Jimmy Miller
Projects

My projects are mostly experiments. I am more interested in exploring the ideas behind things to learn than I am in writing production-ready software. Some eventually graduate to be tools I use every day. But even those are mostly things I find useful, not for others.

tensor-lang

From-scratch tensor compiler: one DSL, three backends, enough to run gpt2.

I wanted to see how hard it would be to write a from-scratch DSL for GPT-2. I targeted only as much support as nanoGPT. Along the way I was able to get GPT-2 working. I learned more about the kinds of optimizations that apply to neural network workloads. And I was able to train some models and build a visualizer for the inner workings of simple ones. There will definitely be a write-up about how all this works at some point.

tensorexamples/gpt2.tensor
1fn layernorm(x, gamma, beta) {
2 let mean = mul(sum(x, axis: 2), inv_d)
3 let xc = sub(x, mean)
4 let var = mul(sum(mul(xc, xc), axis: 2), inv_d)
5 let std = sqrt(add(var, 0.00001))
6 let normed = mul(xc, recip(std))
7 add(mul(normed, gamma), beta)
8}
9
10fn clamp(x, lo, hi) {
11 // clamp(x, lo, hi) = max(min(x, hi), lo)
12 // min(a, b) = neg(max(neg(a), neg(b)))
13 let upper = neg(max(neg(x), neg(hi)))
14 max(upper, lo)
15}
16
17fn gelu(x) {
18 let x3 = mul(mul(x, x), x)
19 let inner = mul(0.7978845608028654, add(x, mul(0.044715, x3)))
20 // Clamp for numerically stable tanh (tanh(10) ≈ 1.0)
21 let clamped = clamp(inner, neg(10.0), 10.0)
22 let z2 = mul(clamped, 2.0)
23 let ez2 = exp(z2)
24 let tanh_val = mul(sub(ez2, 1.0), recip(add(ez2, 1.0)))
25 mul(mul(0.5, x), add(1.0, tanh_val))
26}
27
28fn linear(x, w, b) {
29 add(matmul(x, w), b)
30}
31
32fn softmax_attn(x) {
33 let m = max(x, axis: 3)
34 let e = exp(sub(x, m))
35 let s = sum(e, axis: 3)
36 mul(recip(s), e)
37}
passtimespeedup
baseline (old ARM backend)3740 ms1.00x
+ nested loops3595 ms1.04x
+ k-invariant hoisting3201 ms1.17x
+ matmul unfusion + MR=8 kernel1475 ms2.54x
+ pointer incrementing1434 ms2.61x
+ all combined198 ms18.90x
llm.c -O3 (reference)84 ms
GPT-2 124M · 12 layers · T=16 · single-threaded · Apple Silicon
Some of the numbers above are fudged a bit because I didn't feel like getting exact numbers, the final numbers are real.