Jimmy Miller
Projects

My projects are mostly experiments. I am more interested in exploring the ideas behind things to learn than I am in writing production-ready software. Some eventually graduate to be tools I use every day. But even those are mostly things I find useful, not for others.

tensor-lang

From-scratch tensor compiler: one DSL, three backends, enough to run gpt2.

I wanted to see how hard it would be to write a from-scratch DSL for GPT-2. I targeted only as much support as nanoGPT. Along the way I was able to get GPT-2 working. I learned more about the kinds of optimizations that apply to neural network workloads. And I was able to train some models and build a visualizer for the inner workings of simple ones. There will definitely be a write-up about how all this works at some point.

tensorexamples/gpt2.tl — the layernorm
1fn layernorm(x, gamma, beta) {
2 let mean = mul(sum(x, axis: 2), inv_d)
3 let xc = sub(x, mean)
4 let var = mul(sum(mul(xc, xc), axis: 2), inv_d)
5 let std = sqrt(add(var, 0.00001))
6 let normed = mul(xc, recip(std))
7 add(mul(normed, gamma), beta)
8}
9
10// the unlock: ~95% of time was in index math (SDIV).
11// before:
12for oi in 0..total_size {
13 d0 = (oi / stride[0]) % shape[0] // SDIV: 12-20 cycles
14 d1 = (oi / stride[1]) % shape[1] // SDIV: 12-20 cycles
15 addr = d0 * input_stride_0 + d1 * input_stride_1
16}
17// after: nested loops. address falls out of the counters.
18for d0 in 0..shape[0] {
19 for d1 in 0..shape[1] {
20 addr = d0 * input_stride_0 + d1 * input_stride_1
21 }
22}
passtimespeedup
baseline (old ARM backend)3740 ms1.00x
+ nested loops3595 ms1.04x
+ k-invariant hoisting3201 ms1.17x
+ matmul unfusion + MR=8 kernel1475 ms2.54x
+ pointer incrementing1434 ms2.61x
+ all combined198 ms18.90x
llm.c -O3 (reference)84 ms
GPT-2 124M · 12 layers · T=16 · single-threaded · Apple Silicon