tensor-lang
From-scratch tensor compiler: one DSL, three backends, enough to run gpt2.
I wanted to see how hard it would be to write a from-scratch DSL for GPT-2. I targeted only as much support as nanoGPT. Along the way I was able to get GPT-2 working. I learned more about the kinds of optimizations that apply to neural network workloads. And I was able to train some models and build a visualizer for the inner workings of simple ones. There will definitely be a write-up about how all this works at some point.
1fn layernorm(x, gamma, beta) {2 let mean = mul(sum(x, axis: 2), inv_d)3 let xc = sub(x, mean)4 let var = mul(sum(mul(xc, xc), axis: 2), inv_d)5 let std = sqrt(add(var, 0.00001))6 let normed = mul(xc, recip(std))7 add(mul(normed, gamma), beta)8}910// the unlock: ~95% of time was in index math (SDIV).11// before:12for oi in 0..total_size {13 d0 = (oi / stride[0]) % shape[0] // SDIV: 12-20 cycles14 d1 = (oi / stride[1]) % shape[1] // SDIV: 12-20 cycles15 addr = d0 * input_stride_0 + d1 * input_stride_116}17// after: nested loops. address falls out of the counters.18for d0 in 0..shape[0] {19 for d1 in 0..shape[1] {20 addr = d0 * input_stride_0 + d1 * input_stride_121 }22}