tensor-lang
From-scratch tensor compiler: one DSL, three backends, enough to run gpt2.
I wanted to see how hard it would be to write a from-scratch DSL for GPT-2. I targeted only as much support as nanoGPT. Along the way I was able to get GPT-2 working. I learned more about the kinds of optimizations that apply to neural network workloads. And I was able to train some models and build a visualizer for the inner workings of simple ones. There will definitely be a write-up about how all this works at some point.
1fn layernorm(x, gamma, beta) {2 let mean = mul(sum(x, axis: 2), inv_d)3 let xc = sub(x, mean)4 let var = mul(sum(mul(xc, xc), axis: 2), inv_d)5 let std = sqrt(add(var, 0.00001))6 let normed = mul(xc, recip(std))7 add(mul(normed, gamma), beta)8}910fn clamp(x, lo, hi) {11 // clamp(x, lo, hi) = max(min(x, hi), lo)12 // min(a, b) = neg(max(neg(a), neg(b)))13 let upper = neg(max(neg(x), neg(hi)))14 max(upper, lo)15}1617fn gelu(x) {18 let x3 = mul(mul(x, x), x)19 let inner = mul(0.7978845608028654, add(x, mul(0.044715, x3)))20 // Clamp for numerically stable tanh (tanh(10) ≈ 1.0)21 let clamped = clamp(inner, neg(10.0), 10.0)22 let z2 = mul(clamped, 2.0)23 let ez2 = exp(z2)24 let tanh_val = mul(sub(ez2, 1.0), recip(add(ez2, 1.0)))25 mul(mul(0.5, x), add(1.0, tanh_val))26}2728fn linear(x, w, b) {29 add(matmul(x, w), b)30}3132fn softmax_attn(x) {33 let m = max(x, axis: 3)34 let e = exp(sub(x, m))35 let s = sum(e, axis: 3)36 mul(recip(s), e)37}