tensor-lang

From-scratch tensor compiler: one DSL, three backends, enough to run gpt2.

I wanted to see how hard it would be to write a from-scratch DSL for GPT-2. I targeted only as much support as nanoGPT. Along the way I was able to get GPT-2 working. I learned more about the kinds of optimizations that apply to neural network workloads. And I was able to train some models and build a visualizer for the inner workings of simple ones. There will definitely be a write-up about how all this works at some point.

tensorexamples/gpt2.tensor

 1fn layernorm(x, gamma, beta) {
 2    let mean = mul(sum(x, axis: 2), inv_d)
 3    let xc = sub(x, mean)
 4    let var = mul(sum(mul(xc, xc), axis: 2), inv_d)
 5    let std = sqrt(add(var, 0.00001))
 6    let normed = mul(xc, recip(std))
 7    add(mul(normed, gamma), beta)
 8}
 9 
10fn clamp(x, lo, hi) {
11    // clamp(x, lo, hi) = max(min(x, hi), lo)
12    // min(a, b) = neg(max(neg(a), neg(b)))
13    let upper = neg(max(neg(x), neg(hi)))
14    max(upper, lo)
15}
16 
17fn gelu(x) {
18    let x3 = mul(mul(x, x), x)
19    let inner = mul(0.7978845608028654, add(x, mul(0.044715, x3)))
20    // Clamp for numerically stable tanh (tanh(10) ≈ 1.0)
21    let clamped = clamp(inner, neg(10.0), 10.0)
22    let z2 = mul(clamped, 2.0)
23    let ez2 = exp(z2)
24    let tanh_val = mul(sub(ez2, 1.0), recip(add(ez2, 1.0)))
25    mul(mul(0.5, x), add(1.0, tanh_val))
26}
27 
28fn linear(x, w, b) {
29    add(matmul(x, w), b)
30}
31 
32fn softmax_attn(x) {
33    let m = max(x, axis: 3)
34    let e = exp(sub(x, m))
35    let s = sum(e, axis: 3)
36    mul(recip(s), e)
37}

passtimespeedup

baseline (old ARM backend)3740 ms1.00x

+ nested loops3595 ms1.04x

+ k-invariant hoisting3201 ms1.17x

+ matmul unfusion + MR=8 kernel1475 ms2.54x

+ pointer incrementing1434 ms2.61x

+ all combined198 ms18.90x

llm.c -O3 (reference)84 ms—

GPT-2 124M · 12 layers · T=16 · single-threaded · Apple Silicon

Some of the numbers above are fudged a bit because I didn't feel like getting exact numbers, the final numbers are real.