Skip to content

sailfish009/purevlm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The most atomic Vision-Language Model, in pure Python

Building on @karpathy's pure GPT, this adds vision to the same minimalist framework. A complete VLM — see a digit, say its name — in dependency-free Python.

What it does

Feed it an MNIST digit. It generates the word:

[8] true: eight | generated: eight    ✓
[3] true: three | generated: three    ✓
[7] true: seven | generated: seven    ✓
[0] true:  zero | generated: zero     ✓

95% accuracy on MNIST. ~32K parameters. No GPU. Pure NumPy.

How it works

The architecture follows a single mathematical principle end-to-end:

Image pixels → [BAE × BAE] → vision tokens → [Tensor Attention GPT] → text tokens

During training, the model sees sequences like:

[VIS_0] [VIS_1] ... [VIS_27] [BOS] s e v e n [EOS]

During inference, it encodes the image into vision tokens, then autoregressively generates text — exactly like a language model, but the prompt is an image.

The visual tokenizer: BAE × BAE

The original Bilinear Autoencoder (Lieberum et al., 2025) captures second-order pixel interactions:

f_j = dot(l_j, x) * dot(r_j, x)     — degree-2 polynomial kernel

Two dot products — that's it. This captures pairwise pixel interactions without ever materializing the quadratic space x⊗x.

BAE × BAE takes this further. Two independent BAEs each produce stable, normalized degree-2 features. Their tensor product reaches degree-4 — capturing three-way and four-way pixel interactions — while each factor remains independently well-conditioned:

f₁ = (L₁x) ⊙ (R₁x)          — BAE₁: degree-2
f₂ = (L₂x) ⊙ (R₂x)          — BAE₂: degree-2
token_j = (Pₐ[j]·f₁)(Pᵦ[j]·f₂)  — degree-4 via tensor product

This is the kernel trick applied twice: the model implicitly operates in a ~1.25M-dimensional feature space (for 14×14 images) while only computing with vectors of length 10.

Direct degree-2 and degree-3 features flow alongside as skip connections, giving the transformer access to multiple levels of polynomial interaction simultaneously.

Tensor Product Attention

Standard attention computes similarity as a single dot product — a degree-2 kernel:

score(q, k) = qᵀk / √d

We replace this with tensor product attention, splitting each head in half:

score(q, k) = (q₁ᵀk₁)(q₂ᵀk₂) / d     — degree-4 kernel

This asks not just "are these tokens similar?" but "does feature-group-A match AND feature-group-B match?" — a conjunctive condition that distinguishes digits like 8 vs 0, where the top half is similar but the bottom half differs.

Zero extra parameters. Same Q, K, V matrices, just a different score function. This single change improved accuracy from 85% to 95%.

Everything else is the same GPT

The transformer is: RMSNorm, multi-head causal attention, ReLU MLP, Adam optimizer. Vision tokens and text tokens flow through the same attention layers in a single unified sequence.

The kernel trick runs the whole show

Every component speaks the same mathematical language:

Component Operation Kernel degree
BAE visual features (Lx) ⊙ (Rx) 2
Multi-degree skip (L₁x)(L₂x)(L₃x) 3
BAE × BAE tokens (Pₐ·f₁)(Pᵦ·f₂) 4
Vision projection (Wₗ·f) ⊙ (Wᵣ·f) 2
Tensor attention (q₁·k₁)(q₂·k₂) 4

No component was designed in isolation. Each is a polynomial kernel operating at a different degree, and they compose naturally because products of kernels are kernels.

The point

This isn't meant to compete with Qwen-VL or LLaVA. It's meant to explore whether the kernel trick — the same mathematical insight from SVMs — can serve as a unified foundation for vision-language models, replacing the engineering stack of ViT + adapter + LLM with a single coherent algebraic structure.

Every gradient is traceable. Every matrix multiply is visible. The entire model fits in one file.

As the original code says: "This file is the complete algorithm. Everything else is just efficiency."

That now includes vision — and a hint at what a kernel-native VLM architecture could become.

--

loaded 60000 training images, 28x28=784px
text vocab: ['e', 'f', 'g', 'h', 'i', 'n', 'o', 'r', 's', 't', 'u', 'v', 'w', 'x', 'z'] + BOS + EOS = 17 tokens
downscaling images to 14x14...

=== BAE × BAE Configuration ===
BAE₁: 10 features (degree-2)
BAE₂: 10 features (degree-2)
Tensor product: 10²=100 implicit dims → 12 tokens (degree-4)
Direct features: [8, 8] (degree [2, 3])
Total vision tokens: 28

num params: 32,240
architecture: BAE×BAE + Tensor Product Attention + GPT
attention: score = (q₁·k₁)(q₂·k₂)/d  [degree-4 kernel, no extra params]

--- training (7000 steps, BAE×BAE + Tensor Attention) ---
step   100/7000 | loss 0.6132 | text 0.6132 | recon 0.0000 | target: one | 2.6s
step   200/7000 | loss 0.2771 | text 0.2763 | recon 0.0771 | target: two | 5.2s
step   300/7000 | loss 0.3960 | text 0.3958 | recon 0.0231 | target: six | 7.8s
step   400/7000 | loss 0.9417 | text 0.9413 | recon 0.0439 | target: zero | 10.5s
step   500/7000 | loss 0.2659 | text 0.2633 | recon 0.2590 | target: eight | 13.1s
step   600/7000 | loss 0.5474 | text 0.5468 | recon 0.0668 | target: seven | 15.7s
step   700/7000 | loss 0.8002 | text 0.7999 | recon 0.0337 | target: five | 18.4s
step   800/7000 | loss 0.2323 | text 0.2318 | recon 0.0538 | target: five | 21.0s
step   900/7000 | loss 0.5897 | text 0.5889 | recon 0.0778 | target: five | 23.7s
step  1000/7000 | loss 0.0138 | text 0.0133 | recon 0.0552 | target: six | 26.3s
step  1100/7000 | loss 0.2423 | text 0.2421 | recon 0.0196 | target: four | 29.0s
step  1200/7000 | loss 0.0327 | text 0.0306 | recon 0.2082 | target: two | 31.6s
step  1300/7000 | loss 0.7342 | text 0.7338 | recon 0.0442 | target: two | 34.3s
step  1400/7000 | loss 0.1085 | text 0.1085 | recon 0.0000 | target: nine | 36.9s
step  1500/7000 | loss 0.0081 | text 0.0081 | recon 0.0000 | target: seven | 39.6s
step  1600/7000 | loss 0.2101 | text 0.2088 | recon 0.1307 | target: three | 42.2s
step  1700/7000 | loss 0.0022 | text 0.0005 | recon 0.1629 | target: one | 44.8s
step  1800/7000 | loss 0.1305 | text 0.1293 | recon 0.1262 | target: three | 47.5s
step  1900/7000 | loss 0.0303 | text 0.0303 | recon 0.0000 | target: seven | 50.1s
step  2000/7000 | loss 0.0076 | text 0.0076 | recon 0.0000 | target: zero | 52.8s
step  2100/7000 | loss 0.2230 | text 0.2230 | recon 0.0000 | target: nine | 55.4s
step  2200/7000 | loss 0.3399 | text 0.3398 | recon 0.0092 | target: nine | 58.0s
step  2300/7000 | loss 0.3529 | text 0.3529 | recon 0.0000 | target: nine | 60.7s
step  2400/7000 | loss 0.0191 | text 0.0162 | recon 0.2952 | target: two | 63.3s
step  2500/7000 | loss 0.0070 | text 0.0045 | recon 0.2513 | target: two | 65.9s
step  2600/7000 | loss 0.0028 | text 0.0015 | recon 0.1376 | target: one | 68.5s
step  2700/7000 | loss 0.0595 | text 0.0569 | recon 0.2552 | target: eight | 71.2s
step  2800/7000 | loss 0.0138 | text 0.0138 | recon 0.0000 | target: four | 73.8s
step  2900/7000 | loss 0.0091 | text 0.0091 | recon 0.0000 | target: zero | 76.4s
step  3000/7000 | loss 0.0609 | text 0.0605 | recon 0.0343 | target: five | 79.1s
step  3100/7000 | loss 0.0104 | text 0.0103 | recon 0.0107 | target: nine | 81.7s
step  3200/7000 | loss 0.0127 | text 0.0099 | recon 0.2844 | target: two | 84.3s
step  3300/7000 | loss 0.0081 | text 0.0061 | recon 0.1965 | target: eight | 87.0s
step  3400/7000 | loss 0.0051 | text 0.0051 | recon 0.0089 | target: four | 89.6s
step  3500/7000 | loss 0.0024 | text 0.0012 | recon 0.1190 | target: eight | 92.3s
step  3600/7000 | loss 0.0058 | text 0.0057 | recon 0.0109 | target: seven | 94.9s
step  3700/7000 | loss 0.0688 | text 0.0675 | recon 0.1314 | target: three | 97.5s
step  3800/7000 | loss 0.0028 | text 0.0006 | recon 0.2214 | target: two | 100.2s
step  3900/7000 | loss 0.0004 | text 0.0004 | recon 0.0000 | target: seven | 102.8s
step  4000/7000 | loss 0.0010 | text 0.0010 | recon 0.0000 | target: six | 105.5s
step  4100/7000 | loss 0.0030 | text 0.0029 | recon 0.0087 | target: four | 108.1s
step  4200/7000 | loss 0.0017 | text 0.0007 | recon 0.0997 | target: three | 110.8s
step  4300/7000 | loss 0.0133 | text 0.0131 | recon 0.0255 | target: five | 113.4s
step  4400/7000 | loss 0.0018 | text 0.0016 | recon 0.0227 | target: nine | 116.1s
step  4500/7000 | loss 0.0008 | text 0.0008 | recon 0.0000 | target: seven | 118.7s
step  4600/7000 | loss 0.0045 | text 0.0021 | recon 0.2454 | target: two | 121.3s
step  4700/7000 | loss 0.0337 | text 0.0337 | recon 0.0000 | target: six | 124.0s
step  4800/7000 | loss 0.0210 | text 0.0210 | recon 0.0000 | target: seven | 126.6s
step  4900/7000 | loss 0.0016 | text 0.0015 | recon 0.0027 | target: seven | 129.2s
step  5000/7000 | loss 0.0058 | text 0.0029 | recon 0.2963 | target: two | 131.9s
step  5100/7000 | loss 0.0012 | text 0.0008 | recon 0.0332 | target: six | 134.5s
step  5200/7000 | loss 0.0071 | text 0.0055 | recon 0.1645 | target: eight | 137.1s
step  5300/7000 | loss 0.0374 | text 0.0364 | recon 0.1005 | target: eight | 139.8s
step  5400/7000 | loss 0.0156 | text 0.0153 | recon 0.0342 | target: three | 142.4s
step  5500/7000 | loss 0.0170 | text 0.0168 | recon 0.0220 | target: five | 145.0s
step  5600/7000 | loss 0.0094 | text 0.0094 | recon 0.0000 | target: five | 147.7s
step  5700/7000 | loss 0.0032 | text 0.0013 | recon 0.1839 | target: eight | 150.3s
step  5800/7000 | loss 0.0034 | text 0.0029 | recon 0.0512 | target: three | 153.0s
step  5900/7000 | loss 0.0072 | text 0.0070 | recon 0.0234 | target: four | 155.6s
step  6000/7000 | loss 0.0011 | text 0.0007 | recon 0.0436 | target: nine | 158.2s
step  6100/7000 | loss 0.0052 | text 0.0023 | recon 0.2873 | target: two | 160.9s
step  6200/7000 | loss 0.0036 | text 0.0022 | recon 0.1407 | target: eight | 163.5s
step  6300/7000 | loss 0.0019 | text 0.0007 | recon 0.1245 | target: three | 166.1s
step  6400/7000 | loss 0.0297 | text 0.0297 | recon 0.0075 | target: five | 168.8s
step  6500/7000 | loss 0.0018 | text 0.0000 | recon 0.1726 | target: one | 171.4s
step  6600/7000 | loss 0.0076 | text 0.0071 | recon 0.0481 | target: nine | 174.1s
step  6700/7000 | loss 0.0141 | text 0.0135 | recon 0.0627 | target: nine | 176.7s
step  6800/7000 | loss 0.0040 | text 0.0034 | recon 0.0582 | target: six | 179.3s
step  6900/7000 | loss 0.0048 | text 0.0027 | recon 0.2074 | target: two | 182.0s
step  7000/7000 | loss 0.0160 | text 0.0158 | recon 0.0199 | target: nine | 184.6s

total training time: 184.6s

--- inference: BAE×BAE + Tensor Product Attention ---
  [8] true: eight | generated: eight    ✓
  [9] true:  nine | generated: nine     ✓
  [4] true:  four | generated: four     ✓
  [3] true: three | generated: three    ✓
  [4] true:  four | generated: four     ✓
  [2] true:   two | generated: two      ✓
  [8] true: eight | generated: eight    ✓
  [3] true: three | generated: three    ✓
  [1] true:   one | generated: one      ✓
  [0] true:  zero | generated: zero     ✓
  [4] true:  four | generated: four     ✓
  [7] true: seven | generated: seven    ✓
  [5] true:  five | generated: five     ✓
  [9] true:  nine | generated: nine     ✓
  [0] true:  zero | generated: five     ✗
  [9] true:  nine | generated: nine     ✓
  [3] true: three | generated: three    ✓
  [7] true: seven | generated: seven    ✓
  [8] true: eight | generated: eight    ✓
  [7] true: seven | generated: seven    ✓

accuracy: 19/20 = 95%

About

Pure VLM: See a digit, say its name. Bilinear AE visual tokenizer + GPT in pure Python, zero dependencies.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages