The most atomic Vision-Language Model, in pure Python

Building on @karpathy's pure GPT, this adds vision to the same minimalist framework. A complete VLM — see a digit, say its name — in dependency-free Python.

What it does

Feed it an MNIST digit. It generates the word:

[8] true: eight | generated: eight    ✓
[3] true: three | generated: three    ✓
[7] true: seven | generated: seven    ✓
[0] true:  zero | generated: zero     ✓

95% accuracy on MNIST. ~32K parameters. No GPU. Pure NumPy.

How it works

The architecture follows a single mathematical principle end-to-end:

Image pixels → [BAE × BAE] → vision tokens → [Tensor Attention GPT] → text tokens

During training, the model sees sequences like:

[VIS_0] [VIS_1] ... [VIS_27] [BOS] s e v e n [EOS]

During inference, it encodes the image into vision tokens, then autoregressively generates text — exactly like a language model, but the prompt is an image.

The visual tokenizer: BAE × BAE

The original Bilinear Autoencoder (Lieberum et al., 2025) captures second-order pixel interactions:

f_j = dot(l_j, x) * dot(r_j, x)     — degree-2 polynomial kernel

Two dot products — that's it. This captures pairwise pixel interactions without ever materializing the quadratic space x⊗x.

BAE × BAE takes this further. Two independent BAEs each produce stable, normalized degree-2 features. Their tensor product reaches degree-4 — capturing three-way and four-way pixel interactions — while each factor remains independently well-conditioned:

f₁ = (L₁x) ⊙ (R₁x)          — BAE₁: degree-2
f₂ = (L₂x) ⊙ (R₂x)          — BAE₂: degree-2
token_j = (Pₐ[j]·f₁)(Pᵦ[j]·f₂)  — degree-4 via tensor product

This is the kernel trick applied twice: the model implicitly operates in a ~1.25M-dimensional feature space (for 14×14 images) while only computing with vectors of length 10.

Direct degree-2 and degree-3 features flow alongside as skip connections, giving the transformer access to multiple levels of polynomial interaction simultaneously.

Tensor Product Attention

Standard attention computes similarity as a single dot product — a degree-2 kernel:

score(q, k) = qᵀk / √d

We replace this with tensor product attention, splitting each head in half:

score(q, k) = (q₁ᵀk₁)(q₂ᵀk₂) / d     — degree-4 kernel

This asks not just "are these tokens similar?" but "does feature-group-A match AND feature-group-B match?" — a conjunctive condition that distinguishes digits like 8 vs 0, where the top half is similar but the bottom half differs.

Zero extra parameters. Same Q, K, V matrices, just a different score function. This single change improved accuracy from 85% to 95%.

Everything else is the same GPT

The transformer is: RMSNorm, multi-head causal attention, ReLU MLP, Adam optimizer. Vision tokens and text tokens flow through the same attention layers in a single unified sequence.

The kernel trick runs the whole show

Every component speaks the same mathematical language:

Component	Operation	Kernel degree
BAE visual features	`(Lx) ⊙ (Rx)`	2
Multi-degree skip	`(L₁x)(L₂x)(L₃x)`	3
BAE × BAE tokens	`(Pₐ·f₁)(Pᵦ·f₂)`	4
Vision projection	`(Wₗ·f) ⊙ (Wᵣ·f)`	2
Tensor attention	`(q₁·k₁)(q₂·k₂)`	4

No component was designed in isolation. Each is a polynomial kernel operating at a different degree, and they compose naturally because products of kernels are kernels.

The point

This isn't meant to compete with Qwen-VL or LLaVA. It's meant to explore whether the kernel trick — the same mathematical insight from SVMs — can serve as a unified foundation for vision-language models, replacing the engineering stack of ViT + adapter + LLM with a single coherent algebraic structure.

Every gradient is traceable. Every matrix multiply is visible. The entire model fits in one file.

As the original code says: "This file is the complete algorithm. Everything else is just efficiency."

That now includes vision — and a hint at what a kernel-native VLM architecture could become.

--

loaded 60000 training images, 28x28=784px
text vocab: ['e', 'f', 'g', 'h', 'i', 'n', 'o', 'r', 's', 't', 'u', 'v', 'w', 'x', 'z'] + BOS + EOS = 17 tokens
downscaling images to 14x14...

=== BAE × BAE Configuration ===
BAE₁: 10 features (degree-2)
BAE₂: 10 features (degree-2)
Tensor product: 10²=100 implicit dims → 12 tokens (degree-4)
Direct features: [8, 8] (degree [2, 3])
Total vision tokens: 28

num params: 32,240
architecture: BAE×BAE + Tensor Product Attention + GPT
attention: score = (q₁·k₁)(q₂·k₂)/d  [degree-4 kernel, no extra params]

--- training (7000 steps, BAE×BAE + Tensor Attention) ---
step   100/7000 | loss 0.6132 | text 0.6132 | recon 0.0000 | target: one | 2.6s
step   200/7000 | loss 0.2771 | text 0.2763 | recon 0.0771 | target: two | 5.2s
step   300/7000 | loss 0.3960 | text 0.3958 | recon 0.0231 | target: six | 7.8s
step   400/7000 | loss 0.9417 | text 0.9413 | recon 0.0439 | target: zero | 10.5s
step   500/7000 | loss 0.2659 | text 0.2633 | recon 0.2590 | target: eight | 13.1s
step   600/7000 | loss 0.5474 | text 0.5468 | recon 0.0668 | target: seven | 15.7s
step   700/7000 | loss 0.8002 | text 0.7999 | recon 0.0337 | target: five | 18.4s
step   800/7000 | loss 0.2323 | text 0.2318 | recon 0.0538 | target: five | 21.0s
step   900/7000 | loss 0.5897 | text 0.5889 | recon 0.0778 | target: five | 23.7s
step  1000/7000 | loss 0.0138 | text 0.0133 | recon 0.0552 | target: six | 26.3s
step  1100/7000 | loss 0.2423 | text 0.2421 | recon 0.0196 | target: four | 29.0s
step  1200/7000 | loss 0.0327 | text 0.0306 | recon 0.2082 | target: two | 31.6s
step  1300/7000 | loss 0.7342 | text 0.7338 | recon 0.0442 | target: two | 34.3s
step  1400/7000 | loss 0.1085 | text 0.1085 | recon 0.0000 | target: nine | 36.9s
step  1500/7000 | loss 0.0081 | text 0.0081 | recon 0.0000 | target: seven | 39.6s
step  1600/7000 | loss 0.2101 | text 0.2088 | recon 0.1307 | target: three | 42.2s
step  1700/7000 | loss 0.0022 | text 0.0005 | recon 0.1629 | target: one | 44.8s
step  1800/7000 | loss 0.1305 | text 0.1293 | recon 0.1262 | target: three | 47.5s
step  1900/7000 | loss 0.0303 | text 0.0303 | recon 0.0000 | target: seven | 50.1s
step  2000/7000 | loss 0.0076 | text 0.0076 | recon 0.0000 | target: zero | 52.8s
step  2100/7000 | loss 0.2230 | text 0.2230 | recon 0.0000 | target: nine | 55.4s
step  2200/7000 | loss 0.3399 | text 0.3398 | recon 0.0092 | target: nine | 58.0s
step  2300/7000 | loss 0.3529 | text 0.3529 | recon 0.0000 | target: nine | 60.7s
step  2400/7000 | loss 0.0191 | text 0.0162 | recon 0.2952 | target: two | 63.3s
step  2500/7000 | loss 0.0070 | text 0.0045 | recon 0.2513 | target: two | 65.9s
step  2600/7000 | loss 0.0028 | text 0.0015 | recon 0.1376 | target: one | 68.5s
step  2700/7000 | loss 0.0595 | text 0.0569 | recon 0.2552 | target: eight | 71.2s
step  2800/7000 | loss 0.0138 | text 0.0138 | recon 0.0000 | target: four | 73.8s
step  2900/7000 | loss 0.0091 | text 0.0091 | recon 0.0000 | target: zero | 76.4s
step  3000/7000 | loss 0.0609 | text 0.0605 | recon 0.0343 | target: five | 79.1s
step  3100/7000 | loss 0.0104 | text 0.0103 | recon 0.0107 | target: nine | 81.7s
step  3200/7000 | loss 0.0127 | text 0.0099 | recon 0.2844 | target: two | 84.3s
step  3300/7000 | loss 0.0081 | text 0.0061 | recon 0.1965 | target: eight | 87.0s
step  3400/7000 | loss 0.0051 | text 0.0051 | recon 0.0089 | target: four | 89.6s
step  3500/7000 | loss 0.0024 | text 0.0012 | recon 0.1190 | target: eight | 92.3s
step  3600/7000 | loss 0.0058 | text 0.0057 | recon 0.0109 | target: seven | 94.9s
step  3700/7000 | loss 0.0688 | text 0.0675 | recon 0.1314 | target: three | 97.5s
step  3800/7000 | loss 0.0028 | text 0.0006 | recon 0.2214 | target: two | 100.2s
step  3900/7000 | loss 0.0004 | text 0.0004 | recon 0.0000 | target: seven | 102.8s
step  4000/7000 | loss 0.0010 | text 0.0010 | recon 0.0000 | target: six | 105.5s
step  4100/7000 | loss 0.0030 | text 0.0029 | recon 0.0087 | target: four | 108.1s
step  4200/7000 | loss 0.0017 | text 0.0007 | recon 0.0997 | target: three | 110.8s
step  4300/7000 | loss 0.0133 | text 0.0131 | recon 0.0255 | target: five | 113.4s
step  4400/7000 | loss 0.0018 | text 0.0016 | recon 0.0227 | target: nine | 116.1s
step  4500/7000 | loss 0.0008 | text 0.0008 | recon 0.0000 | target: seven | 118.7s
step  4600/7000 | loss 0.0045 | text 0.0021 | recon 0.2454 | target: two | 121.3s
step  4700/7000 | loss 0.0337 | text 0.0337 | recon 0.0000 | target: six | 124.0s
step  4800/7000 | loss 0.0210 | text 0.0210 | recon 0.0000 | target: seven | 126.6s
step  4900/7000 | loss 0.0016 | text 0.0015 | recon 0.0027 | target: seven | 129.2s
step  5000/7000 | loss 0.0058 | text 0.0029 | recon 0.2963 | target: two | 131.9s
step  5100/7000 | loss 0.0012 | text 0.0008 | recon 0.0332 | target: six | 134.5s
step  5200/7000 | loss 0.0071 | text 0.0055 | recon 0.1645 | target: eight | 137.1s
step  5300/7000 | loss 0.0374 | text 0.0364 | recon 0.1005 | target: eight | 139.8s
step  5400/7000 | loss 0.0156 | text 0.0153 | recon 0.0342 | target: three | 142.4s
step  5500/7000 | loss 0.0170 | text 0.0168 | recon 0.0220 | target: five | 145.0s
step  5600/7000 | loss 0.0094 | text 0.0094 | recon 0.0000 | target: five | 147.7s
step  5700/7000 | loss 0.0032 | text 0.0013 | recon 0.1839 | target: eight | 150.3s
step  5800/7000 | loss 0.0034 | text 0.0029 | recon 0.0512 | target: three | 153.0s
step  5900/7000 | loss 0.0072 | text 0.0070 | recon 0.0234 | target: four | 155.6s
step  6000/7000 | loss 0.0011 | text 0.0007 | recon 0.0436 | target: nine | 158.2s
step  6100/7000 | loss 0.0052 | text 0.0023 | recon 0.2873 | target: two | 160.9s
step  6200/7000 | loss 0.0036 | text 0.0022 | recon 0.1407 | target: eight | 163.5s
step  6300/7000 | loss 0.0019 | text 0.0007 | recon 0.1245 | target: three | 166.1s
step  6400/7000 | loss 0.0297 | text 0.0297 | recon 0.0075 | target: five | 168.8s
step  6500/7000 | loss 0.0018 | text 0.0000 | recon 0.1726 | target: one | 171.4s
step  6600/7000 | loss 0.0076 | text 0.0071 | recon 0.0481 | target: nine | 174.1s
step  6700/7000 | loss 0.0141 | text 0.0135 | recon 0.0627 | target: nine | 176.7s
step  6800/7000 | loss 0.0040 | text 0.0034 | recon 0.0582 | target: six | 179.3s
step  6900/7000 | loss 0.0048 | text 0.0027 | recon 0.2074 | target: two | 182.0s
step  7000/7000 | loss 0.0160 | text 0.0158 | recon 0.0199 | target: nine | 184.6s

total training time: 184.6s

--- inference: BAE×BAE + Tensor Product Attention ---
  [8] true: eight | generated: eight    ✓
  [9] true:  nine | generated: nine     ✓
  [4] true:  four | generated: four     ✓
  [3] true: three | generated: three    ✓
  [4] true:  four | generated: four     ✓
  [2] true:   two | generated: two      ✓
  [8] true: eight | generated: eight    ✓
  [3] true: three | generated: three    ✓
  [1] true:   one | generated: one      ✓
  [0] true:  zero | generated: zero     ✓
  [4] true:  four | generated: four     ✓
  [7] true: seven | generated: seven    ✓
  [5] true:  five | generated: five     ✓
  [9] true:  nine | generated: nine     ✓
  [0] true:  zero | generated: five     ✗
  [9] true:  nine | generated: nine     ✓
  [3] true: three | generated: three    ✓
  [7] true: seven | generated: seven    ✓
  [8] true: eight | generated: eight    ✓
  [7] true: seven | generated: seven    ✓

accuracy: 19/20 = 95%

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
LICENSE		LICENSE
README.md		README.md
vlm.py		vlm.py
vlm_initial.py		vlm_initial.py
vlm_pure_python.py		vlm_pure_python.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The most atomic Vision-Language Model, in pure Python

What it does

How it works

The visual tokenizer: BAE × BAE

Tensor Product Attention

Everything else is the same GPT

The kernel trick runs the whole show

The point

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The most atomic Vision-Language Model, in pure Python

What it does

How it works

The visual tokenizer: BAE × BAE

Tensor Product Attention

Everything else is the same GPT

The kernel trick runs the whole show

The point

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages