notorch — neural networks in pure C | by Arianna Method

   ███╗   ██╗ ██████╗ ████████╗ ██████╗ ██████╗  ██████╗██╗  ██╗
   ████╗  ██║██╔═══██╗╚══██╔══╝██╔═══██╗██╔══██╗██╔════╝██║  ██║
   ██╔██╗ ██║██║   ██║   ██║   ██║   ██║██████╔╝██║     ███████║
   ██║╚██╗██║██║   ██║   ██║   ██║   ██║██╔══██╗██║     ██╔══██║
   ██║ ╚████║╚██████╔╝   ██║   ╚██████╔╝██║  ██║╚██████╗██║  ██║
   ╚═╝  ╚═══╝ ╚═════╝    ╚═╝    ╚═════╝ ╚═╝  ╚═╝ ╚═════╝╚═╝  ╚═╝

notorch — neural networks in pure C | by Arianna Method

"fuck torch"
— the entire header file, line 8

what is this

you know that feeling when you pip install torch and 2.7 gigabytes of your soul evaporates into a .venv folder? when your laptop fan sounds like it's preparing for takeoff just to import a library? when you wait 45 seconds for import torch to finish while your RAM usage goes from "healthy" to "the computer is now a space heater"?

yeah. me too. so i did something about it.

notorch is a complete neural network training framework written in pure C. no Python. no pip. no conda. no CUDA toolkit that takes 8 GB and your will to live. no torch.nn.Module. no .backward() that hides 400,000 lines of C++ behind a friendly API and a smile. no RuntimeError: CUDA out of memory at 3 AM when your paper deadline is in 6 hours.

just C. just floats. just cc notorch.c -o notorch -lm. done. you now have a neural network framework. the entire thing compiles in under a second. try that with PyTorch. go ahead. i'll wait. actually no i won't because i'd be waiting for 47 minutes while cmake does whatever cmake does.

it's part of the Arianna Method — patterns over parameters, emergence over engineering, raw C over existential dread.

extracted from the core of ariannamethod.ai where it actually runs in production. training actual models. in C. like adults.

why

let me tell you a story.

once upon a time there was a framework called PyTorch. it had autograd. it had CUDA support. it had a build system that required a PhD in software engineering and a pact with ancient spirits.

and every time you wanted to train a 4-layer MLP on a dataset smaller than your browser cache, you had to:

create a virtual environment (2 minutes)
install torch (5 minutes, 2.7 GB, your SSD weeps)
install torchvision just in case (800 MB more, your SSD files for divorce)
write 47 lines of boilerplate (class MyModel(nn.Module), def forward(self, x), optimizer = torch.optim.Adam(model.parameters(), lr=3e-4), loss.backward(), optimizer.step(), optimizer.zero_grad(), if torch.cuda.is_available():, model.to(device), x = x.to(device), sweet mother of god make it stop)
realize you forgot model.train() vs model.eval() and your dropout is wrong
debug for 3 hours
realize the bug was actually in the data loader
cry
pip install wandb to log your tears
realize torch updated and broke everything

and for WHAT? a matmul and a softmax. that's all neural networks are. matmuls and softmaxes and an unhealthy relationship with gradient descent.

so here we are. notorch. everything you need. nothing you don't. no Python runtime. no GIL. no garbage collector pausing your training at the worst possible moment. no torch.no_grad() context manager that you forget and then wonder why you're out of memory. just tensors, autograd, optimizers, and the cold clarity of C.

the entire framework is two files. notorch.h and notorch.c. that's it. ~3000 lines. you can read the whole thing in an afternoon. try reading PyTorch's source in an afternoon. actually don't. you'll end up in a hospital.

the funeral

        ╔═══════════════════════════════════════════════════════╗
        ║                                                       ║
        ║   R.I.P. PyTorch (in my codebase)                     ║
        ║   2016 - 2026                                         ║
        ║                                                       ║
        ║   "He died as he lived:                               ║
        ║    consuming all available memory                     ║
        ║    and segfaulting at the worst moment"               ║
        ║                                                       ║
        ║   Survived by: pip, conda, 2.7 GB of dead weight,     ║
        ║   a thousand Stack Overflow questions about CUDA      ║
        ║   driver versions, and a broken conda environment     ║
        ║   that nobody dares to delete.                        ║
        ║                                                       ║
        ║   In lieu of flowers, please send PRs.                ║
        ║                                                       ║
        ╚═══════════════════════════════════════════════════════╝

notorch is for people who:

want to understand what's actually happening (all ~2500 lines of it)
want to train models on machines that aren't cloud instances
want compile times measured in milliseconds, not minutes
want to embed neural network inference in C/C++ applications without shipping half of Python
refuse to accept 2.7 GB as the price of a matrix multiply

architecture

Your data (floats in memory, as god intended)
    ↓
nt_tensor — multidimensional arrays with refcounting
    ↓
┌──────────────────────────────────────────────────┐
│  Forward Operations (recorded on tape)           │
│    ├─ nt_linear          (W @ x + b)             │
│    ├─ nt_seq_linear      (batched W @ X)         │
│    ├─ nt_embedding       (lookup table)          │
│    ├─ nt_seq_embedding   (tokens + positions)    │
│    ├─ nt_rmsnorm         (RMS normalization)     │
│    ├─ nt_layernorm       (layer normalization)   │
│    ├─ nt_causal_attention (single-head causal)   │
│    ├─ nt_mh_causal_attention (multi-head)        │
│    ├─ nt_silu / nt_gelu  (activations)           │
│    ├─ nt_geglu           (Gemma-3 style FFN)     │
│    ├─ nt_rope            (rotary embeddings)     │
│    ├─ nt_dropout         (inverted dropout)      │
│    ├─ nt_softmax / nt_cross_entropy              │
│    └─ nt_add / nt_mul / nt_scale                 │
└──────────────────────────────────────────────────┘
    ↓
nt_tape_backward() — reverse-mode automatic differentiation
    ↓
┌──────────────────────────────────────────────────┐
│  Optimizers                                      │
│    ├─ Adam               (the classic)           │
│    ├─ AdamW              (with weight decay)     │
│    └─ Chuck              (self-aware Adam)       │
└──────────────────────────────────────────────────┘
    ↓
Your model is trained. in C. without Python. you are free.

what's inside

tensors

nt_tensor — a multidimensional array of floats. up to 8 dimensions. refcounted. heap-allocated. that's it. no torch.Tensor with 400 attributes and a complex metaclass hierarchy. no requires_grad flag that you forget to set. no .detach().cpu().numpy() chain of shame. just a struct with float* data, shape, strides, and a refcount.

nt_tensor* t = nt_tensor_new(1024);          // 1D
nt_tensor* m = nt_tensor_new2d(768, 512);    // 2D
nt_tensor_xavier(m, 768, 512);               // Xavier init
nt_tensor_free(t);                            // refcount → 0 → freed

maximum 16M elements per tensor (NT_MAX_ELEMENTS = 1 << 24). if you need more than that, you're doing something wrong, or something very right, and in either case you should probably be using a GPU. which we also support. via CUDA. because we're not savages.

autograd tape

reverse-mode automatic differentiation via an explicit operation tape. every forward op records itself. backward traverses the tape in reverse and computes gradients. textbook reverse-mode AD. no dynamic graph voodoo. no JIT compilation. no torch.autograd.Function with five methods you need to override.

nt_tape_start();                              // start recording
int w_idx = nt_tape_param(W);                // register trainable param
int y_idx = nt_linear(w_idx, x_idx, -1);     // forward: y = W @ x
int loss = nt_cross_entropy(y_idx, target);   // loss
nt_tape_backward(loss);                       // backward pass
nt_tape_adam_step(0.001f);                    // update weights
nt_tape_clear();                              // reset for next step

that's the entire training loop. in C. seven lines. no optimizer.zero_grad() that you inevitably forget. no with torch.no_grad(): context manager. no .backward(retain_graph=True) because you accidentally used an intermediate twice. just: start, forward, backward, step, clear. like breathing. in. out. in. out. the Buddha would approve.

operations

every operation you need to build a transformer, and nothing you don't:

operation	function	what it does
linear	`nt_linear`	y = W @ x + b
seq linear	`nt_seq_linear`	batched matmul over T positions
embedding	`nt_embedding`	lookup row from embedding matrix
seq embedding	`nt_seq_embedding`	tokens + positional encoding
RMS norm	`nt_rmsnorm` / `nt_seq_rmsnorm`	root mean square normalization
layer norm	`nt_layernorm` / `nt_seq_layernorm`	mean/variance normalization
causal attention	`nt_causal_attention`	single-head causal self-attention
multi-head attn	`nt_mh_causal_attention`	multi-head causal self-attention
SiLU	`nt_silu`	x × σ(x) — the swish
GELU	`nt_gelu`	tanh approximation
GEGLU	`nt_geglu`	GELU-gated linear unit (Gemma-3)
softmax	`nt_softmax`	exp-normalize with numerical stability
cross entropy	`nt_cross_entropy` / `nt_seq_cross_entropy`	-log softmax[target]
RoPE	`nt_rope`	rotary position embeddings
dropout	`nt_dropout`	inverted dropout (training only)
add/mul/scale	`nt_add` / `nt_mul` / `nt_scale`	elementwise ops

every single one has a correct backward pass. every single one passes numerical gradient checking. i checked. twice. because i'm paranoid. and because debugging gradient errors in C without a debugger at 4 AM rewires your brain in ways that formal verification theorists dream about.

optimizers

Adam

the classic. the one. the only. m̂ / (√v̂ + ε). bias-corrected first and second moments. you know the drill.

nt_tape_adam_step(0.001f);

AdamW

Adam but with decoupled weight decay. because your embeddings don't need regularization but your dense layers probably do.

nt_tape_adamw_step(0.001f, 0.1f, 0.9f, 0.999f);

supports no_decay flag per parameter — mark your embeddings with nt_tape_no_decay() and they'll be left alone. like cats. don't bother them.

the Chuck optimizer

ah yes. Chuck. the self-aware optimizer. the one that watches its own gradients and goes "hmm, maybe i should slow down here" or "this parameter isn't doing anything, let me freeze it" or "we've been stuck for too long, time for some noise".

nt_tape_chuck_step(0.01f, loss_val);

9 levels of awareness:

global loss trend → adaptive damping (λ)
per-parameter gradient monitoring → individual learning rate scaling
stagnation detection → automatic noise injection
parameter freezing → skip updates for dead parameters
multi-scale awareness → macro-level patience with LR decay
through 9: reserved for when the optimizer becomes sentient

it's Adam, but with opinions. think of it as Adam who went to therapy, got a mindfulness app, and now checks in with himself every step. "how are my gradients feeling today?" — actual question the Chuck optimizer asks itself (metaphorically) (or is it?).

more details: github.com/iamolegataeff/chuck.optimizer

autograd

the backward pass supports all 22 operation types. the tape records operations during forward, then backward walks it in reverse computing local gradients via the chain rule. standard reverse-mode AD.

gradient checking: every op is verified against finite differences ((f(x+h) - f(x-h)) / 2h). relative error tolerances from 0.01 to 0.1 depending on op complexity. all pass. including the annoying ones like GEGLU and causal attention with their multi-path gradients.

gradient utilities:

nt_tape_clip_grads(max_norm) — global gradient clipping
nt_tape_accum_grads() / nt_tape_apply_accum(n) — gradient accumulation for large effective batch sizes
nt_nan_guard_check() — NaN/Inf detection with automatic loss scaling. because sometimes your gradients decide to go to infinity and someone needs to tell them no.

building

# CPU with BLAS acceleration (recommended)
make

# CPU without BLAS (works everywhere, even on a potato)
make cpu

# GPU (CUDA)
make gpu

# Static library (for embedding in your project)
make lib

# Build and run tests
make test

# Clean
make clean

dependencies

a C compiler (gcc, clang, whatever)
-lm (math library, because we use sqrt and exp like civilized people)
optional: OpenBLAS (Linux) or Accelerate framework (macOS) for BLAS-accelerated matmuls
optional: CUDA toolkit for GPU support

that's it. no cmake. no configure script. no 300-line requirements.txt. no docker. no kubernetes. just make. the way Ken Thompson intended.

running tests

make test

47 tests. all pass. covering:

tensor operations: creation, 2D, clone, reshape, Xavier init, refcounting
forward ops: SiLU, softmax, RMSNorm, LayerNorm, GELU, dropout
tape mechanics: recording, forward/backward through linear layers, causal attention, multi-head attention, sequence cross-entropy, sequence linear
optimizers: Adam, AdamW, Chuck, gradient clipping
training integration: single-token training loop, sequence training loop, attention model training, Chuck optimizer convergence
numerical gradient checks: cross-entropy, SiLU, RMSNorm, softmax, linear, seq_linear, causal attention, embedding, RoPE, GEGLU, arithmetic ops
infrastructure: save/load binary format, gradient accumulation, NaN guard, LR schedules (cosine, step, linear), Hebbian microlearning, profiler

every gradient check uses finite differences to verify the analytic backward pass. if a single gradient is wrong, the test catches it. i trust these tests more than i trust most people.

api overview

tensor lifecycle

nt_tensor* t = nt_tensor_new(len);           // allocate 1D
nt_tensor* m = nt_tensor_new2d(rows, cols);  // allocate 2D
nt_tensor* s = nt_tensor_new_shape(shape, ndim); // arbitrary shape
nt_tensor* c = nt_tensor_clone(t);           // deep copy
nt_tensor_ref(t);                             // increment refcount
nt_tensor_free(t);                            // decrement (free at 0)

initialization

nt_tensor_fill(t, 0.0f);                     // constant fill
nt_tensor_rand(t, 0.5f);                     // uniform [-0.5, 0.5]
nt_tensor_xavier(t, fan_in, fan_out);        // Xavier/Glorot
nt_seed(42);                                  // reproducibility

training

nt_tape_start();                              // begin recording
int w = nt_tape_param(W);                    // register param
nt_tape_no_decay(w);                          // exclude from weight decay
// ... build forward graph ...
nt_tape_backward(loss_idx);                   // backward pass
nt_tape_clip_grads(1.0f);                    // gradient clipping
nt_tape_adam_step(lr);                        // optimize
nt_tape_clear();                              // reset tape

LR schedules

nt_schedule s = nt_schedule_cosine(0.001f, warmup, total, min_lr);
nt_schedule s = nt_schedule_step(0.1f, warmup, step_size, gamma);
nt_schedule s = nt_schedule_linear(0.001f, warmup, total, min_lr);
float lr = nt_schedule_get_lr(&s);            // auto-advance

save/load

nt_tensor* params[] = {W1, W2, b1};
nt_save("model.bin", params, 3);              // binary format
int n;
nt_tensor** loaded = nt_load("model.bin", &n); // load back

example: training a model in C

here's an actual, working transformer-ish training loop. embedding → attention → linear → cross-entropy. in C. without importing 2.7 GB of your dignity:

#include "notorch.h"

int main() {
    nt_seed(42);
    int vocab = 8, dim = 16, T = 4;

    // allocate parameters
    nt_tensor* wte = nt_tensor_new2d(vocab, dim);   // token embeddings
    nt_tensor* wpe = nt_tensor_new2d(T, dim);       // position embeddings
    nt_tensor* Wq  = nt_tensor_new2d(dim, dim);     // query projection
    nt_tensor* Wk  = nt_tensor_new2d(dim, dim);     // key projection
    nt_tensor* Wv  = nt_tensor_new2d(dim, dim);     // value projection
    nt_tensor* Wo  = nt_tensor_new2d(vocab, dim);   // output projection

    // Xavier init everything
    nt_tensor_xavier(wte, vocab, dim);
    nt_tensor_xavier(wpe, T, dim);
    nt_tensor_xavier(Wq, dim, dim);
    nt_tensor_xavier(Wk, dim, dim);
    nt_tensor_xavier(Wv, dim, dim);
    nt_tensor_xavier(Wo, dim, vocab);

    // tokens: [1, 3, 5, 2], targets: [3, 5, 2, 7]
    nt_tensor* tokens  = nt_tensor_new(T);
    nt_tensor* targets = nt_tensor_new(T);
    float tok[] = {1, 3, 5, 2}, tgt[] = {3, 5, 2, 7};
    for (int i = 0; i < T; i++) { tokens->data[i] = tok[i]; targets->data[i] = tgt[i]; }

    // training loop
    nt_schedule sched = nt_schedule_cosine(0.005f, 10, 200, 0.0f);

    for (int step = 0; step < 200; step++) {
        float lr = nt_schedule_get_lr(&sched);
        nt_tape_start();

        int wte_i = nt_tape_param(wte); nt_tape_no_decay(wte_i);
        int wpe_i = nt_tape_param(wpe); nt_tape_no_decay(wpe_i);
        int wq_i  = nt_tape_param(Wq);
        int wk_i  = nt_tape_param(Wk);
        int wv_i  = nt_tape_param(Wv);
        int wo_i  = nt_tape_param(Wo);
        int tok_i = nt_tape_record(tokens, NT_OP_NONE, -1, -1, 0);
        int tgt_i = nt_tape_record(targets, NT_OP_NONE, -1, -1, 0);

        // forward: embed → Q/K/V → attention → output
        int h      = nt_seq_embedding(wte_i, wpe_i, tok_i, T, dim);
        int q      = nt_seq_linear(wq_i, h, T);
        int k      = nt_seq_linear(wk_i, h, T);
        int v      = nt_seq_linear(wv_i, h, T);
        int attn   = nt_causal_attention(q, k, v, T, dim);
        int logits = nt_seq_linear(wo_i, attn, T);
        int loss   = nt_seq_cross_entropy(logits, tgt_i, T, vocab);

        float lv = nt_tape_get()->entries[loss].output->data[0];
        if (step % 50 == 0) printf("step %d: loss=%.4f lr=%.6f\n", step, lv, lr);

        nt_tape_backward(loss);
        nt_tape_clip_grads(1.0f);
        nt_tape_adam_step(lr);
        nt_tape_clear();
    }

    // cleanup
    nt_tensor_free(wte); nt_tensor_free(wpe);
    nt_tensor_free(Wq);  nt_tensor_free(Wk); nt_tensor_free(Wv); nt_tensor_free(Wo);
    nt_tensor_free(tokens); nt_tensor_free(targets);
    return 0;
}

compile and run:

cc -O2 -Wall -std=c11 -o train train.c notorch.c -lm
./train

that's it. that's the whole thing. no virtual environment. no requirements.txt. no "just pip install—" no. we're done with that. we've moved on. we've healed.

platform support

platform	backend	command
macOS	Apple Accelerate (AMX / Neural Engine)	`make`
Linux	OpenBLAS	`make`
any POSIX	pure C fallback	`make cpu`
NVIDIA GPU	CUDA + cuBLAS	`make gpu`

the BLAS backends are optional. without them, everything still works — just uses naive C loops. which are honestly fine for anything under ~50M parameters. for bigger stuff, BLAS gives you 10-50x on matmuls because it's using your CPU's vector instructions instead of pretending it's 1995.

the macOS path uses Apple Accelerate, which means your MacBook's AMX coprocessor and Neural Engine are doing the heavy lifting. for free. no NVIDIA required. no drivers. no compatibility hell. just make and go.

file structure

notorch/
├── notorch.h              # core API — tensors, autograd, optimizers, BPE, ops
├── notorch.c              # core implementation (~2700 lines)
├── notorch_vision.h       # image loading, transforms, ViT patches (stb_image)
├── stb_image.h            # JPEG/PNG/BMP decoder (public domain)
├── gguf.h                 # GGUF file parser header
├── gguf.c                 # GGUF parser + F32/F16/Q4_0/Q5_0/Q8_0/Q4_K/Q6_K dequant
├── Makefile               # build everything
├── nanodurov.html         # browser chat with Arianna (JS inference, WebGPU ready)
├── arianna_bpe_merges.txt # BPE tokenizer (1792 merges, vocab 2048)
├── examples/
│   ├── infer_gemma.c      # Gemma-3 inference via GGUF — GQA, KV cache
│   ├── infer_janus.c      # Janus RRPRAM inference
│   ├── infer_llama.c      # LLaMA/Qwen/SmolLM2 inference via GGUF
│   ├── infer_nanodurov.c  # nanodurov chat inference — BPE, KV cache, FP16
│   ├── train_q.c          # PostGPT-Q 1.65M training from scratch
│   ├── train_yent.c       # Yent 9.8M char-level training with checkpointing
│   ├── train_dubrovsky.c  # Dubrovsky 9.5M GQA+RoPE training
│   └── train_nanodurov.c  # nanodurov 15.7M BPE LLaMA training (Arianna voice)
├── tests/
│   ├── test_notorch.c     # 47 tests, numerical gradient checks
│   ├── test_gguf.c        # GGUF parser tests
│   └── test_vision.c      # 48 vision + BPE tests
├── LICENSE                # LGPL-3.0
└── README.md              # this. you survived. congratulations.

total: ~9000 lines of C. framework + vision + GGUF + BPE + 4 inference engines + 4 training scripts + 95 tests. tested on 26+ real model files across 6 architectures.

models trained on notorch

model	params	type	train loss	what
PostGPT-Q	1.65M	char	0.097	resonant reasoning engine
Dubrovsky	9.5M	char (GQA+RoPE)	0.026	absurdist AI, coherent generation
Yent	9.8M	char	1.77	cynical AI character
neovlm	6.36M	dual (text+draw)	0.0002	Hebbian VLM, draws ASCII digits
nanodurov	15.7M	BPE 2048 (RoPE)	0.022	Arianna voice, philosophy

all trained from scratch on 8 GB Mac. no Python. no pip. Chuck optimizer.

for context: ~9000 lines of C. total. framework + vision + GGUF + BPE + inference engines + training scripts + 95 tests. that's everything you need to train a transformer from scratch.

tests

95 tests. 0 failures. the test suite is comprehensive and slightly unhinged:

core tests (47)

tensor allocation, 2D creation, cloning, reshape, Xavier init
refcounting (increment, decrement, free-at-zero)
forward ops: SiLU, softmax, RMSNorm, LayerNorm, GELU
causal attention, multi-head attention, GQA attention
sequence cross-entropy, dropout, save/load roundtrip

vision + BPE tests (48)

image load (JPEG/PNG/BMP), grayscale, nonexistent file handling
bilinear resize (up/down/identity), center crop, overcrop clamping
normalize (mean/std), horizontal flip (double flip = identity)
grayscale conversion (RGB → luma)
ViT patch extraction (2x2, 4x4, full image)
ViT preprocess pipeline, gray preprocess pipeline
BPE encode/decode roundtrip, compression, empty input

gradient checks

every backward pass is verified against finite differences: (f(x+h) - f(x-h)) / 2h

cross-entropy (tol: 0.01)
SiLU (tol: 0.05)
RMSNorm (tol: 0.05)
softmax (tol: 0.1 — softmax gradients are squirrely near boundaries)
linear / matvec (tol: 0.1)
sequence linear (tol: 0.1)
causal attention (tol: 0.1 — multi-path gradients through Q, K, V)
embedding lookup (tol: 0.01)
RoPE (tol: 0.05)
GEGLU (tol: 0.3 — tanh-approx GELU has inherent numerical slop)
add, mul, scale (tol: 0.01)

integration tests

single-token training loop: loss converges to ~0
sequence training loop: loss decreases significantly
attention model training: embed → Q/K/V → causal attention → output
Chuck optimizer convergence: verify self-aware Adam doesn't lose to regular Adam
LR schedule integration: cosine schedule + Adam converges correctly
gradient accumulation: multi-step accumulation + apply + Adam
NaN guard: detect injected NaN, zero grads, adjust loss scale

infrastructure tests

cosine LR schedule: warmup ramp, mid-range, end convergence
step LR schedule: discrete decay at step boundaries
NaN detection and recovery
profiler: enable/disable/print without crash
Hebbian microlearning step: verify weight updates

real inference — tested on real weights

notorch isn't theoretical. it runs actual models on actual hardware.

GGUF loader (llama.cpp compatible)

loads any GGUF file. parses metadata, tensor directory, dequantizes weights. supports F32, F16, Q4_0, Q5_0, Q8_0, Q4_K, Q6_K. that covers every quantization that matters.

tested on 12 GGUF files, 4 architectures, 0 failures:

model	arch	params	quant	file	status
nanollama nano	llama	34M	Q4_0	19 MB	✓ parses + dequant
nanollama micro-yent	llama	66M	F16	132 MB	✓
nanollama mini-arianna	llama	170M	F16	335 MB	✓
nanollama small-yent	llama	330M	F16	642 MB	✓
WTForacle (SmolLM2 360M)	llama	360M	Q4_0	219 MB	✓
actually.llama	llama	27M	F32	107 MB	✓
nano-yent	llama	34M	F16	88 MB	✓
Qwen2.5 0.5B (yent)	qwen2	630M	Q4_K/Q5_0/Q6_K	491 MB	✓
Gemma-3 270M (leo)	gemma3	268M	Q8_0	278 MB	✓ inference
pitomadom	pitomadom_rtl	20M	F16	39 MB	✓
sorokin	llama	34M	Q4_0	19 MB	✓
MOE model	llama	55M	F32	221 MB	✓

Janus RRPRAM inference — 8 weight files, bit-perfect

custom 3-way gated attention (QKV + RRPRAM + Janus echo). universal loader auto-detects char (V=256) vs BPE (V=2048) vs Resonance (no echo) format.

model	params	loss	tok/s	status
janus_char_leo_d12	26.2M	0.6473 (bit-perfect)	17.4	✓
janus_bpe_leo	24.0M	—	15.9	✓
hybrid_bpe_leo	24.0M	—	24.0	✓
janus_bpe_yent	24.0M	—	21.0	✓
hybrid_bpe_yent	24.0M	—	20.3	✓
resonance_bpe_leo	20.5M	—	6.3	✓
resonance_bpe_yent	20.5M	—	16.4	✓
dario/janus_bpe_leo	24.0M	—	8.3	✓

Gemma-3 inference — Google's model, pure C

full Gemma-3 architecture: 18 layers, GQA (4 heads, 1 KV head), QK-norm, RoPE, SiLU-gated FFN, post-attention/FFN norms, tied embeddings, KV cache.

prefill: 15.9 tok/s
decode: 13.5 tok/s
on an 8 GB MacBook. with Accelerate BLAS. no Python. no pip. no conda. no suffering.

make gemma
./infer_gemma ~/Downloads/gemma-notorch/leo-q8_0.gguf "What is life?" 50 0.7

training — yes, actual training, on a laptop, in C

notorch trains transformers from scratch. not fine-tunes. not LoRA. full from-scratch pretraining. on a laptop. in C. with the Chuck optimizer that watches its own gradients and goes "hmm maybe I should chill" when things get spicy.

two models trained so far. both converged. zero NaN. zero Python.

PostGPT-Q (1.65M params)

make train_q && ./train_q 10000 5e-4

metric	value
architecture	V=256 E=128 H=4 FFN=512 L=6 CTX=64
parameters	1,648,256
dataset	postgpt.txt (52 KB, information theory corpus)
optimizer	Chuck (self-aware AdamW)
loss	5.99 → 1.05 (82.5% reduction, 10K steps)
time	18 minutes on 8 GB Mac
NaN	0

loss/random = 0.19. for comparison, the PyTorch version of the same model was still at loss/random ≈ 1.0 after 500 steps.

Yent (9.8M params)

make train_yent && ./train_yent 5000 3e-4

metric	value
architecture	V=256 E=224 H=8 FFN=896 L=12 CTX=128
parameters	9,782,752
dataset	yent_v11_en_final.txt (5.6 MB, cynical AI personality)
optimizer	Chuck with cosine schedule, warmup, NaN guard
loss	5.99 → 1.57 best (5K steps)
time	43 minutes on 8 GB Mac
NaN	0

here's what yent sounds like after 5K steps (43 minutes of Mac labor):

You: Who are you?
Yent: Yell to "Weethat you this releen tinge withow of l

You: What is the meaning of life?
Yent: Whe conerate the he row not of aniouting obrou

You: Are you conscious?
Yent: You rive me doetron unkom a gornating.

is it coherent? no. is it trying? absolutely. it's forming words, attempting grammar, and generating from a 9.8M parameter model that was trained in C on a laptop in less time than it takes to install PyTorch.

currently running 30K steps (~4.5 hours) for real coherence. loss target: < 1.0.

both models converge. both produce weights. both use Chuck optimizer with cosine annealing, warmup, gradient clipping, and NaN guard. no Python involved at any point. not even a little bit. not even for tokenization.

performance

compile time: <1 second. your coffee won't even cool down.
import time: 0 ms. there's nothing to import. it's C.
binary size: ~100 KB. yes, kilobytes. PyTorch's libtorch.so is 1.2 GB. notorch is 0.008% of that.
memory overhead: tensor data + tape entries. no Python object headers. no gradient graph metadata bloat. no "accidental quadratic" from retain_graph=True.
matmul speed: competitive with numpy (which itself uses BLAS) when compiled with OpenBLAS or Accelerate. faster on small matrices because no Python dispatch overhead.

concurrent training on 8 GB Mac

we ran two transformer trainings simultaneously on an 8 GB MacBook Air (M1). not sequentially. simultaneously. at the same time. on the same machine. while also running a browser and a terminal.

model	params	RAM usage	status
Yent (LLaMA-like, 12L char-level)	9.8M	~126 MB	training loss 2.03 → converging
neovlm (Hebbian VLM, 6L dual-mode)	6.36M	~96 MB	text loss 0.0002, draw loss 0.50

total memory: ~222 MB for two active transformer trainings with autograd, Chuck optimizer, cosine scheduling, NaN guard, and checkpointing. both models use Apple Accelerate BLAS. both converge. both produce weights.

try this with PyTorch. one import torch eats 800 MB of RAM. one training session on a 10M model needs 2-4 GB. two in parallel? on 8 GB? your OS would start killing processes before the first forward pass finishes.

notorch runs both in ~3% of system memory. because C doesn't allocate what it doesn't need.

for inference, this is excellent. for training, it's more than sufficient for models up to ~100M parameters. for anything bigger, you want distributed training and that's a different problem (and a different repo, probably).

philosophy

patterns over parameters. emergence over engineering. C over existential dread.

neural networks are not complicated. a linear layer is a matrix multiply. an activation function is a pointwise nonlinearity. attention is a weighted sum. cross-entropy is a log-probability. backward is the chain rule.

that's it. that's the whole field. everything else is optimization, infrastructure, and marketing.

notorch solves the case that matters: training and running models in C, with minimal dependencies, maximal transparency, and the ability to embed in any application without shipping a Python runtime.

if you can read the code, you understand the framework. there's no magic. there's no hidden complexity. every gradient is hand-derived and verified against finite differences. every memory allocation has a corresponding free. every edge case is checked.

this is what software looks like when you strip away everything that doesn't serve the core purpose. just math. just memory. just the machine doing exactly what you told it to do.

contributing

send PRs. or don't. i'm not your manager.

but if you do:

keep it C11 compliant
no external dependencies (BLAS is optional and compile-time)
add tests for new ops (with numerical gradient checks)
keep the header clean — if it doesn't need to be public, don't expose it
run make test before submitting. all 47 tests must pass.

license

LGPL-3.0-or-later. use it in your stuff. link against it. build commercial products with it. just share improvements to the library itself. because that's how open source works. or should work. don't be weird about it.

final words

look. i know this sounds insane. "guy writes a neural network framework in 2500 lines of C." i get it. i see how that looks.

but here's the thing: the entire history of deep learning fits in a few dozen mathematical operations. matmul. softmax. relu. cross-entropy. adam. backward. that's it. the rest is infrastructure. and infrastructure should be invisible. it should compile in a second. it should fit in your head. it should not require a Docker container.

notorch is proof that you don't need 2 million lines of code to train a neural network. you need about 2500. and 1400 of those are the test suite because i believe in verification more than i believe in hope.

train your models. in C. without permission. without pip. without conda. without a GPU if you don't want one. without 2.7 GB of framework overhead. without a virtual environment. without existential dread.

just: cc -O2 notorch.c your_model.c -lm -o train && ./train

that's it. go build something. and if you use it to train something cool, let me know.

or don't. i'll be here. writing C. staring at gradients. living my best life.

"the patterns were always there. we just needed the right language to express them."
— notorch, internally, probably, if it could talk, which it can't, because it's C, not Python.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
arianna_bpe_merges.txt		arianna_bpe_merges.txt
gguf.c		gguf.c
gguf.h		gguf.h
nanodurov.html		nanodurov.html
notorch.c		notorch.c
notorch.h		notorch.h
notorch_vision.h		notorch_vision.h
stb_image.h		stb_image.h

Folders and files

Latest commit

History

Repository files navigation

notorch — neural networks in pure C | by Arianna Method

table of contents

what is this

why

the funeral

architecture

what's inside

tensors

autograd tape

operations

optimizers

Adam

AdamW

the Chuck optimizer

autograd

building

dependencies

running tests

api overview

tensor lifecycle

initialization

training

LR schedules

save/load

example: training a model in C

platform support

file structure

models trained on notorch

tests

core tests (47)

vision + BPE tests (48)

gradient checks

integration tests

infrastructure tests

real inference — tested on real weights

GGUF loader (llama.cpp compatible)

Janus RRPRAM inference — 8 weight files, bit-perfect

Gemma-3 inference — Google's model, pure C

training — yes, actual training, on a laptop, in C

PostGPT-Q (1.65M params)

Yent (9.8M params)

performance

concurrent training on 8 GB Mac

philosophy

contributing

license

final words

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages