feat(Linux): Tinygrad runner for LLM inference on Linux by triko88 · Pull Request #1660 · exo-explore/exo

triko88 · 2026-03-04T22:05:26Z

Motivation

The initial version suffered from dysfunctional Linux code. The re-write uses MLX CUDA for Linux, however the performance as been reported to be sub-optimal. This limits the users into two categories:

Apple users who use MLX
Linux users with Nvidia graphic cards, who can use MLX CUDA

While the older version (ex-exo) used tinygrad, the overall codebase wasn't optimized to handle the architectural differences between Apple Silicon and the contemporary PC architecture. This led to a broken experience on Linux, which me and other Linux users faced. As evident by issues #904, #910, #913 and #934.

One user concluded (for the archived version) after using it on Nvidia RTX 3060 12GB VRAM:

Either I need some other hardware, OS or libraries or this Exo thing does not work at all…
Will give a try later.

Michał Sobczak (https://michalasobczak.pl/ai-ml/2025/03/exo-the-gpu-cluster-tinygrad-mlx/)

The motivation here was clear. Build a usable, if not a performant tinygrad runner for Linux that can run heterogeneously with Apple systems in the future.

Changes

This change introduces a tinygrad based runner in this project that can load MLX safetensor weights and infer them. Because of its fundamental nature, this is a huge change. Done in 10 phases to build a foundational and correct inference engine that can do the following, while relying purely on tensor ops:

Deseralize Huggingface MLX safetensors file. The idea is that the Apple and non-Apple systems must share the weights, and infer accordingly.
Implement an architecture spec registry to unify weight names and functions for their transformer architecture.
Build quantized linear and embedding layers that are interoperable with tinygrad counterparts.
Run MLX dequantization strategy on tinygrad during inference.
Reduce kernel dispatch and reuse kernels, to minimize latency and maximize throughput.

Why It Works

The original exo treated tinygrad as a drop-in runtime equivalent to MLX. It isn't, MLX is a runtime. You call an op, it executes. Tinygrad is a compiler, it builds a computation graph, generates GPU source code, compiles it into a kernel, and only then dispatches. First compilation costs 50–1600ms per kernel shape; cached re-invocations cost ~3ms.

The old architecture ran tinygrad on a background thread via run_in_executor inside the main process. This meant:

Kernel compilation during inference. No warmup phase, so every unique shape compiled while the user waited. Producing 26s time-to-first-token.
Auto-tuner disabled. Tinygrad's BEAM search uses Python signals (main thread only). Running in a thread pool crashed with BEAM=1, so users defaulted to BEAM=0, no optimization.
State leaking between requests. Lazy computation graphs accumulated across calls with no clean boundary, causing context contamination and infinite generation loops.

This change exploits the new architecture's process-isolated Runner model. The tinygrad Runner is a separate child process: main thread available, environment variables (DEV, JIT, TC, BEAM) inherited naturally, memory space isolated. The Worker's plan tree sequences the lifecycle correctly: DownloadModel → LoadModel → StartWarmup → Ready. StartWarmup pre-compiles every kernel before accepting requests.

Result: TTFT 26s → 745ms, throughput 9 → 64.5 tok/s, clean generation termination, and context isolation by process boundary. Later commits extend this pattern for other models and GPU backends.

Test Plan

Manual Testing

Hardware: HP Omen 16-n0079AX
Specs:

CPU: AMD Ryzen 7 6800H
GPU: AMD Radeon RX 6650M
OS: Omarchy Linux with Cachy OS kernel (Linux v6.19)

E2E test steps

Run exo using uv run exo.
To test cache coherence, run DEBUG=1 uv run exo.
Generally, tinygrad selects the best inference backend for the machine. However, in case of MMU failure like with RDNA 2, you can run exo with another inference backend through the environment variable.
To test with another inference backend, DEV=<backend> uv run exo to run the following backends:
- DEV=HIP to run Heterogeneous-computing Interface for Portability. Use it for RDNA 2 GPUs or older.
- DEV=CL to run OpenCL.
Select tensor in the dashboard to run tensor cores in your GPU.

Automated Testing

Unit tests are written in src/exo/worker/tests/unittests/test_tinygrad. Run the tests using pytest:

uv run basedpyright && uv run ruff check && uv run pytest

Specifically for the changed files, the most relevant test files are:

uv run pytest src/exo/worker/tests/unittests/test_tinygrad/test_cache.py. Exercises KVCache (renamed keys/values fields)
uv run pytest src/exo/worker/tests/unittests/test_tinygrad/test_sampling.py. Exercises sample_token (strict=True zip fix)
uv run pytest src/exo/worker/tests/unittests/test_tinygrad/test_generate.py. Exercises the generator (TinyJit import, cache field refs)
uv run pytest src/exo/worker/tests/unittests/test_tinygrad/test_layers.py. Exercises attention layer
uv run pytest src/exo/worker/tests/unittests/test_tinygrad/test_weight_loader.py. Exercises weight loading
uv run pytest src/exo/shared/tests/test_tokenizer_shared.py. The fixed test_unknown_model_returns_none test

Expected Results

basedpyright: 8 pre-existing errors (all in connection_message.py, test_master.py, test_node_id_persistence.py. Not from this branch)
ruff: 0 errors
pytest: All pass. 5 MLX collection errors on Linux are pre-existing (macOS-only mlx.core dependency). 1 test_master failure is pre-existing (Keypair.to_node_id attribute error).

screenrecording-2026-03-03_22-19-30.mp4

parsing. Currently supports Llama, Qwen (dense) and Mistral.

…ge model architectures

Added tinygrad for Linux systems

Fused kernels per forward pass reducing from 3942 kernels to 356 kernels.

Syncs 33 commits from main into the Linux/Tinygrad feature branch. Resolved 7 merge conflicts: - Dashboard: kept Tinygrad runtime option, adopted main's collapsible Advanced Options UI - api.py: added ollama imports with renamed module paths - runner.py: took main's MLX-specific LLM runner (engine_factory preserved separately) - uv.lock: took main's lockfile baseline Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…educe kernel dispatch overhead

…ucing prefill prompt buckets

Evanev7 · 2026-03-05T12:39:52Z

huge! i appreciate our batching changes will have made some work for you to merge this in, but before i dig into it can you move exo/architecture/master back to exo/master?

triko88 · 2026-03-05T14:06:20Z

Sure, I'll move directory exo/architecture/master back to exo/master.

…aster

triko88 · 2026-03-05T14:30:39Z

I've pushed the change, please review it and let me know whether the directory has been correctly merged or not.

Evanev7

just the merging for now. i might take a stab at merging this in with the recent refactor myself if that's ok? the new runner should be more backend agnostic, but im not sure if that's actually the case

src/exo/routing/router.py

Evanev7 · 2026-03-05T16:55:51Z

src/exo/main.py

    async def create(cls, args: "Args") -> Self:
        keypair = get_node_id_keypair()
-        node_id = NodeId(keypair.to_node_id())
+        node_id = NodeId(keypair.to_peer_id())


this as well

Couldn't load exo on Linux without this change for E2E tests

Edit: Reverting the change on Linux lead me to this:

exo feature/linux-support  ? ❯ DEV=HIP DEBUG=1 uv run exo [ 11:36:30.8739PM | INFO ] Starting EXO [ 11:36:30.8743PM | INFO ] EXO_LIBP2P_NAMESPACE: None Traceback (most recent call last): File "/home/apan/git-projects/exo/.venv/bin/exo", line 10, in <module> sys.exit(main()) ~~~~^^ File "/home/apan/git-projects/exo/src/exo/main.py", line 272, in main node = anyio.run(Node.create, args) File "/home/apan/git-projects/exo/.venv/lib/python3.13/site-packages/anyio/_core/_eventloop.py", line 74, in run return async_backend.run(func, args, {}, backend_options) ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/apan/git-projects/exo/.venv/lib/python3.13/site-packages/anyio/_backends/_asyncio.py", line 2325, in run return runner.run(wrapper()) ~~~~~~~~~~^^^^^^^^^^^ File "/home/apan/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^ File "/home/apan/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/asyncio/base_events.py", line 725, in run_until_complete return future.result() ~~~~~~~~~~~~~^^ File "/home/apan/git-projects/exo/.venv/lib/python3.13/site-packages/anyio/_backends/_asyncio.py", line 2313, in wrapper return await func(*args) ^^^^^^^^^^^^^^^^^ File "/home/apan/git-projects/exo/src/exo/main.py", line 46, in create node_id = NodeId(keypair.to_node_id()) ^^^^^^^^^^^^^^^^^^ AttributeError: 'builtins.Keypair' object has no attribute 'to_node_id'

I have no issues with merging my changes with your recent refractor. My runner is as backend agnostic as it can be, since the changes are tinygrad tensor operations.

yup - have you rebuilt the rust bindings inbetween? uv should catch it but often you need to uv sync --upgrade-package exo_pyo3_bindings

I haven't rebuilt the rust bindings. I rebuilt it to see it is causing any issues, running uv sync --upgrade-package exo_pyo3_bindings post rebuild didn't cause any issue.

hm - i suppose your base branch is far enough back that these changes haven't landed yet

yep - there were cases where my initial iterations broke while catching up with the base branch. So I decided to slow down my update frequency with the branch.

reasonable!

compscidr · 2026-03-11T18:44:15Z

Tested this on my rtx5080 on ubuntu, seems to be working!, here's some instructions in case anyone wants to try to reproduce: https://gist.github.com/compscidr/3e071ab6c2dce60339ca953eb0a98787

Working with Llama 3.2 1B 4bit
However, Llama 3.1 8B 8bit fails with shape broadcast error in rms_norm:

  ValueError: cannot broadcast (1, 64, 1024) to new_shape=(1, 64, 4096)
  Likely a hidden_size chunking bug in the tinygrad forward pass

Actually was able to get it working with the other model - there are more detailed notes about how to do so as well. There is one other issue that came up with fork / spawn with notes in there too.

triko88 and others added 21 commits February 19, 2026 15:07

[Phase 0] Guard MLX imports in runner.py for Linux support

ac7df4d

[Phase 1] Implemented shared architecture spec registry and model config

f7ebc12

parsing. Currently supports Llama, Qwen (dense) and Mistral.

[Phase 2] Added shared tokenizer utilities to support multiple langua…

75dc5e5

…ge model architectures

[Phase 3] Implemented core tensor layers for LM architecture

b043046

[Phase 4] Re-implemented quantization module using pure tensor ops

952d1a6

[Phase 5] Implemented weight loader

33e2b52

[Phase 6] Implemented KV cache with transformer forward pass

2dc287c

[Phase 7] Implemented sampling + refactored previous phases

7aac552

Added tinygrad for Linux systems

[Phase 8] Added generation loop and entry point

bff3e51

Added tinygrad as a usable instance for non-Apple systems

917c4c3

Corrected output for Qwen 3

bcaaeca

Fused kernels to optimise token output rate

0fd72c6

Fused kernels per forward pass reducing from 3942 kernels to 356 kernels.

Enabled BEAM for optimized inference

b0ff282

Re-implemented tinygrad runner for non-Apple systems

9f90686

Merged attention layer (QKV) and MLP layer (gate+up) projections to r…

d802341

…educe kernel dispatch overhead

Optimized time to first token by simplifying cache prefill and introd…

c4c6cef

…ucing prefill prompt buckets

Update README.md

3aef0f1

Update README.md

0de9401

Enabled TinyJit and BEAM as default for tinygrad runners

3eadfce

Handled remaining type checks and unit tests for the pull request

973306c

triko88 marked this pull request as ready for review March 4, 2026 22:05

triko88 changed the title ~~Tinygrad runner for LLM inference on non-Apple systems~~ Tinygrad runner for LLM inference on Linux Mar 5, 2026

[PR exo-explore#1660] Shifted exo/shared/architecture/master to exo/m…

bf8dc56

…aster

Evanev7 reviewed Mar 5, 2026

View reviewed changes

triko88 changed the title ~~Tinygrad runner for LLM inference on Linux~~ feat(Linux): Tinygrad runner for LLM inference on Linux Mar 5, 2026

fix: Enable tensor cores by default on Linux

a2f7f5c

Conversation

triko88 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Why It Works

Test Plan

Manual Testing

Automated Testing

Expected Results

Uh oh!

Evanev7 commented Mar 5, 2026

Uh oh!

triko88 commented Mar 5, 2026

Uh oh!

triko88 commented Mar 5, 2026

Uh oh!

Evanev7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Evanev7 Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

triko88 Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

triko88 Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Evanev7 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

triko88 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Evanev7 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

triko88 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Evanev7 Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

compscidr commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

triko88 commented Mar 4, 2026 •

edited

Loading

triko88 Mar 5, 2026 •

edited

Loading

triko88 Mar 5, 2026 •

edited

Loading

compscidr commented Mar 11, 2026 •

edited

Loading