Skip to content

feat(Linux): Tinygrad runner for LLM inference on Linux#1660

Open
triko88 wants to merge 23 commits intoexo-explore:mainfrom
triko88:feature/linux-support
Open

feat(Linux): Tinygrad runner for LLM inference on Linux#1660
triko88 wants to merge 23 commits intoexo-explore:mainfrom
triko88:feature/linux-support

Conversation

@triko88
Copy link

@triko88 triko88 commented Mar 4, 2026

Motivation

The initial version suffered from dysfunctional Linux code. The re-write uses MLX CUDA for Linux, however the performance as been reported to be sub-optimal. This limits the users into two categories:

  • Apple users who use MLX
  • Linux users with Nvidia graphic cards, who can use MLX CUDA

While the older version (ex-exo) used tinygrad, the overall codebase wasn't optimized to handle the architectural differences between Apple Silicon and the contemporary PC architecture. This led to a broken experience on Linux, which me and other Linux users faced. As evident by issues #904, #910, #913 and #934.

One user concluded (for the archived version) after using it on Nvidia RTX 3060 12GB VRAM:

Either I need some other hardware, OS or libraries or this Exo thing does not work at all…
Will give a try later.

Michał Sobczak (https://michalasobczak.pl/ai-ml/2025/03/exo-the-gpu-cluster-tinygrad-mlx/)

The motivation here was clear. Build a usable, if not a performant tinygrad runner for Linux that can run heterogeneously with Apple systems in the future.

Changes

This change introduces a tinygrad based runner in this project that can load MLX safetensor weights and infer them. Because of its fundamental nature, this is a huge change. Done in 10 phases to build a foundational and correct inference engine that can do the following, while relying purely on tensor ops:

  • Deseralize Huggingface MLX safetensors file. The idea is that the Apple and non-Apple systems must share the weights, and infer accordingly.
  • Implement an architecture spec registry to unify weight names and functions for their transformer architecture.
  • Build quantized linear and embedding layers that are interoperable with tinygrad counterparts.
  • Run MLX dequantization strategy on tinygrad during inference.
  • Reduce kernel dispatch and reuse kernels, to minimize latency and maximize throughput.

Why It Works

The original exo treated tinygrad as a drop-in runtime equivalent to MLX. It isn't, MLX is a runtime. You call an op, it executes. Tinygrad is a compiler, it builds a computation graph, generates GPU source code, compiles it into a kernel, and only then dispatches. First compilation costs 50–1600ms per kernel shape; cached re-invocations cost ~3ms.

The old architecture ran tinygrad on a background thread via run_in_executor inside the main process. This meant:

  1. Kernel compilation during inference. No warmup phase, so every unique shape compiled while the user waited. Producing 26s time-to-first-token.
  2. Auto-tuner disabled. Tinygrad's BEAM search uses Python signals (main thread only). Running in a thread pool crashed with BEAM=1, so users defaulted to BEAM=0, no optimization.
  3. State leaking between requests. Lazy computation graphs accumulated across calls with no clean boundary, causing context contamination and infinite generation loops.

This change exploits the new architecture's process-isolated Runner model. The tinygrad Runner is a separate child process: main thread available, environment variables (DEV, JIT, TC, BEAM) inherited naturally, memory space isolated. The Worker's plan tree sequences the lifecycle correctly: DownloadModel → LoadModel → StartWarmup → Ready. StartWarmup pre-compiles every kernel before accepting requests.

Result: TTFT 26s → 745ms, throughput 9 → 64.5 tok/s, clean generation termination, and context isolation by process boundary. Later commits extend this pattern for other models and GPU backends.

Test Plan

Manual Testing

Hardware: HP Omen 16-n0079AX
Specs:

  • CPU: AMD Ryzen 7 6800H
  • GPU: AMD Radeon RX 6650M
  • OS: Omarchy Linux with Cachy OS kernel (Linux v6.19)

E2E test steps

  • Run exo using uv run exo.
  • To test cache coherence, run DEBUG=1 uv run exo.
  • Generally, tinygrad selects the best inference backend for the machine. However, in case of MMU failure like with RDNA 2, you can run exo with another inference backend through the environment variable.
  • To test with another inference backend, DEV=<backend> uv run exo to run the following backends:
    • DEV=HIP to run Heterogeneous-computing Interface for Portability. Use it for RDNA 2 GPUs or older.
    • DEV=CL to run OpenCL.
  • Select tensor in the dashboard to run tensor cores in your GPU.

Automated Testing

Unit tests are written in src/exo/worker/tests/unittests/test_tinygrad. Run the tests using pytest:

uv run basedpyright && uv run ruff check && uv run pytest

Specifically for the changed files, the most relevant test files are:

  • uv run pytest src/exo/worker/tests/unittests/test_tinygrad/test_cache.py. Exercises KVCache (renamed keys/values fields)
  • uv run pytest src/exo/worker/tests/unittests/test_tinygrad/test_sampling.py. Exercises sample_token (strict=True zip fix)
  • uv run pytest src/exo/worker/tests/unittests/test_tinygrad/test_generate.py. Exercises the generator (TinyJit import, cache field refs)
  • uv run pytest src/exo/worker/tests/unittests/test_tinygrad/test_layers.py. Exercises attention layer
  • uv run pytest src/exo/worker/tests/unittests/test_tinygrad/test_weight_loader.py. Exercises weight loading
  • uv run pytest src/exo/shared/tests/test_tokenizer_shared.py. The fixed test_unknown_model_returns_none test

Expected Results

  • basedpyright: 8 pre-existing errors (all in connection_message.py, test_master.py, test_node_id_persistence.py. Not from this branch)
  • ruff: 0 errors
  • pytest: All pass. 5 MLX collection errors on Linux are pre-existing (macOS-only mlx.core dependency). 1 test_master failure is pre-existing (Keypair.to_node_id attribute error).
screenshot-2026-03-03_21-44-47
screenrecording-2026-03-03_22-19-30.mp4

triko88 and others added 21 commits February 19, 2026 15:07
parsing. Currently supports Llama, Qwen (dense) and Mistral.
Fused kernels per forward pass reducing from 3942 kernels to 356
kernels.
Syncs 33 commits from main into the Linux/Tinygrad feature branch.
Resolved 7 merge conflicts:
- Dashboard: kept Tinygrad runtime option, adopted main's collapsible Advanced Options UI
- api.py: added ollama imports with renamed module paths
- runner.py: took main's MLX-specific LLM runner (engine_factory preserved separately)
- uv.lock: took main's lockfile baseline

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@triko88 triko88 marked this pull request as ready for review March 4, 2026 22:05
@triko88 triko88 changed the title Tinygrad runner for LLM inference on non-Apple systems Tinygrad runner for LLM inference on Linux Mar 5, 2026
@Evanev7
Copy link
Member

Evanev7 commented Mar 5, 2026

huge! i appreciate our batching changes will have made some work for you to merge this in, but before i dig into it can you move exo/architecture/master back to exo/master?

@triko88
Copy link
Author

triko88 commented Mar 5, 2026

Sure, I'll move directory exo/architecture/master back to exo/master.

@triko88
Copy link
Author

triko88 commented Mar 5, 2026

I've pushed the change, please review it and let me know whether the directory has been correctly merged or not.

Copy link
Member

@Evanev7 Evanev7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just the merging for now. i might take a stab at merging this in with the recent refactor myself if that's ok? the new runner should be more backend agnostic, but im not sure if that's actually the case

async def create(cls, args: "Args") -> Self:
keypair = get_node_id_keypair()
node_id = NodeId(keypair.to_node_id())
node_id = NodeId(keypair.to_peer_id())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this as well

Copy link
Author

@triko88 triko88 Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't load exo on Linux without this change for E2E tests

Edit: Reverting the change on Linux lead me to this:

exo feature/linux-support  ? ❯ DEV=HIP DEBUG=1 uv run exo
[ 11:36:30.8739PM | INFO    ] Starting EXO
[ 11:36:30.8743PM | INFO    ] EXO_LIBP2P_NAMESPACE: None
Traceback (most recent call last):
  File "/home/apan/git-projects/exo/.venv/bin/exo", line 10, in <module>
    sys.exit(main())
             ~~~~^^
  File "/home/apan/git-projects/exo/src/exo/main.py", line 272, in main
    node = anyio.run(Node.create, args)
  File "/home/apan/git-projects/exo/.venv/lib/python3.13/site-packages/anyio/_core/_eventloop.py", line 74, in run
    return async_backend.run(func, args, {}, backend_options)
           ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/apan/git-projects/exo/.venv/lib/python3.13/site-packages/anyio/_backends/_asyncio.py", line 2325, in run
    return runner.run(wrapper())
           ~~~~~~~~~~^^^^^^^^^^^
  File "/home/apan/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/home/apan/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/asyncio/base_events.py", line 725, in run_until_complete
    return future.result()
           ~~~~~~~~~~~~~^^
  File "/home/apan/git-projects/exo/.venv/lib/python3.13/site-packages/anyio/_backends/_asyncio.py", line 2313, in wrapper
    return await func(*args)
           ^^^^^^^^^^^^^^^^^
  File "/home/apan/git-projects/exo/src/exo/main.py", line 46, in create
    node_id = NodeId(keypair.to_node_id())
                     ^^^^^^^^^^^^^^^^^^
AttributeError: 'builtins.Keypair' object has no attribute 'to_node_id'

Copy link
Author

@triko88 triko88 Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no issues with merging my changes with your recent refractor. My runner is as backend agnostic as it can be, since the changes are tinygrad tensor operations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup - have you rebuilt the rust bindings inbetween? uv should catch it but often you need to uv sync --upgrade-package exo_pyo3_bindings

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't rebuilt the rust bindings. I rebuilt it to see it is causing any issues, running uv sync --upgrade-package exo_pyo3_bindings post rebuild didn't cause any issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm - i suppose your base branch is far enough back that these changes haven't landed yet

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep - there were cases where my initial iterations broke while catching up with the base branch. So I decided to slow down my update frequency with the branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reasonable!

@triko88 triko88 changed the title Tinygrad runner for LLM inference on Linux feat(Linux): Tinygrad runner for LLM inference on Linux Mar 5, 2026
@compscidr
Copy link

compscidr commented Mar 11, 2026

Tested this on my rtx5080 on ubuntu, seems to be working!, here's some instructions in case anyone wants to try to reproduce: https://gist.github.com/compscidr/3e071ab6c2dce60339ca953eb0a98787

Working with Llama 3.2 1B 4bit
However, Llama 3.1 8B 8bit fails with shape broadcast error in rms_norm:

  ValueError: cannot broadcast (1, 64, 1024) to new_shape=(1, 64, 4096)
  Likely a hidden_size chunking bug in the tinygrad forward pass

Actually was able to get it working with the other model - there are more detailed notes about how to do so as well. There is one other issue that came up with fork / spawn with notes in there too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants