feat(Linux): Tinygrad runner for LLM inference on Linux#1660
feat(Linux): Tinygrad runner for LLM inference on Linux#1660triko88 wants to merge 23 commits intoexo-explore:mainfrom
Conversation
parsing. Currently supports Llama, Qwen (dense) and Mistral.
…ge model architectures
Added tinygrad for Linux systems
Fused kernels per forward pass reducing from 3942 kernels to 356 kernels.
Syncs 33 commits from main into the Linux/Tinygrad feature branch. Resolved 7 merge conflicts: - Dashboard: kept Tinygrad runtime option, adopted main's collapsible Advanced Options UI - api.py: added ollama imports with renamed module paths - runner.py: took main's MLX-specific LLM runner (engine_factory preserved separately) - uv.lock: took main's lockfile baseline Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…educe kernel dispatch overhead
…ucing prefill prompt buckets
|
huge! i appreciate our batching changes will have made some work for you to merge this in, but before i dig into it can you move exo/architecture/master back to exo/master? |
|
Sure, I'll move directory |
|
I've pushed the change, please review it and let me know whether the directory has been correctly merged or not. |
Evanev7
left a comment
There was a problem hiding this comment.
just the merging for now. i might take a stab at merging this in with the recent refactor myself if that's ok? the new runner should be more backend agnostic, but im not sure if that's actually the case
| async def create(cls, args: "Args") -> Self: | ||
| keypair = get_node_id_keypair() | ||
| node_id = NodeId(keypair.to_node_id()) | ||
| node_id = NodeId(keypair.to_peer_id()) |
There was a problem hiding this comment.
Couldn't load exo on Linux without this change for E2E tests
Edit: Reverting the change on Linux lead me to this:
exo feature/linux-support ? ❯ DEV=HIP DEBUG=1 uv run exo
[ 11:36:30.8739PM | INFO ] Starting EXO
[ 11:36:30.8743PM | INFO ] EXO_LIBP2P_NAMESPACE: None
Traceback (most recent call last):
File "/home/apan/git-projects/exo/.venv/bin/exo", line 10, in <module>
sys.exit(main())
~~~~^^
File "/home/apan/git-projects/exo/src/exo/main.py", line 272, in main
node = anyio.run(Node.create, args)
File "/home/apan/git-projects/exo/.venv/lib/python3.13/site-packages/anyio/_core/_eventloop.py", line 74, in run
return async_backend.run(func, args, {}, backend_options)
~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/apan/git-projects/exo/.venv/lib/python3.13/site-packages/anyio/_backends/_asyncio.py", line 2325, in run
return runner.run(wrapper())
~~~~~~~~~~^^^^^^^^^^^
File "/home/apan/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
File "/home/apan/.local/share/uv/python/cpython-3.13.12-linux-x86_64-gnu/lib/python3.13/asyncio/base_events.py", line 725, in run_until_complete
return future.result()
~~~~~~~~~~~~~^^
File "/home/apan/git-projects/exo/.venv/lib/python3.13/site-packages/anyio/_backends/_asyncio.py", line 2313, in wrapper
return await func(*args)
^^^^^^^^^^^^^^^^^
File "/home/apan/git-projects/exo/src/exo/main.py", line 46, in create
node_id = NodeId(keypair.to_node_id())
^^^^^^^^^^^^^^^^^^
AttributeError: 'builtins.Keypair' object has no attribute 'to_node_id'
There was a problem hiding this comment.
I have no issues with merging my changes with your recent refractor. My runner is as backend agnostic as it can be, since the changes are tinygrad tensor operations.
There was a problem hiding this comment.
yup - have you rebuilt the rust bindings inbetween? uv should catch it but often you need to uv sync --upgrade-package exo_pyo3_bindings
There was a problem hiding this comment.
I haven't rebuilt the rust bindings. I rebuilt it to see it is causing any issues, running uv sync --upgrade-package exo_pyo3_bindings post rebuild didn't cause any issue.
There was a problem hiding this comment.
hm - i suppose your base branch is far enough back that these changes haven't landed yet
There was a problem hiding this comment.
yep - there were cases where my initial iterations broke while catching up with the base branch. So I decided to slow down my update frequency with the branch.
|
Tested this on my rtx5080 on ubuntu, seems to be working!, here's some instructions in case anyone wants to try to reproduce: https://gist.github.com/compscidr/3e071ab6c2dce60339ca953eb0a98787 Working with Llama 3.2 1B 4bit Actually was able to get it working with the other model - there are more detailed notes about how to do so as well. There is one other issue that came up with fork / spawn with notes in there too. |
Motivation
The initial version suffered from dysfunctional Linux code. The re-write uses MLX CUDA for Linux, however the performance as been reported to be sub-optimal. This limits the users into two categories:
While the older version (ex-exo) used tinygrad, the overall codebase wasn't optimized to handle the architectural differences between Apple Silicon and the contemporary PC architecture. This led to a broken experience on Linux, which me and other Linux users faced. As evident by issues #904, #910, #913 and #934.
One user concluded (for the archived version) after using it on Nvidia RTX 3060 12GB VRAM:
The motivation here was clear. Build a usable, if not a performant tinygrad runner for Linux that can run heterogeneously with Apple systems in the future.
Changes
This change introduces a tinygrad based runner in this project that can load MLX safetensor weights and infer them. Because of its fundamental nature, this is a huge change. Done in 10 phases to build a foundational and correct inference engine that can do the following, while relying purely on tensor ops:
Why It Works
The original exo treated tinygrad as a drop-in runtime equivalent to MLX. It isn't, MLX is a runtime. You call an op, it executes. Tinygrad is a compiler, it builds a computation graph, generates GPU source code, compiles it into a kernel, and only then dispatches. First compilation costs 50–1600ms per kernel shape; cached re-invocations cost ~3ms.
The old architecture ran tinygrad on a background thread via
run_in_executorinside the main process. This meant:BEAM=1, so users defaulted toBEAM=0, no optimization.This change exploits the new architecture's process-isolated Runner model. The tinygrad Runner is a separate child process: main thread available, environment variables (
DEV,JIT,TC,BEAM) inherited naturally, memory space isolated. The Worker's plan tree sequences the lifecycle correctly:DownloadModel → LoadModel → StartWarmup → Ready.StartWarmuppre-compiles every kernel before accepting requests.Result: TTFT 26s → 745ms, throughput 9 → 64.5 tok/s, clean generation termination, and context isolation by process boundary. Later commits extend this pattern for other models and GPU backends.
Test Plan
Manual Testing
Hardware: HP Omen 16-n0079AX
Specs:
E2E test steps
uv run exo.DEBUG=1 uv run exo.DEV=<backend> uv run exoto run the following backends:DEV=HIPto run Heterogeneous-computing Interface for Portability. Use it for RDNA 2 GPUs or older.DEV=CLto run OpenCL.Automated Testing
Unit tests are written in
src/exo/worker/tests/unittests/test_tinygrad. Run the tests usingpytest:Specifically for the changed files, the most relevant test files are:
uv run pytest src/exo/worker/tests/unittests/test_tinygrad/test_cache.py. Exercises KVCache (renamed keys/values fields)uv run pytest src/exo/worker/tests/unittests/test_tinygrad/test_sampling.py. Exercisessample_token(strict=True zip fix)uv run pytest src/exo/worker/tests/unittests/test_tinygrad/test_generate.py. Exercises the generator (TinyJit import, cache field refs)uv run pytest src/exo/worker/tests/unittests/test_tinygrad/test_layers.py. Exercises attention layeruv run pytest src/exo/worker/tests/unittests/test_tinygrad/test_weight_loader.py. Exercises weight loadinguv run pytest src/exo/shared/tests/test_tokenizer_shared.py. The fixed test_unknown_model_returns_none testExpected Results
basedpyright: 8 pre-existing errors (all in connection_message.py, test_master.py, test_node_id_persistence.py. Not from this branch)ruff: 0 errorspytest: All pass. 5 MLX collection errors on Linux are pre-existing (macOS-only mlx.core dependency). 1 test_master failure is pre-existing (Keypair.to_node_id attribute error).screenrecording-2026-03-03_22-19-30.mp4