Skip to content

Switch to magika+onxx instead of guesslang+tensorflow#251

Merged
robherley merged 13 commits intomainfrom
robherley/magika
Jan 10, 2026
Merged

Switch to magika+onxx instead of guesslang+tensorflow#251
robherley merged 13 commits intomainfrom
robherley/magika

Conversation

@robherley
Copy link
Owner

@robherley robherley commented Jan 10, 2026

Closes #245

We have pretty dramatic savings by ditching tensorflow (like 400 MB), but it looks like the guesslang model is faster.

Guesslang itself is a smaller, less accurate model and only supports ~54 languages with a stated 90% accuracy. Whereas Magika supports > 200 content types and states 99% accuracy.

IMO the difference in speed isn't dramatic enough to stick with guesslang.


Note, the data collected below was investigated and summarized by Claude.

Size

Image Size Comparison Summary

Architecture Remote (TensorFlow) Local (Magika) Savings
amd64 529.28 MB 128.54 MB 400.74 MB (75.7%)
arm64 104.88 MB 115.24 MB -10.36 MB (+9.9%)

Key Findings:

amd64 (x86_64):

  • The old TensorFlow-based image had a massive 436 MB layer for TensorFlow libs
  • The new Magika image uses only 22.3 MB for ONNX Runtime libs
  • Total savings: ~401 MB (76% reduction)

arm64:

  • The remote arm64 image didn't actually include TensorFlow (0B extra libs layer) - likely due to lack of TensorFlow arm64 support at the time
  • The new local arm64 image is slightly larger (+10 MB) because it now includes ONNX Runtime (18.9 MB) where previously there was nothing
  • This is actually a feature improvement - arm64 now has full ML inference support via Magika/ONNX

Performance

Initialization Time

Time to load the ML model and prepare for inference.

Library Time Memory Allocations
Magika 3.29 ms 177 KB 2,745
Guesslang 34.08 ms N/A* N/A*

*Guesslang uses TensorFlow which manages memory internally.

Takeaway: Magika initializes ~10x faster, making it better for CLI tools or short-lived processes.


Average time to detect the language of a single file (after initialization).

Library Avg Time Throughput Memory/op Allocs/op
Magika 2.02 ms 0.58 MB/s ~21 KB 12
Guesslang 0.27 ms 4.30 MB/s ~6.5 KB 141

Per-Language Breakdown

Language Magika (ns/op) Guesslang (ns/op) Guesslang Speedup
Go 1,995,021 235,510 8.5x
Python 1,993,068 252,971 7.9x
JavaScript 2,001,053 259,589 7.7x
Rust 2,080,715 265,405 7.8x
Java 2,051,423 288,144 7.1x
TypeScript 2,034,903 265,808 7.7x
Ruby 2,039,749 259,497 7.9x
C++ 2,032,912 278,796 7.3x
C 2,022,978 276,233 7.3x
PHP 2,049,994 287,743 7.1x

Takeaway: Guesslang is ~7.5x faster for per-file detection.

@robherley robherley changed the base branch from main to v1 January 10, 2026 19:20
@robherley robherley marked this pull request as ready for review January 10, 2026 19:24
@robherley robherley changed the base branch from v1 to main January 10, 2026 19:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request switches the file type detection system from guesslang (based on TensorFlow) to magika (based on ONNX Runtime), enabling better cross-platform support including ARM64 architectures.

Changes:

  • Replaces TensorFlow/guesslang-go dependencies with ONNX Runtime/magika-go
  • Adds vendor-onnxruntime script to download and install ONNX Runtime binaries
  • Updates build system with new env and build scripts for CGO/linker configuration
  • Removes architecture-specific limitations (previously disabled on ARM64)

Reviewed changes

Copilot reviewed 18 out of 20 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
script/vendor-onnxruntime New script to download and install ONNX Runtime for the platform
script/run New script to run the application with proper environment setup
script/env New script to configure CGO/linker flags for ONNX Runtime
script/build New script to build the binary with cross-compilation support
script/install-libtensorflow Removed TensorFlow installation script
internal/renderer/guess.go Replaced guesslang with magika scanner implementation using lazy initialization
internal/renderer/guess_disabled.go Updated build tags to remove arm64 restriction
internal/renderer/detect.go Updated comments to reflect AI guessing instead of Guesslang
internal/config/config_guesser.go Changed build tag from amd64 to cgo
internal/config/config_noguesser.go Changed build tag from arm64 to !cgo
internal/config/config.go Updated description to reference AI model instead of Guesslang
go.mod Replaced guesslang-go with magika-go dependency
go.sum Updated dependency checksums
docs/self-hosting.md Updated documentation for new ONNX-based approach
docs/contributing.md Updated setup instructions for local development
README.md Updated technology credits
Dockerfile Refactored to use ONNX Runtime instead of TensorFlow
.gitignore Added third_party/ directory
.github/workflows/test.yml Updated to use vendor-onnxruntime instead of install-libtensorflow
.github/workflows/lint.yml Updated to use vendor-onnxruntime instead of install-libtensorflow

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@robherley robherley changed the base branch from main to v1 January 10, 2026 21:42
@robherley robherley changed the base branch from v1 to main January 10, 2026 21:42
@robherley robherley merged commit 660a5f3 into main Jan 10, 2026
5 checks passed
@robherley robherley deleted the robherley/magika branch January 10, 2026 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Switch to magika instead of guesslang for file detection

2 participants