Switch to magika+onxx instead of guesslang+tensorflow#251
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This pull request switches the file type detection system from guesslang (based on TensorFlow) to magika (based on ONNX Runtime), enabling better cross-platform support including ARM64 architectures.
Changes:
- Replaces TensorFlow/guesslang-go dependencies with ONNX Runtime/magika-go
- Adds vendor-onnxruntime script to download and install ONNX Runtime binaries
- Updates build system with new env and build scripts for CGO/linker configuration
- Removes architecture-specific limitations (previously disabled on ARM64)
Reviewed changes
Copilot reviewed 18 out of 20 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| script/vendor-onnxruntime | New script to download and install ONNX Runtime for the platform |
| script/run | New script to run the application with proper environment setup |
| script/env | New script to configure CGO/linker flags for ONNX Runtime |
| script/build | New script to build the binary with cross-compilation support |
| script/install-libtensorflow | Removed TensorFlow installation script |
| internal/renderer/guess.go | Replaced guesslang with magika scanner implementation using lazy initialization |
| internal/renderer/guess_disabled.go | Updated build tags to remove arm64 restriction |
| internal/renderer/detect.go | Updated comments to reflect AI guessing instead of Guesslang |
| internal/config/config_guesser.go | Changed build tag from amd64 to cgo |
| internal/config/config_noguesser.go | Changed build tag from arm64 to !cgo |
| internal/config/config.go | Updated description to reference AI model instead of Guesslang |
| go.mod | Replaced guesslang-go with magika-go dependency |
| go.sum | Updated dependency checksums |
| docs/self-hosting.md | Updated documentation for new ONNX-based approach |
| docs/contributing.md | Updated setup instructions for local development |
| README.md | Updated technology credits |
| Dockerfile | Refactored to use ONNX Runtime instead of TensorFlow |
| .gitignore | Added third_party/ directory |
| .github/workflows/test.yml | Updated to use vendor-onnxruntime instead of install-libtensorflow |
| .github/workflows/lint.yml | Updated to use vendor-onnxruntime instead of install-libtensorflow |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #245
We have pretty dramatic savings by ditching tensorflow (like 400 MB), but it looks like the guesslang model is faster.
Guesslang itself is a smaller, less accurate model and only supports ~54 languages with a stated 90% accuracy. Whereas Magika supports > 200 content types and states 99% accuracy.
IMO the difference in speed isn't dramatic enough to stick with guesslang.
Note, the data collected below was investigated and summarized by Claude.
Size
Image Size Comparison Summary
Key Findings:
amd64 (x86_64):
arm64:
Performance
Initialization Time
Time to load the ML model and prepare for inference.
*Guesslang uses TensorFlow which manages memory internally.
Takeaway: Magika initializes ~10x faster, making it better for CLI tools or short-lived processes.
Average time to detect the language of a single file (after initialization).
Per-Language Breakdown
Takeaway: Guesslang is ~7.5x faster for per-file detection.