🔧 Model Management - Download, cache, and organize AI models from Hugging Face
🔍 Advanced Filtering - Search models with regex patterns and SQL-style queries
💻 System Detection - Automatic GPU detection and resource reporting
🚀 OpenAI-Compatible API - RESTful API with streaming support
cargo install puma# Clone the repository
git clone https://github.com/InftyAI/PUMA.git
cd PUMA
# Build the binary
make build
# The binary will be available at ./puma
./puma version# Download a model
puma pull inftyai/tiny-random-gpt2
# List all models
puma ls
# Inspect model details
puma inspect inftyai/tiny-random-gpt2
# Check system info
puma info
# Remove a model
puma rm inftyai/tiny-random-gpt2# Start the inference server
puma serve
# Server will start on http://0.0.0.0:8000
# API endpoints:
# POST /v1/chat/completions
# POST /v1/completions
# GET /v1/models
# GET /v1/models/:model
# GET /healthTest the API:
# Health check
curl http://localhost:8000/health
# Chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "inftyai/tiny-random-gpt2",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Or use the test script
./hack/scripts/test_api.sh| Command | Status | Description |
|---|---|---|
pull <model> |
✅ | Download model from provider |
ls |
✅ | List models (supports regex, label filters) |
inspect <model> |
✅ | Show detailed model information |
rm <model> |
✅ | Remove model and cache |
info |
✅ | Display system information |
version |
✅ | Show PUMA version |
serve |
✅ | Start OpenAI-compatible API server |
ps |
🚧 | List running models |
run |
🚧 | Start model inference |
stop |
🚧 | Stop running model |
# Substring match
puma ls qwen
# Prefix match
puma ls "^inftyai/"
# Alternation
puma ls "llama-(2|3)"# Single filter
puma ls -l author=inftyai
# Multiple filters (AND condition)
puma ls -l author=inftyai,license=mit
# Combine pattern + filter
puma ls llama -l author=metaAvailable filters: author, task, license, provider, model_series
PUMA provides an OpenAI-compatible API server for model inference.
# Default: 0.0.0.0:8000
puma serve
# Custom host and port
puma serve --host 127.0.0.1 --port 3000curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "inftyai/tiny-random-gpt2",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"max_tokens": 100,
"temperature": 0.7
}'curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "inftyai/tiny-random-gpt2",
"messages": [{"role": "user", "content": "Tell me a story"}],
"stream": true
}'curl http://localhost:8000/v1/modelscurl http://localhost:8000/health
# Returns: {"status":"ok","version":"0.0.2"}PUMA is compatible with the OpenAI Python SDK:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy" # Not required
)
response = client.chat.completions.create(
model="inftyai/tiny-random-gpt2",
messages=[
{"role": "user", "content": "Hello!"}
]
)
print(response.choices[0].message.content)$ puma inspect inftyai/tiny-random-gpt2
name: inftyai/tiny-random-gpt2
kind: Model
spec:
author: inftyai
provider: huggingface
task: text-generation
license: MIT
model_series: gpt2
context_window: 2.05K
safetensors:
total: 7.00B
parameters:
f32: 7.00B
cache:
revision: abc123de
size: 1.24 GB
cache_path: ~/.puma/cache/...
status:
created: 2 hours ago
updated: 2 hours ago- Database:
~/.puma/models.db(SQLite) - Cache:
~/.puma/cache/(model files)
Models are stored with lowercase names for case-insensitive matching.
# Build
make build
# Run all tests
make test
# Test API manually
./hack/scripts/test_api.shpuma/
├── src/
│ ├── api/ # OpenAI-compatible API
│ ├── backend/ # Inference backends (Mock, MLX)
│ ├── cli/ # Command implementations
│ ├── downloader/ # HuggingFace download logic
│ ├── registry/ # Model registry & metadata
│ ├── storage/ # SQLite storage backend
│ ├── system/ # System info detection
│ └── utils/ # Formatting & helpers
├── tests/ # Integration tests
├── hack/ # Development scripts
├── Cargo.toml # Rust dependencies
└── Makefile # Build commands
Apache-2.0