Skip to content

InftyAI/PUMA

PUMA Logo

A lightweight, high-performance inference engine for local AI

Stability: Active Latest Release

✨ Features

🔧 Model Management - Download, cache, and organize AI models from Hugging Face

🔍 Advanced Filtering - Search models with regex patterns and SQL-style queries

💻 System Detection - Automatic GPU detection and resource reporting

🚀 OpenAI-Compatible API - RESTful API with streaming support

Installation

Install with Cargo

cargo install puma

Build from Source

# Clone the repository
git clone https://github.com/InftyAI/PUMA.git
cd PUMA

# Build the binary
make build

# The binary will be available at ./puma
./puma version

Quick Start

CLI Usage

# Download a model
puma pull inftyai/tiny-random-gpt2

# List all models
puma ls

# Inspect model details
puma inspect inftyai/tiny-random-gpt2

# Check system info
puma info

# Remove a model
puma rm inftyai/tiny-random-gpt2

API Server

# Start the inference server
puma serve

# Server will start on http://0.0.0.0:8000
# API endpoints:
#   POST /v1/chat/completions
#   POST /v1/completions
#   GET  /v1/models
#   GET  /v1/models/:model
#   GET  /health

Test the API:

# Health check
curl http://localhost:8000/health

# Chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "inftyai/tiny-random-gpt2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Or use the test script
./hack/scripts/test_api.sh

Commands

Command Status Description
pull <model> Download model from provider
ls List models (supports regex, label filters)
inspect <model> Show detailed model information
rm <model> Remove model and cache
info Display system information
version Show PUMA version
serve Start OpenAI-compatible API server
ps 🚧 List running models
run 🚧 Start model inference
stop 🚧 Stop running model

Advanced Usage

Pattern Matching

# Substring match
puma ls qwen

# Prefix match
puma ls "^inftyai/"

# Alternation
puma ls "llama-(2|3)"

Label Filtering

# Single filter
puma ls -l author=inftyai

# Multiple filters (AND condition)
puma ls -l author=inftyai,license=mit

# Combine pattern + filter
puma ls llama -l author=meta

Available filters: author, task, license, provider, model_series

API Server

PUMA provides an OpenAI-compatible API server for model inference.

Starting the Server

# Default: 0.0.0.0:8000
puma serve

# Custom host and port
puma serve --host 127.0.0.1 --port 3000

API Endpoints

Chat Completions (Recommended)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "inftyai/tiny-random-gpt2",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Streaming (Server-Sent Events)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "inftyai/tiny-random-gpt2",
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "stream": true
  }'

List Models

curl http://localhost:8000/v1/models

Health Check

curl http://localhost:8000/health
# Returns: {"status":"ok","version":"0.0.2"}

OpenAI Python Client

PUMA is compatible with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"  # Not required
)

response = client.chat.completions.create(
    model="inftyai/tiny-random-gpt2",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

print(response.choices[0].message.content)

Inspect Output

$ puma inspect inftyai/tiny-random-gpt2

name: inftyai/tiny-random-gpt2
kind: Model
spec:
  author:         inftyai
  provider:     huggingface
  task:           text-generation
  license:        MIT
  model_series:   gpt2
  context_window: 2.05K
  safetensors:
    total:        7.00B
    parameters:
      f32:        7.00B
  cache:
    revision:     abc123de
    size:         1.24 GB
    cache_path:   ~/.puma/cache/...
status:
  created:      2 hours ago
  updated:      2 hours ago

Model Management

  • Database: ~/.puma/models.db (SQLite)
  • Cache: ~/.puma/cache/ (model files)

Models are stored with lowercase names for case-insensitive matching.

Development

# Build
make build

# Run all tests
make test

# Test API manually
./hack/scripts/test_api.sh

Project Structure

puma/
├── src/
│   ├── api/          # OpenAI-compatible API
│   ├── backend/      # Inference backends (Mock, MLX)
│   ├── cli/          # Command implementations
│   ├── downloader/   # HuggingFace download logic
│   ├── registry/     # Model registry & metadata
│   ├── storage/      # SQLite storage backend
│   ├── system/       # System info detection
│   └── utils/        # Formatting & helpers
├── tests/            # Integration tests
├── hack/             # Development scripts
├── Cargo.toml        # Rust dependencies
└── Makefile          # Build commands

License

Apache-2.0

Star History

Star History Chart

About

Aim to be a lightweight, high-performance inference engine for local AI.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages