Lightweight client environment for NVIDIA Triton Inference Server covering all four client interfaces: the OpenAI-compatible frontend, the generate extension, the KServe v2 tensor inference protocol (HTTP and gRPC), and TRT-LLM raw tensor inference with client-side tokenization. Powered by Flox.
Provides an interactive chat REPL (triton-chat), a health/smoke/benchmark tool (triton-test), a universal inference CLI (triton-infer), and example scripts for every interface -- everything needed to develop against a Triton server without installing dependencies globally.
| Component | Description |
|---|---|
triton-chat |
Interactive multi-turn chat REPL via OpenAI-compatible frontend |
triton-test |
Health check, smoke test, and benchmark tool |
triton-infer |
Universal inference CLI with auto-detection (generate and TRT-LLM) |
examples/openai/ |
Chat, streaming, and batch completions via OpenAI SDK |
examples/generate/ |
Text generation via Triton's generate extension |
examples/kserve/ |
Tensor inference (HTTP and gRPC) and server metadata |
examples/trtllm/ |
TRT-LLM tensor inference with HuggingFace tokenization |
| Interface | Port | Protocol | Use case | Client library |
|---|---|---|---|---|
| OpenAI-compatible frontend | 9000 | HTTP | LLM chat/completions (dominant pattern) | openai Python SDK |
| Generate extension | 8000 | HTTP | Triton-specific LLM text generation | requests |
| KServe v2 inference | 8000 HTTP / 8001 gRPC | HTTP + gRPC | Standard tensor inference for any model type | tritonclient.http, tritonclient.grpc |
| TRT-LLM raw tensor | 8000 HTTP / 8001 gRPC | HTTP + gRPC | TRT-LLM models with INT32 tensor I/O | tritonclient + transformers |
cd ~/dev/triton-api-client
flox activate
# ── Universal inference (auto-detects model type) ────────────────────
TRITON_MODEL=my-model triton-infer "The capital of France is"
TRITON_MODEL=my-model triton-infer -v "Hello!"
TRITON_MODEL=my-model triton-infer --max-tokens 128 "Explain quantum computing"
# ── OpenAI-compatible frontend (port 9000) ──────────────────────────
# Interactive chat
TRITON_MODEL=my-llm triton-chat
# Health check + smoke test
TRITON_MODEL=my-llm triton-test
# Benchmark
TRITON_MODEL=my-llm triton-test bench -n 50 --concurrent 5
# Single completion
TRITON_MODEL=my-llm python examples/openai/chat.py "What is the capital of France?"
# Streaming completion
TRITON_MODEL=my-llm python examples/openai/streaming.py "Explain quantum computing"
# Batch completions from JSON
echo '["Hello", "What is 2+2?"]' | TRITON_MODEL=my-llm python examples/openai/batch.py -
# ── Generate extension (port 8000) ──────────────────────────────────
TRITON_MODEL=my-llm python examples/generate/single.py "Hello!"
TRITON_MODEL=my-llm python examples/generate/streaming.py "Write a haiku"
# ── KServe v2 tensor inference (port 8000 HTTP / 8001 gRPC) ────────
# HTTP inference
TRITON_MODEL=my-model python examples/kserve/infer_http.py "hello world"
# gRPC inference
TRITON_MODEL=my-model python examples/kserve/infer_grpc.py "hello world"
# Async gRPC (3 concurrent requests)
TRITON_MODEL=my-model python examples/kserve/infer_async.py
# Server health and model metadata
python examples/kserve/metadata.py
TRITON_MODEL=my-model python examples/kserve/metadata.py
# ── TRT-LLM raw tensor inference (port 8000 HTTP / 8001 gRPC) ──────
# HTTP inference with client-side tokenization
TRITON_MODEL=qwen2_5_05b_trtllm python examples/trtllm/infer_http.py "The capital of France is"
# gRPC inference
TRITON_MODEL=qwen2_5_05b_trtllm python examples/trtllm/infer_grpc.py "The capital of France is"
# Streaming gRPC inference
TRITON_MODEL=qwen2_5_05b_trtllm python examples/trtllm/infer_streaming.py "Explain quantum computing"
# TRT-LLM model metadata + config parameters
TRITON_MODEL=qwen2_5_05b_trtllm python examples/trtllm/metadata.py| Variable | Default | Description |
|---|---|---|
TRITON_OPENAI_BASE |
http://localhost:9000/v1 |
OpenAI-compatible frontend URL |
TRITON_API_KEY |
EMPTY |
API key for OpenAI frontend auth |
TRITON_API_BASE |
http://localhost:8000 |
KServe v2 + generate extension URL |
TRITON_MODEL |
(none) | Model name (used across all interfaces) |
TRITON_SYSTEM_PROMPT |
You are a helpful assistant. |
System prompt for triton-chat |
TRITON_TOKENIZER |
(none) | HuggingFace tokenizer name/path for TRT-LLM models |
TRITON_GRPC_PORT |
8001 |
gRPC port for KServe tensor inference |
All variables are set with ${VAR:-default} fallbacks in the Flox on-activate hook, so they can be overridden at activation time or per-command:
# Override at activation time (persists for session)
TRITON_OPENAI_BASE=http://gpu-server:9000/v1 TRITON_MODEL=llama flox activate
# Override per-command
TRITON_MODEL=llama python examples/openai/chat.py "Hello"triton-infer is a universal inference command that auto-detects the model type via metadata introspection and routes to the appropriate inference path.
| Model input tensors | Detected type | Inference path |
|---|---|---|
text_input (BYTES) |
generate | POST /v2/models/{model}/generate |
input_ids (INT32) |
trtllm | KServe v2 tensor inference with client-side tokenization |
# Auto-detect and infer
TRITON_MODEL=my-model triton-infer "The capital of France is"
# Verbose mode (prints detection info to stderr)
TRITON_MODEL=my-model triton-infer -v "Hello!"
# Override model and max tokens
triton-infer -m qwen2_5_05b_trtllm -n 128 "Explain quantum computing"
# Specify tokenizer explicitly (TRT-LLM only)
TRITON_MODEL=qwen2_5_05b_trtllm triton-infer -t Qwen/Qwen2.5-0.5B "Hello"For TRT-LLM models, the tokenizer is resolved in this order:
--tokenizer/-tCLI argumentTRITON_TOKENIZERenvironment variabletokenizer_dirfrom the model's Triton config- Error with instructions if none found
triton-chat is an interactive REPL that uses Triton's OpenAI-compatible frontend (port 9000) to stream chat completions and renders output as markdown using Rich.
| Command | Description |
|---|---|
/clear |
Clear conversation history and start fresh |
/model [name] |
Show current model, or switch to a different model |
/system [prompt] |
Show current system prompt, or set a new one |
/help |
Show available commands |
/quit |
Exit the chat (also: /exit, Ctrl+C, Ctrl+D) |
$ TRITON_MODEL=llama triton-chat
triton-chat connected to http://localhost:9000/v1
Model: llama
Type /help for commands, /quit to exit.
you> What is 2 + 2?
2 + 2 = 4.
you> And if you multiply that by 3?
4 multiplied by 3 is 12.
you> /quit
Bye!
triton-test checks server connectivity, runs smoke tests, and benchmarks throughput via the OpenAI-compatible frontend.
# Health check + smoke test (exit 0 on success, 1 on failure)
triton-test
# Benchmark with defaults (10 requests, concurrency 1)
triton-test bench
# Heavier load test
triton-test bench -n 50 --concurrent 5 --max-tokens 256
# Custom prompt
triton-test bench --prompt "Summarize the theory of relativity"| Check | Description |
|---|---|
| Health | Connects to server, lists available models |
| Smoke (non-streaming) | Single completion, reports latency and token count |
| Smoke (streaming) | Streaming completion, reports TTFT and chunk count |
| Benchmark | Concurrent streaming requests with p50/p90/p99 latency, TTFT, ITL, tokens/sec |
| Script | Description |
|---|---|
chat.py |
Single chat completion (non-streaming) |
streaming.py |
Streaming chat completion, prints tokens as they arrive |
batch.py |
Batch completions from a JSON array, outputs JSONL with usage stats |
| Script | Description |
|---|---|
single.py |
Single text generation via POST /v2/models/{model}/generate |
streaming.py |
Streaming text generation via SSE from generate_stream |
| Script | Description |
|---|---|
infer_http.py |
Synchronous tensor inference via tritonclient.http |
infer_grpc.py |
Synchronous tensor inference via tritonclient.grpc (port 8001) |
infer_async.py |
Async parallel inference via tritonclient.grpc.aio with asyncio.gather() |
metadata.py |
Server health, model metadata, and config introspection |
| Script | Description |
|---|---|
infer_http.py |
HTTP tensor inference with client-side tokenization |
infer_grpc.py |
gRPC tensor inference with client-side tokenization |
infer_streaming.py |
Streaming gRPC inference (decoupled mode) with incremental detokenization |
metadata.py |
TRT-LLM model metadata + config parameter viewer |
The OpenAI-compatible frontend provides standard OpenAI API endpoints. Uses the openai Python SDK.
| Endpoint | Method | Description |
|---|---|---|
/v1/models |
GET | List available models |
/v1/chat/completions |
POST | Chat completion (streaming and non-streaming) |
The generate extension is Triton's native LLM text generation interface. Not part of the KServe v2 spec.
| Endpoint | Method | Description |
|---|---|---|
/v2/models/{model}/generate |
POST | Single text generation (blocking) |
/v2/models/{model}/generate_stream |
POST | Streaming text generation (SSE) |
Request format:
{
"text_input": "What is the capital of France?",
"parameters": { "max_tokens": 256 }
}The KServe v2 inference protocol is the standard tensor-in/tensor-out interface for any model type.
| Endpoint | Method | Description |
|---|---|---|
/v2/health/live |
GET | Server liveness probe |
/v2/health/ready |
GET | Server readiness probe |
/v2/models/{model} |
GET | Model metadata (inputs, outputs, datatypes) |
/v2/models/{model}/config |
GET | Model configuration |
/v2/models/{model}/infer |
POST | Tensor inference |
The infer endpoint uses structured tensor payloads with typed inputs/outputs, accessed via tritonclient.http or tritonclient.grpc.
TRT-LLM models use raw INT32 tensors instead of text BYTES tensors. The client must tokenize input and detokenize output.
Required input tensors:
| Name | Type | Shape | Description |
|---|---|---|---|
input_ids |
INT32 | [1, seq_len] | Tokenized input from tokenizer.encode(prompt) |
input_lengths |
INT32 | [1, 1] | Length of the input sequence |
request_output_len |
INT32 | [1, 1] | Maximum number of output tokens |
end_id |
INT32 | [1, 1] | End-of-sequence token ID from tokenizer |
pad_id |
INT32 | [1, 1] | Pad token ID from tokenizer (falls back to end_id) |
Optional input tensors:
| Name | Type | Shape | Description |
|---|---|---|---|
streaming |
BOOL | [1, 1] | Enable streaming responses (requires decoupled mode) |
Output tensors:
| Name | Type | Shape | Description |
|---|---|---|---|
output_ids |
INT32 | [1, beam, max_seq_len] | Generated token IDs |
sequence_length |
INT32 | [1, beam] | Actual length of generated sequence |
| Package | Purpose |
|---|---|
python312 |
Python 3.12 interpreter |
uv |
Fast Python package installer (creates venv, installs pip packages) |
gcc-unwrapped |
Provides libstdc++.so.6 needed by numpy/grpcio C extensions |
zlib |
Provides libz.so.1 needed by numpy/grpcio C extensions |
| Package | Purpose |
|---|---|
tritonclient[all] |
Official Triton client (HTTP + gRPC, includes numpy) |
openai |
OpenAI Python SDK for the OpenAI-compatible frontend |
rich |
Terminal markdown rendering for triton-chat and triton-test |
requests |
HTTP client for generate extension endpoints |
transformers |
HuggingFace tokenizers for TRT-LLM client-side tokenization |
On flox activate:
- Sets
TRITON_OPENAI_BASE,TRITON_API_KEY,TRITON_API_BASE,TRITON_MODEL,TRITON_SYSTEM_PROMPT,TRITON_TOKENIZER, andTRITON_GRPC_PORTwith fallback defaults - Adds
$FLOX_ENV/libtoLD_LIBRARY_PATH(numpy and grpcio need native libs from Nix packages) - Creates a Python venv in
$FLOX_ENV_CACHE/venv(if it doesn't exist) - Installs pip packages on first activation (skips if
$VENV/.installed-v2marker exists) - Adds the project root and venv
bin/toPATHsotriton-chat,triton-test, andtriton-inferare available as commands
To force a clean reinstall of pip packages:
rm -rf .flox/cache/venv
flox activateMost scripts require TRITON_MODEL. Set it before running:
export TRITON_MODEL=my-model
# or per-command
TRITON_MODEL=my-model triton-chatThe OpenAI-compatible frontend runs on port 9000 by default. Verify it's enabled in your Triton server configuration and reachable:
curl http://localhost:9000/v1/modelsIf on a different host/port:
export TRITON_OPENAI_BASE=http://gpu-server:9000/v1The Triton server is not running at TRITON_API_BASE (default http://localhost:8000):
curl http://localhost:8000/v2/health/readygRPC inference uses port 8001 by default. Verify Triton's gRPC endpoint is enabled:
# Set a custom gRPC port if needed
export TRITON_GRPC_PORT=8001The model may not support the generate extension. Only text-generation backends (vLLM, TensorRT-LLM) expose /v2/models/{model}/generate. Use examples/kserve/metadata.py to check the model's interface.
TRT-LLM scripts need a HuggingFace tokenizer for client-side tokenization. Specify one via:
# Environment variable
export TRITON_TOKENIZER=Qwen/Qwen2.5-0.5B
# CLI argument
python examples/trtllm/infer_http.py --tokenizer Qwen/Qwen2.5-0.5B "Hello"
# Or configure tokenizer_dir in the model's config.pbtxtThe Flox environment should handle this automatically via LD_LIBRARY_PATH. If it occurs, force a clean venv:
rm -rf .flox/cache/venv
flox activatetritonclient.http.InferenceServerClient expects host:port without a scheme prefix (not http://host:port). The example scripts strip the scheme automatically.
For gRPC, tritonclient.grpc.InferenceServerClient also expects host:port without a scheme.
triton-api-client/
.flox/
env/manifest.toml # Flox environment config
triton-chat # Interactive chat REPL (OpenAI SDK)
triton-test # Health/smoke/benchmark tool (OpenAI SDK)
triton-infer # Universal inference CLI with auto-detection
examples/
openai/
chat.py # Single chat completion
streaming.py # Streaming chat completion
batch.py # Batch completions from JSON
generate/
single.py # Single text generation
streaming.py # Streaming text generation (SSE)
kserve/
infer_http.py # HTTP tensor inference
infer_grpc.py # gRPC tensor inference
infer_async.py # Async gRPC parallel inference
metadata.py # Server health & model metadata
trtllm/
infer_http.py # TRT-LLM HTTP inference with tokenization
infer_grpc.py # TRT-LLM gRPC inference with tokenization
infer_streaming.py # TRT-LLM streaming gRPC inference
metadata.py # TRT-LLM metadata + config parameters
README.md
- Triton Inference Server -- The server this client targets
- Triton OpenAI-Compatible Frontend -- OpenAI API compatibility layer
- Triton Generate Extension -- Text generation endpoint spec
- KServe v2 Inference Protocol -- Standard tensor inference protocol
- TensorRT-LLM Backend -- TRT-LLM backend for Triton
- HuggingFace Transformers -- Tokenizer library used for TRT-LLM client-side tokenization
- tritonclient Python API -- Official client library documentation
- OpenAI Python SDK -- OpenAI client used for the frontend
- Flox -- Environment manager powering this project