triton-api-client

Lightweight client environment for NVIDIA Triton Inference Server covering all four client interfaces: the OpenAI-compatible frontend, the generate extension, the KServe v2 tensor inference protocol (HTTP and gRPC), and TRT-LLM raw tensor inference with client-side tokenization. Powered by Flox.

Provides an interactive chat REPL (triton-chat), a health/smoke/benchmark tool (triton-test), a universal inference CLI (triton-infer), and example scripts for every interface -- everything needed to develop against a Triton server without installing dependencies globally.

What's in the environment

Component	Description
`triton-chat`	Interactive multi-turn chat REPL via OpenAI-compatible frontend
`triton-test`	Health check, smoke test, and benchmark tool
`triton-infer`	Universal inference CLI with auto-detection (generate and TRT-LLM)
`examples/openai/`	Chat, streaming, and batch completions via OpenAI SDK
`examples/generate/`	Text generation via Triton's generate extension
`examples/kserve/`	Tensor inference (HTTP and gRPC) and server metadata
`examples/trtllm/`	TRT-LLM tensor inference with HuggingFace tokenization

Four Triton interfaces

Interface	Port	Protocol	Use case	Client library
OpenAI-compatible frontend	9000	HTTP	LLM chat/completions (dominant pattern)	`openai` Python SDK
Generate extension	8000	HTTP	Triton-specific LLM text generation	`requests`
KServe v2 inference	8000 HTTP / 8001 gRPC	HTTP + gRPC	Standard tensor inference for any model type	`tritonclient.http`, `tritonclient.grpc`
TRT-LLM raw tensor	8000 HTTP / 8001 gRPC	HTTP + gRPC	TRT-LLM models with INT32 tensor I/O	`tritonclient` + `transformers`

Quick start

cd ~/dev/triton-api-client
flox activate

# ── Universal inference (auto-detects model type) ────────────────────

TRITON_MODEL=my-model triton-infer "The capital of France is"
TRITON_MODEL=my-model triton-infer -v "Hello!"
TRITON_MODEL=my-model triton-infer --max-tokens 128 "Explain quantum computing"

# ── OpenAI-compatible frontend (port 9000) ──────────────────────────

# Interactive chat
TRITON_MODEL=my-llm triton-chat

# Health check + smoke test
TRITON_MODEL=my-llm triton-test

# Benchmark
TRITON_MODEL=my-llm triton-test bench -n 50 --concurrent 5

# Single completion
TRITON_MODEL=my-llm python examples/openai/chat.py "What is the capital of France?"

# Streaming completion
TRITON_MODEL=my-llm python examples/openai/streaming.py "Explain quantum computing"

# Batch completions from JSON
echo '["Hello", "What is 2+2?"]' | TRITON_MODEL=my-llm python examples/openai/batch.py -

# ── Generate extension (port 8000) ──────────────────────────────────

TRITON_MODEL=my-llm python examples/generate/single.py "Hello!"
TRITON_MODEL=my-llm python examples/generate/streaming.py "Write a haiku"

# ── KServe v2 tensor inference (port 8000 HTTP / 8001 gRPC) ────────

# HTTP inference
TRITON_MODEL=my-model python examples/kserve/infer_http.py "hello world"

# gRPC inference
TRITON_MODEL=my-model python examples/kserve/infer_grpc.py "hello world"

# Async gRPC (3 concurrent requests)
TRITON_MODEL=my-model python examples/kserve/infer_async.py

# Server health and model metadata
python examples/kserve/metadata.py
TRITON_MODEL=my-model python examples/kserve/metadata.py

# ── TRT-LLM raw tensor inference (port 8000 HTTP / 8001 gRPC) ──────

# HTTP inference with client-side tokenization
TRITON_MODEL=qwen2_5_05b_trtllm python examples/trtllm/infer_http.py "The capital of France is"

# gRPC inference
TRITON_MODEL=qwen2_5_05b_trtllm python examples/trtllm/infer_grpc.py "The capital of France is"

# Streaming gRPC inference
TRITON_MODEL=qwen2_5_05b_trtllm python examples/trtllm/infer_streaming.py "Explain quantum computing"

# TRT-LLM model metadata + config parameters
TRITON_MODEL=qwen2_5_05b_trtllm python examples/trtllm/metadata.py

Environment variables

Variable	Default	Description
`TRITON_OPENAI_BASE`	`http://localhost:9000/v1`	OpenAI-compatible frontend URL
`TRITON_API_KEY`	`EMPTY`	API key for OpenAI frontend auth
`TRITON_API_BASE`	`http://localhost:8000`	KServe v2 + generate extension URL
`TRITON_MODEL`	(none)	Model name (used across all interfaces)
`TRITON_SYSTEM_PROMPT`	`You are a helpful assistant.`	System prompt for `triton-chat`
`TRITON_TOKENIZER`	(none)	HuggingFace tokenizer name/path for TRT-LLM models
`TRITON_GRPC_PORT`	`8001`	gRPC port for KServe tensor inference

All variables are set with ${VAR:-default} fallbacks in the Flox on-activate hook, so they can be overridden at activation time or per-command:

# Override at activation time (persists for session)
TRITON_OPENAI_BASE=http://gpu-server:9000/v1 TRITON_MODEL=llama flox activate

# Override per-command
TRITON_MODEL=llama python examples/openai/chat.py "Hello"

Inference CLI

triton-infer is a universal inference command that auto-detects the model type via metadata introspection and routes to the appropriate inference path.

Detection logic

Model input tensors	Detected type	Inference path
`text_input` (BYTES)	generate	`POST /v2/models/{model}/generate`
`input_ids` (INT32)	trtllm	KServe v2 tensor inference with client-side tokenization

Usage

# Auto-detect and infer
TRITON_MODEL=my-model triton-infer "The capital of France is"

# Verbose mode (prints detection info to stderr)
TRITON_MODEL=my-model triton-infer -v "Hello!"

# Override model and max tokens
triton-infer -m qwen2_5_05b_trtllm -n 128 "Explain quantum computing"

# Specify tokenizer explicitly (TRT-LLM only)
TRITON_MODEL=qwen2_5_05b_trtllm triton-infer -t Qwen/Qwen2.5-0.5B "Hello"

Tokenizer resolution (TRT-LLM models)

For TRT-LLM models, the tokenizer is resolved in this order:

--tokenizer / -t CLI argument
TRITON_TOKENIZER environment variable
tokenizer_dir from the model's Triton config
Error with instructions if none found

Chat CLI

triton-chat is an interactive REPL that uses Triton's OpenAI-compatible frontend (port 9000) to stream chat completions and renders output as markdown using Rich.

Commands

Command	Description
`/clear`	Clear conversation history and start fresh
`/model [name]`	Show current model, or switch to a different model
`/system [prompt]`	Show current system prompt, or set a new one
`/help`	Show available commands
`/quit`	Exit the chat (also: `/exit`, Ctrl+C, Ctrl+D)

Example session

$ TRITON_MODEL=llama triton-chat
triton-chat connected to http://localhost:9000/v1
Model: llama
Type /help for commands, /quit to exit.

you> What is 2 + 2?

2 + 2 = 4.

you> And if you multiply that by 3?

4 multiplied by 3 is 12.

you> /quit
Bye!

Test CLI

triton-test checks server connectivity, runs smoke tests, and benchmarks throughput via the OpenAI-compatible frontend.

Usage

# Health check + smoke test (exit 0 on success, 1 on failure)
triton-test

# Benchmark with defaults (10 requests, concurrency 1)
triton-test bench

# Heavier load test
triton-test bench -n 50 --concurrent 5 --max-tokens 256

# Custom prompt
triton-test bench --prompt "Summarize the theory of relativity"

What it tests

Check	Description
Health	Connects to server, lists available models
Smoke (non-streaming)	Single completion, reports latency and token count
Smoke (streaming)	Streaming completion, reports TTFT and chunk count
Benchmark	Concurrent streaming requests with p50/p90/p99 latency, TTFT, ITL, tokens/sec

Example scripts

OpenAI-compatible frontend (`examples/openai/`)

Script	Description
`chat.py`	Single chat completion (non-streaming)
`streaming.py`	Streaming chat completion, prints tokens as they arrive
`batch.py`	Batch completions from a JSON array, outputs JSONL with usage stats

Generate extension (`examples/generate/`)

Script	Description
`single.py`	Single text generation via `POST /v2/models/{model}/generate`
`streaming.py`	Streaming text generation via SSE from `generate_stream`

KServe v2 inference (`examples/kserve/`)

Script	Description
`infer_http.py`	Synchronous tensor inference via `tritonclient.http`
`infer_grpc.py`	Synchronous tensor inference via `tritonclient.grpc` (port 8001)
`infer_async.py`	Async parallel inference via `tritonclient.grpc.aio` with `asyncio.gather()`
`metadata.py`	Server health, model metadata, and config introspection

TRT-LLM raw tensor inference (`examples/trtllm/`)

Script	Description
`infer_http.py`	HTTP tensor inference with client-side tokenization
`infer_grpc.py`	gRPC tensor inference with client-side tokenization
`infer_streaming.py`	Streaming gRPC inference (decoupled mode) with incremental detokenization
`metadata.py`	TRT-LLM model metadata + config parameter viewer

Triton API reference

OpenAI-compatible frontend (port 9000)

The OpenAI-compatible frontend provides standard OpenAI API endpoints. Uses the openai Python SDK.

Endpoint	Method	Description
`/v1/models`	GET	List available models
`/v1/chat/completions`	POST	Chat completion (streaming and non-streaming)

Generate extension (port 8000)

The generate extension is Triton's native LLM text generation interface. Not part of the KServe v2 spec.

Endpoint	Method	Description
`/v2/models/{model}/generate`	POST	Single text generation (blocking)
`/v2/models/{model}/generate_stream`	POST	Streaming text generation (SSE)

Request format:

{
  "text_input": "What is the capital of France?",
  "parameters": { "max_tokens": 256 }
}

KServe v2 inference protocol (port 8000 HTTP / 8001 gRPC)

The KServe v2 inference protocol is the standard tensor-in/tensor-out interface for any model type.

Endpoint	Method	Description
`/v2/health/live`	GET	Server liveness probe
`/v2/health/ready`	GET	Server readiness probe
`/v2/models/{model}`	GET	Model metadata (inputs, outputs, datatypes)
`/v2/models/{model}/config`	GET	Model configuration
`/v2/models/{model}/infer`	POST	Tensor inference

The infer endpoint uses structured tensor payloads with typed inputs/outputs, accessed via tritonclient.http or tritonclient.grpc.

TRT-LLM tensor interface

TRT-LLM models use raw INT32 tensors instead of text BYTES tensors. The client must tokenize input and detokenize output.

Required input tensors:

Name	Type	Shape	Description
`input_ids`	INT32	[1, seq_len]	Tokenized input from `tokenizer.encode(prompt)`
`input_lengths`	INT32	[1, 1]	Length of the input sequence
`request_output_len`	INT32	[1, 1]	Maximum number of output tokens
`end_id`	INT32	[1, 1]	End-of-sequence token ID from tokenizer
`pad_id`	INT32	[1, 1]	Pad token ID from tokenizer (falls back to end_id)

Optional input tensors:

Name	Type	Shape	Description
`streaming`	BOOL	[1, 1]	Enable streaming responses (requires decoupled mode)

Output tensors:

Name	Type	Shape	Description
`output_ids`	INT32	[1, beam, max_seq_len]	Generated token IDs
`sequence_length`	INT32	[1, beam]	Actual length of generated sequence

Flox environment details

Installed packages

Package	Purpose
`python312`	Python 3.12 interpreter
`uv`	Fast Python package installer (creates venv, installs pip packages)
`gcc-unwrapped`	Provides `libstdc++.so.6` needed by numpy/grpcio C extensions
`zlib`	Provides `libz.so.1` needed by numpy/grpcio C extensions

Pip packages (installed in venv on first activation)

Package	Purpose
`tritonclient[all]`	Official Triton client (HTTP + gRPC, includes numpy)
`openai`	OpenAI Python SDK for the OpenAI-compatible frontend
`rich`	Terminal markdown rendering for `triton-chat` and `triton-test`
`requests`	HTTP client for generate extension endpoints
`transformers`	HuggingFace tokenizers for TRT-LLM client-side tokenization

Activation behavior

On flox activate:

Sets TRITON_OPENAI_BASE, TRITON_API_KEY, TRITON_API_BASE, TRITON_MODEL, TRITON_SYSTEM_PROMPT, TRITON_TOKENIZER, and TRITON_GRPC_PORT with fallback defaults
Adds $FLOX_ENV/lib to LD_LIBRARY_PATH (numpy and grpcio need native libs from Nix packages)
Creates a Python venv in $FLOX_ENV_CACHE/venv (if it doesn't exist)
Installs pip packages on first activation (skips if $VENV/.installed-v2 marker exists)
Adds the project root and venv bin/ to PATH so triton-chat, triton-test, and triton-infer are available as commands

To force a clean reinstall of pip packages:

rm -rf .flox/cache/venv
flox activate

Troubleshooting

`TRITON_MODEL is not set`

Most scripts require TRITON_MODEL. Set it before running:

export TRITON_MODEL=my-model
# or per-command
TRITON_MODEL=my-model triton-chat

Connection refused on port 9000 (OpenAI frontend)

The OpenAI-compatible frontend runs on port 9000 by default. Verify it's enabled in your Triton server configuration and reachable:

curl http://localhost:9000/v1/models

If on a different host/port:

export TRITON_OPENAI_BASE=http://gpu-server:9000/v1

Connection refused on port 8000 (generate/KServe)

The Triton server is not running at TRITON_API_BASE (default http://localhost:8000):

curl http://localhost:8000/v2/health/ready

Connection refused on port 8001 (gRPC)

gRPC inference uses port 8001 by default. Verify Triton's gRPC endpoint is enabled:

# Set a custom gRPC port if needed
export TRITON_GRPC_PORT=8001

`404 Not Found` on generate endpoint

The model may not support the generate extension. Only text-generation backends (vLLM, TensorRT-LLM) expose /v2/models/{model}/generate. Use examples/kserve/metadata.py to check the model's interface.

No tokenizer found (TRT-LLM)

TRT-LLM scripts need a HuggingFace tokenizer for client-side tokenization. Specify one via:

# Environment variable
export TRITON_TOKENIZER=Qwen/Qwen2.5-0.5B

# CLI argument
python examples/trtllm/infer_http.py --tokenizer Qwen/Qwen2.5-0.5B "Hello"

# Or configure tokenizer_dir in the model's config.pbtxt

`ImportError: libstdc++.so.6` or `libz.so.1`

The Flox environment should handle this automatically via LD_LIBRARY_PATH. If it occurs, force a clean venv:

rm -rf .flox/cache/venv
flox activate

`tritonclient` host format

tritonclient.http.InferenceServerClient expects host:port without a scheme prefix (not http://host:port). The example scripts strip the scheme automatically.

For gRPC, tritonclient.grpc.InferenceServerClient also expects host:port without a scheme.

File structure

triton-api-client/
  .flox/
    env/manifest.toml           # Flox environment config
  triton-chat                   # Interactive chat REPL (OpenAI SDK)
  triton-test                   # Health/smoke/benchmark tool (OpenAI SDK)
  triton-infer                  # Universal inference CLI with auto-detection
  examples/
    openai/
      chat.py                   # Single chat completion
      streaming.py              # Streaming chat completion
      batch.py                  # Batch completions from JSON
    generate/
      single.py                 # Single text generation
      streaming.py              # Streaming text generation (SSE)
    kserve/
      infer_http.py             # HTTP tensor inference
      infer_grpc.py             # gRPC tensor inference
      infer_async.py            # Async gRPC parallel inference
      metadata.py               # Server health & model metadata
    trtllm/
      infer_http.py             # TRT-LLM HTTP inference with tokenization
      infer_grpc.py             # TRT-LLM gRPC inference with tokenization
      infer_streaming.py        # TRT-LLM streaming gRPC inference
      metadata.py               # TRT-LLM metadata + config parameters
  README.md

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.flox		.flox
examples		examples
.gitignore		.gitignore
FLOX.md		FLOX.md
README.md		README.md
triton-chat		triton-chat
triton-infer		triton-infer
triton-test		triton-test

Folders and files

Latest commit

History

Repository files navigation

triton-api-client

What's in the environment

Four Triton interfaces

Quick start

Environment variables

Inference CLI

Detection logic

Usage

Tokenizer resolution (TRT-LLM models)

Chat CLI

Commands

Example session

Test CLI

Usage

What it tests

Example scripts

OpenAI-compatible frontend (examples/openai/)

Generate extension (examples/generate/)

KServe v2 inference (examples/kserve/)

TRT-LLM raw tensor inference (examples/trtllm/)

Triton API reference

OpenAI-compatible frontend (port 9000)

Generate extension (port 8000)

KServe v2 inference protocol (port 8000 HTTP / 8001 gRPC)

TRT-LLM tensor interface

Flox environment details

Installed packages

Pip packages (installed in venv on first activation)

Activation behavior

Troubleshooting

TRITON_MODEL is not set

Connection refused on port 9000 (OpenAI frontend)

Connection refused on port 8000 (generate/KServe)

Connection refused on port 8001 (gRPC)

404 Not Found on generate endpoint

No tokenizer found (TRT-LLM)

ImportError: libstdc++.so.6 or libz.so.1

tritonclient host format

File structure

Related documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

OpenAI-compatible frontend (`examples/openai/`)

Generate extension (`examples/generate/`)

KServe v2 inference (`examples/kserve/`)

TRT-LLM raw tensor inference (`examples/trtllm/`)

`TRITON_MODEL is not set`

`404 Not Found` on generate endpoint

`ImportError: libstdc++.so.6` or `libz.so.1`

`tritonclient` host format

Packages