The fastest way to run YOLO models in C++ on NVIDIA GPUs.
Header-only. TensorRT-native. Zero-copy GPU pipeline. Sub-2ms inference.
Quick Start · Benchmarks · Installation · API · Docker · Docs
Most YOLO C++ wrappers treat preprocessing as an afterthought — resizing on the CPU, copying synchronously, rebuilding launch parameters every frame. YOLOs-TRT was built from scratch around a single principle: the GPU should never wait for the CPU.
| YOLOs-TRT | Typical C++ YOLO Wrappers | |
|---|---|---|
| Preprocessing | GPU (single CUDA kernel) | CPU (OpenCV) |
| Host-to-device transfer | Async (pinned memory) | Synchronous |
| Inference dispatch | CUDA Graph replay | Per-frame enqueue |
| YOLO version config | Auto-detected from tensor shape | Manual flag |
| API surface | 1 header include | Multiple source files to compile |
The result: sub-2ms end-to-end latency and 530+ FPS on a laptop GPU.
Measured on NVIDIA RTX 2000 Ada (Laptop) — YOLOv11n · 640×640 · 1000 iterations · 10-iter warm-up
| Precision | FPS | Avg Latency | P50 | P99 | GPU Memory |
|---|---|---|---|---|---|
| FP32 | 466 | 2.14 ms | 2.04 ms | 3.03 ms | 530 MB |
| FP16 | 479 | 2.09 ms | 1.98 ms | 2.91 ms | 536 MB |
| INT8 | 530 | 1.89 ms | 1.78 ms | 2.70 ms | 444 MB |
Note: These numbers include the full pipeline — preprocessing, inference, and postprocessing. Scaling is roughly linear on higher-end GPUs (RTX 4090, A100, H100).
What makes it fast?
| Optimization | Impact |
|---|---|
| GPU letterbox + normalize | A single CUDA kernel performs bilinear letterbox resize, BGR→RGB conversion, and /255.0 normalization — writing directly into the TRT input buffer. Eliminates all CPU preprocessing from the hot path. |
| Pinned staging buffers | Raw BGR pixels are memcpy'd into a CUDA pinned buffer, enabling truly asynchronous cudaMemcpyAsync H2D transfer that overlaps with compute. |
| CUDA Graph capture | For fixed-shape engines, the entire enqueueV3 call graph is captured once and replayed via cudaGraphLaunch, cutting ~0.1–0.3 ms of per-frame dispatch overhead. |
| 10-iteration warm-up | Lets TensorRT's internal autotuner converge on optimal kernel selections before timing begins. |
| Single-stream pipeline | Minimal synchronization points — one cudaStream_t drives the entire preprocess → infer → postprocess pipeline. |
3 commands to your first inference:
# 1. Clone & build
git clone https://github.com/Geekgineer/YOLOs-CPP-TensorRT.git
cd YOLOs-CPP-TensorRT && ./build.sh
# 2. Convert a model (requires Python + ultralytics)
pip install -r requirements.txt
python models/export_onnx.py --model yolo11n
trtexec --onnx=models/yolo11n.onnx --saveEngine=models/yolo11n.trt --fp16
# 3. Run inference
./build/image_inference models/yolo11n.trt data/dog.jpg models/coco.namesIn your own code — 5 lines:
#include "yolos/tasks/detection.hpp"
int main() {
yolos::det::YOLODetector detector("yolo11n.trt", "coco.names");
cv::Mat image = cv::imread("dog.jpg");
auto results = detector.detect(image);
detector.drawDetections(image, results);
cv::imshow("YOLOs-TRT", image);
cv::waitKey(0);
}YOLOs-TRT auto-detects the YOLO version from output tensor shapes — no manual configuration required.
| Task | API | Supported Versions |
|---|---|---|
| Detection | YOLODetector::detect() |
YOLOv5 · v7 · v8 · v9 · v10 · v11 · v12 · v26 · NAS |
| Instance Segmentation | YOLOSegDetector::segment() |
YOLOv8-seg · v11-seg · v26-seg |
| Pose Estimation | YOLOPoseDetector::detect() |
YOLOv8-pose · v11-pose · v26-pose |
| Oriented BBox (OBB) | YOLOOBBDetector::detect() |
YOLOv8-obb · v11-obb · v26-obb |
| Classification | YOLOClassifier::classify() |
YOLOv8-cls · v11-cls · v12-cls · v26-cls |
| Dependency | Version | Notes |
|---|---|---|
| NVIDIA GPU | CC ≥ 7.5 | Turing, Ampere, Ada, Hopper, or Jetson Xavier/Orin |
| CUDA Toolkit | ≥ 12.0 | |
| TensorRT | ≥ 10.0 | Tensor-based API (enqueueV3) |
| OpenCV | ≥ 4.5 | Image I/O and visualization |
| CMake | ≥ 3.18 | CUDA language support |
| C++ compiler | C++17 | GCC 9+ / Clang 10+ |
git clone https://github.com/Geekgineer/YOLOs-CPP-TensorRT.git
cd YOLOs-CPP-TensorRT
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)Custom TensorRT path / Jetson / Advanced options
# Custom TensorRT location
cmake .. -DCMAKE_BUILD_TYPE=Release -DTENSORRT_DIR=/opt/TensorRT-10.4
# Specific CUDA architectures (e.g. Jetson Orin)
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES="87"
# Build example applications
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_EXAMPLES=ONJetson (JetPack 6.x): TensorRT and CUDA are pre-installed. Just clone, build, and run.
Ubuntu (apt)
sudo apt update && sudo apt install -y tensorrtJetson (JetPack)
Pre-installed. Verify with:
dpkg -l | grep tensorrtManual / Tarball
Download from developer.nvidia.com/tensorrt, extract, and pass -DTENSORRT_DIR=/path/to/tensorrt to CMake.
C++ inference requires serialized TensorRT engines. Python is only needed for this one-time conversion step.
# Install Python deps (uv recommended for speed)
pip install uv && uv pip install -r requirements.txt
# Export ONNX from Ultralytics
python models/export_onnx.py --model yolo11n
# Convert to TensorRT engine
trtexec --onnx=models/yolo11n.onnx --saveEngine=models/yolo11n.trt --fp16INT8 quantization with calibration
# Generate calibration images (or use your own dataset)
python trt-files/scripts/generate_calibration_data.py
# Convert with INT8
python trt-files/scripts/convert_to_tensorrt.py \
--onnx models/yolo11n.onnx --int8 \
--calib-data trt-files/scripts/calibration_data/See the Quantization Guide for details on INT8 calibration strategies.
Important: TensorRT engines are GPU-architecture and TRT-version specific. Always rebuild on the target hardware.
YOLOs-TRT is header-only. Include the task header you need, link against TensorRT and CUDA, done.
#include "yolos/tasks/detection.hpp" // Detection
#include "yolos/tasks/segmentation.hpp" // Instance Segmentation
#include "yolos/tasks/pose.hpp" // Pose Estimation
#include "yolos/tasks/obb.hpp" // Oriented Bounding Boxes
#include "yolos/tasks/classification.hpp" // Classification
#include "yolos/yolos.hpp" // EverythingAll task classes share the same constructor pattern:
ClassName(const std::string& enginePath,
const std::string& labelsPath,
YOLOVersion version = YOLOVersion::Auto, // auto-detect from tensor shape
int dlaCore = -1); // -1 = GPU, 0/1 = Jetson DLA coreDetection
yolos::det::YOLODetector detector("yolo11n.trt", "coco.names");
auto detections = detector.detect(image, 0.4f, 0.45f);
detector.drawDetections(image, detections);Instance Segmentation
yolos::seg::YOLOSegDetector seg("yolo11n-seg.trt", "coco.names");
auto results = seg.segment(image);
seg.drawSegmentations(image, results);Pose Estimation
yolos::pose::YOLOPoseDetector pose("yolo11n-pose.trt");
auto results = pose.detect(image);
pose.drawPoses(image, results);Oriented Bounding Box (OBB)
yolos::obb::YOLOOBBDetector obb("yolo11n-obb.trt", "Dota.names");
auto results = obb.detect(image);
obb.drawDetections(image, results);Classification
yolos::cls::YOLOClassifier cls("yolov8n-cls.trt", "ImageNet.names");
auto result = cls.classify(image);
std::cout << result.className << ": " << result.confidence * 100 << "%" << std::endl;// Auto-detect version from tensor shape
auto det = yolos::det::createDetector("yolo11n.trt", "coco.names");
// Explicit version override
auto det = yolos::det::createDetector("yolo11n.trt", "coco.names", yolos::YOLOVersion::V11);docker build -f Dockerfile.tensorrt -t yolos-trt .
docker run --gpus all \
-v $(pwd)/models:/app/models \
-v $(pwd)/data:/app/data \
yolos-trt \
./image_inference ./models/yolo11n.trt ./data/dog.jpg ./models/coco.namesdocker build -f Dockerfile.tensorrt.jetson -t yolos-trt-jetson .
docker run --runtime nvidia \
-v $(pwd)/models:/app/models \
yolos-trt-jetson \
./image_inference ./models/yolo11n.trt ./data/dog.jpg ./models/coco.namescv::Mat (BGR, host)
│
│ memcpy → pinned staging buffer
▼
Pinned Host ──cudaMemcpyAsync──► Device uint8 (raw BGR)
│
┌───────────┘
▼
letterboxNormalizeKernel() ← single CUDA kernel
┌─────────────────────────┐
│ bilinear letterbox │
│ BGR → RGB │
│ ÷ 255.0 normalize │
│ HWC → NCHW transpose │
└────────────┬────────────┘
▼
TRT input buffer (float32, device)
│
enqueueV3() / cudaGraphLaunch()
│
▼
TRT output buffer(s) (device)
│
cudaMemcpyAsync D→H
│
▼
Postprocess (CPU) → Detections / Masks / Keypoints
| Component | File | Role |
|---|---|---|
| TrtSessionBase | core/trt_session_base.hpp |
Engine deserialization, I/O buffer management, warm-up, CUDA Graph capture, async inference |
| CUDA Preprocessing | core/cuda_preprocessing.cu |
Single kernel: letterbox + BGR→RGB + normalize + HWC→NCHW |
| Version Detection | core/version.hpp |
Auto-detect YOLO version from output tensor shape |
| NMS | core/nms.hpp |
Batched non-maximum suppression |
| Drawing | core/drawing.hpp |
Bounding box, mask, and skeleton visualization |
| Task Heads | tasks/*.hpp |
Version-aware postprocessing for each task type |
YOLOs-CPP-TensorRT/
├── include/yolos/ # Header-only library
│ ├── core/ # Engine, preprocessing, NMS, drawing, types
│ └── tasks/ # Detection, segmentation, pose, OBB, classification
├── src/ # Ready-to-use inference binaries
│ ├── image_inference.cpp # Single image / folder
│ ├── video_inference.cpp # Video file (multi-threaded)
│ ├── camera_inference.cpp # Live camera feed
│ ├── batch_image_inference.cpp
│ └── class_image_inference.cpp
├── examples/ # Per-task examples (image / video / camera × 5 tasks)
├── benchmarks/ # Unified benchmark tool (FPS, latency, mAP)
├── tests/ # Per-task validation suites (C++ vs Python ground truth)
├── models/ # ONNX export script + label files
├── trt-files/scripts/ # TensorRT conversion & INT8 calibration tools
├── doc/ # Installation, usage, quantization, contributing guides
├── Dockerfile.tensorrt # Multi-stage Docker (data-center GPU)
├── Dockerfile.tensorrt.jetson # Multi-stage Docker (Jetson)
└── CMakeLists.txt # Top-level build (CXX + CUDA)
The build produces five ready-to-use executables:
# Object detection on a single image
./image_inference models/yolo11n.trt data/dog.jpg models/coco.names
# Process a video file
./video_inference models/yolo11n.trt data/video.mp4 output.mp4 models/coco.names
# Live camera feed (V4L2 / RTSP)
./camera_inference models/yolo11n.trt /dev/video0 models/coco.names
# Batch detection over a folder
./batch_image_inference models/yolo11n.trt data/ models/coco.names
# Image classification
./class_image_inference models/yolov8n-cls.trt data/dog.jpg models/ImageNet.namescd tests
./test_all.sh # Run all task tests
./test_detection.sh # Detection only
./test_segmentation.sh # Segmentation only
./test_pose.sh # Pose estimation only
./test_obb.sh # Oriented bounding box only
./test_classification.sh # Classification onlyTests export models via Ultralytics, convert to TRT engines, run inference in both Python and C++, and compare outputs for correctness.
cd benchmarks
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
# Quick performance test
./yolo_unified_benchmark quick models/yolo11n.trt data/dog.jpg
# Comprehensive multi-model sweep
./yolo_unified_benchmark comprehensiveSee the Benchmark README for all modes (image, video, camera, accuracy evaluation).
"TensorRT not found" during CMake
# Check if TensorRT is installed
dpkg -l | grep nvinfer
# Point CMake to a custom install
cmake .. -DTENSORRT_DIR=/opt/TensorRT-10.4"CUDA graph capture failed"
This is expected for dynamic-shape models. YOLOs-TRT automatically falls back to standard enqueueV3 dispatch. Performance impact is minimal (~0.1–0.3 ms).
Engine fails to load / crashes on inference
TensorRT engines are GPU-specific and TRT-version-specific. Rebuild on the target device:
trtexec --onnx=model.onnx --saveEngine=model.trt --fp16Low FPS on first few frames
The 10-iteration warm-up handles this automatically. If you're benchmarking manually, discard the first ~10 frames.
Contributions are welcome! Please see the Contributing Guide for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes
- Push to the branch and open a Pull Request
| Guide | Description |
|---|---|
| Installation | Detailed setup for Ubuntu, Jetson, and Docker |
| Usage | API walkthrough with examples |
| Models | Model export, conversion, and optimization |
| Quantization | INT8 calibration and precision tuning |
| Development | Architecture deep-dive and extending the library |
| Acknowledgments | Credits and references |
If YOLOs-TRT helps your project, consider giving it a star — it helps others discover it!
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See LICENSE for details.
Built with NVIDIA TensorRT and Ultralytics YOLO
Made with dedication to pushing the limits of real-time inference.