GitHub - Geekgineer/YOLOs-CPP-TensorRT: YOLOs-TRT is a header-only C++ library for running all YOLO models with all tasks with NVIDIA TensorRT on CUDA GPUs and Jetson. It features GPU preprocessing (letterbox/normalize/HWC→NCHW), CUDA Graph replay, FP16/INT8 support, and sub-2ms end-to-end latency with 530+fps.

The fastest way to run YOLO models in C++ on NVIDIA GPUs.

Header-only. TensorRT-native. Zero-copy GPU pipeline. Sub-2ms inference.

Quick Start · Benchmarks · Installation · API · Docker · Docs

Why YOLOs-TRT?

Most YOLO C++ wrappers treat preprocessing as an afterthought — resizing on the CPU, copying synchronously, rebuilding launch parameters every frame. YOLOs-TRT was built from scratch around a single principle: the GPU should never wait for the CPU.

	YOLOs-TRT	Typical C++ YOLO Wrappers
Preprocessing	GPU (single CUDA kernel)	CPU (OpenCV)
Host-to-device transfer	Async (pinned memory)	Synchronous
Inference dispatch	CUDA Graph replay	Per-frame `enqueue`
YOLO version config	Auto-detected from tensor shape	Manual flag
API surface	1 header include	Multiple source files to compile

The result: sub-2ms end-to-end latency and 530+ FPS on a laptop GPU.

⚡ Benchmarks

Measured on NVIDIA RTX 2000 Ada (Laptop) — YOLOv11n · 640×640 · 1000 iterations · 10-iter warm-up

Precision	FPS	Avg Latency	P50	P99	GPU Memory
FP32	466	2.14 ms	2.04 ms	3.03 ms	530 MB
FP16	479	2.09 ms	1.98 ms	2.91 ms	536 MB
INT8	530	1.89 ms	1.78 ms	2.70 ms	444 MB

Note: These numbers include the full pipeline — preprocessing, inference, and postprocessing. Scaling is roughly linear on higher-end GPUs (RTX 4090, A100, H100).

What makes it fast?

Optimization	Impact
GPU letterbox + normalize	A single CUDA kernel performs bilinear letterbox resize, BGR→RGB conversion, and `/255.0` normalization — writing directly into the TRT input buffer. Eliminates all CPU preprocessing from the hot path.
Pinned staging buffers	Raw BGR pixels are `memcpy`'d into a CUDA pinned buffer, enabling truly asynchronous `cudaMemcpyAsync` H2D transfer that overlaps with compute.
CUDA Graph capture	For fixed-shape engines, the entire `enqueueV3` call graph is captured once and replayed via `cudaGraphLaunch`, cutting ~0.1–0.3 ms of per-frame dispatch overhead.
10-iteration warm-up	Lets TensorRT's internal autotuner converge on optimal kernel selections before timing begins.
Single-stream pipeline	Minimal synchronization points — one `cudaStream_t` drives the entire preprocess → infer → postprocess pipeline.

🚀 Quick Start

3 commands to your first inference:

# 1. Clone & build
git clone https://github.com/Geekgineer/YOLOs-CPP-TensorRT.git
cd YOLOs-CPP-TensorRT && ./build.sh

# 2. Convert a model (requires Python + ultralytics)
pip install -r requirements.txt
python models/export_onnx.py --model yolo11n
trtexec --onnx=models/yolo11n.onnx --saveEngine=models/yolo11n.trt --fp16

# 3. Run inference
./build/image_inference models/yolo11n.trt data/dog.jpg models/coco.names

In your own code — 5 lines:

#include "yolos/tasks/detection.hpp"

int main() {
    yolos::det::YOLODetector detector("yolo11n.trt", "coco.names");
    cv::Mat image = cv::imread("dog.jpg");
    auto results = detector.detect(image);
    detector.drawDetections(image, results);
    cv::imshow("YOLOs-TRT", image);
    cv::waitKey(0);
}

🧩 Supported Models & Tasks

YOLOs-TRT auto-detects the YOLO version from output tensor shapes — no manual configuration required.

Task	API	Supported Versions
Detection	`YOLODetector::detect()`	YOLOv5 · v7 · v8 · v9 · v10 · v11 · v12 · v26 · NAS
Instance Segmentation	`YOLOSegDetector::segment()`	YOLOv8-seg · v11-seg · v26-seg
Pose Estimation	`YOLOPoseDetector::detect()`	YOLOv8-pose · v11-pose · v26-pose
Oriented BBox (OBB)	`YOLOOBBDetector::detect()`	YOLOv8-obb · v11-obb · v26-obb
Classification	`YOLOClassifier::classify()`	YOLOv8-cls · v11-cls · v12-cls · v26-cls

📦 Installation

Prerequisites

Dependency	Version	Notes
NVIDIA GPU	CC ≥ 7.5	Turing, Ampere, Ada, Hopper, or Jetson Xavier/Orin
CUDA Toolkit	≥ 12.0
TensorRT	≥ 10.0	Tensor-based API (`enqueueV3`)
OpenCV	≥ 4.5	Image I/O and visualization
CMake	≥ 3.18	CUDA language support
C++ compiler	C++17	GCC 9+ / Clang 10+

Build from Source

git clone https://github.com/Geekgineer/YOLOs-CPP-TensorRT.git
cd YOLOs-CPP-TensorRT
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

Custom TensorRT path / Jetson / Advanced options

# Custom TensorRT location
cmake .. -DCMAKE_BUILD_TYPE=Release -DTENSORRT_DIR=/opt/TensorRT-10.4

# Specific CUDA architectures (e.g. Jetson Orin)
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES="87"

# Build example applications
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_EXAMPLES=ON

Jetson (JetPack 6.x): TensorRT and CUDA are pre-installed. Just clone, build, and run.

Install TensorRT

Ubuntu (apt)

sudo apt update && sudo apt install -y tensorrt

Jetson (JetPack)

Pre-installed. Verify with:

dpkg -l | grep tensorrt

Manual / Tarball

Download from developer.nvidia.com/tensorrt, extract, and pass -DTENSORRT_DIR=/path/to/tensorrt to CMake.

🔄 Model Conversion

C++ inference requires serialized TensorRT engines. Python is only needed for this one-time conversion step.

# Install Python deps (uv recommended for speed)
pip install uv && uv pip install -r requirements.txt

# Export ONNX from Ultralytics
python models/export_onnx.py --model yolo11n

# Convert to TensorRT engine
trtexec --onnx=models/yolo11n.onnx --saveEngine=models/yolo11n.trt --fp16

INT8 quantization with calibration

# Generate calibration images (or use your own dataset)
python trt-files/scripts/generate_calibration_data.py

# Convert with INT8
python trt-files/scripts/convert_to_tensorrt.py \
    --onnx models/yolo11n.onnx --int8 \
    --calib-data trt-files/scripts/calibration_data/

See the Quantization Guide for details on INT8 calibration strategies.

Important: TensorRT engines are GPU-architecture and TRT-version specific. Always rebuild on the target hardware.

📖 API Reference

Include & Use

YOLOs-TRT is header-only. Include the task header you need, link against TensorRT and CUDA, done.

#include "yolos/tasks/detection.hpp"      // Detection
#include "yolos/tasks/segmentation.hpp"   // Instance Segmentation
#include "yolos/tasks/pose.hpp"           // Pose Estimation
#include "yolos/tasks/obb.hpp"            // Oriented Bounding Boxes
#include "yolos/tasks/classification.hpp" // Classification
#include "yolos/yolos.hpp"                // Everything

Constructor

All task classes share the same constructor pattern:

ClassName(const std::string& enginePath,
          const std::string& labelsPath,
          YOLOVersion version = YOLOVersion::Auto,  // auto-detect from tensor shape
          int dlaCore = -1);                         // -1 = GPU, 0/1 = Jetson DLA core

Task Examples

Detection

yolos::det::YOLODetector detector("yolo11n.trt", "coco.names");
auto detections = detector.detect(image, 0.4f, 0.45f);
detector.drawDetections(image, detections);

Instance Segmentation

yolos::seg::YOLOSegDetector seg("yolo11n-seg.trt", "coco.names");
auto results = seg.segment(image);
seg.drawSegmentations(image, results);

Pose Estimation

yolos::pose::YOLOPoseDetector pose("yolo11n-pose.trt");
auto results = pose.detect(image);
pose.drawPoses(image, results);

Oriented Bounding Box (OBB)

yolos::obb::YOLOOBBDetector obb("yolo11n-obb.trt", "Dota.names");
auto results = obb.detect(image);
obb.drawDetections(image, results);

Classification

yolos::cls::YOLOClassifier cls("yolov8n-cls.trt", "ImageNet.names");
auto result = cls.classify(image);
std::cout << result.className << ": " << result.confidence * 100 << "%" << std::endl;

Factory Functions

// Auto-detect version from tensor shape
auto det = yolos::det::createDetector("yolo11n.trt", "coco.names");

// Explicit version override
auto det = yolos::det::createDetector("yolo11n.trt", "coco.names", yolos::YOLOVersion::V11);

🐳 Docker

Data-Center GPU

docker build -f Dockerfile.tensorrt -t yolos-trt .
docker run --gpus all \
    -v $(pwd)/models:/app/models \
    -v $(pwd)/data:/app/data \
    yolos-trt \
    ./image_inference ./models/yolo11n.trt ./data/dog.jpg ./models/coco.names

NVIDIA Jetson (JetPack 6.x)

docker build -f Dockerfile.tensorrt.jetson -t yolos-trt-jetson .
docker run --runtime nvidia \
    -v $(pwd)/models:/app/models \
    yolos-trt-jetson \
    ./image_inference ./models/yolo11n.trt ./data/dog.jpg ./models/coco.names

🏗️ Architecture

cv::Mat (BGR, host)
   │
   │  memcpy → pinned staging buffer
   ▼
Pinned Host ──cudaMemcpyAsync──► Device uint8 (raw BGR)
                                      │
                          ┌───────────┘
                          ▼
              letterboxNormalizeKernel()          ← single CUDA kernel
              ┌─────────────────────────┐
              │  bilinear letterbox      │
              │  BGR → RGB               │
              │  ÷ 255.0 normalize       │
              │  HWC → NCHW transpose    │
              └────────────┬────────────┘
                           ▼
              TRT input buffer (float32, device)
                           │
                  enqueueV3() / cudaGraphLaunch()
                           │
                           ▼
              TRT output buffer(s) (device)
                           │
                  cudaMemcpyAsync D→H
                           │
                           ▼
              Postprocess (CPU) → Detections / Masks / Keypoints

Core Components

Component	File	Role
TrtSessionBase	`core/trt_session_base.hpp`	Engine deserialization, I/O buffer management, warm-up, CUDA Graph capture, async inference
CUDA Preprocessing	`core/cuda_preprocessing.cu`	Single kernel: letterbox + BGR→RGB + normalize + HWC→NCHW
Version Detection	`core/version.hpp`	Auto-detect YOLO version from output tensor shape
NMS	`core/nms.hpp`	Batched non-maximum suppression
Drawing	`core/drawing.hpp`	Bounding box, mask, and skeleton visualization
Task Heads	`tasks/*.hpp`	Version-aware postprocessing for each task type

📂 Project Structure

YOLOs-CPP-TensorRT/
├── include/yolos/            # Header-only library
│   ├── core/                 #   Engine, preprocessing, NMS, drawing, types
│   └── tasks/                #   Detection, segmentation, pose, OBB, classification
├── src/                      # Ready-to-use inference binaries
│   ├── image_inference.cpp   #   Single image / folder
│   ├── video_inference.cpp   #   Video file (multi-threaded)
│   ├── camera_inference.cpp  #   Live camera feed
│   ├── batch_image_inference.cpp
│   └── class_image_inference.cpp
├── examples/                 # Per-task examples (image / video / camera × 5 tasks)
├── benchmarks/               # Unified benchmark tool (FPS, latency, mAP)
├── tests/                    # Per-task validation suites (C++ vs Python ground truth)
├── models/                   # ONNX export script + label files
├── trt-files/scripts/        # TensorRT conversion & INT8 calibration tools
├── doc/                      # Installation, usage, quantization, contributing guides
├── Dockerfile.tensorrt       # Multi-stage Docker (data-center GPU)
├── Dockerfile.tensorrt.jetson # Multi-stage Docker (Jetson)
└── CMakeLists.txt            # Top-level build (CXX + CUDA)

🏃 Built-in Binaries

The build produces five ready-to-use executables:

# Object detection on a single image
./image_inference models/yolo11n.trt data/dog.jpg models/coco.names

# Process a video file
./video_inference models/yolo11n.trt data/video.mp4 output.mp4 models/coco.names

# Live camera feed (V4L2 / RTSP)
./camera_inference models/yolo11n.trt /dev/video0 models/coco.names

# Batch detection over a folder
./batch_image_inference models/yolo11n.trt data/ models/coco.names

# Image classification
./class_image_inference models/yolov8n-cls.trt data/dog.jpg models/ImageNet.names

🧪 Testing

cd tests
./test_all.sh          # Run all task tests
./test_detection.sh    # Detection only
./test_segmentation.sh # Segmentation only
./test_pose.sh         # Pose estimation only
./test_obb.sh          # Oriented bounding box only
./test_classification.sh # Classification only

Tests export models via Ultralytics, convert to TRT engines, run inference in both Python and C++, and compare outputs for correctness.

📊 Benchmarking

cd benchmarks
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# Quick performance test
./yolo_unified_benchmark quick models/yolo11n.trt data/dog.jpg

# Comprehensive multi-model sweep
./yolo_unified_benchmark comprehensive

See the Benchmark README for all modes (image, video, camera, accuracy evaluation).

🔧 Troubleshooting

"TensorRT not found" during CMake

# Check if TensorRT is installed
dpkg -l | grep nvinfer

# Point CMake to a custom install
cmake .. -DTENSORRT_DIR=/opt/TensorRT-10.4

"CUDA graph capture failed"

This is expected for dynamic-shape models. YOLOs-TRT automatically falls back to standard enqueueV3 dispatch. Performance impact is minimal (~0.1–0.3 ms).

Engine fails to load / crashes on inference

TensorRT engines are GPU-specific and TRT-version-specific. Rebuild on the target device:

trtexec --onnx=model.onnx --saveEngine=model.trt --fp16

Low FPS on first few frames

The 10-iteration warm-up handles this automatically. If you're benchmarking manually, discard the first ~10 frames.

🤝 Contributing

Contributions are welcome! Please see the Contributing Guide for details.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes
Push to the branch and open a Pull Request

📚 Documentation

Guide	Description
Installation	Detailed setup for Ubuntu, Jetson, and Docker
Usage	API walkthrough with examples
Models	Model export, conversion, and optimization
Quantization	INT8 calibration and precision tuning
Development	Architecture deep-dive and extending the library
Acknowledgments	Credits and references

⭐ Star History

If YOLOs-TRT helps your project, consider giving it a star — it helps others discover it!

📄 License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See LICENSE for details.

Built with NVIDIA TensorRT and Ultralytics YOLO

_{Made with dedication to pushing the limits of real-time inference.}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why YOLOs-TRT?

⚡ Benchmarks

🚀 Quick Start

🧩 Supported Models & Tasks

📦 Installation

Prerequisites

Build from Source

Install TensorRT

🔄 Model Conversion

📖 API Reference

Include & Use

Constructor

Task Examples

Factory Functions

🐳 Docker

Data-Center GPU

NVIDIA Jetson (JetPack 6.x)

🏗️ Architecture

Core Components

📂 Project Structure

🏃 Built-in Binaries

🧪 Testing

📊 Benchmarking

🔧 Troubleshooting

🤝 Contributing

📚 Documentation

⭐ Star History

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
benchmarks		benchmarks
data		data
doc		doc
examples		examples
include		include
models		models
results		results
scripts		scripts
src		src
tests		tests
trt-files		trt-files
.dockerignore		.dockerignore
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Dockerfile.tensorrt		Dockerfile.tensorrt
Dockerfile.tensorrt.jetson		Dockerfile.tensorrt.jetson
Dockerize.md		Dockerize.md
LICENSE		LICENSE
README.md		README.md
RELEASE.md		RELEASE.md
build.sh		build.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_camera.sh		run_camera.sh
run_image.sh		run_image.sh
run_test.sh		run_test.sh
run_video.sh		run_video.sh

Folders and files

Latest commit

History

Repository files navigation

Why YOLOs-TRT?

⚡ Benchmarks

🚀 Quick Start

🧩 Supported Models & Tasks

📦 Installation

Prerequisites

Build from Source

Install TensorRT

🔄 Model Conversion

📖 API Reference

Include & Use

Constructor

Task Examples

Factory Functions

🐳 Docker

Data-Center GPU

NVIDIA Jetson (JetPack 6.x)

🏗️ Architecture

Core Components

📂 Project Structure

🏃 Built-in Binaries

🧪 Testing

📊 Benchmarking

🔧 Troubleshooting

🤝 Contributing

📚 Documentation

⭐ Star History

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages