Skip to content
This repository was archived by the owner on Feb 9, 2026. It is now read-only.

Latest commit

 

History

History

README.md

YOLOX for PyTorch

This repository provides scripts to train YOLOX model on Intel® Gaudi® AI accelerator to achieve state-of-the-art accuracy. To obtain model performance data, refer to the Intel Gaudi Model Performance Data page. For more information about training deep learning models using Gaudi, visit developer.habana.ai. Before you get started, make sure to review the Supported Configurations.

The YOLOX demo included in this release is YOLOX-S in lazy mode training for different batch sizes with FP32 and BF16 mixed precision.

Table of Contents

Model Overview

YOLOX is an anchor-free object detector that adopts the architecture of YOLO with DarkNet53 backbone. The anchor-free mechanism greatly reduces the number of model parameters and therefore simplifies the detector. Additionally, YOLOX also provides improvements to the previous YOLO series such as decoupled head, advanced label assigning strategy, and strong data augmentation. The decoupled head contains a 1x1 conv layer, followed by two parallel branches with two 3x3 conv layers for classification and regression tasks respectively, which helps the model converge faster with better accuracy. The advanced label assignment, SimOTA, selects the top k predictions with the lowest cost as the positive samples for a ground truth object. SimOTA not only reduces training time by approximating the assignment instead of using an optimization algorithm, but also improves AP of the model. Additionally, Mosaic and MixUp image augmentation are applied to the training process to further improve the accuracy. Equipped with these latest advanced techniques, YOLOX remarkably achieves a better trade-off between training speed and accuracy than other counterparts in all model sizes.

This repository is an implementation of PyTorch version YOLOX, based on the source code from https://github.com/Megvii-BaseDetection/YOLOX. More details can be found in the paper YOLOX: Exceeding YOLO Series in 2021 by Zhen Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun.

Setup

Please follow the instructions provided in the Gaudi Installation Guide to set up the environment including the $PYTHON environment variable. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform Guide. The guides will walk you through the process of setting up your system to run the model on Gaudi.

Clone Intel Gaudi Model-References

In the docker container, clone this repository and switch to the branch that matches your Intel Gaudi software version. You can run the hl-smi utility to determine the Intel Gaudi software version

git clone -b [Intel Gaudi software version] https://github.com/HabanaAI/Model-References

Go to PyTorch YOLOX directory:

cd Model-References/PyTorch/computer_vision/detection/yolox

Install Model Requirements

Install the required packages and add current directory to PYTHONPATH:

pip install -r requirements.txt
pip install -e . --no-build-isolation
export PYTHONPATH=$PWD:$PYTHONPATH

Setting up the Dataset

Download COCO 2017 dataset from http://cocodataset.org using the following commands:

cd Model-References/PyTorch/computer_vision/detection/yolox
source download_dataset.sh

You can either set the dataset location to the YOLOX_DATADIR environment variable:

export YOLOX_DATADIR=/data/COCO

Or create a sub-directory, datasets, and create a symbolic link from the COCO dataset path to the 'datasets' sub-directory.

mkdir datasets
ln -s /data/COCO ./datasets/COCO

Alternatively, you can pass the COCO dataset location to the --data-dir argument of the training commands.

Training Examples

Run Single Card and Multi-Card Training Examples

Run training on 1 HPU:

  • FP32 data type, train for 500 steps:

    PT_HPU_LAZY_MODE=1 $PYTHON tools/train.py \
        --model-name yolox-s --devices 1 --batch-size 16 --data-dir /data/COCO --hpu steps 500 output_dir ./yolox_output
  • BF16 data type. train for 500 steps:

    PT_HPU_LAZY_MODE=1 PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST=ops_bf16_yolox.txt PT_HPU_AUTOCAST_FP32_OPS_LIST=ops_fp32_yolox.txt $PYTHON tools/train.py \
        --model-name yolox-s --devices 1 --batch-size 16 --data-dir /data/COCO --hpu --autocast \
        steps 500 output_dir ./yolox_output

Run training on 8 HPUs:

NOTE: mpirun map-by PE attribute value may vary on your setup. For the recommended calculation, refer to the instructions detailed in mpirun Configuration.

  • FP32 data type, train for 2 epochs:

    export MASTER_ADDR=localhost
    export MASTER_PORT=12355
    export PT_HPU_LAZY_MODE=1
    mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by fill --report-bindings --allow-run-as-root \
    $PYTHON tools/train.py \
        --model-name yolox-s --devices 8 --batch-size 128 --data-dir /data/COCO --hpu max_epoch 2 output_dir ./yolox_output
  • BF16 data type. train for 2 epochs:

    export MASTER_ADDR=localhost
    export MASTER_PORT=12355
    export PT_HPU_LAZY_MODE=1
    PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST=ops_bf16_yolox.txt PT_HPU_AUTOCAST_FP32_OPS_LIST=ops_fp32_yolox.txt mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by fill --report-bindings --allow-run-as-root \
    $PYTHON tools/train.py \
        --model-name yolox-s --devices 8 --batch-size 128 --data-dir /data/COCO --hpu --autocast \
        max_epoch 2 output_dir ./yolox_output
  • BF16 data type, train for 300 epochs:

    export MASTER_ADDR=localhost
    export MASTER_PORT=12355
    export PT_HPU_LAZY_MODE=1
    PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST=ops_bf16_yolox.txt PT_HPU_AUTOCAST_FP32_OPS_LIST=ops_fp32_yolox.txt mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by fill --report-bindings --allow-run-as-root \
    $PYTHON tools/train.py \
        --model-name yolox-s --devices 8 --batch-size 128 --data-dir /data/COCO --hpu --autocast \
        print_interval 100 max_epoch 300 save_history_ckpt False eval_interval 300 output_dir ./yolox_output

Validation Examples

Run Single Card and Multi-Card Validation Examples

Pretrained model: you can download a pretrained model here. For example, you can run the following command to download pretrained yolox-s model:

curl -L -O https://github.com/Megvii-BaseDetection/YOLOX/releases/download/0.1.1rc0/yolox_s.pth

Run validation on 1 HPU:

  • FP32 data type:

    PT_HPU_LAZY_MODE=1 $PYTHON tools/eval.py --model-name yolox-s --ckpt-path ./yolox_s.pth --data-dir /data/COCO --batch-size 256 --devices 1 --conf-threshold 0.001 --data-num-workers 4 --hpu --fuse --post-processing cpu-async --warmup-steps 4
  • BF16 data type:

    PT_HPU_LAZY_MODE=1 PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST=ops_bf16_yolox.txt PT_HPU_AUTOCAST_FP32_OPS_LIST=ops_fp32_yolox.txt \
    $PYTHON tools/eval.py --model-name yolox-s --ckpt-path ./yolox_s.pth --data-dir /data/COCO --batch-size 512 --devices 1 --conf-threshold 0.001 --hpu --autocast --fuse --post-processing cpu-async --warmup-steps 4

Run validation on 2 HPUs:

NOTE: mpirun map-by PE attribute value may vary on your setup. For the recommended calculation, refer to the instructions detailed in mpirun Configuration.

  • FP32 data type:

    export MASTER_ADDR=localhost
    export MASTER_PORT=12355
    export PT_HPU_LAZY_MODE=1
    mpirun -n 2 --bind-to core --map-by socket:PE=6 --rank-by fill --report-bindings --allow-run-as-root \
    $PYTHON tools/eval.py --model-name yolox-s --ckpt-path ./yolox_s.pth --data-dir /data/COCO --batch-size 1024 --devices 2 --conf-threshold 0.001 --hpu --fuse --post-processing cpu-async --warmup-steps 4
  • BF16 data type:

    export MASTER_ADDR=localhost
    export MASTER_PORT=12355
    export PT_HPU_LAZY_MODE=1
    PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST=ops_bf16_yolox.txt PT_HPU_AUTOCAST_FP32_OPS_LIST=ops_fp32_yolox.txt \
    mpirun -n 2 --bind-to core --map-by socket:PE=6 --rank-by fill --report-bindings --allow-run-as-root \
    $PYTHON tools/eval.py --model-name yolox-s --ckpt-path ./yolox_s.pth --data-dir /data/COCO --batch-size 1024 --devices 2 --conf-threshold 0.001 --hpu --autocast --fuse --post-processing cpu-async --warmup-steps 4

Inference performance

Interpreting the output

In the log/console output, performance measurements are printed as follows:

  • Total evaluation loop throughput - Estimated total throughput of the full eval process: media decode and processing, inference, and post-processing (generate detections and bounding boxes)
  • Average inference throughput - Estimated throughput of only the inference and postprocessing stages.

For example:

Total evaluation loop time:           3.30 (s)
Total evaluation loop throughput:   1499.83 (images/s)
Total evaluation loop images:         4952
Average inference time per batch:   289.61 (ms)
Average inference throughput:       1767.91 (images/s)

Comparing these two values provides an estimate of how well media processing is being parallelized with the other stages. If total throughput is close to inference-only throughput, then the overhead due to media processing is very low (which is good). See the evaluate() function in coco_evaluator.py for implementation details.

Post-processing options

There are four --post-processing options:

  • device - using a device;
  • cpu - using a CPU;
  • cpu-async - using a CPU in asynchronous mode;
  • off - disabling post-processing.

For HPU it's better to use cpu or cpu-async options to offload post-processing stage to CPU.

cpu-async is the best option, because inference and post-processing execute simultaneously. The performance comes close to one with disabled post-processing (the option off).

Disable HW-accelerated media processing

To disable HW-accelerated media process, add the --disable-mediapipe switch.

Sample run, bf16 with batch size 128, HPU media processing disabled

PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST=ops_bf16_yolox.txt PT_HPU_AUTOCAST_FP32_OPS_LIST=ops_fp32_yolox.txt $PYTHON tools/eval.py --model-name yolox-s --ckpt-path ./yolox_s.pth --data-dir /data/COCO --batch-size 128 --devices 1 --conf-threshold 0.001 --hpu --autocast --fuse --post-processing cpu-async --warmup-steps 4 --disable-mediapipe

Important notes

  • The first run after starting a new container may be slow due to loading JPEG images from disk. In subsequent runs the OS generally will have cached them in the page cache, or files may be copied to a ramdisk as described below.
  • The MediaPipe coco reader function skips images in the validation dataset where no detections are expected, so the total number of images processed is slightly lower than with the SW loader.
  • With different --post-processing CPU-related options enabled, performance of the host CPU may have a significant effect on overall throughput.
  • The inference accuracy may be lower using accelerated media processing compared to the reference (SW) path. For example (bf16):
# SW media processing (padded resize)
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.401

# HPU media processing (stretch resize)
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.375

This is due primarily to differences in the image resize algorithm. In both cases, test images are resized to a resolution of 640x640 prior to running inference. However, the SW resizer pads the images if necessary to maintain the aspect ratio (i.e. one dimension may be less than 640 pixels). The HW resizer does not pad, so all images are stretched to full 640x640. Bounding boxes for detected objects are scaled correctly back to the original image dimensions in both cases, but if the model was trained with 'padded' resize, and detection is run using 'stretched' resize, there can be a loss in accuracy. It is currently under investigation whether HW-accelerated media processing can be implemented with padded resize.

More information about HPU MediaPipe is available from the following pages:

Other useful options

Some options which can be useful in measuring performance under different scenarios include the following.

Change number of warm-up batches

By default, the script runs a few "warm-up" batches through the model to minimize the overhead of graph compilation. The default value is 4 batches, which is usually sufficient to get consistent performance results. The number of warm-up batches may be changed with the --warmup-steps switch.

Run multiple evaluation passes

The option --repetitions can be used to run multiple passes over the same (full) eval dataset. For example, --repetitions 3 will run 3 full evaluation passes, printing performance results after each pass. This may provide more consistent results after the first iteration, as code and data are likely to be resident in RAM already.

If --repetitions is used, the script will only perform the coco accuracy evaluation (comparing detection results with the ground truth in annotations) after the last "epoch".

Copy input images to tmpfs (ramdisk)

If reading the JPEG images from COCO/val2017 is particularly slow due to disk access overhead, you can copy the contents of coco to a new directory in /dev/shm/coco. This requires that /dev/shm exists and has sufficient space for the dataset.

Supported Configurations

Device Intel Gaudi Software Version PyTorch Version Mode
Gaudi 1.20.0 2.6.0 Training
Gaudi 2 1.23.0 2.9.0 Inference
Gaudi 2 1.23.0 2.9.0 Training
Gaudi 3 1.23.0 2.9.0 Inference
Gaudi 3 1.23.0 2.9.0 Training

Changelog

1.22.0

  • HPU media-pipe was enabled for evaluation.
  • Post-processing options were changed for evaluation. Were added:
    • device
    • cpu
    • cpu-async
    • off
  • Added new options for inference performance measurement:
    • --performance-test
    • --repetitions
    • --export-performance-data
  • eval.py and train.py command arguments naming was changed.

1.19.0

  • Evaluation script was enabled for HPU.
  • Enabled eager mode.

1.12.0

  • Removed PT_HPU_LAZY_MODE environment variable.
  • Removed flag use_lazy_mode.
  • Removed HMP data type.
  • Updated run commands which allows for overriding the default lower precision and FP32 lists of ops.

1.10.0

  • Enabled mixed precision training using PyTorch autocast on Gaudi.

Training Script Modifications

The following are the changes made to the training scripts:

  • Added source code to enable training on CPU.

  • Added source code to support Gaudi devices.

    • Enabled HMP data type.

    • Added support to run training in Lazy mode.

    • Re-implemented loss function with TorchScript and deployed the function to CPU.

    • Enabled distributed training with HCCL backend on 8 HPUs.

    • mark_step() is called to trigger execution.