The contents in this post are excerpted from the paper “YOLOX: Exceeding YOLO Series in 2021” 1, with a little bit modification, as my notes for this paper.
1. Introduction
YOLOX integrates the following advanced detection techniques into YOLO families:
- Anchor free manner
- Multi positives
- Decoupled head
- Label assignment strategy: SimOTA
- Stong data augmentation: Mosaic and MixUp
2. YOLOX
2.1. YOLOX-DarkNet53
Implementation details
Our training settings are mostly consistent from the baseline to our final model:
- Train the models for a total of 300 epochs with 5 epochs warm-up on COCO train2017.
- Use stochastic gradient descent (SGD).
- Use a learning rate of \(lr \times \text{BatchSize}/64\) (linear scaling), with a initial \(lr = 0.01\) and the cosine learning rate schedule.
- The batch size is 128 by default to typical 8-GPU devices. Other batch sizes include single GPU training also work well.
- The input size is evenly drawn from 448 to 832 with 32 strides.
YOLOv3 baseline
Our baseline adopts the architec-ture of DarkNet53 backbone and an SPP layer, referred to YOLOv3-SPP. We slightly change some training strategies compared to the original YOLOv3 implementation:
- Add EMA weights updating.
- Add cosine learning rate schedule.
- Add IoU loss and IoU-aware branch.
- Use BCE Loss for training cls and obj branch.
- Use IoU Loss for training reg branch.
These general training tricks are orthogonal to the key improvement of YOLOX, we thus put them on the baseline.
Decoupled head
Due to the conflict between classification and regression tasks, compared to the coupled detection head, the decoupled head for classification and localization:
- Converges much faster in training.
- Gets better performance.
 
This increased AP by 1.1%.
Refer to IoU-Net 2 for the IoU branch.
Strong data augmentation
We add Mosaic and MixUp into our augmentation strategies to boost YOLOX’s performance. We close the augmentation for the last 15 epochs.
This increased AP by 2.4%.
Anchor-free
The anchor mechanism has many known problems:
- 
    Anchor priors are domain-specific and less generalized. 
- 
    Anchor mechanism increases the complexity of detection heads, as well as the number of predictions for each image. 
We reduce the predictions for each location in feature map from 3 to 1 and make them directly predict four box values. Such modification reduces the parameters and GFLOPs of the detector and makes it faster, but obtains better performance.
This increased the AP by 0.9% concerning the YOLOv3-SPP baseline.
Multi positives
Selecting only ONE positive sample (the center location) for each object ignores other high quality predictions. However, optimizing those high quality predictions may also bring beneficial gradients, which may alleviates the extreme imbalance of positive/negative sampling during training. We simply assigns the center 3×3 area as positives, also named “centersampling” in FOCS.
This increased AP by 2.1%.
SimOTA
Four key insights for an advanced label assignment:
- Loss/quality aware.
- Center prior.
- Dynamic number of positive anchors for each ground-truth (abbreviated as dynamic top-\(k\)). The term “anchor” refers to “anchor point” in the context of anchor-free detectors and “grid” in the context of YOLO.
- Global view.
OTA label assignment meets all four rules above, but it is very time consuming, thus we simplify it to dynamic top-\(k\) strategy, named SimOTA, to get an approximate solution.
Steps in SimOTA:
- Calculate pair-wise matching degree, represented by cost or quality for each prediction-gt pair. The cost between ground-truth (gt) \(g_i\) and prediction \(p_j\) is calculated as: \(c_{ij} = L_{ij}^{\text{cls}} + \lambda L_{ij}^{\text{reg}}\), where \(\lambda\) is a balancing coefficient. \(L_{ij}^{\text{cls}}\) and \(L_{ij}^{\text{reg}}\) are classification loss and regression loss between \(g_i\) and \(p_j\).
- For gt \(g_i\), select the top \(k\) predictions with the least cost within a fixed center region as its positive samples. Noted that the value \(k\) varies for different gt. Refer to Dynamic \(k\) Estimation strategy in OTA for more details.
- The corresponding grids of those positive predictions are assigned as positives, while the rest grids are negatives.
Here is a good interpretation of SimOTA.
This increased AP by 2.3%.
2.2. Other Backbones
Model size and data augmentation
The suitable augmentation strategy varies across different size of models. It is better to weaken the augmentation for small models. Specifically, we remove the MixUp augmentation and weaken the Mosaic (reduce the scale rangefrom \([0.1, 2.0]\) to \([0.5, 1.5]\)) when training small models, i.e., YOLOX-S, YOLOX-Tiny, and YOLOX-Nano.
Conclusion
Equipped with previous advanced detection techniques, YOLOX achieves a better trade-off between speed and accuracy than other counterparts across all model sizes.
Reference:
