The contents in this post are excerpted from the paper “Rich feature hierarchies for accurate object detection and semantic segmentation” 1, with a little bit modification, as my notes for this great paper.
1. Introduction
This method solves the CNN localization problem by operating within the “recognition using regions” paradigm, as argued for by Gu et al. in “Recognition using regions”. At test-time, this method consists of three modules:
- generates around 2000 category-independent region proposals (by using selective search in experiment) for the input image;
- extracts a fixed-length (4096-dimensional in experiment) feature vector from each proposal using a CNN as a blackbox feature extractor;
- classifies each region with category-specific linear SVMs.
This method use a simple technique (affine image warping) to compute a fixed-size CNN input from each region proposal, regardless of the region’s shape.
This method combines region proposals with CNNs, the authors dub it R-CNN: Regions with CNN features.
The authors demonstrate that a simple bounding-box regression method significantly reduces mislocalizations, which are the dominant error mode.
2. Object detection with R-CNN
2.2 Test-time detection
Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over-union (IoU) overlap with a higher scoring selected region larger than a learned threshold.
The only class-specific computations are dot products between features and SVM weights and non-maximum suppression.
2.3 Training
Since the training data is too large to fit in memory, we adopt the standard hard negative mining method (a simple explanation).
Appendix
B. Positive vs. negative examples and softmax
B.1 Positive vs. negative examples
Positive and negative examples (region proposals) are defined differently for fine-tuning the CNN versus training the object detection SVMs.
Scenario | IoU of Negative | IoU of Positive |
---|---|---|
Finetunning CNN | \(< 0.5\) | \(\geq 0.5\) (many jittered examples to avoid overfitting) |
Training SVMs | \(< 0.3\) | \(= 1\) (groundtruth) |
Here the IoU refers to the IoU between example (region proposal) and groundtruth.
The fine-tuning data is limited. This scheme introduces many “jittered” examples (those proposals with overlap between 0.5 and 1, but not ground truth), which expands the number of positive examples by approximately 30x. The authors conjecture that this helps avoid overfitting.
B.2 SVMs rather than softmax
After fine-tunning, we train SVMs, rather than simply apply the last layer of the fine-tuned network, which is a 21-way soft-max regression classifier, as the object detector. The authors found that performance on VOC 2007 dropped from 54.2% (SVMs) to 50.9% (softmax) mAP. This performance drop likely arises from a combination of several factors including that the definition of positive examples used in fine-tuning does not emphasize precise localization and the softmax classifier was trained on randomly sampled negative examples rather than on the subset of “hard negatives” used for SVM training.
My understanding: SVMs are suitable for small sample-size, and strict positive/negative samples are enough to train SVMs. In contrast, training CNN needs large sample size, and the jittered examples help avoid overfitting.
C. Bounding-box regression
After scoring each selective search proposal with a class-specific detection SVM, we predict a new bounding box for the detection using a class-specific bounding-box regressor to improve localization performance. Our goal is to learn a transformation that maps a proposed box \(P=(P_x, P_y, P_w, P_h)\) to a groundtruth box \(P=(G_x, G_y, G_w, G_h)\). We parameterize the transformation in terms of four functions \(\{d_{\star}(P)=\mathbf{w}_{\star}^{\mathrm{T}} \boldsymbol{\phi}_{5}(P)\}_{\star=x,y,h,w}\).
Two subtle issues while implementing bounding-box regression:
- Regularization is important.
- We select training pairs \((P, Q)\) such that \(\text{IoU}(P, Q) > 0.6\). If more than one \(Q\) satisfy this requirement, we select the \(Q\) with the largest IoU.
Reference:
-
Girshick, Ross, et al. “Rich feature hierarchies for accurate object detection and semantic segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.APA ↩