Scale Invariant Feature Transforms
Introduction
https://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf
General approach to computing SIFT features:
- Scale-space Extrema Detection
- Keypoint localization
- Orientation Assignment
- Generate keypoint descriptors
Difference of Gaussians
This same technique for detecting interesting points in a scale-invariant way can be approximated by taking the Difference of Gaussians. Consider the figure below.
By taking the difference of images smoothed by a Gaussian with different values of \(\sigma\), the resulting pixel values correspond to areas with high gradient norms in the less blurry version.
Let \(I_{\sigma_1}\) be the image blurred with a smaller value of \(\sigma\) and \(I_{\sigma_2}\) be the image blurred with a larger value. Then \(D(I_{\sigma_1}, I_{\sigma_2}) = I_{\sigma_2} - I_{\sigma_1}\). If a region in \(I_{\sigma_1}\) is locally flat, it will also be in flat in \(I_{\sigma_2}\). The difference will be relatively small for that region. If there are abrupt changes in a local region within \(I_{\sigma_1}\), they will be smoothed in \(I_{\sigma_2}\). Therefore, the difference \(D(I_{\sigma_1}, I_{\sigma_2})\) will be higher for that region.
When building SIFT features, the extremum are selected by comparing 3 DoG images. These are selected by evaluating each pixel to 26 of its neighbors in the current scale space and neighboring DoG spaces as visualized below.
To build the DoG pyramid, the authors propose that images are separated by a constant factor \(k\) in scale space. Each octave of scale space is divided such that the scalespace doubles every \(s\) samples.
Starting with \(\sigma = 0.5\), if we choose \(s=3\) then the fourth sample will be at \(\sigma = 1\), the seventh at \(\sigma=2\), and so on. To make sure the DoG images cover the full range of an octave, \(s + 3\) images need to be created per octave.
Why \(s + 3\)?
Each octave should evaluate local extrema for \(s\) scales. To evaluate this for scale \(\sigma_s\), we need the DoG for scales \(\sigma_{s-1}\) and \(\sigma_{s+1}\). This would require 4 Gaussians images to compute. The figure below represents the stack for \(s=2\).
How is the value of \(s\) determined?
In the paper, the authors perform a repeatability test to determine if the keypoints would be localized even with random augmentations. The process is as follows:
- Randomly augment an input image with noise, color jitter, scale, rotation, etc.
- Compute keypoints using using the extrema detection.
- Compare detected keypoints with known keypoints from original samples.
The authors found that using \(s = 3\) provided the highest percentage of repeatability in their experiments.
Keypoint Localization
Given the candidate keypoints selected by picking out local extrema, they pool of responses can further be refined by removing points that are sensitive to noise or located along an edge. They borrow the same approach used in the Harris corner detector to select more robust interest points in corners.
Orientation Assignment
Given a keypoint, an orientation histogram is generated. The authors use 36 bins to cover a 360 degree range for orientations. Similar to Histogram of Oriented Gradients, the orientations are weighted by their gradient magnitudes (Dalal and Triggs 2005). Additionally, a Gaussian-weighted circular patch is applied, centered on the keypoint, to further weight the responses. This means that points farther away from the center contribute less to the overall feature vector.
In order to make the keypoint rotation invariant, the dominant orientation is determined. If there are orientations that are within 80% of the highest orientation peak, multiple keypoints will be created using those orientations as well.
Orientations in this window are rotated by the dominant gradient so that all directions are with respect to the dominant orientation. This is a more efficient alternative to rotating the entire image by that orientation.
Descriptor Formation
In the paper, the authors generate keypoints using a \(16 \times 16\) window from which \(4 \times 4\) descriptors are generated following the descriptions above. Through experimentation, each \(4 \times 4\) descriptor uses 8 orientations, resulting in a feature vector \(\mathbf{x} \in \mathbb{R}^{128}\).
Different levels of contrast will product edges with higher gradient magnitudes. To account for this, the final feature vector is normalized using the \(L2\) hysteresis approach used in Harris corner detection.