Table of Contents Bag of Visual Words Bag of Words is a technique used in Natural Language Processing for document classification. It is a collection of word counts. To create a Bag of Words for a document, it necessary to create a dictionary first. Choosing the a dictionary is based on many factors including computational limitations. Next, the documents in a dataset are tokenized into words. The word counts are collected as part of a histogram and used as a feature vector for a machine learning model.
Table of Contents Vision Transformer (ViT) (Dosovitskiy et al. 2021) Swin Transformer (Liu et al. 2021) Vision Transformer (ViT) (Dosovitskiy et al. 2021) The original Vision Transformer (ViT) was published by Google Brain with a simple objective: apply the Transformer architecture to images, adding as few modifications necessary. When trained on ImageNet, as was standard practice, the performance of ViT does not match models like ResNet. However, scaling up to hundreds of millions results in a better performing model.
Table of Contents Introduction Mask R-CNN (He et al. 2018) CenterMask (Lee and Park 2020) Cascade R-CNN (Cai and Vasconcelos 2019) MaskFormer (Cheng, Schwing, and Kirillov 2021) Mask2Former (Cheng et al. 2022) Mask-FrozenDETR (Liang and Yuan 2023) Segment Anything (Kirillov et al. 2023) Segment Anything 2 (Ravi et al. 2024) Introduction Mask R-CNN (He et al. 2018) Mask R-CNN adapts Faster R-CNN to include a branch for instance segmentation (Ren et al. 2017). This branch predicts a binary mask for each RoI, and the training loss is updated to include this branch.
Table of Contents Papers Evaluating Object Detection Methods Datasets An Incomplete History of Deep-Learning-based Object Detection Papers https://awesomeopensource.com/projects/object-detection Evaluating Object Detection Methods Object detection algorithms are evaluated using the mean of Average Precision (mAP) across all classes in the dataset.
Precision and recall are computed from the predictions and the ground truth. A sample and the model’s prediction can either be positive or negative when it comes to classification. Either it belongs to a class or it does not. The table below summarizes the outcomes between the model’s prediction and the true underlying class.
Table of Contents Introduction Convolution Operator Properties of Convolutions Parameter Sharing Pooling Backwards Pass Example Neural Networks for Image Classification Useful Resources Key Concepts
Invariance and Equivariance Definition Padding, Stride, Kernel size, dilation Purpose of multiple feature maps Receptive fields and hierarchies of features Downsampling, Upsampling, Examples in research Introduction Dense neural networks made up of linear layers and a chosen activation function are not practical for image data. Consider an image of size \(224\times224\times3\). The first layer of a dense network would require a \(150,528\times n\) parameter matrix, where \(n\) is the number of nodes in the first layer. It is common to build dense networks where the first layer has more nodes than input features. In this case, we would need a minimum of \(150,528^2\) parameters in the first layer. Even if we chose something much smaller like \(n=1024\), this would require \(154,140,672\) parameters for just the first layer. This is clearly impractical.
Table of Contents Introduction Tracking with Optical Flow Kalman Filters Introduction Tracking features and objects is required in many applications ranging from autonomous driving to security. Vision tracking systems are often used for live sports broadcasts to keep track of players, the ball, and other visual queues related to the game.
Figure 1: Source: https://azbigmedia.com/lifestyle/ball-tracking-technology-changes-way-fans-consume-practice-sport-of-golf/ Naive tracking will detect an object per frame without any regard for prior information. More sophisticated trackers will consider the previous frame as a starting point to their search space. However, even these trackers many need to initialize after a certain amount of time if their estimate drifts too far away from the object’s actual location.
Table of Contents Introduction Motion Features Computing Optical Flow Assumptions of Small Motion Applications Introduction Optical flow refers to the apparent motion in a 2D image. Optical flow methods estimate a motion field, which refers to the true motion of objects in 3D. If a fixed camera records a video of someone walking from the left side of the screen to the right, a difference of two consecutive frames reveals much about the apparent motion.
Table of Contents Introduction Agglomerative Clustering K-Means Clustering Simple Linear Iterative Clustering (SLIC) Superpixels in Recent Work Introduction The goal of segmentation is fairly broad: group visual elements together. For any given task, the question is how are elements grouped? At the smallest level of an image, pixels can be grouped by color, intensity, or spatial proximity. Without a model of higher level objects, the pixel-based approach will break down at a large enough scale.
Table of Contents Resources Introduction Gestalt Theory Grouping Segmentation Methods Resources https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/ (Berkeley Segmentation Database) https://arxiv.org/abs/2105.15203v2 (SegFormer) https://arxiv.org/abs/1703.06870 (Mask R-CNN) https://github.com/sithu31296/semantic-segmentation (Collection of SOTA models) Introduction Feature extraction methods such as SIFT provide us with many distinct, low-level features that are useful for providing local descriptions images. We now “zoom out” and take a slightly higher level look at the next stage of image summarization. Our goal here is to take these low-level features and group, or fit, them together such that they represent a higher level feature. For example, from small patches representing color changes or edges, we may wish to build higher-level feature representing an eye, mouth, and nose.
Table of Contents Introduction Computing Gradient Norms Nonmaxima Suppression Thresholding Connectivity Analysis Introduction Figure 1: Vertical derivative filter (left) and horizontal derivative filter (right). When image gradient filters are applied to an image, we can observe that the sample responses are very sensitive to noise and detail. For example, look at the surface at the back of ship near the drive cone. To resolve this, the image should be smoothed before differentiating it. Recall that the Gaussian filter smooths the area so that neighboring pixels are more similar than distant pixels.
Table of Contents Resizing Sampling Resizing Aliasing arises through resampling an image How to resize - algorithm How to resolve aliasing Resizing an image, whether increase or decreasing the size, is a common image operation. In Linear Algebra, scaling is one of the transformations usually discussed, along with rotation and skew. Scaling is performed by creating a transformation matrix
\begin{equation*} M = \begin{bmatrix} s & 0\\ 0 & s \end{bmatrix}, \end{equation*}
Table of Contents Topics The Human Eye Color Matching Color Physics Color Spaces HSV Color Space Topics What is color? How do we process color? What information does color contain? What can we infer from color? The Human Eye The eye acts as a camera, including a lens which focuses light onto a receptive surface. The cornea covers the lens which combine to make a compound lens. The lens itself is flexible to allow the eye to focus on objects of variable distance. The lens is attached to ciliary muscles which contract or expand to change the shape of the lens. This allows us to focus on near or far objects. As we age, the lens itself becomes hardened and does not transform back to a spherical shape when the ciliary muscles contract, resulting in farsightedness.