How can LLMs provide results that are not only factual, but based on your own private data? This article accompanies a workshop given at HackUTA 6 on October 12, 2024.
Table of Contents Introduction Types of Differentiation Forward Mode AD Reverse Mode AD Basic Implementation in Python Matrix Implementation Comparison with PyTorch Introduction These notes largely follow the survey presented by (Baydin et al. 2018). I have added a few examples to clarify the matrix algebra as well as a lead in to a practical implemenation.
Automatic differentiation is a method for computing the derivatives of functions in a modular way using the chain rule of calculus. It is used in many deep learning frameworks such as PyTorch and Tensorflow. Consider a complex series of functions that together work to yield some useful input, such as that of a deep learning model. Traditionally, the parameters of such a model would be optimized through gradient descent. This requires that the derivatives with respect to the parameters are implemented for every function used in the model.
Table of Contents Bag of Visual Words Bag of Words is a technique used in Natural Language Processing for document classification. It is a collection of word counts. To create a Bag of Words for a document, it necessary to create a dictionary first. Choosing the a dictionary is based on many factors including computational limitations. Next, the documents in a dataset are tokenized into words. The word counts are collected as part of a histogram and used as a feature vector for a machine learning model.
Table of Contents Unsupervised Pre-training From GPT to GPT2 These notes provide an overview of pre-training large language models like GPT and Llama.
Unsupervised Pre-training Let’s start by reviewing the pre-training procedure detailed in the GPT paper (Radford et al. 2020). The Generative in Generative Pre-Training reveals much about how the network can be trained without direct supervision. It is analogous to how you might have studied definitions as a kid: create some flash cards with the term on the front and the definition on the back. Given the context of the word, you try and recite the definition. For a pre-training language model, it is given a series of tokens and is tasked with generating the next token in the sequence. Since we have access to the original documents, we can easily determine if it was correct.
Table of Contents Notes from (Friedman 2001) Notes from (Friedman 2001) Many machine learning methods are parameterized functions that are optimized using some numerical optimization techniques, notably steepest-descent.
Initial learner is a stump, subsequent learners are trees with depth as some power of 2 (commonly).
Numerical optimization in function space \[ g_m(\mathbf{x}) = E_y\Big[\frac{\partial L(y, F(\mathbf{x}))}{\partial F(\mathbf{x})}|\mathbf{x}\Big]_{F(\mathbf{x})=F_{m-1}(\mathbf{x})} \] The optimal step size found by solving
\[ \rho_m = \mathop{\arg \min}_{\rho} E_{y,\mathbf{x}}L(y,F_{m-1}(\mathbf{x})-\rho g_m(\mathbf{x})) \] Then the function \(m\) is updated: \[ f_m(\mathbf{x}) = -\rho_m g_m(\mathbf{x}) \]
Hidden Markov Models provide a way of modeling the dynamics of sequential information. They have been used for speech recognition, part-of-speech tagging, machine translation, handwriting recognition, and, as we will see in this article, gesture recognition.
Table of Contents Generalization Bias Variance Bias-Variance Tradeoff Generalization When fitting machine learning models to data, we want them to generalize well to the distribution that we have sampled from. We can measure a model’s ability to generalize by evaluating it on previously unseen data that is sampled from the same distribution as the training set. However, we often do not know the true underlying distribution. So we must fit the models to empirical distributions derived from observed data.
Table of Contents Vision Transformer (ViT) (Dosovitskiy et al. 2021) Swin Transformer (Liu et al. 2021) Vision Transformer (ViT) (Dosovitskiy et al. 2021) The original Vision Transformer (ViT) was published by Google Brain with a simple objective: apply the Transformer architecture to images, adding as few modifications necessary. When trained on ImageNet, as was standard practice, the performance of ViT does not match models like ResNet. However, scaling up to hundreds of millions results in a better performing model.
Table of Contents Introduction Box Constraints Updating the Lagrangians The Algorithm Implementation Introduction Paper link: https://www.microsoft.com/en-us/research/publication/sequential-minimal-optimization-a-fast-algorithm-for-training-support-vector-machines/
Sequential Minimal Optimization (SMO) is an algorithm to solve the SVM Quadratic Programming (QP) problem efficiently (Platt, n.d.). Developed by John Platt at Microsoft Research, SMO deals with the constraints of the SVM objective by breaking it down into a smaller optimization problem at each step.
The two key components of SMO are
Table of Contents Introduction Mask R-CNN (He et al. 2018) CenterMask (Lee and Park 2020) Cascade R-CNN (Cai and Vasconcelos 2019) MaskFormer (Cheng, Schwing, and Kirillov 2021) Mask2Former (Cheng et al. 2022) Mask-FrozenDETR (Liang and Yuan 2023) Segment Anything (Kirillov et al. 2023) Segment Anything 2 (Ravi et al. 2024) Introduction Mask R-CNN (He et al. 2018) Mask R-CNN adapts Faster R-CNN to include a branch for instance segmentation (Ren et al. 2017). This branch predicts a binary mask for each RoI, and the training loss is updated to include this branch.
Table of Contents Papers Evaluating Object Detection Methods Datasets An Incomplete History of Deep-Learning-based Object Detection Papers https://awesomeopensource.com/projects/object-detection Evaluating Object Detection Methods Object detection algorithms are evaluated using the mean of Average Precision (mAP) across all classes in the dataset.
Precision and recall are computed from the predictions and the ground truth. A sample and the model’s prediction can either be positive or negative when it comes to classification. Either it belongs to a class or it does not. The table below summarizes the outcomes between the model’s prediction and the true underlying class.