Table of Contents Key Concepts Key Concepts Traditional Fine-Tuning Fine-tuning a model for a specific task can be expensive if the entire weight matrix is updated. LLMs range from billions to trillions of parameters, making fine-tuning infeasible for many applications.
Low Rank Decomposition Low Rank Adaptation (LoRA) is a method of decomposing the weight update matrix \(\Delta W\) into smaller matrices \(A\) and \(B\) such that \(\Delta W \approx AB\). The rank \(r\) of the decomposition is a hyperparameter that can be tuned to balance performance and computational cost (Hu et al. 2021).
Table of Contents Native Patch Extraction Changing Perspective The Mechanics of as_strided What about RGB? This post is a recreation of Misha Laskin’s Twitter post about patch extraction in numpy. I wanted to provide a version of it that can be accessed without requiring a Twitter account.
Patch extraction is a common image preprocessing technique that splits an input image into a regular grid of sub images. It is commonly used to prepare an image for input into a Vision Transformer (Dosovitskiy et al. 2021). As Misha points out in their original post, it is also used for convolutions, min and max pooling, and splicing audio and text.
Table of Contents What is cuDNN? Setting up cuDNN Handling Errors Representing Data Dense Layers Activation Functions Loss Functions Convolutions Pooling What is cuDNN? NVIDIA cuDNN provides optimized implementations of core operations used in deep learning. It is designed to be integrated into higher-level machine learning frameworks, such as TensorFlow, PyTorch, and Caffe.
Setting up cuDNN To use cuDNN in your applications, each program needs to establish a handle to the cuDNN library. This is done by creating a cudnnHandle_t object and initializing it with cudnnCreate.
Table of Contents Introduction Text Preprocessing Tasks Models Perplexity Introduction Text Preprocessing Character-level tokenization Word-level tokenization Subword tokenization Stopwords Batching Padding Unsupervised Pre-Training Autoregression BERT loss Tasks Text Classification Named Entity Recognition Question Answering Summarization Translation Text Generation Text Preprocessing Text preprocessing is an essential step in NLP that involves cleaning and transforming unstructured text data to prepare it for analysis. Some common text preprocessing techniques include:
Expanding contractions (e.g., “don’t” to “do not”) [7] Lowercasing text[7] Removing punctuations[7] Removing words and digits containing digits[7] Removing stopwords (common words that do not carry much meaning) [7] Rephrasing text[7] Stemming and Lemmatization (reducing words to their root forms) [7] Common Tokenizers Tokenization is the process of breaking a stream of textual data into words, terms, sentences, symbols, or other meaningful elements called tokens. Some common tokenizers used in NLP include:
Table of Contents Introduction Definition Attention Key-value Store Scaled Dot Product Attention Multi-Head Attention Encoder-Decoder Architecture Encoder Decoder Usage Resources Introduction The story of Transformers begins with “Attention Is All You Need” (Vaswani et al., n.d.). In this seminal work, the authors describe the current landscape of sequential models, their shortcomings, and the novel ideas that result in their successful application.
Their first point highlights a fundamental flaw in how Recurrent Neural Networks process sequential data: their output is a function of the previous time step. Given the hindsight of 2022, where large language models are crossing the trillion parameter milestone, a model requiring recurrent computation dependent on previous time steps without the possibility of parallelization would be virtually intractable.
The recurrent nature of RNNs means that gradients get smaller and smaller as the timesteps increase. This is known as the vanishing gradient problem. One of the first popular solutions to this problem is called Long Short-Term Memory, a recurrent network architecture by Hochreiter and Schmidhuber.
An LSTM is made up of memory blocks as opposed to simple hidden units. Each block is differentiable and contains a memory cell along with 3 gates: the input, output, and forget gates. These components allow the blocks to maintain some history of information over longer range dependencies.
Table of Contents Introduction Definition Bidirectional Recurrent Neural Networks References Introduction Neural networks are an effective tool for regression and classification tasks, but they do not consider the dependencies of information over time. Many tasks have implicit information that is dependent on input that may have already been processed or may not be seen until the future.
Recurrent Neural Networks (RNN) consider the historical context of time-series data. Bi-directional Recurrent Neural Networks (BRNN) consider both historical and future context. This is necessary for tasks like language tanslation.
Table of Contents Resources Introduction Gradient Descent and its Variants Adaptive Learning Rate Methods Parameter Initialization Resources Resources https://ruder.io/optimizing-gradient-descent/ https://www.deeplearningbook.org/contents/optimization.html Introduction empirical risk minimization - minimizing over an empirical distribution. Differs from risk minimization which is minimizing over the true distribution. We typically do not know the true distribution.
Complex models are able to memorize the dataset.
In many applications for training, what we want to optimize is different from what we actually optimize since we need to have useful derivatives for gradient descent. For example, the 0-1 loss
Table of Contents Introduction Convolution Operator Properties of Convolutions Parameter Sharing Pooling Backwards Pass Example Neural Networks for Image Classification Useful Resources Key Concepts
Invariance and Equivariance Definition Padding, Stride, Kernel size, dilation Purpose of multiple feature maps Receptive fields and hierarchies of features Downsampling, Upsampling, Examples in research Introduction Dense neural networks made up of linear layers and a chosen activation function are not practical for image data. Consider an image of size \(224\times224\times3\). The first layer of a dense network would require a \(150,528\times n\) parameter matrix, where \(n\) is the number of nodes in the first layer. It is common to build dense networks where the first layer has more nodes than input features. In this case, we would need a minimum of \(150,528^2\) parameters in the first layer. Even if we chose something much smaller like \(n=1024\), this would require \(154,140,672\) parameters for just the first layer. This is clearly impractical.
Table of Contents Introduction What makes a model deep? Deep Networks Deep vs. Shallow Networks High Dimensional Structured Data Activation Functions Loss Functions A Typical Training Pipeline Useful Links Introduction Deep learning is a term that you’ve probably heard of a million times by now in different contexts. It is an umbrella term that encompasses techniques for computer vision, bioinformatics, natural language processing, and much more. It almost always involves a neural network of some kind that was trained on a large corpus of data.