Deep Learning

Low Rank Adaptation

Table of Contents Key Concepts Key Concepts Traditional Fine-Tuning Fine-tuning a model for a specific task can be expensive if the entire weight matrix is updated. LLMs range from billions to trillions of parameters, making fine-tuning infeasible for many applications. Low Rank Decomposition Low Rank Adaptation (LoRA) is a method of decomposing the weight update matrix \(\Delta W\) into smaller matrices \(A\) and \(B\) such that \(\Delta W \approx AB\). The rank \(r\) of the decomposition is a hyperparameter that can be tuned to balance performance and computational cost (Hu et al. 2021).

Patch Extraction

Table of Contents Native Patch Extraction Changing Perspective The Mechanics of as_strided What about RGB? This post is a recreation of Misha Laskin’s Twitter post about patch extraction in numpy. I wanted to provide a version of it that can be accessed without requiring a Twitter account. Patch extraction is a common image preprocessing technique that splits an input image into a regular grid of sub images. It is commonly used to prepare an image for input into a Vision Transformer (Dosovitskiy et al. 2021). As Misha points out in their original post, it is also used for convolutions, min and max pooling, and splicing audio and text.

Using the cuDNN Library

Table of Contents What is cuDNN? Setting up cuDNN Handling Errors Representing Data Dense Layers Activation Functions Loss Functions Convolutions Pooling What is cuDNN? NVIDIA cuDNN provides optimized implementations of core operations used in deep learning. It is designed to be integrated into higher-level machine learning frameworks, such as TensorFlow, PyTorch, and Caffe.

Natural Language Processing

Table of Contents Introduction Text Preprocessing Tasks Models Perplexity Introduction Text Preprocessing Character-level tokenization Word-level tokenization Subword tokenization Stopwords Batching Padding Unsupervised Pre-Training Autoregression BERT loss Tasks Text Classification Named Entity Recognition Question Answering Summarization Translation Text Generation Text Preprocessing Text preprocessing is an essential step in NLP that involves cleaning and transforming unstructured text data to prepare it for analysis. Some common text preprocessing techniques include:

Transformers

Table of Contents Introduction Definition Attention Key-value Store Scaled Dot Product Attention Multi-Head Attention Encoder-Decoder Architecture Encoder Decoder Usage Resources Introduction The story of Transformers begins with “Attention Is All You Need” (Vaswani et al., n.d.). In this seminal work, the authors describe the current landscape of sequential models, their shortcomings, and the novel ideas that result in their successful application.

Long Short-Term Memory

The recurrent nature of RNNs means that gradients get smaller and smaller as the timesteps increase. This is known as the vanishing gradient problem. One of the first popular solutions to this problem is called Long Short-Term Memory, a recurrent network architecture by Hochreiter and Schmidhuber. An LSTM is made up of memory blocks as opposed to simple hidden units. Each block is differentiable and contains a memory cell along with 3 gates: the input, output, and forget gates. These components allow the blocks to maintain some history of information over longer range dependencies.

Recurrent Neural Networks

Table of Contents Introduction Definition Bidirectional Recurrent Neural Networks References Introduction Neural networks are an effective tool for regression and classification tasks, but they do not consider the dependencies of information over time. Many tasks have implicit information that is dependent on input that may have already been processed or may not be seen until the future.

Optimization for Deep Learning

Table of Contents Resources Introduction Gradient Descent and its Variants Adaptive Learning Rate Methods Parameter Initialization Resources Resources https://ruder.io/optimizing-gradient-descent/ https://www.deeplearningbook.org/contents/optimization.html Introduction empirical risk minimization - minimizing over an empirical distribution. Differs from risk minimization which is minimizing over the true distribution. We typically do not know the true distribution.

Convolutional Neural Networks

Table of Contents Introduction Convolution Operator Properties of Convolutions Parameter Sharing Pooling Backwards Pass Example Neural Networks for Image Classification Useful Resources Key Concepts Invariance and Equivariance Definition Padding, Stride, Kernel size, dilation Purpose of multiple feature maps Receptive fields and hierarchies of features Downsampling, Upsampling, Examples in research Introduction Dense neural networks made up of linear layers and a chosen activation function are not practical for image data. Consider an image of size \(224\times224\times3\). The first layer of a dense network would require a \(150,528\times n\) parameter matrix, where \(n\) is the number of nodes in the first layer. It is common to build dense networks where the first layer has more nodes than input features. In this case, we would need a minimum of \(150,528^2\) parameters in the first layer. Even if we chose something much smaller like \(n=1024\), this would require \(154,140,672\) parameters for just the first layer. This is clearly impractical.

Deep Learning

Table of Contents Introduction What makes a model deep? Deep Networks Deep vs. Shallow Networks High Dimensional Structured Data Activation Functions Loss Functions A Typical Training Pipeline Useful Links Introduction Deep learning is a term that you’ve probably heard of a million times by now in different contexts. It is an umbrella term that encompasses techniques for computer vision, bioinformatics, natural language processing, and much more. It almost always involves a neural network of some kind that was trained on a large corpus of data.