How can LLMs provide results that are not only factual, but based on your own private data? This article accompanies a workshop given at HackUTA 6 on October 12, 2024.
Table of Contents Key Concepts Key Concepts Traditional Fine-Tuning Fine-tuning a model for a specific task can be expensive if the entire weight matrix is updated. LLMs range from billions to trillions of parameters, making fine-tuning infeasible for many applications.
Low Rank Decomposition Low Rank Adaptation (LoRA) is a method of decomposing the weight update matrix \(\Delta W\) into smaller matrices \(A\) and \(B\) such that \(\Delta W \approx AB\). The rank \(r\) of the decomposition is a hyperparameter that can be tuned to balance performance and computational cost (Hu et al. 2021).
How do LLMs read and process the high dimensional landscape of text efficiently? Presented as a workshop at UTA's Datathon on April 13, 2024.
Table of Contents Introduction Definition Attention Key-value Store Scaled Dot Product Attention Multi-Head Attention Encoder-Decoder Architecture Encoder Decoder Usage Resources Introduction The story of Transformers begins with “Attention Is All You Need” (Vaswani et al., n.d.). In this seminal work, the authors describe the current landscape of sequential models, their shortcomings, and the novel ideas that result in their successful application.
Their first point highlights a fundamental flaw in how Recurrent Neural Networks process sequential data: their output is a function of the previous time step. Given the hindsight of 2022, where large language models are crossing the trillion parameter milestone, a model requiring recurrent computation dependent on previous time steps without the possibility of parallelization would be virtually intractable.