Key Concepts
Traditional Fine-Tuning
Fine-tuning a model for a specific task can be expensive if the entire weight matrix is updated. LLMs range from billions to trillions of parameters, making fine-tuning infeasible for many applications.
Low Rank Decomposition
Low Rank Adaptation (LoRA) is a method of decomposing the weight update matrix \(\Delta W\) into smaller matrices \(A\) and \(B\) such that \(\Delta W \approx AB\). The rank \(r\) of the decomposition is a hyperparameter that can be tuned to balance performance and computational cost (Hu et al. 2021).
Efficiency
LoRA is more efficient than fine-tuning because the rank \(r\) is much smaller than the number of parameters in the model. This allows for faster training and inference times. If \(W \in \mathbb{R}^{d \times d}\), then \(A \in \mathbb{R}^{d \times r}\) and \(B \in \mathbb{R}^{r \times d}\), so the number of parameters in the decomposition is \(2dr\). Compare this to the \(d^2\) parameters in the original weight matrix.
Implementation
When fine-tuning with LoRA, the original weights are frozen. This has a significant impact on the performance of the model since the gradients do not need to be stored for the original weights. Only the gradients for the decomposition matrices \(A\) and \(B\) need to be stored.