Table of Contents Unsupervised Pre-training From GPT to GPT2 These notes provide an overview of pre-training large language models like GPT and Llama.
Unsupervised Pre-training Let’s start by reviewing the pre-training procedure detailed in the GPT paper (Radford et al. 2020). The Generative in Generative Pre-Training reveals much about how the network can be trained without direct supervision. It is analogous to how you might have studied definitions as a kid: create some flash cards with the term on the front and the definition on the back. Given the context of the word, you try and recite the definition. For a pre-training language model, it is given a series of tokens and is tasked with generating the next token in the sequence. Since we have access to the original documents, we can easily determine if it was correct.