The Language of LLMs

The accompanying Colab notebook is available here.

Large Language Models like ChatGPT have rapidly become ubiquitous tools that enhance productivity, creativity, and even decision-making processes across various domains. Their ability to generate human-like text, comprehend complex instructions, and provide informative responses has captivated the imagination of users worldwide. This paragraph was generated by an LLM (and edited by me).

This workshop is for those that are curious as to how these models interpret the input. By the end of this hour, you will hopefully be able to answer the following questions, among others:

  • How do Large Language Models read and process text?
  • Why are LLMs good at complex tasks, but seem to perform poorly on seemingly simple tasks like spelling or arithmetic?
  • How does an LLM understand what it is processing?

Agenda

  1. Tokenization
  2. Unicode byte encodings
  3. Byte Pair Encoding (BPE)
  4. Embeddings

Tokenization

Tokenization is the process of transforming a sequence of characters into a sequence of tokens. A token is a unit of text that we treat as a single entity. For example, in English, a token could be a word, a sentence, or a paragraph. In programming languages, a token could be a variable name, a keyword, or a string.

Before we get started, let’s check out a live demonstration of tokenization.

Consider the input prompt below. It isn’t likely that you would have mixed emoji and code in a single text file, but it serves as a good example for tokenization.

Why does my code đź’Ą with a segmentation fault?

int main() {
    int *arr = NULL;
    scanf("%d", arr);

    return 0;
}

At the most basic level, how is this text represented in a computer?

These characters are represented as encodings such as ASCII or Unicode. For the purposes of the rest of this article, we will assume the input is represented using Unicode.

Unicode Byte Encodings

If we were to print out the unicode values of the prompt above, we would get the following:

[10, 87, 104, 121, 32, 100, 111, 101, 115, 32, 109, 121, 32, 99, 111, 100, 101, 32, 128165, ...]

Most of the values displayed in the previous cell are the same for ASCII. The emoji value has a very large number and can easily be spotted in the list.

Is that it? Is this how the input is fed into the model?

This encoding is done at the character-level. What other types of encodings are there?

  • Character encoding
  • Word encoding
  • Sub-word encoding

What is the difference between them? Why would we pick one over another?

Character Encoding

Character encoding converts each character into a unique integer. This is by far the simplest form of tokenization and has the benefit of a compact vocabulary. However, it is not able to effectively compress any common subsequences in the input. This leads to much larger sequences and longer training times.

The biggest downside to this approach is that the individual characters are not very informative on a semantic level. For example, the word “cat” would be represented as three separate tokens, ‘c’, ‘a’, and ’t’. If someone were to present you a single letter without context, you probably would likely not be able to understand the point of the message.

Word Encoding

Word encoding is a step up from character encoding. This encoding directly captures the semantic meaning of words and is a fine choice for text classification and sentiment analysis. The sequences formed are much shorter since every word can be converted into a unique token.

The vocabulary size is very large since individual tokens cannot be broken down or recombined in new contexts. It also struggles with out-of-vocabulary words, since there are no base tokens to build upon.

Sub-word Encoding

Sub-word encoding is a compromise between character and word encoding. It is able to capture the semantic meaning of words and can be broken down into smaller tokens. This allows for the model to generalize better to unseen words and phrases. Most large language models use sub-word encoding.

Byte Pair Encoding (BPE)

Byte Pair Encoding is a sub-word encoding technique that was originally designed for data compression. It is a simple algorithm that iteratively merges the most frequent pair of bytes in a sequence. This process is repeated until a predefined vocabulary size is reached.

The algorithm is as follows:

  1. Initialize the vocabulary with all the characters in the input.
  2. Count the frequency of all pairs of characters in the vocabulary.
  3. Merge the most frequent pair of characters.
  4. Update the vocabulary with the merged pair.
  5. Repeat steps 2-4 until the vocabulary size reaches a predefined limit.

Embeddings

A word embedding is a learned representation of text in which semantically similar words are mapped to nearby points in the embedding space. Since they are represented as vectors, all vector operations can be applied to them. This allows for the model to learn relationships between words and phrases, quantify their similarities and differences, and encode higher-level context information.

Embeddings can be learned independently or jointly with the model. For example, the Word2Vec model learns embeddings using an unsupervised approach. It predicts the context of a word given its surrounding words. The embeddings are then used as input to a downstream task (Mikolov et al. 2013).

LLMs typically train embeddings jointly with the model. This allows them to learn embeddings for sentences, paragraphs, or even whole documents.

Creating an embedding layer

We can use libraries such as PyTorch to create a learnable embedding layer. The code below creates an embedding layer that converts each individual token into a `1024` dimensional embedded layer.

import torch.nn as nn
import torch

token_embedding = nn.Embedding(vocab_size, 1024)
prompt_embedded = token_embedding(torch.LongTensor(encode(prompt)))
print(prompt_embedded.shape)

Training the embeddings

Our corpus of a single C file is far too small to learn anything meaningful. Learning an embedding space requires a lot data and compute power. We can instead look at pre-trained embeddings. Huggingface has a large collection of pre-trained models that can be used for a variety of tasks. The accompanying notebook uses embeddings from SentenceTransformer to demonstrate how embeddings can be used in practice.

Sentence Embeddings

To demonstrate the power of embeddings, we will close out the workshop by reviewing sentence embeddings. BERT (Devlin et al. 2019) and RoBERTa (Liu et al. 2019) are LLMs that perform tasks such as semantic textual similarity. They both require that whole sentences be input, resulting in a very expensive computation.

Sentence-BERT proposed an architecture that would embed these into meaningful embeddings that could be easily compared with vector operations (Reimers and Gurevych 2019).

In the cells below, we will use Huggingface (huggingface.co) to download and use a pre-trained sentence transformer. This particular one was trained on 1,170,060,424 sentence pairs.

References

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv. https://doi.org/10.48550/arXiv.1810.04805.
Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” arXiv. https://doi.org/10.48550/arXiv.1907.11692.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” arXiv. http://arxiv.org/abs/1301.3781.
Reimers, Nils, and Iryna Gurevych. 2019. “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks.” arXiv. https://doi.org/10.48550/arXiv.1908.10084.
Alex Dillhoff
Senior Lecturer

"If we understood the world, we would realize that there is a logic of harmony underlying its manifold apparent dissonances." - Jean Sibelius

Related