A beginners guide on how to transform raw text into something an LLM can understand.

This is the first post in a series I am calling “LLM Zero to Hero”. Why am I starting this series? Three reasons.

First, there are many tutorials out there on how to fine-tune existing open-source LLM models, but not as many on building a foundational LLM from scratch. Fine-tuning APIs and services will continue to add layers of abstraction and sensible defaults. Eventually, you’ll be able to fine-tune a model without really knowing much about LLMs, Transformers, or even ML, and it will be relatively hard to screw it up.

The second reason is that I recently gave an internal talk at Shopify on this exact topic, and it was well received.

The last reason is that I firmly believe there is no better way to make sure you understand a subject than to write about it.

So to kick things off, in this first post we'll explore the fundamental steps of preparing text data for language models. We'll cover the basics of text processing, tokenization, and creating embeddings. This knowledge forms the foundation for understanding how models like GPT-4 process and generate text.

Overview

Computers can’t understand raw text. They can, however, understand numbers. If you can figure out a way to convert some object—be it a piece of text, an image, a DNA sequence, or a molecular structure—into a list of numbers (vector) that represents the object, then the world is your oyster: there is likely some ML model architecture you can use to do some neat things.

When training a language model, raw text needs to be converted into a format the model can process. This involves several key steps:

Tokenizing the text
Converting tokens to numerical IDs
Creating embeddings of these tokens
Adding positional information to the embeddings

Let's examine each of these steps in more detail.

Tokenizing Text

Transformers, the model architecture that powers LLMs, don’t inherently know how to handle raw text. That's where tokenization comes into play. Tokenization breaks text into smaller units (tokens), which could be words, subwords, or characters. There are many different ways to perform tokenization, but modern language models tend to use Byte Pair Encoding (BPE) for tokenization.

How BPE Works

BPE starts with a vocabulary of individual characters and iteratively merges the most frequent pair of consecutive bytes or characters. Here's a simplified step-by-step process:

Start with a vocabulary of individual characters.
Count the frequency of each pair of adjacent tokens in the text.
Merge the most frequent pair and add it to the vocabulary.
Repeat steps 2-3 for a fixed number of times or until a desired vocabulary size is reached.

Let's look at a simple example:

Benefits of BPE

Handling unknown words: BPE can tokenize words it hasn't seen during training by breaking them into subwords.
Efficiency: BPE finds a balance between character-level (too granular) and word-level (too sparse) tokenization.
Multilingual support: BPE works well across different languages, even those with different writing systems.

BPE in Practice

In practice, implementations like GPT-2's tokenizer use a variant of BPE that operates on bytes rather than unicode characters. This allows it to encode any string of bytes without "unknown" tokens.

Below is an example using the commonly used tiktoken library. Notice how the the input texts contain a mixture of emojis, English, and Chinese characters.

Considerations

While BPE is powerful, it's not perfect:

It can sometimes create unintuitive subword splits.
The vocabulary size is a hyperparameter that needs tuning.
It doesn't inherently understand the semantics of the subwords it creates.

Embeddings

After tokenization, we need to convert the token IDs into embeddings. Embeddings are dense vector representations of tokens that capture semantic meaning in a way models can process.

Here's an example of creating an embedding layer:

Adding Positional Information

Positional encoding is crucial in Transformer models because the self-attention mechanism is permutation-invariant. This means it needs additional information to understand the order of tokens in a sequence

The Sinusoidal Positional Encoding

The original Transformer uses a clever trigonometric formula to encode position:

For each dimension i of the encoding and position pos:

Where:

pos is the position of token in the sequence
i is the dimension in the encoding
d_model is the dimensionality of the models embeddings

In python that looks like:

Which produces the following chart:

Properties of Sinusoidal Positional Encoding

Fixed encoding: It doesn't require learning, reducing model parameters.
Unique for each position: Each position gets a unique encoding.
Bounded: Values are between -1 and 1, matching typical embedding scales.
Consistency across sequence lengths: It can be computed for any sequence length.
Relative position information: The model can easily learn to attend to relative positions because for any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos).

In practice the code would look something like:

Alternatives to Sinusoidal Encoding

While sinusoidal encoding is widely used, there are alternatives:

Learned positional embeddings: Some models learn the positional embeddings during training.
Relative positional encoding: Instead of absolute positions, encode relative distances between tokens.
Rotary Position Embedding (RoPE): A method that encodes position by rotating vector embeddings.

Each method has its trade-offs in terms of performance, generalization, and computational efficiency. Most modern LLM architectures use RoPE embeddings. The concept is really cool, but also tricky to explain in an intuitive manner. Perhaps I’ll take a stab at it in a future blog post!

And thats how we go from text to tokens :)

LLM Zero to Hero: From Text to Tokens