ML 101 - How LLMs Generate Text
Understanding LLMs Using Kitchen Analogyβ
Imagine you're running a high-end restaurant, where every dish served is an AI-generated response.
- The chef represents the AI model (LLM).
- The ingredients are the tokens (words, subwords).
- The recipe book represents the model's training data.
- The cooking process is the text generation pipeline.
Let's go step by step and see how the kitchen (LLM) operates.
1. Tokenization = Preparing Ingredientsβ
Before cooking begins, a chef prepares ingredients by cutting vegetables, and measuring spices needed for a dish. LLMs follow a similar processβthey break down text into smaller units called tokens.
Example:
- Input sentence:
"We're revolutionizing grocery shopping."
- Chopping process: Breaking it into ingredients (tokens):
["We're", "revolutionizing", "grocery", "shopping"]
- Each token gets a numeric ID, like labeling ingredients with barcodes for inventory.
π Just like a chef doesn't use whole vegetables but cuts them into usable pieces, LLMs split text into tokens for efficient processing.
2. Logits & Softmax = Deciding the Next Ingredient Based on Tasteβ
A chef doesn't just throw in random spices β they taste the dish and decide which ingredient will make it better. Similarly, LLMs predict the most likely next token.
- The chef has a list of possible next ingredients (logits) ranked by suitability.
- They smell, taste, and evaluate (Softmax) to choose the best one.
Example: The chef considers:
- Salt (low probability) β Might overpower the dish.
- Garlic (medium probability) β Could add depth.
- Basil (high probability) β Complements the dish well.
After sampling (Softmax), the chef selects Basil because it enhances the dish.
π Just as a chef refines flavors with seasoning, LLMs select the next word based on learned probabilities.
3. Autoregressive Generation = Cooking Step-by-Stepβ
A chef doesn't prepare a dish by mixing everything at once. Instead, they add ingredients in a specific order, letting each step build upon the last.
-
The chef starts with a base (prompt):
"We're revolutionizing grocery shopping."
-
Follow a structured cooking process, adding one ingredient at a time:
- Add "with" β Stir well.
- Current dish: "We're revolutionizing grocery shopping with"
- Add "same-day" β Let it cook.
- Current dish: "We're revolutionizing grocery shopping with same-day"
- Add "delivery" β Adjust seasoning.
- Current dish: "We're revolutionizing grocery shopping with same-day delivery"
- Add "powered by AI" β Final plating.
- Current dish: "We're revolutionizing grocery shopping with same-day delivery powered by AI"
- Add "with" β Stir well.
-
The dish is complete when a special stop signal appears (like the
[EOS]
token in AI).
4. Transformer Architectureβ
A well-organized kitchen doesn't rely on a single chef doing everything. Instead, it follows a highly efficient workflow, where different kitchen staff members specialize in different tasks. This is exactly how the Transformer model worksβprocessing information in parallel rather than sequentially.
4.1 Self-Attention = A Waiter Managing Multiple Tables at Onceβ
A great waiter doesn't just focus on one table at a time. Instead, they keep track of all their tables, making sure each one gets the right service at the right time. Similarly, self-attention allows an LLM to analyze all words in a sentence at once, rather than just looking at the last word.
Example: Suppose a waiter is serving multiple tables:
- Table 1 orders an appetizer.
- Table 2 asks for a drink refill.
- Table 3 needs the check.
Instead of serving one table at a time, a skilled waiter manages all tables simultaneously, prioritizing based on urgency.
π Self-attention ensures the model doesn't just focus on the last word but considers the entire sentence at once.
4.2 Positional Encoding = The Right Order of Coursesβ
In a restaurant, a meal has a specific orderβyou wouldn't serve dessert before the main course. Similarly, LLMs use positional encoding to keep track of word order, ensuring that sentences are structured correctly.
Correct order:
- Serve the appetizer.
- Bring the main course.
- Deliver dessert.
Wrong order:
- Deliver dessert first.
- Bring the main course.
- Serve the appetizer.
Even though the same dishes are served, the experience is completely wrong if the order is mixed up. Similarly, an AI model ensures the correct sentence structure by encoding word positions.
4.3 Feedforward Layers = Final Presentationβ
Before a dish is served to the customer, it goes through a final quality checkβthe chef ensures the presentation is perfect, adds final garnishes, and makes sure the seasoning is balanced. Similarly, feedforward layers refine token embeddings, making sure the model's predictions are polished and well-formed.
Example: A chef checks a dish before serving:
- Too bland? β Add a final sprinkle of seasoning.
- Messy plating? β Rearrange for better presentation.
- Overcooked steak? β Adjust for future orders.
This last step ensures the final dish meets high standards. Similarly, a Transformer's feedforward layers refine the model's predictions before finalizing the output.
Technical Explanationβ
Picture this: You're at a startup demo day in the Bay Area. A founder steps onstage and says: "We're revolutionizing grocery shopping."
An AI assistant predicts the next phrase: "β¦with same-day delivery powered by AI."
How does the AI come up with this? LLMs generate text by tokenizing input, predicting the next token using probabilities derived from logits, and iteratively building sentences using the Transformer architecture. In summary, the process involves:
- Tokenize the prompt:
- Input: "We're revolutionizing grocery shopping."
- Tokens: ["We're", "revolutionizing", "grocery", "shopping"]
- Add positional encodings:
- Each token is embedded with positional information to retain word order.
- Output: [101, 202, 303, 404]
- Pass through Transformer layers:
- The embeddings undergo self-attention and feedforward transformations, enabling the model to understand the context and relationships between tokens.
- Get logits and apply Softmax:
- The model calculates logits for the next token, e.g.,
- Logits:
[1.2, 2.8, 0.9, 3.0, β¦]
- Softmax:
[0.05, 0.15, 0.02, 0.20, β¦]
- Logits:
- The token with the highest probability is selected: "with."
- The model calculates logits for the next token, e.g.,
- Generate final output:
- The model predicts tokens one by one: "with" β "same-day" β "delivery" β "powered" β "by AI."
Above steps in detail:
1. Tokenization: Splitting Text into Piecesβ
LLMs like GPT don't process raw text directly. Instead, they split the text into smaller units called tokens. These tokens can be whole words, parts of words, or even individual characters, depending on the model's tokenizer.
- Original text: "We're revolutionizing grocery shopping."
- Tokens: ["We're", "revolutionizing", "grocery", "shopping."]
Next, each token is assigned a numeric ID to enable mathematical processing by the model:
- Token IDs: [101, 202, 303, 404]
These IDs correspond to entries in the model's vocabulary, a giant list of all the tokens it has learned during training.
2. Logits & Softmax: Turning Scores into Probabilitiesβ
Once the tokens are processed, the model predicts the next token by calculating logits. Logits are raw scores indicating how likely each token in the vocabulary (which is a list of all the tokens the model has learned during training, could be 100,000+ tokens) is to be the next word.
For example, if the model has a vocabulary of 100,000 tokens, it might output scores like:
- Logits:
[1.2, 2.8, 0.9, 3.0, β¦]
These raw scores aren't directly interpretable as probabilities. To convert them, the model uses Softmax, a mathematical function that normalizes these logits so they sum to 1.
After applying Softmax, the scores might look like this:
- Softmax:
[0.05, 0.15, 0.02, 0.20, β¦]
These probabilities represent the likelihood of each token i.e. "with", "same-day", "delivery", "powered", "by AI" being the next one in the sequence.
3. Autoregressive Generation: Building the Sentenceβ
Once the model has calculated the logits and applied Softmax, it predicts the next token based on the context of previous tokens.
- Start with the prompt:
"We're revolutionizing grocery shopping." - Predict the next token:
The model evaluates all possible tokens and predicts the most likely next token: "with." - Update the sequence:
The prompt becomes: "We're revolutionizing grocery shopping with." - Repeat the process:
- The model predicts: "same-day", appending it to the sequence
- "We're revolutionizing grocery shopping with same-day."
- Then: "delivery", appending it to the sequence
- "We're revolutionizing grocery shopping with same-day delivery."
- Then: "powered", appending it to the sequence
- "We're revolutionizing grocery shopping with same-day delivery powered."
- Finally: "by AI.", appending it to the sequence
- "We're revolutionizing grocery shopping with same-day delivery powered by AI."
- The model predicts: "same-day", appending it to the sequence
- End the sequence:
The model stops generating when it predicts a special end-of-sequence token ([EOS]
).
3.1 How EOS is Predicted?β
Most language models are trained with a vocabulary that includes a special token, typically called [EOS]
(End of Sequence), which signals the completion of the text. Here's how it works:
During training, the model learns to predict [EOS]
in contexts where text logically ends. For example, if the training data contains sentences like:
- "This is a complete sentence."
[EOS]
- "Another example of a sentence."
[EOS]
The model is trained to associate [EOS]
with the point where text typically stops. This makes [EOS]
part of its vocabulary, just like any other token.
At each step of generation, [EOS]
is one of the potential tokens the model can select. When the probability of [EOS]
becomes the highest (or is sampled probabilistically), the model stops generating further tokens.
Example: After generating the phrase: "We're revolutionizing grocery shopping with same-day delivery powered by AI", the model:
- Calculates probabilities for all possible next tokens, including
[EOS]
- If
[EOS]
has the highest probability, the sequence ends
4. Transformer Architecture: The Engine Behind the Scenesβ
The Transformer architecture is the foundation of large language models. It uses several key mechanisms to process and generate text. Here's how it works with data examples:
4.1 Self-Attention: Understanding Contextβ
Self-attention allows the model to determine which tokens in the input are most important for understanding a given token.
How It Works:
Each token generates three vectors:
- Query (Q): What this token wants to know
- Key (K): How relevant each other token is
- Value (V): The actual content of each token
Think of attention scores like a student ("grocery") asking questions to their classmates. The student's question is the Query, and each classmate has their own expertise (Key) to offer. The attention score shows how relevant each classmate's knowledge is to the student's question.
The attention score is calculated by comparing how well the question (Query) matches with each classmate's expertise (Key), similar to how well students' study interests align. Just like in a classroom, we normalize these scores so the student divides their attention (100%) among their classmates.
Example:
For the sentence "We're revolutionizing grocery shopping,":
When processing "grocery", the model asks: "Who should I pay attention to?" It looks at each word and calculates how relevant they are:
Attention Scores (before normalization):
- "We're" β 0.1 (Not very relevant)
- "revolutionizing" β 0.4 (Somewhat relevant)
- "shopping" β 0.8 (Highly relevant)
Softmax Normalization:
Just like a student can't give 200% attention, we normalize these scores to probabilities:
- "We're" β 0.08 (8% attention)
- "revolutionizing" β 0.25 (25% attention)
- "shopping" β 0.67 (67% attention)
The token "grocery" focuses most of its attention on "shopping" because they're naturally related in this context, just like how a student might pay more attention to a classmate discussing the same topic.
4.2 Positional Encoding: Keeping Word Order Intactβ
Think of positional encoding like giving each student a numbered badge to wear. Even if all students arrive at once, their badges tell us their correct order in line. Each student wears both their regular nametag (token embedding) and this numbered badge (positional encoding), so we always know both who they are and where they belong in line.
Example:
For the sentence "We're revolutionizing grocery shopping,":
Token embeddings (like student nametags showing their identity):
- "We're" β [0.12, 0.45, 0.78] (their unique identity vector)
- "revolutionizing" β [0.34, 0.67, 0.11] (their unique identity vector)
- "grocery" β [0.56, 0.89, 0.32] (their unique identity vector)
- "shopping" β [0.23, 0.44, 0.91] (their unique identity vector)
Positional encodings (like numbered badges showing position in line):
- Position 1 β [0.01, 0.02, 0.03] (first position encoding)
- Position 2 β [0.02, 0.03, 0.04] (second position encoding)
- Position 3 β [0.03, 0.04, 0.05] (third position encoding)
- Position 4 β [0.04, 0.05, 0.06] (fourth position encoding)
Token embeddings with positional encoding (students wearing both nametag and badge):
- "We're" β [0.13, 0.47, 0.81] (identity + first position)
- "revolutionizing" β [0.36, 0.70, 0.15] (identity + second position)
- "grocery" β [0.59, 0.93, 0.37] (identity + third position)
- "shopping" β [0.27, 0.49, 0.97] (identity + fourth position)
This ensures the model knows both what each token means and its position in the sequence. e.g. "We're" comes before "revolutionizing" and so on or "grocery" comes before "shopping".
4.3 Feedforward Layers: Refining Predictionsβ
Think of feedforward layers like a student revising their notes after a class discussion. Each revision helps the student better understand how their topic connects to the overall lesson. The initial notes get refined through multiple passes, each making the understanding clearer and more complete.
Example:
Token embedding for "revolutionizing" (after attention and positional encoding):
Input embedding: [0.36, 0.70, 0.15] (initial understanding)
The feedforward neural network applies transformations:
- Linear Layer 1: Multiplies by a weight matrix and adds a bias (first revision)
- Activation Function: Applies non-linearity (ReLU) (highlighting key points)
- Linear Layer 2: Refines the representation further (final polish)
Output embedding: [0.42, 0.85, 0.28] (refined understanding)
This embedding now captures more abstract features, like how "revolutionizing" fits into the broader context of the sentence.
Bringing It All Togetherβ
For the input: "We're revolutionizing grocery shopping with..."
-
Self-Attention:
The model determines that "grocery" relates strongly to "shopping" and "revolutionizing" relates to "We're". -
Positional Encoding:
The model knows the sequence of words, ensuring it doesn't mix up "grocery shopping" with "shopping grocery". -
Feedforward Layers:
Each token's embedding is refined, enabling the model to predict the next token.
Using these mechanisms, the model generates the continuation: "We're revolutionizing grocery shopping with same-day delivery powered by AI."