GenAI Transformers and Attention
The Transformer is the architecture that powers virtually every modern large language model. Introduced in a 2017 research paper titled "Attention Is All You Need," the Transformer replaced older sequential models and made it possible to train AI on text at a massive scale. Understanding how it works reveals the engine behind all of today's generative AI systems.
The Problem Transformers Solved
Before Transformers, models processed text word by word in sequence — like reading a sentence one word at a time from left to right. This approach had a major weakness: the model often forgot earlier context by the time it reached the end of a long sentence.
Example of the problem:
"The trophy did not fit in the suitcase because it was too big." What does "it" refer to — the trophy or the suitcase? → A sequential model struggling with long-distance connections would often get this wrong.
Transformers solve this by looking at all words at the same time, rather than one at a time. This is what makes them both faster and more capable of understanding context across long distances in text.
What Is Attention?
Attention is the core mechanism inside a Transformer. It allows the model to decide how much focus to place on each word in a sentence when generating or understanding any given word.
Think of it as a spotlight. When the model is processing the word "it" in a sentence, the attention mechanism shines a spotlight on all other words and asks: which ones are most relevant to understanding what "it" means?
Sentence: "The trophy did not fit in the suitcase because it was too big." When processing "it": ───────────────────────────────────────────────────── Word | Attention Weight (simplified) ───────────────────────────────────────────────────── "The" | 0.01 (low — not relevant) "trophy" | 0.72 (high — likely the referent) "fit" | 0.08 (low) "suitcase" | 0.14 (medium — related but less likely) "big" | 0.05 (low) ─────────────────────────────────────────────────────
By assigning different weights to different words, the model builds a richer understanding of meaning and context.
Self-Attention Explained Simply
Self-attention means the model attends to all parts of the same input sentence at once. Each word looks at every other word and calculates a relevance score.
To calculate attention, each word is turned into three vectors:
- Query (Q): What this word is looking for
- Key (K): What this word offers to others
- Value (V): The actual information this word carries
Attention Score Calculation ────────────────────────────────────────── For each word: Score = Query of current word × Key of each other word ↓ Apply softmax to turn scores into weights (0 to 1) ↓ Multiply weights by Values ↓ Sum them up → Final context-aware representation of the word
Multi-Head Attention
A single attention pass looks at relationships from one perspective. Multi-head attention runs several attention processes in parallel — each one focusing on different kinds of relationships in the text.
Multi-Head Attention (simplified)
──────────────────────────────────────────────────────
Input sentence → Split into N attention "heads"
Head 1: focuses on grammatical relationships
Head 2: focuses on subject-verb connection
Head 3: focuses on pronoun references
Head N: focuses on topic relevance
...
↓
Combine outputs from all heads
↓
Richer, multi-perspective representation
The Full Transformer Architecture
The original Transformer has two main parts: an Encoder and a Decoder. Modern LLMs typically use only the Decoder portion.
ENCODER (understands input) DECODER (generates output)
───────────────────────────── ──────────────────────────────
Input Tokens Output Tokens (so far)
│ │
Embedding Layer Embedding Layer
│ │
Multi-Head Self-Attention Masked Self-Attention
│ │
Feed-Forward Network Cross-Attention (attends to encoder)
│ │
(Stack N times) Feed-Forward Network
│ │
Encoder Output ─────────────────▶ (Stack N times)
│
Final Output Tokens
In decoder-only models like GPT, the decoder generates text autoregressively — one token at a time, each new token dependent on all previous tokens.
Positional Encoding — Teaching the Model About Word Order
Since attention looks at all words simultaneously, the model would not naturally know which word comes first or last. Positional encoding adds information about each word's position in the sequence.
Without positional encoding: "Dog bites man" = "Man bites dog" (same words, different meaning — model can't tell) With positional encoding: Each word gets its position tagged: "Dog[1] bites[2] man[3]" — model knows the order matters
Why Transformers Are So Powerful
| Feature | Benefit |
|---|---|
| Processes all tokens in parallel | Much faster training than sequential models |
| Attention across full context | Understands long-range dependencies in text |
| Scales with more data and parameters | Bigger models = better performance (scaling laws) |
| Works across modalities | Same architecture used for text, image, audio |
| Pretrain once, adapt many times | One foundation model serves many specialized tasks |
Real-World Analogy for Attention
Imagine reading a newspaper article and trying to understand who "he" refers to. A person does not re-read the entire article one word at a time. The eyes jump back to the most relevant earlier mention. The Transformer's attention mechanism replicates this — it jumps directly to the most relevant parts of the input when processing each word.
With Transformers understood, the next important concept is tokens — the units of text that LLMs actually process. Every word, space, and punctuation mark gets broken down into tokens before the model ever sees them.
