GenAI Transformers and Attention

The Transformer is the architecture that powers virtually every modern large language model. Introduced in a 2017 research paper titled "Attention Is All You Need," the Transformer replaced older sequential models and made it possible to train AI on text at a massive scale. Understanding how it works reveals the engine behind all of today's generative AI systems.

The Problem Transformers Solved

Before Transformers, models processed text word by word in sequence — like reading a sentence one word at a time from left to right. This approach had a major weakness: the model often forgot earlier context by the time it reached the end of a long sentence.

Example of the problem:

"The trophy did not fit in the suitcase because it was too big."

What does "it" refer to — the trophy or the suitcase?
→ A sequential model struggling with long-distance connections
   would often get this wrong.

Transformers solve this by looking at all words at the same time, rather than one at a time. This is what makes them both faster and more capable of understanding context across long distances in text.

What Is Attention?

Attention is the core mechanism inside a Transformer. It allows the model to decide how much focus to place on each word in a sentence when generating or understanding any given word.

Think of it as a spotlight. When the model is processing the word "it" in a sentence, the attention mechanism shines a spotlight on all other words and asks: which ones are most relevant to understanding what "it" means?

Sentence: "The trophy did not fit in the suitcase because it was too big."

When processing "it":
─────────────────────────────────────────────────────
Word        | Attention Weight (simplified)
─────────────────────────────────────────────────────
"The"       | 0.01 (low — not relevant)
"trophy"    | 0.72 (high — likely the referent)
"fit"       | 0.08 (low)
"suitcase"  | 0.14 (medium — related but less likely)
"big"       | 0.05 (low)
─────────────────────────────────────────────────────

By assigning different weights to different words, the model builds a richer understanding of meaning and context.

Self-Attention Explained Simply

Self-attention means the model attends to all parts of the same input sentence at once. Each word looks at every other word and calculates a relevance score.

To calculate attention, each word is turned into three vectors:

Query (Q): What this word is looking for
Key (K): What this word offers to others
Value (V): The actual information this word carries

Attention Score Calculation
──────────────────────────────────────────
For each word:
  Score = Query of current word × Key of each other word
  ↓
  Apply softmax to turn scores into weights (0 to 1)
  ↓
  Multiply weights by Values
  ↓
  Sum them up → Final context-aware representation of the word

Multi-Head Attention

A single attention pass looks at relationships from one perspective. Multi-head attention runs several attention processes in parallel — each one focusing on different kinds of relationships in the text.

Multi-Head Attention (simplified)
──────────────────────────────────────────────────────
Input sentence → Split into N attention "heads"
                  Head 1: focuses on grammatical relationships
                  Head 2: focuses on subject-verb connection
                  Head 3: focuses on pronoun references
                  Head N: focuses on topic relevance
                  ...
                ↓
        Combine outputs from all heads
                ↓
        Richer, multi-perspective representation

The Full Transformer Architecture

The original Transformer has two main parts: an Encoder and a Decoder. Modern LLMs typically use only the Decoder portion.

ENCODER (understands input)          DECODER (generates output)
─────────────────────────────        ──────────────────────────────
Input Tokens                         Output Tokens (so far)
     │                                      │
Embedding Layer                       Embedding Layer
     │                                      │
Multi-Head Self-Attention             Masked Self-Attention
     │                                      │
Feed-Forward Network           Cross-Attention (attends to encoder)
     │                                      │
(Stack N times)                    Feed-Forward Network
     │                                      │
Encoder Output ─────────────────▶  (Stack N times)
                                           │
                                   Final Output Tokens

In decoder-only models like GPT, the decoder generates text autoregressively — one token at a time, each new token dependent on all previous tokens.

Positional Encoding — Teaching the Model About Word Order

Since attention looks at all words simultaneously, the model would not naturally know which word comes first or last. Positional encoding adds information about each word's position in the sequence.

Without positional encoding:
"Dog bites man" = "Man bites dog" (same words, different meaning — model can't tell)

With positional encoding:
Each word gets its position tagged:
"Dog[1] bites[2] man[3]" — model knows the order matters

Why Transformers Are So Powerful

Feature	Benefit
Processes all tokens in parallel	Much faster training than sequential models
Attention across full context	Understands long-range dependencies in text
Scales with more data and parameters	Bigger models = better performance (scaling laws)
Works across modalities	Same architecture used for text, image, audio
Pretrain once, adapt many times	One foundation model serves many specialized tasks

Real-World Analogy for Attention

Imagine reading a newspaper article and trying to understand who "he" refers to. A person does not re-read the entire article one word at a time. The eyes jump back to the most relevant earlier mention. The Transformer's attention mechanism replicates this — it jumps directly to the most relevant parts of the input when processing each word.

With Transformers understood, the next important concept is tokens — the units of text that LLMs actually process. Every word, space, and punctuation mark gets broken down into tokens before the model ever sees them.

Previous lesson

Back to course

Next lesson