Deep Learning Transformers

The Transformer is the architecture that powers GPT, BERT, Google Translate, and most modern AI language models. Introduced in the 2017 paper "Attention Is All You Need," it replaced recurrent networks for sequence tasks by processing all positions in a sequence simultaneously — making training dramatically faster and enabling models of unprecedented scale.

What Makes Transformers Different

RNNs process sequences one step at a time — word 1, then word 2, then word 3. This is sequential and slow. Transformers process all words in parallel, using attention to find relationships between any word and any other word in one step.

RNN processing: "The cat sat on the mat"
  Step 1: "The"
  Step 2: "cat"
  Step 3: "sat"
  ... (must wait for each step to finish)

Transformer processing: "The cat sat on the mat"
  All 6 words processed simultaneously
  Every word attends to every other word at once
  → 6× faster for this sentence, and the advantage grows with length

The Transformer Architecture

High-Level View

INPUT TEXT
    ↓
[Token Embedding + Positional Encoding]
    ↓
┌──────────────────────────────┐
│      ENCODER STACK           │   ← Used by BERT-style models
│   (N identical layers)       │
│   Each layer:                │
│   → Multi-Head Self-Attention│
│   → Feed-Forward Network     │
│   → Layer Norm + Residual    │
└──────────────────────────────┘
    ↓
┌──────────────────────────────┐
│      DECODER STACK           │   ← Used by GPT-style models
│   (N identical layers)       │
│   Each layer:                │
│   → Masked Self-Attention    │
│   → Cross-Attention          │
│   → Feed-Forward Network     │
│   → Layer Norm + Residual    │
└──────────────────────────────┘
    ↓
OUTPUT

Key Components Explained

1. Token Embedding

Each word (or subword) is converted into a high-dimensional vector. The word "cat" becomes a list of 512 or 768 numbers, and similar words land close together in this embedding space.

"cat"  →  [0.21, -0.45, 0.88, ... 512 numbers]
"dog"  →  [0.19, -0.41, 0.85, ... 512 numbers]   ← similar to cat
"car"  →  [-0.73, 0.12, -0.34, ... 512 numbers]  ← very different

2. Positional Encoding

Self-attention processes all words simultaneously — which means it loses track of word order. Positional encoding adds unique position signals to each word's embedding so the model knows "cat" is word 2 and "mat" is word 6.

"The"   embedding + position 1 signal → position-aware embedding
"cat"   embedding + position 2 signal → position-aware embedding
"sat"   embedding + position 3 signal → position-aware embedding

Without this: "cat sat" and "sat cat" look the same to the Transformer.
With this:     Word order is preserved.

3. Multi-Head Self-Attention

Every word's representation is updated by attending to all other words in the sequence. Multiple attention heads run in parallel, each capturing different types of relationships.

Processing "sat" in "The cat sat on the mat":

Head 1 (syntactic): "sat" → attends strongly to "cat" (subject)
Head 2 (semantic):  "sat" → attends to "mat" (location)
Head 3 (context):   "sat" → attends to "on" (preposition)

All heads combined → "sat" now carries richer, multi-faceted meaning

4. Feed-Forward Network

After attention, each position passes through a small, independent two-layer neural network. This applies transformations to each word's representation individually and adds additional expressive capacity.

5. Residual Connections and Layer Normalization

Each sub-layer's output is added back to its input (residual connection) and then normalized. This prevents vanishing gradients in very deep stacks and stabilizes training.

output = LayerNorm(x + SubLayer(x))

The "x +" part is the residual connection.
It creates a shortcut that gradients flow through easily.

Encoder-Only vs Decoder-Only vs Encoder-Decoder

Architecture	Example Models	Best For
Encoder-Only	BERT, RoBERTa	Understanding tasks: classification, question answering, search
Decoder-Only	GPT-4, Claude, LLaMA	Generation tasks: writing text, code, chat
Encoder-Decoder	T5, BART	Sequence-to-sequence: translation, summarization

Masked Attention in Decoders

Decoder models generate text one word at a time. When predicting word 4, the model must not look at words 5, 6, 7 — those have not been generated yet. Masking hides future positions.

Generating: "The cat sat on the ___"

When predicting position 6:
  Can see:    "The" "cat" "sat" "on" "the"
  Cannot see: [MASKED] [MASKED] [MASKED] (future positions)

This forces the model to predict each word based only on what came before.

Scale: Why Transformers Dominate Today

GPT-2 (2019):     1.5 billion parameters
GPT-3 (2020):   175 billion parameters
GPT-4 (2023):   ~1 trillion parameters (estimated)

More parameters + more data + more compute = better performance
This scaling behavior made Transformers the foundation of modern AI.

Transformer Applications

Large Language Models — ChatGPT, Claude, Gemini, LLaMA all use decoder-only Transformers
Machine Translation — Google Translate switched from LSTM to Transformers for better accuracy
Code Generation — GitHub Copilot generates code suggestions using a Transformer trained on code
Image Generation — Vision Transformers (ViT) apply the same architecture to image patches
Drug Discovery — Protein structure prediction models like AlphaFold use Transformer components

Key Terms

Transformer — the architecture that processes sequences using attention, without recurrence
Token Embedding — a dense vector representation of each word or subword
Positional Encoding — a signal added to embeddings to preserve word order
Residual Connection — a shortcut that adds a layer's input to its output
Encoder — the part that reads and understands input sequences
Decoder — the part that generates output sequences
Masked Attention — attention that hides future positions from the decoder

Previous lessons

Back to courses

Next lessons