Deep Learning Attention Mechanism
The attention mechanism fundamentally changed how neural networks process sequences. Instead of relying on a single compressed summary of the entire input, attention allows the model to focus on the most relevant parts of the input at each step of the output — the same way your eyes focus on specific words when reading a complex sentence.
The Problem Attention Solves
Consider translating a long English paragraph to French. A standard encoder-decoder RNN compresses the entire paragraph into one fixed-size vector and hands it to the decoder. That single vector becomes the only source of information for generating every word in the French translation. For long texts, this single vector cannot hold all the detail — and translation quality drops.
The Bottleneck Problem
ENCODER-DECODER WITHOUT ATTENTION:
English: "The cat sat on the mat near the window"
↓ entire sentence compressed into one vector
[ single fixed vector ]
↓
French: "Le chat était assis sur..."
Problem: The single vector forgets early words by the time late words are decoded.
Long sentences → poor translation quality
How Attention Works
With attention, the decoder does not rely on just one summary vector. At every output step, it looks back at all the encoder hidden states and assigns a relevance score to each one. The decoder then focuses its attention on the most relevant encoder states for generating each specific output word.
Attention in Translation
English words: "The" "cat" "sat" "on" "the" "mat"
Encoder states: h1 h2 h3 h4 h5 h6
Generating the French word "chat" (cat):
Attention scores:
h1 ("The") = 0.05 — barely relevant
h2 ("cat") = 0.85 — highly relevant ← model focuses here
h3 ("sat") = 0.03
h4 ("on") = 0.02
h5 ("the") = 0.03
h6 ("mat") = 0.02
───────────────────
Total: = 1.00
The model generates "chat" by weighting h2 most heavily.
The Three Steps of Attention
Step 1: Score
Compare the current decoder state to each encoder hidden state. Each comparison produces a raw relevance score.
Score(decoder state, encoder state h_i) → raw alignment number
Step 2: Normalize (Softmax)
Apply Softmax to the raw scores so they all sum to 1. These become the attention weights — how much to focus on each encoder state.
Raw scores: [2.1, 5.8, 0.4, 0.3, 0.5, 0.4] After Softmax: [0.05, 0.85, 0.03, 0.02, 0.03, 0.02] → sum = 1.0
Step 3: Weighted Sum (Context Vector)
Multiply each encoder state by its attention weight and add them all together. The result is a context vector — a weighted combination of all encoder states, emphasizing the most relevant ones.
Context = 0.05 × h1 + 0.85 × h2 + 0.03 × h3 + 0.02 × h4 + 0.03 × h5 + 0.02 × h6 This context vector carries focused information for generating "chat."
Self-Attention
Self-attention applies the attention mechanism within a single sequence — every word attends to every other word in the same sentence. This captures which words are related to each other, regardless of their distance.
Self-Attention Example
Sentence: "The trophy didn't fit in the suitcase because it was too big" When processing "it": Attention scores for "it" attending to every other word: "The" → 0.02 "trophy" → 0.72 ← "it" refers to the trophy "didn't" → 0.01 "fit" → 0.03 "suitcase" → 0.15 ← also somewhat related "because" → 0.02 "it" → 0.00 "was" → 0.02 "too" → 0.01 "big" → 0.02 Self-attention resolves the pronoun "it" → refers to "trophy"
Multi-Head Attention
Instead of running attention once, multi-head attention runs several attention processes simultaneously — each looking at the sequence from a different angle. Each "head" specializes in a different type of relationship.
Single attention head → one perspective on relationships Multi-head attention (8 heads): Head 1: Focus on grammatical structure (subject-verb-object) Head 2: Focus on coreference (which pronouns refer to which nouns) Head 3: Focus on semantic similarity (synonyms, related concepts) Head 4–8: Other learned relationships All 8 heads run in parallel → results concatenated → final representation
Multi-head attention is one of the core components inside Transformer models, which the next topic covers in full.
Visualizing Attention
Attention weights produce interpretable heatmaps. Researchers can literally see which input words the model focused on when generating each output word.
English input: The | cat | sat | on | the | mat
↓
French output: Le chat était assis sur le tapis
Attention heatmap (rows = French output, columns = English input):
The cat sat on the mat
Le ■ □ □ □ □ □ (focused on "The")
chat □ ■ □ □ □ □ (focused on "cat")
était □ □ ■ □ □ □ (focused on "sat")
assis □ □ ■ □ □ □ (focused on "sat")
sur □ □ □ ■ □ □ (focused on "on")
le □ □ □ □ ■ □ (focused on "the")
tapis □ □ □ □ □ ■ (focused on "mat")
■ = high attention weight □ = low attention weight
Why Attention Matters
| Capability | Without Attention | With Attention |
|---|---|---|
| Long sequences | Quality degrades | Handles them well |
| Context linking | Only nearby words | Any word to any word |
| Interpretability | Black box | Attention maps are visible |
| Parallelization | Sequential | All positions computed in parallel |
Key Terms
- Attention — a mechanism that assigns relevance weights to different parts of the input
- Attention Weight — how much focus to place on a specific encoder state
- Context Vector — a weighted combination of encoder states, used by the decoder
- Self-Attention — attention applied within a single sequence (every word attends to every other)
- Multi-Head Attention — running multiple attention processes in parallel, each capturing different relationships
