Deep Learning Attention Mechanism

The attention mechanism fundamentally changed how neural networks process sequences. Instead of relying on a single compressed summary of the entire input, attention allows the model to focus on the most relevant parts of the input at each step of the output — the same way your eyes focus on specific words when reading a complex sentence.

The Problem Attention Solves

Consider translating a long English paragraph to French. A standard encoder-decoder RNN compresses the entire paragraph into one fixed-size vector and hands it to the decoder. That single vector becomes the only source of information for generating every word in the French translation. For long texts, this single vector cannot hold all the detail — and translation quality drops.

The Bottleneck Problem

ENCODER-DECODER WITHOUT ATTENTION:

English:  "The cat sat on the mat near the window"
              ↓ entire sentence compressed into one vector
        [  single  fixed  vector  ]
              ↓
French:   "Le chat était assis sur..."

Problem: The single vector forgets early words by the time late words are decoded.
Long sentences → poor translation quality

How Attention Works

With attention, the decoder does not rely on just one summary vector. At every output step, it looks back at all the encoder hidden states and assigns a relevance score to each one. The decoder then focuses its attention on the most relevant encoder states for generating each specific output word.

Attention in Translation

English words:    "The" "cat"  "sat"  "on"  "the" "mat"
Encoder states:    h1    h2     h3     h4    h5    h6

Generating the French word "chat" (cat):
  Attention scores:
    h1 ("The") = 0.05 — barely relevant
    h2 ("cat") = 0.85 — highly relevant ← model focuses here
    h3 ("sat") = 0.03
    h4 ("on")  = 0.02
    h5 ("the") = 0.03
    h6 ("mat") = 0.02
    ───────────────────
    Total:      = 1.00

The model generates "chat" by weighting h2 most heavily.

The Three Steps of Attention

Step 1: Score

Compare the current decoder state to each encoder hidden state. Each comparison produces a raw relevance score.

Score(decoder state, encoder state h_i) → raw alignment number

Step 2: Normalize (Softmax)

Apply Softmax to the raw scores so they all sum to 1. These become the attention weights — how much to focus on each encoder state.

Raw scores:   [2.1, 5.8, 0.4, 0.3, 0.5, 0.4]
After Softmax: [0.05, 0.85, 0.03, 0.02, 0.03, 0.02]  → sum = 1.0

Step 3: Weighted Sum (Context Vector)

Multiply each encoder state by its attention weight and add them all together. The result is a context vector — a weighted combination of all encoder states, emphasizing the most relevant ones.

Context = 0.05 × h1 + 0.85 × h2 + 0.03 × h3 + 0.02 × h4 + 0.03 × h5 + 0.02 × h6

This context vector carries focused information for generating "chat."

Self-Attention

Self-attention applies the attention mechanism within a single sequence — every word attends to every other word in the same sentence. This captures which words are related to each other, regardless of their distance.

Self-Attention Example

Sentence: "The trophy didn't fit in the suitcase because it was too big"

When processing "it":
  Attention scores for "it" attending to every other word:

  "The"      → 0.02
  "trophy"   → 0.72  ← "it" refers to the trophy
  "didn't"   → 0.01
  "fit"      → 0.03
  "suitcase" → 0.15  ← also somewhat related
  "because"  → 0.02
  "it"       → 0.00
  "was"      → 0.02
  "too"      → 0.01
  "big"      → 0.02

Self-attention resolves the pronoun "it" → refers to "trophy"

Multi-Head Attention

Instead of running attention once, multi-head attention runs several attention processes simultaneously — each looking at the sequence from a different angle. Each "head" specializes in a different type of relationship.

Single attention head → one perspective on relationships

Multi-head attention (8 heads):
  Head 1: Focus on grammatical structure (subject-verb-object)
  Head 2: Focus on coreference (which pronouns refer to which nouns)
  Head 3: Focus on semantic similarity (synonyms, related concepts)
  Head 4–8: Other learned relationships

All 8 heads run in parallel → results concatenated → final representation

Multi-head attention is one of the core components inside Transformer models, which the next topic covers in full.

Visualizing Attention

Attention weights produce interpretable heatmaps. Researchers can literally see which input words the model focused on when generating each output word.

English input:   The | cat | sat | on | the | mat
                  ↓
French output:  Le   chat  était assis sur   le  tapis

Attention heatmap (rows = French output, columns = English input):
          The  cat  sat  on  the  mat
Le         ■   □    □   □    □    □     (focused on "The")
chat        □   ■    □   □    □    □     (focused on "cat")
était       □   □    ■   □    □    □     (focused on "sat")
assis       □   □    ■   □    □    □     (focused on "sat")
sur         □   □    □   ■    □    □     (focused on "on")
le          □   □    □   □    ■    □     (focused on "the")
tapis       □   □    □   □    □    ■     (focused on "mat")

■ = high attention weight   □ = low attention weight

Why Attention Matters

Capability	Without Attention	With Attention
Long sequences	Quality degrades	Handles them well
Context linking	Only nearby words	Any word to any word
Interpretability	Black box	Attention maps are visible
Parallelization	Sequential	All positions computed in parallel

Key Terms

Attention — a mechanism that assigns relevance weights to different parts of the input
Attention Weight — how much focus to place on a specific encoder state
Context Vector — a weighted combination of encoder states, used by the decoder
Self-Attention — attention applied within a single sequence (every word attends to every other)
Multi-Head Attention — running multiple attention processes in parallel, each capturing different relationships

Previous lessons

Back to courses

Next lessons