ML Recurrent Neural Networks and LSTM

Recurrent Neural Networks (RNNs) are a class of neural networks designed specifically for sequential data — data where the order of inputs matters. Text, speech, stock prices, and sensor readings are all sequential. A standard neural network processes each input independently, with no memory of what came before. RNNs solve this by maintaining a hidden state that carries information across time steps.

Why Standard Networks Fail on Sequences

Sentence: "The bank approved the loan."

Standard Neural Network:
  Processes each word independently.
  Has no memory that "bank" appeared earlier when processing "loan."
  Cannot understand that "bank" here means financial institution,
  not a river bank — context from earlier words is lost.

RNN:
  Processes "The" → remembers context
  Processes "bank" → remembers "The bank"
  Processes "approved" → remembers "The bank approved"
  ...each step carries forward what was seen before.

RNN Architecture

At each time step t, the RNN takes:
  - Current input (xt) — the current word, value, or token
  - Previous hidden state (h_t-1) — memory of past inputs

It produces:
  - New hidden state (ht) — updated memory
  - Output (yt) — prediction at this time step

Formula:
  ht = tanh(W_h × h_t-1 + W_x × xt + b)
  yt = W_y × ht + b_y

Unrolled RNN Diagram:

t=1           t=2           t=3           t=4
  │             │             │             │
  x1            x2            x3            x4
  │             │             │             │
[RNN] → h1 → [RNN] → h2 → [RNN] → h3 → [RNN] → h4 → Output
  │             │             │             │
  y1            y2            y3            y4

The same RNN cell (same weights) is reused at every time step.
The hidden state ht passes information from step to step.

Types of RNN Tasks

┌────────────────────┬──────────────────────────────────────────────┐
│ Task Type          │ Description and Example                      │
├────────────────────┼──────────────────────────────────────────────┤
│ One-to-Many        │ One input → sequence output                  │
│                    │ Example: Image → generates a caption         │
│ Many-to-One        │ Sequence input → one output                  │
│                    │ Example: Movie review text → Positive/Negative│
│ Many-to-Many       │ Sequence input → sequence output (same length)│
│ (synchronized)     │ Example: Classify each word in a sentence    │
│ Many-to-Many       │ Sequence input → sequence output (diff length)│
│ (encoder-decoder)  │ Example: English sentence → French sentence  │
└────────────────────┴──────────────────────────────────────────────┘

The Vanishing Gradient Problem in RNNs

Standard RNNs struggle with long sequences.

Problem:
  Sentence: "I grew up in India and have spoken _______ my whole life."
  The blank should be "Hindi" or another Indian language.
  But the relevant clue ("India") appeared many steps earlier.

  During backpropagation, the gradient from the output
  must flow back through many time steps.
  At each step, it gets multiplied by weights < 1.
  After 20+ steps, the gradient becomes nearly zero.
  The network forgets what happened long ago.

Visual:
  Step 50 gradient: 0.8
  Step 40 gradient: 0.8^10 = 0.107
  Step 30 gradient: 0.8^20 = 0.012
  Step 20 gradient: 0.8^30 = 0.0012
  Step 10 gradient: 0.8^40 = 0.0001 ← nearly zero

  Early steps learn almost nothing from the final output.

LSTM: Long Short-Term Memory

LSTM is a special type of RNN designed to solve the vanishing gradient problem. It uses a more complex cell structure with special gates that control what information to remember, what to forget, and what to output at each step.

LSTM Cell Structure

An LSTM cell has two streams of information:
  Cell State (C): Long-term memory — like a conveyor belt
  Hidden State (h): Short-term working memory — like active RAM

Four components inside each LSTM cell:

1. Forget Gate:
   "What part of the long-term memory should be erased?"
   Output: Value between 0 (forget completely) and 1 (keep fully)

   f_t = sigmoid(W_f × [h_t-1, x_t] + b_f)

2. Input Gate:
   "What new information should be added to long-term memory?"
   Decides which parts of the candidate values to store.

   i_t = sigmoid(W_i × [h_t-1, x_t] + b_i)
   C̃_t = tanh(W_c × [h_t-1, x_t] + b_c)  ← candidate values

3. Cell State Update:
   Old memory is partially forgotten, new info is added.

   C_t = f_t × C_t-1 + i_t × C̃_t

4. Output Gate:
   "What should be sent as output and new hidden state?"

   o_t = sigmoid(W_o × [h_t-1, x_t] + b_o)
   h_t = o_t × tanh(C_t)

LSTM Cell Diagram:

  C_t-1 ──────────────────────────────────────────→ C_t
             ×         +
             │         │
           forget     input
           gate       gate
             │         │
  h_t-1 ──┬─┴─────────┴───────────────────────────→ h_t
           │                                output gate
  x_t ─────┘

LSTM Example: Sentiment Analysis

Input sentence: "The movie was not great."

Standard RNN might say: "great" → positive sentiment (wrong)
LSTM remembers: "not" from earlier → "not great" → negative

Processing:
  t=1: "The"   → h1, C1 (forget gate keeps little, input adds "the")
  t=2: "movie" → h2, C2 (stores "movie" context)
  t=3: "was"   → h3, C3
  t=4: "not"   → h4, C4 (LSTM strongly stores the negation word)
  t=5: "great" → h5, C5 (C4 says "NOT" → modifies meaning of "great")

Output: Negative sentiment ✓

The cell state "remembered" the negation across multiple steps.
Standard RNN would forget it.

GRU: Gated Recurrent Unit

GRU is a simplified version of LSTM with only two gates:
  Reset Gate: How much of past hidden state to forget
  Update Gate: How much of past hidden state to carry forward

GRU vs LSTM:
┌────────────────────────┬──────────────┬──────────────────┐
│ Feature                │ LSTM         │ GRU              │
├────────────────────────┼──────────────┼──────────────────┤
│ Number of gates        │ 3            │ 2                │
│ Parameters             │ More         │ Fewer            │
│ Training speed         │ Slower       │ Faster           │
│ Performance            │ Slightly     │ Competitive,     │
│                        │ better on    │ better on        │
│                        │ long seqs    │ smaller datasets │
└────────────────────────┴──────────────┴──────────────────┘

Use GRU when: Dataset is smaller or training time matters.
Use LSTM when: Very long sequences and maximum accuracy needed.

Bidirectional RNN

Standard RNN reads a sequence left to right only.
In some tasks, context from the FUTURE also matters.

Example: Named Entity Recognition
  "Apple launched a new product in the California store."
  
  To know "Apple" is a company (not the fruit),
  seeing "launched a new product" AFTER "Apple" helps.

Bidirectional RNN:
  Forward RNN:  reads sequence left → right (past context)
  Backward RNN: reads sequence right → left (future context)
  Outputs from both directions are concatenated at each step.

  → Better for tasks where full context matters (translation,
    named entity recognition, text classification).

Applications of RNN and LSTM

┌─────────────────────────────┬───────────────────────────────────┐
│ Application                 │ How RNN/LSTM Is Used              │
├─────────────────────────────┼───────────────────────────────────┤
│ Sentiment Analysis          │ Read full review → positive/neg   │
│ Machine Translation         │ Encode source → decode to target  │
│ Speech Recognition          │ Audio sequence → text sequence    │
│ Text Generation             │ Predict next word from context    │
│ Stock Price Forecasting     │ Past prices → future prediction   │
│ Video Captioning            │ Frame sequence → description text │
│ Music Generation            │ Past notes → next note sequence   │
└─────────────────────────────┴───────────────────────────────────┘

The Rise of Transformers

RNNs and LSTMs dominated sequence tasks until 2017.
The Transformer architecture (Google, 2017) changed this.

Transformer advantages over LSTM:
  ✓ Processes entire sequence in parallel (not step-by-step)
  ✓ Handles much longer sequences efficiently
  ✓ Self-attention directly connects any two positions
  ✓ Scales much better with data and compute

Transformers power: BERT, GPT, ChatGPT, T5, and most
modern NLP systems.

LSTMs are still used for:
  ✓ Time series with limited compute resources
  ✓ Edge devices (smaller models)
  ✓ Problems requiring streaming sequential output

Previous lesson

Back to course

Next lesson