ML Recurrent Neural Networks and LSTM
Recurrent Neural Networks (RNNs) are a class of neural networks designed specifically for sequential data — data where the order of inputs matters. Text, speech, stock prices, and sensor readings are all sequential. A standard neural network processes each input independently, with no memory of what came before. RNNs solve this by maintaining a hidden state that carries information across time steps.
Why Standard Networks Fail on Sequences
Sentence: "The bank approved the loan." Standard Neural Network: Processes each word independently. Has no memory that "bank" appeared earlier when processing "loan." Cannot understand that "bank" here means financial institution, not a river bank — context from earlier words is lost. RNN: Processes "The" → remembers context Processes "bank" → remembers "The bank" Processes "approved" → remembers "The bank approved" ...each step carries forward what was seen before.
RNN Architecture
At each time step t, the RNN takes: - Current input (xt) — the current word, value, or token - Previous hidden state (h_t-1) — memory of past inputs It produces: - New hidden state (ht) — updated memory - Output (yt) — prediction at this time step Formula: ht = tanh(W_h × h_t-1 + W_x × xt + b) yt = W_y × ht + b_y Unrolled RNN Diagram: t=1 t=2 t=3 t=4 │ │ │ │ x1 x2 x3 x4 │ │ │ │ [RNN] → h1 → [RNN] → h2 → [RNN] → h3 → [RNN] → h4 → Output │ │ │ │ y1 y2 y3 y4 The same RNN cell (same weights) is reused at every time step. The hidden state ht passes information from step to step.
Types of RNN Tasks
┌────────────────────┬──────────────────────────────────────────────┐ │ Task Type │ Description and Example │ ├────────────────────┼──────────────────────────────────────────────┤ │ One-to-Many │ One input → sequence output │ │ │ Example: Image → generates a caption │ │ Many-to-One │ Sequence input → one output │ │ │ Example: Movie review text → Positive/Negative│ │ Many-to-Many │ Sequence input → sequence output (same length)│ │ (synchronized) │ Example: Classify each word in a sentence │ │ Many-to-Many │ Sequence input → sequence output (diff length)│ │ (encoder-decoder) │ Example: English sentence → French sentence │ └────────────────────┴──────────────────────────────────────────────┘
The Vanishing Gradient Problem in RNNs
Standard RNNs struggle with long sequences.
Problem:
Sentence: "I grew up in India and have spoken _______ my whole life."
The blank should be "Hindi" or another Indian language.
But the relevant clue ("India") appeared many steps earlier.
During backpropagation, the gradient from the output
must flow back through many time steps.
At each step, it gets multiplied by weights < 1.
After 20+ steps, the gradient becomes nearly zero.
The network forgets what happened long ago.
Visual:
Step 50 gradient: 0.8
Step 40 gradient: 0.8^10 = 0.107
Step 30 gradient: 0.8^20 = 0.012
Step 20 gradient: 0.8^30 = 0.0012
Step 10 gradient: 0.8^40 = 0.0001 ← nearly zero
Early steps learn almost nothing from the final output.
LSTM: Long Short-Term Memory
LSTM is a special type of RNN designed to solve the vanishing gradient problem. It uses a more complex cell structure with special gates that control what information to remember, what to forget, and what to output at each step.
LSTM Cell Structure
An LSTM cell has two streams of information:
Cell State (C): Long-term memory — like a conveyor belt
Hidden State (h): Short-term working memory — like active RAM
Four components inside each LSTM cell:
1. Forget Gate:
"What part of the long-term memory should be erased?"
Output: Value between 0 (forget completely) and 1 (keep fully)
f_t = sigmoid(W_f × [h_t-1, x_t] + b_f)
2. Input Gate:
"What new information should be added to long-term memory?"
Decides which parts of the candidate values to store.
i_t = sigmoid(W_i × [h_t-1, x_t] + b_i)
C̃_t = tanh(W_c × [h_t-1, x_t] + b_c) ← candidate values
3. Cell State Update:
Old memory is partially forgotten, new info is added.
C_t = f_t × C_t-1 + i_t × C̃_t
4. Output Gate:
"What should be sent as output and new hidden state?"
o_t = sigmoid(W_o × [h_t-1, x_t] + b_o)
h_t = o_t × tanh(C_t)
LSTM Cell Diagram:
C_t-1 ──────────────────────────────────────────→ C_t
× +
│ │
forget input
gate gate
│ │
h_t-1 ──┬─┴─────────┴───────────────────────────→ h_t
│ output gate
x_t ─────┘
LSTM Example: Sentiment Analysis
Input sentence: "The movie was not great." Standard RNN might say: "great" → positive sentiment (wrong) LSTM remembers: "not" from earlier → "not great" → negative Processing: t=1: "The" → h1, C1 (forget gate keeps little, input adds "the") t=2: "movie" → h2, C2 (stores "movie" context) t=3: "was" → h3, C3 t=4: "not" → h4, C4 (LSTM strongly stores the negation word) t=5: "great" → h5, C5 (C4 says "NOT" → modifies meaning of "great") Output: Negative sentiment ✓ The cell state "remembered" the negation across multiple steps. Standard RNN would forget it.
GRU: Gated Recurrent Unit
GRU is a simplified version of LSTM with only two gates: Reset Gate: How much of past hidden state to forget Update Gate: How much of past hidden state to carry forward GRU vs LSTM: ┌────────────────────────┬──────────────┬──────────────────┐ │ Feature │ LSTM │ GRU │ ├────────────────────────┼──────────────┼──────────────────┤ │ Number of gates │ 3 │ 2 │ │ Parameters │ More │ Fewer │ │ Training speed │ Slower │ Faster │ │ Performance │ Slightly │ Competitive, │ │ │ better on │ better on │ │ │ long seqs │ smaller datasets │ └────────────────────────┴──────────────┴──────────────────┘ Use GRU when: Dataset is smaller or training time matters. Use LSTM when: Very long sequences and maximum accuracy needed.
Bidirectional RNN
Standard RNN reads a sequence left to right only.
In some tasks, context from the FUTURE also matters.
Example: Named Entity Recognition
"Apple launched a new product in the California store."
To know "Apple" is a company (not the fruit),
seeing "launched a new product" AFTER "Apple" helps.
Bidirectional RNN:
Forward RNN: reads sequence left → right (past context)
Backward RNN: reads sequence right → left (future context)
Outputs from both directions are concatenated at each step.
→ Better for tasks where full context matters (translation,
named entity recognition, text classification).
Applications of RNN and LSTM
┌─────────────────────────────┬───────────────────────────────────┐ │ Application │ How RNN/LSTM Is Used │ ├─────────────────────────────┼───────────────────────────────────┤ │ Sentiment Analysis │ Read full review → positive/neg │ │ Machine Translation │ Encode source → decode to target │ │ Speech Recognition │ Audio sequence → text sequence │ │ Text Generation │ Predict next word from context │ │ Stock Price Forecasting │ Past prices → future prediction │ │ Video Captioning │ Frame sequence → description text │ │ Music Generation │ Past notes → next note sequence │ └─────────────────────────────┴───────────────────────────────────┘
The Rise of Transformers
RNNs and LSTMs dominated sequence tasks until 2017. The Transformer architecture (Google, 2017) changed this. Transformer advantages over LSTM: ✓ Processes entire sequence in parallel (not step-by-step) ✓ Handles much longer sequences efficiently ✓ Self-attention directly connects any two positions ✓ Scales much better with data and compute Transformers power: BERT, GPT, ChatGPT, T5, and most modern NLP systems. LSTMs are still used for: ✓ Time series with limited compute resources ✓ Edge devices (smaller models) ✓ Problems requiring streaming sequential output
