Deep Learning Long Short-Term Memory Networks

Long Short-Term Memory (LSTM) networks solve the biggest problem with standard RNNs: forgetting. Standard RNNs lose track of context from many steps ago. LSTMs carry two separate memory channels — one for short-term details and one for long-term patterns — allowing them to remember relevant information across very long sequences.

The Core Problem: Why RNNs Forget

Read this sentence: "Maria, who moved to Spain at age 9 after her family relocated from Brazil, later studied medicine and became a surgeon. She now works in Madrid."

By the time a standard RNN reaches "She," it has already forgotten that "She" refers to Maria. Long sequences overwrite old information with new information in the hidden state.

LSTMs fix this by maintaining two parallel tracks of information:

Cell State (C_t) — long-term memory (like a notepad you carry around)
Hidden State (h_t) — short-term working memory (what you are focused on right now)

The Three Gates

LSTMs control information flow using three gates. Each gate is a neural network layer that outputs values between 0 and 1, acting like a valve: 0 = block everything, 1 = let everything through.

Gate Overview Diagram

Previous cell state (C_{t-1})
        │
        ├──→ [Forget Gate] ──→ What to erase from memory?
        │
        ├──→ [Input Gate]  ──→ What new information to add?
        │
        ↓
Updated cell state (C_t)
        │
        └──→ [Output Gate] ──→ What to pass as the hidden state?
                                        │
                                    h_t (output)

1. Forget Gate

The forget gate decides what portion of the long-term memory to erase.

Sentence: "John is a good chef. Maria is a doctor."

After processing "Maria," the forget gate signals:
→ Forget "John" context (it is no longer relevant)
→ Keep general sentence structure (still useful)

Forget gate output = 0.1 for "John features" → nearly erased
                   = 0.9 for "sentence structure" → mostly kept

2. Input Gate

The input gate decides what new information to write into long-term memory.

New word: "doctor"

Input gate:
→ "Maria" is the subject → write this association
→ "doctor" is her role → write this into memory
→ Earlier sentence structure → do not rewrite (already stored)

New information added to cell state:
  Maria + doctor → profession association stored

3. Output Gate

The output gate decides what part of the updated long-term memory to expose as the hidden state — the working memory used for this time step's prediction.

Current task: predict next word after "Maria is a..."

Output gate:
→ Expose "Maria" context → yes, relevant
→ Expose "doctor" context → yes, it is being described
→ Expose old "chef" context → no, not relevant now

h_t = filtered version of C_t focused on current task

The Full LSTM Cell

                 ┌───────────────────────────────┐
Previous h_{t-1} │                               │
Previous C_{t-1} │   ┌──────────┐                │
     ↓           │   │ Forget   │                │
     ├──────────→│   │  Gate    │──→ C_t──────→  │→ C_t (next step)
     │           │   └──────────┘        ↓       │
     │           │   ┌──────────┐   ┌──────────┐ │
     │           │   │  Input   │   │  Output  │ │
     └──────────→│   │  Gate    │   │  Gate    │ │→ h_t (output)
Current x_t      │   └──────────┘   └──────────┘ │
                 └───────────────────────────────┘

LSTM vs Standard RNN

Feature	Standard RNN	LSTM
Memory channels	1 (hidden state only)	2 (cell state + hidden state)
Long-term memory	Weak — fades quickly	Strong — preserved by cell state
Vanishing gradients	Severe problem	Largely solved by gating
Complexity	Simple	More parameters to train
Training speed	Faster	Slower

GRU: A Simpler Alternative

The Gated Recurrent Unit (GRU) is a streamlined version of the LSTM. It merges the forget and input gates into a single update gate and removes the separate cell state. GRUs train faster than LSTMs with similar performance on many tasks.

LSTM: Forget Gate + Input Gate + Output Gate + Cell State + Hidden State
GRU:  Update Gate + Reset Gate + Hidden State only

GRU is faster to train.
LSTM tends to perform better on very long sequences.

Real-World LSTM Applications

Machine Translation — Google Translate's original neural engine used LSTM encoders and decoders
Speech-to-Text — audio frames feed into LSTM networks that decode words
Text Generation — LSTMs generate coherent paragraphs by modeling long-range writing patterns
Financial Forecasting — LSTMs process multi-year stock history to identify long-range trends
Healthcare Monitoring — patient vital signs over time feed LSTM models that detect deterioration patterns

A Code Example

from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.LSTM(64, return_sequences=True, input_shape=(timesteps, features)),
    layers.LSTM(32),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

return_sequences=True passes the output at every time step to the next LSTM layer, not just the final step. Stack LSTM layers this way to build deeper sequence models.

Key Terms

LSTM — Long Short-Term Memory — an RNN architecture with gated memory control
Cell State — the long-term memory pathway through the LSTM
Forget Gate — decides what to erase from long-term memory
Input Gate — decides what new information to store
Output Gate — decides what to expose from memory at this step
GRU — Gated Recurrent Unit — a simpler, faster alternative to LSTM

Previous lesson

Back to course

Next lesson