Deep Learning Long Short-Term Memory Networks
Long Short-Term Memory (LSTM) networks solve the biggest problem with standard RNNs: forgetting. Standard RNNs lose track of context from many steps ago. LSTMs carry two separate memory channels — one for short-term details and one for long-term patterns — allowing them to remember relevant information across very long sequences.
The Core Problem: Why RNNs Forget
Read this sentence: "Maria, who moved to Spain at age 9 after her family relocated from Brazil, later studied medicine and became a surgeon. She now works in Madrid."
By the time a standard RNN reaches "She," it has already forgotten that "She" refers to Maria. Long sequences overwrite old information with new information in the hidden state.
LSTMs fix this by maintaining two parallel tracks of information:
- Cell State (C_t) — long-term memory (like a notepad you carry around)
- Hidden State (h_t) — short-term working memory (what you are focused on right now)
The Three Gates
LSTMs control information flow using three gates. Each gate is a neural network layer that outputs values between 0 and 1, acting like a valve: 0 = block everything, 1 = let everything through.
Gate Overview Diagram
Previous cell state (C_{t-1})
│
├──→ [Forget Gate] ──→ What to erase from memory?
│
├──→ [Input Gate] ──→ What new information to add?
│
↓
Updated cell state (C_t)
│
└──→ [Output Gate] ──→ What to pass as the hidden state?
│
h_t (output)
1. Forget Gate
The forget gate decides what portion of the long-term memory to erase.
Sentence: "John is a good chef. Maria is a doctor."
After processing "Maria," the forget gate signals:
→ Forget "John" context (it is no longer relevant)
→ Keep general sentence structure (still useful)
Forget gate output = 0.1 for "John features" → nearly erased
= 0.9 for "sentence structure" → mostly kept
2. Input Gate
The input gate decides what new information to write into long-term memory.
New word: "doctor" Input gate: → "Maria" is the subject → write this association → "doctor" is her role → write this into memory → Earlier sentence structure → do not rewrite (already stored) New information added to cell state: Maria + doctor → profession association stored
3. Output Gate
The output gate decides what part of the updated long-term memory to expose as the hidden state — the working memory used for this time step's prediction.
Current task: predict next word after "Maria is a..." Output gate: → Expose "Maria" context → yes, relevant → Expose "doctor" context → yes, it is being described → Expose old "chef" context → no, not relevant now h_t = filtered version of C_t focused on current task
The Full LSTM Cell
┌───────────────────────────────┐
Previous h_{t-1} │ │
Previous C_{t-1} │ ┌──────────┐ │
↓ │ │ Forget │ │
├──────────→│ │ Gate │──→ C_t──────→ │→ C_t (next step)
│ │ └──────────┘ ↓ │
│ │ ┌──────────┐ ┌──────────┐ │
│ │ │ Input │ │ Output │ │
└──────────→│ │ Gate │ │ Gate │ │→ h_t (output)
Current x_t │ └──────────┘ └──────────┘ │
└───────────────────────────────┘
LSTM vs Standard RNN
| Feature | Standard RNN | LSTM |
|---|---|---|
| Memory channels | 1 (hidden state only) | 2 (cell state + hidden state) |
| Long-term memory | Weak — fades quickly | Strong — preserved by cell state |
| Vanishing gradients | Severe problem | Largely solved by gating |
| Complexity | Simple | More parameters to train |
| Training speed | Faster | Slower |
GRU: A Simpler Alternative
The Gated Recurrent Unit (GRU) is a streamlined version of the LSTM. It merges the forget and input gates into a single update gate and removes the separate cell state. GRUs train faster than LSTMs with similar performance on many tasks.
LSTM: Forget Gate + Input Gate + Output Gate + Cell State + Hidden State GRU: Update Gate + Reset Gate + Hidden State only GRU is faster to train. LSTM tends to perform better on very long sequences.
Real-World LSTM Applications
- Machine Translation — Google Translate's original neural engine used LSTM encoders and decoders
- Speech-to-Text — audio frames feed into LSTM networks that decode words
- Text Generation — LSTMs generate coherent paragraphs by modeling long-range writing patterns
- Financial Forecasting — LSTMs process multi-year stock history to identify long-range trends
- Healthcare Monitoring — patient vital signs over time feed LSTM models that detect deterioration patterns
A Code Example
from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential([
layers.LSTM(64, return_sequences=True, input_shape=(timesteps, features)),
layers.LSTM(32),
layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
return_sequences=True passes the output at every time step to the next LSTM layer, not just the final step. Stack LSTM layers this way to build deeper sequence models.
Key Terms
- LSTM — Long Short-Term Memory — an RNN architecture with gated memory control
- Cell State — the long-term memory pathway through the LSTM
- Forget Gate — decides what to erase from long-term memory
- Input Gate — decides what new information to store
- Output Gate — decides what to expose from memory at this step
- GRU — Gated Recurrent Unit — a simpler, faster alternative to LSTM
