Deep Learning Transfer Learning

Transfer learning lets you take a model that was trained on one large task and reuse it for a different, related task. Instead of training from scratch — which requires massive datasets and expensive compute — you start with a model that already understands language, images, or audio, and then specialize it for your specific problem with far less data and time.

The Core Idea

When a model trains on millions of images, it develops the ability to detect edges, textures, shapes, and complex visual patterns. These abilities are general — they are useful for recognizing cats, diagnosing tumors, and identifying car defects, not just the original task. Transfer learning reuses these general abilities.

Human Analogy

A surgeon learns anatomy, surgical technique, and hand precision over years.

Transfer to a new specialty:
  Surgeon trained in general surgery
  → Transfers skills to cardiac surgery
  → Does NOT relearn how to hold a scalpel

Deep Learning equivalent:
  Model trained on ImageNet (1.2M images, 1000 categories)
  → Transfers visual understanding to chest X-ray diagnosis
  → Does NOT relearn edge detection or shape recognition

How Transfer Learning Works

The Two-Stage Process

STAGE 1: Pre-training
  Large model → Large dataset → General skills learned
  Example: ResNet trained on 1.2 million ImageNet photos

STAGE 2: Fine-tuning
  Same model → Your small dataset → Task-specific knowledge added
  Example: Same ResNet fine-tuned on 500 chest X-rays to detect pneumonia

Architecture During Fine-Tuning

PRE-TRAINED MODEL:
  [Conv Layer 1] → [Conv Layer 2] → [Conv Layer 3] → [Dense] → [1000 classes]
        ↑               ↑               ↑               ↑
     FROZEN            FROZEN          FROZEN          FROZEN
   (keeps learned features from ImageNet)

AFTER SWAPPING HEAD:
  [Conv Layer 1] → [Conv Layer 2] → [Conv Layer 3] → [New Dense] → [2 classes]
        ↑               ↑               ↑                   ↑
     FROZEN            FROZEN          FROZEN            TRAINABLE
                                                    (learns your task)

The early layers detect general features (edges, textures). You freeze them and retrain only the final classification layers using your own data. This is far more efficient than training 50 layers from scratch.

Frozen vs Fine-Tuned Layers

Strategy	Layers Updated	When to Use
Feature Extraction	Only the new head	Very small dataset, task similar to pre-training
Partial Fine-Tuning	Head + last few layers	Medium dataset, moderate task difference
Full Fine-Tuning	Entire network	Large dataset, task very different from pre-training

Popular Pre-Trained Models

For Images (Computer Vision)

Model	Trained On	Common Use
ResNet-50	ImageNet (1.2M images)	Image classification, feature extraction
VGG16	ImageNet	Image classification, style transfer
EfficientNet	ImageNet	High accuracy with small compute budget
CLIP	Image-text pairs	Image search, zero-shot classification

For Text (Natural Language Processing)

Model	Trained On	Common Use
BERT	Books + Wikipedia	Sentiment, classification, Q&A
GPT-2 / GPT-3	Internet text	Text generation, completion
RoBERTa	Large text corpus	Robust sentence understanding
T5	Large text corpus	Translation, summarization, classification

A Practical Transfer Learning Example

Task: Classify Flower Species (Only 500 Photos)

Without Transfer Learning:
  500 photos → Train ResNet-50 from scratch
  → Model has no prior knowledge
  → 500 photos is far too few → 45% accuracy

With Transfer Learning:
  Load ResNet-50 pre-trained on ImageNet
  → Freeze all layers except the final classifier
  → Replace: [1000-class head] with [5-class flower head]
  → Train only the new head on 500 photos
  → Pre-trained visual features are already expert at shapes, textures
  → 91% accuracy on the same 500 photos

Domain Adaptation

Sometimes the pre-training domain and your target domain are different. A model trained on natural photographs of animals may not transfer perfectly to medical scans. In such cases, you fine-tune more layers with a low learning rate to gently shift the model's representations toward your domain.

Source domain: Natural photos (animals, objects, scenes)
Target domain: Chest X-rays (grayscale, medical, specialized)

Adaptation strategy:
  1. Load ImageNet-pre-trained model
  2. Fine-tune ALL layers with a very small learning rate (1e-5)
  3. Use your labeled X-ray dataset
  4. Model gradually shifts its visual vocabulary toward medical features

Result: Better than training from scratch, even across very different domains

Zero-Shot and Few-Shot Learning

The most powerful pre-trained models can handle tasks they were never explicitly trained on.

Zero-shot: No task-specific training examples needed
  Example: CLIP classifies "a photo of a mango" without ever training on mangoes
  → It uses its general understanding of language + images

Few-shot: Only a handful of examples needed
  Example: GPT-3 translates text into a new language after seeing just 3 examples
  → General language understanding does the heavy lifting

Transfer Learning Benefits Summary

Less Data Needed — get strong results with hundreds instead of millions of examples
Faster Training — fine-tuning takes hours instead of weeks
Lower Cost — no need for high-end GPU clusters for initial training
Better Performance — pre-trained features often outperform randomly initialized weights, even with fine-tuning

Key Terms

Transfer Learning — reusing a pre-trained model's knowledge for a new task
Pre-trained Model — a model already trained on a large dataset
Fine-Tuning — continuing training on a new, smaller dataset
Frozen Layers — layers whose weights are not updated during fine-tuning
Feature Extraction — using a pre-trained model as a fixed feature generator
Zero-Shot — performing a task without any task-specific training examples
Domain Adaptation — adjusting a model to work well in a different data distribution

Previous lesson

Back to course

Next lesson