Deep Learning Data Preparation

A Deep Learning model is only as good as the data it trains on. Think of data as food for the model. Bad food produces a weak, sickly model. Clean, well-prepared data produces a strong, accurate one. This topic covers exactly how to prepare data before any training begins.

Why Data Preparation Matters

Raw data collected from the real world is messy. It contains missing values, typos, inconsistent formats, and irrelevant information. Feeding messy data to a model teaches it the wrong things.

The Data Pipeline

Raw Data
   |
   v
[Collect] → [Clean] → [Transform] → [Split] → Ready for Training

Step 1: Collect Your Data

You need a lot of data. Deep Learning models learn by example — and more examples mean better learning.

Structured data — spreadsheets, databases (example: house prices with rooms, size, location)
Unstructured data — images, audio, text (example: thousands of cat photos)
Public datasets — free collections available on Kaggle, UCI Machine Learning Repository, and Hugging Face

Step 2: Clean the Data

Cleaning removes problems that would confuse the model during training.

Common Problems and Fixes

Problem	Example	Fix
Missing values	Age column has blanks	Fill with average age, or remove the row
Duplicate rows	Same customer entered twice	Keep one, delete the other
Wrong data type	"twenty" instead of 20	Convert text to number
Outliers	Age listed as 900	Investigate and remove if it's an error
Inconsistent labels	"cat", "Cat", "CAT" as labels	Standardize to lowercase "cat"

Step 3: Transform the Data

Models work only with numbers. Everything — images, text, and audio — must be converted into numbers before training.

Normalization

Different columns may have very different ranges. Age might go from 1 to 100, while salary might go from 10,000 to 500,000. A large number does not mean it is more important — it just confuses the model. Normalization shrinks all values to a common range, usually 0 to 1.

Original salary column:  10000, 50000, 200000, 500000
After normalization:      0.00,  0.08,   0.38,   1.00

Encoding Categories

Words like "cat", "dog", and "bird" are categories. The model cannot process words — only numbers. One common method is one-hot encoding.

Label     cat    dog    bird
"cat"  →   1      0      0
"dog"  →   0      1      0
"bird" →   0      0      1

Image Data

Every image is already a grid of numbers. A pixel's color is stored as three numbers (Red, Green, Blue), each between 0 and 255. A 100×100 image becomes 100 × 100 × 3 = 30,000 numbers.

Pixel (255, 128, 0) = bright orange
Pixel (0, 0, 255)   = pure blue
Pixel (0, 0, 0)     = black
Pixel (255,255,255) = white

Step 4: Split the Data

You split data into three separate groups — each with a different job.

The Three-Way Split

All Data (100%)
      |
      ├──── Training Set (70%) ──── Model learns from this
      ├──── Validation Set (15%) ── Used during training to tune the model
      └──── Test Set (15%) ──────── Final check — model never sees this during training

Training Set — the textbook the model studies
Validation Set — the practice exam taken during study
Test Set — the real exam taken at the end

The test set must never be used during training. Using it early is like giving students the exam answers before the test — the scores become meaningless.

Data Augmentation

When you do not have enough training data, you can create new examples from existing ones. This technique is called data augmentation.

Example: Augmenting a Dog Photo

Original Photo
      |
      ├──── Flip horizontally → New example
      ├──── Rotate 15°       → New example
      ├──── Zoom in slightly  → New example
      └──── Adjust brightness → New example

All four new images still show a dog — the model learns from more examples without collecting any new photos.

A Quick Data Preparation Checklist

Remove or fill missing values
Fix inconsistent labels and formats
Convert all data to numbers
Normalize or scale the values
Split into training, validation, and test sets
Apply augmentation if the dataset is small

Key Terms

Normalization — rescaling values to a fixed range
One-Hot Encoding — converting categories into binary columns
Augmentation — creating new training examples from existing data
Training Set — data used to teach the model
Validation Set — data used to tune the model during training
Test Set — data used to measure final model performance

Previous lesson

Back to course

Next lesson