Deep Learning Data Preparation
A Deep Learning model is only as good as the data it trains on. Think of data as food for the model. Bad food produces a weak, sickly model. Clean, well-prepared data produces a strong, accurate one. This topic covers exactly how to prepare data before any training begins.
Why Data Preparation Matters
Raw data collected from the real world is messy. It contains missing values, typos, inconsistent formats, and irrelevant information. Feeding messy data to a model teaches it the wrong things.
The Data Pipeline
Raw Data | v [Collect] → [Clean] → [Transform] → [Split] → Ready for Training
Step 1: Collect Your Data
You need a lot of data. Deep Learning models learn by example — and more examples mean better learning.
- Structured data — spreadsheets, databases (example: house prices with rooms, size, location)
- Unstructured data — images, audio, text (example: thousands of cat photos)
- Public datasets — free collections available on Kaggle, UCI Machine Learning Repository, and Hugging Face
Step 2: Clean the Data
Cleaning removes problems that would confuse the model during training.
Common Problems and Fixes
| Problem | Example | Fix |
|---|---|---|
| Missing values | Age column has blanks | Fill with average age, or remove the row |
| Duplicate rows | Same customer entered twice | Keep one, delete the other |
| Wrong data type | "twenty" instead of 20 | Convert text to number |
| Outliers | Age listed as 900 | Investigate and remove if it's an error |
| Inconsistent labels | "cat", "Cat", "CAT" as labels | Standardize to lowercase "cat" |
Step 3: Transform the Data
Models work only with numbers. Everything — images, text, and audio — must be converted into numbers before training.
Normalization
Different columns may have very different ranges. Age might go from 1 to 100, while salary might go from 10,000 to 500,000. A large number does not mean it is more important — it just confuses the model. Normalization shrinks all values to a common range, usually 0 to 1.
Original salary column: 10000, 50000, 200000, 500000 After normalization: 0.00, 0.08, 0.38, 1.00
Encoding Categories
Words like "cat", "dog", and "bird" are categories. The model cannot process words — only numbers. One common method is one-hot encoding.
Label cat dog bird "cat" → 1 0 0 "dog" → 0 1 0 "bird" → 0 0 1
Image Data
Every image is already a grid of numbers. A pixel's color is stored as three numbers (Red, Green, Blue), each between 0 and 255. A 100×100 image becomes 100 × 100 × 3 = 30,000 numbers.
Pixel (255, 128, 0) = bright orange Pixel (0, 0, 255) = pure blue Pixel (0, 0, 0) = black Pixel (255,255,255) = white
Step 4: Split the Data
You split data into three separate groups — each with a different job.
The Three-Way Split
All Data (100%)
|
├──── Training Set (70%) ──── Model learns from this
├──── Validation Set (15%) ── Used during training to tune the model
└──── Test Set (15%) ──────── Final check — model never sees this during training
- Training Set — the textbook the model studies
- Validation Set — the practice exam taken during study
- Test Set — the real exam taken at the end
The test set must never be used during training. Using it early is like giving students the exam answers before the test — the scores become meaningless.
Data Augmentation
When you do not have enough training data, you can create new examples from existing ones. This technique is called data augmentation.
Example: Augmenting a Dog Photo
Original Photo
|
├──── Flip horizontally → New example
├──── Rotate 15° → New example
├──── Zoom in slightly → New example
└──── Adjust brightness → New example
All four new images still show a dog — the model learns from more examples without collecting any new photos.
A Quick Data Preparation Checklist
- Remove or fill missing values
- Fix inconsistent labels and formats
- Convert all data to numbers
- Normalize or scale the values
- Split into training, validation, and test sets
- Apply augmentation if the dataset is small
Key Terms
- Normalization — rescaling values to a fixed range
- One-Hot Encoding — converting categories into binary columns
- Augmentation — creating new training examples from existing data
- Training Set — data used to teach the model
- Validation Set — data used to tune the model during training
- Test Set — data used to measure final model performance
