Deep Learning Data Preparation

A Deep Learning model is only as good as the data it trains on. Think of data as food for the model. Bad food produces a weak, sickly model. Clean, well-prepared data produces a strong, accurate one. This topic covers exactly how to prepare data before any training begins.

Why Data Preparation Matters

Raw data collected from the real world is messy. It contains missing values, typos, inconsistent formats, and irrelevant information. Feeding messy data to a model teaches it the wrong things.

The Data Pipeline

Raw Data
   |
   v
[Collect] → [Clean] → [Transform] → [Split] → Ready for Training

Step 1: Collect Your Data

You need a lot of data. Deep Learning models learn by example — and more examples mean better learning.

  • Structured data — spreadsheets, databases (example: house prices with rooms, size, location)
  • Unstructured data — images, audio, text (example: thousands of cat photos)
  • Public datasets — free collections available on Kaggle, UCI Machine Learning Repository, and Hugging Face

Step 2: Clean the Data

Cleaning removes problems that would confuse the model during training.

Common Problems and Fixes

ProblemExampleFix
Missing valuesAge column has blanksFill with average age, or remove the row
Duplicate rowsSame customer entered twiceKeep one, delete the other
Wrong data type"twenty" instead of 20Convert text to number
OutliersAge listed as 900Investigate and remove if it's an error
Inconsistent labels"cat", "Cat", "CAT" as labelsStandardize to lowercase "cat"

Step 3: Transform the Data

Models work only with numbers. Everything — images, text, and audio — must be converted into numbers before training.

Normalization

Different columns may have very different ranges. Age might go from 1 to 100, while salary might go from 10,000 to 500,000. A large number does not mean it is more important — it just confuses the model. Normalization shrinks all values to a common range, usually 0 to 1.

Original salary column:  10000, 50000, 200000, 500000
After normalization:      0.00,  0.08,   0.38,   1.00

Encoding Categories

Words like "cat", "dog", and "bird" are categories. The model cannot process words — only numbers. One common method is one-hot encoding.

Label     cat    dog    bird
"cat"  →   1      0      0
"dog"  →   0      1      0
"bird" →   0      0      1

Image Data

Every image is already a grid of numbers. A pixel's color is stored as three numbers (Red, Green, Blue), each between 0 and 255. A 100×100 image becomes 100 × 100 × 3 = 30,000 numbers.

Pixel (255, 128, 0) = bright orange
Pixel (0, 0, 255)   = pure blue
Pixel (0, 0, 0)     = black
Pixel (255,255,255) = white

Step 4: Split the Data

You split data into three separate groups — each with a different job.

The Three-Way Split

All Data (100%)
      |
      ├──── Training Set (70%) ──── Model learns from this
      ├──── Validation Set (15%) ── Used during training to tune the model
      └──── Test Set (15%) ──────── Final check — model never sees this during training
  • Training Set — the textbook the model studies
  • Validation Set — the practice exam taken during study
  • Test Set — the real exam taken at the end

The test set must never be used during training. Using it early is like giving students the exam answers before the test — the scores become meaningless.

Data Augmentation

When you do not have enough training data, you can create new examples from existing ones. This technique is called data augmentation.

Example: Augmenting a Dog Photo

Original Photo
      |
      ├──── Flip horizontally → New example
      ├──── Rotate 15°       → New example
      ├──── Zoom in slightly  → New example
      └──── Adjust brightness → New example

All four new images still show a dog — the model learns from more examples without collecting any new photos.

A Quick Data Preparation Checklist

  • Remove or fill missing values
  • Fix inconsistent labels and formats
  • Convert all data to numbers
  • Normalize or scale the values
  • Split into training, validation, and test sets
  • Apply augmentation if the dataset is small

Key Terms

  • Normalization — rescaling values to a fixed range
  • One-Hot Encoding — converting categories into binary columns
  • Augmentation — creating new training examples from existing data
  • Training Set — data used to teach the model
  • Validation Set — data used to tune the model during training
  • Test Set — data used to measure final model performance

Leave a Comment

Your email address will not be published. Required fields are marked *