ML Train Test Split and Cross Validation

A Machine Learning model must prove it works on data it has never seen before. Training and testing on the same data gives a false sense of accuracy — the model memorizes answers rather than learning patterns. Train-test split and cross validation are two methods that create a fair evaluation environment.

The Core Problem: Why Split Data?

Imagine a student studying for an exam:

Scenario A — No Split (Wrong):
  Student memorizes the exact exam questions during study.
  Gets 100% on the exam.
  Tests nothing — the student just memorized, not learned.

Scenario B — With Split (Correct):
  Student studies from a textbook (training data).
  Exam uses new questions not seen during study (test data).
  Score on the exam = true measure of understanding.

Machine Learning works the same way.

Train-Test Split

Train-test split divides the dataset into two separate parts before any model training begins. The training portion teaches the model. The test portion evaluates the model on data it has never encountered.

Full Dataset: 1000 records
       │
       ├──► Training Set: 800 records (80%)  → Model trains here
       │
       └──► Test Set:     200 records (20%)  → Model evaluated here

Standard Split Ratios:
  70% Train / 30% Test  → Smaller datasets
  80% Train / 20% Test  → Most common default
  90% Train / 10% Test  → Large datasets with many records

Rules for Train-Test Split

Split must happen before any preprocessing steps that use the whole dataset (like filling missing values with the mean)
The test set must never influence model training or parameter tuning
The split should be random so both sets represent the full data fairly

Correct Order:
  1. Split data into Train and Test
  2. Fit preprocessing (find mean, scaling range) on Train set ONLY
  3. Apply same preprocessing to Test set
  4. Train model on Train set
  5. Evaluate on Test set

Wrong Order:
  1. Fit preprocessing on FULL dataset (leaks test info into training)
  2. Split into Train and Test
  3. Train → Evaluate → Misleading results

The Validation Set

When tuning model settings (called hyperparameters), using the test set for guidance corrupts the final evaluation. A third split — the validation set — solves this problem.

Full Dataset: 1000 records
       │
       ├──► Training Set:   700 records (70%)  → Model trains here
       │
       ├──► Validation Set: 150 records (15%)  → Tune settings here
       │
       └──► Test Set:       150 records (15%)  → Final evaluation only

Workflow:
  Step 1: Train model on Training Set
  Step 2: Evaluate on Validation Set → adjust settings
  Step 3: Repeat Steps 1–2 until settings are good
  Step 4: Run ONCE on Test Set → report final performance

The Problem with a Single Split

One split depends on which records happen to land in each set. A lucky or unlucky split can make the model appear better or worse than it really is. This is especially risky with small datasets.

Example of Lucky Split Problem:

Dataset: 100 records about house prices
Random Split A → Test accuracy = 88%  (easy test records happened)
Random Split B → Test accuracy = 74%  (harder test records happened)

Which number is the true performance? Neither alone is reliable.
Cross validation solves this by averaging across multiple splits.

Cross Validation

Cross validation splits the dataset into multiple folds and runs the training and evaluation cycle multiple times. Each fold takes a turn being the test set. The final performance is the average of all runs, which gives a much more reliable estimate.

K-Fold Cross Validation

K-Fold is the most common cross validation method. The dataset divides into K equal parts (folds). The model trains K times — each time using a different fold as the test set and the remaining K-1 folds as the training set.

5-Fold Cross Validation on 500 records:

Fold 1:  [TEST ] [TRAIN] [TRAIN] [TRAIN] [TRAIN]  → Accuracy: 84%
Fold 2:  [TRAIN] [TEST ] [TRAIN] [TRAIN] [TRAIN]  → Accuracy: 81%
Fold 3:  [TRAIN] [TRAIN] [TEST ] [TRAIN] [TRAIN]  → Accuracy: 86%
Fold 4:  [TRAIN] [TRAIN] [TRAIN] [TEST ] [TRAIN]  → Accuracy: 82%
Fold 5:  [TRAIN] [TRAIN] [TRAIN] [TRAIN] [TEST ]  → Accuracy: 85%

Final Performance = Average = (84+81+86+82+85) / 5 = 83.6%

Every record gets tested exactly once.
Result is much more reliable than a single split.

Choosing the Value of K

┌───────┬──────────────────────────────────────────────────────┐
│ K     │ Characteristics                                      │
├───────┼──────────────────────────────────────────────────────┤
│ 5     │ Fast, good default for most problems                 │
│ 10    │ More reliable estimate, slightly slower              │
│ N     │ Leave-One-Out (LOO) — each record is its own        │
│       │ test fold — very slow but useful for tiny datasets  │
└───────┴──────────────────────────────────────────────────────┘

Stratified K-Fold Cross Validation

Standard K-Fold splits randomly, which can cause some folds to have very few examples of a minority class. Stratified K-Fold preserves the class proportion in every fold.

Dataset: 100 records → 70 Class A, 30 Class B (30% minority)

Standard K-Fold (5 folds of 20 records each):
  Fold 3 might get: 18 Class A, 2 Class B  (only 10% minority)
  → Unrepresentative test set

Stratified K-Fold (5 folds of 20 records each):
  Every fold: 14 Class A, 6 Class B  (always 30% minority)
  → Each fold reflects the true class distribution

Comparison: Train-Test Split vs Cross Validation

┌──────────────────────────┬─────────────────┬───────────────────┐
│ Feature                  │ Train-Test Split │ Cross Validation  │
├──────────────────────────┼─────────────────┼───────────────────┤
│ Speed                    │ Fast            │ K times slower    │
│ Result stability         │ Varies by split │ Stable (averaged) │
│ Best for large datasets  │ Yes             │ Works for any     │
│ Best for small datasets  │ Risky           │ Yes (recommended) │
│ Uses all data for test?  │ No (only 20–30%)│ Yes (every record)│
│ Common in production?    │ Yes             │ Yes (during tuning│
└──────────────────────────┴─────────────────┴───────────────────┘

Full Evaluation Workflow

Dataset
   │
   ▼
Set aside Test Set (never touch again until final step)
   │
   ▼
Apply K-Fold Cross Validation on remaining data
   │  ┌── Train on K-1 folds
   │  └── Validate on 1 fold
   │  Repeat K times → Average score
   │
   ▼
Tune model settings using cross validation scores
   │
   ▼
Select best settings
   │
   ▼
Retrain model on ALL non-test data with best settings
   │
   ▼
Evaluate ONCE on Test Set → Final reported performance

Following this workflow ensures the test set acts as a true judge — seen only once, at the very end, to give an honest and unbiased measure of how the model performs on completely new data.

Previous lesson

Back to course

Next lesson