DS Machine Learning Fundamentals

Machine Learning (ML) is the part of artificial intelligence where computers learn from data without being explicitly programmed for every scenario. Instead of writing rules manually, a machine learning algorithm finds patterns in historical data and uses them to make predictions on new, unseen data.

What Machine Learning Is and Is Not

Traditional programming tells the computer exactly what to do: "If temperature > 35°C, send a heat alert." Machine learning lets the computer figure out those rules by looking at thousands of past examples.

Diagram – Traditional Programming vs Machine Learning

Traditional Programming:
┌──────────┐   ┌─────────┐   ┌────────┐
│  Rules   │ + │  Data   │ = │ Output │
│(Written  │   │         │   │        │
│manually) │   │         │   │        │
└──────────┘   └─────────┘   └────────┘

Machine Learning:
┌──────────┐   ┌─────────┐   ┌────────┐
│  Data    │ + │ Output  │ = │  Rules │
│(Examples)│   │(Labels) │   │(Model) │
└──────────┘   └─────────┘   └────────┘

ML figures out the rules automatically from data + answers.

Types of Machine Learning

1. Supervised Learning

The model trains on labelled data — data where the correct answer (label) is already known. It learns to map input features to the output label.

Labelled Data Examples:
┌───────────────────────────────────────────────────┐
│ Features (Input)         │  Label (Answer)        │
│──────────────────────────┼────────────────────────│
│ Email text               │ Spam / Not Spam        │
│ House size, location     │ House price            │
│ Patient symptoms         │ Disease / No Disease   │
│ Image pixels             │ Cat / Dog              │
└──────────────────────────┴────────────────────────┘

2. Unsupervised Learning

The model trains on unlabelled data. No correct answers exist. The algorithm discovers hidden structures, groups, or patterns on its own.

Unlabelled Data Examples:
┌───────────────────────────────────────────────────┐
│ Data                     │  Goal                  │
│──────────────────────────┼────────────────────────│
│ Customer purchase history│ Group similar customers│
│ News articles            │ Discover topics        │
│ Gene expression data     │ Find gene clusters     │
└───────────────────────────────────────────────────┘

3. Reinforcement Learning

An agent learns by taking actions in an environment and receiving rewards or penalties. Over time, it learns which actions maximise rewards.

Reinforcement Learning Flow:
Agent → Takes Action → Environment → Receives Reward → Agent learns
         (Move left)    (hits wall)   (penalty: -10)

Used in: Chess engines, robotics, self-driving cars, game playing AI

Key Machine Learning Terminology

Term	Definition	Real Example
Feature (X)	Input variable used to make a prediction	House size, number of rooms, location
Label (y)	Output variable the model predicts	House price
Training Data	Data used to teach the model	80% of the dataset
Test Data	Unseen data used to evaluate the model	20% of the dataset
Model	The mathematical function learned from training data	Linear regression, decision tree
Prediction	The output the model generates for new input	Predicted price: ₹45 lakhs
Algorithm	The method used to train the model	Random Forest, KNN, SVM
Parameter	Internal values the model learns during training	Weights in a linear regression
Hyperparameter	Settings configured before training begins	Number of trees in a random forest

Features and Labels in Practice

import pandas as pd
import numpy as np

# Dataset: Predicting house price
houses = pd.DataFrame({
    "Size_sqft":   [800, 1200, 1500, 900, 2000, 1100, 1800, 2500],
    "Bedrooms":    [2,   3,    3,    2,   4,    3,    4,    5   ],
    "Age_years":   [10,  5,    8,    15,  3,    12,   6,    2   ],
    "Price_lakhs": [35,  55,   65,   40,  85,   48,   75,   105 ]  # ← Label
})

# Separate features (X) and label (y)
X = houses[["Size_sqft", "Bedrooms", "Age_years"]]   # Input
y = houses["Price_lakhs"]                             # Output to predict

print("Features (X):\n", X.head())
print("\nLabel (y):\n", y.head())

Train-Test Split – The Golden Rule

A machine learning model must always be evaluated on data it has never seen during training. Mixing training and test data produces falsely optimistic results — the model appears accurate but fails on real-world data.

Diagram – Train-Test Split

Full Dataset (1000 rows):
┌─────────────────────────────────────────────────┐
│  Training Set (800 rows = 80%)                  │
│  ← Model learns from this                       │
├─────────────────────────────────────────────────┤
│  Test Set (200 rows = 20%)                      │
│  ← Model evaluated on this (never seen before) │
└─────────────────────────────────────────────────┘

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% goes to test set
    random_state=42     # Ensures same split every run
)

print("Training rows:", len(X_train))
print("Testing rows :", len(X_test))

The Machine Learning Pipeline

Raw Data
    │
    ▼
Clean & Preprocess
(handle nulls, encode categories, scale features)
    │
    ▼
Split into Train / Test
    │
    ▼
Choose an Algorithm
(Linear Regression, Decision Tree, Random Forest, etc.)
    │
    ▼
Train the Model
model.fit(X_train, y_train)
    │
    ▼
Evaluate on Test Set
model.score(X_test, y_test)
    │
    ▼
Tune Hyperparameters
(GridSearchCV, cross-validation)
    │
    ▼
Deploy / Use Model
model.predict(new_data)

Overfitting and Underfitting

These two problems represent the two main failure modes of machine learning models.

Underfitting

A model that is too simple fails to capture the patterns in the training data. It performs poorly on both training data and new data.

Overfitting

A model that is too complex memorises the training data — including its noise and random fluctuations. It performs perfectly on training data but fails on new data because it learned specific details instead of general patterns.

Diagram – Fitting Comparison

Data Points: ● ●  ●    ●  ●   ●

Underfitting:              Good Fit:              Overfitting:
───────────────           ╭──────╮               ╭ ╮  ╭ ╮
(flat straight line        │      │              ╯   ╰╯   ╰╯
 misses the curve)         │       ╰──            passes through
                                                  every single point
                                                  (too specific)

Training Accuracy:  Low         High              Very High
Test Accuracy:      Low         High              Low (fails on new data)

# Signs of overfitting:
# Training accuracy: 99%
# Test accuracy:     72%
# Gap is too large → model memorised training data

# Signs of underfitting:
# Training accuracy: 65%
# Test accuracy:     63%
# Both are low → model too simple

Solutions to Overfitting and Underfitting

Problem	Cause	Solutions
Underfitting	Model too simple	Use more complex algorithm, add features, train longer
Overfitting	Model too complex	More data, regularisation, cross-validation, pruning, dropout

Introduction to Scikit-learn

Scikit-learn is the standard machine learning library in Python. It provides a consistent API for every algorithm — meaning the steps to train and predict are the same regardless of the algorithm chosen.

# Universal Scikit-learn Pattern:
# 1. Import the algorithm
from sklearn.linear_model import LinearRegression

# 2. Create the model object
model = LinearRegression()

# 3. Train the model
model.fit(X_train, y_train)

# 4. Make predictions
predictions = model.predict(X_test)

# 5. Evaluate accuracy
score = model.score(X_test, y_test)
print(f"Model accuracy: {score:.2f}")

First Complete Machine Learning Example

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

# Data
np.random.seed(42)
n = 100
X = pd.DataFrame({
    "StudyHours": np.random.uniform(1, 10, n)
})
y = X["StudyHours"] * 8 + np.random.normal(0, 5, n) + 35

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
mae = mean_absolute_error(y_test, y_pred)
r2  = r2_score(y_test, y_pred)

print(f"Mean Absolute Error : {mae:.2f} marks")
print(f"R² Score            : {r2:.4f}")

# Predict for a new student
new_student = pd.DataFrame({"StudyHours": [6.5]})
predicted_score = model.predict(new_student)[0]
print(f"Predicted score for 6.5 hours study: {predicted_score:.1f}")

Output:

Mean Absolute Error : 4.83 marks
R² Score            : 0.8721
Predicted score for 6.5 hours study: 85.6

Choosing the Right Algorithm

Problem Type	Output	Common Algorithms
Regression	Continuous number (price, temperature)	Linear Regression, Ridge, Random Forest
Classification	Category (spam/not spam, disease/healthy)	Logistic Regression, Decision Tree, SVM, KNN
Clustering	Groups (no label given)	K-Means, DBSCAN, Hierarchical
Dimensionality Reduction	Fewer features	PCA, t-SNE, UMAP
Recommendation	Suggested items	Collaborative Filtering, Matrix Factorisation

Summary

Machine learning finds patterns in historical data and uses them to predict outcomes on new data
Supervised learning uses labelled data; unsupervised learning discovers patterns in unlabelled data
Features (X) are inputs; labels (y) are the target values the model learns to predict
Always split data into training and test sets before building any model
Underfitting means the model is too simple; overfitting means it memorised training data
Scikit-learn provides a consistent fit/predict API across all machine learning algorithms
Choosing the right algorithm depends on the problem type — regression, classification, or clustering

Previous lessons

Back to courses

Next lessons