DS Machine Learning Fundamentals

Machine Learning (ML) is the part of artificial intelligence where computers learn from data without being explicitly programmed for every scenario. Instead of writing rules manually, a machine learning algorithm finds patterns in historical data and uses them to make predictions on new, unseen data.

What Machine Learning Is and Is Not

Traditional programming tells the computer exactly what to do: "If temperature > 35°C, send a heat alert." Machine learning lets the computer figure out those rules by looking at thousands of past examples.

Diagram – Traditional Programming vs Machine Learning

Traditional Programming:
┌──────────┐   ┌─────────┐   ┌────────┐
│  Rules   │ + │  Data   │ = │ Output │
│(Written  │   │         │   │        │
│manually) │   │         │   │        │
└──────────┘   └─────────┘   └────────┘

Machine Learning:
┌──────────┐   ┌─────────┐   ┌────────┐
│  Data    │ + │ Output  │ = │  Rules │
│(Examples)│   │(Labels) │   │(Model) │
└──────────┘   └─────────┘   └────────┘

ML figures out the rules automatically from data + answers.

Types of Machine Learning

1. Supervised Learning

The model trains on labelled data — data where the correct answer (label) is already known. It learns to map input features to the output label.

Labelled Data Examples:
┌───────────────────────────────────────────────────┐
│ Features (Input)         │  Label (Answer)        │
│──────────────────────────┼────────────────────────│
│ Email text               │ Spam / Not Spam        │
│ House size, location     │ House price            │
│ Patient symptoms         │ Disease / No Disease   │
│ Image pixels             │ Cat / Dog              │
└──────────────────────────┴────────────────────────┘

2. Unsupervised Learning

The model trains on unlabelled data. No correct answers exist. The algorithm discovers hidden structures, groups, or patterns on its own.

Unlabelled Data Examples:
┌───────────────────────────────────────────────────┐
│ Data                     │  Goal                  │
│──────────────────────────┼────────────────────────│
│ Customer purchase history│ Group similar customers│
│ News articles            │ Discover topics        │
│ Gene expression data     │ Find gene clusters     │
└───────────────────────────────────────────────────┘

3. Reinforcement Learning

An agent learns by taking actions in an environment and receiving rewards or penalties. Over time, it learns which actions maximise rewards.

Reinforcement Learning Flow:
Agent → Takes Action → Environment → Receives Reward → Agent learns
         (Move left)    (hits wall)   (penalty: -10)

Used in: Chess engines, robotics, self-driving cars, game playing AI

Key Machine Learning Terminology

TermDefinitionReal Example
Feature (X)Input variable used to make a predictionHouse size, number of rooms, location
Label (y)Output variable the model predictsHouse price
Training DataData used to teach the model80% of the dataset
Test DataUnseen data used to evaluate the model20% of the dataset
ModelThe mathematical function learned from training dataLinear regression, decision tree
PredictionThe output the model generates for new inputPredicted price: ₹45 lakhs
AlgorithmThe method used to train the modelRandom Forest, KNN, SVM
ParameterInternal values the model learns during trainingWeights in a linear regression
HyperparameterSettings configured before training beginsNumber of trees in a random forest

Features and Labels in Practice

import pandas as pd
import numpy as np

# Dataset: Predicting house price
houses = pd.DataFrame({
    "Size_sqft":   [800, 1200, 1500, 900, 2000, 1100, 1800, 2500],
    "Bedrooms":    [2,   3,    3,    2,   4,    3,    4,    5   ],
    "Age_years":   [10,  5,    8,    15,  3,    12,   6,    2   ],
    "Price_lakhs": [35,  55,   65,   40,  85,   48,   75,   105 ]  # ← Label
})

# Separate features (X) and label (y)
X = houses[["Size_sqft", "Bedrooms", "Age_years"]]   # Input
y = houses["Price_lakhs"]                             # Output to predict

print("Features (X):\n", X.head())
print("\nLabel (y):\n", y.head())

Train-Test Split – The Golden Rule

A machine learning model must always be evaluated on data it has never seen during training. Mixing training and test data produces falsely optimistic results — the model appears accurate but fails on real-world data.

Diagram – Train-Test Split

Full Dataset (1000 rows):
┌─────────────────────────────────────────────────┐
│  Training Set (800 rows = 80%)                  │
│  ← Model learns from this                       │
├─────────────────────────────────────────────────┤
│  Test Set (200 rows = 20%)                      │
│  ← Model evaluated on this (never seen before) │
└─────────────────────────────────────────────────┘
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% goes to test set
    random_state=42     # Ensures same split every run
)

print("Training rows:", len(X_train))
print("Testing rows :", len(X_test))

The Machine Learning Pipeline

Raw Data
    │
    ▼
Clean & Preprocess
(handle nulls, encode categories, scale features)
    │
    ▼
Split into Train / Test
    │
    ▼
Choose an Algorithm
(Linear Regression, Decision Tree, Random Forest, etc.)
    │
    ▼
Train the Model
model.fit(X_train, y_train)
    │
    ▼
Evaluate on Test Set
model.score(X_test, y_test)
    │
    ▼
Tune Hyperparameters
(GridSearchCV, cross-validation)
    │
    ▼
Deploy / Use Model
model.predict(new_data)

Overfitting and Underfitting

These two problems represent the two main failure modes of machine learning models.

Underfitting

A model that is too simple fails to capture the patterns in the training data. It performs poorly on both training data and new data.

Overfitting

A model that is too complex memorises the training data — including its noise and random fluctuations. It performs perfectly on training data but fails on new data because it learned specific details instead of general patterns.

Diagram – Fitting Comparison

Data Points: ● ●  ●    ●  ●   ●

Underfitting:              Good Fit:              Overfitting:
───────────────           ╭──────╮               ╭ ╮  ╭ ╮
(flat straight line        │      │              ╯   ╰╯   ╰╯
 misses the curve)         │       ╰──            passes through
                                                  every single point
                                                  (too specific)

Training Accuracy:  Low         High              Very High
Test Accuracy:      Low         High              Low (fails on new data)
# Signs of overfitting:
# Training accuracy: 99%
# Test accuracy:     72%
# Gap is too large → model memorised training data

# Signs of underfitting:
# Training accuracy: 65%
# Test accuracy:     63%
# Both are low → model too simple

Solutions to Overfitting and Underfitting

ProblemCauseSolutions
UnderfittingModel too simpleUse more complex algorithm, add features, train longer
OverfittingModel too complexMore data, regularisation, cross-validation, pruning, dropout

Introduction to Scikit-learn

Scikit-learn is the standard machine learning library in Python. It provides a consistent API for every algorithm — meaning the steps to train and predict are the same regardless of the algorithm chosen.

# Universal Scikit-learn Pattern:
# 1. Import the algorithm
from sklearn.linear_model import LinearRegression

# 2. Create the model object
model = LinearRegression()

# 3. Train the model
model.fit(X_train, y_train)

# 4. Make predictions
predictions = model.predict(X_test)

# 5. Evaluate accuracy
score = model.score(X_test, y_test)
print(f"Model accuracy: {score:.2f}")

First Complete Machine Learning Example

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

# Data
np.random.seed(42)
n = 100
X = pd.DataFrame({
    "StudyHours": np.random.uniform(1, 10, n)
})
y = X["StudyHours"] * 8 + np.random.normal(0, 5, n) + 35

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
mae = mean_absolute_error(y_test, y_pred)
r2  = r2_score(y_test, y_pred)

print(f"Mean Absolute Error : {mae:.2f} marks")
print(f"R² Score            : {r2:.4f}")

# Predict for a new student
new_student = pd.DataFrame({"StudyHours": [6.5]})
predicted_score = model.predict(new_student)[0]
print(f"Predicted score for 6.5 hours study: {predicted_score:.1f}")

Output:

Mean Absolute Error : 4.83 marks
R² Score            : 0.8721
Predicted score for 6.5 hours study: 85.6

Choosing the Right Algorithm

Problem TypeOutputCommon Algorithms
RegressionContinuous number (price, temperature)Linear Regression, Ridge, Random Forest
ClassificationCategory (spam/not spam, disease/healthy)Logistic Regression, Decision Tree, SVM, KNN
ClusteringGroups (no label given)K-Means, DBSCAN, Hierarchical
Dimensionality ReductionFewer featuresPCA, t-SNE, UMAP
RecommendationSuggested itemsCollaborative Filtering, Matrix Factorisation

Summary

  • Machine learning finds patterns in historical data and uses them to predict outcomes on new data
  • Supervised learning uses labelled data; unsupervised learning discovers patterns in unlabelled data
  • Features (X) are inputs; labels (y) are the target values the model learns to predict
  • Always split data into training and test sets before building any model
  • Underfitting means the model is too simple; overfitting means it memorised training data
  • Scikit-learn provides a consistent fit/predict API across all machine learning algorithms
  • Choosing the right algorithm depends on the problem type — regression, classification, or clustering

Leave a Comment