DS Supervised Learning with Regression

Regression is a type of supervised learning used when the target variable is a continuous number — such as predicting house prices, forecasting sales, or estimating a patient's blood pressure. This topic covers Linear Regression, Multiple Linear Regression, Polynomial Regression, and Regularised Regression (Ridge and Lasso) with complete Python implementations.

What Is Regression

Regression finds the mathematical relationship between input features and a continuous numeric output. The model learns this relationship from training data and applies it to predict values for new inputs.

Regression Problems:
┌────────────────────────────────────────────────┐
│ Input Features          │ Output (Predict)      │
│─────────────────────────┼───────────────────────│
│ House size, location    │ Selling price (₹)     │
│ Temperature, humidity   │ Ice cream sales       │
│ Hours studied           │ Exam score            │
│ Car engine size, age    │ Fuel efficiency (kmpl)│
│ Ad spend (₹)            │ Revenue generated     │
└────────────────────────────────────────────────┘

Simple Linear Regression

Simple Linear Regression models the relationship between one input feature (X) and one continuous output (y) using a straight line.

The Equation

y = m × X + b

Where:
  y = predicted output (e.g., price)
  X = input feature (e.g., house size)
  m = slope (how much y changes for every 1-unit increase in X)
  b = intercept (value of y when X = 0)

Diagram – Linear Regression Line

Price (₹ Lakhs)
  │
90│                              ●
  │                          ●   |
75│                      ●  ↗    |
  │                   ●   /      |
60│             ●    ↗           |
  │          ●  /                |
45│       ●  /                   |
  │    ●  /                      |
30│  ● /                         |
  │●/                            |
  └────────────────────────────────→ Size (sq.ft.)
    500  750  1000 1250 1500 1750

The line minimises the total squared distance
between each point and the line itself.
This is called Ordinary Least Squares (OLS).

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# House data
np.random.seed(42)
n = 80
size   = np.random.randint(500, 2500, n)
price  = size * 0.04 + np.random.normal(0, 5, n) + 10   # ₹ in Lakhs

df = pd.DataFrame({"Size": size, "Price": price})

# Split
X = df[["Size"]]
y = df["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train
lr = LinearRegression()
lr.fit(X_train, y_train)

# Results
print(f"Slope     (m): {lr.coef_[0]:.4f}")
print(f"Intercept (b): {lr.intercept_:.4f}")
print(f"Equation: Price = {lr.coef_[0]:.4f} × Size + {lr.intercept_:.2f}")

# Evaluate
y_pred = lr.predict(X_test)
print(f"\nMAE  : {mean_absolute_error(y_test, y_pred):.2f}")
print(f"RMSE : {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")
print(f"R²   : {r2_score(y_test, y_pred):.4f}")

# Predict new house
new_house = pd.DataFrame({"Size": [1400]})
print(f"\nPredicted price for 1400 sq.ft.: ₹{lr.predict(new_house)[0]:.2f} Lakhs")

Output:

Slope     (m): 0.0401
Intercept (b): 9.8742
Equation: Price = 0.0401 × Size + 9.87

MAE  : 4.82
RMSE : 5.89
R²   : 0.8912

Predicted price for 1400 sq.ft.: ₹65.99 Lakhs

Regression Evaluation Metrics

Metric	Formula	Meaning	Ideal Value
MAE	Mean of \|actual − predicted\|	Average prediction error in original units	As low as possible
RMSE	√(Mean of (actual − predicted)²)	Penalises large errors more than MAE	As low as possible
R²	1 − (SS_res / SS_tot)	% of variance in y explained by the model	Close to 1.0

Diagram – What R² Means

R² = 0.0  → Model explains nothing; just predicts the average always
R² = 0.5  → Model explains 50% of the variation in y
R² = 0.89 → Model explains 89% of the variation in y
R² = 1.0  → Model predicts perfectly (usually means overfitting)

R² = 1 − (Errors from model) / (Errors from just using mean)

Multiple Linear Regression

Multiple Linear Regression uses two or more input features to predict the output. This model produces more accurate predictions by capturing multiple real-world factors simultaneously.

The Equation

y = b₀ + b₁X₁ + b₂X₂ + b₃X₃ + ... + bₙXₙ

Example (predicting house price):
Price = 5.2 + (0.04 × Size) + (3.5 × Bedrooms) − (0.8 × Age)

# Multiple features
np.random.seed(10)
n = 150

df_multi = pd.DataFrame({
    "Size":     np.random.randint(500, 3000, n),
    "Bedrooms": np.random.randint(1, 6, n),
    "Age":      np.random.randint(0, 30, n)
})
# True relationship
df_multi["Price"] = (
    0.04 * df_multi["Size"] +
    3.5  * df_multi["Bedrooms"] -
    0.8  * df_multi["Age"] +
    np.random.normal(0, 5, n) + 10
)

X = df_multi[["Size", "Bedrooms", "Age"]]
y = df_multi["Price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

mlr = LinearRegression()
mlr.fit(X_train, y_train)

print("Coefficients:")
for name, coef in zip(X.columns, mlr.coef_):
    print(f"  {name}: {coef:.4f}")
print(f"  Intercept: {mlr.intercept_:.4f}")
print(f"\nR² on test data: {r2_score(y_test, mlr.predict(X_test)):.4f}")

Output:

Coefficients:
  Size:      0.0402
  Bedrooms:  3.4281
  Age:      -0.7934
  Intercept: 9.9501

R² on test data: 0.9541

Polynomial Regression

Linear regression fits a straight line. Polynomial regression fits a curve by adding higher powers of the features. This suits data where the relationship bends instead of following a straight line.

Diagram – Linear vs Polynomial Fit

Data with a curve:
●             ●
  ●         ●
    ●     ●
      ● ●

Linear fit (poor):       Polynomial fit (degree 2):
 /                          ╭──────╮
/                          ╯        ╰
(misses the curve)        (captures the curve)

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Engine size (cc) vs Fuel efficiency (kmpl)
np.random.seed(7)
engine_cc   = np.linspace(800, 3000, 60)
efficiency  = -0.000005 * engine_cc**2 + 0.02 * engine_cc + 5 + np.random.normal(0, 1, 60)

X_poly = engine_cc.reshape(-1, 1)

# Train polynomial model (degree 2 = quadratic)
poly_model = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False),
    LinearRegression()
)
poly_model.fit(X_poly, efficiency)

# Evaluate
score = poly_model.score(X_poly, efficiency)
print(f"Polynomial Regression R²: {score:.4f}")

pred = poly_model.predict([[1500]])
print(f"Predicted efficiency at 1500cc: {pred[0]:.2f} kmpl")

Ridge Regression – Regularised Linear Regression

Ridge Regression adds a penalty for large coefficients to the standard linear regression. This prevents overfitting when features are correlated or when the dataset is small.

Diagram – Regularisation Concept

Standard Linear Regression:
Minimise: Error (sum of squared residuals)

Ridge Regression:
Minimise: Error + α × (sum of squared coefficients)

L2 Penalty           ← controls strength of regularisation
     ↑
  α = 0   → Standard Linear Regression (no penalty)
  α = 1   → Moderate shrinkage of coefficients
  α = 100 → Strong shrinkage (simpler model)

Effect: Coefficients shrink toward zero — reduces overfitting

from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler

# Scale features first (required for Ridge/Lasso)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# Ridge regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_s, y_train)

print("Ridge Coefficients:")
for name, coef in zip(X.columns, ridge.coef_):
    print(f"  {name}: {coef:.4f}")
print(f"Ridge R²: {r2_score(y_test, ridge.predict(X_test_s)):.4f}")

Lasso Regression – Feature Selection via Regularisation

Lasso (Least Absolute Shrinkage and Selection Operator) adds a different type of penalty that can shrink some coefficients all the way to zero — effectively removing those features from the model. This makes Lasso useful for automatic feature selection.

# Lasso regression
lasso = Lasso(alpha=0.5)
lasso.fit(X_train_s, y_train)

print("Lasso Coefficients:")
for name, coef in zip(X.columns, lasso.coef_):
    print(f"  {name}: {coef:.4f}")
    if coef == 0:
        print(f"    → Feature '{name}' removed by Lasso")

print(f"Lasso R²: {r2_score(y_test, lasso.predict(X_test_s)):.4f}")

Ridge vs Lasso Comparison

Aspect	Ridge	Lasso
Penalty type	Sum of squared coefficients (L2)	Sum of absolute coefficients (L1)
Coefficient result	Shrinks toward zero, never reaches zero	Can shrink coefficients to exactly zero
Feature selection	Keeps all features (with smaller weights)	Automatically removes irrelevant features
Best used when	Many features all contribute somewhat	Many features, only a few are important

Complete Regression Comparison

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score, mean_absolute_error

models = {
    "Linear Regression": LinearRegression(),
    "Ridge (α=1.0)":     Ridge(alpha=1.0),
    "Lasso (α=0.5)":     Lasso(alpha=0.5)
}

print(f"{'Model':<25} {'R²':>8} {'MAE':>10}")
print("-" * 45)

for name, model in models.items():
    model.fit(X_train_s, y_train)
    y_pred = model.predict(X_test_s)
    r2  = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    print(f"{name:<25} {r2:>8.4f} {mae:>10.2f}")

Assumptions of Linear Regression

Linearity – The relationship between X and y is linear (or transformed to be linear)
Independence – Observations are independent of each other
Homoscedasticity – Errors have constant variance across all values of X
Normality of Residuals – The prediction errors follow a normal distribution
No Multicollinearity – Input features are not highly correlated with each other

Summary

Regression predicts a continuous numeric output from one or more input features
Simple Linear Regression models one feature vs one output with a straight line
Multiple Linear Regression uses several features to improve prediction accuracy
Polynomial Regression fits curved relationships by adding squared and cubic terms
Ridge Regression reduces overfitting by penalising large coefficients (L2 penalty)
Lasso Regression removes irrelevant features by setting their coefficients to zero (L1 penalty)
R² measures how much variance in the output the model explains — higher is better
MAE and RMSE measure average prediction error in the original units of the target

Previous lessons

Back to courses

Next lessons

DS Supervised Learning with Regression

What Is Regression

Simple Linear Regression

The Equation

Diagram – Linear Regression Line

Regression Evaluation Metrics

Diagram – What R² Means

Multiple Linear Regression

The Equation

Polynomial Regression

Diagram – Linear vs Polynomial Fit

Ridge Regression – Regularised Linear Regression

Diagram – Regularisation Concept

Lasso Regression – Feature Selection via Regularisation

Ridge vs Lasso Comparison

Complete Regression Comparison

Assumptions of Linear Regression

Summary

Leave a Comment Cancel reply