DS Supervised Learning with Regression

Regression is a type of supervised learning used when the target variable is a continuous number — such as predicting house prices, forecasting sales, or estimating a patient's blood pressure. This topic covers Linear Regression, Multiple Linear Regression, Polynomial Regression, and Regularised Regression (Ridge and Lasso) with complete Python implementations.

What Is Regression

Regression finds the mathematical relationship between input features and a continuous numeric output. The model learns this relationship from training data and applies it to predict values for new inputs.

Regression Problems:
┌────────────────────────────────────────────────┐
│ Input Features          │ Output (Predict)      │
│─────────────────────────┼───────────────────────│
│ House size, location    │ Selling price (₹)     │
│ Temperature, humidity   │ Ice cream sales       │
│ Hours studied           │ Exam score            │
│ Car engine size, age    │ Fuel efficiency (kmpl)│
│ Ad spend (₹)            │ Revenue generated     │
└────────────────────────────────────────────────┘

Simple Linear Regression

Simple Linear Regression models the relationship between one input feature (X) and one continuous output (y) using a straight line.

The Equation

y = m × X + b

Where:
  y = predicted output (e.g., price)
  X = input feature (e.g., house size)
  m = slope (how much y changes for every 1-unit increase in X)
  b = intercept (value of y when X = 0)

Diagram – Linear Regression Line

Price (₹ Lakhs)
  │
90│                              ●
  │                          ●   |
75│                      ●  ↗    |
  │                   ●   /      |
60│             ●    ↗           |
  │          ●  /                |
45│       ●  /                   |
  │    ●  /                      |
30│  ● /                         |
  │●/                            |
  └────────────────────────────────→ Size (sq.ft.)
    500  750  1000 1250 1500 1750

The line minimises the total squared distance
between each point and the line itself.
This is called Ordinary Least Squares (OLS).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# House data
np.random.seed(42)
n = 80
size   = np.random.randint(500, 2500, n)
price  = size * 0.04 + np.random.normal(0, 5, n) + 10   # ₹ in Lakhs

df = pd.DataFrame({"Size": size, "Price": price})

# Split
X = df[["Size"]]
y = df["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train
lr = LinearRegression()
lr.fit(X_train, y_train)

# Results
print(f"Slope     (m): {lr.coef_[0]:.4f}")
print(f"Intercept (b): {lr.intercept_:.4f}")
print(f"Equation: Price = {lr.coef_[0]:.4f} × Size + {lr.intercept_:.2f}")

# Evaluate
y_pred = lr.predict(X_test)
print(f"\nMAE  : {mean_absolute_error(y_test, y_pred):.2f}")
print(f"RMSE : {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")
print(f"R²   : {r2_score(y_test, y_pred):.4f}")

# Predict new house
new_house = pd.DataFrame({"Size": [1400]})
print(f"\nPredicted price for 1400 sq.ft.: ₹{lr.predict(new_house)[0]:.2f} Lakhs")

Output:

Slope     (m): 0.0401
Intercept (b): 9.8742
Equation: Price = 0.0401 × Size + 9.87

MAE  : 4.82
RMSE : 5.89
R²   : 0.8912

Predicted price for 1400 sq.ft.: ₹65.99 Lakhs

Regression Evaluation Metrics

MetricFormulaMeaningIdeal Value
MAEMean of |actual − predicted|Average prediction error in original unitsAs low as possible
RMSE√(Mean of (actual − predicted)²)Penalises large errors more than MAEAs low as possible
1 − (SS_res / SS_tot)% of variance in y explained by the modelClose to 1.0

Diagram – What R² Means

R² = 0.0  → Model explains nothing; just predicts the average always
R² = 0.5  → Model explains 50% of the variation in y
R² = 0.89 → Model explains 89% of the variation in y
R² = 1.0  → Model predicts perfectly (usually means overfitting)

R² = 1 − (Errors from model) / (Errors from just using mean)

Multiple Linear Regression

Multiple Linear Regression uses two or more input features to predict the output. This model produces more accurate predictions by capturing multiple real-world factors simultaneously.

The Equation

y = b₀ + b₁X₁ + b₂X₂ + b₃X₃ + ... + bₙXₙ

Example (predicting house price):
Price = 5.2 + (0.04 × Size) + (3.5 × Bedrooms) − (0.8 × Age)
# Multiple features
np.random.seed(10)
n = 150

df_multi = pd.DataFrame({
    "Size":     np.random.randint(500, 3000, n),
    "Bedrooms": np.random.randint(1, 6, n),
    "Age":      np.random.randint(0, 30, n)
})
# True relationship
df_multi["Price"] = (
    0.04 * df_multi["Size"] +
    3.5  * df_multi["Bedrooms"] -
    0.8  * df_multi["Age"] +
    np.random.normal(0, 5, n) + 10
)

X = df_multi[["Size", "Bedrooms", "Age"]]
y = df_multi["Price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

mlr = LinearRegression()
mlr.fit(X_train, y_train)

print("Coefficients:")
for name, coef in zip(X.columns, mlr.coef_):
    print(f"  {name}: {coef:.4f}")
print(f"  Intercept: {mlr.intercept_:.4f}")
print(f"\nR² on test data: {r2_score(y_test, mlr.predict(X_test)):.4f}")

Output:

Coefficients:
  Size:      0.0402
  Bedrooms:  3.4281
  Age:      -0.7934
  Intercept: 9.9501

R² on test data: 0.9541

Polynomial Regression

Linear regression fits a straight line. Polynomial regression fits a curve by adding higher powers of the features. This suits data where the relationship bends instead of following a straight line.

Diagram – Linear vs Polynomial Fit

Data with a curve:
●             ●
  ●         ●
    ●     ●
      ● ●

Linear fit (poor):       Polynomial fit (degree 2):
 /                          ╭──────╮
/                          ╯        ╰
(misses the curve)        (captures the curve)
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Engine size (cc) vs Fuel efficiency (kmpl)
np.random.seed(7)
engine_cc   = np.linspace(800, 3000, 60)
efficiency  = -0.000005 * engine_cc**2 + 0.02 * engine_cc + 5 + np.random.normal(0, 1, 60)

X_poly = engine_cc.reshape(-1, 1)

# Train polynomial model (degree 2 = quadratic)
poly_model = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False),
    LinearRegression()
)
poly_model.fit(X_poly, efficiency)

# Evaluate
score = poly_model.score(X_poly, efficiency)
print(f"Polynomial Regression R²: {score:.4f}")

pred = poly_model.predict([[1500]])
print(f"Predicted efficiency at 1500cc: {pred[0]:.2f} kmpl")

Ridge Regression – Regularised Linear Regression

Ridge Regression adds a penalty for large coefficients to the standard linear regression. This prevents overfitting when features are correlated or when the dataset is small.

Diagram – Regularisation Concept

Standard Linear Regression:
Minimise: Error (sum of squared residuals)

Ridge Regression:
Minimise: Error + α × (sum of squared coefficients)

L2 Penalty           ← controls strength of regularisation
     ↑
  α = 0   → Standard Linear Regression (no penalty)
  α = 1   → Moderate shrinkage of coefficients
  α = 100 → Strong shrinkage (simpler model)

Effect: Coefficients shrink toward zero — reduces overfitting
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler

# Scale features first (required for Ridge/Lasso)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# Ridge regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_s, y_train)

print("Ridge Coefficients:")
for name, coef in zip(X.columns, ridge.coef_):
    print(f"  {name}: {coef:.4f}")
print(f"Ridge R²: {r2_score(y_test, ridge.predict(X_test_s)):.4f}")

Lasso Regression – Feature Selection via Regularisation

Lasso (Least Absolute Shrinkage and Selection Operator) adds a different type of penalty that can shrink some coefficients all the way to zero — effectively removing those features from the model. This makes Lasso useful for automatic feature selection.

# Lasso regression
lasso = Lasso(alpha=0.5)
lasso.fit(X_train_s, y_train)

print("Lasso Coefficients:")
for name, coef in zip(X.columns, lasso.coef_):
    print(f"  {name}: {coef:.4f}")
    if coef == 0:
        print(f"    → Feature '{name}' removed by Lasso")

print(f"Lasso R²: {r2_score(y_test, lasso.predict(X_test_s)):.4f}")

Ridge vs Lasso Comparison

AspectRidgeLasso
Penalty typeSum of squared coefficients (L2)Sum of absolute coefficients (L1)
Coefficient resultShrinks toward zero, never reaches zeroCan shrink coefficients to exactly zero
Feature selectionKeeps all features (with smaller weights)Automatically removes irrelevant features
Best used whenMany features all contribute somewhatMany features, only a few are important

Complete Regression Comparison

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score, mean_absolute_error

models = {
    "Linear Regression": LinearRegression(),
    "Ridge (α=1.0)":     Ridge(alpha=1.0),
    "Lasso (α=0.5)":     Lasso(alpha=0.5)
}

print(f"{'Model':<25} {'R²':>8} {'MAE':>10}")
print("-" * 45)

for name, model in models.items():
    model.fit(X_train_s, y_train)
    y_pred = model.predict(X_test_s)
    r2  = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    print(f"{name:<25} {r2:>8.4f} {mae:>10.2f}")

Assumptions of Linear Regression

  • Linearity – The relationship between X and y is linear (or transformed to be linear)
  • Independence – Observations are independent of each other
  • Homoscedasticity – Errors have constant variance across all values of X
  • Normality of Residuals – The prediction errors follow a normal distribution
  • No Multicollinearity – Input features are not highly correlated with each other

Summary

  • Regression predicts a continuous numeric output from one or more input features
  • Simple Linear Regression models one feature vs one output with a straight line
  • Multiple Linear Regression uses several features to improve prediction accuracy
  • Polynomial Regression fits curved relationships by adding squared and cubic terms
  • Ridge Regression reduces overfitting by penalising large coefficients (L2 penalty)
  • Lasso Regression removes irrelevant features by setting their coefficients to zero (L1 penalty)
  • R² measures how much variance in the output the model explains — higher is better
  • MAE and RMSE measure average prediction error in the original units of the target

Leave a Comment