DS Supervised Learning with Regression
Regression is a type of supervised learning used when the target variable is a continuous number — such as predicting house prices, forecasting sales, or estimating a patient's blood pressure. This topic covers Linear Regression, Multiple Linear Regression, Polynomial Regression, and Regularised Regression (Ridge and Lasso) with complete Python implementations.
What Is Regression
Regression finds the mathematical relationship between input features and a continuous numeric output. The model learns this relationship from training data and applies it to predict values for new inputs.
Regression Problems: ┌────────────────────────────────────────────────┐ │ Input Features │ Output (Predict) │ │─────────────────────────┼───────────────────────│ │ House size, location │ Selling price (₹) │ │ Temperature, humidity │ Ice cream sales │ │ Hours studied │ Exam score │ │ Car engine size, age │ Fuel efficiency (kmpl)│ │ Ad spend (₹) │ Revenue generated │ └────────────────────────────────────────────────┘
Simple Linear Regression
Simple Linear Regression models the relationship between one input feature (X) and one continuous output (y) using a straight line.
The Equation
y = m × X + b Where: y = predicted output (e.g., price) X = input feature (e.g., house size) m = slope (how much y changes for every 1-unit increase in X) b = intercept (value of y when X = 0)
Diagram – Linear Regression Line
Price (₹ Lakhs)
│
90│ ●
│ ● |
75│ ● ↗ |
│ ● / |
60│ ● ↗ |
│ ● / |
45│ ● / |
│ ● / |
30│ ● / |
│●/ |
└────────────────────────────────→ Size (sq.ft.)
500 750 1000 1250 1500 1750
The line minimises the total squared distance
between each point and the line itself.
This is called Ordinary Least Squares (OLS).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# House data
np.random.seed(42)
n = 80
size = np.random.randint(500, 2500, n)
price = size * 0.04 + np.random.normal(0, 5, n) + 10 # ₹ in Lakhs
df = pd.DataFrame({"Size": size, "Price": price})
# Split
X = df[["Size"]]
y = df["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train
lr = LinearRegression()
lr.fit(X_train, y_train)
# Results
print(f"Slope (m): {lr.coef_[0]:.4f}")
print(f"Intercept (b): {lr.intercept_:.4f}")
print(f"Equation: Price = {lr.coef_[0]:.4f} × Size + {lr.intercept_:.2f}")
# Evaluate
y_pred = lr.predict(X_test)
print(f"\nMAE : {mean_absolute_error(y_test, y_pred):.2f}")
print(f"RMSE : {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")
print(f"R² : {r2_score(y_test, y_pred):.4f}")
# Predict new house
new_house = pd.DataFrame({"Size": [1400]})
print(f"\nPredicted price for 1400 sq.ft.: ₹{lr.predict(new_house)[0]:.2f} Lakhs")
Output:
Slope (m): 0.0401 Intercept (b): 9.8742 Equation: Price = 0.0401 × Size + 9.87 MAE : 4.82 RMSE : 5.89 R² : 0.8912 Predicted price for 1400 sq.ft.: ₹65.99 Lakhs
Regression Evaluation Metrics
| Metric | Formula | Meaning | Ideal Value |
|---|---|---|---|
| MAE | Mean of |actual − predicted| | Average prediction error in original units | As low as possible |
| RMSE | √(Mean of (actual − predicted)²) | Penalises large errors more than MAE | As low as possible |
| R² | 1 − (SS_res / SS_tot) | % of variance in y explained by the model | Close to 1.0 |
Diagram – What R² Means
R² = 0.0 → Model explains nothing; just predicts the average always R² = 0.5 → Model explains 50% of the variation in y R² = 0.89 → Model explains 89% of the variation in y R² = 1.0 → Model predicts perfectly (usually means overfitting) R² = 1 − (Errors from model) / (Errors from just using mean)
Multiple Linear Regression
Multiple Linear Regression uses two or more input features to predict the output. This model produces more accurate predictions by capturing multiple real-world factors simultaneously.
The Equation
y = b₀ + b₁X₁ + b₂X₂ + b₃X₃ + ... + bₙXₙ Example (predicting house price): Price = 5.2 + (0.04 × Size) + (3.5 × Bedrooms) − (0.8 × Age)
# Multiple features
np.random.seed(10)
n = 150
df_multi = pd.DataFrame({
"Size": np.random.randint(500, 3000, n),
"Bedrooms": np.random.randint(1, 6, n),
"Age": np.random.randint(0, 30, n)
})
# True relationship
df_multi["Price"] = (
0.04 * df_multi["Size"] +
3.5 * df_multi["Bedrooms"] -
0.8 * df_multi["Age"] +
np.random.normal(0, 5, n) + 10
)
X = df_multi[["Size", "Bedrooms", "Age"]]
y = df_multi["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
mlr = LinearRegression()
mlr.fit(X_train, y_train)
print("Coefficients:")
for name, coef in zip(X.columns, mlr.coef_):
print(f" {name}: {coef:.4f}")
print(f" Intercept: {mlr.intercept_:.4f}")
print(f"\nR² on test data: {r2_score(y_test, mlr.predict(X_test)):.4f}")
Output:
Coefficients: Size: 0.0402 Bedrooms: 3.4281 Age: -0.7934 Intercept: 9.9501 R² on test data: 0.9541
Polynomial Regression
Linear regression fits a straight line. Polynomial regression fits a curve by adding higher powers of the features. This suits data where the relationship bends instead of following a straight line.
Diagram – Linear vs Polynomial Fit
Data with a curve:
● ●
● ●
● ●
● ●
Linear fit (poor): Polynomial fit (degree 2):
/ ╭──────╮
/ ╯ ╰
(misses the curve) (captures the curve)
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
# Engine size (cc) vs Fuel efficiency (kmpl)
np.random.seed(7)
engine_cc = np.linspace(800, 3000, 60)
efficiency = -0.000005 * engine_cc**2 + 0.02 * engine_cc + 5 + np.random.normal(0, 1, 60)
X_poly = engine_cc.reshape(-1, 1)
# Train polynomial model (degree 2 = quadratic)
poly_model = make_pipeline(
PolynomialFeatures(degree=2, include_bias=False),
LinearRegression()
)
poly_model.fit(X_poly, efficiency)
# Evaluate
score = poly_model.score(X_poly, efficiency)
print(f"Polynomial Regression R²: {score:.4f}")
pred = poly_model.predict([[1500]])
print(f"Predicted efficiency at 1500cc: {pred[0]:.2f} kmpl")
Ridge Regression – Regularised Linear Regression
Ridge Regression adds a penalty for large coefficients to the standard linear regression. This prevents overfitting when features are correlated or when the dataset is small.
Diagram – Regularisation Concept
Standard Linear Regression:
Minimise: Error (sum of squared residuals)
Ridge Regression:
Minimise: Error + α × (sum of squared coefficients)
L2 Penalty ← controls strength of regularisation
↑
α = 0 → Standard Linear Regression (no penalty)
α = 1 → Moderate shrinkage of coefficients
α = 100 → Strong shrinkage (simpler model)
Effect: Coefficients shrink toward zero — reduces overfitting
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
# Scale features first (required for Ridge/Lasso)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# Ridge regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_s, y_train)
print("Ridge Coefficients:")
for name, coef in zip(X.columns, ridge.coef_):
print(f" {name}: {coef:.4f}")
print(f"Ridge R²: {r2_score(y_test, ridge.predict(X_test_s)):.4f}")
Lasso Regression – Feature Selection via Regularisation
Lasso (Least Absolute Shrinkage and Selection Operator) adds a different type of penalty that can shrink some coefficients all the way to zero — effectively removing those features from the model. This makes Lasso useful for automatic feature selection.
# Lasso regression
lasso = Lasso(alpha=0.5)
lasso.fit(X_train_s, y_train)
print("Lasso Coefficients:")
for name, coef in zip(X.columns, lasso.coef_):
print(f" {name}: {coef:.4f}")
if coef == 0:
print(f" → Feature '{name}' removed by Lasso")
print(f"Lasso R²: {r2_score(y_test, lasso.predict(X_test_s)):.4f}")
Ridge vs Lasso Comparison
| Aspect | Ridge | Lasso |
|---|---|---|
| Penalty type | Sum of squared coefficients (L2) | Sum of absolute coefficients (L1) |
| Coefficient result | Shrinks toward zero, never reaches zero | Can shrink coefficients to exactly zero |
| Feature selection | Keeps all features (with smaller weights) | Automatically removes irrelevant features |
| Best used when | Many features all contribute somewhat | Many features, only a few are important |
Complete Regression Comparison
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score, mean_absolute_error
models = {
"Linear Regression": LinearRegression(),
"Ridge (α=1.0)": Ridge(alpha=1.0),
"Lasso (α=0.5)": Lasso(alpha=0.5)
}
print(f"{'Model':<25} {'R²':>8} {'MAE':>10}")
print("-" * 45)
for name, model in models.items():
model.fit(X_train_s, y_train)
y_pred = model.predict(X_test_s)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print(f"{name:<25} {r2:>8.4f} {mae:>10.2f}")
Assumptions of Linear Regression
- Linearity – The relationship between X and y is linear (or transformed to be linear)
- Independence – Observations are independent of each other
- Homoscedasticity – Errors have constant variance across all values of X
- Normality of Residuals – The prediction errors follow a normal distribution
- No Multicollinearity – Input features are not highly correlated with each other
Summary
- Regression predicts a continuous numeric output from one or more input features
- Simple Linear Regression models one feature vs one output with a straight line
- Multiple Linear Regression uses several features to improve prediction accuracy
- Polynomial Regression fits curved relationships by adding squared and cubic terms
- Ridge Regression reduces overfitting by penalising large coefficients (L2 penalty)
- Lasso Regression removes irrelevant features by setting their coefficients to zero (L1 penalty)
- R² measures how much variance in the output the model explains — higher is better
- MAE and RMSE measure average prediction error in the original units of the target
