DS Machine Learning Fundamentals
Machine Learning (ML) is the part of artificial intelligence where computers learn from data without being explicitly programmed for every scenario. Instead of writing rules manually, a machine learning algorithm finds patterns in historical data and uses them to make predictions on new, unseen data.
What Machine Learning Is and Is Not
Traditional programming tells the computer exactly what to do: "If temperature > 35°C, send a heat alert." Machine learning lets the computer figure out those rules by looking at thousands of past examples.
Diagram – Traditional Programming vs Machine Learning
Traditional Programming: ┌──────────┐ ┌─────────┐ ┌────────┐ │ Rules │ + │ Data │ = │ Output │ │(Written │ │ │ │ │ │manually) │ │ │ │ │ └──────────┘ └─────────┘ └────────┘ Machine Learning: ┌──────────┐ ┌─────────┐ ┌────────┐ │ Data │ + │ Output │ = │ Rules │ │(Examples)│ │(Labels) │ │(Model) │ └──────────┘ └─────────┘ └────────┘ ML figures out the rules automatically from data + answers.
Types of Machine Learning
1. Supervised Learning
The model trains on labelled data — data where the correct answer (label) is already known. It learns to map input features to the output label.
Labelled Data Examples: ┌───────────────────────────────────────────────────┐ │ Features (Input) │ Label (Answer) │ │──────────────────────────┼────────────────────────│ │ Email text │ Spam / Not Spam │ │ House size, location │ House price │ │ Patient symptoms │ Disease / No Disease │ │ Image pixels │ Cat / Dog │ └──────────────────────────┴────────────────────────┘
2. Unsupervised Learning
The model trains on unlabelled data. No correct answers exist. The algorithm discovers hidden structures, groups, or patterns on its own.
Unlabelled Data Examples: ┌───────────────────────────────────────────────────┐ │ Data │ Goal │ │──────────────────────────┼────────────────────────│ │ Customer purchase history│ Group similar customers│ │ News articles │ Discover topics │ │ Gene expression data │ Find gene clusters │ └───────────────────────────────────────────────────┘
3. Reinforcement Learning
An agent learns by taking actions in an environment and receiving rewards or penalties. Over time, it learns which actions maximise rewards.
Reinforcement Learning Flow:
Agent → Takes Action → Environment → Receives Reward → Agent learns
(Move left) (hits wall) (penalty: -10)
Used in: Chess engines, robotics, self-driving cars, game playing AI
Key Machine Learning Terminology
| Term | Definition | Real Example |
|---|---|---|
| Feature (X) | Input variable used to make a prediction | House size, number of rooms, location |
| Label (y) | Output variable the model predicts | House price |
| Training Data | Data used to teach the model | 80% of the dataset |
| Test Data | Unseen data used to evaluate the model | 20% of the dataset |
| Model | The mathematical function learned from training data | Linear regression, decision tree |
| Prediction | The output the model generates for new input | Predicted price: ₹45 lakhs |
| Algorithm | The method used to train the model | Random Forest, KNN, SVM |
| Parameter | Internal values the model learns during training | Weights in a linear regression |
| Hyperparameter | Settings configured before training begins | Number of trees in a random forest |
Features and Labels in Practice
import pandas as pd
import numpy as np
# Dataset: Predicting house price
houses = pd.DataFrame({
"Size_sqft": [800, 1200, 1500, 900, 2000, 1100, 1800, 2500],
"Bedrooms": [2, 3, 3, 2, 4, 3, 4, 5 ],
"Age_years": [10, 5, 8, 15, 3, 12, 6, 2 ],
"Price_lakhs": [35, 55, 65, 40, 85, 48, 75, 105 ] # ← Label
})
# Separate features (X) and label (y)
X = houses[["Size_sqft", "Bedrooms", "Age_years"]] # Input
y = houses["Price_lakhs"] # Output to predict
print("Features (X):\n", X.head())
print("\nLabel (y):\n", y.head())
Train-Test Split – The Golden Rule
A machine learning model must always be evaluated on data it has never seen during training. Mixing training and test data produces falsely optimistic results — the model appears accurate but fails on real-world data.
Diagram – Train-Test Split
Full Dataset (1000 rows): ┌─────────────────────────────────────────────────┐ │ Training Set (800 rows = 80%) │ │ ← Model learns from this │ ├─────────────────────────────────────────────────┤ │ Test Set (200 rows = 20%) │ │ ← Model evaluated on this (never seen before) │ └─────────────────────────────────────────────────┘
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% goes to test set
random_state=42 # Ensures same split every run
)
print("Training rows:", len(X_train))
print("Testing rows :", len(X_test))
The Machine Learning Pipeline
Raw Data
│
▼
Clean & Preprocess
(handle nulls, encode categories, scale features)
│
▼
Split into Train / Test
│
▼
Choose an Algorithm
(Linear Regression, Decision Tree, Random Forest, etc.)
│
▼
Train the Model
model.fit(X_train, y_train)
│
▼
Evaluate on Test Set
model.score(X_test, y_test)
│
▼
Tune Hyperparameters
(GridSearchCV, cross-validation)
│
▼
Deploy / Use Model
model.predict(new_data)
Overfitting and Underfitting
These two problems represent the two main failure modes of machine learning models.
Underfitting
A model that is too simple fails to capture the patterns in the training data. It performs poorly on both training data and new data.
Overfitting
A model that is too complex memorises the training data — including its noise and random fluctuations. It performs perfectly on training data but fails on new data because it learned specific details instead of general patterns.
Diagram – Fitting Comparison
Data Points: ● ● ● ● ● ●
Underfitting: Good Fit: Overfitting:
─────────────── ╭──────╮ ╭ ╮ ╭ ╮
(flat straight line │ │ ╯ ╰╯ ╰╯
misses the curve) │ ╰── passes through
every single point
(too specific)
Training Accuracy: Low High Very High
Test Accuracy: Low High Low (fails on new data)
# Signs of overfitting: # Training accuracy: 99% # Test accuracy: 72% # Gap is too large → model memorised training data # Signs of underfitting: # Training accuracy: 65% # Test accuracy: 63% # Both are low → model too simple
Solutions to Overfitting and Underfitting
| Problem | Cause | Solutions |
|---|---|---|
| Underfitting | Model too simple | Use more complex algorithm, add features, train longer |
| Overfitting | Model too complex | More data, regularisation, cross-validation, pruning, dropout |
Introduction to Scikit-learn
Scikit-learn is the standard machine learning library in Python. It provides a consistent API for every algorithm — meaning the steps to train and predict are the same regardless of the algorithm chosen.
# Universal Scikit-learn Pattern:
# 1. Import the algorithm
from sklearn.linear_model import LinearRegression
# 2. Create the model object
model = LinearRegression()
# 3. Train the model
model.fit(X_train, y_train)
# 4. Make predictions
predictions = model.predict(X_test)
# 5. Evaluate accuracy
score = model.score(X_test, y_test)
print(f"Model accuracy: {score:.2f}")
First Complete Machine Learning Example
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
# Data
np.random.seed(42)
n = 100
X = pd.DataFrame({
"StudyHours": np.random.uniform(1, 10, n)
})
y = X["StudyHours"] * 8 + np.random.normal(0, 5, n) + 35
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train
model = LinearRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Absolute Error : {mae:.2f} marks")
print(f"R² Score : {r2:.4f}")
# Predict for a new student
new_student = pd.DataFrame({"StudyHours": [6.5]})
predicted_score = model.predict(new_student)[0]
print(f"Predicted score for 6.5 hours study: {predicted_score:.1f}")
Output:
Mean Absolute Error : 4.83 marks R² Score : 0.8721 Predicted score for 6.5 hours study: 85.6
Choosing the Right Algorithm
| Problem Type | Output | Common Algorithms |
|---|---|---|
| Regression | Continuous number (price, temperature) | Linear Regression, Ridge, Random Forest |
| Classification | Category (spam/not spam, disease/healthy) | Logistic Regression, Decision Tree, SVM, KNN |
| Clustering | Groups (no label given) | K-Means, DBSCAN, Hierarchical |
| Dimensionality Reduction | Fewer features | PCA, t-SNE, UMAP |
| Recommendation | Suggested items | Collaborative Filtering, Matrix Factorisation |
Summary
- Machine learning finds patterns in historical data and uses them to predict outcomes on new data
- Supervised learning uses labelled data; unsupervised learning discovers patterns in unlabelled data
- Features (X) are inputs; labels (y) are the target values the model learns to predict
- Always split data into training and test sets before building any model
- Underfitting means the model is too simple; overfitting means it memorised training data
- Scikit-learn provides a consistent fit/predict API across all machine learning algorithms
- Choosing the right algorithm depends on the problem type — regression, classification, or clustering
