Machine Learning Logistic Regression

Logistic Regression is a classification algorithm that predicts which category an input belongs to. Despite having "regression" in its name, it does not predict numbers — it predicts probabilities and converts them into class labels. It is one of the most widely used algorithms for binary (two-class) classification problems.

The Difference from Linear Regression

Linear Regression:
  Input: Study Hours = 6
  Output: Exam Score = 78  (a number)

Logistic Regression:
  Input: Study Hours = 6
  Output: Pass or Fail  (a category)

Linear Regression predicts a continuous value.
Logistic Regression predicts a class label.

Why Not Use Linear Regression for Classification?

Linear Regression can predict values below 0 or above 1, which makes no sense for probabilities. A probability must always stay between 0 and 1. Logistic Regression uses a special function to guarantee this constraint.

Linear Regression output for classification:
  Probability of spam = 1.8  ← impossible
  Probability of spam = -0.3 ← impossible

Logistic Regression fixes this:
  Probability always between 0.0 and 1.0 ✓

The Sigmoid Function

Logistic Regression uses the Sigmoid function to convert any number — positive, negative, or very large — into a value between 0 and 1. This output is interpreted as a probability.

Sigmoid Function:
  σ(z) = 1 / (1 + e^(-z))

  Where z = m1×X1 + m2×X2 + ... + b  (same as linear formula)

Visual shape of Sigmoid:

Probability
  1.0 |                          ──────────────
  0.9 |                     ───/
  0.8 |                   ──/
  0.7 |                 ─/
  0.5 |               ─/  ← Decision boundary
  0.3 |            ─/
  0.2 |          ─/
  0.1 |  ────────/
  0.0 |──────────────────────────────────────
       ───────────────────────────────────────►
       Negative z              Positive z

Key observations:
  When z is very positive → output approaches 1.0
  When z is very negative → output approaches 0.0
  When z = 0 → output = 0.5 (decision boundary)

Making a Prediction: Threshold

The Sigmoid function gives a probability. A threshold (usually 0.5) converts this probability into a class label.

Output Probability = 0.82
  → 0.82 > 0.5 → Predict Class 1 (Spam / Yes / Pass)

Output Probability = 0.31
  → 0.31 < 0.5 → Predict Class 0 (Not Spam / No / Fail)

The threshold of 0.5 can be changed depending on the problem:
  Medical diagnosis: lower threshold (catch more true positives)
  Marketing email: higher threshold (target only confident leads)

Complete Example: Loan Approval Prediction

Features:
  X1 = Credit Score
  X2 = Annual Income (in Lakhs)
  X3 = Existing Debt (in Lakhs)

Learned Formula:
  z = 0.005×CreditScore + 0.3×Income - 0.4×Debt - 2.0

Applicant:
  Credit Score = 750
  Income = 8L
  Debt = 1L

  z = 0.005×750 + 0.3×8 - 0.4×1 - 2.0
    = 3.75 + 2.4 - 0.4 - 2.0
    = 3.75

  Sigmoid(3.75) = 1 / (1 + e^(-3.75)) = 0.977

  Probability of Approval = 97.7% → Predict: APPROVED ✓

Binary vs Multi-Class Logistic Regression

Binary Classification (2 classes):
  Output: Spam or Not Spam
  Output: Fraud or Not Fraud
  Output: Pass or Fail

Multi-Class Classification (3+ classes):
  Output: Dog, Cat, or Rabbit
  Output: Grade A, B, C, D, or F

For multi-class problems, Logistic Regression uses two strategies:

  1. One-vs-Rest (OvR):
     Train one classifier per class.
     Each classifier answers: "Is this class X or not?"
     Final prediction = class with highest probability.

  2. Softmax (Multinomial Logistic Regression):
     Outputs a probability for EVERY class simultaneously.
     All probabilities sum to exactly 1.0.

     Example: Image classification
       Cat:   42%
       Dog:   51%
       Rabbit: 7%
       Prediction → Dog ✓

How Logistic Regression Learns

Logistic Regression minimizes a cost function called Log Loss (also called Binary Cross Entropy). Unlike MSE, Log Loss applies heavy penalties when the model is confidently wrong.

Log Loss Behavior:

  If Actual = 1 (Spam):
    Predicted probability = 0.95 → Small loss (correct and confident)
    Predicted probability = 0.50 → Medium loss (uncertain)
    Predicted probability = 0.05 → Very large loss (confidently wrong)

  If Actual = 0 (Not Spam):
    Predicted probability = 0.05 → Small loss (correct and confident)
    Predicted probability = 0.95 → Very large loss (confidently wrong)

The model learns by minimizing total Log Loss across all records.

Evaluation Metrics for Logistic Regression

Confusion Matrix (for binary classification):

                    Predicted: Yes    Predicted: No
Actual: Yes  │  True Positive (TP)  │ False Negative (FN) │
Actual: No   │  False Positive (FP) │  True Negative (TN) │

Metrics derived from Confusion Matrix:

  Accuracy  = (TP + TN) / Total
              → Overall correctness

  Precision = TP / (TP + FP)
              → Of all "Yes" predictions, how many were right?

  Recall    = TP / (TP + FN)
              → Of all actual "Yes" cases, how many did we catch?

  F1 Score  = 2 × (Precision × Recall) / (Precision + Recall)
              → Balance between Precision and Recall

Example (Spam Filter, 100 emails):
  TP=45 (spam caught)  FN=5 (spam missed)
  FP=3 (ham flagged)   TN=47 (ham passed)

  Accuracy  = (45+47)/100 = 92%
  Precision = 45/(45+3)  = 93.75%
  Recall    = 45/(45+5)  = 90%
  F1        = 2×(0.9375×0.90)/(0.9375+0.90) = 91.8%

Advantages and Limitations

Advantages:
  ✓ Simple and fast to train
  ✓ Outputs probabilities (not just class labels)
  ✓ Easy to interpret — each feature has a clear coefficient
  ✓ Works well when classes are linearly separable
  ✓ Good baseline model before trying complex algorithms

Limitations:
  ✗ Assumes a linear decision boundary
  ✗ Struggles with complex, non-linear relationships
  ✗ Sensitive to outliers in the feature space
  ✗ Requires feature scaling for best performance
  ✗ Needs independent features (multicollinearity hurts it)

Logistic Regression Flow Diagram

Input Features (X1, X2, X3...)
         │
         ▼
Linear Combination: z = m1×X1 + m2×X2 + b
         │
         ▼
Sigmoid Function → Probability between 0 and 1
         │
         ▼
Apply Threshold (default 0.5)
         │
         ├── Probability ≥ 0.5 → Class 1 (Positive)
         │
         └── Probability < 0.5 → Class 0 (Negative)

Previous lesson

Back to course

Next lesson