Kubernetes Horizontal Pod Autoscaler Scaling Automatically

Manually scaling your Deployments works fine when you know a traffic spike is coming. Most of the time, spikes are unpredictable. The Horizontal Pod Autoscaler (HPA) watches CPU and memory usage on your Pods and adds or removes Pod replicas automatically to match the current demand — no human action required.

How HPA Works

HPA monitors a target metric (usually CPU utilization) every 15 seconds. When average CPU across all Pods exceeds your threshold, it adds more Pods. When demand drops and Pods are underused, it removes them down to your defined minimum. It is a feedback loop — measure, compare against target, act, repeat.

Normal traffic: 3 Pods, each at 30% CPU
                ↓
Traffic spike: 3 Pods, each at 85% CPU (threshold = 70%)
                ↓
HPA adds 2 more Pods
                ↓
5 Pods, each at ~50% CPU — load distributed
                ↓
Traffic drops: 5 Pods, each at 15% CPU
                ↓
HPA removes 2 Pods after cooldown period
                ↓
3 Pods, each at ~25% CPU — back to normal

Prerequisites: Metrics Server

HPA needs actual CPU and memory metrics to work. These come from the Metrics Server, a lightweight component that collects resource usage data from each node. Install it on Minikube:

minikube addons enable metrics-server

On cloud clusters, install it with:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Verify it is running:

kubectl top pods
kubectl top nodes

If kubectl top returns data, Metrics Server is working.

Creating an HPA with kubectl

kubectl autoscale deployment my-app \
  --cpu-percent=70 \
  --min=2 \
  --max=10

This creates an HPA that keeps CPU utilization at 70% by scaling between 2 and 10 Pod replicas.

Creating an HPA with YAML

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

This HPA scales on both CPU and memory. It scales up when either metric exceeds its threshold.

HPA Scaling Calculation

Kubernetes calculates the required replica count with this formula:

desiredReplicas = ceil(currentReplicas × currentMetric / targetMetric)

Example:
currentReplicas = 3
currentCPU = 85%
targetCPU = 70%
desiredReplicas = ceil(3 × 85 / 70) = ceil(3.64) = 4

HPA adds 1 Pod to bring average CPU back toward the 70% target.

Watching HPA in Action

kubectl get hpa
kubectl get hpa my-app-hpa --watch
kubectl describe hpa my-app-hpa

Output of kubectl get hpa:

NAME         REFERENCE          TARGETS        MINPODS   MAXPODS   REPLICAS
my-app-hpa   Deployment/my-app  68%/70%        2         10        4

The TARGETS column shows current metric / target metric. REPLICAS shows the current number of running Pods.

Resource Requests Are Mandatory for HPA

HPA calculates CPU utilization as a percentage of the CPU request set on the container — not the node's total CPU. If you do not set CPU requests on your containers, HPA has no baseline to calculate against and will not scale. Always set resource requests on containers that you plan to autoscale.

spec:
  containers:
  - name: app
    image: my-app:v1
    resources:
      requests:
        cpu: "200m"       # HPA needs this to calculate utilization %
        memory: "256Mi"

Scale-Down Stabilization Window

HPA does not scale down immediately when load drops. It waits for a stabilization window (default 5 minutes) to prevent flapping — rapidly scaling up and down as load oscillates around the threshold. You can configure this window:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300  # 5 minutes (default)
    policies:
    - type: Percent
      value: 25
      periodSeconds: 60   # Remove at most 25% of pods per minute
  scaleUp:
    stabilizationWindowSeconds: 0    # Scale up immediately
    policies:
    - type: Percent
      value: 100
      periodSeconds: 15   # Double pods every 15 seconds if needed

Custom and External Metrics

HPA is not limited to CPU and memory. With a custom metrics adapter, you can scale on any metric — requests per second (from Prometheus), queue depth (from a message queue), or active user sessions. This requires a custom metrics adapter like Prometheus Adapter or KEDA (Kubernetes Event-Driven Autoscaling).

Scale a consumer Deployment based on Kafka queue depth:
- Queue has 1000 messages → scale to 10 consumers
- Queue drains to 50 messages → scale back to 2 consumers

HPA vs. Cluster Autoscaler

Tool	What It Scales	When to Use
HPA	Pod replicas (horizontal)	More copies of your app
VPA (Vertical Pod Autoscaler)	CPU/memory per Pod (vertical)	Resize individual Pod resources
Cluster Autoscaler	Worker nodes	Add/remove VMs from the cluster

In production, HPA and Cluster Autoscaler work together. HPA adds more Pods; Cluster Autoscaler adds more nodes when there is not enough space on existing ones.

Key Points

HPA automatically scales your Deployment between a minimum and maximum replica count based on CPU or memory usage.
Metrics Server must be installed — HPA has no data without it.
Set CPU requests on your containers, or HPA cannot calculate utilization percentages.
The scale-down stabilization window prevents flapping when load oscillates.
Combine HPA with Cluster Autoscaler in production — HPA scales Pods, Cluster Autoscaler scales nodes.

Previous lesson

Back to course

Next lesson