Kubernetes Horizontal Pod Autoscaler Scaling Automatically
Manually scaling your Deployments works fine when you know a traffic spike is coming. Most of the time, spikes are unpredictable. The Horizontal Pod Autoscaler (HPA) watches CPU and memory usage on your Pods and adds or removes Pod replicas automatically to match the current demand — no human action required.
How HPA Works
HPA monitors a target metric (usually CPU utilization) every 15 seconds. When average CPU across all Pods exceeds your threshold, it adds more Pods. When demand drops and Pods are underused, it removes them down to your defined minimum. It is a feedback loop — measure, compare against target, act, repeat.
Normal traffic: 3 Pods, each at 30% CPU
↓
Traffic spike: 3 Pods, each at 85% CPU (threshold = 70%)
↓
HPA adds 2 more Pods
↓
5 Pods, each at ~50% CPU — load distributed
↓
Traffic drops: 5 Pods, each at 15% CPU
↓
HPA removes 2 Pods after cooldown period
↓
3 Pods, each at ~25% CPU — back to normal
Prerequisites: Metrics Server
HPA needs actual CPU and memory metrics to work. These come from the Metrics Server, a lightweight component that collects resource usage data from each node. Install it on Minikube:
minikube addons enable metrics-server
On cloud clusters, install it with:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Verify it is running:
kubectl top pods kubectl top nodes
If kubectl top returns data, Metrics Server is working.
Creating an HPA with kubectl
kubectl autoscale deployment my-app \ --cpu-percent=70 \ --min=2 \ --max=10
This creates an HPA that keeps CPU utilization at 70% by scaling between 2 and 10 Pod replicas.
Creating an HPA with YAML
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
This HPA scales on both CPU and memory. It scales up when either metric exceeds its threshold.
HPA Scaling Calculation
Kubernetes calculates the required replica count with this formula:
desiredReplicas = ceil(currentReplicas × currentMetric / targetMetric) Example: currentReplicas = 3 currentCPU = 85% targetCPU = 70% desiredReplicas = ceil(3 × 85 / 70) = ceil(3.64) = 4
HPA adds 1 Pod to bring average CPU back toward the 70% target.
Watching HPA in Action
kubectl get hpa kubectl get hpa my-app-hpa --watch kubectl describe hpa my-app-hpa
Output of kubectl get hpa:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS my-app-hpa Deployment/my-app 68%/70% 2 10 4
The TARGETS column shows current metric / target metric. REPLICAS shows the current number of running Pods.
Resource Requests Are Mandatory for HPA
HPA calculates CPU utilization as a percentage of the CPU request set on the container — not the node's total CPU. If you do not set CPU requests on your containers, HPA has no baseline to calculate against and will not scale. Always set resource requests on containers that you plan to autoscale.
spec:
containers:
- name: app
image: my-app:v1
resources:
requests:
cpu: "200m" # HPA needs this to calculate utilization %
memory: "256Mi"
Scale-Down Stabilization Window
HPA does not scale down immediately when load drops. It waits for a stabilization window (default 5 minutes) to prevent flapping — rapidly scaling up and down as load oscillates around the threshold. You can configure this window:
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # 5 minutes (default)
policies:
- type: Percent
value: 25
periodSeconds: 60 # Remove at most 25% of pods per minute
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100
periodSeconds: 15 # Double pods every 15 seconds if needed
Custom and External Metrics
HPA is not limited to CPU and memory. With a custom metrics adapter, you can scale on any metric — requests per second (from Prometheus), queue depth (from a message queue), or active user sessions. This requires a custom metrics adapter like Prometheus Adapter or KEDA (Kubernetes Event-Driven Autoscaling).
Scale a consumer Deployment based on Kafka queue depth: - Queue has 1000 messages → scale to 10 consumers - Queue drains to 50 messages → scale back to 2 consumers
HPA vs. Cluster Autoscaler
| Tool | What It Scales | When to Use |
|---|---|---|
| HPA | Pod replicas (horizontal) | More copies of your app |
| VPA (Vertical Pod Autoscaler) | CPU/memory per Pod (vertical) | Resize individual Pod resources |
| Cluster Autoscaler | Worker nodes | Add/remove VMs from the cluster |
In production, HPA and Cluster Autoscaler work together. HPA adds more Pods; Cluster Autoscaler adds more nodes when there is not enough space on existing ones.
Key Points
- HPA automatically scales your Deployment between a minimum and maximum replica count based on CPU or memory usage.
- Metrics Server must be installed — HPA has no data without it.
- Set CPU requests on your containers, or HPA cannot calculate utilization percentages.
- The scale-down stabilization window prevents flapping when load oscillates.
- Combine HPA with Cluster Autoscaler in production — HPA scales Pods, Cluster Autoscaler scales nodes.
