Kubernetes HPA and VPA — Auto-Scale Your Workloads the Right Way
Your application handles 100 requests per second during the day and 10,000 during flash sales. Running enough pods for peak traffic wastes money 95% of the time. Running too few means your app crashes when traffic spikes. Autoscaling solves this by matching your pod count and resource allocation to actual demand in real time.
Autoscaling in Kubernetes — The Big Picture
Kubernetes offers three levels of autoscaling:
| Level | Component | What It Scales |
|---|---|---|
| Pod horizontal | HPA | Number of pod replicas |
| Pod vertical | VPA | CPU and memory requests/limits per pod |
| Cluster | Cluster Autoscaler | Number of nodes in the cluster |
HPA and VPA handle pod-level scaling. The Cluster Autoscaler ensures there are enough nodes to schedule the pods that HPA creates.
HPA Basics — CPU-Based Scaling
The simplest HPA scales based on CPU utilization. Metrics Server must be installed for this to work.
# Prerequisite: install metrics-server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Verify metrics-server is running
kubectl top pods -n kube-system
Basic HPA with kubectl
# Create an HPA that targets 50% CPU utilization
kubectl autoscale deployment web-app \
--cpu-percent=50 \
--min=2 \
--max=20
# Check HPA status
kubectl get hpa
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
# web-app Deployment/web-app 35%/50% 2 20 3 5m
HPA v1 YAML
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 20
targetCPUUtilizationPercentage: 50
The HPA checks metrics every 15 seconds (default) and adjusts replicas using this formula:
desiredReplicas = ceil(currentReplicas * (currentMetric / targetMetric))
If current CPU is 80% and target is 50%, with 3 replicas: ceil(3 * 80/50) = ceil(4.8) = 5 replicas.
HPA v2 — Memory, Multiple Metrics, and Custom Metrics
HPA v2 (autoscaling/v2) unlocks memory scaling, multiple metrics, and custom metric sources.
CPU and Memory Together
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
When multiple metrics are specified, HPA calculates the desired replicas for each metric and picks the highest value.
Custom Metrics (Requests Per Second)
Scale based on application-specific metrics like requests per second, queue depth, or active connections. This requires a custom metrics adapter like the Prometheus Adapter.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa-custom
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 100
metrics:
# Scale on CPU as a baseline
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
# Scale on requests per second per pod
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100" # Scale up when > 100 RPS per pod
# Scale on external metric (e.g., SQS queue depth)
- type: External
external:
metric:
name: sqs_queue_length
selector:
matchLabels:
queue: orders
target:
type: Value
value: "50" # Scale up when queue > 50 messages
Scaling Behavior — Fine-Tuning Scale-Up and Scale-Down
HPA v2 lets you control how aggressively the autoscaler scales in each direction.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Wait 60s before scaling up again
policies:
- type: Percent
value: 100 # Double the pods at most
periodSeconds: 60
- type: Pods
value: 10 # Or add at most 10 pods
periodSeconds: 60
selectPolicy: Max # Use whichever policy allows more pods
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling down
policies:
- type: Percent
value: 10 # Remove at most 10% of pods
periodSeconds: 60
selectPolicy: Min # Use whichever policy removes fewer pods
The stabilization window prevents flapping. Without it, the HPA might scale up during a spike and immediately scale down when the spike passes, only to scale up again when the remaining pods get overwhelmed.
# Watch HPA decisions in real time
kubectl describe hpa web-app-hpa -n production
# Events show scaling decisions
# Events:
# Normal SuccessfulRescale 2m horizontal-pod-autoscaler New size: 8; reason: cpu resource utilization above target
# Normal SuccessfulRescale 30s horizontal-pod-autoscaler New size: 6; reason: All metrics below target
Custom Metrics with Prometheus Adapter
To scale on application metrics, you need to expose them to the Kubernetes metrics API via the Prometheus Adapter.
# Install Prometheus Adapter with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
--set prometheus.url=http://prometheus-server.monitoring.svc \
--set prometheus.port=80
Configure the adapter to expose your custom metric:
# prometheus-adapter-config.yaml (values file)
rules:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_total$"
as: "${1}_per_second"
metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'
# Verify the custom metric is available
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second" | jq .
Vertical Pod Autoscaler (VPA)
While HPA scales horizontally (more pods), VPA scales vertically (bigger pods). VPA adjusts CPU and memory requests based on observed usage.
Installing VPA
# Clone and install VPA
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
# Verify VPA components
kubectl get pods -n kube-system | grep vpa
# vpa-admission-controller-xxx 1/1 Running
# vpa-recommender-xxx 1/1 Running
# vpa-updater-xxx 1/1 Running
VPA Modes
| Mode | Behavior | Disruption | Use Case |
|---|---|---|---|
| Off | Recommendations only, no changes | None | Observe first, get baseline data |
| Initial | Sets resources at pod creation only | None for running pods | Avoid live disruptions |
| Auto | Evicts and recreates pods with new resources | Pod restarts | Full automation |
VPA Configuration
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Off" # Start with Off
resourcePolicy:
containerPolicies:
- containerName: api
controlledResources: ["cpu", "memory"]
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "4"
memory: "8Gi"
controlledValues: RequestsAndLimits
# Check recommendations after a few hours/days
kubectl get vpa api-vpa -n production -o jsonpath='{.status.recommendation}' | jq .
# {
# "containerRecommendations": [{
# "containerName": "api",
# "lowerBound": {"cpu": "120m", "memory": "200Mi"},
# "target": {"cpu": "250m", "memory": "384Mi"},
# "upperBound": {"cpu": "800m", "memory": "1.2Gi"},
# "uncappedTarget": {"cpu": "250m", "memory": "384Mi"}
# }]
# }
HPA vs VPA — When to Use Which
| Criteria | HPA | VPA |
|---|---|---|
| What it scales | Number of replicas | Resource requests/limits |
| Best for | Stateless, horizontally scalable apps | Single-instance or hard-to-scale apps |
| Disruption | No pod restarts | Pod eviction in Auto mode |
| Can use together? | Yes, but not on the same metric | Use VPA for memory, HPA for CPU |
| Response time | Fast (seconds) | Slow (requires pod restart) |
Important: Do not use HPA and VPA on the same metric (e.g., both scaling on CPU). They will fight each other. A common pattern is HPA for CPU scaling and VPA for memory right-sizing.
KEDA — Event-Driven Autoscaling
KEDA (Kubernetes Event-Driven Autoscaler) extends HPA to scale based on event sources like message queues, databases, and external APIs.
# Install KEDA with Helm
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
namespace: production
spec:
scaleTargetRef:
name: order-processor
minReplicaCount: 0 # Scale to zero when idle
maxReplicaCount: 50
cooldownPeriod: 300
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
queueLength: "5" # Scale up when > 5 messages per pod
awsRegion: us-east-1
identityOwner: operator
KEDA supports 60+ scalers including Kafka, RabbitMQ, Redis, PostgreSQL, Prometheus, Cron, and HTTP.
Cluster Autoscaler Integration
HPA adds pods, but if there are no nodes to schedule them, the pods stay Pending. The Cluster Autoscaler detects unschedulable pods and provisions new nodes from your cloud provider.
# Check for pods stuck in Pending (waiting for nodes)
kubectl get pods --field-selector=status.phase=Pending
# Check Cluster Autoscaler status
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
The full autoscaling chain works like this:
- Traffic increases, CPU rises above HPA target
- HPA creates more pods
- New pods are Pending (no node capacity)
- Cluster Autoscaler adds a new node
- Scheduler places pods on the new node
- Traffic decreases, HPA scales down pods
- Node becomes underutilized, Cluster Autoscaler removes it
# Monitor the full autoscaling chain
kubectl get hpa -w &
kubectl get pods -w &
kubectl get nodes -w &
Next, we will dive into Kubernetes monitoring with Prometheus and Grafana — collecting metrics, building dashboards, and setting up alerts to catch problems before your users do.
