Skip to main content

Kubernetes HPA and VPA — Auto-Scale Your Workloads the Right Way

· 8 min read
Goel Academy
DevOps & Cloud Learning Hub

Your application handles 100 requests per second during the day and 10,000 during flash sales. Running enough pods for peak traffic wastes money 95% of the time. Running too few means your app crashes when traffic spikes. Autoscaling solves this by matching your pod count and resource allocation to actual demand in real time.

Autoscaling in Kubernetes — The Big Picture

Kubernetes offers three levels of autoscaling:

LevelComponentWhat It Scales
Pod horizontalHPANumber of pod replicas
Pod verticalVPACPU and memory requests/limits per pod
ClusterCluster AutoscalerNumber of nodes in the cluster

HPA and VPA handle pod-level scaling. The Cluster Autoscaler ensures there are enough nodes to schedule the pods that HPA creates.

HPA Basics — CPU-Based Scaling

The simplest HPA scales based on CPU utilization. Metrics Server must be installed for this to work.

# Prerequisite: install metrics-server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify metrics-server is running
kubectl top pods -n kube-system

Basic HPA with kubectl

# Create an HPA that targets 50% CPU utilization
kubectl autoscale deployment web-app \
--cpu-percent=50 \
--min=2 \
--max=20

# Check HPA status
kubectl get hpa
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
# web-app Deployment/web-app 35%/50% 2 20 3 5m

HPA v1 YAML

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 20
targetCPUUtilizationPercentage: 50

The HPA checks metrics every 15 seconds (default) and adjusts replicas using this formula:

desiredReplicas = ceil(currentReplicas * (currentMetric / targetMetric))

If current CPU is 80% and target is 50%, with 3 replicas: ceil(3 * 80/50) = ceil(4.8) = 5 replicas.

HPA v2 — Memory, Multiple Metrics, and Custom Metrics

HPA v2 (autoscaling/v2) unlocks memory scaling, multiple metrics, and custom metric sources.

CPU and Memory Together

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70

When multiple metrics are specified, HPA calculates the desired replicas for each metric and picks the highest value.

Custom Metrics (Requests Per Second)

Scale based on application-specific metrics like requests per second, queue depth, or active connections. This requires a custom metrics adapter like the Prometheus Adapter.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa-custom
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 100
metrics:
# Scale on CPU as a baseline
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
# Scale on requests per second per pod
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100" # Scale up when > 100 RPS per pod
# Scale on external metric (e.g., SQS queue depth)
- type: External
external:
metric:
name: sqs_queue_length
selector:
matchLabels:
queue: orders
target:
type: Value
value: "50" # Scale up when queue > 50 messages

Scaling Behavior — Fine-Tuning Scale-Up and Scale-Down

HPA v2 lets you control how aggressively the autoscaler scales in each direction.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Wait 60s before scaling up again
policies:
- type: Percent
value: 100 # Double the pods at most
periodSeconds: 60
- type: Pods
value: 10 # Or add at most 10 pods
periodSeconds: 60
selectPolicy: Max # Use whichever policy allows more pods
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling down
policies:
- type: Percent
value: 10 # Remove at most 10% of pods
periodSeconds: 60
selectPolicy: Min # Use whichever policy removes fewer pods

The stabilization window prevents flapping. Without it, the HPA might scale up during a spike and immediately scale down when the spike passes, only to scale up again when the remaining pods get overwhelmed.

# Watch HPA decisions in real time
kubectl describe hpa web-app-hpa -n production

# Events show scaling decisions
# Events:
# Normal SuccessfulRescale 2m horizontal-pod-autoscaler New size: 8; reason: cpu resource utilization above target
# Normal SuccessfulRescale 30s horizontal-pod-autoscaler New size: 6; reason: All metrics below target

Custom Metrics with Prometheus Adapter

To scale on application metrics, you need to expose them to the Kubernetes metrics API via the Prometheus Adapter.

# Install Prometheus Adapter with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
--set prometheus.url=http://prometheus-server.monitoring.svc \
--set prometheus.port=80

Configure the adapter to expose your custom metric:

# prometheus-adapter-config.yaml (values file)
rules:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_total$"
as: "${1}_per_second"
metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'
# Verify the custom metric is available
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second" | jq .

Vertical Pod Autoscaler (VPA)

While HPA scales horizontally (more pods), VPA scales vertically (bigger pods). VPA adjusts CPU and memory requests based on observed usage.

Installing VPA

# Clone and install VPA
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

# Verify VPA components
kubectl get pods -n kube-system | grep vpa
# vpa-admission-controller-xxx 1/1 Running
# vpa-recommender-xxx 1/1 Running
# vpa-updater-xxx 1/1 Running

VPA Modes

ModeBehaviorDisruptionUse Case
OffRecommendations only, no changesNoneObserve first, get baseline data
InitialSets resources at pod creation onlyNone for running podsAvoid live disruptions
AutoEvicts and recreates pods with new resourcesPod restartsFull automation

VPA Configuration

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Off" # Start with Off
resourcePolicy:
containerPolicies:
- containerName: api
controlledResources: ["cpu", "memory"]
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "4"
memory: "8Gi"
controlledValues: RequestsAndLimits
# Check recommendations after a few hours/days
kubectl get vpa api-vpa -n production -o jsonpath='{.status.recommendation}' | jq .
# {
# "containerRecommendations": [{
# "containerName": "api",
# "lowerBound": {"cpu": "120m", "memory": "200Mi"},
# "target": {"cpu": "250m", "memory": "384Mi"},
# "upperBound": {"cpu": "800m", "memory": "1.2Gi"},
# "uncappedTarget": {"cpu": "250m", "memory": "384Mi"}
# }]
# }

HPA vs VPA — When to Use Which

CriteriaHPAVPA
What it scalesNumber of replicasResource requests/limits
Best forStateless, horizontally scalable appsSingle-instance or hard-to-scale apps
DisruptionNo pod restartsPod eviction in Auto mode
Can use together?Yes, but not on the same metricUse VPA for memory, HPA for CPU
Response timeFast (seconds)Slow (requires pod restart)

Important: Do not use HPA and VPA on the same metric (e.g., both scaling on CPU). They will fight each other. A common pattern is HPA for CPU scaling and VPA for memory right-sizing.

KEDA — Event-Driven Autoscaling

KEDA (Kubernetes Event-Driven Autoscaler) extends HPA to scale based on event sources like message queues, databases, and external APIs.

# Install KEDA with Helm
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
namespace: production
spec:
scaleTargetRef:
name: order-processor
minReplicaCount: 0 # Scale to zero when idle
maxReplicaCount: 50
cooldownPeriod: 300
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
queueLength: "5" # Scale up when > 5 messages per pod
awsRegion: us-east-1
identityOwner: operator

KEDA supports 60+ scalers including Kafka, RabbitMQ, Redis, PostgreSQL, Prometheus, Cron, and HTTP.

Cluster Autoscaler Integration

HPA adds pods, but if there are no nodes to schedule them, the pods stay Pending. The Cluster Autoscaler detects unschedulable pods and provisions new nodes from your cloud provider.

# Check for pods stuck in Pending (waiting for nodes)
kubectl get pods --field-selector=status.phase=Pending

# Check Cluster Autoscaler status
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

The full autoscaling chain works like this:

  1. Traffic increases, CPU rises above HPA target
  2. HPA creates more pods
  3. New pods are Pending (no node capacity)
  4. Cluster Autoscaler adds a new node
  5. Scheduler places pods on the new node
  6. Traffic decreases, HPA scales down pods
  7. Node becomes underutilized, Cluster Autoscaler removes it
# Monitor the full autoscaling chain
kubectl get hpa -w &
kubectl get pods -w &
kubectl get nodes -w &

Next, we will dive into Kubernetes monitoring with Prometheus and Grafana — collecting metrics, building dashboards, and setting up alerts to catch problems before your users do.