Kubernetes HPA and VPA — Auto-Scale Your Workloads the Right Way

July 28, 2025 · 8 min read

DevOps & Cloud Learning Hub

Your application handles 100 requests per second during the day and 10,000 during flash sales. Running enough pods for peak traffic wastes money 95% of the time. Running too few means your app crashes when traffic spikes. Autoscaling solves this by matching your pod count and resource allocation to actual demand in real time.

Autoscaling in Kubernetes — The Big Picture

Kubernetes offers three levels of autoscaling:

Level	Component	What It Scales
Pod horizontal	HPA	Number of pod replicas
Pod vertical	VPA	CPU and memory requests/limits per pod
Cluster	Cluster Autoscaler	Number of nodes in the cluster

HPA and VPA handle pod-level scaling. The Cluster Autoscaler ensures there are enough nodes to schedule the pods that HPA creates.

HPA Basics — CPU-Based Scaling

The simplest HPA scales based on CPU utilization. Metrics Server must be installed for this to work.

# Prerequisite: install metrics-server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify metrics-server is running
kubectl top pods -n kube-system

Basic HPA with kubectl

# Create an HPA that targets 50% CPU utilization
kubectl autoscale deployment web-app \
  --cpu-percent=50 \
  --min=2 \
  --max=20

# Check HPA status
kubectl get hpa
# NAME      REFERENCE            TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
# web-app   Deployment/web-app   35%/50%   2         20        3          5m

HPA v1 YAML

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 20
  targetCPUUtilizationPercentage: 50

The HPA checks metrics every 15 seconds (default) and adjusts replicas using this formula:

desiredReplicas = ceil(currentReplicas * (currentMetric / targetMetric))

If current CPU is 80% and target is 50%, with 3 replicas: ceil(3 * 80/50) = ceil(4.8) = 5 replicas.

HPA v2 — Memory, Multiple Metrics, and Custom Metrics

HPA v2 (autoscaling/v2) unlocks memory scaling, multiple metrics, and custom metric sources.

CPU and Memory Together

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70

When multiple metrics are specified, HPA calculates the desired replicas for each metric and picks the highest value.

Custom Metrics (Requests Per Second)

Scale based on application-specific metrics like requests per second, queue depth, or active connections. This requires a custom metrics adapter like the Prometheus Adapter.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa-custom
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 100
  metrics:
  # Scale on CPU as a baseline
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  # Scale on requests per second per pod
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"          # Scale up when > 100 RPS per pod
  # Scale on external metric (e.g., SQS queue depth)
  - type: External
    external:
      metric:
        name: sqs_queue_length
        selector:
          matchLabels:
            queue: orders
      target:
        type: Value
        value: "50"                   # Scale up when queue > 50 messages

Scaling Behavior — Fine-Tuning Scale-Up and Scale-Down

HPA v2 lets you control how aggressively the autoscaler scales in each direction.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # Wait 60s before scaling up again
      policies:
      - type: Percent
        value: 100                      # Double the pods at most
        periodSeconds: 60
      - type: Pods
        value: 10                       # Or add at most 10 pods
        periodSeconds: 60
      selectPolicy: Max                 # Use whichever policy allows more pods
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5 minutes before scaling down
      policies:
      - type: Percent
        value: 10                       # Remove at most 10% of pods
        periodSeconds: 60
      selectPolicy: Min                 # Use whichever policy removes fewer pods

The stabilization window prevents flapping. Without it, the HPA might scale up during a spike and immediately scale down when the spike passes, only to scale up again when the remaining pods get overwhelmed.

# Watch HPA decisions in real time
kubectl describe hpa web-app-hpa -n production

# Events show scaling decisions
# Events:
#   Normal  SuccessfulRescale  2m   horizontal-pod-autoscaler  New size: 8; reason: cpu resource utilization above target
#   Normal  SuccessfulRescale  30s  horizontal-pod-autoscaler  New size: 6; reason: All metrics below target

Custom Metrics with Prometheus Adapter

To scale on application metrics, you need to expose them to the Kubernetes metrics API via the Prometheus Adapter.

# Install Prometheus Adapter with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --set prometheus.url=http://prometheus-server.monitoring.svc \
  --set prometheus.port=80

Configure the adapter to expose your custom metric:

# prometheus-adapter-config.yaml (values file)
rules:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
  name:
    matches: "^(.*)_total$"
    as: "${1}_per_second"
  metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'

# Verify the custom metric is available
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second" | jq .

Vertical Pod Autoscaler (VPA)

While HPA scales horizontally (more pods), VPA scales vertically (bigger pods). VPA adjusts CPU and memory requests based on observed usage.

Installing VPA

# Clone and install VPA
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

# Verify VPA components
kubectl get pods -n kube-system | grep vpa
# vpa-admission-controller-xxx    1/1     Running
# vpa-recommender-xxx             1/1     Running
# vpa-updater-xxx                 1/1     Running

VPA Modes

Mode	Behavior	Disruption	Use Case
Off	Recommendations only, no changes	None	Observe first, get baseline data
Initial	Sets resources at pod creation only	None for running pods	Avoid live disruptions
Auto	Evicts and recreates pods with new resources	Pod restarts	Full automation

VPA Configuration

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Off"               # Start with Off
  resourcePolicy:
    containerPolicies:
    - containerName: api
      controlledResources: ["cpu", "memory"]
      minAllowed:
        cpu: "100m"
        memory: "128Mi"
      maxAllowed:
        cpu: "4"
        memory: "8Gi"
      controlledValues: RequestsAndLimits

# Check recommendations after a few hours/days
kubectl get vpa api-vpa -n production -o jsonpath='{.status.recommendation}' | jq .
# {
#   "containerRecommendations": [{
#     "containerName": "api",
#     "lowerBound": {"cpu": "120m", "memory": "200Mi"},
#     "target": {"cpu": "250m", "memory": "384Mi"},
#     "upperBound": {"cpu": "800m", "memory": "1.2Gi"},
#     "uncappedTarget": {"cpu": "250m", "memory": "384Mi"}
#   }]
# }

HPA vs VPA — When to Use Which

Criteria	HPA	VPA
What it scales	Number of replicas	Resource requests/limits
Best for	Stateless, horizontally scalable apps	Single-instance or hard-to-scale apps
Disruption	No pod restarts	Pod eviction in Auto mode
Can use together?	Yes, but not on the same metric	Use VPA for memory, HPA for CPU
Response time	Fast (seconds)	Slow (requires pod restart)

Important: Do not use HPA and VPA on the same metric (e.g., both scaling on CPU). They will fight each other. A common pattern is HPA for CPU scaling and VPA for memory right-sizing.

KEDA — Event-Driven Autoscaling

KEDA (Kubernetes Event-Driven Autoscaler) extends HPA to scale based on event sources like message queues, databases, and external APIs.

# Install KEDA with Helm
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 0              # Scale to zero when idle
  maxReplicaCount: 50
  cooldownPeriod: 300
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/123456789/orders
      queueLength: "5"            # Scale up when > 5 messages per pod
      awsRegion: us-east-1
      identityOwner: operator

KEDA supports 60+ scalers including Kafka, RabbitMQ, Redis, PostgreSQL, Prometheus, Cron, and HTTP.

Cluster Autoscaler Integration

HPA adds pods, but if there are no nodes to schedule them, the pods stay Pending. The Cluster Autoscaler detects unschedulable pods and provisions new nodes from your cloud provider.

# Check for pods stuck in Pending (waiting for nodes)
kubectl get pods --field-selector=status.phase=Pending

# Check Cluster Autoscaler status
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

The full autoscaling chain works like this:

Traffic increases, CPU rises above HPA target
HPA creates more pods
New pods are Pending (no node capacity)
Cluster Autoscaler adds a new node
Scheduler places pods on the new node
Traffic decreases, HPA scales down pods
Node becomes underutilized, Cluster Autoscaler removes it

# Monitor the full autoscaling chain
kubectl get hpa -w &
kubectl get pods -w &
kubectl get nodes -w &

Next, we will dive into Kubernetes monitoring with Prometheus and Grafana — collecting metrics, building dashboards, and setting up alerts to catch problems before your users do.

Autoscaling in Kubernetes — The Big Picture​

HPA Basics — CPU-Based Scaling​

Basic HPA with kubectl​

HPA v1 YAML​

HPA v2 — Memory, Multiple Metrics, and Custom Metrics​

CPU and Memory Together​

Custom Metrics (Requests Per Second)​

Scaling Behavior — Fine-Tuning Scale-Up and Scale-Down​

Custom Metrics with Prometheus Adapter​

Vertical Pod Autoscaler (VPA)​

Installing VPA​

VPA Modes​

VPA Configuration​

HPA vs VPA — When to Use Which​

KEDA — Event-Driven Autoscaling​

Cluster Autoscaler Integration​

Stay Updated