Kubernetes Performance Tuning — etcd, API Server, and Scheduler Optimization

January 24, 2026 · 9 min read

DevOps & Cloud Learning Hub

Most Kubernetes performance problems are not in your application code. They are in the platform underneath — an etcd database that has not been defragmented in months, an API server drowning in audit logs, a scheduler that takes 5 seconds to place a pod, or CoreDNS adding 30ms to every service call. Fixing these is free performance you are leaving on the table.

etcd Tuning — The Foundation of Everything

Every Kubernetes API call reads from or writes to etcd. If etcd is slow, your entire cluster is slow. Deployments take longer, pod scheduling stalls, and kubectl commands time out.

Check etcd Health

# Get etcd pod
kubectl get pods -n kube-system -l component=etcd

# Check etcd member health
kubectl exec -n kube-system etcd-master-1 -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint health --write-table

# Check database size (alarm at >4GB, critical at >8GB)
kubectl exec -n kube-system etcd-master-1 -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint status --write-table

Defragmentation

etcd uses a B+ tree with MVCC. Deleted keys leave dead space in the database file. Defragmentation reclaims this space.

# Check current DB size vs in-use size
# DB SIZE column = total file size, IN USE = actual data
# If DB SIZE is 2x IN USE, you need defrag

# Defragment (run on one member at a time, never all at once)
kubectl exec -n kube-system etcd-master-1 -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  defrag

# Repeat for each etcd member, one at a time
# etcd-master-2, etcd-master-3

Compaction

Kubernetes automatically compacts etcd every 5 minutes (default). For high-churn clusters, you may want to tune this.

# Check current compaction revision
kubectl exec -n kube-system etcd-master-1 -- etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint status -w json | python3 -m json.tool

Snapshot Interval and WAL Tuning

For self-managed etcd, tune the snapshot count and heartbeat interval:

# /etc/kubernetes/manifests/etcd.yaml (static pod on control plane)
# Add or modify these flags:
spec:
  containers:
    - name: etcd
      command:
        - etcd
        - --snapshot-count=5000          # Default 100000, lower = more frequent snapshots
        - --heartbeat-interval=200       # Default 100ms, increase for cross-AZ
        - --election-timeout=2000        # Default 1000ms, increase for cross-AZ
        - --quota-backend-bytes=8589934592  # 8GB (default 2GB, increase for large clusters)
        - --auto-compaction-retention=3  # Keep 3 hours of history
        - --auto-compaction-mode=periodic
        - --max-request-bytes=10485760   # 10MB max request size

API Server Optimization

The API server is the front door to your cluster. Every kubectl command, controller reconciliation, and admission webhook goes through it.

Request Throttling

# Check current API server flags
kubectl get pods -n kube-system -l component=kube-apiserver \
  -o jsonpath='{.items[0].spec.containers[0].command}' | tr ',' '\n'

# Key tuning parameters (in API server manifest):
# --max-requests-inflight=800        # Default 400, increase for large clusters
# --max-mutating-requests-inflight=400  # Default 200
# --min-request-timeout=1800         # Minimum timeout for long-running requests

Audit Logging Impact

Audit logging is critical for compliance but expensive for performance. A verbose audit policy can add 20-40% latency to API calls.

# /etc/kubernetes/audit-policy.yaml
# Optimized audit policy — log less, keep what matters
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  # Skip noisy, low-value events
  - level: None
    resources:
      - group: ""
        resources: ["events"]
  - level: None
    users: ["system:kube-proxy"]
  - level: None
    userGroups: ["system:nodes"]
    verbs: ["get"]
  - level: None
    resources:
      - group: ""
        resources: ["endpoints", "services", "services/status"]
    users: ["system:kube-controller-manager"]
  # Log metadata only for read operations
  - level: Metadata
    verbs: ["get", "list", "watch"]
  # Log request+response for writes to sensitive resources
  - level: RequestResponse
    resources:
      - group: ""
        resources: ["secrets", "configmaps"]
    verbs: ["create", "update", "patch", "delete"]
  # Default: log metadata for everything else
  - level: Metadata

Watch Cache Tuning

The API server caches watch results in memory. Increase this for large clusters to reduce etcd reads.

# API server flags for watch cache
# --watch-cache=true                    # Default true
# --watch-cache-sizes=pods#1000,nodes#100,services#100
# --default-watch-cache-size=200        # Default 100

Scheduler Tuning

The scheduler decides which node gets each pod. In clusters with 100+ nodes, the scheduler evaluates every node — which can take seconds for complex pods.

Scheduling Profiles

# /etc/kubernetes/scheduler-config.yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: default-scheduler
    plugins:
      score:
        disabled:
          - name: ImageLocality     # Disable if all nodes pull from same registry
          - name: NodeResourcesBalancedAllocation  # Disable for bin-packing
        enabled:
          - name: NodeResourcesFit
            weight: 2
    pluginConfig:
      - name: NodeResourcesFit
        args:
          scoringStrategy:
            type: MostAllocated   # Bin-packing: prefer fuller nodes (saves money)
            # type: LeastAllocated  # Spreading: prefer emptier nodes (better HA)
---
# Apply to API server with --config flag
# --config=/etc/kubernetes/scheduler-config.yaml

Parallelism and Percentage of Nodes to Score

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
percentageOfNodesToScore: 30   # Only evaluate 30% of nodes (default: 50% or scaled by size)
parallelism: 32                # Parallel goroutines for scheduling (default: 16)

For a 200-node cluster, scoring 30% means the scheduler evaluates 60 nodes instead of 100 — cutting scheduling latency nearly in half.

Kubelet Tuning

The kubelet runs on every node and manages pod lifecycle. Misconfigured kubelets cause slow pod starts and unnecessary evictions.

# /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration

# Pod density
maxPods: 110               # Default 110. EKS default is based on ENI capacity.
podsPerCore: 0             # 0 = no per-core limit. Set to 10 for dense nodes.

# Eviction thresholds (when to evict pods to protect the node)
evictionHard:
  memory.available: "200Mi"
  nodefs.available: "10%"
  imagefs.available: "10%"
  pid.available: "5%"
evictionSoft:
  memory.available: "500Mi"
  nodefs.available: "15%"
evictionSoftGracePeriod:
  memory.available: "1m30s"
  nodefs.available: "2m"

# Image garbage collection
imageGCHighThresholdPercent: 85  # Start GC when disk is 85% full
imageGCLowThresholdPercent: 70   # Stop GC when disk drops to 70%
imageMinimumGCAge: "2m"

# Faster pod startup
serializeImagePulls: false   # Allow parallel image pulls (default: true)
registryPullQPS: 10          # Default 5. Increase for faster pulls.
registryBurst: 20            # Default 10. Allow burst pulls.

# CPU manager for guaranteed QoS pods
cpuManagerPolicy: static     # Pin CPUs for Guaranteed pods (latency-sensitive apps)
topologyManagerPolicy: best-effort  # NUMA-aware scheduling

CoreDNS Optimization

Every service-to-service call starts with a DNS lookup. CoreDNS performance directly impacts application latency.

Tune the Cache

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
            launchprobe localhost:8080
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
            ttl 60
        }
        prometheus :9153
        forward . /etc/resolv.conf {
            max_concurrent 1000
        }
        cache 300 {            # Cache for 5 minutes (default 30s)
            success 9984 300   # Cache 9984 positive responses for 300s
            denial 9984 60     # Cache negative responses for 60s
        }
        loop
        reload
        loadbalance
    }

Fix the ndots Problem

By default, Kubernetes pods have ndots:5 in their DNS config. This means a lookup for api.example.com generates 5 search queries before the actual query — adding 50-100ms to external calls.

# Fix in pod spec
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"        # Reduce from 5 to 2
      - name: single-request-reopen
        value: ""          # Avoid conntrack race condition

# Verify current DNS config inside a pod
kubectl exec -it debug-pod -- cat /etc/resolv.conf
# search production.svc.cluster.local svc.cluster.local cluster.local
# nameserver 10.96.0.10
# options ndots:5   ← This is the problem for external lookups

Pod Startup Latency Analysis

Slow pod starts can mean slower deployments, slower scaling, and worse user experience during rollouts.

# Measure pod startup time (from creation to Running)
kubectl get pods -n production -o json | \
  jq -r '.items[] |
    .metadata.name + " " +
    (.status.conditions[] | select(.type=="PodScheduled") | .lastTransitionTime) + " " +
    (.status.conditions[] | select(.type=="Ready") | .lastTransitionTime)'

# Common bottlenecks and fixes:
# 1. Image pull time → Use pre-pulled images, smaller base images, or image caches (Spegel)
# 2. Readiness probe delay → Reduce initialDelaySeconds
# 3. Init containers → Parallelize where possible (K8s 1.29+ sidecar containers)
# 4. Scheduler latency → Tune percentageOfNodesToScore
# 5. Volume attachment → Use regional PVs, faster storage class

Network Performance

CNI Selection Impact

CNI	Throughput	Latency	Overhead	Best For
Cilium (eBPF)	Very High	Very Low	Minimal	Performance-critical, security
Calico (eBPF mode)	High	Low	Low	General purpose, network policy
Calico (iptables)	Medium	Medium	Medium	Compatibility, small clusters
Flannel (VXLAN)	Medium	Medium	Medium	Simplicity, small clusters
AWS VPC CNI	Very High	Very Low	None	AWS-native, direct pod networking

# Check current CNI and mode
kubectl get pods -n kube-system | grep -E "cilium|calico|flannel|weave"

# For Cilium, check if eBPF is active
kubectl exec -n kube-system ds/cilium -- cilium status | grep KubeProxyReplacement

# MTU check — mismatched MTU causes packet fragmentation and retransmits
kubectl exec -it debug-pod -- ip link show eth0
# MTU should match your network (typically 9001 on AWS with jumbo frames, 1500 otherwise)

Storage Performance

Disk I/O is often the silent killer. A database pod on a slow volume will bottleneck your entire application.

# Check storage class and provisioner
kubectl get storageclass

# Example: Upgrade from gp2 to gp3 on AWS (50% cheaper, 4x baseline IOPS)
cat <<EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "6000"        # Baseline: 3000, max: 16000
  throughput: "250"    # Baseline: 125 MB/s, max: 1000 MB/s
  encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
EOF

# Benchmark storage inside a pod
kubectl exec -it storage-test -- fio \
  --name=randwrite --ioengine=libaio --iodepth=32 \
  --rw=randwrite --bs=4k --size=1G --numjobs=4 \
  --runtime=30 --group_reporting --directory=/data

Tuning Summary

Component	Metric to Watch	Default	Tuned	Impact
etcd DB size	`etcd_mvcc_db_total_size_in_bytes`	Grows unbounded	Defrag monthly	Prevents OOM, slow queries
API server inflight	`apiserver_current_inflight_requests`	400/200	800/400	Prevents 429 throttling
Scheduler latency	`scheduler_scheduling_duration_seconds`	Scores 50% nodes	Score 30%	40% faster scheduling
Kubelet image pulls	`kubelet_runtime_operations_duration`	Serial pulls	Parallel pulls	3-5x faster pod startup
CoreDNS cache	`coredns_cache_hits_total`	30s TTL	300s TTL	80% fewer DNS queries
ndots	N/A (application latency)	5	2	50-100ms saved per external call

Performance tuning is not a one-time project. It is an ongoing practice of measuring, identifying bottlenecks, tuning, and measuring again. Start with the biggest pain points — usually etcd health and CoreDNS latency — and work outward. Every millisecond you save in the platform is a millisecond saved on every request your application serves.

etcd Tuning — The Foundation of Everything​

Check etcd Health​

Defragmentation​

Compaction​

Snapshot Interval and WAL Tuning​

API Server Optimization​

Request Throttling​

Audit Logging Impact​

Watch Cache Tuning​

Scheduler Tuning​

Scheduling Profiles​

Parallelism and Percentage of Nodes to Score​

Kubelet Tuning​

CoreDNS Optimization​

Tune the Cache​

Fix the ndots Problem​

Pod Startup Latency Analysis​

Network Performance​

CNI Selection Impact​

Storage Performance​

Tuning Summary​

Stay Updated