Skip to main content

Kubernetes Performance Tuning — etcd, API Server, and Scheduler Optimization

· 9 min read
Goel Academy
DevOps & Cloud Learning Hub

Most Kubernetes performance problems are not in your application code. They are in the platform underneath — an etcd database that has not been defragmented in months, an API server drowning in audit logs, a scheduler that takes 5 seconds to place a pod, or CoreDNS adding 30ms to every service call. Fixing these is free performance you are leaving on the table.

etcd Tuning — The Foundation of Everything

Every Kubernetes API call reads from or writes to etcd. If etcd is slow, your entire cluster is slow. Deployments take longer, pod scheduling stalls, and kubectl commands time out.

Check etcd Health

# Get etcd pod
kubectl get pods -n kube-system -l component=etcd

# Check etcd member health
kubectl exec -n kube-system etcd-master-1 -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health --write-table

# Check database size (alarm at >4GB, critical at >8GB)
kubectl exec -n kube-system etcd-master-1 -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint status --write-table

Defragmentation

etcd uses a B+ tree with MVCC. Deleted keys leave dead space in the database file. Defragmentation reclaims this space.

# Check current DB size vs in-use size
# DB SIZE column = total file size, IN USE = actual data
# If DB SIZE is 2x IN USE, you need defrag

# Defragment (run on one member at a time, never all at once)
kubectl exec -n kube-system etcd-master-1 -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
defrag

# Repeat for each etcd member, one at a time
# etcd-master-2, etcd-master-3

Compaction

Kubernetes automatically compacts etcd every 5 minutes (default). For high-churn clusters, you may want to tune this.

# Check current compaction revision
kubectl exec -n kube-system etcd-master-1 -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint status -w json | python3 -m json.tool

Snapshot Interval and WAL Tuning

For self-managed etcd, tune the snapshot count and heartbeat interval:

# /etc/kubernetes/manifests/etcd.yaml (static pod on control plane)
# Add or modify these flags:
spec:
containers:
- name: etcd
command:
- etcd
- --snapshot-count=5000 # Default 100000, lower = more frequent snapshots
- --heartbeat-interval=200 # Default 100ms, increase for cross-AZ
- --election-timeout=2000 # Default 1000ms, increase for cross-AZ
- --quota-backend-bytes=8589934592 # 8GB (default 2GB, increase for large clusters)
- --auto-compaction-retention=3 # Keep 3 hours of history
- --auto-compaction-mode=periodic
- --max-request-bytes=10485760 # 10MB max request size

API Server Optimization

The API server is the front door to your cluster. Every kubectl command, controller reconciliation, and admission webhook goes through it.

Request Throttling

# Check current API server flags
kubectl get pods -n kube-system -l component=kube-apiserver \
-o jsonpath='{.items[0].spec.containers[0].command}' | tr ',' '\n'

# Key tuning parameters (in API server manifest):
# --max-requests-inflight=800 # Default 400, increase for large clusters
# --max-mutating-requests-inflight=400 # Default 200
# --min-request-timeout=1800 # Minimum timeout for long-running requests

Audit Logging Impact

Audit logging is critical for compliance but expensive for performance. A verbose audit policy can add 20-40% latency to API calls.

# /etc/kubernetes/audit-policy.yaml
# Optimized audit policy — log less, keep what matters
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
# Skip noisy, low-value events
- level: None
resources:
- group: ""
resources: ["events"]
- level: None
users: ["system:kube-proxy"]
- level: None
userGroups: ["system:nodes"]
verbs: ["get"]
- level: None
resources:
- group: ""
resources: ["endpoints", "services", "services/status"]
users: ["system:kube-controller-manager"]
# Log metadata only for read operations
- level: Metadata
verbs: ["get", "list", "watch"]
# Log request+response for writes to sensitive resources
- level: RequestResponse
resources:
- group: ""
resources: ["secrets", "configmaps"]
verbs: ["create", "update", "patch", "delete"]
# Default: log metadata for everything else
- level: Metadata

Watch Cache Tuning

The API server caches watch results in memory. Increase this for large clusters to reduce etcd reads.

# API server flags for watch cache
# --watch-cache=true # Default true
# --watch-cache-sizes=pods#1000,nodes#100,services#100
# --default-watch-cache-size=200 # Default 100

Scheduler Tuning

The scheduler decides which node gets each pod. In clusters with 100+ nodes, the scheduler evaluates every node — which can take seconds for complex pods.

Scheduling Profiles

# /etc/kubernetes/scheduler-config.yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
plugins:
score:
disabled:
- name: ImageLocality # Disable if all nodes pull from same registry
- name: NodeResourcesBalancedAllocation # Disable for bin-packing
enabled:
- name: NodeResourcesFit
weight: 2
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: MostAllocated # Bin-packing: prefer fuller nodes (saves money)
# type: LeastAllocated # Spreading: prefer emptier nodes (better HA)
---
# Apply to API server with --config flag
# --config=/etc/kubernetes/scheduler-config.yaml

Parallelism and Percentage of Nodes to Score

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
percentageOfNodesToScore: 30 # Only evaluate 30% of nodes (default: 50% or scaled by size)
parallelism: 32 # Parallel goroutines for scheduling (default: 16)

For a 200-node cluster, scoring 30% means the scheduler evaluates 60 nodes instead of 100 — cutting scheduling latency nearly in half.

Kubelet Tuning

The kubelet runs on every node and manages pod lifecycle. Misconfigured kubelets cause slow pod starts and unnecessary evictions.

# /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration

# Pod density
maxPods: 110 # Default 110. EKS default is based on ENI capacity.
podsPerCore: 0 # 0 = no per-core limit. Set to 10 for dense nodes.

# Eviction thresholds (when to evict pods to protect the node)
evictionHard:
memory.available: "200Mi"
nodefs.available: "10%"
imagefs.available: "10%"
pid.available: "5%"
evictionSoft:
memory.available: "500Mi"
nodefs.available: "15%"
evictionSoftGracePeriod:
memory.available: "1m30s"
nodefs.available: "2m"

# Image garbage collection
imageGCHighThresholdPercent: 85 # Start GC when disk is 85% full
imageGCLowThresholdPercent: 70 # Stop GC when disk drops to 70%
imageMinimumGCAge: "2m"

# Faster pod startup
serializeImagePulls: false # Allow parallel image pulls (default: true)
registryPullQPS: 10 # Default 5. Increase for faster pulls.
registryBurst: 20 # Default 10. Allow burst pulls.

# CPU manager for guaranteed QoS pods
cpuManagerPolicy: static # Pin CPUs for Guaranteed pods (latency-sensitive apps)
topologyManagerPolicy: best-effort # NUMA-aware scheduling

CoreDNS Optimization

Every service-to-service call starts with a DNS lookup. CoreDNS performance directly impacts application latency.

Tune the Cache

apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health {
launchprobe localhost:8080
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 60
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 300 { # Cache for 5 minutes (default 30s)
success 9984 300 # Cache 9984 positive responses for 300s
denial 9984 60 # Cache negative responses for 60s
}
loop
reload
loadbalance
}

Fix the ndots Problem

By default, Kubernetes pods have ndots:5 in their DNS config. This means a lookup for api.example.com generates 5 search queries before the actual query — adding 50-100ms to external calls.

# Fix in pod spec
spec:
dnsConfig:
options:
- name: ndots
value: "2" # Reduce from 5 to 2
- name: single-request-reopen
value: "" # Avoid conntrack race condition
# Verify current DNS config inside a pod
kubectl exec -it debug-pod -- cat /etc/resolv.conf
# search production.svc.cluster.local svc.cluster.local cluster.local
# nameserver 10.96.0.10
# options ndots:5 ← This is the problem for external lookups

Pod Startup Latency Analysis

Slow pod starts can mean slower deployments, slower scaling, and worse user experience during rollouts.

# Measure pod startup time (from creation to Running)
kubectl get pods -n production -o json | \
jq -r '.items[] |
.metadata.name + " " +
(.status.conditions[] | select(.type=="PodScheduled") | .lastTransitionTime) + " " +
(.status.conditions[] | select(.type=="Ready") | .lastTransitionTime)'

# Common bottlenecks and fixes:
# 1. Image pull time → Use pre-pulled images, smaller base images, or image caches (Spegel)
# 2. Readiness probe delay → Reduce initialDelaySeconds
# 3. Init containers → Parallelize where possible (K8s 1.29+ sidecar containers)
# 4. Scheduler latency → Tune percentageOfNodesToScore
# 5. Volume attachment → Use regional PVs, faster storage class

Network Performance

CNI Selection Impact

CNIThroughputLatencyOverheadBest For
Cilium (eBPF)Very HighVery LowMinimalPerformance-critical, security
Calico (eBPF mode)HighLowLowGeneral purpose, network policy
Calico (iptables)MediumMediumMediumCompatibility, small clusters
Flannel (VXLAN)MediumMediumMediumSimplicity, small clusters
AWS VPC CNIVery HighVery LowNoneAWS-native, direct pod networking
# Check current CNI and mode
kubectl get pods -n kube-system | grep -E "cilium|calico|flannel|weave"

# For Cilium, check if eBPF is active
kubectl exec -n kube-system ds/cilium -- cilium status | grep KubeProxyReplacement

# MTU check — mismatched MTU causes packet fragmentation and retransmits
kubectl exec -it debug-pod -- ip link show eth0
# MTU should match your network (typically 9001 on AWS with jumbo frames, 1500 otherwise)

Storage Performance

Disk I/O is often the silent killer. A database pod on a slow volume will bottleneck your entire application.

# Check storage class and provisioner
kubectl get storageclass

# Example: Upgrade from gp2 to gp3 on AWS (50% cheaper, 4x baseline IOPS)
cat <<EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "6000" # Baseline: 3000, max: 16000
throughput: "250" # Baseline: 125 MB/s, max: 1000 MB/s
encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
EOF

# Benchmark storage inside a pod
kubectl exec -it storage-test -- fio \
--name=randwrite --ioengine=libaio --iodepth=32 \
--rw=randwrite --bs=4k --size=1G --numjobs=4 \
--runtime=30 --group_reporting --directory=/data

Tuning Summary

ComponentMetric to WatchDefaultTunedImpact
etcd DB sizeetcd_mvcc_db_total_size_in_bytesGrows unboundedDefrag monthlyPrevents OOM, slow queries
API server inflightapiserver_current_inflight_requests400/200800/400Prevents 429 throttling
Scheduler latencyscheduler_scheduling_duration_secondsScores 50% nodesScore 30%40% faster scheduling
Kubelet image pullskubelet_runtime_operations_durationSerial pullsParallel pulls3-5x faster pod startup
CoreDNS cachecoredns_cache_hits_total30s TTL300s TTL80% fewer DNS queries
ndotsN/A (application latency)5250-100ms saved per external call

Performance tuning is not a one-time project. It is an ongoing practice of measuring, identifying bottlenecks, tuning, and measuring again. Start with the biggest pain points — usually etcd health and CoreDNS latency — and work outward. Every millisecond you save in the platform is a millisecond saved on every request your application serves.