Kubernetes Performance Tuning — etcd, API Server, and Scheduler Optimization
Most Kubernetes performance problems are not in your application code. They are in the platform underneath — an etcd database that has not been defragmented in months, an API server drowning in audit logs, a scheduler that takes 5 seconds to place a pod, or CoreDNS adding 30ms to every service call. Fixing these is free performance you are leaving on the table.
etcd Tuning — The Foundation of Everything
Every Kubernetes API call reads from or writes to etcd. If etcd is slow, your entire cluster is slow. Deployments take longer, pod scheduling stalls, and kubectl commands time out.
Check etcd Health
# Get etcd pod
kubectl get pods -n kube-system -l component=etcd
# Check etcd member health
kubectl exec -n kube-system etcd-master-1 -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health --write-table
# Check database size (alarm at >4GB, critical at >8GB)
kubectl exec -n kube-system etcd-master-1 -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint status --write-table
Defragmentation
etcd uses a B+ tree with MVCC. Deleted keys leave dead space in the database file. Defragmentation reclaims this space.
# Check current DB size vs in-use size
# DB SIZE column = total file size, IN USE = actual data
# If DB SIZE is 2x IN USE, you need defrag
# Defragment (run on one member at a time, never all at once)
kubectl exec -n kube-system etcd-master-1 -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
defrag
# Repeat for each etcd member, one at a time
# etcd-master-2, etcd-master-3
Compaction
Kubernetes automatically compacts etcd every 5 minutes (default). For high-churn clusters, you may want to tune this.
# Check current compaction revision
kubectl exec -n kube-system etcd-master-1 -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint status -w json | python3 -m json.tool
Snapshot Interval and WAL Tuning
For self-managed etcd, tune the snapshot count and heartbeat interval:
# /etc/kubernetes/manifests/etcd.yaml (static pod on control plane)
# Add or modify these flags:
spec:
containers:
- name: etcd
command:
- etcd
- --snapshot-count=5000 # Default 100000, lower = more frequent snapshots
- --heartbeat-interval=200 # Default 100ms, increase for cross-AZ
- --election-timeout=2000 # Default 1000ms, increase for cross-AZ
- --quota-backend-bytes=8589934592 # 8GB (default 2GB, increase for large clusters)
- --auto-compaction-retention=3 # Keep 3 hours of history
- --auto-compaction-mode=periodic
- --max-request-bytes=10485760 # 10MB max request size
API Server Optimization
The API server is the front door to your cluster. Every kubectl command, controller reconciliation, and admission webhook goes through it.
Request Throttling
# Check current API server flags
kubectl get pods -n kube-system -l component=kube-apiserver \
-o jsonpath='{.items[0].spec.containers[0].command}' | tr ',' '\n'
# Key tuning parameters (in API server manifest):
# --max-requests-inflight=800 # Default 400, increase for large clusters
# --max-mutating-requests-inflight=400 # Default 200
# --min-request-timeout=1800 # Minimum timeout for long-running requests
Audit Logging Impact
Audit logging is critical for compliance but expensive for performance. A verbose audit policy can add 20-40% latency to API calls.
# /etc/kubernetes/audit-policy.yaml
# Optimized audit policy — log less, keep what matters
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
# Skip noisy, low-value events
- level: None
resources:
- group: ""
resources: ["events"]
- level: None
users: ["system:kube-proxy"]
- level: None
userGroups: ["system:nodes"]
verbs: ["get"]
- level: None
resources:
- group: ""
resources: ["endpoints", "services", "services/status"]
users: ["system:kube-controller-manager"]
# Log metadata only for read operations
- level: Metadata
verbs: ["get", "list", "watch"]
# Log request+response for writes to sensitive resources
- level: RequestResponse
resources:
- group: ""
resources: ["secrets", "configmaps"]
verbs: ["create", "update", "patch", "delete"]
# Default: log metadata for everything else
- level: Metadata
Watch Cache Tuning
The API server caches watch results in memory. Increase this for large clusters to reduce etcd reads.
# API server flags for watch cache
# --watch-cache=true # Default true
# --watch-cache-sizes=pods#1000,nodes#100,services#100
# --default-watch-cache-size=200 # Default 100
Scheduler Tuning
The scheduler decides which node gets each pod. In clusters with 100+ nodes, the scheduler evaluates every node — which can take seconds for complex pods.
Scheduling Profiles
# /etc/kubernetes/scheduler-config.yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
plugins:
score:
disabled:
- name: ImageLocality # Disable if all nodes pull from same registry
- name: NodeResourcesBalancedAllocation # Disable for bin-packing
enabled:
- name: NodeResourcesFit
weight: 2
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: MostAllocated # Bin-packing: prefer fuller nodes (saves money)
# type: LeastAllocated # Spreading: prefer emptier nodes (better HA)
---
# Apply to API server with --config flag
# --config=/etc/kubernetes/scheduler-config.yaml
Parallelism and Percentage of Nodes to Score
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
percentageOfNodesToScore: 30 # Only evaluate 30% of nodes (default: 50% or scaled by size)
parallelism: 32 # Parallel goroutines for scheduling (default: 16)
For a 200-node cluster, scoring 30% means the scheduler evaluates 60 nodes instead of 100 — cutting scheduling latency nearly in half.
Kubelet Tuning
The kubelet runs on every node and manages pod lifecycle. Misconfigured kubelets cause slow pod starts and unnecessary evictions.
# /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
# Pod density
maxPods: 110 # Default 110. EKS default is based on ENI capacity.
podsPerCore: 0 # 0 = no per-core limit. Set to 10 for dense nodes.
# Eviction thresholds (when to evict pods to protect the node)
evictionHard:
memory.available: "200Mi"
nodefs.available: "10%"
imagefs.available: "10%"
pid.available: "5%"
evictionSoft:
memory.available: "500Mi"
nodefs.available: "15%"
evictionSoftGracePeriod:
memory.available: "1m30s"
nodefs.available: "2m"
# Image garbage collection
imageGCHighThresholdPercent: 85 # Start GC when disk is 85% full
imageGCLowThresholdPercent: 70 # Stop GC when disk drops to 70%
imageMinimumGCAge: "2m"
# Faster pod startup
serializeImagePulls: false # Allow parallel image pulls (default: true)
registryPullQPS: 10 # Default 5. Increase for faster pulls.
registryBurst: 20 # Default 10. Allow burst pulls.
# CPU manager for guaranteed QoS pods
cpuManagerPolicy: static # Pin CPUs for Guaranteed pods (latency-sensitive apps)
topologyManagerPolicy: best-effort # NUMA-aware scheduling
CoreDNS Optimization
Every service-to-service call starts with a DNS lookup. CoreDNS performance directly impacts application latency.
Tune the Cache
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health {
launchprobe localhost:8080
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 60
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 300 { # Cache for 5 minutes (default 30s)
success 9984 300 # Cache 9984 positive responses for 300s
denial 9984 60 # Cache negative responses for 60s
}
loop
reload
loadbalance
}
Fix the ndots Problem
By default, Kubernetes pods have ndots:5 in their DNS config. This means a lookup for api.example.com generates 5 search queries before the actual query — adding 50-100ms to external calls.
# Fix in pod spec
spec:
dnsConfig:
options:
- name: ndots
value: "2" # Reduce from 5 to 2
- name: single-request-reopen
value: "" # Avoid conntrack race condition
# Verify current DNS config inside a pod
kubectl exec -it debug-pod -- cat /etc/resolv.conf
# search production.svc.cluster.local svc.cluster.local cluster.local
# nameserver 10.96.0.10
# options ndots:5 ← This is the problem for external lookups
Pod Startup Latency Analysis
Slow pod starts can mean slower deployments, slower scaling, and worse user experience during rollouts.
# Measure pod startup time (from creation to Running)
kubectl get pods -n production -o json | \
jq -r '.items[] |
.metadata.name + " " +
(.status.conditions[] | select(.type=="PodScheduled") | .lastTransitionTime) + " " +
(.status.conditions[] | select(.type=="Ready") | .lastTransitionTime)'
# Common bottlenecks and fixes:
# 1. Image pull time → Use pre-pulled images, smaller base images, or image caches (Spegel)
# 2. Readiness probe delay → Reduce initialDelaySeconds
# 3. Init containers → Parallelize where possible (K8s 1.29+ sidecar containers)
# 4. Scheduler latency → Tune percentageOfNodesToScore
# 5. Volume attachment → Use regional PVs, faster storage class
Network Performance
CNI Selection Impact
| CNI | Throughput | Latency | Overhead | Best For |
|---|---|---|---|---|
| Cilium (eBPF) | Very High | Very Low | Minimal | Performance-critical, security |
| Calico (eBPF mode) | High | Low | Low | General purpose, network policy |
| Calico (iptables) | Medium | Medium | Medium | Compatibility, small clusters |
| Flannel (VXLAN) | Medium | Medium | Medium | Simplicity, small clusters |
| AWS VPC CNI | Very High | Very Low | None | AWS-native, direct pod networking |
# Check current CNI and mode
kubectl get pods -n kube-system | grep -E "cilium|calico|flannel|weave"
# For Cilium, check if eBPF is active
kubectl exec -n kube-system ds/cilium -- cilium status | grep KubeProxyReplacement
# MTU check — mismatched MTU causes packet fragmentation and retransmits
kubectl exec -it debug-pod -- ip link show eth0
# MTU should match your network (typically 9001 on AWS with jumbo frames, 1500 otherwise)
Storage Performance
Disk I/O is often the silent killer. A database pod on a slow volume will bottleneck your entire application.
# Check storage class and provisioner
kubectl get storageclass
# Example: Upgrade from gp2 to gp3 on AWS (50% cheaper, 4x baseline IOPS)
cat <<EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "6000" # Baseline: 3000, max: 16000
throughput: "250" # Baseline: 125 MB/s, max: 1000 MB/s
encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
EOF
# Benchmark storage inside a pod
kubectl exec -it storage-test -- fio \
--name=randwrite --ioengine=libaio --iodepth=32 \
--rw=randwrite --bs=4k --size=1G --numjobs=4 \
--runtime=30 --group_reporting --directory=/data
Tuning Summary
| Component | Metric to Watch | Default | Tuned | Impact |
|---|---|---|---|---|
| etcd DB size | etcd_mvcc_db_total_size_in_bytes | Grows unbounded | Defrag monthly | Prevents OOM, slow queries |
| API server inflight | apiserver_current_inflight_requests | 400/200 | 800/400 | Prevents 429 throttling |
| Scheduler latency | scheduler_scheduling_duration_seconds | Scores 50% nodes | Score 30% | 40% faster scheduling |
| Kubelet image pulls | kubelet_runtime_operations_duration | Serial pulls | Parallel pulls | 3-5x faster pod startup |
| CoreDNS cache | coredns_cache_hits_total | 30s TTL | 300s TTL | 80% fewer DNS queries |
| ndots | N/A (application latency) | 5 | 2 | 50-100ms saved per external call |
Performance tuning is not a one-time project. It is an ongoing practice of measuring, identifying bottlenecks, tuning, and measuring again. Start with the biggest pain points — usually etcd health and CoreDNS latency — and work outward. Every millisecond you save in the platform is a millisecond saved on every request your application serves.
