Top 50 DevOps Interview Questions — From Junior to Senior
Whether you are preparing for your first DevOps role or interviewing for a Staff SRE position, these 50 questions cover the full spectrum. Each answer is concise enough to deliver in an interview, but detailed enough to demonstrate real understanding.
Junior Level (Questions 1-15)
These cover fundamentals every DevOps candidate should know cold.
CI/CD and Git
1. What is CI/CD and why does it matter?
CI (Continuous Integration) automatically builds and tests code on every commit. CD (Continuous Delivery/Deployment) automatically delivers that code to staging or production. It matters because it shortens feedback loops — developers learn about broken code in minutes instead of weeks.
2. What is the difference between Continuous Delivery and Continuous Deployment?
Continuous Delivery means every change can be deployed to production at any time but requires manual approval. Continuous Deployment means every change that passes tests is deployed to production automatically. Most organizations start with Delivery and graduate to Deployment.
3. Explain the difference between git merge and git rebase.
# Merge: Creates a merge commit, preserves branch history
git checkout main && git merge feature-branch
# History: A──B──C──M (merge commit)
# └──D──E──┘
# Rebase: Replays commits on top of target, linear history
git checkout feature-branch && git rebase main
# History: A──B──C──D'──E' (clean linear history)
Merge is safer for shared branches. Rebase creates cleaner history but should never be used on commits already pushed to shared branches.
4. What is a Git branching strategy? Name two common ones.
A branching strategy defines how teams use branches for development. Two common ones: GitFlow (develop, feature, release, hotfix branches — good for scheduled releases) and Trunk-Based Development (everyone commits to main with short-lived feature branches — good for continuous deployment).
5. What is a webhook in CI/CD?
A webhook is an HTTP callback triggered by an event. In CI/CD, your Git provider sends a POST request to your CI server when code is pushed, triggering a pipeline automatically. Example: GitHub sends a webhook to Jenkins when a PR is opened.
Linux and Networking
6. How do you check disk space, memory usage, and running processes in Linux?
df -h # Disk space (human-readable)
free -m # Memory usage in MB
top # Interactive process viewer
htop # Better interactive process viewer
ps aux | head -20 # Snapshot of running processes
7. What is the difference between TCP and UDP?
TCP is connection-oriented, guarantees delivery order, and uses three-way handshake (reliable but slower). UDP is connectionless, does not guarantee delivery, and is faster. HTTP/HTTPS uses TCP. DNS queries and video streaming often use UDP.
8. Explain the purpose of ports 22, 80, 443, and 5432.
Port 22 — SSH (secure remote access)
Port 80 — HTTP (unencrypted web traffic)
Port 443 — HTTPS (encrypted web traffic)
Port 5432 — PostgreSQL database
Docker Basics
9. What is the difference between a Docker image and a container?
An image is a read-only template (like a class in OOP). A container is a running instance of an image (like an object). You can create many containers from one image. Images are built from Dockerfiles and stored in registries.
10. What does each line in this Dockerfile do?
FROM node:20-alpine # Base image (Node.js 20 on Alpine Linux)
WORKDIR /app # Set working directory inside container
COPY package*.json ./ # Copy dependency manifests first (layer caching)
RUN npm ci --production # Install dependencies (cached if package.json unchanged)
COPY . . # Copy application source code
EXPOSE 3000 # Document that the app listens on port 3000
CMD ["node", "server.js"] # Default command when container starts
11. What is the difference between CMD and ENTRYPOINT?
CMD provides default arguments that can be overridden at runtime. ENTRYPOINT defines the executable and is harder to override. Best practice: use ENTRYPOINT for the command and CMD for default arguments: ENTRYPOINT ["python"] + CMD ["app.py"].
YAML and Configuration
12. What is YAML and why is it used in DevOps?
YAML (YAML Ain't Markup Language) is a human-readable data serialization format. It is used in DevOps because it is readable, supports complex data structures, and is the standard for Kubernetes manifests, Docker Compose files, CI/CD pipelines, and Ansible playbooks.
13. Spot the error in this YAML:
# BROKEN:
services:
web:
image: nginx
ports:
- 80:80
environment: # ERROR: wrong indentation (extra space)
- PORT=3000
# FIXED:
services:
web:
image: nginx
ports:
- 80:80
environment: # Correct: aligned with 'ports'
- PORT=3000
14. What is the difference between environment variables and config files?
Environment variables are dynamic, set at runtime, and good for secrets and environment-specific values (DB host, API keys). Config files are static, version-controlled, and good for application settings. In 12-factor apps, environment-specific config goes in env vars; application logic config goes in files.
15. What is Infrastructure as Code? Name two tools.
IaC manages infrastructure through machine-readable definition files instead of manual processes. Two tools: Terraform (cloud-agnostic, declarative, state-based) and Ansible (procedural/declarative hybrid, agentless, SSH-based). IaC enables version control, peer review, and reproducibility for infrastructure.
Mid-Level (Questions 16-35)
These test deeper understanding and practical experience.
Kubernetes
16. Explain the difference between a Deployment, StatefulSet, and DaemonSet.
Deployment: Stateless apps. Pods are interchangeable. Rolling updates.
Example: Web API servers.
StatefulSet: Stateful apps. Each pod gets a stable hostname and
persistent volume. Ordered startup/shutdown.
Example: Database clusters (Postgres, Kafka).
DaemonSet: Runs one pod per node. Used for node-level services.
Example: Log collectors (Fluentd), monitoring agents.
17. What is a Kubernetes Service, and what are the three types?
A Service provides a stable network endpoint for a set of Pods. ClusterIP (default): internal-only access within the cluster. NodePort: exposes on each node's IP at a static port. LoadBalancer: provisions an external cloud load balancer.
18. How does a Kubernetes liveness probe differ from a readiness probe?
Liveness probe: "Is the container alive?" If it fails, Kubernetes restarts the container. Readiness probe: "Is the container ready to receive traffic?" If it fails, the pod is removed from the Service endpoint but not restarted. Use liveness for deadlock detection, readiness for startup dependencies.
19. Explain Kubernetes RBAC.
# Role: Defines what actions are allowed
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: production
name: pod-reader
rules:
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
# RoleBinding: Assigns the role to a user/group
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
namespace: production
name: read-pods
subjects:
- kind: User
name: jane@company.com
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
Terraform and IaC
20. What is Terraform state, and why is remote state important?
Terraform state is a JSON file that maps your configuration to real infrastructure. Remote state (S3 + DynamoDB, Terraform Cloud) is important because it enables team collaboration (shared source of truth), prevents concurrent modifications (state locking), and keeps sensitive data off developer laptops.
21. What is the difference between terraform plan and terraform apply?
plan shows what changes Terraform will make without making them (dry run). apply executes those changes. Always run plan first in CI, require human approval, then apply. In CI/CD: plan on PR, apply on merge to main.
22. How do you handle secrets in Terraform?
Never store secrets in Terraform state or .tf files. Use: var with sensitive = true (masks output), reference secrets from Vault/AWS Secrets Manager using data sources, or use environment variables (TF_VAR_db_password). Encrypt state backend (S3 SSE, Terraform Cloud).
Monitoring and Observability
23. What is the difference between monitoring and observability?
Monitoring answers "Is the system working?" (known-unknowns — you set up alerts for anticipated failures). Observability answers "Why is the system broken?" (unknown-unknowns — you can investigate novel failures using metrics, logs, and traces without deploying new code).
24. Explain the three pillars of observability.
Metrics: Numeric measurements over time (CPU usage, request count, latency percentiles). Logs: Timestamped event records with context (structured JSON logs). Traces: End-to-end request paths across services (distributed tracing with trace IDs). All three together let you detect, diagnose, and debug any issue.
25. What are SLIs, SLOs, and SLAs?
SLI (Service Level Indicator):
The measurement. Example: "99.2% of requests completed in < 200ms"
SLO (Service Level Objective):
The target. Example: "99.5% of requests must complete in < 200ms"
SLA (Service Level Agreement):
The contract. Example: "If we drop below 99.0%, customer gets credits"
Relationship: SLI measures → SLO targets → SLA contracts
Incident Management and Security
26. Describe a blameless post-mortem process.
After an incident: 1) Establish timeline of events. 2) Identify contributing factors (not root cause — it is rarely one thing). 3) Focus on how the system allowed the failure, not who caused it. 4) Define actionable remediation items with owners and deadlines. 5) Share learnings widely. The goal is organizational learning, not punishment.
27. What is shift-left security?
Moving security testing earlier (leftward) in the development lifecycle. Instead of a security review before production deploy, you: run SAST in the IDE, scan dependencies in CI, check container images for CVEs automatically, and enforce policies with OPA/Kyverno. Security becomes everyone's responsibility, not a gate.
28. What is the principle of least privilege?
Grant only the minimum permissions necessary for a task. In practice: use IAM roles with scoped policies instead of admin access, create short-lived credentials (STS, OIDC), scope Kubernetes RBAC to specific namespaces, and audit permissions regularly.
Networking and Cloud
29. What is a load balancer and what are Layer 4 vs Layer 7?
Layer 4 (Transport):
Routes based on IP + port.
Fast, protocol-agnostic.
Example: AWS NLB, HAProxy TCP mode.
Layer 7 (Application):
Routes based on HTTP headers, paths, cookies.
Can inspect/modify requests.
Example: AWS ALB, Nginx, Envoy.
Use L4 for: TCP/UDP services, maximum performance.
Use L7 for: HTTP routing, SSL termination, path-based routing.
30. Explain the difference between public and private subnets in a VPC.
Public subnets have a route to an Internet Gateway — resources get public IPs and can be reached from the internet. Private subnets have no internet gateway route — resources can reach the internet only through a NAT Gateway. Databases and application servers go in private subnets; load balancers go in public subnets.
31. What is a reverse proxy?
A reverse proxy sits in front of backend servers, receives client requests, and forwards them. Benefits: SSL termination, load balancing, caching, rate limiting, and hiding backend topology. Examples: Nginx, Envoy, Traefik. Unlike a forward proxy (which sits in front of clients), a reverse proxy protects servers.
32. What are Blue-Green and Canary deployments?
Blue-Green:
Run two identical environments.
Route 100% traffic to "blue" (current).
Deploy new version to "green."
Switch all traffic to "green" instantly.
Rollback = switch back to "blue."
Canary:
Deploy new version to a small subset (5% of traffic).
Monitor metrics (error rate, latency).
Gradually increase (25% → 50% → 100%).
Rollback = route 100% back to old version.
Blue-Green: Faster rollout, higher resource cost.
Canary: Slower rollout, lower risk.
33. What is GitOps?
GitOps uses Git as the single source of truth for infrastructure and application state. A controller (ArgoCD, Flux) continuously reconciles cluster state with the desired state in Git. Benefits: audit trail (git log), rollback (git revert), access control (PR reviews), and consistency.
34. Explain the 12-Factor App methodology (key factors).
Key factors for DevOps interviews:
III. Config — Store config in environment variables
IV. Backing — Treat databases, queues as attached resources
VI. Processes — Stateless processes, share-nothing
VII. Port bind — Export services via port binding
IX. Disposable — Fast startup, graceful shutdown
XI. Logs — Treat logs as event streams (stdout)
35. What is a service mesh? When do you need one?
A service mesh (Istio, Linkerd) handles service-to-service communication: mTLS encryption, traffic management, retries, circuit breaking, and observability. You need one when you have 20+ microservices and need consistent security/observability without modifying application code. You do not need one for 5 services — it adds significant complexity.
Senior Level (Questions 36-50)
These test architecture thinking, leadership, and deep expertise.
36. How would you design a CI/CD platform for 50 engineering teams?
Shared platform with self-service: centralized pipeline templates (reusable workflows), golden paths for common languages, self-service environment provisioning, centralized secrets management (Vault), metrics dashboard (DORA), and guardrails (not gates). Think platform-as-a-product — your users are developers.
37. Explain error budgets and how they affect release velocity.
SLO: 99.9% availability = 43.2 min downtime/month
Error Budget: 0.1% = 43.2 min/month of allowed unreliability
Budget remaining → Ship features aggressively
Budget exhausted → Freeze features, focus on reliability
This aligns incentives:
Developers want to ship fast → they care about reliability
because burned budget = frozen releases
SREs want stability → they allow risk
because unspent budget = wasted capacity
38. How do you handle database schema migrations in a CI/CD pipeline?
Use expand-and-contract pattern: 1) Add new column (expand), deploy code that writes to both old and new. 2) Backfill data. 3) Deploy code that reads from new only. 4) Drop old column (contract). Never make breaking schema changes in one deploy. Tools: Flyway, Liquibase, golang-migrate. Run migrations as a pre-deploy step, never during application startup.
39. What is platform engineering, and how does it differ from DevOps?
Platform engineering builds internal developer platforms (IDPs) that abstract infrastructure complexity. DevOps is a culture and set of practices. Platform engineering is the productization of those practices into self-service tools. The platform team treats developers as customers and builds golden paths for common workflows.
40. Describe your approach to chaos engineering.
Start with steady-state hypothesis ("API latency p99 stays below 200ms"). Introduce controlled failure (kill a pod, inject network latency, simulate AZ failure). Observe whether the system maintains steady state. If it does not, you found a weakness before production did. Start in staging, graduate to production. Tools: Litmus, Chaos Monkey, Gremlin.
41. How do you scale Kubernetes clusters for cost efficiency?
Cluster autoscaler for nodes, HPA for pods (CPU/memory/custom metrics), VPA for right-sizing requests. Use spot/preemptible instances for stateless workloads. Implement pod disruption budgets. Use Karpenter for faster, smarter node provisioning. Monitor with Kubecost. Set resource requests accurately — overprovisioning wastes money, underprovisioning causes OOMKills.
42. Explain the CAP theorem and its relevance to distributed systems.
CAP: You can have at most 2 of 3:
C — Consistency (all nodes see same data)
A — Availability (every request gets a response)
P — Partition tolerance (system works despite network splits)
Partitions WILL happen, so you choose:
CP: Consistent but may reject requests (etcd, ZooKeeper)
AP: Available but may return stale data (Cassandra, DynamoDB)
DevOps relevance: This affects your architecture choices,
monitoring approach, and incident response procedures.
43. How do you implement zero-downtime deployments?
Rolling updates with readiness probes (K8s default). Pre-stop lifecycle hooks for graceful shutdown. Connection draining on load balancers. Database migrations using expand/contract. Feature flags for gradual rollout. Health check grace periods. Pod disruption budgets to maintain minimum availability during updates.
44. What is your approach to managing technical debt in infrastructure?
Track debt explicitly (tag tickets, maintain a debt register). Allocate 15-20% of sprint capacity to debt reduction. Prioritize by blast radius and operational cost. Use "boy scout rule" — leave infra better than you found it. Automate the repetitive parts first (highest ROI). Make debt visible to leadership with metrics (toil hours, incident count from debt).
45. How would you migrate a monolith to microservices?
Strangler Fig pattern: do not rewrite. Identify bounded contexts using domain-driven design. Extract one service at a time, starting with the most independent component. Use an API gateway to route between monolith and new services. Share nothing — each service owns its data. Accept that some services will stay in the monolith (and that is fine).
46. Explain SRE principles and how they relate to DevOps.
SRE Principles:
1. Embrace risk — Use error budgets, not zero-defect
2. SLOs over SLAs — Internal targets drive behavior
3. Eliminate toil — Automate repetitive operational work
4. Monitor meaningfully — Symptoms over causes
5. Release engineering — Safe, fast, repeatable deploys
6. Simplicity — Boring technology advantage
SRE is "DevOps with opinions" — it provides concrete practices
(error budgets, SLOs, toil budgets) for DevOps principles.
47. How do you handle multi-tenancy in Kubernetes?
Namespace isolation with network policies and RBAC. Resource quotas per namespace (prevent noisy neighbors). Pod security standards (restricted). Separate node pools for sensitive tenants. Use Hierarchical Namespaces for organization. For hard multi-tenancy (untrusted tenants), consider vCluster or separate clusters.
48. What is your strategy for secrets rotation?
Automate rotation with zero downtime: use Vault dynamic secrets (TTL-based, auto-rotated), AWS Secrets Manager auto-rotation with Lambda, Kubernetes External Secrets Operator for sync. Application must handle credential refresh without restart. Test rotation in staging regularly. Alert on rotation failures, not just expiry.
49. How do you evaluate and adopt new tools?
Evaluation framework:
1. Problem definition — What specific problem does this solve?
2. Build vs buy — Can we solve it with existing tools?
3. Proof of concept — 2-week timeboxed POC with real workload
4. Operational cost — Who maintains it? What is the learning curve?
5. Community + longevity— Is it CNCF? Active contributors? Bus factor?
6. Migration path — What is the exit strategy if it fails?
7. Team buy-in — Will engineers actually use it?
Red flags: "It was on Hacker News," "It replaces everything we have,"
"The vendor demo looked amazing."
50. How do you build a DevOps culture in a resistant organization?
Start small — find one willing team, deliver one measurable win (deploy time cut by 50%). Make it visible. Let success create demand. Never mandate tools — offer better alternatives. Invest in internal champions across teams. Measure with DORA metrics to show progress objectively. Address management fears (security, control) with data. Remember: culture change is a marathon, not a sprint.
Closing Note
The best DevOps interviews are conversations, not interrogations. Knowing the "textbook answer" is the minimum — interviewers want to hear about your real experience, your failures, and what you learned. For every question above, prepare a concrete example from your own work. The candidate who says "We had this exact problem, and here is what we did" will always beat the one who recites definitions. Good luck.
