Top 50 AWS Interview Questions for DevOps & Solutions Architect Roles
Walking into an AWS interview without preparation is like deploying to production without testing — technically possible, but you're going to have a bad time. These 50 questions cover the breadth of what interviewers actually ask, organized from fundamentals to architecture-level thinking. Each answer is concise and focused on what matters — the concept, the "why," and the key commands or configurations you should know.
Beginner (Questions 1-15)
Q1: What is the difference between IAM Users, Groups, Roles, and Policies?
Users are identities for people or services with long-term credentials. Groups are collections of users for easier policy management. Roles are assumed temporarily by users, services, or other AWS accounts — no permanent credentials. Policies are JSON documents that define permissions (Allow/Deny on Actions for Resources).
# List all IAM users
aws iam list-users --query 'Users[].UserName'
# See what policies are attached to a role
aws iam list-attached-role-policies --role-name MyRole
Q2: What are the S3 storage classes, and when would you use each?
| Storage Class | Use Case | Availability | Min Duration |
|---|---|---|---|
| S3 Standard | Frequently accessed data | 99.99% | None |
| S3 Intelligent-Tiering | Unknown/changing access patterns | 99.9% | None |
| S3 Standard-IA | Infrequent but needs fast access | 99.9% | 30 days |
| S3 One Zone-IA | Non-critical infrequent data | 99.5% | 30 days |
| S3 Glacier Instant Retrieval | Archive with millisecond access | 99.9% | 90 days |
| S3 Glacier Flexible Retrieval | Archive, minutes-to-hours retrieval | 99.99% | 90 days |
| S3 Glacier Deep Archive | Long-term archive, 12-hour retrieval | 99.99% | 180 days |
Q3: Explain the difference between Security Groups and NACLs.
Security Groups are stateful firewalls at the instance/ENI level — if you allow inbound traffic, the response is automatically allowed. NACLs are stateless firewalls at the subnet level — you must explicitly allow both inbound and outbound. Security Groups only have Allow rules; NACLs have both Allow and Deny.
Q4: What is a VPC, and what are its core components?
A VPC is your isolated virtual network in AWS. Core components: CIDR block (IP range), subnets (public/private per AZ), Internet Gateway (public access), NAT Gateway (outbound for private subnets), Route Tables (traffic routing rules), Security Groups, and NACLs.
Q5: How does an Application Load Balancer (ALB) differ from a Network Load Balancer (NLB)?
ALB operates at Layer 7 (HTTP/HTTPS), supports path-based and host-based routing, WebSockets, and is ideal for microservices. NLB operates at Layer 4 (TCP/UDP/TLS), handles millions of requests per second with ultra-low latency, preserves source IP, and supports static IPs.
Q6: What is the difference between vertical and horizontal scaling on AWS?
Vertical scaling means changing instance size (t3.medium to t3.xlarge) — simple but has limits. Horizontal scaling means adding more instances behind a load balancer — scales infinitely but requires stateless application design. Auto Scaling Groups handle horizontal scaling automatically.
Q7: What is an AMI, and how do you create one?
An Amazon Machine Image is a template containing the OS, application server, and applications. It's the blueprint for launching EC2 instances.
aws ec2 create-image \
--instance-id i-0abc123 \
--name "my-app-v2.1" \
--description "App server with latest patches" \
--no-reboot
Q8: Explain S3 bucket policies vs IAM policies for S3 access.
IAM policies attach to users/roles and control what that identity can do across services. Bucket policies attach to the bucket and control who can access that specific bucket. Use bucket policies for cross-account access and public access rules. Use IAM policies for per-user permissions.
Q9: What is CloudFormation, and what problem does it solve?
CloudFormation is AWS's Infrastructure as Code service. You define resources in YAML/JSON templates, and CloudFormation creates, updates, and deletes them as a stack. It solves environment drift, manual configuration errors, and enables repeatable deployments.
Q10: How do EC2 instance types work (e.g., t3.large, c5.xlarge)?
The letter indicates the family (t = burstable, c = compute, m = general, r = memory, g = GPU). The number is the generation. The size indicates CPU/memory. Example: c5.2xlarge = compute-optimized, 5th generation, 8 vCPUs, 16 GB RAM.
Q11: What is the Shared Responsibility Model?
AWS manages security of the cloud (hardware, networking, managed services). You manage security in the cloud (OS patching, firewall rules, IAM, data encryption, application code). The boundary shifts based on the service — more managed means less responsibility for you.
Q12: What is Route 53, and what routing policies does it support?
Route 53 is AWS's DNS and domain registration service. Routing policies: Simple (single resource), Weighted (percentage-based), Latency (closest region), Failover (active-passive DR), Geolocation (country-based), Geoproximity (distance with bias), and Multivalue Answer (multiple healthy resources).
Q13: What are the different EBS volume types?
gp3 — general purpose SSD (most workloads). io2 — high-performance SSD (databases). st1 — throughput-optimized HDD (big data). sc1 — cold HDD (infrequent access). gp3 is the default choice for 90% of workloads.
Q14: What is Elastic Beanstalk?
Elastic Beanstalk is a PaaS that handles deployment, scaling, load balancing, and monitoring. You upload your code, and it provisions the infrastructure. It supports Java, .NET, Node.js, Python, Ruby, Go, Docker. Under the hood, it uses CloudFormation, EC2, ALB, and Auto Scaling.
Q15: What is the AWS Free Tier?
Three types: Always Free (Lambda 1M requests/month, DynamoDB 25 GB), 12-Month Free (750 hours EC2 t2.micro, 5 GB S3), Trials (short-term free trials of specific services). The 12-month clock starts when you create your account.
Intermediate (Questions 16-35)
Q16: How does VPC Peering work, and what are its limitations?
VPC Peering creates a private network connection between two VPCs. Traffic stays on the AWS backbone. Limitations: no transitive peering (A-B and B-C doesn't mean A-C), no overlapping CIDRs, no edge-to-edge routing through gateways.
Q17: Explain AWS Lambda cold starts and how to mitigate them.
Cold starts occur when Lambda creates a new execution environment — downloading code, initializing the runtime. Mitigation: use Provisioned Concurrency, keep functions warm with scheduled invocations, minimize package size, use lighter runtimes (Python/Node over Java).
aws lambda put-provisioned-concurrency-config \
--function-name my-function \
--qualifier prod \
--provisioned-concurrent-executions 10
Q18: What is the difference between SQS and SNS?
SQS is a message queue (pull-based, one consumer processes each message, guarantees delivery). SNS is a pub/sub service (push-based, many subscribers, fan-out pattern). They're often used together — SNS fans out to multiple SQS queues.
Q19: How does Auto Scaling work with target tracking policies?
Target tracking maintains a specific metric at a target value. Example: "keep average CPU at 50%." ASG adds or removes instances to maintain the target.
aws autoscaling put-scaling-policy \
--auto-scaling-group-name my-asg \
--policy-name cpu-target \
--policy-type TargetTrackingScaling \
--target-tracking-configuration \
'{"PredefinedMetricSpecification":{"PredefinedMetricType":"ASGAverageCPUUtilization"},"TargetValue":50.0}'
Q20: What is an ECS Task Definition?
A Task Definition is a blueprint for your container — it specifies the Docker image, CPU/memory, environment variables, port mappings, logging configuration, and IAM task role. Think of it as the "docker-compose.yml" of ECS.
Q21: Explain the difference between RDS Multi-AZ and Read Replicas.
Multi-AZ is for high availability — a standby replica in another AZ with synchronous replication and automatic failover (2-5 minutes). You can't read from the standby. Read Replicas are for performance — asynchronous replication, you can read from them, useful for reporting queries. You can have both simultaneously.
Q22: What is AWS CloudTrail, and why is it critical?
CloudTrail records every API call made in your account — who did what, when, and from where. It's your audit log. Essential for security investigations, compliance, and troubleshooting. Always enable it in all regions and send logs to a centralized S3 bucket.
Q23: How do you troubleshoot an EC2 instance that won't start?
Check: instance state reason (aws ec2 describe-instance-status), system log (aws ec2 get-console-output), instance limit in the region, subnet has available IPs, EBS volume is available, security group allows needed traffic, AMI exists, key pair exists.
Q24: What are VPC Endpoints, and when should you use them?
VPC Endpoints let your VPC resources access AWS services privately without going through the internet. Gateway Endpoints (S3, DynamoDB) — free, add a route table entry. Interface Endpoints (everything else) — ENI in your subnet, cost per hour + per GB.
Q25: Explain the difference between ECS and EKS.
ECS is AWS's proprietary container orchestrator — simpler, no control plane cost, AWS-native. EKS is managed Kubernetes — industry standard, portable, massive ecosystem, $73/month control plane. Choose ECS for simplicity, EKS for Kubernetes compatibility and portability.
Q26: What is AWS Config, and how does it differ from CloudTrail?
CloudTrail records API calls (who did what). Config records resource configurations over time (what changed). Config can evaluate resources against rules and flag non-compliant resources. Example: "All S3 buckets must have encryption enabled."
Q27: How does cross-region replication work in S3?
Enable versioning on both buckets, create a replication rule. Objects are replicated asynchronously. Use it for compliance (data in multiple regions), latency reduction, or disaster recovery. You pay for storage in both regions plus data transfer.
Q28: What is AWS Systems Manager, and what can it do?
Systems Manager is a suite of tools: Session Manager (SSH without opening port 22), Parameter Store (config/secrets), Patch Manager (OS patching), Run Command (execute scripts across instances), State Manager (desired state), and Inventory (software inventory).
Q29: Explain DynamoDB partition keys and sort keys.
The partition key determines which physical partition stores your data — must be high cardinality for even distribution. The sort key enables range queries within a partition. Together they form the primary key. Bad partition key (e.g., "country") creates hot partitions. Good partition key (e.g., "user_id") distributes evenly.
Q30: What is AWS WAF, and what rules would you configure?
Web Application Firewall protects against common web exploits. Essential rules: rate limiting (block IPs exceeding thresholds), SQL injection detection, cross-site scripting (XSS) prevention, geographic blocking, IP reputation lists (AWS managed rules), and bot control.
Q31: How do you implement blue/green deployments on AWS?
Multiple approaches: Route 53 weighted routing (shift traffic between environments), ALB target groups (swap which group receives traffic), CodeDeploy (automated blue/green with EC2 or ECS), Elastic Beanstalk (URL swap between environments).
Q32: What is the difference between AWS Secrets Manager and Parameter Store?
Secrets Manager: automatic rotation, cross-account access, $0.40/secret/month. Parameter Store: free tier (standard), no automatic rotation, hierarchical naming. Use Secrets Manager for database credentials and API keys that need rotation. Use Parameter Store for configuration values.
Q33: How does AWS CloudWatch Logs Insights work?
CloudWatch Logs Insights is a query language for log analysis. It's faster than grep-ing through log files and supports aggregations, filtering, and visualization.
# Find the top 10 most expensive Lambda invocations
fields @timestamp, @duration, @memorySize
| filter @type = "REPORT"
| sort @duration desc
| limit 10
Q34: What are Reserved Instances vs Savings Plans?
Reserved Instances: commit to a specific instance type/region for 1 or 3 years (up to 72% savings). Savings Plans: commit to a $/hour spend for compute — more flexible across instance types, regions, and even services (EC2, Lambda, Fargate). Savings Plans are generally the better choice now.
Q35: How do you secure a public-facing API on AWS?
Layer security: CloudFront with WAF (DDoS protection, rate limiting), API Gateway with authorizers (Cognito, Lambda), VPC with private subnets (backend), Security Groups (port-level), IAM policies (service-level), encryption in transit (TLS) and at rest (KMS).
Advanced (Questions 36-50)
Q36: Design a multi-region active-active architecture.
Use Route 53 latency routing, DynamoDB Global Tables (multi-region writes), S3 Cross-Region Replication, Aurora Global Database (1-second replication), CloudFront for edge caching. Each region runs independently. Challenge: handling conflicts in data written to multiple regions simultaneously.
Q37: What is your approach to AWS disaster recovery?
Four strategies with increasing cost and speed: Backup & Restore (RPO hours, RTO hours), Pilot Light (core infrastructure running, RPO minutes, RTO 10+ minutes), Warm Standby (scaled-down copy, RPO seconds, RTO minutes), Multi-Site Active-Active (RPO near-zero, RTO near-zero).
Q38: How would you implement zero-trust networking on AWS?
Assume no network perimeter is trusted. Use: IAM everywhere (no long-term credentials), VPC endpoints (no internet traversal), PrivateLink (service-to-service), mTLS (mutual TLS between services), Security Groups at the pod/instance level, AWS Verified Access (identity-based application access), and continuous monitoring with GuardDuty.
Q39: How do you handle secrets in a CI/CD pipeline on AWS?
Never store secrets in code or environment variables in build configs. Use Secrets Manager or Parameter Store, reference them at runtime. In CodeBuild, use env/secrets-manager in buildspec.yml. In GitHub Actions, use OIDC federation (no static credentials).
# buildspec.yml — reference secrets from Secrets Manager
env:
secrets-manager:
DB_PASSWORD: "prod/db:password"
API_KEY: "prod/api:key"
Q40: What is the Well-Architected Framework?
Six pillars: Operational Excellence (automation, IaC), Security (IAM, encryption, detective controls), Reliability (multi-AZ, auto-scaling, backup), Performance Efficiency (right-sizing, caching, CDN), Cost Optimization (reserved capacity, right-sizing, lifecycle policies), Sustainability (efficient resource usage).
Q41: How would you optimize a $50K/month AWS bill?
Start with Cost Explorer to identify top spenders. Quick wins: right-size oversized instances, purchase Savings Plans, delete unused EBS volumes and snapshots, use S3 lifecycle policies, switch to Graviton instances (20% cheaper), use Spot for non-critical workloads, review NAT Gateway data transfer costs.
Q42: Explain AWS Organizations and Service Control Policies.
Organizations groups accounts into OUs for centralized management. SCPs set permission boundaries — they don't grant permissions but restrict what IAM policies can allow. Example: an SCP denying all actions in ap-southeast-1 means nobody in that account can use that region, regardless of their IAM permissions.
Q43: How would you migrate a monolithic application to microservices on AWS?
Strangler Fig pattern: keep the monolith running, extract one capability at a time into a microservice (ECS/EKS), route traffic through API Gateway. Use SQS/SNS for async communication between services. Migrate the database last — start with shared database, then split per service.
Q44: What is AWS PrivateLink, and when should you use it?
PrivateLink exposes a service in one VPC to other VPCs (or accounts) privately. Traffic never traverses the public internet. Use it for: SaaS vendor integration, cross-account service access, or exposing internal APIs. It's more secure and scalable than VPC peering for service-oriented architectures.
Q45: How do you handle state in a horizontally scaled application?
Move state out of the application tier. Sessions: ElastiCache Redis. File uploads: S3 with presigned URLs. Database: RDS/Aurora with connection pooling (RDS Proxy). Caching: ElastiCache. Configuration: Parameter Store. Locks: DynamoDB (conditional writes). The application tier should be completely disposable.
Q46: Explain the CAP theorem in the context of AWS databases.
CAP: Consistency, Availability, Partition tolerance — pick two. RDS/Aurora: CP (strong consistency, may lose availability during failover). DynamoDB: AP by default (eventually consistent), but supports strongly consistent reads. DynamoDB Global Tables: AP (eventual consistency across regions). Design your data layer based on which trade-off your application can tolerate.
Q47: How would you implement a data lake on AWS?
S3 as the storage layer (raw, curated, and processed zones). AWS Glue for ETL and catalog. Athena for serverless SQL queries. Lake Formation for governance and access control. Redshift Spectrum for complex analytics. QuickSight for visualization. Kinesis for real-time data ingestion.
Q48: What are the networking considerations for a multi-account strategy?
Use Transit Gateway as the hub. Centralize egress through a dedicated Network account. Share VPCs using RAM for tightly coupled workloads. Use PrivateLink for cross-account service access. Plan CIDR blocks carefully to avoid overlap. Use Route 53 Private Hosted Zones shared across accounts.
Q49: How do you achieve compliance (SOC 2, HIPAA, PCI) on AWS?
AWS provides compliant infrastructure — you handle compliant configurations. Use AWS Artifact for compliance reports. Enable AWS Config rules for continuous compliance monitoring. Use Security Hub for aggregated findings. Encrypt everything (KMS). Log everything (CloudTrail). Restrict access (SCPs + IAM). Use dedicated accounts for compliance-scoped workloads.
Q50: Design a serverless application that handles 10,000 requests per second.
API Gateway (HTTP API for lower latency and cost) with Lambda behind it. Use Provisioned Concurrency to avoid cold starts. DynamoDB with on-demand capacity for unpredictable scaling. SQS for decoupling write-heavy operations. CloudFront in front of API Gateway for caching. Step Functions for complex workflows. Monitor with X-Ray for tracing and CloudWatch for metrics.
# Check Lambda concurrent executions in your region
aws lambda get-account-settings \
--query 'AccountLimit.ConcurrentExecutions'
# Default: 1000, request increase for production workloads
Interviews test more than memorization — they test your ability to reason about trade-offs. When answering, always explain the "why" behind your choice, mention what you'd monitor after implementation, and acknowledge the trade-offs. Saying "I'd use DynamoDB because it scales horizontally without connection limits, but the trade-off is eventual consistency for cross-region reads" is infinitely better than just saying "I'd use DynamoDB." Good luck with your interviews, and practice building these services hands-on — there's no substitute for real experience.
