Skip to main content

The AWS Well-Architected Framework — 5 Pillars You're Probably Ignoring

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

Most teams build on AWS by copying tutorials, stitching together Stack Overflow answers, and hoping for the best. Six months later they have a production system that works — until it doesn't. The bill is 3x what it should be, nobody knows what happens if us-east-1 goes down, and the security posture is "we'll deal with it when we get audited." The Well-Architected Framework exists to prevent this. It's not theoretical — it's a checklist distilled from thousands of AWS customer architectures.

What Is the Well-Architected Framework?

The Well-Architected Framework is a set of design principles and best practices organized into six pillars (originally five, with Sustainability added in 2021). AWS Solutions Architects use it to evaluate workloads through a structured review process, and AWS provides a free tool to run the review yourself.

The six pillars are:

  1. Operational Excellence — Run and monitor systems
  2. Security — Protect data, systems, and assets
  3. Reliability — Recover from failures, meet demand
  4. Performance Efficiency — Use resources efficiently
  5. Cost Optimization — Avoid unnecessary costs
  6. Sustainability — Minimize environmental impact

Pillar 1: Operational Excellence

Key principle: Operations as code. If a human is doing it manually, it should be automated.

Design principles:

  • Perform operations as code (CloudFormation, Terraform, CDK)
  • Make frequent, small, reversible changes
  • Refine operations procedures frequently
  • Anticipate failure (chaos engineering, game days)
  • Learn from all operational failures
# Anti-pattern: SSH into a server to check logs
ssh ec2-user@10.0.1.50 "tail -f /var/log/app.log"

# Well-Architected: Centralized logging with CloudWatch
aws logs create-log-group --log-group-name /myapp/production
aws logs put-retention-policy \
--log-group-name /myapp/production \
--retention-in-days 30

# Query logs without SSH
aws logs filter-log-events \
--log-group-name /myapp/production \
--filter-pattern "ERROR" \
--start-time $(date -d '1 hour ago' +%s000)

Common anti-patterns:

  • Manual deployments via SSH or console clicks
  • No runbooks for incident response
  • Ignoring CloudTrail and CloudWatch alarms
  • No post-incident reviews

Pillar 2: Security

Key principle: Apply security at all layers. Never trust a single security mechanism.

Design principles:

  • Implement a strong identity foundation (least privilege)
  • Enable traceability (log everything)
  • Apply security at all layers (network, instance, application, data)
  • Automate security best practices
  • Protect data in transit and at rest
  • Keep people away from data
  • Prepare for security events
# Anti-pattern: Overly permissive IAM
# "Effect": "Allow", "Action": "*", "Resource": "*"

# Well-Architected: Least privilege IAM
aws iam create-policy --policy-name app-minimal-access \
--policy-document '{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::my-app-bucket/uploads/*"
},
{
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:Query"
],
"Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/users"
}
]
}'

Common anti-patterns:

  • IAM users with long-lived access keys instead of IAM roles
  • Security groups with 0.0.0.0/0 on non-public ports
  • Unencrypted S3 buckets and EBS volumes
  • No MFA on root account and privileged users
  • Single AWS account for everything (no blast radius isolation)

Pillar 3: Reliability

Key principle: Automatically recover from failure. Design for known and unknown failures.

Design principles:

  • Automatically recover from failure
  • Test recovery procedures
  • Scale horizontally to increase aggregate availability
  • Stop guessing capacity
  • Manage change in automation
# Anti-pattern: Single EC2 instance, no health checks
# If it dies, you find out from customer complaints

# Well-Architected: Multi-AZ with auto-healing
aws autoscaling create-auto-scaling-group \
--auto-scaling-group-name myapp-asg \
--launch-template LaunchTemplateId=lt-abc123,Version='$Latest' \
--min-size 2 --max-size 10 --desired-capacity 3 \
--vpc-zone-identifier "subnet-az1,subnet-az2,subnet-az3" \
--health-check-type ELB \
--health-check-grace-period 300 \
--target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/myapp/abc123

# Multi-AZ RDS (automatic failover)
aws rds create-db-instance \
--db-instance-identifier mydb \
--multi-az \
--db-instance-class db.r6g.large \
--engine postgres

Common anti-patterns:

  • Single-AZ deployments for production workloads
  • No auto-scaling (fixed instance count)
  • No backup or backup retention too short
  • Untested disaster recovery plans
  • Hardcoded IP addresses and endpoints

Pillar 4: Performance Efficiency

Key principle: Use the right tool for the job. Don't force one solution for all problems.

Design principles:

  • Democratize advanced technologies (use managed services)
  • Go global in minutes
  • Use serverless architectures
  • Experiment more often
  • Consider mechanical sympathy (understand how services work)
# Anti-pattern: Using EC2 for everything
# "We run Redis on an EC2 instance we manage ourselves"

# Well-Architected: Use managed services
# ElastiCache for Redis (managed, Multi-AZ, auto-failover)
aws elasticache create-replication-group \
--replication-group-id myapp-cache \
--replication-group-description "App session cache" \
--engine redis \
--cache-node-type cache.r6g.large \
--num-cache-clusters 2 \
--automatic-failover-enabled \
--multi-az-enabled

# Consider: Is this a caching problem or a database problem?
# DynamoDB DAX for read-heavy DynamoDB workloads
# CloudFront for static content delivery
# Aurora Serverless for variable database workloads

Common anti-patterns:

  • Running self-managed databases on EC2
  • Not using caching (CloudFront, ElastiCache)
  • Wrong instance type for workload (memory-optimized for CPU-bound)
  • Monolithic architecture when microservices make sense
  • Not leveraging edge locations for global users

Pillar 5: Cost Optimization

Key principle: Pay only for what you need. Measure continuously.

This pillar deserves its own deep-dive (see our AWS Cost Optimization post), but the key design principles are:

  • Implement cloud financial management (dedicated team/process)
  • Adopt a consumption model (pay for what you use)
  • Measure overall efficiency
  • Stop spending money on undifferentiated heavy lifting
  • Analyze and attribute expenditure

Common anti-patterns:

  • No cost monitoring or budgets
  • Running dev/staging 24/7 at production scale
  • Ignoring Reserved Instances or Savings Plans
  • No lifecycle policies on S3
  • Unused Elastic IPs, old snapshots, idle load balancers

Pillar 6: Sustainability

Key principle: Minimize the environmental impact of running cloud workloads.

Design principles:

  • Understand your impact
  • Establish sustainability goals
  • Maximize utilization
  • Adopt more efficient hardware and software
  • Reduce the downstream impact of your workloads
# Use Graviton (ARM) instances — 60% less energy per compute
# Anti-pattern: x86 instances for all workloads
# t3.xlarge → 4 vCPU, 16 GB (x86)

# Well-Architected: Switch to Graviton where possible
# t4g.xlarge → 4 vCPU, 16 GB (ARM) — 20% cheaper, lower energy
aws ec2 run-instances \
--instance-type t4g.xlarge \
--image-id ami-0graviton-arm64 \
--count 1

Common anti-patterns:

  • Over-provisioned resources running at 10% utilization
  • Not using auto-scaling to match demand
  • Keeping large datasets that are never accessed
  • Not considering region carbon intensity

Running a Well-Architected Review

AWS provides a free tool in the console to run a structured review of your workload:

# Create a workload in the Well-Architected Tool
aws wellarchitected create-workload \
--workload-name "Production API" \
--description "Customer-facing REST API" \
--environment PRODUCTION \
--lenses wellarchitected \
--aws-regions us-east-1 \
--review-owner "platform-team@company.com"

# List available lenses (specialized reviews)
aws wellarchitected list-lenses \
--query 'LensSummaries[*].{Name: LensName, ARN: LensArn}' \
--output table

The tool walks you through questions for each pillar and generates a report of high-risk and medium-risk issues with remediation recommendations.

Specialized Lenses

Beyond the core framework, AWS offers specialized lenses for specific workload types:

LensFocus Area
Serverless ApplicationsLambda, API Gateway, Step Functions
SaaSMulti-tenancy, isolation, onboarding
Machine LearningTraining, inference, data pipelines
Data AnalyticsLake, warehouse, streaming
Container BuildECS, EKS, Fargate optimization
Financial ServicesCompliance, security, resilience
IoTDevice management, edge computing

Each lens adds domain-specific questions and best practices on top of the core framework.

The Practical Starting Point

Don't try to be perfect across all six pillars on day one. Prioritize based on your situation:

  1. Security first — because a breach can end a company
  2. Reliability second — because downtime loses revenue and trust
  3. Cost Optimization third — because an unsustainable bill kills projects
  4. Everything else — iterate once the foundation is solid

Schedule a Well-Architected Review quarterly. Treat the high-risk findings like bugs — track them, prioritize them, and fix them.

What's Next

The Well-Architected Framework gives you the big picture. But how do you enforce security across dozens of AWS services? In the next post, we'll dive into GuardDuty, Security Hub, and Config Rules — the services that automate security monitoring and compliance on AWS.