The AWS Well-Architected Framework — 5 Pillars You're Probably Ignoring

October 4, 2025 · 7 min read

DevOps & Cloud Learning Hub

Most teams build on AWS by copying tutorials, stitching together Stack Overflow answers, and hoping for the best. Six months later they have a production system that works — until it doesn't. The bill is 3x what it should be, nobody knows what happens if us-east-1 goes down, and the security posture is "we'll deal with it when we get audited." The Well-Architected Framework exists to prevent this. It's not theoretical — it's a checklist distilled from thousands of AWS customer architectures.

What Is the Well-Architected Framework?

The Well-Architected Framework is a set of design principles and best practices organized into six pillars (originally five, with Sustainability added in 2021). AWS Solutions Architects use it to evaluate workloads through a structured review process, and AWS provides a free tool to run the review yourself.

The six pillars are:

Operational Excellence — Run and monitor systems
Security — Protect data, systems, and assets
Reliability — Recover from failures, meet demand
Performance Efficiency — Use resources efficiently
Cost Optimization — Avoid unnecessary costs
Sustainability — Minimize environmental impact

Pillar 1: Operational Excellence

Key principle: Operations as code. If a human is doing it manually, it should be automated.

Design principles:

Perform operations as code (CloudFormation, Terraform, CDK)
Make frequent, small, reversible changes
Refine operations procedures frequently
Anticipate failure (chaos engineering, game days)
Learn from all operational failures

# Anti-pattern: SSH into a server to check logs
ssh ec2-user@10.0.1.50 "tail -f /var/log/app.log"

# Well-Architected: Centralized logging with CloudWatch
aws logs create-log-group --log-group-name /myapp/production
aws logs put-retention-policy \
  --log-group-name /myapp/production \
  --retention-in-days 30

# Query logs without SSH
aws logs filter-log-events \
  --log-group-name /myapp/production \
  --filter-pattern "ERROR" \
  --start-time $(date -d '1 hour ago' +%s000)

Common anti-patterns:

Manual deployments via SSH or console clicks
No runbooks for incident response
Ignoring CloudTrail and CloudWatch alarms
No post-incident reviews

Pillar 2: Security

Key principle: Apply security at all layers. Never trust a single security mechanism.

Design principles:

Implement a strong identity foundation (least privilege)
Enable traceability (log everything)
Apply security at all layers (network, instance, application, data)
Automate security best practices
Protect data in transit and at rest
Keep people away from data
Prepare for security events

# Anti-pattern: Overly permissive IAM
# "Effect": "Allow", "Action": "*", "Resource": "*"

# Well-Architected: Least privilege IAM
aws iam create-policy --policy-name app-minimal-access \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "s3:GetObject",
          "s3:PutObject"
        ],
        "Resource": "arn:aws:s3:::my-app-bucket/uploads/*"
      },
      {
        "Effect": "Allow",
        "Action": [
          "dynamodb:GetItem",
          "dynamodb:PutItem",
          "dynamodb:Query"
        ],
        "Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/users"
      }
    ]
  }'

Common anti-patterns:

IAM users with long-lived access keys instead of IAM roles
Security groups with 0.0.0.0/0 on non-public ports
Unencrypted S3 buckets and EBS volumes
No MFA on root account and privileged users
Single AWS account for everything (no blast radius isolation)

Pillar 3: Reliability

Key principle: Automatically recover from failure. Design for known and unknown failures.

Design principles:

Automatically recover from failure
Test recovery procedures
Scale horizontally to increase aggregate availability
Stop guessing capacity
Manage change in automation

# Anti-pattern: Single EC2 instance, no health checks
# If it dies, you find out from customer complaints

# Well-Architected: Multi-AZ with auto-healing
aws autoscaling create-auto-scaling-group \
  --auto-scaling-group-name myapp-asg \
  --launch-template LaunchTemplateId=lt-abc123,Version='$Latest' \
  --min-size 2 --max-size 10 --desired-capacity 3 \
  --vpc-zone-identifier "subnet-az1,subnet-az2,subnet-az3" \
  --health-check-type ELB \
  --health-check-grace-period 300 \
  --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/myapp/abc123

# Multi-AZ RDS (automatic failover)
aws rds create-db-instance \
  --db-instance-identifier mydb \
  --multi-az \
  --db-instance-class db.r6g.large \
  --engine postgres

Common anti-patterns:

Single-AZ deployments for production workloads
No auto-scaling (fixed instance count)
No backup or backup retention too short
Untested disaster recovery plans
Hardcoded IP addresses and endpoints

Pillar 4: Performance Efficiency

Key principle: Use the right tool for the job. Don't force one solution for all problems.

Design principles:

Democratize advanced technologies (use managed services)
Go global in minutes
Use serverless architectures
Experiment more often
Consider mechanical sympathy (understand how services work)

# Anti-pattern: Using EC2 for everything
# "We run Redis on an EC2 instance we manage ourselves"

# Well-Architected: Use managed services
# ElastiCache for Redis (managed, Multi-AZ, auto-failover)
aws elasticache create-replication-group \
  --replication-group-id myapp-cache \
  --replication-group-description "App session cache" \
  --engine redis \
  --cache-node-type cache.r6g.large \
  --num-cache-clusters 2 \
  --automatic-failover-enabled \
  --multi-az-enabled

# Consider: Is this a caching problem or a database problem?
# DynamoDB DAX for read-heavy DynamoDB workloads
# CloudFront for static content delivery
# Aurora Serverless for variable database workloads

Common anti-patterns:

Running self-managed databases on EC2
Not using caching (CloudFront, ElastiCache)
Wrong instance type for workload (memory-optimized for CPU-bound)
Monolithic architecture when microservices make sense
Not leveraging edge locations for global users

Pillar 5: Cost Optimization

Key principle: Pay only for what you need. Measure continuously.

This pillar deserves its own deep-dive (see our AWS Cost Optimization post), but the key design principles are:

Implement cloud financial management (dedicated team/process)
Adopt a consumption model (pay for what you use)
Measure overall efficiency
Stop spending money on undifferentiated heavy lifting
Analyze and attribute expenditure

Common anti-patterns:

No cost monitoring or budgets
Running dev/staging 24/7 at production scale
Ignoring Reserved Instances or Savings Plans
No lifecycle policies on S3
Unused Elastic IPs, old snapshots, idle load balancers

Pillar 6: Sustainability

Key principle: Minimize the environmental impact of running cloud workloads.

Design principles:

Understand your impact
Establish sustainability goals
Maximize utilization
Adopt more efficient hardware and software
Reduce the downstream impact of your workloads

# Use Graviton (ARM) instances — 60% less energy per compute
# Anti-pattern: x86 instances for all workloads
# t3.xlarge → 4 vCPU, 16 GB (x86)

# Well-Architected: Switch to Graviton where possible
# t4g.xlarge → 4 vCPU, 16 GB (ARM) — 20% cheaper, lower energy
aws ec2 run-instances \
  --instance-type t4g.xlarge \
  --image-id ami-0graviton-arm64 \
  --count 1

Common anti-patterns:

Over-provisioned resources running at 10% utilization
Not using auto-scaling to match demand
Keeping large datasets that are never accessed
Not considering region carbon intensity

Running a Well-Architected Review

AWS provides a free tool in the console to run a structured review of your workload:

# Create a workload in the Well-Architected Tool
aws wellarchitected create-workload \
  --workload-name "Production API" \
  --description "Customer-facing REST API" \
  --environment PRODUCTION \
  --lenses wellarchitected \
  --aws-regions us-east-1 \
  --review-owner "platform-team@company.com"

# List available lenses (specialized reviews)
aws wellarchitected list-lenses \
  --query 'LensSummaries[*].{Name: LensName, ARN: LensArn}' \
  --output table

The tool walks you through questions for each pillar and generates a report of high-risk and medium-risk issues with remediation recommendations.

Specialized Lenses

Beyond the core framework, AWS offers specialized lenses for specific workload types:

Lens	Focus Area
Serverless Applications	Lambda, API Gateway, Step Functions
SaaS	Multi-tenancy, isolation, onboarding
Machine Learning	Training, inference, data pipelines
Data Analytics	Lake, warehouse, streaming
Container Build	ECS, EKS, Fargate optimization
Financial Services	Compliance, security, resilience
IoT	Device management, edge computing

Each lens adds domain-specific questions and best practices on top of the core framework.

The Practical Starting Point

Don't try to be perfect across all six pillars on day one. Prioritize based on your situation:

Security first — because a breach can end a company
Reliability second — because downtime loses revenue and trust
Cost Optimization third — because an unsustainable bill kills projects
Everything else — iterate once the foundation is solid

Schedule a Well-Architected Review quarterly. Treat the high-risk findings like bugs — track them, prioritize them, and fix them.

What's Next

The Well-Architected Framework gives you the big picture. But how do you enforce security across dozens of AWS services? In the next post, we'll dive into GuardDuty, Security Hub, and Config Rules — the services that automate security monitoring and compliance on AWS.

What Is the Well-Architected Framework?​

Pillar 1: Operational Excellence​

Pillar 2: Security​

Pillar 3: Reliability​

Pillar 4: Performance Efficiency​

Pillar 5: Cost Optimization​

Pillar 6: Sustainability​

Running a Well-Architected Review​

Specialized Lenses​

The Practical Starting Point​

What's Next​

Stay Updated

What Is the Well-Architected Framework?

Pillar 1: Operational Excellence

Pillar 2: Security

Pillar 3: Reliability

Pillar 4: Performance Efficiency

Pillar 5: Cost Optimization

Pillar 6: Sustainability

Running a Well-Architected Review

Specialized Lenses

The Practical Starting Point

What's Next