CloudWatch — Logs, Metrics, Alarms, and Dashboards That Save You at 3 AM
It's 3:17 AM. Your phone buzzes. "Site is down." You SSH into the server, tail the logs, see nothing obvious, check CPU — it's fine. Memory? Fine. Disk? 100% full. Log files ate the disk three hours ago, and nobody noticed because monitoring wasn't set up. CloudWatch exists so that you don't have to be the monitoring system. It collects metrics, aggregates logs, fires alarms, and pages you before users start tweeting.
CloudWatch Metrics — Default vs Custom
Every AWS service automatically sends default metrics to CloudWatch at no extra cost. EC2 sends CPU utilization, network I/O, and disk I/O. RDS sends connections, read/write latency, and free storage. Lambda sends invocations, duration, and errors.
But default metrics have limits. EC2 doesn't report memory or disk usage by default — the hypervisor can't see inside the OS. For those, you need custom metrics.
Namespaces and Dimensions
Metrics are organized into namespaces (like AWS/EC2, AWS/RDS) and filtered by dimensions (like InstanceId, DBInstanceIdentifier):
# List all available metrics for EC2
aws cloudwatch list-metrics \
--namespace AWS/EC2 \
--query 'Metrics[].MetricName' \
--output table
# Get CPU utilization for a specific instance
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123def456 \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average Maximum \
--output table
Publishing Custom Metrics
Use put-metric-data to send application-specific metrics:
# Push a custom metric — active user sessions
aws cloudwatch put-metric-data \
--namespace "MyApp/Production" \
--metric-name ActiveSessions \
--value 1247 \
--unit Count \
--dimensions Environment=production,Service=web-api
# Push with high-resolution (1-second granularity)
aws cloudwatch put-metric-data \
--namespace "MyApp/Production" \
--metric-name RequestLatency \
--value 45.2 \
--unit Milliseconds \
--storage-resolution 1
CloudWatch Alarms
Alarms watch a metric and trigger actions when it crosses a threshold. Three states: OK, ALARM, and INSUFFICIENT_DATA.
# Create an alarm: CPU > 80% for 5 minutes
aws cloudwatch put-metric-alarm \
--alarm-name "high-cpu-web-server" \
--alarm-description "CPU utilization exceeded 80% for 5 minutes" \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123def456 \
--statistic Average \
--period 300 \
--evaluation-periods 1 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--alarm-actions "arn:aws:sns:us-east-1:123456789012:ops-alerts" \
--ok-actions "arn:aws:sns:us-east-1:123456789012:ops-alerts" \
--treat-missing-data missing
Anomaly Detection Alarms
Instead of static thresholds, let CloudWatch learn the normal pattern and alert on deviations:
# Create an anomaly detection alarm
aws cloudwatch put-metric-alarm \
--alarm-name "anomalous-request-count" \
--alarm-description "Request count deviates from expected pattern" \
--namespace AWS/ApplicationELB \
--metric-name RequestCount \
--dimensions Name=LoadBalancer,Value=app/my-alb/abc123 \
--evaluation-periods 3 \
--datapoints-to-alarm 2 \
--comparison-operator GreaterThanUpperThreshold \
--threshold-metric-id ad1 \
--metrics '[
{
"Id": "m1",
"MetricStat": {
"Metric": {
"Namespace": "AWS/ApplicationELB",
"MetricName": "RequestCount",
"Dimensions": [{"Name": "LoadBalancer", "Value": "app/my-alb/abc123"}]
},
"Period": 300,
"Stat": "Sum"
}
},
{
"Id": "ad1",
"Expression": "ANOMALY_DETECTION_BAND(m1, 2)"
}
]' \
--alarm-actions "arn:aws:sns:us-east-1:123456789012:ops-alerts"
Composite Alarms
Combine multiple alarms with AND/OR logic to reduce alert noise:
# Only fire when BOTH CPU is high AND error rate is elevated
aws cloudwatch put-composite-alarm \
--alarm-name "critical-app-issue" \
--alarm-description "High CPU combined with elevated error rate" \
--alarm-rule 'ALARM("high-cpu-web-server") AND ALARM("high-error-rate")' \
--alarm-actions "arn:aws:sns:us-east-1:123456789012:pagerduty-critical"
CloudWatch Logs
Log Groups and Log Streams
Logs are organized into log groups (one per application or service) and log streams (one per instance or container):
# Create a log group with 30-day retention
aws logs create-log-group \
--log-group-name /app/web-api/production
aws logs put-retention-policy \
--log-group-name /app/web-api/production \
--retention-in-days 30
# Tail logs in real-time
aws logs tail /app/web-api/production \
--follow --since 10m \
--format short
Metric Filters — Turn Logs Into Metrics
Extract numeric data from log lines and push them as CloudWatch metrics:
# Count ERROR occurrences in logs
aws logs put-metric-filter \
--log-group-name /app/web-api/production \
--filter-name ErrorCount \
--filter-pattern "ERROR" \
--metric-transformations '[{
"metricName": "ApplicationErrors",
"metricNamespace": "MyApp/Production",
"metricValue": "1",
"defaultValue": 0
}]'
# Extract response time from JSON logs
# Log format: {"status":200,"responseTime":145,"path":"/api/users"}
aws logs put-metric-filter \
--log-group-name /app/web-api/production \
--filter-name ResponseTime \
--filter-pattern '{$.responseTime > 0}' \
--metric-transformations '[{
"metricName": "ResponseTime",
"metricNamespace": "MyApp/Production",
"metricValue": "$.responseTime"
}]'
CloudWatch Logs Insights
Logs Insights is a powerful query language for searching and analyzing logs. Think of it as SQL for log data:
-- Find the 20 most recent errors with context
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20
-- Top 10 slowest API endpoints in the last hour
fields @timestamp, path, responseTime
| filter responseTime > 500
| stats avg(responseTime) as avgLatency,
max(responseTime) as maxLatency,
count(*) as requestCount
by path
| sort avgLatency desc
| limit 10
-- Error rate per 5-minute interval
fields @timestamp, @message
| filter @message like /ERROR|WARN/
| stats count(*) as errorCount by bin(5m)
| sort @timestamp desc
-- Find 5xx errors from ALB access logs
fields @timestamp, elb_status_code, target_status_code, request_url
| filter elb_status_code >= 500
| stats count(*) as count by request_url, target_status_code
| sort count desc
| limit 20
# Run a Logs Insights query from the CLI
aws logs start-query \
--log-group-name /app/web-api/production \
--start-time $(date -u -d '1 hour ago' +%s) \
--end-time $(date -u +%s) \
--query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 10'
Installing the CloudWatch Agent
The CloudWatch Agent collects memory, disk, and custom application metrics from EC2 instances:
# Install on Amazon Linux 2 / AL2023
sudo yum install -y amazon-cloudwatch-agent
# Create the agent configuration
sudo tee /opt/aws/amazon-cloudwatch-agent/etc/config.json << 'EOF'
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"metrics": {
"namespace": "CWAgent",
"append_dimensions": {
"InstanceId": "${aws:InstanceId}",
"AutoScalingGroupName": "${aws:AutoScalingGroupName}"
},
"metrics_collected": {
"mem": {
"measurement": ["mem_used_percent", "mem_available"],
"metrics_collection_interval": 60
},
"disk": {
"measurement": ["disk_used_percent", "disk_free"],
"resources": ["/", "/data"],
"metrics_collection_interval": 300
}
}
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/app/application.log",
"log_group_name": "/app/web-api/production",
"log_stream_name": "{instance_id}",
"timezone": "UTC"
}
]
}
}
}
}
EOF
# Start the agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config \
-m ec2 \
-s \
-c file:/opt/aws/amazon-cloudwatch-agent/etc/config.json
CloudWatch Dashboards
Dashboards give you a single-pane view of your infrastructure:
# Create a dashboard with key metrics
aws cloudwatch put-dashboard \
--dashboard-name "Production-Overview" \
--dashboard-body '{
"widgets": [
{
"type": "metric",
"x": 0, "y": 0, "width": 12, "height": 6,
"properties": {
"metrics": [
["AWS/EC2", "CPUUtilization", "AutoScalingGroupName", "web-asg",
{"stat": "Average", "label": "CPU %"}]
],
"period": 300,
"title": "EC2 CPU Utilization",
"yAxis": {"left": {"min": 0, "max": 100}}
}
},
{
"type": "log",
"x": 12, "y": 0, "width": 12, "height": 6,
"properties": {
"query": "fields @timestamp, @message\n| filter @message like /ERROR/\n| sort @timestamp desc\n| limit 20",
"region": "us-east-1",
"stacked": false,
"title": "Recent Errors",
"view": "table"
}
}
]
}'
Cost Optimization for Logging
CloudWatch Logs can get expensive fast. Here's how to keep costs under control:
| Strategy | Savings | Implementation |
|---|---|---|
| Set retention policies | 40-70% | Never keep logs forever; 30 days for most, 90 for compliance |
| Filter before shipping | 20-50% | Log only what matters at the agent level |
| Use log classes | Up to 50% | Infrequent Access class for archival logs |
| Export to S3 | 60-80% long-term | Use create-export-task for old logs |
| Reduce log verbosity | 30-60% | DEBUG in dev, WARN/ERROR in prod |
# Export old logs to S3 for cheap archival
aws logs create-export-task \
--log-group-name /app/web-api/production \
--from $(date -u -d '90 days ago' +%s)000 \
--to $(date -u -d '30 days ago' +%s)000 \
--destination "my-log-archive-bucket" \
--destination-prefix "cloudwatch-logs/web-api"
What's Next?
Monitoring tells you when things go wrong. But in a distributed system, services need to communicate reliably without tight coupling. Next, we'll explore SQS, SNS, and EventBridge — AWS's messaging services — and learn when to use queues, topics, and event buses.
This is Part 12 of our AWS series. The best monitoring system is the one that wakes you up 10 minutes before the outage, not 10 minutes after.
