Skip to main content

CloudWatch — Logs, Metrics, Alarms, and Dashboards That Save You at 3 AM

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

It's 3:17 AM. Your phone buzzes. "Site is down." You SSH into the server, tail the logs, see nothing obvious, check CPU — it's fine. Memory? Fine. Disk? 100% full. Log files ate the disk three hours ago, and nobody noticed because monitoring wasn't set up. CloudWatch exists so that you don't have to be the monitoring system. It collects metrics, aggregates logs, fires alarms, and pages you before users start tweeting.

CloudWatch Metrics — Default vs Custom

Every AWS service automatically sends default metrics to CloudWatch at no extra cost. EC2 sends CPU utilization, network I/O, and disk I/O. RDS sends connections, read/write latency, and free storage. Lambda sends invocations, duration, and errors.

But default metrics have limits. EC2 doesn't report memory or disk usage by default — the hypervisor can't see inside the OS. For those, you need custom metrics.

Namespaces and Dimensions

Metrics are organized into namespaces (like AWS/EC2, AWS/RDS) and filtered by dimensions (like InstanceId, DBInstanceIdentifier):

# List all available metrics for EC2
aws cloudwatch list-metrics \
--namespace AWS/EC2 \
--query 'Metrics[].MetricName' \
--output table

# Get CPU utilization for a specific instance
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123def456 \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average Maximum \
--output table

Publishing Custom Metrics

Use put-metric-data to send application-specific metrics:

# Push a custom metric — active user sessions
aws cloudwatch put-metric-data \
--namespace "MyApp/Production" \
--metric-name ActiveSessions \
--value 1247 \
--unit Count \
--dimensions Environment=production,Service=web-api

# Push with high-resolution (1-second granularity)
aws cloudwatch put-metric-data \
--namespace "MyApp/Production" \
--metric-name RequestLatency \
--value 45.2 \
--unit Milliseconds \
--storage-resolution 1

CloudWatch Alarms

Alarms watch a metric and trigger actions when it crosses a threshold. Three states: OK, ALARM, and INSUFFICIENT_DATA.

# Create an alarm: CPU > 80% for 5 minutes
aws cloudwatch put-metric-alarm \
--alarm-name "high-cpu-web-server" \
--alarm-description "CPU utilization exceeded 80% for 5 minutes" \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123def456 \
--statistic Average \
--period 300 \
--evaluation-periods 1 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--alarm-actions "arn:aws:sns:us-east-1:123456789012:ops-alerts" \
--ok-actions "arn:aws:sns:us-east-1:123456789012:ops-alerts" \
--treat-missing-data missing

Anomaly Detection Alarms

Instead of static thresholds, let CloudWatch learn the normal pattern and alert on deviations:

# Create an anomaly detection alarm
aws cloudwatch put-metric-alarm \
--alarm-name "anomalous-request-count" \
--alarm-description "Request count deviates from expected pattern" \
--namespace AWS/ApplicationELB \
--metric-name RequestCount \
--dimensions Name=LoadBalancer,Value=app/my-alb/abc123 \
--evaluation-periods 3 \
--datapoints-to-alarm 2 \
--comparison-operator GreaterThanUpperThreshold \
--threshold-metric-id ad1 \
--metrics '[
{
"Id": "m1",
"MetricStat": {
"Metric": {
"Namespace": "AWS/ApplicationELB",
"MetricName": "RequestCount",
"Dimensions": [{"Name": "LoadBalancer", "Value": "app/my-alb/abc123"}]
},
"Period": 300,
"Stat": "Sum"
}
},
{
"Id": "ad1",
"Expression": "ANOMALY_DETECTION_BAND(m1, 2)"
}
]' \
--alarm-actions "arn:aws:sns:us-east-1:123456789012:ops-alerts"

Composite Alarms

Combine multiple alarms with AND/OR logic to reduce alert noise:

# Only fire when BOTH CPU is high AND error rate is elevated
aws cloudwatch put-composite-alarm \
--alarm-name "critical-app-issue" \
--alarm-description "High CPU combined with elevated error rate" \
--alarm-rule 'ALARM("high-cpu-web-server") AND ALARM("high-error-rate")' \
--alarm-actions "arn:aws:sns:us-east-1:123456789012:pagerduty-critical"

CloudWatch Logs

Log Groups and Log Streams

Logs are organized into log groups (one per application or service) and log streams (one per instance or container):

# Create a log group with 30-day retention
aws logs create-log-group \
--log-group-name /app/web-api/production

aws logs put-retention-policy \
--log-group-name /app/web-api/production \
--retention-in-days 30

# Tail logs in real-time
aws logs tail /app/web-api/production \
--follow --since 10m \
--format short

Metric Filters — Turn Logs Into Metrics

Extract numeric data from log lines and push them as CloudWatch metrics:

# Count ERROR occurrences in logs
aws logs put-metric-filter \
--log-group-name /app/web-api/production \
--filter-name ErrorCount \
--filter-pattern "ERROR" \
--metric-transformations '[{
"metricName": "ApplicationErrors",
"metricNamespace": "MyApp/Production",
"metricValue": "1",
"defaultValue": 0
}]'

# Extract response time from JSON logs
# Log format: {"status":200,"responseTime":145,"path":"/api/users"}
aws logs put-metric-filter \
--log-group-name /app/web-api/production \
--filter-name ResponseTime \
--filter-pattern '{$.responseTime > 0}' \
--metric-transformations '[{
"metricName": "ResponseTime",
"metricNamespace": "MyApp/Production",
"metricValue": "$.responseTime"
}]'

CloudWatch Logs Insights

Logs Insights is a powerful query language for searching and analyzing logs. Think of it as SQL for log data:

-- Find the 20 most recent errors with context
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20

-- Top 10 slowest API endpoints in the last hour
fields @timestamp, path, responseTime
| filter responseTime > 500
| stats avg(responseTime) as avgLatency,
max(responseTime) as maxLatency,
count(*) as requestCount
by path
| sort avgLatency desc
| limit 10

-- Error rate per 5-minute interval
fields @timestamp, @message
| filter @message like /ERROR|WARN/
| stats count(*) as errorCount by bin(5m)
| sort @timestamp desc

-- Find 5xx errors from ALB access logs
fields @timestamp, elb_status_code, target_status_code, request_url
| filter elb_status_code >= 500
| stats count(*) as count by request_url, target_status_code
| sort count desc
| limit 20
# Run a Logs Insights query from the CLI
aws logs start-query \
--log-group-name /app/web-api/production \
--start-time $(date -u -d '1 hour ago' +%s) \
--end-time $(date -u +%s) \
--query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 10'

Installing the CloudWatch Agent

The CloudWatch Agent collects memory, disk, and custom application metrics from EC2 instances:

# Install on Amazon Linux 2 / AL2023
sudo yum install -y amazon-cloudwatch-agent

# Create the agent configuration
sudo tee /opt/aws/amazon-cloudwatch-agent/etc/config.json << 'EOF'
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"metrics": {
"namespace": "CWAgent",
"append_dimensions": {
"InstanceId": "${aws:InstanceId}",
"AutoScalingGroupName": "${aws:AutoScalingGroupName}"
},
"metrics_collected": {
"mem": {
"measurement": ["mem_used_percent", "mem_available"],
"metrics_collection_interval": 60
},
"disk": {
"measurement": ["disk_used_percent", "disk_free"],
"resources": ["/", "/data"],
"metrics_collection_interval": 300
}
}
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/app/application.log",
"log_group_name": "/app/web-api/production",
"log_stream_name": "{instance_id}",
"timezone": "UTC"
}
]
}
}
}
}
EOF

# Start the agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config \
-m ec2 \
-s \
-c file:/opt/aws/amazon-cloudwatch-agent/etc/config.json

CloudWatch Dashboards

Dashboards give you a single-pane view of your infrastructure:

# Create a dashboard with key metrics
aws cloudwatch put-dashboard \
--dashboard-name "Production-Overview" \
--dashboard-body '{
"widgets": [
{
"type": "metric",
"x": 0, "y": 0, "width": 12, "height": 6,
"properties": {
"metrics": [
["AWS/EC2", "CPUUtilization", "AutoScalingGroupName", "web-asg",
{"stat": "Average", "label": "CPU %"}]
],
"period": 300,
"title": "EC2 CPU Utilization",
"yAxis": {"left": {"min": 0, "max": 100}}
}
},
{
"type": "log",
"x": 12, "y": 0, "width": 12, "height": 6,
"properties": {
"query": "fields @timestamp, @message\n| filter @message like /ERROR/\n| sort @timestamp desc\n| limit 20",
"region": "us-east-1",
"stacked": false,
"title": "Recent Errors",
"view": "table"
}
}
]
}'

Cost Optimization for Logging

CloudWatch Logs can get expensive fast. Here's how to keep costs under control:

StrategySavingsImplementation
Set retention policies40-70%Never keep logs forever; 30 days for most, 90 for compliance
Filter before shipping20-50%Log only what matters at the agent level
Use log classesUp to 50%Infrequent Access class for archival logs
Export to S360-80% long-termUse create-export-task for old logs
Reduce log verbosity30-60%DEBUG in dev, WARN/ERROR in prod
# Export old logs to S3 for cheap archival
aws logs create-export-task \
--log-group-name /app/web-api/production \
--from $(date -u -d '90 days ago' +%s)000 \
--to $(date -u -d '30 days ago' +%s)000 \
--destination "my-log-archive-bucket" \
--destination-prefix "cloudwatch-logs/web-api"

What's Next?

Monitoring tells you when things go wrong. But in a distributed system, services need to communicate reliably without tight coupling. Next, we'll explore SQS, SNS, and EventBridge — AWS's messaging services — and learn when to use queues, topics, and event buses.


This is Part 12 of our AWS series. The best monitoring system is the one that wakes you up 10 minutes before the outage, not 10 minutes after.