Skip to main content

18 posts tagged with "Monitoring"

Observability, logging, and monitoring

View All Tags

CloudWatch — Logs, Metrics, Alarms, and Dashboards That Save You at 3 AM

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

It's 3:17 AM. Your phone buzzes. "Site is down." You SSH into the server, tail the logs, see nothing obvious, check CPU — it's fine. Memory? Fine. Disk? 100% full. Log files ate the disk three hours ago, and nobody noticed because monitoring wasn't set up. CloudWatch exists so that you don't have to be the monitoring system. It collects metrics, aggregates logs, fires alarms, and pages you before users start tweeting.

Azure Monitor — Logs, Metrics, Alerts, and Application Insights

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

It is 2 AM and your application is down. The on-call engineer opens the Azure portal, stares at a wall of services, and asks the worst question in operations: "Where do I even start looking?" Azure Monitor is the answer. It collects, analyzes, and acts on telemetry from every layer of your infrastructure — from VM CPU spikes to application exceptions to user click patterns. But only if you set it up properly before that 2 AM call.

Prometheus and Grafana — Set Up Production Monitoring in 15 Minutes

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

You have read about the Golden Signals and the three pillars of observability. Now it is time to stop theorizing and start measuring. In this post, we will set up a complete monitoring stack — Prometheus for metrics collection, Grafana for visualization, node_exporter for system metrics, and Alertmanager for routing alerts to Slack. All running locally with Docker Compose, all production-ready patterns.

Monitoring 101 — Metrics, Logs, Traces, and the Golden Signals

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

It is 3 AM. Your phone buzzes with a PagerDuty alert: "CPU usage above 90%." You drag yourself out of bed, SSH into the server, and discover the CPU spike was caused by a log rotation cron job that runs every night. It resolved itself two minutes later. This happens three times a week. You start ignoring alerts. Then one night, the database actually fills up and takes down production. Nobody notices for 47 minutes because the team has learned to silence their phones.