Skip to main content

18 posts tagged with "Monitoring"

Observability, logging, and monitoring

View All Tags

MLOps and AIOps — DevOps for Machine Learning

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

87% of machine learning models never make it to production. Not because the models are bad, but because the gap between a Jupyter notebook and a reliable production system is enormous. MLOps bridges that gap by applying DevOps principles to the ML lifecycle. Meanwhile, AIOps flips the script — using AI to make operations smarter. Together, they represent the frontier of modern DevOps.

Monitor Docker Containers — cAdvisor, Prometheus, and Grafana

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

docker stats shows you what is happening right now. It does not show you what happened at 3 AM when response times spiked. It does not alert you when a container's memory is trending toward its limit. It does not graph CPU usage over the past week to help you right-size your resource limits. For real monitoring, you need metrics collection, storage, visualization, and alerting. The standard stack for Docker is cAdvisor + Prometheus + Grafana, and you can have it running in under fifteen minutes.

Linux Troubleshooting Like a Pro — strace, lsof, tcpdump

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

The app works on staging but fails on production — here's the systematic way to find out why. Every seasoned SRE has a mental decision tree for production incidents. The tools are always the same: strace to see what a process is doing, lsof to see what files it has open, tcpdump to see what's on the wire, and ss to see socket state. Master these four and you can debug almost anything.

Observability vs Monitoring — Distributed Tracing with Jaeger and OpenTelemetry

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

When a user reports that checkout is slow, monitoring tells you that latency spiked. Observability tells you why — the payment service waited 3 seconds for a database query that normally takes 20ms because a missing index caused a full table scan on a table that grew past 10 million rows last Tuesday. That's the difference.

Chaos Engineering — Break Your System Before It Breaks You

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

Netflix famously runs a tool called Chaos Monkey that randomly kills production servers — on purpose. It sounds insane until you realize their systems survived the 2017 AWS S3 outage while half the internet went down. That's chaos engineering: deliberately injecting failure so your systems learn to handle it gracefully.