Skip to main content

18 posts tagged with "Monitoring"

Observability, logging, and monitoring

View All Tags

Kubernetes Logging — EFK Stack, Loki, and Fluent Bit

· 6 min read
Goel Academy
DevOps & Cloud Learning Hub

A pod crashes at 3 AM, restarts, and by the time you check in the morning, kubectl logs shows only the current container's output — the crash logs are gone forever. Kubernetes does not persist logs beyond the lifetime of a container, and on a busy cluster, even node-level logs rotate away within hours. If you are not shipping logs to a central store, you are debugging with one eye closed.

Monitor Kubernetes with Prometheus and Grafana

· 6 min read
Goel Academy
DevOps & Cloud Learning Hub

Your cluster is running thirty microservices, and one of them is silently eating all the memory on node-3. By the time someone notices, the node is in NotReady state and pods are getting evicted left and right. Without proper monitoring, you are flying blind in production — and Kubernetes gives you zero visibility out of the box.

Incident Management — On-Call, Runbooks, and Blameless Postmortems

· 9 min read
Goel Academy
DevOps & Cloud Learning Hub

It is 3 AM. Your phone buzzes. The checkout service is returning 500 errors, revenue is dropping, and the on-call engineer has no idea where to start. There is no runbook, no clear escalation path, and last time this happened the fix was "someone restarted the pod." This is what happens when you treat incident management as an afterthought. In this post, we will build an incident management process from the ground up — one that detects problems fast, resolves them faster, and actually prevents them from recurring.

SRE Principles — SLOs, Error Budgets, and Toil Reduction

· 9 min read
Goel Academy
DevOps & Cloud Learning Hub

Your service is "up." But is it reliable? Can you quantify exactly how reliable it is? Can you answer whether it is reliable enough for your users, and whether you are spending too much engineering time keeping it that way? Site Reliability Engineering gives you a framework to answer all of these questions with data instead of gut feelings. SRE was born at Google in 2003, and its principles now drive reliability practices at companies of every size.

Docker Logging — From docker logs to ELK Stack

· 8 min read
Goel Academy
DevOps & Cloud Learning Hub

Your application logs are the single most important debugging tool you have. In a containerized world, those logs disappear when the container is removed — unless you have a logging strategy. Most teams start with docker logs and stop there. That works for a single container. It falls apart completely at 50 containers across 10 services.