18 posts tagged with "Monitoring"

Observability, logging, and monitoring

Kubernetes Logging — EFK Stack, Loki, and Fluent Bit

August 23, 2025 · 6 min read

DevOps & Cloud Learning Hub

A pod crashes at 3 AM, restarts, and by the time you check in the morning, kubectl logs shows only the current container's output — the crash logs are gone forever. Kubernetes does not persist logs beyond the lifetime of a container, and on a busy cluster, even node-level logs rotate away within hours. If you are not shipping logs to a central store, you are debugging with one eye closed.

Monitor Kubernetes with Prometheus and Grafana

August 9, 2025 · 6 min read

Goel Academy

DevOps & Cloud Learning Hub

Your cluster is running thirty microservices, and one of them is silently eating all the memory on node-3. By the time someone notices, the node is in NotReady state and pods are getting evicted left and right. Without proper monitoring, you are flying blind in production — and Kubernetes gives you zero visibility out of the box.

Linux Performance Tuning — CPU, Memory, and I/O Optimization

August 9, 2025 · 6 min read

Goel Academy

DevOps & Cloud Learning Hub

Your server handles 1,000 requests per second — here's how to push it to 10,000. Performance tuning isn't about guessing; it's about measuring, identifying bottlenecks, and making targeted changes to CPU scheduling, memory management, and disk I/O.

Incident Management — On-Call, Runbooks, and Blameless Postmortems

July 19, 2025 · 9 min read

Goel Academy

DevOps & Cloud Learning Hub

It is 3 AM. Your phone buzzes. The checkout service is returning 500 errors, revenue is dropping, and the on-call engineer has no idea where to start. There is no runbook, no clear escalation path, and last time this happened the fix was "someone restarted the pod." This is what happens when you treat incident management as an afterthought. In this post, we will build an incident management process from the ground up — one that detects problems fast, resolves them faster, and actually prevents them from recurring.

SRE Principles — SLOs, Error Budgets, and Toil Reduction

July 12, 2025 · 9 min read

Goel Academy

DevOps & Cloud Learning Hub

Your service is "up." But is it reliable? Can you quantify exactly how reliable it is? Can you answer whether it is reliable enough for your users, and whether you are spending too much engineering time keeping it that way? Site Reliability Engineering gives you a framework to answer all of these questions with data instead of gut feelings. SRE was born at Google in 2003, and its principles now drive reliability practices at companies of every size.

Docker Logging — From docker logs to ELK Stack

July 12, 2025 · 8 min read

Goel Academy

DevOps & Cloud Learning Hub

Your application logs are the single most important debugging tool you have. In a containerized world, those logs disappear when the container is removed — unless you have a logging strategy. Most teams start with docker logs and stop there. That works for a single container. It falls apart completely at 50 containers across 10 services.