Skip to main content

151 posts tagged with "DevOps"

DevOps practices, CI/CD, and automation

View All Tags

Incident Management — On-Call, Runbooks, and Blameless Postmortems

· 9 min read
Goel Academy
DevOps & Cloud Learning Hub

It is 3 AM. Your phone buzzes. The checkout service is returning 500 errors, revenue is dropping, and the on-call engineer has no idea where to start. There is no runbook, no clear escalation path, and last time this happened the fix was "someone restarted the pod." This is what happens when you treat incident management as an afterthought. In this post, we will build an incident management process from the ground up — one that detects problems fast, resolves them faster, and actually prevents them from recurring.

Kubernetes Workloads — Jobs, CronJobs, DaemonSets, and StatefulSets

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

Deployments are the workhorse of Kubernetes, but not every workload is a long-running web server. You need to run a database migration once, process a queue of images every night, collect logs from every node, or deploy a database cluster with stable identities. Kubernetes has a dedicated workload type for each of these patterns.

Terraform Provisioners — When (and When Not) to Use Them

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

Terraform is a provisioning tool, not a configuration management tool. It creates infrastructure — VMs, networks, databases — but it was never designed to install packages, configure services, or manage files on running machines. Provisioners exist as an escape hatch for those cases, and HashiCorp explicitly recommends using them only as a last resort.

SRE Principles — SLOs, Error Budgets, and Toil Reduction

· 9 min read
Goel Academy
DevOps & Cloud Learning Hub

Your service is "up." But is it reliable? Can you quantify exactly how reliable it is? Can you answer whether it is reliable enough for your users, and whether you are spending too much engineering time keeping it that way? Site Reliability Engineering gives you a framework to answer all of these questions with data instead of gut feelings. SRE was born at Google in 2003, and its principles now drive reliability practices at companies of every size.