Skip to main content

Terraform at Scale — Monorepo vs Polyrepo, and State Blast Radius

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

Managing 10 Terraform resources is straightforward. Managing 10,000 across 20 teams is a different game entirely. At scale, every decision compounds: where you store code, how you split state, who can approve applies, and how CI/CD pipelines run. Get these wrong and you end up with 45-minute plans, state files that lock out entire teams, and a single bad merge that takes down production networking. This post covers the architectural decisions that separate "Terraform works for my team" from "Terraform works for my organization."

The Scaling Wall

Most teams hit the scaling wall around the same milestones:

  • 5-10 engineers using Terraform — merge conflicts on state become frequent.
  • 500+ resources in a single state — plans take minutes, applies become risky.
  • 3+ teams sharing modules — versioning and breaking changes cause friction.
  • Multiple environments — copy-paste configurations drift apart.

The fix is not a single tool. It is a combination of repo structure, state architecture, module strategy, and CI/CD design.

Monorepo vs Polyrepo

The first structural decision is where the Terraform code lives.

DimensionMonorepoPolyrepo
Code locationSingle repo, folder per componentOne repo per component or team
DiscoverabilityEverything in one place, easy to searchSpread across repos, needs catalog
Dependency managementInternal references, shared modules in-treeVersioned module registry, explicit versions
CI complexityNeeds path-based triggers to avoid running everythingSimple per-repo pipelines
Code reviewCross-team visibility, can review everythingIsolated reviews, team autonomy
Access controlRequires CODEOWNERS, branch protection per pathRepo-level permissions
Best forSmall-medium orgs, platform teams, tight couplingLarge orgs, autonomous teams, compliance boundaries
RiskOne bad merge affects everythingModule version sprawl, inconsistency

Monorepo Structure

infrastructure/
├── modules/ # Shared modules
│ ├── vpc/
│ ├── eks-cluster/
│ └── rds-instance/
├── environments/
│ ├── dev/
│ │ ├── networking/ # State: dev-networking
│ │ ├── compute/ # State: dev-compute
│ │ └── data/ # State: dev-data
│ ├── staging/
│ └── prod/
├── global/ # Account-level resources
│ ├── iam/
│ └── dns/
└── .github/workflows/
└── terraform.yml # Path-based CI triggers

Polyrepo Structure

# Separate repositories:
terraform-modules # Shared module library
terraform-networking # VPCs, subnets, peering
terraform-compute # EKS, EC2, ASG
terraform-data # RDS, ElastiCache, S3
terraform-platform-team-a # Team A's services
terraform-platform-team-b # Team B's services

In practice, many organizations use a hybrid: a monorepo for shared modules and platform infrastructure, with polyrepos for team-specific service configurations.

State Blast Radius

State blast radius is the amount of infrastructure affected when something goes wrong with a single state file. A state file with 2,000 resources means a corrupted state, a bad apply, or a long lock can impact everything in it.

Signs Your State Is Too Big

# Check your state size
terraform state list | wc -l

# If this number is > 200, consider splitting
# If plan takes > 2 minutes, definitely split

A healthy state file manages 50-200 resources. Beyond that, you pay in plan duration, lock contention, and risk.

Splitting State by Concern

The best approach is to split state along blast radius boundaries — components that can fail independently should live in separate state files:

# BEFORE: One giant state
prod/
└── main.tf # VPC + EKS + RDS + IAM + DNS + S3 = 800 resources

# AFTER: Split by lifecycle and blast radius
prod/
├── networking/ # VPC, subnets, NAT gateways (rarely changes)
├── compute/ # EKS cluster, node groups (changes weekly)
├── data/ # RDS, ElastiCache (rarely changes, high risk)
├── iam/ # Roles, policies (changes with new services)
└── applications/ # Service-specific resources (changes daily)

Each directory has its own state file, its own lock, and its own blast radius. A bad apply in applications/ cannot touch networking or databases.

Cross-State References

Split states need to share data. Use terraform_remote_state or, better, data sources:

# In compute/main.tf — reference networking outputs
data "terraform_remote_state" "networking" {
backend = "s3"
config = {
bucket = "my-terraform-state"
key = "prod/networking/terraform.tfstate"
region = "us-east-1"
}
}

resource "aws_eks_cluster" "main" {
name = "prod-cluster"
role_arn = var.cluster_role_arn

vpc_config {
subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
}
}

Terragrunt for DRY Configurations

When you split state into many directories, configuration duplication explodes. Terragrunt solves this by letting you define backend config, provider config, and variable values once:

# terragrunt.hcl (root)
remote_state {
backend = "s3"
generate = {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
}
config = {
bucket = "my-terraform-state"
key = "${path_relative_to_include()}/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}

generate "provider" {
path = "provider.tf"
if_exists = "overwrite_terragrunt"
contents = <<EOF
provider "aws" {
region = "us-east-1"
default_tags {
tags = {
ManagedBy = "terraform"
Environment = "${basename(get_terragrunt_dir())}"
}
}
}
EOF
}
# prod/compute/terragrunt.hcl
include "root" {
path = find_in_parent_folders()
}

terraform {
source = "../../modules//eks-cluster"
}

dependency "networking" {
config_path = "../networking"
}

inputs = {
cluster_name = "prod-cluster"
subnet_ids = dependency.networking.outputs.private_subnet_ids
node_count = 5
}

Now terragrunt apply in prod/compute/ generates the backend config, provider config, and passes the right variables — no duplication.

CI/CD Patterns at Scale

Path-Based Triggers (Monorepo)

Only run Terraform for directories that changed:

# .github/workflows/terraform.yml
name: Terraform
on:
pull_request:
paths:
- 'environments/**'
- 'modules/**'

jobs:
detect-changes:
runs-on: ubuntu-latest
outputs:
directories: ${{ steps.changes.outputs.directories }}
steps:
- uses: actions/checkout@v4
- id: changes
run: |
DIRS=$(git diff --name-only origin/main...HEAD \
| grep '^environments/' \
| cut -d'/' -f1-3 \
| sort -u \
| jq -R -s -c 'split("\n") | map(select(. != ""))')
echo "directories=$DIRS" >> "$GITHUB_OUTPUT"

plan:
needs: detect-changes
strategy:
matrix:
directory: ${{ fromJson(needs.detect-changes.outputs.directories) }}
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: |
cd ${{ matrix.directory }}
terraform init
terraform plan -out=plan.tfplan

PR-Based Workflow

The safest pattern at scale: plan on PR, apply on merge.

1. Engineer opens PR with Terraform changes
2. CI runs terraform plan, posts output as PR comment
3. CODEOWNERS review the plan output
4. PR merged → CI runs terraform apply on main branch
5. Apply output posted to Slack/Teams channel

This ensures every infrastructure change is reviewed before it touches production.

Module Registry for Sharing

At scale, shared modules need versioning. A private module registry (Terraform Cloud, Artifactory, or S3-backed) lets teams consume modules with explicit version pins:

module "vpc" {
source = "app.terraform.io/my-org/vpc/aws"
version = "~> 3.0"

cidr_block = "10.0.0.0/16"
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
}

Version constraints prevent breaking changes from cascading. The platform team publishes v4.0.0 with breaking changes, but existing consumers stay on ~> 3.0 until they are ready to upgrade.

Organizational Patterns

The most effective pattern at scale is a platform team model:

RoleResponsibility
Platform teamOwns shared modules, provider configs, CI pipelines, state backend
Product teamsConsume modules, define service-specific infrastructure
Security teamReviews module changes, owns IAM and policy modules
SRE teamOwns monitoring, alerting, and reliability modules

Product teams write simple configurations that compose platform-provided modules:

# Product team's Terraform — simple, constrained, safe
module "my_service" {
source = "app.terraform.io/my-org/microservice/aws"
version = "~> 2.0"

service_name = "order-api"
team = "commerce"
container_image = "order-api:v1.2.3"
cpu = 512
memory = 1024
desired_count = 3
}

The platform module handles VPC placement, security groups, IAM roles, logging, monitoring, and DNS — all with organizational standards baked in. Product teams cannot misconfigure networking or skip encryption because the module does not expose those knobs.

Remote Execution for Consistency

Running Terraform on developer laptops leads to inconsistency: different provider versions, different OS behaviors, stale credentials. Remote execution solves this:

# Using Terraform Cloud for remote execution
terraform {
cloud {
organization = "my-org"
workspaces {
tags = ["prod", "networking"]
}
}
}

With remote execution, every plan and apply runs in the same environment with the same provider versions, the same credentials, and full audit logging. No more "it worked on my machine."

Closing Note

Scaling Terraform is not about finding a magic tool. It is about making deliberate architectural decisions: split state to limit blast radius, choose a repo structure that matches your team topology, version modules so changes are controlled, and run everything through CI so no one is applying from a laptop. Start with state splitting — that alone eliminates the worst pain. Then layer in Terragrunt or Terraform Cloud as your organization grows. The goal is not to manage more resources; it is to let more teams manage their own resources safely.