Terraform at Scale — Monorepo vs Polyrepo, and State Blast Radius
Managing 10 Terraform resources is straightforward. Managing 10,000 across 20 teams is a different game entirely. At scale, every decision compounds: where you store code, how you split state, who can approve applies, and how CI/CD pipelines run. Get these wrong and you end up with 45-minute plans, state files that lock out entire teams, and a single bad merge that takes down production networking. This post covers the architectural decisions that separate "Terraform works for my team" from "Terraform works for my organization."
The Scaling Wall
Most teams hit the scaling wall around the same milestones:
- 5-10 engineers using Terraform — merge conflicts on state become frequent.
- 500+ resources in a single state — plans take minutes, applies become risky.
- 3+ teams sharing modules — versioning and breaking changes cause friction.
- Multiple environments — copy-paste configurations drift apart.
The fix is not a single tool. It is a combination of repo structure, state architecture, module strategy, and CI/CD design.
Monorepo vs Polyrepo
The first structural decision is where the Terraform code lives.
| Dimension | Monorepo | Polyrepo |
|---|---|---|
| Code location | Single repo, folder per component | One repo per component or team |
| Discoverability | Everything in one place, easy to search | Spread across repos, needs catalog |
| Dependency management | Internal references, shared modules in-tree | Versioned module registry, explicit versions |
| CI complexity | Needs path-based triggers to avoid running everything | Simple per-repo pipelines |
| Code review | Cross-team visibility, can review everything | Isolated reviews, team autonomy |
| Access control | Requires CODEOWNERS, branch protection per path | Repo-level permissions |
| Best for | Small-medium orgs, platform teams, tight coupling | Large orgs, autonomous teams, compliance boundaries |
| Risk | One bad merge affects everything | Module version sprawl, inconsistency |
Monorepo Structure
infrastructure/
├── modules/ # Shared modules
│ ├── vpc/
│ ├── eks-cluster/
│ └── rds-instance/
├ ── environments/
│ ├── dev/
│ │ ├── networking/ # State: dev-networking
│ │ ├── compute/ # State: dev-compute
│ │ └── data/ # State: dev-data
│ ├── staging/
│ └── prod/
├── global/ # Account-level resources
│ ├── iam/
│ └── dns/
└── .github/workflows/
└── terraform.yml # Path-based CI triggers
Polyrepo Structure
# Separate repositories:
terraform-modules # Shared module library
terraform-networking # VPCs, subnets, peering
terraform-compute # EKS, EC2, ASG
terraform-data # RDS, ElastiCache, S3
terraform-platform-team-a # Team A's services
terraform-platform-team-b # Team B's services
In practice, many organizations use a hybrid: a monorepo for shared modules and platform infrastructure, with polyrepos for team-specific service configurations.
State Blast Radius
State blast radius is the amount of infrastructure affected when something goes wrong with a single state file. A state file with 2,000 resources means a corrupted state, a bad apply, or a long lock can impact everything in it.
Signs Your State Is Too Big
# Check your state size
terraform state list | wc -l
# If this number is > 200, consider splitting
# If plan takes > 2 minutes, definitely split
A healthy state file manages 50-200 resources. Beyond that, you pay in plan duration, lock contention, and risk.
Splitting State by Concern
The best approach is to split state along blast radius boundaries — components that can fail independently should live in separate state files:
# BEFORE: One giant state
prod/
└── main.tf # VPC + EKS + RDS + IAM + DNS + S3 = 800 resources
# AFTER: Split by lifecycle and blast radius
prod/
├── networking/ # VPC, subnets, NAT gateways (rarely changes)
├── compute/ # EKS cluster, node groups (changes weekly)
├── data/ # RDS, ElastiCache (rarely changes, high risk)
├── iam/ # Roles, policies (changes with new services)
└── applications/ # Service-specific resources (changes daily)
Each directory has its own state file, its own lock, and its own blast radius. A bad apply in applications/ cannot touch networking or databases.
Cross-State References
Split states need to share data. Use terraform_remote_state or, better, data sources:
# In compute/main.tf — reference networking outputs
data "terraform_remote_state" "networking" {
backend = "s3"
config = {
bucket = "my-terraform-state"
key = "prod/networking/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_eks_cluster" "main" {
name = "prod-cluster"
role_arn = var.cluster_role_arn
vpc_config {
subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
}
}
Terragrunt for DRY Configurations
When you split state into many directories, configuration duplication explodes. Terragrunt solves this by letting you define backend config, provider config, and variable values once:
# terragrunt.hcl (root)
remote_state {
backend = "s3"
generate = {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
}
config = {
bucket = "my-terraform-state"
key = "${path_relative_to_include()}/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
generate "provider" {
path = "provider.tf"
if_exists = "overwrite_terragrunt"
contents = <<EOF
provider "aws" {
region = "us-east-1"
default_tags {
tags = {
ManagedBy = "terraform"
Environment = "${basename(get_terragrunt_dir())}"
}
}
}
EOF
}
# prod/compute/terragrunt.hcl
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "../../modules//eks-cluster"
}
dependency "networking" {
config_path = "../networking"
}
inputs = {
cluster_name = "prod-cluster"
subnet_ids = dependency.networking.outputs.private_subnet_ids
node_count = 5
}
Now terragrunt apply in prod/compute/ generates the backend config, provider config, and passes the right variables — no duplication.
CI/CD Patterns at Scale
Path-Based Triggers (Monorepo)
Only run Terraform for directories that changed:
# .github/workflows/terraform.yml
name: Terraform
on:
pull_request:
paths:
- 'environments/**'
- 'modules/**'
jobs:
detect-changes:
runs-on: ubuntu-latest
outputs:
directories: ${{ steps.changes.outputs.directories }}
steps:
- uses: actions/checkout@v4
- id: changes
run: |
DIRS=$(git diff --name-only origin/main...HEAD \
| grep '^environments/' \
| cut -d'/' -f1-3 \
| sort -u \
| jq -R -s -c 'split("\n") | map(select(. != ""))')
echo "directories=$DIRS" >> "$GITHUB_OUTPUT"
plan:
needs: detect-changes
strategy:
matrix:
directory: ${{ fromJson(needs.detect-changes.outputs.directories) }}
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: |
cd ${{ matrix.directory }}
terraform init
terraform plan -out=plan.tfplan
PR-Based Workflow
The safest pattern at scale: plan on PR, apply on merge.
1. Engineer opens PR with Terraform changes
2. CI runs terraform plan, posts output as PR comment
3. CODEOWNERS review the plan output
4. PR merged → CI runs terraform apply on main branch
5. Apply output posted to Slack/Teams channel
This ensures every infrastructure change is reviewed before it touches production.
Module Registry for Sharing
At scale, shared modules need versioning. A private module registry (Terraform Cloud, Artifactory, or S3-backed) lets teams consume modules with explicit version pins:
module "vpc" {
source = "app.terraform.io/my-org/vpc/aws"
version = "~> 3.0"
cidr_block = "10.0.0.0/16"
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
}
Version constraints prevent breaking changes from cascading. The platform team publishes v4.0.0 with breaking changes, but existing consumers stay on ~> 3.0 until they are ready to upgrade.
Organizational Patterns
The most effective pattern at scale is a platform team model:
| Role | Responsibility |
|---|---|
| Platform team | Owns shared modules, provider configs, CI pipelines, state backend |
| Product teams | Consume modules, define service-specific infrastructure |
| Security team | Reviews module changes, owns IAM and policy modules |
| SRE team | Owns monitoring, alerting, and reliability modules |
Product teams write simple configurations that compose platform-provided modules:
# Product team's Terraform — simple, constrained, safe
module "my_service" {
source = "app.terraform.io/my-org/microservice/aws"
version = "~> 2.0"
service_name = "order-api"
team = "commerce"
container_image = "order-api:v1.2.3"
cpu = 512
memory = 1024
desired_count = 3
}
The platform module handles VPC placement, security groups, IAM roles, logging, monitoring, and DNS — all with organizational standards baked in. Product teams cannot misconfigure networking or skip encryption because the module does not expose those knobs.
Remote Execution for Consistency
Running Terraform on developer laptops leads to inconsistency: different provider versions, different OS behaviors, stale credentials. Remote execution solves this:
# Using Terraform Cloud for remote execution
terraform {
cloud {
organization = "my-org"
workspaces {
tags = ["prod", "networking"]
}
}
}
With remote execution, every plan and apply runs in the same environment with the same provider versions, the same credentials, and full audit logging. No more "it worked on my machine."
Closing Note
Scaling Terraform is not about finding a magic tool. It is about making deliberate architectural decisions: split state to limit blast radius, choose a repo structure that matches your team topology, version modules so changes are controlled, and run everything through CI so no one is applying from a laptop. Start with state splitting — that alone eliminates the worst pain. Then layer in Terragrunt or Terraform Cloud as your organization grows. The goal is not to manage more resources; it is to let more teams manage their own resources safely.
