Terraform at Scale — Monorepo vs Polyrepo, and State Blast Radius

December 20, 2025 · 7 min read

DevOps & Cloud Learning Hub

Managing 10 Terraform resources is straightforward. Managing 10,000 across 20 teams is a different game entirely. At scale, every decision compounds: where you store code, how you split state, who can approve applies, and how CI/CD pipelines run. Get these wrong and you end up with 45-minute plans, state files that lock out entire teams, and a single bad merge that takes down production networking. This post covers the architectural decisions that separate "Terraform works for my team" from "Terraform works for my organization."

The Scaling Wall

Most teams hit the scaling wall around the same milestones:

5-10 engineers using Terraform — merge conflicts on state become frequent.
500+ resources in a single state — plans take minutes, applies become risky.
3+ teams sharing modules — versioning and breaking changes cause friction.
Multiple environments — copy-paste configurations drift apart.

The fix is not a single tool. It is a combination of repo structure, state architecture, module strategy, and CI/CD design.

Monorepo vs Polyrepo

The first structural decision is where the Terraform code lives.

Dimension	Monorepo	Polyrepo
Code location	Single repo, folder per component	One repo per component or team
Discoverability	Everything in one place, easy to search	Spread across repos, needs catalog
Dependency management	Internal references, shared modules in-tree	Versioned module registry, explicit versions
CI complexity	Needs path-based triggers to avoid running everything	Simple per-repo pipelines
Code review	Cross-team visibility, can review everything	Isolated reviews, team autonomy
Access control	Requires CODEOWNERS, branch protection per path	Repo-level permissions
Best for	Small-medium orgs, platform teams, tight coupling	Large orgs, autonomous teams, compliance boundaries
Risk	One bad merge affects everything	Module version sprawl, inconsistency

Monorepo Structure

infrastructure/
├── modules/                    # Shared modules
│   ├── vpc/
│   ├── eks-cluster/
│   └── rds-instance/
├── environments/
│   ├── dev/
│   │   ├── networking/         # State: dev-networking
│   │   ├── compute/            # State: dev-compute
│   │   └── data/               # State: dev-data
│   ├── staging/
│   └── prod/
├── global/                     # Account-level resources
│   ├── iam/
│   └── dns/
└── .github/workflows/
    └── terraform.yml           # Path-based CI triggers

Polyrepo Structure

# Separate repositories:
terraform-modules              # Shared module library
terraform-networking            # VPCs, subnets, peering
terraform-compute               # EKS, EC2, ASG
terraform-data                  # RDS, ElastiCache, S3
terraform-platform-team-a       # Team A's services
terraform-platform-team-b       # Team B's services

In practice, many organizations use a hybrid: a monorepo for shared modules and platform infrastructure, with polyrepos for team-specific service configurations.

State Blast Radius

State blast radius is the amount of infrastructure affected when something goes wrong with a single state file. A state file with 2,000 resources means a corrupted state, a bad apply, or a long lock can impact everything in it.

Signs Your State Is Too Big

# Check your state size
terraform state list | wc -l

# If this number is > 200, consider splitting
# If plan takes > 2 minutes, definitely split

A healthy state file manages 50-200 resources. Beyond that, you pay in plan duration, lock contention, and risk.

Splitting State by Concern

The best approach is to split state along blast radius boundaries — components that can fail independently should live in separate state files:

# BEFORE: One giant state
prod/
└── main.tf    # VPC + EKS + RDS + IAM + DNS + S3 = 800 resources

# AFTER: Split by lifecycle and blast radius
prod/
├── networking/    # VPC, subnets, NAT gateways (rarely changes)
├── compute/       # EKS cluster, node groups (changes weekly)
├── data/          # RDS, ElastiCache (rarely changes, high risk)
├── iam/           # Roles, policies (changes with new services)
└── applications/  # Service-specific resources (changes daily)

Each directory has its own state file, its own lock, and its own blast radius. A bad apply in applications/ cannot touch networking or databases.

Cross-State References

Split states need to share data. Use terraform_remote_state or, better, data sources:

# In compute/main.tf — reference networking outputs
data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "my-terraform-state"
    key    = "prod/networking/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_eks_cluster" "main" {
  name     = "prod-cluster"
  role_arn = var.cluster_role_arn

  vpc_config {
    subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
  }
}

Terragrunt for DRY Configurations

When you split state into many directories, configuration duplication explodes. Terragrunt solves this by letting you define backend config, provider config, and variable values once:

# terragrunt.hcl (root)
remote_state {
  backend = "s3"
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite_terragrunt"
  }
  config = {
    bucket         = "my-terraform-state"
    key            = "${path_relative_to_include()}/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

generate "provider" {
  path      = "provider.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<EOF
provider "aws" {
  region = "us-east-1"
  default_tags {
    tags = {
      ManagedBy   = "terraform"
      Environment = "${basename(get_terragrunt_dir())}"
    }
  }
}
EOF
}

# prod/compute/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "../../modules//eks-cluster"
}

dependency "networking" {
  config_path = "../networking"
}

inputs = {
  cluster_name = "prod-cluster"
  subnet_ids   = dependency.networking.outputs.private_subnet_ids
  node_count   = 5
}

Now terragrunt apply in prod/compute/ generates the backend config, provider config, and passes the right variables — no duplication.

CI/CD Patterns at Scale

Path-Based Triggers (Monorepo)

Only run Terraform for directories that changed:

# .github/workflows/terraform.yml
name: Terraform
on:
  pull_request:
    paths:
      - 'environments/**'
      - 'modules/**'

jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      directories: ${{ steps.changes.outputs.directories }}
    steps:
      - uses: actions/checkout@v4
      - id: changes
        run: |
          DIRS=$(git diff --name-only origin/main...HEAD \
            | grep '^environments/' \
            | cut -d'/' -f1-3 \
            | sort -u \
            | jq -R -s -c 'split("\n") | map(select(. != ""))')
          echo "directories=$DIRS" >> "$GITHUB_OUTPUT"

  plan:
    needs: detect-changes
    strategy:
      matrix:
        directory: ${{ fromJson(needs.detect-changes.outputs.directories) }}
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: |
          cd ${{ matrix.directory }}
          terraform init
          terraform plan -out=plan.tfplan

PR-Based Workflow

The safest pattern at scale: plan on PR, apply on merge.

Engineer opens PR with Terraform changes
CI runs terraform plan, posts output as PR comment
CODEOWNERS review the plan output
PR merged → CI runs terraform apply on main branch
Apply output posted to Slack/Teams channel

This ensures every infrastructure change is reviewed before it touches production.

At scale, shared modules need versioning. A private module registry (Terraform Cloud, Artifactory, or S3-backed) lets teams consume modules with explicit version pins:

module "vpc" {
  source  = "app.terraform.io/my-org/vpc/aws"
  version = "~> 3.0"

  cidr_block         = "10.0.0.0/16"
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
}

Version constraints prevent breaking changes from cascading. The platform team publishes v4.0.0 with breaking changes, but existing consumers stay on ~> 3.0 until they are ready to upgrade.

Organizational Patterns

The most effective pattern at scale is a platform team model:

Role	Responsibility
Platform team	Owns shared modules, provider configs, CI pipelines, state backend
Product teams	Consume modules, define service-specific infrastructure
Security team	Reviews module changes, owns IAM and policy modules
SRE team	Owns monitoring, alerting, and reliability modules

Product teams write simple configurations that compose platform-provided modules:

# Product team's Terraform — simple, constrained, safe
module "my_service" {
  source  = "app.terraform.io/my-org/microservice/aws"
  version = "~> 2.0"

  service_name    = "order-api"
  team            = "commerce"
  container_image = "order-api:v1.2.3"
  cpu             = 512
  memory          = 1024
  desired_count   = 3
}

The platform module handles VPC placement, security groups, IAM roles, logging, monitoring, and DNS — all with organizational standards baked in. Product teams cannot misconfigure networking or skip encryption because the module does not expose those knobs.

Remote Execution for Consistency

Running Terraform on developer laptops leads to inconsistency: different provider versions, different OS behaviors, stale credentials. Remote execution solves this:

# Using Terraform Cloud for remote execution
terraform {
  cloud {
    organization = "my-org"
    workspaces {
      tags = ["prod", "networking"]
    }
  }
}

With remote execution, every plan and apply runs in the same environment with the same provider versions, the same credentials, and full audit logging. No more "it worked on my machine."

Closing Note

Scaling Terraform is not about finding a magic tool. It is about making deliberate architectural decisions: split state to limit blast radius, choose a repo structure that matches your team topology, version modules so changes are controlled, and run everything through CI so no one is applying from a laptop. Start with state splitting — that alone eliminates the worst pain. Then layer in Terragrunt or Terraform Cloud as your organization grows. The goal is not to manage more resources; it is to let more teams manage their own resources safely.

The Scaling Wall​

Monorepo vs Polyrepo​

Monorepo Structure​

Polyrepo Structure​

State Blast Radius​

Signs Your State Is Too Big​

Splitting State by Concern​

Cross-State References​

Terragrunt for DRY Configurations​

CI/CD Patterns at Scale​

Path-Based Triggers (Monorepo)​

PR-Based Workflow​

Module Registry for Sharing​

Organizational Patterns​

Remote Execution for Consistency​

Closing Note​

Stay Updated