Multi-Cloud DevOps — Terraform, K8s, and Cross-Cloud CI/CD
When GitLab suffered a major outage in 2023, companies running exclusively on their platform scrambled. When AWS us-east-1 went down for hours in 2021, single-cloud shops lost millions. Multi-cloud is no longer a luxury — it is a strategic decision that protects your business. But doing it wrong costs more than doing nothing at all.
Why Multi-Cloud?
The case for multi-cloud goes beyond buzzwords:
Business Drivers for Multi-Cloud:
1. Avoid Vendor Lock-in
└─ Negotiate from strength, not dependency
2. Best-of-Breed Services
└─ AWS for compute, GCP for ML, Azure for enterprise integration
3. Regulatory Compliance
└─ Data sovereignty: EU data on EU-region providers
└─ Government contracts requiring specific clouds
4. Resilience
└─ Survive single-provider outages
└─ Geographic redundancy beyond one provider's regions
5. M&A Reality
└─ Acquired company uses different cloud
└─ Faster integration than migration
Multi-Cloud Challenges (Be Honest About These)
| Challenge | Impact | Mitigation |
|---|---|---|
| Complexity explosion | 3x the networking, IAM, monitoring to manage | Abstraction layers (Terraform, K8s) |
| Cost management | Billing across providers is painful | FinOps tooling (Kubecost, Infracost) |
| Skills gap | Team needs expertise in multiple clouds | Invest in cloud-agnostic tools |
| Lowest common denominator | Using only features available everywhere | Accept some provider-specific code (80/20 rule) |
| Networking complexity | Cross-cloud latency, security, DNS | Dedicated interconnects, service mesh |
| Inconsistent IAM | Different permission models per cloud | Centralized identity (Okta/Azure AD + OIDC) |
Terraform as the Unified IaC Layer
Terraform's provider model makes it the natural choice for multi-cloud infrastructure:
# main.tf — Multi-cloud infrastructure with Terraform
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.0"
}
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}
# --- AWS: Primary compute ---
provider "aws" {
region = "us-east-1"
alias = "primary"
}
resource "aws_eks_cluster" "primary" {
provider = aws.primary
name = "app-primary"
role_arn = aws_iam_role.eks.arn
version = "1.29"
vpc_config {
subnet_ids = module.aws_vpc.private_subnets
}
}
# --- Azure: EU compliance workloads ---
provider "azurerm" {
features {}
subscription_id = var.azure_subscription_id
}
resource "azurerm_kubernetes_cluster" "eu" {
name = "app-eu"
location = "westeurope"
resource_group_name = azurerm_resource_group.eu.name
default_node_pool {
name = "default"
node_count = 3
vm_size = "Standard_D4s_v3"
}
identity {
type = "SystemAssigned"
}
}
# --- GCP: ML workloads (TPU access) ---
provider "google" {
project = var.gcp_project
region = "us-central1"
}
resource "google_container_cluster" "ml" {
name = "ml-cluster"
location = "us-central1"
node_config {
machine_type = "n2-standard-8"
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform",
]
}
initial_node_count = 3
}
Multi-Cloud Module Pattern
# modules/kubernetes-cluster/main.tf
# Abstract the cloud-specific details behind a unified interface
variable "provider" {
type = string
description = "Cloud provider: aws, azure, gcp"
}
variable "cluster_name" {
type = string
}
variable "node_count" {
type = number
default = 3
}
variable "node_size" {
type = string
description = "Normalized size: small, medium, large"
}
locals {
# Normalize instance sizes across providers
instance_types = {
aws = {
small = "t3.medium"
medium = "m5.xlarge"
large = "m5.2xlarge"
}
azure = {
small = "Standard_D2s_v3"
medium = "Standard_D4s_v3"
large = "Standard_D8s_v3"
}
gcp = {
small = "e2-medium"
medium = "n2-standard-4"
large = "n2-standard-8"
}
}
}
# Usage:
# module "primary_cluster" {
# source = "./modules/kubernetes-cluster"
# provider = "aws"
# cluster_name = "app-primary"
# node_count = 5
# node_size = "medium"
# }
Kubernetes as the Unified Runtime
Kubernetes provides the application-level abstraction that makes workloads portable:
# deployment.yml — Cloud-agnostic application deployment
# This same manifest deploys to EKS, AKS, and GKE
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
labels:
app: api-service
cloud: "{{ .Values.cloud }}" # Helm value: aws|azure|gcp
spec:
replicas: 3
selector:
matchLabels:
app: api-service
template:
metadata:
labels:
app: api-service
spec:
containers:
- name: api
image: ghcr.io/myorg/api-service:v2.4.1
ports:
- containerPort: 8080
env:
- name: CLOUD_PROVIDER
value: "{{ .Values.cloud }}"
- name: DB_CONNECTION
valueFrom:
secretKeyRef:
name: db-credentials
key: connection_string
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
Cross-Cloud CI/CD with GitHub Actions
# .github/workflows/multi-cloud-deploy.yml
name: Multi-Cloud Deploy
on:
push:
branches: [main]
env:
IMAGE: ghcr.io/myorg/api-service
jobs:
build:
runs-on: ubuntu-latest
outputs:
image_tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/checkout@v4
- name: Build and push container image
id: meta
run: |
IMAGE_TAG="${IMAGE}:${GITHUB_SHA::8}"
docker build -t $IMAGE_TAG .
docker push $IMAGE_TAG
echo "tags=$IMAGE_TAG" >> "$GITHUB_OUTPUT"
- name: Run security scan
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.IMAGE }}:${{ github.sha }}
deploy-aws:
needs: build
runs-on: ubuntu-latest
environment: production-aws
steps:
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- name: Deploy to EKS
run: |
aws eks update-kubeconfig --name app-primary
helm upgrade --install api-service ./charts/api-service \
--set image.tag=${GITHUB_SHA::8} \
--set cloud=aws \
--wait --timeout 300s
deploy-azure:
needs: build
runs-on: ubuntu-latest
environment: production-azure
steps:
- uses: azure/login@v2
with:
creds: ${{ secrets.AZURE_CREDENTIALS }}
- name: Deploy to AKS
run: |
az aks get-credentials --resource-group eu-rg --name app-eu
helm upgrade --install api-service ./charts/api-service \
--set image.tag=${GITHUB_SHA::8} \
--set cloud=azure \
--wait --timeout 300s
deploy-gcp:
needs: build
runs-on: ubuntu-latest
environment: production-gcp
steps:
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.GCP_WIF_PROVIDER }}
service_account: ${{ secrets.GCP_SA_EMAIL }}
- name: Deploy to GKE
run: |
gcloud container clusters get-credentials ml-cluster \
--region us-central1
helm upgrade --install api-service ./charts/api-service \
--set image.tag=${GITHUB_SHA::8} \
--set cloud=gcp \
--wait --timeout 300s
Secrets Management Across Clouds
Secrets Management Options for Multi-Cloud:
Option 1: HashiCorp Vault (recommended for multi-cloud)
├── Single source of truth for all secrets
├── Dynamic secrets for each cloud (AWS STS, Azure SPN, GCP SA)
├── Unified audit trail
└── K8s integration via Vault Secrets Operator
Option 2: External Secrets Operator (ESO)
├── Syncs secrets FROM cloud-native stores INTO Kubernetes
├── Supports AWS Secrets Manager, Azure Key Vault, GCP Secret Manager
├── Team uses native tools, K8s gets unified Secrets
└── Less operational overhead than Vault
Option 3: Sealed Secrets + GitOps
├── Encrypted secrets stored in Git
├── Decrypted only inside the cluster
└── Works anywhere K8s runs (cloud agnostic)
# external-secrets.yml — Sync secrets across clouds using ESO
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager # or azure-keyvault, gcp-secret-manager
kind: ClusterSecretStore
target:
name: db-credentials
data:
- secretKey: connection_string
remoteRef:
key: production/database
property: connection_string
Cross-Cloud Networking
Cross-Cloud Networking Options:
1. VPN Tunnels (simplest, higher latency)
AWS VPN Gateway ←──IPSec──→ Azure VPN Gateway
Cost: ~$0.05/GB + hourly gateway fees
Latency: +5-15ms
2. Dedicated Interconnects (lowest latency, highest cost)
AWS Direct Connect ←──→ Equinix ←──→ Azure ExpressRoute
Cost: $0.02/GB + port fees ($200-500/mo)
Latency: +1-3ms
3. Service Mesh (application-level connectivity)
Istio/Consul across clusters
mTLS between services across clouds
No infrastructure-level connectivity needed
4. Cloud-native peering
AWS PrivateLink, Azure Private Link, GCP Private Service Connect
Best for specific service-to-service connections
Monitoring Across Clouds
# Unified monitoring with Grafana Cloud + Prometheus
# Each cluster ships metrics to a central Grafana Cloud instance
# prometheus-agent.yml (runs on each cluster)
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus-agent
spec:
replicas: 1
remoteWrite:
- url: "https://prometheus-prod-01-us-east-0.grafana.net/api/prom/push"
basicAuth:
username:
name: grafana-cloud-creds
key: username
password:
name: grafana-cloud-creds
key: api-key
writeRelabeling:
- sourceLabels: [__name__]
action: keep
regex: "container_.*|kube_.*|node_.*|http_.*"
queueConfig:
maxSamplesPerSend: 1000
batchSendDeadline: 30s
externalLabels:
cluster: "aws-primary" # or "azure-eu", "gcp-ml"
cloud: "aws" # or "azure", "gcp"
region: "us-east-1" # cloud-specific region
Cost Management
| Tool | Multi-Cloud Support | Key Feature |
|---|---|---|
| Kubecost | Any K8s cluster | Per-pod cost allocation across clouds |
| Infracost | Terraform-native | Cost estimates in PR comments before deploy |
| CloudHealth (VMware) | AWS, Azure, GCP | FinOps dashboards and recommendations |
| Vantage | All major clouds | Unified billing with Kubernetes cost reports |
| OpenCost | Any K8s cluster | Open-source, CNCF project |
# Infracost: See cost impact in every Terraform PR
infracost breakdown --path . --format table
# Example output:
# Name Monthly Cost
# ─────────────────────────────────────────────
# aws_eks_cluster.primary $73.00
# azurerm_kubernetes_cluster.eu $438.00
# google_container_cluster.ml $621.00
# ─────────────────────────────────────────────
# Total $1,132.00/mo
When Multi-Cloud Makes Sense vs. When It Does Not
Multi-cloud MAKES SENSE when:
✓ Regulatory requirements mandate specific providers for specific data
✓ You are acquiring companies on different clouds
✓ You need genuinely best-of-breed (GCP AI + AWS networking)
✓ Your scale justifies the operational overhead
✓ You have a platform team to manage the complexity
Multi-cloud DOES NOT make sense when:
✗ "Avoiding vendor lock-in" is the only reason
✗ Your team is < 20 engineers
✗ You are still figuring out one cloud
✗ You do not have a platform team
✗ The cost of abstraction exceeds the cost of lock-in
The 80/20 Rule:
Run 80% of workloads on your primary cloud.
Run 20% on secondary clouds where there is a clear advantage.
Do NOT split evenly — that maximizes pain for minimal benefit.
Closing Note
Multi-cloud is a spectrum, not a binary choice. Start with Terraform and Kubernetes as your abstraction layers — they buy you portability without requiring immediate multi-cloud deployment. When the business case is clear (compliance, M&A, best-of-breed), you will already have the foundation to expand. The teams that succeed with multi-cloud are the ones that treat it as an engineering capability, not a checkbox on an architecture diagram.
