Skip to main content

Multi-Cloud DevOps — Terraform, K8s, and Cross-Cloud CI/CD

· 8 min read
Goel Academy
DevOps & Cloud Learning Hub

When GitLab suffered a major outage in 2023, companies running exclusively on their platform scrambled. When AWS us-east-1 went down for hours in 2021, single-cloud shops lost millions. Multi-cloud is no longer a luxury — it is a strategic decision that protects your business. But doing it wrong costs more than doing nothing at all.

Why Multi-Cloud?

The case for multi-cloud goes beyond buzzwords:

Business Drivers for Multi-Cloud:

1. Avoid Vendor Lock-in
└─ Negotiate from strength, not dependency

2. Best-of-Breed Services
└─ AWS for compute, GCP for ML, Azure for enterprise integration

3. Regulatory Compliance
└─ Data sovereignty: EU data on EU-region providers
└─ Government contracts requiring specific clouds

4. Resilience
└─ Survive single-provider outages
└─ Geographic redundancy beyond one provider's regions

5. M&A Reality
└─ Acquired company uses different cloud
└─ Faster integration than migration

Multi-Cloud Challenges (Be Honest About These)

ChallengeImpactMitigation
Complexity explosion3x the networking, IAM, monitoring to manageAbstraction layers (Terraform, K8s)
Cost managementBilling across providers is painfulFinOps tooling (Kubecost, Infracost)
Skills gapTeam needs expertise in multiple cloudsInvest in cloud-agnostic tools
Lowest common denominatorUsing only features available everywhereAccept some provider-specific code (80/20 rule)
Networking complexityCross-cloud latency, security, DNSDedicated interconnects, service mesh
Inconsistent IAMDifferent permission models per cloudCentralized identity (Okta/Azure AD + OIDC)

Terraform as the Unified IaC Layer

Terraform's provider model makes it the natural choice for multi-cloud infrastructure:

# main.tf — Multi-cloud infrastructure with Terraform

terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.0"
}
google = {
source = "hashicorp/google"
version = "~> 5.0"
}
}
}

# --- AWS: Primary compute ---
provider "aws" {
region = "us-east-1"
alias = "primary"
}

resource "aws_eks_cluster" "primary" {
provider = aws.primary
name = "app-primary"
role_arn = aws_iam_role.eks.arn
version = "1.29"

vpc_config {
subnet_ids = module.aws_vpc.private_subnets
}
}

# --- Azure: EU compliance workloads ---
provider "azurerm" {
features {}
subscription_id = var.azure_subscription_id
}

resource "azurerm_kubernetes_cluster" "eu" {
name = "app-eu"
location = "westeurope"
resource_group_name = azurerm_resource_group.eu.name

default_node_pool {
name = "default"
node_count = 3
vm_size = "Standard_D4s_v3"
}

identity {
type = "SystemAssigned"
}
}

# --- GCP: ML workloads (TPU access) ---
provider "google" {
project = var.gcp_project
region = "us-central1"
}

resource "google_container_cluster" "ml" {
name = "ml-cluster"
location = "us-central1"

node_config {
machine_type = "n2-standard-8"

oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform",
]
}

initial_node_count = 3
}

Multi-Cloud Module Pattern

# modules/kubernetes-cluster/main.tf
# Abstract the cloud-specific details behind a unified interface

variable "provider" {
type = string
description = "Cloud provider: aws, azure, gcp"
}

variable "cluster_name" {
type = string
}

variable "node_count" {
type = number
default = 3
}

variable "node_size" {
type = string
description = "Normalized size: small, medium, large"
}

locals {
# Normalize instance sizes across providers
instance_types = {
aws = {
small = "t3.medium"
medium = "m5.xlarge"
large = "m5.2xlarge"
}
azure = {
small = "Standard_D2s_v3"
medium = "Standard_D4s_v3"
large = "Standard_D8s_v3"
}
gcp = {
small = "e2-medium"
medium = "n2-standard-4"
large = "n2-standard-8"
}
}
}

# Usage:
# module "primary_cluster" {
# source = "./modules/kubernetes-cluster"
# provider = "aws"
# cluster_name = "app-primary"
# node_count = 5
# node_size = "medium"
# }

Kubernetes as the Unified Runtime

Kubernetes provides the application-level abstraction that makes workloads portable:

# deployment.yml — Cloud-agnostic application deployment
# This same manifest deploys to EKS, AKS, and GKE
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
labels:
app: api-service
cloud: "{{ .Values.cloud }}" # Helm value: aws|azure|gcp
spec:
replicas: 3
selector:
matchLabels:
app: api-service
template:
metadata:
labels:
app: api-service
spec:
containers:
- name: api
image: ghcr.io/myorg/api-service:v2.4.1
ports:
- containerPort: 8080
env:
- name: CLOUD_PROVIDER
value: "{{ .Values.cloud }}"
- name: DB_CONNECTION
valueFrom:
secretKeyRef:
name: db-credentials
key: connection_string
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"

Cross-Cloud CI/CD with GitHub Actions

# .github/workflows/multi-cloud-deploy.yml
name: Multi-Cloud Deploy

on:
push:
branches: [main]

env:
IMAGE: ghcr.io/myorg/api-service

jobs:
build:
runs-on: ubuntu-latest
outputs:
image_tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/checkout@v4

- name: Build and push container image
id: meta
run: |
IMAGE_TAG="${IMAGE}:${GITHUB_SHA::8}"
docker build -t $IMAGE_TAG .
docker push $IMAGE_TAG
echo "tags=$IMAGE_TAG" >> "$GITHUB_OUTPUT"

- name: Run security scan
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.IMAGE }}:${{ github.sha }}

deploy-aws:
needs: build
runs-on: ubuntu-latest
environment: production-aws
steps:
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1

- name: Deploy to EKS
run: |
aws eks update-kubeconfig --name app-primary
helm upgrade --install api-service ./charts/api-service \
--set image.tag=${GITHUB_SHA::8} \
--set cloud=aws \
--wait --timeout 300s

deploy-azure:
needs: build
runs-on: ubuntu-latest
environment: production-azure
steps:
- uses: azure/login@v2
with:
creds: ${{ secrets.AZURE_CREDENTIALS }}

- name: Deploy to AKS
run: |
az aks get-credentials --resource-group eu-rg --name app-eu
helm upgrade --install api-service ./charts/api-service \
--set image.tag=${GITHUB_SHA::8} \
--set cloud=azure \
--wait --timeout 300s

deploy-gcp:
needs: build
runs-on: ubuntu-latest
environment: production-gcp
steps:
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.GCP_WIF_PROVIDER }}
service_account: ${{ secrets.GCP_SA_EMAIL }}

- name: Deploy to GKE
run: |
gcloud container clusters get-credentials ml-cluster \
--region us-central1
helm upgrade --install api-service ./charts/api-service \
--set image.tag=${GITHUB_SHA::8} \
--set cloud=gcp \
--wait --timeout 300s

Secrets Management Across Clouds

Secrets Management Options for Multi-Cloud:

Option 1: HashiCorp Vault (recommended for multi-cloud)
├── Single source of truth for all secrets
├── Dynamic secrets for each cloud (AWS STS, Azure SPN, GCP SA)
├── Unified audit trail
└── K8s integration via Vault Secrets Operator

Option 2: External Secrets Operator (ESO)
├── Syncs secrets FROM cloud-native stores INTO Kubernetes
├── Supports AWS Secrets Manager, Azure Key Vault, GCP Secret Manager
├── Team uses native tools, K8s gets unified Secrets
└── Less operational overhead than Vault

Option 3: Sealed Secrets + GitOps
├── Encrypted secrets stored in Git
├── Decrypted only inside the cluster
└── Works anywhere K8s runs (cloud agnostic)
# external-secrets.yml — Sync secrets across clouds using ESO
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager # or azure-keyvault, gcp-secret-manager
kind: ClusterSecretStore
target:
name: db-credentials
data:
- secretKey: connection_string
remoteRef:
key: production/database
property: connection_string

Cross-Cloud Networking

Cross-Cloud Networking Options:

1. VPN Tunnels (simplest, higher latency)
AWS VPN Gateway ←──IPSec──→ Azure VPN Gateway
Cost: ~$0.05/GB + hourly gateway fees
Latency: +5-15ms

2. Dedicated Interconnects (lowest latency, highest cost)
AWS Direct Connect ←──→ Equinix ←──→ Azure ExpressRoute
Cost: $0.02/GB + port fees ($200-500/mo)
Latency: +1-3ms

3. Service Mesh (application-level connectivity)
Istio/Consul across clusters
mTLS between services across clouds
No infrastructure-level connectivity needed

4. Cloud-native peering
AWS PrivateLink, Azure Private Link, GCP Private Service Connect
Best for specific service-to-service connections

Monitoring Across Clouds

# Unified monitoring with Grafana Cloud + Prometheus
# Each cluster ships metrics to a central Grafana Cloud instance

# prometheus-agent.yml (runs on each cluster)
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus-agent
spec:
replicas: 1
remoteWrite:
- url: "https://prometheus-prod-01-us-east-0.grafana.net/api/prom/push"
basicAuth:
username:
name: grafana-cloud-creds
key: username
password:
name: grafana-cloud-creds
key: api-key
writeRelabeling:
- sourceLabels: [__name__]
action: keep
regex: "container_.*|kube_.*|node_.*|http_.*"
queueConfig:
maxSamplesPerSend: 1000
batchSendDeadline: 30s
externalLabels:
cluster: "aws-primary" # or "azure-eu", "gcp-ml"
cloud: "aws" # or "azure", "gcp"
region: "us-east-1" # cloud-specific region

Cost Management

ToolMulti-Cloud SupportKey Feature
KubecostAny K8s clusterPer-pod cost allocation across clouds
InfracostTerraform-nativeCost estimates in PR comments before deploy
CloudHealth (VMware)AWS, Azure, GCPFinOps dashboards and recommendations
VantageAll major cloudsUnified billing with Kubernetes cost reports
OpenCostAny K8s clusterOpen-source, CNCF project
# Infracost: See cost impact in every Terraform PR
infracost breakdown --path . --format table

# Example output:
# Name Monthly Cost
# ─────────────────────────────────────────────
# aws_eks_cluster.primary $73.00
# azurerm_kubernetes_cluster.eu $438.00
# google_container_cluster.ml $621.00
# ─────────────────────────────────────────────
# Total $1,132.00/mo

When Multi-Cloud Makes Sense vs. When It Does Not

Multi-cloud MAKES SENSE when:
✓ Regulatory requirements mandate specific providers for specific data
✓ You are acquiring companies on different clouds
✓ You need genuinely best-of-breed (GCP AI + AWS networking)
✓ Your scale justifies the operational overhead
✓ You have a platform team to manage the complexity

Multi-cloud DOES NOT make sense when:
✗ "Avoiding vendor lock-in" is the only reason
✗ Your team is < 20 engineers
✗ You are still figuring out one cloud
✗ You do not have a platform team
✗ The cost of abstraction exceeds the cost of lock-in

The 80/20 Rule:
Run 80% of workloads on your primary cloud.
Run 20% on secondary clouds where there is a clear advantage.
Do NOT split evenly — that maximizes pain for minimal benefit.

Closing Note

Multi-cloud is a spectrum, not a binary choice. Start with Terraform and Kubernetes as your abstraction layers — they buy you portability without requiring immediate multi-cloud deployment. When the business case is clear (compliance, M&A, best-of-breed), you will already have the foundation to expand. The teams that succeed with multi-cloud are the ones that treat it as an engineering capability, not a checkbox on an architecture diagram.