Skip to main content

Terraform Data Sources — Query Existing Infrastructure

· 6 min read
Goel Academy
DevOps & Cloud Learning Hub

Not everything in your cloud account was created by Terraform. Maybe the VPC was built by another team using CloudFormation. Maybe the DNS zone was set up manually in the console two years ago. Maybe you need the latest Amazon Linux AMI and its ID changes every week. Data sources let Terraform read information from your cloud provider without managing the resource itself.

Resources vs Data Sources

This distinction trips up a lot of beginners.

  • A resource block tells Terraform: "Create and manage this thing."
  • A data block tells Terraform: "Look up this thing that already exists and give me its attributes."
# Resource — Terraform creates and manages this VPC
resource "aws_vpc" "new_vpc" {
cidr_block = "10.1.0.0/16"
tags = { Name = "new-vpc" }
}

# Data source — Terraform reads an existing VPC, does NOT manage it
data "aws_vpc" "existing_vpc" {
filter {
name = "tag:Name"
values = ["production-vpc"]
}
}

If you delete the data source block and run apply, nothing happens in your cloud account. The VPC still exists. If you delete the resource block and run apply, Terraform destroys the VPC.

Looking Up the Latest AMI

This is the single most common use case for data sources. AMI IDs change with every update, and hardcoding them means your configuration is immediately stale.

data "aws_ami" "amazon_linux" {
most_recent = true
owners = ["amazon"]

filter {
name = "name"
values = ["al2023-ami-2023.*-x86_64"]
}

filter {
name = "virtualization-type"
values = ["hvm"]
}

filter {
name = "root-device-type"
values = ["ebs"]
}
}

resource "aws_instance" "web" {
ami = data.aws_ami.amazon_linux.id
instance_type = "t3.micro"

tags = {
Name = "web-server"
AMI = data.aws_ami.amazon_linux.name
ImageID = data.aws_ami.amazon_linux.id
}
}

Every time you run terraform plan, Terraform queries AWS for the latest matching AMI. If a newer AMI is available, the plan shows an update to the instance.

Referencing an Existing VPC and Subnets

Your networking team manages the VPC. You need to deploy resources inside it. Data sources bridge this gap perfectly.

# Look up the shared VPC by tag
data "aws_vpc" "shared" {
filter {
name = "tag:Name"
values = ["shared-services-vpc"]
}
}

# Find all private subnets in that VPC
data "aws_subnets" "private" {
filter {
name = "vpc-id"
values = [data.aws_vpc.shared.id]
}

filter {
name = "tag:Tier"
values = ["private"]
}
}

# Get details of each subnet (for AZ info, CIDR, etc.)
data "aws_subnet" "private_details" {
for_each = toset(data.aws_subnets.private.ids)
id = each.value
}

# Deploy into those subnets
resource "aws_instance" "app" {
count = length(data.aws_subnets.private.ids)
ami = data.aws_ami.amazon_linux.id
instance_type = "t3.micro"
subnet_id = data.aws_subnets.private.ids[count.index]

tags = {
Name = "app-${count.index + 1}"
AZ = data.aws_subnet.private_details[data.aws_subnets.private.ids[count.index]].availability_zone
}
}

This pattern is extremely common in organizations where networking is centralized but application teams deploy independently.

Getting Caller Identity

Need to know which AWS account you are deploying to? The aws_caller_identity data source gives you account ID, user ARN, and user ID without any configuration.

data "aws_caller_identity" "current" {}

data "aws_region" "current" {}

locals {
account_id = data.aws_caller_identity.current.account_id
region = data.aws_region.current.name
}

# Use it to construct ARNs
resource "aws_iam_policy" "app_policy" {
name = "app-s3-access"
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = ["s3:GetObject"]
Resource = "arn:aws:s3:::${local.account_id}-app-data/*"
}]
})
}

output "deploying_to" {
value = "Account ${local.account_id} in ${local.region}"
}

This is invaluable for building account-agnostic configurations that work across dev, staging, and production accounts.

Azure Data Sources

Data sources work the same way across all providers. Here are common Azure examples:

# Look up an existing resource group
data "azurerm_resource_group" "existing" {
name = "rg-shared-services"
}

# Look up an existing virtual network
data "azurerm_virtual_network" "main" {
name = "vnet-production"
resource_group_name = data.azurerm_resource_group.existing.name
}

# Look up a specific subnet
data "azurerm_subnet" "app" {
name = "snet-applications"
virtual_network_name = data.azurerm_virtual_network.main.name
resource_group_name = data.azurerm_resource_group.existing.name
}

# Deploy a VM into the existing subnet
resource "azurerm_linux_virtual_machine" "app" {
name = "vm-app-01"
resource_group_name = data.azurerm_resource_group.existing.name
location = data.azurerm_resource_group.existing.location
size = "Standard_B2s"
network_interface_ids = [azurerm_network_interface.app.id]

admin_username = "azureuser"
admin_ssh_key {
username = "azureuser"
public_key = file("~/.ssh/id_rsa.pub")
}

os_disk {
caching = "ReadWrite"
storage_account_type = "Standard_LRS"
}

source_image_reference {
publisher = "Canonical"
offer = "0001-com-ubuntu-server-jammy"
sku = "22_04-lts"
version = "latest"
}
}

The External Data Source

Sometimes you need data that no provider offers. The external data source runs an external script and reads its JSON output.

data "external" "git_info" {
program = ["bash", "-c", <<-EOF
echo '{"branch":"'$(git rev-parse --abbrev-ref HEAD)'","commit":"'$(git rev-parse --short HEAD)'"}'
EOF
]
}

resource "aws_instance" "app" {
ami = data.aws_ami.amazon_linux.id
instance_type = "t3.micro"

tags = {
Name = "app-server"
GitBranch = data.external.git_info.result.branch
GitCommit = data.external.git_info.result.commit
DeployedAt = timestamp()
}
}

The script must output valid JSON to stdout. The result attribute is a map of the JSON keys and values. Use this sparingly — it introduces external dependencies that can break portability.

Data Source Refresh Behavior

Data sources are read during the plan phase, not just during apply. This means:

  1. Every terraform plan makes API calls to refresh data source values.
  2. If the underlying data changes between plan and apply, the applied state reflects the plan-time values.
  3. Data sources with depends_on are read during the apply phase instead (after their dependencies are resolved).
# This data source depends on a resource being created first
data "aws_instance" "created_instance" {
instance_id = aws_instance.web.id

depends_on = [aws_instance.web]
}

output "instance_private_dns" {
value = data.aws_instance.created_instance.private_dns
}

Practical Example — Complete Lookup Pattern

Here is a realistic pattern that ties together multiple data sources to deploy an application into existing infrastructure:

# Who am I?
data "aws_caller_identity" "current" {}
data "aws_region" "current" {}

# What infrastructure exists?
data "aws_vpc" "main" {
tags = { Name = "main-vpc" }
}

data "aws_subnets" "app" {
filter {
name = "vpc-id"
values = [data.aws_vpc.main.id]
}
tags = { Tier = "application" }
}

# What is the latest AMI?
data "aws_ami" "app" {
most_recent = true
owners = [data.aws_caller_identity.current.account_id]
filter {
name = "name"
values = ["app-server-*"]
}
}

# What security group should I use?
data "aws_security_group" "app" {
vpc_id = data.aws_vpc.main.id
tags = { Name = "app-sg" }
}

# Now deploy using all of that looked-up information
resource "aws_instance" "app" {
count = length(data.aws_subnets.app.ids)
ami = data.aws_ami.app.id
instance_type = var.instance_type
subnet_id = data.aws_subnets.app.ids[count.index]
vpc_security_group_ids = [data.aws_security_group.app.id]

tags = {
Name = "app-${count.index + 1}"
}
}

Zero hardcoded IDs. Every value comes from data sources or variables. This configuration works across any account that has the expected infrastructure in place.

Wrapping Up

Data sources are the glue between Terraform-managed infrastructure and everything else. Use them to look up AMIs, reference shared VPCs, grab account metadata, and bridge team boundaries. They keep your configurations portable and eliminate the hardcoded-ID problem.

In the next post, we will do a deep dive into the Terraform CLI — every command you need, from init to import, with the flags that matter most.