Terraform Best Practices for Production Environments

Learning Terraform syntax is one thing. Running it reliably for production infrastructure used by real customers — with a team, at scale, over years — is another. This final topic brings together the principles, patterns, and habits that experienced Terraform practitioners apply to keep production infrastructure stable, secure, and maintainable.

1. Structure Your Repository Clearly

File organisation has a big impact on how easily a team can navigate and safely change infrastructure. Two widely used patterns are the flat layout and the module layout.

Recommended Repository Structure

infrastructure/
  modules/                  # Reusable building blocks
    networking/
      main.tf
      variables.tf
      outputs.tf
    compute/
    database/

  environments/             # One directory per environment
    dev/
      main.tf               # Calls shared modules
      variables.tf
      terraform.tfvars
      backend.tf
    staging/
      main.tf
      ...
    prod/
      main.tf
      ...

Separate directories per environment make it impossible to accidentally apply dev changes to prod. Each environment has its own backend, its own state file, and its own variable values.

2. Always Pin Versions

Pin three things: Terraform version, provider versions, and module versions. Unpinned versions invite surprise breaking changes on terraform init.

terraform {
  required_version = ">= 1.9.0, < 2.0.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.31"
    }
  }
}

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.2.0"   # exact pin
}

3. Use Remote State with Encryption and Locking

Every production project must use remote state. Never leave state on a developer's machine. Encrypt the bucket, enable versioning, and enable state locking.

terraform {
  backend "s3" {
    bucket         = "company-terraform-state-prod"
    key            = "infrastructure/prod/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

4. Never Commit Secrets

Add these lines to every Terraform project's .gitignore:

.terraform/
terraform.tfstate
terraform.tfstate.backup
*.tfvars          # If any tfvars file might contain secrets
.terraform.lock.hcl  # Some teams commit this; some do not

Store secrets in a vault (AWS Secrets Manager, HashiCorp Vault) and read them at apply time with data sources. Use environment variables or CI/CD secret storage for credentials.

5. Review Every Plan Before Applying

Never run terraform apply without first reading the plan output carefully. In production, make apply approval a human gate in your CI/CD pipeline — the plan runs automatically, but a team member must review and approve it before apply executes.

Look specifically for:

Unexpected destroy operations (a - in the plan output)
Resource replacements (-/+) on critical resources like databases
More changes than expected — someone else may have committed code you did not notice

6. Tag Every Resource

Tags on cloud resources are essential for cost allocation, security audits, and incident response. Define a standard tagging strategy and enforce it through a central locals block or a Sentinel policy.

locals {
  required_tags = {
    Environment = var.environment
    Project     = var.project_name
    Owner       = var.team_name
    ManagedBy   = "Terraform"
    CostCenter  = var.cost_center
  }
}

Apply local.required_tags to every resource that supports tags. When an alert fires at 3 AM, tags tell you which team owns the resource, which environment it runs in, and which project it belongs to.

7. Use Modules for Every Repeatable Pattern

If you write the same group of resources more than once, make a module. A good module:

Has a clear, single responsibility (networking, compute, database)
Accepts all variable inputs it needs — no hard-coded values inside
Exposes all useful outputs
Has a README.md explaining its inputs, outputs, and usage
Has tests that verify its core behaviour

8. Keep Modules Small and Focused

A module that creates everything — VPC, subnets, EC2, RDS, S3, IAM — is hard to test, hard to reuse, and dangerous to change. Break it into focused modules and compose them at the environment level.

Diagram: Composing Small Modules

prod/main.tf
  |
  |---> module "network"    (creates VPC + subnets)
  |---> module "database"   (creates RDS, uses network outputs)
  |---> module "compute"    (creates EC2/ECS, uses network + db outputs)
  |---> module "monitoring" (creates CloudWatch dashboards + alarms)

9. Implement Drift Detection

Schedule a regular terraform plan run (daily is common) to detect drift — changes made to real infrastructure outside of Terraform. Alert when the plan shows unexpected differences.

# GitHub Actions scheduled drift detection
on:
  schedule:
    - cron: '0 6 * * *'    # Run daily at 6 AM UTC

jobs:
  drift-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Terraform Plan (drift check)
        run: terraform plan -detailed-exitcode
        # Exit code 2 = changes detected → trigger alert

10. Document Your Infrastructure

Write a README.md for every module and every environment directory. Include what the code creates, what variables are required, what the outputs mean, and any operational notes. Future you — or your teammate at 2 AM — will be grateful.

Use terraform-docs, an open-source tool that automatically generates Markdown documentation from your Terraform variable and output declarations:

terraform-docs markdown table . > README.md

Key Points Summary

Separate environment directories prevent accidental cross-environment changes and make blast radius smaller.
Pin Terraform, provider, and module versions to prevent surprise breakage from upstream changes.
Remote state with encryption, versioning, and locking is non-negotiable for any production project.
Every resource must be tagged with environment, project, owner, and managed-by information.
Plan review is a human gate — never auto-apply to production without a person reviewing the plan output.
Schedule daily drift detection to catch out-of-band changes before they cause incidents.
Document every module and environment with clear READMEs; use terraform-docs to automate input/output documentation.

Previous lessons

Back to courses