DevOps Cloud Cost Optimization and FinOps

Cloud computing offers flexibility and speed, but unchecked usage leads to significant waste. Research consistently shows that organizations waste 30–35% of their cloud spend on unused or over-provisioned resources. FinOps (Financial Operations) is the practice of bringing financial accountability to cloud spending — combining engineering, finance, and business teams to make informed cost decisions.

In DevOps, cost awareness is not just a finance concern. Engineers who provision infrastructure, design architectures, and build pipelines directly control cloud costs. Understanding cost optimization is a core DevOps competency.

Why Cloud Costs Get Out of Control

Over-provisioned resources: Servers sized for peak load running at 5% CPU most of the time.
Forgotten resources: Dev environments, test databases, and load balancers left running after projects end.
Unoptimized storage: Old snapshots, unused volumes, and data in expensive storage tiers.
Missing auto-scaling: Fixed resource counts that cannot shrink during low-traffic periods.
No tagging strategy: No way to identify which team, project, or environment owns a resource.
Data transfer costs: Unexpected egress fees from inter-region data movement.

Core FinOps Principles

Visibility: Everyone sees what they spend in near-real-time.
Accountability: Teams own their cloud costs — not just the infrastructure team.
Optimization: Continuous improvement rather than one-time cost-cutting exercises.
Collaboration: Engineering, finance, and product teams make cost decisions together.

AWS Cost Management Tools

AWS Cost Explorer

Cost Explorer provides interactive graphs showing spending by service, region, account, and tag. Use it to identify the top cost drivers and trends over time.

AWS Budgets

Budgets set spending thresholds and send alerts when costs approach or exceed them. Every team and project should have a budget alarm.

# Terraform: Create a budget alert
resource "aws_budgets_budget" "webapp_monthly" {
  name         = "webapp-production-monthly"
  budget_type  = "COST"
  limit_amount = "500"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = ["devops-team@company.com"]
  }

  cost_filters = {
    TagKeyValue = "user:Project$webapp"
  }
}

AWS Cost Anomaly Detection

Machine learning-based service that detects unexpected cost spikes and sends alerts. Catches forgotten resources and unexpected usage automatically.

AWS Trusted Advisor

Provides automated recommendations for cost optimization, security, performance, and fault tolerance. Flags idle EC2 instances, underutilized RDS databases, and unassociated Elastic IPs.

Right-Sizing

Right-sizing means choosing the smallest resource that meets performance requirements. Most teams over-provision to be safe — then never revisit the decision.

Right-Sizing Process

Enable CloudWatch detailed monitoring on EC2 instances.
Collect 2–4 weeks of CPU, memory, and network utilization data.
Identify instances consistently running below 40% CPU — candidates for downsizing.
Use AWS Compute Optimizer recommendations.
Downsize in staging first, monitor for 1 week, then apply to production.

# AWS CLI: Get Compute Optimizer recommendations
aws compute-optimizer get-ec2-instance-recommendations \
  --region us-east-1 \
  --query 'instanceRecommendations[?finding==`OVER_PROVISIONED`].{
    Instance:instanceArn,
    CurrentType:currentInstanceType,
    RecommendedType:recommendationOptions[0].instanceType,
    MonthlySavings:recommendationOptions[0].estimatedMonthlySavings.value
  }'

Reserved Instances and Savings Plans

On-demand pricing is the most expensive option. Committing to 1 or 3 years of usage unlocks significant discounts.

Pricing Model	Commitment	Typical Discount	Flexibility
On-Demand	None	0%	Maximum — pay as you go
Savings Plans (Compute)	1 or 3 years (hourly spend)	Up to 66%	High — applies to any instance type or region
Reserved Instances	1 or 3 years (instance type)	Up to 72%	Medium — tied to instance family and region
Spot Instances	None	Up to 90%	Low — can be interrupted with 2 min notice

Strategy

Use Savings Plans for baseline, predictable workloads (e.g., always-on application servers).
Use Spot Instances for fault-tolerant workloads: CI/CD build agents, batch jobs, data processing, Kubernetes worker nodes.
Use On-Demand for unpredictable workloads above the committed baseline.

Auto-Scaling for Cost Efficiency

Auto-scaling reduces costs by running only the resources needed for current demand.

Kubernetes Cluster Autoscaler

# Cluster Autoscaler scales worker nodes based on pending pods
# Add to EKS cluster:
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --set autoDiscovery.clusterName=my-eks-cluster \
  --set awsRegion=us-east-1 \
  --set extraArgs.scale-down-utilization-threshold=0.5 \
  --set extraArgs.scale-down-delay-after-add=10m

Scheduled Scaling – Dev Environment Cost Control

# Terraform: Scale dev EKS node group to 0 overnight
resource "aws_autoscaling_schedule" "dev_scale_down" {
  scheduled_action_name  = "dev-scale-down-overnight"
  autoscaling_group_name = aws_eks_node_group.dev.resources[0].autoscaling_groups[0].name
  recurrence             = "0 20 * * MON-FRI"   # 8 PM weekdays
  min_size               = 0
  max_size               = 0
  desired_capacity       = 0
}

resource "aws_autoscaling_schedule" "dev_scale_up" {
  scheduled_action_name  = "dev-scale-up-morning"
  autoscaling_group_name = aws_eks_node_group.dev.resources[0].autoscaling_groups[0].name
  recurrence             = "0 7 * * MON-FRI"    # 7 AM weekdays
  min_size               = 1
  max_size               = 5
  desired_capacity       = 2
}

A dev cluster running only during business hours instead of 24/7 reduces compute costs by ~70%.

Storage Cost Optimization

S3 Lifecycle Policies

resource "aws_s3_bucket_lifecycle_configuration" "logs_lifecycle" {
  bucket = aws_s3_bucket.app_logs.id

  rule {
    id     = "log-archival"
    status = "Enabled"

    transition {
      days          = 30
      storage_class = "STANDARD_IA"   # 30 days: move to infrequent access (45% cheaper)
    }

    transition {
      days          = 90
      storage_class = "GLACIER"       # 90 days: archive to Glacier (80% cheaper)
    }

    expiration {
      days = 365                      # 1 year: delete permanently
    }
  }
}

EBS Volume Cleanup

# Find unattached EBS volumes (wasted spend)
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{ID:VolumeId,Size:Size,Type:VolumeType,Cost:Size}' \
  --output table

Tagging Strategy for Cost Allocation

Without tags, it is impossible to know which team, application, or environment generated a cost. A consistent tagging strategy is the foundation of cost accountability.

Required Tags (enforced via AWS Config or Terraform policy)

Tag Key	Example Value	Purpose
Environment	production / staging / dev	Environment cost breakdown
Team	payments / frontend / data	Team-level charge-back
Project	checkout-v2 / mobile-app	Project cost tracking
ManagedBy	terraform / manual	IaC compliance tracking
Owner	john.smith@company.com	Escalation contact

Enforce Tagging with Terraform

# Default tags applied to all resources in the provider block
provider "aws" {
  region = "us-east-1"

  default_tags {
    tags = {
      Environment = var.environment
      Team        = var.team_name
      ManagedBy   = "terraform"
      Owner       = var.owner_email
    }
  }
}

FinOps Tools

Tool	Purpose
AWS Cost Explorer	Native AWS cost analysis and forecasting
Infracost	Shows cost estimates for Terraform plans in CI/CD PRs
Kubecost	Cost visibility per Kubernetes namespace, workload, and team
CloudHealth	Multi-cloud cost management and governance
Spot.io (Spot by NetApp)	Automated Spot Instance management for Kubernetes

Infracost in CI/CD Pipeline

# GitHub Actions: Show cost estimate on every Terraform PR
- name: Setup Infracost
  uses: infracost/actions/setup@v2
  with:
    api-key: ${{ secrets.INFRACOST_API_KEY }}

- name: Generate cost estimate
  run: infracost breakdown --path ./terraform/production --format json --out-file /tmp/infracost.json

- name: Post PR comment with cost diff
  uses: infracost/actions/comment@v2
  with:
    path: /tmp/infracost.json
    behavior: update

Every pull request that changes infrastructure automatically shows how much cost increases or decreases — before the change is applied.

Summary

Cloud waste averages 30–35% — right-sizing, auto-scaling, and storage lifecycle policies recover significant spend.
FinOps makes cloud costs visible, accountable, and continuously optimized across engineering and finance teams.
Savings Plans and Reserved Instances reduce baseline compute costs by up to 72% vs on-demand pricing.
Spot Instances cut costs by up to 90% for fault-tolerant workloads like CI/CD agents and batch jobs.
A mandatory tagging strategy enables accurate cost allocation to teams, projects, and environments.
Infracost integrates cost estimates into CI/CD pipelines — surfacing financial impact during code review.

Previous lesson

Back to course

Next lesson