AWS CloudWatch and Monitoring

AWS CloudWatch is the central monitoring and observability service for AWS. It collects metrics, logs, and events from AWS resources and applications, then allows setting alarms, creating dashboards, and automating responses based on what the data shows.

Monitoring is not optional in production systems. An application that goes down at 2 AM and is not discovered until morning costs money and users. CloudWatch is the tool that makes sure problems are detected and acted upon — often before users notice.

What CloudWatch Monitors

CloudWatch collects data from nearly every AWS service automatically. Examples of what it tracks:

  • EC2: CPU utilization, network traffic, disk read/write operations
  • RDS: Database connections, free storage space, read/write latency
  • Lambda: Invocation count, execution duration, error rate, throttles
  • S3: Number of requests, bytes downloaded, 4xx/5xx errors
  • API Gateway: Request count, latency, error rates
  • ECS/EKS: Container CPU and memory usage

CloudWatch Core Components

1. Metrics

A metric is a time-series data point. CloudWatch stores metrics as sequences of values over time. For example, EC2 CPU Utilization is recorded every 1 minute (or every 5 minutes for basic monitoring). Each data point has a timestamp and a value (percentage).

EC2 CPU Utilization — Last 1 Hour

100% |
 75% |              *
 50% |    *    *         *    *
 25% |         
  0% +----+----+----+----+----+----
    12:00 12:10 12:20 12:30 12:40 12:50

Metrics have:

  • Namespace: Category of the metric (e.g., AWS/EC2, AWS/Lambda)
  • Dimension: Identifies which specific resource (e.g., InstanceId = i-1234567890)
  • Resolution: Standard (1 minute) or High Resolution (1 second) for custom metrics

2. Alarms

A CloudWatch Alarm watches a metric and triggers an action when the metric crosses a defined threshold. An alarm can be in three states:

  • OK: Metric is within the normal range.
  • ALARM: Metric has crossed the threshold.
  • INSUFFICIENT_DATA: Not enough data to evaluate the alarm.

Example alarm — send an email notification when EC2 CPU usage exceeds 80% for more than 5 minutes:

Metric:    AWS/EC2 CPUUtilization
Condition: Greater than 80%
Period:    5 minutes (2 consecutive periods)
Action:    Notify via SNS → sends email to ops-team@company.com

Alarms can also trigger Auto Scaling actions (add more EC2 instances when CPU is high) or stop/terminate EC2 instances automatically.

3. Logs — CloudWatch Logs

CloudWatch Logs collects and stores log files from applications and AWS services. Log data flows from applications into log groups and log streams.

  • Log Group: A container for logs from one application or service (e.g., /aws/lambda/my-function).
  • Log Stream: A sequence of log events from one specific instance or invocation.
  • Log Event: A single line entry — a timestamp and a message.

Lambda automatically sends all print() or console.log() output to CloudWatch Logs. No extra configuration required.

Example log stream from a Lambda function:

[2024-03-15 10:23:01] START RequestId: abc123
[2024-03-15 10:23:01] Processing order: ORD-9876
[2024-03-15 10:23:01] Order validated successfully
[2024-03-15 10:23:02] Email sent to customer@example.com
[2024-03-15 10:23:02] END RequestId: abc123
[2024-03-15 10:23:02] REPORT Duration: 980ms Memory Used: 64MB

4. Log Insights

CloudWatch Log Insights is a query engine for log data. It allows writing SQL-like queries to search, filter, and analyze millions of log entries in seconds.

Example query — find the 10 slowest Lambda invocations in the last hour:

fields @timestamp, @duration
| filter @type = "REPORT"
| sort @duration desc
| limit 10

5. Dashboards

CloudWatch Dashboards are customizable visual panels that display multiple metrics side by side. A production application dashboard might show:

  • EC2 CPU and memory usage
  • API Gateway request count and error rate
  • Lambda execution count and duration
  • RDS connections and storage free

Dashboards update in real time and are shareable across the team.

6. CloudWatch Events / EventBridge

CloudWatch Events (now called Amazon EventBridge) detects changes in AWS resources and triggers automated responses. It is used to create rule-based automation.

Examples:

  • Run a Lambda function every day at 8 AM (cron-based schedule).
  • Trigger a notification when an EC2 instance is stopped unexpectedly.
  • Invoke a Step Functions workflow when a new file arrives in S3.

Custom Metrics

CloudWatch collects standard AWS service metrics automatically. Custom metrics allow sending application-specific data. Examples:

  • Active users count in a web app
  • Number of failed login attempts
  • Items in a processing queue
  • Business transactions per minute

Custom metrics are pushed to CloudWatch using the AWS SDK or CLI:

aws cloudwatch put-metric-data \
  --namespace "MyApp" \
  --metric-name "ActiveUsers" \
  --value 342 \
  --unit Count

CloudWatch Agent

By default, CloudWatch does not collect memory usage or disk space from EC2 instances — only CPU, network, and disk I/O. The CloudWatch Agent is a lightweight program installed on EC2 that collects additional system-level metrics and sends application logs directly to CloudWatch.

Monitoring Architecture for a Production App

+----------------------------------------+
|           Application Stack            |
|  [EC2 Web Servers] [Lambda Functions]  |
|  [RDS Database]    [API Gateway]       |
+----------------------------------------+
           |           |
   [Metrics]         [Logs]
           |           |
      [CloudWatch Metrics + Logs]
           |
   +-------+-------+
   |               |
[Alarms]      [Dashboards]
   |               |
[SNS: Email,   [Visible to
 PagerDuty,     Ops Team]
 Auto Scaling]

CloudWatch Pricing

FeatureFree TierPaid
Metrics10 custom metrics, 10 alarms$0.30/metric/month after free tier
Logs5 GB ingestion, 5 GB storage$0.50/GB ingested
Dashboards3 dashboards (50 metrics each)$3/dashboard/month
Log InsightsNone$0.005 per GB scanned

Summary

  • CloudWatch is the central monitoring service for AWS — it collects metrics, logs, and events.
  • Alarms trigger notifications or automated actions when metrics cross defined thresholds.
  • CloudWatch Logs stores application and service logs. Log Insights provides query capability.
  • Dashboards give a visual overview of system health in real time.
  • EventBridge enables rule-based automation triggered by AWS events and schedules.

Leave a Comment