GCP Cloud Monitoring and Logging

Cloud Monitoring and Cloud Logging are GCP's tools for observing what is happening inside a cloud environment. Monitoring collects metrics (numbers over time — like CPU usage, request count, error rate), while Logging collects log entries (text records of events — like "user logged in" or "database query failed"). Together, they provide complete visibility into the health and behavior of cloud resources.

Running an application without monitoring is like driving a car with no dashboard — the engine might be overheating or the fuel nearly empty, but there is no way to know until the car stops. Monitoring and logging provide that dashboard for cloud applications.

Cloud Monitoring

What is a Metric?

A metric is a numeric measurement recorded over time. GCP automatically collects hundreds of built-in metrics for every service:

Resource	Example Metrics
Compute Engine VM	CPU utilization %, Disk read/write bytes, Network traffic
Cloud SQL	Database connections, Queries/sec, Disk usage
Cloud Run	Request count, Request latency, Instance count
Cloud Storage	Total bytes stored, Object count, Request count
GKE	Pod CPU/memory usage, Node count, Container restarts

Dashboards

Metrics are visualized on dashboards in the Cloud Console. GCP provides pre-built dashboards for common services, and custom dashboards can be created to display the metrics most important to an application.

Cloud Monitoring Dashboard
┌──────────────────────────────────────────────────────────┐
│  CPU Utilization (Last 1 hour)                           │
│  80% ─────────────────────────────────┐                  │
│  60%                                  │                  │
│  40%               ╱╲          ╱╲     │                  │
│  20%          ╱╲  ╱  ╲    ╱╲  ╱  ╲    │                  │
│   0% ────────╱──╲╱────╲──╱──╲╱────╲───┘                  │
│       00:00       00:20       00:40      01:00           │
│                                                          │
│  Request Rate: 1,245 req/min  Error Rate: 0.02%          │
└──────────────────────────────────────────────────────────┘

Alerting Policies

An alerting policy defines conditions that, when met, trigger a notification. For example: send an email if CPU usage stays above 90% for 5 minutes.

Alerting Policy: "High CPU Alert"
Condition: VM CPU utilization > 90% for 5 minutes
Notification Channel: email → ops-team@company.com

Creating an alert from Cloud Shell:

# Alerts are easier to create via the Console:
# Monitoring → Alerting → Create Policy
# Select Metric: compute.googleapis.com/instance/cpu/utilization
# Threshold: > 0.90 for 5 minutes
# Add notification channel: email

Custom Metrics

Applications can send their own custom metrics to Cloud Monitoring using the Cloud Monitoring API or a library like OpenTelemetry.

# Python — write a custom metric
from google.cloud import monitoring_v3
import time

client = monitoring_v3.MetricServiceClient()
project_name = client.common_project_path("my-project")

series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/app/active_users"
series.resource.type = "global"

point = series.points.add()
point.value.int64_value = 142
point.interval.end_time.seconds = int(time.time())

client.create_time_series(name=project_name, time_series=[series])

Cloud Logging

What is a Log Entry?

A log entry is a timestamped record of an event. Every GCP service automatically writes logs. Applications can also write their own logs.

Example Log Entry:
{
  "timestamp": "2024-01-15T10:30:45Z",
  "severity": "ERROR",
  "resource": { "type": "gce_instance", "labels": { "instance_id": "1234567890" } },
  "textPayload": "Database connection failed: timeout after 30s",
  "logName": "projects/my-project/logs/app-logs"
}

Log Severity Levels

Level	Description	Example
DEBUG	Detailed development information	"Entering function process_order()"
INFO	Normal operational events	"User user_001 logged in successfully"
WARNING	Something unexpected but not critical	"Retry attempt 2 of 3 for API call"
ERROR	An error occurred, request may have failed	"Failed to write to database"
CRITICAL	Severe error — service may be unavailable	"Out of memory — process terminated"

Writing Application Logs

Applications running on GCP (Cloud Run, GKE, App Engine) automatically send stdout and stderr output to Cloud Logging. For structured JSON logs:

# Python — write structured logs
import json
import sys

def log(severity, message, **kwargs):
    entry = {
        "severity": severity,
        "message": message,
        **kwargs
    }
    print(json.dumps(entry), file=sys.stdout)

log("INFO",  "Order processed successfully", order_id="ORD-9001", user_id="u123")
log("ERROR", "Payment failed",               order_id="ORD-9002", reason="Insufficient funds")

Log Explorer

The Log Explorer in the Cloud Console allows filtering, searching, and analyzing logs in real time. It uses a query language called Logging Query Language (LQL).

-- Show all ERROR and CRITICAL logs from the last hour
severity >= ERROR

-- Show logs from a specific Cloud Run service
resource.type = "cloud_run_revision"
resource.labels.service_name = "my-app"

-- Find logs containing a specific order ID
textPayload: "ORD-9001"

-- Combine filters
resource.type = "cloud_run_revision"
severity = "ERROR"
timestamp >= "2024-01-15T10:00:00Z"

Log Sinks – Exporting Logs

By default, logs are retained for 30 days (_Default bucket) or 400 days (_Required bucket). For longer retention or analysis in BigQuery, logs can be exported using Log Sinks.

Log Sink Flow:
Cloud Logging
    │
    │ Sink (filter: severity >= ERROR)
    ▼
Destination options:
├── Cloud Storage Bucket  (long-term archive)
├── BigQuery Dataset       (SQL analysis of logs)
├── Pub/Sub Topic          (real-time stream processing)
└── Another GCP Project   (centralized logging)

# Create a sink to export ERROR logs to BigQuery
gcloud logging sinks create error-logs-to-bq \
  bigquery.googleapis.com/projects/my-project/datasets/app_logs \
  --log-filter='severity >= ERROR'

Uptime Checks

Cloud Monitoring can periodically check if a URL or IP address is reachable and responding correctly. This is the simplest way to detect if an application goes offline.

Uptime Check: "my-app availability"
Target: https://my-app.run.app/health
Frequency: every 1 minute
Locations: USA, Europe, Asia

If the check fails from 2+ locations:
→ Alert fires → Email sent to ops team

Key Takeaways

Cloud Monitoring collects metrics (numeric measurements) and Cloud Logging collects log entries (event records).
Alerting policies send notifications when metric thresholds are breached.
Applications write structured JSON logs to stdout for automatic ingestion by Cloud Logging.
Log Explorer's LQL allows filtering millions of log entries in seconds.
Log Sinks export logs to BigQuery for analysis or Cloud Storage for long-term retention.
Uptime Checks detect when a public URL stops responding.

Previous lesson

Back to course

Next lesson