GCP Cloud Monitoring and Logging
Cloud Monitoring and Cloud Logging are GCP's tools for observing what is happening inside a cloud environment. Monitoring collects metrics (numbers over time — like CPU usage, request count, error rate), while Logging collects log entries (text records of events — like "user logged in" or "database query failed"). Together, they provide complete visibility into the health and behavior of cloud resources.
Running an application without monitoring is like driving a car with no dashboard — the engine might be overheating or the fuel nearly empty, but there is no way to know until the car stops. Monitoring and logging provide that dashboard for cloud applications.
Cloud Monitoring
What is a Metric?
A metric is a numeric measurement recorded over time. GCP automatically collects hundreds of built-in metrics for every service:
| Resource | Example Metrics |
|---|---|
| Compute Engine VM | CPU utilization %, Disk read/write bytes, Network traffic |
| Cloud SQL | Database connections, Queries/sec, Disk usage |
| Cloud Run | Request count, Request latency, Instance count |
| Cloud Storage | Total bytes stored, Object count, Request count |
| GKE | Pod CPU/memory usage, Node count, Container restarts |
Dashboards
Metrics are visualized on dashboards in the Cloud Console. GCP provides pre-built dashboards for common services, and custom dashboards can be created to display the metrics most important to an application.
Cloud Monitoring Dashboard ┌──────────────────────────────────────────────────────────┐ │ CPU Utilization (Last 1 hour) │ │ 80% ─────────────────────────────────┐ │ │ 60% │ │ │ 40% ╱╲ ╱╲ │ │ │ 20% ╱╲ ╱ ╲ ╱╲ ╱ ╲ │ │ │ 0% ────────╱──╲╱────╲──╱──╲╱────╲───┘ │ │ 00:00 00:20 00:40 01:00 │ │ │ │ Request Rate: 1,245 req/min Error Rate: 0.02% │ └──────────────────────────────────────────────────────────┘
Alerting Policies
An alerting policy defines conditions that, when met, trigger a notification. For example: send an email if CPU usage stays above 90% for 5 minutes.
Alerting Policy: "High CPU Alert" Condition: VM CPU utilization > 90% for 5 minutes Notification Channel: email → ops-team@company.com
Creating an alert from Cloud Shell:
# Alerts are easier to create via the Console: # Monitoring → Alerting → Create Policy # Select Metric: compute.googleapis.com/instance/cpu/utilization # Threshold: > 0.90 for 5 minutes # Add notification channel: email
Custom Metrics
Applications can send their own custom metrics to Cloud Monitoring using the Cloud Monitoring API or a library like OpenTelemetry.
# Python — write a custom metric
from google.cloud import monitoring_v3
import time
client = monitoring_v3.MetricServiceClient()
project_name = client.common_project_path("my-project")
series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/app/active_users"
series.resource.type = "global"
point = series.points.add()
point.value.int64_value = 142
point.interval.end_time.seconds = int(time.time())
client.create_time_series(name=project_name, time_series=[series])
Cloud Logging
What is a Log Entry?
A log entry is a timestamped record of an event. Every GCP service automatically writes logs. Applications can also write their own logs.
Example Log Entry:
{
"timestamp": "2024-01-15T10:30:45Z",
"severity": "ERROR",
"resource": { "type": "gce_instance", "labels": { "instance_id": "1234567890" } },
"textPayload": "Database connection failed: timeout after 30s",
"logName": "projects/my-project/logs/app-logs"
}
Log Severity Levels
| Level | Description | Example |
|---|---|---|
| DEBUG | Detailed development information | "Entering function process_order()" |
| INFO | Normal operational events | "User user_001 logged in successfully" |
| WARNING | Something unexpected but not critical | "Retry attempt 2 of 3 for API call" |
| ERROR | An error occurred, request may have failed | "Failed to write to database" |
| CRITICAL | Severe error — service may be unavailable | "Out of memory — process terminated" |
Writing Application Logs
Applications running on GCP (Cloud Run, GKE, App Engine) automatically send stdout and stderr output to Cloud Logging. For structured JSON logs:
# Python — write structured logs
import json
import sys
def log(severity, message, **kwargs):
entry = {
"severity": severity,
"message": message,
**kwargs
}
print(json.dumps(entry), file=sys.stdout)
log("INFO", "Order processed successfully", order_id="ORD-9001", user_id="u123")
log("ERROR", "Payment failed", order_id="ORD-9002", reason="Insufficient funds")
Log Explorer
The Log Explorer in the Cloud Console allows filtering, searching, and analyzing logs in real time. It uses a query language called Logging Query Language (LQL).
-- Show all ERROR and CRITICAL logs from the last hour severity >= ERROR -- Show logs from a specific Cloud Run service resource.type = "cloud_run_revision" resource.labels.service_name = "my-app" -- Find logs containing a specific order ID textPayload: "ORD-9001" -- Combine filters resource.type = "cloud_run_revision" severity = "ERROR" timestamp >= "2024-01-15T10:00:00Z"
Log Sinks – Exporting Logs
By default, logs are retained for 30 days (_Default bucket) or 400 days (_Required bucket). For longer retention or analysis in BigQuery, logs can be exported using Log Sinks.
Log Sink Flow:
Cloud Logging
│
│ Sink (filter: severity >= ERROR)
▼
Destination options:
├── Cloud Storage Bucket (long-term archive)
├── BigQuery Dataset (SQL analysis of logs)
├── Pub/Sub Topic (real-time stream processing)
└── Another GCP Project (centralized logging)
# Create a sink to export ERROR logs to BigQuery gcloud logging sinks create error-logs-to-bq \ bigquery.googleapis.com/projects/my-project/datasets/app_logs \ --log-filter='severity >= ERROR'
Uptime Checks
Cloud Monitoring can periodically check if a URL or IP address is reachable and responding correctly. This is the simplest way to detect if an application goes offline.
Uptime Check: "my-app availability" Target: https://my-app.run.app/health Frequency: every 1 minute Locations: USA, Europe, Asia If the check fails from 2+ locations: → Alert fires → Email sent to ops team
Key Takeaways
- Cloud Monitoring collects metrics (numeric measurements) and Cloud Logging collects log entries (event records).
- Alerting policies send notifications when metric thresholds are breached.
- Applications write structured JSON logs to stdout for automatic ingestion by Cloud Logging.
- Log Explorer's LQL allows filtering millions of log entries in seconds.
- Log Sinks export logs to BigQuery for analysis or Cloud Storage for long-term retention.
- Uptime Checks detect when a public URL stops responding.
