GCP Architecture Best Practices

Building a working application on GCP is one thing. Building a production-ready system that is reliable, secure, scalable, and cost-efficient is another. This topic consolidates the architectural principles, design patterns, and GCP-specific best practices that experienced cloud architects apply when designing systems on Google Cloud Platform.

Google Cloud's Well-Architected Framework

Google defines five pillars for well-architected cloud systems. Every architectural decision should be evaluated against these pillars:

Pillar	Key Question	GCP Tools
Operational Excellence	Can the team operate and monitor this effectively?	Cloud Monitoring, Cloud Logging, Cloud Trace
Security	Is data and access protected at every layer?	IAM, Secret Manager, Cloud Armor, VPC SC
Reliability	Does the system recover from failures automatically?	Multi-zone VMs, Cloud SQL HA, GKE, Load Balancer
Performance Efficiency	Are resources sized appropriately for the workload?	Autoscaling, Memorystore, Cloud CDN, Bigtable
Cost Optimization	Are only necessary resources being paid for?	Budget alerts, CUDs, Spot VMs, lifecycle policies

Reliability – Designing for Failure

In cloud systems, individual components fail regularly. A well-designed system expects failures and continues operating through them.

Multi-Zone Architecture

Single Zone (avoid in production):
us-central1-a
    └── VM 1 (single point of failure)
        Zone fails → Application offline ✗

Multi-Zone (production standard):
us-central1-a    us-central1-b    us-central1-c
    └── VM 1          └── VM 2          └── VM 3
         └─────────────────────────────────┘
                        │
                  Load Balancer
                  (Routes to healthy zones automatically)
Zone fails → Load balancer routes to remaining zones ✓

Multi-Region Architecture (for critical workloads)

Global Load Balancer
        │
        ├── Region: us-central1
        │       ├── Instance Group (3 VMs across 3 zones)
        │       └── Cloud SQL HA
        │
        └── Region: asia-south1 (Mumbai)
                ├── Instance Group (3 VMs across 3 zones)
                └── Cloud SQL Read Replica

Entire region fails → All traffic routes to surviving region ✓

Health Checks and Graceful Shutdown

# Application should implement a /health endpoint
@app.route('/health')
def health():
    # Check dependencies: database, cache, external APIs
    db_ok = check_database_connection()
    cache_ok = check_redis_connection()

    if db_ok and cache_ok:
        return {"status": "healthy"}, 200
    else:
        return {"status": "unhealthy", "db": db_ok, "cache": cache_ok}, 503

Security – Defence in Depth

Security is not a single control — it is multiple overlapping layers. If one layer is bypassed, others still protect the system.

Defence in Depth Layers:
┌────────────────────────────────────────────────────────┐
│ Layer 1: Network Perimeter                             │
│   Cloud Armor (block DDoS, SQL injection)              │
│   VPC Firewall rules (restrict port access)            │
│   No public IPs on databases/internal services         │
├────────────────────────────────────────────────────────┤
│ Layer 2: Identity and Access                           │
│   IAM least privilege (minimum permissions)            │
│   Service accounts for applications (not user accounts)│
│   2-Step Verification for all human accounts           │
├────────────────────────────────────────────────────────┤
│ Layer 3: Data Protection                               │
│   Encryption at rest (automatic in GCP)                │
│   Secret Manager (no credentials in code)              │
│   HTTPS everywhere (TLS in transit)                    │
├────────────────────────────────────────────────────────┤
│ Layer 4: Detection and Response                        │
│   Security Command Center (vulnerability scanning)     │
│   Cloud Logging (audit all access)                     │
│   Budget alerts (detect unusual spending patterns)     │
└────────────────────────────────────────────────────────┘

Security Checklist for Every GCP Project

Enable organization policies to prevent public Cloud Storage buckets
Use VPC Service Controls for projects handling sensitive data
Rotate service account keys — or better, use workload identity (no keys at all)
Enable Cloud Audit Logs for all admin and data access events
Set up Security Command Center and review findings weekly
Use Binary Authorization to prevent unsigned container images on GKE

Performance – Caching and Latency Reduction

Performance Optimization Stack:
┌──────────────────────────────────────────────────────────┐
│  Layer 1: CDN (Cloud CDN)                                │
│  Cache static assets at edge — serve from 5ms away       │
├──────────────────────────────────────────────────────────┤
│  Layer 2: Application Cache (Memorystore Redis)          │
│  Cache database query results — avoid hitting DB         │
├──────────────────────────────────────────────────────────┤
│  Layer 3: Read Replicas (Cloud SQL / Spanner)            │
│  Route read-heavy queries to replicas — offload primary  │
├──────────────────────────────────────────────────────────┤
│  Layer 4: Async Processing (Pub/Sub / Cloud Tasks)       │
│  Move non-critical work out of the request path          │
├──────────────────────────────────────────────────────────┤
│  Layer 5: Resource Right-Sizing                          │
│  Match machine type to actual workload profile           │
└──────────────────────────────────────────────────────────┘

Scalability – Building Systems That Grow

Stateless Application Design

Stateless applications store no session data in memory. Each request is self-contained. This allows scaling horizontally — adding more instances without sharing state.

Stateful (bad for scaling):
VM 1: Session for User A in memory
VM 2: No session for User A
→ User A must always be routed to VM 1 (session affinity required)

Stateless (good for scaling):
VM 1: Reads session from Redis on each request
VM 2: Reads session from Redis on each request
→ User A can be served by any VM (any instance can handle any request ✓)

Event-Driven Architecture

Monolithic (tight coupling):
Order Service → Inventory → Billing → Shipping → Notification
(One failure cascades through the entire chain)

Event-Driven (loose coupling):
Order Service → Pub/Sub Topic: "orders"
                    │
                    ├──▶ Inventory Service (subscribes)
                    ├──▶ Billing Service (subscribes)
                    ├──▶ Shipping Service (subscribes)
                    └──▶ Notification Service (subscribes)
(Each service fails independently — others keep running ✓)

Observability – What Cannot Be Measured Cannot Be Improved

Every production system needs three types of observability:

Type	Tool	What It Answers
Metrics	Cloud Monitoring	"Is the system healthy? What are the numbers?"
Logs	Cloud Logging	"What exactly happened? When and for which user?"
Traces	Cloud Trace	"Where is the latency? Which service is slow?"

Service Level Objectives (SLOs)

An SLO defines the reliability target for a service. It is measured using Service Level Indicators (SLIs) — metrics that represent the user experience.

SLI: Request success rate = (successful requests / total requests) × 100%
SLO: Request success rate ≥ 99.9% over any 30-day window
SLA: If SLO is breached, customer receives service credit

Example SLOs for a web application:
├── Availability: 99.9% uptime (≤ 43.8 minutes downtime/month)
├── Latency:      95% of requests respond within 200ms
└── Error rate:   Less than 0.1% of requests return 5xx errors

Disaster Recovery Planning

Metric	Definition	Example Target
RPO (Recovery Point Objective)	Maximum acceptable data loss (how old can the backup be?)	RPO = 1 hour (backup every hour)
RTO (Recovery Time Objective)	Maximum time to restore service after a failure	RTO = 15 minutes (HA failover in 60 seconds)

DR Strategy Examples on GCP:
┌──────────────────────────────────────────────────────────────┐
│ Tier 1 — Active-Active (RTO: seconds, RPO: seconds)          │
│   Multi-region load balancer + Spanner/Bigtable global       │
│   Cost: Highest                                              │
├──────────────────────────────────────────────────────────────┤
│ Tier 2 — Active-Passive (RTO: minutes, RPO: minutes)         │
│   Primary region + warm standby region + Cloud SQL HA        │
│   Cost: Medium                                               │
├──────────────────────────────────────────────────────────────┤
│ Tier 3 — Backup and Restore (RTO: hours, RPO: hours)         │
│   Regular Cloud SQL exports to Cloud Storage + GCS backups   │
│   Cost: Lowest                                               │
└──────────────────────────────────────────────────────────────┘

Reference Architecture – Production Web Application

Internet Users
        │
        ▼
Cloud DNS (domain → global IP)
        │
        ▼
Cloud CDN + Cloud Armor (cache + WAF/DDoS)
        │
        ▼
Global External Application Load Balancer (HTTPS, SSL termination)
        │
        ├── /api/*  →  Cloud Run (stateless API service, autoscales 0→100)
        │               │  Reads from Memorystore Redis (cache)
        │               │  Writes to Cloud SQL (Primary — us-central1-a,b,c)
        │               └  Events → Pub/Sub → Cloud Functions (async tasks)
        │
        └── /*  →  Cloud Storage Bucket (static site — React/Vue build)

Observability:
├── Cloud Monitoring (dashboards, alerts)
├── Cloud Logging (structured logs from Cloud Run)
└── Cloud Trace (distributed request tracing)

Security:
├── Secret Manager (DB passwords, API keys)
├── IAM (least privilege for all service accounts)
└── VPC private networking (Cloud SQL on private IP only)

CI/CD:
├── GitHub → Cloud Build trigger on main branch
├── Cloud Build: test → build container → push to Artifact Registry
└── Cloud Build: deploy to Cloud Run (zero-downtime rolling update)

Key Takeaways

Apply all five pillars of the Well-Architected Framework: Operational Excellence, Security, Reliability, Performance, and Cost.
Deploy across multiple zones and regions to eliminate single points of failure.
Use defence in depth — multiple security layers protect against the failure of any single control.
Design stateless applications so any instance can serve any request, enabling horizontal scaling.
Event-driven architectures using Pub/Sub decouple services and prevent cascading failures.
Define SLOs with measurable SLIs to objectively track and communicate system reliability.
Match DR strategy (active-active, active-passive, backup-restore) to the business's RTO and RPO requirements.

Previous lesson

Back to course