Microservices Service Mesh

As a microservices system grows, every service needs the same set of capabilities: encrypted communication, retries, timeouts, circuit breakers, load balancing, and observability. Embedding this logic in every service means duplicating thousands of lines of code. A Service Mesh moves all of this infrastructure logic out of services and into the network layer automatically.

The Repeated Code Problem

WITHOUT A SERVICE MESH
=======================
Order Service code:
  - Business logic (create order, validate items)     [the real work]
  - Retry logic (retry failed calls 3 times)          [infrastructure]
  - Circuit breaker (stop calling broken services)    [infrastructure]
  - mTLS setup (encrypt service-to-service traffic)   [infrastructure]
  - Metrics collection (track request counts)         [infrastructure]
  - Timeout handling (give up after 2 seconds)        [infrastructure]

Payment Service code:
  - Business logic (charge card, issue refund)        [the real work]
  - Retry logic                                       [infrastructure, again]
  - Circuit breaker                                   [infrastructure, again]
  - mTLS setup                                        [infrastructure, again]
  - Metrics collection                                [infrastructure, again]
  - Timeout handling                                  [infrastructure, again]

50 services = 50 copies of the same infrastructure code.
A bug in retry logic requires fixing 50 services.

WITH A SERVICE MESH
====================
Order Service code:    business logic only
Payment Service code:  business logic only

Service Mesh handles: retry, circuit breaker, mTLS, metrics, timeouts
                      for ALL services automatically.

How a Service Mesh Works: The Sidecar Proxy

A service mesh injects a small proxy program — called a sidecar — next to every service container. All network traffic going in and out of the service passes through this sidecar. The sidecar handles all the infrastructure concerns. The service itself sends and receives plain HTTP as if the mesh did not exist.

SIDECAR PROXY PATTERN
=======================
Without Mesh:
[Order Service] ----network----> [Payment Service]

With Mesh:
[Order Service] --> [Sidecar A] ----network----> [Sidecar B] --> [Payment Service]

Sidecar A handles:              Sidecar B handles:
- Encrypt outgoing traffic      - Decrypt incoming traffic
- Apply retry policy            - Collect metrics
- Apply timeout                 - Apply rate limiting
- Collect outgoing metrics      - Verify caller identity (mTLS)

Order Service and Payment Service know nothing about any of this.
They communicate as if talking directly to each other.

Think of the sidecar as a personal security and communications officer attached to each service. The service focuses on its job. The officer handles all external communication rules.

Control Plane and Data Plane

A service mesh has two layers:

CONTROL PLANE
=============
The brain of the mesh. Operators configure policies here:
  - "All traffic must use mTLS"
  - "Retry failed calls up to 3 times"
  - "Route 10% of traffic to v2 of Payment Service"
  - "Block calls from Analytics to Payment Service"

Pushes configuration to all sidecars automatically.

DATA PLANE
==========
The sidecars. They do the actual work:
  - Enforce the policies received from control plane
  - Handle real network traffic
  - Collect metrics and traces

Operators configure the control plane.
The data plane executes those configurations in real time.

Key Capabilities of a Service Mesh

Traffic Management

The mesh controls how requests flow between services. This enables powerful deployment strategies without changing application code.

CANARY DEPLOYMENT VIA MESH
===========================
Payment Service v1.0 runs. You want to test v2.0 safely.

Mesh configuration:
  90% of requests --> Payment Service v1.0
  10% of requests --> Payment Service v2.0

Monitor v2.0 for errors.
Gradually shift: 50/50, then 90% v2.0, then 100% v2.0.
If v2.0 shows errors: shift 100% back to v1.0 in seconds.
No redeployment. Just a configuration change in the mesh.

Mutual TLS (mTLS)

The mesh issues certificates to every service and enforces encrypted, authenticated connections between all services automatically. Zero code changes in your services.

mTLS IN A MESH
===============
Operator sets policy: "All internal traffic must use mTLS"

Sidecar A (Order Service):
  - Receives outgoing request
  - Automatically encrypts with certificate
  - Adds identity proof: "I am Order Service"

Sidecar B (Payment Service):
  - Receives incoming request
  - Verifies Order Service certificate
  - Decrypts traffic
  - Passes plain request to Payment Service

Observability

Because all traffic passes through sidecars, the mesh automatically collects metrics, logs, and traces for every service-to-service call without requiring developers to instrument their code.

AUTO-COLLECTED METRICS (no code required)
==========================================
For every service pair (A calls B):
  - Request rate
  - Error rate
  - Latency (p50, p95, p99)
  - Connection count

Grafana dashboards show this data for every service
without a single line of monitoring code in any service.

Policy Enforcement

Operators define which services can talk to which. The mesh enforces these rules at the network level. A compromised service cannot call unauthorized services even if it tries.

ACCESS POLICY
==============
ALLOW:
  Order Service --> Payment Service
  Order Service --> Inventory Service

DENY:
  Analytics Service --> Payment Service
  Any external call --> Internal database services

If Analytics Service (compromised) tries to call Payment Service:
  Sidecar blocks the request at the network level.
  Payment Service never sees the attempt.

Popular Service Mesh Tools

Istio — the most feature-rich option. Built on Envoy proxy. Used widely in large production systems. Has a steeper learning curve.
Linkerd — lightweight, simpler to operate, excellent performance. Built specifically for Kubernetes.
Consul Connect — part of HashiCorp Consul. Works across multiple platforms, not just Kubernetes.
AWS App Mesh — managed mesh for applications running on AWS.

When to Use a Service Mesh

USE A SERVICE MESH WHEN:
  - You have 10+ services with complex inter-service communication
  - You need mTLS across all services without changing each service
  - You want centralized traffic policies and observability
  - Your teams lack the capacity to implement resilience patterns in each service

DO NOT ADD A SERVICE MESH WHEN:
  - You have fewer than 5-10 services (overhead exceeds benefit)
  - Your team is still learning basic microservices concepts
  - Your organization has not yet mastered Kubernetes
  - You want fast initial delivery (a mesh adds operational complexity)

A service mesh is powerful infrastructure but it is not a starting point. Teams typically add it after the system grows to a scale where the operational complexity of managing per-service resilience code outweighs the cost of learning and running the mesh.

Previous lessons

Back to courses

Next lessons