DevOps Advanced Observability and OpenTelemetry

As applications grow into dozens of microservices, understanding what happens during a single user request becomes extremely difficult. A slow checkout might involve the API gateway, authentication service, product service, cart service, payment service, and notification service — all in sequence. Finding where the delay or error occurred requires distributed tracing.

OpenTelemetry (OTel) is the open standard for collecting observability data — traces, metrics, and logs — in a vendor-neutral way. It is the most significant development in observability in recent years, replacing fragmented, vendor-specific instrumentation with a single, unified approach.

The Problem with Distributed Systems Observability

In a monolith, a stack trace pinpoints the exact line of code that caused an error. In microservices:

A request passes through 8 different services, each with its own logs.
Logs across services have no shared identifier to connect them.
Metrics show that response time is high, but not where the time is spent.
An error in Service G might have been caused by a slow response from Service B three hops earlier.

Distributed tracing solves this by assigning a unique trace ID to each request. Every service that handles the request records a span (a timed unit of work) and tags it with that trace ID. The full trace — the complete journey of a request — can then be visualized as a waterfall diagram.

Core Concepts

Trace

A trace represents the complete journey of a single request through a distributed system. Each trace has a unique trace ID that flows through every service the request touches.

Span

A span is a named, timed operation within a trace. Each service creates one or more spans as it processes the request. Spans have:

A name (e.g., "HTTP POST /checkout")
Start and end timestamps
The parent span ID (creating a tree structure)
Attributes (key-value pairs like http.status_code=200)
Events (point-in-time annotations like "cache miss")
Status (OK, Error)

Context Propagation

For tracing to work across services, the trace ID and span ID must be passed from one service to the next — typically via HTTP headers (traceparent and tracestate in the W3C standard). OpenTelemetry handles this propagation automatically once instrumented.

OpenTelemetry Architecture

OpenTelemetry has three main components:

API: Language-specific interfaces for instrumentation — how code emits traces, metrics, and logs. Vendor-neutral.
SDK: Implementation of the API. Handles sampling, processing, and exporting. Configured per service.
Collector: A standalone service that receives telemetry from applications, processes it, and exports it to one or more backends (Jaeger, Prometheus, Datadog, etc.).

OTel Collector Pipeline

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  resource:
    attributes:
      - key: service.environment
        value: production
        action: upsert

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true
  prometheus:
    endpoint: "0.0.0.0:8889"
  logging:
    loglevel: warn

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Instrumenting Applications with OpenTelemetry

Node.js Auto-Instrumentation

OpenTelemetry provides auto-instrumentation libraries that patch popular frameworks (Express, gRPC, HTTP, databases) without modifying application code.

// tracing.js — loaded before application code
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'checkout-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '2.1.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: 'production',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({ url: 'http://otel-collector:4317' }),
    exportIntervalMillis: 10000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },  // PostgreSQL
      '@opentelemetry/instrumentation-redis': { enabled: true },
    }),
  ],
});

sdk.start();

// Start app with tracing:
node -r ./tracing.js server.js

Manual Custom Spans

Auto-instrumentation traces HTTP calls and database queries. Add custom spans for business logic that matters:

const opentelemetry = require('@opentelemetry/api');
const tracer = opentelemetry.trace.getTracer('checkout-service');

async function processOrder(orderId, userId) {
  // Create a custom span for this business operation
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttributes({
      'order.id': orderId,
      'user.id': userId,
      'order.source': 'web',
    });

    try {
      const order = await fetchOrder(orderId);

      // Child span for the payment step
      await tracer.startActiveSpan('chargePayment', async (paymentSpan) => {
        paymentSpan.setAttributes({ 'payment.amount': order.total });
        await chargePayment(order);
        paymentSpan.end();
      });

      span.setStatus({ code: opentelemetry.SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.setStatus({
        code: opentelemetry.SpanStatusCode.ERROR,
        message: error.message,
      });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Python Auto-Instrumentation

# Install
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install

# Run Flask app with auto-instrumentation
OTEL_SERVICE_NAME=payment-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
opentelemetry-instrument python app.py

Jaeger – Distributed Tracing Backend

Jaeger (developed by Uber, now a CNCF project) stores and visualizes distributed traces. Its UI shows:

A timeline of all spans within a trace, arranged as a waterfall.
The duration of each span — instantly revealing where time is spent.
Errors highlighted in red with full exception details.
Span attributes and events for debugging context.
Service dependency graphs showing how services communicate.

# Deploy Jaeger to Kubernetes
kubectl create namespace observability
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/latest/download/jaeger-operator.yaml -n observability

# Create a simple Jaeger instance
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: observability
spec:
  strategy: production
  storage:
    type: elasticsearch
    elasticsearch:
      serverUrls: https://elasticsearch:9200

Correlating Traces, Metrics, and Logs

The real power of observability emerges when traces, metrics, and logs are correlated. The trace ID connects all three:

Structured Logging with Trace Context

// Add trace ID to every log line
const winston = require('winston');
const { context, trace } = require('@opentelemetry/api');

const logger = winston.createLogger({
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json(),
    winston.format((info) => {
      const span = trace.getActiveSpan();
      if (span) {
        const spanContext = span.spanContext();
        info.traceId = spanContext.traceId;
        info.spanId = spanContext.spanId;
      }
      return info;
    })()
  ),
  transports: [new winston.transports.Console()],
});

// Log output:
// {"timestamp":"2025-03-15T10:30:00Z","level":"error","message":"Payment failed",
//  "orderId":"ORD-9821","traceId":"4bf92f3577b34da6a3ce929d0e0e4736","spanId":"00f067aa0ba902b7"}

In Grafana, clicking on a spike in the error-rate metric jumps directly to the relevant traces. Clicking on a trace ID in Jaeger pulls up all log lines from that exact request across all services.

Sampling Strategies

High-traffic services generate millions of traces per minute. Storing every trace is expensive. Sampling controls which traces to keep.

Strategy	How It Works	Use Case
Head-based (probabilistic)	Decision made at trace start — e.g., keep 10%	High-volume, low-criticality traffic
Tail-based	Decision made after trace completes — always keep errors and slow traces	Production critical paths
Always-on	Keep every trace	Dev and staging only
Rate limiting	Keep up to N traces per second	Controlled cost with full recent coverage

Summary

Distributed tracing follows a single request through all services using a shared trace ID.
Spans record the timing and context of each operation within a trace — forming a complete waterfall diagram.
OpenTelemetry is the vendor-neutral standard for instrumentation — instrument once, export to any backend.
Auto-instrumentation handles HTTP, gRPC, and database calls without code changes. Custom spans add business context.
The OTel Collector receives, processes, and routes telemetry to multiple backends simultaneously.
Correlating trace IDs across traces, metrics, and logs enables instant root-cause analysis across microservices.
Tail-based sampling keeps all error and slow traces while discarding a percentage of normal traffic to control costs.

Previous lesson

Back to course

Next lesson