DevOps Advanced Observability and OpenTelemetry

As applications grow into dozens of microservices, understanding what happens during a single user request becomes extremely difficult. A slow checkout might involve the API gateway, authentication service, product service, cart service, payment service, and notification service — all in sequence. Finding where the delay or error occurred requires distributed tracing.

OpenTelemetry (OTel) is the open standard for collecting observability data — traces, metrics, and logs — in a vendor-neutral way. It is the most significant development in observability in recent years, replacing fragmented, vendor-specific instrumentation with a single, unified approach.

The Problem with Distributed Systems Observability

In a monolith, a stack trace pinpoints the exact line of code that caused an error. In microservices:

  • A request passes through 8 different services, each with its own logs.
  • Logs across services have no shared identifier to connect them.
  • Metrics show that response time is high, but not where the time is spent.
  • An error in Service G might have been caused by a slow response from Service B three hops earlier.

Distributed tracing solves this by assigning a unique trace ID to each request. Every service that handles the request records a span (a timed unit of work) and tags it with that trace ID. The full trace — the complete journey of a request — can then be visualized as a waterfall diagram.

Core Concepts

Trace

A trace represents the complete journey of a single request through a distributed system. Each trace has a unique trace ID that flows through every service the request touches.

Span

A span is a named, timed operation within a trace. Each service creates one or more spans as it processes the request. Spans have:

  • A name (e.g., "HTTP POST /checkout")
  • Start and end timestamps
  • The parent span ID (creating a tree structure)
  • Attributes (key-value pairs like http.status_code=200)
  • Events (point-in-time annotations like "cache miss")
  • Status (OK, Error)

Context Propagation

For tracing to work across services, the trace ID and span ID must be passed from one service to the next — typically via HTTP headers (traceparent and tracestate in the W3C standard). OpenTelemetry handles this propagation automatically once instrumented.

OpenTelemetry Architecture

OpenTelemetry has three main components:

  1. API: Language-specific interfaces for instrumentation — how code emits traces, metrics, and logs. Vendor-neutral.
  2. SDK: Implementation of the API. Handles sampling, processing, and exporting. Configured per service.
  3. Collector: A standalone service that receives telemetry from applications, processes it, and exports it to one or more backends (Jaeger, Prometheus, Datadog, etc.).

OTel Collector Pipeline

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  resource:
    attributes:
      - key: service.environment
        value: production
        action: upsert

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true
  prometheus:
    endpoint: "0.0.0.0:8889"
  logging:
    loglevel: warn

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Instrumenting Applications with OpenTelemetry

Node.js Auto-Instrumentation

OpenTelemetry provides auto-instrumentation libraries that patch popular frameworks (Express, gRPC, HTTP, databases) without modifying application code.

// tracing.js — loaded before application code
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'checkout-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '2.1.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: 'production',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({ url: 'http://otel-collector:4317' }),
    exportIntervalMillis: 10000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },  // PostgreSQL
      '@opentelemetry/instrumentation-redis': { enabled: true },
    }),
  ],
});

sdk.start();
// Start app with tracing:
node -r ./tracing.js server.js

Manual Custom Spans

Auto-instrumentation traces HTTP calls and database queries. Add custom spans for business logic that matters:

const opentelemetry = require('@opentelemetry/api');
const tracer = opentelemetry.trace.getTracer('checkout-service');

async function processOrder(orderId, userId) {
  // Create a custom span for this business operation
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttributes({
      'order.id': orderId,
      'user.id': userId,
      'order.source': 'web',
    });

    try {
      const order = await fetchOrder(orderId);

      // Child span for the payment step
      await tracer.startActiveSpan('chargePayment', async (paymentSpan) => {
        paymentSpan.setAttributes({ 'payment.amount': order.total });
        await chargePayment(order);
        paymentSpan.end();
      });

      span.setStatus({ code: opentelemetry.SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.setStatus({
        code: opentelemetry.SpanStatusCode.ERROR,
        message: error.message,
      });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Python Auto-Instrumentation

# Install
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install

# Run Flask app with auto-instrumentation
OTEL_SERVICE_NAME=payment-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
opentelemetry-instrument python app.py

Jaeger – Distributed Tracing Backend

Jaeger (developed by Uber, now a CNCF project) stores and visualizes distributed traces. Its UI shows:

  • A timeline of all spans within a trace, arranged as a waterfall.
  • The duration of each span — instantly revealing where time is spent.
  • Errors highlighted in red with full exception details.
  • Span attributes and events for debugging context.
  • Service dependency graphs showing how services communicate.
# Deploy Jaeger to Kubernetes
kubectl create namespace observability
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/latest/download/jaeger-operator.yaml -n observability

# Create a simple Jaeger instance
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: observability
spec:
  strategy: production
  storage:
    type: elasticsearch
    elasticsearch:
      serverUrls: https://elasticsearch:9200

Correlating Traces, Metrics, and Logs

The real power of observability emerges when traces, metrics, and logs are correlated. The trace ID connects all three:

Structured Logging with Trace Context

// Add trace ID to every log line
const winston = require('winston');
const { context, trace } = require('@opentelemetry/api');

const logger = winston.createLogger({
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json(),
    winston.format((info) => {
      const span = trace.getActiveSpan();
      if (span) {
        const spanContext = span.spanContext();
        info.traceId = spanContext.traceId;
        info.spanId = spanContext.spanId;
      }
      return info;
    })()
  ),
  transports: [new winston.transports.Console()],
});

// Log output:
// {"timestamp":"2025-03-15T10:30:00Z","level":"error","message":"Payment failed",
//  "orderId":"ORD-9821","traceId":"4bf92f3577b34da6a3ce929d0e0e4736","spanId":"00f067aa0ba902b7"}

In Grafana, clicking on a spike in the error-rate metric jumps directly to the relevant traces. Clicking on a trace ID in Jaeger pulls up all log lines from that exact request across all services.

Sampling Strategies

High-traffic services generate millions of traces per minute. Storing every trace is expensive. Sampling controls which traces to keep.

StrategyHow It WorksUse Case
Head-based (probabilistic)Decision made at trace start — e.g., keep 10%High-volume, low-criticality traffic
Tail-basedDecision made after trace completes — always keep errors and slow tracesProduction critical paths
Always-onKeep every traceDev and staging only
Rate limitingKeep up to N traces per secondControlled cost with full recent coverage

Summary

  • Distributed tracing follows a single request through all services using a shared trace ID.
  • Spans record the timing and context of each operation within a trace — forming a complete waterfall diagram.
  • OpenTelemetry is the vendor-neutral standard for instrumentation — instrument once, export to any backend.
  • Auto-instrumentation handles HTTP, gRPC, and database calls without code changes. Custom spans add business context.
  • The OTel Collector receives, processes, and routes telemetry to multiple backends simultaneously.
  • Correlating trace IDs across traces, metrics, and logs enables instant root-cause analysis across microservices.
  • Tail-based sampling keeps all error and slow traces while discarding a percentage of normal traffic to control costs.

Leave a Comment