DevOps Advanced Observability and OpenTelemetry
As applications grow into dozens of microservices, understanding what happens during a single user request becomes extremely difficult. A slow checkout might involve the API gateway, authentication service, product service, cart service, payment service, and notification service — all in sequence. Finding where the delay or error occurred requires distributed tracing.
OpenTelemetry (OTel) is the open standard for collecting observability data — traces, metrics, and logs — in a vendor-neutral way. It is the most significant development in observability in recent years, replacing fragmented, vendor-specific instrumentation with a single, unified approach.
The Problem with Distributed Systems Observability
In a monolith, a stack trace pinpoints the exact line of code that caused an error. In microservices:
- A request passes through 8 different services, each with its own logs.
- Logs across services have no shared identifier to connect them.
- Metrics show that response time is high, but not where the time is spent.
- An error in Service G might have been caused by a slow response from Service B three hops earlier.
Distributed tracing solves this by assigning a unique trace ID to each request. Every service that handles the request records a span (a timed unit of work) and tags it with that trace ID. The full trace — the complete journey of a request — can then be visualized as a waterfall diagram.
Core Concepts
Trace
A trace represents the complete journey of a single request through a distributed system. Each trace has a unique trace ID that flows through every service the request touches.
Span
A span is a named, timed operation within a trace. Each service creates one or more spans as it processes the request. Spans have:
- A name (e.g., "HTTP POST /checkout")
- Start and end timestamps
- The parent span ID (creating a tree structure)
- Attributes (key-value pairs like
http.status_code=200) - Events (point-in-time annotations like "cache miss")
- Status (OK, Error)
Context Propagation
For tracing to work across services, the trace ID and span ID must be passed from one service to the next — typically via HTTP headers (traceparent and tracestate in the W3C standard). OpenTelemetry handles this propagation automatically once instrumented.
OpenTelemetry Architecture
OpenTelemetry has three main components:
- API: Language-specific interfaces for instrumentation — how code emits traces, metrics, and logs. Vendor-neutral.
- SDK: Implementation of the API. Handles sampling, processing, and exporting. Configured per service.
- Collector: A standalone service that receives telemetry from applications, processes it, and exports it to one or more backends (Jaeger, Prometheus, Datadog, etc.).
OTel Collector Pipeline
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
resource:
attributes:
- key: service.environment
value: production
action: upsert
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
logging:
loglevel: warn
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]Instrumenting Applications with OpenTelemetry
Node.js Auto-Instrumentation
OpenTelemetry provides auto-instrumentation libraries that patch popular frameworks (Express, gRPC, HTTP, databases) without modifying application code.
// tracing.js — loaded before application code
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'checkout-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '2.1.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: 'production',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4317',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({ url: 'http://otel-collector:4317' }),
exportIntervalMillis: 10000,
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true }, // PostgreSQL
'@opentelemetry/instrumentation-redis': { enabled: true },
}),
],
});
sdk.start();// Start app with tracing:
node -r ./tracing.js server.jsManual Custom Spans
Auto-instrumentation traces HTTP calls and database queries. Add custom spans for business logic that matters:
const opentelemetry = require('@opentelemetry/api');
const tracer = opentelemetry.trace.getTracer('checkout-service');
async function processOrder(orderId, userId) {
// Create a custom span for this business operation
return tracer.startActiveSpan('processOrder', async (span) => {
span.setAttributes({
'order.id': orderId,
'user.id': userId,
'order.source': 'web',
});
try {
const order = await fetchOrder(orderId);
// Child span for the payment step
await tracer.startActiveSpan('chargePayment', async (paymentSpan) => {
paymentSpan.setAttributes({ 'payment.amount': order.total });
await chargePayment(order);
paymentSpan.end();
});
span.setStatus({ code: opentelemetry.SpanStatusCode.OK });
return order;
} catch (error) {
span.setStatus({
code: opentelemetry.SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}Python Auto-Instrumentation
# Install
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
# Run Flask app with auto-instrumentation
OTEL_SERVICE_NAME=payment-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
opentelemetry-instrument python app.pyJaeger – Distributed Tracing Backend
Jaeger (developed by Uber, now a CNCF project) stores and visualizes distributed traces. Its UI shows:
- A timeline of all spans within a trace, arranged as a waterfall.
- The duration of each span — instantly revealing where time is spent.
- Errors highlighted in red with full exception details.
- Span attributes and events for debugging context.
- Service dependency graphs showing how services communicate.
# Deploy Jaeger to Kubernetes
kubectl create namespace observability
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/latest/download/jaeger-operator.yaml -n observability
# Create a simple Jaeger instance
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
namespace: observability
spec:
strategy: production
storage:
type: elasticsearch
elasticsearch:
serverUrls: https://elasticsearch:9200Correlating Traces, Metrics, and Logs
The real power of observability emerges when traces, metrics, and logs are correlated. The trace ID connects all three:
Structured Logging with Trace Context
// Add trace ID to every log line
const winston = require('winston');
const { context, trace } = require('@opentelemetry/api');
const logger = winston.createLogger({
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json(),
winston.format((info) => {
const span = trace.getActiveSpan();
if (span) {
const spanContext = span.spanContext();
info.traceId = spanContext.traceId;
info.spanId = spanContext.spanId;
}
return info;
})()
),
transports: [new winston.transports.Console()],
});
// Log output:
// {"timestamp":"2025-03-15T10:30:00Z","level":"error","message":"Payment failed",
// "orderId":"ORD-9821","traceId":"4bf92f3577b34da6a3ce929d0e0e4736","spanId":"00f067aa0ba902b7"}In Grafana, clicking on a spike in the error-rate metric jumps directly to the relevant traces. Clicking on a trace ID in Jaeger pulls up all log lines from that exact request across all services.
Sampling Strategies
High-traffic services generate millions of traces per minute. Storing every trace is expensive. Sampling controls which traces to keep.
| Strategy | How It Works | Use Case |
|---|---|---|
| Head-based (probabilistic) | Decision made at trace start — e.g., keep 10% | High-volume, low-criticality traffic |
| Tail-based | Decision made after trace completes — always keep errors and slow traces | Production critical paths |
| Always-on | Keep every trace | Dev and staging only |
| Rate limiting | Keep up to N traces per second | Controlled cost with full recent coverage |
Summary
- Distributed tracing follows a single request through all services using a shared trace ID.
- Spans record the timing and context of each operation within a trace — forming a complete waterfall diagram.
- OpenTelemetry is the vendor-neutral standard for instrumentation — instrument once, export to any backend.
- Auto-instrumentation handles HTTP, gRPC, and database calls without code changes. Custom spans add business context.
- The OTel Collector receives, processes, and routes telemetry to multiple backends simultaneously.
- Correlating trace IDs across traces, metrics, and logs enables instant root-cause analysis across microservices.
- Tail-based sampling keeps all error and slow traces while discarding a percentage of normal traffic to control costs.
