Observability vs Monitoring — Distributed Tracing with Jaeger and OpenTelemetry

September 20, 2025 · 7 min read

DevOps & Cloud Learning Hub

When a user reports that checkout is slow, monitoring tells you that latency spiked. Observability tells you why — the payment service waited 3 seconds for a database query that normally takes 20ms because a missing index caused a full table scan on a table that grew past 10 million rows last Tuesday. That's the difference.

Monitoring vs Observability

These terms are often used interchangeably, but they represent fundamentally different approaches:

Aspect	Monitoring	Observability
Question answered	"Is something broken?"	"Why is it broken?"
Approach	Pre-defined dashboards and alerts	Explore data to find unknown unknowns
Data model	Metrics (time series)	Metrics + Logs + Traces (correlated)
Investigation	Check known failure modes	Ask arbitrary questions of your data
Setup	Define what to watch	Instrument everything, query later
Analogy	Car dashboard warning lights	Full diagnostic computer readout

Monitoring is a subset of observability. You need monitoring, but in a microservices world, you also need the ability to trace a request across 15 services to find where those 3 seconds went.

The Three Pillars Deep Dive

Observability rests on three complementary signal types:

┌─────────────────────────────────────────────────────┐
│                   Observability                      │
│                                                      │
│  ┌─────────────┐ ┌─────────────┐ ┌───────────────┐ │
│  │   Metrics   │ │    Logs     │ │    Traces     │ │
│  │             │ │             │ │               │ │
│  │ "What is    │ │ "What       │ │ "What path    │ │
│  │  happening?"│ │  happened?" │ │  did it take?"│ │
│  │             │ │             │ │               │ │
│  │ Prometheus  │ │ Loki / ELK  │ │ Jaeger / Tempo│ │
│  │ counters,   │ │ structured  │ │ spans, traces │ │
│  │ gauges,     │ │ events,     │ │ context       │ │
│  │ histograms  │ │ stack traces│ │ propagation   │ │
│  └─────────────┘ └─────────────┘ └───────────────┘ │
│                                                      │
│         Correlation: trace_id links all three        │
└─────────────────────────────────────────────────────┘

Metrics tell you the system's vital signs (request rate, error rate, latency). Logs give you detailed event records. Traces show you the journey of a single request through your distributed system. The magic happens when you correlate all three using a shared trace_id.

Distributed Tracing Concepts

In a microservices architecture, a single user request can touch dozens of services. Distributed tracing tracks that journey:

User Request: GET /checkout
│
├── [Span 1] API Gateway (12ms)
│   │
│   ├── [Span 2] Auth Service - Validate Token (8ms)
│   │
│   ├── [Span 3] Cart Service - Get Items (45ms)
│   │   │
│   │   └── [Span 4] Redis Cache Lookup (2ms)
│   │
│   ├── [Span 5] Payment Service (3,200ms) ← BOTTLENECK
│   │   │
│   │   ├── [Span 6] Fraud Check API (150ms)
│   │   │
│   │   └── [Span 7] Database Query (3,020ms) ← ROOT CAUSE
│   │       └── SELECT * FROM transactions WHERE user_id = ?
│   │           (full table scan — missing index!)
│   │
│   └── [Span 8] Notification Service (25ms)
│
Total: 3,342ms

Key terminology:

Trace: The entire journey of a request (all spans combined)
Span: A single unit of work (one service call, one DB query)
Context Propagation: Passing the trace ID between services via HTTP headers
Parent-Child Relationship: Spans form a tree showing causality

OpenTelemetry: Vendor-Neutral Instrumentation

OpenTelemetry (OTel) is the CNCF standard for generating telemetry data. It's vendor-neutral — instrument once, send data to any backend (Jaeger, Tempo, Datadog, New Relic):

// Node.js — OpenTelemetry auto-instrumentation setup
// tracing.js — Load this BEFORE your application code
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { Resource } = require('@opentelemetry/resources');
const { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } = require('@opentelemetry/semantic-conventions');

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'payment-service',
    [ATTR_SERVICE_VERSION]: '1.4.2',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Auto-instruments: HTTP, Express, pg, mysql, redis, grpc, etc.
      '@opentelemetry/instrumentation-fs': { enabled: false },
    }),
  ],
});

sdk.start();
console.log('OpenTelemetry tracing initialized');

Auto-instrumentation captures HTTP requests, database queries, cache calls, and gRPC calls without modifying your application code.

Manual Instrumentation for Custom Spans

Sometimes you need to trace business logic that auto-instrumentation can't capture:

const { trace, SpanStatusCode } = require('@opentelemetry/api');

const tracer = trace.getTracer('payment-service');

async function processPayment(order) {
  // Create a custom span for the payment workflow
  return tracer.startActiveSpan('process-payment', async (span) => {
    try {
      span.setAttribute('order.id', order.id);
      span.setAttribute('order.amount', order.totalAmount);
      span.setAttribute('payment.method', order.paymentMethod);

      // Nested span for fraud check
      const fraudResult = await tracer.startActiveSpan('fraud-check', async (fraudSpan) => {
        const result = await fraudService.checkTransaction(order);
        fraudSpan.setAttribute('fraud.score', result.score);
        fraudSpan.setAttribute('fraud.decision', result.decision);
        fraudSpan.end();
        return result;
      });

      if (fraudResult.decision === 'REJECT') {
        span.setStatus({ code: SpanStatusCode.ERROR, message: 'Payment rejected by fraud check' });
        span.end();
        throw new Error('Payment rejected');
      }

      // Nested span for payment gateway
      const chargeResult = await tracer.startActiveSpan('charge-gateway', async (chargeSpan) => {
        chargeSpan.setAttribute('gateway', 'stripe');
        const result = await stripe.charges.create({ amount: order.totalAmount });
        chargeSpan.setAttribute('charge.id', result.id);
        chargeSpan.end();
        return result;
      });

      span.setStatus({ code: SpanStatusCode.OK });
      span.end();
      return chargeResult;

    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.end();
      throw error;
    }
  });
}

Jaeger for Trace Visualization

Jaeger (created by Uber) is the most widely used open-source tracing backend. Deploy it on Kubernetes:

# jaeger-deployment.yaml — All-in-one for development
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: observability
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
        - name: jaeger
          image: jaegertracing/all-in-one:1.54
          ports:
            - containerPort: 16686  # UI
            - containerPort: 4317   # OTLP gRPC
            - containerPort: 4318   # OTLP HTTP
          env:
            - name: COLLECTOR_OTLP_ENABLED
              value: "true"
            - name: SPAN_STORAGE_TYPE
              value: "elasticsearch"
            - name: ES_SERVER_URLS
              value: "http://elasticsearch:9200"
---
apiVersion: v1
kind: Service
metadata:
  name: jaeger
  namespace: observability
spec:
  ports:
    - name: ui
      port: 16686
    - name: otlp-grpc
      port: 4317
    - name: otlp-http
      port: 4318
  selector:
    app: jaeger

Open http://jaeger:16686 to search traces by service, operation, duration, or tags.

OpenTelemetry Collector Architecture

The OTel Collector is a proxy that receives, processes, and exports telemetry data. It decouples your application from your backend:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

  # Filter out health check spans to reduce noise
  filter:
    spans:
      exclude:
        match_type: strict
        attributes:
          - key: http.route
            value: /health

  # Add environment metadata to all spans
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: true

  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: otel

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [filter, batch, resource]
      exporters: [otlp/jaeger, otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Architecture:

App → OTel SDK → OTel Collector → Jaeger (traces)
                       │          → Prometheus (metrics)
                       │          → Loki (logs)
                       │
                  Processing:
                  - Batching
                  - Filtering
                  - Sampling
                  - Enrichment

The Collector gives you a single place to manage sampling, filtering, and routing without touching application code.

Trace Sampling Strategies

In high-traffic systems, tracing every request is expensive. Sampling strategies help:

Strategy	Description	Use Case
Head-based	Decide at request start (e.g., 10% of requests)	Simple, predictable cost
Tail-based	Decide after request completes (keep errors + slow)	Better signal, higher resource use
Rate limiting	Max N traces per second per service	Cost control
Priority	Always trace flagged requests (debug header)	On-demand debugging

# Tail-based sampling in OTel Collector
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 2000 }
      - name: percentage
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

This configuration keeps all errors, all slow requests (>2s), and a 5% random sample of everything else — giving you great signal while controlling costs.

Correlating Logs, Metrics, and Traces

The real power comes from linking all three signals:

// Inject trace context into structured logs
const { trace } = require('@opentelemetry/api');
const pino = require('pino');

const logger = pino({
  mixin() {
    const span = trace.getActiveSpan();
    if (span) {
      const context = span.spanContext();
      return {
        trace_id: context.traceId,
        span_id: context.spanId,
      };
    }
    return {};
  },
});

// Now every log line includes trace_id and span_id:
// {"level":30,"trace_id":"abc123","span_id":"def456","msg":"Payment processed","order_id":"ORD-789"}

In Grafana, you can click a trace in Jaeger, jump to the corresponding logs in Loki, and see the related metrics in Prometheus — all linked by trace_id. This is the workflow that turns hours of debugging into minutes.

Closing Note

Observability transforms how teams debug production issues. Instead of guessing which service is slow, you follow the trace. Instead of grepping logs across 20 services, you click a trace ID. In the next post, we'll explore Deployment Strategies — blue-green, canary, rolling updates, and feature flags — so your well-observed system also deploys safely.

Monitoring vs Observability​

The Three Pillars Deep Dive​

Distributed Tracing Concepts​

OpenTelemetry: Vendor-Neutral Instrumentation​

Manual Instrumentation for Custom Spans​

Jaeger for Trace Visualization​

OpenTelemetry Collector Architecture​

Trace Sampling Strategies​

Correlating Logs, Metrics, and Traces​

Closing Note​

Stay Updated