Skip to main content

Observability vs Monitoring — Distributed Tracing with Jaeger and OpenTelemetry

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

When a user reports that checkout is slow, monitoring tells you that latency spiked. Observability tells you why — the payment service waited 3 seconds for a database query that normally takes 20ms because a missing index caused a full table scan on a table that grew past 10 million rows last Tuesday. That's the difference.

Monitoring vs Observability

These terms are often used interchangeably, but they represent fundamentally different approaches:

AspectMonitoringObservability
Question answered"Is something broken?""Why is it broken?"
ApproachPre-defined dashboards and alertsExplore data to find unknown unknowns
Data modelMetrics (time series)Metrics + Logs + Traces (correlated)
InvestigationCheck known failure modesAsk arbitrary questions of your data
SetupDefine what to watchInstrument everything, query later
AnalogyCar dashboard warning lightsFull diagnostic computer readout

Monitoring is a subset of observability. You need monitoring, but in a microservices world, you also need the ability to trace a request across 15 services to find where those 3 seconds went.

The Three Pillars Deep Dive

Observability rests on three complementary signal types:

┌─────────────────────────────────────────────────────┐
│ Observability │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌───────────────┐ │
│ │ Metrics │ │ Logs │ │ Traces │ │
│ │ │ │ │ │ │ │
│ │ "What is │ │ "What │ │ "What path │ │
│ │ happening?"│ │ happened?" │ │ did it take?"│ │
│ │ │ │ │ │ │ │
│ │ Prometheus │ │ Loki / ELK │ │ Jaeger / Tempo│ │
│ │ counters, │ │ structured │ │ spans, traces │ │
│ │ gauges, │ │ events, │ │ context │ │
│ │ histograms │ │ stack traces│ │ propagation │ │
│ └─────────────┘ └─────────────┘ └───────────────┘ │
│ │
│ Correlation: trace_id links all three │
└─────────────────────────────────────────────────────┘

Metrics tell you the system's vital signs (request rate, error rate, latency). Logs give you detailed event records. Traces show you the journey of a single request through your distributed system. The magic happens when you correlate all three using a shared trace_id.

Distributed Tracing Concepts

In a microservices architecture, a single user request can touch dozens of services. Distributed tracing tracks that journey:

User Request: GET /checkout

├── [Span 1] API Gateway (12ms)
│ │
│ ├── [Span 2] Auth Service - Validate Token (8ms)
│ │
│ ├── [Span 3] Cart Service - Get Items (45ms)
│ │ │
│ │ └── [Span 4] Redis Cache Lookup (2ms)
│ │
│ ├── [Span 5] Payment Service (3,200ms) ← BOTTLENECK
│ │ │
│ │ ├── [Span 6] Fraud Check API (150ms)
│ │ │
│ │ └── [Span 7] Database Query (3,020ms) ← ROOT CAUSE
│ │ └── SELECT * FROM transactions WHERE user_id = ?
│ │ (full table scan — missing index!)
│ │
│ └── [Span 8] Notification Service (25ms)

Total: 3,342ms

Key terminology:

  • Trace: The entire journey of a request (all spans combined)
  • Span: A single unit of work (one service call, one DB query)
  • Context Propagation: Passing the trace ID between services via HTTP headers
  • Parent-Child Relationship: Spans form a tree showing causality

OpenTelemetry: Vendor-Neutral Instrumentation

OpenTelemetry (OTel) is the CNCF standard for generating telemetry data. It's vendor-neutral — instrument once, send data to any backend (Jaeger, Tempo, Datadog, New Relic):

// Node.js — OpenTelemetry auto-instrumentation setup
// tracing.js — Load this BEFORE your application code
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { Resource } = require('@opentelemetry/resources');
const { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } = require('@opentelemetry/semantic-conventions');

const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: 'payment-service',
[ATTR_SERVICE_VERSION]: '1.4.2',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
// Auto-instruments: HTTP, Express, pg, mysql, redis, grpc, etc.
'@opentelemetry/instrumentation-fs': { enabled: false },
}),
],
});

sdk.start();
console.log('OpenTelemetry tracing initialized');

Auto-instrumentation captures HTTP requests, database queries, cache calls, and gRPC calls without modifying your application code.

Manual Instrumentation for Custom Spans

Sometimes you need to trace business logic that auto-instrumentation can't capture:

const { trace, SpanStatusCode } = require('@opentelemetry/api');

const tracer = trace.getTracer('payment-service');

async function processPayment(order) {
// Create a custom span for the payment workflow
return tracer.startActiveSpan('process-payment', async (span) => {
try {
span.setAttribute('order.id', order.id);
span.setAttribute('order.amount', order.totalAmount);
span.setAttribute('payment.method', order.paymentMethod);

// Nested span for fraud check
const fraudResult = await tracer.startActiveSpan('fraud-check', async (fraudSpan) => {
const result = await fraudService.checkTransaction(order);
fraudSpan.setAttribute('fraud.score', result.score);
fraudSpan.setAttribute('fraud.decision', result.decision);
fraudSpan.end();
return result;
});

if (fraudResult.decision === 'REJECT') {
span.setStatus({ code: SpanStatusCode.ERROR, message: 'Payment rejected by fraud check' });
span.end();
throw new Error('Payment rejected');
}

// Nested span for payment gateway
const chargeResult = await tracer.startActiveSpan('charge-gateway', async (chargeSpan) => {
chargeSpan.setAttribute('gateway', 'stripe');
const result = await stripe.charges.create({ amount: order.totalAmount });
chargeSpan.setAttribute('charge.id', result.id);
chargeSpan.end();
return result;
});

span.setStatus({ code: SpanStatusCode.OK });
span.end();
return chargeResult;

} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
span.end();
throw error;
}
});
}

Jaeger for Trace Visualization

Jaeger (created by Uber) is the most widely used open-source tracing backend. Deploy it on Kubernetes:

# jaeger-deployment.yaml — All-in-one for development
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: observability
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.54
ports:
- containerPort: 16686 # UI
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
env:
- name: COLLECTOR_OTLP_ENABLED
value: "true"
- name: SPAN_STORAGE_TYPE
value: "elasticsearch"
- name: ES_SERVER_URLS
value: "http://elasticsearch:9200"
---
apiVersion: v1
kind: Service
metadata:
name: jaeger
namespace: observability
spec:
ports:
- name: ui
port: 16686
- name: otlp-grpc
port: 4317
- name: otlp-http
port: 4318
selector:
app: jaeger

Open http://jaeger:16686 to search traces by service, operation, duration, or tags.

OpenTelemetry Collector Architecture

The OTel Collector is a proxy that receives, processes, and exports telemetry data. It decouples your application from your backend:

# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318

processors:
batch:
timeout: 5s
send_batch_size: 1024

# Filter out health check spans to reduce noise
filter:
spans:
exclude:
match_type: strict
attributes:
- key: http.route
value: /health

# Add environment metadata to all spans
resource:
attributes:
- key: environment
value: production
action: upsert

exporters:
otlp/jaeger:
endpoint: jaeger-collector:4317
tls:
insecure: true

otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true

prometheus:
endpoint: 0.0.0.0:8889
namespace: otel

service:
pipelines:
traces:
receivers: [otlp]
processors: [filter, batch, resource]
exporters: [otlp/jaeger, otlp/tempo]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
Architecture:

App → OTel SDK → OTel Collector → Jaeger (traces)
│ → Prometheus (metrics)
│ → Loki (logs)

Processing:
- Batching
- Filtering
- Sampling
- Enrichment

The Collector gives you a single place to manage sampling, filtering, and routing without touching application code.

Trace Sampling Strategies

In high-traffic systems, tracing every request is expensive. Sampling strategies help:

StrategyDescriptionUse Case
Head-basedDecide at request start (e.g., 10% of requests)Simple, predictable cost
Tail-basedDecide after request completes (keep errors + slow)Better signal, higher resource use
Rate limitingMax N traces per second per serviceCost control
PriorityAlways trace flagged requests (debug header)On-demand debugging
# Tail-based sampling in OTel Collector
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-requests
type: latency
latency: { threshold_ms: 2000 }
- name: percentage
type: probabilistic
probabilistic: { sampling_percentage: 5 }

This configuration keeps all errors, all slow requests (>2s), and a 5% random sample of everything else — giving you great signal while controlling costs.

Correlating Logs, Metrics, and Traces

The real power comes from linking all three signals:

// Inject trace context into structured logs
const { trace } = require('@opentelemetry/api');
const pino = require('pino');

const logger = pino({
mixin() {
const span = trace.getActiveSpan();
if (span) {
const context = span.spanContext();
return {
trace_id: context.traceId,
span_id: context.spanId,
};
}
return {};
},
});

// Now every log line includes trace_id and span_id:
// {"level":30,"trace_id":"abc123","span_id":"def456","msg":"Payment processed","order_id":"ORD-789"}

In Grafana, you can click a trace in Jaeger, jump to the corresponding logs in Loki, and see the related metrics in Prometheus — all linked by trace_id. This is the workflow that turns hours of debugging into minutes.

Closing Note

Observability transforms how teams debug production issues. Instead of guessing which service is slow, you follow the trace. Instead of grepping logs across 20 services, you click a trace ID. In the next post, we'll explore Deployment Strategies — blue-green, canary, rolling updates, and feature flags — so your well-observed system also deploys safely.