OpenTelemetryinfrastructure

Why Observability Stack

Tracing, metrics, and logs with OpenTelemetry

v1.1·11 min read·Kenneth Pernyér

observabilityopentelemetrytracingmetricslogging

The Problem

Distributed systems fail in distributed ways. A request touches multiple services, each with its own logs. Finding the root cause means correlating events across systems.

Logs are not enough.

You can grep logs. You cannot grep causality. "Service B returned 500" does not tell you why, or what upstream service caused it, or which user request triggered the chain.

We needed observability that:

Traces requests across service boundaries
Correlates logs, metrics, and traces
Uses vendor-neutral instrumentation
Works with our Rust/TypeScript stack

Current Options

Option	Pros	Cons
Vendor SDKs (Datadog, New Relic)Integrated platforms with proprietary instrumentation.	Easy setup Unified dashboards Managed infrastructure Good default visualizations	Vendor lock-in Expensive at scale Proprietary instrumentation
OpenTelemetryVendor-neutral standard for telemetry.	Vendor-neutral Wide ecosystem support Single instrumentation, any backend Active development (CNCF)	More setup than vendor SDKs Collector complexity Still maturing in some languages
DIY (Prometheus + Jaeger + ELK)Self-hosted stack from components.	Full control No licensing costs Proven components	Operational burden Integration work Scaling challenges

Future Outlook

OpenTelemetry is becoming the standard.

Vendor lock-in is over.

Instrument once with OpenTelemetry, export to any backend. Switch from Jaeger to Datadog to Honeycomb without changing application code. The collector is your adapter.

The three pillars are merging.

Traces, metrics, and logs are converging into a unified model. OpenTelemetry already handles all three. Backends are learning to correlate them automatically.

Auto-instrumentation is improving.

Libraries and frameworks are adding OpenTelemetry support natively. The amount of manual instrumentation needed is decreasing.

Our Decision

✓Why we chose this

Vendor neutralitySwitch backends without changing instrumentation. No lock-in.
Distributed tracingFollow requests across services. See the full picture.
CorrelationTrace IDs link logs, metrics, and spans. One ID, full context.
EcosystemRust (tracing), TypeScript, Python, Go—all support OpenTelemetry.

×Trade-offs we accept

Setup complexityMore initial configuration than vendor SDKs.
Collector operationsRunning the OTel collector adds infrastructure.
Language maturitySome languages have better support than others.

Motivation

We run multiple services in Rust, TypeScript, and Python. A single user action might touch five services. When something fails, we need to know which service, which request, which line of code.

Vendor-specific instrumentation would mean different SDKs in each service. OpenTelemetry gives us one approach everywhere. Rust services use tracing with opentelemetry-rust. TypeScript services use @opentelemetry/sdk-node. Same trace context, same correlation.

We export to Grafana Cloud (Tempo for traces, Loki for logs, Prometheus for metrics). But we can switch backends without touching application code. The instrumentation is ours; the visualization is replaceable.

Recommendation

Adopt OpenTelemetry for new services. The standard is mature enough for production.

Instrumentation strategy:

Auto-instrument frameworks: Axum, Express, FastAPI have OpenTelemetry middleware
Add manual spans for business-critical paths
Include trace context in all logs
Export metrics for SLIs (latency, error rate, throughput)

Start with traces. They provide the most insight for distributed debugging. Add metrics for SLOs. Use structured logging with trace correlation.

For Rust, use the tracing crate with opentelemetry-rust. For TypeScript, use @opentelemetry/sdk-node with auto-instrumentations.

Run the OTel collector as a sidecar or central service. It handles batching, retry, and routing to backends.

Examples

src/telemetry.rsrust

use opentelemetry::global;
use opentelemetry_otlp::WithExportConfig;
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt};

pub fn init_telemetry() -> Result<(), Box<dyn std::error::Error>> {
    // OTLP exporter to collector
    let tracer = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_exporter(
            opentelemetry_otlp::new_exporter()
                .tonic()
                .with_endpoint("http://otel-collector:4317"),
        )
        .install_batch(opentelemetry_sdk::runtime::Tokio)?;

    // Connect tracing to OpenTelemetry
    let telemetry = tracing_opentelemetry::layer().with_tracer(tracer);

    tracing_subscriber::registry()
        .with(telemetry)
        .with(tracing_subscriber::fmt::layer())
        .init();

    Ok(())
}

// Usage in handlers
#[tracing::instrument(skip(db))]
async fn get_user(db: &Database, user_id: UserId) -> Result<User, Error> {
    tracing::info!("Fetching user");
    let user = db.get(user_id).await?;
    tracing::info!(user_name = %user.name, "User fetched");
    Ok(user)
}

Rust tracing crate integrates with OpenTelemetry. Spans are created automatically from #[instrument]. Logs include trace context.

DockerWhy Docker Zero TrustWhy Zero Trust Architecture Event SourcingWhy Event Sourcing