OpenTelemetryinfrastructure

Why Observability Stack

Tracing, metrics, and logs with OpenTelemetry

v1.1·11 min read·Kenneth Pernyér
observabilityopentelemetrytracingmetricslogging

The Problem

Distributed systems fail in distributed ways. A request touches multiple services, each with its own logs. Finding the root cause means correlating events across systems.

Logs are not enough.

You can grep logs. You cannot grep causality. "Service B returned 500" does not tell you why, or what upstream service caused it, or which user request triggered the chain.

We needed observability that:

  • Traces requests across service boundaries
  • Correlates logs, metrics, and traces
  • Uses vendor-neutral instrumentation
  • Works with our Rust/TypeScript stack

Current Options

OptionProsCons
Vendor SDKs (Datadog, New Relic)Integrated platforms with proprietary instrumentation.
  • Easy setup
  • Unified dashboards
  • Managed infrastructure
  • Good default visualizations
  • Vendor lock-in
  • Expensive at scale
  • Proprietary instrumentation
OpenTelemetryVendor-neutral standard for telemetry.
  • Vendor-neutral
  • Wide ecosystem support
  • Single instrumentation, any backend
  • Active development (CNCF)
  • More setup than vendor SDKs
  • Collector complexity
  • Still maturing in some languages
DIY (Prometheus + Jaeger + ELK)Self-hosted stack from components.
  • Full control
  • No licensing costs
  • Proven components
  • Operational burden
  • Integration work
  • Scaling challenges

Future Outlook

OpenTelemetry is becoming the standard.

Vendor lock-in is over.

Instrument once with OpenTelemetry, export to any backend. Switch from Jaeger to Datadog to Honeycomb without changing application code. The collector is your adapter.

The three pillars are merging.

Traces, metrics, and logs are converging into a unified model. OpenTelemetry already handles all three. Backends are learning to correlate them automatically.

Auto-instrumentation is improving.

Libraries and frameworks are adding OpenTelemetry support natively. The amount of manual instrumentation needed is decreasing.

Our Decision

Why we chose this

  • Vendor neutralitySwitch backends without changing instrumentation. No lock-in.
  • Distributed tracingFollow requests across services. See the full picture.
  • CorrelationTrace IDs link logs, metrics, and spans. One ID, full context.
  • EcosystemRust (tracing), TypeScript, Python, Go—all support OpenTelemetry.

×Trade-offs we accept

  • Setup complexityMore initial configuration than vendor SDKs.
  • Collector operationsRunning the OTel collector adds infrastructure.
  • Language maturitySome languages have better support than others.

Motivation

We run multiple services in Rust, TypeScript, and Python. A single user action might touch five services. When something fails, we need to know which service, which request, which line of code.

Vendor-specific instrumentation would mean different SDKs in each service. OpenTelemetry gives us one approach everywhere. Rust services use tracing with opentelemetry-rust. TypeScript services use @opentelemetry/sdk-node. Same trace context, same correlation.

We export to Grafana Cloud (Tempo for traces, Loki for logs, Prometheus for metrics). But we can switch backends without touching application code. The instrumentation is ours; the visualization is replaceable.

Recommendation

Adopt OpenTelemetry for new services. The standard is mature enough for production.

Instrumentation strategy:

  1. Auto-instrument frameworks: Axum, Express, FastAPI have OpenTelemetry middleware
  2. Add manual spans for business-critical paths
  3. Include trace context in all logs
  4. Export metrics for SLIs (latency, error rate, throughput)

Start with traces. They provide the most insight for distributed debugging. Add metrics for SLOs. Use structured logging with trace correlation.

For Rust, use the tracing crate with opentelemetry-rust. For TypeScript, use @opentelemetry/sdk-node with auto-instrumentations.

Run the OTel collector as a sidecar or central service. It handles batching, retry, and routing to backends.

Examples

src/telemetry.rsrust
use opentelemetry::global;
use opentelemetry_otlp::WithExportConfig;
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt};

pub fn init_telemetry() -> Result<(), Box<dyn std::error::Error>> {
    // OTLP exporter to collector
    let tracer = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_exporter(
            opentelemetry_otlp::new_exporter()
                .tonic()
                .with_endpoint("http://otel-collector:4317"),
        )
        .install_batch(opentelemetry_sdk::runtime::Tokio)?;

    // Connect tracing to OpenTelemetry
    let telemetry = tracing_opentelemetry::layer().with_tracer(tracer);

    tracing_subscriber::registry()
        .with(telemetry)
        .with(tracing_subscriber::fmt::layer())
        .init();

    Ok(())
}

// Usage in handlers
#[tracing::instrument(skip(db))]
async fn get_user(db: &Database, user_id: UserId) -> Result<User, Error> {
    tracing::info!("Fetching user");
    let user = db.get(user_id).await?;
    tracing::info!(user_name = %user.name, "User fetched");
    Ok(user)
}

Rust tracing crate integrates with OpenTelemetry. Spans are created automatically from #[instrument]. Logs include trace context.

Related Articles

Stockholm, Sweden

Version 1.1

Kenneth Pernyér signature