Why Observability Stack
Tracing, metrics, and logs with OpenTelemetry
The Problem
Distributed systems fail in distributed ways. A request touches multiple services, each with its own logs. Finding the root cause means correlating events across systems.
Logs are not enough.
You can grep logs. You cannot grep causality. "Service B returned 500" does not tell you why, or what upstream service caused it, or which user request triggered the chain.
We needed observability that:
- Traces requests across service boundaries
- Correlates logs, metrics, and traces
- Uses vendor-neutral instrumentation
- Works with our Rust/TypeScript stack
Current Options
| Option | Pros | Cons |
|---|---|---|
| Vendor SDKs (Datadog, New Relic)Integrated platforms with proprietary instrumentation. |
|
|
| OpenTelemetryVendor-neutral standard for telemetry. |
|
|
| DIY (Prometheus + Jaeger + ELK)Self-hosted stack from components. |
|
|
Future Outlook
OpenTelemetry is becoming the standard.
Vendor lock-in is over.
Instrument once with OpenTelemetry, export to any backend. Switch from Jaeger to Datadog to Honeycomb without changing application code. The collector is your adapter.
The three pillars are merging.
Traces, metrics, and logs are converging into a unified model. OpenTelemetry already handles all three. Backends are learning to correlate them automatically.
Auto-instrumentation is improving.
Libraries and frameworks are adding OpenTelemetry support natively. The amount of manual instrumentation needed is decreasing.
Our Decision
✓Why we chose this
- Vendor neutralitySwitch backends without changing instrumentation. No lock-in.
- Distributed tracingFollow requests across services. See the full picture.
- CorrelationTrace IDs link logs, metrics, and spans. One ID, full context.
- EcosystemRust (tracing), TypeScript, Python, Go—all support OpenTelemetry.
×Trade-offs we accept
- Setup complexityMore initial configuration than vendor SDKs.
- Collector operationsRunning the OTel collector adds infrastructure.
- Language maturitySome languages have better support than others.
Motivation
We run multiple services in Rust, TypeScript, and Python. A single user action might touch five services. When something fails, we need to know which service, which request, which line of code.
Vendor-specific instrumentation would mean different SDKs in each service. OpenTelemetry gives us one approach everywhere. Rust services use tracing with opentelemetry-rust. TypeScript services use @opentelemetry/sdk-node. Same trace context, same correlation.
We export to Grafana Cloud (Tempo for traces, Loki for logs, Prometheus for metrics). But we can switch backends without touching application code. The instrumentation is ours; the visualization is replaceable.
Recommendation
Adopt OpenTelemetry for new services. The standard is mature enough for production.
Instrumentation strategy:
- Auto-instrument frameworks: Axum, Express, FastAPI have OpenTelemetry middleware
- Add manual spans for business-critical paths
- Include trace context in all logs
- Export metrics for SLIs (latency, error rate, throughput)
Start with traces. They provide the most insight for distributed debugging. Add metrics for SLOs. Use structured logging with trace correlation.
For Rust, use the tracing crate with opentelemetry-rust. For TypeScript, use @opentelemetry/sdk-node with auto-instrumentations.
Run the OTel collector as a sidecar or central service. It handles batching, retry, and routing to backends.
Examples
use opentelemetry::global;
use opentelemetry_otlp::WithExportConfig;
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt};
pub fn init_telemetry() -> Result<(), Box<dyn std::error::Error>> {
// OTLP exporter to collector
let tracer = opentelemetry_otlp::new_pipeline()
.tracing()
.with_exporter(
opentelemetry_otlp::new_exporter()
.tonic()
.with_endpoint("http://otel-collector:4317"),
)
.install_batch(opentelemetry_sdk::runtime::Tokio)?;
// Connect tracing to OpenTelemetry
let telemetry = tracing_opentelemetry::layer().with_tracer(tracer);
tracing_subscriber::registry()
.with(telemetry)
.with(tracing_subscriber::fmt::layer())
.init();
Ok(())
}
// Usage in handlers
#[tracing::instrument(skip(db))]
async fn get_user(db: &Database, user_id: UserId) -> Result<User, Error> {
tracing::info!("Fetching user");
let user = db.get(user_id).await?;
tracing::info!(user_name = %user.name, "User fetched");
Ok(user)
}Rust tracing crate integrates with OpenTelemetry. Spans are created automatically from #[instrument]. Logs include trace context.