How a regex typo killed $200K/day in payment traces for three days

A payments team spent three days debugging missing DataDog traces worth $200K daily. The culprit: a one-character regex bug that filtered every span. The fix required building a local observability stack because DataDog's cloud-only architecture made isolated testing impossible.

The Biggish Editorial · Tuesday, February 3, 2026

A critical payment service processing 15,000 daily orders went dark in DataDog last year. The service ran. Logs showed successful transactions. But the APM dashboard was empty.

For three days, the team debugged blind. Was the payment gateway slow? Were retries happening? Where was latency hiding? Without traces, they resorted to print statements in production and log tailing.

The breakthrough came when someone asked: "Can we run this locally and watch traces leave the application?"

They couldn't. DataDog requires cloud connectivity. The local agent needs an API key and phones home. No way to intercept telemetry without a DataDog account, and staging keys hit rate limits that made local testing impractical.

The team built a local stack that accepts ddtrace telemetry and routes it to open-source backends. Within an hour, they found the bug:

filter/health:
  traces:
    span:
      - 'attributes["http.target"] =~ "/.*"'  # Matched EVERYTHING

Intended to filter /health endpoints, the regex =~ matched every single span. A one-character fix (changing =~ to ==) and traces reappeared in production.

Why three days for a one-character bug? No visibility into what the collector actually did. The config looked fine. The collector reported healthy. Logs showed "traces exported successfully" from other services. Without isolating their service's telemetry, they were guessing.

The Local Stack

Point your ddtrace-instrumented app at localhost:8126. The OpenTelemetry Collector receives DataDog-format traces, converts to OTLP, and exports to Grafana Tempo. Your application thinks it's talking to a DataDog agent. No code changes required.

The stack is now used across three teams: the original payments team, logistics (who had similar "missing traces" issues), and platform (for testing collector configs before production).

When This Matters

Use it to verify ddtrace instrumentation works before deploying, debug why traces aren't appearing in production, or test collector configurations locally.

Don't use it for new projects (use OpenTelemetry native instrumentation instead) or if you need DataDog-specific features like service maps or RUM.

Why Tempo over Jaeger? Tempo integrates with Grafana's Explore view, enabling bidirectional log-trace correlation. Click from log to trace, or trace to log. Essential for debugging.

The Broader Pattern

This highlights architectural tension in modern observability: cloud-scale insights versus local development practicality. When your debugging tool requires cloud connectivity, you can't test in isolation. That's fine until it isn't.

The repository (github.com/LukaszGrochal/demo-repo-otel-stack) includes working examples and the collector config that caused the original incident. Worth studying if you run DataDog at scale.

Three teams avoided similar multi-day debugging sessions because they could now test locally. That's the real value: finding config bugs in minutes, not days.

The Local Stack

When This Matters

The Broader Pattern

Related Articles

Developer pushes Pixi.js infinite canvas to 60,000 cards before performance breaks

Web scraping IP bans: rotation tactics for legacy TypeScript codebases

Developer builds notification hub to escape vendor lock-in after $250k migration quote