Distributed Tracing in Go: Finding the Needle in the Microservice Haystack

24 minutes ago 1

Go & Backend October 12, 2025 22 min read

Running a web of Go services taught us that logs alone do not tell the full story. Distributed tracing stitched requests together, let us spot bottlenecks faster, and gave on-call engineers a calmer playbook.

The Debug Session That Broke Our Team

Picture a familiar incident: production alarms fire, dashboards show errors, and the failing request passes through a dozen services with incompatible log formats.

Every team owns its own logging style, timestamps are out of sync, and correlating events means scraping Kibana, Splunk, and raw stdout. Hours disappear while everyone tries to rebuild a single timeline.

That was the moment we decided to stop doing log archaeology and ship distributed tracing.

Once traces were available, incident responders could jump straight to a shared view of the request path. Multi-hour hunts turned into targeted investigations measured in minutes.

What Distributed Tracing Actually Solves

Before tracing, our debugging workflow looked like this:

  1. Find the failing request ID in the API gateway logs
  2. Search for that ID in the auth service logs
  3. Search for that ID in the user service logs
  4. Search for that ID in the payment service logs
  5. Repeat across every downstream service
  6. Try to reconstruct the timeline manually
  7. Give up and restart everything

After tracing:

  1. Search for trace ID in Grafana Tempo
  2. See the entire request flow in one view
  3. Identify the bottleneck immediately
  4. Fix the actual problem

OpenTelemetry Setup: The Foundation

We use OpenTelemetry (OTEL) as our tracing standard. Here's the basic setup:

package tracing import ( "context" "log" "os" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/exporters/otlp/otlptrace" "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp" "go.opentelemetry.io/otel/propagation" "go.opentelemetry.io/otel/resource" "go.opentelemetry.io/otel/sdk/trace" semconv "go.opentelemetry.io/otel/semconv/v1.17.0" ) func InitTracer(serviceName string) func() { // Create OTLP exporter that pushes spans to Grafana Tempo client := otlptracehttp.NewClient( otlptracehttp.WithEndpoint(os.Getenv("TEMPO_ENDPOINT")), otlptracehttp.WithInsecure(), ) exporter, err := otlptrace.New(context.Background(), client) if err != nil { log.Fatalf("Failed to create OTLP exporter: %v", err) } // Create resource with service information res := resource.NewWithAttributes( semconv.SchemaURL, semconv.ServiceName(serviceName), semconv.ServiceVersion(os.Getenv("SERVICE_VERSION")), semconv.DeploymentEnvironment(os.Getenv("ENVIRONMENT")), ) // Create trace provider tp := trace.NewTracerProvider( trace.WithBatcher(exporter), trace.WithResource(res), trace.WithSampler(trace.TraceIDRatioBased(getSamplingRatio())), ) // Set global trace provider otel.SetTracerProvider(tp) // Set global propagator (for cross-service tracing) otel.SetTextMapPropagator( propagation.NewCompositeTextMapPropagator( propagation.TraceContext{}, propagation.Baggage{}, ), ) // Return cleanup function return func() { if err := tp.Shutdown(context.Background()); err != nil { log.Printf("Error shutting down tracer provider: %v", err) } } } func getSamplingRatio() float64 { switch os.Getenv("ENVIRONMENT") { case "production": return 0.1 // Sample 10% in production case "staging": return 0.5 // Sample 50% in staging default: return 1.0 // Sample 100% in development } }

HTTP Instrumentation Playbook

Instead of pasting hundreds of lines, here's the checklist we actually follow:

  • Server: wrap handlers with `otelhttp.NewHandler` or a small middleware that extracts headers, starts a span, and records status/errors.
  • Client: use `otelhttp.NewTransport` so every `Do` call propagates trace context and produces child spans automatically.
  • Metadata: add user agent, endpoint, tenant IDs sparingly to avoid high-cardinality attribute explosions.
  • Result: cross-service requests show up in a single trace without custom plumbing in every handler.

Database Tracing: Finding the Slow Queries

We keep database tracing light:

  • Lean on existing instrumentation (`otelsql`, pgx, GORM, etc.) instead of reinventing a wrapped driver.
  • Record just enough context—operation type, shard/tenant, business identifier—and skip full SQL bodies in production.
  • Mark spans as errors when queries fail so Tempo highlights hot spots.

That combination makes it obvious which query or shard slowed down a request.

gRPC Tracing: Service-to-Service Communication

gRPC needs almost no custom code when you rely on `otelgrpc`:

  • Add `otelgrpc.UnaryServerInterceptor`/`UnaryClientInterceptor` (and streaming variants) to servers and clients.
  • Tempo captures service and method names automatically, so bottlenecks between gateway → auth → billing stand out.
  • gRPC status codes map to span status, turning retry storms into bright red spans.

With that in place every hop shows up without bespoke interceptors.

Custom Spans: Business Logic Tracing

Technical spans are useful, but business spans are what unlock fast incident response:

  • Name spans after domain actions (`payment.process`, `fraud.check`, `invoice.render`) so on-call engineers know what they are looking at.
  • Attach compact business attributes—amount, plan, region—so Tempo filters traces by customer context.
  • Tag errors with the stage (`error.stage=fraud_check`) to jump straight to the failing branch.
  • Break long operations into child spans to highlight where time disappears.

A trace becomes an interactive runbook instead of a stack trace.

Grafana Tempo Explorer: Making Sense of Traces

The real magic happens once traces land in Grafana Tempo (via Explore). Here's what we look for:

Trace Analysis Patterns:

  1. Long traces - Total duration > 5 seconds usually indicates problems
  2. Wide traces - Too many parallel calls can indicate N+1 problems
  3. Deep traces - Too many service hops suggest architecture issues
  4. Error patterns - Services that frequently fail downstream calls
  5. Bottleneck identification - Single spans taking 80%+ of total time

Production Debugging Workflow:

  • Before: grep through logs and dashboards to reconstruct timelines by hand.
  • Now: open Tempo, search by `trace_id`, click the red span, and review the full context in one place.

Production Deployment and Configuration

The production playbook stays intentionally boring:

  • Services bootstrap the tracer on startup and reuse the same HTTP/gRPC/DB wrappers.
  • Graceful shutdown waits a few seconds so the exporter flushes batches to Tempo.
  • Configuration lives in environment variables: 100% sampling locally, single-digit percentages in production.

Automation keeps everyone honest—code reviews fail if a service ships without tracing middleware.

Security Considerations

Security Warning

Distributed tracing can expose sensitive data. Always follow these security practices:

1. Sanitize Sensitive Data

  • Mask anything that looks like a secret: tokens, card numbers, personal data.
  • Use an allowlist so only known-safe attributes get through; everything else becomes `[REDACTED]`.

2. Secure Trace Storage

  • Tempo sits behind Grafana with TLS and authenticated dashboards.
  • Keep traces for days, not months, unless compliance says otherwise.

3. Access Control

Teams see traces for the services they own; shared dashboards expose technical metadata, not raw payloads.

4. Compliance Considerations

  • GDPR: be able to delete traces by user identifier.
  • PCI DSS: never ship full PAN/CVV to telemetry.
  • HIPAA: encrypt storage and audit access.

Testing Strategy

Tracing is only useful if tests keep it honest. Here's what we validate:

  • Unit: middleware propagates context and child spans point at their parent.
  • Integration: a multi-service test spins up Tempo, sends a request, and asserts the expected spans exist.
  • Performance: benchmarks compare with/without tracing to track overhead.
  • Chaos: simulate Tempo being down; services should degrade gracefully.
  • Lint: a custom check ensures we don't add secret or high-cardinality attributes.

Performance Impact and Optimization

Tracing has overhead. Here's how we minimize it:

What to watch:

  • Latency: per-span cost is usually sub-millisecond, but measure in your environment and adjust sampling if it spikes.
  • Memory: traces are small, yet long-lived spans or baggage can accumulate—keep an eye on allocations.
  • Network: exporters batch spans, though the collector still receives a steady stream; budget for it.
  • CPU: serialization costs are modest at low sampling rates, but heavy instrumentation can become noticeable.

Optimization Strategies:

  • Adjust sampling dynamically: single-digit percentages in production, 100% in dev, with the ability to spike it temporarily during incidents.
  • Skip noisy operations (health checks, trivial queries) so quota goes to useful spans.
  • Export in batches and fail fast when Tempo is unavailable to keep the main request path responsive.

Where Tracing Paid Off

With traces available, the same few incident patterns kept showing up—and became much easier to resolve:

Case 1: The Mysterious Timeout

  • Symptom: random spikes to 30 seconds on the checkout endpoint.
  • Trace view: one branch of the request sat idle while the dependency pool filled up.
  • Root cause: a rarely hit code path forgot to release database connections.
  • Resolution time: about 15 minutes once we pulled the trace, versus multi‑hour log spelunking before.

Case 2: The Performance Regression

  • Symptom: p95 latency drifting from ~200 ms to ~2 s after a release.
  • Trace view: a group of spans ballooned in the profile service.
  • Root cause: an N+1 query introduced in the latest merge.
  • Resolution time: under 30 minutes; previously we would bisect builds for half a day.

Case 3: The Ghost Error

  • Symptom: a few percent of requests returned 500 with no obvious log trail.
  • Trace view: retries cascading between services until a fragile fraud check finally threw.
  • Root cause: swallowed exceptions that left spans marked OK while the caller still failed.
  • Resolution time: roughly 10 minutes after spotting the trace.

The Hard-Won Lessons

Keeping tracing useful is mostly about people and process:

  • Instrumentation must live in the definition of done; otherwise spans drift away from reality.
  • Moderate sampling (single-digit percentages for production) balances visibility with cost.
  • Traces, metrics, and logs belong in one shared dashboard so incident response starts from the same facts.
  • Domain-specific spans—"prepare invoice", "charge card"—make runbooks actionable. Purely technical spans rarely help product teams.

The Bottom Line

Before tracing, root-cause hunts routinely took around four hours and optimism was low. After rolling it out across services, the same class of incidents typically closes in about 15 minutes because the team can jump straight to a shared timeline.

P.S.

Distributed tracing did not replace thoughtful logging, but it removed the guesswork. These days we open Grafana Tempo first and only dive into raw logs for supporting detail.

Read Entire Article