Implementing Full-Stack Trace Correlation from a Phoenix Frontend to PostgreSQL with OpenTelemetry and Zipkin


The performance degradation was subtle but undeniable. Our real-time analytics dashboard, featuring complex data visualizations driven by a heavily-styled SCSS component system, exhibited intermittent sluggishness. User reports were vague: “it just feels slow sometimes.” The frontend team suspected API latency. The backend team pointed to potential database bottlenecks in PostgreSQL. We were chasing ghosts with isolated logging, leading to unproductive cycles of blame and speculation. The core problem was a lack of unified context. An individual log line from Phoenix or a slow query log from PostgreSQL was meaningless without knowing the originating user action that triggered it.

Our initial attempt involved manually passing a request ID through every layer. This proved to be a maintenance nightmare, polluting business logic with observability concerns. It was clear we needed a proper distributed tracing solution. The decision was to adopt OpenTelemetry for instrumentation due to its vendor-neutral approach, allowing us to connect the JavaScript frontend, the Elixir/Phoenix backend, and the Ecto database layer into a single, cohesive trace. We chose Zipkin as the backend for its operational simplicity and proven track record. The goal was not just to see traces, but to achieve deep correlation: linking a specific trace ID to every structured log message generated during a request’s lifecycle.

Phase 1: Backend Instrumentation with OpenTelemetry

The foundation of this architecture is laid in the Phoenix application. In a real-world project, you cannot afford to have instrumentation code scattered loosely. It needs to be configured centrally and managed through the application’s lifecycle.

First, we add the necessary OpenTelemetry and Zipkin dependencies to mix.exs.

# mix.exs
def deps do
  [
    # ... other deps
    {:opentelemetry_api, "~> 1.2"},
    {:opentelemetry, "~> 1.2"},
    {:opentelemetry_exporter, "~> 1.2"},
    {:opentelemetry_zipkin, "~> 1.1"},
    # Instrumentation for Phoenix and Ecto
    {:opentelemetry_phoenix, "~> 1.2"},
    {:opentelemetry_ecto, "~> 1.1"},
    # HTTP client for the exporter
    {:finch, "~> 0.16"}
  ]
end

The next critical step is to configure the OpenTelemetry SDK and the Zipkin exporter when the application starts. A common mistake is to place this logic directly in application.ex, which can become cluttered. A better approach is to create a dedicated module, MyApp.Telemetry, and invoke it from the application’s start/2 callback.

# lib/my_app/telemetry.ex
defmodule MyApp.Telemetry do
  @moduledoc """
  Handles the setup and configuration of OpenTelemetry.
  """
  require Logger
  alias OpenTelemetry.Tracer

  def setup do
    # This is crucial for production-grade code.
    # It ensures that the instrumentation libraries can find the API.
    :ok = :opentelemetry_api.set_default_tracer_provider()

    # Configure the batch processor for spans.
    # In production, you don't want to send every span immediately.
    # Batching reduces network overhead.
    :ok =
      :opentelemetry.setup_processor(
        :otel_batch_processor,
        %{
          # The exporter to use.
          exporter: {:otel_exporter_zipkin, []},
          # Fine-tune these settings based on your load.
          max_queue_size: 2048,
          scheduled_delay_ms: 5000,
          max_export_batch_size: 512
        }
      )

    Logger.info("OpenTelemetry configured with Zipkin exporter.")
  end
end

# lib/my_app/application.ex
defmodule MyApp.Application do
  use Application

  @impl true
  def start(_type, _args) do
    # Set up telemetry first, so it's available for everything else.
    MyApp.Telemetry.setup()

    children = [
      # ... other children like Repo and Endpoint
      MyApp.Repo,
      MyApp.Endpoint
    ]

    opts = [strategy: :one_for_one, name: MyApp.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

Now, we need to configure the Zipkin exporter itself and attach the necessary instrumenters in config/config.exs.

# config/config.exs
import Config

config :opentelemetry, :processors,
  otel_batch_processor: %{
    exporter:
      {:otel_exporter_zipkin,
       [
         http_client: :finch,
         # In a real setup, this comes from environment variables.
         # Do not hardcode URLs.
         host: System.get_env("ZIPKIN_HOST", "localhost"),
         port: String.to_integer(System.get_env("ZIPKIN_PORT", "9411"))
       ]}
  }

# Instrument Phoenix to automatically create spans for each request.
config :opentelemetry_phoenix,
  tracer_id: :my_app_tracer,
  # This ensures spans are named after the controller/action, not just the route path.
  span_prefix: "Phoenix"

# Instrument Ecto to trace database queries.
config :opentelemetry_ecto,
  tracer_id: :my_app_tracer,
  span_prefix: "Ecto"

# Finally, attach the instrumenters. This is the magic that connects
# Phoenix and Ecto events to OpenTelemetry.
config :my_app, MyApp.Endpoint,
  instrumenters: [OpentelemetryPhoenix.Instrumenter]

config :my_app, MyApp.Repo,
  telemetry_prefix: [:my_app, :repo],
  adapter: Ecto.Adapters.Postgres,
  # ... other repo config
  after_connect: {OpentelemetryEcto, :setup, [[:my_app, :repo]]}

With this configuration, any incoming HTTP request to our Phoenix application will automatically generate a parent span, and any Ecto query made within that request’s context will generate a child span. This is the baseline, but the real value comes from correlating this trace data with our logs.

Phase 2: Injecting Trace Context into Structured Logs

A trace is useful, but sifting through thousands of them to find a problem is inefficient. The goal is to be able to look at a single error log message and immediately have the trace_id to pull up the entire execution context in Zipkin. This requires modifying Elixir’s default logger.

We’ll define a custom logger format that inspects the OpenTelemetry context and injects the trace and span IDs.

# lib/my_app/logger/trace_formatter.ex
defmodule MyApp.Logger.TraceFormatter do
  @moduledoc """
  A custom Logger formatter that injects OpenTelemetry trace context.
  """
  require OpenTelemetry.Tracer

  # This pattern ensures we don't pollute logs with empty values when no trace is active.
  def format(level, message, timestamp, metadata) do
    trace_context =
      case OpenTelemetry.Tracer.current_span_ctx() do
        :undefined ->
          ""
        ctx ->
          # Convert the binary trace and span IDs to a readable hex format.
          trace_id_hex = :binary.encode_hex(ctx.trace_id)
          span_id_hex = :binary.encode_hex(ctx.span_id)
          " [trace_id=#{trace_id_hex} span_id=#{span_id_hex}]"
      end

    # Build the final log string.
    # $levelpad ensures aligned log levels for readability.
    "#{format_timestamp(timestamp)} #{level}#{trace_context} #{format_message(message)} #{format_metadata(metadata)}\n"
  end

  defp format_timestamp({date, {h, m, s, _}}),
    do: "#{:io_lib.format("~2.10.0B-~2.10.0B-~4.10.0B", [:calendar.gregorian_days_to_date(date) |> Tuple.to_list()])} #{h}:#{m}:#{s}"

  defp format_message(message) do
    IO.iodata_to_binary(message)
  end

  defp format_metadata(metadata) do
    if Enum.empty?(metadata) do
      ""
    else
      # A simple metadata formatter. In a production system, you might want JSON.
      " " <> Enum.map_join(metadata, " ", fn {k, v} -> "#{k}=#{inspect(v)}" end)
    end
  end
end

Now, we replace the default console logger format in config/config.exs.

# config/config.exs

# ... existing configs ...

config :logger, :console,
  format: {MyApp.Logger.TraceFormatter, :format},
  metadata: [:request_id] # Keep other useful metadata

After restarting the server, a log entry for a request handled by an instrumented Phoenix endpoint will now look like this:

2023-10-27 11:35:10 info [trace_id=a1b2c3d4e5f6a7b8a1b2c3d4e5f6a7b8 span_id=f1e2d3c4b5a69876] GET /api/dashboard/data

This is a game-changer for debugging. An exception logger can now capture this trace_id, and we can instantly pivot from a single error to the full distributed trace.

Phase 3: Extending Tracing to the Frontend

The backend is instrumented, but the story starts in the user’s browser. The original problem was perceived slowness in a complex UI component. We need to capture the entire user journey, starting from the client-side click event.

We use the OpenTelemetry Web SDK. This code would typically be part of your frontend build process (e.g., Webpack, esbuild).

// assets/js/telemetry.js
import { WebTracerProvider } from '@opentelemetry/sdk-trace-web';
import { SimpleSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { ZipkinExporter } from '@opentelemetry/exporter-zipkin';
import { B3Propagator } from '@opentelemetry/propagator-b3';
import { registerInstrumentations } from '@opentelemetry/instrumentation';
import { FetchInstrumentation } from '@opentelemetry/instrumentation-fetch';

const provider = new WebTracerProvider();

// Use a Zipkin exporter to send spans directly from the browser.
// Note: In a real-world scenario, you might send this to an OTel Collector
// to avoid CORS issues and to add batching/authentication.
const exporter = new ZipkinExporter({
  serviceName: 'frontend-dashboard',
  url: 'http://localhost:9411/api/v2/spans', // Your Zipkin instance URL
});

provider.addSpanProcessor(new SimpleSpanProcessor(exporter));

provider.register({
  // B3 is the propagation format that Zipkin understands by default.
  // This is the key to linking frontend and backend traces.
  propagator: new B3Propagator(),
});

// Automatically instrument 'fetch' requests to add trace headers.
registerInstrumentations({
  instrumentations: [
    new FetchInstrumentation({
      // We can enrich fetch spans with additional attributes.
      propagateTraceHeaderCorsUrls: [
        /http:\/\/localhost:4000\/api\/.*/,
      ]
    }),
  ],
});

const tracer = provider.getTracer('dashboard-tracer');
export default tracer;

Now, in our application logic, we can create spans manually to trace specific UI interactions. This is where we can finally get quantitative data on our SCSS animation performance.

// assets/js/app.js
import tracer from './telemetry';
import { context, trace } from '@opentelemetry/api';

const dataButton = document.getElementById('load-data-button');
const dashboard = document.getElementById('dashboard-container');

dataButton.addEventListener('click', () => {
  // 1. Create a parent span for the entire user interaction.
  const parentSpan = tracer.startSpan('user-click-load-dashboard');

  // 2. Activate this span's context. Any child spans will be nested under it.
  context.with(trace.setSpan(context.active(), parentSpan), async () => {
    try {
      // 3. Create a child span for the complex SCSS animation/layout phase.
      const animationSpan = tracer.startSpan('apply-loading-animation');
      dashboard.classList.add('loading-state'); // This class triggers complex SCSS transitions.
      // A more robust way would be to use requestAnimationFrame or TransitionEnd events
      // to accurately measure the duration. For simplicity, we'll simulate.
      await new Promise(resolve => setTimeout(resolve, 150)); // Simulate render/animation time.
      animationSpan.end();

      // 4. The FetchInstrumentation will automatically create a span for this API call
      // and inject the B3 headers, making it a child of 'user-click-load-dashboard'.
      const response = await fetch('/api/dashboard/data');
      const data = await response.json();

      // 5. Create another child span for client-side data processing and rendering.
      const renderSpan = tracer.startSpan('render-dashboard-data');
      renderDataToDashboard(data);
      dashboard.classList.remove('loading-state');
      renderSpan.end();

    } catch (error) {
      parentSpan.recordException(error);
      parentSpan.setStatus({ code: 2, message: error.message }); // 2 is ERROR status code
    } finally {
      // 6. Ensure the parent span is always closed.
      parentSpan.end();
    }
  });
});

function renderDataToDashboard(data) {
  // ... logic to render data ...
}

This setup provides a complete, hierarchical trace in Zipkin:

  1. user-click-load-dashboard (root span)
    • apply-loading-animation (child span, measures frontend render time)
    • HTTP GET (child span, created automatically by FetchInstrumentation)
      • Phoenix /api/dashboard/data (backend parent span, linked via B3 headers)
        • Ecto query (backend child span)
    • render-dashboard-data (child span, measures frontend processing time)

We finally have an end-to-end view.

Phase 4: Deepening Database Insights

The default opentelemetry_ecto instrumentation is good, but we can make it better. A common problem is identifying why a query is slow. We can use Ecto’s telemetry events to attach the PostgreSQL EXPLAIN plan directly to our trace data for slow queries. This is an advanced but incredibly powerful technique.

First, we create a telemetry handler module.

# lib/my_app/ecto/trace_enricher.ex
defmodule MyApp.Ecto.TraceEnricher do
  require OpenTelemetry.Tracer
  alias OpenTelemetry.Tracer
  import Ecto.Adapters.SQL

  @slow_query_threshold_ms 100 # Define what we consider a "slow" query.

  def attach do
    # Attach to the Ecto query event.
    :telemetry.attach("my-app-ecto-enricher", [:my_app, :repo, :query], &handle_query/4, nil)
  end

  def handle_query(_event, measurements, metadata, _config) do
    # The Ecto telemetry event gives us the total query time.
    total_time_ms = System.convert_time_unit(measurements.total_time, :native, :millisecond)

    if total_time_ms > @slow_query_threshold_ms do
      # If the query is slow, get the current active span.
      current_span_ctx = Tracer.current_span_ctx()

      if current_span_ctx != :undefined do
        # Do not run EXPLAIN on write operations in production.
        # This is a critical safety check.
        if is_select_query?(metadata.sql) do
          plan = get_query_plan(metadata.repo, metadata.sql, metadata.params)
          # Add the plan as a custom attribute to the active span.
          Tracer.set_attribute("db.statement.plan", plan)
          Tracer.add_event("slow_query_detected")
        end
      end
    end
  end

  # A very basic check. A production version would be more robust.
  defp is_select_query?(sql) do
    String.starts_with?(String.downcase(sql), "select")
  end

  defp get_query_plan(repo, query, params) do
    # We use a transaction to ensure the EXPLAIN doesn't interfere with anything.
    # The plan is retrieved as a string.
    case repo.transaction(fn ->
           explain_query(repo, "EXPLAIN (FORMAT TEXT, VERBOSE)", query, params)
         end) do
      {:ok, result} ->
        result.rows
        |> Enum.map(&hd/1)
        |> Enum.join("\n")
      {:error, _reason} ->
        "Failed to retrieve query plan."
    end
  end
end

We then attach this handler during application startup in lib/my_app/application.ex.

# lib/my_app/application.ex
def start(_type, _args) do
  MyApp.Telemetry.setup()
  # Attach our custom handler after telemetry is set up.
  MyApp.Ecto.TraceEnricher.attach()

  # ... rest of the start function ...
end

Now, when a query exceeds 100ms, our Zipkin trace for that Ecto span will contain a db.statement.plan tag with the full PostgreSQL execution plan. This allows a developer to diagnose an N+1 query or a missing index directly from the observability platform, without needing separate access to the database logs or a DB console.

The complete flow can be visualized as follows:

sequenceDiagram
    participant Browser
    participant Phoenix
    participant PostgreSQL
    participant Zipkin

    Browser->>Browser: user clicks button, starts 'user-click' span
    Browser->>Browser: starts 'animation' child span
    Browser->>+Phoenix: GET /api/dashboard/data (with B3 Headers)
    Phoenix->>Phoenix: starts 'Phoenix' span (parented to browser span)
    Phoenix->>Phoenix: logs "Processing request..." (with trace_id)
    Phoenix->>+PostgreSQL: executes Ecto query
    PostgreSQL-->>-Phoenix: returns data
    Phoenix->>Phoenix: Ecto span ends. If slow, enricher adds EXPLAIN plan.
    Phoenix-->>-Browser: returns JSON response
    Browser->>Browser: starts 'render' child span
    Browser->>Browser: all spans complete
    Note right of Browser: All spans (Browser, Phoenix, Ecto) are asynchronously sent to Zipkin.
    Browser->>Zipkin: sends frontend spans
    Phoenix->>Zipkin: sends backend spans

Through this methodical, layered implementation, we moved from a state of ambiguity and finger-pointing to a system of clear, actionable data. The problem turned out to be a combination of a slightly inefficient query that performed poorly under specific data conditions and a frontend rendering function that was re-calculating layouts unnecessarily. Neither would have been obvious without seeing the entire interaction in a single, correlated trace.

The current implementation uses a SimpleSpanProcessor on the frontend and traces every request on the backend, which is not suitable for a high-traffic production environment due to performance and cost overhead. The logical next steps involve implementing robust sampling strategies. On the backend, a tail-based sampler in an OpenTelemetry Collector would allow us to make sampling decisions based on the entire trace (e.g., only keep traces that contain an error or exceed a certain duration). For the frontend, configuring a probability-based sampler in the WebTracerProvider is a necessary step to reduce the volume of data being sent from clients’ browsers. Furthermore, while custom attributes are powerful, defining a consistent semantic convention for them across all services is crucial for long-term maintainability.


  TOC