The problem started with a disconnect. Our platform’s frontend is a large Next.js application, heavily leveraging Static Site Generation (SSG) for performance. On the backend, a suite of microservices communicates asynchronously, all traced beautifully with Zipkin. When a user reported an issue—a data inconsistency, a broken UI component—we could pull up the backend traces for their session. We’d see the API calls, the database queries, the service-to-service hops. But a critical piece of context was always missing: what version of the static page did the user actually see?
In a fast-moving CI/CD pipeline, dozens of builds can go out in a day. The static HTML file a user loads is a build artifact, a snapshot in time. An issue might not be in the backend logic, but in the data baked into the static HTML during a specific build, or a bug in the JavaScript bundle from that particular deployment. Without a way to link a runtime user trace back to the build that generated the page, we were flying blind, wasting hours trying to reproduce bugs that were already fixed or dependent on a specific build’s state.
Our initial concept was to create a “birth certificate” for every generated page. During the next build
process, we would generate a unique trace context for the build itself, and then a child context for each page rendered. This context—specifically a traceId
—would be embedded directly into the static HTML. When a user loaded the page, our client-side observability code would extract this build traceId
and link it to the new runtime trace for that user session. This would create an unbroken chain of causality from the build process to the user interaction.
The technology choice for instrumentation was OpenTelemetry. Its vendor-agnostic nature is a major plus, and its robust context propagation mechanisms are exactly what this problem requires. We were already using Zipkin, and OpenTelemetry has a straightforward exporter for it. The challenge wasn’t the backend, but instrumenting the two distinct phases of an SSG application’s life: build-time and runtime.
The Foundational Setup: Backend Service and Zipkin
Before tackling the frontend, we needed a target for our traces. A simple Express.js service and a Docker-based Zipkin instance provide a realistic environment. In a real-world project, this would be a complex mesh of services, but for demonstrating the core tracing propagation, a single service is sufficient.
Here is the docker-compose.yml
to run Zipkin locally. It’s the standard configuration and requires no modification.
# docker-compose.yml
version: '3.9'
services:
zipkin:
image: openzipkin/zipkin
container_name: zipkin
ports:
- "9411:9411"
Next, the instrumented backend service. This Node.js server uses several OpenTelemetry packages to automatically instrument incoming HTTP requests and export the trace data to Zipkin.
// backend/tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { ZipkinExporter } = require('@opentelemetry/exporter-zipkin');
// Configure the Zipkin exporter
// This assumes Zipkin is running on the default port
const zipkinExporter = new ZipkinExporter({
serviceName: 'my-backend-service',
url: 'http://localhost:9411/api/v2/spans',
});
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'my-backend-service',
}),
traceExporter: zipkinExporter,
instrumentations: [getNodeAutoInstrumentations()],
});
// Gracefully shut down the SDK on process exit
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.error('Error terminating tracing', error))
.finally(() => process.exit(0));
});
module.exports = sdk;
The server itself is a standard Express app. The key is to start the tracing SDK before any other modules are imported.
// backend/server.js
const sdk = require('./tracing');
sdk.start();
const express = require('express');
const cors = require('cors');
const app = express();
const PORT = 4000;
app.use(cors()); // Allow requests from the Next.js app
app.get('/api/data', (req, res) => {
// The OpenTelemetry instrumentation will automatically create a span for this request
// and link it to the parent span from the frontend if the traceparent header is present.
console.log('Received request with headers:', req.headers);
res.json({
message: 'Data from the backend',
timestamp: new Date().toISOString()
});
});
app.listen(PORT, () => {
console.log(`Backend server listening on port ${PORT}`);
});
With docker-compose up
and node backend/server.js
, we have a functioning trace-collection environment. Any request with a W3C traceparent
header will now appear in the Zipkin UI.
Phase 1: Instrumenting the Build Process
This was the most significant challenge. The next build
command is an ephemeral process. It starts, generates files, and exits. Standard tracing setups are designed for long-running servers. Our solution was to wrap the Next.js build command in a custom Node.js script that establishes a root “build span.”
The core mechanism for maintaining context across the asynchronous operations within the Next.js build is Node.js’s AsyncLocalStorage
.
Here’s the wrapper script that kicks off the entire process.
// scripts/build-with-tracing.js
const { exec } = require('child_process');
const { AsyncLocalStorage } = require('async_hooks');
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { SimpleSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { ZipkinExporter } = require('@opentelemetry/exporter-zipkin');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const api = require('@opentelemetry/api');
// This is the key for context propagation within the build process
const asyncLocalStorage = new AsyncLocalStorage();
global.buildContextStore = asyncLocalStorage;
function initializeTracer() {
const provider = new NodeTracerProvider({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'ssg-build-process',
}),
});
const exporter = new ZipkinExporter({
serviceName: 'ssg-build-process',
url: 'http://localhost:9411/api/v2/spans',
});
provider.addSpanProcessor(new SimpleSpanProcessor(exporter));
provider.register();
return api.trace.getTracer('nextjs-build-tracer');
}
async function runBuild() {
const tracer = initializeTracer();
const commitSha = require('child_process').execSync('git rev-parse --short HEAD').toString().trim();
// Create the root span for the entire build
const rootSpan = tracer.startSpan('nextjs-build', {
attributes: {
'build.commit_sha': commitSha,
'build.start_time': new Date().toISOString(),
}
});
// The core of the solution: run the entire build process within the context
// of this root span.
await api.context.with(api.trace.setSpan(api.context.active(), rootSpan), async () => {
console.log('Starting Next.js build with tracing enabled...');
console.log(`Build Trace ID: ${rootSpan.spanContext().traceId}`);
const buildProcess = exec('next build');
buildProcess.stdout.on('data', (data) => {
process.stdout.write(data);
});
buildProcess.stderr.on('data', (data) => {
process.stderr.write(data);
});
return new Promise((resolve, reject) => {
buildProcess.on('close', (code) => {
if (code !== 0) {
rootSpan.setStatus({ code: api.SpanStatusCode.ERROR, message: 'Build Failed' });
console.error(`Build failed with code ${code}`);
reject(new Error(`Build failed with code ${code}`));
} else {
rootSpan.setStatus({ code: api.SpanStatusCode.OK });
console.log('Build completed successfully.');
resolve();
}
rootSpan.end();
// Ensure all spans are sent before exiting
setTimeout(() => api.trace.getTracerProvider().shutdown(), 2000);
});
});
});
}
runBuild().catch(err => {
console.error('Build script encountered an error:', err);
process.exit(1);
});
We modify package.json
to use this script:
"scripts": {
"build:traced": "node scripts/build-with-tracing.js",
"dev": "next dev",
"start": "next start"
}
Now, when we run npm run build:traced
, a root span is created. The next step is to create child spans for each page generated by getStaticProps
and pass the context down. A pitfall here is assuming that the context will just “exist.” We need to explicitly retrieve it from our AsyncLocalStorage
instance.
We create a higher-order function to wrap getStaticProps
in our pages. This keeps the tracing logic separate from the page logic.
// lib/tracing.js
const api = require('@opentelemetry/api');
// A HOF to wrap getStaticProps and create a span for each page generation
export function withBuildTrace(getStaticPropsFunc) {
return async function (context) {
const buildContextStore = global.buildContextStore;
if (!buildContextStore) {
// Failsafe if not running in the traced build script
return getStaticPropsFunc(context);
}
const store = buildContextStore.getStore();
if (!store) {
return getStaticPropsFunc(context);
}
// This is how we retrieve the parent context set by the build script
const parentContext = store;
const tracer = api.trace.getTracer('nextjs-build-tracer');
const pagePath = context.params?.slug ? `/posts/${context.params.slug.join('/')}` : '/';
const pageSpan = tracer.startSpan(`getStaticProps:${pagePath}`, {}, parentContext);
// Run the actual getStaticProps within the context of the new page span
const result = await api.context.with(api.trace.setSpan(api.context.active(), pageSpan), async () => {
try {
const propsResult = await getStaticPropsFunc(context);
// This is where we inject the build context into the page props
if (propsResult.props) {
propsResult.props.buildTraceContext = {
traceId: pageSpan.spanContext().traceId,
spanId: pageSpan.spanContext().spanId
};
}
pageSpan.setStatus({ code: api.SpanStatusCode.OK });
return propsResult;
} catch (e) {
pageSpan.recordException(e);
pageSpan.setStatus({ code: api.SpanStatusCode.ERROR, message: e.message });
throw e; // re-throw the error so Next.js can handle it
} finally {
pageSpan.end();
}
});
return result;
};
}
Now, in a page file, we use this wrapper:
// pages/posts/[...slug].js
import { withBuildTrace } from '../../lib/tracing';
export default function Post({ post, buildTraceContext }) {
// The buildTraceContext is now available as a prop
// We'll use this in the next step to embed it in the HTML
return (
<article>
{/* ... page content ... */}
</article>
);
}
export const getStaticProps = withBuildTrace(async (context) => {
// Regular data fetching logic
// e.g., fetch post content from a CMS
console.log(`Generating page for: ${context.params.slug.join('/')}`);
const post = { title: `Post about ${context.params.slug.join('/')}` };
return {
props: {
post,
},
};
});
// ... getStaticPaths implementation ...
The final step for the build phase is embedding this context into the HTML. A common mistake is to try and render this in the page component itself, but that’s too late. The context needs to be in the <head>
of the document. We use a custom _app.js
and Next.js’s Head
component.
// pages/_app.js
import Head from 'next/head';
function MyApp({ Component, pageProps }) {
const { buildTraceContext } = pageProps;
return (
<>
<Head>
{buildTraceContext && (
<>
<meta name="build-trace-id" content={buildTraceContext.traceId} />
<meta name="build-span-id" content={buildTraceContext.spanId} />
</>
)}
</Head>
<Component {...pageProps} />
</>
);
}
export default MyApp;
After running npm run build:traced
, we can inspect the generated .next/server/pages/posts/hello-world.html
and see our meta tags:
<meta name="build-trace-id" content="a1b2c3d4..."/>
<meta name="build-span-id" content="e5f6g7h8..."/>
And in Zipkin, we see the build trace, with a root span for the build and child spans for each page generated. The link is forged.
Phase 2: Client-Side Hydration and Runtime Tracing
Now that the build context is available in the DOM, the client-side code needs to pick it up and use it. We set up an OpenTelemetry provider for the browser. The key is to read the meta tags on initialization and attach their content as attributes to new spans.
// lib/client-tracing.js
import { WebTracerProvider } from '@opentelemetry/sdk-trace-web';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { FetchInstrumentation } from '@opentelemetry/instrumentation-fetch';
import { registerInstrumentations } from '@opentelemetry/instrumentation';
import { ZoneContextManager } from '@opentelemetry/context-zone';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { ZipkinExporter } from '@opentelemetry/exporter-zipkin';
let tracerProvider;
export const initTracer = () => {
if (tracerProvider) return tracerProvider;
// Read build context from meta tags
const buildTraceId = document.querySelector('meta[name="build-trace-id"]')?.content;
const buildSpanId = document.querySelector('meta[name="build-span-id"]')?.content;
const resource = new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'ssg-ui-client',
}).merge(new Resource({
// Attach build context as resource attributes. This ensures every span
// created on the client will have this metadata.
'build.trace_id': buildTraceId,
'build.span_id': buildSpanId,
}));
tracerProvider = new WebTracerProvider({ resource });
// Use the Zipkin exporter pointing to our backend proxy or Zipkin directly
// In a real app, this URL would be proxied to avoid CORS issues if Zipkin is on a different domain
const exporter = new ZipkinExporter({
serviceName: 'ssg-ui-client',
url: 'http://localhost:9411/api/v2/spans',
});
tracerProvider.addSpanProcessor(new BatchSpanProcessor(exporter, {
// How long to batch spans before sending
scheduledDelayMillis: 500,
}));
tracerProvider.register({
contextManager: new ZoneContextManager(),
});
// Automatically instrument fetch requests to propagate trace headers
registerInstrumentations({
instrumentations: [
new FetchInstrumentation({
propagateTraceHeaderCorsUrls: [
/localhost:4000/i // Match our backend API
],
// A common pitfall is forgetting to clear headers for non-CORS requests
clearTimingResources: true,
}),
],
});
return tracerProvider;
};
export const getTracer = () => {
if (!tracerProvider) {
initTracer();
}
return tracerProvider.getTracer('ssg-ui-client-tracer');
}
We initialize this tracer in _app.js
using a useEffect
hook to ensure it only runs on the client.
// pages/_app.js (updated)
import Head from 'next/head';
import { useEffect } from 'react';
import { initTracer, getTracer } from '../lib/client-tracing';
import { context, trace } from '@opentelemetry/api';
function MyApp({ Component, pageProps }) {
const { buildTraceContext } = pageProps;
useEffect(() => {
initTracer();
const tracer = getTracer();
// Create a root span for the initial page load
const pageLoadSpan = tracer.startSpan('page-load', {
attributes: {
'page.path': window.location.pathname,
'page.title': document.title,
}
});
// We must manually end the span once the page is considered "loaded"
// A more robust implementation might use performance APIs.
context.with(trace.setSpan(context.active(), pageLoadSpan), () => {
console.log("Page load trace started.");
window.addEventListener('load', () => {
pageLoadSpan.end();
console.log("Page load trace ended.");
});
});
}, []);
return (
<>
<Head>
{buildTraceContext && (
<>
<meta name="build-trace-id" content={buildTraceContext.traceId} />
<meta name="build-span-id" content={buildTraceContext.spanId} />
</>
)}
</Head>
<Component {...pageProps} />
</>
);
}
export default MyApp;
Now, when a page loads, a new trace is started. Any fetch
call made from the UI to http://localhost:4000/api/data
will automatically have the traceparent
header injected. The backend service will pick this up, create its own span as a child of the client-side span, and the entire end-to-end interaction will appear in Zipkin as a single, unified trace.
The Final Result: A Correlated View
The flow is now complete. We can visualize the entire lifecycle.
sequenceDiagram participant CI/CD as CI/CD Pipeline participant BuildScript as build-with-tracing.js participant NextJS as Next.js Build participant Browser as User's Browser participant Backend as Backend Service participant Zipkin CI/CD->>BuildScript: npm run build:traced BuildScript->>Zipkin: Start Root Span (build) BuildScript->>NextJS: exec('next build') NextJS->>NextJS: Renders page via getStaticProps Note over NextJS: withBuildTrace() wrapper creates child span (page-gen) NextJS-->>BuildScript: Returns HTML with meta tags BuildScript->>Zipkin: End Spans (build, page-gen) BuildScript-->>CI/CD: Build artifact (HTML) Browser->>Browser: Loads static HTML Note over Browser: initTracer() reads meta tags Browser->>Zipkin: Start Root Span (page-load) with build_id attribute Browser->>Backend: fetch('/api/data') with traceparent header Backend->>Zipkin: Start Child Span (api-request) Backend-->>Browser: API Response Browser->>Zipkin: End Span (page-load) Backend->>Zipkin: End Span (api-request)
In the Zipkin UI, if we search for a trace originating from the browser, we’ll see the page-load
span. Clicking on it reveals its tags, which now include build.trace_id
and build.span_id
. This is our link. An engineer can copy that build.trace_id
, search for it in Zipkin, and instantly pull up the entire trace for the build process that generated that specific asset, complete with commit SHA, timestamps, and logs. The disconnect is bridged.
This solution is not without its limitations. The client-side tracing library adds a non-trivial amount to the JavaScript bundle size, which needs to be monitored for its impact on Core Web Vitals. The current implementation only traces the initial page load; a production-ready version would need to handle client-side navigations (e.g., via next/router
events) to create new spans for each virtual page view. Furthermore, this setup traces every user interaction, which is not feasible for a high-traffic site. A robust sampling strategy, either head-based on the client or tail-based in a collector, would be a necessary next step to manage the volume of trace data without losing valuable insight.