Implementing Full-Stack Distributed Tracing for JavaScript ISR Builds Using SkyWalking and BDD

Observability

Word Count: 2.9k

Read Times: 18 Min

The production incident started with a vague customer complaint: “The product pages are sometimes slow to load.” Our platform is built on a JavaScript framework using Incremental Static Regeneration (ISR), which promises fast initial loads by serving static HTML, then regenerating it in the background after a configured timeout. For our back-end services, we rely on Apache SkyWalking, which gives us excellent visibility into our Java microservices. The problem was the massive observability gap between the two worlds. When a user experienced a slow load, it was almost always due to a background ISR regeneration being triggered. Our front-end monitoring showed a high Time to First Byte (TTFB), but the Next.js server process was a black box. We had no way to know if the slowness was within the Node.js runtime, during data fetching from our APIs, or somewhere else entirely.

Our initial attempt involved using the standard SkyWalking Node.js agent. This failed. The auto-instrumentation couldn’t correctly handle the ISR lifecycle. It would trace the initial server start, but the on-demand regeneration requests were lost, their context detached from any incoming user request. We were flying blind on our most critical user-facing component.

The core concept we settled on was to stop treating the JavaScript server as a “front-end” and start treating it as a first-class citizen in our distributed system. This meant it needed to participate in the distributed trace, originating its own spans and propagating the context to downstream services. The bridge to connect our JavaScript world with the Java-centric SkyWalking backend had to be OpenTelemetry (OTel). It provided the low-level, vendor-neutral primitives for manual instrumentation, giving us the control that auto-instrumentation lacked.

To ensure this wasn’t just a one-off fix, we decided to codify our performance expectations using Behavior-Driven Development (BDD). A vague requirement like “pages must be fast” is unactionable. A BDD scenario like “Given a stale product page, when regeneration is triggered, then the server-side data fetching and rendering must complete within 300ms” is a concrete, testable contract. This contract would be enforced during code review, making observability a non-negotiable part of our definition of done.

Here is the docker-compose.yml that defines our entire development and testing environment. It includes the Next.js application, a simple backend catalog service written in Java, and the full SkyWalking stack. This ensures anyone on the team can replicate the exact environment.

# docker-compose.yml
version: '3.8'

services:
  # SkyWalking Backend and UI
  elasticsearch:
    image: elasticsearch:7.17.9
    container_name: elasticsearch
    ports:
      - "9200:9200"
    healthcheck:
      test: ["CMD-SHELL", "curl --silent --fail localhost:9200/_cluster/health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
    environment:
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"

  skywalking-oap:
    image: apache/skywalking-oap-server:9.5.0
    container_name: skywalking-oap
    depends_on:
      elasticsearch:
        condition: service_healthy
    ports:
      - "11800:11800"
      - "12800:12800"
    healthcheck:
      test: ["CMD", "/bin/sh", "-c", "curl http://localhost:12800/graphql -X POST -H 'Content-Type: application/json' --data '{\"query\":\"query healthCheck{ checkHealth { score } }\"}' | grep -q '\"score\":[1-9]'"]
      interval: 30s
      timeout: 10s
      retries: 5
    environment:
      - SW_STORAGE=elasticsearch
      - SW_STORAGE_ES_CLUSTER_NODES=elasticsearch:9200
      - SW_HEALTH_CHECKER=default
      - SW_TELEMETRY=prometheus
      - SW_CORE_GRPC_HOST=0.0.0.0
      # Enables receiving OTLP traces
      - SW_RECEIVER_OTLP_ENABLED=true
      - SW_RECEIVER_OTLP_HOST=0.0.0.0
      - SW_RECEIVER_OTLP_PORT=4317 # Default gRPC port
      - SW_RECEIVER_OTLP_HTTP_ENABLED=false # Disable HTTP for this example

  skywalking-ui:
    image: apache/skywalking-ui:9.5.0
    container_name: skywalking-ui
    depends_on:
      - skywalking-oap
    ports:
      - "8080:8080"
    environment:
      - SW_OAP_ADDRESS=http://skywalking-oap:12800

  # Backend Java Service
  catalog-service:
    build:
      context: ./catalog-service
      dockerfile: Dockerfile
    container_name: catalog-service
    ports:
      - "8090:8090"
    environment:
      # Attach the SkyWalking Java Agent
      - SW_AGENT_NAME=catalog-service
      - SW_AGENT_COLLECTOR_BACKEND_SERVICES=skywalking-oap:11800
      - JAVA_TOOL_OPTIONS=-javaagent:/usr/local/skywalking/agent/skywalking-agent.jar
    depends_on:
      - skywalking-oap

  # Frontend JavaScript (Next.js) App
  frontend-app:
    build:
      context: ./frontend-app
      dockerfile: Dockerfile
    container_name: frontend-app
    ports:
      - "3000:3000"
    environment:
      # Critical for injecting the OTel tracer
      - NODE_OPTIONS=--require ./instrumentation.js
      - CATALOG_API_URL=http://catalog-service:8090
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://skywalking-oap:4317
      - OTEL_SERVICE_NAME=frontend-isr-server
    depends_on:
      - catalog-service
      - skywalking-oap

The key parts of this configuration are the JAVA_TOOL_OPTIONS for the catalog-service, which transparently attaches the SkyWalking agent, and the NODE_OPTIONS for the frontend-app, which forces our custom OpenTelemetry instrumentation script to be loaded before any application code runs.

The backend service is a standard Spring Boot application. The SkyWalking Java agent handles all instrumentation automatically. This is our baseline for what “good” observability looks like.

// catalog-service/src/main/java/com/example/catalog/ProductController.java
package com.example.catalog;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RestController;

import java.util.Map;
import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;

@RestController
public class ProductController {

    private static final Logger logger = LoggerFactory.getLogger(ProductController.class);

    @GetMapping("/api/products/{id}")
    public Map<String, Object> getProduct(@PathVariable String id) throws InterruptedException {
        logger.info("Fetching product details for ID: {}", id);

        // Simulate database latency with a random delay
        int latency = ThreadLocalRandom.current().nextInt(50, 150);
        TimeUnit.MILLISECONDS.sleep(latency);

        logger.info("Product {} retrieved in {}ms", id, latency);

        return Map.of(
            "id", id,
            "name", "Super Widget " + id,
            "description", "A high-quality widget for all your needs.",
            "price", 99.99,
            "retrieval_latency_ms", latency
        );
    }
}

Now for the difficult part: instrumenting the Next.js application. We create a file named instrumentation.js in the root of the frontend-app directory. This script is responsible for configuring and initializing the entire OpenTelemetry SDK.

// frontend-app/instrumentation.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { W3CTraceContextPropagator } = require('@opentelemetry/core');
const { CompositePropagator } = require('@opentelemetry/core');
const { B3Propagator, B3InjectEncoding } = require('@opentelemetry/propagator-b3');

// A common mistake is not properly configuring the resource.
// Without a 'service.name', traces often get grouped into a generic 'unknown_service'.
const resource = new Resource({
  [SemanticResourceAttributes.SERVICE_NAME]: process.env.OTEL_SERVICE_NAME || 'frontend-isr-server',
});

// Configure the gRPC exporter to send traces to SkyWalking's OTLP receiver.
// Ensure the endpoint is correct; pointing to localhost in a container won't work.
const traceExporter = new OTLPTraceExporter({
  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
});

// SkyWalking primarily uses B3 propagation, although it supports W3C.
// Using a CompositePropagator ensures compatibility with various systems.
// For SkyWalking, specifically adding its own 'sw8' propagator if needed is also an option,
// but B3 is widely supported. Here we configure both W3C and B3.
const propagator = new CompositePropagator({
  propagators: [
    new W3CTraceContextPropagator(),
    new B3Propagator(),
    new B3Propagator({ injectEncoding: B3InjectEncoding.MULTI_HEADER }),
  ],
});

const sdk = new NodeSDK({
  resource: resource,
  traceExporter,
  // The propagator must be explicitly set here.
  propagator: propagator,
  instrumentations: [
    getNodeAutoInstrumentations({
      // We disable fs instrumentation as it can be very noisy during Next.js builds.
      '@opentelemetry/instrumentation-fs': {
        enabled: false,
      },
    }),
  ],
});

try {
  sdk.start();
  console.log('OpenTelemetry SDK started successfully.');

  process.on('SIGTERM', () => {
    sdk.shutdown()
      .then(() => console.log('Tracing terminated.'))
      .catch((error) => console.error('Error shutting down tracing', error))
      .finally(() => process.exit(0));
  });
} catch (error) {
  console.error('Error starting OpenTelemetry SDK', error);
}

With the SDK initialized, we can now manually create spans within our ISR data-fetching logic. This is where we gain the visibility that the automatic agent couldn’t provide. We will wrap our fetch calls and the getStaticProps function itself.

Here is the product page component (pages/products/[id].js). The key is how we use the OpenTelemetry API to create a custom span and ensure the trace context is propagated.

// frontend-app/pages/products/[id].js
import { trace, context, propagation } from '@opentelemetry/api';

// This is the core of our manual instrumentation.
// We get the global tracer instance that was configured in `instrumentation.js`.
const tracer = trace.getTracer('nextjs-isr-tracer');

// A reusable function to fetch data while propagating trace context.
async function fetchWithTracing(url, span) {
  const headers = {};
  
  // This is the magic. It injects the current active span's context
  // (e.g., B3 or W3C headers) into the headers object. The backend
  // SkyWalking agent will then pick these up and continue the trace.
  propagation.inject(context.active(), headers);

  const response = await fetch(url, { headers });
  
  if (!response.ok) {
    span.setStatus({ code: trace.SpanStatusCode.ERROR, message: `HTTP Error: ${response.status}` });
    span.recordException(new Error(`Failed to fetch ${url}`));
    throw new Error('Failed to fetch product data.');
  }
  
  const data = await response.json();
  span.setAttribute('http.response.content_length', JSON.stringify(data).length);
  return data;
}

export async function getStaticPaths() {
  // Define some initial paths to pre-render at build time
  return {
    paths: [{ params: { id: '1' } }, { params: { id: '2' } }],
    fallback: 'blocking',
  };
}

// This is the function that runs on the server for ISR.
export async function getStaticProps(context) {
  const { id } = context.params;
  const apiUrl = `${process.env.CATALOG_API_URL}/api/products/${id}`;

  // Here we start a new, custom span. This will be the parent span
  // for all work done within this function, including the API call.
  // In SkyWalking, this will appear as a segment named 'isr.getStaticProps'.
  const parentSpan = tracer.startSpan('isr.getStaticProps', {
    attributes: {
      'product.id': id,
      'isr.type': 'revalidation',
    },
  });

  // We must wrap the asynchronous work in `context.with` to ensure
  // that the `parentSpan` is the active span for the duration of this logic.
  // A common pitfall is to forget this, which leads to broken traces
  // because the context is lost across `await` calls.
  const product = await trace.context.with(trace.setSpan(context.active(), parentSpan), async () => {
    try {
      console.log(`Fetching data for product ${id} from ${apiUrl}`);
      const productData = await fetchWithTracing(apiUrl, parentSpan);
      parentSpan.setStatus({ code: trace.SpanStatusCode.OK });
      return productData;
    } catch (error) {
      console.error('Error in getStaticProps:', error);
      parentSpan.setStatus({ code: trace.SpanStatusCode.ERROR, message: error.message });
      parentSpan.recordException(error);
      // Even on error, we must return a valid props structure for Next.js
      return { error: 'Failed to load product.' };
    } finally {
      // It is critical to end the span in all cases.
      parentSpan.end();
    }
  });

  if (product.error) {
    // Handle the error case appropriately
    return { notFound: true };
  }

  return {
    props: {
      product,
    },
    // The core of ISR: regenerate this page at most once every 10 seconds.
    revalidate: 10,
  };
}

function ProductPage({ product }) {
  if (!product) {
    return <div>Loading...</div>;
  }

  return (
    <div>
      <h1>{product.name}</h1>
      <p>{product.description}</p>
      <h2>${product.price}</h2>
      <small>Backend latency: {product.retrieval_latency_ms}ms</small>
    </div>
  );
}

export default ProductPage;

With this in place, a request that triggers an ISR regeneration produces a complete trace in SkyWalking. We can see the entry point at the Next.js server, our custom isr.getStaticProps span, and the downstream call to the catalog-service. We can now definitively answer where the time is being spent.

The final piece is to automate the verification of our performance Service Level Objective (SLO) using BDD. We use Cucumber.js for this.

Here’s our feature file that defines the performance contract.

# frontend-app/features/performance.feature
Feature: ISR Performance Validation

  Scenario: Regenerating a stale product page meets SLO
    Given the product page for ID "123" has been visited and is now stale
    When a request triggers the background regeneration of the page for ID "123"
    Then the "isr.getStaticProps" span duration for product ID "123" should be less than 300ms

The implementation of these steps requires a way to query the SkyWalking backend. SkyWalking provides a GraphQL API for this purpose. The “Then” step will trigger the regeneration, wait a few seconds for the trace data to be processed, and then query the API to find the relevant span and assert its duration.

// frontend-app/features/step_definitions/performance_steps.js
const { Given, When, Then } = require('@cucumber/cucumber');
const assert = require('assert');
const fetch = require('node-fetch');

const FRONTEND_URL = 'http://localhost:3000';
const SKYWALKING_GQL_URL = 'http://localhost:8080/graphql';
const REVALIDATE_SECONDS = 10;
let lastTraceId;

// Helper to wait for a specified duration
const wait = (ms) => new Promise(resolve => setTimeout(resolve, ms));

// Helper to query SkyWalking's GraphQL API
async function querySkyWalkingForTrace(traceId) {
  const query = `
    query ($traceId: String!) {
      trace: queryTrace(traceId: $traceId) {
        spans {
          spanId
          parentSpanId
          serviceCode
          endpointName
          startTime
          endTime
          tags {
            key
            value
          }
        }
      }
    }
  `;
  
  for (let i = 0; i < 5; i++) {
    const response = await fetch(SKYWALKING_GQL_URL, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ query, variables: { traceId } }),
    });
    const result = await response.json();
    if (result.data.trace && result.data.trace.spans.length > 0) {
      return result.data.trace.spans;
    }
    await wait(2000); // Wait and retry
  }
  
  throw new Error(`Trace with ID ${traceId} not found in SkyWalking after retries.`);
}


Given('the product page for ID {string} has been visited and is now stale', async function (productId) {
  // First hit warms up the cache
  await fetch(`${FRONTEND_URL}/products/${productId}`);
  // Wait for longer than the revalidate period to ensure the next hit is a regeneration
  console.log(`Waiting for ${REVALIDATE_SECONDS + 2} seconds to ensure staleness...`);
  await wait((REVALIDATE_SECONDS + 2) * 1000);
});

When('a request triggers the background regeneration of the page for ID {string}', async function (productId) {
  const response = await fetch(`${FRONTEND_URL}/products/${productId}`);
  
  // A pitfall in testing traces is not capturing the trace ID.
  // We can't get it from the response here easily. A more robust solution might
  // involve logging it in the dev server or using a custom header.
  // For this test, we will fetch the most recent trace for the service.
  // This is flaky but sufficient for a demonstration. A production implementation
  // would require a more reliable way to identify the correct trace.
  console.log('Regeneration triggered. Waiting for trace to be processed...');
  await wait(5000); // Wait for SkyWalking to ingest the trace
});

Then('the "isr.getStaticProps" span duration for product ID {string} should be less than {int}ms', async function (productId, maxDuration) {
  const tracesQuery = `
    query ($service: String!, $duration: Duration!) {
      traces: queryBasicTraces(condition: {
        serviceId: $service,
        queryDuration: $duration,
        state: ALL,
        orderBy: START_TIME,
        paging: { pageNum: 1, pageSize: 1 }
      }) {
        traces {
          traceIds
        }
      }
    }
  `;
  const sixtySecondsAgo = new Date(Date.now() - 60000);
  const now = new Date();
  
  const servicesQuery = `query { services: getAllServices(duration: {
      start: "${sixtySecondsAgo.toISOString().replace('Z', '.000Z')}",
      end: "${now.toISOString().replace('Z', '.000Z')}",
      step: MINUTE
  }) { id name } }`;

  const servicesRes = await fetch(SKYWALKING_GQL_URL, {method: 'POST', headers: {'Content-Type': 'application/json'}, body: JSON.stringify({query: servicesQuery}) });
  const servicesData = await servicesRes.json();
  const service = servicesData.data.services.find(s => s.name === process.env.OTEL_SERVICE_NAME);

  assert.ok(service, `Service ${process.env.OTEL_SERVICE_NAME} not found in SkyWalking.`);
  
  const traceRes = await fetch(SKYWALKING_GQL_URL, {
      method: 'POST',
      headers: {'Content-Type': 'application/json'},
      body: JSON.stringify({query: tracesQuery, variables: {
          service: service.id,
          duration: {
              start: sixtySecondsAgo.toISOString().replace('Z', '.000Z'),
              end: now.toISOString().replace('Z', '.000Z'),
              step: 'MINUTE'
          }
      }}),
  });
  const traceData = await traceRes.json();
  const traceId = traceData.data.traces.traces[0].traceIds[0];
  
  assert.ok(traceId, 'Could not find the last trace for the service.');

  const spans = await querySkyWalkingForTrace(traceId);
  const isrSpan = spans.find(s => s.endpointName === 'isr.getStaticProps');
  
  assert.ok(isrSpan, 'The "isr.getStaticProps" span was not found in the trace.');
  
  const duration = isrSpan.endTime - isrSpan.startTime;
  console.log(`Found span "isr.getStaticProps" with duration: ${duration}ms`);
  
  assert.ok(
    duration < maxDuration,
    `ISR regeneration took ${duration}ms, which exceeds the SLO of ${maxDuration}ms.`
  );
});

During code review, if a developer introduces a new data-fetching call inside getStaticProps without wrapping it in our tracing utility, this BDD test will fail. The pull request comment becomes precise: “The performance SLO test failed. The trace shows a 400ms duration for getStaticProps. Please investigate the new dependency added in this function and ensure its trace context is correctly propagated.” This elevates code review from stylistic checks to enforcing operational excellence.

The solution is not without its limitations. The BDD test’s reliance on querying the most recent trace is inherently fragile and would need to be hardened for a CI environment, perhaps by passing a unique request ID through the entire stack. Manual instrumentation also carries a maintenance burden; as the framework evolves, our custom tracing logic might need updates. Furthermore, this only covers the server-side aspect of ISR. A complete picture would involve integrating Real User Monitoring (RUM) to connect these server-side traces with the actual user experience in the browser. Future work could focus on building a dedicated OpenTelemetry instrumentation package for Next.js to automate this process, reducing the need for manual span creation in every data-fetching function.

OpenTelemetry SkyWalking JavaScript BDD ISR Code Review

Constructing an Automated Infrastructure Remediation Engine with spaCy and Chef

2023-10-27 DevSecOps

Automation spaCy NLP Chef Vue.js tRPC

Building Component-Level Cost Attribution by Correlating Snowflake Query History with Prometheus Frontend Metrics

2023-10-27 Observability

React Prometheus TSDB Snowflake CSS Modules FinOps