Implementing End-to-End Distributed Tracing Across a NestJS, Kafka, and Neo4j Pipeline

Observability

Word Count: 2.2k

Read Times: 13 Min

The synchronous API endpoint for processing financial transactions was becoming a liability. Each incoming request triggered a series of complex checks, culminating in a set of Cypher queries against a Neo4j database to detect potential fraud rings. As transaction volume grew, the P95 latency for this endpoint climbed into unacceptable territory. The direct coupling of ingestion and deep analysis was the clear bottleneck. In a real-world project, this kind of synchronous performance degradation is a ticking time bomb.

The initial architectural concept was straightforward: decouple ingestion from analysis. A primary NestJS service, the ingestor-service, would handle initial validation, then immediately push the transaction data into a Kafka topic and return a 202 Accepted. A second, independently scalable NestJS microservice, the analyzer-service, would consume from this topic, perform the expensive graph analysis against Neo4j, and take appropriate action. This event-driven approach promised to restore our API’s responsiveness and provide a durable buffer for transaction spikes.

The problem, however, shifted from performance to visibility. In our monolithic world, a single Datadog APM trace captured the entire lifecycle of a request. With the new architecture, a trace would start at the ingestor-service and abruptly end the moment the message was handed off to the Kafka producer client. The analyzer-service would start an entirely new, disconnected trace upon message consumption. We had created an observability black hole. Debugging a problematic transaction meant manually correlating logs between two services using timestamps and transaction IDs—a slow and error-prone process. The requirement was clear: restore end-to-end tracing across this asynchronous boundary.

graph TD
    subgraph "Observability Black Hole (Initial State)"
        Client -- HTTP Request --> A[ingestor-service];
        A -- Trace 1 (Ends) --> B[Kafka Topic];
        B -- Trace 2 (New) --> C[analyzer-service];
        C -- Cypher Query --> D[Neo4j];
    end

    subgraph "Desired State: Unified Trace"
        Client_ -- HTTP Request --> A_[ingestor-service];
        A_ -- Span A --> B_[Kafka Topic];
        subgraph "Trace Context Propagation"
            B_ -- Span B (Child of A) --> C_[analyzer-service];
        end
        C_ -- Span C (Child of B) --> D_[Neo4j];
    end

    style B fill:#f9f,stroke:#333,stroke-width:2px
    style B_ fill:#f9f,stroke:#333,stroke-width:2px

Instrumenting the Producer: Injecting Trace Context into Kafka

The first step was to ensure the ingestor-service was properly instrumented and, crucially, could inject the active trace context into the outgoing Kafka message headers. We relied on the Datadog Node.js tracer, which is compatible with OpenTelemetry.

The initialization of the tracer is critical. It must be the very first thing imported in the application’s entry point to ensure it can patch all subsequent modules.

ingestor-service/src/main.ts

// src/main.ts
// CRITICAL: Tracer must be initialized before any other module.
import tracer from 'dd-trace';
tracer.init({
  logInjection: true,
  runtimeMetrics: true,
  service: 'ingestor-service', // Essential for Datadog service mapping
  env: process.env.NODE_ENV,
});

import { NestFactory } from '@nestjs/core';
import { AppModule } from './app.module';

async function bootstrap() {
  const app = await NestFactory.create(AppModule);
  // Standard NestJS bootstrap...
  await app.listen(3000);
}
bootstrap();

With the basic instrumentation in place, we focused on the Kafka producer logic. We used the @nestjs/microservices package with kafkajs. The default client.emit() method doesn’t offer a direct way to manipulate message headers. To solve this, we inject the ClientKafka instance and use its send() method, which provides more control.

The core of the solution is manually interacting with the tracer to inject the current span’s context into the Kafka message headers.

ingestor-service/src/transaction/transaction.service.ts

import { Inject, Injectable, OnModuleInit } from '@nestjs/common';
import { ClientKafka } from '@nestjs/microservices';
import { Producer } from 'kafkajs';
import tracer from 'dd-trace';
import { ITransaction } from './interfaces/transaction.interface';

@Injectable()
export class TransactionService implements OnModuleInit {
  private kafkaProducer: Producer;

  constructor(
    @Inject('KAFKA_SERVICE') private readonly kafkaClient: ClientKafka,
  ) {}

  async onModuleInit() {
    // A common mistake is to assume the client is connected immediately.
    // We must wait for the connection to be established before using the producer.
    await this.kafkaClient.connect();
    this.kafkaProducer = await this.kafkaClient.getProducer();
  }

  async processTransaction(transaction: ITransaction): Promise<string> {
    const transactionId = transaction.id;

    // Use tracer.scope().active() to get the currently active span.
    // If there's no active span, it returns null.
    const activeSpan = tracer.scope().active();

    if (!activeSpan) {
      // In a production system, you'd have specific logging here.
      // This indicates a potential instrumentation issue.
      console.error('No active dd-trace span found. Tracing context will be lost.');
      return this.sendToKafka(transaction, {});
    }

    const headers = {};
    // The tracer's inject method serializes the span context into a format
    // that can be propagated. We inject it into our headers object.
    tracer.inject(activeSpan.context(), 'text_map', headers);
    
    // In a real project, you would log the injected headers for debugging,
    // but be careful not to log sensitive data.
    // console.log(`Injecting headers for transaction ${transactionId}:`, headers);
    
    return this.sendToKafka(transaction, headers);
  }

  private async sendToKafka(transaction: ITransaction, headers: Record<string, string>): Promise<string> {
    try {
      await this.kafkaProducer.send({
        topic: 'transactions.new',
        messages: [
          {
            key: transaction.id,
            value: JSON.stringify(transaction),
            headers, // Pass the trace context here.
          },
        ],
      });
      return transaction.id;
    } catch (error) {
        // Proper error handling and logging are crucial for a message producer.
        console.error(`Failed to send transaction ${transaction.id} to Kafka`, { error });
        // Depending on the business requirements, you might re-throw,
        // or implement a retry mechanism with backoff.
        throw new Error('Kafka message production failed.');
    }
  }
}

This code explicitly retrieves the active span, creates a headers object, and uses tracer.inject to populate it with Datadog-specific propagation keys (x-datadog-trace-id, x-datadog-parent-id, etc.). This enriched message is then sent to Kafka.

Resuming the Trace in the Consumer: Context Extraction

Now for the analyzer-service. Its job is to consume the message, extract the trace context from the headers, and start a new span that is correctly parented to the one from the ingestor-service. A NestJS Interceptor is the perfect tool for this, as it allows us to wrap the message handling logic declaratively.

First, the tracer initialization is identical to the producer’s.

analyzer-service/src/main.ts

// src/main.ts
// Again, tracer initialization is paramount.
import tracer from 'dd-trace';
tracer.init({
  logInjection: true,
  service: 'analyzer-service', // A distinct service name
  env: process.env.NODE_ENV,
});

import { NestFactory } from '@nestjs/core';
import { AppModule } from './app.module';
import { MicroserviceOptions, Transport } from '@nestjs/microservices';

async function bootstrap() {
  const app = await NestFactory.createMicroservice<MicroserviceOptions>(
    AppModule,
    {
      transport: Transport.KAFKA,
      options: {
        client: {
          brokers: [process.env.KAFKA_BROKER || 'localhost:9092'],
        },
        consumer: {
          groupId: 'fraud-analyzer-group',
        },
      },
    },
  );
  await app.listen();
}
bootstrap();

Next, we create the KafkaTraceInterceptor. This interceptor uses tracer.extract to parse the headers and then tracer.trace to create a new span, ensuring any work done within the message handler (including calls to other services like Neo4j) is captured as a child of this new span.

analyzer-service/src/tracing/kafka-trace.interceptor.ts

import {
  CallHandler,
  ExecutionContext,
  Injectable,
  NestInterceptor,
} from '@nestjs/common';
import { KafkaContext } from '@nestjs/microservices';
import { Observable } from 'rxjs';
import tracer from 'dd-trace';

@Injectable()
export class KafkaTraceInterceptor implements NestInterceptor {
  intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
    const kafkaContext = context.switchToRpc().getContext<KafkaContext>();
    const message = kafkaContext.getMessage();
    const topic = kafkaContext.getTopic();
    const partition = kafkaContext.getPartition();
    
    // The raw headers are Buffers, so they must be converted to strings.
    const headers = this.convertHeadersToString(message.headers);

    // Extract the parent span context from the message headers.
    // The 'text_map' format corresponds to what we injected.
    const parentSpanContext = tracer.extract('text_map', headers);

    const spanOptions = {
      tags: {
        'span.kind': 'consumer',
        'component': 'kafka',
        'messaging.system': 'kafka',
        'messaging.destination': topic,
        'messaging.kafka.partition': partition,
        'messaging.message_id': message.key.toString(),
      },
    };

    if (parentSpanContext) {
      spanOptions['childOf'] = parentSpanContext;
    }
    
    // A common pitfall is to forget to handle the asynchronous nature of the handler.
    // tracer.trace returns a promise if the wrapped function is async.
    // We create a new span that wraps the entire message handling logic.
    return new Observable(subscriber => {
      tracer.trace('kafka.consume', spanOptions, async (span) => {
        try {
          span.setTag('resource.name', `CONSUME ${topic}`);
          
          const result = await next.handle().toPromise();
          subscriber.next(result);
          subscriber.complete();
        } catch (err) {
          // Ensure errors are tagged on the span for visibility in Datadog.
          span.setTag('error', err);
          subscriber.error(err);
        } finally {
          span.finish();
        }
      });
    });
  }

  private convertHeadersToString(headers: Record<string, Buffer | string>): Record<string, string> {
    const result: Record<string, string> = {};
    for (const key in headers) {
      if (headers[key]) { // Header values can be null.
        result[key] = headers[key].toString();
      }
    }
    return result;
  }
}

This interceptor is then applied globally or to a specific controller.

analyzer-service/src/app.module.ts

import { Module } from '@nestjs/common';
import { APP_INTERCEPTOR } from '@nestjs/core';
import { AnalysisController } from './analysis/analysis.controller';
import { AnalysisModule } from './analysis/analysis.module';
import { KafkaTraceInterceptor } from './tracing/kafka-trace.interceptor';
import { Neo4jModule } from './neo4j/neo4j.module';

@Module({
  imports: [AnalysisModule, Neo4jModule.forRoot()],
  controllers: [AnalysisController],
  providers: [
    {
      provide: APP_INTERCEPTOR,
      useClass: KafkaTraceInterceptor, // Applying the interceptor globally
    },
  ],
})
export class AppModule {}

Instrumenting the Database Layer: Tracing Neo4j Queries

With the trace context successfully propagated across the Kafka boundary, the final step is to get visibility into the Neo4j queries executed by the analyzer-service. The official neo4j-driver doesn’t have automatic Datadog instrumentation, so we must add it manually.

We’ll create a thin wrapper around the Neo4j driver session. This wrapper will start a new child span for each query it executes.

analyzer-service/src/neo4j/neo4j.service.ts

import { Inject, Injectable, OnApplicationShutdown } from '@nestjs/common';
import neo4j, { Driver, Result, Session, Transaction } from 'neo4j-driver';
import tracer from 'dd-trace';
import { NEO4J_CONFIG } from './neo4j.constants';
import { Neo4jConfig } from './neo4j.config.interface';

@Injectable()
export class Neo4jService implements OnApplicationShutdown {
  private readonly driver: Driver;

  constructor(@Inject(NEO4J_CONFIG) private readonly config: Neo4jConfig) {
    this.driver = neo4j.driver(
      this.config.uri,
      neo4j.auth.basic(this.config.user, this.config.password),
    );
  }

  onApplicationShutdown() {
    return this.driver.close();
  }

  // This is the instrumented method.
  async runInSession<T>(
    cypher: string,
    params: Record<string, any>,
    database?: string,
  ): Promise<Result<T>> {
    // We create a new span programmatically using tracer.trace.
    // This will automatically be a child of the active 'kafka.consume' span.
    return tracer.trace('neo4j.query', {
      resource: cypher, // The resource name in Datadog will be the query itself.
      tags: {
        'service.name': 'analyzer-service-neo4j', // Or use the main service name.
        'span.kind': 'client',
        'db.system': 'neo4j',
        'db.statement': cypher,
        'db.instance': database || this.config.database,
      }
    }, async (span) => {
      const session: Session = this.driver.session({ database });
      try {
        const result = await session.run(cypher, params);
        // Tagging metadata helps in analysis. Avoid tagging sensitive data.
        span.setTag('db.neo4j.records_count', result.records.length);
        return result;
      } catch (error) {
        span.setTag('error', error);
        throw error;
      } finally {
        await session.close();
      }
    });
  }
}

Now, the AnalysisService uses this instrumented method instead of the raw driver.

analyzer-service/src/analysis/analysis.service.ts

import { Injectable } from '@nestjs/common';
import { Neo4jService } from '../neo4j/neo4j.service';
import { ITransaction } from './interfaces/transaction.interface';

@Injectable()
export class AnalysisService {
  constructor(private readonly neo4jService: Neo4jService) {}

  async analyzeTransaction(transaction: ITransaction): Promise<boolean> {
    // This is a simplified query for demonstration.
    // In a real system, this would be far more complex.
    const cypher = `
      MATCH (t:Transaction {id: $transactionId})
      // Find other transactions sharing the same payment card
      MATCH (t)-[:USES_CARD]->(c:Card)<-[:USES_CARD]-(other:Transaction)
      WHERE other.timestamp > t.timestamp - duration({minutes: 5}) AND other <> t
      // Find if those transactions share a device
      MATCH (t)-[:FROM_DEVICE]->(d:Device)<-[:FROM_DEVICE]-(other)
      RETURN count(other) > 0 AS isSuspicious
    `;

    const params = { transactionId: transaction.id };
    
    // Using our instrumented method provides automatic tracing.
    const result = await this.neo4jService.runInSession(cypher, params);
    
    if (result.records.length === 0) {
      return false;
    }
    
    return result.records[0].get('isSuspicious');
  }
}

With this final piece, our trace is complete. When viewed in Datadog, a request hitting the ingestor-service now shows a single, continuous flame graph. It details the initial HTTP processing, the Kafka production span, the time spent in the Kafka queue, the consumption span in the analyzer-service, and finally, a child span representing the exact Cypher query run against Neo4j, complete with its duration and metadata. The black hole was bridged.

The solution isn’t without its own maintenance considerations. The manual context propagation adds boilerplate. A more advanced implementation might involve creating a custom NestJS Kafka transport strategy that bakes this tracing logic in, making it transparent to the application developer. Furthermore, the current instrumentation only covers successful message processing and basic errors. A production-grade system must also trace retry attempts and the flow of messages to a dead-letter queue, ensuring that failures are just as observable as successes.

Kafka NestJS Neo4j Datadog Distributed Tracing OpenTelemetry

Implementing Event-Driven Incremental Static Regeneration Using Nuxt.js and Redis Streams

2023-10-27 Web Architecture

Redis Streams Nuxt.js ORM SSG Change Data Capture

Utilizing Redis Streams for a Resilient MLOps Data Ingestion Layer from Swift Applications

2023-10-27 MLOps

Redis Streams MLOps Swift Data Engineering System Design