Implementing a Real-Time Entity Analysis Pipeline with Node.js GraphQL Orchestration of spaCy and HBase on GKE

Distributed Systems

Word Count: 2.3k

Read Times: 14 Min

The operational challenge was unambiguous: provide near real-time entity analysis over a multi-terabyte corpus of unstructured text residing in an existing HBase cluster. The dataset contained billions of documents, and the user requirement was to query for a known entity—say, a company name—and immediately see a graph of co-occurring entities from relevant documents. Our initial batch-processing approach using MapReduce took hours, rendering it useless for interactive analysis. The latency target was sub-5 seconds for a typical query. This constraint immediately invalidated any full-scan or large-scale batch solution. The problem shifted from a big data processing task to a low-latency service orchestration puzzle.

Our data was already in HBase, a choice we couldn’t change. The primary table was keyed by a unique document UUID and contained a single column family doc with a raw_text qualifier.

HBase Table: 'documents'
Row Key: <document_uuid>
Column Family: 'doc'
  Qualifier: 'raw_text' -> "..."

A naive query for an entity required a full table scan with a SingleColumnValueFilter, a non-starter for performance. The initial pain point was a fundamental mismatch between the storage schema and the access pattern. To solve this, we had to introduce a secondary index. In a real-world project, you don’t just add indexes casually; you model them after the precise queries you need to serve. Our primary query was “find all documents mentioning entity X.” This led to a new index table, entity_index, designed for rapid lookups.

The row key design is the single most critical factor in HBase performance. Our chosen schema for the index was [entity_name_normalized_hash]:[document_uuid]. The hash prevents hot-spotting on lexicographically close entity names.

// Example Java code for row key generation (conceptual)
import org.apache.commons.codec.digest.DigestUtils;
import java.nio.charset.StandardCharsets;

public class HBaseKeyGenerator {
    public static byte[] createEntityIndexKey(String entityName, String docUUID) {
        // Normalize: lowercase, remove punctuation, etc.
        String normalizedEntity = entityName.toLowerCase().replaceAll("[^a-z0-9]", "");
        String entityHash = DigestUtils.sha256Hex(normalizedEntity).substring(0, 16); // 8-byte hash prefix

        String key = String.format("%s:%s", entityHash, docUUID);
        return key.getBytes(StandardCharsets.UTF_8);
    }
}

With this index, a query for “Apple Inc.” becomes a highly efficient prefix scan on the entity_index table for the hash of the normalized entity name. This gets us a list of document_uuids in milliseconds, not hours. This was the first major breakthrough, shifting the bottleneck from data retrieval to data processing.

Now that we could fetch relevant document UUIDs quickly, we needed to perform Named Entity Recognition (NER) on the raw text of those documents. The JavaScript ecosystem’s NLP libraries were not mature enough for the accuracy and performance we required. Python, with spaCy, was the obvious choice. This decision forced a polyglot microservice architecture. Our core API, responsible for orchestrating the workflow, would be built in Node.js to leverage its non-blocking I/O for managing concurrent requests to HBase and the new Python service.

Communication between the Node.js orchestrator and the Python NLP service needed to be fast. A common mistake is to default to REST with JSON payloads. The serialization/deserialization overhead of JSON for large text blocks would have killed our latency budget. gRPC was the pragmatic choice, offering performance benefits through Protocol Buffers and HTTP/2 streaming.

Our service contract, defined in ner.proto, was simple but explicit:

// ner.proto
syntax = "proto3";

package ner;

// The request message containing the text to be analyzed.
message NerRequest {
  string request_id = 1; // For logging and tracing
  string text = 2;
}

// A single recognized entity.
message Entity {
  string text = 1;
  string label = 2; // e.g., ORG, PERSON, GPE
}

// The response message containing the list of entities.
message NerResponse {
  string request_id = 1;
  repeated Entity entities = 2;
}

// The service definition.
service NerService {
  rpc ExtractEntities(NerRequest) returns (NerResponse) {}
}

The Python gRPC server implementation using spaCy is where the heavy lifting happens. It’s crucial that this service is stateless and horizontally scalable.

# nlp_service/server.py
import grpc
import spacy
from concurrent import futures
import logging

from ner_pb2 import NerResponse, Entity
from ner_pb2_grpc import NerServiceServicer, add_NerServiceServicer_to_server

# A common production pitfall is loading the model on every request.
# Load it once at startup. Use a smaller, faster model if full accuracy isn't needed.
# For GKE, ensure the container has enough memory for this model.
NLP = spacy.load("en_core_web_sm")

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

class NerServiceImpl(NerServiceServicer):
    def ExtractEntities(self, request, context):
        """
        Extracts named entities from the provided text.
        """
        logging.info(f"Processing request_id: {request.request_id}")
        if not request.text:
            context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
            context.set_details("Text field cannot be empty.")
            return NerResponse()

        try:
            doc = NLP(request.text)
            entities = [
                Entity(text=ent.text, label=ent.label_)
                for ent in doc.ents
            ]
            return NerResponse(request_id=request.request_id, entities=entities)
        except Exception as e:
            logging.error(f"NLP processing failed for request {request.request_id}: {e}")
            context.set_code(grpc.StatusCode.INTERNAL)
            context.set_details("An internal error occurred during NLP processing.")
            return NerResponse()

def serve():
    """Starts the gRPC server."""
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    add_NerServiceServicer_to_server(NerServiceImpl(), server)
    server.add_insecure_port('[::]:50051')
    logging.info("gRPC server started on port 50051.")
    server.start()
    server.wait_for_termination()

if __name__ == '__main__':
    serve()

To deploy this on GKE, we need a Dockerfile. A multi-stage build is best practice to keep the final image slim.

# Dockerfile for Python spaCy service

# Stage 1: Build stage with all dependencies
FROM python:3.9-slim as builder

WORKDIR /app

RUN pip install --no-cache-dir --upgrade pip
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Download the spaCy model
RUN python -m spacy download en_core_web_sm

# Stage 2: Final production image
FROM python:3.9-slim

WORKDIR /app

# Copy only necessary files from the builder stage
COPY --from=builder /usr/local/lib/python3.9/site-packages /usr/local/lib/python3.9/site-packages
COPY --from=builder /app /app
COPY nlp_service/ /app/nlp_service/
COPY *.proto /app/

# Set the entrypoint
CMD ["python", "nlp_service/server.py"]

The orchestration layer is a Node.js server exposing a GraphQL API. We chose Apollo Server. This service is the brain of the operation. It receives a user query, translates it into an HBase prefix scan, retrieves multiple documents, fans out parallel gRPC calls to the Python NLP service, and finally aggregates the results.

Here is the core structure of the Node.js orchestrator.

// orchestrator/index.js
const { ApolloServer } = require('@apollo/server');
const { startStandaloneServer } = require('@apollo/server/standalone');
const grpc = require('@grpc/grpc-js');
const protoLoader = require('@grpc/proto-loader');
const { v4: uuidv4 } = require('uuid');
const hbase = require('hbase-rest'); // Assuming a REST client for HBase

// --- Configuration ---
// In a real project, these come from environment variables or a config service.
const NER_SERVICE_ADDRESS = process.env.NER_SERVICE_ADDRESS || 'localhost:50051';
const HBASE_REST_URL = process.env.HBASE_REST_URL || 'http://localhost:8080';

// --- gRPC Client Setup ---
// It's critical to create ONE client and reuse it.
const PROTO_PATH = __dirname + '/../ner.proto';
const packageDefinition = protoLoader.loadSync(PROTO_PATH, {
    keepCase: true,
    longs: String,
    enums: String,
    defaults: true,
    oneofs: true,
});
const ner_proto = grpc.loadPackageDefinition(packageDefinition).ner;
const nerClient = new ner_proto.NerService(NER_SERVICE_ADDRESS, grpc.credentials.createInsecure());

// --- HBase Client Setup ---
const hbaseClient = hbase({
    host: HBASE_REST_URL,
    // Add authentication, etc. for production
});

// --- GraphQL Schema ---
const typeDefs = `#graphql
  type Entity {
    text: String!
    label: String!
  }

  type DocumentAnalysis {
    documentId: String!
    entities: [Entity!]!
  }

  type Query {
    analyzeEntity(entityText: String!, docLimit: Int = 10): [DocumentAnalysis!]
  }
`;

// --- Helper for HBase Key ---
// This MUST match the logic used for ingestion.
const { createHash } = require('crypto');
function getEntityHashPrefix(entityText) {
    const normalized = entityText.toLowerCase().replace(/[^a-z0-9]/g, '');
    return createHash('sha256').update(normalized).digest('hex').substring(0, 16);
}

// --- GraphQL Resolvers ---
const resolvers = {
    Query: {
        analyzeEntity: async (_, { entityText, docLimit }) => {
            console.log(`Analyzing entity: ${entityText}`);

            // 1. Find document UUIDs from the HBase index
            const entityHashPrefix = getEntityHashPrefix(entityText);
            const scanner = hbaseClient.table('entity_index').scan({
                startRow: entityHashPrefix,
                endRow: entityHashPrefix + '~', // Scan prefix
                limit: docLimit,
            });

            // A common pitfall is not handling scanner errors or empty results gracefully.
            const docIds = await new Promise((resolve, reject) => {
                const ids = [];
                scanner.on('data', (row) => {
                    const key = row.key;
                    const parts = key.split(':');
                    if (parts.length === 2) {
                        ids.push(parts[1]);
                    }
                });
                scanner.on('end', () => resolve(ids));
                scanner.on('error', (err) => reject(new Error(`HBase index scan failed: ${err.message}`)));
            });

            if (docIds.length === 0) {
                return [];
            }
            
            // 2. Fetch raw text for each document from the main HBase table
            const getDocPromises = docIds.map(id => 
                hbaseClient.table('documents').row(id).get('doc:raw_text')
            );

            // Using Promise.allSettled is safer than Promise.all in production.
            // It ensures that one failed document fetch doesn't kill the entire request.
            const docResults = await Promise.allSettled(getDocPromises);

            // 3. Fan-out requests to the gRPC NLP service
            const analysisPromises = docResults
                .filter(res => res.status === 'fulfilled' && res.value && res.value.length > 0)
                .map(res => {
                    const docId = res.value[0].key;
                    // HBase values are base64 encoded by the REST gateway
                    const rawText = Buffer.from(res.value[0].$, 'base64').toString('utf-8');
                    
                    const requestId = uuidv4();
                    const request = { request_id: requestId, text: rawText };

                    return new Promise((resolve, reject) => {
                        nerClient.ExtractEntities(request, { deadline: Date.now() + 2000 }, (err, response) => {
                            if (err) {
                                console.error(`gRPC call failed for ${docId}:`, err.details);
                                // Resolve with a partial error state instead of rejecting the whole batch
                                resolve({ documentId: docId, entities: [], error: err.details });
                                return;
                            }
                            resolve({ documentId: docId, entities: response.entities });
                        });
                    });
                });

            const analyses = await Promise.all(analysisPromises);
            return analyses.filter(a => !a.error);
        },
    },
};

const server = new ApolloServer({
    typeDefs,
    resolvers,
});

async function startServer() {
    const { url } = await startStandaloneServer(server, { listen: { port: 4000 } });
    console.log(`🚀 GraphQL orchestrator ready at ${url}`);
}

startServer();

The flow is clear: GraphQL Query -> Resolver -> HBase Index Scan -> HBase Document Fetch -> Parallel gRPC Calls -> Aggregate -> GraphQL Response.

Finally, deploying this polyglot system on GKE requires defining our workloads and services declaratively.

# nlp-service.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nlp-service
spec:
  replicas: 3 # Start with a few replicas, then configure HPA
  selector:
    matchLabels:
      app: nlp-service
  template:
    metadata:
      labels:
        app: nlp-service
    spec:
      containers:
      - name: spacy-ner
        image: your-gcr-repo/nlp-service:latest
        ports:
        - containerPort: 50051
        resources:
          requests:
            memory: "1Gi" # spaCy models can be memory hungry
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1"
---
apiVersion: v1
kind: Service
metadata:
  name: nlp-service
spec:
  selector:
    app: nlp-service
  ports:
    - protocol: TCP
      port: 50051
      targetPort: 50051
  type: ClusterIP # Expose only within the cluster

The Node.js orchestrator deployment is similar, but its service needs to be exposed to the internet via an Ingress.

# orchestrator-service.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: orchestrator
spec:
  replicas: 2
  selector:
    matchLabels:
      app: orchestrator
  template:
    metadata:
      labels:
        app: orchestrator
    spec:
      containers:
      - name: graphql-orchestrator
        image: your-gcr-repo/orchestrator:latest
        ports:
        - containerPort: 4000
        env:
        - name: NER_SERVICE_ADDRESS
          value: "nlp-service:50051" # K8s DNS resolves the service name
        - name: HBASE_REST_URL
          value: "http://your-hbase-rest-gateway-service:8080"
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
---
apiVersion: v1
kind: Service
metadata:
  name: orchestrator-service
spec:
  selector:
    app: orchestrator
  ports:
    - protocol: TCP
      port: 80
      targetPort: 4000
  type: NodePort # Or LoadBalancer, depending on GKE setup
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-ingress
spec:
  rules:
  - http:
      paths:
      - path: /graphql
        pathType: Prefix
        backend:
          service:
            name: orchestrator-service
            port:
              number: 80

This diagram illustrates the final request flow within the GKE cluster:

sequenceDiagram
    participant C as GraphQL Client (JavaScript)
    participant I as GKE Ingress
    participant O as Orchestrator (Node.js)
    participant H_IDX as HBase (entity_index)
    participant H_DOC as HBase (documents)
    participant N as NLP Service (Python/spaCy)

    C->>I: POST /graphql (analyzeEntity query)
    I->>O: Forwards request
    O->>H_IDX: Scan for documents by entity hash
    H_IDX-->>O: Returns list of document UUIDs
    O->>H_DOC: Parallel GET requests for raw text
    H_DOC-->>O: Returns raw text for each document
    par
        O->>N: gRPC ExtractEntities(doc1)
        N-->>O: Returns entities for doc1
    and
        O->>N: gRPC ExtractEntities(doc2)
        N-->>O: Returns entities for doc2
    and
        O->>N: gRPC ExtractEntities(doc3)
        N-->>O: Returns entities for doc3
    end
    O->>O: Aggregate results
    O-->>I: GraphQL JSON response
    I-->>C: Returns response

This architecture successfully met our sub-5-second latency target. The solution’s main strength lies in its decomposition of concerns: HBase for fast, indexed retrieval; Python/spaCy for high-quality NLP; Node.js for efficient I/O-bound orchestration; and GKE for scalable, resilient deployment.

However, the system is not without its limitations. The on-the-fly NER processing is a potential bottleneck and can be costly at scale, as the NLP pods require significant memory and CPU. A future iteration would involve pre-calculating entities for our entire corpus in a batch process and storing them in another indexed HBase table. This would transform the query into a simple data-retrieval operation, drastically reducing latency and computational cost, at the expense of storage and the complexity of the batch ingestion pipeline. The current approach is a trade-off, prioritizing real-time processing on newly ingested data over the ultimate query speed and operational cost. The applicability of this real-time pattern is therefore bounded by the user’s tolerance for slightly higher latency and the business’s budget for compute resources.

spaCy HBase JavaScript GraphQL Client GCP GKE

Implementing an Observable Saga Pattern for Long-Running Trino Queries with Next.js and WebSockets

2023-10-27 Distributed Systems

WebSockets Grafana Next.js Saga Pattern Presto / Trino

Constructing a Meta-Programmable UI Engine with Domain-Driven Design, Dgraph, etcd, and Dynamic Rollup Bundling

2023-10-27 Software Architecture

etcd Go Node.js Domain-Driven Design Rollup Dgraph Low-Code