Instrumenting End-to-End LLM Interactions with Custom Spans in Apache SkyWalking from a Frontend Client

Observability

Word Count: 2.8k

Read Times: 17 Min

Our new AI-powered Q&A feature, built on a Retrieval-Augmented Generation (RAG) pipeline, went live last week. And the user feedback is a mix of awe and frustration. Some users get brilliant, context-aware answers in seconds. Others get slow, irrelevant responses or outright errors. From a monitoring perspective, it’s a black box. Our frontend observability shows a single, long-running API call to /api/v1/query. Our backend logs show a request coming in and a response going out, with a call to an external LLM API somewhere in the middle. But the crucial “why” is missing. Why was this specific query slow? Was it the vector database search? Was the context retrieval pulling irrelevant documents? Did the LLM itself have a high time-to-first-token? We were flying blind, and in a production environment, that’s an unacceptable risk.

The initial thought was to add more structured logging. But sifting through terabytes of logs to manually piece together the journey of a single user request is inefficient and often futile. The real need was to visualize the entire lifecycle of a request, from the user’s click in the browser all the way through the complex, multi-stage process of our RAG pipeline. We already use Apache SkyWalking for our standard microservices, so the logical step was to extend its tracing capabilities to become “LLM-aware.” This meant not just tracking latency but enriching the trace data with LLM-specific metadata: token counts, model names, retrieved document IDs, and internal processing steps.

This is the log of how we instrumented our system, from the React frontend to the Python backend, forcing SkyWalking to expose the inner workings of our LLM pipeline.

The Foundational Stack: Docker, SkyWalking, and a Basic Application

Before instrumentation, we need a runnable environment. In any real-world project, this means a reproducible setup. We’ll use Docker Compose to orchestrate our services: the SkyWalking OAP (Observer, Analyzer, and Processor), the SkyWalking UI, its Elasticsearch storage, our React frontend, and the Python FastAPI backend.

The docker-compose.yml is the backbone of the local development and testing environment. A common mistake is to have a complex setup that only works on one developer’s machine. This file ensures consistency.

# docker-compose.yml
version: '3.8'

services:
  elasticsearch:
    image: elasticsearch:7.17.10
    container_name: skywalking-elasticsearch
    ports:
      - "9200:9200"
    healthcheck:
      test: ["CMD-SHELL", "curl --silent --fail localhost:9200/_cluster/health | grep -q '\"status\":\"green\"'"]
      interval: 30s
      timeout: 10s
      retries: 3
    environment:
      - "discovery.type=single-node"
      - "ES_JAVA_OPTS=-Xms1g -Xmx1g"

  skywalking-oap:
    image: apache/skywalking-oap-server:9.5.0
    container_name: skywalking-oap
    depends_on:
      elasticsearch:
        condition: service_healthy
    ports:
      - "11800:11800" # gRPC
      - "12800:12800" # HTTP
    environment:
      - SW_STORAGE=elasticsearch
      - SW_STORAGE_ES_CLUSTER_NODES=elasticsearch:9200
      - SW_CORE_RECORD_DATA_TTL=7 # days
      - SW_CORE_METRICS_DATA_TTL=7 # days
      - JAVA_OPTS=-Xms2g -Xmx2g
    healthcheck:
      test: ["CMD", "/bin/bash", "-c", "curl http://localhost:12800/graphql --header 'Content-Type: application/json' --data '{\"query\":\"query healthCheck{ checkHealth { score } }\"}' -s | grep '\"score\":1'"]
      interval: 30s
      timeout: 10s
      retries: 5

  skywalking-ui:
    image: apache/skywalking-ui:9.5.0
    container_name: skywalking-ui
    depends_on:
      - skywalking-oap
    ports:
      - "8080:8080"
    environment:
      - SW_OAP_ADDRESS=http://skywalking-oap:12800

  backend:
    build:
      context: ./backend
    container_name: llm-backend-service
    ports:
      - "8000:8000"
    environment:
      # SkyWalking Agent Configuration
      - SW_AGENT_NAME=llm-rag-service
      - SW_AGENT_COLLECTOR_BACKEND_SERVICES=skywalking-oap:11800
      - SW_AGENT_LOGGING_LEVEL=INFO
      # For demo purposes, we'll use a mock LLM key
      - OPENAI_API_KEY=mock-key-not-real
    depends_on:
      - skywalking-oap

  frontend:
    build:
      context: ./frontend
    container_name: llm-frontend-client
    ports:
      - "3000:3000"
    depends_on:
      - backend

This configuration sets up the entire observability stack and our application containers. The health checks are critical; they ensure that services start in the correct order, preventing cascading failures during startup.

Step 1: Frontend Instrumentation - Capturing the User’s First Action

The trace must begin at the source: the user’s browser. If we only trace the backend, we lose visibility into network latency and frontend processing time. We use skywalking-client-js for this.

Our simple React app has a button that triggers a query to the backend.

// frontend/src/App.js
import React, { useState } from 'react';
import ClientMonitor from 'skywalking-client-js';
import './App.css';

// --- SkyWalking Client Configuration ---
// In a real-world project, these would come from environment variables.
const collectorUrl = 'http://localhost:12800'; // SkyWalking OAP HTTP endpoint
const serviceName = 'llm-frontend-ui';
const serviceVersion = '1.0.0';

ClientMonitor.init({
  service: serviceName,
  pagePath: window.location.href,
  serviceVersion: serviceVersion,
  collector: collectorUrl,
  jsErrors: true, // Report JS errors
  apiErrors: true, // Report API errors
  resourceErrors: true, // Report resource loading errors
});
// ----------------------------------------

function App() {
  const [query, setQuery] = useState('What is Apache SkyWalking?');
  const [response, setResponse] = useState('');
  const [isLoading, setIsLoading] = useState(false);
  const [error, setError] = useState(null);

  const handleQuery = async () => {
    setIsLoading(true);
    setError(null);
    setResponse('');

    try {
      // The fetch call is automatically traced by skywalking-client-js.
      // It injects the 'sw8' header to propagate the trace context.
      const res = await fetch('http://localhost:8000/api/v1/query', {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({ query: query }),
      });

      if (!res.ok) {
        throw new Error(`API request failed with status ${res.status}`);
      }
      
      const data = await res.json();
      setResponse(JSON.stringify(data.answer, null, 2));

    } catch (err) {
      setError(err.message);
      console.error("Failed to fetch LLM response:", err);
    } finally {
      setIsLoading(false);
    }
  };

  return (
    <div className="App">
      <header className="App-header">
        <h1>LLM RAG Query Interface</h1>
        <textarea
          value={query}
          onChange={(e) => setQuery(e.target.value)}
          rows="3"
          cols="80"
        />
        <button onClick={handleQuery} disabled={isLoading}>
          {isLoading ? 'Processing...' : 'Ask'}
        </button>
        {error && <div className="error">Error: {error}</div>}
        {response && (
          <div className="response">
            <h2>Response:</h2>
            <pre>{response}</pre>
          </div>
        )}
      </header>
    </div>
  );
}

export default App;

The critical part is the ClientMonitor.init block. Once initialized, it wraps native browser APIs like fetch and XMLHttpRequest. When handleQuery calls fetch, the library automatically generates a trace context and injects it into the request headers as sw8. This header contains the trace ID, parent span ID, and other context needed by the backend to continue the trace. Without this, the backend would start a brand new trace, breaking the end-to-end view.

Step 2: Backend Instrumentation - Receiving the Trace and the Pitfall of Surface-Level Spans

On the backend, we use FastAPI with the skywalking-python agent. Getting the agent to run is straightforward. We modify our Dockerfile‘s entrypoint.

# backend/Dockerfile
FROM python:3.10-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# The sw-python command bootstraps the application with the SkyWalking agent.
CMD ["sw-python", "run", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

And our initial main.py is simple:

# backend/main.py
import time
import random
import logging
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

# Standard logging setup
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI()

class QueryRequest(BaseModel):
    query: str

@app.post("/api/v1/query")
async def process_query_endpoint(request: QueryRequest):
    logger.info(f"Received query: {request.query}")
    
    # Simulate a complex, multi-step process
    time.sleep(random.uniform(0.5, 1.0)) # Simulate vector DB lookup
    time.sleep(random.uniform(0.1, 0.3)) # Simulate prompt construction
    time.sleep(random.uniform(1.0, 2.5)) # Simulate LLM API call
    
    return {"answer": f"This is a placeholder response for the query: '{request.query}'"}

With this setup, we can run docker-compose up and see a trace in the SkyWalking UI. It will show a span from llm-frontend-ui followed by a child span for the /api/v1/query endpoint on llm-rag-service. This is a good start, but it’s where most teams stop. The entire backend process is represented by a single, opaque span. All those time.sleep calls are lumped together. We still don’t know which part of the RAG pipeline is the bottleneck. This is the surface-level tracing that fails to diagnose complex application logic.

Step 3: Deep Instrumentation with Custom Local Spans

The real value comes from instrumenting the internal logic of our application. SkyWalking’s Python agent provides an API to create custom “local spans.” A local span represents an operation happening within the scope of a single service. We’ll refactor our endpoint to use these.

We create a dedicated module, rag_pipeline.py, to encapsulate the logic and its instrumentation. This is a much cleaner approach than cluttering the API endpoint code.

# backend/rag_pipeline.py
import time
import random
import logging
from skywalking import agent, config
from skywalking.trace.context import SpanContext, get_context
from skywalking.trace.tags import Tag

logger = logging.getLogger(__name__)

def process_llm_query(user_query: str) -> str:
    """
    Simulates a full RAG pipeline with detailed instrumentation using custom SkyWalking spans.
    """
    context: SpanContext = get_context()

    # --- Span 1: Context Retrieval ---
    # This is the core of custom instrumentation. We manually start a span.
    with context.new_local_span(op="RAG/retrieve_context") as span:
        span.layer = "Database"
        span.component = "VectorDB"
        
        try:
            # Simulate a query to a vector database like Milvus or Pinecone
            logger.info("Retrieving context from vector database...")
            retrieval_time = random.uniform(0.5, 1.0)
            time.sleep(retrieval_time)
            
            num_docs = random.randint(3, 8)
            similarity_scores = [random.uniform(0.85, 0.99) for _ in range(num_docs)]

            # A CRITICAL STEP: Enrich the span with domain-specific metadata.
            # These tags are searchable in SkyWalking and provide immense diagnostic value.
            span.add_tag(Tag(key="db.system", val="faiss_mock"))
            span.add_tag(Tag(key="db.retrieved_docs.count", val=str(num_docs)))
            span.add_tag(Tag(key="db.query.top_k", val="8"))
            
            # Use logs for high-cardinality or detailed information.
            span.add_log(key="retrieval_scores", val=str(similarity_scores))
            
            retrieved_context = f"Retrieved {num_docs} documents for query."
            logger.info("Context retrieval complete.")
            
        except Exception as e:
            # Proper error handling is essential for observability.
            span.error_occurred = True
            span.add_log(key="error.kind", val=type(e).__name__)
            span.add_log(key="error.message", val=str(e))
            logger.error("Failed during context retrieval", exc_info=True)
            raise

    # --- Span 2: Prompt Construction ---
    with context.new_local_span(op="RAG/construct_prompt") as span:
        span.component = "PromptEngineering"
        try:
            logger.info("Constructing final prompt...")
            prompt_construction_time = random.uniform(0.1, 0.3)
            time.sleep(prompt_construction_time)
            
            prompt_template = "Context: {context}\n\nQuestion: {query}\n\nAnswer:"
            final_prompt = prompt_template.format(context=retrieved_context, query=user_query)

            # Do NOT log the final prompt if it contains PII. Log the template instead.
            span.add_tag(Tag(key="llm.prompt.template_id", val="rag_v1.2"))
            span.add_log(key="prompt_length_chars", val=str(len(final_prompt)))
            logger.info("Prompt construction complete.")

        except Exception as e:
            span.error_occurred = True
            span.add_log(key="error.kind", val=type(e).__name__)
            span.add_log(key="error.message", val=str(e))
            logger.error("Failed during prompt construction", exc_info=True)
            raise

    # --- Span 3: LLM API Call ---
    with context.new_local_span(op="RAG/call_llm_api") as span:
        span.layer = "HTTP"
        span.component = "OpenAI"
        try:
            logger.info("Calling external LLM API...")
            llm_call_time = random.uniform(1.0, 2.5)
            time.sleep(llm_call_time) # Simulating the actual API call
            
            # In a real application, this data would come from the LLM provider's response.
            prompt_tokens = len(final_prompt) // 4
            completion_tokens = random.randint(50, 200)
            total_tokens = prompt_tokens + completion_tokens
            finish_reason = "stop"
            model_name = "gpt-4-turbo-mock"

            # This is GOLD for FinOps and performance tuning.
            span.add_tag(Tag(key="llm.model_name", val=model_name))
            span.add_tag(Tag(key="llm.usage.prompt_tokens", val=str(prompt_tokens)))
            span.add_tag(Tag(key="llm.usage.completion_tokens", val=str(completion_tokens)))
            span.add_tag(Tag(key="llm.usage.total_tokens", val=str(total_tokens)))
            span.add_tag(Tag(key="llm.finish_reason", val=finish_reason))

            llm_answer = f"Based on {num_docs} retrieved documents, the answer is generated by {model_name}."
            logger.info("LLM API call complete.")

        except Exception as e:
            span.error_occurred = True
            span.add_log(key="error.kind", val=type(e).__name__)
            span.add_log(key="error.message", val=str(e))
            logger.error("Failed during LLM API call", exc_info=True)
            raise
            
    return llm_answer

And we update our main.py to use this instrumented function:

# backend/main.py
import logging
from fastapi import FastAPI, HTTPException, Request
from pydantic import BaseModel
from rag_pipeline import process_llm_query
from skywalking import agent, config

# Initialize SkyWalking agent
# In a real project, this would be more sophisticated, but for the demo, this works.
# The agent is primarily started by the `sw-python` command.
# Explicit init can be used for configuration.
config.init(agent_collector_backend_services='skywalking-oap:11800', agent_name='llm-rag-service')
agent.start()

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI()

class QueryRequest(BaseModel):
    query: str

@app.post("/api/v1/query")
async def process_query_endpoint(request: QueryRequest):
    logger.info(f"Received query: {request.query}")
    try:
        # Call the instrumented function
        answer = process_llm_query(request.query)
        return {"answer": answer}
    except Exception as e:
        logger.error(f"An error occurred in the RAG pipeline: {e}", exc_info=True)
        # The spans inside process_llm_query will have already marked the error.
        raise HTTPException(status_code=500, detail="Internal processing error in RAG pipeline.")

The difference is night and day. The with context.new_local_span(...) block is the key. It creates a new span as a child of the currently active span (in this case, the main endpoint span). We give it a descriptive name (op), set its component and layer for better categorization in the UI, and most importantly, we add custom tags and logs. These tags are not just arbitrary strings; they are structured data that transforms SkyWalking from a simple APM into a powerful business and application analysis tool. We can now filter traces by llm.model_name or create dashboards that plot the average db.retrieved_docs.count against query latency.

The Final Result: A Coherent, Drillable Trace

After deploying these changes, a single user query now produces a rich, hierarchical trace in the SkyWalking UI.

sequenceDiagram
    participant Browser
    participant Frontend (React)
    participant Backend (FastAPI)
    participant VectorDB (Mock)
    participant LLM API (Mock)

    Browser->>Frontend (React): User clicks 'Ask' button
    activate Frontend (React)
    Note over Frontend (React): SkyWalking client starts Trace (T1)
    Frontend (React)->>Backend (FastAPI): POST /api/v1/query (Header: sw8=T1-...)
    deactivate Frontend (React)
    activate Backend (FastAPI)
    Note over Backend (FastAPI): Agent continues Trace (T1), starts Span (S1) for endpoint
    
    rect rgba(173, 216, 230, 0.3)
        Note over Backend (FastAPI): Start Local Span (S2: RAG/retrieve_context)
        Backend (FastAPI)->>VectorDB (Mock): Query for context
        VectorDB (Mock)-->>Backend (FastAPI): Return documents
        Note over Backend (FastAPI): Add tags: doc_count, scores. Stop Span S2.
    end
    
    rect rgba(144, 238, 144, 0.3)
        Note over Backend (FastAPI): Start Local Span (S3: RAG/construct_prompt)
        Backend (FastAPI)->>Backend (FastAPI): Build prompt from template and context
        Note over Backend (FastAPI): Add tag: template_id. Stop Span S3.
    end

    rect rgba(255, 228, 181, 0.3)
        Note over Backend (FastAPI): Start Local Span (S4: RAG/call_llm_api)
        Backend (FastAPI)->>LLM API (Mock): Send final prompt
        LLM API (Mock)-->>Backend (FastAPI): Return generated text and usage stats
        Note over Backend (FastAPI): Add tags: model_name, token_counts. Stop Span S4.
    end
    
    Note over Backend (FastAPI): Stop Endpoint Span (S1)
    Backend (FastAPI)-->>Frontend (React): 200 OK with answer
    deactivate Backend (FastAPI)
    activate Frontend (React)
    Frontend (React)->>Browser: Display response
    deactivate Frontend (React)

Now when a user reports a slow query, we can find their exact trace and see a waterfall diagram showing the duration of each internal step. If RAG/retrieve_context is the long bar, we can inspect its tags and see if an unusually high number of documents were retrieved. If RAG/call_llm_api is slow, we check its tags to see which model was used and how many tokens were generated. The debugging process is no longer guesswork; it’s a data-driven analysis of a high-fidelity trace.

Lingering Issues and Future Iterations

This solution provides immense value, but it’s not a silver bullet. In a high-traffic production environment, tracing every single request is prohibitively expensive in terms of both performance overhead and storage costs for the observability backend. The immediate next step is to configure sampling. SkyWalking supports probabilistic sampling (e.g., trace 5% of all requests), but a more intelligent approach would be tail-based or error-focused sampling: always trace 100% of requests that result in an error, and sample the successful ones.

Furthermore, the metadata we’re capturing is primarily operational. The next evolution is to capture semantic and quality metrics. We could add another span, RAG/evaluate_answer, that uses a separate LLM call or a cheaper model to score the generated answer for relevance and factuality, adding these scores as tags. This would allow us to correlate user-reported issues not just with latency but with the semantic quality of the AI’s output.

Finally, this implementation uses a synchronous, sequential pipeline. Modern LLM applications are increasingly using asynchronous agentic workflows with parallel tool calls. Propagating the trace context correctly across these complex execution graphs is a significant challenge. It requires careful manual context management that goes beyond the automatic instrumentation provided by the agent, representing the next frontier in LLM observability.

LLM Python Distributed Tracing Frontend SkyWalking React

Implementing a Self-Healing Emotional Analysis Pipeline with a Kubernetes Operator and LangChain

2023-11-15 Cloud Native

Emotion LangChain Message Queue Cloud Native & DevOps Kubernetes Operator

Building a High-Throughput containerd Event Collector for ClickHouse with Idempotent Ingestion

2023-11-15 Observability

Jest gRPC Node.js Observability ClickHouse containerd