Constructing a Searchable Distributed Tracing Pipeline with Istio, an API Gateway, and Meilisearch

Observability

Word Count: 2.9k

Read Times: 17 Min

The transition to a microservices architecture solved our deployment isolation problems but introduced a new, more insidious one: operational blindness. When a user reported a slow API response, the subsequent investigation was a frantic exercise in archaeology. We’d grep through logs across a dozen Kubernetes pods, trying to stitch together a narrative from timestamps and disparate request IDs. The process was slow, unreliable, and failed completely under concurrent traffic. Our mean-time-to-resolution (MTTR) was unacceptable, and every incident felt like we were starting from scratch.

Our initial concept was to move beyond logging and embrace distributed tracing. However, mainstream open-source solutions like Jaeger felt heavy for our needs. Their query interfaces were often restrictive, and we found ourselves wanting the ability to perform fast, full-text searches across trace attributes, almost like querying a database. The idea crystallized: what if we treated trace data not as a time-series blob but as structured documents in a high-performance search engine? This would empower engineers to ask complex, ad-hoc questions during an outage, like “Find all traces tagged with customer_id: 123 that invoked the inventory-service and exceeded 500ms latency in the last hour.”

This led to a technology selection process guided by a single principle: leverage the infrastructure we already have and introduce a specialized component for the one thing we need most—query speed.

Service Mesh (Istio): We were already using Istio for traffic management and mTLS. Its most compelling feature for this project is automatic trace context propagation. By enabling tracing in the mesh, Istio’s Envoy sidecars automatically forward traceparent and tracestate headers. This frees application developers from the burden of manual instrumentation for every network call, a massive win for adoption and consistency.
API Gateway (Gloo Edge): As our ingress, the API Gateway is the logical place to initiate the trace. We chose Gloo Edge because it’s also built on Envoy, ensuring configuration patterns are similar to Istio. More importantly, it has first-class support for generating OpenTelemetry (OTLP) traces, which has become the industry standard. It can start the trace and inject the initial headers before the request even enters our mesh.
Search Engine (Meilisearch): This was the unconventional choice. Instead of Elasticsearch or a dedicated observability backend, we opted for Meilisearch. The rationale is purely pragmatic: for our specific use case—incident response—sub-second query latency is non-negotiable. Meilisearch is designed for this “type-as-you-search” speed. We were willing to trade the complex analytical and aggregation features of Elasticsearch for raw query performance and operational simplicity. Meilisearch’s schema-less nature and simple API meant we could get a proof-of-concept running in hours, not weeks.

The resulting architecture is a data pipeline. Gloo Edge and Istio generate spans, an OpenTelemetry Collector aggregates them, a custom adapter transforms the data, and Meilisearch indexes it for immediate querying.

The Foundation: Kubernetes and Core Services

Everything runs on Kubernetes. The following YAML deploys Meilisearch with a PersistentVolumeClaim to ensure our trace data survives pod restarts. In a real-world project, you’d use a more robust storage class.

# meilisearch-deployment.yaml
apiVersion: v1
kind: Service
metadata:
  name: meilisearch
  labels:
    app: meilisearch
spec:
  ports:
  - port: 7700
    name: http
    targetPort: 7700
  selector:
    app: meilisearch
  type: ClusterIP
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: meilisearch-data-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: meilisearch
  labels:
    app: meilisearch
spec:
  replicas: 1
  selector:
    matchLabels:
      app: meilisearch
  template:
    metadata:
      labels:
        app: meilisearch
    spec:
      containers:
      - name: meilisearch
        image: getmeili/meilisearch:v1.3.2
        env:
          # In production, this should be a secret from a vault.
          - name: MEILI_MASTER_KEY
            value: "aVerySecureMasterKey"
        ports:
        - containerPort: 7700
          name: http
        volumeMounts:
        - name: meilidata
          mountPath: /meili_data
      volumes:
      - name: meilidata
        persistentVolumeClaim:
          claimName: meilisearch-data-pvc

The installation of Istio and Gloo Edge are standard procedures which we assume are already in place. The critical part is their configuration to emit traces.

Configuring the Trace Generation Layer

First, we need an OpenTelemetry Collector to receive trace data from both Gloo Edge and the Istio mesh. It will be configured to receive OTLP over gRPC and export data via HTTP to our custom adapter, which we’ll build later.

# otel-collector-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-conf
  labels:
    app: opentelemetry
    component: otel-agent-conf
data:
  otel-agent-config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    processors:
      batch:
        # Batches help in reducing the number of outgoing connections.
        # Tuned for throughput. Timeout is low to ensure data is flushed quickly.
        timeout: 1s
        send_batch_size: 512

    exporters:
      # We will send the transformed traces to a custom adapter service
      # which will then write to Meilisearch.
      otlphttp:
        endpoint: "http://trace-adapter.default.svc.cluster.local:8080/v1/traces"
        tls:
          insecure: true

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [otlphttp]

With the collector configuration defined, we deploy it. The service exposes the gRPC port 4317 for OTLP traffic.

# otel-collector-deployment.yaml
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
spec:
  selector:
    app: opentelemetry
  ports:
  - name: otlp-grpc
    port: 4317
    protocol: TCP
    targetPort: 4317
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
spec:
  replicas: 1
  selector:
    matchLabels:
      app: opentelemetry
  template:
    metadata:
      labels:
        app: opentelemetry
    spec:
      containers:
      - name: otel-collector
        image: otel/opentelemetry-collector-contrib:0.85.0
        command:
          - "/otelcontribcol"
          - "--config=/conf/otel-agent-config.yaml"
        ports:
        - containerPort: 4317
          name: otlp-grpc
        volumeMounts:
        - name: otel-collector-config-vol
          mountPath: /conf
      volumes:
      - name: otel-collector-config-vol
        configMap:
          name: otel-collector-conf

Next, we configure Istio’s MeshConfig to enable tracing and point it to our collector. This is a cluster-wide setting. The key part is setting the address to our collector’s service. We also set a 100% sampling rate for this demonstration; in production, this would be a lower, configurable value.

# istio-tracing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-config
  namespace: istio-system
data:
  mesh: |
    # Enable access log service.
    accessLogFile: /dev/stdout

    # Default proxy config for all services.
    defaultConfig:
      tracing:
        sampling: 100.0
        openCensusAgent:
          address: "otel-collector.default.svc.cluster.local:4317"

Now, for Gloo Edge. We configure tracing at the Gateway level. This ensures that any request hitting any of our VirtualServices will have a trace initiated.

# gloo-gateway-proxy-config.yaml
apiVersion: gateway.solo.io/v1
kind: Gateway
metadata:
  name: gateway-proxy
  namespace: gloo-system
spec:
  bindAddress: '::'
  bindPort: 8080
  httpGateway: {}
  proxyNames:
    - gateway-proxy
  useProxyProto: false
  options:
    tracing:
      provider:
        # Using OpenTelemetry provider in Gloo Edge
        openTelemetry:
          cluster: tracing_cluster # A Gloo Upstream representing the OTel collector
      # Defines some default tags to be added to all spans
      # generated by the gateway.
      verbose: true
---
# We must define the OTel Collector as an Upstream for Gloo to send data to it.
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
  name: otel-collector-upstream
  namespace: gloo-system
spec:
  static:
    hosts:
      - addr: otel-collector.default.svc.cluster.local
        port: 4317

A common mistake here is forgetting to define the Upstream for the OTel collector in Gloo Edge. Without it, Gloo has no knowledge of where tracing_cluster is, and trace data will be silently dropped.

The Missing Link: The Trace Adapter

The OpenTelemetry Collector sends data in OTLP format. Meilisearch expects a simple JSON array of documents. We need a service to bridge this gap. This trace-adapter service will receive OTLP JSON, flatten the complex structure of spans into searchable documents, and batch-insert them into Meilisearch.

Here is the core logic of the adapter, written in Go. It’s designed to be stateless and horizontally scalable.

// main.go (trace-adapter service)
package main

import (
	"bytes"
	"context"
	"encoding/json"
	"fmt"
	"io"
	"log"
	"net/http"
	"os"
	"time"

	"github.com/meilisearch/meilisearch-go"
)

const (
	meiliAddress  = "http://meilisearch.default.svc.cluster.local:7700"
	meiliIndexUID = "traces"
)

var meiliClient *meilisearch.Client

// MeiliSpan is the flattened document structure we will store in Meilisearch.
// We are cherry-picking fields from the OTLP span format that are most
// relevant for searching and filtering.
type MeiliSpan struct {
	TraceID        string                 `json:"trace_id"`
	SpanID         string                 `json:"span_id"`
	ParentSpanID   string                 `json:"parent_span_id,omitempty"`
	Name           string                 `json:"name"`
	ServiceName    string                 `json:"service_name"`
	Kind           int32                  `json:"kind"`
	StartTimeUnix  int64                  `json:"start_time_unix"`
	EndTimeUnix    int64                  `json:"end_time_unix"`
	DurationNanos  int64                  `json:"duration_nanos"`
	StatusCode     string                 `json:"status_code"`
	StatusMessage  string                 `json:"status_message,omitempty"`
	Attributes     map[string]interface{} `json:"attributes"`
}

func main() {
	masterKey := os.Getenv("MEILI_MASTER_KEY")
	if masterKey == "" {
		log.Fatal("MEILI_MASTER_KEY environment variable not set.")
	}

	meiliClient = meilisearch.NewClient(meilisearch.ClientConfig{
		Host:   meiliAddress,
		APIKey: masterKey,
	})

	// Perform a health check on Meilisearch on startup.
	if !meiliClient.IsHealthy() {
		log.Fatalf("Meilisearch at %s is not healthy.", meiliAddress)
	}
	
	// Create the index if it doesn't exist and configure it.
	// This operation is idempotent.
	setupMeiliIndex()

	http.HandleFunc("/v1/traces", handleTraces)
	log.Println("Trace adapter server starting on port 8080...")
	if err := http.ListenAndServe(":8080", nil); err != nil {
		log.Fatalf("Failed to start server: %v", err)
	}
}

// OTLP structures are complex; we only need a subset.
// These are simplified structs to unmarshal the incoming JSON from OTel Collector.
type OTLPTraces struct {
	ResourceSpans []ResourceSpan `json:"resourceSpans"`
}
type ResourceSpan struct {
	Resource      Resource       `json:"resource"`
	ScopeSpans []ScopeSpan `json:"scopeSpans"`
}
type Resource struct {
	Attributes []Attribute `json:"attributes"`
}
type ScopeSpan struct {
	Spans []Span `json:"spans"`
}
type Span struct {
	TraceID           string      `json:"traceId"`
	SpanID            string      `json:"spanId"`
	ParentSpanID      string      `json:"parentSpanId"`
	Name              string      `json:"name"`
	Kind              int32       `json:"kind"`
	StartTimeUnixNano string      `json:"startTimeUnixNano"`
	EndTimeUnixNano   string      `json:"endTimeUnixNano"`
	Attributes        []Attribute `json:"attributes"`
	Status            Status      `json:"status"`
}
type Attribute struct {
	Key   string    `json:"key"`
	Value ValueBody `json:"value"`
}
type ValueBody struct {
	StringValue string `json:"stringValue"`
	IntValue    string `json:"intValue"`
	BoolValue   bool   `json:"boolValue"`
}
type Status struct {
	Code    string `json:"code"`
	Message string `json:"message"`
}

func handleTraces(w http.ResponseWriter, r *http.Request) {
	if r.Method != http.MethodPost {
		http.Error(w, "Only POST method is accepted", http.StatusMethodNotAllowed)
		return
	}

	body, err := io.ReadAll(r.Body)
	if err != nil {
		http.Error(w, "Error reading request body", http.StatusInternalServerError)
		return
	}
	defer r.Body.Close()

	var traces OTLPTraces
	if err := json.Unmarshal(body, &traces); err != nil {
		log.Printf("Failed to unmarshal OTLP JSON: %v. Body was: %s", err, string(body))
		http.Error(w, "Error unmarshalling OTLP JSON", http.StatusBadRequest)
		return
	}

	meiliSpans, err := transformToMeiliSpans(traces)
	if err != nil {
		log.Printf("Failed to transform spans: %v", err)
		http.Error(w, "Error processing spans", http.StatusInternalServerError)
		return
	}

	if len(meiliSpans) == 0 {
		w.WriteHeader(http.StatusOK)
		return
	}

	// In a production system, this should be an asynchronous operation.
	// If Meilisearch is down, we could buffer to a queue.
	_, err = meiliClient.Index(meiliIndexUID).AddDocuments(meiliSpans, "span_id")
	if err != nil {
		log.Printf("Failed to add documents to Meilisearch: %v", err)
		http.Error(w, "Failed to index documents", http.StatusInternalServerError)
		return
	}

	w.WriteHeader(http.StatusOK)
}

func transformToMeiliSpans(traces OTLPTraces) ([]MeiliSpan, error) {
	var meiliSpans []MeiliSpan
	for _, rs := range traces.ResourceSpans {
		serviceName := "unknown-service"
		for _, attr := range rs.Resource.Attributes {
			if attr.Key == "service.name" {
				serviceName = attr.Value.StringValue
				break
			}
		}

		for _, ss := range rs.ScopeSpans {
			for _, span := range ss.Spans {
				startTime, _ := time.Parse(time.RFC3339Nano, span.StartTimeUnixNano)
				endTime, _ := time.Parse(time.RFC3339Nano, span.EndTimeUnixNano)
				
				ms := MeiliSpan{
					TraceID:        span.TraceID,
					SpanID:         span.SpanID,
					ParentSpanID:   span.ParentSpanID,
					Name:           span.Name,
					ServiceName:    serviceName,
					Kind:           span.Kind,
					StartTimeUnix:  startTime.Unix(),
					EndTimeUnix:    endTime.Unix(),
					DurationNanos:  endTime.Sub(startTime).Nanoseconds(),
					StatusCode:     span.Status.Code,
					StatusMessage:  span.Status.Message,
					Attributes:     make(map[string]interface{}),
				}
				for _, attr := range span.Attributes {
					// Simple value extraction; production code needs to handle all types.
					if attr.Value.StringValue != "" {
						ms.Attributes[attr.Key] = attr.Value.StringValue
					} else if attr.Value.IntValue != "" {
						ms.Attributes[attr.Key] = attr.Value.IntValue
					} else {
						ms.Attributes[attr.Key] = attr.Value.BoolValue
					}
				}
				meiliSpans = append(meiliSpans, ms)
			}
		}
	}
	return meiliSpans, nil
}

// setupMeiliIndex configures the 'traces' index with optimal settings
// for our query patterns. This is a critical step for performance.
func setupMeiliIndex() {
	index := meiliClient.Index(meiliIndexUID)

	// Define which attributes we want to filter on.
	// This is the most important setting for query performance.
	filterableAttributes := []string{
		"trace_id",
		"service_name",
		"name",
		"kind",
		"status_code",
		"duration_nanos",
		"start_time_unix",
	}
	
	// Define attributes to sort by.
	sortableAttributes := []string{
		"start_time_unix",
		"duration_nanos",
	}

	// The primary key for each document is the span ID.
	_, err := meiliClient.CreateIndex(&meilisearch.IndexConfig{
		Uid:        meiliIndexUID,
		PrimaryKey: "span_id",
	})
	if err != nil {
		// This might fail if index already exists, which is fine.
		// A real implementation would check the error type.
		log.Printf("Could not create index (may already exist): %v", err)
	}

	// Atomically update all settings.
	task, err := index.UpdateSettings(&meilisearch.Settings{
		FilterableAttributes: &filterableAttributes,
		SortableAttributes: &sortableAttributes,
	})
	if err != nil {
		log.Fatalf("Failed to update Meilisearch settings: %v", err)
	}
	log.Printf("Meilisearch settings update task queued with UID: %d", task.TaskUID)
}

The pitfall here is the setupMeiliIndex function. Without explicitly defining filterableAttributes, Meilisearch will not allow filtering on fields like trace_id or service_name, rendering our entire effort useless. This configuration step is as important as the data transformation itself.

Demonstrating the Full Pipeline

With all components deployed, we can now send a request through the system and query the resulting trace. First, we deploy two simple services, service-a and service-b, where service-a calls service-b. Both are included in the Istio mesh.

A request is sent to service-a through the Gloo Edge gateway:
curl http://<GLOO_GATEWAY_IP>:8080/service-a

The following sequence occurs:

Gloo Edge receives the request, generates a trace_id and an initial parent span, and adds the traceparent header. It exports this span to the OTel Collector.
The request is routed to the service-a pod. The Istio sidecar intercepts it, reads the traceparent header, and generates a new span representing the server-side work.
service-a makes an HTTP call to service-b. The Istio sidecar for service-a intercepts this egress call, generates a client-side span, and ensures the traceparent header is propagated to service-b.
The Istio sidecar for service-b receives the request, generates its own server-side span, and so on.
All these spans, from different sources, flow to the OTel Collector, get batched, and are sent to our trace-adapter.
The trace-adapter transforms and indexes them in Meilisearch.

Now, for the payoff. We can query Meilisearch directly. Let’s assume the request generated a trace with ID 5f2f18a203b8643193e549114e743452.

Querying for all spans in that trace:

curl \
  -X POST 'http://localhost:7700/indexes/traces/search' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer aVerySecureMasterKey' \
  --data-binary '{
    "filter": "trace_id = 5f2f18a203b8643193e549114e743452",
    "sort": ["start_time_unix:asc"]
  }'

The response is an immediate JSON array of all spans, ordered by their start time, allowing us to reconstruct the entire request flow.

A more complex query—finding all slow requests to service-b in the last 10 minutes:

# Calculate timestamp for 10 minutes ago
TEN_MIN_AGO=$(date -v-10M +%s)

curl \
  -X POST 'http://localhost:7700/indexes/traces/search' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer aVerySecureMasterKey' \
  --data-binary '{
    "filter": [
        "service_name = service-b",
        "duration_nanos > 500000000",
        "start_time_unix > '"$TEN_MIN_AGO"'"
    ]
  }'

This query executes in milliseconds, even with millions of documents, which is precisely the outcome we were aiming for.

The data flow can be visualized as follows:

sequenceDiagram
    participant User
    participant Gloo Gateway
    participant OTel Collector
    participant Trace Adapter
    participant Meilisearch
    participant Istio Sidecar (A)
    participant Service A
    participant Istio Sidecar (B)
    participant Service B

    User->>Gloo Gateway: GET /service-a
    Gloo Gateway->>OTel Collector: Export Span 1 (Gateway)
    Gloo Gateway->>Istio Sidecar (A): Forward Request
    Istio Sidecar (A)->>OTel Collector: Export Span 2 (Server-A)
    Istio Sidecar (A)->>Service A: Request
    Service A->>Istio Sidecar (A): Call Service B
    Istio Sidecar (A)->>OTel Collector: Export Span 3 (Client-A)
    Istio Sidecar (A)->>Istio Sidecar (B): Forward Request to B
    Istio Sidecar (B)->>OTel Collector: Export Span 4 (Server-B)
    Istio Sidecar (B)->>Service B: Request
    Service B-->>Istio Sidecar (B): Response
    Istio Sidecar (B)-->>Istio Sidecar (A): Response
    Istio Sidecar (A)-->>Service A: Response
    Service A-->>Istio Sidecar (A): Response
    Istio Sidecar (A)-->>Gloo Gateway: Response
    Gloo Gateway-->>User: Final Response
    
    OTel Collector->>Trace Adapter: Batch OTLP Spans
    Trace Adapter->>Meilisearch: Add Documents

This solution isn’t without its limitations. The choice of Meilisearch prioritizes query speed at the cost of storage efficiency and analytical depth. It is not a replacement for a long-term metrics and logging warehouse. Data retention must be managed aggressively, as storing high-cardinality trace data indefinitely would be prohibitively expensive. Furthermore, the trace-adapter is a custom component that carries a maintenance burden; if the OTLP specification evolves, the adapter must be updated. The current implementation’s error handling is also basic; in a production setting, a dead-letter queue would be necessary between the collector and the adapter to prevent data loss if Meilisearch is unavailable. The system’s true value is realized during active incidents, providing a surgical tool for rapid diagnosis, not a platform for historical trend analysis.

Kubernetes Observability OpenTelemetry API Gateway Meilisearch Istio

Implementing an Exactly-Once CDC Ingestion Pipeline with RocketMQ Transactional Messages and Idempotent NoSQL Writes

2023-10-27 Data Engineering

CDC Data Lake MongoDB NoSQL Distributed & Middleware RocketMQ

Implementing a LlamaIndex HDFS Reader for Real-time RAG on a Legacy Big Data Cluster

2023-10-27 Data Engineering

Emotion Vector LlamaIndex Vite Hadoop