Debugging our distributed order processing system had become an exercise in frustration. The Saga pattern, implemented across three core microservices, ensured eventual consistency but created an operational black box. When an order failed, tracking its lifecycle involved grepping through logs on different machines, manually correlating timestamps, and piecing together a story from disjointed fragments. This process was slow, error-prone, and fundamentally unscalable. The initial pain point wasn’t the saga logic itself, but the utter lack of visibility into its execution flow, especially in aggregate. We couldn’t answer simple questions like “Which saga step fails most often?” or “What is the P95 latency for the payment verification step?”.
The initial concept was to build a centralized dashboard. The first pass at a solution was a simple log aggregator feeding into a plain text viewer. This was a marginal improvement, but it still required developers to read and mentally parse hundreds of log lines. The real breakthrough came from the idea of treating saga execution data not just as logs to be read, but as a dataset to be visualized. We needed a system that could not only collate logs per transaction but also generate statistical insights on the fly, presenting them graphically alongside the raw log data.
This led to a multi-faceted technology selection. For the frontend, React with Material-UI (MUI) was the pragmatic choice for rapidly building a clean, functional internal tool. To prevent style bleed and ensure maintainability, we committed to using CSS Modules for any custom styling. On the backend, Grafana Loki was selected for log aggregation due to its simplicity and powerful label-based querying, which is perfectly suited for tagging logs with a saga_id
. The core of the visualization engine would be a Python service using Flask, leveraging the pandas library for data manipulation and Seaborn for generating rich statistical plots. This Python backend would act as an intermediary, querying Loki, processing the data, and serving visualizations to the React frontend.
The Foundation: Structured Logging for Sagas
The entire system hinges on one critical principle: structured, correlatable logs. Without a consistent format and a shared transaction identifier, any attempt at aggregation is doomed. We mandated that every log entry related to a saga must be a JSON object containing a correlation_id
.
Here is the logging setup for one of the participating Python services, the payment-service
.
# payment-service/logger_config.py
import logging
import json
from contextvars import ContextVar
# ContextVar allows us to carry the correlation_id through the call stack
# without passing it as an argument to every function.
correlation_id_var = ContextVar('correlation_id', default=None)
class JsonFormatter(logging.Formatter):
def format(self, record):
log_record = {
"timestamp": self.formatTime(record, self.datefmt),
"level": record.levelname,
"message": record.getMessage(),
"service": "payment-service",
"correlation_id": correlation_id_var.get(),
"saga_step": getattr(record, "saga_step", "N/A"),
}
# Add exception info if it exists
if record.exc_info:
log_record['exc_info'] = self.formatException(record.exc_info)
return json.dumps(log_record)
def setup_logging():
"""Configures the root logger for the application."""
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# Remove any existing handlers
if logger.hasHandlers():
logger.handlers.clear()
handler = logging.StreamHandler()
formatter = JsonFormatter()
handler.setFormatter(formatter)
logger.addHandler(handler)
# Example usage within a saga step execution
def process_payment(order_id, amount, correlation_id):
correlation_id_var.set(correlation_id)
extra_info = {"saga_step": "ProcessPayment"}
logging.info(f"Starting payment processing for order {order_id}", extra=extra_info)
try:
# ... payment gateway interaction logic ...
if amount < 0:
raise ValueError("Payment amount cannot be negative.")
logging.info(f"Payment successful for order {order_id}", extra=extra_info)
return {"status": "SUCCESS"}
except Exception as e:
logging.error(f"Payment failed for order {order_id}: {e}", exc_info=True, extra=extra_info)
# This log now automatically includes traceback information
return {"status": "FAILED", "reason": str(e)}
This setup ensures every log line is a self-contained, machine-readable JSON document. The use of ContextVar
is a crucial detail for real-world projects; it prevents correlation_id
from polluting every function signature and makes the logging transparent to the business logic.
Ingestion and Querying with Loki
With structured logs being emitted, the next step is to collect them. We use Promtail as the agent and Loki as the storage backend. The configuration is designed to parse the JSON logs and promote key fields to Loki labels for efficient querying.
Here’s the promtail-config.yml
snippet:
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: microservices
static_configs:
- targets:
- localhost
labels:
job: services-logs
__path__: /var/log/services/*.log
pipeline_stages:
- json:
expressions:
level: level
service: service
correlation_id: correlation_id
saga_step: saga_step
- labels:
level:
service:
saga_step:
A key decision here is what to make a label. A common mistake is to label high-cardinality fields like correlation_id
. This would lead to a “label cardinality explosion” and cripple Loki’s performance. Instead, we label service
, level
, and saga_step
, which have a small, finite set of values. The correlation_id
will be used for filtering log content, not as a label.
The LogQL query to find all logs for a specific saga instance would then be:{job="services-logs"} | json | line_format "{{.message}}" | correlation_id = "saga-instance-12345"
The Backend: A Visualization and Data API
The Python backend service is the heart of this solution. It exposes endpoints that the React frontend will consume.
1. Dockerfile for the Python service:
# Dockerfile
FROM python:3.10-slim
WORKDIR /app
RUN pip install --no-cache-dir Flask pandas seaborn requests matplotlib
COPY . .
CMD ["flask", "--app", "app:app", "run", "--host=0.0.0.0"]
2. The Flask application (app.py
):
This application has two main endpoints: one to fetch raw logs for a given correlation_id
and another to generate a statistical plot of saga step durations.
# app.py
import os
import io
import logging
import base64
import requests
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from flask import Flask, jsonify, request, Response
from urllib.parse import quote
app = Flask(__name__)
# In a real project, this would come from a config file or env var
LOKI_URL = os.environ.get("LOKI_URL", "http://loki:3100")
# Configure basic logging for the API itself
logging.basicConfig(level=logging.INFO)
@app.route('/api/saga_logs/<correlation_id>', methods=['GET'])
def get_saga_logs(correlation_id):
"""
Fetches all raw log lines for a given correlation ID from Loki.
"""
if not correlation_id:
return jsonify({"error": "correlation_id is required"}), 400
# LogQL query to filter by content. This avoids high cardinality labels.
query = f'{{job="services-logs"}} | json | correlation_id="{correlation_id}"'
# URL encode the query
encoded_query = quote(query)
# We query for the last 24 hours. This should be adjusted based on needs.
# In a production system, start/end times might be passed from the client.
params = {
'query': encoded_query,
'limit': 500, # Set a reasonable limit
'direction': 'forward',
}
try:
response = requests.get(f"{LOKI_URL}/loki/api/v1/query_range", params=params)
response.raise_for_status() # Raises HTTPError for bad responses (4xx or 5xx)
data = response.json()
# The structure of the response needs to be parsed carefully
results = data.get('data', {}).get('result', [])
log_entries = []
for result in results:
for value in result.get('values', []):
# value is a list [timestamp, log_line_string]
log_entry_json = json.loads(value[1])
log_entries.append({
"timestamp": log_entry_json.get("timestamp"),
"service": log_entry_json.get("service"),
"level": log_entry_json.get("level"),
"message": log_entry_json.get("message"),
"saga_step": log_entry_json.get("saga_step"),
})
# Sort by timestamp to ensure chronological order
log_entries.sort(key=lambda x: x['timestamp'])
return jsonify(log_entries)
except requests.exceptions.RequestException as e:
logging.error(f"Error querying Loki: {e}")
return jsonify({"error": "Failed to connect to Loki"}), 502
except Exception as e:
logging.error(f"An unexpected error occurred: {e}")
return jsonify({"error": "Internal server error"}), 500
@app.route('/api/saga_stats/step_failures', methods=['GET'])
def get_step_failure_chart():
"""
Generates a bar chart of saga step failure counts over the last 24 hours.
Returns the chart as a base64 encoded PNG image.
"""
query = '{job="services-logs", level="ERROR"} | json | saga_step != "N/A"'
encoded_query = quote(query)
params = {'query': encoded_query, 'limit': 5000} # Get a larger sample for stats
try:
response = requests.get(f"{LOKI_URL}/loki/api/v1/query_range", params=params)
response.raise_for_status()
data = response.json()
results = data.get('data', {}).get('result', [])
if not results:
return jsonify({"message": "No failure data found"}), 200
failure_logs = []
for result in results:
for value in result.get('values', []):
log_json = json.loads(value[1])
failure_logs.append({"saga_step": log_json.get("saga_step")})
df = pd.DataFrame(failure_logs)
if df.empty:
return jsonify({"message": "No failure data found"}), 200
# Create the plot using Seaborn
plt.figure(figsize=(10, 6))
sns.set_theme(style="whitegrid")
ax = sns.countplot(y='saga_step', data=df, order=df['saga_step'].value_counts().index)
ax.set_title('Saga Step Failure Counts (Last 24h)')
ax.set_xlabel('Failure Count')
ax.set_ylabel('Saga Step')
plt.tight_layout()
# Save plot to an in-memory buffer
buf = io.BytesIO()
plt.savefig(buf, format='png')
buf.seek(0)
plt.close() # Important to release memory
img_base64 = base64.b64encode(buf.read()).decode('utf-8')
return jsonify({"image": img_base64})
except requests.exceptions.RequestException as e:
logging.error(f"Error querying Loki for stats: {e}")
return jsonify({"error": "Failed to connect to Loki"}), 502
except Exception as e:
logging.error(f"Error generating chart: {e}")
return jsonify({"error": "Internal server error during chart generation"}), 500
if __name__ == '__main__':
app.run(debug=True)
The critical part here is the /api/saga_stats/step_failures
endpoint. It queries Loki for error logs, loads them into a pandas DataFrame, and then uses Seaborn to generate a countplot
. Instead of saving to a file, it saves the plot to an in-memory buffer, base64 encodes it, and sends it as JSON. This avoids disk I/O and simplifies the data transfer to the frontend. A pragmatic trade-off for an internal tool.
The Frontend: An Interactive Diagnostic UI
The React application provides the user interface for developers to interact with the saga data.
1. Project Structure:
src/
├── components/
│ ├── SagaDetail/
│ │ ├── SagaDetail.js
│ │ └── SagaDetail.module.css
│ ├── SagaStats/
│ │ └── SagaStats.js
│ └── SagaList.js
└── App.js
2. The Statistics Component (SagaStats.js
):
This component fetches the base64-encoded image from our Python API and displays it.
// src/components/SagaStats/SagaStats.js
import React, { useState, useEffect } from 'react';
import { Card, CardContent, Typography, CircularProgress, Box } from '@mui/material';
const SagaStats = () => {
const [chart, setChart] = useState(null);
const [loading, setLoading] = useState(true);
const [error, setError] = useState(null);
useEffect(() => {
const fetchStats = async () => {
try {
setLoading(true);
const response = await fetch('/api/saga_stats/step_failures');
if (!response.ok) {
throw new Error(`HTTP error! status: ${response.status}`);
}
const data = await response.json();
if (data.image) {
setChart(`data:image/png;base64,${data.image}`);
} else {
setChart(null); // Handle case with no data
}
setError(null);
} catch (e) {
setError('Failed to load statistics chart.');
console.error(e);
} finally {
setLoading(false);
}
};
fetchStats();
}, []);
return (
<Card>
<CardContent>
<Typography variant="h6" gutterBottom>
Saga Failure Analysis
</Typography>
<Box sx={{ display: 'flex', justifyContent: 'center', alignItems: 'center', minHeight: 300 }}>
{loading && <CircularProgress />}
{error && <Typography color="error">{error}</Typography>}
{!loading && !error && chart && (
<img src={chart} alt="Saga step failure statistics" style={{ maxWidth: '100%', height: 'auto' }} />
)}
{!loading && !error && !chart && (
<Typography>No failure data available to display.</Typography>
)}
</Box>
</CardContent>
</Card>
);
};
export default SagaStats;
3. The Detail View and CSS Modules (SagaDetail.js
& SagaDetail.module.css
):
This component displays the raw logs for a selected saga. Here, we use CSS Modules to apply custom styling to log levels without affecting the global MUI styles or other components.
SagaDetail.module.css
:/* src/components/SagaDetail/SagaDetail.module.css */ .logEntry { font-family: 'Fira Code', 'Courier New', Courier, monospace; padding: 4px 8px; border-radius: 4px; margin-bottom: 2px; white-space: pre-wrap; word-break: break-all; } .logLevelINFO { background-color: #e3f2fd; color: #1565c0; } .logLevelERROR { background-color: #ffebee; color: #c62828; font-weight: bold; } .logLevelWARNING { background-color: #fff8e1; color: #ff8f00; }
SagaDetail.js
:
The use of// src/components/SagaDetail/SagaDetail.js import React, { useState, useEffect } from 'react'; import { Paper, Typography, Box, CircularProgress, Alert } from '@mui/material'; // Import the CSS module import styles from './SagaDetail.module.css'; const SagaDetail = ({ correlationId }) => { const [logs, setLogs] = useState([]); const [loading, setLoading] = useState(false); const [error, setError] = useState(null); useEffect(() => { if (!correlationId) { setLogs([]); return; } const fetchLogs = async () => { try { setLoading(true); setError(null); const response = await fetch(`/api/saga_logs/${correlationId}`); if (!response.ok) throw new Error('Failed to fetch logs.'); const data = await response.json(); setLogs(data); } catch (e) { setError(e.message); } finally { setLoading(false); } }; fetchLogs(); }, [correlationId]); // A mapping function to get the correct CSS module class const getLogLevelClass = (level) => { switch (level) { case 'INFO': return styles.logLevelINFO; case 'ERROR': return styles.logLevelERROR; case 'WARNING': return styles.logLevelWARNING; default: return ''; } }; return ( <Paper elevation={3} sx={{ p: 2, height: '600px', overflowY: 'auto' }}> <Typography variant="h6">Logs for: {correlationId || 'No Saga Selected'}</Typography> {loading && <CircularProgress />} {error && <Alert severity="error">{error}</Alert>} <Box component="pre" sx={{ mt: 2 }}> {logs.map((log, index) => ( <div key={index} className={`${styles.logEntry} ${getLogLevelClass(log.level)}`}> <code> {`${log.timestamp} [${log.service}] [${log.saga_step}] - ${log.message}`} </code> </div> ))} </Box> </Paper> ); }; export default SagaDetail;
className={
${styles.logEntry} ${getLogLevelClass(log.level)}}
is the essence of CSS Modules in practice. It ensures the.logEntry
and.logLevelERROR
styles are scoped to this component only, preventing conflicts with any other part of the application. The final result is a powerful diagnostic tool where an engineer can see the aggregate failure chart, identify a problematic step, and then drill down into specific failed saga instances to view the exact, correlated log messages that tell the story of the failure.
The current implementation, while functional, is not without its limitations. The on-the-fly generation of Seaborn plots in the API is synchronous and could become a performance bottleneck under heavy use. A more robust solution might involve a background worker that pre-generates these visualizations periodically and caches them. Furthermore, relying solely on Loki queries for statistical aggregation can be slow for large time windows. For true real-time monitoring, a better architecture would be to have the services emit structured metrics to a system like Prometheus, which is purpose-built for efficient numerical aggregation, while still using Loki for the detailed, high-cardinality log data needed for drill-down analysis. The current tool serves its purpose as a powerful diagnostic aid, but its evolution into a full-fledged monitoring platform would require these architectural considerations.