Implementing Real-Time PII Detection in Container Syscalls via eBPF and spaCy with a Solid.js Frontend

Observability

Word Count: 2.7k

Read Times: 17 Min

The discovery hit us during a routine log review. A new microservice, containerized and deployed, was leaking unsanitized user email addresses directly into its standard output. The immediate fix was a code patch and redeploy, but the incident left a nagging question: how could we have caught this at a lower level, without instrumenting every single application and hoping developers remembered to sanitize everything? This led to a fascinating and slightly unconventional idea: what if we could build a security layer that operates entirely outside the application, monitoring the data at the very boundary between the container’s user space and the host’s kernel?

Our concept was to intercept system calls like write and sendto originating from a specific container. We’d inspect the data buffers associated with these calls, stream them to an analysis engine for real-time natural language processing, and flag any potential Personally Identifiable Information (PII). The absolute constraint was zero application code modification. This ambitious goal immediately pointed to eBPF as the core technology. For the container environment, Podman’s daemonless architecture felt like a clean, lightweight choice. For the PII detection, simple regex felt too brittle; spaCy’s powerful pre-trained NLP models offered a much more robust solution. And to visualize the findings, a highly performant, real-time dashboard was essential, making Solid.js, with its fine-grained reactivity, the perfect candidate. This project was born from a post-mortem, but it quickly became an exploration into the synergy of these four distinct technologies.

The architecture we settled on looks like this:

graph TD
    subgraph "Podman Container (Victim App)"
        A[Python Flask App] -- logs PII --> B[sys_write call]
    end

    subgraph "Host Linux Kernel"
        B -- triggers --> C{eBPF kprobe}
        C -- sends data via perf buffer --> D[User Space Controller]
    end

    subgraph "Backend Controller (Python)"
        D -- polls buffer --> E[BCC Framework]
        E -- passes string --> F[spaCy NLP Engine]
        F -- detects PII --> G[WebSocket Server]
    end

    subgraph "Browser"
        H[Solid.js Dashboard] -- connects to --> G
        G -- streams alerts --> H
    end

    I[Operator] -- observes --> H

The Target: A Container That Leaks Data

First, we need a “victim” application. A simple Python Flask app running inside a Podman container will serve this purpose. The application will have an endpoint that, when called, deliberately logs a sentence containing PII.

Here is the Containerfile to build our target container:

# Containerfile
FROM python:3.11-slim

WORKDIR /app

RUN pip install Flask==3.0.0

COPY app.py .

EXPOSE 5000

CMD ["python", "app.py"]

And the Python application itself, app.py:

# app.py
from flask import Flask, request
import logging
import sys

# Configure logging to stream to stdout
logging.basicConfig(
    stream=sys.stdout,
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

app = Flask(__name__)

@app.route('/register', methods=['POST'])
def register_user():
    """
    An endpoint that simulates user registration and leaks PII into the logs.
    """
    try:
        data = request.get_json()
        if not data or 'name' not in data or 'email' not in data:
            return {"error": "Missing name or email"}, 400

        user_name = data['name']
        user_email = data['email']

        # This is the problematic line we want to detect externally.
        logging.info(f"Processing registration for user {user_name} with email {user_email} from New York.")
        
        # In a real app, you'd do database operations here.
        # ...

        return {"status": "success", "message": f"User {user_name} registered."}, 200
    except Exception as e:
        logging.error(f"Error during registration: {e}")
        return {"error": "Internal server error"}, 500

if __name__ == '__main__':
    # Running on 0.0.0.0 to be accessible from the host.
    app.run(host='0.0.0.0', port=5000)

We can build and run this container using Podman:

# Build the container image
podman build -t pii-leaker .

# Run the container in the background and map the port
podman run -d --name leaker-container -p 5000:5000 pii-leaker

To test it, we can send a curl request:

curl -X POST -H "Content-Type: application/json" \
-d '{"name": "Alice Johnson", "email": "[email protected]"}' \
http://localhost:5000/register

If we check the container’s logs with podman logs leaker-container, we will see the PII-laden log message. Our goal is to detect this without ever looking at these logs directly.

The Observer: eBPF Kernel Probe and User-Space Controller

This is the heart of the system. We need an eBPF program to run in the kernel and a user-space Python script to load it, manage it, and process the data it sends. We’ll use the BCC (BPF Compiler Collection) framework, which simplifies this interaction significantly.

The eBPF C Program

Our eBPF program will attach a kprobe (kernel probe) to the sys_enter_write syscall. When triggered, it will check if the call is from our target process ID (PID), and if so, it will read the data buffer and send it to our user-space program through a perf buffer.

// bpf_probe.c
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
#include <linux/fs.h>

#define MAX_BUFFER_SIZE 4096 // A reasonable size to capture log lines

// Data structure to send to user space
struct data_t {
    u32 pid;
    u64 ts;
    char comm[TASK_COMM_LEN];
    char buf[MAX_BUFFER_SIZE];
};

// BPF perf buffer to send data events
BPF_PERF_OUTPUT(events);

// Entry point for the kprobe on the write syscall
int trace_write_entry(struct pt_regs *ctx, int fd, const char __user *buf, size_t count) {
    // 1. Filter by target PID. PID_FILTER is a placeholder
    //    that will be replaced by the user-space script.
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    if (pid != PID_FILTER) {
        return 0;
    }

    // 2. We are primarily interested in stdout (fd=1) and stderr (fd=2)
    //    as common sources for log leakage.
    if (fd != 1 && fd != 2) {
        return 0;
    }

    // 3. Create a data structure on the stack to hold our event
    struct data_t data = {};
    data.pid = pid;
    data.ts = bpf_ktime_get_ns();
    bpf_get_current_comm(&data.comm, sizeof(data.comm));

    // 4. Safely read the buffer from user space memory.
    //    This is the most critical and potentially fragile part.
    u32 size_to_read = count < MAX_BUFFER_SIZE ? count : MAX_BUFFER_SIZE - 1;
    bpf_probe_read_user(&data.buf, size_to_read, (void *)buf);
    
    // Null-terminate the string for safety in user space
    data.buf[size_to_read] = '\0';
    
    // 5. Submit the event to the perf buffer for user-space consumption.
    events.perf_submit(ctx, &data, sizeof(data));
    
    return 0;
}

A few critical points about this C code:

PID_FILTER: This is a placeholder. Our Python script will dynamically replace this with the actual PID of the containerized process before loading the BPF program.
bpf_probe_read_user: This helper function is essential for safely reading memory from the user-space process being traced. Direct dereferencing of the buf pointer would cause the BPF verifier to reject the program.
BPF_PERF_OUTPUT: This macro defines the high-performance, lock-free, per-CPU buffer that acts as the communication channel from kernel to user space.

The Python Controller

This Python script orchestrates everything on the user-space side. It finds the target PID, loads the eBPF program, listens for events, and then passes the data to spaCy and the WebSocket server.

# controller.py
import subprocess
import sys
import asyncio
import websockets
import json
import spacy
from bcc import BPF

# --- Configuration ---
CONTAINER_NAME = "leaker-container"
NLP_MODEL = "en_core_web_sm" # Use a small, efficient model
WEBSOCKET_HOST = "localhost"
WEBSOCKET_PORT = 8765

# --- Global State ---
connected_clients = set()
nlp = None

def get_container_pid(container_name):
    """
    Finds the PID of the main process inside a Podman container.
    A real-world implementation should be more robust, perhaps using cgroups.
    """
    try:
        # Podman's 'top' command can give us the PID on the host
        command = ["podman", "top", container_name, "pid"]
        result = subprocess.run(command, capture_output=True, text=True, check=True)
        # The output is like:
        # PID
        # 12345
        pids = result.stdout.strip().split('\n')
        if len(pids) > 1:
            print(f"Found PID for container '{container_name}': {pids[1]}")
            return int(pids[1])
        else:
            print(f"Error: Could not parse PID from podman top output: {pids}", file=sys.stderr)
            return None
    except (subprocess.CalledProcessError, FileNotFoundError, IndexError, ValueError) as e:
        print(f"Error getting PID for container '{container_name}': {e}", file=sys.stderr)
        return None

def load_bpf_program(target_pid):
    """
    Loads the eBPF C code, replacing the PID placeholder,
    and attaches the kprobe.
    """
    try:
        with open("bpf_probe.c", "r") as f:
            bpf_text = f.read()

        # This is the coolest part: dynamically injecting the PID
        bpf_text = bpf_text.replace("PID_FILTER", str(target_pid))

        b = BPF(text=bpf_text)
        b.attach_kprobe(event=b.get_syscall_fnname("write"), fn_name="trace_write_entry")
        print("eBPF probe attached successfully.")
        return b
    except Exception as e:
        print(f"Error loading or attaching BPF program: {e}", file=sys.stderr)
        sys.exit(1)

def process_event(cpu, data, size):
    """
    Callback function executed for each event from the kernel.
    This is where we run our NLP analysis.
    """
    global nlp
    # The 'events' name must match the BPF_PERF_OUTPUT name in the C code
    event = b["events"].event(data)
    
    try:
        # Decode the buffer, ignoring errors for now.
        log_line = event.buf.decode('utf-8', 'ignore')
        
        # A simple heuristic to filter out non-log-like noise
        if len(log_line.strip()) < 10:
            return

        doc = nlp(log_line)
        pii_found = []
        for ent in doc.ents:
            # We can customize this list based on what we consider PII
            if ent.label_ in ["PERSON", "EMAIL", "GPE", "ORG"]:
                pii_found.append({
                    "text": ent.text,
                    "type": ent.label_,
                    "pid": event.pid,
                    "comm": event.comm.decode('utf-8', 'ignore')
                })

        if pii_found:
            message = {
                "type": "pii_alert",
                "payload": {
                    "original_log": log_line.strip(),
                    "entities": pii_found
                }
            }
            # Schedule the broadcast to run on the main asyncio event loop
            asyncio.run_coroutine_threadsafe(broadcast(json.dumps(message)), loop)

    except Exception as e:
        print(f"Error processing event: {e}", file=sys.stderr)

async def register_client(websocket):
    connected_clients.add(websocket)
    try:
        await websocket.wait_closed()
    finally:
        connected_clients.remove(websocket)

async def broadcast(message):
    if connected_clients:
        # Use asyncio.gather for concurrent sending
        await asyncio.gather(*[client.send(message) for client in connected_clients])

async def main():
    global nlp, b, loop
    
    print("Initializing PII detection controller...")
    
    target_pid = get_container_pid(CONTAINER_NAME)
    if not target_pid:
        print(f"Could not find container '{CONTAINER_NAME}'. Exiting.", file=sys.stderr)
        sys.exit(1)
        
    print("Loading spaCy NLP model...")
    nlp = spacy.load(NLP_MODEL)
    print("Model loaded.")

    b = load_bpf_program(target_pid)
    
    # Open the perf buffer and set the callback
    b["events"].open_perf_buffer(process_event)
    
    loop = asyncio.get_running_loop()

    # Start the WebSocket server
    server = await websockets.serve(register_client, WEBSOCKET_HOST, WEBSOCKET_PORT)
    print(f"WebSocket server started on ws://{WEBSOCKET_HOST}:{WEBSOCKET_PORT}")

    # The main polling loop. This runs indefinitely.
    print("Listening for events from kernel... (Press Ctrl+C to exit)")
    while True:
        try:
            # This is a blocking call, but we can add a timeout.
            # We'll run it in a loop with a small sleep to keep the event loop responsive.
            b.perf_buffer_poll(timeout=100)
            await asyncio.sleep(0.1) # Yield control to the asyncio loop
        except KeyboardInterrupt:
            print("\nDetaching BPF probe and shutting down...")
            server.close()
            await server.wait_closed()
            break

if __name__ == "__main__":
    try:
        asyncio.run(main())
    except KeyboardInterrupt:
        print("\nShutdown complete.")

This script is dense but showcases the complete pipeline: finding the container PID, injecting it into the eBPF source, loading the probe, setting up a WebSocket server, and then entering a polling loop where every kernel event triggers an NLP analysis and a potential WebSocket broadcast.

The Visualization: A Real-Time Solid.js Dashboard

Finally, we need a frontend to consume the WebSocket stream and display the alerts. Solid.js is an amazing choice here because its performance is stellar for high-frequency updates, and its reactive primitives (createSignal, createEffect) are incredibly intuitive for this kind of task.

We can set up a basic Solid.js project using Vite:
npm create vite@latest solid-pii-dashboard -- --template solid-ts

Then, we’ll create a component to handle the logic.

// src/App.tsx
import { createSignal, onMount, For, Component } from 'solid-js';
import { createStore } from "solid-js/store";
import './App.css';

// --- Types for our data structure ---
interface PiiEntity {
  text: string;
  type: string;
  pid: number;
  comm: string;
}

interface PiiAlert {
  original_log: string;
  entities: PiiEntity[];
}

interface AlertMessage {
  type: "pii_alert";
  payload: PiiAlert;
}

const WEBSOCKET_URL = "ws://localhost:8765";

const App: Component = () => {
  const [alerts, setAlerts] = createStore<PiiAlert[]>([]);
  const [connectionStatus, setConnectionStatus] = createSignal<string>("Disconnected");

  const connectWebSocket = () => {
    console.log("Attempting to connect to WebSocket...");
    setConnectionStatus("Connecting...");
    const ws = new WebSocket(WEBSOCKET_URL);

    ws.onopen = () => {
      console.log("WebSocket connection established.");
      setConnectionStatus("Connected");
    };

    ws.onmessage = (event) => {
      try {
        const message: AlertMessage = JSON.parse(event.data);
        if (message.type === "pii_alert") {
          // The magic of stores: just prepend the new alert.
          // Solid handles the efficient DOM update.
          setAlerts(prev => [message.payload, ...prev]);
        }
      } catch (error) {
        console.error("Failed to parse WebSocket message:", error);
      }
    };

    ws.onclose = () => {
      console.log("WebSocket connection closed. Reconnecting in 3 seconds...");
      setConnectionStatus("Disconnected");
      setTimeout(connectWebSocket, 3000);
    };

    ws.onerror = (error) => {
      console.error("WebSocket error:", error);
      setConnectionStatus("Error");
      ws.close();
    };
  };

  onMount(() => {
    connectWebSocket();
  });

  const getLabelColor = (type: string) => {
    switch(type) {
      case 'PERSON': return '#ffadad';
      case 'EMAIL': return '#ffd6a5';
      case 'GPE': return '#fdffb6';
      case 'ORG': return '#caffbf';
      default: return '#e0e0e0';
    }
  };

  return (
    <div class="container">
      <header>
        <h1>Live PII Leak Detector</h1>
        <div class="status">
          Status: <span class={`status-indicator ${connectionStatus.toLowerCase()}`}>{connectionStatus}</span>
        </div>
      </header>
      <main class="alerts-feed">
        <For each={alerts} fallback={<p class="fallback">No PII alerts detected yet. Trigger an event in the target container.</p>}>
          {(alert, index) => (
            <div class="alert-card">
              <p class="log-line"><strong>Original Log:</strong> {alert.original_log}</p>
              <div class="entities-list">
                <strong>Detected Entities:</strong>
                <ul>
                  <For each={alert.entities}>
                    {(entity) => (
                      <li>
                        <span class="entity-text">{entity.text}</span>
                        <span class="entity-label" style={{ "background-color": getLabelColor(entity.type) }}>
                          {entity.type}
                        </span>
                        <span class="meta-info">(comm: {entity.comm}, pid: {entity.pid})</span>
                      </li>
                    )}
                  </For>
                </ul>
              </div>
            </div>
          )}
        </For>
      </main>
    </div>
  );
};

export default App;

With some basic CSS, this component provides a clean, auto-updating feed of detected PII leaks. The most powerful part is how little code is needed to handle the real-time updates. By wrapping our array of alerts in createStore, any modification to it automatically and efficiently triggers a re-render of only the necessary parts of the DOM. This is where Solid.js truly outshines frameworks that rely on a Virtual DOM for this kind of high-frequency data stream.

Running the Full System

Start the Container: podman run -d --name leaker-container -p 5000:5000 pii-leaker
Start the Backend: sudo python3 controller.py (sudo is required for eBPF operations)
Start the Frontend: cd solid-pii-dashboard && npm run dev
Trigger an Event: curl -X POST -H "Content-Type: application/json" -d '{"name": "Bob Smith", "email": "[email protected]"}' http://localhost:5000/register

Instantly, a new card should appear on the Solid.js dashboard, identifying “Bob Smith” as a PERSON, “bob.smith@company.net“ as an EMAIL, and “company.net” possibly as an ORG, demonstrating the entire pipeline in action.

While this proof-of-concept is incredibly powerful, it’s far from a production-ready system. The current implementation, which relies on a single Python process to handle kernel events, run NLP, and manage WebSockets, is a clear bottleneck. A more robust architecture would use the eBPF controller solely to publish raw data to a message queue like Kafka or Redis Streams. A separate, scalable pool of NLP workers could then consume from this queue, and another service could handle the WebSocket connections, allowing each component to be scaled independently. Furthermore, the PID-based filtering is fragile; a production system should leverage cgroup information to reliably trace all processes within a given container. The biggest blind spot, however, is encrypted traffic; this entire method is ineffective against data encrypted within the application before being passed to a write syscall, which severely limits its applicability for monitoring network traffic over TLS.

Security eBPF Solid.js Podman spaCy Observability

Implementing a Secure Data Transformation Service Within a VPC Using Go and WebAssembly Plugins

2023-10-27 Backend Architecture

Microservices Security VPC Go-Gin WebAssembly

Implementing a Distributed Feature Serving Layer with Dynamic Configuration Propagation Using Go-Gin and etcd

2023-10-27 Distributed Systems

etcd Go MLOps Go-Gin Feature Store