Implementing a Metrics-Driven Distributed C++ Build System with Jenkins and Prometheus

DevOps

Word Count: 2.7k

Read Times: 16 Min

The build for our monolithic C++ backend was taking over two hours. This wasn’t just an inconvenience; it was a systemic drag on developer productivity. Our Jenkins pipeline was a simple, brute-force invocation of make -j$(nproc) on a single, powerful build agent. When the build ran, the agent’s CPUs were pegged at 100%, its memory saturated, but we had zero insight into the process. It was a complete black box. We didn’t know which translation units were the primary offenders, whether we were CPU-bound or I/O-bound, or why the final link step seemed to take an eternity. The first attempt to fix this was predictable: throw more, bigger machines at the problem. It barely made a dent. The real issue wasn’t a lack of resources, but a profound lack of observability.

Our objective shifted. Instead of just making the build faster, we needed to make it transparent. We decided to re-architect our CI process to treat the C++ build not as a single command, but as a distributed, observable system. The plan involved distributing the compilation workload across a fleet of Jenkins agents and, critically, instrumenting every single compilation step to emit detailed performance metrics to Prometheus.

Phase 1: The Foundation - Distributed Compilation

The initial step was to break the compilation out of a single machine. Tools like distcc are well-suited for this. It’s a simple concept: distcc intercepts compiler calls (g++, clang++) and, when possible, forwards them to available remote machines (our Jenkins agents) for execution.

Our Jenkins agents were already managed as Docker containers, so we created a new build agent image that included distcc.

FROM ubuntu:20.04

# Basic dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    cmake \
    git \
    openssh-client \
    python3 \
    distcc \
    && rm -rf /var/lib/apt/lists/*

# Configure distcc
# We will populate this dynamically in the Jenkins pipeline
ENV DISTCC_HOSTS=""
ENV DISTCC_DIR=/var/tmp/distcc
RUN mkdir -p ${DISTCC_DIR}
ENV PATH="/usr/lib/distcc:${PATH}"

# Add jenkins user
RUN useradd -m -s /bin/bash jenkins

USER jenkins
WORKDIR /home/jenkins

The Jenkinsfile needed to be updated to orchestrate this. The core idea is to have one agent act as the “coordinator” which runs cmake and make, while other agents act as pure compilation workers running the distccd daemon. The coordinator’s DISTCC_HOSTS environment variable would be populated with the IPs of the worker nodes.

A simplified initial Jenkinsfile looked like this:

pipeline {
    agent none
    stages {
        stage('Distributed Build') {
            agent {
                label 'cxx-build-coordinator'
            }
            environment {
                // In a real scenario, these would be discovered dynamically
                DISTCC_HOSTS = '192.168.1.101 192.168.1.102 192.168.1.103'
            }
            steps {
                script {
                    checkout scm

                    // Basic build commands
                    sh 'mkdir -p build && cd build'
                    sh 'cmake ..'
                    
                    // Use distcc by leveraging the updated PATH
                    // We'll give it more jobs than local cores, assuming distribution
                    sh 'make -j32'
                }
            }
        }
    }
}

This gave us a modest speedup, from around 120 minutes to 75. A win, but we were still blind. Was the workload distributed evenly? Were some files so complex they held up a remote worker for minutes while others finished in seconds? Was the network becoming a bottleneck? The distcc logs were cryptic at best. We had achieved distribution, but not observability.

Phase 2: Building the Eyes - A Compiler Metrics Wrapper

The core of our new system is a custom compiler wrapper. This script sits between the build system (make) and the actual compiler (g++). Its sole purpose is to execute the compilation, measure everything about it, and push those metrics to the Prometheus Pushgateway. The Pushgateway is necessary because our compilation jobs are short-lived; they start, run for a few seconds or minutes, and then exit. They don’t live long enough for a standard Prometheus server to scrape them directly.

We chose Python for the wrapper due to its simplicity for scripting and handling system commands.

metrics_compiler_wrapper.py

#!/usr/bin/env python3

import sys
import os
import subprocess
import time
import re
from urllib.request import urlopen, Request
from urllib.error import URLError

# --- Configuration ---
# These would be fetched from environment variables set by Jenkins
PUSHGATEWAY_URL = os.environ.get("PROMETHEUS_PUSHGATEWAY", "http://prometheus-pushgateway:9091")
JOB_NAME = os.environ.get("JOB_NAME", "unknown_job")
BUILD_NUMBER = os.environ.get("BUILD_NUMBER", "0")
AGENT_HOSTNAME = os.environ.get("NODE_NAME", "unknown_agent")

# The actual compiler to call, e.g., 'g++' or 'clang++'
REAL_COMPILER = sys.argv[1]

def format_prometheus_metric(metric_name, value, labels):
    """Formats a single metric in Prometheus text format."""
    label_str = ",".join([f'{k}="{v}"' for k, v in labels.items()])
    return f'{metric_name}{{{label_str}}} {value}'

def get_source_file(args):
    """A best-effort attempt to find the source file from compiler args."""
    for i, arg in enumerate(args):
        if arg == '-c' and i + 1 < len(args):
            # The argument after -c is typically the source file
            return os.path.basename(args[i+1])
    # Fallback for other cases
    for arg in args:
        if arg.endswith(('.cpp', '.c', '.cc', '.cxx')):
            return os.path.basename(arg)
    return "unknown_source_file"

def push_metrics(metrics_payload):
    """Pushes the formatted metrics payload to the Pushgateway."""
    # We group all metrics for a single compilation by a unique instance id
    # This prevents different compilations from overwriting each other's metrics
    # A simple timestamp and pid is usually unique enough for this purpose
    instance_id = f"{time.time_ns()}_{os.getpid()}"
    url = f"{PUSHGATEWAY_URL}/metrics/job/cpp_compilation/instance/{instance_id}"
    
    try:
        req = Request(url, data=metrics_payload.encode('utf-8'), method='POST')
        req.add_header('Content-Type', 'text/plain; version=0.0.4')
        with urlopen(req, timeout=5) as response:
            if response.status not in [200, 202]:
                print(f"Error: Unexpected status code {response.status} from Pushgateway", file=sys.stderr)
    except URLError as e:
        # In a real-world project, failing to push metrics should not fail the build.
        # We log the error and continue.
        print(f"Error: Could not connect to Prometheus Pushgateway at {url}: {e}", file=sys.stderr)
    except Exception as e:
        print(f"An unexpected error occurred while pushing metrics: {e}", file=sys.stderr)


def main():
    compiler_args = sys.argv[2:]
    
    # We use `/usr/bin/time -v` to get detailed process statistics.
    # Its output is sent to stderr, which we capture.
    command_to_run = ['/usr/bin/time', '-v', REAL_COMPILER] + compiler_args

    start_time = time.monotonic()
    
    try:
        proc = subprocess.run(command_to_run, check=False, capture_output=True, text=True)
    except FileNotFoundError:
        print(f"Error: Real compiler '{REAL_COMPILER}' not found.", file=sys.stderr)
        sys.exit(127)

    end_time = time.monotonic()
    
    # If the compiler failed, we must also fail.
    if proc.returncode != 0:
        # Pass through the compiler's output to the build logs
        print(proc.stdout, file=sys.stdout)
        print(proc.stderr, file=sys.stderr)
        sys.exit(proc.returncode)
    
    # --- Metrics Parsing and Formatting ---
    
    wall_clock_duration = end_time - start_time
    time_output = proc.stderr
    
    # Regex to parse the verbose output of /usr/bin/time
    user_time_match = re.search(r'User time \(seconds\): ([\d\.]+)', time_output)
    system_time_match = re.search(r'System time \(seconds\): ([\d\.]+)', time_output)
    cpu_percent_match = re.search(r'Percent of CPU this job got: ([\d]+)%', time_output)
    max_rss_match = re.search(r'Maximum resident set size \(kbytes\): ([\d]+)', time_output)
    
    user_time = float(user_time_match.group(1)) if user_time_match else 0.0
    system_time = float(system_time_match.group(1)) if system_time_match else 0.0
    cpu_percent = int(cpu_percent_match.group(1)) if cpu_percent_match else 0
    max_rss_kb = int(max_rss_match.group(1)) if max_rss_match else 0
    total_cpu_time = user_time + system_time

    source_file = get_source_file(compiler_args)
    
    # --- Prometheus Labels ---
    # These labels provide the context for our metrics, allowing us to slice and dice the data.
    labels = {
        "job": JOB_NAME,
        "build": BUILD_NUMBER,
        "agent": AGENT_HOSTNAME,
        "source_file": source_file,
    }
    
    metrics = [
        format_prometheus_metric("cpp_compilation_duration_seconds", wall_clock_duration, labels),
        format_prometheus_metric("cpp_compilation_cpu_user_seconds", user_time, labels),
        format_prometheus_metric("cpp_compilation_cpu_system_seconds", system_time, labels),
        format_prometheus_metric("cpp_compilation_cpu_total_seconds", total_cpu_time, labels),
        format_prometheus_metric("cpp_compilation_cpu_utilization_percent", cpu_percent, labels),
        format_prometheus_metric("cpp_compilation_max_resident_set_size_kb", max_rss_kb, labels),
        format_prometheus_metric("cpp_compilation_last_success_timestamp_seconds", int(time.time()), labels)
    ]
    
    metrics_payload = "\n".join(metrics) + "\n"
    
    push_metrics(metrics_payload)

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: metrics_compiler_wrapper.py <real_compiler> [compiler_args...]", file=sys.stderr)
        sys.exit(1)
    main()

To integrate this, we updated our build agent’s Dockerfile to include this script and then configured CMake to use it via the CMAKE_CXX_COMPILER_LAUNCHER property. This is a clean way to prepend a command to every compiler invocation without modifying the compiler path itself.

The CMake command in our Jenkinsfile changed to:
cmake .. -DCMAKE_CXX_COMPILER_LAUNCHER=/usr/local/bin/metrics_compiler_wrapper.py

A critical pitfall here is that the wrapper script itself adds overhead. For a project with thousands of small files, the cumulative time spent launching Python, running /usr/bin/time, and making an HTTP request can become significant. In our case, the compilation time of individual files was substantial enough that this overhead was negligible, but it’s a trade-off to be aware of.

Phase 3: A Robust Jenkins Pipeline

The Jenkins pipeline now becomes more complex. It’s no longer just running make; it’s managing a monitoring environment.

Key additions to the Jenkinsfile:

Dynamic Environment Variables: Pass JOB_NAME, BUILD_NUMBER, and NODE_NAME to the build environment so the wrapper script can use them as Prometheus labels.
Pushgateway URL: Provide the location of the Pushgateway.
Post-Build Cleanup: The Prometheus Pushgateway retains metrics until they are manually deleted. It’s crucial to have a post block in the pipeline that cleans up all metrics for the specific job and build to prevent stale data from polluting our dashboards.

pipeline {
    agent none
    environment {
        // Centralized configuration
        PROMETHEUS_PUSHGATEWAY = 'http://prometheus-pushgateway.monitoring:9091'
    }
    stages {
        stage('Distributed Build with Metrics') {
            agent {
                label 'cxx-build-coordinator'
            }
            environment {
                // Assuming agents are labeled 'cxx-build-worker'
                // A more robust solution would use the Kubernetes or Docker plugin to get IPs
                DISTCC_HOSTS = 'cxx-agent-1 cxx-agent-2 cxx-agent-3'
            }
            steps {
                script {
                    checkout scm
                    
                    // The build needs to run in a context where the required env vars are set
                    withEnv([
                        "JOB_NAME=${env.JOB_NAME}",
                        "BUILD_NUMBER=${env.BUILD_NUMBER}",
                        "PROMETHEUS_PUSHGATEWAY=${env.PROMETHEUS_PUSHGATEWAY}"
                    ]) {
                        sh 'mkdir -p build && cd build'
                        
                        // Pass the wrapper script to CMake. We pass the *real* compiler as the first argument.
                        sh 'cmake .. -DCMAKE_CXX_COMPILER_LAUNCHER="/usr/local/bin/metrics_compiler_wrapper.py;g++"'
                        
                        // We use distcc's pump mode for better include file distribution
                        sh 'pump make -j80' 
                    }
                }
            }
        }
    }
    post {
        always {
            script {
                // --- CRUCIAL CLEANUP STEP ---
                // Without this, the Pushgateway will hold metrics from old builds forever.
                // We delete all metrics grouped by our job and build number.
                echo "Cleaning up metrics from Pushgateway for ${env.JOB_NAME}, build ${env.BUILD_NUMBER}"
                def cleanupUrl = "${env.PROMETHEUS_PUSHGATEWAY}/metrics/job/cpp_compilation/job/${env.JOB_NAME}/build/${env.BUILD_NUMBER}"
                
                // A simple curl command handles the DELETE request
                // In a real project, add error handling and retries.
                sh "curl -s -X DELETE ${cleanupUrl}"
            }
        }
    }
}

This pipeline is now a proper orchestration engine. It sets up the environment, runs the instrumented build, and, most importantly, cleans up after itself. Forgetting the cleanup step is a common mistake that leads to a completely unmanageable Pushgateway instance.

Phase 4: Visualization and Diagnosis

With metrics flowing, we can finally see inside the black box. We configured our Prometheus server to scrape the Pushgateway, and then we built a Grafana dashboard to make sense of the data.

Our Prometheus server configuration (prometheus.yml) includes a job for the Pushgateway:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'pushgateway'
    honor_labels: true # Crucial for Pushgateway, allows metrics to keep their original labels
    static_configs:
      - targets: ['prometheus-pushgateway.monitoring:9091']

The true power comes from the PromQL queries we can now run. We created a build-specific dashboard in Grafana that took the $job and $build number as variables.

Here is the architecture we built:

graph TD
    subgraph Jenkins Environment
        Master[Jenkins Master]
        Coordinator[Coordinator Agent]
        Worker1[Worker Agent 1]
        Worker2[Worker Agent 2]
    end

    subgraph Monitoring Stack
        Prometheus[Prometheus Server]
        Pushgateway[Pushgateway]
        Grafana[Grafana]
    end

    User[Developer pushes code] --> Master
    Master -- triggers build --> Coordinator
    Coordinator -- runs make --> distcc
    distcc -- distributes compilation --> Worker1
    distcc -- distributes compilation --> Worker2
    
    Worker1 -- runs --> Wrapper[metrics_compiler_wrapper.py]
    Worker2 -- runs --> Wrapper
    
    Wrapper -- HTTP POST --> Pushgateway

    Prometheus -- scrapes --> Pushgateway
    Grafana -- queries --> Prometheus
    
    User -- views dashboard --> Grafana

Key Diagnostic Panels on our Grafana Dashboard:

Top 10 Slowest Compiling Files (Wall Clock Time):
- Query: topk(10, max(cpp_compilation_duration_seconds{job="$job", build="$build"}) by (source_file))
- Insight: This immediately identified the “god objects” and heavily-templated headers that were the worst offenders. Refactoring efforts could now be precisely targeted.
Top 10 Most CPU-Intensive Files:
- Query: topk(10, max(cpp_compilation_cpu_total_seconds{job="$job", build="$build"}) by (source_file))
- Insight: Sometimes a file doesn’t take long in wall-clock time but consumes an enormous amount of CPU, indicating complex template instantiation or optimization work. This helped us find code that was computationally expensive to compile.
Top 10 Memory-Hungry Files:
- Query: topk(10, max(cpp_compilation_max_resident_set_size_kb{job="$job", build="$build"}) by (source_file))
- Insight: We found a few files that required over 4GB of RAM to compile, which explained why some of our smaller Jenkins agents were thrashing their swap space and performing poorly. We could now either optimize the code or ensure these compilations were scheduled on high-memory agents.
Build Parallelism Over Time:
- Query: sum(running:cpp_compilation_duration_seconds{job="$job", build="$build"}) (requires a recording rule) or simply count(cpp_compilation_cpu_total_seconds{job="$job", build="$build"}) visualized as a gauge.
- Insight: This graph showed us the “shape” of our build. We saw a massive ramp-up of concurrent jobs, a long plateau, and then a sharp drop-off as the build waited for the last few, long-running compilations to finish. It also revealed that our final link step was a huge serialization point.
Compilation Distribution Across Agents:
- Query: count(cpp_compilation_duration_seconds{job="$job", build="$build"}) by (agent)
- Insight: This allowed us to see if distcc was distributing the load evenly. We found one agent was consistently underutilized due to a subtle network configuration issue, something we never would have found otherwise.

The results were transformative. The build time dropped to under 30 minutes, not just because of distribution, but because we could now have data-driven conversations about code complexity. We could quantify the “compilation cost” of a new feature. Pre-commit hooks were even considered to reject code that introduced files that were excessively slow or memory-intensive to compile. We turned an opaque, frustrating process into a transparent, manageable one.

Lingering Issues and Future Paths

This system, while effective, is not without its limitations. The reliance on the Pushgateway introduces a stateful component into our monitoring stack that requires careful management; it can become a performance bottleneck itself if hit with too many metrics from too many concurrent builds. A potential evolution would be to run a Prometheus agent as a sidecar on each Jenkins build agent, allowing for a more standard pull-based collection model and reducing the load on a central gateway.

Furthermore, we’ve only instrumented the compilation (-c) steps. The final link step remains a monolithic, unobserved block of time. Applying similar metric-gathering principles to the linker (ld) is the next logical frontier. This is significantly more complex, as it would likely involve instrumenting the linker itself or using advanced tools like perf to sample its execution during the pipeline.

Finally, the Python wrapper, while flexible, does introduce a non-trivial overhead for each of the thousands of files compiled. A more performant long-term solution might involve writing a native C++ launcher application or contributing instrumentation features directly to build systems like Ninja or CMake to emit these metrics with lower overhead. The current implementation represents a pragmatic balance between ease of implementation and depth of observability.

Observability CI/CD Jenkins C++ Prometheus

Architecting a Hybrid RAG Ingestion Pipeline Integrating HDFS with Containerized Go Services

2023-10-27 System Architecture

RAG Go gRPC Babel LlamaIndex Hadoop Buildah HDFS

Constructing an Interactive Domain Event Simulation Platform with Jupyter and a Consul-Secured SQS Backbone

2023-10-27 Architecture

Microservices AWS SQS Observability Consul Connect Domain-Driven Design (DDD) Jupyter