The build for our monolithic C++ backend was taking over two hours. This wasn’t just an inconvenience; it was a systemic drag on developer productivity. Our Jenkins pipeline was a simple, brute-force invocation of make -j$(nproc)
on a single, powerful build agent. When the build ran, the agent’s CPUs were pegged at 100%, its memory saturated, but we had zero insight into the process. It was a complete black box. We didn’t know which translation units were the primary offenders, whether we were CPU-bound or I/O-bound, or why the final link step seemed to take an eternity. The first attempt to fix this was predictable: throw more, bigger machines at the problem. It barely made a dent. The real issue wasn’t a lack of resources, but a profound lack of observability.
Our objective shifted. Instead of just making the build faster, we needed to make it transparent. We decided to re-architect our CI process to treat the C++ build not as a single command, but as a distributed, observable system. The plan involved distributing the compilation workload across a fleet of Jenkins agents and, critically, instrumenting every single compilation step to emit detailed performance metrics to Prometheus.
Phase 1: The Foundation - Distributed Compilation
The initial step was to break the compilation out of a single machine. Tools like distcc
are well-suited for this. It’s a simple concept: distcc
intercepts compiler calls (g++
, clang++
) and, when possible, forwards them to available remote machines (our Jenkins agents) for execution.
Our Jenkins agents were already managed as Docker containers, so we created a new build agent image that included distcc
.
FROM ubuntu:20.04
# Basic dependencies
RUN apt-get update && apt-get install -y \
build-essential \
cmake \
git \
openssh-client \
python3 \
distcc \
&& rm -rf /var/lib/apt/lists/*
# Configure distcc
# We will populate this dynamically in the Jenkins pipeline
ENV DISTCC_HOSTS=""
ENV DISTCC_DIR=/var/tmp/distcc
RUN mkdir -p ${DISTCC_DIR}
ENV PATH="/usr/lib/distcc:${PATH}"
# Add jenkins user
RUN useradd -m -s /bin/bash jenkins
USER jenkins
WORKDIR /home/jenkins
The Jenkinsfile
needed to be updated to orchestrate this. The core idea is to have one agent act as the “coordinator” which runs cmake
and make
, while other agents act as pure compilation workers running the distccd
daemon. The coordinator’s DISTCC_HOSTS
environment variable would be populated with the IPs of the worker nodes.
A simplified initial Jenkinsfile
looked like this:
pipeline {
agent none
stages {
stage('Distributed Build') {
agent {
label 'cxx-build-coordinator'
}
environment {
// In a real scenario, these would be discovered dynamically
DISTCC_HOSTS = '192.168.1.101 192.168.1.102 192.168.1.103'
}
steps {
script {
checkout scm
// Basic build commands
sh 'mkdir -p build && cd build'
sh 'cmake ..'
// Use distcc by leveraging the updated PATH
// We'll give it more jobs than local cores, assuming distribution
sh 'make -j32'
}
}
}
}
}
This gave us a modest speedup, from around 120 minutes to 75. A win, but we were still blind. Was the workload distributed evenly? Were some files so complex they held up a remote worker for minutes while others finished in seconds? Was the network becoming a bottleneck? The distcc
logs were cryptic at best. We had achieved distribution, but not observability.
Phase 2: Building the Eyes - A Compiler Metrics Wrapper
The core of our new system is a custom compiler wrapper. This script sits between the build system (make
) and the actual compiler (g++
). Its sole purpose is to execute the compilation, measure everything about it, and push those metrics to the Prometheus Pushgateway. The Pushgateway is necessary because our compilation jobs are short-lived; they start, run for a few seconds or minutes, and then exit. They don’t live long enough for a standard Prometheus server to scrape them directly.
We chose Python for the wrapper due to its simplicity for scripting and handling system commands.
metrics_compiler_wrapper.py
#!/usr/bin/env python3
import sys
import os
import subprocess
import time
import re
from urllib.request import urlopen, Request
from urllib.error import URLError
# --- Configuration ---
# These would be fetched from environment variables set by Jenkins
PUSHGATEWAY_URL = os.environ.get("PROMETHEUS_PUSHGATEWAY", "http://prometheus-pushgateway:9091")
JOB_NAME = os.environ.get("JOB_NAME", "unknown_job")
BUILD_NUMBER = os.environ.get("BUILD_NUMBER", "0")
AGENT_HOSTNAME = os.environ.get("NODE_NAME", "unknown_agent")
# The actual compiler to call, e.g., 'g++' or 'clang++'
REAL_COMPILER = sys.argv[1]
def format_prometheus_metric(metric_name, value, labels):
"""Formats a single metric in Prometheus text format."""
label_str = ",".join([f'{k}="{v}"' for k, v in labels.items()])
return f'{metric_name}{{{label_str}}} {value}'
def get_source_file(args):
"""A best-effort attempt to find the source file from compiler args."""
for i, arg in enumerate(args):
if arg == '-c' and i + 1 < len(args):
# The argument after -c is typically the source file
return os.path.basename(args[i+1])
# Fallback for other cases
for arg in args:
if arg.endswith(('.cpp', '.c', '.cc', '.cxx')):
return os.path.basename(arg)
return "unknown_source_file"
def push_metrics(metrics_payload):
"""Pushes the formatted metrics payload to the Pushgateway."""
# We group all metrics for a single compilation by a unique instance id
# This prevents different compilations from overwriting each other's metrics
# A simple timestamp and pid is usually unique enough for this purpose
instance_id = f"{time.time_ns()}_{os.getpid()}"
url = f"{PUSHGATEWAY_URL}/metrics/job/cpp_compilation/instance/{instance_id}"
try:
req = Request(url, data=metrics_payload.encode('utf-8'), method='POST')
req.add_header('Content-Type', 'text/plain; version=0.0.4')
with urlopen(req, timeout=5) as response:
if response.status not in [200, 202]:
print(f"Error: Unexpected status code {response.status} from Pushgateway", file=sys.stderr)
except URLError as e:
# In a real-world project, failing to push metrics should not fail the build.
# We log the error and continue.
print(f"Error: Could not connect to Prometheus Pushgateway at {url}: {e}", file=sys.stderr)
except Exception as e:
print(f"An unexpected error occurred while pushing metrics: {e}", file=sys.stderr)
def main():
compiler_args = sys.argv[2:]
# We use `/usr/bin/time -v` to get detailed process statistics.
# Its output is sent to stderr, which we capture.
command_to_run = ['/usr/bin/time', '-v', REAL_COMPILER] + compiler_args
start_time = time.monotonic()
try:
proc = subprocess.run(command_to_run, check=False, capture_output=True, text=True)
except FileNotFoundError:
print(f"Error: Real compiler '{REAL_COMPILER}' not found.", file=sys.stderr)
sys.exit(127)
end_time = time.monotonic()
# If the compiler failed, we must also fail.
if proc.returncode != 0:
# Pass through the compiler's output to the build logs
print(proc.stdout, file=sys.stdout)
print(proc.stderr, file=sys.stderr)
sys.exit(proc.returncode)
# --- Metrics Parsing and Formatting ---
wall_clock_duration = end_time - start_time
time_output = proc.stderr
# Regex to parse the verbose output of /usr/bin/time
user_time_match = re.search(r'User time \(seconds\): ([\d\.]+)', time_output)
system_time_match = re.search(r'System time \(seconds\): ([\d\.]+)', time_output)
cpu_percent_match = re.search(r'Percent of CPU this job got: ([\d]+)%', time_output)
max_rss_match = re.search(r'Maximum resident set size \(kbytes\): ([\d]+)', time_output)
user_time = float(user_time_match.group(1)) if user_time_match else 0.0
system_time = float(system_time_match.group(1)) if system_time_match else 0.0
cpu_percent = int(cpu_percent_match.group(1)) if cpu_percent_match else 0
max_rss_kb = int(max_rss_match.group(1)) if max_rss_match else 0
total_cpu_time = user_time + system_time
source_file = get_source_file(compiler_args)
# --- Prometheus Labels ---
# These labels provide the context for our metrics, allowing us to slice and dice the data.
labels = {
"job": JOB_NAME,
"build": BUILD_NUMBER,
"agent": AGENT_HOSTNAME,
"source_file": source_file,
}
metrics = [
format_prometheus_metric("cpp_compilation_duration_seconds", wall_clock_duration, labels),
format_prometheus_metric("cpp_compilation_cpu_user_seconds", user_time, labels),
format_prometheus_metric("cpp_compilation_cpu_system_seconds", system_time, labels),
format_prometheus_metric("cpp_compilation_cpu_total_seconds", total_cpu_time, labels),
format_prometheus_metric("cpp_compilation_cpu_utilization_percent", cpu_percent, labels),
format_prometheus_metric("cpp_compilation_max_resident_set_size_kb", max_rss_kb, labels),
format_prometheus_metric("cpp_compilation_last_success_timestamp_seconds", int(time.time()), labels)
]
metrics_payload = "\n".join(metrics) + "\n"
push_metrics(metrics_payload)
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: metrics_compiler_wrapper.py <real_compiler> [compiler_args...]", file=sys.stderr)
sys.exit(1)
main()
To integrate this, we updated our build agent’s Dockerfile to include this script and then configured CMake to use it via the CMAKE_CXX_COMPILER_LAUNCHER
property. This is a clean way to prepend a command to every compiler invocation without modifying the compiler path itself.
The CMake command in our Jenkinsfile changed to:cmake .. -DCMAKE_CXX_COMPILER_LAUNCHER=/usr/local/bin/metrics_compiler_wrapper.py
A critical pitfall here is that the wrapper script itself adds overhead. For a project with thousands of small files, the cumulative time spent launching Python, running /usr/bin/time
, and making an HTTP request can become significant. In our case, the compilation time of individual files was substantial enough that this overhead was negligible, but it’s a trade-off to be aware of.
Phase 3: A Robust Jenkins Pipeline
The Jenkins pipeline now becomes more complex. It’s no longer just running make
; it’s managing a monitoring environment.
Key additions to the Jenkinsfile
:
- Dynamic Environment Variables: Pass
JOB_NAME
,BUILD_NUMBER
, andNODE_NAME
to the build environment so the wrapper script can use them as Prometheus labels. - Pushgateway URL: Provide the location of the Pushgateway.
- Post-Build Cleanup: The Prometheus Pushgateway retains metrics until they are manually deleted. It’s crucial to have a
post
block in the pipeline that cleans up all metrics for the specificjob
andbuild
to prevent stale data from polluting our dashboards.
pipeline {
agent none
environment {
// Centralized configuration
PROMETHEUS_PUSHGATEWAY = 'http://prometheus-pushgateway.monitoring:9091'
}
stages {
stage('Distributed Build with Metrics') {
agent {
label 'cxx-build-coordinator'
}
environment {
// Assuming agents are labeled 'cxx-build-worker'
// A more robust solution would use the Kubernetes or Docker plugin to get IPs
DISTCC_HOSTS = 'cxx-agent-1 cxx-agent-2 cxx-agent-3'
}
steps {
script {
checkout scm
// The build needs to run in a context where the required env vars are set
withEnv([
"JOB_NAME=${env.JOB_NAME}",
"BUILD_NUMBER=${env.BUILD_NUMBER}",
"PROMETHEUS_PUSHGATEWAY=${env.PROMETHEUS_PUSHGATEWAY}"
]) {
sh 'mkdir -p build && cd build'
// Pass the wrapper script to CMake. We pass the *real* compiler as the first argument.
sh 'cmake .. -DCMAKE_CXX_COMPILER_LAUNCHER="/usr/local/bin/metrics_compiler_wrapper.py;g++"'
// We use distcc's pump mode for better include file distribution
sh 'pump make -j80'
}
}
}
}
}
post {
always {
script {
// --- CRUCIAL CLEANUP STEP ---
// Without this, the Pushgateway will hold metrics from old builds forever.
// We delete all metrics grouped by our job and build number.
echo "Cleaning up metrics from Pushgateway for ${env.JOB_NAME}, build ${env.BUILD_NUMBER}"
def cleanupUrl = "${env.PROMETHEUS_PUSHGATEWAY}/metrics/job/cpp_compilation/job/${env.JOB_NAME}/build/${env.BUILD_NUMBER}"
// A simple curl command handles the DELETE request
// In a real project, add error handling and retries.
sh "curl -s -X DELETE ${cleanupUrl}"
}
}
}
}
This pipeline is now a proper orchestration engine. It sets up the environment, runs the instrumented build, and, most importantly, cleans up after itself. Forgetting the cleanup step is a common mistake that leads to a completely unmanageable Pushgateway instance.
Phase 4: Visualization and Diagnosis
With metrics flowing, we can finally see inside the black box. We configured our Prometheus server to scrape the Pushgateway, and then we built a Grafana dashboard to make sense of the data.
Our Prometheus server configuration (prometheus.yml
) includes a job for the Pushgateway:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'pushgateway'
honor_labels: true # Crucial for Pushgateway, allows metrics to keep their original labels
static_configs:
- targets: ['prometheus-pushgateway.monitoring:9091']
The true power comes from the PromQL queries we can now run. We created a build-specific dashboard in Grafana that took the $job
and $build
number as variables.
Here is the architecture we built:
graph TD subgraph Jenkins Environment Master[Jenkins Master] Coordinator[Coordinator Agent] Worker1[Worker Agent 1] Worker2[Worker Agent 2] end subgraph Monitoring Stack Prometheus[Prometheus Server] Pushgateway[Pushgateway] Grafana[Grafana] end User[Developer pushes code] --> Master Master -- triggers build --> Coordinator Coordinator -- runs make --> distcc distcc -- distributes compilation --> Worker1 distcc -- distributes compilation --> Worker2 Worker1 -- runs --> Wrapper[metrics_compiler_wrapper.py] Worker2 -- runs --> Wrapper Wrapper -- HTTP POST --> Pushgateway Prometheus -- scrapes --> Pushgateway Grafana -- queries --> Prometheus User -- views dashboard --> Grafana
Key Diagnostic Panels on our Grafana Dashboard:
Top 10 Slowest Compiling Files (Wall Clock Time):
- Query:
topk(10, max(cpp_compilation_duration_seconds{job="$job", build="$build"}) by (source_file))
- Insight: This immediately identified the “god objects” and heavily-templated headers that were the worst offenders. Refactoring efforts could now be precisely targeted.
- Query:
Top 10 Most CPU-Intensive Files:
- Query:
topk(10, max(cpp_compilation_cpu_total_seconds{job="$job", build="$build"}) by (source_file))
- Insight: Sometimes a file doesn’t take long in wall-clock time but consumes an enormous amount of CPU, indicating complex template instantiation or optimization work. This helped us find code that was computationally expensive to compile.
- Query:
Top 10 Memory-Hungry Files:
- Query:
topk(10, max(cpp_compilation_max_resident_set_size_kb{job="$job", build="$build"}) by (source_file))
- Insight: We found a few files that required over 4GB of RAM to compile, which explained why some of our smaller Jenkins agents were thrashing their swap space and performing poorly. We could now either optimize the code or ensure these compilations were scheduled on high-memory agents.
- Query:
Build Parallelism Over Time:
- Query:
sum(running:cpp_compilation_duration_seconds{job="$job", build="$build"})
(requires a recording rule) or simplycount(cpp_compilation_cpu_total_seconds{job="$job", build="$build"})
visualized as a gauge. - Insight: This graph showed us the “shape” of our build. We saw a massive ramp-up of concurrent jobs, a long plateau, and then a sharp drop-off as the build waited for the last few, long-running compilations to finish. It also revealed that our final link step was a huge serialization point.
- Query:
Compilation Distribution Across Agents:
- Query:
count(cpp_compilation_duration_seconds{job="$job", build="$build"}) by (agent)
- Insight: This allowed us to see if
distcc
was distributing the load evenly. We found one agent was consistently underutilized due to a subtle network configuration issue, something we never would have found otherwise.
- Query:
The results were transformative. The build time dropped to under 30 minutes, not just because of distribution, but because we could now have data-driven conversations about code complexity. We could quantify the “compilation cost” of a new feature. Pre-commit hooks were even considered to reject code that introduced files that were excessively slow or memory-intensive to compile. We turned an opaque, frustrating process into a transparent, manageable one.
Lingering Issues and Future Paths
This system, while effective, is not without its limitations. The reliance on the Pushgateway introduces a stateful component into our monitoring stack that requires careful management; it can become a performance bottleneck itself if hit with too many metrics from too many concurrent builds. A potential evolution would be to run a Prometheus agent as a sidecar on each Jenkins build agent, allowing for a more standard pull-based collection model and reducing the load on a central gateway.
Furthermore, we’ve only instrumented the compilation (-c
) steps. The final link step remains a monolithic, unobserved block of time. Applying similar metric-gathering principles to the linker (ld
) is the next logical frontier. This is significantly more complex, as it would likely involve instrumenting the linker itself or using advanced tools like perf
to sample its execution during the pipeline.
Finally, the Python wrapper, while flexible, does introduce a non-trivial overhead for each of the thousands of files compiled. A more performant long-term solution might involve writing a native C++ launcher application or contributing instrumentation features directly to build systems like Ninja or CMake to emit these metrics with lower overhead. The current implementation represents a pragmatic balance between ease of implementation and depth of observability.