Our MLOps pipeline feedback loop was broken. A single commit to our model’s inference service, a Node.js application, triggered a 25-minute build-and-deploy cycle to our Azure Kubernetes Service (AKS) staging environment. The data science team, needing to iterate quickly on feature extraction logic implemented in this service, found their productivity crippled. The culprit wasn’t the Python model training or the Kubernetes deployment manifests; a flame graph of our GitHub Actions pipeline pointed squarely at a single, egregious step: npm run build
. This command, invoking Babel to transpile our TypeScript and modern JavaScript, was consuming nearly 15 minutes of every run, single-handedly turning our CI/CD process into a bottleneck.
The initial, almost reflexive, reaction was to cache node_modules
. It’s a common pattern in CI environments. We implemented actions/cache
keyed on our package-lock.json
hash. The result was negligible. A few seconds saved on npm ci
, but the 15-minute transpilation beast remained untouched. A common mistake is to conflate dependency installation time with build-time computation. Our problem wasn’t I/O from the npm registry; it was the CPU-intensive task of Babel parsing and transforming thousands of files on every single commit, regardless of the change’s scope.
This forced a deeper analysis. The core issue was the lack of build artifact caching. Babel was re-transpiling the entire codebase for every minor change. The challenge, then, was to devise a robust caching strategy for Babel’s output (dist
directory) that was both reliable and performant within the ephemeral context of GitHub Actions runners. We explicitly decided against migrating to a faster transpiler like swc
or esbuild
. In a real-world project with a mature codebase, replete with custom Babel macros and plugins critical to our application’s logic, such a migration would represent a multi-month engineering effort with significant risk. The directive was to fix the pipeline, not rewrite the application. Our solution had to work with the tools we had.
The concept evolved into a multi-layered caching approach. First, we needed a more intelligent cache key. Instead of hashing a lockfile, which only tracks dependency versions, we needed to hash the actual source code that Babel processes. If the source code didn’t change, the output dist
folder would be identical, and we could safely restore it from a cache. Second, we had to rethink our Docker image build process. A naive build was invalidating Docker’s layer cache and creating bloated images. A multi-stage Docker build was necessary to separate the build environment from the lean production environment. Finally, this new, complex pipeline logic needed to be protected. How could we prevent future code changes from inadvertently breaking our hermetic build assumptions? The answer was to bake automated validation directly into our code review process, flagging problematic pull requests before they were ever merged.
Implementing Granular Artifact Caching in CI
Our baseline GitHub Actions workflow was painfully simple and slow. It checked out the code, installed dependencies, built the application, and then built and pushed a Docker image.
Here is the problematic build stage in its original form:
# .github/workflows/ci-cd.yml (Original Snippet)
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '18'
- name: Install Dependencies
run: npm ci
- name: Build Application (The Bottleneck)
# This step consistently takes 12-15 minutes.
run: npm run build
# ... subsequent steps for Docker build, push to ACR, and deploy to AKS
The first real attempt at a fix involved creating a stable hash from our source files. A simple hash of the entire src
directory would serve as the key for our dist
cache. If the hash matches a previously cached version, we skip the build.
The initial script to generate this hash was straightforward. We needed to ensure it was fast and consistent across runs.
#!/bin/bash
# scripts/generate-source-hash.sh
set -e # Exit immediately if a command exits with a non-zero status.
# We must find all relevant source files, sort them to ensure a consistent
# order, and then pipe their contents into a sha256sum.
# Sorting is critical for determinism. The file system order is not guaranteed.
# We also include babel.config.js itself, as changes to it must invalidate the cache.
find src -type f \( -name "*.ts" -o -name "*.js" \) -print0 | sort -z | xargs -0 cat \
&& cat babel.config.js \
&& cat package.json \
| sha256sum | awk '{ print $1 }'
This script finds all .ts
and .js
files in the src
folder, sorts the paths for deterministic output, concatenates their content along with the Babel config and package.json
(as changes to dependencies or build configuration should also bust the cache), and pipes the entire stream into sha256sum
.
With this hashing mechanism, we modified the workflow to use actions/cache
targeting the dist
directory.
# .github/workflows/ci-cd.yml (Improved Caching Logic)
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '18'
- name: Cache Node Modules
id: cache-node-modules
uses: actions/cache@v3
with:
path: node_modules
key: ${{ runner.os }}-node-modules-${{ hashFiles('**/package-lock.json') }}
- name: Install Dependencies
if: steps.cache-node-modules.outputs.cache-hit != 'true'
run: npm ci
- name: Generate Source Hash
id: source-hash
run: echo "hash=$(./scripts/generate-source-hash.sh)" >> $GITHUB_OUTPUT
- name: Cache Babel Build Artifacts
id: cache-dist
uses: actions/cache@v3
with:
path: dist
key: ${{ runner.os }}-dist-${{ steps.source-hash.outputs.hash }}
- name: Build Application
if: steps.cache-dist.outputs.cache-hit != 'true'
run: |
echo "Cache miss for build artifacts. Running full Babel build..."
npm run build
# This step now only runs on a cache miss.
- name: Verify Build Output
run: |
if [ ! -d "dist" ] || [ -z "$(ls -A dist)" ]; then
echo "::error::Build directory 'dist' is empty after build step. Aborting."
exit 1
fi
echo "Build artifacts successfully generated or restored from cache."
# ... subsequent steps
The results were immediate and dramatic. For commits that didn’t touch the Node.js source code (e.g., updating documentation or the Python model code in another directory), the build time for the Node.js service dropped from 15 minutes to under 30 seconds—the time it took to calculate the hash and download the cached dist
directory. For changes to the source code, we still paid the full 15-minute penalty, but this was now the exception rather than the rule.
Optimizing for Containerization with Multi-Stage Docker Builds
Our CI build was now fast, but the Docker build process introduced the next performance problem. Our original Dockerfile
was naive.
# Dockerfile (Original)
FROM node:18-alpine
WORKDIR /app
# This copies EVERYTHING into the build context.
COPY . .
# Install all dependencies, including devDependencies.
RUN npm ci
# Run the build inside the container.
RUN npm run build
CMD ["node", "dist/server.js"]
This approach had several critical flaws in a CI/CD context:
- Bloated Image: It copied the entire repository (
.git
, source files, test files) and installed alldevDependencies
, resulting in an image size over 1GB. - Broken Layer Caching: Any file change, even in a README, would invalidate the
COPY . .
layer, forcingnpm ci
andnpm run build
to re-run inside the Docker build, completely bypassing our GitHub Actions caching. - Security Risk: It bundled source code and build tools into the final production image.
The solution was a multi-stage Dockerfile
. This pattern allows you to use one stage with a full build environment and then copy only the necessary artifacts into a final, slim production image.
# Dockerfile (Multi-Stage)
# --- Build Stage ---
# This stage uses a full Node.js image to build our application.
FROM node:18-alpine AS builder
WORKDIR /app
# Copy package files and install ALL dependencies (including dev)
COPY package.json package-lock.json ./
RUN npm ci
# Copy the source code
COPY . .
# Run the build. This is the same slow step, but it happens in a
# throwaway container layer. In our CI, this step will eventually be skipped
# as we will pre-build the `dist` folder.
RUN npm run build
# Prune devDependencies to reduce the size of the node_modules folder
# that we will copy to the next stage.
RUN npm prune --production
# --- Production Stage ---
# This stage uses a slim image for the final production container.
FROM node:18-alpine AS production
WORKDIR /app
# Set production environment
ENV NODE_ENV=production
# Copy only the pruned node_modules from the 'builder' stage.
COPY /app/node_modules ./node_modules
# Copy only the compiled application code from the 'builder' stage.
COPY /app/dist ./dist
# Copy package.json for metadata purposes, but it's not strictly needed for running.
COPY package.json .
# Expose the application port
EXPOSE 3000
# The command to run the application.
CMD ["node", "dist/server.js"]
This was a major improvement for image size and security, but it didn’t solve the core CI performance issue on its own. The RUN npm run build
step was still happening inside the Docker build. The key was to connect our GitHub Actions caching with this multi-stage build. We would perform the npm run build
step before the Docker build in the CI workflow (where it could be cached) and then copy the resulting dist
folder into the Docker build context.
The Dockerfile
was modified slightly to not run the build, but instead expect a dist
directory to be present.
# Dockerfile (Final CI-Optimized Multi-Stage)
# --- Build Stage ---
FROM node:18-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
RUN npm prune --production # We can prune early now
# --- Production Stage ---
FROM node:18-alpine AS production
WORKDIR /app
ENV NODE_ENV=production
# Copy production node_modules from the builder stage
COPY /app/node_modules ./node_modules
COPY package.json .
# CRITICAL CHANGE: Copy the pre-built dist folder from the build context.
# This 'dist' directory is now generated by our cached GitHub Actions step.
COPY dist/ ./dist
EXPOSE 3000
CMD ["node", "dist/server.js"]
The updated CI workflow now ensures the dist
directory exists (either from the cache or a fresh build) before invoking docker build
. This synergy between CI-level artifact caching and Docker multi-stage builds provided a fast CI step and a small, efficient final container image for deployment to AKS.
graph TD subgraph GitHub Actions Runner A[Start CI Job] --> B{Checkout Code}; B --> C[Generate Source Hash]; C --> D{Restore 'dist' from Cache?}; D -- Yes (Cache Hit) --> F[Use Cached 'dist']; D -- No (Cache Miss) --> E[Run 'npm run build']; E --> G[Save new 'dist' to Cache]; F --> H[Build Docker Image]; G --> H; end subgraph Docker Build Process H -- "docker build ." --> I(Start Multi-Stage Build); I --> J(Stage 'builder': Install prod dependencies); J --> K(Stage 'production': Copy node_modules from builder); K --> L(Stage 'production': Copy CI-built 'dist' folder); L --> M(Create Final Lean Image); end M --> N[Push Image to Azure Container Registry]; N --> O[Deploy to AKS];
Automating Sanity Checks in Code Review
With the pipeline optimized, a new threat emerged: human error. A developer could unknowingly introduce a change that breaks our caching assumptions. For example, modifying the build script to fetch a remote configuration file at build time would make the build non-hermetic and render our content-based hashing useless. The output would change even if the source code didn’t.
We needed to shift left and catch these issues during code review. A manual process is unreliable; automation was the only scalable solution. We decided to use a custom GitHub Action that runs on every pull request to act as a “pipeline sanity checker.”
The action performs two key checks:
- Configuration Linting: It inspects
babel.config.js
andpackage.json
for patterns we’ve identified as risky, like plugins that rely on environment variables that might not be consistent. - Hermetic Build Test: It attempts to run
npm run build
in a sandboxed container with network access disabled. A successful build proves its hermeticity.
Here’s the implementation of a custom composite action for this purpose.
First, the action definition file:
# .github/actions/pr-build-validator/action.yml
name: 'PR Build Validator'
description: 'Checks for non-hermetic build configurations and other CI anti-patterns.'
runs:
using: "composite"
steps:
- name: Check for risky build script dependencies
id: script-check
run: |
if grep -q "prebuild" package.json; then
echo "::error::'prebuild' script detected. This can cause unexpected side effects. Please integrate logic into the main build script."
exit 1
fi
# Add more advanced checks here, e.g., for certain babel plugins
echo "Build script structure looks OK."
shell: bash
- name: Run hermetic build test
id: hermetic-test
run: |
echo "Attempting to build in a network-disabled environment..."
docker run --rm --network none \
-v "${{ github.workspace }}":/app -w /app \
node:18-alpine sh -c "npm ci && npm run build"
if [ $? -ne 0 ]; then
echo "::error::Hermetic build test failed. The build process may have external network dependencies, which breaks caching."
exit 1
fi
echo "Hermetic build test passed."
shell: bash
And then integrating this action into our main workflow to run specifically on pull requests:
# .github/workflows/ci-cd.yml (Final version with PR checks)
name: MLOps Service CI/CD
on:
push:
branches:
- main
pull_request:
branches:
- main
jobs:
validate-pr:
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '18'
- name: Run Build Validator
uses: ./.github/actions/pr-build-validator
build-and-deploy:
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
needs: validate-pr
# ... The full build-and-deploy job from before ...
This setup ensures that no code can be merged into main
without first passing our automated checks. It codifies the architectural constraints of our CI pipeline and presents them to developers directly in the pull request. If a developer’s change requires network access during the build, the check will fail, forcing a conversation about how to refactor the feature to comply with our performance and reliability goals. This automated gatekeeping is a far more effective form of code review for architectural patterns than relying on human reviewers to spot subtle, pipeline-breaking changes.
The result of this three-pronged approach—granular artifact caching, multi-stage Docker builds, and automated PR validation—was a transformation of our MLOps workflow. Build times on cache hits are now consistently under three minutes, down from 25. The data science team can once again iterate rapidly. The final system is not just faster but also more robust, producing smaller, more secure container images and programmatically preventing the introduction of performance regressions.
The caching strategy, while effective, relies on GitHub’s cache storage, which has size limits and eviction policies that are outside our direct control. For an enterprise-scale system, a more resilient solution might involve using a dedicated storage backend like Azure Blob Storage via actions/cache
‘s backend configuration options, providing more space and finer-grained control over cache persistence. Furthermore, our source hash generation could be improved to also incorporate the versions of key Babel plugins from package-lock.json
to handle cases where a dependency update changes the transpilation output without any source code modification. This current implementation is a pragmatic and high-impact optimization, but it’s a treatment of a symptom; the long-term cure remains a planned migration away from our legacy Babel build system to a more modern, performant alternative.