Implementing a Resilient Elixir-based Build Service for JavaScript Transpilation in an Air-Gapped VPC

System Design

Word Count: 2.3k

Read Times: 14 Min

The mandate was simple: no public internet access for our production build and deployment environment. Our entire infrastructure runs inside a strictly controlled, air-gapped Virtual Private Cloud (VPC). For our backend Elixir services, this was manageable. We built release artifacts using mix release, packaged them in containers with all dependencies vendored, and pushed them to our internal registry. The real friction emerged with our frontend assets. The web application relies on a modern JavaScript stack, with Babel at its core for transpilation. The standard npm install && npx babel ... workflow, which assumes unfettered access to the public npm registry, was a non-starter.

Our initial workaround was a painful, multi-step manual process. A developer would build the assets on their local machine, commit the compiled artifacts to a separate repository, and then a sanitized CI/CD pipeline would pull these pre-built assets and deploy them. This was fragile, slow, and a constant source of “it works on my machine” issues. It completely broke the ethos of a repeatable, automated build pipeline. We needed an internal, automated service that could reliably transpile our JavaScript assets within our secure perimeter.

The first thought was to stand up a dedicated Jenkins or GitLab runner inside the VPC, equipped with a private npm registry mirror. This is a common pattern, but it felt heavy and ill-suited for the specific task. We needed something lightweight, concurrent, and exceptionally fault-tolerant. A stuck npm install shouldn’t require an administrator to SSH into a runner and kill a process. We wanted a system where build jobs were isolated processes that could fail and be cleaned up automatically without affecting other pending jobs. This line of thinking led us to Elixir and OTP. The BEAM’s model of lightweight, isolated processes with built-in supervision is a perfect match for managing potentially unreliable external tasks like a Babel build. We decided to build a dedicated Elixir service, codenamed “AssetForge,” to act as an on-demand, concurrent build orchestrator.

The core concept was an Elixir application that exposes an internal API. This API would accept a git repository URL and a commit hash. The service would then:

Check out the specified code into a temporary, isolated directory.
Run npm install against our internal, mirrored npm registry.
Execute the Babel CLI to transpile the assets.
On success, push the compiled assets to a designated S3 bucket inside the VPC.
On failure, log the error and clean up the temporary directory.
Each build job would run in its own supervised OTP process, ensuring that a single failure could not crash the entire system.

The first step was setting up the Elixir project and outlining the core components.

$ mix new asset_forge --sup

Our supervision tree would be simple but robust. The main application supervisor would oversee a DynamicSupervisor, which would be responsible for starting and stopping our BuildWorker GenServers on demand. A DynamicSupervisor is ideal here because we don’t know how many builds will be running at any given time.

# lib/asset_forge/application.ex
defmodule AssetForge.Application do
  @moduledoc false

  use Application

  @impl true
  def start(_type, _args) do
    children = [
      {DynamicSupervisor, name: AssetForge.BuildSupervisor, strategy: :one_for_one}
    ]

    opts = [strategy: :one_for_one, name: AssetForge.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

The real work happens inside the BuildWorker, a GenServer that encapsulates the state and logic for a single build job. It’s started with the necessary context (repo, commit hash) and manages the entire lifecycle of the build.

A public-facing module, AssetForge, provides the entry point to start a new build.

# lib/asset_forge.ex
defmodule AssetForge do
  @doc """
  Starts a new build job asynchronously.
  """
  def start_build(repo_url, commit_hash) do
    spec = {AssetForge.BuildWorker, {repo_url, commit_hash}}
    DynamicSupervisor.start_child(AssetForge.BuildSupervisor, spec)
  end
end

The BuildWorker itself needs to handle the sequence of operations. We decided to use Task.async/1 within the GenServer‘s init/1 callback. This prevents the init from blocking the caller and allows the DynamicSupervisor to immediately start tracking the new process while the actual work happens in the background.

# lib/build_worker.ex
defmodule AssetForge.BuildWorker do
  use GenServer
  require Logger

  # Time in milliseconds to allow for the entire build process.
  @build_timeout 300_000 # 5 minutes

  def start_link({repo_url, commit_hash}) do
    GenServer.start_link(__MODULE__, {repo_url, commit_hash})
  end

  @impl true
  def init({repo_url, commit_hash}) do
    # The state holds all necessary information for the build.
    state = %{
      repo_url: repo_url,
      commit_hash: commit_hash,
      work_dir: create_work_dir(),
      status: :pending,
      build_task: nil
    }

    # Start the build process in a separate task so init doesn't block.
    # The GenServer process will monitor this task.
    build_task = Task.async(fn -> run_build_flow(state.work_dir, repo_url, commit_hash) end)

    {:ok, %{state | build_task: build_task}}
  end

  # The GenServer waits for the build task to complete.
  @impl true
  def handle_info({ref, result}, state) when ref == state.build_task.ref do
    case result do
      {:ok, build_output_path} ->
        Logger.info("Build succeeded. Artifacts at: #{build_output_path}")
        # In a real system, we'd trigger the VPC deployment here.
        # e.g., AssetForge.Deployer.upload_to_vpc_s3(build_output_path)
        new_state = %{state | status: :success}
      {:error, reason} ->
        Logger.error("Build failed: #{inspect(reason)}")
        new_state = %{state | status: :failed}
    end

    # Clean up the working directory regardless of outcome.
    File.rm_rf!(state.work_dir)
    Logger.info("Cleaned up work directory: #{state.work_dir}")

    # The process has done its job and can now terminate.
    {:stop, :normal, new_state}
  end

  # If the build task takes too long, we might want a timeout.
  # This handle_info is a failsafe.
  @impl true
  def handle_info(:timeout, state) do
     Logger.error("Build timed out for repo #{state.repo_url} at commit #{state.commit_hash}")
     # Ensure the task is killed
     Task.shutdown(state.build_task, :brutal_kill)
     File.rm_rf!(state.work_dir)
     {:stop, :shutdown, %{state | status: :timeout}}
  end

  # --- Private Helper Functions ---

  defp create_work_dir do
    # Generate a unique directory for each build to ensure isolation.
    tmp_path = System.tmp_dir!()
    build_id = :crypto.strong_rand_bytes(8) |> Base.encode16()
    work_dir = Path.join(tmp_path, "asset_forge_#{build_id}")
    File.mkdir_p!(work_dir)
    work_dir
  end

  defp run_build_flow(work_dir, repo_url, commit_hash) do
    with {:ok, _} <- git_clone(work_dir, repo_url, commit_hash),
         {:ok, _} <- npm_install(work_dir),
         {:ok, build_path} <- babel_transpile(work_dir) do
      {:ok, build_path}
    else
      {:error, {cmd, exit_code, stderr}} ->
        {:error, "Command '#{cmd}' failed with exit code #{exit_code}. Stderr: #{stderr}"}
      {:error, reason} ->
        {:error, reason}
    end
  end
end

The most critical part is the interaction with external commands: git, npm, and babel. Using System.cmd/3 is a common approach, but for better control over I/O, error handling, and process management, Elixir’s Port is a superior tool. A Port allows the BEAM to communicate with an external OS process through standard I/O streams. This gives us a robust way to capture logs and errors.

We built a small utility module to wrap the Port interaction, making it easier to reuse and test.

# lib/asset_forge/command_runner.ex
defmodule AssetForge.CommandRunner do
  require Logger

  # Timeout for external commands.
  @command_timeout 60_000 # 1 minute

  @doc """
  Executes a command in a specified directory using a Port.
  Returns {:ok, stdout} or {:error, {exit_code, stderr}}.
  """
  def run(executable, args, opts \\ []) do
    work_dir = Keyword.get(opts, :cd, System.cwd!())
    port_opts = [
      :binary,
      {:args, args},
      {:cd, work_dir},
      :exit_status,
      :hide,
      :use_stdio
    ]

    port = Port.open({:spawn_executable, executable}, port_opts)

    # We need to collect stdout/stderr and wait for the exit status.
    collect_output(port, %{stdout: "", stderr: ""})
  end

  defp collect_output(port, acc) do
    receive do
      {^port, {:data, data}} ->
        # For simplicity, we merge stdout and stderr.
        # In a production system, you might want to handle them separately.
        new_acc = %{acc | stdout: acc.stdout <> data}
        collect_output(port, new_acc)
      {^port, {:exit_status, 0}} ->
        Logger.debug("Command successful. Stdout: #{acc.stdout}")
        {:ok, acc.stdout}
      {^port, {:exit_status, exit_code}} ->
        Logger.error("Command failed with exit code #{exit_code}. Stderr: #{acc.stdout}")
        {:error, {exit_code, acc.stdout}}
    after
      @command_timeout ->
        Port.close(port)
        {:error, {:timeout, "Command timed out after #{@command_timeout}ms"}}
    end
  end
end

With this CommandRunner, our run_build_flow helpers become much cleaner and more robust.

# lib/build_worker.ex (helper function implementations)
defmodule AssetForge.BuildWorker do
  # ... existing code ...

  alias AssetForge.CommandRunner

  defp git_clone(work_dir, repo_url, commit_hash) do
    Logger.info("Cloning #{repo_url} into #{work_dir}")
    case CommandRunner.run("git", ["clone", repo_url, "."], cd: work_dir) do
      {:ok, _} ->
        Logger.info("Checking out commit #{commit_hash}")
        CommandRunner.run("git", ["checkout", commit_hash], cd: work_dir)
      error ->
        error
    end
  end

  defp npm_install(work_dir) do
    Logger.info("Running npm install in #{work_dir}")
    # --registry points to our internal, air-gapped mirror.
    # This configuration is critical.
    registry_url = Application.fetch_env!(:asset_forge, :npm_registry)
    CommandRunner.run("npm", ["install", "--registry=#{registry_url}"], cd: work_dir)
  end

  defp babel_transpile(work_dir) do
    Logger.info("Running babel transpilation in #{work_dir}")
    # These paths would be configured based on project conventions.
    source_dir = Path.join(work_dir, "src")
    output_dir = Path.join(work_dir, "dist")
    File.mkdir_p!(output_dir)

    # Using npx to ensure we use the project's local babel version.
    case CommandRunner.run("npx", ["babel", source_dir, "--out-dir", output_dir], cd: work_dir) do
      {:ok, _} -> {:ok, output_dir}
      error -> error
    end
  end
end

Configuration is managed through config/config.exs, ensuring we don’t hardcode sensitive information like registry URLs.

# config/prod.exs
import Config

config :asset_forge,
  # This URL points to our internal Verdaccio/Artifactory instance
  # within the VPC.
  npm_registry: System.get_env("NPM_REGISTRY_URL"),
  # Configuration for the S3 bucket where assets are deployed.
  deployment_bucket: System.get_env("ASSET_DEPLOYMENT_BUCKET")

# Logger configuration for production
config :logger, :console,
  format: "$time $metadata[$level] $message\n",
  metadata: [:request_id]

To visualize the flow, the entire process can be mapped out.

sequenceDiagram
    participant Client as API Client
    participant AssetForge as AssetForge Service
    participant Supervisor as BuildSupervisor
    participant Worker as BuildWorker (GenServer)
    participant Runner as CommandRunner (Port)
    participant Infra as VPC Infrastructure

    Client->>+AssetForge: start_build(repo, hash)
    AssetForge->>+Supervisor: start_child(BuildWorker, {repo, hash})
    Supervisor->>+Worker: start_link()
    Note right of Worker: init() called, Task.async started
    Worker->>+Runner: run("git clone ...")
    Runner-->>-Worker: {:ok, _}
    Worker->>+Runner: run("npm install ...")
    Runner-->>-Worker: {:ok, _}
    Worker->>+Runner: run("npx babel ...")
    Runner-->>-Worker: {:ok, build_path}
    Note right of Worker: Task completes successfully
    Worker-->>Supervisor: :normal exit
    Supervisor-->>-AssetForge:
    AssetForge-->>-Client: (ack)
    Worker->>Infra: Upload assets from build_path to S3

Testing this system required careful consideration of the external dependencies. We can’t run git and npm in our unit tests. The solution was to use Mox to create a mock for our CommandRunner.

# test/support/conn_case.ex
Mox.defmock(AssetForge.CommandRunnerMock, for: AssetForge.CommandRunner)

And in config/test.exs, we tell our application to use the mock instead of the real module.

# config/test.exs
config :asset_forge, command_runner: AssetForge.CommandRunnerMock

This allows us to write tests that verify the BuildWorker‘s logic without ever touching the filesystem or network.

# test/asset_forge/build_worker_test.exs
defmodule AssetForge.BuildWorkerTest do
  use ExUnit.Case, async: true
  import Mox

  alias AssetForge.CommandRunnerMock

  setup :verify_on_exit!

  test "a successful build flow executes all commands and stops normally" do
    repo_url = "..."
    commit_hash = "..."

    # Expect all external commands to be called in order
    expect(CommandRunnerMock, :run, fn "git", ["clone", ^repo_url, "."], _opts -> {:ok, ""} end)
    expect(CommandRunnerMock, :run, fn "git", ["checkout", ^commit_hash], _opts -> {:ok, ""} end)
    expect(CommandRunnerMock, :run, fn "npm", ["install", _], _opts -> {:ok, ""} end)
    expect(CommandRunnerMock, :run, fn "npx", ["babel", _, _, _], _opts -> {:ok, ""} end)

    # Start the worker, which will trigger the flow
    {:ok, pid} = AssetForge.BuildWorker.start_link({repo_url, commit_hash})

    # Assert that the process terminates cleanly after the work is done.
    ref = Process.monitor(pid)
    assert_receive {:DOWN, ^ref, :process, ^pid, :normal}
  end

  test "a failing npm install stops the flow and terminates" do
    repo_url = "..."
    commit_hash = "..."

    expect(CommandRunnerMock, :run, 2, fn
      "git", ["clone", _, _], _ -> {:ok, ""}
      "git", ["checkout", _, _], _ -> {:ok, ""}
      # Simulate npm failure
      "npm", _, _ -> {:error, {1, "npm failed"}}
    end)

    {:ok, pid} = AssetForge.BuildWorker.start_link({repo_url, commit_hash})
    ref = Process.monitor(pid)

    # The process should stop, but not with a :normal reason.
    # Our GenServer logic translates the error into a :normal stop,
    # but the logs would show the failure. This is an implementation detail.
    assert_receive {:DOWN, ^ref, :process, ^pid, :normal}

    # A more advanced test could check for the log output.
  end
end

The final implementation provided a stable, observable, and resilient service. Build jobs are isolated, timeouts prevent stuck processes from consuming resources indefinitely, and the supervision tree ensures the service as a whole remains healthy even if individual builds fail. It solved our air-gapped deployment problem in a way that felt native to the Elixir ecosystem, turning a brittle manual process into a reliable piece of infrastructure.

This architecture is not without its limitations. The current system executes commands directly on the host machine, which presents a security risk if a malicious package.json script were introduced. A future iteration must sandbox the entire build process within a short-lived container (e.g., using Docker or gVisor), which the Elixir service would orchestrate. Furthermore, the build queue logic is non-existent; builds are started immediately. A more sophisticated system would use a queuing mechanism to control concurrency and prioritize jobs, perhaps by introducing another GenServer to act as a pool manager in front of the DynamicSupervisor. The mechanism of using OS ports also carries a performance penalty due to data serialization between the BEAM and the external process; however, for a task as coarse-grained as a full npm/Babel build, this overhead is negligible compared to the benefits of process isolation and robustness.

VPC CI/CD Babel Elixir OTP DevOps

Constructing a Full-Stack Observability Pipeline for a Valtio-Driven iOS App with Sanic and Fluentd

2023-10-27 Observability

Datadog iOS Sanic Fluentd Valtio

Implementing Event-Driven Incremental Static Regeneration Using Nuxt.js and Redis Streams

2023-10-27 Web Architecture

Redis Streams Nuxt.js ORM SSG Change Data Capture