Constructing an Automated Infrastructure Remediation Engine with spaCy and Chef

DevSecOps

Word Count: 3.5k

Read Times: 21 Min

The operational burden of processing security advisories is a significant source of toil. A new CVE is published, and the platform team is tasked with translating a block of unstructured English text into a precise, targeted change within our Chef-managed infrastructure. This manual process—reading the advisory, identifying the affected package or configuration, locating the relevant cookbook, modifying attributes, testing, and deploying—is slow and prone to human error. A missed detail in the advisory can lead to an incomplete fix, while a typo in a Chef attribute can cause an outage. Our core pain point was this translation layer: from human language to machine-executable, declarative configuration.

We conceptualized a system to automate this translation. The idea was to build a pipeline that ingests raw CVE text, uses Natural Language Processing (NLP) to extract structured, actionable intelligence, presents a proposed remediation plan to an operator for approval, and then automatically applies the change to our Chef infrastructure. This isn’t just about running a script; it’s about programmatically generating the declarative state changes that our configuration management system expects.

The technology selection was driven by pragmatic, production-focused concerns. For the NLP component, spaCy was chosen for its performance and trainable Named Entity Recognition (NER) models. We needed to teach a model to recognize specific technical entities like package names, configuration file paths, and remediation verbs, which generic models fail at. For our configuration management, Chef was the incumbent tool, so integration was non-negotiable. The challenge was to interact with it programmatically and safely. For the human-in-the-loop interface, we settled on a Vue.js frontend with a tRPC backend. The end-to-end type safety provided by tRPC is not a luxury; in a system that proposes changes to production infrastructure, eliminating entire classes of data-related bugs between the client and server is a critical risk mitigation strategy.

Phase 1: Training a Domain-Specific NLP Model with spaCy

A generic NLP model understands people, places, and organizations. It does not understand openssl-1.1.1k, /etc/ssh/sshd_config, or the imperative to “set the PermitRootLogin directive to no“. Our first and most critical task was to build a custom spaCy NER model trained on a corpus of security advisories and technical documentation.

A common mistake is underestimating the effort required for data annotation. We created a small, focused dataset to prove the concept. The data format is a simple list of tuples, where each tuple contains the text and a dictionary of entity spans.

# file: training_data.py
# A small, representative sample of our training data.
# In a real-world project, this would be thousands of entries and managed
# with an annotation tool like Prodigy.

TRAIN_DATA = [
    (
        "Update the openssl package to version 1.1.1n-0+deb11u3 to mitigate the vulnerability.",
        {
            "entities": [
                (11, 18, "PACKAGE_NAME"),
                (32, 50, "PACKAGE_VERSION"),
                (4, 10, "ACTION_VERB"),
            ]
        },
    ),
    (
        "Users should upgrade bind9 to 1:9.16.22-1~deb11u1 immediately.",
        {
            "entities": [
                (13, 20, "ACTION_VERB"),
                (21, 26, "PACKAGE_NAME"),
                (30, 49, "PACKAGE_VERSION"),
            ]
        },
    ),
    (
        "To fix this, set the 'UseDNS' option to 'no' in /etc/ssh/sshd_config.",
        {
            "entities": [
                (14, 17, "ACTION_VERB"),
                (22, 28, "CONFIG_KEY"),
                (39, 42, "CONFIG_VALUE"),
                (47, 69, "CONFIG_PATH"),
            ]
        },
    ),
    (
        "It is required to disable the TLSv1.1 protocol in your web server configuration.",
        {
            "entities": [
                (19, 26, "ACTION_VERB"),
                (31, 38, "CONFIG_KEY"),
            ]
        },
    ),
]

The training script itself is standard spaCy procedure. We start with a blank English model, add our custom NER pipeline component, and train it on our annotated data. The key is running enough iterations for the model to converge without overfitting.

# file: train_ner_model.py
import spacy
from spacy.training.example import Example
from spacy.util import minibatch, compounding
import random
import pathlib
import logging

# Configure basic logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Import training data
from training_data import TRAIN_DATA

def train_custom_ner_model(output_dir: str, n_iter: int = 100):
    """
    Train a custom Named Entity Recognition model with spaCy.
    """
    # Create a blank English model
    nlp = spacy.blank("en")
    logging.info("Created blank 'en' model")

    # The pitfall here is trying to use a pre-trained model like 'en_core_web_sm'
    # and updating its NER component. This often leads to 'catastrophic forgetting'
    # where the model forgets its original training. Starting fresh is more reliable
    # for a highly domain-specific task.
    if "ner" not in nlp.pipe_names:
        ner = nlp.add_pipe("ner", last=True)
    else:
        ner = nlp.get_pipe("ner")

    # Add our custom labels to the NER pipeline
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    # We want to train only the NER component, so we disable other pipes
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

    # Training loop
    with nlp.select_pipes(disable=unaffected_pipes):
        optimizer = nlp.begin_training()
        for iteration in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            # Batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                examples = []
                for text, annots in batch:
                    # Create Example objects for training
                    examples.append(Example.from_dict(nlp.make_doc(text), annots))
                
                # Update the model with the batch of examples
                nlp.update(
                    examples,
                    drop=0.35,  # Dropout rate for regularization
                    sgd=optimizer,
                    losses=losses,
                )
            if iteration % 10 == 0:
                logging.info(f"Iteration {iteration}/{n_iter}, Losses: {losses}")

    # Save the trained model to the output directory
    output_path = pathlib.Path(output_dir)
    if not output_path.exists():
        output_path.mkdir()
    nlp.to_disk(output_path)
    logging.info(f"Saved model to {output_path}")

if __name__ == "__main__":
    # In a real project, this path would be configured and versioned,
    # likely stored in an artifact repository.
    train_custom_ner_model(output_dir="./models/cve_ner_model")

Once trained, this model can be loaded into a simple Python service (e.g., using Flask or FastAPI) that exposes an endpoint to analyze new text. The output is a structured JSON object, which is the foundational data for our entire remediation pipeline.

Phase 2: Building a Type-Safe API with tRPC

The backend acts as the central nervous system, connecting the NLP service, the Chef infrastructure, and the operator UI. We chose tRPC because it allows us to define our API contract in a single place using TypeScript and Zod schemas. This contract is then automatically shared with the Vue.js frontend, providing compile-time type checking and intellisense across the entire stack.

First, we define the shared data structures. This is typically done in a shared monorepo package.

// file: packages/api-types/src/index.ts
import { z } from 'zod';

// Schema for entities extracted by our spaCy model
export const NlpEntitySchema = z.object({
  text: z.string(),
  label: z.enum([
    'PACKAGE_NAME',
    'PACKAGE_VERSION',
    'CONFIG_KEY',
    'CONFIG_VALUE',
    'CONFIG_PATH',
    'ACTION_VERB',
  ]),
  start: z.number(),
  end: z.number(),
});
export type NlpEntity = z.infer<typeof NlpEntitySchema>;

// Schema for the analysis result of a given CVE text
export const CveAnalysisSchema = z.object({
  cveId: z.string().regex(/^CVE-\d{4}-\d{4,}$/),
  rawText: z.string(),
  extractedEntities: z.array(NlpEntitySchema),
});
export type CveAnalysis = z.infer<typeof CveAnalysisSchema>;

// Schema representing a proposed change to a Chef attribute
export const ChefAttributeChangeSchema = z.object({
  cookbook: z.string(),
  attributePath: z.string(), // e.g., 'openssh.server.permit_root_login'
  currentValue: z.any().optional(),
  proposedValue: z.any(),
});
export type ChefAttributeChange = z.infer<typeof ChefAttributeChangeSchema>;

// The final remediation proposal, which includes the original analysis
// and the generated Chef changes for operator review.
export const RemediationProposalSchema = z.object({
  proposalId: z.string().uuid(),
  analysis: CveAnalysisSchema,
  proposedChanges: z.array(ChefAttributeChangeSchema),
  status: z.enum(['pending_approval', 'approved', 'rejected', 'applied']),
  createdAt: z.date(),
});
export type RemediationProposal = z.infer<typeof RemediationProposalSchema>;

With the types defined, we build the tRPC router. This is the core of our backend API. It defines the available procedures (queries for reading data, mutations for changing data), their input validation schemas, and their resolver logic.

// file: packages/server/src/trpc/router.ts
import { initTRPC } from '@trpc/server';
import { z } from 'zod';
import { 
  CveAnalysisSchema, 
  RemediationProposalSchema 
} from 'api-types'; // Assuming a shared package
import { CveProcessorService } from '../services/cveProcessor';
import { ChefOrchestratorService } from '../services/chefOrchestrator';

// Dummy service instances. In production, these would be properly instantiated
// with dependency injection.
const cveProcessor = new CveProcessorService();
const chefOrchestrator = new ChefOrchestratorService();

const t = initTRPC.create();

export const appRouter = t.router({
  // Query to get a list of all current remediation proposals
  getProposals: t.procedure
    .query(async () => {
      // In a real implementation, this would fetch from a database.
      // We'll return a mock list for demonstration.
      console.log('Fetching all proposals...');
      return chefOrchestrator.getAllProposals();
    }),

  // Query to get details for a single proposal
  getProposalById: t.procedure
    .input(z.object({ proposalId: z.string().uuid() }))
    .query(async ({ input }) => {
        console.log(`Fetching proposal with ID: ${input.proposalId}`);
        const proposal = await chefOrchestrator.getProposal(input.proposalId);
        if (!proposal) {
            throw new Error('Proposal not found');
        }
        return proposal;
    }),

  // Mutation to ingest a new CVE, analyze it, and generate a proposal
  createProposalFromCve: t.procedure
    .input(z.object({
        cveId: z.string().regex(/^CVE-\d{4}-\d{4,}$/),
        cveText: z.string().min(20),
    }))
    .output(RemediationProposalSchema) // Enforces the return type
    .mutation(async ({ input }) => {
        console.log(`Processing CVE: ${input.cveId}`);
        // Step 1: Call the NLP service to get structured entities
        const analysisResult = await cveProcessor.analyze(input.cveId, input.cveText);
        
        // Step 2: Translate the NLP output into a concrete Chef change proposal
        const proposal = await chefOrchestrator.generateProposal(analysisResult);
        
        // Step 3: Persist the proposal and return it
        return await chefOrchestrator.saveProposal(proposal);
    }),

  // Mutation for an operator to approve a proposal
  approveProposal: t.procedure
    .input(z.object({ proposalId: z.string().uuid() }))
    .mutation(async ({ input }) => {
        console.log(`Approving proposal: ${input.proposalId}`);
        const proposal = await chefOrchestrator.getProposal(input.proposalId);
        if (!proposal) {
            throw new Error('Proposal not found');
        }
        
        // A critical real-world check: ensure we're not applying an already-applied proposal.
        if(proposal.status !== 'pending_approval') {
            throw new Error(`Cannot approve proposal in status: ${proposal.status}`);
        }

        // Apply the changes to the Chef server
        await chefOrchestrator.applyRemediation(proposal);

        // Update the proposal status
        await chefOrchestrator.updateProposalStatus(input.proposalId, 'applied');
        
        return { success: true, message: `Proposal ${input.proposalId} applied.` };
    }),
});

// Export the type of the router, which is what the client will use.
export type AppRouter = typeof appRouter;

This router provides a strongly-typed, self-documenting API. The use of Zod for input validation is crucial; it prevents malformed data from ever reaching our core business logic.

Phase 3: The Orchestration and Translation Logic

This is where the magic happens. The orchestration service is responsible for taking the structured JSON from the spaCy model and translating it into a proposed change for a Chef attribute. This requires mapping the abstract entities to our concrete infrastructure reality.

A key component here is an “Infrastructure Manifest,” which is essentially a database or configuration file that maps NLP entities to Chef cookbook details. A naive implementation could be a simple JSON file.

// file: infrastructure-manifest.json
{
  "packages": {
    "openssl": {
      "cookbook": "base_linux",
      "attribute_path": "base_linux.packages.openssl.version"
    },
    "bind9": {
      "cookbook": "dns_server",
      "attribute_path": "dns_server.bind.version"
    }
  },
  "configs": {
    "/etc/ssh/sshd_config": {
      "cookbook": "openssh",
      "key_mapping": {
        "UseDNS": "openssh.server.use_dns",
        "PermitRootLogin": "openssh.server.permit_root_login"
      }
    }
  }
}

The service uses this manifest to perform the translation.

// file: packages/server/src/services/chefOrchestrator.ts
import { v4 as uuidv4 } from 'uuid';
import { CveAnalysis, ChefAttributeChange, RemediationProposal } from 'api-types';
import manifest from './infrastructure-manifest.json';

// In-memory store for proposals. Replace with a real database (e.g., PostgreSQL).
const proposalStore: Map<string, RemediationProposal> = new Map();

export class ChefOrchestratorService {

    public async generateProposal(analysis: CveAnalysis): Promise<RemediationProposal> {
        const proposedChanges: ChefAttributeChange[] = [];

        // Logic to process extracted entities
        const pkgNameEntity = analysis.extractedEntities.find(e => e.label === 'PACKAGE_NAME');
        const pkgVersionEntity = analysis.extractedEntities.find(e => e.label === 'PACKAGE_VERSION');
        
        // Handle package updates
        if (pkgNameEntity && pkgVersionEntity) {
            const pkgInfo = manifest.packages[pkgNameEntity.text as keyof typeof manifest.packages];
            if (pkgInfo) {
                // A production system would fetch the *current* attribute value from Chef Server
                // to show a proper diff.
                proposedChanges.push({
                    cookbook: pkgInfo.cookbook,
                    attributePath: pkgInfo.attribute_path,
                    proposedValue: pkgVersionEntity.text,
                    currentValue: 'unknown' // Placeholder
                });
            } else {
                 // Important logging: The CVE mentions a package we don't manage declaratively.
                 console.warn(`No manifest entry found for package: ${pkgNameEntity.text}`);
            }
        }
        
        // Handle configuration changes
        const configPathEntity = analysis.extractedEntities.find(e => e.label === 'CONFIG_PATH');
        const configKeyEntity = analysis.extractedEntities.find(e => e.label === 'CONFIG_KEY');
        const configValueEntity = analysis.extractedEntities.find(e => e.label === 'CONFIG_VALUE');

        if(configPathEntity && configKeyEntity && configValueEntity) {
            const configInfo = manifest.configs[configPathEntity.text as keyof typeof manifest.configs];
            if(configInfo) {
                const attributePath = configInfo.key_mapping[configKeyEntity.text as keyof typeof configInfo.key_mapping];
                if(attributePath) {
                    proposedChanges.push({
                        cookbook: configInfo.cookbook,
                        attributePath: attributePath,
                        proposedValue: configValueEntity.text,
                        currentValue: 'unknown' // Placeholder
                    });
                }
            }
        }

        const proposal: RemediationProposal = {
            proposalId: uuidv4(),
            analysis,
            proposedChanges,
            status: 'pending_approval',
            createdAt: new Date(),
        };

        return proposal;
    }

    public async saveProposal(proposal: RemediationProposal): Promise<RemediationProposal> {
        proposalStore.set(proposal.proposalId, proposal);
        console.log(`Proposal ${proposal.proposalId} saved.`);
        return proposal;
    }
    
    // ... other methods like getProposal, getAllProposals, etc. ...

    public async applyRemediation(proposal: RemediationProposal): Promise<void> {
        // This is the integration point with Chef.
        // In a real-world scenario, this would use a library like 'chef-api' or
        // shell out to 'knife' commands to update Chef Environment attributes
        // or push a new Policyfile.
        console.log(`Applying remediation for proposal ${proposal.proposalId}`);
        for (const change of proposal.proposedChanges) {
            console.log(`  -> SETTING Chef attribute '${change.attributePath}' to '${change.proposedValue}'`);
            // Example command (conceptual):
            // knife exec -E "nodes.find('*:*').each { |n| n.normal['${path[0]}']['${path[1]}'] = '${change.proposedValue}'; n.save }"
            // The exact implementation depends heavily on the Chef setup (Policyfiles vs. Environments).
        }
        console.log('Chef server update simulation complete.');
    }
}

This entire flow can be visualized with a simple diagram.

graph TD
    A[Unstructured CVE Text] --> B{spaCy NER Service};
    B -->|Structured JSON| C{tRPC Backend};
    C --> D{Orchestration Logic};
    subgraph "Translation"
        D -- Uses --> E[Infrastructure Manifest];
    end
    D --> F[Remediation Proposal];
    F --> G((Database));
    C -- getProposal --> G;
    H[Vue.js UI] -- tRPC Client --> C;
    H --> I{Operator Review};
    I -- Approve --> H;
    H -- approveProposal --> C;
    C --> J{Apply to Chef Server};
    J --> K[Chef Client Converges];

Phase 4: The Operator Interface with Vue.js and tRPC Client

The frontend’s job is to present the remediation proposal in a clear, unambiguous way. The operator must see the original CVE text, the NLP model’s interpretation, and the exact, declarative change that will be applied to the infrastructure.

Thanks to tRPC, creating the client is trivial. We simply import the AppRouter type from our backend.

// file: packages/client/src_utils/trpc.ts
import { createTRPCProxyClient, httpBatchLink } from '@trpc/client';
import type { AppRouter } from 'server/src/trpc/router'; // Direct type import

export const trpc = createTRPCProxyClient<AppRouter>({
  links: [
    httpBatchLink({
      url: 'http://localhost:3000/trpc', // URL of our tRPC server
    }),
  ],
});

Now, in our Vue component, we can call backend procedures as if they were local, async functions, with full type safety and autocompletion.

<!-- file: packages/client/src/components/ProposalDetail.vue -->
<template>
  <div v-if="isLoading">Loading proposal...</div>
  <div v-else-if="error">Error: {{ error.message }}</div>
  <div v-else-if="proposal" class="proposal-container">
    <h2>Remediation Proposal: {{ proposal.proposalId }}</h2>
    <p><strong>Status:</strong> <span :class="`status-${proposal.status}`">{{ proposal.status }}</span></p>

    <div class="card">
      <h3>CVE Analysis ({{ proposal.analysis.cveId }})</h3>
      <p class="cve-text">{{ proposal.analysis.rawText }}</p>
      <h4>Extracted Entities:</h4>
      <ul>
        <li v-for="entity in proposal.analysis.extractedEntities" :key="entity.start">
          <strong>{{ entity.label }}:</strong> <code>{{ entity.text }}</code>
        </li>
      </ul>
    </div>

    <div class="card">
      <h3>Proposed Chef Changes</h3>
      <div v-if="proposal.proposedChanges.length === 0">
        <p>No declarative changes could be generated from this advisory.</p>
      </div>
      <div v-for="change in proposal.proposedChanges" :key="change.attributePath" class="change-block">
        <p><strong>Cookbook:</strong> {{ change.cookbook }}</p>
        <p><strong>Attribute Path:</strong> <code>{{ change.attributePath }}</code></p>
        <pre class="diff">
- currentValue: {{ change.currentValue ?? 'not set' }}
+ proposedValue: {{ change.proposedValue }}
        </pre>
      </div>
    </div>
    
    <div class="actions">
      <button 
        @click="handleApprove" 
        :disabled="isApproving || proposal.status !== 'pending_approval'">
        {{ isApproving ? 'Approving...' : 'Approve and Apply' }}
      </button>
    </div>
  </div>
</template>

<script setup lang="ts">
import { ref, onMounted } from 'vue';
import { trpc } from '../utils/trpc';
import type { RemediationProposal } from 'api-types';

const props = defineProps<{
  proposalId: string;
}>();

const proposal = ref<RemediationProposal | null>(null);
const isLoading = ref(true);
const isApproving = ref(false);
const error = ref<Error | null>(null);

onMounted(async () => {
  try {
    // Calling the backend procedure. Notice the full type safety.
    proposal.value = await trpc.getProposalById.query({ proposalId: props.proposalId });
  } catch (err) {
    error.value = err as Error;
  } finally {
    isLoading.value = false;
  }
});

const handleApprove = async () => {
    if(!proposal.value) return;

    if (confirm('Are you sure you want to apply these changes to the Chef infrastructure?')) {
        isApproving.value = true;
        try {
            await trpc.approveProposal.mutate({ proposalId: props.proposalId });
            // In a real app, we would either refetch the data or optimistically update the state.
            alert('Approval successful. Changes are being applied.');
            proposal.value.status = 'applied'; 
        } catch (err) {
            alert(`Approval failed: ${(err as Error).message}`);
        } finally {
            isApproving.value = false;
        }
    }
};
</script>

<style scoped>
/* Basic styling for clarity */
.proposal-container { font-family: sans-serif; }
.card { border: 1px solid #ccc; border-radius: 5px; padding: 1em; margin-bottom: 1em; }
.diff { background-color: #f0f0f0; padding: 0.5em; border-radius: 3px; }
.actions button:disabled { opacity: 0.5; }
.status-applied { color: green; }
.status-pending_approval { color: orange; }
</style>

This component provides the essential “human-in-the-loop” functionality. The operator is the final gatekeeper, using their expertise to validate the AI’s interpretation before committing a change. The clarity and type-safety of the UI are paramount to making this a tool that builds trust rather than creating anxiety.

The system described is a proof-of-concept, a framework for a much more robust internal platform. Its primary dependency and most significant weakness is the accuracy of the NER model. A model trained on a few dozen examples will be brittle; a production system requires a continuous feedback loop where operators can correct mis-identified entities, feeding that data back into the training set. Furthermore, the logic for translating entities into Chef changes is simplistic. It cannot handle complex remediations that require multiple steps or conditional logic, which would necessitate a more sophisticated translation engine, perhaps one that generates small, dynamic Chef recipes instead of just modifying attributes. Finally, the mapping from a vulnerability to the specific nodes it affects is assumed. A mature implementation would need deep integration with a CMDB or asset inventory to know precisely which subset of the fleet requires the change, allowing for more targeted and less risky deployments.

Automation spaCy NLP Chef Vue.js tRPC

A Persistent State Machine Architecture for Asynchronous OpenCV Processing with JPA and Algolia

2023-10-27 System Design

OpenCV XState JPA Hibernate Algolia State Machine Asynchronous Processing

Implementing Full-Stack Distributed Tracing for JavaScript ISR Builds Using SkyWalking and BDD

2023-10-27 Observability

OpenTelemetry SkyWalking JavaScript BDD ISR Code Review