The deployment pipeline for service-billing-v3
failed again. Silently. The Nomad job showed it was running, logs were empty, and health checks were pending. After two hours of digging through Consul UI, checking ACL tokens, and tailing Envoy proxy logs, we found it: a typo in an upstream service name within the connect
stanza of the job file. service-account
instead of service-accounts
. This single character difference meant the mTLS connection was refused, traffic was black-holed, and a critical deployment was blocked. This wasn’t an isolated incident; it was a systemic flaw in our process. Expecting every developer to be a Nomad and Consul Connect expert was proving to be an expensive and frustrating assumption.
Our platform is a mix of old and new. Legacy stateful services run on Nomad for its simplicity with raw exec drivers, while newer stateless applications are deployed on Kubernetes (EKS) to leverage its rich ecosystem. We unified service discovery and security across both orchestrators with HashiCorp Consul, enforcing strict mTLS for all inter-service communication via Consul Connect. The security posture was strong, but the developer experience was brittle. The feedback loop was hours long, culminating in a frustrating kubectl exec
or nomad alloc exec
session. The problem wasn’t the infrastructure; it was the lack of validation at the source.
The initial thought was to build a complex pre-deployment validation service, maybe using Open Policy Agent (OPA), that would check configurations against a set of policies. But in a real-world project, adding another moving part to the deployment pipeline introduces its own failure modes. The goal is to reduce friction, not relocate it. The validation needed to happen earlier, on the developer’s machine, right in their editor. It needed to be part of the same workflow they used for checking code quality. Since our application teams primarily use TypeScript and manage their infrastructure-as-code definitions within the same repositories, the tool they all use is ESLint. The idea formed: what if we treated our Nomad HCL and Kubernetes YAML configurations as lintable artifacts? Could we build a custom ESLint plugin to enforce our specific mTLS policies? It felt unconventional, but the pragmatic benefits were too compelling to ignore.
The Foundation: A Custom ESLint Plugin for Non-JS Files
The first hurdle is that ESLint is designed for JavaScript. It works by parsing code into an Abstract Syntax Tree (AST) and then having rules traverse that tree to find patterns. It has no native understanding of HCL or YAML. A common mistake here would be to abandon the tool. The pragmatic path is to adapt it. ESLint has a concept of “processors” which allows pre-processing of files before they are passed to the linter.
We can create a processor that reads a .nomad.hcl
or .k8s.yaml
file, parses it into a JavaScript object using an external library, and then exports this object as a string of JavaScript code. ESLint can then parse this generated JavaScript.
Here is the basic structure of our plugin:
eslint-plugin-infra/package.json
:
{
"name": "eslint-plugin-infra",
"version": "1.0.0",
"description": "ESLint rules for internal infrastructure configuration.",
"main": "lib/index.js",
"scripts": {
"test": "mocha tests/**/*.js"
},
"keywords": [
"eslint",
"eslintplugin"
],
"author": "Platform Team",
"license": "Internal",
"dependencies": {
"hcl-parser": "^0.2.0",
"js-yaml": "^4.1.0"
},
"devDependencies": {
"eslint": "^8.50.0",
"mocha": "^10.2.0"
}
}
eslint-plugin-infra/lib/index.js
:
/**
* @fileoverview Main entry point for the eslint-plugin-infra plugin.
* @author Platform Team
*/
"use strict";
const yaml = require("js-yaml");
const { parse } = require("hcl-parser");
// Processor to handle YAML and HCL files
const infraProcessor = {
// Takes text of the file and filename
preprocess: function(text, filename) {
let configObject;
try {
if (filename.endsWith(".yaml") || filename.endsWith(".yml")) {
// For Kubernetes, we might have multiple documents in one file.
// We'll process the first one for simplicity in this example.
// A production implementation should handle multi-document YAML.
const docs = yaml.loadAll(text);
configObject = docs[0] || {};
} else if (filename.endsWith(".hcl")) {
// The hcl-parser returns an array for the top-level blocks
[configObject] = parse(text);
} else {
// Not a file we care about, return original text
return [text];
}
} catch (e) {
// If parsing fails, we can't lint.
// A common mistake is to swallow this error. It's better to make it visible.
console.error(`Error parsing ${filename}:`, e.message);
// Return an empty object so rules don't crash.
configObject = {};
}
// Wrap the parsed JSON object in a fake JS module export.
// This is the core trick. ESLint can now parse this string.
const jsCode = `module.exports = ${JSON.stringify(configObject, null, 2)};`;
return [jsCode];
},
// Takes messages from ESLint and maps them back to the original file
postprocess: function(messages, filename) {
// For now, we'll just return the messages as-is.
// A more advanced implementation could try to map line/column numbers.
return messages[0];
},
supportsAutofix: false
};
module.exports = {
// Import rules
rules: {
"nomad-require-connect": require("./rules/nomad-require-connect"),
"validate-service-upstreams": require("./rules/validate-service-upstreams"),
"k8s-require-mtls-annotation": require("./rules/k8s-require-mtls-annotation")
},
// Define processors
processors: {
".hcl": infraProcessor,
".yaml": infraProcessor,
".yml": infraProcessor
}
};
This setup provides the bridge. When we configure ESLint to use this plugin for .hcl
and .yaml
files, our processor will transparently convert them into a format the linter engine can understand.
Rule 1: Enforcing mTLS Enablement on Nomad Jobs
Our first policy is simple: every service
defined in a Nomad job must explicitly enable Consul Connect. This prevents developers from accidentally deploying a service outside the mesh. The connect
block with a sidecar_service
is required.
Here’s the implementation of the rule nomad-require-connect.js
.
eslint-plugin-infra/lib/rules/nomad-require-connect.js
:
/**
* @fileoverview Enforces that Nomad services have Consul Connect enabled.
* @author Platform Team
*/
"use strict";
module.exports = {
meta: {
type: "problem",
docs: {
description: "Enforce Consul Connect sidecar for Nomad services",
category: "Infrastructure Configuration",
recommended: true,
},
fixable: null,
schema: [] // no options
},
create: function(context) {
// ESLint rules operate on AST nodes. Our "source code" is the JS-wrapped config.
// We start from the root of the object expression.
return {
"ObjectExpression": function(node) {
// Our HCL parser puts the top-level block (e.g., 'job') as the first property.
const jobProperty = node.properties.find(p => p.key.value === 'job');
if (!jobProperty) return;
const jobNode = jobProperty.value.elements[0]; // HCL blocks are parsed as arrays
// Find the 'group' blocks within the job
const groupProperty = jobNode.properties.find(p => p.key.value === 'group');
if (!groupProperty) return;
groupProperty.value.elements.forEach(groupNode => {
// Find the 'service' blocks within the group
const serviceProperty = groupNode.properties.find(p => p.key.value === 'service');
if (!serviceProperty) return;
serviceProperty.value.elements.forEach(serviceNode => {
const connectProperty = serviceNode.properties.find(p => p.key.value === 'connect');
// Check for the existence of the 'connect' block
if (!connectProperty) {
context.report({
node: serviceNode,
message: "Service definition is missing a 'connect' block for mTLS.",
});
return; // Stop checking this service block
}
// The 'connect' block should also be an array containing an object
const connectNode = connectProperty.value.elements[0];
const sidecarProperty = connectNode.properties.find(p => p.key.value === 'sidecar_service');
if (!sidecarProperty) {
context.report({
node: connectNode,
message: "The 'connect' block must contain a 'sidecar_service' to enable mTLS.",
});
}
});
});
}
};
}
};
Now, if a developer writes this Nomad job file:
jobs/billing-service.nomad.hcl
:
job "billing-service" {
group "api" {
task "server" {
driver = "docker"
// ...
}
service {
name = "billing-api"
port = "8080"
// Missing 'connect' block
}
}
}
Running ESLint will immediately produce an error: Service definition is missing a 'connect' block for mTLS.
This feedback is instantaneous.
Rule 2: Validating Service Upstreams Against a Known Registry
This is where we solve the typo problem that started this entire initiative. The rule needs to check every service name listed in an upstreams
block against an authoritative list of all registered services in our environment.
A common pitfall here is to make the ESLint rule perform a network call to the Consul API. This would make linting slow, non-deterministic, and dependent on network access and credentials. It’s a critical design error for a tool that needs to run quickly and frequently.
The pragmatic solution is to decouple the data fetching from the validation. We’ll have a CI job that runs periodically (or on-demand) to query the Consul catalog and write the list of service names to a file, which is then committed to the repository.
scripts/sync-services.sh
:
#!/bin/bash
# This script should be run in a CI pipeline with access to Consul.
CONSUL_HTTP_ADDR="http://your-consul-server:8500"
OUTPUT_FILE=".known-services.json"
echo "Fetching known services from Consul..."
curl -s "${CONSUL_HTTP_ADDR}/v1/catalog/services" | jq 'keys' > "$OUTPUT_FILE"
echo "Service list synced to $OUTPUT_FILE"
The repository will now contain a .known-services.json
file:
[
"service-accounts",
"service-payments",
"service-ledger",
"api-gateway"
]
The ESLint rule can now safely and quickly read this local file for validation.
eslint-plugin-infra/lib/rules/validate-service-upstreams.js
:
/**
* @fileoverview Validates Nomad connect upstreams against a known service list.
* @author Platform Team
*/
"use strict";
const fs = require('fs');
const path = require('path');
// Memoize the service list to avoid reading the file for every linted file.
let knownServices = null;
function getKnownServices(rootDir) {
if (knownServices !== null) {
return knownServices;
}
try {
const filePath = path.join(rootDir, '.known-services.json');
const fileContent = fs.readFileSync(filePath, 'utf-8');
knownServices = new Set(JSON.parse(fileContent));
return knownServices;
} catch (error) {
// If the file doesn't exist or is invalid, we can't validate.
// Return an empty set to avoid false positives and log a warning.
console.warn("Could not load .known-services.json. Upstream validation will be skipped.");
knownServices = new Set();
return knownServices;
}
}
module.exports = {
meta: {
type: "problem",
docs: {
description: "Validate Consul Connect upstreams against a pre-generated list of known services",
category: "Infrastructure Configuration",
recommended: true,
},
fixable: null,
schema: []
},
create: function(context) {
// We need to find the project root to locate the .known-services.json file.
// context.getCwd() gives us the directory where ESLint was run.
const services = getKnownServices(context.getCwd());
if (services.size === 0) {
// Don't proceed if we couldn't load the service list.
return {};
}
function checkUpstreams(upstreamsNode) {
if (!upstreamsNode || upstreamsNode.type !== 'ArrayExpression') return;
upstreamsNode.elements.forEach(upstreamElement => {
if (upstreamElement.type !== 'ObjectExpression') return;
const destNameProperty = upstreamElement.properties.find(p => p.key.value === 'destination_name');
if (!destNameProperty) return;
const serviceName = destNameProperty.value.value;
if (!services.has(serviceName)) {
context.report({
node: destNameProperty.value,
message: `Upstream service '${serviceName}' is not a known registered service. Check for typos.`,
});
}
});
}
return {
"ObjectExpression": function(node) {
// This traversal logic is similar to the previous rule, but drills down further.
const jobProperty = node.properties.find(p => p.key.value === 'job');
if (!jobProperty) return;
const jobNode = jobProperty.value.elements[0];
const groupProperty = jobNode.properties.find(p => p.key.value === 'group');
if (!groupProperty) return;
groupProperty.value.elements.forEach(groupNode => {
const taskProperty = groupNode.properties.find(p => p.key.value === 'task');
if (!taskProperty) return;
taskProperty.value.elements.forEach(taskNode => {
const serviceProperty = taskNode.properties.find(p => p.key.value === 'service');
if (!serviceProperty) return;
serviceProperty.value.elements.forEach(serviceNode => {
const connectProperty = serviceNode.properties.find(p => p.key.value === 'connect');
if (!connectProperty) return;
const connectNode = connectProperty.value.elements[0];
const sidecarProperty = connectNode.properties.find(p => p.key.value === 'sidecar_service');
if (!sidecarProperty) return;
const sidecarNode = sidecarProperty.value.elements[0];
const proxyProperty = sidecarNode.properties.find(p => p.key.value === 'proxy');
if (!proxyProperty) return;
const proxyNode = proxyProperty.value.elements[0];
const upstreamsProperty = proxyNode.properties.find(p => p.key.value === 'upstreams');
if (upstreamsProperty) {
checkUpstreams(upstreamsProperty.value);
}
});
});
});
}
};
}
};
Now, when a developer makes the original typo in their job file:
// ...
connect {
sidecar_service {
proxy {
upstreams {
destination_name = "service-account" // The typo
local_bind_port = 8081
}
}
}
}
// ...
They will get an immediate ESLint error: Upstream service 'service-account' is not a known registered service. Check for typos.
The feedback loop shrinks from hours to seconds.
Extending the Paradigm to Kubernetes
The same processor logic works for Kubernetes YAML. We just need a rule tailored to its structure. In our K8s clusters, we use Linkerd for mTLS, which is enabled by adding an annotation to the workload. Our policy is that every Deployment
must have the linkerd.io/inject: enabled
annotation.
eslint-plugin-infra/lib/rules/k8s-require-mtls-annotation.js
:
/**
* @fileoverview Ensures Kubernetes Deployments have Linkerd mTLS injection enabled.
* @author Platform Team
*/
"use strict";
module.exports = {
meta: {
type: "problem",
docs: {
description: "Require Linkerd injection annotation for all Deployments.",
category: "Infrastructure Configuration",
recommended: true,
},
fixable: null,
schema: [],
},
create: function(context) {
return {
"ObjectExpression": function(node) {
// Find the 'kind' property to identify the resource type.
const kindProperty = node.properties.find(p => p.key.value === 'kind');
if (!kindProperty || kindProperty.value.value !== 'Deployment') {
return; // Not a deployment, we don't care.
}
// Navigate to metadata.annotations
const metadataProperty = node.properties.find(p => p.key.value === 'metadata');
if (!metadataProperty || metadataProperty.value.type !== 'ObjectExpression') {
context.report({ node, message: "Deployment is missing 'metadata'." });
return;
}
const annotationsProperty = metadataProperty.value.properties.find(p => p.key.value === 'annotations');
if (!annotationsProperty || annotationsProperty.value.type !== 'ObjectExpression') {
context.report({ node: metadataProperty, message: "Deployment metadata is missing 'annotations' for mTLS configuration." });
return;
}
const annotations = annotationsProperty.value;
const linkerdInjectProp = annotations.properties.find(p => p.key.value === 'linkerd.io/inject');
if (!linkerdInjectProp) {
context.report({ node: annotations, message: "Annotation 'linkerd.io/inject' is missing." });
} else if (linkerdInjectProp.value.value !== 'enabled') {
context.report({ node: linkerdInjectProp.value, message: "Annotation 'linkerd.io/inject' must be set to 'enabled'." });
}
}
};
}
};
This single plugin now provides consistent policy enforcement across both our Nomad and Kubernetes environments, using a tool developers already have integrated into their workflow.
Testing and Integration
A tool like this is only production-grade if it’s reliable and testable. ESLint provides a RuleTester
utility that makes unit testing rules straightforward.
eslint-plugin-infra/tests/lib/rules/nomad-require-connect.test.js
:
const { RuleTester } = require("eslint");
const rule = require("../../../lib/rules/nomad-require-connect");
const { parse } = require("hcl-parser");
// RuleTester needs a parser that can handle our fake JS module.
// The default one works fine for this.
const ruleTester = new RuleTester();
// Helper to wrap HCL in our expected format.
const createValidTest = (hclString) => {
const [configObject] = parse(hclString);
return `module.exports = ${JSON.stringify(configObject)};`
}
ruleTester.run("nomad-require-connect", rule, {
valid: [
{
code: createValidTest(`
job "valid" {
group "api" {
service {
name = "api"
connect { sidecar_service {} }
}
}
}
`),
},
],
invalid: [
{
code: createValidTest(`
job "invalid" {
group "api" {
service {
name = "api"
}
}
}
`),
errors: [{ message: "Service definition is missing a 'connect' block for mTLS." }],
},
{
code: createValidTest(`
job "invalid-no-sidecar" {
group "api" {
service {
name = "api"
connect {}
}
}
}
`),
errors: [{ message: "The 'connect' block must contain a 'sidecar_service' to enable mTLS." }],
},
],
});
With tests in place, the final step is integration into the CI pipeline.
.github/workflows/lint.yml
:
name: Lint Infrastructure Config
on:
pull_request:
jobs:
lint:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '18'
- name: Install Dependencies
run: npm install
- name: Sync Consul Services
env:
CONSUL_HTTP_ADDR: ${{ secrets.CONSUL_ADDR }}
CONSUL_HTTP_TOKEN: ${{ secrets.CONSUL_TOKEN }}
run: ./scripts/sync-services.sh
- name: Run ESLint on Infra Configs
run: npx eslint --ext .hcl,.yaml,.yml ./services/
Now, any pull request that introduces an mTLS configuration error will fail the build, preventing the error from ever reaching a staging or production environment. We’ve shifted a class of runtime infrastructure failures left, transforming them into simple, static analysis errors that are caught in seconds.
The reliance on a checked-in .known-services.json
file is a deliberate trade-off. It keeps the linter fast and deterministic, but introduces the possibility of the list becoming stale. A future improvement could involve having the linter query a lightweight, highly available service catalog API, but that would compromise its ability to run offline and would add a network dependency, something we explicitly avoided. The current solution, refreshed via a simple CI job, represents a robust and pragmatic balance. Furthermore, the HCL/YAML-to-JS AST conversion is an effective hack, but a more academically pure approach would be a custom ESLint parser. In a real-world project, however, the maintenance cost of a custom parser far outweighs the benefits when a simple processor solves the problem adequately. The goal was not perfect architecture, but the effective elimination of a recurring, expensive problem.