Implementing a Zero-Trust Service on Docker Swarm with Vault AppRole and SAML Authentication

DevOps

Word Count: 2.9k

Read Times: 18 Min

The mandate was clear: build a new internal administrative service that couldn’t trust the network and couldn’t rely on manually provisioned credentials. Every internal tool we had deployed previously suffered from the same original sin—a collection of API keys, database passwords, and TLS certificates checked into environment files, managed via Docker secrets populated by a CI runner, or worse, left in a project’s docker-compose.yml commented as “for dev only.” This approach was fragile, insecure, and an operational nightmare. For this project, the service had to bootstrap its own identity and secrets, and user access had to be gated by our corporate single sign-on.

Our chosen stack was Docker Swarm for orchestration due to its operational simplicity for our scale, Tornado for the asynchronous Python service, HashiCorp Vault as our central secrets broker, and SAML for federated identity. The core challenge wasn’t using any one of these technologies, but orchestrating their interaction to achieve a zero-trust startup sequence. The application container, upon starting, would know nothing. It had to securely prove its identity to Vault, fetch its entire configuration—including the SAML specifics needed to authenticate human users—and only then begin serving traffic.

This is the log of how we built it, the problems we hit, and the patterns we established.

Technical Pain Point: The Empty Container Problem

A freshly scheduled container on a Docker Swarm node is fundamentally untrusted. It has no inherent, verifiable identity that another system, like Vault, can immediately accept. The most common anti-pattern is to bake a Vault token into the Docker image or pass it as an environment variable. This is a non-starter; the token is static, long-lived, and exposed.

Our initial concept was to leverage Vault’s AppRole authentication backend. AppRole is designed for machine-to-machine authentication. An “AppRole” is a set of policies. To authenticate, a client needs a RoleID (publicly known, like a username) and a SecretID (a secret credential, like a password). The RoleID could be baked into the image, but the SecretID is the critical bootstrap secret. How do we deliver this SecretID to the container securely and ephemerally?

Kubernetes solves this elegantly with Service Account Tokens automatically mounted into pods, which Vault’s Kubernetes auth method can then validate. Docker Swarm has no direct equivalent. We had to devise a secure introduction mechanism. The solution was to use a trusted orchestrator (our CI/CD pipeline) to request a short-lived, single-use SecretID from Vault, but with a twist: we’d request a wrapped SecretID.

A wrapped secret in Vault is encrypted and can only be unwrapped once using a temporary “wrapping token.” This CI/CD process would look like this:

CI/CD authenticates to Vault with its own high-privilege policy.
It generates a new SecretID for the application’s AppRole.
It asks Vault to wrap this SecretID, getting back a short-lived wrapping token.
The CI/CD pipeline injects this wrapping token into a Docker Secret, which is then attached to the Swarm service.

The container starts, reads the wrapping token from the Docker Secret file, unwraps it to get the real SecretID, and then proceeds with the AppRole login. This wrapping token is useless after its first use, significantly shrinking the attack surface.

Step 1: Configuring Vault and the Swarm Stack

First, we need a foundational docker-compose.yml to deploy Vault and our Tornado application on a Swarm cluster. For this demonstration, Vault runs in dev mode, which is not for production but simplifies setup.

# docker-compose.yml
version: '3.8'

services:
  vault:
    image: hashicorp/vault:1.15
    ports:
      - "8200:8200"
    environment:
      - VAULT_DEV_ROOT_TOKEN_ID=root
      - VAULT_ADDR=http://127.0.0.1:8200
    cap_add:
      - IPC_LOCK
    command: server -dev -dev-listen-address="0.0.0.0:8200"

  admin_app:
    image: my-secure-app:latest # We will build this image
    build:
      context: ./app
    environment:
      - VAULT_ADDR=http://vault:8200
      - VAULT_ROLE_ID=... # This will be set during deployment
    secrets:
      - vault_wrapping_token
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure

secrets:
  vault_wrapping_token:
    external: true

The key parts here are the VAULT_ROLE_ID environment variable and the vault_wrapping_token Docker secret. The deployer is responsible for creating this secret.

Next, we configure Vault. This is done via its CLI or API after the container starts.

# Wait for vault to be up
# Run these commands from a machine with Vault CLI or exec into the container

export VAULT_ADDR='http://127.0.0.1:8200'
export VAULT_TOKEN='root'

# 1. Enable AppRole auth method
vault auth enable approle

# 2. Enable KV v2 secrets engine
vault secrets enable -path=secret kv-v2

# 3. Create a policy for our app
# This policy allows reading SAML configs and database credentials
vault policy write admin-app-policy - <<EOF
path "secret/data/admin-app/saml" {
  capabilities = ["read"]
}
path "secret/data/admin-app/database" {
  capabilities = ["read"]
}
EOF

# 4. Create the AppRole
# We bind the policy to this role.
# We set a short TTL for the generated token for security.
vault write auth/approle/role/admin-app \
    token_policies="admin-app-policy" \
    token_ttl=1h \
    token_max_ttl=4h

# 5. Get the RoleID
# This is considered non-sensitive and can be baked into the image or CI vars.
vault read auth/approle/role/admin-app/role-id
# Output:
# Key        Value
# ---        -----
# role_id    <some-role-id> -> This goes into VAULT_ROLE_ID

# 6. Store some secrets for the app to fetch later
vault kv put secret/admin-app/saml \
    sp_private_key=@/path/to/sp.key \
    sp_public_cert=@/path/to/sp.crt \
    idp_metadata_xml=@/path/to/idp-metadata.xml

vault kv put secret/admin-app/database \
    username="dbuser" \
    password="supersecretpassword"

Now for the deployment-time magic. The CI/CD script generates and injects the wrapped SecretID.

# This would be in a CI/CD deployment script

# 1. Generate a new SecretID for the role
SECRET_ID_RESPONSE=$(vault write -f auth/approle/role/admin-app/secret-id)
SECRET_ID=$(echo "$SECRET_ID_RESPONSE" | grep 'secret_id ' | awk '{print $2}')

# 2. Wrap the SecretID
# The wrapping token will be valid for 5 minutes
WRAPPED_RESPONSE=$(vault wrap -d="secret_id=${SECRET_ID}" -ttl=5m)
WRAPPING_TOKEN=$(echo "$WRAPPED_RESPONSE" | grep 'wrapping_token' | awk '{print $2}')

# 3. Create the Docker secret before deploying the stack
# The secret name must match what's in docker-compose.yml
echo "$WRAPPING_TOKEN" | docker secret create vault_wrapping_token -

# 4. Deploy the stack
# Inject the RoleID as an environment variable
export VAULT_ROLE_ID=$(vault read -field=role_id auth/approle/role/admin-app/role-id)
docker stack deploy -c docker-compose.yml my_secure_stack

The application now has a one-time-use token to fetch its real credential.

Step 2: The Tornado Application’s Bootstrap Logic

The core of the application is a Python service using Tornado. On startup, it must perform the Vault authentication sequence before it even thinks about listening on a port.

Here is the app/ directory structure:

app/
├── Dockerfile
├── requirements.txt
├── vault_client.py
└── main.py

requirements.txt:

tornado==6.4
hvac==1.2.1
python3-saml==1.16.0

vault_client.py contains the logic for the bootstrap sequence.

# app/vault_client.py

import os
import hvac
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

class VaultClient:
    """
    A client to handle the secure introduction to HashiCorp Vault.
    1. Reads a wrapping token from a Docker secret.
    2. Unwraps the token to get a SecretID.
    3. Performs an AppRole login to get a client token.
    4. Provides a configured HVAC client for further secret retrieval.
    """
    def __init__(self, vault_addr: str, role_id: str, wrapping_token_path: str):
        self.vault_addr = vault_addr
        self.role_id = role_id
        self.wrapping_token_path = wrapping_token_path
        self.client = None

    def initialize(self) -> bool:
        """
        Performs the full authentication and initialization sequence.
        Returns True on success, False on failure.
        """
        try:
            wrapping_token = self._read_wrapping_token()
            if not wrapping_token:
                return False

            # Initialize a temporary client just for unwrapping
            unwrap_client = hvac.Client(url=self.vault_addr)
            unwrap_client.token = wrapping_token
            
            logging.info("Attempting to unwrap SecretID from wrapping token...")
            # The unwrap operation is a one-time deal.
            unwrap_response = unwrap_client.sys.unwrap()
            secret_id = unwrap_response['data']['secret_id']
            logging.info("Successfully unwrapped SecretID.")

            # Now perform the AppRole login
            approle_client = hvac.Client(url=self.vault_addr)
            logging.info(f"Performing AppRole login with RoleID: {self.role_id[:8]}...")
            
            login_response = approle_client.auth.approle.login(
                role_id=self.role_id,
                secret_id=secret_id
            )
            
            # We got our client token. This is the goal of the bootstrap.
            client_token = login_response['auth']['client_token']
            logging.info("AppRole login successful. Client token acquired.")

            # Create the final, authenticated client instance
            self.client = hvac.Client(url=self.vault_addr, token=client_token)
            
            if not self.client.is_authenticated():
                logging.error("Client is not authenticated even after AppRole login.")
                return False
            
            logging.info("Vault client is fully initialized and authenticated.")
            return True

        except hvac.exceptions.InvalidRequest as e:
            logging.error(f"HVAC Invalid Request during initialization: {e}. Check Vault policies and roles.")
            return False
        except hvac.exceptions.Forbidden as e:
            logging.error(f"HVAC Forbidden error: {e}. Wrapping token might be expired or already used.")
            return False
        except Exception as e:
            logging.error(f"An unexpected error occurred during Vault initialization: {e}")
            return False

    def _read_wrapping_token(self) -> str | None:
        """Reads the wrapping token from the file path provided by Docker secrets."""
        try:
            with open(self.wrapping_token_path, 'r') as f:
                token = f.read().strip()
                if not token:
                    logging.error(f"Wrapping token file is empty: {self.wrapping_token_path}")
                    return None
                logging.info(f"Read wrapping token from {self.wrapping_token_path}")
                return token
        except FileNotFoundError:
            logging.error(f"Wrapping token file not found at: {self.wrapping_token_path}")
            return None
        except IOError as e:
            logging.error(f"Could not read wrapping token file: {e}")
            return None

    def read_secret(self, path: str) -> dict | None:
        """Reads a secret from the KVv2 engine."""
        if not self.client or not self.client.is_authenticated():
            logging.error("Cannot read secret: Vault client is not initialized or authenticated.")
            return None
        
        try:
            logging.info(f"Reading secret from path: {path}")
            response = self.client.secrets.kv.v2.read_secret_version(path=path)
            # The actual data is nested under 'data' -> 'data' for KVv2
            return response['data']['data']
        except hvac.exceptions.InvalidPath:
            logging.error(f"Secret path not found in Vault: {path}")
            return None
        except Exception as e:
            logging.error(f"Failed to read secret from {path}: {e}")
            return None

This VaultClient class encapsulates the entire complex bootstrap process. The main.py will use it to fetch configuration before starting the web server.

Step 3: Integrating SAML with Dynamically Fetched Configuration

The next challenge was configuring the python3-saml library. It typically takes a static JSON configuration file. Our configuration, however, lives in Vault. The application must fetch it at runtime and construct the settings dictionary dynamically.

main.py orchestrates this.

# app/main.py

import os
import tornado.ioloop
import tornado.web
import tornado.httpserver
import logging
from urllib.parse import urlparse
from onelogin.saml2.auth import OneLogin_Saml2_Auth
from onelogin.saml2.utils import OneLogin_Saml2_Utils

from vault_client import VaultClient

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Global state (in a real app, manage this better)
SAML_SETTINGS = None

def build_saml_request(req):
    """Builds a SAML request dictionary for python3-saml from a Tornado request."""
    # This is a helper to adapt Tornado's request object to the format expected by the SAML library
    http_host = req.headers.get('Host', 'localhost')
    server_port = urlparse(req.full_url()).port or (443 if req.protocol == 'https' else 80)

    return {
        'https': 'on' if req.protocol == 'https' else 'off',
        'http_host': http_host,
        'script_name': req.path,
        'server_port': str(server_port),
        'get_data': {k: v[0].decode('utf-8') for k, v in req.query_arguments.items()},
        'post_data': {k: v[0].decode('utf-8') for k, v in req.body_arguments.items()},
        'query_string': req.query
    }

class BaseHandler(tornado.web.RequestHandler):
    def get_current_user(self):
        return self.get_secure_cookie("user_email")

class MainHandler(BaseHandler):
    @tornado.web.authenticated
    def get(self):
        email = self.get_current_user().decode('utf-8')
        db_creds = self.application.settings.get('db_credentials')
        self.write(f"Hello, {email}. You are authenticated.<br>")
        self.write(f"I have fetched these DB creds from Vault: {db_creds}")

class SamlLoginHandler(BaseHandler):
    def get(self):
        req = build_saml_request(self.request)
        auth = OneLogin_Saml2_Auth(req, SAML_SETTINGS)
        # This redirects the user to the IdP for authentication
        self.redirect(auth.login())

class AcsHandler(BaseHandler):
    """Assertion Consumer Service (ACS) Handler."""
    async def post(self):
        req = build_saml_request(self.request)
        auth = OneLogin_Saml2_Auth(req, SAML_SETTINGS)
        
        auth.process_response()
        errors = auth.get_errors()

        if errors:
            logging.error(f"SAML ACS Error: {errors}, Reason: {auth.get_last_error_reason()}")
            self.set_status(401)
            self.write("SAML authentication failed.")
            return

        if not auth.is_authenticated():
            self.set_status(401)
            self.write("Not authenticated via SAML.")
            return

        # SAML authentication is successful.
        # 'nameId' is typically the user's email or username.
        user_email = auth.get_nameid()
        logging.info(f"User authenticated successfully: {user_email}")
        
        # In a real app, you would check attributes, provision a session, etc.
        self.set_secure_cookie("user_email", user_email)
        self.redirect("/")

class MetadataHandler(BaseHandler):
    def get(self):
        req = build_saml_request(self.request)
        auth = OneLogin_Saml2_Auth(req, SAML_SETTINGS)
        settings = auth.get_settings()
        metadata = settings.get_sp_metadata()
        errors = settings.validate_metadata(metadata)

        if len(errors) == 0:
            self.set_header('Content-Type', 'text/xml')
            self.write(metadata)
        else:
            self.set_status(500)
            self.write(', '.join(errors))

async def make_app(vault_client: VaultClient) -> tornado.web.Application:
    """
    Fetches configuration from Vault and constructs the Tornado application.
    """
    global SAML_SETTINGS
    
    # 1. Fetch SAML configuration from Vault
    saml_secrets = vault_client.read_secret("admin-app/saml")
    if not saml_secrets:
        raise RuntimeError("Failed to fetch SAML secrets from Vault.")
    
    # A pitfall here is ensuring the secrets are in the exact format the SAML library needs.
    # The private key and cert must not have extra whitespace or encoding issues.
    sp_private_key = saml_secrets['sp_private_key']
    sp_public_cert = saml_secrets['sp_public_cert']
    idp_metadata_xml = saml_secrets['idp_metadata_xml']
    
    # 2. Fetch other application secrets, e.g., database credentials
    db_credentials = vault_client.read_secret("admin-app/database")
    if not db_credentials:
        raise RuntimeError("Failed to fetch database credentials from Vault.")

    # 3. Dynamically construct the SAML settings dictionary
    # In a real-world project, the entityId and ACS URL should also come from config.
    SAML_SETTINGS = {
        "strict": True,
        "debug": True,
        "sp": {
            "entityId": "http://localhost:8888/saml/metadata/",
            "assertionConsumerService": {
                "url": "http://localhost:8888/saml/acs/",
                "binding": "urn:oasis:names:tc:SAML:2.0:bindings:HTTP-POST"
            },
            "x509cert": sp_public_cert,
            "privateKey": sp_private_key,
        },
        "idp": {
            # Instead of a URL, we provide the metadata content directly
            "metadata": idp_metadata_xml
        }
    }
    
    logging.info("SAML settings successfully constructed from Vault secrets.")

    return tornado.web.Application([
        (r"/", MainHandler),
        (r"/saml/login/", SamlLoginHandler),
        (r"/saml/acs/", AcsHandler),
        (r"/saml/metadata/", MetadataHandler),
    ], 
    cookie_secret="a_secret_key_that_should_also_come_from_vault",
    login_url="/saml/login/",
    # Pass fetched secrets to handlers
    db_credentials=db_credentials
    )

async def main():
    """Main entry point for the application."""
    vault_addr = os.getenv("VAULT_ADDR")
    role_id = os.getenv("VAULT_ROLE_ID")
    wrapping_token_path = "/run/secrets/vault_wrapping_token"

    if not all([vault_addr, role_id]):
        logging.critical("VAULT_ADDR and VAULT_ROLE_ID must be set.")
        exit(1)

    # --- Secure Bootstrap Sequence ---
    vault_client = VaultClient(vault_addr, role_id, wrapping_token_path)
    if not vault_client.initialize():
        logging.critical("Failed to initialize Vault client. Shutting down.")
        exit(1)
    
    # --- Application Initialization ---
    try:
        app = await make_app(vault_client)
        http_server = tornado.httpserver.HTTPServer(app)
        http_server.listen(8888)
        logging.info("Server started and listening on port 8888")
        await tornado.ioloop.IOLoop.current().start()
    except Exception as e:
        logging.critical(f"Failed to create or start the application: {e}")
        exit(1)

if __name__ == "__main__":
    tornado.ioloop.IOLoop.current().run_sync(main)

Finally, the Dockerfile to package it all.

# app/Dockerfile
FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["python", "main.py"]

The Full Picture: A Zero-Trust Flow

The complete, end-to-end process is now visible. We can represent it with a sequence diagram.

sequenceDiagram
    participant CI/CD
    participant Vault
    participant Docker Swarm
    participant TornadoApp

    CI/CD->>Vault: Generate & Wrap SecretID for 'admin-app'
    Vault-->>CI/CD: Wrapping Token
    CI/CD->>Docker Swarm: docker secret create vault_wrapping_token (content=token)
    CI/CD->>Docker Swarm: docker stack deploy (with RoleID)

    Docker Swarm->>TornadoApp: Start Container (mounts secret at /run/secrets/vault_wrapping_token)

    TornadoApp->>TornadoApp: Read Wrapping Token from file
    TornadoApp->>Vault: Unwrap SecretID using Wrapping Token
    Vault-->>TornadoApp: Plain SecretID
    TornadoApp->>Vault: AppRole Login(RoleID, SecretID)
    Vault-->>TornadoApp: Client Token

    TornadoApp->>Vault: Read secret/admin-app/saml (using Client Token)
    Vault-->>TornadoApp: SAML Config (SP Key, IdP Meta)
    TornadoApp->>Vault: Read secret/admin-app/database
    Vault-->>TornadoApp: DB Credentials

    TornadoApp->>TornadoApp: Configure SAML library & DB connections
    TornadoApp->>TornadoApp: Start Tornado HTTP Server on Port 8888

    participant User
    User->>TornadoApp: GET /
    TornadoApp-->>User: Redirect to /saml/login/
    User->>TornadoApp: GET /saml/login/
    TornadoApp-->>User: Redirect to IdP for authentication

    participant IdP
    User->>IdP: Authenticates (user/pass/MFA)
    IdP-->>User: POST SAML Assertion to /saml/acs/

    User->>TornadoApp: POST /saml/acs/ (with assertion)
    TornadoApp->>TornadoApp: Validate SAML Assertion
    TornadoApp-->>User: Set session cookie & Redirect to /
    User->>TornadoApp: GET / (now with valid session)
    TornadoApp-->>User: 200 OK (Authenticated Content)

This architecture successfully decouples the application from its secrets. The Docker image contains no credentials. The runtime configuration contains only a non-sensitive RoleID and a path to a one-time-use token. The service dynamically pulls everything it needs to operate from a central, audited, and secure source. This is a significant step up from traditional configuration management.

The primary limitation of this specific implementation is its reliance on the CI/CD system as the “trusted introducer.” The security of the entire bootstrap process hinges on the security of the pipeline runner and its Vault token. For higher security environments, one might explore platform-level attestations if the orchestrator supports it. Furthermore, the client token obtained by the application is relatively long-lived (1 hour in our config). For services handling extremely sensitive data, a more sophisticated approach would involve the application renewing its own token periodically, a feature Vault’s tokens support, and ensuring the application code can handle token expiry and re-authentication gracefully without a full restart.

Security Docker Swarm SAML HashiCorp Vault Tornado Zero-Trust

Building a Resilient Event-Driven SSR Frontend for a Legacy Oracle Monolith Using Go SQS and Storybook

2023-10-27 Architecture

Go AWS SQS Echo Event-Driven SSR Oracle Storybook

Implementing a Secure Mock Service Worker for Storybook to Consume JWT-Protected Node.js APIs

2023-10-27 Software Architecture

Node.js JWT Micro-frontends Storybook Mocking MSW