Implementing an SQS-based Certificate Signing Request Pipeline for mTLS Automation with Ansible


The operational burden of managing TLS certificate lifecycles for a large fleet of internal microservices became untenable. Our initial approach, involving long-lived certificates deployed via configuration management, introduced significant security risks and made rotation a high-stakes, manual event. Every rotation cycle was a coordinated, all-hands-on-deck affair prone to human error, resulting in service outages. We needed a system for zero-touch, automated mutual TLS (mTLS) certificate issuance and rotation that was secure, resilient, and decoupled from our core deployment pipeline.

Our first thought was to build a centralized service that Ansible could call to issue certificates. However, this required granting broad network access from our Ansible control node into various secure network segments, a configuration our security team was rightfully hesitant to approve. Furthermore, a synchronous request-response model made the entire process fragile; any transient network issue or temporary unavailability of the Certificate Authority (CA) service would fail a deployment. This led to the core architectural decision: leverage an asynchronous, message-driven pipeline.

We selected AWS SQS as the backbone for this pipeline. It offered a durable, highly available, and decoupled communication channel. Ansible remains the orchestrator for triggering actions on the nodes, but the critical certificate signing request (CSR) and signed certificate data flow through SQS. This design decouples the certificate consumer (the microservice node) from the certificate producer (the CA), allowing each to operate independently. If the CA is down for maintenance, CSRs simply queue up in SQS and are processed when it returns, without failing the client-side automation.

The architecture is composed of three main components:

  1. Client Node Role: An Ansible role deployed to each service instance responsible for generating a private key and CSR, then publishing the CSR to a designated SQS queue.
  2. CA Worker: A secure, isolated process (orchestrated by Ansible) that polls the SQS queue for pending CSRs, validates and signs them, and publishes the signed certificate to a second SQS queue.
  3. Certificate Retrieval: A final step in the client node role where it polls the second queue for its specific signed certificate, installs it, and reloads the service.
sequenceDiagram
    participant Ansible Control Node
    participant Service Node
    participant SQS (csr-pending-queue)
    participant CA Worker
    participant SQS (certs-signed-queue)

    Ansible Control Node->>Service Node: 1. Run 'mtls-client' role
    Service Node->>Service Node: 2. Generate private key & CSR
    Service Node->>SQS (csr-pending-queue): 3. Publish CSR message with instance_id attribute
    
    loop Poll for CSRs
        CA Worker->>SQS (csr-pending-queue): 4. Receive message
        opt Message received
            CA Worker->>CA Worker: 5. Validate CSR
            CA Worker->>CA Worker: 6. Sign CSR -> certificate
            CA Worker->>SQS (certs-signed-queue): 7. Publish certificate with instance_id attribute
            CA Worker->>SQS (csr-pending-queue): 8. Delete CSR message
        end
    end

    Ansible Control Node->>Service Node: 9. Trigger certificate retrieval task
    Service Node->>SQS (certs-signed-queue): 10. Poll for message with its instance_id
    SQS (certs-signed-queue)-->>Service Node: 11. Return signed certificate message
    Service Node->>Service Node: 12. Install certificate & key
    Service Node->>Service Node: 13. Reload service
    Service Node->>SQS (certs-signed-queue): 14. Delete certificate message

This asynchronous flow forms the basis of a resilient PKI automation system. The following sections detail the Ansible roles, Python scripts, and configurations required to build it. A common mistake in such systems is inadequate security for the CA itself; for this implementation, we will use a simple OpenSSL-based CA managed by Ansible, but in a real-world project, this component must be replaced by a hardware security module (HSM) or a dedicated secrets management service like HashiCorp Vault or AWS Certificate Manager Private CA.

Building the Certificate Authority with Ansible

The foundation of any mTLS system is a trusted CA. For our automated system, we’ll establish a two-tier hierarchy: a long-lived, offline Root CA and an online Intermediate CA used by the worker to sign service certificates. This is a standard security practice that limits the exposure of the root key.

The Ansible role ca-setup is responsible for creating this structure on a secured host.

roles/ca-setup/tasks/main.yml

---
- name: Ensure base directories exist
  ansible.builtin.file:
    path: "{{ item }}"
    state: directory
    owner: root
    group: root
    mode: '0700'
  loop:
    - "/etc/pki/CA"
    - "/etc/pki/CA/root"
    - "/etc/pki/CA/intermediate"
    - "/etc/pki/CA/intermediate/csr"
    - "/etc/pki/CA/intermediate/certs"
    - "/etc/pki/CA/intermediate/private"

- name: Create required files for intermediate CA database
  ansible.builtin.file:
    path: "{{ item }}"
    state: touch
    owner: root
    group: root
    mode: '0600'
  loop:
    - "/etc/pki/CA/intermediate/index.txt"
  changed_when: false

- name: Initialize intermediate CA serial file
  ansible.builtin.copy:
    content: "1000"
    dest: "/etc/pki/CA/intermediate/serial"
    owner: root
    group: root
    mode: '0600'
  when: not (ansible_check_mode)

- name: Copy OpenSSL configurations
  ansible.builtin.template:
    src: "{{ item.src }}"
    dest: "{{ item.dest }}"
    owner: root
    group: root
    mode: '0644'
  loop:
    - { src: 'root-openssl.cnf.j2', dest: '/etc/pki/CA/root/openssl.cnf' }
    - { src: 'intermediate-openssl.cnf.j2', dest: '/etc/pki/CA/intermediate/openssl.cnf' }

- name: Generate Root CA private key
  community.crypto.openssl_privatekey:
    path: /etc/pki/CA/root/private/ca.key.pem
    size: 4096
    type: rsa
    mode: '0400'

- name: Generate Root CA self-signed certificate
  community.crypto.openssl_csr_pipe:
    privatekey_path: /etc/pki/CA/root/private/ca.key.pem
    common_name: "My Corp Root CA"
    organization_name: "My Corporation"
    country_name: "US"
  register: root_csr

- community.crypto.x509_certificate:
    path: /etc/pki/CA/root/certs/ca.cert.pem
    csr_content: "{{ root_csr.csr }}"
    privatekey_path: /etc/pki/CA/root/private/ca.key.pem
    provider: selfsigned
    selfsigned_not_after: "+3650d" # 10 years
    mode: '0444'

- name: Generate Intermediate CA private key
  community.crypto.openssl_privatekey:
    path: /etc/pki/CA/intermediate/private/intermediate.key.pem
    size: 4096
    type: rsa
    mode: '0400'

- name: Generate Intermediate CA CSR
  community.crypto.openssl_csr_pipe:
    privatekey_path: /etc/pki/CA/intermediate/private/intermediate.key.pem
    config: /etc/pki/CA/intermediate/openssl.cnf
    common_name: "My Corp Intermediate CA"
    organization_name: "My Corporation"
    country_name: "US"
  register: intermediate_csr

- name: Sign Intermediate CA certificate with Root CA
  community.crypto.x509_certificate:
    path: /etc/pki/CA/intermediate/certs/intermediate.cert.pem
    csr_content: "{{ intermediate_csr.csr }}"
    provider: ownca
    ownca_path: /etc/pki/CA/root/certs/ca.cert.pem
    ownca_privatekey_path: /etc/pki/CA/root/private/ca.key.pem
    ownca_not_after: "+1825d" # 5 years
    ownca_ext_key_usage:
      - serverAuth
      - clientAuth
    ownca_basic_constraints:
      - "CA:TRUE"
      - "pathlen:0"
    mode: '0444'

- name: Create CA chain file
  ansible.builtin.assemble:
    src: /etc/pki/CA/intermediate/certs/
    dest: /etc/pki/CA/intermediate/certs/ca-chain.cert.pem
    content: "{{ lookup('file', '/etc/pki/CA/root/certs/ca.cert.pem') }}"
    mode: '0444'

The critical part of this role is the openssl.cnf templates, which define the policies and constraints for the CAs. A pitfall here is using a generic configuration that doesn’t properly constrain the intermediate CA.

roles/ca-setup/templates/intermediate-openssl.cnf.j2 (Excerpt)

[ ca ]
default_ca = CA_default

[ CA_default ]
dir               = /etc/pki/CA/intermediate
certs             = $dir/certs
crl_dir           = $dir/crl
database          = $dir/index.txt
new_certs_dir     = $dir/newcerts
certificate       = $dir/certs/intermediate.cert.pem
serial            = $dir/serial
private_key       = $dir/private/intermediate.key.pem

# Policy for signing server/client certs
policy            = policy_strict

[ policy_strict ]
countryName             = match
stateOrProvinceName     = optional
organizationName        = match
organizationalUnitName  = optional
commonName              = supplied

[ server_cert ]
basicConstraints = CA:FALSE
nsCertType = server
nsComment = "OpenSSL Generated Server Certificate"
subjectKeyIdentifier = hash
authorityKeyIdentifier = keyid,issuer
keyUsage = critical, digitalSignature, keyEncipherment
extendedKeyUsage = serverAuth

[ client_cert ]
basicConstraints = CA:FALSE
nsCertType = client
nsComment = "OpenSSL Generated Client Certificate"
subjectKeyIdentifier = hash
authorityKeyIdentifier = keyid,issuer
keyUsage = critical, digitalSignature, keyEncipherment
extendedKeyUsage = clientAuth

This configuration ensures the intermediate CA can only issue end-entity certificates (not other CAs) and defines specific profiles for server and client certificates.

The Client-Side Role: CSR Generation and Submission

The mtls-client role runs on each service node. Its first job is to generate a key and a CSR, then dispatch it to SQS. It uses a helper Python script for the AWS interaction, which is more robust and testable than shelling out to the AWS CLI.

roles/mtls-client/tasks/request_cert.yml

---
- name: Ensure certificate directory exists
  ansible.builtin.file:
    path: /etc/my-service/pki
    state: directory
    owner: my-service-user
    group: my-service-user
    mode: '0750'

- name: Generate private key for the service
  community.crypto.openssl_privatekey:
    path: /etc/my-service/pki/service.key
    size: 2048
    type: rsa
    mode: '0600'
    owner: my-service-user
    group: my-service-user
  register: pkey

# A common mistake is not checking if a key already exists.
# We force regeneration here for simplicity, but in production you'd add logic
# to reuse a key if the certificate has expired and needs rotation.

- name: Generate CSR for the service
  community.crypto.openssl_csr_pipe:
    privatekey_path: /etc/my-service/pki/service.key
    # The CN should be specific and verifiable, e.g., instance ID or FQDN
    common_name: "{{ ansible_facts.ec2_instance_id }}"
    subject_alt_name:
      - "DNS:{{ ansible_facts.fqdn }}"
      - "IP:{{ ansible_facts.default_ipv4.address }}"
  register: csr

- name: Install Boto3 for SQS interaction
  ansible.builtin.pip:
    name: boto3
    state: present

- name: Copy CSR submission script to node
  ansible.builtin.template:
    src: submit_csr.py.j2
    dest: /usr/local/bin/submit_csr.py
    mode: '0755'

- name: Submit CSR to SQS queue
  ansible.builtin.command: >
    /usr/local/bin/submit_csr.py
    --queue-url "{{ csr_pending_queue_url }}"
    --region "{{ aws_region }}"
    --instance-id "{{ ansible_facts.ec2_instance_id }}"
  args:
    stdin: "{{ csr.csr }}"
  register: submission_result
  changed_when: submission_result.rc == 0

- name: Log submission result
  ansible.builtin.debug:
    var: submission_result.stdout

The Python script submit_csr.py handles the logic of sending the CSR to SQS. Crucially, it attaches the instance ID as a message attribute. This attribute is essential for the retrieval step, allowing the client to find its specific certificate later without having to parse message bodies.

roles/mtls-client/templates/submit_csr.py.j2

#!/usr/bin/env python3
import argparse
import sys
import json
import logging
import boto3
from botocore.exceptions import ClientError

# Set up basic logging for traceability in system logs
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def submit_csr_to_sqs(queue_url, region, instance_id, csr_content):
    """
    Submits a CSR to a specified SQS queue with instance_id as a message attribute.

    Args:
        queue_url (str): The URL of the SQS queue.
        region (str): The AWS region of the queue.
        instance_id (str): The EC2 instance ID to use as a message attribute.
        csr_content (str): The PEM-encoded CSR string.

    Returns:
        str: The MessageId of the sent message, or None on failure.
    """
    sqs_client = boto3.client('sqs', region_name=region)
    message_body = json.dumps({
        'instance_id': instance_id,
        'csr_pem': csr_content
    })

    try:
        response = sqs_client.send_message(
            QueueUrl=queue_url,
            MessageBody=message_body,
            MessageAttributes={
                'InstanceId': {
                    'DataType': 'String',
                    'StringValue': instance_id
                }
            }
        )
        logging.info(f"Successfully sent CSR for instance {instance_id}. MessageId: {response['MessageId']}")
        return response['MessageId']
    except ClientError as e:
        logging.error(f"Failed to send message to SQS queue {queue_url}: {e}")
        return None

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description="Submit a CSR from stdin to an AWS SQS queue.")
    parser.add_argument('--queue-url', required=True, help='The URL of the target SQS queue.')
    parser.add_argument('--region', required=True, help='The AWS region.')
    parser.add_argument('--instance-id', required=True, help='The EC2 instance ID for message attribute.')
    args = parser.parse_args()

    # Read CSR from standard input, which is piped from Ansible
    csr_pem = sys.stdin.read()
    if not csr_pem:
        logging.error("No CSR content received from stdin.")
        sys.exit(1)

    message_id = submit_csr_to_sqs(args.queue_url, args.region, args.instance_id, csr_pem)

    if message_id:
        print(json.dumps({"status": "success", "message_id": message_id}))
        sys.exit(0)
    else:
        print(json.dumps({"status": "error", "message": "Failed to submit CSR to SQS."}))
        sys.exit(1)

The CA Worker: Processing Signing Requests

The CA worker is the heart of the system. It’s a long-running process that polls the csr-pending-queue. For this example, we’ll create an Ansible playbook that runs a Python script in a loop. In a real-world project, this would be a systemd service or a container running in ECS/EKS.

The worker script must be idempotent. It fetches a message, processes it, and only if signing and publishing the certificate is successful does it delete the original message from the CSR queue. SQS’s visibility timeout feature prevents other workers from picking up the same message while it’s being processed. If the worker crashes, the message becomes visible again after the timeout and can be re-processed.

ca_worker_playbook.yml

---
- hosts: ca_worker
  become: yes
  vars:
    csr_pending_queue_url: "your-csr-pending-queue-url"
    certs_signed_queue_url: "your-certs-signed-queue-url"
    aws_region: "us-east-1"
  
  tasks:
    - name: Install dependencies
      ansible.builtin.pip:
        name:
          - boto3
          - cryptography
        state: present

    - name: Copy CA worker script
      ansible.builtin.template:
        src: ca_worker.py.j2
        dest: /opt/ca-worker/ca_worker.py
        mode: '0755'

    - name: Run CA worker script in a loop
      ansible.builtin.shell: |
        while true; do
          /opt/ca-worker/ca_worker.py \
            --csr-queue-url "{{ csr_pending_queue_url }}" \
            --cert-queue-url "{{ certs_signed_queue_url }}" \
            --region "{{ aws_region }}" \
            --ca-config /etc/pki/CA/intermediate/openssl.cnf \
            --ca-key /etc/pki/CA/intermediate/private/intermediate.key.pem \
            --ca-cert /etc/pki/CA/intermediate/certs/intermediate.cert.pem
          sleep 5
        done
      async: 3600
      poll: 0

The worker script itself is more complex, handling message parsing, CSR validation, OpenSSL command execution, and publishing the result.

templates/ca_worker.py.j2 (Core Logic)

# ... imports and arg parsing ...

def sign_csr(csr_pem, ca_config, ca_key, ca_cert, duration_days=30):
    """Signs a CSR using openssl command line tool."""
    # In a real system, use a cryptography library instead of subprocess
    # for better security and error handling. This is for demonstration.
    
    # A critical step: validate the CSR here. Check the CN against an
    # inventory database, verify domain ownership, etc.
    # For now, we trust the input, which is a significant security risk.
    logging.info("Validating CSR (stubbed)...")

    cmd = [
        'openssl', 'x509', '-req',
        '-in', '/dev/stdin',
        '-CA', ca_cert,
        '-CAkey', ca_key,
        '-CAcreateserial',
        '-out', '/dev/stdout',
        '-days', str(duration_days),
        '-sha256',
        '-extfile', ca_config,
        '-extensions', 'server_cert' # or client_cert based on request
    ]
    process = subprocess.run(cmd, input=csr_pem.encode('utf-8'), capture_output=True, check=False)
    
    if process.returncode != 0:
        logging.error(f"OpenSSL failed to sign CSR: {process.stderr.decode('utf-8')}")
        return None
    
    logging.info("CSR signed successfully.")
    return process.stdout.decode('utf-8')

def process_messages(sqs_client, csr_queue_url, cert_queue_url, ca_config, ca_key, ca_cert):
    response = sqs_client.receive_message(
        QueueUrl=csr_queue_url,
        MaxNumberOfMessages=1,
        WaitTimeSeconds=10, # Use long polling
        MessageAttributeNames=['All']
    )
    
    if 'Messages' not in response:
        return

    for message in response['Messages']:
        receipt_handle = message['ReceiptHandle']
        try:
            body = json.loads(message['Body'])
            csr_pem = body['csr_pem']
            instance_id = body.get('instance_id', 'unknown') # Fallback
            logging.info(f"Processing CSR for instance: {instance_id}")

            signed_cert_pem = sign_csr(csr_pem, ca_config, ca_key, ca_cert)

            if signed_cert_pem:
                # Get the full CA chain
                with open('/etc/pki/CA/intermediate/certs/ca-chain.cert.pem', 'r') as f:
                    ca_chain = f.read()
                
                # Publish to the signed certs queue
                cert_body = json.dumps({
                    'instance_id': instance_id,
                    'certificate_pem': signed_cert_pem,
                    'ca_chain_pem': ca_chain
                })
                
                sqs_client.send_message(
                    QueueUrl=cert_queue_url,
                    MessageBody=cert_body,
                    MessageAttributes={
                        'InstanceId': {
                            'DataType': 'String',
                            'StringValue': instance_id
                        }
                    }
                )
                logging.info(f"Published signed certificate for {instance_id}")
                
                # Only delete the message after successful processing
                sqs_client.delete_message(QueueUrl=csr_queue_url, ReceiptHandle=receipt_handle)
                logging.info(f"Deleted CSR message for {instance_id}")
            else:
                logging.error(f"Failed to sign CSR for {instance_id}. Message will become visible again.")
                # The pitfall here is not having a Dead-Letter Queue (DLQ). A repeatedly failing
                # message will poison the queue. A DLQ is essential for production.

        except Exception as e:
            logging.exception(f"Unhandled error processing message. It will be retried. Error: {e}")

# ... main loop calling process_messages ...

Client-Side Role: Certificate Retrieval and Installation

The final piece is retrieving the signed certificate. Another task in the mtls-client role handles this. It uses a script to poll the certs-signed-queue, filtering messages by its own instance ID. This is a highly efficient way to poll SQS, as the filtering happens on the client side after receiving a batch of messages.

roles/mtls-client/tasks/retrieve_cert.yml

---
- name: Copy certificate retrieval script to node
  ansible.builtin.template:
    src: retrieve_cert.py.j2
    dest: /usr/local/bin/retrieve_cert.py
    mode: '0755'

- name: Retrieve and install certificate from SQS
  ansible.builtin.command: >
    /usr/local/bin/retrieve_cert.py
    --queue-url "{{ certs_signed_queue_url }}"
    --region "{{ aws_region }}"
    --instance-id "{{ ansible_facts.ec2_instance_id }}"
    --cert-path /etc/my-service/pki/service.pem
    --chain-path /etc/my-service/pki/ca-chain.pem
    --timeout 300
  register: retrieval_result
  changed_when: "'Certificate installed' in retrieval_result.stdout"
  notify: restart my-service

- name: Set correct permissions on installed certificates
  ansible.builtin.file:
    path: "{{ item }}"
    owner: my-service-user
    group: my-service-user
    mode: '0640'
  loop:
    - /etc/my-service/pki/service.pem
    - /etc/my-service/pki/ca-chain.pem

The Ansible handler restart my-service is triggered only when the certificate is successfully installed, ensuring the service picks up the new credentials.

The retrieval script polls the queue, checks message attributes, and if a match is found, writes the certificate files and deletes the message.

roles/mtls-client/templates/retrieve_cert.py.j2 (Core Logic)

# ... imports and arg parsing ...

def retrieve_certificate(queue_url, region, instance_id, cert_path, chain_path, timeout):
    sqs_client = boto3.client('sqs', region_name=region)
    start_time = time.time()
    
    while time.time() - start_time < timeout:
        response = sqs_client.receive_message(
            QueueUrl=queue_url,
            MaxNumberOfMessages=10,
            WaitTimeSeconds=5,
            MessageAttributeNames=['InstanceId']
        )
        
        if 'Messages' not in response:
            logging.info("No messages in queue, polling again.")
            continue
            
        for message in response['Messages']:
            msg_attrs = message.get('MessageAttributes', {})
            msg_instance_id = msg_attrs.get('InstanceId', {}).get('StringValue')
            
            if msg_instance_id == instance_id:
                logging.info(f"Found matching certificate message for instance {instance_id}")
                body = json.loads(message['Body'])
                
                with open(cert_path, 'w') as f:
                    f.write(body['certificate_pem'])
                
                with open(chain_path, 'w') as f:
                    f.write(body['ca_chain_pem'])

                logging.info(f"Certificate and chain written to disk.")
                
                # Acknowledge and delete message
                sqs_client.delete_message(
                    QueueUrl=queue_url,
                    ReceiptHandle=message['ReceiptHandle']
                )
                return True
    
    logging.error(f"Timed out after {timeout} seconds waiting for certificate.")
    return False

# ... main function calling retrieve_certificate ...

This SQS-based architecture successfully decouples certificate lifecycle management from the main deployment flow. It’s more resilient to transient failures and adheres to better security practices by avoiding direct network access to sensitive infrastructure. However, the system presented here is a foundation, not a final product. The OpenSSL-based CA is its weakest link from a security and manageability perspective; replacing it with a solution like HashiCorp Vault, which provides API-driven certificate management, robust auditing, and secure key storage, would be the necessary next step for a production environment. Furthermore, this design does not address certificate revocation. An effective PKI must include a mechanism for revoking compromised certificates, which could be implemented using another SQS queue to distribute a CRL or by pointing services to an OCSP responder.


  TOC