The operational burden of managing TLS certificate lifecycles for a large fleet of internal microservices became untenable. Our initial approach, involving long-lived certificates deployed via configuration management, introduced significant security risks and made rotation a high-stakes, manual event. Every rotation cycle was a coordinated, all-hands-on-deck affair prone to human error, resulting in service outages. We needed a system for zero-touch, automated mutual TLS (mTLS) certificate issuance and rotation that was secure, resilient, and decoupled from our core deployment pipeline.
Our first thought was to build a centralized service that Ansible could call to issue certificates. However, this required granting broad network access from our Ansible control node into various secure network segments, a configuration our security team was rightfully hesitant to approve. Furthermore, a synchronous request-response model made the entire process fragile; any transient network issue or temporary unavailability of the Certificate Authority (CA) service would fail a deployment. This led to the core architectural decision: leverage an asynchronous, message-driven pipeline.
We selected AWS SQS as the backbone for this pipeline. It offered a durable, highly available, and decoupled communication channel. Ansible remains the orchestrator for triggering actions on the nodes, but the critical certificate signing request (CSR) and signed certificate data flow through SQS. This design decouples the certificate consumer (the microservice node) from the certificate producer (the CA), allowing each to operate independently. If the CA is down for maintenance, CSRs simply queue up in SQS and are processed when it returns, without failing the client-side automation.
The architecture is composed of three main components:
- Client Node Role: An Ansible role deployed to each service instance responsible for generating a private key and CSR, then publishing the CSR to a designated SQS queue.
- CA Worker: A secure, isolated process (orchestrated by Ansible) that polls the SQS queue for pending CSRs, validates and signs them, and publishes the signed certificate to a second SQS queue.
- Certificate Retrieval: A final step in the client node role where it polls the second queue for its specific signed certificate, installs it, and reloads the service.
sequenceDiagram participant Ansible Control Node participant Service Node participant SQS (csr-pending-queue) participant CA Worker participant SQS (certs-signed-queue) Ansible Control Node->>Service Node: 1. Run 'mtls-client' role Service Node->>Service Node: 2. Generate private key & CSR Service Node->>SQS (csr-pending-queue): 3. Publish CSR message with instance_id attribute loop Poll for CSRs CA Worker->>SQS (csr-pending-queue): 4. Receive message opt Message received CA Worker->>CA Worker: 5. Validate CSR CA Worker->>CA Worker: 6. Sign CSR -> certificate CA Worker->>SQS (certs-signed-queue): 7. Publish certificate with instance_id attribute CA Worker->>SQS (csr-pending-queue): 8. Delete CSR message end end Ansible Control Node->>Service Node: 9. Trigger certificate retrieval task Service Node->>SQS (certs-signed-queue): 10. Poll for message with its instance_id SQS (certs-signed-queue)-->>Service Node: 11. Return signed certificate message Service Node->>Service Node: 12. Install certificate & key Service Node->>Service Node: 13. Reload service Service Node->>SQS (certs-signed-queue): 14. Delete certificate message
This asynchronous flow forms the basis of a resilient PKI automation system. The following sections detail the Ansible roles, Python scripts, and configurations required to build it. A common mistake in such systems is inadequate security for the CA itself; for this implementation, we will use a simple OpenSSL-based CA managed by Ansible, but in a real-world project, this component must be replaced by a hardware security module (HSM) or a dedicated secrets management service like HashiCorp Vault or AWS Certificate Manager Private CA.
Building the Certificate Authority with Ansible
The foundation of any mTLS system is a trusted CA. For our automated system, we’ll establish a two-tier hierarchy: a long-lived, offline Root CA and an online Intermediate CA used by the worker to sign service certificates. This is a standard security practice that limits the exposure of the root key.
The Ansible role ca-setup
is responsible for creating this structure on a secured host.
roles/ca-setup/tasks/main.yml
---
- name: Ensure base directories exist
ansible.builtin.file:
path: "{{ item }}"
state: directory
owner: root
group: root
mode: '0700'
loop:
- "/etc/pki/CA"
- "/etc/pki/CA/root"
- "/etc/pki/CA/intermediate"
- "/etc/pki/CA/intermediate/csr"
- "/etc/pki/CA/intermediate/certs"
- "/etc/pki/CA/intermediate/private"
- name: Create required files for intermediate CA database
ansible.builtin.file:
path: "{{ item }}"
state: touch
owner: root
group: root
mode: '0600'
loop:
- "/etc/pki/CA/intermediate/index.txt"
changed_when: false
- name: Initialize intermediate CA serial file
ansible.builtin.copy:
content: "1000"
dest: "/etc/pki/CA/intermediate/serial"
owner: root
group: root
mode: '0600'
when: not (ansible_check_mode)
- name: Copy OpenSSL configurations
ansible.builtin.template:
src: "{{ item.src }}"
dest: "{{ item.dest }}"
owner: root
group: root
mode: '0644'
loop:
- { src: 'root-openssl.cnf.j2', dest: '/etc/pki/CA/root/openssl.cnf' }
- { src: 'intermediate-openssl.cnf.j2', dest: '/etc/pki/CA/intermediate/openssl.cnf' }
- name: Generate Root CA private key
community.crypto.openssl_privatekey:
path: /etc/pki/CA/root/private/ca.key.pem
size: 4096
type: rsa
mode: '0400'
- name: Generate Root CA self-signed certificate
community.crypto.openssl_csr_pipe:
privatekey_path: /etc/pki/CA/root/private/ca.key.pem
common_name: "My Corp Root CA"
organization_name: "My Corporation"
country_name: "US"
register: root_csr
- community.crypto.x509_certificate:
path: /etc/pki/CA/root/certs/ca.cert.pem
csr_content: "{{ root_csr.csr }}"
privatekey_path: /etc/pki/CA/root/private/ca.key.pem
provider: selfsigned
selfsigned_not_after: "+3650d" # 10 years
mode: '0444'
- name: Generate Intermediate CA private key
community.crypto.openssl_privatekey:
path: /etc/pki/CA/intermediate/private/intermediate.key.pem
size: 4096
type: rsa
mode: '0400'
- name: Generate Intermediate CA CSR
community.crypto.openssl_csr_pipe:
privatekey_path: /etc/pki/CA/intermediate/private/intermediate.key.pem
config: /etc/pki/CA/intermediate/openssl.cnf
common_name: "My Corp Intermediate CA"
organization_name: "My Corporation"
country_name: "US"
register: intermediate_csr
- name: Sign Intermediate CA certificate with Root CA
community.crypto.x509_certificate:
path: /etc/pki/CA/intermediate/certs/intermediate.cert.pem
csr_content: "{{ intermediate_csr.csr }}"
provider: ownca
ownca_path: /etc/pki/CA/root/certs/ca.cert.pem
ownca_privatekey_path: /etc/pki/CA/root/private/ca.key.pem
ownca_not_after: "+1825d" # 5 years
ownca_ext_key_usage:
- serverAuth
- clientAuth
ownca_basic_constraints:
- "CA:TRUE"
- "pathlen:0"
mode: '0444'
- name: Create CA chain file
ansible.builtin.assemble:
src: /etc/pki/CA/intermediate/certs/
dest: /etc/pki/CA/intermediate/certs/ca-chain.cert.pem
content: "{{ lookup('file', '/etc/pki/CA/root/certs/ca.cert.pem') }}"
mode: '0444'
The critical part of this role is the openssl.cnf
templates, which define the policies and constraints for the CAs. A pitfall here is using a generic configuration that doesn’t properly constrain the intermediate CA.
roles/ca-setup/templates/intermediate-openssl.cnf.j2
(Excerpt)
[ ca ]
default_ca = CA_default
[ CA_default ]
dir = /etc/pki/CA/intermediate
certs = $dir/certs
crl_dir = $dir/crl
database = $dir/index.txt
new_certs_dir = $dir/newcerts
certificate = $dir/certs/intermediate.cert.pem
serial = $dir/serial
private_key = $dir/private/intermediate.key.pem
# Policy for signing server/client certs
policy = policy_strict
[ policy_strict ]
countryName = match
stateOrProvinceName = optional
organizationName = match
organizationalUnitName = optional
commonName = supplied
[ server_cert ]
basicConstraints = CA:FALSE
nsCertType = server
nsComment = "OpenSSL Generated Server Certificate"
subjectKeyIdentifier = hash
authorityKeyIdentifier = keyid,issuer
keyUsage = critical, digitalSignature, keyEncipherment
extendedKeyUsage = serverAuth
[ client_cert ]
basicConstraints = CA:FALSE
nsCertType = client
nsComment = "OpenSSL Generated Client Certificate"
subjectKeyIdentifier = hash
authorityKeyIdentifier = keyid,issuer
keyUsage = critical, digitalSignature, keyEncipherment
extendedKeyUsage = clientAuth
This configuration ensures the intermediate CA can only issue end-entity certificates (not other CAs) and defines specific profiles for server and client certificates.
The Client-Side Role: CSR Generation and Submission
The mtls-client
role runs on each service node. Its first job is to generate a key and a CSR, then dispatch it to SQS. It uses a helper Python script for the AWS interaction, which is more robust and testable than shelling out to the AWS CLI.
roles/mtls-client/tasks/request_cert.yml
---
- name: Ensure certificate directory exists
ansible.builtin.file:
path: /etc/my-service/pki
state: directory
owner: my-service-user
group: my-service-user
mode: '0750'
- name: Generate private key for the service
community.crypto.openssl_privatekey:
path: /etc/my-service/pki/service.key
size: 2048
type: rsa
mode: '0600'
owner: my-service-user
group: my-service-user
register: pkey
# A common mistake is not checking if a key already exists.
# We force regeneration here for simplicity, but in production you'd add logic
# to reuse a key if the certificate has expired and needs rotation.
- name: Generate CSR for the service
community.crypto.openssl_csr_pipe:
privatekey_path: /etc/my-service/pki/service.key
# The CN should be specific and verifiable, e.g., instance ID or FQDN
common_name: "{{ ansible_facts.ec2_instance_id }}"
subject_alt_name:
- "DNS:{{ ansible_facts.fqdn }}"
- "IP:{{ ansible_facts.default_ipv4.address }}"
register: csr
- name: Install Boto3 for SQS interaction
ansible.builtin.pip:
name: boto3
state: present
- name: Copy CSR submission script to node
ansible.builtin.template:
src: submit_csr.py.j2
dest: /usr/local/bin/submit_csr.py
mode: '0755'
- name: Submit CSR to SQS queue
ansible.builtin.command: >
/usr/local/bin/submit_csr.py
--queue-url "{{ csr_pending_queue_url }}"
--region "{{ aws_region }}"
--instance-id "{{ ansible_facts.ec2_instance_id }}"
args:
stdin: "{{ csr.csr }}"
register: submission_result
changed_when: submission_result.rc == 0
- name: Log submission result
ansible.builtin.debug:
var: submission_result.stdout
The Python script submit_csr.py
handles the logic of sending the CSR to SQS. Crucially, it attaches the instance ID as a message attribute. This attribute is essential for the retrieval step, allowing the client to find its specific certificate later without having to parse message bodies.
roles/mtls-client/templates/submit_csr.py.j2
#!/usr/bin/env python3
import argparse
import sys
import json
import logging
import boto3
from botocore.exceptions import ClientError
# Set up basic logging for traceability in system logs
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
def submit_csr_to_sqs(queue_url, region, instance_id, csr_content):
"""
Submits a CSR to a specified SQS queue with instance_id as a message attribute.
Args:
queue_url (str): The URL of the SQS queue.
region (str): The AWS region of the queue.
instance_id (str): The EC2 instance ID to use as a message attribute.
csr_content (str): The PEM-encoded CSR string.
Returns:
str: The MessageId of the sent message, or None on failure.
"""
sqs_client = boto3.client('sqs', region_name=region)
message_body = json.dumps({
'instance_id': instance_id,
'csr_pem': csr_content
})
try:
response = sqs_client.send_message(
QueueUrl=queue_url,
MessageBody=message_body,
MessageAttributes={
'InstanceId': {
'DataType': 'String',
'StringValue': instance_id
}
}
)
logging.info(f"Successfully sent CSR for instance {instance_id}. MessageId: {response['MessageId']}")
return response['MessageId']
except ClientError as e:
logging.error(f"Failed to send message to SQS queue {queue_url}: {e}")
return None
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Submit a CSR from stdin to an AWS SQS queue.")
parser.add_argument('--queue-url', required=True, help='The URL of the target SQS queue.')
parser.add_argument('--region', required=True, help='The AWS region.')
parser.add_argument('--instance-id', required=True, help='The EC2 instance ID for message attribute.')
args = parser.parse_args()
# Read CSR from standard input, which is piped from Ansible
csr_pem = sys.stdin.read()
if not csr_pem:
logging.error("No CSR content received from stdin.")
sys.exit(1)
message_id = submit_csr_to_sqs(args.queue_url, args.region, args.instance_id, csr_pem)
if message_id:
print(json.dumps({"status": "success", "message_id": message_id}))
sys.exit(0)
else:
print(json.dumps({"status": "error", "message": "Failed to submit CSR to SQS."}))
sys.exit(1)
The CA Worker: Processing Signing Requests
The CA worker is the heart of the system. It’s a long-running process that polls the csr-pending-queue
. For this example, we’ll create an Ansible playbook that runs a Python script in a loop. In a real-world project, this would be a systemd service or a container running in ECS/EKS.
The worker script must be idempotent. It fetches a message, processes it, and only if signing and publishing the certificate is successful does it delete the original message from the CSR queue. SQS’s visibility timeout feature prevents other workers from picking up the same message while it’s being processed. If the worker crashes, the message becomes visible again after the timeout and can be re-processed.
ca_worker_playbook.yml
---
- hosts: ca_worker
become: yes
vars:
csr_pending_queue_url: "your-csr-pending-queue-url"
certs_signed_queue_url: "your-certs-signed-queue-url"
aws_region: "us-east-1"
tasks:
- name: Install dependencies
ansible.builtin.pip:
name:
- boto3
- cryptography
state: present
- name: Copy CA worker script
ansible.builtin.template:
src: ca_worker.py.j2
dest: /opt/ca-worker/ca_worker.py
mode: '0755'
- name: Run CA worker script in a loop
ansible.builtin.shell: |
while true; do
/opt/ca-worker/ca_worker.py \
--csr-queue-url "{{ csr_pending_queue_url }}" \
--cert-queue-url "{{ certs_signed_queue_url }}" \
--region "{{ aws_region }}" \
--ca-config /etc/pki/CA/intermediate/openssl.cnf \
--ca-key /etc/pki/CA/intermediate/private/intermediate.key.pem \
--ca-cert /etc/pki/CA/intermediate/certs/intermediate.cert.pem
sleep 5
done
async: 3600
poll: 0
The worker script itself is more complex, handling message parsing, CSR validation, OpenSSL command execution, and publishing the result.
templates/ca_worker.py.j2
(Core Logic)
# ... imports and arg parsing ...
def sign_csr(csr_pem, ca_config, ca_key, ca_cert, duration_days=30):
"""Signs a CSR using openssl command line tool."""
# In a real system, use a cryptography library instead of subprocess
# for better security and error handling. This is for demonstration.
# A critical step: validate the CSR here. Check the CN against an
# inventory database, verify domain ownership, etc.
# For now, we trust the input, which is a significant security risk.
logging.info("Validating CSR (stubbed)...")
cmd = [
'openssl', 'x509', '-req',
'-in', '/dev/stdin',
'-CA', ca_cert,
'-CAkey', ca_key,
'-CAcreateserial',
'-out', '/dev/stdout',
'-days', str(duration_days),
'-sha256',
'-extfile', ca_config,
'-extensions', 'server_cert' # or client_cert based on request
]
process = subprocess.run(cmd, input=csr_pem.encode('utf-8'), capture_output=True, check=False)
if process.returncode != 0:
logging.error(f"OpenSSL failed to sign CSR: {process.stderr.decode('utf-8')}")
return None
logging.info("CSR signed successfully.")
return process.stdout.decode('utf-8')
def process_messages(sqs_client, csr_queue_url, cert_queue_url, ca_config, ca_key, ca_cert):
response = sqs_client.receive_message(
QueueUrl=csr_queue_url,
MaxNumberOfMessages=1,
WaitTimeSeconds=10, # Use long polling
MessageAttributeNames=['All']
)
if 'Messages' not in response:
return
for message in response['Messages']:
receipt_handle = message['ReceiptHandle']
try:
body = json.loads(message['Body'])
csr_pem = body['csr_pem']
instance_id = body.get('instance_id', 'unknown') # Fallback
logging.info(f"Processing CSR for instance: {instance_id}")
signed_cert_pem = sign_csr(csr_pem, ca_config, ca_key, ca_cert)
if signed_cert_pem:
# Get the full CA chain
with open('/etc/pki/CA/intermediate/certs/ca-chain.cert.pem', 'r') as f:
ca_chain = f.read()
# Publish to the signed certs queue
cert_body = json.dumps({
'instance_id': instance_id,
'certificate_pem': signed_cert_pem,
'ca_chain_pem': ca_chain
})
sqs_client.send_message(
QueueUrl=cert_queue_url,
MessageBody=cert_body,
MessageAttributes={
'InstanceId': {
'DataType': 'String',
'StringValue': instance_id
}
}
)
logging.info(f"Published signed certificate for {instance_id}")
# Only delete the message after successful processing
sqs_client.delete_message(QueueUrl=csr_queue_url, ReceiptHandle=receipt_handle)
logging.info(f"Deleted CSR message for {instance_id}")
else:
logging.error(f"Failed to sign CSR for {instance_id}. Message will become visible again.")
# The pitfall here is not having a Dead-Letter Queue (DLQ). A repeatedly failing
# message will poison the queue. A DLQ is essential for production.
except Exception as e:
logging.exception(f"Unhandled error processing message. It will be retried. Error: {e}")
# ... main loop calling process_messages ...
Client-Side Role: Certificate Retrieval and Installation
The final piece is retrieving the signed certificate. Another task in the mtls-client
role handles this. It uses a script to poll the certs-signed-queue
, filtering messages by its own instance ID. This is a highly efficient way to poll SQS, as the filtering happens on the client side after receiving a batch of messages.
roles/mtls-client/tasks/retrieve_cert.yml
---
- name: Copy certificate retrieval script to node
ansible.builtin.template:
src: retrieve_cert.py.j2
dest: /usr/local/bin/retrieve_cert.py
mode: '0755'
- name: Retrieve and install certificate from SQS
ansible.builtin.command: >
/usr/local/bin/retrieve_cert.py
--queue-url "{{ certs_signed_queue_url }}"
--region "{{ aws_region }}"
--instance-id "{{ ansible_facts.ec2_instance_id }}"
--cert-path /etc/my-service/pki/service.pem
--chain-path /etc/my-service/pki/ca-chain.pem
--timeout 300
register: retrieval_result
changed_when: "'Certificate installed' in retrieval_result.stdout"
notify: restart my-service
- name: Set correct permissions on installed certificates
ansible.builtin.file:
path: "{{ item }}"
owner: my-service-user
group: my-service-user
mode: '0640'
loop:
- /etc/my-service/pki/service.pem
- /etc/my-service/pki/ca-chain.pem
The Ansible handler restart my-service
is triggered only when the certificate is successfully installed, ensuring the service picks up the new credentials.
The retrieval script polls the queue, checks message attributes, and if a match is found, writes the certificate files and deletes the message.
roles/mtls-client/templates/retrieve_cert.py.j2
(Core Logic)
# ... imports and arg parsing ...
def retrieve_certificate(queue_url, region, instance_id, cert_path, chain_path, timeout):
sqs_client = boto3.client('sqs', region_name=region)
start_time = time.time()
while time.time() - start_time < timeout:
response = sqs_client.receive_message(
QueueUrl=queue_url,
MaxNumberOfMessages=10,
WaitTimeSeconds=5,
MessageAttributeNames=['InstanceId']
)
if 'Messages' not in response:
logging.info("No messages in queue, polling again.")
continue
for message in response['Messages']:
msg_attrs = message.get('MessageAttributes', {})
msg_instance_id = msg_attrs.get('InstanceId', {}).get('StringValue')
if msg_instance_id == instance_id:
logging.info(f"Found matching certificate message for instance {instance_id}")
body = json.loads(message['Body'])
with open(cert_path, 'w') as f:
f.write(body['certificate_pem'])
with open(chain_path, 'w') as f:
f.write(body['ca_chain_pem'])
logging.info(f"Certificate and chain written to disk.")
# Acknowledge and delete message
sqs_client.delete_message(
QueueUrl=queue_url,
ReceiptHandle=message['ReceiptHandle']
)
return True
logging.error(f"Timed out after {timeout} seconds waiting for certificate.")
return False
# ... main function calling retrieve_certificate ...
This SQS-based architecture successfully decouples certificate lifecycle management from the main deployment flow. It’s more resilient to transient failures and adheres to better security practices by avoiding direct network access to sensitive infrastructure. However, the system presented here is a foundation, not a final product. The OpenSSL-based CA is its weakest link from a security and manageability perspective; replacing it with a solution like HashiCorp Vault, which provides API-driven certificate management, robust auditing, and secure key storage, would be the necessary next step for a production environment. Furthermore, this design does not address certificate revocation. An effective PKI must include a mechanism for revoking compromised certificates, which could be implemented using another SQS queue to distribute a CRL or by pointing services to an OCSP responder.