Implementing Mutual TLS for a Django and ChromaDB RAG Service on Azure with Consul Connect


The initial proof-of-concept for our Retrieval-Augmented Generation (RAG) service was straightforward. A Django application handled the business logic and API, while a standalone ChromaDB instance served as the vector store. On a local machine, connecting the two was a matter of pointing the Django service to localhost:8000. Deploying this to our Azure environment, however, surfaced a foundational security problem that simple prototypes conveniently ignore: securing service-to-service communication.

The immediate, and frankly lazy, solution was to place both services on the same Azure Virtual Network and open port 8000 on the ChromaDB VM’s Network Security Group (NSG) to allow traffic from the Django VM’s private IP. This approach is brittle and violates zero-trust principles. IP addresses are ephemeral in a dynamic cloud environment, and an NSG rule-based approach creates a flat, overly permissive network. Any other compromised process on the application VM could potentially access the vector database. Manually managing TLS certificates for every internal service is an operational burden we were determined to avoid. The technical pain point was clear: we needed automated, identity-based, encrypted-in-transit communication between our Django application and our ChromaDB data store.

Our concept solidified around a service mesh. Instead of embedding security logic into our application code or wrestling with network ACLs, a mesh could provide a transparent, secure communication layer. We evaluated a few options. Istio felt like overkill for this two-service interaction, bringing a level of complexity we didn’t yet need. Linkerd was a strong contender but our long-term vision included services running outside of Kubernetes, potentially on dedicated VMs for performance-sensitive workloads like databases. This led us to Consul. Its ability to run agents on any node—VM or container—and its straightforward Consul Connect feature for providing mTLS via sidecar proxies offered the exact flexibility we required. The plan was to deploy our Django app and ChromaDB instance on separate Azure VMs, with Consul agents managing service discovery and secure connectivity, effectively creating a private, encrypted data plane for our RAG components.

Infrastructure Provisioning on Azure

In a real-world project, infrastructure must be codified. We used Terraform to define the necessary Azure resources. This ensures reproducibility and provides a clear audit trail. The core components are a virtual network, two subnets (one for application services, one for data services for logical separation), and two Ubuntu 22.04 VMs.

A common mistake is to create overly permissive NSG rules. Our initial setup only allows SSH access from a specific bastion IP range. All inter-service application traffic will be forced through the Consul Connect mTLS tunnel, bypassing the need for explicit NSG rules between the VMs.

Here is the core Terraform configuration for the network and VMs.

# main.tf

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~>3.0"
    }
  }
}

provider "azurerm" {
  features {}
}

variable "resource_group_name" {
  default = "rag-secure-infra-rg"
}

variable "location" {
  default = "East US"
}

variable "admin_username" {
  default = "azureuser"
}

variable "admin_ssh_key_public" {
  description = "Public SSH key for VM access."
  type        = string
}

resource "azurerm_resource_group" "main" {
  name     = var.resource_group_name
  location = var.location
}

resource "azurerm_virtual_network" "main" {
  name                = "rag-vnet"
  address_space       = ["10.0.0.0/16"]
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
}

resource "azurerm_subnet" "app" {
  name                 = "app-subnet"
  resource_group_name  = azurerm_resource_group.main.name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = ["10.0.1.0/24"]
}

resource "azurerm_subnet" "data" {
  name                 = "data-subnet"
  resource_group_name  = azurerm_resource_group.main.name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = ["10.0.2.0/24"]
}

resource "azurerm_network_security_group" "ssh_only" {
  name                = "ssh-only-nsg"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name

  security_rule {
    name                       = "SSH"
    priority                   = 100
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "22"
    source_address_prefix      = "YOUR_BASTION_IP/32" # IMPORTANT: Lock this down
    destination_address_prefix = "*"
  }
}

# --- Consul Server VM ---
# In a production setup, this should be a cluster of 3 or 5 nodes.
# For this demonstration, a single server simplifies the setup.
resource "azurerm_network_interface" "consul_server_nic" {
  name                = "consul-server-nic"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name

  ip_configuration {
    name                          = "internal"
    subnet_id                     = azurerm_subnet.app.id
    private_ip_address_allocation = "Dynamic"
  }
}

resource "azurerm_linux_virtual_machine" "consul_server" {
  name                            = "consul-server-vm"
  resource_group_name             = azurerm_resource_group.main.name
  location                        = azurerm_resource_group.main.location
  size                            = "Standard_B1s"
  admin_username                  = var.admin_username
  network_interface_ids           = [azurerm_network_interface.consul_server_nic.id]
  disable_password_authentication = true

  admin_ssh_key {
    username   = var.admin_username
    public_key = var.admin_ssh_key_public
  }

  os_disk {
    caching              = "ReadWrite"
    storage_account_type = "Standard_LRS"
  }

  source_image_reference {
    publisher = "Canonical"
    offer     = "0001-com-ubuntu-server-jammy"
    sku       = "22_04-lts"
    version   = "latest"
  }
}

# --- Django App VM ---
resource "azurerm_network_interface" "app_nic" {
  name                = "app-vm-nic"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name

  ip_configuration {
    name                          = "internal"
    subnet_id                     = azurerm_subnet.app.id
    private_ip_address_allocation = "Dynamic"
  }
}

resource "azurerm_network_interface_security_group_association" "app_nsg_assoc" {
  network_interface_id      = azurerm_network_interface.app_nic.id
  network_security_group_id = azurerm_network_security_group.ssh_only.id
}

resource "azurerm_linux_virtual_machine" "app_vm" {
  name                            = "django-app-vm"
  # ... similar configuration to consul_server ...
  network_interface_ids           = [azurerm_network_interface.app_nic.id]
  # ... rest of VM config ...
}

# --- ChromaDB VM ---
resource "azurerm_network_interface" "db_nic" {
  name                = "db-vm-nic"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name

  ip_configuration {
    name                          = "internal"
    subnet_id                     = azurerm_subnet.data.id
    private_ip_address_allocation = "Dynamic"
  }
}

resource "azurerm_network_interface_security_group_association" "db_nsg_assoc" {
  network_interface_id      = azurerm_network_interface.db_nic.id
  network_security_group_id = azurerm_network_security_group.ssh_only.id
}

resource "azurerm_linux_virtual_machine" "db_vm" {
  name                            = "chromadb-vm"
  # ... similar configuration to consul_server ...
  network_interface_ids           = [azurerm_network_interface.db_nic.id]
  # ... rest of VM config ...
}

After applying this configuration, we have our base infrastructure. The next step is installing and configuring the Consul agents on each node.

Establishing the Consul Control Plane

We’ll start with the Consul server. SSH into the consul-server-vm.

# On consul-server-vm
# Install dependencies
sudo apt-get update && sudo apt-get install -y wget gpg unzip

# Install Consul
wget -O- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt-get update && sudo apt-get install consul

# Create Consul configuration directory
sudo mkdir -p /etc/consul.d
sudo chmod 755 /etc/consul.d

# Get the private IP of this VM
PRIVATE_IP=$(hostname -I | awk '{print $1}')

# Create the server configuration file
sudo tee /etc/consul.d/server.hcl > /dev/null <<EOF
datacenter = "azure-eastus"
data_dir = "/opt/consul"
server = true
bootstrap_expect = 1

# It's critical to bind to the private network interface
bind_addr = "${PRIVATE_IP}"
client_addr = "127.0.0.1 ${PRIVATE_IP}"

# Enable the UI for diagnostics
ui_config {
  enabled = true
}

# Enable Connect for service mesh capabilities
connect {
  enabled = true
}
EOF

To run Consul as a service, we create a systemd unit file.

# /etc/systemd/system/consul.service
sudo tee /etc/systemd/system/consul.service > /dev/null <<EOF
[Unit]
Description="HashiCorp Consul - A service mesh solution"
Documentation=https://www.consul.io/
Requires=network-online.target
After=network-online.target
ConditionFileNotEmpty=/etc/consul.d/server.hcl

[Service]
User=consul
Group=consul
ExecStart=/usr/bin/consul agent -config-dir=/etc/consul.d/
ExecReload=/bin/kill --signal HUP $MAINPID
KillMode=process
KillSignal=SIGTERM
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF

# Create a dedicated user, set permissions, and start the service
sudo useradd --system --home /etc/consul.d --shell /bin/false consul
sudo chown -R consul:consul /etc/consul.d /opt/consul
sudo systemctl enable consul
sudo systemctl start consul
sudo systemctl status consul

With the server running, we can now configure the clients on the Django and ChromaDB VMs.

Registering ChromaDB with the Service Mesh

On the chromadb-vm, we install the Consul agent and Docker. The key insight here is that ChromaDB itself will only listen on the localhost interface. All external traffic will be proxied through the Consul Connect sidecar.

# On chromadb-vm
# Install Consul (same steps as server) and Docker
sudo apt-get update
sudo apt-get install -y docker.io consul

# Create Consul configuration
sudo mkdir -p /etc/consul.d
sudo chmod 755 /etc/consul.d

# Get private IPs of this VM and the Consul server
PRIVATE_IP=$(hostname -I | awk '{print $1}')
CONSUL_SERVER_IP="10.0.1.X" # Replace with the actual private IP of the consul-server-vm

# Create client configuration
sudo tee /etc/consul.d/client.hcl > /dev/null <<EOF
datacenter = "azure-eastus"
data_dir = "/opt/consul"
bind_addr = "${PRIVATE_IP}"

# Instructs the client how to find the server
retry_join = ["${CONSUL_SERVER_IP}"]

# Enable connect for this client node
connect {
  enabled = true
}
EOF

# Create the service definition for ChromaDB
# This is the most critical part for enabling service mesh functionality.
sudo tee /etc/consul.d/chromadb.hcl > /dev/null <<EOF
service {
  name = "chromadb"
  port = 8000
  
  # The 'connect' stanza registers this service with the mesh.
  # The 'sidecar_service' block tells Consul how to run the proxy.
  connect {
    sidecar_service {}
  }

  # Health check to ensure ChromaDB is responsive.
  # Consul will mark the service as unhealthy if this fails.
  check {
    id       = "chromadb-api-check"
    name     = "ChromaDB API TCP Check"
    tcp      = "127.0.0.1:8000"
    interval = "10s"
    timeout  = "2s"
  }
}
EOF

# Create and start the systemd service for the Consul client
# (use a similar systemd file as the server, but without ConditionFileNotEmpty)
# ... then enable and start the service.

# Finally, run ChromaDB via Docker
# The container only exposes the port on the host's loopback interface.
sudo docker run -d --name chromadb -p 127.0.0.1:8000:8000 chromadb/chroma

At this point, ChromaDB is running, but it’s completely inaccessible from the Django VM. The Consul agent on this node has registered the chromadb service, but the communication path is not yet open. We need to start the sidecar proxy.

# On chromadb-vm, run this in a separate terminal or as another systemd service
consul connect proxy -service=chromadb

This command starts an Envoy proxy managed by Consul. It will accept incoming mTLS connections intended for the chromadb service and forward the decrypted traffic to 127.0.0.1:8000.

Configuring the Django Application as a Downstream Client

Now for the django-app-vm. The process is similar: install the Consul agent and define the service. The difference is in the service definition, where we declare an upstream dependency on ChromaDB.

# On django-app-vm
# Install Consul and Python dependencies
sudo apt-get update
sudo apt-get install -y python3-pip python3-venv consul

# Setup Django project structure (assuming a `myproject` Django app)
# ...

# Create Consul configuration (client.hcl, same as chromadb-vm)
# ...

# Create the service definition for the Django application
# The upstream definition is the key piece of configuration.
sudo tee /etc/consul.d/django-app.hcl > /dev/null <<EOF
service {
  name = "django-app"
  port = 8080 # The port the Django app itself listens on

  connect {
    sidecar_service {
      # This 'upstreams' block is what configures the client-side proxy.
      # It tells Consul to create a local listener on port 8001
      # that securely forwards traffic to the 'chromadb' service.
      upstreams {
        destination_name = "chromadb"
        local_bind_port  = 8001
      }
    }
  }

  check {
    id       = "django-http-check"
    name     = "Django App Health Check"
    http     = "http://127.0.0.1:8080/health/" # Assuming a /health endpoint
    interval = "15s"
    timeout  = "3s"
  }
}
EOF

# Start the Consul agent via systemd
# ...

The upstreams block instructs the local Consul agent to manage a proxy. When we start the proxy for django-app, it will open a listener on 127.0.0.1:8001. Any traffic sent to this local port will be automatically wrapped in mTLS, sent across the network to the chromadb service’s proxy, decrypted, and forwarded to the ChromaDB container.

The beauty of this is the application’s configuration becomes incredibly simple and static.

# myproject/settings.py

# ... other settings ...

# The application now points to its local sidecar proxy, not the remote DB.
# This configuration never needs to change, regardless of where ChromaDB is running.
CHROMA_HOST = "127.0.0.1"
CHROMA_PORT = 8001

LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'handlers': {
        'console': {
            'class': 'logging.StreamHandler',
        },
    },
    'root': {
        'handlers': ['console'],
        'level': 'INFO',
    },
}

The Django service code that interacts with ChromaDB needs no special logic for TLS or service discovery. It includes robust error handling, which is crucial because a failure could now be the application, the local proxy, the network, the remote proxy, or the database itself.

# myproject/rag/service.py
import chromadb
from chromadb.utils import embedding_functions
from django.conf import settings
import logging

logger = logging.getLogger(__name__)

class ChromaDBService:
    _client = None

    def get_client(self):
        if self._client:
            # Basic health check before returning existing client
            try:
                self._client.heartbeat()
                return self._client
            except Exception:
                logger.warning("Existing ChromaDB client connection is stale. Reconnecting.")
                self._client = None

        logger.info(f"Connecting to ChromaDB via Consul proxy at {settings.CHROMA_HOST}:{settings.CHROMA_PORT}")
        try:
            # The HttpClient connects to the local port managed by the Consul sidecar proxy.
            client = chromadb.HttpClient(
                host=settings.CHROMA_HOST,
                port=settings.CHROMA_PORT,
                # In a real-world project, timeouts are non-negotiable.
                settings=chromadb.config.Settings(
                    chroma_client_auth_provider="chromadb.auth.token.TokenAuthClientProvider",
                    chroma_client_auth_credentials="your-static-token-if-needed"
                )
            )
            # The heartbeat call is a good way to verify the entire chain is working.
            client.heartbeat()
            self._client = client
            logger.info("Successfully connected to ChromaDB.")
            return self._client
        except Exception as e:
            # This error is critical. It indicates a failure in the service mesh path.
            logger.error(
                "Failed to connect to ChromaDB. This could be due to: "
                "1. The local consul-connect-proxy for django-app is not running. "
                "2. The remote consul-connect-proxy for chromadb is not running. "
                "3. A Consul intention is blocking the connection. "
                "4. The ChromaDB service itself is down.",
                exc_info=True
            )
            raise ConnectionError("Could not establish secure connection to vector store.") from e

    def query_collection(self, collection_name: str, query_text: str):
        client = self.get_client()
        collection = client.get_or_create_collection(name=collection_name)
        
        results = collection.query(
            query_texts=[query_text],
            n_results=5
        )
        return results

Now, start the sidecar proxy on the django-app-vm:

# On django-app-vm
consul connect proxy -service=django-app

Verification and Zero-Trust Enforcement

With both proxies running, the Django application can now successfully connect to ChromaDB. But the mesh is currently in a permissive mode. To enforce zero-trust, we use Consul intentions. By default, all traffic is denied unless explicitly allowed.

# First, let's create a "deny all" intention for our datacenter.
# Run this from any machine with the `consul` CLI configured.
consul intention create -deny "*" "*"

# Now, if you try to use the Django app, it will fail to connect to ChromaDB.
# The logs will show a connection refused error, because the proxy will block it.

# Let's create a specific intention to allow django-app to talk to chromadb.
consul intention create -allow django-app chromadb

This single command updates the policy across the entire mesh. The django-app proxy is now authorized to initiate a connection to the chromadb proxy. This is true, identity-based authorization, completely decoupled from network topology.

The final architecture can be visualized as follows:

graph TD
    subgraph Azure VNet: 10.0.0.0/16
        subgraph App Subnet: 10.0.1.0/24
            subgraph Django VM
                A[Django App] -- HTTP --> B(localhost:8001);
                B -- Plaintext --> C{Consul Sidecar Proxy};
            end
        end

        subgraph Data Subnet: 10.0.2.0/24
            subgraph ChromaDB VM
                F{Consul Sidecar Proxy} -- Plaintext --> G[ChromaDB on localhost:8000];
            end
        end

        C -- mTLS Tunnel --> F;
    end

    subgraph Consul Control Plane
        H[Consul Server] -- Manages Proxies & Intentions --> C;
        H -- Manages Proxies & Intentions --> F;
    end

This setup achieves our goal. Communication is encrypted via mTLS, service discovery is automatic, and authorization is based on logical service identity rather than brittle IP addresses. The application code remains blissfully unaware of this complexity, simply connecting to a localhost port. The pitfall here is that the operational complexity has moved from the network layer (NSGs) to the service mesh layer (Consul agents, proxies, intentions). This is a trade-off, but one that provides far greater security and flexibility.

While this implementation on VMs is robust, its lifecycle management relies on systemd, which can be cumbersome. The proxies and agents are distinct processes that must be monitored. A more cloud-native approach would involve deploying this entire stack to Azure Kubernetes Service (AKS). The Consul Helm chart can automate the injection of sidecar proxies and the management of Consul clients, tying the proxy lifecycle directly to the application pod’s lifecycle. Furthermore, automating the creation and destruction of Consul intentions through a CI/CD pipeline using Terraform’s Consul provider would be the next logical step to fully embrace Infrastructure as Code for not just the infrastructure, but the security policies governing it.


  TOC