Securing a PyTorch MLOps Pipeline on Azure with Jenkins and JWT-based Workload Identity

DevSecOps

Word Count: 2.5k

Read Times: 15 Min

Our PyTorch model deployment pipeline was functional but carried a significant security liability. It relied on a long-lived Azure Service Principal secret stored directly in Jenkins Credentials. During a routine security audit, this was flagged as a critical vulnerability. A compromised Jenkins instance meant an attacker could extract this credential, granting them persistent, and often overly broad, access to our Azure environment. The mandate was clear: eliminate static, long-lived cloud credentials from our CI/CD system entirely.

The initial thought was to rotate secrets more frequently using Azure Key Vault. However, this only shifts the problem. The Jenkins pipeline would still need an initial secret to authenticate to the Key Vault. The core issue of a stored credential remains. A more fundamental change was required. We decided to pursue a “passwordless” architecture leveraging Azure’s Workload Identity Federation. The concept is to establish a trust relationship between our Jenkins instance (acting as an OpenID Connect provider) and Azure Active Directory. This allows Jenkins to issue a short-lived JSON Web Token (JWT) for a specific pipeline run, which Azure then exchanges for a temporary, narrowly-scoped Azure access token. No secrets are stored, and access is ephemeral by design.

The architecture centers on four components: Jenkins, which orchestrates the pipeline and issues OIDC JWTs; Azure, which provides the target infrastructure and identity management; PyTorch, the payload being delivered; and the JWT, the cryptographic glue enabling this secure handshake. We selected Azure Kubernetes Service (AKS) as our deployment target due to its robust ecosystem for serving containerized models and Azure Container Registry (ACR) for storing our versioned model images. The entire process hinges on correctly configuring the federated trust between the Jenkins OIDC provider and a User-Assigned Managed Identity in Azure, which will be granted the precise, minimal permissions required for the deployment.

Phase 1: Establishing the Trust Anchor in Azure

Before Jenkins can talk to Azure, the Azure environment must be configured to trust it. This involves creating the target infrastructure and, most critically, the identity components. In a real-world project, this setup should be managed via Infrastructure as Code (e.g., Terraform), but for clarity, the raw Azure CLI commands demonstrate the underlying mechanics.

The first step is setting up the necessary resources: a resource group, the container registry, and the Kubernetes cluster.

# Set environment variables for consistency
export RESOURCE_GROUP="pytorch-secure-pipeline-rg"
export LOCATION="eastus"
export ACR_NAME="pytorchsecureacr$(openssl rand -hex 4)"
export AKS_NAME="pytorch-secure-aks"
export MANAGED_IDENTITY_NAME="jenkins-deployer-identity"

# Create Resource Group
az group create --name $RESOURCE_GROUP --location $LOCATION

# Create Azure Container Registry
az acr create \
  --resource-group $RESOURCE_GROUP \
  --name $ACR_NAME \
  --sku Basic \
  --admin-enabled false # Best practice: disable admin user

# Create AKS Cluster
# Note: In production, use a more robust configuration with networking specifics
az aks create \
  --resource-group $RESOURCE_GROUP \
  --name $AKS_NAME \
  --node-count 1 \
  --generate-ssh-keys \
  --enable-managed-identity

With the infrastructure in place, the core identity configuration begins. We need a User-Assigned Managed Identity that our Jenkins job will assume. This identity will be granted permissions, not the Jenkins server itself.

# Create a User-Assigned Managed Identity
az identity create \
  --resource-group $RESOURCE_GROUP \
  --name $MANAGED_IDENTITY_NAME

# Retrieve identity details for later use
export IDENTITY_CLIENT_ID=$(az identity show --resource-group $RESOURCE_GROUP --name $MANAGED_IDENTITY_NAME --query "clientId" -o tsv)
export IDENTITY_RESOURCE_ID=$(az identity show --resource-group $RESOURCE_GROUP --name $MANAGED_IDENTITY_NAME --query "id" -o tsv)

# Grant the Managed Identity the 'AcrPush' role on the Container Registry
# This adheres to the principle of least privilege. It can push images, nothing more.
export ACR_RESOURCE_ID=$(az acr show --resource-group $RESOURCE_GROUP --name $ACR_NAME --query "id" -o tsv)
az role assignment create \
  --assignee $IDENTITY_CLIENT_ID \
  --scope $ACR_RESOURCE_ID \
  --role "AcrPush"

# Grant the Managed Identity 'Azure Kubernetes Service Cluster User Role' on the AKS cluster
# This allows 'az aks get-credentials' without needing admin access.
export AKS_RESOURCE_ID=$(az aks show --resource-group $RESOURCE_GROUP --name $AKS_NAME --query id -o tsv)
az role assignment create \
  --assignee $IDENTITY_CLIENT_ID \
  --scope $AKS_RESOURCE_ID \
  --role "Azure Kubernetes Service Cluster User Role"

# For kubectl apply, we need permissions to deploy.
# 'Azure Kubernetes Service Contributor Role' is often used, but is too broad.
# A better approach is to create a custom role or use Kubernetes RBAC.
# For this demonstration, we'll grant a built-in role to deploy.
az role assignment create \
    --assignee $IDENTITY_CLIENT_ID \
    --scope $AKS_RESOURCE_ID \
    --role "Azure Kubernetes Service RBAC Writer"

The final and most crucial step is creating the federated credential. This is the link that tells Azure to trust JWTs from a specific issuer (our Jenkins server) for a specific subject (our Jenkins job).

A common pitfall here is incorrectly identifying the Jenkins issuer URL and the expected subject claim. The Jenkins OIDC provider exposes its configuration at https://<your-jenkins-url>/.well-known/openid-configuration. The issuer field in the returned JSON is the value needed. The subject is constructed by Jenkins and typically follows the format system:serviceaccount:<namespace>:<serviceaccount>:<job-path>. For a Jenkins job named pytorch-secure-deploy in the root folder, the subject would be system:serviceaccount:jenkins:job:pytorch-secure-deploy/.

# Prerequisite: Have your Jenkins URL and Job Name ready
# The Jenkins OIDC plugin must be installed and configured
export JENKINS_ISSUER_URL="https://your-jenkins-instance.com/" # MUST end with a slash
export JENKINS_SUBJECT="system:serviceaccount:jenkins:job:pytorch-secure-deploy/" # The job name, check your Jenkins OIDC config

# Create the Azure AD Application - this represents Jenkins in AAD
export AAD_APP_NAME="jenkins-oidc-federation-app"
az ad app create --display-name $AAD_APP_NAME

export AAD_APP_ID=$(az ad app list --display-name $AAD_APP_NAME --query "[0].appId" -o tsv)

# Create a Service Principal for the App
az ad sp create --id $AAD_APP_ID

# Associate the Managed Identity with the Service Principal
az ad app update --id $AAD_APP_ID --set "federatedCredentials"=@- <<EOF
[
    {
        "name": "jenkins-federation-link",
        "issuer": "${JENKINS_ISSUER_URL}",
        "subject": "${JENKINS_SUBJECT}",
        "description": "Federated trust for Jenkins PyTorch deployment job",
        "audiences": [
            "api://AzureADTokenExchange"
        ]
    }
]
EOF

# Finally, associate the federated credential on the Managed Identity with the AAD app.
# This is a critical link.
az identity federated-credential create \
    --name "jenkins-federation-link" \
    --identity-name $MANAGED_IDENTITY_NAME \
    --resource-group $RESOURCE_GROUP \
    --issuer $JENKINS_ISSUER_URL \
    --subject $JENKINS_SUBJECT

This completes the Azure side. We’ve created an identity with specific, limited permissions and taught it to trust a token from a designated Jenkins job. The entire configuration is declarative and auditable, a significant improvement over a static secret.

Phase 2: The MLOps Payload and Pipeline Logic

The heart of the pipeline is the code it builds and deploys. We will use a simple sentiment analysis model based on a pre-trained transformer from Hugging Face, served via a Flask API.

PyTorch Application (app/main.py)

import logging
from flask import Flask, request, jsonify
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

app = Flask(__name__)

# A common mistake is to load the model on every request.
# This is inefficient and will lead to terrible performance and memory issues.
# The model must be loaded once at application startup.
try:
    logging.info("Initializing sentiment analysis pipeline...")
    # Using a smaller, faster model for demonstration purposes
    model_name = "distilbert-base-uncased-finetuned-sst-2-english"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    sentiment_pipeline = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
    logging.info("Sentiment analysis pipeline initialized successfully.")
except Exception as e:
    logging.critical(f"Failed to load model or pipeline: {e}", exc_info=True)
    # If the model fails to load, the application is useless.
    # It's better to fail fast than to run in a broken state.
    sentiment_pipeline = None


@app.route('/health', methods=['GET'])
def health_check():
    """Health check endpoint for Kubernetes probes."""
    if sentiment_pipeline:
        return jsonify({"status": "ok"}), 200
    else:
        return jsonify({"status": "error", "message": "Model not loaded"}), 503


@app.route('/predict', methods=['POST'])
def predict():
    """Prediction endpoint."""
    if not sentiment_pipeline:
        logging.error("Prediction request received but pipeline is not available.")
        return jsonify({"error": "Service is unavailable, model not loaded"}), 503

    if not request.json or 'text' not in request.json:
        logging.warning("Invalid prediction request: missing 'text' field.")
        return jsonify({"error": "Invalid input, 'text' field is required"}), 400

    text_to_analyze = request.json['text']
    if not isinstance(text_to_analyze, str) or not text_to_analyze.strip():
        logging.warning(f"Invalid prediction request: 'text' field is empty or not a string.")
        return jsonify({"error": "'text' must be a non-empty string"}), 400

    try:
        logging.info(f"Processing prediction for text: '{text_to_analyze[:50]}...'")
        result = sentiment_pipeline(text_to_analyze)
        logging.info(f"Prediction successful: {result}")
        return jsonify(result)
    except Exception as e:
        logging.error(f"An error occurred during prediction: {e}", exc_info=True)
        return jsonify({"error": "Internal server error during prediction"}), 500

if __name__ == '__main__':
    # In production, this should be run behind a proper WSGI server like Gunicorn.
    app.run(host='0.0.0.0', port=8000)

Containerization (Dockerfile)

# Use a slim, secure base image.
FROM python:3.9-slim

# Set working directory
WORKDIR /app

# Copy requirements and install dependencies
# This layer is cached as long as requirements.txt doesn't change
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the application code
COPY ./app /app

# Expose the port the app runs on
EXPOSE 8000

# Run the application using Gunicorn for production
# Use multiple workers for better concurrency. The number of workers is a common tuning parameter.
CMD ["gunicorn", "--workers", "2", "--bind", "0.0.0.0:8000", "main:app"]

Dependencies (requirements.txt)

flask
gunicorn
torch
transformers

Finally, the Jenkinsfile orchestrates the entire process. This is where the JWT-based authentication happens.

Jenkins Pipeline (Jenkinsfile)

// Define constants at the top for easy management
final String RESOURCE_GROUP = "pytorch-secure-pipeline-rg"
final String ACR_NAME = "pytorchsecureacr..." // Replace with your actual ACR name
final String AKS_NAME = "pytorch-secure-aks"
final String MANAGED_IDENTITY_CLIENT_ID = "..." // Client ID of the User-Assigned Managed Identity
final String AZURE_TENANT_ID = "..." // Your Azure Tenant ID

pipeline {
    agent any

    environment {
        // Use Jenkins build number for image tagging
        IMAGE_TAG = "${ACR_NAME}.azurecr.io/sentiment-analyzer:${BUILD_NUMBER}"
        AZURE_CONFIG_DIR = "${env.WORKSPACE}/.azure" // Isolate Azure CLI config per workspace
    }

    stages {
        stage('Build and Test') {
            steps {
                script {
                    echo "Building PyTorch application container..."
                    // In a real pipeline, you would run unit tests here.
                    // e.g., sh 'pytest tests/'
                    docker.build(IMAGE_TAG, '.')
                }
            }
        }

        stage('Secure Azure Login & Push Image') {
            steps {
                script {
                    echo "Attempting secure, passwordless login to Azure..."
                    // This is the core of the passwordless flow.
                    // We request an OIDC token (JWT) from Jenkins itself.
                    // The 'aud' (audience) parameter is crucial for some OIDC providers but can often be the default.
                    def oidcToken = ""
                    withCredentials([string(credentialsId: 'jenkins-oidc-token', variable: 'OIDC_TOKEN_JWT')]) {
                        oidcToken = OIDC_TOKEN_JWT
                    }

                    // A common debugging step is to decode the JWT and inspect its claims
                    // to ensure the 'iss' and 'sub' match what Azure expects.
                    // sh "echo ${oidcToken} | cut -d'.' -f2 | base64 -d"

                    // Use a shell wrapper to ensure proper cleanup and isolation
                    // The 'az login' command uses the federated token to get an Azure access token.
                    // No secrets are ever passed to this command.
                    sh """
                        az login --service-principal -u ${MANAGED_IDENTITY_CLIENT_ID} -t ${AZURE_TENANT_ID} --federated-token "${oidcToken}"
                        if [ \$? -ne 0 ]; then
                            echo "Azure federated login failed!"
                            exit 1
                        fi
                        echo "Successfully logged into Azure using federated identity."

                        az acr login --name ${ACR_NAME}
                        if [ \$? -ne 0 ]; then
                            echo "Azure ACR login failed!"
                            exit 1
                        fi
                        echo "Successfully authenticated with Azure Container Registry."
                    """

                    echo "Pushing Docker image to ACR..."
                    docker.image(IMAGE_TAG).push()
                }
            }
        }

        stage('Deploy to AKS') {
            steps {
                script {
                    echo "Deploying application to AKS..."

                    // Because the Azure CLI login is still active in the agent's session,
                    // the following commands will be authenticated.
                    sh """
                        az aks get-credentials --resource-group ${RESOURCE_GROUP} --name ${AKS_NAME} --overwrite-existing
                        if [ \$? -ne 0 ]; then
                            echo "Failed to get AKS credentials!"
                            exit 1
                        fi
                    """

                    // Use a templated Kubernetes manifest
                    // This allows injecting the build-specific image tag
                    sh "sed 's|IMAGE_PLACEHOLDER|${IMAGE_TAG}|g' kubernetes/deployment.yaml | kubectl apply -f -"
                    sh "kubectl apply -f kubernetes/service.yaml"

                    // Wait for the deployment to be successful
                    // A production pipeline should have more robust health checks
                    sh "kubectl rollout status deployment/sentiment-analyzer-deployment"
                }
            }
        }
    }

    post {
        always {
            // Clean up the workspace and log out of Azure to ensure no credentials linger on the agent.
            sh "az logout || true"
            sh "az account clear || true"
            deleteDir()
        }
    }
}

This pipeline explicitly uses withCredentials to fetch a JWT from the Jenkins OIDC provider. It then passes this token directly to the az login command with the --federated-token flag. The successful execution of this command proves the entire trust chain is working. From there, subsequent az commands and kubectl (after az aks get-credentials) are seamlessly authenticated.

Phase 3: Kubernetes Deployment Manifests

The final piece is the Kubernetes objects that define how our application runs in the AKS cluster.

Deployment (kubernetes/deployment.yaml)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sentiment-analyzer-deployment
  labels:
    app: sentiment-analyzer
spec:
  replicas: 2
  selector:
    matchLabels:
      app: sentiment-analyzer
  template:
    metadata:
      labels:
        app: sentiment-analyzer
    spec:
      containers:
      - name: sentiment-analyzer
        image: IMAGE_PLACEHOLDER # This is replaced by the Jenkins pipeline
        ports:
        - containerPort: 8000
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "1"
            memory: "2Gi"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 15
          periodSeconds: 5

Service (kubernetes/service.yaml)

apiVersion: v1
kind: Service
metadata:
  name: sentiment-analyzer-service
spec:
  type: LoadBalancer # For external access in a demo; use Ingress in production
  selector:
    app: sentiment-analyzer
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000

The Jenkins pipeline replaces IMAGE_PLACEHOLDER with the dynamically tagged image and applies these manifests. The result is a running, externally accessible PyTorch model endpoint, deployed without a single static secret being stored in the CI system.

The shift to a JWT-based federated identity model is not merely a technical exercise; it’s a fundamental improvement in security posture. The process eliminates an entire class of vulnerabilities associated with credential leakage. However, this architecture is not without its own set of trade-offs and considerations. The tight coupling between the Azure federated credential configuration and the Jenkins job name (subject claim) means that renaming a Jenkins job is now a breaking change that requires an infrastructure update. Furthermore, while the pipeline is secure, the security of the Jenkins master itself becomes even more paramount, as its ability to issue trusted JWTs is the root of this entire security model. Future work would involve managing the Azure identity configurations via Terraform to create “Identity as Code” and developing more sophisticated health and performance checks within the deployment stage to enable automated rollbacks, moving from continuous delivery to true continuous deployment.

MLOps Security Azure Jenkins PyTorch JWT

Fusing Linkerd Service Mesh Telemetry with Memcached Protocol Metrics for Unified Grafana Dashboards

2023-10-27 Observability

Go Kubernetes Grafana Memcached Prometheus Linkerd

Declarative Provisioning of MLflow Environments Using a Custom Crossplane API for MongoDB Backends

2023-10-27 MLOps

Kubernetes Crossplane MongoDB Platform Engineering IaC MLflow