Implementing a Low-Latency Polyglot Inference Stack Using ASP.NET Core, PyTorch, and Memcached on OCI

System Architecture

Word Count: 3.1k

Read Times: 19 Min

The initial system architecture was straightforward but fundamentally flawed for our latency requirements. An ASP.NET Core backend handled business logic and exposed a REST API, which in turn called a Python Flask service wrapping a PyTorch model for real-time classification. The problem manifested under even moderate load: response times from the API endpoint involving the model call would skyrocket from a baseline of 50ms to over 800ms. Profiling pointed to two primary culprits: JSON serialization/deserialization overhead for the tensor data being sent between the C# and Python processes, and the repetitive, expensive computation for frequently recurring input vectors. The frontend, a simple Lit-based component, was left displaying a loading spinner for an unacceptably long duration. This was not a scalable or production-ready solution.

Our objective became clear: surgically replace the inefficient REST/JSON communication layer and introduce an aggressive caching strategy to shield the PyTorch model from redundant work. The core backend in ASP.NET Core had to remain, as did the PyTorch model, but the bridge between them and the interaction pattern needed a complete overhaul. We settled on gRPC for its high-performance binary serialization with Protocol Buffers and Memcached for its raw, no-frills speed as an in-memory cache. The entire stack would be deployed on Oracle Cloud Infrastructure (OCI), our standard platform.

This isn’t a theoretical exercise. What follows is the build log of that refactoring effort, detailing the gRPC contract design, the implementation of the polyglot services, the caching logic, and the deployment topology on OCI.

Defining the Communication Contract with Protobuf

The first step was to eliminate the ambiguity and overhead of JSON. A Protobuf contract establishes a rigid, efficient schema for communication between the ASP.NET Core client and the Python gRPC server. In a real-world project, getting this contract right is critical because changing it later requires coordinated deployments.

The core of our contract needed to represent a multi-dimensional tensor for the request and a set of class probabilities for the response.

protos/inference.proto:

syntax = "proto3";

package inference;

// The service definition for the PyTorch model server.
service InferenceService {
  // Performs a classification inference on the input tensor.
  rpc Classify (InferenceRequest) returns (InferenceResponse);
}

// Represents the input data for the model.
// For simplicity, we're using a flattened list of floats and a shape.
// In a more complex scenario, this could be a bytes field with a serialized tensor.
message Tensor {
  repeated int64 shape = 1;
  repeated float data = 2;
}

// The request message containing the tensor to be classified.
message InferenceRequest {
  string model_id = 1; // To support multiple models in the future
  Tensor input_tensor = 2;
}

// A single classification result with its confidence score.
message Classification {
  string label = 1;
  float confidence = 2;
}

// The response message containing the classification results.
message InferenceResponse {
  repeated Classification predictions = 1;
}

A key decision here was how to represent the tensor. We opted for a flattened repeated float and a shape field. This is human-readable and easy to work with in both C# and Python. The pitfall here is performance for extremely large tensors; for those cases, serializing the tensor to a bytes field using a library like NumPy’s tobytes() and then deserializing it on the other end can be more efficient, though it sacrifices some readability in the .proto file itself.

The Python gRPC Server: Serving the PyTorch Model

With the contract defined, the next task was to build the Python server. This service is a dedicated, single-purpose process: load a pre-trained PyTorch model and expose it via the gRPC interface we just defined.

We need the following dependencies: grpcio, grpcio-tools, torch, and numpy.

The server implementation involves several key parts:

Model Loading: The model is loaded once at startup to avoid filesystem I/O on every request.
gRPC Service Implementation: A class that inherits from the generated InferenceServiceServicer and implements the Classify method.
Data Transformation: Logic to convert the incoming Protobuf Tensor message into a PyTorch tensor, and the model’s output tensor back into a Protobuf InferenceResponse.

Here’s the core server code.

server.py:

import grpc
import torch
import torch.nn.functional as F
import numpy as np
from concurrent import futures
import logging

# Import generated gRPC files
import inference_pb2
import inference_pb2_grpc

# --- Configuration ---
_PORT = "50051"
_MODEL_PATH = "./model/mnist_cnn.pt" # Path to the pre-trained model
_CLASS_LABELS = [str(i) for i in range(10)] # MNIST labels 0-9
_MAX_WORKERS = 10

# --- Logging Setup ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

class InferenceServiceImpl(inference_pb2_grpc.InferenceServiceServicer):
    """
    Implements the gRPC InferenceService. Loads a PyTorch model and uses it for predictions.
    """

    def __init__(self):
        try:
            self.device = "cuda" if torch.cuda.is_available() else "cpu"
            logging.info(f"Using device: {self.device}")
            # In a real system, you might have a more sophisticated model loading mechanism.
            # For this example, we assume a ScriptModule saved with torch.jit.script.
            self.model = torch.jit.load(_MODEL_PATH).to(self.device)
            self.model.eval() # Set the model to evaluation mode
            logging.info(f"Model loaded successfully from {_MODEL_PATH}")
        except Exception as e:
            logging.error(f"Failed to load model: {e}")
            raise

    def Classify(self, request, context):
        """
        Handles an inference request.
        """
        try:
            logging.info(f"Received classification request for model '{request.model_id}'")
            
            # 1. Convert Protobuf message to PyTorch Tensor
            input_tensor = self._proto_to_tensor(request.input_tensor)

            # 2. Perform inference
            with torch.no_grad(): # Disable gradient calculation for efficiency
                output = self.model(input_tensor)
                probabilities = F.softmax(output, dim=1)
                top_prob, top_catid = torch.topk(probabilities, 1)

            # 3. Format the response
            response = inference_pb2.InferenceResponse()
            
            # For this example, we return the top prediction.
            # A real application might return top-k results.
            prediction = response.predictions.add()
            prediction.label = _CLASS_LABELS[top_catid.item()]
            prediction.confidence = top_prob.item()

            logging.info(f"Prediction complete. Label: {prediction.label}, Confidence: {prediction.confidence:.4f}")

            return response
        except Exception as e:
            logging.error(f"Error during inference: {e}")
            context.set_code(grpc.StatusCode.INTERNAL)
            context.set_details(f"An internal error occurred: {e}")
            return inference_pb2.InferenceResponse()

    def _proto_to_tensor(self, tensor_proto):
        """
        Converts a Protobuf Tensor message to a PyTorch tensor.
        A common mistake is mishandling the shape and data type.
        """
        if not tensor_proto.shape or not tensor_proto.data:
            raise ValueError("Input tensor proto is missing shape or data.")
        
        # Reshape the flattened data array into the specified tensor shape
        np_array = np.array(tensor_proto.data, dtype=np.float32)
        reshaped_array = np_array.reshape(tensor_proto.shape)
        
        # Convert NumPy array to PyTorch tensor and move to the correct device
        return torch.from_numpy(reshaped_array).to(self.device)


def serve():
    """
    Starts the gRPC server.
    """
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=_MAX_WORKERS))
    inference_pb2_grpc.add_InferenceServiceServicer_to_server(InferenceServiceImpl(), server)
    
    server.add_insecure_port(f'[::]:{_PORT}')
    logging.info(f"gRPC server starting on port {_PORT}...")
    server.start()
    server.wait_for_termination()

if __name__ == '__main__':
    serve()

The critical part here is the _proto_to_tensor function. Mismatched shapes or data types between the client and server are a common source of bugs in polyglot systems. The with torch.no_grad(): block is also essential for performance, as it tells PyTorch not to track gradients, significantly reducing memory consumption and computation time during inference.

The ASP.NET Core Backend: Caching and Orchestration

The ASP.NET Core application now acts as the smart orchestrator. It exposes the public-facing API, manages the caching layer, and communicates with the Python gRPC service only when necessary.

First, we set up the project to consume the .proto file and connect to Memcached.

InferenceApi.csproj:

<Project Sdk="Microsoft.NET.Sdk.Web">

  <PropertyGroup>
    <TargetFramework>net8.0</TargetFramework>
    <Nullable>enable</Nullable>
    <ImplicitUsings>enable</ImplicitUsings>
  </PropertyGroup>

  <ItemGroup>
    <Protobuf Include="Protos\inference.proto" GrpcServices="Client" />
  </ItemGroup>

  <ItemGroup>
    <PackageReference Include="Grpc.AspNetCore" Version="2.60.0" />
    <PackageReference Include="Grpc.Net.Client" Version="2.60.0" />
    <PackageReference Include="Grpc.Tools" Version="2.60.0">
      <IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
      <PrivateAssets>all</PrivateAssets>
    </PackageReference>
    <!-- Memcached provider for IDistributedCache -->
    <PackageReference Include="EnyimMemcachedCore" Version="2.5.3" /> 
  </ItemGroup>

</Project>

Next, we configure the services in Program.cs. This includes registering the gRPC client and the Memcached distributed cache.

Program.cs:

using Enyim.Caching.Configuration;
using InferenceApi.Services;
using System.Security.Cryptography;
using System.Text;

var builder = WebApplication.CreateBuilder(args);

// --- Configuration ---
var grpcConfig = builder.Configuration.GetSection("GrpcInferenceService");
var memcachedConfig = builder.Configuration.GetSection("Memcached");

// Add services to the container.
builder.Services.AddControllers();
builder.Services.AddEndpointsApiExplorer();
builder.Services.AddSwaggerGen();

// --- gRPC Client Registration ---
builder.Services.AddGrpcClient<Inference.InferenceService.InferenceServiceClient>(o =>
{
    o.Address = new Uri(grpcConfig["Address"] ?? 
        throw new InvalidOperationException("gRPC service address not configured."));
});

// --- Memcached Registration ---
builder.Services.AddEnyimMemcached(o => o.AddServer(
    memcachedConfig["Host"] ?? "localhost", 
    int.Parse(memcachedConfig["Port"] ?? "11211")));
builder.Services.AddSingleton<Microsoft.Extensions.Caching.Distributed.IDistributedCache, 
    Enyim.Caching.Memcached.MemcachedDistributedCache>();


// --- Register our custom service that orchestrates the logic ---
builder.Services.AddScoped<OrchestrationService>();

var app = builder.Build();

// Configure the HTTP request pipeline.
if (app.Environment.IsDevelopment())
{
    app.UseSwagger();
    app.UseSwaggerUI();
}

app.UseAuthorization();
app.MapControllers();
app.Run();

The configuration in appsettings.json wires everything up.

appsettings.json:

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft.AspNetCore": "Warning"
    }
  },
  "AllowedHosts": "*",
  "GrpcInferenceService": {
    "Address": "http://localhost:50051" // Address of the Python gRPC server
  },
  "Memcached": {
    "Host": "localhost", // Address of the Memcached instance
    "Port": "11211"
  }
}

The core logic resides in the OrchestrationService and the InferenceController. The service implements the cache-aside pattern.

Services/OrchestrationService.cs:

using Google.Protobuf;
using Grpc.Core;
using Inference;
using Microsoft.Extensions.Caching.Distributed;
using System.Security.Cryptography;
using System.Text;
using System.Text.Json;

namespace InferenceApi.Services;

public class OrchestrationService
{
    private readonly InferenceService.InferenceServiceClient _grpcClient;
    private readonly IDistributedCache _cache;
    private readonly ILogger<OrchestrationService> _logger;

    public OrchestrationService(
        InferenceService.InferenceServiceClient grpcClient, 
        IDistributedCache cache, 
        ILogger<OrchestrationService> logger)
    {
        _grpcClient = grpcClient;
        _cache = cache;
        _logger = logger;
    }

    public async Task<InferenceResponse?> GetClassificationAsync(InferenceRequest request)
    {
        var cacheKey = GenerateCacheKey(request);
        
        // 1. Check cache first
        var cachedResult = await _cache.GetStringAsync(cacheKey);

        if (!string.IsNullOrEmpty(cachedResult))
        {
            _logger.LogInformation("Cache HIT for key: {CacheKey}", cacheKey);
            return JsonSerializer.Deserialize<InferenceResponse>(cachedResult);
        }

        _logger.LogInformation("Cache MISS for key: {CacheKey}. Calling gRPC service.", cacheKey);

        // 2. On cache miss, call the gRPC service
        try
        {
            var response = await _grpcClient.ClassifyAsync(request);

            if (response != null)
            {
                // 3. Store the result in Memcached
                var cacheOptions = new DistributedCacheEntryOptions
                {
                    // In a production system, this TTL should be carefully chosen.
                    AbsoluteExpirationRelativeToNow = TimeSpan.FromMinutes(5)
                };
                var jsonResponse = JsonSerializer.Serialize(response);
                await _cache.SetStringAsync(cacheKey, jsonResponse, cacheOptions);
            }
            return response;
        }
        catch (RpcException ex)
        {
            _logger.LogError(ex, "gRPC call failed. Status: {Status}, Detail: {Detail}", ex.StatusCode, ex.Status.Detail);
            // Propagate the error or handle it as per business requirements
            return null;
        }
    }
    
    private string GenerateCacheKey(InferenceRequest request)
    {
        // A common mistake is to create a non-deterministic cache key.
        // It must be stable for the same logical input.
        // Hashing the entire serialized request is a robust way to achieve this.
        using var sha256 = SHA256.Create();
        var requestBytes = request.ToByteArray();
        var hashBytes = sha256.ComputeHash(requestBytes);
        
        // Using a prefix helps organize keys in the cache.
        return $"inference:{Convert.ToHexString(hashBytes)}";
    }
}

The GenerateCacheKey method is more important than it looks. A naive key based on request.ToString() would fail. Serializing the entire protobuf message to bytes and hashing it provides a deterministic and collision-resistant key. This is a production-grade approach.

Finally, the InferenceController exposes this logic via a standard REST endpoint for the frontend to consume.

Controllers/InferenceController.cs:

using Inference;
using InferenceApi.Services;
using Microsoft.AspNetCore.Mvc;

namespace InferenceApi.Controllers;

[ApiController]
[Route("[controller]")]
public class InferenceController : ControllerBase
{
    private readonly OrchestrationService _orchestrationService;
    private readonly ILogger<InferenceController> _logger;

    public InferenceController(OrchestrationService orchestrationService, ILogger<InferenceController> logger)
    {
        _orchestrationService = orchestrationService;
        _logger = logger;
    }

    [HttpPost("classify")]
    public async Task<IActionResult> Classify([FromBody] float[] inputData)
    {
        // Basic input validation
        if (inputData == null || inputData.Length != 784) // 28x28 for MNIST
        {
            return BadRequest("Input data must be a flat array of 784 floats.");
        }
        
        // Construct the gRPC request message from the HTTP request body
        var request = new InferenceRequest
        {
            ModelId = "mnist-cnn-v1",
            InputTensor = new Tensor
            {
                Shape = { 1, 1, 28, 28 }, // Batch size 1, 1 channel, 28x28 pixels
            }
        };
        request.InputTensor.Data.AddRange(inputData);

        var result = await _orchestrationService.GetClassificationAsync(request);

        if (result == null)
        {
            return StatusCode(503, "Inference service is unavailable or failed.");
        }
        
        return Ok(result);
    }
}

The Lit Frontend: A Lightweight Display Component

The frontend component’s role is simple: provide an interface to send data to the backend and display the result. Lit is excellent for this because it produces a standard, interoperable Web Component with minimal boilerplate.

inference-form.js:

import { LitElement, html, css } from 'lit';
import { customElement, state } from 'lit/decorators.js';

@customElement('inference-form')
export class InferenceForm extends LitElement {
  static styles = css`
    :host {
      display: block;
      font-family: sans-serif;
      border: 1px solid #ccc;
      padding: 16px;
      max-width: 500px;
    }
    textarea {
      width: 100%;
      min-height: 100px;
      margin-bottom: 12px;
    }
    button {
      padding: 8px 16px;
      cursor: pointer;
    }
    .result {
      margin-top: 16px;
      padding: 12px;
      background-color: #f0f0f0;
      border-left: 4px solid #007bff;
    }
    .error {
      border-left-color: #dc3545;
      color: #dc3545;
    }
  `;

  @state()
  private _isLoading = false;

  @state()
  private _result = null;

  @state()
  private _error = null;

  private _textAreaRef = null;

  firstUpdated() {
    this._textAreaRef = this.shadowRoot.querySelector('textarea');
  }
  
  async _classify() {
    if (!this._textAreaRef.value) {
      this._error = 'Input data cannot be empty.';
      return;
    }

    let inputData;
    try {
      // Expecting a JSON array of numbers
      inputData = JSON.parse(this._textAreaRef.value);
      if (!Array.isArray(inputData) || inputData.length !== 784) {
        throw new Error("Input must be an array of 784 numbers.");
      }
    } catch (e) {
      this._error = `Invalid input format: ${e.message}`;
      this._result = null;
      return;
    }
    
    this._isLoading = true;
    this._result = null;
    this._error = null;

    try {
      const response = await fetch('/inference/classify', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify(inputData),
      });

      if (!response.ok) {
        throw new Error(`API Error: ${response.status} ${response.statusText}`);
      }
      
      const data = await response.json();
      this._result = data;

    } catch (e) {
      this._error = e.message;
    } finally {
      this._isLoading = false;
    }
  }

  render() {
    return html`
      <h3>Real-time Classifier</h3>
      <p>Paste a flattened 28x28 (784 floats) image tensor below.</p>
      <textarea placeholder="[0.0, 0.1, ..., 0.9]"></textarea>
      <button @click=${this._classify} ?disabled=${this._isLoading}>
        ${this._isLoading ? 'Classifying...' : 'Classify'}
      </button>

      ${this._result ? html`
        <div class="result">
          <strong>Prediction:</strong> ${this._result.predictions[0]?.label ?? 'N/A'}<br>
          <strong>Confidence:</strong> ${this._result.predictions[0]?.confidence.toFixed(4) ?? 'N/A'}
        </div>
      ` : ''}

      ${this._error ? html`
        <div class="result error">
          <strong>Error:</strong> ${this._error}
        </div>
      ` : ''}
    `;
  }
}

This component is self-contained and handles its own state for loading and errors. It communicates with our ASP.NET Core backend, completely unaware of the gRPC and PyTorch machinery running behind the scenes.

OCI Deployment Architecture

Deploying this polyglot system on OCI requires coordinating several components. A minimal production-viable setup would look like this:

graph TD
    subgraph User Browser
        A[Lit Component]
    end

    subgraph OCI Network
        LB[OCI Load Balancer]
        subgraph VCN
            subgraph Public Subnet
                API_Instance[Compute E4 Flex Instance]
            end
            subgraph Private Subnet
                Model_Instance[GPU Compute Instance]
                Memcached_Instance[OCI Cache with Memcached]
            end
        end
    end
    
    A -- HTTPS --> LB
    LB -- HTTP --> API_Instance
    
    API_Instance -- Caches --> Memcached_Instance
    API_Instance -- gRPC (Port 50051) --> Model_Instance
    
    subgraph API_Instance
        B[ASP.NET Core App]
    end

    subgraph Model_Instance
        C[Python gRPC Server + PyTorch]
    end

Key Deployment Considerations:

Network Security: The ASP.NET Core instance in the public subnet is exposed via a load balancer. The PyTorch GPU instance and the Memcached service are placed in a private subnet. Network Security Groups (NSGs) are crucial here. One NSG would allow public traffic on port 443 to the load balancer. Another would allow traffic from the API instance’s subnet to the model instance on port 50051 (for gRPC) and to the Memcached instance on port 11211. All other traffic should be denied by default.
Configuration Management: The addresses for the gRPC service and Memcached should not be hardcoded. They should be passed to the ASP.NET Core application via environment variables or OCI’s configuration service, which are then read by IConfiguration.
Scalability: The ASP.NET Core tier can be scaled horizontally in an instance pool behind the load balancer. Scaling the PyTorch service is more complex. If one GPU instance is insufficient, you could deploy multiple instances and place another internal load balancer in front of them, which the ASP.NET Core service would then target.

This architecture solves our initial problem. gRPC drastically reduces the serialization tax. The Memcached layer absorbs a significant portion of the load, preventing the expensive GPU-bound computation from running unnecessarily. The result is a system that is both faster and more efficient.

Lingering Issues and Future Iterations

This solution, while effective, is not without its own set of trade-offs and potential improvements. The cache key generation, while robust, hashes the entire payload on every request. For very large tensors, this hashing itself could become a non-trivial CPU cost on the API server. A more advanced approach might involve using a faster, non-cryptographic hash like MurmurHash3 or only hashing a stable subset of the input if domain knowledge allows.

Furthermore, the Python gRPC server’s ThreadPoolExecutor offers concurrency but is still bound by Python’s Global Interpreter Lock (GIL) for CPU-bound tasks. While our workload is I/O-bound (waiting on the GPU), a very high request rate could saturate the Python process management. A potential next step would be to use a production-grade gRPC server like gunicorn with grpcio workers or to scale horizontally by running multiple Python server instances, managed by a process manager or a container orchestrator like Kubernetes on OKE (Oracle Kubernetes Engine).

Finally, Memcached is a volatile cache. A service restart clears it completely, leading to a “thundering herd” problem where a storm of requests suddenly hits the PyTorch model server. For applications requiring cache warmth, pre-populating the cache on startup or using a cache with persistence options like Redis might be a better, albeit more complex, choice.

gRPC OCI ASP.NET Core PyTorch Lit Memcached

Building a High-Throughput containerd Event Collector for ClickHouse with Idempotent Ingestion

2023-11-15 Observability

Jest gRPC Node.js Observability ClickHouse containerd

Taming Cold Starts in AWS Lambda for a Production Spring Boot Clean Architecture

2023-11-15 Cloud Native

Spring Boot Performance AWS Lambda Clean Architecture SnapStart