Implementing a Resumable UI for Real-Time TensorFlow Inference with an APISIX Edge Gateway


The core business requirement was deceptively simple: deliver hyper-personalized product descriptions to users in real-time. The non-functional requirement, however, was brutal: the entire experience, from page load to displaying the generated text, had to feel instantaneous, even on poor network conditions. Our existing stack, a monolithic React SPA making direct calls to a Python backend, was posting Largest Contentful Paint (LCP) times north of four seconds. The client-side hydration process, coupled with the latency of the initial inference request, was killing user engagement. This necessitated a complete architectural rethink, moving away from a monolithic frontend and toward a distributed, edge-first model.

Our initial hypothesis was that the problem could be broken down into two distinct domains: an ultra-fast frontend rendering mechanism and a highly efficient, manageable service layer for handling ML inference. The two had to be decoupled but seamlessly integrated.

Technology Selection Rationale

In a real-world project, technology choices are driven by constraints and specific problems, not by hype.

  1. Backend: Python + TensorFlow. This was the least flexible constraint. Our data science team operates exclusively in the Python ecosystem. The text generation models, built on a fine-tuned variant of a transformer architecture, were already implemented in TensorFlow. Migrating them to another framework like PyTorch or, worse, another language like Rust, was a non-starter due to team skillset and time-to-market pressure. Our task was to build a robust service layer around this existing core competency.

  2. API Gateway: Apache APISIX. A simple reverse proxy like Nginx wouldn’t suffice. We needed a control plane at the edge to handle authentication, rate-limiting, and, most importantly, request/response transformation without burdening the core ML service. The ML service’s contract needed to be stable. APISIX’s plugin architecture was the key attraction. Specifically, its support for external plugins, including a Python plugin runner, meant our platform team could write complex orchestration logic in a language they were already comfortable with, creating a “smart edge” that decouples clients from the backend.

  3. Frontend Shell: Qwik. This was the most radical choice. The core issue with our previous SPA was the hydration tax. We investigated server-side rendering (SSR), but that only solves the initial paint; interactivity is still blocked until the JavaScript is downloaded, parsed, and executed. Qwik’s concept of “resumability” was the solution we targeted. By serializing the application’s state and component listeners into the HTML, Qwik bypasses the hydration step entirely, leading to a near-instant Time to Interactive (TTI). For a feature so dependent on immediate user feedback, this was a potential game-changer.

  4. Frontend Component: Lit. A common mistake is to assume a single framework must solve all problems. While Qwik would manage the application shell and routing, the actual component displaying the generated text needed to be a self-contained, standards-based Web Component. Lit was the pragmatic choice. It’s lightweight, produces standard custom elements, and has no complex runtime. This allows the component to be versioned and deployed independently, and even reused in other, older parts of our application portfolio that aren’t built with Qwik. This composition strategy minimized risk and maximized reusability.

Architecture Overview

The final data flow is orchestrated through distinct, specialized layers. Each layer has a single responsibility, which is critical for maintenance and independent scaling.

sequenceDiagram
    participant User as Browser (Qwik + Lit)
    participant Edge as Apache APISIX
    participant Plugin as APISIX Python Plugin
    participant Backend as Python Inference Service (FastAPI + TensorFlow)

    User->>+Edge: POST /api/v1/generate (payload: {userId, context})
    Edge->>+Plugin: Forward request for processing
    Plugin-->>Plugin: 1. Validate JWT token
    Plugin-->>Plugin: 2. Reshape payload to backend format
    Plugin->>+Backend: POST /infer (payload: {user_id, prompt_data})
    Backend-->>Backend: 3. Load/Cache TensorFlow Model
    Backend-->>Backend: 4. Perform inference
    Backend-->>-Plugin: Return {generated_text}
    Plugin-->>-Edge: Forward response
    Edge-->>-User: Return {data: {text: "..."}}

This architecture isolates the concerns: the client knows nothing about the backend’s data model, the backend knows nothing about authentication, and APISIX acts as the central orchestration point.

Implementation Deep Dive: The Python Inference Service

The backend service must be robust. It’s not just a script that calls model.predict(). It needs proper configuration, logging, and a well-defined startup and shutdown lifecycle to manage the expensive TensorFlow model. We used FastAPI for its speed and automatic OpenAPI documentation.

File: inference_server/main.py

import os
import logging
import tensorflow as tf
from fastapi import FastAPI, Request, HTTPException
from pydantic import BaseModel, Field
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
import uvicorn

# --- Configuration Management ---
# In a real-world project, this would come from a more robust system
# like HashiCorp Vault or environment variables managed by Kubernetes.
MODEL_NAME = os.getenv("MODEL_NAME", "gpt2")
MODEL_PATH = os.getenv("MODEL_PATH", "./model_cache")
SERVER_HOST = os.getenv("SERVER_HOST", "0.0.0.0")
SERVER_PORT = int(os.getenv("SERVER_PORT", "8000"))
LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO").upper()

# --- Logging Setup ---
logging.basicConfig(
    level=LOG_LEVEL,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

# --- Pydantic Models for API Contract ---
class InferenceRequest(BaseModel):
    user_id: str = Field(..., description="Unique identifier for the user.")
    prompt_data: str = Field(..., max_length=512, description="Contextual prompt for text generation.")

class InferenceResponse(BaseModel):
    generated_text: str

# --- Application State ---
# Use a dictionary to hold application state, including the model and tokenizer.
# This prevents loading the model on every request.
app_state = {}

# --- FastAPI Lifecycle Events ---
@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: Load the model into memory. This is a blocking operation
    # and will delay server start, but it prevents cold start latency on first request.
    logger.info(f"Loading tokenizer '{MODEL_NAME}' from path '{MODEL_PATH}'...")
    try:
        app_state["tokenizer"] = GPT2Tokenizer.from_pretrained(MODEL_NAME, cache_dir=MODEL_PATH)
        logger.info(f"Loading model '{MODEL_NAME}' from path '{MODEL_PATH}'...")
        # A common mistake is not handling the pad_token_id for open-ended generation.
        if app_state["tokenizer"].pad_token is None:
            app_state["tokenizer"].pad_token = app_state["tokenizer"].eos_token

        app_state["model"] = TFGPT2LMHeadModel.from_pretrained(MODEL_NAME, pad_token_id=app_state["tokenizer"].eos_token_id, cache_dir=MODEL_PATH)
        logger.info("Model and tokenizer loaded successfully.")
    except Exception as e:
        logger.critical(f"Failed to load model: {e}", exc_info=True)
        # Fail fast if the core component can't be loaded.
        raise RuntimeError("Model loading failed.") from e
    yield
    # Shutdown: Clean up resources. Not strictly necessary for this model,
    # but good practice for resources like database connections.
    logger.info("Cleaning up resources...")
    app_state.clear()


app = FastAPI(lifespan=lifespan)

# --- API Endpoint ---
@app.post("/infer", response_model=InferenceResponse)
async def infer(request: InferenceRequest):
    """
    Performs inference using the pre-loaded TensorFlow model.
    """
    logger.debug(f"Received inference request for user '{request.user_id}'.")
    
    if "model" not in app_state or "tokenizer" not in app_state:
        logger.error("Model is not available in app state. This should not happen after startup.")
        raise HTTPException(status_code=503, detail="Service Unavailable: Model not loaded.")

    try:
        tokenizer = app_state["tokenizer"]
        model = app_state["model"]

        # Tokenize the input text
        inputs = tokenizer(request.prompt_data, return_tensors="tf")
        input_ids = inputs["input_ids"]

        # Generate text. The parameters here are critical for performance and quality.
        # max_length should be carefully tuned to prevent runaway generation.
        # num_return_sequences=1 ensures we only generate one output.
        output_sequences = model.generate(
            input_ids=input_ids,
            max_length=100,
            num_return_sequences=1,
            no_repeat_ngram_size=2, # Prevents repetitive phrases
            early_stopping=True
        )

        generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
        
        logger.info(f"Successfully generated text for user '{request.user_id}'.")
        return InferenceResponse(generated_text=generated_text)

    except Exception as e:
        logger.error(f"Inference failed for user '{request.user_id}': {e}", exc_info=True)
        raise HTTPException(status_code=500, detail="Internal Server Error during inference.")


if __name__ == "__main__":
    # This is for local development. In production, a Gunicorn/Uvicorn process manager is used.
    uvicorn.run(app, host=SERVER_HOST, port=SERVER_PORT)

This service is self-contained and exposes a single /infer endpoint. It correctly handles model loading at startup to prevent per-request latency spikes.

Implementation Deep Dive: The APISIX Smart Edge

APISIX configuration is declarative. We define routes and upstreams in YAML. The key component is the ext-plugin-python which delegates request processing to a separate Python script.

File: apisix_config/config.yaml

# This is a snippet for the relevant route and upstream configuration.
# A full APISIX config.yaml would include admin settings, etc.

routes:
  - id: "1"
    uri: "/api/v1/generate"
    methods: ["POST"]
    plugins:
      # The core of our edge logic.
      ext-plugin-pre-req:
        name: python-transformer
        conf:
          # This key must match the name of the function in our Python plugin code.
          hook_name: "rewrite" 
          # You can pass static JSON configuration to your plugin.
          # Here we can define things like required headers or token issuers.
          config:
            required_scope: "product:generate"

    upstream:
      type: roundrobin
      nodes:
        "inference-server:8000": 1 # DNS name of our Python backend service

upstreams:
  - id: "inference-server"
    type: roundrobin
    nodes:
      # Kubernetes service name or Docker container name
      "python-inference-service.default.svc.cluster.local:8000": 1

# This section enables the external Python plugin runner.
ext-plugin:
  path_for_test: /tmp/runner.sock
  plugins:
    - name: python-transformer
      # The runner process listens on this socket file.
      # APISIX communicates with it via this socket.
      sockfile: /usr/local/apisix/conf/ext-plugin.sock
      # How long APISIX should wait for the plugin to respond.
      # Critical to set a timeout to prevent hanging requests.
      timeout:
        connect: 2
        send: 5
        read: 10

The logic itself resides in a separate Python file, executed by the apisix-python-plugin-runner.

File: apisix_plugins/transformer.py

import sys
import json
import logging
from typing import Any, Dict

# This code is executed by the apisix-python-plugin-runner, not a web server.
# It communicates with APISIX over a unix socket.

# --- Basic Logging ---
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logger = logging.getLogger("APISIX_Python_Plugin")

class Plugin:
    def __init__(self, config: Dict[str, Any]):
        """
        The config passed here is the 'conf' block from the APISIX YAML.
        """
        self.required_scope = config.get("config", {}).get("required_scope", "")
        logger.info(f"Initialized plugin with required_scope: {self.required_scope}")

    def get_user_id_from_token(self, token: str) -> str:
        """
        A placeholder for JWT validation logic. In a real project,
        this would use a library like PyJWT to decode and verify the token.
        For this example, we'll just simulate it.
        """
        # A common mistake is to not handle the "Bearer " prefix.
        if token and token.startswith("Bearer "):
            # In a real implementation:
            # decoded = jwt.decode(token[7:], key, algorithms=["RS256"])
            # if self.required_scope not in decoded.get("scope", ""):
            #   raise ValueError("Insufficient scope")
            # return decoded["sub"]
            return "user-123-from-token" # Mocked user ID
        raise ValueError("Invalid or missing token")

    def rewrite(self, request: Dict[str, Any], conf: Dict[str, Any]):
        """
        This method is called by APISIX for every request hitting the configured route.
        The name 'rewrite' must match the 'hook_name' in config.yaml.
        """
        try:
            # --- 1. Authentication ---
            auth_header = request.get("headers", {}).get("Authorization", "")
            user_id = self.get_user_id_from_token(auth_header)
        except ValueError as e:
            logger.warning(f"Authentication failed: {e}")
            # Stop processing and return a 401 Unauthorized response directly from the edge.
            # The backend service is never touched.
            return 401, json.dumps({"error": "Unauthorized"}).encode('utf-8')

        try:
            # --- 2. Request Transformation ---
            # Get the original request body from the client.
            # It's passed as a bytes object.
            client_body = json.loads(request.get("body", b"{}"))
            client_context = client_body.get("context", "")

            # The pitfall here is assuming the client sends a perfect payload.
            # Always validate and provide defaults.
            if not client_context:
                return 400, json.dumps({"error": "Bad Request: context is missing"}).encode('utf-8')

            # Construct the new body for the backend service.
            # This is where we decouple the client API from the internal service API.
            backend_prompt = f"Generate a product description based on this user behavior: {client_context}"
            backend_payload = {
                "user_id": user_id,
                "prompt_data": backend_prompt
            }

            # Set the new, transformed body for the upstream request.
            # APISIX requires the body to be bytes.
            request["body"] = json.dumps(backend_payload).encode('utf-8')

            # --- 3. Modify Upstream Request Path & Headers ---
            # Change the path from the public-facing `/api/v1/generate` to the internal `/infer`.
            request["path"] = "/infer"
            
            # It's good practice to update the Content-Length header after changing the body.
            request["headers"]["Content-Length"] = str(len(request["body"]))
            # We can also inject internal tracking headers.
            request["headers"]["X-Internal-Trace-Id"] = request.get("headers", {}).get("X-Request-ID", "unknown")

            # A return value of None means "continue processing with the modified request".
            return None 

        except json.JSONDecodeError:
            logger.error("Failed to decode client JSON body.")
            return 400, json.dumps({"error": "Bad Request: Invalid JSON"}).encode('utf-8')
        except Exception as e:
            logger.error(f"Unexpected error in plugin: {e}", exc_info=True)
            return 500, json.dumps({"error": "Internal Server Error in Gateway"}).encode('utf-8')

# The runner expects an object with a method matching the hook_name.
# The instantiation uses the 'conf' from the YAML.
transformer_plugin = Plugin

This plugin is the architectural lynchpin. It absorbs authentication and data transformation logic, allowing the ML service to remain simple and focused solely on inference.

Implementation Deep Dive: The Frontend (Qwik + Lit)

The frontend is a composition of two technologies.

1. The Qwik Application Shell (src/routes/product/[id]/index.tsx)

Qwik handles the page itself. Its useResource$ hook is perfect for asynchronous data fetching, as it creates a resumable boundary.

// src/routes/product/[id]/index.tsx
import { component$, useResource$, Resource } from '@builder.io/qwik';
import { useLocation } from '@builder.io/qwik-city';
import { PersonalizedDescription } from '~/components/personalized-description/personalized-description';

// In a real app, this would be a more robust API client.
export const fetchPersonalizedDescription = async (productId: string, context: string) => {
  try {
    const response = await fetch('/api/v1/generate', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        // Token would come from an auth state management solution.
        'Authorization': 'Bearer my-dummy-token'
      },
      body: JSON.stringify({ context: `${context} for product ${productId}` })
    });

    if (!response.ok) {
      // The pitfall is not handling API errors gracefully.
      console.error('API Error:', response.statusText);
      return { text: 'Unable to generate a personalized description at this time.' };
    }
    const data = await response.json();
    // The response is shaped by APISIX, not the backend.
    return { text: data.generated_text };
  } catch (error) {
    console.error('Network or fetch error:', error);
    return { text: 'A network error occurred.' };
  }
};

export default component$(() => {
  const location = useLocation();
  const productId = location.params.id;

  // useResource$ is the key Qwik API for async operations.
  // It runs on the server during SSR and can resume on the client.
  const descriptionResource = useResource$(async ({ track }) => {
    // track tells Qwik to re-run this resource when these values change.
    track(() => productId);

    // This simulates user context we might gather on the client.
    const userContext = "User has viewed 'electronics' category three times.";
    
    // The fetch call is made. Qwik handles streaming the response.
    const result = await fetchPersonalizedDescription(productId, userContext);
    return result;
  });

  return (
    <div>
      <h1>Product Page: {productId}</h1>
      <p>Standard static description content goes here...</p>
      
      {/* Resource component handles the different states of the async operation. */}
      <Resource
        value={descriptionResource}
        onPending={() => <div class="loading-placeholder">Generating description...</div>}
        onRejected={(error) => <div>Error: {error.message}</div>}
        onResolved={(desc) => (
          // We pass the resolved data to our Lit web component.
          // Note the 'q:slot' and 'client:visible' directives for Qwik's optimizer.
          <PersonalizedDescription
            generatedText={desc.text}
            client:visible // This tells Qwik to hydrate this component when it becomes visible.
          />
        )}
      />
    </div>
  );
});

2. The Lit Web Component (src/components/personalized-description/personalized-description.ts)

This component is a self-contained unit of UI. It knows nothing about Qwik. It receives data via standard element properties.

// src/components/personalized-description/personalized-description.ts
import { LitElement, css, html } from 'lit';
import { customElement, property } from 'lit/decorators.js';

@customElement('personalized-description')
export class PersonalizedDescription extends LitElement {
  // Define a public property that can be set from the outside (by Qwik).
  @property({ type: String })
  generatedText = '...';

  static styles = css`
    :host {
      display: block;
      border: 1px solid #ccc;
      padding: 16px;
      margin-top: 20px;
      border-radius: 8px;
      background-color: #f9f9f9;
    }
    p {
      margin: 0;
    }
    .header {
      font-weight: bold;
      margin-bottom: 8px;
      font-size: 0.9em;
      color: #555;
    }
  `;

  // The render function defines the component's internal DOM.
  render() {
    return html`
      <div class="header">Your Personalized Insight:</div>
      <p>${this.generatedText}</p>
    `;
  }
}

// We need to tell Qwik about this component. In Qwik's integration file (e.g., root.tsx):
// import type { QwikIntrinsicElements } from '@builder.io/qwik';
// import { PersonalizedDescription } from './path/to/component';
// declare global {
//   interface QwikIntrinsicElements {
//     'personalized-description': typeof PersonalizedDescription.prototype;
//   }
// }

This separation of concerns was critical. The Qwik team could optimize the shell and data loading, while another team could iterate on the personalized-description component’s UI and features without conflict. The integration point is a stable public contract: the generatedText property.

Limitations and Future Outlook

This architecture successfully met our performance goals for TTI and LCP, but it’s not without its own set of complexities and areas for improvement.

The APISIX Python plugin, while powerful, introduces an extra network hop between APISIX and the apisix-python-plugin-runner process over a unix socket. For extreme low-latency scenarios, rewriting performance-critical plugins in Lua to run directly within the Nginx worker process would yield better results. This remains a future optimization path.

Furthermore, our current model deployment is static; updating the TensorFlow model requires a full service redeployment. The next iteration will involve integrating an MLOps pipeline. APISIX can play a key role here, using its traffic splitting capabilities to canary release new model versions from the data science team with minimal risk.

Finally, while the Qwik and Lit combination provided the performance we needed, it introduced a “two-framework” cognitive load for developers. Onboarding new engineers requires explaining the rationale behind this compositional architecture and the distinct roles of each library. The long-term maintainability depends on clear documentation and strict adherence to the boundaries we’ve established between the application shell and the leaf components.


  TOC