Constructing a Serverless Real User Monitoring Pipeline with AWS Lambda and InfluxDB for a Svelte Front-End


The project began with a clear and painful constraint: commercial Real User Monitoring (RUM) solutions were prohibitively expensive for the sheer volume of interaction events we needed to track. Our applications, built with Svelte, required granular insight not just into Core Web Vitals, but also into component-level mount times and custom user interactions. Sending this firehose of data to a third-party service would have crippled our budget. The initial mandate was to build an in-house ingestion pipeline that was cost-effective, scalable, and could handle spiky, unpredictable traffic without falling over.

Our first napkin sketch was simple: a lightweight Svelte beacon on the client sends metrics to an API Gateway endpoint, which triggers a Lambda function to write the data into a time-series database. For the database, InfluxDB was the obvious choice due to its high write throughput and data lifecycle management features, which are critical for observability metrics.

This architecture, while simple on paper, hides significant complexity in its implementation. In a real-world project, the primary concerns shift from “does it work?” to “is it reliable, maintainable, and cost-efficient at scale?”. The journey from that initial concept to a production-ready system was fraught with performance bottlenecks, cost overruns, and architectural refactoring.

The Client-Side Beacon: From Naive to Batched

The first component was the data collection script, or “beacon.” We chose to build this as a tiny Svelte utility, allowing us to easily integrate it into our existing applications. The initial version was dangerously simple. Using the web-vitals library, we captured metrics and sent them immediately.

// src/rum/beacon.v1.js - The naive, non-production approach
import { onLCP, onFID, onCLS } from 'web-vitals';

const RUM_ENDPOINT = 'https://api.example.com/ingest';

function sendMetric(metric) {
    const body = JSON.stringify({
        name: metric.name,
        value: metric.value,
        // ... other metadata like page path, user agent
    });

    // Fire-and-forget, no batching. This is the problem.
    if (navigator.sendBeacon) {
        navigator.sendBeacon(RUM_ENDPOINT, body);
    } else {
        fetch(RUM_ENDPOINT, { body, method: 'POST', keepalive: true });
    }
}

export function initializeRUM() {
    onLCP(sendMetric);
    onFID(sendMetric);
    onCLS(sendMetric);
}

This approach failed spectacularly under even light testing. Every metric (LCP, FID, CLS, plus our custom events) triggered a separate network request and a separate Lambda invocation. The cost implications were immediately obvious. For a single user session, we could generate 5-10 separate invocations. At a million sessions a day, this would be 5-10 million Lambda invocations—a financial disaster.

The pitfall here is underestimating the sheer volume of events and the cost of individual processing. The solution was client-side batching. We needed to collect metrics in a queue and flush them periodically or when the page was about to unload.

Here is the production-grade beacon. It uses a queue and flushes data using requestIdleCallback to avoid impacting main thread performance. The visibilitychange and pagehide event listeners ensure that we capture data from users leaving the page.

// src/rum/beacon.v2.js - Production version with batching

import { onLCP, onFID, onCLS } from 'web-vitals';

const RUM_ENDPOINT = process.env.RUM_INGESTION_ENDPOINT;
const BATCH_SIZE_LIMIT = 20;
const BATCH_TIME_LIMIT_MS = 5000; // 5 seconds

let metricQueue = [];
let timeoutId = null;

function flushQueue() {
    if (metricQueue.length === 0) {
        return;
    }

    // A common mistake is to mutate the original queue while the send is in flight.
    // Create a copy and clear the original immediately.
    const batch = [...metricQueue];
    metricQueue = [];
    
    // Clear any scheduled flush, since we are flushing now.
    if (timeoutId) {
        clearTimeout(timeoutId);
        timeoutId = null;
    }

    const body = JSON.stringify({ events: batch });

    try {
        if (navigator.sendBeacon) {
            navigator.sendBeacon(RUM_ENDPOINT, body);
        } else {
            fetch(RUM_ENDPOINT, { body, method: 'POST', keepalive: true });
        }
    } catch (error) {
        // In a real-world scenario, you might want to log this to the console
        // for debugging, but avoid sending it to a logging service to prevent loops.
        console.error('RUM Beacon failed to send:', error);
    }
}

function addToQueue(metric) {
    const event = {
        timestamp: new Date().toISOString(),
        name: metric.name,
        value: metric.value,
        // It's crucial to add context here that can become tags in InfluxDB
        context: {
            path: window.location.pathname,
            browser: navigator.userAgentData?.brands[0]?.brand || 'unknown',
            connection: navigator.connection?.effectiveType || 'unknown',
        }
    };
    
    metricQueue.push(event);

    if (metricQueue.length >= BATCH_SIZE_LIMIT) {
        flushQueue();
    } else if (!timeoutId) {
        // Schedule a flush in the future if one isn't already scheduled.
        timeoutId = setTimeout(flushQueue, BATCH_TIME_LIMIT_MS);
    }
}

// Ensure data is sent before the page unloads
document.addEventListener('visibilitychange', () => {
    if (document.visibilityState === 'hidden') {
        flushQueue();
    }
});

// `pagehide` is more reliable for mobile devices.
window.addEventListener('pagehide', flushQueue);

export function initializeRUM() {
    if (!RUM_ENDPOINT) {
        console.warn('RUM_INGESTION_ENDPOINT is not configured.');
        return;
    }
    
    onLCP(addToQueue);
    onFID(addToQueue);
    onCLS(addToQueue);
    
    console.log('RUM Beacon Initialized.');
}

// We can also expose a function for custom application events
export function trackCustomEvent({ name, value, context = {} }) {
    addToQueue({ name, value, context: { ...context, type: 'custom' } });
}

This version reduced our Lambda invocation count by a factor of 10-20x, fundamentally changing the cost model of the entire system.

The AWS Lambda Ingester: Surviving the Firehose

With the client now sending batches, the Lambda function became the next focus. The initial implementation was a simple loop that processed each event in the batch and wrote it to InfluxDB one by one.

// lambda/ingester.v1.js - Simple but inefficient
const { InfluxDB, Point } = require('@influxdata/influxdb-client');

// These should be set as environment variables
const INFLUX_URL = process.env.INFLUX_URL;
const INFLUX_TOKEN = process.env.INFLUX_TOKEN;
const INFLUX_ORG = process.env.INFLUX_ORG;
const INFLUX_BUCKET = process.env.INFLUX_BUCKET;

// IMPORTANT: Initialize the client *outside* the handler to reuse the connection
// across warm invocations. This is a critical performance optimization.
const influxDB = new InfluxDB({ url: INFLUX_URL, token: INFLUX_TOKEN });
const writeApi = influxDB.getWriteApi(INFLUX_ORG, INFLUX_BUCKET);

exports.handler = async (event) => {
    try {
        const body = JSON.parse(event.body);
        const events = body.events || [];

        if (events.length === 0) {
            return { statusCode: 204, body: 'No content' };
        }

        for (const ev of events) {
            const point = new Point(ev.name)
                .tag('path', ev.context.path || 'unknown')
                .tag('browser', ev.context.browser || 'unknown')
                .floatField('value', ev.value)
                .timestamp(new Date(ev.timestamp));
            
            // Writing points one by one is a huge performance bottleneck.
            writeApi.writePoint(point);
        }

        // This flush is also problematic for latency.
        await writeApi.flush();

        return { statusCode: 202, body: 'Accepted' };
    } catch (error) {
        console.error('Ingestion failed:', error);
        // A generic 500 error loses the data. This is where a DLQ is needed.
        return { statusCode: 500, body: 'Internal Server Error' };
    }
};

This implementation suffered from two major problems:

  1. High Latency: Calling writeApi.writePoint and writeApi.flush for each event or even once per batch inside a loop creates significant I/O overhead and increases the Lambda’s execution time.
  2. Data Loss: If InfluxDB is unavailable or the data is malformed, the entire batch is dropped, and a 500 error is returned. The data is lost forever.

The solution was to leverage the InfluxDB client’s own batching capabilities and to configure a Dead Letter Queue (DLQ) for the Lambda function. The refined Lambda prepares an array of Point objects and writes them all in a single writePoints call.

For local development and testing, we used a docker-compose.yml to spin up InfluxDB.

# docker-compose.yml
version: '3'
services:
  influxdb:
    image: influxdb:2.7
    container_name: influxdb_rum
    ports:
      - "8086:8086"
    volumes:
      - influxdb_data:/var/lib/influxdb2
    environment:
      - DOCKER_INFLUXDB_INIT_MODE=setup
      - DOCKER_INFLUXDB_INIT_USERNAME=admin
      - DOCKER_INFLUXDB_INIT_PASSWORD=password123
      - DOCKER_INFLUXDB_INIT_ORG=my-org
      - DOCKER_INFLUXDB_INIT_BUCKET=rum-bucket
      - DOCKER_INFLUXDB_INIT_ADMIN_TOKEN=my-super-secret-token

volumes:
  influxdb_data:

Here’s the production-ready Lambda function. It includes structured logging, robust error handling, and efficient batch writing.

// lambda/ingester.v2.js - Production-ready version
const { InfluxDB, Point } = require('@influxdata/influxdb-client');

const {
    INFLUX_URL,
    INFLUX_TOKEN,
    INFLUX_ORG,
    INFLUX_BUCKET,
    NODE_ENV
} = process.env;

// A common mistake is not validating environment variables on startup.
if (!INFLUX_URL || !INFLUX_TOKEN || !INFLUX_ORG || !INFLUX_BUCKET) {
    throw new Error('InfluxDB environment variables are not fully configured.');
}

const influxDB = new InfluxDB({ url: INFLUX_URL, token: INFLUX_TOKEN });
const writeApi = influxDB.getWriteApi(INFLUX_ORG, INFLUX_BUCKET, 'ns');

// Add a flush on process exit to ensure buffered data is sent during a cold start shutdown.
process.on('exit', async () => {
    try {
        await writeApi.close();
        console.log('InfluxDB writeApi closed.');
    } catch (e) {
        console.error('Error closing InfluxDB writeApi', e);
    }
});

exports.handler = async (event) => {
    // Basic structured logging
    const log = (level, message, context) => {
        console.log(JSON.stringify({ level, message, ...context }));
    };

    const requestID = event.requestContext?.requestId;
    log('info', 'Ingestion started', { requestID });

    try {
        const body = JSON.parse(event.body);
        const events = body.events || [];

        if (events.length === 0) {
            log('warn', 'Received empty batch', { requestID });
            return { statusCode: 204, body: '' };
        }
        
        const points = events.map(ev => {
            // Data validation is crucial for a production system.
            if (!ev.name || typeof ev.value !== 'number' || !ev.timestamp) {
                log('error', 'Malformed event skipped', { event: ev, requestID });
                return null;
            }
            
            // Sanitize tags to prevent InfluxDB errors. High-cardinality values
            // should never be tags. Here, we assume path and browser are low-cardinality.
            const pathTag = (ev.context?.path || 'unknown').replace(/,/g, '\\,').replace(/ /g, '\\ ');
            const browserTag = (ev.context?.browser || 'unknown').replace(/,/g, '\\,').replace(/ /g, '\\ ');

            return new Point(ev.name)
                .tag('path', pathTag)
                .tag('browser', browserTag)
                .floatField('value', ev.value)
                .timestamp(new Date(ev.timestamp));
        }).filter(p => p !== null); // Filter out any malformed events

        if (points.length === 0) {
            log('warn', 'Batch contained only malformed events', { requestID });
            return { statusCode: 204, body: '' };
        }

        writeApi.writePoints(points);
        // The client library handles batching internally. We can flush to ensure
        // it's sent, which is a good trade-off for an ingest endpoint.
        await writeApi.flush();

        log('info', 'Ingestion successful', { requestID, pointCount: points.length });
        return { statusCode: 202, body: JSON.stringify({ message: 'Accepted' }) };

    } catch (error) {
        log('error', 'Ingestion failed catastrophically', {
            requestID,
            errorMessage: error.message,
            stack: NODE_ENV === 'development' ? error.stack : undefined,
        });

        // This is where the AWS Lambda DLQ (configured via IaC, e.g., SAM/Terraform)
        // comes in. The failed `event` payload is sent to an SQS queue for later inspection
        // and reprocessing. We return a 500 to signal failure to the Lambda service.
        throw error; // Re-throwing ensures Lambda marks the invocation as failed.
    }
};

This version is far more robust. By re-throwing the error, we let the Lambda service handle the failure and forward the payload to our configured SQS DLQ. This prevents data loss and allows us to debug and re-process failed batches later.

graph TD
    subgraph Browser
        A[Svelte Beacon] -- Batched JSON payload --> B{navigator.sendBeacon}
    end
    
    B --> C[API Gateway]
    C -- Proxies request --> D[AWS Lambda Ingester]
    
    subgraph "Error Path"
        D -- On unhandled error --> E[SQS Dead Letter Queue]
    end
    
    subgraph "Success Path"
        D -- Writes batch of points --> F[InfluxDB]
    end

    subgraph Dashboard
        G[Svelte App] -- Flux Query --> F
    end

    style F fill:#2b9696,stroke:#333,stroke-width:2px,color:#fff
    style D fill:#ff9900,stroke:#333,stroke-width:2px,color:#fff

The Dashboard: Svelte, SCSS, and the Emotion Conundrum

The final piece was the visualization dashboard. We built this, naturally, with Svelte. The initial structure was straightforward, using global styles managed with SCSS via svelte-preprocess.

// svelte.config.js
import sveltePreprocess from 'svelte-preprocess';

export default {
  preprocess: sveltePreprocess({
    scss: {
      prependData: `@import 'src/styles/variables.scss';`
    }
  })
};

This worked well for the overall layout, buttons, and static components. A typical component looked like this:

<!-- src/components/Card.svelte -->
<div class="card">
    <div class="card-header">
        <slot name="header"></slot>
    </div>
    <div class="card-body">
        <slot></slot>
    </div>
</div>

<style lang="scss">
    .card {
        background-color: $color-surface;
        border: 1px solid $color-border;
        border-radius: $border-radius-md;
        // ... more styles
    }
</style>

The problem arose when we started building complex data visualization components, like a chart displaying LCP distribution. We needed to dynamically color bars based on the data values (e.g., green for good, orange for needs improvement, red for poor).

Doing this with SCSS was clumsy. We could inject CSS custom properties from the Svelte script block, but it felt like a hack and coupled the logic tightly to the style structure.

<!-- The clumsy SCSS way -->
<script>
    export let value;
    let color;
    $: {
        if (value < 2500) color = 'var(--color-success)';
        else if (value < 4000) color = 'var(--color-warning)';
        else color = 'var(--color-danger)';
    }
</script>

<div class="bar" style="--bar-color: {color}; height: {value / 100}px;"></div>

<style lang="scss">
    .bar {
        background-color: var(--bar-color);
    }
</style>

This pattern, repeated across many components, became difficult to maintain. The real issue is that the styling is a direct function of the component’s props. This is a perfect use case for a CSS-in-JS library. After some debate, we introduced Emotion.

This led to a hybrid styling approach. We kept our global styles and layout in SCSS, as it’s excellent for defining a design system. For highly dynamic, data-driven components, we used Emotion to co-locate the style logic with the component logic.

We created a simple Svelte action to apply Emotion classes.

// src/utils/emotion.js
import { css } from '@emotion/css';

export function useStyles(node, styles) {
    let className = css(styles);
    node.classList.add(className);

    return {
        update(newStyles) {
            node.classList.remove(className);
            className = css(newStyles);
            node.classList.add(className);
        }
    };
}

Now, our chart component became much cleaner and more self-contained.

<!-- src/components/PerfBar.svelte -->
<script>
    import { useStyles } from '../utils/emotion.js';

    export let value;
    export let maxValue;

    function getBarStyles(val) {
        let backgroundColor;
        if (val < 2500) backgroundColor = '#2ecc71';
        else if (val < 4000) backgroundColor = '#f39c12';
        else backgroundColor = '#e74c3c';

        const height = Math.max(5, (val / maxValue) * 100);

        // The styles are now a pure function of the props
        return {
            height: `${height}%`,
            backgroundColor,
            transition: 'all 0.3s ease',
            '&:hover': {
                transform: 'scaleY(1.05)',
                boxShadow: '0 0 10px rgba(0,0,0,0.2)'
            }
        };
    }
</script>

<!-- The use:useStyles action applies the dynamic class -->
<div class="bar-container">
    <div class="bar" use:useStyles={getBarStyles(value)}></div>
</div>

<!-- We can still use SCSS for the container and other static parts -->
<style lang="scss">
    .bar-container {
        width: 20px;
        height: 150px;
        background: $color-surface-secondary;
        display: flex;
        align-items: flex-end;
    }
    .bar {
        width: 100%;
    }
</style>

In a real-world project, this hybrid approach is often the most pragmatic. It allows us to use the best tool for the job: SCSS for the stable, global design language and Emotion for the volatile, state-dependent styles at the component level. It avoids the all-or-nothing trap of many styling philosophies.

Lingering Issues and Future Optimizations

The current architecture, while robust, is not without its limitations. The data schema in InfluxDB is optimized for querying recent data, but calculating trends over months requires querying and aggregating vast numbers of individual points, which can be slow and expensive. The next logical step is to introduce InfluxDB Tasks or another scheduled Lambda to pre-aggregate data into new buckets daily—for example, calculating p95/p99 LCP values per page and rolling them up.

Furthermore, while the Lambda ingester is scalable, its reliance on API Gateway can become a cost center at extreme traffic volumes. For a system processing billions of events per month, bypassing API Gateway and writing directly to a Kinesis Data Stream, which then triggers the Lambda, could offer better throughput and a more favorable cost structure. This would also provide better buffering and replay capabilities than an SQS DLQ alone. Finally, the client-side beacon itself, though lightweight, still runs on the main thread. Offloading the queuing and network logic to a Web Worker is a potential micro-optimization to guarantee zero performance impact on the host application.


  TOC