Integrating Playwright Prometheus and DVC for Front-End Performance Regression Analysis


Our component library’s performance was degrading in subtle, untraceable ways. Each sprint, a few milliseconds were shaved off here, a few dozen there. The aggregate effect was a sluggish user experience that CI passed with flying colors. Functional correctness was guaranteed, but performance was a ghost we couldn’t catch. The root of the problem was a lack of context. A performance dip could be from a change in a React component, a CSS-in-JS update in our Chakra UI theme, or, as we discovered later, a change in the shape of the representative data used for rendering. We needed a system to not just measure performance, but to correlate it with the precise version of every dependency: code, infrastructure, and data.

The initial concept was to build a version-aware synthetic monitoring pipeline. For every commit, this system would execute a series of performance-critical user journeys, capture granular metrics far beyond standard Web Vitals, and store them in a way that could be sliced and diced against code and data history. This meant going beyond simple pass/fail assertions and treating performance data as a first-class, versioned artifact.

Technology selection was driven by this need for deep integration and context.

  • UI Target: Our application is built with Chakra UI. Its component-centric architecture was an advantage, allowing us to instrument and measure the performance of individual, isolated components like tables or complex forms.
  • Measurement Agent: Playwright was chosen over alternatives for its robust tracing capabilities and its ability to programmatically access low-level browser performance APIs. We needed to execute custom JavaScript in the page context to extract measurements from our instrumented components, a task Playwright handles cleanly.
  • Time-Series Storage: Prometheus is the de facto standard for metrics. Its dimensional data model allows tagging measurements with labels. We planned to use labels for the component name, test name, and browser. The critical decision was what not to put in labels. Highly unique identifiers like Git commit hashes would lead to a cardinality explosion, a classic Prometheus anti-pattern.
  • Metadata and Context Storage: This is where DynamoDB entered the architecture. To avoid overwhelming Prometheus, all high-cardinality metadata for a test run—the full Git commit hash, the DVC data version hash, the Playwright script’s own hash, a link to CI logs—would be stored in a DynamoDB table, indexed by a unique runId. Prometheus metrics would be tagged with this runId, creating the link between the time-series data and its rich context.
  • Data Versioning: The most unconventional piece was DVC (Data Version Control). Our application’s most complex components rendered large, structured datasets. We realized that changes to this data (e.g., adding more columns, increasing row count, changing data distribution) had a direct impact on rendering performance. By versioning our test datasets with DVC, we could finally attribute performance changes to the data itself, not just the code.

The final architecture coalesced around an orchestration script at the heart of our CI pipeline. This script would be responsible for coordinating these disparate systems into a cohesive workflow.

graph TD
    subgraph CI Runner
        A[Git Checkout] --> B(Orchestration Script);
        B --> C[Get Git Commit];
        B --> D[dvc pull & Get Data Hash];
        B --> E[Generate runId];
        B --> F(Run Playwright Tests);
    end

    subgraph Browser Instance
        F --> G[Chakra UI Application];
        G --> H[Performance Instrumentation];
    end

    subgraph Data Collection
        F -- Collects Metrics --> I[Metrics Payload];
        F -- Collects Metadata --> J[Metadata Payload];
    end

    subgraph Storage & Analysis
        I -- Remote Write --> K(Prometheus TSDB);
        J -- PutItem --> L(DynamoDB Table);
        K -- runId label --> L;
        M[Grafana/Alerting] --> K;
        N[Manual Investigation] --> L;
    end

Instrumenting the Target Application

The first step was to modify our Chakra UI application to expose the metrics we cared about. We focused on a particularly complex DataTable component that was a frequent source of performance issues. We used the browser’s performance.mark and performance.measure APIs, which are more granular and less intrusive than React’s Profiler for our use case.

The goal is to create performance entries that Playwright can later harvest.

// src/components/instrumented-data-table.jsx

import React, { useLayoutEffect, useRef } from 'react';
import { Table, Thead, Tbody, Tr, Th, Td, TableContainer } from '@chakra-ui/react';

// A unique identifier for this component's instrumentation
const COMPONENT_ID = 'DataTable';

const InstrumentedDataTable = ({ data, 'data-testid': dataTestId }) => {
  const tableRef = useRef(null);

  // We use useLayoutEffect to ensure the mark is created after the DOM has been mutated
  // but before the browser has painted.
  useLayoutEffect(() => {
    if (data && data.length > 0) {
      const startMark = `${COMPONENT_ID}_render_start`;
      const endMark = `${COMPONENT_ID}_render_end`;
      const measureName = `${COMPONENT_ID}_render_duration`;

      // Clear previous measures to avoid confusion
      performance.clearMarks(startMark);
      performance.clearMarks(endMark);
      performance.clearMeasures(measureName);

      performance.mark(startMark);

      // The effect's cleanup function will run before the next render or on unmount.
      // We use a microtask to ensure it runs after the current frame is done.
      const timerId = setTimeout(() => {
        performance.mark(endMark);
        try {
          performance.measure(measureName, startMark, endMark);
          const measure = performance.getEntriesByName(measureName)[0];
          
          // Expose the measurement on a global object for Playwright to scrape
          window.__PERF_METRICS__ = window.__PERF_METRICS__ || {};
          window.__PERF_METRICS__[COMPONENT_ID] = measure.duration;

        } catch (e) {
          // It's possible for marks to be cleared before measurement in complex scenarios.
          // In a real-world project, you would log this error.
          console.error(`Failed to measure performance for ${COMPONENT_ID}:`, e);
        }
      }, 0);

      return () => clearTimeout(timerId);
    }
  }, [data]); // Rerun effect when data changes

  return (
    <TableContainer data-testid={dataTestId} ref={tableRef}>
      <Table variant='simple'>
        <Thead>
          <Tr>
            {data.length > 0 && Object.keys(data[0]).map((key) => <Th key={key}>{key}</Th>)}
          </Tr>
        </Thead>
        <Tbody>
          {data.map((row, rowIndex) => (
            <Tr key={rowIndex}>
              {Object.values(row).map((cell, cellIndex) => (
                <Td key={cellIndex}>{cell}</Td>
              ))}
            </Tr>
          ))}
        </Tbody>
      </Table>
    </TableContainer>
  );
};

export default InstrumentedDataTable;

This hook places a start mark before the paint and an end mark right after, measuring the component’s entire render lifecycle. The result is pushed to a global window.__PERF_METRICS__ object. This is a simple mechanism for decoupling the application from the test runner; the app just exposes data, and the test runner is responsible for collecting it.

Crafting the Playwright Collection Script

With the application instrumented, the Playwright script’s job is to orchestrate the browser, trigger a render, and scrape the resulting metrics. The script needs to be robust, waiting for the application to be in the correct state before attempting to collect data.

First, the Playwright configuration needs to be set up to handle custom test logic.

// playwright.config.ts

import { defineConfig, devices } from '@playwright/test';
import path from 'path';

// We'll pass the runId via an environment variable to the test
export const RUN_ID = process.env.PERF_RUN_ID || `local-${new Date().toISOString()}`;
export const METRICS_OUTPUT_DIR = path.join(__dirname, 'perf-results');

export default defineConfig({
  testDir: './tests',
  fullyParallel: true,
  forbidOnly: !!process.env.CI,
  retries: process.env.CI ? 2 : 0,
  workers: process.env.CI ? 1 : undefined,
  reporter: 'html',
  use: {
    baseURL: 'http://localhost:3000',
    trace: 'on-first-retry',
  },
  projects: [
    {
      name: 'chromium',
      use: { ...devices['Desktop Chrome'] },
    },
  ],
  // Custom setup to ensure the output directory exists
  globalSetup: require.resolve('./tests/global-setup'),
});

The test script itself loads the page, interacts with it if necessary to trigger the component render, and then executes a page evaluation to pull the metrics.

// tests/performance.spec.ts

import { test, expect, Page } from '@playwright/test';
import fs from 'fs/promises';
import path from 'path';
import { METRICS_OUTPUT_DIR, RUN_ID } from '../playwright.config';

type PerfMetrics = {
  [key: string]: number;
};

async function getPerformanceMetrics(page: Page): Promise<PerfMetrics> {
  // Wait for our custom metrics object to be populated
  await page.waitForFunction(() => window.__PERF_METRICS__ && window.__PERF_METRICS__.DataTable, null, { timeout: 10000 });

  const customMetrics = await page.evaluate(() => window.__PERF_METRICS__);

  // Additionally, collect standard Web Vitals like LCP
  const lcp = await page.evaluate(() => {
    return new Promise((resolve) => {
      new PerformanceObserver((entryList) => {
        const entries = entryList.getEntries();
        if (entries.length > 0) {
          const lcpEntry = entries[entries.length - 1];
          resolve(lcpEntry.startTime);
        } else {
          // In some cases, LCP might not be available. A production system needs robust fallback.
          resolve(-1);
        }
      }).observe({ type: 'largest-contentful-paint', buffered: true });
    });
  });

  return { ...customMetrics, LargestContentfulPaint: lcp as number };
}

test.describe('DataTable Performance', () => {
  test('should render large dataset within performance budget', async ({ page }) => {
    await page.goto('/');

    const table = page.getByTestId('large-data-table');
    await expect(table).toBeVisible();

    const metrics = await getPerformanceMetrics(page);
    
    // In a real-world project, you might have assertions here, but our primary
    // goal is data collection, not pass/fail in the test runner itself.
    // The regression analysis happens later in Prometheus.
    expect(metrics.DataTable).toBeGreaterThan(0);
    expect(metrics.LargestContentfulPaint).toBeGreaterThan(0);

    // Save metrics to a file for the orchestrator to process
    const result = {
      runId: RUN_ID,
      timestamp: new Date().toISOString(),
      metrics,
    };

    const outputPath = path.join(METRICS_OUTPUT_DIR, `${RUN_ID}.json`);
    await fs.writeFile(outputPath, JSON.stringify(result, null, 2));

    console.log(`Performance metrics saved to ${outputPath}`);
  });
});

// tests/global-setup.ts
import fs from 'fs/promises';
import path from 'path';
import { METRICS_OUTPUT_DIR } from '../playwright.config';

async function globalSetup() {
  await fs.mkdir(METRICS_OUTPUT_DIR, { recursive: true });
}

export default globalSetup;

This script collects both our custom DataTable render duration and the standard LargestContentfulPaint metric. It doesn’t fail the build if a metric exceeds a threshold; instead, it writes the results to a JSON file. The orchestration script will handle the subsequent processing.

The Orchestration and Data Persistence Layer

This is the core of the system, a Node.js script that ties everything together. It’s responsible for generating context, running the tests, and pushing the results to their respective datastores.

For infrastructure, we need a DynamoDB table. Using AWS CDK for definition:

// infrastructure/stack.ts
import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';

export class PerformanceMonitoringStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    new dynamodb.Table(this, 'PerfRunMetadata', {
      tableName: 'performance-run-metadata',
      partitionKey: { name: 'runId', type: dynamodb.AttributeType.STRING },
      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
      removalPolicy: cdk.RemovalPolicy.RETAIN, // In production, you'd want to RETAIN data
    });
  }
}

The orchestration script uses the AWS SDK and external processes to execute the workflow.

// scripts/run-perf-test.js
const { execSync } = require('child_process');
const { randomUUID } = require('crypto');
const fs = require('fs/promises');
const path = require('path');
const { DynamoDBClient } = require('@aws-sdk/client-dynamodb');
const { DynamoDBDocumentClient, PutCommand } = require('@aws-sdk/lib-dynamodb');
const fetch = require('node-fetch'); // For Prometheus remote write

// --- Configuration ---
const DYNAMO_TABLE_NAME = 'performance-run-metadata';
const PROMETHEUS_REMOTE_WRITE_URL = process.env.PROMETHEUS_REMOTE_WRITE_URL; // e.g., http://prometheus:9090/api/v1/write
const METRICS_OUTPUT_DIR = path.join(__dirname, '..', 'perf-results');

const region = 'us-east-1';
const ddbClient = new DynamoDBClient({ region });
const ddbDocClient = DynamoDBDocumentClient.from(ddbClient);

// --- Helper Functions ---
function getGitCommit() {
  try {
    return execSync('git rev-parse HEAD').toString().trim();
  } catch (e) {
    console.warn('Could not get git commit. Defaulting to "unknown".');
    return 'unknown';
  }
}

async function getDataVersionHash(dataPath) {
  // DVC stores metadata in a .dvc file
  const dvcFilePath = `${dataPath}.dvc`;
  try {
    const dvcFileContent = await fs.readFile(dvcFilePath, 'utf-8');
    // A simple parser for the md5 hash. A real implementation should use a YAML parser.
    const match = dvcFileContent.match(/md5: (\w+)/);
    if (match && match[1]) {
      return match[1];
    }
    throw new Error('MD5 hash not found in .dvc file');
  } catch (e) {
    console.warn(`Could not parse DVC file ${dvcFilePath}. Error: ${e.message}`);
    return 'unknown';
  }
}

// Function to format metrics for Prometheus remote write Protobuf format.
// NOTE: For brevity, this example uses a simplified JSON-over-HTTP format supported
// by some collectors like Grafana Agent. A production system should use Snappy-compressed Protobuf.
async function pushToPrometheus(metricsData, metadata) {
  if (!PROMETHEUS_REMOTE_WRITE_URL) {
    console.log('PROMETHEUS_REMOTE_WRITE_URL not set. Skipping push.');
    return;
  }

  const series = Object.entries(metricsData.metrics).map(([key, value]) => ({
    // Metric names in Prometheus should not have capitals.
    __name__: `frontend_perf_${key.toLowerCase()}`,
    // Labels common to all metrics in this run
    labels: {
      job: 'frontend-performance',
      runId: metadata.runId,
      commit: metadata.gitCommit.substring(0, 7), // Short hash for cardinality sanity
      data_version: metadata.dataVersion.substring(0, 7),
    },
    // The value and timestamp
    samples: [[Date.parse(metricsData.timestamp) / 1000, String(value)]],
  }));

  try {
    const response = await fetch(PROMETHEUS_REMOTE_WRITE_URL, {
      method: 'POST',
      body: JSON.stringify({ series }),
      headers: { 'Content-Type': 'application/json' },
    });
    if (!response.ok) {
      throw new Error(`Prometheus remote write failed: ${response.statusText}`);
    }
    console.log('Successfully pushed metrics to Prometheus.');
  } catch (error) {
    console.error('Error pushing metrics to Prometheus:', error);
    // In a CI environment, you might want to fail the job here.
    throw error;
  }
}

async function saveMetadataToDynamoDB(metadata) {
  const command = new PutCommand({
    TableName: DYNAMO_TABLE_NAME,
    Item: metadata,
  });

  try {
    await ddbDocClient.send(command);
    console.log(`Successfully saved metadata for runId ${metadata.runId} to DynamoDB.`);
  } catch (error) {
    console.error('Error saving metadata to DynamoDB:', error);
    throw error;
  }
}

// --- Main Execution Logic ---
async function main() {
  const runId = randomUUID();
  console.log(`Starting performance test run with ID: ${runId}`);

  // 1. Gather Context
  const gitCommit = getGitCommit();
  // Assume our test data is tracked by DVC at 'public/data/large-dataset.json'
  execSync('dvc pull public/data/large-dataset.json', { stdio: 'inherit' });
  const dataVersion = await getDataVersionHash('public/data/large-dataset.json');

  const metadata = {
    runId,
    gitCommit,
    dataVersion,
    runTimestamp: new Date().toISOString(),
    ciJobUrl: process.env.CI_JOB_URL || 'local',
  };

  // 2. Run Playwright Tests
  try {
    console.log('Running Playwright...');
    execSync('npx playwright test --project=chromium', {
      env: { ...process.env, PERF_RUN_ID: runId },
      stdio: 'inherit',
    });
  } catch (e) {
    console.error('Playwright tests failed.');
    // You might want to upload failed metadata to DynamoDB for triage
    metadata.status = 'FAILED';
    await saveMetadataToDynamoDB(metadata);
    process.exit(1);
  }

  // 3. Process and Persist Results
  const metricsFilePath = path.join(METRICS_OUTPUT_DIR, `${runId}.json`);
  const metricsData = JSON.parse(await fs.readFile(metricsFilePath, 'utf-8'));
  
  await pushToPrometheus(metricsData, metadata);
  metadata.status = 'SUCCESS';
  await saveMetadataToDynamoDB(metadata);

  console.log('Performance test run completed successfully.');
}

main().catch(err => {
  console.error('Orchestration script failed:', err);
  process.exit(1);
});

This script is the engine. It collects the git and DVC hashes, runs the test, and then fans out the results to DynamoDB for context and Prometheus for time-series analysis. The pitfall here is error handling; a failure in any step (Playwright, Prometheus push, DynamoDB write) must be handled gracefully, likely by marking the run as failed in DynamoDB for later inspection.

Analysis and The Closed Loop

With data flowing, we can finally close the loop. A Grafana dashboard is set up to visualize the metrics stored in Prometheus. We can create panels that track the p95 render duration of our DataTable component over time.

A PromQL query to spot a regression might look like this:

# Calculate the 1-day moving average of the DataTable render duration
avg_over_time(frontend_perf_datatable[1d])

When we see a spike on the graph, we’ve identified a regression. The workflow is:

  1. Identify the approximate time of the regression from the Grafana chart.
  2. Use the runId label on the problematic Prometheus data point.
  3. Query the performance-run-metadata table in DynamoDB with that runId.
  4. The result from DynamoDB gives us the exact gitCommit and dataVersion that caused the performance degradation.

This system transformed our performance tuning from guesswork into a data-driven process. We were able to definitively prove that a recent “optimization” to a data normalization function actually caused a 15% increase in table render times. In another case, we found that adding three new columns to our test dataset, a change tracked only in DVC, was the sole cause of a major slowdown.

The primary limitation of this architecture is its handling of cardinality in Prometheus. While we’ve offloaded most unique identifiers to DynamoDB, using a short commit hash as a label can still be risky on projects with thousands of commits per day. A future iteration might replace the commit hash label with a version number that increments on each merge to the main branch, reducing the label set significantly. Furthermore, this synthetic monitoring only captures performance in a controlled environment. The next architectural evolution would be to correlate these synthetic results with Real User Monitoring (RUM) data to confirm that CI-detected regressions have a measurable impact on the production user experience.


  TOC