Implementing a Domain Aggregate in PHP to Manage an In-Memory Vector Index Lifecycle


The technical pain point was deceptive in its simplicity: our product search was failing. In our domain, a “curated product collection” is a core business concept. Users searching for “sunny day outfit” were not finding collections containing floral print dresses and straw hats, because simple keyword matching on product descriptions is fundamentally inadequate. The business required semantic understanding. The obvious path was to offload this to a dedicated vector database or a third-party search service. We rejected it. For our highest-value “flash sale” collections, network latency for every search query was unacceptable, and worse, we couldn’t guarantee transactional consistency between a collection’s state in our primary database and its representation in an external search index.

This led to a controversial architectural decision: manage the vector index inside the Domain Model. Specifically, the CuratedCollection Aggregate Root would become responsible for its own semantic searchability. The vector index would not be an external dependency but an integral part of the Aggregate’s state. This meant the index’s lifecycle—creation, updates, and deletion—would be governed by the same domain logic and transactional boundaries as the collection itself. In a real-world project, coupling your domain to such a specific implementation detail is a major red flag, but the performance and consistency requirements were absolute. We decided to proceed, fully aware of the trade-offs. The chosen stack was our existing one: PHP, with the understanding that this would only be viable in a long-running application server environment like Swoole or RoadRunner, not traditional PHP-FPM where memory is discarded after each request.

graph TD
    subgraph "Application Layer"
        A[UpdateCollectionService] --> B(CuratedCollectionRepository)
        C[SearchInCollectionService] --> B
    end

    subgraph "Domain Layer"
        B -- loads/saves --> D{CuratedCollection Aggregate}
        D -- contains --> E[Product Entity]
        D -- contains --> F[Value Objects: CollectionId, ProductId]
        D -- manages lifecycle of --> G(InMemoryVectorIndex)
    end

    subgraph "Infrastructure Layer"
        B_IMPL(RedisCollectionRepository) -- implements --> B
        B_IMPL -- reads/writes state --> H[PostgreSQL for Products]
        B_IMPL -- serializes/deserializes index --> I[Redis for Vector Index]
        
        J(EmbeddingServiceClient) -- generates vectors --> D
    end

    style D fill:#f9f,stroke:#333,stroke-width:2px
    style G fill:#ccf,stroke:#333,stroke-width:2px

The Domain Foundation: Value Objects and Aggregate Root

Before touching vectors, the domain model itself needed to be sound. We started with the core building blocks. The Vector itself is a perfect candidate for a Value Object: an immutable array of floats with behavior for distance calculation. In a production system, these calculations are performance-critical and should be implemented in a C extension via FFI or a custom ext-*.so. For this implementation, we’ll use pure PHP to keep the example self-contained, but acknowledge this is a major performance bottleneck.

<?php

declare(strict_types=1);

namespace App\Domain\Model\Collection\ValueObject;

use InvalidArgumentException;
use Countable;

final class Vector implements Countable
{
    /** @var float[] */
    private readonly array $components;
    private readonly int $dimension;

    public function __construct(array $components)
    {
        if (empty($components)) {
            throw new InvalidArgumentException('Vector components cannot be empty.');
        }
        // In a real-world project, you'd validate that all components are floats.
        $this->components = $components;
        $this->dimension = count($components);
    }

    public function getComponents(): array
    {
        return $this->components;
    }

    public function getDimension(): int
    {
        return $this->dimension;
    }

    public function count(): int
    {
        return $this->dimension;
    }

    /**
     * Calculates the cosine similarity between this vector and another.
     * Returns a value between -1 and 1. Higher is more similar.
     */
    public function cosineSimilarity(self $other): float
    {
        if ($this->dimension !== $other->dimension) {
            throw new InvalidArgumentException('Vectors must have the same dimension for cosine similarity.');
        }

        $dotProduct = 0.0;
        $magnitudeA = 0.0;
        $magnitudeB = 0.0;

        for ($i = 0; $i < $this->dimension; $i++) {
            $dotProduct += $this->components[$i] * $other->components[$i];
            $magnitudeA += $this->components[$i] ** 2;
            $magnitudeB += $other->components[$i] ** 2;
        }

        $magnitudeA = sqrt($magnitudeA);
        $magnitudeB = sqrt($magnitudeB);

        if ($magnitudeA == 0.0 || $magnitudeB == 0.0) {
            return 0.0; // Avoid division by zero
        }
        
        return $dotProduct / ($magnitudeA * $magnitudeB);
    }
}

With the Vector VO defined, we can structure the CuratedCollection Aggregate Root. It contains a list of Product entities and, crucially, an instance of our InMemoryVectorIndex. Its public methods represent the only valid state transitions (use cases) for the collection.

<?php

declare(strict_types=1);

namespace App\Domain\Model\Collection;

use App\Domain\Model\Collection\ValueObject\CollectionId;
use App\Domain\Model\Collection\ValueObject\Product;
use App\Domain\Model\Collection\ValueObject\ProductId;
use App\Domain\Model\Collection\ValueObject\Vector;
use App\Domain\Service\EmbeddingGeneratorInterface;
use App\Domain\Model\Collection\Exception\ProductNotFoundException;

class CuratedCollection
{
    private CollectionId $id;
    private string $name;
    /** @var Product[] */
    private array $products = [];
    private InMemoryVectorIndex $searchIndex;

    // We use a logger for visibility into the aggregate's internal decisions.
    private ?\Psr\Log\LoggerInterface $logger;

    public function __construct(CollectionId $id, string $name, ?\Psr\Log\LoggerInterface $logger = null)
    {
        $this->id = $id;
        $this->name = $name;
        // The index is born with the aggregate. Its lifecycle is tied completely.
        $this->searchIndex = new InMemoryVectorIndex();
        $this->logger = $logger;
    }

    public function getId(): CollectionId
    {
        return $this->id;
    }

    public function getProductIds(): array
    {
        return array_keys($this->products);
    }
    
    public function getProducts(): array
    {
        return $this->products;
    }

    public function getSearchIndex(): InMemoryVectorIndex
    {
        return $this->searchIndex;
    }

    public function addProduct(Product $product, EmbeddingGeneratorInterface $embeddingGenerator): void
    {
        $productIdStr = (string) $product->getId();
        if (isset($this->products[$productIdStr])) {
            // Idempotency is important. Don't throw an error if it already exists.
            $this->logger?->info('Product already in collection, skipping.', ['collection_id' => (string)$this->id, 'product_id' => $productIdStr]);
            return;
        }

        $this->products[$productIdStr] = $product;
        
        // This is a critical domain operation: updating the search index is not optional.
        $vector = $embeddingGenerator->generateForText($product->getDescription());
        $this->searchIndex->add($productIdStr, $vector);
        
        $this->logger?->info('Product added to collection and search index.', ['collection_id' => (string)$this->id, 'product_id' => $productIdStr]);
    }

    public function removeProduct(ProductId $productId): void
    {
        $productIdStr = (string) $productId;
        if (!isset($this->products[$productIdStr])) {
            throw new ProductNotFoundException("Product with ID {$productIdStr} not found in collection {$this->id}.");
        }

        unset($this->products[$productIdStr]);
        $this->searchIndex->remove($productIdStr);

        $this->logger?->info('Product removed from collection and search index.', ['collection_id' => (string)$this->id, 'product_id' => $productIdStr]);
    }

    /**
     * Performs a semantic search within the collection.
     * The aggregate itself is the source of truth for this query.
     * @return array An array of [product_id => similarity_score]
     */
    public function findSimilarProducts(string $searchText, int $k, EmbeddingGeneratorInterface $embeddingGenerator): array
    {
        $queryVector = $embeddingGenerator->generateForText($searchText);
        $similarProductIds = $this->searchIndex->findNearest($queryVector, $k);

        $results = [];
        foreach ($similarProductIds as $productId => $similarity) {
            if (isset($this->products[$productId])) {
                $results[$productId] = $similarity;
            } else {
                // This indicates a data inconsistency, which shouldn't happen if the aggregate logic is sound.
                $this->logger?->warning('Search index returned a product ID not present in the aggregate.', [
                    'collection_id' => (string)$this->id, 
                    'product_id' => $productId
                ]);
            }
        }
        return $results;
    }
}

The In-Memory Vector Index Implementation

This component is the heart of the performance-critical part of our system. A real-world implementation would use a sophisticated algorithm like HNSW (Hierarchical Navigable Small World) for efficient approximate nearest neighbor search. Building a production-grade HNSW index is a massive undertaking. For this post-mortem, we’ll implement a brute-force k-NN index. It’s O(n) but correctly demonstrates the architectural pattern of being managed by the Aggregate. The key is its simple, well-defined interface.

<?php

declare(strict_types=1);

namespace App\Domain\Model\Collection;

use App\Domain\Model\Collection\ValueObject\Vector;

/**
 * A simple, brute-force in-memory vector index.
 * In production, this would be replaced by a more efficient ANN implementation (e.g., HNSW).
 */
class InMemoryVectorIndex implements \Serializable
{
    /** @var array<string, Vector> */
    private array $vectors = [];

    public function add(string $id, Vector $vector): void
    {
        $this->vectors[$id] = $vector;
    }

    public function remove(string $id): void
    {
        unset($this->vectors[$id]);
    }

    public function getVectorById(string $id): ?Vector
    {
        return $this->vectors[$id] ?? null;
    }

    /**
     * @return array<string, float> An array of IDs and their cosine similarity scores.
     */
    public function findNearest(Vector $queryVector, int $k): array
    {
        if (empty($this->vectors)) {
            return [];
        }

        $similarities = [];
        foreach ($this->vectors as $id => $vector) {
            try {
                $similarities[$id] = $queryVector->cosineSimilarity($vector);
            } catch (\InvalidArgumentException $e) {
                // Log this error in a real app. It indicates a dimension mismatch.
                continue;
            }
        }

        // Sort by similarity score in descending order
        arsort($similarities);

        return array_slice($similarities, 0, $k, true);
    }
    
    public function count(): int
    {
        return count($this->vectors);
    }

    // --- Serialization for Persistence ---
    // A pitfall here is using PHP's native serialize. For cross-version compatibility
    // or if the worker environment changes, a more robust format like JSON or MessagePack
    // would be safer. For simplicity, we use the native mechanism.

    public function serialize(): string
    {
        return serialize($this->vectors);
    }

    public function unserialize(string $data): void
    {
        $unserializedData = unserialize($data);
        if (is_array($unserializedData)) {
            $this->vectors = $unserializedData;
        } else {
            // Handle potential corruption of serialized data
            $this->vectors = [];
        }
    }
}

Persistence: The Hardest Problem

The single biggest challenge with this architecture is persistence. An ORM cannot map an in-memory object graph like our vector index. We need a custom repository implementation that understands how to save and reconstruct the entire Aggregate state. We chose a hybrid approach: product metadata is stored in a relational database (PostgreSQL), while the serialized vector index, being a performance-critical blob, is stored in Redis.

The repository’s save method is transactional in nature. It must succeed or fail as a whole. A failure to write the index to Redis after saving the product list to Postgres would lead to a catastrophic state inconsistency. In a production system, this would require a two-phase commit or a saga pattern, but for our controlled environment, we accepted the risk and added robust error handling and logging.

<?php

declare(strict_types=1);

namespace App\Infrastructure\Persistence;

use App\Domain\Model\Collection\CuratedCollection;
use App\Domain\Model\Collection\CuratedCollectionRepositoryInterface;
use App\Domain\Model\Collection\ValueObject\CollectionId;
use App\Domain\Model\Collection\ValueObject\Product;
use App\Domain\Model\Collection\ValueObject\ProductId;
use Doctrine\DBAL\Connection;
use Predis\Client as RedisClient;
use Psr\Log\LoggerInterface;

class RedisAndDbCollectionRepository implements CuratedCollectionRepositoryInterface
{
    private const REDIS_KEY_PREFIX = 'collection:index:';

    private Connection $db;
    private RedisClient $redis;
    private LoggerInterface $logger;

    public function __construct(Connection $db, RedisClient $redis, LoggerInterface $logger)
    {
        $this->db = $db;
        $this->redis = $redis;
        $this->logger = $logger;
    }

    public function findById(CollectionId $id): ?CuratedCollection
    {
        // 1. Fetch collection metadata from the primary DB
        $stmt = $this->db->executeQuery('SELECT id, name FROM collections WHERE id = ?', [$id->toString()]);
        $collectionData = $stmt->fetchAssociative();

        if ($collectionData === false) {
            return null;
        }

        $collection = new CuratedCollection($id, $collectionData['name'], $this->logger);

        // 2. Fetch all products associated with this collection
        $productStmt = $this->db->executeQuery(
            'SELECT p.id, p.name, p.description FROM products p JOIN collection_products cp ON p.id = cp.product_id WHERE cp.collection_id = ?',
            [$id->toString()]
        );
        $productsData = $productStmt->fetchAllAssociative();

        // Rehydrate the product list in the aggregate.
        // We do not generate embeddings here; we assume the saved index is the source of truth.
        foreach ($productsData as $productData) {
            $product = new Product(
                new ProductId($productData['id']),
                $productData['name'],
                $productData['description']
            );
            // This is a private method or a protected constructor variant in a real DDD implementation
            // to rehydrate without triggering domain events/logic. For simplicity, we just set it.
            // This is a common pattern: rehydration logic is different from business logic.
            $this->addProductToCollectionViaReflection($collection, $product);
        }

        // 3. Fetch and unserialize the vector index from Redis
        $redisKey = self::REDIS_KEY_PREFIX . $id->toString();
        $serializedIndex = $this->redis->get($redisKey);

        if ($serializedIndex) {
            try {
                $index = unserialize($serializedIndex);
                if ($index instanceof \App\Domain\Model\Collection\InMemoryVectorIndex) {
                    // Replace the new, empty index with the hydrated one
                    $this->setIndexOnCollectionViaReflection($collection, $index);
                } else {
                     $this->logger->error('Failed to unserialize index from Redis, data corrupted.', ['key' => $redisKey]);
                }
            } catch (\Throwable $e) {
                $this->logger->error('Exception during index unserialization.', ['key' => $redisKey, 'exception' => $e]);
                // Decide on a failure strategy: return null, or return a collection without search?
                // Returning null is safer as it indicates a failure to load the complete state.
                return null;
            }
        }
        
        return $collection;
    }

    public function save(CuratedCollection $collection): void
    {
        $this->db->beginTransaction();
        try {
            // 1. Persist the collection's main properties
            $this->db->executeStatement(
                'INSERT INTO collections (id, name) VALUES (?, ?) ON CONFLICT (id) DO UPDATE SET name = EXCLUDED.name',
                [$collection->getId()->toString(), $collection->getName()]
            );

            // 2. Sync the list of products
            // A common mistake is to not handle deletions. We must diff the states.
            $currentProductIdsInDb = $this->getProductIdsInDb($collection->getId());
            $productIdsInAggregate = $collection->getProductIds();

            $toAdd = array_diff($productIdsInAggregate, $currentProductIdsInDb);
            $toRemove = array_diff($currentProductIdsInDb, $productIdsInAggregate);

            foreach ($toAdd as $productId) {
                 $this->db->executeStatement(
                    'INSERT INTO collection_products (collection_id, product_id) VALUES (?, ?)',
                    [$collection->getId()->toString(), $productId]
                );
            }
            foreach ($toRemove as $productId) {
                 $this->db->executeStatement(
                    'DELETE FROM collection_products WHERE collection_id = ? AND product_id = ?',
                    [$collection->getId()->toString(), $productId]
                );
            }

            // 3. Serialize and save the vector index to Redis
            $redisKey = self::REDIS_KEY_PREFIX . $collection->getId()->toString();
            $serializedIndex = serialize($collection->getSearchIndex());
            $this->redis->set($redisKey, $serializedIndex);

            $this->db->commit();
        } catch (\Throwable $e) {
            $this->db->rollBack();
            $this->logger->critical('Failed to save collection transactionally.', ['id' => $collection->getId()->toString(), 'exception' => $e]);
            // Propagate the exception to inform the Application Service
            throw $e;
        }
    }

    private function getProductIdsInDb(CollectionId $id): array
    {
        $stmt = $this->db->executeQuery(
            'SELECT product_id FROM collection_products WHERE collection_id = ?',
            [$id->toString()]
        );
        return $stmt->fetchFirstColumn();
    }
    
    // Using reflection to bypass public methods for rehydration is a common, pragmatic DDD technique.
    private function addProductToCollectionViaReflection(CuratedCollection $collection, Product $product): void
    {
        $reflection = new \ReflectionClass($collection);
        $productsProp = $reflection->getProperty('products');
        $productsProp->setAccessible(true);
        $products = $productsProp->getValue($collection);
        $products[(string)$product->getId()] = $product;
        $productsProp->setValue($collection, $products);
    }
    
    private function setIndexOnCollectionViaReflection(CuratedCollection $collection, \App\Domain\Model\Collection\InMemoryVectorIndex $index): void
    {
        $reflection = new \ReflectionClass($collection);
        $indexProp = $reflection->getProperty('searchIndex');
        $indexProp->setAccessible(true);
        $indexProp->setValue($collection, $index);
    }
}

Unit Testing the Aggregate’s Logic

The primary benefit of this architecture is testability. We can test the Aggregate’s complex behavior—including its search consistency—in complete isolation, without needing a database or a network connection. We can mock the EmbeddingGeneratorInterface to provide deterministic vectors for our tests.

<?php

declare(strict_types=1);

namespace App\Tests\Domain\Model\Collection;

use App\Domain\Model\Collection\CuratedCollection;
use App\Domain\Model\Collection\ValueObject\CollectionId;
use App\Domain\Model\Collection\ValueObject\Product;
use App\Domain\Model\Collection\ValueObject\ProductId;
use App\Domain\Model\Collection\ValueObject\Vector;
use App\Domain\Service\EmbeddingGeneratorInterface;
use PHPUnit\Framework\TestCase;

class CuratedCollectionTest extends TestCase
{
    private EmbeddingGeneratorInterface $embeddingGeneratorMock;

    protected function setUp(): void
    {
        $this->embeddingGeneratorMock = $this->createMock(EmbeddingGeneratorInterface::class);
    }

    public function testProductAdditionUpdatesSearchIndex(): void
    {
        $collectionId = new CollectionId('c1');
        $collection = new CuratedCollection($collectionId, 'Summer Collection');

        $product1 = new Product(new ProductId('p1'), 'Floral Dress', 'A light dress with a floral pattern.');
        $vector1 = new Vector([0.1, 0.9, 0.2]);

        $this->embeddingGeneratorMock
            ->expects($this->once())
            ->method('generateForText')
            ->with('A light dress with a floral pattern.')
            ->willReturn($vector1);

        $collection->addProduct($product1, $this->embeddingGeneratorMock);

        $index = $collection->getSearchIndex();
        $this->assertEquals(1, $index->count());
        $this->assertSame($vector1, $index->getVectorById('p1'));
    }

    public function testProductRemovalUpdatesSearchIndex(): void
    {
        $collection = new CuratedCollection(new CollectionId('c1'), 'Test');
        $product = new Product(new ProductId('p1'), 'Test Product', 'desc');
        $vector = new Vector([1, 0, 0]);

        $this->embeddingGeneratorMock->method('generateForText')->willReturn($vector);
        $collection->addProduct($product, $this->embeddingGeneratorMock);
        
        $this->assertEquals(1, $collection->getSearchIndex()->count());

        $collection->removeProduct($product->getId());

        $this->assertEquals(0, $collection->getSearchIndex()->count());
        $this->assertNull($collection->getSearchIndex()->getVectorById('p1'));
    }

    public function testFindSimilarProductsUsesInternalIndex(): void
    {
        $collection = new CuratedCollection(new CollectionId('c1'), 'Test');
        
        $dress = new Product(new ProductId('p1'), 'Dress', 'A beautiful red dress.');
        $dressVector = new Vector([0.9, 0.1, 0.1]);

        $shirt = new Product(new ProductId('p2'), 'Shirt', 'A casual blue shirt.');
        $shirtVector = new Vector([0.1, 0.1, 0.9]);

        $this->embeddingGeneratorMock->method('generateForText')
            ->willReturnMap([
                ['A beautiful red dress.', $dressVector],
                ['A casual blue shirt.', $shirtVector],
                ['something red', new Vector([1.0, 0.0, 0.0])] // Query vector
            ]);

        $collection->addProduct($dress, $this->embeddingGeneratorMock);
        $collection->addProduct($shirt, $this->embeddingGeneratorMock);

        $results = $collection->findSimilarProducts('something red', 1, $this->embeddingGeneratorMock);

        $this->assertCount(1, $results);
        $this->assertArrayHasKey('p1', $results); // p1 (dress) should be the most similar
        $this->assertArrayNotHasKey('p2', $results);
    }
}

This approach is not a silver bullet. The memory footprint of the PHP worker process is now a primary concern. A collection with 100,000 products and 768-dimensional vectors would consume over 300MB of RAM just for the index, per worker. This solution is only viable for bounded contexts dealing with a relatively small, but highly active, set of data. The necessity of a long-running process model like Swoole or RoadRunner also completely changes deployment and operational considerations compared to a typical LAMP stack application. Furthermore, the brute-force index search complexity is O(N), which becomes a CPU bottleneck as the collection grows. A logical next step would be to implement a proper ANN index like HNSW within the InMemoryVectorIndex class, which would dramatically improve search performance at the cost of much higher implementation complexity. The persistence strategy, relying on serialization, also presents challenges with “cold starts,” as hydrating a large index from Redis on first access can introduce significant latency.


  TOC