Implementing Read-Your-Writes Consistency for iOS Clients Over an AP Memcached Layer


An iOS application’s social feed is experiencing unacceptable latency when fetching user profile data. The backend architecture consists of a primary PostgreSQL database fronted by a large Memcached cluster. The core design principle is to maximize availability and read performance, consciously choosing the AP (Availability, Partition Tolerance) side of the CAP theorem. This works well for the general user base, who can tolerate a few seconds of data staleness. However, a critical user experience flaw has emerged: when a user updates their own profile—for instance, changing their display name—they are frequently served the old, stale data from Memcached upon navigating back to their profile screen. This breaks the user’s mental model and erodes trust. The system lacks read-your-writes consistency, a property that is non-negotiable for user-initiated actions.

The current, problematic data flow can be visualized as follows:

sequenceDiagram
    participant iOS App
    participant API Gateway
    participant Profile Service (Go)
    participant Memcached
    participant PostgreSQL

    iOS App->>+API Gateway: GET /users/profile/123
    API Gateway->>+Profile Service (Go): GetProfile(id: 123)
    Profile Service (Go)->>+Memcached: GET profile:123
    Memcached-->>-Profile Service (Go): (Cache Hit) Stale Profile Data
    Profile Service (Go)-->>-API Gateway: Stale Profile
    API Gateway-->>-iOS App: Stale Profile

    Note right of iOS App: User sees old data
after just updating it.

The fundamental challenge is to introduce a strong guarantee of read-your-writes consistency for the originating user’s session without sacrificing the high availability and low latency that the AP-focused Memcached architecture provides for the wider system. Re-architecting the entire caching layer to be strongly consistent (e.g., using a CP system like etcd or Zookeeper) is not a viable option, as it would introduce unacceptable latency for the 99.9% of reads that are not post-write.

Solution A: Server-Side Strong Cache Invalidation

A common first approach is to enforce consistency entirely on the server. When a user profile is updated, the service updates the database and then immediately sends a DELETE command to Memcached for the corresponding key. The next read for that profile will result in a cache miss, forcing a fetch from the primary database, which then repopulates the cache with the fresh data.

This approach appears straightforward. The Go backend code for the update operation would look something like this:

// In profile_service.go

package profileservice

import (
	"context"
	"database/sql"
	"encoding/json"
	"fmt"
	"log"
	"time"

	"github.com/bradfitz/gomemcache/memcache"
)

type Profile struct {
	ID          int64  `json:"id"`
	Username    string `json:"username"`
	DisplayName string `json:"display_name"`
	Version     int    `json:"version"`
}

type Service struct {
	db *sql.DB
	mc *memcache.Client
}

const (
	cacheKeyPrefix = "profile:"
	cacheTTL       = 300 // 5 minutes
)

// UpdateProfileWithInvalidation attempts to solve consistency by invalidating the cache.
// This is Solution A: The flawed approach.
func (s *Service) UpdateProfileWithInvalidation(ctx context.Context, profile Profile) error {
	tx, err := s.db.BeginTx(ctx, nil)
	if err != nil {
		log.Printf("ERROR: failed to begin transaction: %v", err)
		return err
	}
	defer tx.Rollback() // Rollback is a no-op if the transaction is committed.

	// In a real-world project, you'd use optimistic locking.
	// We increment a version number to prevent lost updates.
	res, err := tx.ExecContext(ctx,
		`UPDATE user_profiles SET display_name = $1, version = version + 1 WHERE id = $2 AND version = $3`,
		profile.DisplayName, profile.ID, profile.Version,
	)
	if err != nil {
		log.Printf("ERROR: failed to execute update: %v", err)
		return err
	}

	rowsAffected, err := res.RowsAffected()
	if err != nil {
		log.Printf("ERROR: failed to get rows affected: %v", err)
		return err
	}
	if rowsAffected == 0 {
		return fmt.Errorf("concurrency error: profile version mismatch or profile not found")
	}

	if err := tx.Commit(); err != nil {
		log.Printf("ERROR: failed to commit transaction: %v", err)
		return err
	}

	// The core of this approach: invalidate the cache after DB commit.
	cacheKey := fmt.Sprintf("%s%d", cacheKeyPrefix, profile.ID)
	err = s.mc.Delete(cacheKey)
	if err != nil && err != memcache.ErrCacheMiss {
		// A common mistake is to fail the whole operation if cache invalidation fails.
		// In a production system, this is often treated as a non-critical error.
		// We log it but don't return an error to the client, as the primary data is safe.
		log.Printf("WARN: failed to invalidate cache for key %s: %v", cacheKey, err)
	}

	log.Printf("INFO: successfully updated profile %d and invalidated cache", profile.ID)
	return nil
}

While simple, this strategy is riddled with subtle but critical race conditions in a distributed environment.

Pros:

  • Simplicity: The logic is contained entirely within the backend service. The client remains stateless and unaware of the caching strategy.
  • Low Overhead: A single DELETE operation is cheap.

Cons:

  • Write/Invalidate vs. Read/Set Race Condition: This is the most severe flaw. Consider the following sequence:
    1. Request A (Update): Begins updating the profile in the database.
    2. Request B (Read): Misses the cache, queries the database, and gets the old data (because Request A’s transaction hasn’t committed yet).
    3. Request A (Update): Commits the transaction and sends a DELETE command to Memcached for the profile key.
    4. Request B (Read): Receives the old data from the database and performs a SET operation in Memcached, overwriting the cache with stale data after it was just invalidated.
      The cache is now permanently poisoned with stale data until the TTL expires or another update occurs.
  • Replication Delay: In a multi-node Memcached cluster, the DELETE command might be sent to one node. A subsequent read, routed by a client-side hash ring to a different node, might not see the invalidation immediately if any form of replication or gossip protocol is in use (though standard memcached lacks this, custom setups or managed services might).
  • Database Thundering Herd: If a very popular profile is updated, the subsequent cache miss will cause a wave of read requests to hit the primary database simultaneously. This defeats the purpose of the cache as a protective layer during peak traffic.

This solution is ultimately too fragile for a production system that requires reliable read-your-writes guarantees. It attempts to force strong consistency onto a system (Memcached) that was never designed for it, leading to unpredictable behavior.

Solution B: Client-Augmented Consistency with Server Cooperation

Instead of fighting the AP nature of the distributed cache, this approach embraces it. We acknowledge that the cache is eventually consistent for the general population of users. The specific problem of read-your-writes is solved by a cooperative effort between the server and the specific client initiating the write.

The core idea is this:

  1. Server Write (PUT /users/profile/{id}): The server updates the database and then performs a SET (write-through) to the cache. Crucially, upon success, it returns the complete, updated object in the API response body.
  2. iOS Client Logic: The client, upon receiving a 200 OK with the new profile data from the update request, does not immediately trigger a background refresh. Instead, it stores this fresh object in a small, local, in-memory cache with a very short Time-To-Live (TTL), for example, 10 seconds.
  3. iOS Client Read Logic: When any part of the app requests that specific user’s profile, the data repository first checks this short-lived local cache. If a valid, non-expired entry is found, it’s returned instantly without a network call. If not, it proceeds with the standard network fetch, which will then hit the server’s (likely updated) Memcached layer.

This creates a “consistency bubble” around the user who performed the action, for a short duration, guaranteeing they see their own changes.

sequenceDiagram
    participant iOS App
    participant Local Cache (NSCache)
    participant API Gateway
    participant Profile Service (Go)
    participant Memcached
    participant PostgreSQL

    %% Update Flow
    iOS App->>+API Gateway: PUT /users/profile/123 (new_name: "Alpha")
    API Gateway->>+Profile Service (Go): UpdateProfile(id: 123, name: "Alpha")
    Profile Service (Go)->>+PostgreSQL: UPDATE user_profiles ...
    PostgreSQL-->>-Profile Service (Go): OK
    Profile Service (Go)->>+Memcached: SET profile:123 (new_profile_data)
    Memcached-->>-Profile Service (Go): STORED
    Profile Service (Go)-->>-API Gateway: 200 OK, Body: {id: 123, display_name: "Alpha", ...}
    API Gateway-->>-iOS App: 200 OK, Body: {id: 123, display_name: "Alpha", ...}
    iOS App->>+Local Cache (NSCache): Set(key: "profile_123", value: new_profile_data, ttl: 10s)
    Local Cache (NSCache)-->>-iOS App: OK

    %% Immediate Read Flow
    iOS App->>+Local Cache (NSCache): Get(key: "profile_123")
    Local Cache (NSCache)-->>-iOS App: (Cache Hit) Fresh Profile Data
    Note right of iOS App: User instantly sees the new name.
No network request needed.

Final Choice and Rationale

Solution B is the superior architectural choice. It pragmatically solves the precise user-facing problem without compromising the overall system’s design goals. It acknowledges the constraints of the CAP theorem and works with them, rather than against them. While it introduces statefulness on the client, this complexity is well-contained and justified by the robust user experience it delivers. The system remains highly available and performant for all other users, who continue to benefit from the eventually consistent distributed cache. This trade-off—accepting device-specific consistency for a short period in exchange for system-wide availability—is a hallmark of mature distributed system design.

Core Implementation

Server-Side: Go Profile Service (Write-Through)

The Go service is modified to perform a write-through cache update and return the full object.

// In profile_service_solution_b.go

package profileservice

import (
	"context"
	"database/sql"
	"encoding/json"
	"fmt"
	"log"
	"time"

	"github.com/bradfitz/gomemcache/memcache"
)

// Profile struct remains the same.
// Service struct remains the same.

func (s *Service) GetProfile(ctx context.Context, id int64) (*Profile, error) {
	cacheKey := fmt.Sprintf("%s%d", cacheKeyPrefix, id)

	// 1. Attempt to get from cache
	item, err := s.mc.Get(cacheKey)
	if err == nil {
		// Cache Hit
		var profile Profile
		if err := json.Unmarshal(item.Value, &profile); err == nil {
			log.Printf("INFO: Cache hit for profile %d", id)
			return &profile, nil
		}
		log.Printf("WARN: Failed to unmarshal cached profile %d: %v", id, err)
		// Proceed to DB fetch if unmarshal fails
	} else if err != memcache.ErrCacheMiss {
		log.Printf("WARN: Memcached Get failed for key %s: %v", cacheKey, err)
	}

	// 2. Cache Miss or Error: Get from database
	log.Printf("INFO: Cache miss for profile %d, fetching from DB", id)
	var profile Profile
	err = s.db.QueryRowContext(ctx,
		`SELECT id, username, display_name, version FROM user_profiles WHERE id = $1`,
		id).Scan(&profile.ID, &profile.Username, &profile.DisplayName, &profile.Version)

	if err != nil {
		if err == sql.ErrNoRows {
			return nil, fmt.Errorf("profile not found")
		}
		log.Printf("ERROR: DB query failed for profile %d: %v", id, err)
		return nil, err
	}

	// 3. Populate cache (Set)
	profileJSON, err := json.Marshal(profile)
	if err != nil {
		log.Printf("WARN: Failed to marshal profile %d for caching: %v", id, err)
		return &profile, nil // Return data even if caching fails
	}

	err = s.mc.Set(&memcache.Item{
		Key:        cacheKey,
		Value:      profileJSON,
		Expiration: int32(cacheTTL),
	})
	if err != nil {
		log.Printf("WARN: Memcached Set failed for key %s: %v", cacheKey, err)
	}

	return &profile, nil
}


// UpdateProfileWriteThrough performs a write-through cache update and returns the new object.
func (s *Service) UpdateProfileWriteThrough(ctx context.Context, profileUpdate Profile) (*Profile, error) {
	tx, err := s.db.BeginTx(ctx, nil)
	if err != nil {
		log.Printf("ERROR: failed to begin transaction: %v", err)
		return nil, err
	}
	defer tx.Rollback()

	// Update the profile and fetch the new version and other fields in one go
	var updatedProfile Profile
	err = tx.QueryRowContext(ctx,
		`UPDATE user_profiles SET display_name = $1, version = version + 1
		 WHERE id = $2 AND version = $3
		 RETURNING id, username, display_name, version`,
		profileUpdate.DisplayName, profileUpdate.ID, profileUpdate.Version,
	).Scan(&updatedProfile.ID, &updatedProfile.Username, &updatedProfile.DisplayName, &updatedProfile.Version)
	
	if err != nil {
		if err == sql.ErrNoRows {
			return nil, fmt.Errorf("concurrency error: profile version mismatch or not found")
		}
		log.Printf("ERROR: failed to execute update and returning query: %v", err)
		return nil, err
	}

	if err := tx.Commit(); err != nil {
		log.Printf("ERROR: failed to commit transaction: %v", err)
		return nil, err
	}

	// The core of the write-through strategy.
	cacheKey := fmt.Sprintf("%s%d", cacheKeyPrefix, updatedProfile.ID)
	profileJSON, err := json.Marshal(updatedProfile)
	if err != nil {
		log.Printf("WARN: Failed to marshal updated profile %d for caching: %v", updatedProfile.ID, err)
		// Don't fail the request; the primary store is consistent.
		return &updatedProfile, nil
	}

	err = s.mc.Set(&memcache.Item{
		Key:        cacheKey,
		Value:      profileJSON,
		Expiration: int32(cacheTTL),
	})
	if err != nil {
		// Again, log but do not fail the request.
		log.Printf("WARN: failed to update cache (write-through) for key %s: %v", cacheKey, err)
	}
	
	log.Printf("INFO: successfully updated profile %d and performed write-through cache set", updatedProfile.ID)
	
	// Return the fresh, complete object to the client.
	return &updatedProfile, nil
}

Client-Side: Swift Profile Repository

The iOS client needs a repository layer that encapsulates the local caching logic. We use NSCache for a simple, thread-safe in-memory cache.

// Profile.swift

import Foundation

// The Codable model must match the JSON from the Go service.
struct Profile: Codable, Identifiable {
    let id: Int
    let username: String
    let displayName: String
    let version: Int
}

// A wrapper to store the profile along with its expiration date in the local cache.
private struct CachedProfile {
    let profile: Profile
    let expirationDate: Date

    func isExpired(now: Date = Date()) -> Bool {
        return now > expirationDate
    }
}
// ProfileRepository.swift

import Foundation

enum RepositoryError: Error {
    case networkError(Error)
    case decodingError(Error)
    case invalidURL
    case serverError(statusCode: Int)
}

class ProfileRepository {
    
    // NSCache is thread-safe and automatically purges objects on memory pressure.
    // It's perfect for our short-lived local cache.
    private let localCache = NSCache<NSString, AnyObject>()
    private let localCacheTTL: TimeInterval // in seconds
    private let apiBaseURL: URL
    
    init(apiBaseURL: URL, localCacheTTL: TimeInterval = 10.0) {
        self.apiBaseURL = apiBaseURL
        self.localCacheTTL = localCacheTTL
        // Key observation: Using NSString as the key type is important for NSCache.
    }
    
    private func cacheKey(for profileID: Int) -> NSString {
        return "profile_\(profileID)" as NSString
    }

    // Public fetch method that orchestrates the caching logic.
    func fetchProfile(id: Int) async -> Result<Profile, RepositoryError> {
        // 1. Check the local, short-lived cache first.
        let key = cacheKey(for: id)
        if let cached = localCache.object(forKey: key) as? CachedProfile, !cached.isExpired() {
            print("LOG: Local cache hit for profile \(id)")
            return .success(cached.profile)
        }

        // 2. If local cache misses, fetch from the network.
        print("LOG: Local cache miss for profile \(id). Fetching from network.")
        return await fetchProfileFromNetwork(id: id)
    }

    func updateProfile(id: Int, newDisplayName: String, currentVersion: Int) async -> Result<Profile, RepositoryError> {
        guard let url = URL(string: "users/profile/\(id)", relativeTo: apiBaseURL) else {
            return .failure(.invalidURL)
        }

        var request = URLRequest(url: url)
        request.httpMethod = "PUT"
        request.setValue("application/json", forHTTPHeaderField: "Content-Type")

        let payload = ["display_name": newDisplayName, "version": currentVersion]
        
        do {
            request.httpBody = try JSONEncoder().encode(payload)
        } catch {
            return .failure(.decodingError(error))
        }

        do {
            let (data, response) = try await URLSession.shared.data(for: request)
            guard let httpResponse = response as? HTTPURLResponse, httpResponse.statusCode == 200 else {
                let statusCode = (response as? HTTPURLResponse)?.statusCode ?? -1
                return .failure(.serverError(statusCode: statusCode))
            }
            
            let decoder = JSONDecoder()
            decoder.keyDecodingStrategy = .convertFromSnakeCase
            let updatedProfile = try decoder.decode(Profile.self, from: data)

            // 3. On successful update, pre-warm the local cache with the returned data.
            // This is the key to providing read-your-writes consistency.
            let expirationDate = Date().addingTimeInterval(localCacheTTL)
            let cacheEntry = CachedProfile(profile: updatedProfile, expirationDate: expirationDate)
            let key = cacheKey(for: updatedProfile.id)
            localCache.setObject(cacheEntry as AnyObject, forKey: key)
            print("LOG: Pre-warmed local cache for profile \(updatedProfile.id) after update.")

            return .success(updatedProfile)
            
        } catch {
            return .failure(.networkError(error))
        }
    }
    
    private func fetchProfileFromNetwork(id: Int) async -> Result<Profile, RepositoryError> {
        // This method would fetch from the network, which hits the server-side
        // Memcached layer. For brevity, we are focusing on the update and local cache logic.
        // In a real implementation, this would contain a standard GET request.
        // We simulate a successful network fetch for demonstration.
        print("LOG: Simulating network fetch for profile \(id).")
        // A placeholder implementation.
        let fakeProfile = Profile(id: id, username: "testuser", displayName: "Fetched Name", version: 1)
        return .success(fakeProfile)
    }
}

This implementation cleanly separates concerns. The UI layer interacts with ProfileRepository and is oblivious to the underlying consistency mechanism. It simply gets the correct data, instantly.

Limitations and Future Considerations

This solution is not a silver bullet for all consistency problems. It is a targeted pattern with specific trade-offs.

A primary limitation is that the consistency guarantee is scoped to the device that performed the write. If the same user is logged into two devices (e.g., an iPhone and an iPad), updating their profile on the iPhone will provide an instant update there, but the iPad will remain eventually consistent and may see stale data for a few seconds. Solving this multi-device consistency problem would require a much more complex architecture, likely involving real-time push notifications (via WebSockets or APNS) from the server to all of a user’s active clients to proactively update their local caches.

Furthermore, the duration of the local cache TTL (10 seconds in this example) is a parameter that requires careful tuning based on the expected propagation delay in the backend cache cluster and user behavior patterns. A TTL that is too short might expose the user to a “flicker” of stale data if the network read happens before the Memcached SET has propagated. A TTL that is too long might prevent the user from seeing a change initiated from another source (e.g., a customer support tool).

Finally, this pattern is most effective for user-centric, non-collaborative data. For highly collaborative features, where multiple users need to see each other’s updates in near real-time, different technologies like CRDTs (Conflict-free Replicated Data Types) or operational transforms, combined with real-time messaging, would be more appropriate.


  TOC