Implementing a Cross-Cloud Two-Phase Commit Coordinator with gRPC, etcd, and Go


The requirement was deceptively simple: achieve an atomic write across two disparate systems. The first was a core user profile service, a long-running Go application deployed on our AWS EKS cluster. The second was an ephemeral, event-triggered Google Cloud Function responsible for processing partner data feeds. A single logical operation needed to succeed or fail entirely in both places. The initial drift towards an eventual consistency model using a Saga pattern was strong, but business requirements mandated strict, immediate consistency for this specific workflow. This forced our hand towards a classic, and often maligned, solution: the Two-Phase Commit (2PC) protocol.

This is the log of that implementation. It’s a record of building a transaction coordinator from scratch using Go, gRPC, and etcd, orchestrating a transaction that spans the network boundary between AWS and GCP, and wrestling with the fundamental architectural mismatch between a stateful container orchestrator and a stateless serverless platform.

Technical Pain Point: The Cross-Cloud Atomic Write

The core problem can be visualized as follows. An external event triggers the Google Cloud Function. This function must perform an operation on a Firestore collection. Concurrently, it must trigger a corresponding update on a PostgreSQL database managed by a service running in EKS. If the Firestore write succeeds but the PostgreSQL write fails (or vice versa), the system is left in an inconsistent state that is difficult and costly to reconcile.

A simplified view of the required interaction:

sequenceDiagram
    participant Ext as External Event
    participant GCF as GCP Cloud Function
    participant Coord as 2PC Coordinator (EKS)
    participant Svc as Profile Service (EKS)

    Ext->>GCF: Trigger with data payload
    GCF->>Coord: StartTransaction
    Coord-->>GCF: transactionId
    Coord->>Svc: Prepare(transactionId)
    GCF->>GCF: Prepare local change (Firestore)
    Svc-->>Coord: VOTE_COMMIT
    GCF-->>Coord: VOTE_COMMIT
    Coord->>Coord: Decide: GLOBAL_COMMIT
    Coord->>Svc: Commit(transactionId)
    Coord->>GCF: (How to notify?)
    Svc-->>Coord: ACK
    GCF-->>Coord: ACK

The diagram already exposes the primary challenge: how does a central coordinator on EKS, which expects to communicate with stable network endpoints, reliably issue a Commit command to a serverless function that may have already finished its execution after voting? This became the central design constraint.

Initial Concept and Technology Selection Rationale

In a real-world project, choosing 2PC is a decision that requires significant justification due to its blocking nature and susceptibility to coordinator failure. For this specific, low-throughput but high-consistency requirement, we deemed it acceptable.

  • Go: The language of choice for its excellent concurrency primitives, static typing, and robust gRPC ecosystem. It’s well-suited for both the long-running services in EKS and the lean execution environment of Cloud Functions.
  • gRPC: For communication, a RESTful API would introduce ambiguity. gRPC provides a strongly-typed contract via Protocol Buffers, ensuring all participants—the coordinator, the EKS service, and the Cloud Function client—speak the exact same language. Its performance is also a significant advantage over JSON/HTTP.
  • etcd: The coordinator needs a reliable, consistent data store to track transaction states. Using a relational database would introduce another dependency. etcd, already a core component of Kubernetes, provides the ideal primitives: atomic compare-and-swap operations for state transitions, transactions for multi-key operations, and leases for automatic timeout and recovery of stuck transactions. We deployed a dedicated etcd cluster within EKS for this purpose.
  • AWS EKS & Google Cloud Functions: This reflects the reality of our multi-cloud architecture. It wasn’t a choice but a constraint to be engineered around.

Step 1: Defining the Transaction Protocol with gRPC

The foundation of any distributed system is its communication contract. Our transaction.proto file defined the entire 2PC flow.

// transaction.proto
syntax = "proto3";

package transaction;

option go_package = ".;transaction";

enum TransactionState {
  UNKNOWN = 0;
  PREPARING = 1;
  COMMITTING = 2;
  ABORTING = 3;
  COMMITTED = 4;
  ABORTED = 5;
}

enum Vote {
  VOTE_UNKNOWN = 0;
  VOTE_COMMIT = 1;
  VOTE_ABORT = 2;
}

// Service implemented by the Transaction Coordinator
service Coordinator {
  // Initiates a new transaction with a list of participants
  rpc StartTransaction(StartTransactionRequest) returns (StartTransactionResponse);

  // A participant reports its vote for a given transaction
  rpc RegisterVote(RegisterVoteRequest) returns (RegisterVoteResponse);
  
  // A client or participant can poll the final state of a transaction
  rpc GetTransactionState(GetTransactionStateRequest) returns (GetTransactionStateResponse);
}

// Service implemented by each Transaction Participant
service Participant {
  // Coordinator asks participant to prepare for a commit
  rpc Prepare(PrepareRequest) returns (PrepareResponse);

  // Coordinator orders participant to commit the transaction
  rpc Commit(CommitRequest) returns (CommitResponse);

  // Coordinator orders participant to abort the transaction
  rpc Abort(AbortRequest) returns (AbortResponse);
}

// Coordinator Messages
message StartTransactionRequest {
  repeated string participant_ids = 1;
}

message StartTransactionResponse {
  string transaction_id = 1;
}

message RegisterVoteRequest {
  string transaction_id = 1;
  string participant_id = 2;
  Vote vote = 3;
}

message RegisterVoteResponse {}

message GetTransactionStateRequest {
  string transaction_id = 1;
}

message GetTransactionStateResponse {
  TransactionState state = 1;
}


// Participant Messages
message PrepareRequest {
  string transaction_id = 1;
}

message PrepareResponse {
  Vote vote = 1;
}

message CommitRequest {
  string transaction_id = 1;
}

message CommitResponse {}

message AbortRequest {
  string transaction_id = 1;
}

message AbortResponse {}

A key design decision here was to split the services. The Coordinator service is implemented by our central manager in EKS. The Participant service is implemented by any service that wishes to join a transaction. This clean separation is critical for maintainability.

Step 2: The Coordinator Implementation on EKS with etcd

The coordinator is the brain of the operation. Its sole job is to manage the state machine of each transaction, persisting every state change to etcd to survive crashes.

The data model in etcd is key-value based:

  • /transactions/{txid}/state: Stores the global state (PREPARING, COMMITTING, etc.).
  • /transactions/{txid}/participants/{participant_id}: Stores the vote of each participant (VOTE_COMMIT, VOTE_ABORT).
  • A lease is attached to the transaction’s root key /transactions/{txid}. If the coordinator fails to renew the lease (e.g., because the transaction is stuck), the key expires, and a watch can trigger a cleanup/abort process.

Here is the core logic of the coordinator server.

// coordinator/main.go
package main

import (
	"context"
	"fmt"
	"log"
	"net"
	"sync"
	"time"

	"github.com/google/uuid"
	"go.etcd.io/etcd/client/v3"

	"google.golang.org/grpc"
	"google.golang.org/grpc/codes"
	"google.golang.org/grpc/status"

	pb "path/to/your/gen/transaction" // Import generated protobuf code
)

const (
	etcdEndpoint      = "etcd-cluster.default.svc.cluster.local:2379"
	transactionPrefix = "/transactions/"
	leaseTTL          = 60 // seconds
)

type coordinatorServer struct {
	pb.UnimplementedCoordinatorServer
	etcdClient *clientv3.Client
	// Map to hold connections to participants
	participantClients map[string]pb.ParticipantClient
	clientLock         sync.RWMutex
}

func newServer() (*coordinatorServer, error) {
	cli, err := clientv3.New(clientv3.Config{
		Endpoints:   []string{etcdEndpoint},
		DialTimeout: 5 * time.Second,
	})
	if err != nil {
		return nil, fmt.Errorf("failed to connect to etcd: %v", err)
	}

	return &coordinatorServer{
		etcdClient:         cli,
		participantClients: make(map[string]pb.ParticipantClient),
	}, nil
}

// In a real system, participant endpoints would be discovered via service discovery.
// For this example, we'll hardcode them.
func (s *coordinatorServer) getParticipantClient(participantID string) (pb.ParticipantClient, error) {
	s.clientLock.RLock()
	client, ok := s.participantClients[participantID]
	s.clientLock.RUnlock()

	if ok {
		return client, nil
	}

	// This is a simplified service discovery. Production systems would use Consul, Kubernetes services, etc.
	var participantAddr string
	switch participantID {
	case "eks-profile-service":
		participantAddr = "profile-service.default.svc.cluster.local:50052"
	default:
		return nil, status.Errorf(codes.NotFound, "participant %s not found", participantID)
	}

	conn, err := grpc.Dial(participantAddr, grpc.WithInsecure())
	if err != nil {
		return nil, status.Errorf(codes.Internal, "failed to connect to participant %s: %v", participantID, err)
	}

	newClient := pb.NewParticipantClient(conn)

	s.clientLock.Lock()
	s.participantClients[participantID] = newClient
	s.clientLock.Unlock()

	return newClient, nil
}

// StartTransaction is called by the application logic (e.g., the Cloud Function)
func (s *coordinatorServer) StartTransaction(ctx context.Context, req *pb.StartTransactionRequest) (*pb.StartTransactionResponse, error) {
	txID := uuid.New().String()
	log.Printf("Starting transaction %s for participants: %v", txID, req.ParticipantIds)

	lease, err := s.etcdClient.Grant(ctx, leaseTTL)
	if err != nil {
		return nil, status.Errorf(codes.Internal, "etcd lease grant failed: %v", err)
	}

	// Use an etcd transaction to initialize all keys atomically
	ops := []clientv3.Op{
		clientv3.OpPut(fmt.Sprintf("%s%s/state", transactionPrefix, txID), fmt.Sprintf("%d", pb.TransactionState_PREPARING), clientv3.WithLease(lease.ID)),
	}
	for _, pID := range req.ParticipantIds {
		ops = append(ops, clientv3.OpPut(fmt.Sprintf("%s%s/participants/%s", transactionPrefix, txID, pID), fmt.Sprintf("%d", pb.Vote_VOTE_UNKNOWN), clientv3.WithLease(lease.ID)))
	}

	txn := s.etcdClient.Txn(ctx)
	_, err = txn.Then(ops...).Commit()
	if err != nil {
		return nil, status.Errorf(codes.Internal, "etcd transaction failed: %v", err)
	}

	// Asynchronously start the Prepare phase for all participants except the GCF one.
	// The GCF participant will report its vote itself.
	go s.initiatePreparePhase(txID, req.ParticipantIds)

	return &pb.StartTransactionResponse{TransactionId: txID}, nil
}

func (s *coordinatorServer) initiatePreparePhase(txID string, participants []string) {
	for _, pID := range participants {
        // The Cloud Function is a special case; it acts as a client and drives its own prepare.
        // We only actively call Prepare on long-running services.
		if pID == "gcp-cloud-function" {
			continue
		}

		go func(participantID string) {
			log.Printf("TX[%s]: Sending Prepare to participant %s", txID, participantID)
			client, err := s.getParticipantClient(participantID)
			if err != nil {
				log.Printf("TX[%s]: Failed to get client for %s: %v", txID, participantID, err)
				// If we can't even connect, we must abort.
				s.RegisterVote(context.Background(), &pb.RegisterVoteRequest{
					TransactionId: txID,
					ParticipantId: participantID,
					Vote:          pb.Vote_VOTE_ABORT,
				})
				return
			}

			ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
			defer cancel()

			resp, err := client.Prepare(ctx, &pb.PrepareRequest{TransactionId: txID})
			if err != nil {
				log.Printf("TX[%s]: Prepare failed for %s: %v", txID, participantID, err)
				s.RegisterVote(context.Background(), &pb.RegisterVoteRequest{
					TransactionId: txID,
					ParticipantId: participantID,
					Vote:          pb.Vote_VOTE_ABORT,
				})
				return
			}
			log.Printf("TX[%s]: Received vote %s from participant %s", txID, resp.Vote, participantID)
			s.RegisterVote(context.Background(), &pb.RegisterVoteRequest{
				TransactionId: txID,
				ParticipantId: participantID,
				Vote:          resp.Vote,
			})
		}(pID)
	}
}

// RegisterVote is the central logic point. Once all votes are in, it decides the outcome.
func (s *coordinatorServer) RegisterVote(ctx context.Context, req *pb.RegisterVoteRequest) (*pb.RegisterVoteResponse, error) {
	log.Printf("TX[%s]: Registering vote %s for participant %s", req.TransactionId, req.Vote, req.ParticipantId)
	
	// Record this participant's vote in etcd
	voteKey := fmt.Sprintf("%s%s/participants/%s", transactionPrefix, req.TransactionId, req.ParticipantId)
	_, err := s.etcdClient.Put(ctx, voteKey, fmt.Sprintf("%d", req.Vote))
	if err != nil {
		return nil, status.Errorf(codes.Internal, "etcd put failed for vote: %v", err)
	}
	
	// Check if all votes are in and decide the outcome
	go s.checkTransactionStatus(req.TransactionId)
	
	return &pb.RegisterVoteResponse{}, nil
}

func (s *coordinatorServer) checkTransactionStatus(txID string) {
    // This function must be idempotent and safe to be called multiple times.
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()

    // First, lock this transaction check to prevent race conditions from concurrent vote registrations.
    // In a real system, a distributed lock (e.g., using etcd's lock recipe) is necessary.
    // For simplicity, we'll rely on the atomicity of etcd transactions.

    stateKey := fmt.Sprintf("%s%s/state", transactionPrefix, txID)
    participantsPrefix := fmt.Sprintf("%s%s/participants/", transactionPrefix, txID)

    // Use a transaction to read current state and all participant votes
    resp, err := s.etcdClient.Txn(ctx).Then(
        clientv3.OpGet(stateKey),
        clientv3.OpGet(participantsPrefix, clientv3.WithPrefix()),
    ).Commit()

    if err != nil {
        log.Printf("TX[%s]: Failed to read state from etcd: %v", txID, err)
        return
    }

    // Ensure we are still in PREPARING state. If not, a decision has already been made.
    currentState := pb.TransactionState(atoi(string(resp.Responses[0].GetResponseRange().Kvs[0].Value)))
    if currentState != pb.TransactionState_PREPARING {
        log.Printf("TX[%s]: Decision already made (state: %s). Aborting check.", txID, currentState)
        return
    }

    participantVotes := resp.Responses[1].GetResponseRange().Kvs
    
    allVotesIn := true
    decision := pb.Vote_VOTE_COMMIT
    for _, kv := range participantVotes {
        vote := pb.Vote(atoi(string(kv.Value)))
        if vote == pb.Vote_VOTE_UNKNOWN {
            allVotesIn = false
            break
        }
        if vote == pb.Vote_VOTE_ABORT {
            decision = pb.Vote_VOTE_ABORT
            break // One abort vote is enough to doom the transaction
        }
    }

    if !allVotesIn {
        log.Printf("TX[%s]: Not all votes are in yet. Waiting.", txID)
        return
    }

    if decision == pb.Vote_VOTE_COMMIT {
        log.Printf("TX[%s]: All votes are COMMIT. Moving to COMMITTING.", txID)
        s.updateGlobalStateAndBroadcast(txID, pb.TransactionState_COMMITTING)
    } else {
        log.Printf("TX[%s]: At least one vote is ABORT. Moving to ABORTING.", txID)
        s.updateGlobalStateAndBroadcast(txID, pb.TransactionState_ABORTING)
    }
}

func (s *coordinatorServer) updateGlobalStateAndBroadcast(txID string, newState pb.TransactionState) {
    ctx := context.Background()
    stateKey := fmt.Sprintf("%s%s/state", transactionPrefix, txID)

    // Atomically update the global state from PREPARING to the new state.
    // This is the "commit point". If this fails, the state remains PREPARING.
    resp, err := s.etcdClient.Txn(ctx).
        If(clientv3.Compare(clientv3.Value(stateKey), "=", fmt.Sprintf("%d", pb.TransactionState_PREPARING))).
        Then(clientv3.OpPut(stateKey, fmt.Sprintf("%d", newState))).
        Commit()
    
    if err != nil {
        log.Printf("TX[%s]: Failed to update global state to %s: %v", txID, newState, err)
        // A retry mechanism would be needed here in production.
        return
    }

    if !resp.Succeeded {
        log.Printf("TX[%s]: Lost race to update global state. Another process already made a decision.", txID)
        return
    }

    log.Printf("TX[%s]: Global state successfully updated to %s. Broadcasting to participants.", txID, newState)
    
    // Now, broadcast the decision.
	participantsPrefix := fmt.Sprintf("%s%s/participants/", transactionPrefix, txID)
    getResp, _ := s.etcdClient.Get(ctx, participantsPrefix, clientv3.WithPrefix())

    for _, kv := range getResp.Kvs {
        // Extract participant ID from the key
        pID := string(kv.Key)[len(participantsPrefix):]

        // Again, GCF is a special case. We don't broadcast to it. It polls for the state.
        if pID == "gcp-cloud-function" {
            continue
        }
        
        go func(participantID string) {
            client, err := s.getParticipantClient(participantID)
            if err != nil {
                log.Printf("TX[%s]: Error getting client for %s during broadcast: %v", txID, participantID, err)
                return
            }

            broadcastCtx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
            defer cancel()

            if newState == pb.TransactionState_COMMITTING {
                _, err = client.Commit(broadcastCtx, &pb.CommitRequest{TransactionId: txID})
            } else {
                _, err = client.Abort(broadcastCtx, &pb.AbortRequest{TransactionId: txID})
            }
            if err != nil {
                // A production system needs a robust retry/recovery mechanism here.
                // e.g., writing to a recovery log.
                log.Printf("TX[%s]: FAILED to send %s to %s: %v", txID, newState, participantID, err)
            } else {
                 log.Printf("TX[%s]: Successfully sent %s to %s", txID, newState, participantID)
            }
        }(pID)
    }
}

// GetTransactionState is the crucial endpoint for polling clients like our Cloud Function.
func (s *coordinatorServer) GetTransactionState(ctx context.Context, req *pb.GetTransactionStateRequest) (*pb.GetTransactionStateResponse, error) {
	stateKey := fmt.Sprintf("%s%s/state", transactionPrefix, req.TransactionId)
	resp, err := s.etcdClient.Get(ctx, stateKey)
	if err != nil {
		return nil, status.Errorf(codes.Internal, "etcd get failed: %v", err)
	}
	if len(resp.Kvs) == 0 {
		return nil, status.Errorf(codes.NotFound, "transaction %s not found", req.TransactionId)
	}
	
	stateVal := atoi(string(resp.Kvs[0].Value))
	return &pb.GetTransactionStateResponse{State: pb.TransactionState(stateVal)}, nil
}

func main() {
	lis, err := net.Listen("tcp", ":50051")
	if err != nil {
		log.Fatalf("failed to listen: %v", err)
	}

	s, err := newServer()
	if err != nil {
		log.Fatalf("failed to create server: %v", err)
	}

	grpcServer := grpc.NewServer()
	pb.RegisterCoordinatorServer(grpcServer, s)
	log.Println("Coordinator server listening on :50051")
	if err := grpcServer.Serve(lis); err != nil {
		log.Fatalf("failed to serve: %v", err)
	}
}

func atoi(s string) int {
	i, _ := fmt.Sscan(s, &i)
	return i
}

The pitfall here is the complexity in checkTransactionStatus. It must be thread-safe and idempotent. A race condition where two RegisterVote calls trigger the check simultaneously could lead to incorrect decisions. Using an etcd transaction to read the state and participant votes in one atomic operation mitigates this.

Step 3: The Stateful Participant on EKS

This is the “easy” part. It’s a standard gRPC service that implements the Participant interface. For this example, it will just log its actions, but in a real system, it would interact with a database.

// participant/main.go
package main

import (
	"context"
	"log"
	"net"
	"sync"

	"google.golang.org/grpc"
	pb "path/to/your/gen/transaction"
)

type participantServer struct {
	pb.UnimplementedParticipantServer
	preparedTxs sync.Map // In-memory store for prepared transactions state
}

// Prepare simulates locking a resource and writing to a write-ahead-log.
func (s *participantServer) Prepare(ctx context.Context, req *pb.PrepareRequest) (*pb.PrepareResponse, error) {
	log.Printf("Received Prepare for transaction %s", req.TransactionId)
	
	// In a real system:
	// 1. Begin a database transaction.
	// 2. Perform the required writes.
	// 3. DO NOT COMMIT.
	// 4. Store the transaction state in a durable log.
	
	// Simulate success
	s.preparedTxs.Store(req.TransactionId, "prepared")
	log.Printf("Transaction %s prepared locally.", req.TransactionId)
	return &pb.PrepareResponse{Vote: pb.Vote_VOTE_COMMIT}, nil
}

// Commit finalizes the change.
func (s *participantServer) Commit(ctx context.Context, req *pb.CommitRequest) (*pb.CommitResponse, error) {
	log.Printf("Received Commit for transaction %s", req.TransactionId)
	
	state, ok := s.preparedTxs.Load(req.TransactionId)
	if !ok || state != "prepared" {
		log.Printf("ERROR: Commit called for unknown or non-prepared tx %s", req.TransactionId)
		// This is a serious error state that requires manual intervention.
		return &pb.CommitResponse{}, nil
	}
	
	// In a real system:
	// 1. Commit the database transaction.
	// 2. Clean up the durable log entry.
	s.preparedTxs.Store(req.TransactionId, "committed")
	log.Printf("Transaction %s committed.", req.TransactionId)
	return &pb.CommitResponse{}, nil
}

func (s *participantServer) Abort(ctx context.Context, req *pb.AbortRequest) (*pb.AbortResponse, error) {
	log.Printf("Received Abort for transaction %s", req.TransactionId)
	
	// In a real system:
	// 1. Rollback the database transaction.
	// 2. Clean up the durable log entry.
	s.preparedTxs.Delete(req.TransactionId)
	log.Printf("Transaction %s aborted.", req.TransactionId)
	return &pb.AbortResponse{}, nil
}

func main() {
	lis, err := net.Listen("tcp", ":50052")
	if err != nil {
		log.Fatalf("failed to listen: %v", err)
	}
	grpcServer := grpc.NewServer()
	pb.RegisterParticipantServer(grpcServer, &participantServer{})
	log.Println("EKS Participant server listening on :50052")
	if err := grpcServer.Serve(lis); err != nil {
		log.Fatalf("failed to serve: %v", err)
	}
}

Step 4: The Ephemeral Participant - Google Cloud Function

This is where the architecture gets ugly, a direct consequence of forcing a blocking protocol onto a non-blocking, ephemeral execution model. A Cloud Function cannot host a gRPC server to receive Commit/Abort calls. Therefore, it must change its role from a passive participant to an active client that polls for the final decision.

The flow inside the function is:

  1. Connect to the coordinator on EKS. This requires setting up VPC connectors and firewall rules to allow traffic from GCP to the EKS cluster’s VPC.
  2. Call StartTransaction, registering itself (gcp-cloud-function) and the EKS service (eks-profile-service).
  3. Perform its local “prepare” work (e.g., writing to a temporary collection in Firestore).
  4. If the local prepare succeeds, call RegisterVote on the coordinator with VOTE_COMMIT.
  5. Enter a polling loop, calling GetTransactionState on the coordinator. This is the most problematic part. It holds the function execution active, incurring cost and risking timeout.
  6. Once the state changes from PREPARING to COMMITTING or ABORTING, exit the loop.
  7. Perform the local commit (move data from temp to final collection) or abort (delete temp data).
  8. Return a final status.
// gcf/function.go
package gcf

import (
	"context"
	"fmt"
	"log"
	"net/http"
	"time"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	pb "path/to/your/gen/transaction"
)

var coordClient pb.CoordinatorClient

// Use an init function to establish the gRPC connection once per instance.
func init() {
    // NOTE: This endpoint must be publicly accessible or connected via VPC Connector.
    // In production, this would use mTLS, not insecure credentials.
	conn, err := grpc.Dial("YOUR_COORDINATOR_IP_OR_DNS:50051", grpc.WithTransportCredentials(insecure.NewCredentials()))
	if err != nil {
		log.Fatalf("Failed to connect to coordinator: %v", err)
	}
	coordClient = pb.NewCoordinatorClient(conn)
}

// HttpTrigger is the entry point for the Cloud Function.
func HttpTrigger(w http.ResponseWriter, r *http.Request) {
	ctx := r.Context()
	log.Println("Cloud Function triggered, starting cross-cloud transaction.")

	// 1. Start the transaction
	startResp, err := coordClient.StartTransaction(ctx, &pb.StartTransactionRequest{
		ParticipantIds: []string{"gcp-cloud-function", "eks-profile-service"},
	})
	if err != nil {
		log.Printf("Failed to start transaction: %v", err)
		http.Error(w, "Failed to start transaction", http.StatusInternalServerError)
		return
	}
	txID := startResp.TransactionId
	log.Printf("TX[%s]: Transaction started.", txID)

	// 2. Perform local prepare work
	if !performLocalPrepare(txID) {
		log.Printf("TX[%s]: Local prepare failed. Voting to abort.", txID)
		coordClient.RegisterVote(ctx, &pb.RegisterVoteRequest{
			TransactionId: txID,
			ParticipantId: "gcp-cloud-function",
			Vote:          pb.Vote_VOTE_ABORT,
		})
		http.Error(w, "Local prepare failed", http.StatusInternalServerError)
		return
	}
	log.Printf("TX[%s]: Local prepare successful.", txID)

	// 3. Register VOTE_COMMIT
	_, err = coordClient.RegisterVote(ctx, &pb.RegisterVoteRequest{
		TransactionId: txID,
		ParticipantId: "gcp-cloud-function",
		Vote:          pb.Vote_VOTE_COMMIT,
	})
	if err != nil {
		log.Printf("TX[%s]: Failed to register vote: %v", err)
		// We can't proceed. We must assume the transaction will time out and abort.
		performLocalAbort(txID)
		http.Error(w, "Failed to register vote", http.StatusInternalServerError)
		return
	}
	log.Printf("TX[%s]: Vote registered. Polling for final decision...", txID)


	// 4. Poll for the final decision
    // This is the anti-pattern, but necessary for this architecture.
	finalState, err := pollForFinalState(ctx, txID)
	if err != nil {
		log.Printf("TX[%s]: Polling failed: %v", err)
        // At this point, the state is uncertain. A recovery process is needed.
		http.Error(w, "Polling for final state failed", http.StatusInternalServerError)
		return
	}

	log.Printf("TX[%s]: Final decision received: %s", txID, finalState)

	// 5. Execute final decision
	if finalState == pb.TransactionState_COMMITTING || finalState == pb.TransactionState_COMMITTED {
		if performLocalCommit(txID) {
			fmt.Fprint(w, "Transaction committed successfully.")
		} else {
			http.Error(w, "CRITICAL: Local commit failed after global commit decision!", http.StatusInternalServerError)
		}
	} else {
		performLocalAbort(txID)
		fmt.Fprint(w, "Transaction aborted.")
	}
}

func pollForFinalState(ctx context.Context, txID string) (pb.TransactionState, error) {
	// A Cloud Function has a max timeout. We must respect that.
	// For this example, polling for up to 30 seconds.
	ticker := time.NewTicker(2 * time.Second)
	defer ticker.Stop()
	
	timeout := time.After(30 * time.Second)
	
	for {
		select {
		case <-ticker.C:
			resp, err := coordClient.GetTransactionState(ctx, &pb.GetTransactionStateRequest{TransactionId: txID})
			if err != nil {
				log.Printf("TX[%s]: Error during polling: %v", txID, err)
				continue
			}
			
			if resp.State != pb.TransactionState_PREPARING {
				return resp.State, nil
			}
            log.Printf("TX[%s]: Still PREPARING...", txID)
		case <-timeout:
			return pb.TransactionState_UNKNOWN, fmt.Errorf("polling timed out after 30 seconds")
        case <-ctx.Done():
            return pb.TransactionState_UNKNOWN, fmt.Errorf("client request context cancelled")
		}
	}
}

// Stubs for local actions
func performLocalPrepare(txID string) bool { log.Printf("TX[%s]: Performing local prepare (e.g., write to temp doc)", txID); return true }
func performLocalCommit(txID string) bool  { log.Printf("TX[%s]: Performing local commit (e.g., move temp doc to final)", txID); return true }
func performLocalAbort(txID string)  { log.Printf("TX[%s]: Performing local abort (e.g., delete temp doc)", txID) }

This code highlights the architectural impedance mismatch. A common mistake is to underestimate the networking and IAM configuration needed to make the grpc.Dial from a Cloud Function to an internal EKS service work securely and reliably.

Lingering Issues and Applicability Boundaries

This implementation, while functional as a proof-of-concept, is fraught with production risks. The polling mechanism in the Google Cloud Function is inefficient, costly, and brittle; it’s a direct fight against the serverless execution model. A long-running Prepare phase in the EKS service would cause the Cloud Function to time out, leaving the transaction in an indeterminate state that requires manual intervention. Furthermore, the entire system is vulnerable to the coordinator being a single point of failure. While etcd provides data durability, a network partition isolating the coordinator would halt all new transactions and block all in-flight ones until it’s resolved. This entire exercise serves less as a recommended pattern and more as a stark illustration of why blocking distributed transaction protocols are a poor fit for serverless architectures. For any system with moderate throughput or a need for higher availability, an asynchronous, choreography-based Saga pattern would be a more resilient and scalable, albeit more complex to debug, choice.


  TOC