Coordinator - Interesting* Issues and Decisions

This is currently just a dump of ideas / discussions - probably very vague, and thin on content. To be shortly formatted well enough to be (hopefully) readable and understandable by others.

Cluster:

The cluster has a centralized coordinator service, and multiple masters and backups (clients for the coordinator). The coordinator consists of small group of nodes. At any given point in time, exactly one of them is the leader, and the others are followers. Only the leader services the requests to the coordinator.

General Coordinator Design:

Thread Safety:

We want the coordinator to be thread safe.

Modularity:

Pull as much logic out of Coordinator as possible, and have better* information hiding and modularity. Egs.:

TabletMap:
- Examples: Atomic operations like addTablet, removeTabletsForTable, etc.
- Done. Ryan's commit: fb0a623d3957a172314fbe0b53566a5699e1c0e6
Will:
- Similar to TabletMap.
ServerList:
- Examples:
  - Atomic operation to mark server crashed and update serverList version.
  - Any updates to serverList directly broadcasted to cluster without coordinator having to do it separately.

Fault Tolerance:

To make coordinator fault tolerant, the leader should persist information about requests being serviced in a highly reliable, consistent storage service - like LogCabin. If this leader fails at some point, one of the follower takes over as the leader, and can replay the log.

Encapsulating Information:

We encapsulate the information to be persisted in terms of state. The leader pushes the current state of the request being serviced to the log at distinct commit points. The state captures changes to the local state of the coordinator, as well as the communication done / to be done with the external world (masters / backups in the cluster) to change their state.

RPC thread and Replay thread:

The requests coming to the coordinator are serviced by the RPC thread. The actual work is done by a separate thread, called the replay thread. Only the current leader has an active RPC thread. The leader as well as the followers are continuously running the replay thread.

In the general case, when the leader's rpc thread receives a request, it appends the request to the log. The leader's replay thread is following the log, and picks up the request from there and does the required work. Every follower's replay thread can also follow the log in real time to keep their local state up to date, but only after that request has been completed (state indicates "done".) On recovery, the new leader's replay thread replays all log entries that have not yet been replayed, including the ones for incomplete actions. This can be represented by:

Design: RPC and Replay threads

// RPC
if (leader) {
    LogCabinEntry e = new LogCabinEntry(actionNotCompleted = x)
    appendToLogCabin(e);
    wait(conditionVariable);
    replyToRpc(read data to be returned from some common data struct / log); 
}

// REPLAY
while (true) {
    e <- LogCabinIterator.getNextEntry()

    // Entry contains state for a pending request
    // and i am not the leader, THEN
    // keep stop at this point, and retry getting the entry.
    if (!e.done && me != leader) {
        LogCabinIterator.goBack();
        continue;
    }

    // CASE 1: Entry contains state for a pending action and i am
    // a) general case leader, OR
    // b) recovering leader. OR
    // CASE 2: Entry contains state for a completed action and i am
    // a) follower trying to keep my state up-to-date in real time, OR
    // b) recovering leader catching up on state changes i missed,
    // THEN
    // do the action corresponding to this entry.
    // The action done is determined by the state encapsulated in the entry.
    e.doAction(e);
    signal(conditionVariable);
}

// doAction will dispatch the request to appropriate method, like
// doCreateTable, or doDropTable, etc.

Note that when a client sends a request to the coordinator, we are not making any guarantees that that request will be completed. So it is perfectly fine to just drop a request in case of failure. However, we don't want to leave the cluster in an inconsistent state, which could happen if the original leader had partially serviced the request. For example, in case of createTable, the leader might have assigned the tablet ownership for 2 of the 5 tablets to masters before it failed. So during recovery, we're only concerned with requests that had already started being serviced. This eliminates the need for rpc thread on the leader to append the state for the incoming request to the log. It can directly call the corresponding method in replay thread, which will then append the state when necessary. This modifies the above design to:

Change in Design: RPC and Replay threads

// RPC
if (leader) {
    switch (opcode) {
        case "createTable":
            retVal = doCreateTable(rpc, NULL);
            break;
        case "createTable":
            retVal = doCreateTable(rpc, NULL);
            break;
        .
        .
        .  
    }
    replyToRpc(retVal); 
}
  
// REPLAY
// Replay thread is the same as the previous design.
// However, now, CASE 1a does NOT exist anymore.

Order of execution on recovery:

When a leader fails and a new leader takes over, it will first replay the remaining log in order, and only after that start servicing requests / rpcs.

To understand why this decision was made, consider a simple example:

When replay for an action can't be completed:

Get recovery to do your work for you: Changing the definition of "done":

* For some definition of this word.