We may use the infrastructure for linearizability, such as assignment of rpc id, tracking unacknowledged rpcs, efficient garbage collection mechanism. The transaction we are suggesting is very similar to the one described in Sinfonia, but our mechanism provides much straightforward and efficient way for garbage collection.

(John's comments are in red, marked "JO: ...")

 

Data Structures

- Lock table entry: per object, we need this <clientId, seq# of PREPARE (Only needed if we allow multiple concurrent transactions from a single client), WorkerTimer, newVal>

- LockRecord: <TableId, key, condition, newVal, clientId, (seq# of PREPARE)>

- RpcRecord: header: <clientId, keyHash, TableId>; response: <VOTE, list of <TableId, keyHash, rpcId>> (for recovery).

- COMMITED: set of <rpcId>

 

Normal Operation

<networkDiagram_normalOp>

  1. A client reserves sequence numbers for RPC ids. It reserves M+1 consecutive ids, where M is the number of objects involved in the current transaction. The lowest seq# is not assigned to any object or RPC and work as placeholder. Other M sequence numbers are assigned to each object.
  2. RPC (1) PREPARE: A client sends prepare messages to all data master servers participating transaction. For understandability, we send a separate RPC request for each object in transaction.
    1. Request msg: <list of <tableId, keyHash, Seq#>, tableId, key, condition, newVal>
      1. list of <tableId, keyHash, Seq#>: used in case of client disconnection.
      2. TableId & Key: object operating on.
      3. Condition: condition for COMMIT-VOTE other than successful locking. RAMCloud RejectRules. This can be NULL.
      4. newVal: value to be written for “key” on the receipt of “COMMIT”.
    2. Handling:
      1. Grab a lock for “key” on lock table. Buffer newVal for the key.
      2. - If the lock was grabbed & condition is satisfied, log LockRecord (lock information. See figure~\ref{fig:lockRecord}) and RpcRecord with the result of "COMMIT-VOTE" and <list of <tableId, keyHash, Seq#>> (linearizability. See figure~\ref{fig:rpcRecord})
        - If grabbed the lock but condition is not satisfied, unlock immediately, and log RpcRecord with the result of “ABORT-VOTE” and <list of <tableId, keyHash, Seq#>>
        - If we failed to grab the lock, log RpcRecord with the result of “ABORT-VOTE” and <list of <tableId, keyHash, Seq#>>.
        (JO: why do we need to log anything here? The abort condition is permanent, no? A: retried PREPARE can successfully grab a lock. I suspect this can cause client sees "ABORT" but recovery process can "COMMIT".)

        (JO: Ahah, I see now: I was thinking of the case where the condition is not satisfied. This condition is permanent (any retries will also fail), so we only need to log ABORT-VOTE if we couldn't grab the lock, right? A: Well. It depends on details of the protocol. Currently, we unlock immediately if condition didn't match. Subsequent TX can change the object. The retried PREPARE can now vote for COMMITE, leaving "COMMIT-VOTE" in linearizability record. Same problem can happen.)
         
        (JO: if the condition wasn't satisfied initially, how could an object modification result in the condition being satisfied in the future?)
      3. Sync log with backup.
      4. JO: I think that the server needs to record the  <list of <tableId, KeyHash, Seq#>> as well; this needs to be durable, no? A: Yes, it is recorded with linearizability record in response field of RpcRecord.
    3. Response: either “COMMIT-VOTE” or “ABORT-VOTE”.
  3. RPC(3) DECISION: After collecting all votes from data masters, the client broadcast the decision to cohorts voted for COMMIT.
    1. Request: <tableId, keyHash, seq# for PREPARE, DECISION>
    2. Handling: if DECISION = COMMIT,
      1. If there is a buffered write, log Object (with new value), Tombstone for old Object, and Tombstone for LockRecord atomically.
      2. Unlock the object in lock table.
      3. Sync log with backup.
        (It is not okay to delay sync until we sync a next transaction’s LockRecord.)
    3. Response: ACK.
  4. After collecting “ACK” from all cohorts, the client acknowledge the lowest seq# reserved, so that ACK# can reach up to the highest seq# used in this transaction.

 

Discussions & Questions:

Latency for transaction result: 1RTT + 1D. However, how should we implement background processing of RPC(3)? It may be easier to implement “harder” scheme, which requires much simpler client library.

 

Lock time: 1RTT + 1D. We merge log synching for 3 with 1 to claim 1D. Is that valid claim?

 

 

Automatic recovery from master’s failures

Two options

-        A client messages to every cohort to prevent WorkerTimer fires and transaction recovery from being invoked.

-        Even master’s crash causes transaction recovery: this can cause complexity..

 

Client Failure Recovery

Guarantee only 1 recovery or cleanup can proceed concurrently by marking the lowest reserved RPC id in UnackedRpcResults (of recovery coordinator) as an indicator that recovery is in progress/finished. (No durability is required.) (JO: sounds like an additional complexity for our UnackedRpcResults mechanism. A: Well, just a monitor style lock is okay for now.. But I think the meaning of UnackedRpcResults remains same since we are tracking unacked TX using a (virtual) rpcID representing TX.)

 

  1. RPC(5): as a DM detects the crash of client (or slowness of client) by WorkerTimer of lock, sends “StartRecovery” request to recovery coordinator (the server with 1st entry in list of keyHash).
    1. Request: <clientId, list of <tableId, keyHash, rpcId>>
    2. Handling: recovery coordinator initiates recovery protocol. Possible optimization: use UnackedRpcResults to avoid duplicate recoveries. CAUTION: avoid deadlock by recovery job occupies all threads in a master.
    3. Response: Empty
  2. RPC(6): Recovery coordinator sends requestAbort to clean up & release all locks in masters.
    1. Request: <clientId, seq#, list of <tableId, keyHash, Seq#>>
    2. Handling:
      1. checkDuplicate with given clientID & seq#
      2. if exists, respond with saved results.
      3. If not, respond “ABORT-VOTE” and durably log RpcRecord with the result of “ABORT-VOTE” and <list of <tableId, keyHash, Seq#>>(JO: does this need to be logged durably? Right)
    3. Response: COMMIT-VOTE | ABORT-VOTE
  3. After recovery coordinator collects all votes, it sends decision to cohorts voted for COMMIT.
    1. Request: <DECISION, clientId, rpcId in RPC(6)>
    2. Handling:
      1. Check a lock is grabbed for rpcId (2 methods. Need discussion: 1st soln is saving “key” in RpcRecord::response and use the key to look up lock table. 2nd soln is keeping a separate table or list of all locks.) (JO: just allow locks to be looked up by rpcid? This is unique. Or, just scan the lock table for the rpcid; this won't happen very often. A: depends on the implementation of lock table. If the lock table is a separate table, we can just enumerate on it. If the lock information is kept as a part of object hash table, I think it is not feasible to enumerate whole hash table. Collin is thinking about lock table implementation.)
      2. If no lock is grabbed, respond with “ACK”
      3. If a lock was grabbed, flush the buffered write (detail is same as normal operation.) and unlock the object.
    3. Response: ACK (empty)
  4. Recovery coordinator is finished with transaction. Leaving RpcRecord around is safe for client’s resurrection before lease timeout.

 

 

Garbage Collection

In addition to garbage collection by ack ID, we should purge unnecessary information after client lease times out (typically due to client termination). As a client's lease times out, we don't expect the client will finish outstanding transactions or send RPC requests to find out the outcome of the transactions. So we should finish any pending transactions and purge all durable records that were kept to answer client RPCs about the transactions. After this garbage collection, any client's PREPARE request will be rejected due to expired lease, and DECISION request will be just answered with ACK without any internal processing (since no lock is grabbed).

Since we are purging the durable records about votes in this process, we cannot utilize the distributed durable information about votes to recover from crash. Thus, we are switching to regular 2 phase commit and relying on a single durable record about transaction outcome; After collection of votes, recovery coordinator durably record its decision. Effectively, we only need to log COMMIT decisions since we assume abort if no linearizability record exist for RequestAbort step. We keep the commit decisions in memory as well, called COMMITED set, for faster access. Without this record, the crash of recovery coordinator after partially sending DECISION in step 4 can cause inconsistency. Such crash can change a transaction originally decided to commit to abort since the RequestAbort message returns ABORT-VOTE after garbage collection in step 4. (JO: I still don't understand this; perhaps the best thing is to discuss in person. Why can't garbage collection be separated entirely from aborting/completing transactions, and thereby made much simpler? For example, use locks on objects to force transaction completion/abort, and make this independent of the mechanism for garbage-collecting unacked RPC results?) (JO: also, if the COMMITTED set is needed here, isn't it also needed in the Client Crash Recovery section above?)

  1. RPC(5): as a DM detects the expiration of a client lease, it checks whether there is unacknowledged transaction information, and sends “StartCleanup” request to recovery coordinator of each transaction (the server with 1st entry in list of keyHash). (JO: what is "unacknowledged transaction information??".  In addition, why is this step necessary? If the transaction completed, then on lease expiration the DM can just discard its unacked RPC results, like it would for any other linearizable operation that completed. If the transaction didn't complete, then the timer mechanism for locks will already have triggered long before the lease expired, no?)
    1. Request: <clientId, list of <tableId, keyHash, rpcId>>  (+ also clusterTime).
    2. Handling: recovery coordinator initiates cleanup protocol. Possible optimization: use UnackedRpcResults to avoid duplicate cleanups/recoveries.
    3. Response: Empty
  2. Check if COMMITED set has this TX’s record. If it was decided to commit before, skip step 3 and send COMMIT message in step 4.
    (JO: I'm still confused (my earlier comment on this seems to have gotten deleted without answering the questions). Exactly what is the COMMITTED set? Perhaps explain this above when you first mention the COMMITTED set? Why is this needed? Is there a problem if step 3 gets executed multiple times? A: answered in 2nd paragraph of this section.)
  3. RPC(6): Recovery coordinator sends requestAbort to clean up & release all locks in masters.
    1. Request: <clientId, seq#>
    2. Handling:
      1. checkDuplicate with given clientID & seq#
      2. if exists, respond with saved results.
      3. If not, respond “ABORT-VOTE” (JO: durable? A: we don't except retired PREPARE and we will reject the PREPARE anyway, so it is safe without durable logging here.)
    3. Response: COMMIT-VOTE | ABORT-VOTE
  4. After recovery coordinator collects all votes, durably log outcome of TX (only if outcome is COMMIT) & add to COMMITED set and send decision & order clean up.
    1. Request: <DECISION, clientId, rpcId in RPC(6)>
    2. Handling:
      1. Check a lock is grabbed for rpcId
      2. If a lock was grabbed, flush the buffered write (detail is same as normal operation.) and unlock the object.
      3. Clean up RpcRecord by manually marking “acked” on UnackedRpcResults. Refactoring UnackedRpcResults is required to support marking “acked” and shrinking its window accordingly. We delete the whole client information as soon as all TX are marked as “acked”.
      4. Respond ACK.
    3. Response: ACK (empty)
  5. Recovery coordinator deletes the logged result (written in 7) of transaction (appending tombstone for the TX outcome entry). It is now safe to remove the TX’s record from COMMITED set.