...
- A client reserves sequence numbers for RPC ids. It reserves M+1 consecutive ids, where M is the number of objects involved in the current transaction. The lowest seq# is not assigned to any object or RPC and work as placeholder. Other M sequence numbers are assigned to each object.
- RPC (1) PREPARE: A client sends prepare messages to all data master servers participating transaction. For understandability, we send a separate RPC request for each object in transaction.
- Request msg: <list of <tableId, keyHash, Seq#>, tableId, key, condition, newVal>
- list of <tableId, keyHash, Seq#>: used in case of client disconnection.
- TableId & Key: object operating on.
- Condition: condition for COMMIT-VOTE other than successful locking. RAMCloud RejectRules. This can be NULL.
- newVal: value to be written for “key” on the receipt of “COMMIT”.
- Handling:
- Grab a lock for “key” on lock table. Buffer newVal for the key.
- - If the lock was grabbed & condition is satisfied, log LockRecord (lock information. See figure~\ref{fig:lockRecord}) and RpcRecord with the result of "COMMIT-VOTE" and <list of <tableId, keyHash, Seq#>> (linearizability. See figure~\ref{fig:rpcRecord})
- If grabbed the lock but condition is not satisfied, unlock immediately, and log RpcRecord with the result of “ABORT-VOTE” and <list of <tableId, keyHash, Seq#>>
- If we failed to grab the lock, log RpcRecord with the result of “ABORT-VOTE” and <list of <tableId, keyHash, Seq#>>.
(JO: why do we need to log anything here? The abort condition is permanent, no? A: retried PREPARE can successfully grab a lock. I suspect this can cause client sees "ABORT" but recovery process can "COMMIT".) - Sync log with backup.
- JO: I think that the server needs to record the <list of <tableId, KeyHash, Seq#>> as well; this needs to be durable, no? A: Yes, it is recorded with linearizability record in response field of RpcRecord.
- Response: either “COMMIT-VOTE” or “ABORT-VOTE”.
- Request msg: <list of <tableId, keyHash, Seq#>, tableId, key, condition, newVal>
- RPC(3) DECISION: After collecting all votes from data masters, the client broadcast its the decision to all cohorts . (JO: no need to broadcast to servers that voted ABORT?)voted for COMMIT.
- Request: <tableId, keyHash, seq# for PREPARE, DECISION>
- Handling: if DECISION = COMMIT,
- If there is a buffered write, log Object (with new value), Tombstone for old Object, and Tombstone for LockRecord atomically.
- Unlock the object in lock table.
- Sync log with backup.
(It is not okay to delay sync until we sync a next transaction’s LockRecord.)
- Response: DoneACK.
- After collecting “Done” “ACK” from all cohorts, the client acknowledge the lowest seq# reserved, so that ACK# can reach up to the highest seq# used in this transaction.
...
- RPC(5): as a DM detects the crash of client (or slowness of client) by WorkerTimer of lock, sends “StartRecovery” request to recovery coordinator (the server with 1st entry in list of keyHash).
- Request: <clientId, list of <tableId, keyHash, rpcId>>
- Handling: recovery coordinator initiates recovery protocol. Possible optimization: use UnackedRpcResults to avoid duplicate recoveries. CAUTION: avoid deadlock by recovery job occupies all threads in a master.
- Response: Empty
- RPC(6): Recovery coordinator sends requestAbort to clean up & release all locks in masters.
- Request: <clientId, seq#>
- Handling:
- checkDuplicate with given clientID & seq#
- if exists, respond with saved results.
- If not, respond “ABORT-VOTE”
- Response: COMMIT-VOTE | ABORT-VOTE
- After recovery coordinator collects all votes, it sends decision to cohorts voted for COMMIT.
- Request: <DECISION, clientId, rpcId in RPC(6)>
- Handling:
- Check a lock is grabbed for rpcId (2 methods. Need discussion: 1st soln is saving “key” in RpcRecord::response and use the key to look up lock table. 2nd soln is keeping a separate table or list of all locks.) (JO: just allow locks to be looked up by rpcid? This is unique. Or, just scan the lock table for the rpcid; this won't happen very often. A: depends on the implementation of lock table. If the lock table is a separate table, we can just enumerate on it. If the lock information is kept as a part of object hash table, I think it is not feasible to enumerate whole hash table. Collin is thinking about lock table implementation.)
- If no lock is grabbed, respond with “ACK”
- If a lock was grabbed, flush the buffered write (detail is same as normal operation.) and unlock the object.
- Check a lock is grabbed for rpcId (2 methods. Need discussion: 1st soln is saving “key” in RpcRecord::response and use the key to look up lock table. 2nd soln is keeping a separate table or list of all locks.) (JO: just allow locks to be looked up by rpcid? This is unique. Or, just scan the lock table for the rpcid; this won't happen very often. A: depends on the implementation of lock table. If the lock table is a separate table, we can just enumerate on it. If the lock information is kept as a part of object hash table, I think it is not feasible to enumerate whole hash table. Collin is thinking about lock table implementation.)
- Response: ACK (empty)
- Recovery coordinator is finished with transaction. Leaving RpcRecord around is safe for client’s resurrection before lease timeout.
...