Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. RPC(5): as a DM detects the crash of client (or slowness of client) by WorkerTimer of lock, sends “StartRecovery” request to recovery coordinator (the server with 1st entry in list of keyHash).
    1. Request: <clientId, list of <tableId, keyHash, rpcId>>
    2. Handling: recovery coordinator initiates recovery protocol. Possible optimization: use UnackedRpcResults to avoid duplicate recoveries. CAUTION: avoid deadlock by recovery job occupies all threads in a master.
    3. Response: Empty
  2. RPC(6): Recovery coordinator sends requestAbort to clean up & release all locks in masters.
    1. Request: <clientId, seq#>
    2. Handling:
      1. checkDuplicate with given clientID & seq#
      2. if exists, respond with saved results.
      3. If not, respond “ABORT-VOTE”
    3. Response: COMMIT-VOTE | ABORT-VOTE
  3. After recovery coordinator collects all votes, it sends decision.
    1. Request: <DECISION, clientId, rpcId in RPC(6)>
    2. Handling:
      1. Check a lock is grabbed for rpcId (2 methods. Need discussion: 1st soln is saving “key” in RpcRecord::response and use the key to look up lock table. 2nd soln is keeping a separate table or list of all locks.)
      2. If no lock is grabbed, respond with “ACK”
      3. If a lock was grabbed, flush the buffered write (detail is same as normal operation.) and unlock the object.
    3. Response: ACK (empty)
  4. Recovery coordinator is finished with transaction. Leaving RpcRecord around is safe for client’s resurrection before lease timeout.

 

 

Garbage Collection

 

  1. RPC(5): as a DM detects the expiration of a client lease, it checks whether there is unacknowledged transaction information, and sends “StartCleanup” request to recovery coordinator of each transaction (the server with 1st entry in list of keyHash).
    1. Request: <clientId, list of <tableId, keyHash, rpcId>>
    2. Handling: recovery coordinator initiates cleanup protocol. Possible optimization: use UnackedRpcResults to avoid duplicate cleanups/recoveries.
    3. Response: Empty
  2. RPC(6): Recovery coordinator sends requestAbort to clean up & release all locks in masters.
    1. Request: <clientId, seq#>
    2. Handling:

      ...

          1. checkDuplicate with given clientID & seq#

      ...

          1. if exists, respond with saved results.

      ...

          1. If not, respond “ABORT-VOTE”
        1. Response: COMMIT-VOTE | ABORT-VOTE
      1. Check if COMMITED set has this TX’s record. After recovery coordinator collects all votes, durably log outcome of TX (only if outcome is COMMIT) & add to COMMITED set and send decision & order clean up.
        1. Request: <DECISION, clientId, rpcId in RPC(6)>
        2. Handling:

          ...

              1. Check a lock is grabbed for rpcId

          ...

              1. If a lock was grabbed, flush the buffered write (detail is same as normal operation.) and unlock the object.

          ...

              1. Clean up RpcRecord by manually marking “acked” on UnackedRpcResults. Refactoring UnackedRpcResults is required to support marking “acked” and shrinking its window accordingly. We delete the whole client information as soon as all TX are marked as “acked”.

          ...

              1. Respond ACK.
            1. Response: ACK (empty)
          1. Recovery coordinator deletes the logged result (written in 7) of transaction (appending tombstone for the TX outcome entry). It is now safe to remove the TX’s record from COMMITED set.