Transactions
A transaction is a set of operations that should happen atomically. Exactly which operations that may include is up for discussion, but think reads, writes, deletes, and possibly index lookups for now.
Warning: This page uses the term coordinator in the 2PC sense: the role of coordinating the execution of a transaction. This differs from Coordinator, the role of coordinating the RAMCloud cluster.
Contents:
Client-Side Transactions
Assuming initially object ID 1 contains A, 2 contains B, 3 contains C.
- Client writes placeholder T with object IDs (1, 2, 3)
- Client adds masks to objects 1, 2, 3 which all point to T
- Client updates T with (write A' at 1 over version V1, write B' at 2 over version V2, write C' at 3 over version V3)
- Client writes A' into 1, B' into 2, C' into 3
- Client deletes T
Consequences
- 1:A, 2:B, 3:C appear an extra time in the log (when they are masked).
- A', B', C' appear an extra time in the log (in T).
- Client clocks must be synchronized and/or other clients must always wait some amount of time before aborting a transaction
- Blind (unconditional) modifications are no longer possible as the objects they operate on may be masked.
- Having an object in the read-set of a transaction bumps its version number, so it invalidates everyone's cache.
- Can only support creates using server-assigned object IDs if it is acceptable to burn object IDs when clients crash.
- It'd be hard to do this safely if we need to consider access control. (We don't with namespaces.)
Optimization: Don't write placeholder T
If the transaction commits:
- Client reserves object ID for T
- Client adds masks to objects 1, 2, 3 which all point to T (T does not yet exist)
- Client creates T with (write A' at 1 over version V1, write B' at 2 over version V2, write C' at 3 over version V3)
- Client writes A' into 1, B' into 2, C' into 3
- Client deletes T
If some other client wants to abort T before step 3, the other client may create a tombstone at T's object ID. This blocks the create in step 3 and the coordinator will be forced to abort.
The client behaviors that follow:
- If some other client finds an object masked by a committed transaction object, it can do the write-back.
- If some other client finds an object masked by a tombstone, it can remove the mask.
- If some other client finds an object masked by a missing transaction object, it can either wait for a transaction object to appear, or it can create a tombstone and then remove the mask.
The cleaning rules:
- It is safe for anyone to delete a committed transaction object if all participating objects have been unmasked.
- It is safe for the coordinator to delete a tombstone if it discards knowledge of that transaction ID.
- And here is the gotcha: when can anyone delete a tombstone?
- It is with high probability safe for anyone to delete a tombstone after a large amount of time has elapsed. This large amount of time (possibly measured in weeks) would have to convince us that the coordinator has died or observed the tombstone.
- Or, don't clean the tombstones, but periodically delete the table and start using a different one, invalidating all previous transaction IDs.
- Or, invalidate the coordinator's token.
- Or, drop the table fragment instead of the whole table.
I think the main benefit is that there is one less write operation in the common case (this approach doesn't write the placeholder T). I think the main drawback is that cleaning tombstones is somewhat troubling and/or annoying.
John says:
My main question is whether it's worth the complications of tombstone manager to save the extra write, given that the transaction is already doing a lot of writes. Things feel a lot more obvious and safe with the first scheme.
Server-Side Transactions (2PC)
- Client sends MT ("minitransaction") to master
- predicates: list of (table id, oid, version)
- If the transaction commits, the object must have the given version at the time the decision is made. If the object has some other operation applied to it below, it must additionally not be modified until the operation is applied.
- writes: list of (table id, oid, data, indexes)
- If the transaction commits, the data and index entries will be stored in the object.
- creates (with server-assigned keys): list of (table, data, indexes)
- If the transaction commits, the data and index entries will be stored in a new object.
- How will the client get to know which object this is? Either the master will have to delay returning the outcome to the client or the participant will have to allocate the ID before the decision to commit is made.
- If the transaction commits, the data and index entries will be stored in a new object.
- deletes: list of (table id, oid)
- If the transaction commits, the object will be deleted.
- reads: list of (table id, oid)
- If the transaction commits, the data and index entries will be returned.
- Doing reads outside the transaction and then sending a transaction consisting of predicates for the versions that were read is the optimistic alternative to this. Do we want to support pessimistic reads inside transactions?
- predicates: list of (table id, oid, version)
- Master writes transaction object with list of participants, acquiring a txid
- Master sends txid, MT to all participants
- By sending the txid and MT, the master guarantees to send the participants a decision eventually.
- Participants lock objects and log MT
- Participants send vote to master
- If they vote no, they can unlock their objects and forget all about the transaction.
- If they vote yes, they guarantee to keep their objects locked until they learn the decision
and to be able to commit if that is the decision (i.e., they persist the intent of the MT).
- If the decision is yes, the master notes it in the transaction object
- Master relays decision to participants
- Master sends response to client
- If the decision is yes, the participants commit. Otherwise, they unlock/roll back.
- Participants sends commit acknowledgement to master
- When the master receives their commit acknowledgements, it may logically remove them from the participants list.
- Master removes transaction object
- There are no more participants and the clients have no way to refer to the transaction. It is complete.
Failure Scenarios
- If the client crashes after sending the MT:
The application never learns of the decision. The MT will still commit or abort.
- If the master crashes before writing the transaction object (step 2):
The client's RPC library will retry.
- If the master crashes before removing the transaction object (step 11):
When the master recovers, it must scan its transaction objects. For each transaction object that does not have a decision to commit, the master assumes it aborted. For each transaction that does, the master knows it committed. It picks back up by relaying the decision to participants (step 7). Participants that have no record of the transaction should simply agree to this.
- If a participant crashes before it logs the MT (step 3):
When the participant comes back up, it will not know about the MT.
The master will not receive a vote from the participant in a timely manner. The master can resend the txid, MT to the participant (step 2) or simply abort the transaction (step 7).
- If a participant crashes before committing or rolling back (step 9):
It comes back up and waits. If the master asks for its vote, it should resend the response. If the master sends a decision, it should proceed from there.
- If a participant crashes before sending a commit acknowledgement (step 10):
It comes back up and has no record of the transaction. When the master resends it the decision, it simply agrees.
Optimization: Client coordinates transaction
Ignacio pointed out that this is how Sinfonia does it, and it allows for better latency on the critical path. They block on memory node (master) failures so that the coordinator has to keep no state. We should explore the trade-offs.