Transaction_satoshi
I-1. Objective)
Providing multi-object (database) transaction with ACID properties:
Atomic
Consistent
Isolated
Durable
Database transaction: http://en.wikipedia.org/wiki/Database_transaction
I-2. CAP Theorem)
We can not simultaneously provide all three of the following:
Consistency
Availability
Partition tolerance
CAP Theorem: http://en.wikipedia.org/wiki/CAP_theorem
On RAMCloud, availability and partition is already relaxed as follows:
Relaxation of availability) No data replication for concurrent service. System is temporally unavailable before node recovery.
Relaxation of partition tolerance) when the network is partitioned, the partition isolated from coordinator can not continue service.
II. Current Proposal)
II-1. Basic Strategy)
Figure II-1a
Each transaction consists of a sequence of operations and executed in some period of time as seen in Figure II-1a.
To achieve ACID, there are two strategies:
Pessimistic lock - allow only one transaction to be executed.
Pros) Simple, No re-execution overhead by conflict
Cons) Small parallelism
Optimistic lock - executes multiple transaction at one time with watching conflicts. If conflict occurs, cancel conflicting transaction and rewind its side-effects.
Pros) Large parallelism
Cons) Re-execution overhead with conflict
In traditional database or transaction, the probability of conflict is considered very low.
The most systems take optimistic lock approach. Some system combines pessimistic lock to improve the efficiency.
We will apply only optimistic approach.
II-2. Memory Renaming)
In order to efficiency of parallel execution, we apply the technique called 'memory renaming' widely used in parallel processing such as out of order super scalar processors. Memory renaming is the technique similar to register renaming: http://en.wikipedia.org/wiki/Register_renaming .
Similar technique is used in transactional memory http://en.wikipedia.org/wiki/Transactional_memory
Figure II-2a
Considers (RAR hazard: Read after Read) is harmless.
True dependency (RAW: Read after Write) is detected and resolved by canceling affected transactions and re-execute them.
All other cases (WAR: Write after Read, WAW: Write after Write) are false dependency and resolved by the renaming (buffering and writing later).
II-3. Programming model)
Figure. II-3a : Conflict of two Transactions
Define a commit API for a transaction:
A commit is a bundle of multiple conditional and non-conditional operations executed with ACID property.
API is similar to multiOp.
Advantage)
Simple:
Blue arrows in Figure II-3a (from Start Tx to Commit) is written as a user code which:
Takes care of transaction retry caused by nacked (aborted) commit.
Saves version numbers and un-committed objects and passes them to the commit API for the condition check
Returns a un-committed written object for a local read. (local RAW resolution)
Flexible to extending to:
Multiple transaction can be reside concurrently on a single client application
A transaction involving multiple clients can be written by exchanging the condition with the clients
Other transaction APIs:
SQL like primitives
Transactional memory like model: Define the life of a transaction with start transaction and commit. Library record versions and temporal data implicitly.
Commit API:
Basic Idea)
Defines a combination of operations 'compare, read, write' for each object
Handles objects which are much larger than several bytes. Data validation is done with version number not with contents.
Data needs to be loaded and used before commit. Then the version number is given to the commit.
Table II-3-b shows the operations.
Implementation
Basically extending the existing multiOp context.
Parameters of request: <requestId, commitObjects>
requestId : <clientId, sequentialNumber> (Guarantee linearizability. See liniarizable+RPC)
commitObjects (To Revise) : a list of <operation, tableId, key, [condition], [blob]> where:
operation is ether : check, overWrite, conditionalWrite
key is a primary key of the object.
[condition] is a version number to be compared for check or conditionalWrite.
[blob] is a write data for overWrite or conditional write
First commitObject is specially handled as 'anchor object'.
To be extended for secondary index.
response contains:
Ack (commit successfully) or Nack (aborted)
List of new version numbers for writes
Note)
No priority is given by the start order of transactions (aka. age of transaction) such as Sinfonia [SOSP2007]
Solution to avoid conflict to a long life transaction.
Split the transaction into multiple mini-transactions
Write a intermediate layer to prioritize transactions considering their life or start time
Issues)
Which operations remain when the transaction is implemented in RAMCloud?
How much can we make its implementation simpler with:
assuming all the data fits in a RPC
assuming all the data fits in the same segment
Table II-3-b: Commit succeeds when all the object operations are successful otherwise it aborts without any side-effects
Operation | Conditional? | Parameter for the Entry | Comment | |
Request | Reply | |||
|---|---|---|---|---|
Compare | Yes | version | None | Check if the object is unmodified |
Conditional write | Yes | version, blob | newVersion | Write if the object is still in the version |
Conditional remove | Yes | version | None | Remove if the object is still in the version |
Create | Yes | blob | newVersion | Create if the object is not exist |
Uc Overwrite | No | blob | newVersion | Unconditionally write when commit succeeds. |
Uc Remove | No | None | None | Unconditionally remove when commit succeeds. |
Note1) We extend MultiOp to make sure our implement strategy and measure performance. However, since current MultiOp lacks linearizability we may create different rpc.
Note2) Usecase of 'Create' :
Read an object before commit
The read returns the object does not exist
Create the object locally on the client
Commit with the condition if the object does not still exist on the master.
II-4. Design Outline)
II-4-1. Commit Control Flow)
Flow and log structure are modified in first trial implementation, see 'IV) First Trial Implementation' instead and skip this section.
Figure II-4-1a
1.5 RTT and two log write commit (Figure II-4-1a )
One master takes care of commit control as a "Transaction-master" while it works as object storage, the rests "Data-masters" work as ordinal data storage.
The receiver of anchor object (the first object) in the multi-write of commit operation is defined as Transaction-master.
Transaction-master receives the followings:
List of keys the transaction referred: both written and read objects
Key-hash of the anchor object which specify transaction-master (the master itself)
Written objects for the master: list of (key, read-version, new version, blob)
Keys of read objects for the master: list of (key, read-version)
Data-master receives
Key-hash of the anchor object which specify transaction-master
Written objects for the master: list of (key, read-version, new version, blob)
Keys of read objects for the master: list of (key, read-version)
In both cases, no read proceeded to the write, special value 'no-read-proceeded' is given as the read-version.
Transaction master (T-Master) coordinates the durable commit operation with writing Non-return-point in its log and backups.
D-Master checks and lock the objects as follows:
Loop while objects exists:
If the object is already locked by other transaction or the version mismatches for a conditional operation:
It sends nack to T-Master
It cleans up all the locks for the transaction
exit
Otherwise lock the object
Send ack of the transaction to T-Master. Ack contains Tid.
T-Master receives an acknowledge (ack) from D-Master
In Phase1
The transaction aborts if T-Master receives any nack.
If T-Master receives ack from all D-Master, it proceeds Phase2.
In Phase2
All D-Masters should eventually send ack. Ack tells phase2 operation is completed on the D-Master so that T-Master prepares for the resource clean-up.
If the T-Master crashes during commit operation, the recovery master which stores the anchor object takes over the T-Master's role.
So Non-return-point should have the same hash as the anchor object's one.T-Master is responsible for the lock clean-up at the commit or abort.
The node where T-Master resides works as a D-Master for the objects assigned to it.
Data structure in log and log cleaning
A object is attributed as follows:
During commit operation)
unlocked (before locking)
locked for committed object or temporal object for transactional update.
After commit operation)
unlocked
removed
The lock modifier contains Tid which consists of <clientId, anchorObjectId>.
We can identify T-Master from anchorObjectId.
If some clients access to a object and it remains locked even after some retry.
The D-master need to request T-Master to clean up lock, ie. complete the transaction with abort or commit.Log structure is in section II-2-6.
T-Master (transaction master) uses three log entries for a durable commit operation
Start-commit: specifies from whom T-Master gets phase1 ack.
NRP: specifies the operation is committed.
Tomb stone for deleting control objects (Start-commit, NRP) for a completed transaction.
Behavior in accessing locked objects is discussed in section II-2-3 and II-2-4.
Suspend returning response before the lock is released assuming:
Defines phase1 timeout :
If T-Master phase1-timeouts before all acks arrive, it aborts the transaction. Then T-Master eventually release all the locks.
Defines phase2 timeout:
If phase1 is successful, all the locks are eventually released in phase2.
If T-Master phase2-timeouts before all the phase2 acks are returned,
T-Master requests all unacked D-Masters for the completion of Phase2 (ie. unlock).
According above i) and ii), all the locks are released before phase1-timeout + phase2-timeout + some retry time
T-Master returns ack (committed) to the client right after successful NRP write. This is not too early because:
Eventually commit phase2 is completed in all D-Masters as described in 3-a-iii).
There might be a case where:
The client that receives ack as a commit completion tries to access the object which supposed to be release the commit. However, it still remains locked.
It is harmless, client can just wait or retry because the lock is released eventually.
Resource Release:
Successful commit: (See Figure II-4-1a )
Resource release point: Client - RC, T-Master - RT, D-Master - RD
T-Master release control objects in log at RT
Insert Tomb stone for the Tid
Normal case: T-Master wait all acks from D-Master
Corner cases: T-Master timeouts before it receives all asks from D-Masters. T-Master interrogates un-acked D-Masters for the completion to insert tomb stone.
Abort:
Any crash without NRP:
Control structures are located in memory. Resources are automatically released
Abort (Nack) :
T-Master: Any Nack from D-Master
Client: Nack from T-Master
Issues:
Infinite chain of acks
To reach RT, acks from the client and all D-Masters are needed.
Can not remove RC and RDs until making sure all the acks are delivered to T-Master.
That requires infinite chain of acks such as ack_of_ack, ack_of_ack_of_ack,....
Solution
RD: It is idempotent. Harmless for re-execution by repeated request from T-Master.
RC: Similar to the solution for linearizable RPC.
Releases the transaction data by receiving ack or nack(abort), moves to committing or aborting state, and sends ack_of_ack to T-Master.
Largest ack_of_ack number is piggy backed on an ack of later commit.
RT: retries until all the acks arrive from all D-Masters and Client.
Interrogate liveness of the client to coordinator
II-4-2. Conflicting commits to a locked object by a preceding commit)
See II-2. Memory Renaming
Real conflict is RAW (True dependency): transactions reading write-locked objects should be aborted and re-executed.
Pseudo conflicts are WAR, WAW (False dependency): writes in the conflicting transaction just corrupts locked object. The write needs to be delayed until the lock is released.
We can detect WAW (or WAR) and re-execute conflicting transaction by:
Preceding transaction reads the object before writes it to make WAW treated as RAW.
RAR: We consider read - read conflict is harmless. The object is not locked for the upcoming read access.
If we want to detect RAR to Object O1, we can provide anther Object O2 and detect the conflict as RAW to O2 as follows:
Definition of symbols:
Suppose a group of transactions Tn want to detect RAR conflict to O1
Suppose Ti, Tj, and Tk belong to Tn want to read O1.
Sequence:
Provides O2 before any Tn starts.
Any of Tn reads O2 before it starts O1 access.
Ti writes O2 before it reads O1
Tj, Tk are aborted by RAW conflict if Ti successfully commits.
Table II-4-2a
Operation in another | Status of locked object | |
read | written | |
|---|---|---|
read | continue (not locked) | abort (True dependency, RAW) |
write | block (wait: WAR) | block (wait: WAW) |
II-4-3. Effect of crashes)
Table II-4-3a
| Comments | Effect to the Transaction by Crash of: | ||
|---|---|---|---|---|
Client | T-Master | D-Master | ||
Before Commit | Data stored in Client code | Abort | No-effect | No-effect |
Commit Phase1 | Ack table stored in T-Master's memory. | Abort | Abort | No-effect |
After Phase2 | Status is durable in T-Master's log | No-effect | No-effect | No-effect |
II-4-4. Log Structure on Masters)
Inserts object modifier in log to lock and unlock the object during commit operation
Chain of modifiers are created by sequence of transaction (commit)
Cleanup: associate lock and unlock with transaction id and cleanup (now investigating log cleaner for the modification).
Some of them are terminated with tomb stone.
Object modifier example in log
Figure II-4-4a: sequence of transactions read the same object
Figure II-4-4b: modifier objects for object write at successful commit:
both old object and newer (transactional temporal) object are locked at the same time. Log objects might be optimized such as "temporal object for write + lock", etc
See section IV for the implementation.
Figure II-4-4a (Consecutive Read Lock by transactions)
Figure II-4-4b (Object Modifiers for conditionalWrite)
III. Corner Cases)
Linearizablity issue: Ether commit request, nack, or ack is undelivered.
(Resolved) Partial data loss in commit request with multi-op.
Solution) Retry original request with timeout for ack(committed)/nack(aborted). See section IV-2-2.
(Alternative method) Extending linearizable operations to multi-op using multiple RPCs