Transaction_satoshi

Transaction_satoshi

I-1. Objective)

Providing multi-object (database) transaction with ACID properties:

  1. Atomic

  2. Consistent

  3. Isolated

  4. Durable

Database transaction: http://en.wikipedia.org/wiki/Database_transaction

I-2. CAP Theorem) 

  We can not simultaneously provide all three of the following:

  1. Consistency

  2. Availability

  3. Partition tolerance

CAP Theorem: http://en.wikipedia.org/wiki/CAP_theorem



On RAMCloud, availability and partition is already relaxed as follows:

  1. Relaxation of availability) No data replication for concurrent service. System is temporally unavailable before node recovery.

  2. Relaxation of partition tolerance) when the network is partitioned, the partition isolated from coordinator can not continue service.

II. Current Proposal) 

II-1. Basic Strategy)


   Figure II-1a

Each transaction consists of a sequence of operations and executed in some period of time as seen in Figure II-1a.

To achieve ACID, there are two strategies:

  1. Pessimistic lock - allow only one transaction to be executed. 

    1. Pros) Simple, No re-execution overhead by conflict

    2. Cons) Small parallelism

  2. Optimistic lock - executes multiple transaction at one time with watching conflicts. If conflict occurs, cancel conflicting transaction and rewind its side-effects.

    1. Pros) Large parallelism

    2. Cons) Re-execution overhead with conflict

In traditional database or transaction, the probability of conflict is considered very low.
The most systems take optimistic lock approach. Some system combines pessimistic lock to improve the efficiency.
We will apply only optimistic approach.

II-2. Memory Renaming)  

  In order to efficiency of parallel execution, we apply the technique called 'memory renaming' widely used in parallel processing such as out of order super scalar processors. Memory renaming is the technique similar to register renaming: http://en.wikipedia.org/wiki/Register_renaming . 

Similar technique is used in transactional memory http://en.wikipedia.org/wiki/Transactional_memory 


     Figure II-2a

  1. Considers (RAR hazard: Read after Read) is harmless.

  2. True dependency (RAW: Read after Write) is detected and resolved by canceling affected transactions and re-execute them.

  3. All other cases  (WAR: Write after Read, WAW: Write after Write) are false dependency and resolved by the renaming (buffering and writing later).

II-3. Programming model)  


     Figure. II-3a : Conflict of two Transactions

  1. Define a commit API for a transaction:

    1. A commit is a bundle of multiple conditional and non-conditional operations executed with ACID property.

    2. API is similar to multiOp.

    3. Advantage)

      1. Simple:

        1. Blue arrows in Figure II-3a (from Start Tx to Commit) is written as a user code which:

          1. Takes care of transaction retry caused by nacked (aborted) commit.

          2. Saves version numbers and un-committed objects and passes them to the commit API for the condition check

          3. Returns a un-committed written object for a local read. (local RAW resolution)

      2. Flexible to extending to:

        1. Multiple transaction can be reside concurrently on a single client application

        2. A transaction involving multiple clients can be written by exchanging the condition with the clients

        3. Other transaction APIs:

          1. SQL like primitives

          2. Transactional memory like model: Define the life of a transaction with start transaction and commit. Library record versions and temporal data implicitly.

  2. Commit API:

    1. Basic Idea)

      1. Defines a combination of operations 'compare, read, write' for each object

      2. Handles objects which are much larger than several bytes. Data validation is done with version number not with contents.

        1. Data needs to be loaded and used before commit. Then the version number is given to the commit.

      3. Table II-3-b shows the operations.

    2. Implementation

      1. Basically extending the existing multiOp context.

      2. Parameters of request: <requestId, commitObjects>

        1. requestId : <clientId, sequentialNumber> (Guarantee linearizability. See liniarizable+RPC)

        2. commitObjects (To Revise) : a list of <operation, tableId, key, [condition], [blob]> where:

          1. operation is ether : check, overWrite, conditionalWrite

          2. key is a primary key of the object.

          3. [condition] is a version number to be compared for check or conditionalWrite.

          4. [blob] is a write data for overWrite or conditional write

      3. First commitObject is specially handled as 'anchor object'.

      4. To be extended for secondary index.

    3. response contains:

    4.  

      1. Ack (commit successfully) or Nack (aborted)

      2. List of new version numbers for writes

  3. Note)

  4.  

    1. No priority is given by the start order of transactions (aka. age of transaction) such as Sinfonia [SOSP2007]

    2. Solution to avoid conflict to a long life transaction.

      1. Split the transaction into multiple mini-transactions

      2. Write a intermediate layer to prioritize transactions considering their life or start time

  5. Issues)

    1. Which operations remain when the transaction is implemented in RAMCloud?

    2. How much can we make its implementation simpler with:

      1. assuming all the data fits in a RPC

      2. assuming all the data fits in the same segment

 

    Table II-3-b: Commit succeeds when all the object operations  are successful otherwise it aborts without any side-effects

Operation

Conditional?

Parameter for the Entry 

Comment

Request

Reply 

Compare

Yes

version

None

Check if the object is unmodified

Conditional write

Yes

version, blob

newVersion

Write if the object is still in the version

Conditional remove

Yes

version

None

Remove if the object is still in the version

Create

Yes

blob

newVersion

Create if the object is not exist
(Can use ConditionalWrite with version=0)
(See Note below for its use case) 

Uc Overwrite

No

blob

newVersion

Unconditionally write when commit succeeds.
(renaming for WAR, WAW hazard)

Uc Remove

No

None

None

Unconditionally remove when commit succeeds.

Note1) We extend MultiOp to make sure our implement strategy and measure performance. However, since current MultiOp lacks linearizability we may create different rpc.

Note2) Usecase of  'Create' :

  1.  Read an object before commit

  2. The read returns the object does not exist

  3. Create the object locally on the client

  4. Commit with the condition if the object does not still exist on the master.

II-4. Design Outline)

II-4-1. Commit Control Flow)

Flow and log structure are modified in first trial implementation, see 'IV) First Trial Implementation'  instead and skip this section.
 


  Figure II-4-1a

  1. 1.5 RTT and two log write commit (Figure II-4-1a )

    1. One master takes care of commit control as a "Transaction-master" while it works as object storage, the rests "Data-masters" work as ordinal data storage.

    2. The receiver of anchor object (the first object) in the multi-write of commit operation is defined as Transaction-master.

      1. Transaction-master receives the followings:

        1. List of keys the transaction referred: both written and read objects

        2. Key-hash of the anchor object which specify transaction-master (the master itself)

        3. Written objects for the master: list of (key, read-version, new version, blob)

        4. Keys of read objects for the master: list of (key, read-version)

      2. Data-master receives

        1. Key-hash of the anchor object which specify transaction-master

        2. Written objects for the master: list of (key, read-version, new version, blob)

        3. Keys of read objects for the master: list of (key, read-version)

      3. In both cases, no read proceeded to the write, special value 'no-read-proceeded' is given as the read-version.

    3. Transaction master (T-Master) coordinates the durable commit operation with writing Non-return-point in its log and backups.

      1. D-Master checks and lock the objects as follows:

        1. Loop while objects exists:

          1. If the object is already locked by other transaction or the version mismatches for a conditional operation:

            1. It sends nack to T-Master

            2. It cleans up all the locks for the transaction

            3. exit

          2. Otherwise lock the object

        2. Send ack of the transaction to T-Master. Ack contains Tid.

      2. T-Master receives an acknowledge (ack) from D-Master

        1. In Phase1

          1. The transaction aborts if T-Master receives any nack.

          2. If T-Master receives ack from all D-Master, it proceeds Phase2.

        2. In Phase2

          1. All D-Masters should eventually send ack. Ack tells phase2 operation is completed on the D-Master so that T-Master prepares for the resource clean-up.

      3. If the T-Master crashes during commit operation, the recovery master which stores the anchor object takes over the T-Master's role. 
        So Non-return-point should have the same hash as the anchor object's one.

      4. T-Master is responsible for the lock clean-up at the commit or abort.

      5. The node where T-Master resides  works as a D-Master for the objects assigned to it.

    4. Data structure in log and log cleaning

      1. A object is attributed as follows:

        1. During commit operation)

          1. unlocked (before locking)

          2. locked for committed object or temporal object for transactional update.

        2. After commit operation)

          1. unlocked

          2. removed

      2. The lock modifier contains Tid which consists of <clientId, anchorObjectId>.

        1. We can identify T-Master from anchorObjectId.

        2. If some clients access to a object and it remains locked even after some retry.
          The D-master need to request T-Master to clean up lock, ie. complete the transaction with abort or commit.

        3. Log structure is in section II-2-6.

  2. T-Master (transaction master) uses three log entries for a durable commit operation

    1. Start-commit: specifies from whom T-Master gets phase1 ack.

    2. NRP: specifies the operation is committed.

    3. Tomb stone for deleting control objects (Start-commit, NRP) for a completed transaction.

  3. Behavior in accessing locked objects is discussed in section II-2-3 and II-2-4.

    1. Suspend returning response before the lock is released assuming:

      1. Defines phase1 timeout :

        1. If T-Master phase1-timeouts before all acks arrive, it aborts the transaction. Then T-Master eventually release all the locks.

      2. Defines phase2 timeout:

        1. If phase1 is successful, all the locks are eventually released in phase2. 

        2. If T-Master phase2-timeouts before all the phase2 acks are returned,
          T-Master requests all unacked D-Masters for the completion of Phase2 (ie. unlock).

      3. According above i) and ii), all the locks are released before phase1-timeout  + phase2-timeout + some retry time

  4. T-Master returns ack (committed) to the client right after successful NRP write. This is not too early because:

    1. Eventually commit phase2 is completed in all D-Masters as described in 3-a-iii).

    2. There might be a case where:

      1. The client that receives ack as a commit completion tries to access the object which supposed to be release the commit. However, it still remains locked.

      2. It is harmless, client can just wait or retry because the lock is released eventually.

  5. Resource Release:

    1. Successful commit: (See Figure II-4-1a ) 

      1. Resource release point: Client - RC, T-Master - RT, D-Master - RD

      2. T-Master release control objects in log at RT

        1. Insert Tomb stone for the Tid

          1. Normal case: T-Master wait all acks from D-Master

          2. Corner cases: T-Master timeouts before it receives all asks from D-Masters. T-Master interrogates un-acked D-Masters for the completion to insert tomb stone.

    2. Abort:

      1. Any crash without NRP:

        1. Control structures are located in memory. Resources are automatically released

      2. Abort (Nack) :

        1. T-Master: Any Nack from D-Master

        2. Client: Nack from T-Master

    3. Issues:

      1.  Infinite chain of acks 

        1. To reach RT, acks from the client and all D-Masters are needed.

        2. Can not remove RC and RDs until making sure all the acks are delivered to T-Master.

        3. That requires infinite chain of acks such as ack_of_ack, ack_of_ack_of_ack,.... 

      2. Solution

        1. RD: It is idempotent. Harmless for re-execution by repeated request from T-Master.

        2. RC: Similar to the solution for linearizable RPC.

          1. Releases the transaction data by receiving ack or nack(abort), moves to committing or aborting state, and sends ack_of_ack to T-Master.

          2. Largest ack_of_ack number is piggy backed on an ack of later commit.

        3. RT: retries until all the acks arrive from all D-Masters and Client.

          1. Interrogate liveness of the client to coordinator

II-4-2. Conflicting commits to a locked object by a preceding commit)

See II-2. Memory Renaming

  1. Real conflict is RAW (True dependency): transactions reading write-locked objects should be aborted and re-executed.

  2. Pseudo conflicts are WAR, WAW (False dependency): writes in the conflicting transaction just corrupts locked object. The write needs to be delayed until the lock is released.

    1. We can detect WAW (or WAR) and re-execute conflicting transaction by:

      1. Preceding transaction reads the object before writes it to make WAW treated as RAW.

  3. RAR: We consider read - read conflict is harmless. The object is not locked for the upcoming read access.

    1. If we want to detect RAR to Object O1, we can provide anther Object O2 and detect the conflict as RAW to O2 as follows:

      1. Definition of symbols:

        1. Suppose a group of transactions Tn want to detect RAR conflict to O1

        2. Suppose Ti, Tj, and Tk  belong to Tn want to read O1.

      2. Sequence:

        1. Provides O2 before any Tn starts.

        2. Any of Tn reads O2 before it starts O1 access.

        3. Ti writes O2 before it reads O1

        4. Tj, Tk are aborted by RAW conflict if Ti successfully commits.

     Table II-4-2a 

Operation in another
(newer) transaction's
commit 

Status of locked object

read

written

read

continue (not locked)

abort (True dependency, RAW)

write

block (wait: WAR)

block (wait: WAW)

II-4-3. Effect of crashes)

    Table II-4-3a

 

Comments

Effect to the Transaction by Crash of:

Client

T-Master

D-Master

Before Commit

Data stored in Client code

Abort

No-effect

No-effect

Commit Phase1

Ack table stored in T-Master's memory.
After the all the commit requests have arrived to both
T-Master and D-Masters, crash of any client
has no effect to commit operation. 

Abort
or No-effect

Abort

No-effect

After Phase2

Status is durable in T-Master's log

No-effect

No-effect

No-effect

 

II-4-4. Log Structure on Masters)

  1. Inserts object modifier in log to lock and unlock the object during commit operation

    1. Chain of modifiers are created by sequence of transaction (commit)

      1. Cleanup: associate lock and unlock with transaction id and cleanup (now investigating log cleaner for the modification).

    2. Some of them are terminated with tomb stone.

  2. Object modifier example in log

    1. Figure II-4-4a: sequence of transactions read the same object

    2. Figure II-4-4b: modifier objects for object write at successful commit:

      1. both old object and newer (transactional temporal) object are locked at the same time. Log objects might  be optimized such as "temporal object for write + lock", etc

    3. See section IV for the implementation.


       Figure II-4-4a (Consecutive Read Lock by transactions)


     Figure II-4-4b (Object Modifiers for conditionalWrite)

III. Corner Cases)

  1. Linearizablity issue: Ether commit request, nack, or ack is undelivered.

    1.  (Resolved) Partial data loss in commit request with multi-op.

      1. Solution) Retry original request with timeout for ack(committed)/nack(aborted). See section IV-2-2.

      2. (Alternative method) Extending linearizable operations to multi-op using multiple RPCs