I-1. Objective)

Providing multi-object (database) transaction with ACID properties:

  1. Atomic
  2. Consistent
  3. Isolated
  4. Durable

I-2. CAP Theorem) 

  We can not simultaneously provide all three of the following:
  1. Consistency
  2. Availability
  3. Partition tolerance

On RAMCloud, availability and partition is already relaxed as follows:
  1. Relaxation of availability) No data replication for concurrent service. System is temporally unavailable before node recovery.
  2. Relaxation of partition tolerance) when the network is partitioned, the partition isolated from coordinator can not continue service.

II. Current Proposal) 

II-1. Basic Strategy)

   Figure II-1a

Each transaction consists of a sequence of operations and executed in some period of time as seen in Figure II-1a.

To achieve ACID, there are two strategies:

  1. Pessimistic lock - allow only one transaction to be executed. 
    1. Pros) Simple, No re-execution overhead by conflict
    2. Cons) Small parallelism
  2. Optimistic lock - executes multiple transaction at one time with watching conflicts. If conflict occurs, cancel conflicting transaction and rewind its side-effects.
    1. Pros) Large parallelism
    2. Cons) Re-execution overhead with conflict

In traditional database or transaction, the probability of conflict is considered very low.
The most systems take optimistic lock approach. Some system combines pessimistic lock to improve the efficiency.
We will apply only optimistic approach.

II-2. Memory Renaming)  

  In order to efficiency of parallel execution, we apply the technique called 'memory renaming' widely used in parallel processing such as out of order super scalar processors. Memory renaming is the technique similar to register renaming: . 

Similar technique is used in transactional memory 

     Figure II-2a

  1. Considers (RAR hazard: Read after Read) is harmless.

  2. True dependency (RAW: Read after Write) is detected and resolved by canceling affected transactions and re-execute them.
  3. All other cases  (WAR: Write after Read, WAW: Write after Write) are false dependency and resolved by the renaming (buffering and writing later).

II-3. Programming model)  

     Figure. II-3a : Conflict of two Transactions

  1. Define a commit API for a transaction:
    1. A commit is a bundle of multiple conditional and non-conditional operations executed with ACID property.
    2. API is similar to multiOp.
    3. Advantage)
      1. Simple:
        1. Blue arrows in Figure II-3a (from Start Tx to Commit) is written as a user code which:
          1. Takes care of transaction retry caused by nacked (aborted) commit.
          2. Saves version numbers and un-committed objects and passes them to the commit API for the condition check
          3. Returns a un-committed written object for a local read. (local RAW resolution)
      2. Flexible to extending to:
        1. Multiple transaction can be reside concurrently on a single client application
        2. A transaction involving multiple clients can be written by exchanging the condition with the clients
        3. Other transaction APIs:
          1. SQL like primitives
          2. Transactional memory like model: Define the life of a transaction with start transaction and commit. Library record versions and temporal data implicitly.
  2. Commit API:
    1. Basic Idea)
      1. Defines a combination of operations 'compare, read, write' for each object
      2. Handles objects which are much larger than several bytes. Data validation is done with version number not with contents.
        1. Data needs to be loaded and used before commit. Then the version number is given to the commit.
      3. Table II-3-b shows the operations.
    2. Implementation
      1. Basically extending the existing multiOp context.
      2. Parameters of request: <requestId, commitObjects>
        1. requestId : <clientId, sequentialNumber> (Guarantee linearizability. See liniarizable+RPC)
        2. commitObjects (To Revise) : a list of <operation, tableId, key, [condition], [blob]> where:
          1. operation is ether : check, overWrite, conditionalWrite
          2. key is a primary key of the object.
          3. [condition] is a version number to be compared for check or conditionalWrite.
          4. [blob] is a write data for overWrite or conditional write
      3. First commitObject is specially handled as 'anchor object'.
      4. To be extended for secondary index.
    3. response contains:
      1. Ack (commit successfully) or Nack (aborted)
      2. List of new version numbers for writes
  3. Note)
    1. No priority is given by the start order of transactions (aka. age of transaction) such as Sinfonia [SOSP2007]
    2. Solution to avoid conflict to a long life transaction.
      1. Split the transaction into multiple mini-transactions
      2. Write a intermediate layer to prioritize transactions considering their life or start time
  4. Issues)
    1. Which operations remain when the transaction is implemented in RAMCloud?
    2. How much can we make its implementation simpler with:
      1. assuming all the data fits in a RPC
      2. assuming all the data fits in the same segment


    Table II-3-b: Commit succeeds when all the object operations  are successful otherwise it aborts without any side-effects
OperationConditional?Parameter for the Entry Comment

Check if the object is unmodified

Conditional writeYesversion, blobnewVersion

Write if the object is still in the version

Conditional removeYesversionNoneRemove if the object is still in the version
CreateYesblobnewVersionCreate if the object is not exist
(Can use ConditionalWrite with version=0)
(See Note below for its use case) 
Uc OverwriteNoblobnewVersion

Unconditionally write when commit succeeds.
(renaming for WAR, WAW hazard)

Uc RemoveNoNoneNoneUnconditionally remove when commit succeeds.

Note1) We extend MultiOp to make sure our implement strategy and measure performance. However, since current MultiOp lacks linearizability we may create different rpc.

Note2) Usecase of  'Create' :

  1.  Read an object before commit
  2. The read returns the object does not exist
  3. Create the object locally on the client
  4. Commit with the condition if the object does not still exist on the master.

II-4. Design Outline)

II-4-1. Commit Control Flow)

Flow and log structure are modified in first trial implementation, see 'IV) First Trial Implementation'  instead and skip this section.

  Figure II-4-1a

  1. 1.5 RTT and two log write commit (Figure II-4-1a )
    1. One master takes care of commit control as a "Transaction-master" while it works as object storage, the rests "Data-masters" work as ordinal data storage.
    2. The receiver of anchor object (the first object) in the multi-write of commit operation is defined as Transaction-master.
      1. Transaction-master receives the followings:
        1. List of keys the transaction referred: both written and read objects
        2. Key-hash of the anchor object which specify transaction-master (the master itself)
        3. Written objects for the master: list of (key, read-version, new version, blob)
        4. Keys of read objects for the master: list of (key, read-version)
      2. Data-master receives
        1. Key-hash of the anchor object which specify transaction-master
        2. Written objects for the master: list of (key, read-version, new version, blob)
        3. Keys of read objects for the master: list of (key, read-version)
      3. In both cases, no read proceeded to the write, special value 'no-read-proceeded' is given as the read-version.
    3. Transaction master (T-Master) coordinates the durable commit operation with writing Non-return-point in its log and backups.
      1. D-Master checks and lock the objects as follows:
        1. Loop while objects exists:
          1. If the object is already locked by other transaction or the version mismatches for a conditional operation:
            1. It sends nack to T-Master
            2. It cleans up all the locks for the transaction
            3. exit
          2. Otherwise lock the object
        2. Send ack of the transaction to T-Master. Ack contains Tid.
      2. T-Master receives an acknowledge (ack) from D-Master
        1. In Phase1
          1. The transaction aborts if T-Master receives any nack.
          2. If T-Master receives ack from all D-Master, it proceeds Phase2.
        2. In Phase2
          1. All D-Masters should eventually send ack. Ack tells phase2 operation is completed on the D-Master so that T-Master prepares for the resource clean-up.
      3. If the T-Master crashes during commit operation, the recovery master which stores the anchor object takes over the T-Master's role. 
        So Non-return-point should have the same hash as the anchor object's one.
      4. T-Master is responsible for the lock clean-up at the commit or abort.
      5. The node where T-Master resides  works as a D-Master for the objects assigned to it.
    4. Data structure in log and log cleaning
      1. A object is attributed as follows:
        1. During commit operation)
          1. unlocked (before locking)
          2. locked for committed object or temporal object for transactional update.
        2. After commit operation)
          1. unlocked
          2. removed
      2. The lock modifier contains Tid which consists of <clientId, anchorObjectId>.
        1. We can identify T-Master from anchorObjectId.
        2. If some clients access to a object and it remains locked even after some retry.
          The D-master need to request T-Master to clean up lock, ie. complete the transaction with abort or commit.
        3. Log structure is in section II-2-6.
  2. T-Master (transaction master) uses three log entries for a durable commit operation
    1. Start-commit: specifies from whom T-Master gets phase1 ack.
    2. NRP: specifies the operation is committed.
    3. Tomb stone for deleting control objects (Start-commit, NRP) for a completed transaction.
  3. Behavior in accessing locked objects is discussed in section II-2-3 and II-2-4.
    1. Suspend returning response before the lock is released assuming:
      1. Defines phase1 timeout :
        1. If T-Master phase1-timeouts before all acks arrive, it aborts the transaction. Then T-Master eventually release all the locks.
      2. Defines phase2 timeout:
        1. If phase1 is successful, all the locks are eventually released in phase2. 
        2. If T-Master phase2-timeouts before all the phase2 acks are returned,
          T-Master requests all unacked D-Masters for the completion of Phase2 (ie. unlock).
      3. According above i) and ii), all the locks are released before phase1-timeout  + phase2-timeout + some retry time
  4. T-Master returns ack (committed) to the client right after successful NRP write. This is not too early because:
    1. Eventually commit phase2 is completed in all D-Masters as described in 3-a-iii).
    2. There might be a case where:
      1. The client that receives ack as a commit completion tries to access the object which supposed to be release the commit. However, it still remains locked.
      2. It is harmless, client can just wait or retry because the lock is released eventually.
  5. Resource Release:
    1. Successful commit: (See Figure II-4-1a ) 
      1. Resource release point: Client - RC, T-Master - RT, D-Master - RD
      2. T-Master release control objects in log at RT
        1. Insert Tomb stone for the Tid
          1. Normal case: T-Master wait all acks from D-Master
          2. Corner cases: T-Master timeouts before it receives all asks from D-Masters. T-Master interrogates un-acked D-Masters for the completion to insert tomb stone.
    2. Abort:
      1. Any crash without NRP:
        1. Control structures are located in memory. Resources are automatically released
      2. Abort (Nack) :
        1. T-Master: Any Nack from D-Master
        2. Client: Nack from T-Master
    3. Issues:
      1.  Infinite chain of acks 
        1. To reach RT, acks from the client and all D-Masters are needed.
        2. Can not remove RC and RDs until making sure all the acks are delivered to T-Master.
        3. That requires infinite chain of acks such as ack_of_ack, ack_of_ack_of_ack,.... 
      2. Solution
        1. RD: It is idempotent. Harmless for re-execution by repeated request from T-Master.
        2. RC: Similar to the solution for linearizable RPC.
          1. Releases the transaction data by receiving ack or nack(abort), moves to committing or aborting state, and sends ack_of_ack to T-Master.
          2. Largest ack_of_ack number is piggy backed on an ack of later commit.
        3. RT: retries until all the acks arrive from all D-Masters and Client.
          1. Interrogate liveness of the client to coordinator

II-4-2. Conflicting commits to a locked object by a preceding commit)

See II-2. Memory Renaming

  1. Real conflict is RAW (True dependency): transactions reading write-locked objects should be aborted and re-executed.
  2. Pseudo conflicts are WAR, WAW (False dependency): writes in the conflicting transaction just corrupts locked object. The write needs to be delayed until the lock is released.
    1. We can detect WAW (or WAR) and re-execute conflicting transaction by:
      1. Preceding transaction reads the object before writes it to make WAW treated as RAW.
  3. RAR: We consider read - read conflict is harmless. The object is not locked for the upcoming read access.
    1. If we want to detect RAR to Object O1, we can provide anther Object O2 and detect the conflict as RAW to O2 as follows:
      1. Definition of symbols:
        1. Suppose a group of transactions Tn want to detect RAR conflict to O1
        2. Suppose Ti, Tj, and Tk  belong to Tn want to read O1.

      2. Sequence:
        1. Provides O2 before any Tn starts.
        2. Any of Tn reads O2 before it starts O1 access.
        3. Ti writes O2 before it reads O1
        4. Tj, Tk are aborted by RAW conflict if Ti successfully commits.
     Table II-4-2a 
Operation in another
(newer) transaction's
Status of locked object
readcontinue (not locked)abort (True dependency, RAW)
writeblock (wait: WAR)block (wait: WAW)

II-4-3. Effect of crashes)

    Table II-4-3a

 CommentsEffect to the Transaction by Crash of:
Before CommitData stored in Client codeAbortNo-effectNo-effect
Commit Phase1Ack table stored in T-Master's memory.
After the all the commit requests have arrived to both
T-Master and D-Masters, crash of any client
has no effect to commit operation. 
or No-effect
After Phase2Status is durable in T-Master's logNo-effectNo-effectNo-effect


II-4-4. Log Structure on Masters)

  1. Inserts object modifier in log to lock and unlock the object during commit operation
    1. Chain of modifiers are created by sequence of transaction (commit)
      1. Cleanup: associate lock and unlock with transaction id and cleanup (now investigating log cleaner for the modification).
    2. Some of them are terminated with tomb stone.
  2. Object modifier example in log
    1. Figure II-4-4a: sequence of transactions read the same object
    2. Figure II-4-4b: modifier objects for object write at successful commit:
      1. both old object and newer (transactional temporal) object are locked at the same time. Log objects might  be optimized such as "temporal object for write + lock", etc
    3. See section IV for the implementation.

       Figure II-4-4a (Consecutive Read Lock by transactions)

     Figure II-4-4b (Object Modifiers for conditionalWrite)

III. Corner Cases)

  1. Linearizablity issue: Ether commit request, nack, or ack is undelivered.
    1.  (Resolved) Partial data loss in commit request with multi-op.
      1. Solution) Retry original request with timeout for ack(committed)/nack(aborted). See section IV-2-2.
      2. (Alternative method) Extending linearizable operations to multi-op using multiple RPCs
        1. Automatic retries for un-delivered RPCs?
    2. (Almost resolved. See section II-4-1 item5) Resource management (GC) and racing
      1. TransactionId
      2. Control objects in T-Master's log
      3. Lock/unlock objects in D-Masters' log conflicting with cleaning and object migration
  2. (Resolved) Corner cases during crash behavior transition
    1. Transition is: no-effect - abort (Volatile) - commit (NRP)
    2. Solution)
      1. Client's state is always volatile. If client crashes, masters clean up resource with inquiring expired clientId lease information on coordinator.
      2. On T-Master, all the information is volatile and lost by its crash before NRP is written.
        1. The recovery master for the T-Master, which receives the anchor object, starts aborting the transaction because D-Master list (See Figure IV-3b) is lost.
        2. If NRP is found on the recovered T-Master, the T-Master continues commit.
      3. D-Master decides its behavior by inquiring the transaction's state to T-Master after recovery.
        1. Transaction id of a locked object is found in the lock.
  3. (Resolved) Arrival delay of the anchor object to T-Master
    1. ack arrives from D-Master before the anchor object arrives to T-Master
      1. T-Master gets the Tid in ack and allocate the transaction's control structure. See section IV-4-2, item 1-c.

IV) First Trial Implementation

We will implement a simple trial RAMCloud with 'muti-object transaction'. We are going to evaluate the performance of our first implementation with micro-benchmark and some sample applications and estimate the improvement by different implementations.

IV-1) Overall design

IV-1-1) Transaction Id 

  1. TransactionId = <ClientId, sequentialNo>
  2. TM location is hashed by ClientId.

 IV-1-2) Data and control flow with RPC.

       Figure IV-1-2a

  IV-1-2-1) Initiation by client

  1. MultiCommit
    1. Sending startTx to TM
      1. Piggy backs TxDone information
        1. The biggest sequence number in the completed (acked or aborted) transaction is notified to transaction master for the client by piggy backing it with a MultiCommit request
    2. Sending Operation list to Data master
      1. Including TxId
  2. DoCommit / DoAbort
    1. Including TxId, primary keys of objects in the transaction

  IV-1-2-2)  Initiation by Transaction Master

  1. DoCommitDM
    1. unlock and commit
  2. DoAbortDM
    1. unlock and abort

   IV-1-2-3) False dependency

False dependency 'WAR' and 'WAR' described in  'Figure II-2a' could be treated differently with blocking (holding) the lost transaction 'Section : II-4-2. Conflicting commits to a locked object by a preceding commit)' to reduce abort and retry overhead.

In out first trial implementation, we treat them as conflict and abort the lost transaction. We are going to make an evaluation with micro-benchmark and some sample applications to estimate the improvement by the complex implementation.

   IV-1-2-4) Transaction Abort

  1.  A client sends DoAbort request to TM if any of MultiCommit returns nack by:
    1. the object is locked by different transaction (different transaction is proceeding phase1 of commit with the object).
    2. condition check (version check) for the object failed
  2.  Transaction Master (TM) sends DoAbortDM to DM to abort transaction and unlock object. The operation is triggered by:
    1. DoAbort request from Client
    2. Transactions without NRP at TM recovery  (Tx start log entry needed in TM log.)
    3. Client lease expiration
    4. Timeout for DoCommit request from Client (Tx start is recorded in memory, because pending transaction without NRP is aborted by TM recovery (See b. above).
  3.  Data Master (DM) aborts transaction in the following cases
    1. DoAbortDM request from TM
    2. No NRP found inTM for the transaction at Data Master recovery. If TM recovery is going on, DM needs to retry after TM recovery.
  4. Client crash: client crash during phase1 may leave object locked. The lock prevent object access from other clients like a master crash. It should by released within a few second to make the system available as follows:
    1. TM is notified client lease expiration from coordinator. Since TM for the transaction is hashed by clientId, coordinator easily decide the TM to notify. (May take a while)
    2. Errorneous client codes: The client code is not error free. Even the client still responds to ping from coordinator, the client library may hang or live-locks before completing commit operation. Solution is:
      1. Watch dog timeout in TM's memory initiates aborting the transaction. (Faster normally this works.)
      2. If TM crashes, TM automatically aborts all transactions without NRP. To tell undergoing transactions to the recovered TM, we provides a 'startTx' log entry (See next section for the new log entry).
        1. If we provide object list in 'startTx', TM can selectively send abort requests to the relevant masters.
        2. Without the list, TM needs to broadcast abort requests to all the masters. (current proposal considering the corner case in next bullet.)
      3. Following corner case is eventually resolved:
        1. Client issues MultiCommit
        2. Some DMs lock objects
        3. Before TM writes 'startTX' to log, the TM crashes
        4. The recovered TM will not start the transaction abort.
        5. However the locks are eventually released:
          1. TM recognizes crash eventually:
            1. If the client is still arrive the linearizable rpc resends MultiCommit request to TM. (Faster) Since no startTx is found in log at recovery, TM treat the request as  a new request. If some DM has already accepted the original request, it does not cause any problem because client retries the request with the same TxId and DM can tell the request is retry for the object .
            2. If the client crashed too, coordinator notices the clientId expiration to the TM. (Slower)
          2. Since TM does not know the DM involved in the transaction, it broadcast Tx abort request to all masters.

IV-1-3) New log entries with contents

  1.  TM (Transaction-managementl Master)
     These entries for the same client identifier is located on the same master, ie. objectFinder is called by clientId.
    1. TxStart : Contents = (TxId=[ClientId, SeqNo] , ackedSequenceNumber)
    2. NRP: (TxID, (list of primary keys))
    3. TxDMDone: (TxId that received all acks from DM)
    4. TxAbort : Transaction is aborted : (TxID, (list of primary keys))  –  not sure????  needed for efficiency or object migration??? If it is needed, it is simple to move 'list of primary keys' to TxStart.
  2.  DM (transaction Data Master)
    1. Lock:  (TxId, Object info= [primary key, version])
    2. Unlock: (TxId, Object info= [primary key, version])

IV-1-4) Log Cleaning

  • TM log
    • Regular cleaning) NRP for the client is cleaned up to minimum (sequence number in TxDone@Client, sequence number in TxDone@DM)
    • Client lease expiration triggered) Cleaning initiated after interrogating coordinator for the activeness of the clients that the TM manages
  • DM log: periodically cleaning following pairs
    • Completed transaction: Lock-Unlock pair with the same TxId
    • Deleted object: (Object, Lock, Unlock, Lock, Unlock, Lock, ...., Tomb stone ) sequence with the same key and version.

IV-1-5) Linearizable RPC

  1. Implementing MultiCommit with MultiOp. Since MultiCommit is a conditional operation, it needs to be implemented on linearizable RPC.

IV-1-6) Recovery

  • Operation as a Dater Master (DM)
    1. Recovers unlocked (normal) Objects
    2. Recovers locked Object according inquiry result to TM identified by the TxId in lock entry.
      1. Proceeds commits if NRP exists for the TxId.
      2. Cannot abort the transaction even if NRP does not exist. Because if the ack has already been sent to client, client may decides to go to phase2 (Do Commit). Final decision for the commit should be done by TM. When DoCommit is arrived to TM, the transaction can be ether committed or aborted, TM decides it with crash information it knows at the decision point.
    3. If the master crashed before ack to MultiCommit request, the client retries the MultiCommit again. The DM checks objects hash in memory, continue the commit operation, and respond the client with Ack or Nack.
  • Operation as a Transaction Master (TM) 
    1. Recovers TxDoneClient, NRP, TxDoneDM onto the recovery master hashed by ClientId in these entries.
    2. Rebuilds in-memory control structure for transactions on recovered TM.
    3. Recovered TM responds requests or acks.
      1. Responds to DoCommit from clients, which might be first request or retry
        1. Receives DoCommitDM acks from DM. The acks may be the result of:
        2. Acks corresponding to DoCommitDM issued by crashed TM may be discarded because the TM has already been crashed. In-memory information about returned acks is also lost.
      2. Requests DoCommitDM to DM if the recovered TM finds NRP.

IV-2) Corner cases

Corner cases should be resolved as follows:

  1. Non reversible operations (below) are recorded in Log synchronously
    1. Written object
    2. Lock
    3. Unlock
    4. Tomb stone
    5. NRP
  2.  Final decision of commit is delegated to TM as mentioned in IV-1-6 (DM  list #2). The decision can satisfy ACID (Atomic Consistent  Isolated Durable) constraint.

We would like to list corner cases and verify the strategy works.

IV-3) Client

  1. Flow
    1. Client program constructs MultiCommit object lists with RAMCloud libraries.
    2. Below are implemented in MultiCommit RPC.
      1. Sends MultiCommit request to DMs and a TM using linearizable RPC. A new transaction id=<clientId, seqNumber, ackedSeqNumber> is inserted before sending MultiCommit request.
      2. Waits response of MultiCommit request
      3. Sends request according the MultiCommit response:
        1. Sends DoAbort request to TM if any of response is nack
        2. Sends DoCommit request to TM if all of the responses are ack.
      4. Waits response of item iii.
      5. the RPC returns Ack (committed) or Nack (aborted)
        1. Nack may be trigger trap??
  2. Data Structure: all the data is in-memory and volatile. 
    1. TxId
    2. MultiCommit object list
    3. MultiCommit return list –  success and fail state is given for each object
      1. When the MultiCommit response is returned. The client library scans the response and decides whether it sends DoCommit or DoAbort to TM.

IV-4) D-Master

Normal master works as D-Master too.

  1. Using linearizable RPC and retry is encapsulated.
  2. Modifies log structure first. Then corresponding in-memory structure.
  3. In memory (Volatile) data structure: See Figure IV-3b
    1. Main hash:
      1. Add <locked, r/w> in hash entry tell the object is locked.
    2. Committing transaction map:
      1. <Tid, Status, numToProcess, numProcessed, (Object List) >
        1. numToProcess : number of object to be processed
          1. Phase1: number of objects to be checked: Since some objects would be for unconditional write, it would be smaller than number of object for the DM.
          2. Phase2: number of objects to be committed.
        2. numProcessed : number of objects completed. If it matches to numToProcess, DM returns ack.
        3. Status ether:
          1. phase1 started
          2. phase1 acked
          3. phase2 started
          4. phase2 acked
        4. Object List is list of: <status, txOperation, cur-obj, new-obj>
          1. status:  status of the object
            1. ready
            2. phase1Passed
            3. phase1Failed —  Phase1 check failed.
            4. phase2Completed –- Phase2 process (unlock) completed
          2. txOperation : duplication of operation command given by MultiCommit including:
            1. transactional operation to the object in Table II-3-b
            2. version to be compared if the operation is conditional
          3. cur-obj : pointer to current (committed) object in hash
          4. new-obj : pointer to overwriting object at successful commit, which is overwritten to cur-obj at phase2.
    3. Log Structure:
      1. See Section 'II-2-4' for basic log entry linkage at commit operation.
      2. See Section 'IV-1-3) New log entries with contents' for introducing log entries in this implementation.
      3. Figure IV-4a depicts in-memory data structure of DM. (Not quite exact.... )

        Figure IV-4a

IV-4) T-Master

Normally collocates with D-Master.

  IV-4-1) Transactional Control Logic

See Figure IV-1-2a for overview.

  1. Volatile data structure in memory: 
    1. Committing transaction map
      1. Tid : transaction Id which consists of <clientId, seqNumber>
      2. Status: transaction status ether:
        1. Started
        2. NRP
        3. Completed : commited transaction whose sequence number is smaller than both of:
          1. ackedSeqNumber in TxStart object given by MultiCommit
          2. sequenceNumber in TxDMDone which means phase2 operation on DMs are all completed.
        4. Aborted : abort request is sent.
      3. Object list
  2. Persistent data is located in log: See  Figure IV-1-2a

V. Comparison against other implementations) 

VI. References) 

  1. Sinfonia [SOSP2007]
  2. Parallel Processor Design
    1. Transactional Memory
    2. Speculative Multithread Processor
      1. NEC Merlot: ISSCC, 2000,
      2. NEC Pinot: Micro38, 2005,

Old materials) 

Until Nov 12, 2014)

IV-1) Transaction Id 

  1. TransactionId is a transaction identifier in T-Master or D-Master for

    1. memory data entries
    2. log entries
  2. TransactionId = <requestId, anchorObjectId>
    1. requestId = <clientId, sequenceNumber>
      1. the parameter in commit operation given by the client like a ordinal linearizable RPC
        1. clientId is given from coordinator
        2. sequenceNumber is 64bit number and monotonically increase without garbage collection
        3. Using the same resource management as the linearizable RPC.
    2. anchorObjectId
      1. A hash which specifies T-Master for the transaction
      2. If the same client issues another transaction including the same anchorObjectId in the live transaction, the issued transaction is immediately aborted because it conflicts any of previous active transactions.

Log Structure: See Section II-2-5: detailed implementation will be considered soon.

  1. Object modifiers are used only for log replay and cleaning which follows the new space/computing effective log structure for cleaning.
  2. Tid=<ClinentID, Seq, AnchorObjectID> include no-ordering information between different Tids generated in different clients. We use shadow-version to identify final state of each log entries.
  3. Object-modifier: We want to re-define each log entry as follows:
    1. Object: consists of primary-hash, tableID,  keys, version, blob, etc
    2. Object modifier: consists of TransactionID, primary-key, modifierVersion (shadowVersion in the figure), state
      State is ether:
      1. tomb stone
      2. lock
      3. unlock
      4. NRP (for transaction state management object) 
    3. All the log entries involved in a transaction must be modified by lock. In log replay, we can recover in-memory control structure for the transactions handled by the recovered TM(Transaction master) or DM (Data Master).

Figure II-4-4a, II-4-4b is modified as followings:

     Figure IV-3a:  Object state management in Log with shadow version in consecutive reads.

Figure IV-3b: Object state management in Log with shadow version in write.

IV-4) T-Master


Normally collocates with D-Master.


  IV-4-1) Transactional Control Logic


  1. Volatile data structure in memory: See Figure IV-4-2a
    1. Committing Transaction Map
      1. Tid : transaction Id which consists of <clientId, anchorObjectId>
        1. Master which stores anchorObject is T-Master
      2. startCommit always points log entry SC (StartCommit; see Figure II-2-2a)
        1. objectList can be used to verify whether some operation is performed to all objects.
      3. status points a control object in the log to tell the current transaction status ether:
        1. StartCommit
        2. NRP
        3. Done (Tomb Stone)
      4. D-Master List points D-Master List.
    2. D-Master List contains all D-Masters' information
      1. T-Master has the D-Master's role. T-Master has a D-Master entry too.
      2. masterId: id of the master
      3. status ether of:
        1. p1acked : received phase1 (can commit?) ack
        2. p1abort : received abort request from this D-Master
        3. p2acked : received phase2 (do commit) ack
    3. Arrival delay of the anchor object to T-Master and ack from D-Master arrivers before. (see section III  item 3)
      1. Can allocate committing transaction map from the ack.
        1. Get Tid from ack
        2. Leave startCommit and status Null
        3. Allocate partial D-Master list which contains only the D-Master which sent the ack.
          1. The rest of the D-Master list will be allocated when the anchor object arrives. 

    1.      Figure IV-4-2a (Out-dated)


On July 10, 2014)

    Table II-3-b   :  Note that the success of all the compare operation can be returned as commit ack
  (’compare & read’ combination has no meaning because we use version number for comparison, instead of value,  acquired by a read preceding to the commit request.)


Command for the EntryBehaviorParameter for the EntryComment

Check if the object is unmodified

YesNoYesConditional writeversion, blobnewVersionWrite if the object is still in the version
NoYesNoReadNoneversion, blobRead if the commit succeeds
NoNoYesOverwriteblobnewVersionWrite without read if the commit succeeds
(renaming for WAR, WAW hazard)
NoYesYesSwapblobnewVersionSwap if the commit succeeds

On July 3, 2014)

II-3. Programming model)  

As seen in Figure II-3-a, transition of a transaction is as follows:

  1. A client starts a transaction and receive a transaction identifier (Tid).
  2. The client performs object accesses for the Tid.
  3. RAMCloud system watches access conflict against other transactions.
  4. The client issues commit operation for the Tid. Only one commit completes successfully and other transaction will be aborted. Operations done by aborted transaction are all cancelled (rewound). 
    1. Read is not treated as a conflict
      1. If we would like to abort transaction by read-read conflict, we can allocate a separate flag and cause write-write conflict to detect the conflict.
    2. All other conflict are detected and resolved by renaming or retrying through transaction abort.
  5. Version number: (Need discussion for the definition)
    1. Proposal 1) Using local version number: See Figure II-2-1b
      1. Create local version number starting from zero for an object written without read. The local version number is  a kind of delta of version number.
      2. Add the delta when the object is written back to the master at commit
      3. Can perform conditional write locally with the local version number.
      4. In case that the object is once created and deleted, the definition says global version number must be bigger than previous version. To follow this, scheme, the local version number delta should be given from local next-version-number. The local next-version-number starts from zero when the transaction start and is incremented by one when it is used.
        1. But it does not match to the global version number rule: Version number is incremented by one with one write access.
    2. Proposal 2) Refer the master version number: see Figure II-2-1c
      1. Pros) 
        1. This exactly follows the current definition of version number increment.
      2. Cons)
        1. Additional rcp needed to fetch version number from a master.
        2. Needs to mark the version number is acquired not by the preceding read. Otherwise, needless RAW conflict is detected which leads the transaction abort.
  6. No support:
    1. Single transaction owned by multiple clients.

Figure. II-3a

II-4. Design Outline)

  1. Commit is a linealizable operation containing a large chunk of object to be conditionally updated. We can apply the idea for linializable operation on RAMCloud (see liniarizable+RPC)
  2. Client allocates a unique transaction identifier locally using its unique client id (see liniarizable+RPC)
  3. No priority is given by the start order of transactions (aka. age of transaction) such as Sinfonia [SOSP2007].
  4. When multiple transactions have object-access-conflicts, the earliest commit has priority and success.
    1. We assume probability of access conflict can be kept small with low latency system.
    2. If some users assume a quite long life transaction and need to avoid its cancelation by another transaction started later, they can implement a sophisticated transaction management system upon our simple transaction model.
  5. All the write data before commit is stored in the client and volatile.
    1. We can update data in masters with multi-writes for performance.
  6. During commit operation, we define a NRP: Non-return-point. Before NRP, a transaction is aborted by any crash. After NRP, the transaction is committed regardless of any crash or recovery.

II-4-1. Design Issues)

  1. Before commit
    1. Assigns of transaction Id (Tid)
    2. Stores temporal data store in client
      1. Records read information (key, version) — Figure II-4-1a
      2. Records write information (key, version, blob)
        1. Even write follows read, both read-version and new-version are recorded. Read-version is used to verify conflict on the object — Figure II-2-1a
        2. For write without reads. We just records 'no-read-proceeded' to detect WAR, WAW, which is pseudo conflict and can be handled without transaction abort.
          1. If read follows the write-without-read, the result only depends on the cluster local write, the sequence is isolated from external events.
          2. We do not need to interrogates read-version if read is not executed before write, which was originally considered in Figure II-2-1b, which is:
            1. Client interrogates the server for the current version to verify its modification at commit.
            2. If the write crates a new object, a version-reserve-object is allocated in the server and log to record the version number of the new object for the conflict detection at commit time.
  2. Commit operation (see section II-2-2 )
    1. Fast commit: 1.5 RTT and two log write 
   Figure II-4-1a

       Figure II-4-1b (Proposal 1)

  Figure II-4-1c  (Proposal 2)

II-4-3. Interaction between transactional and non-transactional operation)

  Figure II-2-7a

  1. Order of operations. See Figure II-2-7a
    1. Logical view) According to the definition of commit, Op-T is atomically executed at some period in commit.
      1. Transactional operation is logically executed in ACID (Atomic, Consistent, Isolated, Durable ) at commit time.
      2. Non-transactional operation is executed at the access time which is categorized to:
        1. Op-N occurs before commit and within commit before the object is locked.
        2. Op-NC occurs after the object is locked.
    2. Practically) 
      1. If Op-T is read, it is physically performed at some point after StartTx and the version read is recorded in transaction.
      2. Written objects are eventual updated within some range in commit operation.
    3. Order of operations:
      1. Suppose Op-NC is the non-transactional operation physically occurs after the modification-check of commit.
      2. Logical order is   first) Op-N,   second) Op-T,  third) Op-NC
  2. Interaction of non-transactional operation to transactional operation
    1. Follow conflict matrix in Table II-2-3a (RAW, WAR, WAW)
    2. Conflict between Op-N and Op-T. See Table II-2-7b
      1. Simplest implementation and natural behavior
        1. Only Op-N-written to Op-T-read (RAW) results abort.
        2. The no-read-proceeded is recorded when no read for the object proceeds writes. It can be used to distinguish RAW from WAW.
      2. By requesting non-transactional operation to a master without interacting client's local transactional information, the above behavior i) is realized.
    3. Conflict between Op-T and Op-NC. See Table II-2-7c
      1. Should be consistent to the behavior before commit.
      2. Minimizing the performance degradation of traditional operations. (See -  III) Implementation )
        1. Commit completed in a short period of time. Probability of conflict between commit and Op-NC is very small.
        2. Add two bits <lock-for-commit, r/w> in hash entry in order to:
          1. Blocking non-transactional-write before the ongoing commit is completed or aborted.
          2. Continue non-transactional-read if the lock is for read commit.
          3. Blocking non-transactional-read if the lock is for write commit
Table II-2-7b
 Op-N (always completes)
Op-Treadcontinueabort (True dependency, RAW)


Table II-2-7c
 Op-T (locked for commit)
Op-NCreadcontinueblock (wait)
writeblock (wait)block (wait)

IV-2) Client

  IV-2-1) Data Structure

   All data is located in memory.
   Figure II-2-1a and Figure II-2-2a describe the overall picture.

   Detail: TBD.

Before July 3, 2014)

  • Simplified:
    • Removes prioritizing transaction with Transaction state order. The transaction is prioritized by commit order. This enables storing data in client local buffer and write data multi-write operation at commit time. We think user can build their own transaction prioritizing code with RAMCloud transaction if considerable performance loss would be concerned.
    • Not involving coordinator with using 2 phase commit with pessimistic lock. A solution for lock clean up is done with timeout.
    • For commit operation, we are investigating multiple options to implement transaction manager (TM). I am going to classify and investigate performance issues, etc:
      • Locate TM in user library
        • Pros) 
          • Smaller ( ? ) network traffic at commit
        • Cons) 
          • Needs Non-Return-Point (NRP) repository and incurs the communication overhead.
          • Application recompilation is needed when the transaction algorithm would be modified.
      • Locate TM in a master (which we hereafter name transaction master)
        • Pros)
          • No communication to NRP repository is needed
          • Object code is independent with transactional algorithm
        • Cons)
          • Data is hopped by Transaction commit master
    • There are many similarities between transaction and secondary key. We are going to share the idea.

 Previous version)

Still the proposal is under revise for below: (Updated on Jan. 29, 2014)

  1. traditional transaction system
  2. requirement for transaction
    1. priority : –   Earlier commit wins similar to Sinfonia.
      1. should  prioritize for older transaction?  or early commit would overtake older transaction?
    2. should minimize re-execution? – According to (a) no optimization to reduce re-execution would be implemented so far.
      1. lock or rewind to final consistent point?
    3. should complex or simple? –  Take simple one. RAMCloud philosophy is to reduce latency with simple implementation, which I think reduces possibility of conflict of transaction with well organized code.
  3. find missing racing conditions and solutions
  4. implementation proposal: will be re-written.
    1. recovery sequence — done
    2. control sequence — writing
    3. data structure – writing
  5. optimization (TBD)
  6. assumption of transactions and examples - done
  7. related work
    1. 3 phase commit by Stonebraker, Quorum commit, Consensus algorithm 
    2. Sinfornia, H-Store, SDD-1 ----- writing
    3. sharding, micro-sharding  ---- writing
    4. ACID, CAP theorem –- writing
  8. benchmarks
    1. Introduction of TPC: