Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 164 Next »

I-1. Objective)

Providing multi-object (database) transaction with ACID properties:

  1. Atomic
  2. Consistent
  3. Isolated
  4. Durable

I-2. CAP Theorem) 

  We can not simultaneously provide all three of the following:
  1. Consistency
  2. Availability
  3. Partition tolerance

On RAMCloud, availability and partition is already relaxed as follows:
  1. Relaxation of availability) No data replication for concurrent service. System is temporally unavailable before node recovery.
  2. Relaxation of partition tolerance) when the network is partitioned, the partition isolated from coordinator can not continue service.

II. Current Proposal) 

II-1. Basic Strategy)


   Figure II-1a

Each transaction consists of a sequence of operations and executed in some period of time as seen in Figure II-1a.

To achieve ACID, there are two strategies:

  1. Pessimistic lock - allow only one transaction to be executed. 
    1. Pros) Simple, No re-execution overhead by conflict
    2. Cons) Small parallelism
  2. Optimistic lock - executes multiple transaction at one time with watching conflicts. If conflict occurs, cancel conflicting transaction and rewind its side-effects.
    1. Pros) Large parallelism
    2. Cons) Re-execution overhead with conflict

In traditional database or transaction, the probability of conflict is considered very low.
The most systems take optimistic lock approach. Some system combines pessimistic lock to improve the efficiency.
We will apply only optimistic approach.

II-2. Memory Renaming)  

 In order to efficiency of parallel execution, we apply the technique called 'memory renaming' widely used in parallel processing such as out of order super scalar processors. Memory renaming is the technique similar to register renaming: http://en.wikipedia.org/wiki/Register_renaming .
Similar technique is used in transactional memory http://en.wikipedia.org/wiki/Transactional_memory 


     Figure II-2a

  1. Considers (RAR: Read after Read) is harmless.

  2. True dependency (RAW: Read after Write) is detected and resolved by canceling affected transactions and re-execute them.
  3. All other cases  (WAR: Write after Read, WAW: Write after Write) are false dependency and resolved by the renaming (buffering and writing later).

II-3. Programming model)  

As seen in Figure II-3-a, transition of a transaction is as follows:

  1. A client starts a transaction and receive a transaction identifier (Tid).
  2. The client performs object accesses for the Tid.
  3. RAMCloud system watches access conflict against other transactions.
  4. The client issues commit operation for the Tid. Only one commit completes successfully and other transaction will be aborted. Operations done by aborted transaction are all cancelled (rewound). 
    1. Read is not treated as a conflict
      1. If we would like to abort transaction by read-read conflict, we can allocate a separate flag and cause write-write conflict to detect the conflict.
    2. All other conflict are detected and resolved by renaming or retrying through transaction abort.
  5. Version number: (Need discussion for the definition)
    1. Proposal 1) Using local version number: See Figure II-2-1b
      1. Create local version number starting from zero for an object written without read. The local version number is  a kind of delta of version number.
      2. Add the delta when the object is written back to the master at commit
      3. Can perform conditional write locally with the local version number.
      4. In case that the object is once created and deleted, the definition says global version number must be bigger than previous version. To follow this, scheme, the local version number delta should be given from local next-version-number. The local next-version-number starts from zero when the transaction start and is incremented by one when it is used.
        1. But it does not match to the global version number rule: Version number is incremented by one with one write access.
    2. Proposal 2) Refer the master version number: see Figure II-2-1c
      1. Pros) 
        1. This exactly follows the current definition of version number increment.
      2. Cons)
        1. Additional rcp needed to fetch version number from a master.
        2. Needs to mark the version number is acquired not by the preceding read. Otherwise, needless RAW conflict is detected which leads the transaction abort.
  6. No support:
    1. Single transaction owned by multiple clients.


Figure. II-3a

II-2. Design Outline)

  1. Commit is a linealizable operation containing a large chunk of object to be conditionally updated. We can apply the idea for linializable operation on RAMCloud (see liniarizable+RPC)
  2. Client allocates a unique transaction identifier locally using its unique client id (see liniarizable+RPC)
  3. No priority is given by the start order of transactions (aka. age of transaction) such as Sinfonia [SOSP2007].
  4. When multiple transactions have object-access-conflicts, the earliest commit has priority and success.
    1. We assume probability of access conflict can be kept small with low latency system.
    2. If some users assume a quite long life transaction and need to avoid its cancelation by another transaction started later, they can implement a sophisticated transaction management system upon our simple transaction model.
  5. All the write data before commit is stored in the client and volatile.
    1. We can update data in masters with multi-writes for performance.
  6. During commit operation, we define a NRP: Non-return-point. Before NRP, a transaction is aborted by any crash. After NRP, the transaction is committed regardless of any crash or recovery.

II-2-1. Design Issues)

  1. Before commit
    1. Assigns of transaction Id (Tid)
    2. Stores temporal data store in client
      1. Records read information (key, version) — Figure II-2-1a
      2. Records write information (key, version, blob)
        1. Even write follows read, both read-version and new-version are recorded. Read-version is used to verify conflict on the object — Figure II-2-1a
        2. For write without reads. We just records 'no-read-proceeded' to detect WAR, WAW, which is pseudo conflict and can be handled without transaction abort.
          1. If read follows the write-without-read, the result only depends on the cluster local write, the sequence is isolated from external events.
          2. We do not need to interrogates read-version if read is not executed before write, which was originally considered in Figure II-2-1b, which is:
            1. Client interrogates the server for the current version to verify its modification at commit.
            2. If the write crates a new object, a version-reserve-object is allocated in the server and log to record the version number of the new object for the conflict detection at commit time.
  2. Commit operation (see section II-2-2 )
    1. Fast commit: 1.5 RTT and two log write 
   Figure II-2-1a

       Figure II-2-1b (Proposal 1)

  Figure II-2-1c  (Proposal 2)

II-2-2. Commit Operation)
  1. 1.5 RTT and two log write commit (Figure II-2-2a )
    1. One master takes care of commit control as a "Transaction-master" while it works as object storage, the rests "Data-masters" work as ordinal data storage.
    2. The receiver of anchor object (the first object) in the multi-write of commit operation is defined as Transaction-master.
      1. Transaction-master receives the followings:
        1. List of keys the transaction referred: both written and read objects
        2. Key-hash of the anchor object which specify transaction-master (the master itself)
        3. Written objects for the master: list of (key, read-version, new version, blob)
        4. Keys of read objects for the master: list of (key, read-version)
      2. Data-master receives
        1. Key-hash of the anchor object which specify transaction-master
        2. Written objects for the master: list of (key, read-version, new version, blob)
        3. Keys of read objects for the master: list of (key, read-version)
      3. In both of cases, no read proceeded to the write, special value 'no-read-proceeded' is given as the read-version.
    3. Transaction master (T-Master) coordinates the durable commit operation with writing Non-return-point in its log and backups.
      1. T-Master receives an acknowledge (ack) from D-Master
        1. ack contains Tid
        2. In Phase1, Any D-Master may request abort with 'abort (nack) instead.
        3. In Phase2, All D-Masters should eventually send ack. Ack tells phase2 operation is completed on the D-Master so that T-Master prepares for the resource clean-up.
      2. If the T-Master crashes during commit operation, the recovery master which stores the anchor object takes over the T-Master's role. 
        So Non-return-point should have the same hash as the anchor object.
      3. Lock clean-up for commit or abort is done by T-Master.
      4. T-Master works as a D-Master for the objects assigned to it.
    4. Data structure in log and log cleaning
      1. A object is attributed as follows:
        1. During commit operation)
          1. unlocked (before locking)
          2. locked for committed object or temporal object for transactional update.
        2. After commit operation)
          1. unlocked
          2. removed
      2. The lock modifier contains Tid which consists of <clientId, anchorObjectId>.
        1. We can identify T-Master from anchorObjectId.
        2. If some clients access to a object and it remains locked even after some retry.
          The D-master need to request T-Master to clean up lock, ie. complete the transaction with abort or commit.
        3. Log structure is in section II-2-6.
  2. T-Master (transaction master) uses three log entries for a durable commit operation
    1. Start-commit: specifies from whom T-Master gets phase1 ack.
    2. NRP: specifies the operation is committed.
    3. Tomb stone for deleting control objects (Start-commit, NRP) for a completed transaction.
  3. Behavior in accessing locked objects is discussed in section II-2-3 and II-2-4.
    1. Suspend returning response before the lock is released assuming:
      1. Defines phase1 timeout :
        1. If T-Master phase1-timeouts before all acks arrive, it aborts the transaction. Then T-Master eventually release all the locks.
      2. Defines phase2 timeout:
        1. If phase1 is successful, all the locks are eventually released in phase2. 
        2. If T-Master phase2-timeouts before all the phase2 acks are returned,
          T-Master requests all unacked D-Masters for the completion of Phase2 (ie. unlock).
      3. According above i) and ii), all the locks are released before phase1-timeout  + phase2-timeout + some retry time
  4. T-Master returns ack (committed) to the client right after successful NRP write. This is not too early because:
    1. Eventually commit phase2 is completed in all D-Masters as described in 3-a-iii).
    2. There might be a case where:
      1. The client that receives ack as a commit completion tries to access the object which supposed to be release the commit. However, it still remains locked.
      2. It is harmless, client can just wait or retry because the lock is released eventually.
  5. Resource Release:
    1. Successful commit: (See Fiture 2-2-2a ) 
      1. Resource release point: Client - RC, T-Master - RT, D-Master - RD
      2. T-Master release control objects in log at RT
        1. Insert Tomb stone for the Tid
          1. Normal case: T-Master wait all acks from D-Master
          2. Corner cases: T-Master timeouts before it receives all asks from D-Masters. T-Master interrogates un-acked D-Masters for the completion to insert tomb stone.
    2. Abort:
      1. Any crash without NRP:
        1. Control structures are located in memory. Resources are automatically released
      2. Abort (Nack) :
        1. T-Master: Any Nack from D-Master
        2. Client: Nack from T-Master


  Figure II-2-2a

II-2-3. Conflicting another transactions to locked object at Commit)

hazard resolution technique known as memory renaming in multiprocessing.

  1. Real conflict is RAW: transactions reading write-locked objects should be aborted and re-executed.
  2. Pseudo conflicts are WAR, WAR: writes in the conflicting transaction just corrupts locked object. The write needs to be delayed until the lock is released.
  3. RAR: We consider read - read conflict is harmless. The object is not locked for the upcoming read access.
    1. If we want to detect RAR to Object O1, we can provide anther Object O2 and detect the conflict as RAW to O2 as follows:
      1. Definition of symbols:
        1. Suppose a group of transactions Tn want to detect RAR conflict to O1
        2. Suppose Ti, Tj, and Tk  belong to Tn want to read O1.

      2. Sequence:
        1. Provides O2 before any Tn starts.
        2. Any of Tn reads O2 before it starts O1 access.
        3. Ti writes O2 before it reads O1
        4. Tj, Tk are aborted by RAW conflict if Ti successfully commits.
Table II-2-3a 
Operation in another
(newer) transaction's
commit 
Status of locked object
readwritten
readcontinue (not locked)abort (True dependency, RAW)
writeblock (wait: WAR)block (wait: WAW)

II-2-4. Interaction between transactional and non-transactional operation)


  Figure II-2-4a

(See: Table II-2-d).

  1. Order of operations. See Figure II-2-4a
    1. Logical view) According to the definition of commit, Op-T is atomically executed at some period in commit.
      1. Transactional operation is logically executed in ACID (Atomic, Consistent, Isolated, Durable ) at commit time.
      2. Non-transactional operation is executed at the access time which is categorized to:
        1. Op-N occurs before commit and within commit before the object is locked.
        2. Op-NC occurs after the object is locked.
    2. Practically) 
      1. If Op-T is read, it is physically performed at some point after StartTx and the version read is recorded in transaction.
      2. Written objects are eventual updated within some range in commit operation.
    3. Order of operations:
      1. Suppose Op-NC is the non-transactional operation physically occurs after the modification-check of commit.
      2. Logical order is   first) Op-N,   second) Op-T,  third) Op-NC
  2. Interaction of non-transactional operation to transactional operation
    1. Follow conflict matrix in Table II-3a
    2. Conflict between Op-N and Op-T. See Table II-2-4b
      1. Simplest implementation and natural behavior
        1. Only Op-N-written to Op-T-read (RAW) results abort.
        2. The no-read-proceeded is recorded when no read for the object proceeds writes. It can be used to distinguish RAW from WAW.
      2. By requesting non-transactional operation to a master without interacting client's local transactional information, the above behavior i) is realized.
    3. Conflict between Op-T and Op-NC. See Table II-2-4c
      1. Should be consistent to the behavior before commit.
      2. Minimizing the performance degradation of traditional operations. (See -  III) Implementation )
        1. Commit completed in a short period of time. Probability of conflict between commit and Op-NC is very small.
        2. Add two bits <lock-for-commit, r/w> in hash entry in order to:
          1. Blocking non-transactional-write before the ongoing commit is completed or aborted.
          2. Continue non-transactional-read if the lock is for read commit.
          3. Blocking non-transactional-read if the lock is for write commit
Table II-2-4b
 Op-N (always completes)
readwritten
Op-Treadcontinueabort (True dependency, RAW)
writecontinuecontinue

 

Table II-2-4c
 Op-T (locked for commit)
readwritten
Op-NCreadcontinueblock (wait)
writeblock (wait)block (wait)

 

II-2-5. Effect of Crashes)

    Table II-2-5a

 CommentsEffect to the Transaction by Crash of:
ClientT-MasterD-Master
Before CommitData stored in ClientAbortNo-effectNo-effect
Commit Phase1Ack table stored in T-Master's memory.
After the all the commit requests have arrived to both
T-Master and D-Masters, crash of any client
has no effect to commit operation. 
Abort
or No-effect
AbortNo-effect
After Phase2Status is durable in T-Master's logNo-effectNo-effectNo-effect

 

II-2-6. Log Structure)

  1. Inserts object modifier in log to lock and unlock the object during commit operation
    1. Chain of modifiers are created by sequence of transaction (commit)
      1. Cleanup: associate lock and unlock with transaction id and cleanup (now investigating log cleaner for the modification).
    2. Some of them are terminated with tomb stone.
  2. Object modifier example in log
    1. Figure II-2-5a: sequence of transactions read the same object
    2. Figure II-2-5b: modifier objects for object write at successful commit:
      1. both old object and newer (transactional temporal) object are locked at the same time. Log objects might  be optimized such as "temporal object for write + lock", etc


       Figure II-2-6a


     Figure II-2-6b

III. Corner Cases)

  1. Linializability issue: Ether commit request, nack, or ack is undelivered.
    1. Partial data loss in commit request with multi-op
      1. Extending linearizable operations to multi-op using multiple RPCs
        1. Automatic retries for un-delivered RPCs?
      2. Can use the same strategy for resources (clientId, requestId) management with lease.
    2. Resource management (GC) and racing
      1. Transaction Id (client id)
      2. Control objects in T-Master's log
      3. Lock/unlock objects in D-Masters' log conflicting with cleaning and object migration
  2. Corner cases during crash behavior transition
    1. no-effect - abort (Volatile) - commit (NRP)
  3. Arrival delay of the anchor object to T-Master (Resolved)
    1. ack arrives from D-Master before the anchor object arrives to T-Master
      1. T-Master gets the Tid in ack and allocate the transaction's control structure. See section IV-4-2, item 1-c.

IV) Implementation

IV-1) Transaction Id 

    TransactionId = <clientId, sequence_number, anchorObjectId>

  1. <clientId, sequence_number> is used for implementing linearizability.
    1. ClientId is given from coordinator
    2. 64bit sequence number avoid garbage collection of used number
    3. Using the same resource management scheme as linearizability.
  2. anchorObjectId
    1. A hash which specifies T-Master for the transaction
    2. If the same client issues another transaction including the same anchorObjectId in the live transaction, the issued transaction is immediately aborted because it conflicts any of previous active transactions.

IV-2) Client

  IV-2-1) Data Structure

   TBD:
   All data is located in memory.
   Figure II-2-1a and Figure II-2-2a describe the overall picture.

  IV-2-2) Linearizability on multi-op

   TBD.

IV-3) D-Master

  1. Modify log structure first,  then corresponding in memory structure.
  2. In memory (Volatile) data structure: See Figure IV-3b
    1. Main hash:
      1. Add <locked, r/w> in hash table entry to lock the entry at commit and to realize the behavior described in section II-2-3.
    2. Committing transaction map:
      1. <Tid, Status, Object List>
        1. Status is ether
          1. Commit Stated
          2. Phase1 acking
          3. Phase2 requested
          4. Phase2 acking
          5. Completed
        2. Object List is list of: <locked, r/w, rd-ver, cur-obj, new-obj>
          1. <locked, r/w> is the duplicated in ordinal hash : Might be un-needed.
          2. rd-ver is version number when the object is first read in the transaction.
          3. cur-obj is <hash-value-to-ordinal-hash, slot-number>
          4. new-obj is only allocated for object write and consists of <segment, offset>
        3. To consider the pointer structure to optimize log cleaning for object modifiers
  3. Log Structure: See Section II-2-5: detailed implementation will be considered soon.
    1. Object modifiers are used only for log replay and cleaning which follows the new space/computing effective log structure for cleaning.
  4. Ack to D-Master.see Section II-2-2 item 1-c


   Figure IV-3b

IV-4) T-Master

  IV-4-1) Transactional Data Store

          Instanciates D-Master

  IV-4-2) Transactional Control Logic

  1. Volatile data structure in memory: See Figure IV-42a
    1. Committing Transaction Map
      1. Tid : transaction Id which consists of <clientId, anchorObjectId>
        1. Master which stores anchorObject is T-Master
      2. startCommit always points log entry SC (StartCommit; see Figure II-2-2a)
        1. objectList can be used to verify whether some operation is performed to all objects.
      3. status points a control object in the log to tell the current transaction status ether:
        1. StartCommit
        2. NRP
        3. Done (Tomb Stone)
      4. D-Master List points D-Master List.
    2. D-Master List contains all D-Masters' information
      1. T-Master has the D-Master's role. T-Master has a D-Master entry too.
      2. masterId: id of the master
      3. status ether of:
        1. p1acked : received phase1 (can commit?) ack
        2. p1abort : received abort request from this D-Master
        3. p2acked : received phase2 (do commit) ack
    3. Arrival delay of the anchor object to T-Master and ack from D-Master arrivers before. (see section III  item 3)
      1. Can allocate committing transaction map from the ack.
        1. Get Tid from ack
        2. Leave startCommit and status Null
        3. Allocate partial D-Master list which contains only the D-Master which sent the ack.
          1. The rest of the D-Master list will be allocated when the anchor object arrives. 
  2. Persistent data is located in log: See Figure II-2-2a 

     Figure IV-4-2a

V. Comparison against other implementations) 

VI. References) 

  1. Sinfonia [SOSP2007]
  2. Parallel Processor Design
    1. Transactional Memory http://en.wikipedia.org/wiki/Transactional_memory
    2. Speculative Multithread Processor
      1. NEC Merlot: ISSCC, 2000, http://bit.ly/1ly07Pa
      2. NEC Pinot: Micro38, 2005,  http://dl.acm.org/citation.cfm?id=1100541

Old materials) 

  • Simplified:
    • Removes prioritizing transaction with Transaction state order. The transaction is prioritized by commit order. This enables storing data in client local buffer and write data multi-write operation at commit time. We think user can build their own transaction prioritizing code with RAMCloud transaction if considerable performance loss would be concerned.
    • Not involving coordinator with using 2 phase commit with pessimistic lock. A solution for lock clean up is done with timeout.
    • For commit operation, we are investigating multiple options to implement transaction manager (TM). I am going to classify and investigate performance issues, etc:
      • Locate TM in user library
        • Pros) 
          • Smaller ( ? ) network traffic at commit
        • Cons) 
          • Needs Non-Return-Point (NRP) repository and incurs the communication overhead.
          • Application recompilation is needed when the transaction algorithm would be modified.
      • Locate TM in a master (which we hereafter name transaction master)
        • Pros)
          • No communication to NRP repository is needed
          • Object code is independent with transactional algorithm
        • Cons)
          • Data is hopped by Transaction commit master
    • There are many similarities between transaction and secondary key. We are going to share the idea.

 Previous version)

Still the proposal is under revise for below: (Updated on Jan. 29, 2014)

  1. traditional transaction system
    1. http://searchdatamanagement.techtarget.com/feature/Understanding-and-comparing-six-types-of-processing-systems
  2. requirement for transaction
    1. priority : –   Earlier commit wins similar to Sinfonia.
      1. should  prioritize for older transaction?  or early commit would overtake older transaction?
    2. should minimize re-execution? – According to (a) no optimization to reduce re-execution would be implemented so far.
      1. lock or rewind to final consistent point?
    3. should complex or simple? –  Take simple one. RAMCloud philosophy is to reduce latency with simple implementation, which I think reduces possibility of conflict of transaction with well organized code.
  3. find missing racing conditions and solutions
  4. implementation proposal: will be re-written.
    1. recovery sequence — done
    2. control sequence — writing
    3. data structure – writing
  5. optimization (TBD)
  6. assumption of transactions and examples - done
  7. related work
    1. 3 phase commit by Stonebraker, Quorum commit, Consensus algorithm 
    2. Sinfornia, H-Store, SDD-1 ----- writing
    3. sharding, micro-sharding  ---- writing
    4. ACID, CAP theorem –- writing
  8. benchmarks
    1. Introduction of TPC: http://searchdatamanagement.techtarget.com/feature/Key-benchmarks-for-measuring-transaction-processing-performance
  • No labels