Table of Contents

 

Index key fields for a table

 

 

Index keys for objects

Now, RAMCloud knows what fields to build keys on. How do we figure out these keys for each object? RAMCloud doesn't understand the object blob, the application does.

 Solution: Provide index keys explicitly, apart from the object (key-value).

Note that the first entry in the array must correspond to the primary key.

What would the actual stored object look like?

 

 

 

Index location

If index and table can both fit on one server, great!

But what if they don't?

Options:

Index on one or more index servers (these index servers could have data too, but the point is that they don't have to). Partitioned based on index key: (tick)

Index distributed such that it is co-located with corresponding data:
(1): Lookup analysis for both cases. Lets assume data+index for this distributed across n servers. There's a read based on secondary index. Which option is better?
 
For n = 1: (a) and (b) both do same.
 
For n > 1:
(a) does 1 lookup then 1 multiRead to maximum n-1 servers (only to servers that actually hold matching data); (b) does multiLookup to n servers then reads data locally from some or all.

 

TODOs: Figure out

 

 

Index entry format

<index key, object pointer> (as in primary key hash table)

<index key, primary key>

<index key, primary key hash> (tick)

 

 

Index format

Initial thoughts below. More up to date info to be maintained by Ashish on a different wiki page.

Tree: Enables range queries apart from lookups.

Hash table: Enables lookups on exact match. Options:

To begin, could implement trees, since that's where we get the max functionality boost. Later want to implement both and provide option while defining indexes.

 

 

Index memory allocation (and sharing memory with data)

Initial thoughts below. More up to date info to be maintained by Ashish on a different wiki page.

There will be some index (for some part of some table) and some data (from the same or another table) on a given server. So index trees will have to share memory with segment storage for data. How do we allocate (and clean) memory to the index? Options:


 

Consistency

Definition

At any point, data is consistent with index entries corresponding to it:

Required:

  1. If data X exists, X is reachable from all key indexes.
  2. Data returned to client is consistent with key used to look it up.

If these are not satisfied, then linearizability is violated. But there's a tradeoff with performance. In future, consider providing an option to turn this on or off.

Desirable:
  1. Dangling pointers (index entry exists, but data corresponding to it doesn't) are not accumulating. (Temporary dangling pointers okay.)

If this isn't satisfied, memory footprint will increase beyond what is necessary.

For multi-threaded read/write non-locking access, no failures

(Can directly look at Object Update, since that includes the other cases.)

Simplest solution for all cases, wrap steps into a transaction. But, we'd like to avoid transactions as far as possible (to retain as much performance and scalability as possible).

Various cases:

Object(s) lookup based on secondary index:

Object write: Options:

  1. Insert index entry then write object: (tick) -> Object available (via all indexes) as soon as object written.
    1. Read before either insertion / write: no inconsistency
    2. Read after both insertion and write: no inconsistency
    3. Read between insertion and write: index entry found, matching object not found, read will return no value, as if object wasn't written yet. Same thing with multiple indexes.
  2. Write object then insert index entry:
    1. Read before either write / insertion: no inconsistency
    2. Read after both write and insertion: no inconsistency
    3. Read between write and insertion: index entry not found, read will return no value, as if write hadn't completed. This is inconsistent since object does exist. If we have multiple indexes, then lookup on one could succeed and lookup on another could fail.

Object delete: Options:

  1. Remove index entry then delete object
    1. Violates property 1.
    2. Example of undesirable situation: If I have index A and index B on same data, then removing entry from index A then from B then actually deleting data could cause a situation where lookup on index A doesn't return value, but lookup on index B returns a value.
  2. Delete object then remove entry: (tick) -> Object not available (via any index) as soon as object deleted.
    1. Read on this object before deletion and removal: object read, no inconsistency
    2. Read after deletion and removal: object not found, no inconsistency
    3. Read between deletion and removal: index entry found, corresponding object not found, no object returned.
    4. Note: delete() operation doesn't have information about indexes. Hence, before / while deleting object, read in the index keys so that they can be removed. How to handle failures?

Object update:

From the consistency requirement and RAMCloud operation linearizability, it follows that we want the following linearizability property for object updates:

There exists time x, s.t. before x, client can lookup old data; at and after time x, it can lookup the new data.

Have to remove old entry AND insert new entry. Combination of object delete and object write.

Key idea: Writing object is the commit point.

Steps:
  1. Insert new index entries
  2. Read old object (to get old index entry information) and write new object (this invalidates the old object) – Can return to client at this point
    • Note: Don't need to read out. Can keep track of position in log and look it up later.
  3. Remove old index entries – Can happen asynchronously after return to client; Can batch from various requests.
    • Note: Will have to mark old object in the some way to make sure it doesn't get deleted.
Step 2 is a single operation (new kind of write) where writing an obj returns the old one if it exists. This will have to be atomic. (A previous option achieve the same result with current reads and writes, no new atomic write/read. But that needed more rpcs, and more complex code for write/write access.)

TODO: What happens if step 2 done before step 1?

Example: Primary key is "Foo" which hashes to 4444444. Its value is initially Bob Brown and later updated to Sam Smith. There are indexes on the first name (Bob / Sam) and last name (Brown / Smith). Timeline:

Note: The old and new index entries may not be on the same physical server.

 

For multi-threaded write/write non-locking access, no failures

Previous state: Foo: Bob Brown. Client 1 writes: Foo: Sam Smith. Now, say client 2 writes: Foo: Jen Jones.

Allowed end states: Consistent data and index entries:

Data - Foo: Sam Smith; fname - Sam: 444444; lname - Smith: 444444 ---OR--- vice versa

Not allowed end states: Data X but index entries corresponding to Y:

Data - Foo: Sam Smith; fname - Jen: 444444; lname - Jones: 444444 ---OR--- vice versa

Not ideal end states: Consistent data and index entries plus dangling pointers:

Data - Foo: Sam Smith; fname -  Sam: 444444, Jen: 444444; lname - Smith: 444444, Jones: 444444 ---OR--- vice versa


It seems like there is no interleaving of actions leading to not allowed state.

Potentially bad interleaving produces ideal result:

  1. Master writes index entries for SS
  2. Master writes index entries for JJ
  3. Master reads BB and writes data SS (this invalidates BB)
  4. Master reads SS and writes data JJ (this invalidates SS)
  5. Master removes index entries for BB
  6. Master removes index entries for SS
 
Complication: What if client 2 wants to write BB (same as the original data)?
Given current algo, with the previous example interleaving, the end state would have BB as the data (okay) but no index entries pointing to it (not okay).

Potential Solution: Add id (nonce?) to index entry and store ids (for all indexes corresponding to an obj) in the object. This keeps track of the "correct" index entry.
Problem: Space overhead and small time overhead.

Potential Solution (tick): (If a write involves indexes,) serialize writes to the same object including index updates. Lock update on object before step 1 and release after step step 3. – Implications: This makes batching of index deletes not feasible. Also, cannot push the indexing logic to client anymore.
BUT: Holding lock during rpcs or timeouts is not ideal. (TODO)
 
Need to prevent delayed repeated messages.
Can return early to client, but not release the lock. Given this, can also batch removes with next write.
(TODO)
 

Failure of an Index Server

The client request being processed gets blocked till the index server is recovered; completed and return to client once index server recovers.

In the meanwhile, two things can happen:

  1. The client might retry this request
  2. Another client might send another request for the same object

If an update involves this index server, return RETRY to client. Can leave dangling pointers around. Failure (and returning retry) at any point will not violate required consistency properties.

No need to return RETRY if failure happens at removal of index entries stage.

Implementation-wise: Use RPC wrapper to throw error if index server failed.

For a reasonable first implementation: Just wait for index server to recover, no need to return RETRY.

 

Failure of Master Server

Consider the sequence of steps for an update: insert new index entry, update object, remove old index entry. Lets say they happen on the same server (i.e., index and data in this case located on the same server). What if the server fails at some point here?

Simple solution: Wrap all the ops into a transaction. But as in previous subsection, we'd like to avoid transactions as far as possible. So, for now, lets assume no transactions.

Failure scenarios:

Failure before the update begins or after it is complete are okay – ok – don't affect anything

Failure after step 1, during step 2 – Case A:

Failure after step 2, during step 3 – Case B:

 
Dangling pointers problem: Options:
  1. Just leave them around. How much do we expect to accumulate anyway?
  2. Remove dangling pointers if a lookup is done on that and object not found. (Will have to be smart to not remove a pointer for which object is being currently added. Can do so by giving ownership of this removal to the object master.)
  3. Cleaning! (See next subsection.)

 

Cleaning

An index cleaner process can scan indexes in background and delete index entries that are dangling pointers.

Complication: By design, we have dangling pointers for a little (write index entries, then update / write the object). Will have to detect this and not delete such index entries.

Another concern: Since indexes and data may not be on the same server, cleaning on an index could require an rpc for each object in the index. Can batch with multiread which will make it much more efficient.

Approach (that addresses the above concerns too): Steps:

  1. Index cleaner starts running on an index partition (index partition = part of one index corresponding to one table that is located on one server).
  2. Send (new) rpc confirmExists(.., .., .., ..) with all the index entries as params that gets bucketed and sent to all the masters owning data from this rpc (similar to multiop mechanism).
  3. Each master server checks if that exists (or is being currently written), and for data that doesn't exist, send (new) rpc multiRemoveFromIndex(..., ...) to the index server.

 

 

Failures / Recovery

Tablet (data) server failure

 

Index server failure

After crash, till index is restored:


Index restoration options:

Backup/Recover:

Rebuild:


Cost during data write:

Write analysis for backup/recover and rebuild. We can have m secondary indexes on a table. Consider what happens on a write (which will require insertions in all m indexes).

First, lets consider case where m = 2, and that for option (a), data on one server and index on one; for option (b), data+index on both servers.

Generalizing, for any m:

 

 Latency# Mem writes# Backup writes# Msgs from data to index servers# Msgs to backups
No indexing15 us1R0R
Indexing w/ backup/restore35 usm+1R*(m+1)mR*(m+1)
Indexing w/ no-backup/rebuild25 usm+1RmR

 

Indexing w/ backup/restore has a very high latency.

Idea 1: Put index entries in master log? (TODO)

Idea 2: Use transactions so that time before return can be reduce (though the total time remains the same).


Cost during crash recovery:

Backup/recover: similar to regular crash recovery?

Rebuild:

Will depend on configuration. Lets find worst case performance with a "reasonable" set of assumptions. 

The parallelization for index recovery master done as:

Total cache miss time of 5.12s is a higher number than we're willing to tolerate. The only way to reduce this would be to enforce a constraint that data for a particular table cannot occupy more than a fraction of a master. (And then this time can be multiplied by that fraction).

Given these tradeoffs, lets go with the backup/recover option for now. Moreover, if we do decide to go with no-backup/rebuild eventually, not much implementation work will have to be thrown away and it makes a good first step.

Note: For details on how we got the cache miss time in the above table, look at the experiments here.