Secondary Indexes Design

Index key fields for a table

Index fields (and other necessary information, like say the type of index) to be declared at the level of the table. Indexes can be added or dropped any time.
Internally, key indexed field is assigned an enumerated value.
Later: Allow indexes on multi-keys. i.e., can have an index which is just one key, or two or more keys.

Index keys for objects

Now, RAMCloud knows what fields to build keys on. How do we figure out these keys for each object? RAMCloud doesn't understand the object blob, the application does.

Solution: Provide index keys explicitly, apart from the object (key-value).

write(tableId, key, value, ..., numKeys, Array<indexName, indexLength>)

Note that the first entry in the array must correspond to the primary key.

What would the actual stored object look like?

key0 is primary key, always and has to be present.
The offsets are the relative positions of the keys. By definition, they are cumulative.
Offset 0 points to beginning of key 0.
Using offsets makes key length computation trivial.
The key offsets and keys are put in the order of index enumeration.
- For i >= 1, if key_i is not present, then keyOff_i = keyOff_i-1. This basically means the length of the i^th key is 0.
If the application is smart, it could not include data from keys in the value, and use it directly from the index fields when returned.

Index location

If index and table can both fit on one server, great!

But what if they don't?

Options:

Index on one or more index servers (these index servers could have data too, but the point is that they don't have to). Partitioned based on index key:

Index distributed based on index key on various index servers (these index servers could also have other data).
Coordinator keeps a map of which parts of which secondary indexes reside on which servers
Lookup analysis below - (1) - This is option (a)
Recovery times for data table server and index table server will be different. :-/
TODO: What if the list of keys (to multiread on) returned from index lookup are too long to fit in one rpc?

Index distributed such that it is co-located with corresponding data:

Any index lookup would go to all servers on which index is located.
Data (for each table) is scattered across servers and get re-scattered during recovery. Re-distributing index according to data would be an additional overhead during recovery.
Lookup analysis below - (1) - This is option (b)
This will certainly be "better" for writes (no rpcs to index writes, hence lower latency) and better for reconstruction recovery.

(1): Lookup analysis for both cases. Lets assume data+index for this distributed across n servers. There's a read based on secondary index. Which option is better?

For n = 1: (a) and (b) both do same.

For n > 1:

(a) does 1 lookup then 1 multiRead to maximum n-1 servers (only to servers that actually hold matching data); (b) does multiLookup to n servers then reads data locally from some or all.

For n = 2, 3, 4, 5, doing a direct multi-rpc to n is cheaper than 1 rpc + multi-rpc to n-1. (According to clusterperf benchmarks from wiki uploaded on Nov 12, 2012.)
For n > 5, they're about the same.
However, for large n, say n ~ 1000, we expect matching data to be found on a much smaller set of servers (and not on all). So, (a) will be better.

TODOs: Figure out

How to balance on index partitioning?
Consider grouping indexes that user will typically query on, similar to Hyperdex
What will be the performance when loaded?
If a key is very hot, split it for network bandwidth?

Index entry format

<index key, object pointer> (as in primary key hash table)

Not possible if index entries not on the same server as corresponding data records. Also, during if the object is moved (for example, during recovery), then index entries would have to be updated.

Avoids problems with previous option.
Primary key could be large, at least larger than obj ptr.

Better than 2 since hash table is based on primary key hash anyway, so why not store the hash directly – saves time and space.
Since primary key hash may not uniquely identify object, need to compare read object with the key provided to ensure we have the right object(s).
Concern: How to do garbage collection?
- Problem: If an object A is deleted, we look at the secondary index. When we are looking at a secondary index entry with a 64 bit primary key hash X, how do we know that there wasn't another object B with the same 64 bit primary key hash X that had the same secondary index value (and hence this entry corresponds to another object, and hence can't be deleted). Sure, the likelihood of this is very very slim, but we have to consider it.
- Solution idea: Keep a count in the secondary index entries for each primary key hash corresponding to this index key. Doesn't work since incrementing / decrementing this count is not an idempotent action – and we need all our operations to be idempotent (so that transactions will work).
- Solution: Lookup hash X in the primary hash table, then compare the entire key hash (>64 bit) where .....

Index format

Initial thoughts below. More up to date info to be maintained by Ashish on a different wiki page.

Tree: Enables range queries apart from lookups.

Modular (tree implementation should be changeable without affecting rest of the implementation)
API assumptions: Allows lookup, add, remove. All ops are atomic and idempotent.
Our "distribution" strategy is to partition based on index key. Many distributed tree implementations do random node placement, which will be highly inefficient / not scalable.
Tree structure: Check related work (B+ Tree, Adaptive Radix tree, MassTree, BWTree, etc) and figure out which one(s) would work best. If possible, make modular so that it can be changed if needed.
Concurrency issues? Is it good enough to take a global lock on the tree partition?
Later: Trie for string key indexes and normal tree for numerical key indexes? Might help with space efficiency. Would be more complex since need two implementations.

Hash table: Enables lookups on exact match. Options:

Make it just like the current primary key hash table. Good: Can reuse current implementation. Bad: This index would have to be modified every time data moves. (See section on index entry format.)
Index key points to the primary key, then another lookup in primary key hash table to get the object itself.
- Will need a mechanism to handle duplicates. (Since secondary keys don't have to be unique, could have multiple keys here.)

To begin, could implement trees, since that's where we get the max functionality boost. Later want to implement both and provide option while defining indexes.

Index memory allocation (and sharing memory with data)

Initial thoughts below. More up to date info to be maintained by Ashish on a different wiki page.

There will be some index (for some part of some table) and some data (from the same or another table) on a given server. So index trees will have to share memory with segment storage for data. How do we allocate (and clean) memory to the index? Options:

Stuff index nodes into RAMCloud key-value objects.
- Might be inefficient – TODO analysis.
- Simpler since these can get managed (allocated / read / cleaned / recovered) in the same way as regular RAMCloud objects.
- If we want to use backup/recover option for recovery, then this is particularly good.
Allocate log segments to index and use these segment spaces to store parts of the tree.
- How will this allocation and cleaning work?
- This approach might be convenient if we can use RAMCloud recovery for recovering the index too.
Split memory between index and data.
- Dynamic splitting not possible without changing RAMCloud memory allocation (and don't want to go there).
- Static splitting will require information a priori about index and data size which might not be available, else result in underutilized space.
Completely separate index servers and data servers.
- Then there would be issues about how many servers to allocate for indexes and how many for data?
- How would this impact index server recoveries?
- Memory allocation can then be done independently for data and index without interference from the other.

Consistency

Definition

At any point, data is consistent with index entries corresponding to it:

Required:

If data X exists, X is reachable from all key indexes.
Data returned to client is consistent with key used to look it up.

If these are not satisfied, then linearizability is violated. But there's a tradeoff with performance. In future, consider providing an option to turn this on or off.

Desirable:

Dangling pointers (index entry exists, but data corresponding to it doesn't) are not accumulating. (Temporary dangling pointers okay.)

If this isn't satisfied, memory footprint will increase beyond what is necessary.

For multi-threaded read/write non-locking access, no failures

(Can directly look at Object Update, since that includes the other cases.)

Simplest solution for all cases, wrap steps into a transaction. But, we'd like to avoid transactions as far as possible (to retain as much performance and scalability as possible).

Various cases:

Object(s) lookup based on secondary index:

Lookup in the index, return keyhashes corresponding to query
- If index should exist (based on index declaration) but doesn't exist right now, then maybe its being reconstructed after a crash.
  - Handling options: keep retrying till success or return "retry" to client
Client does multiread on these keyhashes
- For any keyhash, if no match, then return empty result for that object (just as in normal read)
- If matches found for a keyhash, check we have the right objects based on index keys. Return matching values.

Object write: Options:

Insert index entry then write object: -> Object available (via all indexes) as soon as object written.
1. Read before either insertion / write: no inconsistency
2. Read after both insertion and write: no inconsistency
3. Read between insertion and write: index entry found, matching object not found, read will return no value, as if object wasn't written yet. Same thing with multiple indexes.
Write object then insert index entry:
1. Read before either write / insertion: no inconsistency
2. Read after both write and insertion: no inconsistency
3. Read between write and insertion: index entry not found, read will return no value, as if write hadn't completed. This is inconsistent since object does exist. If we have multiple indexes, then lookup on one could succeed and lookup on another could fail.

Object delete: Options:

Remove index entry then delete object
1. Violates property 1.
2. Example of undesirable situation: If I have index A and index B on same data, then removing entry from index A then from B then actually deleting data could cause a situation where lookup on index A doesn't return value, but lookup on index B returns a value.
Delete object then remove entry: -> Object not available (via any index) as soon as object deleted.
1. Read on this object before deletion and removal: object read, no inconsistency
2. Read after deletion and removal: object not found, no inconsistency
3. Read between deletion and removal: index entry found, corresponding object not found, no object returned.
4. Note: delete() operation doesn't have information about indexes. Hence, before / while deleting object, read in the index keys so that they can be removed. How to handle failures?

Object update:

From the consistency requirement and RAMCloud operation linearizability, it follows that we want the following linearizability property for object updates:

There exists time x, s.t. before x, client can lookup old data; at and after time x, it can lookup the new data.

Have to remove old entry AND insert new entry. Combination of object delete and object write.

Key idea: Writing object is the commit point.

Steps:

Insert new index entries
Read old object (to get old index entry information) and write new object (this invalidates the old object) – Can return to client at this point
- Note: Don't need to read out. Can keep track of position in log and look it up later.
Remove old index entries – Can happen asynchronously after return to client; Can batch from various requests.
- Note: Will have to mark old object in the some way to make sure it doesn't get deleted.

Step 2 is a single operation (new kind of write) where writing an obj returns the old one if it exists. This will have to be atomic. (A previous option achieve the same result with current reads and writes, no new atomic write/read. But that needed more rpcs, and more complex code for write/write access.)

TODO: What happens if step 2 done before step 1?

Example: Primary key is "Foo" which hashes to 4444444. Its value is initially Bob Brown and later updated to Sam Smith. There are indexes on the first name (Bob / Sam) and last name (Brown / Smith). Timeline:

Note: The old and new index entries may not be on the same physical server.

For multi-threaded write/write non-locking access, no failures

Previous state: Foo: Bob Brown. Client 1 writes: Foo: Sam Smith. Now, say client 2 writes: Foo: Jen Jones.

Allowed end states: Consistent data and index entries:

Data - Foo: Sam Smith; fname - Sam: 444444; lname - Smith: 444444 ---OR--- vice versa

Not allowed end states: Data X but index entries corresponding to Y:

Data - Foo: Sam Smith; fname - Jen: 444444; lname - Jones: 444444 ---OR--- vice versa

Not ideal end states: Consistent data and index entries plus dangling pointers:

Data - Foo: Sam Smith; fname -  Sam: 444444, Jen: 444444; lname - Smith: 444444, Jones: 444444 ---OR--- vice versa

It seems like there is no interleaving of actions leading to not allowed state.

Potentially bad interleaving produces ideal result:

Master writes index entries for SS
Master writes index entries for JJ
Master reads BB and writes data SS (this invalidates BB)
Master reads SS and writes data JJ (this invalidates SS)
Master removes index entries for BB
Master removes index entries for SS

Complication: What if client 2 wants to write BB (same as the original data)?

Given current algo, with the previous example interleaving, the end state would have BB as the data (okay) but no index entries pointing to it (not okay).

Potential Solution: Add id (nonce?) to index entry and store ids (for all indexes corresponding to an obj) in the object. This keeps track of the "correct" index entry.

Problem: Space overhead and small time overhead.

Potential Solution

: (If a write involves indexes,) serialize writes to the same object including index updates. Lock update on object before step 1 and release after step step 3. – Implications: This makes batching of index deletes not feasible. Also, cannot push the indexing logic to client anymore.

BUT: Holding lock during rpcs or timeouts is not ideal. (TODO)

Need to prevent delayed repeated messages.

Can return early to client, but not release the lock. Given this, can also batch removes with next write.

(TODO)

Failure of an Index Server

The client request being processed gets blocked till the index server is recovered; completed and return to client once index server recovers.

In the meanwhile, two things can happen:

The client might retry this request
- The write (including index updates) is idempotent by design, so not a problem.
Another client might send another request for the same object
- This is the same as multi-threaded access to the same object problem.

If an update involves this index server, return RETRY to client. Can leave dangling pointers around. Failure (and returning retry) at any point will not violate required consistency properties.

No need to return RETRY if failure happens at removal of index entries stage.

Implementation-wise: Use RPC wrapper to throw error if index server failed.

For a reasonable first implementation: Just wait for index server to recover, no need to return RETRY.

Failure of Master Server

Consider the sequence of steps for an update: insert new index entry, update object, remove old index entry. Lets say they happen on the same server (i.e., index and data in this case located on the same server). What if the server fails at some point here?

Simple solution: Wrap all the ops into a transaction. But as in previous subsection, we'd like to avoid transactions as far as possible. So, for now, lets assume no transactions.

Failure scenarios:

Failure before the update begins or after it is complete are okay – ok – don't affect anything

Failure after step 1, during step 2 – Case A:

Current state maintains consistency properties stated above
There are dangling pointers (to new unwritten data)
On client retry:
- Step 1 redone: Idempotent, so not a problem
- Step 2-3 done: Client operation completes
If the client dies and never retries?
- If index restoration is done by rebuilding, then eventually, when this server crashes/recovers, the dangling pointers will go away.
- If index restoration is done by backup/recovery, then these may stay around forever.

Failure after step 2, during step 3 – Case B:

Current state maintains consistency properties stated above
There are dangling pointers (to old deleted data)
On client retry:
- Step 1 redone: Idempotent, so not a problem
- Step 2 redone: Data written is still SS, but that read is SS (instead of BB)
- Step 3 done: Delete index entry pointing to SS – bad! Should put in a check to not delete index corresponding to data that we're trying to write (since that might happen in other cases too, like overwriting the data with same value.) So, now, step 3 is no-op. Dangling pointer to old data (BB) stays.
If the client dies and never retries?
- If index restoration is done by rebuilding, then eventually, when this server crashes/recovers, the dangling pointers will go away.
- If index restoration is done by backup/recovery, then these may stay around forever.

Dangling pointers problem: Options:

Just leave them around. How much do we expect to accumulate anyway?
Remove dangling pointers if a lookup is done on that and object not found. (Will have to be smart to not remove a pointer for which object is being currently added. Can do so by giving ownership of this removal to the object master.)
Cleaning! (See next subsection.)

Cleaning

An index cleaner process can scan indexes in background and delete index entries that are dangling pointers.

Complication: By design, we have dangling pointers for a little (write index entries, then update / write the object). Will have to detect this and not delete such index entries.

Another concern: Since indexes and data may not be on the same server, cleaning on an index could require an rpc for each object in the index. Can batch with multiread which will make it much more efficient.

Approach (that addresses the above concerns too): Steps:

Index cleaner starts running on an index partition (index partition = part of one index corresponding to one table that is located on one server).
Send (new) rpc confirmExists(.., .., .., ..) with all the index entries as params that gets bucketed and sent to all the masters owning data from this rpc (similar to multiop mechanism).
Each master server checks if that exists (or is being currently written), and for data that doesn't exist, send (new) rpc multiRemoveFromIndex(..., ...) to the index server.

Failures / Recovery

Tablet (data) server failure

Normal recovery happens
- Tablets from crashed master recovered on multiple recovery masters
- Corresponding tablet maps updated on coordinator
Since our index record maps index key to object key hash, which doesn't change on recovery, this doesn't affect indexes at all.
- Just retry till object found (i.e., till recovery completes)

Index server failure

After crash, till index is restored:

Normal reads can continue if the lookups are not on secondary indexes.
Lookups will block.
All writes (to table having this index) will block.

Index restoration options:

Backup/Recover:

Backup index on R disks (as with normal data) and recover on crash.
Can we use regular RAMCloud crash recovery with minor changes if index written to segments?
Recovery Mechanism:
- If using RAMCloud key-val to store object, then RAMCloud recovery automatically recovers the index.
- If not, steps:
  - Find backups
  - Partition failed index, assign m recovery masters
  - Read log from all backups for this index server
  - Ship data from backups to recovery masters
  - Write + recompute tree on recovery masters

Rebuild:

Keep index only in memory, then rebuild on crash.
Can read from masters that hold data for table being indexed.
Can recover on multiple index recovery masters (partition based on index key) – can later consolidate them back onto one index server. (Similar to regular recovery partitions.)
There was some pathology discussed for this case (before I joined the group).
Rebuild Mechanism:
- Locate master servers for entire table (not only the partition that failed)
- Partition failed index, assign m recovery masters
- Data masters scan all data for table
- Data masters multiplex ship data for this index partition to recovery masters
- Recovery masters reconstruct + write index tree nodes.

Cost during data write:

Write analysis for backup/recover and rebuild. We can have m secondary indexes on a table. Consider what happens on a write (which will require insertions in all m indexes).

First, lets consider case where m = 2, and that for option (a), data on one server and index on one; for option (b), data+index on both servers.

Generalizing, for any m:

Best case latency to write: for backup/recover = 20 us, for rebuild = 15 us.
Actual work done for backup/recover is 3*m disk writes extra than that for rebuild

	Latency	# Mem writes	# Backup writes	# Msgs from data to index servers	# Msgs to backups
No indexing	15 us	1	R	0	R
Indexing w/ backup/restore	35 us	m+1	R*(m+1)	m	R*(m+1)
Indexing w/ no-backup/rebuild	25 us	m+1	R	m	R

Indexing w/ backup/restore has a very high latency.

Idea 1: Put index entries in master log? (TODO)

Idea 2: Use transactions so that time before return can be reduce (though the total time remains the same).

2013-11-19 15.26.08.jpg

Cost during crash recovery:

Backup/recover: similar to regular crash recovery?

Rebuild:

Will depend on configuration. Lets find worst case performance with a "reasonable" set of assumptions.

The parallelization for index recovery master done as:

Reduce the partition size to be recovered to 300 MB => total time 10s
Multithread across 10 cores => total time 1s.
- This is problematic since our current tree implementation is not thread-safe. We'll then have to replace our tree with a thread-safe one. Its certainly a possibility. Though that might increase the tree insert times too.

Total cache miss time of 5.12s is a higher number than we're willing to tolerate. The only way to reduce this would be to enforce a constraint that data for a particular table cannot occupy more than a fraction of a master. (And then this time can be multiplied by that fraction).

Given these tradeoffs, lets go with the backup/recover option for now. Moreover, if we do decide to go with no-backup/rebuild eventually, not much implementation work will have to be thrown away and it makes a good first step.

Note: For details on how we got the cache miss time in the above table, look at the experiments here.

RAMCloud Project

Secondary Indexes Design

Table of Contents

Index key fields for a table

Index keys for objects

Index location

Index format

Index memory allocation (and sharing memory with data)

Consistency

Definition

For multi-threaded read/write non-locking access, no failures

For multi-threaded write/write non-locking access, no failures

Failure of an Index Server

Failure of Master Server

Cleaning

Failures / Recovery

Tablet (data) server failure

Index server failure