Content Comparison

Insert index entry then write object: -> Object available (via all indexes) as soon as object written.
1. Read before either insertion / write: no inconsistency
2. Read after both insertion and write: no inconsistency
3. Read between insertion and write: index entry found, matching object not found, read will return no value, as if object wasn't written yet. Same thing with multiple indexes.
Write object then insert index entry:
1. Read before either write / insertion: no inconsistency
2. Read after both write and insertion: no inconsistency
3. Read between write and insertion: index entry not found, read will return no value, as if write hadn't completed. This is inconsistent since object does exist. If we have multiple indexes, then lookup on one could succeed and lookup on another could fail.

Object delete: Options:

Remove index entry then delete object
1. Violates property 1.
2. Example of undesirable situation: If I have index A and index B on same data, then removing entry from index A then from B then actually deleting data could cause a situation where lookup on index A doesn't return value, but lookup on index B returns a value.
Delete object then remove entry: -> Object not available (via any index) as soon as object deleted.
1. Read on this object before deletion and removal: object read, no inconsistency
2. Read after deletion and removal: object not found, no inconsistency
3. Read between deletion and removal: index entry found, corresponding object not found, no object returned.
4. Note: delete() operation doesn't have information about indexes. Hence, before / while deleting object, read in the index keys so that they can be removed. How to handle failures?

Object update:

From the consistency requirement and RAMCloud operation linearizability, it follows that we want the following linearizability property for object updates:

...

Insert new index entries
Read old object (to get old index entry information) and write new object (this invalidates the old object) – Can return to client at this point
- Note: Don't need to read out. Can keep track of position in log and look it up later.
Remove old index entries – Can happen asynchronously after return to client; Can batch from various requests.
- Note: Will have to mark old object in the some way to make sure it doesn't get deleted.

Step 2 is a single operation (new kind of write) where writing an obj returns the old one if it exists. This will have to be atomic. (A previous option achieve the same result with current reads and writes, no new atomic write/read. But that needed more rpcs, and more complex code for write/write access.)

TODO: What happens if step 2 done before step 1?

...

Previous state: Foo: Bob Brown. Client 1 writes: Foo: Sam Smith. Now, say client 2 writes: Foo: Jen Jones.

Allowed end states: Consistent data and index entries:

Data - Foo: Sam Smith; fname - Sam: 444444; lname - Smith: 444444 ---OR--- vice versa

Not allowed end states: Data X but index entries corresponding to Y:

Data - Foo: Sam Smith; fname - Jen: 444444; lname - Jones: 444444 ---OR--- vice versa

Not ideal end states: Consistent data and index entries plus dangling pointers:

Data - Foo: Sam Smith; fname -  Sam: 444444, Jen: 444444; lname - Smith: 444444, Jones: 444444 ---OR--- vice versa

It seems like there is no interleaving of actions leading to not allowed state.

Potentially bad interleaving produces ideal result:

Master writes index entries for SS
Master writes index entries for JJ
Master reads BB and writes data SS (this invalidates BB)
Master reads SS and writes data JJ (this invalidates SS)
Master removes index entries for BB
Master removes index entries for SS

Complication: What if client 2 wants to write BB (same as the original data)?

...

Need to prevent delayed repeated messages.

Can return early to client, but not release the lock. Given this, can also batch removes with next write.

(TODO)

Failure of an Index Server

The client request being processed gets blocked till the index server is recovered; completed and return to client once index server recovers.

In the meanwhile, two things can happen:

The client might retry this request
- The write (including index updates) is idempotent by design, so not a problem.
Another client might send another request for the same object
- This is the same as multi-threaded access to the same object problem.

If an update involves this index server, return RETRY to client. Can leave dangling pointers around. Failure (and returning retry) at any point will not violate required consistency properties.

No need to return RETRY if failure happens at removal of index entries stage.

Implementation-wise: Use RPC wrapper to throw error if index server failed.

For a reasonable first implementation: Just wait for index server to recover, no need to return RETRY.

...

Dangling pointers problem: Options:

Just leave them around. How much do we expect to accumulate anyway?
Remove dangling pointers if a lookup is done on that and object not found. (Will have to be smart to not remove a pointer for which object is being currently added. Can do so by giving ownership of this removal to the object master.)
Cleaning! (See next subsection.)

...

Failures / Recovery

Tablet (data) server failure

Normal recovery happens
- Tablets from crashed master recovered on multiple recovery masters
- Corresponding tablet maps updated on coordinator
Since our index record maps index key to object key hash, which doesn't change on recovery, this doesn't affect indexes at all.
- Just retry till object found (i.e., till recovery completes)

Index server failure

After crash, till index is restored:

Normal reads can continue if the lookups are not on secondary indexes.
Lookups will block.
All writes (to table having this index) will block.

Index restoration options:

Backup/Recover:

Backup index on R disks (as with normal data) and recover on crash.
Can we use regular RAMCloud crash recovery with minor changes if index written to segments?
Recovery Mechanism:
- If using RAMCloud key-val to store object, then RAMCloud recovery automatically recovers the index.
- If not, steps:
  - Find backups
  - Partition failed index, assign m recovery masters
  - Read log from all backups for this index server
  - Ship data from backups to recovery masters
  - Write + recompute tree on recovery masters

Rebuild:

Keep index only in memory, then rebuild on crash.
Can read from masters that hold data for table being indexed.
Can recover on multiple index recovery masters (partition based on index key) – can later consolidate them back onto one index server. (Similar to regular recovery partitions.)
There was some pathology discussed for this case (before I joined the group).
Rebuild Mechanism:
- Locate master servers for entire table (not only the partition that failed)
- Partition failed index, assign m recovery masters
- Data masters scan all data for table
- Data masters multiplex ship data for this index partition to recovery masters
- Recovery masters reconstruct + write index tree nodes.

Cost during data write:

Write analysis for backup/recover and rebuild. We can have m secondary indexes on a table. Consider what happens on a write (which will require insertions in all m indexes).

First, lets consider case where m = 2, and that for option (a), data on one server and index on one; for option (b), data+index on both servers.

Generalizing, for any m:

Best case latency to write: for backup/recover = 20 us, for rebuild = 15 us.
Actual work done for backup/recover is 3*m disk writes extra than that for rebuild

	Latency	# Mem writes	# Backup writes	# Msgs from data to index servers	# Msgs to backups
No indexing	15 us	1	R	0	R
Indexing w/ backup/restore	35 us	m+1	R*(m+1)	m	R*(m+1)
Indexing w/ no-backup/rebuild	25 us	m+1	R	m	R

Indexing w/ backup/restore has a very high latency.

Idea 1: Put index entries in master log? (TODO)

Idea 2: Use transactions so that time before return can be reduce (though the total time remains the same).

2013-11-19 15.26.08.jpg

Cost during crash recovery:

Backup/recover: similar to regular crash recovery?

Rebuild:

Will depend on configuration. Lets find worst case performance with a "reasonable" set of assumptions.

The parallelization for index recovery master done as:

Reduce the partition size to be recovered to 300 MB => total time 10s
Multithread across 10 cores => total time 1s.
- This is problematic since our current tree implementation is not thread-safe. We'll then have to replace our tree with a thread-safe one. Its certainly a possibility. Though that might increase the tree insert times too.

Total cache miss time of 5.12s is a higher number than we're willing to tolerate. The only way to reduce this would be to enforce a constraint that data for a particular table cannot occupy more than a fraction of a master. (And then this time can be multiplied by that fraction).

Given these tradeoffs, lets go with the backup/recover option for now. Moreover, if we do decide to go with no-backup/rebuild eventually, not much implementation work will have to be thrown away and it makes a good first step.

Note: For details on how we got the cache miss time in the above table, look at the experiments here.

Version	Old Version 92	New Version 93
Changes made by	Ankita Kejriwal	Ankita Kejriwal
Saved on	Jan 24, 2014	Jan 24, 2014

Versions Compared

Key

Table of Contents

For multi-threaded read/write non-locking access, no failures

Failure of an Index Server

Failures / Recovery

Tablet (data) server failure

Index server failure