Page Comparison

...

Disk write bandwidth
Best approach to achieve good write bandwidth on disk: have at least 3 disks, one for write logging, one archiving the last amount of log data, one for compaction. Using this scheme we can completely eliminate seeks which should give us about 100 MB/s. Unfortunately, we'll need RAID as well so the total is more than 3 disks just to achieve the write bandwidth of 1 disk.
Network bandwidth
Incast issues on reconstruction
- Particularly at TOR switches, etc.
- Otherwise lower-bound on machine reconstruction is about 1 min.

Looming Questions

Read/Write mix?
Merge recent writes
Making data available during recovery
If backups are going to have to serve shards while recovering a master why recover at all?

...

Soft state masters, backups
- Once a master or backup fails a new one is chosen and the state is sent there using remaining backups/masters.
- All soft state, fixes
alot of
- many consistency issues on recovery
.
Hold most recent writes in a partial version of the shard in RAM
- LRU
Don't evict values from RAM on write-back (this allows fast recovery if backups need to compare recent values for consistency)
We get write coalescing for cheap
Can maintain on-disk structure in a compact and quickly serviceable layout
- ?? Is this going to be impossible to maintain with decent performance, though?
- If so, we can still do logging, but how quickly can we service requests then?

Backups are not recovered either
On non-responsive backups a new backup is chosen based on load and location (must be outside of rack, etc.) and bootstrapped
- Bootstrap from the primary or from the other backup?

How much serving capacity will need to be reserved versus the amount that a host can act as backup for?
- We'll need some target ratios based on the failure rate of machines in order to trigger migration of shards to less loaded servers
Shard sizes?
- Static?
- Dynamic?
Index backup?