Recap

Assumptions

Approach #4: temporarily commit to another server's RAM for speed, eventually to disk
- some form of logging in RAM and batching writes to disk + checkpointing
- need to "shard" data for each server so that many servers serve as a backup for a single master to speed recovery time
  - likely, backup shards will need to be able to temporarily become masters for the data while rebuilding the master

New assumption: 1-2 second non-availability when failure. How long for full reconstruction?
Issue: Estimate reconstitution of a failed node:
1. No Sharding
  - 64 GB per server
  - 128 MB/s disk bandwidth

Disk write bandwidth
Best approach to achieve good write bandwidth on disk: have at least 3 disks, one for write logging, one archiving the last amount of log data, one for compaction. Using this scheme we can completely eliminate seeks which should give us about 100 MB/s. Unfortunately, we'll need RAID as well so the total is more than 3 disks just to achieve the write bandwidth of 1 disk.
Network bandwidth
Incast issues on reconstruction
- Particularly at TOR switches, etc.
- Otherwise lower-bound on machine reconstruction is about 1 min.

Read/Write mix?
Merge recent writes
Making data available during recovery
If backups are going to have to serve shards while recovering a master why recover at all?

Soft state masters, backups
- Once a master or backup fails a new one is chosen and the state is sent there using remaining backups/masters.
- All soft state, fixes many consistency issues on recovery
Hold most recent writes in a partial version of the shard in RAM
- LRU
Don't evict values from RAM on write-back (this allows fast recovery if backups need to compare recent values for consistency)
We get write coalescing for cheap
Can maintain on-disk structure in a compact and quickly serviceable layout
- ?? Is this going to be impossible to maintain with decent performance, though?
- If so, we can still do logging, but how quickly can we service requests then?

Backups are not recovered either
On non-responsive backups a new backup is chosen based on load and location (must be outside of rack, etc.) and bootstrapped
- Bootstrap from the primary or from the other backup?

How much serving capacity will need to be reserved versus the amount that a host can act as backup for?
- We'll need some target ratios based on the failure rate of machines in order to trigger migration of shards to less loaded servers
Shard sizes?
- Static?
- Dynamic?
Index backup?
- Shard and store along with the values
- How do we shard, though, since it is ordered on a different key
We still really have no idea what the workload looks like which makes it hard to tell if any of these approaches is reasonable. In particular, we really have no idea how much disk bandwidth we'll really need to make the system work.