New assumption: 1-2 second non-availability when failure. How long for full reconstruction?
Issue: Estimate reconstitution of a failed node:
1. No Sharding
  - 64 GB per server
  - 128 MB/s disk bandwidth
    - 1/10th of 10 GigE
  - 512 s to read snapshot (64 GB) off a single disk
2. Sharding (Want < 2s recovery)
  - 512z shards needed for 1 s restore from disk to RAM
    - z := # of replicas per server
  - Assume 5 ms write latency
  - Want 90% disk write bandwidth efficiency
  - 45 ms worth of writes does ~6 MB or ~6000 records
  - Must keep 512z * 6 = 3z GB of buffer to maintain 6 MB of log per shard
  - How do we clean the log?
    - If appending recovery may read lots of irrelevant data
  - 25z s to persist outstanding log at any given time
    - important in case of failure - need this much time to make everything durable
  - With z = 2, ~10% overhead
  - What if memory goes up? e.g. 128 GB
    - Double shards, write bandwidth is static
    - Still 10% overhead

Write Bandwidth Limitations

~128 MB/s => 100 MB/s @ 90% efficiency
- Assuming no cleaning overhead
- ~100k recs/sec (1 KB recs)
- ~100k/z writes/sec/disk
- z = 2 => max 50k writes/sec (20% of 1M ops/sec/server)

Questions

Idle standbys versus homogeneous load
Parameters - replicas (z) and shards; how do we choose them? On what granularity are they?
Major failures: How long to bring up if we lose DC power?
Reconstituting failed backup disks? Don't?
All backups must be able to initiate reconstruction. Why?
Flash may be worth another look, particularly if we can play games in the FTL.

Disk write bandwidth
Best approach to achieve good write bandwidth on disk: have at least 3 disks, one for write logging, one archiving the last amount of log data, one for compaction. Using this scheme we can completely eliminate seeks which should give us about 100 MB/s. Unfortunately, we'll need RAID as well so the total is more than 3 disks just to achieve the write bandwidth of 1 disk.
Network bandwidth
Incast issues on reconstruction
- Particularly at TOR switches, etc.
- Otherwise lower-bound on machine reconstruction is about 1 min.

...

Backups are not recovered either
On non-responsive backups a new backup is chosen based on load and location (must be outside of rack, etc.) and bootstrapped
- Bootstrap from the primary or from the other backup?

How much serving capacity will need to be reserved versus the amount that a host can act as backup for?
- We'll need some target ratios based on the failure rate of machines in order to trigger migration of shards to less loaded servers
Shard sizes?
- Static?
- Dynamic?
Index backup?
- Shard and store along with the values
- How do we shard, though, since it is ordered on a different key
We still really have no idea what the workload looks like which makes it hard to tell if any of these approaches is reasonable. In particular, we really have no idea how much disk bandwidth we'll really need to make the system work.