Idle standbys versus homogeneous load
Parameters - replicas (z) and shards; how do we choose them? On what granularity are they?
Major failures: How long to bring up if we lose DC power?
Reconstituting failed backup disks? Don't?
- Adds to the reason for more, smaller disks in a single machine
1. More heads = more write bandwidth per node
2. If we partition the shards to only log to a specific disk then we only have to replicate 1/4 as many shards when a single disk fails
- The downside, of course, is that disk failures will occur more often, and cost
All backups must be able to initiate reconstruction. Why?
Flash may be worth another look, particularly if we can play games in the FTL.

Old Notes

Issues

Disk write bandwidth
Best approach to achieve good write bandwidth on disk: have at least 3 disks, one for write logging, one archiving the last amount of log data, one for compaction. Using this scheme we can completely eliminate seeks which should give us about 100 MB/s. Unfortunately, we'll need RAID as well so the total is more than 3 disks just to achieve the write bandwidth of 1 disk.
Network bandwidth
Incast issues on reconstruction
- Particularly at TOR switches, etc.
- Otherwise lower-bound on machine reconstruction is about 1 min.

Looming Questions

Read/Write mix?
Merge recent writes
Making data available during recovery
If backups are going to have to serve shards while recovering a master why recover at all?

Alternative - No Recovery

Soft state masters, backups
- Once a master or backup fails a new one is chosen and the state is sent there using remaining backups/masters.
- All soft state, fixes many consistency issues on recovery
Hold most recent writes in a partial version of the shard in RAM
- LRU
Don't evict values from RAM on write-back (this allows fast recovery if backups need to compare recent values for consistency)
We get write coalescing for cheap
Can maintain on-disk structure in a compact and quickly serviceable layout
- ?? Is this going to be impossible to maintain with decent performance, though?
- If so, we can still do logging, but how quickly can we service requests then?

On master failure
- For each shard served one of the backups for that shard will be elected master
- The election will be aware of the load on each server so the master will tend to the servers that are less loaded
- Compare backups after a master is elected to check consistency
  - The master could've failed while writes were in flight
  - Version numbers make this easy
  - Only the most recently written value can be inconsistent, so we only need to compare k values for k backup shards
  - This is much cheaper than an agreement protocol for the backups
- After getting no response on a write to a master the client will need to retry the write to one of the backups, which will redirect the client to the new master, if the write was already seen the new write will fail harmlessly, if the write was not seen it will succeed.
- To repair locality we'll punt to a higher level service which will eventually migrate the data

Backups are not recovered either
On non-responsive backups a new backup is chosen based on load and location (must be outside of rack, etc.) and bootstrapped
- Bootstrap from the primary or from the other backup?

Questions

...

We'll need some target ratios based on the failure rate of machines in order to trigger migration of shards to less loaded servers

...

Static?
Dynamic?

...

Shard and store along with the values
How do we shard, though, since it is ordered on a different key

...

10/7 Discussion (Logging/Cleaning on Backups)

Knobs/Tradeoffs

Time to on disk durability (TTD)
- Also determines how much "NVRAM" we need
Memory needed to "stage" or buffer writes
Write bandwidth - (WE is (new data amount)/(write bandwidth))
Time to Recover a shard/machine (TTR)
- Shards are intended to speed recovery time

Cluster writes in chunks ("shardlets") to be batch written to segments on disk

From previous discussion, achieves about 90% WE using 6 MB chunks per shard
3z GB RAM for 512 shards
25z TTD in case of power failure
90% RE providing TTR of 5z s per shard

Question remains: Log compaction or checkpointing to keep replay time from growing indefinitely

No-read LFS-style cleaning

Masters write to 6MB shardlets in log style
Writes are dispatched to all backups just as before
- For each shard the backup maintains a shardlet image in RAM that is in lock-step with the servers
- Once the server is done placing writes on a shardlet the backup commits it to disk
The master recovers space from deleted or updated old values by performing compaction
- This is simply an automatic write of data from two or more old shardlets to a new one
- This allows the old shardlets to be reclaimed
- To allow the backups to reclaim the space in their on-disk logs the master sends a short trailer after finishing writes to a shardlet describing which previous shardlets are now considered "free"
The backup maintains a catalog of where active data for each shard is (its shardlets) on disk and periodically writes this to the log

Back of the envelope

Master cleans shardlets with less than some amount of utilization

Shardlet Util.	Disk/Master Memory Util. (Median Shardlet Util.)	TTR Off Optimal	Remaining Disk BW Off Optimal	Combined WE with 90% approach from earlier
50%	75%	1.33x slower	~50%	45%
10%	55%	1.82x slower	~88%	80%

Need to take a serious look at the distribution, however. Will need to force bimodality of shardlet distribution, which will bank on some locality (temporal will be just fine)
We tie our memory management of the server to the disk layout
Specifically, we can't afford to have the memory be under utilized, but we won't benefit much if shardlets are dense

10/8 Discussion ("Paging" to yield higher utilization on masters)

Versions Compared

Old Version 7

New Version 8

Key

Old Notes

Issues

Looming Questions

Alternative - No Recovery

Questions

10/7 Discussion (Logging/Cleaning on Backups)

Knobs/Tradeoffs

Cluster writes in chunks ("shardlets") to be batch written to segments on disk

No-read LFS-style cleaning

Back of the envelope

Master cleans shardlets with less than some amount of utilization

10/8 Discussion ("Paging" to yield higher utilization on masters)

Page Comparison

Versions Compared

Old Version 7

New Version 8

Key

Old Notes

Issues

Looming Questions

Alternative - No Recovery

Questions

10/7 Discussion (Logging/Cleaning on Backups)

Knobs/Tradeoffs

Cluster writes in chunks ("shardlets") to be batch written to segments on disk

No-read LFS-style cleaning

Back of the envelope

Master cleans shardlets with less than some amount of utilization

10/8 Discussion ("Paging" to yield higher utilization on masters)