Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Disk write bandwidth
  • Best approach to achieve good write bandwidth on disk: have at least 3 disks, one for write logging, one archiving the last amount of log data, one for compaction. Using this scheme we can completely eliminate seeks which should give us about 100 MB/s. Unfortunately, we'll need RAID as well so the total is more than 3 disks just to achieve the write bandwidth of 1 disk.
  • Network bandwidth
  • Incast issues on reconstruction
    • Particularly at TOR switches, etc.
    • Otherwise lower-bound on machine reconstruction is about 1 min.

Looming Questions

  • Read/Write mix?
  • Merge recent writes
  • Making data available during recovery
  • If backups are going to have to serve shards while recovering a master why recover at all?

Alternative

...

- No Recovery

  • Soft state masters, backups
    • Once a master or backup fails a new one is chosen and the state is sent there using remaining backups/masters.
    • All soft state, fixes
    alot of
    • many consistency issues on recovery
    .
  • Hold most recent writes in a partial version of the shard in RAM
    • LRU
  • Don't evict values from RAM on write-back (this allows fast recovery if backups need to compare recent values for consistency)
  • We get write coalescing for cheap
  • Can maintain on-disk structure in a compact and quickly serviceable layout
    • ?? Is this going to be impossible to maintain with decent performance, though?
    • If so, we can still do logging, but how quickly can we service requests then?
  • On master failure
    • For each shard served one of the backups for that shard will be elected master
    • The election will be aware of the load on each server so the master will tend to the servers that are less loaded
    • Compare backups after a master is elected to check consistency
    , version
      • The master could've failed while writes were in flight
      • Version numbers make this
    possible, only
      • easy
      • Only the most recently written value can be inconsistent, so we only need to compare k values for k backup shards
    .
      • This is much cheaper than an agreement protocol for the backups
    • After getting no response on a write to a master the client will need to retry the write to one of the backups, which will redirect the client to the new master, if the write was already seen the new write will fail harmlessly, if the write was not seen it will succeed.
    • To repair locality we'll punt to a higher level service which will eventually migrate the data
  • Backups are not recovered either
  • On non-responsive backups a new backup is chosen based on load and location (must be outside of rack, etc.) and bootstrapped
    • Bootstrap from the primary or from the other backup?

Questions

  • How much serving capacity will need to be reserved versus the amount that a host can act as backup for?
    • We'll need some target ratios based on the failure rate of machines in order to trigger migration of shards to less loaded servers
  • Shard sizes?
    • Static?
    • Dynamic?
  • Index backup?