Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Recap

Assumptions

  • Must be just as durable as disk-storage
  • Durable when write request returns
  • Almost always in a state of crash recovery
  • Recovery instantaneous and transparent
  • Records of a few thousand bytes, perhaps partial updates

Failures

  • Bit errors
  • Single machine
  • Group of machines
  • Power failure
  • Loss of datacenter
  • Network partitions
  • Byzantine failures
  • Thermal emergency

Possible Solution

  • Approach #4: temporarily commit to another server's RAM for speed, eventually to disk
    • some form of logging in RAM and batching writes to disk + checkpointing
    • need to "shard" data for each server so that many servers serve as a backup for a single master to speed recovery time
      • likely, backup shards will need to be able to temporarily become masters for the data while rebuilding the master

9/30 Discussion Notes

Data Durability

  • New assumption: 1-2 second non-availability when failure. How long for full reconstruction?
  • Issue: Estimate reconstitution of a failed node:
    1. No Sharding
      • 64 GB per server
      • 128 MB/s disk bandwidth

Old Notes

Issues

  • Disk write bandwidth
  • Best approach to achieve good write bandwidth on disk: have at least 3 disks, one for write logging, one archiving the last amount of log data, one for compaction. Using this scheme we can completely eliminate seeks which should give us about 100 MB/s. Unfortunately, we'll need RAID as well so the total is more than 3 disks just to achieve the write bandwidth of 1 disk.
  • Network bandwidth
  • Incast issues on reconstruction
    • Particularly at TOR switches, etc.
    • Otherwise lower-bound on machine reconstruction is about 1 min.

Looming Questions

  • Read/Write mix?
  • Merge recent writes
  • Making data available during recovery
  • If backups are going to have to serve shards while recovering a master why recover at all?

Alternative - No Recovery

  • Soft state masters, backups
    • Once a master or backup fails a new one is chosen and the state is sent there using remaining backups/masters.
    • All soft state, fixes many consistency issues on recovery
  • Hold most recent writes in a partial version of the shard in RAM
    • LRU
  • Don't evict values from RAM on write-back (this allows fast recovery if backups need to compare recent values for consistency)
  • We get write coalescing for cheap
  • Can maintain on-disk structure in a compact and quickly serviceable layout
    • ?? Is this going to be impossible to maintain with decent performance, though?
    • If so, we can still do logging, but how quickly can we service requests then?
  • On master failure
    • For each shard served one of the backups for that shard will be elected master
    • The election will be aware of the load on each server so the master will tend to the servers that are less loaded
    • Compare backups after a master is elected to check consistency
      • The master could've failed while writes were in flight
      • Version numbers make this easy
      • Only the most recently written value can be inconsistent, so we only need to compare k values for k backup shards
      • This is much cheaper than an agreement protocol for the backups
    • After getting no response on a write to a master the client will need to retry the write to one of the backups, which will redirect the client to the new master, if the write was already seen the new write will fail harmlessly, if the write was not seen it will succeed.
    • To repair locality we'll punt to a higher level service which will eventually migrate the data
  • Backups are not recovered either
  • On non-responsive backups a new backup is chosen based on load and location (must be outside of rack, etc.) and bootstrapped
    • Bootstrap from the primary or from the other backup?

Questions

  • How much serving capacity will need to be reserved versus the amount that a host can act as backup for?
    • We'll need some target ratios based on the failure rate of machines in order to trigger migration of shards to less loaded servers
  • Shard sizes?
    • Static?
    • Dynamic?
  • Index backup?
    • Shard and store along with the values
    • How do we shard, though, since it is ordered on a different key
  • We still really have no idea what the workload looks like which makes it hard to tell if any of these approaches is reasonable. In particular, we really have no idea how much disk bandwidth we'll really need to make the system work.
  • No labels