Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Approach #4: temporarily commit to another server's RAM for speed, eventually to disk
    • some form of logging in RAM and batching writes to disk + checkpointing
    • need to "shard" data for each server so that many servers serve as a backup for a single master to speed recovery time
      • likely, backup shards will need to be able to temporarily become masters for the data while rebuilding the master

9/30 Discussion Notes

Data Durability

  • New assumption: 1-2 second non-availability when failure. How long for full reconstruction?
  • Issue: Estimate reconstitution of a failed node:
    1. No Sharding
      • 64 GB per server
      • 128 MB/s disk bandwidth

Old Notes

Issues

  • Disk write bandwidth
  • Best approach to achieve good write bandwidth on disk: have at least 3 disks, one for write logging, one archiving the last amount of log data, one for compaction. Using this scheme we can completely eliminate seeks which should give us about 100 MB/s. Unfortunately, we'll need RAID as well so the total is more than 3 disks just to achieve the write bandwidth of 1 disk.
  • Network bandwidth
  • Incast issues on reconstruction
    • Particularly at TOR switches, etc.
    • Otherwise lower-bound on machine reconstruction is about 1 min.

...