Page Comparison

...

Approach #1: write data to disk synchronously. Too slow.
Approach #2: write data to flash memory synchronously? Also probably too slow.
An RPC to another server is much faster than a local disk write, so "commit" by sending updates to one or more servers.
Approach #3: keep all backup copies in main memory. This would at least double the amount of memory needed and hence the cost of the system. Too expensive. Also, doesn't handle power failures.
Approach #4: temporarily commit to another server for speed, but eventually commit data to disk or some other secondary storage:
- This would handle power failures as well as crashes.
- However, can't afford to do a disk I/O for each update.
One approach: log plus checkpoint, with staging:
- Keep an operation log that records all state changes.
- For each server, the operation log is stored on one or more other servers (call these backups).
- Before responding to an update request, record log entries on all the backup servers.
- Each backup server stores log entries temporarily in main memory. Once it has collected a large number of them it then writes them to a local disk. This way the log ends up on disk relatively soon, but doesn't require a disk I/O for each update.
- Occasionally write a checkpoint of the server's memory image to disk.
- Record information about the latest checkpoint in the operation log.
- After a crash, reload the memory image from the latest checkpoint, then replay all of the log entries more recent than that checkpoint.
Problem: checkpoint reload time.
- Suppose the checkpoint is stored on the server's disk.
- If the disk transfer rate is 100 MB/second and the server has 32 GB of RAM, it will take 300 seconds just to reload the checkpoint.
- This time will increase in the future, as RAM capacity increases faster than disk transfer rates.
- If the checkpoint is stored on the disk(s) of one or more other machines and reconstituted on a server over a 1Gb network, the reload time is about the same.
Possible solution:
- Split the checkpoint into, say, 1000 fragments and store each fragment on a different backup server.
- These fragments are still large enough to be transferred efficiently to/from disk.
- After a crash, each of the 1000 servers reads its fragment from disk in parallel (only 0.3 seconds). During recovery, these servers temporarily fill in for the crashed server.
- Eventually a replacement server comes up, fetches the fragments from the backup servers, rolls forward through the operation log, and takes over for the backups.
- For this to work, each backup server will have to replay the portion of the operation log relevant for it; perhaps split the log across backup servers?
- A potential problem with this approach: if it takes 1000 servers to recover, there is a significant chance that one of them will crash during the recovery interval, so RAMCloud will probably need to tolerate failures during recovery. Would a RAID-like approach work for this? Or, just use full redundancy (disk space probably isn't an issue).
What is the cost of replaying log entries?
- Suppose the fraction of operations that are updates is U.
- Suppose also that it takes the same amount of time for a basic read operation, a basic update, and replaying of an update.
- Then it will take U seconds to replay each second of typical work.
- If U is 0.1 and a crash happens 1000 seconds after last checkpoint, then it will take
  100 seconds to replay the log, assuming it is all done on the same machine. This suggests that checkpoints would need to be made frequently, which sounds expensive.
- On the other hand, with the fragmented approach described above, 1000 machines
  might handle the replay in parallel, which means 100,000 seconds of activity (more than a day) could be replayed in 10 seconds.
Must RAMCloud handle complete failure of an entire datacenter?
- This would require shipping checkpoints and operation logs to one or more additional datacenters.
- How fast must disaster recovery be? Is recovery from disk OK in this case?
Checkpoint overhead:
- If a server has 32GB of memory and a 1Gbit network connection, it will take 256 seconds of network time to transmit a checkpoint externally.
- If a server is willing to use 10% of its network bandwidth for checkpointing, that means a checkpoint every 2560 seconds (40 minutes)
- If the server is willing to use only 1% of its network bandwidth for checkpointing, that means a checkpoint every 25600 seconds (seven hours)
- Cross-datacenter checkpointting checkpointing
  RAMCloud capacity
  Bandwidth bewteen datacenters
  Checkpoint Interval
  50TB
  1Gb/s
  556 hours (23 days)
  50TB
  10Gb/s
  55.6 hours (2.3 days)
- Most checkpoints will be almost identical to their predecessors; we should look for approaches that allow old checkpoint data to be retained for data that hasn't changed, in order to reduce the overhead for checkpointing.
One way of thinking about the role of disks in RAMCloud is that are creating a very simple file system with two operations:
- Write batches of small changes.
- Read the entire file system at once (during recovery)

...

Versions Compared

Old Version 11

New Version 12

Key

RAMCloud capacity	Bandwidth bewteen datacenters	Checkpoint Interval
50TB	1Gb/s	556 hours (23 days)
50TB	10Gb/s	55.6 hours (2.3 days)