Data persistence

Assumptions

Even though the current copy of information is in main memory, data must be just as durable as if it were stored in a traditional disk-based database.
The durability of data must be guaranteed before a write request returns.
With thousands of servers in the cluster, machines will die constantly; it's possible that the system will almost always be in a state of crash recovery.
Recovery from "simple" failures must be instantaneous and transparent.

Failure scenarios

We might want RAMCloud data to survive any or all of the following scenarios:

Sudden loss of a single machine. A hardware failure could make the machine's disk inaccessible for an extended period of time. Data unavailability in the event of a single machine failure may not be tolerable.
Sudden loss/unavailability of a group of machines, such as a rack (e.g., the top-of-rack switch could fail).
Power failure. This may not introduce any additional problems beyond those presented by other failure modes. It appears that large data centers may have reliable enough backup to "guarantee" reliable power. Even in the event a power failure, we may be able to assume enough warning to flush (modest amounts of) volatile information to disk.
Complete loss of a datacenter:
- Do we need to survive this?
- How long of a period of unavailability is acceptable during disaster recovery (does the system have to support hot standbys, versus backups in other data centers)?

Possible solutions

Approach #1: write data to disk synchronously. Too slow.
Approach #2: write data to flash memory synchronously? Also probably too slow.
An RPC to another server is much faster than a local disk write, so "commit" by sending updates to one or more servers.
Approach #3: keep all backup copies in main memory. This would at least double the amount of memory needed and hence the cost of the system. Too expensive. Also, doesn't handle power failures.
Approach #4: temporarily commit to another server for speed, but eventually commit data to disk or some other secondary storage:
- This would handle power failures as well as crashes.
- However, can't afford to do a disk I/O for each update.
One approach: log plus snapshot, with staging:
- Keep an operation log that records all state changes.
- For each server, the operation log is stored on one or more other servers (call these backups).
- Before responding to an update request, record log entries on all the backup servers.
- Each backup server stores log entries temporarily in main memory. Once it has collected a large number of them it then writes them to a local disk. This way the log ends up on disk relatively soon, but doesn't require a disk I/O for each update.
- Occasionally write a snapshot of the server's memory image to disk.
- Record information about the latest snapshot in the operation log.
- After a crash, reload the memory image from the latest snapshot, then replay all of the log entries more recent than that snapshot.
Problem: snapshot reload time.
- Suppose the snapshot is stored on the server's disk.
- If the disk transfer rate is 100 MB/second and the server has 32 GB of RAM, it will take 300 seconds just to reload the snapshot.
- This time will increase in the future, as RAM capacity increases faster than disk transfer rates.
- If the snapshot is stored on the disk(s) of one or more other machines and reconstituted on a server over a 1Gb network, the reload time is about the same.
Possible solution:
- Split the snapshot into, say, 1000 fragments and store each fragment on a different backup server.
- These fragments are still large enough to be transferred efficiently to/from disk.
- After a crash, each of the 1000 servers reads its fragment from disk in parallel (only 0.3 seconds). During recovery, these servers temporarily fill in for the crashed server.
- Eventually a replacement server comes up, fetches the fragments from the backup servers, rolls forward through the operation log, and takes over for the backups.
- For this to work, each backup server will have to replay the portion of the operation log relevant for it; perhaps split the log across backup servers?
- A potential problem with this approach: if it takes 1000 servers to recover, there is a significant chance that one of them will crash during the recovery interval, so RAMCloud will probably need to tolerate failures during recovery. Would a RAID-like approach work for this? Or, just use full redundancy (disk space probably isn't an issue).
Another approach to recovery: full pages
- Use a relatively small "page" size (1024 bytes?)
- During updates, send entire pages to backup servers
- Backup servers must coalesce these updated pages into larger units for transfer to disk
- The additional overhead for sending entire pages during commit may make this approach impractical
Must RAMCloud handle complete failure of an entire datacenter?
- This would require shipping snapshots and operation logs to one or more additional datacenters.
- How fast must disaster recovery be?
What is the cost of replaying log entries?
- Suppose the fraction of operations that are updates is U.
- Suppose also that it takes the same amount of time for a basic read operation, a basic update, and replaying of an update.
- Then it will take U seconds to replay each second of typical work.
- If U is 0.1 and a crash happens 1000 seconds after last snapshot, then it will take
  100 seconds to replay the log, assuming it is all done on the same machine. This suggests that snapshots would need to be made frequently, which sounds expensive.
- On the other hand, with the fragmented approach described above, 1000 machines
  might handle the replay in parallel, which means 100,000 seconds of activity (more than a day) could be replayed in 10 seconds.
Snapshot overhead:
- If a server has 32GB of memory and a 1Gbit network connection, it will take 256 seconds of network time to transmit a snapshot externally.
- If a server is willing to use 10% of its network bandwidth for snapshotting, that means a snapshot every 2560 seconds (40 minutes)
- If the server is willing to use only 1% of its network bandwidth for snapshotting, that means a snapshot every 25600 seconds (seven hours)
- Cross-datacenter snapshotting
  
  RAMCloud capacity
  
  Bandwidth bewteen datacenters
  
  Snapshot Interval
  
  50TB
  
  1Gb/s
  
  556 hours (23 days)
  
  50TB
  
  10Gb/s
  
  55.6 hours (2.3 days)
- Most snapshots will be almost identical to their predecessors; we should look for approaches that allow old snapshot data to be retained for data that hasn't changed, in order to reduce the overhead for snapshotting.
One way of thinking about the role of disks in RAMCloud is that are creating a very simple file system with two operations:
- Write small chunks in batches.
- Read the entire file system at once (during recovery)

Additional Questions and Ideas

Could the operation log also serve as an undo/redo log and audit trail?
What happens if there is an additional crash during recovery?
How is disk space managed for backup information?
What level of replication is required for RAMCloud to be widely accepted? Most likely, single-level redundancy will not be enough.

RAMCloud capacity	Bandwidth bewteen datacenters	Snapshot Interval
50TB	1Gb/s	556 hours (23 days)
50TB	10Gb/s	55.6 hours (2.3 days)