Data persistence

Assumptions

Even though the current copy of information is in main memory, data must be just as durable as if it were stored in a traditional disk-based database.
The durability of data must be guaranteed before a write request returns.
With thousands of servers in the cluster, machines will die constantly; it's possible that the system will almost always be in a state of crash recovery.
Recovery from "simple" failures must be instantaneous and transparent.
Data consists of small chunks of information (a few hundred bytes), updated in their entirety.

Failure scenarios

We might want RAMCloud data to survive any or all of the following scenarios:

Bit errors in memory. May need additional checks on top of those provided by hardware to make sure errors are detected.
Sudden loss of a single machine. A hardware failure could make the machine's disk inaccessible for an extended period of time. Data unavailability in the event of a single machine failure may not be tolerable.
Sudden loss/unavailability of a group of machines, such as a rack (e.g., the top-of-rack switch could fail).
Power failure. This may not introduce any additional problems beyond those presented by other failure modes. It appears that large data centers may have reliable enough backup to "guarantee" reliable power. Even in the event a power failure, we may be able to assume enough warning to flush (modest amounts of) volatile information to disk.
Complete loss of a datacenter:
- Do we need to survive this?
- How long of a period of unavailability is acceptable during disaster recovery (does the system have to support hot standbys, versus backups in other data centers)?
Network partitions: how likely are these within a datacenter?
Byzantine failures?
Thermal emergency: need to shut down machines to reduce energy usage. This issue may relate more to scaling the system up and down and to data durability.

Possible solutions

Approach #1: write data to disk synchronously. Too slow.
Approach #2: write data to flash memory synchronously? Also probably too slow.
An RPC to another server is much faster than a local disk write, so "commit" by sending updates to one or more servers.
Approach #3: keep all backup copies in main memory. This would at least double the amount of memory needed and hence the cost of the system. Too expensive. Also, doesn't handle power failures.
Approach #4: temporarily commit to another server for speed, but eventually commit data to disk or some other secondary storage:
- This would handle power failures as well as crashes.
- However, can't afford to do a disk I/O for each update.
One approach: log plus checkpoint, with staging:
- Keep an operation log that records all state changes.
- For each server, the operation log is stored on one or more other servers (call these backups).
- Before responding to an update request, record log entries on all the backup servers.
- Each backup server stores log entries temporarily in main memory. Once it has collected a large number of them it then writes them to a local disk. This way the log ends up on disk relatively soon, but doesn't require a disk I/O for each update.
- Occasionally write a checkpoint of the server's memory image to disk.
- Record information about the latest checkpoint in the operation log.
- After a crash, reload the memory image from the latest checkpoint, then replay all of the log entries more recent than that checkpoint.
Problem: checkpoint reload time.
- Suppose the checkpoint is stored on the server's disk.
- If the disk transfer rate is 100 MB/second and the server has 32 GB of RAM, it will take 300 seconds just to reload the checkpoint.
- This time will increase in the future, as RAM capacity increases faster than disk transfer rates.
- If the checkpoint is stored on the disk(s) of one or more other machines and reconstituted on a server over a 1Gb network, the reload time is about the same.
Possible solution:
- Split the checkpoint into, say, 1000 fragments and store each fragment on a different backup server.
- These fragments are still large enough to be transferred efficiently to/from disk.
- After a crash, each of the 1000 servers reads its fragment from disk in parallel (only 0.3 seconds). During recovery, these servers temporarily fill in for the crashed server.
- Eventually a replacement server comes up, fetches the fragments from the backup servers, rolls forward through the operation log, and takes over for the backups.
- Or, alternatively, the backup becomes the permanent home for its fragment. Fragments might later get reassigned as part of the scaling facilities in the system.
- For this to work, each backup server will have to replay the portion of the operation log relevant for it; perhaps split the log across backup servers?
- A potential problem with this approach: if it takes 1000 servers to recover, there is a significant chance that one of them will crash during the recovery interval, so RAMCloud will probably need to tolerate failures during recovery. Would a RAID-like approach work for this? Or, just use full redundancy (disk space probably isn't an issue).
What is the cost of replaying log entries?
- Suppose the fraction of operations that are updates is U.
- Suppose also that it takes the same amount of time for a basic read operation, a basic update, and replaying of an update.
- Then it will take U seconds to replay each second of typical work.
- If U is 0.1 and a crash happens 1000 seconds after last checkpoint, then it will take
  100 seconds to replay the log, assuming it is all done on the same machine. This suggests that checkpoints would need to be made frequently, which sounds expensive.
- On the other hand, with the fragmented approach described above, 1000 machines
  might handle the replay in parallel, which means 100,000 seconds of activity (more than a day) could be replayed in 10 seconds.
Must RAMCloud handle complete failure of an entire datacenter?
- This would require shipping checkpoints and operation logs to one or more additional datacenters.
- How fast must disaster recovery be? Is recovery from disk OK in this case?
Checkpoint overhead:
- If a server has 32GB of memory and a 1Gbit network connection, it will take 256 seconds of network time to transmit a checkpoint externally.
- If a server is willing to use 10% of its network bandwidth for checkpointing, that means a checkpoint every 2560 seconds (40 minutes)
- If the server is willing to use only 1% of its network bandwidth for checkpointing, that means a checkpoint every 25600 seconds (seven hours)
- Cross-datacenter checkpointing
  
  RAMCloud capacity
  
  Bandwidth bewteen datacenters
  
  Checkpoint Interval
  
  50TB
  
  1Gb/s
  
  556 hours (23 days)
  
  50TB
  
  10Gb/s
  
  55.6 hours (2.3 days)
- Most checkpoints will be almost identical to their predecessors; we should look for approaches that allow old checkpoint data to be retained for data that hasn't changed, in order to reduce the overhead for checkpointing.
One way of thinking about the role of disks in RAMCloud is that are creating a very simple file system with two operations:
- Write batches of small changes.
- Read the entire file system at once (during recovery)

Another approach to checkpointing: full pages
- Use a relatively small "page" size (1024 bytes?)
- During updates, send entire pages to backup servers
- Backup servers must coalesce these updated pages into larger units for transfer to disk
- The additional overhead for sending entire pages during commit may make this approach impractical
Yet another approach to checkpointing:
- Server sends operation log to backups.
- It's up to backups to figure out how to persist this: they must generate a complete enough disk image so that they can quickly regenerate all of the data they have ever received. No explicit checkpoint data is sent to backups.
- During recovery the backup can provide the recovered data in any order (it can always be rearranged in memory).
- This starts sounding a lot like a log-structured file system.

Update fraction and log length

What will be the update fraction (U) in the system?
If U is 0.1 we can't handle the logs:
- At 10*6 operations/sec/server and U=0.1 and 103 log bytes/write, we have 100MB/sec of log, which (a) fills outgoing network and (b) completely fills the bandwidth of a disk. For a cloud with 10*4 servers we have 1TB/sec. of log; certainly couldn't transmit all this externally to a backup datacenter.
Perhaps U <<< 0.1? Our expected access pattern multiplies reads (e.g. one Facebook page reads from 1000 servers); elevating the read rate effectively reduces U.
If U is 10**-4 then each server generates 100 KB/sec. of log, which is much more manageable.
If U is 10*-4 then 10*4 servers generate 1GB/sec. of log, which may (barely) be tolerable.

Additional questions and ideas

Could the operation log also serve as an undo/redo log and audit trail?
What happens if there is an additional crash during recovery?
What level of replication is required for RAMCloud to be widely accepted? Most likely, single-level redundancy will not be enough.
The approach to persistence may depend on the data model. For example, if there are indexes, will they require different techniques for persistence
than data objects?

Follow-up discussion topics

Design a different checkpointing algorithm using LFS-like techniques to avoid re-checkpointing data that hasn't changed. For example, how can logging be organized to merge repeated writes for the same object?
Exactly how to make data available during recovery.
With ubiquitous flash memory, how would we approach the checkpointing problem?

RAMCloud capacity	Bandwidth bewteen datacenters	Checkpoint Interval
50TB	1Gb/s	556 hours (23 days)
50TB	10Gb/s	55.6 hours (2.3 days)