Backup Goals

Store log segments durably for future recovery requests by masters.
Return data to masters as quickly as possible for recovery.
Minimize memory footprint, during all operations (both backup and recovery).

Rough Overview of the Components

BackupServer - Implements the backups, serves RPC requests for BackupClients.
BackupManager - Orchestras all the details of durability and recovery from the Master's perspective (e.g. Choosing and scheduling servers to backup to, ensuring replication requirements are met, etc.).
- BackupClient - A thin RPC client implementing stubs for the RPC calls serviced by BackupServer.
BackupStorage - Manages all the details of durable storage inside the BackupServer through a simple interface.
- SingleFileStorage - A durable backup storage backend backed by a single file or raw device.
- InMemoryStorage - A volitile backup storage backend for debugging or high-performance in-memory replication for low durability needs.

Operations

Normal Operation Walkthrough:

Start Coordinator
Start Backups
1. coordinator.serverAlive(BACKUP, "fast+udp:host=1.2.3.4,port=12345") -> id 1
2. coordinator.serverAlive(BACKUP, "fast+udp:host=1.2.3.5,port=12345") -> id 2
Start Server
1. coordinator.serverAlive MASTER "fast+udp:host=1.2.3.6,port=12344" -> id 3
Open a segment to begin writing log (first segment header).
This write is passed on to the BackupMangager.
The BackupManager has no BackupHosts to communicate with BackupServers.
1. BackupManager tries to discover some service locators for backups:
  1. coordinator.getServerList() -> [(BACKUP, "fast+udp:host=1.2.3.4,port=12345", 1), ...]
2. BackupManager selects enough to meet requirements from the list randomly.
foreach backupHost: backup.openSegment(masterId=3, segmentId=0) -> OK
Write of segment header proceeds:
- foreach backupHost: backup.writeSegment(masterId=3, segmentId=0, offset=0, length=8, data=...) -> OK
Writes proceed until Master decides the segment is full.
Opens the next segment on a new set of backups in a temporary list.
Closes the old segment on each of the backups.
- foreach backupHost: backup.closeSegment(masterId=3, segmentId=0) -> OK
Make the temporary backup list opened above be the new list of backups.

Log Cleaning:

backup.freeSegment(masterId=3, segmentId=0);

Recovery:

Someone (an application) decides a master may be down.
- coordinator.serverMayBeDown(3) -> OK, I'll check it out, you retry for a bit while I work.
Assume the Coordinator coordinates recovery for now.
The Coordinator calls for all backups in RAMCloud: backup.startReadingData(3) -> [(segmentId, length)]
- Coordinator prunes any entries from the list where the segmentId is redundant with another entry AND its length is less then the max seen among that segmentId.
Coordinator selects several recovery masters and sends them a tablet map and a list of segments [(backupLocator, segmentId)]
Each recovery master asks each backup in RAMCloud for objects in their backed up segments that pertain to its key range of the table:
- foreach (backupLocator, segmentId) in list given by Coordinator: backup.getRecoveryData(3, segmentId, startId, endId) -> [LogEntry with data blobs]
- Insert all log entries into the hash table and log.
Notify recovery coordinator that recovery for the key range is complete.
Once recovery coordinator knows all key ranges are recovered:
- foreach backup in this recovery: backup.releaseRecoveryData(masterId).
Coordinator jettisons state for masterId.
If the crashed server restarts it performs serverAlive MASTER as usual getting a new unique masterId.

Segment Durability

void
openSegment(uint64_ masterId, uint64_t segmentId);

Reserve segment buffer and storage space to hold this segment. Once an open has succeeded the backup must make every effort to replay the segment during recovery and ensure that it appears as open.

void
writeSegment(uint64_ masterId, uint64_t segmentId, uint32_t offset, uint32_t length, char *data);

Write length bytes starting at data to the segment buffer for masterId, segmentId starting at offset. This operation should not be refused by the backup. The backup guarantees to the best of its ability that the segment replayed during recovery will reflect this write and will appear open.

void
closeSegment(uint64_ masterId, uint64_t segmentId);

Move the data in the staging buffer for the segment masterId, segmentId to permanent storage and release the staging buffer for reuse. The backup promises to return data from this segment during recovery until the master indicates it is no longer interested in it by calling freeSegment, that the segment appears closed during recovery, and to refuse writes to this segment in the future.

void
freeSegment(uint64_ masterId, uint64_t segmentId);

Remove all knowledge of segment masterId, segmentId from the backup for a segment which was previously closed to storage. The backup guarantees this data will not be seen by any future recoveries.

Data Recovery

vector<uint64_t>
startReadingData(uint64_t masterId);

For the given masterId implicitly close all open segments and return a list of segments this backup has stored for it. This also hints to the backup that it should load all segments for this master from storage into RAM reserved for staging recoveries.

vector<LogEntry>
getRecoveryData(uint64_t masterId, uint64_t segmentId,
                uint64_t startId, uint64_t endId);

For a segment given by masterId, segmentId return all the object and tombstone log entries that affect objects in the given start/end id range. Effectively serves a compacted version of the segment to the recovering master.

Should we support multitable filtering in the first design?

Normal Operation Memory Use

Approximately 2 * segmentSize * masters.

Recovery Operation Memory Use and Amount of Data Needed from Disk

(recoveries * (ram per master) * replicas) / backups

Using 2 replicas:

Backups	4 GB/Master	8 GB/Master	16 GB/Master	24 GB/Master	32 GB/Master	64 GB/Master
1	8.00	16.00	32.00	48.00	64.00	128.00
2	4.00	8.00	16.00	24.00	32.00	64.00
3	2.67	5.33	10.67	16.00	21.33	42.67
4	2.00	4.00	8.00	12.00	16.00	32.00
5	1.60	3.20	6.40	9.60	12.80	25.60
10	0.80	1.60	3.20	4.80	6.40	12.80
20	0.40	0.80	1.60	2.40	3.20	6.40
40	0.20	0.40	0.80	1.20	1.60	3.20

Add an extra round during recovery coordination to bring only one copy of each segment online and we can half the numbers above.

Note if we assume 100 MB/sec/disk then we can recover about 4000 MB/s across 40 single-disk machine. If we knock this up to 8 disks per machine it's not completely impossible to recover 24 GB servers in about 1s.

writeSegment consistency issue

Just reconcile all copies taking the longest segments for each segmentId.