image/svg+xml
Replay
Data Scattering
Motivation
Fast Crash Recovery in RAMCloud
Ankita Kejriwal, Diego Ongaro, Stephen M. Rumble, Ryan Stutsman,John Ousterhout, and Mendel Rosenblum
● 60-node cluster, 32 Gbps Infiniband network● Recovered 11.7 GB in ~1 second● Using flash improves to 35 GB in 1.6 seconds● Time spent replicating is the current bottleneck● Implementation hides disk speed variance well
20 masters120 backups120 disks11.7 GB
1 master6 backups6 disks600 MB
● Masters replicate writes to backups immediately● Backups buffer it and flush to disk/flash in batch● – Need auxiliary power source for buffers for power failure● Backup locations chosen randomly to scatter segments● – Contraints on placment due to corrleated failures● – Tweaked to balance expected read time● – Provides the needed read bandwidth for recovery
1. Processwrite request
2. Append object to log &update hash table
3. Replicate objectto backups
4. Respond towrite request
Backup
Disk
Buffered Segment
Backup
Disk
Buffered Segment
Backup
Disk
Buffered Segment
Master
Hash table
In-Memory Log
Recovery Master
Backup
2. Divide segmentdata
3. Transfer datato masters
4. Add objects tohash table & log
6. Write backupdata to disk
5. Replicate logdata to backups
Hash table
In-Memory Log
Disk
...
1. Read disk
● All data always in RAM● – 1,000 - 10,000 commodity servers● – 64 GB DRAM/server or more● Durability goals:● – Small impact on performance● – Minimum cost and energy● Keep replicas in DRAM of other servers?● – Triples cost and energy usage● – Power failures are still a problem● RAMCloud's approach: fast recovery● – 1 copy in DRAM, backup copies on disk/flash● – Hypothesis: failures will not be noticed
● Every host is involved in recovery and they work in parallel● Work on each host proceeds in parallel (steps are pipelined)● Recovery masters make several parallel requests to backups● Prevents pipeline stalls when backups are not ready with data● New log segments are buffered until recovery is complete
Approach
64 GB / 3 disks / 100 MB/s/disk = 3.5 minutes
Recovery Master
Backups
Datacenter Network
Disk Bottleneck
CrashedMaster
. . .
Recovery Masters
Backups
Datacenter Network
Fast Recovery
. . .
Recovery Master
Backups
Datacenter Network
Network Bottleneck
64 GB / 10 Gbps = 1 minute
● Static set of backups is insufficient● – Harness scale: Use many disks during recovery● – – From all 1,000+ machines● – Scatter data throughout the cluster● – 64 GB / 1000 disks / 100 MB/s/disk = 0.6 s● Cannot reconstitute data quickly through a single NIC● – Harness scale: Use many hosts (NICs)● – – About 100 recovery masters will do● – – Each recovery master can recover 400-800 MB/s● – Need a ratio of about 6 disks to each recovery master
Results
0
200
400
600
800
1000
1200
1400
1600
1800
0
5
10
15
20
Recovery Time (ms)
Number of 600 MB Partitions
(Recovery Masters)
Total Recovery
Max. Disk Reading
Avg. Disk Reading