Date: Fri, 29 Mar 2024 06:42:37 +0000 (UTC)
Message-ID: <1902995683.15.1711694557909@4a64903b76d4>
Subject: Exported From Confluence
MIME-Version: 1.0
Content-Type: multipart/related;
boundary="----=_Part_14_865894177.1711694557908"
------=_Part_14_865894177.1711694557908
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Content-Location: file:///C:/exported.html
Data persistence
Data persistence
Assumptions
- Even though the current copy of information is in main memory, data mus=
t be just as durable as if it were stored in a traditional disk-based datab=
ase.
- The durability of data must be guaranteed before a write request return=
s.
- With thousands of servers in the cluster, machines will die constantly;=
it's possible that the system will almost always be in a state of crash re=
covery.
- Recovery from "simple" failures must be instantaneous and transparent.<=
/li>
- Data consists of small chunks of information (a few hundred bytes), upd=
ated in their entirety.
Failure scenarios
We might want RAMCloud data to survive any or all of the following scena=
rios:
- Bit errors in memory. May need additional checks on top of =
those provided by hardware to make sure errors are detected.
- Sudden loss of a single machine. A hardware failure could make the mach=
ine's disk inaccessible for an extended period of time. Data unavailability=
in the event of a single machine failure may not be tolerable.
- Sudden loss/unavailability of a group of machines, such as a rack (e.g.=
, the top-of-rack switch could fail).
- Power failure. This may not introduce any additional problems beyond th=
ose presented by other failure modes. It appears that large data centers ma=
y have reliable enough backup to "guarantee" reliable power. Even in the ev=
ent a power failure, we may be able to assume enough warning to flush (mode=
st amounts of) volatile information to disk.
- Complete loss of a datacenter:=20
- Do we need to survive this?
- How long of a period of unavailability is acceptable during disaster re=
covery (does the system have to support hot standbys, versus backups in oth=
er data centers)?
- Network partitions: how likely are these within a datacenter?
- Byzantine failures?
- Thermal emergency: need to shut down machines to reduce energy usage. T=
his issue may relate more to scaling the system up and down and to data dur=
ability.
Possible solutions
- Approach #1: write data to disk synchronously. Too slow.<=
/strong>
- Approach #2: write data to flash memory synchronously? Al=
so probably too slow.
- An RPC to another server is much faster than a local disk write, so "co=
mmit" by sending updates to one or more servers.
- Approach #3: keep all backup copies in main memory. This would at=
least double the amount of memory needed and hence the cost of the system.=
Too expensive. Also, doesn't handle power failures.
- Approach #4: temporarily commit to another server for speed, but eventu=
ally commit data to disk or some other secondary storage:=20
- This would handle power failures as well as crashes.
- However, can't afford to do a disk I/O for each update.
- One approach: log plus checkpoint, with staging:=20
- Keep an operation log that records all state changes.
- For each server, the operation log is stored on one or more other serve=
rs (call these backups).
- Before responding to an update request, record log entries on all the b=
ackup servers.
- Each backup server stores log entries temporarily in main memory. Once =
it has collected a large number of them it then writes them to a local disk=
. This way the log ends up on disk relatively soon, but doesn't require a d=
isk I/O for each update.
- Occasionally write a checkpoint of the server's memory image to disk.=
li>
- Record information about the latest checkpoint in the operation log.
- After a crash, reload the memory image from the latest checkpoint, then=
replay all of the log entries more recent than that checkpoint.
- Problem: checkpoint reload time.=20
- Suppose the checkpoint is stored on the server's disk.
- If the disk transfer rate is 100 MB/second and the server has 32 GB of =
RAM, it will take 300 seconds just to reload the checkpoint.
- This time will increase in the future, as RAM capacity increases faster=
than disk transfer rates.
- If the checkpoint is stored on the disk(s) of one or more other machine=
s and reconstituted on a server over a 1Gb network, the reload time is abou=
t the same.
- Possible solution:=20
- Split the checkpoint into, say, 1000 fragments and store each fragment =
on a different backup server.
- These fragments are still large enough to be transferred efficiently to=
/from disk.
- After a crash, each of the 1000 servers reads its fragment from disk in=
parallel (only 0.3 seconds). During recovery, these servers temporarily fi=
ll in for the crashed server.
- Eventually a replacement server comes up, fetches the fragments from th=
e backup servers, rolls forward through the operation log, and takes over f=
or the backups.
- Or, alternatively, the backup becomes the permanent home for its fragme=
nt. Fragments might later get reassigned as part of the scaling facilities =
in the system.
- For this to work, each backup server will have to replay the portion of=
the operation log relevant for it; perhaps split the log across backup ser=
vers?
- A potential problem with this approach: if it takes 1000 servers to rec=
over, there is a significant chance that one of them will crash during the =
recovery interval, so RAMCloud will probably need to tolerate failures duri=
ng recovery. Would a RAID-like approach work for this? &nb=
sp; Or, just use full redundancy (disk space probably isn't an issue).
- What is the cost of replaying log entries?=20
- Suppose the fraction of operations that are updates is U.
- Suppose also that it takes the same amount of time for a basic read ope=
ration, a basic update, and replaying of an update.
- Then it will take U seconds to replay each second of typical work.
- If U is 0.1 and a crash happens 1000 seconds after last checkpoint, the=
n it will take
100 seconds to replay the log, assuming it is all done on the same machine=
. This suggests that checkpoints would need to be made frequently, which so=
unds expensive.
- On the other hand, with the fragmented approach described above, 1000 m=
achines
might handle the replay in parallel, which means 100,000 seconds of activi=
ty (more than a day) could be replayed in 10 seconds.
- Must RAMCloud handle complete failure of an entire datacenter?=20
- This would require shipping checkpoints and operation logs to one or mo=
re additional datacenters.
- How fast must disaster recovery be? Is recovery from disk OK in t=
his case?
- Checkpoint overhead:=20
- If a server has 32GB of memory and a 1Gbit network connection, it will =
take 256 seconds of network time to transmit a checkpoint externally.
- If a server is willing to use 10% of its network bandwidth for checkpoi=
nting, that means a checkpoint every 2560 seconds (40 minutes)
- If the server is willing to use only 1% of its network bandwidth for ch=
eckpointing, that means a checkpoint every 25600 seconds (seven hours)
- Cross-datacenter checkpointing=20
RAMCloud capacity |
Bandwidth bewteen datacenters |
Checkpoint Interval |
50TB |
1Gb/s |
556 hours (23 days) |
50TB |
10Gb/s |
55.6 hours (2.3 days) |
- Most checkpoints will be almost identical to their predecessors; we sho=
uld look for approaches that allow old checkpoint data to be retained for d=
ata that hasn't changed, in order to reduce the overhead for checkpointing.=
- One way of thinking about the role of disks in RAMCloud is that are cre=
ating a very simple file system with two operations:=20
- Write batches of small changes.
- Read the entire file system at once (during recovery)
- Another approach to checkpointing: full pages=20
- Use a relatively small "page" size (1024 bytes?)
- During updates, send entire pages to backup servers
- Backup servers must coalesce these updated pages into larger units for =
transfer to disk
- The additional overhead for sending entire pages during commit may make=
this approach impractical
- Yet another approach to checkpointing:=20
- Server sends operation log to backups.
- It's up to backups to figure out how to persist this: they must generat=
e a complete enough disk image so that they can quickly regenerate all of t=
he data they have ever received. No explicit checkpoint data is sent to bac=
kups.
- During recovery the backup can provide the recovered data in any order =
(it can always be rearranged in memory).
- This starts sounding a lot like a log-structured file system.
Update fraction and l=
og length
- What will be the update fraction (U) in the system?
- If U is 0.1 we can't handle the logs:=20
- At 10*6 operations/sec/server and U=3D0.1 and 103 log bytes/write, we have 100MB/sec of log, which (a) fills outgoing net=
work and (b) completely fills the bandwidth of a disk. For a cloud with 10<=
/strong>*4 servers we have 1TB/sec. of log; certainly couldn't transmit all=
this externally to a backup datacenter.
- Perhaps U <<< 0.1? Our expected access pattern multiplies read=
s (e.g. one Facebook page reads from 1000 servers); elevating the read rate=
effectively reduces U.
- If U is 10**-4 then each server generates 100 KB/sec. of log, which is =
much more manageable.
- If U is 10*-4 then 10*4 servers generate 1GB/sec. of l=
og, which may (barely) be tolerable.
Additional questions=
and ideas
- Could the operation log also serve as an undo/redo log and audit trail?=
- What happens if there is an additional crash during recovery?
- What level of replication is required for RAMCloud to be widely accepte=
d? Most likely, single-level redundancy will not be enough.
- The approach to persistence may depend on the data model. For example, =
if there are indexes, will they require different techniques for persistenc=
e
than data objects?
Follow-up discussion t=
opics
- Design a different checkpointing algorithm using LFS-like techniques to=
avoid re-checkpointing data that hasn't changed. For example, how can logg=
ing be organized to merge repeated writes for the same object?
- Exactly how to make data available during recovery.
- With ubiquitous flash memory, how would we approach the checkpointing p=
roblem?
------=_Part_14_865894177.1711694557908--