Current Design

Goals

Achieving fast TTR of a single machine from a single machine is not practical because reconstituting 64 GB at disk rate requires several minutes. Thus, we need to bring more disk heads or SSDs to bear on the problem. To that end we introduce shards. A shard is, simply, a contiguous key range. The key range for each shard is chosen such that the shard represents approximately some fixed size in bytes with some hysteresis. Shards can be split/merged also to allow key ranges to move between servers for request load balancing, but this is secondary to the TTR goal. (Question: What changes if we have shards be contiguous machine/address ranges allowing objects to move between shards? One problem: dealing with client-side key-to-machine lookups.) We expect to maintain, therefore, some near constant number of shards for a single master's main memory. Each of these shards is backed-up by z other (hopefully distinct) machines.

A master for some shard receives all writes to that shard. All those writes are logged in the master and forwarded to the backups. The backups log the write in memory in precisely the same way. Once the in-memory backup log reaches a configurable length that log is committed to disk (we call these fixed length shard-specific log fragments segments). This allows us a knob to control how much buffering we need to achieve some target write bandwidth for our disks. Necessary segment sizes depend on the write bandwidth and seek latency of the logging devices. Standard disks seem to need about 6 MB segments to achieve 90% write bandwidth whereas SSDs seem they will need 128 to 512 KB segments.

Recap

Assumptions

Failures

Possible Solution

9/30 Discussion Notes

Data Durability

Write Bandwidth Limitations

Questions

10/7 Discussion (Logging/Cleaning on Backups)

Knobs/Tradeoffs

Cluster writes in chunks ("shardlets") to be batch written to segments on disk

Question remains: Log compaction or checkpointing to keep replay time from growing indefinitely

No-read LFS-style cleaning

Back of the envelope

Master cleans shardlets with less than some amount of utilization

Shardlet Util.

Disk/Master Memory Util. (Median Shardlet Util.)

TTR Off Optimal

Remaining Disk BW Off Optimal

Combined WE with 90% approach from earlier

50%

75%

1.33x slower

~50%

45%

10%

55%

1.82x slower

~88%

80%

Fast (fine-grained, shard interleaved) logging + shardlets

10/8 Discussion ("Paging" to yield higher utilization on masters)