Distribution of data among servers, replication, locality
Addressing
Some possibilities
Structured, Unstructured
Random, Hashes, Sequential
User-specified, generated
Need at least 2^48 capacity for objects
Hence, unstructured addresses probably need to be at least 2^64
64 * 2^30 bytes/machine * 2^14 machines = 2^50 bytes, 2^50 bytes/2^7 bytes/obj = 2^43 objects
Sequential and Structured
Temporal id locality
Gives clients some means to control locality, if that can be used to advantage
Allocation could be tricky to make practical in a distributed setting
How many tables does a typical (or large) RDBMS have?
How many applications do we expect to support in a single RAMCloud instance?
application (16 bit) | table (16 bit) | address (64 bit) |
Random
Smaller ids (64-bit?)
Not if we want these to look like capabilities
Simple to make generation fast
Not meaningful to client (both a plus and minus)
Indexing must be done by clients and stored in the cloud
Akin to FriendFeed's setup
Content-based
Can't share objects without references
Less general
Potential vulns if hashes have weaknesses (good 128-bit hashes?)
Built-in de-duplication
Which also poses storage channel for multi-tenacy
Distribution
Effects
+ Capacity
Strictly necessary for this reason alone
+ Throughput/Latency
+ Avoid hot spots
Multiple hosts can service requests in parallel
Do either of these matter if nodes can handle 1M requests/sec?
- Throughput/Latency
Cost to map address -> shard, shard -> host on client
Cost to map address -> metadata (physical location, size, and permissions)
- Consistency
- Reliability
How do durability and reliability tie in to the discussion?
Evaluation
Reqs/sec/GB
Index size/GB
Questions
How much metadata space is needed for all tables/applications?
Object Level
Up to 2^43 objects, size (4 bytes?), permissions or appid, tableid if not in address (4 bytes)
(2^43)*8 = 64 TB fully loaded (6.25% of capacity), not including the index size
How does metadata replication occur and what is the frequency?
On writes for object metadata
Shard Mappings
Lazily
Not sufficient when a client discovers a host is down
must update mappings in the new replicas at least very quickly
May additionally want leases or heartbeat or something similar as in MapReduce to ensure enough copies of shards are maintained on failure even if the data is cold
Approaches
addr mod servers
+ Balanced storage
Statistically, if keys are random
Need other protections if keys are not random (e.g. user can control placement)
- Requires exact same amount of storage in all machines statistically
- Adding/removing machines
Requires all objects to move, or for cluster to only double in size each time
Mapping/Address Space Partitioning
+ Adding/removing easier
Layer of indirection shard -> host
- Requires reverse lookup addr -> shard, requires a range search
RAMCloud Address -> Metadata as quickly as possible (e.g. with as few requests and as little processing as possible) where metadata includes physical address, size, and permissions at least.
Ideal: 0 network messages and O(1) address to host mapping time with high probability
Implies all clients are aware of mapping.
Possibly only update mapping on lookup failures. Requests to incorrect host reply with mapping correction. Except replicas must be notified quickly of failures.
Complication: access-control requires highly-consistent mapping replication if control is on addresses (e.g. the application/table is part of the structured address). Otherwise, missing entry causes denial, extra entry allows access after revocation.
Objects (likely as whole shards) may need to relocate due failures, load, or capacity.
Fixed Number of Objects per Shard
+ O(1) lookup
Just keep a table, shift the address to get the shard number, then lookup host by shard id
Table is too large initally or too small eventually
Assuming 64 bit addrs, 32 bit host addrs, 4 replicas (e.g. 128-bits stored per shard entry)
log_2 # shards | log_2 # addrs/shard | index size |
|---|---|---|
30 | 34 | 16GB |
40 | 24 | 16TB |
- Size of a shard can never grow beyond capacity of the largest machine
Might not always be able to store even when there is excess capacity in the system
Could do something hackish; have saturated host forward requests it doesn't have the entries for to another host with a different part of the same shard
- Nearly impossible to determine address range chunk size initially
- Nightmare if we decide we need a new address range chunk size
Requires "rehashing" the whole thing
- Adding and removing hosts
Some of these issues are addressable with linear hashing techniques.
DHT
Allows variable/fixed address space shards and/or variable/fixed space shards. Allows flexibility along both dimensions, for time and data structure code complexity. Could be hard in FPGA?
+ Simple
+ Natural replication
Shard address space (needed for distribution anyway), multiple servers cover each shard
- Latency
Address to shard mapping has log(# shards) time in general, if shards are not fixed width in terms of # of addresses
Can be mitgated for index space tradeoff using radix tree or tries
How many levels is too deep?
Even in the face of cache misses? (what is cache miss cost)
At 3.0 GHz we get 3000 cycles per us
Cache miss seem to be about 60 cycles = 30 ns?
+ Load sharing
- More difficult to co-locate related data on a single machine
Probably the case that we want to intentionally distrbute related data (more network overhead, but reduces latency because lookups happen on independent machines)
Extendible
Still expensive "rehash" time while the index is doubled in size; just consists of making any entire copy of the index
Linear
Not great unless shards are split prematurely since the original algorithm still requires bucket chaining
Even so, doesn't provide load relief where it's needed, but rather on nodes in a predefined order
Consistent
log(# shards) lookup time
In reality, high-radix radix trees can solve this
Example, create 1 node (shard) in the consistent hash map per GB in the system.
2^14 machines * 64 nodes = 2^20 nodes (shards)
Use 64 entry radix tree as ships with Linux for reverse address mappings and we have expected 3.3 accesses per lookup with index likely from 1 to 128 MB
Tried this with off-the-shelf radix tree that claims to be fast and cache optimized (Judy): with 2^20 nodes average lookup time of about 1000 cycles on an AMD64 and 520 cycles on a recent Core 2 Duo and a Xeon, allocated 15 MB of RAM during inserts. Even with some cache pressure (1000 lines touched randomly) still responds in about 1800 cycles.
B-Trees
+ Supports range queries on totally ordered keys
+/- Allows several records from the same table to be returned with a single request
Can give control over placement if they can take advantage of it for locality
May cause a server to become a hot spot
Is this anymore true than with hashing?
- Latency (at least as bad as DHT)
Hard in FPGA?
Replication
Effects
- Capacity
+ Throughput
+ Mitgates hot spots
If there is a hot spot, data reduction may not solve overloading problems
If a single server can handle 1M requests/second, is there any need to replicate?
If the load gets too high, perhaps reduce the load by reducing the amount of data stored on a server, rather than replicating the data
+ Latency
Eliminates cross data center requests (East to West Coast datacenters)
- Consistency
A system without replication would be much easier to manage
Evaluation
Reqs/sec
MTBF if considering Reliability
Questions
How do we update the address mapping when a host fails?
Clients that detect the failed node initiate an update of the mapping
Where, to whom?
Locality
Effects
+ Network traffic
+ Latency for serial requests
+ Performance isolation in multi-tenant environments
+ Economy of metadata
For example, only access control information for data which resides on a host must be replicated to that host
Is there any locality in interesting database applications?
The most interesting form of locality is locality within a request: would like to satisfy each request with a single call to a single server, if possible
Evaluation
Bytes transfered/response
Reqs/sec
bps leaked between users
Metadata/GB