Distribution of data among servers, replication, locality

Addressing

Some possibilities

Structured, Unstructured
Random, Hashes, Sequential
User-specified, generated
Need at least 2^48 capacity for objects
- Hence, unstructured addresses probably need to be at least 2^64
- 64 * 2^30 bytes/machine * 2^14 machines = 2^50 bytes, 2^50 bytes/2^7 bytes/obj = 2^43 objects

Sequential and Structured

Temporal id locality
- Gives clients some means to control locality, if that can be used to advantage
Allocation could be tricky to make practical in a distributed setting
How many tables does a typical (or large) RDBMS have?
How many applications do we expect to support in a single RAMCloud instance?

application (16 bit)

table (16 bit)

address (64 bit)

Random

Smaller ids (64-bit?)
- Not if we want these to look like capabilities
Simple to make generation fast
Not meaningful to client (both a plus and minus)
Indexing must be done by clients and stored in the cloud
- Akin to FriendFeed's setup
Content-based
- Can't share objects without references
- Less general
- Potential vulns if hashes have weaknesses (good 128-bit hashes?)
- Built-in de-duplication
  - Which also poses storage channel for multi-tenacy

Distribution

Effects

+ Capacity
- Strictly necessary for this reason alone
+ Throughput/Latency
- + Avoid hot spots
- Multiple hosts can service requests in parallel
- Do either of these matter if nodes can handle 1M requests/sec?
- Throughput/Latency
- Cost to map address -> shard, shard -> host on client
- Cost to map address -> metadata (physical location, size, and permissions)
- Consistency
- Reliability

How do durability and reliability tie in to the discussion?

Evaluation

Reqs/sec/GB
Index size/GB

Questions

How much metadata space is needed for all tables/applications?
- Object Level
  - Up to 2^43 objects, size (4 bytes?), permissions or appid, tableid if not in address (4 bytes)
  - (2^43)*8 = 64 TB fully loaded (6.25% of capacity), not including the index size
How does metadata replication occur and what is the frequency?
- On writes for object metadata
- Shard Mappings
  - Lazily
  - Not sufficient when a client discovers a host is down
    - must update mappings in the new replicas at least very quickly
  - May additionally want leases or heartbeat or something similar as in MapReduce to ensure enough copies of shards are maintained on failure even if the data is cold

Approaches

addr mod servers

+ Balanced storage
- Statistically, if keys are random
- Need other protections if keys are not random (e.g. user can control placement)
- Requires exact same amount of storage in all machines statistically
- Adding/removing machines
- Requires all objects to move, or for cluster to only double in size each time

Mapping/Address Space Partitioning

+ Adding/removing easier
- Layer of indirection shard -> host
- Requires reverse lookup addr -> shard, requires a range search

RAMCloud Address -> Metadata as quickly as possible (e.g. with as few requests and as little processing as possible) where metadata includes physical address, size, and permissions at least.

Ideal: 0 network messages and O(1) address to host mapping time with high probability

Implies all clients are aware of mapping.

Possibly only update mapping on lookup failures. Requests to incorrect host reply with mapping correction. Except replicas must be notified quickly of failures.

Complication: access-control requires highly-consistent mapping replication if control is on addresses (e.g. the application/table is part of the structured address). Otherwise, missing entry causes denial, extra entry allows access after revocation.

Objects (likely as whole shards) may need to relocate due failures, load, or capacity.

Fixed Number of Objects per Shard

+ O(1) lookup
- Just keep a table, shift the address to get the shard number, then lookup host by shard id
- Table is too large initally or too small eventually

Assuming 64 bit addrs, 32 bit host addrs, 4 replicas (e.g. 128-bits stored per shard entry)

log_2 # shards	log_2 # addrs/shard	index size
30	34	16GB
40	24	16TB

- Size of a shard can never grow beyond capacity of the largest machine
- Might not always be able to store even when there is excess capacity in the system
- Could do something hackish; have saturated host forward requests it doesn't have the entries for to another host with a different part of the same shard
- Nearly impossible to determine address range chunk size initially
- Nightmare if we decide we need a new address range chunk size
- Requires "rehashing" the whole thing
- Adding and removing hosts
- Some of these issues are addressable with linear hashing techniques.

DHT

Allows variable/fixed address space shards and/or variable/fixed space shards. Allows flexibility along both dimensions, for time and data structure code complexity. Could be hard in FPGA?

+ Simple
+ Natural replication
- Shard address space (needed for distribution anyway), multiple servers cover each shard
- Latency
- Address to shard mapping has log(# shards) time in general, if shards are not fixed width in terms of # of addresses
- Can be mitgated for index space tradeoff using radix tree or tries
- How many levels is too deep?
- Even in the face of cache misses? (what is cache miss cost)
  - At 3.0 GHz we get 3000 cycles per us
  - Cache miss seem to be about 60 cycles = 30 ns?
+ Load sharing
- More difficult to co-locate related data on a single machine
- Probably the case that we want to intentionally distrbute related data (more network overhead, but reduces latency because lookups happen on independent machines)

Extendible
- Still expensive "rehash" time while the index is doubled in size; just consists of making any entire copy of the index
Linear
- Not great unless shards are split prematurely since the original algorithm still requires bucket chaining
- Even so, doesn't provide load relief where it's needed, but rather on nodes in a predefined order
Consistent
- log(# shards) lookup time
- In reality, high-radix radix trees can solve this

Example, create 1 node (shard) in the consistent hash map per GB in the system.
2^14 machines * 64 nodes = 2^20 nodes (shards)
Use 64 entry radix tree as ships with Linux for reverse address mappings and we have expected 3.3 accesses per lookup with index likely from 1 to 128 MB

Tried this with off-the-shelf radix tree that claims to be fast and cache optimized (Judy): with 2^20 nodes average lookup time of about 1000 cycles on an AMD64 and 520 cycles on a recent Core 2 Duo and a Xeon, allocated 15 MB of RAM during inserts. Even with some cache pressure (1000 lines touched randomly) still responds in about 1800 cycles.

B-Trees

+ Supports range queries on totally ordered keys
+/- Allows several records from the same table to be returned with a single request
- Can give control over placement if they can take advantage of it for locality
May cause a server to become a hot spot
- Is this anymore true than with hashing?
- Latency (at least as bad as DHT)

Hard in FPGA?

Replication

Effects

- Capacity
+ Throughput
- + Mitgates hot spots
- If there is a hot spot, data reduction may not solve overloading problems
- If a single server can handle 1M requests/second, is there any need to replicate?
- If the load gets too high, perhaps reduce the load by reducing the amount of data stored on a server, rather than replicating the data
+ Latency
- Eliminates cross data center requests (East to West Coast datacenters)
- Consistency
- A system without replication would be much easier to manage

Evaluation

Reqs/sec
MTBF if considering Reliability

Questions

How do we update the address mapping when a host fails?
- Clients that detect the failed node initiate an update of the mapping
  - Where, to whom?

Locality

Effects

+ Network traffic
+ Latency for serial requests
+ Performance isolation in multi-tenant environments
+ Economy of metadata
- For example, only access control information for data which resides on a host must be replicated to that host

Is there any locality in interesting database applications?
The most interesting form of locality is locality within a request: would like to satisfy each request with a single call to a single server, if possible

Evaluation

Bytes transfered/response
Reqs/sec
bps leaked between users
Metadata/GB