Distribution of data among servers, replication, locality

Addressing

Some possibilities

Structured, Unstructured
Random, Hashes, Sequential
User-specified, generated
Need at least 2^48 capacity for objects
- Hence, unstructured addresses probably need to be at least 2^64
- 64 * 2^30 bytes/machine * 2^14 machines = 2^50 bytes, 2^50 bytes/2^7 bytes/obj = 2^48 objects

Sequential and Structured

Temporal id locality
- Gives clients some means to control locality, if that can be used to advantage
Allocation could be tricky to make practical in a distributed setting
How many tables does a typical (or large) RDBMS have?
How many applications do we expect to support in a single RAMCloud instance?

application (16 bit)

table (16 bit)

address (64 bit)

Random

Smaller ids (64-bit?)
- Not if we want these to look like capabilities
Simple to make fast
Not meaningful to client (both a plus and minus)
Indexing must be done by clients and stored in the cloud
- Akin to FriendFeed's setup

Distribution

How much metadata space is needed for all tables/applications?
- Up to 2^48 objects, 128 bit (16 byte) key, size (4-8 bytes), permissions or appid if not in address (2-4 bytes)
- (2^43)*16/2^40 = 128 TB
How does metadata replication occur and what is the frequency?

Effects

+ Capacity
- Strictly necessary for this reason alone
+ Throughput/Latency
- + Avoid hot spots
- Multiple hosts can service requests in parallel
- Do either of these matter if nodes can handle 1M requests/sec?
- Throughput/Latency
- Cost to map address -> shard, shard -> host on client
- Cost to map address -> metadata (physical location, size, and permissions)
- Consistency
- Reliability

How do durability and reliability tie in to the discussion?

Approaches

Mapping/Address Space Paritioning

RAMCloud Address -> Metadata as quickly as possible (e.g. with as few requests and as little processing as possible) where metadata includes physical address, size, and permissions at least.

Ideal: 0 network messages and O(1) address to host mapping time with high probability

Implies all clients are aware of mapping.

Possibly only update mapping on lookup failures. Requests to incorrect host reply with mapping correction.

Complication: access-control requires highly-consistent mapping replication if control is on addresses (e.g. the application/table is part of the structured address).

Objects (likely as whole shards) may need to relocate due failures, load, or capacity.

Fixed Number of Objects per Shard

+ O(1) lookup
- Just keep a table, shift the address to get the shard number, then lookup host by shard id
- Table is too large initally or too small eventually

Assuming 64 bit addrs, 32 bit host addrs, 4 replicas

log_2 # shards	log_2 # addrs/shard	index size
30	34	16GB
40	24	16TB

- Size of a shard can never grow beyond capacity of the largest machine
- Might not always be able to store even when there is excess capacity in the system
- Could do something hackish; have saturated host forward requests it doesn't have the entries for to another host with a different part of the same shard
- Nearly impossible to determine address range chunk size initially
- Nightmare if we decide we need a new address range chunk size
- Requires "rehashing" the whole thing
- Adding and removing hosts
- Some of these issues are addressable with linear hashing techniques.

DHT

+ Simple
+ Natural replication
- Shard address space (needed for distribution anyway), multiple servers cover each shard
- Latency
- Address to shard mapping has log(# shards) time in general, if shards are not fixed width in terms of # of addresses
- Can be mitgated for index space tradeoff using radix tree or tries
- How many levels is too deep?
- Even 2-3 in the face of cache misses? (what is cache miss cost)
  - Definitely, at 3.0 GHz we get 300 cycles per us
+ Load sharing
- More difficult to co-locate related data on a single machine
- Probably the case that we want to intentionally distrbute related data (more network overhead, but reduces latency because lookups happen on independent machines)

Extendible
Linear
Consistent

B-Trees

+ Supports range queries on totally ordered keys
+/- Allows several records from the same table to be returned with a single request
- Can give control over placement if they can take advantage of it for locality
May cause a server to become a hot spot
- Is this anymore true than with hashing?
- Latency (at least as bad as DHT)

Replication

Effects

- Capacity
+ Throughput
- + Mitgates hot spots
- If there is a hot spot, data reduction may not solve overloading problems
- If a single server can handle 1M requests/second, is there any need to replicate?
- If the load gets too high, perhaps reduce the load by reducing the amount of data stored on a server, rather than replicating the data
+ Latency
- Eliminates cross data center requests (East to West Coast datacenters)
- Consistency
- A system without replication would be much easier to manage

Locality

Effects

+ Network traffic
+ Latency for serial requests
+ Performance isolation in multi-tenant environments
+ Economy of metadata
- For example, only access control information for data which resides on a host must be replicated to that host

Is there any locality in interesting database applications?
The most interesting form of locality is locality within a request: would like to satisfy each request with a single call to a single server, if possible