Scalability
Scaling up: Where is the ceiling?
What is our target size, anyway?
Facebook has 4k MySQL, 2k memcached and 15k www/php machines
Google? Many 10k clusters? How big before partitioning for other tasks?
Step back: what is a 'server'?
NUMA architectures somewhat like many smaller machines bundled together
different bandwidths and latencies to various memory / network / misc i/o resources
may want to treat cores/clusters of cores as much like distinct machines as possible (avoid complexity)
but be sure any buddy nodes are not intra-machine for durability
Unclear what optimal RC hardware will be
Fewer big boxes stuffed with memory?
Many small boxes?
Something in-between?
Is it more meaningful to target # of cores, rather than servers?
Increases scalability requirements by 1-2 orders of magnitude right now
e.g., 10k machines could have 100k cores now. Perhaps 1e6 to 1e7 cores in 5 years?
Will we need to think bigger than we already are?
Managing instances
Must manage growth automatically
ideally, just plug in more servers and the system automatically remodels itself to handle the additional capacity
at large scale, individual management is impossible => need LOM
at small scale, management probably feasible => assuming additional complexity for small setups is ok
primarily, must automate network autoconfiguration (easy)
what about autoconfiguring location?
system will probably want to know what lives where (which machines share same rack, etc, for availability)
probably can be derived from subnet?
Scale affects:
Network architecture
Distribution/Replication/Reliability
Fundamental algorithms - how well do they scale? Examples:
Dynamo (DHT)
paper doesn't say (probably 1000's, though each app has own Dynamo instance)
DHT's generally scale well and deal well with churn
i.e. popular with P2P apps in general
uses consistent hashing
various schemes tested
focus is on availability and latency guarantees (milliseconds)
BigTable (B+ like tree atop bunch of other distributed services)
paper indicates generally < 500 servers in production
diminishing returns due to GFS large block sizes
interestingly, BigTable can run in all-RAM mode and is not decentralised
What we choose depends largely on data model
e.g. DHT good if we don't want range queries/care about key locality, else a tree may be better
Must have good system visibility/insight
For health/system status/load/etc (managerial)
For development/debugging/performance optimisation/etc (for us. Want an X-Trace/DTrade for RAMCloud, perhaps?)
Scaling Down
The system should scale down as well as up:
Within a large datacenter installation, it should be possible to have small applications whose memory and bandwidth needs can be met by a fraction of a server. These applications should get all of the durability benefits of the full installation, but at a cost proportional to actual server usage.
It should also be possible to deploy RAMCloud outside the datacenter in an installation with only a few servers. The performance and durability of such an installation should scale down smoothly with the number of servers. For example, an installation with only two or three servers should still provide good durability, though it might not provide as good availability in the event of power outage or the loss of a network switch, and recovery time after a crash might be longer.
It should also be possible to scale an existing installation down (preferably by just unplugging nodes, downing VMs, etc)
Dynamic vs. Static Scalability
RAMCloud should permit:
Static Scalability
New installations can be created of many sizes: 1 machine, 10k machines, etc.
Dynamic Scalability
Existing installations must permit expansion - both incremental and explosive
Need to scale up as quickly as user requires - may be orders of magnitude in a few days
(Orran Krieger's Forum presentation - EC2 customer example)
Scaling down may be as important as scaling up
server consolidation may be important
regular: reduce active nodes during off-peak times (assuming we can maintain the in-memory dataset)
irregular: data center resources may be re-provisioned (to cut costs, handle reduced popularity, RAMCloud 2.0 is just too efficient, etc)
Addressing
10,000 128GB nodes = 1.250 PB of storage
1 PB = 2^50 bytes
Assuming average object size is 128 bytes:
2^50 / 2^7 = 2^43 objects
=> need at least 43 bits of address space (rounds up to 64-bit)
May want much larger key spaces, though:
if random keys, want less probability of collision, aid to distribution, etc
if structured keys, need bits for user, table, access rights, future/undefined fields, etc
So, we probably want 128-bit addressing, at the minimum.
Heterogeneity
Machines will be used at least long enough such that new additions will be faster, have more memory, or otherwise differ in important characteristics
Need to be sure heterogeneity is taken into account in distribution
e.g.: using DHT's and consistent hashing we can varying the key space a node is responsible for to be in proportion to its capabilities
Is purposeful heterogeneity useful?
Lots of big, dumb, efficient nodes with lots of RAM coupled with some smaller, very fast, expensive nodes for offloading serious hotspots?
Or will high throughput/low latencies save us?
Alternative is to shrink responsibility of overloaded node to concentrate on hottest data
Perhaps there's a performance, energy, or cost win to being specifically heterogeneous?
Sharding
Likely want to shard RAMCloud into chunks
Useful for heterogeneity - chunk size <= smallest memory capacity, bigger servers responsible for more chunks
Variable vs. static chunk sizes
Variable complicates mapping addresses to chunks, but:
permits squeezing an address range to break apart hot data
hot spots may grow increasingly unlikely with scale, but should we consider the low end?
Virtualisation Interplay
Is it reasonable to expect to run within a virtualised environment?
could imply much greater dynamism than we might be anticipating
high churn in joining/leaving DHT, lots of resultant swap in/out to maintain availability
could also imply larger number of nodes than we expect, e.g.
let a hypervisor worry about multiprocessors
VMs may have significant latency penalties (though can be mitigated with PCI device pass-through, core pinning, etc)