Scalability
Scaling up: Where is the ceiling?
- What is our target size, anyway?
- Facebook has 4k MySQL, 2k memcached and 15k www/php machines
- Google? Many 10k clusters? How big before partitioning for other tasks?
- Step back: what is a 'server'?
- NUMA architectures somewhat like many smaller machines bundled together
- different bandwidths and latencies to various memory / network / misc i/o resources
- may want to treat cores/clusters of cores as much like distinct machines as possible (avoid complexity)
- but be sure any buddy nodes are not intra-machine for durability
- Unclear what optimal RC hardware will be
- Fewer big boxes stuffed with memory?
- Many small boxes?
- Something in-between?
- NUMA architectures somewhat like many smaller machines bundled together
- Is it more meaningful to target # of cores, rather than servers?
- Increases scalability requirements by 1-2 orders of magnitude right now
- e.g., 10k machines could have 100k cores now. Perhaps 1e6 to 1e7 cores in 5 years?
- Will we need to think bigger than we already are?
- Increases scalability requirements by 1-2 orders of magnitude right now
Managing instances
- Must manage growth automatically
- ideally, just plug in more servers and the system automatically remodels itself to handle the additional capacity
- at large scale, individual management is impossible => need LOM
- at small scale, management probably feasible => assuming additional complexity for small setups is ok
- primarily, must automate network autoconfiguration (easy)
- what about autoconfiguring location?
- system will probably want to know what lives where (which machines share same rack, etc, for availability)
- probably can be derived from subnet?
- what about autoconfiguring location?
- Scale affects:
- Network architecture
- Distribution/Replication/Reliability
- Fundamental algorithms - how well do they scale? Examples:
- Dynamo (DHT)
- paper doesn't say (probably 1000's, though each app has own Dynamo instance)
- DHT's generally scale well and deal well with churn
- i.e. popular with P2P apps in general
- uses consistent hashing
- various schemes tested
- DHT's generally scale well and deal well with churn
- focus is on availability and latency guarantees (milliseconds)
- paper doesn't say (probably 1000's, though each app has own Dynamo instance)
- BigTable (B+ like tree atop bunch of other distributed services)
- paper indicates generally < 500 servers in production
- diminishing returns due to GFS large block sizes
- interestingly, BigTable can run in all-RAM mode and is not decentralised
- What we choose depends largely on data model
- e.g. DHT good if we don't want range queries/care about key locality, else a tree may be better
- Dynamo (DHT)
- Fundamental algorithms - how well do they scale? Examples:
- ideally, just plug in more servers and the system automatically remodels itself to handle the additional capacity
- Must have good system visibility/insight
- For health/system status/load/etc (managerial)
-
- For development/debugging/performance optimisation/etc (for us. Want an X-Trace/DTrade for RAMCloud, perhaps?)
Scaling Down
The system should scale down as well as up:
- Within a large datacenter installation, it should be possible to have small applications whose memory and bandwidth needs can be met by a fraction of a server. These applications should get all of the durability benefits of the full installation, but at a cost proportional to actual server usage.
- It should also be possible to deploy RAMCloud outside the datacenter in an installation with only a few servers. The performance and durability of such an installation should scale down smoothly with the number of servers. For example, an installation with only two or three servers should still provide good durability, though it might not provide as good availability in the event of power outage or the loss of a network switch, and recovery time after a crash might be longer.
- It should also be possible to scale an existing installation down (preferably by just unplugging nodes, downing VMs, etc)
Dynamic vs. Static Scalability
RAMCloud should permit:
- Static Scalability
- New installations can be created of many sizes: 1 machine, 10k machines, etc.
- Dynamic Scalability
- Existing installations must permit expansion - both incremental and explosive
- Need to scale up as quickly as user requires - may be orders of magnitude in a few days
- (Orran Krieger's Forum presentation - EC2 customer example)
- Scaling down may be as important as scaling up
- server consolidation may be important
- regular: reduce active nodes during off-peak times (assuming we can maintain the in-memory dataset)
- irregular: data center resources may be re-provisioned (to cut costs, handle reduced popularity, RAMCloud 2.0 is just too efficient, etc)
- server consolidation may be important
- Need to scale up as quickly as user requires - may be orders of magnitude in a few days
- Existing installations must permit expansion - both incremental and explosive
Addressing
- 10,000 128GB nodes = 1.250 PB of storage
- 1 PB = 2^50 bytes
- Assuming average object size is 128 bytes:
- 2^50 / 2^7 = 2^43 objects
- => need at least 43 bits of address space (rounds up to 64-bit)
- May want much larger key spaces, though:
- if random keys, want less probability of collision, aid to distribution, etc
- if structured keys, need bits for user, table, access rights, future/undefined fields, etc
- So, we probably want 128-bit addressing, at the minimum.
Heterogeneity
- Machines will be used at least long enough such that new additions will be faster, have more memory, or otherwise differ in important characteristics
- Need to be sure heterogeneity is taken into account in distribution
- e.g.: using DHT's and consistent hashing we can varying the key space a node is responsible for to be in proportion to its capabilities
- Is purposeful heterogeneity useful?
- Lots of big, dumb, efficient nodes with lots of RAM coupled with some smaller, very fast, expensive nodes for offloading serious hotspots?
- Or will high throughput/low latencies save us?
- Alternative is to shrink responsibility of overloaded node to concentrate on hottest data
- Perhaps there's a performance, energy, or cost win to being specifically heterogeneous?
- Lots of big, dumb, efficient nodes with lots of RAM coupled with some smaller, very fast, expensive nodes for offloading serious hotspots?
Sharding
- Likely want to shard RAMCloud into chunks
- Useful for heterogeneity - chunk size <= smallest memory capacity, bigger servers responsible for more chunks
- Variable vs. static chunk sizes
- Variable complicates mapping addresses to chunks, but:
- permits squeezing an address range to break apart hot data
- hot spots may grow increasingly unlikely with scale, but should we consider the low end?
- Variable complicates mapping addresses to chunks, but:
Virtualisation Interplay
- Is it reasonable to expect to run within a virtualised environment?
- could imply much greater dynamism than we might be anticipating
- high churn in joining/leaving DHT, lots of resultant swap in/out to maintain availability
- could also imply larger number of nodes than we expect, e.g.
- let a hypervisor worry about multiprocessors
- could imply much greater dynamism than we might be anticipating
- VMs may have significant latency penalties (though can be mitigated with PCI device pass-through, core pinning, etc)