Scalability

Scaling up: Where is the ceiling?

What is our target size, anyway?
- Facebook has 4k MySQL, 2k memcached and 15k www/php machines
- Google? Many 10k clusters? How big before partitioning for other tasks?
Step back: what is a 'server'?
- NUMA architectures somewhat like many smaller machines bundled together
  - different bandwidths and latencies to various memory / network / misc i/o resources
  - may want to treat cores/clusters of cores as much like distinct machines as possible (avoid complexity)
    - but be sure any buddy nodes are not intra-machine for durability
- Unclear what optimal RC hardware will be
  - Fewer big boxes stuffed with memory?
  - Many small boxes?
  - Something in-between?
Is it more meaningful to target # of cores, rather than servers?
- Increases scalability requirements by 1-2 orders of magnitude right now
  - e.g., 10k machines could have 100k cores now. Perhaps 1e6 to 1e7 cores in 5 years?
- Will we need to think bigger than we already are?

Managing instances

Must manage growth automatically
- ideally, just plug in more servers and the system automatically remodels itself to handle the additional capacity
  - at large scale, individual management is impossible => need LOM
  - at small scale, management probably feasible => assuming additional complexity for small setups is ok
- primarily, must automate network autoconfiguration (easy)
  - what about autoconfiguring location?
    - system will probably want to know what lives where (which machines share same rack, etc, for availability)
    - probably can be derived from subnet?
- Scale affects:
  - Network architecture
  - Distribution/Replication/Reliability
    - Fundamental algorithms - how well do they scale? Examples:
      - Dynamo (DHT)
        
        paper doesn't say (probably 1000's, though each app has own Dynamo instance)
        
        DHT's generally scale well and deal well with churn
        
        i.e. popular with P2P apps in general
        
        uses consistent hashing
        
        various schemes tested
        
        focus is on availability and latency guarantees (milliseconds)
      - BigTable (B+ like tree atop bunch of other distributed services)
        
        paper indicates generally < 500 servers in production
        
        diminishing returns due to GFS large block sizes
        
        interestingly, BigTable can run in all-RAM mode and is not decentralised
      - What we choose depends largely on data model
        
        e.g. DHT good if we don't want range queries/care about key locality, else a tree may be better

Scaling Down

The system should scale down as well as up:

Within a large datacenter installation, it should be possible to have small applications whose memory and bandwidth needs can be met by a fraction of a server. These applications should get all of the durability benefits of the full installation, but at a cost proportional to actual server usage.
It should also be possible to deploy RAMCloud outside the datacenter in an installation with only a few servers. The performance and durability of such an installation should scale down smoothly with the number of servers. For example, an installation with only two or three servers should still provide good durability, though it might not provide as good availability in the event of power outage or the loss of a network switch, and recovery time after a crash might be longer.
It should also be possible to scale an existing installation down (preferably by just unplugging nodes, downing VMs, etc)

Dynamic vs. Static Scalability

RAMCloud should permit:

Static Scalability
- New installations can be created of many sizes: 1 machine, 10k machines, etc.
Dynamic Scalability
- Existing installations must permit expansion - both incremental and explosive
  - Need to scale up as quickly as user requires - may be orders of magnitude in a few days
    - (Orran Krieger's Forum presentation - EC2 customer example)
  - Scaling down may be as important as scaling up
    - server consolidation may be important
      - regular: reduce active nodes during off-peak times (assuming we can maintain the in-memory dataset)
      - irregular: data center resources may be re-provisioned (to cut costs, handle reduced popularity, RAMCloud 2.0 is just too efficient, etc)

Virtualisation Interplay

Is it reasonable to expect to run within a virtualised environment?
- could imply much greater dynamism than we might be anticipating
  - high churn in joining/leaving DHT, lots of resultant swap in/out to maintain availability
- could also imply larger number of nodes than we expect, e.g.
  - let a hypervisor worry about multiprocessors
VMs may have significant latency penalties (though can be mitigated with PCI device pass-through, core pinning, etc)