Scalability

What is our target size, anyway?
- Facebook has 4k MySQL, 2k memcached and 15k www/php machines
- Google? Many 10k clusters? How big before partitioning for other tasks?
Step back: what is a 'server'?
- NUMA architectures somewhat like many smaller machines bundled together
  - different bandwidths and latencies to various memory / network / misc i/o resources
  - may want to treat cores as much like distinct machines as possible (avoid complexity)
- Unclear what optimal RC hardware will be
  - Fewer big boxes stuffed with memory?
  - Many small boxes?
  - Something in-between?
Is it more meaningful to target # of cores, rather than servers?
- Increases scalability requirements by 1-2 orders of magnitude right now
  - e.g., 10k machines could have 100k cores now. Perhaps 1e6 to 1e7 cores in 5 years?
- Will we need to think bigger than we already are?

Must manage growth automatically
- ideally, just plug in more servers and the system automatically remodels itself to handle the additional capacity
  - at large scale, individual management is impossible => need LOM
  - at small scale, management probably feasible => assuming additional complexity for small setups is ok
- Scale affects:

The system should scale down as well as up:

Within a large datacenter installation, it should be possible to have small applications whose memory and bandwidth needs can be met by a fraction of a server. These applications you should get all of the durability benefits of the full installation, but that a cost proportional to actual server usage.
It should also be possible to deploy RAMCloud outside the datacenter in an installation with only a few servers. The performance and durability of such an installation should scale down smoothly with the number of servers. For example, an installation with only two or three servers should still provide good durability, though it might not provide as good availability in the event of power outage or the loss of a network switch, and recovery time after a crash might be longer.

Dynamic vs. Static Scalability

RAMCloud should permit

Static Scalability: New installations can be created of many sizes: 1 machine, 10k machines, etc.
Dynamic Scalability: Existing installations must permit expansion - both incremental and explosive
- Need to scale up as quickly as user requires - may be orders of magnitude in a few days
  - (Orran Krieger's Forum presentation - EC2 customer example)
- Scaling down may be as important as scaling up
  - server consolidation may be important
    - regular: reduce active nodes during off-peak times (assuming we can maintain the in-memory dataset)

- - - irregular: data center resources may be re-provisioned (to cut costs, handle reduced popularity, RAMCloud 2.0 is just too efficient, etc)