Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Scaling up: Where is the ceiling?

  • What is our target size, anyway?
    • Facebook has 4k MySQL, 2k memcached and 15k www/php machines
    • Google? Many 10k clusters? How big before partitioning for other tasks?
  • Step back: what is a 'server'?
    • NUMA architectures somewhat like many smaller machines bundled together
      • different bandwidths and latencies to memory / network / misc i/o resources
      • may want to treat as much as separate machines as possible (avoid complexity)
    • Unclear what optimal RC hardware will be
      • Fewer big boxes stuffed with memory?
      • Many small boxes?
      • Something in-between?
  • Is it more meaningful to target # of cores, rather than servers?
    • Increases scalability requirements by 1-2 orders of magnitude right now
      • e.g., 10k machines could have 100k cores now. Perhaps 1e6 to 1e7 cores in 5 years?

Managing instances

  • Must manage growth automatically.
    • ideally, just plug in more servers and the system automatically remodels itself to handle the additional capacity.

Scaling Down

  • The system should scale down as well as up:
    • Within a large datacenter installation, it should be possible to have small applications whose memory and bandwidth needs can be met by a fraction of a server.  These applications you should get all of the durability benefits of the full installation, but that a cost proportional to actual server usage.
    • It should also be possible to deploy RAMCloud outside the datacenter in an installation with only a few servers.  The performance and durability of such an installation should scale down smoothly with the number of servers.  For example, an installation with only two or three servers should still provide good durability, though it might not provide as good availability in the event of power outage or the loss of a network switch, and recovery time after a crash might be longer.
  • No labels