Nanoscheduling

RAMCloud servers do not currently manage their threads very efficiently. This page documents some of the problems with the current approach, and also serves as a collection point for design notes on a new "nanoscheduler" that might solve the problems.

Problems

The degree of concurrency within a server is configured when it starts, for example by specifying the number of worker threads for each service. However, it's not clear how to set these configuration options. The "right" configuration depends on how many cores the machine has, and it could vary dynamically depending on workload.
If all of the workers for server are busy, the dispatch thread queues any new requests. However, the turnaround time for dispatching requests from the queue is very slow. For example, in the YCSB "A" workload, it can take 1 microsecond or more between when a worker thread finishes its current request and when it receives a new request to start processing. Even though the server is maxed out, neither the dispatch thread or the worker threads are fully utilized.
In the YCSB "A" workload at 90% memory utilization, having 3 worker threads for the master service runs slower than having only 1 (12.4 kops/sec/server vs, 16.1 kops/sec/server).
Housekeeping tasks such as the log cleaner compete for threads with the workers. It would be better to reduce the number of workers temporarily while the cleaners running so that the operating system does not have to deschedule any of RAMCloud's threads.
There is no mechanism for the operating system to tell RAMCloud how many threads it's going to schedule, or for RAMCloud to pick which threads those are. A new OS API for this could be very useful (but, need to review Scheduler Activations to see if perhaps that already solved this problem).
If a worker thread blocks for a long time on a nested RPC, it might make sense for it to switch over to some other task until result comes back, if this context switching can be done quickly enough.
The assignment of work to threads feels too coarse-grained. For example, it might be useful to use a separate thread just for sending network packets under conditions of high load, but right now there's no way to separate that from the other dispatch thread functions (it probably wouldn't make sense to dedicate a thread to this permanently).
Write throughput under load is poor in RAMCloud because replication operations are effectively serialized. The best we can currently hope for now is that during one replication operation, several other writes can add entries to the log, so that the next replication operation can handle all of the new entries at once. However, this uses threads relatively inefficiently: it takes 2N master service threads to get N-way batching, and all of those threads but one are idle. Ideally it should be easy and efficient for any number of writes to get batched together, so that if a replication operation takes a long time, then the next replication will handle a huge number of writes. This means that some threads/cores must multiplex among several write RPCs.