Nanoscheduling

RAMCloud servers do not currently manage their threads very efficiently. This page documents some of the problems with the current approach, and also serves as a collection point for design notes on a new "nanoscheduler" that might solve the problems.

Problems

The degree of concurrency within a server is configured when it starts, for example by specifying the number of worker threads for each service. However, it's not clear how to set these configuration options. The "right" configuration depends on how many cores the machine has, and it could vary dynamically depending on workload.
If all of the workers for server are busy, the dispatch thread queues any new requests. However, the turnaround time for dispatching requests from the queue is very slow. For example, in the YCSB "A" workload, it can take 1 microsecond or more between when a worker thread finishes its current request and when it receives a new request to start processing. Even though the server is maxed out, neither the dispatch thread or the worker threads are fully utilized.
In the YCSB "A" workload at 90% memory utilization, having 3 worker threads for the master service runs slower than having only 1 (12.4 kops/sec/server vs, 16.1 kops/sec/server).
Housekeeping tasks such as the log cleaner compete for threads with the workers. It would be better to reduce the number of workers temporarily while the cleaners running so that the operating system does not have to deschedule any of RAMCloud's threads.
There is no mechanism for the operating system to tell RAMCloud how many threads it's going to schedule, or for RAMCloud to pick which threads those are. A new OS API for this could be very useful (but, need to review Scheduler Activations to see if perhaps that already solved this problem).
If a worker thread blocks for a long time on a nested RPC, it might make sense for it to switch over to some other task until result comes back, if this context switching can be done quickly enough.
The assignment of work to threads feels too coarse-grained. For example, it might be useful to use a separate thread just for sending network packets under conditions of high load, but right now there's no way to separate that from the other dispatch thread functions (it probably wouldn't make sense to dedicate a thread to this permanently).
Write throughput under load is poor in RAMCloud because replication operations are effectively serialized. The best we can currently hope for now is that during one replication operation, several other writes can add entries to the log, so that the next replication operation can handle all of the new entries at once. However, this uses threads relatively inefficiently: it takes 2N master service threads to get N-way batching, and all of those threads but one are idle. Ideally it should be easy and efficient for any number of writes to get batched together, so that if a replication operation takes a long time, then the next replication will handle a huge number of writes. This means that some threads/cores must multiplex among several write RPCs.

Scheduler Activations

Here is some information on Scheduler Activations, which has many of the same goals as nanoscheduling.

The goal is for the kernel to provide each application with a "virtual multiprocessor": the kernel controls the number of processors allocated to the application, but the application can use the processors wherever it wants.
In Scheduler Activations, the kernel starts up what is effectively a new thread (a "scheduler activation") whenever it wants to notify the application of anything. Typically it does this by interrupting an existing activation, effectively delivering 2 notifications: one for the desired event and one for the preemption. Wouldn't it would be simpler and more efficient to do these notifications by leaving messages in shared memory, which the application can poll?

Miscellaneous Thoughts

This new mechanism might also be useful in a VM world, where the hypervisor allocates cores to guest OSes (which adjust their requests based on demand), the guest OSes allocate cores to applications (based on their requests), and applications schedule cores to fibers.
The new OS APIs should be designed with polling in mind. For example, if an OS needs to reclaim a core from an application, it leaves a note in memory, which the application's nanoscheduler sees and then picks a thread to relinquish.
With the increasing number of cores, and the importance of application-level thread scheduling, it makes more sense for OSes to think of cores as a resource to be allocated, not scheduled.