Arachne Load Balancing

  • If tasks are sufficiently short, do we need to explicitly re-balance load at all between cores?
    • Could we do it based only on correct assignment of threads to cores on creation?
  • When everyone is busy, where do we put the new thread?
  • How many cores could you support with a central queue?
    • What is the throughput of a centralized scheduler queue?
    • Everyone is going to work-stealing today.
    • Central queue - can get first available core, but might suffer from contention.
  • Ideally, we would like to assign to the least loaded core, by number of runnable threads, but how might we obtain this information without incurring additional cache coherency traffic?
  • How many cores are we trying to support?
    • Hardware is moving quickly to greater core counts
    • Suppose you wanted 100 cores and each finishing a task every us, then you have to do one every 10 ns, so a centralized scheduler queue probably won't handle that.

 

Designs for Load Balancing at Creation

Goals

  1. Low-overhead thread creation both when cores are busy and when they are not busy.
  2. Low-latency thread start when there is an available core.
  3. Work-conserving: Cores are not idle for long periods when there is work available.

 

Notes

  • In the following designs, we assume that we are polling in an empty context (we have a stack but are not executing any user code). If we are polling in a blocked context, then we will switch to an empty context when we find thread creation requests.
  • We considered work-stealing designs, and the following designs can be modified to support work-stealing, but we have currently decided to omit it until we discover it is necessary.
  • We currently believe that having a bounded array / queue sizes simplifies a bunch of stuff.
  • In all cases where we mention "loop over all threads", this represents a scalability problem. We can fix the scalability problem by replacing this with "pick n cores and perform action on those n cores".
  • In the designs below, we will look for new thread creation requests only when a core is idle (a core is only considered idle iff there are no runnable threads).

Global Ordered Queue with Per-Core TaskBox

Creator

  1. Loop through each fast path TaskBox across all cores. If any are available, then enqueue the new thread there.
  2. If all cores are busy, then enqueue the function and arguments onto a global queue.

Idle Core

  1. Check for new creation request in its own fast path block. If one exists, execute it.
  2. Check for a doorbell on the global queue. If the global queue is non-empty, then dequeue an item and execute it.

Advantages

  • Simple, work conserving (iff queue is not a bottleneck), can be efficient if contention is low.
  • Bounded queue size may be a good way to apply back-pressure to application by rejecting enqueue requests.

Potential Issues

  • It is possible for creation requests on the global queue to starve, if later requests keep filling up the per-core blocks.
    • One fix for this is if cores alternated between serving the global queue and serving their own per-core blocks, when both are occupied. 

  • Contention on the head and tail of the global queue can become a bottleneck which prevents work from happening.
  • Queuing contention and overheads will grow worse under heavy load.

 

Per-Core Queue with Per-Core TaskBox

Creator

  1. Loop through each TaskBox across all cores. If any are available, then enqueue the new thread there.
    While looping through, also track the id and minimum length of the per-core queues, which will be stored on the same cache line as the fast path block.
  2. If all the TaskBoxes are non-empty then add the work to the queue of the core with the shortest queue.

Idle Core

  1. Check for new creation request in its own TaskBox. If one exists, execute it.
  2. Check for new creation request on its own queue.  If one exists, execute it.

Advantages

  • If most tasks take roughly the same amount of time, and thread creation distributes tasks evenly, then work should be reasonably balanced.

  • Locality and reduced contention.

Potential Issues

  • Simultaneous creators find the same short queue and then dumps work onto that core.

Per-Core Array

Data Structure

  • For each core, allocate an array of TaskBox objects instead of a single one.
  • Each core maintains a private index, which is the index in the array that it will search for first.

Creator

  1. Loop over all cores, and check the first slot of every core. If one is available enqueue the thread creation request there.
  2. If the work is not yet enqueued, then try the 2nd slot of every core, then the 3rd in order.

Idle Core

  1. Scan over the array for this core, starting at the current index, and then looping back around. 
  2. If it finds a request, it will advance the index to point one element after the location where it found the request, and then execute the requests.

Advantages

  • Simplicity: One unified mechanism for creating threads.

Potential Issues

  • If cores are very occupied, creation may take a long time.

Multiple Shared Queues with Randomization

Data Structure

  • Several queues with no core affiliations.

Creator

  1. Pick n queues at random, and add the TaskBox to the queue with the shortest length.

Idle Core

  1. Pick n queues at random, and dequeue from the queue of the longest length.

Advantages

  • High throughput
  • Simplicity: One unified mechanism for creating threads.

Potential Issues

  • Cache coherency traffic & miss overhead since queues are shared.

Merge Active Queue and TaskBox Array with Randomization

Data Structure

  • The current active list of UserContext's with TaskBox embedded inside each one.
  • Furthermore, the active list for each core will start with M empty contexts and M stacks, where M is the maximum number of user threads we will allow per core.
  • For each core, a single word of memory will be allocated for storing both a bit mask of occupied flags on that core and a count of how many threads are running on that core.

Creator

  1. Pick two cores at random.
  2. Read the thread count for each one and identify the core with the lower thread count.
  3. Find an unoccupied bit in the bit mask, and create modify a local copy with the count incremented and the bit set.
  4. Attempt to CAS the local copy into the core's version.
  5. Fill in the TaskBox of the appropriate slot.
  6. Set the wakeup flag on that slot.

Idle Core

  1. If wakeup is set in the current core, then return from the blocking state.
  2. Otherwise, keep scanning the array of possibly active threads until we find one with a wakeup flag that is set.

Advantages

  • Unified mechanism for new threads and old reduces the complexity.
  • We get reasonable load-balancing built in at the cost of roughly 4 cache misses.

Potential Issues

  • Pre-allocating a bunch of stacks is memory-intensive, and memory usage is not proportional to workload.
  • Cores lose the ability to differentiate between new thread creations and existing threads.
    • Harder to prioritize existing work over new work.
  • Size of the list that must be searched for runnable threads in the unloaded case is larger than it needs to be.