...
- Cache coherency traffic & miss overhead since queues are shared.
...
Merge Active Queue and TaskBox Array with Randomization
Data Structure
- The current active list of UserContext's with TaskBox embedded inside each one.
- Furthermore, the active list for each core will start with M empty contexts and M stacks, where M is the maximum number of user threads we will allow per core.
- For each core, a single word of memory will be allocated for storing both a bit mask of occupied flags on that core and a count of how many threads are running on that core.
Creator
- Pick two cores at random.
- Read the thread count for each one and identify the core with the lower thread count.
- Find an unoccupied bit in the bit mask, and create modify a local copy with the count incremented and the bit set.
- Attempt to CAS the local copy into the core's version.
- Fill in the TaskBox of the appropriate slot.
- Set the wakeup flag on that slot.
Idle Core
- If wakeup is set in the current core, then return from the blocking state.
- Otherwise, keep scanning the array of possibly active threads until we find one with a wakeup flag that is set.
Advantages
- Unified mechanism for new threads and old reduces the complexity.
- We get reasonable load-balancing built in at the cost of roughly 4 cache misses.
Potential Issues
- Pre-allocating a bunch of stacks is memory-intensive, and memory usage is not proportional to workload.
- Cores lose the ability to differentiate between new thread creations and existing threads.
- Harder to prioritize existing work over new work.
- Size of the list that must be searched for runnable threads in the unloaded case is larger than it needs to be.