Arachne Open Issues

WHAT WE MEAN WHEN WE SAY THAT WE ARE CORE AWARE

Originally we thought this meant that applications would schedule to the number of cores.
Now we believe that this means intelligently scheduling user threads onto cores, and intelligently determining when more cores are necessary to maintain low latency.
Knowledge of how many cores are available moves up the stack.
- Originally that knowledge was deep down in the kernel.
- How high up the stack does it need to go?
- It should go to at least to user-level threading library - Arachne.
Old model - Application creates threads and kernel schedules them.
New model - Application queries number of cores and schedules based on that.

MINIMUM TIME TO SCHEDULE A THREAD ONTO ANOTHER CORE

What is the correct architecture for the fast path, and what is the lowest possible time to schedule to another core?

HANDLING BLOCKING SYSCALLS

How much or how little kernel assistance do we need?
Current idea is to migrate all the kernel threads doing syscalls onto one core. Then how do we ensure there is at least one kernel thread on every other idle core?
Requires experimentation for performance.

DATA STRUCTURE FOR TASKS THAT DO NOT YET HAVE A CORE

Single Global Queue vs Per-Core Queues.

HOW TO INTEGRATING PRIORITIES INTO CREATE THREAD MECHANISM

We want the priorities to be orthogonal to the thread creation mechanism
- Might involve 'accepting' a low-priority thread creation request without running it.
Since we cannot predict the priority of threads to come, and it is not desirable to block them from enqueuing on the fast path, it may make most sense to always first use a stack on the receiving core, and then immediately if there are higher priority threads.

LOGICAL CONCURRENCY VS DESIRED PARALLELISM (CORE COUNT)

They are far from identical, because of threading for convenience and deadlock avoidance.
When should we ask the kernel for more cores, and when should we return cores to the kernel?
Tradeoff - Core efficiency vs latency vs throughput
- Greedy core allocation is not necessarily best even for latency?
- How many cores would we be wasting if we did that?

TIMING GUARANTEES UNDER COOPERATIVE THREADING

Can we have any kind of timing guarantees without preemption?
If app helps, we can bound the tail latency scheduling contribution to the largest time between yield?
Assuming sufficiently few deadlined threads relative to the number of cores, an application can bound the maximum time before a deadlined thread gets to run by the longest non-yielding run time of a non-deadlined thread.
How expensive is a yield call / especially a no-op yield call?
- Hope to make it a few ns.

PRIORITIES

- Current thought is that two priorities are sufficient, but it is possible more are needed.

- One priority for highly latency-sensitive, but short-running user threads such as pings

- Another priority for ordinary cases.

- How do we handle starvation?

- Is it necessary to have arbitrary amounts of priority?

- What are the performance implications of having multiple priority levels?

- If checking multiple run queues, then cost will increase w / number of priorities.

HOW TO MAKE SCHEDULER VERY LOW LATENCY

BENEFITS / SELLING POINTS / CONTRIBUTIONS

0. Making more resource information available at higher levels of software,

and letting higher level leverage that to enable better performance.

1. Having many more threads than we could afford in the kernel

==> avoid deadlocks based on thread limitations.

2. Ability to run to completion (no preemption at weird times)

3. Fast context switches

==> Reuse of idle times without paying for kernel context switch

==> Allows us to get high throughput without giving up low latency

4. Policies for relating the number of cores to logical concurrency (number of user threads)

- Enable us to fill in the idle time without adding extra latency

PRIORITIES OF EXISTING THREADS VS THREADS WITHOUT A CORE

- Orthogonal issues: Goal is to run the highest priority thread.

- Current Thought: we must treat them equally, or prioritize just-created threads to avoid deadlock.

HOW TO PREVENT STARVATION

- In a priority system, do we want to starve the low priority threads?

- Consider specific actual or expected use cases.

- Idea 1: Ensure that the number of CPU-bound high priority threads is lower than the number of cores.

- Idea 2: Ensure there is at least one core to run only low priority threads, even if that means putting more than one high-priority thread on the same core.

LOAD BALANCING BETWEEN CORES

If tasks are sufficiently short, do we need to explicitly load rebalance at all between cores?

- Could we do it based on proper assignment only?

Which core should we assign the new thread to?

When everyone is busy, where do we put the new thread?

How many cores could you support with a central queue?

- What is the throughput of a centralized scheduler queue?

- Everyone is going to work-stealing today.

Central queue - can get first available core, but might suffer from contention.

Ideally, we would like to assign to the least loaded core, by number of runnable threads, but how might we obtain this information without incuring additional cache coherency traffic?

How many cores are we trying to support?

- Hardware is moving quickly to greater core counts

- Suppose you wanted 100 cores and each finishing a task every us, then

you have to do one every 10 ns, so a centralized scheduler queue

probably won't do that.

USER API

KERNEL API