Skip to content

Implement sleep strategy using the IO subsystem

Florian Fischer requested to merge aj46ezos/emper:pipe-sleep-strategy into master

implement a pipe based sleep strategy using the IO subsystem

Design goals

  • Wakeup either on external newWork notifications or on local IO completions -> Sleep strategy is sound without the IO completer
  • Do as less as possible in a system saturated with work
  • Pass a hint where to find new work to suspended workers

Algorithm

Data:
    Global:
        hint pipe
        sleepers count
    Per worker:
        dispatch hint buffer
        in flight flag

Sleep:
    if we have no sleep request in flight
            Atomic increment sleep count
            Remember that we are sleeping
            Prepare read cqe from the hint pipe to dispatch hint buffer
    Prevent the completer from reaping completions on this worker's IoContext
    Wait until IO completions occurred

NotifyEmper(n):
    if observed sleepers <= 0
            return

    // Determine how many we are responsible to wake
    do
            toWakeup = min(observed sleepers, n)
    while (!CAS(sleepers, toWakeup))

    write toWakeup hints to the hint pipe

NotifyAnywhere(n):
    // Ensure all n notifications take effect
    while (!CAS(sleepers, observed sleepers - n))
            if observed sleeping <= -n
                    return

    toWakeup = min(observed sleeping, n)
    write toWakeup hints to the hint pipe

onNewWorkCompletion:
    reset in flight flag
    allow completer to reap completions on this IoContext

Notes

  • We must decrement the sleepers count on the notifier side to prevent multiple notifiers to observe all the same amount of sleepers, trying to wake up the same sleepers by writing to the pipe and jamming it up with unconsumed hints and thus blocking in the notify write resulting in a deadlock.
  • The CAS loops on the notifier side are needed because decrementing and incrementing the excess is racy: Two notifier can observe the sum of both their excess decrement and increment to much resulting in a broken counter.
  • Add the dispatch hint code in AbstractWorkStealingScheduler::nextFiber. This allows workers to check the dispatch hint after there where no local work to execute. This is a trade-off where we trade slower wakeup - a just awoken worker will check for local work - against a faster dispatch hot path when we have work to do in our local WSQ.
  • The completer tread must not reap completions on the IoContexts of sleeping workers because this introduces a race for cqes and a possible lost wakeup if the completer consumes the completions before the worker is actually waiting for them.
  • When notifying sleeping workers from anywhere we must ensure that all notifications take effect. This is needed for example when terminating the runtime to prevent sleep attempt from worker thread which are about to sleep but have not incremented the sleeper count yet. We achieve this by always decrementing the sleeper count by the notification count.

Thanks to Florian Schmaus flow@cs.fau.de for spotting bugs and suggesting improvements.

Edited by Florian Fischer

Merge request reports