Fix reap completion race (!117) · Merge requests · Lehrstuhl für Informatik 4 (Systemsoftware) / manycore / emper

Merged Florian Fischer requested to merge aj46ezos/emper:fix_reap_completion_race into master 4 years ago

Mar 01, 2021

[Runtime] add iterator-based scheduling and optimize Runtime::nextFiber · 941f22fa
Florian Fischer authored 4 years ago

941f22fa

[IO] fix the possible lost wakeup for the IoContext::cq_lock race · e6cc92f1

Florian Fischer authored 4 years ago

Our current naive try lock protecting a worker's IoContext's cq is racy.
This fact alone is no problem a try lock is by design racy in the sense
that two threads race who can take the lock.

The actual problem is:

While a worker is holding the lock additional completions could arrive
which the worker does not observe because it could be already finished
iterating the CQ.

In the case that the worker still holds the lock preventing the globalCompleter
from reaping the additional completions there exists a lost wakeup problem
possibly leading to a completely sleeping runtime with runnable completions
in a worker's IoContext.

To prevent this lost wakeup the cq_lock now counts the unsuccessful
lock attempts from the globalCompleter.

If a worker observes that the globalCompleter tried to reapCompletions
more than once we know that a lost wakeup could have occurred and we try to
reap again.
Observing one attempt is normal since we know the globalCompleter and the
worker owning the IoContext race for the cq_lock required to reap completions.

Additionally:

* Reduce the critical section in which the cq_lock is held by copying all
  seen cqes and completing the Futures after the lock was released.

* Don't immediately schedule blocked Fibers or Callbacks rather collect them
  an return them as batch. Maybe the caller knows better what to to with a
  batch of runnable Fibers

e6cc92f1

Fix reap completion race

Merge request reports

Activity