Skip to content
Snippets Groups Projects

Fix reap completion race

Merged Florian Fischer requested to merge aj46ezos/emper:fix_reap_completion_race into master
  1. Mar 01, 2021
    • Florian Fischer's avatar
    • Florian Fischer's avatar
      [IO] fix the possible lost wakeup for the IoContext::cq_lock race · e6cc92f1
      Florian Fischer authored
      Our current naive try lock protecting a worker's IoContext's cq is racy.
      This fact alone is no problem a try lock is by design racy in the sense
      that two threads race who can take the lock.
      
      The actual problem is:
      
      While a worker is holding the lock additional completions could arrive
      which the worker does not observe because it could be already finished
      iterating the CQ.
      
      In the case that the worker still holds the lock preventing the globalCompleter
      from reaping the additional completions there exists a lost wakeup problem
      possibly leading to a completely sleeping runtime with runnable completions
      in a worker's IoContext.
      
      To prevent this lost wakeup the cq_lock now counts the unsuccessful
      lock attempts from the globalCompleter.
      
      If a worker observes that the globalCompleter tried to reapCompletions
      more than once we know that a lost wakeup could have occurred and we try to
      reap again.
      Observing one attempt is normal since we know the globalCompleter and the
      worker owning the IoContext race for the cq_lock required to reap completions.
      
      Additionally:
      
      * Reduce the critical section in which the cq_lock is held by copying all
        seen cqes and completing the Futures after the lock was released.
      
      * Don't immediately schedule blocked Fibers or Callbacks rather collect them
        an return them as batch. Maybe the caller knows better what to to with a
        batch of runnable Fibers
      e6cc92f1
Loading