Commits · cf3ac3ed97945a4ba9da3f6edc3ea87dfca1205e · Lehrstuhl für Informatik 4 (Systemsoftware) / manycore / emper

Feb 07, 2022

Add support for continuation stealing · cf3ac3ed

Florian Schmaus authored 3 years ago


Thanks to Nicolas Pfeiffer for writing the initial prototypical
implementation of continuation stealing and the cactus stack
mechanism, on which this is based.

Co-authored-by: Nicolas Pfeiffer <pfeiffer@cs.fau.de>

cf3ac3ed

Jan 23, 2022

[meson] add option to ignore wakeup hints · 7f6fb152

Florian Fischer authored 3 years ago

I think wakeup hints should never be ignored but having the option
seams usefull to observe their benefits/cost.

7f6fb152

Jan 14, 2022
- [meson] Add use_bundled_deps option · 76f4eafd
  Florian Schmaus authored 3 years ago
  
  76f4eafd
Jan 11, 2022

[meson] add boost as dependency · e976a496

Florian Fischer authored 3 years ago

I setup a new development environment and emper did not compile because
emper::io::Stats use the circular_buffer provided by boost.
Boost was not installed and our build-system failed to detect it.

This change adds the header-only boost dependency to emper.
https://mesonbuild.com/Dependencies.html#boost
The header-only dependency is enough to build emper default configuration.

When linking against boost is required we use the 'modules' karg.

e976a496

Dec 14, 2021

[IO] support one sq poller thread per numa node · fdff953f
Florian Fischer authored 3 years ago

fdff953f

[IO] overhaul SQPOLL support · 50c965e4

Florian Fischer authored 3 years ago

Two meson options control the io_uring sqpoll feature:
* io_uring_sqpoll - enable sq polling
* io_uring_shared_poller - share the polling thread between all io_urings

Since 5.12 the IORING_SETUP_ATTACH_WQ only causes sharing of
poller threads not the work queues.
See: https://github.com/axboe/liburing/issues/349

When using SQPOLL the userspace has no good way to
know how many sqes the kernel has consumed therefore we
wait for available sqes using io_uring_sqring_wait if there
was no usable sqe.

Remove the GlobalIoContext::registerLock and register all worker
io_uring eventfd reads at the beginning of the completer function.
Also register all the worker io_uring eventfds since they never change
and it hopefully reduces overhead in the global io_uring.

50c965e4

Dec 10, 2021

[meson] introduce dependencies to io configuration options · 5c7e3e9b
Florian Fischer authored 3 years ago

5c7e3e9b

Introduce waitfree workstealing · 1c538024

Florian Fischer authored 3 years ago

Waitfree work stealing is configured with the meson option
'waitfree_work_stealing'.

The retry logic is intentionally left in the Queues and not lifted to
the scheduler to reuse the load of an unsuccessful CAS.

Consider the following pseudo code examples:

steal() -> bool:                       steal() -> res
  load                                   load
loop:                                    if empty return EMPTY
  if empty return EMPTY                  cas
  cas                                    return cas ? STOLEN : LOST_RACE
  if not WAITFREE and not cas:
    goto loop                          outer():
  return cas ? STOLEN : LOST_RACE      loop:
                                         res = steal()
outer():                                 if not WAITFREE and res == LOST_RACE:
  steal()                                  goto loop

In the right example the value loaded by a possible unsuccessful CAS
can not be reused. And a loop of unsuccessful CAS' will result in
double loads.

The retries are configurable through a template variable maxRetries.
* maxRetries < 0: indefinitely retries
* maxRetries >= 0: maxRetries

1c538024

Dec 06, 2021

[meson] set check_anywhere_queue_while_stealing automatic · 7da8e687

Florian Fischer authored 3 years ago

We introduced the check_anywhere_queue_while_steal configuration
as an optimization to get the IO completions reaped by the completer
faster into the normal WSQ.
But now the emper has configurations where we don't use a completer
thus making this optimization useless or rather harmful.

By default automatically decide the value of
check_anywhere_queue_while_stealing based on the value of
io_completer_behavior.

7da8e687

Nov 10, 2021

make the victim count in work-stealing configurable · cd06496d

Florian Fischer authored 3 years ago

Add two new mutual exclusive meson_options:
* work_stealing_victim_count: Which sets an absolute number of victims
* work_stealing_victim_denominator: Set victim count to #workers/denominator

cd06496d

Oct 13, 2021

[meson] introduce lockless memory order and rename lockless option · 67b0c77a

Florian Fischer authored 3 years ago

The lockless algorithm can now be configured by setting -Dio_lockless_cq=true
and the used memory ordering by setting -Dio_lockless_memory_order={weak,strong}.

io_lockless_memory_order=weak:
    read with acquire
    write with release

io_lockless_memory_order=strong:
    read with seq_cst
    write with seq_cst

67b0c77a

Oct 11, 2021

[IoContext] implement lockless CQ reaping · d9d350d9
Florian Fischer authored 3 years ago
```
TODO: think about stats and possible ring buffer pointers overflow and ABA.
```
d9d350d9

implement IO stealing · 0abc29ad

Florian Fischer authored 3 years ago

IO stealing is analog to work-stealing and means that worker thread
without work will try to steal IO completions (CQEs) from other worker's
IoContexts. The work stealing algorithm is modified to check a victims
CQ after findig their work queue empty.

This approach in combination with future additions (global notifications
on IO completions, and lock free CQE consumption) are a realistic candidate
to replace the completer thread without loosing its benefits.

To allow IO stealing the CQ must be synchronized which is already the
case with the IoContext::cq_lock.
Currently stealing workers always try to pop a single CQE (this could
be configurable).
Steal attempts are recorded in the IoContext's Stats object and
successfully stolen IO continuations in the AbstractWorkStealingWorkerStats.

I moved the code transforming CQEs into continuation Fibers from
reapCompletions into a seperate function to make the rather complicated
function more readable and thus easier to understand.

Remove the default CallerEnvironment template arguments to make
the code more explicit and prevent easy errors (not propagating
the caller environment or forgetting the function takes a caller environment).

io::Stats now need to use atomics because multiple thread may increment
them in parallel from EMPER and the OWNER.
And since using std::atomic<T*> in std::map is not easily possible we
use the compiler __atomic_* builtins.

Add, adjust and fix some comments.

0abc29ad

Sep 22, 2021
- [IoContext] replace fancy CountingTryLock with simple CQ emptiness check · 9f545ba0
  Florian Fischer authored 3 years ago
  
  9f545ba0
Aug 19, 2021
- [build] Emit a message if liburing subproject is used · c6ee1e60
  Florian Schmaus authored 3 years ago
  
  c6ee1e60
- [meson] use liburing wrap if native is not suitable · df4de2c7
  Florian Fischer authored 3 years ago
  
  df4de2c7
- [GlobalIoContext] Add CompleterSchedParam option · c0cf0f8d
  Florian Schmaus authored 3 years ago
  
  This adds an option to make the scheduling parameters of the completer thread configurable via a meson option.
  c0cf0f8d
Aug 18, 2021

[IO] add "try nonblocking syscall" optimization for send and recv · 0539a922
Florian Fischer authored 3 years ago

0539a922

[IO] Implement configurable "simple architecture" · 06b5bf0f

Florian Fischer authored 3 years ago

Introduce a new meson option io_single_uring which causes EMPER
to only use the GlobalIoContexts for all IO.

To submit SQEs to the io_uring SQ SubmitActor is used.

Futures can be in a new state where they are submitted to the SubmitActor
but not to the io_uring yet.
In this state isSubmitted && !isPrepared th Future must not be destroyed
to ensure this we yield when forgetting a Future until it is prepared
and thus it is safe to destroy it.

This commit contains no optimizations (no batching, no try non blocking
syscall first, ...)

Refacter GlobalIoContext.cpp:

* rename globalCompleter to completer
* make the completer loop non-static

06b5bf0f

Aug 02, 2021
- Add check_anywhere_queue_while_stealing meson option · 94c099e2
  Florian Schmaus authored 3 years ago
  
  94c099e2
Jul 14, 2021

implement a pipe based sleep strategy using the IO subsystem · 4ec30fd4

Florian Fischer authored 3 years ago

Design goals
============

* Wakeup either on external newWork notifications or on local IO completions
  -> Sleep strategy is sound without the IO completer
* Do as less as possible in a system saturated with work
* Pass a hint where to find new work to suspended workers

Algorithm
=========

Data:
	Global:
		hint pipe
		sleepers count
	Per worker:
		dispatch hint buffer
		in flight flag

Sleep:
	if we have no sleep request in flight
		Atomic increment sleep count
		Remember that we are sleeping
		Prepare read cqe from the hint pipe to dispatch hint buffer
	Prevent the completer from reaping completions on this worker's IoContext
	Wait until IO completions occurred

NotifyEmper(n):
	if observed sleepers <= 0
		return

	// Determine how many we are responsible to wake
	do
		toWakeup = min(observed sleepers, n)
	while (!CAS(sleepers, toWakeup))

	write toWakeup hints to the hint pipe

NotifyAnywhere(n):
	// Ensure all n notifications take effect
	while (!CAS(sleepers, observed sleepers - n))
		if observed sleeping <= -n
			return

	toWakeup = min(observed sleeping, n)
	write toWakeup hints to the hint pipe

onNewWorkCompletion:
	reset in flight flag
	allow completer to reap completions on this IoContext

Notes
=====

* We must decrement the sleepers count on the notifier side to
  prevent multiple notifiers to observe all the same amount of sleepers,
  trying to wake up the same sleepers by writing to the pipe and jamming it up
  with unconsumed hints and thus blocking in the notify write resulting
  in a deadlock.
* The CAS loops on the notifier side are needed because decrementing
  and incrementing the excess is racy: Two notifier can observe the
  sum of both their excess decrement and increment to much resulting in a
  broken counter.
* Add the dispatch hint code in AbstractWorkStealingScheduler::nextFiber.
  This allows workers to check the dispatch hint after there
  where no local work to execute.
  This is a trade-off where we trade slower wakeup - a just awoken worker
  will check for local work - against a faster dispatch hot path when
  we have work to do in our local WSQ.
* The completer tread must not reap completions on the IoContexts of
  sleeping workers because this introduces a race for cqes and a possible
  lost wakeup if the completer consumes the completions before the worker
  is actually waiting for them.
* When notifying sleeping workers from anywhere we must ensure that all
  notifications take effect. This is needed for example when terminating
  the runtime to prevent sleep attempt from worker thread which are
  about to sleep but have not incremented the sleeper count yet.
  We achieve this by always decrementing the sleeper count by the notification
  count.

Thanks to Florian Schmaus <flow@cs.fau.de> for spotting bugs and suggesting
improvements.

4ec30fd4

May 05, 2021
- [meson] propagate set_affinity_on_block · 463c252e
  Florian Fischer authored 3 years ago
  
  463c252e
Mar 23, 2021

[IO] make the behavior of the completer thread configurable · 5ea44519

Florian Fischer authored 3 years ago

Available behaviors:
  * none - the completer thread is not started

  * schedule (default) - the completer thread will reap and schedule available
                         completions from worker IoContexts

  * wakeup - the completer thread will wakeup all workers if it observes completions
             in a worker IoContext. The Fiber produced by the completion will
             be scheduled when the worker in which's IoContext the cqe lies
             reaps its completions.

5ea44519

Mar 12, 2021
- [Debug] remove OFF LogLevel, make Info release default, use constexpr everywhere · ed5ecc2a
  Florian Fischer authored 4 years ago
  
  ed5ecc2a
Mar 09, 2021

[IO] make the lock implementation protecting a IoContext's cq configurable · a619ba3e

Florian Fischer authored 4 years ago

This change introduces a new synchronization primitive "PseudoCountingTryLock"
which takes an actual lock as template and provides a CountingTryLock interface.
By using a PseudoCountingTryLock we don't have to change any synchronization
code in IoContext::reapCompletion.

Since all PseudoCountingTryLock code is defined in a header the compiler
should see our constant return values and hopefully optimize away any check
depending on those constant return values.

Options:
* spin_lock - naive CAS spin lock
* mutex - std::mutex
* counting_try_lock (default) - our own lightweight special
                                purpose synchronization primitive

a619ba3e

[meson] Fix 'iwyu' target for meson >= 0.57 · 08224cd2
Florian Schmaus authored 4 years ago
```
The run_target() function requires an absolute path in meson >= 0.57.
```
08224cd2

Mar 08, 2021

[meson] remove obsolete io_batch_anywhere_completions option · 0627a4b9

Florian Fischer authored 4 years ago

Since 8f38dbed the globalCompleter
does always reap and schedule in batches through
IoContest::reapAndSchedule<CallerEnvironment::ANYWHERE> ->
Runtime::scheduleFromAnywhere(Input it begin, InputIt end) ->
AnywhereQueue::insert(Input it begin, InputIt end)

0627a4b9

Mar 01, 2021
- [build] Make boost_thread_dep an optional dependency · 869cef2d
  Florian Schmaus authored 4 years ago
  
  869cef2d
- [build] Re-enable non-virtual-dtor warning · 4bd885bb
  Florian Schmaus authored 4 years ago
  
  4bd885bb
Feb 26, 2021

Make LockedUnboundedQueue implementation configurable · 9b949e49

Florian Fischer authored 4 years ago

Available implementations configurations through the meson option
'locked_unbounded_queue_implementation' are:

mutex - our current LockedUnboundedQueue implementation using std::mutex

rwlock - An implementation with pthread_rwlock. The implementations tries
         to upgrade its rdlock and drops and acquires a wrlock on failure

shared_mutex - An implementation using std::shared_mutex.
         dequeue() acquires a shared lock at first, drops it and
         acquires a unique lock

boost_shared_mutex - An implementation using boost::shared_mutex.
         dequeue() acquires an upgradable lock and upgrade it
         to a unique lock if necessary

9b949e49

add a batch optimization for the global completer · 17776ba2

Florian Fischer authored 4 years ago

This change introduces new scheduleFromAnywhere methods which take
a range of Fibers to schedule.

Blockable gets a new method returning the fiber used to start
the unblocked context, which is used by Future/PartialCompletableFuture
to provide a way of completion and returning the continuation Fiber
to the caller so they may schedule the continuation how they want.

If the meson option io_batch_anywhere_completions is set the global
completer will collect all callback and continuation fibers before
scheduling them at once when it is done reaping the completions.
The idea is that taking the AnywhereQueue write lock and calling onNewWork
must only be done once.

TODO: investigate if onNewWork should be extended by an amountOfWork
argument which determines how many worker can be awoken and have work to
do. This should be trivially since our WorkerWakeupSemaphore implementations
already support notify_many(), which may be implemented in terms of
notify_all though.

17776ba2

Feb 23, 2021

[WorkerWakeupSemaphore] add three possible implementations · 3cde3e16

Florian Fischer authored 4 years ago

LockedSemaphore is the already existening Semaphore using
a mutex and a condition variable.
PosixMutex is a thin wrapper around a POSIX semaphore.
SpuriousFutexSemaphore is a atomic/futex based implementation
prune to spurious wakeups which is fine for the worker wakeup usecase.

3cde3e16

Feb 22, 2021
- [meson] include compare header only if available · c0dc475b
  Florian Fischer authored 4 years ago
  
  c0dc475b
Feb 10, 2021
- [IO] add option to use a common async backend for all io_urings · 6bbd0099
  Florian Fischer authored 4 years ago
  
  6bbd0099
Jan 26, 2021

[IO] introduce emper::io a IO subsystem using io_uring · 460c2f05

Florian Fischer authored 4 years ago

Empers IO design is based on a proactor pattern where each worker
can issue IO requests through its exclusive IoContext object which wraps an
io_uring instance.

IO completions are reaped at 4 places:
1. After a submit to collect inline completions
2. Before dispatching a new Fiber
3. When no new IO can be submitted because the completion queue is full
4. And by a global completer thread which gets notified about completions
   on worker IoContexts through registered eventfds

All IO requests are modeled as Future objects which can be either
instantiated and submitted manually, retrieved by POSIX-like non-blocking
or implicitly used by posix-like blocking functions.

User facing API is exported in the following headers:
* emper/io.hpp (POSIX-like)
* emper.h (POSIX-like)
* emper/io/Future.hpp

Catching short write/reads/sends and resubmitting the request without
unblocking the Fiber is supported.

Using AlarmFuture objects Fibers have a emper-native way to sleep for
a given time.

IO request timeouts with TimeoutWrapper class.
Request Cancellation is supported with Future::cancel() or the
CancelWrapper() Future class.

A proactor design demands that buffers are committed to the kernel
as long as the request is active. To guaranty memory safety Futures
get canceled in their Destructor which will only return after the committed
memory is free to use.

Linking Futures to chains is supported using the Future::SetDependency()
method. Future are submitted when their last Future gets submitted.
A linked Request will start if the previous has finished.
Error or partial completions will cancel the not started tail of a chain.

TODO: Handle possible situations where the CQ of the global completer is full
and no more sqe can be submitted to the SQ.

460c2f05

[Blockable] add global set of all blocked contexts for debugging · a745c865
Florian Fischer authored 4 years ago
```
This feature must be activated using the blocked_context_set meson option.
```
a745c865

Jan 22, 2021

De-duplicate work-stealing scheduling code · b2e2a7b4

Florian Schmaus authored 4 years ago

This introduces AbstractWorkStealingScheduler which holds the common
work-stealing scheduling strategy.

b2e2a7b4

Jan 13, 2021
- Add option to include timestamp in EMPER log messages · e2de6234
  Florian Schmaus authored 4 years ago
  
  This also changes emper_log so that a std::ostringstream is used to assemble the log message.
  e2de6234
Jan 11, 2021

Improve EMPER worker wakeup strategy option · 3fa15d16

Florian Schmaus authored 4 years ago

Initiailze the WORKER_WAKEUP_STRATEGY via the contents of the
EMPER_WORKER_WAKEUP_STRATEGY macro. This makes it easier to add
additional strategies later on.

3fa15d16

Jan 05, 2021
- make worker wakeup strategy compile time configurable · 51d1518d
  Florian Fischer authored 4 years ago
  
  51d1518d