emper issueshttps://gitlab.cs.fau.de/i4/manycore/emper/-/issues2022-05-27T11:40:22Zhttps://gitlab.cs.fau.de/i4/manycore/emper/-/issues/33Add scheduleIn(std::uint64_t ns, emper::io::Future::Callback fun)2022-05-27T11:40:22ZFlorian SchmausAdd scheduleIn(std::uint64_t ns, emper::io::Future::Callback fun)Possible implementation
```
void scheduleIn(std::uint64_t ns, emper::io::Future::Callback fun) {
emper::io::AlarmFuture::Timespec ts = {.tv_sec = 0, .tv_nsec = ns};
emper::io::AlarmFuture alarmFuture(ts);
alarmFuture.setCallb...Possible implementation
```
void scheduleIn(std::uint64_t ns, emper::io::Future::Callback fun) {
emper::io::AlarmFuture::Timespec ts = {.tv_sec = 0, .tv_nsec = ns};
emper::io::AlarmFuture alarmFuture(ts);
alarmFuture.setCallback(fun);
alarmFuture.submit();
}
```https://gitlab.cs.fau.de/i4/manycore/emper/-/issues/32fsearch with continuation stealing fails2022-03-30T14:30:18ZFlorian Fischerfsearch with continuation stealing failsI wanted to add fsearch using continuation stealing to the EMPER variant of our fs evaluation but fsearch using continuations stealing fails with:
```
munmap_chunk(): invalid pointer
```I wanted to add fsearch using continuation stealing to the EMPER variant of our fs evaluation but fsearch using continuations stealing fails with:
```
munmap_chunk(): invalid pointer
```https://gitlab.cs.fau.de/i4/manycore/emper/-/issues/31Future cancellation is broken by design2022-01-21T14:38:13ZFlorian FischerFuture cancellation is broken by designThe design decition of distributed independent `io_urings` per worker, which allows us to not synchronize request submission prevents us from having simple cancellation logic.
An `io_uring` can only cancel a request it actually knows ab...The design decition of distributed independent `io_urings` per worker, which allows us to not synchronize request submission prevents us from having simple cancellation logic.
An `io_uring` can only cancel a request it actually knows about. But in the EMPER design it is totally possible that a Fiber
submits a request, blocks, is continued on a different worker and then wants to cancel the original request.
Which will fail because the `io_uring` of the current worker does not know the request submitted on the previous Worker.
The cancellation of the partially completed future chain in our CancelFutureTest is the perfect example:
cancelPartialCompletedChain();
004 1535.46106302149 PS 0x7ff904082460 constructed by fiber Fiber [ptr=0x562abe4b9c40 func=0 arg=0 aff=
nullptr]
004 1535.46106309403 PS 0x7ff9040824f0 constructed by fiber Fiber [ptr=0x562abe4b9c40 func=0 arg=0 aff=
nullptr]
IOC 1535.46106308601 IO 0x7ff904000900 Reaping completions for worker 4
004 1535.46106317007 PS 0x7ff904082580 constructed by fiber Fiber [ptr=0x562abe4b9c40 func=0 arg=0 aff=
nullptr]
004 1535.46106367300 IO 0x7ff904082570 submit read Future 0x7ff904082570 to IoContext 0x7ff904000900
004 1535.46106374043 IO 0x7ff904000900 submitting read Future 0x7ff904082570 and it's dependencies
004 1535.46106381056 IO 0x7ff904000900 Prepare read Future 0x7ff904082450 as a dependency
Worker 4 has prepared a chain of two read futures and submitted those.
004 1535.46106389892 IO 0x7ff904000900 Reaping completions for worker 4
004 1535.46106396855 IO 0x7ff9040824e0 submit write Future 0x7ff9040824e0 to IoContext 0x7ff904000900
004 1535.46106403487 IO 0x7ff904000900 submitting write Future 0x7ff9040824e0
004 1535.46106416892 IO 0x7ff904000900 Reaping completions for worker 4
004 1535.46106423745 IO 0x7ff9040824e0 Waiting on write Future 0x7ff9040824e0
004 1535.46106431088 PS 0x7ff9040824f0 block() blockedContext is 0x7ff904073a00
004 1535.46106443040 C 0x7ff904073a00 saving and switching to Context 0x7ff904083ac0 [tos: 0x7ff904093
b30 bos: 0x7ff904083b40]
004 1535.46106451175 IO 0x7ff904000900 Reaping completions for worker 4
004 1535.46106462526 SLEEP_S 0x7ffedd48c478 going to sleep
Worker 4 has submitted the write requests completing one of the reads and the Fiber blocks until the write is completed.
Thus there is no more work in the system Worker 4 goes to sleep.
IOC 1535.46106575295 IO 0x7ff904000900 Reaping completions for worker 4
IOC 1535.46106629826 IO 0x7ff904000900 got 2 cqes from worker 4's io_uring
IOC 1535.46106651016 IO 0x7ff9040824e0 Complete write Future 0x7ff9040824e0 with result 8
IOC 1535.46106676874 PS 0x7ff9040824f0 unblock in fast path
IOC 1535.46106689126 IO 0x7ff904082450 Complete read Future 0x7ff904082450 with result 8
IOC 1535.46106698343 PS 0x7ff904082460 no unblock in slow path
IOC 1535.46106711087 SLEEP_S 0x7ffedd48c478 NotifyMany 1 from ANYWHERE
000 1535.46106848001 SLEEP_S 0x7ffedd48c478 awoken
000 1535.46106878818 IO 0x7ff91c000900 Reaping completions for worker 0
000 1535.46106890039 DISP 0x562abe4a0dd8 executing fiber 0x7ff8d0001140
000 1535.46106899837 F 0x7ff8d0001140 run() calling 0 (ZN5FiberC4ERKSt8functionIFvvEEPiEUlPvE_) with a
rg 0
000 1535.46106908683 C 0x7ff91c073a00 discarding and switching to 0x7ff904073a00
000 1535.46106916628 CM 0x562abe4a0c40 Freeing context 0x7ff91c073a00
000 1535.46106932126 PS 0x7ff9040821b0 constructed by fiber Fiber [ptr=0x562abe4b9c40 func=0 arg=0 aff=
nullptr]
The completer thread does its job and reaps the completions of the sleeping Worker 4 and
notifies a sleeping Worker.
Worker 0 is awoken and resumes the blocked Fiber.
000 1535.46106940171 IO 0x7ff9040821a0 submit cancel Future 0x7ff9040821a0 to IoContext 0x7ff91c000900
000 1535.46106947505 IO 0x7ff91c000900 submitting cancel Future 0x7ff9040821a0
000 1535.46106967592 IO 0x7ff91c000900 Reaping completions for worker 0
000 1535.46106975317 IO 0x7ff91c000900 got 1 cqes from worker 0's io_uring
000 1535.46106983652 IO 0x7ff9040821a0 Complete cancel Future 0x7ff9040821a0 with result -2
The cancellation fails with -EBADF because in the io_uring of Worker 0 the future submitted
on Worker 4 is unknown.
000 1535.46106991056 PS 0x7ff9040821b0 no unblock in slow path
000 1535.46106998640 IO 0x7ff9040821a0 Waiting on cancel Future 0x7ff9040821a0
000 1535.46107005452 IO 0x7ff904082570 Waiting on read Future 0x7ff904082570
000 1535.46107012415 PS 0x7ff904082580 block() blockedContext is 0x7ff904073a00
000 1535.46107024698 C 0x7ff904073a00 saving and switching to Context 0x7ff91c073a00 [tos: 0x7ff91c083
a70 bos: 0x7ff91c073a80]
000 1535.46107032232 IO 0x7ff91c000900 Reaping completions for worker 0
000 1535.46107040748 SLEEP_S 0x7ffedd48c478 going to sleep
IOC 1535.46107059783 IO 0x7ff91c000900 Reaping completions for worker 0
The result is a sleeping emper.
Problem is this is not trivially fixable.https://gitlab.cs.fau.de/i4/manycore/emper/-/issues/30BinaryPrivateSemaphoreTest timeout for pipe sleep strategy2022-01-14T14:19:20ZFlorian FischerBinaryPrivateSemaphoreTest timeout for pipe sleep strategyJob [#485907](https://gitlab.cs.fau.de/aj46ezos/emper/-/jobs/485907) failed for f133174a
Emper config
-Dworker_sleep_strategy=pipe
* i4cinode15
* [#485907](https://gitlab.cs.fau.de/aj46ezos/emper/-/jobs/485907)
* [#507415](https://...Job [#485907](https://gitlab.cs.fau.de/aj46ezos/emper/-/jobs/485907) failed for f133174a
Emper config
-Dworker_sleep_strategy=pipe
* i4cinode15
* [#485907](https://gitlab.cs.fau.de/aj46ezos/emper/-/jobs/485907)
* [#507415](https://gitlab.cs.fau.de/aj46ezos/emper/-/jobs/507415)
* [#508657](https://gitlab.cs.fau.de/flow/emper/-/jobs/508657) (pipe-no-completer)
* faui49phi01
* [#488767](https://gitlab.cs.fau.de/aj46ezos/emper/-/jobs/488767)https://gitlab.cs.fau.de/i4/manycore/emper/-/issues/29[Job Failed #463590] LinkFutureTest hangs without worker suspension2022-01-18T14:28:48ZFlorian Fischer[Job Failed #463590] LinkFutureTest hangs without worker suspensionJob [#463590](https://gitlab.cs.fau.de/aj46ezos/emper/-/jobs/463590) failed for 964278bc2d6fd0a66b4654e909f4f4d5512fe4c4:
I can not reproduce this on my machine.
What kernel version is the CI using?
* jenkins2
* https://gitlab.cs.fa...Job [#463590](https://gitlab.cs.fau.de/aj46ezos/emper/-/jobs/463590) failed for 964278bc2d6fd0a66b4654e909f4f4d5512fe4c4:
I can not reproduce this on my machine.
What kernel version is the CI using?
* jenkins2
* https://gitlab.cs.fau.de/aj46ezos/emper/-/jobs/477320
* https://gitlab.cs.fau.de/aj46ezos/emper/-/jobs/479439
* https://gitlab.cs.fau.de/aj46ezos/emper/-/jobs/481977
* phi01
* https://gitlab.cs.fau.de/i4/manycore/emper/-/jobs/482393
* i4cinode15
* https://gitlab.cs.fau.de/aj46ezos/emper/-/jobs/483469
* https://gitlab.cs.fau.de/i4/manycore/emper/-/jobs/490542https://gitlab.cs.fau.de/i4/manycore/emper/-/issues/28Find a better name for WakeupStrategy::ThrottleState::pending2021-10-11T09:05:04ZFlorian FischerFind a better name for WakeupStrategy::ThrottleState::pendinghttps://gitlab.cs.fau.de/i4/manycore/emper/-/issues/27The introduction of WakeupStrategy introduces/reveals a memory corruption2021-09-24T11:24:49ZFlorian FischerThe introduction of WakeupStrategy introduces/reveals a memory corruptionThis bug happens in master in debugoptimized builds.
Responsible commit 37143de207fce79a76d6187f7056149b0e9f19f5.
Found using:
git bisect start master ad10eb3a00493a12045dd251fa75dc32205ae80b --
git bisect run meson test -C build-...This bug happens in master in debugoptimized builds.
Responsible commit 37143de207fce79a76d6187f7056149b0e9f19f5.
Found using:
git bisect start master ad10eb3a00493a12045dd251fa75dc32205ae80b --
git bisect run meson test -C build-debugoptimized/ c_api_test
The bug results in this stacktrace:
#0 0x00007ffff7dead22 in raise () from /usr/lib/libc.so.6
#1 0x00007ffff7dd4862 in abort () from /usr/lib/libc.so.6
#2 0x00007ffff7e2cd28 in __libc_message () from /usr/lib/libc.so.6
#3 0x00007ffff7e3492a in malloc_printerr () from /usr/lib/libc.so.6
#4 0x00007ffff7e35826 in unlink_chunk.constprop () from /usr/lib/libc.so.6
#5 0x00007ffff7e3607b in _int_free () from /usr/lib/libc.so.6
#6 0x00007ffff7e399e8 in free () from /usr/lib/libc.so.6
#7 0x00007ffff7ded553 in __run_exit_handlers () from /usr/lib/libc.so.6
#8 0x00007ffff7ded64e in exit () from /usr/lib/libc.so.6
#9 0x00005555555552ca in check_fun () at ../tests/c_api_test.c:25
#10 0x00007ffff7d69947 in std::function<void (void*)>::operator()(void*) const (
__args#0=<optimized out>, this=0x7fffdc0018c8) at /usr/include/c++/11.1.0/bits/std_function.h:560
#11 Fiber::run (this=0x7fffdc0018c0) at ../emper/Fiber.cpp:13
#12 0x00007ffff7d943ff in Dispatcher::dispatch (fiber=0x7fffdc0018c0, this=0x55555556b6b8)
at ../emper/Dispatcher.hpp:30
#13 WsDispatcher::dispatchLoop (this=0x55555556b6b8) at ../emper/strategies/ws/WsDispatcher.cpp:20
The stacktrace is from thread 13 and apparently it is currently exiting.
I find this behavior only in the `c_api_test`.https://gitlab.cs.fau.de/i4/manycore/emper/-/issues/26Wakeup-Strategy 'throttle' is unsound2021-10-11T09:04:35ZFlorian FischerWakeup-Strategy 'throttle' is unsoundJob [#449233](https://gitlab.cs.fau.de/i4/manycore/emper/-/jobs/449233) failed for d8434b57e971136bb8376ce06a04f79a3c100318:Job [#449233](https://gitlab.cs.fau.de/i4/manycore/emper/-/jobs/449233) failed for d8434b57e971136bb8376ce06a04f79a3c100318:https://gitlab.cs.fau.de/i4/manycore/emper/-/issues/25SimpleDiskAndNetworkTest hangs after terminating the runtime2021-09-24T11:14:35ZFlorian FischerSimpleDiskAndNetworkTest hangs after terminating the runtimeI observed the `SimpleDiskAndNetworkTest` hang with only the main thread left waiting on the `successSem` in the test-runner's main.
It is reproducable for me on master consistently using release builds and eventually using debugoptimiz...I observed the `SimpleDiskAndNetworkTest` hang with only the main thread left waiting on the `successSem` in the test-runner's main.
It is reproducable for me on master consistently using release builds and eventually using debugoptimized with a mmaped log file.
### Steps to reproduce
make release
meson test -C build SimpleDiskAndNetworkTest --repeat=10
### First observed
I remember seeing it somewhere in our CI but I can not find it anymore.https://gitlab.cs.fau.de/i4/manycore/emper/-/issues/24gitlab-ci tests once we can use io_uring from docker2021-09-24T10:52:19ZFlorian Schmausgitlab-ci tests once we can use io_uring from dockerIt should be possible to use io_uring from gitlab runner's docker with the upcoming Debian bullseye update.
Then we should:
- classify unit tests as 'io' tests
- create jobs for io_uring and SINGLE_URING in the 'test' stageIt should be possible to use io_uring from gitlab runner's docker with the upcoming Debian bullseye update.
Then we should:
- classify unit tests as 'io' tests
- create jobs for io_uring and SINGLE_URING in the 'test' stagehttps://gitlab.cs.fau.de/i4/manycore/emper/-/issues/23Provide EMPER-native buffered I/O2021-08-12T10:16:28ZFlorian SchmausProvide EMPER-native buffered I/Ohttps://gitlab.cs.fau.de/i4/manycore/emper/-/issues/22Alternative completer behavior: Only remove one CQE from the CQ at a time2021-08-12T10:16:06ZFlorian SchmausAlternative completer behavior: Only remove one CQE from the CQ at a timeInstead of having the completer drain the whole CQ at one time, have him only remove one item, then proceed to the next ready CQ. After 8 (or 16) CQEs have been obtained, push those into the AnywhereQueue. Continue doing so, until all CQ...Instead of having the completer drain the whole CQ at one time, have him only remove one item, then proceed to the next ready CQ. After 8 (or 16) CQEs have been obtained, push those into the AnywhereQueue. Continue doing so, until all CQs are drained.https://gitlab.cs.fau.de/i4/manycore/emper/-/issues/21Create a variant where exactly one io_uring is used2021-08-23T09:34:58ZFlorian SchmausCreate a variant where exactly one io_uring is usedFor comparison, it would be great to have a variant where only exactly one io_uring is used (instead of one per worker).For comparison, it would be great to have a variant where only exactly one io_uring is used (instead of one per worker).https://gitlab.cs.fau.de/i4/manycore/emper/-/issues/20Invalid link behaviour changed in io_uring2021-07-08T07:45:00ZFlorian FischerInvalid link behaviour changed in io_uringSince a couple of month I noticed that our `LinkFutureTest` fails in the `Valid->Invalid->Valid` case.
We expect io_uring to submit the broken chains until the invalid requests but since [cf10960426515](https://github.com/torvalds/linu...Since a couple of month I noticed that our `LinkFutureTest` fails in the `Valid->Invalid->Valid` case.
We expect io_uring to submit the broken chains until the invalid requests but since [cf10960426515](https://github.com/torvalds/linux/commit/cf109604265156bb22c45e0c2aa62f53a697a3f4) `io_uring` does no longer submit broken links at all which seams reasonable.https://gitlab.cs.fau.de/i4/manycore/emper/-/issues/19Incorrect worker count in qemu using kvm2021-07-05T09:17:25ZFlorian FischerIncorrect worker count in qemu using kvmWhen I test emper with a custom kernel in qemu `Runtime::getDefaultWorkerCount()` returns the number of CPU available to the host not the guest running in qemu.
But running `nproc(1)` in the guest reports the correct value.
That lead me...When I test emper with a custom kernel in qemu `Runtime::getDefaultWorkerCount()` returns the number of CPU available to the host not the guest running in qemu.
But running `nproc(1)` in the guest reports the correct value.
That lead me to [the corutils nproc source code](https://github.com/coreutils/gnulib/blob/90e79512d8b385801218d6e9c4d88ff77186560b/lib/nproc.c#L206) they use `min(configured-cpus, online-cpus, cpus-available-to-process)`.
We should use something similar.https://gitlab.cs.fau.de/i4/manycore/emper/-/issues/18Shrink AnywhereQueue / LockedUnboundedQueue2021-05-08T17:54:20ZFlorian SchmausShrink AnywhereQueue / LockedUnboundedQueueAs of now, at least some incarnations of LockedUnboundedQueue do not release memory. We should change that.As of now, at least some incarnations of LockedUnboundedQueue do not release memory. We should change that.https://gitlab.cs.fau.de/i4/manycore/emper/-/issues/17emper::IoCompleterBehavior::wakeup failes in IncrementalCompletionTest2021-04-20T06:39:23ZFlorian Fischeremper::IoCompleterBehavior::wakeup failes in IncrementalCompletionTestWith !172 the IncrementalCompletionTest fails due to a memory corruption, most of the time in the allocator.
But I could not find any invalid/double frees in the allocation trace using chattymalloc.With !172 the IncrementalCompletionTest fails due to a memory corruption, most of the time in the allocator.
But I could not find any invalid/double frees in the allocation trace using chattymalloc.https://gitlab.cs.fau.de/i4/manycore/emper/-/issues/16CancelFutureTest is unreliable2021-04-14T13:42:33ZFlorian SchmausCancelFutureTest is unreliableWe observe CancelFutureTest failing with SIGABRT, caused by an assert firing, or running into the test timeout.We observe CancelFutureTest failing with SIGABRT, caused by an assert firing, or running into the test timeout.https://gitlab.cs.fau.de/i4/manycore/emper/-/issues/15Consider using "Fast Random Integer Generation in an Interval" Lemir (2019), ...2021-03-24T08:47:27ZFlorian SchmausConsider using "Fast Random Integer Generation in an Interval" Lemir (2019), e.g. for random victim selection when stealing workRight now, EMPER's random victim selection is even biased, due https://gitlab.cs.fau.de/i4/manycore/emper/-/blob/33cad423220b3fc3c4a0f0202a61d45d104b0a19/emper/strategies/AbstractWorkStealingScheduler.cpp#L71
- https://arxiv.org/abs/180...Right now, EMPER's random victim selection is even biased, due https://gitlab.cs.fau.de/i4/manycore/emper/-/blob/33cad423220b3fc3c4a0f0202a61d45d104b0a19/emper/strategies/AbstractWorkStealingScheduler.cpp#L71
- https://arxiv.org/abs/1805.10941
- https://dl.acm.org/doi/10.1145/3230636https://gitlab.cs.fau.de/i4/manycore/emper/-/issues/14SIGSEGV during exit (Safely exit the program from within the Runtime)2021-03-24T14:05:39ZFlorian FischerSIGSEGV during exit (Safely exit the program from within the Runtime)I can reproduce this [crash](https://gitlab.cs.fau.de/i4/manycore/emper/-/jobs/337220).
It happens when a thread calls `exit` and the cleanup code while another thread is using the runtime under destruction.
Thread 5 receives SIGSEGV in...I can reproduce this [crash](https://gitlab.cs.fau.de/i4/manycore/emper/-/jobs/337220).
It happens when a thread calls `exit` and the cleanup code while another thread is using the runtime under destruction.
Thread 5 receives SIGSEGV in `Scheduler::schedule` because the Runtime object and its Scheduler member are garbage values.
(gdb) i threads
Id Target Id Frame
1 Thread 0x7ffff79b5280 (LWP 444113) "TellActorFromAn" 0x00007ffff7ce79ba in __futex_abstimed_wait_common64 () from /usr/lib/libpthread.so.0
2 Thread 0x7ffff79af640 (LWP 444148) "TellActorFromAn" 0x00007ffff7bfca9d in syscall ()
from /usr/lib/libc.so.6
3 Thread 0x7ffff71ae640 (LWP 444149) "TellActorFromAn" 0x00007ffff7ce79ba in __futex_abstimed_wait_common64 () from /usr/lib/libpthread.so.0
4 Thread 0x7ffff69ad640 (LWP 444150) "TellActorFromAn" 0x00007ffff7ce79ba in __futex_abstimed_wait_common64 () from /usr/lib/libpthread.so.0
* 5 Thread 0x7ffff61ac640 (LWP 444153) "TellActorFromAn" 0x0000555555560640 in Scheduler::schedule (
this=0xfd284c0940fe485, fiber=...) at ../emper/Scheduler.hpp:60
6 Thread 0x7ffff59ab640 (LWP 444156) "TellActorFromAn" 0x00007ffff7fdc272 in _dl_fini ()
from /lib64/ld-linux-x86-64.so.2
runtime and scheduler objects seen by Thread 5
Scheduler object in `Scheduler::schedule`
(gdb) p *this
Cannot access memory at address 0xfd284c0940fe485
Runtime object in `Runtime::schedule`
(gdb) up
#1 0x0000555555560786 in Runtime::schedule (this=0x7ffff7fdc0e7 <_dl_fini+119>, fiber=...)
at ../emper/Runtime.hpp:168
168 scheduler.schedule(fiber);
(gdb) p *this
$1 = {<Logger<(LogSubsystem)6>> = {<No data fields>}, static currentRuntimeMutex =
{<std::__mutex_base> = {_M_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0,
__kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = '\000' <repeats 39 times>, __align = 0}}, <No data fields>},
static currentRuntime = 0x7fffffffe040, workerCount = 19339,
newWorkerHooks = std::vector of length 132845363851615715, capacity -267351304115112441 = {
<error reading variable>
(gdb) p this
$2 = (Runtime * const) 0x7ffff7fdc0e7 <_dl_fini+119>
Thread 6 is destructing the process resulting in an invalid Runtime object
(gdb) thread 6
[Switching to thread 6 (Thread 0x7ffff59ab640 (LWP 444156))]
#0 0x00007ffff7fdc272 in _dl_fini () from /lib64/ld-linux-x86-64.so.2
(gdb) bt
#0 0x00007ffff7fdc272 in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#1 0x00007ffff7b42697 in __run_exit_handlers () from /usr/lib/libc.so.6
#2 0x00007ffff7b4283e in exit () from /usr/lib/libc.so.6
#3 0x00007ffff7f26edb in invokeTest () at ../tests/test-runner/test-runner.cpp:14
Backtrace stopped: previous frame inner to this frame (corrupt stack?)Florian FischerFlorian Fischer