Since !120 (merged) I observe that the echo server benchmark does not terminate.
I suspect the EchoClient because the server does work with the rust echo client and accepts new connections.
After reverting afafc707 the benchmark works again therefore I am quite sure that the issue must be somewhere in !120 (merged).
Edited
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
I don't want to shift the blame, but I do wonder if this is not an issue somewhere else than afafc707 that got uncovered by afafc707's different timing behavior (the critical section is shorter with afafc707). Maybe we should try to add a short (1us? 100ns?) sleep() statement before unlocking the cq_lock. Just an idea.
I don't want to blame anyone for anything just share my observation. !120 (merged) still seams correct to me. I will experiment with sleeping in the critical section
Ok now is the time to blame me. I found a first bug. See !132 (merged)
But not sure If this patch solves all our problems yet.
I thought I have seen a double signal of a Future's semaphore with the CountingTryLock but I can't reproduce it right now.
And I could reproduce the double signal using a build-release-cq-lock-mutex echo server and a debugoptimized-log-error echoclient.
#0 0x00007ffff7b5aef5 in raise () from /usr/lib/libc.so.6#1 0x00007ffff7b44862 in abort () from /usr/lib/libc.so.6#2 0x00007ffff7b44747 in __assert_fail_base.cold () from /usr/lib/libc.so.6#3 0x00007ffff7b53646 in __assert_fail () from /usr/lib/libc.so.6#4 0x00007ffff7f8e9b6 in BinaryPrivateSemaphore::signalInternal (this=0x7ffff1d08308) at ../emper/BinaryPrivateSemaphore.cpp:47#5 0x00005555555649a9 in PrivateSemaphore::signalAndGetContinuation (this=0x7ffff1d08308) at ../emper/PrivateSemaphore.hpp:43#6 0x000055555556524a in emper::io::Future::completeAndGetContinuation (this=0x7ffff1d08300, res=32) at ../emper/io/Future.hpp:113#7 0x0000555555565951 in emper::io::PartialCompletableFuture::completeAndGetContinuation ( this=0x7ffff1d08300, res=32) at ../emper/io/Future.hpp:396#8 0x00007ffff7f9cdbb in emper::io::IoContext::reapCompletions<(CallerEnvironment)0> ( this=0x7ffff0000900) at ../emper/io/IoContext.cpp:281#9 0x00007ffff7f9d5b1 in emper::io::IoContext::reapAndScheduleCompletions<(CallerEnvironment)0> ( this=0x7ffff0000900) at ../emper/io/IoContext.hpp:154#10 0x00007ffff7f9b90f in emper::io::IoContext::submit<(CallerEnvironment)0> (this=0x7ffff0000900, future=...) at ../emper/io/IoContext.cpp:156#11 0x00007ffff7f92910 in emper::io::Future::submit (this=0x7ffff1d08290) at ../emper/io/Future.cpp:80#12 0x00005555555656c4 in emper::io::Future::submitAndWait (this=0x7ffff1d08290) at ../emper/io/Future.hpp:280#13 0x0000555555560e8a in clientFunc (client_id=2432, readySem=..., startSem=..., cps=...)
So a worker is trying to submitting a Future into a ring with an full CQ and therefore
has to reap. But it stumbles upon a Future which's semaphore was already signaled.
Maybe this has actually something to do with memory initialization.
I added assert(!state.completed); in Future.hpp:96.
My assumption was that if the Future is actually completed twice the assert would abort before overwriting the values
from the first completion and I could observe the original Future state.
But the new assert does not trigger but the original (BinaryPrivateSemaphore) does still.
This means setCompletion which is always called before we signal the Future's semaphore (Future.hpp:102-115) is not called twice this means the Future's semaphore is also not signaled twice.
This leafs us with some sort of memory bug I suppose.
I could trigger the assert(currentState != signaled); from BinaryPrivateSemaphore.cpp:47 with the new test case in LinkFutureTest introduced by !134 (merged).
The meson cmds I used:
meson build-debugoptimized-no-log-cq-mutex -Dlog_level=Error -Dio_cq_lock_implementation=mutexmeson test -C build-debugoptimized-no-log-cq-mutex --repeat=100
Florian Fischerchanged title from Echo benchmark hangs since afafc7073f3fa2493fd21a346b2034b0cbfcdf80 to Linked Futures can result in a BinaryPrivateSemaphore beeing signaled twice
changed title from Echo benchmark hangs since afafc7073f3fa2493fd21a346b2034b0cbfcdf80 to Linked Futures can result in a BinaryPrivateSemaphore beeing signaled twice
For documentation purposes here what we have discovered so far:
The Future whose BinarySemaphore has bpsState == signaled when signaling it in Future::completeAndGetContinuation is not completed twice because we have an assert for that in Future::setCompletion which is always called before sem.signal{,AndGetContinuation}.
It does not seam to be a memory corruption where someone writes to the front of a Future object
bpsState is explicitly set to signaled (which is 1)
we introduced a canary at the front of Future and moved the semaphore to the back and still only the BPS state was changed to signaled
Backtrace from a crash produced by a -Dlog_level=Error -Dio_cq_lock_implementation=mutex build's LinkFutureTest on 03727b29.
#0 0x00007ffff7b64ef5 in raise () from /usr/lib/libc.so.6#1 0x00007ffff7b4e862 in abort () from /usr/lib/libc.so.6#2 0x00007ffff7b4e747 in __assert_fail_base.cold () from /usr/lib/libc.so.6#3 0x00007ffff7b5d646 in __assert_fail () from /usr/lib/libc.so.6#4 0x00007ffff7f8c3b8 in BinaryPrivateSemaphore::signalInternal (this=0x7fffe8082f48) at ../emper/BinaryPrivateSemaphore.cpp:47#5 0x000055555555fb8f in PrivateSemaphore::signalAndGetContinuation (this=0x7fffe8082f48) at ../emper/PrivateSemaphore.hpp:43#6 0x000055555555fd9e in emper::io::Future::completeAndGetContinuation (this=0x7fffe8082f40, res=8) at ../emper/io/Future.hpp:115#7 0x00005555555604cb in emper::io::PartialCompletableFuture::completeAndGetContinuation ( this=0x7fffe8082f40, res=8) at ../emper/io/Future.hpp:398#8 0x00007ffff7f9aadb in emper::io::IoContext::reapCompletions<(CallerEnvironment)1> ( this=0x7fffe0000900) at ../emper/io/IoContext.cpp:308#9 0x00007ffff7f9fee3 in emper::io::IoContext::reapAndScheduleCompletions<(CallerEnvironment)1> ( this=0x7fffe0000900) at ../emper/io/IoContext.hpp:175#10 0x00007ffff7f9f543 in emper::io::GlobalIoContext::globalCompleterFunc (arg=0x55555557d840) at ../emper/io/GlobalIoContext.cpp:106#11 0x00007ffff7b09299 in start_thread () from /usr/lib/libpthread.so.0#12 0x00007ffff7c27053 in clone () from /usr/lib/libc.so.6
locals from the frame 8 emper::io::IoContext::reapCompletions<(CallerEnvironment)1>
The Futures in reapedCompletions (note we call completeAndGetContinuation() on those from low to high indices).
reapedCompletions[0].second - the WriteFuture triggering the assert in 5:
The WriteFuture is completed because we call BPS::signalAndGetContinuation after Future::setCompletion but we are not completing the Future twice because Future::setCompletion asserts that the Future in not completed yet.
reapedCompletions[1].second - the ReadFuture dependent on the WriteFuture:
Unfortunately I can't see anything suspicious here.
But I think it is still interesting that I only see the assert for the Dependency-Future
never for the Dependent-Future.