Message-ID: <1482595274.3038.35.camel@redhat.com>
Subject: Robust futexes: lost wakeups and design flaws in the glibc/kernel
 synchronization scheme
From: Torvald Riegel <triegel@redhat.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: GLIBC Devel <libc-alpha@sourceware.org>,
        "Zijlstra, Peter" <peter.zijlstra@intel.com>,
        Thomas Gleixner <tglx@linutronix.de>, Rich Felker <dalias@libc.org>
Date: Sat, 24 Dec 2016 17:01:14 +0100
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 8542
Lines: 165

TLDR:
(1) The Linux kernel fails to wake futex waiters when a userspace thread
that releases a mutex is killed between modifying the futex word and the
subsequent FUTEX_WAKE syscall.
(2) POSIX makes certain requirements regarding when destruction of a
mutex is safe that are violated by the current glibc/kernel
synchronization scheme, which can lead to corruption of userspace
memory; robust FUTEX_UNLOCK_PI must be changed slightly, and the non-PI
robust mutexes must either use a FUTEX_UNLOCK-like syscall to release
the mutex, try to detect this by interpreting (current position in)
userspace code, or we must add a new futex recovery syscall if we want
to preserve the userspace-only mutex-release fastpath.

[Please keep me in CC because I'm not subscribed to LKML.]


=== Robust mutexes have bugs, in both glibc and the kernel

I've been reviewing the implementation of robust mutexes in both glibc
and the kernel support code recently because there are several bug
reports, for example:
https://bugzilla.redhat.com/show_bug.cgi?id=1401665
https://sourceware.org/bugzilla/show_bug.cgi?id=19402
https://sourceware.org/bugzilla/show_bug.cgi?id=14485

This review revealed a bunch of bugs.  I have committed/proposed patches
that fix all glibc-only bugs that I am aware of:
https://sourceware.org/ml/libc-alpha/2016-12/msg00587.html
https://sourceware.org/ml/libc-alpha/2016-12/msg00862.html
https://sourceware.org/ml/libc-alpha/2016-12/msg00863.html
https://sourceware.org/ml/libc-alpha/2016-12/msg00947.html
These fixes are meant to be easy to backport, so they don't contain as
much cleanup and explanation of the glibc code as would be ideal.

I plan to follow these up with more cleanup and adding of documentation,
which should also describe the glibc/kernel synchronization scheme in
more detail.  I believe I have found all the bugs in robust mutexes
(when including the kernel-side bugs described in this email), but I'll
make another review pass when preparing this cleanup+documentation
patch.


=== Background

A brief reminder of how glibc and the kernel attempt to synchronize with
each other:
Each thread maintains a list of robust mutexes that it has acquired, and
a single pointer to a futex word that it is about to acquire or release
(called op_pending).  The list is a concurrent list (to some extent)
because userspace can be interrupted by the post-crash cleanup code run
by the kernel (ie, this is similar to how one would synchronize with a
signal handler).

To release a mutex without using FUTEX_UNLOCK_PI, userspace basically
does the following:
(L1)  atomic_store (&op_pending, &mutex->futex, memory_order_relaxed);
(L2)  <"thread-safe" dequeue of mutex->futex from robust mutex list>
(L3)  fw = atomic_exchange(&mutex->futex, 0, memory_order_release);
(L4)  if (fw & FUTEX_WAITERS) futex_wake(&mutex->futex, 1);
(L5)  atomic_store (&op_pending, NULL, memory_order_relaxed);
(I'm not showing the signal fences required; for simplicity, assume that
there is a signal fence after each line.)

In case of a robust PI mutex, L3 and L4 are done by the kernel in
FUTEX_UNLOCK_PI.


=== Lost wakeups caused by the kernel

If a thread happens to be killed between L3 and L4 and FUTEX_WAITERS is
set, then userspace will have set the futex word to 0 but not yet have
run FUTEX_WAKE.
op_pending still points to the futex word, but handle_futex_death in
kernel/futex.c only calls futex_wake if the TID bits in the futex word
are equal to the TID of the killed thread.  The result of that is a lost
wake-up for one of the threads that set or shared FUTEX_WAITERS.

This can be fixed by running futex_wake conservatively, even if the
value of *op_ending does not match the TID of the killed thread.  This
can lead to spurious wakeups (eg, if the thread was killed between L4
and L5), but userspace code has to, in the general case, be prepared for
spurious futex wake-ups anyway if using POSIX mutexes:
https://lkml.org/lkml/2014/11/27/472
Also, the same spurious wake-ups can happen under slightly different
executions without a crash (eg, if FUTEX_WAITERS is set spuriously due
to timeouts on another thread that set it).


=== POSIX' mutex destruction requirements are violated

In a nutshell, the mutex destruction requirements mean that there must
be a single atomic action that releases a mutex; after a mutex is
released, the thread releasing the mutex must not access the mutex'
memory anymore (but it is still allowed to run a FUTEX_WAKE on it); see
https://lkml.org/lkml/2014/11/27/472 for additional background.

Currently, the kernel can potentially modify the mutex' memory until L5
has executed.  This means that getting killed while releasing a robust
mutex is not guaranteed to be a single atomic action.

For example, assume that Thread1 has acquired the mutex and releases it,
but is killed right before L5; if the program is built in such a way
that only Thread2 will acquire the mutex after Thread1 has released it,
then POSIX allows Thread2 to acquire the mutex, release it, destroy it,
and reuse its memory for something else (these are the destruction
requirements).  So, Thread2 does that and reuses the memory for another
object that is not a mutex, which happens to put the same value as the
TID of the killed thread at the position of the futex word.  Then, the
kernel sees that *op_pending equals the TID, and sets *op_pending to 0,
thus corrupting the other object.
(Note that there is no similar problem when acquiring the mutex because
just the existance of a thread attempting to acquire the mutex disallows
attempting to concurrently destroy the mutex.) 

For FUTEX_UNLOCK_PI, the fix is relatively straightforward: The kernel
must clear op_pending in the syscall if op_pending is equal to the
address of the unlocked futex.  This ensures that unlock is a single
atomic action; the store by the kernel is harmless at least for glibc
(and it would certainly be possible to get the same effect without a
userspace-visible change).

If the userspace fastpath is used (ie, L3), then there is no simple fix.
Never using the fastpath but only something like FUTEX_UNLOCK_PI slows
down lock release, which can be bad for performance.

When using the fastpath, and to be able to solve this in the kernel when
cleaning up after the thread is killed, the kernel would need to be able
to detect whether the thread was killed before the critical atomic
action in L3; the kernel cannot use the value of the futex word to
determine this because it is not guaranteed to be a futex word of a
robust mutex anymore.  Thus, the instruction pointer of the userspace
thread would have to be taken into account, and userspace would have to
enforce code generation in a way that makes it possible for the kernel
to easily detect whether the mutex release sequence is after L3 or not.
On the userspace side, the release sequence starting with L1 is not
exactly trivial, and I'd prefer to not have to maintain assembly for
that (OTOH, maybe some of the machinery of restartable sequences could
be reused here).  Therefore, that's not an approach I'd prefer.

Alternatively, it should be possible to let threads that want to acquire
a mutex with a dead owner request kernel-side recovery of this mutex.
Basically, the mutex-releasing thread would first set the futex word to
an intermediate state that signals that it intends to release the mutex
(eg, setting a special bit but preserving the bits for the TID); it
would then dequeue the mutex from its robust futex list; and then set
the futex word to 0 to finalize release.
Threads that see the special bit in the futex word signaling the intent
to release cannot simply acquire the mutex in userspace.  They would
have to call a new futex operation in the kernel that then either blocks
until woken by a normal futex_wake, or recovers the mutex when the
thread that intended to release it was killed in the meantime.
The advantage of this approach would be that the interface is simpler
for userspace (eg, no funny interpretation of instruction pointers or
such).  However, the kernel-side synchronization is probably more
complex, and there are a couple of details to be solved such as avoiding
an ABA on the TID of the thread that intended to release the mutex.
If there is interest in this approach, I can work on the details of the
general synchronization scheme (though I'd like someone else to take
care of the kernel-side implementation).

Thoughts?


PS: Isn't this problem a neat little holiday present for all those who
enjoy the beauty of futexes? ;)