Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755436AbcLXQBV (ORCPT ); Sat, 24 Dec 2016 11:01:21 -0500 Received: from mx1.redhat.com ([209.132.183.28]:40610 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753004AbcLXQBT (ORCPT ); Sat, 24 Dec 2016 11:01:19 -0500 Message-ID: <1482595274.3038.35.camel@redhat.com> Subject: Robust futexes: lost wakeups and design flaws in the glibc/kernel synchronization scheme From: Torvald Riegel To: LKML Cc: GLIBC Devel , "Zijlstra, Peter" , Thomas Gleixner , Rich Felker Date: Sat, 24 Dec 2016 17:01:14 +0100 Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.32]); Sat, 24 Dec 2016 16:01:18 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8542 Lines: 165 TLDR: (1) The Linux kernel fails to wake futex waiters when a userspace thread that releases a mutex is killed between modifying the futex word and the subsequent FUTEX_WAKE syscall. (2) POSIX makes certain requirements regarding when destruction of a mutex is safe that are violated by the current glibc/kernel synchronization scheme, which can lead to corruption of userspace memory; robust FUTEX_UNLOCK_PI must be changed slightly, and the non-PI robust mutexes must either use a FUTEX_UNLOCK-like syscall to release the mutex, try to detect this by interpreting (current position in) userspace code, or we must add a new futex recovery syscall if we want to preserve the userspace-only mutex-release fastpath. [Please keep me in CC because I'm not subscribed to LKML.] === Robust mutexes have bugs, in both glibc and the kernel I've been reviewing the implementation of robust mutexes in both glibc and the kernel support code recently because there are several bug reports, for example: https://bugzilla.redhat.com/show_bug.cgi?id=1401665 https://sourceware.org/bugzilla/show_bug.cgi?id=19402 https://sourceware.org/bugzilla/show_bug.cgi?id=14485 This review revealed a bunch of bugs. I have committed/proposed patches that fix all glibc-only bugs that I am aware of: https://sourceware.org/ml/libc-alpha/2016-12/msg00587.html https://sourceware.org/ml/libc-alpha/2016-12/msg00862.html https://sourceware.org/ml/libc-alpha/2016-12/msg00863.html https://sourceware.org/ml/libc-alpha/2016-12/msg00947.html These fixes are meant to be easy to backport, so they don't contain as much cleanup and explanation of the glibc code as would be ideal. I plan to follow these up with more cleanup and adding of documentation, which should also describe the glibc/kernel synchronization scheme in more detail. I believe I have found all the bugs in robust mutexes (when including the kernel-side bugs described in this email), but I'll make another review pass when preparing this cleanup+documentation patch. === Background A brief reminder of how glibc and the kernel attempt to synchronize with each other: Each thread maintains a list of robust mutexes that it has acquired, and a single pointer to a futex word that it is about to acquire or release (called op_pending). The list is a concurrent list (to some extent) because userspace can be interrupted by the post-crash cleanup code run by the kernel (ie, this is similar to how one would synchronize with a signal handler). To release a mutex without using FUTEX_UNLOCK_PI, userspace basically does the following: (L1) atomic_store (&op_pending, &mutex->futex, memory_order_relaxed); (L2) <"thread-safe" dequeue of mutex->futex from robust mutex list> (L3) fw = atomic_exchange(&mutex->futex, 0, memory_order_release); (L4) if (fw & FUTEX_WAITERS) futex_wake(&mutex->futex, 1); (L5) atomic_store (&op_pending, NULL, memory_order_relaxed); (I'm not showing the signal fences required; for simplicity, assume that there is a signal fence after each line.) In case of a robust PI mutex, L3 and L4 are done by the kernel in FUTEX_UNLOCK_PI. === Lost wakeups caused by the kernel If a thread happens to be killed between L3 and L4 and FUTEX_WAITERS is set, then userspace will have set the futex word to 0 but not yet have run FUTEX_WAKE. op_pending still points to the futex word, but handle_futex_death in kernel/futex.c only calls futex_wake if the TID bits in the futex word are equal to the TID of the killed thread. The result of that is a lost wake-up for one of the threads that set or shared FUTEX_WAITERS. This can be fixed by running futex_wake conservatively, even if the value of *op_ending does not match the TID of the killed thread. This can lead to spurious wakeups (eg, if the thread was killed between L4 and L5), but userspace code has to, in the general case, be prepared for spurious futex wake-ups anyway if using POSIX mutexes: https://lkml.org/lkml/2014/11/27/472 Also, the same spurious wake-ups can happen under slightly different executions without a crash (eg, if FUTEX_WAITERS is set spuriously due to timeouts on another thread that set it). === POSIX' mutex destruction requirements are violated In a nutshell, the mutex destruction requirements mean that there must be a single atomic action that releases a mutex; after a mutex is released, the thread releasing the mutex must not access the mutex' memory anymore (but it is still allowed to run a FUTEX_WAKE on it); see https://lkml.org/lkml/2014/11/27/472 for additional background. Currently, the kernel can potentially modify the mutex' memory until L5 has executed. This means that getting killed while releasing a robust mutex is not guaranteed to be a single atomic action. For example, assume that Thread1 has acquired the mutex and releases it, but is killed right before L5; if the program is built in such a way that only Thread2 will acquire the mutex after Thread1 has released it, then POSIX allows Thread2 to acquire the mutex, release it, destroy it, and reuse its memory for something else (these are the destruction requirements). So, Thread2 does that and reuses the memory for another object that is not a mutex, which happens to put the same value as the TID of the killed thread at the position of the futex word. Then, the kernel sees that *op_pending equals the TID, and sets *op_pending to 0, thus corrupting the other object. (Note that there is no similar problem when acquiring the mutex because just the existance of a thread attempting to acquire the mutex disallows attempting to concurrently destroy the mutex.) For FUTEX_UNLOCK_PI, the fix is relatively straightforward: The kernel must clear op_pending in the syscall if op_pending is equal to the address of the unlocked futex. This ensures that unlock is a single atomic action; the store by the kernel is harmless at least for glibc (and it would certainly be possible to get the same effect without a userspace-visible change). If the userspace fastpath is used (ie, L3), then there is no simple fix. Never using the fastpath but only something like FUTEX_UNLOCK_PI slows down lock release, which can be bad for performance. When using the fastpath, and to be able to solve this in the kernel when cleaning up after the thread is killed, the kernel would need to be able to detect whether the thread was killed before the critical atomic action in L3; the kernel cannot use the value of the futex word to determine this because it is not guaranteed to be a futex word of a robust mutex anymore. Thus, the instruction pointer of the userspace thread would have to be taken into account, and userspace would have to enforce code generation in a way that makes it possible for the kernel to easily detect whether the mutex release sequence is after L3 or not. On the userspace side, the release sequence starting with L1 is not exactly trivial, and I'd prefer to not have to maintain assembly for that (OTOH, maybe some of the machinery of restartable sequences could be reused here). Therefore, that's not an approach I'd prefer. Alternatively, it should be possible to let threads that want to acquire a mutex with a dead owner request kernel-side recovery of this mutex. Basically, the mutex-releasing thread would first set the futex word to an intermediate state that signals that it intends to release the mutex (eg, setting a special bit but preserving the bits for the TID); it would then dequeue the mutex from its robust futex list; and then set the futex word to 0 to finalize release. Threads that see the special bit in the futex word signaling the intent to release cannot simply acquire the mutex in userspace. They would have to call a new futex operation in the kernel that then either blocks until woken by a normal futex_wake, or recovers the mutex when the thread that intended to release it was killed in the meantime. The advantage of this approach would be that the interface is simpler for userspace (eg, no funny interpretation of instruction pointers or such). However, the kernel-side synchronization is probably more complex, and there are a couple of details to be solved such as avoiding an ABA on the TID of the thread that intended to release the mutex. If there is interest in this approach, I can work on the details of the general synchronization scheme (though I'd like someone else to take care of the kernel-side implementation). Thoughts? PS: Isn't this problem a neat little holiday present for all those who enjoy the beauty of futexes? ;)