Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753339AbcCZP46 (ORCPT ); Sat, 26 Mar 2016 11:56:58 -0400 Received: from mout.perfora.net ([74.208.4.197]:56694 "EHLO mout.perfora.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752279AbcCZP4z (ORCPT ); Sat, 26 Mar 2016 11:56:55 -0400 Message-ID: <1459007812.5648.5.camel@petros-ultrathin> Subject: futex: clarification needed with drop_futex_key_refs and memory barriers From: Petros Koutoupis To: linux-kernel@vger.kernel.org Cc: "petros@petroskoutoupis.com" Date: Sat, 26 Mar 2016 10:56:52 -0500 Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.10.4-0ubuntu2 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-Provags-ID: V03:K0:mx7qp/cZ8NQJXo4gLw9Gz7g53SBYZnk40l36Xj1hMmLL0jNZmFP Gi8Sy4LgyEv8R7e3ddLsQ/jnqPoQPxBY6chV7w8jR7txyf50e1ogXaxfJENhX/+H6WHcVcs nnZlwSBa8WvGg17Pn2udyONq+hPb2PQutBXhToCpc2X5a9JxEw964RLmEfND4aJq38Kt1fR ZlbemGE0JWGbf0U71ke9Q== X-UI-Out-Filterresults: notjunk:1;V01:K0:eB1Rd70AZY8=:7FBVOMvDxdN99loh75b/3N f5gQhMSLKT7o9j1Rh304XJK0OPFkmkh9Y3QcMeik3JjxEFFDJeRl3gCa0XB35Eic/Ge0DV3MG L6O095ty4m59FS+5OZwWVjQj5SHa8CfsqGageas2fM6ZoT0SDhQ4BRXde8Q0Oph+lzYjDuIvQ NTumKGGWXuoT2IyxZLeRsSeBMruoLjtplkkNZ3FhmhlQrjQwAbQ9n5sifSmKMbqCXpgQR3OMb CtYeMHNV4aO467nKHegEZjUkjIo5YP6H2uQziFFyKBKW0HngP5em+O5GSxK+OLBZLo2JYTf+M 2HzDU95la0yygf0nKA0SrDukmVK1sAvnijpJMzUMyViGqnzGzZW6EsSdpoXU4HC8sp5Y4oKPx DY1d0TGmYa7innHYdigHsVYnMNR57uXmSAx0cJSEwijb7QCNNUnhEhFR7F1GuzC5auVhd1bZ+ kgB6PapyO1epZVgrmQmR5Kr5e8iPe1854PxT+LXAXotF8V+KA6u9wAJcqopLrp/v3jyJFc8H2 FriQWCxPUuiUktBPnauCiulpUu4GCcLJt7hLkT7jmxhBChXGmY+m0N2nr0pzW3cRb62Pis7Ll 1M578vGNCQfWDwyQIMhe9AJV71jyIj86rmBghCzc41ahjBnJi3/F5Q/wJAlnAQ1GLO/mxn/aK 4rXUJge23xLj8T/QPhP64dRQMAgOuPLZqMHiFHVqqGtvbTzcx8K7d61Jo1nhSUGKJ083nkdVx Ye2aH8Zrgd/YStuQ Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2751 Lines: 56 I stumbled on an interesting scenario which I am unable to fully explain and I was hoping to get some other opinions on why this would or wouldn't work. In recent testing on a 48-core Haswell arch server, our multi-threaded user space application was utilizing 60% to 100% more CPU than on our smaller 24-core servers (running an identical load). After spending a considerable amount of time analyzing stack dumps and straces it became immediately apparent that those exact threads operating with the higher CPU utilization were off in futex land. Shortly afterward I stumbled on commit 76835b0ebf8a7fe85beb03c75121419a7dec52f0 (futex: Ensure get_futex_key_refs() always implies a barrier) which addressed the handling of private futexes and preventing a race condition by completing the function with a memory barrier. Now, I fully understand why this patch was implemented: to have a memory barrier before checking the "waiters." It makes sense. What doesn't make sense (so far) is when I apply the same patch to the drop counterpart, drop_futex_key_refs(), and the problem goes away. See the change and my notes below. --- linux/kernel/futex.c.orig 2016-03-25 19:45:08.169563263 -0500 +++ linux/kernel/futex.c 2016-03-25 19:46:06.901562211 -0500 @@ -438,11 +438,13 @@ static void drop_futex_key_refs(union fu switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { case FUT_OFF_INODE: - iput(key->shared.inode); + iput(key->shared.inode); /* implies smp_mb(); (B) */ break; case FUT_OFF_MMSHARED: - mmdrop(key->private.mm); + mmdrop(key->private.mm); /* implies smp_mb(); (B) */ break; + default: + smp_mb(); /* explicit smp_mb(); (B) */ } } The iput() and mmdrop() routines in the switch statement eventually use atomic_dec_and_test() which according to the Documentation/memory-barriers.txt implies an smp_mb() on each side of the actual operation. Notice that private futexes aren't handled by this (read below) this switch. Now there is a wrapper put_futex_key() which is called in a few function as a way to clean up before before retrying, but in every case, and before it is called, a check is made to see if the futex is private and if so, retries at a more appropriate area of its respective function. Now I have found two functions where this type of check/protection aren't made and I am curious as to if I stumbled on what could potentially lead to a race condition in a large SMP environment. Please refer to futex_wait() (called indirectly via unqueue_me()) and futex_requeue(). Any thoughts or opinions would be greatly appreciated. Thank you in advance. -- Petros