Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756719AbbHZSRJ (ORCPT ); Wed, 26 Aug 2015 14:17:09 -0400 Received: from casper.infradead.org ([85.118.1.10]:43906 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751967AbbHZSRH (ORCPT ); Wed, 26 Aug 2015 14:17:07 -0400 Date: Wed, 26 Aug 2015 20:16:59 +0200 From: Peter Zijlstra To: Thomas Gleixner Cc: Linus Torvalds , Oleg Nesterov , Paul McKenney , Ingo Molnar , mtk.manpages@gmail.com, dvhart@infradead.org, dave@stgolabs.net, Vineet.Gupta1@synopsys.com, ralf@linux-mips.org, ddaney@caviumnetworks.com, Will Deacon , linux-kernel@vger.kernel.org Subject: futex atomic vs ordering constraints Message-ID: <20150826181659.GW16853@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4024 Lines: 136 Hi all, I tried to keep this email short, but failed miserably at this. For the TL;DR skip to the tail. So the question of ordering constraints of futex atomic operations has come up recently: http://marc.info/?l=linux-kernel&m=143894765931868 This email will attempt to describe the two primitives and start a discussion on the constraints. * futex_atomic_op_inuser() There is but a single callsite of this function: futex_wake_op(). It being part of a wake primitive seems to suggest a (RCsc) RELEASE is the strongest required (the RCsc part because I don't think we want to expose RCpc to userspace if we don't have to). The immediate scenario where this is important is: CPU0 CPU1 CPU2 futex_lock(); -> uncontended user acquire A = 1; futex_lock(); -> kernel, set pending, sleep B = 1; futex_unlock(); if pending futex_wake_op spin_lock(bh->lock) RELEASE futex_atomic_op_inuser(); -> futex unlocked futex_lock() -> uncontended user steal load A; In other words, the moment we perform the WAKE_OP userspace can observe the 'lock' as unlocked and do a lock (steal) acquire of the 'lock'. If userspace succeeds with this acquire, we need full serialization of the locked (RCsc) variables (eg A and B in the above). Of course, if anything else prior to futex_atomic_op_inuser() implies an (RCsc) RELEASE or stronger the primitive can do without providing anything itself. This turns out to be the case, a successful get_futex_key() implies a full memory barrier; recent: 1d0dcb3ad9d3 ("futex: Implement lockless wakeups"). And since get_futex_key() is fundamental to doing _anything_ with a futex, I think its semi-sane to rely on this. So we have two valid options: - RCsc RELEASE - no ordering at all Current implementation: alpha: MB ll/sc RELEASE arm64: ll/sc-release MB FULL arm: MB ll/sc RELEASE mips: ll/sc MB ACQUIRE powerpc: lwsync ll/sc sync FULL * futex_atomic_cmpxchg_inatomic() This is called from: lock_pi_update_atomic wake_futex_pi fixup_pi_state_owner futex_unlock_pi handle_futex_death But I think we can form a position from just two of them: futex_unlock_pi() and lock_pi_update_atomic() these end up being ACQUIRE and RELEASE, and a combination of these two would give us a requirement for full serialization. And unlike the previous we cannot talk this one away. Even though every futex op needs a get_futex_key() which implies a full memory barrier, and every get_futex_key() needs a put_futex_key(), the latter does _NOT_ imply a full barrier. So while we could relax the RELEASE semantics we cannot relax the ACQUIRE semantics. Then there is handle_futex_death(), which is difficult, I _think_ it wants to be a RELEASE, but state is corrupted anyhow and I can well imagine not wanting to play any games here and go fully serialized like we're used to with cmpxchg. Now the robust stuff doesn't use {get,put}_futex_key() stuff, so no implied barriers here. Which leaves us all with a great big mess. Current implementation: alpha: MB ll/sc RELEASE arm64: ll/sc-release MB FULL arm: MB ll/sc MB FULL mips: ll/sc MB ACQUIRE powerpc: lwsync ll/sc sync FULL There are a few options: 1) punt, mandate they're both fully ordered and stop thinking about it 2) make them both fully relaxed, rely on implied barriers and employ smp_mb__{before,after}_atomic in key places Given the current state of things and that I don't really think there is a compelling performance argument to be made for 2, I would suggest we go with 1. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/