Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp32106861rwd; Fri, 7 Jul 2023 08:47:20 -0700 (PDT) X-Google-Smtp-Source: APBJJlF0agV+G8/0/7XRZKBpEISjlcIM5O7EilKkCr9IqAw3NoFbnYg/xpolWWflzulyCsC4YNV0 X-Received: by 2002:a17:906:21a:b0:988:b61e:4219 with SMTP id 26-20020a170906021a00b00988b61e4219mr5295636ejd.29.1688744839983; Fri, 07 Jul 2023 08:47:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1688744839; cv=none; d=google.com; s=arc-20160816; b=hWHXlfDx7gmesDmBaC/5pEqnAaKjomyiXSx47bocnAxNXmbgYvvRozY0chidLKofI6 XO3jbD3bUzA4nUoh5ZB1p6R1TuLrWrRi4rbo+cBX/O3P8aJRQHsFQ0KElkXgJZmQBLQd G4lpJN17MhwOYRtNZ1lU5Y87BrrG1GpdMyc5DZcjE1yUNlhoo5qUASJnujLxhx7mkMhB eEvyc03KX4Pv9Jtfd2sPd82rEE++KQzrZOCNULe4RwNF2lYkxdiKdlMWOtt3kzmsy4Db wVC7uUYaiAKYHZ9O5CibHz3gWeX4gD63h2Hkts73AM4ZO0D74eWofvcrAENoXZjLP2qA RukA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=vsz4mx0NidhFixROaKDUIc/U5gtD+P2QOnIuicGCEEw=; fh=EXpxbEe2RUAXanxiWeAjoa7+LtdcCQO6HpxNj5NsXaY=; b=ZfXEDDFiB+r7U/4KMWfcM1XbuiWhHkfwKo91kqqbTbDxHH8NB/0hISh/FIy+uSJm2c iE7Mfx2AwebDMWTlI3EbhwGSRAx9nSRH3IpFvV7ZA98KWq7R+1fkhlCVxnP3DO3gQxTD 5cFk7vDk6u8pCh3oDhTBlqkmrPejonCuXpzmU/2itOVoozIwCdYio00jQ2DqsAQhQaDb Jg15CFR+PC9G2pTTMaUh518ttcu6/beQVWh0R99ZYgzOFF/q7dYeeVnKt7xeqhYRGaQ7 Xfz6RD74I4ra/9Sidio6BhnPkcjLAHmjTXARI9ExFMy/YioKRtAUKSfX2vp2WBVx2K86 Op4w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=FhFFgHuT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id pk4-20020a170906d7a400b009886b606a71si907508ejb.696.2023.07.07.08.46.55; Fri, 07 Jul 2023 08:47:19 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=FhFFgHuT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231565AbjGGPbU (ORCPT + 99 others); Fri, 7 Jul 2023 11:31:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33780 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229681AbjGGPbS (ORCPT ); Fri, 7 Jul 2023 11:31:18 -0400 Received: from smtpout.efficios.com (unknown [IPv6:2607:5300:203:b2ee::31e5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CB8F8AF; Fri, 7 Jul 2023 08:31:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1688743875; bh=6r0pVTMiXPIZxuj/C4xxxu1nMFpZhaCFC1W3fq20TXI=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=FhFFgHuTdpwQ36mN93527p/AICqsZ5lB/YjyN69PD4017Iki3cN4uXGK0oAyzRNw0 6mvCFv026mjMRE4lEzF2LcIC82byjyYDs99uXizssVVk2UdekAMg7ZTrxApZUQLodC KGvmhoTdjtccOt83XDbM547OTeQrcq42/8kBB09xGzCXO7ur1XzEcocHOg7g3BagIc xMm+XTDpJhqDd0FB51ZNRS/tYZWObEqD9OoQt4jRGUgkvGCmLALh2prjw9HCG4E1NH YHqzx2EHUg8UZa6brQuJzJm9d6Q9+N0mpynIUlR/zTdRlN+oTyPoFykhkaxl6sbqFz p9LcYkmT47vRA== Received: from [172.16.0.85] (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4QyHRj4PP0z1G9d; Fri, 7 Jul 2023 11:31:13 -0400 (EDT) Message-ID: Date: Fri, 7 Jul 2023 11:31:46 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 Subject: Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics Content-Language: en-US To: Jonathan Wakely , Peter Zijlstra Cc: Olivier Dion , rnk@google.com, Alan Stern , Andrea Parri , Will Deacon , Boqun Feng , Nicholas Piggin , David Howells , Jade Alglave , Luc Maranget , "Paul E. McKenney" , Nathan Chancellor , Nick Desaulniers , Tom Rix , linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, gcc@gcc.gnu.org, llvm@lists.linux.dev References: <87ttukdcow.fsf@laura> <20230704094627.GS4253@hirez.programming.kicks-ass.net> From: Mathieu Desnoyers In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RDNS_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 7/4/23 06:23, Jonathan Wakely wrote: > On Tue, 4 Jul 2023 at 10:47, Peter Zijlstra wrote: >> >> On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote: >> >>> int x = 0; >>> int y = 0; >>> int r0, r1; >>> >>> int dummy; >>> >>> void t0(void) >>> { >>> __atomic_store_n(&x, 1, __ATOMIC_RELAXED); >>> >>> __atomic_exchange_n(&dummy, 1, __ATOMIC_SEQ_CST); >>> __atomic_thread_fence(__ATOMIC_SEQ_CST); >>> >>> r0 = __atomic_load_n(&y, __ATOMIC_RELAXED); >>> } >>> >>> void t1(void) >>> { >>> __atomic_store_n(&y, 1, __ATOMIC_RELAXED); >>> __atomic_thread_fence(__ATOMIC_SEQ_CST); >>> r1 = __atomic_load_n(&x, __ATOMIC_RELAXED); >>> } >>> >>> // BUG_ON(r0 == 0 && r1 == 0) >>> >>> On x86-64 (gcc 13.1 -O2) we get: >>> >>> t0(): >>> movl $1, x(%rip) >>> movl $1, %eax >>> xchgl dummy(%rip), %eax >>> lock orq $0, (%rsp) ;; Redundant with previous exchange. >>> movl y(%rip), %eax >>> movl %eax, r0(%rip) >>> ret >>> t1(): >>> movl $1, y(%rip) >>> lock orq $0, (%rsp) >>> movl x(%rip), %eax >>> movl %eax, r1(%rip) >>> ret >> >> So I would expect the compilers to do better here. It should know those >> __atomic_thread_fence() thingies are superfluous and simply not emit >> them. This could even be done as a peephole pass later, where it sees >> consecutive atomic ops and the second being a no-op. > > Right, I don't see why we need a whole set of new built-ins that say > "this fence isn't needed if the adjacent atomic op already implies a > fence". If the adjacent atomic op already implies a fence for a given > ISA, then the compiler should already be able to elide the explicit > fence. > > So just write your code with the explicit fence, and rely on the > compiler to optimize it properly. Admittedly, today's compilers don't > do that optimization well, but they also don't support your proposed > built-ins, so you're going to have to wait for compilers to make > improvements either way. Emitting the redundant fences is the plan we have for liburcu. The current situation unfortunately requires users to choose between generation of inefficient code with C11 or implement their own inline assembler until the compilers catch up. > > https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html > discusses that compilers could (and should) optimize around atomics > better. Our understanding of the C11/C++11 memory model is that it aims at defining the weakest possible guarantees for each ordering to be as efficient as possible on weakly ordered architectures. However, when writing portable code in practice, the C11/C++11 memory model force the programmer to insert memory fences which are redundant on strongly ordered architectures. We want something that can apply across procedures from different modules: e.g. a mutex lock operation (glibc) has an acquire semantic using a RMW operation that the caller could promote to a full fence. The peephole optimizations cannot do this because they focus on a single basic block. PRE can apply across procedures, but would rely on LTO and possibly function annotation across modules. I am not aware of any progress in that research field in the past 6 years. [1-2] The new atomic builtins we propose allow the user to better express its intent to the compiler, allowing for better code generation. Therefore, reducing the number of emitted redundant fences, without having to rely on optimizations. It should be noted that the builtins extensions we propose are not entirely free. Here are our perceived downsides of introducing those APIs: - They add complexity to the atomic builtins API. - They add constraints which need to be taken into account for future architecture-specific backend optimizations, as an example the (broken) xchg RELEASE | RELAXED -> store on x86 (Clang) [3]. If an atomic op class (e.g. rmw) can be optimized to a weaker instruction by the architecture backend, then the emission of a before/after-fence associated with this class of atomic op, must be pessimistic and assume the weakest instruction pattern which can be generated. There are optimizations of atomics and redundant fences in Clang. The redundant fences optimizations appear to be limited to a peephole, which does not appear to leverage the fact that lock-prefixed atomic operations act as implicit fences on x86. Perhaps this could be a low-hanging fruit for optimization. We have not observed any similar optimizations in gcc as of today, which appears to be a concern for many users. [4-7] Thanks, Mathieu [1] https://dl.acm.org/doi/10.1145/3033019.3033021 [2] https://reviews.llvm.org/D5758 [3] https://github.com/llvm/llvm-project/issues/60418 [4] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86056 [5] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68622 [6] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86072 [7] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63273 -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com