Received: by 2002:a05:7412:2a8c:b0:e2:908c:2ebd with SMTP id u12csp3315872rdh; Thu, 28 Sep 2023 08:20:56 -0700 (PDT) X-Google-Smtp-Source: AGHT+IE6Q4fiqqYaJcC20Or/Vh6IIhMFj3JFA8Dv8ocu5M4kyyS8sMnP192scBI6Sk8ujsHnzFCs X-Received: by 2002:a17:902:ea01:b0:1bc:4f04:17f4 with SMTP id s1-20020a170902ea0100b001bc4f0417f4mr2309371plg.30.1695914456304; Thu, 28 Sep 2023 08:20:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695914456; cv=none; d=google.com; s=arc-20160816; b=U+i4wQKo31zK1Ju6p0lLQltUIT5HNyuD1PyC6tkivosOn33JsHxS329DjNw5Awt+3Y k4oCyYviaiGH3JrecsaIH1jlmST2zqI/VTKmNSRUymhtZBV3CFIQX4qFirC3r9Oga6UJ 4t5/5rQ1ALM/IWZNmj0PuwoceP1jk0FbmJSKxAYHmfqPk9tj69p4RNn7CGPb9FxXY+lV 4DqCLs0Z8wHFrUMfjv+2wp6mt+uYnpkK5DbDngC8xgbI2WWGRP7B2kobhnu/mSZgFNYD c6ItaQhsCAZYhVH7bB+vmfOqSXPyC3D/gYkWBVdSYftKtEeKutQpVg7r7YGyKJEgzRyP Zrag== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=VEh/P4CZIN7dtMmrMKJYQdsmVpbhrLGW/qbCtoGRHn4=; fh=bCDgAzBVBQ61kwebW68P+cbSvfbWopmQI+m7kTQGNnU=; b=r4WpKOrZoccqw8LmxKep3+WCBUmscmEbcdsvTF2RHVLLPlBXhxF1/eo4uSlXH37G7r BtLr5aiW9EcakIz1qBrMTvOlG+M8hfog35T1g3RMNFa8cvRtUCp0i0MwtTnAmDarL8yM 0yoz+/IH3q6PDfoG98N5wg8BS5N9OBdTo46xPCwzVRnYwmukjBQFP/raydASF9y9GPrW py8dRPxKurJEhN4VynSMGfqiebBrji+Jag4iiPDEIeau3FAc09UdyU50ybFepofvu+0z +w4ZGYiksE5Uf+G0oS2hQSJKFH5Df+2QNCD0O3jFVvvd7f895ys/m3PftE+yHgRc2Y6h rwOQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@igalia.com header.s=20170329 header.b=YXiSVfYw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from howler.vger.email (howler.vger.email. [2620:137:e000::3:4]) by mx.google.com with ESMTPS id ik22-20020a170902ab1600b001b7d2b55d8asi11908207plb.626.2023.09.28.08.20.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 28 Sep 2023 08:20:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) client-ip=2620:137:e000::3:4; Authentication-Results: mx.google.com; dkim=fail header.i=@igalia.com header.s=20170329 header.b=YXiSVfYw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id C14F5809AF97; Thu, 28 Sep 2023 08:06:58 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231550AbjI1PGf (ORCPT + 99 others); Thu, 28 Sep 2023 11:06:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37998 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231958AbjI1PG1 (ORCPT ); Thu, 28 Sep 2023 11:06:27 -0400 Received: from fanzine2.igalia.com (fanzine.igalia.com [178.60.130.6]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 07C19F9; Thu, 28 Sep 2023 08:06:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:From: References:Cc:To:Subject:MIME-Version:Date:Message-ID:Sender:Reply-To: Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=VEh/P4CZIN7dtMmrMKJYQdsmVpbhrLGW/qbCtoGRHn4=; b=YXiSVfYwN2g/6rkn5PnqoRAI4+ p5KiqJ15ZG5Z2dkuvs+oJ/jLWj6IuN1B4FppVeXB7ejs7E1aEZbIO8zT4fFEQRkx54RvWwU80wVZZ UXOJVl/OoJMldN81hAL4mt6LlaG8VRny/uzvmbzpNxSVWyzd8Gxo7eFKQi7JbuhG2a89+BGMnpwoZ Z/n8LtiQZeSeBCnOKBw8DrvXupCQ6SSDXmc5iXhTkmvWBy+w8Q/fZpDDgsYMD8RulKlO/Srpk5ydB 5S5Qyg4xlNMdcNtCj2F5YH0bEVUSU+dLyQ9txViKcSpBs1YyIeAvgwhz9hpZDZXJ7o6836a13QxUL +81Evc5Q==; Received: from [176.177.19.228] (helo=[192.168.1.19]) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_128_GCM:128) (Exim) id 1qlsaZ-009JpF-0b; Thu, 28 Sep 2023 17:06:03 +0200 Message-ID: Date: Thu, 28 Sep 2023 17:05:59 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq Content-Language: en-US To: Mathieu Desnoyers Cc: "linux-kernel@vger.kernel.org" , Thomas Gleixner , "Paul E . McKenney" , Boqun Feng , "H . Peter Anvin" , Paul Turner , "linux-api@vger.kernel.org" , David Laight , Christian Brauner , Florian Weimer , "carlos@redhat.com" , Peter Oskolkov , Alexander Mikhalitsyn , 'Peter Zijlstra' , Chris Kennelly , Ingo Molnar , Darren Hart , Davidlohr Bueso , "libc-alpha@sourceware.org" , Steven Rostedt , Jonathan Corbet , Noah Goldstein , Daniel Colascione , "longman@redhat.com" , Florian Weimer References: <20230529191416.53955-1-mathieu.desnoyers@efficios.com> <20230529191416.53955-2-mathieu.desnoyers@efficios.com> <20230928103926.GI9829@noisy.programming.kicks-ass.net> <34ddb730-8893-19a8-00fe-84c4e281eef1@efficios.com> From: =?UTF-8?Q?Andr=C3=A9_Almeida?= In-Reply-To: <34ddb730-8893-19a8-00fe-84c4e281eef1@efficios.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Thu, 28 Sep 2023 08:06:59 -0700 (PDT) On 9/28/23 15:20, Mathieu Desnoyers wrote: > On 9/28/23 07:22, David Laight wrote: >> From: Peter Zijlstra >>> Sent: 28 September 2023 11:39 >>> >>> On Mon, May 29, 2023 at 03:14:13PM -0400, Mathieu Desnoyers wrote: >>>> Expose the "on-cpu" state for each thread through struct rseq to allow >>>> adaptative mutexes to decide more accurately between busy-waiting and >>>> calling sys_futex() to release the CPU, based on the on-cpu state >>>> of the >>>> mutex owner. >> >> Are you trying to avoid spinning when the owning process is sleeping? > > Yes, this is my main intent. > >> Or trying to avoid the system call when it will find that the futex >> is no longer held? >> >> The latter is really horribly detremental. > > That's a good questions. What should we do in those three situations > when trying to grab the lock: > > 1) Lock has no owner > > We probably want to simply grab the lock with an atomic instruction. > But then if other threads are queued on sys_futex and did not manage > to grab the lock yet, this would be detrimental to fairness. > > 2) Lock owner is running: > > The lock owner is certainly running on another cpu (I'm using the term > "cpu" here as logical cpu). > > I guess we could either decide to bypass sys_futex entirely and try to > grab the lock with an atomic, or we go through sys_futex nevertheless > to allow futex to guarantee some fairness across threads. About the fairness part: Even if you enqueue everyone, the futex syscall doesn't provide any guarantee about the order of the wake. The current implementation tries to be fair, but I don't think it works for every case. I wouldn't be much concern about being fair here, given that it's an inherent problem of the futex anyway. From the man pages: "No guarantee is provided about which waiters are awoken" > > 3) Lock owner is sleeping: > > The lock owner may be either tied to the same cpu as the requester, or > a different cpu. Here calling FUTEX_WAIT and friends is pretty much > required. > > Can you elaborate on why skipping sys_futex in scenario (2) would be > so bad ? I wonder if we could get away with skipping futex entirely in > this scenario and still guarantee fairness by implementing MCS locking > or ticket locks in userspace. Basically, if userspace queues itself on > the lock through either MCS locking or ticket locks, it could > guarantee fairness on its own. > > Of course things are more complicated with PI-futex, is that what you > have in mind ? > >> >>>> >>>> It is only provided as an optimization hint, because there is no >>>> guarantee that the page containing this field is in the page cache, >>>> and >>>> therefore the scheduler may very well fail to clear the on-cpu >>>> state on >>>> preemption. This is expected to be rare though, and is resolved as >>>> soon >>>> as the task returns to user-space. >>>> >>>> The goal is to improve use-cases where the duration of the critical >>>> sections for a given lock follows a multi-modal distribution, >>>> preventing >>>> statistical guesses from doing a good job at choosing between >>>> busy-wait >>>> and futex wait behavior. >>> >>> As always, are syscalls really *that* expensive? Why can't we busy wait >>> in the kernel instead? >>> >>> I mean, sure, meltdown sucked, but most people should now be running >>> chips that are not affected by that particular horror show, no? >> >> IIRC 'page table separation' which is what makes system calls expensive >> is only a compile-time option. So is likely to be enabled on any >> 'distro' >> kernel. >> But a lot of other mitigations (eg RSB stuffing) are also pretty >> detrimental. >> >> OTOH if you have a 'hot' userspace mutex you are going to lose whatever. >> All that needs to happen is for a ethernet interrupt to decide to >> discard >> completed transmits and refill the rx ring, and then for the softint >> code >> to free a load of stuff deferred by rcu while you've grabbed the mutex >> and no matter how short the user-space code path the mutex won't be >> released for absolutely ages. >> >> I had to change a load of code to use arrays and atomic increments >> to avoid delays acquiring mutex. > > That's good input, thanks! I mostly defer to André Almeida on the > use-case motivation. I mostly provided this POC patch to show that it > _can_ be done with sys_rseq(2). > > Thanks! > > Mathieu > >> >>     David >> >> - >> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, >> MK1 1PT, UK >> Registration No: 1397386 (Wales) >> >