Received: by 2002:a05:7412:2a8c:b0:e2:908c:2ebd with SMTP id u12csp3273251rdh; Thu, 28 Sep 2023 07:24:44 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHPHSTvyGcwHJzzjdFS4JVae6PDTw8TlA2H4DxX5CyY63+Ui7v5NZbrELCG9lJlXu96zvM/ X-Received: by 2002:aca:f0e:0:b0:3ad:fdfb:d384 with SMTP id 14-20020aca0f0e000000b003adfdfbd384mr1309956oip.53.1695911084331; Thu, 28 Sep 2023 07:24:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695911084; cv=none; d=google.com; s=arc-20160816; b=cgMYYGMHRbqYnyVbRcQk3OJZDoX/LGb/N7M3zehJY7VOxGXVg9k+WsGA8TVBfDLnNM sw36fDAR+uGg8Bc7pXgwXkMvN05BFpOy7Vghatu39O5u677X+fD9AA7MDzmwU2hljEIe tH0FvmImYTbJTaKtqJPBlXZ0Qco4Xd2pURvmUiYOXtkua3FBbNrHcAo6yrXDqlRpHHTX wJg84WcPaK14Vy/8hzAllh2usCBvXv03BYBDMdtAqRUqGZ/W9+S4qYC8XArHKbmmPby5 L1kFLoouaWC4Lluq8bnh/dbp8uRA9FLMpxxMepkM9Ios7D5HtTsqWw9Hdmp8snrqTl/c kWjQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :content-language:references:cc:to:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=dGbEqiTSrbldTxmyFh5imgvO+Fb78lEaX75G2PxYgn8=; fh=6YOBt/ELE7yuaAOVQofBvjOH5i45yTYJ7uVIfX8T8Gs=; b=e76oS8lC4vKGJ7vnRvD/czdT4lQuo2iC7hvc3X5hF9C/ida6I5kynEcSWrphgnZxOF N1IHOmE6DzfCLhdVtSaKQKJ0haiDXtY6FByYyBygDpzZikqKrUeQQ7dQWZQdiYYm6xNN p/FQovyIils5HAtIxGXL5nYdRbxaKddON3C8LZPqf00t0FqkBrOVR3gMyScl16gHNTCT pVIX3Xlq1qGjm7qBEvsOvtHHDgui/Tj7qThYbtq2wYIrlDGssGwox/8HdZLhFh898qm/ zb4nQhMErFYQVuUNOHK/omk8j+6xnTXNfA/H1fMP0yekhr6wOgdJeCbsVjWCI0QdBkff TRgg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=rCj2JJBp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Return-Path: Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id g2-20020a636b02000000b0057e0c5a34f1si15169231pgc.239.2023.09.28.07.24.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 28 Sep 2023 07:24:44 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=rCj2JJBp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id F3C94829AA3D; Thu, 28 Sep 2023 06:24:28 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231822AbjI1NYO (ORCPT + 99 others); Thu, 28 Sep 2023 09:24:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59480 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231910AbjI1NYD (ORCPT ); Thu, 28 Sep 2023 09:24:03 -0400 Received: from smtpout.efficios.com (unknown [IPv6:2607:5300:203:b2ee::31e5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CE4711BD7; Thu, 28 Sep 2023 06:21:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1695907278; bh=aQgmzs+zgElkhFAUe9cVIYtiek4AxaBx9SIvc7Taa3k=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=rCj2JJBpStddJ8UFVBhHCvO1i34JJCIg1dSVgElFZ/VZjjxBXsjSlHX4/bamQ1b4Y AkR+zmAOfBXIEZ0Mz+LDPtqf2F7jU/PFBRL6MnDua7hYl19lqrLaWzgOmJtr7jmLNr yt/nlHYMyZzKWa/avqt94HHZ8+H69NMkaF1Htv/aP6r5k1ON5Cvzrlk7d1pBw7tz9s Rny/QrzK6tbILSwEWSA7UIdFIKsh49DB0rHafJqvorC4XHs26w4p7BInwW1geC8Lqf CURbioBUdnv21M5Jr6sux2TVGSUbYb4ObBFWXuE9tUbbfPto0xykOu0XNyaEWWMmWg dLGpOhfyzpdwA== Received: from [IPV6:2605:8d80:5a1:95e5:4101:ac48:ed0d:d728] (unknown [IPv6:2605:8d80:5a1:95e5:4101:ac48:ed0d:d728]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4RxDdS23fZz1RDt; Thu, 28 Sep 2023 09:21:16 -0400 (EDT) Message-ID: <34ddb730-8893-19a8-00fe-84c4e281eef1@efficios.com> Date: Thu, 28 Sep 2023 09:20:36 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1 Subject: Re: [RFC PATCH v2 1/4] rseq: Add sched_state field to struct rseq To: David Laight , 'Peter Zijlstra' Cc: "linux-kernel@vger.kernel.org" , Thomas Gleixner , "Paul E . McKenney" , Boqun Feng , "H . Peter Anvin" , Paul Turner , "linux-api@vger.kernel.org" , Christian Brauner , Florian Weimer , "carlos@redhat.com" , Peter Oskolkov , Alexander Mikhalitsyn , Chris Kennelly , Ingo Molnar , Darren Hart , Davidlohr Bueso , =?UTF-8?Q?Andr=c3=a9_Almeida?= , "libc-alpha@sourceware.org" , Steven Rostedt , Jonathan Corbet , Noah Goldstein , Daniel Colascione , "longman@redhat.com" , Florian Weimer References: <20230529191416.53955-1-mathieu.desnoyers@efficios.com> <20230529191416.53955-2-mathieu.desnoyers@efficios.com> <20230928103926.GI9829@noisy.programming.kicks-ass.net> Content-Language: en-US From: Mathieu Desnoyers In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.3 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Thu, 28 Sep 2023 06:24:29 -0700 (PDT) On 9/28/23 07:22, David Laight wrote: > From: Peter Zijlstra >> Sent: 28 September 2023 11:39 >> >> On Mon, May 29, 2023 at 03:14:13PM -0400, Mathieu Desnoyers wrote: >>> Expose the "on-cpu" state for each thread through struct rseq to allow >>> adaptative mutexes to decide more accurately between busy-waiting and >>> calling sys_futex() to release the CPU, based on the on-cpu state of the >>> mutex owner. > > Are you trying to avoid spinning when the owning process is sleeping? Yes, this is my main intent. > Or trying to avoid the system call when it will find that the futex > is no longer held? > > The latter is really horribly detremental. That's a good questions. What should we do in those three situations when trying to grab the lock: 1) Lock has no owner We probably want to simply grab the lock with an atomic instruction. But then if other threads are queued on sys_futex and did not manage to grab the lock yet, this would be detrimental to fairness. 2) Lock owner is running: The lock owner is certainly running on another cpu (I'm using the term "cpu" here as logical cpu). I guess we could either decide to bypass sys_futex entirely and try to grab the lock with an atomic, or we go through sys_futex nevertheless to allow futex to guarantee some fairness across threads. 3) Lock owner is sleeping: The lock owner may be either tied to the same cpu as the requester, or a different cpu. Here calling FUTEX_WAIT and friends is pretty much required. Can you elaborate on why skipping sys_futex in scenario (2) would be so bad ? I wonder if we could get away with skipping futex entirely in this scenario and still guarantee fairness by implementing MCS locking or ticket locks in userspace. Basically, if userspace queues itself on the lock through either MCS locking or ticket locks, it could guarantee fairness on its own. Of course things are more complicated with PI-futex, is that what you have in mind ? > >>> >>> It is only provided as an optimization hint, because there is no >>> guarantee that the page containing this field is in the page cache, and >>> therefore the scheduler may very well fail to clear the on-cpu state on >>> preemption. This is expected to be rare though, and is resolved as soon >>> as the task returns to user-space. >>> >>> The goal is to improve use-cases where the duration of the critical >>> sections for a given lock follows a multi-modal distribution, preventing >>> statistical guesses from doing a good job at choosing between busy-wait >>> and futex wait behavior. >> >> As always, are syscalls really *that* expensive? Why can't we busy wait >> in the kernel instead? >> >> I mean, sure, meltdown sucked, but most people should now be running >> chips that are not affected by that particular horror show, no? > > IIRC 'page table separation' which is what makes system calls expensive > is only a compile-time option. So is likely to be enabled on any 'distro' > kernel. > But a lot of other mitigations (eg RSB stuffing) are also pretty detrimental. > > OTOH if you have a 'hot' userspace mutex you are going to lose whatever. > All that needs to happen is for a ethernet interrupt to decide to discard > completed transmits and refill the rx ring, and then for the softint code > to free a load of stuff deferred by rcu while you've grabbed the mutex > and no matter how short the user-space code path the mutex won't be > released for absolutely ages. > > I had to change a load of code to use arrays and atomic increments > to avoid delays acquiring mutex. That's good input, thanks! I mostly defer to André Almeida on the use-case motivation. I mostly provided this POC patch to show that it _can_ be done with sys_rseq(2). Thanks! Mathieu > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) > -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com