Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp1661634rdb; Thu, 7 Dec 2023 05:44:47 -0800 (PST) X-Google-Smtp-Source: AGHT+IHn9Ir2UrXgEIdS8SxNqCQiNQ/ylbzuRVc0rlX9jfWt0sXUSSWvlWo1APGEbXvffzpfayb9 X-Received: by 2002:a05:6a20:a11e:b0:18f:e2c9:9748 with SMTP id q30-20020a056a20a11e00b0018fe2c99748mr2621508pzk.65.1701956687213; Thu, 07 Dec 2023 05:44:47 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701956687; cv=none; d=google.com; s=arc-20160816; b=StvZm/it62mh/pVLowwlCucoJSgG+QOdKBGLZaQ/GrzwVJWLE8IG5HpGkxQXog1JPN kU9RIOyTeRPnN37oGW04IaP3iwBbPjF9Gpr/mgFB7im5KcdY4EIa9e1A6nWTpuTz0V29 CPmQ0e8BM6Y10pJ048y9PZxFQ/Cj21aZ4QQxIH+rM5H1mVZJDZdNhMqUNrr+rr1o/I48 OKZYxEpweKAXLsDYTKYiw0XwbfZN552/R//RZLOohrPpe2wgNBhIZnPP+FpomCDeXKbs 2pzCiZOUwdUpj/E8R5yQbuIAU+fNdQgT7R/boG99P9e43l9FAiaPEedsGJqWD9lI2IXK 0x9g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date; bh=aY7yy0ZxrddvvZomJT9+Un5x1D81QAizx1GzwSkWHb4=; fh=oRrJiS+Pr8Oh5iDCUYD93nZSLkYxnmUlmDYA3N7SaMw=; b=CT8nRfybRpUG4CjvTGPuIBZE71xWoSpRmLdB1wbbc1sZsPt+kbW4Mw+5+/PYTvR5KS wg7ZS5M44+VcbLRSb0B1XG9zM28JPrJ2fBFsS8YVyd4LB1mbcvzQrK2S9Mwbps89Ov17 dr+ovKXVbgg4LkUt96x5TeKMJ1cIAJZtpv99Ld8FaBa+UvQKzE2kjZYChs7/n8K48N+9 5cdxj768eUl8NuQAUdHvX5E5XXJIZp+pGqJKPkr9ZN+Km4I+fwxmbFCdkyiYySK8vxSj pakg7E6i2wMuazkRaLFviNSWittnnT/0ssCCqAr5pYdKMJFiDaXJgSepp1s+Ba1MofxS W5HA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3]) by mx.google.com with ESMTPS id w4-20020a631604000000b005bd2713d3a1si1196496pgl.650.2023.12.07.05.44.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 Dec 2023 05:44:47 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) client-ip=2620:137:e000::3:3; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 02ADB80545A2; Thu, 7 Dec 2023 05:44:44 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1442008AbjLGNo2 (ORCPT + 99 others); Thu, 7 Dec 2023 08:44:28 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42968 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232759AbjLGNo0 (ORCPT ); Thu, 7 Dec 2023 08:44:26 -0500 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 53D4FD5E for ; Thu, 7 Dec 2023 05:44:32 -0800 (PST) Received: by smtp.kernel.org (Postfix) with ESMTPSA id F2ECEC433C8; Thu, 7 Dec 2023 13:44:27 +0000 (UTC) Date: Thu, 7 Dec 2023 08:44:57 -0500 From: Steven Rostedt To: "Paul E. McKenney" Cc: Thomas Gleixner , Ankur Arora , linux-kernel@vger.kernel.org, peterz@infradead.org, torvalds@linux-foundation.org, linux-mm@kvack.org, x86@kernel.org, akpm@linux-foundation.org, luto@kernel.org, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, willy@infradead.org, mgorman@suse.de, jon.grimm@amd.com, bharata@amd.com, raghavendra.kt@amd.com, boris.ostrovsky@oracle.com, konrad.wilk@oracle.com, jgross@suse.com, andrew.cooper3@citrix.com, mingo@kernel.org, bristot@kernel.org, mathieu.desnoyers@efficios.com, geert@linux-m68k.org, glaubitz@physik.fu-berlin.de, anton.ivanov@cambridgegreys.com, mattst88@gmail.com, krypton@ulrich-teichert.org, David.Laight@aculab.com, richard@nod.at, mjguzik@gmail.com, Simon Horman , Julian Anastasov , Alexei Starovoitov , Daniel Borkmann Subject: Re: [RFC PATCH 47/86] rcu: select PREEMPT_RCU if PREEMPT Message-ID: <20231207084457.78ab7d31@gandalf.local.home> In-Reply-To: <842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop> References: <87wmu2ywrk.ffs@tglx> <20231205100114.0bd3c4a2@gandalf.local.home> <1375e409-2593-45e1-b27e-3699c17c47dd@paulmck-laptop> <20231205154518.70d042c3@gandalf.local.home> <842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop> X-Mailer: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Thu, 07 Dec 2023 05:44:44 -0800 (PST) On Wed, 6 Dec 2023 20:34:11 -0800 "Paul E. McKenney" wrote: > > > I like the concept, but those with mutex_lock() of rarely-held mutexes > > > in their fastpaths might have workloads that have a contrary opinion. > > > > I don't understand your above statement. Maybe I wasn't clear with my > > statement? The above is more about PREEMPT_FULL, as it currently will > > preempt immediately. My above comment is that we can have an option for > > PREEMPT_FULL where if the scheduler decided to preempt even in a fast path, > > it would at least hold off until there's no mutex held. Who cares if it's a > > fast path when a task needs to give up the CPU for another task? What I > > worry about is scheduling out while holding a mutex which increases the > > chance of that mutex being contended upon. Which does have drastic impact > > on performance. > > As I understand the current mutex_lock() code, the fastpaths leave no > scheduler-visible clue that a mutex is in fact held. If there is no > such clue, it is quite likely that those fastpaths will need to do some > additional clue-leaving work, increasing their overhead. And while it > is always possible that this overhead will be down in the noise, if it > was too far down in the noise there would be no need for those fastpaths. > > So it is possible (but by no means certain) that some workloads will end > up caring. OK, that makes more sense, and I do agree with that statement. It would need to do something like spin locks do with preempt disable, but I agree, this would need to be done in a way not to cause performance regressions. > > > > > > Another is the aforementioned situations where removing the cond_resched() > > > > > increases latency. Yes, capping the preemption latency is a wonderful > > > > > thing, and the people I chatted with are all for that, but it is only > > > > > natural that there would be a corresponding level of concern about the > > > > > cases where removing the cond_resched() calls increases latency. > > > > > > > > With the "capped preemption" I'm not sure that would still be the case. > > > > cond_resched() currently only preempts if NEED_RESCHED is set. That means > > > > the system had to already be in a situation that a schedule needs to > > > > happen. There's lots of places in the kernel that run for over a tick > > > > without any cond_resched(). The cond_resched() is usually added for > > > > locations that show tremendous latency (where either a watchdog triggered, > > > > or showed up in some analysis that had a latency that was much greater than > > > > a tick). > > > > > > For non-real-time workloads, the average case is important, not just the > > > worst case. In the new lazily preemptible mode of thought, a preemption > > > by a non-real-time task will wait a tick. Earlier, it would have waited > > > for the next cond_resched(). Which, in the average case, might have > > > arrived much sooner than one tick. > > > > Or much later. It's random. And what's nice about this model, we can add > > more models than just "NONE", "VOLUNTARY", "FULL". We could have a way to > > say "this task needs to preempt immediately" and not just for RT tasks. > > > > This allows the user to decide which task preempts more and which does not > > (defined by the scheduler), instead of some random cond_resched() that can > > also preempt a higher priority task that just finished its quota to run a > > low priority task causing latency for the higher priority task. > > > > This is what I mean by "think differently". > > I did understand your meaning, and it is a source of some concern. ;-) > > When things become sufficiently stable, larger-scale tests will of course > be needed, not just different thought.. Fair enough. > > > > > The point is, if/when we switch to the new preemption model, we would need > > > > to re-evaluate if any cond_resched() is needed. Yes, testing needs to be > > > > done to prevent regressions. But the reasons I see cond_resched() being > > > > added today, should no longer exist with this new model. > > > > > > This I agree with. Also, with the new paradigm and new mode of thought > > > in place, it should be safe to drop any cond_resched() that is in a loop > > > that consumes more than a tick of CPU time per iteration. > > > > Why does that matter? Is the loop not important? Why stop it from finishing > > for some random task that may not be important, and cond_resched() has no > > idea if it is or not. > > Because if it takes more than a tick to reach the next cond_resched(), > lazy preemption is likely to preempt before that cond_resched() is > reached. Which suggests that such a cond_resched() would not be all > that valuable in the new thought paradigm. Give or take potential issues > with exactly where the preemption happens. I'm just saying there's lots of places that the above happens, which is why we are still scattering cond_resched() all over the place. > > > > > > There might be others as well. These are the possibilities that have > > > > > come up thus far. > > > > > > > > > > > They all suck and keeping some of them is just counterproductive as > > > > > > again people will sprinkle them all over the place for the very wrong > > > > > > reasons. > > > > > > > > > > Yes, but do they suck enough and are they counterproductive enough to > > > > > be useful and necessary? ;-) > > > > > > > > They are only useful and necessary because of the way we handle preemption > > > > today. With the new preemption model, they are all likely to be useless and > > > > unnecessary ;-) > > > > > > The "all likely" needs some demonstration. I agree that a great many > > > of them would be useless and unnecessary. Maybe even the vast majority. > > > But that is different than "all". ;-) > > > > I'm betting it is "all" ;-) But I also agree that this "needs some > > demonstration". We are not there yet, and likely will not be until the > > second half of next year. So we have plenty of time to speak rhetorically > > to each other! > > You know, we usually find time to engage in rhetorical conversation. ;-) > > > > > The conflict is with the new paradigm (I love that word! It's so "buzzy"). > > > > As I mentioned above, cond_resched() is usually added when a problem was > > > > seen. I really believe that those problems would never had been seen if > > > > the new paradigm had already been in place. > > > > > > Indeed, that sort of wording does quite the opposite of raising my > > > confidence levels. ;-) > > > > Yes, I admit the "manager speak" isn't something to brag about here. But I > > really do like that word. It's just fun to say (and spell)! Paradigm, > > paradigm, paradigm! It's that silent 'g'. Although, I wonder if we should > > be like gnu, and pronounce it when speaking about free software? Although, > > that makes the word sound worse. :-p > > Pair a' dime, pair a' quarter, pair a' fifty-cent pieces, whatever it takes! Pair a' two-bits : that's all it's worth Or Pair a' two-cents : as it's my two cents that I'm giving. > > > > You know, the ancient Romans would have had no problem dealing with the > > > dot-com boom, cryptocurrency, some of the shadier areas of artificial > > > intelligence and machine learning, and who knows what all else. As the > > > Romans used to say, "Beware of geeks bearing grifts." > > > > > > > > > 3) Looking at the initial problem Ankur was trying to solve there is > > > > > > absolutely no acceptable solution to solve that unless you think > > > > > > that the semantically invers 'allow_preempt()/disallow_preempt()' > > > > > > is anywhere near acceptable. > > > > > > > > > > I am not arguing for allow_preempt()/disallow_preempt(), so for that > > > > > argument, you need to find someone else to argue with. ;-) > > > > > > > > Anyway, there's still a long path before cond_resched() can be removed. It > > > > was a mistake by Ankur to add those removals this early (and he has > > > > acknowledged that mistake). > > > > > > OK, that I can live with. But that seems to be a bit different of a > > > take than that of some earlier emails in this thread. ;-) > > > > Well, we are also stating the final goal as well. I think there's some > > confusion to what's going to happen immediately and what's going to happen > > in the long run. > > If I didn't know better, I might suspect that in addition to the > confusion, there are a few differences of opinion. ;-) Confusion enhances differences of opinion. > > > > > First we need to get the new preemption modeled implemented. When it is, it > > > > can be just a config option at first. Then when that config option is set, > > > > you can enable the NONE, VOLUNTARY or FULL preemption modes, even switch > > > > between them at run time as they are just a way to tell the scheduler when > > > > to set NEED_RESCHED_LAZY vs NEED_RSECHED. > > > > > > Assuming CONFIG_PREEMPT_RCU=y, agreed. With CONFIG_PREEMPT_RCU=n, > > > the runtime switching needs to be limited to NONE and VOLUNTARY. > > > Which is fine. > > > > But why? Because the run time switches of NONE and VOLUNTARY are no > > different than FULL. > > > > Why I say that? Because: > > > > For all modes, NEED_RESCHED_LAZY is set, the kernel has one tick to get out > > or NEED_RESCHED will be set (of course that one tick may be configurable). > > Once the NEED_RESCHED is set, then the kernel is converted to PREEMPT_FULL. > > > > Even if the user sets the mode to "NONE", after the above scenario (one tick > > after NEED_RESCHED_LAZY is set) the kernel will be behaving no differently > > than PREEMPT_FULL. > > > > So why make the difference between CONFIG_PREEMPT_RCU=n and limit to only > > NONE and VOLUNTARY. It must work with FULL or it will be broken for NONE > > and VOLUNTARY after one tick from NEED_RESCHED_LAZY being set. > > Because PREEMPT_FULL=y plus PREEMPT_RCU=n appears to be a useless > combination. All of the gains from PREEMPT_FULL=y are more than lost > due to PREEMPT_RCU=n, especially when the kernel decides to do something > like walk a long task list under RCU protection. We should not waste > people's time getting burned by this combination, nor should we waste > cycles testing it. The issue I see here is that PREEMPT_RCU is not something that we can convert at run time, where the NONE, VOLUNTARY, FULL (and more to come) can be. And you have stated that PREEMPT_RCU adds some more overhead that people may not care about. But even though you say PREEMPT_RCU=n makes no sense with PREEMPT_FULL, it doesn't mean we should not allow it. Especially if we have to make sure that it still works (even NONE and VOLUNTARY turn to FULL after that one-tick). Remember, what we are looking at is having: N : NEED_RESCHED - schedule at next possible location L : NEED_RESCHED_LAZY - schedule when going into user space. When to set what for a task needing to schedule? Model SCHED_OTHER RT/DL(or user specified) ----- ----------- ------------------------ NONE L L VOLUNTARY L N FULL N N By saying FULL, you are saying that you want the SCHED_OTHER as well as RT/DL tasks to schedule as soon as possible and not wait to going into user space. This is still applicable even with PREEMPT_RCU=n It may be that someone wants better latency for all tasks (like VOLUNTARY) but not the overhead that PREEMPT_RCU gives, and is OK with the added latency as a result. -- Steve