Received: by 10.223.185.111 with SMTP id b44csp1541346wrg; Sat, 10 Mar 2018 08:07:06 -0800 (PST) X-Google-Smtp-Source: AG47ELuPAUNGrzW2Jc5Lk4i07LhxGm9rCdEYbozBqIUQCS26FHNE7IwfDn+zw4Yu0TffA4Pr6nks X-Received: by 2002:a17:902:ab84:: with SMTP id f4-v6mr2501803plr.239.1520698026261; Sat, 10 Mar 2018 08:07:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1520698026; cv=none; d=google.com; s=arc-20160816; b=eCcWSFcTjbXCwKHbvdR4F16I+w9sZCRIwk6Ejy2VgGoEEyeY3+Pz/NyjZ8lWr0ajME UVeues/TgPlFWkDoFYUiZfOfHT3d3dc562JgQlNKTDe8rwAN/N91sw456A23CunOfucD 1LONivGVj6Lw9wVp7sBYu5CfeDa2heq3+BQFuQahtcsAfBGQptD1VFVoHJ4Ta15ijTFu 80Qws9NUmjNkQQHN/b+sWKZEzTM61nLrqWUdXe6pd9cAqpznHAYOuaaQqsN3G6dtRekw 6mzj7EsLjyTJz7Q/4ImNKl20gumXuPtsreKVTRQ+bp6n6ip9Yn7CkQ85uq0ovq0YfC7h WCoQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:user-agent:in-reply-to :content-disposition:mime-version:references:reply-to:subject:cc:to :from:date:arc-authentication-results; bh=8DnFnCK0iU9Wv454I/TKqb+X2ivRQqPFOxiCiYKCNWI=; b=ef6BQS6LXt4vuaIIR7Z74OX4Z/zbDEWrZhMoVGYSWqewqd48OWuYRUAL6JOOJb0z4L KVFUiAJWSJDB1ceDCOY1ttUp/7O20ivzicYlpHiTSPpQY0pS+fZ7QcmcYGZ6cgXhYgJ3 tUERJToCw58zTrl9oqFaEYxLek0THUEYLTnMfdXuoR9sQ1MbpZ8LWGDybJnqZMn+1QAg tINz8dyyEHCD0JYe8avNVAkt6utDbUUkkd/635TrruSi58mjqX5Zu0vp8o3PCTo4dRFF fIUTxNZ4BBJgl4jPXb9u5PA6nuzm2AsLCy46sgvTLEIuwL3jWJN/OsBG7PFXjO+E1Egq XWSw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w7si2420628pgp.150.2018.03.10.08.06.51; Sat, 10 Mar 2018 08:07:06 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932474AbeCJQFo (ORCPT + 99 others); Sat, 10 Mar 2018 11:05:44 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:32814 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932368AbeCJQFn (ORCPT ); Sat, 10 Mar 2018 11:05:43 -0500 Received: from pps.filterd (m0098396.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w2AG3iCa015200 for ; Sat, 10 Mar 2018 11:05:42 -0500 Received: from e17.ny.us.ibm.com (e17.ny.us.ibm.com [129.33.205.207]) by mx0a-001b2d01.pphosted.com with ESMTP id 2gmdsnqhaf-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Sat, 10 Mar 2018 11:05:42 -0500 Received: from localhost by e17.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Sat, 10 Mar 2018 11:05:41 -0500 Received: from b01cxnp23034.gho.pok.ibm.com (9.57.198.29) by e17.ny.us.ibm.com (146.89.104.204) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Sat, 10 Mar 2018 11:05:37 -0500 Received: from b01ledav003.gho.pok.ibm.com (b01ledav003.gho.pok.ibm.com [9.57.199.108]) by b01cxnp23034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w2AG5aqG46268648; Sat, 10 Mar 2018 16:05:36 GMT Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id F09FCB2046; Sat, 10 Mar 2018 12:07:50 -0500 (EST) Received: from paulmck-ThinkPad-W541 (unknown [9.80.197.11]) by b01ledav003.gho.pok.ibm.com (Postfix) with ESMTP id 87010B205D; Sat, 10 Mar 2018 12:07:50 -0500 (EST) Received: by paulmck-ThinkPad-W541 (Postfix, from userid 1000) id 72B9216C2A6B; Sat, 10 Mar 2018 08:06:14 -0800 (PST) Date: Sat, 10 Mar 2018 08:06:14 -0800 From: "Paul E. McKenney" To: Lai Jiangshan Cc: "Eric W. Biederman" , Ingo Molnar , Linus Torvalds , Tejun Heo , Jann Horn , Benjamin LaHaise , Al Viro , Thomas Gleixner , Peter Zijlstra , LKML Subject: Re: Simplifying our RCU models Reply-To: paulmck@linux.vnet.ibm.com References: <20180305001600.GO3918@linux.vnet.ibm.com> <20180305030949.GP3918@linux.vnet.ibm.com> <20180305082441.4hao2z4dqn2n5on6@gmail.com> <87po4izj67.fsf_-_@xmission.com> <20180305161446.GQ3918@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 18031016-0040-0000-0000-0000040561A2 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00008647; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000254; SDB=6.01001063; UDB=6.00509249; IPR=6.00780393; MB=3.00019958; MTD=3.00000008; XFM=3.00000015; UTC=2018-03-10 16:05:40 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18031016-0041-0000-0000-000008066992 Message-Id: <20180310160614.GG3918@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2018-03-10_05:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1803100194 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 09, 2018 at 05:48:55PM +0800, Lai Jiangshan wrote: > On Tue, Mar 6, 2018 at 12:14 AM, Paul E. McKenney > wrote: > > On Mon, Mar 05, 2018 at 08:33:20AM -0600, Eric W. Biederman wrote: > >> > >> Moving this discussion to a public list as discussing how to reduce the > >> number of rcu variants does not make sense in private. We should have > >> an archive of such discussions. > >> > >> Ingo Molnar writes: > >> > >> > * Paul E. McKenney wrote: > >> > > >> >> > So if people really want that low-cost RCU, and some people really > >> >> > need the sleepable version, the only one that can _possibly_ be dumped > >> >> > is the preempt one. > >> >> > > >> >> > But I may - again - be confused and/or missing something. > >> >> > >> >> I am going to do something very stupid and say that I was instead thinking in > >> >> terms of getting rid of RCU-bh, thus reminding you of its existence. ;-) > >> >> > >> >> The reason for believing that it is possible to get rid of RCU-bh is the work > >> >> that has gone into improving the forward progress of RCU grace periods under > >> >> heavy load and in corner-case workloads. > >> >> > >> > > >> > [...] > >> > > >> >> The other reason for RCU-sched is it has the side effect of waiting > >> >> for all in-flight hardware interrupt handlers, NMI handlers, and > >> >> preempt-disable regions of code to complete, and last I checked, this side > >> >> effect is relied on. In contrast, RCU-preeempt is only guaranteed to wait > >> >> on regions of code protected by rcu_read_lock() and rcu_read_unlock(). > >> > > >> > Instead of only trying to fix the documentation (which is never a bad idea but it > >> > is fighting the symptom in this case), I think the first step should be to > >> > simplify the RCU read side APIs of RCU from 4 APIs: > >> > > >> > rcu_read_lock() > >> > srcu_read_lock() > >> > rcu_read_lock_sched() > >> > rcu_read_lock_bh() > >> > > >> > ... which have ~8 further sub-model variations depending on CONFIG_PREEMPT, > >> > CONFIG_PREEMPT_RCU - which is really a crazy design! > > > > If it is possible to set CONFIG_PREEMPT_RCU differently than CONFIG_PREEMPT, > > then that is a bug that I need to fix. > > > >> > I think we could reduce this to just two APIs with no Kconfig dependencies: > >> > > >> > rcu_read_lock() > >> > rcu_read_lock_preempt_disable() > >> > > >> > Which would be much, much simpler. > > > > No argument on the simpler part, at least from an API perspective. > > > >> > This is how we could do it I think: > >> > > >> > 1) > >> > > >> > Getting rid of the _bh() variant should be reasonably simple and involve a > >> > treewide replacement of: > >> > > >> > rcu_read_lock_bh() -> local_bh_disable() > >> > rcu_read_unlock_bh() -> local_bh_enable() > >> > > >> > Correct? > > > > Assuming that I have done enough forward-progress work on grace periods, yes. > > > >> > 2) > >> > > >> > Further reducing the variants is harder, due to this main asymmetry: > >> > > >> > !PREEMPT_RCU PREEMPT_RCU=y > >> > rcu_read_lock_sched(): atomic atomic > >> > rcu_read_lock(): atomic preemptible > >> > > >> > ('atomic' here is meant in the scheduler, non-preemptible sense.) > >> > > >> > But if we look at the bigger API picture: > >> > > >> > !PREEMPT_RCU PREEMPT_RCU=y > >> > rcu_read_lock(): atomic preemptiblep > >> > rcu_read_lock_sched(): atomic atomic > >> > srcu_read_lock(): preemptible preemptible > >> > > >> > Then we could maintain full read side API flexibility by making PREEMPT_RCU=y the > >> > only model, merging it with SRCU and using these main read side APIs: > >> > > >> > rcu_read_lock_preempt_disable((): atomic > >> > rcu_read_lock() preemptible > > > > One issue with merging SRCU into rcu_read_lock() is the general blocking > > within SRCU readers. Once merged in, these guys block everyone. We should > > focus initially on the non-SRCU variants. > > > > On the other hand, Linus's suggestion of merging rcu_read_lock_sched() > > into rcu_read_lock() just might be feasible. If that really does pan > > out, we end up with the following: > > > > !PREEMPT PREEMPT=y > > rcu_read_lock(): atomic preemptible > > srcu_read_lock(): preemptible preemptible > > > > In this model, rcu_read_lock_sched() maps to preempt_disable() and (as > > you say above) rcu_read_lock_bh() maps to local_bh_disable(). The way > > this works is that in PREEMPT=y kernels, synchronize_rcu() waits not > > only for RCU read-side critical sections, but also for regions of code > > with preemption disabled. The main caveat seems to be that there be an > > assumed point of preemptibility between each interrupt and each softirq > > handler, which should be OK. > > > > There will be some adjustments required for lockdep-RCU, but that should > > be reasonably straightforward. > > > > Seem reasonable? > > It's good. I hope there is only one global(non-srcu) rcu variant. Well, there will still be both SRCU and RCU-tasks, but reducing by two should at least help. > It does have the trade-off, the grace period will be extended a little > in some cases, > so will the call_rcu()/synchronze_rcu(). But it simplifies the coding a lot. True, and the extended grace periods did bother me at first. But then I realized that it was no worse than RCU-sched, so it should be OK. And thank you for looking this over! Thanx, Paul > >> > It's a _really_ simple and straightforward RCU model, with very obvious semantics > >> > all around: > >> > > >> > - Note how the 'atomic' (non-preempt) variant uses the well-known > >> > preempt_disable() name as a postfix to signal its main property. (It's also a > >> > bit of a mouthful, which should discourage over-use.) > > > > My thought is to eliminate the atomic variant entirely. If you want > > to disable preemption, interrupts, or whatever, you simply do so. > > It might turn out that there are documentation benefits to having a > > separate rcu_read_lock_preempt_disable() that maps to preempt_disable() > > with lockdep semantics, and if so, that can be provided trivially. > > > >> > - The read side APIs are really as straightforward as possible: there's no SRCU > >> > distinction on the read side, no _bh() distinction and no _sched() distinction. > >> > (On -rt all of these would turn into preemptible sections, > >> > obviously.) > > > > Agreed, and both models accomplish that. > > > >> And it looses the one advantage of srcu_read_lock. That you don't have > >> to wait for the entire world. If you actually allow sleeping that is an > >> important distinction to have. Or are you proposing that we add the > >> equivalent of init_srcu_struct to all of the rcu users? > > > > I am instead proposing folding rcu_read_lock_bh() and rcu_read_lock_sched() > > into rcu_read_lock(), and leaving srcu_read_lock() separate. > > > >> That rcu_read_lock would need to take an argument about which rcu region > >> we are talking about. > > > > From what I can see, it would be far better to leave SRCU separate. As you > > say, it really does have very different semantics. > > > >> > rcu_read_lock_preempt_disable() would essentially be all the current > >> > rcu_read_lock_sched() users (where the _sched() postfix was a confusing misnomer > >> > anyway). > > > > I agree that rcu_read_lock_preempt_disable() is a better name. > > We might not need it at all, though. There are only about 20 uses of > > rcu_read_lock_sched() in v4.15. ;-) > > > >> > Wrt. merging SRCU and RCU: this can be done by making PREEMPT_RCU=y the one and > >> > only main RCU model and converting all SRCU users to main RCU. This is relatively > >> > straightforward to perform, as there are only ~170 SRCU critical sections, versus > >> > the 3000+ main RCU critical sections ... > >> > >> It really sounds like you are talking about adding a requirement that > >> everyone update their rcu_read_lock() calls with information about which > >> region you are talking about. That seems like quite a bit of work. > > > > Agreed, merging RCU, RCU-bh, and RCU-sched seems much more straightforward > > to me from the viewpoint of both usage and implementation. > > > >> Doing something implicit when PREEMPT_RCU=y and converting > >> "rcu_read_lock()" to "srcu_read_lock(&kernel_srcu_region)" only in that > >> case I can see. > >> > >> Except in very specific circustances I don't think I ever want to run a > >> kernel with PREEMPT_RCU the default. All of that real time stuff trades > >> off predictability with performance. Having lost enough performance to > >> spectre and meltdown I don't think it makes sense for us all to start > >> runing predictable^H^H^H^H^H^H^H^H^H^H^H time kernels now. > > > > Yes, in PREEMPT=n kernels RCU would act exactly as it does today. > > > >> > AFAICS this should be a possible read side design that keeps correctness, without > >> > considering grace period length patterns, i.e. without considering GC latency and > >> > scalability aspects. > >> > > >> > Before we get into ways to solve the latency and scalability aspects of such a > >> > simplified RCU model, do you agree with this analysis so far, or have I missed > >> > something important wrt. correctness? > >> > >> RCU region specification. If we routinely allow preemption of rcu > >> critical sections for any length of time I can't imagine we will want to > >> wait for every possible preempted rcu critical section. > >> > >> Of course I could see the merge working the other way. Adding the > >> debugging we need to find rcu critical secions that are held to long and > >> shrinking them so we don't need PREEMPT_RCU at all. > > > > Again, from what I can see, merging rcu_read_lock(), rcu_read_lock_sched(), > > and rcu_read_lock_bh() together should get us to a much better place. > > > > Make sense, or am I missing something? > > > > Thanx, Paul > > >