Received: by 10.223.185.111 with SMTP id b44csp106043wrg; Fri, 9 Mar 2018 01:51:20 -0800 (PST) X-Google-Smtp-Source: AG47ELtn9V4DYK3dISDK4HVL/XFjxMN8DApJ33QHIvlLqeNCbXdNosyjfz0VSomA4NjVdqWlrb/e X-Received: by 10.98.58.3 with SMTP id h3mr29466128pfa.178.1520589080092; Fri, 09 Mar 2018 01:51:20 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1520589080; cv=none; d=google.com; s=arc-20160816; b=Dr5YcCyJKC3a/8afm6E4n9Bdl72Qq6xqJZXDbSD7zRBlcvILV9N11inIZLU3mnZX+A pFvotsT6SLmMrVG+nvJwE3GpNsutnqIPH+MprB4J4NbISzqA7Pmt4JizaSqe9BpfQkAb ymyb1Z7AWF/q7FTkbKLkv9KHQPd51XU/oj4E+DMaxjvLVL4mNqHruIZ/5qjXkDJv1r7K QWIaOEMw2PKXiWHK4gyh/WRABkRbaN3W+/sF8acnwCHdVFfVVAco+TKG9oV2vj1nsZCL bgD4tjmYIjwQFWCCuAZSWuxxx0bClO9t2VCelDM/UkJAsc8EkhAFYLBjhD5wAq27X5Zu yr5w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=8Bny6GI6ShdMw7Eo8qewOoTi1A/SX9FJCNzO2KaKxfw=; b=jZvn9n/2mjSxzjXFuu6AKvABHQBoqUb8dTNKUZeYHklzqoSkUQO9xydIHkXg7OnDqB nrjsFm1EsgfSsMH3NMQeP0KULNrJwvpUbt/pSQUrF6i7R/Tl7vhJe/uuVnGMsJeFI05o pLb1yrVt6UJ9fREwDgkc8Cc53oQaFQVJ1gthkICK6lnEq7L2xApjYwf+bL/QNeEH96r4 vDvk/FYQkm8FA/2FY7fcvmyhmUcLW9b+B05fFkKnsNYKh5CZshk8BFn3RXbXXnYVCdZq wAsmQKG92fAalFa3MG5y+/RpWrgTkBlCQqeCqoRiVNDL5ScUKMxtn9QoPnXO+OY8bv1t irDg== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@gmail.com header.s=20161025 header.b=jgquddTB; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d10-v6si592536pls.660.2018.03.09.01.51.05; Fri, 09 Mar 2018 01:51:20 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@gmail.com header.s=20161025 header.b=jgquddTB; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751198AbeCIJtD (ORCPT + 99 others); Fri, 9 Mar 2018 04:49:03 -0500 Received: from mail-it0-f53.google.com ([209.85.214.53]:53576 "EHLO mail-it0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751169AbeCIJs5 (ORCPT ); Fri, 9 Mar 2018 04:48:57 -0500 Received: by mail-it0-f53.google.com with SMTP id w63so2094160ita.3 for ; Fri, 09 Mar 2018 01:48:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=8Bny6GI6ShdMw7Eo8qewOoTi1A/SX9FJCNzO2KaKxfw=; b=jgquddTBpfJxV/kcMVo3mdHuUeRykg6Ey4pLkFhK6zzts9ZXdnhTemo+YuMcc7gQ44 EAOoNT6EiwGeCEqa1llaH6FvDVqaAQ13hG/UZ+6MswwInAkXAGEKgmpFNDUEtqNyO1t6 PvV4P+RCSvHUINaiNzN7kfRaDzY/Gk7UbTbEG7zLXSkPuwbgCxQSx3T3/A6Ztx+asJ+q Pcoc0jtOj2zvnwTizXCq75lHCcqziIValBDH1Divhi5GAd1IQa1yeHM5hUXynrBXzjrA R/B++EHbBGdskdcQiF6gEPozNNdZFtlgxePxi8k2fejK/h/xG5IB0K9G2M3RHhTYoWNc Dlxg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=8Bny6GI6ShdMw7Eo8qewOoTi1A/SX9FJCNzO2KaKxfw=; b=BfGAOwV1ujgKvvVrQoWVLDo/3XHXZYL5RFxQjGxIoC85ez/eSWExRXBEZ96bJThVHs jLMUsCWYDWvSw31An2cxfBepw/fJFo9or6HWBGBWSU68NHJVu6t1LJl5EMn1BJR9E8BS JHydjK5ydKvVfcJTeSe0wIxUhL3TAHDK0yZqcgE55PWTKip/WGJs4TLK0btBk8aSaoT6 PmpS5QBvoLHraM/S4i8GFZvtqN9Werxb7EqkdCJ7bX8NyPny1mmwCAdrs/LT/FAXF13u uiaDtzazpCf0QFntFL+8yRRQgw0bF+fwygQ4mD6FsyznnyiBcSSoPxWM8ab5OXrhq1/h RIDQ== X-Gm-Message-State: AElRT7GxV6DGfUOppjF8UluJuySQHYA3rpsnNnUD+OgkqTVvHjuoKU2N jvChQVrYndWh//sHL9q2dQ2WX85oPR893buyIT8= X-Received: by 10.36.131.3 with SMTP id d3mr2549605ite.149.1520588936258; Fri, 09 Mar 2018 01:48:56 -0800 (PST) MIME-Version: 1.0 Received: by 10.79.120.3 with HTTP; Fri, 9 Mar 2018 01:48:55 -0800 (PST) In-Reply-To: <20180305161446.GQ3918@linux.vnet.ibm.com> References: <20180305001600.GO3918@linux.vnet.ibm.com> <20180305030949.GP3918@linux.vnet.ibm.com> <20180305082441.4hao2z4dqn2n5on6@gmail.com> <87po4izj67.fsf_-_@xmission.com> <20180305161446.GQ3918@linux.vnet.ibm.com> From: Lai Jiangshan Date: Fri, 9 Mar 2018 17:48:55 +0800 X-Google-Sender-Auth: GpOk7HYcldMBag_bjsOgP3KgqfU Message-ID: Subject: Re: Simplifying our RCU models To: "Paul E. McKenney" Cc: "Eric W. Biederman" , Ingo Molnar , Linus Torvalds , Tejun Heo , Jann Horn , Benjamin LaHaise , Al Viro , Thomas Gleixner , Peter Zijlstra , LKML Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 6, 2018 at 12:14 AM, Paul E. McKenney wrote: > On Mon, Mar 05, 2018 at 08:33:20AM -0600, Eric W. Biederman wrote: >> >> Moving this discussion to a public list as discussing how to reduce the >> number of rcu variants does not make sense in private. We should have >> an archive of such discussions. >> >> Ingo Molnar writes: >> >> > * Paul E. McKenney wrote: >> > >> >> > So if people really want that low-cost RCU, and some people really >> >> > need the sleepable version, the only one that can _possibly_ be dumped >> >> > is the preempt one. >> >> > >> >> > But I may - again - be confused and/or missing something. >> >> >> >> I am going to do something very stupid and say that I was instead thinking in >> >> terms of getting rid of RCU-bh, thus reminding you of its existence. ;-) >> >> >> >> The reason for believing that it is possible to get rid of RCU-bh is the work >> >> that has gone into improving the forward progress of RCU grace periods under >> >> heavy load and in corner-case workloads. >> >> >> > >> > [...] >> > >> >> The other reason for RCU-sched is it has the side effect of waiting >> >> for all in-flight hardware interrupt handlers, NMI handlers, and >> >> preempt-disable regions of code to complete, and last I checked, this side >> >> effect is relied on. In contrast, RCU-preeempt is only guaranteed to wait >> >> on regions of code protected by rcu_read_lock() and rcu_read_unlock(). >> > >> > Instead of only trying to fix the documentation (which is never a bad idea but it >> > is fighting the symptom in this case), I think the first step should be to >> > simplify the RCU read side APIs of RCU from 4 APIs: >> > >> > rcu_read_lock() >> > srcu_read_lock() >> > rcu_read_lock_sched() >> > rcu_read_lock_bh() >> > >> > ... which have ~8 further sub-model variations depending on CONFIG_PREEMPT, >> > CONFIG_PREEMPT_RCU - which is really a crazy design! > > If it is possible to set CONFIG_PREEMPT_RCU differently than CONFIG_PREEMPT, > then that is a bug that I need to fix. > >> > I think we could reduce this to just two APIs with no Kconfig dependencies: >> > >> > rcu_read_lock() >> > rcu_read_lock_preempt_disable() >> > >> > Which would be much, much simpler. > > No argument on the simpler part, at least from an API perspective. > >> > This is how we could do it I think: >> > >> > 1) >> > >> > Getting rid of the _bh() variant should be reasonably simple and involve a >> > treewide replacement of: >> > >> > rcu_read_lock_bh() -> local_bh_disable() >> > rcu_read_unlock_bh() -> local_bh_enable() >> > >> > Correct? > > Assuming that I have done enough forward-progress work on grace periods, yes. > >> > 2) >> > >> > Further reducing the variants is harder, due to this main asymmetry: >> > >> > !PREEMPT_RCU PREEMPT_RCU=y >> > rcu_read_lock_sched(): atomic atomic >> > rcu_read_lock(): atomic preemptible >> > >> > ('atomic' here is meant in the scheduler, non-preemptible sense.) >> > >> > But if we look at the bigger API picture: >> > >> > !PREEMPT_RCU PREEMPT_RCU=y >> > rcu_read_lock(): atomic preemptiblep >> > rcu_read_lock_sched(): atomic atomic >> > srcu_read_lock(): preemptible preemptible >> > >> > Then we could maintain full read side API flexibility by making PREEMPT_RCU=y the >> > only model, merging it with SRCU and using these main read side APIs: >> > >> > rcu_read_lock_preempt_disable((): atomic >> > rcu_read_lock() preemptible > > One issue with merging SRCU into rcu_read_lock() is the general blocking > within SRCU readers. Once merged in, these guys block everyone. We should > focus initially on the non-SRCU variants. > > On the other hand, Linus's suggestion of merging rcu_read_lock_sched() > into rcu_read_lock() just might be feasible. If that really does pan > out, we end up with the following: > > !PREEMPT PREEMPT=y > rcu_read_lock(): atomic preemptible > srcu_read_lock(): preemptible preemptible > > In this model, rcu_read_lock_sched() maps to preempt_disable() and (as > you say above) rcu_read_lock_bh() maps to local_bh_disable(). The way > this works is that in PREEMPT=y kernels, synchronize_rcu() waits not > only for RCU read-side critical sections, but also for regions of code > with preemption disabled. The main caveat seems to be that there be an > assumed point of preemptibility between each interrupt and each softirq > handler, which should be OK. > > There will be some adjustments required for lockdep-RCU, but that should > be reasonably straightforward. > > Seem reasonable? It's good. I hope there is only one global(non-srcu) rcu variant. It does have the trade-off, the grace period will be extended a little in some cases, so will the call_rcu()/synchronze_rcu(). But it simplifies the coding a lot. > >> > It's a _really_ simple and straightforward RCU model, with very obvious semantics >> > all around: >> > >> > - Note how the 'atomic' (non-preempt) variant uses the well-known >> > preempt_disable() name as a postfix to signal its main property. (It's also a >> > bit of a mouthful, which should discourage over-use.) > > My thought is to eliminate the atomic variant entirely. If you want > to disable preemption, interrupts, or whatever, you simply do so. > It might turn out that there are documentation benefits to having a > separate rcu_read_lock_preempt_disable() that maps to preempt_disable() > with lockdep semantics, and if so, that can be provided trivially. > >> > - The read side APIs are really as straightforward as possible: there's no SRCU >> > distinction on the read side, no _bh() distinction and no _sched() distinction. >> > (On -rt all of these would turn into preemptible sections, >> > obviously.) > > Agreed, and both models accomplish that. > >> And it looses the one advantage of srcu_read_lock. That you don't have >> to wait for the entire world. If you actually allow sleeping that is an >> important distinction to have. Or are you proposing that we add the >> equivalent of init_srcu_struct to all of the rcu users? > > I am instead proposing folding rcu_read_lock_bh() and rcu_read_lock_sched() > into rcu_read_lock(), and leaving srcu_read_lock() separate. > >> That rcu_read_lock would need to take an argument about which rcu region >> we are talking about. > > From what I can see, it would be far better to leave SRCU separate. As you > say, it really does have very different semantics. > >> > rcu_read_lock_preempt_disable() would essentially be all the current >> > rcu_read_lock_sched() users (where the _sched() postfix was a confusing misnomer >> > anyway). > > I agree that rcu_read_lock_preempt_disable() is a better name. > We might not need it at all, though. There are only about 20 uses of > rcu_read_lock_sched() in v4.15. ;-) > >> > Wrt. merging SRCU and RCU: this can be done by making PREEMPT_RCU=y the one and >> > only main RCU model and converting all SRCU users to main RCU. This is relatively >> > straightforward to perform, as there are only ~170 SRCU critical sections, versus >> > the 3000+ main RCU critical sections ... >> >> It really sounds like you are talking about adding a requirement that >> everyone update their rcu_read_lock() calls with information about which >> region you are talking about. That seems like quite a bit of work. > > Agreed, merging RCU, RCU-bh, and RCU-sched seems much more straightforward > to me from the viewpoint of both usage and implementation. > >> Doing something implicit when PREEMPT_RCU=y and converting >> "rcu_read_lock()" to "srcu_read_lock(&kernel_srcu_region)" only in that >> case I can see. >> >> Except in very specific circustances I don't think I ever want to run a >> kernel with PREEMPT_RCU the default. All of that real time stuff trades >> off predictability with performance. Having lost enough performance to >> spectre and meltdown I don't think it makes sense for us all to start >> runing predictable^H^H^H^H^H^H^H^H^H^H^H time kernels now. > > Yes, in PREEMPT=n kernels RCU would act exactly as it does today. > >> > AFAICS this should be a possible read side design that keeps correctness, without >> > considering grace period length patterns, i.e. without considering GC latency and >> > scalability aspects. >> > >> > Before we get into ways to solve the latency and scalability aspects of such a >> > simplified RCU model, do you agree with this analysis so far, or have I missed >> > something important wrt. correctness? >> >> RCU region specification. If we routinely allow preemption of rcu >> critical sections for any length of time I can't imagine we will want to >> wait for every possible preempted rcu critical section. >> >> Of course I could see the merge working the other way. Adding the >> debugging we need to find rcu critical secions that are held to long and >> shrinking them so we don't need PREEMPT_RCU at all. > > Again, from what I can see, merging rcu_read_lock(), rcu_read_lock_sched(), > and rcu_read_lock_bh() together should get us to a much better place. > > Make sense, or am I missing something? > > Thanx, Paul >