Received: by 2002:a25:b794:0:0:0:0:0 with SMTP id n20csp1220801ybh; Sat, 3 Aug 2019 20:57:39 -0700 (PDT) X-Google-Smtp-Source: APXvYqz/tiywotcqgo1ghTc2N7lhs6NFHhiBGWOi+xTEKNhbc8FMDz5FgD+skHk1L8I0G+csLcVw X-Received: by 2002:aa7:83ce:: with SMTP id j14mr31692593pfn.55.1564891059668; Sat, 03 Aug 2019 20:57:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1564891059; cv=none; d=google.com; s=arc-20160816; b=TlRM02lUx9GcULqWOzTUIQqBziVJfzmvnHLm1SdH/XUpguuPJOLBreeG+WVa9mM0q4 yJFSUBPSctLcXvOyjAxjTz5g1xeof1hM8FJBjgKL4tpwNZmA4ag5gQUL4HNXXtMsVOss tBcViNUbBJdhidy7u1IkBUkEe88MVsW/Hd99wUk++LXZ8QbBKu/GCNGTcqhoxsVl7ffh 6eqxLQTDY/uj1UJRdKew/XVNWwvgZd3lPBZ/u/ODRomXqbx/9WQygVAAGXYfHB1BIeue TZCNxRxl2WUPH5YhA72Yhx6xgdLOulrtGaLHh1GpjhxvoN68KECbP/VPHxMoFUnxGNB2 PkiA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=Eqe0jo6oWvHCtdfAFOdxdsyP1aAj3Ti7HkefwkqRcYc=; b=iRrJppjeULgCLWRjAZw5qYs2cXNh4GrdmJu11fXLvpw8gghJftYgU6/EeCmVDGi+0g eIPKkj7ZzNT53GfsMAHBkZgcqGh82BGsMH4A0LzojuLOWowWmFXz1jVFNuuMN2YIrrBz G1XGT7xLLWkx9a9SSV993eqZ0xhfkarH5PQ4UZS3oNFiqkwYJ35PrihbyabLiWJHNyTE 45/E0PJKfG2baox3kr7IateB6qznWxu3VR36/yY5kspcWXwP3HRNYHQ/ZMFo+qr5q/7b 4knh2nVRjuMlWm/j4/6Z2gy6jQvtkjb59o7yJ10pyO8R4rqD0Fj0KNFeZkRtiS2aPtJI pajQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=PBWAelOc; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q7si43153071pgk.456.2019.08.03.20.57.24; Sat, 03 Aug 2019 20:57:39 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=PBWAelOc; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728173AbfHCRlb (ORCPT + 99 others); Sat, 3 Aug 2019 13:41:31 -0400 Received: from mail-pg1-f194.google.com ([209.85.215.194]:34438 "EHLO mail-pg1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725886AbfHCRlb (ORCPT ); Sat, 3 Aug 2019 13:41:31 -0400 Received: by mail-pg1-f194.google.com with SMTP id n9so31405445pgc.1 for ; Sat, 03 Aug 2019 10:41:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joelfernandes.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=Eqe0jo6oWvHCtdfAFOdxdsyP1aAj3Ti7HkefwkqRcYc=; b=PBWAelOc6a31GMyf6OK4PlVyVkGZqs7fvdv3stJDwJr9uOglewfSI3bqkiH0Dahpxr J0cv9TtxiGll49zKmGylA44SuhfGuo3Lbyke7FMULlMCSbP971pKTIg2ZQ9eSxs/lTl8 M6JTc3P/Z9GhT0qheNuV/WnRCrgEe411gAA1g= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=Eqe0jo6oWvHCtdfAFOdxdsyP1aAj3Ti7HkefwkqRcYc=; b=JFNV1zo/L5VkqEqH2/MS305Ei9mIGeaJpAc2ziiLI/7ML88fuY+LVgXOjcW3siwC2r oJJHeqaPLVMGHH2QDs6BuE1KjohJ1K4eFnAem0HF9RL6GcbpgT1CwS0tqodb82IB+81P vtgsvzzh1mvVxUMCzV+BaLYkA3pWMlsSiTtqHKbuAmaySNRI1koCI7jSQ0sEO6gCdXhP 99QWKYEbcwHlY64oPq2qW0gEUrKvCXYb29jWaSV0Zaw1quWObPW/yTQsnSTZ8lucIQOz ZU6gtsf2pO+Ocu3CCks82Ylu8WAZ+FJID6TTfLIOZbtLJU22ApUZb6uQg54fMXONAUQ2 AhXw== X-Gm-Message-State: APjAAAWwWZt2t8kLTFQA+6i9NFlrQx7wi6w5xCevTKuCRO6xycteWZw4 YOSjIkQr1ln0f9jnjifdJJ8= X-Received: by 2002:a62:cec4:: with SMTP id y187mr64899524pfg.84.1564854090339; Sat, 03 Aug 2019 10:41:30 -0700 (PDT) Received: from localhost ([2620:15c:6:12:9c46:e0da:efbf:69cc]) by smtp.gmail.com with ESMTPSA id e124sm126833785pfh.181.2019.08.03.10.41.28 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Sat, 03 Aug 2019 10:41:29 -0700 (PDT) Date: Sat, 3 Aug 2019 13:41:27 -0400 From: Joel Fernandes To: "Paul E. McKenney" Cc: rcu@vger.kernel.org, linux-kernel@vger.kernel.org, mingo@kernel.org, jiangshanlai@gmail.com, dipankar@in.ibm.com, akpm@linux-foundation.org, mathieu.desnoyers@efficios.com, josh@joshtriplett.org, tglx@linutronix.de, peterz@infradead.org, rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com, fweisbec@gmail.com, oleg@redhat.com Subject: Re: [PATCH tip/core/rcu 03/11] rcu/nocb: Provide separate no-CBs grace-period kthreads Message-ID: <20190803174127.GA83025@google.com> References: <20190801225009.GA17155@linux.ibm.com> <20190801225028.18225-3-paulmck@linux.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190801225028.18225-3-paulmck@linux.ibm.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Aug 01, 2019 at 03:50:20PM -0700, Paul E. McKenney wrote: > Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads > are divided into groups. The first rcuo kthread to come online in a > given group is that group's leader, and the leader both waits for grace > periods and invokes its CPU's callbacks. The non-leader rcuo kthreads > only invoke callbacks. > > This works well in the real-time/embedded environments for which it was > intended because such environments tend not to generate all that many > callbacks. However, given huge floods of callbacks, it is possible for > the leader kthread to be stuck invoking callbacks while its followers > wait helplessly while their callbacks pile up. This is a good recipe > for an OOM, and rcutorture's new callback-flood capability does generate > such OOMs. > > One strategy would be to wait until such OOMs start happening in > production, but similar OOMs have in fact happened starting in 2018. > It would therefore be wise to take a more proactive approach. I haven't looked much into nocbs/nohz_full stuff (yet). In particular, I did not even know that the rcuo threads do grace period life-cycle management and waiting, I thought only the RCU GP threads did :-/. however, it seems this is a completely separate grace-period management state machine outside of the RCU GP thread right? I was wondering for this patch, could we also just have the rcuo leader handle both callback execution and waking other non-leader threads at the same time? So like, execute few callbacks, then do the wake up of the non-leaders to execute their callbacks, the get back to executing their own callbacks, etc. That way we don't need a separate rcuog thread to wait for grace period, would that not work? If you don't mind could you share with me a kvm.sh command (which has config, boot parameters etc) that can produce the OOM without this patch? I'd like to take a closer look at it. Is there also a short answer for my the RCU GP thread cannot do the job of these new rcuog threads? thanks a lot, - Joel > This commit therefore features per-CPU rcuo kthreads that do nothing > but invoke callbacks. Instead of having one of these kthreads act as > leader, each group has a separate rcog kthread that handles grace periods > for its group. Because these rcuog kthreads do not invoke callbacks, > callback floods on one CPU no longer block callbacks from reaching the > rcuc callback-invocation kthreads on other CPUs. > > This change does introduce additional kthreads, however: > > 1. The number of additional kthreads is about the square root of > the number of CPUs, so that a 4096-CPU system would have only > about 64 additional kthreads. Note that recent changes > decreased the number of rcuo kthreads by a factor of two > (CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so > this still represents a significant improvement on most systems. > > 2. The leading "rcuo" of the rcuog kthreads should allow existing > scripting to affinity these additional kthreads as needed, the > same as for the rcuop and rcuos kthreads. (There are no longer > any rcuob kthreads.) > > 3. A state-machine approach was considered and rejected. Although > this would allow the rcuo kthreads to continue their dual > leader/follower roles, it complicates callback invocation > and makes it more difficult to consolidate rcuo callback > invocation with existing softirq callback invocation. > > The introduction of rcuog kthreads should thus be acceptable. > > Signed-off-by: Paul E. McKenney > --- > kernel/rcu/tree.h | 6 +- > kernel/rcu/tree_plugin.h | 115 +++++++++++++++++++-------------------- > 2 files changed, 61 insertions(+), 60 deletions(-) > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h > index 32b3348d3a4d..dc3c53cb9608 100644 > --- a/kernel/rcu/tree.h > +++ b/kernel/rcu/tree.h > @@ -200,8 +200,8 @@ struct rcu_data { > atomic_long_t nocb_q_count_lazy; /* invocation (all stages). */ > struct rcu_head *nocb_cb_head; /* CBs ready to invoke. */ > struct rcu_head **nocb_cb_tail; > - struct swait_queue_head nocb_wq; /* For nocb kthreads to sleep on. */ > - struct task_struct *nocb_cb_kthread; > + struct swait_queue_head nocb_cb_wq; /* For nocb kthreads to sleep on. */ > + struct task_struct *nocb_gp_kthread; > raw_spinlock_t nocb_lock; /* Guard following pair of fields. */ > int nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */ > struct timer_list nocb_timer; /* Enforce finite deferral. */ > @@ -211,6 +211,8 @@ struct rcu_data { > /* CBs waiting for GP. */ > struct rcu_head **nocb_gp_tail; > bool nocb_gp_sleep; /* Is the nocb GP thread asleep? */ > + struct swait_queue_head nocb_gp_wq; /* For nocb kthreads to sleep on. */ > + struct task_struct *nocb_cb_kthread; > struct rcu_data *nocb_next_cb_rdp; > /* Next rcu_data in wakeup chain. */ > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > index 5a72700c3a32..c3b6493313ab 100644 > --- a/kernel/rcu/tree_plugin.h > +++ b/kernel/rcu/tree_plugin.h > @@ -1531,7 +1531,7 @@ static void __wake_nocb_leader(struct rcu_data *rdp, bool force, > struct rcu_data *rdp_leader = rdp->nocb_gp_rdp; > > lockdep_assert_held(&rdp->nocb_lock); > - if (!READ_ONCE(rdp_leader->nocb_cb_kthread)) { > + if (!READ_ONCE(rdp_leader->nocb_gp_kthread)) { > raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); > return; > } > @@ -1541,7 +1541,7 @@ static void __wake_nocb_leader(struct rcu_data *rdp, bool force, > del_timer(&rdp->nocb_timer); > raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); > smp_mb(); /* ->nocb_gp_sleep before swake_up_one(). */ > - swake_up_one(&rdp_leader->nocb_wq); > + swake_up_one(&rdp_leader->nocb_gp_wq); > } else { > raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); > } > @@ -1646,7 +1646,7 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp, > smp_mb__after_atomic(); /* Store *old_rhpp before _wake test. */ > > /* If we are not being polled and there is a kthread, awaken it ... */ > - t = READ_ONCE(rdp->nocb_cb_kthread); > + t = READ_ONCE(rdp->nocb_gp_kthread); > if (rcu_nocb_poll || !t) { > trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, > TPS("WakeNotPoll")); > @@ -1786,7 +1786,7 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp) > * No-CBs GP kthreads come here to wait for additional callbacks to show up. > * This function does not return until callbacks appear. > */ > -static void nocb_leader_wait(struct rcu_data *my_rdp) > +static void nocb_gp_wait(struct rcu_data *my_rdp) > { > bool firsttime = true; > unsigned long flags; > @@ -1794,12 +1794,10 @@ static void nocb_leader_wait(struct rcu_data *my_rdp) > struct rcu_data *rdp; > struct rcu_head **tail; > > -wait_again: > - > /* Wait for callbacks to appear. */ > if (!rcu_nocb_poll) { > trace_rcu_nocb_wake(rcu_state.name, my_rdp->cpu, TPS("Sleep")); > - swait_event_interruptible_exclusive(my_rdp->nocb_wq, > + swait_event_interruptible_exclusive(my_rdp->nocb_gp_wq, > !READ_ONCE(my_rdp->nocb_gp_sleep)); > raw_spin_lock_irqsave(&my_rdp->nocb_lock, flags); > my_rdp->nocb_gp_sleep = true; > @@ -1838,7 +1836,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp) > trace_rcu_nocb_wake(rcu_state.name, my_rdp->cpu, > TPS("WokeEmpty")); > } > - goto wait_again; > + return; > } > > /* Wait for one grace period. */ > @@ -1862,34 +1860,47 @@ static void nocb_leader_wait(struct rcu_data *my_rdp) > rdp->nocb_cb_tail = rdp->nocb_gp_tail; > *tail = rdp->nocb_gp_head; > raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags); > - if (rdp != my_rdp && tail == &rdp->nocb_cb_head) { > + if (tail == &rdp->nocb_cb_head) { > /* List was empty, so wake up the kthread. */ > - swake_up_one(&rdp->nocb_wq); > + swake_up_one(&rdp->nocb_cb_wq); > } > } > +} > > - /* If we (the GP kthreads) don't have CBs, go wait some more. */ > - if (!my_rdp->nocb_cb_head) > - goto wait_again; > +/* > + * No-CBs grace-period-wait kthread. There is one of these per group > + * of CPUs, but only once at least one CPU in that group has come online > + * at least once since boot. This kthread checks for newly posted > + * callbacks from any of the CPUs it is responsible for, waits for a > + * grace period, then awakens all of the rcu_nocb_cb_kthread() instances > + * that then have callback-invocation work to do. > + */ > +static int rcu_nocb_gp_kthread(void *arg) > +{ > + struct rcu_data *rdp = arg; > + > + for (;;) > + nocb_gp_wait(rdp); > + return 0; > } > > /* > * No-CBs CB kthreads come here to wait for additional callbacks to show up. > - * This function does not return until callbacks appear. > + * This function returns true ("keep waiting") until callbacks appear and > + * then false ("stop waiting") when callbacks finally do appear. > */ > -static void nocb_follower_wait(struct rcu_data *rdp) > +static bool nocb_follower_wait(struct rcu_data *rdp) > { > - for (;;) { > - trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FollowerSleep")); > - swait_event_interruptible_exclusive(rdp->nocb_wq, > - READ_ONCE(rdp->nocb_cb_head)); > - if (smp_load_acquire(&rdp->nocb_cb_head)) { > - /* ^^^ Ensure CB invocation follows _head test. */ > - return; > - } > - WARN_ON(signal_pending(current)); > - trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WokeEmpty")); > + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FollowerSleep")); > + swait_event_interruptible_exclusive(rdp->nocb_cb_wq, > + READ_ONCE(rdp->nocb_cb_head)); > + if (smp_load_acquire(&rdp->nocb_cb_head)) { /* VVV */ > + /* ^^^ Ensure CB invocation follows _head test. */ > + return false; > } > + WARN_ON(signal_pending(current)); > + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WokeEmpty")); > + return true; > } > > /* > @@ -1899,7 +1910,7 @@ static void nocb_follower_wait(struct rcu_data *rdp) > * have to do quite so many wakeups (as in they only need to wake the > * no-CBs GP kthreads, not the CB kthreads). > */ > -static int rcu_nocb_kthread(void *arg) > +static int rcu_nocb_cb_kthread(void *arg) > { > int c, cl; > unsigned long flags; > @@ -1911,10 +1922,8 @@ static int rcu_nocb_kthread(void *arg) > /* Each pass through this loop invokes one batch of callbacks */ > for (;;) { > /* Wait for callbacks. */ > - if (rdp->nocb_gp_rdp == rdp) > - nocb_leader_wait(rdp); > - else > - nocb_follower_wait(rdp); > + while (nocb_follower_wait(rdp)) > + continue; > > /* Pull the ready-to-invoke callbacks onto local list. */ > raw_spin_lock_irqsave(&rdp->nocb_lock, flags); > @@ -2048,7 +2057,8 @@ void __init rcu_init_nohz(void) > static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) > { > rdp->nocb_tail = &rdp->nocb_head; > - init_swait_queue_head(&rdp->nocb_wq); > + init_swait_queue_head(&rdp->nocb_cb_wq); > + init_swait_queue_head(&rdp->nocb_gp_wq); > rdp->nocb_cb_tail = &rdp->nocb_cb_head; > raw_spin_lock_init(&rdp->nocb_lock); > timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0); > @@ -2056,50 +2066,39 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp) > > /* > * If the specified CPU is a no-CBs CPU that does not already have its > - * rcuo kthread, spawn it. If the CPUs are brought online out of order, > - * this can require re-organizing the GP-CB relationships. > + * rcuo CB kthread, spawn it. Additionally, if the rcuo GP kthread > + * for this CPU's group has not yet been created, spawn it as well. > */ > static void rcu_spawn_one_nocb_kthread(int cpu) > { > - struct rcu_data *rdp; > - struct rcu_data *rdp_last; > - struct rcu_data *rdp_old_leader; > - struct rcu_data *rdp_spawn = per_cpu_ptr(&rcu_data, cpu); > + struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); > + struct rcu_data *rdp_gp; > struct task_struct *t; > > /* > * If this isn't a no-CBs CPU or if it already has an rcuo kthread, > * then nothing to do. > */ > - if (!rcu_is_nocb_cpu(cpu) || rdp_spawn->nocb_cb_kthread) > + if (!rcu_is_nocb_cpu(cpu) || rdp->nocb_cb_kthread) > return; > > /* If we didn't spawn the GP kthread first, reorganize! */ > - rdp_old_leader = rdp_spawn->nocb_gp_rdp; > - if (rdp_old_leader != rdp_spawn && !rdp_old_leader->nocb_cb_kthread) { > - rdp_last = NULL; > - rdp = rdp_old_leader; > - do { > - rdp->nocb_gp_rdp = rdp_spawn; > - if (rdp_last && rdp != rdp_spawn) > - rdp_last->nocb_next_cb_rdp = rdp; > - if (rdp == rdp_spawn) { > - rdp = rdp->nocb_next_cb_rdp; > - } else { > - rdp_last = rdp; > - rdp = rdp->nocb_next_cb_rdp; > - rdp_last->nocb_next_cb_rdp = NULL; > - } > - } while (rdp); > - rdp_spawn->nocb_next_cb_rdp = rdp_old_leader; > + rdp_gp = rdp->nocb_gp_rdp; > + if (!rdp_gp->nocb_gp_kthread) { > + t = kthread_run(rcu_nocb_gp_kthread, rdp_gp, > + "rcuog/%d", rdp_gp->cpu); > + if (WARN_ONCE(IS_ERR(t), "%s: Could not start rcuo GP kthread, OOM is now expected behavior\n", __func__)) > + return; > + WRITE_ONCE(rdp_gp->nocb_gp_kthread, t); > } > > /* Spawn the kthread for this CPU. */ > - t = kthread_run(rcu_nocb_kthread, rdp_spawn, > + t = kthread_run(rcu_nocb_cb_kthread, rdp, > "rcuo%c/%d", rcu_state.abbr, cpu); > - if (WARN_ONCE(IS_ERR(t), "%s: Could not start rcuo kthread, OOM is now expected behavior\n", __func__)) > + if (WARN_ONCE(IS_ERR(t), "%s: Could not start rcuo CB kthread, OOM is now expected behavior\n", __func__)) > return; > - WRITE_ONCE(rdp_spawn->nocb_cb_kthread, t); > + WRITE_ONCE(rdp->nocb_cb_kthread, t); > + WRITE_ONCE(rdp->nocb_gp_kthread, rdp_gp->nocb_gp_kthread); > } > > /* > -- > 2.17.1 >