Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp3534274pxb; Mon, 4 Apr 2022 20:15:35 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxvU8Y6dTJbsG8LTHI0i0DqR/gKj3RCwWFNqtTk7r/xCLDYR3UOe6iCxPoktnep3e/kx21S X-Received: by 2002:a63:f147:0:b0:399:2b1f:7aff with SMTP id o7-20020a63f147000000b003992b1f7affmr1105970pgk.341.1649128535294; Mon, 04 Apr 2022 20:15:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1649128535; cv=none; d=google.com; s=arc-20160816; b=Puvdu0oWmaLGLxnMQsx+eVQCEMA3qygly87OEh5bG0DBVRCh4tlIaZlxmnfY63l7Q1 cPMpF9Px9KxWGO0o57Cv/4RexjMd/3C+CrGv5PcIaubYCeHX5cuoE+MELEImt1dPP/nR lA4S6Kg95KQss17enXeFNKZBmbed7m9ZcctLT/dwL+yEKFyZ7nO5rr9I2KoZBt0502lp XIMwkT6XboJMlVYJ+87md0dTu+RFH44+cek/2OBl92naYkAqBAMMk+JFC3VugagZ18EU yywcjCwxNp1nJCwTub0q8ACsGqZJOTy2moDRp3cqv4xb3QGyZ/M5ID3xDqN+yHceUg1P O5wA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=yorgjSX0pF9CEPCLNlCyRRDtyvCf6H3SiFL07JUbIXY=; b=ncyeXn2T1XJor5Yr1WcHbkvxjXxkp+q6qOdotPm+hP+FVT+m335m0udt2Y3QHKIbBm JOO7yBjmXi+G+w7TKA0JLi+WHZXXKQVzksTf1aK6L26uLmIbgRuiVqF45V+lggu/RzOT Y4BMDtgZ9PtWxKZ185b8KJDhEf8EUry/10j7/OdjTUyRN/BIMLrIrIvcCrGeP4WkpApw ia73aoGIRlqwNTaFsB/2Be4UJuCV9+uYLjunV89C31A7aqtX5qeA3wbkqiMUq+LYxZ8j aJsW9OciG2+7nNhb+i6ikubeupsLe1moyzH/lr4Pn43v8rr2l8cUUIlrFoXC2T+RjcH0 d2cw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=cctDP4nf; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id y3-20020a17090a86c300b001c70228d117si867729pjv.55.2022.04.04.20.15.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 04 Apr 2022 20:15:35 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=cctDP4nf; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id C13981AD1E5; Mon, 4 Apr 2022 18:15:40 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245381AbiDBAcP (ORCPT + 99 others); Fri, 1 Apr 2022 20:32:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43732 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240464AbiDBAcO (ORCPT ); Fri, 1 Apr 2022 20:32:14 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2A2451A635A; Fri, 1 Apr 2022 17:30:23 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id A58C561C15; Sat, 2 Apr 2022 00:30:22 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 06852C2BBE4; Sat, 2 Apr 2022 00:30:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1648859422; bh=Qjn4dFR4+GP9AoL4/Dzm7663OTMV3h+c7ZnLAc2omIo=; h=Date:From:To:Cc:Subject:Reply-To:References:In-Reply-To:From; b=cctDP4nfyxq81bfUdHvC+akA55vibAFciiEGspyiOY2E7OJ1+8regxbotKa+Ys97V cfjHIBXwvpr0pTFy/A9wJYvTCCQob5MYptiTgjNki2+iI3AFrXLdw0nR3D5iQvwzcn nai+2LWDbjGdHz+H50el2AqcKp8lm9yAsMqTeQn5yin6GDPYcHdFbJgZ7E3dcRtCqm RnrUIKhxSztlgdiFa17wJGu+sCsOihHu7T/gOF1rmyU+WJWn1ifgTRY/9/trg0Wz5u 8PEbdNW0RtheRRIP2Vomdxe8kRG/BPA4JWhJoiuz6SrCEc2F7CcIF7jh0eJCzjKgrJ OyH7Q55CQLa1g== Received: by paulmck-ThinkPad-P17-Gen-1.home (Postfix, from userid 1000) id 9532D5C0ADE; Fri, 1 Apr 2022 17:30:21 -0700 (PDT) Date: Fri, 1 Apr 2022 17:30:21 -0700 From: "Paul E. McKenney" To: Kalesh Singh Cc: surenb@google.com, kernel-team@android.com, Tejun Heo , Tim Murray , Wei Wang , Kyle Lin , Chunwei Lu , Lulu Wang , Frederic Weisbecker , Neeraj Upadhyay , Josh Triplett , Steven Rostedt , Mathieu Desnoyers , Lai Jiangshan , Joel Fernandes , rcu@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] RCU: Move expedited grace period (GP) work to RT kthread_worker Message-ID: <20220402003021.GI4285@paulmck-ThinkPad-P17-Gen-1> Reply-To: paulmck@kernel.org References: <20220401225740.1984689-1-kaleshsingh@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220401225740.1984689-1-kaleshsingh@google.com> X-Spam-Status: No, score=-2.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Apr 01, 2022 at 03:57:40PM -0700, Kalesh Singh wrote: > It was noticed that enabling CONFIG_RCU_BOOST did not help RCU > performance because the workqueues that runs expedited GP work are > subject to scheduling delays. This patch moves the expedited GP > work items to RT kthread_worker. > > The results were evaluated on arm64 Android devices (6GB ram) running > 5.10 kernel, and caturing trace data during critical user journeys. > > The table below compares the time synchronize_rcu_expedited() is blocked: > > ---------------------------------------------------------------------- > | | Using WQ | Using kthead_worker | Diff | > ----------------------------------------------------------------------- > | Max duration (ns) | 372766967 | 2329671 | -99.38% | > ----------------------------------------------------------------------- > | Avg duration (ns) | 2746353.16 | 151242.311 | -94.49% | > ---------------------------------------------------------------------- Impressive reductions in latency!!! 151 microsecond average and two millisecond maximum, not bad at all. ;-) We will need number of samples and some sort of information on variation, for example, a range of maximums and a standard deviation to go with the average. These might seem silly given that you have up to two orders of magnitude effect sizes, but we should still set a good example. I have queued this on an experimental -rcu branch fastexp.2022.04.01a for testing purposes, but I cannot send it upstream in this form. However, I believe that a few adjustments can make it into something that can be upstreamed, maybe even as early as the next (v5.19) merge window. As in I probably will need to revert it after running tests, but let's see how it goes. Please see below for questions and comments. Also, as usual, I could not resist the urge to do a bit of wordsmithing, most especially inserting commas into the big numbers. Please see the end of this email for the variant queued. Oh, and I do have Uladzislau's CONFIG_RCU_EXP_CPU_STALL_TIMEOUT commit already queued, which will be helpful in finding overly long RCU read-side critical sections. Good stuff!!! Thanx, Paul > Cc: "Paul E. McKenney" > Cc: Tejun Heo > Reported-by: Tim Murray > Reported-by: Wei Wang > Tested-by: Kyle Lin > Tested-by: Chunwei Lu > Tested-by: Lulu Wang > Signed-off-by: Kalesh Singh > --- > kernel/rcu/rcu.h | 3 ++- > kernel/rcu/tree.c | 41 +++++++++++++++++++++++++++++++++++++---- > kernel/rcu/tree.h | 3 ++- > kernel/rcu/tree_exp.h | 35 +++++++++++++++-------------------- > 4 files changed, 56 insertions(+), 26 deletions(-) > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > index 24b5f2c2de87..13d2b74bf19f 100644 > --- a/kernel/rcu/rcu.h > +++ b/kernel/rcu/rcu.h > @@ -534,7 +534,8 @@ int rcu_get_gp_kthreads_prio(void); > void rcu_fwd_progress_check(unsigned long j); > void rcu_force_quiescent_state(void); > extern struct workqueue_struct *rcu_gp_wq; > -extern struct workqueue_struct *rcu_par_gp_wq; > +extern struct kthread_worker *rcu_exp_gp_kworker; > +extern struct kthread_worker *rcu_exp_par_gp_kworker; > #endif /* #else #ifdef CONFIG_TINY_RCU */ > > #ifdef CONFIG_RCU_NOCB_CPU > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index a4b8189455d5..bd5e672ffa5a 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -4471,6 +4471,33 @@ static int rcu_pm_notify(struct notifier_block *self, > return NOTIFY_OK; > } > > +struct kthread_worker *rcu_exp_gp_kworker; > +struct kthread_worker *rcu_exp_par_gp_kworker; > + > +static void __init rcu_start_exp_gp_kworkers(void) > +{ > + const char *par_gp_kworker_name = "rcu_exp_par_gp_kthread_worker"; > + const char *gp_kworker_name = "rcu_exp_gp_kthread_worker"; > + struct sched_param param = { .sched_priority = kthread_prio }; > + > + rcu_exp_gp_kworker = kthread_create_worker(0, gp_kworker_name); > + if (IS_ERR_OR_NULL(rcu_exp_gp_kworker)) { > + pr_err("Failed to create %s!\n", gp_kworker_name); > + return; > + } > + > + rcu_exp_par_gp_kworker = kthread_create_worker(0, par_gp_kworker_name); > + if (IS_ERR_OR_NULL(rcu_exp_par_gp_kworker)) { > + pr_err("Failed to create %s!\n", par_gp_kworker_name); > + kthread_destroy_worker(rcu_exp_gp_kworker); > + return; > + } > + > + sched_setscheduler_nocheck(rcu_exp_gp_kworker->task, SCHED_FIFO, ¶m); > + sched_setscheduler_nocheck(rcu_exp_par_gp_kworker->task, SCHED_FIFO, > + ¶m); What happens when kthread_prio is zero, as it is by default on systems built with CONFIG_RCU_BOOST=n? My guess is that we need to use normal workqueues when CONFIG_RCU_BOOST=n. The real-time guys usually completely disable expedited grace periods except during boot, so they might also want normal workqueues as well. They build with CONFIG_PREEMPT_RT=y. > +} > + > /* > * Spawn the kthreads that handle RCU's grace periods. > */ > @@ -4500,6 +4527,10 @@ static int __init rcu_spawn_gp_kthread(void) > rcu_spawn_nocb_kthreads(); > rcu_spawn_boost_kthreads(); > rcu_spawn_core_kthreads(); > + > + /* Create kthread worker for expedited GPs */ > + rcu_start_exp_gp_kworkers(); > + > return 0; > } > early_initcall(rcu_spawn_gp_kthread); > @@ -4745,7 +4776,6 @@ static void __init rcu_dump_rcu_node_tree(void) > } > > struct workqueue_struct *rcu_gp_wq; > -struct workqueue_struct *rcu_par_gp_wq; > > static void __init kfree_rcu_batch_init(void) > { > @@ -4808,11 +4838,14 @@ void __init rcu_init(void) > rcutree_online_cpu(cpu); > } > > - /* Create workqueue for Tree SRCU and for expedited GPs. */ > + /* > + * Create workqueue for Tree SRCU. > + * > + * Expedited GPs use RT kthread_worker. > + * See: rcu_start_exp_gp_kworkers() > + */ > rcu_gp_wq = alloc_workqueue("rcu_gp", WQ_MEM_RECLAIM, 0); > WARN_ON(!rcu_gp_wq); > - rcu_par_gp_wq = alloc_workqueue("rcu_par_gp", WQ_MEM_RECLAIM, 0); > - WARN_ON(!rcu_par_gp_wq); > > /* Fill in default value for rcutree.qovld boot parameter. */ > /* -After- the rcu_node ->lock fields are initialized! */ > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h > index 926673ebe355..0193d67a706a 100644 > --- a/kernel/rcu/tree.h > +++ b/kernel/rcu/tree.h > @@ -10,6 +10,7 @@ > */ > > #include > +#include > #include > #include > #include > @@ -23,7 +24,7 @@ > /* Communicate arguments to a workqueue handler. */ > struct rcu_exp_work { > unsigned long rew_s; > - struct work_struct rew_work; > + struct kthread_work rew_work; > }; > > /* RCU's kthread states for tracing. */ > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > index 60197ea24ceb..f5f3722c0a74 100644 > --- a/kernel/rcu/tree_exp.h > +++ b/kernel/rcu/tree_exp.h > @@ -334,7 +334,7 @@ static bool exp_funnel_lock(unsigned long s) > * Select the CPUs within the specified rcu_node that the upcoming > * expedited grace period needs to wait for. > */ > -static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) > +static void sync_rcu_exp_select_node_cpus(struct kthread_work *wp) > { > int cpu; > unsigned long flags; > @@ -423,7 +423,6 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) > */ > static void sync_rcu_exp_select_cpus(void) > { > - int cpu; > struct rcu_node *rnp; > > trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("reset")); > @@ -435,28 +434,27 @@ static void sync_rcu_exp_select_cpus(void) > rnp->exp_need_flush = false; > if (!READ_ONCE(rnp->expmask)) > continue; /* Avoid early boot non-existent wq. */ > - if (!READ_ONCE(rcu_par_gp_wq) || > + if (!READ_ONCE(rcu_exp_par_gp_kworker) || > rcu_scheduler_active != RCU_SCHEDULER_RUNNING || > rcu_is_last_leaf_node(rnp)) { > - /* No workqueues yet or last leaf, do direct call. */ > + /* kthread worker not started yet or last leaf, do direct call. */ > sync_rcu_exp_select_node_cpus(&rnp->rew.rew_work); > continue; > } > - INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus); > - cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1); > - /* If all offline, queue the work on an unbound CPU. */ > - if (unlikely(cpu > rnp->grphi - rnp->grplo)) > - cpu = WORK_CPU_UNBOUND; > - else > - cpu += rnp->grplo; > - queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work); > + kthread_init_work(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus); > + /* > + * Use rcu_exp_par_gp_kworker, because flushing a work item from > + * another work item on the same kthread worker can result in > + * deadlock. > + */ > + kthread_queue_work(rcu_exp_par_gp_kworker, &rnp->rew.rew_work); It looks like kthread_queue_work() uses a single lock per worker. This is likely to be problematic on large systems having multiple leaf rcu_node structures, both due to lock contention and due to reduced concurrency. For a point of reference, I have gotten RCU bug reports for issues on systems with 256 leaf rcu_node structures. One approach would be to create a new Kconfig option to select the kthread_queue implementation on small systems. This Kconfig option could choose its default based on system size, perhaps with 32 or more CPUs continuing to use workqueues. Or it could choose its default based on CONFIG_ANDROID. Another approach would be to have a separate kthread_worker for each leaf rcu_node structure. The first approach would be safer in that it would avoid changing behavior for systems not needing ultrafast expedited grace-period latencies. It might also make more sense for systems built with CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y. I don't see a reason to make this switchable at boot time, but others might be able to provide good reasons. Either way, good encapsulation will of course be necessary. > rnp->exp_need_flush = true; > } > > /* Wait for workqueue jobs (if any) to complete. */ > rcu_for_each_leaf_node(rnp) > if (rnp->exp_need_flush) > - flush_work(&rnp->rew.rew_work); > + kthread_flush_work(&rnp->rew.rew_work); > } > > /* > @@ -625,7 +623,7 @@ static void rcu_exp_sel_wait_wake(unsigned long s) > /* > * Work-queue handler to drive an expedited grace period forward. > */ > -static void wait_rcu_exp_gp(struct work_struct *wp) > +static void wait_rcu_exp_gp(struct kthread_work *wp) > { > struct rcu_exp_work *rewp; > > @@ -848,20 +846,17 @@ void synchronize_rcu_expedited(void) > } else { > /* Marshall arguments & schedule the expedited grace period. */ > rew.rew_s = s; > - INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp); > - queue_work(rcu_gp_wq, &rew.rew_work); > + kthread_init_work(&rew.rew_work, wait_rcu_exp_gp); > + kthread_queue_work(rcu_exp_gp_kworker, &rew.rew_work); > } > > /* Wait for expedited grace period to complete. */ > rnp = rcu_get_root(); > wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3], > sync_exp_work_done(s)); > - smp_mb(); /* Workqueue actions happen before return. */ > + smp_mb(); /* kthread actions happen before return. */ > > /* Let the next expedited grace period start. */ > mutex_unlock(&rcu_state.exp_mutex); > - > - if (likely(!boottime)) > - destroy_work_on_stack(&rew.rew_work); > } > EXPORT_SYMBOL_GPL(synchronize_rcu_expedited); > > base-commit: 7a3ecddc571cc3294e5d6bb5948ff2b0cfa12735 > -- > 2.35.1.1094.g7c7d902a7c-goog ------------------------------------------------------------------------ commit d9f3e7d671416fdf5b61f094765754269b652db0 Author: Kalesh Singh Date: Fri Apr 1 15:57:40 2022 -0700 EXP rcu: Move expedited grace period (GP) work to RT kthread_worker Enabling CONFIG_RCU_BOOST did not reduce RCU expedited grace-period latency because its workqueues run at SCHED_OTHER, and thus can be delayed by normal processes. This commit avoids these delays by moving the expedited GP work items to a real-time-priority kthread_worker. The results were evaluated on arm64 Android devices (6GB ram) running 5.10 kernel, and capturing trace data in critical user-level code. The table below shows the resulting order-of-magnitude improvements in synchronize_rcu_expedited() latency: --------------------------------------------------------------- | | workqueues | kthead_worker | Diff | ---------------------------------------------------------------- | Max duration (ns) | 372,766,967 | 2,329,671 | -99.38% | ---------------------------------------------------------------- | Avg duration (ns) | 2,746,353 | 151,242 | -94.49% | --------------------------------------------------------------- Cc: "Paul E. McKenney" Cc: Tejun Heo Reported-by: Tim Murray Reported-by: Wei Wang Tested-by: Kyle Lin Tested-by: Chunwei Lu Tested-by: Lulu Wang Signed-off-by: Kalesh Singh Signed-off-by: Paul E. McKenney diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 23c877fa220e..a7f6c6665b01 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -566,6 +566,8 @@ void rcu_fwd_progress_check(unsigned long j); void rcu_force_quiescent_state(void); extern struct workqueue_struct *rcu_gp_wq; extern struct workqueue_struct *rcu_par_gp_wq; +extern struct kthread_worker *rcu_exp_gp_kworker; +extern struct kthread_worker *rcu_exp_par_gp_kworker; void rcu_gp_slow_register(atomic_t *rgssp); void rcu_gp_slow_unregister(atomic_t *rgssp); #endif /* #else #ifdef CONFIG_TINY_RCU */ diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 542cb514567f..2c7e714d34e9 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -4559,6 +4559,33 @@ static int rcu_pm_notify(struct notifier_block *self, return NOTIFY_OK; } +struct kthread_worker *rcu_exp_gp_kworker; +struct kthread_worker *rcu_exp_par_gp_kworker; + +static void __init rcu_start_exp_gp_kworkers(void) +{ + const char *par_gp_kworker_name = "rcu_exp_par_gp_kthread_worker"; + const char *gp_kworker_name = "rcu_exp_gp_kthread_worker"; + struct sched_param param = { .sched_priority = kthread_prio }; + + rcu_exp_gp_kworker = kthread_create_worker(0, gp_kworker_name); + if (IS_ERR_OR_NULL(rcu_exp_gp_kworker)) { + pr_err("Failed to create %s!\n", gp_kworker_name); + return; + } + + rcu_exp_par_gp_kworker = kthread_create_worker(0, par_gp_kworker_name); + if (IS_ERR_OR_NULL(rcu_exp_par_gp_kworker)) { + pr_err("Failed to create %s!\n", par_gp_kworker_name); + kthread_destroy_worker(rcu_exp_gp_kworker); + return; + } + + sched_setscheduler_nocheck(rcu_exp_gp_kworker->task, SCHED_FIFO, ¶m); + sched_setscheduler_nocheck(rcu_exp_par_gp_kworker->task, SCHED_FIFO, + ¶m); +} + /* * Spawn the kthreads that handle RCU's grace periods. */ @@ -4595,6 +4622,10 @@ static int __init rcu_spawn_gp_kthread(void) rcu_spawn_cpu_nocb_kthread(smp_processor_id()); rcu_spawn_one_boost_kthread(rdp->mynode); rcu_spawn_core_kthreads(); + + /* Create kthread worker for expedited GPs */ + rcu_start_exp_gp_kworkers(); + return 0; } early_initcall(rcu_spawn_gp_kthread); @@ -4843,7 +4874,6 @@ static void __init rcu_dump_rcu_node_tree(void) } struct workqueue_struct *rcu_gp_wq; -struct workqueue_struct *rcu_par_gp_wq; static void __init kfree_rcu_batch_init(void) { @@ -4905,11 +4935,14 @@ void __init rcu_init(void) rcu_cpu_starting(cpu); rcutree_online_cpu(cpu); - /* Create workqueue for Tree SRCU and for expedited GPs. */ + /* + * Create workqueue for Tree SRCU. + * + * Expedited GPs use RT kthread_worker. + * See: rcu_start_exp_gp_kworkers() + */ rcu_gp_wq = alloc_workqueue("rcu_gp", WQ_MEM_RECLAIM, 0); WARN_ON(!rcu_gp_wq); - rcu_par_gp_wq = alloc_workqueue("rcu_par_gp", WQ_MEM_RECLAIM, 0); - WARN_ON(!rcu_par_gp_wq); /* Fill in default value for rcutree.qovld boot parameter. */ /* -After- the rcu_node ->lock fields are initialized! */ diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index b8d07bf92d29..b0afcc36a896 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -10,6 +10,7 @@ */ #include +#include #include #include #include @@ -23,7 +24,7 @@ /* Communicate arguments to a workqueue handler. */ struct rcu_exp_work { unsigned long rew_s; - struct work_struct rew_work; + struct kthread_work rew_work; }; /* RCU's kthread states for tracing. */ diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index bd47fce0e08c..cdb299e80cff 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -334,7 +334,7 @@ static bool exp_funnel_lock(unsigned long s) * Select the CPUs within the specified rcu_node that the upcoming * expedited grace period needs to wait for. */ -static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) +static void sync_rcu_exp_select_node_cpus(struct kthread_work *wp) { int cpu; unsigned long flags; @@ -423,7 +423,6 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp) */ static void sync_rcu_exp_select_cpus(void) { - int cpu; struct rcu_node *rnp; trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("reset")); @@ -435,28 +434,27 @@ static void sync_rcu_exp_select_cpus(void) rnp->exp_need_flush = false; if (!READ_ONCE(rnp->expmask)) continue; /* Avoid early boot non-existent wq. */ - if (!READ_ONCE(rcu_par_gp_wq) || + if (!READ_ONCE(rcu_exp_par_gp_kworker) || rcu_scheduler_active != RCU_SCHEDULER_RUNNING || rcu_is_last_leaf_node(rnp)) { - /* No workqueues yet or last leaf, do direct call. */ + /* kthread worker not started yet or last leaf, do direct call. */ sync_rcu_exp_select_node_cpus(&rnp->rew.rew_work); continue; } - INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus); - cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1); - /* If all offline, queue the work on an unbound CPU. */ - if (unlikely(cpu > rnp->grphi - rnp->grplo)) - cpu = WORK_CPU_UNBOUND; - else - cpu += rnp->grplo; - queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work); + kthread_init_work(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus); + /* + * Use rcu_exp_par_gp_kworker, because flushing a work item from + * another work item on the same kthread worker can result in + * deadlock. + */ + kthread_queue_work(rcu_exp_par_gp_kworker, &rnp->rew.rew_work); rnp->exp_need_flush = true; } /* Wait for workqueue jobs (if any) to complete. */ rcu_for_each_leaf_node(rnp) if (rnp->exp_need_flush) - flush_work(&rnp->rew.rew_work); + kthread_flush_work(&rnp->rew.rew_work); } /* @@ -625,7 +623,7 @@ static void rcu_exp_sel_wait_wake(unsigned long s) /* * Work-queue handler to drive an expedited grace period forward. */ -static void wait_rcu_exp_gp(struct work_struct *wp) +static void wait_rcu_exp_gp(struct kthread_work *wp) { struct rcu_exp_work *rewp; @@ -835,22 +833,18 @@ static void __synchronize_rcu_expedited(bool polling) } else { /* Marshall arguments & schedule the expedited grace period. */ rew.rew_s = s; - INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp); - queue_work(rcu_gp_wq, &rew.rew_work); + kthread_init_work(&rew.rew_work, wait_rcu_exp_gp); + kthread_queue_work(rcu_exp_gp_kworker, &rew.rew_work); } /* Wait for expedited grace period to complete. */ rnp = rcu_get_root(); wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3], sync_exp_work_done(s)); - smp_mb(); /* Workqueue actions happen before return. */ + smp_mb(); /* kthread actions happen before return. */ /* Let the next expedited grace period start. */ mutex_unlock(&rcu_state.exp_mutex); - - if (likely(!boottime)) - destroy_work_on_stack(&rew.rew_work); - } /**