Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19;
Date:   Fri, 1 Apr 2022 17:30:21 -0700
From:   "Paul E. McKenney" <paulmck@kernel.org>
To:     Kalesh Singh <kaleshsingh@google.com>
Cc:     surenb@google.com, kernel-team@android.com,
        Tejun Heo <tj@kernel.org>, Tim Murray <timmurray@google.com>,
        Wei Wang <wvw@google.com>, Kyle Lin <kylelin@google.com>,
        Chunwei Lu <chunweilu@google.com>,
        Lulu Wang <luluw@google.com>,
        Frederic Weisbecker <frederic@kernel.org>,
        Neeraj Upadhyay <quic_neeraju@quicinc.com>,
        Josh Triplett <josh@joshtriplett.org>,
        Steven Rostedt <rostedt@goodmis.org>,
        Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
        Lai Jiangshan <jiangshanlai@gmail.com>,
        Joel Fernandes <joel@joelfernandes.org>, rcu@vger.kernel.org,
        linux-kernel@vger.kernel.org
Subject: Re: [PATCH] RCU: Move expedited grace period (GP) work to RT
 kthread_worker
Message-ID: <20220402003021.GI4285@paulmck-ThinkPad-P17-Gen-1>
Reply-To: paulmck@kernel.org
References: <20220401225740.1984689-1-kaleshsingh@google.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20220401225740.1984689-1-kaleshsingh@google.com>
Precedence: bulk

On Fri, Apr 01, 2022 at 03:57:40PM -0700, Kalesh Singh wrote:
> It was noticed that enabling CONFIG_RCU_BOOST did not help RCU
> performance because the workqueues that runs expedited GP work are
> subject to scheduling delays. This patch moves the expedited GP
> work items to RT kthread_worker.
> 
> The results were evaluated on arm64 Android devices (6GB ram) running
> 5.10 kernel, and caturing trace data during critical user journeys.
> 
> The table below compares the time synchronize_rcu_expedited() is blocked:
> 
>  ----------------------------------------------------------------------
> |                   |      Using WQ    | Using kthead_worker |  Diff   |
> -----------------------------------------------------------------------
> | Max duration (ns) |    372766967     |     2329671         | -99.38% |
> -----------------------------------------------------------------------
> | Avg duration (ns) |      2746353.16  |      151242.311     | -94.49% |
>  ----------------------------------------------------------------------

Impressive reductions in latency!!!  151 microsecond average and two
millisecond maximum, not bad at all.  ;-)

We will need number of samples and some sort of information on variation,
for example, a range of maximums and a standard deviation to go with
the average.  These might seem silly given that you have up to two orders
of magnitude effect sizes, but we should still set a good example.

I have queued this on an experimental -rcu branch fastexp.2022.04.01a for
testing purposes, but I cannot send it upstream in this form.  However,
I believe that a few adjustments can make it into something that can be
upstreamed, maybe even as early as the next (v5.19) merge window.

As in I probably will need to revert it after running tests, but let's
see how it goes.

Please see below for questions and comments.

Also, as usual, I could not resist the urge to do a bit of wordsmithing,
most especially inserting commas into the big numbers.  Please see the
end of this email for the variant queued.

Oh, and I do have Uladzislau's CONFIG_RCU_EXP_CPU_STALL_TIMEOUT commit
already queued, which will be helpful in finding overly long RCU read-side
critical sections.

Good stuff!!!

							Thanx, Paul

> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Reported-by: Tim Murray <timmurray@google.com>
> Reported-by: Wei Wang <wvw@google.com>
> Tested-by: Kyle Lin <kylelin@google.com>
> Tested-by: Chunwei Lu <chunweilu@google.com>
> Tested-by: Lulu Wang <luluw@google.com>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> ---
>  kernel/rcu/rcu.h      |  3 ++-
>  kernel/rcu/tree.c     | 41 +++++++++++++++++++++++++++++++++++++----
>  kernel/rcu/tree.h     |  3 ++-
>  kernel/rcu/tree_exp.h | 35 +++++++++++++++--------------------
>  4 files changed, 56 insertions(+), 26 deletions(-)
> 
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index 24b5f2c2de87..13d2b74bf19f 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -534,7 +534,8 @@ int rcu_get_gp_kthreads_prio(void);
>  void rcu_fwd_progress_check(unsigned long j);
>  void rcu_force_quiescent_state(void);
>  extern struct workqueue_struct *rcu_gp_wq;
> -extern struct workqueue_struct *rcu_par_gp_wq;
> +extern struct kthread_worker *rcu_exp_gp_kworker;
> +extern struct kthread_worker *rcu_exp_par_gp_kworker;
>  #endif /* #else #ifdef CONFIG_TINY_RCU */
>  
>  #ifdef CONFIG_RCU_NOCB_CPU
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index a4b8189455d5..bd5e672ffa5a 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -4471,6 +4471,33 @@ static int rcu_pm_notify(struct notifier_block *self,
>  	return NOTIFY_OK;
>  }
>  
> +struct kthread_worker *rcu_exp_gp_kworker;
> +struct kthread_worker *rcu_exp_par_gp_kworker;
> +
> +static void __init rcu_start_exp_gp_kworkers(void)
> +{
> +	const char *par_gp_kworker_name = "rcu_exp_par_gp_kthread_worker";
> +	const char *gp_kworker_name = "rcu_exp_gp_kthread_worker";
> +	struct sched_param param = { .sched_priority = kthread_prio };
> +
> +	rcu_exp_gp_kworker = kthread_create_worker(0, gp_kworker_name);
> +	if (IS_ERR_OR_NULL(rcu_exp_gp_kworker)) {
> +		pr_err("Failed to create %s!\n", gp_kworker_name);
> +		return;
> +	}
> +
> +	rcu_exp_par_gp_kworker = kthread_create_worker(0, par_gp_kworker_name);
> +	if (IS_ERR_OR_NULL(rcu_exp_par_gp_kworker)) {
> +		pr_err("Failed to create %s!\n", par_gp_kworker_name);
> +		kthread_destroy_worker(rcu_exp_gp_kworker);
> +		return;
> +	}
> +
> +	sched_setscheduler_nocheck(rcu_exp_gp_kworker->task, SCHED_FIFO, &param);
> +	sched_setscheduler_nocheck(rcu_exp_par_gp_kworker->task, SCHED_FIFO,
> +				   &param);

What happens when kthread_prio is zero, as it is by default on systems
built with CONFIG_RCU_BOOST=n?

My guess is that we need to use normal workqueues when CONFIG_RCU_BOOST=n.

The real-time guys usually completely disable expedited grace periods
except during boot, so they might also want normal workqueues as well.
They build with CONFIG_PREEMPT_RT=y.

> +}
> +
>  /*
>   * Spawn the kthreads that handle RCU's grace periods.
>   */
> @@ -4500,6 +4527,10 @@ static int __init rcu_spawn_gp_kthread(void)
>  	rcu_spawn_nocb_kthreads();
>  	rcu_spawn_boost_kthreads();
>  	rcu_spawn_core_kthreads();
> +
> +	/* Create kthread worker for expedited GPs */
> +	rcu_start_exp_gp_kworkers();
> +
>  	return 0;
>  }
>  early_initcall(rcu_spawn_gp_kthread);
> @@ -4745,7 +4776,6 @@ static void __init rcu_dump_rcu_node_tree(void)
>  }
>  
>  struct workqueue_struct *rcu_gp_wq;
> -struct workqueue_struct *rcu_par_gp_wq;
>  
>  static void __init kfree_rcu_batch_init(void)
>  {
> @@ -4808,11 +4838,14 @@ void __init rcu_init(void)
>  		rcutree_online_cpu(cpu);
>  	}
>  
> -	/* Create workqueue for Tree SRCU and for expedited GPs. */
> +	/*
> +	 * Create workqueue for Tree SRCU.
> +	 *
> +	 * Expedited GPs use RT kthread_worker.
> +	 * See: rcu_start_exp_gp_kworkers()
> +	 */
>  	rcu_gp_wq = alloc_workqueue("rcu_gp", WQ_MEM_RECLAIM, 0);
>  	WARN_ON(!rcu_gp_wq);
> -	rcu_par_gp_wq = alloc_workqueue("rcu_par_gp", WQ_MEM_RECLAIM, 0);
> -	WARN_ON(!rcu_par_gp_wq);
>  
>  	/* Fill in default value for rcutree.qovld boot parameter. */
>  	/* -After- the rcu_node ->lock fields are initialized! */
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 926673ebe355..0193d67a706a 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -10,6 +10,7 @@
>   */
>  
>  #include <linux/cache.h>
> +#include <linux/kthread.h>
>  #include <linux/spinlock.h>
>  #include <linux/rtmutex.h>
>  #include <linux/threads.h>
> @@ -23,7 +24,7 @@
>  /* Communicate arguments to a workqueue handler. */
>  struct rcu_exp_work {
>  	unsigned long rew_s;
> -	struct work_struct rew_work;
> +	struct kthread_work rew_work;
>  };
>  
>  /* RCU's kthread states for tracing. */
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 60197ea24ceb..f5f3722c0a74 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -334,7 +334,7 @@ static bool exp_funnel_lock(unsigned long s)
>   * Select the CPUs within the specified rcu_node that the upcoming
>   * expedited grace period needs to wait for.
>   */
> -static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> +static void sync_rcu_exp_select_node_cpus(struct kthread_work *wp)
>  {
>  	int cpu;
>  	unsigned long flags;
> @@ -423,7 +423,6 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
>   */
>  static void sync_rcu_exp_select_cpus(void)
>  {
> -	int cpu;
>  	struct rcu_node *rnp;
>  
>  	trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("reset"));
> @@ -435,28 +434,27 @@ static void sync_rcu_exp_select_cpus(void)
>  		rnp->exp_need_flush = false;
>  		if (!READ_ONCE(rnp->expmask))
>  			continue; /* Avoid early boot non-existent wq. */
> -		if (!READ_ONCE(rcu_par_gp_wq) ||
> +		if (!READ_ONCE(rcu_exp_par_gp_kworker) ||
>  		    rcu_scheduler_active != RCU_SCHEDULER_RUNNING ||
>  		    rcu_is_last_leaf_node(rnp)) {
> -			/* No workqueues yet or last leaf, do direct call. */
> +			/* kthread worker not started yet or last leaf, do direct call. */
>  			sync_rcu_exp_select_node_cpus(&rnp->rew.rew_work);
>  			continue;
>  		}
> -		INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> -		cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1);
> -		/* If all offline, queue the work on an unbound CPU. */
> -		if (unlikely(cpu > rnp->grphi - rnp->grplo))
> -			cpu = WORK_CPU_UNBOUND;
> -		else
> -			cpu += rnp->grplo;
> -		queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
> +		kthread_init_work(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> +		/*
> +		 * Use rcu_exp_par_gp_kworker, because flushing a work item from
> +		 * another work item on the same kthread worker can result in
> +		 * deadlock.
> +		 */
> +		kthread_queue_work(rcu_exp_par_gp_kworker, &rnp->rew.rew_work);

It looks like kthread_queue_work() uses a single lock per worker.  This is
likely to be problematic on large systems having multiple leaf rcu_node
structures, both due to lock contention and due to reduced concurrency.
For a point of reference, I have gotten RCU bug reports for issues on
systems with 256 leaf rcu_node structures.

One approach would be to create a new Kconfig option to select the
kthread_queue implementation on small systems.  This Kconfig option
could choose its default based on system size, perhaps with 32 or
more CPUs continuing to use workqueues.  Or it could choose its
default based on CONFIG_ANDROID.

Another approach would be to have a separate kthread_worker for each
leaf rcu_node structure.

The first approach would be safer in that it would avoid changing
behavior for systems not needing ultrafast expedited grace-period
latencies.  It might also make more sense for systems built with
CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y.

I don't see a reason to make this switchable at boot time, but others
might be able to provide good reasons.

Either way, good encapsulation will of course be necessary.

>  		rnp->exp_need_flush = true;
>  	}
>  
>  	/* Wait for workqueue jobs (if any) to complete. */
>  	rcu_for_each_leaf_node(rnp)
>  		if (rnp->exp_need_flush)
> -			flush_work(&rnp->rew.rew_work);
> +			kthread_flush_work(&rnp->rew.rew_work);
>  }
>  
>  /*
> @@ -625,7 +623,7 @@ static void rcu_exp_sel_wait_wake(unsigned long s)
>  /*
>   * Work-queue handler to drive an expedited grace period forward.
>   */
> -static void wait_rcu_exp_gp(struct work_struct *wp)
> +static void wait_rcu_exp_gp(struct kthread_work *wp)
>  {
>  	struct rcu_exp_work *rewp;
>  
> @@ -848,20 +846,17 @@ void synchronize_rcu_expedited(void)
>  	} else {
>  		/* Marshall arguments & schedule the expedited grace period. */
>  		rew.rew_s = s;
> -		INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
> -		queue_work(rcu_gp_wq, &rew.rew_work);
> +		kthread_init_work(&rew.rew_work, wait_rcu_exp_gp);
> +		kthread_queue_work(rcu_exp_gp_kworker, &rew.rew_work);
>  	}
>  
>  	/* Wait for expedited grace period to complete. */
>  	rnp = rcu_get_root();
>  	wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
>  		   sync_exp_work_done(s));
> -	smp_mb(); /* Workqueue actions happen before return. */
> +	smp_mb(); /* kthread actions happen before return. */
>  
>  	/* Let the next expedited grace period start. */
>  	mutex_unlock(&rcu_state.exp_mutex);
> -
> -	if (likely(!boottime))
> -		destroy_work_on_stack(&rew.rew_work);
>  }
>  EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
> 
> base-commit: 7a3ecddc571cc3294e5d6bb5948ff2b0cfa12735
> -- 
> 2.35.1.1094.g7c7d902a7c-goog

------------------------------------------------------------------------

commit d9f3e7d671416fdf5b61f094765754269b652db0
Author: Kalesh Singh <kaleshsingh@google.com>
Date:   Fri Apr 1 15:57:40 2022 -0700

    EXP rcu: Move expedited grace period (GP) work to RT kthread_worker
    
    Enabling CONFIG_RCU_BOOST did not reduce RCU expedited grace-period
    latency because its workqueues run at SCHED_OTHER, and thus can be
    delayed by normal processes.  This commit avoids these delays by moving
    the expedited GP work items to a real-time-priority kthread_worker.
    
    The results were evaluated on arm64 Android devices (6GB ram) running
    5.10 kernel, and capturing trace data in critical user-level code.
    
    The table below shows the resulting order-of-magnitude improvements
    in synchronize_rcu_expedited() latency:
    
     ---------------------------------------------------------------
    |                   |   workqueues  |  kthead_worker  |    Diff |
    ----------------------------------------------------------------
    | Max duration (ns) |  372,766,967  |      2,329,671  | -99.38% |
    ----------------------------------------------------------------
    | Avg duration (ns) |    2,746,353  |        151,242  | -94.49% |
     ---------------------------------------------------------------
    
    Cc: "Paul E. McKenney" <paulmck@kernel.org>
    Cc: Tejun Heo <tj@kernel.org>
    Reported-by: Tim Murray <timmurray@google.com>
    Reported-by: Wei Wang <wvw@google.com>
    Tested-by: Kyle Lin <kylelin@google.com>
    Tested-by: Chunwei Lu <chunweilu@google.com>
    Tested-by: Lulu Wang <luluw@google.com>
    Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 23c877fa220e..a7f6c6665b01 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -566,6 +566,8 @@ void rcu_fwd_progress_check(unsigned long j);
 void rcu_force_quiescent_state(void);
 extern struct workqueue_struct *rcu_gp_wq;
 extern struct workqueue_struct *rcu_par_gp_wq;
+extern struct kthread_worker *rcu_exp_gp_kworker;
+extern struct kthread_worker *rcu_exp_par_gp_kworker;
 void rcu_gp_slow_register(atomic_t *rgssp);
 void rcu_gp_slow_unregister(atomic_t *rgssp);
 #endif /* #else #ifdef CONFIG_TINY_RCU */
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 542cb514567f..2c7e714d34e9 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4559,6 +4559,33 @@ static int rcu_pm_notify(struct notifier_block *self,
 	return NOTIFY_OK;
 }
 
+struct kthread_worker *rcu_exp_gp_kworker;
+struct kthread_worker *rcu_exp_par_gp_kworker;
+
+static void __init rcu_start_exp_gp_kworkers(void)
+{
+	const char *par_gp_kworker_name = "rcu_exp_par_gp_kthread_worker";
+	const char *gp_kworker_name = "rcu_exp_gp_kthread_worker";
+	struct sched_param param = { .sched_priority = kthread_prio };
+
+	rcu_exp_gp_kworker = kthread_create_worker(0, gp_kworker_name);
+	if (IS_ERR_OR_NULL(rcu_exp_gp_kworker)) {
+		pr_err("Failed to create %s!\n", gp_kworker_name);
+		return;
+	}
+
+	rcu_exp_par_gp_kworker = kthread_create_worker(0, par_gp_kworker_name);
+	if (IS_ERR_OR_NULL(rcu_exp_par_gp_kworker)) {
+		pr_err("Failed to create %s!\n", par_gp_kworker_name);
+		kthread_destroy_worker(rcu_exp_gp_kworker);
+		return;
+	}
+
+	sched_setscheduler_nocheck(rcu_exp_gp_kworker->task, SCHED_FIFO, &param);
+	sched_setscheduler_nocheck(rcu_exp_par_gp_kworker->task, SCHED_FIFO,
+				   &param);
+}
+
 /*
  * Spawn the kthreads that handle RCU's grace periods.
  */
@@ -4595,6 +4622,10 @@ static int __init rcu_spawn_gp_kthread(void)
 	rcu_spawn_cpu_nocb_kthread(smp_processor_id());
 	rcu_spawn_one_boost_kthread(rdp->mynode);
 	rcu_spawn_core_kthreads();
+
+	/* Create kthread worker for expedited GPs */
+	rcu_start_exp_gp_kworkers();
+
 	return 0;
 }
 early_initcall(rcu_spawn_gp_kthread);
@@ -4843,7 +4874,6 @@ static void __init rcu_dump_rcu_node_tree(void)
 }
 
 struct workqueue_struct *rcu_gp_wq;
-struct workqueue_struct *rcu_par_gp_wq;
 
 static void __init kfree_rcu_batch_init(void)
 {
@@ -4905,11 +4935,14 @@ void __init rcu_init(void)
 	rcu_cpu_starting(cpu);
 	rcutree_online_cpu(cpu);
 
-	/* Create workqueue for Tree SRCU and for expedited GPs. */
+	/*
+	 * Create workqueue for Tree SRCU.
+	 *
+	 * Expedited GPs use RT kthread_worker.
+	 * See: rcu_start_exp_gp_kworkers()
+	 */
 	rcu_gp_wq = alloc_workqueue("rcu_gp", WQ_MEM_RECLAIM, 0);
 	WARN_ON(!rcu_gp_wq);
-	rcu_par_gp_wq = alloc_workqueue("rcu_par_gp", WQ_MEM_RECLAIM, 0);
-	WARN_ON(!rcu_par_gp_wq);
 
 	/* Fill in default value for rcutree.qovld boot parameter. */
 	/* -After- the rcu_node ->lock fields are initialized! */
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index b8d07bf92d29..b0afcc36a896 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -10,6 +10,7 @@
  */
 
 #include <linux/cache.h>
+#include <linux/kthread.h>
 #include <linux/spinlock.h>
 #include <linux/rtmutex.h>
 #include <linux/threads.h>
@@ -23,7 +24,7 @@
 /* Communicate arguments to a workqueue handler. */
 struct rcu_exp_work {
 	unsigned long rew_s;
-	struct work_struct rew_work;
+	struct kthread_work rew_work;
 };
 
 /* RCU's kthread states for tracing. */
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index bd47fce0e08c..cdb299e80cff 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -334,7 +334,7 @@ static bool exp_funnel_lock(unsigned long s)
  * Select the CPUs within the specified rcu_node that the upcoming
  * expedited grace period needs to wait for.
  */
-static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
+static void sync_rcu_exp_select_node_cpus(struct kthread_work *wp)
 {
 	int cpu;
 	unsigned long flags;
@@ -423,7 +423,6 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
  */
 static void sync_rcu_exp_select_cpus(void)
 {
-	int cpu;
 	struct rcu_node *rnp;
 
 	trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("reset"));
@@ -435,28 +434,27 @@ static void sync_rcu_exp_select_cpus(void)
 		rnp->exp_need_flush = false;
 		if (!READ_ONCE(rnp->expmask))
 			continue; /* Avoid early boot non-existent wq. */
-		if (!READ_ONCE(rcu_par_gp_wq) ||
+		if (!READ_ONCE(rcu_exp_par_gp_kworker) ||
 		    rcu_scheduler_active != RCU_SCHEDULER_RUNNING ||
 		    rcu_is_last_leaf_node(rnp)) {
-			/* No workqueues yet or last leaf, do direct call. */
+			/* kthread worker not started yet or last leaf, do direct call. */
 			sync_rcu_exp_select_node_cpus(&rnp->rew.rew_work);
 			continue;
 		}
-		INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
-		cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1);
-		/* If all offline, queue the work on an unbound CPU. */
-		if (unlikely(cpu > rnp->grphi - rnp->grplo))
-			cpu = WORK_CPU_UNBOUND;
-		else
-			cpu += rnp->grplo;
-		queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
+		kthread_init_work(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
+		/*
+		 * Use rcu_exp_par_gp_kworker, because flushing a work item from
+		 * another work item on the same kthread worker can result in
+		 * deadlock.
+		 */
+		kthread_queue_work(rcu_exp_par_gp_kworker, &rnp->rew.rew_work);
 		rnp->exp_need_flush = true;
 	}
 
 	/* Wait for workqueue jobs (if any) to complete. */
 	rcu_for_each_leaf_node(rnp)
 		if (rnp->exp_need_flush)
-			flush_work(&rnp->rew.rew_work);
+			kthread_flush_work(&rnp->rew.rew_work);
 }
 
 /*
@@ -625,7 +623,7 @@ static void rcu_exp_sel_wait_wake(unsigned long s)
 /*
  * Work-queue handler to drive an expedited grace period forward.
  */
-static void wait_rcu_exp_gp(struct work_struct *wp)
+static void wait_rcu_exp_gp(struct kthread_work *wp)
 {
 	struct rcu_exp_work *rewp;
 
@@ -835,22 +833,18 @@ static void __synchronize_rcu_expedited(bool polling)
 	} else {
 		/* Marshall arguments & schedule the expedited grace period. */
 		rew.rew_s = s;
-		INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
-		queue_work(rcu_gp_wq, &rew.rew_work);
+		kthread_init_work(&rew.rew_work, wait_rcu_exp_gp);
+		kthread_queue_work(rcu_exp_gp_kworker, &rew.rew_work);
 	}
 
 	/* Wait for expedited grace period to complete. */
 	rnp = rcu_get_root();
 	wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
 		   sync_exp_work_done(s));
-	smp_mb(); /* Workqueue actions happen before return. */
+	smp_mb(); /* kthread actions happen before return. */
 
 	/* Let the next expedited grace period start. */
 	mutex_unlock(&rcu_state.exp_mutex);
-
-	if (likely(!boottime))
-		destroy_work_on_stack(&rew.rew_work);
-
 }
 
 /**