Subject: [PATCH] rcu: Use cpus_read_lock() while looking at cpu_online_mask

It was possible that sync_rcu_exp_select_cpus() enqueued something on
CPU0 while CPU0 was offline. Such a work item wouldn't be processed
until CPU0 gets back online. This problem was addressed in commit
fcc6354365015 ("rcu: Make expedited GPs handle CPU 0 being offline"). I
don't think the issue fully addressed.

Assume grplo = 0 and grphi = 7 and sync_rcu_exp_select_cpus() is invoked
on CPU1. The preempt_disable() section on CPU1 won't ensure that CPU0
remains online between looking at cpu_online_mask and invoking
queue_work_on() on CPU1.

Use cpus_read_lock() to ensure that `cpu' is not going down between
looking at cpu_online_mask at invoking queue_work_on() and waiting for
its completion. It is added around the loop + flush_work() which is
similar to work_on_cpu_safe() (and we can have multiple jobs running on
NUMA systems).

Fixes: fcc6354365015 ("rcu: Make expedited GPs handle CPU 0 being
offline")
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
kernel/rcu/tree_exp.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 01b6ddeb4f050..a104cf91e6b90 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -479,6 +479,7 @@ static void sync_rcu_exp_select_cpus(struct rcu_state *rsp,
sync_exp_reset_tree(rsp);
trace_rcu_exp_grace_period(rsp->name, rcu_exp_gp_seq_endval(rsp), TPS("select"));

+ cpus_read_lock();
/* Schedule work for each leaf rcu_node structure. */
rcu_for_each_leaf_node(rsp, rnp) {
rnp->exp_need_flush = false;
@@ -493,13 +494,11 @@ static void sync_rcu_exp_select_cpus(struct rcu_state *rsp,
continue;
}
INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
- preempt_disable();
cpu = cpumask_next(rnp->grplo - 1, cpu_online_mask);
/* If all offline, queue the work on an unbound CPU. */
if (unlikely(cpu > rnp->grphi))
cpu = WORK_CPU_UNBOUND;
queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
- preempt_enable();
rnp->exp_need_flush = true;
}

@@ -507,6 +506,7 @@ static void sync_rcu_exp_select_cpus(struct rcu_state *rsp,
rcu_for_each_leaf_node(rsp, rnp)
if (rnp->exp_need_flush)
flush_work(&rnp->rew.rew_work);
+ cpus_read_unlock();
}

static void synchronize_sched_expedited_wait(struct rcu_state *rsp)
--
2.19.0.rc2



2018-09-11 16:06:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] rcu: Use cpus_read_lock() while looking at cpu_online_mask

On Mon, Sep 10, 2018 at 03:56:16PM +0200, Sebastian Andrzej Siewior wrote:
> It was possible that sync_rcu_exp_select_cpus() enqueued something on
> CPU0 while CPU0 was offline. Such a work item wouldn't be processed
> until CPU0 gets back online. This problem was addressed in commit
> fcc6354365015 ("rcu: Make expedited GPs handle CPU 0 being offline"). I
> don't think the issue fully addressed.
>
> Assume grplo = 0 and grphi = 7 and sync_rcu_exp_select_cpus() is invoked
> on CPU1. The preempt_disable() section on CPU1 won't ensure that CPU0
> remains online between looking at cpu_online_mask and invoking
> queue_work_on() on CPU1.
>
> Use cpus_read_lock() to ensure that `cpu' is not going down between
> looking at cpu_online_mask at invoking queue_work_on() and waiting for
> its completion. It is added around the loop + flush_work() which is
> similar to work_on_cpu_safe() (and we can have multiple jobs running on
> NUMA systems).

Is this experimental or theoretical? If theoretical, the counter-theory
is that the stop-machine processing prevents any of the cpu_online_mask
bits from changing, though, yes, we would like to get rid of the
stop-machine processing. So either way, yes, the current state could
use some improvement.

But one problem with the patch below is that sync_rcu_exp_select_cpus()
can be called while the cpu_hotplug_lock is write-held. Or is that
somehow OK these days? Assuming not, how about the (untested) patch
below?

Thanx, Paul

> Fixes: fcc6354365015 ("rcu: Make expedited GPs handle CPU 0 being
> offline")
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> ---
> kernel/rcu/tree_exp.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 01b6ddeb4f050..a104cf91e6b90 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -479,6 +479,7 @@ static void sync_rcu_exp_select_cpus(struct rcu_state *rsp,
> sync_exp_reset_tree(rsp);
> trace_rcu_exp_grace_period(rsp->name, rcu_exp_gp_seq_endval(rsp), TPS("select"));
>
> + cpus_read_lock();
> /* Schedule work for each leaf rcu_node structure. */
> rcu_for_each_leaf_node(rsp, rnp) {
> rnp->exp_need_flush = false;
> @@ -493,13 +494,11 @@ static void sync_rcu_exp_select_cpus(struct rcu_state *rsp,
> continue;
> }
> INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> - preempt_disable();
> cpu = cpumask_next(rnp->grplo - 1, cpu_online_mask);
> /* If all offline, queue the work on an unbound CPU. */
> if (unlikely(cpu > rnp->grphi))
> cpu = WORK_CPU_UNBOUND;
> queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
> - preempt_enable();
> rnp->exp_need_flush = true;
> }
>
> @@ -507,6 +506,7 @@ static void sync_rcu_exp_select_cpus(struct rcu_state *rsp,
> rcu_for_each_leaf_node(rsp, rnp)
> if (rnp->exp_need_flush)
> flush_work(&rnp->rew.rew_work);
> + cpus_read_unlock();
> }
>
> static void synchronize_sched_expedited_wait(struct rcu_state *rsp)

commit 5214cbbfe6a5d6b92c76c4e411a049fe57245d4a
Author: Paul E. McKenney <[email protected]>
Date: Tue Sep 11 08:57:48 2018 -0700

rcu: Stop expedited grace periods from relying on stop-machine

The CPU-selection code in sync_rcu_exp_select_cpus() disables preemption
to prevent the cpu_online_mask from changing. However, this relies on
the stop-machine mechanism in the CPU-hotplug offline code, which is not
desirable (it would be good to someday remove the stop-machine mechanism).

This commit therefore instead uses the relevant leaf rcu_node structure's
->ffmask, which has a bit set for all CPUs that are fully functional.
A given CPU's bit is cleared very early during offline processing by
rcutree_offline_cpu() and set very late during online processsing by
rcutree_online_cpu(). Therefore, if a CPU's bit is set in this mask, and
preemption is disabled, we have to be before the synchronize_sched() in
the CPU-hotplug offline code, which means that the CPU is guaranteed to be
workqueue-ready throughout the duration of the enclosing preempt_disable()
region of code.

This also has the side-effect of using WORK_CPU_UNBOUND if all the CPUs for
this leaf rcu_node structure are offline, which is an acceptable difference
in behavior.

Reported-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 8d18c1014e2b..e669ccf3751b 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -450,10 +450,12 @@ static void sync_rcu_exp_select_cpus(smp_call_func_t func)
}
INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
preempt_disable();
- cpu = cpumask_next(rnp->grplo - 1, cpu_online_mask);
+ cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1);
/* If all offline, queue the work on an unbound CPU. */
- if (unlikely(cpu > rnp->grphi))
+ if (unlikely(cpu > rnp->grphi - rnp->grplo))
cpu = WORK_CPU_UNBOUND;
+ else
+ cpu += rnp->grplo;
queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
preempt_enable();
rnp->exp_need_flush = true;


Subject: Re: [PATCH] rcu: Use cpus_read_lock() while looking at cpu_online_mask

On 2018-09-11 09:05:32 [-0700], Paul E. McKenney wrote:
> On Mon, Sep 10, 2018 at 03:56:16PM +0200, Sebastian Andrzej Siewior wrote:
> > It was possible that sync_rcu_exp_select_cpus() enqueued something on
> > CPU0 while CPU0 was offline. Such a work item wouldn't be processed
> > until CPU0 gets back online. This problem was addressed in commit
> > fcc6354365015 ("rcu: Make expedited GPs handle CPU 0 being offline"). I
> > don't think the issue fully addressed.
> >
> > Assume grplo = 0 and grphi = 7 and sync_rcu_exp_select_cpus() is invoked
> > on CPU1. The preempt_disable() section on CPU1 won't ensure that CPU0
> > remains online between looking at cpu_online_mask and invoking
> > queue_work_on() on CPU1.
> >
> > Use cpus_read_lock() to ensure that `cpu' is not going down between
> > looking at cpu_online_mask at invoking queue_work_on() and waiting for
> > its completion. It is added around the loop + flush_work() which is
> > similar to work_on_cpu_safe() (and we can have multiple jobs running on
> > NUMA systems).
>
> Is this experimental or theoretical?

theoretical. I saw that hunk on RT and I can't have queue_work() within
a preempt_disable() section here.

> If theoretical, the counter-theory
> is that the stop-machine processing prevents any of the cpu_online_mask
> bits from changing, though, yes, we would like to get rid of the
> stop-machine processing. So either way, yes, the current state could
> use some improvement.
>
> But one problem with the patch below is that sync_rcu_exp_select_cpus()
> can be called while the cpu_hotplug_lock is write-held. Or is that
> somehow OK these days?

depends. Is it okay to wait until the write-lock is dropped? If it is,
then it is okay. If not…

> Assuming not, how about the (untested) patch
> below?

Doesn't work for me because it is still within the preempt-disable
section :/.
Would it work to use WORK_CPU_UNBOUND? As far as I understand it, the
CPU number does not matter, you just want to spread it across multiple
CPUs in the NUMA case.

> Thanx, Paul
>
> commit 5214cbbfe6a5d6b92c76c4e411a049fe57245d4a
> Author: Paul E. McKenney <[email protected]>
> Date: Tue Sep 11 08:57:48 2018 -0700
>
> rcu: Stop expedited grace periods from relying on stop-machine
>
> The CPU-selection code in sync_rcu_exp_select_cpus() disables preemption
> to prevent the cpu_online_mask from changing. However, this relies on
> the stop-machine mechanism in the CPU-hotplug offline code, which is not
> desirable (it would be good to someday remove the stop-machine mechanism).

not that I tested it, but I still don't understand how a
preempt_disable() section on CPU1 can ensure that CPU3 won't go down. Is
there some code that invokes stop_cpus() for each CPU or so?

Sebastian

2018-09-11 17:03:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] rcu: Use cpus_read_lock() while looking at cpu_online_mask

On Tue, Sep 11, 2018 at 06:21:42PM +0200, Sebastian Andrzej Siewior wrote:
> On 2018-09-11 09:05:32 [-0700], Paul E. McKenney wrote:
> > On Mon, Sep 10, 2018 at 03:56:16PM +0200, Sebastian Andrzej Siewior wrote:
> > > It was possible that sync_rcu_exp_select_cpus() enqueued something on
> > > CPU0 while CPU0 was offline. Such a work item wouldn't be processed
> > > until CPU0 gets back online. This problem was addressed in commit
> > > fcc6354365015 ("rcu: Make expedited GPs handle CPU 0 being offline"). I
> > > don't think the issue fully addressed.
> > >
> > > Assume grplo = 0 and grphi = 7 and sync_rcu_exp_select_cpus() is invoked
> > > on CPU1. The preempt_disable() section on CPU1 won't ensure that CPU0
> > > remains online between looking at cpu_online_mask and invoking
> > > queue_work_on() on CPU1.
> > >
> > > Use cpus_read_lock() to ensure that `cpu' is not going down between
> > > looking at cpu_online_mask at invoking queue_work_on() and waiting for
> > > its completion. It is added around the loop + flush_work() which is
> > > similar to work_on_cpu_safe() (and we can have multiple jobs running on
> > > NUMA systems).
> >
> > Is this experimental or theoretical?
>
> theoretical. I saw that hunk on RT and I can't have queue_work() within
> a preempt_disable() section here.

OK, I feel better now. ;-)

> > If theoretical, the counter-theory
> > is that the stop-machine processing prevents any of the cpu_online_mask
> > bits from changing, though, yes, we would like to get rid of the
> > stop-machine processing. So either way, yes, the current state could
> > use some improvement.
> >
> > But one problem with the patch below is that sync_rcu_exp_select_cpus()
> > can be called while the cpu_hotplug_lock is write-held. Or is that
> > somehow OK these days?
>
> depends. Is it okay to wait until the write-lock is dropped? If it is,
> then it is okay. If not…

The problem is that people wait for grace periods within CPU hotplug
notifiers, so the lock won't be dropped until long after that notifier
returns.

> > Assuming not, how about the (untested) patch
> > below?
>
> Doesn't work for me because it is still within the preempt-disable
> section :/.
> Would it work to use WORK_CPU_UNBOUND? As far as I understand it, the
> CPU number does not matter, you just want to spread it across multiple
> CPUs in the NUMA case.

Locality is a good thing, but yes, something like this?

if (!IS_ENABLED(CONFIG_PREEMPT_RT) && /* or whatever it is called */
unlikely(cpu > rnp->grphi - rnp->grplo))

Another approach that might be better longer term would be to have a
workqueue interface that treats the specified CPU as a suggestion,
and silently switches to WORK_CPU_UNBOUND if there is any problem
whatsoever with the specified CPU. Tejun, Lai, thoughts?

> > commit 5214cbbfe6a5d6b92c76c4e411a049fe57245d4a
> > Author: Paul E. McKenney <[email protected]>
> > Date: Tue Sep 11 08:57:48 2018 -0700
> >
> > rcu: Stop expedited grace periods from relying on stop-machine
> >
> > The CPU-selection code in sync_rcu_exp_select_cpus() disables preemption
> > to prevent the cpu_online_mask from changing. However, this relies on
> > the stop-machine mechanism in the CPU-hotplug offline code, which is not
> > desirable (it would be good to someday remove the stop-machine mechanism).
>
> not that I tested it, but I still don't understand how a
> preempt_disable() section on CPU1 can ensure that CPU3 won't go down. Is
> there some code that invokes stop_cpus() for each CPU or so?

Yes, there is a stop_machine_cpuslocked() in takedown_cpu().

There is also a synchronize_rcu_mult(call_rcu, call_rcu_sched) in
sched_cpu_deactivate().

The old code relies on the stop_machine_cpuslocked() and my proposed
patch relies on the synchronize_rcu_mult().

Thanx, Paul


2018-09-19 20:55:50

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] rcu: Use cpus_read_lock() while looking at cpu_online_mask

Hello,

On Tue, Sep 11, 2018 at 10:02:22AM -0700, Paul E. McKenney wrote:
> > Doesn't work for me because it is still within the preempt-disable
> > section :/.
> > Would it work to use WORK_CPU_UNBOUND? As far as I understand it, the
> > CPU number does not matter, you just want to spread it across multiple
> > CPUs in the NUMA case.
>
> Locality is a good thing, but yes, something like this?
>
> if (!IS_ENABLED(CONFIG_PREEMPT_RT) && /* or whatever it is called */
> unlikely(cpu > rnp->grphi - rnp->grplo))
>
> Another approach that might be better longer term would be to have a
> workqueue interface that treats the specified CPU as a suggestion,
> and silently switches to WORK_CPU_UNBOUND if there is any problem
> whatsoever with the specified CPU. Tejun, Lai, thoughts?

Unbound workqueue is NUMA-affine by default, so using it by default
might not harm anything. Also, per-cpu work items get unbound from
the cpu if the cpu goes down while the work item is running or queued,
so it might just work already.

Thanks.

--
tejun

2018-09-19 22:12:39

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] rcu: Use cpus_read_lock() while looking at cpu_online_mask

On Wed, Sep 19, 2018 at 01:55:21PM -0700, Tejun Heo wrote:
> Hello,
>
> On Tue, Sep 11, 2018 at 10:02:22AM -0700, Paul E. McKenney wrote:
> > > Doesn't work for me because it is still within the preempt-disable
> > > section :/.
> > > Would it work to use WORK_CPU_UNBOUND? As far as I understand it, the
> > > CPU number does not matter, you just want to spread it across multiple
> > > CPUs in the NUMA case.
> >
> > Locality is a good thing, but yes, something like this?
> >
> > if (!IS_ENABLED(CONFIG_PREEMPT_RT) && /* or whatever it is called */
> > unlikely(cpu > rnp->grphi - rnp->grplo))
> >
> > Another approach that might be better longer term would be to have a
> > workqueue interface that treats the specified CPU as a suggestion,
> > and silently switches to WORK_CPU_UNBOUND if there is any problem
> > whatsoever with the specified CPU. Tejun, Lai, thoughts?
>
> Unbound workqueue is NUMA-affine by default, so using it by default
> might not harm anything.

OK, so the above workaround would function correctly on -rt, thank you!

Sebastian, is there a counterpart to CONFIG_PREEMPT_RT already in
mainline? If so, I would be happy to make mainline safe for -rt.

> Also, per-cpu work items get unbound from
> the cpu if the cpu goes down while the work item is running or queued,
> so it might just work already.

There are race conditions where the work item is queued at an inopportune
time during the offline process, resulting in a splat, hence the need
for a check with preemption disabled in order to synchronize with the
synchronize_sched() in the offline process.

Thanx, Paul


Subject: Re: [PATCH] rcu: Use cpus_read_lock() while looking at cpu_online_mask

On 2018-09-19 15:11:40 [-0700], Paul E. McKenney wrote:
> On Wed, Sep 19, 2018 at 01:55:21PM -0700, Tejun Heo wrote:
> > Unbound workqueue is NUMA-affine by default, so using it by default
> > might not harm anything.
>
> OK, so the above workaround would function correctly on -rt, thank you!
>
> Sebastian, is there a counterpart to CONFIG_PREEMPT_RT already in
> mainline? If so, I would be happy to make mainline safe for -rt.

Now that I stumbled upon it again, I noticed that I never replied here.
Sorry for that.

Let me summarize:
sync_rcu_exp_select_cpus() used
queue_work_on(rnp->grplo, rcu_par_gp_wq, &rnp->rew.rew_work);

which was changed in commit fcc6354365015 ("rcu: Make expedited GPs
handle CPU 0 being offline"). The commit claims that this is needed in
case CPU0 is offline so it tries to find another CPU starting with the
possible offline CPU. It might cross to another NUMA node but that is
not really a problem, it just tries to remain on the "local NUMA node".

After that commit, the code invokes queue_work_on() within a
preempt_disable() section because it can't use cpus_read_lock() to
ensure that the CPU won't go offline in the middle of testing (and
preempt_disable() does the trick).
For RT reasons I would like to get rid of queue_work_on() within the
preempt_disable() section.
Tejun said that enqueueing an item on an unbound workqueue is NUMA
affine.

I figured out that enqueueing an item on an offline CPU is not a problem
and it will pop up on a "random" CPU which means it will be carried out
asap and will not wait until the CPU gets back online. Therefore I don't
understand the commit fcc6354365015.

May I suggest the following change? It will enqueue the work item on
the first CPU on the NUMA node and the "unbound" part of the work queue
ensures that a CPU of that NUMA node will perform the work.
This is mostly a revert of fcc6354365015 except that the workqueue
gained the WQ_UNBOUND flag.

------------------>8----

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0b760c1369f76..94d6c50c4e796 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4162,7 +4162,7 @@ void __init rcu_init(void)
/* Create workqueue for expedited GPs and for Tree SRCU. */
rcu_gp_wq = alloc_workqueue("rcu_gp", WQ_MEM_RECLAIM, 0);
WARN_ON(!rcu_gp_wq);
- rcu_par_gp_wq = alloc_workqueue("rcu_par_gp", WQ_MEM_RECLAIM, 0);
+ rcu_par_gp_wq = alloc_workqueue("rcu_par_gp", WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
WARN_ON(!rcu_par_gp_wq);
}

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 0b2c2ad69629c..a0486414edb40 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -472,7 +472,6 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
static void sync_rcu_exp_select_cpus(struct rcu_state *rsp,
smp_call_func_t func)
{
- int cpu;
struct rcu_node *rnp;

trace_rcu_exp_grace_period(rsp->name, rcu_exp_gp_seq_endval(rsp), TPS("reset"));
@@ -494,13 +493,7 @@ static void sync_rcu_exp_select_cpus(struct rcu_state *rsp,
continue;
}
INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
- preempt_disable();
- cpu = cpumask_next(rnp->grplo - 1, cpu_online_mask);
- /* If all offline, queue the work on an unbound CPU. */
- if (unlikely(cpu > rnp->grphi))
- cpu = WORK_CPU_UNBOUND;
- queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
- preempt_enable();
+ queue_work_on(rnp->grplo, rcu_par_gp_wq, &rnp->rew.rew_work);
rnp->exp_need_flush = true;
}


>
> Thanx, Paul

Sebastian

2018-10-13 13:48:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] rcu: Use cpus_read_lock() while looking at cpu_online_mask

On Fri, Oct 12, 2018 at 08:41:15PM +0200, Sebastian Andrzej Siewior wrote:
> On 2018-09-19 15:11:40 [-0700], Paul E. McKenney wrote:
> > On Wed, Sep 19, 2018 at 01:55:21PM -0700, Tejun Heo wrote:
> > > Unbound workqueue is NUMA-affine by default, so using it by default
> > > might not harm anything.
> >
> > OK, so the above workaround would function correctly on -rt, thank you!
> >
> > Sebastian, is there a counterpart to CONFIG_PREEMPT_RT already in
> > mainline? If so, I would be happy to make mainline safe for -rt.
>
> Now that I stumbled upon it again, I noticed that I never replied here.
> Sorry for that.
>
> Let me summarize:
> sync_rcu_exp_select_cpus() used
> queue_work_on(rnp->grplo, rcu_par_gp_wq, &rnp->rew.rew_work);
>
> which was changed in commit fcc6354365015 ("rcu: Make expedited GPs
> handle CPU 0 being offline"). The commit claims that this is needed in
> case CPU0 is offline so it tries to find another CPU starting with the
> possible offline CPU. It might cross to another NUMA node but that is
> not really a problem, it just tries to remain on the "local NUMA node".
>
> After that commit, the code invokes queue_work_on() within a
> preempt_disable() section because it can't use cpus_read_lock() to
> ensure that the CPU won't go offline in the middle of testing (and
> preempt_disable() does the trick).
> For RT reasons I would like to get rid of queue_work_on() within the
> preempt_disable() section.
> Tejun said that enqueueing an item on an unbound workqueue is NUMA
> affine.
>
> I figured out that enqueueing an item on an offline CPU is not a problem
> and it will pop up on a "random" CPU which means it will be carried out
> asap and will not wait until the CPU gets back online. Therefore I don't
> understand the commit fcc6354365015.
>
> May I suggest the following change? It will enqueue the work item on
> the first CPU on the NUMA node and the "unbound" part of the work queue
> ensures that a CPU of that NUMA node will perform the work.
> This is mostly a revert of fcc6354365015 except that the workqueue
> gained the WQ_UNBOUND flag.

My concern would be that it would queue it by default for the current
CPU, which would serialize the processing, losing the concurrency of
grace-period initialization. But that was a long time ago, and perhaps
workqueues have changed. So, have you tried using rcuperf to test the
update performance on a large system?

If this change does not impact performance on an rcuperf test, why not
send me a formal patch with Signed-off-by and commit log (including
performance testing results)? I will then apply it, it will be exposed
to 0day and eventually -next testing, and if no problems arise, it will go
to mainline, perhaps as soon as the merge window after the upcoming one.

Fair enough?

Thanx, Paul

> ------------------>8----
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 0b760c1369f76..94d6c50c4e796 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -4162,7 +4162,7 @@ void __init rcu_init(void)
> /* Create workqueue for expedited GPs and for Tree SRCU. */
> rcu_gp_wq = alloc_workqueue("rcu_gp", WQ_MEM_RECLAIM, 0);
> WARN_ON(!rcu_gp_wq);
> - rcu_par_gp_wq = alloc_workqueue("rcu_par_gp", WQ_MEM_RECLAIM, 0);
> + rcu_par_gp_wq = alloc_workqueue("rcu_par_gp", WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
> WARN_ON(!rcu_par_gp_wq);
> }
>
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 0b2c2ad69629c..a0486414edb40 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -472,7 +472,6 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> static void sync_rcu_exp_select_cpus(struct rcu_state *rsp,
> smp_call_func_t func)
> {
> - int cpu;
> struct rcu_node *rnp;
>
> trace_rcu_exp_grace_period(rsp->name, rcu_exp_gp_seq_endval(rsp), TPS("reset"));
> @@ -494,13 +493,7 @@ static void sync_rcu_exp_select_cpus(struct rcu_state *rsp,
> continue;
> }
> INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> - preempt_disable();
> - cpu = cpumask_next(rnp->grplo - 1, cpu_online_mask);
> - /* If all offline, queue the work on an unbound CPU. */
> - if (unlikely(cpu > rnp->grphi))
> - cpu = WORK_CPU_UNBOUND;
> - queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
> - preempt_enable();
> + queue_work_on(rnp->grplo, rcu_par_gp_wq, &rnp->rew.rew_work);
> rnp->exp_need_flush = true;
> }
>
>
> >
> > Thanx, Paul
>
> Sebastian
>


Subject: Re: [PATCH] rcu: Use cpus_read_lock() while looking at cpu_online_mask

On 2018-10-13 06:48:13 [-0700], Paul E. McKenney wrote:
>
> My concern would be that it would queue it by default for the current
> CPU, which would serialize the processing, losing the concurrency of
> grace-period initialization. But that was a long time ago, and perhaps
> workqueues have changed.

but the code here is always using the first CPU of a NUMA node or did I
miss something?

> So, have you tried using rcuperf to test the
> update performance on a large system?

Something like this:
| tools/testing/selftests/rcutorture/bin/kvm.sh --torture rcuperf --configs TREE
| ----Start batch 1: Mon Oct 15 12:46:13 CEST 2018
| TREE 142: Starting build. …
| …
| Average grace-period duration: 19952.7 microseconds
| Minimum grace-period duration: 9004.51
| 50th percentile grace-period duration: 19998.3
| 90th percentile grace-period duration: 24994.4
| 99th percentile grace-period duration: 30002.1
| Maximum grace-period duration: 42998.2
| Grace periods: 6560 Batches: 209 Ratio: 31.3876
| Computed from rcuperf printk output.
| Completed in 27 vs. 1800
|
| Average grace-period duration: 18700 microseconds
| Minimum grace-period duration: 7069.2
| 50th percentile grace-period duration: 18987.5
| 90th percentile grace-period duration: 22997
| 99th percentile grace-period duration: 28944.7
| Maximum grace-period duration: 36994.5
| Grace periods: 6551 Batches: 209 Ratio: 31.3445
| Computed from rcuperf printk output.
| Completed in 27 vs. 1800

two runs patched followed by two runs without the patched:
| Average grace-period duration: 19423.3 microseconds
| Minimum grace-period duration: 8006.93
| 50th percentile grace-period duration: 19002.8
| 90th percentile grace-period duration: 23997.5
| 99th percentile grace-period duration: 29995.7
| Maximum grace-period duration: 37997.9
| Grace periods: 6526 Batches: 208 Ratio: 31.375
| Computed from rcuperf printk output.
| Completed in 27 vs. 1800
|
| Average grace-period duration: 18822.4 microseconds
| Minimum grace-period duration: 8348.15
| 50th percentile grace-period duration: 18996.9
| 90th percentile grace-period duration: 23000
| 99th percentile grace-period duration: 27999.5
| Maximum grace-period duration: 39001.9
| Grace periods: 6540 Batches: 209 Ratio: 31.2919
| Computed from rcuperf printk output.
| Completed in 27 vs. 1800

I think difference might come from cpufreq on the host. But in general,
this is what you have been asking for or did you want to see something
on the host (or an additional argument to the script)?

> If this change does not impact performance on an rcuperf test, why not
> send me a formal patch with Signed-off-by and commit log (including
> performance testing results)? I will then apply it, it will be exposed
> to 0day and eventually -next testing, and if no problems arise, it will go
> to mainline, perhaps as soon as the merge window after the upcoming one.
>
> Fair enough?
sure.

> Thanx, Paul
>

Sebastian

2018-10-15 15:01:27

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH] rcu: Use cpus_read_lock() while looking at cpu_online_mask

Hi, Sebastian

On Mon, Oct 15, 2018 at 04:42:17PM +0200, Sebastian Andrzej Siewior wrote:
> On 2018-10-13 06:48:13 [-0700], Paul E. McKenney wrote:
> >
> > My concern would be that it would queue it by default for the current
> > CPU, which would serialize the processing, losing the concurrency of
> > grace-period initialization. But that was a long time ago, and perhaps
> > workqueues have changed.
>
> but the code here is always using the first CPU of a NUMA node or did I
> miss something?
>

The thing is the original way is to pick one CPU for a *RCU* node to
run the grace-period work, but with your proposal, if a RCU node is
smaller than a NUMA node (having fewer CPUs), we could end up having two
grace-period works running on one CPU. I think that's Paul's concern.

Regards,
Boqun

> > So, have you tried using rcuperf to test the
> > update performance on a large system?
>
> Something like this:
> | tools/testing/selftests/rcutorture/bin/kvm.sh --torture rcuperf --configs TREE
> | ----Start batch 1: Mon Oct 15 12:46:13 CEST 2018
> | TREE 142: Starting build. …
> | …
> | Average grace-period duration: 19952.7 microseconds
> | Minimum grace-period duration: 9004.51
> | 50th percentile grace-period duration: 19998.3
> | 90th percentile grace-period duration: 24994.4
> | 99th percentile grace-period duration: 30002.1
> | Maximum grace-period duration: 42998.2
> | Grace periods: 6560 Batches: 209 Ratio: 31.3876
> | Computed from rcuperf printk output.
> | Completed in 27 vs. 1800
> |
> | Average grace-period duration: 18700 microseconds
> | Minimum grace-period duration: 7069.2
> | 50th percentile grace-period duration: 18987.5
> | 90th percentile grace-period duration: 22997
> | 99th percentile grace-period duration: 28944.7
> | Maximum grace-period duration: 36994.5
> | Grace periods: 6551 Batches: 209 Ratio: 31.3445
> | Computed from rcuperf printk output.
> | Completed in 27 vs. 1800
>
> two runs patched followed by two runs without the patched:
> | Average grace-period duration: 19423.3 microseconds
> | Minimum grace-period duration: 8006.93
> | 50th percentile grace-period duration: 19002.8
> | 90th percentile grace-period duration: 23997.5
> | 99th percentile grace-period duration: 29995.7
> | Maximum grace-period duration: 37997.9
> | Grace periods: 6526 Batches: 208 Ratio: 31.375
> | Computed from rcuperf printk output.
> | Completed in 27 vs. 1800
> |
> | Average grace-period duration: 18822.4 microseconds
> | Minimum grace-period duration: 8348.15
> | 50th percentile grace-period duration: 18996.9
> | 90th percentile grace-period duration: 23000
> | 99th percentile grace-period duration: 27999.5
> | Maximum grace-period duration: 39001.9
> | Grace periods: 6540 Batches: 209 Ratio: 31.2919
> | Computed from rcuperf printk output.
> | Completed in 27 vs. 1800
>
> I think difference might come from cpufreq on the host. But in general,
> this is what you have been asking for or did you want to see something
> on the host (or an additional argument to the script)?
>
> > If this change does not impact performance on an rcuperf test, why not
> > send me a formal patch with Signed-off-by and commit log (including
> > performance testing results)? I will then apply it, it will be exposed
> > to 0day and eventually -next testing, and if no problems arise, it will go
> > to mainline, perhaps as soon as the merge window after the upcoming one.
> >
> > Fair enough?
> sure.
>
> > Thanx, Paul
> >
>
> Sebastian


Attachments:
(No filename) (3.50 kB)
signature.asc (499.00 B)
Download all attachments
Subject: Re: [PATCH] rcu: Use cpus_read_lock() while looking at cpu_online_mask

On 2018-10-15 23:07:15 [+0800], Boqun Feng wrote:
> Hi, Sebastian
Hi Boqun,

> On Mon, Oct 15, 2018 at 04:42:17PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2018-10-13 06:48:13 [-0700], Paul E. McKenney wrote:
> > >
> > > My concern would be that it would queue it by default for the current
> > > CPU, which would serialize the processing, losing the concurrency of
> > > grace-period initialization. But that was a long time ago, and perhaps
> > > workqueues have changed.
> >
> > but the code here is always using the first CPU of a NUMA node or did I
> > miss something?
> >
>
> The thing is the original way is to pick one CPU for a *RCU* node to
> run the grace-period work, but with your proposal, if a RCU node is
> smaller than a NUMA node (having fewer CPUs), we could end up having two
> grace-period works running on one CPU. I think that's Paul's concern.

Ah. Okay. From what I observed, the RCU nodes and NUMA nodes were 1:1
here. Noted.
Given that I can enqueue a work item on an offlined CPU I don't see why
commit fcc6354365015 ("rcu: Make expedited GPs handle CPU 0 being
offline") should make a difference. Any objections to just revert it?

> Regards,
> Boqun

Sebastian

2018-10-15 15:28:02

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH] rcu: Use cpus_read_lock() while looking at cpu_online_mask

On Mon, Oct 15, 2018 at 05:09:03PM +0200, Sebastian Andrzej Siewior wrote:
> On 2018-10-15 23:07:15 [+0800], Boqun Feng wrote:
> > Hi, Sebastian
> Hi Boqun,
>
> > On Mon, Oct 15, 2018 at 04:42:17PM +0200, Sebastian Andrzej Siewior wrote:
> > > On 2018-10-13 06:48:13 [-0700], Paul E. McKenney wrote:
> > > >
> > > > My concern would be that it would queue it by default for the current
> > > > CPU, which would serialize the processing, losing the concurrency of
> > > > grace-period initialization. But that was a long time ago, and perhaps
> > > > workqueues have changed.
> > >
> > > but the code here is always using the first CPU of a NUMA node or did I
> > > miss something?
> > >
> >
> > The thing is the original way is to pick one CPU for a *RCU* node to
> > run the grace-period work, but with your proposal, if a RCU node is
> > smaller than a NUMA node (having fewer CPUs), we could end up having two
> > grace-period works running on one CPU. I think that's Paul's concern.
>
> Ah. Okay. From what I observed, the RCU nodes and NUMA nodes were 1:1
> here. Noted.

Ok, in that case, there should be no significant performance difference.

> Given that I can enqueue a work item on an offlined CPU I don't see why
> commit fcc6354365015 ("rcu: Make expedited GPs handle CPU 0 being
> offline") should make a difference. Any objections to just revert it?

Well, that commit is trying to avoid queue a work on an offlined CPU,
because according to workqueue API, it's the users' responsibility to
make sure the CPU is online when a work item enqueued. So there is a
difference ;-)

But I don't have any objection to revert it with your proposal, since
yours is more simple and straight-forward, and doesn't perform worse if
NUMA nodes and RCU nodes have one-to-one corresponding.

Besides, I think even if we observe some performance difference in the
future, the best way to solve that is to make workqueue have a more
fine-grained affine group than a NUMA node.

Regards,
Boqun

>
> > Regards,
> > Boqun
>
> Sebastian


Attachments:
(No filename) (2.04 kB)
signature.asc (499.00 B)
Download all attachments

2018-10-15 16:37:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] rcu: Use cpus_read_lock() while looking at cpu_online_mask

On Mon, Oct 15, 2018 at 11:33:48PM +0800, Boqun Feng wrote:
> On Mon, Oct 15, 2018 at 05:09:03PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2018-10-15 23:07:15 [+0800], Boqun Feng wrote:
> > > Hi, Sebastian
> > Hi Boqun,
> >
> > > On Mon, Oct 15, 2018 at 04:42:17PM +0200, Sebastian Andrzej Siewior wrote:
> > > > On 2018-10-13 06:48:13 [-0700], Paul E. McKenney wrote:
> > > > >
> > > > > My concern would be that it would queue it by default for the current
> > > > > CPU, which would serialize the processing, losing the concurrency of
> > > > > grace-period initialization. But that was a long time ago, and perhaps
> > > > > workqueues have changed.
> > > >
> > > > but the code here is always using the first CPU of a NUMA node or did I
> > > > miss something?
> > > >
> > >
> > > The thing is the original way is to pick one CPU for a *RCU* node to
> > > run the grace-period work, but with your proposal, if a RCU node is
> > > smaller than a NUMA node (having fewer CPUs), we could end up having two
> > > grace-period works running on one CPU. I think that's Paul's concern.
> >
> > Ah. Okay. From what I observed, the RCU nodes and NUMA nodes were 1:1
> > here. Noted.
>
> Ok, in that case, there should be no significant performance difference.
>
> > Given that I can enqueue a work item on an offlined CPU I don't see why
> > commit fcc6354365015 ("rcu: Make expedited GPs handle CPU 0 being
> > offline") should make a difference. Any objections to just revert it?
>
> Well, that commit is trying to avoid queue a work on an offlined CPU,
> because according to workqueue API, it's the users' responsibility to
> make sure the CPU is online when a work item enqueued. So there is a
> difference ;-)
>
> But I don't have any objection to revert it with your proposal, since
> yours is more simple and straight-forward, and doesn't perform worse if
> NUMA nodes and RCU nodes have one-to-one corresponding.
>
> Besides, I think even if we observe some performance difference in the
> future, the best way to solve that is to make workqueue have a more
> fine-grained affine group than a NUMA node.

Please keep in mind that there are computer systems out there with NUMA
topologies that are completely incompatible with RCU's rcu_node tree
structure. According to Rik van Riel (CCed), there are even systems
out there where CPU 0 is on socket 0, CPU 1 on socket 1, and so on,
round-robining across the sockets.

The system that convinced me that the additional constraints on
the workqueue's CPU had CPUs 0-7 on one socket and CPUs 8-15 on the
second, and with CPUs 0-15 sharing the same leaf rcu_node structure.
Unfortunately, I no longer have useful access to this system (dead disk
drive, apparently).

I am not saying that Sebastian's approach is bad, rather that it does
need to be tested on a variety of systems.

Thanx, Paul

> Regards,
> Boqun
>
> >
> > > Regards,
> > > Boqun
> >
> > Sebastian