2022-04-01 13:22:15

by Eric Dumazet

[permalink] [raw]
Subject: [BUG] rcu-tasks : should take care of sparse cpu masks

Hi Paul

It seems you assume per cpu ptr for arbitrary indexes (< nr_cpu_ids) are valid.

What do you think of the (untested) following patch ?

Thanks.

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 99cf3a13954cfb17828fbbeeb884f11614a526a9..df3785be4022e903d9682dd403464aa9927aa5c2
100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -273,13 +273,17 @@ static void call_rcu_tasks_generic(struct
rcu_head *rhp, rcu_callback_t func,
bool needadjust = false;
bool needwake;
struct rcu_tasks_percpu *rtpcp;
+ int ideal_cpu, chosen_cpu;

rhp->next = NULL;
rhp->func = func;
local_irq_save(flags);
rcu_read_lock();
- rtpcp = per_cpu_ptr(rtp->rtpcpu,
- smp_processor_id() >>
READ_ONCE(rtp->percpu_enqueue_shift));
+
+ ideal_cpu = smp_processor_id() >> READ_ONCE(rtp->percpu_enqueue_shift);
+ chosen_cpu = cpumask_next(ideal_cpu - 1, cpu_online_mask);
+
+ rtpcp = per_cpu_ptr(rtp->rtpcpu, chosen_cpu);
if (!raw_spin_trylock_rcu_node(rtpcp)) { // irqs already disabled.
raw_spin_lock_rcu_node(rtpcp); // irqs already disabled.
j = jiffies;


2022-04-01 14:29:22

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [BUG] rcu-tasks : should take care of sparse cpu masks

On Thu, Mar 31, 2022 at 04:28:04PM -0700, Eric Dumazet wrote:
> On Thu, Mar 31, 2022 at 4:13 PM Paul E. McKenney <[email protected]> wrote:
> >
> > The initial setting of ->percpu_enqueue_shift forces all in-range CPU
> > IDs to shift down to zero. The grace-period kthread is allowed to run
> > where it likes. The callback lists are protected by locking, even in
> > the case of local access, so this should be safe.
> >
> > Or am I missing your point?
> >
>
> In fact I have been looking at this code, because we bisected a
> regression back to this patch:
>
> 4fe192dfbe5ba9780df699d411aa4f25ba24cf61 rcu-tasks: Shorten
> per-grace-period sleep for RCU Tasks Trace
>
> It is very possible the regression comes because the RCU task thread
> is using more cpu cycles, from 'CPU 0' where our system daemons are
> pinned.

Heh! I did express that concern when creating that patch, but was
assured that the latency was much more important.

Yes, that patch most definitely increases CPU utilization during RCU Tasks
Trace grace periods. If you can tolerate longer grace-period latencies,
it might be worth toning it down a bit. The ask was for about twice
the latency I achieved in my initial attempt, and I made the mistake of
forwarding that attempt out for testing. They liked the shorter latency
very much, and objected strenuously to the thought that I might detune
it back to the latency that they originally asked for. ;-)

But I can easily provide the means to detune it through use of a kernel
boot parameter or some such, if that would help.

> But I could not spot where the RCU task kthread is forced to run on CPU 0.

I never did intend this kthread be bound anywhere. RCU's policy is
that any binding of its kthreads is the responsibility of the sysadm,
be that carbon-based or otherwise.

But this kthread is spawned early enough that only CPU 0 is online,
so maybe the question is not "what is binding it to CPU 0?" but rather
"why isn't something kicking it off of CPU 0?"

> I attempted to backport to our kernel all related patches that were
> not yet backported,
> and we still see a regression in our tests.

The per-grace-period CPU consumption of rcu_tasks_trace was intentionally
increased by the above commit, and I never have done anything to reduce
that CPU consumption. In part because you are the first to call my
attention to it.

Oh, and one other issue that I very recently fixed, that has not
yet reached mainline, just in case it matters. If you are building a
CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y kernel, but also have
CONFIG_RCU_TORTURE_TEST=m (or, for that matter, =y, but please don't in
production!), then your kernel will use RCU Tasks instead of vanilla RCU.
(Note well, RCU Tasks, not RCU Tasks Trace, the latter being necessaary
for sleepable BPF programs regardless of kernel .config).

> Please ignore the sha1 in this current patch series, this is only to
> show my current attempt to fix the regression in our tree.
>
> 450b3244f29b rcu-tasks: Don't remove tasks with pending IPIs from holdout list
> 5f88f7e9cc36 rcu-tasks: Create per-CPU callback lists
> 1a943d0041dc rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic
> queue selection
> ea5289f12fce rcu-tasks: Convert grace-period counter to grace-period
> sequence number
> 22efd5093c3b rcu/segcblist: Prevent useless GP start if no CBs to accelerate
> 16dee1b3babf rcu: Implement rcu_segcblist_is_offloaded() config dependent
> 8cafaadb6144 rcu: Add callbacks-invoked counters
> 323234685765 rcu/tree: Make rcu_do_batch count how many callbacks were executed
> f48f3386a1cc rcu/segcblist: Add additional comments to explain smp_mb()
> 4408105116de rcu/segcblist: Add counters to segcblist datastructure
> 4a0b89a918d6 rcu/tree: segcblist: Remove redundant smp_mb()s
> 38c0d18e8740 rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_divisor
> 0b5d1031b509 rcu/segcblist: Add debug checks for segment lengths
> 8a82886fbf02 rcu_tasks: Convert bespoke callback list to rcu_segcblist structure
> cbd452a5c01f rcu-tasks: Use spin_lock_rcu_node() and friends
> 073222be51f3 rcu-tasks: Add a ->percpu_enqueue_lim to the rcu_tasks structure
> 5af10fb0f8fb rcu-tasks: Abstract checking of callback lists
> d3e8be598546 rcu-tasks: Abstract invocations of callbacks
> 65784460a392 rcu-tasks: Use workqueues for multiple
> rcu_tasks_invoke_cbs() invocations
> dd6413e355f1 rcu-tasks: Make rcu_barrier_tasks*() handle multiple
> callback queues
> 2499cb3c438e rcu-tasks: Add rcupdate.rcu_task_enqueue_lim to set
> initial queueing
> a859f409a503 rcu-tasks: Count trylocks to estimate call_rcu_tasks() contention
> 4ab253ca056e rcu-tasks: Avoid raw-spinlocked wakeups from
> call_rcu_tasks_generic()
> e9a3563fe76e rcu-tasks: Use more callback queues if contention encountered
> 4023187fe31d rcu-tasks: Use separate ->percpu_dequeue_lim for callback
> dequeueing
> 533be3bd47c3 rcu: Provide polling interfaces for Tree RCU grace periods
> f7e5a81d7953 rcu-tasks: Use fewer callbacks queues if callback flood ends
> bb7ad9078e1b rcu-tasks: Fix computation of CPU-to-list shift counts
> d9cebde55539 rcu-tasks: Use order_base_2() instead of ilog2()
> 95606f1248f5 rcu-tasks: Set ->percpu_enqueue_shift to zero upon contention

2022-04-01 15:12:01

by Eric Dumazet

[permalink] [raw]
Subject: Re: [BUG] rcu-tasks : should take care of sparse cpu masks

On Thu, Mar 31, 2022 at 4:13 PM Paul E. McKenney <[email protected]> wrote:
>
> The initial setting of ->percpu_enqueue_shift forces all in-range CPU
> IDs to shift down to zero. The grace-period kthread is allowed to run
> where it likes. The callback lists are protected by locking, even in
> the case of local access, so this should be safe.
>
> Or am I missing your point?
>

In fact I have been looking at this code, because we bisected a
regression back to this patch:

4fe192dfbe5ba9780df699d411aa4f25ba24cf61 rcu-tasks: Shorten
per-grace-period sleep for RCU Tasks Trace

It is very possible the regression comes because the RCU task thread
is using more cpu cycles, from 'CPU 0' where our system daemons are
pinned.

But I could not spot where the RCU task kthread is forced to run on CPU 0.

I attempted to backport to our kernel all related patches that were
not yet backported,
and we still see a regression in our tests.

Please ignore the sha1 in this current patch series, this is only to
show my current attempt to fix the regression in our tree.

450b3244f29b rcu-tasks: Don't remove tasks with pending IPIs from holdout list
5f88f7e9cc36 rcu-tasks: Create per-CPU callback lists
1a943d0041dc rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic
queue selection
ea5289f12fce rcu-tasks: Convert grace-period counter to grace-period
sequence number
22efd5093c3b rcu/segcblist: Prevent useless GP start if no CBs to accelerate
16dee1b3babf rcu: Implement rcu_segcblist_is_offloaded() config dependent
8cafaadb6144 rcu: Add callbacks-invoked counters
323234685765 rcu/tree: Make rcu_do_batch count how many callbacks were executed
f48f3386a1cc rcu/segcblist: Add additional comments to explain smp_mb()
4408105116de rcu/segcblist: Add counters to segcblist datastructure
4a0b89a918d6 rcu/tree: segcblist: Remove redundant smp_mb()s
38c0d18e8740 rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_divisor
0b5d1031b509 rcu/segcblist: Add debug checks for segment lengths
8a82886fbf02 rcu_tasks: Convert bespoke callback list to rcu_segcblist structure
cbd452a5c01f rcu-tasks: Use spin_lock_rcu_node() and friends
073222be51f3 rcu-tasks: Add a ->percpu_enqueue_lim to the rcu_tasks structure
5af10fb0f8fb rcu-tasks: Abstract checking of callback lists
d3e8be598546 rcu-tasks: Abstract invocations of callbacks
65784460a392 rcu-tasks: Use workqueues for multiple
rcu_tasks_invoke_cbs() invocations
dd6413e355f1 rcu-tasks: Make rcu_barrier_tasks*() handle multiple
callback queues
2499cb3c438e rcu-tasks: Add rcupdate.rcu_task_enqueue_lim to set
initial queueing
a859f409a503 rcu-tasks: Count trylocks to estimate call_rcu_tasks() contention
4ab253ca056e rcu-tasks: Avoid raw-spinlocked wakeups from
call_rcu_tasks_generic()
e9a3563fe76e rcu-tasks: Use more callback queues if contention encountered
4023187fe31d rcu-tasks: Use separate ->percpu_dequeue_lim for callback
dequeueing
533be3bd47c3 rcu: Provide polling interfaces for Tree RCU grace periods
f7e5a81d7953 rcu-tasks: Use fewer callbacks queues if callback flood ends
bb7ad9078e1b rcu-tasks: Fix computation of CPU-to-list shift counts
d9cebde55539 rcu-tasks: Use order_base_2() instead of ilog2()
95606f1248f5 rcu-tasks: Set ->percpu_enqueue_shift to zero upon contention

2022-04-02 03:08:29

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [BUG] rcu-tasks : should take care of sparse cpu masks

On Thu, Mar 31, 2022 at 09:39:02PM -0700, Eric Dumazet wrote:
> On Thu, Mar 31, 2022 at 5:06 PM Paul E. McKenney <[email protected]> wrote:
> >
> > On Thu, Mar 31, 2022 at 04:28:04PM -0700, Eric Dumazet wrote:
> > > On Thu, Mar 31, 2022 at 4:13 PM Paul E. McKenney <[email protected]> wrote:
> > > >
> > > > The initial setting of ->percpu_enqueue_shift forces all in-range CPU
> > > > IDs to shift down to zero. The grace-period kthread is allowed to run
> > > > where it likes. The callback lists are protected by locking, even in
> > > > the case of local access, so this should be safe.
> > > >
> > > > Or am I missing your point?
> > > >
> > >
> > > In fact I have been looking at this code, because we bisected a
> > > regression back to this patch:
> > >
> > > 4fe192dfbe5ba9780df699d411aa4f25ba24cf61 rcu-tasks: Shorten
> > > per-grace-period sleep for RCU Tasks Trace
> > >
> > > It is very possible the regression comes because the RCU task thread
> > > is using more cpu cycles, from 'CPU 0' where our system daemons are
> > > pinned.
> >
> > Heh! I did express that concern when creating that patch, but was
> > assured that the latency was much more important.
> >
> > Yes, that patch most definitely increases CPU utilization during RCU Tasks
> > Trace grace periods. If you can tolerate longer grace-period latencies,
> > it might be worth toning it down a bit. The ask was for about twice
> > the latency I achieved in my initial attempt, and I made the mistake of
> > forwarding that attempt out for testing. They liked the shorter latency
> > very much, and objected strenuously to the thought that I might detune
> > it back to the latency that they originally asked for. ;-)
> >
> > But I can easily provide the means to detune it through use of a kernel
> > boot parameter or some such, if that would help.
> >
> > > But I could not spot where the RCU task kthread is forced to run on CPU 0.
> >
> > I never did intend this kthread be bound anywhere. RCU's policy is
> > that any binding of its kthreads is the responsibility of the sysadm,
> > be that carbon-based or otherwise.
> >
> > But this kthread is spawned early enough that only CPU 0 is online,
> > so maybe the question is not "what is binding it to CPU 0?" but rather
> > "why isn't something kicking it off of CPU 0?"
>
> I guess the answer to this question can be found in the following
> piece of code :)
>
> rcu_read_lock();
> for_each_process_thread(g, t)
> rtp->pertask_func(t, &holdouts);
> rcu_read_unlock();
>
>
> With ~150,000 threads on a 256 cpu host, this holds current cpu for
> very long times:
>
> rcu_tasks_trace 11 [017] 5010.544762:
> probe:rcu_tasks_wait_gp: (ffffffff963fb4b0)
> rcu_tasks_trace 11 [017] 5010.600396:
> probe:rcu_tasks_trace_postscan: (ffffffff963fb7c0)

So about 55 milliseconds for the tasklist scan, correct? Or am I
losing the plot here?

> rcu_tasks_trace 11 [022] 5010.618783:
> probe:check_all_holdout_tasks_trace: (ffffffff963fb850)
> rcu_tasks_trace 11 [022] 5010.618840:
> probe:rcu_tasks_trace_postgp: (ffffffff963fba70)
>
> In this case, CPU 22 is the victim, not CPU 0 :)

My faith in the scheduler is restored! ;-)

My position has been that this tasklist scan does not need to be broken
up because it should happen only when a sleepable BPF program is removed,
which is a rare event.

In addition, breaking up this scan is not trivial, because as far as I
know there is no way to force a given task to stay in the list. I would
have to instead use something like rcu_lock_break(), and restart the
scan if either of the nailed-down pair of tasks was removed from the list.
In a system where tasks were coming and going very frequently, it might
be that such a broken-up scan would never complete.

I can imagine tricks where the nailed-down tasks are kept on a list,
and the nailed-downness is moved to the next task when those tasks
are removed. I can also imagine a less-than-happy response to such
a proposal.

So I am not currently thinking in terms of breaking up this scan.

Or is there some trick that I am missing?

In the meantime, a simple patch that reduces the frequency of the scan
by a factor of two. But this would not be the scan of the full tasklist,
but rather the frequency of the calls to check_all_holdout_tasks_trace().
And the total of these looks to be less than 20 milliseconds, if I am
correctly interpreting your trace. And most of that 20 milliseconds
is sleeping.

Nevertheless, the patch is at the end of this email.

Other than that, I could imagine batching removal of sleepable BPF
programs and using a single grace period for all of their trampolines.
But are there enough sleepable BPF programs ever installed to make this
a useful approach?

Or is the status quo in fact acceptable? (Hey, I can dream, can't I?)

Thanx, Paul

> > > I attempted to backport to our kernel all related patches that were
> > > not yet backported,
> > > and we still see a regression in our tests.
> >
> > The per-grace-period CPU consumption of rcu_tasks_trace was intentionally
> > increased by the above commit, and I never have done anything to reduce
> > that CPU consumption. In part because you are the first to call my
> > attention to it.
> >
> > Oh, and one other issue that I very recently fixed, that has not
> > yet reached mainline, just in case it matters. If you are building a
> > CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y kernel, but also have
> > CONFIG_RCU_TORTURE_TEST=m (or, for that matter, =y, but please don't in
> > production!), then your kernel will use RCU Tasks instead of vanilla RCU.
> > (Note well, RCU Tasks, not RCU Tasks Trace, the latter being necessaary
> > for sleepable BPF programs regardless of kernel .config).
> >
> > > Please ignore the sha1 in this current patch series, this is only to
> > > show my current attempt to fix the regression in our tree.
> > >
> > > 450b3244f29b rcu-tasks: Don't remove tasks with pending IPIs from holdout list
> > > 5f88f7e9cc36 rcu-tasks: Create per-CPU callback lists
> > > 1a943d0041dc rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic
> > > queue selection
> > > ea5289f12fce rcu-tasks: Convert grace-period counter to grace-period
> > > sequence number
> > > 22efd5093c3b rcu/segcblist: Prevent useless GP start if no CBs to accelerate
> > > 16dee1b3babf rcu: Implement rcu_segcblist_is_offloaded() config dependent
> > > 8cafaadb6144 rcu: Add callbacks-invoked counters
> > > 323234685765 rcu/tree: Make rcu_do_batch count how many callbacks were executed
> > > f48f3386a1cc rcu/segcblist: Add additional comments to explain smp_mb()
> > > 4408105116de rcu/segcblist: Add counters to segcblist datastructure
> > > 4a0b89a918d6 rcu/tree: segcblist: Remove redundant smp_mb()s
> > > 38c0d18e8740 rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_divisor
> > > 0b5d1031b509 rcu/segcblist: Add debug checks for segment lengths
> > > 8a82886fbf02 rcu_tasks: Convert bespoke callback list to rcu_segcblist structure
> > > cbd452a5c01f rcu-tasks: Use spin_lock_rcu_node() and friends
> > > 073222be51f3 rcu-tasks: Add a ->percpu_enqueue_lim to the rcu_tasks structure
> > > 5af10fb0f8fb rcu-tasks: Abstract checking of callback lists
> > > d3e8be598546 rcu-tasks: Abstract invocations of callbacks
> > > 65784460a392 rcu-tasks: Use workqueues for multiple
> > > rcu_tasks_invoke_cbs() invocations
> > > dd6413e355f1 rcu-tasks: Make rcu_barrier_tasks*() handle multiple
> > > callback queues
> > > 2499cb3c438e rcu-tasks: Add rcupdate.rcu_task_enqueue_lim to set
> > > initial queueing
> > > a859f409a503 rcu-tasks: Count trylocks to estimate call_rcu_tasks() contention
> > > 4ab253ca056e rcu-tasks: Avoid raw-spinlocked wakeups from
> > > call_rcu_tasks_generic()
> > > e9a3563fe76e rcu-tasks: Use more callback queues if contention encountered
> > > 4023187fe31d rcu-tasks: Use separate ->percpu_dequeue_lim for callback
> > > dequeueing
> > > 533be3bd47c3 rcu: Provide polling interfaces for Tree RCU grace periods
> > > f7e5a81d7953 rcu-tasks: Use fewer callbacks queues if callback flood ends
> > > bb7ad9078e1b rcu-tasks: Fix computation of CPU-to-list shift counts
> > > d9cebde55539 rcu-tasks: Use order_base_2() instead of ilog2()
> > > 95606f1248f5 rcu-tasks: Set ->percpu_enqueue_shift to zero upon contention


diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 65d6e21a607a..141e2b4c70cc 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -1640,10 +1640,10 @@ static int __init rcu_spawn_tasks_trace_kthread(void)
rcu_tasks_trace.gp_sleep = HZ / 10;
rcu_tasks_trace.init_fract = HZ / 10;
} else {
- rcu_tasks_trace.gp_sleep = HZ / 200;
+ rcu_tasks_trace.gp_sleep = HZ / 100;
if (rcu_tasks_trace.gp_sleep <= 0)
rcu_tasks_trace.gp_sleep = 1;
- rcu_tasks_trace.init_fract = HZ / 200;
+ rcu_tasks_trace.init_fract = HZ / 100;
if (rcu_tasks_trace.init_fract <= 0)
rcu_tasks_trace.init_fract = 1;
}

2022-04-02 09:15:33

by Eric Dumazet

[permalink] [raw]
Subject: Re: [BUG] rcu-tasks : should take care of sparse cpu masks

On Thu, Mar 31, 2022 at 5:06 PM Paul E. McKenney <[email protected]> wrote:
>
> On Thu, Mar 31, 2022 at 04:28:04PM -0700, Eric Dumazet wrote:
> > On Thu, Mar 31, 2022 at 4:13 PM Paul E. McKenney <[email protected]> wrote:
> > >
> > > The initial setting of ->percpu_enqueue_shift forces all in-range CPU
> > > IDs to shift down to zero. The grace-period kthread is allowed to run
> > > where it likes. The callback lists are protected by locking, even in
> > > the case of local access, so this should be safe.
> > >
> > > Or am I missing your point?
> > >
> >
> > In fact I have been looking at this code, because we bisected a
> > regression back to this patch:
> >
> > 4fe192dfbe5ba9780df699d411aa4f25ba24cf61 rcu-tasks: Shorten
> > per-grace-period sleep for RCU Tasks Trace
> >
> > It is very possible the regression comes because the RCU task thread
> > is using more cpu cycles, from 'CPU 0' where our system daemons are
> > pinned.
>
> Heh! I did express that concern when creating that patch, but was
> assured that the latency was much more important.
>
> Yes, that patch most definitely increases CPU utilization during RCU Tasks
> Trace grace periods. If you can tolerate longer grace-period latencies,
> it might be worth toning it down a bit. The ask was for about twice
> the latency I achieved in my initial attempt, and I made the mistake of
> forwarding that attempt out for testing. They liked the shorter latency
> very much, and objected strenuously to the thought that I might detune
> it back to the latency that they originally asked for. ;-)
>
> But I can easily provide the means to detune it through use of a kernel
> boot parameter or some such, if that would help.
>
> > But I could not spot where the RCU task kthread is forced to run on CPU 0.
>
> I never did intend this kthread be bound anywhere. RCU's policy is
> that any binding of its kthreads is the responsibility of the sysadm,
> be that carbon-based or otherwise.
>
> But this kthread is spawned early enough that only CPU 0 is online,
> so maybe the question is not "what is binding it to CPU 0?" but rather
> "why isn't something kicking it off of CPU 0?"

I guess the answer to this question can be found in the following
piece of code :)

rcu_read_lock();
for_each_process_thread(g, t)
rtp->pertask_func(t, &holdouts);
rcu_read_unlock();


With ~150,000 threads on a 256 cpu host, this holds current cpu for
very long times:

rcu_tasks_trace 11 [017] 5010.544762:
probe:rcu_tasks_wait_gp: (ffffffff963fb4b0)
rcu_tasks_trace 11 [017] 5010.600396:
probe:rcu_tasks_trace_postscan: (ffffffff963fb7c0)
rcu_tasks_trace 11 [022] 5010.618783:
probe:check_all_holdout_tasks_trace: (ffffffff963fb850)
rcu_tasks_trace 11 [022] 5010.618840:
probe:rcu_tasks_trace_postgp: (ffffffff963fba70)

In this case, CPU 22 is the victim, not CPU 0 :)



>
> > I attempted to backport to our kernel all related patches that were
> > not yet backported,
> > and we still see a regression in our tests.
>
> The per-grace-period CPU consumption of rcu_tasks_trace was intentionally
> increased by the above commit, and I never have done anything to reduce
> that CPU consumption. In part because you are the first to call my
> attention to it.
>
> Oh, and one other issue that I very recently fixed, that has not
> yet reached mainline, just in case it matters. If you are building a
> CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y kernel, but also have
> CONFIG_RCU_TORTURE_TEST=m (or, for that matter, =y, but please don't in
> production!), then your kernel will use RCU Tasks instead of vanilla RCU.
> (Note well, RCU Tasks, not RCU Tasks Trace, the latter being necessaary
> for sleepable BPF programs regardless of kernel .config).
>
> > Please ignore the sha1 in this current patch series, this is only to
> > show my current attempt to fix the regression in our tree.
> >
> > 450b3244f29b rcu-tasks: Don't remove tasks with pending IPIs from holdout list
> > 5f88f7e9cc36 rcu-tasks: Create per-CPU callback lists
> > 1a943d0041dc rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic
> > queue selection
> > ea5289f12fce rcu-tasks: Convert grace-period counter to grace-period
> > sequence number
> > 22efd5093c3b rcu/segcblist: Prevent useless GP start if no CBs to accelerate
> > 16dee1b3babf rcu: Implement rcu_segcblist_is_offloaded() config dependent
> > 8cafaadb6144 rcu: Add callbacks-invoked counters
> > 323234685765 rcu/tree: Make rcu_do_batch count how many callbacks were executed
> > f48f3386a1cc rcu/segcblist: Add additional comments to explain smp_mb()
> > 4408105116de rcu/segcblist: Add counters to segcblist datastructure
> > 4a0b89a918d6 rcu/tree: segcblist: Remove redundant smp_mb()s
> > 38c0d18e8740 rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_divisor
> > 0b5d1031b509 rcu/segcblist: Add debug checks for segment lengths
> > 8a82886fbf02 rcu_tasks: Convert bespoke callback list to rcu_segcblist structure
> > cbd452a5c01f rcu-tasks: Use spin_lock_rcu_node() and friends
> > 073222be51f3 rcu-tasks: Add a ->percpu_enqueue_lim to the rcu_tasks structure
> > 5af10fb0f8fb rcu-tasks: Abstract checking of callback lists
> > d3e8be598546 rcu-tasks: Abstract invocations of callbacks
> > 65784460a392 rcu-tasks: Use workqueues for multiple
> > rcu_tasks_invoke_cbs() invocations
> > dd6413e355f1 rcu-tasks: Make rcu_barrier_tasks*() handle multiple
> > callback queues
> > 2499cb3c438e rcu-tasks: Add rcupdate.rcu_task_enqueue_lim to set
> > initial queueing
> > a859f409a503 rcu-tasks: Count trylocks to estimate call_rcu_tasks() contention
> > 4ab253ca056e rcu-tasks: Avoid raw-spinlocked wakeups from
> > call_rcu_tasks_generic()
> > e9a3563fe76e rcu-tasks: Use more callback queues if contention encountered
> > 4023187fe31d rcu-tasks: Use separate ->percpu_dequeue_lim for callback
> > dequeueing
> > 533be3bd47c3 rcu: Provide polling interfaces for Tree RCU grace periods
> > f7e5a81d7953 rcu-tasks: Use fewer callbacks queues if callback flood ends
> > bb7ad9078e1b rcu-tasks: Fix computation of CPU-to-list shift counts
> > d9cebde55539 rcu-tasks: Use order_base_2() instead of ilog2()
> > 95606f1248f5 rcu-tasks: Set ->percpu_enqueue_shift to zero upon contention

2022-04-02 10:51:53

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [BUG] rcu-tasks : should take care of sparse cpu masks

On Thu, Mar 31, 2022 at 02:45:25PM -0700, Eric Dumazet wrote:
> Hi Paul
>
> It seems you assume per cpu ptr for arbitrary indexes (< nr_cpu_ids) are valid.

Gah! I knew I was forgetting something...

But just to check, is this a theoretical problem or something you hit
on real hardware? (For the rest of this email, I am assuming the latter.)

> What do you think of the (untested) following patch ?

One issue with this patch is that the contention could be unpredictable,
or worse, vary among CPU, especially if the cpu_possible_mask was oddly
distributed.

So might it be better to restrict this to all on CPU 0 on the one hand
and completely per-CPU on the other? (Or all on the boot CPU, in case
I am forgetting some misbegotten architecture that can run without a
CPU 0.)

Thanx, Paul

> Thanks.
>
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index 99cf3a13954cfb17828fbbeeb884f11614a526a9..df3785be4022e903d9682dd403464aa9927aa5c2
> 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -273,13 +273,17 @@ static void call_rcu_tasks_generic(struct
> rcu_head *rhp, rcu_callback_t func,
> bool needadjust = false;
> bool needwake;
> struct rcu_tasks_percpu *rtpcp;
> + int ideal_cpu, chosen_cpu;
>
> rhp->next = NULL;
> rhp->func = func;
> local_irq_save(flags);
> rcu_read_lock();
> - rtpcp = per_cpu_ptr(rtp->rtpcpu,
> - smp_processor_id() >>
> READ_ONCE(rtp->percpu_enqueue_shift));
> +
> + ideal_cpu = smp_processor_id() >> READ_ONCE(rtp->percpu_enqueue_shift);
> + chosen_cpu = cpumask_next(ideal_cpu - 1, cpu_online_mask);
> +
> + rtpcp = per_cpu_ptr(rtp->rtpcpu, chosen_cpu);
> if (!raw_spin_trylock_rcu_node(rtpcp)) { // irqs already disabled.
> raw_spin_lock_rcu_node(rtpcp); // irqs already disabled.
> j = jiffies;

2022-04-02 16:39:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [BUG] rcu-tasks : should take care of sparse cpu masks

On Thu, Mar 31, 2022 at 03:57:36PM -0700, Eric Dumazet wrote:
> On Thu, Mar 31, 2022 at 3:54 PM Eric Dumazet <[email protected]> wrote:
> >
> > On Thu, Mar 31, 2022 at 3:42 PM Paul E. McKenney <[email protected]> wrote:
> > >
> > > On Thu, Mar 31, 2022 at 02:45:25PM -0700, Eric Dumazet wrote:
> > > > Hi Paul
> > > >
> > > > It seems you assume per cpu ptr for arbitrary indexes (< nr_cpu_ids) are valid.
> > >
> > > Gah! I knew I was forgetting something...
> > >
> > > But just to check, is this a theoretical problem or something you hit
> > > on real hardware? (For the rest of this email, I am assuming the latter.)
> >
> > Code review really...
> >
> > >
> > > > What do you think of the (untested) following patch ?
> > >
> > > One issue with this patch is that the contention could be unpredictable,
> > > or worse, vary among CPU, especially if the cpu_possible_mask was oddly
> > > distributed.
> > >
> > > So might it be better to restrict this to all on CPU 0 on the one hand
> > > and completely per-CPU on the other? (Or all on the boot CPU, in case
> > > I am forgetting some misbegotten architecture that can run without a
> > > CPU 0.)
> >
> > If I understand correctly, cblist_init_generic() could setup
> > percpu_enqueue_shift
> > to something smaller than order_base_2(nr_cpu_ids)
> >
> > Meaning that we could reach a non zero idx in (smp_processor_id() >>
> > percpu_enqueue_shift)
> >
> > So even if CPU0 is always present (I am not sure this is guaranteed,
> > but this seems reasonable),
> > we could still attempt a per_cpu_ptr(PTR, not_present_cpu), and get garbage.
>
> Also you mention CPU 0, but I do not see where cpu binding is
> performed on the kthread ?

The initial setting of ->percpu_enqueue_shift forces all in-range CPU
IDs to shift down to zero. The grace-period kthread is allowed to run
where it likes. The callback lists are protected by locking, even in
the case of local access, so this should be safe.

Or am I missing your point?

Thanx, Paul

> > > > Thanks.
> > > >
> > > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > > > index 99cf3a13954cfb17828fbbeeb884f11614a526a9..df3785be4022e903d9682dd403464aa9927aa5c2
> > > > 100644
> > > > --- a/kernel/rcu/tasks.h
> > > > +++ b/kernel/rcu/tasks.h
> > > > @@ -273,13 +273,17 @@ static void call_rcu_tasks_generic(struct
> > > > rcu_head *rhp, rcu_callback_t func,
> > > > bool needadjust = false;
> > > > bool needwake;
> > > > struct rcu_tasks_percpu *rtpcp;
> > > > + int ideal_cpu, chosen_cpu;
> > > >
> > > > rhp->next = NULL;
> > > > rhp->func = func;
> > > > local_irq_save(flags);
> > > > rcu_read_lock();
> > > > - rtpcp = per_cpu_ptr(rtp->rtpcpu,
> > > > - smp_processor_id() >>
> > > > READ_ONCE(rtp->percpu_enqueue_shift));
> > > > +
> > > > + ideal_cpu = smp_processor_id() >> READ_ONCE(rtp->percpu_enqueue_shift);
> > > > + chosen_cpu = cpumask_next(ideal_cpu - 1, cpu_online_mask);
> > > > +
> > > > + rtpcp = per_cpu_ptr(rtp->rtpcpu, chosen_cpu);
> > > > if (!raw_spin_trylock_rcu_node(rtpcp)) { // irqs already disabled.
> > > > raw_spin_lock_rcu_node(rtpcp); // irqs already disabled.
> > > > j = jiffies;

2022-04-02 17:02:03

by Eric Dumazet

[permalink] [raw]
Subject: Re: [BUG] rcu-tasks : should take care of sparse cpu masks

On Fri, Apr 1, 2022 at 6:01 AM Paul E. McKenney <[email protected]> wrote:
>
> On Thu, Mar 31, 2022 at 09:39:02PM -0700, Eric Dumazet wrote:
> > On Thu, Mar 31, 2022 at 5:06 PM Paul E. McKenney <[email protected]> wrote:
> > >
> > > On Thu, Mar 31, 2022 at 04:28:04PM -0700, Eric Dumazet wrote:
> > > > On Thu, Mar 31, 2022 at 4:13 PM Paul E. McKenney <[email protected]> wrote:
> > > > >
> > > > > The initial setting of ->percpu_enqueue_shift forces all in-range CPU
> > > > > IDs to shift down to zero. The grace-period kthread is allowed to run
> > > > > where it likes. The callback lists are protected by locking, even in
> > > > > the case of local access, so this should be safe.
> > > > >
> > > > > Or am I missing your point?
> > > > >
> > > >
> > > > In fact I have been looking at this code, because we bisected a
> > > > regression back to this patch:
> > > >
> > > > 4fe192dfbe5ba9780df699d411aa4f25ba24cf61 rcu-tasks: Shorten
> > > > per-grace-period sleep for RCU Tasks Trace
> > > >
> > > > It is very possible the regression comes because the RCU task thread
> > > > is using more cpu cycles, from 'CPU 0' where our system daemons are
> > > > pinned.
> > >
> > > Heh! I did express that concern when creating that patch, but was
> > > assured that the latency was much more important.
> > >
> > > Yes, that patch most definitely increases CPU utilization during RCU Tasks
> > > Trace grace periods. If you can tolerate longer grace-period latencies,
> > > it might be worth toning it down a bit. The ask was for about twice
> > > the latency I achieved in my initial attempt, and I made the mistake of
> > > forwarding that attempt out for testing. They liked the shorter latency
> > > very much, and objected strenuously to the thought that I might detune
> > > it back to the latency that they originally asked for. ;-)
> > >
> > > But I can easily provide the means to detune it through use of a kernel
> > > boot parameter or some such, if that would help.
> > >
> > > > But I could not spot where the RCU task kthread is forced to run on CPU 0.
> > >
> > > I never did intend this kthread be bound anywhere. RCU's policy is
> > > that any binding of its kthreads is the responsibility of the sysadm,
> > > be that carbon-based or otherwise.
> > >
> > > But this kthread is spawned early enough that only CPU 0 is online,
> > > so maybe the question is not "what is binding it to CPU 0?" but rather
> > > "why isn't something kicking it off of CPU 0?"
> >
> > I guess the answer to this question can be found in the following
> > piece of code :)
> >
> > rcu_read_lock();
> > for_each_process_thread(g, t)
> > rtp->pertask_func(t, &holdouts);
> > rcu_read_unlock();
> >
> >
> > With ~150,000 threads on a 256 cpu host, this holds current cpu for
> > very long times:
> >
> > rcu_tasks_trace 11 [017] 5010.544762:
> > probe:rcu_tasks_wait_gp: (ffffffff963fb4b0)
> > rcu_tasks_trace 11 [017] 5010.600396:
> > probe:rcu_tasks_trace_postscan: (ffffffff963fb7c0)
>
> So about 55 milliseconds for the tasklist scan, correct? Or am I
> losing the plot here?
>
> > rcu_tasks_trace 11 [022] 5010.618783:
> > probe:check_all_holdout_tasks_trace: (ffffffff963fb850)
> > rcu_tasks_trace 11 [022] 5010.618840:
> > probe:rcu_tasks_trace_postgp: (ffffffff963fba70)
> >
> > In this case, CPU 22 is the victim, not CPU 0 :)
>
> My faith in the scheduler is restored! ;-)
>
> My position has been that this tasklist scan does not need to be broken
> up because it should happen only when a sleepable BPF program is removed,
> which is a rare event.

Hmm... what about bpf_sk_storage_free() ?

Definitely not a rare event.

>
> In addition, breaking up this scan is not trivial, because as far as I
> know there is no way to force a given task to stay in the list. I would
> have to instead use something like rcu_lock_break(), and restart the
> scan if either of the nailed-down pair of tasks was removed from the list.
> In a system where tasks were coming and going very frequently, it might
> be that such a broken-up scan would never complete.
>
> I can imagine tricks where the nailed-down tasks are kept on a list,
> and the nailed-downness is moved to the next task when those tasks
> are removed. I can also imagine a less-than-happy response to such
> a proposal.
>
> So I am not currently thinking in terms of breaking up this scan.
>
> Or is there some trick that I am missing?
>
> In the meantime, a simple patch that reduces the frequency of the scan
> by a factor of two. But this would not be the scan of the full tasklist,
> but rather the frequency of the calls to check_all_holdout_tasks_trace().
> And the total of these looks to be less than 20 milliseconds, if I am
> correctly interpreting your trace. And most of that 20 milliseconds
> is sleeping.
>
> Nevertheless, the patch is at the end of this email.
>
> Other than that, I could imagine batching removal of sleepable BPF
> programs and using a single grace period for all of their trampolines.
> But are there enough sleepable BPF programs ever installed to make this
> a useful approach?
>
> Or is the status quo in fact acceptable? (Hey, I can dream, can't I?)
>
> Thanx, Paul
>
> > > > I attempted to backport to our kernel all related patches that were
> > > > not yet backported,
> > > > and we still see a regression in our tests.
> > >
> > > The per-grace-period CPU consumption of rcu_tasks_trace was intentionally
> > > increased by the above commit, and I never have done anything to reduce
> > > that CPU consumption. In part because you are the first to call my
> > > attention to it.
> > >
> > > Oh, and one other issue that I very recently fixed, that has not
> > > yet reached mainline, just in case it matters. If you are building a
> > > CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y kernel, but also have
> > > CONFIG_RCU_TORTURE_TEST=m (or, for that matter, =y, but please don't in
> > > production!), then your kernel will use RCU Tasks instead of vanilla RCU.
> > > (Note well, RCU Tasks, not RCU Tasks Trace, the latter being necessaary
> > > for sleepable BPF programs regardless of kernel .config).
> > >
> > > > Please ignore the sha1 in this current patch series, this is only to
> > > > show my current attempt to fix the regression in our tree.
> > > >
> > > > 450b3244f29b rcu-tasks: Don't remove tasks with pending IPIs from holdout list
> > > > 5f88f7e9cc36 rcu-tasks: Create per-CPU callback lists
> > > > 1a943d0041dc rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic
> > > > queue selection
> > > > ea5289f12fce rcu-tasks: Convert grace-period counter to grace-period
> > > > sequence number
> > > > 22efd5093c3b rcu/segcblist: Prevent useless GP start if no CBs to accelerate
> > > > 16dee1b3babf rcu: Implement rcu_segcblist_is_offloaded() config dependent
> > > > 8cafaadb6144 rcu: Add callbacks-invoked counters
> > > > 323234685765 rcu/tree: Make rcu_do_batch count how many callbacks were executed
> > > > f48f3386a1cc rcu/segcblist: Add additional comments to explain smp_mb()
> > > > 4408105116de rcu/segcblist: Add counters to segcblist datastructure
> > > > 4a0b89a918d6 rcu/tree: segcblist: Remove redundant smp_mb()s
> > > > 38c0d18e8740 rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_divisor
> > > > 0b5d1031b509 rcu/segcblist: Add debug checks for segment lengths
> > > > 8a82886fbf02 rcu_tasks: Convert bespoke callback list to rcu_segcblist structure
> > > > cbd452a5c01f rcu-tasks: Use spin_lock_rcu_node() and friends
> > > > 073222be51f3 rcu-tasks: Add a ->percpu_enqueue_lim to the rcu_tasks structure
> > > > 5af10fb0f8fb rcu-tasks: Abstract checking of callback lists
> > > > d3e8be598546 rcu-tasks: Abstract invocations of callbacks
> > > > 65784460a392 rcu-tasks: Use workqueues for multiple
> > > > rcu_tasks_invoke_cbs() invocations
> > > > dd6413e355f1 rcu-tasks: Make rcu_barrier_tasks*() handle multiple
> > > > callback queues
> > > > 2499cb3c438e rcu-tasks: Add rcupdate.rcu_task_enqueue_lim to set
> > > > initial queueing
> > > > a859f409a503 rcu-tasks: Count trylocks to estimate call_rcu_tasks() contention
> > > > 4ab253ca056e rcu-tasks: Avoid raw-spinlocked wakeups from
> > > > call_rcu_tasks_generic()
> > > > e9a3563fe76e rcu-tasks: Use more callback queues if contention encountered
> > > > 4023187fe31d rcu-tasks: Use separate ->percpu_dequeue_lim for callback
> > > > dequeueing
> > > > 533be3bd47c3 rcu: Provide polling interfaces for Tree RCU grace periods
> > > > f7e5a81d7953 rcu-tasks: Use fewer callbacks queues if callback flood ends
> > > > bb7ad9078e1b rcu-tasks: Fix computation of CPU-to-list shift counts
> > > > d9cebde55539 rcu-tasks: Use order_base_2() instead of ilog2()
> > > > 95606f1248f5 rcu-tasks: Set ->percpu_enqueue_shift to zero upon contention
>
>
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index 65d6e21a607a..141e2b4c70cc 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -1640,10 +1640,10 @@ static int __init rcu_spawn_tasks_trace_kthread(void)
> rcu_tasks_trace.gp_sleep = HZ / 10;
> rcu_tasks_trace.init_fract = HZ / 10;
> } else {
> - rcu_tasks_trace.gp_sleep = HZ / 200;
> + rcu_tasks_trace.gp_sleep = HZ / 100;
> if (rcu_tasks_trace.gp_sleep <= 0)
> rcu_tasks_trace.gp_sleep = 1;
> - rcu_tasks_trace.init_fract = HZ / 200;
> + rcu_tasks_trace.init_fract = HZ / 100;
> if (rcu_tasks_trace.init_fract <= 0)
> rcu_tasks_trace.init_fract = 1;
> }

It seems that if the scan time is > 50ms in some common cases (at
least at Google scale),
the claim of having a latency of 10ms is not reasonable.

Given that, I do not think bpf_sk_storage_free() can/should use
call_rcu_tasks_trace(),
we probably will have to fix this soon (or revert from our kernels)

Thanks.

2022-04-04 23:59:55

by Eric Dumazet

[permalink] [raw]
Subject: Re: [BUG] rcu-tasks : should take care of sparse cpu masks

On Mon, Apr 4, 2022 at 12:49 PM Paul E. McKenney <[email protected]> wrote:
>
> On Thu, Mar 31, 2022 at 03:54:02PM -0700, Eric Dumazet wrote:
> > On Thu, Mar 31, 2022 at 3:42 PM Paul E. McKenney <[email protected]> wrote:
> > >
> > > On Thu, Mar 31, 2022 at 02:45:25PM -0700, Eric Dumazet wrote:
> > > > Hi Paul
> > > >
> > > > It seems you assume per cpu ptr for arbitrary indexes (< nr_cpu_ids) are valid.
> > >
> > > Gah! I knew I was forgetting something...
> > >
> > > But just to check, is this a theoretical problem or something you hit
> > > on real hardware? (For the rest of this email, I am assuming the latter.)
> >
> > Code review really...
> >
> > >
> > > > What do you think of the (untested) following patch ?
> > >
> > > One issue with this patch is that the contention could be unpredictable,
> > > or worse, vary among CPU, especially if the cpu_possible_mask was oddly
> > > distributed.
> > >
> > > So might it be better to restrict this to all on CPU 0 on the one hand
> > > and completely per-CPU on the other? (Or all on the boot CPU, in case
> > > I am forgetting some misbegotten architecture that can run without a
> > > CPU 0.)
> >
> > If I understand correctly, cblist_init_generic() could setup
> > percpu_enqueue_shift
> > to something smaller than order_base_2(nr_cpu_ids)
> >
> > Meaning that we could reach a non zero idx in (smp_processor_id() >>
> > percpu_enqueue_shift)
> >
> > So even if CPU0 is always present (I am not sure this is guaranteed,
> > but this seems reasonable),
> > we could still attempt a per_cpu_ptr(PTR, not_present_cpu), and get garbage.
>
> And the problem with my wish to provide load balancing is that a
> sparse cpumask could be sparse any which way that it wants to be.
> Another problem is that, unlike TREE SRCU, Tasks RCU doesn't have an
> efficient way to find all the CPUs with callbacks queued. Yes, I could
> add that information, but the benefit does not seem worth the complexity.
>
> So I took your patch after all, but changed from cpu_online_mask to
> cpu_possible_mask. Thank you for bearing with me on this one!
>
> Are you OK with your Signed-off-by on this patch as shown below?

Absolutely, thanks Paul for taking care of this.

>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit b77b2981bb22c4449a0a6e86eeb9fbab36a2beae
> Author: Eric Dumazet <[email protected]>
> Date: Mon Apr 4 12:30:18 2022 -0700
>
> rcu-tasks: Handle sparse cpu_possible_mask
>
> If the rcupdate.rcu_task_enqueue_lim kernel boot parameter is set to
> something greater than 1 and less than nr_cpu_ids, the code attempts to
> use a subset of the CPU's RCU Tasks callback lists. This works, but only
> if the cpu_possible_mask is contiguous. If there are "holes" in this
> mask, the callback-enqueue code might attempt to access a non-existent
> per-CPU ->rtcpu variable for a non-existent CPU. For example, if only
> CPUs 0, 4, 8, 12, 16 and so on are in cpu_possible_mask, specifying
> rcupdate.rcu_task_enqueue_lim=4 would cause the code to attempt to
> use callback queues for non-existent CPUs 1, 2, and 3. Because such
> systems have existed in the past and might still exist, the code needs
> to gracefully handle this situation.
>
> This commit therefore checks to see whether the desired CPU is present
> in cpu_possible_mask, and, if not, searches for the next CPU. This means
> that the systems administrator of a system with a sparse cpu_possible_mask
> will need to account for this sparsity when specifying the value of
> the rcupdate.rcu_task_enqueue_lim kernel boot parameter. For example,
> setting this parameter to the value 4 will use only CPUs 0 and 4, which
> CPU 4 getting three times the callback load of CPU 0.
>
> This commit assumes that bit (nr_cpu_ids - 1) is always set in
> cpu_possible_mask.
>
> Link: https://lore.kernel.org/lkml/CANn89iKaNEwyNZ=L_PQnkH0LP_XjLYrr_dpyRKNNoDJaWKdrmg@mail.gmail.com/
> Signed-off-by: Eric Dumazet <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index 65d6e21a607a..44977c6a1cb8 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -273,7 +273,9 @@ static void call_rcu_tasks_iw_wakeup(struct irq_work *iwp)
> static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func,
> struct rcu_tasks *rtp)
> {
> + int chosen_cpu;
> unsigned long flags;
> + int ideal_cpu;
> unsigned long j;
> bool needadjust = false;
> bool needwake;
> @@ -283,8 +285,9 @@ static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func,
> rhp->func = func;
> local_irq_save(flags);
> rcu_read_lock();
> - rtpcp = per_cpu_ptr(rtp->rtpcpu,
> - smp_processor_id() >> READ_ONCE(rtp->percpu_enqueue_shift));
> + ideal_cpu = smp_processor_id() >> READ_ONCE(rtp->percpu_enqueue_shift);
> + chosen_cpu = cpumask_next(ideal_cpu - 1, cpu_possible_mask);
> + rtpcp = per_cpu_ptr(rtp->rtpcpu, chosen_cpu);
> if (!raw_spin_trylock_rcu_node(rtpcp)) { // irqs already disabled.
> raw_spin_lock_rcu_node(rtpcp); // irqs already disabled.
> j = jiffies;

2022-04-05 00:09:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [BUG] rcu-tasks : should take care of sparse cpu masks

On Fri, Apr 01, 2022 at 08:28:14AM -0700, Paul E. McKenney wrote:
> [ Adding Andrii and Alexei at Andrii's request. ]
>
> On Fri, Apr 01, 2022 at 08:20:37AM -0700, Paul E. McKenney wrote:
> > On Fri, Apr 01, 2022 at 06:24:13AM -0700, Eric Dumazet wrote:
> > > On Fri, Apr 1, 2022 at 6:01 AM Paul E. McKenney <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 31, 2022 at 09:39:02PM -0700, Eric Dumazet wrote:
> > > > > On Thu, Mar 31, 2022 at 5:06 PM Paul E. McKenney <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Mar 31, 2022 at 04:28:04PM -0700, Eric Dumazet wrote:
> > > > > > > On Thu, Mar 31, 2022 at 4:13 PM Paul E. McKenney <[email protected]> wrote:
> > > > > > > >
> > > > > > > > The initial setting of ->percpu_enqueue_shift forces all in-range CPU
> > > > > > > > IDs to shift down to zero. The grace-period kthread is allowed to run
> > > > > > > > where it likes. The callback lists are protected by locking, even in
> > > > > > > > the case of local access, so this should be safe.
> > > > > > > >
> > > > > > > > Or am I missing your point?
> > > > > > > >
> > > > > > >
> > > > > > > In fact I have been looking at this code, because we bisected a
> > > > > > > regression back to this patch:
> > > > > > >
> > > > > > > 4fe192dfbe5ba9780df699d411aa4f25ba24cf61 rcu-tasks: Shorten
> > > > > > > per-grace-period sleep for RCU Tasks Trace
> > > > > > >
> > > > > > > It is very possible the regression comes because the RCU task thread
> > > > > > > is using more cpu cycles, from 'CPU 0' where our system daemons are
> > > > > > > pinned.
> > > > > >
> > > > > > Heh! I did express that concern when creating that patch, but was
> > > > > > assured that the latency was much more important.
> > > > > >
> > > > > > Yes, that patch most definitely increases CPU utilization during RCU Tasks
> > > > > > Trace grace periods. If you can tolerate longer grace-period latencies,
> > > > > > it might be worth toning it down a bit. The ask was for about twice
> > > > > > the latency I achieved in my initial attempt, and I made the mistake of
> > > > > > forwarding that attempt out for testing. They liked the shorter latency
> > > > > > very much, and objected strenuously to the thought that I might detune
> > > > > > it back to the latency that they originally asked for. ;-)
> > > > > >
> > > > > > But I can easily provide the means to detune it through use of a kernel
> > > > > > boot parameter or some such, if that would help.
> > > > > >
> > > > > > > But I could not spot where the RCU task kthread is forced to run on CPU 0.
> > > > > >
> > > > > > I never did intend this kthread be bound anywhere. RCU's policy is
> > > > > > that any binding of its kthreads is the responsibility of the sysadm,
> > > > > > be that carbon-based or otherwise.
> > > > > >
> > > > > > But this kthread is spawned early enough that only CPU 0 is online,
> > > > > > so maybe the question is not "what is binding it to CPU 0?" but rather
> > > > > > "why isn't something kicking it off of CPU 0?"
> > > > >
> > > > > I guess the answer to this question can be found in the following
> > > > > piece of code :)
> > > > >
> > > > > rcu_read_lock();
> > > > > for_each_process_thread(g, t)
> > > > > rtp->pertask_func(t, &holdouts);
> > > > > rcu_read_unlock();
> > > > >
> > > > >
> > > > > With ~150,000 threads on a 256 cpu host, this holds current cpu for
> > > > > very long times:
> > > > >
> > > > > rcu_tasks_trace 11 [017] 5010.544762:
> > > > > probe:rcu_tasks_wait_gp: (ffffffff963fb4b0)
> > > > > rcu_tasks_trace 11 [017] 5010.600396:
> > > > > probe:rcu_tasks_trace_postscan: (ffffffff963fb7c0)
> > > >
> > > > So about 55 milliseconds for the tasklist scan, correct? Or am I
> > > > losing the plot here?
> > > >
> > > > > rcu_tasks_trace 11 [022] 5010.618783:
> > > > > probe:check_all_holdout_tasks_trace: (ffffffff963fb850)
> > > > > rcu_tasks_trace 11 [022] 5010.618840:
> > > > > probe:rcu_tasks_trace_postgp: (ffffffff963fba70)
> > > > >
> > > > > In this case, CPU 22 is the victim, not CPU 0 :)
> > > >
> > > > My faith in the scheduler is restored! ;-)
> > > >
> > > > My position has been that this tasklist scan does not need to be broken
> > > > up because it should happen only when a sleepable BPF program is removed,
> > > > which is a rare event.
> > >
> > > Hmm... what about bpf_sk_storage_free() ?
> > >
> > > Definitely not a rare event.
> >
> > Hmmm... Are the BPF guys using call_rcu_tasks_trace() to free things that
> > are not trampolines for sleepable BPF programs? Kind of looks like it.
> >
> > Maybe RCU Tasks Trace was too convenient to use? ;-)
> >
> > > > In addition, breaking up this scan is not trivial, because as far as I
> > > > know there is no way to force a given task to stay in the list. I would
> > > > have to instead use something like rcu_lock_break(), and restart the
> > > > scan if either of the nailed-down pair of tasks was removed from the list.
> > > > In a system where tasks were coming and going very frequently, it might
> > > > be that such a broken-up scan would never complete.
> > > >
> > > > I can imagine tricks where the nailed-down tasks are kept on a list,
> > > > and the nailed-downness is moved to the next task when those tasks
> > > > are removed. I can also imagine a less-than-happy response to such
> > > > a proposal.
> > > >
> > > > So I am not currently thinking in terms of breaking up this scan.
> > > >
> > > > Or is there some trick that I am missing?
> > > >
> > > > In the meantime, a simple patch that reduces the frequency of the scan
> > > > by a factor of two. But this would not be the scan of the full tasklist,
> > > > but rather the frequency of the calls to check_all_holdout_tasks_trace().
> > > > And the total of these looks to be less than 20 milliseconds, if I am
> > > > correctly interpreting your trace. And most of that 20 milliseconds
> > > > is sleeping.
> > > >
> > > > Nevertheless, the patch is at the end of this email.
> > > >
> > > > Other than that, I could imagine batching removal of sleepable BPF
> > > > programs and using a single grace period for all of their trampolines.
> > > > But are there enough sleepable BPF programs ever installed to make this
> > > > a useful approach?
> > > >
> > > > Or is the status quo in fact acceptable? (Hey, I can dream, can't I?)
> > > >
> > > > Thanx, Paul
> > > >
> > > > > > > I attempted to backport to our kernel all related patches that were
> > > > > > > not yet backported,
> > > > > > > and we still see a regression in our tests.
> > > > > >
> > > > > > The per-grace-period CPU consumption of rcu_tasks_trace was intentionally
> > > > > > increased by the above commit, and I never have done anything to reduce
> > > > > > that CPU consumption. In part because you are the first to call my
> > > > > > attention to it.
> > > > > >
> > > > > > Oh, and one other issue that I very recently fixed, that has not
> > > > > > yet reached mainline, just in case it matters. If you are building a
> > > > > > CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y kernel, but also have
> > > > > > CONFIG_RCU_TORTURE_TEST=m (or, for that matter, =y, but please don't in
> > > > > > production!), then your kernel will use RCU Tasks instead of vanilla RCU.
> > > > > > (Note well, RCU Tasks, not RCU Tasks Trace, the latter being necessaary
> > > > > > for sleepable BPF programs regardless of kernel .config).
> > > > > >
> > > > > > > Please ignore the sha1 in this current patch series, this is only to
> > > > > > > show my current attempt to fix the regression in our tree.
> > > > > > >
> > > > > > > 450b3244f29b rcu-tasks: Don't remove tasks with pending IPIs from holdout list
> > > > > > > 5f88f7e9cc36 rcu-tasks: Create per-CPU callback lists
> > > > > > > 1a943d0041dc rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic
> > > > > > > queue selection
> > > > > > > ea5289f12fce rcu-tasks: Convert grace-period counter to grace-period
> > > > > > > sequence number
> > > > > > > 22efd5093c3b rcu/segcblist: Prevent useless GP start if no CBs to accelerate
> > > > > > > 16dee1b3babf rcu: Implement rcu_segcblist_is_offloaded() config dependent
> > > > > > > 8cafaadb6144 rcu: Add callbacks-invoked counters
> > > > > > > 323234685765 rcu/tree: Make rcu_do_batch count how many callbacks were executed
> > > > > > > f48f3386a1cc rcu/segcblist: Add additional comments to explain smp_mb()
> > > > > > > 4408105116de rcu/segcblist: Add counters to segcblist datastructure
> > > > > > > 4a0b89a918d6 rcu/tree: segcblist: Remove redundant smp_mb()s
> > > > > > > 38c0d18e8740 rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_divisor
> > > > > > > 0b5d1031b509 rcu/segcblist: Add debug checks for segment lengths
> > > > > > > 8a82886fbf02 rcu_tasks: Convert bespoke callback list to rcu_segcblist structure
> > > > > > > cbd452a5c01f rcu-tasks: Use spin_lock_rcu_node() and friends
> > > > > > > 073222be51f3 rcu-tasks: Add a ->percpu_enqueue_lim to the rcu_tasks structure
> > > > > > > 5af10fb0f8fb rcu-tasks: Abstract checking of callback lists
> > > > > > > d3e8be598546 rcu-tasks: Abstract invocations of callbacks
> > > > > > > 65784460a392 rcu-tasks: Use workqueues for multiple
> > > > > > > rcu_tasks_invoke_cbs() invocations
> > > > > > > dd6413e355f1 rcu-tasks: Make rcu_barrier_tasks*() handle multiple
> > > > > > > callback queues
> > > > > > > 2499cb3c438e rcu-tasks: Add rcupdate.rcu_task_enqueue_lim to set
> > > > > > > initial queueing
> > > > > > > a859f409a503 rcu-tasks: Count trylocks to estimate call_rcu_tasks() contention
> > > > > > > 4ab253ca056e rcu-tasks: Avoid raw-spinlocked wakeups from
> > > > > > > call_rcu_tasks_generic()
> > > > > > > e9a3563fe76e rcu-tasks: Use more callback queues if contention encountered
> > > > > > > 4023187fe31d rcu-tasks: Use separate ->percpu_dequeue_lim for callback
> > > > > > > dequeueing
> > > > > > > 533be3bd47c3 rcu: Provide polling interfaces for Tree RCU grace periods
> > > > > > > f7e5a81d7953 rcu-tasks: Use fewer callbacks queues if callback flood ends
> > > > > > > bb7ad9078e1b rcu-tasks: Fix computation of CPU-to-list shift counts
> > > > > > > d9cebde55539 rcu-tasks: Use order_base_2() instead of ilog2()
> > > > > > > 95606f1248f5 rcu-tasks: Set ->percpu_enqueue_shift to zero upon contention
> > > >
> > > >
> > > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > > > index 65d6e21a607a..141e2b4c70cc 100644
> > > > --- a/kernel/rcu/tasks.h
> > > > +++ b/kernel/rcu/tasks.h
> > > > @@ -1640,10 +1640,10 @@ static int __init rcu_spawn_tasks_trace_kthread(void)
> > > > rcu_tasks_trace.gp_sleep = HZ / 10;
> > > > rcu_tasks_trace.init_fract = HZ / 10;
> > > > } else {
> > > > - rcu_tasks_trace.gp_sleep = HZ / 200;
> > > > + rcu_tasks_trace.gp_sleep = HZ / 100;
> > > > if (rcu_tasks_trace.gp_sleep <= 0)
> > > > rcu_tasks_trace.gp_sleep = 1;
> > > > - rcu_tasks_trace.init_fract = HZ / 200;
> > > > + rcu_tasks_trace.init_fract = HZ / 100;
> > > > if (rcu_tasks_trace.init_fract <= 0)
> > > > rcu_tasks_trace.init_fract = 1;
> > > > }
> > >
> > > It seems that if the scan time is > 50ms in some common cases (at
> > > least at Google scale),
> > > the claim of having a latency of 10ms is not reasonable.
> >
> > But does the above patch make things better? If it does, I will send
> > you a proper patch with kernel boot parameters. We can then discuss
> > better autotuning, for example, making the defaults a function of the
> > number of CPUs.
> >
> > Either way, that certainly is a fair point. Another fair point is that
> > the offending commit was in response to a bug report from your colleagues. ;-)
> >
> > Except that I don't see any uses of synchronize_rcu_tasks_trace(), so
> > I am at a loss as to why latency matters anymore.
> >
> > Is the issue the overall CPU consumption of the scan (which is my
> > current guess) or the length of time that the scan runs without invoking
> > cond_resched() or similar?
> >
> > Either way, how frequently is call_rcu_tasks_trace() being invoked in
> > your setup? If it is being invoked frequently, increasing delays would
> > allow multiple call_rcu_tasks_trace() instances to be served by a single
> > tasklist scan.
> >
> > > Given that, I do not think bpf_sk_storage_free() can/should use
> > > call_rcu_tasks_trace(),
> > > we probably will have to fix this soon (or revert from our kernels)
> >
> > Well, you are in luck!!! This commit added call_rcu_tasks_trace() to
> > bpf_selem_unlink_storage_nolock(), which is invoked in a loop by
> > bpf_sk_storage_free():
> >
> > 0fe4b381a59e ("bpf: Allow bpf_local_storage to be used by sleepable programs")
> >
> > This commit was authored by KP Singh, who I am adding on CC. Or I would
> > have, except that you beat me to it. Good show!!! ;-)
> >
> > If this commit provoked this issue, then the above patch might help.

Another question... Were there actually any sleepable BPF
programs running on this system at that time? If not, then maybe
bpf_selem_unlink_storage_nolock() should check for that condition, and
invoke call_rcu_tasks_trace() only if there are actually sleepable BPF
programs running (or in the process of being removed. If I understand
correctly (ha!), if there were no sleepable BPF programs running, you
would instead want call_rcu(). (Here I am assuming that non-sleepable
BPF programs still run within a normal RCU read-side critical section.)

Thanx, Paul

> > I am also adding Neeraj Uphadhyay on CC for his thoughts, as he has
> > also been throught this code.
> >
> > My response time may be a bit slow next week, but I will be checking
> > email.
> >
> > Thanx, Paul

2022-04-05 01:20:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [BUG] rcu-tasks : should take care of sparse cpu masks

[ Adding Andrii and Alexei at Andrii's request. ]

On Fri, Apr 01, 2022 at 08:20:37AM -0700, Paul E. McKenney wrote:
> On Fri, Apr 01, 2022 at 06:24:13AM -0700, Eric Dumazet wrote:
> > On Fri, Apr 1, 2022 at 6:01 AM Paul E. McKenney <[email protected]> wrote:
> > >
> > > On Thu, Mar 31, 2022 at 09:39:02PM -0700, Eric Dumazet wrote:
> > > > On Thu, Mar 31, 2022 at 5:06 PM Paul E. McKenney <[email protected]> wrote:
> > > > >
> > > > > On Thu, Mar 31, 2022 at 04:28:04PM -0700, Eric Dumazet wrote:
> > > > > > On Thu, Mar 31, 2022 at 4:13 PM Paul E. McKenney <[email protected]> wrote:
> > > > > > >
> > > > > > > The initial setting of ->percpu_enqueue_shift forces all in-range CPU
> > > > > > > IDs to shift down to zero. The grace-period kthread is allowed to run
> > > > > > > where it likes. The callback lists are protected by locking, even in
> > > > > > > the case of local access, so this should be safe.
> > > > > > >
> > > > > > > Or am I missing your point?
> > > > > > >
> > > > > >
> > > > > > In fact I have been looking at this code, because we bisected a
> > > > > > regression back to this patch:
> > > > > >
> > > > > > 4fe192dfbe5ba9780df699d411aa4f25ba24cf61 rcu-tasks: Shorten
> > > > > > per-grace-period sleep for RCU Tasks Trace
> > > > > >
> > > > > > It is very possible the regression comes because the RCU task thread
> > > > > > is using more cpu cycles, from 'CPU 0' where our system daemons are
> > > > > > pinned.
> > > > >
> > > > > Heh! I did express that concern when creating that patch, but was
> > > > > assured that the latency was much more important.
> > > > >
> > > > > Yes, that patch most definitely increases CPU utilization during RCU Tasks
> > > > > Trace grace periods. If you can tolerate longer grace-period latencies,
> > > > > it might be worth toning it down a bit. The ask was for about twice
> > > > > the latency I achieved in my initial attempt, and I made the mistake of
> > > > > forwarding that attempt out for testing. They liked the shorter latency
> > > > > very much, and objected strenuously to the thought that I might detune
> > > > > it back to the latency that they originally asked for. ;-)
> > > > >
> > > > > But I can easily provide the means to detune it through use of a kernel
> > > > > boot parameter or some such, if that would help.
> > > > >
> > > > > > But I could not spot where the RCU task kthread is forced to run on CPU 0.
> > > > >
> > > > > I never did intend this kthread be bound anywhere. RCU's policy is
> > > > > that any binding of its kthreads is the responsibility of the sysadm,
> > > > > be that carbon-based or otherwise.
> > > > >
> > > > > But this kthread is spawned early enough that only CPU 0 is online,
> > > > > so maybe the question is not "what is binding it to CPU 0?" but rather
> > > > > "why isn't something kicking it off of CPU 0?"
> > > >
> > > > I guess the answer to this question can be found in the following
> > > > piece of code :)
> > > >
> > > > rcu_read_lock();
> > > > for_each_process_thread(g, t)
> > > > rtp->pertask_func(t, &holdouts);
> > > > rcu_read_unlock();
> > > >
> > > >
> > > > With ~150,000 threads on a 256 cpu host, this holds current cpu for
> > > > very long times:
> > > >
> > > > rcu_tasks_trace 11 [017] 5010.544762:
> > > > probe:rcu_tasks_wait_gp: (ffffffff963fb4b0)
> > > > rcu_tasks_trace 11 [017] 5010.600396:
> > > > probe:rcu_tasks_trace_postscan: (ffffffff963fb7c0)
> > >
> > > So about 55 milliseconds for the tasklist scan, correct? Or am I
> > > losing the plot here?
> > >
> > > > rcu_tasks_trace 11 [022] 5010.618783:
> > > > probe:check_all_holdout_tasks_trace: (ffffffff963fb850)
> > > > rcu_tasks_trace 11 [022] 5010.618840:
> > > > probe:rcu_tasks_trace_postgp: (ffffffff963fba70)
> > > >
> > > > In this case, CPU 22 is the victim, not CPU 0 :)
> > >
> > > My faith in the scheduler is restored! ;-)
> > >
> > > My position has been that this tasklist scan does not need to be broken
> > > up because it should happen only when a sleepable BPF program is removed,
> > > which is a rare event.
> >
> > Hmm... what about bpf_sk_storage_free() ?
> >
> > Definitely not a rare event.
>
> Hmmm... Are the BPF guys using call_rcu_tasks_trace() to free things that
> are not trampolines for sleepable BPF programs? Kind of looks like it.
>
> Maybe RCU Tasks Trace was too convenient to use? ;-)
>
> > > In addition, breaking up this scan is not trivial, because as far as I
> > > know there is no way to force a given task to stay in the list. I would
> > > have to instead use something like rcu_lock_break(), and restart the
> > > scan if either of the nailed-down pair of tasks was removed from the list.
> > > In a system where tasks were coming and going very frequently, it might
> > > be that such a broken-up scan would never complete.
> > >
> > > I can imagine tricks where the nailed-down tasks are kept on a list,
> > > and the nailed-downness is moved to the next task when those tasks
> > > are removed. I can also imagine a less-than-happy response to such
> > > a proposal.
> > >
> > > So I am not currently thinking in terms of breaking up this scan.
> > >
> > > Or is there some trick that I am missing?
> > >
> > > In the meantime, a simple patch that reduces the frequency of the scan
> > > by a factor of two. But this would not be the scan of the full tasklist,
> > > but rather the frequency of the calls to check_all_holdout_tasks_trace().
> > > And the total of these looks to be less than 20 milliseconds, if I am
> > > correctly interpreting your trace. And most of that 20 milliseconds
> > > is sleeping.
> > >
> > > Nevertheless, the patch is at the end of this email.
> > >
> > > Other than that, I could imagine batching removal of sleepable BPF
> > > programs and using a single grace period for all of their trampolines.
> > > But are there enough sleepable BPF programs ever installed to make this
> > > a useful approach?
> > >
> > > Or is the status quo in fact acceptable? (Hey, I can dream, can't I?)
> > >
> > > Thanx, Paul
> > >
> > > > > > I attempted to backport to our kernel all related patches that were
> > > > > > not yet backported,
> > > > > > and we still see a regression in our tests.
> > > > >
> > > > > The per-grace-period CPU consumption of rcu_tasks_trace was intentionally
> > > > > increased by the above commit, and I never have done anything to reduce
> > > > > that CPU consumption. In part because you are the first to call my
> > > > > attention to it.
> > > > >
> > > > > Oh, and one other issue that I very recently fixed, that has not
> > > > > yet reached mainline, just in case it matters. If you are building a
> > > > > CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y kernel, but also have
> > > > > CONFIG_RCU_TORTURE_TEST=m (or, for that matter, =y, but please don't in
> > > > > production!), then your kernel will use RCU Tasks instead of vanilla RCU.
> > > > > (Note well, RCU Tasks, not RCU Tasks Trace, the latter being necessaary
> > > > > for sleepable BPF programs regardless of kernel .config).
> > > > >
> > > > > > Please ignore the sha1 in this current patch series, this is only to
> > > > > > show my current attempt to fix the regression in our tree.
> > > > > >
> > > > > > 450b3244f29b rcu-tasks: Don't remove tasks with pending IPIs from holdout list
> > > > > > 5f88f7e9cc36 rcu-tasks: Create per-CPU callback lists
> > > > > > 1a943d0041dc rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic
> > > > > > queue selection
> > > > > > ea5289f12fce rcu-tasks: Convert grace-period counter to grace-period
> > > > > > sequence number
> > > > > > 22efd5093c3b rcu/segcblist: Prevent useless GP start if no CBs to accelerate
> > > > > > 16dee1b3babf rcu: Implement rcu_segcblist_is_offloaded() config dependent
> > > > > > 8cafaadb6144 rcu: Add callbacks-invoked counters
> > > > > > 323234685765 rcu/tree: Make rcu_do_batch count how many callbacks were executed
> > > > > > f48f3386a1cc rcu/segcblist: Add additional comments to explain smp_mb()
> > > > > > 4408105116de rcu/segcblist: Add counters to segcblist datastructure
> > > > > > 4a0b89a918d6 rcu/tree: segcblist: Remove redundant smp_mb()s
> > > > > > 38c0d18e8740 rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_divisor
> > > > > > 0b5d1031b509 rcu/segcblist: Add debug checks for segment lengths
> > > > > > 8a82886fbf02 rcu_tasks: Convert bespoke callback list to rcu_segcblist structure
> > > > > > cbd452a5c01f rcu-tasks: Use spin_lock_rcu_node() and friends
> > > > > > 073222be51f3 rcu-tasks: Add a ->percpu_enqueue_lim to the rcu_tasks structure
> > > > > > 5af10fb0f8fb rcu-tasks: Abstract checking of callback lists
> > > > > > d3e8be598546 rcu-tasks: Abstract invocations of callbacks
> > > > > > 65784460a392 rcu-tasks: Use workqueues for multiple
> > > > > > rcu_tasks_invoke_cbs() invocations
> > > > > > dd6413e355f1 rcu-tasks: Make rcu_barrier_tasks*() handle multiple
> > > > > > callback queues
> > > > > > 2499cb3c438e rcu-tasks: Add rcupdate.rcu_task_enqueue_lim to set
> > > > > > initial queueing
> > > > > > a859f409a503 rcu-tasks: Count trylocks to estimate call_rcu_tasks() contention
> > > > > > 4ab253ca056e rcu-tasks: Avoid raw-spinlocked wakeups from
> > > > > > call_rcu_tasks_generic()
> > > > > > e9a3563fe76e rcu-tasks: Use more callback queues if contention encountered
> > > > > > 4023187fe31d rcu-tasks: Use separate ->percpu_dequeue_lim for callback
> > > > > > dequeueing
> > > > > > 533be3bd47c3 rcu: Provide polling interfaces for Tree RCU grace periods
> > > > > > f7e5a81d7953 rcu-tasks: Use fewer callbacks queues if callback flood ends
> > > > > > bb7ad9078e1b rcu-tasks: Fix computation of CPU-to-list shift counts
> > > > > > d9cebde55539 rcu-tasks: Use order_base_2() instead of ilog2()
> > > > > > 95606f1248f5 rcu-tasks: Set ->percpu_enqueue_shift to zero upon contention
> > >
> > >
> > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > > index 65d6e21a607a..141e2b4c70cc 100644
> > > --- a/kernel/rcu/tasks.h
> > > +++ b/kernel/rcu/tasks.h
> > > @@ -1640,10 +1640,10 @@ static int __init rcu_spawn_tasks_trace_kthread(void)
> > > rcu_tasks_trace.gp_sleep = HZ / 10;
> > > rcu_tasks_trace.init_fract = HZ / 10;
> > > } else {
> > > - rcu_tasks_trace.gp_sleep = HZ / 200;
> > > + rcu_tasks_trace.gp_sleep = HZ / 100;
> > > if (rcu_tasks_trace.gp_sleep <= 0)
> > > rcu_tasks_trace.gp_sleep = 1;
> > > - rcu_tasks_trace.init_fract = HZ / 200;
> > > + rcu_tasks_trace.init_fract = HZ / 100;
> > > if (rcu_tasks_trace.init_fract <= 0)
> > > rcu_tasks_trace.init_fract = 1;
> > > }
> >
> > It seems that if the scan time is > 50ms in some common cases (at
> > least at Google scale),
> > the claim of having a latency of 10ms is not reasonable.
>
> But does the above patch make things better? If it does, I will send
> you a proper patch with kernel boot parameters. We can then discuss
> better autotuning, for example, making the defaults a function of the
> number of CPUs.
>
> Either way, that certainly is a fair point. Another fair point is that
> the offending commit was in response to a bug report from your colleagues. ;-)
>
> Except that I don't see any uses of synchronize_rcu_tasks_trace(), so
> I am at a loss as to why latency matters anymore.
>
> Is the issue the overall CPU consumption of the scan (which is my
> current guess) or the length of time that the scan runs without invoking
> cond_resched() or similar?
>
> Either way, how frequently is call_rcu_tasks_trace() being invoked in
> your setup? If it is being invoked frequently, increasing delays would
> allow multiple call_rcu_tasks_trace() instances to be served by a single
> tasklist scan.
>
> > Given that, I do not think bpf_sk_storage_free() can/should use
> > call_rcu_tasks_trace(),
> > we probably will have to fix this soon (or revert from our kernels)
>
> Well, you are in luck!!! This commit added call_rcu_tasks_trace() to
> bpf_selem_unlink_storage_nolock(), which is invoked in a loop by
> bpf_sk_storage_free():
>
> 0fe4b381a59e ("bpf: Allow bpf_local_storage to be used by sleepable programs")
>
> This commit was authored by KP Singh, who I am adding on CC. Or I would
> have, except that you beat me to it. Good show!!! ;-)
>
> If this commit provoked this issue, then the above patch might help.
>
> I am also adding Neeraj Uphadhyay on CC for his thoughts, as he has
> also been throught this code.
>
> My response time may be a bit slow next week, but I will be checking
> email.
>
> Thanx, Paul

2022-04-05 01:23:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [BUG] rcu-tasks : should take care of sparse cpu masks

On Fri, Apr 01, 2022 at 06:24:13AM -0700, Eric Dumazet wrote:
> On Fri, Apr 1, 2022 at 6:01 AM Paul E. McKenney <[email protected]> wrote:
> >
> > On Thu, Mar 31, 2022 at 09:39:02PM -0700, Eric Dumazet wrote:
> > > On Thu, Mar 31, 2022 at 5:06 PM Paul E. McKenney <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 31, 2022 at 04:28:04PM -0700, Eric Dumazet wrote:
> > > > > On Thu, Mar 31, 2022 at 4:13 PM Paul E. McKenney <[email protected]> wrote:
> > > > > >
> > > > > > The initial setting of ->percpu_enqueue_shift forces all in-range CPU
> > > > > > IDs to shift down to zero. The grace-period kthread is allowed to run
> > > > > > where it likes. The callback lists are protected by locking, even in
> > > > > > the case of local access, so this should be safe.
> > > > > >
> > > > > > Or am I missing your point?
> > > > > >
> > > > >
> > > > > In fact I have been looking at this code, because we bisected a
> > > > > regression back to this patch:
> > > > >
> > > > > 4fe192dfbe5ba9780df699d411aa4f25ba24cf61 rcu-tasks: Shorten
> > > > > per-grace-period sleep for RCU Tasks Trace
> > > > >
> > > > > It is very possible the regression comes because the RCU task thread
> > > > > is using more cpu cycles, from 'CPU 0' where our system daemons are
> > > > > pinned.
> > > >
> > > > Heh! I did express that concern when creating that patch, but was
> > > > assured that the latency was much more important.
> > > >
> > > > Yes, that patch most definitely increases CPU utilization during RCU Tasks
> > > > Trace grace periods. If you can tolerate longer grace-period latencies,
> > > > it might be worth toning it down a bit. The ask was for about twice
> > > > the latency I achieved in my initial attempt, and I made the mistake of
> > > > forwarding that attempt out for testing. They liked the shorter latency
> > > > very much, and objected strenuously to the thought that I might detune
> > > > it back to the latency that they originally asked for. ;-)
> > > >
> > > > But I can easily provide the means to detune it through use of a kernel
> > > > boot parameter or some such, if that would help.
> > > >
> > > > > But I could not spot where the RCU task kthread is forced to run on CPU 0.
> > > >
> > > > I never did intend this kthread be bound anywhere. RCU's policy is
> > > > that any binding of its kthreads is the responsibility of the sysadm,
> > > > be that carbon-based or otherwise.
> > > >
> > > > But this kthread is spawned early enough that only CPU 0 is online,
> > > > so maybe the question is not "what is binding it to CPU 0?" but rather
> > > > "why isn't something kicking it off of CPU 0?"
> > >
> > > I guess the answer to this question can be found in the following
> > > piece of code :)
> > >
> > > rcu_read_lock();
> > > for_each_process_thread(g, t)
> > > rtp->pertask_func(t, &holdouts);
> > > rcu_read_unlock();
> > >
> > >
> > > With ~150,000 threads on a 256 cpu host, this holds current cpu for
> > > very long times:
> > >
> > > rcu_tasks_trace 11 [017] 5010.544762:
> > > probe:rcu_tasks_wait_gp: (ffffffff963fb4b0)
> > > rcu_tasks_trace 11 [017] 5010.600396:
> > > probe:rcu_tasks_trace_postscan: (ffffffff963fb7c0)
> >
> > So about 55 milliseconds for the tasklist scan, correct? Or am I
> > losing the plot here?
> >
> > > rcu_tasks_trace 11 [022] 5010.618783:
> > > probe:check_all_holdout_tasks_trace: (ffffffff963fb850)
> > > rcu_tasks_trace 11 [022] 5010.618840:
> > > probe:rcu_tasks_trace_postgp: (ffffffff963fba70)
> > >
> > > In this case, CPU 22 is the victim, not CPU 0 :)
> >
> > My faith in the scheduler is restored! ;-)
> >
> > My position has been that this tasklist scan does not need to be broken
> > up because it should happen only when a sleepable BPF program is removed,
> > which is a rare event.
>
> Hmm... what about bpf_sk_storage_free() ?
>
> Definitely not a rare event.

Hmmm... Are the BPF guys using call_rcu_tasks_trace() to free things that
are not trampolines for sleepable BPF programs? Kind of looks like it.

Maybe RCU Tasks Trace was too convenient to use? ;-)

> > In addition, breaking up this scan is not trivial, because as far as I
> > know there is no way to force a given task to stay in the list. I would
> > have to instead use something like rcu_lock_break(), and restart the
> > scan if either of the nailed-down pair of tasks was removed from the list.
> > In a system where tasks were coming and going very frequently, it might
> > be that such a broken-up scan would never complete.
> >
> > I can imagine tricks where the nailed-down tasks are kept on a list,
> > and the nailed-downness is moved to the next task when those tasks
> > are removed. I can also imagine a less-than-happy response to such
> > a proposal.
> >
> > So I am not currently thinking in terms of breaking up this scan.
> >
> > Or is there some trick that I am missing?
> >
> > In the meantime, a simple patch that reduces the frequency of the scan
> > by a factor of two. But this would not be the scan of the full tasklist,
> > but rather the frequency of the calls to check_all_holdout_tasks_trace().
> > And the total of these looks to be less than 20 milliseconds, if I am
> > correctly interpreting your trace. And most of that 20 milliseconds
> > is sleeping.
> >
> > Nevertheless, the patch is at the end of this email.
> >
> > Other than that, I could imagine batching removal of sleepable BPF
> > programs and using a single grace period for all of their trampolines.
> > But are there enough sleepable BPF programs ever installed to make this
> > a useful approach?
> >
> > Or is the status quo in fact acceptable? (Hey, I can dream, can't I?)
> >
> > Thanx, Paul
> >
> > > > > I attempted to backport to our kernel all related patches that were
> > > > > not yet backported,
> > > > > and we still see a regression in our tests.
> > > >
> > > > The per-grace-period CPU consumption of rcu_tasks_trace was intentionally
> > > > increased by the above commit, and I never have done anything to reduce
> > > > that CPU consumption. In part because you are the first to call my
> > > > attention to it.
> > > >
> > > > Oh, and one other issue that I very recently fixed, that has not
> > > > yet reached mainline, just in case it matters. If you are building a
> > > > CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y kernel, but also have
> > > > CONFIG_RCU_TORTURE_TEST=m (or, for that matter, =y, but please don't in
> > > > production!), then your kernel will use RCU Tasks instead of vanilla RCU.
> > > > (Note well, RCU Tasks, not RCU Tasks Trace, the latter being necessaary
> > > > for sleepable BPF programs regardless of kernel .config).
> > > >
> > > > > Please ignore the sha1 in this current patch series, this is only to
> > > > > show my current attempt to fix the regression in our tree.
> > > > >
> > > > > 450b3244f29b rcu-tasks: Don't remove tasks with pending IPIs from holdout list
> > > > > 5f88f7e9cc36 rcu-tasks: Create per-CPU callback lists
> > > > > 1a943d0041dc rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic
> > > > > queue selection
> > > > > ea5289f12fce rcu-tasks: Convert grace-period counter to grace-period
> > > > > sequence number
> > > > > 22efd5093c3b rcu/segcblist: Prevent useless GP start if no CBs to accelerate
> > > > > 16dee1b3babf rcu: Implement rcu_segcblist_is_offloaded() config dependent
> > > > > 8cafaadb6144 rcu: Add callbacks-invoked counters
> > > > > 323234685765 rcu/tree: Make rcu_do_batch count how many callbacks were executed
> > > > > f48f3386a1cc rcu/segcblist: Add additional comments to explain smp_mb()
> > > > > 4408105116de rcu/segcblist: Add counters to segcblist datastructure
> > > > > 4a0b89a918d6 rcu/tree: segcblist: Remove redundant smp_mb()s
> > > > > 38c0d18e8740 rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_divisor
> > > > > 0b5d1031b509 rcu/segcblist: Add debug checks for segment lengths
> > > > > 8a82886fbf02 rcu_tasks: Convert bespoke callback list to rcu_segcblist structure
> > > > > cbd452a5c01f rcu-tasks: Use spin_lock_rcu_node() and friends
> > > > > 073222be51f3 rcu-tasks: Add a ->percpu_enqueue_lim to the rcu_tasks structure
> > > > > 5af10fb0f8fb rcu-tasks: Abstract checking of callback lists
> > > > > d3e8be598546 rcu-tasks: Abstract invocations of callbacks
> > > > > 65784460a392 rcu-tasks: Use workqueues for multiple
> > > > > rcu_tasks_invoke_cbs() invocations
> > > > > dd6413e355f1 rcu-tasks: Make rcu_barrier_tasks*() handle multiple
> > > > > callback queues
> > > > > 2499cb3c438e rcu-tasks: Add rcupdate.rcu_task_enqueue_lim to set
> > > > > initial queueing
> > > > > a859f409a503 rcu-tasks: Count trylocks to estimate call_rcu_tasks() contention
> > > > > 4ab253ca056e rcu-tasks: Avoid raw-spinlocked wakeups from
> > > > > call_rcu_tasks_generic()
> > > > > e9a3563fe76e rcu-tasks: Use more callback queues if contention encountered
> > > > > 4023187fe31d rcu-tasks: Use separate ->percpu_dequeue_lim for callback
> > > > > dequeueing
> > > > > 533be3bd47c3 rcu: Provide polling interfaces for Tree RCU grace periods
> > > > > f7e5a81d7953 rcu-tasks: Use fewer callbacks queues if callback flood ends
> > > > > bb7ad9078e1b rcu-tasks: Fix computation of CPU-to-list shift counts
> > > > > d9cebde55539 rcu-tasks: Use order_base_2() instead of ilog2()
> > > > > 95606f1248f5 rcu-tasks: Set ->percpu_enqueue_shift to zero upon contention
> >
> >
> > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > index 65d6e21a607a..141e2b4c70cc 100644
> > --- a/kernel/rcu/tasks.h
> > +++ b/kernel/rcu/tasks.h
> > @@ -1640,10 +1640,10 @@ static int __init rcu_spawn_tasks_trace_kthread(void)
> > rcu_tasks_trace.gp_sleep = HZ / 10;
> > rcu_tasks_trace.init_fract = HZ / 10;
> > } else {
> > - rcu_tasks_trace.gp_sleep = HZ / 200;
> > + rcu_tasks_trace.gp_sleep = HZ / 100;
> > if (rcu_tasks_trace.gp_sleep <= 0)
> > rcu_tasks_trace.gp_sleep = 1;
> > - rcu_tasks_trace.init_fract = HZ / 200;
> > + rcu_tasks_trace.init_fract = HZ / 100;
> > if (rcu_tasks_trace.init_fract <= 0)
> > rcu_tasks_trace.init_fract = 1;
> > }
>
> It seems that if the scan time is > 50ms in some common cases (at
> least at Google scale),
> the claim of having a latency of 10ms is not reasonable.

But does the above patch make things better? If it does, I will send
you a proper patch with kernel boot parameters. We can then discuss
better autotuning, for example, making the defaults a function of the
number of CPUs.

Either way, that certainly is a fair point. Another fair point is that
the offending commit was in response to a bug report from your colleagues. ;-)

Except that I don't see any uses of synchronize_rcu_tasks_trace(), so
I am at a loss as to why latency matters anymore.

Is the issue the overall CPU consumption of the scan (which is my
current guess) or the length of time that the scan runs without invoking
cond_resched() or similar?

Either way, how frequently is call_rcu_tasks_trace() being invoked in
your setup? If it is being invoked frequently, increasing delays would
allow multiple call_rcu_tasks_trace() instances to be served by a single
tasklist scan.

> Given that, I do not think bpf_sk_storage_free() can/should use
> call_rcu_tasks_trace(),
> we probably will have to fix this soon (or revert from our kernels)

Well, you are in luck!!! This commit added call_rcu_tasks_trace() to
bpf_selem_unlink_storage_nolock(), which is invoked in a loop by
bpf_sk_storage_free():

0fe4b381a59e ("bpf: Allow bpf_local_storage to be used by sleepable programs")

This commit was authored by KP Singh, who I am adding on CC. Or I would
have, except that you beat me to it. Good show!!! ;-)

If this commit provoked this issue, then the above patch might help.

I am also adding Neeraj Uphadhyay on CC for his thoughts, as he has
also been throught this code.

My response time may be a bit slow next week, but I will be checking
email.

Thanx, Paul

2022-04-05 01:37:35

by Neeraj Upadhyay

[permalink] [raw]
Subject: Re: [BUG] rcu-tasks : should take care of sparse cpu masks

Hi,

Trying to understand the issue.

On 4/1/2022 9:18 PM, Paul E. McKenney wrote:
> On Fri, Apr 01, 2022 at 08:28:14AM -0700, Paul E. McKenney wrote:
>> [ Adding Andrii and Alexei at Andrii's request. ]
>>
>> On Fri, Apr 01, 2022 at 08:20:37AM -0700, Paul E. McKenney wrote:
>>> On Fri, Apr 01, 2022 at 06:24:13AM -0700, Eric Dumazet wrote:
>>>> On Fri, Apr 1, 2022 at 6:01 AM Paul E. McKenney <[email protected]> wrote:
>>>>>
>>>>> On Thu, Mar 31, 2022 at 09:39:02PM -0700, Eric Dumazet wrote:
>>>>>> On Thu, Mar 31, 2022 at 5:06 PM Paul E. McKenney <[email protected]> wrote:
>>>>>>>
>>>>>>> On Thu, Mar 31, 2022 at 04:28:04PM -0700, Eric Dumazet wrote:
>>>>>>>> On Thu, Mar 31, 2022 at 4:13 PM Paul E. McKenney <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> The initial setting of ->percpu_enqueue_shift forces all in-range CPU
>>>>>>>>> IDs to shift down to zero. The grace-period kthread is allowed to run
>>>>>>>>> where it likes. The callback lists are protected by locking, even in
>>>>>>>>> the case of local access, so this should be safe.
>>>>>>>>>
>>>>>>>>> Or am I missing your point?
>>>>>>>>>
>>>>>>>>
>>>>>>>> In fact I have been looking at this code, because we bisected a
>>>>>>>> regression back to this patch:
>>>>>>>>
>>>>>>>> 4fe192dfbe5ba9780df699d411aa4f25ba24cf61 rcu-tasks: Shorten
>>>>>>>> per-grace-period sleep for RCU Tasks Trace
>>>>>>>>
>>>>>>>> It is very possible the regression comes because the RCU task thread
>>>>>>>> is using more cpu cycles, from 'CPU 0' where our system daemons are
>>>>>>>> pinned.
>>>>>>>
>>>>>>> Heh! I did express that concern when creating that patch, but was
>>>>>>> assured that the latency was much more important.
>>>>>>>
>>>>>>> Yes, that patch most definitely increases CPU utilization during RCU Tasks
>>>>>>> Trace grace periods. If you can tolerate longer grace-period latencies,
>>>>>>> it might be worth toning it down a bit. The ask was for about twice
>>>>>>> the latency I achieved in my initial attempt, and I made the mistake of
>>>>>>> forwarding that attempt out for testing. They liked the shorter latency
>>>>>>> very much, and objected strenuously to the thought that I might detune
>>>>>>> it back to the latency that they originally asked for. ;-)
>>>>>>>
>>>>>>> But I can easily provide the means to detune it through use of a kernel
>>>>>>> boot parameter or some such, if that would help.
>>>>>>>
>>>>>>>> But I could not spot where the RCU task kthread is forced to run on CPU 0.
>>>>>>>
>>>>>>> I never did intend this kthread be bound anywhere. RCU's policy is
>>>>>>> that any binding of its kthreads is the responsibility of the sysadm,
>>>>>>> be that carbon-based or otherwise.
>>>>>>>
>>>>>>> But this kthread is spawned early enough that only CPU 0 is online,
>>>>>>> so maybe the question is not "what is binding it to CPU 0?" but rather
>>>>>>> "why isn't something kicking it off of CPU 0?"
>>>>>>
>>>>>> I guess the answer to this question can be found in the following
>>>>>> piece of code :)
>>>>>>
>>>>>> rcu_read_lock();
>>>>>> for_each_process_thread(g, t)
>>>>>> rtp->pertask_func(t, &holdouts);
>>>>>> rcu_read_unlock();
>>>>>>
>>>>>>
>>>>>> With ~150,000 threads on a 256 cpu host, this holds current cpu for
>>>>>> very long times:
>>>>>>
>>>>>> rcu_tasks_trace 11 [017] 5010.544762:
>>>>>> probe:rcu_tasks_wait_gp: (ffffffff963fb4b0)
>>>>>> rcu_tasks_trace 11 [017] 5010.600396:
>>>>>> probe:rcu_tasks_trace_postscan: (ffffffff963fb7c0)
>>>>>
>>>>> So about 55 milliseconds for the tasklist scan, correct? Or am I
>>>>> losing the plot here?
>>>>>
>>>>>> rcu_tasks_trace 11 [022] 5010.618783:
>>>>>> probe:check_all_holdout_tasks_trace: (ffffffff963fb850)
>>>>>> rcu_tasks_trace 11 [022] 5010.618840:
>>>>>> probe:rcu_tasks_trace_postgp: (ffffffff963fba70)
>>>>>>
>>>>>> In this case, CPU 22 is the victim, not CPU 0 :)
>>>>>
>>>>> My faith in the scheduler is restored! ;-)
>>>>>
>>>>> My position has been that this tasklist scan does not need to be broken
>>>>> up because it should happen only when a sleepable BPF program is removed,
>>>>> which is a rare event.
>>>>
>>>> Hmm... what about bpf_sk_storage_free() ?
>>>>
>>>> Definitely not a rare event.
>>>
>>> Hmmm... Are the BPF guys using call_rcu_tasks_trace() to free things that
>>> are not trampolines for sleepable BPF programs? Kind of looks like it.
>>>
>>> Maybe RCU Tasks Trace was too convenient to use? ;-)
>>>
>>>>> In addition, breaking up this scan is not trivial, because as far as I
>>>>> know there is no way to force a given task to stay in the list. I would
>>>>> have to instead use something like rcu_lock_break(), and restart the
>>>>> scan if either of the nailed-down pair of tasks was removed from the list.
>>>>> In a system where tasks were coming and going very frequently, it might
>>>>> be that such a broken-up scan would never complete.
>>>>>
>>>>> I can imagine tricks where the nailed-down tasks are kept on a list,
>>>>> and the nailed-downness is moved to the next task when those tasks
>>>>> are removed. I can also imagine a less-than-happy response to such
>>>>> a proposal.
>>>>>
>>>>> So I am not currently thinking in terms of breaking up this scan.
>>>>>
>>>>> Or is there some trick that I am missing?
>>>>>
>>>>> In the meantime, a simple patch that reduces the frequency of the scan
>>>>> by a factor of two. But this would not be the scan of the full tasklist,
>>>>> but rather the frequency of the calls to check_all_holdout_tasks_trace().
>>>>> And the total of these looks to be less than 20 milliseconds, if I am
>>>>> correctly interpreting your trace. And most of that 20 milliseconds
>>>>> is sleeping.
>>>>>
>>>>> Nevertheless, the patch is at the end of this email.
>>>>>
>>>>> Other than that, I could imagine batching removal of sleepable BPF
>>>>> programs and using a single grace period for all of their trampolines.
>>>>> But are there enough sleepable BPF programs ever installed to make this
>>>>> a useful approach?
>>>>>
>>>>> Or is the status quo in fact acceptable? (Hey, I can dream, can't I?)
>>>>>
>>>>> Thanx, Paul
>>>>>
>>>>>>>> I attempted to backport to our kernel all related patches that were
>>>>>>>> not yet backported,
>>>>>>>> and we still see a regression in our tests.
>>>>>>>
>>>>>>> The per-grace-period CPU consumption of rcu_tasks_trace was intentionally
>>>>>>> increased by the above commit, and I never have done anything to reduce
>>>>>>> that CPU consumption. In part because you are the first to call my
>>>>>>> attention to it.
>>>>>>>
>>>>>>> Oh, and one other issue that I very recently fixed, that has not
>>>>>>> yet reached mainline, just in case it matters. If you are building a
>>>>>>> CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y kernel, but also have
>>>>>>> CONFIG_RCU_TORTURE_TEST=m (or, for that matter, =y, but please don't in
>>>>>>> production!), then your kernel will use RCU Tasks instead of vanilla RCU.
>>>>>>> (Note well, RCU Tasks, not RCU Tasks Trace, the latter being necessaary
>>>>>>> for sleepable BPF programs regardless of kernel .config).
>>>>>>>
>>>>>>>> Please ignore the sha1 in this current patch series, this is only to
>>>>>>>> show my current attempt to fix the regression in our tree.
>>>>>>>>
>>>>>>>> 450b3244f29b rcu-tasks: Don't remove tasks with pending IPIs from holdout list
>>>>>>>> 5f88f7e9cc36 rcu-tasks: Create per-CPU callback lists
>>>>>>>> 1a943d0041dc rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic
>>>>>>>> queue selection
>>>>>>>> ea5289f12fce rcu-tasks: Convert grace-period counter to grace-period
>>>>>>>> sequence number
>>>>>>>> 22efd5093c3b rcu/segcblist: Prevent useless GP start if no CBs to accelerate
>>>>>>>> 16dee1b3babf rcu: Implement rcu_segcblist_is_offloaded() config dependent
>>>>>>>> 8cafaadb6144 rcu: Add callbacks-invoked counters
>>>>>>>> 323234685765 rcu/tree: Make rcu_do_batch count how many callbacks were executed
>>>>>>>> f48f3386a1cc rcu/segcblist: Add additional comments to explain smp_mb()
>>>>>>>> 4408105116de rcu/segcblist: Add counters to segcblist datastructure
>>>>>>>> 4a0b89a918d6 rcu/tree: segcblist: Remove redundant smp_mb()s
>>>>>>>> 38c0d18e8740 rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_divisor
>>>>>>>> 0b5d1031b509 rcu/segcblist: Add debug checks for segment lengths
>>>>>>>> 8a82886fbf02 rcu_tasks: Convert bespoke callback list to rcu_segcblist structure
>>>>>>>> cbd452a5c01f rcu-tasks: Use spin_lock_rcu_node() and friends
>>>>>>>> 073222be51f3 rcu-tasks: Add a ->percpu_enqueue_lim to the rcu_tasks structure
>>>>>>>> 5af10fb0f8fb rcu-tasks: Abstract checking of callback lists
>>>>>>>> d3e8be598546 rcu-tasks: Abstract invocations of callbacks
>>>>>>>> 65784460a392 rcu-tasks: Use workqueues for multiple
>>>>>>>> rcu_tasks_invoke_cbs() invocations
>>>>>>>> dd6413e355f1 rcu-tasks: Make rcu_barrier_tasks*() handle multiple
>>>>>>>> callback queues
>>>>>>>> 2499cb3c438e rcu-tasks: Add rcupdate.rcu_task_enqueue_lim to set
>>>>>>>> initial queueing
>>>>>>>> a859f409a503 rcu-tasks: Count trylocks to estimate call_rcu_tasks() contention
>>>>>>>> 4ab253ca056e rcu-tasks: Avoid raw-spinlocked wakeups from
>>>>>>>> call_rcu_tasks_generic()
>>>>>>>> e9a3563fe76e rcu-tasks: Use more callback queues if contention encountered
>>>>>>>> 4023187fe31d rcu-tasks: Use separate ->percpu_dequeue_lim for callback
>>>>>>>> dequeueing
>>>>>>>> 533be3bd47c3 rcu: Provide polling interfaces for Tree RCU grace periods
>>>>>>>> f7e5a81d7953 rcu-tasks: Use fewer callbacks queues if callback flood ends
>>>>>>>> bb7ad9078e1b rcu-tasks: Fix computation of CPU-to-list shift counts
>>>>>>>> d9cebde55539 rcu-tasks: Use order_base_2() instead of ilog2()
>>>>>>>> 95606f1248f5 rcu-tasks: Set ->percpu_enqueue_shift to zero upon contention
>>>>>
>>>>>
>>>>> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
>>>>> index 65d6e21a607a..141e2b4c70cc 100644
>>>>> --- a/kernel/rcu/tasks.h
>>>>> +++ b/kernel/rcu/tasks.h
>>>>> @@ -1640,10 +1640,10 @@ static int __init rcu_spawn_tasks_trace_kthread(void)
>>>>> rcu_tasks_trace.gp_sleep = HZ / 10;
>>>>> rcu_tasks_trace.init_fract = HZ / 10;
>>>>> } else {
>>>>> - rcu_tasks_trace.gp_sleep = HZ / 200;
>>>>> + rcu_tasks_trace.gp_sleep = HZ / 100;
>>>>> if (rcu_tasks_trace.gp_sleep <= 0)
>>>>> rcu_tasks_trace.gp_sleep = 1;
>>>>> - rcu_tasks_trace.init_fract = HZ / 200;
>>>>> + rcu_tasks_trace.init_fract = HZ / 100;
>>>>> if (rcu_tasks_trace.init_fract <= 0)
>>>>> rcu_tasks_trace.init_fract = 1;
>>>>> }
>>>>
>>>> It seems that if the scan time is > 50ms in some common cases (at
>>>> least at Google scale),
>>>> the claim of having a latency of 10ms is not reasonable.
>>>
>>> But does the above patch make things better? If it does, I will send
>>> you a proper patch with kernel boot parameters. We can then discuss
>>> better autotuning, for example, making the defaults a function of the
>>> number of CPUs.
>>>
>>> Either way, that certainly is a fair point. Another fair point is that
>>> the offending commit was in response to a bug report from your colleagues. ;-)
>>>
>>> Except that I don't see any uses of synchronize_rcu_tasks_trace(), so
>>> I am at a loss as to why latency matters anymore.
>>>
>>> Is the issue the overall CPU consumption of the scan (which is my
>>> current guess) or the length of time that the scan runs without invoking
>>> cond_resched() or similar?
>>>

I agree on this part; depending on the results of increasing the sleep
time for trace kthread to 10 ms; if scanning all threads is holding the
CPU, we can try cond_resched(), to isolate the issue. I checked other
commits in this code path. Don't see anything obvious impacting this.
However, will check more on this.


Thanks
Neeraj

>>> Either way, how frequently is call_rcu_tasks_trace() being invoked in
>>> your setup? If it is being invoked frequently, increasing delays would
>>> allow multiple call_rcu_tasks_trace() instances to be served by a single
>>> tasklist scan.
>>>
>>>> Given that, I do not think bpf_sk_storage_free() can/should use
>>>> call_rcu_tasks_trace(),
>>>> we probably will have to fix this soon (or revert from our kernels)
>>>
>>> Well, you are in luck!!! This commit added call_rcu_tasks_trace() to
>>> bpf_selem_unlink_storage_nolock(), which is invoked in a loop by
>>> bpf_sk_storage_free():
>>>
>>> 0fe4b381a59e ("bpf: Allow bpf_local_storage to be used by sleepable programs")
>>>
>>> This commit was authored by KP Singh, who I am adding on CC. Or I would
>>> have, except that you beat me to it. Good show!!! ;-)
>>>
>>> If this commit provoked this issue, then the above patch might help.
>
> Another question... Were there actually any sleepable BPF
> programs running on this system at that time? If not, then maybe
> bpf_selem_unlink_storage_nolock() should check for that condition, and
> invoke call_rcu_tasks_trace() only if there are actually sleepable BPF
> programs running (or in the process of being removed. If I understand
> correctly (ha!), if there were no sleepable BPF programs running, you
> would instead want call_rcu(). (Here I am assuming that non-sleepable
> BPF programs still run within a normal RCU read-side critical section.)
>
> Thanx, Paul
>
>>> I am also adding Neeraj Uphadhyay on CC for his thoughts, as he has
>>> also been throught this code.
>>>
>>> My response time may be a bit slow next week, but I will be checking
>>> email.
>>>
>>> Thanx, Paul

2022-04-05 03:36:51

by KP Singh

[permalink] [raw]
Subject: Re: [BUG] rcu-tasks : should take care of sparse cpu masks

On Mon, Apr 4, 2022 at 7:45 AM Neeraj Upadhyay <[email protected]> wrote:
>
> Hi,
>
> Trying to understand the issue.
>
> On 4/1/2022 9:18 PM, Paul E. McKenney wrote:
> > On Fri, Apr 01, 2022 at 08:28:14AM -0700, Paul E. McKenney wrote:
> >> [ Adding Andrii and Alexei at Andrii's request. ]
> >>
> >> On Fri, Apr 01, 2022 at 08:20:37AM -0700, Paul E. McKenney wrote:
> >>> On Fri, Apr 01, 2022 at 06:24:13AM -0700, Eric Dumazet wrote:
> >>>> On Fri, Apr 1, 2022 at 6:01 AM Paul E. McKenney <[email protected]> wrote:
> >>>>>
> >>>>> On Thu, Mar 31, 2022 at 09:39:02PM -0700, Eric Dumazet wrote:
> >>>>>> On Thu, Mar 31, 2022 at 5:06 PM Paul E. McKenney <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On Thu, Mar 31, 2022 at 04:28:04PM -0700, Eric Dumazet wrote:
> >>>>>>>> On Thu, Mar 31, 2022 at 4:13 PM Paul E. McKenney <[email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>> The initial setting of ->percpu_enqueue_shift forces all in-range CPU
> >>>>>>>>> IDs to shift down to zero. The grace-period kthread is allowed to run
> >>>>>>>>> where it likes. The callback lists are protected by locking, even in
> >>>>>>>>> the case of local access, so this should be safe.
> >>>>>>>>>
> >>>>>>>>> Or am I missing your point?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> In fact I have been looking at this code, because we bisected a
> >>>>>>>> regression back to this patch:
> >>>>>>>>
> >>>>>>>> 4fe192dfbe5ba9780df699d411aa4f25ba24cf61 rcu-tasks: Shorten
> >>>>>>>> per-grace-period sleep for RCU Tasks Trace
> >>>>>>>>
> >>>>>>>> It is very possible the regression comes because the RCU task thread
> >>>>>>>> is using more cpu cycles, from 'CPU 0' where our system daemons are
> >>>>>>>> pinned.
> >>>>>>>
> >>>>>>> Heh! I did express that concern when creating that patch, but was
> >>>>>>> assured that the latency was much more important.
> >>>>>>>
> >>>>>>> Yes, that patch most definitely increases CPU utilization during RCU Tasks
> >>>>>>> Trace grace periods. If you can tolerate longer grace-period latencies,
> >>>>>>> it might be worth toning it down a bit. The ask was for about twice
> >>>>>>> the latency I achieved in my initial attempt, and I made the mistake of
> >>>>>>> forwarding that attempt out for testing. They liked the shorter latency
> >>>>>>> very much, and objected strenuously to the thought that I might detune
> >>>>>>> it back to the latency that they originally asked for. ;-)
> >>>>>>>
> >>>>>>> But I can easily provide the means to detune it through use of a kernel
> >>>>>>> boot parameter or some such, if that would help.
> >>>>>>>
> >>>>>>>> But I could not spot where the RCU task kthread is forced to run on CPU 0.
> >>>>>>>
> >>>>>>> I never did intend this kthread be bound anywhere. RCU's policy is
> >>>>>>> that any binding of its kthreads is the responsibility of the sysadm,
> >>>>>>> be that carbon-based or otherwise.
> >>>>>>>
> >>>>>>> But this kthread is spawned early enough that only CPU 0 is online,
> >>>>>>> so maybe the question is not "what is binding it to CPU 0?" but rather
> >>>>>>> "why isn't something kicking it off of CPU 0?"
> >>>>>>
> >>>>>> I guess the answer to this question can be found in the following
> >>>>>> piece of code :)
> >>>>>>
> >>>>>> rcu_read_lock();
> >>>>>> for_each_process_thread(g, t)
> >>>>>> rtp->pertask_func(t, &holdouts);
> >>>>>> rcu_read_unlock();
> >>>>>>
> >>>>>>
> >>>>>> With ~150,000 threads on a 256 cpu host, this holds current cpu for
> >>>>>> very long times:
> >>>>>>
> >>>>>> rcu_tasks_trace 11 [017] 5010.544762:
> >>>>>> probe:rcu_tasks_wait_gp: (ffffffff963fb4b0)
> >>>>>> rcu_tasks_trace 11 [017] 5010.600396:
> >>>>>> probe:rcu_tasks_trace_postscan: (ffffffff963fb7c0)
> >>>>>
> >>>>> So about 55 milliseconds for the tasklist scan, correct? Or am I
> >>>>> losing the plot here?
> >>>>>
> >>>>>> rcu_tasks_trace 11 [022] 5010.618783:
> >>>>>> probe:check_all_holdout_tasks_trace: (ffffffff963fb850)
> >>>>>> rcu_tasks_trace 11 [022] 5010.618840:
> >>>>>> probe:rcu_tasks_trace_postgp: (ffffffff963fba70)
> >>>>>>
> >>>>>> In this case, CPU 22 is the victim, not CPU 0 :)
> >>>>>
> >>>>> My faith in the scheduler is restored! ;-)
> >>>>>
> >>>>> My position has been that this tasklist scan does not need to be broken
> >>>>> up because it should happen only when a sleepable BPF program is removed,
> >>>>> which is a rare event.
> >>>>
> >>>> Hmm... what about bpf_sk_storage_free() ?
> >>>>
> >>>> Definitely not a rare event.
> >>>
> >>> Hmmm... Are the BPF guys using call_rcu_tasks_trace() to free things that
> >>> are not trampolines for sleepable BPF programs? Kind of looks like it.
> >>>
> >>> Maybe RCU Tasks Trace was too convenient to use? ;-)
> >>>
> >>>>> In addition, breaking up this scan is not trivial, because as far as I
> >>>>> know there is no way to force a given task to stay in the list. I would
> >>>>> have to instead use something like rcu_lock_break(), and restart the
> >>>>> scan if either of the nailed-down pair of tasks was removed from the list.
> >>>>> In a system where tasks were coming and going very frequently, it might
> >>>>> be that such a broken-up scan would never complete.
> >>>>>
> >>>>> I can imagine tricks where the nailed-down tasks are kept on a list,
> >>>>> and the nailed-downness is moved to the next task when those tasks
> >>>>> are removed. I can also imagine a less-than-happy response to such
> >>>>> a proposal.
> >>>>>
> >>>>> So I am not currently thinking in terms of breaking up this scan.
> >>>>>
> >>>>> Or is there some trick that I am missing?
> >>>>>
> >>>>> In the meantime, a simple patch that reduces the frequency of the scan
> >>>>> by a factor of two. But this would not be the scan of the full tasklist,
> >>>>> but rather the frequency of the calls to check_all_holdout_tasks_trace().
> >>>>> And the total of these looks to be less than 20 milliseconds, if I am
> >>>>> correctly interpreting your trace. And most of that 20 milliseconds
> >>>>> is sleeping.
> >>>>>
> >>>>> Nevertheless, the patch is at the end of this email.
> >>>>>
> >>>>> Other than that, I could imagine batching removal of sleepable BPF
> >>>>> programs and using a single grace period for all of their trampolines.
> >>>>> But are there enough sleepable BPF programs ever installed to make this
> >>>>> a useful approach?
> >>>>>
> >>>>> Or is the status quo in fact acceptable? (Hey, I can dream, can't I?)
> >>>>>
> >>>>> Thanx, Paul
> >>>>>
> >>>>>>>> I attempted to backport to our kernel all related patches that were
> >>>>>>>> not yet backported,
> >>>>>>>> and we still see a regression in our tests.
> >>>>>>>
> >>>>>>> The per-grace-period CPU consumption of rcu_tasks_trace was intentionally
> >>>>>>> increased by the above commit, and I never have done anything to reduce
> >>>>>>> that CPU consumption. In part because you are the first to call my
> >>>>>>> attention to it.
> >>>>>>>
> >>>>>>> Oh, and one other issue that I very recently fixed, that has not
> >>>>>>> yet reached mainline, just in case it matters. If you are building a
> >>>>>>> CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y kernel, but also have
> >>>>>>> CONFIG_RCU_TORTURE_TEST=m (or, for that matter, =y, but please don't in
> >>>>>>> production!), then your kernel will use RCU Tasks instead of vanilla RCU.
> >>>>>>> (Note well, RCU Tasks, not RCU Tasks Trace, the latter being necessaary
> >>>>>>> for sleepable BPF programs regardless of kernel .config).
> >>>>>>>
> >>>>>>>> Please ignore the sha1 in this current patch series, this is only to
> >>>>>>>> show my current attempt to fix the regression in our tree.
> >>>>>>>>
> >>>>>>>> 450b3244f29b rcu-tasks: Don't remove tasks with pending IPIs from holdout list
> >>>>>>>> 5f88f7e9cc36 rcu-tasks: Create per-CPU callback lists
> >>>>>>>> 1a943d0041dc rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic
> >>>>>>>> queue selection
> >>>>>>>> ea5289f12fce rcu-tasks: Convert grace-period counter to grace-period
> >>>>>>>> sequence number
> >>>>>>>> 22efd5093c3b rcu/segcblist: Prevent useless GP start if no CBs to accelerate
> >>>>>>>> 16dee1b3babf rcu: Implement rcu_segcblist_is_offloaded() config dependent
> >>>>>>>> 8cafaadb6144 rcu: Add callbacks-invoked counters
> >>>>>>>> 323234685765 rcu/tree: Make rcu_do_batch count how many callbacks were executed
> >>>>>>>> f48f3386a1cc rcu/segcblist: Add additional comments to explain smp_mb()
> >>>>>>>> 4408105116de rcu/segcblist: Add counters to segcblist datastructure
> >>>>>>>> 4a0b89a918d6 rcu/tree: segcblist: Remove redundant smp_mb()s
> >>>>>>>> 38c0d18e8740 rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_divisor
> >>>>>>>> 0b5d1031b509 rcu/segcblist: Add debug checks for segment lengths
> >>>>>>>> 8a82886fbf02 rcu_tasks: Convert bespoke callback list to rcu_segcblist structure
> >>>>>>>> cbd452a5c01f rcu-tasks: Use spin_lock_rcu_node() and friends
> >>>>>>>> 073222be51f3 rcu-tasks: Add a ->percpu_enqueue_lim to the rcu_tasks structure
> >>>>>>>> 5af10fb0f8fb rcu-tasks: Abstract checking of callback lists
> >>>>>>>> d3e8be598546 rcu-tasks: Abstract invocations of callbacks
> >>>>>>>> 65784460a392 rcu-tasks: Use workqueues for multiple
> >>>>>>>> rcu_tasks_invoke_cbs() invocations
> >>>>>>>> dd6413e355f1 rcu-tasks: Make rcu_barrier_tasks*() handle multiple
> >>>>>>>> callback queues
> >>>>>>>> 2499cb3c438e rcu-tasks: Add rcupdate.rcu_task_enqueue_lim to set
> >>>>>>>> initial queueing
> >>>>>>>> a859f409a503 rcu-tasks: Count trylocks to estimate call_rcu_tasks() contention
> >>>>>>>> 4ab253ca056e rcu-tasks: Avoid raw-spinlocked wakeups from
> >>>>>>>> call_rcu_tasks_generic()
> >>>>>>>> e9a3563fe76e rcu-tasks: Use more callback queues if contention encountered
> >>>>>>>> 4023187fe31d rcu-tasks: Use separate ->percpu_dequeue_lim for callback
> >>>>>>>> dequeueing
> >>>>>>>> 533be3bd47c3 rcu: Provide polling interfaces for Tree RCU grace periods
> >>>>>>>> f7e5a81d7953 rcu-tasks: Use fewer callbacks queues if callback flood ends
> >>>>>>>> bb7ad9078e1b rcu-tasks: Fix computation of CPU-to-list shift counts
> >>>>>>>> d9cebde55539 rcu-tasks: Use order_base_2() instead of ilog2()
> >>>>>>>> 95606f1248f5 rcu-tasks: Set ->percpu_enqueue_shift to zero upon contention
> >>>>>
> >>>>>
> >>>>> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> >>>>> index 65d6e21a607a..141e2b4c70cc 100644
> >>>>> --- a/kernel/rcu/tasks.h
> >>>>> +++ b/kernel/rcu/tasks.h
> >>>>> @@ -1640,10 +1640,10 @@ static int __init rcu_spawn_tasks_trace_kthread(void)
> >>>>> rcu_tasks_trace.gp_sleep = HZ / 10;
> >>>>> rcu_tasks_trace.init_fract = HZ / 10;
> >>>>> } else {
> >>>>> - rcu_tasks_trace.gp_sleep = HZ / 200;
> >>>>> + rcu_tasks_trace.gp_sleep = HZ / 100;
> >>>>> if (rcu_tasks_trace.gp_sleep <= 0)
> >>>>> rcu_tasks_trace.gp_sleep = 1;
> >>>>> - rcu_tasks_trace.init_fract = HZ / 200;
> >>>>> + rcu_tasks_trace.init_fract = HZ / 100;
> >>>>> if (rcu_tasks_trace.init_fract <= 0)
> >>>>> rcu_tasks_trace.init_fract = 1;
> >>>>> }
> >>>>
> >>>> It seems that if the scan time is > 50ms in some common cases (at
> >>>> least at Google scale),
> >>>> the claim of having a latency of 10ms is not reasonable.
> >>>
> >>> But does the above patch make things better? If it does, I will send
> >>> you a proper patch with kernel boot parameters. We can then discuss
> >>> better autotuning, for example, making the defaults a function of the
> >>> number of CPUs.
> >>>
> >>> Either way, that certainly is a fair point. Another fair point is that
> >>> the offending commit was in response to a bug report from your colleagues. ;-)
> >>>
> >>> Except that I don't see any uses of synchronize_rcu_tasks_trace(), so
> >>> I am at a loss as to why latency matters anymore.
> >>>
> >>> Is the issue the overall CPU consumption of the scan (which is my
> >>> current guess) or the length of time that the scan runs without invoking
> >>> cond_resched() or similar?
> >>>
>
> I agree on this part; depending on the results of increasing the sleep
> time for trace kthread to 10 ms; if scanning all threads is holding the
> CPU, we can try cond_resched(), to isolate the issue. I checked other
> commits in this code path. Don't see anything obvious impacting this.
> However, will check more on this.
>
>
> Thanks
> Neeraj
>
> >>> Either way, how frequently is call_rcu_tasks_trace() being invoked in
> >>> your setup? If it is being invoked frequently, increasing delays would
> >>> allow multiple call_rcu_tasks_trace() instances to be served by a single
> >>> tasklist scan.
> >>>
> >>>> Given that, I do not think bpf_sk_storage_free() can/should use
> >>>> call_rcu_tasks_trace(),
> >>>> we probably will have to fix this soon (or revert from our kernels)
> >>>
> >>> Well, you are in luck!!! This commit added call_rcu_tasks_trace() to
> >>> bpf_selem_unlink_storage_nolock(), which is invoked in a loop by
> >>> bpf_sk_storage_free():
> >>>
> >>> 0fe4b381a59e ("bpf: Allow bpf_local_storage to be used by sleepable programs")
> >>>
> >>> This commit was authored by KP Singh, who I am adding on CC. Or I would
> >>> have, except that you beat me to it. Good show!!! ;-)

Hello :)

Martin, if this ends up being an issue we might have to go with the
initial proposed approach
of marking local storage maps explicitly as sleepable so that not all
maps are forced to be
synchronized via trace RCU.

We can make the verifier reject loading programs that try to use
non-sleepable local storage
maps in sleepable programs.

Do you think this is a feasible approach we can take or do you have
other suggestions?

> >>>
> >>> If this commit provoked this issue, then the above patch might help.
> >
> > Another question... Were there actually any sleepable BPF
> > programs running on this system at that time? If not, then maybe
> > bpf_selem_unlink_storage_nolock() should check for that condition, and
> > invoke call_rcu_tasks_trace() only if there are actually sleepable BPF
> > programs running (or in the process of being removed. If I understand
> > correctly (ha!), if there were no sleepable BPF programs running, you
> > would instead want call_rcu(). (Here I am assuming that non-sleepable
> > BPF programs still run within a normal RCU read-side critical section.)
> >
> > Thanx, Paul
> >
> >>> I am also adding Neeraj Uphadhyay on CC for his thoughts, as he has
> >>> also been throught this code.
> >>>
> >>> My response time may be a bit slow next week, but I will be checking
> >>> email.
> >>>
> >>> Thanx, Paul

2022-04-06 08:50:25

by Martin KaFai Lau

[permalink] [raw]
Subject: Re: [BUG] rcu-tasks : should take care of sparse cpu masks

On Tue, Apr 05, 2022 at 02:04:34AM +0200, KP Singh wrote:
> > >>> Either way, how frequently is call_rcu_tasks_trace() being invoked in
> > >>> your setup? If it is being invoked frequently, increasing delays would
> > >>> allow multiple call_rcu_tasks_trace() instances to be served by a single
> > >>> tasklist scan.
> > >>>
> > >>>> Given that, I do not think bpf_sk_storage_free() can/should use
> > >>>> call_rcu_tasks_trace(),
> > >>>> we probably will have to fix this soon (or revert from our kernels)
> > >>>
> > >>> Well, you are in luck!!! This commit added call_rcu_tasks_trace() to
> > >>> bpf_selem_unlink_storage_nolock(), which is invoked in a loop by
> > >>> bpf_sk_storage_free():
> > >>>
> > >>> 0fe4b381a59e ("bpf: Allow bpf_local_storage to be used by sleepable programs")
> > >>>
> > >>> This commit was authored by KP Singh, who I am adding on CC. Or I would
> > >>> have, except that you beat me to it. Good show!!! ;-)
>
> Hello :)
>
> Martin, if this ends up being an issue we might have to go with the
> initial proposed approach
> of marking local storage maps explicitly as sleepable so that not all
> maps are forced to be
> synchronized via trace RCU.
>
> We can make the verifier reject loading programs that try to use
> non-sleepable local storage
> maps in sleepable programs.
>
> Do you think this is a feasible approach we can take or do you have
> other suggestions?
bpf_sk_storage_free() does not need to use call_rcu_tasks_trace().
The same should go for the bpf_{task,inode}_storage_free().
The sk at this point is being destroyed. No bpf prog (sleepable or not)
can have a hold on this sk. The only storage reader left is from
bpf_local_storage_map_free() which is under rcu_read_lock(),
so a 'kfree_rcu(selem, rcu)' is enough.
A few lines below in bpf_sk_storage_free(), 'kfree_rcu(sk_storage, rcu)'
is currently used instead of call_rcu_tasks_trace() for the same reason.

KP, if the above makes sense, can you make a patch for it?
The bpf_local_storage_map_free() code path also does not need
call_rcu_tasks_trace(), so may as well change it together.
The bpf_*_storage_delete() helper and the map_{delete,update}_elem()
syscall still require the call_rcu_tasks_trace().

2022-04-06 17:38:31

by KP Singh

[permalink] [raw]
Subject: Re: [BUG] rcu-tasks : should take care of sparse cpu masks

On Tue, Apr 5, 2022 at 10:38 PM Martin KaFai Lau <[email protected]> wrote:
>
> On Tue, Apr 05, 2022 at 02:04:34AM +0200, KP Singh wrote:
> > > >>> Either way, how frequently is call_rcu_tasks_trace() being invoked in
> > > >>> your setup? If it is being invoked frequently, increasing delays would
> > > >>> allow multiple call_rcu_tasks_trace() instances to be served by a single
> > > >>> tasklist scan.
> > > >>>
> > > >>>> Given that, I do not think bpf_sk_storage_free() can/should use
> > > >>>> call_rcu_tasks_trace(),
> > > >>>> we probably will have to fix this soon (or revert from our kernels)
> > > >>>
> > > >>> Well, you are in luck!!! This commit added call_rcu_tasks_trace() to
> > > >>> bpf_selem_unlink_storage_nolock(), which is invoked in a loop by
> > > >>> bpf_sk_storage_free():
> > > >>>
> > > >>> 0fe4b381a59e ("bpf: Allow bpf_local_storage to be used by sleepable programs")
> > > >>>
> > > >>> This commit was authored by KP Singh, who I am adding on CC. Or I would
> > > >>> have, except that you beat me to it. Good show!!! ;-)
> >
> > Hello :)
> >
> > Martin, if this ends up being an issue we might have to go with the
> > initial proposed approach
> > of marking local storage maps explicitly as sleepable so that not all
> > maps are forced to be
> > synchronized via trace RCU.
> >
> > We can make the verifier reject loading programs that try to use
> > non-sleepable local storage
> > maps in sleepable programs.
> >
> > Do you think this is a feasible approach we can take or do you have
> > other suggestions?
> bpf_sk_storage_free() does not need to use call_rcu_tasks_trace().
> The same should go for the bpf_{task,inode}_storage_free().
> The sk at this point is being destroyed. No bpf prog (sleepable or not)
> can have a hold on this sk. The only storage reader left is from
> bpf_local_storage_map_free() which is under rcu_read_lock(),
> so a 'kfree_rcu(selem, rcu)' is enough.
> A few lines below in bpf_sk_storage_free(), 'kfree_rcu(sk_storage, rcu)'
> is currently used instead of call_rcu_tasks_trace() for the same reason.
>
> KP, if the above makes sense, can you make a patch for it?
> The bpf_local_storage_map_free() code path also does not need
> call_rcu_tasks_trace(), so may as well change it together.
> The bpf_*_storage_delete() helper and the map_{delete,update}_elem()
> syscall still require the call_rcu_tasks_trace().

Thanks, I will send a patch.

2022-04-12 19:49:42

by Martin KaFai Lau

[permalink] [raw]
Subject: Re: [BUG] rcu-tasks : should take care of sparse cpu masks

On Tue, Apr 12, 2022 at 08:28:07AM +0200, KP Singh wrote:
> On Wed, Apr 6, 2022 at 5:44 PM KP Singh <[email protected]> wrote:
> >
> > On Tue, Apr 5, 2022 at 10:38 PM Martin KaFai Lau <[email protected]> wrote:
> > >
> > > On Tue, Apr 05, 2022 at 02:04:34AM +0200, KP Singh wrote:
> > > > > >>> Either way, how frequently is call_rcu_tasks_trace() being invoked in
> > > > > >>> your setup? If it is being invoked frequently, increasing delays would
> > > > > >>> allow multiple call_rcu_tasks_trace() instances to be served by a single
> > > > > >>> tasklist scan.
> > > > > >>>
> > > > > >>>> Given that, I do not think bpf_sk_storage_free() can/should use
> > > > > >>>> call_rcu_tasks_trace(),
> > > > > >>>> we probably will have to fix this soon (or revert from our kernels)
> > > > > >>>
> > > > > >>> Well, you are in luck!!! This commit added call_rcu_tasks_trace() to
> > > > > >>> bpf_selem_unlink_storage_nolock(), which is invoked in a loop by
> > > > > >>> bpf_sk_storage_free():
> > > > > >>>
> > > > > >>> 0fe4b381a59e ("bpf: Allow bpf_local_storage to be used by sleepable programs")
> > > > > >>>
> > > > > >>> This commit was authored by KP Singh, who I am adding on CC. Or I would
> > > > > >>> have, except that you beat me to it. Good show!!! ;-)
> > > >
> > > > Hello :)
> > > >
> > > > Martin, if this ends up being an issue we might have to go with the
> > > > initial proposed approach
> > > > of marking local storage maps explicitly as sleepable so that not all
> > > > maps are forced to be
> > > > synchronized via trace RCU.
> > > >
> > > > We can make the verifier reject loading programs that try to use
> > > > non-sleepable local storage
> > > > maps in sleepable programs.
> > > >
> > > > Do you think this is a feasible approach we can take or do you have
> > > > other suggestions?
> > > bpf_sk_storage_free() does not need to use call_rcu_tasks_trace().
> > > The same should go for the bpf_{task,inode}_storage_free().
> > > The sk at this point is being destroyed. No bpf prog (sleepable or not)
> > > can have a hold on this sk. The only storage reader left is from
> > > bpf_local_storage_map_free() which is under rcu_read_lock(),
> > > so a 'kfree_rcu(selem, rcu)' is enough.
> > > A few lines below in bpf_sk_storage_free(), 'kfree_rcu(sk_storage, rcu)'
> > > is currently used instead of call_rcu_tasks_trace() for the same reason.
> > >
> > > KP, if the above makes sense, can you make a patch for it?
> > > The bpf_local_storage_map_free() code path also does not need
> > > call_rcu_tasks_trace(), so may as well change it together.
> > > The bpf_*_storage_delete() helper and the map_{delete,update}_elem()
> > > syscall still require the call_rcu_tasks_trace().
> >
> > Thanks, I will send a patch.
>
> Martin, does this work? (I can send it as a patch later today)
LGTM. Thanks.

2022-04-12 21:01:00

by KP Singh

[permalink] [raw]
Subject: Re: [BUG] rcu-tasks : should take care of sparse cpu masks

On Wed, Apr 6, 2022 at 5:44 PM KP Singh <[email protected]> wrote:
>
> On Tue, Apr 5, 2022 at 10:38 PM Martin KaFai Lau <[email protected]> wrote:
> >
> > On Tue, Apr 05, 2022 at 02:04:34AM +0200, KP Singh wrote:
> > > > >>> Either way, how frequently is call_rcu_tasks_trace() being invoked in
> > > > >>> your setup? If it is being invoked frequently, increasing delays would
> > > > >>> allow multiple call_rcu_tasks_trace() instances to be served by a single
> > > > >>> tasklist scan.
> > > > >>>
> > > > >>>> Given that, I do not think bpf_sk_storage_free() can/should use
> > > > >>>> call_rcu_tasks_trace(),
> > > > >>>> we probably will have to fix this soon (or revert from our kernels)
> > > > >>>
> > > > >>> Well, you are in luck!!! This commit added call_rcu_tasks_trace() to
> > > > >>> bpf_selem_unlink_storage_nolock(), which is invoked in a loop by
> > > > >>> bpf_sk_storage_free():
> > > > >>>
> > > > >>> 0fe4b381a59e ("bpf: Allow bpf_local_storage to be used by sleepable programs")
> > > > >>>
> > > > >>> This commit was authored by KP Singh, who I am adding on CC. Or I would
> > > > >>> have, except that you beat me to it. Good show!!! ;-)
> > >
> > > Hello :)
> > >
> > > Martin, if this ends up being an issue we might have to go with the
> > > initial proposed approach
> > > of marking local storage maps explicitly as sleepable so that not all
> > > maps are forced to be
> > > synchronized via trace RCU.
> > >
> > > We can make the verifier reject loading programs that try to use
> > > non-sleepable local storage
> > > maps in sleepable programs.
> > >
> > > Do you think this is a feasible approach we can take or do you have
> > > other suggestions?
> > bpf_sk_storage_free() does not need to use call_rcu_tasks_trace().
> > The same should go for the bpf_{task,inode}_storage_free().
> > The sk at this point is being destroyed. No bpf prog (sleepable or not)
> > can have a hold on this sk. The only storage reader left is from
> > bpf_local_storage_map_free() which is under rcu_read_lock(),
> > so a 'kfree_rcu(selem, rcu)' is enough.
> > A few lines below in bpf_sk_storage_free(), 'kfree_rcu(sk_storage, rcu)'
> > is currently used instead of call_rcu_tasks_trace() for the same reason.
> >
> > KP, if the above makes sense, can you make a patch for it?
> > The bpf_local_storage_map_free() code path also does not need
> > call_rcu_tasks_trace(), so may as well change it together.
> > The bpf_*_storage_delete() helper and the map_{delete,update}_elem()
> > syscall still require the call_rcu_tasks_trace().
>
> Thanks, I will send a patch.

Martin, does this work? (I can send it as a patch later today)

diff --git a/include/linux/bpf_local_storage.h
b/include/linux/bpf_local_storage.h
index 493e63258497..7ea18d4da84b 100644
--- a/include/linux/bpf_local_storage.h
+++ b/include/linux/bpf_local_storage.h
@@ -143,9 +143,9 @@ void bpf_selem_link_storage_nolock(struct
bpf_local_storage *local_storage,

bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_storage,
struct bpf_local_storage_elem *selem,
- bool uncharge_omem);
+ bool uncharge_omem, bool use_trace_rcu);

-void bpf_selem_unlink(struct bpf_local_storage_elem *selem);
+void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool
use_trace_rcu);

void bpf_selem_link_map(struct bpf_local_storage_map *smap,
struct bpf_local_storage_elem *selem);
diff --git a/kernel/bpf/bpf_inode_storage.c b/kernel/bpf/bpf_inode_storage.c
index 96be8d518885..10424a1cda51 100644
--- a/kernel/bpf/bpf_inode_storage.c
+++ b/kernel/bpf/bpf_inode_storage.c
@@ -90,7 +90,7 @@ void bpf_inode_storage_free(struct inode *inode)
*/
bpf_selem_unlink_map(selem);
free_inode_storage = bpf_selem_unlink_storage_nolock(
- local_storage, selem, false);
+ local_storage, selem, false, false);
}
raw_spin_unlock_bh(&local_storage->lock);
rcu_read_unlock();
@@ -149,7 +149,7 @@ static int inode_storage_delete(struct inode
*inode, struct bpf_map *map)
if (!sdata)
return -ENOENT;

- bpf_selem_unlink(SELEM(sdata));
+ bpf_selem_unlink(SELEM(sdata), true);

return 0;
}
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 01aa2b51ec4d..fbe35554bab7 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -106,7 +106,7 @@ static void bpf_selem_free_rcu(struct rcu_head *rcu)
*/
bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_storage,
struct bpf_local_storage_elem *selem,
- bool uncharge_mem)
+ bool uncharge_mem, bool use_trace_rcu)
{
struct bpf_local_storage_map *smap;
bool free_local_storage;
@@ -150,11 +150,16 @@ bool bpf_selem_unlink_storage_nolock(struct
bpf_local_storage *local_storage,
SDATA(selem))
RCU_INIT_POINTER(local_storage->cache[smap->cache_idx], NULL);

- call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu);
+ if (use_trace_rcu)
+ call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu);
+ else
+ kfree_rcu(selem, rcu);
+
return free_local_storage;
}

-static void __bpf_selem_unlink_storage(struct bpf_local_storage_elem *selem)
+static void __bpf_selem_unlink_storage(struct bpf_local_storage_elem *selem,
+ bool use_trace_rcu)
{
struct bpf_local_storage *local_storage;
bool free_local_storage = false;
@@ -169,12 +174,16 @@ static void __bpf_selem_unlink_storage(struct
bpf_local_storage_elem *selem)
raw_spin_lock_irqsave(&local_storage->lock, flags);
if (likely(selem_linked_to_storage(selem)))
free_local_storage = bpf_selem_unlink_storage_nolock(
- local_storage, selem, true);
+ local_storage, selem, true, use_trace_rcu);
raw_spin_unlock_irqrestore(&local_storage->lock, flags);

- if (free_local_storage)
- call_rcu_tasks_trace(&local_storage->rcu,
+ if (free_local_storage) {
+ if (use_trace_rcu)
+ call_rcu_tasks_trace(&local_storage->rcu,
bpf_local_storage_free_rcu);
+ else
+ kfree_rcu(local_storage, rcu);
+ }
}

void bpf_selem_link_storage_nolock(struct bpf_local_storage *local_storage,
@@ -214,14 +223,14 @@ void bpf_selem_link_map(struct
bpf_local_storage_map *smap,
raw_spin_unlock_irqrestore(&b->lock, flags);
}

-void bpf_selem_unlink(struct bpf_local_storage_elem *selem)
+void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool use_trace_rcu)
{
/* Always unlink from map before unlinking from local_storage
* because selem will be freed after successfully unlinked from
* the local_storage.
*/
bpf_selem_unlink_map(selem);
- __bpf_selem_unlink_storage(selem);
+ __bpf_selem_unlink_storage(selem, use_trace_rcu);
}

struct bpf_local_storage_data *
@@ -548,7 +557,7 @@ void bpf_local_storage_map_free(struct
bpf_local_storage_map *smap,
migrate_disable();
__this_cpu_inc(*busy_counter);
}
- bpf_selem_unlink(selem);
+ bpf_selem_unlink(selem, false);
if (busy_counter) {
__this_cpu_dec(*busy_counter);
migrate_enable();
diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
index 6638a0ecc3d2..57904263a710 100644
--- a/kernel/bpf/bpf_task_storage.c
+++ b/kernel/bpf/bpf_task_storage.c
@@ -102,7 +102,7 @@ void bpf_task_storage_free(struct task_struct *task)
*/
bpf_selem_unlink_map(selem);
free_task_storage = bpf_selem_unlink_storage_nolock(
- local_storage, selem, false);
+ local_storage, selem, false, false);
}
raw_spin_unlock_irqrestore(&local_storage->lock, flags);
bpf_task_storage_unlock();
@@ -192,7 +192,7 @@ static int task_storage_delete(struct task_struct
*task, struct bpf_map *map)
if (!sdata)
return -ENOENT;

- bpf_selem_unlink(SELEM(sdata));
+ bpf_selem_unlink(SELEM(sdata), true);

return 0;
}
diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index e3ac36380520..83d7641ef67b 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -40,7 +40,7 @@ static int bpf_sk_storage_del(struct sock *sk,
struct bpf_map *map)
if (!sdata)
return -ENOENT;

- bpf_selem_unlink(SELEM(sdata));
+ bpf_selem_unlink(SELEM(sdata), true);

return 0;
}
@@ -75,8 +75,8 @@ void bpf_sk_storage_free(struct sock *sk)
* sk_storage.
*/
bpf_selem_unlink_map(selem);
- free_sk_storage = bpf_selem_unlink_storage_nolock(sk_storage,
- selem, true);
+ free_sk_storage = bpf_selem_unlink_storage_nolock(
+ sk_storage, selem, true, false);
}
raw_spin_unlock_bh(&sk_storage->lock);
rcu_read_unlock();