On Thu, Mar 31, 2022 at 3:42 PM Paul E. McKenney <[email protected]> wrote:
>
> On Thu, Mar 31, 2022 at 02:45:25PM -0700, Eric Dumazet wrote:
> > Hi Paul
> >
> > It seems you assume per cpu ptr for arbitrary indexes (< nr_cpu_ids) are valid.
>
> Gah! I knew I was forgetting something...
>
> But just to check, is this a theoretical problem or something you hit
> on real hardware? (For the rest of this email, I am assuming the latter.)
Code review really...
>
> > What do you think of the (untested) following patch ?
>
> One issue with this patch is that the contention could be unpredictable,
> or worse, vary among CPU, especially if the cpu_possible_mask was oddly
> distributed.
>
> So might it be better to restrict this to all on CPU 0 on the one hand
> and completely per-CPU on the other? (Or all on the boot CPU, in case
> I am forgetting some misbegotten architecture that can run without a
> CPU 0.)
If I understand correctly, cblist_init_generic() could setup
percpu_enqueue_shift
to something smaller than order_base_2(nr_cpu_ids)
Meaning that we could reach a non zero idx in (smp_processor_id() >>
percpu_enqueue_shift)
So even if CPU0 is always present (I am not sure this is guaranteed,
but this seems reasonable),
we could still attempt a per_cpu_ptr(PTR, not_present_cpu), and get garbage.
> > Thanks.
> >
> > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > index 99cf3a13954cfb17828fbbeeb884f11614a526a9..df3785be4022e903d9682dd403464aa9927aa5c2
> > 100644
> > --- a/kernel/rcu/tasks.h
> > +++ b/kernel/rcu/tasks.h
> > @@ -273,13 +273,17 @@ static void call_rcu_tasks_generic(struct
> > rcu_head *rhp, rcu_callback_t func,
> > bool needadjust = false;
> > bool needwake;
> > struct rcu_tasks_percpu *rtpcp;
> > + int ideal_cpu, chosen_cpu;
> >
> > rhp->next = NULL;
> > rhp->func = func;
> > local_irq_save(flags);
> > rcu_read_lock();
> > - rtpcp = per_cpu_ptr(rtp->rtpcpu,
> > - smp_processor_id() >>
> > READ_ONCE(rtp->percpu_enqueue_shift));
> > +
> > + ideal_cpu = smp_processor_id() >> READ_ONCE(rtp->percpu_enqueue_shift);
> > + chosen_cpu = cpumask_next(ideal_cpu - 1, cpu_online_mask);
> > +
> > + rtpcp = per_cpu_ptr(rtp->rtpcpu, chosen_cpu);
> > if (!raw_spin_trylock_rcu_node(rtpcp)) { // irqs already disabled.
> > raw_spin_lock_rcu_node(rtpcp); // irqs already disabled.
> > j = jiffies;
On Thu, Mar 31, 2022 at 03:54:02PM -0700, Eric Dumazet wrote:
> On Thu, Mar 31, 2022 at 3:42 PM Paul E. McKenney <[email protected]> wrote:
> >
> > On Thu, Mar 31, 2022 at 02:45:25PM -0700, Eric Dumazet wrote:
> > > Hi Paul
> > >
> > > It seems you assume per cpu ptr for arbitrary indexes (< nr_cpu_ids) are valid.
> >
> > Gah! I knew I was forgetting something...
> >
> > But just to check, is this a theoretical problem or something you hit
> > on real hardware? (For the rest of this email, I am assuming the latter.)
>
> Code review really...
OK, not yet an emergency, then. ;-)
If it does become an emergency, one workaround is to set the
rcupdate.rcu_task_enqueue_lim kernel boot parameter to the number of CPUs.
> > > What do you think of the (untested) following patch ?
> >
> > One issue with this patch is that the contention could be unpredictable,
> > or worse, vary among CPU, especially if the cpu_possible_mask was oddly
> > distributed.
> >
> > So might it be better to restrict this to all on CPU 0 on the one hand
> > and completely per-CPU on the other? (Or all on the boot CPU, in case
> > I am forgetting some misbegotten architecture that can run without a
> > CPU 0.)
>
> If I understand correctly, cblist_init_generic() could setup
> percpu_enqueue_shift
> to something smaller than order_base_2(nr_cpu_ids)
>
> Meaning that we could reach a non zero idx in (smp_processor_id() >>
> percpu_enqueue_shift)
>
> So even if CPU0 is always present (I am not sure this is guaranteed,
> but this seems reasonable),
> we could still attempt a per_cpu_ptr(PTR, not_present_cpu), and get garbage.
My thought is to replace the shift with a boolean. One setting just
always uses CPU 0's cblist (or that of the boot CPU, if there is some
architecture that is completely innocent of CPU 0), and the other setting
always uses that of the currently running CPU. This does rule out the
current possibility of using (say) 16 cblists on a 64-CPU system, but
I suspect that this would be just fine.
Neeraj's and my proposed memory-diet patches for SRCU do something like
this. These are on branch srcu.2022.03.29a on -rcu.
Thoughts?
Thanx, Paul
> > > Thanks.
> > >
> > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > > index 99cf3a13954cfb17828fbbeeb884f11614a526a9..df3785be4022e903d9682dd403464aa9927aa5c2
> > > 100644
> > > --- a/kernel/rcu/tasks.h
> > > +++ b/kernel/rcu/tasks.h
> > > @@ -273,13 +273,17 @@ static void call_rcu_tasks_generic(struct
> > > rcu_head *rhp, rcu_callback_t func,
> > > bool needadjust = false;
> > > bool needwake;
> > > struct rcu_tasks_percpu *rtpcp;
> > > + int ideal_cpu, chosen_cpu;
> > >
> > > rhp->next = NULL;
> > > rhp->func = func;
> > > local_irq_save(flags);
> > > rcu_read_lock();
> > > - rtpcp = per_cpu_ptr(rtp->rtpcpu,
> > > - smp_processor_id() >>
> > > READ_ONCE(rtp->percpu_enqueue_shift));
> > > +
> > > + ideal_cpu = smp_processor_id() >> READ_ONCE(rtp->percpu_enqueue_shift);
> > > + chosen_cpu = cpumask_next(ideal_cpu - 1, cpu_online_mask);
> > > +
> > > + rtpcp = per_cpu_ptr(rtp->rtpcpu, chosen_cpu);
> > > if (!raw_spin_trylock_rcu_node(rtpcp)) { // irqs already disabled.
> > > raw_spin_lock_rcu_node(rtpcp); // irqs already disabled.
> > > j = jiffies;
On Thu, Mar 31, 2022 at 03:54:02PM -0700, Eric Dumazet wrote:
> On Thu, Mar 31, 2022 at 3:42 PM Paul E. McKenney <[email protected]> wrote:
> >
> > On Thu, Mar 31, 2022 at 02:45:25PM -0700, Eric Dumazet wrote:
> > > Hi Paul
> > >
> > > It seems you assume per cpu ptr for arbitrary indexes (< nr_cpu_ids) are valid.
> >
> > Gah! I knew I was forgetting something...
> >
> > But just to check, is this a theoretical problem or something you hit
> > on real hardware? (For the rest of this email, I am assuming the latter.)
>
> Code review really...
>
> >
> > > What do you think of the (untested) following patch ?
> >
> > One issue with this patch is that the contention could be unpredictable,
> > or worse, vary among CPU, especially if the cpu_possible_mask was oddly
> > distributed.
> >
> > So might it be better to restrict this to all on CPU 0 on the one hand
> > and completely per-CPU on the other? (Or all on the boot CPU, in case
> > I am forgetting some misbegotten architecture that can run without a
> > CPU 0.)
>
> If I understand correctly, cblist_init_generic() could setup
> percpu_enqueue_shift
> to something smaller than order_base_2(nr_cpu_ids)
>
> Meaning that we could reach a non zero idx in (smp_processor_id() >>
> percpu_enqueue_shift)
>
> So even if CPU0 is always present (I am not sure this is guaranteed,
> but this seems reasonable),
> we could still attempt a per_cpu_ptr(PTR, not_present_cpu), and get garbage.
And the problem with my wish to provide load balancing is that a
sparse cpumask could be sparse any which way that it wants to be.
Another problem is that, unlike TREE SRCU, Tasks RCU doesn't have an
efficient way to find all the CPUs with callbacks queued. Yes, I could
add that information, but the benefit does not seem worth the complexity.
So I took your patch after all, but changed from cpu_online_mask to
cpu_possible_mask. Thank you for bearing with me on this one!
Are you OK with your Signed-off-by on this patch as shown below?
Thanx, Paul
------------------------------------------------------------------------
commit b77b2981bb22c4449a0a6e86eeb9fbab36a2beae
Author: Eric Dumazet <[email protected]>
Date: Mon Apr 4 12:30:18 2022 -0700
rcu-tasks: Handle sparse cpu_possible_mask
If the rcupdate.rcu_task_enqueue_lim kernel boot parameter is set to
something greater than 1 and less than nr_cpu_ids, the code attempts to
use a subset of the CPU's RCU Tasks callback lists. This works, but only
if the cpu_possible_mask is contiguous. If there are "holes" in this
mask, the callback-enqueue code might attempt to access a non-existent
per-CPU ->rtcpu variable for a non-existent CPU. For example, if only
CPUs 0, 4, 8, 12, 16 and so on are in cpu_possible_mask, specifying
rcupdate.rcu_task_enqueue_lim=4 would cause the code to attempt to
use callback queues for non-existent CPUs 1, 2, and 3. Because such
systems have existed in the past and might still exist, the code needs
to gracefully handle this situation.
This commit therefore checks to see whether the desired CPU is present
in cpu_possible_mask, and, if not, searches for the next CPU. This means
that the systems administrator of a system with a sparse cpu_possible_mask
will need to account for this sparsity when specifying the value of
the rcupdate.rcu_task_enqueue_lim kernel boot parameter. For example,
setting this parameter to the value 4 will use only CPUs 0 and 4, which
CPU 4 getting three times the callback load of CPU 0.
This commit assumes that bit (nr_cpu_ids - 1) is always set in
cpu_possible_mask.
Link: https://lore.kernel.org/lkml/CANn89iKaNEwyNZ=L_PQnkH0LP_XjLYrr_dpyRKNNoDJaWKdrmg@mail.gmail.com/
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 65d6e21a607a..44977c6a1cb8 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -273,7 +273,9 @@ static void call_rcu_tasks_iw_wakeup(struct irq_work *iwp)
static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func,
struct rcu_tasks *rtp)
{
+ int chosen_cpu;
unsigned long flags;
+ int ideal_cpu;
unsigned long j;
bool needadjust = false;
bool needwake;
@@ -283,8 +285,9 @@ static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func,
rhp->func = func;
local_irq_save(flags);
rcu_read_lock();
- rtpcp = per_cpu_ptr(rtp->rtpcpu,
- smp_processor_id() >> READ_ONCE(rtp->percpu_enqueue_shift));
+ ideal_cpu = smp_processor_id() >> READ_ONCE(rtp->percpu_enqueue_shift);
+ chosen_cpu = cpumask_next(ideal_cpu - 1, cpu_possible_mask);
+ rtpcp = per_cpu_ptr(rtp->rtpcpu, chosen_cpu);
if (!raw_spin_trylock_rcu_node(rtpcp)) { // irqs already disabled.
raw_spin_lock_rcu_node(rtpcp); // irqs already disabled.
j = jiffies;
On Thu, Mar 31, 2022 at 3:54 PM Eric Dumazet <[email protected]> wrote:
>
> On Thu, Mar 31, 2022 at 3:42 PM Paul E. McKenney <[email protected]> wrote:
> >
> > On Thu, Mar 31, 2022 at 02:45:25PM -0700, Eric Dumazet wrote:
> > > Hi Paul
> > >
> > > It seems you assume per cpu ptr for arbitrary indexes (< nr_cpu_ids) are valid.
> >
> > Gah! I knew I was forgetting something...
> >
> > But just to check, is this a theoretical problem or something you hit
> > on real hardware? (For the rest of this email, I am assuming the latter.)
>
> Code review really...
>
> >
> > > What do you think of the (untested) following patch ?
> >
> > One issue with this patch is that the contention could be unpredictable,
> > or worse, vary among CPU, especially if the cpu_possible_mask was oddly
> > distributed.
> >
> > So might it be better to restrict this to all on CPU 0 on the one hand
> > and completely per-CPU on the other? (Or all on the boot CPU, in case
> > I am forgetting some misbegotten architecture that can run without a
> > CPU 0.)
>
> If I understand correctly, cblist_init_generic() could setup
> percpu_enqueue_shift
> to something smaller than order_base_2(nr_cpu_ids)
>
> Meaning that we could reach a non zero idx in (smp_processor_id() >>
> percpu_enqueue_shift)
>
> So even if CPU0 is always present (I am not sure this is guaranteed,
> but this seems reasonable),
> we could still attempt a per_cpu_ptr(PTR, not_present_cpu), and get garbage.
>
Also you mention CPU 0, but I do not see where cpu binding is
performed on the kthread ?
>
>
> > > Thanks.
> > >
> > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > > index 99cf3a13954cfb17828fbbeeb884f11614a526a9..df3785be4022e903d9682dd403464aa9927aa5c2
> > > 100644
> > > --- a/kernel/rcu/tasks.h
> > > +++ b/kernel/rcu/tasks.h
> > > @@ -273,13 +273,17 @@ static void call_rcu_tasks_generic(struct
> > > rcu_head *rhp, rcu_callback_t func,
> > > bool needadjust = false;
> > > bool needwake;
> > > struct rcu_tasks_percpu *rtpcp;
> > > + int ideal_cpu, chosen_cpu;
> > >
> > > rhp->next = NULL;
> > > rhp->func = func;
> > > local_irq_save(flags);
> > > rcu_read_lock();
> > > - rtpcp = per_cpu_ptr(rtp->rtpcpu,
> > > - smp_processor_id() >>
> > > READ_ONCE(rtp->percpu_enqueue_shift));
> > > +
> > > + ideal_cpu = smp_processor_id() >> READ_ONCE(rtp->percpu_enqueue_shift);
> > > + chosen_cpu = cpumask_next(ideal_cpu - 1, cpu_online_mask);
> > > +
> > > + rtpcp = per_cpu_ptr(rtp->rtpcpu, chosen_cpu);
> > > if (!raw_spin_trylock_rcu_node(rtpcp)) { // irqs already disabled.
> > > raw_spin_lock_rcu_node(rtpcp); // irqs already disabled.
> > > j = jiffies;