2023-10-11 07:21:33

by Ankit Jain

[permalink] [raw]
Subject: [PATCH RFC] cpumask: Randomly distribute the tasks within affinity mask

commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
introduced the logic to distribute the tasks at initial wakeup on cpus
where load balancing works poorly or disabled at all (isolated cpus).

There are cases in which the distribution of tasks
that are spawned on isolcpus does not happen properly.
In production deployment, initial wakeup of tasks spawn from
housekeeping cpus to isolcpus[nohz_full cpu] happens on first cpu
within isolcpus range instead of distributed across isolcpus.

Usage of distribute_cpu_mask_prev from one processes group,
will clobber previous value of another or other groups and vice-versa.

When housekeeping cpus spawn multiple child tasks to wakeup on
isolcpus[nohz_full cpu], using cpusets.cpus/sched_setaffinity(),
distribution is currently performed based on per-cpu
distribute_cpu_mask_prev counter.
At the same time, on housekeeping cpus there are percpu
bounded timers interrupt/rcu threads and other system/user tasks
would be running with affinity as housekeeping cpus. In a real-life
environment, housekeeping cpus are much fewer and are too much loaded.
So, distribute_cpu_mask_prev value from these tasks impacts
the offset value for the tasks spawning to wakeup on isolcpus and
thus most of the tasks end up waking up on first cpu within the
isolcpus set.

Steps to reproduce:
Kernel cmdline parameters:
isolcpus=2-5 skew_tick=1 nohz=on nohz_full=2-5
rcu_nocbs=2-5 rcu_nocb_poll idle=poll irqaffinity=0-1

* pid=$(echo $$)
* taskset -pc 0 $pid
* cat loop-normal.c
int main(void)
{
while (1)
;
return 0;
}
* gcc -o loop-normal loop-normal.c
* for i in {1..50}; do ./loop-normal & done
* pids=$(ps -a | grep loop-normal | cut -d' ' -f5)
* for i in $pids; do taskset -pc 2-5 $i ; done

Expected output:
* All 50 “loop-normal” tasks should wake up on cpu2-5
equally distributed.
* ps -eLo cpuid,pid,tid,ppid,cls,psr,cls,cmd | grep "^ [2345]"

Actual output:
* All 50 “loop-normal” tasks got woken up on cpu2 only

Analysis:
There are percpu bounded timer interrupt/rcu threads activities
going on every few microseconds on housekeeping cpus, exercising
find_lowest_rq() -> cpumask_any_and_distribute()/cpumask_any_distribute()
So, per cpu variable distribute_cpu_mask_prev for housekeeping cpus
keep on getting set to housekeeping cpus. Bash/docker processes
are sharing same per cpu variable as they run on housekeeping cpus.
Thus intersection of clobbered distribute_cpu_mask_prev and
new mask(isolcpus) return always first cpu within the new mask(isolcpus)
in accordance to the logic mentioned in commits above.

Fix the issue by using random cores out of the applicable CPU set
instead of relying on distribute_cpu_mask_prev.

Fixes: 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
Fixes: 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")

Signed-off-by: Ankit Jain <[email protected]>
---
lib/cpumask.c | 40 +++++++++++++++++++++-------------------
1 file changed, 21 insertions(+), 19 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index a7fd02b5ae26..95a7c1b40e95 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -155,45 +155,47 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
}
EXPORT_SYMBOL(cpumask_local_spread);

-static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
-
/**
* cpumask_any_and_distribute - Return an arbitrary cpu within src1p & src2p.
* @src1p: first &cpumask for intersection
* @src2p: second &cpumask for intersection
*
- * Iterated calls using the same srcp1 and srcp2 will be distributed within
- * their intersection.
+ * Iterated calls using the same srcp1 and srcp2 will be randomly distributed
+ * within their intersection.
*
* Returns >= nr_cpu_ids if the intersection is empty.
*/
unsigned int cpumask_any_and_distribute(const struct cpumask *src1p,
const struct cpumask *src2p)
{
- unsigned int next, prev;
+ unsigned int n_cpus, nth_cpu;

- /* NOTE: our first selection will skip 0. */
- prev = __this_cpu_read(distribute_cpu_mask_prev);
+ n_cpus = cpumask_weight_and(src1p, src2p);
+ if (n_cpus == 0)
+ return nr_cpu_ids;

- next = find_next_and_bit_wrap(cpumask_bits(src1p), cpumask_bits(src2p),
- nr_cpumask_bits, prev + 1);
- if (next < nr_cpu_ids)
- __this_cpu_write(distribute_cpu_mask_prev, next);
+ nth_cpu = get_random_u32_below(n_cpus);

- return next;
+ return find_nth_and_bit(cpumask_bits(src1p), cpumask_bits(src2p),
+ nr_cpumask_bits, nth_cpu);
}
EXPORT_SYMBOL(cpumask_any_and_distribute);

+/**
+ * Returns an arbitrary cpu within srcp.
+ *
+ * Iterated calls using the same srcp will be randomly distributed
+ */
unsigned int cpumask_any_distribute(const struct cpumask *srcp)
{
- unsigned int next, prev;
+ unsigned int n_cpus, nth_cpu;

- /* NOTE: our first selection will skip 0. */
- prev = __this_cpu_read(distribute_cpu_mask_prev);
- next = find_next_bit_wrap(cpumask_bits(srcp), nr_cpumask_bits, prev + 1);
- if (next < nr_cpu_ids)
- __this_cpu_write(distribute_cpu_mask_prev, next);
+ n_cpus = cpumask_weight(srcp);
+ if (n_cpus == 0)
+ return nr_cpu_ids;

- return next;
+ nth_cpu = get_random_u32_below(n_cpus);
+
+ return find_nth_bit(cpumask_bits(srcp), nr_cpumask_bits, nth_cpu);
}
EXPORT_SYMBOL(cpumask_any_distribute);
--
2.23.1


2023-10-11 10:39:35

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH RFC] cpumask: Randomly distribute the tasks within affinity mask

On Wed, Oct 11, 2023 at 12:49:25PM +0530, Ankit Jain wrote:
> commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
> and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
> introduced the logic to distribute the tasks at initial wakeup on cpus
> where load balancing works poorly or disabled at all (isolated cpus).
>
> There are cases in which the distribution of tasks
> that are spawned on isolcpus does not happen properly.
> In production deployment, initial wakeup of tasks spawn from
> housekeeping cpus to isolcpus[nohz_full cpu] happens on first cpu
> within isolcpus range instead of distributed across isolcpus.
>
> Usage of distribute_cpu_mask_prev from one processes group,
> will clobber previous value of another or other groups and vice-versa.
>
> When housekeeping cpus spawn multiple child tasks to wakeup on
> isolcpus[nohz_full cpu], using cpusets.cpus/sched_setaffinity(),
> distribution is currently performed based on per-cpu
> distribute_cpu_mask_prev counter.
> At the same time, on housekeeping cpus there are percpu
> bounded timers interrupt/rcu threads and other system/user tasks
> would be running with affinity as housekeeping cpus. In a real-life
> environment, housekeeping cpus are much fewer and are too much loaded.
> So, distribute_cpu_mask_prev value from these tasks impacts
> the offset value for the tasks spawning to wakeup on isolcpus and
> thus most of the tasks end up waking up on first cpu within the
> isolcpus set.
>
> Steps to reproduce:
> Kernel cmdline parameters:
> isolcpus=2-5 skew_tick=1 nohz=on nohz_full=2-5
> rcu_nocbs=2-5 rcu_nocb_poll idle=poll irqaffinity=0-1
>
> * pid=$(echo $$)
> * taskset -pc 0 $pid
> * cat loop-normal.c
> int main(void)
> {
> while (1)
> ;
> return 0;
> }
> * gcc -o loop-normal loop-normal.c
> * for i in {1..50}; do ./loop-normal & done
> * pids=$(ps -a | grep loop-normal | cut -d' ' -f5)
> * for i in $pids; do taskset -pc 2-5 $i ; done
>
> Expected output:
> * All 50 “loop-normal” tasks should wake up on cpu2-5
> equally distributed.
> * ps -eLo cpuid,pid,tid,ppid,cls,psr,cls,cmd | grep "^ [2345]"
>
> Actual output:
> * All 50 “loop-normal” tasks got woken up on cpu2 only
>
> Analysis:
> There are percpu bounded timer interrupt/rcu threads activities
> going on every few microseconds on housekeeping cpus, exercising
> find_lowest_rq() -> cpumask_any_and_distribute()/cpumask_any_distribute()
> So, per cpu variable distribute_cpu_mask_prev for housekeeping cpus
> keep on getting set to housekeeping cpus. Bash/docker processes
> are sharing same per cpu variable as they run on housekeeping cpus.
> Thus intersection of clobbered distribute_cpu_mask_prev and
> new mask(isolcpus) return always first cpu within the new mask(isolcpus)
> in accordance to the logic mentioned in commits above.
>
> Fix the issue by using random cores out of the applicable CPU set
> instead of relying on distribute_cpu_mask_prev.

> Fixes: 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
> Fixes: 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")

>

Blank lines are not allowed in the tag block.

> Signed-off-by: Ankit Jain <[email protected]>

...

> +/**
> + * Returns an arbitrary cpu within srcp.
> + *
> + * Iterated calls using the same srcp will be randomly distributed
> + */

This is invalid. Always run

scripts/kernel-doc -v -none -Wall ...

against the file of interest and fix all warnings and errors reported.

--
With Best Regards,
Andy Shevchenko


2023-10-11 10:54:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] cpumask: Randomly distribute the tasks within affinity mask

On Wed, Oct 11, 2023 at 12:49:25PM +0530, Ankit Jain wrote:
> commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
> and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
> introduced the logic to distribute the tasks at initial wakeup on cpus
> where load balancing works poorly or disabled at all (isolated cpus).
>
> There are cases in which the distribution of tasks
> that are spawned on isolcpus does not happen properly.
> In production deployment, initial wakeup of tasks spawn from
> housekeeping cpus to isolcpus[nohz_full cpu] happens on first cpu
> within isolcpus range instead of distributed across isolcpus.
>
> Usage of distribute_cpu_mask_prev from one processes group,
> will clobber previous value of another or other groups and vice-versa.
>
> When housekeeping cpus spawn multiple child tasks to wakeup on
> isolcpus[nohz_full cpu], using cpusets.cpus/sched_setaffinity(),
> distribution is currently performed based on per-cpu
> distribute_cpu_mask_prev counter.
> At the same time, on housekeeping cpus there are percpu
> bounded timers interrupt/rcu threads and other system/user tasks
> would be running with affinity as housekeeping cpus. In a real-life
> environment, housekeeping cpus are much fewer and are too much loaded.
> So, distribute_cpu_mask_prev value from these tasks impacts
> the offset value for the tasks spawning to wakeup on isolcpus and
> thus most of the tasks end up waking up on first cpu within the
> isolcpus set.
>
> Steps to reproduce:
> Kernel cmdline parameters:
> isolcpus=2-5 skew_tick=1 nohz=on nohz_full=2-5
> rcu_nocbs=2-5 rcu_nocb_poll idle=poll irqaffinity=0-1
>
> * pid=$(echo $$)
> * taskset -pc 0 $pid
> * cat loop-normal.c
> int main(void)
> {
> while (1)
> ;
> return 0;
> }
> * gcc -o loop-normal loop-normal.c
> * for i in {1..50}; do ./loop-normal & done
> * pids=$(ps -a | grep loop-normal | cut -d' ' -f5)
> * for i in $pids; do taskset -pc 2-5 $i ; done
>
> Expected output:
> * All 50 “loop-normal” tasks should wake up on cpu2-5
> equally distributed.
> * ps -eLo cpuid,pid,tid,ppid,cls,psr,cls,cmd | grep "^ [2345]"
>
> Actual output:
> * All 50 “loop-normal” tasks got woken up on cpu2 only

Your expectation is wrong. Things work as advertised.

Also, isolcpus is crap.


2023-10-11 11:47:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] cpumask: Randomly distribute the tasks within affinity mask

On Wed, Oct 11, 2023 at 12:53:29PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 11, 2023 at 12:49:25PM +0530, Ankit Jain wrote:
> > commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
> > and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
> > introduced the logic to distribute the tasks at initial wakeup on cpus
> > where load balancing works poorly or disabled at all (isolated cpus).
> >
> > There are cases in which the distribution of tasks
> > that are spawned on isolcpus does not happen properly.
> > In production deployment, initial wakeup of tasks spawn from
> > housekeeping cpus to isolcpus[nohz_full cpu] happens on first cpu
> > within isolcpus range instead of distributed across isolcpus.
> >
> > Usage of distribute_cpu_mask_prev from one processes group,
> > will clobber previous value of another or other groups and vice-versa.
> >
> > When housekeeping cpus spawn multiple child tasks to wakeup on
> > isolcpus[nohz_full cpu], using cpusets.cpus/sched_setaffinity(),
> > distribution is currently performed based on per-cpu
> > distribute_cpu_mask_prev counter.
> > At the same time, on housekeeping cpus there are percpu
> > bounded timers interrupt/rcu threads and other system/user tasks
> > would be running with affinity as housekeeping cpus. In a real-life
> > environment, housekeeping cpus are much fewer and are too much loaded.
> > So, distribute_cpu_mask_prev value from these tasks impacts
> > the offset value for the tasks spawning to wakeup on isolcpus and
> > thus most of the tasks end up waking up on first cpu within the
> > isolcpus set.
> >
> > Steps to reproduce:
> > Kernel cmdline parameters:
> > isolcpus=2-5 skew_tick=1 nohz=on nohz_full=2-5
> > rcu_nocbs=2-5 rcu_nocb_poll idle=poll irqaffinity=0-1
> >
> > * pid=$(echo $$)
> > * taskset -pc 0 $pid
> > * cat loop-normal.c
> > int main(void)
> > {
> > while (1)
> > ;
> > return 0;
> > }
> > * gcc -o loop-normal loop-normal.c
> > * for i in {1..50}; do ./loop-normal & done
> > * pids=$(ps -a | grep loop-normal | cut -d' ' -f5)
> > * for i in $pids; do taskset -pc 2-5 $i ; done
> >
> > Expected output:
> > * All 50 “loop-normal” tasks should wake up on cpu2-5
> > equally distributed.
> > * ps -eLo cpuid,pid,tid,ppid,cls,psr,cls,cmd | grep "^ [2345]"
> >
> > Actual output:
> > * All 50 “loop-normal” tasks got woken up on cpu2 only
>
> Your expectation is wrong. Things work as advertised.

That is, isolcpus results in single CPU balance domains and as such we
must not distribute -- there is no load balancing.

Ideally we'd reject setting cpumasks with multi bits set on domains like
that, but alas, that would break historical behaviour :/

Now, looking at the code, I don't think the current code actually
behaves correct in this case :-(, somewhere along the line we should
truncate cpu_valid_mask to a single bit. Let me see where the sane place
is to do that.


2023-10-11 13:53:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] cpumask: Randomly distribute the tasks within affinity mask

On Wed, Oct 11, 2023 at 01:46:42PM +0200, Peter Zijlstra wrote:

> Now, looking at the code, I don't think the current code actually
> behaves correct in this case :-(, somewhere along the line we should
> truncate cpu_valid_mask to a single bit. Let me see where the sane place
> is to do that.

Something like so I suppose, that limits newmask to the first
root_domain, which should be a superset of the cpuset if there is such a
thing.


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 23f638d431d6..334c5bc59160 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3081,6 +3081,29 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
return 0;
}

+static struct cpumask *root_domain_allowed(struct cpumask *newmask,
+ struct cpumask *scratch,
+ struct cpumask *valid)
+{
+ struct root_domain *rd;
+ struct cpumask *mask;
+ struct rq *rq;
+
+ int first = cpumask_first_and(newmask, valid);
+ if (first >= nr_cpu_ids)
+ return NULL;
+
+ mask = cpumask_of(first);
+ rd = cpu_rq(first)->rd;
+ if (rd)
+ mask = rd->span;
+
+ if (!cpumask_and(scratch, newmask, mask))
+ return NULL;
+
+ return scratch;
+}
+
/*
* Called with both p->pi_lock and rq->lock held; drops both before returning.
*/
@@ -3113,6 +3136,13 @@ static int __set_cpus_allowed_ptr_locked(struct task_struct *p,
cpu_valid_mask = cpu_online_mask;
}

+ ctx->new_mask = root_domain_allowed(ctx->new_mask,
+ rq->scratch_mask, cpu_valid_mask);
+ if (!ctx->new_mask) {
+ ret = -EINVAL;
+ goto out;
+ }
+
if (!kthread && !cpumask_subset(ctx->new_mask, cpu_allowed_mask)) {
ret = -EINVAL;
goto out;

2023-10-11 23:56:04

by Josh Don

[permalink] [raw]
Subject: Re: [PATCH RFC] cpumask: Randomly distribute the tasks within affinity mask

Hey Peter,

> +static struct cpumask *root_domain_allowed(struct cpumask *newmask,
> + struct cpumask *scratch,
> + struct cpumask *valid)
> +{
> + struct root_domain *rd;
> + struct cpumask *mask;
> + struct rq *rq;
> +
> + int first = cpumask_first_and(newmask, valid);
> + if (first >= nr_cpu_ids)
> + return NULL;

This picks the first arbitrarily, but isn't it possible for the mask
to span both isolated and non-isolated cpus? In which case, the rest
of this just boils down to chance (ie. whatever the span happens to be
for the first cpu here)?

2023-10-12 00:18:55

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH RFC] cpumask: Randomly distribute the tasks within affinity mask

On Wed, Oct 11, 2023 at 12:49:25PM +0530, Ankit Jain wrote:
> commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
> and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
> introduced the logic to distribute the tasks at initial wakeup on cpus
> where load balancing works poorly or disabled at all (isolated cpus).
>
> There are cases in which the distribution of tasks
> that are spawned on isolcpus does not happen properly.
> In production deployment, initial wakeup of tasks spawn from
> housekeeping cpus to isolcpus[nohz_full cpu] happens on first cpu
> within isolcpus range instead of distributed across isolcpus.
>
> Usage of distribute_cpu_mask_prev from one processes group,
> will clobber previous value of another or other groups and vice-versa.
>
> When housekeeping cpus spawn multiple child tasks to wakeup on
> isolcpus[nohz_full cpu], using cpusets.cpus/sched_setaffinity(),
> distribution is currently performed based on per-cpu
> distribute_cpu_mask_prev counter.
> At the same time, on housekeeping cpus there are percpu
> bounded timers interrupt/rcu threads and other system/user tasks
> would be running with affinity as housekeeping cpus. In a real-life
> environment, housekeeping cpus are much fewer and are too much loaded.
> So, distribute_cpu_mask_prev value from these tasks impacts
> the offset value for the tasks spawning to wakeup on isolcpus and
> thus most of the tasks end up waking up on first cpu within the
> isolcpus set.
>
> Steps to reproduce:
> Kernel cmdline parameters:
> isolcpus=2-5 skew_tick=1 nohz=on nohz_full=2-5
> rcu_nocbs=2-5 rcu_nocb_poll idle=poll irqaffinity=0-1
>
> * pid=$(echo $$)
> * taskset -pc 0 $pid
> * cat loop-normal.c
> int main(void)
> {
> while (1)
> ;
> return 0;
> }
> * gcc -o loop-normal loop-normal.c
> * for i in {1..50}; do ./loop-normal & done
> * pids=$(ps -a | grep loop-normal | cut -d' ' -f5)
> * for i in $pids; do taskset -pc 2-5 $i ; done
>
> Expected output:
> * All 50 “loop-normal” tasks should wake up on cpu2-5
> equally distributed.
> * ps -eLo cpuid,pid,tid,ppid,cls,psr,cls,cmd | grep "^ [2345]"
>
> Actual output:
> * All 50 “loop-normal” tasks got woken up on cpu2 only
>
> Analysis:
> There are percpu bounded timer interrupt/rcu threads activities
> going on every few microseconds on housekeeping cpus, exercising
> find_lowest_rq() -> cpumask_any_and_distribute()/cpumask_any_distribute()
> So, per cpu variable distribute_cpu_mask_prev for housekeeping cpus
> keep on getting set to housekeeping cpus. Bash/docker processes
> are sharing same per cpu variable as they run on housekeeping cpus.
> Thus intersection of clobbered distribute_cpu_mask_prev and
> new mask(isolcpus) return always first cpu within the new mask(isolcpus)
> in accordance to the logic mentioned in commits above.
>
> Fix the issue by using random cores out of the applicable CPU set
> instead of relying on distribute_cpu_mask_prev.
>
> Fixes: 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
> Fixes: 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
>
> Signed-off-by: Ankit Jain <[email protected]>
> ---
> lib/cpumask.c | 40 +++++++++++++++++++++-------------------
> 1 file changed, 21 insertions(+), 19 deletions(-)
>
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index a7fd02b5ae26..95a7c1b40e95 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -155,45 +155,47 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
> }
> EXPORT_SYMBOL(cpumask_local_spread);
>
> -static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
> -
> /**
> * cpumask_any_and_distribute - Return an arbitrary cpu within src1p & src2p.
> * @src1p: first &cpumask for intersection
> * @src2p: second &cpumask for intersection
> *
> - * Iterated calls using the same srcp1 and srcp2 will be distributed within
> - * their intersection.
> + * Iterated calls using the same srcp1 and srcp2 will be randomly distributed
> + * within their intersection.
> *
> * Returns >= nr_cpu_ids if the intersection is empty.
> */

This has been discussed a while ago, and the bottomline is that 'any'
is not the same as 'random'. In practice, it means that getting 'any'
cpu is cheaper than getting randomized one.

I'm not that deep in context of the problem you're trying to solve, but
if you need randomized CPU, can you just add a new function for it?
Something like cpumask_get_random().

The approach with find_nth_bit() itself looks good to me.

Thanks,
Yury

> unsigned int cpumask_any_and_distribute(const struct cpumask *src1p,
> const struct cpumask *src2p)
> {
> - unsigned int next, prev;
> + unsigned int n_cpus, nth_cpu;
>
> - /* NOTE: our first selection will skip 0. */
> - prev = __this_cpu_read(distribute_cpu_mask_prev);
> + n_cpus = cpumask_weight_and(src1p, src2p);
> + if (n_cpus == 0)
> + return nr_cpu_ids;
>
> - next = find_next_and_bit_wrap(cpumask_bits(src1p), cpumask_bits(src2p),
> - nr_cpumask_bits, prev + 1);
> - if (next < nr_cpu_ids)
> - __this_cpu_write(distribute_cpu_mask_prev, next);
> + nth_cpu = get_random_u32_below(n_cpus);
>
> - return next;
> + return find_nth_and_bit(cpumask_bits(src1p), cpumask_bits(src2p),
> + nr_cpumask_bits, nth_cpu);
> }
> EXPORT_SYMBOL(cpumask_any_and_distribute);
>
> +/**
> + * Returns an arbitrary cpu within srcp.
> + *
> + * Iterated calls using the same srcp will be randomly distributed
> + */
> unsigned int cpumask_any_distribute(const struct cpumask *srcp)
> {
> - unsigned int next, prev;
> + unsigned int n_cpus, nth_cpu;
>
> - /* NOTE: our first selection will skip 0. */
> - prev = __this_cpu_read(distribute_cpu_mask_prev);
> - next = find_next_bit_wrap(cpumask_bits(srcp), nr_cpumask_bits, prev + 1);
> - if (next < nr_cpu_ids)
> - __this_cpu_write(distribute_cpu_mask_prev, next);
> + n_cpus = cpumask_weight(srcp);
> + if (n_cpus == 0)
> + return nr_cpu_ids;
>
> - return next;
> + nth_cpu = get_random_u32_below(n_cpus);
> +
> + return find_nth_bit(cpumask_bits(srcp), nr_cpumask_bits, nth_cpu);
> }
> EXPORT_SYMBOL(cpumask_any_distribute);
> --
> 2.23.1

2023-10-12 08:06:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] cpumask: Randomly distribute the tasks within affinity mask

On Wed, Oct 11, 2023 at 04:55:35PM -0700, Josh Don wrote:
> Hey Peter,
>
> > +static struct cpumask *root_domain_allowed(struct cpumask *newmask,
> > + struct cpumask *scratch,
> > + struct cpumask *valid)
> > +{
> > + struct root_domain *rd;
> > + struct cpumask *mask;
> > + struct rq *rq;
> > +
> > + int first = cpumask_first_and(newmask, valid);
> > + if (first >= nr_cpu_ids)
> > + return NULL;
>
> This picks the first arbitrarily, but isn't it possible for the mask
> to span both isolated and non-isolated cpus? In which case, the rest
> of this just boils down to chance (ie. whatever the span happens to be
> for the first cpu here)?

Yes, but it matches historical behaviour :/

Like I said, ideally we'd flat out reject masks like this, but...

2023-10-12 15:43:46

by Ankit Jain

[permalink] [raw]
Subject: Re: [PATCH RFC] cpumask: Randomly distribute the tasks within affinity mask



> On 11-Oct-2023, at 5:16 PM, Peter Zijlstra <[email protected]> wrote:
>
> !! External Email
>
> On Wed, Oct 11, 2023 at 12:53:29PM +0200, Peter Zijlstra wrote:
>> On Wed, Oct 11, 2023 at 12:49:25PM +0530, Ankit Jain wrote:
>>> commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
>>> and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
>>> introduced the logic to distribute the tasks at initial wakeup on cpus
>>> where load balancing works poorly or disabled at all (isolated cpus).
>>>
>>> There are cases in which the distribution of tasks
>>> that are spawned on isolcpus does not happen properly.
>>> In production deployment, initial wakeup of tasks spawn from
>>> housekeeping cpus to isolcpus[nohz_full cpu] happens on first cpu
>>> within isolcpus range instead of distributed across isolcpus.
>>>
>>> Usage of distribute_cpu_mask_prev from one processes group,
>>> will clobber previous value of another or other groups and vice-versa.
>>>
>>> When housekeeping cpus spawn multiple child tasks to wakeup on
>>> isolcpus[nohz_full cpu], using cpusets.cpus/sched_setaffinity(),
>>> distribution is currently performed based on per-cpu
>>> distribute_cpu_mask_prev counter.
>>> At the same time, on housekeeping cpus there are percpu
>>> bounded timers interrupt/rcu threads and other system/user tasks
>>> would be running with affinity as housekeeping cpus. In a real-life
>>> environment, housekeeping cpus are much fewer and are too much loaded.
>>> So, distribute_cpu_mask_prev value from these tasks impacts
>>> the offset value for the tasks spawning to wakeup on isolcpus and
>>> thus most of the tasks end up waking up on first cpu within the
>>> isolcpus set.
>>>
>>> Steps to reproduce:
>>> Kernel cmdline parameters:
>>> isolcpus=2-5 skew_tick=1 nohz=on nohz_full=2-5
>>> rcu_nocbs=2-5 rcu_nocb_poll idle=poll irqaffinity=0-1
>>>
>>> * pid=$(echo $$)
>>> * taskset -pc 0 $pid
>>> * cat loop-normal.c
>>> int main(void)
>>> {
>>> while (1)
>>> ;
>>> return 0;
>>> }
>>> * gcc -o loop-normal loop-normal.c
>>> * for i in {1..50}; do ./loop-normal & done
>>> * pids=$(ps -a | grep loop-normal | cut -d' ' -f5)
>>> * for i in $pids; do taskset -pc 2-5 $i ; done
>>>
>>> Expected output:
>>> * All 50 “loop-normal” tasks should wake up on cpu2-5
>>> equally distributed.
>>> * ps -eLo cpuid,pid,tid,ppid,cls,psr,cls,cmd | grep "^ [2345]"
>>>
>>> Actual output:
>>> * All 50 “loop-normal” tasks got woken up on cpu2 only
>>
>> Your expectation is wrong. Things work as advertised.
>
> That is, isolcpus results in single CPU balance domains and as such we
> must not distribute -- there is no load balancing.
>
> Ideally we'd reject setting cpumasks with multi bits set on domains like
> that, but alas, that would break historical behaviour :/

Thank you Peter for investing your time in reviewing this change.
I completely agree and understand that cpumask with multi bits
should not be set for domains like isolcpus and should probably
be addressed in user space.

>
> Now, looking at the code, I don't think the current code actually
> behaves correct in this case :-(, somewhere along the line we should
> truncate cpu_valid_mask to a single bit. Let me see where the sane place
> is to do that.
>
>
>
> !! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

2023-10-12 15:52:49

by Ankit Jain

[permalink] [raw]
Subject: Re: [PATCH RFC] cpumask: Randomly distribute the tasks within affinity mask



> On 12-Oct-2023, at 5:46 AM, Yury Norov <[email protected]> wrote:
>
> !! External Email
>
> On Wed, Oct 11, 2023 at 12:49:25PM +0530, Ankit Jain wrote:
>> commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
>> and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
>> introduced the logic to distribute the tasks at initial wakeup on cpus
>> where load balancing works poorly or disabled at all (isolated cpus).
>>
>> There are cases in which the distribution of tasks
>> that are spawned on isolcpus does not happen properly.
>> In production deployment, initial wakeup of tasks spawn from
>> housekeeping cpus to isolcpus[nohz_full cpu] happens on first cpu
>> within isolcpus range instead of distributed across isolcpus.
>>
>> Usage of distribute_cpu_mask_prev from one processes group,
>> will clobber previous value of another or other groups and vice-versa.
>>
>> When housekeeping cpus spawn multiple child tasks to wakeup on
>> isolcpus[nohz_full cpu], using cpusets.cpus/sched_setaffinity(),
>> distribution is currently performed based on per-cpu
>> distribute_cpu_mask_prev counter.
>> At the same time, on housekeeping cpus there are percpu
>> bounded timers interrupt/rcu threads and other system/user tasks
>> would be running with affinity as housekeeping cpus. In a real-life
>> environment, housekeeping cpus are much fewer and are too much loaded.
>> So, distribute_cpu_mask_prev value from these tasks impacts
>> the offset value for the tasks spawning to wakeup on isolcpus and
>> thus most of the tasks end up waking up on first cpu within the
>> isolcpus set.
>>
>> Steps to reproduce:
>> Kernel cmdline parameters:
>> isolcpus=2-5 skew_tick=1 nohz=on nohz_full=2-5
>> rcu_nocbs=2-5 rcu_nocb_poll idle=poll irqaffinity=0-1
>>
>> * pid=$(echo $$)
>> * taskset -pc 0 $pid
>> * cat loop-normal.c
>> int main(void)
>> {
>> while (1)
>> ;
>> return 0;
>> }
>> * gcc -o loop-normal loop-normal.c
>> * for i in {1..50}; do ./loop-normal & done
>> * pids=$(ps -a | grep loop-normal | cut -d' ' -f5)
>> * for i in $pids; do taskset -pc 2-5 $i ; done
>>
>> Expected output:
>> * All 50 “loop-normal” tasks should wake up on cpu2-5
>> equally distributed.
>> * ps -eLo cpuid,pid,tid,ppid,cls,psr,cls,cmd | grep "^ [2345]"
>>
>> Actual output:
>> * All 50 “loop-normal” tasks got woken up on cpu2 only
>>
>> Analysis:
>> There are percpu bounded timer interrupt/rcu threads activities
>> going on every few microseconds on housekeeping cpus, exercising
>> find_lowest_rq() -> cpumask_any_and_distribute()/cpumask_any_distribute()
>> So, per cpu variable distribute_cpu_mask_prev for housekeeping cpus
>> keep on getting set to housekeeping cpus. Bash/docker processes
>> are sharing same per cpu variable as they run on housekeeping cpus.
>> Thus intersection of clobbered distribute_cpu_mask_prev and
>> new mask(isolcpus) return always first cpu within the new mask(isolcpus)
>> in accordance to the logic mentioned in commits above.
>>
>> Fix the issue by using random cores out of the applicable CPU set
>> instead of relying on distribute_cpu_mask_prev.
>>
>> Fixes: 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
>> Fixes: 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
>>
>> Signed-off-by: Ankit Jain <[email protected]>
>> ---
>> lib/cpumask.c | 40 +++++++++++++++++++++-------------------
>> 1 file changed, 21 insertions(+), 19 deletions(-)
>>
>> diff --git a/lib/cpumask.c b/lib/cpumask.c
>> index a7fd02b5ae26..95a7c1b40e95 100644
>> --- a/lib/cpumask.c
>> +++ b/lib/cpumask.c
>> @@ -155,45 +155,47 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
>> }
>> EXPORT_SYMBOL(cpumask_local_spread);
>>
>> -static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
>> -
>> /**
>> * cpumask_any_and_distribute - Return an arbitrary cpu within src1p & src2p.
>> * @src1p: first &cpumask for intersection
>> * @src2p: second &cpumask for intersection
>> *
>> - * Iterated calls using the same srcp1 and srcp2 will be distributed within
>> - * their intersection.
>> + * Iterated calls using the same srcp1 and srcp2 will be randomly distributed
>> + * within their intersection.
>> *
>> * Returns >= nr_cpu_ids if the intersection is empty.
>> */
>
> This has been discussed a while ago, and the bottomline is that 'any'
> is not the same as 'random'. In practice, it means that getting 'any'
> cpu is cheaper than getting randomized one.
>

Thank you Yury for this clarification.
My objective behind this change was distribution of tasks on cpu
with no load balancing(i.e. isolcpus) even at some
higher cost(“random” logic).
However, I realize that distribution on isolcpus should probably
be addressed in userspace.