LinuxLists.cc - [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

2024-03-13 08:58:37

Subject: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

We observered select_idle_sibling() is likely to return the *target* cpu
early which is likely to be the previous cpu this task is running on even
when it's actually not within the affinity list newly set, from where after
we can only rely on select_fallback_rq() to choose one for us at its will
(the first valid mostly for now).

However, the one chosen by select_fallback_rq() is highly likely not a
good enough candidate, sometimes it has to rely on load balancer to kick
in to place itself to a better cpu, which adds one or more unnecessary
migrations in no doubt. For example, this is what I get when I move task
3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:

swapper 0 [013] 959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
kworker/24:2-mm 1014 [024] 959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24

The thing is we can actually do better if we do checks early and take more
advantages of the *target* in select_idle_sibling(). That is, we continue
the idle cpu selection if *target* fails the test of cpumask_test_cpu(
*target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
especially when the newly allowed cpu set shares some cpu resources with
*target*.

And with this change, we clearly see the improvement when I move task 3964
to cpu 25-26 where cpu 25 has a cpu bound work pinned already.

swapper 0 [027] 4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26

Note we do the same check for *prev* in select_idle_sibling() as well.

Signed-off-by: Ze Gao <[email protected]>
---
kernel/sched/fair.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 533547e3c90a..9ef6e74c6b2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7511,16 +7511,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
*/
lockdep_assert_irqs_disabled();

- if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
- asym_fits_cpu(task_util, util_min, util_max, target))
+ if (cpumask_test_cpu(target, p->cpus_ptr) &&
+ (available_idle_cpu(target) || sched_idle_cpu(target)) &&
+ asym_fits_cpu(task_util, util_min, util_max, target))
return target;

/*
* If the previous CPU is cache affine and idle, don't be stupid:
*/
- if (prev != target && cpus_share_cache(prev, target) &&
- (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
- asym_fits_cpu(task_util, util_min, util_max, prev)) {
+ if (cpumask_test_cpu(prev, p->cpus_ptr) &&
+ prev != target &&
+ cpus_share_cache(prev, target) &&
+ (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
+ asym_fits_cpu(task_util, util_min, util_max, prev)) {

if (!static_branch_unlikely(&sched_cluster_active) ||
cpus_share_resources(prev, target))
--
2.41.0

2024-03-13 15:16:55

by Pierre Gondois

[permalink] [raw]

Subject: Re: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

Hello Ze,

I am running stress-ng with the following command:
stress-ng -c 1 -l 10 &
and migrating the process with:
taskset -pc [cpus] [pid]

The thread seems to be migrated via:
sched_setaffinity
\-__sched_setaffinity()
\-__set_cpus_allowed_ptr()
\-__set_cpus_allowed_ptr_locked()
\- [1]

[1]
/*
* Picking a ~random cpu helps in cases where we are changing affinity
* for groups of tasks (ie. cpuset), so that load balancing is not
* immediately required to distribute the tasks within their new mask.
*/
dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);

So it seems the destination CPU chosen among the new CPU affinity mask is done
here, by picking a random CPU in the mask.

Checking the cpus_ptr in select_idle_sibling() might be useful in other cases,
but I think the experiment doesn't show that. Maybe a another small tweak could
be done at [1] instead ?

Regards,
Pierre

On 3/13/24 09:58, Ze Gao wrote:
> We observered select_idle_sibling() is likely to return the *target* cpu
> early which is likely to be the previous cpu this task is running on even
> when it's actually not within the affinity list newly set, from where after
> we can only rely on select_fallback_rq() to choose one for us at its will
> (the first valid mostly for now).
>
> However, the one chosen by select_fallback_rq() is highly likely not a
> good enough candidate, sometimes it has to rely on load balancer to kick
> in to place itself to a better cpu, which adds one or more unnecessary
> migrations in no doubt. For example, this is what I get when I move task
> 3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:
>
> swapper 0 [013] 959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
> kworker/24:2-mm 1014 [024] 959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24
>
> The thing is we can actually do better if we do checks early and take more
> advantages of the *target* in select_idle_sibling(). That is, we continue
> the idle cpu selection if *target* fails the test of cpumask_test_cpu(
> *target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
> especially when the newly allowed cpu set shares some cpu resources with
> *target*.
>
> And with this change, we clearly see the improvement when I move task 3964
> to cpu 25-26 where cpu 25 has a cpu bound work pinned already.
>
> swapper 0 [027] 4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26
>
> Note we do the same check for *prev* in select_idle_sibling() as well.
>
> Signed-off-by: Ze Gao <[email protected]>
> ---
> kernel/sched/fair.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 533547e3c90a..9ef6e74c6b2a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7511,16 +7511,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> */
> lockdep_assert_irqs_disabled();
>
> - if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
> - asym_fits_cpu(task_util, util_min, util_max, target))
> + if (cpumask_test_cpu(target, p->cpus_ptr) &&
> + (available_idle_cpu(target) || sched_idle_cpu(target)) &&
> + asym_fits_cpu(task_util, util_min, util_max, target))
> return target;
>
> /*
> * If the previous CPU is cache affine and idle, don't be stupid:
> */
> - if (prev != target && cpus_share_cache(prev, target) &&
> - (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> - asym_fits_cpu(task_util, util_min, util_max, prev)) {
> + if (cpumask_test_cpu(prev, p->cpus_ptr) &&
> + prev != target &&
> + cpus_share_cache(prev, target) &&
> + (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> + asym_fits_cpu(task_util, util_min, util_max, prev)) {
>
> if (!static_branch_unlikely(&sched_cluster_active) ||
> cpus_share_resources(prev, target))

2024-03-14 03:00:11

by Ze Gao

[permalink] [raw]

Subject: Re: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

On Wed, Mar 13, 2024 at 11:16 PM Pierre Gondois <pierre.gondois@armcom> wrote:
>
> Hello Ze,
>
> I am running stress-ng with the following command:
> stress-ng -c 1 -l 10 &
> and migrating the process with:
> taskset -pc [cpus] [pid]
>
> The thread seems to be migrated via:
> sched_setaffinity
> \-__sched_setaffinity()
> \-__set_cpus_allowed_ptr()
> \-__set_cpus_allowed_ptr_locked()
> \- [1]

> [1]
> /*
> * Picking a ~random cpu helps in cases where we are changing affinity
> * for groups of tasks (ie. cpuset), so that load balancing is not
> * immediately required to distribute the tasks within their new mask.
> */
> dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);
>
> So it seems the destination CPU chosen among the new CPU affinity mask is done
> here, by picking a random CPU in the mask.

IIUC, this is for running/queued/waking tasks instead of blocked tasks.

Am I missing something obvious here?

> Checking the cpus_ptr in select_idle_sibling() might be useful in other cases,
> but I think the experiment doesn't show that. Maybe a another small tweak could

The experiment is used to illustrate that the status quo does not do well
but has to rely on select_fallback_rq() to choose a cpu for a woken task
which turns out to be a bad choice since it's already monopolized by a
cpu bound task, that is why a second migration happens with the help
of the load balancer.

Actually, we can reuse the same reasons for doing so as in

commit 46a87b3851f0("sched/core: Distribute tasks within affinity masks")

> be done at [1] instead ?

As for blocked tasks, check out what is commented on set_task_cpu() and
select_task_rq(), since we never call set_task_cpu() on blocked tasks which
in turn, we have no way to change p->wake_cpu to dest_cpu being randomly
chosen here, so when it's woken up, it still needs to go through the
select_task_rq() process using the outdated p->wake_cpu.

Thanks,
-- Ze

> Regards,
> Pierre
>
> On 3/13/24 09:58, Ze Gao wrote:
> > We observered select_idle_sibling() is likely to return the *target* cpu
> > early which is likely to be the previous cpu this task is running on even
> > when it's actually not within the affinity list newly set, from where after
> > we can only rely on select_fallback_rq() to choose one for us at its will
> > (the first valid mostly for now).
> >
> > However, the one chosen by select_fallback_rq() is highly likely not a
> > good enough candidate, sometimes it has to rely on load balancer to kick
> > in to place itself to a better cpu, which adds one or more unnecessary
> > migrations in no doubt. For example, this is what I get when I move task
> > 3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:
> >
> > swapper 0 [013] 959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
> > kworker/24:2-mm 1014 [024] 959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24
> >
> > The thing is we can actually do better if we do checks early and take more
> > advantages of the *target* in select_idle_sibling(). That is, we continue
> > the idle cpu selection if *target* fails the test of cpumask_test_cpu(
> > *target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
> > especially when the newly allowed cpu set shares some cpu resources with
> > *target*.
> >
> > And with this change, we clearly see the improvement when I move task 3964
> > to cpu 25-26 where cpu 25 has a cpu bound work pinned already.
> >
> > swapper 0 [027] 4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26
> >
> > Note we do the same check for *prev* in select_idle_sibling() as well.
> >
> > Signed-off-by: Ze Gao <[email protected]>
> > ---
> > kernel/sched/fair.c | 13 ++++++++-----
> > 1 file changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 533547e3c90a..9ef6e74c6b2a 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7511,16 +7511,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > */
> > lockdep_assert_irqs_disabled();
> >
> > - if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
> > - asym_fits_cpu(task_util, util_min, util_max, target))
> > + if (cpumask_test_cpu(target, p->cpus_ptr) &&
> > + (available_idle_cpu(target) || sched_idle_cpu(target)) &&
> > + asym_fits_cpu(task_util, util_min, util_max, target))
> > return target;
> >
> > /*
> > * If the previous CPU is cache affine and idle, don't be stupid:
> > */
> > - if (prev != target && cpus_share_cache(prev, target) &&
> > - (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> > - asym_fits_cpu(task_util, util_min, util_max, prev)) {
> > + if (cpumask_test_cpu(prev, p->cpus_ptr) &&
> > + prev != target &&
> > + cpus_share_cache(prev, target) &&
> > + (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> > + asym_fits_cpu(task_util, util_min, util_max, prev)) {
> >
> > if (!static_branch_unlikely(&sched_cluster_active) ||
> > cpus_share_resources(prev, target))

2024-03-14 03:49:38

by Ze Gao

[permalink] [raw]

Subject: Re: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

On Thu, Mar 14, 2024 at 10:59 AM Ze Gao <[email protected]> wrote:
>
> On Wed, Mar 13, 2024 at 11:16 PM Pierre Gondois <[email protected]> wrote:
> >
> > Hello Ze,
> >
> > I am running stress-ng with the following command:
> > stress-ng -c 1 -l 10 &
> > and migrating the process with:
> > taskset -pc [cpus] [pid]
> >
> > The thread seems to be migrated via:
> > sched_setaffinity
> > \-__sched_setaffinity()
> > \-__set_cpus_allowed_ptr()
> > \-__set_cpus_allowed_ptr_locked()
> > \- [1]
>
>
>
> > [1]
> > /*
> > * Picking a ~random cpu helps in cases where we are changing affinity
> > * for groups of tasks (ie. cpuset), so that load balancing is not
> > * immediately required to distribute the tasks within their new mask.
> > */
> > dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);
> >
> > So it seems the destination CPU chosen among the new CPU affinity mask is done
> > here, by picking a random CPU in the mask.
>
> IIUC, this is for running/queued/waking tasks instead of blocked tasks.
>
> Am I missing something obvious here?
>
> > Checking the cpus_ptr in select_idle_sibling() might be useful in other cases,
> > but I think the experiment doesn't show that. Maybe a another small tweak could
>
> The experiment is used to illustrate that the status quo does not do well
> but has to rely on select_fallback_rq() to choose a cpu for a woken task
> which turns out to be a bad choice since it's already monopolized by a
> cpu bound task, that is why a second migration happens with the help
> of the load balancer.

Btw, perf alone does not show obvious results here, you need some
other observability tools to make sure the migration is not initiated by
__set_cpus_allowed_ptr_locked (i.e., for running tasks). I achieve this
by directly adding some tracepoints to both select_fallback_rq() and
select_idle_sibling().

Regards,
-- Ze

> Actually, we can reuse the same reasons for doing so as in
>
> commit 46a87b3851f0("sched/core: Distribute tasks within affinity masks")
>
> > be done at [1] instead ?
>
> As for blocked tasks, check out what is commented on set_task_cpu() and
> select_task_rq(), since we never call set_task_cpu() on blocked tasks which
> in turn, we have no way to change p->wake_cpu to dest_cpu being randomly
> chosen here, so when it's woken up, it still needs to go through the
> select_task_rq() process using the outdated p->wake_cpu.
>
>
> Thanks,
> -- Ze
>
> > Regards,
> > Pierre
> >
> > On 3/13/24 09:58, Ze Gao wrote:
> > > We observered select_idle_sibling() is likely to return the *target* cpu
> > > early which is likely to be the previous cpu this task is running on even
> > > when it's actually not within the affinity list newly set, from where after
> > > we can only rely on select_fallback_rq() to choose one for us at its will
> > > (the first valid mostly for now).
> > >
> > > However, the one chosen by select_fallback_rq() is highly likely not a
> > > good enough candidate, sometimes it has to rely on load balancer to kick
> > > in to place itself to a better cpu, which adds one or more unnecessary
> > > migrations in no doubt. For example, this is what I get when I move task
> > > 3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:
> > >
> > > swapper 0 [013] 959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
> > > kworker/24:2-mm 1014 [024] 959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24
> > >
> > > The thing is we can actually do better if we do checks early and take more
> > > advantages of the *target* in select_idle_sibling(). That is, we continue
> > > the idle cpu selection if *target* fails the test of cpumask_test_cpu(
> > > *target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
> > > especially when the newly allowed cpu set shares some cpu resources with
> > > *target*.
> > >
> > > And with this change, we clearly see the improvement when I move task 3964
> > > to cpu 25-26 where cpu 25 has a cpu bound work pinned already.
> > >
> > > swapper 0 [027] 4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26
> > >
> > > Note we do the same check for *prev* in select_idle_sibling() as well.
> > >
> > > Signed-off-by: Ze Gao <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 13 ++++++++-----
> > > 1 file changed, 8 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 533547e3c90a..9ef6e74c6b2a 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7511,16 +7511,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > > */
> > > lockdep_assert_irqs_disabled();
> > >
> > > - if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
> > > - asym_fits_cpu(task_util, util_min, util_max, target))
> > > + if (cpumask_test_cpu(target, p->cpus_ptr) &&
> > > + (available_idle_cpu(target) || sched_idle_cpu(target)) &&
> > > + asym_fits_cpu(task_util, util_min, util_max, target))
> > > return target;
> > >
> > > /*
> > > * If the previous CPU is cache affine and idle, don't be stupid:
> > > */
> > > - if (prev != target && cpus_share_cache(prev, target) &&
> > > - (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> > > - asym_fits_cpu(task_util, util_min, util_max, prev)) {
> > > + if (cpumask_test_cpu(prev, p->cpus_ptr) &&
> > > + prev != target &&
> > > + cpus_share_cache(prev, target) &&
> > > + (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> > > + asym_fits_cpu(task_util, util_min, util_max, prev)) {
> > >
> > > if (!static_branch_unlikely(&sched_cluster_active) ||
> > > cpus_share_resources(prev, target))

2024-03-14 10:32:55

by Pierre Gondois

[permalink] [raw]

Subject: Re: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

On 3/14/24 04:49, Ze Gao wrote:
> On Thu, Mar 14, 2024 at 10:59 AM Ze Gao <[email protected]> wrote:
>>
>> On Wed, Mar 13, 2024 at 11:16 PM Pierre Gondois <[email protected]> wrote:
>>>
>>> Hello Ze,
>>>
>>> I am running stress-ng with the following command:
>>> stress-ng -c 1 -l 10 &
>>> and migrating the process with:
>>> taskset -pc [cpus] [pid]
>>>
>>> The thread seems to be migrated via:
>>> sched_setaffinity
>>> \-__sched_setaffinity()
>>> \-__set_cpus_allowed_ptr()
>>> \-__set_cpus_allowed_ptr_locked()
>>> \- [1]
>>
>>
>>
>>> [1]
>>> /*
>>> * Picking a ~random cpu helps in cases where we are changing affinity
>>> * for groups of tasks (ie. cpuset), so that load balancing is not
>>> * immediately required to distribute the tasks within their new mask.
>>> */
>>> dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);
>>>
>>> So it seems the destination CPU chosen among the new CPU affinity mask is done
>>> here, by picking a random CPU in the mask.
>>
>> IIUC, this is for running/queued/waking tasks instead of blocked tasks.
>>
>> Am I missing something obvious here?

Not at all, I just didn't tested the patch on a blocked task as you said.

>>
>>> Checking the cpus_ptr in select_idle_sibling() might be useful in other cases,
>>> but I think the experiment doesn't show that. Maybe a another small tweak could
>>
>> The experiment is used to illustrate that the status quo does not do well
>> but has to rely on select_fallback_rq() to choose a cpu for a woken task
>> which turns out to be a bad choice since it's already monopolized by a
>> cpu bound task, that is why a second migration happens with the help
>> of the load balancer.
>
> Btw, perf alone does not show obvious results here, you need some
> other observability tools to make sure the migration is not initiated by
> __set_cpus_allowed_ptr_locked (i.e., for running tasks). I achieve this
> by directly adding some tracepoints to both select_fallback_rq() and
> select_idle_sibling().

FWIW, I could effectively reproduce the issue described in the commit message, so:
Tested-by: Pierre Gondois <[email protected]>

Regards,
Pierre

>
> Regards,
> -- Ze
>
>> Actually, we can reuse the same reasons for doing so as in
>>
>> commit 46a87b3851f0("sched/core: Distribute tasks within affinity masks")
>>
>>> be done at [1] instead ?
>>
>> As for blocked tasks, check out what is commented on set_task_cpu() and
>> select_task_rq(), since we never call set_task_cpu() on blocked tasks which
>> in turn, we have no way to change p->wake_cpu to dest_cpu being randomly
>> chosen here, so when it's woken up, it still needs to go through the
>> select_task_rq() process using the outdated p->wake_cpu.
>>
>>
>> Thanks,
>> -- Ze
>>
>>> Regards,
>>> Pierre
>>>
>>> On 3/13/24 09:58, Ze Gao wrote:
>>>> We observered select_idle_sibling() is likely to return the *target* cpu
>>>> early which is likely to be the previous cpu this task is running on even
>>>> when it's actually not within the affinity list newly set, from where after
>>>> we can only rely on select_fallback_rq() to choose one for us at its will
>>>> (the first valid mostly for now).
>>>>
>>>> However, the one chosen by select_fallback_rq() is highly likely not a
>>>> good enough candidate, sometimes it has to rely on load balancer to kick
>>>> in to place itself to a better cpu, which adds one or more unnecessary
>>>> migrations in no doubt. For example, this is what I get when I move task
>>>> 3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:
>>>>
>>>> swapper 0 [013] 959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
>>>> kworker/24:2-mm 1014 [024] 959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24
>>>>
>>>> The thing is we can actually do better if we do checks early and take more
>>>> advantages of the *target* in select_idle_sibling(). That is, we continue
>>>> the idle cpu selection if *target* fails the test of cpumask_test_cpu(
>>>> *target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
>>>> especially when the newly allowed cpu set shares some cpu resources with
>>>> *target*.
>>>>
>>>> And with this change, we clearly see the improvement when I move task 3964
>>>> to cpu 25-26 where cpu 25 has a cpu bound work pinned already.
>>>>
>>>> swapper 0 [027] 4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26
>>>>
>>>> Note we do the same check for *prev* in select_idle_sibling() as well.
>>>>
>>>> Signed-off-by: Ze Gao <[email protected]>
>>>> ---
>>>> kernel/sched/fair.c | 13 ++++++++-----
>>>> 1 file changed, 8 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 533547e3c90a..9ef6e74c6b2a 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -7511,16 +7511,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>>> */
>>>> lockdep_assert_irqs_disabled();
>>>>
>>>> - if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
>>>> - asym_fits_cpu(task_util, util_min, util_max, target))
>>>> + if (cpumask_test_cpu(target, p->cpus_ptr) &&
>>>> + (available_idle_cpu(target) || sched_idle_cpu(target)) &&
>>>> + asym_fits_cpu(task_util, util_min, util_max, target))
>>>> return target;
>>>>
>>>> /*
>>>> * If the previous CPU is cache affine and idle, don't be stupid:
>>>> */
>>>> - if (prev != target && cpus_share_cache(prev, target) &&
>>>> - (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
>>>> - asym_fits_cpu(task_util, util_min, util_max, prev)) {
>>>> + if (cpumask_test_cpu(prev, p->cpus_ptr) &&
>>>> + prev != target &&
>>>> + cpus_share_cache(prev, target) &&
>>>> + (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
>>>> + asym_fits_cpu(task_util, util_min, util_max, prev)) {
>>>>
>>>> if (!static_branch_unlikely(&sched_cluster_active) ||
>>>> cpus_share_resources(prev, target))

2024-04-12 17:00:24

by Madadi Vineeth Reddy

[permalink] [raw]

Subject: Re: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

Hi Ze Gao,

On 13/03/24 14:28, Ze Gao wrote:
> We observered select_idle_sibling() is likely to return the *target* cpu
> early which is likely to be the previous cpu this task is running on even
> when it's actually not within the affinity list newly set, from where after
> we can only rely on select_fallback_rq() to choose one for us at its will
> (the first valid mostly for now).
>
> However, the one chosen by select_fallback_rq() is highly likely not a
> good enough candidate, sometimes it has to rely on load balancer to kick
> in to place itself to a better cpu, which adds one or more unnecessary
> migrations in no doubt. For example, this is what I get when I move task
> 3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:
>
> swapper 0 [013] 959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
> kworker/24:2-mm 1014 [024] 959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24
>

I am able to reproduce this scenario of having an extra migration through load balance
swapper 0 [031] 398764.057232: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=31 dest_cpu=33
ksoftirqd/0 13 [000] 398764.356138: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=33 dest_cpu=34

I wrote a simple c program that blocks for few seconds, meanwhile I taskset it to CPUs 33,34 while I already have a
busy task running on CPU 33.

> The thing is we can actually do better if we do checks early and take more
> advantages of the *target* in select_idle_sibling(). That is, we continue
> the idle cpu selection if *target* fails the test of cpumask_test_cpu(
> *target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
> especially when the newly allowed cpu set shares some cpu resources with
> *target*.
>
> And with this change, we clearly see the improvement when I move task 3964
> to cpu 25-26 where cpu 25 has a cpu bound work pinned already.
>
> swapper 0 [027] 4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26

But after applying this patch, The extra migration is still happening as CPU 33 is still chosen by try_to_wake_up.

On placing some perf probes and testing,
migration/57 304 [057] 12216.988491: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=57 dest_cpu=4
swapper 0 [004] 12226.989065: probe:select_idle_sibling_L124: (c0000000001bafc0) i=-1 recent_used_cpu=-1 prev_aff=-1
swapper 0 [004] 12226.989071: probe:select_fallback_rq: (c0000000001a2e38) cpu=4
swapper 0 [004] 12226.989074: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=4 dest_cpu=33
swapper 0 [000] 12227.007768: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=33 dest_cpu=34

It is observed that, select_fallback_rq is still taken in this scenario as default target is returned at the end of select_idle_sibling
which was CPU 4.

In most of my testing, default target is returned at the end of the function due to the initial checks. It's possible that there would
be cases where we can get optimal CPU before we reach end of the select_idle_sibling function but it would be interesting to know if the
extra time spent in finding an optimal cpu have an impact instead of returning it earlier if in most of the times we are returning the
default target at the end.

Thanks and Regards
Madadi Vineeth Reddy

>
> Note we do the same check for *prev* in select_idle_sibling() as well.
>
> Signed-off-by: Ze Gao <[email protected]>
> ---

2024-04-15 02:03:50

by Ze Gao

[permalink] [raw]

Subject: Re: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

On Sat, Apr 13, 2024 at 12:59 AM Madadi Vineeth Reddy
<[email protected]> wrote:
>
> Hi Ze Gao,
>
> On 13/03/24 14:28, Ze Gao wrote:
> > We observered select_idle_sibling() is likely to return the *target* cpu
> > early which is likely to be the previous cpu this task is running on even
> > when it's actually not within the affinity list newly set, from where after
> > we can only rely on select_fallback_rq() to choose one for us at its will
> > (the first valid mostly for now).
> >
> > However, the one chosen by select_fallback_rq() is highly likely not a
> > good enough candidate, sometimes it has to rely on load balancer to kick
> > in to place itself to a better cpu, which adds one or more unnecessary
> > migrations in no doubt. For example, this is what I get when I move task
> > 3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:
> >
> > swapper 0 [013] 959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
> > kworker/24:2-mm 1014 [024] 959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24
> >
>
> I am able to reproduce this scenario of having an extra migration through load balance
> swapper 0 [031] 398764.057232: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=31 dest_cpu=33
> ksoftirqd/0 13 [000] 398764.356138: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=33 dest_cpu=34
>
> I wrote a simple c program that blocks for few seconds, meanwhile I taskset it to CPUs 33,34 while I already have a
> busy task running on CPU 33.
>
> > The thing is we can actually do better if we do checks early and take more
> > advantages of the *target* in select_idle_sibling(). That is, we continue
> > the idle cpu selection if *target* fails the test of cpumask_test_cpu(
> > *target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
> > especially when the newly allowed cpu set shares some cpu resources with
> > *target*.
> >
> > And with this change, we clearly see the improvement when I move task 3964
> > to cpu 25-26 where cpu 25 has a cpu bound work pinned already.
> >
> > swapper 0 [027] 4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26
>
> But after applying this patch, The extra migration is still happening as CPU 33 is still chosen by try_to_wake_up.
>
> On placing some perf probes and testing,
> migration/57 304 [057] 12216.988491: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=57 dest_cpu=4
> swapper 0 [004] 12226.989065: probe:select_idle_sibling_L124: (c0000000001bafc0) i=-1 recent_used_cpu=-1 prev_aff=-1
> swapper 0 [004] 12226.989071: probe:select_fallback_rq: (c0000000001a2e38) cpu=4
> swapper 0 [004] 12226.989074: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=4 dest_cpu=33
> swapper 0 [000] 12227.007768: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=33 dest_cpu=34
>
> It is observed that, select_fallback_rq is still taken in this scenario as default target is returned at the end of select_idle_sibling
> which was CPU 4.

My best guess is that you may have hit the code path for running tasks
(taskset happens right after the task is woken up). Should that happen,
the picking is done via:

dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);

and it also makes sense that select_fallback_rq() returns 4 since that happens
before you change the affinities.

you may need to rule out this case first :)

Regards,
Ze

> In most of my testing, default target is returned at the end of the function due to the initial checks. It's possible that there would
> be cases where we can get optimal CPU before we reach end of the select_idle_sibling function but it would be interesting to know if the
> extra time spent in finding an optimal cpu have an impact instead of returning it earlier if in most of the times we are returning the
> default target at the end.
>
> Thanks and Regards
> Madadi Vineeth Reddy
>
> >
> > Note we do the same check for *prev* in select_idle_sibling() as well.
> >
> > Signed-off-by: Ze Gao <[email protected]>
> > ---
>

2024-04-15 05:40:36

by Ze Gao

[permalink] [raw]

Subject: Re: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

On Sat, Apr 13, 2024 at 12:59 AM Madadi Vineeth Reddy
<[email protected]> wrote:
>
> Hi Ze Gao,
>
> On 13/03/24 14:28, Ze Gao wrote:
> > We observered select_idle_sibling() is likely to return the *target* cpu
> > early which is likely to be the previous cpu this task is running on even
> > when it's actually not within the affinity list newly set, from where after
> > we can only rely on select_fallback_rq() to choose one for us at its will
> > (the first valid mostly for now).
> >
> > However, the one chosen by select_fallback_rq() is highly likely not a
> > good enough candidate, sometimes it has to rely on load balancer to kick
> > in to place itself to a better cpu, which adds one or more unnecessary
> > migrations in no doubt. For example, this is what I get when I move task
> > 3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:
> >
> > swapper 0 [013] 959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
> > kworker/24:2-mm 1014 [024] 959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24
> >
>
> I am able to reproduce this scenario of having an extra migration through load balance
> swapper 0 [031] 398764.057232: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=31 dest_cpu=33
> ksoftirqd/0 13 [000] 398764.356138: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=33 dest_cpu=34
>
> I wrote a simple c program that blocks for few seconds, meanwhile I taskset it to CPUs 33,34 while I already have a
> busy task running on CPU 33.
>
> > The thing is we can actually do better if we do checks early and take more
> > advantages of the *target* in select_idle_sibling(). That is, we continue
> > the idle cpu selection if *target* fails the test of cpumask_test_cpu(
> > *target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
> > especially when the newly allowed cpu set shares some cpu resources with
> > *target*.
> >
> > And with this change, we clearly see the improvement when I move task 3964
> > to cpu 25-26 where cpu 25 has a cpu bound work pinned already.
> >
> > swapper 0 [027] 4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26
>
> But after applying this patch, The extra migration is still happening as CPU 33 is still chosen by try_to_wake_up.
>
> On placing some perf probes and testing,
> migration/57 304 [057] 12216.988491: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=57 dest_cpu=4
> swapper 0 [004] 12226.989065: probe:select_idle_sibling_L124: (c0000000001bafc0) i=-1 recent_used_cpu=-1 prev_aff=-1
> swapper 0 [004] 12226.989071: probe:select_fallback_rq: (c0000000001a2e38) cpu=4
> swapper 0 [004] 12226.989074: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=4 dest_cpu=33
> swapper 0 [000] 12227.007768: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=33 dest_cpu=34
>
> It is observed that, select_fallback_rq is still taken in this scenario as default target is returned at the end of select_idle_sibling
> which was CPU 4.

After second thoughts, it indeed could happen if CPU 4 shares nothing
with CPU 33(34),
for example, in different numa nodes.

IOW, it cannot benefit from select_idle_siblings() and has to rely on
select_fallback_rq
as the last resort. Just like what I said in the changelog, this patch
aims to improve rq
selection for cases where the newly allowed cpu set shares some cpu
resources with
the old cpu set.

Sorry for not being able to recall all details immediately, since this
thread has been inactive
for a long while and receives no feedback from sched folks.

Best,
Ze

> In most of my testing, default target is returned at the end of the function due to the initial checks. It's possible that there would
> be cases where we can get optimal CPU before we reach end of the select_idle_sibling function but it would be interesting to know if the
> extra time spent in finding an optimal cpu have an impact instead of returning it earlier if in most of the times we are returning the
> default target at the end.
>
> Thanks and Regards
> Madadi Vineeth Reddy
>
> >
> > Note we do the same check for *prev* in select_idle_sibling() as well.
> >
> > Signed-off-by: Ze Gao <[email protected]>
> > ---
>

2024-04-15 10:31:51

by Madadi Vineeth Reddy

[permalink] [raw]

Subject: Re: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

Hi Ze Gao,

On 15/04/24 07:33, Ze Gao wrote:
> On Sat, Apr 13, 2024 at 12:59 AM Madadi Vineeth Reddy
> <[email protected]> wrote:
>>
>> Hi Ze Gao,
>>
>> On 13/03/24 14:28, Ze Gao wrote:
>>> We observered select_idle_sibling() is likely to return the *target* cpu
>>> early which is likely to be the previous cpu this task is running on even
>>> when it's actually not within the affinity list newly set, from where after
>>> we can only rely on select_fallback_rq() to choose one for us at its will
>>> (the first valid mostly for now).
>>>
>>> However, the one chosen by select_fallback_rq() is highly likely not a
>>> good enough candidate, sometimes it has to rely on load balancer to kick
>>> in to place itself to a better cpu, which adds one or more unnecessary
>>> migrations in no doubt. For example, this is what I get when I move task
>>> 3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:
>>>
>>> swapper 0 [013] 959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
>>> kworker/24:2-mm 1014 [024] 959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24
>>>
>>
>> I am able to reproduce this scenario of having an extra migration through load balance
>> swapper 0 [031] 398764.057232: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=31 dest_cpu=33
>> ksoftirqd/0 13 [000] 398764.356138: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=33 dest_cpu=34
>>
>> I wrote a simple c program that blocks for few seconds, meanwhile I taskset it to CPUs 33,34 while I already have a
>> busy task running on CPU 33.
>>
>>> The thing is we can actually do better if we do checks early and take more
>>> advantages of the *target* in select_idle_sibling(). That is, we continue
>>> the idle cpu selection if *target* fails the test of cpumask_test_cpu(
>>> *target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
>>> especially when the newly allowed cpu set shares some cpu resources with
>>> *target*.
>>>
>>> And with this change, we clearly see the improvement when I move task 3964
>>> to cpu 25-26 where cpu 25 has a cpu bound work pinned already.
>>>
>>> swapper 0 [027] 4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26
>>
>> But after applying this patch, The extra migration is still happening as CPU 33 is still chosen by try_to_wake_up.
>>
>> On placing some perf probes and testing,
>> migration/57 304 [057] 12216.988491: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=57 dest_cpu=4
>> swapper 0 [004] 12226.989065: probe:select_idle_sibling_L124: (c0000000001bafc0) i=-1 recent_used_cpu=-1 prev_aff=-1
>> swapper 0 [004] 12226.989071: probe:select_fallback_rq: (c0000000001a2e38) cpu=4
>> swapper 0 [004] 12226.989074: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=4 dest_cpu=33
>> swapper 0 [000] 12227.007768: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=33 dest_cpu=34
>>
>> It is observed that, select_fallback_rq is still taken in this scenario as default target is returned at the end of select_idle_sibling
>> which was CPU 4.
>
> My best guess is that you may have hit the code path for running tasks
> (taskset happens right after the task is woken up). Should that happen,
> the picking is done via:
>
> dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);

Yes, I verified that cpumask_any_and_distribute is hit.

>
> and it also makes sense that select_fallback_rq() returns 4 since that happens
> before you change the affinities.

In this case, 4 is passed as an argument to select_fallback_rq(), it's not the return
value. It actually returns the first affined CPU which is 33 here.

Also, I see select_fallback_rq() happening after cpumask_any_and_distribute.

>
> you may need to rule out this case first :)
>

I am not sure on how to avoid hitting cpumask_any_and_distribute, Can you explain how did you move the tasks in case
of stress-ng?

Thanks and Regards
Madadi Vineeth Reddy

> Regards,
> Ze
>
>> In most of my testing, default target is returned at the end of the function due to the initial checks. It's possible that there would
>> be cases where we can get optimal CPU before we reach end of the select_idle_sibling function but it would be interesting to know if the
>> extra time spent in finding an optimal cpu have an impact instead of returning it earlier if in most of the times we are returning the
>> default target at the end.
>>
>> Thanks and Regards
>> Madadi Vineeth Reddy
>>
>>>
>>> Note we do the same check for *prev* in select_idle_sibling() as well.
>>>
>>> Signed-off-by: Ze Gao <[email protected]>
>>> ---
>>

2024-04-15 10:38:07

by Madadi Vineeth Reddy

[permalink] [raw]

Subject: Re: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

On 15/04/24 11:10, Ze Gao wrote:
> On Sat, Apr 13, 2024 at 12:59 AM Madadi Vineeth Reddy
> <[email protected]> wrote:
>>
>> Hi Ze Gao,
>>
>> On 13/03/24 14:28, Ze Gao wrote:
>>> We observered select_idle_sibling() is likely to return the *target* cpu
>>> early which is likely to be the previous cpu this task is running on even
>>> when it's actually not within the affinity list newly set, from where after
>>> we can only rely on select_fallback_rq() to choose one for us at its will
>>> (the first valid mostly for now).
>>>
>>> However, the one chosen by select_fallback_rq() is highly likely not a
>>> good enough candidate, sometimes it has to rely on load balancer to kick
>>> in to place itself to a better cpu, which adds one or more unnecessary
>>> migrations in no doubt. For example, this is what I get when I move task
>>> 3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:
>>>
>>> swapper 0 [013] 959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
>>> kworker/24:2-mm 1014 [024] 959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24
>>>
>>
>> I am able to reproduce this scenario of having an extra migration through load balance
>> swapper 0 [031] 398764.057232: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=31 dest_cpu=33
>> ksoftirqd/0 13 [000] 398764.356138: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=33 dest_cpu=34
>>
>> I wrote a simple c program that blocks for few seconds, meanwhile I taskset it to CPUs 33,34 while I already have a
>> busy task running on CPU 33.
>>
>>> The thing is we can actually do better if we do checks early and take more
>>> advantages of the *target* in select_idle_sibling(). That is, we continue
>>> the idle cpu selection if *target* fails the test of cpumask_test_cpu(
>>> *target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
>>> especially when the newly allowed cpu set shares some cpu resources with
>>> *target*.
>>>
>>> And with this change, we clearly see the improvement when I move task 3964
>>> to cpu 25-26 where cpu 25 has a cpu bound work pinned already.
>>>
>>> swapper 0 [027] 4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26
>>
>> But after applying this patch, The extra migration is still happening as CPU 33 is still chosen by try_to_wake_up.
>>
>> On placing some perf probes and testing,
>> migration/57 304 [057] 12216.988491: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=57 dest_cpu=4
>> swapper 0 [004] 12226.989065: probe:select_idle_sibling_L124: (c0000000001bafc0) i=-1 recent_used_cpu=-1 prev_aff=-1
>> swapper 0 [004] 12226.989071: probe:select_fallback_rq: (c0000000001a2e38) cpu=4
>> swapper 0 [004] 12226.989074: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=4 dest_cpu=33
>> swapper 0 [000] 12227.007768: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=33 dest_cpu=34
>>
>> It is observed that, select_fallback_rq is still taken in this scenario as default target is returned at the end of select_idle_sibling
>> which was CPU 4.
>
> After second thoughts, it indeed could happen if CPU 4 shares nothing
> with CPU 33(34),
> for example, in different numa nodes.
>
> IOW, it cannot benefit from select_idle_siblings() and has to rely on
> select_fallback_rq
> as the last resort. Just like what I said in the changelog, this patch
> aims to improve rq
> selection for cases where the newly allowed cpu set shares some cpu
> resources with
> the old cpu set.

Right. In power 10(where I tested), LLC is smaller being at SMT4 small core
level. So, I think there are less chances of affined CPUs to share resources
with the old CPU set.

I also ran schbench and hackbench just to make sure that there is no regression
in powerpc. Luckily there is no regression.

Tested-by: Madadi Vineeth Reddy <[email protected]>

Thanks and Regards
Madadi Vineeth Reddy

>
> Sorry for not being able to recall all details immediately, since this
> thread has been inactive
> for a long while and receives no feedback from sched folks.
>
> Best,
> Ze
>
>> In most of my testing, default target is returned at the end of the function due to the initial checks. It's possible that there would
>> be cases where we can get optimal CPU before we reach end of the select_idle_sibling function but it would be interesting to know if the
>> extra time spent in finding an optimal cpu have an impact instead of returning it earlier if in most of the times we are returning the
>> default target at the end.
>>
>> Thanks and Regards
>> Madadi Vineeth Reddy
>>
>>>
>>> Note we do the same check for *prev* in select_idle_sibling() as well.
>>>
>>> Signed-off-by: Ze Gao <[email protected]>
>>> ---
>>

2024-04-15 11:05:23

by Ze Gao

[permalink] [raw]

Subject: Re: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

On Mon, Apr 15, 2024 at 6:25 PM Madadi Vineeth Reddy
<[email protected]> wrote:
>
> Hi Ze Gao,
>
> On 15/04/24 07:33, Ze Gao wrote:
> > On Sat, Apr 13, 2024 at 12:59 AM Madadi Vineeth Reddy
> > <[email protected]> wrote:
> >>
> >> Hi Ze Gao,
> >>
> >> On 13/03/24 14:28, Ze Gao wrote:
> >>> We observered select_idle_sibling() is likely to return the *target* cpu
> >>> early which is likely to be the previous cpu this task is running on even
> >>> when it's actually not within the affinity list newly set, from where after
> >>> we can only rely on select_fallback_rq() to choose one for us at its will
> >>> (the first valid mostly for now).
> >>>
> >>> However, the one chosen by select_fallback_rq() is highly likely not a
> >>> good enough candidate, sometimes it has to rely on load balancer to kick
> >>> in to place itself to a better cpu, which adds one or more unnecessary
> >>> migrations in no doubt. For example, this is what I get when I move task
> >>> 3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:
> >>>
> >>> swapper 0 [013] 959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
> >>> kworker/24:2-mm 1014 [024] 959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24
> >>>
> >>
> >> I am able to reproduce this scenario of having an extra migration through load balance
> >> swapper 0 [031] 398764.057232: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=31 dest_cpu=33
> >> ksoftirqd/0 13 [000] 398764.356138: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=33 dest_cpu=34
> >>
> >> I wrote a simple c program that blocks for few seconds, meanwhile I taskset it to CPUs 33,34 while I already have a
> >> busy task running on CPU 33.
> >>
> >>> The thing is we can actually do better if we do checks early and take more
> >>> advantages of the *target* in select_idle_sibling(). That is, we continue
> >>> the idle cpu selection if *target* fails the test of cpumask_test_cpu(
> >>> *target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
> >>> especially when the newly allowed cpu set shares some cpu resources with
> >>> *target*.
> >>>
> >>> And with this change, we clearly see the improvement when I move task 3964
> >>> to cpu 25-26 where cpu 25 has a cpu bound work pinned already.
> >>>
> >>> swapper 0 [027] 4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26
> >>
> >> But after applying this patch, The extra migration is still happening as CPU 33 is still chosen by try_to_wake_up.
> >>
> >> On placing some perf probes and testing,
> >> migration/57 304 [057] 12216.988491: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=57 dest_cpu=4
> >> swapper 0 [004] 12226.989065: probe:select_idle_sibling_L124: (c0000000001bafc0) i=-1 recent_used_cpu=-1 prev_aff=-1
> >> swapper 0 [004] 12226.989071: probe:select_fallback_rq: (c0000000001a2e38) cpu=4
> >> swapper 0 [004] 12226.989074: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=4 dest_cpu=33
> >> swapper 0 [000] 12227.007768: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=33 dest_cpu=34
> >>
> >> It is observed that, select_fallback_rq is still taken in this scenario as default target is returned at the end of select_idle_sibling
> >> which was CPU 4.
> >
> > My best guess is that you may have hit the code path for running tasks
> > (taskset happens right after the task is woken up). Should that happen,
> > the picking is done via:
> >
> > dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, ctx->new_mask);
>
> Yes, I verified that cpumask_any_and_distribute is hit.
>
> >
> > and it also makes sense that select_fallback_rq() returns 4 since that happens
> > before you change the affinities.
>
> In this case, 4 is passed as an argument to select_fallback_rq(), it's not the return
> value. It actually returns the first affined CPU which is 33 here.
>
> Also, I see select_fallback_rq() happening after cpumask_any_and_distribute.
>
> >
> > you may need to rule out this case first :)
> >
>
> I am not sure on how to avoid hitting cpumask_any_and_distribute, Can you explain how did you move the tasks in case
> of stress-ng?

Acutally there is no easy way to avoid taskset on the running task
(maybe -l with a low load expectation which increases the chance
the task sleeps). and then it is simple and stupid: let the trace be
as much verbose as I can, and filter what I pay attention to.

Regards,
Ze

> Thanks and Regards
> Madadi Vineeth Reddy
>
> > Regards,
> > Ze
> >
> >> In most of my testing, default target is returned at the end of the function due to the initial checks. It's possible that there would
> >> be cases where we can get optimal CPU before we reach end of the select_idle_sibling function but it would be interesting to know if the
> >> extra time spent in finding an optimal cpu have an impact instead of returning it earlier if in most of the times we are returning the
> >> default target at the end.
> >>
> >> Thanks and Regards
> >> Madadi Vineeth Reddy
> >>
> >>>
> >>> Note we do the same check for *prev* in select_idle_sibling() as well.
> >>>
> >>> Signed-off-by: Ze Gao <[email protected]>
> >>> ---
> >>
>

2024-04-15 11:06:48

by Ze Gao

[permalink] [raw]

Subject: Re: [PATCH] sched: Improve rq selection for a blocked task when its affinity changes

On Mon, Apr 15, 2024 at 6:36 PM Madadi Vineeth Reddy
<[email protected]> wrote:
>
> On 15/04/24 11:10, Ze Gao wrote:
> > On Sat, Apr 13, 2024 at 12:59 AM Madadi Vineeth Reddy
> > <[email protected]> wrote:
> >>
> >> Hi Ze Gao,
> >>
> >> On 13/03/24 14:28, Ze Gao wrote:
> >>> We observered select_idle_sibling() is likely to return the *target* cpu
> >>> early which is likely to be the previous cpu this task is running on even
> >>> when it's actually not within the affinity list newly set, from where after
> >>> we can only rely on select_fallback_rq() to choose one for us at its will
> >>> (the first valid mostly for now).
> >>>
> >>> However, the one chosen by select_fallback_rq() is highly likely not a
> >>> good enough candidate, sometimes it has to rely on load balancer to kick
> >>> in to place itself to a better cpu, which adds one or more unnecessary
> >>> migrations in no doubt. For example, this is what I get when I move task
> >>> 3964 to cpu 23-24 where cpu 23 has a cpu bound work pinned already:
> >>>
> >>> swapper 0 [013] 959.791829: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=13 dest_cpu=23
> >>> kworker/24:2-mm 1014 [024] 959.806148: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=23 dest_cpu=24
> >>>
> >>
> >> I am able to reproduce this scenario of having an extra migration through load balance
> >> swapper 0 [031] 398764.057232: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=31 dest_cpu=33
> >> ksoftirqd/0 13 [000] 398764.356138: sched:sched_migrate_task: comm=loop pid=178687 prio=120 orig_cpu=33 dest_cpu=34
> >>
> >> I wrote a simple c program that blocks for few seconds, meanwhile I taskset it to CPUs 33,34 while I already have a
> >> busy task running on CPU 33.
> >>
> >>> The thing is we can actually do better if we do checks early and take more
> >>> advantages of the *target* in select_idle_sibling(). That is, we continue
> >>> the idle cpu selection if *target* fails the test of cpumask_test_cpu(
> >>> *target*, p->cpus_ptr). By doing so, we are likely to pick a good candidate,
> >>> especially when the newly allowed cpu set shares some cpu resources with
> >>> *target*.
> >>>
> >>> And with this change, we clearly see the improvement when I move task 3964
> >>> to cpu 25-26 where cpu 25 has a cpu bound work pinned already.
> >>>
> >>> swapper 0 [027] 4249.204658: sched:sched_migrate_task: comm=stress-ng-cpu pid=3964 prio=120 orig_cpu=27 dest_cpu=26
> >>
> >> But after applying this patch, The extra migration is still happening as CPU 33 is still chosen by try_to_wake_up.
> >>
> >> On placing some perf probes and testing,
> >> migration/57 304 [057] 12216.988491: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=57 dest_cpu=4
> >> swapper 0 [004] 12226.989065: probe:select_idle_sibling_L124: (c0000000001bafc0) i=-1 recent_used_cpu=-1 prev_aff=-1
> >> swapper 0 [004] 12226.989071: probe:select_fallback_rq: (c0000000001a2e38) cpu=4
> >> swapper 0 [004] 12226.989074: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=4 dest_cpu=33
> >> swapper 0 [000] 12227.007768: sched:sched_migrate_task: comm=loop pid=11172 prio=120 orig_cpu=33 dest_cpu=34
> >>
> >> It is observed that, select_fallback_rq is still taken in this scenario as default target is returned at the end of select_idle_sibling
> >> which was CPU 4.
> >
> > After second thoughts, it indeed could happen if CPU 4 shares nothing
> > with CPU 33(34),
> > for example, in different numa nodes.
> >
> > IOW, it cannot benefit from select_idle_siblings() and has to rely on
> > select_fallback_rq
> > as the last resort. Just like what I said in the changelog, this patch
> > aims to improve rq
> > selection for cases where the newly allowed cpu set shares some cpu
> > resources with
> > the old cpu set.
>
> Right. In power 10(where I tested), LLC is smaller being at SMT4 small core
> level. So, I think there are less chances of affined CPUs to share resources
> with the old CPU set.
>
> I also ran schbench and hackbench just to make sure that there is no regression
> in powerpc. Luckily there is no regression.
> Tested-by: Madadi Vineeth Reddy <[email protected]>

Thanks for the testing!

Sincerely,
Ze

> Thanks and Regards
> Madadi Vineeth Reddy
>
> >
> > Sorry for not being able to recall all details immediately, since this
> > thread has been inactive
> > for a long while and receives no feedback from sched folks.
> >
> > Best,
> > Ze
> >
> >> In most of my testing, default target is returned at the end of the function due to the initial checks. It's possible that there would
> >> be cases where we can get optimal CPU before we reach end of the select_idle_sibling function but it would be interesting to know if the
> >> extra time spent in finding an optimal cpu have an impact instead of returning it earlier if in most of the times we are returning the
> >> default target at the end.
> >>
> >> Thanks and Regards
> >> Madadi Vineeth Reddy
> >>
> >>>
> >>> Note we do the same check for *prev* in select_idle_sibling() as well.
> >>>
> >>> Signed-off-by: Ze Gao <[email protected]>
> >>> ---
> >>
>