2022-08-10 23:33:47

by Libo Chen

[permalink] [raw]
Subject: [PATCH 1/1] sched/fair: Fix inaccurate tally of ttwu_move_affine

There are scenarios where non-affine wakeups are incorrectly counted as
affine wakeups by schedstats.

When wake_affine_idle() returns prev_cpu which doesn't equal to
nr_cpumask_bits, it will slip through the check: target == nr_cpumask_bits
in wake_affine() and be counted as if target == this_cpu in schedstats.

Replace target == nr_cpumask_bits with target != this_cpu to make sure
affine wakeups are accurately tallied.

Fixes: 806486c377e33 (sched/fair: Do not migrate if the prev_cpu is idle)
Suggested-by: Daniel Jordan <[email protected]>
Signed-off-by: Libo Chen <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da388657d5ac..b179da4f8105 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6114,7 +6114,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);

schedstat_inc(p->stats.nr_wakeups_affine_attempts);
- if (target == nr_cpumask_bits)
+ if (target != this_cpu)
return prev_cpu;

schedstat_inc(sd->ttwu_move_affine);
--
2.31.1


2022-08-15 11:05:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched/fair: Fix inaccurate tally of ttwu_move_affine

On Wed, Aug 10, 2022 at 03:33:13PM -0700, Libo Chen wrote:
> There are scenarios where non-affine wakeups are incorrectly counted as
> affine wakeups by schedstats.
>
> When wake_affine_idle() returns prev_cpu which doesn't equal to
> nr_cpumask_bits, it will slip through the check: target == nr_cpumask_bits
> in wake_affine() and be counted as if target == this_cpu in schedstats.
>
> Replace target == nr_cpumask_bits with target != this_cpu to make sure
> affine wakeups are accurately tallied.
>
> Fixes: 806486c377e33 (sched/fair: Do not migrate if the prev_cpu is idle)
> Suggested-by: Daniel Jordan <[email protected]>
> Signed-off-by: Libo Chen <[email protected]>
> ---
> kernel/sched/fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da388657d5ac..b179da4f8105 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6114,7 +6114,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
>
> schedstat_inc(p->stats.nr_wakeups_affine_attempts);
> - if (target == nr_cpumask_bits)
> + if (target != this_cpu)
> return prev_cpu;
>
> schedstat_inc(sd->ttwu_move_affine);

This not only changes the accounting but also the placement, no?

2022-08-15 23:18:50

by Libo Chen

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched/fair: Fix inaccurate tally of ttwu_move_affine



On 8/15/22 04:01, Peter Zijlstra wrote:
> On Wed, Aug 10, 2022 at 03:33:13PM -0700, Libo Chen wrote:
>> There are scenarios where non-affine wakeups are incorrectly counted as
>> affine wakeups by schedstats.
>>
>> When wake_affine_idle() returns prev_cpu which doesn't equal to
>> nr_cpumask_bits, it will slip through the check: target == nr_cpumask_bits
>> in wake_affine() and be counted as if target == this_cpu in schedstats.
>>
>> Replace target == nr_cpumask_bits with target != this_cpu to make sure
>> affine wakeups are accurately tallied.
>>
>> Fixes: 806486c377e33 (sched/fair: Do not migrate if the prev_cpu is idle)
>> Suggested-by: Daniel Jordan <[email protected]>
>> Signed-off-by: Libo Chen <[email protected]>
>> ---
>> kernel/sched/fair.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index da388657d5ac..b179da4f8105 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6114,7 +6114,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
>> target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
>>
>> schedstat_inc(p->stats.nr_wakeups_affine_attempts);
>> - if (target == nr_cpumask_bits)
>> + if (target != this_cpu)
>> return prev_cpu;
>>
>> schedstat_inc(sd->ttwu_move_affine);
> This not only changes the accounting but also the placement, no?
No, it should only change the accounting. wake_affine() still returns
prev_cpu if target equals to prev_cpu or nr_cpumask_bits, the same
behavior as before.


Libo

2022-08-17 13:27:00

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched/fair: Fix inaccurate tally of ttwu_move_affine

On Mon, 15 Aug 2022 at 21:19, Libo Chen <[email protected]> wrote:
>
>
>
> On 8/15/22 04:01, Peter Zijlstra wrote:
> > On Wed, Aug 10, 2022 at 03:33:13PM -0700, Libo Chen wrote:
> >> There are scenarios where non-affine wakeups are incorrectly counted as
> >> affine wakeups by schedstats.
> >>
> >> When wake_affine_idle() returns prev_cpu which doesn't equal to
> >> nr_cpumask_bits, it will slip through the check: target == nr_cpumask_bits
> >> in wake_affine() and be counted as if target == this_cpu in schedstats.
> >>
> >> Replace target == nr_cpumask_bits with target != this_cpu to make sure
> >> affine wakeups are accurately tallied.
> >>
> >> Fixes: 806486c377e33 (sched/fair: Do not migrate if the prev_cpu is idle)
> >> Suggested-by: Daniel Jordan <[email protected]>
> >> Signed-off-by: Libo Chen <[email protected]>
> >> ---
> >> kernel/sched/fair.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index da388657d5ac..b179da4f8105 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6114,7 +6114,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> >> target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
> >>
> >> schedstat_inc(p->stats.nr_wakeups_affine_attempts);
> >> - if (target == nr_cpumask_bits)
> >> + if (target != this_cpu)
> >> return prev_cpu;
> >>
> >> schedstat_inc(sd->ttwu_move_affine);
> > This not only changes the accounting but also the placement, no?
> No, it should only change the accounting. wake_affine() still returns
> prev_cpu if target equals to prev_cpu or nr_cpumask_bits, the same
> behavior as before.

Looks good to me

Reviewed-by: Vincent Guittot <[email protected]>

>
>
> Libo

2022-08-25 07:48:20

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched/fair: Fix inaccurate tally of ttwu_move_affine

On Wed, Aug 10, 2022 at 03:33:13PM -0700, Libo Chen wrote:
> There are scenarios where non-affine wakeups are incorrectly counted as
> affine wakeups by schedstats.
>
> When wake_affine_idle() returns prev_cpu which doesn't equal to
> nr_cpumask_bits, it will slip through the check: target == nr_cpumask_bits
> in wake_affine() and be counted as if target == this_cpu in schedstats.
>
> Replace target == nr_cpumask_bits with target != this_cpu to make sure
> affine wakeups are accurately tallied.
>
> Fixes: 806486c377e33 (sched/fair: Do not migrate if the prev_cpu is idle)
> Suggested-by: Daniel Jordan <[email protected]>
> Signed-off-by: Libo Chen <[email protected]>
> ---
> kernel/sched/fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da388657d5ac..b179da4f8105 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6114,7 +6114,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
>
> schedstat_inc(p->stats.nr_wakeups_affine_attempts);
> - if (target == nr_cpumask_bits)
> + if (target != this_cpu)
> return prev_cpu;


This seems to be the right thing to do. However,..

if this_cpu and prev_cpu were in the same LLC and we pick prev_cpu,
technically is it still not an affine wakeup?


>
> schedstat_inc(sd->ttwu_move_affine);
> --
> 2.31.1
>

--
Thanks and Regards
gautham.

2022-08-25 09:37:31

by Libo Chen

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched/fair: Fix inaccurate tally of ttwu_move_affine



On 8/25/22 00:30, Gautham R. Shenoy wrote:
> On Wed, Aug 10, 2022 at 03:33:13PM -0700, Libo Chen wrote:
>> There are scenarios where non-affine wakeups are incorrectly counted as
>> affine wakeups by schedstats.
>>
>> When wake_affine_idle() returns prev_cpu which doesn't equal to
>> nr_cpumask_bits, it will slip through the check: target == nr_cpumask_bits
>> in wake_affine() and be counted as if target == this_cpu in schedstats.
>>
>> Replace target == nr_cpumask_bits with target != this_cpu to make sure
>> affine wakeups are accurately tallied.
>>
>> Fixes: 806486c377e33 (sched/fair: Do not migrate if the prev_cpu is idle)
>> Suggested-by: Daniel Jordan <[email protected]>
>> Signed-off-by: Libo Chen <[email protected]>
>> ---
>> kernel/sched/fair.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index da388657d5ac..b179da4f8105 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6114,7 +6114,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
>> target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
>>
>> schedstat_inc(p->stats.nr_wakeups_affine_attempts);
>> - if (target == nr_cpumask_bits)
>> + if (target != this_cpu)
>> return prev_cpu;
>
> This seems to be the right thing to do. However,..
>
> if this_cpu and prev_cpu were in the same LLC and we pick prev_cpu,
> technically is it still not an affine wakeup?
>
I think schedstats like ttwu_move_affine/ttwu_wake_remote is defined
within a sched domain, so if the waking CPU and the previous CPU are in
the same MC domain, then picking the previous CPU is a remote wakeup
within that MC. If the two candidate CPUs are from two different NUMA
nodes, then picking the waking CPU is an affine wakeup within that NUMA
domain. Correct me if I am wrong, this definition is consistent across
all levels of sched domains.

But I do understand that when two candidate CPUs are within an LLC,
    a) all the fast-path wakeups should be affine wakeups if your
definition of an affine wakeup is a wakeup to the same LLC of the waker.
    b) select_idle_sibling() may pick a CPU in that LLC other than the
two candidate CPUs which makes the affine/remote stats here useless even
if we are consistent with ttwu_move_affine/ttwu_wake_remote
       definition.

I personally think it's just too much trouble to add additional code in
the kernel to, let's say, treat all wakeups within an LLC as
ttwu_move_affine. It's a lot easier to do that when you process
schedstats data,
whether you want to treat all wakeups in LLC domains as affine wakeups
or ignore ttwu_move_affine/ttwu_wake_remote stats from LLC domains.



>>
>> schedstat_inc(sd->ttwu_move_affine);
>> --
>> 2.31.1
>>
> --
> Thanks and Regards
> gautham.

2023-01-09 22:38:58

by Libo Chen

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched/fair: Fix inaccurate tally of ttwu_move_affine

Hi Peter,

A gentle ping~ Vincent has signed it off. Let me know what else I should
do for this patch.

Libo

On 8/15/22 12:19 PM, Libo Chen wrote:
>
>
> On 8/15/22 04:01, Peter Zijlstra wrote:
>> On Wed, Aug 10, 2022 at 03:33:13PM -0700, Libo Chen wrote:
>>> There are scenarios where non-affine wakeups are incorrectly counted as
>>> affine wakeups by schedstats.
>>>
>>> When wake_affine_idle() returns prev_cpu which doesn't equal to
>>> nr_cpumask_bits, it will slip through the check: target ==
>>> nr_cpumask_bits
>>> in wake_affine() and be counted as if target == this_cpu in schedstats.
>>>
>>> Replace target == nr_cpumask_bits with target != this_cpu to make sure
>>> affine wakeups are accurately tallied.
>>>
>>> Fixes: 806486c377e33 (sched/fair: Do not migrate if the prev_cpu is
>>> idle)
>>> Suggested-by: Daniel Jordan <[email protected]>
>>> Signed-off-by: Libo Chen <[email protected]>
>>> ---
>>>   kernel/sched/fair.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index da388657d5ac..b179da4f8105 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6114,7 +6114,7 @@ static int wake_affine(struct sched_domain
>>> *sd, struct task_struct *p,
>>>           target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
>>>         schedstat_inc(p->stats.nr_wakeups_affine_attempts);
>>> -    if (target == nr_cpumask_bits)
>>> +    if (target != this_cpu)
>>>           return prev_cpu;
>>>         schedstat_inc(sd->ttwu_move_affine);
>> This not only changes the accounting but also the placement, no?
> No, it should only change the accounting. wake_affine() still returns
> prev_cpu if target equals to prev_cpu or nr_cpumask_bits, the same
> behavior as before.
>
>
> Libo

2023-03-09 03:18:49

by Libo Chen

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched/fair: Fix inaccurate tally of ttwu_move_affine



> On Jan 9, 2023, at 2:00 PM, Libo Chen <[email protected]> wrote:
>
> Hi Peter,
>
> A gentle ping~ Vincent has signed it off. Let me know what else I should do for this patch.
>
> Libo
>
> On 8/15/22 12:19 PM, Libo Chen wrote:
>>
>>
>> On 8/15/22 04:01, Peter Zijlstra wrote:
>>> On Wed, Aug 10, 2022 at 03:33:13PM -0700, Libo Chen wrote:
>>>> There are scenarios where non-affine wakeups are incorrectly counted as
>>>> affine wakeups by schedstats.
>>>>
>>>> When wake_affine_idle() returns prev_cpu which doesn't equal to
>>>> nr_cpumask_bits, it will slip through the check: target == nr_cpumask_bits
>>>> in wake_affine() and be counted as if target == this_cpu in schedstats.
>>>>
>>>> Replace target == nr_cpumask_bits with target != this_cpu to make sure
>>>> affine wakeups are accurately tallied.
>>>>
>>>> Fixes: 806486c377e33 (sched/fair: Do not migrate if the prev_cpu is idle)
>>>> Suggested-by: Daniel Jordan <[email protected]>
>>>> Signed-off-by: Libo Chen <[email protected]>
>>>> ---
>>>> kernel/sched/fair.c | 2 +-
>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index da388657d5ac..b179da4f8105 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6114,7 +6114,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
>>>> target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
>>>> schedstat_inc(p->stats.nr_wakeups_affine_attempts);
>>>> - if (target == nr_cpumask_bits)
>>>> + if (target != this_cpu)
>>>> return prev_cpu;
>>>> schedstat_inc(sd->ttwu_move_affine);
>>> This not only changes the accounting but also the placement, no?
>> No, it should only change the accounting. wake_affine() still returns prev_cpu if target equals to prev_cpu or nr_cpumask_bits, the same behavior as before.
>>

Hi Peter,

A second ping in case you missed the first one, what else should I do to get this fix in?


Libo

>>
>> Libo
>


2023-03-30 10:45:26

by Gautham R. Shenoy

[permalink] [raw]
Subject: Re: [PATCH 1/1] sched/fair: Fix inaccurate tally of ttwu_move_affine

Hello Libo,

On Thu, Aug 25, 2022 at 02:13:17AM -0700, Libo Chen wrote:
>

Sorry, looks like this message got burried under the pile.

>
> On 8/25/22 00:30, Gautham R. Shenoy wrote:
> > On Wed, Aug 10, 2022 at 03:33:13PM -0700, Libo Chen wrote:
> > > There are scenarios where non-affine wakeups are incorrectly counted as
> > > affine wakeups by schedstats.
> > >
> > > When wake_affine_idle() returns prev_cpu which doesn't equal to
> > > nr_cpumask_bits, it will slip through the check: target == nr_cpumask_bits
> > > in wake_affine() and be counted as if target == this_cpu in schedstats.
> > >
> > > Replace target == nr_cpumask_bits with target != this_cpu to make sure
> > > affine wakeups are accurately tallied.
> > >
> > > Fixes: 806486c377e33 (sched/fair: Do not migrate if the prev_cpu is idle)
> > > Suggested-by: Daniel Jordan <[email protected]>
> > > Signed-off-by: Libo Chen <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index da388657d5ac..b179da4f8105 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6114,7 +6114,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> > > target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
> > > schedstat_inc(p->stats.nr_wakeups_affine_attempts);
> > > - if (target == nr_cpumask_bits)
> > > + if (target != this_cpu)
> > > return prev_cpu;
> >
> > This seems to be the right thing to do. However,..
> >
> > if this_cpu and prev_cpu were in the same LLC and we pick prev_cpu,
> > technically is it still not an affine wakeup?
> >
> I think schedstats like ttwu_move_affine/ttwu_wake_remote is defined within
> a sched domain, so if the waking CPU and the previous CPU are in the same MC
> domain, then picking the previous CPU is a remote wakeup
> within that MC. If the two candidate CPUs are from two different NUMA nodes,
> then picking the waking CPU is an affine wakeup within that NUMA domain.
> Correct me if I am wrong, this definition is consistent across
> all levels of sched domains.

Yes, the definition of ttwu_wake_remote in the lowest sched-domain
containing both the prev_cpu and this_cpu, is target_cpu != this_cpu.
This is fairly unambiguous.


From the code, the definition of an ttwu_move_affine is to capture an
_attempt_ to chose the this_cpu as the target_cpu in the lowest
sched-domain containing both prev CPU and this_cpu. It is merely an
attempt since the actual target_CPU is selected by SIS and could be
any idle CPU in the LLC of the prev/this_cpu.

ttwu_move_affine makes sense for sched-domains higher than the LLC
domain since we move the task to the LLC of this_cpu away from the LLC
of the prev_cpu (This is possible on AMD processors which contains
multiple LLC domains within a NUMA node). Having given it some more
thought, I am not sure how to interpret this metric for the LLC domain
and lower ones, since the eventual target CPU may not even be "closer"
to this_cpu.

>
> But I do understand that when two candidate CPUs are within an LLC,
> ??? a) all the fast-path wakeups should be affine wakeups if your definition
> of an affine wakeup is a wakeup to the same LLC of the waker.
> ??? b) select_idle_sibling() may pick a CPU in that LLC other than the two
> candidate CPUs which makes the affine/remote stats here useless even if we
> are consistent with ttwu_move_affine/ttwu_wake_remote
> ?????? definition.

Fair enough.

>
> I personally think it's just too much trouble to add additional code in the
> kernel to, let's say, treat all wakeups within an LLC as ttwu_move_affine.
> It's a lot easier to do that when you process schedstats data,
> whether you want to treat all wakeups in LLC domains as affine wakeups or
> ignore ttwu_move_affine/ttwu_wake_remote stats from LLC domains.


I agree.

I think that having your fix is the right thing, since currently the
move_affine data in schedstats isn't accurate when wake_affine_idle()
or wake_affine_weight() picks the prev_cpu, especially when prev_cpu
and this_cpu are in sched-domains higher than the LLC. Thus today we
overcount affine wakeups which is incorrect.

So,
Reviewed-by: Gautham R. Shenoy <[email protected]>

--
Thanks and Regards
gautham.

2023-04-06 10:30:12

by tip-bot2 for Jacob Pan

[permalink] [raw]
Subject: [tip: sched/core] sched/fair: Fix inaccurate tally of ttwu_move_affine

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 39afe5d6fc59237ff7738bf3ede5a8856822d59d
Gitweb: https://git.kernel.org/tip/39afe5d6fc59237ff7738bf3ede5a8856822d59d
Author: Libo Chen <[email protected]>
AuthorDate: Wed, 10 Aug 2022 15:33:13 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 05 Apr 2023 09:58:48 +02:00

sched/fair: Fix inaccurate tally of ttwu_move_affine

There are scenarios where non-affine wakeups are incorrectly counted as
affine wakeups by schedstats.

When wake_affine_idle() returns prev_cpu which doesn't equal to
nr_cpumask_bits, it will slip through the check: target == nr_cpumask_bits
in wake_affine() and be counted as if target == this_cpu in schedstats.

Replace target == nr_cpumask_bits with target != this_cpu to make sure
affine wakeups are accurately tallied.

Fixes: 806486c377e33 (sched/fair: Do not migrate if the prev_cpu is idle)
Suggested-by: Daniel Jordan <[email protected]>
Signed-off-by: Libo Chen <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Gautham R. Shenoy <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc358dc..f5da01a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6582,7 +6582,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);

schedstat_inc(p->stats.nr_wakeups_affine_attempts);
- if (target == nr_cpumask_bits)
+ if (target != this_cpu)
return prev_cpu;

schedstat_inc(sd->ttwu_move_affine);