2021-11-04 17:56:04

by Vincent Donnefort

[permalink] [raw]
Subject: [PATCH] sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()

Nothing protects the access to the per_cpu variable sd_llc_id. When testing
the same CPU (i.e. this_cpu == that_cpu), a race condition exists with
update_top_cache_domain(). One scenario being:

CPU1 CPU2
==================================================================

per_cpu(sd_llc_id, CPUX) => 0
partition_sched_domains_locked()
detach_destroy_domains()
cpus_share_cache(CPUX, CPUX) update_top_cache_domain(CPUX)
per_cpu(sd_llc_id, CPUX) => 0
per_cpu(sd_llc_id, CPUX) = CPUX
per_cpu(sd_llc_id, CPUX) => CPUX
return false

ttwu_queue_cond() wouldn't catch smp_processor_id() == cpu and the result
is a warning triggered from ttwu_queue_wakelist().

Avoid a such race in cpus_share_cache() by always returning true when
this_cpu == that_cpu.

Fixes: 518cd6234178 ("sched: Only queue remote wakeups when crossing cache boundaries")
Reported-by: Jing-Ting Wu <[email protected]>
Signed-off-by: Vincent Donnefort <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f2611b9cf503..f5ca15cdcff4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3726,6 +3726,9 @@ void wake_up_if_idle(int cpu)

bool cpus_share_cache(int this_cpu, int that_cpu)
{
+ if (this_cpu == that_cpu)
+ return true;
+
return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
}

--
2.25.1


2021-11-05 16:36:03

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()

On Thu, 4 Nov 2021 at 18:52, Vincent Donnefort
<[email protected]> wrote:
>
> Nothing protects the access to the per_cpu variable sd_llc_id. When testing
> the same CPU (i.e. this_cpu == that_cpu), a race condition exists with
> update_top_cache_domain(). One scenario being:
>
> CPU1 CPU2
> ==================================================================
>
> per_cpu(sd_llc_id, CPUX) => 0
> partition_sched_domains_locked()
> detach_destroy_domains()
> cpus_share_cache(CPUX, CPUX) update_top_cache_domain(CPUX)
> per_cpu(sd_llc_id, CPUX) => 0
> per_cpu(sd_llc_id, CPUX) = CPUX
> per_cpu(sd_llc_id, CPUX) => CPUX
> return false
>
> ttwu_queue_cond() wouldn't catch smp_processor_id() == cpu and the result
> is a warning triggered from ttwu_queue_wakelist().
>
> Avoid a such race in cpus_share_cache() by always returning true when
> this_cpu == that_cpu.
>
> Fixes: 518cd6234178 ("sched: Only queue remote wakeups when crossing cache boundaries")
> Reported-by: Jing-Ting Wu <[email protected]>
> Signed-off-by: Vincent Donnefort <[email protected]>
> Reviewed-by: Valentin Schneider <[email protected]>

Reviewed-by: Vincent Guittot <[email protected]>

>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f2611b9cf503..f5ca15cdcff4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3726,6 +3726,9 @@ void wake_up_if_idle(int cpu)
>
> bool cpus_share_cache(int this_cpu, int that_cpu)
> {
> + if (this_cpu == that_cpu)
> + return true;
> +
> return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
> }
>
> --
> 2.25.1
>

2021-11-08 15:44:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()

On Fri, Nov 05, 2021 at 04:08:33PM +0100, Vincent Guittot wrote:
> On Thu, 4 Nov 2021 at 18:52, Vincent Donnefort
> <[email protected]> wrote:
> >
> > Nothing protects the access to the per_cpu variable sd_llc_id. When testing
> > the same CPU (i.e. this_cpu == that_cpu), a race condition exists with
> > update_top_cache_domain(). One scenario being:
> >
> > CPU1 CPU2
> > ==================================================================
> >
> > per_cpu(sd_llc_id, CPUX) => 0
> > partition_sched_domains_locked()
> > detach_destroy_domains()
> > cpus_share_cache(CPUX, CPUX) update_top_cache_domain(CPUX)
> > per_cpu(sd_llc_id, CPUX) => 0
> > per_cpu(sd_llc_id, CPUX) = CPUX
> > per_cpu(sd_llc_id, CPUX) => CPUX
> > return false
> >
> > ttwu_queue_cond() wouldn't catch smp_processor_id() == cpu and the result
> > is a warning triggered from ttwu_queue_wakelist().
> >
> > Avoid a such race in cpus_share_cache() by always returning true when
> > this_cpu == that_cpu.
> >
> > Fixes: 518cd6234178 ("sched: Only queue remote wakeups when crossing cache boundaries")
> > Reported-by: Jing-Ting Wu <[email protected]>
> > Signed-off-by: Vincent Donnefort <[email protected]>
> > Reviewed-by: Valentin Schneider <[email protected]>
>
> Reviewed-by: Vincent Guittot <[email protected]>
>
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f2611b9cf503..f5ca15cdcff4 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3726,6 +3726,9 @@ void wake_up_if_idle(int cpu)
> >
> > bool cpus_share_cache(int this_cpu, int that_cpu)
> > {
> > + if (this_cpu == that_cpu)
> > + return true;
> > +
> > return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
> > }

Blergh, that's annoying. Thanks guys, picked it up for /urgent

Subject: [tip: sched/urgent] sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 42dc938a590c96eeb429e1830123fef2366d9c80
Gitweb: https://git.kernel.org/tip/42dc938a590c96eeb429e1830123fef2366d9c80
Author: Vincent Donnefort <[email protected]>
AuthorDate: Thu, 04 Nov 2021 17:51:20
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 11 Nov 2021 13:09:32 +01:00

sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()

Nothing protects the access to the per_cpu variable sd_llc_id. When testing
the same CPU (i.e. this_cpu == that_cpu), a race condition exists with
update_top_cache_domain(). One scenario being:

CPU1 CPU2
==================================================================

per_cpu(sd_llc_id, CPUX) => 0
partition_sched_domains_locked()
detach_destroy_domains()
cpus_share_cache(CPUX, CPUX) update_top_cache_domain(CPUX)
per_cpu(sd_llc_id, CPUX) => 0
per_cpu(sd_llc_id, CPUX) = CPUX
per_cpu(sd_llc_id, CPUX) => CPUX
return false

ttwu_queue_cond() wouldn't catch smp_processor_id() == cpu and the result
is a warning triggered from ttwu_queue_wakelist().

Avoid a such race in cpus_share_cache() by always returning true when
this_cpu == that_cpu.

Fixes: 518cd6234178 ("sched: Only queue remote wakeups when crossing cache boundaries")
Reported-by: Jing-Ting Wu <[email protected]>
Signed-off-by: Vincent Donnefort <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/core.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 523fd60..cec173a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3726,6 +3726,9 @@ out:

bool cpus_share_cache(int this_cpu, int that_cpu)
{
+ if (this_cpu == that_cpu)
+ return true;
+
return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
}


2021-11-15 10:46:20

by Shinichiro Kawasaki

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()

On Nov 04, 2021 / 17:51, Vincent Donnefort wrote:
> Nothing protects the access to the per_cpu variable sd_llc_id. When testing
> the same CPU (i.e. this_cpu == that_cpu), a race condition exists with
> update_top_cache_domain(). One scenario being:
>
> CPU1 CPU2
> ==================================================================
>
> per_cpu(sd_llc_id, CPUX) => 0
> partition_sched_domains_locked()
> detach_destroy_domains()
> cpus_share_cache(CPUX, CPUX) update_top_cache_domain(CPUX)
> per_cpu(sd_llc_id, CPUX) => 0
> per_cpu(sd_llc_id, CPUX) = CPUX
> per_cpu(sd_llc_id, CPUX) => CPUX
> return false
>
> ttwu_queue_cond() wouldn't catch smp_processor_id() == cpu and the result
> is a warning triggered from ttwu_queue_wakelist().
>
> Avoid a such race in cpus_share_cache() by always returning true when
> this_cpu == that_cpu.
>
> Fixes: 518cd6234178 ("sched: Only queue remote wakeups when crossing cache boundaries")
> Reported-by: Jing-Ting Wu <[email protected]>
> Signed-off-by: Vincent Donnefort <[email protected]>
> Reviewed-by: Valentin Schneider <[email protected]>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f2611b9cf503..f5ca15cdcff4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3726,6 +3726,9 @@ void wake_up_if_idle(int cpu)
>
> bool cpus_share_cache(int this_cpu, int that_cpu)
> {
> + if (this_cpu == that_cpu)
> + return true;
> +
> return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
> }
>
> --
> 2.25.1
>

Oh, this is the exactly same fix as I posted before [1]. It is a little bit sad
that my post did not get reviewed. Anyway, good to see the issue fixed. Thanks.

[1] https://lore.kernel.org/all/[email protected]/

--
Best Regards,
Shin'ichiro Kawasaki

2021-11-15 22:29:56

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()

On 15/11/21 10:46, Shinichiro Kawasaki wrote:
> Oh, this is the exactly same fix as I posted before [1]. It is a little bit sad
> that my post did not get reviewed. Anyway, good to see the issue fixed. Thanks.
>
> [1] https://lore.kernel.org/all/[email protected]/
>

Oh, that sucks, sorry about that. I do try to keep an eye out for sched
stuff sent to LKML, but clearly there are some that fall through :(

> --
> Best Regards,
> Shin'ichiro Kawasaki

2021-11-16 08:02:55

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()

Hi Shinichiro

On Mon, 15 Nov 2021 at 19:34, Valentin Schneider
<[email protected]> wrote:
>
> On 15/11/21 10:46, Shinichiro Kawasaki wrote:
> > Oh, this is the exactly same fix as I posted before [1]. It is a little bit sad
> > that my post did not get reviewed. Anyway, good to see the issue fixed. Thanks.
> >
> > [1] https://lore.kernel.org/all/[email protected]/
> >
>
> Oh, that sucks, sorry about that. I do try to keep an eye out for sched
> stuff sent to LKML, but clearly there are some that fall through :(

I would advise you to use get_maintainer.pl to make sure to not miss
to cc someone.
Like valentin, this patch has been lost in the lkml flow and i haven't
noticed it

Regards,
Vincent
>
> > --
> > Best Regards,
> > Shin'ichiro Kawasaki

2021-11-16 08:45:24

by Shinichiro Kawasaki

[permalink] [raw]
Subject: Re: [PATCH] sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()

On Nov 16, 2021 / 09:02, Vincent Guittot wrote:
> Hi Shinichiro
>
> On Mon, 15 Nov 2021 at 19:34, Valentin Schneider
> <[email protected]> wrote:
> >
> > On 15/11/21 10:46, Shinichiro Kawasaki wrote:
> > > Oh, this is the exactly same fix as I posted before [1]. It is a little bit sad
> > > that my post did not get reviewed. Anyway, good to see the issue fixed. Thanks.
> > >
> > > [1] https://lore.kernel.org/all/[email protected]/
> > >
> >
> > Oh, that sucks, sorry about that. I do try to keep an eye out for sched
> > stuff sent to LKML, but clearly there are some that fall through :(
>
> I would advise you to use get_maintainer.pl to make sure to not miss
> to cc someone.
> Like valentin, this patch has been lost in the lkml flow and i haven't
> noticed it

Hi Velentin, Vincent, thank you for the comments.

I have run get_maintainer.pl on may patch, and it listed 9 maintainers and
reviewers. But my patch had only 2 out of the 9 in its To/Cc list, which was
too small. For next time, I will use get_maintainer.pl and post with full
To/Cc list. Thanks!

--
Best Regards,
Shin'ichiro Kawasaki