2020-04-15 21:42:55

by Qais Yousef

[permalink] [raw]
Subject: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()

Now that we have a proper function that returns a 'random' CPU in a mask [1]
utilize that in find_lowest_rq() to solve the thundering herd issue described
in this thread

https://lore.kernel.org/lkml/20200219140243.wfljmupcrwm2jelo@e107158-lin/

But as a pre-amble, I noticed that the new cpumask_any_and_distribute() is
actually an alias for cpumask_any_and() which is documented as returning
a 'random' cpu but actually just does cpumask_first_and().

The first 3 patches cleanup the API so that the whole family of
cpumask_any*() take advantage of the new 'random' behavior and in patch
4 I convert the cpumask_first_and() --> cpumask_any_and() in find_lowest_rq()
to allow to better distribute the RT tasks that wake up simultaneously.

[1] https://lore.kernel.org/lkml/[email protected]/

CC: Juri Lelli <[email protected]>
CC: Vincent Guittot <[email protected]>
CC: Dietmar Eggemann <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Ben Segall <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Yury Norov <[email protected]>
CC: Paul Turner <[email protected]>
CC: Alexey Dobriyan <[email protected]>
CC: Josh Don <[email protected]>
CC: Pavan Kondeti <[email protected]>
CC: [email protected]

Qais Yousef (4):
cpumask: Rename cpumask_any_and_distribute
cpumask: Make cpumask_any() truly random
cpumask: Convert cpumask_any_but() to the new random function
sched/rt: Better distribute tasks that wakeup simultaneously

include/linux/cpumask.h | 33 ++++++-----------
kernel/sched/core.c | 2 +-
kernel/sched/rt.c | 4 +-
lib/cpumask.c | 82 +++++++++++++++++++++++++++--------------
4 files changed, 68 insertions(+), 53 deletions(-)

--
2.17.1


2020-04-15 21:43:09

by Qais Yousef

[permalink] [raw]
Subject: [PATCH 2/4] cpumask: Make cpumask_any() truly random

Commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
added a new cpumask_any_and_distribute() which truly returns a random
cpu within the mask.

Previous patch renamed the function to cpumask_any_and(), so that old
users can take advantage of the new randomness behavior.

Build up on that, and let cpumask_any() truly random too by re-using the
logic from cpumask_any_and().

Signed-off-by: Qais Yousef <[email protected]>
CC: Juri Lelli <[email protected]>
CC: Vincent Guittot <[email protected]>
CC: Dietmar Eggemann <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Ben Segall <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Yury Norov <[email protected]>
CC: Paul Turner <[email protected]>
CC: Alexey Dobriyan <[email protected]>
CC: Josh Don <[email protected]>
CC: Pavan Kondeti <[email protected]>
CC: [email protected]
---
include/linux/cpumask.h | 14 ++++++--------
lib/cpumask.c | 24 ++++++++++++++++++++++++
2 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index e4d6d140a67c..7fb25d256043 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -194,6 +194,11 @@ static inline unsigned int cpumask_local_spread(unsigned int i, int node)
return 0;
}

+static inline int cpumask_any(const struct cpumask *src1p)
+{
+ return 0;
+}
+
static inline int cpumask_any_and(const struct cpumask *src1p,
const struct cpumask *src2p)
{
@@ -251,6 +256,7 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp)
int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
int cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
unsigned int cpumask_local_spread(unsigned int i, int node);
+int cpumask_any(const struct cpumask *srcp);
int cpumask_any_and(const struct cpumask *src1p, const struct cpumask *src2p);

/**
@@ -600,14 +606,6 @@ static inline void cpumask_copy(struct cpumask *dstp,
bitmap_copy(cpumask_bits(dstp), cpumask_bits(srcp), nr_cpumask_bits);
}

-/**
- * cpumask_any - pick a "random" cpu from *srcp
- * @srcp: the input cpumask
- *
- * Returns >= nr_cpu_ids if no cpus set.
- */
-#define cpumask_any(srcp) cpumask_first(srcp)
-
/**
* cpumask_first_and - return the first cpu from *srcp1 & *srcp2
* @src1p: the first input
diff --git a/lib/cpumask.c b/lib/cpumask.c
index b527a153b023..bcac63e45374 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -259,3 +259,27 @@ int cpumask_any_and(const struct cpumask *src1p, const struct cpumask *src2p)
return next;
}
EXPORT_SYMBOL(cpumask_any_and);
+
+/**
+ * cpumask_any - pick a "random" cpu from *srcp
+ * @srcp: the input cpumask
+ *
+ * Returns >= nr_cpu_ids if no cpus set.
+ */
+int cpumask_any(const struct cpumask *srcp)
+{
+ int next, prev;
+
+ /* NOTE: our first selection will skip 0. */
+ prev = __this_cpu_read(distribute_cpu_mask_prev);
+
+ next = cpumask_next(prev, srcp);
+ if (next >= nr_cpu_ids)
+ next = cpumask_first(srcp);
+
+ if (next < nr_cpu_ids)
+ __this_cpu_write(distribute_cpu_mask_prev, next);
+
+ return next;
+}
+EXPORT_SYMBOL(cpumask_any);
--
2.17.1

2020-04-15 21:43:46

by Qais Yousef

[permalink] [raw]
Subject: [PATCH 4/4] sched/rt: Better distribute tasks that wakeup simultaneously

If multiple RT tasks of the same priority wakeup simultaneously,
select_task_rt_rt() will always return the same CPU for all tasks
because find_lowest_rq() always returns the first cpu in the
lowest_mask.

Use cpumask_any_and() to randomize the selection, hence better
distribute the tasks.

This helps alleviate the thundering herd issue described here when
multiple tasks of the same priority wake up simultaneously:

https://lore.kernel.org/lkml/20200219140243.wfljmupcrwm2jelo@e107158-lin/

Signed-off-by: Qais Yousef <[email protected]>
CC: Juri Lelli <[email protected]>
CC: Vincent Guittot <[email protected]>
CC: Dietmar Eggemann <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Ben Segall <[email protected]>
CC: Mel Gorman <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Yury Norov <[email protected]>
CC: Paul Turner <[email protected]>
CC: Alexey Dobriyan <[email protected]>
CC: Josh Don <[email protected]>
CC: Pavan Kondeti <[email protected]>
CC: [email protected]
---
kernel/sched/rt.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index df11d88c9895..16c7eede370a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1750,8 +1750,8 @@ static int find_lowest_rq(struct task_struct *task)
return this_cpu;
}

- best_cpu = cpumask_first_and(lowest_mask,
- sched_domain_span(sd));
+ best_cpu = cpumask_any_and(lowest_mask,
+ sched_domain_span(sd));
if (best_cpu < nr_cpu_ids) {
rcu_read_unlock();
return best_cpu;
--
2.17.1

2020-04-15 21:45:43

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/4] cpumask: Make cpumask_any() truly random

On Tue, 14 Apr 2020 16:05:54 +0100
Qais Yousef <[email protected]> wrote:

> Commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
> added a new cpumask_any_and_distribute() which truly returns a random
> cpu within the mask.
>
> Previous patch renamed the function to cpumask_any_and(), so that old
> users can take advantage of the new randomness behavior.
>
> Build up on that, and let cpumask_any() truly random too by re-using the
> logic from cpumask_any_and().
>
> Signed-off-by: Qais Yousef <[email protected]>
> CC: Juri Lelli <[email protected]>
> CC: Vincent Guittot <[email protected]>
> CC: Dietmar Eggemann <[email protected]>
> CC: Steven Rostedt <[email protected]>
> CC: Ben Segall <[email protected]>
> CC: Mel Gorman <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: Thomas Gleixner <[email protected]>
> CC: Yury Norov <[email protected]>
> CC: Paul Turner <[email protected]>
> CC: Alexey Dobriyan <[email protected]>
> CC: Josh Don <[email protected]>
> CC: Pavan Kondeti <[email protected]>
> CC: [email protected]
> ---
> include/linux/cpumask.h | 14 ++++++--------
> lib/cpumask.c | 24 ++++++++++++++++++++++++
> 2 files changed, 30 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index e4d6d140a67c..7fb25d256043 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -194,6 +194,11 @@ static inline unsigned int cpumask_local_spread(unsigned int i, int node)
> return 0;
> }
>
> +static inline int cpumask_any(const struct cpumask *src1p)
> +{
> + return 0;
> +}
> +
> static inline int cpumask_any_and(const struct cpumask *src1p,
> const struct cpumask *src2p)
> {
> @@ -251,6 +256,7 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp)
> int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
> int cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
> unsigned int cpumask_local_spread(unsigned int i, int node);
> +int cpumask_any(const struct cpumask *srcp);
> int cpumask_any_and(const struct cpumask *src1p, const struct cpumask *src2p);
>
> /**
> @@ -600,14 +606,6 @@ static inline void cpumask_copy(struct cpumask *dstp,
> bitmap_copy(cpumask_bits(dstp), cpumask_bits(srcp), nr_cpumask_bits);
> }
>
> -/**
> - * cpumask_any - pick a "random" cpu from *srcp
> - * @srcp: the input cpumask
> - *
> - * Returns >= nr_cpu_ids if no cpus set.
> - */
> -#define cpumask_any(srcp) cpumask_first(srcp)
> -
> /**
> * cpumask_first_and - return the first cpu from *srcp1 & *srcp2
> * @src1p: the first input
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index b527a153b023..bcac63e45374 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -259,3 +259,27 @@ int cpumask_any_and(const struct cpumask *src1p, const struct cpumask *src2p)
> return next;
> }
> EXPORT_SYMBOL(cpumask_any_and);
> +
> +/**
> + * cpumask_any - pick a "random" cpu from *srcp
> + * @srcp: the input cpumask
> + *
> + * Returns >= nr_cpu_ids if no cpus set.
> + */
> +int cpumask_any(const struct cpumask *srcp)
> +{
> + int next, prev;
> +
> + /* NOTE: our first selection will skip 0. */
> + prev = __this_cpu_read(distribute_cpu_mask_prev);
> +
> + next = cpumask_next(prev, srcp);
> + if (next >= nr_cpu_ids)
> + next = cpumask_first(srcp);
> +
> + if (next < nr_cpu_ids)
> + __this_cpu_write(distribute_cpu_mask_prev, next);

Do we care if this gets preempted and migrated to a new CPU where we read
"prev" from one distribute_cpu_mask_prev on one CPU and write it to another
CPU?

-- Steve

> +
> + return next;
> +}
> +EXPORT_SYMBOL(cpumask_any);

2020-04-15 21:53:31

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()

Hi,

On 14/04/20 16:05, Qais Yousef wrote:
> Now that we have a proper function that returns a 'random' CPU in a mask [1]
> utilize that in find_lowest_rq() to solve the thundering herd issue described
> in this thread
>
> https://lore.kernel.org/lkml/20200219140243.wfljmupcrwm2jelo@e107158-lin/
>
> But as a pre-amble, I noticed that the new cpumask_any_and_distribute() is
> actually an alias for cpumask_any_and() which is documented as returning
> a 'random' cpu but actually just does cpumask_first_and().
>
> The first 3 patches cleanup the API so that the whole family of
> cpumask_any*() take advantage of the new 'random' behavior

I'm a bit wary about such blanket changes. I feel like most places impacted
by this change don't gain anything by using the random thing. In sched land
that would be:

- The single cpumask_any() in core.c::select_task_rq()
- Pretty much any function that wants a CPU id to dereference a
root_domain; there's some of them in deadline.c, topology.c

Looking some more into it, there's shadier things:

- cpufreq_offline() uses cpumask_any() to figure out the new policy
leader... That one should be cpumask_first()
- gic_set_affinity() uses cpumask_any_and() (in the common case). If this
starts using randomness, you will stop affining e.g. all SPIs to CPU0
by default (!!!)
- ... and there might be more

I think people went with cpumask_any_* mostly because there is just
cpumask_first() while there are more cpumask_any_* variants, and since
those have been returning the first set CPU for over a decade people just
went with it.

To move this forward, I would suggest renaming the current cpumask_any_*()
into cpumask_first_*(), and THEN introduce the new pseudo-random
ones. People are then free to hand-fix specific locations if it makes sense
there, like you're doing for RT.

I think it's safe to say the vast majority of the current callers do not
require randomness - the exceptions should mainly be scheduler / workqueues
and the like.

> and in patch
> 4 I convert the cpumask_first_and() --> cpumask_any_and() in find_lowest_rq()
> to allow to better distribute the RT tasks that wake up simultaneously.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> CC: Juri Lelli <[email protected]>
> CC: Vincent Guittot <[email protected]>
> CC: Dietmar Eggemann <[email protected]>
> CC: Steven Rostedt <[email protected]>
> CC: Ben Segall <[email protected]>
> CC: Mel Gorman <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: Thomas Gleixner <[email protected]>
> CC: Yury Norov <[email protected]>
> CC: Paul Turner <[email protected]>
> CC: Alexey Dobriyan <[email protected]>
> CC: Josh Don <[email protected]>
> CC: Pavan Kondeti <[email protected]>
> CC: [email protected]
>
> Qais Yousef (4):
> cpumask: Rename cpumask_any_and_distribute
> cpumask: Make cpumask_any() truly random
> cpumask: Convert cpumask_any_but() to the new random function
> sched/rt: Better distribute tasks that wakeup simultaneously
>
> include/linux/cpumask.h | 33 ++++++-----------
> kernel/sched/core.c | 2 +-
> kernel/sched/rt.c | 4 +-
> lib/cpumask.c | 82 +++++++++++++++++++++++++++--------------
> 4 files changed, 68 insertions(+), 53 deletions(-)

2020-04-15 21:55:57

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()


On 14/04/20 21:27, Steven Rostedt wrote:
> On Tue, 14 Apr 2020 19:58:49 +0100
> Valentin Schneider <[email protected]> wrote:
>
>> To move this forward, I would suggest renaming the current cpumask_any_*()
>> into cpumask_first_*(), and THEN introduce the new pseudo-random
>> ones. People are then free to hand-fix specific locations if it makes sense
>> there, like you're doing for RT.
>
> Or leave "cpumask_any()" as is, and create a new "cpumask_random()" for
> this purpose.
>

Sure; I'll be happy as long as there isn't a blanket replacement of
existing callsites.

> -- Steve

2020-04-15 21:56:12

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()

On Tue, 14 Apr 2020 19:58:49 +0100
Valentin Schneider <[email protected]> wrote:

> To move this forward, I would suggest renaming the current cpumask_any_*()
> into cpumask_first_*(), and THEN introduce the new pseudo-random
> ones. People are then free to hand-fix specific locations if it makes sense
> there, like you're doing for RT.

Or leave "cpumask_any()" as is, and create a new "cpumask_random()" for
this purpose.

-- Steve

2020-04-15 22:51:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/4] cpumask: Make cpumask_any() truly random

On Tue, Apr 14, 2020 at 12:19:56PM -0400, Steven Rostedt wrote:

> > +/**
> > + * cpumask_any - pick a "random" cpu from *srcp
> > + * @srcp: the input cpumask
> > + *
> > + * Returns >= nr_cpu_ids if no cpus set.
> > + */
> > +int cpumask_any(const struct cpumask *srcp)
> > +{
> > + int next, prev;
> > +
> > + /* NOTE: our first selection will skip 0. */
> > + prev = __this_cpu_read(distribute_cpu_mask_prev);
> > +
> > + next = cpumask_next(prev, srcp);
> > + if (next >= nr_cpu_ids)
> > + next = cpumask_first(srcp);
> > +
> > + if (next < nr_cpu_ids)
> > + __this_cpu_write(distribute_cpu_mask_prev, next);
>
> Do we care if this gets preempted and migrated to a new CPU where we read
> "prev" from one distribute_cpu_mask_prev on one CPU and write it to another
> CPU?

I don't think we do; that just adds to the randomness ;-), but you do
raise a good point in that __this_cpu_*() ops assume preemption is
already disabled, which is true of the one exiting
cpumask_any_and_distribute() caller, but is no longer true after patch
1, and this patch repeats the mistake.

So either we need to disable preemption across the function or
transition to this_cpu_*() ops.

2020-04-15 22:52:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()

On Tue, Apr 14, 2020 at 04:27:42PM -0400, Steven Rostedt wrote:
> On Tue, 14 Apr 2020 19:58:49 +0100
> Valentin Schneider <[email protected]> wrote:
>
> > To move this forward, I would suggest renaming the current cpumask_any_*()
> > into cpumask_first_*(), and THEN introduce the new pseudo-random
> > ones. People are then free to hand-fix specific locations if it makes sense
> > there, like you're doing for RT.
>
> Or leave "cpumask_any()" as is, and create a new "cpumask_random()" for
> this purpose.

Well, that's just twisting words, not sure I like that. 'Any' really
means 'any'. So in order to preserve long term sanity, I'd vote for
Valentin's approach of converting existing users over to first.

2020-04-16 00:04:03

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()

On Wed, 15 Apr 2020 11:39:35 +0200
Peter Zijlstra <[email protected]> wrote:

> On Tue, Apr 14, 2020 at 04:27:42PM -0400, Steven Rostedt wrote:
> > On Tue, 14 Apr 2020 19:58:49 +0100
> > Valentin Schneider <[email protected]> wrote:
> >
> > > To move this forward, I would suggest renaming the current cpumask_any_*()
> > > into cpumask_first_*(), and THEN introduce the new pseudo-random
> > > ones. People are then free to hand-fix specific locations if it makes sense
> > > there, like you're doing for RT.
> >
> > Or leave "cpumask_any()" as is, and create a new "cpumask_random()" for
> > this purpose.
>
> Well, that's just twisting words, not sure I like that. 'Any' really
> means 'any'. So in order to preserve long term sanity, I'd vote for
> Valentin's approach of converting existing users over to first.

Well, to me "any" means it can be "any" and leaves it up to the
implementation to decide. Thus, "any" can always be the first one, or the
last one, or some random one. It means, "you don't care". If someone comes
to you showing you a deck of cards and says "pick any card". Picking the
first one every time is legitimate.

But saying "random" or perhaps better "shuffle" documents that you expect
something different each time.

I guess the best thing to do is to look at how they are used in the kernel,
and see if picking the same one (first or last) could cause issues. If all
the use cases really want something different each time, then great I
agree, let's make "any" give something different. But if the use cases are
"I don't care, just give me something fast" then I think we should leave it
as is. As the "random" version does add a slight overhead.

To me it comes down to documenting the use case. "first" means you really
want the first one. "any" would mean you don't care which one. "random" (or
whatever) means you would like to get something different each time.

How about calling it: cpumask_surprise_me().

-- Steve

2020-04-20 15:45:50

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 2/4] cpumask: Make cpumask_any() truly random

On 04/15/20 11:36, Peter Zijlstra wrote:
> On Tue, Apr 14, 2020 at 12:19:56PM -0400, Steven Rostedt wrote:
>
> > > +/**
> > > + * cpumask_any - pick a "random" cpu from *srcp
> > > + * @srcp: the input cpumask
> > > + *
> > > + * Returns >= nr_cpu_ids if no cpus set.
> > > + */
> > > +int cpumask_any(const struct cpumask *srcp)
> > > +{
> > > + int next, prev;
> > > +
> > > + /* NOTE: our first selection will skip 0. */
> > > + prev = __this_cpu_read(distribute_cpu_mask_prev);
> > > +
> > > + next = cpumask_next(prev, srcp);
> > > + if (next >= nr_cpu_ids)
> > > + next = cpumask_first(srcp);
> > > +
> > > + if (next < nr_cpu_ids)
> > > + __this_cpu_write(distribute_cpu_mask_prev, next);
> >
> > Do we care if this gets preempted and migrated to a new CPU where we read
> > "prev" from one distribute_cpu_mask_prev on one CPU and write it to another
> > CPU?
>
> I don't think we do; that just adds to the randomness ;-), but you do

Yep we don't care and it should enhance the randomness.

> raise a good point in that __this_cpu_*() ops assume preemption is
> already disabled, which is true of the one exiting
> cpumask_any_and_distribute() caller, but is no longer true after patch
> 1, and this patch repeats the mistake.
>
> So either we need to disable preemption across the function or
> transition to this_cpu_*() ops.

Sorry wasn't aware about the preemption check in __this_cpu_write().

Transitioning to this_cpu_write() makes sense. Unless Josh comes back it'll
break something he noticed.

Thanks

--
Qais Yousef

2020-04-20 21:38:47

by Josh Don

[permalink] [raw]
Subject: Re: [PATCH 2/4] cpumask: Make cpumask_any() truly random

On Mon, Apr 20, 2020 at 8:43 AM Qais Yousef <[email protected]> wrote:
>
> On 04/15/20 11:36, Peter Zijlstra wrote:
> > On Tue, Apr 14, 2020 at 12:19:56PM -0400, Steven Rostedt wrote:
> > > Do we care if this gets preempted and migrated to a new CPU where we read
> > > "prev" from one distribute_cpu_mask_prev on one CPU and write it to another
> > > CPU?
> >
> > I don't think we do; that just adds to the randomness ;-), but you do
>
> Yep we don't care and it should enhance the randomness.
>
> > raise a good point in that __this_cpu_*() ops assume preemption is
> > already disabled, which is true of the one exiting
> > cpumask_any_and_distribute() caller, but is no longer true after patch
> > 1, and this patch repeats the mistake.
> >
> > So either we need to disable preemption across the function or
> > transition to this_cpu_*() ops.
>
> Sorry wasn't aware about the preemption check in __this_cpu_write().
>
> Transitioning to this_cpu_write() makes sense. Unless Josh comes back it'll
> break something he noticed.

Yep, this_cpu_* makes sense to me. Preemption is ok, since prev must
always be a valid cpu id, thus we just get a little more _random_ from
this pseudorandom implementation.

2020-04-21 12:14:39

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()

+Rafael +Marc

(There's a question about a function that you maintain below, the context is
that cpumask_any() now truly returns a random cpu in the mask and there were
concern it might break something you maintain)

On 04/14/20 19:58, Valentin Schneider wrote:
> Hi,
>
> On 14/04/20 16:05, Qais Yousef wrote:
> > Now that we have a proper function that returns a 'random' CPU in a mask [1]
> > utilize that in find_lowest_rq() to solve the thundering herd issue described
> > in this thread
> >
> > https://lore.kernel.org/lkml/20200219140243.wfljmupcrwm2jelo@e107158-lin/
> >
> > But as a pre-amble, I noticed that the new cpumask_any_and_distribute() is
> > actually an alias for cpumask_any_and() which is documented as returning
> > a 'random' cpu but actually just does cpumask_first_and().
> >
> > The first 3 patches cleanup the API so that the whole family of
> > cpumask_any*() take advantage of the new 'random' behavior
>
> I'm a bit wary about such blanket changes. I feel like most places impacted
> by this change don't gain anything by using the random thing. In sched land
> that would be:

The API has always been clear that cpumask_any return a random cpu within the
mask. And the fact it's a one liner with cpumask_first() directly visible,
a user made the choice to stick to cpumask_any() indicates that that's what
they wanted.

Probably a lot of them they don't care what cpu is returned and happy with the
random value. I don't see why it has to have an effect. Some could benefit,
like my use case here. Or others truly don't care, then it's fine to return
anything, as requested.

>
> - The single cpumask_any() in core.c::select_task_rq()
> - Pretty much any function that wants a CPU id to dereference a
> root_domain; there's some of them in deadline.c, topology.c
>
> Looking some more into it, there's shadier things:
>
> - cpufreq_offline() uses cpumask_any() to figure out the new policy
> leader... That one should be cpumask_first()

I CCed Rafael who's the maintainer of this file who can speak better of what
should be expected.

I don't see why it _should_ be cpumask_first(). AFAICT any cpu can be the new
leader. So nothing breaks here.

> - gic_set_affinity() uses cpumask_any_and() (in the common case). If this
> starts using randomness, you will stop affining e.g. all SPIs to CPU0
> by default (!!!)

I CCed Marc who's the maintainer of this file who can clarify better if this
really breaks anything.

If any interrupt expects to be affined to a specific CPU then this must be
described in DT/driver. I think the GIC controller is free to distribute them
to any cpu otherwise if !force. Which is usually done by irq_balancer anyway
in userspace, IIUC.

I don't see how cpumask_any_and() break anything here too. I actually think it
improves on things by better distribute the irqs on the system by default.

Marc will know better for sure. It's been a long time I looked at irqchip code.

> - ... and there might be more

I can appreciate you have some concerns this might break something. But IMO
this will be isolated cases of bad copy/paste. I think the blanket conversion
of cpumask_any() to cpumask_first() is the bad one because it dishonours the
choice developers made to pick one over the other. As I mention above the doc
was clear it will return a random value and the oneliner gave a clear
alternative to the user if they missed the cpumask_first() API.

The name of the API is so descriptive anyway to think that most users got it
wrong. And I think it acts as a good natural documentation of the code.

>
> I think people went with cpumask_any_* mostly because there is just
> cpumask_first() while there are more cpumask_any_* variants, and since
> those have been returning the first set CPU for over a decade people just
> went with it.

This is speculation. Unless we ask every developer we can't tell. Given the
clear documentation of cpumask_any() returns a random value, the natural
conclusion is that people want that or just truly don't care.

>
> To move this forward, I would suggest renaming the current cpumask_any_*()
> into cpumask_first_*(), and THEN introduce the new pseudo-random
> ones. People are then free to hand-fix specific locations if it makes sense
> there, like you're doing for RT.

I strongly disagree with this. I'd rather keep the status-quo instead and just
use the new function in rt.c as it's currently named.

>
> I think it's safe to say the vast majority of the current callers do not
> require randomness - the exceptions should mainly be scheduler / workqueues
> and the like.

It's hard to judge. As the API clearly says this returns a random value,
I think assuming that most users got it wrong is the unnatural way of
thinking. If there happened to be bad users, that's an isolated cases and bugs
that we can certainly be easily fixed.

Thanks

--
Qais Yousef

>
> > and in patch
> > 4 I convert the cpumask_first_and() --> cpumask_any_and() in find_lowest_rq()
> > to allow to better distribute the RT tasks that wake up simultaneously.
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
> >
> > CC: Juri Lelli <[email protected]>
> > CC: Vincent Guittot <[email protected]>
> > CC: Dietmar Eggemann <[email protected]>
> > CC: Steven Rostedt <[email protected]>
> > CC: Ben Segall <[email protected]>
> > CC: Mel Gorman <[email protected]>
> > CC: Andrew Morton <[email protected]>
> > CC: Thomas Gleixner <[email protected]>
> > CC: Yury Norov <[email protected]>
> > CC: Paul Turner <[email protected]>
> > CC: Alexey Dobriyan <[email protected]>
> > CC: Josh Don <[email protected]>
> > CC: Pavan Kondeti <[email protected]>
> > CC: [email protected]
> >
> > Qais Yousef (4):
> > cpumask: Rename cpumask_any_and_distribute
> > cpumask: Make cpumask_any() truly random
> > cpumask: Convert cpumask_any_but() to the new random function
> > sched/rt: Better distribute tasks that wakeup simultaneously
> >
> > include/linux/cpumask.h | 33 ++++++-----------
> > kernel/sched/core.c | 2 +-
> > kernel/sched/rt.c | 4 +-
> > lib/cpumask.c | 82 +++++++++++++++++++++++++++--------------
> > 4 files changed, 68 insertions(+), 53 deletions(-)

2020-04-21 12:19:36

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()

On 04/15/20 11:39, Peter Zijlstra wrote:
> On Tue, Apr 14, 2020 at 04:27:42PM -0400, Steven Rostedt wrote:
> > On Tue, 14 Apr 2020 19:58:49 +0100
> > Valentin Schneider <[email protected]> wrote:
> >
> > > To move this forward, I would suggest renaming the current cpumask_any_*()
> > > into cpumask_first_*(), and THEN introduce the new pseudo-random
> > > ones. People are then free to hand-fix specific locations if it makes sense
> > > there, like you're doing for RT.
> >
> > Or leave "cpumask_any()" as is, and create a new "cpumask_random()" for
> > this purpose.
>
> Well, that's just twisting words, not sure I like that. 'Any' really
> means 'any'. So in order to preserve long term sanity, I'd vote for
> Valentin's approach of converting existing users over to first.

I have answered Valentin's email. So hopefully I have addressed the concerns.

If not, I really hate this mass conversion and I think we're better off with
the status-quo.

Thanks

--
Qais Yousef

2020-04-21 13:20:21

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()


On 21/04/20 13:13, Qais Yousef wrote:
> On 04/14/20 19:58, Valentin Schneider wrote:
>>
>> I'm a bit wary about such blanket changes. I feel like most places impacted
>> by this change don't gain anything by using the random thing. In sched land
>> that would be:
>
> The API has always been clear that cpumask_any return a random cpu within the
> mask. And the fact it's a one liner with cpumask_first() directly visible,
> a user made the choice to stick to cpumask_any() indicates that that's what
> they wanted.
>
> Probably a lot of them they don't care what cpu is returned and happy with the
> random value. I don't see why it has to have an effect. Some could benefit,
> like my use case here. Or others truly don't care, then it's fine to return
> anything, as requested.
>

Exactly, *some* (which AFAICT is a minority) might benefit. So why should
all the others pay the price for a functionality they do not need?

I don't think your change would actually cause a splat somewhere; my point
is about changing existing behaviour without having a story for it. The
thing said 'pick a "random" cpu', sure, but it never did that, it always
picked the first.

I've pointed out two examples that want to be cpumask_first(), and I'm
absolutely certain there are more than these two out there. What if folks
ran some performance test and were completely fine with the _first()
behaviour? What tells you randomness won't degrade some cases?

IMO the correct procedure is to keep everything as it is and improve the
specific callsites that benefit from randomness. I get your point that
using cpumask_any() should be a good enough indicator of the latter, but I
don't think it can realistically be followed. To give my PoV, if in the
past someone had used a cpumask_any() where a cpumask_first() could do, I
would've acked it (disclaimer: super representative population of sample
size = 1).

Flipping the switch on everyone to then have a series of patches "oh this
one didn't need it", "this one neither", "I actually need this to be the
first" just feels sloppy.

> I CCed Marc who's the maintainer of this file who can clarify better if this
> really breaks anything.
>
> If any interrupt expects to be affined to a specific CPU then this must be
> described in DT/driver. I think the GIC controller is free to distribute them
> to any cpu otherwise if !force. Which is usually done by irq_balancer anyway
> in userspace, IIUC.
>
> I don't see how cpumask_any_and() break anything here too. I actually think it
> improves on things by better distribute the irqs on the system by default.
>

As you say, if someone wants smarter IRQ affinity they can do irq_balancer
and whatnot. The default kernel policy for now has been to shove everything
on the lowest-numbered CPU, and I see no valid reason to change that.

2020-04-21 13:29:48

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()

On Tue, 21 Apr 2020 at 15:18, Valentin Schneider
<[email protected]> wrote:
>
>
> On 21/04/20 13:13, Qais Yousef wrote:
> > On 04/14/20 19:58, Valentin Schneider wrote:
> >>
> >> I'm a bit wary about such blanket changes. I feel like most places impacted
> >> by this change don't gain anything by using the random thing. In sched land
> >> that would be:
> >
> > The API has always been clear that cpumask_any return a random cpu within the
> > mask. And the fact it's a one liner with cpumask_first() directly visible,
> > a user made the choice to stick to cpumask_any() indicates that that's what
> > they wanted.
> >
> > Probably a lot of them they don't care what cpu is returned and happy with the
> > random value. I don't see why it has to have an effect. Some could benefit,
> > like my use case here. Or others truly don't care, then it's fine to return
> > anything, as requested.
> >
>
> Exactly, *some* (which AFAICT is a minority) might benefit. So why should
> all the others pay the price for a functionality they do not need?
>
> I don't think your change would actually cause a splat somewhere; my point
> is about changing existing behaviour without having a story for it. The
> thing said 'pick a "random" cpu', sure, but it never did that, it always
> picked the first.
>
> I've pointed out two examples that want to be cpumask_first(), and I'm
> absolutely certain there are more than these two out there. What if folks
> ran some performance test and were completely fine with the _first()
> behaviour? What tells you randomness won't degrade some cases?

I tend to agree that any doesn't mean random and using a random cpu
will create strange behavior

One example is the irq affinity on b.L system. Right now, the irq are
always pinned to the same CPU (the 1st one which is most probably a
Little) but with your change we can imagine that this will change and
might ever change over 2 consecutives boot if for whatever reason (and
this happen) the drivers are not probed in the same order . At the end
you will run some tests with irq on little and other time irq on big.
And more generally speaking and a SMP system can be impacted because
the irq will not be pinned to the same CPU with always the same other
irqs

>
> IMO the correct procedure is to keep everything as it is and improve the
> specific callsites that benefit from randomness. I get your point that

I agree with this point

> using cpumask_any() should be a good enough indicator of the latter, but I
> don't think it can realistically be followed. To give my PoV, if in the
> past someone had used a cpumask_any() where a cpumask_first() could do, I
> would've acked it (disclaimer: super representative population of sample
> size = 1).
>
> Flipping the switch on everyone to then have a series of patches "oh this
> one didn't need it", "this one neither", "I actually need this to be the
> first" just feels sloppy.
>
> > I CCed Marc who's the maintainer of this file who can clarify better if this
> > really breaks anything.
> >
> > If any interrupt expects to be affined to a specific CPU then this must be
> > described in DT/driver. I think the GIC controller is free to distribute them
> > to any cpu otherwise if !force. Which is usually done by irq_balancer anyway
> > in userspace, IIUC.
> >
> > I don't see how cpumask_any_and() break anything here too. I actually think it
> > improves on things by better distribute the irqs on the system by default.
> >
>
> As you say, if someone wants smarter IRQ affinity they can do irq_balancer
> and whatnot. The default kernel policy for now has been to shove everything
> on the lowest-numbered CPU, and I see no valid reason to change that.

2020-04-21 14:11:06

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()

On 2020-04-21 14:18, Valentin Schneider wrote:
> On 21/04/20 13:13, Qais Yousef wrote:

[...]

>> I CCed Marc who's the maintainer of this file who can clarify better
>> if this
>> really breaks anything.
>>
>> If any interrupt expects to be affined to a specific CPU then this
>> must be
>> described in DT/driver. I think the GIC controller is free to
>> distribute them
>> to any cpu otherwise if !force. Which is usually done by irq_balancer
>> anyway
>> in userspace, IIUC.
>>
>> I don't see how cpumask_any_and() break anything here too. I actually
>> think it
>> improves on things by better distribute the irqs on the system by
>> default.

That's a pretty bold statement. Unfortunately, it isn't universally
true.
Some workload will be very happy with interrupts spread all over the
map,
and some others will suffer from it because, well, it interrupts
userspace.

> As you say, if someone wants smarter IRQ affinity they can do
> irq_balancer
> and whatnot. The default kernel policy for now has been to shove
> everything
> on the lowest-numbered CPU, and I see no valid reason to change that.

Exactly. I would like to keep the kernel policy as simple as possible
for
non-managed interrupts (managed interrupts are another kettle of fish
entirely).
Userpace is in control to place things "intelligently", so let's not try
and
make the kernel smarter than it strictly needs to be.

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2020-04-21 14:25:02

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()

On Tue, Apr 21, 2020 at 03:28:14PM +0200, Vincent Guittot wrote:
> On Tue, 21 Apr 2020 at 15:18, Valentin Schneider
> <[email protected]> wrote:
> >
> >
> > On 21/04/20 13:13, Qais Yousef wrote:
> > > On 04/14/20 19:58, Valentin Schneider wrote:
> > >>
> > >> I'm a bit wary about such blanket changes. I feel like most places impacted
> > >> by this change don't gain anything by using the random thing. In sched land
> > >> that would be:
> > >
> > > The API has always been clear that cpumask_any return a random cpu within the
> > > mask. And the fact it's a one liner with cpumask_first() directly visible,
> > > a user made the choice to stick to cpumask_any() indicates that that's what
> > > they wanted.
> > >
> > > Probably a lot of them they don't care what cpu is returned and happy with the
> > > random value. I don't see why it has to have an effect. Some could benefit,
> > > like my use case here. Or others truly don't care, then it's fine to return
> > > anything, as requested.
> > >
> >
> > Exactly, *some* (which AFAICT is a minority) might benefit. So why should
> > all the others pay the price for a functionality they do not need?
> >
> > I don't think your change would actually cause a splat somewhere; my point
> > is about changing existing behaviour without having a story for it. The
> > thing said 'pick a "random" cpu', sure, but it never did that, it always
> > picked the first.
> >
> > I've pointed out two examples that want to be cpumask_first(), and I'm
> > absolutely certain there are more than these two out there. What if folks
> > ran some performance test and were completely fine with the _first()
> > behaviour? What tells you randomness won't degrade some cases?
>
> I tend to agree that any doesn't mean random and using a random cpu
> will create strange behavior
>
> One example is the irq affinity on b.L system. Right now, the irq are
> always pinned to the same CPU (the 1st one which is most probably a
> Little) but with your change we can imagine that this will change and
> might ever change over 2 consecutives boot if for whatever reason (and
> this happen) the drivers are not probed in the same order . At the end
> you will run some tests with irq on little and other time irq on big.
> And more generally speaking and a SMP system can be impacted because
> the irq will not be pinned to the same CPU with always the same other
> irqs
>
> >
> > IMO the correct procedure is to keep everything as it is and improve the
> > specific callsites that benefit from randomness. I get your point that
>
> I agree with this point
> > using cpumask_any() should be a good enough indicator of the latter, but I
> > don't think it can realistically be followed. To give my PoV, if in the
> > past someone had used a cpumask_any() where a cpumask_first() could do, I
> > would've acked it (disclaimer: super representative population of sample
> > size = 1).
> >
> > Flipping the switch on everyone to then have a series of patches "oh this
> > one didn't need it", "this one neither", "I actually need this to be the
> > first" just feels sloppy.
> >
> > > I CCed Marc who's the maintainer of this file who can clarify better if this
> > > really breaks anything.
> > >
> > > If any interrupt expects to be affined to a specific CPU then this must be
> > > described in DT/driver. I think the GIC controller is free to distribute them
> > > to any cpu otherwise if !force. Which is usually done by irq_balancer anyway
> > > in userspace, IIUC.
> > >
> > > I don't see how cpumask_any_and() break anything here too. I actually think it
> > > improves on things by better distribute the irqs on the system by default.
> > >
> >
> > As you say, if someone wants smarter IRQ affinity they can do irq_balancer
> > and whatnot. The default kernel policy for now has been to shove everything
> > on the lowest-numbered CPU, and I see no valid reason to change that.

My 5 cents. I was also surprised when I found cpumask_any() nailed to the first
CPU. But for my use I found it beneficial.

Namely, all system IRQs and other events are targeted to the first CPU which is
considered as system maintenance unit. Other CPUs are dedicated to user-specific
payloads using task isolation. This approach improves latency a lot.

Systems that have many cores operating in idling/powersave mode probably benefit
from the state of things as well - they don't wake up sleeping cores and therefore
save power and improve IRQ turnaround.

Thanks,
Yury

2020-04-21 14:25:06

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()

On 04/21/20 15:09, Marc Zyngier wrote:
> On 2020-04-21 14:18, Valentin Schneider wrote:
> > On 21/04/20 13:13, Qais Yousef wrote:
>
> [...]
>
> > > I CCed Marc who's the maintainer of this file who can clarify better
> > > if this
> > > really breaks anything.
> > >
> > > If any interrupt expects to be affined to a specific CPU then this
> > > must be
> > > described in DT/driver. I think the GIC controller is free to
> > > distribute them
> > > to any cpu otherwise if !force. Which is usually done by
> > > irq_balancer anyway
> > > in userspace, IIUC.
> > >
> > > I don't see how cpumask_any_and() break anything here too. I
> > > actually think it
> > > improves on things by better distribute the irqs on the system by
> > > default.
>
> That's a pretty bold statement. Unfortunately, it isn't universally true.
> Some workload will be very happy with interrupts spread all over the map,
> and some others will suffer from it because, well, it interrupts userspace.
>
> > As you say, if someone wants smarter IRQ affinity they can do
> > irq_balancer
> > and whatnot. The default kernel policy for now has been to shove
> > everything
> > on the lowest-numbered CPU, and I see no valid reason to change that.
>
> Exactly. I would like to keep the kernel policy as simple as possible for
> non-managed interrupts (managed interrupts are another kettle of fish
> entirely).
> Userpace is in control to place things "intelligently", so let's not try and
> make the kernel smarter than it strictly needs to be.

Fair enough. But why is it asking for cpumask_any() in the first place?

Thanks

--
Qais Yousef

2020-04-21 14:27:39

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()

On 04/21/20 15:28, Vincent Guittot wrote:
> On Tue, 21 Apr 2020 at 15:18, Valentin Schneider
> <[email protected]> wrote:
> >
> >
> > On 21/04/20 13:13, Qais Yousef wrote:
> > > On 04/14/20 19:58, Valentin Schneider wrote:
> > >>
> > >> I'm a bit wary about such blanket changes. I feel like most places impacted
> > >> by this change don't gain anything by using the random thing. In sched land
> > >> that would be:
> > >
> > > The API has always been clear that cpumask_any return a random cpu within the
> > > mask. And the fact it's a one liner with cpumask_first() directly visible,
> > > a user made the choice to stick to cpumask_any() indicates that that's what
> > > they wanted.
> > >
> > > Probably a lot of them they don't care what cpu is returned and happy with the
> > > random value. I don't see why it has to have an effect. Some could benefit,
> > > like my use case here. Or others truly don't care, then it's fine to return
> > > anything, as requested.
> > >
> >
> > Exactly, *some* (which AFAICT is a minority) might benefit. So why should
> > all the others pay the price for a functionality they do not need?
> >
> > I don't think your change would actually cause a splat somewhere; my point
> > is about changing existing behaviour without having a story for it. The
> > thing said 'pick a "random" cpu', sure, but it never did that, it always
> > picked the first.
> >
> > I've pointed out two examples that want to be cpumask_first(), and I'm
> > absolutely certain there are more than these two out there. What if folks
> > ran some performance test and were completely fine with the _first()
> > behaviour? What tells you randomness won't degrade some cases?
>
> I tend to agree that any doesn't mean random and using a random cpu
> will create strange behavior
>
> One example is the irq affinity on b.L system. Right now, the irq are
> always pinned to the same CPU (the 1st one which is most probably a
> Little) but with your change we can imagine that this will change and
> might ever change over 2 consecutives boot if for whatever reason (and
> this happen) the drivers are not probed in the same order . At the end
> you will run some tests with irq on little and other time irq on big.
> And more generally speaking and a SMP system can be impacted because
> the irq will not be pinned to the same CPU with always the same other
> irqs

For me this highlights issues that needs to be addressed. Which I think are
easy enough to address. But I won't push too hard for this.

Thanks

--
Qais Yousef

2020-04-21 14:29:51

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()

On 2020-04-21 15:22, Qais Yousef wrote:
> On 04/21/20 15:09, Marc Zyngier wrote:
>> On 2020-04-21 14:18, Valentin Schneider wrote:
>> > On 21/04/20 13:13, Qais Yousef wrote:
>>
>> [...]
>>
>> > > I CCed Marc who's the maintainer of this file who can clarify better
>> > > if this
>> > > really breaks anything.
>> > >
>> > > If any interrupt expects to be affined to a specific CPU then this
>> > > must be
>> > > described in DT/driver. I think the GIC controller is free to
>> > > distribute them
>> > > to any cpu otherwise if !force. Which is usually done by
>> > > irq_balancer anyway
>> > > in userspace, IIUC.
>> > >
>> > > I don't see how cpumask_any_and() break anything here too. I
>> > > actually think it
>> > > improves on things by better distribute the irqs on the system by
>> > > default.
>>
>> That's a pretty bold statement. Unfortunately, it isn't universally
>> true.
>> Some workload will be very happy with interrupts spread all over the
>> map,
>> and some others will suffer from it because, well, it interrupts
>> userspace.
>>
>> > As you say, if someone wants smarter IRQ affinity they can do
>> > irq_balancer
>> > and whatnot. The default kernel policy for now has been to shove
>> > everything
>> > on the lowest-numbered CPU, and I see no valid reason to change that.
>>
>> Exactly. I would like to keep the kernel policy as simple as possible
>> for
>> non-managed interrupts (managed interrupts are another kettle of fish
>> entirely).
>> Userpace is in control to place things "intelligently", so let's not
>> try and
>> make the kernel smarter than it strictly needs to be.
>
> Fair enough. But why is it asking for cpumask_any() in the first place?

Implementation detail. Turn it into cpumask_first_and() if you want.

M.
--
Jazz is not dead. It just smells funny...

2020-04-21 14:40:37

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 0/4] sched/rt: Distribute tasks in find_lowest_rq()

On 04/21/20 15:28, Marc Zyngier wrote:
> On 2020-04-21 15:22, Qais Yousef wrote:
> > On 04/21/20 15:09, Marc Zyngier wrote:
> > > On 2020-04-21 14:18, Valentin Schneider wrote:
> > > > On 21/04/20 13:13, Qais Yousef wrote:
> > >
> > > [...]
> > >
> > > > > I CCed Marc who's the maintainer of this file who can clarify better
> > > > > if this
> > > > > really breaks anything.
> > > > >
> > > > > If any interrupt expects to be affined to a specific CPU then this
> > > > > must be
> > > > > described in DT/driver. I think the GIC controller is free to
> > > > > distribute them
> > > > > to any cpu otherwise if !force. Which is usually done by
> > > > > irq_balancer anyway
> > > > > in userspace, IIUC.
> > > > >
> > > > > I don't see how cpumask_any_and() break anything here too. I
> > > > > actually think it
> > > > > improves on things by better distribute the irqs on the system by
> > > > > default.
> > >
> > > That's a pretty bold statement. Unfortunately, it isn't universally
> > > true.
> > > Some workload will be very happy with interrupts spread all over the
> > > map,
> > > and some others will suffer from it because, well, it interrupts
> > > userspace.
> > >
> > > > As you say, if someone wants smarter IRQ affinity they can do
> > > > irq_balancer
> > > > and whatnot. The default kernel policy for now has been to shove
> > > > everything
> > > > on the lowest-numbered CPU, and I see no valid reason to change that.
> > >
> > > Exactly. I would like to keep the kernel policy as simple as
> > > possible for
> > > non-managed interrupts (managed interrupts are another kettle of fish
> > > entirely).
> > > Userpace is in control to place things "intelligently", so let's not
> > > try and
> > > make the kernel smarter than it strictly needs to be.
> >
> > Fair enough. But why is it asking for cpumask_any() in the first place?
>
> Implementation detail. Turn it into cpumask_first_and() if you want.

Will do.

Thanks

--
Qais Yousef

2020-05-28 09:04:41

by Chen, Rong A

[permalink] [raw]
Subject: [cpumask] a7934287d8: BUG:using__this_cpu_read()in_preemptible[#]code:kworker

Greeting,

FYI, we noticed the following commit (built with gcc-7):

commit: a7934287d8a6c43811fca8ddf421b3b6091564f2 ("[PATCH 2/4] cpumask: Make cpumask_any() truly random")
url: https://github.com/0day-ci/linux/commits/Qais-Yousef/sched-rt-Distribute-tasks-in-find_lowest_rq/20200415-040515


in testcase: locktorture
with following parameters:

runtime: 300s
test: cpuhotplug

test-description: This torture test consists of creating a number of kernel threads which acquire the lock and hold it for specific amount of time, thus simulating different critical region behaviors.
test-url: https://www.kernel.org/doc/Documentation/locking/locktorture.txt


on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 8G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+----------------------------------------------------------+------------+------------+
| | bb4424cb99 | a7934287d8 |
+----------------------------------------------------------+------------+------------+
| boot_successes | 8 | 0 |
| boot_failures | 0 | 8 |
| BUG:using__this_cpu_read()in_preemptible[#]code:kworker | 0 | 8 |
| BUG:using__this_cpu_write()in_preemptible[#]code:kworker | 0 | 8 |
+----------------------------------------------------------+------------+------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 69.501074] BUG: using __this_cpu_read() in preemptible [00000000] code: kworker/0:1/12
[ 69.505571] caller is cpumask_any+0x14/0x70
[ 69.508332] CPU: 1 PID: 12 Comm: kworker/0:1 Not tainted 5.6.0-rc4-00110-ga7934287d8a6c #2
[ 69.512218] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 69.516270] Workqueue: events cpuset_hotplug_workfn
[ 69.519114] Call Trace:
[ 69.521394] dump_stack+0x8f/0xcb
[ 69.524026] __this_cpu_preempt_check+0xaf/0xc0
[ 69.526886] cpumask_any+0x14/0x70
[ 69.529265] partition_sched_domains_locked+0x152/0x500
[ 69.532200] rebuild_sched_domains_locked+0x5a1/0xba0
[ 69.535164] ? down_write+0x38/0x70
[ 69.537489] ? cpumask_next+0x17/0x20
[ 69.539929] ? percpu_down_write+0xa3/0x100
[ 69.542643] rebuild_sched_domains+0x1b/0x30
[ 69.545257] cpuset_hotplug_workfn+0x74e/0x1310
[ 69.547867] ? process_one_work+0x1be/0x5f0
[ 69.550317] process_one_work+0x23e/0x5f0
[ 69.552855] ? worker_thread+0xce/0x390
[ 69.555291] worker_thread+0x3c/0x390
[ 69.557624] ? process_one_work+0x5f0/0x5f0
[ 69.560003] kthread+0x145/0x170
[ 69.562255] ? kthread_park+0x90/0x90
[ 69.564666] ret_from_fork+0x3a/0x50
[ 69.567057] BUG: using __this_cpu_write() in preemptible [00000000] code: kworker/0:1/12
[ 69.570605] caller is cpumask_any+0x4d/0x70
[ 69.573025] CPU: 1 PID: 12 Comm: kworker/0:1 Not tainted 5.6.0-rc4-00110-ga7934287d8a6c #2
[ 69.576279] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 69.579764] Workqueue: events cpuset_hotplug_workfn
[ 69.582268] Call Trace:
[ 69.584206] dump_stack+0x8f/0xcb
[ 69.586294] __this_cpu_preempt_check+0xaf/0xc0
[ 69.588763] cpumask_any+0x4d/0x70
[ 69.591035] partition_sched_domains_locked+0x152/0x500
[ 69.593763] rebuild_sched_domains_locked+0x5a1/0xba0
[ 69.596343] ? down_write+0x38/0x70
[ 69.598635] ? cpumask_next+0x17/0x20
[ 69.600800] ? percpu_down_write+0xa3/0x100
[ 69.603283] rebuild_sched_domains+0x1b/0x30
[ 69.605590] cpuset_hotplug_workfn+0x74e/0x1310
[ 69.608078] ? process_one_work+0x1be/0x5f0
[ 69.610607] process_one_work+0x23e/0x5f0
[ 69.613089] ? worker_thread+0xce/0x390
[ 69.615435] worker_thread+0x3c/0x390
[ 69.617739] ? process_one_work+0x5f0/0x5f0
[ 69.620199] kthread+0x145/0x170
[ 69.622269] ? kthread_park+0x90/0x90
[ 69.624534] ret_from_fork+0x3a/0x50
[ 72.608226] smpboot: Booting Node 0 Processor 0 APIC 0x0
[ 72.623234] kvm-clock: cpu 0, msr 23b401001, secondary cpu clock
[ 72.623294] masked ExtINT on CPU#0
[ 72.655688] KVM setup async PF for cpu 0
[ 72.658605] kvm-stealtime: cpu 0, msr 23fc30040
[ 75.728553] Unregister pv shared memory for cpu 1
[ 75.741355] smpboot: CPU 1 is now offline
[ 75.749031] BUG: using __this_cpu_read() in preemptible [00000000] code: kworker/1:1/32
[ 75.753000] caller is cpumask_any+0x14/0x70
[ 75.755574] CPU: 0 PID: 32 Comm: kworker/1:1 Not tainted 5.6.0-rc4-00110-ga7934287d8a6c #2
[ 75.758960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 75.762748] Workqueue: events cpuset_hotplug_workfn
[ 75.765380] Call Trace:
[ 75.767570] dump_stack+0x8f/0xcb
[ 75.769807] __this_cpu_preempt_check+0xaf/0xc0
[ 75.772365] cpumask_any+0x14/0x70
[ 75.774705] partition_sched_domains_locked+0x152/0x500
[ 75.777399] rebuild_sched_domains_locked+0x5a1/0xba0
[ 75.779940] ? down_write+0x38/0x70
[ 75.782189] ? cpumask_next+0x17/0x20
[ 75.784537] ? percpu_down_write+0xa3/0x100
[ 75.787181] rebuild_sched_domains+0x1b/0x30
[ 75.789560] cpuset_hotplug_workfn+0x74e/0x1310
[ 75.791999] ? process_one_work+0x1be/0x5f0
[ 75.794440] process_one_work+0x23e/0x5f0
[ 75.796702] ? worker_thread+0xce/0x390
[ 75.799062] worker_thread+0x3c/0x390
[ 75.801445] ? process_one_work+0x5f0/0x5f0
[ 75.803864] kthread+0x145/0x170
[ 75.805972] ? kthread_park+0x90/0x90
[ 75.808664] ret_from_fork+0x3a/0x50
[ 75.811162] BUG: using __this_cpu_write() in preemptible [00000000] code: kworker/1:1/32
[ 75.814497] caller is cpumask_any+0x4d/0x70
[ 75.816861] CPU: 0 PID: 32 Comm: kworker/1:1 Not tainted 5.6.0-rc4-00110-ga7934287d8a6c #2
[ 75.820558] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 75.828295] Workqueue: events cpuset_hotplug_workfn
[ 75.830705] Call Trace:
[ 75.833009] dump_stack+0x8f/0xcb
[ 75.835214] __this_cpu_preempt_check+0xaf/0xc0
[ 75.837779] cpumask_any+0x4d/0x70
[ 75.840083] partition_sched_domains_locked+0x152/0x500
[ 75.842971] rebuild_sched_domains_locked+0x5a1/0xba0
[ 75.845729] ? down_write+0x38/0x70
[ 75.848117] ? cpumask_next+0x17/0x20
[ 75.850597] ? percpu_down_write+0xa3/0x100
[ 75.853049] rebuild_sched_domains+0x1b/0x30
[ 75.855647] cpuset_hotplug_workfn+0x74e/0x1310
[ 75.858190] ? process_one_work+0x1be/0x5f0
[ 75.860745] process_one_work+0x23e/0x5f0
[ 75.863280] ? worker_thread+0xce/0x390
[ 75.865699] worker_thread+0x3c/0x390
[ 75.868085] ? process_one_work+0x5f0/0x5f0
[ 75.870592] kthread+0x145/0x170
[ 75.872739] ? kthread_park+0x90/0x90
[ 75.875055] ret_from_fork+0x3a/0x50
[ 81.896088] x86: Booting SMP configuration:
[ 81.905907] smpboot: Booting Node 0 Processor 1 APIC 0x1
[ 81.913431] kvm-clock: cpu 1, msr 23b401041, secondary cpu clock
[ 81.913488] masked ExtINT on CPU#1
[ 81.937706] KVM setup async PF for cpu 1
[ 81.940401] kvm-stealtime: cpu 1, msr 23fd30040
[ 84.980215] Unregister pv shared memory for cpu 0
[ 85.001386] smpboot: CPU 0 is now offline
[ 85.010941] BUG: using __this_cpu_read() in preemptible [00000000] code: kworker/0:2/144
[ 85.014653] caller is cpumask_any+0x14/0x70
[ 85.017195] CPU: 1 PID: 144 Comm: kworker/0:2 Not tainted 5.6.0-rc4-00110-ga7934287d8a6c #2
[ 85.020677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 85.024172] Workqueue: events cpuset_hotplug_workfn
[ 85.026826] Call Trace:
[ 85.028689] dump_stack+0x8f/0xcb
[ 85.030943] __this_cpu_preempt_check+0xaf/0xc0
[ 85.033582] cpumask_any+0x14/0x70
[ 85.035717] partition_sched_domains_locked+0x152/0x500
[ 85.038475] rebuild_sched_domains_locked+0x5a1/0xba0
[ 85.041030] ? down_write+0x38/0x70
[ 85.043283] ? cpumask_next+0x17/0x20
[ 85.045656] ? percpu_down_write+0xa3/0x100
[ 85.048027] rebuild_sched_domains+0x1b/0x30
[ 85.050432] cpuset_hotplug_workfn+0x74e/0x1310
[ 85.052972] ? process_one_work+0x1be/0x5f0
[ 85.055616] process_one_work+0x23e/0x5f0
[ 85.057954] ? worker_thread+0xce/0x390
[ 85.060290] worker_thread+0x3c/0x390
[ 85.062673] ? process_one_work+0x5f0/0x5f0
[ 85.065007] kthread+0x145/0x170
[ 85.067219] ? kthread_park+0x90/0x90
[ 85.069517] ret_from_fork+0x3a/0x50
[ 85.071998] BUG: using __this_cpu_write() in preemptible [00000000] code: kworker/0:2/144
[ 85.075495] caller is cpumask_any+0x4d/0x70
[ 85.078053] CPU: 1 PID: 144 Comm: kworker/0:2 Not tainted 5.6.0-rc4-00110-ga7934287d8a6c #2
[ 85.081638] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 85.085139] Workqueue: events cpuset_hotplug_workfn
[ 85.087972] Call Trace:
[ 85.090133] dump_stack+0x8f/0xcb
[ 85.092449] __this_cpu_preempt_check+0xaf/0xc0
[ 85.094945] cpumask_any+0x4d/0x70
[ 85.097255] partition_sched_domains_locked+0x152/0x500
[ 85.099968] rebuild_sched_domains_locked+0x5a1/0xba0
[ 85.102690] ? down_write+0x38/0x70
[ 85.104983] ? cpumask_next+0x17/0x20
[ 85.107318] ? percpu_down_write+0xa3/0x100
[ 85.109927] rebuild_sched_domains+0x1b/0x30
[ 85.112333] cpuset_hotplug_workfn+0x74e/0x1310
[ 85.114892] ? process_one_work+0x1be/0x5f0
[ 85.117334] process_one_work+0x23e/0x5f0
[ 85.119868] ? worker_thread+0xce/0x390
[ 85.122441] worker_thread+0x3c/0x390
[ 85.124850] ? process_one_work+0x5f0/0x5f0
[ 85.127409] kthread+0x145/0x170
[ 85.129521] ? kthread_park+0x90/0x90
[ 85.136185] ret_from_fork+0x3a/0x50
[ 88.041127] smpboot: Booting Node 0 Processor 0 APIC 0x0
[ 88.047260] kvm-clock: cpu 0, msr 23b401001, secondary cpu clock
[ 88.047315] masked ExtINT on CPU#0
[ 88.080615] KVM setup async PF for cpu 0
[ 88.083215] kvm-stealtime: cpu 0, msr 23fc30040
[ 91.122185] Unregister pv shared memory for cpu 0
[ 91.136805] smpboot: CPU 0 is now offline
[ 91.142632] BUG: using __this_cpu_read() in preemptible [00000000] code: kworker/0:2/144
[ 91.146321] caller is cpumask_any+0x14/0x70
[ 91.148857] CPU: 1 PID: 144 Comm: kworker/0:2 Not tainted 5.6.0-rc4-00110-ga7934287d8a6c #2
[ 91.152441] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 91.155959] Workqueue: events cpuset_hotplug_workfn
[ 91.158633] Call Trace:
[ 91.160515] dump_stack+0x8f/0xcb
[ 91.162804] __this_cpu_preempt_check+0xaf/0xc0
[ 91.165289] cpumask_any+0x14/0x70
[ 91.167414] partition_sched_domains_locked+0x152/0x500
[ 91.170239] rebuild_sched_domains_locked+0x5a1/0xba0
[ 91.172877] ? down_write+0x38/0x70
[ 91.175155] ? cpumask_next+0x17/0x20
[ 91.177550] ? percpu_down_write+0xa3/0x100
[ 91.179958] rebuild_sched_domains+0x1b/0x30
[ 91.182500] cpuset_hotplug_workfn+0x74e/0x1310
[ 91.185112] ? process_one_work+0x1be/0x5f0
[ 91.187582] process_one_work+0x23e/0x5f0
[ 91.190095] ? worker_thread+0xce/0x390
[ 91.192481] worker_thread+0x3c/0x390
[ 91.194906] ? process_one_work+0x5f0/0x5f0
[ 91.197369] kthread+0x145/0x170
[ 91.199452] ? kthread_park+0x90/0x90
[ 91.201615] ret_from_fork+0x3a/0x50
[ 91.203878] BUG: using __this_cpu_write() in preemptible [00000000] code: kworker/0:2/144
[ 91.207496] caller is cpumask_any+0x4d/0x70
[ 91.209991] CPU: 1 PID: 144 Comm: kworker/0:2 Not tainted 5.6.0-rc4-00110-ga7934287d8a6c #2
[ 91.213556] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 91.216919] Workqueue: events cpuset_hotplug_workfn
[ 91.219450] Call Trace:
[ 91.221455] dump_stack+0x8f/0xcb
[ 91.223683] __this_cpu_preempt_check+0xaf/0xc0
[ 91.226348] cpumask_any+0x4d/0x70
[ 91.228592] partition_sched_domains_locked+0x152/0x500
[ 91.231337] rebuild_sched_domains_locked+0x5a1/0xba0
[ 91.234159] ? down_write+0x38/0x70
[ 91.236447] ? cpumask_next+0x17/0x20
[ 91.238827] ? percpu_down_write+0xa3/0x100
[ 91.241275] rebuild_sched_domains+0x1b/0x30
[ 91.248641] cpuset_hotplug_workfn+0x74e/0x1310
[ 91.251506] ? process_one_work+0x1be/0x5f0
[ 91.254775] process_one_work+0x23e/0x5f0
[ 91.257257] ? worker_thread+0xce/0x390
[ 91.259720] worker_thread+0x3c/0x390
[ 91.262119] ? process_one_work+0x5f0/0x5f0
[ 91.264523] kthread+0x145/0x170
[ 91.266890] ? kthread_park+0x90/0x90
[ 91.269381] ret_from_fork+0x3a/0x50
[ 94.185102] smpboot: Booting Node 0 Processor 0 APIC 0x0
[ 94.194225] kvm-clock: cpu 0, msr 23b401001, secondary cpu clock
[ 94.194284] masked ExtINT on CPU#0
[ 94.227621] KVM setup async PF for cpu 0
[ 94.230391] kvm-stealtime: cpu 0, msr 23fc30040
[ 97.267310] Unregister pv shared memory for cpu 0
[ 97.277923] smpboot: CPU 0 is now offline
[ 97.284802] BUG: using __this_cpu_read() in preemptible [00000000] code: kworker/0:2/144
[ 97.292569] caller is cpumask_any+0x14/0x70
[ 97.295186] CPU: 1 PID: 144 Comm: kworker/0:2 Not tainted 5.6.0-rc4-00110-ga7934287d8a6c #2
[ 97.298722] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 97.302094] Workqueue: events cpuset_hotplug_workfn
[ 97.304832] Call Trace:
[ 97.306912] dump_stack+0x8f/0xcb
[ 97.309299] __this_cpu_preempt_check+0xaf/0xc0
[ 97.311842] cpumask_any+0x14/0x70
[ 97.314204] partition_sched_domains_locked+0x152/0x500
[ 97.317091] rebuild_sched_domains_locked+0x5a1/0xba0
[ 97.319800] ? down_write+0x38/0x70
[ 97.322168] ? cpumask_next+0x17/0x20
[ 97.324453] ? percpu_down_write+0xa3/0x100
[ 97.326961] rebuild_sched_domains+0x1b/0x30
[ 97.329583] cpuset_hotplug_workfn+0x74e/0x1310
[ 97.332048] ? process_one_work+0x1be/0x5f0
[ 97.334608] process_one_work+0x23e/0x5f0
[ 97.336954] ? worker_thread+0xce/0x390
[ 97.339368] worker_thread+0x3c/0x390
[ 97.341637] ? process_one_work+0x5f0/0x5f0
[ 97.344033] kthread+0x145/0x170
[ 97.346210] ? kthread_park+0x90/0x90
[ 97.348454] ret_from_fork+0x3a/0x50
[ 97.350814] BUG: using __this_cpu_write() in preemptible [00000000] code: kworker/0:2/144
[ 97.354066] caller is cpumask_any+0x4d/0x70
[ 97.356562] CPU: 1 PID: 144 Comm: kworker/0:2 Not tainted 5.6.0-rc4-00110-ga7934287d8a6c #2
[ 97.360002] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 97.363342] Workqueue: events cpuset_hotplug_workfn
[ 97.366045] Call Trace:
[ 97.368053] dump_stack+0x8f/0xcb
[ 97.370361] __this_cpu_preempt_check+0xaf/0xc0
[ 97.372869] cpumask_any+0x4d/0x70
[ 97.375182] partition_sched_domains_locked+0x152/0x500
[ 97.378070] rebuild_sched_domains_locked+0x5a1/0xba0
[ 97.380701] ? down_write+0x38/0x70
[ 97.383118] ? cpumask_next+0x17/0x20
[ 97.385516] ? percpu_down_write+0xa3/0x100
[ 97.387933] rebuild_sched_domains+0x1b/0x30
[ 97.390302] cpuset_hotplug_workfn+0x74e/0x1310
[ 97.392884] ? process_one_work+0x1be/0x5f0
[ 97.395551] process_one_work+0x23e/0x5f0
[ 97.397888] ? worker_thread+0xce/0x390
[ 97.400253] worker_thread+0x3c/0x390
[ 97.402490] ? process_one_work+0x5f0/0x5f0
[ 97.404993] kthread+0x145/0x170
[ 97.407242] ? kthread_park+0x90/0x90
[ 97.409554] ret_from_fork+0x3a/0x50
[ 99.106901] Writes: Total: 35434807 Max/Min: 0/0 Fail: 0
[ 109.353172] smpboot: Booting Node 0 Processor 0 APIC 0x0
[ 109.388740] kvm-clock: cpu 0, msr 23b401001, secondary cpu clock
[ 109.388796] masked ExtINT on CPU#0
[ 109.443615] KVM setup async PF for cpu 0
[ 109.446345] kvm-stealtime: cpu 0, msr 23fc30040
[ 112.508187] Unregister pv shared memory for cpu 0
[ 112.515229] smpboot: CPU 0 is now offline
[ 112.519948] BUG: using __this_cpu_read() in preemptible [00000000] code: kworker/0:2/144
[ 112.523383] caller is cpumask_any+0x14/0x70
[ 112.530155] CPU: 1 PID: 144 Comm: kworker/0:2 Not tainted 5.6.0-rc4-00110-ga7934287d8a6c #2
[ 112.533776] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 112.537314] Workqueue: events cpuset_hotplug_workfn
[ 112.539926] Call Trace:
[ 112.541903] dump_stack+0x8f/0xcb
[ 112.544231] __this_cpu_preempt_check+0xaf/0xc0
[ 112.547000] cpumask_any+0x14/0x70
[ 112.549359] partition_sched_domains_locked+0x152/0x500
[ 112.552141] rebuild_sched_domains_locked+0x5a1/0xba0
[ 112.554933] ? down_write+0x38/0x70
[ 112.557276] ? cpumask_next+0x17/0x20
[ 112.559628] ? percpu_down_write+0xa3/0x100
[ 112.562486] rebuild_sched_domains+0x1b/0x30
[ 112.565144] cpuset_hotplug_workfn+0x74e/0x1310
[ 112.567700] ? process_one_work+0x1be/0x5f0
[ 112.570089] process_one_work+0x23e/0x5f0
[ 112.572424] ? worker_thread+0xce/0x390
[ 112.574794] worker_thread+0x3c/0x390
[ 112.577083] ? process_one_work+0x5f0/0x5f0
[ 112.579431] kthread+0x145/0x170
[ 112.581443] ? kthread_park+0x90/0x90
[ 112.583466] ret_from_fork+0x3a/0x50
[ 112.585688] BUG: using __this_cpu_write() in preemptible [00000000] code: kworker/0:2/144
[ 112.588903] caller is cpumask_any+0x4d/0x70
[ 112.591186] CPU: 1 PID: 144 Comm: kworker/0:2 Not tainted 5.6.0-rc4-00110-ga7934287d8a6c #2
[ 112.594570] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 112.597845] Workqueue: events cpuset_hotplug_workfn
[ 112.600392] Call Trace:
[ 112.602477] dump_stack+0x8f/0xcb
[ 112.604778] __this_cpu_preempt_check+0xaf/0xc0
[ 112.607346] cpumask_any+0x4d/0x70
[ 112.609670] partition_sched_domains_locked+0x152/0x500
[ 112.612419] rebuild_sched_domains_locked+0x5a1/0xba0
[ 112.615130] ? down_write+0x38/0x70
[ 112.617627] ? cpumask_next+0x17/0x20
[ 112.619936] ? percpu_down_write+0xa3/0x100
[ 112.622448] rebuild_sched_domains+0x1b/0x30
[ 112.624916] cpuset_hotplug_workfn+0x74e/0x1310
[ 112.627483] ? process_one_work+0x1be/0x5f0
[ 112.630192] process_one_work+0x23e/0x5f0
[ 112.632670] ? worker_thread+0xce/0x390
[ 112.634895] worker_thread+0x3c/0x390
[ 112.637239] ? process_one_work+0x5f0/0x5f0
[ 112.639832] kthread+0x145/0x170
[ 112.647876] ? kthread_park+0x90/0x90
[ 112.650142] ret_from_fork+0x3a/0x50
[ 115.561110] smpboot: Booting Node 0 Processor 0 APIC 0x0
[ 115.574240] kvm-clock: cpu 0, msr 23b401001, secondary cpu clock
[ 115.574300] masked ExtINT on CPU#0
[ 115.607548] KVM setup async PF for cpu 0
[ 115.610451] kvm-stealtime: cpu 0, msr 23fc30040
[ 118.648289] Unregister pv shared memory for cpu 1
[ 118.659140] smpboot: CPU 1 is now offline
[ 118.669405] BUG: using __this_cpu_read() in preemptible [00000000] code: kworker/1:2/167
[ 118.672719] caller is cpumask_any+0x14/0x70
[ 118.675145] CPU: 0 PID: 167 Comm: kworker/1:2 Not tainted 5.6.0-rc4-00110-ga7934287d8a6c #2
[ 118.678321] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 118.681838] Workqueue: events cpuset_hotplug_workfn
[ 118.684491] Call Trace:
[ 118.686463] dump_stack+0x8f/0xcb
[ 118.688602] __this_cpu_preempt_check+0xaf/0xc0
[ 118.690971] cpumask_any+0x14/0x70
[ 118.693063] partition_sched_domains_locked+0x152/0x500
[ 118.695702] rebuild_sched_domains_locked+0x5a1/0xba0
[ 118.698322] ? down_write+0x38/0x70
[ 118.700546] ? cpumask_next+0x17/0x20
[ 118.702883] ? percpu_down_write+0xa3/0x100
[ 118.705310] rebuild_sched_domains+0x1b/0x30
[ 118.707662] cpuset_hotplug_workfn+0x74e/0x1310
[ 118.710009] ? process_one_work+0x1be/0x5f0
[ 118.712400] process_one_work+0x23e/0x5f0
[ 118.714585] ? worker_thread+0xce/0x390
[ 118.716965] worker_thread+0x3c/0x390
[ 118.719188] ? process_one_work+0x5f0/0x5f0
[ 118.721459] kthread+0x145/0x170
[ 118.723631] ? kthread_park+0x90/0x90
[ 118.726020] ret_from_fork+0x3a/0x50
[ 118.728511] BUG: using __this_cpu_write() in preemptible [00000000] code: kworker/1:2/167
[ 118.731747] caller is cpumask_any+0x4d/0x70
[ 118.734313] CPU: 0 PID: 167 Comm: kworker/1:2 Not tainted 5.6.0-rc4-00110-ga7934287d8a6c #2
[ 118.737683] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 118.740914] Workqueue: events cpuset_hotplug_workfn
[ 118.743399] Call Trace:
[ 118.745514] dump_stack+0x8f/0xcb
[ 118.747848] __this_cpu_preempt_check+0xaf/0xc0
[ 118.750522] cpumask_any+0x4d/0x70
[ 118.752752] partition_sched_domains_locked+0x152/0x500
[ 118.755264] rebuild_sched_domains_locked+0x5a1/0xba0
[ 118.757816] ? down_write+0x38/0x70
[ 118.760109] ? cpumask_next+0x17/0x20
[ 118.762669] ? percpu_down_write+0xa3/0x100
[ 118.765308] rebuild_sched_domains+0x1b/0x30
[ 118.767876] cpuset_hotplug_workfn+0x74e/0x1310
[ 118.770286] ? process_one_work+0x1be/0x5f0
[ 118.772723] process_one_work+0x23e/0x5f0
[ 118.775191] ? worker_thread+0xce/0x390
[ 118.777766] worker_thread+0x3c/0x390
[ 118.780116] ? process_one_work+0x5f0/0x5f0
[ 118.782612] kthread+0x145/0x170
[ 118.784938] ? kthread_park+0x90/0x90
[ 118.787058] ret_from_fork+0x3a/0x50

Elapsed time: 120

qemu-img create -f qcow2 disk-vm-snb-60-0 256G
qemu-img create -f qcow2 disk-vm-snb-60-1 256G
qemu-img create -f qcow2 disk-vm-snb-60-2 256G
qemu-img create -f qcow2 disk-vm-snb-60-3 256G
qemu-img create -f qcow2 disk-vm-snb-60-4 256G
qemu-img create -f qcow2 disk-vm-snb-60-5 256G
qemu-img create -f qcow2 disk-vm-snb-60-6 256G

kvm=(
qemu-system-x86_64
-enable-kvm
-cpu SandyBridge
-kernel $kernel
-initrd initrd-vm-snb-60.cgz
-m 8192
-smp 2
-device e1000,netdev=net0
-netdev user,id=net0,hostfwd=tcp::32032-:22
-boot order=nc
-no-reboot
-watchdog i6300esb
-watchdog-action debug
-rtc base=localtime
-drive file=disk-vm-snb-60-0,media=disk,if=virtio
-drive file=disk-vm-snb-60-1,media=disk,if=virtio
-drive file=disk-vm-snb-60-2,media=disk,if=virtio
-drive file=disk-vm-snb-60-3,media=disk,if=virtio
-drive file=disk-vm-snb-60-4,media=disk,if=virtio
-drive file=disk-vm-snb-60-5,media=disk,if=virtio
-drive file=disk-vm-snb-60-6,media=disk,if=virtio
-serial stdio
-display none
-monitor null
)

append=(
ip=::::vm-snb-60::dhcp
root=/dev/ram0
user=lkp
job=/job-script
ARCH=x86_64
kconfig=x86_64-rhel-7.6-kselftests
branch=linux-devel/devel-hourly-2020041507
commit=a7934287d8a6c43811fca8ddf421b3b6091564f2
BOOT_IMAGE=/pkg/linux/x86_64-rhel-7.6-kselftests/gcc-7/a7934287d8a6c43811fca8ddf421b3b6091564f2/vmlinuz-5.6.0-rc4-00110-ga7934287d8a6c
max_uptime=1500
RESULT_ROOT=/result/locktorture/300s-cpuhotplug/vm-snb/yocto-i386-minimal-20190520.cgz/x86_64-rhel-7.6-kselftests/gcc-7/a7934287d8a6c43811fca8ddf421b3b6091564f2/3
result_service=tmpfs
selinux=0
debug
apic=debug
sysrq_always_enabled
rcupdate.rcu_cpu_stall_timeout=100
net.ifnames=0
printk.devkmsg=on
panic=-1
softlockup_panic=1
nmi_watchdog=panic
oops=panic
load_ramdisk=2
prompt_ramdisk=0
drbd.minor_count=8
systemd.log_level=err
ignore_loglevel
console=tty0
earlyprintk=ttyS0,115200
console=ttyS0,115200
vga=normal
rw
rcuperf.shutdown=0
watchdog_thresh=60
)

"${kvm[@]}" -append "${append[*]}"


To reproduce:

# build kernel
cd linux
cp config-5.6.0-rc4-00110-ga7934287d8a6c .config
make HOSTCC=gcc-7 CC=gcc-7 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=gcc-7 CC=gcc-7 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email



Thanks,
Rong Chen


Attachments:
(No filename) (23.51 kB)
config-5.6.0-rc4-00110-ga7934287d8a6c (207.17 kB)
job-script (4.44 kB)
dmesg.xz (15.57 kB)
Download all attachments