2014-07-16 14:57:20

by Bruno Wolff III

[permalink] [raw]
Subject: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

caffcdd8d27ba78730d5540396ce72ad022aff2c has been causing crashes early in
the boot process on one of three machines I have been testing the kernel
on. On that one machine it happens every boot. It happens before netconsole
is functional.

A partial revert of the commit fixes the problem. I do not know why the
commit is broken though.

I have filed https://bugzilla.kernel.org/show_bug.cgi?id=80251 for this
issue.

The problem happens on both Fedora and Linus kernels.

git diff caffcdd8d27ba78730d5540396ce72ad022aff2c^ caffcdd8d27ba78730d5540396ce72ad022aff2c
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 45d077ed24fb..6340c601475d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5794,8 +5794,6 @@ build_sched_groups(struct sched_domain *sd, int cpu)
continue;

group = get_group(i, sdd, &sg);
- cpumask_clear(sched_group_cpus(sg));
- sg->sgp->power = 0;
cpumask_setall(sched_group_mask(sg));

for_each_cpu(j, span) {

By rc5 the second line can't be added back because the structure has changed.
However adding back cpumask_clear(sched_group_cpus(sg)); to rc5 got things
working for me again.


2014-07-16 15:18:25

by Josh Boyer

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

Adding Dietmar in since he is the original author.

josh

On Wed, Jul 16, 2014 at 09:55:46AM -0500, Bruno Wolff III wrote:
> caffcdd8d27ba78730d5540396ce72ad022aff2c has been causing crashes
> early in the boot process on one of three machines I have been
> testing the kernel on. On that one machine it happens every boot. It
> happens before netconsole is functional.
>
> A partial revert of the commit fixes the problem. I do not know why
> the commit is broken though.
>
> I have filed https://bugzilla.kernel.org/show_bug.cgi?id=80251 for
> this issue.
>
> The problem happens on both Fedora and Linus kernels.
>
> git diff caffcdd8d27ba78730d5540396ce72ad022aff2c^ caffcdd8d27ba78730d5540396ce72ad022aff2c
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 45d077ed24fb..6340c601475d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5794,8 +5794,6 @@ build_sched_groups(struct sched_domain *sd, int cpu)
> continue;
>
> group = get_group(i, sdd, &sg);
> - cpumask_clear(sched_group_cpus(sg));
> - sg->sgp->power = 0;
> cpumask_setall(sched_group_mask(sg));
>
> for_each_cpu(j, span) {
>
> By rc5 the second line can't be added back because the structure has
> changed. However adding back cpumask_clear(sched_group_cpus(sg)); to
> rc5 got things working for me again.

2014-07-16 19:17:40

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

Hi Bruno and Josh,

On 16/07/14 17:17, Josh Boyer wrote:
> Adding Dietmar in since he is the original author.
>
> josh
>
> On Wed, Jul 16, 2014 at 09:55:46AM -0500, Bruno Wolff III wrote:
>> caffcdd8d27ba78730d5540396ce72ad022aff2c has been causing crashes
>> early in the boot process on one of three machines I have been
>> testing the kernel on. On that one machine it happens every boot. It
>> happens before netconsole is functional.

I tested this patch on two platforms (ARM TC2 and INTEL i5 M520) by
replacing the two lines (already with the new sg->sgc->capacity instead
of the old sg->sgp->power) by:

BUG_ON(!cpumask_empty(sched_group_cpus(sg)));
BUG_ON(sg->sgc->capacity);

The memory for sg is allocated and zeroed out in __sdt_alloc() with:

sgc = kzalloc_node(sizeof(struct sched_group_capacity) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));

The related call chain:

build_sched_domains()
__visit_domain_allocation_hell()
__sdt_alloc()
build_sched_groups()

>>
>> A partial revert of the commit fixes the problem. I do not know why
>> the commit is broken though.
>>
>> I have filed https://bugzilla.kernel.org/show_bug.cgi?id=80251 for
>> this issue.

From the issue, I see that the machine making trouble is an Xeon (2
processors w/ hyper-threading).

Could you please share:

cat /proc/cpuinfo and
cat /proc/schedstat (kernel config w/ CONFIG_SCHEDSTATS=y)

from this machine.

I don't think it is SMT (since it's also there on my INTEL i5 M520
(arch/x86/configs/x86_64_defconfig).

Could you also put the two BUG_ON lines into build_sched_groups()
[kernel/sched/core.c] wo/ the cpumask_clear() and setting
sg->sgc->capacity to 0 and share the possible crash output as well?

>>
>> The problem happens on both Fedora and Linus kernels.
>>
>> git diff caffcdd8d27ba78730d5540396ce72ad022aff2c^ caffcdd8d27ba78730d5540396ce72ad022aff2c
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 45d077ed24fb..6340c601475d 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -5794,8 +5794,6 @@ build_sched_groups(struct sched_domain *sd, int cpu)
>> continue;
>>
>> group = get_group(i, sdd, &sg);
>> - cpumask_clear(sched_group_cpus(sg));
>> - sg->sgp->power = 0;
>> cpumask_setall(sched_group_mask(sg));
>>
>> for_each_cpu(j, span) {
>>
>> By rc5 the second line can't be added back because the structure has
>> changed. However adding back cpumask_clear(sched_group_cpus(sg)); to
>> rc5 got things working for me again.

That's because 'sched: Let 'struct sched_group_power' care about CPU
capacity' (commit id 63b2ca30bdb3) changes the struct sched_group member
from struct sched_group_power *sgp to struct sched_group_capacity *sgc .

I.e. the second line becomes

sg->sgc->capacity = 0;

Thanks,

-- Dietmar

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2014-07-16 19:55:50

by Bruno Wolff III

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Wed, Jul 16, 2014 at 21:17:32 +0200,
Dietmar Eggemann <[email protected]> wrote:
>Hi Bruno and Josh,
>
>From the issue, I see that the machine making trouble is an Xeon (2
>processors w/ hyper-threading).
>
>Could you please share:
>
> cat /proc/cpuinfo and

I have attached it to the bug and to this message.

> cat /proc/schedstat (kernel config w/ CONFIG_SCHEDSTATS=y)

It looks like that isn't set for my previous builds and I'll need to
set it for my next test build.

>Could you also put the two BUG_ON lines into build_sched_groups()
>[kernel/sched/core.c] wo/ the cpumask_clear() and setting
>sg->sgc->capacity to 0 and share the possible crash output as well?

I can try a new build with this. I can probably get results back tomorrow
before I leave for work. The crashes happen too early in the boot process
for me to easily capture output as text. I can slow things down to take
pictures though.


Attachments:
(No filename) (926.00 B)
cpuinfo.out (2.50 kB)
Download all attachments

2014-07-16 23:18:45

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On 16/07/14 21:54, Bruno Wolff III wrote:
> On Wed, Jul 16, 2014 at 21:17:32 +0200,
> Dietmar Eggemann <[email protected]> wrote:
>> Hi Bruno and Josh,
>>
>>From the issue, I see that the machine making trouble is an Xeon (2
>> processors w/ hyper-threading).
>>
>> Could you please share:
>>
>> cat /proc/cpuinfo and
>
> I have attached it to the bug and to this message.
>
>> cat /proc/schedstat (kernel config w/ CONFIG_SCHEDSTATS=y)
>
> It looks like that isn't set for my previous builds and I'll need to
> set it for my next test build.
>
>> Could you also put the two BUG_ON lines into build_sched_groups()
>> [kernel/sched/core.c] wo/ the cpumask_clear() and setting
>> sg->sgc->capacity to 0 and share the possible crash output as well?
>
> I can try a new build with this. I can probably get results back tomorrow
> before I leave for work. The crashes happen too early in the boot process
> for me to easily capture output as text. I can slow things down to take
> pictures though.
>

That would be helpful. Thanks. I saw that you have CONFIG_SCHED_DEBUG
enabled.

So the output of

$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/*

would be handy too.

The difference to the Intel machine I tested on is that yours is a "dual
single core CPU with hyper-threading' and mine is a 'dual core with
hyper-threading'

yours:
$ cat cpuinfo.out | grep '^physical\|^core\|^cpu cores'
physical id : 0
core id : 0
cpu cores : 1
physical id : 3
core id : 0
cpu cores : 1
physical id : 0
core id : 0
cpu cores : 1
physical id : 3
core id : 0
cpu cores : 1

mine:
$ cat /proc/cpuinfo | grep '^physical\|^core\|^cpu cores'
physical id : 0
core id : 0
cpu cores : 2
physical id : 0
core id : 0
cpu cores : 2
physical id : 0
core id : 1
cpu cores : 2
physical id : 0
core id : 1
cpu cores : 2

Just to make sure, you do have 'CONFIG_X86_32=y' and '# CONFIG_NUMA is
not set' in your build?

2014-07-17 03:11:26

by Bruno Wolff III

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Thu, Jul 17, 2014 at 01:18:36 +0200,
Dietmar Eggemann <[email protected]> wrote:
>So the output of
>
>$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/*
>
>would be handy too.

Attached and added to the bug.

>Just to make sure, you do have 'CONFIG_X86_32=y' and '# CONFIG_NUMA is
>not set' in your build?

Yes.

I probably won't be able to get /proc/schedstat on my next test since the
system will probably crash right away. However, I probably will have a
much faster rebuild and might still be able to get the info for you
before I leave tomorrow.


Attachments:
(No filename) (567.00 B)
sched_domain.out (300.00 B)
Download all attachments

2014-07-17 04:23:33

by Bruno Wolff III

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

>>>Could you also put the two BUG_ON lines into build_sched_groups()
>>>[kernel/sched/core.c] wo/ the cpumask_clear() and setting
>>>sg->sgc->capacity to 0 and share the possible crash output as well?
>>
>>I can try a new build with this. I can probably get results back tomorrow
>>before I leave for work. The crashes happen too early in the boot process
>>for me to easily capture output as text. I can slow things down to take
>>pictures though.
>>
>
>That would be helpful. Thanks. I saw that you have CONFIG_SCHED_DEBUG
>enabled.

Well that didn't help much. It still crashed. Taking pictures didn't
get a good capture. I wasn't able to use boot_delay to slow things down,
as even a small value resulted in me only seeing one line of output
before giving up after a minute or two. I used a serial console to
slow things down, but it isn't enough to make it easy to take pictures
with the camera I had. The crash was someplace inside the scheduler, but
I don't know if there were messages from the BUG_ON lines.

2014-07-17 04:30:17

by Bruno Wolff III

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Wed, Jul 16, 2014 at 21:17:32 +0200,
Dietmar Eggemann <[email protected]> wrote:
>Could you please share:
>
> cat /proc/cpuinfo and
> cat /proc/schedstat (kernel config w/ CONFIG_SCHEDSTATS=y)

/proc/schedstat output is attached.


Attachments:
(No filename) (242.00 B)
schedstat.out (1.38 kB)
Download all attachments

2014-07-17 08:58:06

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On 17/07/14 05:09, Bruno Wolff III wrote:
> On Thu, Jul 17, 2014 at 01:18:36 +0200,
> Dietmar Eggemann <[email protected]> wrote:
>> So the output of
>>
>> $ cat /proc/sys/kernel/sched_domain/cpu*/domain*/*
>>
>> would be handy too.

Thanks, this was helpful.
I see from the sched domain layout that you have SMT (domain0) and DIE
(domain1) level. So on this system, the MC level gets degenerated
(sd_degenerate() in kernel/sched/core.c).
I fail so far to see how this can have an effect on the memory of the
sched groups. But I can try to fake this situation on one of my platforms.

There is also the possibility that the memory for sched_group sg is not
(completely) zeroed out:

sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));


struct sched_group {
...
* NOTE: this field is variable length. (Allocated dynamically
* by attaching extra space to the end of the structure,
* depending on how many CPUs the kernel has booted up with)
*/
unsigned long cpumask[0];
};

so that the cpumask of a sched group is not 0 and can only be cured by
an explicit cpumask_clear(sched_group_cpus(sg)) in build_sched_groups()
on this kind of machine.

>
> Attached and added to the bug.
>
>> Just to make sure, you do have 'CONFIG_X86_32=y' and '# CONFIG_NUMA is
>> not set' in your build?
>
> Yes.
>
> I probably won't be able to get /proc/schedstat on my next test since the
> system will probably crash right away. However, I probably will have a
> much faster rebuild and might still be able to get the info for you
> before I leave tomorrow.
>

2014-07-17 09:05:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Thu, Jul 17, 2014 at 10:57:55AM +0200, Dietmar Eggemann wrote:
> There is also the possibility that the memory for sched_group sg is not
> (completely) zeroed out:
>
> sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
> GFP_KERNEL, cpu_to_node(j));
>
>
> struct sched_group {
> ...
> * NOTE: this field is variable length. (Allocated dynamically
> * by attaching extra space to the end of the structure,
> * depending on how many CPUs the kernel has booted up with)
> */
> unsigned long cpumask[0];

well kZalloc should Zero the entire allocated size, and the specified
size very much includes the cpumask size as per:
sizeof(struct sched_group) + cpumask_size()

But yeah, I'm also a bit puzzled why this goes bang. Makes we worry we
scribble it somewhere or so.


Attachments:
(No filename) (802.00 B)
(No filename) (836.00 B)
Download all attachments

2014-07-17 11:24:01

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On 17/07/14 11:04, Peter Zijlstra wrote:
> On Thu, Jul 17, 2014 at 10:57:55AM +0200, Dietmar Eggemann wrote:
>> There is also the possibility that the memory for sched_group sg is not
>> (completely) zeroed out:
>>
>> sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
>> GFP_KERNEL, cpu_to_node(j));
>>
>>
>> struct sched_group {
>> ...
>> * NOTE: this field is variable length. (Allocated dynamically
>> * by attaching extra space to the end of the structure,
>> * depending on how many CPUs the kernel has booted up with)
>> */
>> unsigned long cpumask[0];
>
> well kZalloc should Zero the entire allocated size, and the specified
> size very much includes the cpumask size as per:
> sizeof(struct sched_group) + cpumask_size()

Yes, I think so too.

>
> But yeah, I'm also a bit puzzled why this goes bang. Makes we worry we
> scribble it somewhere or so.
>

But then, this must be happening in build_sched_domains() between
__visit_domain_allocation_hell()->__sdt_alloc() and build_sched_groups().


Couldn't catch this phenomena by adding a fake SMT level (just a copy of
the real MC level) to my ARM TC2 (dual cluster dual/triple core, no
hyper-threading) to provoke sd degenerate. It does not show the issue
and MC level gets degenerated nicely. Might not be the real example
since SMT and MC are using the same cpu mask here).

@@ -281,6 +281,7 @@ static inline const int cpu_corepower_flags(void)
}

static struct sched_domain_topology_level arm_topology[] = {
+ { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(SMT) },
#ifdef CONFIG_SCHED_MC
{ cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },


Maybe by enabling sched_debug on command line (earlyprintk=keep
sched_debug), Bruno could spot topology setup issues on his XEON machine
which could lead to this problem unless the sg cpumask gets zeroed out
in build_sched_groups() a second time ?

Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz dmesg snippet as an example when
booted with 'earlyprintk=keep sched_debug':
...
[ 0.119737] CPU0 attaching sched-domain:
[ 0.119740] domain 0: span 0-1 level SIBLING
[ 0.119742] groups: 0 (cpu_power = 588) 1 (cpu_power = 588)
[ 0.119745] domain 1: span 0-3 level MC
[ 0.119747] groups: 0-1 (cpu_power = 1176) 2-3 (cpu_power = 1176)
...










2014-07-17 12:35:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Thu, Jul 17, 2014 at 01:23:51PM +0200, Dietmar Eggemann wrote:
> On 17/07/14 11:04, Peter Zijlstra wrote:
> >On Thu, Jul 17, 2014 at 10:57:55AM +0200, Dietmar Eggemann wrote:
> >>There is also the possibility that the memory for sched_group sg is not
> >>(completely) zeroed out:
> >>
> >> sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
> >> GFP_KERNEL, cpu_to_node(j));
> >>
> >>
> >> struct sched_group {
> >> ...
> >> * NOTE: this field is variable length. (Allocated dynamically
> >> * by attaching extra space to the end of the structure,
> >> * depending on how many CPUs the kernel has booted up with)
> >> */
> >> unsigned long cpumask[0];
> >
> >well kZalloc should Zero the entire allocated size, and the specified
> >size very much includes the cpumask size as per:
> > sizeof(struct sched_group) + cpumask_size()
>
> Yes, I think so too.
>
> >
> >But yeah, I'm also a bit puzzled why this goes bang. Makes we worry we
> >scribble it somewhere or so.
> >
>
> But then, this must be happening in build_sched_domains() between
> __visit_domain_allocation_hell()->__sdt_alloc() and build_sched_groups().
>
>
> Couldn't catch this phenomena by adding a fake SMT level (just a copy of the
> real MC level) to my ARM TC2 (dual cluster dual/triple core, no
> hyper-threading) to provoke sd degenerate. It does not show the issue and MC
> level gets degenerated nicely. Might not be the real example since SMT and
> MC are using the same cpu mask here).

Yeah, obviously my machines didn't trigger this either, and afaik none
of Ingo's did either.

In any case, can someone who can trigger this run with the below; its
'clean' for me, but supposedly you'll trigger a FAIL somewhere.

It includes the cpumask_clear() so your machines should boot, albeit
with a noisy dmesg :-)

---
kernel/sched/core.c | 16 ++++++++++++++++
lib/vsprintf.c | 5 +++++
2 files changed, 21 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7bc599dc4aa4..1c140057db12 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5857,6 +5857,17 @@ build_sched_groups(struct sched_domain *sd, int cpu)
continue;

group = get_group(i, sdd, &sg);
+
+ if (!cpumask_empty(sched_group_cpus(sg)))
+ printk("%s: FAIL\n", __func__);
+
+ printk("%s: got group %p with cpus: %pc\n",
+ __func__,
+ sg,
+ sched_group_cpus(sg));
+
+ cpumask_clear(sched_group_cpus(sg));
+
cpumask_setall(sched_group_mask(sg));

for_each_cpu(j, span) {
@@ -6418,6 +6429,11 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
if (!sg)
return -ENOMEM;

+ printk("%s: allocated %p with cpus: %pc\n",
+ __func__,
+ sg,
+ sched_group_cpus(sg));
+
sg->next = sg;

*per_cpu_ptr(sdd->sg, j) = sg;
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 6fe2c84eb055..ac22c46fd6d0 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -28,6 +28,7 @@
#include <linux/ioport.h>
#include <linux/dcache.h>
#include <linux/cred.h>
+#include <linux/cpumask.h>
#include <net/addrconf.h>

#include <asm/page.h> /* for PAGE_SIZE */
@@ -1250,6 +1251,7 @@ int kptr_restrict __read_mostly;
* (default assumed to be phys_addr_t, passed by reference)
* - 'd[234]' For a dentry name (optionally 2-4 last components)
* - 'D[234]' Same as 'd' but for a struct file
+ * - 'c' For a cpumask list
*
* Note: The difference between 'S' and 'F' is that on ia64 and ppc64
* function pointers are really function descriptors, which contain a
@@ -1389,6 +1391,8 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
return dentry_name(buf, end,
((const struct file *)ptr)->f_path.dentry,
spec, fmt);
+ case 'c':
+ return buf + cpulist_scnprintf(buf, end - buf, ptr);
}
spec.flags |= SMALL;
if (spec.field_width == -1) {
@@ -1635,6 +1639,7 @@ int format_decode(const char *fmt, struct printf_spec *spec)
* case.
* %*ph[CDN] a variable-length hex string with a separator (supports up to 64
* bytes of the input)
+ * %pc print a cpumask as comma-separated list
* %n is ignored
*
* ** Please update Documentation/printk-formats.txt when making changes **


Attachments:
(No filename) (4.09 kB)
(No filename) (836.00 B)
Download all attachments

2014-07-17 16:38:03

by Bruno Wolff III

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

I did a few quick boots this morning while taking a bunch of pictures. I have
gone through some of them this morning and found one that shows bug on
was triggered at 5850 which is from:
BUG_ON(!cpumask_empty(sched_group_cpus(sg)));

You can see the JPEG at:
https://bugzilla.kernel.org/attachment.cgi?id=143331

2014-07-17 18:43:23

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On 17/07/14 18:36, Bruno Wolff III wrote:
> I did a few quick boots this morning while taking a bunch of pictures. I have
> gone through some of them this morning and found one that shows bug on
> was triggered at 5850 which is from:
> BUG_ON(!cpumask_empty(sched_group_cpus(sg)));
>
> You can see the JPEG at:
> https://bugzilla.kernel.org/attachment.cgi?id=143331
>

Many thanks for testing this, Bruno!

So the memory of the cpumask of some sched_group(s) in your system has
been altered between __visit_domain_allocation_hell()->__sdt_alloc() and
build_sched_groups().

In the meantime, PeterZ has posted a patch which barfs when this happens
but also prints out the sched groups with the related cpus but also
includes the cpumask_clear so your machine would boot still fine.

If you could apply the patch:

https://lkml.org/lkml/2014/7/17/288

and then run it on your machine, that would give us more details, i.e.
the information on which sched_group(s) and in which sched domain level
(SMT and/or DIE) this issue occurs.


Another thing which you could do is to boot with an extra
'earlyprintk=keep sched_debug' in your command line options with a build
containing the cpumask_clear() in build_sched_groups() and extract the
dmesg output of the scheduler-setup code:

Example:

[ 0.119737] CPU0 attaching sched-domain:
[ 0.119740] domain 0: span 0-1 level SIBLING
[ 0.119742] groups: 0 (cpu_power = 588) 1 (cpu_power = 588)
[ 0.119745] domain 1: span 0-3 level MC
[ 0.119747] groups: 0-1 (cpu_power = 1176) 2-3 (cpu_power = 1176)
[ 0.119751] CPU1 attaching sched-domain:
[ 0.119752] domain 0: span 0-1 level SIBLING
[ 0.119753] groups: 1 (cpu_power = 588) 0 (cpu_power = 588)
[ 0.119756] domain 1: span 0-3 level MC
[ 0.119757] groups: 0-1 (cpu_power = 1176) 2-3 (cpu_power = 1176)
[ 0.119759] CPU2 attaching sched-domain:
[ 0.119760] domain 0: span 2-3 level SIBLING
[ 0.119761] groups: 2 (cpu_power = 588) 3 (cpu_power = 588)
[ 0.119764] domain 1: span 0-3 level MC
[ 0.119765] groups: 2-3 (cpu_power = 1176) 0-1 (cpu_power = 1176)
[ 0.119767] CPU3 attaching sched-domain:
[ 0.119768] domain 0: span 2-3 level SIBLING
[ 0.119769] groups: 3 (cpu_power = 588) 2 (cpu_power = 588)
[ 0.119772] domain 1: span 0-3 level MC
[ 0.119773] groups: 2-3 (cpu_power = 1176) 0-1 (cpu_power = 1176)

2014-07-17 18:55:55

by Bruno Wolff III

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Thu, Jul 17, 2014 at 20:43:16 +0200,
Dietmar Eggemann <[email protected]> wrote:
>
>If you could apply the patch:
>
>https://lkml.org/lkml/2014/7/17/288
>
>and then run it on your machine, that would give us more details, i.e.
>the information on which sched_group(s) and in which sched domain
>level (SMT and/or DIE) this issue occurs.
>
>
>Another thing which you could do is to boot with an extra
>'earlyprintk=keep sched_debug' in your command line options with a
>build containing the cpumask_clear() in build_sched_groups() and
>extract the dmesg output of the scheduler-setup code:

I'll see what I can do. I have plans after work today and I don't know
if I'll be awake enough when I get home to followup tonight or early
tomorrow. Worst case will be over the weekend.

2014-07-18 05:36:24

by Bruno Wolff III

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Thu, Jul 17, 2014 at 14:35:02 +0200,
Peter Zijlstra <[email protected]> wrote:
>
>In any case, can someone who can trigger this run with the below; its
>'clean' for me, but supposedly you'll trigger a FAIL somewhere.

I got a couple of fail messages.

dmesg output is available in the bug as the following attachment:
https://bugzilla.kernel.org/attachment.cgi?id=143361

The part of interest is probably:

[ 0.253354] build_sched_groups: got group f255b020 with cpus:
[ 0.253436] build_sched_groups: got group f255b120 with cpus:
[ 0.253519] build_sched_groups: got group f255b1a0 with cpus:
[ 0.253600] build_sched_groups: got group f255b2a0 with cpus:
[ 0.253681] build_sched_groups: got group f255b2e0 with cpus:
[ 0.253762] build_sched_groups: got group f255b320 with cpus:
[ 0.253843] build_sched_groups: got group f255b360 with cpus:
[ 0.254004] build_sched_groups: got group f255b0e0 with cpus:
[ 0.254087] build_sched_groups: got group f255b160 with cpus:
[ 0.254170] build_sched_groups: got group f255b1e0 with cpus:
[ 0.254252] build_sched_groups: FAIL
[ 0.254331] build_sched_groups: got group f255b1a0 with cpus: 0
[ 0.255004] build_sched_groups: FAIL
[ 0.255084] build_sched_groups: got group f255b1e0 with cpus: 1

I also booted with early printk=keepsched_debug as requested by
Dietmar.

2014-07-18 09:28:21

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On 18/07/14 07:34, Bruno Wolff III wrote:
> On Thu, Jul 17, 2014 at 14:35:02 +0200,
> Peter Zijlstra <[email protected]> wrote:
>>
>> In any case, can someone who can trigger this run with the below; its
>> 'clean' for me, but supposedly you'll trigger a FAIL somewhere.
>
> I got a couple of fail messages.
>
> dmesg output is available in the bug as the following attachment:
> https://bugzilla.kernel.org/attachment.cgi?id=143361
>
> The part of interest is probably:
>
> [ 0.253354] build_sched_groups: got group f255b020 with cpus:
> [ 0.253436] build_sched_groups: got group f255b120 with cpus:
> [ 0.253519] build_sched_groups: got group f255b1a0 with cpus:
> [ 0.253600] build_sched_groups: got group f255b2a0 with cpus:
> [ 0.253681] build_sched_groups: got group f255b2e0 with cpus:
> [ 0.253762] build_sched_groups: got group f255b320 with cpus:
> [ 0.253843] build_sched_groups: got group f255b360 with cpus:
> [ 0.254004] build_sched_groups: got group f255b0e0 with cpus:
> [ 0.254087] build_sched_groups: got group f255b160 with cpus:
> [ 0.254170] build_sched_groups: got group f255b1e0 with cpus:
> [ 0.254252] build_sched_groups: FAIL
> [ 0.254331] build_sched_groups: got group f255b1a0 with cpus: 0
> [ 0.255004] build_sched_groups: FAIL
> [ 0.255084] build_sched_groups: got group f255b1e0 with cpus: 1

That (partly) explains it. f255b1a0 (5) and f255b1e0 (6) are reused
here! This reuse doesn't happen on my machines.

But if they are used for a different cpu mask (not including cpu0 resp.
cpu1 this would mess up their first usage?

I guess that the second time, cpu3 will be added to the cpumask of
f255b1a0 and cpu4 to f255b1e0?

Maybe we can extend PeterZ patch to print out cpu and span as well us
this printk also in free_sched_domain() to debug further if this is not
enough evidence?

[ 0.252059] __sdt_alloc: allocated f255b020 with cpus: (1)
[ 0.252147] __sdt_alloc: allocated f255b0e0 with cpus: (2)
[ 0.252229] __sdt_alloc: allocated f255b120 with cpus: (3)
[ 0.252311] __sdt_alloc: allocated f255b160 with cpus: (4)
[ 0.252395] __sdt_alloc: allocated f255b1a0 with cpus: (5)
[ 0.252477] __sdt_alloc: allocated f255b1e0 with cpus: (6)
[ 0.252559] __sdt_alloc: allocated f255b220 with cpus: (7) (not used)
[ 0.252641] __sdt_alloc: allocated f255b260 with cpus: (8) (not used)
[ 0.253013] __sdt_alloc: allocated f255b2a0 with cpus: (9)
[ 0.253097] __sdt_alloc: allocated f255b2e0 with cpus: (10)
[ 0.253184] __sdt_alloc: allocated f255b320 with cpus: (11)
[ 0.253265] __sdt_alloc: allocated f255b360 with cpus: (12)

[ 0.253354] build_sched_groups: got group f255b020 with cpus: (1)
[ 0.253436] build_sched_groups: got group f255b120 with cpus: (3)
[ 0.253519] build_sched_groups: got group f255b1a0 with cpus: (5)
[ 0.253600] build_sched_groups: got group f255b2a0 with cpus: (9)
[ 0.253681] build_sched_groups: got group f255b2e0 with cpus: (10)
[ 0.253762] build_sched_groups: got group f255b320 with cpus: (11)
[ 0.253843] build_sched_groups: got group f255b360 with cpus: (12)
[ 0.254004] build_sched_groups: got group f255b0e0 with cpus: (2)
[ 0.254087] build_sched_groups: got group f255b160 with cpus: (4)
[ 0.254170] build_sched_groups: got group f255b1e0 with cpus: (6)
[ 0.254252] build_sched_groups: FAIL
[ 0.254331] build_sched_groups: got group f255b1a0 with cpus: 0 (5)
[ 0.255004] build_sched_groups: FAIL
[ 0.255084] build_sched_groups: got group f255b1e0 with cpus: 1 (6)
[ 0.255365] devtmpfs: initialized

>
> I also booted with early printk=keepsched_debug as requested by
> Dietmar.
>

Didn't see what I was looking for in your dmesg output. Did you use
'earlyprintk=keep sched_debug'






2014-07-18 10:16:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Fri, Jul 18, 2014 at 12:34:49AM -0500, Bruno Wolff III wrote:
> On Thu, Jul 17, 2014 at 14:35:02 +0200,
> Peter Zijlstra <[email protected]> wrote:
> >
> >In any case, can someone who can trigger this run with the below; its
> >'clean' for me, but supposedly you'll trigger a FAIL somewhere.
>
> I got a couple of fail messages.
>
> dmesg output is available in the bug as the following attachment:
> https://bugzilla.kernel.org/attachment.cgi?id=143361

Thanks!

[ 0.252059] __sdt_alloc: allocated f255b020 with cpus:
[ 0.252147] __sdt_alloc: allocated f255b0e0 with cpus:
[ 0.252229] __sdt_alloc: allocated f255b120 with cpus:
[ 0.252311] __sdt_alloc: allocated f255b160 with cpus:

[ 0.252395] __sdt_alloc: allocated f255b1a0 with cpus:
[ 0.252477] __sdt_alloc: allocated f255b1e0 with cpus:
[ 0.252559] __sdt_alloc: allocated f255b220 with cpus:
[ 0.252641] __sdt_alloc: allocated f255b260 with cpus:

[ 0.253013] __sdt_alloc: allocated f255b2a0 with cpus:
[ 0.253097] __sdt_alloc: allocated f255b2e0 with cpus:
[ 0.253184] __sdt_alloc: allocated f255b320 with cpus:
[ 0.253265] __sdt_alloc: allocated f255b360 with cpus:

[ 0.253354] build_sched_groups: got group f255b020 with cpus:
[ 0.253436] build_sched_groups: got group f255b120 with cpus:
[ 0.253519] build_sched_groups: got group f255b1a0 with cpus:
[ 0.253600] build_sched_groups: got group f255b2a0 with cpus:
[ 0.253681] build_sched_groups: got group f255b2e0 with cpus:

[ 0.253762] build_sched_groups: got group f255b320 with cpus:
[ 0.253843] build_sched_groups: got group f255b360 with cpus:
[ 0.254004] build_sched_groups: got group f255b0e0 with cpus:
[ 0.254087] build_sched_groups: got group f255b160 with cpus:
[ 0.254170] build_sched_groups: got group f255b1e0 with cpus:
[ 0.254252] build_sched_groups: FAIL
[ 0.254331] build_sched_groups: got group f255b1a0 with cpus: 0
[ 0.255004] build_sched_groups: FAIL
[ 0.255084] build_sched_groups: got group f255b1e0 with cpus: 1

So from previous msgs we know:

CPU0 CPU1 CPU2 CPU3

D0 * * SMT
* *

D2 * * * * DIE


This gives us (from __sdt_alloc):

020 0e0 120 160 SMT
1a0 1e0 220 260 MC
2a0 2e0 320 360 DIE

Given that you have a DIE domain, and MC is found degenerate, I'll
conclude that you do not have the shared L3 possible for your machine
and only have the dual socket, with 2 threads per socket.

So the domains _should_ look like:

D0 0,2 1,3 0,2 1,3
D1 0,2 1,3 0,2 1,3
D2 0,1,2,3 0,1,2,3 0,1,2,3 0,1,2,3

Assuming that, build_sched_groups(), which gets called for each cpu, for
each domain, we get:

D0g 020(0) 120(2)
D1g 1a0(0,2)
D2g 2a0(0,2)

So far so good, at this point we're in build_sched_groups, we have a
.cpu=0 @span=0-3 @covered=0,2 @i=0 and we're just about to start the
loop for @i=1.

1 is not set in covered

get_group(i=1, sdd, &sg)
@sd = *per_cpu_ptr(sdd->sd, 1); /* should be D2 for CPU1 */
@child = sd->child; /* should be D1 for CPU1: 1,3 */
@cpu = 1
@sg = *per_cpu_ptr(sdd->sg, 1); /* should be: 2e0 */

But instead we get 320 !?

The 2e0 group would cover 1,3, thereby increasing @cover to 0-3 and
we're done for CPU0. Instead things go on to return 360, more WTF!

So it looks like the actual domain tree is broken, and not what we
assumed it was.

Could I bother you to run with the below instead? It should also print
out the sched domain masks so we don't need to guess about them.

(make sure you have CONFIG_SCHED_DEBUG=y otherwise it will not build)

> I also booted with early printk=keepsched_debug as requested by Dietmar.

can you make that: sched_debug ?

---
kernel/sched/core.c | 22 ++++++++++++++++++++++
lib/vsprintf.c | 5 +++++
2 files changed, 27 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7bc599dc4aa4..4babcbbc11b6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5857,6 +5857,17 @@ build_sched_groups(struct sched_domain *sd, int cpu)
continue;

group = get_group(i, sdd, &sg);
+
+ if (!cpumask_empty(sched_group_cpus(sg)))
+ printk("%s: FAIL\n", __func__);
+
+ printk("%s: got group %p with cpus: %pc\n",
+ __func__,
+ sg,
+ sched_group_cpus(sg));
+
+ cpumask_clear(sched_group_cpus(sg));
+
cpumask_setall(sched_group_mask(sg));

for_each_cpu(j, span) {
@@ -6418,6 +6429,11 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
if (!sg)
return -ENOMEM;

+ printk("%s: allocated %p with cpus: %pc\n",
+ __func__,
+ sg,
+ sched_group_cpus(sg));
+
sg->next = sg;

*per_cpu_ptr(sdd->sg, j) = sg;
@@ -6474,6 +6490,12 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
if (!sd)
return child;

+ printk("%s: cpu: %d level: %s cpu_map: %pc tl->mask: %pc\n",
+ __func__,
+ cpu, tl->name,
+ cpu_map,
+ tl->mask(cpu));
+
cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
if (child) {
sd->level = child->level + 1;
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 6fe2c84eb055..ac22c46fd6d0 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -28,6 +28,7 @@
#include <linux/ioport.h>
#include <linux/dcache.h>
#include <linux/cred.h>
+#include <linux/cpumask.h>
#include <net/addrconf.h>

#include <asm/page.h> /* for PAGE_SIZE */
@@ -1250,6 +1251,7 @@ int kptr_restrict __read_mostly;
* (default assumed to be phys_addr_t, passed by reference)
* - 'd[234]' For a dentry name (optionally 2-4 last components)
* - 'D[234]' Same as 'd' but for a struct file
+ * - 'c' For a cpumask list
*
* Note: The difference between 'S' and 'F' is that on ia64 and ppc64
* function pointers are really function descriptors, which contain a
@@ -1389,6 +1391,8 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
return dentry_name(buf, end,
((const struct file *)ptr)->f_path.dentry,
spec, fmt);
+ case 'c':
+ return buf + cpulist_scnprintf(buf, end - buf, ptr);
}
spec.flags |= SMALL;
if (spec.field_width == -1) {
@@ -1635,6 +1639,7 @@ int format_decode(const char *fmt, struct printf_spec *spec)
* case.
* %*ph[CDN] a variable-length hex string with a separator (supports up to 64
* bytes of the input)
+ * %pc print a cpumask as comma-separated list
* %n is ignored
*
* ** Please update Documentation/printk-formats.txt when making changes **


Attachments:
(No filename) (6.26 kB)
(No filename) (836.00 B)
Download all attachments

2014-07-18 12:10:58

by Bruno Wolff III

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Fri, Jul 18, 2014 at 11:28:14 +0200,
Dietmar Eggemann <[email protected]> wrote:
>
>Didn't see what I was looking for in your dmesg output. Did you use
>'earlyprintk=keep sched_debug'

I was missing a space. I'll get it on the next run.

2014-07-18 13:05:46

by Bruno Wolff III

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Fri, Jul 18, 2014 at 12:16:33 +0200,
Peter Zijlstra <[email protected]> wrote:
>So it looks like the actual domain tree is broken, and not what we
>assumed it was.
>
>Could I bother you to run with the below instead? It should also print
>out the sched domain masks so we don't need to guess about them.

The full dmesg output is at:
https://bugzilla.kernel.org/attachment.cgi?id=143381

>(make sure you have CONFIG_SCHED_DEBUG=y otherwise it will not build)
>
>> I also booted with early printk=keepsched_debug as requested by Dietmar.
>
>can you make that: sched_debug ?

I think I've fixed that.

I think the part you are most interested in contains the following:
[ 0.252280] smpboot: Total of 4 processors activated (21438.11 BogoMIPS)
[ 0.253058] __sdt_alloc: allocated f255b020 with cpus:
[ 0.253146] __sdt_alloc: allocated f255b0e0 with cpus:
[ 0.253227] __sdt_alloc: allocated f255b120 with cpus:
[ 0.253308] __sdt_alloc: allocated f255b160 with cpus:
[ 0.253390] __sdt_alloc: allocated f255b1a0 with cpus:
[ 0.253471] __sdt_alloc: allocated f255b1e0 with cpus:
[ 0.253551] __sdt_alloc: allocated f255b220 with cpus:
[ 0.253632] __sdt_alloc: allocated f255b260 with cpus:
[ 0.254009] __sdt_alloc: allocated f255b2a0 with cpus:
[ 0.254092] __sdt_alloc: allocated f255b2e0 with cpus:
[ 0.254181] __sdt_alloc: allocated f255b320 with cpus:
[ 0.254262] __sdt_alloc: allocated f255b360 with cpus:
[ 0.254350] build_sched_domain: cpu: 0 level: SMT cpu_map: 0-3 tl->mask: 0,2
[ 0.254433] build_sched_domain: cpu: 0 level: MC cpu_map: 0-3 tl->mask: 0
[ 0.254516] build_sched_domain: cpu: 0 level: DIE cpu_map: 0-3 tl->mask: 0-3
[ 0.254600] build_sched_domain: cpu: 1 level: SMT cpu_map: 0-3 tl->mask: 1,3
[ 0.254683] build_sched_domain: cpu: 1 level: MC cpu_map: 0-3 tl->mask: 1
[ 0.254766] build_sched_domain: cpu: 1 level: DIE cpu_map: 0-3 tl->mask: 0-3
[ 0.254850] build_sched_domain: cpu: 2 level: SMT cpu_map: 0-3 tl->mask: 0,2
[ 0.254932] build_sched_domain: cpu: 2 level: MC cpu_map: 0-3 tl->mask: 2
[ 0.255005] build_sched_domain: cpu: 2 level: DIE cpu_map: 0-3 tl->mask: 0-3
[ 0.255091] build_sched_domain: cpu: 3 level: SMT cpu_map: 0-3 tl->mask: 1,3
[ 0.255176] build_sched_domain: cpu: 3 level: MC cpu_map: 0-3 tl->mask: 3
[ 0.255260] build_sched_domain: cpu: 3 level: DIE cpu_map: 0-3 tl->mask: 0-3
[ 0.256006] build_sched_groups: got group f255b020 with cpus:
[ 0.256089] build_sched_groups: got group f255b120 with cpus:
[ 0.256171] build_sched_groups: got group f255b1a0 with cpus:
[ 0.256252] build_sched_groups: got group f255b2a0 with cpus:
[ 0.256333] build_sched_groups: got group f255b2e0 with cpus:
[ 0.256414] build_sched_groups: got group f255b320 with cpus:
[ 0.256495] build_sched_groups: got group f255b360 with cpus:
[ 0.256576] build_sched_groups: got group f255b0e0 with cpus:
[ 0.256657] build_sched_groups: got group f255b160 with cpus:
[ 0.256740] build_sched_groups: got group f255b1e0 with cpus:
[ 0.256821] build_sched_groups: FAIL
[ 0.257004] build_sched_groups: got group f255b1a0 with cpus: 0
[ 0.257087] build_sched_groups: FAIL
[ 0.257167] build_sched_groups: got group f255b1e0 with cpus: 1

2014-07-18 14:16:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Fri, Jul 18, 2014 at 08:01:26AM -0500, Bruno Wolff III wrote:
> build_sched_domain: cpu: 0 level: SMT cpu_map: 0-3 tl->mask: 0,2
> [ 0.254433] build_sched_domain: cpu: 0 level: MC cpu_map: 0-3 tl->mask: 0
> [ 0.254516] build_sched_domain: cpu: 0 level: DIE cpu_map: 0-3 tl->mask: 0-3
> [ 0.254600] build_sched_domain: cpu: 1 level: SMT cpu_map: 0-3 tl->mask: 1,3
> [ 0.254683] build_sched_domain: cpu: 1 level: MC cpu_map: 0-3 tl->mask: 1
> [ 0.254766] build_sched_domain: cpu: 1 level: DIE cpu_map: 0-3 tl->mask: 0-3
> [ 0.254850] build_sched_domain: cpu: 2 level: SMT cpu_map: 0-3 tl->mask: 0,2
> [ 0.254932] build_sched_domain: cpu: 2 level: MC cpu_map: 0-3 tl->mask: 2
> [ 0.255005] build_sched_domain: cpu: 2 level: DIE cpu_map: 0-3 tl->mask: 0-3
> [ 0.255091] build_sched_domain: cpu: 3 level: SMT cpu_map: 0-3 tl->mask: 1,3
> [ 0.255176] build_sched_domain: cpu: 3 level: MC cpu_map: 0-3 tl->mask: 3
> [ 0.255260] build_sched_domain: cpu: 3 level: DIE cpu_map: 0-3 tl->mask: 0-3

*blink*...

That's, shall we say, unexpected. Let me ponder that a bit. HPA any clue
why a machine might report such a weird topology? AFAIK threads _always_
share cache. So how can cpu_coregroup_mask be a subset (instead of a
superset) of topology_thread_cpumask?

Let me go stare at the x86 topology mask setup code.

2014-07-18 14:16:54

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On 18/07/14 15:01, Bruno Wolff III wrote:
> On Fri, Jul 18, 2014 at 12:16:33 +0200,
> Peter Zijlstra <[email protected]> wrote:
>> So it looks like the actual domain tree is broken, and not what we
>> assumed it was.
>>
>> Could I bother you to run with the below instead? It should also print
>> out the sched domain masks so we don't need to guess about them.
>
> The full dmesg output is at:
> https://bugzilla.kernel.org/attachment.cgi?id=143381
>
>> (make sure you have CONFIG_SCHED_DEBUG=y otherwise it will not build)
>>
>>> I also booted with early printk=keepsched_debug as requested by Dietmar.
>>
>> can you make that: sched_debug ?
>
> I think I've fixed that.
>
> I think the part you are most interested in contains the following:
> [ 0.252280] smpboot: Total of 4 processors activated (21438.11 BogoMIPS)
> [ 0.253058] __sdt_alloc: allocated f255b020 with cpus:
> [ 0.253146] __sdt_alloc: allocated f255b0e0 with cpus:
> [ 0.253227] __sdt_alloc: allocated f255b120 with cpus:
> [ 0.253308] __sdt_alloc: allocated f255b160 with cpus:
> [ 0.253390] __sdt_alloc: allocated f255b1a0 with cpus:
> [ 0.253471] __sdt_alloc: allocated f255b1e0 with cpus:
> [ 0.253551] __sdt_alloc: allocated f255b220 with cpus:
> [ 0.253632] __sdt_alloc: allocated f255b260 with cpus:
> [ 0.254009] __sdt_alloc: allocated f255b2a0 with cpus:
> [ 0.254092] __sdt_alloc: allocated f255b2e0 with cpus:
> [ 0.254181] __sdt_alloc: allocated f255b320 with cpus:
> [ 0.254262] __sdt_alloc: allocated f255b360 with cpus:
> [ 0.254350] build_sched_domain: cpu: 0 level: SMT cpu_map: 0-3 tl->mask: 0,2
> [ 0.254433] build_sched_domain: cpu: 0 level: MC cpu_map: 0-3 tl->mask: 0

So the MC level cpu mask function is wrong on this machine. Should be
0,2 here, right?

The cpu_capacity values look strange too (probably a subsequent error).

[ 0.257260] CPU0 attaching sched-domain:
[ 0.257264] domain 0: span 0,2 level SMT
[ 0.257268] groups: 0 (cpu_capacity = 586) 2 (cpu_capacity = 587)
[ 0.257275] domain 1: span 0-3 level DIE
[ 0.257278] groups: 0 (cpu_capacity = 587) 1 (cpu_capacity = 588)
2 (cpu_capacity = 587) 3 (cpu_capacity = 588)

> [ 0.254516] build_sched_domain: cpu: 0 level: DIE cpu_map: 0-3 tl->mask: 0-3
> [ 0.254600] build_sched_domain: cpu: 1 level: SMT cpu_map: 0-3 tl->mask: 1,3
> [ 0.254683] build_sched_domain: cpu: 1 level: MC cpu_map: 0-3 tl->mask: 1
> [ 0.254766] build_sched_domain: cpu: 1 level: DIE cpu_map: 0-3 tl->mask: 0-3
> [ 0.254850] build_sched_domain: cpu: 2 level: SMT cpu_map: 0-3 tl->mask: 0,2
> [ 0.254932] build_sched_domain: cpu: 2 level: MC cpu_map: 0-3 tl->mask: 2
> [ 0.255005] build_sched_domain: cpu: 2 level: DIE cpu_map: 0-3 tl->mask: 0-3
> [ 0.255091] build_sched_domain: cpu: 3 level: SMT cpu_map: 0-3 tl->mask: 1,3
> [ 0.255176] build_sched_domain: cpu: 3 level: MC cpu_map: 0-3 tl->mask: 3
> [ 0.255260] build_sched_domain: cpu: 3 level: DIE cpu_map: 0-3 tl->mask: 0-3
> [ 0.256006] build_sched_groups: got group f255b020 with cpus:
> [ 0.256089] build_sched_groups: got group f255b120 with cpus:
> [ 0.256171] build_sched_groups: got group f255b1a0 with cpus:
> [ 0.256252] build_sched_groups: got group f255b2a0 with cpus:
> [ 0.256333] build_sched_groups: got group f255b2e0 with cpus:
> [ 0.256414] build_sched_groups: got group f255b320 with cpus:
> [ 0.256495] build_sched_groups: got group f255b360 with cpus:
> [ 0.256576] build_sched_groups: got group f255b0e0 with cpus:
> [ 0.256657] build_sched_groups: got group f255b160 with cpus:
> [ 0.256740] build_sched_groups: got group f255b1e0 with cpus:
> [ 0.256821] build_sched_groups: FAIL
> [ 0.257004] build_sched_groups: got group f255b1a0 with cpus: 0
> [ 0.257087] build_sched_groups: FAIL
> [ 0.257167] build_sched_groups: got group f255b1e0 with cpus: 1
>
>

2014-07-18 14:50:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Fri, Jul 18, 2014 at 04:16:48PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 18, 2014 at 08:01:26AM -0500, Bruno Wolff III wrote:
> > build_sched_domain: cpu: 0 level: SMT cpu_map: 0-3 tl->mask: 0,2
> > [ 0.254433] build_sched_domain: cpu: 0 level: MC cpu_map: 0-3 tl->mask: 0
> > [ 0.254516] build_sched_domain: cpu: 0 level: DIE cpu_map: 0-3 tl->mask: 0-3
> > [ 0.254600] build_sched_domain: cpu: 1 level: SMT cpu_map: 0-3 tl->mask: 1,3
> > [ 0.254683] build_sched_domain: cpu: 1 level: MC cpu_map: 0-3 tl->mask: 1
> > [ 0.254766] build_sched_domain: cpu: 1 level: DIE cpu_map: 0-3 tl->mask: 0-3
> > [ 0.254850] build_sched_domain: cpu: 2 level: SMT cpu_map: 0-3 tl->mask: 0,2
> > [ 0.254932] build_sched_domain: cpu: 2 level: MC cpu_map: 0-3 tl->mask: 2
> > [ 0.255005] build_sched_domain: cpu: 2 level: DIE cpu_map: 0-3 tl->mask: 0-3
> > [ 0.255091] build_sched_domain: cpu: 3 level: SMT cpu_map: 0-3 tl->mask: 1,3
> > [ 0.255176] build_sched_domain: cpu: 3 level: MC cpu_map: 0-3 tl->mask: 3
> > [ 0.255260] build_sched_domain: cpu: 3 level: DIE cpu_map: 0-3 tl->mask: 0-3
>
> *blink*...
>
> That's, shall we say, unexpected. Let me ponder that a bit. HPA any clue
> why a machine might report such a weird topology? AFAIK threads _always_
> share cache. So how can cpu_coregroup_mask be a subset (instead of a
> superset) of topology_thread_cpumask?
>
> Let me go stare at the x86 topology mask setup code.

Possibly something like so, but I'm not too sure. Anybody?

---
arch/x86/kernel/smpboot.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 5492798930ef..5eefa9abc2a9 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -338,9 +338,15 @@ static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
{
int cpu1 = c->cpu_index, cpu2 = o->cpu_index;

- if (per_cpu(cpu_llc_id, cpu1) != BAD_APICID &&
- per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2))
+ if (cpu_has_topoext) {
+ if (per_cpu(cpu_llc_id, cpu1) != BAD_APICID &&
+ per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2))
+ return topology_sane(c, o, "llc");
+
+ } else if (c->phys_proc_id == o->phys_proc_id &&
+ c->cpu_core_id == o->cpu_core_id) {
return topology_sane(c, o, "llc");
+ }

return false;
}

2014-07-18 16:16:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Fri, Jul 18, 2014 at 04:50:40PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 18, 2014 at 04:16:48PM +0200, Peter Zijlstra wrote:
> > On Fri, Jul 18, 2014 at 08:01:26AM -0500, Bruno Wolff III wrote:
> > > build_sched_domain: cpu: 0 level: SMT cpu_map: 0-3 tl->mask: 0,2
> > > [ 0.254433] build_sched_domain: cpu: 0 level: MC cpu_map: 0-3 tl->mask: 0
> > > [ 0.254516] build_sched_domain: cpu: 0 level: DIE cpu_map: 0-3 tl->mask: 0-3
> > > [ 0.254600] build_sched_domain: cpu: 1 level: SMT cpu_map: 0-3 tl->mask: 1,3
> > > [ 0.254683] build_sched_domain: cpu: 1 level: MC cpu_map: 0-3 tl->mask: 1
> > > [ 0.254766] build_sched_domain: cpu: 1 level: DIE cpu_map: 0-3 tl->mask: 0-3
> > > [ 0.254850] build_sched_domain: cpu: 2 level: SMT cpu_map: 0-3 tl->mask: 0,2
> > > [ 0.254932] build_sched_domain: cpu: 2 level: MC cpu_map: 0-3 tl->mask: 2
> > > [ 0.255005] build_sched_domain: cpu: 2 level: DIE cpu_map: 0-3 tl->mask: 0-3
> > > [ 0.255091] build_sched_domain: cpu: 3 level: SMT cpu_map: 0-3 tl->mask: 1,3
> > > [ 0.255176] build_sched_domain: cpu: 3 level: MC cpu_map: 0-3 tl->mask: 3
> > > [ 0.255260] build_sched_domain: cpu: 3 level: DIE cpu_map: 0-3 tl->mask: 0-3
> >
> > *blink*...
> >
> > That's, shall we say, unexpected. Let me ponder that a bit. HPA any clue
> > why a machine might report such a weird topology? AFAIK threads _always_
> > share cache. So how can cpu_coregroup_mask be a subset (instead of a
> > superset) of topology_thread_cpumask?
> >
> > Let me go stare at the x86 topology mask setup code.
>
> Possibly something like so, but I'm not too sure. Anybody?

OK, Borislav says topoext is AMD only, so that's not the problem. In
which case the problem must be that cpu_llc_id is wrong.

This gets set in init_intel_cacheinfo() but that's hurting my brain for
the moment. There's plenty P4 specific cruft in there though.

2014-07-21 16:37:34

by Bruno Wolff III

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

Is there more I can do to help with this now? Or should I just wait for
patches to test?

2014-07-21 16:52:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Mon, Jul 21, 2014 at 11:35:28AM -0500, Bruno Wolff III wrote:
> Is there more I can do to help with this now? Or should I just wait for
> patches to test?

Yeah, sorry, was wiped out today. I'll go stare harder at the P4
topology setup code tomorrow. Something fishy there.

2014-07-22 09:47:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Mon, Jul 21, 2014 at 06:52:12PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 21, 2014 at 11:35:28AM -0500, Bruno Wolff III wrote:
> > Is there more I can do to help with this now? Or should I just wait for
> > patches to test?
>
> Yeah, sorry, was wiped out today. I'll go stare harder at the P4
> topology setup code tomorrow. Something fishy there.

Does this make your machine boot again (while giving an error)?

It tries to robustify the topology setup a bit, crashing on crap input
should be avoided if possible of course.

I'll go stare at the x86/P4 topology code like promised.

---
Subject: sched: Robustify topology setup
From: Peter Zijlstra <[email protected]>
Date: Mon Jul 21 23:07:06 CEST 2014

We hard assume that higher topology levels are strict supersets of
lower levels.

Detect, warn and try to fixup when we encounter this violated.

Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
---
kernel/sched/core.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6480,6 +6480,20 @@ struct sched_domain *build_sched_domain(
sched_domain_level_max = max(sched_domain_level_max, sd->level);
child->parent = sd;
sd->child = child;
+
+ if (!cpumask_subset(sched_domain_span(child),
+ sched_domain_span(sd))) {
+ pr_err("BUG: arch topology borken\n");
+#ifdef CONFIG_SCHED_DEBUG
+ pr_err(" the %s domain not a subset of the %s domain\n",
+ child->name, sd->name);
+#endif
+ /* Fixup, ensure @sd has at least @child cpus. */
+ cpumask_or(sched_domain_span(sd),
+ sched_domain_span(sd),
+ sched_domain_span(child));
+ }
+
}
set_domain_attribute(sd, attr);

2014-07-22 10:39:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Tue, Jul 22, 2014 at 11:47:40AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 21, 2014 at 06:52:12PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 21, 2014 at 11:35:28AM -0500, Bruno Wolff III wrote:
> > > Is there more I can do to help with this now? Or should I just wait for
> > > patches to test?
> >
> > Yeah, sorry, was wiped out today. I'll go stare harder at the P4
> > topology setup code tomorrow. Something fishy there.
>
> Does this make your machine boot again (while giving an error)?
>
> It tries to robustify the topology setup a bit, crashing on crap input
> should be avoided if possible of course.
>
> I'll go stare at the x86/P4 topology code like promised.

Could you provide the output of cpuid and cpuid -r for your machine?
This code is magic and I've no idea what your machine is telling it to
do :/

2014-07-22 12:12:17

by Bruno Wolff III

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Tue, Jul 22, 2014 at 12:38:57 +0200,
Peter Zijlstra <[email protected]> wrote:
>
>Could you provide the output of cpuid and cpuid -r for your machine?
>This code is magic and I've no idea what your machine is telling it to
>do :/

I am attaching both sets of output. (I also added copies to the bug report.)


Attachments:
(No filename) (316.00 B)
cpuid.out (20.02 kB)
cpuidr.out (3.15 kB)
Download all attachments

2014-07-22 12:12:41

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On 22/07/14 10:47, Peter Zijlstra wrote:
> On Mon, Jul 21, 2014 at 06:52:12PM +0200, Peter Zijlstra wrote:
>> On Mon, Jul 21, 2014 at 11:35:28AM -0500, Bruno Wolff III wrote:
>>> Is there more I can do to help with this now? Or should I just wait for
>>> patches to test?
>>
>> Yeah, sorry, was wiped out today. I'll go stare harder at the P4
>> topology setup code tomorrow. Something fishy there.
>
> Does this make your machine boot again (while giving an error)?
>
> It tries to robustify the topology setup a bit, crashing on crap input
> should be avoided if possible of course.
>
> I'll go stare at the x86/P4 topology code like promised.
>
> ---
> Subject: sched: Robustify topology setup
> From: Peter Zijlstra <[email protected]>
> Date: Mon Jul 21 23:07:06 CEST 2014
>
> We hard assume that higher topology levels are strict supersets of
> lower levels.

IMHO, we require only that higher topology levels are supersets of lower
levels, not strict (proper) supersets.

AFAICS, the patch itself requires only supersets, i.e. on ARM TC2 with
the following change in cpu_corepower_mask:

const struct cpumask *cpu_corepower_mask(int cpu)
{
- return &cpu_topology[cpu].thread_sibling;
+ return &cpu_topology[cpu].core_sibling;
}

I get:

...
build_sched_domain: cpu: 0 level: GMC cpu_map: 0-4 tl->mask: 0-1
build_sched_domain: cpu: 0 level: MC cpu_map: 0-4 tl->mask: 0-1
build_sched_domain: cpu: 0 level: DIE cpu_map: 0-4 tl->mask: 0-4
...

without hitting the newly introduced pr_err's.

>
> Detect, warn and try to fixup when we encounter this violated.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Link: http://lkml.kernel.org/n/[email protected]
> ---
> kernel/sched/core.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6480,6 +6480,20 @@ struct sched_domain *build_sched_domain(
> sched_domain_level_max = max(sched_domain_level_max, sd->level);
> child->parent = sd;
> sd->child = child;
> +
> + if (!cpumask_subset(sched_domain_span(child),
> + sched_domain_span(sd))) {
> + pr_err("BUG: arch topology borken\n");
> +#ifdef CONFIG_SCHED_DEBUG
> + pr_err(" the %s domain not a subset of the %s domain\n",
> + child->name, sd->name);
> +#endif
> + /* Fixup, ensure @sd has at least @child cpus. */
> + cpumask_or(sched_domain_span(sd),
> + sched_domain_span(sd),
> + sched_domain_span(child));

This fixup will (probably) heal the Bruno's issue with it's wrong
cpu_coregroup_mask() function.

If I exchange cpu_corepower_mask with cpu_coregroup_mask in arm_topology[]

static struct sched_domain_topology_level arm_topology[] = {
#ifdef CONFIG_SCHED_MC
- { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
- { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+ { cpu_coregroup_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
+ { cpu_corepower_mask, cpu_core_flags, SD_INIT_NAME(MC) },
#endif

I get:

...
build_sched_domain: cpu: 0 level: GMC cpu_map: 0-4 tl->mask: 0-1
build_sched_domain: cpu: 0 level: MC cpu_map: 0-4 tl->mask: 0
BUG: arch topology borken
the GMC domain not a subset of the MC domain
build_sched_domain: cpu: 0 level: DIE cpu_map: 0-4 tl->mask: 0-4
...

cat /proc/schedstat
...
cpu0 0 0 16719 6169 10937 6392 5348510220 2935348625 10448
domain0 03 19190 19168 9 10265 13 0 0 19168 16 16 0 0 0 0 0 16 1196 1080
46 43570 75 0 0 1080 0 0 0 0 0 0 0 0 0 2947 280 0
domain1 1f 18768 18763 3 3006 2 0 9 18055 6 6 0 0 0 0 0 1 1125 996 94
81038 43 0 18 978 0 0 0 0 0 0 0 0 0 1582 172 0

# cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
GMC
DIE

so MC level gets changed to mask 0-1.

> + }
> +
> }
> set_domain_attribute(sd, attr);
>
>

Tested-by: Dietmar Eggemann <[email protected]>


















2014-07-22 12:59:29

by Bruno Wolff III

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Tue, Jul 22, 2014 at 11:47:40 +0200,
Peter Zijlstra <[email protected]> wrote:
>On Mon, Jul 21, 2014 at 06:52:12PM +0200, Peter Zijlstra wrote:
>> On Mon, Jul 21, 2014 at 11:35:28AM -0500, Bruno Wolff III wrote:
>> > Is there more I can do to help with this now? Or should I just wait for
>> > patches to test?
>>
>> Yeah, sorry, was wiped out today. I'll go stare harder at the P4
>> topology setup code tomorrow. Something fishy there.
>
>Does this make your machine boot again (while giving an error)?

I won't be able to actually test this until after work.

2014-07-22 13:03:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Tue, Jul 22, 2014 at 07:10:01AM -0500, Bruno Wolff III wrote:
> On Tue, Jul 22, 2014 at 12:38:57 +0200,
> Peter Zijlstra <[email protected]> wrote:
> >
> >Could you provide the output of cpuid and cpuid -r for your machine?
> >This code is magic and I've no idea what your machine is telling it to
> >do :/
>
> I am attaching both sets of output. (I also added copies to the bug report.)

Thanks! and yes I now see (and I should have seen before) what is
'broken'.

> 0x00000000 0x00: eax=0x00000002 ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69

This gives us cpuid_level=0x02

> 0x00000002 0x00: eax=0x665b5001 ebx=0x00000000 ecx=0x00000000 edx=0x007b7040

Which means that init_intel_cacheinfo() will not use cpuid4 for
cacheinfo and we revert to cpuid2, which translates into:

> cache and TLB information (2):
> 0x50: instruction TLB: 4K & 2M/4M pages, 64 entries
> 0x5b: data TLB: 4K & 4M pages, 64 entries
> 0x66: L1 data cache: 8K, 4-way, 64 byte lines
> 0x40: No L3 cache
> 0x70: Trace cache: 12K-uop, 8-way
> 0x7b: L2 cache: 512K, 8-way, sectored, 64 byte lines

Now the problem is that cpu_llc_id is only set on new_l[23], and set to
l[23]_id. Both new_l[23] and l[23]_id are only set in the cpuid4 case.

So for this P4 cpu_llc_id remains unset.

Furthermore cpuid2 does not include cpu masks, so we need to use cpuid1:

> (multi-processing method): Intel leaf 1

> 0x00000001 0x00: eax=0x00000f29 ebx=0x0002080b ecx=0x00004400 edx=0xbfebfbff

to reconstruct the topology, with the added assumption that SMT threads
share all caches.

Oh, of course we do SMP detection and setup after the cache setup...
lovely.

/me goes bang head against wall

2014-07-22 13:26:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Tue, Jul 22, 2014 at 03:03:43PM +0200, Peter Zijlstra wrote:
> Oh, of course we do SMP detection and setup after the cache setup...
> lovely.
>
> /me goes bang head against wall

hpa, could we move the legacy cpuid1/cpuid4 topology detection muck up,
preferably right after detect_extended_topology()?

I need c->phys_proc_id in init_intel_cacheinfo() for machines with
cpuid_level < 4.

2014-07-22 13:35:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Tue, Jul 22, 2014 at 03:26:03PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 22, 2014 at 03:03:43PM +0200, Peter Zijlstra wrote:
> > Oh, of course we do SMP detection and setup after the cache setup...
> > lovely.
> >
> > /me goes bang head against wall
>
> hpa, could we move the legacy cpuid1/cpuid4 topology detection muck up,
> preferably right after detect_extended_topology()?
>
> I need c->phys_proc_id in init_intel_cacheinfo() for machines with
> cpuid_level < 4.

Something like so.. anything obviously broken?

---
arch/x86/kernel/cpu/intel.c | 22 +++++++++++-----------
arch/x86/kernel/cpu/intel_cacheinfo.c | 12 ++++++++++++
2 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 0fd955778f35..9483ee5b3991 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -370,6 +370,17 @@ static void init_intel(struct cpuinfo_x86 *c)
*/
detect_extended_topology(c);

+ if (!cpu_has(c, X86_FEATURE_XTOPOLOGY)) {
+ /*
+ * let's use the legacy cpuid vector 0x1 and 0x4 for topology
+ * detection.
+ */
+ c->x86_max_cores = intel_num_cpu_cores(c);
+#ifdef CONFIG_X86_32
+ detect_ht(c);
+#endif
+ }
+
l2 = init_intel_cacheinfo(c);
if (c->cpuid_level > 9) {
unsigned eax = cpuid_eax(10);
@@ -438,17 +449,6 @@ static void init_intel(struct cpuinfo_x86 *c)
set_cpu_cap(c, X86_FEATURE_P3);
#endif

- if (!cpu_has(c, X86_FEATURE_XTOPOLOGY)) {
- /*
- * let's use the legacy cpuid vector 0x1 and 0x4 for topology
- * detection.
- */
- c->x86_max_cores = intel_num_cpu_cores(c);
-#ifdef CONFIG_X86_32
- detect_ht(c);
-#endif
- }
-
/* Work around errata */
srat_detect_node(c);

diff --git a/arch/x86/kernel/cpu/intel_cacheinfo.c b/arch/x86/kernel/cpu/intel_cacheinfo.c
index a952e9c85b6f..9c8f7394c612 100644
--- a/arch/x86/kernel/cpu/intel_cacheinfo.c
+++ b/arch/x86/kernel/cpu/intel_cacheinfo.c
@@ -730,6 +730,18 @@ unsigned int init_intel_cacheinfo(struct cpuinfo_x86 *c)
#endif
}

+#ifdef CONFIG_X86_HT
+ /*
+ * If cpu_llc_id is not yet set, this means cpuid_level < 4 which in
+ * turns means that the only possibility is SMT (as indicated in
+ * cpuid1). Since cpuid2 doesn't specify shared caches, and we know
+ * that SMT shares all caches, we can unconditionally set cpu_llc_id to
+ * c->phys_proc_id.
+ */
+ if (per_cpu(cpu_llc_id, cpu) == BAD_APICID)
+ per_cpu(cpu_llc_id, cpu) = c->phys_proc_id;
+#endif
+
c->x86_cache_size = l3 ? l3 : (l2 ? l2 : (l1i+l1d));

return l2;

2014-07-22 14:11:28

by Bruno Wolff III

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Tue, Jul 22, 2014 at 15:35:14 +0200,
Peter Zijlstra <[email protected]> wrote:
>On Tue, Jul 22, 2014 at 03:26:03PM +0200, Peter Zijlstra wrote:
>
>Something like so.. anything obviously broken?

Do you want me to test this change instead of, or combined with the other
patch you wanted tested earlier?

>
>---
> arch/x86/kernel/cpu/intel.c | 22 +++++++++++-----------
> arch/x86/kernel/cpu/intel_cacheinfo.c | 12 ++++++++++++
> 2 files changed, 23 insertions(+), 11 deletions(-)
>
>diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
>index 0fd955778f35..9483ee5b3991 100644
>--- a/arch/x86/kernel/cpu/intel.c
>+++ b/arch/x86/kernel/cpu/intel.c
>@@ -370,6 +370,17 @@ static void init_intel(struct cpuinfo_x86 *c)
> */
> detect_extended_topology(c);
>
>+ if (!cpu_has(c, X86_FEATURE_XTOPOLOGY)) {
>+ /*
>+ * let's use the legacy cpuid vector 0x1 and 0x4 for topology
>+ * detection.
>+ */
>+ c->x86_max_cores = intel_num_cpu_cores(c);
>+#ifdef CONFIG_X86_32
>+ detect_ht(c);
>+#endif
>+ }
>+
> l2 = init_intel_cacheinfo(c);
> if (c->cpuid_level > 9) {
> unsigned eax = cpuid_eax(10);
>@@ -438,17 +449,6 @@ static void init_intel(struct cpuinfo_x86 *c)
> set_cpu_cap(c, X86_FEATURE_P3);
> #endif
>
>- if (!cpu_has(c, X86_FEATURE_XTOPOLOGY)) {
>- /*
>- * let's use the legacy cpuid vector 0x1 and 0x4 for topology
>- * detection.
>- */
>- c->x86_max_cores = intel_num_cpu_cores(c);
>-#ifdef CONFIG_X86_32
>- detect_ht(c);
>-#endif
>- }
>-
> /* Work around errata */
> srat_detect_node(c);
>
>diff --git a/arch/x86/kernel/cpu/intel_cacheinfo.c b/arch/x86/kernel/cpu/intel_cacheinfo.c
>index a952e9c85b6f..9c8f7394c612 100644
>--- a/arch/x86/kernel/cpu/intel_cacheinfo.c
>+++ b/arch/x86/kernel/cpu/intel_cacheinfo.c
>@@ -730,6 +730,18 @@ unsigned int init_intel_cacheinfo(struct cpuinfo_x86 *c)
> #endif
> }
>
>+#ifdef CONFIG_X86_HT
>+ /*
>+ * If cpu_llc_id is not yet set, this means cpuid_level < 4 which in
>+ * turns means that the only possibility is SMT (as indicated in
>+ * cpuid1). Since cpuid2 doesn't specify shared caches, and we know
>+ * that SMT shares all caches, we can unconditionally set cpu_llc_id to
>+ * c->phys_proc_id.
>+ */
>+ if (per_cpu(cpu_llc_id, cpu) == BAD_APICID)
>+ per_cpu(cpu_llc_id, cpu) = c->phys_proc_id;
>+#endif
>+
> c->x86_cache_size = l3 ? l3 : (l2 ? l2 : (l1i+l1d));
>
> return l2;
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/

2014-07-22 14:19:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Tue, Jul 22, 2014 at 09:09:12AM -0500, Bruno Wolff III wrote:
> On Tue, Jul 22, 2014 at 15:35:14 +0200,
> Peter Zijlstra <[email protected]> wrote:
> >On Tue, Jul 22, 2014 at 03:26:03PM +0200, Peter Zijlstra wrote:
> >
> >Something like so.. anything obviously broken?
>
> Do you want me to test this change instead of, or combined with the other
> patch you wanted tested earlier?

You can put this on top of them. I hope that this will make the pr_err()
introduced in the robustify patch go away.

2014-07-22 17:06:20

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On 07/22/2014 06:35 AM, Peter Zijlstra wrote:
> On Tue, Jul 22, 2014 at 03:26:03PM +0200, Peter Zijlstra wrote:
>> On Tue, Jul 22, 2014 at 03:03:43PM +0200, Peter Zijlstra wrote:
>>> Oh, of course we do SMP detection and setup after the cache setup...
>>> lovely.
>>>
>>> /me goes bang head against wall
>>
>> hpa, could we move the legacy cpuid1/cpuid4 topology detection muck up,
>> preferably right after detect_extended_topology()?
>>
>> I need c->phys_proc_id in init_intel_cacheinfo() for machines with
>> cpuid_level < 4.
>
> Something like so.. anything obviously broken?
>

Nothing *obvious*. I should stare a pass at the code.

-hpa

2014-07-23 01:39:00

by Bruno Wolff III

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Tue, Jul 22, 2014 at 16:18:55 +0200,
Peter Zijlstra <[email protected]> wrote:
>
>You can put this on top of them. I hope that this will make the pr_err()
>introduced in the robustify patch go away.

I went to 3.16-rc6 and then reapplied three patches from your previous
email messages. The dmesg output and the diff from 3.16-rc6 have been
added to https://bugzilla.kernel.org/show_bug.cgi?id=80251 .
The dmesg output is at: https://bugzilla.kernel.org/attachment.cgi?id=143961
The combined diff is at: https://bugzilla.kernel.org/attachment.cgi?id=143971

What I think you are probably looking for in the dmesg output:
[ 0.251061] __sdt_alloc: allocated f515b020 with cpus:
[ 0.251149] __sdt_alloc: allocated f515b0e0 with cpus:
[ 0.251231] __sdt_alloc: allocated f515b120 with cpus:
[ 0.251313] __sdt_alloc: allocated f515b160 with cpus:
[ 0.251397] __sdt_alloc: allocated f515b1a0 with cpus:
[ 0.251479] __sdt_alloc: allocated f515b1e0 with cpus:
[ 0.251561] __sdt_alloc: allocated f515b220 with cpus:
[ 0.251643] __sdt_alloc: allocated f515b260 with cpus:
[ 0.252011] __sdt_alloc: allocated f515b2a0 with cpus:
[ 0.252095] __sdt_alloc: allocated f515b2e0 with cpus:
[ 0.252184] __sdt_alloc: allocated f515b320 with cpus:
[ 0.252266] __sdt_alloc: allocated f515b360 with cpus:
[ 0.252355] build_sched_domain: cpu: 0 level: SMT cpu_map: 0-3 tl->mask: 0,2
[ 0.252441] build_sched_domain: cpu: 0 level: MC cpu_map: 0-3 tl->mask: 0,2
[ 0.252526] build_sched_domain: cpu: 0 level: DIE cpu_map: 0-3 tl->mask: 0-3
[ 0.252611] build_sched_domain: cpu: 1 level: SMT cpu_map: 0-3 tl->mask: 1,3
[ 0.252696] build_sched_domain: cpu: 1 level: MC cpu_map: 0-3 tl->mask: 1,3
[ 0.252781] build_sched_domain: cpu: 1 level: DIE cpu_map: 0-3 tl->mask: 0-3
[ 0.252866] build_sched_domain: cpu: 2 level: SMT cpu_map: 0-3 tl->mask: 0,2
[ 0.252951] build_sched_domain: cpu: 2 level: MC cpu_map: 0-3 tl->mask: 0,2
[ 0.253005] build_sched_domain: cpu: 2 level: DIE cpu_map: 0-3 tl->mask: 0-3
[ 0.253091] build_sched_domain: cpu: 3 level: SMT cpu_map: 0-3 tl->mask: 1,3
[ 0.253176] build_sched_domain: cpu: 3 level: MC cpu_map: 0-3 tl->mask: 1,3
[ 0.253261] build_sched_domain: cpu: 3 level: DIE cpu_map: 0-3 tl->mask: 0-3
[ 0.254004] build_sched_groups: got group f515b020 with cpus:
[ 0.254088] build_sched_groups: got group f515b120 with cpus:
[ 0.254170] build_sched_groups: got group f515b1a0 with cpus:
[ 0.254253] build_sched_groups: got group f515b2a0 with cpus:
[ 0.254336] build_sched_groups: got group f515b2e0 with cpus:
[ 0.254419] build_sched_groups: got group f515b0e0 with cpus:
[ 0.254502] build_sched_groups: got group f515b160 with cpus:
[ 0.254585] build_sched_groups: got group f515b1e0 with cpus:
[ 0.254680] CPU0 attaching sched-domain:
[ 0.254684] domain 0: span 0,2 level SMT
[ 0.254687] groups: 0 (cpu_capacity = 586) 2 (cpu_capacity = 588)
[ 0.254695] domain 1: span 0-3 level DIE
[ 0.254698] groups: 0,2 (cpu_capacity = 1174) 1,3 (cpu_capacity = 1176)
[ 0.254709] CPU1 attaching sched-domain:
[ 0.254711] domain 0: span 1,3 level SMT
[ 0.254714] groups: 1 (cpu_capacity = 588) 3 (cpu_capacity = 588)
[ 0.254721] domain 1: span 0-3 level DIE
[ 0.254724] groups: 1,3 (cpu_capacity = 1176) 0,2 (cpu_capacity = 1174)
[ 0.254733] CPU2 attaching sched-domain:
[ 0.254735] domain 0: span 0,2 level SMT
[ 0.254738] groups: 2 (cpu_capacity = 588) 0 (cpu_capacity = 586)
[ 0.254745] domain 1: span 0-3 level DIE
[ 0.254747] groups: 0,2 (cpu_capacity = 1174) 1,3 (cpu_capacity = 1176)
[ 0.254756] CPU3 attaching sched-domain:
[ 0.254758] domain 0: span 1,3 level SMT
[ 0.254761] groups: 3 (cpu_capacity = 588) 1 (cpu_capacity = 588)
[ 0.254768] domain 1: span 0-3 level DIE
[ 0.254770] groups: 1,3 (cpu_capacity = 1176) 0,2 (cpu_capacity = 1174)

2014-07-23 06:51:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Tue, Jul 22, 2014 at 08:37:19PM -0500, Bruno Wolff III wrote:
> build_sched_domain: cpu: 0 level: SMT cpu_map: 0-3 tl->mask: 0,2
> [ 0.252441] build_sched_domain: cpu: 0 level: MC cpu_map: 0-3 tl->mask: 0,2
> [ 0.252526] build_sched_domain: cpu: 0 level: DIE cpu_map: 0-3 tl->mask: 0-3

W00t! that seems to have cured it.

Thanks Bruno, I'll go write up a proper changelog and such.

2014-07-23 15:11:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c


OK, so that's become the below patch. I'll feed it to Ingo if that's OK
with hpa.

---
Subject: x86: Fix cache topology for early P4-SMT
From: Peter Zijlstra <[email protected]>
Date: Tue, 22 Jul 2014 15:35:14 +0200

P4 systems with cpuid level < 4 can have SMT, but the cache topology
description available (cpuid2) does not include SMP information.

Now we know that SMT shares all cache levels, and therefore we can
mark all available cache levels as shared.

We do this by setting cpu_llc_id to ->phys_proc_id, since that's
the same for each SMT thread. We can do this unconditional since if
there's no SMT its still true, the one CPU shares cache with only
itself.

This fixes a problem where such CPUs report an incorrect LLC CPU mask.

This in turn fixes a crash in the scheduler where the topology was
build wrong, it assumes the LLC mask to include at least the SMT CPUs.

Cc: Josh Boyer <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Tested-by: Bruno Wolff III <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
arch/x86/kernel/cpu/intel.c | 22 +++++++++++-----------
arch/x86/kernel/cpu/intel_cacheinfo.c | 12 ++++++++++++
2 files changed, 23 insertions(+), 11 deletions(-)

--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -370,6 +370,17 @@ static void init_intel(struct cpuinfo_x8
*/
detect_extended_topology(c);

+ if (!cpu_has(c, X86_FEATURE_XTOPOLOGY)) {
+ /*
+ * let's use the legacy cpuid vector 0x1 and 0x4 for topology
+ * detection.
+ */
+ c->x86_max_cores = intel_num_cpu_cores(c);
+#ifdef CONFIG_X86_32
+ detect_ht(c);
+#endif
+ }
+
l2 = init_intel_cacheinfo(c);
if (c->cpuid_level > 9) {
unsigned eax = cpuid_eax(10);
@@ -438,17 +449,6 @@ static void init_intel(struct cpuinfo_x8
set_cpu_cap(c, X86_FEATURE_P3);
#endif

- if (!cpu_has(c, X86_FEATURE_XTOPOLOGY)) {
- /*
- * let's use the legacy cpuid vector 0x1 and 0x4 for topology
- * detection.
- */
- c->x86_max_cores = intel_num_cpu_cores(c);
-#ifdef CONFIG_X86_32
- detect_ht(c);
-#endif
- }
-
/* Work around errata */
srat_detect_node(c);

--- a/arch/x86/kernel/cpu/intel_cacheinfo.c
+++ b/arch/x86/kernel/cpu/intel_cacheinfo.c
@@ -730,6 +730,18 @@ unsigned int init_intel_cacheinfo(struct
#endif
}

+#ifdef CONFIG_X86_HT
+ /*
+ * If cpu_llc_id is not yet set, this means cpuid_level < 4 which in
+ * turns means that the only possibility is SMT (as indicated in
+ * cpuid1). Since cpuid2 doesn't specify shared caches, and we know
+ * that SMT shares all caches, we can unconditionally set cpu_llc_id to
+ * c->phys_proc_id.
+ */
+ if (per_cpu(cpu_llc_id, cpu) == BAD_APICID)
+ per_cpu(cpu_llc_id, cpu) = c->phys_proc_id;
+#endif
+
c->x86_cache_size = l3 ? l3 : (l2 ? l2 : (l1i+l1d));

return l2;

2014-07-23 15:13:17

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On 07/23/2014 08:11 AM, Peter Zijlstra wrote:
>
> OK, so that's become the below patch. I'll feed it to Ingo if that's OK
> with hpa.
>

I'll grab it directly, it is a bit quicker that way.

-hpa

Subject: [tip:x86/urgent] x86, cpu: Fix cache topology for early P4-SMT

Commit-ID: 2a2261553dd1472ca574acadbd93e12f44c4e6d5
Gitweb: http://git.kernel.org/tip/2a2261553dd1472ca574acadbd93e12f44c4e6d5
Author: Peter Zijlstra <[email protected]>
AuthorDate: Tue, 22 Jul 2014 15:35:14 +0200
Committer: H. Peter Anvin <[email protected]>
CommitDate: Wed, 23 Jul 2014 08:16:17 -0700

x86, cpu: Fix cache topology for early P4-SMT

P4 systems with cpuid level < 4 can have SMT, but the cache topology
description available (cpuid2) does not include SMP information.

Now we know that SMT shares all cache levels, and therefore we can
mark all available cache levels as shared.

We do this by setting cpu_llc_id to ->phys_proc_id, since that's
the same for each SMT thread. We can do this unconditional since if
there's no SMT its still true, the one CPU shares cache with only
itself.

This fixes a problem where such CPUs report an incorrect LLC CPU mask.

This in turn fixes a crash in the scheduler where the topology was
build wrong, it assumes the LLC mask to include at least the SMT CPUs.

Cc: Josh Boyer <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Tested-by: Bruno Wolff III <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/kernel/cpu/intel.c | 22 +++++++++++-----------
arch/x86/kernel/cpu/intel_cacheinfo.c | 12 ++++++++++++
2 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index a800290..f9e4fdd 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -370,6 +370,17 @@ static void init_intel(struct cpuinfo_x86 *c)
*/
detect_extended_topology(c);

+ if (!cpu_has(c, X86_FEATURE_XTOPOLOGY)) {
+ /*
+ * let's use the legacy cpuid vector 0x1 and 0x4 for topology
+ * detection.
+ */
+ c->x86_max_cores = intel_num_cpu_cores(c);
+#ifdef CONFIG_X86_32
+ detect_ht(c);
+#endif
+ }
+
l2 = init_intel_cacheinfo(c);
if (c->cpuid_level > 9) {
unsigned eax = cpuid_eax(10);
@@ -438,17 +449,6 @@ static void init_intel(struct cpuinfo_x86 *c)
set_cpu_cap(c, X86_FEATURE_P3);
#endif

- if (!cpu_has(c, X86_FEATURE_XTOPOLOGY)) {
- /*
- * let's use the legacy cpuid vector 0x1 and 0x4 for topology
- * detection.
- */
- c->x86_max_cores = intel_num_cpu_cores(c);
-#ifdef CONFIG_X86_32
- detect_ht(c);
-#endif
- }
-
/* Work around errata */
srat_detect_node(c);

diff --git a/arch/x86/kernel/cpu/intel_cacheinfo.c b/arch/x86/kernel/cpu/intel_cacheinfo.c
index a952e9c..9c8f739 100644
--- a/arch/x86/kernel/cpu/intel_cacheinfo.c
+++ b/arch/x86/kernel/cpu/intel_cacheinfo.c
@@ -730,6 +730,18 @@ unsigned int init_intel_cacheinfo(struct cpuinfo_x86 *c)
#endif
}

+#ifdef CONFIG_X86_HT
+ /*
+ * If cpu_llc_id is not yet set, this means cpuid_level < 4 which in
+ * turns means that the only possibility is SMT (as indicated in
+ * cpuid1). Since cpuid2 doesn't specify shared caches, and we know
+ * that SMT shares all caches, we can unconditionally set cpu_llc_id to
+ * c->phys_proc_id.
+ */
+ if (per_cpu(cpu_llc_id, cpu) == BAD_APICID)
+ per_cpu(cpu_llc_id, cpu) = c->phys_proc_id;
+#endif
+
c->x86_cache_size = l3 ? l3 : (l2 ? l2 : (l1i+l1d));

return l2;

2014-07-24 01:46:45

by Bruno Wolff III

[permalink] [raw]
Subject: Re: Scheduler regression from caffcdd8d27ba78730d5540396ce72ad022aff2c

On Wed, Jul 23, 2014 at 17:11:40 +0200,
Peter Zijlstra <[email protected]> wrote:
>
>OK, so that's become the below patch. I'll feed it to Ingo if that's OK
>with hpa.

I tested this patch on 3 machines and it continued to fix the one that
was broken and didn't seem to break anything on the two that weren't
broken.

Thanks for developing this patch so quickly.

Subject: [tip:sched/core] sched: Robustify topology setup

Commit-ID: 6ae72dff37596f736b795426306404f0793e4b1b
Gitweb: http://git.kernel.org/tip/6ae72dff37596f736b795426306404f0793e4b1b
Author: Peter Zijlstra <[email protected]>
AuthorDate: Tue, 22 Jul 2014 11:47:40 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 28 Jul 2014 10:04:13 +0200

sched: Robustify topology setup

We hard assume that higher topology levels are supersets of lower
levels.

Detect, warn and try to fixup when we encounter this violated.

Tested-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Josh Boyer <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Bruno Wolff III <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 415ab02..2a36a74 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6481,6 +6481,20 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
sched_domain_level_max = max(sched_domain_level_max, sd->level);
child->parent = sd;
sd->child = child;
+
+ if (!cpumask_subset(sched_domain_span(child),
+ sched_domain_span(sd))) {
+ pr_err("BUG: arch topology borken\n");
+#ifdef CONFIG_SCHED_DEBUG
+ pr_err(" the %s domain not a subset of the %s domain\n",
+ child->name, sd->name);
+#endif
+ /* Fixup, ensure @sd has at least @child cpus. */
+ cpumask_or(sched_domain_span(sd),
+ sched_domain_span(sd),
+ sched_domain_span(child));
+ }
+
}
set_domain_attribute(sd, attr);