2012-05-09 08:40:05

by Igor Mammedov

[permalink] [raw]
Subject: [PATCH] sched_groups are expected to be circular linked list, make it so right after allocation

if we have one cpu that failed to boot and boot cpu gave up on waiting for it
and then another cpu is being booted, kernel might crash with following OOPS:

[ 723.865765] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[ 723.866616] IP: [<ffffffff812c3630>] __bitmap_weight+0x30/0x80
[ 723.866616] PGD 7ba91067 PUD 7a205067 PMD 0
[ 723.866616] Oops: 0000 [#1] SMP
[ 723.898527] CPU 1
...
[ 723.898527] Pid: 1221, comm: offV2.sh Tainted: G W 3.4.0-rc4+ #213 Red Hat KVM
[ 723.898527] RIP: 0010:[<ffffffff812c3630>] [<ffffffff812c3630>] __bitmap_weight+0x30/0x80
[ 723.898527] RSP: 0018:ffff88007ab9dc18 EFLAGS: 00010246
[ 723.898527] RAX: 0000000000000003 RBX: 0000000000000000 RCX: 0000000000000000
[ 723.898527] RDX: 0000000000000018 RSI: 0000000000000100 RDI: 0000000000000018
[ 723.898527] RBP: ffff88007ab9dc18 R08: 0000000000000000 R09: 0000000000000020
[ 723.898527] R10: 0000000000000004 R11: 0000000000000000 R12: ffff88007c06ed60
[ 723.898527] R13: ffff880037a94000 R14: 0000000000000003 R15: ffff88007c06ed60
[ 723.898527] FS: 00007f1d6a7d8700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
[ 723.898527] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 723.898527] CR2: 0000000000000018 CR3: 000000007bb7f000 CR4: 00000000000007e0
[ 723.898527] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 723.898527] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 723.898527] Process offV2.sh (pid: 1221, threadinfo ffff88007ab9c000, task ffff88007b358000)
[ 723.898527] Stack:
[ 723.898527] ffff88007ab9dcc8 ffffffff8108b9b6 ffff88007ab9dc58 ffff88007b4f2a00
[ 723.898527] ffff88007c06ed60 0000000000000003 000000037ab9dc58 0000000000010008
[ 723.898527] ffffffff81a308e8 0000000000000003 ffff88007b489cc0 ffff880037b6bd20
[ 723.898527] Call Trace:
[ 723.898527] [<ffffffff8108b9b6>] build_sched_domains+0x7b6/0xa50
[ 723.898527] [<ffffffff8108bea9>] partition_sched_domains+0x259/0x3f0
[ 723.898527] [<ffffffff810c4485>] cpuset_update_active_cpus+0x85/0x90
[ 723.898527] [<ffffffff81084f65>] cpuset_cpu_active+0x25/0x30
[ 723.898527] [<ffffffff81545b45>] notifier_call_chain+0x55/0x80
[ 723.898527] [<ffffffff8107e59e>] __raw_notifier_call_chain+0xe/0x10
[ 723.898527] [<ffffffff81058be0>] __cpu_notify+0x20/0x40
[ 723.898527] [<ffffffff8153af08>] _cpu_up+0xc7/0x10e
[ 723.898527] [<ffffffff8153af9b>] cpu_up+0x4c/0x5c

crash happens in init_sched_groups_power() that expects sched_groups to be
circular linked list. However it is not always true, since sched_groups
preallocated in __sdt_alloc are initialized in build_sched_groups and it
may exit early

if (cpu != cpumask_first(sched_domain_span(sd)))
return 0;

without initializing sd->groups->next field.

Fix bug by initializing next field right after sched_group was allocated.

Signed-off-by: Igor Mammedov <[email protected]>
---
kernel/sched/core.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0533a68..e5212ae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6382,6 +6382,8 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
if (!sg)
return -ENOMEM;

+ sg->next = sg;
+
*per_cpu_ptr(sdd->sg, j) = sg;

sgp = kzalloc_node(sizeof(struct sched_group_power),
--
1.7.1


2012-05-09 10:21:57

by Jiang Liu

[permalink] [raw]
Subject: Re: [PATCH] sched_groups are expected to be circular linked list, make it so right after allocation

Hi Igor,
Thanks for fixing this bug! We encountered the same issue with an
IA64 systems too. That system could boot with 2.6.32, but can't boot with
any 3.x.x kernels. We have just found the root cause today.
--gerry

On 05/09/2012 06:38 PM, Igor Mammedov wrote:
> if we have one cpu that failed to boot and boot cpu gave up on waiting for it
> and then another cpu is being booted, kernel might crash with following OOPS:
>
> [ 723.865765] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> [ 723.866616] IP: [<ffffffff812c3630>] __bitmap_weight+0x30/0x80
> [ 723.866616] PGD 7ba91067 PUD 7a205067 PMD 0
> [ 723.866616] Oops: 0000 [#1] SMP
> [ 723.898527] CPU 1
> ...
> [ 723.898527] Pid: 1221, comm: offV2.sh Tainted: G W 3.4.0-rc4+ #213 Red Hat KVM
> [ 723.898527] RIP: 0010:[<ffffffff812c3630>] [<ffffffff812c3630>] __bitmap_weight+0x30/0x80
> [ 723.898527] RSP: 0018:ffff88007ab9dc18 EFLAGS: 00010246
> [ 723.898527] RAX: 0000000000000003 RBX: 0000000000000000 RCX: 0000000000000000
> [ 723.898527] RDX: 0000000000000018 RSI: 0000000000000100 RDI: 0000000000000018
> [ 723.898527] RBP: ffff88007ab9dc18 R08: 0000000000000000 R09: 0000000000000020
> [ 723.898527] R10: 0000000000000004 R11: 0000000000000000 R12: ffff88007c06ed60
> [ 723.898527] R13: ffff880037a94000 R14: 0000000000000003 R15: ffff88007c06ed60
> [ 723.898527] FS: 00007f1d6a7d8700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
> [ 723.898527] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 723.898527] CR2: 0000000000000018 CR3: 000000007bb7f000 CR4: 00000000000007e0
> [ 723.898527] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 723.898527] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 723.898527] Process offV2.sh (pid: 1221, threadinfo ffff88007ab9c000, task ffff88007b358000)
> [ 723.898527] Stack:
> [ 723.898527] ffff88007ab9dcc8 ffffffff8108b9b6 ffff88007ab9dc58 ffff88007b4f2a00
> [ 723.898527] ffff88007c06ed60 0000000000000003 000000037ab9dc58 0000000000010008
> [ 723.898527] ffffffff81a308e8 0000000000000003 ffff88007b489cc0 ffff880037b6bd20
> [ 723.898527] Call Trace:
> [ 723.898527] [<ffffffff8108b9b6>] build_sched_domains+0x7b6/0xa50
> [ 723.898527] [<ffffffff8108bea9>] partition_sched_domains+0x259/0x3f0
> [ 723.898527] [<ffffffff810c4485>] cpuset_update_active_cpus+0x85/0x90
> [ 723.898527] [<ffffffff81084f65>] cpuset_cpu_active+0x25/0x30
> [ 723.898527] [<ffffffff81545b45>] notifier_call_chain+0x55/0x80
> [ 723.898527] [<ffffffff8107e59e>] __raw_notifier_call_chain+0xe/0x10
> [ 723.898527] [<ffffffff81058be0>] __cpu_notify+0x20/0x40
> [ 723.898527] [<ffffffff8153af08>] _cpu_up+0xc7/0x10e
> [ 723.898527] [<ffffffff8153af9b>] cpu_up+0x4c/0x5c
>
> crash happens in init_sched_groups_power() that expects sched_groups to be
> circular linked list. However it is not always true, since sched_groups
> preallocated in __sdt_alloc are initialized in build_sched_groups and it
> may exit early
>
> if (cpu != cpumask_first(sched_domain_span(sd)))
> return 0;
>
> without initializing sd->groups->next field.
>
> Fix bug by initializing next field right after sched_group was allocated.
>
> Signed-off-by: Igor Mammedov <[email protected]>
> ---
> kernel/sched/core.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0533a68..e5212ae 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6382,6 +6382,8 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
> if (!sg)
> return -ENOMEM;
>
> + sg->next = sg;
> +
> *per_cpu_ptr(sdd->sg, j) = sg;
>
> sgp = kzalloc_node(sizeof(struct sched_group_power),

2012-05-09 10:35:46

by Igor Mammedov

[permalink] [raw]
Subject: [tip:sched/urgent] sched: Fix KVM and ia64 boot crash due to sched_groups circular linked list assumption

Commit-ID: 30b4e9eb783d94e9f5d503b15eb31720679ae1c7
Gitweb: http://git.kernel.org/tip/30b4e9eb783d94e9f5d503b15eb31720679ae1c7
Author: Igor Mammedov <[email protected]>
AuthorDate: Wed, 9 May 2012 12:38:28 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 9 May 2012 12:27:35 +0200

sched: Fix KVM and ia64 boot crash due to sched_groups circular linked list assumption

If we have one cpu that failed to boot and boot cpu gave up on
waiting for it and then another cpu is being booted, kernel
might crash with following OOPS:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [<ffffffff812c3630>] __bitmap_weight+0x30/0x80
Call Trace:
[<ffffffff8108b9b6>] build_sched_domains+0x7b6/0xa50

The crash happens in init_sched_groups_power() that expects
sched_groups to be circular linked list. However it is not
always true, since sched_groups preallocated in __sdt_alloc are
initialized in build_sched_groups and it may exit early

if (cpu != cpumask_first(sched_domain_span(sd)))
return 0;

without initializing sd->groups->next field.

Fix bug by initializing next field right after sched_group was
allocated.

Also-Reported-by: Jiang Liu <[email protected]>
Signed-off-by: Igor Mammedov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0533a68..e5212ae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6382,6 +6382,8 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
if (!sg)
return -ENOMEM;

+ sg->next = sg;
+
*per_cpu_ptr(sdd->sg, j) = sg;

sgp = kzalloc_node(sizeof(struct sched_group_power),

2012-05-09 11:41:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched_groups are expected to be circular linked list, make it so right after allocation

On Wed, 2012-05-09 at 12:38 +0200, Igor Mammedov wrote:
> init_sched_groups_power() that expects sched_groups to be
> circular linked list. However it is not always true, since sched_groups
> preallocated in __sdt_alloc are initialized in build_sched_groups and it
> may exit early
>
> if (cpu != cpumask_first(sched_domain_span(sd)))
> return 0;
>
> without initializing sd->groups->next field.

The only way I can see that happen is if the arch code is lying to us.
We build the sched_domain_span() like:

cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));

and the above != first_cpumask() can only happen if the topology mask
provided by the architecture includes a cpu that isn't actually there.

(and equally, how did it get in the active_mask if its not there).

Jiang, how did your IA64 arrive in this state?

> Fix bug by initializing next field right after sched_group was allocated.

I'd not call it a bug, the bug is the arch being broken, its a
robustification of the code to better handle broken input.

Ideally we'd also add a WARN someplace to notify us of this situation.

Still, nice catch..

Acked-by: Peter Zijlstra <[email protected]>

> Signed-off-by: Igor Mammedov <[email protected]>
> ---
> kernel/sched/core.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0533a68..e5212ae 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6382,6 +6382,8 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
> if (!sg)
> return -ENOMEM;
>
> + sg->next = sg;
> +
> *per_cpu_ptr(sdd->sg, j) = sg;
>
> sgp = kzalloc_node(sizeof(struct sched_group_power),

2012-05-09 11:44:31

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] sched_groups are expected to be circular linked list, make it so right after allocation

On 05/09/2012 12:21 PM, Jiang Liu wrote:
> Hi Igor,
> Thanks for fixing this bug! We encountered the same issue with an
> IA64 systems too. That system could boot with 2.6.32, but can't boot with
> any 3.x.x kernels. We have just found the root cause today.
> --gerry

This patch fixes only build_sched_groups path, but there is another fail path
that results in below OOPS.
build_overlap_sched_groups() may exit without setting groups and later it will crash
init_sched_groups_power as well. But I just don't know how to fix it, so I've just
posted partial fix that reduces crash frequency. And I have to admit that
cpu_active_mask and siblings map are busted but we either should not exit from builder
funcs with NULL group or BUG there if it is impossible to come-up with sane group
for insane domain span.


>
> On 05/09/2012 06:38 PM, Igor Mammedov wrote:
>> if we have one cpu that failed to boot and boot cpu gave up on waiting for it
>> and then another cpu is being booted, kernel might crash with following OOPS:
>>
>> [ 723.865765] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>> [ 723.866616] IP: [<ffffffff812c3630>] __bitmap_weight+0x30/0x80
>> [ 723.866616] PGD 7ba91067 PUD 7a205067 PMD 0
>> [ 723.866616] Oops: 0000 [#1] SMP
>> [ 723.898527] CPU 1
>> ...
>> [ 723.898527] Pid: 1221, comm: offV2.sh Tainted: G W 3.4.0-rc4+ #213 Red Hat KVM
>> [ 723.898527] RIP: 0010:[<ffffffff812c3630>] [<ffffffff812c3630>] __bitmap_weight+0x30/0x80
>> [ 723.898527] RSP: 0018:ffff88007ab9dc18 EFLAGS: 00010246
>> [ 723.898527] RAX: 0000000000000003 RBX: 0000000000000000 RCX: 0000000000000000
>> [ 723.898527] RDX: 0000000000000018 RSI: 0000000000000100 RDI: 0000000000000018
>> [ 723.898527] RBP: ffff88007ab9dc18 R08: 0000000000000000 R09: 0000000000000020
>> [ 723.898527] R10: 0000000000000004 R11: 0000000000000000 R12: ffff88007c06ed60
>> [ 723.898527] R13: ffff880037a94000 R14: 0000000000000003 R15: ffff88007c06ed60
>> [ 723.898527] FS: 00007f1d6a7d8700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
>> [ 723.898527] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [ 723.898527] CR2: 0000000000000018 CR3: 000000007bb7f000 CR4: 00000000000007e0
>> [ 723.898527] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 723.898527] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [ 723.898527] Process offV2.sh (pid: 1221, threadinfo ffff88007ab9c000, task ffff88007b358000)
>> [ 723.898527] Stack:
>> [ 723.898527] ffff88007ab9dcc8 ffffffff8108b9b6 ffff88007ab9dc58 ffff88007b4f2a00
>> [ 723.898527] ffff88007c06ed60 0000000000000003 000000037ab9dc58 0000000000010008
>> [ 723.898527] ffffffff81a308e8 0000000000000003 ffff88007b489cc0 ffff880037b6bd20
>> [ 723.898527] Call Trace:
>> [ 723.898527] [<ffffffff8108b9b6>] build_sched_domains+0x7b6/0xa50
>> [ 723.898527] [<ffffffff8108bea9>] partition_sched_domains+0x259/0x3f0
>> [ 723.898527] [<ffffffff810c4485>] cpuset_update_active_cpus+0x85/0x90
>> [ 723.898527] [<ffffffff81084f65>] cpuset_cpu_active+0x25/0x30
>> [ 723.898527] [<ffffffff81545b45>] notifier_call_chain+0x55/0x80
>> [ 723.898527] [<ffffffff8107e59e>] __raw_notifier_call_chain+0xe/0x10
>> [ 723.898527] [<ffffffff81058be0>] __cpu_notify+0x20/0x40
>> [ 723.898527] [<ffffffff8153af08>] _cpu_up+0xc7/0x10e
>> [ 723.898527] [<ffffffff8153af9b>] cpu_up+0x4c/0x5c
>>
>> crash happens in init_sched_groups_power() that expects sched_groups to be
>> circular linked list. However it is not always true, since sched_groups
>> preallocated in __sdt_alloc are initialized in build_sched_groups and it
>> may exit early
>>
>> if (cpu != cpumask_first(sched_domain_span(sd)))
>> return 0;
>>
>> without initializing sd->groups->next field.
>>
>> Fix bug by initializing next field right after sched_group was allocated.
>>
>> Signed-off-by: Igor Mammedov<[email protected]>
>> ---
>> kernel/sched/core.c | 2 ++
>> 1 files changed, 2 insertions(+), 0 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 0533a68..e5212ae 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6382,6 +6382,8 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
>> if (!sg)
>> return -ENOMEM;
>>
>> + sg->next = sg;
>> +
>> *per_cpu_ptr(sdd->sg, j) = sg;
>>
>> sgp = kzalloc_node(sizeof(struct sched_group_power),
>

--
-----
Igor

2012-05-09 11:52:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched_groups are expected to be circular linked list, make it so right after allocation

On Wed, 2012-05-09 at 13:44 +0200, Igor Mammedov wrote:
> This patch fixes only build_sched_groups path, but there is another fail path
> that results in below OOPS.
> build_overlap_sched_groups() may exit without setting groups and later it will crash
> init_sched_groups_power as well.

if that allocation fails? Or is there another fail path?

> But I just don't know how to fix it, so I've just
> posted partial fix that reduces crash frequency.


> And I have to admit that
> cpu_active_mask and siblings map are busted but we either should not exit from builder
> funcs with NULL group or BUG there if it is impossible to come-up with sane group
> for insane domain span.

I'm perfectly OK with taking the machine down, provided we can output
useful messages as to what is broken first..

2012-05-09 11:59:07

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] sched_groups are expected to be circular linked list, make it so right after allocation

On 05/09/2012 01:52 PM, Peter Zijlstra wrote:
> On Wed, 2012-05-09 at 13:44 +0200, Igor Mammedov wrote:
>> This patch fixes only build_sched_groups path, but there is another fail path
>> that results in below OOPS.
>> build_overlap_sched_groups() may exit without setting groups and later it will crash
>> init_sched_groups_power as well.
>
> if that allocation fails? Or is there another fail path?

build_overlap_sched_groups(struct sched_domain *sd, int cpu)
...
if (cpumask_test_cpu(cpu, sg_span))
groups = sg;
...

above test fails and leaves local var groups set to NULL
and before exit there is:

sd->groups = groups;

which resets sd->groups to NULL and I'm not sure if it is correct at all to skip this
assignment if groups == NULL.

>
>> But I just don't know how to fix it, so I've just
>> posted partial fix that reduces crash frequency.
>
>
>> And I have to admit that
>> cpu_active_mask and siblings map are busted but we either should not exit from builder
>> funcs with NULL group or BUG there if it is impossible to come-up with sane group
>> for insane domain span.
>
> I'm perfectly OK with taking the machine down, provided we can output
> useful messages as to what is broken first..

--
-----
Igor

2012-05-09 12:21:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched_groups are expected to be circular linked list, make it so right after allocation

On Wed, 2012-05-09 at 13:58 +0200, Igor Mammedov wrote:
> On 05/09/2012 01:52 PM, Peter Zijlstra wrote:
> > On Wed, 2012-05-09 at 13:44 +0200, Igor Mammedov wrote:
> >> This patch fixes only build_sched_groups path, but there is another fail path
> >> that results in below OOPS.
> >> build_overlap_sched_groups() may exit without setting groups and later it will crash
> >> init_sched_groups_power as well.
> >
> > if that allocation fails? Or is there another fail path?
>
> build_overlap_sched_groups(struct sched_domain *sd, int cpu)
> ...
> if (cpumask_test_cpu(cpu, sg_span))
> groups = sg;
> ...
>
> above test fails and leaves local var groups set to NULL
> and before exit there is:
>
> sd->groups = groups;
>
> which resets sd->groups to NULL

Cute! So we're building groups for @cpu, for a domain on the same @cpu,
but none of the groups actually span this @cpu. This would imply the
domain doesn't actually contain @cpu.

> and I'm not sure if it is correct at all to skip this
> assignment if groups == NULL.

It would avoid exploding, but nothing in the above situation is anywhere
near correct.

Does something like the below give any clues as to how we got there?

---
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1043,6 +1043,15 @@ struct sched_domain {
unsigned long span[0];
};

+static inline char *sched_domain_name(struct sched_domain *sd)
+{
+#ifdef CONFIG_SCHED_DEBUG
+ return sd->name;
+#else
+ return "";
+#endif
+}
+
static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
{
return to_cpumask(sd->span);
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5993,6 +5993,22 @@ build_overlap_sched_groups(struct sched_
last = sg;
last->next = first;
}
+ if (!groups) {
+ char str[256];
+
+ printk(KERN_ERR "sched: Topology is hosed for CPU-%d!!\n", cpu);
+ cpulist_scnprintf(str, sizeof(str), sched_domain_span(sd));
+ printk(KERN_ERR "sched: domain: %s %s\n", sched_domain_name(sd), str);
+
+ sg = first;
+ if (sg) do {
+ cpulist_scnprintf(str, sizeof(str), sched_group_cpus(sg));
+ printk(KERN_ERR "sched: group: %s\n", str);
+ sg = sg->next;
+ } while (sg != first);
+
+ BUG();
+ }
sd->groups = groups;

return 0;

2012-05-09 12:23:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched_groups are expected to be circular linked list, make it so right after allocation

On Wed, 2012-05-09 at 14:21 +0200, Peter Zijlstra wrote:
> Does something like the below give any clues as to how we got there?

I'm assuming you've got means of reproducing this?

2012-05-09 12:30:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched_groups are expected to be circular linked list, make it so right after allocation

On Wed, 2012-05-09 at 14:21 +0200, Peter Zijlstra wrote:
> Does something like the below give any clues as to how we got there?

New version that checks we include the right cpu in build_sched_domain()
too.. on a related note, we should add a printk-%p modifier for cpulist,
this cpulist_scnprintf() stuff gets annoying.

---
include/linux/sched.h | 9 +++++++++
kernel/sched/core.c | 26 ++++++++++++++++++++++++++
2 files changed, 35 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 714fa99..416aad9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1041,6 +1041,15 @@ struct sched_domain {
unsigned long span[0];
};

+static inline char *sched_domain_name(struct sched_domain *sd)
+{
+#ifdef CONFIG_SCHED_DEBUG
+ return sd->name;
+#else
+ return "";
+#endif
+}
+
static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
{
return to_cpumask(sd->span);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7b77e3b..2884134 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6072,6 +6072,22 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
last = sg;
last->next = first;
}
+ if (!groups) {
+ char str[256];
+
+ printk(KERN_ERR "sched: Topology is hosed for CPU-%d!!\n", cpu);
+ cpulist_scnprintf(str, sizeof(str), sched_domain_span(sd));
+ printk(KERN_ERR "sched: domain: %s %s\n", sched_domain_name(sd), str);
+
+ sg = first;
+ if (sg) do {
+ cpulist_scnprintf(str, sizeof(str), sched_group_cpus(sg));
+ printk(KERN_ERR "sched: group: %s\n", str);
+ sg = sg->next;
+ } while (sg != first);
+
+ BUG();
+ }
sd->groups = groups;

return 0;
@@ -6440,6 +6456,16 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,

set_domain_attribute(sd, attr);
cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
+ if (!cpumask_test_cpu(cpu, sched_domain_span(sd))) {
+ char str[256];
+ cpulist_scnprintf(str, sizeof(str), cpu_map);
+ printk(KERN_ERR "sched: Bonkers domain doesn't include its own cpu: %d %s\n",
+ cpu, str);
+ cpulist_scnprintf(str, sizeof(str), tl->mask(cpu));
+ printk(KERN_ERR "sched: Bonkers domain doesn't include its own cpu: %d %s\n",
+ cpu, str);
+ WARN(1);
+ }
if (child) {
sd->level = child->level + 1;
sched_domain_level_max = max(sched_domain_level_max, sd->level);

2012-05-09 12:36:13

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] sched_groups are expected to be circular linked list, make it so right after allocation

On 05/09/2012 02:22 PM, Peter Zijlstra wrote:
> On Wed, 2012-05-09 at 14:21 +0200, Peter Zijlstra wrote:
>> Does something like the below give any clues as to how we got there?
>
> I'm assuming you've got means of reproducing this?

Yep, it reproduces several times a day. I'll retest with your patch.

--
-----
Igor

2012-05-09 13:28:01

by Peter Zijlstra

[permalink] [raw]
Subject: [RFC][PATCH] printk: Add %pb to print bitmaps

On Wed, 2012-05-09 at 14:30 +0200, Peter Zijlstra wrote:
> on a related note, we should add a printk-%p modifier for cpulist,
> this cpulist_scnprintf() stuff gets annoying.

Something like so I guess.. I'll try and convert some of the sched_debug
code to test it..

---
lib/vsprintf.c | 36 +++++++++++++++++++++++++++++++-----
1 file changed, 31 insertions(+), 5 deletions(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index abbabec..90bfcd1 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -25,6 +25,9 @@
#include <linux/kallsyms.h>
#include <linux/uaccess.h>
#include <linux/ioport.h>
+#include <linux/bitmap.h>
+#include <linux/cpumask.h>
+#include <linux/numa.h>
#include <net/addrconf.h>

#include <asm/page.h> /* for PAGE_SIZE */
@@ -857,15 +860,19 @@ int kptr_restrict __read_mostly;
* correctness of the format string and va_list arguments.
* - 'K' For a kernel pointer that should be hidden from unprivileged users
* - 'NF' For a netdev_features_t
+ * - 'b' For a bitmap, consumes 2 args, second is int
+ * - 'bc' For a cpumask
+ * - 'bn' For a nodemask
*
* Note: The difference between 'S' and 'F' is that on ia64 and ppc64
* function pointers are really function descriptors, which contain a
* pointer to the real address.
*/
-static noinline_for_stack
-char *pointer(const char *fmt, char *buf, char *end, void *ptr,
- struct printf_spec spec)
+static noinline_for_stack
+char *pointer(const char *fmt, char *buf, char *end, va_list args, struct printf_spec spec)
{
+ void *ptr = va_arg(args, void *);
+
if (!ptr && *fmt != 'K') {
/*
* Print (null) with the same width as a pointer so it makes
@@ -941,6 +948,26 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
return netdev_feature_string(buf, end, ptr, spec);
}
break;
+ case 'b':
+ {
+ int bits, len;
+
+ switch (fmt[1]) {
+ case 'c':
+ bits = nr_cpumask_bits;
+ break;
+ case 'n':
+ bits = MAX_NUMNODES;
+ break;
+ default:
+ bits = va_arg(args, int);
+ break;
+ }
+
+ len = bitmap_scnlistprintf(buf, end - buf, ptr, bits);
+
+ return buf + len;
+ }
}
spec.flags |= SMALL;
if (spec.field_width == -1) {
@@ -1261,8 +1288,7 @@ int vsnprintf(char *buf, size_t size, const char *fmt, va_list args)
break;

case FORMAT_TYPE_PTR:
- str = pointer(fmt+1, str, end, va_arg(args, void *),
- spec);
+ str = pointer(fmt+1, str, end, args, spec);
while (isalnum(*fmt))
fmt++;
break;

2012-05-09 13:30:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps

On Wed, 2012-05-09 at 15:27 +0200, Peter Zijlstra wrote:
> +char *pointer(const char *fmt, char *buf, char *end, va_list args,
> struct printf_spec spec)

I'd better make that a va_list *, otherwise the iteration gets lost.

2012-05-09 13:36:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps


* Peter Zijlstra <[email protected]> wrote:

> @@ -857,15 +860,19 @@ int kptr_restrict __read_mostly;
> * correctness of the format string and va_list arguments.
> * - 'K' For a kernel pointer that should be hidden from unprivileged users
> * - 'NF' For a netdev_features_t
> + * - 'b' For a bitmap, consumes 2 args, second is int

hm, won't the second arg confuse gcc's printf format checker?

> + case 'b':
> + {
> + int bits, len;
> +
> + switch (fmt[1]) {
> + case 'c':
> + bits = nr_cpumask_bits;
> + break;
> + case 'n':
> + bits = MAX_NUMNODES;
> + break;
> + default:
> + bits = va_arg(args, int);
> + break;
> + }
> +
> + len = bitmap_scnlistprintf(buf, end - buf, ptr, bits);

also, if the second argument is not provided, could va_arg()
return noise or even NULL?

Thanks,

Ingo

2012-05-09 13:44:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps

On Wed, 2012-05-09 at 15:36 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > @@ -857,15 +860,19 @@ int kptr_restrict __read_mostly;
> > * correctness of the format string and va_list arguments.
> > * - 'K' For a kernel pointer that should be hidden from unprivileged users
> > * - 'NF' For a netdev_features_t
> > + * - 'b' For a bitmap, consumes 2 args, second is int
>
> hm, won't the second arg confuse gcc's printf format checker?


Ah, yes, I suppose I could abuse something like %*pb. Let me try that.

2012-05-09 13:59:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps

On Wed, 2012-05-09 at 15:44 +0200, Peter Zijlstra wrote:
> On Wed, 2012-05-09 at 15:36 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > @@ -857,15 +860,19 @@ int kptr_restrict __read_mostly;
> > > * correctness of the format string and va_list arguments.
> > > * - 'K' For a kernel pointer that should be hidden from unprivileged users
> > > * - 'NF' For a netdev_features_t
> > > + * - 'b' For a bitmap, consumes 2 args, second is int
> >
> > hm, won't the second arg confuse gcc's printf format checker?
>
>
> Ah, yes, I suppose I could abuse something like %*pb. Let me try that.

I guess I should use %.*pb and keep the field_width in case someone
manages to actually make bitmap_scnlistprintf() conform to it. The
precision is unused anyway.

The current implementation limits both (field_width and precision) to
s16, which will limit us to printing 32Kb bitmaps. If need arises we
could increase printf_spec size I guess.

---
lib/vsprintf.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index abbabec..6eb30a9 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -25,6 +25,9 @@
#include <linux/kallsyms.h>
#include <linux/uaccess.h>
#include <linux/ioport.h>
+#include <linux/bitmap.h>
+#include <linux/cpumask.h>
+#include <linux/numa.h>
#include <net/addrconf.h>

#include <asm/page.h> /* for PAGE_SIZE */
@@ -857,6 +860,9 @@ int kptr_restrict __read_mostly;
* correctness of the format string and va_list arguments.
* - 'K' For a kernel pointer that should be hidden from unprivileged users
* - 'NF' For a netdev_features_t
+ * - 'b' For a bitmap; %.*pb is required and the precision is used as bitmap length
+ * - 'bc' For a cpumask
+ * - 'bn' For a nodemask
*
* Note: The difference between 'S' and 'F' is that on ia64 and ppc64
* function pointers are really function descriptors, which contain a
@@ -941,6 +947,26 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
return netdev_feature_string(buf, end, ptr, spec);
}
break;
+ case 'b':
+ {
+ int bits, len;
+
+ switch (fmt[1]) {
+ case 'c':
+ bits = nr_cpumask_bits;
+ break;
+ case 'n':
+ bits = MAX_NUMNODES;
+ break;
+ default:
+ bits = spec->precision;
+ break;
+ }
+
+ len = bitmap_scnlistprintf(buf, end - buf, ptr, bits);
+
+ return buf + len;
+ }
}
spec.flags |= SMALL;
if (spec.field_width == -1) {
@@ -1175,6 +1201,9 @@ int format_decode(const char *fmt, struct printf_spec *spec)
* %pI6c print an IPv6 address as specified by RFC 5952
* %pU[bBlL] print a UUID/GUID in big or little endian using lower or upper
* case.
+ * %.*pb print a bitmap list using the precision as bitmap length
+ * %pbc print a cpumask bitmap
+ * %pbn print a nodemask bitmap
* %n is ignored
*
* The return value is the number of characters which would

2012-05-09 14:15:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps


* Peter Zijlstra <[email protected]> wrote:

> On Wed, 2012-05-09 at 15:44 +0200, Peter Zijlstra wrote:
> > On Wed, 2012-05-09 at 15:36 +0200, Ingo Molnar wrote:
> > > * Peter Zijlstra <[email protected]> wrote:
> > >
> > > > @@ -857,15 +860,19 @@ int kptr_restrict __read_mostly;
> > > > * correctness of the format string and va_list arguments.
> > > > * - 'K' For a kernel pointer that should be hidden from unprivileged users
> > > > * - 'NF' For a netdev_features_t
> > > > + * - 'b' For a bitmap, consumes 2 args, second is int
> > >
> > > hm, won't the second arg confuse gcc's printf format checker?
> >
> >
> > Ah, yes, I suppose I could abuse something like %*pb. Let me try that.
>
> I guess I should use %.*pb and keep the field_width in case someone
> manages to actually make bitmap_scnlistprintf() conform to it. The
> precision is unused anyway.

That's a cute trick, and it's intuitive as well.

> + case 'b':
> + {
> + int bits, len;
> +
> + switch (fmt[1]) {
> + case 'c':
> + bits = nr_cpumask_bits;
> + break;
> + case 'n':
> + bits = MAX_NUMNODES;
> + break;
> + default:
> + bits = spec->precision;
> + break;

So, if someone specifies an incomplete "%pb" format - fmt[1]
will be 0 and we take precision as the length - presumably also
0. We stick that 0 into:

> + len = bitmap_scnlistprintf(buf, end - buf, ptr, bits);

Will that work?

Provided it all tests out fine for you it looks good to me.

I'd probably write the switch statement as:

switch (fmt[1]) {
case 'c': bits = nr_cpumask_bits; break;
case 'n': bits = MAX_NUMNODES; break;
default: bits = spec->precision; break;
}

... but that's a small detail and a matter of taste in any case.

Thanks,

Ingo

2012-05-09 14:19:50

by Joe Perches

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps

On Wed, 2012-05-09 at 15:59 +0200, Peter Zijlstra wrote:
> The current implementation limits both (field_width and precision) to
> s16, which will limit us to printing 32Kb bitmaps. If need arises we
> could increase printf_spec size I guess.

Not sure that's a problem really.
A single printk is limited to 1024 bytes.

> diff --git a/lib/vsprintf.c b/lib/vsprintf.c
[]
> @@ -941,6 +947,26 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
> return netdev_feature_string(buf, end, ptr, spec);
> }
> break;
> + case 'b':
> + {
> + int bits, len;

trivia: a tab indent could be avoided by using
case b: {

2012-05-09 14:24:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps

On Wed, 2012-05-09 at 16:15 +0200, Ingo Molnar wrote:
> So, if someone specifies an incomplete "%pb" format - fmt[1]
> will be 0 and we take precision as the length - presumably also
> 0.

yes, printf_spec is initialized to 0s.

> We stick that 0 into:
>
> > + len = bitmap_scnlistprintf(buf, end - buf, ptr, bits);
>
> Will that work?

Yes, bitmap_scnlistprintf() will only set buf[0] = 0 (which will be
overwritten by a possible next printf token) and return 0 length not
advancing the ptr.

2012-05-09 15:33:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps

On Wed, 2012-05-09 at 16:15 +0200, Ingo Molnar wrote:
> > I guess I should use %.*pb and keep the field_width in case someone
> > manages to actually make bitmap_scnlistprintf() conform to it. The
> > precision is unused anyway.
>
> That's a cute trick, and it's intuitive as well.

kernel/sched/core.c:5570:3: warning: precision used with ‘%p’ gnu_printf format [-Wformat]

/me curses a bit.. anybody?


The below compiles and works (with the above caveat) and has all the
style muck people 'wanted'.

---
lib/vsprintf.c | 46 ++++++++++++++++++++++++++++++++--------------
1 file changed, 32 insertions(+), 14 deletions(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index abbabec..6bd9a66 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -25,6 +25,9 @@
#include <linux/kallsyms.h>
#include <linux/uaccess.h>
#include <linux/ioport.h>
+#include <linux/bitmap.h>
+#include <linux/cpumask.h>
+#include <linux/numa.h>
#include <net/addrconf.h>

#include <asm/page.h> /* for PAGE_SIZE */
@@ -857,6 +860,9 @@ int kptr_restrict __read_mostly;
* correctness of the format string and va_list arguments.
* - 'K' For a kernel pointer that should be hidden from unprivileged users
* - 'NF' For a netdev_features_t
+ * - 'b' For a bitmap; %.*pb is required and the precision is used as bitmap length
+ * - 'bc' For a cpumask
+ * - 'bn' For a nodemask
*
* Note: The difference between 'S' and 'F' is that on ia64 and ppc64
* function pointers are really function descriptors, which contain a
@@ -902,24 +908,21 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
* 6: 000102...0f
*/
switch (fmt[1]) {
- case '6':
- return ip6_addr_string(buf, end, ptr, spec, fmt);
- case '4':
- return ip4_addr_string(buf, end, ptr, spec, fmt);
+ case '6': return ip6_addr_string(buf, end, ptr, spec, fmt);
+ case '4': return ip4_addr_string(buf, end, ptr, spec, fmt);
}
break;
case 'U':
return uuid_string(buf, end, ptr, spec, fmt);
- case 'V':
- {
- va_list va;
-
- va_copy(va, *((struct va_format *)ptr)->va);
- buf += vsnprintf(buf, end > buf ? end - buf : 0,
- ((struct va_format *)ptr)->fmt, va);
- va_end(va);
- return buf;
- }
+ case 'V': {
+ va_list va;
+
+ va_copy(va, *((struct va_format *)ptr)->va);
+ buf += vsnprintf(buf, end > buf ? end - buf : 0,
+ ((struct va_format *)ptr)->fmt, va);
+ va_end(va);
+ return buf;
+ }
case 'K':
/*
* %pK cannot be used in IRQ context because its test
@@ -941,6 +944,18 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
return netdev_feature_string(buf, end, ptr, spec);
}
break;
+ case 'b': {
+ int bits;
+
+ switch (fmt[1]) {
+ case 'c': bits = nr_cpumask_bits; break;
+ case 'n': bits = MAX_NUMNODES; break;
+ default: bits = spec.precision; break;
+ }
+
+ return buf + bitmap_scnlistprintf(buf, end - buf, ptr, bits);
+ }
+ default: break;
}
spec.flags |= SMALL;
if (spec.field_width == -1) {
@@ -1175,6 +1190,9 @@ int format_decode(const char *fmt, struct printf_spec *spec)
* %pI6c print an IPv6 address as specified by RFC 5952
* %pU[bBlL] print a UUID/GUID in big or little endian using lower or upper
* case.
+ * %.*pb print a bitmap list using the precision as bitmap length
+ * %pbc print a cpumask bitmap
+ * %pbn print a nodemask bitmap
* %n is ignored
*
* The return value is the number of characters which would

2012-05-09 15:34:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps


* Joe Perches <[email protected]> wrote:

> On Wed, 2012-05-09 at 15:59 +0200, Peter Zijlstra wrote:
> > The current implementation limits both (field_width and precision) to
> > s16, which will limit us to printing 32Kb bitmaps. If need arises we
> > could increase printf_spec size I guess.
>
> Not sure that's a problem really.
> A single printk is limited to 1024 bytes.
>
> > diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> []
> > @@ -941,6 +947,26 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
> > return netdev_feature_string(buf, end, ptr, spec);
> > }
> > break;
> > + case 'b':
> > + {
> > + int bits, len;
>
> trivia: a tab indent could be avoided by using
> case b: {

Nit: 'b'.

Thanks,

Ingo

2012-05-09 15:41:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps


* Peter Zijlstra <[email protected]> wrote:

> On Wed, 2012-05-09 at 16:15 +0200, Ingo Molnar wrote:
> > > I guess I should use %.*pb and keep the field_width in case someone
> > > manages to actually make bitmap_scnlistprintf() conform to it. The
> > > precision is unused anyway.
> >
> > That's a cute trick, and it's intuitive as well.
>
> kernel/sched/core.c:5570:3: warning: precision used with ‘%p’ gnu_printf format [-Wformat]
>
> /me curses a bit.. anybody?

Yeah, that's a floating point precision thing (or for strings,
right-align) - could you use %30pb, etc?

Thanks,

Ingo

2012-05-09 16:07:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps

On Wed, 2012-05-09 at 17:41 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Wed, 2012-05-09 at 16:15 +0200, Ingo Molnar wrote:
> > > > I guess I should use %.*pb and keep the field_width in case someone
> > > > manages to actually make bitmap_scnlistprintf() conform to it. The
> > > > precision is unused anyway.
> > >
> > > That's a cute trick, and it's intuitive as well.
> >
> > kernel/sched/core.c:5570:3: warning: precision used with ‘%p’ gnu_printf format [-Wformat]
> >
> > /me curses a bit.. anybody?
>
> Yeah, that's a floating point precision thing (or for strings,
> right-align) - could you use %30pb, etc?

%*pb you mean? Yeah that works, field width is allowed for %p. It does
mean we cannot ever make bitmaps respect field width though.

2012-05-09 16:39:32

by Joe Perches

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps

On Wed, 2012-05-09 at 18:06 +0200, Peter Zijlstra wrote:
> On Wed, 2012-05-09 at 17:41 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Wed, 2012-05-09 at 16:15 +0200, Ingo Molnar wrote:
> > > > > I guess I should use %.*pb and keep the field_width in case someone
> > > > > manages to actually make bitmap_scnlistprintf() conform to it. The
> > > > > precision is unused anyway.
> > > >
> > > > That's a cute trick, and it's intuitive as well.
> > >
> > > kernel/sched/core.c:5570:3: warning: precision used with ‘%p’ gnu_printf format [-Wformat]
> > >
> > > /me curses a bit.. anybody?
> >
> > Yeah, that's a floating point precision thing (or for strings,
> > right-align) - could you use %30pb, etc?
>
> %*pb you mean? Yeah that works, field width is allowed for %p. It does
> mean we cannot ever make bitmaps respect field width though.

You could also fill a new struct with info
about the bitmap and pass a pointer to that.

struct print_bitmap_info {
int foo;
int precision;
int width;
void *bitmap;
};
...
struct print_bitmap_info bi;
bi.foo = bar;
...
bi.bitmap = &bitmap;
...
printk("%pb", &bi);

2012-05-09 17:15:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps

On Wed, May 9, 2012 at 6:27 AM, Peter Zijlstra <[email protected]> wrote:
> + * - 'b' For a bitmap, consumes 2 args, second is int
> + * - 'bc' For a cpumask
> + * - 'bn' For a nodemask

Hard NAK.

No way. The "consumes 2 args" is fundamentally idiotic, since it
forces compiler warnings. The whole idea of %pXX was that you can give
it any pointer, because all that the compiler cares about is the "%p"
part, so random pointers to stuff won't break.

Your patch breaks the whole point of the extension.

The "bc" and "bn" would work, except for the fact that I doubt they
are printed out enough to matter.

A "%.*pb" is the only interface that can work for a "sized" bitmap
(with obviously fixed-length ones being possible with a "%.32bp" like
thing)

But the whole va_args games you play are not acceptable. %p *will*
continue to take a void *, and nothing else.

Linus

2012-05-09 17:22:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps


* Joe Perches <[email protected]> wrote:

> On Wed, 2012-05-09 at 18:06 +0200, Peter Zijlstra wrote:
> > On Wed, 2012-05-09 at 17:41 +0200, Ingo Molnar wrote:
> > > * Peter Zijlstra <[email protected]> wrote:
> > >
> > > > On Wed, 2012-05-09 at 16:15 +0200, Ingo Molnar wrote:
> > > > > > I guess I should use %.*pb and keep the field_width in case someone
> > > > > > manages to actually make bitmap_scnlistprintf() conform to it. The
> > > > > > precision is unused anyway.
> > > > >
> > > > > That's a cute trick, and it's intuitive as well.
> > > >
> > > > kernel/sched/core.c:5570:3: warning: precision used with ‘%p’ gnu_printf format [-Wformat]
> > > >
> > > > /me curses a bit.. anybody?
> > >
> > > Yeah, that's a floating point precision thing (or for strings,
> > > right-align) - could you use %30pb, etc?
> >
> > %*pb you mean? Yeah that works, field width is allowed for %p. It does
> > mean we cannot ever make bitmaps respect field width though.
>
> You could also fill a new struct with info
> about the bitmap and pass a pointer to that.
>
> struct print_bitmap_info {
> int foo;
> int precision;
> int width;
> void *bitmap;
> };
> ...
> struct print_bitmap_info bi;
> bi.foo = bar;
> ...
> bi.bitmap = &bitmap;
> ...
> printk("%pb", &bi);

That defeats the whole purpose of avoiding extra complexity and
preparation - just passing in the bitmap pointer is the most
natural, most intuitive type to use.

Thanks,

Ingo

2012-05-09 17:23:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps

On Wed, 2012-05-09 at 10:15 -0700, Linus Torvalds wrote:
>
> The "bc" and "bn" would work, except for the fact that I doubt they
> are printed out enough to matter.

There's a number %pbc candidates in the scheduler that would be nice to
not have to do manually.

> A "%.*pb" is the only interface that can work for a "sized" bitmap
> (with obviously fixed-length ones being possible with a "%.32bp" like
> thing)

Right, sadly that also generates a warning :/

> But the whole va_args games you play are not acceptable. %p *will*
> continue to take a void *, and nothing else.

The va_args() games are dead.

2012-05-09 17:24:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps


* Peter Zijlstra <[email protected]> wrote:

> On Wed, 2012-05-09 at 17:41 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Wed, 2012-05-09 at 16:15 +0200, Ingo Molnar wrote:
> > > > > I guess I should use %.*pb and keep the field_width in case someone
> > > > > manages to actually make bitmap_scnlistprintf() conform to it. The
> > > > > precision is unused anyway.
> > > >
> > > > That's a cute trick, and it's intuitive as well.
> > >
> > > kernel/sched/core.c:5570:3: warning: precision used with ‘%p’ gnu_printf format [-Wformat]
> > >
> > > /me curses a bit.. anybody?
> >
> > Yeah, that's a floating point precision thing (or for strings,
> > right-align) - could you use %30pb, etc?
>
> %*pb you mean? Yeah that works, field width is allowed for %p.
> It does mean we cannot ever make bitmaps respect field width
> though.

I don't think that's a significant limitation, because it's
fixed width anyway. So if this works then this would be a pretty
good and simple to use solution.

Thanks,

Ingo

2012-05-09 17:25:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps

On Wed, 2012-05-09 at 19:24 +0200, Ingo Molnar wrote:
> I don't think that's a significant limitation, because it's
> fixed width anyway. So if this works then this would be a pretty
> good and simple to use solution.

Not quite following, there's nothing fixed width about
bitmap_scnlistprintf() output.

2012-05-09 17:26:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps


* Peter Zijlstra <[email protected]> wrote:

> > A "%.*pb" is the only interface that can work for a "sized"
> > bitmap (with obviously fixed-length ones being possible with
> > a "%.32bp" like thing)
>
> Right, sadly that also generates a warning :/

Why not use what I suggested, %*pb? It also doubles as a width
field in essence, as bitmap output is fundamentally fixed width.

Thanks,

Ingo

2012-05-09 17:30:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps

On Wed, 2012-05-09 at 19:26 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > > A "%.*pb" is the only interface that can work for a "sized"
> > > bitmap (with obviously fixed-length ones being possible with
> > > a "%.32bp" like thing)
> >
> > Right, sadly that also generates a warning :/
>
> Why not use what I suggested, %*pb? It also doubles as a width
> field in essence, as bitmap output is fundamentally fixed width.

The bitmap list output uses RLE and generates things like:

6-11,18-23

Its the stuff used for all the sched_domain output etc..


2012-05-09 17:31:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps


* Peter Zijlstra <[email protected]> wrote:

> On Wed, 2012-05-09 at 19:24 +0200, Ingo Molnar wrote:
>
> > I don't think that's a significant limitation, because it's
> > fixed width anyway. So if this works then this would be a
> > pretty good and simple to use solution.
>
> Not quite following, there's nothing fixed width about
> bitmap_scnlistprintf() output.

There's a maximum field length possible, and we should/could use
that as padding.

Say for 8 bits, the worst-case is something like:

1-2,4-5,7-8

right? The max is roughly len*3/2.

I see a problem there: for something larger like 256 it would
lead to quite some padding ...

Well, we could just skip doing width for now - the ability to
easily print bitmaps is very valuable in itself.

Thanks,

Ingo

2012-05-09 19:07:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps

On Wed, 09 May 2012 15:27:23 +0200
Peter Zijlstra <[email protected]> wrote:

> + * - 'b' For a bitmap, consumes 2 args, second is int

There's a pretty widespread convention that %b is used to print in
binary: printf("%b", 10) -> 1010. Regrettably it isn't part of
standard C, but it's often supported anyway. Perl's printf does this
as well. Perl uses %B also.

If we later want to add binary printing to kernel print(), we'd presumably
use %pb, to match this precedent.

So perhaps it would be better to avoid consuming `b' for bitmaps?

2012-05-09 20:58:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps

On Wed, 2012-05-09 at 12:07 -0700, Andrew Morton wrote:
> On Wed, 09 May 2012 15:27:23 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > + * - 'b' For a bitmap, consumes 2 args, second is int
>
> There's a pretty widespread convention that %b is used to print in
> binary: printf("%b", 10) -> 1010. Regrettably it isn't part of
> standard C, but it's often supported anyway. Perl's printf does this
> as well. Perl uses %B also.
>
> If we later want to add binary printing to kernel print(), we'd presumably
> use %pb, to match this precedent.

I'd expect something like %ub, %p would take a pointer to a value.

Too bad both %b and %B generate a warning, it would be rather trivial to
add base 2 number stuff.

> So perhaps it would be better to avoid consuming `b' for bitmaps?

With %.*pb not actually working I could use %pC to mean a cpumask and
skip the generic bitmap stuff.

2012-05-10 07:45:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH] printk: Add %pb to print bitmaps


* Peter Zijlstra <[email protected]> wrote:

> > So perhaps it would be better to avoid consuming `b' for
> > bitmaps?
>
> With %.*pb not actually working I could use %pC to mean a
> cpumask and skip the generic bitmap stuff.

I'd suggest to try that, all the objections came from trying to
support variable size bitmaps. Lets just solve the scheduler
bitmask problem and let others worry about the generic case?

You are trying to be too nice ;-)

Thanks,

Ingo

2012-05-10 13:26:52

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] sched_groups are expected to be circular linked list, make it so right after allocation

On Wed, May 09, 2012 at 02:30:44PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-05-09 at 14:21 +0200, Peter Zijlstra wrote:
> > Does something like the below give any clues as to how we got there?
>
> New version that checks we include the right cpu in build_sched_domain()
> too.. on a related note, we should add a printk-%p modifier for cpulist,
> this cpulist_scnprintf() stuff gets annoying.
>
Logs produced with your patches:

[ 141.699854] sched: Bonkers domain doesn't include its own cpu: 3 0-1,3
[ 141.725038] sched: Bonkers domain doesn't include its own cpu: 3 0-1
[ 141.725785] ------------[ cut here ]------------
[ 141.726351] WARNING: at kernel/sched/core.c:6468 build_sched_domain+0x1a4/0x1b0()
[ 141.727233] Hardware name: KVM
[ 141.727597] Modules linked in: sunrpc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc crc32c_intel ghash_clmulni_intel microcode serio_raw e1000 virtio_balloon i2c_piix4 i2c_core floppy [last unloaded: scsi_wait_scan]
[ 141.731149] Pid: 2659, comm: offV2.sh Not tainted 3.4.0-rc6+ #239
[ 141.731853] Call Trace:
[ 141.732166] [<ffffffff81057bcf>] warn_slowpath_common+0x7f/0xc0
[ 141.732866] [<ffffffff81057c2a>] warn_slowpath_null+0x1a/0x20
[ 141.733584] [<ffffffff8108c3f4>] build_sched_domain+0x1a4/0x1b0
[ 141.734317] [<ffffffff810bbe9d>] ? trace_hardirqs_on_caller+0x10d/0x1a0
[ 141.735102] [<ffffffff810bbf3d>] ? trace_hardirqs_on+0xd/0x10
[ 141.735775] [<ffffffff810bbb2d>] ? mark_held_locks+0x7d/0x120
[ 141.736496] [<ffffffff81093713>] ? build_sched_domains+0x323/0x8f0
[ 141.737249] [<ffffffff81186618>] ? kmem_cache_alloc_trace+0x48/0x140
[ 141.737978] [<ffffffff810937b7>] build_sched_domains+0x3c7/0x8f0
[ 141.738728] [<ffffffff810945ab>] partition_sched_domains+0x2cb/0x4d0
[ 141.739494] [<ffffffff81094403>] ? partition_sched_domains+0x123/0x4d0
[ 141.740287] [<ffffffff810daca7>] cpuset_update_active_cpus+0x87/0x90
[ 141.741042] [<ffffffff8108c035>] cpuset_cpu_active+0x25/0x30
[ 141.741711] [<ffffffff815b598c>] notifier_call_chain+0x5c/0x120
[ 141.742445] [<ffffffff81085c4e>] __raw_notifier_call_chain+0xe/0x10
[ 141.743197] [<ffffffff8105a3b0>] __cpu_notify+0x20/0x40
[ 141.743806] [<ffffffff815a9cdd>] _cpu_up+0xc7/0x10e
[ 141.744421] [<ffffffff815a9d70>] cpu_up+0x4c/0x5c
[ 141.744974] [<ffffffff8159b96c>] store_online+0x9c/0xd0
[ 141.745631] [<ffffffff813aa590>] dev_attr_store+0x20/0x30
[ 141.746306] [<ffffffff812147a6>] sysfs_write_file+0xe6/0x170
[ 141.746964] [<ffffffff8119bfc8>] vfs_write+0xc8/0x190
[ 141.747600] [<ffffffff8119c191>] sys_write+0x51/0x90
[ 141.748214] [<ffffffff815ba469>] system_call_fastpath+0x16/0x1b
[ 141.748926] ---[ end trace 09ac555cab7508f1 ]---
[ 141.749516] sched: Bonkers domain doesn't include its own cpu: 3 0-1,3
[ 141.750298] sched: Bonkers domain doesn't include its own cpu: 3 0-1
[ 141.751039] ------------[ cut here ]------------
[ 141.751594] WARNING: at kernel/sched/core.c:6468 build_sched_domain+0x1a4/0x1b0()
[ 141.752473] Hardware name: KVM
[ 141.752829] Modules linked in: sunrpc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc crc32c_intel ghash_clmulni_intel microcode serio_raw e1000 virtio_balloon i2c_piix4 i2c_core floppy [last unloaded: scsi_wait_scan]
[ 141.756387] Pid: 2659, comm: offV2.sh Tainted: G W 3.4.0-rc6+ #239
[ 141.757231] Call Trace:
[ 141.757533] [<ffffffff81057bcf>] warn_slowpath_common+0x7f/0xc0
[ 141.758251] [<ffffffff81057c2a>] warn_slowpath_null+0x1a/0x20
[ 141.758912] [<ffffffff8108c3f4>] build_sched_domain+0x1a4/0x1b0
[ 141.759640] [<ffffffff810bbe9d>] ? trace_hardirqs_on_caller+0x10d/0x1a0
[ 141.760446] [<ffffffff810bbf3d>] ? trace_hardirqs_on+0xd/0x10
[ 141.761133] [<ffffffff810bbb2d>] ? mark_held_locks+0x7d/0x120
[ 141.761820] [<ffffffff81093713>] ? build_sched_domains+0x323/0x8f0
[ 141.762590] [<ffffffff81186618>] ? kmem_cache_alloc_trace+0x48/0x140
[ 141.763368] [<ffffffff810937b7>] build_sched_domains+0x3c7/0x8f0
[ 141.764077] [<ffffffff810945ab>] partition_sched_domains+0x2cb/0x4d0
[ 141.764829] [<ffffffff81094403>] ? partition_sched_domains+0x123/0x4d0
[ 141.765631] [<ffffffff810daca7>] cpuset_update_active_cpus+0x87/0x90
[ 141.766399] [<ffffffff8108c035>] cpuset_cpu_active+0x25/0x30
[ 141.767074] [<ffffffff815b598c>] notifier_call_chain+0x5c/0x120
[ 141.767780] [<ffffffff81085c4e>] __raw_notifier_call_chain+0xe/0x10
[ 141.768549] [<ffffffff8105a3b0>] __cpu_notify+0x20/0x40
[ 141.769177] [<ffffffff815a9cdd>] _cpu_up+0xc7/0x10e
[ 141.769756] [<ffffffff815a9d70>] cpu_up+0x4c/0x5c
[ 141.770349] [<ffffffff8159b96c>] store_online+0x9c/0xd0
[ 141.770971] [<ffffffff813aa590>] dev_attr_store+0x20/0x30
[ 141.771632] [<ffffffff812147a6>] sysfs_write_file+0xe6/0x170
[ 141.772327] [<ffffffff8119bfc8>] vfs_write+0xc8/0x190
[ 141.772916] [<ffffffff8119c191>] sys_write+0x51/0x90
[ 141.773530] [<ffffffff815ba469>] system_call_fastpath+0x16/0x1b
[ 141.774251] ---[ end trace 09ac555cab7508f2 ]---
[ 141.775040] sched: Topology is hosed for CPU-3!!
[ 141.775596] sched: domain: NODE 0-1
[ 141.776004] sched: group: 0-1
[ 141.776411] ------------[ cut here ]------------
[ 141.776940] kernel BUG at kernel/sched/core.c:6088!
[ 141.777394] invalid opcode: 0000 [#1] SMP
[ 141.777394] Dumping ftrace buffer:
[ 141.777394] (ftrace buffer empty)
[ 141.777394] CPU 0
[ 141.777394] Modules linked in: sunrpc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc crc32c_intel ghash_clmulni_intel microcode serio_raw e1000 virtio_balloon i2c_piix4 i2c_core floppy [last unloaded: scsi_wait_scan]
[ 141.777394]
[ 141.777394] Pid: 2659, comm: offV2.sh Tainted: G W 3.4.0-rc6+ #239 Red Hat KVM
[ 141.777394] RIP: 0010:[<ffffffff8108e115>] [<ffffffff8108e115>] build_overlap_sched_groups+0x2e5/0x320
[ 141.777394] RSP: 0018:ffff880037b21a88 EFLAGS: 00010246
[ 141.777394] RAX: 0000000000000028 RBX: ffff880037b21ac8 RCX: 0000000043d543d4
[ 141.777394] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000246
[ 141.777394] RBP: ffff880037b21c08 R08: 0000000000000002 R09: 0000000000000000
[ 141.777394] R10: 0720072007200720 R11: 0000000000000001 R12: ffff88007b21c700
[ 141.777394] R13: ffff88007b21c718 R14: ffff88007b21c700 R15: 0000000000000001
[ 141.777394] FS: 00007fe4009f3700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[ 141.777394] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 141.777394] CR2: 00007f03a8708580 CR3: 000000007a48e000 CR4: 00000000000007f0
[ 141.777394] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 141.777394] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 141.777394] Process offV2.sh (pid: 2659, threadinfo ffff880037b20000, task ffff88007ac78000)
[ 141.777394] Stack:
[ 141.777394] ffff88007a979c00 0000000000000003 0000000381a33300 0000000000010468
[ 141.777394] ffffffff81a332e8 0000000000000000 ffff88007b21c700 ffff88007a979d20
[ 141.777394] ffff003300312d30 ffffffff810bbe9d ffffffff81a45d20 0000000000000246
[ 141.777394] Call Trace:
[ 141.777394] [<ffffffff810bbe9d>] ? trace_hardirqs_on_caller+0x10d/0x1a0
[ 141.777394] [<ffffffff810bbf3d>] ? trace_hardirqs_on+0xd/0x10
[ 141.777394] [<ffffffff810bbb2d>] ? mark_held_locks+0x7d/0x120
[ 141.777394] [<ffffffff81093713>] ? build_sched_domains+0x323/0x8f0
[ 141.777394] [<ffffffff81186618>] ? kmem_cache_alloc_trace+0x48/0x140
[ 141.777394] [<ffffffff81093864>] build_sched_domains+0x474/0x8f0
[ 141.777394] [<ffffffff810945ab>] partition_sched_domains+0x2cb/0x4d0
[ 141.777394] [<ffffffff81094403>] ? partition_sched_domains+0x123/0x4d0
[ 141.777394] [<ffffffff810daca7>] cpuset_update_active_cpus+0x87/0x90
[ 141.777394] [<ffffffff8108c035>] cpuset_cpu_active+0x25/0x30
[ 141.777394] [<ffffffff815b598c>] notifier_call_chain+0x5c/0x120
[ 141.777394] [<ffffffff81085c4e>] __raw_notifier_call_chain+0xe/0x10
[ 141.777394] [<ffffffff8105a3b0>] __cpu_notify+0x20/0x40
[ 141.777394] [<ffffffff815a9cdd>] _cpu_up+0xc7/0x10e
[ 141.777394] [<ffffffff815a9d70>] cpu_up+0x4c/0x5c
[ 141.777394] [<ffffffff8159b96c>] store_online+0x9c/0xd0
[ 141.777394] [<ffffffff813aa590>] dev_attr_store+0x20/0x30
[ 141.777394] [<ffffffff812147a6>] sysfs_write_file+0xe6/0x170
[ 141.777394] [<ffffffff8119bfc8>] vfs_write+0xc8/0x190
[ 141.777394] [<ffffffff8119c191>] sys_write+0x51/0x90
[ 141.777394] [<ffffffff815ba469>] system_call_fastpath+0x16/0x1b
[ 141.777394] Code: 89 df e8 7f e2 26 00 48 8b 95 80 fe ff ff 31 c0 48 c7 c7 9f 5b 79 81 48 8b b2 00 01 00 00 48 89 da e8 ea fb 51 00 4d 85 f6 75 04 <0f> 0b eb fe 4d 89 f4 49 8d 54 24 18 b9 00 01 00 00 be 00 01 00
[ 141.777394] RIP [<ffffffff8108e115>] build_overlap_sched_groups+0x2e5/0x320
[ 141.777394] RSP <ffff880037b21a88>
[ 141.816980] ---[ end trace 09ac555cab7508f3 ]---
[ 141.817563] Kernel panic - not syncing: Fatal exception

2012-05-10 13:46:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched_groups are expected to be circular linked list, make it so right after allocation

On Thu, 2012-05-10 at 15:26 +0200, Igor Mammedov wrote:
> [ 141.699854] sched: Bonkers domain doesn't include its own cpu: 3 0-1,3
> [ 141.725038] sched: Bonkers domain doesn't include its own cpu: 3 0-1

Whee!! so cpu_mask (active_mask) does include 3, but the tl->mask()
doesn't.

> [ 141.775040] sched: Topology is hosed for CPU-3!!
> [ 141.775596] sched: domain: NODE 0-1
> [ 141.776004] sched: group: 0-1
>
This seems to suggest its the node topology being wrecked.

which with your code-base would be
cpu_node_mask()->sched_domain_node_span()..

Did you specify any node topology on the qemu command line? If not, it
should all reduce to cpumask_of_node(0).

identify_secondary_cpu()->identify_cpu()->numa_add_cpu() should set that
bit. which is well before the CPU_ONLINE->cpuset_update_active_cpus()
sched domain rebuild.


Most puzzling. Can you dig a little deeper as to why these masks might
be wrong? Also, can you reproduce on actual hardware? The reason I never
use kvm or other virt for debugging is that I always end up spending
time chasing virt bugs, and I hate virt..

2012-05-10 17:02:09

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] sched_groups are expected to be circular linked list, make it so right after allocation

----- Original Message -----
> From: "Peter Zijlstra" <[email protected]>
> To: "Igor Mammedov" <[email protected]>
> Cc: "Jiang Liu" <[email protected]>, [email protected], [email protected], [email protected],
> [email protected], "seto hidetoshi" <[email protected]>
> Sent: Thursday, May 10, 2012 3:45:44 PM
> Subject: Re: [PATCH] sched_groups are expected to be circular linked list, make it so right after allocation
>
> On Thu, 2012-05-10 at 15:26 +0200, Igor Mammedov wrote:
> > [ 141.699854] sched: Bonkers domain doesn't include its own cpu: 3
> > 0-1,3
> > [ 141.725038] sched: Bonkers domain doesn't include its own cpu: 3
> > 0-1
>
> Whee!! so cpu_mask (active_mask) does include 3, but the tl->mask()
> doesn't.
>
> > [ 141.775040] sched: Topology is hosed for CPU-3!!
> > [ 141.775596] sched: domain: NODE 0-1
> > [ 141.776004] sched: group: 0-1
> >
> This seems to suggest its the node topology being wrecked.
>
> which with your code-base would be
> cpu_node_mask()->sched_domain_node_span()..
>
> Did you specify any node topology on the qemu command line? If not,
> it
> should all reduce to cpumask_of_node(0).
>
> identify_secondary_cpu()->identify_cpu()->numa_add_cpu() should set
> that
> bit. which is well before the CPU_ONLINE->cpuset_update_active_cpus()
> sched domain rebuild.
>
>
> Most puzzling. Can you dig a little deeper as to why these masks
> might
> be wrong? Also, can you reproduce on actual hardware? The reason I
probably not

> never
> use kvm or other virt for debugging is that I always end up spending
> time chasing virt bugs, and I hate virt..

qemu started with -smp 4,sockets=4,cores=1,threads=1

I can answer it right away. It happens when one of cpus failed
to come online on time and boot cpu decided to abandon attempt and exit
do_boot_cpu with error. While exiting with error it clears some masks:

if (boot_error) {
/* Try to put things back the way they were before ... */
numa_remove_cpu(cpu); /* was set by numa_add_cpu */

/* was set by do_boot_cpu() */
cpumask_clear_cpu(cpu, cpu_callout_mask);

/* was set by cpu_init() */
cpumask_clear_cpu(cpu, cpu_initialized_mask);

set_cpu_present(cpu, false);
per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
}

However cpu that has been blissfully forgotten, continues to boot and sets
cpu_active_mask. So causing all sorts of havoc in places that use this mask.
Then when next cpu is being onlined/offlined it eventually calls
CPU_[ONLINE|DOWN]->cpuset_update_active_cpus()
result is OOPS.

That is why I've posted https://lkml.org/lkml/2012/5/9/125
that prevent all of this from happening by 'gracefully' aborting boot
process on failed cpu with necessary cleanup.

Checked Ingo's git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git:master
races, that patches 1, 2, 4 are fixing, are still there.
If it is preferable for this patches to go into -tip, then I'll rebase them
on top of it and repost. ???

Please review them if you think it has a sense to fix races in mainline,
before cpu-hotplug overhaul of this part is complete.

2012-05-10 17:33:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched_groups are expected to be circular linked list, make it so right after allocation

On Thu, 2012-05-10 at 13:01 -0400, Igor Mammedov wrote:
> However cpu that has been blissfully forgotten, continues to boot and sets
> cpu_active_mask. So causing all sorts of havoc in places that use this mask.
> Then when next cpu is being onlined/offlined it eventually calls
> CPU_[ONLINE|DOWN]->cpuset_update_active_cpus()
> result is OOPS.
>
> That is why I've posted https://lkml.org/lkml/2012/5/9/125
> that prevent all of this from happening by 'gracefully' aborting boot
> process on failed cpu with necessary cleanup.
>
> Checked Ingo's git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git:master
> races, that patches 1, 2, 4 are fixing, are still there.
> If it is preferable for this patches to go into -tip, then I'll rebase them
> on top of it and repost. ???

Right, I'd hoped tglx would respond as he's got more stuff on his
harddrive. But he's been rather quiet the past few days. This either
means he's ignoring us all and working very hard to finish these patches
so he can post them and delight us all with a new hotplug
infrastructure. Or not and there's something else..

Anyway, I think you're right in that fixing that bringup fail is
important -- and you've done the work already, so yeah unless tglx shows
up soon your patches ought to go on.

I'll try and have a peek at them soon.