2012-05-29 13:54:15

by Borislav Petkov

[permalink] [raw]
Subject: WARNING: at arch/x86/kernel/smpboot.c:310 topology_sane.clone.1+0x6e/0x81()

Dudes,

I'm getting the warning below on current linus. AFAICT, it is caused by

static bool __cpuinit match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
{
if (c->phys_proc_id == o->phys_proc_id)
return topology_sane(c, o, "mc");

return false;
}

and the reason is, IMHO, that because this is a MCM box which has two
nodes in one physical package, i.e., phys_proc_id is 0 on both CPU6 and
CPU0 but it has two internal nodes, 0 and 1 and CPUs 0-5 are on node 0
and CPUs 6-11 are on node 1, the warning fires.

Maybe we could do something like this untested hunk:

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 433529e29be4..e52538cd48bb 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -348,7 +348,8 @@ static bool __cpuinit match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
static bool __cpuinit match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
{
if (c->phys_proc_id == o->phys_proc_id)
- return topology_sane(c, o, "mc");
+ if (!cpu_has(c, X86_FEATURE_AMD_DCM))
+ return topology_sane(c, o, "mc");

return false;
}

or you have a better idea...?

[ 0.444413] Booting Node 0, Processors #1 #2 #3 #4 #5 Ok.
[ 0.461388] ------------[ cut here ]------------
[ 0.465997] WARNING: at arch/x86/kernel/smpboot.c:310 topology_sane.clone.1+0x6e/0x81()
[ 0.473960] Hardware name: Dinar
[ 0.477170] sched: CPU #6's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
[ 0.486860] Booting Node 1, Processors #6
[ 0.491104] Modules linked in:
[ 0.494141] Pid: 0, comm: swapper/6 Not tainted 3.4.0+ #1
[ 0.499510] Call Trace:
[ 0.501946] [<ffffffff8144bf92>] ? topology_sane.clone.1+0x6e/0x81
[ 0.508185] [<ffffffff8102f1fc>] warn_slowpath_common+0x85/0x9d
[ 0.514163] [<ffffffff8102f2b7>] warn_slowpath_fmt+0x46/0x48
[ 0.519881] [<ffffffff8144bf92>] topology_sane.clone.1+0x6e/0x81
[ 0.525943] [<ffffffff8144c234>] set_cpu_sibling_map+0x251/0x371
[ 0.532004] [<ffffffff8144c4ee>] start_secondary+0x19a/0x218
[ 0.537729] ---[ end trace 4eaa2a86a8e2da22 ]---
[ 0.628197] #7 #8 #9 #10 #11 Ok.
[ 0.807108] Booting Node 3, Processors #12 #13 #14 #15 #16 #17 Ok.
[ 0.897587] Booting Node 2, Processors #18 #19 #20 #21 #22 #23 Ok.
[ 0.917443] Brought up 24 CPUs

Thanks.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551


2012-05-29 14:52:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: WARNING: at arch/x86/kernel/smpboot.c:310 topology_sane.clone.1+0x6e/0x81()

On Tue, 2012-05-29 at 15:54 +0200, Borislav Petkov wrote:
> Dudes,
>
> I'm getting the warning below on current linus. AFAICT, it is caused by
>
> static bool __cpuinit match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
> {
> if (c->phys_proc_id == o->phys_proc_id)
> return topology_sane(c, o, "mc");
>
> return false;
> }
>
> and the reason is, IMHO, that because this is a MCM box which has two
> nodes in one physical package, i.e., phys_proc_id is 0 on both CPU6 and
> CPU0 but it has two internal nodes, 0 and 1 and CPUs 0-5 are on node 0
> and CPUs 6-11 are on node 1, the warning fires.
>
> Maybe we could do something like this untested hunk:
>
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 433529e29be4..e52538cd48bb 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -348,7 +348,8 @@ static bool __cpuinit match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
> static bool __cpuinit match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
> {
> if (c->phys_proc_id == o->phys_proc_id)
> - return topology_sane(c, o, "mc");
> + if (!cpu_has(c, X86_FEATURE_AMD_DCM))
> + return topology_sane(c, o, "mc");
>
> return false;
> }
>
> or you have a better idea...?

Ah,.. uhm.. unfortunate this... we only seem to use cpu_core_mask for
topology_core_cpumask() and its purpose is to enumerate cores in a
package for some very limited generic functions.

Its a bit sad we defined it thus, the multi-core concept only really
make sense if you share caches, otherwise its just smp.

Also, our generic topology as defined doesn't match nodes. Which is
weird to say the least.

I'd almost be tempted to say you should fake phys_id, but I can only
imagine what all would explode if we'd do that :-)

Yeah, I guess we should do the thing you propose, unless someone else
has a sane idea?

Subject: Re: WARNING: at arch/x86/kernel/smpboot.c:310 topology_sane.clone.1+0x6e/0x81()

On Tue, May 29, 2012 at 04:51:46PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-05-29 at 15:54 +0200, Borislav Petkov wrote:
> > Dudes,
> >
> > I'm getting the warning below on current linus. AFAICT, it is caused by
> >
> > static bool __cpuinit match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
> > {
> > if (c->phys_proc_id == o->phys_proc_id)
> > return topology_sane(c, o, "mc");
> >
> > return false;
> > }
> >
> > and the reason is, IMHO, that because this is a MCM box which has two
> > nodes in one physical package, i.e., phys_proc_id is 0 on both CPU6 and
> > CPU0 but it has two internal nodes, 0 and 1 and CPUs 0-5 are on node 0
> > and CPUs 6-11 are on node 1, the warning fires.
> >
> > Maybe we could do something like this untested hunk:
> >
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 433529e29be4..e52538cd48bb 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -348,7 +348,8 @@ static bool __cpuinit match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
> > static bool __cpuinit match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
> > {
> > if (c->phys_proc_id == o->phys_proc_id)
> > - return topology_sane(c, o, "mc");
> > + if (!cpu_has(c, X86_FEATURE_AMD_DCM))
> > + return topology_sane(c, o, "mc");
> >
> > return false;
> > }
> >
> > or you have a better idea...?
>
> Ah,.. uhm.. unfortunate this... we only seem to use cpu_core_mask for
> topology_core_cpumask() and its purpose is to enumerate cores in a
> package for some very limited generic functions.
>
> Its a bit sad we defined it thus, the multi-core concept only really
> make sense if you share caches, otherwise its just smp.
>
> Also, our generic topology as defined doesn't match nodes. Which is
> weird to say the least.
>
> I'd almost be tempted to say you should fake phys_id, but I can only
> imagine what all would explode if we'd do that :-)
>
> Yeah, I guess we should do the thing you propose, unless someone else
> has a sane idea?

I've also looked at this. core_siblings mask is broken with this patch.
And there is this new irritating warning ...

I second Boris' suggestion for a fix. But I think the check for
X86_FEATURE_AMD_DCM should go into topology_sane() which in theory
could check other things as well.


Andreas

2012-05-29 16:59:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: WARNING: at arch/x86/kernel/smpboot.c:310 topology_sane.clone.1+0x6e/0x81()

On Tue, 2012-05-29 at 17:29 +0200, Andreas Herrmann wrote:

> I've also looked at this. core_siblings mask is broken with this patch.
> And there is this new irritating warning ...

Hehe, you made this irritating hardware ;-) But fair enough.

> I second Boris' suggestion for a fix. But I think the check for
> X86_FEATURE_AMD_DCM should go into topology_sane() which in theory
> could check other things as well.

Unless you plan to go span cache (or even SMT siblings) over physical
IDs I'd strongly argue against putting it in topology_sane().

As it stands I think we should discuss the definition for the generic
topology bits (drivers/base/topology.c), because I think your
Magny-Cours thing does the wrong thing here.

The core span in a phys_id is all nice and such, but what does it mean?
IOW what would you do with it?

I would think the LLC range and the node-span are much more useful
things to have. Once you have nodes the sysfs node topology takes over.

2012-05-29 17:12:38

by Borislav Petkov

[permalink] [raw]
Subject: Re: WARNING: at arch/x86/kernel/smpboot.c:310 topology_sane.clone.1+0x6e/0x81()

On Tue, May 29, 2012 at 06:59:03PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-05-29 at 17:29 +0200, Andreas Herrmann wrote:
>
> > I've also looked at this. core_siblings mask is broken with this patch.
> > And there is this new irritating warning ...
>
> Hehe, you made this irritating hardware ;-) But fair enough.
>
> > I second Boris' suggestion for a fix. But I think the check for
> > X86_FEATURE_AMD_DCM should go into topology_sane() which in theory
> > could check other things as well.
>
> Unless you plan to go span cache (or even SMT siblings) over physical
> IDs I'd strongly argue against putting it in topology_sane().

Nah, the check goes in match_mc - we just talked it over with Andreas.

> As it stands I think we should discuss the definition for the generic
> topology bits (drivers/base/topology.c), because I think your
> Magny-Cours thing does the wrong thing here.

"wrong" is such a strong word :-) Please elaborate and I'll have a look.

> The core span in a phys_id is all nice and such, but what does it mean?

AFAICT, this is the physical package id to which the cores belong, i.e.
physical socket.

> IOW what would you do with it?

Shoot empty cans with it... :-)

Andreas?

> I would think the LLC range and the node-span are much more useful
> things to have. Once you have nodes the sysfs node topology takes
> over.

Yes, the LLC range should be the cores belonging to an internal node and
the node-span is the cores belonging to a physical socket. I think you
can compute anything else from those topo-wise.

Thanks.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-05-29 17:25:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: WARNING: at arch/x86/kernel/smpboot.c:310 topology_sane.clone.1+0x6e/0x81()

On Tue, 2012-05-29 at 19:13 +0200, Borislav Petkov wrote:
>
> > As it stands I think we should discuss the definition for the generic
> > topology bits (drivers/base/topology.c), because I think your
> > Magny-Cours thing does the wrong thing here.
>
> "wrong" is such a strong word :-) Please elaborate and I'll have a look.

Right, so I meant LLC is the useful mask, and in my mind LLC is what
makes a multi-core, without shared cache its just SMP. So core_siblings
to me would mean LLC sharing cores.

But its all very subjective I guess, but using strong words gets the
discussion going better ;-)

> > The core span in a phys_id is all nice and such, but what does it mean?
>
> AFAICT, this is the physical package id to which the cores belong, i.e.
> physical socket.
>
> > IOW what would you do with it?
>
> Shoot empty cans with it... :-)

Right, I actually came up with proper use-case, physical hotplug :-)

Its not immediately obvious the sysfs topo bits have the llc mask, which
is the more 'useful' one IMO.

Another funny case I don't see represented well is where there's
multiple sockets to a node -- I know this is like ancient tech and
unlikely in these days of multi-node sockets, but still ;-)

I guess what I'm asking is what is the purpose of the sys topo bits?

Subject: Re: WARNING: at arch/x86/kernel/smpboot.c:310 topology_sane.clone.1+0x6e/0x81()

On Tue, May 29, 2012 at 07:25:19PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-05-29 at 19:13 +0200, Borislav Petkov wrote:
> >
> > > As it stands I think we should discuss the definition for the generic
> > > topology bits (drivers/base/topology.c), because I think your
> > > Magny-Cours thing does the wrong thing here.
> >
> > "wrong" is such a strong word :-) Please elaborate and I'll have a look.
>
> Right, so I meant LLC is the useful mask, and in my mind LLC is what
> makes a multi-core, without shared cache its just SMP. So core_siblings
> to me would mean LLC sharing cores.
>
> But its all very subjective I guess, but using strong words gets the
> discussion going better ;-)
>
> > > The core span in a phys_id is all nice and such, but what does it mean?
> >
> > AFAICT, this is the physical package id to which the cores belong, i.e.
> > physical socket.
> >
> > > IOW what would you do with it?
> >
> > Shoot empty cans with it... :-)
>
> Right, I actually came up with proper use-case, physical hotplug :-)
>
> Its not immediately obvious the sysfs topo bits have the llc mask, which
> is the more 'useful' one IMO.
>
> Another funny case I don't see represented well is where there's
> multiple sockets to a node -- I know this is like ancient tech and
> unlikely in these days of multi-node sockets, but still ;-)
>
> I guess what I'm asking is what is the purpose of the sys topo bits?

You mean the topology directory for each CPU in sysfs?

Where else do you find reliable/complete CPU topology information for
you system? (And yes I know that the info provided in this directory
is already incomplete for AMD multi-node CPUs.)

AFAIK hwloc relies on this information. (Together with cache topology
information in the cache sub-directory for each CPU.)

And if the question was why does it matter to know which cores belong
to which physical socket. What if you want to pin all your tasks to
both nodes of the same socket on AMD mult-node CPUs? Use
core_siblings/core_siblings_list provided in sysfs, use it as
parameter for a tool like taskset and you are done with it.


Andreas

2012-06-04 12:41:36

by Borislav Petkov

[permalink] [raw]
Subject: [PATCH] x86, smp: Fix topology checks on AMD MCM

From: Borislav Petkov <[email protected]>

The warning below triggers on AMD MCM packages because physical package
IDs on the cores of a _physical_ socket are the same. I.e., this field
says which CPUs belong to the same physical package.

However, the same two CPUs belong to two different internal, i.e.
"logical" nodes in the same physical socket which is reflected in the
CPU-to-node map on x86 with NUMA.

Which makes this check wrong on the above topologies so circumvent it.

[ 0.444413] Booting Node 0, Processors #1 #2 #3 #4 #5 Ok.
[ 0.461388] ------------[ cut here ]------------
[ 0.465997] WARNING: at arch/x86/kernel/smpboot.c:310 topology_sane.clone.1+0x6e/0x81()
[ 0.473960] Hardware name: Dinar
[ 0.477170] sched: CPU #6's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
[ 0.486860] Booting Node 1, Processors #6
[ 0.491104] Modules linked in:
[ 0.494141] Pid: 0, comm: swapper/6 Not tainted 3.4.0+ #1
[ 0.499510] Call Trace:
[ 0.501946] [<ffffffff8144bf92>] ? topology_sane.clone.1+0x6e/0x81
[ 0.508185] [<ffffffff8102f1fc>] warn_slowpath_common+0x85/0x9d
[ 0.514163] [<ffffffff8102f2b7>] warn_slowpath_fmt+0x46/0x48
[ 0.519881] [<ffffffff8144bf92>] topology_sane.clone.1+0x6e/0x81
[ 0.525943] [<ffffffff8144c234>] set_cpu_sibling_map+0x251/0x371
[ 0.532004] [<ffffffff8144c4ee>] start_secondary+0x19a/0x218
[ 0.537729] ---[ end trace 4eaa2a86a8e2da22 ]---
[ 0.628197] #7 #8 #9 #10 #11 Ok.
[ 0.807108] Booting Node 3, Processors #12 #13 #14 #15 #16 #17 Ok.
[ 0.897587] Booting Node 2, Processors #18 #19 #20 #21 #22 #23 Ok.
[ 0.917443] Brought up 24 CPUs

Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andreas Herrmann <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/kernel/smpboot.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index f56f96da77f5..912e5cac61d8 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -350,7 +350,8 @@ static bool __cpuinit match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
static bool __cpuinit match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
{
if (c->phys_proc_id == o->phys_proc_id)
- return topology_sane(c, o, "mc");
+ if (!cpu_has(c, X86_FEATURE_AMD_DCM))
+ return topology_sane(c, o, "mc");

return false;
}
--
1.7.9.3.362.g71319

2012-06-04 12:44:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] x86, smp: Fix topology checks on AMD MCM

On Mon, 2012-06-04 at 14:41 +0200, Borislav Petkov wrote:
> The warning below triggers on AMD MCM packages because physical
> package
> IDs on the cores of a _physical_ socket are the same. I.e., this field
> says which CPUs belong to the same physical package.
>
Thanks!

2012-06-04 13:36:58

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86, smp: Fix topology checks on AMD MCM

On Mon, Jun 04, 2012 at 02:43:50PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-06-04 at 14:41 +0200, Borislav Petkov wrote:
> > The warning below triggers on AMD MCM packages because physical
> > package
> > IDs on the cores of a _physical_ socket are the same. I.e., this field
> > says which CPUs belong to the same physical package.
> >
> Thanks!

This is still crap.

It breaks tests like those in powernow-k8:

if (cpumask_first(cpu_core_mask(data->cpu)) == data->cpu)
print_basics(data);

and it calls print_basics() on each core filling up dmesg with:

[ 5.216925] powernow-k8: 0 : pstate 0 (1700 MHz)
[ 5.222139] powernow-k8: 1 : pstate 1 (1300 MHz)
[ 5.227354] powernow-k8: 2 : pstate 2 (1100 MHz)
[ 5.232568] powernow-k8: 3 : pstate 3 (1000 MHz)
[ 5.237783] powernow-k8: 4 : pstate 4 (800 MHz)
[ 5.242988] powernow-k8: 0 : pstate 0 (1700 MHz)
[ 5.248242] powernow-k8: 1 : pstate 1 (1300 MHz)
[ 5.253462] powernow-k8: 2 : pstate 2 (1100 MHz)
[ 5.258677] powernow-k8: 3 : pstate 3 (1000 MHz)
[ 5.263892] powernow-k8: 4 : pstate 4 (800 MHz)
[ 5.269094] powernow-k8: 0 : pstate 0 (1700 MHz)
[ 5.274348] powernow-k8: 1 : pstate 1 (1300 MHz)
[ 5.279566] powernow-k8: 2 : pstate 2 (1100 MHz)
[ 5.284787] powernow-k8: 3 : pstate 3 (1000 MHz)
[ 5.298387] powernow-k8: 4 : pstate 4 (800 MHz)
...

while it should dump the Pstates only once per physical socket.

The proper fix should be:

static bool __cpuinit match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
{
if (c->phys_proc_id == o->phys_proc_id) {
if (cpu_has(c, X86_FEATURE_AMD_DCM))
return true;

return topology_sane(c, o, "mc");
}
return false;
}

which basically circumvents the topology check on MCM modules. I'll run
this a bit longer to verify it doesn't break anything else.

Thanks.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-06-04 13:38:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] x86, smp: Fix topology checks on AMD MCM

On Mon, 2012-06-04 at 15:37 +0200, Borislav Petkov wrote:
> This is still crap.

*poof* and the patch is gone.. I'll wait for an update ;-)

Subject: Re: [PATCH] x86, smp: Fix topology checks on AMD MCM

On Mon, Jun 04, 2012 at 03:38:37PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-06-04 at 15:37 +0200, Borislav Petkov wrote:
> > This is still crap.
>
> *poof* and the patch is gone.. I'll wait for an update ;-)

Thanks.

So I'm looking more into this and there's another issue IMHO:

processor : 2
vendor_id : AuthenticAMD
cpu family : 16
model : 9
stepping : 1
microcode : 0x10000d9
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 12
core id : 2
cpu cores : 2
^^^^^^^^^^^^^^^^^^^

This is cpuinfo_x86.booted_cores and here's how it progresses on this
MCM box (below).

On core 2 it becomes 2 and it doesn't change again. Until the next
physical socket comes (cpu 12) where it starts again from 1.

And, it is normally the number of booted cores on the socket, i.e. 12 in
this case.

Now, before we go babbling about what the right fix is, maybe we should
remove this thing completely - I mean, it looks like no one uses it, it
is only in the /proc/cpuinfo thing. Ingo, hpa?

$ grep -E "(cpu cores|processor)" /proc/cpuinfo
processor : 0
cpu cores : 1
processor : 1
cpu cores : 2
processor : 2
cpu cores : 2
processor : 3
cpu cores : 2
processor : 4
cpu cores : 2
processor : 5
cpu cores : 2
processor : 6
cpu cores : 2
processor : 7
cpu cores : 2
processor : 8
cpu cores : 2
processor : 9
cpu cores : 2
processor : 10
cpu cores : 2
processor : 11
cpu cores : 2
processor : 12
cpu cores : 1
processor : 13
cpu cores : 2
processor : 14
cpu cores : 2
processor : 15
cpu cores : 2
processor : 16
cpu cores : 2
processor : 17
cpu cores : 2
processor : 18
cpu cores : 2
processor : 19
cpu cores : 2
processor : 20
cpu cores : 2
processor : 21
cpu cores : 2
processor : 22
cpu cores : 2
processor : 23
cpu cores : 2


--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-06-04 14:56:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] x86, smp: Fix topology checks on AMD MCM

On Mon, 2012-06-04 at 16:48 +0200, Borislav Petkov wrote:
> This is cpuinfo_x86.booted_cores and here's how it progresses on this
> MCM box (below).


http://marc.info/?l=linux-kernel&m=133849647709656

2012-06-04 16:00:49

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86, smp: Fix topology checks on AMD MCM

On Mon, Jun 04, 2012 at 04:56:01PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-06-04 at 16:48 +0200, Borislav Petkov wrote:
> > This is cpuinfo_x86.booted_cores and here's how it progresses on this
> > MCM box (below).
>
> http://marc.info/?l=linux-kernel&m=133849647709656

Ho-hum, it workie.

Thanks.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-06-06 15:31:04

by Borislav Petkov

[permalink] [raw]
Subject: [PATCH -v2] x86, smp: Fix topology checks on AMD MCM

From: Borislav Petkov <[email protected]>

The warning below triggers on AMD MCM packages because physical package
IDs on the cores of a _physical_ socket are the same. I.e., this field
says which CPUs belong to the same physical package.

However, the same two CPUs belong to two different internal, i.e.
"logical" nodes in the same physical socket which is reflected in the
CPU-to-node map on x86 with NUMA.

Which makes this check wrong on the above topologies so circumvent it.

[ 0.444413] Booting Node 0, Processors #1 #2 #3 #4 #5 Ok.
[ 0.461388] ------------[ cut here ]------------
[ 0.465997] WARNING: at arch/x86/kernel/smpboot.c:310 topology_sane.clone.1+0x6e/0x81()
[ 0.473960] Hardware name: Dinar
[ 0.477170] sched: CPU #6's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
[ 0.486860] Booting Node 1, Processors #6
[ 0.491104] Modules linked in:
[ 0.494141] Pid: 0, comm: swapper/6 Not tainted 3.4.0+ #1
[ 0.499510] Call Trace:
[ 0.501946] [<ffffffff8144bf92>] ? topology_sane.clone.1+0x6e/0x81
[ 0.508185] [<ffffffff8102f1fc>] warn_slowpath_common+0x85/0x9d
[ 0.514163] [<ffffffff8102f2b7>] warn_slowpath_fmt+0x46/0x48
[ 0.519881] [<ffffffff8144bf92>] topology_sane.clone.1+0x6e/0x81
[ 0.525943] [<ffffffff8144c234>] set_cpu_sibling_map+0x251/0x371
[ 0.532004] [<ffffffff8144c4ee>] start_secondary+0x19a/0x218
[ 0.537729] ---[ end trace 4eaa2a86a8e2da22 ]---
[ 0.628197] #7 #8 #9 #10 #11 Ok.
[ 0.807108] Booting Node 3, Processors #12 #13 #14 #15 #16 #17 Ok.
[ 0.897587] Booting Node 2, Processors #18 #19 #20 #21 #22 #23 Ok.
[ 0.917443] Brought up 24 CPUs

Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andreas Herrmann <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
---

This is -v2, we ran a topology sanity check test we have here on it and
it all looks ok... hopefully :).

arch/x86/kernel/smpboot.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index f56f96da77f5..9432dcd86340 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -349,9 +349,12 @@ static bool __cpuinit match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)

static bool __cpuinit match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
{
- if (c->phys_proc_id == o->phys_proc_id)
- return topology_sane(c, o, "mc");
+ if (c->phys_proc_id == o->phys_proc_id) {
+ if (cpu_has(c, X86_FEATURE_AMD_DCM))
+ return true;

+ return topology_sane(c, o, "mc");
+ }
return false;
}

--
1.7.11.rc1


--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

Subject: [tip:x86/urgent] x86/smp: Fix topology checks on AMD MCM CPUs

Commit-ID: 161270fc1f9ddfc17154e0d49291472a9cdef7db
Gitweb: http://git.kernel.org/tip/161270fc1f9ddfc17154e0d49291472a9cdef7db
Author: Borislav Petkov <[email protected]>
AuthorDate: Wed, 6 Jun 2012 17:31:26 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 13 Jun 2012 14:56:12 +0200

x86/smp: Fix topology checks on AMD MCM CPUs

The warning below triggers on AMD MCM packages because physical package
IDs on the cores of a _physical_ socket are the same. I.e., this field
says which CPUs belong to the same physical package.

However, the same two CPUs belong to two different internal, i.e.
"logical" nodes in the same physical socket which is reflected in the
CPU-to-node map on x86 with NUMA.

Which makes this check wrong on the above topologies so circumvent it.

[ 0.444413] Booting Node 0, Processors #1 #2 #3 #4 #5 Ok.
[ 0.461388] ------------[ cut here ]------------
[ 0.465997] WARNING: at arch/x86/kernel/smpboot.c:310 topology_sane.clone.1+0x6e/0x81()
[ 0.473960] Hardware name: Dinar
[ 0.477170] sched: CPU #6's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
[ 0.486860] Booting Node 1, Processors #6
[ 0.491104] Modules linked in:
[ 0.494141] Pid: 0, comm: swapper/6 Not tainted 3.4.0+ #1
[ 0.499510] Call Trace:
[ 0.501946] [<ffffffff8144bf92>] ? topology_sane.clone.1+0x6e/0x81
[ 0.508185] [<ffffffff8102f1fc>] warn_slowpath_common+0x85/0x9d
[ 0.514163] [<ffffffff8102f2b7>] warn_slowpath_fmt+0x46/0x48
[ 0.519881] [<ffffffff8144bf92>] topology_sane.clone.1+0x6e/0x81
[ 0.525943] [<ffffffff8144c234>] set_cpu_sibling_map+0x251/0x371
[ 0.532004] [<ffffffff8144c4ee>] start_secondary+0x19a/0x218
[ 0.537729] ---[ end trace 4eaa2a86a8e2da22 ]---
[ 0.628197] #7 #8 #9 #10 #11 Ok.
[ 0.807108] Booting Node 3, Processors #12 #13 #14 #15 #16 #17 Ok.
[ 0.897587] Booting Node 2, Processors #18 #19 #20 #21 #22 #23 Ok.
[ 0.917443] Brought up 24 CPUs

We ran a topology sanity check test we have here on it and
it all looks ok... hopefully :).

Signed-off-by: Borislav Petkov <[email protected]>
Cc: Andreas Herrmann <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/smpboot.c | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index fd019d7..c3a6bac 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -349,9 +349,12 @@ static bool __cpuinit match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)

static bool __cpuinit match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
{
- if (c->phys_proc_id == o->phys_proc_id)
- return topology_sane(c, o, "mc");
+ if (c->phys_proc_id == o->phys_proc_id) {
+ if (cpu_has(c, X86_FEATURE_AMD_DCM))
+ return true;

+ return topology_sane(c, o, "mc");
+ }
return false;
}