2009-04-03 20:02:20

by Chris Worley

[permalink] [raw]
Subject: Off topic: Numactl "distance" wrong

Sorry for an off topic, but I'm hoping somebody can point me in the
right direction of where to direct this question...

I have an Opteron system where I've seen the HW diagrams, and each of
4 sockets is directly connected (HT) to two other sockets, and two HT
hops away from a third (i.e. a simple square topology, no X in the
middle).

Yet, "numactl --hardware" shows but one hop to each socket:

# numactl --hardware
...
node 0 1 2 3
0: 10 20 20 20
1: 20 10 20 20
2: 20 20 10 20
3: 20 20 20 10

I know this is wrong.

Who would be the right person (or list) to talk about this?

I'm using the latest numactl 2.0.3-rc2 source and a 2.6.29.1 kernel.

Thanks,

Chris


2009-04-03 20:16:20

by Brice Goglin

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

Chris Worley wrote:
> Sorry for an off topic, but I'm hoping somebody can point me in the
> right direction of where to direct this question...
>
> I have an Opteron system where I've seen the HW diagrams, and each of
> 4 sockets is directly connected (HT) to two other sockets, and two HT
> hops away from a third (i.e. a simple square topology, no X in the
> middle).
>
> Yet, "numactl --hardware" shows but one hop to each socket:
>
> # numactl --hardware
> ...
> node 0 1 2 3
> 0: 10 20 20 20
> 1: 20 10 20 20
> 2: 20 20 10 20
> 3: 20 20 20 10
>
> I know this is wrong.
>
> Who would be the right person (or list) to talk about this?
>

IIRC, the motherboard/BIOS is supposed to report numa distances through
the PXM ACPI method. But I have never seen any opteron box do it
properly. So you just get 10 for "local" and 20 for "remote". Some
Itanium machines however report actual distances.

Brice

2009-04-03 20:49:41

by Yinghai Lu

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

On Fri, Apr 3, 2009 at 1:16 PM, Brice Goglin <[email protected]> wrote:
> Chris Worley wrote:
>> Sorry for an off topic, but I'm hoping somebody can point me in the
>> right direction of where to direct this question...
>>
>> I have an Opteron system where I've seen the HW diagrams, and each of
>> 4 sockets is directly connected (HT) to two other sockets, and two HT
>> hops away from a third (i.e. a simple square topology, no X in the
>> middle).
>>
>> Yet, "numactl --hardware" shows but one hop to each socket:
>>
>> # numactl --hardware
>> ...
>> node 0 1 2 3
>> 0: 10 20 20 20
>> 1: 20 10 20 20
>> 2: 20 20 10 20
>> 3: 20 20 20 10
>>
>> I know this is wrong.
>>
>> Who would be the right person (or list) to talk about this?
>>
>
> IIRC, the motherboard/BIOS is supposed to report numa distances through
> the PXM ACPI method. But I have never seen any opteron box do it
> properly. So you just get 10 for "local" and 20 for "remote". Some
> Itanium machines however report actual distances.

for x86 64 bit, we are copying SLIT table and save another copy.

could provide one /sys interface to make user could modify it...

YH

2009-04-03 21:01:10

by Cliff Wickman

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

On Fri, Apr 03, 2009 at 01:49:27PM -0700, Yinghai Lu wrote:
> On Fri, Apr 3, 2009 at 1:16 PM, Brice Goglin <[email protected]> wrote:
> > Chris Worley wrote:
> >> Sorry for an off topic, but I'm hoping somebody can point me in the
> >> right direction of where to direct this question...
> >>
> >> I have an Opteron system where I've seen the HW diagrams, and each of
> >> 4 sockets is directly connected (HT) to two other sockets, and two HT
> >> hops away from a third (i.e. a simple square topology, no X in the
> >> middle).
> >>
> >> Yet, "numactl --hardware" shows but one hop to each socket:
> >>
> >> # numactl --hardware
> >> ...
> >> node 0 1 2 3
> >> 0: 10 20 20 20
> >> 1: 20 10 20 20
> >> 2: 20 20 10 20
> >> 3: 20 20 20 10
> >>
> >> I know this is wrong.
> >>
> >> Who would be the right person (or list) to talk about this?

numactl and libnuma are discussed on
[email protected]

> >>
> >
> > IIRC, the motherboard/BIOS is supposed to report numa distances through
> > the PXM ACPI method. But I have never seen any opteron box do it
> > properly. So you just get 10 for "local" and 20 for "remote". Some
> > Itanium machines however report actual distances.
>
> for x86 64 bit, we are copying SLIT table and save another copy.
>
> could provide one /sys interface to make user could modify it...
>
> YH
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Cliff Wickman
Silicon Graphics, Inc.
[email protected]
(651) 683-3824

2009-04-03 21:25:25

by Andi Kleen

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

Chris Worley <[email protected]> writes:

> Who would be the right person (or list) to talk about this?

Your BIOS vendor whose code reported the wrong values. Not that it matters
really on small systems.

-Andi

2009-04-03 21:43:53

by David Rientjes

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

On Fri, 3 Apr 2009, Andi Kleen wrote:

> > Who would be the right person (or list) to talk about this?
>
> Your BIOS vendor whose code reported the wrong values. Not that it matters
> really on small systems.
>

The numactl --hardware values are coming directly from the sysfs per-node
distance interface, so this may not be a result of erroneous BIOS data but
rather the lack of a SLIT to describe the physical topology better. When
we lack a SLIT, nodes are simply given these remote distances of 20
because their ids differ.

Yinghai, can you elaborate on exactly what type of interface you can
imagine for modifying the distance for nodes through sysfs? It seems like
you'd have to report the entire physical topology in one write, for which
we currently don't have an interface for beyond pxms, instead of per-node
distances to remote nodes.

2009-04-03 21:44:19

by Yinghai Lu

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

http://tracker.coreboot.org/trac/coreboot/browser/trunk/coreboot-v2/src/northbridge/amd/amdk8/amdk8_acpi.c
acpi_fill_slit()
could tell you how to generate slit according to hops on AMD opteron system.

then you could use /sys interface ( later ) to update acpi_slit in the kernel.
later __node_distance could get correct distance.

YH

2009-04-03 21:48:39

by Yinghai Lu

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

On Fri, Apr 3, 2009 at 2:43 PM, David Rientjes <[email protected]> wrote:
>
> Yinghai, can you elaborate on exactly what type of interface you can
> imagine for modifying the distance for nodes through sysfs? ?It seems like
> you'd have to report the entire physical topology in one write, for which
> we currently don't have an interface for beyond pxms, instead of per-node
> distances to remote nodes.

acpi_numa_slit_init() in srat_64.c will have one copy (called
acpi_slit) of SLIT, if there is SLIT from ACPI.

so if numa is enabled, could expose that acpi_slit via sysfs for the
user to update it.

YH

2009-04-03 21:52:11

by Andi Kleen

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

On Fri, Apr 03, 2009 at 02:43:32PM -0700, David Rientjes wrote:
> On Fri, 3 Apr 2009, Andi Kleen wrote:
>
> > > Who would be the right person (or list) to talk about this?
> >
> > Your BIOS vendor whose code reported the wrong values. Not that it matters
> > really on small systems.
> >
>
> The numactl --hardware values are coming directly from the sysfs per-node
> distance interface, so this may not be a result of erroneous BIOS data but
> rather the lack of a SLIT to describe the physical topology better.

That's the same really. Think about it. No SLIT on a NUMA system is a wrong
SLIT.

BTW there are more cases, like illegal slit which is also replaced
with 10/20.

-Andi

2009-04-03 21:52:41

by David Rientjes

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

On Fri, 3 Apr 2009, Yinghai Lu wrote:

> acpi_numa_slit_init() in srat_64.c will have one copy (called
> acpi_slit) of SLIT, if there is SLIT from ACPI.
>
> so if numa is enabled, could expose that acpi_slit via sysfs for the
> user to update it.
>

Yeah, but in what format do you expect the user to update it? There is no
ACPI requirement that all nodes that have a remote distance of 20 to node
0, for instance, are local to themselves.

So you'll have to report the entire physical topology in one write to a
sysfs interface. We don't currently have such a format unless we are to
allow updating of the pxms.

2009-04-03 21:55:29

by David Rientjes

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

On Fri, 3 Apr 2009, Andi Kleen wrote:

> That's the same really. Think about it. No SLIT on a NUMA system is a wrong
> SLIT.
>

Unless each node really is symmetrically distant from each other node, in
which case the distance is reported correctly as a result of the differing
pxms.

2009-04-03 21:55:44

by Andi Kleen

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

Yinghai Lu <[email protected]> writes:

> On Fri, Apr 3, 2009 at 2:43 PM, David Rientjes <[email protected]> wrote:
>>
>> Yinghai, can you elaborate on exactly what type of interface you can
>> imagine for modifying the distance for nodes through sysfs? ?It seems like
>> you'd have to report the entire physical topology in one write, for which
>> we currently don't have an interface for beyond pxms, instead of per-node
>> distances to remote nodes.
>
> acpi_numa_slit_init() in srat_64.c will have one copy (called
> acpi_slit) of SLIT, if there is SLIT from ACPI.
>
> so if numa is enabled, could expose that acpi_slit via sysfs for the
> user to update it.

That's not enough. You would need to redo all the zone fallback tables
in the VM that are initialized based on topology, do new scheduler
topologies and all kind of other stuff.

Besides I don't know any user space software which actually does anything
with the distances. The kernel does, but for it it doesn't make much difference
on smaller systems.

-Andi

--
[email protected] -- Speaking for myself only.

2009-04-03 22:16:03

by Yinghai Lu

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

On Fri, Apr 3, 2009 at 2:55 PM, Andi Kleen <[email protected]> wrote:
> Yinghai Lu <[email protected]> writes:
>
>> On Fri, Apr 3, 2009 at 2:43 PM, David Rientjes <[email protected]> wrote:
>>>
>>> Yinghai, can you elaborate on exactly what type of interface you can
>>> imagine for modifying the distance for nodes through sysfs? ?It seems like
>>> you'd have to report the entire physical topology in one write, for which
>>> we currently don't have an interface for beyond pxms, instead of per-node
>>> distances to remote nodes.
>>
>> acpi_numa_slit_init() in srat_64.c will have one copy (called
>> acpi_slit) of SLIT, if there is SLIT from ACPI.
>>
>> so if numa is enabled, could expose that acpi_slit via sysfs for the
>> user to update it.
>
> That's not enough. You would need to redo all the zone fallback tables
> in the VM that are initialized based on topology, do new scheduler
> topologies and all kind of other stuff.
>
> Besides I don't know any user space software which actually does anything
> with the distances. The kernel does, but for it it doesn't make much difference
> on smaller systems.

how the cpu and memory hotplug works when it could change kind of topology?
sched: wonder if arch_reinit_sched_domains() is used to
rebuild_sched_domains() it will call node_distance()

or ACPI asl code could update SLIT table , and let os use slit table.

YH

2009-04-03 22:19:17

by Yinghai Lu

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

On Fri, Apr 3, 2009 at 2:54 PM, David Rientjes <[email protected]> wrote:
> On Fri, 3 Apr 2009, Andi Kleen wrote:
>
>> That's the same really. Think about it. No SLIT on a NUMA system is a wrong
>> SLIT.
>>
>
> Unless each node really is symmetrically distant from each other node, in
> which case the distance is reported correctly as a result of the differing
> pxms.
>

when ACPI is used, first PXM show up in SRAT table will become node0...
so you already have node_to_pxm mapping.

YH

2009-04-07 02:53:20

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

> That's not enough. You would need to redo all the zone fallback tables
> in the VM that are initialized based on topology, do new scheduler
> topologies and all kind of other stuff.

I think this is very good viewpoint.

The rebuilding zone fallback table and scheduler topologies need to add
new lock.

Oh well, who need memory and scheduler performance regression?
Then, its /sys interface isn't so useful.


I don't think the manual setting of node distance improve
opteron's (or another small machine) performance.


2009-04-07 06:17:56

by Brice Goglin

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

KOSAKI Motohiro wrote:
>> That's not enough. You would need to redo all the zone fallback tables
>> in the VM that are initialized based on topology, do new scheduler
>> topologies and all kind of other stuff.
>
> I think this is very good viewpoint.
>
> The rebuilding zone fallback table and scheduler topologies need to add
> new lock.

Could you clarify how changing numa distances could break
zone fallback tables and scheduler topologies?

> Oh well, who need memory and scheduler performance regression?
> Then, its /sys interface isn't so useful.

If changing the slit table at runtime is too hard, what about
changing it at boot through a new kernel command-line parameter?

> I don't think the manual setting of node distance improve
> opteron's (or another small machine) performance.

Well, some user-space application may use these distances
to improve their binding. Maybe nobody does yet because
numa distances have never been available on x86_64 boxes...

Brice

2009-04-07 07:03:34

by Andi Kleen

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

> Well, some user-space application may use these distances
> to improve their binding.

I'm not aware of any that does. In general applications usually
only use the bare basics of NUMA API (if at all), the fancy stuff tends
to be more slideware.

If it's true then the correct place would be to fix the BIOS.

> Maybe nobody does yet because
> numa distances have never been available on x86_64 boxes...

That's an incorrect statement. There are x86-64 boxes which
report correct NUMA distances.

-Andi

2009-04-07 07:41:09

by Brice Goglin

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

Andi Kleen wrote:
>> Well, some user-space application may use these distances
>> to improve their binding.
>>
>
> I'm not aware of any that does.

We have some people here that would like to use it ideally. But they
know numa distances is almost never available, so they don't really look
at using them...

> If it's true then the correct place would be to fix the BIOS.
>

Come on, you know it's not going to happen for 99.9% on the existing
opteron boxes. We have many hardware quirks in the kernel, I don't see
why this numa distance problem would not deserve its own work around.


By the way, anybody looked at this on Nehalem boxes ?

Brice

2009-04-07 07:44:49

by David Rientjes

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

On Tue, 7 Apr 2009, Andi Kleen wrote:

> I'm not aware of any that does. In general applications usually
> only use the bare basics of NUMA API (if at all), the fancy stuff tends
> to be more slideware.
>
> If it's true then the correct place would be to fix the BIOS.
>

We already verify that each node has local distance to itself and that its
distance to any other node is greater than local when determining whether
the SLIT is valid.

It would also be possible to verify that the distance between two
localities is described consistently in the table (like in the following
patch).

I do think it would be helpful to add an acpi=noslit option, however, that
would disable parsing the SLIT if it is known to incorrectly describe the
physical topology of the system.
---
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -172,6 +172,8 @@ static __init int slit_valid(struct acpi_table_slit *slit)
return 0;
} else if (val <= LOCAL_DISTANCE)
return 0;
+ if (val != slit->entry[d*j + i])
+ return 0;
}
}
return 1;

2009-04-07 07:54:21

by Andi Kleen

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

On Tue, Apr 07, 2009 at 09:40:56AM +0200, Brice Goglin wrote:
> Andi Kleen wrote:
> >> Well, some user-space application may use these distances
> >> to improve their binding.
> >>
> >
> > I'm not aware of any that does.
>
> We have some people here that would like to use it ideally. But they
> know numa distances is almost never available, so they don't really look
> at using them...

>From my experience and from talking at people they tend to have enough
trouble getting the basic NUMA tunings done, without caring about
such (arcane) details.

>
> > If it's true then the correct place would be to fix the BIOS.
> >
>
> Come on, you know it's not going to happen for 99.9% on the existing
> opteron boxes. We have many hardware quirks in the kernel, I don't see
> why this numa distance problem would not deserve its own work around.

The systems where it makes a large difference typically have them anyways.

Anyways if you really want you can ask Len for a way to override SLIT
tables at boot time (similar to the mechanism for MADTs), but I suspect
he wouldn't be particularly enthuiastic. Also it's a little more tricky
than for normal MADTs because SLIT parsing happens very early.

> By the way, anybody looked at this on Nehalem boxes ?

Current Nehalem boxes are all fully connected, so 10/20
(or sometimes scaled to trigger the zone fallback workaround) is the
correct answer and you don't get any benefits from magic in this area.

-Andi

--
[email protected] -- Speaking for myself only.

2009-04-07 07:57:29

by Andi Kleen

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

On Tue, Apr 07, 2009 at 12:44:21AM -0700, David Rientjes wrote:
> On Tue, 7 Apr 2009, Andi Kleen wrote:
>
> > I'm not aware of any that does. In general applications usually
> > only use the bare basics of NUMA API (if at all), the fancy stuff tends
> > to be more slideware.
> >
> > If it's true then the correct place would be to fix the BIOS.
> >
>
> We already verify that each node has local distance to itself and that its
> distance to any other node is greater than local when determining whether
> the SLIT is valid.
>
> It would also be possible to verify that the distance between two
> localities is described consistently in the table (like in the following
> patch).

Do you have an real-world example where this is wrong?

>
> I do think it would be helpful to add an acpi=noslit option, however, that
> would disable parsing the SLIT if it is known to incorrectly describe the
> physical topology of the system.

The check heuristic handles this. I am not aware of a case where it really
fails and let's something really bogus through.

In general this thread seems to contain much more speculation than
facts.

-Andi

--
[email protected] -- Speaking for myself only.

2009-04-07 08:09:36

by David Rientjes

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

On Tue, 7 Apr 2009, Andi Kleen wrote:

> > It would also be possible to verify that the distance between two
> > localities is described consistently in the table (like in the following
> > patch).
>
> Do you have an real-world example where this is wrong?
>

Um, this is a SLIT validation method, so the change is only necessary to
ensure that the table is actually valid unless affinity is not symmetric
in both directions between localities.

Do you have a real-world example of the firmware handing off a locality
that less than LOCAL_DISTANCE?

If so, that would violate the specification since values 0-9 are reserved.
But the validation method still checks and you're not arguing against it,
right?

slit_valid() is intended to prevent invalid tables from being used because
they are incorrect and, thus, can't possibly be used the describe the
physical topology.

> In general this thread seems to contain much more speculation than
> facts.
>

The fact, which you seem to be ignoring, is node hotplug would require
this table to change anyway. It's quite possible using an _SLI method to
dynamically reconfigure the localities, including those that were
statically described by the BIOS at boot. So while you may be satisfied
with the ACPI 2.0 way of thinking, machines have actually changed in the
last five years.

2009-04-07 08:21:53

by Andi Kleen

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

> If so, that would violate the specification since values 0-9 are reserved.
> But the validation method still checks and you're not arguing against it,
> right?
>
> slit_valid() is intended to prevent invalid tables from being used because

No actually it was intended to prevent tables that confuse the scheduler/VM
from being used. At least that is what I wrote it for. The only check
that is really needed is that remote != local, the rest is fluff
admittedly and could be all dropped.

> they are incorrect and, thus, can't possibly be used the describe the
> physical topology.

As long as the scheduler/VM does roughly the right thing it's ok.

>
> > In general this thread seems to contain much more speculation than
> > facts.
> >
>
> The fact, which you seem to be ignoring, is node hotplug would require
> this table to change anyway. It's quite possible using an _SLI method to
> dynamically reconfigure the localities, including those that were
> statically described by the BIOS at boot. So while you may be satisfied
> with the ACPI 2.0 way of thinking, machines have actually changed in the
> last five years.

That may be all true in theory, but Linux doesn't implement node hotplug in this
way (not even on architectures like ia64 that would do it in theory)
In general node hotplug in Linux is pretty useless because you
can only add, never remove, so people don't really use it.

-Andi

--
[email protected] -- Speaking for myself only.

2009-04-07 08:30:30

by David Rientjes

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

On Tue, 7 Apr 2009, Andi Kleen wrote:

> No actually it was intended to prevent tables that confuse the scheduler/VM
> from being used. At least that is what I wrote it for. The only check
> that is really needed is that remote != local, the rest is fluff
> admittedly and could be all dropped.
>

I think users, current and future, of node_distance(a, b) would assume it
would be equal to node_distance(b, a). Admittedly, later revisions of the
SLIT specification allow this to not necessarily be true.

I agreed with you early on that we shouldn't add an interface to
dynamically change the localities in the SLIT, not only because of the
dependencies that the VM and scheduler have which you mentioned, but also
because there is no current interface for changing it and its much more
reasonable to simply fix the BIOS instead of tuning this (which is even
more argument for making slit_valid() as sane as possible).

I do think that it would be helpful to add a parameter to disable parsing
the SLIT, however, when it is known to be incorrect. I haven't heard your
objection to that yet.

2009-04-07 09:14:06

by Andi Kleen

[permalink] [raw]
Subject: Re: Off topic: Numactl "distance" wrong

> I do think that it would be helpful to add a parameter to disable parsing
> the SLIT, however, when it is known to be incorrect. I haven't heard your
> objection to that yet.

Then you would get 10/20 reported (the scheduler/VM need those). Would that
help user space? I have some doubts.

The broken Opteron BIOSes report 10/20 anyways iirc.

-Andi


--
[email protected] -- Speaking for myself only.