2004-11-04 02:13:36

by Takayoshi Kochi

[permalink] [raw]
Subject: Re: Externalize SLIT table

Hi,

For wider audience, added LKML.

From: Jack Steiner <[email protected]>
Subject: Externalize SLIT table
Date: Wed, 3 Nov 2004 14:56:56 -0600

> The SLIT table provides useful information on internode
> distances. Has anyone considered externalizing this
> table via /proc or some equivalent mechanism.
>
> For example, something like the following would be useful:
>
> # cat /proc/acpi/slit
> 010 066 046 066
> 066 010 066 046
> 046 066 010 020
> 066 046 020 010
>
> If this looks ok (or something equivalent), I'll generate a patch....

For user space to manipulate scheduling domains, pinning processes
to some cpu groups etc, that kind of information is very useful!
Without this, users have no notion about how far between two nodes.

But ACPI SLIT table is too arch specific (ia64 and x86 only) and
user-visible logical number and ACPI proximity domain number is
not always identical.

Why not export node_distance() under sysfs?
I like (1).

(1) obey one-value-per-file sysfs principle

% cat /sys/devices/system/node/node0/distance0
10
% cat /sys/devices/system/node/node0/distance1
66

(2) one distance for each line

% cat /sys/devices/system/node/node0/distance
0:10
1:66
2:46
3:66

(3) all distances in one line like /proc/<PID>/stat

% cat /sys/devices/system/node/node0/distance
10 66 46 66

---
Takayoshi Kochi


2004-11-04 04:14:18

by Andi Kleen

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Thu, Nov 04, 2004 at 10:59:08AM +0900, Takayoshi Kochi wrote:
> Hi,
>
> For wider audience, added LKML.
>
> From: Jack Steiner <[email protected]>
> Subject: Externalize SLIT table
> Date: Wed, 3 Nov 2004 14:56:56 -0600
>
> > The SLIT table provides useful information on internode
> > distances. Has anyone considered externalizing this
> > table via /proc or some equivalent mechanism.
> >
> > For example, something like the following would be useful:
> >
> > # cat /proc/acpi/slit
> > 010 066 046 066
> > 066 010 066 046
> > 046 066 010 020
> > 066 046 020 010
> >
> > If this looks ok (or something equivalent), I'll generate a patch....

This isn't very useful without information about proximity domains.
e.g. on x86-64 the proximity domain number is not necessarily
the same as the node number.


> For user space to manipulate scheduling domains, pinning processes
> to some cpu groups etc, that kind of information is very useful!
> Without this, users have no notion about how far between two nodes.

Also some reporting of _PXM for PCI devices is needed. I had a
experimental patch for this on x86-64 (not ACPI based), that
reported nearby nodes for PCI busses.

>
> But ACPI SLIT table is too arch specific (ia64 and x86 only) and
> user-visible logical number and ACPI proximity domain number is
> not always identical.

Exactly.

>
> Why not export node_distance() under sysfs?
> I like (1).
>
> (1) obey one-value-per-file sysfs principle
>
> % cat /sys/devices/system/node/node0/distance0
> 10

Surely distance from 0 to 0 is 0?

> % cat /sys/devices/system/node/node0/distance1
> 66

>
> (2) one distance for each line
>
> % cat /sys/devices/system/node/node0/distance
> 0:10
> 1:66
> 2:46
> 3:66
>
> (3) all distances in one line like /proc/<PID>/stat
>
> % cat /sys/devices/system/node/node0/distance
> 10 66 46 66

I would prefer that.

-Andi

2004-11-04 04:57:56

by Takayoshi Kochi

[permalink] [raw]
Subject: Re: Externalize SLIT table

Hi,

From: Andi Kleen <[email protected]>
Subject: Re: Externalize SLIT table
Date: Thu, 4 Nov 2004 05:07:13 +0100

> > Why not export node_distance() under sysfs?
> > I like (1).
> >
> > (1) obey one-value-per-file sysfs principle
> >
> > % cat /sys/devices/system/node/node0/distance0
> > 10
>
> Surely distance from 0 to 0 is 0?

According to the ACPI spec, 10 means local and other values
mean ratio to 10. But what the distance number should mean
mean is ambiguous from the spec (e.g. some veondors interpret as
memory access latency, others interpret as memory throughput
etc.)
However relative distance just works for most of uses, I believe.

Anyway, we should clarify how the numbers should be interpreted
to avoid confusion.

How about this?
"The distance to itself means the base value. Distance to
other nodes are relative to the base value.
0 means unreachable (hot-removed or disabled) to that node."

(Just FYI, numbers 0-9 are reserved and 255 (unsigned char -1) means
unreachable, according to the ACPI spec.)

> > % cat /sys/devices/system/node/node0/distance1
> > 66
>
> >
> > (2) one distance for each line
> >
> > % cat /sys/devices/system/node/node0/distance
> > 0:10
> > 1:66
> > 2:46
> > 3:66
> >
> > (3) all distances in one line like /proc/<PID>/stat
> >
> > % cat /sys/devices/system/node/node0/distance
> > 10 66 46 66
>
> I would prefer that.

Ah, I missed the following last sentence in
Documentation/filesystems/sysfs.txt:

|Attributes should be ASCII text files, preferably with only one value
|per file. It is noted that it may not be efficient to contain only
|value per file, so it is socially acceptable to express an array of
|values of the same type.

If an array is acceptable, I would prefer (3), too.

---
Takayoshi Kochi

2004-11-04 06:45:43

by Andi Kleen

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Thu, Nov 04, 2004 at 01:57:21PM +0900, Takayoshi Kochi wrote:
> Hi,
>
> From: Andi Kleen <[email protected]>
> Subject: Re: Externalize SLIT table
> Date: Thu, 4 Nov 2004 05:07:13 +0100
>
> > > Why not export node_distance() under sysfs?
> > > I like (1).
> > >
> > > (1) obey one-value-per-file sysfs principle
> > >
> > > % cat /sys/devices/system/node/node0/distance0
> > > 10
> >
> > Surely distance from 0 to 0 is 0?
>
> According to the ACPI spec, 10 means local and other values
> mean ratio to 10. But what the distance number should mean

Ah, missed that. ok I guess it makes sense to use the same
encoding as ACPI, no need to be intentionally different.

> mean is ambiguous from the spec (e.g. some veondors interpret as
> memory access latency, others interpret as memory throughput
> etc.)
> However relative distance just works for most of uses, I believe.
>
> Anyway, we should clarify how the numbers should be interpreted
> to avoid confusion.

Defining it as "as defined in the ACPI spec" should be ok.
I guess even non ACPI architectures will be able to live with that.

Anyways, since we seem to agree and so far nobody has complained
it's just that somebody needs to do a patch? If possible make
it generic code in drivers/acpi/numa.c, there won't be anything architecture
specific in this and it should work for x86-64 too.


-Andi

2004-11-04 14:19:38

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Thu, Nov 04, 2004 at 10:59:08AM +0900, Takayoshi Kochi wrote:
> Hi,
>
> For wider audience, added LKML.
>
> From: Jack Steiner <[email protected]>
> Subject: Externalize SLIT table
> Date: Wed, 3 Nov 2004 14:56:56 -0600
>
> > The SLIT table provides useful information on internode
> > distances. Has anyone considered externalizing this
> > table via /proc or some equivalent mechanism.
> >
> > For example, something like the following would be useful:
> >
> > # cat /proc/acpi/slit
> > 010 066 046 066
> > 066 010 066 046
> > 046 066 010 020
> > 066 046 020 010
> >
> > If this looks ok (or something equivalent), I'll generate a patch....
>
> For user space to manipulate scheduling domains, pinning processes
> to some cpu groups etc, that kind of information is very useful!
> Without this, users have no notion about how far between two nodes.
>
> But ACPI SLIT table is too arch specific (ia64 and x86 only) and
> user-visible logical number and ACPI proximity domain number is
> not always identical.
>
> Why not export node_distance() under sysfs?
> I like (1).
>
> (1) obey one-value-per-file sysfs principle
>
> % cat /sys/devices/system/node/node0/distance0
> 10
> % cat /sys/devices/system/node/node0/distance1
> 66

I'm not familar with the internals of sysfs. For example, on a 256 node
system, there will be 65536 instances of
/sys/devices/system/node/node<M>/distance<N>

Does this require any significant amount of kernel resources to
maintain this amount of information.




>
> (2) one distance for each line
>
> % cat /sys/devices/system/node/node0/distance
> 0:10
> 1:66
> 2:46
> 3:66
>
> (3) all distances in one line like /proc/<PID>/stat
>
> % cat /sys/devices/system/node/node0/distance
> 10 66 46 66
>


I like (3) the best.

I think it would also be useful to have a similar cpu-to-cpu distance
metric:
% cat /sys/devices/system/cpu/cpu0/distance
10 20 40 60

This gives the same information but is cpu-centric rather than
node centric.



--
Thanks

Jack Steiner ([email protected]) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.


2004-11-04 14:33:58

by Andi Kleen

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Thu, Nov 04, 2004 at 08:13:37AM -0600, Jack Steiner wrote:
> On Thu, Nov 04, 2004 at 10:59:08AM +0900, Takayoshi Kochi wrote:
> > Hi,
> >
> > For wider audience, added LKML.
> >
> > From: Jack Steiner <[email protected]>
> > Subject: Externalize SLIT table
> > Date: Wed, 3 Nov 2004 14:56:56 -0600
> >
> > > The SLIT table provides useful information on internode
> > > distances. Has anyone considered externalizing this
> > > table via /proc or some equivalent mechanism.
> > >
> > > For example, something like the following would be useful:
> > >
> > > # cat /proc/acpi/slit
> > > 010 066 046 066
> > > 066 010 066 046
> > > 046 066 010 020
> > > 066 046 020 010
> > >
> > > If this looks ok (or something equivalent), I'll generate a patch....
> >
> > For user space to manipulate scheduling domains, pinning processes
> > to some cpu groups etc, that kind of information is very useful!
> > Without this, users have no notion about how far between two nodes.
> >
> > But ACPI SLIT table is too arch specific (ia64 and x86 only) and
> > user-visible logical number and ACPI proximity domain number is
> > not always identical.
> >
> > Why not export node_distance() under sysfs?
> > I like (1).
> >
> > (1) obey one-value-per-file sysfs principle
> >
> > % cat /sys/devices/system/node/node0/distance0
> > 10
> > % cat /sys/devices/system/node/node0/distance1
> > 66
>
> I'm not familar with the internals of sysfs. For example, on a 256 node
> system, there will be 65536 instances of
> /sys/devices/system/node/node<M>/distance<N>
>
> Does this require any significant amount of kernel resources to
> maintain this amount of information.

Yes it does, even with the new sysfs backing store. And reading
it would create all the inodes and dentries, which are quite
bloated.

>
> I think it would also be useful to have a similar cpu-to-cpu distance
> metric:
> % cat /sys/devices/system/cpu/cpu0/distance
> 10 20 40 60
>
> This gives the same information but is cpu-centric rather than
> node centric.


And the same thing for PCI busses, like in this patch. However
for strict ACPI systems this information would need to be gotten
from _PXM first. x86-64 on Opteron currently reads it directly
from the hardware and uses it to allocate DMA memory near the device.

-Andi


diff -urpN -X ../KDIFX linux-2.6.8rc3/drivers/pci/pci-sysfs.c linux-2.6.8rc3-amd64/drivers/pci/pci-sysfs.c
--- linux-2.6.8rc3/drivers/pci/pci-sysfs.c 2004-07-27 14:44:10.000000000 +0200
+++ linux-2.6.8rc3-amd64/drivers/pci/pci-sysfs.c 2004-08-04 02:42:11.000000000 +0200
@@ -17,6 +17,7 @@
#include <linux/kernel.h>
#include <linux/pci.h>
#include <linux/stat.h>
+#include <linux/topology.h>

#include "pci.h"

@@ -38,6 +39,15 @@ pci_config_attr(subsystem_device, "0x%04
pci_config_attr(class, "0x%06x\n");
pci_config_attr(irq, "%u\n");

+static ssize_t local_cpus_show(struct device *dev, char *buf)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+ cpumask_t mask = pcibus_to_cpumask(pdev->bus->number);
+ int len = cpumask_scnprintf(buf, PAGE_SIZE-1, mask);
+ strcat(buf,"\n");
+ return 1+len;
+}
+
/* show resources */
static ssize_t
resource_show(struct device * dev, char * buf)
@@ -67,6 +77,7 @@ struct device_attribute pci_dev_attrs[]
__ATTR_RO(subsystem_device),
__ATTR_RO(class),
__ATTR_RO(irq),
+ __ATTR_RO(local_cpus),
__ATTR_NULL,
};



2004-11-04 15:33:48

by Erich Focht

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Thursday 04 November 2004 15:13, Jack Steiner wrote:
> I think it would also be useful to have a similar cpu-to-cpu distance
> metric:
> ????????% cat /sys/devices/system/cpu/cpu0/distance
> ????????10 20 40 60
>
> This gives the same information but is cpu-centric rather than
> node centric.

I don't see the use of that once you have some way to find the logical
CPU to node number mapping. The "node distances" are meant to be
proportional to the memory access latency ratios (20 means 2 times
larger than local (intra-node) access, which is by definition 10).
If the cpu_to_cpu distance is necessary because there is a hierarchy
in the memory blocks inside one node, then maybe the definition of a
node should be changed...

We currently have (at least in -mm kernels):
% ls /sys/devices/system/node/node0/cpu*
for finding out which CPUs belong to which nodes. Together with
/sys/devices/system/node/node0/distances
this should be enough for user space NUMA tools.

Regards,
Erich

2004-11-04 17:13:03

by Andi Kleen

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Thu, Nov 04, 2004 at 04:31:42PM +0100, Erich Focht wrote:
> On Thursday 04 November 2004 15:13, Jack Steiner wrote:
> > I think it would also be useful to have a similar cpu-to-cpu distance
> > metric:
> > ????????% cat /sys/devices/system/cpu/cpu0/distance
> > ????????10 20 40 60
> >
> > This gives the same information but is cpu-centric rather than
> > node centric.
>
> I don't see the use of that once you have some way to find the logical
> CPU to node number mapping. The "node distances" are meant to be

I think he wants it just to have a more convenient interface,
which is not necessarily a bad thing. But then one could put the
convenience into libnuma anyways.

-Andi

2004-11-04 19:41:48

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Thu, Nov 04, 2004 at 06:04:35PM +0100, Andi Kleen wrote:
> On Thu, Nov 04, 2004 at 04:31:42PM +0100, Erich Focht wrote:
> > On Thursday 04 November 2004 15:13, Jack Steiner wrote:
> > > I think it would also be useful to have a similar cpu-to-cpu distance
> > > metric:
> > > ????????% cat /sys/devices/system/cpu/cpu0/distance
> > > ????????10 20 40 60
> > >
> > > This gives the same information but is cpu-centric rather than
> > > node centric.
> >
> > I don't see the use of that once you have some way to find the logical
> > CPU to node number mapping. The "node distances" are meant to be
>
> I think he wants it just to have a more convenient interface,
> which is not necessarily a bad thing. But then one could put the
> convenience into libnuma anyways.
>
> -Andi

Yes, strictly convenience. Most of the cases that I have seen deal with
cpu placement & cpu distances from each other. I agree that cpu-to-cpu
distances can be determined by converting to nodes & finding the
node-to-node distance.

A second reason is symmetry. If there is a /sys/devices/system/node/node0/distance
metric, it seems as though there should also be a /sys/devices/system/cpu/cpu0/distance
metric.

--
Thanks

Jack Steiner ([email protected]) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.


2004-11-05 16:10:25

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: Externalize SLIT table

Based on the ideas from Andi & Takayoshi, I created a patch to
add the SLIT distance information to the sysfs.

I've tested this on Altix/IA64 & it appears to work ok. I have
not tried it on other architectures.

Andi also posted a related patch for adding similar information
for PCI busses.

Comments, suggestions, .....


# cd /sys/devices/system
# find .
./node
./node/node5
./node/node5/cpu11
./node/node5/cpu10
./node/node5/distance
./node/node5/numastat
./node/node5/meminfo
./node/node5/cpumap
./node/node4
./node/node4/cpu9
./node/node4/cpu8
./node/node4/distance
./node/node4/numastat
./node/node4/meminfo
./node/node4/cpumap
....
./cpu
./cpu/cpu11
./cpu/cpu11/distance
./cpu/cpu10
./cpu/cpu10/distance
./cpu/cpu9
./cpu/cpu9/distance
./cpu/cpu8
...

# cat ./node/node0/distance
10 20 64 42 42 22

# cat ./cpu/cpu8/distance
42 42 64 64 22 22 42 42 10 10 20 20

# cat node/*/distance
10 20 64 42 42 22
20 10 42 22 64 84
64 42 10 20 22 42
42 22 20 10 42 62
42 64 22 42 10 20
22 84 42 62 20 10

# cat cpu/*/distance
10 10 20 20 64 64 42 42 42 42 22 22
10 10 20 20 64 64 42 42 42 42 22 22
20 20 10 10 42 42 22 22 64 64 84 84
20 20 10 10 42 42 22 22 64 64 84 84
64 64 42 42 10 10 20 20 22 22 42 42
64 64 42 42 10 10 20 20 22 22 42 42
42 42 22 22 20 20 10 10 42 42 62 62
42 42 22 22 20 20 10 10 42 42 62 62
42 42 64 64 22 22 42 42 10 10 20 20
42 42 64 64 22 22 42 42 10 10 20 20
22 22 84 84 42 42 62 62 20 20 10 10
22 22 84 84 42 42 62 62 20 20 10 10



Index: linux/drivers/base/node.c
===================================================================
--- linux.orig/drivers/base/node.c 2004-11-05 08:34:42.000000000 -0600
+++ linux/drivers/base/node.c 2004-11-05 09:00:01.000000000 -0600
@@ -111,6 +111,21 @@ static ssize_t node_read_numastat(struct
}
static SYSDEV_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);

+static ssize_t node_read_distance(struct sys_device * dev, char * buf)
+{
+ int nid = dev->id;
+ int len = 0;
+ int i;
+
+ for (i = 0; i < numnodes; i++)
+ len += sprintf(buf + len, "%s%d", i ? " " : "", node_distance(nid, i));
+
+ len += sprintf(buf + len, "\n");
+ return len;
+}
+static SYSDEV_ATTR(distance, S_IRUGO, node_read_distance, NULL);
+
+
/*
* register_node - Setup a driverfs device for a node.
* @num - Node number to use when creating the device.
@@ -129,6 +144,7 @@ int __init register_node(struct node *no
sysdev_create_file(&node->sysdev, &attr_cpumap);
sysdev_create_file(&node->sysdev, &attr_meminfo);
sysdev_create_file(&node->sysdev, &attr_numastat);
+ sysdev_create_file(&node->sysdev, &attr_distance);
}
return error;
}
Index: linux/drivers/base/cpu.c
===================================================================
--- linux.orig/drivers/base/cpu.c 2004-11-05 08:58:09.000000000 -0600
+++ linux/drivers/base/cpu.c 2004-11-05 08:59:25.000000000 -0600
@@ -8,6 +8,7 @@
#include <linux/cpu.h>
#include <linux/topology.h>
#include <linux/device.h>
+#include <linux/cpumask.h>


struct sysdev_class cpu_sysdev_class = {
@@ -58,6 +59,31 @@ static inline void register_cpu_control(
}
#endif /* CONFIG_HOTPLUG_CPU */

+#ifdef CONFIG_NUMA
+static ssize_t cpu_read_distance(struct sys_device * dev, char * buf)
+{
+ int nid = cpu_to_node(dev->id);
+ int len = 0;
+ int i;
+
+ for (i = 0; i < num_possible_cpus(); i++)
+ len += sprintf(buf + len, "%s%d", i ? " " : "",
+ node_distance(nid, cpu_to_node(i)));
+ len += sprintf(buf + len, "\n");
+ return len;
+}
+static SYSDEV_ATTR(distance, S_IRUGO, cpu_read_distance, NULL);
+
+static inline void register_cpu_distance(struct cpu *cpu)
+{
+ sysdev_create_file(&cpu->sysdev, &attr_distance);
+}
+#else /* !CONFIG_NUMA */
+static inline void register_cpu_distance(struct cpu *cpu)
+{
+}
+#endif
+
/*
* register_cpu - Setup a driverfs device for a CPU.
* @cpu - Callers can set the cpu->no_control field to 1, to indicate not to
@@ -81,6 +107,10 @@ int __init register_cpu(struct cpu *cpu,
kobject_name(&cpu->sysdev.kobj));
if (!error && !cpu->no_control)
register_cpu_control(cpu);
+
+ if (!error)
+ register_cpu_distance(cpu);
+
return error;
}


On Thu, Nov 04, 2004 at 01:57:21PM +0900, Takayoshi Kochi wrote:
> Hi,
>
> From: Andi Kleen <[email protected]>
> Subject: Re: Externalize SLIT table
> Date: Thu, 4 Nov 2004 05:07:13 +0100
>
> > > Why not export node_distance() under sysfs?
> > > I like (1).
> > >
> > > (1) obey one-value-per-file sysfs principle
> > >
> > > % cat /sys/devices/system/node/node0/distance0
> > > 10
> >
> > Surely distance from 0 to 0 is 0?
>
> According to the ACPI spec, 10 means local and other values
> mean ratio to 10. But what the distance number should mean
> mean is ambiguous from the spec (e.g. some veondors interpret as
> memory access latency, others interpret as memory throughput
> etc.)
> However relative distance just works for most of uses, I believe.
>
> Anyway, we should clarify how the numbers should be interpreted
> to avoid confusion.
>
> How about this?
> "The distance to itself means the base value. Distance to
> other nodes are relative to the base value.
> 0 means unreachable (hot-removed or disabled) to that node."
>
> (Just FYI, numbers 0-9 are reserved and 255 (unsigned char -1) means
> unreachable, according to the ACPI spec.)
>
> > > % cat /sys/devices/system/node/node0/distance1
> > > 66
> >
> > >
> > > (2) one distance for each line
> > >
> > > % cat /sys/devices/system/node/node0/distance
> > > 0:10
> > > 1:66
> > > 2:46
> > > 3:66
> > >
> > > (3) all distances in one line like /proc/<PID>/stat
> > >
> > > % cat /sys/devices/system/node/node0/distance
> > > 10 66 46 66
> >
> > I would prefer that.
>
> Ah, I missed the following last sentence in
> Documentation/filesystems/sysfs.txt:
>
> |Attributes should be ASCII text files, preferably with only one value
> |per file. It is noted that it may not be efficient to contain only
> |value per file, so it is socially acceptable to express an array of
> |values of the same type.
>
> If an array is acceptable, I would prefer (3), too.
>
> ---
> Takayoshi Kochi

--
Thanks

Jack Steiner ([email protected]) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.


2004-11-05 16:30:38

by Andreas Schwab

[permalink] [raw]
Subject: Re: Externalize SLIT table

Jack Steiner <[email protected]> writes:

> @@ -111,6 +111,21 @@ static ssize_t node_read_numastat(struct
> }
> static SYSDEV_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
>
> +static ssize_t node_read_distance(struct sys_device * dev, char * buf)
> +{
> + int nid = dev->id;
> + int len = 0;
> + int i;
> +
> + for (i = 0; i < numnodes; i++)
> + len += sprintf(buf + len, "%s%d", i ? " " : "", node_distance(nid, i));

Can this overflow the space allocated for buf?

> @@ -58,6 +59,31 @@ static inline void register_cpu_control(
> }
> #endif /* CONFIG_HOTPLUG_CPU */
>
> +#ifdef CONFIG_NUMA
> +static ssize_t cpu_read_distance(struct sys_device * dev, char * buf)
> +{
> + int nid = cpu_to_node(dev->id);
> + int len = 0;
> + int i;
> +
> + for (i = 0; i < num_possible_cpus(); i++)
> + len += sprintf(buf + len, "%s%d", i ? " " : "",
> + node_distance(nid, cpu_to_node(i)));

Or this?

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux AG, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2004-11-05 16:45:26

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Fri, Nov 05, 2004 at 05:26:10PM +0100, Andreas Schwab wrote:
> Jack Steiner <[email protected]> writes:
>
> > @@ -111,6 +111,21 @@ static ssize_t node_read_numastat(struct
> > }
> > static SYSDEV_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
> >
> > +static ssize_t node_read_distance(struct sys_device * dev, char * buf)
> > +{
> > + int nid = dev->id;
> > + int len = 0;
> > + int i;
> > +
> > + for (i = 0; i < numnodes; i++)
> > + len += sprintf(buf + len, "%s%d", i ? " " : "", node_distance(nid, i));
>
> Can this overflow the space allocated for buf?


Good point. I think we are ok for now. AFAIK, the largest cpu count
currently supported is 512. That gives a max string of 2k (max of 3
digits + space per cpu).

However, I should probably add a BUILD_BUG_ON to check for overflow.

BUILD_BUG_ON(NR_NODES*4 > PAGE_SIZE/2);
BUILD_BUG_ON(NR_CPUS*4 > PAGE_SIZE/2);



>
> > @@ -58,6 +59,31 @@ static inline void register_cpu_control(
> > }
> > #endif /* CONFIG_HOTPLUG_CPU */
> >
> > +#ifdef CONFIG_NUMA
> > +static ssize_t cpu_read_distance(struct sys_device * dev, char * buf)
> > +{
> > + int nid = cpu_to_node(dev->id);
> > + int len = 0;
> > + int i;
> > +
> > + for (i = 0; i < num_possible_cpus(); i++)
> > + len += sprintf(buf + len, "%s%d", i ? " " : "",
> > + node_distance(nid, cpu_to_node(i)));
>
> Or this?
>
> Andreas.
>
> --
> Andreas Schwab, SuSE Labs, [email protected]
> SuSE Linux AG, Maxfeldstra_e 5, 90409 N|rnberg, Germany
> Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
> "And now for something completely different."

--
Thanks

Jack Steiner ([email protected]) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.


2004-11-05 17:13:45

by Erich Focht

[permalink] [raw]
Subject: Re: Externalize SLIT table

Hi Jack,

the patch looks fine, of course.
> # cat ./node/node0/distance
> 10 20 64 42 42 22
Great!

But:
> # cat ./cpu/cpu8/distance
> 42 42 64 64 22 22 42 42 10 10 20 20
...

what exactly do you mean by cpu_to_cpu distance? In analogy with the
node distance I'd say it is the time (latency) for moving data from
the register of one CPU into the register of another CPU:
cpu*/distance : cpu -> memory -> cpu
node1 node? node2

On most architectures this means flushing a cacheline to memory on one
side and reading it on another side. What you actually implement is
the latency from memory (one node) to a particular cpu (on some
node).
memory -> cpu
node1 node2

That's only half of the story and actually misleading. I don't
think the complexity hiding is good in this place. Questions coming to
my mind are: Where is the memory? Is the SLIT matrix really symmetric
(cpu_to_cpu distance only makes sense for symmetric matrices)? I
remember talking to IBM people about hardware where the node distance
matrix was asymmetric.

Why do you want this distance anyway? libnuma offers you _node_ masks
for allocating memory from a particular node. And when you want to
arrange a complex MPI process structure you'll have to think about
latency for moving data from one processes buffer to the other
processes buffer. The buffers live on nodes, not on cpus.

Regards,
Erich

2004-11-05 19:14:06

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Fri, Nov 05, 2004 at 06:13:24PM +0100, Erich Focht wrote:
> Hi Jack,
>
> the patch looks fine, of course.
> > # cat ./node/node0/distance
> > 10 20 64 42 42 22
> Great!
>
> But:
> > # cat ./cpu/cpu8/distance
> > 42 42 64 64 22 22 42 42 10 10 20 20
> ...
>
> what exactly do you mean by cpu_to_cpu distance? In analogy with the
> node distance I'd say it is the time (latency) for moving data from
> the register of one CPU into the register of another CPU:
> cpu*/distance : cpu -> memory -> cpu
> node1 node? node2
>

I'm trying to create an easy-to-use metric for finding sets of cpus that
are close to each other. By "close", I mean that the average offnode
reference from a cpu to remote memory in the set is minimized.

The numbers in cpuN/distance represent the distance from cpu N to
the memory that is local to each of the other cpus.

I agree that this can be derived from converting cpuN->node, finding
internode distances, then finding the cpus on each remote node.
The cpu metric is much easier to use.


> On most architectures this means flushing a cacheline to memory on one
> side and reading it on another side. What you actually implement is
> the latency from memory (one node) to a particular cpu (on some
> node).
> memory -> cpu
> node1 node2

I see how the term can be misleading. The metric is intended to
represent ONLY the cost of remote access to another processor's local memory.
Is there a better way to describe the cpu-to-remote-cpu's-memory metric OR
should we let users contruct their own matrix from the node data?


>
> That's only half of the story and actually misleading. I don't
> think the complexity hiding is good in this place. Questions coming to
> my mind are: Where is the memory? Is the SLIT matrix really symmetric
> (cpu_to_cpu distance only makes sense for symmetric matrices)? I
> remember talking to IBM people about hardware where the node distance
> matrix was asymmetric.
>
> Why do you want this distance anyway? libnuma offers you _node_ masks
> for allocating memory from a particular node. And when you want to
> arrange a complex MPI process structure you'll have to think about
> latency for moving data from one processes buffer to the other
> processes buffer. The buffers live on nodes, not on cpus.

One important use is in the creation of cpusets. The batch scheduler needs
to pick a subset of cpus that are as close together as possible.


--
Thanks

Jack Steiner ([email protected]) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.


2004-11-06 11:50:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Fri, Nov 05, 2004 at 10:44:49AM -0600, Jack Steiner wrote:
> > > + for (i = 0; i < numnodes; i++)
> > > + len += sprintf(buf + len, "%s%d", i ? " " : "", node_distance(nid, i));
> >
> > Can this overflow the space allocated for buf?
>
>
> Good point. I think we are ok for now. AFAIK, the largest cpu count
> currently supported is 512. That gives a max string of 2k (max of 3
> digits + space per cpu).

I always wondered why sysfs doesn't use the seq_file interface that makes
life easier in the rest of them kernel.

2004-11-06 12:49:06

by Andi Kleen

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Sat, Nov 06, 2004 at 11:50:29AM +0000, Christoph Hellwig wrote:
> On Fri, Nov 05, 2004 at 10:44:49AM -0600, Jack Steiner wrote:
> > > > + for (i = 0; i < numnodes; i++)
> > > > + len += sprintf(buf + len, "%s%d", i ? " " : "", node_distance(nid, i));
> > >
> > > Can this overflow the space allocated for buf?
> >
> >
> > Good point. I think we are ok for now. AFAIK, the largest cpu count
> > currently supported is 512. That gives a max string of 2k (max of 3
> > digits + space per cpu).
>
> I always wondered why sysfs doesn't use the seq_file interface that makes
> life easier in the rest of them kernel.

Most fields only output a single number, and seq_file would be
extreme overkill for that.

-Andi

2004-11-06 13:08:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Sat, Nov 06, 2004 at 01:48:38PM +0100, Andi Kleen wrote:
> On Sat, Nov 06, 2004 at 11:50:29AM +0000, Christoph Hellwig wrote:
> > On Fri, Nov 05, 2004 at 10:44:49AM -0600, Jack Steiner wrote:
> > > > > + for (i = 0; i < numnodes; i++)
> > > > > + len += sprintf(buf + len, "%s%d", i ? " " : "", node_distance(nid, i));
> > > >
> > > > Can this overflow the space allocated for buf?
> > >
> > >
> > > Good point. I think we are ok for now. AFAIK, the largest cpu count
> > > currently supported is 512. That gives a max string of 2k (max of 3
> > > digits + space per cpu).
> >
> > I always wondered why sysfs doesn't use the seq_file interface that makes
> > life easier in the rest of them kernel.
>
> Most fields only output a single number, and seq_file would be
> extreme overkill for that.

Personally I think even a:

static void
show_foo(struct device *dev, struct seq_file *s)
{
seq_printf(s, "blafcsvsdfg\n");
}

static ssize_t
show_foo(struct device *dev, char *buf)
{
return snprintf(buf, 20, "blafcsvsdfg\n");
}

would be a definitive improvement.

2004-11-09 19:23:51

by Matthew Dobson

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Wed, 2004-11-03 at 20:07, Andi Kleen wrote:
> On Thu, Nov 04, 2004 at 10:59:08AM +0900, Takayoshi Kochi wrote:
> > (3) all distances in one line like /proc/<PID>/stat
> >
> > % cat /sys/devices/system/node/node0/distance
> > 10 66 46 66
>
> I would prefer that.
>
> -Andi

That would be my vote as well. One line, space delimited. Easy to
parse... Plus you could easily reproduce the entire SLIT matrix by:

cd /sys/devices/system/node/
for i in `ls node*`; do cat $i/distance; done


-Matt

2004-11-09 19:44:20

by Matthew Dobson

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Thu, 2004-11-04 at 07:31, Erich Focht wrote:
> On Thursday 04 November 2004 15:13, Jack Steiner wrote:
> > I think it would also be useful to have a similar cpu-to-cpu distance
> > metric:
> > % cat /sys/devices/system/cpu/cpu0/distance
> > 10 20 40 60
> >
> > This gives the same information but is cpu-centric rather than
> > node centric.
>
> I don't see the use of that once you have some way to find the logical
> CPU to node number mapping. The "node distances" are meant to be
> proportional to the memory access latency ratios (20 means 2 times
> larger than local (intra-node) access, which is by definition 10).
> If the cpu_to_cpu distance is necessary because there is a hierarchy
> in the memory blocks inside one node, then maybe the definition of a
> node should be changed...
>
> We currently have (at least in -mm kernels):
> % ls /sys/devices/system/node/node0/cpu*
> for finding out which CPUs belong to which nodes. Together with
> /sys/devices/system/node/node0/distances
> this should be enough for user space NUMA tools.
>
> Regards,
> Erich

I have to agree with Erich here. Node distances make sense, but adding
'cpu distances' which are just re-exporting the node distances in each
cpu's directory in sysfs doesn't make much sense to me. Especially
because it is so trivial to get a list of which CPUs are on which node.
If you're looking for groups of CPUs which are close, simply look for
groups of nodes that are close, then use the CPUs on those nodes. If we
came up with some sort of different notion of 'distance' for CPUs and
exported that, I'd be OK with it, because it'd be new information. I
don't think we should export the *exact same* node distance information
through the CPUs, though.

-Matt

2004-11-09 19:47:57

by Matthew Dobson

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Thu, 2004-11-04 at 09:04, Andi Kleen wrote:
> On Thu, Nov 04, 2004 at 04:31:42PM +0100, Erich Focht wrote:
> > On Thursday 04 November 2004 15:13, Jack Steiner wrote:
> > > I think it would also be useful to have a similar cpu-to-cpu distance
> > > metric:
> > > ????????% cat /sys/devices/system/cpu/cpu0/distance
> > > ????????10 20 40 60
> > >
> > > This gives the same information but is cpu-centric rather than
> > > node centric.
> >
> > I don't see the use of that once you have some way to find the logical
> > CPU to node number mapping. The "node distances" are meant to be
>
> I think he wants it just to have a more convenient interface,
> which is not necessarily a bad thing. But then one could put the
> convenience into libnuma anyways.
>
> -Andi

Using libnuma sounds fine to me. On a 512 CPU system, with 4 CPUs/node,
we'd have 128 nodes. Re-exporting ALL the same data, those huge strings
of node-to-node distances, 512 *additional* times in the per-CPU sysfs
directories seems like a waste.

-Matt

2004-11-09 20:34:49

by Mark Goodwin

[permalink] [raw]
Subject: Re: Externalize SLIT table


On Tue, 9 Nov 2004, Matthew Dobson wrote:
> ...
> I don't think we should export the *exact same* node distance information
> through the CPUs, though.

We should still export cpu distances though because the distance between
cpus on the same node may not be equal. e.g. consider a node with multiple
cpu sockets, each socket with a hyperthreaded (or dual core) cpu.

Once again however, it depends on the definition of distance. For nodes,
we've established it's the ACPI SLIT (relative distance to memory). For
cpus, should it be distance to memory? Distance to cache? Registers? Or
what?

-- Mark

2004-11-09 22:03:58

by Jesse Barnes

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Tuesday, November 09, 2004 3:34 pm, Mark Goodwin wrote:
> On Tue, 9 Nov 2004, Matthew Dobson wrote:
> > ...
> > I don't think we should export the *exact same* node distance information
> > through the CPUs, though.
>
> We should still export cpu distances though because the distance between
> cpus on the same node may not be equal. e.g. consider a node with multiple
> cpu sockets, each socket with a hyperthreaded (or dual core) cpu.
>
> Once again however, it depends on the definition of distance. For nodes,
> we've established it's the ACPI SLIT (relative distance to memory). For
> cpus, should it be distance to memory? Distance to cache? Registers? Or
> what?

Yeah, that's a tough call. We should definitely get the node stuff in there
now though, IMO. We can always add the CPU distances later if we figure out
what they should mean.

Jesse

2004-11-10 00:02:08

by Matthew Dobson

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Tue, 2004-11-09 at 12:34, Mark Goodwin wrote:
> On Tue, 9 Nov 2004, Matthew Dobson wrote:
> > ...
> > I don't think we should export the *exact same* node distance information
> > through the CPUs, though.
>
> We should still export cpu distances though because the distance between
> cpus on the same node may not be equal. e.g. consider a node with multiple
> cpu sockets, each socket with a hyperthreaded (or dual core) cpu.

Well, I'm not sure that just because a CPU has two hyperthread units in
the same core that those HT units have a different distance or latency
to memory...? The fact that it is a HT unit and not a physical core has
implications to the scheduler, but I thought that the 2 siblings looked
identical to userspace, no? If 2 CPUs in the same node are on the same
bus, then in all likelihood they have the same "distance".


> Once again however, it depends on the definition of distance. For nodes,
> we've established it's the ACPI SLIT (relative distance to memory). For
> cpus, should it be distance to memory? Distance to cache? Registers? Or
> what?
>
> -- Mark

That's the real issue. We need to agree upon a meaningful definition of
CPU-to-CPU "distance". As Jesse mentioned in a follow-up, we can all
agree on what Node-to-Node "distance" means, but there doesn't appear to
be much consensus on what CPU "distance" means.

-Matt

2004-11-10 05:05:41

by Mark Goodwin

[permalink] [raw]
Subject: Re: Externalize SLIT table


On Tue, 9 Nov 2004, Matthew Dobson wrote:
> On Tue, 2004-11-09 at 12:34, Mark Goodwin wrote:
>> Once again however, it depends on the definition of distance. For nodes,
>> we've established it's the ACPI SLIT (relative distance to memory). For
>> cpus, should it be distance to memory? Distance to cache? Registers? Or
>> what?
>>
> That's the real issue. We need to agree upon a meaningful definition of
> CPU-to-CPU "distance". As Jesse mentioned in a follow-up, we can all
> agree on what Node-to-Node "distance" means, but there doesn't appear to
> be much consensus on what CPU "distance" means.

How about we define cpu-distance to be "relative distance to the
lowest level cache on another CPU". On a system that has nodes with
multiple sockets (each supporting multiple cores or HT "CPUs" sharing
some level of cache), when the scheduler needs to migrate a task it would
first choose a CPU sharing the same cache, then a CPU on the same node,
then an off-node CPU (i.e. falling back to node distance).

Of course, I have no idea if that's anything like an optimal or desirable
task migration policy. Probably depends on cache-trashiness of the task
being migrated.

-- Mark

2004-11-10 18:45:54

by Erich Focht

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Wednesday 10 November 2004 06:05, Mark Goodwin wrote:
>
> On Tue, 9 Nov 2004, Matthew Dobson wrote:
> > On Tue, 2004-11-09 at 12:34, Mark Goodwin wrote:
> >> Once again however, it depends on the definition of distance. For nodes,
> >> we've established it's the ACPI SLIT (relative distance to memory). For
> >> cpus, should it be distance to memory? Distance to cache? Registers? Or
> >> what?
> >>
> > That's the real issue. We need to agree upon a meaningful definition of
> > CPU-to-CPU "distance". As Jesse mentioned in a follow-up, we can all
> > agree on what Node-to-Node "distance" means, but there doesn't appear to
> > be much consensus on what CPU "distance" means.
>
> How about we define cpu-distance to be "relative distance to the
> lowest level cache on another CPU".

Several definitions are possible, this is really a source of
confusion. Any of these can be reconstructed if one has access to the
constituents: node-to-node latency (SLIT), cache-to-cache
latencies. The later ones aren't available and would anyhow be better
placed in something like /proc/cpuinfo or similar. They are CPU or
package specific and have nothing to do with NUMA.

> On a system that has nodes with multiple sockets (each supporting
> multiple cores or HT "CPUs" sharing some level of cache), when the
> scheduler needs to migrate a task it would first choose a CPU
> sharing the same cache, then a CPU on the same node, then an
> off-node CPU (i.e. falling back to node distance).

This should be done by correctly setting up the sched domains. It's
not a question of exporting useless or redundant information to user
space.

The need for some (any) cpu-to-cpu metrics initially brought up by
Jack seemed mainly motivated by existing user space tools for
constructing cpusets (maybe in PBS). I think it is a tolerable effort
to introduce in user space an inlined function or macro doing
something like
cpu_metric(i,j) := node_metric(cpu_node(i),cpu_node(j))

It keeps the kernel free of misleading information which might just
slightly make cpusets construction more comfortable. In user space you
have the full freedom to enhance your metrics when getting more
details about the next generation cpus.

Regards,
Erich

2004-11-10 22:09:54

by Matthew Dobson

[permalink] [raw]
Subject: Re: Externalize SLIT table

On Wed, 2004-11-10 at 10:45, Erich Focht wrote:
> On Wednesday 10 November 2004 06:05, Mark Goodwin wrote:
> > On a system that has nodes with multiple sockets (each supporting
> > multiple cores or HT "CPUs" sharing some level of cache), when the
> > scheduler needs to migrate a task it would first choose a CPU
> > sharing the same cache, then a CPU on the same node, then an
> > off-node CPU (i.e. falling back to node distance).
>
> This should be done by correctly setting up the sched domains. It's
> not a question of exporting useless or redundant information to user
> space.
>
> The need for some (any) cpu-to-cpu metrics initially brought up by
> Jack seemed mainly motivated by existing user space tools for
> constructing cpusets (maybe in PBS). I think it is a tolerable effort
> to introduce in user space an inlined function or macro doing
> something like
> cpu_metric(i,j) := node_metric(cpu_node(i),cpu_node(j))
>
> It keeps the kernel free of misleading information which might just
> slightly make cpusets construction more comfortable. In user space you
> have the full freedom to enhance your metrics when getting more
> details about the next generation cpus.

Good point, Erich. I don't think there is any desperate need for
CPU-to-CPU distances to be exported to userspace right now. If that is
incorrect and someone really needs a particular distance metric to be
exported by the kernel, we can look into that and export the required
info. For now I think the Node-to-Node distance information is enough.
-Matt