2008-01-31 00:48:23

by Chris Snook

[permalink] [raw]
Subject: Purpose of numa_node?

While pondering ways to optimize I/O and swapping on large NUMA machines, I
noticed that the numa_node field in struct device isn't actually used anywhere.
We just have a couple dozen lines of code to conditionally create a sysfs file
that will always return -1. Is anyone even working on code to actually use this
field? I think it's a good piece of information to keep track of, so I'm not
suggesting we remove it, but I want to make sure I'm not stepping on toes or
duplicating effort if I try to make it useful.

-- Chris


2008-01-31 08:12:04

by Paul Mundt

[permalink] [raw]
Subject: Re: Purpose of numa_node?

On Wed, Jan 30, 2008 at 07:48:13PM -0500, Chris Snook wrote:
> While pondering ways to optimize I/O and swapping on large NUMA machines, I
> noticed that the numa_node field in struct device isn't actually used
> anywhere. We just have a couple dozen lines of code to conditionally
> create a sysfs file that will always return -1. Is anyone even working on
> code to actually use this field? I think it's a good piece of information
> to keep track of, so I'm not suggesting we remove it, but I want to make
> sure I'm not stepping on toes or duplicating effort if I try to make it
> useful.
>
It's manipulated with accessors. If you look at the users of
dev_to_node()/set_dev_node() you can see where it's being used. It's
primarily used in allocation paths for node locality, and the existing
set_dev_node() callsites are places where node locality information
already exists (ie, which node a given controller sits on). You can see
this in places like PCI (pcibus_to_node()) and USB, with node allocation
hints used in places like the dmapool and skb alloc paths.

The in-kernel use looks perfectly sane in that regard, though I'm not
sure what the point of exporting this as a RO attribute to userspace is.
Presumably someone has a tool somewhere that cares about this.

2008-01-31 09:56:24

by Andi Kleen

[permalink] [raw]
Subject: Re: Purpose of numa_node?

Paul Mundt <[email protected]> writes:
>
> The in-kernel use looks perfectly sane in that regard, though I'm not
> sure what the point of exporting this as a RO attribute to userspace is.
> Presumably someone has a tool somewhere that cares about this.

The idea was to allow e.g. NUMA aware irqbalanced that directs the interrupts
on the same node as the device is connected to. Don't know if it was
ever actually implemented.

-Andi

2008-01-31 13:42:26

by Brice Goglin

[permalink] [raw]
Subject: Re: Purpose of numa_node?

Paul Mundt wrote:
> On Wed, Jan 30, 2008 at 07:48:13PM -0500, Chris Snook wrote:
>
>> While pondering ways to optimize I/O and swapping on large NUMA machines, I
>> noticed that the numa_node field in struct device isn't actually used
>> anywhere. We just have a couple dozen lines of code to conditionally
>> create a sysfs file that will always return -1. Is anyone even working on
>> code to actually use this field? I think it's a good piece of information
>> to keep track of, so I'm not suggesting we remove it, but I want to make
>> sure I'm not stepping on toes or duplicating effort if I try to make it
>> useful.
>>
> It's manipulated with accessors. If you look at the users of
> dev_to_node()/set_dev_node() you can see where it's being used. It's
> primarily used in allocation paths for node locality, and the existing
> set_dev_node() callsites are places where node locality information
> already exists (ie, which node a given controller sits on). You can see
> this in places like PCI (pcibus_to_node()) and USB, with node allocation
> hints used in places like the dmapool and skb alloc paths.
>
> The in-kernel use looks perfectly sane in that regard, though I'm not
> sure what the point of exporting this as a RO attribute to userspace is.
> Presumably someone has a tool somewhere that cares about this.
>

I added the numa_node sysfs attribute in the beginning to make it easier
to bind processes near some devices. So yes I have some user-space tool
using it. It is much easier to use than the local_cpus field on large
machines, especially when you use the libnuma interface to bind things,
since you don't have to translate numa_node from/to cpumasks.

It works fine on regular machines such as dual opterons. However, I
noticed recently that it was wrong on some quad-opteron machines (see
http://marc.info/?l=linux-pci&m=119072400008538&w=2) because something
is not initialized in the right order. But I haven't tested 2.6.24 on
this hardware yet, and I don't know if things have changed regarding this.

Brice

2008-01-31 21:30:09

by Yinghai Lu

[permalink] [raw]
Subject: Re: Purpose of numa_node?

On Jan 31, 2008 5:42 AM, Brice Goglin <[email protected]> wrote:
> Paul Mundt wrote:
> > On Wed, Jan 30, 2008 at 07:48:13PM -0500, Chris Snook wrote:
> >
> >> While pondering ways to optimize I/O and swapping on large NUMA machines, I
> >> noticed that the numa_node field in struct device isn't actually used
> >> anywhere. We just have a couple dozen lines of code to conditionally
> >> create a sysfs file that will always return -1. Is anyone even working on
> >> code to actually use this field? I think it's a good piece of information
> >> to keep track of, so I'm not suggesting we remove it, but I want to make
> >> sure I'm not stepping on toes or duplicating effort if I try to make it
> >> useful.
> >>
> > It's manipulated with accessors. If you look at the users of
> > dev_to_node()/set_dev_node() you can see where it's being used. It's
> > primarily used in allocation paths for node locality, and the existing
> > set_dev_node() callsites are places where node locality information
> > already exists (ie, which node a given controller sits on). You can see
> > this in places like PCI (pcibus_to_node()) and USB, with node allocation
> > hints used in places like the dmapool and skb alloc paths.
> >
> > The in-kernel use looks perfectly sane in that regard, though I'm not
> > sure what the point of exporting this as a RO attribute to userspace is.
> > Presumably someone has a tool somewhere that cares about this.
> >
>
> I added the numa_node sysfs attribute in the beginning to make it easier
> to bind processes near some devices. So yes I have some user-space tool
> using it. It is much easier to use than the local_cpus field on large
> machines, especially when you use the libnuma interface to bind things,
> since you don't have to translate numa_node from/to cpumasks.
>
> It works fine on regular machines such as dual opterons. However, I
> noticed recently that it was wrong on some quad-opteron machines (see
> http://marc.info/?l=linux-pci&m=119072400008538&w=2) because something
> is not initialized in the right order. But I haven't tested 2.6.24 on
> this hardware yet, and I don't know if things have changed regarding this.

that will depend if you dsdt have _PXM for your pci root bus.
otherwise you will get all -1

I have a patchset locally that it call bus_numa, can get that from pci
conf space for AMD64 based machine.
so you can use that for AMD64 system without _PXM for pci root bus or
even with acpi=off.

let me know if you want test it.

YH

2008-01-31 21:35:35

by Brice Goglin

[permalink] [raw]
Subject: Re: Purpose of numa_node?

Yinghai Lu wrote:
> On Jan 31, 2008 5:42 AM, Brice Goglin <[email protected]> wrote:
>
>> It works fine on regular machines such as dual opterons. However, I
>> noticed recently that it was wrong on some quad-opteron machines (see
>> http://marc.info/?l=linux-pci&m=119072400008538&w=2) because something
>> is not initialized in the right order. But I haven't tested 2.6.24 on
>> this hardware yet, and I don't know if things have changed regarding this.
>>
>
> that will depend if you dsdt have _PXM for your pci root bus.
> otherwise you will get all -1
>

Have a look at the above link. I don't get -1. I get 0 everywhere, while
I should get 1 for some devices. And if I unplug/replug a device using
fakephp, numa_node becomes correct (1 instead of 0). This just looks
like the code is there but things are initialized in the wrong order.

Brice

2008-01-31 21:42:19

by Yinghai Lu

[permalink] [raw]
Subject: Re: Purpose of numa_node?

On Jan 31, 2008 1:35 PM, Brice Goglin <[email protected]> wrote:
> Yinghai Lu wrote:
> > On Jan 31, 2008 5:42 AM, Brice Goglin <[email protected]> wrote:
> >
> >> It works fine on regular machines such as dual opterons. However, I
> >> noticed recently that it was wrong on some quad-opteron machines (see
> >> http://marc.info/?l=linux-pci&m=119072400008538&w=2) because something
> >> is not initialized in the right order. But I haven't tested 2.6.24 on
> >> this hardware yet, and I don't know if things have changed regarding this.
> >>
> >
> > that will depend if you dsdt have _PXM for your pci root bus.
> > otherwise you will get all -1
> >
>
> Have a look at the above link. I don't get -1. I get 0 everywhere, while
> I should get 1 for some devices. And if I unplug/replug a device using
> fakephp, numa_node becomes correct (1 instead of 0). This just looks
> like the code is there but things are initialized in the wrong order.

do you have
...
bus 00 -> pxm 0 -> node 0
...
bus 40 -> pxm 1 -> node 1
...
bus 80 -> pxm 1 -> node 1

in your boot msg or dmesg?

if not, your dsdt doesn't have _PXM for pci root bus. or you need to
ask your HW vendor to add that in their BIOS, or use my patchset.

YH

2008-01-31 23:35:28

by Yinghai Lu

[permalink] [raw]
Subject: Re: Purpose of numa_node?

On Jan 31, 2008 1:42 PM, Yinghai Lu <[email protected]> wrote:
>
> On Jan 31, 2008 1:35 PM, Brice Goglin <[email protected]> wrote:
> > Yinghai Lu wrote:
> > > On Jan 31, 2008 5:42 AM, Brice Goglin <[email protected]> wrote:
> > >
> > >> It works fine on regular machines such as dual opterons. However, I
> > >> noticed recently that it was wrong on some quad-opteron machines (see
> > >> http://marc.info/?l=linux-pci&m=119072400008538&w=2) because something
> > >> is not initialized in the right order. But I haven't tested 2.6.24 on
> > >> this hardware yet, and I don't know if things have changed regarding this.
> > >>
> > >
> > > that will depend if you dsdt have _PXM for your pci root bus.
> > > otherwise you will get all -1
> > >
> >
> > Have a look at the above link. I don't get -1. I get 0 everywhere, while
> > I should get 1 for some devices. And if I unplug/replug a device using
> > fakephp, numa_node becomes correct (1 instead of 0). This just looks
> > like the code is there but things are initialized in the wrong order.
>
> do you have
> ...
> bus 00 -> pxm 0 -> node 0
> ...
> bus 40 -> pxm 1 -> node 1
> ...
> bus 80 -> pxm 1 -> node 1
>
> in your boot msg or dmesg?
>
> if not, your dsdt doesn't have _PXM for pci root bus. or you need to
> ask your HW vendor to add that in their BIOS, or use my patchset.

please try the attached patchset

please get x86.git then use quilt apply the patch

http://people.redhat.com/mingo/x86.git/README

YH


Attachments:
(No filename) (1.45 kB)
patches_01312008_mm_bus_numa.tar.bz2 (12.07 kB)
Download all attachments

2008-02-13 18:52:32

by Brice Goglin

[permalink] [raw]
Subject: Re: Purpose of numa_node?

Yinghai Lu wrote:
>>> Have a look at the above link. I don't get -1. I get 0 everywhere, while
>>> I should get 1 for some devices. And if I unplug/replug a device using
>>> fakephp, numa_node becomes correct (1 instead of 0). This just looks
>>> like the code is there but things are initialized in the wrong order.
>>>
>> do you have
>> ...
>> bus 00 -> pxm 0 -> node 0
>> ...
>> bus 40 -> pxm 1 -> node 1
>> ...
>> bus 80 -> pxm 1 -> node 1
>>
>> in your boot msg or dmesg?
>>
>> if not, your dsdt doesn't have _PXM for pci root bus. or you need to
>> ask your HW vendor to add that in their BIOS, or use my patchset.
>>
>
> please try the attached patchset
>
> please get x86.git then use quilt apply the patch
>
> http://people.redhat.com/mingo/x86.git/README
>

I finally managed to test this and it seems to work. I now get the
following numa_node attributes:
/sys/devices/pci0000:00/0000:00:01.0/numa_node 0
/sys/devices/pci0000:00/0000:00:07.0/numa_node 0
/sys/devices/pci0000:00/0000:00:07.0/0000:38:0d.0/numa_node 0
/sys/devices/pci0000:00/0000:00:08.0/numa_node 0
/sys/devices/pci0000:00/0000:00:08.1/numa_node 0
/sys/devices/pci0000:00/0000:00:08.2/numa_node 0
/sys/devices/pci0000:00/0000:00:09.0/numa_node 0
/sys/devices/pci0000:00/0000:00:09.1/numa_node 0
/sys/devices/pci0000:00/0000:00:09.2/numa_node 0
/sys/devices/pci0000:00/0000:00:0a.0/numa_node 0
/sys/devices/pci0000:00/0000:00:0a.0/0000:22:00.0/numa_node 0
/sys/devices/pci0000:00/0000:00:0b.0/numa_node 0
/sys/devices/pci0000:00/0000:00:0c.0/numa_node 0
/sys/devices/pci0000:00/0000:00:0c.0/0000:0c:00.0/numa_node 0
/sys/devices/pci0000:00/0000:00:0c.0/0000:0c:00.0/0000:0d:00.0/numa_node
0
/sys/devices/pci0000:00/0000:00:0d.0/numa_node 0
/sys/devices/pci0000:00/0000:00:0d.0/0000:01:00.0/numa_node 0
/sys/devices/pci0000:00/0000:00:0e.0/numa_node 0
/sys/devices/pci0000:00/0000:00:0e.0/0000:17:00.0/numa_node 0
/sys/devices/pci0000:00/0000:00:0e.0/0000:17:00.0/0000:18:00.0/numa_node
0
/sys/devices/pci0000:00/0000:00:18.0/numa_node 0
/sys/devices/pci0000:00/0000:00:18.1/numa_node 0
/sys/devices/pci0000:00/0000:00:18.2/numa_node 0
/sys/devices/pci0000:00/0000:00:18.3/numa_node 0
/sys/devices/pci0000:00/0000:00:19.0/numa_node 0
/sys/devices/pci0000:00/0000:00:19.1/numa_node 0
/sys/devices/pci0000:00/0000:00:19.2/numa_node 0
/sys/devices/pci0000:00/0000:00:19.3/numa_node 0
/sys/devices/pci0000:00/0000:00:1a.0/numa_node 0
/sys/devices/pci0000:00/0000:00:1a.1/numa_node 0
/sys/devices/pci0000:00/0000:00:1a.2/numa_node 0
/sys/devices/pci0000:00/0000:00:1a.3/numa_node 0
/sys/devices/pci0000:00/0000:00:1b.0/numa_node 0
/sys/devices/pci0000:00/0000:00:1b.1/numa_node 0
/sys/devices/pci0000:00/0000:00:1b.2/numa_node 0
/sys/devices/pci0000:00/0000:00:1b.3/numa_node 0
/sys/devices/pci0000:40/0000:40:0f.0/numa_node 1
/sys/devices/pci0000:40/0000:40:10.0/numa_node 1
/sys/devices/pci0000:40/0000:40:11.0/numa_node 1
/sys/devices/pci0000:40/0000:40:12.0/numa_node 1
/sys/devices/pci0000:40/0000:40:12.0/0000:51:00.0/numa_node 1
/sys/devices/pci0000:40/0000:40:13.0/numa_node 1

The 5 last lines above would report 0 instead of 1 with an older kernel.
Everything looks correct now (0000:40 is the second PCIe bus and it is
attached to socket #1).

Thanks a lot, Yinghai! Are you planning to merge these patches in the
near future? 2.6.26?

Brice

PS: I saved the corresponding dmesg. If you want to look at it, please
let me know.

2008-02-13 21:52:26

by Yinghai Lu

[permalink] [raw]
Subject: Re: Purpose of numa_node?

On Feb 13, 2008 10:52 AM, Brice Goglin <[email protected]> wrote:
> Yinghai Lu wrote:
> >>> Have a look at the above link. I don't get -1. I get 0 everywhere, while
> >>> I should get 1 for some devices. And if I unplug/replug a device using
> >>> fakephp, numa_node becomes correct (1 instead of 0). This just looks
> >>> like the code is there but things are initialized in the wrong order.
> >>>
> >> do you have
> >> ...
> >> bus 00 -> pxm 0 -> node 0
> >> ...
> >> bus 40 -> pxm 1 -> node 1
> >> ...
> >> bus 80 -> pxm 1 -> node 1
> >>
> >> in your boot msg or dmesg?
> >>
> >> if not, your dsdt doesn't have _PXM for pci root bus. or you need to
> >> ask your HW vendor to add that in their BIOS, or use my patchset.
> >>
> >
> > please try the attached patchset
> >
> > please get x86.git then use quilt apply the patch
> >
> > http://people.redhat.com/mingo/x86.git/README
> >
>
> I finally managed to test this and it seems to work. I now get the
> following numa_node attributes:
> /sys/devices/pci0000:00/0000:00:01.0/numa_node 0
> /sys/devices/pci0000:00/0000:00:07.0/numa_node 0
> /sys/devices/pci0000:00/0000:00:07.0/0000:38:0d.0/numa_node 0
> /sys/devices/pci0000:00/0000:00:08.0/numa_node 0
> /sys/devices/pci0000:00/0000:00:08.1/numa_node 0
> /sys/devices/pci0000:00/0000:00:08.2/numa_node 0
> /sys/devices/pci0000:00/0000:00:09.0/numa_node 0
> /sys/devices/pci0000:00/0000:00:09.1/numa_node 0
> /sys/devices/pci0000:00/0000:00:09.2/numa_node 0
> /sys/devices/pci0000:00/0000:00:0a.0/numa_node 0
> /sys/devices/pci0000:00/0000:00:0a.0/0000:22:00.0/numa_node 0
> /sys/devices/pci0000:00/0000:00:0b.0/numa_node 0
> /sys/devices/pci0000:00/0000:00:0c.0/numa_node 0
> /sys/devices/pci0000:00/0000:00:0c.0/0000:0c:00.0/numa_node 0
> /sys/devices/pci0000:00/0000:00:0c.0/0000:0c:00.0/0000:0d:00.0/numa_node
> 0
> /sys/devices/pci0000:00/0000:00:0d.0/numa_node 0
> /sys/devices/pci0000:00/0000:00:0d.0/0000:01:00.0/numa_node 0
> /sys/devices/pci0000:00/0000:00:0e.0/numa_node 0
> /sys/devices/pci0000:00/0000:00:0e.0/0000:17:00.0/numa_node 0
> /sys/devices/pci0000:00/0000:00:0e.0/0000:17:00.0/0000:18:00.0/numa_node
> 0
> /sys/devices/pci0000:00/0000:00:18.0/numa_node 0
> /sys/devices/pci0000:00/0000:00:18.1/numa_node 0
> /sys/devices/pci0000:00/0000:00:18.2/numa_node 0
> /sys/devices/pci0000:00/0000:00:18.3/numa_node 0
> /sys/devices/pci0000:00/0000:00:19.0/numa_node 0
> /sys/devices/pci0000:00/0000:00:19.1/numa_node 0
> /sys/devices/pci0000:00/0000:00:19.2/numa_node 0
> /sys/devices/pci0000:00/0000:00:19.3/numa_node 0
> /sys/devices/pci0000:00/0000:00:1a.0/numa_node 0
> /sys/devices/pci0000:00/0000:00:1a.1/numa_node 0
> /sys/devices/pci0000:00/0000:00:1a.2/numa_node 0
> /sys/devices/pci0000:00/0000:00:1a.3/numa_node 0
> /sys/devices/pci0000:00/0000:00:1b.0/numa_node 0
> /sys/devices/pci0000:00/0000:00:1b.1/numa_node 0
> /sys/devices/pci0000:00/0000:00:1b.2/numa_node 0
> /sys/devices/pci0000:00/0000:00:1b.3/numa_node 0
> /sys/devices/pci0000:40/0000:40:0f.0/numa_node 1
> /sys/devices/pci0000:40/0000:40:10.0/numa_node 1
> /sys/devices/pci0000:40/0000:40:11.0/numa_node 1
> /sys/devices/pci0000:40/0000:40:12.0/numa_node 1
> /sys/devices/pci0000:40/0000:40:12.0/0000:51:00.0/numa_node 1
> /sys/devices/pci0000:40/0000:40:13.0/numa_node 1
>
> The 5 last lines above would report 0 instead of 1 with an older kernel.
> Everything looks correct now (0000:40 is the second PCIe bus and it is
> attached to socket #1).
>
> Thanks a lot, Yinghai! Are you planning to merge these patches in the
> near future? 2.6.26?

Andi thought that is too hardware related...
they have stayed a while in -mm.

these patchset could be only useful
when you have several HT chains, and BIOS doesn't have pxm->node in dsdt,
or doesn't allocate io resource to some of addon cards.

YH

2008-02-20 21:55:48

by Yinghai Lu

[permalink] [raw]
Subject: Re: Purpose of numa_node?

On Wed, Feb 13, 2008 at 10:52 AM, Brice Goglin <[email protected]> wrote:
> /sys/devices/pci0000:40/0000:40:0f.0/numa_node 1
> /sys/devices/pci0000:40/0000:40:10.0/numa_node 1
> /sys/devices/pci0000:40/0000:40:11.0/numa_node 1
> /sys/devices/pci0000:40/0000:40:12.0/numa_node 1
> /sys/devices/pci0000:40/0000:40:12.0/0000:51:00.0/numa_node 1
> /sys/devices/pci0000:40/0000:40:13.0/numa_node 1
>
> The 5 last lines above would report 0 instead of 1 with an older kernel.
> Everything looks correct now (0000:40 is the second PCIe bus and it is
> attached to socket #1).
>
> Thanks a lot, Yinghai! Are you planning to merge these patches in the
> near future? 2.6.26?
>
ingo put them in x86.git#testing

please check
http://people.redhat.com/mingo/x86.git/README
to get that.

YH