Subject: [PATCH] pci: derive nearby CPUs from device's instead of bus' NUMA information

In case of AMD CPU northbridge functions this NUMA information might
differ.

Here is an example from a 4-socket system.

Currently Linux shows

root@hagen:/sys/devices/pci0000:00/0000:00:1a.4# cat numa_node
0
root@hagen:/sys/devices/pci0000:00/0000:00:1a.4# cat local_cpu*
0-3
00000000,0000000f

which is not correct for northbridge functions as the local CPUs
are those of the same socket.

With this patch and a quirk for AMD CPU NB functions Linux can
do better and correctly show

root@hagen:/sys/devices/pci0000:00/0000:00:1a.4# cat numa_node
2
root@hagen:/sys/devices/pci0000:00/0000:00:1a.4# cat local_cpu*
8-11
00000000,00000f00

Signed-off-by: Andreas Herrmann <[email protected]>
---
drivers/pci/pci-sysfs.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)

The quirk for AMD CPU NB functions is contained in another patch
that I'll send to x86-maintainers for inclusion into tip tree.

Please apply.

Thanks,
Andreas

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index a7eb1b4..9360f3d 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -74,7 +74,11 @@ static ssize_t local_cpus_show(struct device *dev,
const struct cpumask *mask;
int len;

+#ifdef CONFIG_NUMA
+ mask = cpumask_of_node(dev_to_node(dev));
+#else
mask = cpumask_of_pcibus(to_pci_dev(dev)->bus);
+#endif
len = cpumask_scnprintf(buf, PAGE_SIZE-2, mask);
buf[len++] = '\n';
buf[len] = '\0';
@@ -88,7 +92,11 @@ static ssize_t local_cpulist_show(struct device *dev,
const struct cpumask *mask;
int len;

+#ifdef CONFIG_NUMA
+ mask = cpumask_of_node(dev_to_node(dev));
+#else
mask = cpumask_of_pcibus(to_pci_dev(dev)->bus);
+#endif
len = cpulist_scnprintf(buf, PAGE_SIZE-2, mask);
buf[len++] = '\n';
buf[len] = '\0';
--
1.6.2



2009-04-17 16:21:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] pci: derive nearby CPUs from device's instead of bus' NUMA information


* Andreas Herrmann <[email protected]> wrote:

> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index a7eb1b4..9360f3d 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -74,7 +74,11 @@ static ssize_t local_cpus_show(struct device *dev,
> const struct cpumask *mask;
> int len;
>
> +#ifdef CONFIG_NUMA
> + mask = cpumask_of_node(dev_to_node(dev));
> +#else
> mask = cpumask_of_pcibus(to_pci_dev(dev)->bus);
> +#endif
> len = cpumask_scnprintf(buf, PAGE_SIZE-2, mask);
> buf[len++] = '\n';
> buf[len] = '\0';
> @@ -88,7 +92,11 @@ static ssize_t local_cpulist_show(struct device *dev,
> const struct cpumask *mask;
> int len;
>
> +#ifdef CONFIG_NUMA
> + mask = cpumask_of_node(dev_to_node(dev));
> +#else
> mask = cpumask_of_pcibus(to_pci_dev(dev)->bus);
> +#endif

No objections against the change (at all), but this pattern cries
out for a different, cleaner solution.

Shouldnt there be a cpumask_of_pcidev(dev) helper instead, which
[recognizing that most PCI devices dont get their node info
initialized in practice] would do something like:

const struct cpumask * cpumask_of_pcidev(struct pci_dev *dev)
{
if (dev->numa_node == -1)
return cpumask_of_pcibus(to_pci_dev(dev)->bus);

return cpumask_of_node(dev_to_node(dev));
}

? This would work fine in all cases.

Which you could thus use in both cases above, cleanly.

Ingo

2009-04-17 19:27:11

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] pci: derive nearby CPUs from device's instead of bus' NUMA information

On Fri, Apr 17, 2009 at 9:21 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andreas Herrmann <[email protected]> wrote:
>
>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>> index a7eb1b4..9360f3d 100644
>> --- a/drivers/pci/pci-sysfs.c
>> +++ b/drivers/pci/pci-sysfs.c
>> @@ -74,7 +74,11 @@ static ssize_t local_cpus_show(struct device *dev,
>> ? ? ? const struct cpumask *mask;
>> ? ? ? int len;
>>
>> +#ifdef CONFIG_NUMA
>> + ? ? mask = cpumask_of_node(dev_to_node(dev));
>> +#else
>> ? ? ? mask = cpumask_of_pcibus(to_pci_dev(dev)->bus);
>> +#endif
>> ? ? ? len = cpumask_scnprintf(buf, PAGE_SIZE-2, mask);
>> ? ? ? buf[len++] = '\n';
>> ? ? ? buf[len] = '\0';
>> @@ -88,7 +92,11 @@ static ssize_t local_cpulist_show(struct device *dev,
>> ? ? ? const struct cpumask *mask;
>> ? ? ? int len;
>>
>> +#ifdef CONFIG_NUMA
>> + ? ? mask = cpumask_of_node(dev_to_node(dev));
>> +#else
>> ? ? ? mask = cpumask_of_pcibus(to_pci_dev(dev)->bus);
>> +#endif
>
> No objections against the change (at all), but this pattern cries
> out for a different, cleaner solution.
>
> Shouldnt there be a cpumask_of_pcidev(dev) helper instead, which
> [recognizing that most PCI devices dont get their node info
> initialized in practice] would do something like:
>
> const struct cpumask * cpumask_of_pcidev(struct pci_dev *dev)
> {
> ? ? ? ?if (dev->numa_node == -1)
> ? ? ? ? ? ? ? ?return cpumask_of_pcibus(to_pci_dev(dev)->bus);
>
> ? ? ? ?return cpumask_of_node(dev_to_node(dev));
> }
>
> ? This would work fine in all cases.

you are right, dev_to_node(dev) could return -1 on 64bit, if there is
no memory on that node.

YH

Subject: Re: [PATCH] pci: derive nearby CPUs from device's instead of bus' NUMA information

On Fri, Apr 17, 2009 at 12:26:54PM -0700, Yinghai Lu wrote:
> On Fri, Apr 17, 2009 at 9:21 AM, Ingo Molnar <[email protected]> wrote:
> > const struct cpumask * cpumask_of_pcidev(struct pci_dev *dev)
> > {
> > ? ? ? ?if (dev->numa_node == -1)
> > ? ? ? ? ? ? ? ?return cpumask_of_pcibus(to_pci_dev(dev)->bus);
> >
> > ? ? ? ?return cpumask_of_node(dev_to_node(dev));
> > }
> >
> > ? This would work fine in all cases.

Yes, I think so. That's the general solution w/o additional
"ifdefing".

> you are right, dev_to_node(dev) could return -1 on 64bit, if there is
> no memory on that node.

Hmm, I thought just in the CONFIG_NUMA=n case -1 is returned.

During initialization the struct device's numa_node is set to -1 and
later on the information is inherited from the parent numa_node.

So what do I miss?


Thanks,
Andreas

--
Operating | Advanced Micro Devices GmbH
System | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. M?nchen, Germany
Research | Gesch?ftsf?hrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
Center | Sitz: Dornach, Gemeinde Aschheim, Landkreis M?nchen
(OSRC) | Registergericht M?nchen, HRB Nr. 43632

2009-04-20 20:03:54

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH] pci: derive nearby CPUs from device's instead of bus' NUMA information

On Mon, 20 Apr 2009 10:47:47 +0200
Andreas Herrmann <[email protected]> wrote:

> On Fri, Apr 17, 2009 at 12:26:54PM -0700, Yinghai Lu wrote:
> > On Fri, Apr 17, 2009 at 9:21 AM, Ingo Molnar <[email protected]> wrote:
> > > const struct cpumask * cpumask_of_pcidev(struct pci_dev *dev)
> > > {
> > >        if (dev->numa_node == -1)
> > >                return cpumask_of_pcibus(to_pci_dev(dev)->bus);
> > >
> > >        return cpumask_of_node(dev_to_node(dev));
> > > }
> > >
> > > ? This would work fine in all cases.
>
> Yes, I think so. That's the general solution w/o additional
> "ifdefing".
>
> > you are right, dev_to_node(dev) could return -1 on 64bit, if there
> > is no memory on that node.
>
> Hmm, I thought just in the CONFIG_NUMA=n case -1 is returned.
>
> During initialization the struct device's numa_node is set to -1 and
> later on the information is inherited from the parent numa_node.
>
> So what do I miss?

I like the idea of cpumask_of_pcidev(), but it seems like
cpumask_of_pcibus should return the same value. So if the node is
unassigned or "equadistant" (there's code that treats -1 as both I
think), cpumask_of_pcibus should figure out what the nearest CPUs are
and return that, right?

--
Jesse Barnes, Intel Open Source Technology Center

2009-04-20 21:24:41

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] pci: derive nearby CPUs from device's instead of bus' NUMA information

On Mon, Apr 20, 2009 at 1:47 AM, Andreas Herrmann
<[email protected]> wrote:
> On Fri, Apr 17, 2009 at 12:26:54PM -0700, Yinghai Lu wrote:
>> On Fri, Apr 17, 2009 at 9:21 AM, Ingo Molnar <[email protected]> wrote:
>> > const struct cpumask * cpumask_of_pcidev(struct pci_dev *dev)
>> > {
>> > ? ? ? ?if (dev->numa_node == -1)
>> > ? ? ? ? ? ? ? ?return cpumask_of_pcibus(to_pci_dev(dev)->bus);
>> >
>> > ? ? ? ?return cpumask_of_node(dev_to_node(dev));
>> > }
>> >
>> > ? This would work fine in all cases.
>
> Yes, I think so. That's the general solution w/o additional
> "ifdefing".
>
>> you are right, dev_to_node(dev) could return -1 on 64bit, if there is
>> no memory on that node.
>
> Hmm, I thought just in the CONFIG_NUMA=n case -1 is returned.
>
> During initialization the struct device's numa_node is set to -1 and
> later on the information is inherited from the parent numa_node.
>
parent numa_node could be -1 too.

in amd_bus.c
int get_mp_bus_to_node(int busnum)
{
int node = -1;

if (busnum < 0 || busnum > (BUS_NR - 1))
return node;

node = mp_bus_to_node[busnum];

/*
* let numa_node_id to decide it later in dma_alloc_pages
* if there is no ram on that node
*/
if (node != -1 && !node_online(node))
node = -1;

return node;
}


YH

Subject: Re: [PATCH] pci: derive nearby CPUs from device's instead of bus' NUMA information

On Mon, Apr 20, 2009 at 02:23:47PM -0700, Yinghai Lu wrote:
> On Mon, Apr 20, 2009 at 1:47 AM, Andreas Herrmann
> <[email protected]> wrote:
> > During initialization the struct device's numa_node is set to -1 and
> > later on the information is inherited from the parent numa_node.
> >
> parent numa_node could be -1 too.
>
> in amd_bus.c
> int get_mp_bus_to_node(int busnum)
> {
> int node = -1;
>
> if (busnum < 0 || busnum > (BUS_NR - 1))
> return node;
>
> node = mp_bus_to_node[busnum];
>
> /*
> * let numa_node_id to decide it later in dma_alloc_pages
> * if there is no ram on that node
> */
> if (node != -1 && !node_online(node))
> node = -1;
>
> return node;
> }

Ok, I see.
Thanks,

Andreas