2006-10-30 14:15:28

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 2/3] add dev_to_node()

Davem suggested to get the node-affinity information directly from
struct device instead of having the caller extreact it from the
pci_dev. This patch adds dev_to_node() to the topology API for that.
The implementation is rather ugly as we need to compare the bus
operations which we can't do inline in a header without pulling all
kinds of mess in.

Thus provide an out of line dev_to_node for ppc and let everyone else
use the dummy variant in asm-generic.h for now.


Signed-off-by: Christoph Hellwig <[email protected]>

Index: linux-2.6/include/asm-generic/topology.h
===================================================================
--- linux-2.6.orig/include/asm-generic/topology.h 2006-10-10 14:53:52.000000000 +0200
+++ linux-2.6/include/asm-generic/topology.h 2006-10-30 13:42:22.000000000 +0100
@@ -45,11 +45,14 @@
#define pcibus_to_node(node) (-1)
#endif

+#ifndef dev_to_node
+#define dev_to_node(dev) (-1)
+#endif
+
#ifndef pcibus_to_cpumask
#define pcibus_to_cpumask(bus) (pcibus_to_node(bus) == -1 ? \
CPU_MASK_ALL : \
node_to_cpumask(pcibus_to_node(bus)) \
)
#endif
-
#endif /* _ASM_GENERIC_TOPOLOGY_H */
Index: linux-2.6/include/asm-powerpc/topology.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/topology.h 2006-10-10 14:53:52.000000000 +0200
+++ linux-2.6/include/asm-powerpc/topology.h 2006-10-30 14:03:44.000000000 +0100
@@ -5,6 +5,7 @@

struct sys_device;
struct device_node;
+struct device;

#ifdef CONFIG_NUMA

@@ -33,6 +34,7 @@

struct pci_bus;
extern int pcibus_to_node(struct pci_bus *bus);
+int dev_to_node(struct device *dev);

#define pcibus_to_cpumask(bus) (pcibus_to_node(bus) == -1 ? \
CPU_MASK_ALL : \
Index: linux-2.6/arch/powerpc/kernel/pci_64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/pci_64.c 2006-10-23 17:21:43.000000000 +0200
+++ linux-2.6/arch/powerpc/kernel/pci_64.c 2006-10-30 14:02:40.000000000 +0100
@@ -1424,4 +1424,12 @@
return phb->node;
}
EXPORT_SYMBOL(pcibus_to_node);
+
+int dev_to_node(struct device *dev)
+{
+ if (dev->bus == &pci_bus_type)
+ return pcibus_to_node(to_pci_dev(dev)->bus);
+ return -1;
+}
+EXPORT_SYMBOL(dev_to_node);
#endif


2006-10-30 22:34:06

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 2/3] add dev_to_node()

From: Christoph Hellwig <[email protected]>
Date: Mon, 30 Oct 2006 15:15:01 +0100

> Davem suggested to get the node-affinity information directly from
> struct device instead of having the caller extreact it from the
> pci_dev. This patch adds dev_to_node() to the topology API for that.
> The implementation is rather ugly as we need to compare the bus
> operations which we can't do inline in a header without pulling all
> kinds of mess in.
>
> Thus provide an out of line dev_to_node for ppc and let everyone else
> use the dummy variant in asm-generic.h for now.
>
> Signed-off-by: Christoph Hellwig <[email protected]>

It may be a bit much to be calling all the way through up to the PCI
layer just to pluck out a simple integer, don't you think? The PCI
bus pointer comparison is just a symptom of how silly this is.

Especially since this will be used for every packet allocation a
device makes.

So, please add some sanity to this situation and just put the node
into the generic struct device. :-)

2006-11-01 00:11:01

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2/3] add dev_to_node()

On Mon, 30 Oct 2006, David Miller wrote:

> So, please add some sanity to this situation and just put the node
> into the generic struct device. :-)

Good. Then we can remove the node from the pci structure and get rid of
pcibus_to_node?

2006-11-01 00:53:18

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 2/3] add dev_to_node()

From: Christoph Lameter <[email protected]>
Date: Tue, 31 Oct 2006 16:10:48 -0800 (PST)

> On Mon, 30 Oct 2006, David Miller wrote:
>
> > So, please add some sanity to this situation and just put the node
> > into the generic struct device. :-)
>
> Good. Then we can remove the node from the pci structure and get rid of
> pcibus_to_node?

Yes, that's possible, because the idea is that the arch specific
bus layer code would initialize the node value. Therefore, there
would be no need for things like pcibus_to_node() any longer.

2006-11-01 01:58:38

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2/3] add dev_to_node()

On Tue, 31 Oct 2006, David Miller wrote:

> Yes, that's possible, because the idea is that the arch specific
> bus layer code would initialize the node value. Therefore, there
> would be no need for things like pcibus_to_node() any longer.

Then lets rename pcibus_to_node to dev_to_node() throughout the kernel.
Provide a -1 default. Then other device layers that are not based on pci
will also be able to exploit NUMA locality.

2006-11-04 22:56:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/3] add dev_to_node()

On Mon, Oct 30, 2006 at 02:33:57PM -0800, David Miller wrote:
> It may be a bit much to be calling all the way through up to the PCI
> layer just to pluck out a simple integer, don't you think? The PCI
> bus pointer comparison is just a symptom of how silly this is.
>
> Especially since this will be used for every packet allocation a
> device makes.
>
> So, please add some sanity to this situation and just put the node
> into the generic struct device. :-)

I was concerned about growing struct device, on smaller system it
already eats up a lot of memory. But we can make the node member
conditional on CONFIG_NUMA, as I did in the patch below.

This directly replaces PATCH 2/2 (the one we're replying to), all
others remain unmodified.


Index: linux-2.6/include/linux/device.h
===================================================================
--- linux-2.6.orig/include/linux/device.h 2006-10-29 16:02:38.000000000 +0100
+++ linux-2.6/include/linux/device.h 2006-11-02 12:47:17.000000000 +0100
@@ -347,6 +347,9 @@
BIOS data),reserved for device core*/
struct dev_pm_info power;

+#ifdef CONFIG_NUMA
+ int numa_node; /* NUMA node this device is close to */
+#endif
u64 *dma_mask; /* dma mask (if dma'able device) */
u64 coherent_dma_mask;/* Like dma_mask, but for
alloc_coherent mappings as
@@ -368,6 +371,12 @@
void (*release)(struct device * dev);
};

+#ifdef CONFIG_NUMA
+#define dev_to_node(dev) ((dev)->numa_node)
+#else
+#define dev_to_node(dev) (-1)
+#endif
+
static inline void *
dev_get_drvdata (struct device *dev)
{
Index: linux-2.6/drivers/base/core.c
===================================================================
--- linux-2.6.orig/drivers/base/core.c 2006-10-23 17:21:44.000000000 +0200
+++ linux-2.6/drivers/base/core.c 2006-11-02 12:48:12.000000000 +0100
@@ -381,6 +381,7 @@
INIT_LIST_HEAD(&dev->node);
init_MUTEX(&dev->sem);
device_init_wakeup(dev, 0);
+ dev->numa_node = -1;
}

/**
Index: linux-2.6/drivers/pci/probe.c
===================================================================
--- linux-2.6.orig/drivers/pci/probe.c 2006-10-23 17:21:46.000000000 +0200
+++ linux-2.6/drivers/pci/probe.c 2006-11-02 12:47:35.000000000 +0100
@@ -846,6 +846,7 @@
dev->dev.release = pci_release_dev;
pci_dev_get(dev);

+ dev->dev.numa_node = pcibus_to_node(bus);
dev->dev.dma_mask = &dev->dma_mask;
dev->dev.coherent_dma_mask = 0xffffffffull;

2006-11-04 23:07:07

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH 2/3] add dev_to_node()

On Sat, Nov 04, 2006 at 11:56:29PM +0100, Christoph Hellwig wrote:

This will break the compile for !NUMA if someone ends up doing a bisect
and lands here as a bisect point.

You introduce this nice wrapper..

> +#ifdef CONFIG_NUMA
> +#define dev_to_node(dev) ((dev)->numa_node)
> +#else
> +#define dev_to_node(dev) (-1)
> +#endif
> +
> static inline void *
> dev_get_drvdata (struct device *dev)
> {


And then don't use it here..

> Index: linux-2.6/drivers/base/core.c
> ===================================================================
> --- linux-2.6.orig/drivers/base/core.c 2006-10-23 17:21:44.000000000 +0200
> +++ linux-2.6/drivers/base/core.c 2006-11-02 12:48:12.000000000 +0100
> @@ -381,6 +381,7 @@
> INIT_LIST_HEAD(&dev->node);
> init_MUTEX(&dev->sem);
> device_init_wakeup(dev, 0);
> + dev->numa_node = -1;
> }
>
> /**

and here.

> Index: linux-2.6/drivers/pci/probe.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/probe.c 2006-10-23 17:21:46.000000000 +0200
> +++ linux-2.6/drivers/pci/probe.c 2006-11-02 12:47:35.000000000 +0100
> @@ -846,6 +846,7 @@
> dev->dev.release = pci_release_dev;
> pci_dev_get(dev);
>
> + dev->dev.numa_node = pcibus_to_node(bus);
> dev->dev.dma_mask = &dev->dma_mask;
> dev->dev.coherent_dma_mask = 0xffffffffull;


Dave


--
http://www.codemonkey.org.uk

2006-11-04 23:09:29

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/3] add dev_to_node()

On Sat, Nov 04, 2006 at 06:06:48PM -0500, Dave Jones wrote:
> On Sat, Nov 04, 2006 at 11:56:29PM +0100, Christoph Hellwig wrote:
>
> This will break the compile for !NUMA if someone ends up doing a bisect
> and lands here as a bisect point.
>
> You introduce this nice wrapper..

Yes, I'm stupid :) Updated version will follow ASAP.

2006-11-04 23:53:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/3] add dev_to_node()

On Sat, Nov 04, 2006 at 06:06:48PM -0500, Dave Jones wrote:
> On Sat, Nov 04, 2006 at 11:56:29PM +0100, Christoph Hellwig wrote:
>
> This will break the compile for !NUMA if someone ends up doing a bisect
> and lands here as a bisect point.
>
> You introduce this nice wrapper..

The dev_to_node wrapper is not enough as we can't assign to (-1) for
the non-NUMA case. So I added a second macro, set_dev_node for that.

The patch below compiles and works on numa and non-NUMA platforms.


Signed-off-by: Christoph Hellwig <[email protected]>

Index: linux-2.6/include/linux/device.h
===================================================================
--- linux-2.6.orig/include/linux/device.h 2006-11-05 00:16:09.000000000 +0100
+++ linux-2.6/include/linux/device.h 2006-11-05 00:39:22.000000000 +0100
@@ -347,6 +347,9 @@
BIOS data),reserved for device core*/
struct dev_pm_info power;

+#ifdef CONFIG_NUMA
+ int numa_node; /* NUMA node this device is close to */
+#endif
u64 *dma_mask; /* dma mask (if dma'able device) */
u64 coherent_dma_mask;/* Like dma_mask, but for
alloc_coherent mappings as
@@ -368,6 +371,14 @@
void (*release)(struct device * dev);
};

+#ifdef CONFIG_NUMA
+#define dev_to_node(dev) ((dev)->numa_node)
+#define set_dev_node(dev, node) ((dev)->numa_node = node)
+#else
+#define dev_to_node(dev) (-1)
+#define set_dev_node(dev, node) do { } while (0)
+#endif
+
static inline void *
dev_get_drvdata (struct device *dev)
{
Index: linux-2.6/drivers/base/core.c
===================================================================
--- linux-2.6.orig/drivers/base/core.c 2006-11-05 00:16:09.000000000 +0100
+++ linux-2.6/drivers/base/core.c 2006-11-05 00:40:01.000000000 +0100
@@ -381,6 +381,7 @@
INIT_LIST_HEAD(&dev->node);
init_MUTEX(&dev->sem);
device_init_wakeup(dev, 0);
+ set_dev_node(dev, -1);
}

/**
Index: linux-2.6/drivers/pci/probe.c
===================================================================
--- linux-2.6.orig/drivers/pci/probe.c 2006-11-05 00:16:09.000000000 +0100
+++ linux-2.6/drivers/pci/probe.c 2006-11-05 00:39:55.000000000 +0100
@@ -846,6 +846,7 @@
dev->dev.release = pci_release_dev;
pci_dev_get(dev);

+ set_dev_node(&dev->dev, pcibus_to_node(bus));
dev->dev.dma_mask = &dev->dma_mask;
dev->dev.coherent_dma_mask = 0xffffffffull;

2006-11-05 08:22:32

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 2/3] add dev_to_node()

From: Christoph Hellwig <[email protected]>
Date: Sun, 5 Nov 2006 00:53:23 +0100

> On Sat, Nov 04, 2006 at 06:06:48PM -0500, Dave Jones wrote:
> > On Sat, Nov 04, 2006 at 11:56:29PM +0100, Christoph Hellwig wrote:
> >
> > This will break the compile for !NUMA if someone ends up doing a bisect
> > and lands here as a bisect point.
> >
> > You introduce this nice wrapper..
>
> The dev_to_node wrapper is not enough as we can't assign to (-1) for
> the non-NUMA case. So I added a second macro, set_dev_node for that.
>
> The patch below compiles and works on numa and non-NUMA platforms.
>
> Signed-off-by: Christoph Hellwig <[email protected]>

Looks good to me.

2006-11-06 23:39:10

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/3] add dev_to_node()

On Sun, Nov 05, 2006 at 12:22:37AM -0800, David Miller wrote:
> Looks good to me.

So what's the right path to get this in? There's one patch touching
MM code, one adding something to the driver core and then finally a
networking patch depending on the previous two. Do you want to take
them all and send them in through the networking tree? Or should
we put the burden on Andrew?

2006-11-07 06:26:05

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: [PATCH 2/3] add dev_to_node()

On Sun, Nov 05, 2006 at 12:53:23AM +0100, Christoph Hellwig wrote:
> On Sat, Nov 04, 2006 at 06:06:48PM -0500, Dave Jones wrote:
> > On Sat, Nov 04, 2006 at 11:56:29PM +0100, Christoph Hellwig wrote:
> >
> > This will break the compile for !NUMA if someone ends up doing a bisect
> > and lands here as a bisect point.
> >
> > You introduce this nice wrapper..
>
> The dev_to_node wrapper is not enough as we can't assign to (-1) for
> the non-NUMA case. So I added a second macro, set_dev_node for that.
>
> The patch below compiles and works on numa and non-NUMA platforms.
>
>

Hi Christoph,
dev_to_node does not work as expected on x86_64 (and i386). This is because
node value returned by pcibus_to_node is initialized after a struct device
is created with current x86_64 code.

We need the node value initialized before the call to pci_scan_bus_parented,
as the generic devices are allocated and initialized
off pci_scan_child_bus, which gets called from pci_scan_bus_parented
The following patch does that using "pci_sysdata" introduced by the PCI
domain patches in -mm.

Signed-off-by: Alok N Kataria <[email protected]>
Signed-off-by: Ravikiran Thirumalai <[email protected]>
Signed-off-by: Shai Fultheim <[email protected]>

Index: linux-2.6.19-rc4mm2/arch/i386/pci/acpi.c
===================================================================
--- linux-2.6.19-rc4mm2.orig/arch/i386/pci/acpi.c 2006-11-06 11:03:50.000000000 -0800
+++ linux-2.6.19-rc4mm2/arch/i386/pci/acpi.c 2006-11-06 22:04:14.000000000 -0800
@@ -9,6 +9,7 @@ struct pci_bus * __devinit pci_acpi_scan
{
struct pci_bus *bus;
struct pci_sysdata *sd;
+ int pxm;

/* Allocate per-root-bus (not per bus) arch-specific data.
* TODO: leak; this memory is never freed.
@@ -30,15 +31,21 @@ struct pci_bus * __devinit pci_acpi_scan
}
#endif /* CONFIG_PCI_DOMAINS */

+ sd->node = -1;
+
+ pxm = acpi_get_pxm(device->handle);
+#ifdef CONFIG_ACPI_NUMA
+ if (pxm >= 0)
+ sd->node = pxm_to_node(pxm);
+#endif
+
bus = pci_scan_bus_parented(NULL, busnum, &pci_root_ops, sd);
if (!bus)
kfree(sd);

#ifdef CONFIG_ACPI_NUMA
if (bus != NULL) {
- int pxm = acpi_get_pxm(device->handle);
if (pxm >= 0) {
- sd->node = pxm_to_node(pxm);
printk("bus %d -> pxm %d -> node %d\n",
busnum, pxm, sd->node);
}

2006-11-07 10:16:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/3] add dev_to_node()

On Mon, Nov 06, 2006 at 10:25:36PM -0800, Ravikiran G Thirumalai wrote:
> On Sun, Nov 05, 2006 at 12:53:23AM +0100, Christoph Hellwig wrote:
> > On Sat, Nov 04, 2006 at 06:06:48PM -0500, Dave Jones wrote:
> > > On Sat, Nov 04, 2006 at 11:56:29PM +0100, Christoph Hellwig wrote:
> > >
> > > This will break the compile for !NUMA if someone ends up doing a bisect
> > > and lands here as a bisect point.
> > >
> > > You introduce this nice wrapper..
> >
> > The dev_to_node wrapper is not enough as we can't assign to (-1) for
> > the non-NUMA case. So I added a second macro, set_dev_node for that.
> >
> > The patch below compiles and works on numa and non-NUMA platforms.
> >
> >
>
> Hi Christoph,
> dev_to_node does not work as expected on x86_64 (and i386). This is because
> node value returned by pcibus_to_node is initialized after a struct device
> is created with current x86_64 code.
>
> We need the node value initialized before the call to pci_scan_bus_parented,
> as the generic devices are allocated and initialized
> off pci_scan_child_bus, which gets called from pci_scan_bus_parented
> The following patch does that using "pci_sysdata" introduced by the PCI
> domain patches in -mm.

A nice, that some non-cell folks actually care for this patch. As far
as my x86_64 pci code knowledge is concerned that patch look fine to me.

2006-11-08 02:37:54

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 2/3] add dev_to_node()

Hi, I have a question.

On Sat, 4 Nov 2006 23:56:29 +0100
Christoph Hellwig <[email protected]> wrote:
> Index: linux-2.6/include/linux/device.h
> ===================================================================
> --- linux-2.6.orig/include/linux/device.h 2006-10-29 16:02:38.000000000 +0100
> +++ linux-2.6/include/linux/device.h 2006-11-02 12:47:17.000000000 +0100
> @@ -347,6 +347,9 @@
> BIOS data),reserved for device core*/
> struct dev_pm_info power;
>
> +#ifdef CONFIG_NUMA
> + int numa_node; /* NUMA node this device is close to */
> +#endif

> + dev->dev.numa_node = pcibus_to_node(bus);

Does this "node" is guaranteed to be online ?

if node is not online, NODE_DATA(node) is NULL or not initialized.
Then, alloc_pages_node() at el. will panic.

I wonder there are no code for creating NODE_DATA() for device-only-node.

-Kame

2006-11-10 18:16:53

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2/3] add dev_to_node()

On Wed, 8 Nov 2006, KAMEZAWA Hiroyuki wrote:

> I wonder there are no code for creating NODE_DATA() for device-only-node.

On IA64 we remap nodes with no memory / cpus to the nearest node with
memory. I think that is sufficient.


2006-11-10 18:28:17

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [PATCH 2/3] add dev_to_node()

On Fri, 2006-11-10 at 10:16 -0800, Christoph Lameter wrote:
> On Wed, 8 Nov 2006, KAMEZAWA Hiroyuki wrote:
>
> > I wonder there are no code for creating NODE_DATA() for device-only-node.
>
> On IA64 we remap nodes with no memory / cpus to the nearest node with
> memory. I think that is sufficient.

I don't think this happens anymore. Back in the ~2.6.5 days, when we
would configure our numa platforms with 100% of memory interleaved [in
hardware at cache line granularity], the cpus would move to the
interleaved "pseudo-node" and the memoryless nodes would be removed.
numactl --hardware would show something like this:

# uname -r
2.6.5-7.244-default
# numactl --hardware
available: 1 nodes (0-0)
node 0 size: 65443 MB
node 0 free: 64506 MB

I started seeing different behavior about the time SPARSEMEM went in.
Now, with a 2.6.16 base kernel [same platform, hardware interleaved
memory], I see:

# uname -r# numactl --hardware
available: 5 nodes (0-4)
node 0 size: 0 MB
node 0 free: 0 MB
node 1 size: 0 MB
node 1 free: 0 MB
node 2 size: 0 MB
node 2 free: 0 MB
node 3 size: 0 MB
node 3 free: 0 MB
node 4 size: 65439 MB
node 4 free: 64492 MB
node distances:
node 0 1 2 3 4
0: 10 17 17 17 14
1: 17 10 17 17 14
2: 17 17 10 17 14
3: 17 17 17 10 14
4: 14 14 14 14 10
2.6.16.21-0.8-default

[Aside: The firmware/SLIT says that the interleaved memory is closer to
all nodes that other nodes' memory. This has interesting implications
for the "overflow" zone lists...]

Lee

2006-11-11 00:09:25

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 2/3] add dev_to_node()

On Fri, 10 Nov 2006 13:28:25 -0500
Lee Schermerhorn <[email protected]> wrote:

> On Fri, 2006-11-10 at 10:16 -0800, Christoph Lameter wrote:
> > On Wed, 8 Nov 2006, KAMEZAWA Hiroyuki wrote:
> >
> > > I wonder there are no code for creating NODE_DATA() for device-only-node.
> >
> > On IA64 we remap nodes with no memory / cpus to the nearest node with
> > memory. I think that is sufficient.
>
> I don't think this happens anymore.

In my understanding , from drivers/acpi/numa.c,
a node is created by a pxm found in SRAT table at boot time.

the node-number for the pxm which was not found in SRAT at boot time is "-1".
please check how acpi_map_pxm_to_node() is used.

If pci's node-id is based on pxm, checking return vaule of pxm_to_node()
will be good.

-Kame