2007-08-03 22:10:31

by Chuck Ebbert

[permalink] [raw]
Subject: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=250859

at line 74:

muli@62829:
muli@62829: sd = bus->sysdata;
muli@62829: sd->node = node; <=====

bus->sysdata is NULL.

Last changed by this hunk of
"x86-64: introduce struct pci_sysdata to facilitate sharing of ->sysdata":

@@ -67,7 +69,9 @@ fill_mp_bus_to_cpumask(void)
continue;
if (!node_online(node))
node = 0;
- bus->sysdata = (void *)node;
+
+ sd = bus->sysdata;
+ sd->node = node;
}
}
}


2007-08-03 22:51:43

by Andrew Morton

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On Fri, 03 Aug 2007 18:10:03 -0400
Chuck Ebbert <[email protected]> wrote:

> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=250859
>
> at line 74:
>
> muli@62829:
> muli@62829: sd = bus->sysdata;
> muli@62829: sd->node = node; <=====
>
> bus->sysdata is NULL.
>
> Last changed by this hunk of
> "x86-64: introduce struct pci_sysdata to facilitate sharing of ->sysdata":
>
> @@ -67,7 +69,9 @@ fill_mp_bus_to_cpumask(void)
> continue;
> if (!node_online(node))
> node = 0;
> - bus->sysdata = (void *)node;
> +
> + sd = bus->sysdata;
> + sd->node = node;
> }
> }
> }
>

Andy keeps trotting out a patch which will probably fix this, but for some
reason it doesn't seem to make progress.

2007-08-04 06:17:47

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On Fri, Aug 03, 2007 at 03:50:35PM -0700, Andrew Morton wrote:
> On Fri, 03 Aug 2007 18:10:03 -0400
> Chuck Ebbert <[email protected]> wrote:
>
> > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=250859
> >
> > at line 74:
> >
> > muli@62829:
> > muli@62829: sd = bus->sysdata;
> > muli@62829: sd->node = node; <=====
> >
> > bus->sysdata is NULL.
> >
> > Last changed by this hunk of
> > "x86-64: introduce struct pci_sysdata to facilitate sharing of ->sysdata":
> >
> > @@ -67,7 +69,9 @@ fill_mp_bus_to_cpumask(void)
> > continue;
> > if (!node_online(node))
> > node = 0;
> > - bus->sysdata = (void *)node;
> > +
> > + sd = bus->sysdata;
> > + sd->node = node;
> > }
> > }
> > }
> >
>
> Andy keeps trotting out a patch which will probably fix this, but
> for some reason it doesn't seem to make progress.

Hmm, looks like we're missing one of the paths where sd gets
allocated.

Andy can you please point me to the latest version of your patch? I
recall thinking that it papers over the bug rather than fixing it but
would like to take a second look.

Thanks,
Muli

2007-08-04 09:32:24

by Andi Kleen

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On Saturday 04 August 2007 00:50, Andrew Morton wrote:
> On Fri, 03 Aug 2007 18:10:03 -0400
>
> Chuck Ebbert <[email protected]> wrote:
> > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=250859
> >
> > at line 74:
> >
> > muli@62829:
> > muli@62829: sd = bus->sysdata;
> > muli@62829: sd->node = node; <=====
> >
> > bus->sysdata is NULL.
> >
> > Last changed by this hunk of
> > "x86-64: introduce struct pci_sysdata to facilitate sharing of
> > ->sysdata":

Hmm, will double check. Perhaps Muli's conversion was incomplete.

> > @@ -67,7 +69,9 @@ fill_mp_bus_to_cpumask(void)
> > continue;
> > if (!node_online(node))
> > node = 0;
> > - bus->sysdata = (void *)node;
> > +
> > + sd = bus->sysdata;
> > + sd->node = node;
> > }
> > }
> > }
>
> Andy keeps trotting out a patch which will probably fix this,

What patch do you mean? I don't have anything sysdata related
left over.

-Andi

2007-08-04 16:33:57

by Andrew Morton

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On Sat, 4 Aug 2007 11:30:41 +0200 Andi Kleen <[email protected]> wrote:

> On Saturday 04 August 2007 00:50, Andrew Morton wrote:
> > On Fri, 03 Aug 2007 18:10:03 -0400
> >
> > Chuck Ebbert <[email protected]> wrote:
> > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=250859
> > >
> > > at line 74:
> > >
> > > muli@62829:
> > > muli@62829: sd = bus->sysdata;
> > > muli@62829: sd->node = node; <=====
> > >
> > > bus->sysdata is NULL.
> > >
> > > Last changed by this hunk of
> > > "x86-64: introduce struct pci_sysdata to facilitate sharing of
> > > ->sysdata":
>
> Hmm, will double check. Perhaps Muli's conversion was incomplete.

hm.

> > > @@ -67,7 +69,9 @@ fill_mp_bus_to_cpumask(void)
> > > continue;
> > > if (!node_online(node))
> > > node = 0;
> > > - bus->sysdata = (void *)node;
> > > +
> > > + sd = bus->sysdata;
> > > + sd->node = node;
> > > }
> > > }
> > > }
> >
> > Andy keeps trotting out a patch which will probably fix this,
>
> What patch do you mean? I don't have anything sysdata related
> left over.
>

"pci device ensure sysdata initialised", now at version 4.



From: Andy Whitcroft <[email protected]>

We have been seeing panic's on NUMA systems in pci_call_probe() in
2.6.19-rc1-mm1 and later. This is related to the changes introduced in the
commit below:

[x86, PCI] Switch pci_bus::sysdata from NUMA node integer to a pointer
0a247a58fc3e2ecfc17654301033e8b8d08df2a2

In this change the sysdata has changed from directly representing a value
(the node number in NUMA) to a pointer to a structure. However, it seems
that we do not always initialise this sysdata before we probe the device.

Prior to the changes above the node was defaulted to 'NULL' allocating the
devices to node 0 unconditionally. This patch adds a default sysdata entry
(pci_default_sysdata), this is then used where 'NULL' was used previously.
pci_default_sysdata defaults the node to unknown (-1). This is a more
accurate assignment, mirroring the value returned where no topology support
is provided and no locality information is available.

There are only two uses of this value in the affected architectures
(x86, x86_64) and generic code:

1) in x86_64, dma_alloc_pages() looks up the node in order to
allocate node local memory. Here if the node is invalid we
will default to the first online node. Behaviour here should
be unchanged.
2) in generic, pci_call_probe() looks up the node in order to
restrict execution of the probe on the card local node, to
favor node local allocation. Where this is unknown previously
we would force execution (and thereby allocation) to node 0,
this is arguably wrong and using -1 releases this restriction.

In an ideal world we should be supplying a sysdata for the
appropriate node where it is known. Where it is not known defaulting
to -1 seems a better course, and would help us where node 0 is
short of memory.

Signed-off-by: Andy Whitcroft <[email protected]>
Acked-by: Yinghai Lu <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Jeff Garzik <[email protected]>
Cc: Greg KH <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

arch/i386/pci/common.c | 2 ++
arch/i386/pci/fixup.c | 8 +++++---
arch/i386/pci/numa.c | 8 +++++---
arch/i386/pci/visws.c | 4 ++--
include/asm-i386/pci.h | 1 +
include/asm-x86_64/pci.h | 1 +
6 files changed, 16 insertions(+), 8 deletions(-)

diff -puN arch/i386/pci/common.c~pci-device-ensure-sysdata-initialised-v4 arch/i386/pci/common.c
--- a/arch/i386/pci/common.c~pci-device-ensure-sysdata-initialised-v4
+++ a/arch/i386/pci/common.c
@@ -27,6 +27,8 @@ unsigned long pirq_table_addr;
struct pci_bus *pci_root_bus;
struct pci_raw_ops *raw_pci_ops;

+struct pci_sysdata pci_default_sysdata = { .node = -1 };
+
static int pci_read(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 *value)
{
return raw_pci_ops->read(0, bus->number, devfn, where, size, value);
diff -puN arch/i386/pci/fixup.c~pci-device-ensure-sysdata-initialised-v4 arch/i386/pci/fixup.c
--- a/arch/i386/pci/fixup.c~pci-device-ensure-sysdata-initialised-v4
+++ a/arch/i386/pci/fixup.c
@@ -25,9 +25,11 @@ static void __devinit pci_fixup_i450nx(s
pci_read_config_byte(d, reg++, &subb);
DBG("i450NX PXB %d: %02x/%02x/%02x\n", pxb, busno, suba, subb);
if (busno)
- pci_scan_bus(busno, &pci_root_ops, NULL); /* Bus A */
+ pci_scan_bus(busno, &pci_root_ops,
+ &pci_default_sysdata); /* Bus A */
if (suba < subb)
- pci_scan_bus(suba+1, &pci_root_ops, NULL); /* Bus B */
+ pci_scan_bus(suba+1, &pci_root_ops,
+ &pci_default_sysdata); /* Bus B */
}
pcibios_last_bus = -1;
}
@@ -42,7 +44,7 @@ static void __devinit pci_fixup_i450gx(s
u8 busno;
pci_read_config_byte(d, 0x4a, &busno);
printk(KERN_INFO "PCI: i440KX/GX host bridge %s: secondary bus %02x\n", pci_name(d), busno);
- pci_scan_bus(busno, &pci_root_ops, NULL);
+ pci_scan_bus(busno, &pci_root_ops, &pci_default_sysdata);
pcibios_last_bus = -1;
}
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82454GX, pci_fixup_i450gx);
diff -puN arch/i386/pci/numa.c~pci-device-ensure-sysdata-initialised-v4 arch/i386/pci/numa.c
--- a/arch/i386/pci/numa.c~pci-device-ensure-sysdata-initialised-v4
+++ a/arch/i386/pci/numa.c
@@ -97,9 +97,11 @@ static void __devinit pci_fixup_i450nx(s
pci_read_config_byte(d, reg++, &subb);
DBG("i450NX PXB %d: %02x/%02x/%02x\n", pxb, busno, suba, subb);
if (busno)
- pci_scan_bus(QUADLOCAL2BUS(quad,busno), &pci_root_ops, NULL); /* Bus A */
+ pci_scan_bus(QUADLOCAL2BUS(quad,busno), &pci_root_ops,
+ &pci_default_sysdata); /* Bus A */
if (suba < subb)
- pci_scan_bus(QUADLOCAL2BUS(quad,suba+1), &pci_root_ops, NULL); /* Bus B */
+ pci_scan_bus(QUADLOCAL2BUS(quad,suba+1), &pci_root_ops,
+ &pci_default_sysdata); /* Bus B */
}
pcibios_last_bus = -1;
}
@@ -124,7 +126,7 @@ static int __init pci_numa_init(void)
printk("Scanning PCI bus %d for quad %d\n",
QUADLOCAL2BUS(quad,0), quad);
pci_scan_bus(QUADLOCAL2BUS(quad,0),
- &pci_root_ops, NULL);
+ &pci_root_ops, &pci_default_sysdata);
}
return 0;
}
diff -puN arch/i386/pci/visws.c~pci-device-ensure-sysdata-initialised-v4 arch/i386/pci/visws.c
--- a/arch/i386/pci/visws.c~pci-device-ensure-sysdata-initialised-v4
+++ a/arch/i386/pci/visws.c
@@ -101,8 +101,8 @@ static int __init pcibios_init(void)
"bridge B (PIIX4) bus: %u\n", pci_bus1, pci_bus0);

raw_pci_ops = &pci_direct_conf1;
- pci_scan_bus(pci_bus0, &pci_root_ops, NULL);
- pci_scan_bus(pci_bus1, &pci_root_ops, NULL);
+ pci_scan_bus(pci_bus0, &pci_root_ops, &pci_default_sysdata);
+ pci_scan_bus(pci_bus1, &pci_root_ops, &pci_default_sysdata);
pci_fixup_irqs(visws_swizzle, visws_map_irq);
pcibios_resource_survey();
return 0;
diff -puN include/asm-i386/pci.h~pci-device-ensure-sysdata-initialised-v4 include/asm-i386/pci.h
--- a/include/asm-i386/pci.h~pci-device-ensure-sysdata-initialised-v4
+++ a/include/asm-i386/pci.h
@@ -7,6 +7,7 @@
struct pci_sysdata {
int node; /* NUMA node */
};
+extern struct pci_sysdata pci_default_sysdata;

#include <linux/mm.h> /* for struct page */

diff -puN include/asm-x86_64/pci.h~pci-device-ensure-sysdata-initialised-v4 include/asm-x86_64/pci.h
--- a/include/asm-x86_64/pci.h~pci-device-ensure-sysdata-initialised-v4
+++ a/include/asm-x86_64/pci.h
@@ -9,6 +9,7 @@ struct pci_sysdata {
int node; /* NUMA node */
void* iommu; /* IOMMU private data */
};
+extern struct pci_sysdata pci_default_sysdata;

#ifdef CONFIG_CALGARY_IOMMU
static inline void* pci_iommu(struct pci_bus *bus)
_

2007-08-04 17:45:42

by Yinghai Lu

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On 8/4/07, Andrew Morton <[email protected]> wrote:
> On Sat, 4 Aug 2007 11:30:41 +0200 Andi Kleen <[email protected]> wrote:
>
> > On Saturday 04 August 2007 00:50, Andrew Morton wrote:
> > > On Fri, 03 Aug 2007 18:10:03 -0400
> > >
> > > Chuck Ebbert <[email protected]> wrote:
> > > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=250859
> > > >
> > > > at line 74:
> > > >
> > > > muli@62829:
> > > > muli@62829: sd = bus->sysdata;
> > > > muli@62829: sd->node = node; <=====
> > > >
> > > > bus->sysdata is NULL.
> > > >
> > > > Last changed by this hunk of
> > > > "x86-64: introduce struct pci_sysdata to facilitate sharing of
> > > > ->sysdata":
> >
> > Hmm, will double check. Perhaps Muli's conversion was incomplete.
>
> hm.
>
> > > > @@ -67,7 +69,9 @@ fill_mp_bus_to_cpumask(void)
> > > > continue;
> > > > if (!node_online(node))
> > > > node = 0;
> > > > - bus->sysdata = (void *)node;
> > > > +
> > > > + sd = bus->sysdata;
> > > > + sd->node = node;
> > > > }
> > > > }
> > > > }
> > >
> > > Andy keeps trotting out a patch which will probably fix this,
> >
> > What patch do you mean? I don't have anything sysdata related
> > left over.
> >
>
> "pci device ensure sysdata initialised", now at version 4.
>
>
>
> From: Andy Whitcroft <[email protected]>
>
> We have been seeing panic's on NUMA systems in pci_call_probe() in
> 2.6.19-rc1-mm1 and later. This is related to the changes introduced in the
> commit below:
>
> [x86, PCI] Switch pci_bus::sysdata from NUMA node integer to a pointer
> 0a247a58fc3e2ecfc17654301033e8b8d08df2a2
>
> In this change the sysdata has changed from directly representing a value
> (the node number in NUMA) to a pointer to a structure. However, it seems
> that we do not always initialise this sysdata before we probe the device.
>
> Prior to the changes above the node was defaulted to 'NULL' allocating the
> devices to node 0 unconditionally. This patch adds a default sysdata entry
> (pci_default_sysdata), this is then used where 'NULL' was used previously.
> pci_default_sysdata defaults the node to unknown (-1). This is a more
> accurate assignment, mirroring the value returned where no topology support
> is provided and no locality information is available.
>
> There are only two uses of this value in the affected architectures
> (x86, x86_64) and generic code:
>
> 1) in x86_64, dma_alloc_pages() looks up the node in order to
> allocate node local memory. Here if the node is invalid we
> will default to the first online node. Behaviour here should
> be unchanged.
> 2) in generic, pci_call_probe() looks up the node in order to
> restrict execution of the probe on the card local node, to
> favor node local allocation. Where this is unknown previously
> we would force execution (and thereby allocation) to node 0,
> this is arguably wrong and using -1 releases this restriction.
>
> In an ideal world we should be supplying a sysdata for the
> appropriate node where it is known. Where it is not known defaulting
> to -1 seems a better course, and would help us where node 0 is
> short of memory.
>
> Signed-off-by: Andy Whitcroft <[email protected]>
> Acked-by: Yinghai Lu <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Jeff Garzik <[email protected]>
> Cc: Greg KH <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

Andrew,

still need
x86_64-get-mp_bus_to_node-as-early-v2.patch in the -mm
it fix
diff -puN arch/i386/pci/irq.c~x86_64-get-mp_bus_to_node-as-early-v2
arch/i386/pci/irq.c
--- a/arch/i386/pci/irq.c~x86_64-get-mp_bus_to_node-as-early-v2
+++ a/arch/i386/pci/irq.c
@@ -136,10 +136,26 @@ static void __init pirq_peer_trick(void)
busmap[e->bus] = 1;
}
for(i = 1; i < 256; i++) {
+ struct pci_bus *bus = NULL;
+ struct pci_sysdata *sd;
if (!busmap[i] || pci_find_bus(0, i))
continue;
- if (pci_scan_bus(i, &pci_root_ops, NULL))
+ /* Allocate per-root-bus (not per bus) arch-specific data.
+ * TODO: leak; this memory is never freed.
+ * It's arguable whether it's worth the trouble to care.
+ */
+ sd = kzalloc(sizeof(*sd), GFP_KERNEL);
+ if (!sd) {
+ printk(KERN_ERR "PCI: OOM, not probing PCI bus %02x\n",
+ i);
+ continue;
+ }
+ sd->node = get_mp_bus_to_node(i);
+ bus = pci_scan_bus(i, &pci_root_ops, sd);
+ if (bus)
printk(KERN_INFO "PCI: Discovered primary peer
bus %02x [IRQ]\n", i);
+ else
+ kfree(sd);
}
pcibios_last_bus = -1;
}
diff -puN arch/i386/pci/legacy.c~x86_64-get-mp_bus_to_node-as-early-v2
arch/i386/pci/legacy.c
--- a/arch/i386/pci/legacy.c~x86_64-get-mp_bus_to_node-as-early-v2
+++ a/arch/i386/pci/legacy.c
@@ -12,6 +12,9 @@
static void __devinit pcibios_fixup_peer_bridges(void)
{
int n, devfn;
+ struct pci_bus *bus = NULL;
+ struct pci_sysdata *sd;
+ long node;

if (pcibios_last_bus <= 0 || pcibios_last_bus >= 0xff)
return;
@@ -21,12 +24,29 @@ static void __devinit pcibios_fixup_peer
u32 l;
if (pci_find_bus(0, n))
continue;
+ node = get_mp_bus_to_node(n);
for (devfn = 0; devfn < 256; devfn += 8) {
if (!raw_pci_ops->read(0, n, devfn,
PCI_VENDOR_ID, 2, &l) &&
l != 0x0000 && l != 0xffff) {
DBG("Found device at %02x:%02x
[%04x]\n", n, devfn, l);
printk(KERN_INFO "PCI: Discovered peer
bus %02x\n", n);
- pci_scan_bus(n, &pci_root_ops, NULL);
+ /* Allocate per-root-bus (not per bus)
+ * arch-specific data.
+ * TODO: leak; this memory is never freed.
+ * It's arguable whether it's worth the trouble
+ * to care.
+ */
+ sd = kzalloc(sizeof(*sd), GFP_KERNEL);
+ if (!sd) {
+ printk(KERN_ERR
+ "PCI: OOM, not probing PCI bus %02x\n",
+ n);
+ break;
+ }
+ sd->node = node;
+ bus = pci_scan_bus(n, &pci_root_ops, sd);
+ if (!bus)
+ kfree(sd);
break;
}
}

YH

2007-08-04 18:16:38

by Andrew Morton

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On Sat, 4 Aug 2007 10:45:31 -0700 "Yinghai Lu" <[email protected]> wrote:

> Andrew,
>
> still need
> x86_64-get-mp_bus_to_node-as-early-v2.patch in the -mm
> it fix

What does it fix? Much more detail, please.

> diff -puN arch/i386/pci/irq.c~x86_64-get-mp_bus_to_node-as-early-v2
> arch/i386/pci/irq.c
> --- a/arch/i386/pci/irq.c~x86_64-get-mp_bus_to_node-as-early-v2
> +++ a/arch/i386/pci/irq.c
> @@ -136,10 +136,26 @@ static void __init pirq_peer_trick(void)
> busmap[e->bus] = 1;
> }
> for(i = 1; i < 256; i++) {
> + struct pci_bus *bus = NULL;
> + struct pci_sysdata *sd;
> if (!busmap[i] || pci_find_bus(0, i))
> continue;
> - if (pci_scan_bus(i, &pci_root_ops, NULL))
> + /* Allocate per-root-bus (not per bus) arch-specific data.
> + * TODO: leak; this memory is never freed.
> + * It's arguable whether it's worth the trouble to care.
> + */
> + sd = kzalloc(sizeof(*sd), GFP_KERNEL);
> + if (!sd) {
> + printk(KERN_ERR "PCI: OOM, not probing PCI bus %02x\n",
> + i);
> + continue;
> + }
> + sd->node = get_mp_bus_to_node(i);
> + bus = pci_scan_bus(i, &pci_root_ops, sd);
> + if (bus)
> printk(KERN_INFO "PCI: Discovered primary peer
> bus %02x [IRQ]\n", i);

Wordwrapped, tabs replaced with spaces.

Please, for once and for all, fix your email client?

Thanks.

2007-08-04 19:02:51

by Yinghai Lu

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On 8/4/07, Andrew Morton <[email protected]> wrote:
> On Sat, 4 Aug 2007 10:45:31 -0700 "Yinghai Lu" <[email protected]> wrote:
>
> > Andrew,
> >
> > still need
x86_64-get-mp_bus_to_node-as-early-v2.patch is already in the -mm

it fixs
pirq_peer_trick(void) in irq.c
pcibios_fixup_peer_bridges in legacy.c

by allocate sd.

YH

2007-08-04 23:41:07

by Andi Kleen

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On Saturday 04 August 2007 18:32:22 Andrew Morton wrote:
> On Sat, 4 Aug 2007 11:30:41 +0200 Andi Kleen <[email protected]> wrote:
>
> > On Saturday 04 August 2007 00:50, Andrew Morton wrote:
> > > On Fri, 03 Aug 2007 18:10:03 -0400
> > >
> > > Chuck Ebbert <[email protected]> wrote:
> > > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=250859
> > > >
> > > > at line 74:
> > > >
> > > > muli@62829:
> > > > muli@62829: sd = bus->sysdata;
> > > > muli@62829: sd->node = node; <=====
> > > >
> > > > bus->sysdata is NULL.
> > > >
> > > > Last changed by this hunk of
> > > > "x86-64: introduce struct pci_sysdata to facilitate sharing of
> > > > ->sysdata":
> >
> > Hmm, will double check. Perhaps Muli's conversion was incomplete.
>
> hm.
>
> > > > @@ -67,7 +69,9 @@ fill_mp_bus_to_cpumask(void)
> > > > continue;
> > > > if (!node_online(node))
> > > > node = 0;
> > > > - bus->sysdata = (void *)node;
> > > > +
> > > > + sd = bus->sysdata;
> > > > + sd->node = node;
> > > > }
> > > > }
> > > > }
> > >
> > > Andy keeps trotting out a patch which will probably fix this,
> >
> > What patch do you mean? I don't have anything sysdata related
> > left over.
> >
>
> "pci device ensure sysdata initialised", now at version 4.

Oh what a mess. I think I'll ask Linus to revert the sysdata patch
instead. Clearly the stuff is half-baked

-Andi

2007-08-05 04:15:44

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On Sun, Aug 05, 2007 at 01:40:49AM +0200, Andi Kleen wrote:

> > "pci device ensure sysdata initialised", now at version 4.
>
> Oh what a mess. I think I'll ask Linus to revert the sysdata patch
> instead. Clearly the stuff is half-baked

Personally I don't care that much, but I think you'll make jgarzik
very unhappy. The idea is also clearly the right way forward.

So far we know of a single regression and only with 'pci=noacpi'. How
about giving us a little bit of time to get to the bottom of it?
Unfortunately it's not reproducable on any of my machines, but I'm
looking into it.

Cheers,
Muli

2007-08-05 04:31:53

by Yinghai Lu

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On 8/4/07, Andi Kleen <[email protected]> wrote:
> On Saturday 04 August 2007 18:32:22 Andrew Morton wrote:
> > On Sat, 4 Aug 2007 11:30:41 +0200 Andi Kleen <[email protected]> wrote:
> >
> > > On Saturday 04 August 2007 00:50, Andrew Morton wrote:
> > > > On Fri, 03 Aug 2007 18:10:03 -0400
> > > >
> > > > Chuck Ebbert <[email protected]> wrote:
> > > > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=250859
> > > > >
> > > > > at line 74:
> > > > >
> > > > > muli@62829:
> > > > > muli@62829: sd = bus->sysdata;
> > > > > muli@62829: sd->node = node; <=====
> > > > >
> > > > > bus->sysdata is NULL.
> > > > >
> > > > > Last changed by this hunk of
> > > > > "x86-64: introduce struct pci_sysdata to facilitate sharing of
> > > > > ->sysdata":
> > >
> > > Hmm, will double check. Perhaps Muli's conversion was incomplete.
> >
> > hm.
> >
> > > > > @@ -67,7 +69,9 @@ fill_mp_bus_to_cpumask(void)
> > > > > continue;
> > > > > if (!node_online(node))
> > > > > node = 0;
> > > > > - bus->sysdata = (void *)node;
> > > > > +
> > > > > + sd = bus->sysdata;
> > > > > + sd->node = node;
> > > > > }
> > > > > }
> > > > > }
> > > >
> > > > Andy keeps trotting out a patch which will probably fix this,
> > >
> > > What patch do you mean? I don't have anything sysdata related
> > > left over.
> > >
> >
> > "pci device ensure sysdata initialised", now at version 4.
>
> Oh what a mess. I think I'll ask Linus to revert the sysdata patch
> instead. Clearly the stuff is half-baked
>
i noticed that you didn't sign off that patch ( from Muli). So Linus
picked it up directly?

YH

2007-08-05 04:34:13

by Yinghai Lu

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On 8/4/07, Muli Ben-Yehuda <[email protected]> wrote:
> On Sun, Aug 05, 2007 at 01:40:49AM +0200, Andi Kleen wrote:
>
> > > "pci device ensure sysdata initialised", now at version 4.
> >
> > Oh what a mess. I think I'll ask Linus to revert the sysdata patch
> > instead. Clearly the stuff is half-baked
>
> Personally I don't care that much, but I think you'll make jgarzik
> very unhappy. The idea is also clearly the right way forward.
>
> So far we know of a single regression and only with 'pci=noacpi'. How
> about giving us a little bit of time to get to the bottom of it?
> Unfortunately it's not reproducable on any of my machines, but I'm
> looking into it.
>

I hope we can use .node and .iommu in pci_bus...

YH

2007-08-05 05:00:49

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On Sat, Aug 04, 2007 at 09:33:58PM -0700, Yinghai Lu wrote:

> I hope we can use .node and .iommu in pci_bus...

pci_bus is shared between different architectures, and not all
architectures need .node and .iommu. Why is this better than using the
(arch specific) ->sysdata?

Cheers,
Muli

2007-08-05 05:04:29

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On Sun, Aug 05, 2007 at 01:40:49AM +0200, Andi Kleen wrote:

> Oh what a mess. I think I'll ask Linus to revert the sysdata patch
> instead. Clearly the stuff is half-baked

By the way, Andi, just making sure you're aware the issue Andy's patch
addresses is separate from the 2.6.23-rc1 ->sysdata breakage?
Reverting my and jgarzik's sysdata patch won't fix Andy's issue.

Cheers,
Muli

2007-08-05 05:39:00

by Yinghai Lu

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On 8/4/07, Muli Ben-Yehuda <[email protected]> wrote:
> On Sun, Aug 05, 2007 at 01:40:49AM +0200, Andi Kleen wrote:
>
> > Oh what a mess. I think I'll ask Linus to revert the sysdata patch
> > instead. Clearly the stuff is half-baked
>
> By the way, Andi, just making sure you're aware the issue Andy's patch
> addresses is separate from the 2.6.23-rc1 ->sysdata breakage?
> Reverting my and jgarzik's sysdata patch won't fix Andy's issue.
>


OK at this time, you can take my patch
x86_64-get-mp_bus_to_node-as-early-v2.patch
to fix the chuck the problem.

Chuck,
can you try my patch in -mm.
or you can get that from following link
http://lkml.org/lkml/2007/7/26/377

I like to see acpi=off work without problem on AMD64 platform.

YH

2007-08-05 05:53:29

by Andrew Morton

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On Sat, 4 Aug 2007 12:02:35 -0700 "Yinghai Lu" <[email protected]> wrote:

> On 8/4/07, Andrew Morton <[email protected]> wrote:
> > On Sat, 4 Aug 2007 10:45:31 -0700 "Yinghai Lu" <[email protected]> wrote:
> >
> > > Andrew,
> > >
> > > still need
> x86_64-get-mp_bus_to_node-as-early-v2.patch is already in the -mm
>
> it fixs
> pirq_peer_trick(void) in irq.c
> pcibios_fixup_peer_bridges in legacy.c
>
> by allocate sd.
>

I don't understand you much, sorry. Are you saying that
x86_64-get-mp_bus_to_node-as-early-v2.patch should be merged in 2.6.23?

2007-08-05 06:02:58

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On Sat, Aug 04, 2007 at 10:52:22PM -0700, Andrew Morton wrote:
> On Sat, 4 Aug 2007 12:02:35 -0700 "Yinghai Lu" <[email protected]> wrote:
>
> > On 8/4/07, Andrew Morton <[email protected]> wrote:
> > > On Sat, 4 Aug 2007 10:45:31 -0700 "Yinghai Lu" <[email protected]> wrote:
> > >
> > > > Andrew,
> > > >
> > > > still need
> > x86_64-get-mp_bus_to_node-as-early-v2.patch is already in the -mm
> >
> > it fixs
> > pirq_peer_trick(void) in irq.c
> > pcibios_fixup_peer_bridges in legacy.c
> >
> > by allocate sd.
> >
>
> I don't understand you much, sorry. Are you saying that
> x86_64-get-mp_bus_to_node-as-early-v2.patch should be merged in
> 2.6.23?

If I may try to elaborate, yhlu is pointing out that the ->sysdata
conversion missed a couple of spots in i386 where we need to allocate
a pci_sysdata object as well. Unfortunately I don't think
x86_64-get-mp_bus_to_node-as-early-v2 is a suitable fix as it does a
whole bunch of other stuff too. I am now preparing a minimal patch
(based on x86_64-get-mp_bus_to_node-as-early-v2) that only takes care
of the missing ->sysdata bits. If that patch cures the original bug
and doesn't cause any more regressions, it's definitely 2.6.23
material.

Cheers,
Muli

2007-08-05 06:04:44

by Yinghai Lu

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On 8/4/07, Andrew Morton <[email protected]> wrote:
> On Sat, 4 Aug 2007 12:02:35 -0700 "Yinghai Lu" <[email protected]> wrote:
>
> > On 8/4/07, Andrew Morton <[email protected]> wrote:
> > > On Sat, 4 Aug 2007 10:45:31 -0700 "Yinghai Lu" <[email protected]> wrote:
> > >
> > > > Andrew,
> > > >
> > > > still need
> > x86_64-get-mp_bus_to_node-as-early-v2.patch is already in the -mm
> >
> > it fixs
> > pirq_peer_trick(void) in irq.c
> > pcibios_fixup_peer_bridges in legacy.c
> >
> > by allocate sd.
> >
>
> I don't understand you much, sorry. Are you saying that
> x86_64-get-mp_bus_to_node-as-early-v2.patch should be merged in 2.6.23?

Yes.

YH

2007-08-05 06:07:21

by Yinghai Lu

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On 8/4/07, Muli Ben-Yehuda <[email protected]> wrote:
> On Sat, Aug 04, 2007 at 10:52:22PM -0700, Andrew Morton wrote:
> > On Sat, 4 Aug 2007 12:02:35 -0700 "Yinghai Lu" <[email protected]> wrote:
> >
> > > On 8/4/07, Andrew Morton <[email protected]> wrote:
> > > > On Sat, 4 Aug 2007 10:45:31 -0700 "Yinghai Lu" <[email protected]> wrote:
> > > >
> > > > > Andrew,
> > > > >
> > > > > still need
> > > x86_64-get-mp_bus_to_node-as-early-v2.patch is already in the -mm
> > >
> > > it fixs
> > > pirq_peer_trick(void) in irq.c
> > > pcibios_fixup_peer_bridges in legacy.c
> > >
> > > by allocate sd.
> > >
> >
> > I don't understand you much, sorry. Are you saying that
> > x86_64-get-mp_bus_to_node-as-early-v2.patch should be merged in
> > 2.6.23?
>
> If I may try to elaborate, yhlu is pointing out that the ->sysdata
> conversion missed a couple of spots in i386 where we need to allocate
> a pci_sysdata object as well. Unfortunately I don't think
> x86_64-get-mp_bus_to_node-as-early-v2 is a suitable fix as it does a
> whole bunch of other stuff too. I am now preparing a minimal patch
> (based on x86_64-get-mp_bus_to_node-as-early-v2) that only takes care
> of the missing ->sysdata bits. If that patch cures the original bug
> and doesn't cause any more regressions, it's definitely 2.6.23
> material.

You just use my patch plus Andy's patch together...

YH

2007-08-05 06:11:47

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On Sat, Aug 04, 2007 at 11:07:04PM -0700, Yinghai Lu wrote:

> You just use my patch plus Andy's patch together...

Your patch does other things that in my opinion are not appropriate at
this stage for 2.6.23. I am aiming for the minimal change that will
fix the regression, and your patch should be considered for 2.6.24.

Cheers,
Muli

2007-08-05 06:24:43

by Yinghai Lu

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On 8/4/07, Muli Ben-Yehuda <[email protected]> wrote:
> On Sat, Aug 04, 2007 at 11:07:04PM -0700, Yinghai Lu wrote:
>
> > You just use my patch plus Andy's patch together...
>
> Your patch does other things that in my opinion are not appropriate at
> this stage for 2.6.23. I am aiming for the minimal change that will
> fix the regression, and your patch should be considered for 2.6.24.
>
if Andi picked my batch v1 version before your sysdata patch, then
there would be this regression

YH..

2007-08-05 06:27:36

by Yinghai Lu

[permalink] [raw]
Subject: Re: Oops in 2.6.23-rc1-git9, arch/x86_64/pci/k8-bus.c::fill_mp_bus_to_cpumask()

On 8/4/07, Yinghai Lu <[email protected]> wrote:
> On 8/4/07, Muli Ben-Yehuda <[email protected]> wrote:
> > On Sat, Aug 04, 2007 at 11:07:04PM -0700, Yinghai Lu wrote:
> >
> > > You just use my patch plus Andy's patch together...
> >
> > Your patch does other things that in my opinion are not appropriate at
> > this stage for 2.6.23. I am aiming for the minimal change that will
> > fix the regression, and your patch should be considered for 2.6.24.
> >
> if Andi picked my batch v1 version before your sysdata patch, then
> there would be this regression
i mean:
if Andi picked up my batch v1 version before your sysdata patch, then
there would not be this regression
YH

2007-08-05 07:53:33

by Muli Ben-Yehuda

[permalink] [raw]
Subject: [PATCH/RFT] finish i386 and x86-64 sysdata conversion

This patch finishes the i386 and x86-64 ->sysdata conversion and
hopefully also fixes Riku's and Andy's observed bugs. It is based on
Yinghai Lu's and Andy Whitcroft's patches (thanks!) with some changes:

- introduce pci_scan_bus_with_sysdata() and use it instead of
pci_scan_bus() where appropriate. pci_scan_bus_with_sysdata() will
allocate the sysdata structure and then call pci_scan_bus().
- always allocate pci_sysdata dynamically. The whole point of this
sysdata work is to make it easy to do root-bus specific things
(e.g., support PCI domains and IOMMU's). I dislike using a default
struct pci_sysdata in some places and a dynamically allocated
pci_sysdata elsewhere - the potential for someone indavertantly
changing the default structure is too high.
- this patch only makes the minimal changes necessary, i.e., the NUMA node is
always initialized to -1. Patches to do the right thing with regards
to the NUMA node can build on top of this (either add a 'node'
parameter to pci_scan_bus_with_sysdata() or just update the node
when it becomes known).

The patch was compile tested with various configurations (e.g., NUMAQ,
VISWS) and run-time tested on i386 and x86-64. Unfortunately none of
my machines exhibited the bugs so caveat emptor.

Andy, could you please see if this fixes the NUMA issues you've seen?
Riku, does this fix "pci=noacpi" on your laptop?

Comments appreciated. If this looks ok it should go into 2.6.23.

Signed-off-by: Muli Ben-Yehuda <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Chuck Ebbert <[email protected]>,
Cc: [email protected]
Cc: Andy Whitcroft <[email protected]>
Cc: Jeff Garzik <[email protected]>
---
arch/i386/pci/common.c | 23 +++++++++++++++++++++++
arch/i386/pci/fixup.c | 6 +++---
arch/i386/pci/irq.c | 5 +++--
arch/i386/pci/legacy.c | 2 +-
arch/i386/pci/numa.c | 15 +++++++++------
arch/i386/pci/visws.c | 4 ++--
include/asm-i386/pci.h | 3 +++
include/asm-x86_64/pci.h | 2 ++
8 files changed, 46 insertions(+), 14 deletions(-)

diff --git a/arch/i386/pci/common.c b/arch/i386/pci/common.c
index 85503de..ebc6f3c 100644
--- a/arch/i386/pci/common.c
+++ b/arch/i386/pci/common.c
@@ -455,3 +455,26 @@ void pcibios_disable_device (struct pci_dev *dev)
if (!dev->msi_enabled && pcibios_disable_irq)
pcibios_disable_irq(dev);
}
+
+struct pci_bus *pci_scan_bus_with_sysdata(int busno)
+{
+ struct pci_bus *bus = NULL;
+ struct pci_sysdata *sd;
+
+ /*
+ * Allocate per-root-bus (not per bus) arch-specific data.
+ * TODO: leak; this memory is never freed.
+ * It's arguable whether it's worth the trouble to care.
+ */
+ sd = kzalloc(sizeof(*sd), GFP_KERNEL);
+ if (!sd) {
+ printk(KERN_ERR "PCI: OOM, skipping PCI bus %02x\n", busno);
+ return NULL;
+ }
+ sd->node = -1;
+ bus = pci_scan_bus(busno, &pci_root_ops, sd);
+ if (!bus)
+ kfree(sd);
+
+ return bus;
+}
diff --git a/arch/i386/pci/fixup.c b/arch/i386/pci/fixup.c
index e7306db..c82cbf4 100644
--- a/arch/i386/pci/fixup.c
+++ b/arch/i386/pci/fixup.c
@@ -25,9 +25,9 @@ static void __devinit pci_fixup_i450nx(struct pci_dev *d)
pci_read_config_byte(d, reg++, &subb);
DBG("i450NX PXB %d: %02x/%02x/%02x\n", pxb, busno, suba, subb);
if (busno)
- pci_scan_bus(busno, &pci_root_ops, NULL); /* Bus A */
+ pci_scan_bus_with_sysdata(busno); /* Bus A */
if (suba < subb)
- pci_scan_bus(suba+1, &pci_root_ops, NULL); /* Bus B */
+ pci_scan_bus_with_sysdata(suba+1); /* Bus B */
}
pcibios_last_bus = -1;
}
@@ -42,7 +42,7 @@ static void __devinit pci_fixup_i450gx(struct pci_dev *d)
u8 busno;
pci_read_config_byte(d, 0x4a, &busno);
printk(KERN_INFO "PCI: i440KX/GX host bridge %s: secondary bus %02x\n", pci_name(d), busno);
- pci_scan_bus(busno, &pci_root_ops, NULL);
+ pci_scan_bus_with_sysdata(busno);
pcibios_last_bus = -1;
}
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82454GX, pci_fixup_i450gx);
diff --git a/arch/i386/pci/irq.c b/arch/i386/pci/irq.c
index f2cb942..665db06 100644
--- a/arch/i386/pci/irq.c
+++ b/arch/i386/pci/irq.c
@@ -138,8 +138,9 @@ static void __init pirq_peer_trick(void)
for(i = 1; i < 256; i++) {
if (!busmap[i] || pci_find_bus(0, i))
continue;
- if (pci_scan_bus(i, &pci_root_ops, NULL))
- printk(KERN_INFO "PCI: Discovered primary peer bus %02x [IRQ]\n", i);
+ if (pci_scan_bus_with_sysdata(i))
+ printk(KERN_INFO "PCI: Discovered primary peer "
+ "bus %02x [IRQ]\n", i);
}
pcibios_last_bus = -1;
}
diff --git a/arch/i386/pci/legacy.c b/arch/i386/pci/legacy.c
index 149a958..5565d70 100644
--- a/arch/i386/pci/legacy.c
+++ b/arch/i386/pci/legacy.c
@@ -26,7 +26,7 @@ static void __devinit pcibios_fixup_peer_bridges(void)
l != 0x0000 && l != 0xffff) {
DBG("Found device at %02x:%02x [%04x]\n", n, devfn, l);
printk(KERN_INFO "PCI: Discovered peer bus %02x\n", n);
- pci_scan_bus(n, &pci_root_ops, NULL);
+ pci_scan_bus_with_sysdata(n);
break;
}
}
diff --git a/arch/i386/pci/numa.c b/arch/i386/pci/numa.c
index adbe17a..f5f165f 100644
--- a/arch/i386/pci/numa.c
+++ b/arch/i386/pci/numa.c
@@ -96,10 +96,14 @@ static void __devinit pci_fixup_i450nx(struct pci_dev *d)
pci_read_config_byte(d, reg++, &suba);
pci_read_config_byte(d, reg++, &subb);
DBG("i450NX PXB %d: %02x/%02x/%02x\n", pxb, busno, suba, subb);
- if (busno)
- pci_scan_bus(QUADLOCAL2BUS(quad,busno), &pci_root_ops, NULL); /* Bus A */
- if (suba < subb)
- pci_scan_bus(QUADLOCAL2BUS(quad,suba+1), &pci_root_ops, NULL); /* Bus B */
+ if (busno) {
+ /* Bus A */
+ pci_scan_bus_with_sysdata(QUADLOCAL2BUS(quad, busno));
+ }
+ if (suba < subb) {
+ /* Bus B */
+ pci_scan_bus_with_sysdata(QUADLOCAL2BUS(quad, suba+1));
+ }
}
pcibios_last_bus = -1;
}
@@ -123,8 +127,7 @@ static int __init pci_numa_init(void)
continue;
printk("Scanning PCI bus %d for quad %d\n",
QUADLOCAL2BUS(quad,0), quad);
- pci_scan_bus(QUADLOCAL2BUS(quad,0),
- &pci_root_ops, NULL);
+ pci_scan_bus_with_sysdata(QUADLOCAL2BUS(quad, 0));
}
return 0;
}
diff --git a/arch/i386/pci/visws.c b/arch/i386/pci/visws.c
index f1b486d..8ecb1c7 100644
--- a/arch/i386/pci/visws.c
+++ b/arch/i386/pci/visws.c
@@ -101,8 +101,8 @@ static int __init pcibios_init(void)
"bridge B (PIIX4) bus: %u\n", pci_bus1, pci_bus0);

raw_pci_ops = &pci_direct_conf1;
- pci_scan_bus(pci_bus0, &pci_root_ops, NULL);
- pci_scan_bus(pci_bus1, &pci_root_ops, NULL);
+ pci_scan_bus_with_sysdata(pci_bus0);
+ pci_scan_bus_with_sysdata(pci_bus1);
pci_fixup_irqs(visws_swizzle, visws_map_irq);
pcibios_resource_survey();
return 0;
diff --git a/include/asm-i386/pci.h b/include/asm-i386/pci.h
index d790343..4fcacc7 100644
--- a/include/asm-i386/pci.h
+++ b/include/asm-i386/pci.h
@@ -8,6 +8,9 @@ struct pci_sysdata {
int node; /* NUMA node */
};

+/* scan a bus after allocating a pci_sysdata for it */
+extern struct pci_bus *pci_scan_bus_with_sysdata(int busno);
+
#include <linux/mm.h> /* for struct page */

/* Can be used to override the logic in pci_scan_bus for skipping
diff --git a/include/asm-x86_64/pci.h b/include/asm-x86_64/pci.h
index 88926eb..5da8cb0 100644
--- a/include/asm-x86_64/pci.h
+++ b/include/asm-x86_64/pci.h
@@ -10,6 +10,8 @@ struct pci_sysdata {
void* iommu; /* IOMMU private data */
};

+extern struct pci_bus *pci_scan_bus_with_sysdata(int busno);
+
#ifdef CONFIG_CALGARY_IOMMU
static inline void* pci_iommu(struct pci_bus *bus)
{
--
1.5.2



2007-08-05 08:50:12

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH/RFT] finish i386 and x86-64 sysdata conversion

On 8/5/07, Muli Ben-Yehuda <[email protected]> wrote:
> This patch finishes the i386 and x86-64 ->sysdata conversion and
> hopefully also fixes Riku's and Andy's observed bugs. It is based on
> Yinghai Lu's and Andy Whitcroft's patches (thanks!) with some changes:
>
> - introduce pci_scan_bus_with_sysdata() and use it instead of
> pci_scan_bus() where appropriate. pci_scan_bus_with_sysdata() will
> allocate the sysdata structure and then call pci_scan_bus().
> - always allocate pci_sysdata dynamically. The whole point of this
> sysdata work is to make it easy to do root-bus specific things
> (e.g., support PCI domains and IOMMU's). I dislike using a default
> struct pci_sysdata in some places and a dynamically allocated
> pci_sysdata elsewhere - the potential for someone indavertantly
> changing the default structure is too high.
> - this patch only makes the minimal changes necessary, i.e., the NUMA node is
> always initialized to -1. Patches to do the right thing with regards
> to the NUMA node can build on top of this (either add a 'node'
> parameter to pci_scan_bus_with_sysdata() or just update the node
> when it becomes known).
>
> The patch was compile tested with various configurations (e.g., NUMAQ,
> VISWS) and run-time tested on i386 and x86-64. Unfortunately none of
> my machines exhibited the bugs so caveat emptor.
>
> Andy, could you please see if this fixes the NUMA issues you've seen?
> Riku, does this fix "pci=noacpi" on your laptop?
>
> Comments appreciated. If this looks ok it should go into 2.6.23.
>
> Signed-off-by: Muli Ben-Yehuda <[email protected]>
> Cc: Yinghai Lu <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Chuck Ebbert <[email protected]>,
> Cc: [email protected]
> Cc: Andy Whitcroft <[email protected]>
> Cc: Jeff Garzik <[email protected]>
> ---

sounds good.

Can you change
pci_scan_bus_with_sysdata(int busno)
to
pci_scan_bus_on_node(int bus, struct pci_ops *ops, int node)?

pci_scan_bus_with_sysdata(int busno) make me feel that i need feed one
sysdata as on param for it.

I only need to update irq.c and legacy.c in my patch.

YH

2007-08-05 11:55:32

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: [PATCH/RFT] finish i386 and x86-64 sysdata conversion

On Sun, Aug 05, 2007 at 01:49:57AM -0700, Yinghai Lu wrote:

> Can you change
> pci_scan_bus_with_sysdata(int busno)
> to
> pci_scan_bus_on_node(int bus, struct pci_ops *ops, int node)?

Do you anticipate passing in a different pci_ops or node?
In any case please remember I am aiming for the minimal "obviously
correc" change for 2.6.23...

> pci_scan_bus_with_sysdata(int busno) make me feel that i need feed one
> sysdata as on param for it.

Yeah, lousy name, but the best I came up with. Runner up was
'x86_pci_scan_bus', which is I think worse?

Cheers,
Muli

2007-08-05 16:40:11

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH/RFT] finish i386 and x86-64 sysdata conversion

On 8/5/07, Muli Ben-Yehuda <[email protected]> wrote:
> On Sun, Aug 05, 2007 at 01:49:57AM -0700, Yinghai Lu wrote:
>
> > Can you change
> > pci_scan_bus_with_sysdata(int busno)
> > to
> > pci_scan_bus_on_node(int bus, struct pci_ops *ops, int node)?
>
> Do you anticipate passing in a different pci_ops or node?
> In any case please remember I am aiming for the minimal "obviously
> correc" change for 2.6.23...
>
> > pci_scan_bus_with_sysdata(int busno) make me feel that i need feed one
> > sysdata as on param for it.
>
> Yeah, lousy name, but the best I came up with. Runner up was
> 'x86_pci_scan_bus', which is I think worse?
>
or

pci_scan_bus_on_node(int bus, struct pci_ops *ops, int node)
x86_pci_scan_root_bus(int bus)
{
pci_scan_bus_on_node(bus, &pci_root_ops, -1);
}

i need node as one param for my patch later in irq.c and legacy.c

YH

2007-08-05 17:36:41

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH/RFT] finish i386 and x86-64 sysdata conversion

Yinghai Lu wrote:
> pci_scan_bus_on_node(int bus, struct pci_ops *ops, int node)
> x86_pci_scan_root_bus(int bus)
> {
> pci_scan_bus_on_node(bus, &pci_root_ops, -1);
> }
>
> i need node as one param for my patch later in irq.c and legacy.c


It is a mistake to start coding NUMA details into pci scan functions.

Anywhere the current code does not set the NUMA node, set it to -1 or
some other default value.

Jeff


2007-08-05 20:41:59

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH/RFT] finish i386 and x86-64 sysdata conversion

On 8/5/07, Jeff Garzik <[email protected]> wrote:
> Yinghai Lu wrote:
> > pci_scan_bus_on_node(int bus, struct pci_ops *ops, int node)
> > x86_pci_scan_root_bus(int bus)
> > {
> > pci_scan_bus_on_node(bus, &pci_root_ops, -1);
> > }
> >
> > i need node as one param for my patch later in irq.c and legacy.c
>
>
> It is a mistake to start coding NUMA details into pci scan functions.
>
> Anywhere the current code does not set the NUMA node, set it to -1 or
> some other default value.

Can you check
http://lkml.org/lkml/2007/7/26/377
http://lkml.org/lkml/2007/7/26/378
http://lkml.org/lkml/2007/7/26/379

it will make sure numa_node on device get correct value after pci scan.
esp for k8 system with second peer root bus on second node.

Thanks

Yinghai Lu

2007-08-07 22:51:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH/RFT] finish i386 and x86-64 sysdata conversion

On Sun, 5 Aug 2007 10:53:07 +0300
Muli Ben-Yehuda <[email protected]> wrote:

> This patch finishes the i386 and x86-64 ->sysdata conversion and
> hopefully also fixes Riku's and Andy's observed bugs. It is based on
> Yinghai Lu's and Andy Whitcroft's patches (thanks!) with some changes:
>
> - introduce pci_scan_bus_with_sysdata() and use it instead of
> pci_scan_bus() where appropriate. pci_scan_bus_with_sysdata() will
> allocate the sysdata structure and then call pci_scan_bus().
> - always allocate pci_sysdata dynamically. The whole point of this
> sysdata work is to make it easy to do root-bus specific things
> (e.g., support PCI domains and IOMMU's). I dislike using a default
> struct pci_sysdata in some places and a dynamically allocated
> pci_sysdata elsewhere - the potential for someone indavertantly
> changing the default structure is too high.
> - this patch only makes the minimal changes necessary, i.e., the NUMA node is
> always initialized to -1. Patches to do the right thing with regards
> to the NUMA node can build on top of this (either add a 'node'
> parameter to pci_scan_bus_with_sysdata() or just update the node
> when it becomes known).
>
> The patch was compile tested with various configurations (e.g., NUMAQ,
> VISWS) and run-time tested on i386 and x86-64. Unfortunately none of
> my machines exhibited the bugs so caveat emptor.
>
> Andy, could you please see if this fixes the NUMA issues you've seen?
> Riku, does this fix "pci=noacpi" on your laptop?

I am sooooooo tired of this thing. Andi, someone, can we for heaven's
sake please just get it all sorted out?

I dropped two of Yinghai's patches which conflicted with this, and then
two more which looked like they depended on the dropped ones.

2007-08-07 22:57:27

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: [PATCH/RFT] finish i386 and x86-64 sysdata conversion

On Tue, Aug 07, 2007 at 03:49:11PM -0700, Andrew Morton wrote:

> I am sooooooo tired of this thing. Andi, someone, can we for heaven's
> sake please just get it all sorted out?

With regards to the sysdata conversion: Riku says he cannot test new
kernel. I haven't heard anything from Andy Whitcroft. It passes all of
my tests and what we have now is obviously broken... I think we should
put the fix in.

Cheers,
Muli

2007-08-08 00:44:19

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH/RFT] finish i386 and x86-64 sysdata conversion

Muli Ben-Yehuda wrote:
> On Tue, Aug 07, 2007 at 03:49:11PM -0700, Andrew Morton wrote:
>
>> I am sooooooo tired of this thing. Andi, someone, can we for heaven's
>> sake please just get it all sorted out?
>
> With regards to the sysdata conversion: Riku says he cannot test new
> kernel. I haven't heard anything from Andy Whitcroft. It passes all of
> my tests and what we have now is obviously broken... I think we should
> put the fix in.

Strongly agreed. It overall fixes bugs that existing _before_ the
sysdata stuff went in.

It was clear the NUMA node was /not initialized/ in all cases, and I
think it's quite unfair to revert the sysdata stuff because of
pre-existing bugs (that we are fixing right now anyway).

Jeff



2007-08-08 01:09:59

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH/RFT] finish i386 and x86-64 sysdata conversion

On 8/7/07, Jeff Garzik <[email protected]> wrote:
> Muli Ben-Yehuda wrote:
>
> Strongly agreed. It overall fixes bugs that existing _before_ the
> sysdata stuff went in.

what are those bugs?

the sysdata stuff try to fix:
when calgary iommu code was introduced, it is trying to share sysdata
that orginally is initialized k8_bus.c and used by pcibus_to_node. but
that only happen with AMD K8 platform. and have nothing to do with
calgary.
Then my patch get_mp_bus_to_node_as_early will use that sysdata as
node as early as possible before the pci_scan bus is called.

then Mul came out his patch ...to split the sharing that will never happen.

But andi picked muli patch even my patch stay a while in -mm.

Also i don't think anyone test muli sysdata patch with amd64 platform
with acpi=off. even muli himself.

calgary support bus numa at this time? i mean mult peer root bus on
different node..

anyway I will produce my patch in v3...

YH

2007-08-08 01:23:37

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH/RFT] finish i386 and x86-64 sysdata conversion

Yinghai Lu wrote:
> the sysdata stuff try to fix:
> when calgary iommu code was introduced, it is trying to share sysdata
> that orginally is initialized k8_bus.c and used by pcibus_to_node. but
> that only happen with AMD K8 platform. and have nothing to do with
> calgary.

incorrect: sysdata is initialized when the structure is created, which
clearly not only in k8_bus.c


> Then my patch get_mp_bus_to_node_as_early will use that sysdata as
> node as early as possible before the pci_scan bus is called.

that's incorrect, since it assumes NUMA owns ->sysdata


> then Mul came out his patch ...to split the sharing that will never happen.

that's incorrect, ->sysdata will be shared among Calgary, NUMA, and
PCI-domain support.


> Also i don't think anyone test muli sysdata patch with amd64 platform
> with acpi=off. even muli himself.

I did. I did all PCI domain support development on AMD64 platform.


> calgary support bus numa at this time? i mean mult peer root bus on
> different node..

My PCI domains patch add support for multiple peer root buses. That's
the definition of PCI domain support.

Branch 'pciseg' of
git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/misc-2.6.git

Jeff


2007-08-08 01:29:03

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH/RFT] finish i386 and x86-64 sysdata conversion

On 8/7/07, Jeff Garzik <[email protected]> wrote:
> Yinghai Lu wrote:
> > the sysdata stuff try to fix:
> > when calgary iommu code was introduced, it is trying to share sysdata
> > that orginally is initialized k8_bus.c and used by pcibus_to_node. but
> > that only happen with AMD K8 platform. and have nothing to do with
> > calgary.
>
> incorrect: sysdata is initialized when the structure is created, which
> clearly not only in k8_bus.c

before his first patch, numa juse use sysdata pointer for node.

>
>
> > Then my patch get_mp_bus_to_node_as_early will use that sysdata as
> > node as early as possible before the pci_scan bus is called.
>
> that's incorrect, since it assumes NUMA owns ->sysdata

At that time, only it use that.

>
>
> > then Mul came out his patch ...to split the sharing that will never happen.
>
> that's incorrect, ->sysdata will be shared among Calgary, NUMA, and
> PCI-domain support.

but not the same time on run time system with Calgary, NUMA.

>
>
> > Also i don't think anyone test muli sysdata patch with amd64 platform
> > with acpi=off. even muli himself.
>
> I did. I did all PCI domain support development on AMD64 platform.

really, but his first patch is broken with acpi=off.
I guess you tested his patch plus your patch. instead of test his
patch barely. I tested his patch + my patch, and it works well, and
even test his patch only.

>
>
> > calgary support bus numa at this time? i mean mult peer root bus on
> > different node..
>
> My PCI domains patch add support for multiple peer root buses. That's
> the definition of PCI domain support.
>
> Branch 'pciseg' of
> git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/misc-2.6.git

good, i will look at it.

YH

2007-08-08 02:59:28

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH/RFT] finish i386 and x86-64 sysdata conversion

On 8/7/07, Jeff Garzik <[email protected]> wrote:
>
> My PCI domains patch add support for multiple peer root buses. That's
> the definition of PCI domain support.

one one segment, we can not have multi peer root buses?

>
> Branch 'pciseg' of
> git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/misc-2.6.git

is your test system based AMD opteron..., and have io device on other
node/link, and use mmconfig to access that root bus on different
segment?

I used to think that we need bigSMP to support pci segments on AMD platfrom.

YH