2006-01-09 20:37:55

by Greg KH

[permalink] [raw]
Subject: [GIT PATCH] PCI patches for 2.6.15 - retry

Here are some PCI patches against your latest git tree. They have all
been in the -mm tree for a while with no problems. I've pulled out all
of the offending patches that people objected to, or ones that crashed
older machines from the last series I sent you.

The thing that touches so many different files are the change from the
pci_module_init() to pci_register_driver() that was done by Richard
Knutsson. Other big stuff is the addition of the pci error recovery
framework, after many different revisions and reworks.
There are also some pci hotplug fixes, and quirks added.

Please pull from:
rsync://rsync.kernel.org/pub/scm/linux/kernel/git/gregkh/pci-2.6.git/
or if master.kernel.org hasn't synced up yet:
master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6.git/

The full patches will be sent to the linux-pci mailing list, if anyone
wants to see them.

thanks,

greg k-h

Documentation/filesystems/sysfs-pci.txt | 21 +-
Documentation/pci-error-recovery.txt | 246 +++++++++++++++++++++++++++
MAINTAINERS | 7
arch/alpha/kernel/sys_alcor.c | 3
arch/alpha/kernel/sys_sio.c | 6
arch/frv/mb93090-mb00/pci-frv.c | 8
arch/frv/mb93090-mb00/pci-irq.c | 4
arch/i386/kernel/scx200.c | 2
arch/i386/pci/acpi.c | 2
arch/i386/pci/fixup.c | 7
arch/i386/pci/irq.c | 42 ++--
arch/mips/vr41xx/common/vrc4173.c | 2
arch/ppc/kernel/pci.c | 21 +-
arch/ppc/platforms/85xx/mpc85xx_cds_common.c | 11 -
arch/sparc64/kernel/ebus.c | 15 -
drivers/acpi/pci_irq.c | 7
drivers/block/DAC960.c | 2
drivers/block/cciss.c | 2
drivers/block/sx8.c | 2
drivers/block/umem.c | 2
drivers/hwmon/vt8231.c | 2
drivers/media/radio/radio-gemtek-pci.c | 2
drivers/media/radio/radio-maxiradio.c | 2
drivers/media/video/bttv-driver.c | 2
drivers/media/video/saa7134/saa7134-core.c | 2
drivers/parport/parport_serial.c | 2
drivers/pci/hotplug/acpiphp_glue.c | 6
drivers/pci/hotplug/cpqphp.h | 8
drivers/pci/hotplug/cpqphp_core.c | 127 +++++++------
drivers/pci/hotplug/cpqphp_ctrl.c | 28 ---
drivers/pci/hotplug/cpqphp_sysfs.c | 138 ++++++++++++---
drivers/pci/hotplug/ibmphp_pci.c | 2
drivers/pci/hotplug/pciehp_core.c | 92 +++++-----
drivers/pci/hotplug/pciehp_hpc.c | 19 +-
drivers/pci/hotplug/pciehp_pci.c | 52 +++--
drivers/pci/hotplug/pciehprm_acpi.c | 13 -
drivers/pci/hotplug/rpadlpar_core.c | 27 --
drivers/pci/hotplug/rpaphp_pci.c | 47 -----
drivers/pci/hotplug/shpchp.h | 4
drivers/pci/hotplug/shpchp_core.c | 16 +
drivers/pci/hotplug/shpchp_ctrl.c | 37 ----
drivers/pci/hotplug/shpchp_hpc.c | 138 +++++++++------
drivers/pci/hotplug/shpchp_pci.c | 19 +-
drivers/pci/pci.c | 7
drivers/pci/pci.h | 5
drivers/pci/pcie/portdrv_core.c | 4
drivers/pci/probe.c | 49 ++++-
drivers/pci/proc.c | 3
drivers/pci/quirks.c | 26 ++
drivers/pci/remove.c | 3
drivers/pcmcia/vrc4173_cardu.c | 2
drivers/serial/serial_txx9.c | 2
drivers/video/cyblafb.c | 1
include/linux/pci.h | 69 +++++++
sound/oss/ad1889.c | 2
sound/oss/btaudio.c | 2
sound/oss/cmpci.c | 2
sound/oss/cs4281/cs4281m.c | 2
sound/oss/cs46xx.c | 2
sound/oss/emu10k1/main.c | 2
sound/oss/es1370.c | 2
sound/oss/es1371.c | 2
sound/oss/ite8172.c | 2
sound/oss/kahlua.c | 2
sound/oss/maestro.c | 2
sound/oss/nec_vrc5477.c | 2
sound/oss/nm256_audio.c | 2
sound/oss/rme96xx.c | 2
sound/oss/sonicvibes.c | 2
sound/oss/ymfpci.c | 2
70 files changed, 956 insertions(+), 444 deletions(-)

Adrian Bunk:
PCI Hotplug: cpqphp_ctrl.c: remove dead code
PCI: drivers/pci: some cleanups

Benjamin Herrenschmidt:
PCI: Export pci_cfg_space_size

Daniel Marjamäki:
PCI: irq.c: trivial printk and DBG updates

Daniel Yeisley:
PCI Quirk: 1K I/O space granularity on Intel P64H2

Dominik Brodowski:
PCI: use bus numbers sparsely, if necessary

Greg Kroah-Hartman:
PCI Hotplug: fix up the sysfs file in the compaq pci hotplug driver
drivers/sound/oss: Replace pci_module_init() with pci_register_driver()

Hanna Linder:
PCI: arch/i386/pci/acpi.c: use for_each_pci_dev

Jesper Juhl:
PCI: Reduce nr of ptr derefs in drivers/pci/hotplug/cpqphp_core.c
PCI: Reduce nr of ptr derefs in drivers/pci/hotplug/rpaphp_pci.c
PCI: Reduce nr of ptr derefs in drivers/pci/hotplug/pciehp_core.c
PCI: Reduce nr of ptr derefs in drivers/pci/hotplug/pciehprm_acpi.c

Jesse Barnes:
PCI: document sysfs rom file interface
PCI: update Toshiba ohci quirk DMI table

Jiri Slaby:
PCI: pci_find_device remove (ppc/platforms/85xx/mpc85xx_cds_common.c)
PCI: pci_find_device remove (alpha/kernel/sys_sio.c)
PCI: pci_find_device remove (alpha/kernel/sys_alcor.c)
PCI: pci_find_device remove (ppc/kernel/pci.c)
PCI: arch: pci_find_device remove (frv/mb93090-mb00/pci-irq.c)
PCI: pci_find_device remove (frv/mb93090-mb00/pci-frv.c)
PCI: pci_find_device remove (sparc64/kernel/ebus.c)

Jordan, William P:
PCI Hotplug: ibmphp_pci.c copy-n-paste fix

Kenji Kaneshige:
shpchp: fix improper reference to Slot Avail Regsister
shpchp: fix improper reference to Mode 1 ECC Capability" bit
shpchp: replace pci_find_slot() with pci_get_slot()
shpchp: fix improper mmio mapping
shpchp: fix improper wait for command completion
shpchp: fix improper write to Command Completion Detect bit
shpchp: Implement get_address callback

Kristen Accardi:
pci: use pin stored in pci_dev
apci: use pin stored in pci_dev
pci: store PCI_INTERRUPT_PIN in pci_dev
pci: call pci_read_irq for bridges
acpiphp: only size new bus

linas:
PCI Error Recovery: header file patch

[email protected]:
PCI Hotplug/powerpc: remove duplicated code
PCI Hotplug/powerpc: more removal of duplicated code
PCI Error Recovery: documentation

Rajesh Shah:
pciehp: allow bridged card hotplug

Richard Knutsson:
arch: Replace pci_module_init() with pci_register_driver()
drivers/block: Replace pci_module_init() with pci_register_driver()
drivers/*rest*: Replace pci_module_init() with pci_register_driver()

Sergey Vlasov:
PCIE: make bus_id for PCI Express devices unique

Thomas Schaefer:
pciehp: handle sticky power-fault status


2006-01-10 00:00:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PATCH] PCI patches for 2.6.15 - retry



On Mon, 9 Jan 2006, Greg KH wrote:
>
> Here are some PCI patches against your latest git tree. They have all
> been in the -mm tree for a while with no problems. I've pulled out all
> of the offending patches that people objected to, or ones that crashed
> older machines from the last series I sent you.

Before I pull this, I'd like to get some confirmation that some of the
other problems that seem to be PCI-related in the -mm tree are also
understood, or at least known to be part of the stuff that you're _not_
sending me..

[ There's at least a pci_call_probe() NULL ptr dereference report by
Martin Bligh, I think Andrew has a few others he's tracked.. ]

Linus

2006-01-10 00:44:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [GIT PATCH] PCI patches for 2.6.15 - retry

Linus Torvalds <[email protected]> wrote:
>
>
>
> On Mon, 9 Jan 2006, Greg KH wrote:
> >
> > Here are some PCI patches against your latest git tree. They have all
> > been in the -mm tree for a while with no problems. I've pulled out all
> > of the offending patches that people objected to, or ones that crashed
> > older machines from the last series I sent you.
>
> Before I pull this, I'd like to get some confirmation that some of the
> other problems that seem to be PCI-related in the -mm tree are also
> understood, or at least known to be part of the stuff that you're _not_
> sending me..

It's really hard to keep track of all this, so it's likely that some things
will still sneak through.

- Reuben Farrelly's oops in make_class_name(). Could be libata, or scsi
or driver core.

- A few problems with ehci. For example Grant Coady went oops loading
the module. Probably USB, maybe solved now, but there are
interactions...

- gregkh-pci-x86-pci-domain-support-the-meat.patch is a problem, but
wasn't in this tree.

> [ There's at least a pci_call_probe() NULL ptr dereference report by
> Martin Bligh, I think Andrew has a few others he's tracked.. ]

Yes, Martin is reporting failures on a few machines. Hopefully he's
working out whether gregkh-pci-x86-pci-domain-support-the-meat.patch was
the culprit here. If so, I'd say we're good to go. If that's _not_ the
source then we just don't know where the failure is coming from.

All very vague, sorry.

2006-01-10 01:46:37

by Alan

[permalink] [raw]
Subject: Re: [GIT PATCH] PCI patches for 2.6.15 - retry

On Llu, 2006-01-09 at 16:44 -0800, Andrew Morton wrote:
> - Reuben Farrelly's oops in make_class_name(). Could be libata, or scsi
> or driver core.

libata I think. I reproduced it on 2.6.14-mm2 by accident with a buggy
pata driver.


2006-01-10 01:50:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [GIT PATCH] PCI patches for 2.6.15 - retry

Alan Cox <[email protected]> wrote:
>
> On Llu, 2006-01-09 at 16:44 -0800, Andrew Morton wrote:
> > - Reuben Farrelly's oops in make_class_name(). Could be libata, or scsi
> > or driver core.
>
> libata I think. I reproduced it on 2.6.14-mm2 by accident with a buggy
> pata driver.

Well that's all merged up now. Reuben, could you please test 2.6.15git6
tomorrow?

2006-01-10 02:28:25

by Greg KH

[permalink] [raw]
Subject: Re: [GIT PATCH] PCI patches for 2.6.15 - retry

On Mon, Jan 09, 2006 at 04:44:10PM -0800, Andrew Morton wrote:
> Linus Torvalds <[email protected]> wrote:
> >
> > On Mon, 9 Jan 2006, Greg KH wrote:
> > >
> > > Here are some PCI patches against your latest git tree. They have all
> > > been in the -mm tree for a while with no problems. I've pulled out all
> > > of the offending patches that people objected to, or ones that crashed
> > > older machines from the last series I sent you.
> >
> > Before I pull this, I'd like to get some confirmation that some of the
> > other problems that seem to be PCI-related in the -mm tree are also
> > understood, or at least known to be part of the stuff that you're _not_
> > sending me..
>
> It's really hard to keep track of all this, so it's likely that some things
> will still sneak through.
>
> - Reuben Farrelly's oops in make_class_name(). Could be libata, or scsi
> or driver core.

Haven't heard of this one before, but it shouldn't be a pci issue.

> - A few problems with ehci. For example Grant Coady went oops loading
> the module. Probably USB, maybe solved now, but there are
> interactions...

People are still working on this one, but it shouldn't be a pci issue.

> - gregkh-pci-x86-pci-domain-support-the-meat.patch is a problem, but
> wasn't in this tree.
>
> > [ There's at least a pci_call_probe() NULL ptr dereference report by
> > Martin Bligh, I think Andrew has a few others he's tracked.. ]

Yes, that's the x86-* patches in my tree, from Jeff, and I'm not going
to be sending them to you until all of the breakage is fixed up (he
created them for machines that aren't public yet, so I don't think
there's a rush for them to get in anytime soon...)

> Yes, Martin is reporting failures on a few machines. Hopefully he's
> working out whether gregkh-pci-x86-pci-domain-support-the-meat.patch was
> the culprit here. If so, I'd say we're good to go. If that's _not_ the
> source then we just don't know where the failure is coming from.

It sure looks like it's the reason why, as we are suddenly failing with
a NULL pointer problem after we change an integer field into a pointer :)

Linus, it should all be safe for you to pull this tree, as I took
everything that people objected to out of my last attempt :)

thanks,

greg k-h

2006-01-10 10:04:21

by Reuben Farrelly

[permalink] [raw]
Subject: Re: [GIT PATCH] PCI patches for 2.6.15 - retry



On 10/01/2006 2:49 p.m., Andrew Morton wrote:
> Alan Cox <[email protected]> wrote:
>> On Llu, 2006-01-09 at 16:44 -0800, Andrew Morton wrote:
>>> - Reuben Farrelly's oops in make_class_name(). Could be libata, or scsi
>>> or driver core.
>> libata I think. I reproduced it on 2.6.14-mm2 by accident with a buggy
>> pata driver.
>
> Well that's all merged up now. Reuben, could you please test 2.6.15git6
> tomorrow?

A couple of reboots later with git6 and at this stage it seems all OK, no oopses.

I'm still having 100% repeatable "soft" hangs when booting up though, both with
-mm2 (-mm1 seems OK in this regard) and git6. It's enough to make git6 and mm2
unusable because the machine never finishes booting userspace. I'll put more
details of that in another email following up to the original -mm2 thread, as
it's unrelated to the oops above (but probably equally as nasty).

But it means I can't test the git6 fixes much more because every time I boot it
I have to alt-sysrq S+U+B or uncleanly kill the box by hitting the reset button.

reuben

2006-01-12 03:55:59

by Reuben Farrelly

[permalink] [raw]
Subject: Re: [GIT PATCH] PCI patches for 2.6.15 - retry



On 10/01/2006 2:49 p.m., Andrew Morton wrote:
> Alan Cox <[email protected]> wrote:
>> On Llu, 2006-01-09 at 16:44 -0800, Andrew Morton wrote:
>>> - Reuben Farrelly's oops in make_class_name(). Could be libata, or scsi
>>> or driver core.
>> libata I think. I reproduced it on 2.6.14-mm2 by accident with a buggy
>> pata driver.
>
> Well that's all merged up now. Reuben, could you please test 2.6.15git6
> tomorrow?

Seemingly not fixed afterall. I've been doing many reboots lately getting to
the bottom of the barrier/md bug and just before I hit this with -mm3
(linus.patch -git7) which I believe is the same bug (the call trace looks very
similar).


Real Time Clock Driver v1.12ac
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
?serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
ACPI: PCI Interrupt 0000:06:02.0[A] -> GSI 18 (level, low) -> IRQ 185
0000:06:02.0: ttyS1 at I/O 0xbc00 (irq = 185) is a 16550A
0000:06:02.0: ttyS2 at I/O 0xbc08 (irq = 185) is a 16550A
Floppy drive(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
ahci: probe of 0000:00:1f.2 failed with error -12
ata1: SATA max UDMA/133 cmd 0x0 ctl 0x2 bmdma 0x0 irq 0
ata2: SATA max UDMA/133 cmd 0x0 ctl 0x2 bmdma 0x8 irq 0
Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
c023c873
*pde = 00000000
Oops: 0000 [#1]
SMP
last sysfs file:
Modules linked in:
CPU: 0
EIP: 0060:[<c023c873>] Not tainted VLI
EFLAGS: 00010206 (2.6.15-mm3)
EIP is at make_class_name+0x28/0x8d
eax: 00000000 ebx: ffffffff ecx: ffffffff edx: c1a12224
esi: 00000009 edi: 00000000 ebp: c1921d2c esp: c1921d1c
ds: 007b es: 007b ss: 0068
Process swapper (pid: 1, threadinfo=c1921000 task=c1920a90)
Stack: <0>c1a12224 c03913f8 c1a12224 c03913f8 c1921d54 c023cabd c1921d58 c0391380
00000000 c1af39c0 c0391400 c1a12224 c1a12000 c1a12030 c1921d60 c023cb7b
c1a120e4 c1921d74 c0255dbf c1a122c0 c1a43a40 00000000 c1921d80 c025e393
Call Trace:
[<c0103c5d>] show_stack+0x9b/0xc0
[<c0103de4>] show_registers+0x162/0x1e7
[<c0103f8f>] die+0x126/0x231
[<c01140db>] do_page_fault+0x271/0x5b9
[<c01037df>] error_code+0x4f/0x54
[<c023cabd>] class_device_del+0xa3/0x156
[<c023cb7b>] class_device_unregister+0xb/0x15
[<c0255dbf>] scsi_remove_host+0xb4/0xef
[<c025e393>] ata_host_remove+0x11/0x1c
[<c0260ec6>] ata_device_add+0x2e4/0xb7b
[<c0261cd6>] ata_pci_init_one+0x322/0x387
[<c0265b34>] piix_init_one+0x18c/0x338
[<c01f4f4f>] pci_device_probe+0x44/0x5f
[<c023bf62>] driver_probe_device+0x3e/0xb0
[<c023c0df>] __driver_attach+0x8e/0x90
[<c023b9f3>] bus_for_each_dev+0x44/0x62
[<c023bece>] driver_attach+0x19/0x1b
[<c023b687>] bus_add_driver+0x6d/0x126
[<c023c350>] driver_register+0x6b/0x9b
[<c01f50fb>] __pci_register_driver+0x6a/0x95
[<c03e8ea8>] piix_init+0xf/0x22
[<c01003cc>] init+0xff/0x325
[<c0100d25>] kernel_thread_helper+0x5/0xb
Code: c8 5d c3 55 89 e5 57 56 53 83 ec 04 89 45 f0 89 c2 8b 40 48 8b 38 bb ff ff
ff ff 89 d9 31 c0 f2 ae f7 d1 49 89 ce 8b 7a 08 89 d9 <f2> ae f7 d1 49 89 ca 8d
4e 02 8d 04 0a ba d0 00 00 00 e8 3b 72
<0>Kernel panic - not syncing: Attempted to kill init!

reuben




2006-01-12 04:29:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [GIT PATCH] PCI patches for 2.6.15 - retry

Reuben Farrelly <[email protected]> wrote:
>
>
>
> On 10/01/2006 2:49 p.m., Andrew Morton wrote:
> > Alan Cox <[email protected]> wrote:
> >> On Llu, 2006-01-09 at 16:44 -0800, Andrew Morton wrote:
> >>> - Reuben Farrelly's oops in make_class_name(). Could be libata, or scsi
> >>> or driver core.
> >> libata I think. I reproduced it on 2.6.14-mm2 by accident with a buggy
> >> pata driver.
> >
> > Well that's all merged up now. Reuben, could you please test 2.6.15git6
> > tomorrow?
>
> Seemingly not fixed afterall. I've been doing many reboots lately getting to
> the bottom of the barrier/md bug and just before I hit this with -mm3
> (linus.patch -git7) which I believe is the same bug (the call trace looks very
> similar).
>
> ...

I'm getting my bugs confused now - there are so many. Were you the person
who reported this before?

> serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
> ACPI: PCI Interrupt 0000:06:02.0[A] -> GSI 18 (level, low) -> IRQ 185
> 0000:06:02.0: ttyS1 at I/O 0xbc00 (irq = 185) is a 16550A
> 0000:06:02.0: ttyS2 at I/O 0xbc08 (irq = 185) is a 16550A
> Floppy drive(s): fd0 is 1.44M
> FDC 0 is a post-1991 82077
> Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
> ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
> ahci: probe of 0000:00:1f.2 failed with error -12
> ata1: SATA max UDMA/133 cmd 0x0 ctl 0x2 bmdma 0x0 irq 0
> ata2: SATA max UDMA/133 cmd 0x0 ctl 0x2 bmdma 0x8 irq 0
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
> printing eip:
> c023c873
> *pde = 00000000
> Oops: 0000 [#1]
> SMP
> last sysfs file:
> Modules linked in:
> CPU: 0
> EIP: 0060:[<c023c873>] Not tainted VLI
> EFLAGS: 00010206 (2.6.15-mm3)
> EIP is at make_class_name+0x28/0x8d
> eax: 00000000 ebx: ffffffff ecx: ffffffff edx: c1a12224
> esi: 00000009 edi: 00000000 ebp: c1921d2c esp: c1921d1c
> ds: 007b es: 007b ss: 0068
> Process swapper (pid: 1, threadinfo=c1921000 task=c1920a90)
> Stack: <0>c1a12224 c03913f8 c1a12224 c03913f8 c1921d54 c023cabd c1921d58 c0391380
> 00000000 c1af39c0 c0391400 c1a12224 c1a12000 c1a12030 c1921d60 c023cb7b
> c1a120e4 c1921d74 c0255dbf c1a122c0 c1a43a40 00000000 c1921d80 c025e393
> Call Trace:
> [<c0103c5d>] show_stack+0x9b/0xc0
> [<c0103de4>] show_registers+0x162/0x1e7
> [<c0103f8f>] die+0x126/0x231
> [<c01140db>] do_page_fault+0x271/0x5b9
> [<c01037df>] error_code+0x4f/0x54
> [<c023cabd>] class_device_del+0xa3/0x156
> [<c023cb7b>] class_device_unregister+0xb/0x15
> [<c0255dbf>] scsi_remove_host+0xb4/0xef
> [<c025e393>] ata_host_remove+0x11/0x1c
> [<c0260ec6>] ata_device_add+0x2e4/0xb7b
> [<c0261cd6>] ata_pci_init_one+0x322/0x387
> [<c0265b34>] piix_init_one+0x18c/0x338
> [<c01f4f4f>] pci_device_probe+0x44/0x5f
> [<c023bf62>] driver_probe_device+0x3e/0xb0
> [<c023c0df>] __driver_attach+0x8e/0x90
> [<c023b9f3>] bus_for_each_dev+0x44/0x62
> [<c023bece>] driver_attach+0x19/0x1b
> [<c023b687>] bus_add_driver+0x6d/0x126
> [<c023c350>] driver_register+0x6b/0x9b
> [<c01f50fb>] __pci_register_driver+0x6a/0x95
> [<c03e8ea8>] piix_init+0xf/0x22
> [<c01003cc>] init+0xff/0x325
> [<c0100d25>] kernel_thread_helper+0x5/0xb
> Code: c8 5d c3 55 89 e5 57 56 53 83 ec 04 89 45 f0 89 c2 8b 40 48 8b 38 bb ff ff
> ff ff 89 d9 31 c0 f2 ae f7 d1 49 89 ce 8b 7a 08 89 d9 <f2> ae f7 d1 49 89 ca 8d
> 4e 02 8d 04 0a ba d0 00 00 00 e8 3b 72

Jeff, I beleive this is a sata bug. ata_device_add() called
ata_host_remove() and something under there isnot yet sufficiently
initialised.

2006-01-12 04:55:47

by Reuben Farrelly

[permalink] [raw]
Subject: Re: [GIT PATCH] PCI patches for 2.6.15 - retry



On 12/01/2006 5:29 p.m., Andrew Morton wrote:
> Reuben Farrelly <[email protected]> wrote:
>>
>>
>> On 10/01/2006 2:49 p.m., Andrew Morton wrote:
>>> Alan Cox <[email protected]> wrote:
>>>> On Llu, 2006-01-09 at 16:44 -0800, Andrew Morton wrote:
>>>>> - Reuben Farrelly's oops in make_class_name(). Could be libata, or scsi
>>>>> or driver core.
>>>> libata I think. I reproduced it on 2.6.14-mm2 by accident with a buggy
>>>> pata driver.
>>> Well that's all merged up now. Reuben, could you please test 2.6.15git6
>>> tomorrow?
>> Seemingly not fixed afterall. I've been doing many reboots lately getting to
>> the bottom of the barrier/md bug and just before I hit this with -mm3
>> (linus.patch -git7) which I believe is the same bug (the call trace looks very
>> similar).
>>
>> ...
>
> I'm getting my bugs confused now - there are so many. Were you the person
> who reported this before?

Yes. It was suggested I try -git6. I reported that it seemed to be OK, but
clearly it isn't. Then again, I've done a hell of a lot of reboots in the last
couple of days.

I've updated my list at

http://www.reub.net/files/kernel/outstanding-kernel-bugs.txt
and
http://www.reub.net/files/kernel/

This is a very basic text file which outlines the details of the various bugs
that I have on the go at any given point in time and where they're at as. There
are various postings on LKML reporting almost all of them.

Thread starts http://www.ussg.iu.edu/hypermail/linux/kernel/0601.1/0619.html
Greg KH (Mon Jan 09 2006 - 15:36:39 EST)


>> serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
>> ACPI: PCI Interrupt 0000:06:02.0[A] -> GSI 18 (level, low) -> IRQ 185
>> 0000:06:02.0: ttyS1 at I/O 0xbc00 (irq = 185) is a 16550A
>> 0000:06:02.0: ttyS2 at I/O 0xbc08 (irq = 185) is a 16550A
>> Floppy drive(s): fd0 is 1.44M
>> FDC 0 is a post-1991 82077
>> Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
>> ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
>> ahci: probe of 0000:00:1f.2 failed with error -12
>> ata1: SATA max UDMA/133 cmd 0x0 ctl 0x2 bmdma 0x0 irq 0
>> ata2: SATA max UDMA/133 cmd 0x0 ctl 0x2 bmdma 0x8 irq 0
>> Unable to handle kernel NULL pointer dereference at virtual address 00000000
>> printing eip:
>> c023c873
>> *pde = 00000000
>> Oops: 0000 [#1]
>> SMP
>> last sysfs file:
>> Modules linked in:
>> CPU: 0
>> EIP: 0060:[<c023c873>] Not tainted VLI
>> EFLAGS: 00010206 (2.6.15-mm3)
>> EIP is at make_class_name+0x28/0x8d
>> eax: 00000000 ebx: ffffffff ecx: ffffffff edx: c1a12224
>> esi: 00000009 edi: 00000000 ebp: c1921d2c esp: c1921d1c
>> ds: 007b es: 007b ss: 0068
>> Process swapper (pid: 1, threadinfo=c1921000 task=c1920a90)
>> Stack: <0>c1a12224 c03913f8 c1a12224 c03913f8 c1921d54 c023cabd c1921d58 c0391380
>> 00000000 c1af39c0 c0391400 c1a12224 c1a12000 c1a12030 c1921d60 c023cb7b
>> c1a120e4 c1921d74 c0255dbf c1a122c0 c1a43a40 00000000 c1921d80 c025e393
>> Call Trace:
>> [<c0103c5d>] show_stack+0x9b/0xc0
>> [<c0103de4>] show_registers+0x162/0x1e7
>> [<c0103f8f>] die+0x126/0x231
>> [<c01140db>] do_page_fault+0x271/0x5b9
>> [<c01037df>] error_code+0x4f/0x54
>> [<c023cabd>] class_device_del+0xa3/0x156
>> [<c023cb7b>] class_device_unregister+0xb/0x15
>> [<c0255dbf>] scsi_remove_host+0xb4/0xef
>> [<c025e393>] ata_host_remove+0x11/0x1c
>> [<c0260ec6>] ata_device_add+0x2e4/0xb7b
>> [<c0261cd6>] ata_pci_init_one+0x322/0x387
>> [<c0265b34>] piix_init_one+0x18c/0x338
>> [<c01f4f4f>] pci_device_probe+0x44/0x5f
>> [<c023bf62>] driver_probe_device+0x3e/0xb0
>> [<c023c0df>] __driver_attach+0x8e/0x90
>> [<c023b9f3>] bus_for_each_dev+0x44/0x62
>> [<c023bece>] driver_attach+0x19/0x1b
>> [<c023b687>] bus_add_driver+0x6d/0x126
>> [<c023c350>] driver_register+0x6b/0x9b
>> [<c01f50fb>] __pci_register_driver+0x6a/0x95
>> [<c03e8ea8>] piix_init+0xf/0x22
>> [<c01003cc>] init+0xff/0x325
>> [<c0100d25>] kernel_thread_helper+0x5/0xb
>> Code: c8 5d c3 55 89 e5 57 56 53 83 ec 04 89 45 f0 89 c2 8b 40 48 8b 38 bb ff ff
>> ff ff 89 d9 31 c0 f2 ae f7 d1 49 89 ce 8b 7a 08 89 d9 <f2> ae f7 d1 49 89 ca 8d
>> 4e 02 8d 04 0a ba d0 00 00 00 e8 3b 72
>
> Jeff, I beleive this is a sata bug. ata_device_add() called
> ata_host_remove() and something under there isnot yet sufficiently
> initialised.

reuben

2006-01-12 11:40:08

by Alan

[permalink] [raw]
Subject: Re: [GIT PATCH] PCI patches for 2.6.15 - retry

On Iau, 2006-01-12 at 16:55 +1300, Reuben Farrelly wrote:
> ata1: SATA max UDMA/133 cmd 0x0 ctl 0x2 bmdma 0x0 irq 0
> ata2: SATA max UDMA/133 cmd 0x0 ctl 0x2 bmdma 0x8 irq 0
> Unable to handle kernel NULL pointer dereference at virtual address 00000000

That is the critical bit. The SATA ports have no PCI resources assigned
for bus mastering (BAR 4). libata should have driven the device PIO in
this case but the resource should have been assigned.

2006-01-12 20:55:35

by Jeff Garzik

[permalink] [raw]
Subject: Re: [GIT PATCH] PCI patches for 2.6.15 - retry

Alan Cox wrote:
> On Llu, 2006-01-09 at 16:44 -0800, Andrew Morton wrote:
>
>>- Reuben Farrelly's oops in make_class_name(). Could be libata, or scsi
>> or driver core.
>
>
> libata I think. I reproduced it on 2.6.14-mm2 by accident with a buggy
> pata driver.


Any additional info? How can I reproduce?

Jeff

2006-01-12 20:59:40

by Jeff Garzik

[permalink] [raw]
Subject: Re: [GIT PATCH] PCI patches for 2.6.15 - retry

Alan Cox wrote:
> On Iau, 2006-01-12 at 16:55 +1300, Reuben Farrelly wrote:
>
>>ata1: SATA max UDMA/133 cmd 0x0 ctl 0x2 bmdma 0x0 irq 0
>>ata2: SATA max UDMA/133 cmd 0x0 ctl 0x2 bmdma 0x8 irq 0
>>Unable to handle kernel NULL pointer dereference at virtual address 00000000
>
>
> That is the critical bit. The SATA ports have no PCI resources assigned
> for bus mastering (BAR 4). libata should have driven the device PIO in
> this case but the resource should have been assigned.

Agreed. This appears to be BIOS assigning bad values to SATA hardware.

However, libata should recognize this and not attempt to iomap or drive
the hardware, in that case, rather than oops.

Jeff


2006-01-13 00:14:24

by Alan

[permalink] [raw]
Subject: Re: [GIT PATCH] PCI patches for 2.6.15 - retry

On Iau, 2006-01-12 at 15:55 -0500, Jeff Garzik wrote:
> > libata I think. I reproduced it on 2.6.14-mm2 by accident with a buggy
> > pata driver.
>
>
> Any additional info? How can I reproduce?

In my case I'm fairly sure (waves arms frantically) that it was
registering a controller which then failed to add any drives so it got
cleaned back up early

2006-01-16 13:11:35

by Reuben Farrelly

[permalink] [raw]
Subject: Re: [GIT PATCH] PCI patches for 2.6.15 - retry



On 13/01/2006 9:59 a.m., Jeff Garzik wrote:
> Alan Cox wrote:
>> On Iau, 2006-01-12 at 16:55 +1300, Reuben Farrelly wrote:
>>
>>> ata1: SATA max UDMA/133 cmd 0x0 ctl 0x2 bmdma 0x0 irq 0
>>> ata2: SATA max UDMA/133 cmd 0x0 ctl 0x2 bmdma 0x8 irq 0
>>> Unable to handle kernel NULL pointer dereference at virtual address
>>> 00000000
>>
>>
>> That is the critical bit. The SATA ports have no PCI resources assigned
>> for bus mastering (BAR 4). libata should have driven the device PIO in
>> this case but the resource should have been assigned.
>
> Agreed. This appears to be BIOS assigning bad values to SATA hardware.
>
> However, libata should recognize this and not attempt to iomap or drive
> the hardware, in that case, rather than oops.
>
> Jeff

Some testing tonight has shown up a bit more about where this regression crept in.

Below the table reads release on left hand side, and the result of a given
reboot on the right hand side after the dash. I had to do so many reboots to
be sure that the bug was there or not - as you can see from the -mm1 test it
doesn't always show up.

2.6.15 - OK OK OK OK OK
2.6.15-git1 - OK OK OK OK OK OK OK OK
2.6.15-git2 - OK
2.6.15-git6 - OK OK OK OK OK OK OK OK
2.6.15-git12 - OK OK OK OK OK OK OK

2.6.15-rc5-mm3 - OK OK OK(but oopsed in usb) OK OK(but oopsed in usb)
Those oopses in USB were only seen in this release so looks likely whatever
was causing them was fixed soon after.
2.6.15-mm1 - OK OK OOPSED OOPSED OOPSED all in ATA
2.6.15-mm2 + mm3 - [known to OOPS on this bug frequently but not tested in this
round]
2.6.15-mm4 - OOPSED OK OOPSED TIMEOUT TIMEOUT OOPS OK
2.6.15-mm1 with no git-acpi.patch - TIMEOUT TIMEOUT OOPSED TIMEOUT OK

OK means the system booted up to single user mode without issue, TIMEOUT means
that the controllers were assigned IRQ 50 and then failed to find any ATA disks
and OOPSED means that he SATA ports were not assigned IRQs at all and hence the
system oopsed out like this:

ahci: probe of 0000:00:1f.2 failed with error -12
ata1: SATA max UDMA/133 cmd 0x0 ctl 0x2 bmdma 0x0 irq 0
ata2: SATA max UDMA/133 cmd 0x0 ctl 0x2 bmdma 0x8 irq 0
Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
c023c873
*pde = 00000000
Oops: 0000 [#1]
<plus a trace and a whole lot more>

Full output on http://www.reub.net/files/kernel/outstanding-kernel-bugs.txt (as
usual)

In summary the good news is that 2.6.15-git12 (which is the latest linus tree)
is GOOD and does not seem to exhibit this problem. Not a single -git release
crapped out. It seems some regression was introduced into 2.6.15-mm1 which has
been carried forward through to -mm4 so far though but never pushed to Linus.
I guess it also suggests that it's not a hardware or bios issue given the sheer
number of successful reboots in a row, and I think reverting the git-acpi.patch
suggests that ACPI is not the cause of it, at least in this instance. But
that's about as far as I have gotten.

45 reboots later I'm finishing for tonight, but before I go back and hit it with
git bisect to narrow it down any further, Andrew/Jeff does this make it any
easier to pinpoint, and/or do you have any preliminary patches to test or ideas
as to what other subsystems could be involved?

Thanks,
Reuben