2006-05-10 18:32:47

by Andrew Vasquez

[permalink] [raw]
Subject: OOPS during FC-aware-driver module reload...

Recent linux-2.6.git trees introduced some oddities while performing a
simple load/unload/load of the qla2xxx (a FC transport aware driver)
driver.

Basically on module reload, I'd consistently hit the following

# insmod qla2xxx
# rmmod qla2xxx
# insmod qla2xxx

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip: 00000000
*pde = 00000000
Oops: 0000 [#1]
SMP
Modules linked in: qla2xxx scsi_transport_fc
CPU: 1
EIP: 0060:[<00000000>] Not tainted VLI
EFLAGS: 00010246 (2.6.17-rc3 #32)
EIP is at _stext+0x3feffd68/0x23
eax: 00000000 ebx: f7c6c448 ecx: f3e92128 edx: 00000000
esi: f3e92128 edi: c02fe420 ebp: 6b6b6b6b esp: ea290dcc
ds: 007b es: 007b ss: 0068
Process insmod (pid: 10001, threadinfo=ea290000 task=f40ada90)
Stack: <0>c0226794 6b6b6b6b f3e92128 f8b216e0 f3e92128 f3e92000 00000000 000000d0
c0226cc3 f3e92128 c0226c90 c02598e4 f3e92128 f3e920f1 f3e92000 c02520c7
f3e92000 ffff5e05 f3e92358 f3e93b14 f3e92000 f8b04d29 f3e92000 c1b2eb5c
Call Trace:
<c0226794> attribute_container_add_device+0x4b/0x135 <c0226cc3> transport_setup_device+0xe/0x11
<c0226c90> transport_setup_classdev+0x0/0x25 <c02598e4> scsi_sysfs_add_host+0x9e/0xac
<c02520c7> scsi_add_host+0x129/0x179 <f8b04d29> qla2x00_probe_one+0xa33/0xb34 [qla2xxx]
<c0125eb3> call_usermodehelper_keys+0xf7/0x104 <c0125d78> __call_usermodehelper+0x0/0x44
<f8b05e1b> qla2xxx_probe_one+0xe/0x11 [qla2xxx] <c01e4ae6> pci_call_probe+0xf/0x12
<c01e4b1c> __pci_device_probe+0x33/0x47 <c01e4b4f> pci_device_probe+0x1f/0x34
<c0223ec4> driver_probe_device+0x43/0xa4 <c0223f95> __driver_attach+0x0/0x84
<c0223fee> __driver_attach+0x59/0x84 <c02235aa> bus_for_each_dev+0x47/0x6d
<c01dc91e> kobject_add+0xae/0xf7 <c022402d> driver_attach+0x14/0x18
<c0223f95> __driver_attach+0x0/0x84 <c02239d1> bus_add_driver+0x57/0x8d
<c02244b6> driver_register+0xb9/0xbe <c01e4d4e> __pci_register_driver+0x85/0x96
<f883c07b> qla2x00_module_init+0x7b/0xa2 [qla2xxx] <c01308c1> sys_init_module+0x8d/0x171
<c010266f> sysenter_past_esp+0x54/0x75
Code: Bad EIP value.
EIP: [<00000000>] _stext+0x3feffd68/0x23 SS:ESP 0068:ea290dcc

after some churning, git-bisect pointed to commit:

commit 56cf6504fc1c0c221b82cebc16a444b684140fb7
Author: Russell King <[email protected]>
Date: Fri May 5 17:57:52 2006 +0100

[BLOCK] Fix oops on removal of SD/MMC card

The block layer keeps a reference (driverfs_dev) to the struct
device associated with the block device, and uses it internally
for generating uevents in block_uevent.

Block device uevents include umounting the partition, which can
occur after the backing device has been removed.

Unfortunately, this reference is not counted. This means that
if the struct device is removed from the device tree, the block
layers reference will become stale.

Guard against this by holding a reference to the struct device
in add_disk(), and only drop the reference when we're releasing
the gendisk kobject - in other words when we can be sure that no
further uevents will be generated for this block device.

Signed-off-by: Russell King <[email protected]>
Acked-by: Jens Axboe <[email protected]>

after reverting the commit, all is fine again (no more panics during
reload). My fear though is that we're simply masking an underlying
problem which the commit attempted to correct.

Ideas?

--
av


2006-05-10 19:05:05

by Russell King

[permalink] [raw]
Subject: Re: OOPS during FC-aware-driver module reload...

On Wed, May 10, 2006 at 11:31:44AM -0700, Andrew Vasquez wrote:
> Recent linux-2.6.git trees introduced some oddities while performing a
> simple load/unload/load of the qla2xxx (a FC transport aware driver)
> driver.
>
> Basically on module reload, I'd consistently hit the following
>
> # insmod qla2xxx
> # rmmod qla2xxx
> # insmod qla2xxx
>
> BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000
> printing eip: 00000000
> *pde = 00000000
> Oops: 0000 [#1]
> SMP
> Modules linked in: qla2xxx scsi_transport_fc
> CPU: 1
> EIP: 0060:[<00000000>] Not tainted VLI
> EFLAGS: 00010246 (2.6.17-rc3 #32)
> EIP is at _stext+0x3feffd68/0x23
> eax: 00000000 ebx: f7c6c448 ecx: f3e92128 edx: 00000000
> esi: f3e92128 edi: c02fe420 ebp: 6b6b6b6b esp: ea290dcc
> ds: 007b es: 007b ss: 0068
> Process insmod (pid: 10001, threadinfo=ea290000 task=f40ada90)
> Stack: <0>c0226794 6b6b6b6b f3e92128 f8b216e0 f3e92128 f3e92000 00000000 000000d0
> c0226cc3 f3e92128 c0226c90 c02598e4 f3e92128 f3e920f1 f3e92000 c02520c7
> f3e92000 ffff5e05 f3e92358 f3e93b14 f3e92000 f8b04d29 f3e92000 c1b2eb5c
> Call Trace:
> <c0226794> attribute_container_add_device+0x4b/0x135 <c0226cc3> transport_setup_device+0xe/0x11
> <c0226c90> transport_setup_classdev+0x0/0x25 <c02598e4> scsi_sysfs_add_host+0x9e/0xac
> <c02520c7> scsi_add_host+0x129/0x179 <f8b04d29> qla2x00_probe_one+0xa33/0xb34 [qla2xxx]
> <c0125eb3> call_usermodehelper_keys+0xf7/0x104 <c0125d78> __call_usermodehelper+0x0/0x44
> <f8b05e1b> qla2xxx_probe_one+0xe/0x11 [qla2xxx] <c01e4ae6> pci_call_probe+0xf/0x12
> <c01e4b1c> __pci_device_probe+0x33/0x47 <c01e4b4f> pci_device_probe+0x1f/0x34
> <c0223ec4> driver_probe_device+0x43/0xa4 <c0223f95> __driver_attach+0x0/0x84
> <c0223fee> __driver_attach+0x59/0x84 <c02235aa> bus_for_each_dev+0x47/0x6d
> <c01dc91e> kobject_add+0xae/0xf7 <c022402d> driver_attach+0x14/0x18
> <c0223f95> __driver_attach+0x0/0x84 <c02239d1> bus_add_driver+0x57/0x8d
> <c02244b6> driver_register+0xb9/0xbe <c01e4d4e> __pci_register_driver+0x85/0x96
> <f883c07b> qla2x00_module_init+0x7b/0xa2 [qla2xxx] <c01308c1> sys_init_module+0x8d/0x171
> <c010266f> sysenter_past_esp+0x54/0x75
> Code: Bad EIP value.
> EIP: [<00000000>] _stext+0x3feffd68/0x23 SS:ESP 0068:ea290dcc
>
> after some churning, git-bisect pointed to commit:
>
> commit 56cf6504fc1c0c221b82cebc16a444b684140fb7
> Author: Russell King <[email protected]>
> Date: Fri May 5 17:57:52 2006 +0100
>
> [BLOCK] Fix oops on removal of SD/MMC card
>
> The block layer keeps a reference (driverfs_dev) to the struct
> device associated with the block device, and uses it internally
> for generating uevents in block_uevent.
>
> Block device uevents include umounting the partition, which can
> occur after the backing device has been removed.
>
> Unfortunately, this reference is not counted. This means that
> if the struct device is removed from the device tree, the block
> layers reference will become stale.
>
> Guard against this by holding a reference to the struct device
> in add_disk(), and only drop the reference when we're releasing
> the gendisk kobject - in other words when we can be sure that no
> further uevents will be generated for this block device.
>
> Signed-off-by: Russell King <[email protected]>
> Acked-by: Jens Axboe <[email protected]>
>
> after reverting the commit, all is fine again (no more panics during
> reload). My fear though is that we're simply masking an underlying
> problem which the commit attempted to correct.

Not sure why that would be, but I think your assumption is correct (that
this fix has uncovered something else which is wrong.)

I'm not sure what because I don't do well when trying to decode x86
oopsen. In fact, I'm not sure how it's getting from
attribute_container_add_device() to a NULL pointer.

The obvious thing is this:

if (!cont->match(cont, dev))
continue;

but that means that "cont" must be invalid. I don't know this code.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core