2012-05-02 11:21:42

by Martin Mokrejs

[permalink] [raw]
Subject: Re: linux-3.4-rc5: Oops in klist_del()



Yinghai Lu wrote:
> On Mon, Apr 30, 2012 at 2:58 PM, Martin Mokrejs
> <[email protected]> wrote:
>> Martin Mokrejs wrote:
>>> Hi,
>>> I was removing an ExpressCard with a NEC chipset used by xhci_hcd and the kernel crashed
>>> on my Dell Vostro 3550 laptop. Its express slot chipset somehow does not report properly
>>> that a card was unplugged, as discussed earlier here on linux-pci under Subject
>>> "Re: 3.2.11: PCI Express card cannot be re-detected withing cca 60sec timeframe".
>>>
>>> Maybe this helps somebody to find the cause? Here is a bit of the attached Oops snapshot
>>> stacktrace:
>>>
>>> klist_del
>>> fdevice_del
>>> device_unregister
>>> pci_stop_bus_device
>>> pci_stop_and_remove_bus_device
>>> devehp_unconfigure_device
>>> pciehp_disable_slot
>>> ? pciehp_disable_slot
>>> pciehp_power_thread
>>> process_one_work
>>> worker_thriad
>>> ? manage_workers.clone
>>> ? manage_workers.clone
>>> kthrerd
>>> ...
>>
>> This is 100% reproducible. It happens with 3.4-rc5 and 3.3.4 kernels while not with 3.2.14.
>> I did not have the external enclosure earlier but repeated unplugs/re-plugs of the NEC-based
>> expresscard were/are just not enough to trigger the Oops. One needs the ASMedia2105 in addition.
>>
>> To hit the Ooops, the disk has to be *RE-detected* and unplugged (so not its very first plugin).
>> Express card removal after the very first plugin says usually something about the surprise
>> removal. When I re-plug the disk, it gets detected, and it removal hits the Ooops. I am attaching
>> dmesg from 3.2.14 kernel which does not have the problem. You can see how many times I did remove
>> and re-plug the card. If I waited long ebough for xhci_hcd to timeout waiting for the unplugged
>> device, it says
>>
>> xhci_hcd 0000:11:00.0: Timeout while waiting for configure endpoint command
>> xHCI xhci_drop_endpoint called for root hub
>> xHCI xhci_check_bandwidth called for root hub
>> xhci_hcd 0000:11:00.0: USB bus 5 deregistered
>> xhci_hcd 0000:11:00.0: PCI INT A disabled
>>
>> and then, subsequent re-plug of the card with the disk attached results in detected SCSI device.
>> In contrast, if I re-plug the express card earlier express card slot change is reported but the
>> disc is not detected (something is blocked in xhci_hcd I think).
>>
>> The above stacktrace (of actually the camera snapshot is not the first stacktrace, unfortunately).
>> I see there are 2-3 screen scrolled away. Hmm, but I do not have anything what could be used as
>> a remote console ...
>
> So if you use "echo 1 > /sys/.../0000:11.00.0/remove" before surprise
> remove the card,
>
> the oops would not happen?

You are right, it does NOT happen. I managed to get somehow a piece of the stacktrace before the usbserial
on the affected host died:

[ 508.458604] pciehp 0000:00:1c.7:pcie04: Card present on Slot(7)
[ 508.558838] pciehp 0000:00:1c.7:pcie04: Device 0000:11:00.0 already exists at 0000:11:00, cannot hot-add
[ 508.558880] pciehp 0000:00:1c.7:pcie04: Cannot add device at 0000:11:00
[ 521.584091] pciehp 0000:00:1c.7:pcie04: Card not present on Slot(7)
[ 521.584319] ------------[ cut here ]------------
[ 521.584325] WARNING: at fs/sysfs/group.c:142 sysfs_remove_group+0x4f/0x83()
[ 521.584326] Hardware name: Vostro 3550
[ 521.584328] sysfs group ffffffff8194f980 not found for kobject '0000:11:00.0'
[ 521.584330] Modules linked in: iwlwifi usb_storage i2c_i801
[ 521.584335] Pid: 1589, comm: kworker/0:2 Not tainted 3.4.0-rc5-default #5
[ 521.584337] Call Trace:
[ 521.584343] [<ffffffff8107af23>] warn_slowpath_common+0x80/0x98
[ 521.584345] [<ffffffff8107afcf>] warn_slowpath_fmt+0x41/0x43
[ 521.584348] [<ffffffff8115e0fd>] ? sysfs_get_dirent+0x47/0x51
[ 521.584351] [<ffffffff8115f5dc>] sysfs_remove_group+0x4f/0x83
[ 521.584354] [<ffffffff8132f36a>] dpm_sysfs_remove+0x33/0x39
[ 521.584357] [<ffffffff81328486>] device_del+0x47/0x17d
[ 521.584359] [<ffffffff813285cd>] device_unregister+0x11/0x1e
[ 521.584362] [<ffffffff81242782>] pci_stop_bus_device+0x5e/0x80
[ 521.584364] [<ffffffff8124290f>] pci_stop_and_remove_bus_device+



Also, there is a message

[ 525.302615] sysfs group ffffffff8194f980 not found for kobject '0000:11:00.0'

when I was rebooting the computer. I think this might be pointing to the place of crash as well.


Once again, does not happen with 3.2.14 but does with 3.3.4 as well.

I will learn to use the echo trick always for this USB3.0 express card and and we came to that
already, it only happens when xhci_hcd was loaded and claimed the device.

In the attached traces are actually two bootups. In the former I got the Ooops.
In the second I used carefully the echo command to "unplug" the card from xhci_hcd
before physically removing it. But at the very end I few times omitted teh echo trick.
So at some point you will see different messages logged. And that is why at the very
end when system was to reboot, sysfs complained.

Thanks,
Martin


Attachments:
capture.txt (55.54 kB)