2013-10-14 22:47:59

by Andreas Noever

[permalink] [raw]
Subject: [3.11.4] Thunderbolt/PCI unplug oops in pci_pme_list_scan

When I unplug the Thunderbolt ethernet adapter on my MacBookPro Linux
crashes a few seconds later. Using
echo 1 > /sys/bus/pci/devices/0000:08:00.0/remove
to remove a bridge two levels above the device triggers the fault immediately:

pciehp 0000:09:00.0:pcie24: unloading service driver pciehp
pci_bus 0000:0a: busn_res: [bus 0a] is released
pci_bus 0000:09: busn_res: [bus 09-0a] is released
general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
....
Workqueue: events pci_pme_list_scan
task: ffff88044b0b8000 ti: ffff88044ac62000 task.ti: ffff88044ac62000
RIP: 0010:[<ffffffff812bdc8c>] [<ffffffff812bdc8c>] pci_pme_list_scan+0x3c/0xe0
RSP: 0018:ffff88044ac63e10 EFLAGS: 00010202
RAX: ffff88045601e7b0 RBX: ffffffff8187b070 RCX: 0000000000000000
RDX: 6b6b6b6b6b6b6b6b RSI: ffff88044ac63da0 RDI: ffff880453250ca8
RBP: ffff88044ac63e20 R08: ffff88044ac63da0 R09: 0001f9e0c287afc0
R10: 0001f9e0c287afc0 R11: 0000000000000000 R12: ffff880453250ca8
R13: ffff88046d053d00 R14: ffff88046d058200 R15: ffffffff8187afc8
FS: 0000000000000000(0000) GS:ffff88046d040000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd301d57000 CR3: 000000000280d000 CR4: 00000000001407e0
Stack:
ffffffff8187afc0 ffff88044a920e40 ffff88044ac63e68 ffffffff8107ddd6
000000006d053d00 0000000000000000 ffff88046d053d00 ffff88046d053d18
ffff88044a920e70 ffff88044b0b8000 ffff88044a920e40 ffff88044ac63ec8
Call Trace:
[<ffffffff8107ddd6>] process_one_work+0x176/0x470
[<ffffffff8107e79b>] worker_thread+0x11b/0x3a0
[<ffffffff8107e680>] ? manage_workers.isra.21+0x2b0/0x2b0
[<ffffffff810855e0>] kthread+0xc0/0xd0
[<ffffffff81085520>] ? kthread_create_on_node+0x110/0x110
[<ffffffff814f4c2c>] ret_from_fork+0x7c/0xb0
[<ffffffff81085520>] ? kthread_create_on_node+0x110/0x110
Code: 54 53 e8 f8 c6 22 00 4c 8b 25 01 d4 5b 00 49 81 fc 70 b0 87 81
0f 84 98 00 00 00 49 8b 1c 24 4c 89 e7 eb 36 0f 1f 00 48 8b 50 10 <48>
8b 52 38 48 85 d2 74 07 8b 4a 78 85 c9 75 0a 31 f6 48 89 c7
RIP [<ffffffff812bdc8c>] pci_pme_list_scan+0x3c/0xe0
RSP <ffff88044ac63e10>
---[ end trace 3905f90a7dacf7b3 ]---

The offending line is:
(gdb) list *(pci_pme_list_scan+0x3c)
0xffffffff812bdc8c is in pci_pme_list_scan (drivers/pci/pci.c:1551).
1546 if (!list_empty(&pci_pme_list)) {
1547 list_for_each_entry_safe(pme_dev, n,
&pci_pme_list, list) {
1548 if (pme_dev->dev->pme_poll) {
1549 struct pci_dev *bridge;
1550
1551 bridge = pme_dev->dev->bus->self;
1552 /*
1553 * If bridge is in low power state, the
1554 * configuration space of
subordinate devices
1555 * may be not accessible
If I read the disassembly correctly then the deref of bus seems to
cause the oops.

An almost identical bug was reported (and fixed) some time ago:
http://lkml.indiana.edu/hypermail/linux/kernel/1302.1/01165.html

Andreas


2013-10-14 23:51:00

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [3.11.4] Thunderbolt/PCI unplug oops in pci_pme_list_scan

[+cc Rafael, Mika, Kirill, linux-pci]

On Mon, Oct 14, 2013 at 4:47 PM, Andreas Noever
<[email protected]> wrote:
> When I unplug the Thunderbolt ethernet adapter on my MacBookPro Linux
> crashes a few seconds later. Using
> echo 1 > /sys/bus/pci/devices/0000:08:00.0/remove
> to remove a bridge two levels above the device triggers the fault immediately:

There have been significant changes in acpiphp related to Thunderbolt
since v3.11. Any chance you can try reproduce this problem on a
current kernel, e.g., v3.12-rc5? If it still happens, can you collect
a complete dmesg log, acpidump, and "lspci -vv" output, and attach
them to a new http://bugzilla.kernel.org report?

Since you're doing a remove two levels above the Thunderbolt device,
and it looks like pciehp is handling this part, you might be seeing
something new, but the info above will still be a good start in
looking at it.

Bjorn

> pciehp 0000:09:00.0:pcie24: unloading service driver pciehp
> pci_bus 0000:0a: busn_res: [bus 0a] is released
> pci_bus 0000:09: busn_res: [bus 09-0a] is released
> general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> ....
> Workqueue: events pci_pme_list_scan
> task: ffff88044b0b8000 ti: ffff88044ac62000 task.ti: ffff88044ac62000
> RIP: 0010:[<ffffffff812bdc8c>] [<ffffffff812bdc8c>] pci_pme_list_scan+0x3c/0xe0
> RSP: 0018:ffff88044ac63e10 EFLAGS: 00010202
> RAX: ffff88045601e7b0 RBX: ffffffff8187b070 RCX: 0000000000000000
> RDX: 6b6b6b6b6b6b6b6b RSI: ffff88044ac63da0 RDI: ffff880453250ca8
> RBP: ffff88044ac63e20 R08: ffff88044ac63da0 R09: 0001f9e0c287afc0
> R10: 0001f9e0c287afc0 R11: 0000000000000000 R12: ffff880453250ca8
> R13: ffff88046d053d00 R14: ffff88046d058200 R15: ffffffff8187afc8
> FS: 0000000000000000(0000) GS:ffff88046d040000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fd301d57000 CR3: 000000000280d000 CR4: 00000000001407e0
> Stack:
> ffffffff8187afc0 ffff88044a920e40 ffff88044ac63e68 ffffffff8107ddd6
> 000000006d053d00 0000000000000000 ffff88046d053d00 ffff88046d053d18
> ffff88044a920e70 ffff88044b0b8000 ffff88044a920e40 ffff88044ac63ec8
> Call Trace:
> [<ffffffff8107ddd6>] process_one_work+0x176/0x470
> [<ffffffff8107e79b>] worker_thread+0x11b/0x3a0
> [<ffffffff8107e680>] ? manage_workers.isra.21+0x2b0/0x2b0
> [<ffffffff810855e0>] kthread+0xc0/0xd0
> [<ffffffff81085520>] ? kthread_create_on_node+0x110/0x110
> [<ffffffff814f4c2c>] ret_from_fork+0x7c/0xb0
> [<ffffffff81085520>] ? kthread_create_on_node+0x110/0x110
> Code: 54 53 e8 f8 c6 22 00 4c 8b 25 01 d4 5b 00 49 81 fc 70 b0 87 81
> 0f 84 98 00 00 00 49 8b 1c 24 4c 89 e7 eb 36 0f 1f 00 48 8b 50 10 <48>
> 8b 52 38 48 85 d2 74 07 8b 4a 78 85 c9 75 0a 31 f6 48 89 c7
> RIP [<ffffffff812bdc8c>] pci_pme_list_scan+0x3c/0xe0
> RSP <ffff88044ac63e10>
> ---[ end trace 3905f90a7dacf7b3 ]---
>
> The offending line is:
> (gdb) list *(pci_pme_list_scan+0x3c)
> 0xffffffff812bdc8c is in pci_pme_list_scan (drivers/pci/pci.c:1551).
> 1546 if (!list_empty(&pci_pme_list)) {
> 1547 list_for_each_entry_safe(pme_dev, n,
> &pci_pme_list, list) {
> 1548 if (pme_dev->dev->pme_poll) {
> 1549 struct pci_dev *bridge;
> 1550
> 1551 bridge = pme_dev->dev->bus->self;
> 1552 /*
> 1553 * If bridge is in low power state, the
> 1554 * configuration space of
> subordinate devices
> 1555 * may be not accessible
> If I read the disassembly correctly then the deref of bus seems to
> cause the oops.
>
> An almost identical bug was reported (and fixed) some time ago:
> http://lkml.indiana.edu/hypermail/linux/kernel/1302.1/01165.html
>
> Andreas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-10-15 02:44:59

by Matthew Garrett

[permalink] [raw]
Subject: Re: [3.11.4] Thunderbolt/PCI unplug oops in pci_pme_list_scan

On Mon, Oct 14, 2013 at 05:50:38PM -0600, Bjorn Helgaas wrote:
> [+cc Rafael, Mika, Kirill, linux-pci]
>
> On Mon, Oct 14, 2013 at 4:47 PM, Andreas Noever
> <[email protected]> wrote:
> > When I unplug the Thunderbolt ethernet adapter on my MacBookPro Linux
> > crashes a few seconds later. Using
> > echo 1 > /sys/bus/pci/devices/0000:08:00.0/remove
> > to remove a bridge two levels above the device triggers the fault immediately:
>
> There have been significant changes in acpiphp related to Thunderbolt
> since v3.11.

Apple don't expose Thunderbolt via ACPI, so it appears as native PCIe.
I'd be surprised if acpiphp makes a difference here.

(Whine whine Intel continuing to refuse to provide documentation for a
widely shipped part whine whine)

--
Matthew Garrett | [email protected]

2013-10-16 20:21:29

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [3.11.4] Thunderbolt/PCI unplug oops in pci_pme_list_scan

On Tue, Oct 15, 2013 at 03:44:52AM +0100, Matthew Garrett wrote:
> On Mon, Oct 14, 2013 at 05:50:38PM -0600, Bjorn Helgaas wrote:
> > [+cc Rafael, Mika, Kirill, linux-pci]
> >
> > On Mon, Oct 14, 2013 at 4:47 PM, Andreas Noever
> > <[email protected]> wrote:
> > > When I unplug the Thunderbolt ethernet adapter on my MacBookPro Linux
> > > crashes a few seconds later. Using
> > > echo 1 > /sys/bus/pci/devices/0000:08:00.0/remove
> > > to remove a bridge two levels above the device triggers the fault immediately:
> >
> > There have been significant changes in acpiphp related to Thunderbolt
> > since v3.11.
>
> Apple don't expose Thunderbolt via ACPI, so it appears as native PCIe.
> I'd be surprised if acpiphp makes a difference here.

Yeah, you're right; I wasn't paying attention.

We save a pci_dev pointer in the pci_pme_list, which of course has a
longer lifetime than the pci_dev itself, but we don't acquire a reference
on it, so I suspect the pci_dev got released before we got around to
doing the pci_pme_list_scan().

Andreas, can you try the patch below? It's against v3.12-rc2, but it
should apply to v3.11, too.

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index ad7fc72..8b0a2f3 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1580,6 +1580,7 @@ static void pci_pme_list_scan(struct work_struct *work)
pci_pme_wakeup(pme_dev->dev, NULL);
} else {
list_del(&pme_dev->list);
+ pci_dev_put(pme_dev->dev);
kfree(pme_dev);
}
}
@@ -1640,7 +1641,7 @@ void pci_pme_active(struct pci_dev *dev, bool enable)
GFP_KERNEL);
if (!pme_dev)
goto out;
- pme_dev->dev = dev;
+ pme_dev->dev = pci_dev_get(dev);
mutex_lock(&pci_pme_list_mutex);
list_add(&pme_dev->list, &pci_pme_list);
if (list_is_singular(&pci_pme_list))
@@ -1652,6 +1653,7 @@ void pci_pme_active(struct pci_dev *dev, bool enable)
list_for_each_entry(pme_dev, &pci_pme_list, list) {
if (pme_dev->dev == dev) {
list_del(&pme_dev->list);
+ pci_dev_put(pme_dev->dev);
kfree(pme_dev);
break;
}

2013-10-17 13:59:33

by Andreas Noever

[permalink] [raw]
Subject: Re: [3.11.4] Thunderbolt/PCI unplug oops in pci_pme_list_scan

On Wed, Oct 16, 2013 at 10:21 PM, Bjorn Helgaas <[email protected]> wrote:
> On Tue, Oct 15, 2013 at 03:44:52AM +0100, Matthew Garrett wrote:
>> On Mon, Oct 14, 2013 at 05:50:38PM -0600, Bjorn Helgaas wrote:
>> > [+cc Rafael, Mika, Kirill, linux-pci]
>> >
>> > On Mon, Oct 14, 2013 at 4:47 PM, Andreas Noever
>> > <[email protected]> wrote:
>> > > When I unplug the Thunderbolt ethernet adapter on my MacBookPro Linux
>> > > crashes a few seconds later. Using
>> > > echo 1 > /sys/bus/pci/devices/0000:08:00.0/remove
>> > > to remove a bridge two levels above the device triggers the fault immediately:
>> >
>> > There have been significant changes in acpiphp related to Thunderbolt
>> > since v3.11.
>>
>> Apple don't expose Thunderbolt via ACPI, so it appears as native PCIe.
>> I'd be surprised if acpiphp makes a difference here.
>
> Yeah, you're right; I wasn't paying attention.
>
> We save a pci_dev pointer in the pci_pme_list, which of course has a
> longer lifetime than the pci_dev itself, but we don't acquire a reference
> on it, so I suspect the pci_dev got released before we got around to
> doing the pci_pme_list_scan().
>
> Andreas, can you try the patch below? It's against v3.12-rc2, but it
> should apply to v3.11, too.

I have tested your patch against 3.11 where it solves the problem. Thanks!

Unfortunately I could not reproduce the problem in 3.12-rc5. I only
get the following warning (and no crash):

tg3 0000:0a:00.0: PME# disabled
pcieport 0000:09:00.0: PME# disabled
pciehp 0000:09:00.0:pcie24: unloading service driver pciehp
pci_bus 0000:0a: dev 00, dec refcount to 0
pci_bus 0000:0a: dev 00, released physical slot 9
------------[ cut here ]------------
WARNING: CPU: 0 PID: 122 at drivers/pci/pci.c:1430
pci_disable_device+0x84/0x90()
Device pcieport
disabling already-disabled device
Modules linked in:
btusb bluetooth joydev hid_apple bcm5974 nls_utf8 nls_cp437 hfsplus
vfat fat snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp
coretemp kvm_intel kvm cfg80211 uvcvideo crc32_pclmul crc32c_intel
videobuf2_vmalloc ghash_clmulni_intel aesni_intel videobuf2_memops
aes_x86_64 glue_helper videobuf2_core tg3 videodev lrw gf128mul
ablk_helper iTCO_wdt hid_generic iTCO_vendor_support cryptd media
applesmc input_polldev usbhid ptp microcode snd_hda_codec_cirrus hid
pps_core libphy rfkill i2c_i801 pcspkr snd_hda_intel apple_gmux
lib80211 snd_hda_codec acpi_cpufreq snd_hwdep snd_pcm snd_page_alloc
snd_timer mei_me snd mei processor soundcore lpc_ich evdev mfd_core
apple_bl ac battery ext4 crc16 mbcache jbd2 sd_mod ahci libahci libata
xhci_hcd ehci_pci sdhci_pci ehci_hcd sdhci scsi_mod mmc_core
usbcore usb_common nouveau mxm_wmi wmi ttm i915 video button
i2c_algo_bit intel_agp intel_gtt drm_kms_helper drm i2c_core
CPU: 0 PID: 122 Comm: kworker/u16:5 Not tainted 3.12.0-1-dirty #30
Hardware name: Apple Inc. MacBookPro10,1/Mac-C3EC7CD22292981F, BIOS
MBP101.88Z.00EE.B03.1212211437 12/21/2012
Workqueue: sysfsd sysfs_schedule_callback_work
0000000000000009 ffff88044c021c00 ffffffff814c4288 ffff88044c021c48
ffff88044c021c38 ffffffff81061b7d ffff880458a5c000 ffffffff8187c5c0
ffff880458a5c000 ffff880458a5b098 0000000000000000 ffff88044c021c98
Call Trace:
[<ffffffff814c4288>] dump_stack+0x54/0x8d
[<ffffffff81061b7d>] warn_slowpath_common+0x7d/0xa0
[<ffffffff81061bec>] warn_slowpath_fmt+0x4c/0x50
[<ffffffff812bdd92>] ? do_pci_disable_device+0x52/0x60
[<ffffffff813097f3>] ? acpi_pci_irq_disable+0x4c/0x8d
[<ffffffff812bde24>] pci_disable_device+0x84/0x90
[<ffffffff812cc62a>] pcie_portdrv_remove+0x1a/0x20
[<ffffffff812bfcdb>] pci_device_remove+0x3b/0xb0
[<ffffffff81381caf>] __device_release_driver+0x7f/0xf0
[<ffffffff81381d43>] device_release_driver+0x23/0x30
[<ffffffff813814d8>] bus_remove_device+0x108/0x180
[<ffffffff8137de75>] device_del+0x135/0x1d0
[<ffffffff812ba394>] pci_stop_bus_device+0x94/0xa0
[<ffffffff812ba33b>] pci_stop_bus_device+0x3b/0xa0
[<ffffffff812ba4a2>] pci_stop_and_remove_bus_device+0x12/0x20
[<ffffffff812c15c5>] remove_callback+0x25/0x40
[<ffffffff81212ad4>] sysfs_schedule_callback_work+0x14/0x80
[<ffffffff8107c9e8>] process_one_work+0x178/0x470
[<ffffffff8107d3b1>] worker_thread+0x121/0x3a0
[<ffffffff8107d290>] ? manage_workers.isra.21+0x2b0/0x2b0
[<ffffffff810840f0>] kthread+0xc0/0xd0
[<ffffffff81084030>] ? kthread_create_on_node+0x120/0x120
[<ffffffff814d2dfc>] ret_from_fork+0x7c/0xb0
[<ffffffff81084030>] ? kthread_create_on_node+0x120/0x120
---[ end trace b39a15fa94fbb2a2 ]---


Bisection points to 928bea964827d7824b548c1f8e06eccbbc4d0d7d .
>From this commit on the pci_pme_list_scan crash disappears and the
warning appears.

Since this commit seems to just mask the problem I went ahead and
tested your patch on 3.12-rc5 as well. It seems to work (not crash)
but the warning is still there.

The above warning was triggered by removing the 08 bridge via sysfs.
The same warning can be triggered by unplugging the adapter (dmesg
below). The ethernet card is removed immediately. The bridges follow
15 seconds later together with the warning. The topology is:
06:03.0 -- 08 -- 09 -- 0a (tg3)
(full lspci -vv is attached)

[ 25.077577] pciehp 0000:06:03.0:pcie24: Card not present on Slot(3-1)
[ 25.077626] tg3 0000:0a:00.0: PME# disabled
[ 26.284664] tg3 0000:0a:00.0: tg3_abort_hw timed out,
TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff
[ 27.669942] tg3 0000:0a:00.0 ens9: No firmware running
[ 38.661674] tg3 0000:0a:00.0 ens9: Link is down
[ 40.094609] pcieport 0000:09:00.0: PME# disabled
[ 40.094771] pciehp 0000:09:00.0:pcie24: unloading service driver pciehp
[ 40.094781] pci_bus 0000:0a: dev 00, dec refcount to 0
[ 40.094795] pci_bus 0000:0a: dev 00, released physical slot 9
[ 40.094981] ------------[ cut here ]------------
[ 40.094992] WARNING: CPU: 0 PID: 53 at drivers/pci/pci.c:1430
pci_disable_device+0x84/0x90()
[ 40.094995] Device pcieport
disabling already-disabled device
[ 40.094997] Modules linked in:
[ 40.094999] btusb bluetooth joydev hid_apple bcm5974
lib80211_crypt_tkip nls_cp437 vfat fat snd_hda_codec_hdmi nls_utf8
x86_pkg_temp_thermal intel_powerclamp hfsplus coretemp wl(O) kvm_intel
kvm crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel
aes_x86_64 glue_helper lrw gf128mul iTCO_wdt ablk_helper tg3 cryptd
cfg80211 hid_generic applesmc iTCO_vendor_support input_polldev usbhid
ptp hid snd_hda_codec_cirrus microcode pps_core libphy i2c_i801 pcspkr
snd_hda_intel rfkill snd_hda_codec lib80211 uvcvideo snd_hwdep
videobuf2_vmalloc videobuf2_memops snd_pcm videobuf2_core videodev
acpi_cpufreq mei_me apple_gmux snd_page_alloc mei snd_timer lpc_ich
mfd_core snd media battery apple_bl soundcore evdev processor ac ext4
crc16 mbcache jbd2 sd_mod ahci libahci libata xhci_hcd ehci_pci
sdhci_pci ehci_hcd
[ 40.095212] sdhci scsi_mod mmc_core usbcore usb_common nouveau
mxm_wmi wmi ttm i915 video button i2c_algo_bit intel_agp intel_gtt
drm_kms_helper drm i2c_core
[ 40.095242] CPU: 0 PID: 53 Comm: kworker/0:1 Tainted: G W O
3.12.0-1-dirty #31
[ 40.095246] Hardware name: Apple Inc.
MacBookPro10,1/Mac-C3EC7CD22292981F, BIOS
MBP101.88Z.00EE.B03.1212211437 12/21/2012
[ 40.095253] Workqueue: pciehp-3 pciehp_power_thread
[ 40.095256] 0000000000000009 ffff880458ab5b98 ffffffff814c42b8
ffff880458ab5be0
[ 40.095262] ffff880458ab5bd0 ffffffff81061b7d ffff880458a5c000
ffffffff8187c5c0
[ 40.095268] ffff880458a5c000 ffff880458a5b098 0000000000000000
ffff880458ab5c30
[ 40.095287] Call Trace:
[ 40.095293] [<ffffffff814c42b8>] dump_stack+0x54/0x8d
[ 40.095298] [<ffffffff81061b7d>] warn_slowpath_common+0x7d/0xa0
[ 40.095302] [<ffffffff81061bec>] warn_slowpath_fmt+0x4c/0x50
[ 40.095306] [<ffffffff812bddb2>] ? do_pci_disable_device+0x52/0x60
[ 40.095310] [<ffffffff81309823>] ? acpi_pci_irq_disable+0x4c/0x8d
[ 40.095313] [<ffffffff812bde44>] pci_disable_device+0x84/0x90
[ 40.095317] [<ffffffff812cc65a>] pcie_portdrv_remove+0x1a/0x20
[ 40.095321] [<ffffffff812bfd0b>] pci_device_remove+0x3b/0xb0
[ 40.095325] [<ffffffff81381cdf>] __device_release_driver+0x7f/0xf0
[ 40.095328] [<ffffffff81381d73>] device_release_driver+0x23/0x30
[ 40.095331] [<ffffffff81381508>] bus_remove_device+0x108/0x180
[ 40.095336] [<ffffffff8137dea5>] device_del+0x135/0x1d0
[ 40.095350] [<ffffffff812ba394>] pci_stop_bus_device+0x94/0xa0
[ 40.095353] [<ffffffff812ba33b>] pci_stop_bus_device+0x3b/0xa0
[ 40.095357] [<ffffffff812ba4a2>] pci_stop_and_remove_bus_device+0x12/0x20
[ 40.095361] [<ffffffff812d2e48>] pciehp_unconfigure_device+0xa8/0x1b0
[ 40.095364] [<ffffffff812d27a8>] pciehp_disable_slot+0x68/0x200
[ 40.095368] [<ffffffff812d29c3>] pciehp_power_thread+0x83/0xf0
[ 40.095372] [<ffffffff8107c9e8>] process_one_work+0x178/0x470
[ 40.095375] [<ffffffff8107d3b1>] worker_thread+0x121/0x3a0
[ 40.095379] [<ffffffff8107d290>] ? manage_workers.isra.21+0x2b0/0x2b0
[ 40.095382] [<ffffffff810840f0>] kthread+0xc0/0xd0
[ 40.095385] [<ffffffff81084030>] ? kthread_create_on_node+0x120/0x120
[ 40.095389] [<ffffffff814d2e3c>] ret_from_fork+0x7c/0xb0
[ 40.095392] [<ffffffff81084030>] ? kthread_create_on_node+0x120/0x120
[ 40.095404] ---[ end trace 12862498ad48cb36 ]---
[ 40.095513] pcieport 0000:08:00.0: PME# disabled
[ 40.096296] pci_bus 0000:0a: busn_res: [bus 0a] is released
[ 40.096367] pci_bus 0000:09: busn_res: [bus 09-0a] is released


Attachments:
lspcivv (74.61 kB)

2013-10-23 03:32:39

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [3.11.4] Thunderbolt/PCI unplug oops in pci_pme_list_scan

[+cc Yinghai]

On Thu, Oct 17, 2013 at 7:59 AM, Andreas Noever
<[email protected]> wrote:
> On Wed, Oct 16, 2013 at 10:21 PM, Bjorn Helgaas <[email protected]> wrote:
>> On Tue, Oct 15, 2013 at 03:44:52AM +0100, Matthew Garrett wrote:
>>> On Mon, Oct 14, 2013 at 05:50:38PM -0600, Bjorn Helgaas wrote:
>>> > [+cc Rafael, Mika, Kirill, linux-pci]
>>> >
>>> > On Mon, Oct 14, 2013 at 4:47 PM, Andreas Noever
>>> > <[email protected]> wrote:
>>> > > When I unplug the Thunderbolt ethernet adapter on my MacBookPro Linux
>>> > > crashes a few seconds later. Using
>>> > > echo 1 > /sys/bus/pci/devices/0000:08:00.0/remove
>>> > > to remove a bridge two levels above the device triggers the fault immediately:
>>> >
>>> > There have been significant changes in acpiphp related to Thunderbolt
>>> > since v3.11.
>>>
>>> Apple don't expose Thunderbolt via ACPI, so it appears as native PCIe.
>>> I'd be surprised if acpiphp makes a difference here.
>>
>> Yeah, you're right; I wasn't paying attention.
>>
>> We save a pci_dev pointer in the pci_pme_list, which of course has a
>> longer lifetime than the pci_dev itself, but we don't acquire a reference
>> on it, so I suspect the pci_dev got released before we got around to
>> doing the pci_pme_list_scan().
>>
>> Andreas, can you try the patch below? It's against v3.12-rc2, but it
>> should apply to v3.11, too.
>
> I have tested your patch against 3.11 where it solves the problem. Thanks!
>
> Unfortunately I could not reproduce the problem in 3.12-rc5. I only
> get the following warning (and no crash):
>
> tg3 0000:0a:00.0: PME# disabled
> pcieport 0000:09:00.0: PME# disabled
> pciehp 0000:09:00.0:pcie24: unloading service driver pciehp
> pci_bus 0000:0a: dev 00, dec refcount to 0
> pci_bus 0000:0a: dev 00, released physical slot 9
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 122 at drivers/pci/pci.c:1430
> pci_disable_device+0x84/0x90()
> Device pcieport
> disabling already-disabled device
> Modules linked in:
> btusb bluetooth joydev hid_apple bcm5974 nls_utf8 nls_cp437 hfsplus
> vfat fat snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp
> coretemp kvm_intel kvm cfg80211 uvcvideo crc32_pclmul crc32c_intel
> videobuf2_vmalloc ghash_clmulni_intel aesni_intel videobuf2_memops
> aes_x86_64 glue_helper videobuf2_core tg3 videodev lrw gf128mul
> ablk_helper iTCO_wdt hid_generic iTCO_vendor_support cryptd media
> applesmc input_polldev usbhid ptp microcode snd_hda_codec_cirrus hid
> pps_core libphy rfkill i2c_i801 pcspkr snd_hda_intel apple_gmux
> lib80211 snd_hda_codec acpi_cpufreq snd_hwdep snd_pcm snd_page_alloc
> snd_timer mei_me snd mei processor soundcore lpc_ich evdev mfd_core
> apple_bl ac battery ext4 crc16 mbcache jbd2 sd_mod ahci libahci libata
> xhci_hcd ehci_pci sdhci_pci ehci_hcd sdhci scsi_mod mmc_core
> usbcore usb_common nouveau mxm_wmi wmi ttm i915 video button
> i2c_algo_bit intel_agp intel_gtt drm_kms_helper drm i2c_core
> CPU: 0 PID: 122 Comm: kworker/u16:5 Not tainted 3.12.0-1-dirty #30
> Hardware name: Apple Inc. MacBookPro10,1/Mac-C3EC7CD22292981F, BIOS
> MBP101.88Z.00EE.B03.1212211437 12/21/2012
> Workqueue: sysfsd sysfs_schedule_callback_work
> 0000000000000009 ffff88044c021c00 ffffffff814c4288 ffff88044c021c48
> ffff88044c021c38 ffffffff81061b7d ffff880458a5c000 ffffffff8187c5c0
> ffff880458a5c000 ffff880458a5b098 0000000000000000 ffff88044c021c98
> Call Trace:
> [<ffffffff814c4288>] dump_stack+0x54/0x8d
> [<ffffffff81061b7d>] warn_slowpath_common+0x7d/0xa0
> [<ffffffff81061bec>] warn_slowpath_fmt+0x4c/0x50
> [<ffffffff812bdd92>] ? do_pci_disable_device+0x52/0x60
> [<ffffffff813097f3>] ? acpi_pci_irq_disable+0x4c/0x8d
> [<ffffffff812bde24>] pci_disable_device+0x84/0x90
> [<ffffffff812cc62a>] pcie_portdrv_remove+0x1a/0x20
> [<ffffffff812bfcdb>] pci_device_remove+0x3b/0xb0
> [<ffffffff81381caf>] __device_release_driver+0x7f/0xf0
> [<ffffffff81381d43>] device_release_driver+0x23/0x30
> [<ffffffff813814d8>] bus_remove_device+0x108/0x180
> [<ffffffff8137de75>] device_del+0x135/0x1d0
> [<ffffffff812ba394>] pci_stop_bus_device+0x94/0xa0
> [<ffffffff812ba33b>] pci_stop_bus_device+0x3b/0xa0
> [<ffffffff812ba4a2>] pci_stop_and_remove_bus_device+0x12/0x20
> [<ffffffff812c15c5>] remove_callback+0x25/0x40
> [<ffffffff81212ad4>] sysfs_schedule_callback_work+0x14/0x80
> [<ffffffff8107c9e8>] process_one_work+0x178/0x470
> [<ffffffff8107d3b1>] worker_thread+0x121/0x3a0
> [<ffffffff8107d290>] ? manage_workers.isra.21+0x2b0/0x2b0
> [<ffffffff810840f0>] kthread+0xc0/0xd0
> [<ffffffff81084030>] ? kthread_create_on_node+0x120/0x120
> [<ffffffff814d2dfc>] ret_from_fork+0x7c/0xb0
> [<ffffffff81084030>] ? kthread_create_on_node+0x120/0x120
> ---[ end trace b39a15fa94fbb2a2 ]---
>
>
> Bisection points to 928bea964827d7824b548c1f8e06eccbbc4d0d7d .

This is "PCI: Delay enabling bridges until they're needed" by Yinghai.

Yinghai, please comment.

> From this commit on the pci_pme_list_scan crash disappears and the
> warning appears.
>
> Since this commit seems to just mask the problem I went ahead and
> tested your patch on 3.12-rc5 as well. It seems to work (not crash)
> but the warning is still there.
>
> The above warning was triggered by removing the 08 bridge via sysfs.
> The same warning can be triggered by unplugging the adapter (dmesg
> below). The ethernet card is removed immediately. The bridges follow
> 15 seconds later together with the warning. The topology is:
> 06:03.0 -- 08 -- 09 -- 0a (tg3)
> (full lspci -vv is attached)
>
> [ 25.077577] pciehp 0000:06:03.0:pcie24: Card not present on Slot(3-1)
> [ 25.077626] tg3 0000:0a:00.0: PME# disabled
> [ 26.284664] tg3 0000:0a:00.0: tg3_abort_hw timed out,
> TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff
> [ 27.669942] tg3 0000:0a:00.0 ens9: No firmware running
> [ 38.661674] tg3 0000:0a:00.0 ens9: Link is down
> [ 40.094609] pcieport 0000:09:00.0: PME# disabled
> [ 40.094771] pciehp 0000:09:00.0:pcie24: unloading service driver pciehp
> [ 40.094781] pci_bus 0000:0a: dev 00, dec refcount to 0
> [ 40.094795] pci_bus 0000:0a: dev 00, released physical slot 9
> [ 40.094981] ------------[ cut here ]------------
> [ 40.094992] WARNING: CPU: 0 PID: 53 at drivers/pci/pci.c:1430
> pci_disable_device+0x84/0x90()
> [ 40.094995] Device pcieport
> disabling already-disabled device
> [ 40.094997] Modules linked in:
> [ 40.094999] btusb bluetooth joydev hid_apple bcm5974
> lib80211_crypt_tkip nls_cp437 vfat fat snd_hda_codec_hdmi nls_utf8
> x86_pkg_temp_thermal intel_powerclamp hfsplus coretemp wl(O) kvm_intel
> kvm crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel
> aes_x86_64 glue_helper lrw gf128mul iTCO_wdt ablk_helper tg3 cryptd
> cfg80211 hid_generic applesmc iTCO_vendor_support input_polldev usbhid
> ptp hid snd_hda_codec_cirrus microcode pps_core libphy i2c_i801 pcspkr
> snd_hda_intel rfkill snd_hda_codec lib80211 uvcvideo snd_hwdep
> videobuf2_vmalloc videobuf2_memops snd_pcm videobuf2_core videodev
> acpi_cpufreq mei_me apple_gmux snd_page_alloc mei snd_timer lpc_ich
> mfd_core snd media battery apple_bl soundcore evdev processor ac ext4
> crc16 mbcache jbd2 sd_mod ahci libahci libata xhci_hcd ehci_pci
> sdhci_pci ehci_hcd
> [ 40.095212] sdhci scsi_mod mmc_core usbcore usb_common nouveau
> mxm_wmi wmi ttm i915 video button i2c_algo_bit intel_agp intel_gtt
> drm_kms_helper drm i2c_core
> [ 40.095242] CPU: 0 PID: 53 Comm: kworker/0:1 Tainted: G W O
> 3.12.0-1-dirty #31
> [ 40.095246] Hardware name: Apple Inc.
> MacBookPro10,1/Mac-C3EC7CD22292981F, BIOS
> MBP101.88Z.00EE.B03.1212211437 12/21/2012
> [ 40.095253] Workqueue: pciehp-3 pciehp_power_thread
> [ 40.095256] 0000000000000009 ffff880458ab5b98 ffffffff814c42b8
> ffff880458ab5be0
> [ 40.095262] ffff880458ab5bd0 ffffffff81061b7d ffff880458a5c000
> ffffffff8187c5c0
> [ 40.095268] ffff880458a5c000 ffff880458a5b098 0000000000000000
> ffff880458ab5c30
> [ 40.095287] Call Trace:
> [ 40.095293] [<ffffffff814c42b8>] dump_stack+0x54/0x8d
> [ 40.095298] [<ffffffff81061b7d>] warn_slowpath_common+0x7d/0xa0
> [ 40.095302] [<ffffffff81061bec>] warn_slowpath_fmt+0x4c/0x50
> [ 40.095306] [<ffffffff812bddb2>] ? do_pci_disable_device+0x52/0x60
> [ 40.095310] [<ffffffff81309823>] ? acpi_pci_irq_disable+0x4c/0x8d
> [ 40.095313] [<ffffffff812bde44>] pci_disable_device+0x84/0x90
> [ 40.095317] [<ffffffff812cc65a>] pcie_portdrv_remove+0x1a/0x20
> [ 40.095321] [<ffffffff812bfd0b>] pci_device_remove+0x3b/0xb0
> [ 40.095325] [<ffffffff81381cdf>] __device_release_driver+0x7f/0xf0
> [ 40.095328] [<ffffffff81381d73>] device_release_driver+0x23/0x30
> [ 40.095331] [<ffffffff81381508>] bus_remove_device+0x108/0x180
> [ 40.095336] [<ffffffff8137dea5>] device_del+0x135/0x1d0
> [ 40.095350] [<ffffffff812ba394>] pci_stop_bus_device+0x94/0xa0
> [ 40.095353] [<ffffffff812ba33b>] pci_stop_bus_device+0x3b/0xa0
> [ 40.095357] [<ffffffff812ba4a2>] pci_stop_and_remove_bus_device+0x12/0x20
> [ 40.095361] [<ffffffff812d2e48>] pciehp_unconfigure_device+0xa8/0x1b0
> [ 40.095364] [<ffffffff812d27a8>] pciehp_disable_slot+0x68/0x200
> [ 40.095368] [<ffffffff812d29c3>] pciehp_power_thread+0x83/0xf0
> [ 40.095372] [<ffffffff8107c9e8>] process_one_work+0x178/0x470
> [ 40.095375] [<ffffffff8107d3b1>] worker_thread+0x121/0x3a0
> [ 40.095379] [<ffffffff8107d290>] ? manage_workers.isra.21+0x2b0/0x2b0
> [ 40.095382] [<ffffffff810840f0>] kthread+0xc0/0xd0
> [ 40.095385] [<ffffffff81084030>] ? kthread_create_on_node+0x120/0x120
> [ 40.095389] [<ffffffff814d2e3c>] ret_from_fork+0x7c/0xb0
> [ 40.095392] [<ffffffff81084030>] ? kthread_create_on_node+0x120/0x120
> [ 40.095404] ---[ end trace 12862498ad48cb36 ]---
> [ 40.095513] pcieport 0000:08:00.0: PME# disabled
> [ 40.096296] pci_bus 0000:0a: busn_res: [bus 0a] is released
> [ 40.096367] pci_bus 0000:09: busn_res: [bus 09-0a] is released

2013-10-23 23:53:58

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [3.11.4] Thunderbolt/PCI unplug oops in pci_pme_list_scan

On Thu, Oct 17, 2013 at 7:59 AM, Andreas Noever
<[email protected]> wrote:
> On Wed, Oct 16, 2013 at 10:21 PM, Bjorn Helgaas <[email protected]> wrote:
>> On Tue, Oct 15, 2013 at 03:44:52AM +0100, Matthew Garrett wrote:
>>> On Mon, Oct 14, 2013 at 05:50:38PM -0600, Bjorn Helgaas wrote:
>>> > [+cc Rafael, Mika, Kirill, linux-pci]
>>> >
>>> > On Mon, Oct 14, 2013 at 4:47 PM, Andreas Noever
>>> > <[email protected]> wrote:
>>> > > When I unplug the Thunderbolt ethernet adapter on my MacBookPro Linux
>>> > > crashes a few seconds later. Using
>>> > > echo 1 > /sys/bus/pci/devices/0000:08:00.0/remove
>>> > > to remove a bridge two levels above the device triggers the fault immediately:
>>> >
>>> > There have been significant changes in acpiphp related to Thunderbolt
>>> > since v3.11.
>>>
>>> Apple don't expose Thunderbolt via ACPI, so it appears as native PCIe.
>>> I'd be surprised if acpiphp makes a difference here.
>>
>> Yeah, you're right; I wasn't paying attention.
>>
>> We save a pci_dev pointer in the pci_pme_list, which of course has a
>> longer lifetime than the pci_dev itself, but we don't acquire a reference
>> on it, so I suspect the pci_dev got released before we got around to
>> doing the pci_pme_list_scan().
>>
>> Andreas, can you try the patch below? It's against v3.12-rc2, but it
>> should apply to v3.11, too.
>
> I have tested your patch against 3.11 where it solves the problem. Thanks!

Hi Andreas, sorry for the delay here. I'm still trying to understand
exactly why my patch fixes the problem, since I don't see a relevant
refcounting change between v3.11 and v3.12-rc5. And I don't actually
see the hole yet from inspection. It seems like we should be safe
even without my patch.

But maybe it's a case of releasing the pci_bus before releasing a
pci_dev on the bus. I thought we recently fixed a hole there, but
maybe not. I'll look more carefully at that path.

Can I trouble you to collect a complete dmesg log from v3.11 without
my patch? Maybe if I stare long enough at that and the lspci you
supplied, I can figure out what's going on. If you were really
gung-ho, you could add instrumentation to print out the pci_dev and
pci_bus pointers as we enumerate them. My guess is that we'd see one
of those pointers in the GPF register dump.

Bjorn

2013-10-24 05:53:45

by Yinghai Lu

[permalink] [raw]
Subject: Re: [3.11.4] Thunderbolt/PCI unplug oops in pci_pme_list_scan

On Tue, Oct 22, 2013 at 8:32 PM, Bjorn Helgaas <[email protected]> wrote:
> [+cc Yinghai]
>
> On Thu, Oct 17, 2013 at 7:59 AM, Andreas Noever
> <[email protected]> wrote:
>> On Wed, Oct 16, 2013 at 10:21 PM, Bjorn Helgaas <[email protected]> wrote:
>>> On Tue, Oct 15, 2013 at 03:44:52AM +0100, Matthew Garrett wrote:
>>>> On Mon, Oct 14, 2013 at 05:50:38PM -0600, Bjorn Helgaas wrote:
>>>> > [+cc Rafael, Mika, Kirill, linux-pci]
>>>> >
>>>> > On Mon, Oct 14, 2013 at 4:47 PM, Andreas Noever
>>>> > <[email protected]> wrote:
>>>> > > When I unplug the Thunderbolt ethernet adapter on my MacBookPro Linux
>>>> > > crashes a few seconds later. Using
>>>> > > echo 1 > /sys/bus/pci/devices/0000:08:00.0/remove
>>>> > > to remove a bridge two levels above the device triggers the fault immediately:
>>>> >
>>>> > There have been significant changes in acpiphp related to Thunderbolt
>>>> > since v3.11.
>>>>
>>>> Apple don't expose Thunderbolt via ACPI, so it appears as native PCIe.
>>>> I'd be surprised if acpiphp makes a difference here.
>>>
>>> Yeah, you're right; I wasn't paying attention.
>>>
>>> We save a pci_dev pointer in the pci_pme_list, which of course has a
>>> longer lifetime than the pci_dev itself, but we don't acquire a reference
>>> on it, so I suspect the pci_dev got released before we got around to
>>> doing the pci_pme_list_scan().
>>>
>>> Andreas, can you try the patch below? It's against v3.12-rc2, but it
>>> should apply to v3.11, too.
>>
>> I have tested your patch against 3.11 where it solves the problem. Thanks!
>>
>> Unfortunately I could not reproduce the problem in 3.12-rc5. I only
>> get the following warning (and no crash):
>>
>> tg3 0000:0a:00.0: PME# disabled
>> pcieport 0000:09:00.0: PME# disabled
>> pciehp 0000:09:00.0:pcie24: unloading service driver pciehp
>> pci_bus 0000:0a: dev 00, dec refcount to 0
>> pci_bus 0000:0a: dev 00, released physical slot 9
>> ------------[ cut here ]------------
>> WARNING: CPU: 0 PID: 122 at drivers/pci/pci.c:1430
>> pci_disable_device+0x84/0x90()
>> Device pcieport
>> disabling already-disabled device
>> Modules linked in:
>> btusb bluetooth joydev hid_apple bcm5974 nls_utf8 nls_cp437 hfsplus
>> vfat fat snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp
>> coretemp kvm_intel kvm cfg80211 uvcvideo crc32_pclmul crc32c_intel
>> videobuf2_vmalloc ghash_clmulni_intel aesni_intel videobuf2_memops
>> aes_x86_64 glue_helper videobuf2_core tg3 videodev lrw gf128mul
>> ablk_helper iTCO_wdt hid_generic iTCO_vendor_support cryptd media
>> applesmc input_polldev usbhid ptp microcode snd_hda_codec_cirrus hid
>> pps_core libphy rfkill i2c_i801 pcspkr snd_hda_intel apple_gmux
>> lib80211 snd_hda_codec acpi_cpufreq snd_hwdep snd_pcm snd_page_alloc
>> snd_timer mei_me snd mei processor soundcore lpc_ich evdev mfd_core
>> apple_bl ac battery ext4 crc16 mbcache jbd2 sd_mod ahci libahci libata
>> xhci_hcd ehci_pci sdhci_pci ehci_hcd sdhci scsi_mod mmc_core
>> usbcore usb_common nouveau mxm_wmi wmi ttm i915 video button
>> i2c_algo_bit intel_agp intel_gtt drm_kms_helper drm i2c_core
>> CPU: 0 PID: 122 Comm: kworker/u16:5 Not tainted 3.12.0-1-dirty #30
>> Hardware name: Apple Inc. MacBookPro10,1/Mac-C3EC7CD22292981F, BIOS
>> MBP101.88Z.00EE.B03.1212211437 12/21/2012
>> Workqueue: sysfsd sysfs_schedule_callback_work
>> 0000000000000009 ffff88044c021c00 ffffffff814c4288 ffff88044c021c48
>> ffff88044c021c38 ffffffff81061b7d ffff880458a5c000 ffffffff8187c5c0
>> ffff880458a5c000 ffff880458a5b098 0000000000000000 ffff88044c021c98
>> Call Trace:
>> [<ffffffff814c4288>] dump_stack+0x54/0x8d
>> [<ffffffff81061b7d>] warn_slowpath_common+0x7d/0xa0
>> [<ffffffff81061bec>] warn_slowpath_fmt+0x4c/0x50
>> [<ffffffff812bdd92>] ? do_pci_disable_device+0x52/0x60
>> [<ffffffff813097f3>] ? acpi_pci_irq_disable+0x4c/0x8d
>> [<ffffffff812bde24>] pci_disable_device+0x84/0x90
>> [<ffffffff812cc62a>] pcie_portdrv_remove+0x1a/0x20
>> [<ffffffff812bfcdb>] pci_device_remove+0x3b/0xb0
>> [<ffffffff81381caf>] __device_release_driver+0x7f/0xf0
>> [<ffffffff81381d43>] device_release_driver+0x23/0x30
>> [<ffffffff813814d8>] bus_remove_device+0x108/0x180
>> [<ffffffff8137de75>] device_del+0x135/0x1d0
>> [<ffffffff812ba394>] pci_stop_bus_device+0x94/0xa0
>> [<ffffffff812ba33b>] pci_stop_bus_device+0x3b/0xa0
>> [<ffffffff812ba4a2>] pci_stop_and_remove_bus_device+0x12/0x20
>> [<ffffffff812c15c5>] remove_callback+0x25/0x40
>> [<ffffffff81212ad4>] sysfs_schedule_callback_work+0x14/0x80
>> [<ffffffff8107c9e8>] process_one_work+0x178/0x470
>> [<ffffffff8107d3b1>] worker_thread+0x121/0x3a0
>> [<ffffffff8107d290>] ? manage_workers.isra.21+0x2b0/0x2b0
>> [<ffffffff810840f0>] kthread+0xc0/0xd0
>> [<ffffffff81084030>] ? kthread_create_on_node+0x120/0x120
>> [<ffffffff814d2dfc>] ret_from_fork+0x7c/0xb0
>> [<ffffffff81084030>] ? kthread_create_on_node+0x120/0x120
>> ---[ end trace b39a15fa94fbb2a2 ]---
>>
>>
>> Bisection points to 928bea964827d7824b548c1f8e06eccbbc4d0d7d .
>
> This is "PCI: Delay enabling bridges until they're needed" by Yinghai.

that double disabling should be addressed by:

https://lkml.org/lkml/2013/4/25/608

[PATCH] PCI: Remove duplicate pci_disable_device for pcie port

Thanks

Yinghai

2013-10-25 03:34:13

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [3.11.4] Thunderbolt/PCI unplug oops in pci_pme_list_scan

On Wed, Oct 23, 2013 at 11:53 PM, Yinghai Lu <[email protected]> wrote:
> On Tue, Oct 22, 2013 at 8:32 PM, Bjorn Helgaas <[email protected]> wrote:
>> On Thu, Oct 17, 2013 at 7:59 AM, Andreas Noever <[email protected]> wrote:
>>> On Wed, Oct 16, 2013 at 10:21 PM, Bjorn Helgaas <[email protected]> wrote:
>>>> On Tue, Oct 15, 2013 at 03:44:52AM +0100, Matthew Garrett wrote:
>>>>> On Mon, Oct 14, 2013 at 05:50:38PM -0600, Bjorn Helgaas wrote:
>>>>> > On Mon, Oct 14, 2013 at 4:47 PM, Andreas Noever <[email protected]> wrote:
>>>>> > > When I unplug the Thunderbolt ethernet adapter on my MacBookPro Linux
>>>>> > > crashes a few seconds later. Using
>>>>> > > echo 1 > /sys/bus/pci/devices/0000:08:00.0/remove
>>>>> > > to remove a bridge two levels above the device triggers the fault immediately:

>>>> We save a pci_dev pointer in the pci_pme_list, which of course has a
>>>> longer lifetime than the pci_dev itself, but we don't acquire a reference
>>>> on it, so I suspect the pci_dev got released before we got around to
>>>> doing the pci_pme_list_scan().
>>>>
>>>> Andreas, can you try the patch below? It's against v3.12-rc2, but it
>>>> should apply to v3.11, too.
>>>
>>> I have tested your patch against 3.11 where it solves the problem. Thanks!
>>>
>>> Unfortunately I could not reproduce the problem in 3.12-rc5. I only
>>> get the following warning (and no crash):
>>>
>>> tg3 0000:0a:00.0: PME# disabled
>>> pcieport 0000:09:00.0: PME# disabled
>>> pciehp 0000:09:00.0:pcie24: unloading service driver pciehp
>>> pci_bus 0000:0a: dev 00, dec refcount to 0
>>> pci_bus 0000:0a: dev 00, released physical slot 9
>>> ------------[ cut here ]------------
>>> WARNING: CPU: 0 PID: 122 at drivers/pci/pci.c:1430
>>> pci_disable_device+0x84/0x90()
>>> Device pcieport
>>> disabling already-disabled device
>>> ...

>>> Bisection points to 928bea964827d7824b548c1f8e06eccbbc4d0d7d .
>>
>> This is "PCI: Delay enabling bridges until they're needed" by Yinghai.
>
> that double disabling should be addressed by:
>
> https://lkml.org/lkml/2013/4/25/608
>
> [PATCH] PCI: Remove duplicate pci_disable_device for pcie port

I'll look at that patch again. I had some questions about it the
first time, but perhaps it makes more sense after 928bea9648 has been
applied.

Andreas originally reported a GPF oops in pci_pme_list_scan(). I
posted a refcounting patch, which made the problem go away, but I
can't explain why, and I don't want to apply it without understanding
that. Decoding his oops shows this:

24: 0f 1f 00 nopl (%rax)
27: 48 8b 50 10 mov 0x10(%rax),%rdx
2b:* 48 8b 52 38 mov 0x38(%rdx),%rdx <-- trapping instruction
2f: 48 85 d2 test %rdx,%rdx

%rax is the pci_dev pointer, so 0x10(%rax) is the dev->bus pointer,
which we put in %rdx. The oops says %rdx = 6b6b6b6b6b6b6b6b, which is
POISON_FREE, so I think we loaded dev->bus out of a struct pci_dev
that has already been freed.

pci_pme_list_scan() holds pci_pme_list_mutex while it traverses
pci_pme_list, and the pci_stop_and_remove_bus_device() path removes
the pci_dev by calling pci_pme_active(), which also holds
pci_pme_list_mutex, so I don't understand how pci_pme_list_scan() can
see a pci_dev that has already been freed.

If I understand Andreas correctly, 928bea9648 also fixes the crash,
even without my refcounting change. Can you explain why?

Bjorn

2013-10-25 05:13:22

by Yinghai Lu

[permalink] [raw]
Subject: Re: [3.11.4] Thunderbolt/PCI unplug oops in pci_pme_list_scan

On Thu, Oct 24, 2013 at 8:33 PM, Bjorn Helgaas <[email protected]> wrote:
>
>>>> Bisection points to 928bea964827d7824b548c1f8e06eccbbc4d0d7d .
>>>
>>> This is "PCI: Delay enabling bridges until they're needed" by Yinghai.
>>
>> that double disabling should be addressed by:
>>
>> https://lkml.org/lkml/2013/4/25/608
>>
>> [PATCH] PCI: Remove duplicate pci_disable_device for pcie port
>
> I'll look at that patch again. I had some questions about it the
> first time, but perhaps it makes more sense after 928bea9648 has been
> applied.
>
> Andreas originally reported a GPF oops in pci_pme_list_scan(). I
> posted a refcounting patch, which made the problem go away, but I
> can't explain why, and I don't want to apply it without understanding
> that. Decoding his oops shows this:
>
> 24: 0f 1f 00 nopl (%rax)
> 27: 48 8b 50 10 mov 0x10(%rax),%rdx
> 2b:* 48 8b 52 38 mov 0x38(%rdx),%rdx <-- trapping instruction
> 2f: 48 85 d2 test %rdx,%rdx
>
> %rax is the pci_dev pointer, so 0x10(%rax) is the dev->bus pointer,
> which we put in %rdx. The oops says %rdx = 6b6b6b6b6b6b6b6b, which is
> POISON_FREE, so I think we loaded dev->bus out of a struct pci_dev
> that has already been freed.
>
> pci_pme_list_scan() holds pci_pme_list_mutex while it traverses
> pci_pme_list, and the pci_stop_and_remove_bus_device() path removes
> the pci_dev by calling pci_pme_active(), which also holds
> pci_pme_list_mutex, so I don't understand how pci_pme_list_scan() can
> see a pci_dev that has already been freed.
>
> If I understand Andreas correctly, 928bea9648 also fixes the crash,
> even without my refcounting change. Can you explain why?

928bea will make the dev->enable_cnt increase wrongly, as we have
pci_enable_device for child
pci_enable_bridge for parent
pci_enable_bridge for grandparent
pci_enable_device for grandparent
pci_enable_device for parent
pci_enable_brdige for grandparent
pci_enable_device for grandparent.
...

in that case grandprent will be enabled two times, and will enable_cnt will have
extra increase.

so later pci_disable_device will not really call do_pci_disable_device
do the really work, as enable_cnt still big.

solution could be:
let pci_enable_bridge call __pci_enable_device.
and __pci_enable_device will not call pci_enable_bridge.

Thanks

Yinghai

2013-10-25 05:28:47

by Yinghai Lu

[permalink] [raw]
Subject: Re: [3.11.4] Thunderbolt/PCI unplug oops in pci_pme_list_scan

On Thu, Oct 24, 2013 at 10:13 PM, Yinghai Lu <[email protected]> wrote:
> On Thu, Oct 24, 2013 at 8:33 PM, Bjorn Helgaas <[email protected]> wrote:
>>
>>>>> Bisection points to 928bea964827d7824b548c1f8e06eccbbc4d0d7d .
>>>>
>>>> This is "PCI: Delay enabling bridges until they're needed" by Yinghai.
>>>
>>> that double disabling should be addressed by:
>>>
>>> https://lkml.org/lkml/2013/4/25/608
>>>
>>> [PATCH] PCI: Remove duplicate pci_disable_device for pcie port
>>
>> I'll look at that patch again. I had some questions about it the
>> first time, but perhaps it makes more sense after 928bea9648 has been
>> applied.
>>
>> Andreas originally reported a GPF oops in pci_pme_list_scan(). I
>> posted a refcounting patch, which made the problem go away, but I
>> can't explain why, and I don't want to apply it without understanding
>> that. Decoding his oops shows this:
>>
>> 24: 0f 1f 00 nopl (%rax)
>> 27: 48 8b 50 10 mov 0x10(%rax),%rdx
>> 2b:* 48 8b 52 38 mov 0x38(%rdx),%rdx <-- trapping instruction
>> 2f: 48 85 d2 test %rdx,%rdx
>>
>> %rax is the pci_dev pointer, so 0x10(%rax) is the dev->bus pointer,
>> which we put in %rdx. The oops says %rdx = 6b6b6b6b6b6b6b6b, which is
>> POISON_FREE, so I think we loaded dev->bus out of a struct pci_dev
>> that has already been freed.
>>
>> pci_pme_list_scan() holds pci_pme_list_mutex while it traverses
>> pci_pme_list, and the pci_stop_and_remove_bus_device() path removes
>> the pci_dev by calling pci_pme_active(), which also holds
>> pci_pme_list_mutex, so I don't understand how pci_pme_list_scan() can
>> see a pci_dev that has already been freed.
>>
>> If I understand Andreas correctly, 928bea9648 also fixes the crash,
>> even without my refcounting change. Can you explain why?
>
> 928bea will make the dev->enable_cnt increase wrongly, as we have
> pci_enable_device for child
> pci_enable_bridge for parent
> pci_enable_bridge for grandparent
> pci_enable_device for grandparent
> pci_enable_device for parent
> pci_enable_brdige for grandparent
> pci_enable_device for grandparent. ===> looks like i read the code wrongly. This one is skipped as we have pci_is_enabled checking.

928bea delay bridge enabling, and it's pci_is_enabled checking will prevent
pci bridge get enabled second time, so enable_cnt will be only 1. instead of
2. if we enable bridge at first and then pcie port driver.

Yinghai

2013-10-25 23:01:29

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [3.11.4] Thunderbolt/PCI unplug oops in pci_pme_list_scan

On Thu, Oct 24, 2013 at 11:13 PM, Yinghai Lu <[email protected]> wrote:
> On Thu, Oct 24, 2013 at 8:33 PM, Bjorn Helgaas <[email protected]> wrote:
>>
>>>>> Bisection points to 928bea964827d7824b548c1f8e06eccbbc4d0d7d .
>>>>
>>>> This is "PCI: Delay enabling bridges until they're needed" by Yinghai.
>>>
>>> that double disabling should be addressed by:
>>>
>>> https://lkml.org/lkml/2013/4/25/608
>>>
>>> [PATCH] PCI: Remove duplicate pci_disable_device for pcie port
>>
>> I'll look at that patch again. I had some questions about it the
>> first time, but perhaps it makes more sense after 928bea9648 has been
>> applied.
>>
>> Andreas originally reported a GPF oops in pci_pme_list_scan(). I
>> posted a refcounting patch, which made the problem go away, but I
>> can't explain why, and I don't want to apply it without understanding
>> that. Decoding his oops shows this:
>>
>> 24: 0f 1f 00 nopl (%rax)
>> 27: 48 8b 50 10 mov 0x10(%rax),%rdx
>> 2b:* 48 8b 52 38 mov 0x38(%rdx),%rdx <-- trapping instruction
>> 2f: 48 85 d2 test %rdx,%rdx
>>
>> %rax is the pci_dev pointer, so 0x10(%rax) is the dev->bus pointer,
>> which we put in %rdx. The oops says %rdx = 6b6b6b6b6b6b6b6b, which is
>> POISON_FREE, so I think we loaded dev->bus out of a struct pci_dev
>> that has already been freed.
>>
>> pci_pme_list_scan() holds pci_pme_list_mutex while it traverses
>> pci_pme_list, and the pci_stop_and_remove_bus_device() path removes
>> the pci_dev by calling pci_pme_active(), which also holds
>> pci_pme_list_mutex, so I don't understand how pci_pme_list_scan() can
>> see a pci_dev that has already been freed.
>>
>> If I understand Andreas correctly, 928bea9648 also fixes the crash,
>> even without my refcounting change. Can you explain why?
>
> 928bea will make the dev->enable_cnt increase wrongly, as we have
> pci_enable_device for child
> pci_enable_bridge for parent
> pci_enable_bridge for grandparent
> pci_enable_device for grandparent
> pci_enable_device for parent
> pci_enable_brdige for grandparent
> pci_enable_device for grandparent.
> ...
>
> in that case grandprent will be enabled two times, and will enable_cnt will have
> extra increase.
>
> so later pci_disable_device will not really call do_pci_disable_device
> do the really work, as enable_cnt still big.
>
> solution could be:
> let pci_enable_bridge call __pci_enable_device.
> and __pci_enable_device will not call pci_enable_bridge.

Sorry, I didn't understand this. Is this supposed to be an
explanation of how 928bea fixes the oops that Andreas saw? If so, can
you be a little more explicit about when the pci_dev got freed and
when pci_pme_list_scan() walked the list and accessed the freed area?

Bjorn

2013-10-27 00:39:54

by Andreas Noever

[permalink] [raw]
Subject: Re: [3.11.4] Thunderbolt/PCI unplug oops in pci_pme_list_scan

> Sorry, I didn't understand this. Is this supposed to be an
> explanation of how 928bea fixes the oops that Andreas saw? If so, can
> you be a little more explicit about when the pci_dev got freed and
> when pci_pme_list_scan() walked the list and accessed the freed area?

I did some more debugging and it seems that 928bea is innocent after
all. I added some debugging statements to pci_pme_active. The
additional delay seems to make the oops easier to trigger and I can
now replicate it up to
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5137a2ee2007d9cbbbeebd14abe08357a079b607
which makes much more sense.

Here is what's going on (in 3.11). First of all pci_pme_activate is
only ever called with false as the second paramter during boot. Now
when I unplug the adapter, the first call is:
[<ffffffff814b1cc7>] dump_stack+0x54/0x8d
[<ffffffff812ae970>] pci_pme_active+0x30/0x210
[<ffffffff813bf2bc>] ? pci_read+0x2c/0x30 (this should be pci_stop_dev imho)
[<ffffffff812ac8ae>] pci_stop_bus_device+0x4e/0xa0
[<ffffffff812ac89b>] pci_stop_bus_device+0x3b/0xa0
[<ffffffff812ac89b>] pci_stop_bus_device+0x3b/0xa0
[<ffffffff812aca02>] pci_stop_and_remove_bus_device+0x12/0x20
[<ffffffff812c4698>] pciehp_unconfigure_device+0xa8/0x1b0
[<ffffffff812c3ff8>] pciehp_disable_slot+0x68/0x200
[<ffffffff812c4213>] pciehp_power_thread+0x83/0xf0
[<ffffffff8107b5b8>] process_one_work+0x178/0x470
[<ffffffff8107bf81>] worker_thread+0x121/0x3a0
[<ffffffff8107be60>] ? manage_workers.isra.21+0x2b0/0x2b0
[<ffffffff81082d80>] kthread+0xc0/0xd0
[<ffffffff81060000>] ? SyS_unshare+0x220/0x280
[<ffffffff81082cc0>] ? kthread_create_on_node+0x120/0x120
[<ffffffff814c07ec>] ret_from_fork+0x7c/0xb0
[<ffffffff81082cc0>] ? kthread_create_on_node+0x120/0x120
tg3 0000:0a:00.0: PME# disabled

This is still fine. But then it gets interesting. The next call is:
[<ffffffff814b1cc7>] dump_stack+0x54/0x8d
[<ffffffff812ae970>] pci_pme_active+0x30/0x210
[<ffffffff812aebb5>] __pci_enable_wake+0x65/0x160
[<ffffffff812aecd5>] pci_wake_from_d3+0x25/0x40
[<ffffffffa0511c29>] tg3_power_down+0x29/0x40 [tg3]
[<ffffffffa0511d4c>] tg3_close+0x10c/0x1d0 [tg3]
[<ffffffff813d67b5>] __dev_close_many+0x85/0xd0
[<ffffffff813d68cb>] dev_close_many+0x8b/0x100
[<ffffffff813d8dd8>] rollback_registered_many+0xd8/0x250
[<ffffffff813d8f7d>] rollback_registered+0x2d/0x40
[<ffffffff813da828>] unregister_netdevice_queue+0x58/0xb0
[<ffffffff813da89c>] unregister_netdev+0x1c/0x30
[<ffffffffa050104b>] tg3_remove_one+0x6b/0x120 [tg3]
[<ffffffff812b1b0b>] pci_device_remove+0x3b/0xb0
[<ffffffff81371c1f>] __device_release_driver+0x7f/0xf0
[<ffffffff81371cb3>] device_release_driver+0x23/0x30
[<ffffffff81371484>] bus_remove_device+0xf4/0x170
[<ffffffff8136df45>] device_del+0x135/0x1d0
[<ffffffff812ac8f4>] pci_stop_bus_device+0x94/0xa0
[<ffffffff812ac89b>] pci_stop_bus_device+0x3b/0xa0
[<ffffffff812ac89b>] pci_stop_bus_device+0x3b/0xa0
[<ffffffff812aca02>] pci_stop_and_remove_bus_device+0x12/0x20
[<ffffffff812c4698>] pciehp_unconfigure_device+0xa8/0x1b0
[<ffffffff812c3ff8>] pciehp_disable_slot+0x68/0x200
[<ffffffff812c4213>] pciehp_power_thread+0x83/0xf0
[<ffffffff8107b5b8>] process_one_work+0x178/0x470
[<ffffffff8107bf81>] worker_thread+0x121/0x3a0
[<ffffffff8107be60>] ? manage_workers.isra.21+0x2b0/0x2b0
[<ffffffff81082d80>] kthread+0xc0/0xd0
[<ffffffff81060000>] ? SyS_unshare+0x220/0x280
[<ffffffff81082cc0>] ? kthread_create_on_node+0x120/0x120
[<ffffffff814c07ec>] ret_from_fork+0x7c/0xb0
[<ffffffff81082cc0>] ? kthread_create_on_node+0x120/0x120
tg3 0000:0a:00.0: PME# enabled

On removal tg3 calls pci_wake_from_d3 to enable/disable wake-on-lan.
This then calls pci_pme_activate(dev, true) for a device which is
about to be deleted. The linked commit does no longer call
pci_wake_from_d3, which "fixes" the problem.

Andreas

2013-10-29 03:30:54

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [3.11.4] Thunderbolt/PCI unplug oops in pci_pme_list_scan

On Wed, Oct 16, 2013 at 2:21 PM, Bjorn Helgaas <[email protected]> wrote:
> On Tue, Oct 15, 2013 at 03:44:52AM +0100, Matthew Garrett wrote:
>> On Mon, Oct 14, 2013 at 05:50:38PM -0600, Bjorn Helgaas wrote:
>> > [+cc Rafael, Mika, Kirill, linux-pci]
>> >
>> > On Mon, Oct 14, 2013 at 4:47 PM, Andreas Noever
>> > <[email protected]> wrote:
>> > > When I unplug the Thunderbolt ethernet adapter on my MacBookPro Linux
>> > > crashes a few seconds later. Using
>> > > echo 1 > /sys/bus/pci/devices/0000:08:00.0/remove
>> > > to remove a bridge two levels above the device triggers the fault immediately:
>> >
>> > There have been significant changes in acpiphp related to Thunderbolt
>> > since v3.11.
>>
>> Apple don't expose Thunderbolt via ACPI, so it appears as native PCIe.
>> I'd be surprised if acpiphp makes a difference here.
>
> Yeah, you're right; I wasn't paying attention.
>
> We save a pci_dev pointer in the pci_pme_list, which of course has a
> longer lifetime than the pci_dev itself, but we don't acquire a reference
> on it, so I suspect the pci_dev got released before we got around to
> doing the pci_pme_list_scan().
>
> Andreas, can you try the patch below? It's against v3.12-rc2, but it
> should apply to v3.11, too.
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index ad7fc72..8b0a2f3 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1580,6 +1580,7 @@ static void pci_pme_list_scan(struct work_struct *work)
> pci_pme_wakeup(pme_dev->dev, NULL);
> } else {
> list_del(&pme_dev->list);
> + pci_dev_put(pme_dev->dev);
> kfree(pme_dev);
> }
> }
> @@ -1640,7 +1641,7 @@ void pci_pme_active(struct pci_dev *dev, bool enable)
> GFP_KERNEL);
> if (!pme_dev)
> goto out;
> - pme_dev->dev = dev;
> + pme_dev->dev = pci_dev_get(dev);
> mutex_lock(&pci_pme_list_mutex);
> list_add(&pme_dev->list, &pci_pme_list);
> if (list_is_singular(&pci_pme_list))
> @@ -1652,6 +1653,7 @@ void pci_pme_active(struct pci_dev *dev, bool enable)
> list_for_each_entry(pme_dev, &pci_pme_list, list) {
> if (pme_dev->dev == dev) {
> list_del(&pme_dev->list);
> + pci_dev_put(pme_dev->dev);
> kfree(pme_dev);
> break;
> }

The patch above covered up the problem, but is incorrect. The topology is:

08:00.0 PCI bridge to [bus 09-0a] Thunderbolt Upstream Port
09:00.0 PCI bridge to [bus 0a] Thunderbolt Downstream Port
0a:00.0 tg3 NIC

and the sequence leading to the crash is (edited for brevity):

remove_store(08:00.0)
pci_stop_and_remove_bus_device(08:00.0) # Upstream Port
pci_stop_bus_device(08:00.0)
pci_stop_bus_device(09:00.0) # Downstream Port
pci_stop_bus_device(0a:00.0) # tg3 device
pci_stop_dev(0a:00.0)
pci_pme_active(0a:00.0, false) # remove from pci_pme_list
device_del(0a:00.0)
device_release_driver
tg3_remove_one
unregister_netdev
dev->netdev_ops->ndo_stop # tg3_close
tg3_close
pci_wake_from_d3
pci_pme_active(dev, true) # add to pci_pme_list
pci_remove_bus_device(08:00.0)
pci_remove_bus_device(09:00.0)
pci_remove_bus_device(0a:00.0)
pci_destroy_dev(0a:00.0)
put_device(0a:00.0) # drop last tg3
pci_dev reference
...
pci_release_dev # pci_dev release function
kfree(0a:00.0)
...
pci_pme_list_scan
0a:00.0 still on list => use-after-free

The patch above avoids the crash by acquiring a reference when adding
to pci_pme_list, so when pci_destroy_dev() drops the reference, it's
not the *last* reference, so the pci_dev is not released.

The problem is that the reference acquired when we add to pci_pme_list
will *never* be dropped, so we leaked the pci_dev.

Bjorn

2013-10-30 08:00:33

by Yijing Wang

[permalink] [raw]
Subject: Re: [3.11.4] Thunderbolt/PCI unplug oops in pci_pme_list_scan

>>> Bisection points to 928bea964827d7824b548c1f8e06eccbbc4d0d7d .
>>
>> This is "PCI: Delay enabling bridges until they're needed" by Yinghai.
>
> that double disabling should be addressed by:
>
> https://lkml.org/lkml/2013/4/25/608
>
> [PATCH] PCI: Remove duplicate pci_disable_device for pcie port

Hi Yinghai and Bjorn,
I found a related issue in the latest Bjorn/pci-next branch.

Now if we remove the pcie port device in the system, there is a warning occured.
It seems introduced after commit 928bea9 "PCI: Delay enabling bridges until they're needed".

[ 2124.129478] ------------[ cut here ]------------
[ 2124.129490] WARNING: CPU: 3 PID: 7 at drivers/pci/pci.c:1424 pci_disable_device+0x90/0xa0()
[ 2124.129492] Device pcieport
[ 2124.129492] disabling already-disabled device
[ 2124.129494] Modules linked in: binfmt_misc cpufreq_conservative cpufreq_userspace cpufreq_powersave loop bnx2 igb kvm_intel kvm dca i2c_algo_bit ptp pps_core i2c_i801 i7core_edac iTCO_wdt iTCO_vendor_support lpc_ich mfd_core edac_core acpi_cpufreq serio_raw sg button pcspkr microcode autofs4 processor thermal_sys scsi_dh_rdac scsi_dh_alua scsi_dh_emc scsi_dh_hp_sw scsi_dh ata_generic ata_piix megaraid_sas
[ 2124.129530] CPU: 3 PID: 7 Comm: kworker/u49:0 Not tainted 3.12.0-rc2-2.10-desktop+ #22
[ 2124.129533] Hardware name: Huawei Technologies Co., Ltd. Tecal RH2285 /BC11BTSA , BIOS CTSAV036 04/27/2011
[ 2124.129540] Workqueue: sysfsd sysfs_schedule_callback_work
[ 2124.129543] 0000000000000009 ffff880532cd9bb8 ffffffff8162f51c 0000000000000007
[ 2124.129547] ffff880532cd9c08 ffff880532cd9bf8 ffffffff8105060c 0000000000000286
[ 2124.129552] ffff8805329f2000 ffffffff81c67bc0 ffff8805329f2000 0000000000000000
[ 2124.129556] Call Trace:
[ 2124.129564] [<ffffffff8162f51c>] dump_stack+0x55/0x86
[ 2124.129572] [<ffffffff8105060c>] warn_slowpath_common+0x8c/0xc0
[ 2124.129576] [<ffffffff810506f6>] warn_slowpath_fmt+0x46/0x50
[ 2124.129580] [<ffffffff81345ef0>] pci_disable_device+0x90/0xa0
[ 2124.129587] [<ffffffff8135524e>] pcie_portdrv_remove+0x1e/0x30
[ 2124.129592] [<ffffffff813491d6>] pci_device_remove+0x46/0xc0
[ 2124.129598] [<ffffffff814139cf>] __device_release_driver+0x7f/0xf0
[ 2124.129602] [<ffffffff81413c5c>] device_release_driver+0x2c/0x40
[ 2124.129606] [<ffffffff81413368>] bus_remove_device+0x108/0x170
[ 2124.129610] [<ffffffff814102f0>] device_del+0x130/0x1c0
[ 2124.129614] [<ffffffff813427dc>] pci_stop_bus_device+0x9c/0xb0
[ 2124.129618] [<ffffffff81342986>] pci_stop_and_remove_bus_device+0x16/0x30
[ 2124.129622] [<ffffffff8134a599>] remove_callback+0x29/0x40
[ 2124.129625] [<ffffffff811fd7d8>] sysfs_schedule_callback_work+0x18/0x80
[ 2124.129632] [<ffffffff8106bbbd>] process_one_work+0x17d/0x470
[ 2124.129635] [<ffffffff8106c342>] worker_thread+0x122/0x380
[ 2124.129639] [<ffffffff8106c220>] ? rescuer_thread+0x330/0x330
[ 2124.129643] [<ffffffff81073330>] kthread+0xc0/0xd0
[ 2124.129647] [<ffffffff81073270>] ? kthread_create_on_node+0x130/0x130
[ 2124.129653] [<ffffffff8163dbec>] ret_from_fork+0x7c/0xb0
[ 2124.129657] [<ffffffff81073270>] ? kthread_create_on_node+0x130/0x130
[ 2124.129660] ---[ end trace 5b020b35ec6adb4c ]---


>
> Thanks
>
> Yinghai
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> .
>


--
Thanks!
Yijing

2013-10-31 06:48:27

by Yinghai Lu

[permalink] [raw]
Subject: Re: [3.11.4] Thunderbolt/PCI unplug oops in pci_pme_list_scan

On Wed, Oct 30, 2013 at 12:57 AM, Yijing Wang <[email protected]> wrote:
>>>> Bisection points to 928bea964827d7824b548c1f8e06eccbbc4d0d7d .
>>>
>>> This is "PCI: Delay enabling bridges until they're needed" by Yinghai.
>>
>> that double disabling should be addressed by:
>>
>> https://lkml.org/lkml/2013/4/25/608
>>
>> [PATCH] PCI: Remove duplicate pci_disable_device for pcie port
>
> Hi Yinghai and Bjorn,
> I found a related issue in the latest Bjorn/pci-next branch.
>
> Now if we remove the pcie port device in the system, there is a warning occured.
> It seems introduced after commit 928bea9 "PCI: Delay enabling bridges until they're needed".
>
> [ 2124.129478] ------------[ cut here ]------------
> [ 2124.129490] WARNING: CPU: 3 PID: 7 at drivers/pci/pci.c:1424 pci_disable_device+0x90/0xa0()
> [ 2124.129492] Device pcieport
> [ 2124.129492] disabling already-disabled device

yes. that is pcie port driver problem.

| 928bea delay bridge enabling, and it's pci_is_enabled checking will prevent
| pci bridge get enabled second time, so enable_cnt will be only 1. instead of
| 2. if we enable bridge at first and then pcie port driver.