LinuxLists.cc - [BUG] 4.11.0-rc1 panic on shutdown X61s

2017-03-12 05:36:55

Subject: [BUG] 4.11.0-rc1 panic on shutdown X61s

Hello list,

Here's a photo of the panic, on imgur to be kind to vger:
http://imgur.com/a/wZI32

I'm out on a sailboat so can't really do much, but had a chance with internet
to send this FYI. I don't even know if this happens always or not yet.

Never seen this before, up to and including 4.10.0.

Regards,
Vito Caputo

2017-03-12 11:57:19

by Borislav Petkov

[permalink] [raw]

Subject: Re: [BUG] 4.11.0-rc1 panic on shutdown X61s

On Sat, Mar 11, 2017 at 09:37:23PM -0800, [email protected] wrote:
> Hello list,
>
> Here's a photo of the panic, on imgur to be kind to vger:
> http://imgur.com/a/wZI32
>
> I'm out on a sailboat so can't really do much, but had a chance with internet

So you didn't bring another box with you on the sailboat to connect it to the
laptop over netconsole to catch full dmesg, did you?

:-)))

Because there's a taint W flag in that splat which means there was a
warning which fired before that. Which could be the important piece of
dmesg...

Other than that, this is the BUG() pci_msi_shutdown(). Hmm, maybe
linux-pci has an idea, CCed.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2017-03-12 12:27:19

by Borislav Petkov

[permalink] [raw]

Subject: Re: [BUG] 4.11.0-rc1 panic on shutdown X61s

On Sun, Mar 12, 2017 at 12:57:03PM +0100, Borislav Petkov wrote:
> On Sat, Mar 11, 2017 at 09:37:23PM -0800, [email protected] wrote:
> > Hello list,
> >
> > Here's a photo of the panic, on imgur to be kind to vger:
> > http://imgur.com/a/wZI32
> >
> > I'm out on a sailboat so can't really do much, but had a chance with internet
>
> So you didn't bring another box with you on the sailboat to connect it to the
> laptop over netconsole to catch full dmesg, did you?

Hahah, you're so in luck: I just sent this mail and hibernated my laptop
and got the same BUG. What's the chance of that happening?! Apparently
big enough.

But I was able to catch the warning before it too. So the question is,
do you have an e1000e eth controller in that machine too?

Because the symptoms below are consistent with the observed behavior:
e1000e fails to initialize MSI interrupts for whatever reason and falls
back to legacy interrupts.

Then, PCI core shuts down and BUGs because the msi_list is empty.

Anyway, lemme add e1000e people too to the fun thread.

[ 8295.723895] hib.sh (19178): drop_caches: 3
[ 8295.961695] PM: Syncing filesystems ...
[ 8295.972559] done.
[ 8295.974699] Freezing user space processes ... (elapsed 0.001 seconds) done.
[ 8295.978484] PM: Preallocating image memory... done (allocated 197963 pages)
[ 8296.150973] PM: Allocated 791852 kbytes in 0.17 seconds (4657.95 MB/s)
[ 8296.151832] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[ 8296.155781] ACPI : EC: event blocked
[ 8296.449268] PM: freeze of devices complete after 294.716 msecs
[ 8296.469566] PM: late freeze of devices complete after 20.241 msecs
[ 8296.473369] ACPI : EC: interrupt blocked
[ 8296.477572] PM: noirq freeze of devices complete after 7.949 msecs
[ 8296.477605] Disabling non-boot CPUs ...
[ 8296.502106] Broke affinity for irq 16
[ 8296.502115] Broke affinity for irq 18
[ 8296.502124] Broke affinity for irq 23
[ 8296.503211] smpboot: CPU 1 is now offline
[ 8296.538391] Broke affinity for irq 16
[ 8296.538402] Broke affinity for irq 18
[ 8296.538412] Broke affinity for irq 23
[ 8296.538423] Broke affinity for irq 26
[ 8296.539527] smpboot: CPU 2 is now offline
[ 8296.578171] Broke affinity for irq 1
[ 8296.578180] Broke affinity for irq 8
[ 8296.578185] Broke affinity for irq 9
[ 8296.578191] Broke affinity for irq 12
[ 8296.578198] Broke affinity for irq 16
[ 8296.578204] Broke affinity for irq 18
[ 8296.578211] Broke affinity for irq 23
[ 8296.578218] Broke affinity for irq 26
[ 8296.578223] Broke affinity for irq 27
[ 8296.578228] Broke affinity for irq 28
[ 8296.579287] smpboot: CPU 3 is now offline
[ 8296.590166] PM: Creating hibernation image:
[ 8296.853347] PM: Need to copy 196464 pages
[ 8296.591954] Suspended for 284.763 seconds
[ 8296.592077] Enabling non-boot CPUs ...
[ 8296.603399] x86: Booting SMP configuration:
[ 8296.603423] smpboot: Booting Node 0 Processor 1 APIC 0x1
[ 8296.607663] cache: parent cpu1 should not be sleeping
[ 8296.608456] CPU1 is up
[ 8296.627426] smpboot: Booting Node 0 Processor 2 APIC 0x2
[ 8296.631362] cache: parent cpu2 should not be sleeping
[ 8296.632124] CPU2 is up
[ 8296.647451] smpboot: Booting Node 0 Processor 3 APIC 0x3
[ 8296.651451] cache: parent cpu3 should not be sleeping
[ 8296.652258] CPU3 is up
[ 8296.659200] ACPI : EC: interrupt unblocked
[ 8296.662221] sdhci-pci 0000:02:00.0: MMC controller base frequency changed to 50Mhz.
[ 8296.703214] PM: noirq restore of devices complete after 44.358 msecs
[ 8296.704523] PM: early restore of devices complete after 1.246 msecs
[ 8296.777628] usb usb1: root hub lost power or was reset
[ 8296.777758] ACPI : EC: event unblocked
[ 8296.778121] usb usb2: root hub lost power or was reset
[ 8296.779447] rtc_cmos 00:02: System wakeup disabled by ACPI
[ 8296.779671] usb usb3: root hub lost power or was reset
[ 8296.779685] usb usb4: root hub lost power or was reset
[ 8296.781545] ehci-pci 0000:00:1a.0: cache line size of 64 is not supported
[ 8296.782034] ehci-pci 0000:00:1d.0: cache line size of 64 is not supported
[ 8296.788359] sd 0:0:0:0: [sda] Starting disk
[ 8296.788439] sd 2:0:0:0: [sdb] Starting disk
[ 8296.934415] ------------[ cut here ]------------
[ 8296.934428] WARNING: CPU: 0 PID: 19229 at drivers/pci/msi.c:1052 __pci_enable_msi_range+0x39e/0x3f0
[ 8296.934440] Modules linked in: ctr ccm fuse ntfs msdos ext2 msr cpufreq_powersave cpufreq_userspace cpufreq_conservative binfmt_misc uinput vfat fat loop dm_crypt dm_mod hid_generic usbhid hid snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel iTCO_wdt aes_x86_64 iTCO_vendor_support crypto_simd cryptd glue_helper intel_cstate arc4 intel_rapl_perf pcspkr i2c_i801 snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core sdhci_pci sdhci sg snd_pcm ehci_pci lpc_ich e1000e xhci_pci mmc_core snd_timer mfd_core ehci_hcd xhci_hcd wmi thinkpad_acpi nvram snd soundcore led_class ac battery serio_raw thermal [last unloaded: cfg80211]
[ 8296.934584] CPU: 0 PID: 19229 Comm: kworker/u8:49 Not tainted 4.10.0+ #2
[ 8296.934593] Hardware name: LENOVO 2320CTO/2320CTO, BIOS G2ET86WW (2.06 ) 11/13/2012
[ 8296.934605] Workqueue: events_unbound async_run_entry_fn
[ 8296.934613] Call Trace:
[ 8296.934621] dump_stack+0x67/0x92
[ 8296.934629] __warn+0xcb/0xf0
[ 8296.934636] ? pci_pm_suspend_noirq+0x190/0x190
[ 8296.934644] warn_slowpath_null+0x1d/0x20
[ 8296.934650] __pci_enable_msi_range+0x39e/0x3f0
[ 8296.934662] ? e1000_get_phy_info_82577+0x21/0x110 [e1000e]
[ 8296.934671] ? pci_pm_suspend_noirq+0x190/0x190
[ 8296.934678] pci_enable_msi+0x1a/0x30
[ 8296.934686] e1000e_set_interrupt_capability+0x3c/0x110 [e1000e]
[ 8296.934697] e1000e_pm_thaw+0x22/0x60 [e1000e]
[ 8296.934707] e1000e_pm_resume+0x25/0x30 [e1000e]
[ 8296.934714] pci_pm_restore+0x79/0xb0
[ 8296.934723] dpm_run_callback+0x4e/0x2d0
[ 8296.934730] device_resume+0x9d/0x1a0
[ 8296.934737] async_resume+0x1d/0x50
[ 8296.934743] async_run_entry_fn+0x37/0xe0
[ 8296.934752] process_one_work+0x1e8/0x730
[ 8296.934759] ? process_one_work+0x169/0x730
[ 8296.934768] worker_thread+0x48/0x4e0
[ 8296.934777] kthread+0x101/0x140
[ 8296.934783] ? process_one_work+0x730/0x730
[ 8296.934790] ? kthread_create_on_node+0x40/0x40
[ 8296.934798] ret_from_fork+0x2e/0x40
[ 8296.934809] ---[ end trace 6ab732218e829ce9 ]---
[ 8296.934875] e1000e 0000:00:19.0 eth0: Failed to initialize MSI interrupts. Falling back to legacy interrupts.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2017-03-12 13:55:29

by Andy Shevchenko

[permalink] [raw]

Subject: Re: [BUG] 4.11.0-rc1 panic on shutdown X61s

On Sun, Mar 12, 2017 at 2:26 PM, Borislav Petkov <[email protected]> wrote:
> On Sun, Mar 12, 2017 at 12:57:03PM +0100, Borislav Petkov wrote:
>> On Sat, Mar 11, 2017 at 09:37:23PM -0800, [email protected] wrote:
>> > Hello list,
>> >
>> > Here's a photo of the panic, on imgur to be kind to vger:
>> > http://imgur.com/a/wZI32
>> >
>> > I'm out on a sailboat so can't really do much, but had a chance with internet
>>
>> So you didn't bring another box with you on the sailboat to connect it to the
>> laptop over netconsole to catch full dmesg, did you?
>
> Hahah, you're so in luck: I just sent this mail and hibernated my laptop
> and got the same BUG. What's the chance of that happening?! Apparently
> big enough.
>
> But I was able to catch the warning before it too. So the question is,
> do you have an e1000e eth controller in that machine too?
>
> Because the symptoms below are consistent with the observed behavior:
> e1000e fails to initialize MSI interrupts for whatever reason and falls
> back to legacy interrupts.
>
> Then, PCI core shuts down and BUGs because the msi_list is empty.
>
> Anyway, lemme add e1000e people too to the fun thread.
>

The only change that IMHO matters happened between v4.10 and v4.11-rc1 is this:

@@ -6276,8 +6274,8 @@ static int e1000e_pm_freeze(struct device *dev)
/* Quiesce the device without resetting the hardware */
e1000e_down(adapter, false);
e1000_free_irq(adapter);
+ e1000e_reset_interrupt_capability(adapter);
}
- e1000e_reset_interrupt_capability(adapter);

So, it apparently misses something for the other case, like
pci_disable_msi() call or so.

P.S. I'm not PCI or e1000e guy :-)

--
With Best Regards,
Andy Shevchenko

2017-03-12 18:23:35

by lkml

[permalink] [raw]

Subject: Re: [BUG] 4.11.0-rc1 panic on shutdown X61s

On Sun, Mar 12, 2017 at 01:26:21PM +0100, Borislav Petkov wrote:
> On Sun, Mar 12, 2017 at 12:57:03PM +0100, Borislav Petkov wrote:
> > On Sat, Mar 11, 2017 at 09:37:23PM -0800, [email protected] wrote:
> > > Hello list,
> > >
> > > Here's a photo of the panic, on imgur to be kind to vger:
> > > http://imgur.com/a/wZI32
> > >
> > > I'm out on a sailboat so can't really do much, but had a chance with internet
> >
> > So you didn't bring another box with you on the sailboat to connect it to the
> > laptop over netconsole to catch full dmesg, did you?
>
> Hahah, you're so in luck: I just sent this mail and hibernated my laptop
> and got the same BUG. What's the chance of that happening?! Apparently
> big enough.
>
> But I was able to catch the warning before it too. So the question is,
> do you have an e1000e eth controller in that machine too?
>
> Because the symptoms below are consistent with the observed behavior:
> e1000e fails to initialize MSI interrupts for whatever reason and falls
> back to legacy interrupts.
>
> Then, PCI core shuts down and BUGs because the msi_list is empty.
>
> Anyway, lemme add e1000e people too to the fun thread.
>
<snip>

Hihgly likely apparently, this machine does have e1000e and after a single
suspend+resume cycle this appears in dmesg:

[28539.220131] ------------[ cut here ]------------
[28539.220131] WARNING: CPU: 1 PID: 1432 at drivers/pci/msi.c:1052 __pci_enable_msi_range+0x39c/0x3f0
[28539.220131] CPU: 1 PID: 1432 Comm: kworker/u4:40 Not tainted 4.11.0-rc1 #51
[28539.220131] Hardware name: LENOVO 7668CTO/7668CTO, BIOS 7NETC2WW (2.22 ) 03/22/2011
[28539.220131] Workqueue: events_unbound async_run_entry_fn
[28539.220131] Call Trace:
[28539.220131] dump_stack+0x4d/0x72
[28539.220131] __warn+0xc7/0xf0
[28539.220131] warn_slowpath_null+0x18/0x20
[28539.220131] __pci_enable_msi_range+0x39c/0x3f0
[28539.220131] ? e1000e_get_phy_info_igp+0x1c/0xf0
[28539.220131] pci_enable_msi+0x15/0x30
[28539.220131] e1000e_set_interrupt_capability+0xe0/0x130
[28539.220131] e1000e_pm_thaw+0x1d/0x50
[28539.220131] e1000e_pm_resume+0x20/0x30
[28539.220131] pci_pm_resume+0x5f/0x90
[28539.220131] dpm_run_callback+0x44/0x170
[28539.220131] ? pci_pm_thaw+0x90/0x90
[28539.220131] device_resume+0xce/0x1e0
[28539.220131] async_resume+0x18/0x40
[28539.220131] async_run_entry_fn+0x32/0xe0
[28539.220131] process_one_work+0x13b/0x3e0
[28539.220131] worker_thread+0x64/0x4a0
[28539.220131] kthread+0x10f/0x150
[28539.220131] ? process_one_work+0x3e0/0x3e0
[28539.220131] ? __kthread_create_on_node+0x150/0x150
[28539.220131] ret_from_fork+0x29/0x40
[28539.220131] ---[ end trace e7beefda13ba724f ]---
[28539.220131] e1000e 0000:00:19.0 eth3: Failed to initialize MSI interrupts. Falling back to legacy interrupts.

Regards,
Vito Caputo

2017-03-12 22:23:53

by Borislav Petkov

[permalink] [raw]

Subject: Re: [BUG] 4.11.0-rc1 panic on shutdown X61s

On Sun, Mar 12, 2017 at 03:55:08PM +0200, Andy Shevchenko wrote:
> On Sun, Mar 12, 2017 at 2:26 PM, Borislav Petkov <[email protected]> wrote:
> > On Sun, Mar 12, 2017 at 12:57:03PM +0100, Borislav Petkov wrote:
> >> On Sat, Mar 11, 2017 at 09:37:23PM -0800, [email protected] wrote:
> >> > Hello list,
> >> >
> >> > Here's a photo of the panic, on imgur to be kind to vger:
> >> > http://imgur.com/a/wZI32
> >> >
> >> > I'm out on a sailboat so can't really do much, but had a chance with internet
> >>
> >> So you didn't bring another box with you on the sailboat to connect it to the
> >> laptop over netconsole to catch full dmesg, did you?
> >
> > Hahah, you're so in luck: I just sent this mail and hibernated my laptop
> > and got the same BUG. What's the chance of that happening?! Apparently
> > big enough.
> >
> > But I was able to catch the warning before it too. So the question is,
> > do you have an e1000e eth controller in that machine too?
> >
> > Because the symptoms below are consistent with the observed behavior:
> > e1000e fails to initialize MSI interrupts for whatever reason and falls
> > back to legacy interrupts.
> >
> > Then, PCI core shuts down and BUGs because the msi_list is empty.
> >
> > Anyway, lemme add e1000e people too to the fun thread.
> >
>
> The only change that IMHO matters happened between v4.10 and v4.11-rc1 is this:
>
> @@ -6276,8 +6274,8 @@ static int e1000e_pm_freeze(struct device *dev)
> /* Quiesce the device without resetting the hardware */
> e1000e_down(adapter, false);
> e1000_free_irq(adapter);
> + e1000e_reset_interrupt_capability(adapter);
> }
> - e1000e_reset_interrupt_capability(adapter);
>
> So, it apparently misses something for the other case, like
> pci_disable_msi() call or so.

Well, lemme add the people from

7e54d9d063fa ("e1000e: driver trying to free already-free irq")

to CC then. :-)

Thanks.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2017-03-13 16:51:45

by Bjørn Mork

[permalink] [raw]

Subject: Re: [BUG] 4.11.0-rc1 panic on shutdown X61s

Borislav Petkov <[email protected]> writes:
> On Sun, Mar 12, 2017 at 03:55:08PM +0200, Andy Shevchenko wrote:
>
>> The only change that IMHO matters happened between v4.10 and v4.11-rc1 is this:
>>
>> @@ -6276,8 +6274,8 @@ static int e1000e_pm_freeze(struct device *dev)
>> /* Quiesce the device without resetting the hardware */
>> e1000e_down(adapter, false);
>> e1000_free_irq(adapter);
>> + e1000e_reset_interrupt_capability(adapter);
>> }
>> - e1000e_reset_interrupt_capability(adapter);
>>
>> So, it apparently misses something for the other case, like
>> pci_disable_msi() call or so.
>
> Well, lemme add the people from
>
> 7e54d9d063fa ("e1000e: driver trying to free already-free irq")
>
> to CC then. :-)

Already did that a week ago:
https://www.spinics.net/lists/netdev/msg423379.html

Haven't heard anything back yet. Wondering if they are waiting for
someone else to submit the pretty obvious revert? Don't understand why
that should take more than a minute to figure out. It's not like they
are testing these changes anyway...

Bjørn

2017-03-14 01:20:37

by Brown, Aaron F

[permalink] [raw]

Subject: RE: [BUG] 4.11.0-rc1 panic on shutdown X61s

> From: Bjørn Mork [mailto:[email protected]]
> Sent: Monday, March 13, 2017 9:46 AM
> To: Borislav Petkov <[email protected]>
> Cc: Andy Shevchenko <[email protected]>; [email protected];
> linux-kernel <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; khalidm
> <[email protected]>; David Singleton <[email protected]>; Brown, Aaron
> F <[email protected]>; Kirsher, Jeffrey T
> <[email protected]>
> Subject: Re: [BUG] 4.11.0-rc1 panic on shutdown X61s
>
> Borislav Petkov <[email protected]> writes:
> > On Sun, Mar 12, 2017 at 03:55:08PM +0200, Andy Shevchenko wrote:
> >
> >> The only change that IMHO matters happened between v4.10 and v4.11-
> rc1 is this:
> >>
> >> @@ -6276,8 +6274,8 @@ static int e1000e_pm_freeze(struct device *dev)
> >> /* Quiesce the device without resetting the hardware */
> >> e1000e_down(adapter, false);
> >> e1000_free_irq(adapter);
> >> + e1000e_reset_interrupt_capability(adapter);
> >> }
> >> - e1000e_reset_interrupt_capability(adapter);
> >>
> >> So, it apparently misses something for the other case, like
> >> pci_disable_msi() call or so.
> >
> > Well, lemme add the people from
> >
> > 7e54d9d063fa ("e1000e: driver trying to free already-free irq")
> >
> > to CC then. :-)
>
> Already did that a week ago:
> https://www.spinics.net/lists/netdev/msg423379.html
>
> Haven't heard anything back yet. Wondering if they are waiting for
> someone else to submit the pretty obvious revert? Don't understand why
> that should take more than a minute to figure out. It's not like they
> are testing these changes anyway...

Believe it or not we actually do test these changes. This one was tested by me and I did not have the same results you and the other people reporting this trace did. I made it back in the lab today and have spent a good part of the day attempting to reproduce this bug without success. Freeze / resume works for me on all the systems I have tried, which includes a sampling of all the current parts and many older ones. Given there are several other reports of this it is obviously an issue and I would like to be able to reproduce it in case another patch to resolve the issue this attempts to fix comes back in another form. So I want to know what's different between the systems that hit this and my bank of systems that don't.

What exact part (or parts) are we looking at (lspci|grep -i eth) that trigger this? Could it be a difference in .config files? The trace says it is falling back to legacy interrupts, does the system continue to work and does the network continue to function in that mode? In case it's related to user space what is the base distro? Any other information you think can help me reproduce the issue would be appreciated.

Thanks,
Aaron

>
>
> Bjørn

2017-03-14 01:43:53

by Borislav Petkov

[permalink] [raw]

Subject: Re: [BUG] 4.11.0-rc1 panic on shutdown X61s

On Tue, Mar 14, 2017 at 01:20:27AM +0000, Brown, Aaron F wrote:
> Believe it or not we actually do test these changes. This one was
> tested by me and I did not have the same results you and the other
> people reporting this trace did. I made it back in the lab today and
> have spent a good part of the day attempting to reproduce this bug
> without success. Freeze / resume works for me on all the systems I
> have tried, which includes a sampling of all the current parts and
> many older ones.

Yeah, tell me about it.

> Given there are several other reports of this it is obviously an issue
> and I would like to be able to reproduce it in case another patch to
> resolve the issue this attempts to fix comes back in another form. So
> I want to know what's different between the systems that hit this and
> my bank of systems that don't.

So mine is not the newest anymore: thinkpad x230.

> What exact part (or parts) are we looking at (lspci|grep -i eth)

Lemme give you the gory details (PCI cfg space etc):

$ lspci -xxx -vvvv | grep -i eth -A 36
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04)
Subsystem: Lenovo 82579LM Gigabit Network Connection
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 30
Region 0: Memory at f1500000 (32-bit, non-prefetchable) [size=128K]
Region 1: Memory at f153b000 (32-bit, non-prefetchable) [size=4K]
Region 2: I/O ports at 4080 [size=32]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee002d8 Data: 0000
Capabilities: [e0] PCI Advanced Features
AFCap: TP+ FLR+
AFCtrl: FLR-
AFStatus: TP-
Kernel driver in use: e1000e
Kernel modules: e1000e
00: 86 80 02 15 07 04 10 00 04 00 00 02 00 00 00 00
10: 00 00 50 f1 00 b0 53 f1 81 40 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 aa 17 f3 21
30: 00 00 00 00 c8 00 00 00 00 00 00 00 07 01 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 01 d0 22 c8 00 20 00 07
d0: 05 e0 81 00 d8 02 e0 fe 00 00 00 00 00 00 00 00
e0: 13 00 06 03 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

> that trigger this? Could it be a difference in .config files? The

.config attached.

> trace says it is falling back to legacy interrupts, does the system
> continue to work and does the network continue to function in that
> mode?

Not really. I tried halting it after the splat but it started powering
down and deadlocked on something. Had to cold-reset.

> Any other information you think can help me reproduce the issue would
> be appreciated.

So the real question is why does it fail setting up MSI interrupts. I'd
look into that part of the driver...

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

Attachments:

(No filename) (3.64 kB)
config-4.10.0+.gz (25.60 kB)
Download all attachments

2017-03-14 02:39:54

by lkml

[permalink] [raw]

Subject: Re: [BUG] 4.11.0-rc1 panic on shutdown X61s

On Tue, Mar 14, 2017 at 01:20:27AM +0000, Brown, Aaron F wrote:
> > Borislav Petkov <[email protected]> writes:
> > > On Sun, Mar 12, 2017 at 03:55:08PM +0200, Andy Shevchenko wrote:
> > >
> > >> The only change that IMHO matters happened between v4.10 and v4.11-
> > rc1 is this:
> > >>
> > >> @@ -6276,8 +6274,8 @@ static int e1000e_pm_freeze(struct device *dev)
> > >> /* Quiesce the device without resetting the hardware */
> > >> e1000e_down(adapter, false);
> > >> e1000_free_irq(adapter);
> > >> + e1000e_reset_interrupt_capability(adapter);
> > >> }
> > >> - e1000e_reset_interrupt_capability(adapter);
> > >>
> > >> So, it apparently misses something for the other case, like
> > >> pci_disable_msi() call or so.
> > >
> > > Well, lemme add the people from
> > >
> > > 7e54d9d063fa ("e1000e: driver trying to free already-free irq")
> > >
> > > to CC then. :-)
> >
> > Already did that a week ago:
> > https://www.spinics.net/lists/netdev/msg423379.html
> >
> > Haven't heard anything back yet. Wondering if they are waiting for
> > someone else to submit the pretty obvious revert? Don't understand why
> > that should take more than a minute to figure out. It's not like they
> > are testing these changes anyway...
>
<snip>
>
> What exact part (or parts) are we looking at (lspci|grep -i eth) that trigger this? Could it be a difference in .config files? The trace says it is falling back to legacy interrupts, does the system continue to work and does the network continue to function in that mode? In case it's related to user space what is the base distro? Any other information you think can help me reproduce the issue would be appreciated.
>

Config attached, the machine is a Thinkpad X61s 1.8Ghz with no onboard wireless
devices (rtl8192cu usb wifi is used).

# lspci| grep -i eth
00:19.0 Ethernet controller: Intel Corporation 82566MM Gigabit Network Connection (rev 03)

Debian jessie amd64 is the distro.

I'll have to get back to you on if the e1000e continues functioning, the
machine continues to function until the shutdown panic.

There were however some occurrences of subsequent suspend/resume cycles hanging
the machine hard leaving the display off, which prompted me to resume using
4.10 before digging any further as it's my only system right now.

Will try get around to testing 4.11 with 7e54d9d063fa reverted soon.

Regards,
Vito Caputo

Attachments:

(No filename) (2.40 kB)
linux-4.11.0-rc1-x61s.config (104.05 kB)
Download all attachments

2017-03-14 07:49:33

by Sasha Neftin

[permalink] [raw]

Subject: Re: [Intel-wired-lan] [BUG] 4.11.0-rc1 panic on shutdown X61s

On 3/14/2017 03:20, Brown, Aaron F wrote:
>> From: Bjørn Mork [mailto:[email protected]]
>> Sent: Monday, March 13, 2017 9:46 AM
>> To: Borislav Petkov <[email protected]>
>> Cc: Andy Shevchenko <[email protected]>; [email protected];
>> linux-kernel <[email protected]>; [email protected]; linux-
>> [email protected]; [email protected]; khalidm
>> <[email protected]>; David Singleton <[email protected]>; Brown, Aaron
>> F <[email protected]>; Kirsher, Jeffrey T
>> <[email protected]>
>> Subject: Re: [BUG] 4.11.0-rc1 panic on shutdown X61s
>>
>> Borislav Petkov <[email protected]> writes:
>>> On Sun, Mar 12, 2017 at 03:55:08PM +0200, Andy Shevchenko wrote:
>>>
>>>> The only change that IMHO matters happened between v4.10 and v4.11-
>> rc1 is this:
>>>> @@ -6276,8 +6274,8 @@ static int e1000e_pm_freeze(struct device *dev)
>>>> /* Quiesce the device without resetting the hardware */
>>>> e1000e_down(adapter, false);
>>>> e1000_free_irq(adapter);
>>>> + e1000e_reset_interrupt_capability(adapter);
>>>> }
>>>> - e1000e_reset_interrupt_capability(adapter);
>>>>
>>>> So, it apparently misses something for the other case, like
>>>> pci_disable_msi() call or so.
>>> Well, lemme add the people from
>>>
>>> 7e54d9d063fa ("e1000e: driver trying to free already-free irq")
>>>
>>> to CC then. :-)
>> Already did that a week ago:
>> https://www.spinics.net/lists/netdev/msg423379.html
>>
>> Haven't heard anything back yet. Wondering if they are waiting for
>> someone else to submit the pretty obvious revert? Don't understand why
>> that should take more than a minute to figure out. It's not like they
>> are testing these changes anyway...
> Believe it or not we actually do test these changes. This one was tested by me and I did not have the same results you and the other people reporting this trace did. I made it back in the lab today and have spent a good part of the day attempting to reproduce this bug without success. Freeze / resume works for me on all the systems I have tried, which includes a sampling of all the current parts and many older ones. Given there are several other reports of this it is obviously an issue and I would like to be able to reproduce it in case another patch to resolve the issue this attempts to fix comes back in another form. So I want to know what's different between the systems that hit this and my bank of systems that don't.
>
> What exact part (or parts) are we looking at (lspci|grep -i eth) that trigger this? Could it be a difference in .config files? The trace says it is falling back to legacy interrupts, does the system continue to work and does the network continue to function in that mode? In case it's related to user space what is the base distro? Any other information you think can help me reproduce the issue would be appreciated.
>
> Thanks,
> Aaron
>
>>
>> Bjørn
> _______________________________________________
> Intel-wired-lan mailing list
> [email protected]
> http://lists.osuosl.org/mailman/listinfo/intel-wired-lan

Hello,

I suggest revert commit of this patch. We recommended do not apply this
change.

Thanks,

Sasha

2017-03-14 08:28:54

by Bjørn Mork

[permalink] [raw]

Subject: Re: [BUG] 4.11.0-rc1 panic on shutdown X61s

"Brown, Aaron F" <[email protected]> writes:
>> From: Bjørn Mork [mailto:[email protected]]
>>
>> Already did that a week ago:
>> https://www.spinics.net/lists/netdev/msg423379.html
>>
>> Haven't heard anything back yet. Wondering if they are waiting for
>> someone else to submit the pretty obvious revert? Don't understand why
>> that should take more than a minute to figure out. It's not like they
>> are testing these changes anyway...
>
> Believe it or not we actually do test these changes.

Yes, I know you did. Sorry. I should stop throwing such comments around.

>This one was tested by me and I did not have the same results you and
>the other people reporting this trace did. I made it back in the lab
>today and have spent a good part of the day attempting to reproduce
>this bug without success. Freeze / resume works for me on all the
>systems I have tried, which includes a sampling of all the current
>parts and many older ones. Given there are several other reports of
>this it is obviously an issue and I would like to be able to reproduce
>it in case another patch to resolve the issue this attempts to fix
>comes back in another form. So I want to know what's different between
>the systems that hit this and my bank of systems that don't.
>
> What exact part (or parts) are we looking at (lspci|grep -i eth) that
> trigger this? Could it be a difference in .config files? The trace
> says it is falling back to legacy interrupts, does the system continue
> to work and does the network continue to function in that mode? In
> case it's related to user space what is the base distro? Any other
> information you think can help me reproduce the issue would be
> appreciated.

I have a somewhat newer laptop than Borislav. But mine is also a Lenovo
Thinkpad, so that might be a thing. My system is a Lenovo Thinkpad X1
Carbon Gen4, which is a Skylake generation laptop. lspci output as
configured in v4.9:

root@miraculix:/tmp# lspci -vvvnnxxxs 1f.6
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection I219-LM [8086:156f] (rev 21)
Subsystem: Lenovo Ethernet Connection I219-LM [17aa:2233]
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 134
Region 0: Memory at e1300000 (32-bit, non-prefetchable) [disabled] [size=128K]
Capabilities: [c8] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D3 NoSoftRst+ PME-Enable+ DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00378 Data: 0000
Capabilities: [e0] PCI Advanced Features
AFCap: TP+ FLR+
AFCtrl: FLR-
AFStatus: TP-
Kernel driver in use: e1000e
Kernel modules: e1000e
00: 86 80 6f 15 00 04 10 00 21 00 00 02 00 00 00 00
10: 00 00 30 e1 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 aa 17 33 22
30: 00 00 00 00 c8 00 00 00 00 00 00 00 ff 01 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 28 00 00 00 08 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 1f 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 01 00 00 00 03 10 03 10 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 01 d0 23 c8 0b 21 00 00
d0: 05 e0 81 00 78 03 e0 fe 00 00 00 00 00 00 00 00
e0: 13 00 06 03 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Attaching the .config. The laptop is running Debian sid, which is more
or less "stretch" now. But I usually build my kernels on a Debian
stable ("jessie") system if that matters.

As for whether it works or not, I must admit that I haven't tested.
Wired ethernet is not something I use on a daily basis on this laptop.
Will test it when I find time, but sending this now so you have
something to start working with.

Thanks

Bjørn

Attachments:

config-v4.11.0-rc1.gz (28.79 kB)

2017-03-14 08:44:00

by Bjørn Mork

[permalink] [raw]

Subject: Re: [BUG] 4.11.0-rc1 panic on shutdown X61s

Bjørn Mork <[email protected]> writes:
> "Brown, Aaron F" <[email protected]> writes:
>>> From: Bjørn Mork [mailto:[email protected]]
>>>
>>> Already did that a week ago:
>>> https://www.spinics.net/lists/netdev/msg423379.html
>>>
>>> Haven't heard anything back yet. Wondering if they are waiting for
>>> someone else to submit the pretty obvious revert? Don't understand why
>>> that should take more than a minute to figure out. It's not like they
>>> are testing these changes anyway...
>>
>> Believe it or not we actually do test these changes.
>
> Yes, I know you did. Sorry. I should stop throwing such comments around.

So I have made that mistake before, and never learn. But the good thing
about that is that it now rang a bell. I once had suspend issues with a
completely different Intel device on another Lenovo laptop, which turned
out to be caused by the Lenovo firmware accessing the device while it is
in D3: https://lkml.org/lkml/2015/3/2/183

I don't have a clue about how these things work so this might be
completely unrelated. Just thought I'd mention it as an example of how
the Lenovo firmware can do unexpected things.

>From the reports so far, it looks like the issue at hand affects most
(all?) Lenovo Thinkpads with an e1000e part. If you haven't tried
already, it might be interesting to see if you can reproduce it on a
Thinkpad.

Bjørn

2017-03-16 20:13:46

by Bowers, AndrewX

[permalink] [raw]

Subject: RE: [Intel-wired-lan] [BUG] 4.11.0-rc1 panic on shutdown X61s

Tested this on a Thinkpad T420i, after verifying it also has an e1000e NIC, unable to reproduce. Might be limited to that particular model/firmware version you're using, which I was not able to track down here although there is another person I could ask, might be able to come up with one yet.

> -----Original Message-----
> From: Intel-wired-lan [mailto:[email protected]] On
> Behalf Of [email protected]
> Sent: Monday, March 13, 2017 7:41 PM
> To: Brown, Aaron F <[email protected]>
> Cc: [email protected]; [email protected]; David Singleton
> <[email protected]>; linux-kernel <[email protected]>;
> khalidm <[email protected]>; Andy Shevchenko
> <[email protected]>; Borislav Petkov <[email protected]>; intel-
> [email protected]; Bj?rn Mork <[email protected]>
> Subject: Re: [Intel-wired-lan] [BUG] 4.11.0-rc1 panic on shutdown X61s
>
> On Tue, Mar 14, 2017 at 01:20:27AM +0000, Brown, Aaron F wrote:
> > > Borislav Petkov <[email protected]> writes:
> > > > On Sun, Mar 12, 2017 at 03:55:08PM +0200, Andy Shevchenko wrote:
> > > >
> > > >> The only change that IMHO matters happened between v4.10 and
> > > >> v4.11-
> > > rc1 is this:
> > > >>
> > > >> @@ -6276,8 +6274,8 @@ static int e1000e_pm_freeze(struct device
> *dev)
> > > >> /* Quiesce the device without resetting the hardware */
> > > >> e1000e_down(adapter, false);
> > > >> e1000_free_irq(adapter);
> > > >> + e1000e_reset_interrupt_capability(adapter);
> > > >> }
> > > >> - e1000e_reset_interrupt_capability(adapter);
> > > >>
> > > >> So, it apparently misses something for the other case, like
> > > >> pci_disable_msi() call or so.
> > > >
> > > > Well, lemme add the people from
> > > >
> > > > 7e54d9d063fa ("e1000e: driver trying to free already-free irq")
> > > >
> > > > to CC then. :-)
> > >
> > > Already did that a week ago:
> > > https://www.spinics.net/lists/netdev/msg423379.html
> > >
> > > Haven't heard anything back yet. Wondering if they are waiting for
> > > someone else to submit the pretty obvious revert? Don't understand
> > > why that should take more than a minute to figure out. It's not
> > > like they are testing these changes anyway...
> >
> <snip>
> >
> > What exact part (or parts) are we looking at (lspci|grep -i eth) that trigger
> this? Could it be a difference in .config files? The trace says it is falling back
> to legacy interrupts, does the system continue to work and does the
> network continue to function in that mode? In case it's related to user space
> what is the base distro? Any other information you think can help me
> reproduce the issue would be appreciated.
> >
>
> Config attached, the machine is a Thinkpad X61s 1.8Ghz with no onboard
> wireless devices (rtl8192cu usb wifi is used).
>
> # lspci| grep -i eth
> 00:19.0 Ethernet controller: Intel Corporation 82566MM Gigabit Network
> Connection (rev 03)
>
> Debian jessie amd64 is the distro.
>
> I'll have to get back to you on if the e1000e continues functioning, the
> machine continues to function until the shutdown panic.
>
> There were however some occurrences of subsequent suspend/resume
> cycles hanging the machine hard leaving the display off, which prompted me
> to resume using
> 4.10 before digging any further as it's my only system right now.
>
> Will try get around to testing 4.11 with 7e54d9d063fa reverted soon.
>
> Regards,
> Vito Caputo

2017-03-22 02:04:24

by lkml

[permalink] [raw]

Subject: Re: [Intel-wired-lan] [BUG] 4.11.0-rc1 panic on shutdown X61s

On Thu, Mar 16, 2017 at 08:13:40PM +0000, Bowers, AndrewX wrote:
> Tested this on a Thinkpad T420i, after verifying it also has an e1000e NIC, unable to reproduce. Might be limited to that particular model/firmware version you're using, which I was not able to track down here although there is another person I could ask, might be able to come up with one yet.
>
>
> > -----Original Message-----
> > From: Intel-wired-lan [mailto:[email protected]] On
> > Behalf Of [email protected]
> > Sent: Monday, March 13, 2017 7:41 PM
> > To: Brown, Aaron F <[email protected]>
> > Cc: [email protected]; [email protected]; David Singleton
> > <[email protected]>; linux-kernel <[email protected]>;
> > khalidm <[email protected]>; Andy Shevchenko
> > <[email protected]>; Borislav Petkov <[email protected]>; intel-
> > [email protected]; Bj?rn Mork <[email protected]>
> > Subject: Re: [Intel-wired-lan] [BUG] 4.11.0-rc1 panic on shutdown X61s
> >
> > On Tue, Mar 14, 2017 at 01:20:27AM +0000, Brown, Aaron F wrote:
> > > > Borislav Petkov <[email protected]> writes:
> > > > > On Sun, Mar 12, 2017 at 03:55:08PM +0200, Andy Shevchenko wrote:
> > > > >
> > > > >> The only change that IMHO matters happened between v4.10 and
> > > > >> v4.11-
> > > > rc1 is this:
> > > > >>
> > > > >> @@ -6276,8 +6274,8 @@ static int e1000e_pm_freeze(struct device
> > *dev)
> > > > >> /* Quiesce the device without resetting the hardware */
> > > > >> e1000e_down(adapter, false);
> > > > >> e1000_free_irq(adapter);
> > > > >> + e1000e_reset_interrupt_capability(adapter);
> > > > >> }
> > > > >> - e1000e_reset_interrupt_capability(adapter);
> > > > >>
> > > > >> So, it apparently misses something for the other case, like
> > > > >> pci_disable_msi() call or so.
> > > > >
> > > > > Well, lemme add the people from
> > > > >
> > > > > 7e54d9d063fa ("e1000e: driver trying to free already-free irq")
> > > > >
> > > > > to CC then. :-)
> > > >
> > > > Already did that a week ago:
> > > > https://www.spinics.net/lists/netdev/msg423379.html
> > > >
> > > > Haven't heard anything back yet. Wondering if they are waiting for
> > > > someone else to submit the pretty obvious revert? Don't understand
> > > > why that should take more than a minute to figure out. It's not
> > > > like they are testing these changes anyway...
> > >
> > <snip>
> > >
> > > What exact part (or parts) are we looking at (lspci|grep -i eth) that trigger
> > this? Could it be a difference in .config files? The trace says it is falling back
> > to legacy interrupts, does the system continue to work and does the
> > network continue to function in that mode? In case it's related to user space
> > what is the base distro? Any other information you think can help me
> > reproduce the issue would be appreciated.
> > >
> >
> > Config attached, the machine is a Thinkpad X61s 1.8Ghz with no onboard
> > wireless devices (rtl8192cu usb wifi is used).
> >
> > # lspci| grep -i eth
> > 00:19.0 Ethernet controller: Intel Corporation 82566MM Gigabit Network
> > Connection (rev 03)
> >
> > Debian jessie amd64 is the distro.
> >
> > I'll have to get back to you on if the e1000e continues functioning, the
> > machine continues to function until the shutdown panic.
> >
> > There were however some occurrences of subsequent suspend/resume
> > cycles hanging the machine hard leaving the display off, which prompted me
> > to resume using
> > 4.10 before digging any further as it's my only system right now.
> >
> > Will try get around to testing 4.11 with 7e54d9d063fa reverted soon.
> >
> > Regards,
> > Vito Caputo

This is still broken as of 4.11.0-rc3 FYI.

Upon resume:
[ 45.828344] ------------[ cut here ]------------
[ 45.828352] WARNING: CPU: 0 PID: 807 at drivers/pci/msi.c:1052 __pci_enable_msi_range+0x39c/0x3f0
[ 45.828355] CPU: 0 PID: 807 Comm: kworker/u4:29 Not tainted 4.11.0-rc3 #52
[ 45.828356] Hardware name: LENOVO 7668CTO/7668CTO, BIOS 7NETC2WW (2.22 ) 03/22/2011
[ 45.828360] Workqueue: events_unbound async_run_entry_fn
[ 45.828362] Call Trace:
[ 45.828366] dump_stack+0x4d/0x72
[ 45.828369] __warn+0xc7/0xf0
[ 45.828371] warn_slowpath_null+0x18/0x20
[ 45.828372] __pci_enable_msi_range+0x39c/0x3f0
[ 45.828375] ? e1000e_get_phy_info_igp+0x1c/0xf0
[ 45.828377] pci_enable_msi+0x15/0x30
[ 45.828379] e1000e_set_interrupt_capability+0xe0/0x130
[ 45.828381] e1000e_pm_thaw+0x1d/0x50
[ 45.828383] e1000e_pm_resume+0x20/0x30
[ 45.828386] pci_pm_resume+0x5f/0x90
[ 45.828389] dpm_run_callback+0x44/0x170
[ 45.828391] ? pci_pm_thaw+0x90/0x90
[ 45.828393] device_resume+0xce/0x1e0
[ 45.828395] async_resume+0x18/0x40
[ 45.828396] async_run_entry_fn+0x32/0xe0
[ 45.828399] process_one_work+0x13b/0x3e0
[ 45.828400] worker_thread+0x64/0x4a0
[ 45.828402] kthread+0x10f/0x150
[ 45.828404] ? process_one_work+0x3e0/0x3e0
[ 45.828406] ? __kthread_create_on_node+0x150/0x150
[ 45.828409] ret_from_fork+0x29/0x40
[ 45.828411] ---[ end trace 56fad2d83af13529 ]---
[ 45.828469] e1000e 0000:00:19.0 eth3: Failed to initialize MSI interrupts. Falling back to legacy interrupts.
[ 45.835944] PM: resume of devices complete after 364.406 msecs
[ 45.836001] usb 2-1:1.0: rebind failed: -517
[ 45.836316] PM: Finishing wakeup.

Regards,
Vito Caputo

2017-03-22 02:12:49

by lkml

[permalink] [raw]

Subject: Re: [Intel-wired-lan] [BUG] 4.11.0-rc1 panic on shutdown X61s

On Tue, Mar 21, 2017 at 07:04:45PM -0700, [email protected] wrote:
> On Thu, Mar 16, 2017 at 08:13:40PM +0000, Bowers, AndrewX wrote:
> > Tested this on a Thinkpad T420i, after verifying it also has an e1000e NIC, unable to reproduce. Might be limited to that particular model/firmware version you're using, which I was not able to track down here although there is another person I could ask, might be able to come up with one yet.
> >
> >
> > > -----Original Message-----
> > > From: Intel-wired-lan [mailto:[email protected]] On
> > > Behalf Of [email protected]
> > > Sent: Monday, March 13, 2017 7:41 PM
> > > To: Brown, Aaron F <[email protected]>
> > > Cc: [email protected]; [email protected]; David Singleton
> > > <[email protected]>; linux-kernel <[email protected]>;
> > > khalidm <[email protected]>; Andy Shevchenko
> > > <[email protected]>; Borislav Petkov <[email protected]>; intel-
> > > [email protected]; Bj?rn Mork <[email protected]>
> > > Subject: Re: [Intel-wired-lan] [BUG] 4.11.0-rc1 panic on shutdown X61s
> > >
> > > On Tue, Mar 14, 2017 at 01:20:27AM +0000, Brown, Aaron F wrote:
> > > > > Borislav Petkov <[email protected]> writes:
> > > > > > On Sun, Mar 12, 2017 at 03:55:08PM +0200, Andy Shevchenko wrote:
> > > > > >
> > > > > >> The only change that IMHO matters happened between v4.10 and
> > > > > >> v4.11-
> > > > > rc1 is this:
> > > > > >>
> > > > > >> @@ -6276,8 +6274,8 @@ static int e1000e_pm_freeze(struct device
> > > *dev)
> > > > > >> /* Quiesce the device without resetting the hardware */
> > > > > >> e1000e_down(adapter, false);
> > > > > >> e1000_free_irq(adapter);
> > > > > >> + e1000e_reset_interrupt_capability(adapter);
> > > > > >> }
> > > > > >> - e1000e_reset_interrupt_capability(adapter);
> > > > > >>
> > > > > >> So, it apparently misses something for the other case, like
> > > > > >> pci_disable_msi() call or so.
> > > > > >
> > > > > > Well, lemme add the people from
> > > > > >
> > > > > > 7e54d9d063fa ("e1000e: driver trying to free already-free irq")
> > > > > >
> > > > > > to CC then. :-)
> > > > >
> > > > > Already did that a week ago:
> > > > > https://www.spinics.net/lists/netdev/msg423379.html
> > > > >
> > > > > Haven't heard anything back yet. Wondering if they are waiting for
> > > > > someone else to submit the pretty obvious revert? Don't understand
> > > > > why that should take more than a minute to figure out. It's not
> > > > > like they are testing these changes anyway...
> > > >
> > > <snip>
> > > >
> > > > What exact part (or parts) are we looking at (lspci|grep -i eth) that trigger
> > > this? Could it be a difference in .config files? The trace says it is falling back
> > > to legacy interrupts, does the system continue to work and does the
> > > network continue to function in that mode? In case it's related to user space
> > > what is the base distro? Any other information you think can help me
> > > reproduce the issue would be appreciated.
> > > >
> > >
> > > Config attached, the machine is a Thinkpad X61s 1.8Ghz with no onboard
> > > wireless devices (rtl8192cu usb wifi is used).
> > >
> > > # lspci| grep -i eth
> > > 00:19.0 Ethernet controller: Intel Corporation 82566MM Gigabit Network
> > > Connection (rev 03)
> > >
> > > Debian jessie amd64 is the distro.
> > >
> > > I'll have to get back to you on if the e1000e continues functioning, the
> > > machine continues to function until the shutdown panic.
> > >
> > > There were however some occurrences of subsequent suspend/resume
> > > cycles hanging the machine hard leaving the display off, which prompted me
> > > to resume using
> > > 4.10 before digging any further as it's my only system right now.
> > >
> > > Will try get around to testing 4.11 with 7e54d9d063fa reverted soon.
> > >
> > > Regards,
> > > Vito Caputo
>
>
> This is still broken as of 4.11.0-rc3 FYI.
>
> Upon resume:
> [ 45.828344] ------------[ cut here ]------------
> [ 45.828352] WARNING: CPU: 0 PID: 807 at drivers/pci/msi.c:1052 __pci_enable_msi_range+0x39c/0x3f0
> [ 45.828355] CPU: 0 PID: 807 Comm: kworker/u4:29 Not tainted 4.11.0-rc3 #52
> [ 45.828356] Hardware name: LENOVO 7668CTO/7668CTO, BIOS 7NETC2WW (2.22 ) 03/22/2011
> [ 45.828360] Workqueue: events_unbound async_run_entry_fn
> [ 45.828362] Call Trace:
> [ 45.828366] dump_stack+0x4d/0x72
> [ 45.828369] __warn+0xc7/0xf0
> [ 45.828371] warn_slowpath_null+0x18/0x20
> [ 45.828372] __pci_enable_msi_range+0x39c/0x3f0
> [ 45.828375] ? e1000e_get_phy_info_igp+0x1c/0xf0
> [ 45.828377] pci_enable_msi+0x15/0x30
> [ 45.828379] e1000e_set_interrupt_capability+0xe0/0x130
> [ 45.828381] e1000e_pm_thaw+0x1d/0x50
> [ 45.828383] e1000e_pm_resume+0x20/0x30
> [ 45.828386] pci_pm_resume+0x5f/0x90
> [ 45.828389] dpm_run_callback+0x44/0x170
> [ 45.828391] ? pci_pm_thaw+0x90/0x90
> [ 45.828393] device_resume+0xce/0x1e0
> [ 45.828395] async_resume+0x18/0x40
> [ 45.828396] async_run_entry_fn+0x32/0xe0
> [ 45.828399] process_one_work+0x13b/0x3e0
> [ 45.828400] worker_thread+0x64/0x4a0
> [ 45.828402] kthread+0x10f/0x150
> [ 45.828404] ? process_one_work+0x3e0/0x3e0
> [ 45.828406] ? __kthread_create_on_node+0x150/0x150
> [ 45.828409] ret_from_fork+0x29/0x40
> [ 45.828411] ---[ end trace 56fad2d83af13529 ]---
> [ 45.828469] e1000e 0000:00:19.0 eth3: Failed to initialize MSI interrupts. Falling back to legacy interrupts.
> [ 45.835944] PM: resume of devices complete after 364.406 msecs
> [ 45.836001] usb 2-1:1.0: rebind failed: -517
> [ 45.836316] PM: Finishing wakeup.
>

I never reported back on the results of reverting 7e54d9d063fa, it seems to fix
the problem on my machine as well.

Regards,
Vito Caputo

2017-03-22 19:00:46

by Borislav Petkov

[permalink] [raw]

Subject: Re: [Intel-wired-lan] [BUG] 4.11.0-rc1 panic on shutdown X61s

(Readding Jeff Kirsher who got dropped from the CC-list at some point.)

On Tue, Mar 21, 2017 at 07:13:43PM -0700, [email protected] wrote:
> > This is still broken as of 4.11.0-rc3 FYI.
> >
> > Upon resume:
> > [ 45.828344] ------------[ cut here ]------------
> > [ 45.828352] WARNING: CPU: 0 PID: 807 at drivers/pci/msi.c:1052 __pci_enable_msi_range+0x39c/0x3f0
> > [ 45.828355] CPU: 0 PID: 807 Comm: kworker/u4:29 Not tainted 4.11.0-rc3 #52
> > [ 45.828356] Hardware name: LENOVO 7668CTO/7668CTO, BIOS 7NETC2WW (2.22 ) 03/22/2011
> > [ 45.828360] Workqueue: events_unbound async_run_entry_fn
> > [ 45.828362] Call Trace:
> > [ 45.828366] dump_stack+0x4d/0x72
> > [ 45.828369] __warn+0xc7/0xf0
> > [ 45.828371] warn_slowpath_null+0x18/0x20
> > [ 45.828372] __pci_enable_msi_range+0x39c/0x3f0
> > [ 45.828375] ? e1000e_get_phy_info_igp+0x1c/0xf0
> > [ 45.828377] pci_enable_msi+0x15/0x30
> > [ 45.828379] e1000e_set_interrupt_capability+0xe0/0x130
> > [ 45.828381] e1000e_pm_thaw+0x1d/0x50
> > [ 45.828383] e1000e_pm_resume+0x20/0x30
> > [ 45.828386] pci_pm_resume+0x5f/0x90
> > [ 45.828389] dpm_run_callback+0x44/0x170
> > [ 45.828391] ? pci_pm_thaw+0x90/0x90
> > [ 45.828393] device_resume+0xce/0x1e0
> > [ 45.828395] async_resume+0x18/0x40
> > [ 45.828396] async_run_entry_fn+0x32/0xe0
> > [ 45.828399] process_one_work+0x13b/0x3e0
> > [ 45.828400] worker_thread+0x64/0x4a0
> > [ 45.828402] kthread+0x10f/0x150
> > [ 45.828404] ? process_one_work+0x3e0/0x3e0
> > [ 45.828406] ? __kthread_create_on_node+0x150/0x150
> > [ 45.828409] ret_from_fork+0x29/0x40
> > [ 45.828411] ---[ end trace 56fad2d83af13529 ]---
> > [ 45.828469] e1000e 0000:00:19.0 eth3: Failed to initialize MSI interrupts. Falling back to legacy interrupts.
> > [ 45.835944] PM: resume of devices complete after 364.406 msecs
> > [ 45.836001] usb 2-1:1.0: rebind failed: -517
> > [ 45.836316] PM: Finishing wakeup.
> >
>
> I never reported back on the results of reverting 7e54d9d063fa, it seems to fix
> the problem on my machine as well.

Right, so I think we should revert soonish. That is, if you guys don't
have a fix yet. You can always try again during the next merge window.
Right now, I'm not testing -rc kernels on this box because of this. I
can always blacklist the networking driver but what good is that box
then...

Thanks.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2017-03-24 06:19:17

by Jeff Kirsher

[permalink] [raw]

Subject: Re: [Intel-wired-lan] [BUG] 4.11.0-rc1 panic on shutdown X61s

On Wed, Mar 22, 2017 at 12:00 PM, Borislav Petkov <[email protected]> wrote:
> (Readding Jeff Kirsher who got dropped from the CC-list at some point.)
>
> On Tue, Mar 21, 2017 at 07:13:43PM -0700, [email protected] wrote:
>> > This is still broken as of 4.11.0-rc3 FYI.
>> >
>> > Upon resume:
>> > [ 45.828344] ------------[ cut here ]------------
>> > [ 45.828352] WARNING: CPU: 0 PID: 807 at drivers/pci/msi.c:1052 __pci_enable_msi_range+0x39c/0x3f0
>> > [ 45.828355] CPU: 0 PID: 807 Comm: kworker/u4:29 Not tainted 4.11.0-rc3 #52
>> > [ 45.828356] Hardware name: LENOVO 7668CTO/7668CTO, BIOS 7NETC2WW (2.22 ) 03/22/2011
>> > [ 45.828360] Workqueue: events_unbound async_run_entry_fn
>> > [ 45.828362] Call Trace:
>> > [ 45.828366] dump_stack+0x4d/0x72
>> > [ 45.828369] __warn+0xc7/0xf0
>> > [ 45.828371] warn_slowpath_null+0x18/0x20
>> > [ 45.828372] __pci_enable_msi_range+0x39c/0x3f0
>> > [ 45.828375] ? e1000e_get_phy_info_igp+0x1c/0xf0
>> > [ 45.828377] pci_enable_msi+0x15/0x30
>> > [ 45.828379] e1000e_set_interrupt_capability+0xe0/0x130
>> > [ 45.828381] e1000e_pm_thaw+0x1d/0x50
>> > [ 45.828383] e1000e_pm_resume+0x20/0x30
>> > [ 45.828386] pci_pm_resume+0x5f/0x90
>> > [ 45.828389] dpm_run_callback+0x44/0x170
>> > [ 45.828391] ? pci_pm_thaw+0x90/0x90
>> > [ 45.828393] device_resume+0xce/0x1e0
>> > [ 45.828395] async_resume+0x18/0x40
>> > [ 45.828396] async_run_entry_fn+0x32/0xe0
>> > [ 45.828399] process_one_work+0x13b/0x3e0
>> > [ 45.828400] worker_thread+0x64/0x4a0
>> > [ 45.828402] kthread+0x10f/0x150
>> > [ 45.828404] ? process_one_work+0x3e0/0x3e0
>> > [ 45.828406] ? __kthread_create_on_node+0x150/0x150
>> > [ 45.828409] ret_from_fork+0x29/0x40
>> > [ 45.828411] ---[ end trace 56fad2d83af13529 ]---
>> > [ 45.828469] e1000e 0000:00:19.0 eth3: Failed to initialize MSI interrupts. Falling back to legacy interrupts.
>> > [ 45.835944] PM: resume of devices complete after 364.406 msecs
>> > [ 45.836001] usb 2-1:1.0: rebind failed: -517
>> > [ 45.836316] PM: Finishing wakeup.
>> >
>>
>> I never reported back on the results of reverting 7e54d9d063fa, it seems to fix
>> the problem on my machine as well.
>
> Right, so I think we should revert soonish. That is, if you guys don't
> have a fix yet. You can always try again during the next merge window.
> Right now, I'm not testing -rc kernels on this box because of this. I
> can always blacklist the networking driver but what good is that box
> then...
>

I have sent a patch to revert this offending commit through David
Miller's net tree, sorry for the delay on this, I thought I had seen a
patch to revert the offending commit earlier which is why I did not
send this earlier.

--
Cheers,
Jeff

2017-03-24 08:48:30

by Borislav Petkov

[permalink] [raw]

Subject: Re: [Intel-wired-lan] [BUG] 4.11.0-rc1 panic on shutdown X61s

On Thu, Mar 23, 2017 at 11:18:31PM -0700, Jeff Kirsher wrote:
> I have sent a patch to revert this offending commit through David
> Miller's net tree, sorry for the delay on this, I thought I had seen a
> patch to revert the offending commit earlier which is why I did not
> send this earlier.

Thanks Jeff. If you need me to test the real fix just let me know.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.