Hi,
Stable kernel has stability problems on r8169 that were not present in 4.17.3:
[ 0.000000] Linux version 4.18.8 (kbuild@lofw) (gcc version 7.3.0 (Gentoo 7.3.0-r3 p1.4)) #70 SMP
PREEMPT Mon Sep 17 17:56:57 CEST 2018
[ 0.000000] Command line: BOOT_IMAGE=/boot/linux-4.18.8 root=LABEL=ROOT ro rootfstype=ext4
net.ifnames=0 pci=nomsi
[ 1.772849] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[ 1.772852] r8169 0000:07:00.0: can't disable ASPM; OS doesn't have ASPM control
[ 1.784948] r8169 0000:07:00.0 eth4: RTL8168h/8111h, 50:9a:4c:2e:92:be, XID 54100800, IRQ 16
[ 1.784949] r8169 0000:07:00.0 eth4: jumbo features [frames: 9200 bytes, tx checksumming: ko]
We saw the interface unresponsive twice during the last 3 days with:
[Mon Sep 24 11:35:56 2018] ------------[ cut here ]------------
[Mon Sep 24 11:35:56 2018] NETDEV WATCHDOG: wan (r8169): transmit queue 0 timed out
[Mon Sep 24 11:35:56 2018] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:461
dev_watchdog+0x215/0x220
[Mon Sep 24 11:35:56 2018] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.18.8 #70
[Mon Sep 24 11:35:56 2018] Hardware name: Dell Inc. OptiPlex 3050/0W0CHX, BIOS 1.6.5 09/09/2017
[Mon Sep 24 11:35:56 2018] RIP: 0010:dev_watchdog+0x215/0x220
[Mon Sep 24 11:35:56 2018] Code: 49 63 4c 24 e8 eb 8c 4c 89 ef c6 05 1a 19 ca 00 01 e8 5f 52 fd ff
89 d9 4c 89 ee 48 c7 c7 78 ab 67 89 48 89 c2 e8 1b 2b 49 ff <0f> 0b eb be 0f 1f 80 00 00 00 00 41 57
45 89 cf 41 56 49 89 d6 41
[Mon Sep 24 11:35:56 2018] RSP: 0018:ffff96f05dd03e98 EFLAGS: 00010282
[Mon Sep 24 11:35:56 2018] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
[Mon Sep 24 11:35:56 2018] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff96f05dd15350
[Mon Sep 24 11:35:56 2018] RBP: ffff96f0462ee41c R08: 0000000000000001 R09: 000000000000133d
[Mon Sep 24 11:35:56 2018] R10: 0000000000000202 R11: 0000000000000000 R12: ffff96f0462ee438
[Mon Sep 24 11:35:56 2018] R13: ffff96f0462ee000 R14: 0000000000000001 R15: ffff96f0455eaa80
[Mon Sep 24 11:35:56 2018] FS: 0000000000000000(0000) GS:ffff96f05dd00000(0000) knlGS:0000000000000000
[Mon Sep 24 11:35:56 2018] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Mon Sep 24 11:35:56 2018] CR2: 000055c9498766e0 CR3: 00000000bb80a006 CR4: 00000000003606e0
[Mon Sep 24 11:35:56 2018] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Mon Sep 24 11:35:56 2018] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[Mon Sep 24 11:35:56 2018] Call Trace:
[Mon Sep 24 11:35:56 2018] <IRQ>
[Mon Sep 24 11:35:56 2018] ? pfifo_fast_reset+0x130/0x130
[Mon Sep 24 11:35:56 2018] ? pfifo_fast_reset+0x130/0x130
[Mon Sep 24 11:35:56 2018] call_timer_fn+0x11/0x70
[Mon Sep 24 11:35:56 2018] expire_timers+0x8e/0xa0
[Mon Sep 24 11:35:56 2018] run_timer_softirq+0xb9/0x160
[Mon Sep 24 11:35:56 2018] ? __hrtimer_run_queues+0x135/0x1a0
[Mon Sep 24 11:35:56 2018] ? hw_breakpoint_pmu_read+0x10/0x10
[Mon Sep 24 11:35:56 2018] ? ktime_get+0x39/0x90
[Mon Sep 24 11:35:56 2018] ? lapic_next_event+0x20/0x20
[Mon Sep 24 11:35:56 2018] __do_softirq+0xcb/0x1f8
[Mon Sep 24 11:35:56 2018] irq_exit+0xa9/0xb0
[Mon Sep 24 11:35:56 2018] smp_apic_timer_interrupt+0x59/0x90
[Mon Sep 24 11:35:56 2018] apic_timer_interrupt+0xf/0x20
[Mon Sep 24 11:35:56 2018] </IRQ>
[Mon Sep 24 11:35:56 2018] RIP: 0010:cpuidle_enter_state+0x129/0x200
[Mon Sep 24 11:35:56 2018] Code: 45 00 89 c3 e8 d8 3b 55 ff 65 8b 3d b1 eb 45 77 e8 8c 3a 55 ff 31
ff 49 89 c4 e8 72 43 55 ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 89 e1 4c 29 e9 48 89 c8 48 c1 f9 3f
48 f7 ea b8 ff ff ff 7f 48
[Mon Sep 24 11:35:56 2018] RSP: 0018:ffff9a93c06e7ea8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
[Mon Sep 24 11:35:56 2018] RAX: ffff96f05dd1f800 RBX: 0000000000000003 RCX: 000000000000001f
[Mon Sep 24 11:35:56 2018] RDX: 20c49ba5e353f7cf RSI: 00000000258f0602 RDI: 0000000000000000
[Mon Sep 24 11:35:56 2018] RBP: ffff96f05dd25ee0 R08: 00000000000002b4 R09: 00000000ffffffff
[Mon Sep 24 11:35:56 2018] R10: ffff9a93c06e7e90 R11: 0000000000000142 R12: 00012ec849a182b9
[Mon Sep 24 11:35:56 2018] R13: 00012ec8499ddf88 R14: 0000000000000003 R15: 0000000000000000
[Mon Sep 24 11:35:56 2018] ? cpuidle_enter_state+0x11e/0x200
[Mon Sep 24 11:35:56 2018] do_idle+0x1c0/0x200
[Mon Sep 24 11:35:56 2018] cpu_startup_entry+0x6a/0x70
[Mon Sep 24 11:35:56 2018] start_secondary+0x18a/0x1c0
[Mon Sep 24 11:35:56 2018] secondary_startup_64+0xa5/0xb0
[Mon Sep 24 11:35:56 2018] ---[ end trace 327bd9c035abe307 ]---
This is the built-in ethernet port on a Dell main board:
07:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express
Gigabit Ethernet Controller [10ec:8168] (rev 15)
Subsystem: Dell RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [1028:07a3]
Flags: bus master, fast devsel, latency 0, IRQ 16
I/O ports at e000 [size=256]
Memory at f7404000 (64-bit, non-prefetchable) [size=4K]
Memory at f7400000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 01
Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Virtual Channel
Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
Capabilities: [170] Latency Tolerance Reporting
Capabilities: [178] L1 PM Substates
Kernel driver in use: r8169
The box has an extra 4-way ethernet card that uses the same driver. We had to set pci=nomsi because
the card frequently behaved erratic with msi on.
Thanks,
Ortwin
On 24.09.2018 14:00, Ortwin Glück wrote:
> Hi,
>
> Stable kernel has stability problems on r8169 that were not present in 4.17.3:
>
> [ 0.000000] Linux version 4.18.8 (kbuild@lofw) (gcc version 7.3.0 (Gentoo 7.3.0-r3 p1.4)) #70 SMP PREEMPT Mon Sep 17 17:56:57 CEST 2018
> [ 0.000000] Command line: BOOT_IMAGE=/boot/linux-4.18.8 root=LABEL=ROOT ro rootfstype=ext4 net.ifnames=0 pci=nomsi
>
> [ 1.772849] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> [ 1.772852] r8169 0000:07:00.0: can't disable ASPM; OS doesn't have ASPM control
> [ 1.784948] r8169 0000:07:00.0 eth4: RTL8168h/8111h, 50:9a:4c:2e:92:be, XID 54100800, IRQ 16
> [ 1.784949] r8169 0000:07:00.0 eth4: jumbo features [frames: 9200 bytes, tx checksumming: ko]
>
> We saw the interface unresponsive twice during the last 3 days with:
>
> [Mon Sep 24 11:35:56 2018] ------------[ cut here ]------------
> [Mon Sep 24 11:35:56 2018] NETDEV WATCHDOG: wan (r8169): transmit queue 0 timed out
> [Mon Sep 24 11:35:56 2018] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x215/0x220
> [Mon Sep 24 11:35:56 2018] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.18.8 #70
> [Mon Sep 24 11:35:56 2018] Hardware name: Dell Inc. OptiPlex 3050/0W0CHX, BIOS 1.6.5 09/09/2017
> [Mon Sep 24 11:35:56 2018] RIP: 0010:dev_watchdog+0x215/0x220
> [Mon Sep 24 11:35:56 2018] Code: 49 63 4c 24 e8 eb 8c 4c 89 ef c6 05 1a 19 ca 00 01 e8 5f 52 fd ff 89 d9 4c 89 ee 48 c7 c7 78 ab 67 89 48 89 c2 e8 1b 2b 49 ff <0f> 0b eb be 0f 1f 80 00 00 00 00 41 57 45 89 cf 41 56 49 89 d6 41
> [Mon Sep 24 11:35:56 2018] RSP: 0018:ffff96f05dd03e98 EFLAGS: 00010282
> [Mon Sep 24 11:35:56 2018] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
> [Mon Sep 24 11:35:56 2018] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff96f05dd15350
> [Mon Sep 24 11:35:56 2018] RBP: ffff96f0462ee41c R08: 0000000000000001 R09: 000000000000133d
> [Mon Sep 24 11:35:56 2018] R10: 0000000000000202 R11: 0000000000000000 R12: ffff96f0462ee438
> [Mon Sep 24 11:35:56 2018] R13: ffff96f0462ee000 R14: 0000000000000001 R15: ffff96f0455eaa80
> [Mon Sep 24 11:35:56 2018] FS: 0000000000000000(0000) GS:ffff96f05dd00000(0000) knlGS:0000000000000000
> [Mon Sep 24 11:35:56 2018] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Mon Sep 24 11:35:56 2018] CR2: 000055c9498766e0 CR3: 00000000bb80a006 CR4: 00000000003606e0
> [Mon Sep 24 11:35:56 2018] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [Mon Sep 24 11:35:56 2018] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [Mon Sep 24 11:35:56 2018] Call Trace:
> [Mon Sep 24 11:35:56 2018] <IRQ>
> [Mon Sep 24 11:35:56 2018] ? pfifo_fast_reset+0x130/0x130
> [Mon Sep 24 11:35:56 2018] ? pfifo_fast_reset+0x130/0x130
> [Mon Sep 24 11:35:56 2018] call_timer_fn+0x11/0x70
> [Mon Sep 24 11:35:56 2018] expire_timers+0x8e/0xa0
> [Mon Sep 24 11:35:56 2018] run_timer_softirq+0xb9/0x160
> [Mon Sep 24 11:35:56 2018] ? __hrtimer_run_queues+0x135/0x1a0
> [Mon Sep 24 11:35:56 2018] ? hw_breakpoint_pmu_read+0x10/0x10
> [Mon Sep 24 11:35:56 2018] ? ktime_get+0x39/0x90
> [Mon Sep 24 11:35:56 2018] ? lapic_next_event+0x20/0x20
> [Mon Sep 24 11:35:56 2018] __do_softirq+0xcb/0x1f8
> [Mon Sep 24 11:35:56 2018] irq_exit+0xa9/0xb0
> [Mon Sep 24 11:35:56 2018] smp_apic_timer_interrupt+0x59/0x90
> [Mon Sep 24 11:35:56 2018] apic_timer_interrupt+0xf/0x20
> [Mon Sep 24 11:35:56 2018] </IRQ>
> [Mon Sep 24 11:35:56 2018] RIP: 0010:cpuidle_enter_state+0x129/0x200
> [Mon Sep 24 11:35:56 2018] Code: 45 00 89 c3 e8 d8 3b 55 ff 65 8b 3d b1 eb 45 77 e8 8c 3a 55 ff 31 ff 49 89 c4 e8 72 43 55 ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 89 e1 4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48
> [Mon Sep 24 11:35:56 2018] RSP: 0018:ffff9a93c06e7ea8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
> [Mon Sep 24 11:35:56 2018] RAX: ffff96f05dd1f800 RBX: 0000000000000003 RCX: 000000000000001f
> [Mon Sep 24 11:35:56 2018] RDX: 20c49ba5e353f7cf RSI: 00000000258f0602 RDI: 0000000000000000
> [Mon Sep 24 11:35:56 2018] RBP: ffff96f05dd25ee0 R08: 00000000000002b4 R09: 00000000ffffffff
> [Mon Sep 24 11:35:56 2018] R10: ffff9a93c06e7e90 R11: 0000000000000142 R12: 00012ec849a182b9
> [Mon Sep 24 11:35:56 2018] R13: 00012ec8499ddf88 R14: 0000000000000003 R15: 0000000000000000
> [Mon Sep 24 11:35:56 2018] ? cpuidle_enter_state+0x11e/0x200
> [Mon Sep 24 11:35:56 2018] do_idle+0x1c0/0x200
> [Mon Sep 24 11:35:56 2018] cpu_startup_entry+0x6a/0x70
> [Mon Sep 24 11:35:56 2018] start_secondary+0x18a/0x1c0
> [Mon Sep 24 11:35:56 2018] secondary_startup_64+0xa5/0xb0
> [Mon Sep 24 11:35:56 2018] ---[ end trace 327bd9c035abe307 ]---
>
> This is the built-in ethernet port on a Dell main board:
> 07:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15)
> Subsystem: Dell RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [1028:07a3]
> Flags: bus master, fast devsel, latency 0, IRQ 16
> I/O ports at e000 [size=256]
> Memory at f7404000 (64-bit, non-prefetchable) [size=4K]
> Memory at f7400000 (64-bit, non-prefetchable) [size=16K]
> Capabilities: [40] Power Management version 3
> Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
> Capabilities: [70] Express Endpoint, MSI 01
> Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
> Capabilities: [100] Advanced Error Reporting
> Capabilities: [140] Virtual Channel
> Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
> Capabilities: [170] Latency Tolerance Reporting
> Capabilities: [178] L1 PM Substates
> Kernel driver in use: r8169
>
> The box has an extra 4-way ethernet card that uses the same driver. We had to set pci=nomsi because the card frequently behaved erratic with msi on.
>
> Thanks,
>
> Ortwin
>
Thanks for the report. Here come a few inquiries:
You say the box has one on-board network port and four network ports on
an extension card, all five driven by r8169. The on-board chip is a
RTL8168h, what's the type of the chips on the extension card?
I'm asking because r8169 supports ~ 50 chip variants of the RTL8169/8
family.
Are the problems the same on all five ports?
Can you reproduce the problem (how)? Any specific network usage
triggering the problem?
The root cause of the problem not necessarily is in r8169, some other
change could have broken it too. Can you test using r8169 from 4.18
on top of 4.17?
When stating "behaves erratic" you refer to the network hangs
mentioned before? Or to some other issue?
A similar report is here:
https://bugzilla.kernel.org/show_bug.cgi?id=201109
There the problem seems to start with the upgrade from 4.18.4 to 4.18.5.
Can you try with 4.18.4 ?
The diff between 4.18.4 and 4.18.5 shows nothing related to r8169.
Rgds, Heiner
On 24.09.18 22:21, Heiner Kallweit wrote:
> Thanks for the report. Here come a few inquiries:
>
> You say the box has one on-board network port and four network ports on
> an extension card, all five driven by r8169. The on-board chip is a
> RTL8168h, what's the type of the chips on the extension card?
This is the card:
03:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express
Gigabit Ethernet Controller [10ec:8168] (rev 0c)
Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet
Controller [10ec:0123]
Flags: bus master, fast devsel, latency 0, IRQ 17
I/O ports at d000 [size=256]
Memory at f7300000 (64-bit, non-prefetchable) [size=4K]
Memory at f0300000 (64-bit, prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 01
Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
Capabilities: [d0] Vital Product Data
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Virtual Channel
Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
Capabilities: [170] Latency Tolerance Reporting
Kernel driver in use: r8169
04:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express
Gigabit Ethernet Controller [10ec:8168] (rev 0c)
Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet
Controller [10ec:0123]
Flags: bus master, fast devsel, latency 0, IRQ 18
I/O ports at c000 [size=256]
Memory at f7200000 (64-bit, non-prefetchable) [size=4K]
Memory at f0200000 (64-bit, prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 01
Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
Capabilities: [d0] Vital Product Data
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Virtual Channel
Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
Capabilities: [170] Latency Tolerance Reporting
Kernel driver in use: r8169
05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express
Gigabit Ethernet Controller [10ec:8168] (rev 0c)
Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet
Controller [10ec:0123]
Flags: bus master, fast devsel, latency 0, IRQ 19
I/O ports at b000 [size=256]
Memory at f7100000 (64-bit, non-prefetchable) [size=4K]
Memory at f0100000 (64-bit, prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 01
Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
Capabilities: [d0] Vital Product Data
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Virtual Channel
Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
Capabilities: [170] Latency Tolerance Reporting
Kernel driver in use: r8169
06:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express
Gigabit Ethernet Controller [10ec:8168] (rev 0c)
Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet
Controller [10ec:0123]
Flags: bus master, fast devsel, latency 0, IRQ 16
I/O ports at a000 [size=256]
Memory at f7000000 (64-bit, non-prefetchable) [size=4K]
Memory at f0000000 (64-bit, prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 01
Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
Capabilities: [d0] Vital Product Data
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Virtual Channel
Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
Capabilities: [170] Latency Tolerance Reporting
Kernel driver in use: r8169
> I'm asking because r8169 supports ~ 50 chip variants of the RTL8169/8
> family.
> Are the problems the same on all five ports?
Yes, I see the WARN_ON also on the other ports. Here is one from the separate card:
[Mon Sep 24 15:47:21 2018] ------------[ cut here ]------------
[Mon Sep 24 15:47:21 2018] NETDEV WATCHDOG: lan (r8169): transmit queue 0 timed out
[Mon Sep 24 15:47:21 2018] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:461
dev_watchdog+0x215/0x220
[Mon Sep 24 15:47:21 2018] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.18.8 #70
[Mon Sep 24 15:47:21 2018] Hardware name: Dell Inc. OptiPlex 3050/0W0CHX, BIOS 1.6.5 09/09/2017
[Mon Sep 24 15:47:21 2018] RIP: 0010:dev_watchdog+0x215/0x220
[Mon Sep 24 15:47:21 2018] Code: 49 63 4c 24 e8 eb 8c 4c 89 ef c6 05 1a 19 ca 00 01 e8 5f 52 fd ff
89 d9 4c 89 ee 48 c7 c7 78 ab 67 af 48 89 c2 e8 1b 2b 49 ff <0f> 0b eb be 0f 1f 80 00 00 00 00 41 57
45 89 cf 41 56 49 89 d6 41
[Mon Sep 24 15:47:21 2018] RSP: 0018:ffff8a205dd03e98 EFLAGS: 00010282
[Mon Sep 24 15:47:21 2018] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
[Mon Sep 24 15:47:21 2018] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8a205dd15350
[Mon Sep 24 15:47:21 2018] RBP: ffff8a2045de041c R08: 0000000000000001 R09: 0000000000000763
[Mon Sep 24 15:47:21 2018] R10: 0000000000000082 R11: 0000000000000000 R12: ffff8a2045de0438
[Mon Sep 24 15:47:21 2018] R13: ffff8a2045de0000 R14: 0000000000000001 R15: ffff8a20455e1480
[Mon Sep 24 15:47:21 2018] FS: 0000000000000000(0000) GS:ffff8a205dd00000(0000) knlGS:0000000000000000
[Mon Sep 24 15:47:21 2018] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Mon Sep 24 15:47:21 2018] CR2: 00007ff598642c38 CR3: 000000001f80a004 CR4: 00000000003606e0
[Mon Sep 24 15:47:21 2018] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Mon Sep 24 15:47:21 2018] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[Mon Sep 24 15:47:21 2018] Call Trace:
[Mon Sep 24 15:47:21 2018] <IRQ>
[Mon Sep 24 15:47:21 2018] ? pfifo_fast_reset+0x130/0x130
[Mon Sep 24 15:47:21 2018] ? pfifo_fast_reset+0x130/0x130
[Mon Sep 24 15:47:21 2018] call_timer_fn+0x11/0x70
[Mon Sep 24 15:47:21 2018] expire_timers+0x8e/0xa0
[Mon Sep 24 15:47:21 2018] run_timer_softirq+0xb9/0x160
[Mon Sep 24 15:47:21 2018] ? __hrtimer_run_queues+0x135/0x1a0
[Mon Sep 24 15:47:21 2018] ? hw_breakpoint_pmu_read+0x10/0x10
[Mon Sep 24 15:47:21 2018] ? ktime_get+0x39/0x90
[Mon Sep 24 15:47:21 2018] ? lapic_next_event+0x20/0x20
[Mon Sep 24 15:47:21 2018] __do_softirq+0xcb/0x1f8
[Mon Sep 24 15:47:21 2018] irq_exit+0xa9/0xb0
[Mon Sep 24 15:47:21 2018] smp_apic_timer_interrupt+0x59/0x90
[Mon Sep 24 15:47:21 2018] apic_timer_interrupt+0xf/0x20
[Mon Sep 24 15:47:21 2018] </IRQ>
[Mon Sep 24 15:47:21 2018] RIP: 0010:cpuidle_enter_state+0x129/0x200
[Mon Sep 24 15:47:21 2018] Code: 45 00 89 c3 e8 d8 3b 55 ff 65 8b 3d b1 eb 45 51 e8 8c 3a 55 ff 31
ff 49 89 c4 e8 72 43 55 ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 89 e1 4c 29 e9 48 89 c8 48 c1 f9 3f
48 f7 ea b8 ff ff ff 7f 48
[Mon Sep 24 15:47:21 2018] RSP: 0018:ffff9f46806e7ea8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
[Mon Sep 24 15:47:21 2018] RAX: ffff8a205dd1f800 RBX: 0000000000000004 RCX: 000000000000001f
[Mon Sep 24 15:47:21 2018] RDX: 20c49ba5e353f7cf RSI: 00000000258f0602 RDI: 0000000000000000
[Mon Sep 24 15:47:21 2018] RBP: ffff8a205dd25ee0 R08: 00000000000003ee R09: 00000000ffffffff
[Mon Sep 24 15:47:21 2018] R10: ffff9f46806e7e90 R11: 00000000000003d5 R12: 00000752028ec2d0
[Mon Sep 24 15:47:21 2018] R13: 000007520282ea18 R14: 0000000000000004 R15: 0000000000000000
[Mon Sep 24 15:47:21 2018] ? cpuidle_enter_state+0x11e/0x200
[Mon Sep 24 15:47:21 2018] do_idle+0x1c0/0x200
[Mon Sep 24 15:47:21 2018] cpu_startup_entry+0x6a/0x70
[Mon Sep 24 15:47:21 2018] start_secondary+0x18a/0x1c0
[Mon Sep 24 15:47:21 2018] secondary_startup_64+0xa5/0xb0
[Mon Sep 24 15:47:21 2018] ---[ end trace 7ea59db719c41930 ]---
[Mon Sep 24 15:47:21 2018] r8169 0000:06:00.0 lan: link up
> Can you reproduce the problem (how)? Any specific network usage
> triggering the problem?
Actually the WARN_ON seems not to be causing the hang but to be rather a side effect of an
"unstable" port. The WARN_ON occurs every few days but the port works fine most of the time. But
when the port hangs (no more traffic) only a reboot fixes it.
These looks suspicious:
[ 8048.829832] r8169 0000:06:00.0 lan: link up
[64114.535455] r8169 0000:07:00.0 wan: link up
[74726.396319] r8169 0000:07:00.0 wan: link up
The ports seem to go up at random times (without "link down" message). This is also what we saw with
MSI enabled: links thrashing in down/up cycles.
> The root cause of the problem not necessarily is in r8169, some other
> change could have broken it too. Can you test using r8169 from 4.18
> on top of 4.17?
Unfortunately I am not able to do that.
> When stating "behaves erratic" you refer to the network hangs
> mentioned before? Or to some other issue?
please see above: links thrashing in down/up cycles.
I see that MSI/MSI-X has been disabled on quite a few chip versions (e.g. lately commit 72a579b8).
Does pci=nomsi disable MSI-X as well, or just MSI?
I could try adding my chip versions to these quirk sections. I couldn't find a way to disable
MSI/MSI-X per device.
> A similar report is here:
> https://bugzilla.kernel.org/show_bug.cgi?id=201109
Also that report has the "link up" message after the WARN_ON.
> There the problem seems to start with the upgrade from 4.18.4 to 4.18.5.
> Can you try with 4.18.4 ?
>
> The diff between 4.18.4 and 4.18.5 shows nothing related to r8169.
Exactly. I think the incident is just so rare that correlating it with a specific upgrade is hard.
Thanks,
Ortwin
On 25.09.2018 11:53, Ortwin Glück wrote:
> On 24.09.18 22:21, Heiner Kallweit wrote:
>> Thanks for the report. Here come a few inquiries:
>>
>> You say the box has one on-board network port and four network ports on
>> an extension card, all five driven by r8169. The on-board chip is a
>> RTL8168h, what's the type of the chips on the extension card?
>
> This is the card:
> 03:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 0c)
> Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:0123]
> Flags: bus master, fast devsel, latency 0, IRQ 17
> I/O ports at d000 [size=256]
> Memory at f7300000 (64-bit, non-prefetchable) [size=4K]
> Memory at f0300000 (64-bit, prefetchable) [size=16K]
> Capabilities: [40] Power Management version 3
> Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
> Capabilities: [70] Express Endpoint, MSI 01
> Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
> Capabilities: [d0] Vital Product Data
> Capabilities: [100] Advanced Error Reporting
> Capabilities: [140] Virtual Channel
> Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
> Capabilities: [170] Latency Tolerance Reporting
> Kernel driver in use: r8169
>
> 04:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 0c)
> Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:0123]
> Flags: bus master, fast devsel, latency 0, IRQ 18
> I/O ports at c000 [size=256]
> Memory at f7200000 (64-bit, non-prefetchable) [size=4K]
> Memory at f0200000 (64-bit, prefetchable) [size=16K]
> Capabilities: [40] Power Management version 3
> Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
> Capabilities: [70] Express Endpoint, MSI 01
> Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
> Capabilities: [d0] Vital Product Data
> Capabilities: [100] Advanced Error Reporting
> Capabilities: [140] Virtual Channel
> Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
> Capabilities: [170] Latency Tolerance Reporting
> Kernel driver in use: r8169
>
> 05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 0c)
> Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:0123]
> Flags: bus master, fast devsel, latency 0, IRQ 19
> I/O ports at b000 [size=256]
> Memory at f7100000 (64-bit, non-prefetchable) [size=4K]
> Memory at f0100000 (64-bit, prefetchable) [size=16K]
> Capabilities: [40] Power Management version 3
> Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
> Capabilities: [70] Express Endpoint, MSI 01
> Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
> Capabilities: [d0] Vital Product Data
> Capabilities: [100] Advanced Error Reporting
> Capabilities: [140] Virtual Channel
> Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
> Capabilities: [170] Latency Tolerance Reporting
> Kernel driver in use: r8169
>
> 06:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 0c)
> Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:0123]
> Flags: bus master, fast devsel, latency 0, IRQ 16
> I/O ports at a000 [size=256]
> Memory at f7000000 (64-bit, non-prefetchable) [size=4K]
> Memory at f0000000 (64-bit, prefetchable) [size=16K]
> Capabilities: [40] Power Management version 3
> Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
> Capabilities: [70] Express Endpoint, MSI 01
> Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
> Capabilities: [d0] Vital Product Data
> Capabilities: [100] Advanced Error Reporting
> Capabilities: [140] Virtual Channel
> Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
> Capabilities: [170] Latency Tolerance Reporting
> Kernel driver in use: r8169
>
Unfortunately the lspci info doesn't allow to determine the exact chip type.
Relevant would be the syslog info as you provided it in your first mail:
r8169 0000:07:00.0 eth4: RTL8168h/8111h, 50:9a:4c:2e:92:be, XID 54100800, IRQ 16
Are the chips on the extension card the same as the on-board chip?
>
>
>> I'm asking because r8169 supports ~ 50 chip variants of the RTL8169/8
>> family.
>> Are the problems the same on all five ports?
>
> Yes, I see the WARN_ON also on the other ports. Here is one from the separate card:
> [Mon Sep 24 15:47:21 2018] ------------[ cut here ]------------
> [Mon Sep 24 15:47:21 2018] NETDEV WATCHDOG: lan (r8169): transmit queue 0 timed out
> [Mon Sep 24 15:47:21 2018] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x215/0x220
> [Mon Sep 24 15:47:21 2018] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.18.8 #70
> [Mon Sep 24 15:47:21 2018] Hardware name: Dell Inc. OptiPlex 3050/0W0CHX, BIOS 1.6.5 09/09/2017
> [Mon Sep 24 15:47:21 2018] RIP: 0010:dev_watchdog+0x215/0x220
> [Mon Sep 24 15:47:21 2018] Code: 49 63 4c 24 e8 eb 8c 4c 89 ef c6 05 1a 19 ca 00 01 e8 5f 52 fd ff 89 d9 4c 89 ee 48 c7 c7 78 ab 67 af 48 89 c2 e8 1b 2b 49 ff <0f> 0b eb be 0f 1f 80 00 00 00 00 41 57 45 89 cf 41 56 49 89 d6 41
> [Mon Sep 24 15:47:21 2018] RSP: 0018:ffff8a205dd03e98 EFLAGS: 00010282
> [Mon Sep 24 15:47:21 2018] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
> [Mon Sep 24 15:47:21 2018] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8a205dd15350
> [Mon Sep 24 15:47:21 2018] RBP: ffff8a2045de041c R08: 0000000000000001 R09: 0000000000000763
> [Mon Sep 24 15:47:21 2018] R10: 0000000000000082 R11: 0000000000000000 R12: ffff8a2045de0438
> [Mon Sep 24 15:47:21 2018] R13: ffff8a2045de0000 R14: 0000000000000001 R15: ffff8a20455e1480
> [Mon Sep 24 15:47:21 2018] FS: 0000000000000000(0000) GS:ffff8a205dd00000(0000) knlGS:0000000000000000
> [Mon Sep 24 15:47:21 2018] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Mon Sep 24 15:47:21 2018] CR2: 00007ff598642c38 CR3: 000000001f80a004 CR4: 00000000003606e0
> [Mon Sep 24 15:47:21 2018] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [Mon Sep 24 15:47:21 2018] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [Mon Sep 24 15:47:21 2018] Call Trace:
> [Mon Sep 24 15:47:21 2018] <IRQ>
> [Mon Sep 24 15:47:21 2018] ? pfifo_fast_reset+0x130/0x130
> [Mon Sep 24 15:47:21 2018] ? pfifo_fast_reset+0x130/0x130
> [Mon Sep 24 15:47:21 2018] call_timer_fn+0x11/0x70
> [Mon Sep 24 15:47:21 2018] expire_timers+0x8e/0xa0
> [Mon Sep 24 15:47:21 2018] run_timer_softirq+0xb9/0x160
> [Mon Sep 24 15:47:21 2018] ? __hrtimer_run_queues+0x135/0x1a0
> [Mon Sep 24 15:47:21 2018] ? hw_breakpoint_pmu_read+0x10/0x10
> [Mon Sep 24 15:47:21 2018] ? ktime_get+0x39/0x90
> [Mon Sep 24 15:47:21 2018] ? lapic_next_event+0x20/0x20
> [Mon Sep 24 15:47:21 2018] __do_softirq+0xcb/0x1f8
> [Mon Sep 24 15:47:21 2018] irq_exit+0xa9/0xb0
> [Mon Sep 24 15:47:21 2018] smp_apic_timer_interrupt+0x59/0x90
> [Mon Sep 24 15:47:21 2018] apic_timer_interrupt+0xf/0x20
> [Mon Sep 24 15:47:21 2018] </IRQ>
> [Mon Sep 24 15:47:21 2018] RIP: 0010:cpuidle_enter_state+0x129/0x200
> [Mon Sep 24 15:47:21 2018] Code: 45 00 89 c3 e8 d8 3b 55 ff 65 8b 3d b1 eb 45 51 e8 8c 3a 55 ff 31 ff 49 89 c4 e8 72 43 55 ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 89 e1 4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48
> [Mon Sep 24 15:47:21 2018] RSP: 0018:ffff9f46806e7ea8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
> [Mon Sep 24 15:47:21 2018] RAX: ffff8a205dd1f800 RBX: 0000000000000004 RCX: 000000000000001f
> [Mon Sep 24 15:47:21 2018] RDX: 20c49ba5e353f7cf RSI: 00000000258f0602 RDI: 0000000000000000
> [Mon Sep 24 15:47:21 2018] RBP: ffff8a205dd25ee0 R08: 00000000000003ee R09: 00000000ffffffff
> [Mon Sep 24 15:47:21 2018] R10: ffff9f46806e7e90 R11: 00000000000003d5 R12: 00000752028ec2d0
> [Mon Sep 24 15:47:21 2018] R13: 000007520282ea18 R14: 0000000000000004 R15: 0000000000000000
> [Mon Sep 24 15:47:21 2018] ? cpuidle_enter_state+0x11e/0x200
> [Mon Sep 24 15:47:21 2018] do_idle+0x1c0/0x200
> [Mon Sep 24 15:47:21 2018] cpu_startup_entry+0x6a/0x70
> [Mon Sep 24 15:47:21 2018] start_secondary+0x18a/0x1c0
> [Mon Sep 24 15:47:21 2018] secondary_startup_64+0xa5/0xb0
> [Mon Sep 24 15:47:21 2018] ---[ end trace 7ea59db719c41930 ]---
> [Mon Sep 24 15:47:21 2018] r8169 0000:06:00.0 lan: link up
>
>
>> Can you reproduce the problem (how)? Any specific network usage
>> triggering the problem?
>
> Actually the WARN_ON seems not to be causing the hang but to be rather a side effect of an "unstable" port. The WARN_ON occurs every few days but the port works fine most of the time. But when the port hangs (no more traffic) only a reboot fixes it.
>
> These looks suspicious:
>
> [ 8048.829832] r8169 0000:06:00.0 lan: link up
> [64114.535455] r8169 0000:07:00.0 wan: link up
> [74726.396319] r8169 0000:07:00.0 wan: link up
>
> The ports seem to go up at random times (without "link down" message). This is also what we saw with MSI enabled: links thrashing in down/up cycles.
>
>> The root cause of the problem not necessarily is in r8169, some other
>> change could have broken it too. Can you test using r8169 from 4.18
>> on top of 4.17?
>
> Unfortunately I am not able to do that.
>
>> When stating "behaves erratic" you refer to the network hangs
>> mentioned before? Or to some other issue?
>
> please see above: links thrashing in down/up cycles.
> I see that MSI/MSI-X has been disabled on quite a few chip versions (e.g. lately commit 72a579b8).
> Does pci=nomsi disable MSI-X as well, or just MSI?
Kernel treats MSI-X as a sub-feature of MSI, therefore nomsi disables MSI-X too.
> I could try adding my chip versions to these quirk sections. I couldn't find a way to disable MSI/MSI-X per device.
>
It turned out that the MSI-X problems are caused by a nasty small bug in certain PCI bridges.
As soon as the fix for it appears in mainline we'll revert the workarounds in r8169.
And the MSI-X issue resulted in non-functional network after resume from suspend, therefore
I think your problem isn't related.
>> A similar report is here:
>> https://bugzilla.kernel.org/show_bug.cgi?id=201109
>
> Also that report has the "link up" message after the WARN_ON.
>
>> There the problem seems to start with the upgrade from 4.18.4 to 4.18.5.
>> Can you try with 4.18.4 ?
>>
>> The diff between 4.18.4 and 4.18.5 shows nothing related to r8169.
>
> Exactly. I think the incident is just so rare that correlating it with a specific upgrade is hard.
>
As I said, you could try running 4.18.4 instead of 4.18.8 for a while (if you have this option).
You faced the hang roughly once per day, so after a few days you should be sure.
> Thanks,
>
> Ortwin
>
Rgds, Heiner
Am Di., 25. Sep. 2018 um 20:44 Uhr schrieb Heiner Kallweit
<[email protected]>:
> As I said, you could try running 4.18.4 instead of 4.18.8 for a while (if you have this option).
> You faced the hang roughly once per day, so after a few days you should be sure.
I don't think this will fix anything.
See: https://lkml.org/lkml/2018/9/4/174
Probably all broken by 4fd48c4ac0a0 ("r8169: move common
initializations to tp->hw_start").
BR
On 25.09.18 20:44, Heiner Kallweit wrote:
> Unfortunately the lspci info doesn't allow to determine the exact chip type.
> Relevant would be the syslog info as you provided it in your first mail:
> r8169 0000:07:00.0 eth4: RTL8168h/8111h, 50:9a:4c:2e:92:be, XID 54100800, IRQ 16
> Are the chips on the extension card the same as the on-board chip?
They are different: G and H versions.
[ 1.696997] r8169 0000:03:00.0 eth0: RTL8168g/8111g at 0x (ptrval), 00:0a:cd:31:60:1f, XID
0c000800 IRQ 17
[ 1.712910] r8169 0000:04:00.0 eth1: RTL8168g/8111g at 0x (ptrval), 00:0a:cd:31:60:20, XID
0c000800 IRQ 18
[ 1.724969] r8169 0000:05:00.0 eth2: RTL8168g/8111g at 0x (ptrval), 00:0a:cd:31:60:21, XID
0c000800 IRQ 19
[ 1.736923] r8169 0000:06:00.0 eth3: RTL8168g/8111g at 0x (ptrval), 00:0a:cd:31:60:22, XID
0c000800 IRQ 16
[ 1.753031] r8169 0000:07:00.0 eth4: RTL8168h/8111h at 0x (ptrval), 50:9a:4c:2e:92:be, XID
14100800 IRQ 16
I am going to try your patch against 4.18.10 and report back!
Thanks,
Ortwin