2015-08-12 19:30:03

by Sander Eikelenboom

[permalink] [raw]
Subject: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80

Hi,

On my box running Xen with a 4.2-rc6 kernel i still get this splat in
dom0,
which crashes the box.
(i reported a similar splat before (at rc4) here,
http://www.spinics.net/lists/netdev/msg337570.html)

Never seen this one on 4.1, so it seems a regression.

--
Sander


[81133.193439] general protection fault: 0000 [#1] SMP
[81133.204284] Modules linked in:
[81133.214934] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted
4.2.0-rc6-20150811-linus-doflr+ #1
[81133.225632] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640) , BIOS
V1.8B1 09/13/2010
[81133.236237] task: ffff880059b91580 ti: ffff880059bb4000 task.ti:
ffff880059bb4000
[81133.246808] RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>]
detach_if_pending+0x18/0x80
[81133.257354] RSP: e02b:ffff880059bb7848 EFLAGS: 00010086
[81133.267749] RAX: ffff88004eddc7f0 RBX: ffff88000e20ae08 RCX:
dead000000200200
[81133.278201] RDX: 0000000000000000 RSI: ffff88005f60e600 RDI:
ffff88000e20ae08
[81133.288723] RBP: ffff880059bb7848 R08: 0000000000000001 R09:
0000000000000001
[81133.298930] R10: 0000000000000003 R11: ffff88000e20ad68 R12:
0000000000000000
[81133.308875] R13: 0000000101735569 R14: 0000000000015f90 R15:
ffff88005f60e600
[81133.318845] FS: 00007f28c6f7c800(0000) GS:ffff88005f600000(0000)
knlGS:0000000000000000
[81133.328864] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[81133.338693] CR2: ffff8000007f6800 CR3: 000000003d55c000 CR4:
0000000000000660
[81133.348462] Stack:
[81133.358005] ffff880059bb7898 ffffffff8110fe3f ffffffff810fc261
0000000000000200
[81133.367682] 0000000000000003 ffff88000e20ad68 0000000000000000
ffff88005854d400
[81133.377064] 0000000000015f90 0000000000000000 ffff880059bb78c8
ffffffff819b5243
[81133.386374] Call Trace:
[81133.395596] [<ffffffff8110fe3f>] mod_timer_pending+0x3f/0xe0
[81133.404999] [<ffffffff810fc261>] ?
__raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[81133.414255] [<ffffffff819b5243>] __nf_ct_refresh_acct+0xa3/0xb0
[81133.423137] [<ffffffff819bbe8b>] tcp_packet+0xb3b/0x1290
[81133.431894] [<ffffffff810cb8ca>] ? __local_bh_enable_ip+0x2a/0x90
[81133.440622] [<ffffffff819b4939>] ?
__nf_conntrack_find_get+0x129/0x2a0
[81133.449339] [<ffffffff819b682c>] nf_conntrack_in+0x29c/0x7c0
[81133.457940] [<ffffffff81a67181>] ipv4_conntrack_in+0x21/0x30
[81133.466296] [<ffffffff819aea1c>] nf_iterate+0x4c/0x80
[81133.474401] [<ffffffff819aeab4>] nf_hook_slow+0x64/0xc0
[81133.482615] [<ffffffff81a211ec>] ip_rcv+0x2ec/0x380
[81133.490781] [<ffffffff81a209f0>] ?
ip_local_deliver_finish+0x130/0x130
[81133.498790] [<ffffffff8197e140>]
__netif_receive_skb_core+0x2a0/0x970
[81133.506714] [<ffffffff81a56db8>] ? inet_gro_receive+0x1c8/0x200
[81133.514609] [<ffffffff81980705>] __netif_receive_skb+0x15/0x70
[81133.522333] [<ffffffff8198077e>]
netif_receive_skb_internal+0x1e/0x80
[81133.529840] [<ffffffff81980f3b>] napi_gro_receive+0x6b/0x90
[81133.537173] [<ffffffff81740fb6>] rtl8169_poll+0x2e6/0x600
[81133.544444] [<ffffffff810fc261>] ?
__raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[81133.551566] [<ffffffff81981ad7>] net_rx_action+0x1f7/0x300
[81133.558412] [<ffffffff810cb6c3>] __do_softirq+0x103/0x210
[81133.565353] [<ffffffff810cb807>] run_ksoftirqd+0x37/0x60
[81133.572359] [<ffffffff810e4de0>] smpboot_thread_fn+0x130/0x190
[81133.579215] [<ffffffff810e4cb0>] ? sort_range+0x20/0x20
[81133.586042] [<ffffffff810e1fae>] kthread+0xee/0x110
[81133.592792] [<ffffffff810e1ec0>] ?
kthread_create_on_node+0x1b0/0x1b0
[81133.599694] [<ffffffff81af92df>] ret_from_fork+0x3f/0x70
[81133.606662] [<ffffffff810e1ec0>] ?
kthread_create_on_node+0x1b0/0x1b0
[81133.613445] Code: 77 28 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00
00 00 48 8b 47 08 55 48 89 e5 48 85 c0 74 6a 48 8b 0f 48 85 c9 48 89 08
74 04 <48> 89 41 08 84 d2 74 08 48 c7 47 08 00 00 00 00 f6 47 2a 10 48
[81133.627196] RIP [<ffffffff8110fb18>] detach_if_pending+0x18/0x80
[81133.634036] RSP <ffff880059bb7848>
[81133.640817] ---[ end trace eaf596e1fcf6a591 ]---
[81133.647521] Kernel panic - not syncing: Fatal exception in interrupt


2015-08-12 20:41:39

by Eric Dumazet

[permalink] [raw]
Subject: Re: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80

On Wed, 2015-08-12 at 21:19 +0200, [email protected] wrote:
> Hi,
>
> On my box running Xen with a 4.2-rc6 kernel i still get this splat in
> dom0,
> which crashes the box.
> (i reported a similar splat before (at rc4) here,
> http://www.spinics.net/lists/netdev/msg337570.html)
>
> Never seen this one on 4.1, so it seems a regression.
>
> --
> Sander
>
>
> [81133.193439] general protection fault: 0000 [#1] SMP
> [81133.204284] Modules linked in:
> [81133.214934] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted
> 4.2.0-rc6-20150811-linus-doflr+ #1
> [81133.225632] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640) , BIOS
> V1.8B1 09/13/2010
> [81133.236237] task: ffff880059b91580 ti: ffff880059bb4000 task.ti:
> ffff880059bb4000
> [81133.246808] RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>]
> detach_if_pending+0x18/0x80
> [81133.257354] RSP: e02b:ffff880059bb7848 EFLAGS: 00010086
> [81133.267749] RAX: ffff88004eddc7f0 RBX: ffff88000e20ae08 RCX:
> dead000000200200
> [81133.278201] RDX: 0000000000000000 RSI: ffff88005f60e600 RDI:
> ffff88000e20ae08
> [81133.288723] RBP: ffff880059bb7848 R08: 0000000000000001 R09:
> 0000000000000001
> [81133.298930] R10: 0000000000000003 R11: ffff88000e20ad68 R12:
> 0000000000000000
> [81133.308875] R13: 0000000101735569 R14: 0000000000015f90 R15:
> ffff88005f60e600
> [81133.318845] FS: 00007f28c6f7c800(0000) GS:ffff88005f600000(0000)
> knlGS:0000000000000000
> [81133.328864] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> [81133.338693] CR2: ffff8000007f6800 CR3: 000000003d55c000 CR4:
> 0000000000000660
> [81133.348462] Stack:
> [81133.358005] ffff880059bb7898 ffffffff8110fe3f ffffffff810fc261
> 0000000000000200
> [81133.367682] 0000000000000003 ffff88000e20ad68 0000000000000000
> ffff88005854d400
> [81133.377064] 0000000000015f90 0000000000000000 ffff880059bb78c8
> ffffffff819b5243
> [81133.386374] Call Trace:
> [81133.395596] [<ffffffff8110fe3f>] mod_timer_pending+0x3f/0xe0
> [81133.404999] [<ffffffff810fc261>] ?
> __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
> [81133.414255] [<ffffffff819b5243>] __nf_ct_refresh_acct+0xa3/0xb0
> [81133.423137] [<ffffffff819bbe8b>] tcp_packet+0xb3b/0x1290
> [81133.431894] [<ffffffff810cb8ca>] ? __local_bh_enable_ip+0x2a/0x90
> [81133.440622] [<ffffffff819b4939>] ?
> __nf_conntrack_find_get+0x129/0x2a0
> [81133.449339] [<ffffffff819b682c>] nf_conntrack_in+0x29c/0x7c0
> [81133.457940] [<ffffffff81a67181>] ipv4_conntrack_in+0x21/0x30
> [81133.466296] [<ffffffff819aea1c>] nf_iterate+0x4c/0x80
> [81133.474401] [<ffffffff819aeab4>] nf_hook_slow+0x64/0xc0
> [81133.482615] [<ffffffff81a211ec>] ip_rcv+0x2ec/0x380
> [81133.490781] [<ffffffff81a209f0>] ?
> ip_local_deliver_finish+0x130/0x130
> [81133.498790] [<ffffffff8197e140>]
> __netif_receive_skb_core+0x2a0/0x970
> [81133.506714] [<ffffffff81a56db8>] ? inet_gro_receive+0x1c8/0x200
> [81133.514609] [<ffffffff81980705>] __netif_receive_skb+0x15/0x70
> [81133.522333] [<ffffffff8198077e>]
> netif_receive_skb_internal+0x1e/0x80
> [81133.529840] [<ffffffff81980f3b>] napi_gro_receive+0x6b/0x90
> [81133.537173] [<ffffffff81740fb6>] rtl8169_poll+0x2e6/0x600
> [81133.544444] [<ffffffff810fc261>] ?
> __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
> [81133.551566] [<ffffffff81981ad7>] net_rx_action+0x1f7/0x300
> [81133.558412] [<ffffffff810cb6c3>] __do_softirq+0x103/0x210
> [81133.565353] [<ffffffff810cb807>] run_ksoftirqd+0x37/0x60
> [81133.572359] [<ffffffff810e4de0>] smpboot_thread_fn+0x130/0x190
> [81133.579215] [<ffffffff810e4cb0>] ? sort_range+0x20/0x20
> [81133.586042] [<ffffffff810e1fae>] kthread+0xee/0x110
> [81133.592792] [<ffffffff810e1ec0>] ?
> kthread_create_on_node+0x1b0/0x1b0
> [81133.599694] [<ffffffff81af92df>] ret_from_fork+0x3f/0x70
> [81133.606662] [<ffffffff810e1ec0>] ?
> kthread_create_on_node+0x1b0/0x1b0
> [81133.613445] Code: 77 28 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00
> 00 00 48 8b 47 08 55 48 89 e5 48 85 c0 74 6a 48 8b 0f 48 85 c9 48 89 08
> 74 04 <48> 89 41 08 84 d2 74 08 48 c7 47 08 00 00 00 00 f6 47 2a 10 48
> [81133.627196] RIP [<ffffffff8110fb18>] detach_if_pending+0x18/0x80
> [81133.634036] RSP <ffff880059bb7848>
> [81133.640817] ---[ end trace eaf596e1fcf6a591 ]---
> [81133.647521] Kernel panic - not syncing: Fatal exception in interrupt

This looks like the bug fixed in David Miller net tree :

http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=2235f2ac75fd2501c251b0b699a9632e80239a6d



2015-08-12 20:55:58

by Sander Eikelenboom

[permalink] [raw]
Subject: Re: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80

On 2015-08-12 22:41, Eric Dumazet wrote:
> On Wed, 2015-08-12 at 21:19 +0200, [email protected] wrote:
>> Hi,
>>
>> On my box running Xen with a 4.2-rc6 kernel i still get this splat in
>> dom0,
>> which crashes the box.
>> (i reported a similar splat before (at rc4) here,
>> http://www.spinics.net/lists/netdev/msg337570.html)
>>
>> Never seen this one on 4.1, so it seems a regression.
>>
>> --
>> Sander
>>
>>
>> [81133.193439] general protection fault: 0000 [#1] SMP
>> [81133.204284] Modules linked in:
>> [81133.214934] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted
>> 4.2.0-rc6-20150811-linus-doflr+ #1
>> [81133.225632] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640) ,
>> BIOS
>> V1.8B1 09/13/2010
>> [81133.236237] task: ffff880059b91580 ti: ffff880059bb4000 task.ti:
>> ffff880059bb4000
>> [81133.246808] RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>]
>> detach_if_pending+0x18/0x80
>> [81133.257354] RSP: e02b:ffff880059bb7848 EFLAGS: 00010086
>> [81133.267749] RAX: ffff88004eddc7f0 RBX: ffff88000e20ae08 RCX:
>> dead000000200200
>> [81133.278201] RDX: 0000000000000000 RSI: ffff88005f60e600 RDI:
>> ffff88000e20ae08
>> [81133.288723] RBP: ffff880059bb7848 R08: 0000000000000001 R09:
>> 0000000000000001
>> [81133.298930] R10: 0000000000000003 R11: ffff88000e20ad68 R12:
>> 0000000000000000
>> [81133.308875] R13: 0000000101735569 R14: 0000000000015f90 R15:
>> ffff88005f60e600
>> [81133.318845] FS: 00007f28c6f7c800(0000) GS:ffff88005f600000(0000)
>> knlGS:0000000000000000
>> [81133.328864] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [81133.338693] CR2: ffff8000007f6800 CR3: 000000003d55c000 CR4:
>> 0000000000000660
>> [81133.348462] Stack:
>> [81133.358005] ffff880059bb7898 ffffffff8110fe3f ffffffff810fc261
>> 0000000000000200
>> [81133.367682] 0000000000000003 ffff88000e20ad68 0000000000000000
>> ffff88005854d400
>> [81133.377064] 0000000000015f90 0000000000000000 ffff880059bb78c8
>> ffffffff819b5243
>> [81133.386374] Call Trace:
>> [81133.395596] [<ffffffff8110fe3f>] mod_timer_pending+0x3f/0xe0
>> [81133.404999] [<ffffffff810fc261>] ?
>> __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
>> [81133.414255] [<ffffffff819b5243>] __nf_ct_refresh_acct+0xa3/0xb0
>> [81133.423137] [<ffffffff819bbe8b>] tcp_packet+0xb3b/0x1290
>> [81133.431894] [<ffffffff810cb8ca>] ? __local_bh_enable_ip+0x2a/0x90
>> [81133.440622] [<ffffffff819b4939>] ?
>> __nf_conntrack_find_get+0x129/0x2a0
>> [81133.449339] [<ffffffff819b682c>] nf_conntrack_in+0x29c/0x7c0
>> [81133.457940] [<ffffffff81a67181>] ipv4_conntrack_in+0x21/0x30
>> [81133.466296] [<ffffffff819aea1c>] nf_iterate+0x4c/0x80
>> [81133.474401] [<ffffffff819aeab4>] nf_hook_slow+0x64/0xc0
>> [81133.482615] [<ffffffff81a211ec>] ip_rcv+0x2ec/0x380
>> [81133.490781] [<ffffffff81a209f0>] ?
>> ip_local_deliver_finish+0x130/0x130
>> [81133.498790] [<ffffffff8197e140>]
>> __netif_receive_skb_core+0x2a0/0x970
>> [81133.506714] [<ffffffff81a56db8>] ? inet_gro_receive+0x1c8/0x200
>> [81133.514609] [<ffffffff81980705>] __netif_receive_skb+0x15/0x70
>> [81133.522333] [<ffffffff8198077e>]
>> netif_receive_skb_internal+0x1e/0x80
>> [81133.529840] [<ffffffff81980f3b>] napi_gro_receive+0x6b/0x90
>> [81133.537173] [<ffffffff81740fb6>] rtl8169_poll+0x2e6/0x600
>> [81133.544444] [<ffffffff810fc261>] ?
>> __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
>> [81133.551566] [<ffffffff81981ad7>] net_rx_action+0x1f7/0x300
>> [81133.558412] [<ffffffff810cb6c3>] __do_softirq+0x103/0x210
>> [81133.565353] [<ffffffff810cb807>] run_ksoftirqd+0x37/0x60
>> [81133.572359] [<ffffffff810e4de0>] smpboot_thread_fn+0x130/0x190
>> [81133.579215] [<ffffffff810e4cb0>] ? sort_range+0x20/0x20
>> [81133.586042] [<ffffffff810e1fae>] kthread+0xee/0x110
>> [81133.592792] [<ffffffff810e1ec0>] ?
>> kthread_create_on_node+0x1b0/0x1b0
>> [81133.599694] [<ffffffff81af92df>] ret_from_fork+0x3f/0x70
>> [81133.606662] [<ffffffff810e1ec0>] ?
>> kthread_create_on_node+0x1b0/0x1b0
>> [81133.613445] Code: 77 28 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00
>> 00
>> 00 00 48 8b 47 08 55 48 89 e5 48 85 c0 74 6a 48 8b 0f 48 85 c9 48 89
>> 08
>> 74 04 <48> 89 41 08 84 d2 74 08 48 c7 47 08 00 00 00 00 f6 47 2a 10 48
>> [81133.627196] RIP [<ffffffff8110fb18>] detach_if_pending+0x18/0x80
>> [81133.634036] RSP <ffff880059bb7848>
>> [81133.640817] ---[ end trace eaf596e1fcf6a591 ]---
>> [81133.647521] Kernel panic - not syncing: Fatal exception in
>> interrupt
>
> This looks like the bug fixed in David Miller net tree :
>
> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=2235f2ac75fd2501c251b0b699a9632e80239a6d

Will pull the net-tree in and re-test.
But since it only seems to crash after a day or two, that will take some
time.

Thanks,

Sander

2015-08-12 21:40:43

by David Miller

[permalink] [raw]
Subject: Re: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80

From: [email protected]
Date: Wed, 12 Aug 2015 22:50:42 +0200

> On 2015-08-12 22:41, Eric Dumazet wrote:
>> On Wed, 2015-08-12 at 21:19 +0200, [email protected] wrote:
>>> Hi,
>>> On my box running Xen with a 4.2-rc6 kernel i still get this splat in
>>> dom0,
>>> which crashes the box.
>>> (i reported a similar splat before (at rc4) here,
>>> http://www.spinics.net/lists/netdev/msg337570.html)
>>> Never seen this one on 4.1, so it seems a regression.
>>> --
>>> Sander
>>> [81133.193439] general protection fault: 0000 [#1] SMP
>>> [81133.204284] Modules linked in:
>>> [81133.214934] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted
>>> 4.2.0-rc6-20150811-linus-doflr+ #1
>>> [81133.225632] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640) , BIOS
>>> V1.8B1 09/13/2010
>>> [81133.236237] task: ffff880059b91580 ti: ffff880059bb4000 task.ti:
>>> ffff880059bb4000
>>> [81133.246808] RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>]
>>> detach_if_pending+0x18/0x80
>>> [81133.257354] RSP: e02b:ffff880059bb7848 EFLAGS: 00010086
>>> [81133.267749] RAX: ffff88004eddc7f0 RBX: ffff88000e20ae08 RCX:
>>> dead000000200200
>>> [81133.278201] RDX: 0000000000000000 RSI: ffff88005f60e600 RDI:
>>> ffff88000e20ae08
>>> [81133.288723] RBP: ffff880059bb7848 R08: 0000000000000001 R09:
>>> 0000000000000001
>>> [81133.298930] R10: 0000000000000003 R11: ffff88000e20ad68 R12:
>>> 0000000000000000
>>> [81133.308875] R13: 0000000101735569 R14: 0000000000015f90 R15:
>>> ffff88005f60e600
>>> [81133.318845] FS: 00007f28c6f7c800(0000) GS:ffff88005f600000(0000)
>>> knlGS:0000000000000000
>>> [81133.328864] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
>>> [81133.338693] CR2: ffff8000007f6800 CR3: 000000003d55c000 CR4:
>>> 0000000000000660
>>> [81133.348462] Stack:
>>> [81133.358005] ffff880059bb7898 ffffffff8110fe3f ffffffff810fc261
>>> 0000000000000200
>>> [81133.367682] 0000000000000003 ffff88000e20ad68 0000000000000000
>>> ffff88005854d400
>>> [81133.377064] 0000000000015f90 0000000000000000 ffff880059bb78c8
>>> ffffffff819b5243
>>> [81133.386374] Call Trace:
>>> [81133.395596] [<ffffffff8110fe3f>] mod_timer_pending+0x3f/0xe0
>>> [81133.404999] [<ffffffff810fc261>] ?
>>> __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
>>> [81133.414255] [<ffffffff819b5243>] __nf_ct_refresh_acct+0xa3/0xb0
>>> [81133.423137] [<ffffffff819bbe8b>] tcp_packet+0xb3b/0x1290
>>> [81133.431894] [<ffffffff810cb8ca>] ? __local_bh_enable_ip+0x2a/0x90
>>> [81133.440622] [<ffffffff819b4939>] ?
>>> __nf_conntrack_find_get+0x129/0x2a0
>>> [81133.449339] [<ffffffff819b682c>] nf_conntrack_in+0x29c/0x7c0
>>> [81133.457940] [<ffffffff81a67181>] ipv4_conntrack_in+0x21/0x30
>>> [81133.466296] [<ffffffff819aea1c>] nf_iterate+0x4c/0x80
>>> [81133.474401] [<ffffffff819aeab4>] nf_hook_slow+0x64/0xc0
>>> [81133.482615] [<ffffffff81a211ec>] ip_rcv+0x2ec/0x380
>>> [81133.490781] [<ffffffff81a209f0>] ?
>>> ip_local_deliver_finish+0x130/0x130
>>> [81133.498790] [<ffffffff8197e140>]
>>> __netif_receive_skb_core+0x2a0/0x970
>>> [81133.506714] [<ffffffff81a56db8>] ? inet_gro_receive+0x1c8/0x200
>>> [81133.514609] [<ffffffff81980705>] __netif_receive_skb+0x15/0x70
>>> [81133.522333] [<ffffffff8198077e>]
>>> netif_receive_skb_internal+0x1e/0x80
>>> [81133.529840] [<ffffffff81980f3b>] napi_gro_receive+0x6b/0x90
>>> [81133.537173] [<ffffffff81740fb6>] rtl8169_poll+0x2e6/0x600
>>> [81133.544444] [<ffffffff810fc261>] ?
>>> __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
>>> [81133.551566] [<ffffffff81981ad7>] net_rx_action+0x1f7/0x300
>>> [81133.558412] [<ffffffff810cb6c3>] __do_softirq+0x103/0x210
>>> [81133.565353] [<ffffffff810cb807>] run_ksoftirqd+0x37/0x60
>>> [81133.572359] [<ffffffff810e4de0>] smpboot_thread_fn+0x130/0x190
>>> [81133.579215] [<ffffffff810e4cb0>] ? sort_range+0x20/0x20
>>> [81133.586042] [<ffffffff810e1fae>] kthread+0xee/0x110
>>> [81133.592792] [<ffffffff810e1ec0>] ?
>>> kthread_create_on_node+0x1b0/0x1b0
>>> [81133.599694] [<ffffffff81af92df>] ret_from_fork+0x3f/0x70
>>> [81133.606662] [<ffffffff810e1ec0>] ?
>>> kthread_create_on_node+0x1b0/0x1b0
>>> [81133.613445] Code: 77 28 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00
>>> 00
>>> 00 00 48 8b 47 08 55 48 89 e5 48 85 c0 74 6a 48 8b 0f 48 85 c9 48 89
>>> 08
>>> 74 04 <48> 89 41 08 84 d2 74 08 48 c7 47 08 00 00 00 00 f6 47 2a 10 48
>>> [81133.627196] RIP [<ffffffff8110fb18>] detach_if_pending+0x18/0x80
>>> [81133.634036] RSP <ffff880059bb7848>
>>> [81133.640817] ---[ end trace eaf596e1fcf6a591 ]---
>>> [81133.647521] Kernel panic - not syncing: Fatal exception in
>>> interrupt
>> This looks like the bug fixed in David Miller net tree :
>> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=2235f2ac75fd2501c251b0b699a9632e80239a6d
>
> Will pull the net-tree in and re-test.

You should not pull the 'net-next', but rather the 'net' one.

'net' is not necessarily included in 'net-next'.

2015-08-12 21:52:06

by Sander Eikelenboom

[permalink] [raw]
Subject: Re: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80

On 2015-08-12 23:40, David Miller wrote:
> From: [email protected]
> Date: Wed, 12 Aug 2015 22:50:42 +0200
>
>> On 2015-08-12 22:41, Eric Dumazet wrote:
>>> On Wed, 2015-08-12 at 21:19 +0200, [email protected] wrote:
>>>> Hi,
>>>> On my box running Xen with a 4.2-rc6 kernel i still get this splat
>>>> in
>>>> dom0,
>>>> which crashes the box.
>>>> (i reported a similar splat before (at rc4) here,
>>>> http://www.spinics.net/lists/netdev/msg337570.html)
>>>> Never seen this one on 4.1, so it seems a regression.
>>>> --
>>>> Sander
>>>> [81133.193439] general protection fault: 0000 [#1] SMP
>>>> [81133.204284] Modules linked in:
>>>> [81133.214934] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted
>>>> 4.2.0-rc6-20150811-linus-doflr+ #1
>>>> [81133.225632] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640) ,
>>>> BIOS
>>>> V1.8B1 09/13/2010
>>>> [81133.236237] task: ffff880059b91580 ti: ffff880059bb4000 task.ti:
>>>> ffff880059bb4000
>>>> [81133.246808] RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>]
>>>> detach_if_pending+0x18/0x80
>>>> [81133.257354] RSP: e02b:ffff880059bb7848 EFLAGS: 00010086
>>>> [81133.267749] RAX: ffff88004eddc7f0 RBX: ffff88000e20ae08 RCX:
>>>> dead000000200200
>>>> [81133.278201] RDX: 0000000000000000 RSI: ffff88005f60e600 RDI:
>>>> ffff88000e20ae08
>>>> [81133.288723] RBP: ffff880059bb7848 R08: 0000000000000001 R09:
>>>> 0000000000000001
>>>> [81133.298930] R10: 0000000000000003 R11: ffff88000e20ad68 R12:
>>>> 0000000000000000
>>>> [81133.308875] R13: 0000000101735569 R14: 0000000000015f90 R15:
>>>> ffff88005f60e600
>>>> [81133.318845] FS: 00007f28c6f7c800(0000) GS:ffff88005f600000(0000)
>>>> knlGS:0000000000000000
>>>> [81133.328864] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>> [81133.338693] CR2: ffff8000007f6800 CR3: 000000003d55c000 CR4:
>>>> 0000000000000660
>>>> [81133.348462] Stack:
>>>> [81133.358005] ffff880059bb7898 ffffffff8110fe3f ffffffff810fc261
>>>> 0000000000000200
>>>> [81133.367682] 0000000000000003 ffff88000e20ad68 0000000000000000
>>>> ffff88005854d400
>>>> [81133.377064] 0000000000015f90 0000000000000000 ffff880059bb78c8
>>>> ffffffff819b5243
>>>> [81133.386374] Call Trace:
>>>> [81133.395596] [<ffffffff8110fe3f>] mod_timer_pending+0x3f/0xe0
>>>> [81133.404999] [<ffffffff810fc261>] ?
>>>> __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
>>>> [81133.414255] [<ffffffff819b5243>] __nf_ct_refresh_acct+0xa3/0xb0
>>>> [81133.423137] [<ffffffff819bbe8b>] tcp_packet+0xb3b/0x1290
>>>> [81133.431894] [<ffffffff810cb8ca>] ?
>>>> __local_bh_enable_ip+0x2a/0x90
>>>> [81133.440622] [<ffffffff819b4939>] ?
>>>> __nf_conntrack_find_get+0x129/0x2a0
>>>> [81133.449339] [<ffffffff819b682c>] nf_conntrack_in+0x29c/0x7c0
>>>> [81133.457940] [<ffffffff81a67181>] ipv4_conntrack_in+0x21/0x30
>>>> [81133.466296] [<ffffffff819aea1c>] nf_iterate+0x4c/0x80
>>>> [81133.474401] [<ffffffff819aeab4>] nf_hook_slow+0x64/0xc0
>>>> [81133.482615] [<ffffffff81a211ec>] ip_rcv+0x2ec/0x380
>>>> [81133.490781] [<ffffffff81a209f0>] ?
>>>> ip_local_deliver_finish+0x130/0x130
>>>> [81133.498790] [<ffffffff8197e140>]
>>>> __netif_receive_skb_core+0x2a0/0x970
>>>> [81133.506714] [<ffffffff81a56db8>] ? inet_gro_receive+0x1c8/0x200
>>>> [81133.514609] [<ffffffff81980705>] __netif_receive_skb+0x15/0x70
>>>> [81133.522333] [<ffffffff8198077e>]
>>>> netif_receive_skb_internal+0x1e/0x80
>>>> [81133.529840] [<ffffffff81980f3b>] napi_gro_receive+0x6b/0x90
>>>> [81133.537173] [<ffffffff81740fb6>] rtl8169_poll+0x2e6/0x600
>>>> [81133.544444] [<ffffffff810fc261>] ?
>>>> __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
>>>> [81133.551566] [<ffffffff81981ad7>] net_rx_action+0x1f7/0x300
>>>> [81133.558412] [<ffffffff810cb6c3>] __do_softirq+0x103/0x210
>>>> [81133.565353] [<ffffffff810cb807>] run_ksoftirqd+0x37/0x60
>>>> [81133.572359] [<ffffffff810e4de0>] smpboot_thread_fn+0x130/0x190
>>>> [81133.579215] [<ffffffff810e4cb0>] ? sort_range+0x20/0x20
>>>> [81133.586042] [<ffffffff810e1fae>] kthread+0xee/0x110
>>>> [81133.592792] [<ffffffff810e1ec0>] ?
>>>> kthread_create_on_node+0x1b0/0x1b0
>>>> [81133.599694] [<ffffffff81af92df>] ret_from_fork+0x3f/0x70
>>>> [81133.606662] [<ffffffff810e1ec0>] ?
>>>> kthread_create_on_node+0x1b0/0x1b0
>>>> [81133.613445] Code: 77 28 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00
>>>> 00
>>>> 00 00 48 8b 47 08 55 48 89 e5 48 85 c0 74 6a 48 8b 0f 48 85 c9 48 89
>>>> 08
>>>> 74 04 <48> 89 41 08 84 d2 74 08 48 c7 47 08 00 00 00 00 f6 47 2a 10
>>>> 48
>>>> [81133.627196] RIP [<ffffffff8110fb18>] detach_if_pending+0x18/0x80
>>>> [81133.634036] RSP <ffff880059bb7848>
>>>> [81133.640817] ---[ end trace eaf596e1fcf6a591 ]---
>>>> [81133.647521] Kernel panic - not syncing: Fatal exception in
>>>> interrupt
>>> This looks like the bug fixed in David Miller net tree :
>>> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=2235f2ac75fd2501c251b0b699a9632e80239a6d
>>
>> Will pull the net-tree in and re-test.
>
> You should not pull the 'net-next', but rather the 'net' one.
>
> 'net' is not necessarily included in 'net-next'.

Thanks for the reminder, but luckily i was aware of that,
seen enough of your replies asking for patches to be resubmitted
against "the other tree" ;)
Kernel with patch is currently running so fingers crossed.

--
Sander

2015-08-12 22:41:23

by Eric Dumazet

[permalink] [raw]
Subject: Re: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80

On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote:

> Thanks for the reminder, but luckily i was aware of that,
> seen enough of your replies asking for patches to be resubmitted
> against "the other tree" ;)
> Kernel with patch is currently running so fingers crossed.

Thanks for testing. I am definitely interested knowing your results.

2015-08-14 22:14:44

by Sander Eikelenboom

[permalink] [raw]
Subject: Re: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80

On 2015-08-13 00:41, Eric Dumazet wrote:
> On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote:
>
>> Thanks for the reminder, but luckily i was aware of that,
>> seen enough of your replies asking for patches to be resubmitted
>> against "the other tree" ;)
>> Kernel with patch is currently running so fingers crossed.
>
> Thanks for testing. I am definitely interested knowing your results.

Hmm it seems now commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af is
breaking things
(have to test if a revert helps) i get this in some guests:

NMI watchdog: BUG: soft lockup - CPU#0 stuck for 506s! [swapper/0:0]
[ 6620.282805] Modules linked in:
[ 6620.282805] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.2.0-rc6-20150814-linus-doflr-apicrevert+ #1
[ 6620.282805] task: ffffffff8221a580 ti: ffffffff82200000 task.ti:
ffffffff82200000
[ 6620.282805] RIP: e030:[<ffffffff8100122a>] [<ffffffff8100122a>]
xen_hypercall_xen_version+0xa/0x20
[ 6620.282805] RSP: e02b:ffff88000fc03d48 EFLAGS: 00000246
[ 6620.282805] RAX: 0000000000040006 RBX: 0000000000000200 RCX:
ffffffff8100122a
[ 6620.282805] RDX: 0000000000000001 RSI: 00000000deadbeef RDI:
00000000deadbeef
[ 6620.282805] RBP: ffff88000fc03d60 R08: ffff88000fc03ee0 R09:
00000000000000ee
[ 6620.282805] R10: ffffffff8220a0c0 R11: 0000000000000246 R12:
00000000ffffffff
[ 6620.282805] R13: 0000000000000001 R14: ffff880003b53054 R15:
0000000000000005
[ 6620.282805] FS: 00007fec747ad800(0000) GS:ffff88000fc00000(0000)
knlGS:0000000000000000
[ 6620.282805] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 6620.282805] CR2: 00007ffcb7a7a6d8 CR3: 0000000003164000 CR4:
0000000000000660
[ 6620.282805] Stack:
[ 6620.282805] 0000000000000068 0000000000000007 ffffffff81008dbd
ffff88000fc03dd8
[ 6620.282805] ffffffff81009592 0000000000000068 ffffffff8220a0c0
00000000000000ee
[ 6620.282805] ffff88000fc03ee0 0000000000000200 0000000000000200
0000000000000001
[ 6620.282805] Call Trace:
[ 6620.282805] <IRQ>
[ 6620.282805] [<ffffffff81008dbd>] ?
xen_force_evtchn_callback+0xd/0x10
[ 6620.282805] [<ffffffff81009592>] check_events+0x12/0x20
[ 6620.282805] [<ffffffff8100957f>] ?
xen_restore_fl_direct_reloc+0x4/0x4
[ 6620.282805] [<ffffffff81af79a5>] ?
_raw_spin_unlock_irqrestore+0x25/0x30
[ 6620.282805] [<ffffffff8110ed43>] try_to_del_timer_sync+0x43/0x60
[ 6620.282805] [<ffffffff8110eda7>] del_timer_sync+0x47/0x60
[ 6620.282805] [<ffffffff81a2b698>]
inet_csk_reqsk_queue_drop+0x118/0x1f0
[ 6620.282805] [<ffffffff81a2b8c6>] reqsk_timer_handler+0x156/0x260
[ 6620.282805] [<ffffffff81a2b770>] ?
inet_csk_reqsk_queue_drop+0x1f0/0x1f0
[ 6620.282805] [<ffffffff8110f3c7>] call_timer_fn.isra.27+0x17/0x80
[ 6620.282805] [<ffffffff81a2b770>] ?
inet_csk_reqsk_queue_drop+0x1f0/0x1f0
[ 6620.282805] [<ffffffff8110f55d>] run_timer_softirq+0x12d/0x200
[ 6620.282805] [<ffffffff810ca6c3>] __do_softirq+0x103/0x210
[ 6620.282805] [<ffffffff810ca9cb>] irq_exit+0x4b/0xa0
[ 6620.282805] [<ffffffff814f05d4>] xen_evtchn_do_upcall+0x34/0x50
[ 6620.282805] [<ffffffff81af932e>]
xen_do_hypervisor_callback+0x1e/0x40
[ 6620.282805] <EOI>
[ 6620.282805] [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[ 6620.282805] [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[ 6620.282805] [<ffffffff81008d60>] ? xen_safe_halt+0x10/0x20
[ 6620.282805] [<ffffffff810188d3>] ? default_idle+0x13/0x20
[ 6620.282805] [<ffffffff81018e1a>] ? arch_cpu_idle+0xa/0x10
[ 6620.282805] [<ffffffff810f8e7e>] ? default_idle_call+0x2e/0x50
[ 6620.282805] [<ffffffff810f9112>] ? cpu_startup_entry+0x272/0x2e0
[ 6620.282805] [<ffffffff81ae7967>] ? rest_init+0x77/0x80
[ 6620.282805] [<ffffffff82312f58>] ? start_kernel+0x43b/0x448
[ 6620.282805] [<ffffffff823124ef>] ?
x86_64_start_reservations+0x2a/0x2c
[ 6620.282805] [<ffffffff82316008>] ? xen_start_kernel+0x550/0x55c
[ 6620.282805] Code: cc 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00 00 00
0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc

2015-08-14 22:22:25

by Sander Eikelenboom

[permalink] [raw]
Subject: Re: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80

On 2015-08-15 00:09, Sander Eikelenboom wrote:
> On 2015-08-13 00:41, Eric Dumazet wrote:
>> On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote:
>>
>>> Thanks for the reminder, but luckily i was aware of that,
>>> seen enough of your replies asking for patches to be resubmitted
>>> against "the other tree" ;)
>>> Kernel with patch is currently running so fingers crossed.
>>
>> Thanks for testing. I am definitely interested knowing your results.
>
> Hmm it seems now commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af is
> breaking things
> (have to test if a revert helps) i get this in some guests:

Should have done that before, because it wasn't in yet .. and likely to
fix the issue,
also pulled and compiling now.

--
Sander



> NMI watchdog: BUG: soft lockup - CPU#0 stuck for 506s! [swapper/0:0]
> [ 6620.282805] Modules linked in:
> [ 6620.282805] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> 4.2.0-rc6-20150814-linus-doflr-apicrevert+ #1
> [ 6620.282805] task: ffffffff8221a580 ti: ffffffff82200000 task.ti:
> ffffffff82200000
> [ 6620.282805] RIP: e030:[<ffffffff8100122a>] [<ffffffff8100122a>]
> xen_hypercall_xen_version+0xa/0x20
> [ 6620.282805] RSP: e02b:ffff88000fc03d48 EFLAGS: 00000246
> [ 6620.282805] RAX: 0000000000040006 RBX: 0000000000000200 RCX:
> ffffffff8100122a
> [ 6620.282805] RDX: 0000000000000001 RSI: 00000000deadbeef RDI:
> 00000000deadbeef
> [ 6620.282805] RBP: ffff88000fc03d60 R08: ffff88000fc03ee0 R09:
> 00000000000000ee
> [ 6620.282805] R10: ffffffff8220a0c0 R11: 0000000000000246 R12:
> 00000000ffffffff
> [ 6620.282805] R13: 0000000000000001 R14: ffff880003b53054 R15:
> 0000000000000005
> [ 6620.282805] FS: 00007fec747ad800(0000) GS:ffff88000fc00000(0000)
> knlGS:0000000000000000
> [ 6620.282805] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 6620.282805] CR2: 00007ffcb7a7a6d8 CR3: 0000000003164000 CR4:
> 0000000000000660
> [ 6620.282805] Stack:
> [ 6620.282805] 0000000000000068 0000000000000007 ffffffff81008dbd
> ffff88000fc03dd8
> [ 6620.282805] ffffffff81009592 0000000000000068 ffffffff8220a0c0
> 00000000000000ee
> [ 6620.282805] ffff88000fc03ee0 0000000000000200 0000000000000200
> 0000000000000001
> [ 6620.282805] Call Trace:
> [ 6620.282805] <IRQ>
> [ 6620.282805] [<ffffffff81008dbd>] ?
> xen_force_evtchn_callback+0xd/0x10
> [ 6620.282805] [<ffffffff81009592>] check_events+0x12/0x20
> [ 6620.282805] [<ffffffff8100957f>] ?
> xen_restore_fl_direct_reloc+0x4/0x4
> [ 6620.282805] [<ffffffff81af79a5>] ?
> _raw_spin_unlock_irqrestore+0x25/0x30
> [ 6620.282805] [<ffffffff8110ed43>] try_to_del_timer_sync+0x43/0x60
> [ 6620.282805] [<ffffffff8110eda7>] del_timer_sync+0x47/0x60
> [ 6620.282805] [<ffffffff81a2b698>]
> inet_csk_reqsk_queue_drop+0x118/0x1f0
> [ 6620.282805] [<ffffffff81a2b8c6>] reqsk_timer_handler+0x156/0x260
> [ 6620.282805] [<ffffffff81a2b770>] ?
> inet_csk_reqsk_queue_drop+0x1f0/0x1f0
> [ 6620.282805] [<ffffffff8110f3c7>] call_timer_fn.isra.27+0x17/0x80
> [ 6620.282805] [<ffffffff81a2b770>] ?
> inet_csk_reqsk_queue_drop+0x1f0/0x1f0
> [ 6620.282805] [<ffffffff8110f55d>] run_timer_softirq+0x12d/0x200
> [ 6620.282805] [<ffffffff810ca6c3>] __do_softirq+0x103/0x210
> [ 6620.282805] [<ffffffff810ca9cb>] irq_exit+0x4b/0xa0
> [ 6620.282805] [<ffffffff814f05d4>] xen_evtchn_do_upcall+0x34/0x50
> [ 6620.282805] [<ffffffff81af932e>]
> xen_do_hypervisor_callback+0x1e/0x40
> [ 6620.282805] <EOI>
> [ 6620.282805] [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
> [ 6620.282805] [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
> [ 6620.282805] [<ffffffff81008d60>] ? xen_safe_halt+0x10/0x20
> [ 6620.282805] [<ffffffff810188d3>] ? default_idle+0x13/0x20
> [ 6620.282805] [<ffffffff81018e1a>] ? arch_cpu_idle+0xa/0x10
> [ 6620.282805] [<ffffffff810f8e7e>] ? default_idle_call+0x2e/0x50
> [ 6620.282805] [<ffffffff810f9112>] ? cpu_startup_entry+0x272/0x2e0
> [ 6620.282805] [<ffffffff81ae7967>] ? rest_init+0x77/0x80
> [ 6620.282805] [<ffffffff82312f58>] ? start_kernel+0x43b/0x448
> [ 6620.282805] [<ffffffff823124ef>] ?
> x86_64_start_reservations+0x2a/0x2c
> [ 6620.282805] [<ffffffff82316008>] ? xen_start_kernel+0x550/0x55c
> [ 6620.282805] Code: cc 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc
> cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00
> 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
> cc cc

2015-08-14 22:39:29

by Eric Dumazet

[permalink] [raw]
Subject: Re: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80

On Sat, 2015-08-15 at 00:09 +0200, Sander Eikelenboom wrote:
> On 2015-08-13 00:41, Eric Dumazet wrote:
> > On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote:
> >
> >> Thanks for the reminder, but luckily i was aware of that,
> >> seen enough of your replies asking for patches to be resubmitted
> >> against "the other tree" ;)
> >> Kernel with patch is currently running so fingers crossed.
> >
> > Thanks for testing. I am definitely interested knowing your results.
>
> Hmm it seems now commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af is
> breaking things
> (have to test if a revert helps) i get this in some guests:


Yes, this was fixed by :
http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af

2015-08-17 09:11:12

by Sander Eikelenboom

[permalink] [raw]
Subject: Re: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80


Saturday, August 15, 2015, 12:39:25 AM, you wrote:

> On Sat, 2015-08-15 at 00:09 +0200, Sander Eikelenboom wrote:
>> On 2015-08-13 00:41, Eric Dumazet wrote:
>> > On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote:
>> >
>> >> Thanks for the reminder, but luckily i was aware of that,
>> >> seen enough of your replies asking for patches to be resubmitted
>> >> against "the other tree" ;)
>> >> Kernel with patch is currently running so fingers crossed.
>> >
>> > Thanks for testing. I am definitely interested knowing your results.
>>
>> Hmm it seems now commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af is
>> breaking things
>> (have to test if a revert helps) i get this in some guests:


> Yes, this was fixed by :
> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af


Hi Eric,

With that patch i had a crash again this night, see below.

--
Sander

[177459.188808] general protection fault: 0000 [#1] SMP
[177459.199746] Modules linked in:
[177459.210540] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.2.0-rc6-20150815-linus-doflr-net+ #1
[177459.221441] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640) , BIOS V1.8B1 09/13/2010
[177459.232247] task: ffffffff8221a580 ti: ffffffff82200000 task.ti: ffffffff82200000
[177459.242931] RIP: e030:[<ffffffff8110eb58>] [<ffffffff8110eb58>] detach_if_pending+0x18/0x80
[177459.253503] RSP: e02b:ffff88005f6039d8 EFLAGS: 00010086
[177459.264051] RAX: ffff8800584d6580 RBX: ffff880004901420 RCX: dead000000200200
[177459.274599] RDX: 0000000000000000 RSI: ffff88005f60e5c0 RDI: ffff880004901420
[177459.285122] RBP: ffff88005f6039d8 R08: 0000000000000001 R09: 0000000000000000
[177459.295286] R10: 0000000000000003 R11: ffff880004901394 R12: 0000000000000003
[177459.305388] R13: 000000010ae47040 R14: 0000000007b98a00 R15: ffff88005f60e5c0
[177459.315345] FS: 00007f51317ec700(0000) GS:ffff88005f600000(0000) knlGS:0000000000000000
[177459.325340] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[177459.335217] CR2: 00000000010f8000 CR3: 000000002a154000 CR4: 0000000000000660
[177459.345129] Stack:
[177459.354783] ffff88005f603a28 ffffffff8110ee7f ffffffff810fb261 0000000000000200
[177459.364505] 0000000000000003 ffff880004901380 0000000000000003 ffff8800567d0d00
[177459.374064] 0000000007b98a00 0000000000000000 ffff88005f603a58 ffffffff819b3eb3
[177459.383532] Call Trace:
[177459.392878] <IRQ>
[177459.392935] [<ffffffff8110ee7f>] mod_timer_pending+0x3f/0xe0
[177459.411058] [<ffffffff810fb261>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[177459.419876] [<ffffffff819b3eb3>] __nf_ct_refresh_acct+0xa3/0xb0
[177459.428642] [<ffffffff819baafb>] tcp_packet+0xb3b/0x1290
[177459.437285] [<ffffffff81a2535e>] ? ip_output+0x5e/0xc0
[177459.445845] [<ffffffff810ca8ca>] ? __local_bh_enable_ip+0x2a/0x90
[177459.454331] [<ffffffff819b35a9>] ? __nf_conntrack_find_get+0x129/0x2a0
[177459.462642] [<ffffffff819b549c>] nf_conntrack_in+0x29c/0x7c0
[177459.470711] [<ffffffff81a65e9c>] ipv4_conntrack_local+0x4c/0x50
[177459.478753] [<ffffffff819ad67c>] nf_iterate+0x4c/0x80
[177459.486726] [<ffffffff81102437>] ? generic_handle_irq+0x27/0x40
[177459.494634] [<ffffffff819ad714>] nf_hook_slow+0x64/0xc0
[177459.502486] [<ffffffff81a22d40>] __ip_local_out_sk+0x90/0xa0
[177459.510248] [<ffffffff81a22c40>] ? ip_forward_options+0x1a0/0x1a0
[177459.517782] [<ffffffff81a22d66>] ip_local_out_sk+0x16/0x40
[177459.525044] [<ffffffff81a2343d>] ip_queue_xmit+0x14d/0x350
[177459.532247] [<ffffffff81a3ae7e>] tcp_transmit_skb+0x48e/0x960
[177459.539413] [<ffffffff81a3cddb>] tcp_xmit_probe_skb+0xdb/0xf0
[177459.546389] [<ffffffff81a3dffb>] tcp_write_wakeup+0x5b/0x150
[177459.553061] [<ffffffff81a3e51b>] tcp_keepalive_timer+0x1fb/0x230
[177459.559761] [<ffffffff81a3e320>] ? tcp_init_xmit_timers+0x20/0x20
[177459.566447] [<ffffffff8110f3c7>] call_timer_fn.isra.27+0x17/0x80
[177459.573121] [<ffffffff81a3e320>] ? tcp_init_xmit_timers+0x20/0x20
[177459.579778] [<ffffffff8110f55d>] run_timer_softirq+0x12d/0x200
[177459.586448] [<ffffffff810ca6c3>] __do_softirq+0x103/0x210
[177459.593138] [<ffffffff810ca9cb>] irq_exit+0x4b/0xa0
[177459.599783] [<ffffffff814f05d4>] xen_evtchn_do_upcall+0x34/0x50
[177459.606300] [<ffffffff81af93ae>] xen_do_hypervisor_callback+0x1e/0x40
[177459.612583] <EOI>
[177459.612637] [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[177459.625010] [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[177459.631157] [<ffffffff81008d60>] ? xen_safe_halt+0x10/0x20
[177459.637158] [<ffffffff810188d3>] ? default_idle+0x13/0x20
[177459.643072] [<ffffffff81018e1a>] ? arch_cpu_idle+0xa/0x10
[177459.648809] [<ffffffff810f8e7e>] ? default_idle_call+0x2e/0x50
[177459.654650] [<ffffffff810f9112>] ? cpu_startup_entry+0x272/0x2e0
[177459.660488] [<ffffffff81ae79f7>] ? rest_init+0x77/0x80
[177459.666297] [<ffffffff82312f58>] ? start_kernel+0x43b/0x448
[177459.672092] [<ffffffff823124ef>] ? x86_64_start_reservations+0x2a/0x2c
[177459.677800] [<ffffffff82316008>] ? xen_start_kernel+0x550/0x55c
[177459.683451] Code: 77 28 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 8b 47 08 55 48 89 e5 48 85 c0 74 6a 48 8b 0f 48 85 c9 48 89 08 74 04 <48> 89 41 08 84 d2 74 08 48 c7 47 08 00 00 00 00 f6 47 2a 10 48
[177459.695332] RIP [<ffffffff8110eb58>] detach_if_pending+0x18/0x80
[177459.701154] RSP <ffff88005f6039d8>
(XEN) [2015-08-17 00:11:51.426] Hardware Dom0 crashed: rebooting machine in 5 seconds.

2015-08-17 13:37:18

by Eric Dumazet

[permalink] [raw]
Subject: Re: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80

On Mon, 2015-08-17 at 11:09 +0200, Sander Eikelenboom wrote:
> Saturday, August 15, 2015, 12:39:25 AM, you wrote:
>
> > On Sat, 2015-08-15 at 00:09 +0200, Sander Eikelenboom wrote:
> >> On 2015-08-13 00:41, Eric Dumazet wrote:
> >> > On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote:
> >> >
> >> >> Thanks for the reminder, but luckily i was aware of that,
> >> >> seen enough of your replies asking for patches to be resubmitted
> >> >> against "the other tree" ;)
> >> >> Kernel with patch is currently running so fingers crossed.
> >> >
> >> > Thanks for testing. I am definitely interested knowing your results.
> >>
> >> Hmm it seems now commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af is
> >> breaking things
> >> (have to test if a revert helps) i get this in some guests:
>
>
> > Yes, this was fixed by :
> > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af
>
>
> Hi Eric,
>
> With that patch i had a crash again this night, see below.
>
> --
> Sander
>
> [177459.188808] general protection fault: 0000 [#1] SMP
> [177459.199746] Modules linked in:
> [177459.210540] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.2.0-rc6-20150815-linus-doflr-net+ #1
> [177459.221441] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640) , BIOS V1.8B1 09/13/2010
> [177459.232247] task: ffffffff8221a580 ti: ffffffff82200000 task.ti: ffffffff82200000
> [177459.242931] RIP: e030:[<ffffffff8110eb58>] [<ffffffff8110eb58>] detach_if_pending+0x18/0x80
> [177459.253503] RSP: e02b:ffff88005f6039d8 EFLAGS: 00010086
> [177459.264051] RAX: ffff8800584d6580 RBX: ffff880004901420 RCX: dead000000200200
> [177459.274599] RDX: 0000000000000000 RSI: ffff88005f60e5c0 RDI: ffff880004901420
> [177459.285122] RBP: ffff88005f6039d8 R08: 0000000000000001 R09: 0000000000000000
> [177459.295286] R10: 0000000000000003 R11: ffff880004901394 R12: 0000000000000003
> [177459.305388] R13: 000000010ae47040 R14: 0000000007b98a00 R15: ffff88005f60e5c0
> [177459.315345] FS: 00007f51317ec700(0000) GS:ffff88005f600000(0000) knlGS:0000000000000000
> [177459.325340] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> [177459.335217] CR2: 00000000010f8000 CR3: 000000002a154000 CR4: 0000000000000660
> [177459.345129] Stack:
> [177459.354783] ffff88005f603a28 ffffffff8110ee7f ffffffff810fb261 0000000000000200
> [177459.364505] 0000000000000003 ffff880004901380 0000000000000003 ffff8800567d0d00
> [177459.374064] 0000000007b98a00 0000000000000000 ffff88005f603a58 ffffffff819b3eb3
> [177459.383532] Call Trace:
> [177459.392878] <IRQ>
> [177459.392935] [<ffffffff8110ee7f>] mod_timer_pending+0x3f/0xe0
> [177459.411058] [<ffffffff810fb261>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
> [177459.419876] [<ffffffff819b3eb3>] __nf_ct_refresh_acct+0xa3/0xb0
> [177459.428642] [<ffffffff819baafb>] tcp_packet+0xb3b/0x1290
> [177459.437285] [<ffffffff81a2535e>] ? ip_output+0x5e/0xc0
> [177459.445845] [<ffffffff810ca8ca>] ? __local_bh_enable_ip+0x2a/0x90
> [177459.454331] [<ffffffff819b35a9>] ? __nf_conntrack_find_get+0x129/0x2a0
> [177459.462642] [<ffffffff819b549c>] nf_conntrack_in+0x29c/0x7c0
> [177459.470711] [<ffffffff81a65e9c>] ipv4_conntrack_local+0x4c/0x50
> [177459.478753] [<ffffffff819ad67c>] nf_iterate+0x4c/0x80
> [177459.486726] [<ffffffff81102437>] ? generic_handle_irq+0x27/0x40
> [177459.494634] [<ffffffff819ad714>] nf_hook_slow+0x64/0xc0
> [177459.502486] [<ffffffff81a22d40>] __ip_local_out_sk+0x90/0xa0
> [177459.510248] [<ffffffff81a22c40>] ? ip_forward_options+0x1a0/0x1a0
> [177459.517782] [<ffffffff81a22d66>] ip_local_out_sk+0x16/0x40
> [177459.525044] [<ffffffff81a2343d>] ip_queue_xmit+0x14d/0x350
> [177459.532247] [<ffffffff81a3ae7e>] tcp_transmit_skb+0x48e/0x960
> [177459.539413] [<ffffffff81a3cddb>] tcp_xmit_probe_skb+0xdb/0xf0
> [177459.546389] [<ffffffff81a3dffb>] tcp_write_wakeup+0x5b/0x150
> [177459.553061] [<ffffffff81a3e51b>] tcp_keepalive_timer+0x1fb/0x230
> [177459.559761] [<ffffffff81a3e320>] ? tcp_init_xmit_timers+0x20/0x20
> [177459.566447] [<ffffffff8110f3c7>] call_timer_fn.isra.27+0x17/0x80
> [177459.573121] [<ffffffff81a3e320>] ? tcp_init_xmit_timers+0x20/0x20
> [177459.579778] [<ffffffff8110f55d>] run_timer_softirq+0x12d/0x200
> [177459.586448] [<ffffffff810ca6c3>] __do_softirq+0x103/0x210
> [177459.593138] [<ffffffff810ca9cb>] irq_exit+0x4b/0xa0
> [177459.599783] [<ffffffff814f05d4>] xen_evtchn_do_upcall+0x34/0x50
> [177459.606300] [<ffffffff81af93ae>] xen_do_hypervisor_callback+0x1e/0x40
> [177459.612583] <EOI>
> [177459.612637] [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
> [177459.625010] [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
> [177459.631157] [<ffffffff81008d60>] ? xen_safe_halt+0x10/0x20
> [177459.637158] [<ffffffff810188d3>] ? default_idle+0x13/0x20
> [177459.643072] [<ffffffff81018e1a>] ? arch_cpu_idle+0xa/0x10
> [177459.648809] [<ffffffff810f8e7e>] ? default_idle_call+0x2e/0x50
> [177459.654650] [<ffffffff810f9112>] ? cpu_startup_entry+0x272/0x2e0
> [177459.660488] [<ffffffff81ae79f7>] ? rest_init+0x77/0x80
> [177459.666297] [<ffffffff82312f58>] ? start_kernel+0x43b/0x448
> [177459.672092] [<ffffffff823124ef>] ? x86_64_start_reservations+0x2a/0x2c
> [177459.677800] [<ffffffff82316008>] ? xen_start_kernel+0x550/0x55c
> [177459.683451] Code: 77 28 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 8b 47 08 55 48 89 e5 48 85 c0 74 6a 48 8b 0f 48 85 c9 48 89 08 74 04 <48> 89 41 08 84 d2 74 08 48 c7 47 08 00 00 00 00 f6 47 2a 10 48
> [177459.695332] RIP [<ffffffff8110eb58>] detach_if_pending+0x18/0x80
> [177459.701154] RSP <ffff88005f6039d8>
> (XEN) [2015-08-17 00:11:51.426] Hardware Dom0 crashed: rebooting machine in 5 seconds.
>


might be conntracking related then.
You might try :

1) reproduce the issue without conntracking.

2) bisect the bug

Thanks.

2015-08-17 13:48:11

by Sander Eikelenboom

[permalink] [raw]
Subject: Re: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80


Monday, August 17, 2015, 3:37:13 PM, you wrote:

> On Mon, 2015-08-17 at 11:09 +0200, Sander Eikelenboom wrote:
>> Saturday, August 15, 2015, 12:39:25 AM, you wrote:
>>
>> > On Sat, 2015-08-15 at 00:09 +0200, Sander Eikelenboom wrote:
>> >> On 2015-08-13 00:41, Eric Dumazet wrote:
>> >> > On Wed, 2015-08-12 at 23:46 +0200, Sander Eikelenboom wrote:
>> >> >
>> >> >> Thanks for the reminder, but luckily i was aware of that,
>> >> >> seen enough of your replies asking for patches to be resubmitted
>> >> >> against "the other tree" ;)
>> >> >> Kernel with patch is currently running so fingers crossed.
>> >> >
>> >> > Thanks for testing. I am definitely interested knowing your results.
>> >>
>> >> Hmm it seems now commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af is
>> >> breaking things
>> >> (have to test if a revert helps) i get this in some guests:
>>
>>
>> > Yes, this was fixed by :
>> > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af
>>
>>
>> Hi Eric,
>>
>> With that patch i had a crash again this night, see below.
>>
>> --
>> Sander
>>
>> [177459.188808] general protection fault: 0000 [#1] SMP
>> [177459.199746] Modules linked in:
>> [177459.210540] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.2.0-rc6-20150815-linus-doflr-net+ #1
>> [177459.221441] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640) , BIOS V1.8B1 09/13/2010
>> [177459.232247] task: ffffffff8221a580 ti: ffffffff82200000 task.ti: ffffffff82200000
>> [177459.242931] RIP: e030:[<ffffffff8110eb58>] [<ffffffff8110eb58>] detach_if_pending+0x18/0x80
>> [177459.253503] RSP: e02b:ffff88005f6039d8 EFLAGS: 00010086
>> [177459.264051] RAX: ffff8800584d6580 RBX: ffff880004901420 RCX: dead000000200200
>> [177459.274599] RDX: 0000000000000000 RSI: ffff88005f60e5c0 RDI: ffff880004901420
>> [177459.285122] RBP: ffff88005f6039d8 R08: 0000000000000001 R09: 0000000000000000
>> [177459.295286] R10: 0000000000000003 R11: ffff880004901394 R12: 0000000000000003
>> [177459.305388] R13: 000000010ae47040 R14: 0000000007b98a00 R15: ffff88005f60e5c0
>> [177459.315345] FS: 00007f51317ec700(0000) GS:ffff88005f600000(0000) knlGS:0000000000000000
>> [177459.325340] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [177459.335217] CR2: 00000000010f8000 CR3: 000000002a154000 CR4: 0000000000000660
>> [177459.345129] Stack:
>> [177459.354783] ffff88005f603a28 ffffffff8110ee7f ffffffff810fb261 0000000000000200
>> [177459.364505] 0000000000000003 ffff880004901380 0000000000000003 ffff8800567d0d00
>> [177459.374064] 0000000007b98a00 0000000000000000 ffff88005f603a58 ffffffff819b3eb3
>> [177459.383532] Call Trace:
>> [177459.392878] <IRQ>
>> [177459.392935] [<ffffffff8110ee7f>] mod_timer_pending+0x3f/0xe0
>> [177459.411058] [<ffffffff810fb261>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
>> [177459.419876] [<ffffffff819b3eb3>] __nf_ct_refresh_acct+0xa3/0xb0
>> [177459.428642] [<ffffffff819baafb>] tcp_packet+0xb3b/0x1290
>> [177459.437285] [<ffffffff81a2535e>] ? ip_output+0x5e/0xc0
>> [177459.445845] [<ffffffff810ca8ca>] ? __local_bh_enable_ip+0x2a/0x90
>> [177459.454331] [<ffffffff819b35a9>] ? __nf_conntrack_find_get+0x129/0x2a0
>> [177459.462642] [<ffffffff819b549c>] nf_conntrack_in+0x29c/0x7c0
>> [177459.470711] [<ffffffff81a65e9c>] ipv4_conntrack_local+0x4c/0x50
>> [177459.478753] [<ffffffff819ad67c>] nf_iterate+0x4c/0x80
>> [177459.486726] [<ffffffff81102437>] ? generic_handle_irq+0x27/0x40
>> [177459.494634] [<ffffffff819ad714>] nf_hook_slow+0x64/0xc0
>> [177459.502486] [<ffffffff81a22d40>] __ip_local_out_sk+0x90/0xa0
>> [177459.510248] [<ffffffff81a22c40>] ? ip_forward_options+0x1a0/0x1a0
>> [177459.517782] [<ffffffff81a22d66>] ip_local_out_sk+0x16/0x40
>> [177459.525044] [<ffffffff81a2343d>] ip_queue_xmit+0x14d/0x350
>> [177459.532247] [<ffffffff81a3ae7e>] tcp_transmit_skb+0x48e/0x960
>> [177459.539413] [<ffffffff81a3cddb>] tcp_xmit_probe_skb+0xdb/0xf0
>> [177459.546389] [<ffffffff81a3dffb>] tcp_write_wakeup+0x5b/0x150
>> [177459.553061] [<ffffffff81a3e51b>] tcp_keepalive_timer+0x1fb/0x230
>> [177459.559761] [<ffffffff81a3e320>] ? tcp_init_xmit_timers+0x20/0x20
>> [177459.566447] [<ffffffff8110f3c7>] call_timer_fn.isra.27+0x17/0x80
>> [177459.573121] [<ffffffff81a3e320>] ? tcp_init_xmit_timers+0x20/0x20
>> [177459.579778] [<ffffffff8110f55d>] run_timer_softirq+0x12d/0x200
>> [177459.586448] [<ffffffff810ca6c3>] __do_softirq+0x103/0x210
>> [177459.593138] [<ffffffff810ca9cb>] irq_exit+0x4b/0xa0
>> [177459.599783] [<ffffffff814f05d4>] xen_evtchn_do_upcall+0x34/0x50
>> [177459.606300] [<ffffffff81af93ae>] xen_do_hypervisor_callback+0x1e/0x40
>> [177459.612583] <EOI>
>> [177459.612637] [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>> [177459.625010] [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>> [177459.631157] [<ffffffff81008d60>] ? xen_safe_halt+0x10/0x20
>> [177459.637158] [<ffffffff810188d3>] ? default_idle+0x13/0x20
>> [177459.643072] [<ffffffff81018e1a>] ? arch_cpu_idle+0xa/0x10
>> [177459.648809] [<ffffffff810f8e7e>] ? default_idle_call+0x2e/0x50
>> [177459.654650] [<ffffffff810f9112>] ? cpu_startup_entry+0x272/0x2e0
>> [177459.660488] [<ffffffff81ae79f7>] ? rest_init+0x77/0x80
>> [177459.666297] [<ffffffff82312f58>] ? start_kernel+0x43b/0x448
>> [177459.672092] [<ffffffff823124ef>] ? x86_64_start_reservations+0x2a/0x2c
>> [177459.677800] [<ffffffff82316008>] ? xen_start_kernel+0x550/0x55c
>> [177459.683451] Code: 77 28 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 8b 47 08 55 48 89 e5 48 85 c0 74 6a 48 8b 0f 48 85 c9 48 89 08 74 04 <48> 89 41 08 84 d2 74 08 48 c7 47 08 00 00 00 00 f6 47 2a 10 48
>> [177459.695332] RIP [<ffffffff8110eb58>] detach_if_pending+0x18/0x80
>> [177459.701154] RSP <ffff88005f6039d8>
>> (XEN) [2015-08-17 00:11:51.426] Hardware Dom0 crashed: rebooting machine in 5 seconds.
>>


> might be conntracking related then.
> You might try :

> 1) reproduce the issue without conntracking.
Will see if i can do that.

> 2) bisect the bug
Hmm that's going to be quite painful, since i don't have an immediate
and reliable testcase (running for "about two days" doessn't qualify).
Especially since there are all kinds of other known bugs in between.

> Thanks.

--
Sander


2015-08-17 14:21:53

by Eric Dumazet

[permalink] [raw]
Subject: Re: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80

On Mon, 2015-08-17 at 09:02 -0500, Jon Christopherson wrote:
> This is very similar to the behavior I am seeing in this bug:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=102911

OK, but have you applied the fix ?

http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af

It will be part of net iteration from David Miller to Linus Torvald.

2015-08-17 14:25:38

by Sander Eikelenboom

[permalink] [raw]
Subject: Re: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80


Monday, August 17, 2015, 4:21:47 PM, you wrote:

> On Mon, 2015-08-17 at 09:02 -0500, Jon Christopherson wrote:
>> This is very similar to the behavior I am seeing in this bug:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=102911

> OK, but have you applied the fix ?

> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af

> It will be part of net iteration from David Miller to Linus Torvald.


I did have that patch in for my last report.
But i don't think he had (looking at the second part of his oops).

--
Sander

2015-08-17 15:16:15

by Jon Christopherson

[permalink] [raw]
Subject: Re: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80

On 08/17/2015 09:25 AM, Sander Eikelenboom wrote:
>
> > OK, but have you applied the fix ?
>
> > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af
>
> > It will be part of net iteration from David Miller to Linus Torvald.
>
>
I did not have that fix applied, but will apply and test.

Thanks,

Jon

2015-08-17 17:18:52

by Eric Dumazet

[permalink] [raw]
Subject: Re: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80

From: Eric Dumazet <[email protected]>

On Mon, 2015-08-17 at 16:25 +0200, Sander Eikelenboom wrote:
> Monday, August 17, 2015, 4:21:47 PM, you wrote:
>
> > On Mon, 2015-08-17 at 09:02 -0500, Jon Christopherson wrote:
> >> This is very similar to the behavior I am seeing in this bug:
> >>
> >> https://bugzilla.kernel.org/show_bug.cgi?id=102911
>
> > OK, but have you applied the fix ?
>
> > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af
>
> > It will be part of net iteration from David Miller to Linus Torvald.
>
>
> I did have that patch in for my last report.
> But i don't think he had (looking at the second part of his oops).
>

Then can you try following fix as well ?

Thanks !

[PATCH] timer: fix a race in __mod_timer()

lock_timer_base() can not catch following :

CPU1 ( in __mod_timer()
timer->flags |= TIMER_MIGRATING;
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
timer->flags &= ~TIMER_BASEMASK;
CPU2 (in lock_timer_base())
see timer base is cpu0 base
spin_lock_irqsave(&base->lock, *flags);
if (timer->flags == tf)
return base; // oops, wrong base
timer->flags |= base->cpu // too late

We must write timer->flags in one go, otherwise we can fool other cpus.

Fixes: bc7a34b8b9eb ("timer: Reduce timer migration overhead if disabled")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Thomas Gleixner <[email protected]>
---
kernel/time/timer.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 5e097fa9faf7..84190f02b521 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -807,8 +807,8 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
- timer->flags &= ~TIMER_BASEMASK;
- timer->flags |= base->cpu;
+ WRITE_ONCE(timer->flags,
+ (timer->flags & ~TIMER_BASEMASK) | base->cpu);
}
}


2015-08-17 18:33:31

by Sander Eikelenboom

[permalink] [raw]
Subject: Re: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80

On 2015-08-17 19:18, Eric Dumazet wrote:
> From: Eric Dumazet <[email protected]>
>
> On Mon, 2015-08-17 at 16:25 +0200, Sander Eikelenboom wrote:
>> Monday, August 17, 2015, 4:21:47 PM, you wrote:
>>
>> > On Mon, 2015-08-17 at 09:02 -0500, Jon Christopherson wrote:
>> >> This is very similar to the behavior I am seeing in this bug:
>> >>
>> >> https://bugzilla.kernel.org/show_bug.cgi?id=102911
>>
>> > OK, but have you applied the fix ?
>>
>> > http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af
>>
>> > It will be part of net iteration from David Miller to Linus Torvald.
>>
>>
>> I did have that patch in for my last report.
>> But i don't think he had (looking at the second part of his oops).
>>
>
> Then can you try following fix as well ?
>
> Thanks !

Running now :)


>
> [PATCH] timer: fix a race in __mod_timer()
>
> lock_timer_base() can not catch following :
>
> CPU1 ( in __mod_timer()
> timer->flags |= TIMER_MIGRATING;
> spin_unlock(&base->lock);
> base = new_base;
> spin_lock(&base->lock);
> timer->flags &= ~TIMER_BASEMASK;
> CPU2 (in lock_timer_base())
> see timer base is cpu0 base
> spin_lock_irqsave(&base->lock,
> *flags);
> if (timer->flags == tf)
> return base; // oops, wrong base
> timer->flags |= base->cpu // too late
>
> We must write timer->flags in one go, otherwise we can fool other cpus.
>
> Fixes: bc7a34b8b9eb ("timer: Reduce timer migration overhead if
> disabled")
> Signed-off-by: Eric Dumazet <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> ---
> kernel/time/timer.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 5e097fa9faf7..84190f02b521 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -807,8 +807,8 @@ __mod_timer(struct timer_list *timer, unsigned long
> expires,
> spin_unlock(&base->lock);
> base = new_base;
> spin_lock(&base->lock);
> - timer->flags &= ~TIMER_BASEMASK;
> - timer->flags |= base->cpu;
> + WRITE_ONCE(timer->flags,
> + (timer->flags & ~TIMER_BASEMASK) | base->cpu);
> }
> }

2015-08-17 19:14:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80

On Mon, 17 Aug 2015, Eric Dumazet wrote:
> [PATCH] timer: fix a race in __mod_timer()
>
> lock_timer_base() can not catch following :
>
> CPU1 ( in __mod_timer()
> timer->flags |= TIMER_MIGRATING;
> spin_unlock(&base->lock);
> base = new_base;
> spin_lock(&base->lock);
> timer->flags &= ~TIMER_BASEMASK;
> CPU2 (in lock_timer_base())
> see timer base is cpu0 base
> spin_lock_irqsave(&base->lock, *flags);
> if (timer->flags == tf)
> return base; // oops, wrong base
> timer->flags |= base->cpu // too late
>
> We must write timer->flags in one go, otherwise we can fool other cpus.
>
> Fixes: bc7a34b8b9eb ("timer: Reduce timer migration overhead if disabled")
> Signed-off-by: Eric Dumazet <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> ---
> kernel/time/timer.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 5e097fa9faf7..84190f02b521 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -807,8 +807,8 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
> spin_unlock(&base->lock);
> base = new_base;
> spin_lock(&base->lock);
> - timer->flags &= ~TIMER_BASEMASK;
> - timer->flags |= base->cpu;
> + WRITE_ONCE(timer->flags,
> + (timer->flags & ~TIMER_BASEMASK) | base->cpu);

Duh, yes. Picking it up for timers/urgent.

Thanks for spotting it.

tglx

2015-08-18 00:05:45

by Jon Christopherson

[permalink] [raw]
Subject: Re: Linux 4.2-rc6 regression: RIP: e030:[<ffffffff8110fb18>] [<ffffffff8110fb18>] detach_if_pending+0x18/0x80

On 08/17/2015 12:18 PM, Eric Dumazet wrote:
> From: Eric Dumazet <[email protected]>

<snip>

>
> Then can you try following fix as well ?
>
> Thanks !
>
> [PATCH] timer: fix a race in __mod_timer()
>

<snip>

I have been running the latest code from git with the 2 patches in this
thread applied. No issues so far.

-Jon

Subject: [tip:timers/urgent] timer: Write timer->flags atomically

Commit-ID: d0023a1448abdcc892b8bca631e74bb1888efd02
Gitweb: http://git.kernel.org/tip/d0023a1448abdcc892b8bca631e74bb1888efd02
Author: Eric Dumazet <[email protected]>
AuthorDate: Mon, 17 Aug 2015 10:18:48 -0700
Committer: Thomas Gleixner <[email protected]>
CommitDate: Tue, 18 Aug 2015 15:31:16 +0200

timer: Write timer->flags atomically

lock_timer_base() cannot prevent the following :

CPU1 ( in __mod_timer()
timer->flags |= TIMER_MIGRATING;
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
// The next line clears TIMER_MIGRATING
timer->flags &= ~TIMER_BASEMASK;
CPU2 (in lock_timer_base())
see timer base is cpu0 base
spin_lock_irqsave(&base->lock, *flags);
if (timer->flags == tf)
return base; // oops, wrong base
timer->flags |= base->cpu // too late

We must write timer->flags in one go, otherwise we can fool other cpus.

Fixes: bc7a34b8b9eb ("timer: Reduce timer migration overhead if disabled")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Jon Christopherson <[email protected]>
Cc: David Miller <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Sander Eikelenboom <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Thomas Gleixner <[email protected]>
---
kernel/time/timer.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 5e097fa..84190f0 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -807,8 +807,8 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
- timer->flags &= ~TIMER_BASEMASK;
- timer->flags |= base->cpu;
+ WRITE_ONCE(timer->flags,
+ (timer->flags & ~TIMER_BASEMASK) | base->cpu);
}
}