LinuxLists.cc - kernel panic in skb_copy

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

On Thu, 2013-06-27 at 10:58 +0800, Joe Jin wrote:
> Hi,
>
> When we do fail over test with iscsi + multipath by reset the switches
> on OVM(2.6.39) we hit the panic:
>
> BUG: unable to handle kernel paging request at ffff88006d9e8d48
> IP: [<ffffffff812605bb>] memcpy+0xb/0x120
> PGD 1798067 PUD 1fd2067 PMD 213f067 PTE 0
> Oops: 0000 [#1] SMP
> CPU 7
> Modules linked in: dm_nfs tun nfs fscache auth_rpcgss nfs_acl xen_blkback xen_netback xen_gntdev xen_evtchn lockd sunrpc bridge stp llc bonding be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio dm_round_robin dm_multipath libiscsi_tcp libiscsi scsi_transport_iscsi xenfs xen_privcmd video sbs sbshc acpi_memhotplug acpi_ipmi ipmi_msghandler parport_pc lp parport ixgbe dca sr_mod cdrom bnx2 radeon ttm drm_kms_helper drm snd_seq_dummy i2c_algo_bit i2c_core snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss serio_raw snd_pcm snd_timer snd soundcore snd_page_alloc iTCO_wdt pcspkr iTCO_vendor_support pata_acpi dcdbas i5k_amb ata_generic hwmon floppy ghes i5000_edac edac_core hed dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage lpfc scsi_transport_fc scsi_tgt ata_piix sg shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod crc_t10dif ext3 j!
> bd mbcache
>
>
> Pid: 0, comm: swapper Tainted: G W 2.6.39-300.32.1.el5uek #1 Dell Inc. PowerEdge 2950/0DP246
> RIP: e030:[<ffffffff812605bb>] [<ffffffff812605bb>] memcpy+0xb/0x120
> RSP: e02b:ffff8801003c3d58 EFLAGS: 00010246
> RAX: ffff880076b9e280 RBX: ffff8800714d2c00 RCX: 0000000000000057
> RDX: 0000000000000000 RSI: ffff88006d9e8d48 RDI: ffff880076b9e280
> RBP: ffff8801003c3dc0 R08: 00000000000bf723 R09: 0000000000000000
> R10: 0000000000000000 R11: 000000000000000a R12: 0000000000000034
> R13: 0000000000000034 R14: 00000000000002b8 R15: 00000000000005a8
> FS: 00007fc1e852a6e0(0000) GS:ffff8801003c0000(0000) knlGS:0000000000000000
> CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
> CR2: ffff88006d9e8d48 CR3: 000000006370b000 CR4: 0000000000002660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffff880077ac0000, task ffff880077abe240)
> Stack:
> ffffffff8142db21 0000000000000000 ffff880076b9e280 ffff8800637097f0
> 000002ec00000000 00000000000002b8 ffff880077ac0000 0000000000000000
> ffff8800637097f0 ffff880066c9a7c0 00000000fffffdb4 000000000000024c
> Call Trace:
> <IRQ>
> [<ffffffff8142db21>] ? skb_copy_bits+0x1c1/0x2e0
> [<ffffffff8142f173>] skb_copy+0xf3/0x120
> [<ffffffff81447fbc>] neigh_timer_handler+0x1ac/0x350
> [<ffffffff810573fe>] ? account_idle_ticks+0xe/0x10
> [<ffffffff81447e10>] ? neigh_alloc+0x180/0x180
> [<ffffffff8107dbaa>] call_timer_fn+0x4a/0x110
> [<ffffffff81447e10>] ? neigh_alloc+0x180/0x180
> [<ffffffff8107f82a>] run_timer_softirq+0x13a/0x220
> [<ffffffff81075c39>] __do_softirq+0xb9/0x1d0
> [<ffffffff810d9678>] ? handle_percpu_irq+0x48/0x70
> [<ffffffff81511d3c>] call_softirq+0x1c/0x30
> [<ffffffff810172e5>] do_softirq+0x65/0xa0
> [<ffffffff8107656b>] irq_exit+0xab/0xc0
> [<ffffffff812f97d5>] xen_evtchn_do_upcall+0x35/0x50
> [<ffffffff81511d8e>] xen_do_hypervisor_callback+0x1e/0x30
> <EOI>
> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
> [<ffffffff8100a0b0>] ? xen_safe_halt+0x10/0x20
> [<ffffffff8101dfeb>] ? default_idle+0x5b/0x170
> [<ffffffff81014ac6>] ? cpu_idle+0xc6/0xf0
> [<ffffffff8100a8c9>] ? xen_irq_enable_direct_reloc+0x4/0x4
> [<ffffffff814f7bbe>] ? cpu_bringup_and_idle+0xe/0x10
> Code: 01 c6 43 4c 04 19 c0 4c 8b 65 f0 4c 8b 6d f8 83 e0 fc 83 c0 08 88 43 4d 48 8b 5d e8 c9 c3 90 90 48 89 f8 89 d1 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 20 48 83 ea 20 4c 8b 06 4c 8b 4e 08 4c
> RIP [<ffffffff812605bb>] memcpy+0xb/0x120
> RSP <ffff8801003c3d58>
> CR2: ffff88006d9e8d48
>
> Reviewed vmcore I found the skb->users is 1 at the moment, checked network neighbour
> history I found skb_get() be replaced by skb_copy by commit 7e36763b2c:
>
> commit 7e36763b2c204d59de4e88087f84a2c0c8421f25
> Author: Frank Blaschka <[email protected]>
> Date: Mon Mar 3 12:16:04 2008 -0800
>
> [NET]: Fix race in generic address resolution.
>
> neigh_update sends skb from neigh->arp_queue while neigh_timer_handler
> has increased skbs refcount and calls solicit with the
> skb. neigh_timer_handler should not increase skbs refcount but make a
> copy of the skb and do solicit with the copy.
>
> Signed-off-by: Frank Blaschka <[email protected]>
> Signed-off-by: David S. Miller <[email protected]>
>
> So can you please give some details of the race? per vmcore seems like the skb data
> be freed, I suspected skb_get() lost at somewhere?
> I reverted above commit the panic not occurred during our testing.
>
> Any input will appreciate!

Well, fact is that your crash is happening in skb_copy().

Frank patch is OK. I suspect using skb_clone() would work too,
so if these skb were fclone ready, chance of an GFP_ATOMIC allocation
error would be smaller.

So something is providing a wrong skb at the very beginning.

You could try to do a early skb_copy to catch the bug and see in the
stack trace what produced this buggy skb.

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 5c56b21..a7a51fd 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1010,6 +1010,7 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
NEIGH_CACHE_STAT_INC(neigh->tbl, unres_discards);
}
skb_dst_force(skb);
+ kfree_skb(skb_copy(skb, GFP_ATOMIC));
__skb_queue_tail(&neigh->arp_queue, skb);
neigh->arp_queue_len_bytes += skb->truesize;
}

2013-06-27 07:15:51

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

Hi Eric,

Thanks for you response, will test it and get back to you.

Regards,
Joe
On 06/27/13 13:31, Eric Dumazet wrote:
> On Thu, 2013-06-27 at 10:58 +0800, Joe Jin wrote:
>> Hi,
>>
>> When we do fail over test with iscsi + multipath by reset the switches
>> on OVM(2.6.39) we hit the panic:
>>
>> BUG: unable to handle kernel paging request at ffff88006d9e8d48
>> IP: [<ffffffff812605bb>] memcpy+0xb/0x120
>> PGD 1798067 PUD 1fd2067 PMD 213f067 PTE 0
>> Oops: 0000 [#1] SMP
>> CPU 7
>> Modules linked in: dm_nfs tun nfs fscache auth_rpcgss nfs_acl xen_blkback xen_netback xen_gntdev xen_evtchn lockd sunrpc bridge stp llc bonding be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio dm_round_robin dm_multipath libiscsi_tcp libiscsi scsi_transport_iscsi xenfs xen_privcmd video sbs sbshc acpi_memhotplug acpi_ipmi ipmi_msghandler parport_pc lp parport ixgbe dca sr_mod cdrom bnx2 radeon ttm drm_kms_helper drm snd_seq_dummy i2c_algo_bit i2c_core snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss serio_raw snd_pcm snd_timer snd soundcore snd_page_alloc iTCO_wdt pcspkr iTCO_vendor_support pata_acpi dcdbas i5k_amb ata_generic hwmon floppy ghes i5000_edac edac_core hed dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage lpfc scsi_transport_fc scsi_tgt ata_piix sg shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod crc_t10dif ext!
3!
> j!
>> bd mbcache
>>
>>
>> Pid: 0, comm: swapper Tainted: G W 2.6.39-300.32.1.el5uek #1 Dell Inc. PowerEdge 2950/0DP246
>> RIP: e030:[<ffffffff812605bb>] [<ffffffff812605bb>] memcpy+0xb/0x120
>> RSP: e02b:ffff8801003c3d58 EFLAGS: 00010246
>> RAX: ffff880076b9e280 RBX: ffff8800714d2c00 RCX: 0000000000000057
>> RDX: 0000000000000000 RSI: ffff88006d9e8d48 RDI: ffff880076b9e280
>> RBP: ffff8801003c3dc0 R08: 00000000000bf723 R09: 0000000000000000
>> R10: 0000000000000000 R11: 000000000000000a R12: 0000000000000034
>> R13: 0000000000000034 R14: 00000000000002b8 R15: 00000000000005a8
>> FS: 00007fc1e852a6e0(0000) GS:ffff8801003c0000(0000) knlGS:0000000000000000
>> CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
>> CR2: ffff88006d9e8d48 CR3: 000000006370b000 CR4: 0000000000002660
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Process swapper (pid: 0, threadinfo ffff880077ac0000, task ffff880077abe240)
>> Stack:
>> ffffffff8142db21 0000000000000000 ffff880076b9e280 ffff8800637097f0
>> 000002ec00000000 00000000000002b8 ffff880077ac0000 0000000000000000
>> ffff8800637097f0 ffff880066c9a7c0 00000000fffffdb4 000000000000024c
>> Call Trace:
>> <IRQ>
>> [<ffffffff8142db21>] ? skb_copy_bits+0x1c1/0x2e0
>> [<ffffffff8142f173>] skb_copy+0xf3/0x120
>> [<ffffffff81447fbc>] neigh_timer_handler+0x1ac/0x350
>> [<ffffffff810573fe>] ? account_idle_ticks+0xe/0x10
>> [<ffffffff81447e10>] ? neigh_alloc+0x180/0x180
>> [<ffffffff8107dbaa>] call_timer_fn+0x4a/0x110
>> [<ffffffff81447e10>] ? neigh_alloc+0x180/0x180
>> [<ffffffff8107f82a>] run_timer_softirq+0x13a/0x220
>> [<ffffffff81075c39>] __do_softirq+0xb9/0x1d0
>> [<ffffffff810d9678>] ? handle_percpu_irq+0x48/0x70
>> [<ffffffff81511d3c>] call_softirq+0x1c/0x30
>> [<ffffffff810172e5>] do_softirq+0x65/0xa0
>> [<ffffffff8107656b>] irq_exit+0xab/0xc0
>> [<ffffffff812f97d5>] xen_evtchn_do_upcall+0x35/0x50
>> [<ffffffff81511d8e>] xen_do_hypervisor_callback+0x1e/0x30
>> <EOI>
>> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>> [<ffffffff8100a0b0>] ? xen_safe_halt+0x10/0x20
>> [<ffffffff8101dfeb>] ? default_idle+0x5b/0x170
>> [<ffffffff81014ac6>] ? cpu_idle+0xc6/0xf0
>> [<ffffffff8100a8c9>] ? xen_irq_enable_direct_reloc+0x4/0x4
>> [<ffffffff814f7bbe>] ? cpu_bringup_and_idle+0xe/0x10
>> Code: 01 c6 43 4c 04 19 c0 4c 8b 65 f0 4c 8b 6d f8 83 e0 fc 83 c0 08 88 43 4d 48 8b 5d e8 c9 c3 90 90 48 89 f8 89 d1 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 20 48 83 ea 20 4c 8b 06 4c 8b 4e 08 4c
>> RIP [<ffffffff812605bb>] memcpy+0xb/0x120
>> RSP <ffff8801003c3d58>
>> CR2: ffff88006d9e8d48
>>
>> Reviewed vmcore I found the skb->users is 1 at the moment, checked network neighbour
>> history I found skb_get() be replaced by skb_copy by commit 7e36763b2c:
>>
>> commit 7e36763b2c204d59de4e88087f84a2c0c8421f25
>> Author: Frank Blaschka <[email protected]>
>> Date: Mon Mar 3 12:16:04 2008 -0800
>>
>> [NET]: Fix race in generic address resolution.
>>
>> neigh_update sends skb from neigh->arp_queue while neigh_timer_handler
>> has increased skbs refcount and calls solicit with the
>> skb. neigh_timer_handler should not increase skbs refcount but make a
>> copy of the skb and do solicit with the copy.
>>
>> Signed-off-by: Frank Blaschka <[email protected]>
>> Signed-off-by: David S. Miller <[email protected]>
>>
>> So can you please give some details of the race? per vmcore seems like the skb data
>> be freed, I suspected skb_get() lost at somewhere?
>> I reverted above commit the panic not occurred during our testing.
>>
>> Any input will appreciate!
>
> Well, fact is that your crash is happening in skb_copy().
>
> Frank patch is OK. I suspect using skb_clone() would work too,
> so if these skb were fclone ready, chance of an GFP_ATOMIC allocation
> error would be smaller.
>
> So something is providing a wrong skb at the very beginning.
>
> You could try to do a early skb_copy to catch the bug and see in the
> stack trace what produced this buggy skb.
>
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 5c56b21..a7a51fd 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -1010,6 +1010,7 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
> NEIGH_CACHE_STAT_INC(neigh->tbl, unres_discards);
> }
> skb_dst_force(skb);
> + kfree_skb(skb_copy(skb, GFP_ATOMIC));
> __skb_queue_tail(&neigh->arp_queue, skb);
> neigh->arp_queue_len_bytes += skb->truesize;
> }
>
>

2013-06-28 04:18:12

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

Find a similar issue http://www.gossamer-threads.com/lists/xen/devel/265611
So copied to Xen developer as well.

On 06/27/13 13:31, Eric Dumazet wrote:
> On Thu, 2013-06-27 at 10:58 +0800, Joe Jin wrote:
>> Hi,
>>
>> When we do fail over test with iscsi + multipath by reset the switches
>> on OVM(2.6.39) we hit the panic:
>>
>> BUG: unable to handle kernel paging request at ffff88006d9e8d48
>> IP: [<ffffffff812605bb>] memcpy+0xb/0x120
>> PGD 1798067 PUD 1fd2067 PMD 213f067 PTE 0
>> Oops: 0000 [#1] SMP
>> CPU 7
>> Modules linked in: dm_nfs tun nfs fscache auth_rpcgss nfs_acl xen_blkback xen_netback xen_gntdev xen_evtchn lockd sunrpc bridge stp llc bonding be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio dm_round_robin dm_multipath libiscsi_tcp libiscsi scsi_transport_iscsi xenfs xen_privcmd video sbs sbshc acpi_memhotplug acpi_ipmi ipmi_msghandler parport_pc lp parport ixgbe dca sr_mod cdrom bnx2 radeon ttm drm_kms_helper drm snd_seq_dummy i2c_algo_bit i2c_core snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss serio_raw snd_pcm snd_timer snd soundcore snd_page_alloc iTCO_wdt pcspkr iTCO_vendor_support pata_acpi dcdbas i5k_amb ata_generic hwmon floppy ghes i5000_edac edac_core hed dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage lpfc scsi_transport_fc scsi_tgt ata_piix sg shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod crc_t10dif ext!
3!
> j!
>> bd mbcache
>>
>>
>> Pid: 0, comm: swapper Tainted: G W 2.6.39-300.32.1.el5uek #1 Dell Inc. PowerEdge 2950/0DP246
>> RIP: e030:[<ffffffff812605bb>] [<ffffffff812605bb>] memcpy+0xb/0x120
>> RSP: e02b:ffff8801003c3d58 EFLAGS: 00010246
>> RAX: ffff880076b9e280 RBX: ffff8800714d2c00 RCX: 0000000000000057
>> RDX: 0000000000000000 RSI: ffff88006d9e8d48 RDI: ffff880076b9e280
>> RBP: ffff8801003c3dc0 R08: 00000000000bf723 R09: 0000000000000000
>> R10: 0000000000000000 R11: 000000000000000a R12: 0000000000000034
>> R13: 0000000000000034 R14: 00000000000002b8 R15: 00000000000005a8
>> FS: 00007fc1e852a6e0(0000) GS:ffff8801003c0000(0000) knlGS:0000000000000000
>> CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
>> CR2: ffff88006d9e8d48 CR3: 000000006370b000 CR4: 0000000000002660
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Process swapper (pid: 0, threadinfo ffff880077ac0000, task ffff880077abe240)
>> Stack:
>> ffffffff8142db21 0000000000000000 ffff880076b9e280 ffff8800637097f0
>> 000002ec00000000 00000000000002b8 ffff880077ac0000 0000000000000000
>> ffff8800637097f0 ffff880066c9a7c0 00000000fffffdb4 000000000000024c
>> Call Trace:
>> <IRQ>
>> [<ffffffff8142db21>] ? skb_copy_bits+0x1c1/0x2e0
>> [<ffffffff8142f173>] skb_copy+0xf3/0x120
>> [<ffffffff81447fbc>] neigh_timer_handler+0x1ac/0x350
>> [<ffffffff810573fe>] ? account_idle_ticks+0xe/0x10
>> [<ffffffff81447e10>] ? neigh_alloc+0x180/0x180
>> [<ffffffff8107dbaa>] call_timer_fn+0x4a/0x110
>> [<ffffffff81447e10>] ? neigh_alloc+0x180/0x180
>> [<ffffffff8107f82a>] run_timer_softirq+0x13a/0x220
>> [<ffffffff81075c39>] __do_softirq+0xb9/0x1d0
>> [<ffffffff810d9678>] ? handle_percpu_irq+0x48/0x70
>> [<ffffffff81511d3c>] call_softirq+0x1c/0x30
>> [<ffffffff810172e5>] do_softirq+0x65/0xa0
>> [<ffffffff8107656b>] irq_exit+0xab/0xc0
>> [<ffffffff812f97d5>] xen_evtchn_do_upcall+0x35/0x50
>> [<ffffffff81511d8e>] xen_do_hypervisor_callback+0x1e/0x30
>> <EOI>
>> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>> [<ffffffff8100a0b0>] ? xen_safe_halt+0x10/0x20
>> [<ffffffff8101dfeb>] ? default_idle+0x5b/0x170
>> [<ffffffff81014ac6>] ? cpu_idle+0xc6/0xf0
>> [<ffffffff8100a8c9>] ? xen_irq_enable_direct_reloc+0x4/0x4
>> [<ffffffff814f7bbe>] ? cpu_bringup_and_idle+0xe/0x10
>> Code: 01 c6 43 4c 04 19 c0 4c 8b 65 f0 4c 8b 6d f8 83 e0 fc 83 c0 08 88 43 4d 48 8b 5d e8 c9 c3 90 90 48 89 f8 89 d1 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 20 48 83 ea 20 4c 8b 06 4c 8b 4e 08 4c
>> RIP [<ffffffff812605bb>] memcpy+0xb/0x120
>> RSP <ffff8801003c3d58>
>> CR2: ffff88006d9e8d48
>>
>> Reviewed vmcore I found the skb->users is 1 at the moment, checked network neighbour
>> history I found skb_get() be replaced by skb_copy by commit 7e36763b2c:
>>
>> commit 7e36763b2c204d59de4e88087f84a2c0c8421f25
>> Author: Frank Blaschka <[email protected]>
>> Date: Mon Mar 3 12:16:04 2008 -0800
>>
>> [NET]: Fix race in generic address resolution.
>>
>> neigh_update sends skb from neigh->arp_queue while neigh_timer_handler
>> has increased skbs refcount and calls solicit with the
>> skb. neigh_timer_handler should not increase skbs refcount but make a
>> copy of the skb and do solicit with the copy.
>>
>> Signed-off-by: Frank Blaschka <[email protected]>
>> Signed-off-by: David S. Miller <[email protected]>
>>
>> So can you please give some details of the race? per vmcore seems like the skb data
>> be freed, I suspected skb_get() lost at somewhere?
>> I reverted above commit the panic not occurred during our testing.
>>
>> Any input will appreciate!
>
> Well, fact is that your crash is happening in skb_copy().
>
> Frank patch is OK. I suspect using skb_clone() would work too,
> so if these skb were fclone ready, chance of an GFP_ATOMIC allocation
> error would be smaller.
>
> So something is providing a wrong skb at the very beginning.
>
> You could try to do a early skb_copy to catch the bug and see in the
> stack trace what produced this buggy skb.
>
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 5c56b21..a7a51fd 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -1010,6 +1010,7 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
> NEIGH_CACHE_STAT_INC(neigh->tbl, unres_discards);
> }
> skb_dst_force(skb);
> + kfree_skb(skb_copy(skb, GFP_ATOMIC));
> __skb_queue_tail(&neigh->arp_queue, skb);
> neigh->arp_queue_len_bytes += skb->truesize;
> }
>
>

BUG: unable to handle kernel paging request at ffff8800488db8dc
IP: [<ffffffff812605bb>] memcpy+0xb/0x120
PGD 1796067 PUD 20e5067 PMD 212a067 PTE 0
Oops: 0000 [#1] SMP
CPU 13
Modules linked in: ocfs2 jbd2 xen_blkback xen_netback xen_gntdev xen_evtchn netconsole i2c_dev i2c_core ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs lockd sunrpc dm_round_robin dm_multipath bridge stp llc bonding be2iscsi iscsi_boot_sysfs iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi xenfs xen_privcmd video sbs sbshc hed acpi_memhotplug acpi_ipmi ipmi_msghandler parport_pc lp parport serio_raw ixgbe hpilo tg3 hpwdt dca snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd iTCO_wdt iTCO_vendor_support soundcore snd_page_alloc pcspkr pata_acpi ata_generic dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage ata_piix sg shpchp hpsa cciss sd_mod crc_t10dif ext3 jbd mbcache

Pid: 0, comm: swapper Not tainted 2.6.39-300.32.1.el5uek.bug16929255v5 #1 HP ProLiant DL360p Gen8
RIP: e030:[<ffffffff812605bb>] [<ffffffff812605bb>] memcpy+0xb/0x120
RSP: e02b:ffff88005a9a3b68 EFLAGS: 00010202
RAX: ffff8800200f0280 RBX: 0000000000000724 RCX: 00000000000000e4
RDX: 0000000000000004 RSI: ffff8800488db8dc RDI: ffff8800200f0280
RBP: ffff88005a9a3bd0 R08: 0000000000000004 R09: ffff880052824980
R10: 0000000000000000 R11: 0000000000015048 R12: 0000000000000034
R13: 0000000000000034 R14: 00000000000022f4 R15: ffff880021208ab0
FS: 00007fe8737c96e0(0000) GS:ffff88005a9a0000(0000) knlGS:0000000000000000
CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
CR2: ffff8800488db8dc CR3: 000000004fb38000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff880054d36000, task ffff880054d343c0)
Stack:
ffffffff8142dac7 0000000000000000 00000000ffffffff ffff8800200f0280
0000075800000000 0000000000000724 ffff880054d36000 0000000000000000
00000000fffffdb4 ffff880052824980 ffff880021208ab0 000000000000024c
Call Trace:
<IRQ>
[<ffffffff8142dac7>] ? skb_copy_bits+0x167/0x290
[<ffffffff8142f0b5>] skb_copy+0x85/0xb0
[<ffffffff8144864d>] __neigh_event_send+0x18d/0x200
[<ffffffff81449a42>] neigh_resolve_output+0x162/0x1b0
[<ffffffff81477046>] ip_finish_output+0x146/0x320
[<ffffffff814754a5>] ip_output+0x85/0xd0
[<ffffffff814758d9>] ip_local_out+0x29/0x30
[<ffffffff814761e0>] ip_queue_xmit+0x1c0/0x3d0
[<ffffffff8148d3ef>] tcp_transmit_skb+0x40f/0x520
[<ffffffff8148e5ff>] tcp_retransmit_skb+0x16f/0x2e0
[<ffffffff814908c0>] ? tcp_retransmit_timer+0x4a0/0x4a0
[<ffffffff814905ad>] tcp_retransmit_timer+0x18d/0x4a0
[<ffffffff814908c0>] ? tcp_retransmit_timer+0x4a0/0x4a0
[<ffffffff81490994>] tcp_write_timer+0xd4/0x100
[<ffffffff8107dbaa>] call_timer_fn+0x4a/0x110
[<ffffffff814908c0>] ? tcp_retransmit_timer+0x4a0/0x4a0
[<ffffffff8107f82a>] run_timer_softirq+0x13a/0x220
[<ffffffff81075c39>] __do_softirq+0xb9/0x1d0
[<ffffffff810d9678>] ? handle_percpu_irq+0x48/0x70
[<ffffffff81511b7c>] call_softirq+0x1c/0x30
[<ffffffff810172e5>] do_softirq+0x65/0xa0
[<ffffffff8107656b>] irq_exit+0xab/0xc0
[<ffffffff812f97d5>] xen_evtchn_do_upcall+0x35/0x50
[<ffffffff81511bce>] xen_do_hypervisor_callback+0x1e/0x30
<EOI>
[<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[<ffffffff8100a0d0>] ? xen_safe_halt+0x10/0x20
[<ffffffff8101dfeb>] ? default_idle+0x5b/0x170
[<ffffffff81014ac6>] ? cpu_idle+0xc6/0xf0
[<ffffffff8100a8e9>] ? xen_irq_enable_direct_reloc+0x4/0x4
[<ffffffff814f7a2e>] ? cpu_bringup_and_idle+0xe/0x10
Code: 01 c6 43 4c 04 19 c0 4c 8b 65 f0 4c 8b 6d f8 83 e0 fc 83 c0 08 88 43 4d 48 8b 5d e8 c9 c3 90 90 48 89 f8 89 d1 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 20 48 83 ea 20 4c 8b 06 4c 8b 4e 08 4c
RIP [<ffffffff812605bb>] memcpy+0xb/0x120

Per vmcore, the socket info as below:
------------------------------------------------------------------------------
<struct tcp_sock 0xffff88004d344e00> TCP
tcp 10.1.1.11:42147 10.1.1.21:3260 FIN_WAIT1
windows: rcv=122124, snd=65535 advmss=8948 rcv_ws=1 snd_ws=0
nonagle=1 sack_ok=0 tstamp_ok=1
rmem_alloc=0, wmem_alloc=10229
rx_queue=0, tx_queue=149765
rcvbuf=262142, sndbuf=262142
rcv_tstamp=51.4 s, lsndtime=0.0 s ago
-- Retransmissions --
retransmits=7, ca_state=TCP_CA_Disorder
------------------------------------------------------------------------------

When sock status move to FIN_WAIT1, will it cleanup all skb or no?

Thanks,
Joe

2013-06-28 06:52:26

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

On Fri, 2013-06-28 at 12:17 +0800, Joe Jin wrote:
> Find a similar issue http://www.gossamer-threads.com/lists/xen/devel/265611
> So copied to Xen developer as well.
>
> On 06/27/13 13:31, Eric Dumazet wrote:
> > On Thu, 2013-06-27 at 10:58 +0800, Joe Jin wrote:
> >> Hi,
> >>
> >> When we do fail over test with iscsi + multipath by reset the switches
> >> on OVM(2.6.39) we hit the panic:
> >>
> >> BUG: unable to handle kernel paging request at ffff88006d9e8d48
> >> IP: [<ffffffff812605bb>] memcpy+0xb/0x120
> >> PGD 1798067 PUD 1fd2067 PMD 213f067 PTE 0
> >> Oops: 0000 [#1] SMP
> >> CPU 7
> >> Modules linked in: dm_nfs tun nfs fscache auth_rpcgss nfs_acl xen_blkback xen_netback xen_gntdev xen_evtchn lockd sunrpc bridge stp llc bonding be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio dm_round_robin dm_multipath libiscsi_tcp libiscsi scsi_transport_iscsi xenfs xen_privcmd video sbs sbshc acpi_memhotplug acpi_ipmi ipmi_msghandler parport_pc lp parport ixgbe dca sr_mod cdrom bnx2 radeon ttm drm_kms_helper drm snd_seq_dummy i2c_algo_bit i2c_core snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss serio_raw snd_pcm snd_timer snd soundcore snd_page_alloc iTCO_wdt pcspkr iTCO_vendor_support pata_acpi dcdbas i5k_amb ata_generic hwmon floppy ghes i5000_edac edac_core hed dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage lpfc scsi_transport_fc scsi_tgt ata_piix sg shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod crc_t10dif ext!
> 3!
> > j!
> >> bd mbcache
> >>
> >>
> >> Pid: 0, comm: swapper Tainted: G W 2.6.39-300.32.1.el5uek #1 Dell Inc. PowerEdge 2950/0DP246
> >> RIP: e030:[<ffffffff812605bb>] [<ffffffff812605bb>] memcpy+0xb/0x120
> >> RSP: e02b:ffff8801003c3d58 EFLAGS: 00010246
> >> RAX: ffff880076b9e280 RBX: ffff8800714d2c00 RCX: 0000000000000057
> >> RDX: 0000000000000000 RSI: ffff88006d9e8d48 RDI: ffff880076b9e280
> >> RBP: ffff8801003c3dc0 R08: 00000000000bf723 R09: 0000000000000000
> >> R10: 0000000000000000 R11: 000000000000000a R12: 0000000000000034
> >> R13: 0000000000000034 R14: 00000000000002b8 R15: 00000000000005a8
> >> FS: 00007fc1e852a6e0(0000) GS:ffff8801003c0000(0000) knlGS:0000000000000000
> >> CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
> >> CR2: ffff88006d9e8d48 CR3: 000000006370b000 CR4: 0000000000002660
> >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >> Process swapper (pid: 0, threadinfo ffff880077ac0000, task ffff880077abe240)
> >> Stack:
> >> ffffffff8142db21 0000000000000000 ffff880076b9e280 ffff8800637097f0
> >> 000002ec00000000 00000000000002b8 ffff880077ac0000 0000000000000000
> >> ffff8800637097f0 ffff880066c9a7c0 00000000fffffdb4 000000000000024c
> >> Call Trace:
> >> <IRQ>
> >> [<ffffffff8142db21>] ? skb_copy_bits+0x1c1/0x2e0
> >> [<ffffffff8142f173>] skb_copy+0xf3/0x120
> >> [<ffffffff81447fbc>] neigh_timer_handler+0x1ac/0x350
> >> [<ffffffff810573fe>] ? account_idle_ticks+0xe/0x10
> >> [<ffffffff81447e10>] ? neigh_alloc+0x180/0x180
> >> [<ffffffff8107dbaa>] call_timer_fn+0x4a/0x110
> >> [<ffffffff81447e10>] ? neigh_alloc+0x180/0x180
> >> [<ffffffff8107f82a>] run_timer_softirq+0x13a/0x220
> >> [<ffffffff81075c39>] __do_softirq+0xb9/0x1d0
> >> [<ffffffff810d9678>] ? handle_percpu_irq+0x48/0x70
> >> [<ffffffff81511d3c>] call_softirq+0x1c/0x30
> >> [<ffffffff810172e5>] do_softirq+0x65/0xa0
> >> [<ffffffff8107656b>] irq_exit+0xab/0xc0
> >> [<ffffffff812f97d5>] xen_evtchn_do_upcall+0x35/0x50
> >> [<ffffffff81511d8e>] xen_do_hypervisor_callback+0x1e/0x30
> >> <EOI>
> >> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
> >> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
> >> [<ffffffff8100a0b0>] ? xen_safe_halt+0x10/0x20
> >> [<ffffffff8101dfeb>] ? default_idle+0x5b/0x170
> >> [<ffffffff81014ac6>] ? cpu_idle+0xc6/0xf0
> >> [<ffffffff8100a8c9>] ? xen_irq_enable_direct_reloc+0x4/0x4
> >> [<ffffffff814f7bbe>] ? cpu_bringup_and_idle+0xe/0x10
> >> Code: 01 c6 43 4c 04 19 c0 4c 8b 65 f0 4c 8b 6d f8 83 e0 fc 83 c0 08 88 43 4d 48 8b 5d e8 c9 c3 90 90 48 89 f8 89 d1 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 20 48 83 ea 20 4c 8b 06 4c 8b 4e 08 4c
> >> RIP [<ffffffff812605bb>] memcpy+0xb/0x120
> >> RSP <ffff8801003c3d58>
> >> CR2: ffff88006d9e8d48
> >>
> >> Reviewed vmcore I found the skb->users is 1 at the moment, checked network neighbour
> >> history I found skb_get() be replaced by skb_copy by commit 7e36763b2c:
> >>
> >> commit 7e36763b2c204d59de4e88087f84a2c0c8421f25
> >> Author: Frank Blaschka <[email protected]>
> >> Date: Mon Mar 3 12:16:04 2008 -0800
> >>
> >> [NET]: Fix race in generic address resolution.
> >>
> >> neigh_update sends skb from neigh->arp_queue while neigh_timer_handler
> >> has increased skbs refcount and calls solicit with the
> >> skb. neigh_timer_handler should not increase skbs refcount but make a
> >> copy of the skb and do solicit with the copy.
> >>
> >> Signed-off-by: Frank Blaschka <[email protected]>
> >> Signed-off-by: David S. Miller <[email protected]>
> >>
> >> So can you please give some details of the race? per vmcore seems like the skb data
> >> be freed, I suspected skb_get() lost at somewhere?
> >> I reverted above commit the panic not occurred during our testing.
> >>
> >> Any input will appreciate!
> >
> > Well, fact is that your crash is happening in skb_copy().
> >
> > Frank patch is OK. I suspect using skb_clone() would work too,
> > so if these skb were fclone ready, chance of an GFP_ATOMIC allocation
> > error would be smaller.
> >
> > So something is providing a wrong skb at the very beginning.
> >
> > You could try to do a early skb_copy to catch the bug and see in the
> > stack trace what produced this buggy skb.
> >
> > diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> > index 5c56b21..a7a51fd 100644
> > --- a/net/core/neighbour.c
> > +++ b/net/core/neighbour.c
> > @@ -1010,6 +1010,7 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
> > NEIGH_CACHE_STAT_INC(neigh->tbl, unres_discards);
> > }
> > skb_dst_force(skb);
> > + kfree_skb(skb_copy(skb, GFP_ATOMIC));
> > __skb_queue_tail(&neigh->arp_queue, skb);
> > neigh->arp_queue_len_bytes += skb->truesize;
> > }
> >
> >
>
> BUG: unable to handle kernel paging request at ffff8800488db8dc
> IP: [<ffffffff812605bb>] memcpy+0xb/0x120
> PGD 1796067 PUD 20e5067 PMD 212a067 PTE 0
> Oops: 0000 [#1] SMP
> CPU 13
> Modules linked in: ocfs2 jbd2 xen_blkback xen_netback xen_gntdev xen_evtchn netconsole i2c_dev i2c_core ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs lockd sunrpc dm_round_robin dm_multipath bridge stp llc bonding be2iscsi iscsi_boot_sysfs iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi xenfs xen_privcmd video sbs sbshc hed acpi_memhotplug acpi_ipmi ipmi_msghandler parport_pc lp parport serio_raw ixgbe hpilo tg3 hpwdt dca snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd iTCO_wdt iTCO_vendor_support soundcore snd_page_alloc pcspkr pata_acpi ata_generic dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage ata_piix sg shpchp hpsa cciss sd_mod crc_t10dif ext3 jbd mbcache
>
> Pid: 0, comm: swapper Not tainted 2.6.39-300.32.1.el5uek.bug16929255v5 #1 HP ProLiant DL360p Gen8
> RIP: e030:[<ffffffff812605bb>] [<ffffffff812605bb>] memcpy+0xb/0x120
> RSP: e02b:ffff88005a9a3b68 EFLAGS: 00010202
> RAX: ffff8800200f0280 RBX: 0000000000000724 RCX: 00000000000000e4
> RDX: 0000000000000004 RSI: ffff8800488db8dc RDI: ffff8800200f0280
> RBP: ffff88005a9a3bd0 R08: 0000000000000004 R09: ffff880052824980
> R10: 0000000000000000 R11: 0000000000015048 R12: 0000000000000034
> R13: 0000000000000034 R14: 00000000000022f4 R15: ffff880021208ab0
> FS: 00007fe8737c96e0(0000) GS:ffff88005a9a0000(0000) knlGS:0000000000000000
> CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
> CR2: ffff8800488db8dc CR3: 000000004fb38000 CR4: 0000000000002660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffff880054d36000, task ffff880054d343c0)
> Stack:
> ffffffff8142dac7 0000000000000000 00000000ffffffff ffff8800200f0280
> 0000075800000000 0000000000000724 ffff880054d36000 0000000000000000
> 00000000fffffdb4 ffff880052824980 ffff880021208ab0 000000000000024c
> Call Trace:
> <IRQ>
> [<ffffffff8142dac7>] ? skb_copy_bits+0x167/0x290
> [<ffffffff8142f0b5>] skb_copy+0x85/0xb0
> [<ffffffff8144864d>] __neigh_event_send+0x18d/0x200
> [<ffffffff81449a42>] neigh_resolve_output+0x162/0x1b0
> [<ffffffff81477046>] ip_finish_output+0x146/0x320
> [<ffffffff814754a5>] ip_output+0x85/0xd0
> [<ffffffff814758d9>] ip_local_out+0x29/0x30
> [<ffffffff814761e0>] ip_queue_xmit+0x1c0/0x3d0
> [<ffffffff8148d3ef>] tcp_transmit_skb+0x40f/0x520
> [<ffffffff8148e5ff>] tcp_retransmit_skb+0x16f/0x2e0
> [<ffffffff814908c0>] ? tcp_retransmit_timer+0x4a0/0x4a0
> [<ffffffff814905ad>] tcp_retransmit_timer+0x18d/0x4a0
> [<ffffffff814908c0>] ? tcp_retransmit_timer+0x4a0/0x4a0
> [<ffffffff81490994>] tcp_write_timer+0xd4/0x100
> [<ffffffff8107dbaa>] call_timer_fn+0x4a/0x110
> [<ffffffff814908c0>] ? tcp_retransmit_timer+0x4a0/0x4a0
> [<ffffffff8107f82a>] run_timer_softirq+0x13a/0x220
> [<ffffffff81075c39>] __do_softirq+0xb9/0x1d0
> [<ffffffff810d9678>] ? handle_percpu_irq+0x48/0x70
> [<ffffffff81511b7c>] call_softirq+0x1c/0x30
> [<ffffffff810172e5>] do_softirq+0x65/0xa0
> [<ffffffff8107656b>] irq_exit+0xab/0xc0
> [<ffffffff812f97d5>] xen_evtchn_do_upcall+0x35/0x50
> [<ffffffff81511bce>] xen_do_hypervisor_callback+0x1e/0x30
> <EOI>
> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
> [<ffffffff8100a0d0>] ? xen_safe_halt+0x10/0x20
> [<ffffffff8101dfeb>] ? default_idle+0x5b/0x170
> [<ffffffff81014ac6>] ? cpu_idle+0xc6/0xf0
> [<ffffffff8100a8e9>] ? xen_irq_enable_direct_reloc+0x4/0x4
> [<ffffffff814f7a2e>] ? cpu_bringup_and_idle+0xe/0x10
> Code: 01 c6 43 4c 04 19 c0 4c 8b 65 f0 4c 8b 6d f8 83 e0 fc 83 c0 08 88 43 4d 48 8b 5d e8 c9 c3 90 90 48 89 f8 89 d1 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 20 48 83 ea 20 4c 8b 06 4c 8b 4e 08 4c
> RIP [<ffffffff812605bb>] memcpy+0xb/0x120
>
>
> Per vmcore, the socket info as below:
> ------------------------------------------------------------------------------
> <struct tcp_sock 0xffff88004d344e00> TCP
> tcp 10.1.1.11:42147 10.1.1.21:3260 FIN_WAIT1
> windows: rcv=122124, snd=65535 advmss=8948 rcv_ws=1 snd_ws=0
> nonagle=1 sack_ok=0 tstamp_ok=1
> rmem_alloc=0, wmem_alloc=10229
> rx_queue=0, tx_queue=149765
> rcvbuf=262142, sndbuf=262142
> rcv_tstamp=51.4 s, lsndtime=0.0 s ago
> -- Retransmissions --
> retransmits=7, ca_state=TCP_CA_Disorder
> ------------------------------------------------------------------------------
>
> When sock status move to FIN_WAIT1, will it cleanup all skb or no?

I get crashes as well using UDP application. Its not related to TCP.

There is some corruption going on in neighbour code.

[ 942.319645] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 942.327510] IP: [<ffffffff814e4558>] __neigh_event_send+0x1a8/0x240
[ 942.333799] PGD c5a125067 PUD c603e1067 PMD 0
[ 942.338292] Oops: 0002 [#1] SMP
[ 942.341819] gsmi: Log Shutdown Reason 0x03
[ 942.364995] CPU: 8 PID: 13760 Comm: netperf Tainted: G W 3.10.0-smp-DEV #155
[ 942.380212] task: ffff88065b54b000 ti: ffff8806498fc000 task.ti: ffff8806498fc000
[ 942.387689] RIP: 0010:[<ffffffff814e4558>] [<ffffffff814e4558>] __neigh_event_send+0x1a8/0x240
[ 942.396402] RSP: 0018:ffff8806498fd9d8 EFLAGS: 00010206
[ 942.401709] RAX: 0000000000000000 RBX: ffff88065a8f9000 RCX: ffff88065fdf61c0
[ 942.408837] RDX: 0000000000000000 RSI: ffff880c5d5b3080 RDI: ffff880c5b9c0ac0
[ 942.415966] RBP: ffff8806498fd9f8 R08: ffff88064cb00000 R09: ffff8806498fda70
[ 942.423095] R10: ffff880c5ffbead0 R11: ffffffff815137d0 R12: ffff88065a8f9030
[ 942.430232] R13: ffff880c5d5b3080 R14: 0000000000000000 R15: ffff88065b4af940
[ 942.437362] FS: 00007fd613190700(0000) GS:ffff880c7fc40000(0000) knlGS:0000000000000000
[ 942.445452] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 942.451193] CR2: 0000000000000008 CR3: 0000000c59b60000 CR4: 00000000000007e0
[ 942.458324] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 942.465460] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 27 05:49:12 [ 942.472597] Stack:
[ 942.475997] ffff880c5d5b3080 ffff88065a8f9000 ffff880c59ac43c0 0000000000000088
[ 942.483473] ffff8806498fda48 ffffffff814e50db ffff880c5d5b3080 ffffffff81514c60
[ 942.490947] 0000000000000088 ffff88064cb00000 ffff880c5d5b3080 ffff880c59ac43c0
[ 942.498415] Call Trace:
[ 942.500873] [<ffffffff814e50db>] neigh_resolve_output+0x14b/0x1f0
lpq84 kernel: [ [ 942.507056] [<ffffffff81514c60>] ? __ip_append_data.isra.39+0x9e0/0x9e0
[ 942.515138] [<ffffffff81514ddf>] ip_finish_output+0x17f/0x380
[ 942.520972] [<ffffffff81515bb3>] ip_output+0x53/0x90
942.341819] gsm[ 942.526030] [<ffffffff815167d6>] ? ip_make_skb+0xf6/0x120
[ 942.532897] [<ffffffff81515379>] ip_local_out+0x29/0x30
i: Log Shutdown [ 942.538215] [<ffffffff81516649>] ip_send_skb+0x19/0x50
Reason 0x03
[ 942.544825] [<ffffffff8153a65e>] udp_send_skb+0x2ce/0x3a0
[ 942.551439] [<ffffffff815137d0>] ? ip_setup_cork+0x110/0x110

2013-06-28 09:37:45

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

OK please try the following patch

[PATCH] neighbour: fix a race in neigh_destroy()

There is a race in neighbour code, because neigh_destroy() uses
skb_queue_purge(&neigh->arp_queue) without holding neighbour lock,
while other parts of the code assume neighbour rwlock is what
protects arp_queue

Convert all skb_queue_purge() calls to the __skb_queue_purge() variant

Use __skb_queue_head_init() instead of skb_queue_head_init()
to make clear we do not use arp_queue.lock

And hold neigh->lock in neigh_destroy() to close the race.

Reported-by: Joe Jin <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
---
net/core/neighbour.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 2569ab2..b7de821 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -231,7 +231,7 @@ static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev)
we must kill timers etc. and move
it to safe state.
*/
- skb_queue_purge(&n->arp_queue);
+ __skb_queue_purge(&n->arp_queue);
n->arp_queue_len_bytes = 0;
n->output = neigh_blackhole;
if (n->nud_state & NUD_VALID)
@@ -286,7 +286,7 @@ static struct neighbour *neigh_alloc(struct neigh_table *tbl, struct net_device
if (!n)
goto out_entries;

- skb_queue_head_init(&n->arp_queue);
+ __skb_queue_head_init(&n->arp_queue);
rwlock_init(&n->lock);
seqlock_init(&n->ha_lock);
n->updated = n->used = now;
@@ -708,7 +708,9 @@ void neigh_destroy(struct neighbour *neigh)
if (neigh_del_timer(neigh))
pr_warn("Impossible event\n");

- skb_queue_purge(&neigh->arp_queue);
+ write_lock_bh(&neigh->lock);
+ __skb_queue_purge(&neigh->arp_queue);
+ write_unlock_bh(&neigh->lock);
neigh->arp_queue_len_bytes = 0;

if (dev->netdev_ops->ndo_neigh_destroy)
@@ -858,7 +860,7 @@ static void neigh_invalidate(struct neighbour *neigh)
neigh->ops->error_report(neigh, skb);
write_lock(&neigh->lock);
}
- skb_queue_purge(&neigh->arp_queue);
+ __skb_queue_purge(&neigh->arp_queue);
neigh->arp_queue_len_bytes = 0;
}

@@ -1210,7 +1212,7 @@ int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,

write_lock_bh(&neigh->lock);
}
- skb_queue_purge(&neigh->arp_queue);
+ __skb_queue_purge(&neigh->arp_queue);
neigh->arp_queue_len_bytes = 0;
}
out:

2013-06-28 11:33:30

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

Hi Eric,

Thanks for your patch, I'll test it then get back to you.

Regards,
Joe
On 06/28/13 17:37, Eric Dumazet wrote:
> OK please try the following patch
>
>
> [PATCH] neighbour: fix a race in neigh_destroy()
>
> There is a race in neighbour code, because neigh_destroy() uses
> skb_queue_purge(&neigh->arp_queue) without holding neighbour lock,
> while other parts of the code assume neighbour rwlock is what
> protects arp_queue
>
> Convert all skb_queue_purge() calls to the __skb_queue_purge() variant
>
> Use __skb_queue_head_init() instead of skb_queue_head_init()
> to make clear we do not use arp_queue.lock
>
> And hold neigh->lock in neigh_destroy() to close the race.
>
> Reported-by: Joe Jin <[email protected]>
> Signed-off-by: Eric Dumazet <[email protected]>
> ---
> net/core/neighbour.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 2569ab2..b7de821 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -231,7 +231,7 @@ static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev)
> we must kill timers etc. and move
> it to safe state.
> */
> - skb_queue_purge(&n->arp_queue);
> + __skb_queue_purge(&n->arp_queue);
> n->arp_queue_len_bytes = 0;
> n->output = neigh_blackhole;
> if (n->nud_state & NUD_VALID)
> @@ -286,7 +286,7 @@ static struct neighbour *neigh_alloc(struct neigh_table *tbl, struct net_device
> if (!n)
> goto out_entries;
>
> - skb_queue_head_init(&n->arp_queue);
> + __skb_queue_head_init(&n->arp_queue);
> rwlock_init(&n->lock);
> seqlock_init(&n->ha_lock);
> n->updated = n->used = now;
> @@ -708,7 +708,9 @@ void neigh_destroy(struct neighbour *neigh)
> if (neigh_del_timer(neigh))
> pr_warn("Impossible event\n");
>
> - skb_queue_purge(&neigh->arp_queue);
> + write_lock_bh(&neigh->lock);
> + __skb_queue_purge(&neigh->arp_queue);
> + write_unlock_bh(&neigh->lock);
> neigh->arp_queue_len_bytes = 0;
>
> if (dev->netdev_ops->ndo_neigh_destroy)
> @@ -858,7 +860,7 @@ static void neigh_invalidate(struct neighbour *neigh)
> neigh->ops->error_report(neigh, skb);
> write_lock(&neigh->lock);
> }
> - skb_queue_purge(&neigh->arp_queue);
> + __skb_queue_purge(&neigh->arp_queue);
> neigh->arp_queue_len_bytes = 0;
> }
>
> @@ -1210,7 +1212,7 @@ int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
>
> write_lock_bh(&neigh->lock);
> }
> - skb_queue_purge(&neigh->arp_queue);
> + __skb_queue_purge(&neigh->arp_queue);
> neigh->arp_queue_len_bytes = 0;
> }
> out:
>
>

2013-06-28 23:37:25

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

Hi Eric,

The patch not fix the issue and panic as same as early I posted:
> BUG: unable to handle kernel paging request at ffff88006d9e8d48
> IP: [<ffffffff812605bb>] memcpy+0xb/0x120
> PGD 1798067 PUD 1fd2067 PMD 213f067 PTE 0
> Oops: 0000 [#1] SMP
> CPU 7
> Modules linked in: dm_nfs tun nfs fscache auth_rpcgss nfs_acl xen_blkback xen_netback xen_gntdev xen_evtchn lockd sunrpc bridge stp llc bonding be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio dm_round_robin dm_multipath libiscsi_tcp libiscsi scsi_transport_iscsi xenfs xen_privcmd video sbs sbshc acpi_memhotplug acpi_ipmi ipmi_msghandler parport_pc lp parport ixgbe dca sr_mod cdrom bnx2 radeon ttm drm_kms_helper drm snd_seq_dummy i2c_algo_bit i2c_core snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss serio_raw snd_pcm snd_timer snd soundcore snd_page_alloc iTCO_wdt pcspkr iTCO_vendor_support pata_acpi dcdbas i5k_amb ata_generic hwmon floppy ghes i5000_edac edac_core hed dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage lpfc scsi_transport_fc scsi_tgt ata_piix sg shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod crc_t10dif ext3!
jbd mbcac
he
>
>
> Pid: 0, comm: swapper Tainted: G W 2.6.39-300.32.1.el5uek #1 Dell Inc. PowerEdge 2950/0DP246
> RIP: e030:[<ffffffff812605bb>] [<ffffffff812605bb>] memcpy+0xb/0x120
> RSP: e02b:ffff8801003c3d58 EFLAGS: 00010246
> RAX: ffff880076b9e280 RBX: ffff8800714d2c00 RCX: 0000000000000057
> RDX: 0000000000000000 RSI: ffff88006d9e8d48 RDI: ffff880076b9e280
> RBP: ffff8801003c3dc0 R08: 00000000000bf723 R09: 0000000000000000
> R10: 0000000000000000 R11: 000000000000000a R12: 0000000000000034
> R13: 0000000000000034 R14: 00000000000002b8 R15: 00000000000005a8
> FS: 00007fc1e852a6e0(0000) GS:ffff8801003c0000(0000) knlGS:0000000000000000
> CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
> CR2: ffff88006d9e8d48 CR3: 000000006370b000 CR4: 0000000000002660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffff880077ac0000, task ffff880077abe240)
> Stack:
> ffffffff8142db21 0000000000000000 ffff880076b9e280 ffff8800637097f0
> 000002ec00000000 00000000000002b8 ffff880077ac0000 0000000000000000
> ffff8800637097f0 ffff880066c9a7c0 00000000fffffdb4 000000000000024c
> Call Trace:
> <IRQ>
> [<ffffffff8142db21>] ? skb_copy_bits+0x1c1/0x2e0
> [<ffffffff8142f173>] skb_copy+0xf3/0x120
> [<ffffffff81447fbc>] neigh_timer_handler+0x1ac/0x350
> [<ffffffff810573fe>] ? account_idle_ticks+0xe/0x10
> [<ffffffff81447e10>] ? neigh_alloc+0x180/0x180
> [<ffffffff8107dbaa>] call_timer_fn+0x4a/0x110
> [<ffffffff81447e10>] ? neigh_alloc+0x180/0x180
> [<ffffffff8107f82a>] run_timer_softirq+0x13a/0x220
> [<ffffffff81075c39>] __do_softirq+0xb9/0x1d0
> [<ffffffff810d9678>] ? handle_percpu_irq+0x48/0x70
> [<ffffffff81511d3c>] call_softirq+0x1c/0x30
> [<ffffffff810172e5>] do_softirq+0x65/0xa0
> [<ffffffff8107656b>] irq_exit+0xab/0xc0
> [<ffffffff812f97d5>] xen_evtchn_do_upcall+0x35/0x50
> [<ffffffff81511d8e>] xen_do_hypervisor_callback+0x1e/0x30
> <EOI>
> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
> [<ffffffff8100a0b0>] ? xen_safe_halt+0x10/0x20
> [<ffffffff8101dfeb>] ? default_idle+0x5b/0x170
> [<ffffffff81014ac6>] ? cpu_idle+0xc6/0xf0
> [<ffffffff8100a8c9>] ? xen_irq_enable_direct_reloc+0x4/0x4
> [<ffffffff814f7bbe>] ? cpu_bringup_and_idle+0xe/0x10
> Code: 01 c6 43 4c 04 19 c0 4c 8b 65 f0 4c 8b 6d f8 83 e0 fc 83 c0 08 88 43 4d 48 8b 5d e8 c9 c3 90 90 48 89 f8 89 d1 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 20 48 83 ea 20 4c 8b 06 4c 8b 4e 08 4c
> RIP [<ffffffff812605bb>] memcpy+0xb/0x120
> RSP <ffff8801003c3d58>
> CR2: ffff88006d9e8d48

Thanks,
Joe
On 06/28/13 17:37, Eric Dumazet wrote:
> OK please try the following patch
>
>
> [PATCH] neighbour: fix a race in neigh_destroy()
>
> There is a race in neighbour code, because neigh_destroy() uses
> skb_queue_purge(&neigh->arp_queue) without holding neighbour lock,
> while other parts of the code assume neighbour rwlock is what
> protects arp_queue
>
> Convert all skb_queue_purge() calls to the __skb_queue_purge() variant
>
> Use __skb_queue_head_init() instead of skb_queue_head_init()
> to make clear we do not use arp_queue.lock
>
> And hold neigh->lock in neigh_destroy() to close the race.
>
> Reported-by: Joe Jin <[email protected]>
> Signed-off-by: Eric Dumazet <[email protected]>
> ---
> net/core/neighbour.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 2569ab2..b7de821 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -231,7 +231,7 @@ static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev)
> we must kill timers etc. and move
> it to safe state.
> */
> - skb_queue_purge(&n->arp_queue);
> + __skb_queue_purge(&n->arp_queue);
> n->arp_queue_len_bytes = 0;
> n->output = neigh_blackhole;
> if (n->nud_state & NUD_VALID)
> @@ -286,7 +286,7 @@ static struct neighbour *neigh_alloc(struct neigh_table *tbl, struct net_device
> if (!n)
> goto out_entries;
>
> - skb_queue_head_init(&n->arp_queue);
> + __skb_queue_head_init(&n->arp_queue);
> rwlock_init(&n->lock);
> seqlock_init(&n->ha_lock);
> n->updated = n->used = now;
> @@ -708,7 +708,9 @@ void neigh_destroy(struct neighbour *neigh)
> if (neigh_del_timer(neigh))
> pr_warn("Impossible event\n");
>
> - skb_queue_purge(&neigh->arp_queue);
> + write_lock_bh(&neigh->lock);
> + __skb_queue_purge(&neigh->arp_queue);
> + write_unlock_bh(&neigh->lock);
> neigh->arp_queue_len_bytes = 0;
>
> if (dev->netdev_ops->ndo_neigh_destroy)
> @@ -858,7 +860,7 @@ static void neigh_invalidate(struct neighbour *neigh)
> neigh->ops->error_report(neigh, skb);
> write_lock(&neigh->lock);
> }
> - skb_queue_purge(&neigh->arp_queue);
> + __skb_queue_purge(&neigh->arp_queue);
> neigh->arp_queue_len_bytes = 0;
> }
>
> @@ -1210,7 +1212,7 @@ int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
>
> write_lock_bh(&neigh->lock);
> }
> - skb_queue_purge(&neigh->arp_queue);
> + __skb_queue_purge(&neigh->arp_queue);
> neigh->arp_queue_len_bytes = 0;
> }
> out:
>
>

--
Oracle <http://www.oracle.com>
Joe Jin | Software Development Senior Manager | +8610.6106.5624
ORACLE | Linux and Virtualization
No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing

2013-06-29 07:05:04

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

On Sat, 2013-06-29 at 07:36 +0800, Joe Jin wrote:
> Hi Eric,
>
> The patch not fix the issue and panic as same as early I posted:

At least it fixes my own panics ;)

My test bed was :

Launch 24 concurrent "netperf -t UDP_STREAM -H destination -- -m 128"

Then on "destination" disconnect the ethernet port.

While the link flaps, I got panic in a few seconds.

Thanks

2013-06-29 07:20:34

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

On Sat, 2013-06-29 at 07:36 +0800, Joe Jin wrote:
> Hi Eric,
>
> The patch not fix the issue and panic as same as early I posted:
> > BUG: unable to handle kernel paging request at ffff88006d9e8d48
> > IP: [<ffffffff812605bb>] memcpy+0xb/0x120
> > PGD 1798067 PUD 1fd2067 PMD 213f067 PTE 0
> > Oops: 0000 [#1] SMP
> > CPU 7
> > Modules linked in: dm_nfs tun nfs fscache auth_rpcgss nfs_acl xen_blkback xen_netback xen_gntdev xen_evtchn lockd sunrpc bridge stp llc bonding be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio dm_round_robin dm_multipath libiscsi_tcp libiscsi scsi_transport_iscsi xenfs xen_privcmd video sbs sbshc acpi_memhotplug acpi_ipmi ipmi_msghandler parport_pc lp parport ixgbe dca sr_mod cdrom bnx2 radeon ttm drm_kms_helper drm snd_seq_dummy i2c_algo_bit i2c_core snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss serio_raw snd_pcm snd_timer snd soundcore snd_page_alloc iTCO_wdt pcspkr iTCO_vendor_support pata_acpi dcdbas i5k_amb ata_generic hwmon floppy ghes i5000_edac edac_core hed dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage lpfc scsi_transport_fc scsi_tgt ata_piix sg shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod crc_t10dif ext3!
> jbd mbcac
> he
> >
> >
> > Pid: 0, comm: swapper Tainted: G W 2.6.39-300.32.1.el5uek #1 Dell Inc. PowerEdge 2950/0DP246

By the way my patch was for current kernels, not for 2.6.39

For instance, I was not able to reproduce the crash with 3.3

RCU in neighbour code was added in 2.6.37, but it looks like this code
is a bit fragile because all the kfree_skb() are done while neighbour
locks are held.

So if a skb destructor triggers a new call to neighbour code, I presume
some bad things can happen. LOCKDEP could eventually help to detect
this.

You could try to replace these kfree_skb() calls to dev_kfree_skb_irq()
just in case.

(Do not forget the __skb_queue_purge() ones)

Try a LOCKDEP build as well.

2013-06-29 16:11:49

by Ben Greear

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

On 06/29/2013 12:20 AM, Eric Dumazet wrote:
> On Sat, 2013-06-29 at 07:36 +0800, Joe Jin wrote:
>> Hi Eric,
>>
>> The patch not fix the issue and panic as same as early I posted:
>>> BUG: unable to handle kernel paging request at ffff88006d9e8d48
>>> IP: [<ffffffff812605bb>] memcpy+0xb/0x120
>>> PGD 1798067 PUD 1fd2067 PMD 213f067 PTE 0
>>> Oops: 0000 [#1] SMP
>>> CPU 7
>>> Modules linked in: dm_nfs tun nfs fscache auth_rpcgss nfs_acl xen_blkback xen_netback xen_gntdev xen_evtchn lockd sunrpc bridge stp llc bonding be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio dm_round_robin dm_multipath libiscsi_tcp libiscsi scsi_transport_iscsi xenfs xen_privcmd video sbs sbshc acpi_memhotplug acpi_ipmi ipmi_msghandler parport_pc lp parport ixgbe dca sr_mod cdrom bnx2 radeon ttm drm_kms_helper drm snd_seq_dummy i2c_algo_bit i2c_core snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss serio_raw snd_pcm snd_timer snd soundcore snd_page_alloc iTCO_wdt pcspkr iTCO_vendor_support pata_acpi dcdbas i5k_amb ata_generic hwmon floppy ghes i5000_edac edac_core hed dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage lpfc scsi_transport_fc scsi_tgt ata_piix sg shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod crc_t10dif ex!
t3!
>> jbd mbcac
>> he
>>>
>>>
>>> Pid: 0, comm: swapper Tainted: G W 2.6.39-300.32.1.el5uek #1 Dell Inc. PowerEdge 2950/0DP246
>
>
> By the way my patch was for current kernels, not for 2.6.39

Do you know if your patch should go in 3.9?

Your test case sounds a bit like what gives us the rare crash in tcp_collapse
(we have lots of bouncing wifi interfaces running slow-speed TCP trafic). But,
it takes days for us to hit the problem most of the time.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2013-06-29 16:26:08

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

On Sat, 2013-06-29 at 09:11 -0700, Ben Greear wrote:

> Do you know if your patch should go in 3.9?
>

Yes it should.

> Your test case sounds a bit like what gives us the rare crash in tcp_collapse
> (we have lots of bouncing wifi interfaces running slow-speed TCP trafic). But,
> it takes days for us to hit the problem most of the time.

Well, unfortunately that's a different problem :(

2013-06-29 16:32:14

by Ben Greear

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

On 06/29/2013 09:26 AM, Eric Dumazet wrote:
> On Sat, 2013-06-29 at 09:11 -0700, Ben Greear wrote:
>
>> Do you know if your patch should go in 3.9?
>>
>
> Yes it should.

Ok, I'll add that to my tree.

>> Your test case sounds a bit like what gives us the rare crash in tcp_collapse
>> (we have lots of bouncing wifi interfaces running slow-speed TCP trafic). But,
>> it takes days for us to hit the problem most of the time.
>
> Well, unfortunately that's a different problem :(

For what it's worth, I added this patch to my tree. We haven't hit the problem
since, but perhaps on the over-the-weekend run we'll see it.

commit 0286716b36a0e5b82c385052a0971f44bc3c3442
Author: Ben Greear <[email protected]>
Date: Tue Jun 25 15:49:52 2013 -0700

tcp: Try to work around crash in tcp_collapse.

And print out some info about why it crashed.

Signed-off-by: Ben Greear <[email protected]>

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a2f267a..63f7704 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4810,7 +4810,15 @@ restart:
int offset = start - TCP_SKB_CB(skb)->seq;
int size = TCP_SKB_CB(skb)->end_seq - start;

- BUG_ON(offset < 0);
+ if (WARN_ON(offset < 0)) {
+ /* We see a crash here (when using BUG_ON) every few days under
+ * some torture tests. I'm not sure how to clean this up properly,
+ * so just return and hope thinks keep muddling through. --Ben
+ */
+ printk("offset: %i start: %i seq: %i size: %i copy: %i\n",
+ offset, start, TCP_SKB_CB(skb)->seq, size, copy);
+ return;
+ }
if (size > 0) {
size = min(copy, size);
if (skb_copy_bits(skb, offset, skb_put(nskb, size), size))

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2013-06-30 00:26:55

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

On 06/29/13 15:20, Eric Dumazet wrote:
> On Sat, 2013-06-29 at 07:36 +0800, Joe Jin wrote:
>> Hi Eric,
>>
>> The patch not fix the issue and panic as same as early I posted:
>>> BUG: unable to handle kernel paging request at ffff88006d9e8d48
>>> IP: [<ffffffff812605bb>] memcpy+0xb/0x120
>>> PGD 1798067 PUD 1fd2067 PMD 213f067 PTE 0
>>> Oops: 0000 [#1] SMP
>>> CPU 7
>>> Modules linked in: dm_nfs tun nfs fscache auth_rpcgss nfs_acl xen_blkback xen_netback xen_gntdev xen_evtchn lockd sunrpc bridge stp llc bonding be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio dm_round_robin dm_multipath libiscsi_tcp libiscsi scsi_transport_iscsi xenfs xen_privcmd video sbs sbshc acpi_memhotplug acpi_ipmi ipmi_msghandler parport_pc lp parport ixgbe dca sr_mod cdrom bnx2 radeon ttm drm_kms_helper drm snd_seq_dummy i2c_algo_bit i2c_core snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss serio_raw snd_pcm snd_timer snd soundcore snd_page_alloc iTCO_wdt pcspkr iTCO_vendor_support pata_acpi dcdbas i5k_amb ata_generic hwmon floppy ghes i5000_edac edac_core hed dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage lpfc scsi_transport_fc scsi_tgt ata_piix sg shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod crc_t10dif ex!
> t3!
>> jbd mbcac
>> he
>>>
>>>
>>> Pid: 0, comm: swapper Tainted: G W 2.6.39-300.32.1.el5uek #1 Dell Inc. PowerEdge 2950/0DP246
>
>
> By the way my patch was for current kernels, not for 2.6.39
>
> For instance, I was not able to reproduce the crash with 3.3
>
> RCU in neighbour code was added in 2.6.37, but it looks like this code
> is a bit fragile because all the kfree_skb() are done while neighbour
> locks are held.
>
> So if a skb destructor triggers a new call to neighbour code, I presume
> some bad things can happen. LOCKDEP could eventually help to detect
> this.
>
> You could try to replace these kfree_skb() calls to dev_kfree_skb_irq()
> just in case.
>
> (Do not forget the __skb_queue_purge() ones)
>
> Try a LOCKDEP build as well.

So far we suspected it caused by iscsi called sendpage(), and later page
be unmapped but still trying copy skb. We'll try to disable sg to see if
help or no.

Thanks,
Joe

2013-06-30 07:50:57

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

On Sun, 2013-06-30 at 08:26 +0800, Joe Jin wrote:

> So far we suspected it caused by iscsi called sendpage(), and later page
> be unmapped but still trying copy skb. We'll try to disable sg to see if
> help or no.

sendpage() should increment page refcounts for every page frag of an
skb, therefore page should not be unmapped.

Of course userland can either rewrite the content, or unmap() the page,
but the underlying page cannot be freed as long skb is not freed.

2013-06-30 09:24:02

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

--On 28 June 2013 12:17:43 +0800 Joe Jin <[email protected]> wrote:

> Find a similar issue
> http://www.gossamer-threads.com/lists/xen/devel/265611 So copied to Xen
> developer as well.

I thought this sounded familiar. I haven't got the start of this
thread, but what version of Xen are you running and what device
model? If before 4.3, there is a page lifetime bug in the kernel
(not the xen code) which can affect anything where the guest accesses
the host's block stack and that in turn accesses the networking
stack (it may in fact be wider than that). So, e.g. domU on
iCSSI will do it. It tends to get triggered by a TCP retransmit
or (on NFS) the RPC equivalent. Essentially block operation
is considered complete, returning through xen and freeing the
grant table entry, and yet something in the kernel (e.g. tcp
retransmit) can still access the data. The nature of the bug
is extensively discussed in that thread - you'll also find
a reference to a thread on linux-nfs which concludes it
isn't an nfs problem, and even some patches to fix it in the
kernel adding reference counting.

A workaround is to turn off O_DIRECT use by Xen as that ensures
the pages are copied. Xen 4.3 does this by default.

I believe fixes for this are in 4.3 and 4.2.2 if using the
qemu upstream DM. Note these aren't real fixes, just a workaround
of a kernel bug.

To fix on a local build of xen you will need something like this:
https://github.com/abligh/qemu-upstream-4.2-testing/commit/9a97c011e1a682eed9bc7195a25349eaf23ff3f9
and something like this (NB: obviously insert your own git
repo and commit numbers)
https://github.com/abligh/xen/commit/f5c344afac96ced8b980b9659fb3e81c4a0db5ca

Also note those fixes are (technically) unsafe for live migration
unless there is an ordering change made in qemu's block open
call.

Of course this might be something completely different.

--
Alex Bligh

2013-06-30 09:36:17

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

--On 30 June 2013 10:13:35 +0100 Alex Bligh <[email protected]> wrote:

> The nature of the bug
> is extensively discussed in that thread - you'll also find
> a reference to a thread on linux-nfs which concludes it
> isn't an nfs problem, and even some patches to fix it in the
> kernel adding reference counting.

Some more links for anyone interested in fixing the kernel bug:

http://lists.xen.org/archives/html/xen-devel/2013-01/msg01618.html
http://www.spinics.net/lists/linux-nfs/msg34913.html
http://www.spinics.net/lists/netdev/msg224106.html

--
Alex Bligh

2013-07-01 03:18:52

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

On 06/30/13 17:13, Alex Bligh wrote:
>
>
> --On 28 June 2013 12:17:43 +0800 Joe Jin <[email protected]> wrote:
>
>> Find a similar issue
>> http://www.gossamer-threads.com/lists/xen/devel/265611 So copied to Xen
>> developer as well.
>
> I thought this sounded familiar. I haven't got the start of this
> thread, but what version of Xen are you running and what device
> model? If before 4.3, there is a page lifetime bug in the kernel
> (not the xen code) which can affect anything where the guest accesses
> the host's block stack and that in turn accesses the networking
> stack (it may in fact be wider than that). So, e.g. domU on
> iCSSI will do it. It tends to get triggered by a TCP retransmit
> or (on NFS) the RPC equivalent. Essentially block operation
> is considered complete, returning through xen and freeing the
> grant table entry, and yet something in the kernel (e.g. tcp
> retransmit) can still access the data. The nature of the bug
> is extensively discussed in that thread - you'll also find
> a reference to a thread on linux-nfs which concludes it
> isn't an nfs problem, and even some patches to fix it in the
> kernel adding reference counting.

Do you know if have a fix for above? so far we also suspected the
grant page be unmapped earlier, we using 4.1 stable during our test.

>
> A workaround is to turn off O_DIRECT use by Xen as that ensures
> the pages are copied. Xen 4.3 does this by default.
>
> I believe fixes for this are in 4.3 and 4.2.2 if using the
> qemu upstream DM. Note these aren't real fixes, just a workaround
> of a kernel bug.

The guest is pvm, and disk model is xvbd, guest config file as below:

vif = ['mac=00:21:f6:00:00:01,bridge=c0a80b00']
OVM_simple_name = 'Guest#1'
disk = ['file:/OVS/Repositories/0004fb000003000091e9eae94d1e907c/VirtualDisks/0004fb0000120000f78799dad800ef47.img,xvda,w', 'phy:/dev/mapper/360060e8010141870058b415700000002,xvdb,w', 'phy:/dev/mapper/360060e8010141870058b415700000003,xvdc,w']
bootargs = ''
uuid = '0004fb00-0006-0000-2b00-77a4766001ed'
on_reboot = 'restart'
cpu_weight = 27500
OVM_os_type = 'Oracle Linux 5'
cpu_cap = 0
maxvcpus = 8
OVM_high_availability = False
memory = 4096
OVM_description = ''
on_poweroff = 'destroy'
on_crash = 'restart'
bootloader = '/usr/bin/pygrub'
guest_os_type = 'linux'
name = '0004fb00000600002b0077a4766001ed'
vfb = ['type=vnc,vncunused=1,vnclisten=127.0.0.1,keymap=en-us']
vcpus = 8
OVM_cpu_compat_group = ''
OVM_domain_type = 'xen_pvm'

>
> To fix on a local build of xen you will need something like this:
> https://github.com/abligh/qemu-upstream-4.2-testing/commit/9a97c011e1a682eed9bc7195a25349eaf23ff3f9
> and something like this (NB: obviously insert your own git
> repo and commit numbers)
> https://github.com/abligh/xen/commit/f5c344afac96ced8b980b9659fb3e81c4a0db5ca
>

I think this only for pvhvm/hvm?

Thanks,
Joe
> Also note those fixes are (technically) unsafe for live migration
> unless there is an ordering change made in qemu's block open
> call.
>
> Of course this might be something completely different.
>

2013-07-01 08:11:30

by Ian Campbell

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

On Mon, 2013-07-01 at 11:18 +0800, Joe Jin wrote:
> > A workaround is to turn off O_DIRECT use by Xen as that ensures
> > the pages are copied. Xen 4.3 does this by default.
> >
> > I believe fixes for this are in 4.3 and 4.2.2 if using the
> > qemu upstream DM. Note these aren't real fixes, just a workaround
> > of a kernel bug.
>
> The guest is pvm, and disk model is xvbd, guest config file as below:

Do you know which disk backend? The workaround Alex refers to went into
qdisk but I think blkback could still suffer from a variant of the
retransmit issue if you run it over iSCSI.

> > To fix on a local build of xen you will need something like this:
> > https://github.com/abligh/qemu-upstream-4.2-testing/commit/9a97c011e1a682eed9bc7195a25349eaf23ff3f9
> > and something like this (NB: obviously insert your own git
> > repo and commit numbers)
> > https://github.com/abligh/xen/commit/f5c344afac96ced8b980b9659fb3e81c4a0db5ca
> >
>
> I think this only for pvhvm/hvm?

No, the underlying issue affects any PV device which is run over a
network protocol (NFS, iSCSI etc). In effect a delayed retransmit can
cross over the deayed ack and cause I/O to be completed while
retransmits are pending, such as is described in
http://www.spinics.net/lists/linux-nfs/msg34913.html (the original NFS
variant). The problem is that because Xen PV drivers often unmap the
page on I/O completion you get a crash (page fault) on the retransmit.

The issue also affects native but in that case the symptom is "just" a
corrupt packet on the wire. I tried to address this with my "skb
destructor" series but unfortunately I got bogged down on the details,
then I had to take time out to look into some other stuff and never
managed to get back into it. I'd be very grateful if there was someone
who could pick up that work (Alex gave some useful references in another
reply to this thread)

Some PV disk backends (e.g. blktap2) have worked around this by using
grant copy instead of grant map, others (e.g. qdisk) have disabled
O_DIRECT so that the pages are copied into the dom0 page cache and
transmitted from there.

We were discussing recently the possibility of mapping all ballooned out
pages to a single read-only scratch page instead of leaving them empty
in the page tables, this would cause the Xen case to revert to the
native case. I think Thanos was going to take a look into this.

Ian.

2013-07-01 08:29:18

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

Joe,

> Do you know if have a fix for above? so far we also suspected the
> grant page be unmapped earlier, we using 4.1 stable during our test.

A true fix? No, but I posted a patch set (see later email message
for a link) that you could forward port. The workaround is:

>> A workaround is to turn off O_DIRECT use by Xen as that ensures
>> the pages are copied. Xen 4.3 does this by default.
>>
>> I believe fixes for this are in 4.3 and 4.2.2 if using the
>> qemu upstream DM. Note these aren't real fixes, just a workaround
>> of a kernel bug.
>
> The guest is pvm, and disk model is xvbd, guest config file as below:
...
> I think this only for pvhvm/hvm?

I don't have much experience outside pvhvm/hvm, but I believe it
should work for any device.

Testing was simple - just find all (*) the references to O_DIRECT
in your device model and remove them!

(*)=you could be less lazy than me and find the right ones.

I am guessing it will be the same ones though.

--
Alex Bligh

2013-07-01 13:01:15

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

On 07/01/13 16:11, Ian Campbell wrote:
> On Mon, 2013-07-01 at 11:18 +0800, Joe Jin wrote:
>>> A workaround is to turn off O_DIRECT use by Xen as that ensures
>>> the pages are copied. Xen 4.3 does this by default.
>>>
>>> I believe fixes for this are in 4.3 and 4.2.2 if using the
>>> qemu upstream DM. Note these aren't real fixes, just a workaround
>>> of a kernel bug.
>>
>> The guest is pvm, and disk model is xvbd, guest config file as below:
>
> Do you know which disk backend? The workaround Alex refers to went into
> qdisk but I think blkback could still suffer from a variant of the
> retransmit issue if you run it over iSCSI.

The backend is xen-blkback on iSCSI storage.

>
>>> To fix on a local build of xen you will need something like this:
>>> https://github.com/abligh/qemu-upstream-4.2-testing/commit/9a97c011e1a682eed9bc7195a25349eaf23ff3f9
>>> and something like this (NB: obviously insert your own git
>>> repo and commit numbers)
>>> https://github.com/abligh/xen/commit/f5c344afac96ced8b980b9659fb3e81c4a0db5ca
>>>
>>
>> I think this only for pvhvm/hvm?
>
> No, the underlying issue affects any PV device which is run over a
> network protocol (NFS, iSCSI etc). In effect a delayed retransmit can
> cross over the deayed ack and cause I/O to be completed while
> retransmits are pending, such as is described in
> http://www.spinics.net/lists/linux-nfs/msg34913.html (the original NFS
> variant). The problem is that because Xen PV drivers often unmap the
> page on I/O completion you get a crash (page fault) on the retransmit.
>
To prevent iSCSI call sendpage() reuse the page we disabled the sg from NIC,
per test result the panic went. This also confirmed the page be unmpped by
grant system, the symptom as same as nfs panic.

> The issue also affects native but in that case the symptom is "just" a
> corrupt packet on the wire. I tried to address this with my "skb
> destructor" series but unfortunately I got bogged down on the details,
> then I had to take time out to look into some other stuff and never
> managed to get back into it. I'd be very grateful if there was someone
> who could pick up that work (Alex gave some useful references in another
> reply to this thread)
>
> Some PV disk backends (e.g. blktap2) have worked around this by using
> grant copy instead of grant map, others (e.g. qdisk) have disabled
> O_DIRECT so that the pages are copied into the dom0 page cache and
> transmitted from there.

The work around as same as we disable sg from NIC(disable it sendpage will
create own page copy rather than reuse the page).

Thanks,
Joe
>
> We were discussing recently the possibility of mapping all ballooned out
> pages to a single read-only scratch page instead of leaving them empty
> in the page tables, this would cause the Xen case to revert to the
> native case. I think Thanos was going to take a look into this.
>
> Ian.
>

2013-07-01 20:36:18

by David Miller

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

From: Eric Dumazet <[email protected]>
Date: Fri, 28 Jun 2013 02:37:42 -0700

> [PATCH] neighbour: fix a race in neigh_destroy()
>
> There is a race in neighbour code, because neigh_destroy() uses
> skb_queue_purge(&neigh->arp_queue) without holding neighbour lock,
> while other parts of the code assume neighbour rwlock is what
> protects arp_queue
>
> Convert all skb_queue_purge() calls to the __skb_queue_purge() variant
>
> Use __skb_queue_head_init() instead of skb_queue_head_init()
> to make clear we do not use arp_queue.lock
>
> And hold neigh->lock in neigh_destroy() to close the race.
>
> Reported-by: Joe Jin <[email protected]>
> Signed-off-by: Eric Dumazet <[email protected]>

Applied and queued up for -stable, thanks Eric.

2013-07-04 08:56:28

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

On 07/01/13 16:11, Ian Campbell wrote:
> On Mon, 2013-07-01 at 11:18 +0800, Joe Jin wrote:
>>> A workaround is to turn off O_DIRECT use by Xen as that ensures
>>> the pages are copied. Xen 4.3 does this by default.
>>>
>>> I believe fixes for this are in 4.3 and 4.2.2 if using the
>>> qemu upstream DM. Note these aren't real fixes, just a workaround
>>> of a kernel bug.
>>
>> The guest is pvm, and disk model is xvbd, guest config file as below:
>
> Do you know which disk backend? The workaround Alex refers to went into
> qdisk but I think blkback could still suffer from a variant of the
> retransmit issue if you run it over iSCSI.
>
>>> To fix on a local build of xen you will need something like this:
>>> https://github.com/abligh/qemu-upstream-4.2-testing/commit/9a97c011e1a682eed9bc7195a25349eaf23ff3f9
>>> and something like this (NB: obviously insert your own git
>>> repo and commit numbers)
>>> https://github.com/abligh/xen/commit/f5c344afac96ced8b980b9659fb3e81c4a0db5ca
>>>
>>
>> I think this only for pvhvm/hvm?
>
> No, the underlying issue affects any PV device which is run over a
> network protocol (NFS, iSCSI etc). In effect a delayed retransmit can
> cross over the deayed ack and cause I/O to be completed while
> retransmits are pending, such as is described in
> http://www.spinics.net/lists/linux-nfs/msg34913.html (the original NFS
> variant). The problem is that because Xen PV drivers often unmap the
> page on I/O completion you get a crash (page fault) on the retransmit.
>

Can we do it by remember grant page refcount when mapping, and when unmap
check if page refcount as same as mapping? This change will limited in
xen-blkback.

Another way is add new page flag like PG_send, when sendpage() be called,
set the bit, when page be put, clear the bit. Then xen-blkback can wait
on the pagequeue.

Thanks,
Joe

> The issue also affects native but in that case the symptom is "just" a
> corrupt packet on the wire. I tried to address this with my "skb
> destructor" series but unfortunately I got bogged down on the details,
> then I had to take time out to look into some other stuff and never
> managed to get back into it. I'd be very grateful if there was someone
> who could pick up that work (Alex gave some useful references in another
> reply to this thread)
>
> Some PV disk backends (e.g. blktap2) have worked around this by using
> grant copy instead of grant map, others (e.g. qdisk) have disabled
> O_DIRECT so that the pages are copied into the dom0 page cache and
> transmitted from there.
>
> We were discussing recently the possibility of mapping all ballooned out
> pages to a single read-only scratch page instead of leaving them empty
> in the page tables, this would cause the Xen case to revert to the
> native case. I think Thanos was going to take a look into this.
>
> Ian.
>

--
Oracle <http://www.oracle.com>
Joe Jin | Software Development Senior Manager | +8610.6106.5624
ORACLE | Linux and Virtualization
No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing

2013-07-04 08:59:50

by Ian Campbell

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

On Thu, 2013-07-04 at 16:55 +0800, Joe Jin wrote:
> On 07/01/13 16:11, Ian Campbell wrote:
> > On Mon, 2013-07-01 at 11:18 +0800, Joe Jin wrote:
> >>> A workaround is to turn off O_DIRECT use by Xen as that ensures
> >>> the pages are copied. Xen 4.3 does this by default.
> >>>
> >>> I believe fixes for this are in 4.3 and 4.2.2 if using the
> >>> qemu upstream DM. Note these aren't real fixes, just a workaround
> >>> of a kernel bug.
> >>
> >> The guest is pvm, and disk model is xvbd, guest config file as below:
> >
> > Do you know which disk backend? The workaround Alex refers to went into
> > qdisk but I think blkback could still suffer from a variant of the
> > retransmit issue if you run it over iSCSI.
> >
> >>> To fix on a local build of xen you will need something like this:
> >>> https://github.com/abligh/qemu-upstream-4.2-testing/commit/9a97c011e1a682eed9bc7195a25349eaf23ff3f9
> >>> and something like this (NB: obviously insert your own git
> >>> repo and commit numbers)
> >>> https://github.com/abligh/xen/commit/f5c344afac96ced8b980b9659fb3e81c4a0db5ca
> >>>
> >>
> >> I think this only for pvhvm/hvm?
> >
> > No, the underlying issue affects any PV device which is run over a
> > network protocol (NFS, iSCSI etc). In effect a delayed retransmit can
> > cross over the deayed ack and cause I/O to be completed while
> > retransmits are pending, such as is described in
> > http://www.spinics.net/lists/linux-nfs/msg34913.html (the original NFS
> > variant). The problem is that because Xen PV drivers often unmap the
> > page on I/O completion you get a crash (page fault) on the retransmit.
> >
>
> Can we do it by remember grant page refcount when mapping, and when unmap
> check if page refcount as same as mapping? This change will limited in
> xen-blkback.
>
> Another way is add new page flag like PG_send, when sendpage() be called,
> set the bit, when page be put, clear the bit. Then xen-blkback can wait
> on the pagequeue.

These schemes don't work when you have multiple simultaneous I/Os
referencing the same underlying page.

>
> Thanks,
> Joe
>
> > The issue also affects native but in that case the symptom is "just" a
> > corrupt packet on the wire. I tried to address this with my "skb
> > destructor" series but unfortunately I got bogged down on the details,
> > then I had to take time out to look into some other stuff and never
> > managed to get back into it. I'd be very grateful if there was someone
> > who could pick up that work (Alex gave some useful references in another
> > reply to this thread)
> >
> > Some PV disk backends (e.g. blktap2) have worked around this by using
> > grant copy instead of grant map, others (e.g. qdisk) have disabled
> > O_DIRECT so that the pages are copied into the dom0 page cache and
> > transmitted from there.
> >
> > We were discussing recently the possibility of mapping all ballooned out
> > pages to a single read-only scratch page instead of leaving them empty
> > in the page tables, this would cause the Xen case to revert to the
> > native case. I think Thanos was going to take a look into this.
> >
> > Ian.
> >
>
>

2013-07-04 09:34:30

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

On Thu, 2013-07-04 at 09:59 +0100, Ian Campbell wrote:
> On Thu, 2013-07-04 at 16:55 +0800, Joe Jin wrote:
> >
> > Another way is add new page flag like PG_send, when sendpage() be called,
> > set the bit, when page be put, clear the bit. Then xen-blkback can wait
> > on the pagequeue.
>
> These schemes don't work when you have multiple simultaneous I/Os
> referencing the same underlying page.

So this is a page property, still the patches I saw tried to address
this problem adding networking stuff (destructors) in the skbs.

Given that a page refcount can be transfered between entities, say using
splice() system call, I do not really understand why the fix would imply
networking only.

Let's try to fix it properly, or else we must disable zero copies
because they are not reliable.

Why sendfile() doesn't have the problem, but vmsplice()+splice() do have
this issue ?

As soon as a page fragment reference is taken somewhere, the only way to
properly reuse the page is to rely on put_page() and page being freed.

Adding workarounds in TCP stack to always copy the page fragments in
case of a retransmit is partial solution, as the remote peer could be
malicious and send ACK _before_ page content is actually read by the
NIC.

So if we rely on networking stacks to give the signal for page reuse, we
can have major security issue.

2013-07-04 09:52:51

by Ian Campbell

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

On Thu, 2013-07-04 at 02:34 -0700, Eric Dumazet wrote:
> On Thu, 2013-07-04 at 09:59 +0100, Ian Campbell wrote:
> > On Thu, 2013-07-04 at 16:55 +0800, Joe Jin wrote:
> > >
> > > Another way is add new page flag like PG_send, when sendpage() be called,
> > > set the bit, when page be put, clear the bit. Then xen-blkback can wait
> > > on the pagequeue.
> >
> > These schemes don't work when you have multiple simultaneous I/Os
> > referencing the same underlying page.
>
> So this is a page property, still the patches I saw tried to address
> this problem adding networking stuff (destructors) in the skbs.
>
> Given that a page refcount can be transfered between entities, say using
> splice() system call, I do not really understand why the fix would imply
> networking only.
>
> Let's try to fix it properly, or else we must disable zero copies
> because they are not reliable.
>
> Why sendfile() doesn't have the problem, but vmsplice()+splice() do have
> this issue ?

Might just be that no one has observed it with vmsplice()+splice()? Most
of the time this happens silently and you'll probably never notice, it's
just the behaviour of Xen which escalates the issue into one you can
see.

> As soon as a page fragment reference is taken somewhere, the only way to
> properly reuse the page is to rely on put_page() and page being freed.

Xen's out of tree netback used to fix this by a destructor call back on
page free, but that was a core mm patch in the hot memory free path
which wasn't popular, and it doesn't solve anything for the non-Xen
instances of this issue.

> Adding workarounds in TCP stack to always copy the page fragments in
> case of a retransmit is partial solution, as the remote peer could be
> malicious and send ACK _before_ page content is actually read by the
> NIC.
>
> So if we rely on networking stacks to give the signal for page reuse, we
> can have major security issue.

If you ignore the Xen case and consider just the native case then the
issue isn't page reuse in the sense of getting mapped into another
process, it's the same page in the same process but the process has
written something new to the buffer, e.g.
memset(buf, 0xaa, 4096);
write(fd, buf, 4096)
memset(buf, 0x55, 4096);
(where fd is O_DIRECT on NFS) Can result in 0x55 being seen on the wire
in the TCP retransmit.

If the retransmit is at the RPC layer then you get a resend of the NFS
write RPC, but the XDR sequence stuff catches that case (I think, memory
is fuzzy).

If the retransmit is at the TCP level then the TCP sequence/ack will
cause the receiver to ignore the corrupt version, but if you replace the
second memset with write_critical_secret_key(buf), then you have an
information leak.

Ian.

2013-07-04 10:12:15

[permalink] [raw]

Subject: Re: kernel panic in skb_copy_bits

On Thu, 2013-07-04 at 10:52 +0100, Ian Campbell wrote:

> Might just be that no one has observed it with vmsplice()+splice()? Most
> of the time this happens silently and you'll probably never notice, it's
> just the behaviour of Xen which escalates the issue into one you can
> see.

The point I wanted to make is that nobody can seriously use vmsplice(),
unless the memory is never reused by the application, or the application
doesn't care of security implications.

Because an application has no way to know when it's safe to reuse the
area for another usage.

[ Unless it uses the obscure and complex pagemap stuff
(Documentation/vm/pagemap.txt), but its not asynchronous signaling and
not pluggable into epoll()/poll()/select()) ]

> Xen's out of tree netback used to fix this by a destructor call back on
> page free, but that was a core mm patch in the hot memory free path
> which wasn't popular, and it doesn't solve anything for the non-Xen
> instances of this issue.

It _is_ a mm core patch which is needed, if we ever want to fix this
problem.

It looks like a typical COW issue to me.

If the page content is written while there is still a reference on this
page, we should allocate a new page and copy the previous content.

And this has little to do with networking.

2013-07-04 12:57:34