2015-04-07 22:57:56

by Tuan Bui

[permalink] [raw]
Subject: [BUG REPORT] kernel panic in tcp_sendpage() on null pointer dereference

Hi all,

I am consistently seeing this kernel panic on a 16 sockets machine
running Spark PageRank workload using Docker. I am running RHEL 7.0
stock kernel which is 3.10.0-123.el7.x86_64.

I believe __skb_insert() might be dereferencing a null *prev.

Attached is a complete dmesg and disassemble log.

Stack Trace:
[ 6169.148712] BUG: unable to handle kernel NULL pointer dereference at
(null)
[ 6169.157531] IP: [<ffffffff8151829d>] tcp_sendpage+0x44d/0x6d0
[ 6169.163995] PGD 49bcfb83067 PUD 49bcfb82067 PMD 0
[ 6169.169520] Oops: 0002 [#1] SMP
[ 6169.173230] Modules linked in: veth xt_addrtype ipt_MASQUERADE
dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio loop ext4 mbcache
jbd2 ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat
ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat
nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle
ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack
iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg
iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crct10dif_pclmul
crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul
glue_helper ablk_helper cryptd pcspkr ses enclosure ixgbe ptp hpilo
hpwdt pps_core mdio sb_edac ioatdma lpc_ich edac_core mfd_core
[ 6169.255799] dca shpchp ipmi_si ipmi_msghandler mperf vfat fat btrfs
zlib_deflate raid6_pq xor xfs libcrc32c dm_service_time sd_mod
crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit qla2xxx drm_kms_helper ttm scsi_transport_fc drm scsi_tgt
i2c_core dm_mirror dm_region_hash dm_log dm_multipath dm_mod
[ 6169.289059] CPU: 87 PID: 205310 Comm: java Not tainted
3.10.0-123.el7.x86_64 #1
[ 6169.297161] Hardware name: HP Superdome2 16s x86, BIOS Bundle:
005.073.000 SFW: 015.082.000 08/08/2014
[ 6169.307538] task: ffff8c9bccf38000 ti: ffff8c94c506c000 task.ti:
ffff8c94c506c000
[ 6169.315890] RIP: 0010:[<ffffffff8151829d>] [<ffffffff8151829d>]
tcp_sendpage+0x44d/0x6d0
[ 6169.325229] RSP: 0018:ffff8c94c506dbe0 EFLAGS: 00010202
[ 6169.331182] RAX: 0000000000000000 RBX: ffff918a6dfda800 RCX:
ffff918a6dfda938
[ 6169.339149] RDX: 0000000000000110 RSI: ffff918a6dfda938 RDI:
0000000000000000
[ 6169.347141] RBP: ffff8c94c506dc58 R08: 00000000000002c0 R09:
0000000000000500
[ 6169.355100] R10: ffff88bd7f406e80 R11: 0000000000000000 R12:
0000000000020040
[ 6169.363074] R13: 0000000000000000 R14: 0000000000000219 R15:
ffffea1253a1c800
[ 6169.371045] FS: 00007f648fefe700(0000) GS:ffff8c9cffb80000(0000)
knlGS:0000000000000000
[ 6169.380075] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6169.386507] CR2: 0000000000000000 CR3: 0000049bd848c000 CR4:
00000000001407e0
[ 6169.394540] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 6169.402612] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 6169.410621] Stack:
[ 6169.412984] 0000000100000000 ffff8ed00a267c00 ffffffff00000de7
0000000000000de7
[ 6169.422080] 0000000000000001 ffff918a6dfda938 00000de7000005a8
000010f853ab6ec0
[ 6169.431181] 0000000000000000 00000000a54958f0 ffff918a6dfda800
ffff8c9bd5e4fd00
[ 6169.440262] Call Trace:
[ 6169.443190] [<ffffffff811dc940>] ? splice_from_pipe_feed+0x120/0x120
[ 6169.450512] [<ffffffff81542ade>] inet_sendpage+0x6e/0xe0
[ 6169.456766] [<ffffffff814b668b>] kernel_sendpage+0x1b/0x30
[ 6169.463193] [<ffffffff814b66c7>] sock_sendpage+0x27/0x30
[ 6169.469387] [<ffffffff811dc9a3>] pipe_to_sendpage+0x63/0xa0
[ 6169.475864] [<ffffffff811dc89e>] splice_from_pipe_feed+0x7e/0x120
[ 6169.482900] [<ffffffff811dc940>] ? splice_from_pipe_feed+0x120/0x120
[ 6169.490215] [<ffffffff811dcc1e>] __splice_from_pipe+0x6e/0x90
[ 6169.496881] [<ffffffff811dc940>] ? splice_from_pipe_feed+0x120/0x120
[ 6169.504198] [<ffffffff811de82e>] splice_from_pipe+0x5e/0x90
[ 6169.510659] [<ffffffff811de860>] ? splice_from_pipe+0x90/0x90
[ 6169.517342] [<ffffffff811de875>] generic_splice_sendpage+0x15/0x20
[ 6169.524464] [<ffffffff811dd361>] do_splice_from+0x91/0x100
[ 6169.530900] [<ffffffff811dd3f0>] direct_splice_actor+0x20/0x30
[ 6169.537686] [<ffffffff811dd114>] splice_direct_to_actor+0xd4/0x200
[ 6169.544915] [<ffffffff811dd3d0>] ? do_splice_from+0x100/0x100
[ 6169.551594] [<ffffffff811de912>] do_splice_direct+0x62/0x90
[ 6169.558070] [<ffffffff811b0193>] do_sendfile+0x1c3/0x340
[ 6169.564248] [<ffffffff811b130e>] SyS_sendfile64+0x5e/0xb0
[ 6169.570521] [<ffffffff815f2119>] system_call_fastpath+0x16/0x1b
[ 6169.577273] Code: 10 88 41 7c 8b 81 dc 00 00 00 48 03 81 e0 00 00 00
f0 81 40 24 00 00 01 00 48 8b 83 40 01 00 00 48 8b 75 b0 48 89 41 08 48
89 31 <48> 89 08 83 83 48 01 00 00 01 48 83 bb 68 02 00 00 00 48 89 8b
[ 6169.606298] RIP [<ffffffff8151829d>] tcp_sendpage+0x44d/0x6d0
[ 6169.612956] RSP <ffff8c94c506dbe0>
[ 6169.616925] CR2: 0000000000000000






Partial disassemble up to the crash full one is attached:
/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/net/ipv4/tcp.c: 879
0xffffffff8151823d <tcp_sendpage+1005>: test %rax,%rax
/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/net/ipv4/tcp.c: 878
0xffffffff81518240 <tcp_sendpage+1008>: mov %rax,%rcx
/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/net/ipv4/tcp.c: 879
0xffffffff81518243 <tcp_sendpage+1011>: je 0xffffffff815181b0
<tcp_sendpage+864>
/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/net/ipv4/tcp.c: 604
0xffffffff81518249 <tcp_sendpage+1017>: movl $0x0,0x74(%rax)
/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/net/ipv4/tcp.c: 605
0xffffffff81518250 <tcp_sendpage+1024>: mov 0x650(%rbx),%eax
/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/net/ipv4/tcp.c: 606
0xffffffff81518256 <tcp_sendpage+1030>: movb $0x10,0x4c(%rcx)
/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/net/ipv4/tcp.c: 607
0xffffffff8151825a <tcp_sendpage+1034>: movb $0x0,0x4d(%rcx)
/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/net/ipv4/tcp.c: 605
0xffffffff8151825e <tcp_sendpage+1038>: mov %eax,0x44(%rcx)
0xffffffff81518261 <tcp_sendpage+1041>: mov %eax,0x40(%rcx)
/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/include/linux/skbuff.h: 963
0xffffffff81518264 <tcp_sendpage+1044>: movzbl 0x7c(%rcx),%eax
0xffffffff81518268 <tcp_sendpage+1048>: test $0x10,%al
0xffffffff8151826a <tcp_sendpage+1050>: jne 0xffffffff815184da
<tcp_sendpage+1674>
/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/include/linux/skbuff.h: 964
0xffffffff81518270 <tcp_sendpage+1056>: or $0x10,%eax
0xffffffff81518273 <tcp_sendpage+1059>: mov %al,0x7c(%rcx)
/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/include/linux/skbuff.h: 792
0xffffffff81518276 <tcp_sendpage+1062>: mov 0xdc(%rcx),%eax
0xffffffff8151827c <tcp_sendpage+1068>: add 0xe0(%rcx),%rax
/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/arch/x86/include/asm/atomic.h: 49
0xffffffff81518283 <tcp_sendpage+1075>: lock addl $0x10000,0x24(%rax)
/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/include/linux/skbuff.h: 1271
0xffffffff8151828b <tcp_sendpage+1083>: mov 0x140(%rbx),%rax
/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/include/linux/skbuff.h: 1163
0xffffffff81518292 <tcp_sendpage+1090>: mov -0x50(%rbp),%rsi
/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/include/linux/skbuff.h: 1164
0xffffffff81518296 <tcp_sendpage+1094>: mov %rax,0x8(%rcx)
/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/include/linux/skbuff.h: 1163
0xffffffff8151829a <tcp_sendpage+1098>: mov %rsi,(%rcx)
/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/include/linux/skbuff.h: 1165
0xffffffff8151829d <tcp_sendpage+1101>: mov %rcx,(%rax)


Attachments:
dis.txt (25.96 kB)
dmesg.txt (368.17 kB)
Download all attachments

2015-04-07 23:33:45

by Eric Dumazet

[permalink] [raw]
Subject: Re: [BUG REPORT] kernel panic in tcp_sendpage() on null pointer dereference

On Tue, 2015-04-07 at 15:57 -0700, Tuan Bui wrote:
> Hi all,
>
> I am consistently seeing this kernel panic on a 16 sockets machine
> running Spark PageRank workload using Docker. I am running RHEL 7.0
> stock kernel which is 3.10.0-123.el7.x86_64.

Have you tried a recent upstream kernel ?


2015-04-08 00:01:28

by Tuan Bui

[permalink] [raw]
Subject: Re: [BUG REPORT] kernel panic in tcp_sendpage() on null pointer dereference

On Tue, 2015-04-07 at 16:33 -0700, Eric Dumazet wrote:
> On Tue, 2015-04-07 at 15:57 -0700, Tuan Bui wrote:
> > Hi all,
> >
> > I am consistently seeing this kernel panic on a 16 sockets machine
> > running Spark PageRank workload using Docker. I am running RHEL 7.0
> > stock kernel which is 3.10.0-123.el7.x86_64.
>
> Have you tried a recent upstream kernel ?
>

Yes I have tried an upstream v4.0-rc4 kernel. I was still getting the
same kernel panic.



2015-04-09 09:20:42

by Eric Dumazet

[permalink] [raw]
Subject: Re: [BUG REPORT] kernel panic in tcp_sendpage() on null pointer dereference

On Tue, 2015-04-07 at 17:01 -0700, Tuan Bui wrote:
> On Tue, 2015-04-07 at 16:33 -0700, Eric Dumazet wrote:
> > On Tue, 2015-04-07 at 15:57 -0700, Tuan Bui wrote:
> > > Hi all,
> > >
> > > I am consistently seeing this kernel panic on a 16 sockets machine
> > > running Spark PageRank workload using Docker. I am running RHEL 7.0
> > > stock kernel which is 3.10.0-123.el7.x86_64.
> >
> > Have you tried a recent upstream kernel ?
> >
>
> Yes I have tried an upstream v4.0-rc4 kernel. I was still getting the
> same kernel panic.

This looks like a use after free on one skb, but I see no smoking gun in
sendpage() paths.

You might try some debugging kernel build to detect this. (ASAN maybe)