LinuxLists.cc - 3.9-rc1 NULL pointer crash at find_pid

2013-03-07 09:37:17

Subject: 3.9-rc1 NULL pointer crash at find_pid_ns

Just came across this during LTP run on a ppc64 system. Still trying to
reproduce and possible bisect, but want to give an early head-up to see
if anyone see anything obvious.

CAI Qian

[ 6476.040024] Unable to handle kernel paging request for data at address 0xfffffffffffffff0
[ 6476.040051] Faulting instruction address: 0xc0000000000af8ac
[ 6476.040060] Oops: Kernel access of bad area, sig: 11 [#1]
[ 6476.040067] SMP NR_CPUS=1024 NUMA pSeries
[ 6476.040077] Modules linked in: tun binfmt_misc hidp cmtp kernelcapi rfcomm l2tp_ppp l2tp_netlink l2tp_core bnep nfc af_802154 pppoe pppox ppp_generic slhc rds af_key atm sctp ip6table_filter ip6_tables iptable_filter ip_tables btrfs raid6_pq xor vfat fat nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache nfnetlink_log nfnetlink bluetooth rfkill arc4 md4 nls_utf8 cifs dns_resolver nf_tproxy_core nls_koi8_u nls_cp932 ts_kmp fuse sg ehea xfs libcrc32c sd_mod crc_t10dif ibmvscsi scsi_transport_srp scsi_tgt dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipt_REJECT]
[ 6476.040204] NIP: c0000000000af8ac LR: c0000000000b07e0 CTR: 0000000000000000
[ 6476.040213] REGS: c00000011ae73480 TRAP: 0300 Not tainted (3.9.0-rc1)
[ 6476.040221] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 88008488 XER: 20000000
[ 6476.040243] SOFTE: 1
[ 6476.040248] CFAR: c000000000005f1c
[ 6476.040253] DAR: fffffffffffffff0, DSISR: 40000000
[ 6476.040260] TASK = c00000006be719e0[26514] 'ps' THREAD: c00000011ae70000 CPU: 26
GPR00: c000000000299c34 c00000011ae73700 c0000000010f3a18 00000000000050d5
GPR04: c000000001047ec0 0000000000000011 a000000000000000 9e97fbecc2b0cf95
GPR08: c0000001472521c0 fffffffffffffff0 c0000000014d0000 c000000144886026
GPR12: 0000000024008422 c00000000ed96800 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 c00000011ae73ab0
GPR24: c00000014488602b 0000000000000000 0000000000000004 fffffffffffff000
GPR28: c0000001fd040040 c0000001fd040040 c0000001f2ed3540 0000000000000011
[ 6476.040371] NIP [c0000000000af8ac] .find_pid_ns+0x8c/0xd0
[ 6476.040379] LR [c0000000000b07e0] .find_task_by_pid_ns+0x10/0x50
[ 6476.040385] Call Trace:
[ 6476.040394] [c00000011ae73700] [c00000011ae737c0] 0xc00000011ae737c0 (unreliable)
[ 6476.040407] [c00000011ae73770] [c000000000299c34] .proc_pid_lookup+0xc4/0x1a0
[ 6476.040416] [c00000011ae73800] [c0000000002942f4] .proc_root_lookup+0x44/0x80
[ 6476.040427] [c00000011ae73890] [c00000000021b300] .lookup_real+0x40/0x90
[ 6476.040437] [c00000011ae73910] [c00000000021bd00] .__lookup_hash+0x40/0x60
[ 6476.040446] [c00000011ae739a0] [c00000000021c7d0] .lookup_slow+0x60/0x100
[ 6476.040456] [c00000011ae73a30] [c00000000021da08] .link_path_walk+0x8d8/0xaa0
[ 6476.040466] [c00000011ae73b40] [c000000000221a98] .path_openat+0xc8/0x5c0
[ 6476.040476] [c00000011ae73c60] [c0000000002223f0] .do_filp_open+0x40/0xb0
[ 6476.040486] [c00000011ae73d80] [c00000000020d470] .do_sys_open+0x140/0x250
[ 6476.040497] [c00000011ae73e30] [c000000000009c54] syscall_exit+0x0/0x98
[ 6476.040504] Instruction dump:
[ 6476.040510] 7d2a482a 3929fff0 48000020 60000000 60000000 e9490010 2faa0000 419e0048
[ 6476.040528] e9290010 3929fff0 2fa90000 419e0038 <81490000> 7f8a1800 409effdc e9490008
[ 6476.040551] ---[ end trace 5c3fc2ac5c10d1e8 ]---
[ 6476.042165]

2013-03-07 09:42:26

by Zefan Li

[permalink] [raw]

Subject: Re: 3.9-rc1 NULL pointer crash at find_pid_ns

Cc: [email protected]
Cc: "Eric W. Biederman" <[email protected]>
Cc: container

This is a second report... and the same address: 0xfffffffffffffff0

On 2013/3/7 17:37, CAI Qian wrote:
> Just came across this during LTP run on a ppc64 system. Still trying to
> reproduce and possible bisect, but want to give an early head-up to see
> if anyone see anything obvious.
>
> CAI Qian
>
> [ 6476.040024] Unable to handle kernel paging request for data at address 0xfffffffffffffff0
> [ 6476.040051] Faulting instruction address: 0xc0000000000af8ac
> [ 6476.040060] Oops: Kernel access of bad area, sig: 11 [#1]
> [ 6476.040067] SMP NR_CPUS=1024 NUMA pSeries
> [ 6476.040077] Modules linked in: tun binfmt_misc hidp cmtp kernelcapi rfcomm l2tp_ppp l2tp_netlink l2tp_core bnep nfc af_802154 pppoe pppox ppp_generic slhc rds af_key atm sctp ip6table_filter ip6_tables iptable_filter ip_tables btrfs raid6_pq xor vfat fat nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache nfnetlink_log nfnetlink bluetooth rfkill arc4 md4 nls_utf8 cifs dns_resolver nf_tproxy_core nls_koi8_u nls_cp932 ts_kmp fuse sg ehea xfs libcrc32c sd_mod crc_t10dif ibmvscsi scsi_transport_srp scsi_tgt dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipt_REJECT]
> [ 6476.040204] NIP: c0000000000af8ac LR: c0000000000b07e0 CTR: 0000000000000000
> [ 6476.040213] REGS: c00000011ae73480 TRAP: 0300 Not tainted (3.9.0-rc1)
> [ 6476.040221] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 88008488 XER: 20000000
> [ 6476.040243] SOFTE: 1
> [ 6476.040248] CFAR: c000000000005f1c
> [ 6476.040253] DAR: fffffffffffffff0, DSISR: 40000000
> [ 6476.040260] TASK = c00000006be719e0[26514] 'ps' THREAD: c00000011ae70000 CPU: 26
> GPR00: c000000000299c34 c00000011ae73700 c0000000010f3a18 00000000000050d5
> GPR04: c000000001047ec0 0000000000000011 a000000000000000 9e97fbecc2b0cf95
> GPR08: c0000001472521c0 fffffffffffffff0 c0000000014d0000 c000000144886026
> GPR12: 0000000024008422 c00000000ed96800 0000000000000000 0000000000000000
> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> GPR20: 0000000000000000 0000000000000000 0000000000000000 c00000011ae73ab0
> GPR24: c00000014488602b 0000000000000000 0000000000000004 fffffffffffff000
> GPR28: c0000001fd040040 c0000001fd040040 c0000001f2ed3540 0000000000000011
> [ 6476.040371] NIP [c0000000000af8ac] .find_pid_ns+0x8c/0xd0
> [ 6476.040379] LR [c0000000000b07e0] .find_task_by_pid_ns+0x10/0x50
> [ 6476.040385] Call Trace:
> [ 6476.040394] [c00000011ae73700] [c00000011ae737c0] 0xc00000011ae737c0 (unreliable)
> [ 6476.040407] [c00000011ae73770] [c000000000299c34] .proc_pid_lookup+0xc4/0x1a0
> [ 6476.040416] [c00000011ae73800] [c0000000002942f4] .proc_root_lookup+0x44/0x80
> [ 6476.040427] [c00000011ae73890] [c00000000021b300] .lookup_real+0x40/0x90
> [ 6476.040437] [c00000011ae73910] [c00000000021bd00] .__lookup_hash+0x40/0x60
> [ 6476.040446] [c00000011ae739a0] [c00000000021c7d0] .lookup_slow+0x60/0x100
> [ 6476.040456] [c00000011ae73a30] [c00000000021da08] .link_path_walk+0x8d8/0xaa0
> [ 6476.040466] [c00000011ae73b40] [c000000000221a98] .path_openat+0xc8/0x5c0
> [ 6476.040476] [c00000011ae73c60] [c0000000002223f0] .do_filp_open+0x40/0xb0
> [ 6476.040486] [c00000011ae73d80] [c00000000020d470] .do_sys_open+0x140/0x250
> [ 6476.040497] [c00000011ae73e30] [c000000000009c54] syscall_exit+0x0/0x98
> [ 6476.040504] Instruction dump:
> [ 6476.040510] 7d2a482a 3929fff0 48000020 60000000 60000000 e9490010 2faa0000 419e0048
> [ 6476.040528] e9290010 3929fff0 2fa90000 419e0038 <81490000> 7f8a1800 409effdc e9490008
> [ 6476.040551] ---[ end trace 5c3fc2ac5c10d1e8 ]---
> [ 6476.042165]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
> .
>

2013-03-07 09:59:41

by Eric W. Biederman

[permalink] [raw]

Subject: Re: 3.9-rc1 NULL pointer crash at find_pid_ns

Li Zefan <[email protected]> writes:

> Cc: [email protected]
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: container
>
> This is a second report... and the same address: 0xfffffffffffffff0

Actually this is the third report I have seen with that address, and the
others were on x86_64.

The obvious answer is that there is something subtlely wrong with:

commit b67bfe0d42cac56c512dd5da4b1b347a23f4b70a
Author: Sasha Levin <[email protected]>
Date: Wed Feb 27 17:06:00 2013 -0800

hlist: drop the node parameter from iterators

This is the only change the pid namespace that I am aware of in 3.9-rc1.

If you can reproduce this somewhat readily can you please revert the
hlist change and see if this continues to happen. Right now there are
no other code changes that I can see. And the address
0xfffffffffffffff0 is consistent with a bug in hlist_for_each_entry_rcu.

Eric

> On 2013/3/7 17:37, CAI Qian wrote:
>> Just came across this during LTP run on a ppc64 system. Still trying to
>> reproduce and possible bisect, but want to give an early head-up to see
>> if anyone see anything obvious.
>>
>> CAI Qian
>>
>> [ 6476.040024] Unable to handle kernel paging request for data at address 0xfffffffffffffff0
>> [ 6476.040051] Faulting instruction address: 0xc0000000000af8ac
>> [ 6476.040060] Oops: Kernel access of bad area, sig: 11 [#1]
>> [ 6476.040067] SMP NR_CPUS=1024 NUMA pSeries
>> [ 6476.040077] Modules linked in: tun binfmt_misc hidp cmtp kernelcapi rfcomm l2tp_ppp l2tp_netlink l2tp_core bnep nfc af_802154 pppoe pppox ppp_generic slhc rds af_key atm sctp ip6table_filter ip6_tables iptable_filter ip_tables btrfs raid6_pq xor vfat fat nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache nfnetlink_log nfnetlink bluetooth rfkill arc4 md4 nls_utf8 cifs dns_resolver nf_tproxy_core nls_koi8_u nls_cp932 ts_kmp fuse sg ehea xfs libcrc32c sd_mod crc_t10dif ibmvscsi scsi_transport_srp scsi_tgt dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipt_REJECT]
>> [ 6476.040204] NIP: c0000000000af8ac LR: c0000000000b07e0 CTR: 0000000000000000
>> [ 6476.040213] REGS: c00000011ae73480 TRAP: 0300 Not tainted (3.9.0-rc1)
>> [ 6476.040221] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 88008488 XER: 20000000
>> [ 6476.040243] SOFTE: 1
>> [ 6476.040248] CFAR: c000000000005f1c
>> [ 6476.040253] DAR: fffffffffffffff0, DSISR: 40000000
>> [ 6476.040260] TASK = c00000006be719e0[26514] 'ps' THREAD: c00000011ae70000 CPU: 26
>> GPR00: c000000000299c34 c00000011ae73700 c0000000010f3a18 00000000000050d5
>> GPR04: c000000001047ec0 0000000000000011 a000000000000000 9e97fbecc2b0cf95
>> GPR08: c0000001472521c0 fffffffffffffff0 c0000000014d0000 c000000144886026
>> GPR12: 0000000024008422 c00000000ed96800 0000000000000000 0000000000000000
>> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> GPR20: 0000000000000000 0000000000000000 0000000000000000 c00000011ae73ab0
>> GPR24: c00000014488602b 0000000000000000 0000000000000004 fffffffffffff000
>> GPR28: c0000001fd040040 c0000001fd040040 c0000001f2ed3540 0000000000000011
>> [ 6476.040371] NIP [c0000000000af8ac] .find_pid_ns+0x8c/0xd0
>> [ 6476.040379] LR [c0000000000b07e0] .find_task_by_pid_ns+0x10/0x50
>> [ 6476.040385] Call Trace:
>> [ 6476.040394] [c00000011ae73700] [c00000011ae737c0] 0xc00000011ae737c0 (unreliable)
>> [ 6476.040407] [c00000011ae73770] [c000000000299c34] .proc_pid_lookup+0xc4/0x1a0
>> [ 6476.040416] [c00000011ae73800] [c0000000002942f4] .proc_root_lookup+0x44/0x80
>> [ 6476.040427] [c00000011ae73890] [c00000000021b300] .lookup_real+0x40/0x90
>> [ 6476.040437] [c00000011ae73910] [c00000000021bd00] .__lookup_hash+0x40/0x60
>> [ 6476.040446] [c00000011ae739a0] [c00000000021c7d0] .lookup_slow+0x60/0x100
>> [ 6476.040456] [c00000011ae73a30] [c00000000021da08] .link_path_walk+0x8d8/0xaa0
>> [ 6476.040466] [c00000011ae73b40] [c000000000221a98] .path_openat+0xc8/0x5c0
>> [ 6476.040476] [c00000011ae73c60] [c0000000002223f0] .do_filp_open+0x40/0xb0
>> [ 6476.040486] [c00000011ae73d80] [c00000000020d470] .do_sys_open+0x140/0x250
>> [ 6476.040497] [c00000011ae73e30] [c000000000009c54] syscall_exit+0x0/0x98
>> [ 6476.040504] Instruction dump:
>> [ 6476.040510] 7d2a482a 3929fff0 48000020 60000000 60000000 e9490010 2faa0000 419e0048
>> [ 6476.040528] e9290010 3929fff0 2fa90000 419e0038 <81490000> 7f8a1800 409effdc e9490008
>> [ 6476.040551] ---[ end trace 5c3fc2ac5c10d1e8 ]---
>> [ 6476.042165]
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>> .
>>

2013-03-07 17:35:51

by Sasha Levin

[permalink] [raw]

Subject: Re: 3.9-rc1 NULL pointer crash at find_pid_ns

On 03/07/2013 04:59 AM, [email protected] wrote:
> Li Zefan <[email protected]> writes:
>
>> Cc: [email protected]
>> Cc: "Eric W. Biederman" <[email protected]>
>> Cc: container
>>
>> This is a second report... and the same address: 0xfffffffffffffff0
>
> Actually this is the third report I have seen with that address, and the
> others were on x86_64.
>
> The obvious answer is that there is something subtlely wrong with:
>
> commit b67bfe0d42cac56c512dd5da4b1b347a23f4b70a
> Author: Sasha Levin <[email protected]>
> Date: Wed Feb 27 17:06:00 2013 -0800
>
> hlist: drop the node parameter from iterators
>
>
> This is the only change the pid namespace that I am aware of in 3.9-rc1.
>
> If you can reproduce this somewhat readily can you please revert the
> hlist change and see if this continues to happen. Right now there are
> no other code changes that I can see. And the address
> 0xfffffffffffffff0 is consistent with a bug in hlist_for_each_entry_rcu.

Looks like the hlist change is probably the issue, though it specifically
uses:

#define hlist_entry_safe(ptr, type, member) \
(ptr) ? hlist_entry(ptr, type, member) : NULL

I'm still looking at the code in question and it's assembly, but I can't
figure out what's going wrong. I was also trying to see what's so special
about this loop in find_pid_ns as opposed to the rest of the kernel code
that uses hlist_for_each_entry_rcu() but couldn't find out why.

Is it somehow possible that if we rcu_dereference_raw() the same thing twice
inside the same rcu_read_lock() section we'll get different results? That's
really the only reason for this crash that comes to mind at the moment, very
unlikely - but that's all I have right now.

Is this bug reproducible easily on your setup? I've managed to reproduce it
exactly 3 times in the past month or so, twice when I reported it and only
once since then - at some point I thought that it was a freak compiler issue
that went away when the code changed but since you're reporting it again
I guess that it isn't the case.

Paul, any chance you can give hlist_for_each_entry_rcu() a second look please?
I know you've already acked it before, but is it possible I missed a subtle
detail with RCU that causes this?

Thanks,
Sasha

2013-03-07 17:46:15

by Eric Dumazet

[permalink] [raw]

Subject: Re: 3.9-rc1 NULL pointer crash at find_pid_ns

On Thu, 2013-03-07 at 12:36 -0500, Sasha Levin wrote:

> Looks like the hlist change is probably the issue, though it specifically
> uses:
>
> #define hlist_entry_safe(ptr, type, member) \
> (ptr) ? hlist_entry(ptr, type, member) : NULL
>
> I'm still looking at the code in question and it's assembly, but I can't
> figure out what's going wrong. I was also trying to see what's so special
> about this loop in find_pid_ns as opposed to the rest of the kernel code
> that uses hlist_for_each_entry_rcu() but couldn't find out why.
>
> Is it somehow possible that if we rcu_dereference_raw() the same thing twice
> inside the same rcu_read_lock() section we'll get different results? That's
> really the only reason for this crash that comes to mind at the moment, very
> unlikely - but that's all I have right now.
>

Yep

#define hlist_entry_safe(ptr, type, member) \
(ptr) ? hlist_entry(ptr, type, member) : NULL

Is not safe, as ptr can be evaluated twice, and thats not good at all...

> Is this bug reproducible easily on your setup? I've managed to reproduce it
> exactly 3 times in the past month or so, twice when I reported it and only
> once since then - at some point I thought that it was a freak compiler issue
> that went away when the code changed but since you're reporting it again
> I guess that it isn't the case.
>
>
> Paul, any chance you can give hlist_for_each_entry_rcu() a second look please?
> I know you've already acked it before, but is it possible I missed a subtle
> detail with RCU that causes this?
>

2013-03-07 17:50:38

by Sasha Levin

[permalink] [raw]

Subject: Re: 3.9-rc1 NULL pointer crash at find_pid_ns

On 03/07/2013 12:46 PM, Eric Dumazet wrote:
> On Thu, 2013-03-07 at 12:36 -0500, Sasha Levin wrote:
>
>> Looks like the hlist change is probably the issue, though it specifically
>> uses:
>>
>> #define hlist_entry_safe(ptr, type, member) \
>> (ptr) ? hlist_entry(ptr, type, member) : NULL
>>
>> I'm still looking at the code in question and it's assembly, but I can't
>> figure out what's going wrong. I was also trying to see what's so special
>> about this loop in find_pid_ns as opposed to the rest of the kernel code
>> that uses hlist_for_each_entry_rcu() but couldn't find out why.
>>
>> Is it somehow possible that if we rcu_dereference_raw() the same thing twice
>> inside the same rcu_read_lock() section we'll get different results? That's
>> really the only reason for this crash that comes to mind at the moment, very
>> unlikely - but that's all I have right now.
>>
>
> Yep
>
> #define hlist_entry_safe(ptr, type, member) \
> (ptr) ? hlist_entry(ptr, type, member) : NULL
>
> Is not safe, as ptr can be evaluated twice, and thats not good at all...

ptr is being evaluated twice, but in this case this is an
rcu_dereference_raw() value within the same rcu_read_lock() section.

Is it still problematic?

Thanks,
Sasha

2013-03-07 18:04:43

by Paul E. McKenney

[permalink] [raw]

Subject: Re: 3.9-rc1 NULL pointer crash at find_pid_ns

On Thu, Mar 07, 2013 at 12:50:47PM -0500, Sasha Levin wrote:
> On 03/07/2013 12:46 PM, Eric Dumazet wrote:
> > On Thu, 2013-03-07 at 12:36 -0500, Sasha Levin wrote:
> >
> >> Looks like the hlist change is probably the issue, though it specifically
> >> uses:
> >>
> >> #define hlist_entry_safe(ptr, type, member) \
> >> (ptr) ? hlist_entry(ptr, type, member) : NULL
> >>
> >> I'm still looking at the code in question and it's assembly, but I can't
> >> figure out what's going wrong. I was also trying to see what's so special
> >> about this loop in find_pid_ns as opposed to the rest of the kernel code
> >> that uses hlist_for_each_entry_rcu() but couldn't find out why.
> >>
> >> Is it somehow possible that if we rcu_dereference_raw() the same thing twice
> >> inside the same rcu_read_lock() section we'll get different results? That's
> >> really the only reason for this crash that comes to mind at the moment, very
> >> unlikely - but that's all I have right now.
> >
> > Yep
> >
> > #define hlist_entry_safe(ptr, type, member) \
> > (ptr) ? hlist_entry(ptr, type, member) : NULL
> >
> > Is not safe, as ptr can be evaluated twice, and thats not good at all...
>
> ptr is being evaluated twice, but in this case this is an
> rcu_dereference_raw() value within the same rcu_read_lock() section.
>
> Is it still problematic?

Sigh! Eric is right -- if "ptr" involves side effects, evaluating
it twice is completely broken.

Thanx, Paul

2013-03-07 18:05:49

by Eric W. Biederman

[permalink] [raw]

Subject: Re: 3.9-rc1 NULL pointer crash at find_pid_ns

Sasha Levin <[email protected]> writes:

> On 03/07/2013 12:46 PM, Eric Dumazet wrote:
>> On Thu, 2013-03-07 at 12:36 -0500, Sasha Levin wrote:
>>
>>> Looks like the hlist change is probably the issue, though it specifically
>>> uses:
>>>
>>> #define hlist_entry_safe(ptr, type, member) \
>>> (ptr) ? hlist_entry(ptr, type, member) : NULL
>>>
>>> I'm still looking at the code in question and it's assembly, but I can't
>>> figure out what's going wrong. I was also trying to see what's so special
>>> about this loop in find_pid_ns as opposed to the rest of the kernel code
>>> that uses hlist_for_each_entry_rcu() but couldn't find out why.
>>>
>>> Is it somehow possible that if we rcu_dereference_raw() the same thing twice
>>> inside the same rcu_read_lock() section we'll get different results? That's
>>> really the only reason for this crash that comes to mind at the moment, very
>>> unlikely - but that's all I have right now.
>>>
>>
>> Yep
>>
>> #define hlist_entry_safe(ptr, type, member) \
>> (ptr) ? hlist_entry(ptr, type, member) : NULL
>>
>> Is not safe, as ptr can be evaluated twice, and thats not good at all...
>
> ptr is being evaluated twice, but in this case this is an
> rcu_dereference_raw() value within the same rcu_read_lock() section.
>
> Is it still problematic?

Definitely.

Head in this instance the expression: &pid_hash[pid_hashfn(nr, ns)]

And the crash clearly shows that when hilst_entry is being evaluated the
HEAD is NULL.

Perhaps this shows up in proc because the hash chains are short and
frequently NULL?

Eric

2013-03-07 18:14:33

by Sasha Levin

[permalink] [raw]

Subject: Re: 3.9-rc1 NULL pointer crash at find_pid_ns

On 03/07/2013 01:05 PM, [email protected] wrote:
> Sasha Levin <[email protected]> writes:
>
>> On 03/07/2013 12:46 PM, Eric Dumazet wrote:
>>> On Thu, 2013-03-07 at 12:36 -0500, Sasha Levin wrote:
>>>
>>>> Looks like the hlist change is probably the issue, though it specifically
>>>> uses:
>>>>
>>>> #define hlist_entry_safe(ptr, type, member) \
>>>> (ptr) ? hlist_entry(ptr, type, member) : NULL
>>>>
>>>> I'm still looking at the code in question and it's assembly, but I can't
>>>> figure out what's going wrong. I was also trying to see what's so special
>>>> about this loop in find_pid_ns as opposed to the rest of the kernel code
>>>> that uses hlist_for_each_entry_rcu() but couldn't find out why.
>>>>
>>>> Is it somehow possible that if we rcu_dereference_raw() the same thing twice
>>>> inside the same rcu_read_lock() section we'll get different results? That's
>>>> really the only reason for this crash that comes to mind at the moment, very
>>>> unlikely - but that's all I have right now.
>>>>
>>>
>>> Yep
>>>
>>> #define hlist_entry_safe(ptr, type, member) \
>>> (ptr) ? hlist_entry(ptr, type, member) : NULL
>>>
>>> Is not safe, as ptr can be evaluated twice, and thats not good at all...
>>
>> ptr is being evaluated twice, but in this case this is an
>> rcu_dereference_raw() value within the same rcu_read_lock() section.
>>
>> Is it still problematic?
>
> Definitely.
>
> Head in this instance the expression: &pid_hash[pid_hashfn(nr, ns)]
>
> And the crash clearly shows that when hilst_entry is being evaluated the
> HEAD is NULL.

Okay, I'm even more confused now.

The expression in question is:

hlist_entry_safe(rcu_dereference_bh(hlist_first_rcu(head)))

You're saying that "rcu_dereference_bh(hlist_first_rcu(head))" can change between
the two evaluations we do. That would mean that 'head' has changed in between, right?

In that case, the list itself has changed - which means that RCU has changed the
list underneath us.

hlist_first_rcu() doesn't have any side-effects, it doesn't modify the list whatsoever,
so the only thing that can change is 'head'. Why is it allowed to change if the list
is protected by RCU?

Thanks,
Sasha

2013-03-07 18:17:33

by Paul E. McKenney

[permalink] [raw]

Subject: Re: 3.9-rc1 NULL pointer crash at find_pid_ns

On Thu, Mar 07, 2013 at 10:05:34AM -0800, Eric W. Biederman wrote:
> Sasha Levin <[email protected]> writes:
>
> > On 03/07/2013 12:46 PM, Eric Dumazet wrote:
> >> On Thu, 2013-03-07 at 12:36 -0500, Sasha Levin wrote:
> >>
> >>> Looks like the hlist change is probably the issue, though it specifically
> >>> uses:
> >>>
> >>> #define hlist_entry_safe(ptr, type, member) \
> >>> (ptr) ? hlist_entry(ptr, type, member) : NULL
> >>>
> >>> I'm still looking at the code in question and it's assembly, but I can't
> >>> figure out what's going wrong. I was also trying to see what's so special
> >>> about this loop in find_pid_ns as opposed to the rest of the kernel code
> >>> that uses hlist_for_each_entry_rcu() but couldn't find out why.
> >>>
> >>> Is it somehow possible that if we rcu_dereference_raw() the same thing twice
> >>> inside the same rcu_read_lock() section we'll get different results? That's
> >>> really the only reason for this crash that comes to mind at the moment, very
> >>> unlikely - but that's all I have right now.
> >>>
> >>
> >> Yep
> >>
> >> #define hlist_entry_safe(ptr, type, member) \
> >> (ptr) ? hlist_entry(ptr, type, member) : NULL
> >>
> >> Is not safe, as ptr can be evaluated twice, and thats not good at all...
> >
> > ptr is being evaluated twice, but in this case this is an
> > rcu_dereference_raw() value within the same rcu_read_lock() section.
> >
> > Is it still problematic?
>
> Definitely.
>
> Head in this instance the expression: &pid_hash[pid_hashfn(nr, ns)]
>
> And the crash clearly shows that when hilst_entry is being evaluated the
> HEAD is NULL.
>
> Perhaps this shows up in proc because the hash chains are short and
> frequently NULL?

So it should be possible to do something like the following:

#define hlist_entry_safe(ptr, type, member) \
({ typeof(ptr) ____ptr = ACCESS_ONCE(ptr); \
____ptr ? hlist_entry(____ptr, type, member) : NULL; \
})

Does that help?

Thanx, Paul

2013-03-07 18:18:48

by Eric Dumazet

[permalink] [raw]

Subject: Re: 3.9-rc1 NULL pointer crash at find_pid_ns

On Thu, 2013-03-07 at 13:14 -0500, Sasha Levin wrote:

> Okay, I'm even more confused now.
>
> The expression in question is:
>
> hlist_entry_safe(rcu_dereference_bh(hlist_first_rcu(head)))
>
> You're saying that "rcu_dereference_bh(hlist_first_rcu(head))" can change between
> the two evaluations we do. That would mean that 'head' has changed in between, right?
>
> In that case, the list itself has changed - which means that RCU has changed the
> list underneath us.
>
> hlist_first_rcu() doesn't have any side-effects, it doesn't modify the list whatsoever,
> so the only thing that can change is 'head'. Why is it allowed to change if the list
> is protected by RCU?

Thats the point of RCU ;)

Things can change under us.

2013-03-07 18:22:05

by Eric W. Biederman

[permalink] [raw]

Subject: Re: 3.9-rc1 NULL pointer crash at find_pid_ns

Sasha Levin <[email protected]> writes:

> On 03/07/2013 01:05 PM, [email protected] wrote:
>> Sasha Levin <[email protected]> writes:
>>
>>> On 03/07/2013 12:46 PM, Eric Dumazet wrote:
>>>> On Thu, 2013-03-07 at 12:36 -0500, Sasha Levin wrote:
>>>>
>>>>> Looks like the hlist change is probably the issue, though it specifically
>>>>> uses:
>>>>>
>>>>> #define hlist_entry_safe(ptr, type, member) \
>>>>> (ptr) ? hlist_entry(ptr, type, member) : NULL
>>>>>
>>>>> I'm still looking at the code in question and it's assembly, but I can't
>>>>> figure out what's going wrong. I was also trying to see what's so special
>>>>> about this loop in find_pid_ns as opposed to the rest of the kernel code
>>>>> that uses hlist_for_each_entry_rcu() but couldn't find out why.
>>>>>
>>>>> Is it somehow possible that if we rcu_dereference_raw() the same thing twice
>>>>> inside the same rcu_read_lock() section we'll get different results? That's
>>>>> really the only reason for this crash that comes to mind at the moment, very
>>>>> unlikely - but that's all I have right now.
>>>>>
>>>>
>>>> Yep
>>>>
>>>> #define hlist_entry_safe(ptr, type, member) \
>>>> (ptr) ? hlist_entry(ptr, type, member) : NULL
>>>>
>>>> Is not safe, as ptr can be evaluated twice, and thats not good at all...
>>>
>>> ptr is being evaluated twice, but in this case this is an
>>> rcu_dereference_raw() value within the same rcu_read_lock() section.
>>>
>>> Is it still problematic?
>>
>> Definitely.
>>
>> Head in this instance the expression: &pid_hash[pid_hashfn(nr, ns)]
>>
>> And the crash clearly shows that when hilst_entry is being evaluated the
>> HEAD is NULL.
>
> Okay, I'm even more confused now.
>
> The expression in question is:
>
> hlist_entry_safe(rcu_dereference_bh(hlist_first_rcu(head)))
>
> You're saying that "rcu_dereference_bh(hlist_first_rcu(head))" can change between
> the two evaluations we do. That would mean that 'head' has changed in between, right?
>
> In that case, the list itself has changed - which means that RCU has changed the
> list underneath us.
>
> hlist_first_rcu() doesn't have any side-effects, it doesn't modify the list whatsoever,
> so the only thing that can change is 'head'. Why is it allowed to change if the list
> is protected by RCU?

The pointer to the first element of the list goes to NULL.

With RCU pointers can change and the guranateee that is made is that if
you follow a stale pointer the storage pointed to by the stale pointer
does not become invalid until you exit the rcu critical section.

Eric

2013-03-07 18:27:29

by Sasha Levin

[permalink] [raw]

Subject: Re: 3.9-rc1 NULL pointer crash at find_pid_ns

On 03/07/2013 01:21 PM, [email protected] wrote:
> Sasha Levin <[email protected]> writes:
>
>> On 03/07/2013 01:05 PM, [email protected] wrote:
>>> Sasha Levin <[email protected]> writes:
>>>
>>>> On 03/07/2013 12:46 PM, Eric Dumazet wrote:
>>>>> On Thu, 2013-03-07 at 12:36 -0500, Sasha Levin wrote:
>>>>>
>>>>>> Looks like the hlist change is probably the issue, though it specifically
>>>>>> uses:
>>>>>>
>>>>>> #define hlist_entry_safe(ptr, type, member) \
>>>>>> (ptr) ? hlist_entry(ptr, type, member) : NULL
>>>>>>
>>>>>> I'm still looking at the code in question and it's assembly, but I can't
>>>>>> figure out what's going wrong. I was also trying to see what's so special
>>>>>> about this loop in find_pid_ns as opposed to the rest of the kernel code
>>>>>> that uses hlist_for_each_entry_rcu() but couldn't find out why.
>>>>>>
>>>>>> Is it somehow possible that if we rcu_dereference_raw() the same thing twice
>>>>>> inside the same rcu_read_lock() section we'll get different results? That's
>>>>>> really the only reason for this crash that comes to mind at the moment, very
>>>>>> unlikely - but that's all I have right now.
>>>>>>
>>>>>
>>>>> Yep
>>>>>
>>>>> #define hlist_entry_safe(ptr, type, member) \
>>>>> (ptr) ? hlist_entry(ptr, type, member) : NULL
>>>>>
>>>>> Is not safe, as ptr can be evaluated twice, and thats not good at all...
>>>>
>>>> ptr is being evaluated twice, but in this case this is an
>>>> rcu_dereference_raw() value within the same rcu_read_lock() section.
>>>>
>>>> Is it still problematic?
>>>
>>> Definitely.
>>>
>>> Head in this instance the expression: &pid_hash[pid_hashfn(nr, ns)]
>>>
>>> And the crash clearly shows that when hilst_entry is being evaluated the
>>> HEAD is NULL.
>>
>> Okay, I'm even more confused now.
>>
>> The expression in question is:
>>
>> hlist_entry_safe(rcu_dereference_bh(hlist_first_rcu(head)))
>>
>> You're saying that "rcu_dereference_bh(hlist_first_rcu(head))" can change between
>> the two evaluations we do. That would mean that 'head' has changed in between, right?
>>
>> In that case, the list itself has changed - which means that RCU has changed the
>> list underneath us.
>>
>> hlist_first_rcu() doesn't have any side-effects, it doesn't modify the list whatsoever,
>> so the only thing that can change is 'head'. Why is it allowed to change if the list
>> is protected by RCU?
>
> The pointer to the first element of the list goes to NULL.
>
> With RCU pointers can change and the guranateee that is made is that if
> you follow a stale pointer the storage pointed to by the stale pointer
> does not become invalid until you exit the rcu critical section.

But there's nothing that guarantees that the pointers themselves won't change like
I thought. That would explain this issue.

Let me run with trinity for a day with the change proposed by Paul just to see
everything looks sane, and if it is I can send a patch to fix it.

Thanks,
Sasha

2013-03-07 18:29:54

by Paul E. McKenney

[permalink] [raw]

Subject: Re: 3.9-rc1 NULL pointer crash at find_pid_ns

On Thu, Mar 07, 2013 at 01:14:10PM -0500, Sasha Levin wrote:
> On 03/07/2013 01:05 PM, [email protected] wrote:
> > Sasha Levin <[email protected]> writes:
> >
> >> On 03/07/2013 12:46 PM, Eric Dumazet wrote:
> >>> On Thu, 2013-03-07 at 12:36 -0500, Sasha Levin wrote:
> >>>
> >>>> Looks like the hlist change is probably the issue, though it specifically
> >>>> uses:
> >>>>
> >>>> #define hlist_entry_safe(ptr, type, member) \
> >>>> (ptr) ? hlist_entry(ptr, type, member) : NULL
> >>>>
> >>>> I'm still looking at the code in question and it's assembly, but I can't
> >>>> figure out what's going wrong. I was also trying to see what's so special
> >>>> about this loop in find_pid_ns as opposed to the rest of the kernel code
> >>>> that uses hlist_for_each_entry_rcu() but couldn't find out why.
> >>>>
> >>>> Is it somehow possible that if we rcu_dereference_raw() the same thing twice
> >>>> inside the same rcu_read_lock() section we'll get different results? That's
> >>>> really the only reason for this crash that comes to mind at the moment, very
> >>>> unlikely - but that's all I have right now.
> >>>>
> >>>
> >>> Yep
> >>>
> >>> #define hlist_entry_safe(ptr, type, member) \
> >>> (ptr) ? hlist_entry(ptr, type, member) : NULL
> >>>
> >>> Is not safe, as ptr can be evaluated twice, and thats not good at all...
> >>
> >> ptr is being evaluated twice, but in this case this is an
> >> rcu_dereference_raw() value within the same rcu_read_lock() section.
> >>
> >> Is it still problematic?
> >
> > Definitely.
> >
> > Head in this instance the expression: &pid_hash[pid_hashfn(nr, ns)]
> >
> > And the crash clearly shows that when hilst_entry is being evaluated the
> > HEAD is NULL.
>
> Okay, I'm even more confused now.
>
> The expression in question is:
>
> hlist_entry_safe(rcu_dereference_bh(hlist_first_rcu(head)))
>
> You're saying that "rcu_dereference_bh(hlist_first_rcu(head))" can change between
> the two evaluations we do. That would mean that 'head' has changed in between, right?
>
> In that case, the list itself has changed - which means that RCU has changed the
> list underneath us.
>
> hlist_first_rcu() doesn't have any side-effects, it doesn't modify the list whatsoever,
> so the only thing that can change is 'head'. Why is it allowed to change if the list
> is protected by RCU?

RCU does not prevent the list from changing. Instead, it prevents anything
that was in the list from being freed during a given RCU read-side critical
section. Here is how it is supposed to happen:

head---->A

Task 1 picks up the pointer from head to A, and sees that it is non-NULL.

Task 2 removes A from the list, so that the pointer from head is now NULL:

head A
|
|
V
NULL

Now task 1 refetches from head, and is fatally disappointed to get a
NULL pointer.

Now, had task 1 avoided the refetch, it would be still working with
a pointer to A. Since A won't be freed until the end of an RCU grace
period, all would have been well. Again, one way to handle this is
as follows:

#define hlist_entry_safe(ptr, type, member) \
({ typeof(ptr) ____ptr = (ptr); \
____ptr ? hlist_entry(____ptr, type, member) : NULL; \
})

This way, "ptr" is executed exactly once, and the check and the
hlist_entry() are both using the same value.

Thanx, Paul

2013-03-09 08:02:37

by Zefan Li

[permalink] [raw]

Subject: Re: 3.9-rc1 NULL pointer crash at find_pid_ns

>>>>>> Looks like the hlist change is probably the issue, though it specifically
>>>>>> uses:
>>>>>>
>>>>>> #define hlist_entry_safe(ptr, type, member) \
>>>>>> (ptr) ? hlist_entry(ptr, type, member) : NULL
>>>>>>
>>>>>> I'm still looking at the code in question and it's assembly, but I can't
>>>>>> figure out what's going wrong. I was also trying to see what's so special
>>>>>> about this loop in find_pid_ns as opposed to the rest of the kernel code
>>>>>> that uses hlist_for_each_entry_rcu() but couldn't find out why.
>>>>>>
>>>>>> Is it somehow possible that if we rcu_dereference_raw() the same thing twice
>>>>>> inside the same rcu_read_lock() section we'll get different results? That's
>>>>>> really the only reason for this crash that comes to mind at the moment, very
>>>>>> unlikely - but that's all I have right now.
>>>>>>
>>>>>
>>>>> Yep
>>>>>
>>>>> #define hlist_entry_safe(ptr, type, member) \
>>>>> (ptr) ? hlist_entry(ptr, type, member) : NULL
>>>>>
>>>>> Is not safe, as ptr can be evaluated twice, and thats not good at all...
>>>>
>>>> ptr is being evaluated twice, but in this case this is an
>>>> rcu_dereference_raw() value within the same rcu_read_lock() section.
>>>>
>>>> Is it still problematic?
>>>
>>> Definitely.
>>>
>>> Head in this instance the expression: &pid_hash[pid_hashfn(nr, ns)]
>>>
>>> And the crash clearly shows that when hilst_entry is being evaluated the
>>> HEAD is NULL.
>>
>> Okay, I'm even more confused now.
>>
>> The expression in question is:
>>
>> hlist_entry_safe(rcu_dereference_bh(hlist_first_rcu(head)))
>>
>> You're saying that "rcu_dereference_bh(hlist_first_rcu(head))" can change between
>> the two evaluations we do. That would mean that 'head' has changed in between, right?
>>
>> In that case, the list itself has changed - which means that RCU has changed the
>> list underneath us.
>>
>> hlist_first_rcu() doesn't have any side-effects, it doesn't modify the list whatsoever,
>> so the only thing that can change is 'head'. Why is it allowed to change if the list
>> is protected by RCU?
>
> RCU does not prevent the list from changing. Instead, it prevents anything
> that was in the list from being freed during a given RCU read-side critical
> section. Here is how it is supposed to happen:
>
> head---->A
>
> Task 1 picks up the pointer from head to A, and sees that it is non-NULL.
>
> Task 2 removes A from the list, so that the pointer from head is now NULL:
>
> head A
> |
> |
> V
> NULL
>
> Now task 1 refetches from head, and is fatally disappointed to get a
> NULL pointer.
>
> Now, had task 1 avoided the refetch, it would be still working with
> a pointer to A. Since A won't be freed until the end of an RCU grace
> period, all would have been well. Again, one way to handle this is
> as follows:
>
> #define hlist_entry_safe(ptr, type, member) \
> ({ typeof(ptr) ____ptr = (ptr); \
> ____ptr ? hlist_entry(____ptr, type, member) : NULL; \
> })
>
> This way, "ptr" is executed exactly once, and the check and the
> hlist_entry() are both using the same value.
>

I just played with trinity, and triggered this bug in just a few mins,
and I tried Paul's proposed fix and it works.

2013-03-09 18:38:26

by Paul E. McKenney

[permalink] [raw]

Subject: Re: 3.9-rc1 NULL pointer crash at find_pid_ns

On Sat, Mar 09, 2013 at 04:01:41PM +0800, Li Zefan wrote:

[ . . . ]

> >> hlist_first_rcu() doesn't have any side-effects, it doesn't modify the list whatsoever,
> >> so the only thing that can change is 'head'. Why is it allowed to change if the list
> >> is protected by RCU?
> >
> > RCU does not prevent the list from changing. Instead, it prevents anything
> > that was in the list from being freed during a given RCU read-side critical
> > section. Here is how it is supposed to happen:
> >
> > head---->A
> >
> > Task 1 picks up the pointer from head to A, and sees that it is non-NULL.
> >
> > Task 2 removes A from the list, so that the pointer from head is now NULL:
> >
> > head A
> > |
> > |
> > V
> > NULL
> >
> > Now task 1 refetches from head, and is fatally disappointed to get a
> > NULL pointer.
> >
> > Now, had task 1 avoided the refetch, it would be still working with
> > a pointer to A. Since A won't be freed until the end of an RCU grace
> > period, all would have been well. Again, one way to handle this is
> > as follows:
> >
> > #define hlist_entry_safe(ptr, type, member) \
> > ({ typeof(ptr) ____ptr = (ptr); \
> > ____ptr ? hlist_entry(____ptr, type, member) : NULL; \
> > })
> >
> > This way, "ptr" is executed exactly once, and the check and the
> > hlist_entry() are both using the same value.
>
> I just played with trinity, and triggered this bug in just a few mins,
> and I tried Paul's proposed fix and it works.

Thank you for testing this! Please see below for the patch.

Sasha, will you be pushing this or would you like me to do so?

Thanx, Paul

------------------------------------------------------------------------

list: Fix double fetch of pointer in hlist_entry_safe()

The current version of hlist_entry_safe() fetches the pointer twice,
once to test for NULL and the other to compute the offset back to the
enclosing structure. This is OK for normal lock-based use because in
that case, the pointer cannot change. However, when the pointer is
protected by RCU (as in "rcu_dereference(p)"), then the pointer can
change at any time. This use case can result in the following sequence
of events:

1. CPU 0 invokes hlist_entry_safe(), fetches the RCU-protected
pointer as sees that it is non-NULL.

2. CPU 1 invokes hlist_del_rcu(), deleting the entry that CPU 0
just fetched a pointer to. Because this is the last entry
in the list, the pointer fetched by CPU 0 is now NULL.

3. CPU 0 refetches the pointer, obtains NULL, and then gets a
NULL-pointer crash.

This commit therefore applies gcc's "({ })" statement expression to
create a temporary variable so that the specified pointer is fetched
only once, avoiding the above sequence of events. Please note that
it is the caller's responsibility to use rcu_dereference() as needed.
This allows RCU-protected uses to work correctly without imposing
any additional overhead on the non-RCU case.

Many thanks to Eric Dumazet for spotting root cause!

Reported-by: CAI Qian <[email protected]>
Reported-by: Eric Dumazet <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Tested-by: Li Zefan <[email protected]>

diff --git a/include/linux/list.h b/include/linux/list.h
index d991cc1..6a1f8df 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -667,7 +667,9 @@ static inline void hlist_move_list(struct hlist_head *old,
pos = n)

#define hlist_entry_safe(ptr, type, member) \
- (ptr) ? hlist_entry(ptr, type, member) : NULL
+ ({ typeof(ptr) ____ptr = (ptr); \
+ ____ptr ? hlist_entry(____ptr, type, member) : NULL; \
+ })

/**
* hlist_for_each_entry - iterate over list of given type

2013-03-14 20:02:10

by Dave Jones

[permalink] [raw]

Subject: Re: 3.9-rc1 NULL pointer crash at find_pid_ns

On Sat, Mar 09, 2013 at 07:51:46AM -0800, Paul E. McKenney wrote:
> On Sat, Mar 09, 2013 at 04:01:41PM +0800, Li Zefan wrote:
>
> [ . . . ]
>
> > > This way, "ptr" is executed exactly once, and the check and the
> > > hlist_entry() are both using the same value.
> >
> > I just played with trinity, and triggered this bug in just a few mins,
> > and I tried Paul's proposed fix and it works.
>
> Thank you for testing this! Please see below for the patch.
>
> Sasha, will you be pushing this or would you like me to do so?

[crickets]

Given this has been hanging for the better part of another week,
Paul why not send it Linus so hopefully rc3 doesn't have this bug ?

Dave

2013-03-14 21:01:26

by Paul E. McKenney

[permalink] [raw]

Subject: Re: 3.9-rc1 NULL pointer crash at find_pid_ns

On Thu, Mar 14, 2013 at 04:00:54PM -0400, Dave Jones wrote:
> On Sat, Mar 09, 2013 at 07:51:46AM -0800, Paul E. McKenney wrote:
> > On Sat, Mar 09, 2013 at 04:01:41PM +0800, Li Zefan wrote:
> >
> > [ . . . ]
> >
> > > > This way, "ptr" is executed exactly once, and the check and the
> > > > hlist_entry() are both using the same value.
> > >
> > > I just played with trinity, and triggered this bug in just a few mins,
> > > and I tried Paul's proposed fix and it works.
> >
> > Thank you for testing this! Please see below for the patch.
> >
> > Sasha, will you be pushing this or would you like me to do so?
>
> [crickets]
>
> Given this has been hanging for the better part of another week,
> Paul why not send it Linus so hopefully rc3 doesn't have this bug ?

Good point! I sent a pull request.

Thanx, Paul