2013-07-31 11:24:06

by vinayak menon

[permalink] [raw]
Subject: ipv4: crash at leaf_walk_rcu

Hi,

A crash was seen on 3.4.5 kernel during some random wlan operations.

CPU: Single core ARM Cortex A9.

fib_route_seq_next was called with second argument (void *v) as
0xd6e3e360 which is a "freed" object of the "ip_fib_trie" cache. I
confirmed that the object was freed with crash utility.

Sequence: fib_route_seq_next->trie_nextleaf->leaf_walk_rcu

As "v" was a freed object, inside trie_nextleaf(), node_parent_rcu()
returned an invalid tnode. But as I had enabled slab poisoning and the
object was already freed, the tnode was 0x6b6b6b6b. And this was
passed to leaf_walk_rcu and resulted in the crash.

fib_route_seq_start, takes rcu_read_lock(), but free_leaf calls
call_rcu_bh. Can this be the problem ?
Should rcu_read_lock() in fib_route_seq_start be changed to rcu_read_lock_bh() ?

----------------------------------------------------------------------------
PC is at leaf_walk_rcu+0x10/0xa0
LR is at fib_route_seq_next+0x58/0x74
pc : [<c0500e5c>] lr : [<c050108c>] psr: a0000013
sp : c150bee0 ip : 00000000 fp : 00000000
r10: 00000400 r9 : 53701020 r8 : c32345c0
r7 : 00000000 r6 : 00000001 r5 : 00000000 r4 : 00000002
r3 : 6b6b6b6b r2 : 00000001 r1 : d6e3e360 r0 : 6b6b6b6a
Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
Control: 10c53c7d Table: 835dc059 DAC: 00000015

Backtrace:
[<c0500e5c>] (leaf_walk_rcu+0x10/0xa0) from [<c050108c>]
(fib_route_seq_next+0x58/0x74)
[<c050108c>] (fib_route_seq_next+0x58/0x74) from [<c011c06c>]
(seq_read+0x2cc/0x438)
[<c011c06c>] (seq_read+0x2cc/0x438) from [<c0145734>] (proc_reg_read+0xb0/0xcc)
[<c0145734>] (proc_reg_read+0xb0/0xcc) from [<c0100798>] (vfs_read+0xac/0x124)
[<c0100798>] (vfs_read+0xac/0x124) from [<c0100848>] (sys_read+0x38/0x64)
[<c0100848>] (sys_read+0x38/0x64) from [<c000e100>] (ret_fast_syscall+0x0/0x48)

Thanks,
Vinayak


2013-07-31 12:55:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: ipv4: crash at leaf_walk_rcu

On Wed, Jul 31, 2013 at 04:40:47PM +0530, vinayak menon wrote:
> Hi,
>
> A crash was seen on 3.4.5 kernel during some random wlan operations.
>
> CPU: Single core ARM Cortex A9.
>
> fib_route_seq_next was called with second argument (void *v) as 0xd6e3e360
> which is a "freed" object of the "ip_fib_trie" cache. I confirmed that the
> object was freed with crash utility.
>
> Sequence: fib_route_seq_next->trie_nextleaf->leaf_walk_rcu
>
> As "v" was a freed object, inside trie_nextleaf(), node_parent_rcu()
> returned an invalid tnode. But as I had enabled slab poisoning and the
> object was already freed, the tnode was 0x6b6b6b6b. And this was passed to
> leaf_walk_rcu and resulted in the crash.
>
> fib_route_seq_start, takes rcu_read_lock(), but free_leaf
> calls call_rcu_bh. Can this be the problem ?
> Should rcu_read_lock() in fib_route_seq_start be changed to rcu_read_lock_bh()
> ?

One way or the other, the RCU read-side primitives need to match the RCU
update-side primitives. Adding netdev...

Thanx, Paul

> ----------------------------------------------------------------------------
> PC is at leaf_walk_rcu+0x10/0xa0
> LR is at fib_route_seq_next+0x58/0x74
> pc : [<c0500e5c>] lr : [<c050108c>] psr: a0000013
> sp : c150bee0 ip : 00000000 fp : 00000000
> r10: 00000400 r9 : 53701020 r8 : c32345c0
> r7 : 00000000 r6 : 00000001 r5 : 00000000 r4 : 00000002
> r3 : 6b6b6b6b r2 : 00000001 r1 : d6e3e360 r0 : 6b6b6b6a
> Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
> Control: 10c53c7d Table: 835dc059 DAC: 00000015
>
> Backtrace:
> [<c0500e5c>] (leaf_walk_rcu+0x10/0xa0) from [<c050108c>]
> (fib_route_seq_next+0x58/0x74)
> [<c050108c>] (fib_route_seq_next+0x58/0x74) from [<c011c06c>]
> (seq_read+0x2cc/0x438)
> [<c011c06c>] (seq_read+0x2cc/0x438) from [<c0145734>]
> (proc_reg_read+0xb0/0xcc)
> [<c0145734>] (proc_reg_read+0xb0/0xcc) from [<c0100798>]
> (vfs_read+0xac/0x124)
> [<c0100798>] (vfs_read+0xac/0x124) from [<c0100848>] (sys_read+0x38/0x64)
> [<c0100848>] (sys_read+0x38/0x64) from [<c000e100>]
> (ret_fast_syscall+0x0/0x48)
>
> Thanks,
> Vinayak

2013-07-31 13:13:26

by Hannes Frederic Sowa

[permalink] [raw]
Subject: Re: ipv4: crash at leaf_walk_rcu

On Wed, Jul 31, 2013 at 05:55:13AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 31, 2013 at 04:40:47PM +0530, vinayak menon wrote:
> > Hi,
> >
> > A crash was seen on 3.4.5 kernel during some random wlan operations.
> >
> > CPU: Single core ARM Cortex A9.
> >
> > fib_route_seq_next was called with second argument (void *v) as 0xd6e3e360
> > which is a "freed" object of the "ip_fib_trie" cache. I confirmed that the
> > object was freed with crash utility.
> >
> > Sequence: fib_route_seq_next->trie_nextleaf->leaf_walk_rcu
> >
> > As "v" was a freed object, inside trie_nextleaf(), node_parent_rcu()
> > returned an invalid tnode. But as I had enabled slab poisoning and the
> > object was already freed, the tnode was 0x6b6b6b6b. And this was passed to
> > leaf_walk_rcu and resulted in the crash.
> >
> > fib_route_seq_start, takes rcu_read_lock(), but free_leaf
> > calls call_rcu_bh. Can this be the problem ?
> > Should rcu_read_lock() in fib_route_seq_start be changed to rcu_read_lock_bh()
> > ?
>
> One way or the other, the RCU read-side primitives need to match the RCU
> update-side primitives. Adding netdev...

Already fixed by:

commit 0c03eca3d995e73d691edea8c787e25929ec156d
Author: Eric Dumazet <[email protected]>
Date: Tue Aug 7 00:47:11 2012 +0000

net: fib: fix incorrect call_rcu_bh()

After IP route cache removal, I believe rcu_bh() has very little use and
we should remove this RCU variant, since it adds some cycles in fast
path.

Anyway, the call_rcu_bh() use in fib_true is obviously wrong, since
some users only assert rcu_read_lock().

2013-07-31 13:31:29

by vinayak menon

[permalink] [raw]
Subject: Re: ipv4: crash at leaf_walk_rcu

On Wed, Jul 31, 2013 at 6:43 PM, Hannes Frederic Sowa
<[email protected]> wrote:
> On Wed, Jul 31, 2013 at 05:55:13AM -0700, Paul E. McKenney wrote:
>> On Wed, Jul 31, 2013 at 04:40:47PM +0530, vinayak menon wrote:
>> > Hi,
>> >
>> > A crash was seen on 3.4.5 kernel during some random wlan operations.
>> >
>> > CPU: Single core ARM Cortex A9.
>> >
>> > fib_route_seq_next was called with second argument (void *v) as 0xd6e3e360
>> > which is a "freed" object of the "ip_fib_trie" cache. I confirmed that the
>> > object was freed with crash utility.
>> >
>> > Sequence: fib_route_seq_next->trie_nextleaf->leaf_walk_rcu
>> >
>> > As "v" was a freed object, inside trie_nextleaf(), node_parent_rcu()
>> > returned an invalid tnode. But as I had enabled slab poisoning and the
>> > object was already freed, the tnode was 0x6b6b6b6b. And this was passed to
>> > leaf_walk_rcu and resulted in the crash.
>> >
>> > fib_route_seq_start, takes rcu_read_lock(), but free_leaf
>> > calls call_rcu_bh. Can this be the problem ?
>> > Should rcu_read_lock() in fib_route_seq_start be changed to rcu_read_lock_bh()
>> > ?
>>
>> One way or the other, the RCU read-side primitives need to match the RCU
>> update-side primitives. Adding netdev...
>
> Already fixed by:
>
> commit 0c03eca3d995e73d691edea8c787e25929ec156d
> Author: Eric Dumazet <[email protected]>
> Date: Tue Aug 7 00:47:11 2012 +0000
>
> net: fib: fix incorrect call_rcu_bh()
>
> After IP route cache removal, I believe rcu_bh() has very little use and
> we should remove this RCU variant, since it adds some cycles in fast
> path.
>
> Anyway, the call_rcu_bh() use in fib_true is obviously wrong, since
> some users only assert rcu_read_lock().
>

Thanks. I missed this somehow.

2013-07-31 14:13:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: ipv4: crash at leaf_walk_rcu

On Wed, Jul 31, 2013 at 03:13:23PM +0200, Hannes Frederic Sowa wrote:
> On Wed, Jul 31, 2013 at 05:55:13AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 31, 2013 at 04:40:47PM +0530, vinayak menon wrote:
> > > Hi,
> > >
> > > A crash was seen on 3.4.5 kernel during some random wlan operations.
> > >
> > > CPU: Single core ARM Cortex A9.
> > >
> > > fib_route_seq_next was called with second argument (void *v) as 0xd6e3e360
> > > which is a "freed" object of the "ip_fib_trie" cache. I confirmed that the
> > > object was freed with crash utility.
> > >
> > > Sequence: fib_route_seq_next->trie_nextleaf->leaf_walk_rcu
> > >
> > > As "v" was a freed object, inside trie_nextleaf(), node_parent_rcu()
> > > returned an invalid tnode. But as I had enabled slab poisoning and the
> > > object was already freed, the tnode was 0x6b6b6b6b. And this was passed to
> > > leaf_walk_rcu and resulted in the crash.
> > >
> > > fib_route_seq_start, takes rcu_read_lock(), but free_leaf
> > > calls call_rcu_bh. Can this be the problem ?
> > > Should rcu_read_lock() in fib_route_seq_start be changed to rcu_read_lock_bh()
> > > ?
> >
> > One way or the other, the RCU read-side primitives need to match the RCU
> > update-side primitives. Adding netdev...
>
> Already fixed by:
>
> commit 0c03eca3d995e73d691edea8c787e25929ec156d
> Author: Eric Dumazet <[email protected]>
> Date: Tue Aug 7 00:47:11 2012 +0000
>
> net: fib: fix incorrect call_rcu_bh()
>
> After IP route cache removal, I believe rcu_bh() has very little use and
> we should remove this RCU variant, since it adds some cycles in fast
> path.
>
> Anyway, the call_rcu_bh() use in fib_true is obviously wrong, since
> some users only assert rcu_read_lock().

Even better! ;-)

Thanx, Paul