2018-12-10 07:51:06

by syzbot

[permalink] [raw]
Subject: INFO: rcu detected stall in xfrm_hash_rebuild

Hello,

syzbot found the following crash on:

HEAD commit: 74c4a24df7ca Add linux-next specific files for 20181207
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=17dbb3d5400000
kernel config: https://syzkaller.appspot.com/x/.config?x=6e9413388bf37bed
dashboard link: https://syzkaller.appspot.com/bug?extid=592df0494801b6648ec6
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: [email protected]

device bridge_slave_1 entered promiscuous mode
IPv6: ADDRCONF(NETDEV_UP): veth0_to_bridge: link is not ready
IPv6: ADDRCONF(NETDEV_UP): veth1_to_bridge: link is not ready
bond0: Enslaving bond_slave_0 as an active interface with an up link
bond0: Enslaving bond_slave_1 as an active interface with an up link
rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 1-....: (1 GPs behind) idle=3e2/1/0x4000000000000002
softirq=136882/136885 fqs=5250
rcu: (t=10501 jiffies g=189801 q=2008)
NMI backtrace for cpu 1
CPU: 1 PID: 6200 Comm: kworker/1:2 Not tainted 4.20.0-rc5-next-20181207+
#163
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Workqueue: events xfrm_hash_rebuild
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x244/0x39d lib/dump_stack.c:113
nmi_cpu_backtrace.cold.2+0x5c/0xa1 lib/nmi_backtrace.c:101
nmi_trigger_cpumask_backtrace+0x1e8/0x22a lib/nmi_backtrace.c:62
arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline]
rcu_dump_cpu_stacks+0x16f/0x1bc kernel/rcu/tree.c:1211
print_cpu_stall.cold.70+0x218/0x40a kernel/rcu/tree.c:1348
check_cpu_stall kernel/rcu/tree.c:1422 [inline]
rcu_pending kernel/rcu/tree.c:3018 [inline]
rcu_check_callbacks+0xf3b/0x13f0 kernel/rcu/tree.c:2521
update_process_times+0x2d/0x70 kernel/time/timer.c:1635
tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:161
tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1271
__run_hrtimer kernel/time/hrtimer.c:1389 [inline]
__hrtimer_run_queues+0x41c/0x10d0 kernel/time/hrtimer.c:1451
hrtimer_interrupt+0x313/0x780 kernel/time/hrtimer.c:1509
local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1034 [inline]
smp_apic_timer_interrupt+0x1a1/0x760 arch/x86/kernel/apic/apic.c:1059
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:804
</IRQ>
RIP: 0010:__read_once_size include/linux/compiler.h:182 [inline]
RIP: 0010:check_kcov_mode kernel/kcov.c:69 [inline]
RIP: 0010:write_comp_data+0x22/0x70 kernel/kcov.c:122
Code: 90 90 90 90 90 90 90 90 55 48 89 e5 65 4c 8b 04 25 40 ee 01 00 65 8b
05 7c f6 81 7e a9 00 01 1f 00 75 51 41 8b 80 d8 12 00 00 <83> f8 03 75 45
49 8b 80 e0 12 00 00 45 8b 80 dc 12 00 00 4c 8b 08
RSP: 0018:ffff8881872f7320 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000000 RBX: 0000000000000003 RCX: ffffffff86d1d8db
RDX: 0000000000000003 RSI: 000000000000000e RDI: 0000000000000005
RBP: ffff8881872f7320 R08: ffff8881872ea1c0 R09: 0000000000000008
R10: 0000000000000005 R11: ffff8881872ea1c0 R12: dffffc0000000000
R13: ffff8881b628ca14 R14: ffff8881b4050114 R15: 0000000000000000
__sanitizer_cov_trace_const_cmp4+0x16/0x20 kernel/kcov.c:188
selector_cmp net/xfrm/xfrm_policy.c:1390 [inline]
xfrm_policy_insert_list+0x62b/0x1020 net/xfrm/xfrm_policy.c:1534
xfrm_policy_inexact_insert+0x166/0xee0 net/xfrm/xfrm_policy.c:1195
xfrm_hash_rebuild+0xfba/0x1380 net/xfrm/xfrm_policy.c:1317
process_one_work+0xc90/0x1c40 kernel/workqueue.c:2153
worker_thread+0x17f/0x1390 kernel/workqueue.c:2296
kthread+0x35a/0x440 kernel/kthread.c:246
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
syzbot.


2018-12-10 12:54:51

by Florian Westphal

[permalink] [raw]
Subject: Re: INFO: rcu detected stall in xfrm_hash_rebuild

syzbot <[email protected]> wrote:
> Hello,
>
> syzbot found the following crash on:
[..]

> Workqueue: events xfrm_hash_rebuild

Ignoring this report for a second -- I think it makes sense to see
if we can just remove the entire hash table rebuild/resize code.

After recent tree conversion, we could probably make the exact policies
part of the 'inexact tree' (which would be renamed to 'policy tree' or
some such).

Special-casing the exact policies made a lot of sense when we had
a single list for the inexact policies (to keep its length down).

But now I think we could try to unify all of this and only maintain
the existing tree-based storage.

Would also remove the need to do lookups in two different
data structures (bydst-hash-then-inexact-tree).

What do you think?

2018-12-10 19:55:34

by David Miller

[permalink] [raw]
Subject: Re: INFO: rcu detected stall in xfrm_hash_rebuild

From: Florian Westphal <[email protected]>
Date: Mon, 10 Dec 2018 13:47:24 +0100

> After recent tree conversion, we could probably make the exact policies
> part of the 'inexact tree' (which would be renamed to 'policy tree' or
> some such).
>
> Special-casing the exact policies made a lot of sense when we had
> a single list for the inexact policies (to keep its length down).
>
> But now I think we could try to unify all of this and only maintain
> the existing tree-based storage.
>
> Would also remove the need to do lookups in two different
> data structures (bydst-hash-then-inexact-tree).
>
> What do you think?

I think this makes a lot of sense.

2018-12-14 13:18:20

by Wolfgang Walter

[permalink] [raw]
Subject: Re: INFO: rcu detected stall in xfrm_hash_rebuild

Am Montag, 10. Dezember 2018, 09:58:56 schrieb David Miller:
> From: Florian Westphal <[email protected]>
> Date: Mon, 10 Dec 2018 13:47:24 +0100
>
> > After recent tree conversion, we could probably make the exact policies
> > part of the 'inexact tree' (which would be renamed to 'policy tree' or
> > some such).
> >
> > Special-casing the exact policies made a lot of sense when we had
> > a single list for the inexact policies (to keep its length down).
> >
> > But now I think we could try to unify all of this and only maintain
> > the existing tree-based storage.
> >
> > Would also remove the need to do lookups in two different
> > data structures (bydst-hash-then-inexact-tree).
> >
> > What do you think?
>
> I think this makes a lot of sense.

Sites mainly using tunnel mode this certainly makes sense.

I'm not so sure for transport mode. With transport mode the netmask usually is
/32 or /128, respectively (there may also be trap-rules). So a site only using
transport mode (road warrior scenario, for example) may see a large
performance regression if this is changed. They may do not have many entries
in the inexact list if any at all.

Maybe there are a lot more transport mode users than tunnel mode users, this
would explain why the removal of the flowcache did not hit that many people.

We do not use transport mode, so I'm not familiar how strongswan for example
handles that. I think that since 5.3 or so strongwans allows a catch rule
(inexact) and then inserts exact policy rules on the fly. But I don't know for
sure. There are a lot of tests on strongswan for different scenarios which
also demonstrate how policy and state table finally will look like on all
hosts.

Here is one with such a scenario (transport mode trap policy on a gateway,
three road warriors):

https://www.strongswan.org/testing/testresults/ikev2/trap-any/

So I would try to find users who are heavy users of transport mode and see how
this change would impact there performance.

Regards,
--
Wolfgang Walter
Studentenwerk M?nchen
Anstalt des ?ffentlichen Rechts

2018-12-14 14:37:52

by Florian Westphal

[permalink] [raw]
Subject: Re: INFO: rcu detected stall in xfrm_hash_rebuild

Wolfgang Walter <[email protected]> wrote:

[ CCing Christophe ]

> Am Montag, 10. Dezember 2018, 09:58:56 schrieb David Miller:
> > From: Florian Westphal <[email protected]>
> > Date: Mon, 10 Dec 2018 13:47:24 +0100
> >
> > > After recent tree conversion, we could probably make the exact policies
> > > part of the 'inexact tree' (which would be renamed to 'policy tree' or
> > > some such).
> > >
> > > Special-casing the exact policies made a lot of sense when we had
> > > a single list for the inexact policies (to keep its length down).
> > >
> > > But now I think we could try to unify all of this and only maintain
> > > the existing tree-based storage.
> > >
> > > Would also remove the need to do lookups in two different
> > > data structures (bydst-hash-then-inexact-tree).
> > >
> > > What do you think?
> >
> > I think this makes a lot of sense.
>
> Sites mainly using tunnel mode this certainly makes sense.

Ok. An alternative would be to remove the support for
policy hash table thresholds (which decide what kinds of policies
go to exact table and which ones go into inexact ones), i.e.
partially revert 880a6fab8f6ba5b5abe59ea6
("xfrm: configure policy hash table thresholds by netlink").

This would remove the need for the rehashing support that
re-sorts the policies into either exact/inexact lists) when the
those tunables are changed.

We could also easily convert the exact table to an rhashtable
then if we wanted to.

I guess we should probably wait to get some operational feedback on the
inexact storage first to see if it really improves things enough to
make threshold tuning unneccessary.

Christophe, whats your take?

2018-12-14 14:58:14

by Herbert Xu

[permalink] [raw]
Subject: Re: INFO: rcu detected stall in xfrm_hash_rebuild

On Fri, Dec 14, 2018 at 03:35:32PM +0100, Florian Westphal wrote:
>
> Ok. An alternative would be to remove the support for
> policy hash table thresholds (which decide what kinds of policies
> go to exact table and which ones go into inexact ones), i.e.
> partially revert 880a6fab8f6ba5b5abe59ea6
> ("xfrm: configure policy hash table thresholds by netlink").
>
> This would remove the need for the rehashing support that
> re-sorts the policies into either exact/inexact lists) when the
> those tunables are changed.
>
> We could also easily convert the exact table to an rhashtable
> then if we wanted to.

We could also do both. In fact that was the reason why I started
working on rhashtable in the first place. The idea is to extend
the run-time rehashing to include both parts of the database.

So you would look up the old table/list combo and then move onto
the new one.

Of course I never had time to finish this and I think the entity
asking for this has moved onto something else.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2018-12-14 16:07:26

by Christophe Gouault

[permalink] [raw]
Subject: Re: INFO: rcu detected stall in xfrm_hash_rebuild

Le ven. 14 déc. 2018 à 15:35, Florian Westphal <[email protected]> a écrit :
> Ok. An alternative would be to remove the support for
> policy hash table thresholds (which decide what kinds of policies
> go to exact table and which ones go into inexact ones), i.e.
> partially revert 880a6fab8f6ba5b5abe59ea6
> ("xfrm: configure policy hash table thresholds by netlink").
>
> This would remove the need for the rehashing support that
> re-sorts the policies into either exact/inexact lists) when the
> those tunables are changed.
>
> We could also easily convert the exact table to an rhashtable
> then if we wanted to.
>
> I guess we should probably wait to get some operational feedback on the
> inexact storage first to see if it really improves things enough to
> make threshold tuning unneccessary.
>
> Christophe, whats your take?

Hi Florian,

The main use cases I have encountered and tried to address with the
hash-based lookup were network operator use cases:
- a lot of dynamic /32 <=> /32 policies (protecting GTP tunnels)
- or a lot of dynamic policies with the same prefix lengths (e.g. /16 <=> /24)
and a few non-hashed policies stored in the linked list.

This solutions gives good performance for both lookup *and*
configuration with a big number of SPs. The lookup time is essential,
but so is the IKE negotiation rate. And the configuration time has a
direct impact on it.

I have not yet had time to benchmark the new tree-based implementation.

Could you verify how the configuration time would behave in such use
cases if the hash-based lookup was replaced by a tree-based lookup?

Regards
Christophe

2018-12-14 16:25:09

by Florian Westphal

[permalink] [raw]
Subject: Re: INFO: rcu detected stall in xfrm_hash_rebuild

Christophe Gouault <[email protected]> wrote:
> The main use cases I have encountered and tried to address with the
> hash-based lookup were network operator use cases:
> - a lot of dynamic /32 <=> /32 policies (protecting GTP tunnels)
> - or a lot of dynamic policies with the same prefix lengths (e.g. /16 <=> /24)
> and a few non-hashed policies stored in the linked list.

Thanks for the detailed information.

[..]

> Could you verify how the configuration time would behave in such use
> cases if the hash-based lookup was replaced by a tree-based lookup?

I won't send a patch to remove your work, at least not at this time.

In case I'd do this removal (thresholds, hash table, or both)
i will make these tests to see how large the impact is.

2018-12-14 16:32:23

by Christophe Gouault

[permalink] [raw]
Subject: Re: INFO: rcu detected stall in xfrm_hash_rebuild

Le ven. 14 déc. 2018 à 17:23, Florian Westphal <[email protected]> a écrit :
> I won't send a patch to remove your work, at least not at this time.
>
> In case I'd do this removal (thresholds, hash table, or both)
> i will make these tests to see how large the impact is.

Perfect, thanks
Christophe

2018-12-14 19:08:40

by David Miller

[permalink] [raw]
Subject: Re: INFO: rcu detected stall in xfrm_hash_rebuild

From: Florian Westphal <[email protected]>
Date: Fri, 14 Dec 2018 17:23:33 +0100

> Christophe Gouault <[email protected]> wrote:
>> The main use cases I have encountered and tried to address with the
>> hash-based lookup were network operator use cases:
>> - a lot of dynamic /32 <=> /32 policies (protecting GTP tunnels)
>> - or a lot of dynamic policies with the same prefix lengths (e.g. /16 <=> /24)
>> and a few non-hashed policies stored in the linked list.
>
> Thanks for the detailed information.
>
> [..]
>
>> Could you verify how the configuration time would behave in such use
>> cases if the hash-based lookup was replaced by a tree-based lookup?
>
> I won't send a patch to remove your work, at least not at this time.
>
> In case I'd do this removal (thresholds, hash table, or both)
> i will make these tests to see how large the impact is.

Florian, thanks for working so hard on this.

The operational feedback is absolutely essential for figuring out how
to move forward on this issue, otherwise we'll just keep going back
and forth on the approach.