LinuxLists.cc - [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

[permalink] [raw]

Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

Kajetan Puchalski <[email protected]> wrote:
> While running the udp-flood test from stress-ng on Ampere Altra (Mt.
> Jade platform) I encountered a kernel panic caused by NULL pointer
> dereference within nf_conntrack.
>
> The issue is present in the latest mainline (5.19-rc4), latest stable
> (5.18.8), as well as multiple older stable versions. The last working
> stable version I found was 5.15.40.

Do I need a special setup for conntrack?

No crashes after more than one hour of stress-ng on
1. 4 core amd64 Fedora 5.17 kernel
2. 16 core amd64, linux stable 5.17.15
3. 12 core intel, Fedora 5.18 kernel
4. 3 core aarch64 vm, 5.18.7-200.fc36.aarch64

I used standard firewalld ruleset for all of these and manually tuned
conntrack settings to make sure the early evict path (as per backtrace)
gets exercised.

2022-07-02 11:11:30

[permalink] [raw]

Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Fri, Jul 01, 2022 at 10:01:10PM +0200, Florian Westphal wrote:
> Kajetan Puchalski <[email protected]> wrote:
> > While running the udp-flood test from stress-ng on Ampere Altra (Mt.
> > Jade platform) I encountered a kernel panic caused by NULL pointer
> > dereference within nf_conntrack.
> >
> > The issue is present in the latest mainline (5.19-rc4), latest stable
> > (5.18.8), as well as multiple older stable versions. The last working
> > stable version I found was 5.15.40.
>
> Do I need a special setup for conntrack?

I don't think there was any special setup involved, the config I started
from was a generic distribution config and I didn't change any
networking-specific options. In case that's helpful here's the .config I
used.

https://pastebin.com/Bb2wttdx

>
> No crashes after more than one hour of stress-ng on
> 1. 4 core amd64 Fedora 5.17 kernel
> 2. 16 core amd64, linux stable 5.17.15
> 3. 12 core intel, Fedora 5.18 kernel
> 4. 3 core aarch64 vm, 5.18.7-200.fc36.aarch64
>

That would make sense, from further experiments I ran it somehow seems
to be related to the number of workers being spawned by stress-ng along
with the CPUs/cores involved.

For instance, running the test with <=25 workers (--udp-flood 25 etc.)
results in the test running fine for at least 15 minutes.
Running the test with 30 workers results in a panic sometime before it
hits the 15 minute mark.
Based on observations there seems to be a corellation between the number
of workers and how quickly the panic occurs, ie with 30 it takes a few
minutes, with 160 it consistently happens almost immediately. That also
holds for various numbers of workers in between.

On the CPU/core side of things, the machine in question has two CPU
sockets with 80 identical cores each. All the panics I've encountered
happened when stress-ng was ran directly and unbound.
When I tried using hwloc-bind to bind the process to one of the CPU
sockets, the test ran for 15 mins with 80 and 160 workers with no issues,
no matter which CPU it was bound to.

Ie the specific circumstances under which it seems to occur are when the
test is able to run across multiple CPU sockets with a large number
of workers being spawned.

> I used standard firewalld ruleset for all of these and manually tuned
> conntrack settings to make sure the early evict path (as per backtrace)
> gets exercised.

2022-07-02 12:46:36

by Willy Tarreau

[permalink] [raw]

Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Sat, Jul 02, 2022 at 12:08:46PM +0100, Kajetan Puchalski wrote:
> On Fri, Jul 01, 2022 at 10:01:10PM +0200, Florian Westphal wrote:
> > Kajetan Puchalski <[email protected]> wrote:
> > > While running the udp-flood test from stress-ng on Ampere Altra (Mt.
> > > Jade platform) I encountered a kernel panic caused by NULL pointer
> > > dereference within nf_conntrack.
> > >
> > > The issue is present in the latest mainline (5.19-rc4), latest stable
> > > (5.18.8), as well as multiple older stable versions. The last working
> > > stable version I found was 5.15.40.
> >
> > Do I need a special setup for conntrack?
>
> I don't think there was any special setup involved, the config I started
> from was a generic distribution config and I didn't change any
> networking-specific options. In case that's helpful here's the .config I
> used.
>
> https://pastebin.com/Bb2wttdx
>
> >
> > No crashes after more than one hour of stress-ng on
> > 1. 4 core amd64 Fedora 5.17 kernel
> > 2. 16 core amd64, linux stable 5.17.15
> > 3. 12 core intel, Fedora 5.18 kernel
> > 4. 3 core aarch64 vm, 5.18.7-200.fc36.aarch64
> >
>
> That would make sense, from further experiments I ran it somehow seems
> to be related to the number of workers being spawned by stress-ng along
> with the CPUs/cores involved.
>
> For instance, running the test with <=25 workers (--udp-flood 25 etc.)
> results in the test running fine for at least 15 minutes.

Another point to keep in mind is that modern ARM processors (ARMv8.1 and
above) have a more relaxed memory model than older ones (and x86), that
can easily exhibit a missing barrier somewhere. I faced this situation
already in the past the first time I ran my code on Graviton2, which
caused crashes that would never happen on A53/A72/A73 cores nor x86.

ARMv8.1 SoCs are not yet widely available for end users like us. A76
is only coming, and A55 has now been available for a bit more than a
year. So testing on regular ARM devices like RPi etc may not exhibit
such differences.

> Running the test with 30 workers results in a panic sometime before it
> hits the 15 minute mark.
> Based on observations there seems to be a corellation between the number
> of workers and how quickly the panic occurs, ie with 30 it takes a few
> minutes, with 160 it consistently happens almost immediately. That also
> holds for various numbers of workers in between.
>
> On the CPU/core side of things, the machine in question has two CPU
> sockets with 80 identical cores each. All the panics I've encountered
> happened when stress-ng was ran directly and unbound.
> When I tried using hwloc-bind to bind the process to one of the CPU
> sockets, the test ran for 15 mins with 80 and 160 workers with no issues,
> no matter which CPU it was bound to.
>
> Ie the specific circumstances under which it seems to occur are when the
> test is able to run across multiple CPU sockets with a large number
> of workers being spawned.

This could further fuel the possibliity explained above.

Regards,
Willy

2022-07-02 21:02:31

[permalink] [raw]

Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

Kajetan Puchalski <[email protected]> wrote:
> On Fri, Jul 01, 2022 at 10:01:10PM +0200, Florian Westphal wrote:
> > Kajetan Puchalski <[email protected]> wrote:
> > > While running the udp-flood test from stress-ng on Ampere Altra (Mt.
> > > Jade platform) I encountered a kernel panic caused by NULL pointer
> > > dereference within nf_conntrack.
> > >
> > > The issue is present in the latest mainline (5.19-rc4), latest stable
> > > (5.18.8), as well as multiple older stable versions. The last working
> > > stable version I found was 5.15.40.
> >
> > Do I need a special setup for conntrack?
>
> I don't think there was any special setup involved, the config I started
> from was a generic distribution config and I didn't change any
> networking-specific options. In case that's helpful here's the .config I
> used.
>
> https://pastebin.com/Bb2wttdx
>
> >
> > No crashes after more than one hour of stress-ng on
> > 1. 4 core amd64 Fedora 5.17 kernel
> > 2. 16 core amd64, linux stable 5.17.15
> > 3. 12 core intel, Fedora 5.18 kernel
> > 4. 3 core aarch64 vm, 5.18.7-200.fc36.aarch64
> >
>
> That would make sense, from further experiments I ran it somehow seems
> to be related to the number of workers being spawned by stress-ng along
> with the CPUs/cores involved.
>
> For instance, running the test with <=25 workers (--udp-flood 25 etc.)
> results in the test running fine for at least 15 minutes.

Ok. I will let it run for longer on the machines I have access to.

In mean time, you could test attached patch, its simple s/refcount_/atomic_/
in nf_conntrack.

If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you
but works with attached patch someone who understands aarch64 memory ordering
would have to look more closely at refcount_XXX functions to see where they
might differ from atomic_ ones.

If it still crashes, please try below hunk in addition, although I don't see
how it would make a difference.

This is the one spot where the original conversion replaced atomic_inc()
with refcount_set(), this is on allocation, refcount is expected to be 0 so
refcount_inc() triggers a warning hinting at a use-after free.

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1776,7 +1776,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
__nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);

/* Now it is going to be associated with an sk_buff, set refcount to 1. */
- atomic_set(&ct->ct_general.use, 1);
+ atomic_inc(&ct->ct_general.use);

if (exp) {
if (exp->expectfn)

Attachments:

(No filename) (2.71 kB)
0001-netfilter-conntrack-revert-to-atomic_t-api.patch (10.46 kB)
Download all attachments

2022-07-04 09:41:53

[permalink] [raw]

Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Sat, Jul 02, 2022 at 10:56:51PM +0200, Florian Westphal wrote:
> > That would make sense, from further experiments I ran it somehow seems
> > to be related to the number of workers being spawned by stress-ng along
> > with the CPUs/cores involved.
> >
> > For instance, running the test with <=25 workers (--udp-flood 25 etc.)
> > results in the test running fine for at least 15 minutes.
>
> Ok. I will let it run for longer on the machines I have access to.
>
> In mean time, you could test attached patch, its simple s/refcount_/atomic_/
> in nf_conntrack.
>
> If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you
> but works with attached patch someone who understands aarch64 memory ordering
> would have to look more closely at refcount_XXX functions to see where they
> might differ from atomic_ ones.

I can confirm that the patch seems to solve the issue.
With it applied on top of the 5.19-rc5 tag the test runs fine for at
least 15 minutes which was not the case before so it looks like it is
that aarch64 memory ordering problem.

>
> If it still crashes, please try below hunk in addition, although I don't see
> how it would make a difference.
>
> This is the one spot where the original conversion replaced atomic_inc()
> with refcount_set(), this is on allocation, refcount is expected to be 0 so
> refcount_inc() triggers a warning hinting at a use-after free.
>
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -1776,7 +1776,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
> __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
>
> /* Now it is going to be associated with an sk_buff, set refcount to 1. */
> - atomic_set(&ct->ct_general.use, 1);
> + atomic_inc(&ct->ct_general.use);
>
> if (exp) {
> if (exp->expectfn)

> From 4234018dff486bdc30f4fe4625c8da1a8e30c2f6 Mon Sep 17 00:00:00 2001
> From: Florian Westphal <[email protected]>
> Date: Sat, 2 Jul 2022 22:42:57 +0200
> Subject: [PATCH 1/1] netfilter: conntrack: revert to atomic_t api
>
> Just for testing.
> ---
> include/linux/netfilter/nf_conntrack_common.h | 6 ++---
> include/net/netfilter/nf_conntrack.h | 2 +-
> net/netfilter/nf_conntrack_core.c | 24 +++++++++----------
> net/netfilter/nf_conntrack_expect.c | 2 +-
> net/netfilter/nf_conntrack_netlink.c | 6 ++---
> net/netfilter/nf_conntrack_standalone.c | 4 ++--
> net/netfilter/nf_flow_table_core.c | 2 +-
> net/netfilter/nft_ct.c | 4 ++--
> net/netfilter/xt_CT.c | 2 +-
> 9 files changed, 26 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
> index 2770db2fa080..48a78944182d 100644
> --- a/include/linux/netfilter/nf_conntrack_common.h
> +++ b/include/linux/netfilter/nf_conntrack_common.h
> @@ -25,7 +25,7 @@ struct ip_conntrack_stat {
> #define NFCT_PTRMASK ~(NFCT_INFOMASK)
>
> struct nf_conntrack {
> - refcount_t use;
> + atomic_t use;
> };
>
> void nf_conntrack_destroy(struct nf_conntrack *nfct);
> @@ -33,13 +33,13 @@ void nf_conntrack_destroy(struct nf_conntrack *nfct);
> /* like nf_ct_put, but without module dependency on nf_conntrack */
> static inline void nf_conntrack_put(struct nf_conntrack *nfct)
> {
> - if (nfct && refcount_dec_and_test(&nfct->use))
> + if (nfct && atomic_dec_and_test(&nfct->use))
> nf_conntrack_destroy(nfct);
> }
> static inline void nf_conntrack_get(struct nf_conntrack *nfct)
> {
> if (nfct)
> - refcount_inc(&nfct->use);
> + atomic_inc(&nfct->use);
> }
>
> #endif /* _NF_CONNTRACK_COMMON_H */
> diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
> index a32be8aa7ed2..9fab0c8835bb 100644
> --- a/include/net/netfilter/nf_conntrack.h
> +++ b/include/net/netfilter/nf_conntrack.h
> @@ -180,7 +180,7 @@ void nf_ct_destroy(struct nf_conntrack *nfct);
> /* decrement reference count on a conntrack */
> static inline void nf_ct_put(struct nf_conn *ct)
> {
> - if (ct && refcount_dec_and_test(&ct->ct_general.use))
> + if (ct && atomic_dec_and_test(&ct->ct_general.use))
> nf_ct_destroy(&ct->ct_general);
> }
>
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 082a2fd8d85b..4469e49d78a7 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -554,7 +554,7 @@ struct nf_conn *nf_ct_tmpl_alloc(struct net *net,
> tmpl->status = IPS_TEMPLATE;
> write_pnet(&tmpl->ct_net, net);
> nf_ct_zone_add(tmpl, zone);
> - refcount_set(&tmpl->ct_general.use, 1);
> + atomic_set(&tmpl->ct_general.use, 1);
>
> return tmpl;
> }
> @@ -586,7 +586,7 @@ void nf_ct_destroy(struct nf_conntrack *nfct)
> struct nf_conn *ct = (struct nf_conn *)nfct;
>
> pr_debug("%s(%p)\n", __func__, ct);
> - WARN_ON(refcount_read(&nfct->use) != 0);
> + WARN_ON(atomic_read(&nfct->use) != 0);
>
> if (unlikely(nf_ct_is_template(ct))) {
> nf_ct_tmpl_free(ct);
> @@ -726,7 +726,7 @@ nf_ct_match(const struct nf_conn *ct1, const struct nf_conn *ct2)
> /* caller must hold rcu readlock and none of the nf_conntrack_locks */
> static void nf_ct_gc_expired(struct nf_conn *ct)
> {
> - if (!refcount_inc_not_zero(&ct->ct_general.use))
> + if (!atomic_inc_not_zero(&ct->ct_general.use))
> return;
>
> if (nf_ct_should_gc(ct))
> @@ -794,7 +794,7 @@ __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone,
> * in, try to obtain a reference and re-check tuple
> */
> ct = nf_ct_tuplehash_to_ctrack(h);
> - if (likely(refcount_inc_not_zero(&ct->ct_general.use))) {
> + if (likely(atomic_inc_not_zero(&ct->ct_general.use))) {
> if (likely(nf_ct_key_equal(h, tuple, zone, net)))
> goto found;
>
> @@ -923,7 +923,7 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct)
>
> smp_wmb();
> /* The caller holds a reference to this object */
> - refcount_set(&ct->ct_general.use, 2);
> + atomic_set(&ct->ct_general.use, 2);
> __nf_conntrack_hash_insert(ct, hash, reply_hash);
> nf_conntrack_double_unlock(hash, reply_hash);
> NF_CT_STAT_INC(net, insert);
> @@ -981,7 +981,7 @@ static void __nf_conntrack_insert_prepare(struct nf_conn *ct)
> {
> struct nf_conn_tstamp *tstamp;
>
> - refcount_inc(&ct->ct_general.use);
> + atomic_inc(&ct->ct_general.use);
>
> /* set conntrack timestamp, if enabled. */
> tstamp = nf_conn_tstamp_find(ct);
> @@ -1384,7 +1384,7 @@ static unsigned int early_drop_list(struct net *net,
> nf_ct_is_dying(tmp))
> continue;
>
> - if (!refcount_inc_not_zero(&tmp->ct_general.use))
> + if (!atomic_inc_not_zero(&tmp->ct_general.use))
> continue;
>
> /* kill only if still in same netns -- might have moved due to
> @@ -1533,7 +1533,7 @@ static void gc_worker(struct work_struct *work)
> continue;
>
> /* need to take reference to avoid possible races */
> - if (!refcount_inc_not_zero(&tmp->ct_general.use))
> + if (!atomic_inc_not_zero(&tmp->ct_general.use))
> continue;
>
> if (gc_worker_skip_ct(tmp)) {
> @@ -1640,7 +1640,7 @@ __nf_conntrack_alloc(struct net *net,
> /* Because we use RCU lookups, we set ct_general.use to zero before
> * this is inserted in any list.
> */
> - refcount_set(&ct->ct_general.use, 0);
> + atomic_set(&ct->ct_general.use, 0);
> return ct;
> out:
> atomic_dec(&cnet->count);
> @@ -1665,7 +1665,7 @@ void nf_conntrack_free(struct nf_conn *ct)
> /* A freed object has refcnt == 0, that's
> * the golden rule for SLAB_TYPESAFE_BY_RCU
> */
> - WARN_ON(refcount_read(&ct->ct_general.use) != 0);
> + WARN_ON(atomic_read(&ct->ct_general.use) != 0);
>
> if (ct->status & IPS_SRC_NAT_DONE) {
> const struct nf_nat_hook *nat_hook;
> @@ -1776,7 +1776,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
> __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
>
> /* Now it is going to be associated with an sk_buff, set refcount to 1. */
> - refcount_set(&ct->ct_general.use, 1);
> + atomic_set(&ct->ct_general.use, 1);
>
> if (exp) {
> if (exp->expectfn)
> @@ -2390,7 +2390,7 @@ get_next_corpse(int (*iter)(struct nf_conn *i, void *data),
>
> return NULL;
> found:
> - refcount_inc(&ct->ct_general.use);
> + atomic_inc(&ct->ct_general.use);
> spin_unlock(lockp);
> local_bh_enable();
> return ct;
> diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c
> index 96948e98ec53..84cb05eae410 100644
> --- a/net/netfilter/nf_conntrack_expect.c
> +++ b/net/netfilter/nf_conntrack_expect.c
> @@ -208,7 +208,7 @@ nf_ct_find_expectation(struct net *net,
> * can be sure the ct cannot disappear underneath.
> */
> if (unlikely(nf_ct_is_dying(exp->master) ||
> - !refcount_inc_not_zero(&exp->master->ct_general.use)))
> + !atomic_inc_not_zero(&exp->master->ct_general.use)))
> return NULL;
>
> if (exp->flags & NF_CT_EXPECT_PERMANENT) {
> diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
> index 722af5e309ba..d5de0e580e6c 100644
> --- a/net/netfilter/nf_conntrack_netlink.c
> +++ b/net/netfilter/nf_conntrack_netlink.c
> @@ -514,7 +514,7 @@ static int ctnetlink_dump_id(struct sk_buff *skb, const struct nf_conn *ct)
>
> static int ctnetlink_dump_use(struct sk_buff *skb, const struct nf_conn *ct)
> {
> - if (nla_put_be32(skb, CTA_USE, htonl(refcount_read(&ct->ct_general.use))))
> + if (nla_put_be32(skb, CTA_USE, htonl(atomic_read(&ct->ct_general.use))))
> goto nla_put_failure;
> return 0;
>
> @@ -1204,7 +1204,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
> ct = nf_ct_tuplehash_to_ctrack(h);
> if (nf_ct_is_expired(ct)) {
> if (i < ARRAY_SIZE(nf_ct_evict) &&
> - refcount_inc_not_zero(&ct->ct_general.use))
> + atomic_inc_not_zero(&ct->ct_general.use))
> nf_ct_evict[i++] = ct;
> continue;
> }
> @@ -1747,7 +1747,7 @@ static int ctnetlink_dump_one_entry(struct sk_buff *skb,
> NFNL_MSG_TYPE(cb->nlh->nlmsg_type),
> ct, dying, 0);
> if (res < 0) {
> - if (!refcount_inc_not_zero(&ct->ct_general.use))
> + if (!atomic_inc_not_zero(&ct->ct_general.use))
> return 0;
>
> ctx->last = ct;
> diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
> index 6ad7bbc90d38..badd3f219533 100644
> --- a/net/netfilter/nf_conntrack_standalone.c
> +++ b/net/netfilter/nf_conntrack_standalone.c
> @@ -303,7 +303,7 @@ static int ct_seq_show(struct seq_file *s, void *v)
> int ret = 0;
>
> WARN_ON(!ct);
> - if (unlikely(!refcount_inc_not_zero(&ct->ct_general.use)))
> + if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
> return 0;
>
> if (nf_ct_should_gc(ct)) {
> @@ -370,7 +370,7 @@ static int ct_seq_show(struct seq_file *s, void *v)
> ct_show_zone(s, ct, NF_CT_DEFAULT_ZONE_DIR);
> ct_show_delta_time(s, ct);
>
> - seq_printf(s, "use=%u\n", refcount_read(&ct->ct_general.use));
> + seq_printf(s, "use=%u\n", atomic_read(&ct->ct_general.use));
>
> if (seq_has_overflowed(s))
> goto release;
> diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> index f2def06d1070..8b3f91a60ba2 100644
> --- a/net/netfilter/nf_flow_table_core.c
> +++ b/net/netfilter/nf_flow_table_core.c
> @@ -54,7 +54,7 @@ struct flow_offload *flow_offload_alloc(struct nf_conn *ct)
> struct flow_offload *flow;
>
> if (unlikely(nf_ct_is_dying(ct) ||
> - !refcount_inc_not_zero(&ct->ct_general.use)))
> + !atomic_inc_not_zero(&ct->ct_general.use)))
> return NULL;
>
> flow = kzalloc(sizeof(*flow), GFP_ATOMIC);
> diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
> index d8e1614918a1..1b6ead61a8f1 100644
> --- a/net/netfilter/nft_ct.c
> +++ b/net/netfilter/nft_ct.c
> @@ -260,8 +260,8 @@ static void nft_ct_set_zone_eval(const struct nft_expr *expr,
>
> ct = this_cpu_read(nft_ct_pcpu_template);
>
> - if (likely(refcount_read(&ct->ct_general.use) == 1)) {
> - refcount_inc(&ct->ct_general.use);
> + if (likely(atomic_read(&ct->ct_general.use) == 1)) {
> + atomic_inc(&ct->ct_general.use);
> nf_ct_zone_add(ct, &zone);
> } else {
> /* previous skb got queued to userspace, allocate temporary
> diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c
> index 267757b0392a..cf2f8c1d4fb5 100644
> --- a/net/netfilter/xt_CT.c
> +++ b/net/netfilter/xt_CT.c
> @@ -24,7 +24,7 @@ static inline int xt_ct_target(struct sk_buff *skb, struct nf_conn *ct)
> return XT_CONTINUE;
>
> if (ct) {
> - refcount_inc(&ct->ct_general.use);
> + atomic_inc(&ct->ct_general.use);
> nf_ct_set(skb, ct, IP_CT_NEW);
> } else {
> nf_ct_set(skb, ct, IP_CT_UNTRACKED);
> --
> 2.35.1
>

2022-07-05 11:04:54

[permalink] [raw]

Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Mon, Jul 04, 2022 at 10:22:24AM +0100, Kajetan Puchalski wrote:
> On Sat, Jul 02, 2022 at 10:56:51PM +0200, Florian Westphal wrote:
> > > That would make sense, from further experiments I ran it somehow seems
> > > to be related to the number of workers being spawned by stress-ng along
> > > with the CPUs/cores involved.
> > >
> > > For instance, running the test with <=25 workers (--udp-flood 25 etc.)
> > > results in the test running fine for at least 15 minutes.
> >
> > Ok. I will let it run for longer on the machines I have access to.
> >
> > In mean time, you could test attached patch, its simple s/refcount_/atomic_/
> > in nf_conntrack.
> >
> > If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you
> > but works with attached patch someone who understands aarch64 memory ordering
> > would have to look more closely at refcount_XXX functions to see where they
> > might differ from atomic_ ones.
>
> I can confirm that the patch seems to solve the issue.
> With it applied on top of the 5.19-rc5 tag the test runs fine for at
> least 15 minutes which was not the case before so it looks like it is
> that aarch64 memory ordering problem.

I'm CCing some people who should be able to help with aarch64 memory
ordering, maybe they could take a look.

(re-sending due to a typo in CC, sorry for duplicate emails!)

>
> >
> > If it still crashes, please try below hunk in addition, although I don't see
> > how it would make a difference.
> >
> > This is the one spot where the original conversion replaced atomic_inc()
> > with refcount_set(), this is on allocation, refcount is expected to be 0 so
> > refcount_inc() triggers a warning hinting at a use-after free.
> >
> > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -1776,7 +1776,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
> > __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
> >
> > /* Now it is going to be associated with an sk_buff, set refcount to 1. */
> > - atomic_set(&ct->ct_general.use, 1);
> > + atomic_inc(&ct->ct_general.use);
> >
> > if (exp) {
> > if (exp->expectfn)
>
> > From 4234018dff486bdc30f4fe4625c8da1a8e30c2f6 Mon Sep 17 00:00:00 2001
> > From: Florian Westphal <[email protected]>
> > Date: Sat, 2 Jul 2022 22:42:57 +0200
> > Subject: [PATCH 1/1] netfilter: conntrack: revert to atomic_t api
> >
> > Just for testing.
> > ---
> > include/linux/netfilter/nf_conntrack_common.h | 6 ++---
> > include/net/netfilter/nf_conntrack.h | 2 +-
> > net/netfilter/nf_conntrack_core.c | 24 +++++++++----------
> > net/netfilter/nf_conntrack_expect.c | 2 +-
> > net/netfilter/nf_conntrack_netlink.c | 6 ++---
> > net/netfilter/nf_conntrack_standalone.c | 4 ++--
> > net/netfilter/nf_flow_table_core.c | 2 +-
> > net/netfilter/nft_ct.c | 4 ++--
> > net/netfilter/xt_CT.c | 2 +-
> > 9 files changed, 26 insertions(+), 26 deletions(-)
> >
> > diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
> > index 2770db2fa080..48a78944182d 100644
> > --- a/include/linux/netfilter/nf_conntrack_common.h
> > +++ b/include/linux/netfilter/nf_conntrack_common.h
> > @@ -25,7 +25,7 @@ struct ip_conntrack_stat {
> > #define NFCT_PTRMASK ~(NFCT_INFOMASK)
> >
> > struct nf_conntrack {
> > - refcount_t use;
> > + atomic_t use;
> > };
> >
> > void nf_conntrack_destroy(struct nf_conntrack *nfct);
> > @@ -33,13 +33,13 @@ void nf_conntrack_destroy(struct nf_conntrack *nfct);
> > /* like nf_ct_put, but without module dependency on nf_conntrack */
> > static inline void nf_conntrack_put(struct nf_conntrack *nfct)
> > {
> > - if (nfct && refcount_dec_and_test(&nfct->use))
> > + if (nfct && atomic_dec_and_test(&nfct->use))
> > nf_conntrack_destroy(nfct);
> > }
> > static inline void nf_conntrack_get(struct nf_conntrack *nfct)
> > {
> > if (nfct)
> > - refcount_inc(&nfct->use);
> > + atomic_inc(&nfct->use);
> > }
> >
> > #endif /* _NF_CONNTRACK_COMMON_H */
> > diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
> > index a32be8aa7ed2..9fab0c8835bb 100644
> > --- a/include/net/netfilter/nf_conntrack.h
> > +++ b/include/net/netfilter/nf_conntrack.h
> > @@ -180,7 +180,7 @@ void nf_ct_destroy(struct nf_conntrack *nfct);
> > /* decrement reference count on a conntrack */
> > static inline void nf_ct_put(struct nf_conn *ct)
> > {
> > - if (ct && refcount_dec_and_test(&ct->ct_general.use))
> > + if (ct && atomic_dec_and_test(&ct->ct_general.use))
> > nf_ct_destroy(&ct->ct_general);
> > }
> >
> > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > index 082a2fd8d85b..4469e49d78a7 100644
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -554,7 +554,7 @@ struct nf_conn *nf_ct_tmpl_alloc(struct net *net,
> > tmpl->status = IPS_TEMPLATE;
> > write_pnet(&tmpl->ct_net, net);
> > nf_ct_zone_add(tmpl, zone);
> > - refcount_set(&tmpl->ct_general.use, 1);
> > + atomic_set(&tmpl->ct_general.use, 1);
> >
> > return tmpl;
> > }
> > @@ -586,7 +586,7 @@ void nf_ct_destroy(struct nf_conntrack *nfct)
> > struct nf_conn *ct = (struct nf_conn *)nfct;
> >
> > pr_debug("%s(%p)\n", __func__, ct);
> > - WARN_ON(refcount_read(&nfct->use) != 0);
> > + WARN_ON(atomic_read(&nfct->use) != 0);
> >
> > if (unlikely(nf_ct_is_template(ct))) {
> > nf_ct_tmpl_free(ct);
> > @@ -726,7 +726,7 @@ nf_ct_match(const struct nf_conn *ct1, const struct nf_conn *ct2)
> > /* caller must hold rcu readlock and none of the nf_conntrack_locks */
> > static void nf_ct_gc_expired(struct nf_conn *ct)
> > {
> > - if (!refcount_inc_not_zero(&ct->ct_general.use))
> > + if (!atomic_inc_not_zero(&ct->ct_general.use))
> > return;
> >
> > if (nf_ct_should_gc(ct))
> > @@ -794,7 +794,7 @@ __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone,
> > * in, try to obtain a reference and re-check tuple
> > */
> > ct = nf_ct_tuplehash_to_ctrack(h);
> > - if (likely(refcount_inc_not_zero(&ct->ct_general.use))) {
> > + if (likely(atomic_inc_not_zero(&ct->ct_general.use))) {
> > if (likely(nf_ct_key_equal(h, tuple, zone, net)))
> > goto found;
> >
> > @@ -923,7 +923,7 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct)
> >
> > smp_wmb();
> > /* The caller holds a reference to this object */
> > - refcount_set(&ct->ct_general.use, 2);
> > + atomic_set(&ct->ct_general.use, 2);
> > __nf_conntrack_hash_insert(ct, hash, reply_hash);
> > nf_conntrack_double_unlock(hash, reply_hash);
> > NF_CT_STAT_INC(net, insert);
> > @@ -981,7 +981,7 @@ static void __nf_conntrack_insert_prepare(struct nf_conn *ct)
> > {
> > struct nf_conn_tstamp *tstamp;
> >
> > - refcount_inc(&ct->ct_general.use);
> > + atomic_inc(&ct->ct_general.use);
> >
> > /* set conntrack timestamp, if enabled. */
> > tstamp = nf_conn_tstamp_find(ct);
> > @@ -1384,7 +1384,7 @@ static unsigned int early_drop_list(struct net *net,
> > nf_ct_is_dying(tmp))
> > continue;
> >
> > - if (!refcount_inc_not_zero(&tmp->ct_general.use))
> > + if (!atomic_inc_not_zero(&tmp->ct_general.use))
> > continue;
> >
> > /* kill only if still in same netns -- might have moved due to
> > @@ -1533,7 +1533,7 @@ static void gc_worker(struct work_struct *work)
> > continue;
> >
> > /* need to take reference to avoid possible races */
> > - if (!refcount_inc_not_zero(&tmp->ct_general.use))
> > + if (!atomic_inc_not_zero(&tmp->ct_general.use))
> > continue;
> >
> > if (gc_worker_skip_ct(tmp)) {
> > @@ -1640,7 +1640,7 @@ __nf_conntrack_alloc(struct net *net,
> > /* Because we use RCU lookups, we set ct_general.use to zero before
> > * this is inserted in any list.
> > */
> > - refcount_set(&ct->ct_general.use, 0);
> > + atomic_set(&ct->ct_general.use, 0);
> > return ct;
> > out:
> > atomic_dec(&cnet->count);
> > @@ -1665,7 +1665,7 @@ void nf_conntrack_free(struct nf_conn *ct)
> > /* A freed object has refcnt == 0, that's
> > * the golden rule for SLAB_TYPESAFE_BY_RCU
> > */
> > - WARN_ON(refcount_read(&ct->ct_general.use) != 0);
> > + WARN_ON(atomic_read(&ct->ct_general.use) != 0);
> >
> > if (ct->status & IPS_SRC_NAT_DONE) {
> > const struct nf_nat_hook *nat_hook;
> > @@ -1776,7 +1776,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
> > __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
> >
> > /* Now it is going to be associated with an sk_buff, set refcount to 1. */
> > - refcount_set(&ct->ct_general.use, 1);
> > + atomic_set(&ct->ct_general.use, 1);
> >
> > if (exp) {
> > if (exp->expectfn)
> > @@ -2390,7 +2390,7 @@ get_next_corpse(int (*iter)(struct nf_conn *i, void *data),
> >
> > return NULL;
> > found:
> > - refcount_inc(&ct->ct_general.use);
> > + atomic_inc(&ct->ct_general.use);
> > spin_unlock(lockp);
> > local_bh_enable();
> > return ct;
> > diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c
> > index 96948e98ec53..84cb05eae410 100644
> > --- a/net/netfilter/nf_conntrack_expect.c
> > +++ b/net/netfilter/nf_conntrack_expect.c
> > @@ -208,7 +208,7 @@ nf_ct_find_expectation(struct net *net,
> > * can be sure the ct cannot disappear underneath.
> > */
> > if (unlikely(nf_ct_is_dying(exp->master) ||
> > - !refcount_inc_not_zero(&exp->master->ct_general.use)))
> > + !atomic_inc_not_zero(&exp->master->ct_general.use)))
> > return NULL;
> >
> > if (exp->flags & NF_CT_EXPECT_PERMANENT) {
> > diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
> > index 722af5e309ba..d5de0e580e6c 100644
> > --- a/net/netfilter/nf_conntrack_netlink.c
> > +++ b/net/netfilter/nf_conntrack_netlink.c
> > @@ -514,7 +514,7 @@ static int ctnetlink_dump_id(struct sk_buff *skb, const struct nf_conn *ct)
> >
> > static int ctnetlink_dump_use(struct sk_buff *skb, const struct nf_conn *ct)
> > {
> > - if (nla_put_be32(skb, CTA_USE, htonl(refcount_read(&ct->ct_general.use))))
> > + if (nla_put_be32(skb, CTA_USE, htonl(atomic_read(&ct->ct_general.use))))
> > goto nla_put_failure;
> > return 0;
> >
> > @@ -1204,7 +1204,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
> > ct = nf_ct_tuplehash_to_ctrack(h);
> > if (nf_ct_is_expired(ct)) {
> > if (i < ARRAY_SIZE(nf_ct_evict) &&
> > - refcount_inc_not_zero(&ct->ct_general.use))
> > + atomic_inc_not_zero(&ct->ct_general.use))
> > nf_ct_evict[i++] = ct;
> > continue;
> > }
> > @@ -1747,7 +1747,7 @@ static int ctnetlink_dump_one_entry(struct sk_buff *skb,
> > NFNL_MSG_TYPE(cb->nlh->nlmsg_type),
> > ct, dying, 0);
> > if (res < 0) {
> > - if (!refcount_inc_not_zero(&ct->ct_general.use))
> > + if (!atomic_inc_not_zero(&ct->ct_general.use))
> > return 0;
> >
> > ctx->last = ct;
> > diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
> > index 6ad7bbc90d38..badd3f219533 100644
> > --- a/net/netfilter/nf_conntrack_standalone.c
> > +++ b/net/netfilter/nf_conntrack_standalone.c
> > @@ -303,7 +303,7 @@ static int ct_seq_show(struct seq_file *s, void *v)
> > int ret = 0;
> >
> > WARN_ON(!ct);
> > - if (unlikely(!refcount_inc_not_zero(&ct->ct_general.use)))
> > + if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
> > return 0;
> >
> > if (nf_ct_should_gc(ct)) {
> > @@ -370,7 +370,7 @@ static int ct_seq_show(struct seq_file *s, void *v)
> > ct_show_zone(s, ct, NF_CT_DEFAULT_ZONE_DIR);
> > ct_show_delta_time(s, ct);
> >
> > - seq_printf(s, "use=%u\n", refcount_read(&ct->ct_general.use));
> > + seq_printf(s, "use=%u\n", atomic_read(&ct->ct_general.use));
> >
> > if (seq_has_overflowed(s))
> > goto release;
> > diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> > index f2def06d1070..8b3f91a60ba2 100644
> > --- a/net/netfilter/nf_flow_table_core.c
> > +++ b/net/netfilter/nf_flow_table_core.c
> > @@ -54,7 +54,7 @@ struct flow_offload *flow_offload_alloc(struct nf_conn *ct)
> > struct flow_offload *flow;
> >
> > if (unlikely(nf_ct_is_dying(ct) ||
> > - !refcount_inc_not_zero(&ct->ct_general.use)))
> > + !atomic_inc_not_zero(&ct->ct_general.use)))
> > return NULL;
> >
> > flow = kzalloc(sizeof(*flow), GFP_ATOMIC);
> > diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
> > index d8e1614918a1..1b6ead61a8f1 100644
> > --- a/net/netfilter/nft_ct.c
> > +++ b/net/netfilter/nft_ct.c
> > @@ -260,8 +260,8 @@ static void nft_ct_set_zone_eval(const struct nft_expr *expr,
> >
> > ct = this_cpu_read(nft_ct_pcpu_template);
> >
> > - if (likely(refcount_read(&ct->ct_general.use) == 1)) {
> > - refcount_inc(&ct->ct_general.use);
> > + if (likely(atomic_read(&ct->ct_general.use) == 1)) {
> > + atomic_inc(&ct->ct_general.use);
> > nf_ct_zone_add(ct, &zone);
> > } else {
> > /* previous skb got queued to userspace, allocate temporary
> > diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c
> > index 267757b0392a..cf2f8c1d4fb5 100644
> > --- a/net/netfilter/xt_CT.c
> > +++ b/net/netfilter/xt_CT.c
> > @@ -24,7 +24,7 @@ static inline int xt_ct_target(struct sk_buff *skb, struct nf_conn *ct)
> > return XT_CONTINUE;
> >
> > if (ct) {
> > - refcount_inc(&ct->ct_general.use);
> > + atomic_inc(&ct->ct_general.use);
> > nf_ct_set(skb, ct, IP_CT_NEW);
> > } else {
> > nf_ct_set(skb, ct, IP_CT_UNTRACKED);
> > --
> > 2.35.1
> >
>

2022-07-05 11:18:37

[permalink] [raw]

Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Tue, Jul 05, 2022 at 11:57:49AM +0100, Will Deacon wrote:
> On Tue, Jul 05, 2022 at 11:53:22AM +0100, Kajetan Puchalski wrote:
> > On Mon, Jul 04, 2022 at 10:22:24AM +0100, Kajetan Puchalski wrote:
> > > On Sat, Jul 02, 2022 at 10:56:51PM +0200, Florian Westphal wrote:
> > > > > That would make sense, from further experiments I ran it somehow seems
> > > > > to be related to the number of workers being spawned by stress-ng along
> > > > > with the CPUs/cores involved.
> > > > >
> > > > > For instance, running the test with <=25 workers (--udp-flood 25 etc.)
> > > > > results in the test running fine for at least 15 minutes.
> > > >
> > > > Ok. I will let it run for longer on the machines I have access to.
> > > >
> > > > In mean time, you could test attached patch, its simple s/refcount_/atomic_/
> > > > in nf_conntrack.
> > > >
> > > > If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you
> > > > but works with attached patch someone who understands aarch64 memory ordering
> > > > would have to look more closely at refcount_XXX functions to see where they
> > > > might differ from atomic_ ones.
> > >
> > > I can confirm that the patch seems to solve the issue.
> > > With it applied on top of the 5.19-rc5 tag the test runs fine for at
> > > least 15 minutes which was not the case before so it looks like it is
> > > that aarch64 memory ordering problem.
> >
> > I'm CCing some people who should be able to help with aarch64 memory
> > ordering, maybe they could take a look.
> >
> > (re-sending due to a typo in CC, sorry for duplicate emails!)
>
> Sorry, but I have absolutely no context here. We have a handy document
> describing the differences between atomic_t and refcount_t:
>
> Documentation/core-api/refcount-vs-atomic.rst
>
> What else do you need to know?

Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622
("netfilter: conntrack: convert to refcount_t api") which mean that you
no longer have ordering to subsequent reads in the absence of an address
dependency.

Will

2022-07-05 11:23:32

[permalink] [raw]

Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Tue, Jul 05, 2022 at 11:53:22AM +0100, Kajetan Puchalski wrote:
> On Mon, Jul 04, 2022 at 10:22:24AM +0100, Kajetan Puchalski wrote:
> > On Sat, Jul 02, 2022 at 10:56:51PM +0200, Florian Westphal wrote:
> > > > That would make sense, from further experiments I ran it somehow seems
> > > > to be related to the number of workers being spawned by stress-ng along
> > > > with the CPUs/cores involved.
> > > >
> > > > For instance, running the test with <=25 workers (--udp-flood 25 etc.)
> > > > results in the test running fine for at least 15 minutes.
> > >
> > > Ok. I will let it run for longer on the machines I have access to.
> > >
> > > In mean time, you could test attached patch, its simple s/refcount_/atomic_/
> > > in nf_conntrack.
> > >
> > > If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you
> > > but works with attached patch someone who understands aarch64 memory ordering
> > > would have to look more closely at refcount_XXX functions to see where they
> > > might differ from atomic_ ones.
> >
> > I can confirm that the patch seems to solve the issue.
> > With it applied on top of the 5.19-rc5 tag the test runs fine for at
> > least 15 minutes which was not the case before so it looks like it is
> > that aarch64 memory ordering problem.
>
> I'm CCing some people who should be able to help with aarch64 memory
> ordering, maybe they could take a look.
>
> (re-sending due to a typo in CC, sorry for duplicate emails!)

Sorry, but I have absolutely no context here. We have a handy document
describing the differences between atomic_t and refcount_t:

Documentation/core-api/refcount-vs-atomic.rst

What else do you need to know?

Will

2022-07-05 11:33:21

[permalink] [raw]

Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Tue, Jul 05, 2022 at 12:07:25PM +0100, Will Deacon wrote:
> On Tue, Jul 05, 2022 at 11:57:49AM +0100, Will Deacon wrote:
> > On Tue, Jul 05, 2022 at 11:53:22AM +0100, Kajetan Puchalski wrote:
> > > On Mon, Jul 04, 2022 at 10:22:24AM +0100, Kajetan Puchalski wrote:
> > > > On Sat, Jul 02, 2022 at 10:56:51PM +0200, Florian Westphal wrote:
> > > > > > That would make sense, from further experiments I ran it somehow seems
> > > > > > to be related to the number of workers being spawned by stress-ng along
> > > > > > with the CPUs/cores involved.
> > > > > >
> > > > > > For instance, running the test with <=25 workers (--udp-flood 25 etc.)
> > > > > > results in the test running fine for at least 15 minutes.
> > > > >
> > > > > Ok. I will let it run for longer on the machines I have access to.
> > > > >
> > > > > In mean time, you could test attached patch, its simple s/refcount_/atomic_/
> > > > > in nf_conntrack.
> > > > >
> > > > > If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you
> > > > > but works with attached patch someone who understands aarch64 memory ordering
> > > > > would have to look more closely at refcount_XXX functions to see where they
> > > > > might differ from atomic_ ones.
> > > >
> > > > I can confirm that the patch seems to solve the issue.
> > > > With it applied on top of the 5.19-rc5 tag the test runs fine for at
> > > > least 15 minutes which was not the case before so it looks like it is
> > > > that aarch64 memory ordering problem.
> > >
> > > I'm CCing some people who should be able to help with aarch64 memory
> > > ordering, maybe they could take a look.
> > >
> > > (re-sending due to a typo in CC, sorry for duplicate emails!)
> >
> > Sorry, but I have absolutely no context here. We have a handy document
> > describing the differences between atomic_t and refcount_t:
> >
> > Documentation/core-api/refcount-vs-atomic.rst
> >
> > What else do you need to know?
>
> Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622
> ("netfilter: conntrack: convert to refcount_t api") which mean that you
> no longer have ordering to subsequent reads in the absence of an address
> dependency.

I think the patch above needs auditing with the relaxed behaviour in mind,
but for the specific crash reported here possibly something like the diff
below?

Will

--->8

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 082a2fd8d85b..5ad9fcc84269 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1394,6 +1394,7 @@ static unsigned int early_drop_list(struct net *net,
* already fired or someone else deleted it. Just drop ref
* and move to next entry.
*/
+ smp_rmb(); /* XXX: Why? */
if (net_eq(nf_ct_net(tmp), net) &&
nf_ct_is_confirmed(tmp) &&
nf_ct_delete(tmp, 0, 0))

2022-07-05 16:03:24

[permalink] [raw]

Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

> > > Sorry, but I have absolutely no context here. We have a handy document
> > > describing the differences between atomic_t and refcount_t:
> > >
> > > Documentation/core-api/refcount-vs-atomic.rst
> > >
> > > What else do you need to know?
> >
> > Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622
> > ("netfilter: conntrack: convert to refcount_t api") which mean that you
> > no longer have ordering to subsequent reads in the absence of an address
> > dependency.
>
> I think the patch above needs auditing with the relaxed behaviour in mind,
> but for the specific crash reported here possibly something like the diff
> below?
>
> Will
>
> --->8
>
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 082a2fd8d85b..5ad9fcc84269 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -1394,6 +1394,7 @@ static unsigned int early_drop_list(struct net *net,
> * already fired or someone else deleted it. Just drop ref
> * and move to next entry.
> */
> + smp_rmb(); /* XXX: Why? */
> if (net_eq(nf_ct_net(tmp), net) &&
> nf_ct_is_confirmed(tmp) &&
> nf_ct_delete(tmp, 0, 0))
>

With this patch applied the issue goes away as well. The test runs fine
well beyond where it would crash previously so looks good, thanks!

2022-07-06 10:43:54

[permalink] [raw]

Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Tue, Jul 05, 2022 at 12:24:49PM +0100, Will Deacon wrote:
> > > Sorry, but I have absolutely no context here. We have a handy document
> > > describing the differences between atomic_t and refcount_t:
> > >
> > > Documentation/core-api/refcount-vs-atomic.rst
> > >
> > > What else do you need to know?
> >
> > Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622
> > ("netfilter: conntrack: convert to refcount_t api") which mean that you
> > no longer have ordering to subsequent reads in the absence of an address
> > dependency.
>
> I think the patch above needs auditing with the relaxed behaviour in mind,
> but for the specific crash reported here possibly something like the diff
> below?
>
> Will
>
> --->8
>
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 082a2fd8d85b..5ad9fcc84269 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -1394,6 +1394,7 @@ static unsigned int early_drop_list(struct net *net,
> * already fired or someone else deleted it. Just drop ref
> * and move to next entry.
> */
> + smp_rmb(); /* XXX: Why? */
> if (net_eq(nf_ct_net(tmp), net) &&
> nf_ct_is_confirmed(tmp) &&
> nf_ct_delete(tmp, 0, 0))
>

Just to follow up, I think you're right, the patch in question should be
audited further for other missing memory barrier issues.
While this one smp_rmb() helps a lot, ie lets the test run for at least
an hour or two, an overnight 6 hour test still resulted in the same
crash somewhere along the way so it looks like it's not the only one
that's needed.

2022-07-06 12:27:23

[permalink] [raw]

Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

Kajetan Puchalski <[email protected]> wrote:
> On Tue, Jul 05, 2022 at 12:24:49PM +0100, Will Deacon wrote:
> > > > Sorry, but I have absolutely no context here. We have a handy document
> > > > describing the differences between atomic_t and refcount_t:
> > > >
> > > > Documentation/core-api/refcount-vs-atomic.rst
> > > >
> > > > What else do you need to know?
> > >
> > > Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622
> > > ("netfilter: conntrack: convert to refcount_t api") which mean that you
> > > no longer have ordering to subsequent reads in the absence of an address
> > > dependency.
> >
> > I think the patch above needs auditing with the relaxed behaviour in mind,
> > but for the specific crash reported here possibly something like the diff
> > below?
> >
> > Will
> >
> > --->8
> >
> > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > index 082a2fd8d85b..5ad9fcc84269 100644
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -1394,6 +1394,7 @@ static unsigned int early_drop_list(struct net *net,
> > * already fired or someone else deleted it. Just drop ref
> > * and move to next entry.
> > */
> > + smp_rmb(); /* XXX: Why? */
> > if (net_eq(nf_ct_net(tmp), net) &&
> > nf_ct_is_confirmed(tmp) &&
> > nf_ct_delete(tmp, 0, 0))
> >
>
> Just to follow up, I think you're right, the patch in question should be
> audited further for other missing memory barrier issues.
> While this one smp_rmb() helps a lot, ie lets the test run for at least
> an hour or two, an overnight 6 hour test still resulted in the same
> crash somewhere along the way so it looks like it's not the only one
> that's needed.

Yes, I don't think that refcount_inc_not_zero is useable as-is for conntrack.
Here is a patch, I hope this will get things back to a working order without
a revert to atomic_t api.

Subject: [nf] netfilter: conntrack: fix crash due to confirmed bit load reordering

Kajetan Puchalski reports crash on ARM, with backtrace of:

__nf_ct_delete_from_lists
nf_ct_delete
early_drop
__nf_conntrack_alloc

Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier.
conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly'
allocated object is still in use on another CPU:

CPU1 CPU2
enounters 'ct' during hlist walk
delete_from_lists
refcount drops to 0
kmem_cache_free(ct);
__nf_conntrack_alloc() // returns same object
refcount_inc_not_zero(ct); /* might fail */

/* If set, ct is public/in the hash table */
test_bit(IPS_CONFIRMED_BIT, &ct->status);

In case CPU1 already set refcount back to 1, refcount_inc_not_zero()
will succeed.

The expected possibilities for a CPU that obtained the object 'ct'
(but no reference so far) are:

1. refcount_inc_not_zero() fails. CPU2 ignores the object and moves to
the next entry in the list. This happens for objects that are about
to be free'd, that have been free'd, or that have been reallocated
by __nf_conntrack_alloc(), but where the refcount has not been
increased back to 1 yet.

2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit
in ct->status. If set, the object is public/in the table.

If not, the object must be skipped; CPU2 calls nf_ct_put() to
un-do the refcount increment and moves to the next object.

Parallel deletion from the hlists is prevented by a
'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one
cpu will do the unlink, the other one will only drop its reference count.

Because refcount_inc_not_zero is not a full barrier, CPU2 may try to
delete an object that is not on any list:

1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
2. CONFIRMED test also successful (load was reordered or zeroing
of ct->status not yet visible)
3. delete_from_lists unlinks entry not on the hlist, because
IPS_DYING_BIT is 0 (already cleared).

2) is already wrong: CPU2 will handle a partially initited object
that is supposed to be private to CPU1.

This change adds smp_rmb() whenever refcount_inc_not_zero() was successful.

It also inserts a smp_wmb() before the refcount is set to 1 during
allocation.

Because other CPU might still 'see' the object, refcount_set(1)
"resurrects" the object, so we need to make sure that other CPUs will
also observe the right contents. In particular, the CONFIRMED bit test
must only pass once the object is fully initialised and either in the
hash or about to be inserted (with locks held to delay possible unlink from
early_drop or gc worker).

I did not change flow_offload_alloc(), as far as I can see it should call
refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to
the skb so its refcount should be >= 1 in all cases.

Reported-by: Kajetan Puchalski <[email protected]>
Diagnosed-by: Will Deacon <[email protected]>
Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
Signed-off-by: Florian Westphal <[email protected]>
---
include/net/netfilter/nf_conntrack.h | 3 +++
net/netfilter/nf_conntrack_core.c | 19 +++++++++++++++++++
2 files changed, 22 insertions(+)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index a32be8aa7ed2..3dc3646ffba2 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -300,6 +300,9 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct)
/* use after obtaining a reference count */
static inline bool nf_ct_should_gc(const struct nf_conn *ct)
{
+ /* ->status and ->timeout loads must happen after refcount increase */
+ smp_rmb();
+
return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) &&
!nf_ct_is_dying(ct);
}
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 082a2fd8d85b..072cabf1b296 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -795,6 +795,9 @@ __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone,
*/
ct = nf_ct_tuplehash_to_ctrack(h);
if (likely(refcount_inc_not_zero(&ct->ct_general.use))) {
+ /* re-check key after refcount */
+ smp_rmb();
+
if (likely(nf_ct_key_equal(h, tuple, zone, net)))
goto found;

@@ -1393,7 +1396,11 @@ static unsigned int early_drop_list(struct net *net,
* We steal the timer reference. If that fails timer has
* already fired or someone else deleted it. Just drop ref
* and move to next entry.
+ *
+ * smp_rmb to ensure ->ct_net and ->status are loaded after
+ * refcount increase.
*/
+ smp_rmb();
if (net_eq(nf_ct_net(tmp), net) &&
nf_ct_is_confirmed(tmp) &&
nf_ct_delete(tmp, 0, 0))
@@ -1536,6 +1543,8 @@ static void gc_worker(struct work_struct *work)
if (!refcount_inc_not_zero(&tmp->ct_general.use))
continue;

+ /* load ct->status after refcount */
+ smp_rmb();
if (gc_worker_skip_ct(tmp)) {
nf_ct_put(tmp);
continue;
@@ -1775,6 +1784,16 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
if (!exp)
__nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);

+ /* Other CPU might have obtained a pointer to this object before it was
+ * released. Because refcount is 0, refcount_inc_not_zero() will fail.
+ *
+ * After refcount_set(1) it will succeed; ensure that zeroing of
+ * ct->status and the correct ct->net pointer are visible; else other
+ * core might observe CONFIRMED bit which means the entry is valid and
+ * in the hash table, but its not (anymore).
+ */
+ smp_wmb();
+
/* Now it is going to be associated with an sk_buff, set refcount to 1. */
refcount_set(&ct->ct_general.use, 1);

--
2.35.1

2022-07-06 12:34:41

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Wed, Jul 06, 2022 at 02:02:01PM +0200, Florian Westphal wrote:

> Subject: [nf] netfilter: conntrack: fix crash due to confirmed bit load reordering
>
> Kajetan Puchalski reports crash on ARM, with backtrace of:
>
> __nf_ct_delete_from_lists
> nf_ct_delete
> early_drop
> __nf_conntrack_alloc
>
> Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier.
> conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly'
> allocated object is still in use on another CPU:
>
> CPU1 CPU2
> enounters 'ct' during hlist walk
> delete_from_lists
> refcount drops to 0
> kmem_cache_free(ct);
> __nf_conntrack_alloc() // returns same object
> refcount_inc_not_zero(ct); /* might fail */
>
> /* If set, ct is public/in the hash table */
> test_bit(IPS_CONFIRMED_BIT, &ct->status);
>
> In case CPU1 already set refcount back to 1, refcount_inc_not_zero()
> will succeed.
>
> The expected possibilities for a CPU that obtained the object 'ct'
> (but no reference so far) are:
>
> 1. refcount_inc_not_zero() fails. CPU2 ignores the object and moves to
> the next entry in the list. This happens for objects that are about
> to be free'd, that have been free'd, or that have been reallocated
> by __nf_conntrack_alloc(), but where the refcount has not been
> increased back to 1 yet.
>
> 2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit
> in ct->status. If set, the object is public/in the table.
>
> If not, the object must be skipped; CPU2 calls nf_ct_put() to
> un-do the refcount increment and moves to the next object.
>
> Parallel deletion from the hlists is prevented by a
> 'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one
> cpu will do the unlink, the other one will only drop its reference count.
>
> Because refcount_inc_not_zero is not a full barrier, CPU2 may try to
> delete an object that is not on any list:
>
> 1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
> 2. CONFIRMED test also successful (load was reordered or zeroing
> of ct->status not yet visible)
> 3. delete_from_lists unlinks entry not on the hlist, because
> IPS_DYING_BIT is 0 (already cleared).
>
> 2) is already wrong: CPU2 will handle a partially initited object
> that is supposed to be private to CPU1.
>
> This change adds smp_rmb() whenever refcount_inc_not_zero() was successful.
>
> It also inserts a smp_wmb() before the refcount is set to 1 during
> allocation.
>
> Because other CPU might still 'see' the object, refcount_set(1)
> "resurrects" the object, so we need to make sure that other CPUs will
> also observe the right contents. In particular, the CONFIRMED bit test
> must only pass once the object is fully initialised and either in the
> hash or about to be inserted (with locks held to delay possible unlink from
> early_drop or gc worker).
>
> I did not change flow_offload_alloc(), as far as I can see it should call
> refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to
> the skb so its refcount should be >= 1 in all cases.
>
> Reported-by: Kajetan Puchalski <[email protected]>
> Diagnosed-by: Will Deacon <[email protected]>
> Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
> Signed-off-by: Florian Westphal <[email protected]>
> ---
> include/net/netfilter/nf_conntrack.h | 3 +++
> net/netfilter/nf_conntrack_core.c | 19 +++++++++++++++++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
> index a32be8aa7ed2..3dc3646ffba2 100644
> --- a/include/net/netfilter/nf_conntrack.h
> +++ b/include/net/netfilter/nf_conntrack.h
> @@ -300,6 +300,9 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct)
> /* use after obtaining a reference count */
> static inline bool nf_ct_should_gc(const struct nf_conn *ct)
> {
> + /* ->status and ->timeout loads must happen after refcount increase */
> + smp_rmb();
> +
> return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) &&
> !nf_ct_is_dying(ct);
> }
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 082a2fd8d85b..072cabf1b296 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -795,6 +795,9 @@ __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone,
> */
> ct = nf_ct_tuplehash_to_ctrack(h);
> if (likely(refcount_inc_not_zero(&ct->ct_general.use))) {
> + /* re-check key after refcount */
> + smp_rmb();
> +
> if (likely(nf_ct_key_equal(h, tuple, zone, net)))
> goto found;
>
> @@ -1393,7 +1396,11 @@ static unsigned int early_drop_list(struct net *net,
> * We steal the timer reference. If that fails timer has
> * already fired or someone else deleted it. Just drop ref
> * and move to next entry.
> + *
> + * smp_rmb to ensure ->ct_net and ->status are loaded after
> + * refcount increase.
> */
> + smp_rmb();
> if (net_eq(nf_ct_net(tmp), net) &&
> nf_ct_is_confirmed(tmp) &&
> nf_ct_delete(tmp, 0, 0))
> @@ -1536,6 +1543,8 @@ static void gc_worker(struct work_struct *work)
> if (!refcount_inc_not_zero(&tmp->ct_general.use))
> continue;
>
> + /* load ct->status after refcount */
> + smp_rmb();
> if (gc_worker_skip_ct(tmp)) {
> nf_ct_put(tmp);
> continue;
> @@ -1775,6 +1784,16 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
> if (!exp)
> __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
>
> + /* Other CPU might have obtained a pointer to this object before it was
> + * released. Because refcount is 0, refcount_inc_not_zero() will fail.
> + *
> + * After refcount_set(1) it will succeed; ensure that zeroing of
> + * ct->status and the correct ct->net pointer are visible; else other
> + * core might observe CONFIRMED bit which means the entry is valid and
> + * in the hash table, but its not (anymore).
> + */
> + smp_wmb();
> +
> /* Now it is going to be associated with an sk_buff, set refcount to 1. */
> refcount_set(&ct->ct_general.use, 1);

FWIW, the old code, that used atomic_inc() instead of refcount_set()
would have had the exact sample problem. There was no implied order vs
the earlier stores.

2022-07-06 12:41:06

[permalink] [raw]

Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Wed, Jul 06, 2022 at 02:02:01PM +0200, Florian Westphal wrote:
> Kajetan Puchalski <[email protected]> wrote:
> > On Tue, Jul 05, 2022 at 12:24:49PM +0100, Will Deacon wrote:
> > > > > Sorry, but I have absolutely no context here. We have a handy document
> > > > > describing the differences between atomic_t and refcount_t:
> > > > >
> > > > > Documentation/core-api/refcount-vs-atomic.rst
> > > > >
> > > > > What else do you need to know?
> > > >
> > > > Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622
> > > > ("netfilter: conntrack: convert to refcount_t api") which mean that you
> > > > no longer have ordering to subsequent reads in the absence of an address
> > > > dependency.
> > >
> > > I think the patch above needs auditing with the relaxed behaviour in mind,
> > > but for the specific crash reported here possibly something like the diff
> > > below?
> > >
> > > Will
> > >
> > > --->8
> > >
> > > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > > index 082a2fd8d85b..5ad9fcc84269 100644
> > > --- a/net/netfilter/nf_conntrack_core.c
> > > +++ b/net/netfilter/nf_conntrack_core.c
> > > @@ -1394,6 +1394,7 @@ static unsigned int early_drop_list(struct net *net,
> > > * already fired or someone else deleted it. Just drop ref
> > > * and move to next entry.
> > > */
> > > + smp_rmb(); /* XXX: Why? */
> > > if (net_eq(nf_ct_net(tmp), net) &&
> > > nf_ct_is_confirmed(tmp) &&
> > > nf_ct_delete(tmp, 0, 0))
> > >
> >
> > Just to follow up, I think you're right, the patch in question should be
> > audited further for other missing memory barrier issues.
> > While this one smp_rmb() helps a lot, ie lets the test run for at least
> > an hour or two, an overnight 6 hour test still resulted in the same
> > crash somewhere along the way so it looks like it's not the only one
> > that's needed.
>
> Yes, I don't think that refcount_inc_not_zero is useable as-is for conntrack.
> Here is a patch, I hope this will get things back to a working order without
> a revert to atomic_t api.
>
> Subject: [nf] netfilter: conntrack: fix crash due to confirmed bit load reordering
>
> Kajetan Puchalski reports crash on ARM, with backtrace of:
>
> __nf_ct_delete_from_lists
> nf_ct_delete
> early_drop
> __nf_conntrack_alloc
>
> Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier.
> conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly'
> allocated object is still in use on another CPU:
>
> CPU1 CPU2
> enounters 'ct' during hlist walk
> delete_from_lists
> refcount drops to 0
> kmem_cache_free(ct);
> __nf_conntrack_alloc() // returns same object
> refcount_inc_not_zero(ct); /* might fail */
>
> /* If set, ct is public/in the hash table */
> test_bit(IPS_CONFIRMED_BIT, &ct->status);
>
> In case CPU1 already set refcount back to 1, refcount_inc_not_zero()
> will succeed.
>
> The expected possibilities for a CPU that obtained the object 'ct'
> (but no reference so far) are:
>
> 1. refcount_inc_not_zero() fails. CPU2 ignores the object and moves to
> the next entry in the list. This happens for objects that are about
> to be free'd, that have been free'd, or that have been reallocated
> by __nf_conntrack_alloc(), but where the refcount has not been
> increased back to 1 yet.
>
> 2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit
> in ct->status. If set, the object is public/in the table.
>
> If not, the object must be skipped; CPU2 calls nf_ct_put() to
> un-do the refcount increment and moves to the next object.
>
> Parallel deletion from the hlists is prevented by a
> 'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one
> cpu will do the unlink, the other one will only drop its reference count.
>
> Because refcount_inc_not_zero is not a full barrier, CPU2 may try to
> delete an object that is not on any list:
>
> 1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
> 2. CONFIRMED test also successful (load was reordered or zeroing
> of ct->status not yet visible)
> 3. delete_from_lists unlinks entry not on the hlist, because
> IPS_DYING_BIT is 0 (already cleared).
>
> 2) is already wrong: CPU2 will handle a partially initited object
> that is supposed to be private to CPU1.
>
> This change adds smp_rmb() whenever refcount_inc_not_zero() was successful.
>
> It also inserts a smp_wmb() before the refcount is set to 1 during
> allocation.
>
> Because other CPU might still 'see' the object, refcount_set(1)
> "resurrects" the object, so we need to make sure that other CPUs will
> also observe the right contents. In particular, the CONFIRMED bit test
> must only pass once the object is fully initialised and either in the
> hash or about to be inserted (with locks held to delay possible unlink from
> early_drop or gc worker).
>
> I did not change flow_offload_alloc(), as far as I can see it should call
> refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to
> the skb so its refcount should be >= 1 in all cases.
>
> Reported-by: Kajetan Puchalski <[email protected]>
> Diagnosed-by: Will Deacon <[email protected]>
> Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
> Signed-off-by: Florian Westphal <[email protected]>
> ---
> include/net/netfilter/nf_conntrack.h | 3 +++
> net/netfilter/nf_conntrack_core.c | 19 +++++++++++++++++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
> index a32be8aa7ed2..3dc3646ffba2 100644
> --- a/include/net/netfilter/nf_conntrack.h
> +++ b/include/net/netfilter/nf_conntrack.h
> @@ -300,6 +300,9 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct)
> /* use after obtaining a reference count */
> static inline bool nf_ct_should_gc(const struct nf_conn *ct)
> {
> + /* ->status and ->timeout loads must happen after refcount increase */
> + smp_rmb();

Sorry I didn't suggest this earlier, but if all of these smp_rmb()s are
for upgrading the ordering from refcount_inc_not_zero() then you should
use smp_acquire__after_ctrl_dep() instead. It's the same under the hood,
but it illustrates what's going on a bit better.

> @@ -1775,6 +1784,16 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
> if (!exp)
> __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
>
> + /* Other CPU might have obtained a pointer to this object before it was
> + * released. Because refcount is 0, refcount_inc_not_zero() will fail.
> + *
> + * After refcount_set(1) it will succeed; ensure that zeroing of
> + * ct->status and the correct ct->net pointer are visible; else other
> + * core might observe CONFIRMED bit which means the entry is valid and
> + * in the hash table, but its not (anymore).
> + */
> + smp_wmb();
> +
> /* Now it is going to be associated with an sk_buff, set refcount to 1. */
> refcount_set(&ct->ct_general.use, 1);

Ideally that refcount_set() would be a release, but this is definitely
(ab)using refcount_t in way that isn't anticipated by the API! It looks
like a similar pattern exists in net/core/sock.c as well, so I wonder if
it's worth extending the API.

Peter, what do you think?

Will

2022-07-06 13:30:37

[permalink] [raw]

Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

Will Deacon <[email protected]> wrote:
> On Wed, Jul 06, 2022 at 02:02:01PM +0200, Florian Westphal wrote:
> > + /* ->status and ->timeout loads must happen after refcount increase */
> > + smp_rmb();
>
> Sorry I didn't suggest this earlier, but if all of these smp_rmb()s are
> for upgrading the ordering from refcount_inc_not_zero() then you should
> use smp_acquire__after_ctrl_dep() instead. It's the same under the hood,
> but it illustrates what's going on a bit better.

Ok, I can replace it and send a v2, no problem.

2022-07-06 14:04:24

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Wed, Jul 06, 2022 at 01:22:50PM +0100, Will Deacon wrote:

> > @@ -300,6 +300,9 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct)
> > /* use after obtaining a reference count */
> > static inline bool nf_ct_should_gc(const struct nf_conn *ct)
> > {
> > + /* ->status and ->timeout loads must happen after refcount increase */
> > + smp_rmb();
>
> Sorry I didn't suggest this earlier, but if all of these smp_rmb()s are
> for upgrading the ordering from refcount_inc_not_zero() then you should
> use smp_acquire__after_ctrl_dep() instead. It's the same under the hood,
> but it illustrates what's going on a bit better.

But in that case if had better also be near an actual condition,
otherwise things become too murky for words :/

That is, why is this sprinkled all over instead of right after
an successfull refcount_inc_not_zero() ?

Code like:

if (!refcount_inc_not_zero())
return;

smp_acquire__after_ctrl_dep();

is fairly self-evident, whereas encountering an
smp_acquire__after_ctrl_dep() in a different function, completely
unrelated to any condition is quite crazy.

> > @@ -1775,6 +1784,16 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
> > if (!exp)
> > __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
> >
> > + /* Other CPU might have obtained a pointer to this object before it was
> > + * released. Because refcount is 0, refcount_inc_not_zero() will fail.
> > + *
> > + * After refcount_set(1) it will succeed; ensure that zeroing of
> > + * ct->status and the correct ct->net pointer are visible; else other
> > + * core might observe CONFIRMED bit which means the entry is valid and
> > + * in the hash table, but its not (anymore).
> > + */
> > + smp_wmb();
> > +
> > /* Now it is going to be associated with an sk_buff, set refcount to 1. */
> > refcount_set(&ct->ct_general.use, 1);
>
> Ideally that refcount_set() would be a release, but this is definitely
> (ab)using refcount_t in way that isn't anticipated by the API! It looks
> like a similar pattern exists in net/core/sock.c as well, so I wonder if
> it's worth extending the API.
>
> Peter, what do you think?

Bah; you have reminded me that I have a fairly sizable amount of
refcount patches from when Linus complained about it last that don't
seem to have gone anywhere :/

Anyway, I suppose we could do a refcount_set_release(), but it had
better get a fairly big comment on how you're on your own if you use it.

2022-07-06 15:19:32

[permalink] [raw]

Subject: [PATCH nf v3] netfilter: conntrack: fix crash due to confirmed bit load reordering

Kajetan Puchalski reports crash on ARM, with backtrace of:

__nf_ct_delete_from_lists
nf_ct_delete
early_drop
__nf_conntrack_alloc

Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier.
conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly'
allocated object is still in use on another CPU:

CPU1 CPU2
encounter 'ct' during hlist walk
delete_from_lists
refcount drops to 0
kmem_cache_free(ct);
__nf_conntrack_alloc() // returns same object
refcount_inc_not_zero(ct); /* might fail */

/* If set, ct is public/in the hash table */
test_bit(IPS_CONFIRMED_BIT, &ct->status);

In case CPU1 already set refcount back to 1, refcount_inc_not_zero()
will succeed.

The expected possibilities for a CPU that obtained the object 'ct'
(but no reference so far) are:

1. refcount_inc_not_zero() fails. CPU2 ignores the object and moves to
the next entry in the list. This happens for objects that are about
to be free'd, that have been free'd, or that have been reallocated
by __nf_conntrack_alloc(), but where the refcount has not been
increased back to 1 yet.

2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit
in ct->status. If set, the object is public/in the table.

If not, the object must be skipped; CPU2 calls nf_ct_put() to
un-do the refcount increment and moves to the next object.

Parallel deletion from the hlists is prevented by a
'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one
cpu will do the unlink, the other one will only drop its reference count.

Because refcount_inc_not_zero is not a full barrier, CPU2 may try to
delete an object that is not on any list:

1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
2. CONFIRMED test also successful (load was reordered or zeroing
of ct->status not yet visible)
3. delete_from_lists unlinks entry not on the hlist, because
IPS_DYING_BIT is 0 (already cleared).

2) is already wrong: CPU2 will handle a partially initited object
that is supposed to be private to CPU1.

Add needed barriers when refcount_inc_not_zero() is successful.

It also inserts a smp_wmb() before the refcount is set to 1 during
allocation.

Because other CPU might still 'see' the object, refcount_set(1)
"resurrects" the object, so we need to make sure that other CPUs will
also observe the right contents. In particular, the CONFIRMED bit test
must only pass once the object is fully initialised and either in the
hash or about to be inserted (with locks held to delay possible unlink from
early_drop or gc worker).

I did not change flow_offload_alloc(), as far as I can see it should call
refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to
the skb so its refcount should be >= 1 in all cases.

v2: prefer smp_acquire__after_ctrl_dep to smp_rmb (Will Deacon).
v3: keep smp_acquire__after_ctrl_dep close to refcount_inc_not_zero call
add comment in nf_conntrack_netlink, no control dependency there
due to locks.

Cc: Peter Zijlstra <[email protected]>
Reported-by: Kajetan Puchalski <[email protected]>
Diagnosed-by: Will Deacon <[email protected]>
Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
Signed-off-by: Florian Westphal <[email protected]>
---
net/netfilter/nf_conntrack_core.c | 22 ++++++++++++++++++++++
net/netfilter/nf_conntrack_netlink.c | 1 +
net/netfilter/nf_conntrack_standalone.c | 3 +++
3 files changed, 26 insertions(+)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 082a2fd8d85b..369aeabb94fe 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -729,6 +729,9 @@ static void nf_ct_gc_expired(struct nf_conn *ct)
if (!refcount_inc_not_zero(&ct->ct_general.use))
return;

+ /* load ->status after refcount increase */
+ smp_acquire__after_ctrl_dep();
+
if (nf_ct_should_gc(ct))
nf_ct_kill(ct);

@@ -795,6 +798,9 @@ __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone,
*/
ct = nf_ct_tuplehash_to_ctrack(h);
if (likely(refcount_inc_not_zero(&ct->ct_general.use))) {
+ /* re-check key after refcount */
+ smp_acquire__after_ctrl_dep();
+
if (likely(nf_ct_key_equal(h, tuple, zone, net)))
goto found;

@@ -1387,6 +1393,9 @@ static unsigned int early_drop_list(struct net *net,
if (!refcount_inc_not_zero(&tmp->ct_general.use))
continue;

+ /* load ->ct_net and ->status after refcount increase */
+ smp_acquire__after_ctrl_dep();
+
/* kill only if still in same netns -- might have moved due to
* SLAB_TYPESAFE_BY_RCU rules.
*
@@ -1536,6 +1545,9 @@ static void gc_worker(struct work_struct *work)
if (!refcount_inc_not_zero(&tmp->ct_general.use))
continue;

+ /* load ->status after refcount increase */
+ smp_acquire__after_ctrl_dep();
+
if (gc_worker_skip_ct(tmp)) {
nf_ct_put(tmp);
continue;
@@ -1775,6 +1787,16 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
if (!exp)
__nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);

+ /* Other CPU might have obtained a pointer to this object before it was
+ * released. Because refcount is 0, refcount_inc_not_zero() will fail.
+ *
+ * After refcount_set(1) it will succeed; ensure that zeroing of
+ * ct->status and the correct ct->net pointer are visible; else other
+ * core might observe CONFIRMED bit which means the entry is valid and
+ * in the hash table, but its not (anymore).
+ */
+ smp_wmb();
+
/* Now it is going to be associated with an sk_buff, set refcount to 1. */
refcount_set(&ct->ct_general.use, 1);

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 722af5e309ba..f5905b5201a7 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -1203,6 +1203,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
hnnode) {
ct = nf_ct_tuplehash_to_ctrack(h);
if (nf_ct_is_expired(ct)) {
+ /* need to defer nf_ct_kill() until lock is released */
if (i < ARRAY_SIZE(nf_ct_evict) &&
refcount_inc_not_zero(&ct->ct_general.use))
nf_ct_evict[i++] = ct;
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 6ad7bbc90d38..05895878610c 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -306,6 +306,9 @@ static int ct_seq_show(struct seq_file *s, void *v)
if (unlikely(!refcount_inc_not_zero(&ct->ct_general.use)))
return 0;

+ /* load ->status after refcount increase */
+ smp_acquire__after_ctrl_dep();
+
if (nf_ct_should_gc(ct)) {
nf_ct_kill(ct);
goto release;
--
2.35.1

2022-07-07 08:47:02

[permalink] [raw]

Subject: Re: [PATCH nf v3] netfilter: conntrack: fix crash due to confirmed bit load reordering

On Wed, Jul 06, 2022 at 04:50:04PM +0200, Florian Westphal wrote:
> Kajetan Puchalski reports crash on ARM, with backtrace of:
>
> __nf_ct_delete_from_lists
> nf_ct_delete
> early_drop
> __nf_conntrack_alloc
>
> Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier.
> conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly'
> allocated object is still in use on another CPU:
>
> CPU1 CPU2
> encounter 'ct' during hlist walk
> delete_from_lists
> refcount drops to 0
> kmem_cache_free(ct);
> __nf_conntrack_alloc() // returns same object
> refcount_inc_not_zero(ct); /* might fail */
>
> /* If set, ct is public/in the hash table */
> test_bit(IPS_CONFIRMED_BIT, &ct->status);
>
> In case CPU1 already set refcount back to 1, refcount_inc_not_zero()
> will succeed.
>
> The expected possibilities for a CPU that obtained the object 'ct'
> (but no reference so far) are:
>
> 1. refcount_inc_not_zero() fails. CPU2 ignores the object and moves to
> the next entry in the list. This happens for objects that are about
> to be free'd, that have been free'd, or that have been reallocated
> by __nf_conntrack_alloc(), but where the refcount has not been
> increased back to 1 yet.
>
> 2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit
> in ct->status. If set, the object is public/in the table.
>
> If not, the object must be skipped; CPU2 calls nf_ct_put() to
> un-do the refcount increment and moves to the next object.
>
> Parallel deletion from the hlists is prevented by a
> 'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one
> cpu will do the unlink, the other one will only drop its reference count.
>
> Because refcount_inc_not_zero is not a full barrier, CPU2 may try to
> delete an object that is not on any list:
>
> 1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
> 2. CONFIRMED test also successful (load was reordered or zeroing
> of ct->status not yet visible)
> 3. delete_from_lists unlinks entry not on the hlist, because
> IPS_DYING_BIT is 0 (already cleared).
>
> 2) is already wrong: CPU2 will handle a partially initited object
> that is supposed to be private to CPU1.
>
> Add needed barriers when refcount_inc_not_zero() is successful.
>
> It also inserts a smp_wmb() before the refcount is set to 1 during
> allocation.
>
> Because other CPU might still 'see' the object, refcount_set(1)
> "resurrects" the object, so we need to make sure that other CPUs will
> also observe the right contents. In particular, the CONFIRMED bit test
> must only pass once the object is fully initialised and either in the
> hash or about to be inserted (with locks held to delay possible unlink from
> early_drop or gc worker).
>
> I did not change flow_offload_alloc(), as far as I can see it should call
> refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to
> the skb so its refcount should be >= 1 in all cases.
>
> v2: prefer smp_acquire__after_ctrl_dep to smp_rmb (Will Deacon).
> v3: keep smp_acquire__after_ctrl_dep close to refcount_inc_not_zero call
> add comment in nf_conntrack_netlink, no control dependency there
> due to locks.
>
> Cc: Peter Zijlstra <[email protected]>
> Reported-by: Kajetan Puchalski <[email protected]>
> Diagnosed-by: Will Deacon <[email protected]>
> Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
> Signed-off-by: Florian Westphal <[email protected]>
> ---
> net/netfilter/nf_conntrack_core.c | 22 ++++++++++++++++++++++
> net/netfilter/nf_conntrack_netlink.c | 1 +
> net/netfilter/nf_conntrack_standalone.c | 3 +++
> 3 files changed, 26 insertions(+)

Acked-by: Will Deacon <[email protected]>

Will

2022-07-07 10:29:38

by Thorsten Leemhuis

[permalink] [raw]

Subject: Re: [PATCH nf v3] netfilter: conntrack: fix crash due to confirmed bit load reordering

On 06.07.22 16:50, Florian Westphal wrote:
> Kajetan Puchalski reports crash on ARM, with backtrace of:
>
> [...]
> Cc: Peter Zijlstra <[email protected]>
> Reported-by: Kajetan Puchalski <[email protected]>
> Diagnosed-by: Will Deacon <[email protected]>
> Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
> Signed-off-by: Florian Westphal <[email protected]>

If you need to respin this patch for one reason or another, could you do
me a favor and add proper 'Link:' tags pointing to all reports about
this issue? e.g. like this:

Link:
https://lore.kernel.org/all/[email protected]/

These tags are considered important by Linus[1] and others, as they
allow anyone to look into the backstory weeks or years from now. That is
why they should be placed in cases like this, as
Documentation/process/submitting-patches.rst and
Documentation/process/5.Posting.rst explain in more detail. I care
personally, because these tags make my regression tracking efforts a
whole lot easier, as they allow my tracking bot 'regzbot' to
automatically connect reports with patches posted or committed to fix
tracked regressions.

[1] see for example:
https://lore.kernel.org/all/CAHk-=wjMmSZzMJ3Xnskdg4+GGz=5p5p+GSYyFBTh0f-DgvdBWg@mail.gmail.com/
https://lore.kernel.org/all/CAHk-=wgs38ZrfPvy=nOwVkVzjpM3VFU1zobP37Fwd_h9iAD5JQ@mail.gmail.com/
https://lore.kernel.org/all/CAHk-=wjxzafG-=J8oT30s7upn4RhBs6TX-uVFZ5rME+L5_DoJA@mail.gmail.com/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

2022-07-07 19:38:29

[permalink] [raw]

Subject: Re: [PATCH nf v3] netfilter: conntrack: fix crash due to confirmed bit load reordering

Will Deacon <[email protected]> wrote:
> On Wed, Jul 06, 2022 at 04:50:04PM +0200, Florian Westphal wrote:
> > net/netfilter/nf_conntrack_core.c | 22 ++++++++++++++++++++++
> > net/netfilter/nf_conntrack_netlink.c | 1 +
> > net/netfilter/nf_conntrack_standalone.c | 3 +++
> > 3 files changed, 26 insertions(+)
>
> Acked-by: Will Deacon <[email protected]>

Thanks, I pushed this patch to nf.git.

2022-07-11 17:48:14