2022-07-01 11:27:48

by Kajetan Puchalski

[permalink] [raw]
Subject: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

Hi,

While running the udp-flood test from stress-ng on Ampere Altra (Mt.
Jade platform) I encountered a kernel panic caused by NULL pointer
dereference within nf_conntrack.

The issue is present in the latest mainline (5.19-rc4), latest stable
(5.18.8), as well as multiple older stable versions. The last working
stable version I found was 5.15.40.

Through bisecting I've traced the issue back to mainline commit
719774377622bc4025d2a74f551b5dc2158c6c30 (netfilter: conntrack: convert to refcount_t api),
on kernels from before this commit the test runs fine. As far as I can tell, this commit was
included in stable with version 5.15.41, thus causing the regression
compared to 5.15.40. It was included in the mainline with version 5.16.

The issue is very consistently reproducible as well, running this
command resulted in the same kernel panic every time I tried it on
different kernels from after the change in question was merged.

stress-ng --udp-flood 0 -t 1m --metrics-brief --perf

The commit was not easily revertible so I can't say whether reverting it
on the latest mainline would fix the problem or not.

Here's the decoded kernel panic in question for reference:

[ 869.372868] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 869.381708] Mem abort info:
[ 869.384490] ESR = 0x0000000096000044
[ 869.388245] EC = 0x25: DABT (current EL), IL = 32 bits
[ 869.393681] SET = 0, FnV = 0
[ 869.396723] EA = 0, S1PTW = 0
[ 869.399859] FSC = 0x04: level 0 translation fault
[ 869.404731] Data abort info:
[ 869.407606] ISV = 0, ISS = 0x00000044
[ 869.411437] CM = 0, WnR = 1
[ 869.414398] user pgtable: 4k pages, 48-bit VAs, pgdp=000008002f6b9000
[ 869.420834] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[ 869.427621] Internal error: Oops: 96000044 [#1] SMP
[ 869.432488] Modules linked in: sctp ip6_udp_tunnel udp_tunnel dccp_ipv4 dccp xt_conntrack xt_MASQUERADE xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc ipmi_devintf ipmi_msghandler overlay cppc_cpufreq drm ip_tables x_tables btrfs blake2b_generic libcrc32c xor xor_neon raid6_pq zstd_compress nvme mlx5_core crct10dif_ce nvme_core mlxfw
[ 869.466986] CPU: 13 PID: 100864 Comm: stress-ng-udp-f Not tainted 5.19.0-rc4-custom-kajpuc01-teo #15
[ 869.476107] Hardware name: WIWYNN Mt.Jade Server System B81.03001.0005/Mt.Jade Motherboard, BIOS 1.08.20220218 (SCP: 1.08.20220218) 2022/02/18
[ 869.488872] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 869.495821] pc : __nf_ct_delete_from_lists (/home/kajpuc01/linux/./include/linux/list_nulls.h:108) nf_conntrack
[ 869.502174] lr : __nf_ct_delete_from_lists (/home/kajpuc01/linux/net/netfilter/nf_conntrack_core.c:628) nf_conntrack
[ 869.508523] sp : ffff800055acb6d0
[ 869.511825] x29: ffff800055acb6d0 x28: ffffa1a42f9c8ac0 x27: 0000000000000000
[ 869.518949] x26: ffffa1a3f531c780 x25: 000000000000ef6d x24: ffff4005eddfc500
[ 869.526072] x23: ffff4005eddfc520 x22: ffff4005eddfc558 x21: ffffa1a42f9c8ac0
[ 869.533195] x20: 0000000000000000 x19: 0000000000023923 x18: 0000000000000000
[ 869.540319] x17: 0000000000000000 x16: ffffa1a42d8c3bb0 x15: 0000ffffffec5ff0
[ 869.547442] x14: 4e4e4e4e4e4e4e4e x13: 4e4e4e4e4e4e4e4e x12: 4e4e4e4e4e4e4e4e
[ 869.554565] x11: 0000000000000000 x10: 000000000100007f x9 : ffffa1a3f5304074
[ 869.561688] x8 : 0000000000000000 x7 : 00113ec6b282834e x6 : ffff800055acb6c8
[ 869.568811] x5 : dd7051c45787c585 x4 : abfc59e3b0a2b492 x3 : 0000000000000000
[ 869.575934] x2 : 0000000000000001 x1 : 0000000000000000 x0 : 000000000001dedb
[ 869.583058] Call trace:
[ 869.585492] __nf_ct_delete_from_lists (/home/kajpuc01/linux/./include/linux/list_nulls.h:108) nf_conntrack
[ 869.591494] nf_ct_delete (/home/kajpuc01/linux/./include/linux/bottom_half.h:33) nf_conntrack
[ 869.596368] early_drop (/home/kajpuc01/linux/net/netfilter/nf_conntrack_core.c:1400) nf_conntrack
[ 869.601154] __nf_conntrack_alloc (/home/kajpuc01/linux/net/netfilter/nf_conntrack_core.c:1610) nf_conntrack
[ 869.606808] init_conntrack.isra.0 (/home/kajpuc01/linux/net/netfilter/nf_conntrack_core.c:1716) nf_conntrack
[ 869.612549] nf_conntrack_in (/home/kajpuc01/linux/net/netfilter/nf_conntrack_core.c:1832) nf_conntrack
[ 869.617769] ipv4_conntrack_local (/home/kajpuc01/linux/net/netfilter/nf_conntrack_proto.c:214) nf_conntrack
[ 869.623249] nf_hook_slow (/home/kajpuc01/linux/net/netfilter/core.c:621)
[ 869.626814] __ip_local_out (/home/kajpuc01/linux/net/ipv4/ip_output.c:118)
[ 869.630640] ip_local_out (/home/kajpuc01/linux/net/ipv4/ip_output.c:125)
[ 869.634117] ip_send_skb (/home/kajpuc01/linux/net/ipv4/ip_output.c:1572)
[ 869.637508] udp_send_skb.isra.0 (/home/kajpuc01/linux/net/ipv4/udp.c:968)
[ 869.641767] udp_sendmsg (/home/kajpuc01/linux/net/ipv4/udp.c:1254)
[ 869.645329] inet_sendmsg (/home/kajpuc01/linux/net/ipv4/af_inet.c:821)
[ 869.648807] sock_sendmsg (/home/kajpuc01/linux/net/socket.c:717)
[ 869.652284] __sys_sendto (/home/kajpuc01/linux/net/socket.c:2119)
[ 869.655847] __arm64_sys_sendto (/home/kajpuc01/linux/net/socket.c:2127)
[ 869.659845] invoke_syscall (/home/kajpuc01/linux/./arch/arm64/include/asm/current.h:19)
[ 869.663584] el0_svc_common.constprop.0 (/home/kajpuc01/linux/arch/arm64/kernel/syscall.c:75)
[ 869.668449] do_el0_svc (/home/kajpuc01/linux/arch/arm64/kernel/syscall.c:207)
[ 869.671753] el0_svc (/home/kajpuc01/linux/./arch/arm64/include/asm/daifflags.h:28)
[ 869.674796] el0t_64_sync_handler (/home/kajpuc01/linux/arch/arm64/kernel/entry-common.c:643)
[ 869.678966] el0t_64_sync (/home/kajpuc01/linux/arch/arm64/kernel/entry.S:581)
[ 869.682618] Code: 72001c1f 54fffc01 d503201f a9410700 (f9000020)
All code
========
0: 72001c1f tst w0, #0xff
4: 54fffc01 b.ne 0xffffffffffffff84 // b.any
8: d503201f nop
c: a9410700 ldp x0, x1, [x24, #16]
10:* f9000020 str x0, [x1] <-- trapping instruction

Code starting with the faulting instruction
===========================================
0: f9000020 str x0, [x1]
[ 869.688700] ---[ end trace 0000000000000000 ]---
[ 869.693306] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[ 869.700168] SMP: stopping secondary CPUs
[ 869.704219] Kernel Offset: 0x21a4251b0000 from 0xffff800008000000
[ 869.710300] PHYS_OFFSET: 0x80000000
[ 869.713775] CPU features: 0x000,0042e015,19805c82
[ 869.718467] Memory Limit: none
[ 869.721509] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---

The contents of /proc/cpuinfo:

processor : 0
BogoMIPS : 50.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x3
CPU part : 0xd0c
CPU revision : 1

(The same type of CPU is repeated 160 times, only including one for brevity)

/proc/version:
Linux version 5.19.0-rc4-custom-kajpuc01-teo gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #37 SMP Thu Jun 30 14:53:25 UTC 2022

The distirbution is Ubuntu 20.04.3 LTS, the architecture is aarch64.

Please let me know if I can provide any more details or try any more
tests.

Regards,
Kajetan


2022-07-01 12:07:45

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra #forregzbot

[TLDR: I'm adding this regression report to the list of tracked
regressions; all text from me you find below is based on a few templates
paragraphs you might have encountered already already in similar form.]

TWIMC: this mail is primarily send for documentation purposes and for
regzbot, my Linux kernel regression tracking bot. These mails usually
contain '#forregzbot' in the subject, to make them easy to spot and filter.

Hi, this is your Linux kernel regression tracker.

On 01.07.22 13:11, Kajetan Puchalski wrote:
> Hi,
>
> While running the udp-flood test from stress-ng on Ampere Altra (Mt.
> Jade platform) I encountered a kernel panic caused by NULL pointer
> dereference within nf_conntrack.
>
> The issue is present in the latest mainline (5.19-rc4), latest stable
> (5.18.8), as well as multiple older stable versions. The last working
> stable version I found was 5.15.40.
>
> Through bisecting I've traced the issue back to mainline commit
> 719774377622bc4025d2a74f551b5dc2158c6c30 (netfilter: conntrack: convert to refcount_t api),
> on kernels from before this commit the test runs fine. As far as I can tell, this commit was
> included in stable with version 5.15.41, thus causing the regression
> compared to 5.15.40. It was included in the mainline with version 5.16.

FWIW, looks like it was merged for v5.17-rc1
$ git describe --contains --tags 719774377622bc4025

v5.17-rc1~170^2~24^2~19

> The issue is very consistently reproducible as well, running this
> command resulted in the same kernel panic every time I tried it on
> different kernels from after the change in question was merged.
>
> stress-ng --udp-flood 0 -t 1m --metrics-brief --perf
>
> The commit was not easily revertible so I can't say whether reverting it
> on the latest mainline would fix the problem or not.
>
> [...]
>
> The distirbution is Ubuntu 20.04.3 LTS, the architecture is aarch64.
>
> Please let me know if I can provide any more details or try any more
> tests.

Thanks for the report. To be sure below issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression
tracking bot:

#regzbot ^introduced 719774377622bc402
#regzbot title net: netfilter: stress-ng udp-flood causes kernel panic
on Ampere Altra

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply -- ideally with also
telling regzbot about it, as explained here:
https://linux-regtracking.leemhuis.info/tracked-regression/

Reminder for developers: When fixing the issue, add 'Link:' tags
pointing to the report (the mail this one replies to), as explained for
in the Linux kernel's documentation; above webpage explains why this is
important for tracked regressions.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

2022-07-01 20:51:45

by Florian Westphal

[permalink] [raw]
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

Kajetan Puchalski <[email protected]> wrote:
> While running the udp-flood test from stress-ng on Ampere Altra (Mt.
> Jade platform) I encountered a kernel panic caused by NULL pointer
> dereference within nf_conntrack.
>
> The issue is present in the latest mainline (5.19-rc4), latest stable
> (5.18.8), as well as multiple older stable versions. The last working
> stable version I found was 5.15.40.

Do I need a special setup for conntrack?

No crashes after more than one hour of stress-ng on
1. 4 core amd64 Fedora 5.17 kernel
2. 16 core amd64, linux stable 5.17.15
3. 12 core intel, Fedora 5.18 kernel
4. 3 core aarch64 vm, 5.18.7-200.fc36.aarch64

I used standard firewalld ruleset for all of these and manually tuned
conntrack settings to make sure the early evict path (as per backtrace)
gets exercised.

2022-07-02 11:11:30

by Kajetan Puchalski

[permalink] [raw]
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Fri, Jul 01, 2022 at 10:01:10PM +0200, Florian Westphal wrote:
> Kajetan Puchalski <[email protected]> wrote:
> > While running the udp-flood test from stress-ng on Ampere Altra (Mt.
> > Jade platform) I encountered a kernel panic caused by NULL pointer
> > dereference within nf_conntrack.
> >
> > The issue is present in the latest mainline (5.19-rc4), latest stable
> > (5.18.8), as well as multiple older stable versions. The last working
> > stable version I found was 5.15.40.
>
> Do I need a special setup for conntrack?

I don't think there was any special setup involved, the config I started
from was a generic distribution config and I didn't change any
networking-specific options. In case that's helpful here's the .config I
used.

https://pastebin.com/Bb2wttdx

>
> No crashes after more than one hour of stress-ng on
> 1. 4 core amd64 Fedora 5.17 kernel
> 2. 16 core amd64, linux stable 5.17.15
> 3. 12 core intel, Fedora 5.18 kernel
> 4. 3 core aarch64 vm, 5.18.7-200.fc36.aarch64
>

That would make sense, from further experiments I ran it somehow seems
to be related to the number of workers being spawned by stress-ng along
with the CPUs/cores involved.

For instance, running the test with <=25 workers (--udp-flood 25 etc.)
results in the test running fine for at least 15 minutes.
Running the test with 30 workers results in a panic sometime before it
hits the 15 minute mark.
Based on observations there seems to be a corellation between the number
of workers and how quickly the panic occurs, ie with 30 it takes a few
minutes, with 160 it consistently happens almost immediately. That also
holds for various numbers of workers in between.

On the CPU/core side of things, the machine in question has two CPU
sockets with 80 identical cores each. All the panics I've encountered
happened when stress-ng was ran directly and unbound.
When I tried using hwloc-bind to bind the process to one of the CPU
sockets, the test ran for 15 mins with 80 and 160 workers with no issues,
no matter which CPU it was bound to.

Ie the specific circumstances under which it seems to occur are when the
test is able to run across multiple CPU sockets with a large number
of workers being spawned.

> I used standard firewalld ruleset for all of these and manually tuned
> conntrack settings to make sure the early evict path (as per backtrace)
> gets exercised.

2022-07-02 12:46:36

by Willy Tarreau

[permalink] [raw]
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Sat, Jul 02, 2022 at 12:08:46PM +0100, Kajetan Puchalski wrote:
> On Fri, Jul 01, 2022 at 10:01:10PM +0200, Florian Westphal wrote:
> > Kajetan Puchalski <[email protected]> wrote:
> > > While running the udp-flood test from stress-ng on Ampere Altra (Mt.
> > > Jade platform) I encountered a kernel panic caused by NULL pointer
> > > dereference within nf_conntrack.
> > >
> > > The issue is present in the latest mainline (5.19-rc4), latest stable
> > > (5.18.8), as well as multiple older stable versions. The last working
> > > stable version I found was 5.15.40.
> >
> > Do I need a special setup for conntrack?
>
> I don't think there was any special setup involved, the config I started
> from was a generic distribution config and I didn't change any
> networking-specific options. In case that's helpful here's the .config I
> used.
>
> https://pastebin.com/Bb2wttdx
>
> >
> > No crashes after more than one hour of stress-ng on
> > 1. 4 core amd64 Fedora 5.17 kernel
> > 2. 16 core amd64, linux stable 5.17.15
> > 3. 12 core intel, Fedora 5.18 kernel
> > 4. 3 core aarch64 vm, 5.18.7-200.fc36.aarch64
> >
>
> That would make sense, from further experiments I ran it somehow seems
> to be related to the number of workers being spawned by stress-ng along
> with the CPUs/cores involved.
>
> For instance, running the test with <=25 workers (--udp-flood 25 etc.)
> results in the test running fine for at least 15 minutes.

Another point to keep in mind is that modern ARM processors (ARMv8.1 and
above) have a more relaxed memory model than older ones (and x86), that
can easily exhibit a missing barrier somewhere. I faced this situation
already in the past the first time I ran my code on Graviton2, which
caused crashes that would never happen on A53/A72/A73 cores nor x86.

ARMv8.1 SoCs are not yet widely available for end users like us. A76
is only coming, and A55 has now been available for a bit more than a
year. So testing on regular ARM devices like RPi etc may not exhibit
such differences.

> Running the test with 30 workers results in a panic sometime before it
> hits the 15 minute mark.
> Based on observations there seems to be a corellation between the number
> of workers and how quickly the panic occurs, ie with 30 it takes a few
> minutes, with 160 it consistently happens almost immediately. That also
> holds for various numbers of workers in between.
>
> On the CPU/core side of things, the machine in question has two CPU
> sockets with 80 identical cores each. All the panics I've encountered
> happened when stress-ng was ran directly and unbound.
> When I tried using hwloc-bind to bind the process to one of the CPU
> sockets, the test ran for 15 mins with 80 and 160 workers with no issues,
> no matter which CPU it was bound to.
>
> Ie the specific circumstances under which it seems to occur are when the
> test is able to run across multiple CPU sockets with a large number
> of workers being spawned.

This could further fuel the possibliity explained above.

Regards,
Willy

2022-07-02 21:02:31

by Florian Westphal

[permalink] [raw]
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

Kajetan Puchalski <[email protected]> wrote:
> On Fri, Jul 01, 2022 at 10:01:10PM +0200, Florian Westphal wrote:
> > Kajetan Puchalski <[email protected]> wrote:
> > > While running the udp-flood test from stress-ng on Ampere Altra (Mt.
> > > Jade platform) I encountered a kernel panic caused by NULL pointer
> > > dereference within nf_conntrack.
> > >
> > > The issue is present in the latest mainline (5.19-rc4), latest stable
> > > (5.18.8), as well as multiple older stable versions. The last working
> > > stable version I found was 5.15.40.
> >
> > Do I need a special setup for conntrack?
>
> I don't think there was any special setup involved, the config I started
> from was a generic distribution config and I didn't change any
> networking-specific options. In case that's helpful here's the .config I
> used.
>
> https://pastebin.com/Bb2wttdx
>
> >
> > No crashes after more than one hour of stress-ng on
> > 1. 4 core amd64 Fedora 5.17 kernel
> > 2. 16 core amd64, linux stable 5.17.15
> > 3. 12 core intel, Fedora 5.18 kernel
> > 4. 3 core aarch64 vm, 5.18.7-200.fc36.aarch64
> >
>
> That would make sense, from further experiments I ran it somehow seems
> to be related to the number of workers being spawned by stress-ng along
> with the CPUs/cores involved.
>
> For instance, running the test with <=25 workers (--udp-flood 25 etc.)
> results in the test running fine for at least 15 minutes.

Ok. I will let it run for longer on the machines I have access to.

In mean time, you could test attached patch, its simple s/refcount_/atomic_/
in nf_conntrack.

If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you
but works with attached patch someone who understands aarch64 memory ordering
would have to look more closely at refcount_XXX functions to see where they
might differ from atomic_ ones.

If it still crashes, please try below hunk in addition, although I don't see
how it would make a difference.

This is the one spot where the original conversion replaced atomic_inc()
with refcount_set(), this is on allocation, refcount is expected to be 0 so
refcount_inc() triggers a warning hinting at a use-after free.

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1776,7 +1776,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
__nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);

/* Now it is going to be associated with an sk_buff, set refcount to 1. */
- atomic_set(&ct->ct_general.use, 1);
+ atomic_inc(&ct->ct_general.use);

if (exp) {
if (exp->expectfn)


Attachments:
(No filename) (2.71 kB)
0001-netfilter-conntrack-revert-to-atomic_t-api.patch (10.46 kB)
Download all attachments

2022-07-04 09:41:53

by Kajetan Puchalski

[permalink] [raw]
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Sat, Jul 02, 2022 at 10:56:51PM +0200, Florian Westphal wrote:
> > That would make sense, from further experiments I ran it somehow seems
> > to be related to the number of workers being spawned by stress-ng along
> > with the CPUs/cores involved.
> >
> > For instance, running the test with <=25 workers (--udp-flood 25 etc.)
> > results in the test running fine for at least 15 minutes.
>
> Ok. I will let it run for longer on the machines I have access to.
>
> In mean time, you could test attached patch, its simple s/refcount_/atomic_/
> in nf_conntrack.
>
> If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you
> but works with attached patch someone who understands aarch64 memory ordering
> would have to look more closely at refcount_XXX functions to see where they
> might differ from atomic_ ones.

I can confirm that the patch seems to solve the issue.
With it applied on top of the 5.19-rc5 tag the test runs fine for at
least 15 minutes which was not the case before so it looks like it is
that aarch64 memory ordering problem.

>
> If it still crashes, please try below hunk in addition, although I don't see
> how it would make a difference.
>
> This is the one spot where the original conversion replaced atomic_inc()
> with refcount_set(), this is on allocation, refcount is expected to be 0 so
> refcount_inc() triggers a warning hinting at a use-after free.
>
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -1776,7 +1776,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
> __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
>
> /* Now it is going to be associated with an sk_buff, set refcount to 1. */
> - atomic_set(&ct->ct_general.use, 1);
> + atomic_inc(&ct->ct_general.use);
>
> if (exp) {
> if (exp->expectfn)

> From 4234018dff486bdc30f4fe4625c8da1a8e30c2f6 Mon Sep 17 00:00:00 2001
> From: Florian Westphal <[email protected]>
> Date: Sat, 2 Jul 2022 22:42:57 +0200
> Subject: [PATCH 1/1] netfilter: conntrack: revert to atomic_t api
>
> Just for testing.
> ---
> include/linux/netfilter/nf_conntrack_common.h | 6 ++---
> include/net/netfilter/nf_conntrack.h | 2 +-
> net/netfilter/nf_conntrack_core.c | 24 +++++++++----------
> net/netfilter/nf_conntrack_expect.c | 2 +-
> net/netfilter/nf_conntrack_netlink.c | 6 ++---
> net/netfilter/nf_conntrack_standalone.c | 4 ++--
> net/netfilter/nf_flow_table_core.c | 2 +-
> net/netfilter/nft_ct.c | 4 ++--
> net/netfilter/xt_CT.c | 2 +-
> 9 files changed, 26 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
> index 2770db2fa080..48a78944182d 100644
> --- a/include/linux/netfilter/nf_conntrack_common.h
> +++ b/include/linux/netfilter/nf_conntrack_common.h
> @@ -25,7 +25,7 @@ struct ip_conntrack_stat {
> #define NFCT_PTRMASK ~(NFCT_INFOMASK)
>
> struct nf_conntrack {
> - refcount_t use;
> + atomic_t use;
> };
>
> void nf_conntrack_destroy(struct nf_conntrack *nfct);
> @@ -33,13 +33,13 @@ void nf_conntrack_destroy(struct nf_conntrack *nfct);
> /* like nf_ct_put, but without module dependency on nf_conntrack */
> static inline void nf_conntrack_put(struct nf_conntrack *nfct)
> {
> - if (nfct && refcount_dec_and_test(&nfct->use))
> + if (nfct && atomic_dec_and_test(&nfct->use))
> nf_conntrack_destroy(nfct);
> }
> static inline void nf_conntrack_get(struct nf_conntrack *nfct)
> {
> if (nfct)
> - refcount_inc(&nfct->use);
> + atomic_inc(&nfct->use);
> }
>
> #endif /* _NF_CONNTRACK_COMMON_H */
> diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
> index a32be8aa7ed2..9fab0c8835bb 100644
> --- a/include/net/netfilter/nf_conntrack.h
> +++ b/include/net/netfilter/nf_conntrack.h
> @@ -180,7 +180,7 @@ void nf_ct_destroy(struct nf_conntrack *nfct);
> /* decrement reference count on a conntrack */
> static inline void nf_ct_put(struct nf_conn *ct)
> {
> - if (ct && refcount_dec_and_test(&ct->ct_general.use))
> + if (ct && atomic_dec_and_test(&ct->ct_general.use))
> nf_ct_destroy(&ct->ct_general);
> }
>
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 082a2fd8d85b..4469e49d78a7 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -554,7 +554,7 @@ struct nf_conn *nf_ct_tmpl_alloc(struct net *net,
> tmpl->status = IPS_TEMPLATE;
> write_pnet(&tmpl->ct_net, net);
> nf_ct_zone_add(tmpl, zone);
> - refcount_set(&tmpl->ct_general.use, 1);
> + atomic_set(&tmpl->ct_general.use, 1);
>
> return tmpl;
> }
> @@ -586,7 +586,7 @@ void nf_ct_destroy(struct nf_conntrack *nfct)
> struct nf_conn *ct = (struct nf_conn *)nfct;
>
> pr_debug("%s(%p)\n", __func__, ct);
> - WARN_ON(refcount_read(&nfct->use) != 0);
> + WARN_ON(atomic_read(&nfct->use) != 0);
>
> if (unlikely(nf_ct_is_template(ct))) {
> nf_ct_tmpl_free(ct);
> @@ -726,7 +726,7 @@ nf_ct_match(const struct nf_conn *ct1, const struct nf_conn *ct2)
> /* caller must hold rcu readlock and none of the nf_conntrack_locks */
> static void nf_ct_gc_expired(struct nf_conn *ct)
> {
> - if (!refcount_inc_not_zero(&ct->ct_general.use))
> + if (!atomic_inc_not_zero(&ct->ct_general.use))
> return;
>
> if (nf_ct_should_gc(ct))
> @@ -794,7 +794,7 @@ __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone,
> * in, try to obtain a reference and re-check tuple
> */
> ct = nf_ct_tuplehash_to_ctrack(h);
> - if (likely(refcount_inc_not_zero(&ct->ct_general.use))) {
> + if (likely(atomic_inc_not_zero(&ct->ct_general.use))) {
> if (likely(nf_ct_key_equal(h, tuple, zone, net)))
> goto found;
>
> @@ -923,7 +923,7 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct)
>
> smp_wmb();
> /* The caller holds a reference to this object */
> - refcount_set(&ct->ct_general.use, 2);
> + atomic_set(&ct->ct_general.use, 2);
> __nf_conntrack_hash_insert(ct, hash, reply_hash);
> nf_conntrack_double_unlock(hash, reply_hash);
> NF_CT_STAT_INC(net, insert);
> @@ -981,7 +981,7 @@ static void __nf_conntrack_insert_prepare(struct nf_conn *ct)
> {
> struct nf_conn_tstamp *tstamp;
>
> - refcount_inc(&ct->ct_general.use);
> + atomic_inc(&ct->ct_general.use);
>
> /* set conntrack timestamp, if enabled. */
> tstamp = nf_conn_tstamp_find(ct);
> @@ -1384,7 +1384,7 @@ static unsigned int early_drop_list(struct net *net,
> nf_ct_is_dying(tmp))
> continue;
>
> - if (!refcount_inc_not_zero(&tmp->ct_general.use))
> + if (!atomic_inc_not_zero(&tmp->ct_general.use))
> continue;
>
> /* kill only if still in same netns -- might have moved due to
> @@ -1533,7 +1533,7 @@ static void gc_worker(struct work_struct *work)
> continue;
>
> /* need to take reference to avoid possible races */
> - if (!refcount_inc_not_zero(&tmp->ct_general.use))
> + if (!atomic_inc_not_zero(&tmp->ct_general.use))
> continue;
>
> if (gc_worker_skip_ct(tmp)) {
> @@ -1640,7 +1640,7 @@ __nf_conntrack_alloc(struct net *net,
> /* Because we use RCU lookups, we set ct_general.use to zero before
> * this is inserted in any list.
> */
> - refcount_set(&ct->ct_general.use, 0);
> + atomic_set(&ct->ct_general.use, 0);
> return ct;
> out:
> atomic_dec(&cnet->count);
> @@ -1665,7 +1665,7 @@ void nf_conntrack_free(struct nf_conn *ct)
> /* A freed object has refcnt == 0, that's
> * the golden rule for SLAB_TYPESAFE_BY_RCU
> */
> - WARN_ON(refcount_read(&ct->ct_general.use) != 0);
> + WARN_ON(atomic_read(&ct->ct_general.use) != 0);
>
> if (ct->status & IPS_SRC_NAT_DONE) {
> const struct nf_nat_hook *nat_hook;
> @@ -1776,7 +1776,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
> __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
>
> /* Now it is going to be associated with an sk_buff, set refcount to 1. */
> - refcount_set(&ct->ct_general.use, 1);
> + atomic_set(&ct->ct_general.use, 1);
>
> if (exp) {
> if (exp->expectfn)
> @@ -2390,7 +2390,7 @@ get_next_corpse(int (*iter)(struct nf_conn *i, void *data),
>
> return NULL;
> found:
> - refcount_inc(&ct->ct_general.use);
> + atomic_inc(&ct->ct_general.use);
> spin_unlock(lockp);
> local_bh_enable();
> return ct;
> diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c
> index 96948e98ec53..84cb05eae410 100644
> --- a/net/netfilter/nf_conntrack_expect.c
> +++ b/net/netfilter/nf_conntrack_expect.c
> @@ -208,7 +208,7 @@ nf_ct_find_expectation(struct net *net,
> * can be sure the ct cannot disappear underneath.
> */
> if (unlikely(nf_ct_is_dying(exp->master) ||
> - !refcount_inc_not_zero(&exp->master->ct_general.use)))
> + !atomic_inc_not_zero(&exp->master->ct_general.use)))
> return NULL;
>
> if (exp->flags & NF_CT_EXPECT_PERMANENT) {
> diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
> index 722af5e309ba..d5de0e580e6c 100644
> --- a/net/netfilter/nf_conntrack_netlink.c
> +++ b/net/netfilter/nf_conntrack_netlink.c
> @@ -514,7 +514,7 @@ static int ctnetlink_dump_id(struct sk_buff *skb, const struct nf_conn *ct)
>
> static int ctnetlink_dump_use(struct sk_buff *skb, const struct nf_conn *ct)
> {
> - if (nla_put_be32(skb, CTA_USE, htonl(refcount_read(&ct->ct_general.use))))
> + if (nla_put_be32(skb, CTA_USE, htonl(atomic_read(&ct->ct_general.use))))
> goto nla_put_failure;
> return 0;
>
> @@ -1204,7 +1204,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
> ct = nf_ct_tuplehash_to_ctrack(h);
> if (nf_ct_is_expired(ct)) {
> if (i < ARRAY_SIZE(nf_ct_evict) &&
> - refcount_inc_not_zero(&ct->ct_general.use))
> + atomic_inc_not_zero(&ct->ct_general.use))
> nf_ct_evict[i++] = ct;
> continue;
> }
> @@ -1747,7 +1747,7 @@ static int ctnetlink_dump_one_entry(struct sk_buff *skb,
> NFNL_MSG_TYPE(cb->nlh->nlmsg_type),
> ct, dying, 0);
> if (res < 0) {
> - if (!refcount_inc_not_zero(&ct->ct_general.use))
> + if (!atomic_inc_not_zero(&ct->ct_general.use))
> return 0;
>
> ctx->last = ct;
> diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
> index 6ad7bbc90d38..badd3f219533 100644
> --- a/net/netfilter/nf_conntrack_standalone.c
> +++ b/net/netfilter/nf_conntrack_standalone.c
> @@ -303,7 +303,7 @@ static int ct_seq_show(struct seq_file *s, void *v)
> int ret = 0;
>
> WARN_ON(!ct);
> - if (unlikely(!refcount_inc_not_zero(&ct->ct_general.use)))
> + if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
> return 0;
>
> if (nf_ct_should_gc(ct)) {
> @@ -370,7 +370,7 @@ static int ct_seq_show(struct seq_file *s, void *v)
> ct_show_zone(s, ct, NF_CT_DEFAULT_ZONE_DIR);
> ct_show_delta_time(s, ct);
>
> - seq_printf(s, "use=%u\n", refcount_read(&ct->ct_general.use));
> + seq_printf(s, "use=%u\n", atomic_read(&ct->ct_general.use));
>
> if (seq_has_overflowed(s))
> goto release;
> diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> index f2def06d1070..8b3f91a60ba2 100644
> --- a/net/netfilter/nf_flow_table_core.c
> +++ b/net/netfilter/nf_flow_table_core.c
> @@ -54,7 +54,7 @@ struct flow_offload *flow_offload_alloc(struct nf_conn *ct)
> struct flow_offload *flow;
>
> if (unlikely(nf_ct_is_dying(ct) ||
> - !refcount_inc_not_zero(&ct->ct_general.use)))
> + !atomic_inc_not_zero(&ct->ct_general.use)))
> return NULL;
>
> flow = kzalloc(sizeof(*flow), GFP_ATOMIC);
> diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
> index d8e1614918a1..1b6ead61a8f1 100644
> --- a/net/netfilter/nft_ct.c
> +++ b/net/netfilter/nft_ct.c
> @@ -260,8 +260,8 @@ static void nft_ct_set_zone_eval(const struct nft_expr *expr,
>
> ct = this_cpu_read(nft_ct_pcpu_template);
>
> - if (likely(refcount_read(&ct->ct_general.use) == 1)) {
> - refcount_inc(&ct->ct_general.use);
> + if (likely(atomic_read(&ct->ct_general.use) == 1)) {
> + atomic_inc(&ct->ct_general.use);
> nf_ct_zone_add(ct, &zone);
> } else {
> /* previous skb got queued to userspace, allocate temporary
> diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c
> index 267757b0392a..cf2f8c1d4fb5 100644
> --- a/net/netfilter/xt_CT.c
> +++ b/net/netfilter/xt_CT.c
> @@ -24,7 +24,7 @@ static inline int xt_ct_target(struct sk_buff *skb, struct nf_conn *ct)
> return XT_CONTINUE;
>
> if (ct) {
> - refcount_inc(&ct->ct_general.use);
> + atomic_inc(&ct->ct_general.use);
> nf_ct_set(skb, ct, IP_CT_NEW);
> } else {
> nf_ct_set(skb, ct, IP_CT_UNTRACKED);
> --
> 2.35.1
>

2022-07-05 11:04:54

by Kajetan Puchalski

[permalink] [raw]
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Mon, Jul 04, 2022 at 10:22:24AM +0100, Kajetan Puchalski wrote:
> On Sat, Jul 02, 2022 at 10:56:51PM +0200, Florian Westphal wrote:
> > > That would make sense, from further experiments I ran it somehow seems
> > > to be related to the number of workers being spawned by stress-ng along
> > > with the CPUs/cores involved.
> > >
> > > For instance, running the test with <=25 workers (--udp-flood 25 etc.)
> > > results in the test running fine for at least 15 minutes.
> >
> > Ok. I will let it run for longer on the machines I have access to.
> >
> > In mean time, you could test attached patch, its simple s/refcount_/atomic_/
> > in nf_conntrack.
> >
> > If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you
> > but works with attached patch someone who understands aarch64 memory ordering
> > would have to look more closely at refcount_XXX functions to see where they
> > might differ from atomic_ ones.
>
> I can confirm that the patch seems to solve the issue.
> With it applied on top of the 5.19-rc5 tag the test runs fine for at
> least 15 minutes which was not the case before so it looks like it is
> that aarch64 memory ordering problem.

I'm CCing some people who should be able to help with aarch64 memory
ordering, maybe they could take a look.

(re-sending due to a typo in CC, sorry for duplicate emails!)

>
> >
> > If it still crashes, please try below hunk in addition, although I don't see
> > how it would make a difference.
> >
> > This is the one spot where the original conversion replaced atomic_inc()
> > with refcount_set(), this is on allocation, refcount is expected to be 0 so
> > refcount_inc() triggers a warning hinting at a use-after free.
> >
> > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -1776,7 +1776,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
> > __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
> >
> > /* Now it is going to be associated with an sk_buff, set refcount to 1. */
> > - atomic_set(&ct->ct_general.use, 1);
> > + atomic_inc(&ct->ct_general.use);
> >
> > if (exp) {
> > if (exp->expectfn)
>
> > From 4234018dff486bdc30f4fe4625c8da1a8e30c2f6 Mon Sep 17 00:00:00 2001
> > From: Florian Westphal <[email protected]>
> > Date: Sat, 2 Jul 2022 22:42:57 +0200
> > Subject: [PATCH 1/1] netfilter: conntrack: revert to atomic_t api
> >
> > Just for testing.
> > ---
> > include/linux/netfilter/nf_conntrack_common.h | 6 ++---
> > include/net/netfilter/nf_conntrack.h | 2 +-
> > net/netfilter/nf_conntrack_core.c | 24 +++++++++----------
> > net/netfilter/nf_conntrack_expect.c | 2 +-
> > net/netfilter/nf_conntrack_netlink.c | 6 ++---
> > net/netfilter/nf_conntrack_standalone.c | 4 ++--
> > net/netfilter/nf_flow_table_core.c | 2 +-
> > net/netfilter/nft_ct.c | 4 ++--
> > net/netfilter/xt_CT.c | 2 +-
> > 9 files changed, 26 insertions(+), 26 deletions(-)
> >
> > diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
> > index 2770db2fa080..48a78944182d 100644
> > --- a/include/linux/netfilter/nf_conntrack_common.h
> > +++ b/include/linux/netfilter/nf_conntrack_common.h
> > @@ -25,7 +25,7 @@ struct ip_conntrack_stat {
> > #define NFCT_PTRMASK ~(NFCT_INFOMASK)
> >
> > struct nf_conntrack {
> > - refcount_t use;
> > + atomic_t use;
> > };
> >
> > void nf_conntrack_destroy(struct nf_conntrack *nfct);
> > @@ -33,13 +33,13 @@ void nf_conntrack_destroy(struct nf_conntrack *nfct);
> > /* like nf_ct_put, but without module dependency on nf_conntrack */
> > static inline void nf_conntrack_put(struct nf_conntrack *nfct)
> > {
> > - if (nfct && refcount_dec_and_test(&nfct->use))
> > + if (nfct && atomic_dec_and_test(&nfct->use))
> > nf_conntrack_destroy(nfct);
> > }
> > static inline void nf_conntrack_get(struct nf_conntrack *nfct)
> > {
> > if (nfct)
> > - refcount_inc(&nfct->use);
> > + atomic_inc(&nfct->use);
> > }
> >
> > #endif /* _NF_CONNTRACK_COMMON_H */
> > diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
> > index a32be8aa7ed2..9fab0c8835bb 100644
> > --- a/include/net/netfilter/nf_conntrack.h
> > +++ b/include/net/netfilter/nf_conntrack.h
> > @@ -180,7 +180,7 @@ void nf_ct_destroy(struct nf_conntrack *nfct);
> > /* decrement reference count on a conntrack */
> > static inline void nf_ct_put(struct nf_conn *ct)
> > {
> > - if (ct && refcount_dec_and_test(&ct->ct_general.use))
> > + if (ct && atomic_dec_and_test(&ct->ct_general.use))
> > nf_ct_destroy(&ct->ct_general);
> > }
> >
> > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > index 082a2fd8d85b..4469e49d78a7 100644
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -554,7 +554,7 @@ struct nf_conn *nf_ct_tmpl_alloc(struct net *net,
> > tmpl->status = IPS_TEMPLATE;
> > write_pnet(&tmpl->ct_net, net);
> > nf_ct_zone_add(tmpl, zone);
> > - refcount_set(&tmpl->ct_general.use, 1);
> > + atomic_set(&tmpl->ct_general.use, 1);
> >
> > return tmpl;
> > }
> > @@ -586,7 +586,7 @@ void nf_ct_destroy(struct nf_conntrack *nfct)
> > struct nf_conn *ct = (struct nf_conn *)nfct;
> >
> > pr_debug("%s(%p)\n", __func__, ct);
> > - WARN_ON(refcount_read(&nfct->use) != 0);
> > + WARN_ON(atomic_read(&nfct->use) != 0);
> >
> > if (unlikely(nf_ct_is_template(ct))) {
> > nf_ct_tmpl_free(ct);
> > @@ -726,7 +726,7 @@ nf_ct_match(const struct nf_conn *ct1, const struct nf_conn *ct2)
> > /* caller must hold rcu readlock and none of the nf_conntrack_locks */
> > static void nf_ct_gc_expired(struct nf_conn *ct)
> > {
> > - if (!refcount_inc_not_zero(&ct->ct_general.use))
> > + if (!atomic_inc_not_zero(&ct->ct_general.use))
> > return;
> >
> > if (nf_ct_should_gc(ct))
> > @@ -794,7 +794,7 @@ __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone,
> > * in, try to obtain a reference and re-check tuple
> > */
> > ct = nf_ct_tuplehash_to_ctrack(h);
> > - if (likely(refcount_inc_not_zero(&ct->ct_general.use))) {
> > + if (likely(atomic_inc_not_zero(&ct->ct_general.use))) {
> > if (likely(nf_ct_key_equal(h, tuple, zone, net)))
> > goto found;
> >
> > @@ -923,7 +923,7 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct)
> >
> > smp_wmb();
> > /* The caller holds a reference to this object */
> > - refcount_set(&ct->ct_general.use, 2);
> > + atomic_set(&ct->ct_general.use, 2);
> > __nf_conntrack_hash_insert(ct, hash, reply_hash);
> > nf_conntrack_double_unlock(hash, reply_hash);
> > NF_CT_STAT_INC(net, insert);
> > @@ -981,7 +981,7 @@ static void __nf_conntrack_insert_prepare(struct nf_conn *ct)
> > {
> > struct nf_conn_tstamp *tstamp;
> >
> > - refcount_inc(&ct->ct_general.use);
> > + atomic_inc(&ct->ct_general.use);
> >
> > /* set conntrack timestamp, if enabled. */
> > tstamp = nf_conn_tstamp_find(ct);
> > @@ -1384,7 +1384,7 @@ static unsigned int early_drop_list(struct net *net,
> > nf_ct_is_dying(tmp))
> > continue;
> >
> > - if (!refcount_inc_not_zero(&tmp->ct_general.use))
> > + if (!atomic_inc_not_zero(&tmp->ct_general.use))
> > continue;
> >
> > /* kill only if still in same netns -- might have moved due to
> > @@ -1533,7 +1533,7 @@ static void gc_worker(struct work_struct *work)
> > continue;
> >
> > /* need to take reference to avoid possible races */
> > - if (!refcount_inc_not_zero(&tmp->ct_general.use))
> > + if (!atomic_inc_not_zero(&tmp->ct_general.use))
> > continue;
> >
> > if (gc_worker_skip_ct(tmp)) {
> > @@ -1640,7 +1640,7 @@ __nf_conntrack_alloc(struct net *net,
> > /* Because we use RCU lookups, we set ct_general.use to zero before
> > * this is inserted in any list.
> > */
> > - refcount_set(&ct->ct_general.use, 0);
> > + atomic_set(&ct->ct_general.use, 0);
> > return ct;
> > out:
> > atomic_dec(&cnet->count);
> > @@ -1665,7 +1665,7 @@ void nf_conntrack_free(struct nf_conn *ct)
> > /* A freed object has refcnt == 0, that's
> > * the golden rule for SLAB_TYPESAFE_BY_RCU
> > */
> > - WARN_ON(refcount_read(&ct->ct_general.use) != 0);
> > + WARN_ON(atomic_read(&ct->ct_general.use) != 0);
> >
> > if (ct->status & IPS_SRC_NAT_DONE) {
> > const struct nf_nat_hook *nat_hook;
> > @@ -1776,7 +1776,7 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
> > __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
> >
> > /* Now it is going to be associated with an sk_buff, set refcount to 1. */
> > - refcount_set(&ct->ct_general.use, 1);
> > + atomic_set(&ct->ct_general.use, 1);
> >
> > if (exp) {
> > if (exp->expectfn)
> > @@ -2390,7 +2390,7 @@ get_next_corpse(int (*iter)(struct nf_conn *i, void *data),
> >
> > return NULL;
> > found:
> > - refcount_inc(&ct->ct_general.use);
> > + atomic_inc(&ct->ct_general.use);
> > spin_unlock(lockp);
> > local_bh_enable();
> > return ct;
> > diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c
> > index 96948e98ec53..84cb05eae410 100644
> > --- a/net/netfilter/nf_conntrack_expect.c
> > +++ b/net/netfilter/nf_conntrack_expect.c
> > @@ -208,7 +208,7 @@ nf_ct_find_expectation(struct net *net,
> > * can be sure the ct cannot disappear underneath.
> > */
> > if (unlikely(nf_ct_is_dying(exp->master) ||
> > - !refcount_inc_not_zero(&exp->master->ct_general.use)))
> > + !atomic_inc_not_zero(&exp->master->ct_general.use)))
> > return NULL;
> >
> > if (exp->flags & NF_CT_EXPECT_PERMANENT) {
> > diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
> > index 722af5e309ba..d5de0e580e6c 100644
> > --- a/net/netfilter/nf_conntrack_netlink.c
> > +++ b/net/netfilter/nf_conntrack_netlink.c
> > @@ -514,7 +514,7 @@ static int ctnetlink_dump_id(struct sk_buff *skb, const struct nf_conn *ct)
> >
> > static int ctnetlink_dump_use(struct sk_buff *skb, const struct nf_conn *ct)
> > {
> > - if (nla_put_be32(skb, CTA_USE, htonl(refcount_read(&ct->ct_general.use))))
> > + if (nla_put_be32(skb, CTA_USE, htonl(atomic_read(&ct->ct_general.use))))
> > goto nla_put_failure;
> > return 0;
> >
> > @@ -1204,7 +1204,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
> > ct = nf_ct_tuplehash_to_ctrack(h);
> > if (nf_ct_is_expired(ct)) {
> > if (i < ARRAY_SIZE(nf_ct_evict) &&
> > - refcount_inc_not_zero(&ct->ct_general.use))
> > + atomic_inc_not_zero(&ct->ct_general.use))
> > nf_ct_evict[i++] = ct;
> > continue;
> > }
> > @@ -1747,7 +1747,7 @@ static int ctnetlink_dump_one_entry(struct sk_buff *skb,
> > NFNL_MSG_TYPE(cb->nlh->nlmsg_type),
> > ct, dying, 0);
> > if (res < 0) {
> > - if (!refcount_inc_not_zero(&ct->ct_general.use))
> > + if (!atomic_inc_not_zero(&ct->ct_general.use))
> > return 0;
> >
> > ctx->last = ct;
> > diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
> > index 6ad7bbc90d38..badd3f219533 100644
> > --- a/net/netfilter/nf_conntrack_standalone.c
> > +++ b/net/netfilter/nf_conntrack_standalone.c
> > @@ -303,7 +303,7 @@ static int ct_seq_show(struct seq_file *s, void *v)
> > int ret = 0;
> >
> > WARN_ON(!ct);
> > - if (unlikely(!refcount_inc_not_zero(&ct->ct_general.use)))
> > + if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
> > return 0;
> >
> > if (nf_ct_should_gc(ct)) {
> > @@ -370,7 +370,7 @@ static int ct_seq_show(struct seq_file *s, void *v)
> > ct_show_zone(s, ct, NF_CT_DEFAULT_ZONE_DIR);
> > ct_show_delta_time(s, ct);
> >
> > - seq_printf(s, "use=%u\n", refcount_read(&ct->ct_general.use));
> > + seq_printf(s, "use=%u\n", atomic_read(&ct->ct_general.use));
> >
> > if (seq_has_overflowed(s))
> > goto release;
> > diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> > index f2def06d1070..8b3f91a60ba2 100644
> > --- a/net/netfilter/nf_flow_table_core.c
> > +++ b/net/netfilter/nf_flow_table_core.c
> > @@ -54,7 +54,7 @@ struct flow_offload *flow_offload_alloc(struct nf_conn *ct)
> > struct flow_offload *flow;
> >
> > if (unlikely(nf_ct_is_dying(ct) ||
> > - !refcount_inc_not_zero(&ct->ct_general.use)))
> > + !atomic_inc_not_zero(&ct->ct_general.use)))
> > return NULL;
> >
> > flow = kzalloc(sizeof(*flow), GFP_ATOMIC);
> > diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
> > index d8e1614918a1..1b6ead61a8f1 100644
> > --- a/net/netfilter/nft_ct.c
> > +++ b/net/netfilter/nft_ct.c
> > @@ -260,8 +260,8 @@ static void nft_ct_set_zone_eval(const struct nft_expr *expr,
> >
> > ct = this_cpu_read(nft_ct_pcpu_template);
> >
> > - if (likely(refcount_read(&ct->ct_general.use) == 1)) {
> > - refcount_inc(&ct->ct_general.use);
> > + if (likely(atomic_read(&ct->ct_general.use) == 1)) {
> > + atomic_inc(&ct->ct_general.use);
> > nf_ct_zone_add(ct, &zone);
> > } else {
> > /* previous skb got queued to userspace, allocate temporary
> > diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c
> > index 267757b0392a..cf2f8c1d4fb5 100644
> > --- a/net/netfilter/xt_CT.c
> > +++ b/net/netfilter/xt_CT.c
> > @@ -24,7 +24,7 @@ static inline int xt_ct_target(struct sk_buff *skb, struct nf_conn *ct)
> > return XT_CONTINUE;
> >
> > if (ct) {
> > - refcount_inc(&ct->ct_general.use);
> > + atomic_inc(&ct->ct_general.use);
> > nf_ct_set(skb, ct, IP_CT_NEW);
> > } else {
> > nf_ct_set(skb, ct, IP_CT_UNTRACKED);
> > --
> > 2.35.1
> >
>

2022-07-05 11:18:37

by Will Deacon

[permalink] [raw]
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Tue, Jul 05, 2022 at 11:57:49AM +0100, Will Deacon wrote:
> On Tue, Jul 05, 2022 at 11:53:22AM +0100, Kajetan Puchalski wrote:
> > On Mon, Jul 04, 2022 at 10:22:24AM +0100, Kajetan Puchalski wrote:
> > > On Sat, Jul 02, 2022 at 10:56:51PM +0200, Florian Westphal wrote:
> > > > > That would make sense, from further experiments I ran it somehow seems
> > > > > to be related to the number of workers being spawned by stress-ng along
> > > > > with the CPUs/cores involved.
> > > > >
> > > > > For instance, running the test with <=25 workers (--udp-flood 25 etc.)
> > > > > results in the test running fine for at least 15 minutes.
> > > >
> > > > Ok. I will let it run for longer on the machines I have access to.
> > > >
> > > > In mean time, you could test attached patch, its simple s/refcount_/atomic_/
> > > > in nf_conntrack.
> > > >
> > > > If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you
> > > > but works with attached patch someone who understands aarch64 memory ordering
> > > > would have to look more closely at refcount_XXX functions to see where they
> > > > might differ from atomic_ ones.
> > >
> > > I can confirm that the patch seems to solve the issue.
> > > With it applied on top of the 5.19-rc5 tag the test runs fine for at
> > > least 15 minutes which was not the case before so it looks like it is
> > > that aarch64 memory ordering problem.
> >
> > I'm CCing some people who should be able to help with aarch64 memory
> > ordering, maybe they could take a look.
> >
> > (re-sending due to a typo in CC, sorry for duplicate emails!)
>
> Sorry, but I have absolutely no context here. We have a handy document
> describing the differences between atomic_t and refcount_t:
>
> Documentation/core-api/refcount-vs-atomic.rst
>
> What else do you need to know?

Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622
("netfilter: conntrack: convert to refcount_t api") which mean that you
no longer have ordering to subsequent reads in the absence of an address
dependency.

Will

2022-07-05 11:23:32

by Will Deacon

[permalink] [raw]
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Tue, Jul 05, 2022 at 11:53:22AM +0100, Kajetan Puchalski wrote:
> On Mon, Jul 04, 2022 at 10:22:24AM +0100, Kajetan Puchalski wrote:
> > On Sat, Jul 02, 2022 at 10:56:51PM +0200, Florian Westphal wrote:
> > > > That would make sense, from further experiments I ran it somehow seems
> > > > to be related to the number of workers being spawned by stress-ng along
> > > > with the CPUs/cores involved.
> > > >
> > > > For instance, running the test with <=25 workers (--udp-flood 25 etc.)
> > > > results in the test running fine for at least 15 minutes.
> > >
> > > Ok. I will let it run for longer on the machines I have access to.
> > >
> > > In mean time, you could test attached patch, its simple s/refcount_/atomic_/
> > > in nf_conntrack.
> > >
> > > If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you
> > > but works with attached patch someone who understands aarch64 memory ordering
> > > would have to look more closely at refcount_XXX functions to see where they
> > > might differ from atomic_ ones.
> >
> > I can confirm that the patch seems to solve the issue.
> > With it applied on top of the 5.19-rc5 tag the test runs fine for at
> > least 15 minutes which was not the case before so it looks like it is
> > that aarch64 memory ordering problem.
>
> I'm CCing some people who should be able to help with aarch64 memory
> ordering, maybe they could take a look.
>
> (re-sending due to a typo in CC, sorry for duplicate emails!)

Sorry, but I have absolutely no context here. We have a handy document
describing the differences between atomic_t and refcount_t:

Documentation/core-api/refcount-vs-atomic.rst

What else do you need to know?

Will

2022-07-05 11:33:21

by Will Deacon

[permalink] [raw]
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Tue, Jul 05, 2022 at 12:07:25PM +0100, Will Deacon wrote:
> On Tue, Jul 05, 2022 at 11:57:49AM +0100, Will Deacon wrote:
> > On Tue, Jul 05, 2022 at 11:53:22AM +0100, Kajetan Puchalski wrote:
> > > On Mon, Jul 04, 2022 at 10:22:24AM +0100, Kajetan Puchalski wrote:
> > > > On Sat, Jul 02, 2022 at 10:56:51PM +0200, Florian Westphal wrote:
> > > > > > That would make sense, from further experiments I ran it somehow seems
> > > > > > to be related to the number of workers being spawned by stress-ng along
> > > > > > with the CPUs/cores involved.
> > > > > >
> > > > > > For instance, running the test with <=25 workers (--udp-flood 25 etc.)
> > > > > > results in the test running fine for at least 15 minutes.
> > > > >
> > > > > Ok. I will let it run for longer on the machines I have access to.
> > > > >
> > > > > In mean time, you could test attached patch, its simple s/refcount_/atomic_/
> > > > > in nf_conntrack.
> > > > >
> > > > > If mainline (patch vs. HEAD 69cb6c6556ad89620547318439) crashes for you
> > > > > but works with attached patch someone who understands aarch64 memory ordering
> > > > > would have to look more closely at refcount_XXX functions to see where they
> > > > > might differ from atomic_ ones.
> > > >
> > > > I can confirm that the patch seems to solve the issue.
> > > > With it applied on top of the 5.19-rc5 tag the test runs fine for at
> > > > least 15 minutes which was not the case before so it looks like it is
> > > > that aarch64 memory ordering problem.
> > >
> > > I'm CCing some people who should be able to help with aarch64 memory
> > > ordering, maybe they could take a look.
> > >
> > > (re-sending due to a typo in CC, sorry for duplicate emails!)
> >
> > Sorry, but I have absolutely no context here. We have a handy document
> > describing the differences between atomic_t and refcount_t:
> >
> > Documentation/core-api/refcount-vs-atomic.rst
> >
> > What else do you need to know?
>
> Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622
> ("netfilter: conntrack: convert to refcount_t api") which mean that you
> no longer have ordering to subsequent reads in the absence of an address
> dependency.

I think the patch above needs auditing with the relaxed behaviour in mind,
but for the specific crash reported here possibly something like the diff
below?

Will

--->8

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 082a2fd8d85b..5ad9fcc84269 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1394,6 +1394,7 @@ static unsigned int early_drop_list(struct net *net,
* already fired or someone else deleted it. Just drop ref
* and move to next entry.
*/
+ smp_rmb(); /* XXX: Why? */
if (net_eq(nf_ct_net(tmp), net) &&
nf_ct_is_confirmed(tmp) &&
nf_ct_delete(tmp, 0, 0))

2022-07-05 16:03:24

by Kajetan Puchalski

[permalink] [raw]
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

> > > Sorry, but I have absolutely no context here. We have a handy document
> > > describing the differences between atomic_t and refcount_t:
> > >
> > > Documentation/core-api/refcount-vs-atomic.rst
> > >
> > > What else do you need to know?
> >
> > Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622
> > ("netfilter: conntrack: convert to refcount_t api") which mean that you
> > no longer have ordering to subsequent reads in the absence of an address
> > dependency.
>
> I think the patch above needs auditing with the relaxed behaviour in mind,
> but for the specific crash reported here possibly something like the diff
> below?
>
> Will
>
> --->8
>
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 082a2fd8d85b..5ad9fcc84269 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -1394,6 +1394,7 @@ static unsigned int early_drop_list(struct net *net,
> * already fired or someone else deleted it. Just drop ref
> * and move to next entry.
> */
> + smp_rmb(); /* XXX: Why? */
> if (net_eq(nf_ct_net(tmp), net) &&
> nf_ct_is_confirmed(tmp) &&
> nf_ct_delete(tmp, 0, 0))
>

With this patch applied the issue goes away as well. The test runs fine
well beyond where it would crash previously so looks good, thanks!

2022-07-06 10:43:54

by Kajetan Puchalski

[permalink] [raw]
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Tue, Jul 05, 2022 at 12:24:49PM +0100, Will Deacon wrote:
> > > Sorry, but I have absolutely no context here. We have a handy document
> > > describing the differences between atomic_t and refcount_t:
> > >
> > > Documentation/core-api/refcount-vs-atomic.rst
> > >
> > > What else do you need to know?
> >
> > Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622
> > ("netfilter: conntrack: convert to refcount_t api") which mean that you
> > no longer have ordering to subsequent reads in the absence of an address
> > dependency.
>
> I think the patch above needs auditing with the relaxed behaviour in mind,
> but for the specific crash reported here possibly something like the diff
> below?
>
> Will
>
> --->8
>
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 082a2fd8d85b..5ad9fcc84269 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -1394,6 +1394,7 @@ static unsigned int early_drop_list(struct net *net,
> * already fired or someone else deleted it. Just drop ref
> * and move to next entry.
> */
> + smp_rmb(); /* XXX: Why? */
> if (net_eq(nf_ct_net(tmp), net) &&
> nf_ct_is_confirmed(tmp) &&
> nf_ct_delete(tmp, 0, 0))
>

Just to follow up, I think you're right, the patch in question should be
audited further for other missing memory barrier issues.
While this one smp_rmb() helps a lot, ie lets the test run for at least
an hour or two, an overnight 6 hour test still resulted in the same
crash somewhere along the way so it looks like it's not the only one
that's needed.

2022-07-06 12:27:23

by Florian Westphal

[permalink] [raw]
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

Kajetan Puchalski <[email protected]> wrote:
> On Tue, Jul 05, 2022 at 12:24:49PM +0100, Will Deacon wrote:
> > > > Sorry, but I have absolutely no context here. We have a handy document
> > > > describing the differences between atomic_t and refcount_t:
> > > >
> > > > Documentation/core-api/refcount-vs-atomic.rst
> > > >
> > > > What else do you need to know?
> > >
> > > Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622
> > > ("netfilter: conntrack: convert to refcount_t api") which mean that you
> > > no longer have ordering to subsequent reads in the absence of an address
> > > dependency.
> >
> > I think the patch above needs auditing with the relaxed behaviour in mind,
> > but for the specific crash reported here possibly something like the diff
> > below?
> >
> > Will
> >
> > --->8
> >
> > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > index 082a2fd8d85b..5ad9fcc84269 100644
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -1394,6 +1394,7 @@ static unsigned int early_drop_list(struct net *net,
> > * already fired or someone else deleted it. Just drop ref
> > * and move to next entry.
> > */
> > + smp_rmb(); /* XXX: Why? */
> > if (net_eq(nf_ct_net(tmp), net) &&
> > nf_ct_is_confirmed(tmp) &&
> > nf_ct_delete(tmp, 0, 0))
> >
>
> Just to follow up, I think you're right, the patch in question should be
> audited further for other missing memory barrier issues.
> While this one smp_rmb() helps a lot, ie lets the test run for at least
> an hour or two, an overnight 6 hour test still resulted in the same
> crash somewhere along the way so it looks like it's not the only one
> that's needed.

Yes, I don't think that refcount_inc_not_zero is useable as-is for conntrack.
Here is a patch, I hope this will get things back to a working order without
a revert to atomic_t api.

Subject: [nf] netfilter: conntrack: fix crash due to confirmed bit load reordering

Kajetan Puchalski reports crash on ARM, with backtrace of:

__nf_ct_delete_from_lists
nf_ct_delete
early_drop
__nf_conntrack_alloc

Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier.
conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly'
allocated object is still in use on another CPU:

CPU1 CPU2
enounters 'ct' during hlist walk
delete_from_lists
refcount drops to 0
kmem_cache_free(ct);
__nf_conntrack_alloc() // returns same object
refcount_inc_not_zero(ct); /* might fail */

/* If set, ct is public/in the hash table */
test_bit(IPS_CONFIRMED_BIT, &ct->status);

In case CPU1 already set refcount back to 1, refcount_inc_not_zero()
will succeed.

The expected possibilities for a CPU that obtained the object 'ct'
(but no reference so far) are:

1. refcount_inc_not_zero() fails. CPU2 ignores the object and moves to
the next entry in the list. This happens for objects that are about
to be free'd, that have been free'd, or that have been reallocated
by __nf_conntrack_alloc(), but where the refcount has not been
increased back to 1 yet.

2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit
in ct->status. If set, the object is public/in the table.

If not, the object must be skipped; CPU2 calls nf_ct_put() to
un-do the refcount increment and moves to the next object.

Parallel deletion from the hlists is prevented by a
'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one
cpu will do the unlink, the other one will only drop its reference count.

Because refcount_inc_not_zero is not a full barrier, CPU2 may try to
delete an object that is not on any list:

1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
2. CONFIRMED test also successful (load was reordered or zeroing
of ct->status not yet visible)
3. delete_from_lists unlinks entry not on the hlist, because
IPS_DYING_BIT is 0 (already cleared).

2) is already wrong: CPU2 will handle a partially initited object
that is supposed to be private to CPU1.

This change adds smp_rmb() whenever refcount_inc_not_zero() was successful.

It also inserts a smp_wmb() before the refcount is set to 1 during
allocation.

Because other CPU might still 'see' the object, refcount_set(1)
"resurrects" the object, so we need to make sure that other CPUs will
also observe the right contents. In particular, the CONFIRMED bit test
must only pass once the object is fully initialised and either in the
hash or about to be inserted (with locks held to delay possible unlink from
early_drop or gc worker).

I did not change flow_offload_alloc(), as far as I can see it should call
refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to
the skb so its refcount should be >= 1 in all cases.

Reported-by: Kajetan Puchalski <[email protected]>
Diagnosed-by: Will Deacon <[email protected]>
Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
Signed-off-by: Florian Westphal <[email protected]>
---
include/net/netfilter/nf_conntrack.h | 3 +++
net/netfilter/nf_conntrack_core.c | 19 +++++++++++++++++++
2 files changed, 22 insertions(+)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index a32be8aa7ed2..3dc3646ffba2 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -300,6 +300,9 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct)
/* use after obtaining a reference count */
static inline bool nf_ct_should_gc(const struct nf_conn *ct)
{
+ /* ->status and ->timeout loads must happen after refcount increase */
+ smp_rmb();
+
return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) &&
!nf_ct_is_dying(ct);
}
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 082a2fd8d85b..072cabf1b296 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -795,6 +795,9 @@ __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone,
*/
ct = nf_ct_tuplehash_to_ctrack(h);
if (likely(refcount_inc_not_zero(&ct->ct_general.use))) {
+ /* re-check key after refcount */
+ smp_rmb();
+
if (likely(nf_ct_key_equal(h, tuple, zone, net)))
goto found;

@@ -1393,7 +1396,11 @@ static unsigned int early_drop_list(struct net *net,
* We steal the timer reference. If that fails timer has
* already fired or someone else deleted it. Just drop ref
* and move to next entry.
+ *
+ * smp_rmb to ensure ->ct_net and ->status are loaded after
+ * refcount increase.
*/
+ smp_rmb();
if (net_eq(nf_ct_net(tmp), net) &&
nf_ct_is_confirmed(tmp) &&
nf_ct_delete(tmp, 0, 0))
@@ -1536,6 +1543,8 @@ static void gc_worker(struct work_struct *work)
if (!refcount_inc_not_zero(&tmp->ct_general.use))
continue;

+ /* load ct->status after refcount */
+ smp_rmb();
if (gc_worker_skip_ct(tmp)) {
nf_ct_put(tmp);
continue;
@@ -1775,6 +1784,16 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
if (!exp)
__nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);

+ /* Other CPU might have obtained a pointer to this object before it was
+ * released. Because refcount is 0, refcount_inc_not_zero() will fail.
+ *
+ * After refcount_set(1) it will succeed; ensure that zeroing of
+ * ct->status and the correct ct->net pointer are visible; else other
+ * core might observe CONFIRMED bit which means the entry is valid and
+ * in the hash table, but its not (anymore).
+ */
+ smp_wmb();
+
/* Now it is going to be associated with an sk_buff, set refcount to 1. */
refcount_set(&ct->ct_general.use, 1);

--
2.35.1

2022-07-06 12:34:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Wed, Jul 06, 2022 at 02:02:01PM +0200, Florian Westphal wrote:

> Subject: [nf] netfilter: conntrack: fix crash due to confirmed bit load reordering
>
> Kajetan Puchalski reports crash on ARM, with backtrace of:
>
> __nf_ct_delete_from_lists
> nf_ct_delete
> early_drop
> __nf_conntrack_alloc
>
> Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier.
> conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly'
> allocated object is still in use on another CPU:
>
> CPU1 CPU2
> enounters 'ct' during hlist walk
> delete_from_lists
> refcount drops to 0
> kmem_cache_free(ct);
> __nf_conntrack_alloc() // returns same object
> refcount_inc_not_zero(ct); /* might fail */
>
> /* If set, ct is public/in the hash table */
> test_bit(IPS_CONFIRMED_BIT, &ct->status);
>
> In case CPU1 already set refcount back to 1, refcount_inc_not_zero()
> will succeed.
>
> The expected possibilities for a CPU that obtained the object 'ct'
> (but no reference so far) are:
>
> 1. refcount_inc_not_zero() fails. CPU2 ignores the object and moves to
> the next entry in the list. This happens for objects that are about
> to be free'd, that have been free'd, or that have been reallocated
> by __nf_conntrack_alloc(), but where the refcount has not been
> increased back to 1 yet.
>
> 2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit
> in ct->status. If set, the object is public/in the table.
>
> If not, the object must be skipped; CPU2 calls nf_ct_put() to
> un-do the refcount increment and moves to the next object.
>
> Parallel deletion from the hlists is prevented by a
> 'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one
> cpu will do the unlink, the other one will only drop its reference count.
>
> Because refcount_inc_not_zero is not a full barrier, CPU2 may try to
> delete an object that is not on any list:
>
> 1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
> 2. CONFIRMED test also successful (load was reordered or zeroing
> of ct->status not yet visible)
> 3. delete_from_lists unlinks entry not on the hlist, because
> IPS_DYING_BIT is 0 (already cleared).
>
> 2) is already wrong: CPU2 will handle a partially initited object
> that is supposed to be private to CPU1.
>
> This change adds smp_rmb() whenever refcount_inc_not_zero() was successful.
>
> It also inserts a smp_wmb() before the refcount is set to 1 during
> allocation.
>
> Because other CPU might still 'see' the object, refcount_set(1)
> "resurrects" the object, so we need to make sure that other CPUs will
> also observe the right contents. In particular, the CONFIRMED bit test
> must only pass once the object is fully initialised and either in the
> hash or about to be inserted (with locks held to delay possible unlink from
> early_drop or gc worker).
>
> I did not change flow_offload_alloc(), as far as I can see it should call
> refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to
> the skb so its refcount should be >= 1 in all cases.
>
> Reported-by: Kajetan Puchalski <[email protected]>
> Diagnosed-by: Will Deacon <[email protected]>
> Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
> Signed-off-by: Florian Westphal <[email protected]>
> ---
> include/net/netfilter/nf_conntrack.h | 3 +++
> net/netfilter/nf_conntrack_core.c | 19 +++++++++++++++++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
> index a32be8aa7ed2..3dc3646ffba2 100644
> --- a/include/net/netfilter/nf_conntrack.h
> +++ b/include/net/netfilter/nf_conntrack.h
> @@ -300,6 +300,9 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct)
> /* use after obtaining a reference count */
> static inline bool nf_ct_should_gc(const struct nf_conn *ct)
> {
> + /* ->status and ->timeout loads must happen after refcount increase */
> + smp_rmb();
> +
> return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) &&
> !nf_ct_is_dying(ct);
> }
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 082a2fd8d85b..072cabf1b296 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -795,6 +795,9 @@ __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone,
> */
> ct = nf_ct_tuplehash_to_ctrack(h);
> if (likely(refcount_inc_not_zero(&ct->ct_general.use))) {
> + /* re-check key after refcount */
> + smp_rmb();
> +
> if (likely(nf_ct_key_equal(h, tuple, zone, net)))
> goto found;
>
> @@ -1393,7 +1396,11 @@ static unsigned int early_drop_list(struct net *net,
> * We steal the timer reference. If that fails timer has
> * already fired or someone else deleted it. Just drop ref
> * and move to next entry.
> + *
> + * smp_rmb to ensure ->ct_net and ->status are loaded after
> + * refcount increase.
> */
> + smp_rmb();
> if (net_eq(nf_ct_net(tmp), net) &&
> nf_ct_is_confirmed(tmp) &&
> nf_ct_delete(tmp, 0, 0))
> @@ -1536,6 +1543,8 @@ static void gc_worker(struct work_struct *work)
> if (!refcount_inc_not_zero(&tmp->ct_general.use))
> continue;
>
> + /* load ct->status after refcount */
> + smp_rmb();
> if (gc_worker_skip_ct(tmp)) {
> nf_ct_put(tmp);
> continue;
> @@ -1775,6 +1784,16 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
> if (!exp)
> __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
>
> + /* Other CPU might have obtained a pointer to this object before it was
> + * released. Because refcount is 0, refcount_inc_not_zero() will fail.
> + *
> + * After refcount_set(1) it will succeed; ensure that zeroing of
> + * ct->status and the correct ct->net pointer are visible; else other
> + * core might observe CONFIRMED bit which means the entry is valid and
> + * in the hash table, but its not (anymore).
> + */
> + smp_wmb();
> +
> /* Now it is going to be associated with an sk_buff, set refcount to 1. */
> refcount_set(&ct->ct_general.use, 1);

FWIW, the old code, that used atomic_inc() instead of refcount_set()
would have had the exact sample problem. There was no implied order vs
the earlier stores.

2022-07-06 12:41:06

by Will Deacon

[permalink] [raw]
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Wed, Jul 06, 2022 at 02:02:01PM +0200, Florian Westphal wrote:
> Kajetan Puchalski <[email protected]> wrote:
> > On Tue, Jul 05, 2022 at 12:24:49PM +0100, Will Deacon wrote:
> > > > > Sorry, but I have absolutely no context here. We have a handy document
> > > > > describing the differences between atomic_t and refcount_t:
> > > > >
> > > > > Documentation/core-api/refcount-vs-atomic.rst
> > > > >
> > > > > What else do you need to know?
> > > >
> > > > Hmm, and I see a tonne of *_inc_not_zero() conversions in 719774377622
> > > > ("netfilter: conntrack: convert to refcount_t api") which mean that you
> > > > no longer have ordering to subsequent reads in the absence of an address
> > > > dependency.
> > >
> > > I think the patch above needs auditing with the relaxed behaviour in mind,
> > > but for the specific crash reported here possibly something like the diff
> > > below?
> > >
> > > Will
> > >
> > > --->8
> > >
> > > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > > index 082a2fd8d85b..5ad9fcc84269 100644
> > > --- a/net/netfilter/nf_conntrack_core.c
> > > +++ b/net/netfilter/nf_conntrack_core.c
> > > @@ -1394,6 +1394,7 @@ static unsigned int early_drop_list(struct net *net,
> > > * already fired or someone else deleted it. Just drop ref
> > > * and move to next entry.
> > > */
> > > + smp_rmb(); /* XXX: Why? */
> > > if (net_eq(nf_ct_net(tmp), net) &&
> > > nf_ct_is_confirmed(tmp) &&
> > > nf_ct_delete(tmp, 0, 0))
> > >
> >
> > Just to follow up, I think you're right, the patch in question should be
> > audited further for other missing memory barrier issues.
> > While this one smp_rmb() helps a lot, ie lets the test run for at least
> > an hour or two, an overnight 6 hour test still resulted in the same
> > crash somewhere along the way so it looks like it's not the only one
> > that's needed.
>
> Yes, I don't think that refcount_inc_not_zero is useable as-is for conntrack.
> Here is a patch, I hope this will get things back to a working order without
> a revert to atomic_t api.
>
> Subject: [nf] netfilter: conntrack: fix crash due to confirmed bit load reordering
>
> Kajetan Puchalski reports crash on ARM, with backtrace of:
>
> __nf_ct_delete_from_lists
> nf_ct_delete
> early_drop
> __nf_conntrack_alloc
>
> Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier.
> conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly'
> allocated object is still in use on another CPU:
>
> CPU1 CPU2
> enounters 'ct' during hlist walk
> delete_from_lists
> refcount drops to 0
> kmem_cache_free(ct);
> __nf_conntrack_alloc() // returns same object
> refcount_inc_not_zero(ct); /* might fail */
>
> /* If set, ct is public/in the hash table */
> test_bit(IPS_CONFIRMED_BIT, &ct->status);
>
> In case CPU1 already set refcount back to 1, refcount_inc_not_zero()
> will succeed.
>
> The expected possibilities for a CPU that obtained the object 'ct'
> (but no reference so far) are:
>
> 1. refcount_inc_not_zero() fails. CPU2 ignores the object and moves to
> the next entry in the list. This happens for objects that are about
> to be free'd, that have been free'd, or that have been reallocated
> by __nf_conntrack_alloc(), but where the refcount has not been
> increased back to 1 yet.
>
> 2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit
> in ct->status. If set, the object is public/in the table.
>
> If not, the object must be skipped; CPU2 calls nf_ct_put() to
> un-do the refcount increment and moves to the next object.
>
> Parallel deletion from the hlists is prevented by a
> 'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one
> cpu will do the unlink, the other one will only drop its reference count.
>
> Because refcount_inc_not_zero is not a full barrier, CPU2 may try to
> delete an object that is not on any list:
>
> 1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
> 2. CONFIRMED test also successful (load was reordered or zeroing
> of ct->status not yet visible)
> 3. delete_from_lists unlinks entry not on the hlist, because
> IPS_DYING_BIT is 0 (already cleared).
>
> 2) is already wrong: CPU2 will handle a partially initited object
> that is supposed to be private to CPU1.
>
> This change adds smp_rmb() whenever refcount_inc_not_zero() was successful.
>
> It also inserts a smp_wmb() before the refcount is set to 1 during
> allocation.
>
> Because other CPU might still 'see' the object, refcount_set(1)
> "resurrects" the object, so we need to make sure that other CPUs will
> also observe the right contents. In particular, the CONFIRMED bit test
> must only pass once the object is fully initialised and either in the
> hash or about to be inserted (with locks held to delay possible unlink from
> early_drop or gc worker).
>
> I did not change flow_offload_alloc(), as far as I can see it should call
> refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to
> the skb so its refcount should be >= 1 in all cases.
>
> Reported-by: Kajetan Puchalski <[email protected]>
> Diagnosed-by: Will Deacon <[email protected]>
> Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
> Signed-off-by: Florian Westphal <[email protected]>
> ---
> include/net/netfilter/nf_conntrack.h | 3 +++
> net/netfilter/nf_conntrack_core.c | 19 +++++++++++++++++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
> index a32be8aa7ed2..3dc3646ffba2 100644
> --- a/include/net/netfilter/nf_conntrack.h
> +++ b/include/net/netfilter/nf_conntrack.h
> @@ -300,6 +300,9 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct)
> /* use after obtaining a reference count */
> static inline bool nf_ct_should_gc(const struct nf_conn *ct)
> {
> + /* ->status and ->timeout loads must happen after refcount increase */
> + smp_rmb();

Sorry I didn't suggest this earlier, but if all of these smp_rmb()s are
for upgrading the ordering from refcount_inc_not_zero() then you should
use smp_acquire__after_ctrl_dep() instead. It's the same under the hood,
but it illustrates what's going on a bit better.

> @@ -1775,6 +1784,16 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
> if (!exp)
> __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
>
> + /* Other CPU might have obtained a pointer to this object before it was
> + * released. Because refcount is 0, refcount_inc_not_zero() will fail.
> + *
> + * After refcount_set(1) it will succeed; ensure that zeroing of
> + * ct->status and the correct ct->net pointer are visible; else other
> + * core might observe CONFIRMED bit which means the entry is valid and
> + * in the hash table, but its not (anymore).
> + */
> + smp_wmb();
> +
> /* Now it is going to be associated with an sk_buff, set refcount to 1. */
> refcount_set(&ct->ct_general.use, 1);

Ideally that refcount_set() would be a release, but this is definitely
(ab)using refcount_t in way that isn't anticipated by the API! It looks
like a similar pattern exists in net/core/sock.c as well, so I wonder if
it's worth extending the API.

Peter, what do you think?

Will

2022-07-06 13:30:37

by Florian Westphal

[permalink] [raw]
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

Will Deacon <[email protected]> wrote:
> On Wed, Jul 06, 2022 at 02:02:01PM +0200, Florian Westphal wrote:
> > + /* ->status and ->timeout loads must happen after refcount increase */
> > + smp_rmb();
>
> Sorry I didn't suggest this earlier, but if all of these smp_rmb()s are
> for upgrading the ordering from refcount_inc_not_zero() then you should
> use smp_acquire__after_ctrl_dep() instead. It's the same under the hood,
> but it illustrates what's going on a bit better.

Ok, I can replace it and send a v2, no problem.

2022-07-06 14:04:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra

On Wed, Jul 06, 2022 at 01:22:50PM +0100, Will Deacon wrote:

> > @@ -300,6 +300,9 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct)
> > /* use after obtaining a reference count */
> > static inline bool nf_ct_should_gc(const struct nf_conn *ct)
> > {
> > + /* ->status and ->timeout loads must happen after refcount increase */
> > + smp_rmb();
>
> Sorry I didn't suggest this earlier, but if all of these smp_rmb()s are
> for upgrading the ordering from refcount_inc_not_zero() then you should
> use smp_acquire__after_ctrl_dep() instead. It's the same under the hood,
> but it illustrates what's going on a bit better.

But in that case if had better also be near an actual condition,
otherwise things become too murky for words :/

That is, why is this sprinkled all over instead of right after
an successfull refcount_inc_not_zero() ?

Code like:

if (!refcount_inc_not_zero())
return;

smp_acquire__after_ctrl_dep();

is fairly self-evident, whereas encountering an
smp_acquire__after_ctrl_dep() in a different function, completely
unrelated to any condition is quite crazy.

> > @@ -1775,6 +1784,16 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
> > if (!exp)
> > __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
> >
> > + /* Other CPU might have obtained a pointer to this object before it was
> > + * released. Because refcount is 0, refcount_inc_not_zero() will fail.
> > + *
> > + * After refcount_set(1) it will succeed; ensure that zeroing of
> > + * ct->status and the correct ct->net pointer are visible; else other
> > + * core might observe CONFIRMED bit which means the entry is valid and
> > + * in the hash table, but its not (anymore).
> > + */
> > + smp_wmb();
> > +
> > /* Now it is going to be associated with an sk_buff, set refcount to 1. */
> > refcount_set(&ct->ct_general.use, 1);
>
> Ideally that refcount_set() would be a release, but this is definitely
> (ab)using refcount_t in way that isn't anticipated by the API! It looks
> like a similar pattern exists in net/core/sock.c as well, so I wonder if
> it's worth extending the API.
>
> Peter, what do you think?

Bah; you have reminded me that I have a fairly sizable amount of
refcount patches from when Linus complained about it last that don't
seem to have gone anywhere :/

Anyway, I suppose we could do a refcount_set_release(), but it had
better get a fairly big comment on how you're on your own if you use it.

2022-07-06 15:19:32

by Florian Westphal

[permalink] [raw]
Subject: [PATCH nf v3] netfilter: conntrack: fix crash due to confirmed bit load reordering

Kajetan Puchalski reports crash on ARM, with backtrace of:

__nf_ct_delete_from_lists
nf_ct_delete
early_drop
__nf_conntrack_alloc

Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier.
conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly'
allocated object is still in use on another CPU:

CPU1 CPU2
encounter 'ct' during hlist walk
delete_from_lists
refcount drops to 0
kmem_cache_free(ct);
__nf_conntrack_alloc() // returns same object
refcount_inc_not_zero(ct); /* might fail */

/* If set, ct is public/in the hash table */
test_bit(IPS_CONFIRMED_BIT, &ct->status);

In case CPU1 already set refcount back to 1, refcount_inc_not_zero()
will succeed.

The expected possibilities for a CPU that obtained the object 'ct'
(but no reference so far) are:

1. refcount_inc_not_zero() fails. CPU2 ignores the object and moves to
the next entry in the list. This happens for objects that are about
to be free'd, that have been free'd, or that have been reallocated
by __nf_conntrack_alloc(), but where the refcount has not been
increased back to 1 yet.

2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit
in ct->status. If set, the object is public/in the table.

If not, the object must be skipped; CPU2 calls nf_ct_put() to
un-do the refcount increment and moves to the next object.

Parallel deletion from the hlists is prevented by a
'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one
cpu will do the unlink, the other one will only drop its reference count.

Because refcount_inc_not_zero is not a full barrier, CPU2 may try to
delete an object that is not on any list:

1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
2. CONFIRMED test also successful (load was reordered or zeroing
of ct->status not yet visible)
3. delete_from_lists unlinks entry not on the hlist, because
IPS_DYING_BIT is 0 (already cleared).

2) is already wrong: CPU2 will handle a partially initited object
that is supposed to be private to CPU1.

Add needed barriers when refcount_inc_not_zero() is successful.

It also inserts a smp_wmb() before the refcount is set to 1 during
allocation.

Because other CPU might still 'see' the object, refcount_set(1)
"resurrects" the object, so we need to make sure that other CPUs will
also observe the right contents. In particular, the CONFIRMED bit test
must only pass once the object is fully initialised and either in the
hash or about to be inserted (with locks held to delay possible unlink from
early_drop or gc worker).

I did not change flow_offload_alloc(), as far as I can see it should call
refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to
the skb so its refcount should be >= 1 in all cases.

v2: prefer smp_acquire__after_ctrl_dep to smp_rmb (Will Deacon).
v3: keep smp_acquire__after_ctrl_dep close to refcount_inc_not_zero call
add comment in nf_conntrack_netlink, no control dependency there
due to locks.

Cc: Peter Zijlstra <[email protected]>
Reported-by: Kajetan Puchalski <[email protected]>
Diagnosed-by: Will Deacon <[email protected]>
Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
Signed-off-by: Florian Westphal <[email protected]>
---
net/netfilter/nf_conntrack_core.c | 22 ++++++++++++++++++++++
net/netfilter/nf_conntrack_netlink.c | 1 +
net/netfilter/nf_conntrack_standalone.c | 3 +++
3 files changed, 26 insertions(+)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 082a2fd8d85b..369aeabb94fe 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -729,6 +729,9 @@ static void nf_ct_gc_expired(struct nf_conn *ct)
if (!refcount_inc_not_zero(&ct->ct_general.use))
return;

+ /* load ->status after refcount increase */
+ smp_acquire__after_ctrl_dep();
+
if (nf_ct_should_gc(ct))
nf_ct_kill(ct);

@@ -795,6 +798,9 @@ __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone,
*/
ct = nf_ct_tuplehash_to_ctrack(h);
if (likely(refcount_inc_not_zero(&ct->ct_general.use))) {
+ /* re-check key after refcount */
+ smp_acquire__after_ctrl_dep();
+
if (likely(nf_ct_key_equal(h, tuple, zone, net)))
goto found;

@@ -1387,6 +1393,9 @@ static unsigned int early_drop_list(struct net *net,
if (!refcount_inc_not_zero(&tmp->ct_general.use))
continue;

+ /* load ->ct_net and ->status after refcount increase */
+ smp_acquire__after_ctrl_dep();
+
/* kill only if still in same netns -- might have moved due to
* SLAB_TYPESAFE_BY_RCU rules.
*
@@ -1536,6 +1545,9 @@ static void gc_worker(struct work_struct *work)
if (!refcount_inc_not_zero(&tmp->ct_general.use))
continue;

+ /* load ->status after refcount increase */
+ smp_acquire__after_ctrl_dep();
+
if (gc_worker_skip_ct(tmp)) {
nf_ct_put(tmp);
continue;
@@ -1775,6 +1787,16 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
if (!exp)
__nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);

+ /* Other CPU might have obtained a pointer to this object before it was
+ * released. Because refcount is 0, refcount_inc_not_zero() will fail.
+ *
+ * After refcount_set(1) it will succeed; ensure that zeroing of
+ * ct->status and the correct ct->net pointer are visible; else other
+ * core might observe CONFIRMED bit which means the entry is valid and
+ * in the hash table, but its not (anymore).
+ */
+ smp_wmb();
+
/* Now it is going to be associated with an sk_buff, set refcount to 1. */
refcount_set(&ct->ct_general.use, 1);

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 722af5e309ba..f5905b5201a7 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -1203,6 +1203,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
hnnode) {
ct = nf_ct_tuplehash_to_ctrack(h);
if (nf_ct_is_expired(ct)) {
+ /* need to defer nf_ct_kill() until lock is released */
if (i < ARRAY_SIZE(nf_ct_evict) &&
refcount_inc_not_zero(&ct->ct_general.use))
nf_ct_evict[i++] = ct;
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 6ad7bbc90d38..05895878610c 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -306,6 +306,9 @@ static int ct_seq_show(struct seq_file *s, void *v)
if (unlikely(!refcount_inc_not_zero(&ct->ct_general.use)))
return 0;

+ /* load ->status after refcount increase */
+ smp_acquire__after_ctrl_dep();
+
if (nf_ct_should_gc(ct)) {
nf_ct_kill(ct);
goto release;
--
2.35.1

2022-07-07 08:47:02

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH nf v3] netfilter: conntrack: fix crash due to confirmed bit load reordering

On Wed, Jul 06, 2022 at 04:50:04PM +0200, Florian Westphal wrote:
> Kajetan Puchalski reports crash on ARM, with backtrace of:
>
> __nf_ct_delete_from_lists
> nf_ct_delete
> early_drop
> __nf_conntrack_alloc
>
> Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier.
> conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly'
> allocated object is still in use on another CPU:
>
> CPU1 CPU2
> encounter 'ct' during hlist walk
> delete_from_lists
> refcount drops to 0
> kmem_cache_free(ct);
> __nf_conntrack_alloc() // returns same object
> refcount_inc_not_zero(ct); /* might fail */
>
> /* If set, ct is public/in the hash table */
> test_bit(IPS_CONFIRMED_BIT, &ct->status);
>
> In case CPU1 already set refcount back to 1, refcount_inc_not_zero()
> will succeed.
>
> The expected possibilities for a CPU that obtained the object 'ct'
> (but no reference so far) are:
>
> 1. refcount_inc_not_zero() fails. CPU2 ignores the object and moves to
> the next entry in the list. This happens for objects that are about
> to be free'd, that have been free'd, or that have been reallocated
> by __nf_conntrack_alloc(), but where the refcount has not been
> increased back to 1 yet.
>
> 2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit
> in ct->status. If set, the object is public/in the table.
>
> If not, the object must be skipped; CPU2 calls nf_ct_put() to
> un-do the refcount increment and moves to the next object.
>
> Parallel deletion from the hlists is prevented by a
> 'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one
> cpu will do the unlink, the other one will only drop its reference count.
>
> Because refcount_inc_not_zero is not a full barrier, CPU2 may try to
> delete an object that is not on any list:
>
> 1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
> 2. CONFIRMED test also successful (load was reordered or zeroing
> of ct->status not yet visible)
> 3. delete_from_lists unlinks entry not on the hlist, because
> IPS_DYING_BIT is 0 (already cleared).
>
> 2) is already wrong: CPU2 will handle a partially initited object
> that is supposed to be private to CPU1.
>
> Add needed barriers when refcount_inc_not_zero() is successful.
>
> It also inserts a smp_wmb() before the refcount is set to 1 during
> allocation.
>
> Because other CPU might still 'see' the object, refcount_set(1)
> "resurrects" the object, so we need to make sure that other CPUs will
> also observe the right contents. In particular, the CONFIRMED bit test
> must only pass once the object is fully initialised and either in the
> hash or about to be inserted (with locks held to delay possible unlink from
> early_drop or gc worker).
>
> I did not change flow_offload_alloc(), as far as I can see it should call
> refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to
> the skb so its refcount should be >= 1 in all cases.
>
> v2: prefer smp_acquire__after_ctrl_dep to smp_rmb (Will Deacon).
> v3: keep smp_acquire__after_ctrl_dep close to refcount_inc_not_zero call
> add comment in nf_conntrack_netlink, no control dependency there
> due to locks.
>
> Cc: Peter Zijlstra <[email protected]>
> Reported-by: Kajetan Puchalski <[email protected]>
> Diagnosed-by: Will Deacon <[email protected]>
> Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
> Signed-off-by: Florian Westphal <[email protected]>
> ---
> net/netfilter/nf_conntrack_core.c | 22 ++++++++++++++++++++++
> net/netfilter/nf_conntrack_netlink.c | 1 +
> net/netfilter/nf_conntrack_standalone.c | 3 +++
> 3 files changed, 26 insertions(+)

Acked-by: Will Deacon <[email protected]>

Will

2022-07-07 10:29:38

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [PATCH nf v3] netfilter: conntrack: fix crash due to confirmed bit load reordering

On 06.07.22 16:50, Florian Westphal wrote:
> Kajetan Puchalski reports crash on ARM, with backtrace of:
>
> [...]
> Cc: Peter Zijlstra <[email protected]>
> Reported-by: Kajetan Puchalski <[email protected]>
> Diagnosed-by: Will Deacon <[email protected]>
> Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
> Signed-off-by: Florian Westphal <[email protected]>

If you need to respin this patch for one reason or another, could you do
me a favor and add proper 'Link:' tags pointing to all reports about
this issue? e.g. like this:

Link:
https://lore.kernel.org/all/[email protected]/

These tags are considered important by Linus[1] and others, as they
allow anyone to look into the backstory weeks or years from now. That is
why they should be placed in cases like this, as
Documentation/process/submitting-patches.rst and
Documentation/process/5.Posting.rst explain in more detail. I care
personally, because these tags make my regression tracking efforts a
whole lot easier, as they allow my tracking bot 'regzbot' to
automatically connect reports with patches posted or committed to fix
tracked regressions.

[1] see for example:
https://lore.kernel.org/all/CAHk-=wjMmSZzMJ3Xnskdg4+GGz=5p5p+GSYyFBTh0f-DgvdBWg@mail.gmail.com/
https://lore.kernel.org/all/CAHk-=wgs38ZrfPvy=nOwVkVzjpM3VFU1zobP37Fwd_h9iAD5JQ@mail.gmail.com/
https://lore.kernel.org/all/CAHk-=wjxzafG-=J8oT30s7upn4RhBs6TX-uVFZ5rME+L5_DoJA@mail.gmail.com/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

2022-07-07 19:38:29

by Florian Westphal

[permalink] [raw]
Subject: Re: [PATCH nf v3] netfilter: conntrack: fix crash due to confirmed bit load reordering

Will Deacon <[email protected]> wrote:
> On Wed, Jul 06, 2022 at 04:50:04PM +0200, Florian Westphal wrote:
> > net/netfilter/nf_conntrack_core.c | 22 ++++++++++++++++++++++
> > net/netfilter/nf_conntrack_netlink.c | 1 +
> > net/netfilter/nf_conntrack_standalone.c | 3 +++
> > 3 files changed, 26 insertions(+)
>
> Acked-by: Will Deacon <[email protected]>

Thanks, I pushed this patch to nf.git.

2022-07-11 17:48:14

by Kajetan Puchalski

[permalink] [raw]
Subject: Re: [PATCH nf v3] netfilter: conntrack: fix crash due to confirmed bit load reordering

> v3: keep smp_acquire__after_ctrl_dep close to refcount_inc_not_zero call
> add comment in nf_conntrack_netlink, no control dependency there
> due to locks.

Just to follow up on that, I tested v3 for 24 hours with the workload in
question and found no issues so looks like the fix is stable.

In case someone is interested in performance differences, seeing as I
was running benchmarks regardless I thought I might share the numbers on
how using refcount vs atomic here seems to affect networking workloads.

The results were collected using mmtests, the means containing asterisks
are the results that the framework considered statistically significant.

netperf-udp
atomic v3
Hmean send-64 189.36 ( 0.00%) 227.14 * 19.95%*
Hmean send-128 378.77 ( 0.00%) 387.94 ( 2.42%)
Hmean send-256 925.96 ( 0.00%) 922.77 ( -0.34%)
Hmean send-1024 3550.03 ( 0.00%) 3528.63 ( -0.60%)
Hmean send-2048 6545.45 ( 0.00%) 6655.64 * 1.68%*
Hmean send-3312 10282.12 ( 0.00%) 10388.78 * 1.04%*
Hmean send-4096 11902.15 ( 0.00%) 12052.30 * 1.26%*
Hmean send-8192 19369.15 ( 0.00%) 20363.82 * 5.14%*
Hmean send-16384 32610.44 ( 0.00%) 33080.30 ( 1.44%)
Hmean recv-64 189.36 ( 0.00%) 226.34 * 19.53%*
Hmean recv-128 378.77 ( 0.00%) 386.81 ( 2.12%)
Hmean recv-256 925.95 ( 0.00%) 922.77 ( -0.34%)
Hmean recv-1024 3549.90 ( 0.00%) 3528.51 ( -0.60%)
Hmean recv-2048 6542.82 ( 0.00%) 6653.05 * 1.68%*
Hmean recv-3312 10278.46 ( 0.00%) 10385.45 * 1.04%*
Hmean recv-4096 11892.86 ( 0.00%) 12041.68 * 1.25%*
Hmean recv-8192 19345.14 ( 0.00%) 20343.76 * 5.16%*
Hmean recv-16384 32574.38 ( 0.00%) 33030.53 ( 1.40%)

netperf-tcp
atomic v3
Hmean 64 1324.25 ( 0.00%) 1328.90 * 0.35%*
Hmean 128 2576.89 ( 0.00%) 2579.71 ( 0.11%)
Hmean 256 4882.34 ( 0.00%) 4889.49 ( 0.15%)
Hmean 1024 14560.89 ( 0.00%) 14423.39 * -0.94%*
Hmean 2048 20995.91 ( 0.00%) 20818.49 * -0.85%*
Hmean 3312 25440.20 ( 0.00%) 25318.16 * -0.48%*
Hmean 4096 27309.32 ( 0.00%) 27282.26 ( -0.10%)
Hmean 8192 31204.34 ( 0.00%) 31326.23 * 0.39%*
Hmean 16384 34370.49 ( 0.00%) 34298.25 ( -0.21%)

Additionally, the reason I bumped into this issue in the first place was
running benchmarks on different CPUIdle governors so below are the
results for what happens if in additon to changing from atomic to
v3 refcount I also switch the idle governor from menu to TEO.

netperf-udp
atomic v3
menu teo
Hmean send-64 189.36 ( 0.00%) 248.79 * 31.38%*
Hmean send-128 378.77 ( 0.00%) 439.06 ( 15.92%)
Hmean send-256 925.96 ( 0.00%) 1101.20 * 18.93%*
Hmean send-1024 3550.03 ( 0.00%) 3298.19 ( -7.09%)
Hmean send-2048 6545.45 ( 0.00%) 7714.21 * 17.86%*
Hmean send-3312 10282.12 ( 0.00%) 12090.56 * 17.59%*
Hmean send-4096 11902.15 ( 0.00%) 13766.56 * 15.66%*
Hmean send-8192 19369.15 ( 0.00%) 22943.77 * 18.46%*
Hmean send-16384 32610.44 ( 0.00%) 37370.44 * 14.60%*
Hmean recv-64 189.36 ( 0.00%) 248.79 * 31.38%*
Hmean recv-128 378.77 ( 0.00%) 439.06 ( 15.92%)
Hmean recv-256 925.95 ( 0.00%) 1101.19 * 18.92%*
Hmean recv-1024 3549.90 ( 0.00%) 3298.16 ( -7.09%)
Hmean recv-2048 6542.82 ( 0.00%) 7711.59 * 17.86%*
Hmean recv-3312 10278.46 ( 0.00%) 12087.81 * 17.60%*
Hmean recv-4096 11892.86 ( 0.00%) 13755.48 * 15.66%*
Hmean recv-8192 19345.14 ( 0.00%) 22933.98 * 18.55%*
Hmean recv-16384 32574.38 ( 0.00%) 37332.10 * 14.61%*

netperf-tcp
atomic v3
menu teo
Hmean 64 1324.25 ( 0.00%) 1351.86 * 2.08%*
Hmean 128 2576.89 ( 0.00%) 2629.08 * 2.03%*
Hmean 256 4882.34 ( 0.00%) 5003.19 * 2.48%*
Hmean 1024 14560.89 ( 0.00%) 15237.15 * 4.64%*
Hmean 2048 20995.91 ( 0.00%) 22804.40 * 8.61%*
Hmean 3312 25440.20 ( 0.00%) 27815.23 * 9.34%*
Hmean 4096 27309.32 ( 0.00%) 30171.81 * 10.48%*
Hmean 8192 31204.34 ( 0.00%) 37112.55 * 18.93%*
Hmean 16384 34370.49 ( 0.00%) 42952.01 * 24.97%*

The absolute values might be skewed by the characteristics of the
machine in question but I thought the comparative differences between
different patches were interesting enough to share.

> Cc: Peter Zijlstra <[email protected]>
> Reported-by: Kajetan Puchalski <[email protected]>
> Diagnosed-by: Will Deacon <[email protected]>
> Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api")
> Signed-off-by: Florian Westphal <[email protected]>
> ---
> net/netfilter/nf_conntrack_core.c | 22 ++++++++++++++++++++++
> net/netfilter/nf_conntrack_netlink.c | 1 +
> net/netfilter/nf_conntrack_standalone.c | 3 +++
> 3 files changed, 26 insertions(+)
>
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 082a2fd8d85b..369aeabb94fe 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -729,6 +729,9 @@ static void nf_ct_gc_expired(struct nf_conn *ct)
> if (!refcount_inc_not_zero(&ct->ct_general.use))
> return;
>
> + /* load ->status after refcount increase */
> + smp_acquire__after_ctrl_dep();
> +
> if (nf_ct_should_gc(ct))
> nf_ct_kill(ct);
>
> @@ -795,6 +798,9 @@ __nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone,
> */
> ct = nf_ct_tuplehash_to_ctrack(h);
> if (likely(refcount_inc_not_zero(&ct->ct_general.use))) {
> + /* re-check key after refcount */
> + smp_acquire__after_ctrl_dep();
> +
> if (likely(nf_ct_key_equal(h, tuple, zone, net)))
> goto found;
>
> @@ -1387,6 +1393,9 @@ static unsigned int early_drop_list(struct net *net,
> if (!refcount_inc_not_zero(&tmp->ct_general.use))
> continue;
>
> + /* load ->ct_net and ->status after refcount increase */
> + smp_acquire__after_ctrl_dep();
> +
> /* kill only if still in same netns -- might have moved due to
> * SLAB_TYPESAFE_BY_RCU rules.
> *
> @@ -1536,6 +1545,9 @@ static void gc_worker(struct work_struct *work)
> if (!refcount_inc_not_zero(&tmp->ct_general.use))
> continue;
>
> + /* load ->status after refcount increase */
> + smp_acquire__after_ctrl_dep();
> +
> if (gc_worker_skip_ct(tmp)) {
> nf_ct_put(tmp);
> continue;
> @@ -1775,6 +1787,16 @@ init_conntrack(struct net *net, struct nf_conn *tmpl,
> if (!exp)
> __nf_ct_try_assign_helper(ct, tmpl, GFP_ATOMIC);
>
> + /* Other CPU might have obtained a pointer to this object before it was
> + * released. Because refcount is 0, refcount_inc_not_zero() will fail.
> + *
> + * After refcount_set(1) it will succeed; ensure that zeroing of
> + * ct->status and the correct ct->net pointer are visible; else other
> + * core might observe CONFIRMED bit which means the entry is valid and
> + * in the hash table, but its not (anymore).
> + */
> + smp_wmb();
> +
> /* Now it is going to be associated with an sk_buff, set refcount to 1. */
> refcount_set(&ct->ct_general.use, 1);
>
> diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
> index 722af5e309ba..f5905b5201a7 100644
> --- a/net/netfilter/nf_conntrack_netlink.c
> +++ b/net/netfilter/nf_conntrack_netlink.c
> @@ -1203,6 +1203,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
> hnnode) {
> ct = nf_ct_tuplehash_to_ctrack(h);
> if (nf_ct_is_expired(ct)) {
> + /* need to defer nf_ct_kill() until lock is released */
> if (i < ARRAY_SIZE(nf_ct_evict) &&
> refcount_inc_not_zero(&ct->ct_general.use))
> nf_ct_evict[i++] = ct;
> diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
> index 6ad7bbc90d38..05895878610c 100644
> --- a/net/netfilter/nf_conntrack_standalone.c
> +++ b/net/netfilter/nf_conntrack_standalone.c
> @@ -306,6 +306,9 @@ static int ct_seq_show(struct seq_file *s, void *v)
> if (unlikely(!refcount_inc_not_zero(&ct->ct_general.use)))
> return 0;
>
> + /* load ->status after refcount increase */
> + smp_acquire__after_ctrl_dep();
> +
> if (nf_ct_should_gc(ct)) {
> nf_ct_kill(ct);
> goto release;
> --
> 2.35.1
>