2020-03-21 02:52:39

by Jason A. Donenfeld

[permalink] [raw]
Subject: refcount underflow in nfsd41_destroy_cb

Hello,

A user erroneously sent me a refcount UaF in NFS with 5.6-rc6. I
thought I should forward it onward here in case this is not already
known. The original reporter is CC'd in case you have any questions.

Regards,
Jason

-------8<------------------------

Mar 20 21:43:34 agro kernel: ------------[ cut here ]------------
Mar 20 21:43:34 agro kernel: refcount_t: underflow; use-after-free.
Mar 20 21:43:34 agro kernel: WARNING: CPU: 1 PID: 9334 at
lib/refcount.c:28 refcount_warn_saturate+0xd9/0xe0
Mar 20 21:43:34 agro kernel: Modules linked in: md5 wireguard
curve25519_x86_64 libcurve25519_generic libchacha20poly1305
chacha_x86_64 libchacha poly1305_x86_64 ip6_udp_tunnel udp_tunnel
libblake2s blake2s_x86_64 libblake2s_generic xt_mac xt_nat
xt_MASQUERADE xt_REDIRECT xt_owner xt_comment ipt_REJECT
nf_reject_ipv4 xt_mark xt_hashlimit xt_multiport nfnetlink_log
xt_NFLOG nf_log_ipv4 nf_log_common xt_LOG nf_nat_tftp nf_nat_sip
nf_nat_h323 nf_nat_ftp nf_conntrack_tftp nf_conntrack_sip
nf_conntrack_netlink nfnetlink nf_conntrack_h323 nf_conntrack_ftp nfsd
auth_rpcgss nfs_acl lockd grace sunrpc cpufreq_ondemand msr bridge stp
llc xt_tcpudp xt_conntrack iptable_mangle iptable_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_raw iptable_filter
ip_tables x_tables btrfs blake2b_generic xor raid6_pq libcrc32c
zstd_compress zstd_decompress zlib_deflate zlib_inflate uas
usb_storage snd_hda_codec_realtek snd_hda_codec_generic matroxfb_base
matroxfb_g450 matroxfb_Ti3026 snd_hda_int
el matroxfb_accel
Mar 20 21:43:34 agro kernel: snd_intel_dspcfg cfbfillrect cfbimgblt
atlantic snd_hda_codec r8169 cfbcopyarea e1000e snd_hda_core snd_hwdep
matroxfb_DAC1064 g450_pll matroxfb_misc ptp realtek fb snd_pcm libphy
snd_timer mxm_wmi xhci_pci pcspkr i2c_piix4 input_leds pps_core fbdev
k10temp ohci_pci snd xhci_hcd ohci_hcd ehci_pci soundcore i2c_core
ehci_hcd wmi acpi_cpufreq vhci_hcd usbip_core usbcore usb_common
vhost_net vhost kvm_amd kvm irqbypass loop tun fuse it87 hwmon_vid
hwmon
Mar 20 21:43:34 agro kernel: CPU: 1 PID: 9334 Comm: kworker/u16:3 Not
tainted 5.6.0-rc6 #1
Mar 20 21:43:34 agro kernel: Hardware name: Gigabyte Technology Co.,
Ltd. GA-990FXA-UD5/GA-990FXA-UD5, BIOS F12 10/03/2013
Mar 20 21:43:34 agro kernel: Workqueue: rpciod rpc_async_schedule [sunrpc]
Mar 20 21:43:34 agro kernel: RIP: 0010:refcount_warn_saturate+0xd9/0xe0
Mar 20 21:43:34 agro kernel: Code: ff 48 c7 c7 e8 ff d5 81 c6 05 cd 49
b5 00 01 e8 4d 04 cf ff 0f 0b c3 48 c7 c7 90 ff d5 81 c6 05 b9 49 b5
00 01 e8 37 04 cf ff <0f> 0b c3 0f 1f 40 00 8b 07 3d 00 00 00 c0 74 12
83 f8 01 74 46 8d
Mar 20 21:43:34 agro kernel: RSP: 0018:ffffc900010dfde0 EFLAGS: 00010282
Mar 20 21:43:34 agro kernel: RAX: 0000000000000026 RBX:
0000000000000e81 RCX: 0000000000000007
Mar 20 21:43:34 agro kernel: RDX: 0000000000000007 RSI:
0000000000000092 RDI: ffff88881fc587c0
Mar 20 21:43:34 agro kernel: RBP: ffff8887eeb40470 R08:
00000000000004d9 R09: 0000000000000001
Mar 20 21:43:34 agro kernel: R10: 0000000000000000 R11:
0000000000000001 R12: ffff8884f1e28510
Mar 20 21:43:34 agro kernel: R13: ffff88837dc49f30 R14:
0000000000000001 R15: ffff888780b8a840
Mar 20 21:43:34 agro kernel: FS: 0000000000000000(0000)
GS:ffff88881fc40000(0000) knlGS:0000000000000000
Mar 20 21:43:34 agro kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 20 21:43:34 agro kernel: CR2: 000000000421efb0 CR3:
0000000510564000 CR4: 00000000000006e0
Mar 20 21:43:34 agro kernel: Call Trace:
Mar 20 21:43:34 agro kernel: nfsd41_destroy_cb+0x36/0x50 [nfsd]
Mar 20 21:43:34 agro kernel: rpc_free_task+0x3c/0x70 [sunrpc]
Mar 20 21:43:34 agro kernel: __rpc_execute+0x3a5/0x3c0 [sunrpc]
Mar 20 21:43:34 agro kernel: ? finish_task_switch+0x7f/0x250
Mar 20 21:43:34 agro kernel: rpc_async_schedule+0x2f/0x50 [sunrpc]
Mar 20 21:43:34 agro kernel: process_one_work+0x1ca/0x3c0
Mar 20 21:43:34 agro kernel: worker_thread+0x45/0x3d0
Mar 20 21:43:34 agro kernel: kthread+0xf3/0x130
Mar 20 21:43:34 agro kernel: ? process_one_work+0x3c0/0x3c0
Mar 20 21:43:34 agro kernel: ? kthread_park+0x80/0x80
Mar 20 21:43:34 agro kernel: ret_from_fork+0x1f/0x30
Mar 20 21:43:34 agro kernel: ---[ end trace 99765c8e28c46274 ]---


2020-03-21 15:06:20

by Jan Psota

[permalink] [raw]
Subject: Re: refcount underflow in nfsd41_destroy_cb

Hi! Previous and first trace for problem that Jason forwarded to you.


Attachments:
(No filename) (71.00 B)
nfsd-trace.txt (3.91 kB)
Download all attachments

2020-03-21 17:26:25

by Jan Psota

[permalink] [raw]
Subject: Re: refcount underflow in nfsd41_destroy_cb

Hi! 3-rd trace. If it is enough, so you want me not to send next, just
let me know.


Attachments:
(No filename) (86.00 B)
nfsd-trace3.txt (3.43 kB)
Download all attachments

2020-03-21 19:53:12

by Chuck Lever III

[permalink] [raw]
Subject: Re: refcount underflow in nfsd41_destroy_cb



> On Mar 20, 2020, at 10:51 PM, Jason A. Donenfeld <[email protected]> wrote:
>
> Hello,
>
> A user erroneously sent me a refcount UaF in NFS with 5.6-rc6. I
> thought I should forward it onward here in case this is not already
> known. The original reporter is CC'd in case you have any questions.

Jan, how are you reproducing this?


> Regards,
> Jason
>
> -------8<------------------------
>
> Mar 20 21:43:34 agro kernel: ------------[ cut here ]------------
> Mar 20 21:43:34 agro kernel: refcount_t: underflow; use-after-free.
> Mar 20 21:43:34 agro kernel: WARNING: CPU: 1 PID: 9334 at
> lib/refcount.c:28 refcount_warn_saturate+0xd9/0xe0
> Mar 20 21:43:34 agro kernel: Modules linked in: md5 wireguard
> curve25519_x86_64 libcurve25519_generic libchacha20poly1305
> chacha_x86_64 libchacha poly1305_x86_64 ip6_udp_tunnel udp_tunnel
> libblake2s blake2s_x86_64 libblake2s_generic xt_mac xt_nat
> xt_MASQUERADE xt_REDIRECT xt_owner xt_comment ipt_REJECT
> nf_reject_ipv4 xt_mark xt_hashlimit xt_multiport nfnetlink_log
> xt_NFLOG nf_log_ipv4 nf_log_common xt_LOG nf_nat_tftp nf_nat_sip
> nf_nat_h323 nf_nat_ftp nf_conntrack_tftp nf_conntrack_sip
> nf_conntrack_netlink nfnetlink nf_conntrack_h323 nf_conntrack_ftp nfsd
> auth_rpcgss nfs_acl lockd grace sunrpc cpufreq_ondemand msr bridge stp
> llc xt_tcpudp xt_conntrack iptable_mangle iptable_nat nf_nat
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_raw iptable_filter
> ip_tables x_tables btrfs blake2b_generic xor raid6_pq libcrc32c
> zstd_compress zstd_decompress zlib_deflate zlib_inflate uas
> usb_storage snd_hda_codec_realtek snd_hda_codec_generic matroxfb_base
> matroxfb_g450 matroxfb_Ti3026 snd_hda_int
> el matroxfb_accel
> Mar 20 21:43:34 agro kernel: snd_intel_dspcfg cfbfillrect cfbimgblt
> atlantic snd_hda_codec r8169 cfbcopyarea e1000e snd_hda_core snd_hwdep
> matroxfb_DAC1064 g450_pll matroxfb_misc ptp realtek fb snd_pcm libphy
> snd_timer mxm_wmi xhci_pci pcspkr i2c_piix4 input_leds pps_core fbdev
> k10temp ohci_pci snd xhci_hcd ohci_hcd ehci_pci soundcore i2c_core
> ehci_hcd wmi acpi_cpufreq vhci_hcd usbip_core usbcore usb_common
> vhost_net vhost kvm_amd kvm irqbypass loop tun fuse it87 hwmon_vid
> hwmon
> Mar 20 21:43:34 agro kernel: CPU: 1 PID: 9334 Comm: kworker/u16:3 Not
> tainted 5.6.0-rc6 #1
> Mar 20 21:43:34 agro kernel: Hardware name: Gigabyte Technology Co.,
> Ltd. GA-990FXA-UD5/GA-990FXA-UD5, BIOS F12 10/03/2013
> Mar 20 21:43:34 agro kernel: Workqueue: rpciod rpc_async_schedule [sunrpc]
> Mar 20 21:43:34 agro kernel: RIP: 0010:refcount_warn_saturate+0xd9/0xe0
> Mar 20 21:43:34 agro kernel: Code: ff 48 c7 c7 e8 ff d5 81 c6 05 cd 49
> b5 00 01 e8 4d 04 cf ff 0f 0b c3 48 c7 c7 90 ff d5 81 c6 05 b9 49 b5
> 00 01 e8 37 04 cf ff <0f> 0b c3 0f 1f 40 00 8b 07 3d 00 00 00 c0 74 12
> 83 f8 01 74 46 8d
> Mar 20 21:43:34 agro kernel: RSP: 0018:ffffc900010dfde0 EFLAGS: 00010282
> Mar 20 21:43:34 agro kernel: RAX: 0000000000000026 RBX:
> 0000000000000e81 RCX: 0000000000000007
> Mar 20 21:43:34 agro kernel: RDX: 0000000000000007 RSI:
> 0000000000000092 RDI: ffff88881fc587c0
> Mar 20 21:43:34 agro kernel: RBP: ffff8887eeb40470 R08:
> 00000000000004d9 R09: 0000000000000001
> Mar 20 21:43:34 agro kernel: R10: 0000000000000000 R11:
> 0000000000000001 R12: ffff8884f1e28510
> Mar 20 21:43:34 agro kernel: R13: ffff88837dc49f30 R14:
> 0000000000000001 R15: ffff888780b8a840
> Mar 20 21:43:34 agro kernel: FS: 0000000000000000(0000)
> GS:ffff88881fc40000(0000) knlGS:0000000000000000
> Mar 20 21:43:34 agro kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Mar 20 21:43:34 agro kernel: CR2: 000000000421efb0 CR3:
> 0000000510564000 CR4: 00000000000006e0
> Mar 20 21:43:34 agro kernel: Call Trace:
> Mar 20 21:43:34 agro kernel: nfsd41_destroy_cb+0x36/0x50 [nfsd]
> Mar 20 21:43:34 agro kernel: rpc_free_task+0x3c/0x70 [sunrpc]
> Mar 20 21:43:34 agro kernel: __rpc_execute+0x3a5/0x3c0 [sunrpc]
> Mar 20 21:43:34 agro kernel: ? finish_task_switch+0x7f/0x250
> Mar 20 21:43:34 agro kernel: rpc_async_schedule+0x2f/0x50 [sunrpc]
> Mar 20 21:43:34 agro kernel: process_one_work+0x1ca/0x3c0
> Mar 20 21:43:34 agro kernel: worker_thread+0x45/0x3d0
> Mar 20 21:43:34 agro kernel: kthread+0xf3/0x130
> Mar 20 21:43:34 agro kernel: ? process_one_work+0x3c0/0x3c0
> Mar 20 21:43:34 agro kernel: ? kthread_park+0x80/0x80
> Mar 20 21:43:34 agro kernel: ret_from_fork+0x1f/0x30
> Mar 20 21:43:34 agro kernel: ---[ end trace 99765c8e28c46274 ]---

--
Chuck Lever



2020-03-22 03:45:02

by Jan Psota

[permalink] [raw]
Subject: Re: refcount underflow in nfsd41_destroy_cb

Chuck Lever <[email protected]> napisa?(a):
> Jan, how are you reproducing this?

It looks like it's taking place on server on high NFS load and about
a day after boot! (as I noticed looking into last -x results, below)
Then system runs all right for a month (to be rebooted on new kernel
[not always] or something like this).

We have some NFS-rooted machines:
/systemd on / type nfs4 (rw,relatime,vers=4.2,rsize=4096,wsize=4096,namlen=255,hard,proto=tcp,
timeo=10,retrans=2,sec=sys,clientaddr=192.168.1.18,local_lock=none,addr=192.168.1.1)

Server has 10Gb Aquantia AQC107 card connected to Mikrotik CSS326
switch. Clients running distcc (aside from acting as workstations)
are connected on 1Gb ethernet. Server runs Gentoo Linux on OpenRC
(stations have Systemd) with recent gcc-9.3, binutils-2.34 and
glibc-2.30, has 32 GB RAM and AMD Phenom II X6 1090T CPU.

/var/tmp/portage, where compilation takes place, normally is on client
tmpfs, but when there is not enough space to compile huge program, I
switch it to server exported NFS
(/etc/exports opts: -rw,async,no_root_squash,no_subtree_check)

# "grep nfs.*destroy /var/log/messages" mixed with "last -x"

reboot system boot 5.5.1-gentoo Mon Feb 3 00:20 - 15:22 (25+15:01)
Feb 4 17:44:39 agro kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
rust compilation, kernel 5.5.1-gentoo

reboot system boot 5.5.6-gentoo Fri Feb 28 15:23 - 16:25 (14+01:02)
Feb 29 13:51:49 agro kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
rust compilation, kernel 5.5.6-gentoo

reboot system boot 5.5.9-gentoo Fri Mar 13 16:27 - 00:04 (4+07:36)
Mar 14 18:03:49 agro kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
libpciaccess compilation, kernel

reboot system boot 5.6.0-rc6 Wed Mar 18 00:06 - 20:39 (2+20:32)
Mar 19 11:08:07 agro kernel: nfsd41_destroy_cb+0x36/0x50 [nfsd]
linux-firmware merge
*
reboot system boot 5.6.0-rc6 Fri Mar 20 20:40 - 02:40 (05:59)
Mar 20 21:43:34 agro kernel: nfsd41_destroy_cb+0x36/0x50 [nfsd]
zstd compilation
*
reboot system boot 5.6.0-rc6 Sat Mar 21 02:42 still running
Mar 21 17:34:43 agro kernel: nfsd41_destroy_cb+0x36/0x50 [nfsd]
nodejs compilation

* - I noticed kernel fault looking for a reason, why WireGuard refused
to connect with _some_ remote peers so I rebooted the server and it helped.

2020-03-23 14:14:07

by Chuck Lever III

[permalink] [raw]
Subject: Re: refcount underflow in nfsd41_destroy_cb



> On Mar 21, 2020, at 11:43 PM, Jan Psota <[email protected]> wrote:
>
> Chuck Lever <[email protected]> napisaƂ(a):
>> Jan, how are you reproducing this?
>
> It looks like it's taking place on server on high NFS load and about
> a day after boot! (as I noticed looking into last -x results, below)
> Then system runs all right for a month (to be rebooted on new kernel
> [not always] or something like this).
>
> We have some NFS-rooted machines:
> /systemd on / type nfs4 (rw,relatime,vers=4.2,rsize=4096,wsize=4096,namlen=255,hard,proto=tcp,
> timeo=10,retrans=2,sec=sys,clientaddr=192.168.1.18,local_lock=none,addr=192.168.1.1)
>
> Server has 10Gb Aquantia AQC107 card connected to Mikrotik CSS326
> switch. Clients running distcc (aside from acting as workstations)
> are connected on 1Gb ethernet. Server runs Gentoo Linux on OpenRC
> (stations have Systemd) with recent gcc-9.3, binutils-2.34 and
> glibc-2.30, has 32 GB RAM and AMD Phenom II X6 1090T CPU.
>
> /var/tmp/portage, where compilation takes place, normally is on client
> tmpfs, but when there is not enough space to compile huge program, I
> switch it to server exported NFS
> (/etc/exports opts: -rw,async,no_root_squash,no_subtree_check)
>
> # "grep nfs.*destroy /var/log/messages" mixed with "last -x"

I thought I read in the initial report that you were seeing this
problem only on v5.6-rc6. What is the earliest kernel release
where you saw refcount UaF warnings from nfsd4_destroy_cb?


> reboot system boot 5.5.1-gentoo Mon Feb 3 00:20 - 15:22 (25+15:01)
> Feb 4 17:44:39 agro kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
> rust compilation, kernel 5.5.1-gentoo
>
> reboot system boot 5.5.6-gentoo Fri Feb 28 15:23 - 16:25 (14+01:02)
> Feb 29 13:51:49 agro kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
> rust compilation, kernel 5.5.6-gentoo
>
> reboot system boot 5.5.9-gentoo Fri Mar 13 16:27 - 00:04 (4+07:36)
> Mar 14 18:03:49 agro kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
> libpciaccess compilation, kernel
>
> reboot system boot 5.6.0-rc6 Wed Mar 18 00:06 - 20:39 (2+20:32)
> Mar 19 11:08:07 agro kernel: nfsd41_destroy_cb+0x36/0x50 [nfsd]
> linux-firmware merge
> *
> reboot system boot 5.6.0-rc6 Fri Mar 20 20:40 - 02:40 (05:59)
> Mar 20 21:43:34 agro kernel: nfsd41_destroy_cb+0x36/0x50 [nfsd]
> zstd compilation
> *
> reboot system boot 5.6.0-rc6 Sat Mar 21 02:42 still running
> Mar 21 17:34:43 agro kernel: nfsd41_destroy_cb+0x36/0x50 [nfsd]
> nodejs compilation
>
> * - I noticed kernel fault looking for a reason, why WireGuard refused
> to connect with _some_ remote peers so I rebooted the server and it helped.

--
Chuck Lever



2020-03-23 15:10:04

by Jan Psota

[permalink] [raw]
Subject: Re: refcount underflow in nfsd41_destroy_cb

> I thought I read in the initial report that you were seeing this
> problem only on v5.6-rc6. What is the earliest kernel release
> where you saw refcount UaF warnings from nfsd4_destroy_cb?
>
I didn't noticed that earlier too, because until connection breakage on
WireGuard I did not have any problems related. But when you are asking,
I found it in my Pentium G2020 system too since 5.5.4 kernel and 5.4.2
looks not affected (I have logs since 01 Jan and fault begin to appear
on Feb 21, when I switched from 5.4.2 to 5.5.4 kernel a day before)

$ journalctl | grep nfsd41_destroy_cb
lut 21 01:07:58 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
lut 27 01:01:12 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
mar 03 00:59:01 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
mar 03 23:03:02 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
mar 11 11:52:42 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
mar 13 01:12:02 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
mar 14 14:31:39 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
mar 15 20:56:56 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
mar 17 15:58:32 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
mar 22 15:24:03 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]

I attach NFS part of my .config and a screen dump of menuconfig.


Attachments:
(No filename) (1.33 kB)
nfs.config (1.01 kB)
nfs-config.txt (2.61 kB)
Download all attachments

2020-03-24 13:51:49

by Chuck Lever III

[permalink] [raw]
Subject: Re: refcount underflow in nfsd41_destroy_cb

Trond,

> On Mar 23, 2020, at 11:09 AM, Jan Psota <[email protected]> wrote:
>
>> I thought I read in the initial report that you were seeing this
>> problem only on v5.6-rc6. What is the earliest kernel release
>> where you saw refcount UaF warnings from nfsd4_destroy_cb?
>>
> I didn't noticed that earlier too, because until connection breakage on
> WireGuard I did not have any problems related. But when you are asking,
> I found it in my Pentium G2020 system too since 5.5.4 kernel and 5.4.2
> looks not affected (I have logs since 01 Jan and fault begin to appear
> on Feb 21, when I switched from 5.4.2 to 5.5.4 kernel a day before)
>
> $ journalctl | grep nfsd41_destroy_cb
> lut 21 01:07:58 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
> lut 27 01:01:12 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
> mar 03 00:59:01 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
> mar 03 23:03:02 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
> mar 11 11:52:42 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
> mar 13 01:12:02 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
> mar 14 14:31:39 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
> mar 15 20:56:56 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
> mar 17 15:58:32 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
> mar 22 15:24:03 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
>
> I attach NFS part of my .config and a screen dump of menuconfig.
> <nfs.config><nfs-config.txt>

I'm wondering if

2bbfed98a4d8 ("nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback()")

or

12357f1b2c8e ("nfsd: minor 4.1 callback cleanup")

might be related to this issue (see down-thread for details and backtraces).

--
Chuck Lever



2020-03-31 18:04:09

by Chuck Lever III

[permalink] [raw]
Subject: Re: refcount underflow in nfsd41_destroy_cb

Trond, any thoughts?

> On Mar 24, 2020, at 9:50 AM, Chuck Lever <[email protected]> wrote:
>
> Trond,
>
>> On Mar 23, 2020, at 11:09 AM, Jan Psota <[email protected]> wrote:
>>
>>> I thought I read in the initial report that you were seeing this
>>> problem only on v5.6-rc6. What is the earliest kernel release
>>> where you saw refcount UaF warnings from nfsd4_destroy_cb?
>>>
>> I didn't noticed that earlier too, because until connection breakage on
>> WireGuard I did not have any problems related. But when you are asking,
>> I found it in my Pentium G2020 system too since 5.5.4 kernel and 5.4.2
>> looks not affected (I have logs since 01 Jan and fault begin to appear
>> on Feb 21, when I switched from 5.4.2 to 5.5.4 kernel a day before)
>>
>> $ journalctl | grep nfsd41_destroy_cb
>> lut 21 01:07:58 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
>> lut 27 01:01:12 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
>> mar 03 00:59:01 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
>> mar 03 23:03:02 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
>> mar 11 11:52:42 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
>> mar 13 01:12:02 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
>> mar 14 14:31:39 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
>> mar 15 20:56:56 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
>> mar 17 15:58:32 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
>> mar 22 15:24:03 mordimer kernel: nfsd41_destroy_cb+0x2c/0x40 [nfsd]
>>
>> I attach NFS part of my .config and a screen dump of menuconfig.
>> <nfs.config><nfs-config.txt>
>
> I'm wondering if
>
> 2bbfed98a4d8 ("nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback()")
>
> or
>
> 12357f1b2c8e ("nfsd: minor 4.1 callback cleanup")
>
> might be related to this issue (see down-thread for details and backtraces).
>
> --
> Chuck Lever

--
Chuck Lever



2020-05-10 23:58:48

by Jan Psota

[permalink] [raw]
Subject: Re: refcount underflow in nfsd41_destroy_cb

Hi!
Today's nfsd trace in attachment. My NFS-root-ed client worked on
qemu-5.0.0 and tap interface (qemu has nothing to do with that effect
- on another server, physical machine client is connected through gigabit
ethernet). Error appeared on intensive NFS operations (Gentoo Linux on
client was installing Libre Office from binary package, moving files
from /var/tmp/portage to root filesystem).

Server /proc/version:
Linux version 5.7.0-rc2 (root@mordimer)
(gcc version 9.3.0 (Gentoo 9.3.0 p2), GNU ld (Gentoo 2.34 p1) 2.34.0)
#1 SMP PREEMPT Tue Apr 21 21:52:13 CEST 2020

server /proc/cmdline:
BOOT_IMAGE=/vmlinuz root=/dev/md0p4 panic=30 rw
md=0,/dev/sda,/dev/sdb rootdelay=1 ipv6.disable=1

Client side errors were only some:
nfs: server 192.168.2.1 not responding, still trying
nfs: server 192.168.2.1 OK
but it is normal.


Attachments:
(No filename) (858.00 B)
nfsd-trace-20200511.txt (3.51 kB)
Download all attachments

2020-05-15 22:02:48

by J. Bruce Fields

[permalink] [raw]
Subject: Re: refcount underflow in nfsd41_destroy_cb

On Mon, May 11, 2020 at 01:23:48AM +0200, Jan Psota wrote:
> Hi!
> Today's nfsd trace in attachment. My NFS-root-ed client worked on
> qemu-5.0.0 and tap interface (qemu has nothing to do with that effect
> - on another server, physical machine client is connected through gigabit
> ethernet). Error appeared on intensive NFS operations (Gentoo Linux on
> client was installing Libre Office from binary package, moving files
> from /var/tmp/portage to root filesystem).
>
> Server /proc/version:
> Linux version 5.7.0-rc2 (root@mordimer)
> (gcc version 9.3.0 (Gentoo 9.3.0 p2), GNU ld (Gentoo 2.34 p1) 2.34.0)
> #1 SMP PREEMPT Tue Apr 21 21:52:13 CEST 2020
>
> server /proc/cmdline:
> BOOT_IMAGE=/vmlinuz root=/dev/md0p4 panic=30 rw
> md=0,/dev/sda,/dev/sdb rootdelay=1 ipv6.disable=1
>
> Client side errors were only some:
> nfs: server 192.168.2.1 not responding, still trying
> nfs: server 192.168.2.1 OK
> but it is normal.

> [17957.413775] ------------[ cut here ]------------
> [17957.413804] refcount_t: underflow; use-after-free.

Looks like this is in the !cb->cb_need restart case of nfsd4_cb_release,
which is ->release method of nfsd4_cb_ops.

So, an rpc task for a callback is being freed and decrementing
clp->cl_cb_inflight, and discovering that cl_cb_inflight is already less
than or equal to 0. Either we've got unbalanced increments and
decrements of cl_cb_inflight, or clp has already been freed out from
under us.

I'm looking but can't quite see yet how that can happen.

--b.


> [17957.413835] WARNING: CPU: 1 PID: 26127 at lib/refcount.c:28 refcount_warn_saturate+0xd8/0xe0
> [17957.413872] Modules linked in: md5 nfsd auth_rpcgss nfs_acl lockd grace sunrpc fuse vhost_net vhost vhost_iotlb xt_mac xt_socket nf_socket_ipv4 xt_MASQUERADE xt_REDIRECT xt_TPROXY nf_tproxy_ipv4 xt_comment ipt_REJECT nf_reject_ipv4 xt_mark xt_multiport nfnetlink_log xt_NFLOG nf_log_ipv4 nf_log_common xt_LOG nf_nat_tftp nf_nat_sip nf_nat_h323 nf_nat_ftp nf_conntrack_tftp nf_conntrack_sip nf_conntrack_netlink nfnetlink nf_conntrack_h323 nf_conntrack_ftp tun bridge stp llc xt_tcpudp xt_conntrack iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_raw iptable_filter ip_tables x_tables rc_it913x_v1 it913x_fe hid_generic usbhid hid dvb_usb_it913x dvb_usb_v2 dvb_core videobuf2_vmalloc videobuf2_memops videobuf2_common rc_core btrfs blake2b_generic xor raid6_pq libcrc32c zlib_deflate zlib_inflate snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal kvm_intel kvm ppdev irqbypass i915 crct10dif_pclmul crc32_pclmul sr_mod intel_cstate iosf_mbi
> [17957.413896] intel_rapl_perf input_leds drm_kms_helper snd_hda_intel syscopyarea snd_intel_dspcfg sysfillrect pcspkr cdrom sysimgblt i2c_i801 fb_sys_fops intel_gtt r8169 snd_hda_codec i2c_algo_bit ehci_pci cfbfillrect cfbimgblt ehci_hcd realtek lpc_ich mei_me cfbcopyarea snd_hda_core usbcore libphy sky2 fb snd_hwdep mei mfd_core usb_common fbdev thermal parport_pc parport video sch_fq_codel snd_pcm_oss snd_mixer_oss snd_pcm snd_timer drm snd drm_panel_orientation_quirks soundcore agpgart backlight i2c_core coretemp hwmon
> [17957.414516] CPU: 1 PID: 26127 Comm: kworker/u4:2 Not tainted 5.7.0-rc2 #1
> [17957.414545] Hardware name: MSI MS-7788/H61M-P20 (G3) (MS-7788), BIOS V1.9 01/10/2013
> [17957.414587] Workqueue: rpciod rpc_async_schedule [sunrpc]
> [17957.414612] RIP: 0010:refcount_warn_saturate+0xd8/0xe0
> [17957.414633] Code: ff 48 c7 c7 f0 7a d7 81 c6 05 d1 35 d7 00 01 e8 5b 9b d0 ff 0f 0b c3 48 c7 c7 98 7a d7 81 c6 05 bd 35 d7 00 01 e8 45 9b d0 ff <0f> 0b c3 0f 1f 44 00 00 8b 07 3d 00 00 00 c0 74 12 83 f8 01 74 46
> [17957.414716] RSP: 0018:ffffc90000bafde0 EFLAGS: 00010282
> [17957.414738] RAX: 0000000000000026 RBX: 0000000000000e81 RCX: 0000000000000007
> [17957.414767] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff888216718800
> [17957.414798] RBP: ffff8881c5270c70 R08: 0000000000000338 R09: ffffc90010054024
> [17957.414828] R10: 0000000000aaaaaa R11: 0000000000000000 R12: ffff88805b0a0510
> [17957.414857] R13: ffff888011707b30 R14: 0000000000000001 R15: ffff888130c9d000
> [17957.414889] FS: 0000000000000000(0000) GS:ffff888216700000(0000) knlGS:0000000000000000
> [17957.414924] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [17957.414948] CR2: 00005600ba4586e8 CR3: 000000015a210004 CR4: 00000000001626e0
> [17957.414978] Call Trace:
> [17957.414996] nfsd41_destroy_cb+0x2c/0x40 [nfsd]
> [17957.415020] rpc_free_task+0x31/0x50 [sunrpc]
> [17957.415042] __rpc_execute+0x38f/0x3a0 [sunrpc]
> [17957.415063] ? pick_next_task_fair+0x295/0x2b0
> [17957.415082] ? finish_task_switch+0x70/0x230
> [17957.415103] rpc_async_schedule+0x24/0x40 [sunrpc]
> [17957.416404] process_one_work+0x1cd/0x3c0
> [17957.417680] worker_thread+0x45/0x3c0
> [17957.418947] kthread+0x10b/0x150
> [17957.420195] ? process_one_work+0x3c0/0x3c0
> [17957.421420] ? kthread_park+0x80/0x80
> [17957.422620] ret_from_fork+0x1f/0x30
> [17957.423790] ---[ end trace d52e90aa624996f1 ]---