This has started happening quite recently, after some Kernel-Updates
within the 5.4 stable series.
What seems to be triggering is an application (singularity doing a pull)
on a (RDMA connected) NFS client calling copy_file_range().
Which gets infinitely stuck.
in dmesg on the server:
> [80245.802613] ------------[ cut here ]------------
> [80245.802653] refcount_t: underflow; use-after-free.
> [80245.802700] WARNING: CPU: 28 PID: 73131 at lib/refcount.c:190 refcount_sub_and_test_checked+0x55/0x60
> [80245.802749] Modules linked in: zfs(PO) zunicode(PO) zavl(PO) icp(PO) zlua(PO) zcommon(PO) znvpair(PO) spl(O)
> [80245.802810] CPU: 28 PID: 73131 Comm: kworker/u64:0 Tainted: P O 5.4.80 #1
> [80245.802854] Hardware name: Supermicro AS -1014S-WTRT/H12SSW-NT, BIOS 1.1a 05/28/2020
> [80245.802900] Workqueue: rpciod rpc_async_schedule
> [80245.802928] RIP: 0010:refcount_sub_and_test_checked+0x55/0x60
> [80245.802960] Code: e0 41 5c c3 44 89 e0 41 5c c3 44 0f b6 25 97 d4 53 01 45 84 e4 75 e4 48 c7 c7 78 5c ce a4 c6 05 84 d4 53 01 01 e8 c9 ed ab ff <0f> 0b eb d0 0f 1f 80 00 00 00 00 48 89 fe bf 01 00 00 00 eb 96 66
> [80245.803059] RSP: 0018:ffffa28cc13f3dd8 EFLAGS: 00010286
> [80245.803089] RAX: 0000000000000000 RBX: 0000000000002a81 RCX: 0000000000000006
> [80245.803127] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8e228bf11a50
> [80245.803166] RBP: ffff8e20d4812000 R08: 0000000000000000 R09: 0000000000000873
> [80245.804042] R10: ffff8e227e48e820 R11: ffffffffa4ab3e40 R12: 0000000000000000
> [80245.804953] R13: ffff8e226556c930 R14: 0000000000000001 R15: 0000000000000000
> [80245.805821] FS: 0000000000000000(0000) GS:ffff8e228bf00000(0000) knlGS:0000000000000000
> [80245.806695] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [80245.807540] CR2: 00007fed60557490 CR3: 0000003e45eae000 CR4: 00000000003406e0
> [80245.808430] Call Trace:
> [80245.809284] nfs4_put_copy+0x15/0x30
> [80245.810157] rpc_free_task+0x39/0x70
> [80245.811026] __rpc_execute+0x354/0x370
> [80245.811918] rpc_async_schedule+0x29/0x40
> [80245.812789] process_one_work+0x1df/0x3a0
> [80245.813655] worker_thread+0x4a/0x3c0
> [80245.814551] kthread+0xfb/0x130
> [80245.815448] ? process_one_work+0x3a0/0x3a0
> [80245.816332] ? kthread_park+0x90/0x90
> [80245.817229] ret_from_fork+0x22/0x40
> [80245.818138] ---[ end trace 41f9b2bba9f6f614 ]---
I don't know for sure if changes to the 5.4 kernel have triggered this,
or changes in singularity which now make use of copy_file_range.
But restarting the nfs server, and then doing a "singularity pull"
reliably triggers that dmesg message.
I found commit 49a361327332c9221438397059067f9b205f690d, but it seems
related to the inter ssc stuff, which isn't even present in 5.4.
Any insights/help/fixes to backport would be highly appreciated.
Timo
On 30.11.2020 13:53, Timo Rothenpieler wrote:
> I don't know for sure if changes to the 5.4 kernel have triggered this,
> or changes in singularity which now make use of copy_file_range.
> But restarting the nfs server, and then doing a "singularity pull"
> reliably triggers that dmesg message.
>
> I found commit 49a361327332c9221438397059067f9b205f690d, but it seems
> related to the inter ssc stuff, which isn't even present in 5.4.
Quick update: I applied that commit on top of the 5.4.80 Kernel, and
have not been able to reproduce the issue anymore in a quick test.
Will leave it running like this for a bit and see if it's fixed.
I don't know if this is in any way correct though.
On 30.11.2020 14:12, Timo Rothenpieler wrote:
> Quick update: I applied that commit on top of the 5.4.80 Kernel, and
> have not been able to reproduce the issue anymore in a quick test.
> Will leave it running like this for a bit and see if it's fixed.
>
> I don't know if this is in any way correct though.
Partial success, I guess.
The use after free did not occur again, but users still get their
singularity pulls stuck, waiting on copy_file_range() to return.
Must have just been a coincidence that those two always happened at the
same time.
It works fine on a fresh mount, until at some random point all attempts
lead to it getting stuck.
Re-Mounting, rebooting client or rebooting server all seem to resolve it
temporarily, until it happens again.