2021-01-03 15:38:39

by Nix

[permalink] [raw]
Subject: steam-associated reproducible hard NFSv4.2 client hang (5.9, 5.10)

So I bought a game or two on Steam a fortnight ago to while away a
Christmas spent stuck indoors with no family, decorations, or presents
(stuck at my sister's and she got a perfectly-timed case of covid-19).
It turned out this was a very bad idea!

I keep my Steam directory on an ext4 fs on the local fileserver and
access it via NFSv4.2: over 10GbE: it's actually faster than local
spinning-rust storage would be (five-spindle RAID-6 and bcache and
128GiB RAM on the server for caching probably have something to do with
that!) though likely slower than the NVMe the local machine actually
has: but it doesn't have enough local storage, so I keep all the Steam
stuff remotely.

(I keep my $HOME on the same fileserver, though that's on xfs: I'd use
xfs for Steam, too, but there are still a lot of games out there that
fail on filesystems with 64-bit inode numbers.)

Unfortunately, what I observe now when kicking up Steam is a spasm of
downloading I/O for the game I just bought and then a complete
client-side hang when it starts installing: just about all I can do is
hard-reboot. This of course stops me using Steam for anything which
didn't make me very happy :P

The server hums onwards as though nothing went wrong, but the client is
in a bad way indeed: not only has NFS hung but there are clearly some
important locks taken out since even an ls on a non-NFS dir (with no NFS
dirs in the PATH) hangs, as does a while :; ps -e; done loop. I can
rebuild with lockdep and do a lock dump if that would be useful. (The
kernel is not quite stock: it has DTrace built in, but the hangs happen
even if the kernel module has never been loaded, and I'm pretty sure the
hangs can't be associated with the not-very-substantial built-in DTrace
changes. I can try rebuilding without that as well, but I'll have to rip
chunks out of my monitoring machinery to do that so I've avoided doing
it so far.)

I finally got round to doing a bit of debugging today. NFS debugging
dumps covering the last fifteen seconds or so before the hang and a
couple of minutes afterwards (taken on the client, logged without
incident to the server's syslog, as proof that not *everything* is
stalled) are a few megabytes long so are at
<http://www.esperi.org.uk/~nix/temporary/nfs-steam-hang.log>. Everything
seems to go wrong after an especially large bunch of NFS commits.

Both client (silk, 192.168.16.21) and server (loom, 192.168.16.8) are
running 5.10.4 with the same NFS config, though the client has no NFS
server exports at the time and the server isn't acting as an NFS client
at all:

CONFIG_NFS_FS=y
# CONFIG_NFS_V2 is not set
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
# CONFIG_NFS_SWAP is not set
CONFIG_NFS_V4_1=y
CONFIG_NFS_V4_2=y
CONFIG_PNFS_FILE_LAYOUT=y
CONFIG_PNFS_BLOCK=y
CONFIG_PNFS_FLEXFILE_LAYOUT=m
CONFIG_NFS_V4_1_IMPLEMENTATION_ID_DOMAIN="kernel.org"
# CONFIG_NFS_V4_1_MIGRATION is not set
# CONFIG_ROOT_NFS is not set
# CONFIG_NFS_USE_LEGACY_DNS is not set
CONFIG_NFS_USE_KERNEL_DNS=y
CONFIG_NFS_DEBUG=y
CONFIG_NFS_DISABLE_UDP_SUPPORT=y
# CONFIG_NFS_V4_2_READ_PLUS is not set
CONFIG_NFSD=y
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
# CONFIG_NFSD_BLOCKLAYOUT is not set
# CONFIG_NFSD_SCSILAYOUT is not set
# CONFIG_NFSD_FLEXFILELAYOUT is not set
CONFIG_NFSD_V4_2_INTER_SSC=y
CONFIG_GRACE_PERIOD=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_GSS=y
CONFIG_SUNRPC_BACKCHANNEL=y
CONFIG_SUNRPC_DEBUG=y

The server (loom) has a large number of exports: many of the filesystems
being exported are moved around via bind mounts and the like. The steam
datadir is in /pkg/non-free/steam, which is bind-mounted onto
/home/nix/.steam on the client: /home/nix itself is bind-mounted from
/home/.loom.srvr.nix/steam, and is NFS-exported from the server under
that pathname.

Relevant exports, from /proc/fs/nfs/exports:

/ 192.168.16.0/24,silk.wkstn.nix(ro,insecure,no_root_squash,sync,no_wdelay,no_subtree_check,v4root,fsid=0,uuid=0a4a4563:00764033:8c827c0e:989cf534,sec=390003:390004:390005:1)
/home/.loom.srvr.nix *.srvr.nix,fold.srvr.nix(rw,root_squash,sync,wdelay,no_subtree_check,uuid=0a4a4563:00764033:8c827c0e:989cf534,sec=1)
/pkg/non-free/steam 192.168.16.0/24,silk.wkstn.nix(rw,root_squash,sync,wdelay,uuid=11b94016:cb6e4b64:beb6c134:9ea3d6b3,sec=1)

Blocked tasks after the hang suggest that the client is hung waiting
indefinitely for the server to reply, and I'd guess the server doesn't
think it needs to. (I can do some server-side debugging dumps too if you
think the problem might be there.)

[ 322.843962] sysrq: Show Blocked State
[ 322.844058] task:tee state:D stack: 0 pid: 1176 ppid: 1173 flags:0x00000000
[ 322.844064] Call Trace:
[ 322.844074] __schedule+0x3db/0x7ee
[ 322.844079] ? io_schedule+0x70/0x6d
[ 322.844082] schedule+0x68/0xdc
[ 322.844085] io_schedule+0x46/0x6d
[ 322.844089] bit_wait_io+0x11/0x52
[ 322.844092] __wait_on_bit+0x31/0x90
[ 322.844096] out_of_line_wait_on_bit+0x7e/0x80
[ 322.844100] ? var_wake_function+0x30/0x27
[ 322.844104] nfs_wait_on_request.part.0+0x31/0x33
[ 322.844107] nfs_page_group_lock_head+0x31/0x78
[ 322.844110] nfs_lock_and_join_requests+0x9d/0x26e
[ 322.844114] nfs_updatepage+0x1be/0x985
[ 322.844118] nfs_write_end+0x128/0x52a
[ 322.844122] ? iov_iter_copy_from_user_atomic+0xe8/0x3ce
[ 322.844127] generic_perform_write+0xed/0x17c
[ 322.844131] nfs_file_write+0x14a/0x29f
[ 322.844135] new_sync_write+0xfb/0x16b
[ 322.844139] vfs_write+0x1ee/0x29a
[ 322.844142] ? vfs_read+0x17b/0x198
[ 322.844145] ksys_write+0x61/0xd0
[ 322.844148] __x64_sys_write+0x1a/0x1c
[ 322.844151] do_syscall_64+0x32/0x45
[ 322.844154] entry_SYSCALL_64_after_hwframe+0x44/0x0
[ 322.844158] RIP: 0033:0x7f4781f323a3
[ 322.844160] RSP: 002b:00007ffcf8220408 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 322.844164] RAX: ffffffffffffffda RBX: 0000000000000092 RCX: 00007f4781f323a3
[ 322.844166] RDX: 0000000000000092 RSI: 00007ffcf8220530 RDI: 0000000000000003
[ 322.844168] RBP: 00007ffcf8220530 R08: 0000000000000000 R09: 0000000000000001
[ 322.844170] R10: 0000000000400ad2 R11: 0000000000000246 R12: 0000000000000092
[ 322.844172] R13: 0000000002361580 R14: 0000000000000092 R15: 00007f4782201720
[ 322.844202] task:CIPCServer::Thr state:D stack: 0 pid: 1801 ppid: 1540 flags:0x20020000
[ 322.844207] Call Trace:
[ 322.844212] __schedule+0x3db/0x7ee
[ 322.844219] ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
[ 322.844222] schedule+0x68/0xdc
[ 322.844227] rpc_wait_bit_killable+0x11/0x52
[ 322.844230] __wait_on_bit+0x31/0x90
[ 322.844235] ? __rpc_atrun+0x20/0x1a
[ 322.844238] ? __rpc_atrun+0x20/0x1a
[ 322.844242] out_of_line_wait_on_bit+0x7e/0x80
[ 322.844246] ? var_wake_function+0x30/0x27
[ 322.844250] __rpc_execute+0xe5/0x485
[ 322.844254] ? sugov_get_util+0xf0/0xeb
[ 322.844258] rpc_execute+0xa5/0xaa
[ 322.844263] rpc_run_task+0x148/0x187
[ 322.844267] nfs4_do_call_sync+0x61/0x81
[ 322.844272] _nfs4_proc_getattr+0x10a/0x120
[ 322.844278] nfs4_proc_getattr+0x67/0xee
[ 322.844283] __nfs_revalidate_inode+0xb7/0x2df
[ 322.844287] nfs_access_get_cached+0x14d/0x1df
[ 322.844291] nfs_do_access+0x48/0x28e
[ 322.844296] ? step_into+0x2bc/0x6a1
[ 322.844301] nfs_permission+0xa2/0x1b8
[ 322.844305] inode_permission+0x96/0xca
[ 322.844308] ? unlazy_walk+0x4c/0x80
[ 322.844312] link_path_walk.part.0+0x2a5/0x320
[ 322.844315] ? path_init+0x2c2/0x3e5
[ 322.844318] path_lookupat+0x3f/0x1b1
[ 322.844322] filename_lookup+0x97/0x17f
[ 322.844326] ? kmem_cache_alloc+0x32/0x200
[ 322.844331] user_path_at_empty+0x59/0x8a
[ 322.844333] do_faccessat+0x70/0x227
[ 322.844337] __ia32_sys_access+0x1c/0x1e
[ 322.844340] __do_fast_syscall_32+0x5f/0x8e
[ 322.844342] do_fast_syscall_32+0x3d/0x80
[ 322.844346] entry_SYSCALL_compat_after_hwframe+0x45/0x0
[ 322.844348] RIP: 0023:0xf7ee6549
[ 322.844350] RSP: 002b:00000000eb763b28 EFLAGS: 00200292 ORIG_RAX: 0000000000000021
[ 322.844353] RAX: ffffffffffffffda RBX: 000000005754d9e0 RCX: 0000000000000004
[ 322.844355] RDX: 00000000f7ab3e1c RSI: 000000005754d9e0 RDI: 00000000eb763bc8
[ 322.844357] RBP: 00000000f78a91c0 R08: 0000000000000000 R09: 0000000000000000
[ 322.844359] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 322.844360] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 322.844364] task:CFileWriterThre state:D stack: 0 pid: 1810 ppid: 1540 flags:0x20024000
[ 322.844367] Call Trace:
[ 322.844371] __schedule+0x3db/0x7ee
[ 322.844375] ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
[ 322.844379] schedule+0x68/0xdc
[ 322.844382] rpc_wait_bit_killable+0x11/0x52
[ 322.844385] __wait_on_bit+0x31/0x90
[ 322.844389] ? call_reserveresult+0xa0/0x9f
[ 322.844391] ? call_transmit_status+0x160/0x159
[ 322.844395] out_of_line_wait_on_bit+0x7e/0x80
[ 322.844398] ? var_wake_function+0x30/0x27
[ 322.844401] __rpc_execute+0xe5/0x485
[ 322.844404] ? nfs_check_verifier+0x5b/0xa4
[ 322.844407] rpc_execute+0xa5/0xaa
[ 322.844410] rpc_run_task+0x148/0x187
[ 322.844414] nfs4_do_call_sync+0x61/0x81
[ 322.844417] _nfs4_proc_getattr+0x10a/0x120
[ 322.844422] nfs4_proc_getattr+0x67/0xee
[ 322.844426] __nfs_revalidate_inode+0xb7/0x2df
[ 322.844430] nfs_getattr+0x159/0x374
[ 322.844433] vfs_getattr_nosec+0x96/0xa4
[ 322.844435] vfs_statx+0x7a/0xe5
[ 322.844438] vfs_fstatat+0x1f/0x21
[ 322.844443] __do_compat_sys_ia32_lstat64+0x2b/0x4e
[ 322.844448] __ia32_compat_sys_ia32_lstat64+0x14/0x16
[ 322.844450] __do_fast_syscall_32+0x5f/0x8e
[ 322.844453] do_fast_syscall_32+0x3d/0x80
[ 322.844457] entry_SYSCALL_compat_after_hwframe+0x45/0x0
[ 322.844459] RIP: 0023:0xf7ee6549
[ 322.844460] RSP: 002b:00000000eb561be8 EFLAGS: 00000292 ORIG_RAX: 00000000000000c4
[ 322.844463] RAX: ffffffffffffffda RBX: 00000000c521bf60 RCX: 00000000eb561e70
[ 322.844465] RDX: 00000000f7ab3e1c RSI: 00000000eb561e70 RDI: 00000000c521bf60
[ 322.844467] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 322.844468] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 322.844470] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 322.844477] task:Steam - DirWatc state:D stack: 0 pid: 1924 ppid: 1540 flags:0x20024000
[ 322.844480] Call Trace:
[ 322.844484] __schedule+0x3db/0x7ee
[ 322.844488] ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
[ 322.844492] schedule+0x68/0xdc
[ 322.844495] rpc_wait_bit_killable+0x11/0x52
[ 322.844498] __wait_on_bit+0x31/0x90
[ 322.844501] ? rpc_sleep_on_timeout+0x64/0x9c
[ 322.844504] ? rpc_check_timeout+0x1e0/0x1e0
[ 322.844507] ? call_connect_status+0x180/0x17a
[ 322.844511] out_of_line_wait_on_bit+0x7e/0x80
[ 322.844513] ? var_wake_function+0x30/0x27
[ 322.844516] __rpc_execute+0xe5/0x485
[ 322.844520] rpc_execute+0xa5/0xaa
[ 322.844523] rpc_run_task+0x148/0x187
[ 322.844526] nfs4_do_call_sync+0x61/0x81
[ 322.844530] _nfs4_proc_getattr+0x10a/0x120
[ 322.844534] nfs4_proc_getattr+0x67/0xee
[ 322.844538] __nfs_revalidate_inode+0xb7/0x2df
[ 322.844542] nfs_lookup_verify_inode+0x67/0x82
[ 322.844545] nfs_do_lookup_revalidate+0x19b/0x29c
[ 322.844548] nfs4_do_lookup_revalidate+0x5d/0xc5
[ 322.844552] nfs4_lookup_revalidate+0x6a/0x8d
[ 322.844555] lookup_fast+0xb8/0x122
[ 322.844559] path_openat+0xfd/0xfd8
[ 322.844562] ? unlazy_walk+0x4c/0x80
[ 322.844565] ? terminate_walk+0x62/0xed
[ 322.844569] ? putname+0x4b/0x53
[ 322.844572] do_filp_open+0x86/0x110
[ 322.844576] ? kmem_cache_alloc+0x32/0x200
[ 322.844579] ? __alloc_fd+0xb2/0x143
[ 322.844583] do_sys_openat2+0x8d/0x13a
[ 322.844585] ? do_faccessat+0x108/0x227
[ 322.844588] __ia32_compat_sys_openat+0x48/0x63
[ 322.844591] __do_fast_syscall_32+0x5f/0x8e
[ 322.844594] do_fast_syscall_32+0x3d/0x80
[ 322.844597] entry_SYSCALL_compat_after_hwframe+0x45/0x0
[ 322.844600] RIP: 0023:0xf7ee6549
[ 322.844601] RSP: 002b:000000008c8fbcb0 EFLAGS: 00200287 ORIG_RAX: 0000000000000127
[ 322.844604] RAX: ffffffffffffffda RBX: 00000000ffffff9c RCX: 000000008c8fbffc
[ 322.844606] RDX: 0000000000098800 RSI: 0000000000000000 RDI: 00000000f7ab3e1c
[ 322.844608] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 322.844609] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 322.844611] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 322.844614] task:CGenericAsyncFi state:D stack: 0 pid: 1927 ppid: 1540 flags:0x20020000
[ 322.844617] Call Trace:
[ 322.844621] __schedule+0x3db/0x7ee
[ 322.844625] schedule+0x68/0xdc
[ 322.844628] io_schedule+0x46/0x6d
[ 322.844631] wait_on_page_bit_common+0xf8/0x325
[ 322.844634] ? trace_raw_output_file_check_and_advance_wb_err+0x80/0x7d
[ 322.844638] wait_on_page_bit+0x3f/0x41
[ 322.844642] wait_on_page_writeback.part.0+0x35/0x80
[ 322.844645] wait_on_page_writeback+0x27/0x29
[ 322.844647] __filemap_fdatawait_range+0x88/0xd9
[ 322.844651] filemap_write_and_wait_range+0x45/0x80
[ 322.844655] nfs_wb_all+0x30/0x13d
[ 322.844658] nfs4_file_flush+0x85/0xc0
[ 322.844661] filp_close+0x3e/0x78
[ 322.844664] __close_fd+0x23/0x30
[ 322.844667] __ia32_sys_close+0x22/0x50
[ 322.844670] __do_fast_syscall_32+0x5f/0x8e
[ 322.844673] do_fast_syscall_32+0x3d/0x80
[ 322.844676] entry_SYSCALL_compat_after_hwframe+0x45/0x0
[ 322.844678] RIP: 0023:0xf7ee6549
[ 322.844680] RSP: 002b:000000008c5fddb0 EFLAGS: 00200286 ORIG_RAX: 0000000000000006
[ 322.844682] RAX: ffffffffffffffda RBX: 000000000000004d RCX: 0000000000000002
[ 322.844684] RDX: 0000000000000000 RSI: 00000000f7d02e70 RDI: 000000000000004d
[ 322.844686] RBP: 000000008c5fe110 R08: 0000000000000000 R09: 0000000000000000
[ 322.844688] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 322.844689] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 322.844699] task:ThreadPoolForeg state:D stack: 0 pid: 1768 ppid: 1735 flags:0x00004000
[ 322.844702] Call Trace:
[ 322.844705] __schedule+0x3db/0x7ee
[ 322.844709] ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
[ 322.844712] schedule+0x68/0xdc
[ 322.844716] rpc_wait_bit_killable+0x11/0x52
[ 322.844719] __wait_on_bit+0x31/0x90
[ 322.844722] ? rpc_sleep_on_timeout+0x64/0x9c
[ 322.844725] ? rpc_check_timeout+0x1e0/0x1e0
[ 322.844728] ? call_connect_status+0x180/0x17a
[ 322.844731] out_of_line_wait_on_bit+0x7e/0x80
[ 322.844734] ? var_wake_function+0x30/0x27
[ 322.844737] __rpc_execute+0xe5/0x485
[ 322.844740] ? nfs_check_verifier+0x5b/0xa4
[ 322.844743] rpc_execute+0xa5/0xaa
[ 322.844746] rpc_run_task+0x148/0x187
[ 322.844750] nfs4_do_call_sync+0x61/0x81
[ 322.844753] _nfs4_proc_getattr+0x10a/0x120
[ 322.844758] nfs4_proc_getattr+0x67/0xee
[ 322.844762] __nfs_revalidate_inode+0xb7/0x2df
[ 322.844766] nfs_getattr+0x159/0x374
[ 322.844769] vfs_getattr_nosec+0x96/0xa4
[ 322.844771] vfs_statx+0x7a/0xe5
[ 322.844774] __do_sys_newstat+0x31/0x4c
[ 322.844778] __x64_sys_newstat+0x16/0x18
[ 322.844781] do_syscall_64+0x32/0x45
[ 322.844783] entry_SYSCALL_64_after_hwframe+0x44/0x0
[ 322.844786] RIP: 0033:0x7fe70ebf19e6
[ 322.844787] RSP: 002b:00007fe6c15b2068 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[ 322.844790] RAX: ffffffffffffffda RBX: 00000bd4193e1910 RCX: 00007fe70ebf19e6
[ 322.844792] RDX: 00007fe6c15b2078 RSI: 00007fe6c15b2078 RDI: 00000bd4193e1910
[ 322.844794] RBP: 00007fe6c15b2160 R08: 0000000000000001 R09: 00007ffd971ed090
[ 322.844796] R10: 0000000000056764 R11: 0000000000000246 R12: 00007fe6c15b2078
[ 322.844797] R13: 00000bd4192f6f60 R14: 00007fe6c15b2108 R15: 00007fe6c15b2120
[ 322.844801] task:ThreadPoolForeg state:D stack: 0 pid: 1769 ppid: 1735 flags:0x00000000
[ 322.844804] Call Trace:
[ 322.844807] __schedule+0x3db/0x7ee
[ 322.844811] ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
[ 322.844814] schedule+0x68/0xdc
[ 322.844817] rpc_wait_bit_killable+0x11/0x52
[ 322.844820] __wait_on_bit+0x31/0x90
[ 322.844823] ? __rpc_atrun+0x20/0x1a
[ 322.844826] ? __rpc_atrun+0x20/0x1a
[ 322.844829] out_of_line_wait_on_bit+0x7e/0x80
[ 322.844831] ? var_wake_function+0x30/0x27
[ 322.844834] __rpc_execute+0xe5/0x485
[ 322.844837] ? update_cfs_group+0x9a/0x9e
[ 322.844840] rpc_execute+0xa5/0xaa
[ 322.844842] rpc_run_task+0x148/0x187
[ 322.844845] nfs4_do_call_sync+0x61/0x81
[ 322.844849] _nfs4_proc_getattr+0x10a/0x120
[ 322.844853] nfs4_proc_getattr+0x67/0xee
[ 322.844856] __nfs_revalidate_inode+0xb7/0x2df
[ 322.844859] nfs_file_write+0xfc/0x29f
[ 322.844862] new_sync_write+0xfb/0x16b
[ 322.844866] vfs_write+0x1ee/0x29a
[ 322.844868] ? __fget_light+0x31/0x72
[ 322.844871] ksys_write+0x61/0xd0
[ 322.844874] __x64_sys_write+0x1a/0x1c
[ 322.844877] do_syscall_64+0x32/0x45
[ 322.844880] entry_SYSCALL_64_after_hwframe+0x44/0x0
[ 322.844881] RIP: 0033:0x7fe71859ab7f
[ 322.844883] RSP: 002b:00007fe6bd5b0d30 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[ 322.844885] RAX: ffffffffffffffda RBX: 00007fe6bd5b0d68 RCX: 00007fe71859ab7f
[ 322.844887] RDX: 0000000000000007 RSI: 00007fe6bd5b1090 RDI: 000000000000005b
[ 322.844888] RBP: 00007fe6bd5b0de0 R08: 0000000000000000 R09: 00007ffd971ed090
[ 322.844890] R10: 0000000000065fd2 R11: 0000000000000293 R12: 00000bd418d95020
[ 322.844891] R13: 00007fe6bd5b1090 R14: 0000000000000007 R15: 0000000000000000
[ 322.844900] task:ThreadPoolForeg state:D stack: 0 pid: 1807 ppid: 1773 flags:0x00000000
[ 322.844902] Call Trace:
[ 322.844904] __schedule+0x3db/0x7ee
[ 322.844907] ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
[ 322.844908] schedule+0x68/0xdc
[ 322.844911] rpc_wait_bit_killable+0x11/0x52
[ 322.844913] __wait_on_bit+0x31/0x90
[ 322.844915] ? __rpc_atrun+0x20/0x1a
[ 322.844916] ? __rpc_atrun+0x20/0x1a
[ 322.844918] out_of_line_wait_on_bit+0x7e/0x80
[ 322.844920] ? var_wake_function+0x30/0x27
[ 322.844922] __rpc_execute+0xe5/0x485
[ 322.844924] rpc_execute+0xa5/0xaa
[ 322.844926] rpc_run_task+0x148/0x187
[ 322.844928] nfs4_do_call_sync+0x61/0x81
[ 322.844930] _nfs4_proc_getattr+0x10a/0x120
[ 322.844933] nfs4_proc_getattr+0x67/0xee
[ 322.844936] __nfs_revalidate_inode+0xb7/0x2df
[ 322.844938] nfs_access_get_cached+0x14d/0x1df
[ 322.844940] nfs_do_access+0x48/0x28e
[ 322.844942] ? nfs4_lookup_revalidate+0x38/0x8d
[ 322.844944] nfs_permission+0xa2/0x1b8
[ 322.844946] inode_permission+0x96/0xca
[ 322.844948] ? unlazy_walk+0x4c/0x80
[ 322.844950] link_path_walk.part.0+0x2a5/0x320
[ 322.844952] ? path_init+0x2c2/0x3e5
[ 322.844954] path_lookupat+0x3f/0x1b1
[ 322.844956] filename_lookup+0x97/0x17f
[ 322.844959] ? do_unlk+0x9a/0xca
[ 322.844960] ? kmem_cache_alloc+0x32/0x200
[ 322.844963] user_path_at_empty+0x59/0x8a
[ 322.844965] vfs_statx+0x64/0xe5
[ 322.844967] __do_sys_newstat+0x31/0x4c
[ 322.844968] ? do_fcntl+0x113/0x582
[ 322.844971] ? fput+0x13/0x15
[ 322.844972] ? __x64_sys_fcntl+0x81/0xa8
[ 322.844974] __x64_sys_newstat+0x16/0x18
[ 322.844976] do_syscall_64+0x32/0x45
[ 322.844977] entry_SYSCALL_64_after_hwframe+0x44/0x0
[ 322.844979] RIP: 0033:0x7f0d1fd0c9e6
[ 322.844980] RSP: 002b:00007f0d08e12e18 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[ 322.844981] RAX: ffffffffffffffda RBX: 00007f0d08e12edc RCX: 00007f0d1fd0c9e6
[ 322.844983] RDX: 00007f0d08e12e28 RSI: 00007f0d08e12e28 RDI: 00001605c2ff1169
[ 322.844984] RBP: 00007f0d08e12ec0 R08: 0000000000000001 R09: 00007fffaddd9090
[ 322.844985] R10: 00001605c3133480 R11: 0000000000000246 R12: 0000000000000000
[ 322.844986] R13: 0000000000000000 R14: 00001605c2ff0f00 R15: 0000000000000000
[ 322.844988] task:ThreadPoolForeg state:D stack: 0 pid: 2031 ppid: 1773 flags:0x00004000
[ 322.844990] Call Trace:
[ 322.844992] __schedule+0x3db/0x7ee
[ 322.844995] ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
[ 322.844997] schedule+0x68/0xdc
[ 322.844999] rpc_wait_bit_killable+0x11/0x52
[ 322.845001] __wait_on_bit+0x31/0x90
[ 322.845003] ? rpc_sleep_on_timeout+0x64/0x9c
[ 322.845005] ? rpc_check_timeout+0x1e0/0x1e0
[ 322.845006] ? call_connect_status+0x180/0x17a
[ 322.845009] out_of_line_wait_on_bit+0x7e/0x80
[ 322.845010] ? var_wake_function+0x30/0x27
[ 322.845012] __rpc_execute+0xe5/0x485
[ 322.845014] rpc_execute+0xa5/0xaa
[ 322.845016] rpc_run_task+0x148/0x187
[ 322.845018] nfs4_do_call_sync+0x61/0x81
[ 322.845020] _nfs4_proc_getattr+0x10a/0x120
[ 322.845023] nfs4_proc_getattr+0x67/0xee
[ 322.845026] __nfs_revalidate_inode+0xb7/0x2df
[ 322.845028] nfs_access_get_cached+0x14d/0x1df
[ 322.845030] nfs_do_access+0x48/0x28e
[ 322.845032] ? nfs4_lookup_revalidate+0x38/0x8d
[ 322.845034] nfs_permission+0xa2/0x1b8
[ 322.845036] inode_permission+0x96/0xca
[ 322.845038] ? unlazy_walk+0x4c/0x80
[ 322.845040] link_path_walk.part.0+0x2a5/0x320
[ 322.845042] ? path_init+0x2c2/0x3e5
[ 322.845044] path_lookupat+0x3f/0x1b1
[ 322.845046] filename_lookup+0x97/0x17f
[ 322.845049] ? kmem_cache_alloc+0x32/0x200
[ 322.845051] user_path_at_empty+0x59/0x8a
[ 322.845054] ? mntput_no_expire+0x4f/0x264
[ 322.845055] vfs_statx+0x64/0xe5
[ 322.845057] ? putname+0x4b/0x53
[ 322.845059] __do_sys_newstat+0x31/0x4c
[ 322.845062] ? __x64_sys_futex+0x13a/0x159
[ 322.845064] ? switch_fpu_return+0x58/0xf2
[ 322.845066] __x64_sys_newstat+0x16/0x18
[ 322.845068] do_syscall_64+0x32/0x45
[ 322.845069] entry_SYSCALL_64_after_hwframe+0x44/0x0
[ 322.845071] RIP: 0033:0x7f0d1fd0c9e6
[ 322.845072] RSP: 002b:00007f0d00331108 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[ 322.845073] RAX: ffffffffffffffda RBX: 00001605c2fdc0a0 RCX: 00007f0d1fd0c9e6
[ 322.845074] RDX: 00007f0d00331118 RSI: 00007f0d00331118 RDI: 00001605c2fdc0a0
[ 322.845076] RBP: 00007f0d00331200 R08: 0000000000000001 R09: 00007fffaddd9090
[ 322.845077] R10: 000000000005771c R11: 0000000000000246 R12: 00007f0d00331118
[ 322.845078] R13: 00001605c3309c40 R14: 00007f0d003311a8 R15: 00007f0d003311c0
[ 322.845087] task:192.168.16.8-ma state:D stack: 0 pid: 2044 ppid: 2 flags:0x00004000
[ 322.845089] Call Trace:
[ 322.845091] __schedule+0x3db/0x7ee
[ 322.845094] ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
[ 322.845095] schedule+0x68/0xdc
[ 322.845098] rpc_wait_bit_killable+0x11/0x52
[ 322.845100] __wait_on_bit+0x31/0x90
[ 322.845102] ? __rpc_atrun+0x20/0x1a
[ 322.845103] ? __rpc_atrun+0x20/0x1a
[ 322.845105] out_of_line_wait_on_bit+0x7e/0x80
[ 322.845107] ? var_wake_function+0x30/0x27
[ 322.845109] __rpc_execute+0xe5/0x485
[ 322.845111] rpc_execute+0xa5/0xaa
[ 322.845113] rpc_run_task+0x148/0x187
[ 322.845115] nfs4_do_call_sync+0x61/0x81
[ 322.845117] _nfs4_proc_statfs+0xc1/0xcf
[ 322.845120] nfs4_proc_statfs+0x55/0x74
[ 322.845122] nfs_statfs+0x65/0x152
[ 322.845124] ? _nfs4_proc_delegreturn+0x21e/0x2e5
[ 322.845127] statfs_by_dentry+0x55/0x74
[ 322.845129] vfs_statfs+0x24/0x43
[ 322.845132] check_free_space+0x3c/0xe5
[ 322.845135] do_acct_process+0x5d/0x5e1
[ 322.845136] ? nfs_end_delegation_return+0x1cb/0x3aa
[ 322.845138] ? complete+0x49/0x57
[ 322.845141] acct_process+0xe5/0x105
[ 322.845143] do_exit+0x781/0xa1b
[ 322.845145] ? refcount_dec_and_lock+0x22/0x8c
[ 322.845148] __module_put_and_exit+0x1b/0x1b
[ 322.845150] nfs4_run_state_manager+0x293/0x933
[ 322.845152] ? nfs4_do_reclaim+0x820/0x81c
[ 322.845155] kthread+0x142/0x15f
[ 322.845157] ? __kthread_create_worker+0x130/0x122
[ 322.845159] ret_from_fork+0x1f/0x2a
[ 369.859385] INFO: task tee:1176 blocked for more than 122 seconds.
[ 369.859391] Not tainted 5.10.4-00021-gde017294a539-dirty #3
[ 369.859393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 369.859395] task:tee state:D stack: 0 pid: 1176 ppid: 1173 flags:0x00000000
[ 369.859399] Call Trace:
[ 369.859407] __schedule+0x3db/0x7ee
[ 369.859410] ? io_schedule+0x70/0x6d
[ 369.859412] schedule+0x68/0xdc
[ 369.859415] io_schedule+0x46/0x6d
[ 369.859417] bit_wait_io+0x11/0x52
[ 369.859420] __wait_on_bit+0x31/0x90
[ 369.859422] out_of_line_wait_on_bit+0x7e/0x80
[ 369.859426] ? var_wake_function+0x30/0x27
[ 369.859429] nfs_wait_on_request.part.0+0x31/0x33
[ 369.859431] nfs_page_group_lock_head+0x31/0x78
[ 369.859433] nfs_lock_and_join_requests+0x9d/0x26e
[ 369.859436] nfs_updatepage+0x1be/0x985
[ 369.859439] nfs_write_end+0x128/0x52a
[ 369.859443] ? iov_iter_copy_from_user_atomic+0xe8/0x3ce
[ 369.859446] generic_perform_write+0xed/0x17c
[ 369.859449] nfs_file_write+0x14a/0x29f
[ 369.859452] new_sync_write+0xfb/0x16b
[ 369.859455] vfs_write+0x1ee/0x29a
[ 369.859457] ? vfs_read+0x17b/0x198
[ 369.859459] ksys_write+0x61/0xd0
[ 369.859462] __x64_sys_write+0x1a/0x1c
[ 369.859464] do_syscall_64+0x32/0x45
[ 369.859467] entry_SYSCALL_64_after_hwframe+0x44/0x0
[ 369.859469] RIP: 0033:0x7f4781f323a3
[ 369.859471] RSP: 002b:00007ffcf8220408 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 369.859474] RAX: ffffffffffffffda RBX: 0000000000000092 RCX: 00007f4781f323a3
[ 369.859475] RDX: 0000000000000092 RSI: 00007ffcf8220530 RDI: 0000000000000003
[ 369.859477] RBP: 00007ffcf8220530 R08: 0000000000000000 R09: 0000000000000001
[ 369.859478] R10: 0000000000400ad2 R11: 0000000000000246 R12: 0000000000000092
[ 369.859480] R13: 0000000002361580 R14: 0000000000000092 R15: 00007f4782201720
[ 369.859495] INFO: task CGenericAsyncFi:1927 blocked for more than 122 seconds.
[ 369.859497] Not tainted 5.10.4-00021-gde017294a539-dirty #3
[ 369.859498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 369.859499] task:CGenericAsyncFi state:D stack: 0 pid: 1927 ppid: 1540 flags:0x20020000
[ 369.859501] Call Trace:
[ 369.859504] __schedule+0x3db/0x7ee
[ 369.859506] schedule+0x68/0xdc
[ 369.859508] io_schedule+0x46/0x6d
[ 369.859510] wait_on_page_bit_common+0xf8/0x325
[ 369.859512] ? trace_raw_output_file_check_and_advance_wb_err+0x80/0x7d
[ 369.859515] wait_on_page_bit+0x3f/0x41
[ 369.859517] wait_on_page_writeback.part.0+0x35/0x80
[ 369.859519] wait_on_page_writeback+0x27/0x29
[ 369.859521] __filemap_fdatawait_range+0x88/0xd9
[ 369.859523] filemap_write_and_wait_range+0x45/0x80
[ 369.859525] nfs_wb_all+0x30/0x13d
[ 369.859528] nfs4_file_flush+0x85/0xc0
[ 369.859530] filp_close+0x3e/0x78
[ 369.859532] __close_fd+0x23/0x30
[ 369.859534] __ia32_sys_close+0x22/0x50
[ 369.859536] __do_fast_syscall_32+0x5f/0x8e
[ 369.859538] do_fast_syscall_32+0x3d/0x80
[ 369.859540] entry_SYSCALL_compat_after_hwframe+0x45/0x0
[ 369.859541] RIP: 0023:0xf7ee6549
[ 369.859543] RSP: 002b:000000008c5fddb0 EFLAGS: 00200286 ORIG_RAX: 0000000000000006
[ 369.859545] RAX: ffffffffffffffda RBX: 000000000000004d RCX: 0000000000000002
[ 369.859546] RDX: 0000000000000000 RSI: 00000000f7d02e70 RDI: 000000000000004d
[ 369.859547] RBP: 000000008c5fe110 R08: 0000000000000000 R09: 0000000000000000
[ 369.859548] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 369.859549] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 394.423093] nfs: server loom not responding, still trying

--
NULL && (void)


2021-02-13 16:26:24

by Nix

[permalink] [raw]
Subject: Re: steam-associated reproducible hard NFSv4.2 client hang (5.9, 5.10)

(I can't get References: right on this mail due to the original aging
out of my mailbox: archive URL,
https://www.spinics.net/lists/linux-nfs/msg81430.html).

I now have a little lockdep info from this hang (and reports from at
least two others that they've seen similar-looking hangs dating back to
4.19, though much harder to reproduce, taking many hours rather than
five minutes: in one case they report not using NFS in production any
more because of this).

Unfortunately the lockdep info isn't much use:

Feb 13 14:13:12 silk warning: : [ 888.834464] Showing all locks held in the system:
Feb 13 14:13:12 silk warning: : [ 888.834501] 1 lock held by dmesg/1152:
Feb 13 14:13:12 silk warning: : [ 888.834508] #0: ffff980c3b7200d0 (&user->lock){+.+.}-{3:3}, at: devkmsg_read+0x49/0x2d1
Feb 13 14:13:12 silk warning: : [ 888.834540] 2 locks held by tee/1322:
Feb 13 14:13:12 silk warning: : [ 888.834546] #0: ffff980c0809a430 (sb_writers#12){.+.+}-{0:0}, at: ksys_write+0x6a/0xdc
Feb 13 14:13:12 silk warning: : [ 888.834573] #1: ffff980c3ca7b5e8 (&sb->s_type->i_mutex_key#16){++++}-{3:3}, at: nfs_start_io_write+0x1a/0x45
Feb 13 14:13:12 silk warning: : [ 888.834632] 1 lock held by 192.168.16.8-ma/2302:
Feb 13 14:13:12 silk warning: : [ 888.834638] #0: ffff980c0fe6b700 (&acct->lock#2){+.+.}-{3:3}, at: acct_process+0x102/0x2bc

The first of those is my ongoing dmesg -w. The last is process
accounting. The middle one is an ongoing, always-active Xsession-errors
tee over the same NFSv4 connection, which says nothing more than that
writes to this NFS server from this client have hung, which we already
know. There are no signs of locks held by the Steam client which has
hung in the middle of installation.

So whateverthehell this is, it's not blocked on a lock. The NFS client
is hanging all on its own. (I have no idea how clients can block in the
middle of writing if a lock is *not* involved somehow, but that is what
it looks like from the lockdep output.)

Does anyone know how I might start debugging this sod?

2021-02-24 00:53:51

by J. Bruce Fields

[permalink] [raw]
Subject: Re: steam-associated reproducible hard NFSv4.2 client hang (5.9, 5.10)

On Sun, Jan 03, 2021 at 02:27:51PM +0000, Nick Alcock wrote:
> So I bought a game or two on Steam a fortnight ago to while away a
> Christmas spent stuck indoors with no family, decorations, or presents
> (stuck at my sister's and she got a perfectly-timed case of covid-19).
> It turned out this was a very bad idea!
>
> I keep my Steam directory on an ext4 fs on the local fileserver and
> access it via NFSv4.2: over 10GbE: it's actually faster than local
> spinning-rust storage would be (five-spindle RAID-6 and bcache and
> 128GiB RAM on the server for caching probably have something to do with
> that!) though likely slower than the NVMe the local machine actually
> has: but it doesn't have enough local storage, so I keep all the Steam
> stuff remotely.
>
> (I keep my $HOME on the same fileserver, though that's on xfs: I'd use
> xfs for Steam, too, but there are still a lot of games out there that
> fail on filesystems with 64-bit inode numbers.)
>
> Unfortunately, what I observe now when kicking up Steam is a spasm of
> downloading I/O for the game I just bought and then a complete
> client-side hang when it starts installing: just about all I can do is
> hard-reboot. This of course stops me using Steam for anything which
> didn't make me very happy :P
>
> The server hums onwards as though nothing went wrong, but the client is
> in a bad way indeed: not only has NFS hung but there are clearly some
> important locks taken out since even an ls on a non-NFS dir (with no NFS
> dirs in the PATH) hangs, as does a while :; ps -e; done loop. I can
> rebuild with lockdep and do a lock dump if that would be useful. (The
> kernel is not quite stock: it has DTrace built in, but the hangs happen
> even if the kernel module has never been loaded, and I'm pretty sure the
> hangs can't be associated with the not-very-substantial built-in DTrace
> changes. I can try rebuilding without that as well, but I'll have to rip
> chunks out of my monitoring machinery to do that so I've avoided doing
> it so far.)
>
> I finally got round to doing a bit of debugging today. NFS debugging
> dumps covering the last fifteen seconds or so before the hang and a
> couple of minutes afterwards (taken on the client, logged without
> incident to the server's syslog, as proof that not *everything* is
> stalled) are a few megabytes long so are at
> <http://www.esperi.org.uk/~nix/temporary/nfs-steam-hang.log>. Everything
> seems to go wrong after an especially large bunch of NFS commits.
>
> Both client (silk, 192.168.16.21) and server (loom, 192.168.16.8) are
> running 5.10.4 with the same NFS config, though the client has no NFS
> server exports at the time and the server isn't acting as an NFS client
> at all:
>
> CONFIG_NFS_FS=y
> # CONFIG_NFS_V2 is not set
> CONFIG_NFS_V3=y
> CONFIG_NFS_V3_ACL=y
> CONFIG_NFS_V4=y
> # CONFIG_NFS_SWAP is not set
> CONFIG_NFS_V4_1=y
> CONFIG_NFS_V4_2=y
> CONFIG_PNFS_FILE_LAYOUT=y
> CONFIG_PNFS_BLOCK=y
> CONFIG_PNFS_FLEXFILE_LAYOUT=m
> CONFIG_NFS_V4_1_IMPLEMENTATION_ID_DOMAIN="kernel.org"
> # CONFIG_NFS_V4_1_MIGRATION is not set
> # CONFIG_ROOT_NFS is not set
> # CONFIG_NFS_USE_LEGACY_DNS is not set
> CONFIG_NFS_USE_KERNEL_DNS=y
> CONFIG_NFS_DEBUG=y
> CONFIG_NFS_DISABLE_UDP_SUPPORT=y
> # CONFIG_NFS_V4_2_READ_PLUS is not set
> CONFIG_NFSD=y
> CONFIG_NFSD_V2_ACL=y
> CONFIG_NFSD_V3=y
> CONFIG_NFSD_V3_ACL=y
> CONFIG_NFSD_V4=y
> # CONFIG_NFSD_BLOCKLAYOUT is not set
> # CONFIG_NFSD_SCSILAYOUT is not set
> # CONFIG_NFSD_FLEXFILELAYOUT is not set
> CONFIG_NFSD_V4_2_INTER_SSC=y
> CONFIG_GRACE_PERIOD=y
> CONFIG_LOCKD=y
> CONFIG_LOCKD_V4=y
> CONFIG_NFS_ACL_SUPPORT=y
> CONFIG_NFS_COMMON=y
> CONFIG_SUNRPC=y
> CONFIG_SUNRPC_GSS=y
> CONFIG_SUNRPC_BACKCHANNEL=y
> CONFIG_SUNRPC_DEBUG=y
>
> The server (loom) has a large number of exports: many of the filesystems
> being exported are moved around via bind mounts and the like. The steam
> datadir is in /pkg/non-free/steam, which is bind-mounted onto
> /home/nix/.steam on the client: /home/nix itself is bind-mounted from
> /home/.loom.srvr.nix/steam, and is NFS-exported from the server under
> that pathname.
>
> Relevant exports, from /proc/fs/nfs/exports:
>
> / 192.168.16.0/24,silk.wkstn.nix(ro,insecure,no_root_squash,sync,no_wdelay,no_subtree_check,v4root,fsid=0,uuid=0a4a4563:00764033:8c827c0e:989cf534,sec=390003:390004:390005:1)
> /home/.loom.srvr.nix *.srvr.nix,fold.srvr.nix(rw,root_squash,sync,wdelay,no_subtree_check,uuid=0a4a4563:00764033:8c827c0e:989cf534,sec=1)
> /pkg/non-free/steam 192.168.16.0/24,silk.wkstn.nix(rw,root_squash,sync,wdelay,uuid=11b94016:cb6e4b64:beb6c134:9ea3d6b3,sec=1)
>
> Blocked tasks after the hang suggest that the client is hung waiting
> indefinitely for the server to reply, and I'd guess the server doesn't
> think it needs to. (I can do some server-side debugging dumps too if you
> think the problem might be there.)

I seem to remember there have been some subtle changes to how dropped
replies are handled. If I remember right: a v4 server is never supposed
to drop a reply without also closing the tcp connection, but it may have
done so in the past. (I *think* all those cases are fixed?)

And the client may have been changed to be more unforgiving about that
requirement in recent years.

But I don't know if that's where the issue is.

I'm not sure how best to go about debugging. I suppose you could try to
capture a network trace and work out if there's an RPC that isn't
getting a reply.

Maybe turning on server debugging and looking for "svc: svc_process
dropit" messages timed with the start of the hang. I think what you
want is "rpdebug -m rpc -s svcdsp".

Does rpc.mountd look healthy?

--b.


>
> [ 322.843962] sysrq: Show Blocked State
> [ 322.844058] task:tee state:D stack: 0 pid: 1176 ppid: 1173 flags:0x00000000
> [ 322.844064] Call Trace:
> [ 322.844074] __schedule+0x3db/0x7ee
> [ 322.844079] ? io_schedule+0x70/0x6d
> [ 322.844082] schedule+0x68/0xdc
> [ 322.844085] io_schedule+0x46/0x6d
> [ 322.844089] bit_wait_io+0x11/0x52
> [ 322.844092] __wait_on_bit+0x31/0x90
> [ 322.844096] out_of_line_wait_on_bit+0x7e/0x80
> [ 322.844100] ? var_wake_function+0x30/0x27
> [ 322.844104] nfs_wait_on_request.part.0+0x31/0x33
> [ 322.844107] nfs_page_group_lock_head+0x31/0x78
> [ 322.844110] nfs_lock_and_join_requests+0x9d/0x26e
> [ 322.844114] nfs_updatepage+0x1be/0x985
> [ 322.844118] nfs_write_end+0x128/0x52a
> [ 322.844122] ? iov_iter_copy_from_user_atomic+0xe8/0x3ce
> [ 322.844127] generic_perform_write+0xed/0x17c
> [ 322.844131] nfs_file_write+0x14a/0x29f
> [ 322.844135] new_sync_write+0xfb/0x16b
> [ 322.844139] vfs_write+0x1ee/0x29a
> [ 322.844142] ? vfs_read+0x17b/0x198
> [ 322.844145] ksys_write+0x61/0xd0
> [ 322.844148] __x64_sys_write+0x1a/0x1c
> [ 322.844151] do_syscall_64+0x32/0x45
> [ 322.844154] entry_SYSCALL_64_after_hwframe+0x44/0x0
> [ 322.844158] RIP: 0033:0x7f4781f323a3
> [ 322.844160] RSP: 002b:00007ffcf8220408 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [ 322.844164] RAX: ffffffffffffffda RBX: 0000000000000092 RCX: 00007f4781f323a3
> [ 322.844166] RDX: 0000000000000092 RSI: 00007ffcf8220530 RDI: 0000000000000003
> [ 322.844168] RBP: 00007ffcf8220530 R08: 0000000000000000 R09: 0000000000000001
> [ 322.844170] R10: 0000000000400ad2 R11: 0000000000000246 R12: 0000000000000092
> [ 322.844172] R13: 0000000002361580 R14: 0000000000000092 R15: 00007f4782201720
> [ 322.844202] task:CIPCServer::Thr state:D stack: 0 pid: 1801 ppid: 1540 flags:0x20020000
> [ 322.844207] Call Trace:
> [ 322.844212] __schedule+0x3db/0x7ee
> [ 322.844219] ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> [ 322.844222] schedule+0x68/0xdc
> [ 322.844227] rpc_wait_bit_killable+0x11/0x52
> [ 322.844230] __wait_on_bit+0x31/0x90
> [ 322.844235] ? __rpc_atrun+0x20/0x1a
> [ 322.844238] ? __rpc_atrun+0x20/0x1a
> [ 322.844242] out_of_line_wait_on_bit+0x7e/0x80
> [ 322.844246] ? var_wake_function+0x30/0x27
> [ 322.844250] __rpc_execute+0xe5/0x485
> [ 322.844254] ? sugov_get_util+0xf0/0xeb
> [ 322.844258] rpc_execute+0xa5/0xaa
> [ 322.844263] rpc_run_task+0x148/0x187
> [ 322.844267] nfs4_do_call_sync+0x61/0x81
> [ 322.844272] _nfs4_proc_getattr+0x10a/0x120
> [ 322.844278] nfs4_proc_getattr+0x67/0xee
> [ 322.844283] __nfs_revalidate_inode+0xb7/0x2df
> [ 322.844287] nfs_access_get_cached+0x14d/0x1df
> [ 322.844291] nfs_do_access+0x48/0x28e
> [ 322.844296] ? step_into+0x2bc/0x6a1
> [ 322.844301] nfs_permission+0xa2/0x1b8
> [ 322.844305] inode_permission+0x96/0xca
> [ 322.844308] ? unlazy_walk+0x4c/0x80
> [ 322.844312] link_path_walk.part.0+0x2a5/0x320
> [ 322.844315] ? path_init+0x2c2/0x3e5
> [ 322.844318] path_lookupat+0x3f/0x1b1
> [ 322.844322] filename_lookup+0x97/0x17f
> [ 322.844326] ? kmem_cache_alloc+0x32/0x200
> [ 322.844331] user_path_at_empty+0x59/0x8a
> [ 322.844333] do_faccessat+0x70/0x227
> [ 322.844337] __ia32_sys_access+0x1c/0x1e
> [ 322.844340] __do_fast_syscall_32+0x5f/0x8e
> [ 322.844342] do_fast_syscall_32+0x3d/0x80
> [ 322.844346] entry_SYSCALL_compat_after_hwframe+0x45/0x0
> [ 322.844348] RIP: 0023:0xf7ee6549
> [ 322.844350] RSP: 002b:00000000eb763b28 EFLAGS: 00200292 ORIG_RAX: 0000000000000021
> [ 322.844353] RAX: ffffffffffffffda RBX: 000000005754d9e0 RCX: 0000000000000004
> [ 322.844355] RDX: 00000000f7ab3e1c RSI: 000000005754d9e0 RDI: 00000000eb763bc8
> [ 322.844357] RBP: 00000000f78a91c0 R08: 0000000000000000 R09: 0000000000000000
> [ 322.844359] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [ 322.844360] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 322.844364] task:CFileWriterThre state:D stack: 0 pid: 1810 ppid: 1540 flags:0x20024000
> [ 322.844367] Call Trace:
> [ 322.844371] __schedule+0x3db/0x7ee
> [ 322.844375] ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> [ 322.844379] schedule+0x68/0xdc
> [ 322.844382] rpc_wait_bit_killable+0x11/0x52
> [ 322.844385] __wait_on_bit+0x31/0x90
> [ 322.844389] ? call_reserveresult+0xa0/0x9f
> [ 322.844391] ? call_transmit_status+0x160/0x159
> [ 322.844395] out_of_line_wait_on_bit+0x7e/0x80
> [ 322.844398] ? var_wake_function+0x30/0x27
> [ 322.844401] __rpc_execute+0xe5/0x485
> [ 322.844404] ? nfs_check_verifier+0x5b/0xa4
> [ 322.844407] rpc_execute+0xa5/0xaa
> [ 322.844410] rpc_run_task+0x148/0x187
> [ 322.844414] nfs4_do_call_sync+0x61/0x81
> [ 322.844417] _nfs4_proc_getattr+0x10a/0x120
> [ 322.844422] nfs4_proc_getattr+0x67/0xee
> [ 322.844426] __nfs_revalidate_inode+0xb7/0x2df
> [ 322.844430] nfs_getattr+0x159/0x374
> [ 322.844433] vfs_getattr_nosec+0x96/0xa4
> [ 322.844435] vfs_statx+0x7a/0xe5
> [ 322.844438] vfs_fstatat+0x1f/0x21
> [ 322.844443] __do_compat_sys_ia32_lstat64+0x2b/0x4e
> [ 322.844448] __ia32_compat_sys_ia32_lstat64+0x14/0x16
> [ 322.844450] __do_fast_syscall_32+0x5f/0x8e
> [ 322.844453] do_fast_syscall_32+0x3d/0x80
> [ 322.844457] entry_SYSCALL_compat_after_hwframe+0x45/0x0
> [ 322.844459] RIP: 0023:0xf7ee6549
> [ 322.844460] RSP: 002b:00000000eb561be8 EFLAGS: 00000292 ORIG_RAX: 00000000000000c4
> [ 322.844463] RAX: ffffffffffffffda RBX: 00000000c521bf60 RCX: 00000000eb561e70
> [ 322.844465] RDX: 00000000f7ab3e1c RSI: 00000000eb561e70 RDI: 00000000c521bf60
> [ 322.844467] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> [ 322.844468] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [ 322.844470] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 322.844477] task:Steam - DirWatc state:D stack: 0 pid: 1924 ppid: 1540 flags:0x20024000
> [ 322.844480] Call Trace:
> [ 322.844484] __schedule+0x3db/0x7ee
> [ 322.844488] ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> [ 322.844492] schedule+0x68/0xdc
> [ 322.844495] rpc_wait_bit_killable+0x11/0x52
> [ 322.844498] __wait_on_bit+0x31/0x90
> [ 322.844501] ? rpc_sleep_on_timeout+0x64/0x9c
> [ 322.844504] ? rpc_check_timeout+0x1e0/0x1e0
> [ 322.844507] ? call_connect_status+0x180/0x17a
> [ 322.844511] out_of_line_wait_on_bit+0x7e/0x80
> [ 322.844513] ? var_wake_function+0x30/0x27
> [ 322.844516] __rpc_execute+0xe5/0x485
> [ 322.844520] rpc_execute+0xa5/0xaa
> [ 322.844523] rpc_run_task+0x148/0x187
> [ 322.844526] nfs4_do_call_sync+0x61/0x81
> [ 322.844530] _nfs4_proc_getattr+0x10a/0x120
> [ 322.844534] nfs4_proc_getattr+0x67/0xee
> [ 322.844538] __nfs_revalidate_inode+0xb7/0x2df
> [ 322.844542] nfs_lookup_verify_inode+0x67/0x82
> [ 322.844545] nfs_do_lookup_revalidate+0x19b/0x29c
> [ 322.844548] nfs4_do_lookup_revalidate+0x5d/0xc5
> [ 322.844552] nfs4_lookup_revalidate+0x6a/0x8d
> [ 322.844555] lookup_fast+0xb8/0x122
> [ 322.844559] path_openat+0xfd/0xfd8
> [ 322.844562] ? unlazy_walk+0x4c/0x80
> [ 322.844565] ? terminate_walk+0x62/0xed
> [ 322.844569] ? putname+0x4b/0x53
> [ 322.844572] do_filp_open+0x86/0x110
> [ 322.844576] ? kmem_cache_alloc+0x32/0x200
> [ 322.844579] ? __alloc_fd+0xb2/0x143
> [ 322.844583] do_sys_openat2+0x8d/0x13a
> [ 322.844585] ? do_faccessat+0x108/0x227
> [ 322.844588] __ia32_compat_sys_openat+0x48/0x63
> [ 322.844591] __do_fast_syscall_32+0x5f/0x8e
> [ 322.844594] do_fast_syscall_32+0x3d/0x80
> [ 322.844597] entry_SYSCALL_compat_after_hwframe+0x45/0x0
> [ 322.844600] RIP: 0023:0xf7ee6549
> [ 322.844601] RSP: 002b:000000008c8fbcb0 EFLAGS: 00200287 ORIG_RAX: 0000000000000127
> [ 322.844604] RAX: ffffffffffffffda RBX: 00000000ffffff9c RCX: 000000008c8fbffc
> [ 322.844606] RDX: 0000000000098800 RSI: 0000000000000000 RDI: 00000000f7ab3e1c
> [ 322.844608] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> [ 322.844609] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [ 322.844611] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 322.844614] task:CGenericAsyncFi state:D stack: 0 pid: 1927 ppid: 1540 flags:0x20020000
> [ 322.844617] Call Trace:
> [ 322.844621] __schedule+0x3db/0x7ee
> [ 322.844625] schedule+0x68/0xdc
> [ 322.844628] io_schedule+0x46/0x6d
> [ 322.844631] wait_on_page_bit_common+0xf8/0x325
> [ 322.844634] ? trace_raw_output_file_check_and_advance_wb_err+0x80/0x7d
> [ 322.844638] wait_on_page_bit+0x3f/0x41
> [ 322.844642] wait_on_page_writeback.part.0+0x35/0x80
> [ 322.844645] wait_on_page_writeback+0x27/0x29
> [ 322.844647] __filemap_fdatawait_range+0x88/0xd9
> [ 322.844651] filemap_write_and_wait_range+0x45/0x80
> [ 322.844655] nfs_wb_all+0x30/0x13d
> [ 322.844658] nfs4_file_flush+0x85/0xc0
> [ 322.844661] filp_close+0x3e/0x78
> [ 322.844664] __close_fd+0x23/0x30
> [ 322.844667] __ia32_sys_close+0x22/0x50
> [ 322.844670] __do_fast_syscall_32+0x5f/0x8e
> [ 322.844673] do_fast_syscall_32+0x3d/0x80
> [ 322.844676] entry_SYSCALL_compat_after_hwframe+0x45/0x0
> [ 322.844678] RIP: 0023:0xf7ee6549
> [ 322.844680] RSP: 002b:000000008c5fddb0 EFLAGS: 00200286 ORIG_RAX: 0000000000000006
> [ 322.844682] RAX: ffffffffffffffda RBX: 000000000000004d RCX: 0000000000000002
> [ 322.844684] RDX: 0000000000000000 RSI: 00000000f7d02e70 RDI: 000000000000004d
> [ 322.844686] RBP: 000000008c5fe110 R08: 0000000000000000 R09: 0000000000000000
> [ 322.844688] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [ 322.844689] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 322.844699] task:ThreadPoolForeg state:D stack: 0 pid: 1768 ppid: 1735 flags:0x00004000
> [ 322.844702] Call Trace:
> [ 322.844705] __schedule+0x3db/0x7ee
> [ 322.844709] ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> [ 322.844712] schedule+0x68/0xdc
> [ 322.844716] rpc_wait_bit_killable+0x11/0x52
> [ 322.844719] __wait_on_bit+0x31/0x90
> [ 322.844722] ? rpc_sleep_on_timeout+0x64/0x9c
> [ 322.844725] ? rpc_check_timeout+0x1e0/0x1e0
> [ 322.844728] ? call_connect_status+0x180/0x17a
> [ 322.844731] out_of_line_wait_on_bit+0x7e/0x80
> [ 322.844734] ? var_wake_function+0x30/0x27
> [ 322.844737] __rpc_execute+0xe5/0x485
> [ 322.844740] ? nfs_check_verifier+0x5b/0xa4
> [ 322.844743] rpc_execute+0xa5/0xaa
> [ 322.844746] rpc_run_task+0x148/0x187
> [ 322.844750] nfs4_do_call_sync+0x61/0x81
> [ 322.844753] _nfs4_proc_getattr+0x10a/0x120
> [ 322.844758] nfs4_proc_getattr+0x67/0xee
> [ 322.844762] __nfs_revalidate_inode+0xb7/0x2df
> [ 322.844766] nfs_getattr+0x159/0x374
> [ 322.844769] vfs_getattr_nosec+0x96/0xa4
> [ 322.844771] vfs_statx+0x7a/0xe5
> [ 322.844774] __do_sys_newstat+0x31/0x4c
> [ 322.844778] __x64_sys_newstat+0x16/0x18
> [ 322.844781] do_syscall_64+0x32/0x45
> [ 322.844783] entry_SYSCALL_64_after_hwframe+0x44/0x0
> [ 322.844786] RIP: 0033:0x7fe70ebf19e6
> [ 322.844787] RSP: 002b:00007fe6c15b2068 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
> [ 322.844790] RAX: ffffffffffffffda RBX: 00000bd4193e1910 RCX: 00007fe70ebf19e6
> [ 322.844792] RDX: 00007fe6c15b2078 RSI: 00007fe6c15b2078 RDI: 00000bd4193e1910
> [ 322.844794] RBP: 00007fe6c15b2160 R08: 0000000000000001 R09: 00007ffd971ed090
> [ 322.844796] R10: 0000000000056764 R11: 0000000000000246 R12: 00007fe6c15b2078
> [ 322.844797] R13: 00000bd4192f6f60 R14: 00007fe6c15b2108 R15: 00007fe6c15b2120
> [ 322.844801] task:ThreadPoolForeg state:D stack: 0 pid: 1769 ppid: 1735 flags:0x00000000
> [ 322.844804] Call Trace:
> [ 322.844807] __schedule+0x3db/0x7ee
> [ 322.844811] ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> [ 322.844814] schedule+0x68/0xdc
> [ 322.844817] rpc_wait_bit_killable+0x11/0x52
> [ 322.844820] __wait_on_bit+0x31/0x90
> [ 322.844823] ? __rpc_atrun+0x20/0x1a
> [ 322.844826] ? __rpc_atrun+0x20/0x1a
> [ 322.844829] out_of_line_wait_on_bit+0x7e/0x80
> [ 322.844831] ? var_wake_function+0x30/0x27
> [ 322.844834] __rpc_execute+0xe5/0x485
> [ 322.844837] ? update_cfs_group+0x9a/0x9e
> [ 322.844840] rpc_execute+0xa5/0xaa
> [ 322.844842] rpc_run_task+0x148/0x187
> [ 322.844845] nfs4_do_call_sync+0x61/0x81
> [ 322.844849] _nfs4_proc_getattr+0x10a/0x120
> [ 322.844853] nfs4_proc_getattr+0x67/0xee
> [ 322.844856] __nfs_revalidate_inode+0xb7/0x2df
> [ 322.844859] nfs_file_write+0xfc/0x29f
> [ 322.844862] new_sync_write+0xfb/0x16b
> [ 322.844866] vfs_write+0x1ee/0x29a
> [ 322.844868] ? __fget_light+0x31/0x72
> [ 322.844871] ksys_write+0x61/0xd0
> [ 322.844874] __x64_sys_write+0x1a/0x1c
> [ 322.844877] do_syscall_64+0x32/0x45
> [ 322.844880] entry_SYSCALL_64_after_hwframe+0x44/0x0
> [ 322.844881] RIP: 0033:0x7fe71859ab7f
> [ 322.844883] RSP: 002b:00007fe6bd5b0d30 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> [ 322.844885] RAX: ffffffffffffffda RBX: 00007fe6bd5b0d68 RCX: 00007fe71859ab7f
> [ 322.844887] RDX: 0000000000000007 RSI: 00007fe6bd5b1090 RDI: 000000000000005b
> [ 322.844888] RBP: 00007fe6bd5b0de0 R08: 0000000000000000 R09: 00007ffd971ed090
> [ 322.844890] R10: 0000000000065fd2 R11: 0000000000000293 R12: 00000bd418d95020
> [ 322.844891] R13: 00007fe6bd5b1090 R14: 0000000000000007 R15: 0000000000000000
> [ 322.844900] task:ThreadPoolForeg state:D stack: 0 pid: 1807 ppid: 1773 flags:0x00000000
> [ 322.844902] Call Trace:
> [ 322.844904] __schedule+0x3db/0x7ee
> [ 322.844907] ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> [ 322.844908] schedule+0x68/0xdc
> [ 322.844911] rpc_wait_bit_killable+0x11/0x52
> [ 322.844913] __wait_on_bit+0x31/0x90
> [ 322.844915] ? __rpc_atrun+0x20/0x1a
> [ 322.844916] ? __rpc_atrun+0x20/0x1a
> [ 322.844918] out_of_line_wait_on_bit+0x7e/0x80
> [ 322.844920] ? var_wake_function+0x30/0x27
> [ 322.844922] __rpc_execute+0xe5/0x485
> [ 322.844924] rpc_execute+0xa5/0xaa
> [ 322.844926] rpc_run_task+0x148/0x187
> [ 322.844928] nfs4_do_call_sync+0x61/0x81
> [ 322.844930] _nfs4_proc_getattr+0x10a/0x120
> [ 322.844933] nfs4_proc_getattr+0x67/0xee
> [ 322.844936] __nfs_revalidate_inode+0xb7/0x2df
> [ 322.844938] nfs_access_get_cached+0x14d/0x1df
> [ 322.844940] nfs_do_access+0x48/0x28e
> [ 322.844942] ? nfs4_lookup_revalidate+0x38/0x8d
> [ 322.844944] nfs_permission+0xa2/0x1b8
> [ 322.844946] inode_permission+0x96/0xca
> [ 322.844948] ? unlazy_walk+0x4c/0x80
> [ 322.844950] link_path_walk.part.0+0x2a5/0x320
> [ 322.844952] ? path_init+0x2c2/0x3e5
> [ 322.844954] path_lookupat+0x3f/0x1b1
> [ 322.844956] filename_lookup+0x97/0x17f
> [ 322.844959] ? do_unlk+0x9a/0xca
> [ 322.844960] ? kmem_cache_alloc+0x32/0x200
> [ 322.844963] user_path_at_empty+0x59/0x8a
> [ 322.844965] vfs_statx+0x64/0xe5
> [ 322.844967] __do_sys_newstat+0x31/0x4c
> [ 322.844968] ? do_fcntl+0x113/0x582
> [ 322.844971] ? fput+0x13/0x15
> [ 322.844972] ? __x64_sys_fcntl+0x81/0xa8
> [ 322.844974] __x64_sys_newstat+0x16/0x18
> [ 322.844976] do_syscall_64+0x32/0x45
> [ 322.844977] entry_SYSCALL_64_after_hwframe+0x44/0x0
> [ 322.844979] RIP: 0033:0x7f0d1fd0c9e6
> [ 322.844980] RSP: 002b:00007f0d08e12e18 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
> [ 322.844981] RAX: ffffffffffffffda RBX: 00007f0d08e12edc RCX: 00007f0d1fd0c9e6
> [ 322.844983] RDX: 00007f0d08e12e28 RSI: 00007f0d08e12e28 RDI: 00001605c2ff1169
> [ 322.844984] RBP: 00007f0d08e12ec0 R08: 0000000000000001 R09: 00007fffaddd9090
> [ 322.844985] R10: 00001605c3133480 R11: 0000000000000246 R12: 0000000000000000
> [ 322.844986] R13: 0000000000000000 R14: 00001605c2ff0f00 R15: 0000000000000000
> [ 322.844988] task:ThreadPoolForeg state:D stack: 0 pid: 2031 ppid: 1773 flags:0x00004000
> [ 322.844990] Call Trace:
> [ 322.844992] __schedule+0x3db/0x7ee
> [ 322.844995] ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> [ 322.844997] schedule+0x68/0xdc
> [ 322.844999] rpc_wait_bit_killable+0x11/0x52
> [ 322.845001] __wait_on_bit+0x31/0x90
> [ 322.845003] ? rpc_sleep_on_timeout+0x64/0x9c
> [ 322.845005] ? rpc_check_timeout+0x1e0/0x1e0
> [ 322.845006] ? call_connect_status+0x180/0x17a
> [ 322.845009] out_of_line_wait_on_bit+0x7e/0x80
> [ 322.845010] ? var_wake_function+0x30/0x27
> [ 322.845012] __rpc_execute+0xe5/0x485
> [ 322.845014] rpc_execute+0xa5/0xaa
> [ 322.845016] rpc_run_task+0x148/0x187
> [ 322.845018] nfs4_do_call_sync+0x61/0x81
> [ 322.845020] _nfs4_proc_getattr+0x10a/0x120
> [ 322.845023] nfs4_proc_getattr+0x67/0xee
> [ 322.845026] __nfs_revalidate_inode+0xb7/0x2df
> [ 322.845028] nfs_access_get_cached+0x14d/0x1df
> [ 322.845030] nfs_do_access+0x48/0x28e
> [ 322.845032] ? nfs4_lookup_revalidate+0x38/0x8d
> [ 322.845034] nfs_permission+0xa2/0x1b8
> [ 322.845036] inode_permission+0x96/0xca
> [ 322.845038] ? unlazy_walk+0x4c/0x80
> [ 322.845040] link_path_walk.part.0+0x2a5/0x320
> [ 322.845042] ? path_init+0x2c2/0x3e5
> [ 322.845044] path_lookupat+0x3f/0x1b1
> [ 322.845046] filename_lookup+0x97/0x17f
> [ 322.845049] ? kmem_cache_alloc+0x32/0x200
> [ 322.845051] user_path_at_empty+0x59/0x8a
> [ 322.845054] ? mntput_no_expire+0x4f/0x264
> [ 322.845055] vfs_statx+0x64/0xe5
> [ 322.845057] ? putname+0x4b/0x53
> [ 322.845059] __do_sys_newstat+0x31/0x4c
> [ 322.845062] ? __x64_sys_futex+0x13a/0x159
> [ 322.845064] ? switch_fpu_return+0x58/0xf2
> [ 322.845066] __x64_sys_newstat+0x16/0x18
> [ 322.845068] do_syscall_64+0x32/0x45
> [ 322.845069] entry_SYSCALL_64_after_hwframe+0x44/0x0
> [ 322.845071] RIP: 0033:0x7f0d1fd0c9e6
> [ 322.845072] RSP: 002b:00007f0d00331108 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
> [ 322.845073] RAX: ffffffffffffffda RBX: 00001605c2fdc0a0 RCX: 00007f0d1fd0c9e6
> [ 322.845074] RDX: 00007f0d00331118 RSI: 00007f0d00331118 RDI: 00001605c2fdc0a0
> [ 322.845076] RBP: 00007f0d00331200 R08: 0000000000000001 R09: 00007fffaddd9090
> [ 322.845077] R10: 000000000005771c R11: 0000000000000246 R12: 00007f0d00331118
> [ 322.845078] R13: 00001605c3309c40 R14: 00007f0d003311a8 R15: 00007f0d003311c0
> [ 322.845087] task:192.168.16.8-ma state:D stack: 0 pid: 2044 ppid: 2 flags:0x00004000
> [ 322.845089] Call Trace:
> [ 322.845091] __schedule+0x3db/0x7ee
> [ 322.845094] ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> [ 322.845095] schedule+0x68/0xdc
> [ 322.845098] rpc_wait_bit_killable+0x11/0x52
> [ 322.845100] __wait_on_bit+0x31/0x90
> [ 322.845102] ? __rpc_atrun+0x20/0x1a
> [ 322.845103] ? __rpc_atrun+0x20/0x1a
> [ 322.845105] out_of_line_wait_on_bit+0x7e/0x80
> [ 322.845107] ? var_wake_function+0x30/0x27
> [ 322.845109] __rpc_execute+0xe5/0x485
> [ 322.845111] rpc_execute+0xa5/0xaa
> [ 322.845113] rpc_run_task+0x148/0x187
> [ 322.845115] nfs4_do_call_sync+0x61/0x81
> [ 322.845117] _nfs4_proc_statfs+0xc1/0xcf
> [ 322.845120] nfs4_proc_statfs+0x55/0x74
> [ 322.845122] nfs_statfs+0x65/0x152
> [ 322.845124] ? _nfs4_proc_delegreturn+0x21e/0x2e5
> [ 322.845127] statfs_by_dentry+0x55/0x74
> [ 322.845129] vfs_statfs+0x24/0x43
> [ 322.845132] check_free_space+0x3c/0xe5
> [ 322.845135] do_acct_process+0x5d/0x5e1
> [ 322.845136] ? nfs_end_delegation_return+0x1cb/0x3aa
> [ 322.845138] ? complete+0x49/0x57
> [ 322.845141] acct_process+0xe5/0x105
> [ 322.845143] do_exit+0x781/0xa1b
> [ 322.845145] ? refcount_dec_and_lock+0x22/0x8c
> [ 322.845148] __module_put_and_exit+0x1b/0x1b
> [ 322.845150] nfs4_run_state_manager+0x293/0x933
> [ 322.845152] ? nfs4_do_reclaim+0x820/0x81c
> [ 322.845155] kthread+0x142/0x15f
> [ 322.845157] ? __kthread_create_worker+0x130/0x122
> [ 322.845159] ret_from_fork+0x1f/0x2a
> [ 369.859385] INFO: task tee:1176 blocked for more than 122 seconds.
> [ 369.859391] Not tainted 5.10.4-00021-gde017294a539-dirty #3
> [ 369.859393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 369.859395] task:tee state:D stack: 0 pid: 1176 ppid: 1173 flags:0x00000000
> [ 369.859399] Call Trace:
> [ 369.859407] __schedule+0x3db/0x7ee
> [ 369.859410] ? io_schedule+0x70/0x6d
> [ 369.859412] schedule+0x68/0xdc
> [ 369.859415] io_schedule+0x46/0x6d
> [ 369.859417] bit_wait_io+0x11/0x52
> [ 369.859420] __wait_on_bit+0x31/0x90
> [ 369.859422] out_of_line_wait_on_bit+0x7e/0x80
> [ 369.859426] ? var_wake_function+0x30/0x27
> [ 369.859429] nfs_wait_on_request.part.0+0x31/0x33
> [ 369.859431] nfs_page_group_lock_head+0x31/0x78
> [ 369.859433] nfs_lock_and_join_requests+0x9d/0x26e
> [ 369.859436] nfs_updatepage+0x1be/0x985
> [ 369.859439] nfs_write_end+0x128/0x52a
> [ 369.859443] ? iov_iter_copy_from_user_atomic+0xe8/0x3ce
> [ 369.859446] generic_perform_write+0xed/0x17c
> [ 369.859449] nfs_file_write+0x14a/0x29f
> [ 369.859452] new_sync_write+0xfb/0x16b
> [ 369.859455] vfs_write+0x1ee/0x29a
> [ 369.859457] ? vfs_read+0x17b/0x198
> [ 369.859459] ksys_write+0x61/0xd0
> [ 369.859462] __x64_sys_write+0x1a/0x1c
> [ 369.859464] do_syscall_64+0x32/0x45
> [ 369.859467] entry_SYSCALL_64_after_hwframe+0x44/0x0
> [ 369.859469] RIP: 0033:0x7f4781f323a3
> [ 369.859471] RSP: 002b:00007ffcf8220408 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [ 369.859474] RAX: ffffffffffffffda RBX: 0000000000000092 RCX: 00007f4781f323a3
> [ 369.859475] RDX: 0000000000000092 RSI: 00007ffcf8220530 RDI: 0000000000000003
> [ 369.859477] RBP: 00007ffcf8220530 R08: 0000000000000000 R09: 0000000000000001
> [ 369.859478] R10: 0000000000400ad2 R11: 0000000000000246 R12: 0000000000000092
> [ 369.859480] R13: 0000000002361580 R14: 0000000000000092 R15: 00007f4782201720
> [ 369.859495] INFO: task CGenericAsyncFi:1927 blocked for more than 122 seconds.
> [ 369.859497] Not tainted 5.10.4-00021-gde017294a539-dirty #3
> [ 369.859498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 369.859499] task:CGenericAsyncFi state:D stack: 0 pid: 1927 ppid: 1540 flags:0x20020000
> [ 369.859501] Call Trace:
> [ 369.859504] __schedule+0x3db/0x7ee
> [ 369.859506] schedule+0x68/0xdc
> [ 369.859508] io_schedule+0x46/0x6d
> [ 369.859510] wait_on_page_bit_common+0xf8/0x325
> [ 369.859512] ? trace_raw_output_file_check_and_advance_wb_err+0x80/0x7d
> [ 369.859515] wait_on_page_bit+0x3f/0x41
> [ 369.859517] wait_on_page_writeback.part.0+0x35/0x80
> [ 369.859519] wait_on_page_writeback+0x27/0x29
> [ 369.859521] __filemap_fdatawait_range+0x88/0xd9
> [ 369.859523] filemap_write_and_wait_range+0x45/0x80
> [ 369.859525] nfs_wb_all+0x30/0x13d
> [ 369.859528] nfs4_file_flush+0x85/0xc0
> [ 369.859530] filp_close+0x3e/0x78
> [ 369.859532] __close_fd+0x23/0x30
> [ 369.859534] __ia32_sys_close+0x22/0x50
> [ 369.859536] __do_fast_syscall_32+0x5f/0x8e
> [ 369.859538] do_fast_syscall_32+0x3d/0x80
> [ 369.859540] entry_SYSCALL_compat_after_hwframe+0x45/0x0
> [ 369.859541] RIP: 0023:0xf7ee6549
> [ 369.859543] RSP: 002b:000000008c5fddb0 EFLAGS: 00200286 ORIG_RAX: 0000000000000006
> [ 369.859545] RAX: ffffffffffffffda RBX: 000000000000004d RCX: 0000000000000002
> [ 369.859546] RDX: 0000000000000000 RSI: 00000000f7d02e70 RDI: 000000000000004d
> [ 369.859547] RBP: 000000008c5fe110 R08: 0000000000000000 R09: 0000000000000000
> [ 369.859548] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [ 369.859549] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 394.423093] nfs: server loom not responding, still trying
>
> --
> NULL && (void)

2021-02-24 01:10:09

by Trond Myklebust

[permalink] [raw]
Subject: Re: steam-associated reproducible hard NFSv4.2 client hang (5.9, 5.10)

On Tue, 2021-02-23 at 17:57 -0500, J. Bruce Fields wrote:
> On Sun, Jan 03, 2021 at 02:27:51PM +0000, Nick Alcock wrote:
> > So I bought a game or two on Steam a fortnight ago to while away a
> > Christmas spent stuck indoors with no family, decorations, or
> > presents
> > (stuck at my sister's and she got a perfectly-timed case of covid-
> > 19).
> > It turned out this was a very bad idea!
> >
> > I keep my Steam directory on an ext4 fs on the local fileserver and
> > access it via NFSv4.2: over 10GbE: it's actually faster than local
> > spinning-rust storage would be (five-spindle RAID-6 and bcache and
> > 128GiB RAM on the server for caching probably have something to do
> > with
> > that!) though likely slower than the NVMe the local machine
> > actually
> > has: but it doesn't have enough local storage, so I keep all the
> > Steam
> > stuff remotely.
> >
> > (I keep my $HOME on the same fileserver, though that's on xfs: I'd
> > use
> > xfs for Steam, too, but there are still a lot of games out there
> > that
> > fail on filesystems with 64-bit inode numbers.)
> >
> > Unfortunately, what I observe now when kicking up Steam is a spasm
> > of
> > downloading I/O for the game I just bought and then a complete
> > client-side hang when it starts installing: just about all I can do
> > is
> > hard-reboot. This of course stops me using Steam for anything which
> > didn't make me very happy :P
> >
> > The server hums onwards as though nothing went wrong, but the
> > client is
> > in a bad way indeed: not only has NFS hung but there are clearly
> > some
> > important locks taken out since even an ls on a non-NFS dir (with
> > no NFS
> > dirs in the PATH) hangs, as does a while :; ps -e; done loop. I can
> > rebuild with lockdep and do a lock dump if that would be useful.
> > (The
> > kernel is not quite stock: it has DTrace built in, but the hangs
> > happen
> > even if the kernel module has never been loaded, and I'm pretty
> > sure the
> > hangs can't be associated with the not-very-substantial built-in
> > DTrace
> > changes. I can try rebuilding without that as well, but I'll have
> > to rip
> > chunks out of my monitoring machinery to do that so I've avoided
> > doing
> > it so far.)
> >
> > I finally got round to doing a bit of debugging today. NFS
> > debugging
> > dumps covering the last fifteen seconds or so before the hang and a
> > couple of minutes afterwards (taken on the client, logged without
> > incident to the server's syslog, as proof that not *everything* is
> > stalled) are a few megabytes long so are at
> > <http://www.esperi.org.uk/~nix/temporary/nfs-steam-hang.log>.
> > Everything
> > seems to go wrong after an especially large bunch of NFS commits.
> >
> > Both client (silk, 192.168.16.21) and server (loom, 192.168.16.8)
> > are
> > running 5.10.4 with the same NFS config, though the client has no
> > NFS
> > server exports at the time and the server isn't acting as an NFS
> > client
> > at all:
> >
> > CONFIG_NFS_FS=y
> > # CONFIG_NFS_V2 is not set
> > CONFIG_NFS_V3=y
> > CONFIG_NFS_V3_ACL=y
> > CONFIG_NFS_V4=y
> > # CONFIG_NFS_SWAP is not set
> > CONFIG_NFS_V4_1=y
> > CONFIG_NFS_V4_2=y
> > CONFIG_PNFS_FILE_LAYOUT=y
> > CONFIG_PNFS_BLOCK=y
> > CONFIG_PNFS_FLEXFILE_LAYOUT=m
> > CONFIG_NFS_V4_1_IMPLEMENTATION_ID_DOMAIN="kernel.org"
> > # CONFIG_NFS_V4_1_MIGRATION is not set
> > # CONFIG_ROOT_NFS is not set
> > # CONFIG_NFS_USE_LEGACY_DNS is not set
> > CONFIG_NFS_USE_KERNEL_DNS=y
> > CONFIG_NFS_DEBUG=y
> > CONFIG_NFS_DISABLE_UDP_SUPPORT=y
> > # CONFIG_NFS_V4_2_READ_PLUS is not set
> > CONFIG_NFSD=y
> > CONFIG_NFSD_V2_ACL=y
> > CONFIG_NFSD_V3=y
> > CONFIG_NFSD_V3_ACL=y
> > CONFIG_NFSD_V4=y
> > # CONFIG_NFSD_BLOCKLAYOUT is not set
> > # CONFIG_NFSD_SCSILAYOUT is not set
> > # CONFIG_NFSD_FLEXFILELAYOUT is not set
> > CONFIG_NFSD_V4_2_INTER_SSC=y
> > CONFIG_GRACE_PERIOD=y
> > CONFIG_LOCKD=y
> > CONFIG_LOCKD_V4=y
> > CONFIG_NFS_ACL_SUPPORT=y
> > CONFIG_NFS_COMMON=y
> > CONFIG_SUNRPC=y
> > CONFIG_SUNRPC_GSS=y
> > CONFIG_SUNRPC_BACKCHANNEL=y
> > CONFIG_SUNRPC_DEBUG=y
> >
> > The server (loom) has a large number of exports: many of the
> > filesystems
> > being exported are moved around via bind mounts and the like. The
> > steam
> > datadir is in /pkg/non-free/steam, which is bind-mounted onto
> > /home/nix/.steam on the client: /home/nix itself is bind-mounted
> > from
> > /home/.loom.srvr.nix/steam, and is NFS-exported from the server
> > under
> > that pathname.
> >
> > Relevant exports, from /proc/fs/nfs/exports:
> >
> > /       192.168.16.0/24,silk.wkstn.nix(ro,insecure,no_root_squash,s
> > ync,no_wdelay,no_subtree_check,v4root,fsid=0,uuid=0a4a4563:00764033
> > :8c827c0e:989cf534,sec=390003:390004:390005:1)
> > /home/.loom.srvr.nix    *.srvr.nix,fold.srvr.nix(rw,root_squash,syn
> > c,wdelay,no_subtree_check,uuid=0a4a4563:00764033:8c827c0e:989cf534,
> > sec=1)

Isn't that trying to export the same filesystem as '/' on the line
above using conflicting export options?


> > /pkg/non-
> > free/steam     192.168.16.0/24,silk.wkstn.nix(rw,root_squash,sync,w
> > delay,uuid=11b94016:cb6e4b64:beb6c134:9ea3d6b3,sec=1)
> >
> > Blocked tasks after the hang suggest that the client is hung
> > waiting
> > indefinitely for the server to reply, and I'd guess the server
> > doesn't
> > think it needs to. (I can do some server-side debugging dumps too
> > if you
> > think the problem might be there.)
>
> I seem to remember there have been some subtle changes to how dropped
> replies are handled.  If I remember right: a v4 server is never
> supposed
> to drop a reply without also closing the tcp connection, but it may
> have
> done so in the past.  (I *think* all those cases are fixed?)
>
> And the client may have been changed to be more unforgiving about
> that
> requirement in recent years.
>
> But I don't know if that's where the issue is.
>
> I'm not sure how best to go about debugging.  I suppose you could try
> to
> capture a network trace and work out if there's an RPC that isn't
> getting a reply.
>
> Maybe turning on server debugging and looking for "svc: svc_process
> dropit" messages timed with the start of the hang.  I think what you
> want is "rpdebug -m rpc -s svcdsp".
>
> Does rpc.mountd look healthy?
>
> --b.
>
>
> >
> > [  322.843962] sysrq: Show Blocked State
> > [  322.844058] task:tee             state:D stack:    0 pid: 1176
> > ppid:  1173 flags:0x00000000
> > [  322.844064] Call Trace:
> > [  322.844074]  __schedule+0x3db/0x7ee
> > [  322.844079]  ? io_schedule+0x70/0x6d
> > [  322.844082]  schedule+0x68/0xdc
> > [  322.844085]  io_schedule+0x46/0x6d
> > [  322.844089]  bit_wait_io+0x11/0x52
> > [  322.844092]  __wait_on_bit+0x31/0x90
> > [  322.844096]  out_of_line_wait_on_bit+0x7e/0x80
> > [  322.844100]  ? var_wake_function+0x30/0x27
> > [  322.844104]  nfs_wait_on_request.part.0+0x31/0x33
> > [  322.844107]  nfs_page_group_lock_head+0x31/0x78
> > [  322.844110]  nfs_lock_and_join_requests+0x9d/0x26e
> > [  322.844114]  nfs_updatepage+0x1be/0x985
> > [  322.844118]  nfs_write_end+0x128/0x52a
> > [  322.844122]  ? iov_iter_copy_from_user_atomic+0xe8/0x3ce
> > [  322.844127]  generic_perform_write+0xed/0x17c
> > [  322.844131]  nfs_file_write+0x14a/0x29f
> > [  322.844135]  new_sync_write+0xfb/0x16b
> > [  322.844139]  vfs_write+0x1ee/0x29a
> > [  322.844142]  ? vfs_read+0x17b/0x198
> > [  322.844145]  ksys_write+0x61/0xd0
> > [  322.844148]  __x64_sys_write+0x1a/0x1c
> > [  322.844151]  do_syscall_64+0x32/0x45
> > [  322.844154]  entry_SYSCALL_64_after_hwframe+0x44/0x0
> > [  322.844158] RIP: 0033:0x7f4781f323a3
> > [  322.844160] RSP: 002b:00007ffcf8220408 EFLAGS: 00000246
> > ORIG_RAX: 0000000000000001
> > [  322.844164] RAX: ffffffffffffffda RBX: 0000000000000092 RCX:
> > 00007f4781f323a3
> > [  322.844166] RDX: 0000000000000092 RSI: 00007ffcf8220530 RDI:
> > 0000000000000003
> > [  322.844168] RBP: 00007ffcf8220530 R08: 0000000000000000 R09:
> > 0000000000000001
> > [  322.844170] R10: 0000000000400ad2 R11: 0000000000000246 R12:
> > 0000000000000092
> > [  322.844172] R13: 0000000002361580 R14: 0000000000000092 R15:
> > 00007f4782201720
> > [  322.844202] task:CIPCServer::Thr state:D stack:    0 pid: 1801
> > ppid:  1540 flags:0x20020000
> > [  322.844207] Call Trace:
> > [  322.844212]  __schedule+0x3db/0x7ee
> > [  322.844219]  ?
> > trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> > [  322.844222]  schedule+0x68/0xdc
> > [  322.844227]  rpc_wait_bit_killable+0x11/0x52
> > [  322.844230]  __wait_on_bit+0x31/0x90
> > [  322.844235]  ? __rpc_atrun+0x20/0x1a
> > [  322.844238]  ? __rpc_atrun+0x20/0x1a
> > [  322.844242]  out_of_line_wait_on_bit+0x7e/0x80
> > [  322.844246]  ? var_wake_function+0x30/0x27
> > [  322.844250]  __rpc_execute+0xe5/0x485
> > [  322.844254]  ? sugov_get_util+0xf0/0xeb
> > [  322.844258]  rpc_execute+0xa5/0xaa
> > [  322.844263]  rpc_run_task+0x148/0x187
> > [  322.844267]  nfs4_do_call_sync+0x61/0x81
> > [  322.844272]  _nfs4_proc_getattr+0x10a/0x120
> > [  322.844278]  nfs4_proc_getattr+0x67/0xee
> > [  322.844283]  __nfs_revalidate_inode+0xb7/0x2df
> > [  322.844287]  nfs_access_get_cached+0x14d/0x1df
> > [  322.844291]  nfs_do_access+0x48/0x28e
> > [  322.844296]  ? step_into+0x2bc/0x6a1
> > [  322.844301]  nfs_permission+0xa2/0x1b8
> > [  322.844305]  inode_permission+0x96/0xca
> > [  322.844308]  ? unlazy_walk+0x4c/0x80
> > [  322.844312]  link_path_walk.part.0+0x2a5/0x320
> > [  322.844315]  ? path_init+0x2c2/0x3e5
> > [  322.844318]  path_lookupat+0x3f/0x1b1
> > [  322.844322]  filename_lookup+0x97/0x17f
> > [  322.844326]  ? kmem_cache_alloc+0x32/0x200
> > [  322.844331]  user_path_at_empty+0x59/0x8a
> > [  322.844333]  do_faccessat+0x70/0x227
> > [  322.844337]  __ia32_sys_access+0x1c/0x1e
> > [  322.844340]  __do_fast_syscall_32+0x5f/0x8e
> > [  322.844342]  do_fast_syscall_32+0x3d/0x80
> > [  322.844346]  entry_SYSCALL_compat_after_hwframe+0x45/0x0
> > [  322.844348] RIP: 0023:0xf7ee6549
> > [  322.844350] RSP: 002b:00000000eb763b28 EFLAGS: 00200292
> > ORIG_RAX: 0000000000000021
> > [  322.844353] RAX: ffffffffffffffda RBX: 000000005754d9e0 RCX:
> > 0000000000000004
> > [  322.844355] RDX: 00000000f7ab3e1c RSI: 000000005754d9e0 RDI:
> > 00000000eb763bc8
> > [  322.844357] RBP: 00000000f78a91c0 R08: 0000000000000000 R09:
> > 0000000000000000
> > [  322.844359] R10: 0000000000000000 R11: 0000000000000000 R12:
> > 0000000000000000
> > [  322.844360] R13: 0000000000000000 R14: 0000000000000000 R15:
> > 0000000000000000
> > [  322.844364] task:CFileWriterThre state:D stack:    0 pid: 1810
> > ppid:  1540 flags:0x20024000
> > [  322.844367] Call Trace:
> > [  322.844371]  __schedule+0x3db/0x7ee
> > [  322.844375]  ?
> > trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> > [  322.844379]  schedule+0x68/0xdc
> > [  322.844382]  rpc_wait_bit_killable+0x11/0x52
> > [  322.844385]  __wait_on_bit+0x31/0x90
> > [  322.844389]  ? call_reserveresult+0xa0/0x9f
> > [  322.844391]  ? call_transmit_status+0x160/0x159
> > [  322.844395]  out_of_line_wait_on_bit+0x7e/0x80
> > [  322.844398]  ? var_wake_function+0x30/0x27
> > [  322.844401]  __rpc_execute+0xe5/0x485
> > [  322.844404]  ? nfs_check_verifier+0x5b/0xa4
> > [  322.844407]  rpc_execute+0xa5/0xaa
> > [  322.844410]  rpc_run_task+0x148/0x187
> > [  322.844414]  nfs4_do_call_sync+0x61/0x81
> > [  322.844417]  _nfs4_proc_getattr+0x10a/0x120
> > [  322.844422]  nfs4_proc_getattr+0x67/0xee
> > [  322.844426]  __nfs_revalidate_inode+0xb7/0x2df
> > [  322.844430]  nfs_getattr+0x159/0x374
> > [  322.844433]  vfs_getattr_nosec+0x96/0xa4
> > [  322.844435]  vfs_statx+0x7a/0xe5
> > [  322.844438]  vfs_fstatat+0x1f/0x21
> > [  322.844443]  __do_compat_sys_ia32_lstat64+0x2b/0x4e
> > [  322.844448]  __ia32_compat_sys_ia32_lstat64+0x14/0x16
> > [  322.844450]  __do_fast_syscall_32+0x5f/0x8e
> > [  322.844453]  do_fast_syscall_32+0x3d/0x80
> > [  322.844457]  entry_SYSCALL_compat_after_hwframe+0x45/0x0
> > [  322.844459] RIP: 0023:0xf7ee6549
> > [  322.844460] RSP: 002b:00000000eb561be8 EFLAGS: 00000292
> > ORIG_RAX: 00000000000000c4
> > [  322.844463] RAX: ffffffffffffffda RBX: 00000000c521bf60 RCX:
> > 00000000eb561e70
> > [  322.844465] RDX: 00000000f7ab3e1c RSI: 00000000eb561e70 RDI:
> > 00000000c521bf60
> > [  322.844467] RBP: 0000000000000000 R08: 0000000000000000 R09:
> > 0000000000000000
> > [  322.844468] R10: 0000000000000000 R11: 0000000000000000 R12:
> > 0000000000000000
> > [  322.844470] R13: 0000000000000000 R14: 0000000000000000 R15:
> > 0000000000000000
> > [  322.844477] task:Steam - DirWatc state:D stack:    0 pid: 1924
> > ppid:  1540 flags:0x20024000
> > [  322.844480] Call Trace:
> > [  322.844484]  __schedule+0x3db/0x7ee
> > [  322.844488]  ?
> > trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> > [  322.844492]  schedule+0x68/0xdc
> > [  322.844495]  rpc_wait_bit_killable+0x11/0x52
> > [  322.844498]  __wait_on_bit+0x31/0x90
> > [  322.844501]  ? rpc_sleep_on_timeout+0x64/0x9c
> > [  322.844504]  ? rpc_check_timeout+0x1e0/0x1e0
> > [  322.844507]  ? call_connect_status+0x180/0x17a
> > [  322.844511]  out_of_line_wait_on_bit+0x7e/0x80
> > [  322.844513]  ? var_wake_function+0x30/0x27
> > [  322.844516]  __rpc_execute+0xe5/0x485
> > [  322.844520]  rpc_execute+0xa5/0xaa
> > [  322.844523]  rpc_run_task+0x148/0x187
> > [  322.844526]  nfs4_do_call_sync+0x61/0x81
> > [  322.844530]  _nfs4_proc_getattr+0x10a/0x120
> > [  322.844534]  nfs4_proc_getattr+0x67/0xee
> > [  322.844538]  __nfs_revalidate_inode+0xb7/0x2df
> > [  322.844542]  nfs_lookup_verify_inode+0x67/0x82
> > [  322.844545]  nfs_do_lookup_revalidate+0x19b/0x29c
> > [  322.844548]  nfs4_do_lookup_revalidate+0x5d/0xc5
> > [  322.844552]  nfs4_lookup_revalidate+0x6a/0x8d
> > [  322.844555]  lookup_fast+0xb8/0x122
> > [  322.844559]  path_openat+0xfd/0xfd8
> > [  322.844562]  ? unlazy_walk+0x4c/0x80
> > [  322.844565]  ? terminate_walk+0x62/0xed
> > [  322.844569]  ? putname+0x4b/0x53
> > [  322.844572]  do_filp_open+0x86/0x110
> > [  322.844576]  ? kmem_cache_alloc+0x32/0x200
> > [  322.844579]  ? __alloc_fd+0xb2/0x143
> > [  322.844583]  do_sys_openat2+0x8d/0x13a
> > [  322.844585]  ? do_faccessat+0x108/0x227
> > [  322.844588]  __ia32_compat_sys_openat+0x48/0x63
> > [  322.844591]  __do_fast_syscall_32+0x5f/0x8e
> > [  322.844594]  do_fast_syscall_32+0x3d/0x80
> > [  322.844597]  entry_SYSCALL_compat_after_hwframe+0x45/0x0
> > [  322.844600] RIP: 0023:0xf7ee6549
> > [  322.844601] RSP: 002b:000000008c8fbcb0 EFLAGS: 00200287
> > ORIG_RAX: 0000000000000127
> > [  322.844604] RAX: ffffffffffffffda RBX: 00000000ffffff9c RCX:
> > 000000008c8fbffc
> > [  322.844606] RDX: 0000000000098800 RSI: 0000000000000000 RDI:
> > 00000000f7ab3e1c
> > [  322.844608] RBP: 0000000000000000 R08: 0000000000000000 R09:
> > 0000000000000000
> > [  322.844609] R10: 0000000000000000 R11: 0000000000000000 R12:
> > 0000000000000000
> > [  322.844611] R13: 0000000000000000 R14: 0000000000000000 R15:
> > 0000000000000000
> > [  322.844614] task:CGenericAsyncFi state:D stack:    0 pid: 1927
> > ppid:  1540 flags:0x20020000
> > [  322.844617] Call Trace:
> > [  322.844621]  __schedule+0x3db/0x7ee
> > [  322.844625]  schedule+0x68/0xdc
> > [  322.844628]  io_schedule+0x46/0x6d
> > [  322.844631]  wait_on_page_bit_common+0xf8/0x325
> > [  322.844634]  ?
> > trace_raw_output_file_check_and_advance_wb_err+0x80/0x7d
> > [  322.844638]  wait_on_page_bit+0x3f/0x41
> > [  322.844642]  wait_on_page_writeback.part.0+0x35/0x80
> > [  322.844645]  wait_on_page_writeback+0x27/0x29
> > [  322.844647]  __filemap_fdatawait_range+0x88/0xd9
> > [  322.844651]  filemap_write_and_wait_range+0x45/0x80
> > [  322.844655]  nfs_wb_all+0x30/0x13d
> > [  322.844658]  nfs4_file_flush+0x85/0xc0
> > [  322.844661]  filp_close+0x3e/0x78
> > [  322.844664]  __close_fd+0x23/0x30
> > [  322.844667]  __ia32_sys_close+0x22/0x50
> > [  322.844670]  __do_fast_syscall_32+0x5f/0x8e
> > [  322.844673]  do_fast_syscall_32+0x3d/0x80
> > [  322.844676]  entry_SYSCALL_compat_after_hwframe+0x45/0x0
> > [  322.844678] RIP: 0023:0xf7ee6549
> > [  322.844680] RSP: 002b:000000008c5fddb0 EFLAGS: 00200286
> > ORIG_RAX: 0000000000000006
> > [  322.844682] RAX: ffffffffffffffda RBX: 000000000000004d RCX:
> > 0000000000000002
> > [  322.844684] RDX: 0000000000000000 RSI: 00000000f7d02e70 RDI:
> > 000000000000004d
> > [  322.844686] RBP: 000000008c5fe110 R08: 0000000000000000 R09:
> > 0000000000000000
> > [  322.844688] R10: 0000000000000000 R11: 0000000000000000 R12:
> > 0000000000000000
> > [  322.844689] R13: 0000000000000000 R14: 0000000000000000 R15:
> > 0000000000000000
> > [  322.844699] task:ThreadPoolForeg state:D stack:    0 pid: 1768
> > ppid:  1735 flags:0x00004000
> > [  322.844702] Call Trace:
> > [  322.844705]  __schedule+0x3db/0x7ee
> > [  322.844709]  ?
> > trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> > [  322.844712]  schedule+0x68/0xdc
> > [  322.844716]  rpc_wait_bit_killable+0x11/0x52
> > [  322.844719]  __wait_on_bit+0x31/0x90
> > [  322.844722]  ? rpc_sleep_on_timeout+0x64/0x9c
> > [  322.844725]  ? rpc_check_timeout+0x1e0/0x1e0
> > [  322.844728]  ? call_connect_status+0x180/0x17a
> > [  322.844731]  out_of_line_wait_on_bit+0x7e/0x80
> > [  322.844734]  ? var_wake_function+0x30/0x27
> > [  322.844737]  __rpc_execute+0xe5/0x485
> > [  322.844740]  ? nfs_check_verifier+0x5b/0xa4
> > [  322.844743]  rpc_execute+0xa5/0xaa
> > [  322.844746]  rpc_run_task+0x148/0x187
> > [  322.844750]  nfs4_do_call_sync+0x61/0x81
> > [  322.844753]  _nfs4_proc_getattr+0x10a/0x120
> > [  322.844758]  nfs4_proc_getattr+0x67/0xee
> > [  322.844762]  __nfs_revalidate_inode+0xb7/0x2df
> > [  322.844766]  nfs_getattr+0x159/0x374
> > [  322.844769]  vfs_getattr_nosec+0x96/0xa4
> > [  322.844771]  vfs_statx+0x7a/0xe5
> > [  322.844774]  __do_sys_newstat+0x31/0x4c
> > [  322.844778]  __x64_sys_newstat+0x16/0x18
> > [  322.844781]  do_syscall_64+0x32/0x45
> > [  322.844783]  entry_SYSCALL_64_after_hwframe+0x44/0x0
> > [  322.844786] RIP: 0033:0x7fe70ebf19e6
> > [  322.844787] RSP: 002b:00007fe6c15b2068 EFLAGS: 00000246
> > ORIG_RAX: 0000000000000004
> > [  322.844790] RAX: ffffffffffffffda RBX: 00000bd4193e1910 RCX:
> > 00007fe70ebf19e6
> > [  322.844792] RDX: 00007fe6c15b2078 RSI: 00007fe6c15b2078 RDI:
> > 00000bd4193e1910
> > [  322.844794] RBP: 00007fe6c15b2160 R08: 0000000000000001 R09:
> > 00007ffd971ed090
> > [  322.844796] R10: 0000000000056764 R11: 0000000000000246 R12:
> > 00007fe6c15b2078
> > [  322.844797] R13: 00000bd4192f6f60 R14: 00007fe6c15b2108 R15:
> > 00007fe6c15b2120
> > [  322.844801] task:ThreadPoolForeg state:D stack:    0 pid: 1769
> > ppid:  1735 flags:0x00000000
> > [  322.844804] Call Trace:
> > [  322.844807]  __schedule+0x3db/0x7ee
> > [  322.844811]  ?
> > trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> > [  322.844814]  schedule+0x68/0xdc
> > [  322.844817]  rpc_wait_bit_killable+0x11/0x52
> > [  322.844820]  __wait_on_bit+0x31/0x90
> > [  322.844823]  ? __rpc_atrun+0x20/0x1a
> > [  322.844826]  ? __rpc_atrun+0x20/0x1a
> > [  322.844829]  out_of_line_wait_on_bit+0x7e/0x80
> > [  322.844831]  ? var_wake_function+0x30/0x27
> > [  322.844834]  __rpc_execute+0xe5/0x485
> > [  322.844837]  ? update_cfs_group+0x9a/0x9e
> > [  322.844840]  rpc_execute+0xa5/0xaa
> > [  322.844842]  rpc_run_task+0x148/0x187
> > [  322.844845]  nfs4_do_call_sync+0x61/0x81
> > [  322.844849]  _nfs4_proc_getattr+0x10a/0x120
> > [  322.844853]  nfs4_proc_getattr+0x67/0xee
> > [  322.844856]  __nfs_revalidate_inode+0xb7/0x2df
> > [  322.844859]  nfs_file_write+0xfc/0x29f
> > [  322.844862]  new_sync_write+0xfb/0x16b
> > [  322.844866]  vfs_write+0x1ee/0x29a
> > [  322.844868]  ? __fget_light+0x31/0x72
> > [  322.844871]  ksys_write+0x61/0xd0
> > [  322.844874]  __x64_sys_write+0x1a/0x1c
> > [  322.844877]  do_syscall_64+0x32/0x45
> > [  322.844880]  entry_SYSCALL_64_after_hwframe+0x44/0x0
> > [  322.844881] RIP: 0033:0x7fe71859ab7f
> > [  322.844883] RSP: 002b:00007fe6bd5b0d30 EFLAGS: 00000293
> > ORIG_RAX: 0000000000000001
> > [  322.844885] RAX: ffffffffffffffda RBX: 00007fe6bd5b0d68 RCX:
> > 00007fe71859ab7f
> > [  322.844887] RDX: 0000000000000007 RSI: 00007fe6bd5b1090 RDI:
> > 000000000000005b
> > [  322.844888] RBP: 00007fe6bd5b0de0 R08: 0000000000000000 R09:
> > 00007ffd971ed090
> > [  322.844890] R10: 0000000000065fd2 R11: 0000000000000293 R12:
> > 00000bd418d95020
> > [  322.844891] R13: 00007fe6bd5b1090 R14: 0000000000000007 R15:
> > 0000000000000000
> > [  322.844900] task:ThreadPoolForeg state:D stack:    0 pid: 1807
> > ppid:  1773 flags:0x00000000
> > [  322.844902] Call Trace:
> > [  322.844904]  __schedule+0x3db/0x7ee
> > [  322.844907]  ?
> > trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> > [  322.844908]  schedule+0x68/0xdc
> > [  322.844911]  rpc_wait_bit_killable+0x11/0x52
> > [  322.844913]  __wait_on_bit+0x31/0x90
> > [  322.844915]  ? __rpc_atrun+0x20/0x1a
> > [  322.844916]  ? __rpc_atrun+0x20/0x1a
> > [  322.844918]  out_of_line_wait_on_bit+0x7e/0x80
> > [  322.844920]  ? var_wake_function+0x30/0x27
> > [  322.844922]  __rpc_execute+0xe5/0x485
> > [  322.844924]  rpc_execute+0xa5/0xaa
> > [  322.844926]  rpc_run_task+0x148/0x187
> > [  322.844928]  nfs4_do_call_sync+0x61/0x81
> > [  322.844930]  _nfs4_proc_getattr+0x10a/0x120
> > [  322.844933]  nfs4_proc_getattr+0x67/0xee
> > [  322.844936]  __nfs_revalidate_inode+0xb7/0x2df
> > [  322.844938]  nfs_access_get_cached+0x14d/0x1df
> > [  322.844940]  nfs_do_access+0x48/0x28e
> > [  322.844942]  ? nfs4_lookup_revalidate+0x38/0x8d
> > [  322.844944]  nfs_permission+0xa2/0x1b8
> > [  322.844946]  inode_permission+0x96/0xca
> > [  322.844948]  ? unlazy_walk+0x4c/0x80
> > [  322.844950]  link_path_walk.part.0+0x2a5/0x320
> > [  322.844952]  ? path_init+0x2c2/0x3e5
> > [  322.844954]  path_lookupat+0x3f/0x1b1
> > [  322.844956]  filename_lookup+0x97/0x17f
> > [  322.844959]  ? do_unlk+0x9a/0xca
> > [  322.844960]  ? kmem_cache_alloc+0x32/0x200
> > [  322.844963]  user_path_at_empty+0x59/0x8a
> > [  322.844965]  vfs_statx+0x64/0xe5
> > [  322.844967]  __do_sys_newstat+0x31/0x4c
> > [  322.844968]  ? do_fcntl+0x113/0x582
> > [  322.844971]  ? fput+0x13/0x15
> > [  322.844972]  ? __x64_sys_fcntl+0x81/0xa8
> > [  322.844974]  __x64_sys_newstat+0x16/0x18
> > [  322.844976]  do_syscall_64+0x32/0x45
> > [  322.844977]  entry_SYSCALL_64_after_hwframe+0x44/0x0
> > [  322.844979] RIP: 0033:0x7f0d1fd0c9e6
> > [  322.844980] RSP: 002b:00007f0d08e12e18 EFLAGS: 00000246
> > ORIG_RAX: 0000000000000004
> > [  322.844981] RAX: ffffffffffffffda RBX: 00007f0d08e12edc RCX:
> > 00007f0d1fd0c9e6
> > [  322.844983] RDX: 00007f0d08e12e28 RSI: 00007f0d08e12e28 RDI:
> > 00001605c2ff1169
> > [  322.844984] RBP: 00007f0d08e12ec0 R08: 0000000000000001 R09:
> > 00007fffaddd9090
> > [  322.844985] R10: 00001605c3133480 R11: 0000000000000246 R12:
> > 0000000000000000
> > [  322.844986] R13: 0000000000000000 R14: 00001605c2ff0f00 R15:
> > 0000000000000000
> > [  322.844988] task:ThreadPoolForeg state:D stack:    0 pid: 2031
> > ppid:  1773 flags:0x00004000
> > [  322.844990] Call Trace:
> > [  322.844992]  __schedule+0x3db/0x7ee
> > [  322.844995]  ?
> > trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> > [  322.844997]  schedule+0x68/0xdc
> > [  322.844999]  rpc_wait_bit_killable+0x11/0x52
> > [  322.845001]  __wait_on_bit+0x31/0x90
> > [  322.845003]  ? rpc_sleep_on_timeout+0x64/0x9c
> > [  322.845005]  ? rpc_check_timeout+0x1e0/0x1e0
> > [  322.845006]  ? call_connect_status+0x180/0x17a
> > [  322.845009]  out_of_line_wait_on_bit+0x7e/0x80
> > [  322.845010]  ? var_wake_function+0x30/0x27
> > [  322.845012]  __rpc_execute+0xe5/0x485
> > [  322.845014]  rpc_execute+0xa5/0xaa
> > [  322.845016]  rpc_run_task+0x148/0x187
> > [  322.845018]  nfs4_do_call_sync+0x61/0x81
> > [  322.845020]  _nfs4_proc_getattr+0x10a/0x120
> > [  322.845023]  nfs4_proc_getattr+0x67/0xee
> > [  322.845026]  __nfs_revalidate_inode+0xb7/0x2df
> > [  322.845028]  nfs_access_get_cached+0x14d/0x1df
> > [  322.845030]  nfs_do_access+0x48/0x28e
> > [  322.845032]  ? nfs4_lookup_revalidate+0x38/0x8d
> > [  322.845034]  nfs_permission+0xa2/0x1b8
> > [  322.845036]  inode_permission+0x96/0xca
> > [  322.845038]  ? unlazy_walk+0x4c/0x80
> > [  322.845040]  link_path_walk.part.0+0x2a5/0x320
> > [  322.845042]  ? path_init+0x2c2/0x3e5
> > [  322.845044]  path_lookupat+0x3f/0x1b1
> > [  322.845046]  filename_lookup+0x97/0x17f
> > [  322.845049]  ? kmem_cache_alloc+0x32/0x200
> > [  322.845051]  user_path_at_empty+0x59/0x8a
> > [  322.845054]  ? mntput_no_expire+0x4f/0x264
> > [  322.845055]  vfs_statx+0x64/0xe5
> > [  322.845057]  ? putname+0x4b/0x53
> > [  322.845059]  __do_sys_newstat+0x31/0x4c
> > [  322.845062]  ? __x64_sys_futex+0x13a/0x159
> > [  322.845064]  ? switch_fpu_return+0x58/0xf2
> > [  322.845066]  __x64_sys_newstat+0x16/0x18
> > [  322.845068]  do_syscall_64+0x32/0x45
> > [  322.845069]  entry_SYSCALL_64_after_hwframe+0x44/0x0
> > [  322.845071] RIP: 0033:0x7f0d1fd0c9e6
> > [  322.845072] RSP: 002b:00007f0d00331108 EFLAGS: 00000246
> > ORIG_RAX: 0000000000000004
> > [  322.845073] RAX: ffffffffffffffda RBX: 00001605c2fdc0a0 RCX:
> > 00007f0d1fd0c9e6
> > [  322.845074] RDX: 00007f0d00331118 RSI: 00007f0d00331118 RDI:
> > 00001605c2fdc0a0
> > [  322.845076] RBP: 00007f0d00331200 R08: 0000000000000001 R09:
> > 00007fffaddd9090
> > [  322.845077] R10: 000000000005771c R11: 0000000000000246 R12:
> > 00007f0d00331118
> > [  322.845078] R13: 00001605c3309c40 R14: 00007f0d003311a8 R15:
> > 00007f0d003311c0
> > [  322.845087] task:192.168.16.8-ma state:D stack:    0 pid: 2044
> > ppid:     2 flags:0x00004000
> > [  322.845089] Call Trace:
> > [  322.845091]  __schedule+0x3db/0x7ee
> > [  322.845094]  ?
> > trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> > [  322.845095]  schedule+0x68/0xdc
> > [  322.845098]  rpc_wait_bit_killable+0x11/0x52
> > [  322.845100]  __wait_on_bit+0x31/0x90
> > [  322.845102]  ? __rpc_atrun+0x20/0x1a
> > [  322.845103]  ? __rpc_atrun+0x20/0x1a
> > [  322.845105]  out_of_line_wait_on_bit+0x7e/0x80
> > [  322.845107]  ? var_wake_function+0x30/0x27
> > [  322.845109]  __rpc_execute+0xe5/0x485
> > [  322.845111]  rpc_execute+0xa5/0xaa
> > [  322.845113]  rpc_run_task+0x148/0x187
> > [  322.845115]  nfs4_do_call_sync+0x61/0x81
> > [  322.845117]  _nfs4_proc_statfs+0xc1/0xcf
> > [  322.845120]  nfs4_proc_statfs+0x55/0x74
> > [  322.845122]  nfs_statfs+0x65/0x152
> > [  322.845124]  ? _nfs4_proc_delegreturn+0x21e/0x2e5
> > [  322.845127]  statfs_by_dentry+0x55/0x74
> > [  322.845129]  vfs_statfs+0x24/0x43
> > [  322.845132]  check_free_space+0x3c/0xe5
> > [  322.845135]  do_acct_process+0x5d/0x5e1
> > [  322.845136]  ? nfs_end_delegation_return+0x1cb/0x3aa
> > [  322.845138]  ? complete+0x49/0x57
> > [  322.845141]  acct_process+0xe5/0x105
> > [  322.845143]  do_exit+0x781/0xa1b
> > [  322.845145]  ? refcount_dec_and_lock+0x22/0x8c
> > [  322.845148]  __module_put_and_exit+0x1b/0x1b
> > [  322.845150]  nfs4_run_state_manager+0x293/0x933
> > [  322.845152]  ? nfs4_do_reclaim+0x820/0x81c
> > [  322.845155]  kthread+0x142/0x15f
> > [  322.845157]  ? __kthread_create_worker+0x130/0x122
> > [  322.845159]  ret_from_fork+0x1f/0x2a
> > [  369.859385] INFO: task tee:1176 blocked for more than 122
> > seconds.
> > [  369.859391]       Not tainted 5.10.4-00021-gde017294a539-dirty
> > #3
> > [  369.859393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  369.859395] task:tee             state:D stack:    0 pid: 1176
> > ppid:  1173 flags:0x00000000
> > [  369.859399] Call Trace:
> > [  369.859407]  __schedule+0x3db/0x7ee
> > [  369.859410]  ? io_schedule+0x70/0x6d
> > [  369.859412]  schedule+0x68/0xdc
> > [  369.859415]  io_schedule+0x46/0x6d
> > [  369.859417]  bit_wait_io+0x11/0x52
> > [  369.859420]  __wait_on_bit+0x31/0x90
> > [  369.859422]  out_of_line_wait_on_bit+0x7e/0x80
> > [  369.859426]  ? var_wake_function+0x30/0x27
> > [  369.859429]  nfs_wait_on_request.part.0+0x31/0x33
> > [  369.859431]  nfs_page_group_lock_head+0x31/0x78
> > [  369.859433]  nfs_lock_and_join_requests+0x9d/0x26e
> > [  369.859436]  nfs_updatepage+0x1be/0x985
> > [  369.859439]  nfs_write_end+0x128/0x52a
> > [  369.859443]  ? iov_iter_copy_from_user_atomic+0xe8/0x3ce
> > [  369.859446]  generic_perform_write+0xed/0x17c
> > [  369.859449]  nfs_file_write+0x14a/0x29f
> > [  369.859452]  new_sync_write+0xfb/0x16b
> > [  369.859455]  vfs_write+0x1ee/0x29a
> > [  369.859457]  ? vfs_read+0x17b/0x198
> > [  369.859459]  ksys_write+0x61/0xd0
> > [  369.859462]  __x64_sys_write+0x1a/0x1c
> > [  369.859464]  do_syscall_64+0x32/0x45
> > [  369.859467]  entry_SYSCALL_64_after_hwframe+0x44/0x0
> > [  369.859469] RIP: 0033:0x7f4781f323a3
> > [  369.859471] RSP: 002b:00007ffcf8220408 EFLAGS: 00000246
> > ORIG_RAX: 0000000000000001
> > [  369.859474] RAX: ffffffffffffffda RBX: 0000000000000092 RCX:
> > 00007f4781f323a3
> > [  369.859475] RDX: 0000000000000092 RSI: 00007ffcf8220530 RDI:
> > 0000000000000003
> > [  369.859477] RBP: 00007ffcf8220530 R08: 0000000000000000 R09:
> > 0000000000000001
> > [  369.859478] R10: 0000000000400ad2 R11: 0000000000000246 R12:
> > 0000000000000092
> > [  369.859480] R13: 0000000002361580 R14: 0000000000000092 R15:
> > 00007f4782201720
> > [  369.859495] INFO: task CGenericAsyncFi:1927 blocked for more
> > than 122 seconds.
> > [  369.859497]       Not tainted 5.10.4-00021-gde017294a539-dirty
> > #3
> > [  369.859498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  369.859499] task:CGenericAsyncFi state:D stack:    0 pid: 1927
> > ppid:  1540 flags:0x20020000
> > [  369.859501] Call Trace:
> > [  369.859504]  __schedule+0x3db/0x7ee
> > [  369.859506]  schedule+0x68/0xdc
> > [  369.859508]  io_schedule+0x46/0x6d
> > [  369.859510]  wait_on_page_bit_common+0xf8/0x325
> > [  369.859512]  ?
> > trace_raw_output_file_check_and_advance_wb_err+0x80/0x7d
> > [  369.859515]  wait_on_page_bit+0x3f/0x41
> > [  369.859517]  wait_on_page_writeback.part.0+0x35/0x80
> > [  369.859519]  wait_on_page_writeback+0x27/0x29
> > [  369.859521]  __filemap_fdatawait_range+0x88/0xd9
> > [  369.859523]  filemap_write_and_wait_range+0x45/0x80
> > [  369.859525]  nfs_wb_all+0x30/0x13d
> > [  369.859528]  nfs4_file_flush+0x85/0xc0
> > [  369.859530]  filp_close+0x3e/0x78
> > [  369.859532]  __close_fd+0x23/0x30
> > [  369.859534]  __ia32_sys_close+0x22/0x50
> > [  369.859536]  __do_fast_syscall_32+0x5f/0x8e
> > [  369.859538]  do_fast_syscall_32+0x3d/0x80
> > [  369.859540]  entry_SYSCALL_compat_after_hwframe+0x45/0x0
> > [  369.859541] RIP: 0023:0xf7ee6549
> > [  369.859543] RSP: 002b:000000008c5fddb0 EFLAGS: 00200286
> > ORIG_RAX: 0000000000000006
> > [  369.859545] RAX: ffffffffffffffda RBX: 000000000000004d RCX:
> > 0000000000000002
> > [  369.859546] RDX: 0000000000000000 RSI: 00000000f7d02e70 RDI:
> > 000000000000004d
> > [  369.859547] RBP: 000000008c5fe110 R08: 0000000000000000 R09:
> > 0000000000000000
> > [  369.859548] R10: 0000000000000000 R11: 0000000000000000 R12:
> > 0000000000000000
> > [  369.859549] R13: 0000000000000000 R14: 0000000000000000 R15:
> > 0000000000000000
> > [  394.423093] nfs: server loom not responding, still trying
> >
> > --
> > NULL && (void)

--
Trond Myklebust
CTO, Hammerspace Inc
4984 El Camino Real, Suite 208
Los Altos, CA 94022

http://www.hammer.space

2021-02-24 02:03:19

by J. Bruce Fields

[permalink] [raw]
Subject: Re: steam-associated reproducible hard NFSv4.2 client hang (5.9, 5.10)

On Tue, Feb 23, 2021 at 11:58:51PM +0000, Trond Myklebust wrote:
> On Tue, 2021-02-23 at 17:57 -0500, J. Bruce Fields wrote:
> > On Sun, Jan 03, 2021 at 02:27:51PM +0000, Nick Alcock wrote:
> > > Relevant exports, from /proc/fs/nfs/exports:
> > >
> > > /       192.168.16.0/24,silk.wkstn.nix(ro,insecure,no_root_squash,s
> > > ync,no_wdelay,no_subtree_check,v4root,fsid=0,uuid=0a4a4563:00764033
> > > :8c827c0e:989cf534,sec=390003:390004:390005:1)
> > > /home/.loom.srvr.nix    *.srvr.nix,fold.srvr.nix(rw,root_squash,syn
> > > c,wdelay,no_subtree_check,uuid=0a4a4563:00764033:8c827c0e:989cf534,
> > > sec=1)
>
> Isn't that trying to export the same filesystem as '/' on the line
> above using conflicting export options?

Yes, and exporting something on the root filesystem is generally a bad
idea for the usual reasons.

I don't see any explanation for the write hang there, though.

Unless maybe if it leaves mountd unable to answer an upcall for some
reason, hm.

I think that would show up in /proc/net/rpc/nfsd.fh/content. Try
cat'ing that file after 'rpcdebug -m rpc -s cache' and that should show
if there's an upcall that's not getting answered.

--b.

>
>
> > > /pkg/non-
> > > free/steam     192.168.16.0/24,silk.wkstn.nix(rw,root_squash,sync,w
> > > delay,uuid=11b94016:cb6e4b64:beb6c134:9ea3d6b3,sec=1)
> > >
> > > Blocked tasks after the hang suggest that the client is hung
> > > waiting
> > > indefinitely for the server to reply, and I'd guess the server
> > > doesn't
> > > think it needs to. (I can do some server-side debugging dumps too
> > > if you
> > > think the problem might be there.)
> >
> > I seem to remember there have been some subtle changes to how dropped
> > replies are handled.  If I remember right: a v4 server is never
> > supposed
> > to drop a reply without also closing the tcp connection, but it may
> > have
> > done so in the past.  (I *think* all those cases are fixed?)
> >
> > And the client may have been changed to be more unforgiving about
> > that
> > requirement in recent years.
> >
> > But I don't know if that's where the issue is.
> >
> > I'm not sure how best to go about debugging.  I suppose you could try
> > to
> > capture a network trace and work out if there's an RPC that isn't
> > getting a reply.
> >
> > Maybe turning on server debugging and looking for "svc: svc_process
> > dropit" messages timed with the start of the hang.  I think what you
> > want is "rpdebug -m rpc -s svcdsp".
> >
> > Does rpc.mountd look healthy?
> >
> > --b.
> >
> >
> > >
> > > [  322.843962] sysrq: Show Blocked State
> > > [  322.844058] task:tee             state:D stack:    0 pid: 1176
> > > ppid:  1173 flags:0x00000000
> > > [  322.844064] Call Trace:
> > > [  322.844074]  __schedule+0x3db/0x7ee
> > > [  322.844079]  ? io_schedule+0x70/0x6d
> > > [  322.844082]  schedule+0x68/0xdc
> > > [  322.844085]  io_schedule+0x46/0x6d
> > > [  322.844089]  bit_wait_io+0x11/0x52
> > > [  322.844092]  __wait_on_bit+0x31/0x90
> > > [  322.844096]  out_of_line_wait_on_bit+0x7e/0x80
> > > [  322.844100]  ? var_wake_function+0x30/0x27
> > > [  322.844104]  nfs_wait_on_request.part.0+0x31/0x33
> > > [  322.844107]  nfs_page_group_lock_head+0x31/0x78
> > > [  322.844110]  nfs_lock_and_join_requests+0x9d/0x26e
> > > [  322.844114]  nfs_updatepage+0x1be/0x985
> > > [  322.844118]  nfs_write_end+0x128/0x52a
> > > [  322.844122]  ? iov_iter_copy_from_user_atomic+0xe8/0x3ce
> > > [  322.844127]  generic_perform_write+0xed/0x17c
> > > [  322.844131]  nfs_file_write+0x14a/0x29f
> > > [  322.844135]  new_sync_write+0xfb/0x16b
> > > [  322.844139]  vfs_write+0x1ee/0x29a
> > > [  322.844142]  ? vfs_read+0x17b/0x198
> > > [  322.844145]  ksys_write+0x61/0xd0
> > > [  322.844148]  __x64_sys_write+0x1a/0x1c
> > > [  322.844151]  do_syscall_64+0x32/0x45
> > > [  322.844154]  entry_SYSCALL_64_after_hwframe+0x44/0x0
> > > [  322.844158] RIP: 0033:0x7f4781f323a3
> > > [  322.844160] RSP: 002b:00007ffcf8220408 EFLAGS: 00000246
> > > ORIG_RAX: 0000000000000001
> > > [  322.844164] RAX: ffffffffffffffda RBX: 0000000000000092 RCX:
> > > 00007f4781f323a3
> > > [  322.844166] RDX: 0000000000000092 RSI: 00007ffcf8220530 RDI:
> > > 0000000000000003
> > > [  322.844168] RBP: 00007ffcf8220530 R08: 0000000000000000 R09:
> > > 0000000000000001
> > > [  322.844170] R10: 0000000000400ad2 R11: 0000000000000246 R12:
> > > 0000000000000092
> > > [  322.844172] R13: 0000000002361580 R14: 0000000000000092 R15:
> > > 00007f4782201720
> > > [  322.844202] task:CIPCServer::Thr state:D stack:    0 pid: 1801
> > > ppid:  1540 flags:0x20020000
> > > [  322.844207] Call Trace:
> > > [  322.844212]  __schedule+0x3db/0x7ee
> > > [  322.844219]  ?
> > > trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> > > [  322.844222]  schedule+0x68/0xdc
> > > [  322.844227]  rpc_wait_bit_killable+0x11/0x52
> > > [  322.844230]  __wait_on_bit+0x31/0x90
> > > [  322.844235]  ? __rpc_atrun+0x20/0x1a
> > > [  322.844238]  ? __rpc_atrun+0x20/0x1a
> > > [  322.844242]  out_of_line_wait_on_bit+0x7e/0x80
> > > [  322.844246]  ? var_wake_function+0x30/0x27
> > > [  322.844250]  __rpc_execute+0xe5/0x485
> > > [  322.844254]  ? sugov_get_util+0xf0/0xeb
> > > [  322.844258]  rpc_execute+0xa5/0xaa
> > > [  322.844263]  rpc_run_task+0x148/0x187
> > > [  322.844267]  nfs4_do_call_sync+0x61/0x81
> > > [  322.844272]  _nfs4_proc_getattr+0x10a/0x120
> > > [  322.844278]  nfs4_proc_getattr+0x67/0xee
> > > [  322.844283]  __nfs_revalidate_inode+0xb7/0x2df
> > > [  322.844287]  nfs_access_get_cached+0x14d/0x1df
> > > [  322.844291]  nfs_do_access+0x48/0x28e
> > > [  322.844296]  ? step_into+0x2bc/0x6a1
> > > [  322.844301]  nfs_permission+0xa2/0x1b8
> > > [  322.844305]  inode_permission+0x96/0xca
> > > [  322.844308]  ? unlazy_walk+0x4c/0x80
> > > [  322.844312]  link_path_walk.part.0+0x2a5/0x320
> > > [  322.844315]  ? path_init+0x2c2/0x3e5
> > > [  322.844318]  path_lookupat+0x3f/0x1b1
> > > [  322.844322]  filename_lookup+0x97/0x17f
> > > [  322.844326]  ? kmem_cache_alloc+0x32/0x200
> > > [  322.844331]  user_path_at_empty+0x59/0x8a
> > > [  322.844333]  do_faccessat+0x70/0x227
> > > [  322.844337]  __ia32_sys_access+0x1c/0x1e
> > > [  322.844340]  __do_fast_syscall_32+0x5f/0x8e
> > > [  322.844342]  do_fast_syscall_32+0x3d/0x80
> > > [  322.844346]  entry_SYSCALL_compat_after_hwframe+0x45/0x0
> > > [  322.844348] RIP: 0023:0xf7ee6549
> > > [  322.844350] RSP: 002b:00000000eb763b28 EFLAGS: 00200292
> > > ORIG_RAX: 0000000000000021
> > > [  322.844353] RAX: ffffffffffffffda RBX: 000000005754d9e0 RCX:
> > > 0000000000000004
> > > [  322.844355] RDX: 00000000f7ab3e1c RSI: 000000005754d9e0 RDI:
> > > 00000000eb763bc8
> > > [  322.844357] RBP: 00000000f78a91c0 R08: 0000000000000000 R09:
> > > 0000000000000000
> > > [  322.844359] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > 0000000000000000
> > > [  322.844360] R13: 0000000000000000 R14: 0000000000000000 R15:
> > > 0000000000000000
> > > [  322.844364] task:CFileWriterThre state:D stack:    0 pid: 1810
> > > ppid:  1540 flags:0x20024000
> > > [  322.844367] Call Trace:
> > > [  322.844371]  __schedule+0x3db/0x7ee
> > > [  322.844375]  ?
> > > trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> > > [  322.844379]  schedule+0x68/0xdc
> > > [  322.844382]  rpc_wait_bit_killable+0x11/0x52
> > > [  322.844385]  __wait_on_bit+0x31/0x90
> > > [  322.844389]  ? call_reserveresult+0xa0/0x9f
> > > [  322.844391]  ? call_transmit_status+0x160/0x159
> > > [  322.844395]  out_of_line_wait_on_bit+0x7e/0x80
> > > [  322.844398]  ? var_wake_function+0x30/0x27
> > > [  322.844401]  __rpc_execute+0xe5/0x485
> > > [  322.844404]  ? nfs_check_verifier+0x5b/0xa4
> > > [  322.844407]  rpc_execute+0xa5/0xaa
> > > [  322.844410]  rpc_run_task+0x148/0x187
> > > [  322.844414]  nfs4_do_call_sync+0x61/0x81
> > > [  322.844417]  _nfs4_proc_getattr+0x10a/0x120
> > > [  322.844422]  nfs4_proc_getattr+0x67/0xee
> > > [  322.844426]  __nfs_revalidate_inode+0xb7/0x2df
> > > [  322.844430]  nfs_getattr+0x159/0x374
> > > [  322.844433]  vfs_getattr_nosec+0x96/0xa4
> > > [  322.844435]  vfs_statx+0x7a/0xe5
> > > [  322.844438]  vfs_fstatat+0x1f/0x21
> > > [  322.844443]  __do_compat_sys_ia32_lstat64+0x2b/0x4e
> > > [  322.844448]  __ia32_compat_sys_ia32_lstat64+0x14/0x16
> > > [  322.844450]  __do_fast_syscall_32+0x5f/0x8e
> > > [  322.844453]  do_fast_syscall_32+0x3d/0x80
> > > [  322.844457]  entry_SYSCALL_compat_after_hwframe+0x45/0x0
> > > [  322.844459] RIP: 0023:0xf7ee6549
> > > [  322.844460] RSP: 002b:00000000eb561be8 EFLAGS: 00000292
> > > ORIG_RAX: 00000000000000c4
> > > [  322.844463] RAX: ffffffffffffffda RBX: 00000000c521bf60 RCX:
> > > 00000000eb561e70
> > > [  322.844465] RDX: 00000000f7ab3e1c RSI: 00000000eb561e70 RDI:
> > > 00000000c521bf60
> > > [  322.844467] RBP: 0000000000000000 R08: 0000000000000000 R09:
> > > 0000000000000000
> > > [  322.844468] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > 0000000000000000
> > > [  322.844470] R13: 0000000000000000 R14: 0000000000000000 R15:
> > > 0000000000000000
> > > [  322.844477] task:Steam - DirWatc state:D stack:    0 pid: 1924
> > > ppid:  1540 flags:0x20024000
> > > [  322.844480] Call Trace:
> > > [  322.844484]  __schedule+0x3db/0x7ee
> > > [  322.844488]  ?
> > > trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> > > [  322.844492]  schedule+0x68/0xdc
> > > [  322.844495]  rpc_wait_bit_killable+0x11/0x52
> > > [  322.844498]  __wait_on_bit+0x31/0x90
> > > [  322.844501]  ? rpc_sleep_on_timeout+0x64/0x9c
> > > [  322.844504]  ? rpc_check_timeout+0x1e0/0x1e0
> > > [  322.844507]  ? call_connect_status+0x180/0x17a
> > > [  322.844511]  out_of_line_wait_on_bit+0x7e/0x80
> > > [  322.844513]  ? var_wake_function+0x30/0x27
> > > [  322.844516]  __rpc_execute+0xe5/0x485
> > > [  322.844520]  rpc_execute+0xa5/0xaa
> > > [  322.844523]  rpc_run_task+0x148/0x187
> > > [  322.844526]  nfs4_do_call_sync+0x61/0x81
> > > [  322.844530]  _nfs4_proc_getattr+0x10a/0x120
> > > [  322.844534]  nfs4_proc_getattr+0x67/0xee
> > > [  322.844538]  __nfs_revalidate_inode+0xb7/0x2df
> > > [  322.844542]  nfs_lookup_verify_inode+0x67/0x82
> > > [  322.844545]  nfs_do_lookup_revalidate+0x19b/0x29c
> > > [  322.844548]  nfs4_do_lookup_revalidate+0x5d/0xc5
> > > [  322.844552]  nfs4_lookup_revalidate+0x6a/0x8d
> > > [  322.844555]  lookup_fast+0xb8/0x122
> > > [  322.844559]  path_openat+0xfd/0xfd8
> > > [  322.844562]  ? unlazy_walk+0x4c/0x80
> > > [  322.844565]  ? terminate_walk+0x62/0xed
> > > [  322.844569]  ? putname+0x4b/0x53
> > > [  322.844572]  do_filp_open+0x86/0x110
> > > [  322.844576]  ? kmem_cache_alloc+0x32/0x200
> > > [  322.844579]  ? __alloc_fd+0xb2/0x143
> > > [  322.844583]  do_sys_openat2+0x8d/0x13a
> > > [  322.844585]  ? do_faccessat+0x108/0x227
> > > [  322.844588]  __ia32_compat_sys_openat+0x48/0x63
> > > [  322.844591]  __do_fast_syscall_32+0x5f/0x8e
> > > [  322.844594]  do_fast_syscall_32+0x3d/0x80
> > > [  322.844597]  entry_SYSCALL_compat_after_hwframe+0x45/0x0
> > > [  322.844600] RIP: 0023:0xf7ee6549
> > > [  322.844601] RSP: 002b:000000008c8fbcb0 EFLAGS: 00200287
> > > ORIG_RAX: 0000000000000127
> > > [  322.844604] RAX: ffffffffffffffda RBX: 00000000ffffff9c RCX:
> > > 000000008c8fbffc
> > > [  322.844606] RDX: 0000000000098800 RSI: 0000000000000000 RDI:
> > > 00000000f7ab3e1c
> > > [  322.844608] RBP: 0000000000000000 R08: 0000000000000000 R09:
> > > 0000000000000000
> > > [  322.844609] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > 0000000000000000
> > > [  322.844611] R13: 0000000000000000 R14: 0000000000000000 R15:
> > > 0000000000000000
> > > [  322.844614] task:CGenericAsyncFi state:D stack:    0 pid: 1927
> > > ppid:  1540 flags:0x20020000
> > > [  322.844617] Call Trace:
> > > [  322.844621]  __schedule+0x3db/0x7ee
> > > [  322.844625]  schedule+0x68/0xdc
> > > [  322.844628]  io_schedule+0x46/0x6d
> > > [  322.844631]  wait_on_page_bit_common+0xf8/0x325
> > > [  322.844634]  ?
> > > trace_raw_output_file_check_and_advance_wb_err+0x80/0x7d
> > > [  322.844638]  wait_on_page_bit+0x3f/0x41
> > > [  322.844642]  wait_on_page_writeback.part.0+0x35/0x80
> > > [  322.844645]  wait_on_page_writeback+0x27/0x29
> > > [  322.844647]  __filemap_fdatawait_range+0x88/0xd9
> > > [  322.844651]  filemap_write_and_wait_range+0x45/0x80
> > > [  322.844655]  nfs_wb_all+0x30/0x13d
> > > [  322.844658]  nfs4_file_flush+0x85/0xc0
> > > [  322.844661]  filp_close+0x3e/0x78
> > > [  322.844664]  __close_fd+0x23/0x30
> > > [  322.844667]  __ia32_sys_close+0x22/0x50
> > > [  322.844670]  __do_fast_syscall_32+0x5f/0x8e
> > > [  322.844673]  do_fast_syscall_32+0x3d/0x80
> > > [  322.844676]  entry_SYSCALL_compat_after_hwframe+0x45/0x0
> > > [  322.844678] RIP: 0023:0xf7ee6549
> > > [  322.844680] RSP: 002b:000000008c5fddb0 EFLAGS: 00200286
> > > ORIG_RAX: 0000000000000006
> > > [  322.844682] RAX: ffffffffffffffda RBX: 000000000000004d RCX:
> > > 0000000000000002
> > > [  322.844684] RDX: 0000000000000000 RSI: 00000000f7d02e70 RDI:
> > > 000000000000004d
> > > [  322.844686] RBP: 000000008c5fe110 R08: 0000000000000000 R09:
> > > 0000000000000000
> > > [  322.844688] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > 0000000000000000
> > > [  322.844689] R13: 0000000000000000 R14: 0000000000000000 R15:
> > > 0000000000000000
> > > [  322.844699] task:ThreadPoolForeg state:D stack:    0 pid: 1768
> > > ppid:  1735 flags:0x00004000
> > > [  322.844702] Call Trace:
> > > [  322.844705]  __schedule+0x3db/0x7ee
> > > [  322.844709]  ?
> > > trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> > > [  322.844712]  schedule+0x68/0xdc
> > > [  322.844716]  rpc_wait_bit_killable+0x11/0x52
> > > [  322.844719]  __wait_on_bit+0x31/0x90
> > > [  322.844722]  ? rpc_sleep_on_timeout+0x64/0x9c
> > > [  322.844725]  ? rpc_check_timeout+0x1e0/0x1e0
> > > [  322.844728]  ? call_connect_status+0x180/0x17a
> > > [  322.844731]  out_of_line_wait_on_bit+0x7e/0x80
> > > [  322.844734]  ? var_wake_function+0x30/0x27
> > > [  322.844737]  __rpc_execute+0xe5/0x485
> > > [  322.844740]  ? nfs_check_verifier+0x5b/0xa4
> > > [  322.844743]  rpc_execute+0xa5/0xaa
> > > [  322.844746]  rpc_run_task+0x148/0x187
> > > [  322.844750]  nfs4_do_call_sync+0x61/0x81
> > > [  322.844753]  _nfs4_proc_getattr+0x10a/0x120
> > > [  322.844758]  nfs4_proc_getattr+0x67/0xee
> > > [  322.844762]  __nfs_revalidate_inode+0xb7/0x2df
> > > [  322.844766]  nfs_getattr+0x159/0x374
> > > [  322.844769]  vfs_getattr_nosec+0x96/0xa4
> > > [  322.844771]  vfs_statx+0x7a/0xe5
> > > [  322.844774]  __do_sys_newstat+0x31/0x4c
> > > [  322.844778]  __x64_sys_newstat+0x16/0x18
> > > [  322.844781]  do_syscall_64+0x32/0x45
> > > [  322.844783]  entry_SYSCALL_64_after_hwframe+0x44/0x0
> > > [  322.844786] RIP: 0033:0x7fe70ebf19e6
> > > [  322.844787] RSP: 002b:00007fe6c15b2068 EFLAGS: 00000246
> > > ORIG_RAX: 0000000000000004
> > > [  322.844790] RAX: ffffffffffffffda RBX: 00000bd4193e1910 RCX:
> > > 00007fe70ebf19e6
> > > [  322.844792] RDX: 00007fe6c15b2078 RSI: 00007fe6c15b2078 RDI:
> > > 00000bd4193e1910
> > > [  322.844794] RBP: 00007fe6c15b2160 R08: 0000000000000001 R09:
> > > 00007ffd971ed090
> > > [  322.844796] R10: 0000000000056764 R11: 0000000000000246 R12:
> > > 00007fe6c15b2078
> > > [  322.844797] R13: 00000bd4192f6f60 R14: 00007fe6c15b2108 R15:
> > > 00007fe6c15b2120
> > > [  322.844801] task:ThreadPoolForeg state:D stack:    0 pid: 1769
> > > ppid:  1735 flags:0x00000000
> > > [  322.844804] Call Trace:
> > > [  322.844807]  __schedule+0x3db/0x7ee
> > > [  322.844811]  ?
> > > trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> > > [  322.844814]  schedule+0x68/0xdc
> > > [  322.844817]  rpc_wait_bit_killable+0x11/0x52
> > > [  322.844820]  __wait_on_bit+0x31/0x90
> > > [  322.844823]  ? __rpc_atrun+0x20/0x1a
> > > [  322.844826]  ? __rpc_atrun+0x20/0x1a
> > > [  322.844829]  out_of_line_wait_on_bit+0x7e/0x80
> > > [  322.844831]  ? var_wake_function+0x30/0x27
> > > [  322.844834]  __rpc_execute+0xe5/0x485
> > > [  322.844837]  ? update_cfs_group+0x9a/0x9e
> > > [  322.844840]  rpc_execute+0xa5/0xaa
> > > [  322.844842]  rpc_run_task+0x148/0x187
> > > [  322.844845]  nfs4_do_call_sync+0x61/0x81
> > > [  322.844849]  _nfs4_proc_getattr+0x10a/0x120
> > > [  322.844853]  nfs4_proc_getattr+0x67/0xee
> > > [  322.844856]  __nfs_revalidate_inode+0xb7/0x2df
> > > [  322.844859]  nfs_file_write+0xfc/0x29f
> > > [  322.844862]  new_sync_write+0xfb/0x16b
> > > [  322.844866]  vfs_write+0x1ee/0x29a
> > > [  322.844868]  ? __fget_light+0x31/0x72
> > > [  322.844871]  ksys_write+0x61/0xd0
> > > [  322.844874]  __x64_sys_write+0x1a/0x1c
> > > [  322.844877]  do_syscall_64+0x32/0x45
> > > [  322.844880]  entry_SYSCALL_64_after_hwframe+0x44/0x0
> > > [  322.844881] RIP: 0033:0x7fe71859ab7f
> > > [  322.844883] RSP: 002b:00007fe6bd5b0d30 EFLAGS: 00000293
> > > ORIG_RAX: 0000000000000001
> > > [  322.844885] RAX: ffffffffffffffda RBX: 00007fe6bd5b0d68 RCX:
> > > 00007fe71859ab7f
> > > [  322.844887] RDX: 0000000000000007 RSI: 00007fe6bd5b1090 RDI:
> > > 000000000000005b
> > > [  322.844888] RBP: 00007fe6bd5b0de0 R08: 0000000000000000 R09:
> > > 00007ffd971ed090
> > > [  322.844890] R10: 0000000000065fd2 R11: 0000000000000293 R12:
> > > 00000bd418d95020
> > > [  322.844891] R13: 00007fe6bd5b1090 R14: 0000000000000007 R15:
> > > 0000000000000000
> > > [  322.844900] task:ThreadPoolForeg state:D stack:    0 pid: 1807
> > > ppid:  1773 flags:0x00000000
> > > [  322.844902] Call Trace:
> > > [  322.844904]  __schedule+0x3db/0x7ee
> > > [  322.844907]  ?
> > > trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> > > [  322.844908]  schedule+0x68/0xdc
> > > [  322.844911]  rpc_wait_bit_killable+0x11/0x52
> > > [  322.844913]  __wait_on_bit+0x31/0x90
> > > [  322.844915]  ? __rpc_atrun+0x20/0x1a
> > > [  322.844916]  ? __rpc_atrun+0x20/0x1a
> > > [  322.844918]  out_of_line_wait_on_bit+0x7e/0x80
> > > [  322.844920]  ? var_wake_function+0x30/0x27
> > > [  322.844922]  __rpc_execute+0xe5/0x485
> > > [  322.844924]  rpc_execute+0xa5/0xaa
> > > [  322.844926]  rpc_run_task+0x148/0x187
> > > [  322.844928]  nfs4_do_call_sync+0x61/0x81
> > > [  322.844930]  _nfs4_proc_getattr+0x10a/0x120
> > > [  322.844933]  nfs4_proc_getattr+0x67/0xee
> > > [  322.844936]  __nfs_revalidate_inode+0xb7/0x2df
> > > [  322.844938]  nfs_access_get_cached+0x14d/0x1df
> > > [  322.844940]  nfs_do_access+0x48/0x28e
> > > [  322.844942]  ? nfs4_lookup_revalidate+0x38/0x8d
> > > [  322.844944]  nfs_permission+0xa2/0x1b8
> > > [  322.844946]  inode_permission+0x96/0xca
> > > [  322.844948]  ? unlazy_walk+0x4c/0x80
> > > [  322.844950]  link_path_walk.part.0+0x2a5/0x320
> > > [  322.844952]  ? path_init+0x2c2/0x3e5
> > > [  322.844954]  path_lookupat+0x3f/0x1b1
> > > [  322.844956]  filename_lookup+0x97/0x17f
> > > [  322.844959]  ? do_unlk+0x9a/0xca
> > > [  322.844960]  ? kmem_cache_alloc+0x32/0x200
> > > [  322.844963]  user_path_at_empty+0x59/0x8a
> > > [  322.844965]  vfs_statx+0x64/0xe5
> > > [  322.844967]  __do_sys_newstat+0x31/0x4c
> > > [  322.844968]  ? do_fcntl+0x113/0x582
> > > [  322.844971]  ? fput+0x13/0x15
> > > [  322.844972]  ? __x64_sys_fcntl+0x81/0xa8
> > > [  322.844974]  __x64_sys_newstat+0x16/0x18
> > > [  322.844976]  do_syscall_64+0x32/0x45
> > > [  322.844977]  entry_SYSCALL_64_after_hwframe+0x44/0x0
> > > [  322.844979] RIP: 0033:0x7f0d1fd0c9e6
> > > [  322.844980] RSP: 002b:00007f0d08e12e18 EFLAGS: 00000246
> > > ORIG_RAX: 0000000000000004
> > > [  322.844981] RAX: ffffffffffffffda RBX: 00007f0d08e12edc RCX:
> > > 00007f0d1fd0c9e6
> > > [  322.844983] RDX: 00007f0d08e12e28 RSI: 00007f0d08e12e28 RDI:
> > > 00001605c2ff1169
> > > [  322.844984] RBP: 00007f0d08e12ec0 R08: 0000000000000001 R09:
> > > 00007fffaddd9090
> > > [  322.844985] R10: 00001605c3133480 R11: 0000000000000246 R12:
> > > 0000000000000000
> > > [  322.844986] R13: 0000000000000000 R14: 00001605c2ff0f00 R15:
> > > 0000000000000000
> > > [  322.844988] task:ThreadPoolForeg state:D stack:    0 pid: 2031
> > > ppid:  1773 flags:0x00004000
> > > [  322.844990] Call Trace:
> > > [  322.844992]  __schedule+0x3db/0x7ee
> > > [  322.844995]  ?
> > > trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> > > [  322.844997]  schedule+0x68/0xdc
> > > [  322.844999]  rpc_wait_bit_killable+0x11/0x52
> > > [  322.845001]  __wait_on_bit+0x31/0x90
> > > [  322.845003]  ? rpc_sleep_on_timeout+0x64/0x9c
> > > [  322.845005]  ? rpc_check_timeout+0x1e0/0x1e0
> > > [  322.845006]  ? call_connect_status+0x180/0x17a
> > > [  322.845009]  out_of_line_wait_on_bit+0x7e/0x80
> > > [  322.845010]  ? var_wake_function+0x30/0x27
> > > [  322.845012]  __rpc_execute+0xe5/0x485
> > > [  322.845014]  rpc_execute+0xa5/0xaa
> > > [  322.845016]  rpc_run_task+0x148/0x187
> > > [  322.845018]  nfs4_do_call_sync+0x61/0x81
> > > [  322.845020]  _nfs4_proc_getattr+0x10a/0x120
> > > [  322.845023]  nfs4_proc_getattr+0x67/0xee
> > > [  322.845026]  __nfs_revalidate_inode+0xb7/0x2df
> > > [  322.845028]  nfs_access_get_cached+0x14d/0x1df
> > > [  322.845030]  nfs_do_access+0x48/0x28e
> > > [  322.845032]  ? nfs4_lookup_revalidate+0x38/0x8d
> > > [  322.845034]  nfs_permission+0xa2/0x1b8
> > > [  322.845036]  inode_permission+0x96/0xca
> > > [  322.845038]  ? unlazy_walk+0x4c/0x80
> > > [  322.845040]  link_path_walk.part.0+0x2a5/0x320
> > > [  322.845042]  ? path_init+0x2c2/0x3e5
> > > [  322.845044]  path_lookupat+0x3f/0x1b1
> > > [  322.845046]  filename_lookup+0x97/0x17f
> > > [  322.845049]  ? kmem_cache_alloc+0x32/0x200
> > > [  322.845051]  user_path_at_empty+0x59/0x8a
> > > [  322.845054]  ? mntput_no_expire+0x4f/0x264
> > > [  322.845055]  vfs_statx+0x64/0xe5
> > > [  322.845057]  ? putname+0x4b/0x53
> > > [  322.845059]  __do_sys_newstat+0x31/0x4c
> > > [  322.845062]  ? __x64_sys_futex+0x13a/0x159
> > > [  322.845064]  ? switch_fpu_return+0x58/0xf2
> > > [  322.845066]  __x64_sys_newstat+0x16/0x18
> > > [  322.845068]  do_syscall_64+0x32/0x45
> > > [  322.845069]  entry_SYSCALL_64_after_hwframe+0x44/0x0
> > > [  322.845071] RIP: 0033:0x7f0d1fd0c9e6
> > > [  322.845072] RSP: 002b:00007f0d00331108 EFLAGS: 00000246
> > > ORIG_RAX: 0000000000000004
> > > [  322.845073] RAX: ffffffffffffffda RBX: 00001605c2fdc0a0 RCX:
> > > 00007f0d1fd0c9e6
> > > [  322.845074] RDX: 00007f0d00331118 RSI: 00007f0d00331118 RDI:
> > > 00001605c2fdc0a0
> > > [  322.845076] RBP: 00007f0d00331200 R08: 0000000000000001 R09:
> > > 00007fffaddd9090
> > > [  322.845077] R10: 000000000005771c R11: 0000000000000246 R12:
> > > 00007f0d00331118
> > > [  322.845078] R13: 00001605c3309c40 R14: 00007f0d003311a8 R15:
> > > 00007f0d003311c0
> > > [  322.845087] task:192.168.16.8-ma state:D stack:    0 pid: 2044
> > > ppid:     2 flags:0x00004000
> > > [  322.845089] Call Trace:
> > > [  322.845091]  __schedule+0x3db/0x7ee
> > > [  322.845094]  ?
> > > trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
> > > [  322.845095]  schedule+0x68/0xdc
> > > [  322.845098]  rpc_wait_bit_killable+0x11/0x52
> > > [  322.845100]  __wait_on_bit+0x31/0x90
> > > [  322.845102]  ? __rpc_atrun+0x20/0x1a
> > > [  322.845103]  ? __rpc_atrun+0x20/0x1a
> > > [  322.845105]  out_of_line_wait_on_bit+0x7e/0x80
> > > [  322.845107]  ? var_wake_function+0x30/0x27
> > > [  322.845109]  __rpc_execute+0xe5/0x485
> > > [  322.845111]  rpc_execute+0xa5/0xaa
> > > [  322.845113]  rpc_run_task+0x148/0x187
> > > [  322.845115]  nfs4_do_call_sync+0x61/0x81
> > > [  322.845117]  _nfs4_proc_statfs+0xc1/0xcf
> > > [  322.845120]  nfs4_proc_statfs+0x55/0x74
> > > [  322.845122]  nfs_statfs+0x65/0x152
> > > [  322.845124]  ? _nfs4_proc_delegreturn+0x21e/0x2e5
> > > [  322.845127]  statfs_by_dentry+0x55/0x74
> > > [  322.845129]  vfs_statfs+0x24/0x43
> > > [  322.845132]  check_free_space+0x3c/0xe5
> > > [  322.845135]  do_acct_process+0x5d/0x5e1
> > > [  322.845136]  ? nfs_end_delegation_return+0x1cb/0x3aa
> > > [  322.845138]  ? complete+0x49/0x57
> > > [  322.845141]  acct_process+0xe5/0x105
> > > [  322.845143]  do_exit+0x781/0xa1b
> > > [  322.845145]  ? refcount_dec_and_lock+0x22/0x8c
> > > [  322.845148]  __module_put_and_exit+0x1b/0x1b
> > > [  322.845150]  nfs4_run_state_manager+0x293/0x933
> > > [  322.845152]  ? nfs4_do_reclaim+0x820/0x81c
> > > [  322.845155]  kthread+0x142/0x15f
> > > [  322.845157]  ? __kthread_create_worker+0x130/0x122
> > > [  322.845159]  ret_from_fork+0x1f/0x2a
> > > [  369.859385] INFO: task tee:1176 blocked for more than 122
> > > seconds.
> > > [  369.859391]       Not tainted 5.10.4-00021-gde017294a539-dirty
> > > #3
> > > [  369.859393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [  369.859395] task:tee             state:D stack:    0 pid: 1176
> > > ppid:  1173 flags:0x00000000
> > > [  369.859399] Call Trace:
> > > [  369.859407]  __schedule+0x3db/0x7ee
> > > [  369.859410]  ? io_schedule+0x70/0x6d
> > > [  369.859412]  schedule+0x68/0xdc
> > > [  369.859415]  io_schedule+0x46/0x6d
> > > [  369.859417]  bit_wait_io+0x11/0x52
> > > [  369.859420]  __wait_on_bit+0x31/0x90
> > > [  369.859422]  out_of_line_wait_on_bit+0x7e/0x80
> > > [  369.859426]  ? var_wake_function+0x30/0x27
> > > [  369.859429]  nfs_wait_on_request.part.0+0x31/0x33
> > > [  369.859431]  nfs_page_group_lock_head+0x31/0x78
> > > [  369.859433]  nfs_lock_and_join_requests+0x9d/0x26e
> > > [  369.859436]  nfs_updatepage+0x1be/0x985
> > > [  369.859439]  nfs_write_end+0x128/0x52a
> > > [  369.859443]  ? iov_iter_copy_from_user_atomic+0xe8/0x3ce
> > > [  369.859446]  generic_perform_write+0xed/0x17c
> > > [  369.859449]  nfs_file_write+0x14a/0x29f
> > > [  369.859452]  new_sync_write+0xfb/0x16b
> > > [  369.859455]  vfs_write+0x1ee/0x29a
> > > [  369.859457]  ? vfs_read+0x17b/0x198
> > > [  369.859459]  ksys_write+0x61/0xd0
> > > [  369.859462]  __x64_sys_write+0x1a/0x1c
> > > [  369.859464]  do_syscall_64+0x32/0x45
> > > [  369.859467]  entry_SYSCALL_64_after_hwframe+0x44/0x0
> > > [  369.859469] RIP: 0033:0x7f4781f323a3
> > > [  369.859471] RSP: 002b:00007ffcf8220408 EFLAGS: 00000246
> > > ORIG_RAX: 0000000000000001
> > > [  369.859474] RAX: ffffffffffffffda RBX: 0000000000000092 RCX:
> > > 00007f4781f323a3
> > > [  369.859475] RDX: 0000000000000092 RSI: 00007ffcf8220530 RDI:
> > > 0000000000000003
> > > [  369.859477] RBP: 00007ffcf8220530 R08: 0000000000000000 R09:
> > > 0000000000000001
> > > [  369.859478] R10: 0000000000400ad2 R11: 0000000000000246 R12:
> > > 0000000000000092
> > > [  369.859480] R13: 0000000002361580 R14: 0000000000000092 R15:
> > > 00007f4782201720
> > > [  369.859495] INFO: task CGenericAsyncFi:1927 blocked for more
> > > than 122 seconds.
> > > [  369.859497]       Not tainted 5.10.4-00021-gde017294a539-dirty
> > > #3
> > > [  369.859498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [  369.859499] task:CGenericAsyncFi state:D stack:    0 pid: 1927
> > > ppid:  1540 flags:0x20020000
> > > [  369.859501] Call Trace:
> > > [  369.859504]  __schedule+0x3db/0x7ee
> > > [  369.859506]  schedule+0x68/0xdc
> > > [  369.859508]  io_schedule+0x46/0x6d
> > > [  369.859510]  wait_on_page_bit_common+0xf8/0x325
> > > [  369.859512]  ?
> > > trace_raw_output_file_check_and_advance_wb_err+0x80/0x7d
> > > [  369.859515]  wait_on_page_bit+0x3f/0x41
> > > [  369.859517]  wait_on_page_writeback.part.0+0x35/0x80
> > > [  369.859519]  wait_on_page_writeback+0x27/0x29
> > > [  369.859521]  __filemap_fdatawait_range+0x88/0xd9
> > > [  369.859523]  filemap_write_and_wait_range+0x45/0x80
> > > [  369.859525]  nfs_wb_all+0x30/0x13d
> > > [  369.859528]  nfs4_file_flush+0x85/0xc0
> > > [  369.859530]  filp_close+0x3e/0x78
> > > [  369.859532]  __close_fd+0x23/0x30
> > > [  369.859534]  __ia32_sys_close+0x22/0x50
> > > [  369.859536]  __do_fast_syscall_32+0x5f/0x8e
> > > [  369.859538]  do_fast_syscall_32+0x3d/0x80
> > > [  369.859540]  entry_SYSCALL_compat_after_hwframe+0x45/0x0
> > > [  369.859541] RIP: 0023:0xf7ee6549
> > > [  369.859543] RSP: 002b:000000008c5fddb0 EFLAGS: 00200286
> > > ORIG_RAX: 0000000000000006
> > > [  369.859545] RAX: ffffffffffffffda RBX: 000000000000004d RCX:
> > > 0000000000000002
> > > [  369.859546] RDX: 0000000000000000 RSI: 00000000f7d02e70 RDI:
> > > 000000000000004d
> > > [  369.859547] RBP: 000000008c5fe110 R08: 0000000000000000 R09:
> > > 0000000000000000
> > > [  369.859548] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > 0000000000000000
> > > [  369.859549] R13: 0000000000000000 R14: 0000000000000000 R15:
> > > 0000000000000000
> > > [  394.423093] nfs: server loom not responding, still trying
> > >
> > > --
> > > NULL && (void)
>
> --
> Trond Myklebust
> CTO, Hammerspace Inc
> 4984 El Camino Real, Suite 208
> Los Altos, CA 94022
> ​
> http://www.hammer.space
>

2021-04-01 18:13:49

by J. Bruce Fields

[permalink] [raw]
Subject: Re: steam-associated reproducible hard NFSv4.2 client hang (5.9, 5.10)

On Thu, Apr 01, 2021 at 02:33:02PM +0100, Nix wrote:
> [Sorry about the huge delay: your replies got accidentally marked as
> read in a MUA snafu. I'll be getting some more debugging dumps -- and
> seeing if this still happens! -- when I reboot into 5.11 this weekend.]
>
> On 24 Feb 2021, [email protected] said:
>
> > On Tue, Feb 23, 2021 at 11:58:51PM +0000, Trond Myklebust wrote:
> >> On Tue, 2021-02-23 at 17:57 -0500, J. Bruce Fields wrote:
> >> > On Sun, Jan 03, 2021 at 02:27:51PM +0000, Nick Alcock wrote:
> >> > > Relevant exports, from /proc/fs/nfs/exports:
> >> > >
> >> > > /       192.168.16.0/24,silk.wkstn.nix(ro,insecure,no_root_squash,s
> >> > > ync,no_wdelay,no_subtree_check,v4root,fsid=0,uuid=0a4a4563:00764033
> >> > > :8c827c0e:989cf534,sec=390003:390004:390005:1)
> >> > > /home/.loom.srvr.nix    *.srvr.nix,fold.srvr.nix(rw,root_squash,syn
> >> > > c,wdelay,no_subtree_check,uuid=0a4a4563:00764033:8c827c0e:989cf534,
> >> > > sec=1)
> >>
> >> Isn't that trying to export the same filesystem as '/' on the line
> >> above using conflicting export options?
>
> Hmm. I don't actually have a / mount in /etc/exports (and haven't had
> one since I finished building this machine, last year), and looking at
> /proc/fs/nfs/exports on the server now, it's not there.

Right, but even though you're not exporting /, you're exporting
/home/.loom.srvr.nix, and that's on the same filesystem as /, isn't it?

--b.

>
> The 'v4root' makes me wonder if this was automatically synthesised?
>
> Exports explicitly named in /etc/exports for that machine are:
>
> /home/.loom.srvr.nix -rw,no_subtree_check,sync fold.srvr.nix(root_squash) mutilate(no_root_squash) silk(no_root_squash)
> /.nfs/nix/Graphics/Private -no_root_squash,sync,no_subtree_check mutilate(rw) silk(rw)
> /.nfs/nix/share/phones -root_squash,async,no_subtree_check fold.srvr.nix(rw) mutilate(rw) silk(rw)
> /.nfs/nix/.cache -root_squash,async,no_subtree_check fold.srvr.nix(rw) mutilate(rw) silk(rw)
> /.nfs/nix/Mail/nnmh -root_squash,async,no_subtree_check fold.srvr.nix(ro) mutilate(rw) silk(rw)
> /.nfs/compiler/.ccache -root_squash,async,no_subtree_check mutilate(rw) silk(rw)
> /.nfs/compiler/.cache -root_squash,async,no_subtree_check fold.srvr.nix(rw) mutilate(rw) silk(rw)
> /usr/doc -root_squash,no_subtree_check,async mutilate(ro) silk(ro)
> /usr/info -root_squash,no_subtree_check,async mutilate(ro) silk(ro)
> /usr/share/texlive -no_root_squash,no_subtree_check,async mutilate(rw) silk(rw)
> /usr/share/xemacs -no_root_squash,no_subtree_check,async mutilate(rw) silk(rw)
> /usr/share/xplanet -root_squash,no_subtree_check,async mutilate(ro) silk(ro)
> /usr/share/nethack -root_squash,no_subtree_check,async mutilate(rw) silk(rw)
> /pkg/non-free -no_root_squash,no_subtree_check,async mutilate(rw) chronix.wkstn.nix(rw) silk(rw)
> /usr/lib/X11/fonts -root_squash,no_subtree_check,async mutilate(ro) silk(ro)
> /usr/share/wine -root_squash,no_subtree_check,async mutilate(rw) silk(rw)
> /usr/share/clamav -root_squash,no_subtree_check,async mutilate(ro) silk(ro)
> /usr/share/emacs/site-lisp -no_root_squash,no_subtree_check,async mutilate(rw) silk(rw)
> /usr/archive -root_squash,async,subtree_check mutilate(rw,root_squash,insecure) cinema.srvr.nix(ro,root_squash,insecure) chronix.wkstn.nix(ro) silk(rw,root_squash,insecure)
> /usr/archive/music/Pete -root_squash,async,subtree_check mutilate(rw,root_squash,insecure) cinema.srvr.nix(ro,root_squash,insecure) chronix.wkstn.nix(ro) silk(rw,root_squash,insecure)
> /var/log.real -root_squash,no_subtree_check,async mutilate(ro) silk(ro)
> /etc/shai-hulud -no_root_squash,no_subtree_check,async mutilate(rw) silk(rw)
> /usr/src -no_root_squash,no_subtree_check,async mutilate(rw) oracle.vm.nix(rw) 192.168.20.0/24(ro) scratch.vm.nix(rw) ubuntu.vm.nix(rw) cinema.srvr.nix(rw,root_squash) chronix.wkstn.nix(rw,root_squash) silk(rw)
> /var/cache/CPAN -no_root_squash,no_subtree_check,async mutilate(rw) silk(rw)
> /usr/share/flightgear -root_squash,async,no_subtree_check mutilate(ro) silk(ro)
> /usr/local/tmp/encoding/mkv -no_subtree_check,root_squash,ro mutilate.wkstn.nix silk.wkstn.nix
> /pkg/non-free/steam -rw,subtree_check,root_squash mutilate.wkstn.nix loom.srvr.nix chronix.wkstn.nix silk.wkstn.nix(insecure)
> /.transient/workstations/silk -no_root_squash,async,subtree_check silk(rw)
> /trees/mirrors/mutilate silk.wkstn.nix(no_root_squash,async,no_subtree_check,rw)
> /trees/mirrors/silk silk.wkstn.nix(no_root_squash,async,no_subtree_check,rw)
>
> ... and, well, it's a bit of a mess and there are a lot of them, but
> there are no nested mounts there, at least, not outside /usr/archive
> which is probably not relevant for this. (The /.nfs stuff is
> particularly ugly: if I try to export e.g. /home/nix/.cache rather than
> /.nfs/nix/.cache the export simply never appears: /.nfs/$foo is just
> built out of bind mounts for this purpose, but then /home/$name is also
> a bind mount from /home/.loom.srvr.nix/$name on this network in any
> case. /nfs/$foo is usually bind-mounted *back* under /home on the
> client, too, but still from NFS's perspective I guess this means they
> are distinct? I hope so...)

2021-04-01 19:27:10

by Nix

[permalink] [raw]
Subject: Re: steam-associated reproducible hard NFSv4.2 client hang (5.9, 5.10)

[Sorry about the huge delay: your replies got accidentally marked as
read in a MUA snafu. I'll be getting some more debugging dumps -- and
seeing if this still happens! -- when I reboot into 5.11 this weekend.]

On 24 Feb 2021, [email protected] said:

> On Tue, Feb 23, 2021 at 11:58:51PM +0000, Trond Myklebust wrote:
>> On Tue, 2021-02-23 at 17:57 -0500, J. Bruce Fields wrote:
>> > On Sun, Jan 03, 2021 at 02:27:51PM +0000, Nick Alcock wrote:
>> > > Relevant exports, from /proc/fs/nfs/exports:
>> > >
>> > > /       192.168.16.0/24,silk.wkstn.nix(ro,insecure,no_root_squash,s
>> > > ync,no_wdelay,no_subtree_check,v4root,fsid=0,uuid=0a4a4563:00764033
>> > > :8c827c0e:989cf534,sec=390003:390004:390005:1)
>> > > /home/.loom.srvr.nix    *.srvr.nix,fold.srvr.nix(rw,root_squash,syn
>> > > c,wdelay,no_subtree_check,uuid=0a4a4563:00764033:8c827c0e:989cf534,
>> > > sec=1)
>>
>> Isn't that trying to export the same filesystem as '/' on the line
>> above using conflicting export options?

Hmm. I don't actually have a / mount in /etc/exports (and haven't had
one since I finished building this machine, last year), and looking at
/proc/fs/nfs/exports on the server now, it's not there.

The 'v4root' makes me wonder if this was automatically synthesised?

Exports explicitly named in /etc/exports for that machine are:

/home/.loom.srvr.nix -rw,no_subtree_check,sync fold.srvr.nix(root_squash) mutilate(no_root_squash) silk(no_root_squash)
/.nfs/nix/Graphics/Private -no_root_squash,sync,no_subtree_check mutilate(rw) silk(rw)
/.nfs/nix/share/phones -root_squash,async,no_subtree_check fold.srvr.nix(rw) mutilate(rw) silk(rw)
/.nfs/nix/.cache -root_squash,async,no_subtree_check fold.srvr.nix(rw) mutilate(rw) silk(rw)
/.nfs/nix/Mail/nnmh -root_squash,async,no_subtree_check fold.srvr.nix(ro) mutilate(rw) silk(rw)
/.nfs/compiler/.ccache -root_squash,async,no_subtree_check mutilate(rw) silk(rw)
/.nfs/compiler/.cache -root_squash,async,no_subtree_check fold.srvr.nix(rw) mutilate(rw) silk(rw)
/usr/doc -root_squash,no_subtree_check,async mutilate(ro) silk(ro)
/usr/info -root_squash,no_subtree_check,async mutilate(ro) silk(ro)
/usr/share/texlive -no_root_squash,no_subtree_check,async mutilate(rw) silk(rw)
/usr/share/xemacs -no_root_squash,no_subtree_check,async mutilate(rw) silk(rw)
/usr/share/xplanet -root_squash,no_subtree_check,async mutilate(ro) silk(ro)
/usr/share/nethack -root_squash,no_subtree_check,async mutilate(rw) silk(rw)
/pkg/non-free -no_root_squash,no_subtree_check,async mutilate(rw) chronix.wkstn.nix(rw) silk(rw)
/usr/lib/X11/fonts -root_squash,no_subtree_check,async mutilate(ro) silk(ro)
/usr/share/wine -root_squash,no_subtree_check,async mutilate(rw) silk(rw)
/usr/share/clamav -root_squash,no_subtree_check,async mutilate(ro) silk(ro)
/usr/share/emacs/site-lisp -no_root_squash,no_subtree_check,async mutilate(rw) silk(rw)
/usr/archive -root_squash,async,subtree_check mutilate(rw,root_squash,insecure) cinema.srvr.nix(ro,root_squash,insecure) chronix.wkstn.nix(ro) silk(rw,root_squash,insecure)
/usr/archive/music/Pete -root_squash,async,subtree_check mutilate(rw,root_squash,insecure) cinema.srvr.nix(ro,root_squash,insecure) chronix.wkstn.nix(ro) silk(rw,root_squash,insecure)
/var/log.real -root_squash,no_subtree_check,async mutilate(ro) silk(ro)
/etc/shai-hulud -no_root_squash,no_subtree_check,async mutilate(rw) silk(rw)
/usr/src -no_root_squash,no_subtree_check,async mutilate(rw) oracle.vm.nix(rw) 192.168.20.0/24(ro) scratch.vm.nix(rw) ubuntu.vm.nix(rw) cinema.srvr.nix(rw,root_squash) chronix.wkstn.nix(rw,root_squash) silk(rw)
/var/cache/CPAN -no_root_squash,no_subtree_check,async mutilate(rw) silk(rw)
/usr/share/flightgear -root_squash,async,no_subtree_check mutilate(ro) silk(ro)
/usr/local/tmp/encoding/mkv -no_subtree_check,root_squash,ro mutilate.wkstn.nix silk.wkstn.nix
/pkg/non-free/steam -rw,subtree_check,root_squash mutilate.wkstn.nix loom.srvr.nix chronix.wkstn.nix silk.wkstn.nix(insecure)
/.transient/workstations/silk -no_root_squash,async,subtree_check silk(rw)
/trees/mirrors/mutilate silk.wkstn.nix(no_root_squash,async,no_subtree_check,rw)
/trees/mirrors/silk silk.wkstn.nix(no_root_squash,async,no_subtree_check,rw)

... and, well, it's a bit of a mess and there are a lot of them, but
there are no nested mounts there, at least, not outside /usr/archive
which is probably not relevant for this. (The /.nfs stuff is
particularly ugly: if I try to export e.g. /home/nix/.cache rather than
/.nfs/nix/.cache the export simply never appears: /.nfs/$foo is just
built out of bind mounts for this purpose, but then /home/$name is also
a bind mount from /home/.loom.srvr.nix/$name on this network in any
case. /nfs/$foo is usually bind-mounted *back* under /home on the
client, too, but still from NFS's perspective I guess this means they
are distinct? I hope so...)

2021-04-01 21:53:11

by Nix

[permalink] [raw]
Subject: Re: steam-associated reproducible hard NFSv4.2 client hang (5.9, 5.10)

On 1 Apr 2021, [email protected] outgrape:

> On Thu, Apr 01, 2021 at 02:33:02PM +0100, Nix wrote:
>> [Sorry about the huge delay: your replies got accidentally marked as
>> read in a MUA snafu. I'll be getting some more debugging dumps -- and
>> seeing if this still happens! -- when I reboot into 5.11 this weekend.]
>>
>> On 24 Feb 2021, [email protected] said:
>>
>> > On Tue, Feb 23, 2021 at 11:58:51PM +0000, Trond Myklebust wrote:
>> >> On Tue, 2021-02-23 at 17:57 -0500, J. Bruce Fields wrote:
>> >> > On Sun, Jan 03, 2021 at 02:27:51PM +0000, Nick Alcock wrote:
>> >> > > Relevant exports, from /proc/fs/nfs/exports:
>> >> > >
>> >> > > /       192.168.16.0/24,silk.wkstn.nix(ro,insecure,no_root_squash,s
>> >> > > ync,no_wdelay,no_subtree_check,v4root,fsid=0,uuid=0a4a4563:00764033
>> >> > > :8c827c0e:989cf534,sec=390003:390004:390005:1)
>> >> > > /home/.loom.srvr.nix    *.srvr.nix,fold.srvr.nix(rw,root_squash,syn
>> >> > > c,wdelay,no_subtree_check,uuid=0a4a4563:00764033:8c827c0e:989cf534,
>> >> > > sec=1)
>> >>
>> >> Isn't that trying to export the same filesystem as '/' on the line
>> >> above using conflicting export options?
>>
>> Hmm. I don't actually have a / mount in /etc/exports (and haven't had
>> one since I finished building this machine, last year), and looking at
>> /proc/fs/nfs/exports on the server now, it's not there.
>
> Right, but even though you're not exporting /, you're exporting
> /home/.loom.srvr.nix, and that's on the same filesystem as /, isn't it?

Yes, but I'm *not* exporting /. (I just checked my backups, and no such
export existed at the time I sent the original mail, nor was I importing
it on the client).

This export is prsumably automatically generated, and likely indicates
nothing more than that I am exporting from different subtrees off the
root (which I am: various subdirectories of /home, /usr, /trees,
/.transient, /.nfs, and /pkg are exported).

2021-04-02 19:21:46

by J. Bruce Fields

[permalink] [raw]
Subject: Re: steam-associated reproducible hard NFSv4.2 client hang (5.9, 5.10)

On Thu, Apr 01, 2021 at 10:52:47PM +0100, Nix wrote:
> On 1 Apr 2021, [email protected] outgrape:
>
> > On Thu, Apr 01, 2021 at 02:33:02PM +0100, Nix wrote:
> >> [Sorry about the huge delay: your replies got accidentally marked as
> >> read in a MUA snafu. I'll be getting some more debugging dumps -- and
> >> seeing if this still happens! -- when I reboot into 5.11 this weekend.]
> >>
> >> On 24 Feb 2021, [email protected] said:
> >>
> >> > On Tue, Feb 23, 2021 at 11:58:51PM +0000, Trond Myklebust wrote:
> >> >> On Tue, 2021-02-23 at 17:57 -0500, J. Bruce Fields wrote:
> >> >> > On Sun, Jan 03, 2021 at 02:27:51PM +0000, Nick Alcock wrote:
> >> >> > > Relevant exports, from /proc/fs/nfs/exports:
> >> >> > >
> >> >> > > /       192.168.16.0/24,silk.wkstn.nix(ro,insecure,no_root_squash,s
> >> >> > > ync,no_wdelay,no_subtree_check,v4root,fsid=0,uuid=0a4a4563:00764033
> >> >> > > :8c827c0e:989cf534,sec=390003:390004:390005:1)
> >> >> > > /home/.loom.srvr.nix    *.srvr.nix,fold.srvr.nix(rw,root_squash,syn
> >> >> > > c,wdelay,no_subtree_check,uuid=0a4a4563:00764033:8c827c0e:989cf534,
> >> >> > > sec=1)
> >> >>
> >> >> Isn't that trying to export the same filesystem as '/' on the line
> >> >> above using conflicting export options?
> >>
> >> Hmm. I don't actually have a / mount in /etc/exports (and haven't had
> >> one since I finished building this machine, last year), and looking at
> >> /proc/fs/nfs/exports on the server now, it's not there.
> >
> > Right, but even though you're not exporting /, you're exporting
> > /home/.loom.srvr.nix, and that's on the same filesystem as /, isn't it?
>
> Yes, but I'm *not* exporting /. (I just checked my backups, and no such
> export existed at the time I sent the original mail, nor was I importing
> it on the client).
>
> This export is prsumably automatically generated, and likely indicates
> nothing more than that I am exporting from different subtrees off the
> root (which I am: various subdirectories of /home, /usr, /trees,
> /.transient, /.nfs, and /pkg are exported).

Right, the "/" export is automatically generated.

I don't have any real hypothesis here, I'm just thinking if the server's
failing to reply, one possible culprit is that it's waiting for
rpc.mountd. The server depends on rpc.mountd for export information.
And it sounds like you have a relatively complicated export setup, so
there's perhaps more that could go wrong there.

Might be worth, on the server:

rpcdebug -m rpc -s cache

and then dumping the contents of all the files /proc/net/rpc/*/content.
That should show us whether the server's waiting on userspace helpers
for anything.

Sorry, did you say whether nfsd threads or rpc.mountd are blocked?

--b.

2021-04-03 22:46:16

by Nix

[permalink] [raw]
Subject: Re: steam-associated reproducible hard NFSv4.2 client hang (5.9, 5.10)

On 2 Apr 2021, [email protected] said:
> Sorry, did you say whether nfsd threads or rpc.mountd are blocked?

I'm not sure. I will endeavour to find out (I think I did look to see if
anything was blocked, but I can't remember the result: it's been a long
time now).

I'll probably do some more debugging on Monday.

2021-04-05 18:12:11

by Nix

[permalink] [raw]
Subject: Re: steam-associated reproducible hard NFSv4.2 client hang (5.9, 5.10)

On 2 Apr 2021, [email protected] said:
> Sorry, did you say whether nfsd threads or rpc.mountd are blocked?

... just about to switch into debugging this, but it does seem to me
that if nfsd threads or (especially) mountd on the server side were
blocked, I'd see misbehaviour with mounts from every client, not just a
few of them. This doesn't happen.

While this is going on, my firewall and other clients not engaging in
the problematic Steam-related activity can talk NFSv4 to the server
perfectly happily: indeed this is actually a problem when debugging
because I have to quiesce the bloody things as much as I can to stop
their RPC traffic flooding the log with irrelevant junk :)

Recovery from this consists only of rebooting the stuck client: the
server and all other clients don't need touching (indeed, I'm typing
this in an emacs on that server, and since it was last rebooted it's
been hit by a client experiencing this hang at least five times: the
mailserver also keeps its mailspool on that server as well, and no
problems there either).

(The server also has fairly silly amounts of RAM compared to the load
it's placed under. I'm not concerned about the possibility of rpc.mountd
getting swapped out. It just doesn't happen. Even things like git gc of
the entire Chromium git repo proceed without swapping.)


btw, the filesystem layout on this machine is, in part:

/dev/main/root xfs 4294993920 738953092 3556040828 18% /
/dev/mapper/main-vms xfs 1073231868 406045460 667186408 38% /vm
/dev/mapper/main-steam ext4 1055852896 85367140 916781564 9% /pkg/non-free/steam
/dev/mapper/main-archive xfs 3219652608 2761922796 457729812 86% /usr/archive
/dev/mapper/main-pete xfs 2468405656 2216785448 251620208 90% /usr/archive/music/Pete
/dev/mapper/main-phones xfs 52411388 4354092 48057296 9% /.nfs/nix/share/phones
/dev/mapper/main-unifi xfs 10491804 1130228 9361576 11% /var/lib/unifi
/dev/mapper/oracrypt-plain 2147510784 144030636 2003480148 7% /home/oranix/oracle/private

... and you'll note that the exported fs I'm seeing hangs on is actually
*not* the $HOME on the root fs: it's /pkg/non-free/steam, which is ext4
purely because so many games on x86 still fail horribly when 64-bit
inodes are used, and ext4 can emit 32-bit inodes on biggish fses without
horrible performance consequences, unlike xfs. The relevant import line:

loom:/pkg/non-free/steam/.steam /home/nix/.steam nfs defaults,_netdev

(so it is imported to *subdirectory* of a directory which is a mounted
NFS export, and *that* one is exported from /). The hang also happens
when using nfusr as the NFS client for the .steam import, so whatever it
is isn't just down to the client...

The primary reason I'm using one big fs for almost everything on this
server build is, uh, NFSv4. My last machine had lots of little
filesystems, and the result somehow confused the NFS pseudoroot
construction process so badly that most of the things I tried to export
never appeared on NFSv4, only on v3: only those exports which *were* on
the root filesystem were ever available for NFSv4 mounting, so I was
stuck with v3 on that machine. At (IIRC) Chuck Lever's suggestion (many
years ago, so he probably won't remember) I varied things when I built a
new server and was happy to find that with a less baroque setup and a
bigger rootfs with more stuff on it, NFSv4 seemed perfectly happy and
the pseudoroot was populated fine.

OK let's collect some logs so we're not reasoning in the absence of data
any more. Back soon! (I hope.)

2021-04-06 05:06:29

by Nix

[permalink] [raw]
Subject: Re: steam-associated reproducible hard NFSv4.2 client hang (5.9, 5.10)

[linux-net added: a stubborn NFS client hang doing uncompression of some
Steam games over NFS now looks like it might be TCP-stack or even
network-device-related rather than NFS-related]

On 24 Feb 2021, [email protected] said:

> On Tue, Feb 23, 2021 at 11:58:51PM +0000, Trond Myklebust wrote:
>> On Tue, 2021-02-23 at 17:57 -0500, J. Bruce Fields wrote:
>> > On Sun, Jan 03, 2021 at 02:27:51PM +0000, Nick Alcock wrote:
>> > > Relevant exports, from /proc/fs/nfs/exports:
>> > >
>> > > /       192.168.16.0/24,silk.wkstn.nix(ro,insecure,no_root_squash,s
>> > > ync,no_wdelay,no_subtree_check,v4root,fsid=0,uuid=0a4a4563:00764033
>> > > :8c827c0e:989cf534,sec=390003:390004:390005:1)
>> > > /home/.loom.srvr.nix    *.srvr.nix,fold.srvr.nix(rw,root_squash,syn
>> > > c,wdelay,no_subtree_check,uuid=0a4a4563:00764033:8c827c0e:989cf534,
>> > > sec=1)
>>
>> Isn't that trying to export the same filesystem as '/' on the line
>> above using conflicting export options?
>
> Yes, and exporting something on the root filesystem is generally a bad
> idea for the usual reasons.
>
> I don't see any explanation for the write hang there, though.
>
> Unless maybe if it leaves mountd unable to answer an upcall for some
> reason, hm.
>
> I think that would show up in /proc/net/rpc/nfsd.fh/content. Try
> cat'ing that file after 'rpcdebug -m rpc -s cache' and that should show
> if there's an upcall that's not getting answered.

OK, I got some debugging: this is still happening on 5.11, but it's
starting to look more like a problem below NFS.

There are no unanswered upcalls, and though I did eventually manage to
get a multimegabyte pile of NFS debugging info (after fighting with the
problem that the Steam client has internal timeouts that are silently
exceeded and break things if you just leave debugging on)... I suspect I
don't need to bother you with it, because the packet capture was more
informative.

I have a complete capture of everything on the wire from the moment
Steam started, but it's nearly 150MiB xz'ed and includes a lot of boring
nonsense related to Steam's revalidation of everything because the last
startup crashed: it probably also includes things like my Steam account
password, getdents of my home directory etc so if you want that I can
send you the details privately: but this capture of only the last few
thousand NFS packets is interesting enough:
<http://www.esperi.org.uk/~nix/temporary/nfs-trouble.cap>.

Looking at this in Wireshark, everything goes fine (ignoring the usual
offloading-associated checksum whining) until packet 1644, when we
suddenly start seeing out-of-order packets and retransmissions (on an
otherwise totally idle 10GbE subnet). They get more and more common, and
at packet 1711 the client loses patience and hits the host with a RST.
This is... not likely to do NFS any good at all, and with $HOME served
over the same connection I'm not surprised the client goes fairly
unresponsive after that and the hangcheck timer starts screaming about
processes blocked on I/O to some of the odder filesystems I'm getting
over NFS on that client, like /var/tmp and /var/account/acct (so, yes,
all process termination likely blocks forever on this). So it looks like
NFS is being betrayed by TCP and/or some lower layer. (Also, NFS doesn't
seem to try too hard to recover -- or perhaps, when it does, the new
session goes just as wrong).

The socket statistics on the server report 22 rx drops since boot (out
of 11903K) and no errors of any kind: possibly each of these drops
corresponds to the test runs I've been doing, but there can't be more
than one or two drops per test (I must have crashed the client over ten
times trying to get dumps) which if true seems more fragile than I
expect NFS to be :) so they might well just be coincidental.

I simplified my network for this test run, so the link has literally no
hosts on it right now other than the NFS server and client at issue and
a 10GbE Netgear GS110EMX switch, which reports zero errors and no
packets dropped since it was rebooted months ago. Both ends are the same
Intel X540-AT2 driven by ixgbe. The network cabling has never given me
the least cause for concern, and I've never seen things getting hit by
RSTs when talking to this host before. I've certainly never seen *NFS*
getting a RST. It happens consistently, anyway, which argues strongly
against the cabling :)

Neither client nor server are using any sort of iptables (it's not even
compiled in), and while the server is doing a bit of policy-based
routing and running a couple of bridges for VMs it's not doing anything
that should cause this (and this was replicated when the bridges had
nothing on them other than the host in any case). Syncookies are
compiled out on both hosts (they're still on on the firewall, but that's
not participating in this and isn't even on the same subnet).

Just in case: complete network-related sysctls changed from default on
the client:

net.ipv6.conf.default.keep_addr_on_down = 1

and on the server:

net.ipv4.ip_forward = 1
net.ipv6.conf.all.forwarding = 1
net.ipv6.conf.default.keep_addr_on_down = 1
net.ipv4.ping_group_range = 73 73
net.core.bpf_jit_enable = 1

None of these seem too likely to cause *this*. I wonder how I could
track down where this mess is coming from...