On Wed, 2021-03-03 at 08:10 -0500, Benjamin Coddington wrote:
> Hi Trond,
>
> I'd like to go back to setting sk->sk_allocation = GFP_NOIO (see
> a1231fda7e94). That would cover sync tasks as well as async, but I'm
> not
> sure what memalloc_nofs_save/restore gives us and if we should just
> try
> to
> apply that to all tasks.
>
> We're getting some folks deadlocked on the xprt_sending queue with
> stacks like:
>
> #0 [ffffacab45f17108] __schedule at ffffffffae4d1826
> #1 [ffffacab45f171a0] schedule at ffffffffae4d1cb8
> #2 [ffffacab45f171b0] rpc_wait_bit_killable at ffffffffc067d44e
> [sunrpc]
> #3 [ffffacab45f171c8] __wait_on_bit at ffffffffae4d216c
> #4 [ffffacab45f17200] out_of_line_wait_on_bit at ffffffffae4d2211
> #5 [ffffacab45f17250] __rpc_execute at ffffffffc067f3fc [sunrpc]
> #6 [ffffacab45f172a8] rpc_run_task at ffffffffc06732c4 [sunrpc]
> #7 [ffffacab45f172e8] nfs4_proc_layoutreturn at ffffffffc08f5d44
> [nfsv4]
> #8 [ffffacab45f17388] pnfs_send_layoutreturn at ffffffffc091946e
> [nfsv4]
> #9 [ffffacab45f173d8] _pnfs_return_layout at ffffffffc091ba8b
> [nfsv4]
> #10 [ffffacab45f17450] nfs4_evict_inode at ffffffffc0906a05 [nfsv4]
> #11 [ffffacab45f17460] evict at ffffffffadef8592
> #12 [ffffacab45f17480] dispose_list at ffffffffadef86a8
> #13 [ffffacab45f174a0] prune_icache_sb at ffffffffadef99a2
> #14 [ffffacab45f174c8] super_cache_scan at ffffffffadede183
> #15 [ffffacab45f17518] do_shrink_slab at ffffffffade3d5d8
> #16 [ffffacab45f17588] shrink_slab at ffffffffade3dab5
> #17 [ffffacab45f17608] shrink_node at ffffffffade42a8c
> #18 [ffffacab45f17678] do_try_to_free_pages at ffffffffade42e43
> #19 [ffffacab45f176c8] try_to_free_pages at ffffffffade431c8
> #20 [ffffacab45f17768] __alloc_pages_slowpath at ffffffffade81981
> #21 [ffffacab45f17868] __alloc_pages_nodemask at ffffffffade82555
> #22 [ffffacab45f178c8] skb_page_frag_refill at ffffffffae31bea7
> #23 [ffffacab45f178e0] sk_page_frag_refill at ffffffffae31c71d
> #24 [ffffacab45f178f8] tcp_sendmsg_locked at ffffffffae3cbe65
> #25 [ffffacab45f179a0] tcp_sendmsg at ffffffffae3cc8f7
> #26 [ffffacab45f179c0] sock_sendmsg at ffffffffae317cce
> #27 [ffffacab45f179d8] xs_sendpages at ffffffffc0679741 [sunrpc]
> #28 [ffffacab45f17ac8] xs_tcp_send_request at ffffffffc067adb4
> [sunrpc]
> #29 [ffffacab45f17b20] xprt_transmit at ffffffffc067674c [sunrpc]
> #30 [ffffacab45f17b90] call_transmit at ffffffffc0672064 [sunrpc]
> #31 [ffffacab45f17ba0] __rpc_execute at ffffffffc067f365 [sunrpc]
> #32 [ffffacab45f17bf8] rpc_run_task at ffffffffc06732c4 [sunrpc]
> #33 [ffffacab45f17c38] nfs4_call_sync_custom at ffffffffc08e50bb
> [nfsv4]
> #34 [ffffacab45f17c48] nfs4_call_sync_sequence at ffffffffc08e5143
> [nfsv4]
> #35 [ffffacab45f17cb8] _nfs4_proc_getattr at ffffffffc08e7f08 [nfsv4]
> #36 [ffffacab45f17d78] nfs4_proc_getattr at ffffffffc08f200a [nfsv4]
> #37 [ffffacab45f17de8] __nfs_revalidate_inode at ffffffffc08741d7
> [nfs]
> #38 [ffffacab45f17e18] nfs_getattr at ffffffffc0874458 [nfs]
> #39 [ffffacab45f17e60] vfs_statx_fd at ffffffffadedf8a4
> #40 [ffffacab45f17e98] __do_sys_newfstat at ffffffffadedfedd
> #41 [ffffacab45f17f38] do_syscall_64 at ffffffffadc0419b
> #42 [ffffacab45f17f50] entry_SYSCALL_64_after_hwframe at
> ffffffffae6000ad
> RIP: 00007f721ddcdd37 RSP: 00007ffc0d54cab8 RFLAGS: 00000246
> RAX: ffffffffffffffda RBX: 00007f721e09b3c0 RCX:
> 00007f721ddcdd37
> RDX: 00007ffc0d54cac0 RSI: 00007ffc0d54cac0 RDI:
> 0000000000000001
> RBP: 00007f721e09f6c0 R8: 00007f721f87cf00 R9:
> 0000000000000000
> R10: 00007ffc0d54a32a R11: 0000000000000246 R12:
> 00007f721e09b3c0
> R13: 000055606b81376e R14: 0000000000000013 R15:
> 00007f721e09b3c0
> ORIG_RAX: 0000000000000005 CS: 0033 SS: 002b
>
> Ben
>
Please just wrap that call to __rpc_execute() in rpc_execute() with a
memalloc_nofs_save()/memalloc_nofs_restore() pair. That should cause
the mm layer to do the correct thing here, and prevent re-entry into
the NFS code.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]