LinuxLists.cc - CPU stall, eventual host hang with BTRFS + NFS under heavy load

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

Forgot to add -- sometimes, right before the core stall and backtrace, we see messages similar to the following:

[16825.408854] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 0000000051f43ff7 xid 2e0c9b7a
[16825.414070] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 0000000051f43ff7 xid 2f0c9b7a
[16825.414360] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 0000000051f43ff7 xid 300c9b7a

We're not sure if they are related or not.

----- Original Message -----
> From: "Timothy Pearson" <[email protected]>
> To: "J. Bruce Fields" <[email protected]>, "Chuck Lever" <[email protected]>
> Cc: "linux-nfs" <[email protected]>
> Sent: Monday, July 5, 2021 4:44:29 AM
> Subject: CPU stall, eventual host hang with BTRFS + NFS under heavy load

> We've been dealing with a fairly nasty NFS-related problem off and on for the
> past couple of years. The host is a large POWER server with several external
> SAS arrays attached, using BTRFS for cold storage of large amounts of data.
> The main symptom is that under heavy sustained NFS write traffic using certain
> file types (see below) a core will suddenly lock up, continually spewing a
> backtrace similar to the one I've pasted below. While this immediately halts
> all NFS traffic to the affected client (which is never the same client as the
> machine doing the large file transfer), the larger issue is that over the next
> few minutes / hours the entire host will gradually degrade in responsiveness
> until it grinds to a complete halt. Once the core stall occurs we have been
> unable to find any way to restore the machine to full functionality or avoid
> the degradation and eventual hang short of a hard power down and restart.
>
> Tens of GB of compressed data in a single file seems to be fairly good at
> triggering the problem, whereas raw disk images or other regularly patterned
> data tend not to be. The underlying hardware is functioning perfectly with no
> problems noted, and moving the files without NFS avoids the bug.
>
> We've been using a workaround involving purposefully pausing (SIGSTOP) the file
> transfer process on the client as soon as other clients start to show a
> slowdown. This hack avoids the bug entirely provided the host is allowed to
> catch back up prior to resuming (SIGCONT) the file transfer process. From
> this, it seems something is going very wrong within the NFS stack under high
> storage I/O pressure and high storage write latency (timeout?) -- it should
> simply pause transfers while the storage subsystem catches up, not lock up a
> core and force a host restart. Interesting, sometimes it does exactly what it
> is supposed to and does pause and wait for the storage subsystem, but around
> 20% of the time it just triggers this bug and stalls a core.
>
> This bug has been present since at least 4.14 and is still present in the latest
> 5.12.14 version.
>
> As the machine is in production, it is difficult to gather further information
> or test patches, however we would be able to apply patches to the kernel that
> would potentially restore stability with enough advance scheduling.
>
> Sample backtrace below:
>
> [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU
> [16846.426202] rcu: 32-....: (5249 ticks this GP)
> idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986
> [16846.426241] (t=5251 jiffies g=2720809 q=756724)
> [16846.426273] NMI backtrace for cpu 32
> [16846.426298] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
> [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc]
> [16846.426406] Call Trace:
> [16846.426429] [c000200010823250] [c00000000074e630] dump_stack+0xc4/0x114
> (unreliable)
> [16846.426483] [c000200010823290] [c00000000075aebc]
> nmi_cpu_backtrace+0xfc/0x150
> [16846.426506] [c000200010823310] [c00000000075b0a8]
> nmi_trigger_cpumask_backtrace+0x198/0x1f0
> [16846.426577] [c0002000108233b0] [c000000000072818]
> arch_trigger_cpumask_backtrace+0x28/0x40
> [16846.426621] [c0002000108233d0] [c000000000202db8]
> rcu_dump_cpu_stacks+0x158/0x1b8
> [16846.426667] [c000200010823470] [c000000000201828]
> rcu_sched_clock_irq+0x908/0xb10
> [16846.426708] [c000200010823560] [c0000000002141d0]
> update_process_times+0xc0/0x140
> [16846.426768] [c0002000108235a0] [c00000000022dd34]
> tick_sched_handle.isra.18+0x34/0xd0
> [16846.426808] [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0
> [16846.426856] [c000200010823610] [c00000000021577c]
> __hrtimer_run_queues+0x16c/0x370
> [16846.426903] [c000200010823690] [c000000000216378]
> hrtimer_interrupt+0x128/0x2f0
> [16846.426947] [c000200010823740] [c000000000029494] timer_interrupt+0x134/0x310
> [16846.426989] [c0002000108237a0] [c000000000016c54]
> replay_soft_interrupts+0x124/0x2e0
> [16846.427045] [c000200010823990] [c000000000016f14]
> arch_local_irq_restore+0x104/0x170
> [16846.427103] [c0002000108239c0] [c00000000017247c]
> mod_delayed_work_on+0x8c/0xe0
> [16846.427149] [c000200010823a20] [c00800000819fe04]
> rpc_set_queue_timer+0x5c/0x80 [sunrpc]
> [16846.427234] [c000200010823a40] [c0080000081a096c]
> __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc]
> [16846.427324] [c000200010823a90] [c0080000081a3080]
> rpc_sleep_on_timeout+0x88/0x110 [sunrpc]
> [16846.427388] [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530
> [nfsd]
> [16846.427457] [c000200010823b60] [c0080000081a0a0c] rpc_exit_task+0x84/0x1d0
> [sunrpc]
> [16846.427520] [c000200010823ba0] [c0080000081a2448] __rpc_execute+0xd0/0x760
> [sunrpc]
> [16846.427598] [c000200010823c30] [c0080000081a2b18]
> rpc_async_schedule+0x40/0x70 [sunrpc]
> [16846.427687] [c000200010823c60] [c000000000170bf0]
> process_one_work+0x290/0x580
> [16846.427736] [c000200010823d00] [c000000000170f68] worker_thread+0x88/0x620
> [16846.427813] [c000200010823da0] [c00000000017b860] kthread+0x1a0/0x1b0
> [16846.427865] [c000200010823e10] [c00000000000d6ec]
> ret_from_kernel_thread+0x5c/0x70
> [16873.869180] watchdog: BUG: soft lockup - CPU#32 stuck for 49s!
> [kworker/u130:25:10624]
> [16873.869245] Modules linked in: rpcsec_gss_krb5 iscsi_target_mod
> target_core_user uio target_core_pscsi target_core_file target_core_iblock
> target_core_mod tun nft_counter nf_tables nfnetlink vfio_pci vfio_virqfd
> vfio_iommu_spapr_tce vfio vfio_spapr_eeh i2c_dev bridg$
> [16873.869413] linear mlx4_ib ib_uverbs ib_core raid1 md_mod sd_mod t10_pi
> hid_generic usbhid hid ses enclosure crct10dif_vpmsum crc32c_vpmsum xhci_pci
> xhci_hcd ixgbe mlx4_core mpt3sas usbcore tg3 mdio_devres of_mdio fixed_phy
> xfrm_algo mdio libphy aacraid igb raid_cl$
> [16873.869889] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
> [16873.869966] Workqueue: rpciod rpc_async_schedule [sunrpc]
> [16873.870023] NIP: c000000000711300 LR: c0080000081a0708 CTR: c0000000007112a0
> [16873.870073] REGS: c0002000108237d0 TRAP: 0900 Not tainted (5.12.14)
> [16873.870109] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR:
> 24004842 XER: 00000000
> [16873.870146] CFAR: c0080000081d8054 IRQMASK: 0
> GPR00: c0080000081a0748 c000200010823a70 c0000000015c0700 c0000000e2227a40
> GPR04: c0000000e2227a40 c0000000e2227a40 c000200ffb6cc0a8 0000000000000018
> GPR08: 0000000000000000 5deadbeef0000122 c0080000081ffd18 c0080000081d8040
> GPR12: c0000000007112a0 c000200fff7fee00 c00000000017b6c8 c000000090d9ccc0
> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000040
> GPR24: 0000000000000000 0000000000000000 fffffffffffffe00 0000000000000001
> GPR28: c00000001a62f000 c0080000081a0988 c0080000081ffd10 c0000000e2227a00
> [16873.870452] NIP [c000000000711300] __list_del_entry_valid+0x60/0x100
> [16873.870507] LR [c0080000081a0708]
> rpc_wake_up_task_on_wq_queue_action_locked+0x330/0x400 [sunrpc]

2021-07-23 21:00:58

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

Sorry, took me a while to get to this because I'm not sure what to
suggest:

On Mon, Jul 05, 2021 at 04:44:29AM -0500, Timothy Pearson wrote:
> We've been dealing with a fairly nasty NFS-related problem off and on
> for the past couple of years. The host is a large POWER server with
> several external SAS arrays attached, using BTRFS for cold storage of
> large amounts of data. The main symptom is that under heavy sustained
> NFS write traffic using certain file types (see below) a core will
> suddenly lock up, continually spewing a backtrace similar to the one
> I've pasted below.

Is this the very *first* backtrace you get in the log? Is there
anything suspicious right before it? (Other than the warnings you
mention in the next message?)

> While this immediately halts all NFS traffic to
> the affected client (which is never the same client as the machine
> doing the large file transfer), the larger issue is that over the next
> few minutes / hours the entire host will gradually degrade in
> responsiveness until it grinds to a complete halt. Once the core
> stall occurs we have been unable to find any way to restore the
> machine to full functionality or avoid the degradation and eventual
> hang short of a hard power down and restart.

>
> Tens of GB of compressed data in a single file seems to be fairly good at triggering the problem, whereas raw disk images or other regularly patterned data tend not to be. The underlying hardware is functioning perfectly with no problems noted, and moving the files without NFS avoids the bug.
>
> We've been using a workaround involving purposefully pausing (SIGSTOP) the file transfer process on the client as soon as other clients start to show a slowdown. This hack avoids the bug entirely provided the host is allowed to catch back up prior to resuming (SIGCONT) the file transfer process. From this, it seems something is going very wrong within the NFS stack under high storage I/O pressure and high storage write latency (timeout?) -- it should simply pause transfers while the storage subsystem catches up, not lock up a core and force a host restart. Interesting, sometimes it does exactly what it is supposed to and does pause and wait for the storage subsystem, but around 20% of the time it just triggers this bug and stalls a core.
>
> This bug has been present since at least 4.14 and is still present in the latest 5.12.14 version.
>
> As the machine is in production, it is difficult to gather further information or test patches, however we would be able to apply patches to the kernel that would potentially restore stability with enough advance scheduling.
>
> Sample backtrace below:
>
> [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU
> [16846.426202] rcu: 32-....: (5249 ticks this GP) idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986
> [16846.426241] (t=5251 jiffies g=2720809 q=756724)
> [16846.426273] NMI backtrace for cpu 32
> [16846.426298] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
> [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc]
> [16846.426406] Call Trace:
> [16846.426429] [c000200010823250] [c00000000074e630] dump_stack+0xc4/0x114 (unreliable)
> [16846.426483] [c000200010823290] [c00000000075aebc] nmi_cpu_backtrace+0xfc/0x150
> [16846.426506] [c000200010823310] [c00000000075b0a8] nmi_trigger_cpumask_backtrace+0x198/0x1f0
> [16846.426577] [c0002000108233b0] [c000000000072818] arch_trigger_cpumask_backtrace+0x28/0x40
> [16846.426621] [c0002000108233d0] [c000000000202db8] rcu_dump_cpu_stacks+0x158/0x1b8
> [16846.426667] [c000200010823470] [c000000000201828] rcu_sched_clock_irq+0x908/0xb10
> [16846.426708] [c000200010823560] [c0000000002141d0] update_process_times+0xc0/0x140
> [16846.426768] [c0002000108235a0] [c00000000022dd34] tick_sched_handle.isra.18+0x34/0xd0
> [16846.426808] [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0
> [16846.426856] [c000200010823610] [c00000000021577c] __hrtimer_run_queues+0x16c/0x370
> [16846.426903] [c000200010823690] [c000000000216378] hrtimer_interrupt+0x128/0x2f0
> [16846.426947] [c000200010823740] [c000000000029494] timer_interrupt+0x134/0x310
> [16846.426989] [c0002000108237a0] [c000000000016c54] replay_soft_interrupts+0x124/0x2e0
> [16846.427045] [c000200010823990] [c000000000016f14] arch_local_irq_restore+0x104/0x170
> [16846.427103] [c0002000108239c0] [c00000000017247c] mod_delayed_work_on+0x8c/0xe0
> [16846.427149] [c000200010823a20] [c00800000819fe04] rpc_set_queue_timer+0x5c/0x80 [sunrpc]
> [16846.427234] [c000200010823a40] [c0080000081a096c] __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc]
> [16846.427324] [c000200010823a90] [c0080000081a3080] rpc_sleep_on_timeout+0x88/0x110 [sunrpc]
> [16846.427388] [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530 [nfsd]

I think this has to be the rpc_delay in the -NFS4ERR_DELAY case of
nfsd4_cb_sequence_done. I don't know what would cause a lockup there.
Maybe the rpc_task we've passed in is corrupted somehow?

We could try to add some instrumentation in that case. I don't think
that should be a common error case. I guess the client has probably hit
one of the NFS4ERR_DELAY cases in
fs/nfs/callback_proc.c:nfs4_callback_sequence?

There might be a way to get some more information out of this with some
tracing that you could run in production, but I'm not sure what to
suggest off the top of my head.

--b.

> [16846.427457] [c000200010823b60] [c0080000081a0a0c] rpc_exit_task+0x84/0x1d0 [sunrpc]
> [16846.427520] [c000200010823ba0] [c0080000081a2448] __rpc_execute+0xd0/0x760 [sunrpc]
> [16846.427598] [c000200010823c30] [c0080000081a2b18] rpc_async_schedule+0x40/0x70 [sunrpc]
> [16846.427687] [c000200010823c60] [c000000000170bf0] process_one_work+0x290/0x580
> [16846.427736] [c000200010823d00] [c000000000170f68] worker_thread+0x88/0x620
> [16846.427813] [c000200010823da0] [c00000000017b860] kthread+0x1a0/0x1b0
> [16846.427865] [c000200010823e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
> [16873.869180] watchdog: BUG: soft lockup - CPU#32 stuck for 49s! [kworker/u130:25:10624]
> [16873.869245] Modules linked in: rpcsec_gss_krb5 iscsi_target_mod target_core_user uio target_core_pscsi target_core_file target_core_iblock target_core_mod tun nft_counter nf_tables nfnetlink vfio_pci vfio_virqfd vfio_iommu_spapr_tce vfio vfio_spapr_eeh i2c_dev bridg$
> [16873.869413] linear mlx4_ib ib_uverbs ib_core raid1 md_mod sd_mod t10_pi hid_generic usbhid hid ses enclosure crct10dif_vpmsum crc32c_vpmsum xhci_pci xhci_hcd ixgbe mlx4_core mpt3sas usbcore tg3 mdio_devres of_mdio fixed_phy xfrm_algo mdio libphy aacraid igb raid_cl$
> [16873.869889] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
> [16873.869966] Workqueue: rpciod rpc_async_schedule [sunrpc]
> [16873.870023] NIP: c000000000711300 LR: c0080000081a0708 CTR: c0000000007112a0
> [16873.870073] REGS: c0002000108237d0 TRAP: 0900 Not tainted (5.12.14)
> [16873.870109] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24004842 XER: 00000000
> [16873.870146] CFAR: c0080000081d8054 IRQMASK: 0
> GPR00: c0080000081a0748 c000200010823a70 c0000000015c0700 c0000000e2227a40
> GPR04: c0000000e2227a40 c0000000e2227a40 c000200ffb6cc0a8 0000000000000018
> GPR08: 0000000000000000 5deadbeef0000122 c0080000081ffd18 c0080000081d8040
> GPR12: c0000000007112a0 c000200fff7fee00 c00000000017b6c8 c000000090d9ccc0
> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000040
> GPR24: 0000000000000000 0000000000000000 fffffffffffffe00 0000000000000001
> GPR28: c00000001a62f000 c0080000081a0988 c0080000081ffd10 c0000000e2227a00
> [16873.870452] NIP [c000000000711300] __list_del_entry_valid+0x60/0x100
> [16873.870507] LR [c0080000081a0708] rpc_wake_up_task_on_wq_queue_action_locked+0x330/0x400 [sunrpc]

2021-07-23 21:01:49

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

On Mon, Jul 05, 2021 at 04:47:01AM -0500, Timothy Pearson wrote:
> Forgot to add -- sometimes, right before the core stall and backtrace, we see messages similar to the following:
>
> [16825.408854] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 0000000051f43ff7 xid 2e0c9b7a
> [16825.414070] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 0000000051f43ff7 xid 2f0c9b7a
> [16825.414360] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 0000000051f43ff7 xid 300c9b7a

That suggests either the server has a bug or its receiving tcp data
that's corrupted somehow.

--b.

>
> We're not sure if they are related or not.
>
> ----- Original Message -----
> > From: "Timothy Pearson" <[email protected]>
> > To: "J. Bruce Fields" <[email protected]>, "Chuck Lever" <[email protected]>
> > Cc: "linux-nfs" <[email protected]>
> > Sent: Monday, July 5, 2021 4:44:29 AM
> > Subject: CPU stall, eventual host hang with BTRFS + NFS under heavy load
>
> > We've been dealing with a fairly nasty NFS-related problem off and on for the
> > past couple of years. The host is a large POWER server with several external
> > SAS arrays attached, using BTRFS for cold storage of large amounts of data.
> > The main symptom is that under heavy sustained NFS write traffic using certain
> > file types (see below) a core will suddenly lock up, continually spewing a
> > backtrace similar to the one I've pasted below. While this immediately halts
> > all NFS traffic to the affected client (which is never the same client as the
> > machine doing the large file transfer), the larger issue is that over the next
> > few minutes / hours the entire host will gradually degrade in responsiveness
> > until it grinds to a complete halt. Once the core stall occurs we have been
> > unable to find any way to restore the machine to full functionality or avoid
> > the degradation and eventual hang short of a hard power down and restart.
> >
> > Tens of GB of compressed data in a single file seems to be fairly good at
> > triggering the problem, whereas raw disk images or other regularly patterned
> > data tend not to be. The underlying hardware is functioning perfectly with no
> > problems noted, and moving the files without NFS avoids the bug.
> >
> > We've been using a workaround involving purposefully pausing (SIGSTOP) the file
> > transfer process on the client as soon as other clients start to show a
> > slowdown. This hack avoids the bug entirely provided the host is allowed to
> > catch back up prior to resuming (SIGCONT) the file transfer process. From
> > this, it seems something is going very wrong within the NFS stack under high
> > storage I/O pressure and high storage write latency (timeout?) -- it should
> > simply pause transfers while the storage subsystem catches up, not lock up a
> > core and force a host restart. Interesting, sometimes it does exactly what it
> > is supposed to and does pause and wait for the storage subsystem, but around
> > 20% of the time it just triggers this bug and stalls a core.
> >
> > This bug has been present since at least 4.14 and is still present in the latest
> > 5.12.14 version.
> >
> > As the machine is in production, it is difficult to gather further information
> > or test patches, however we would be able to apply patches to the kernel that
> > would potentially restore stability with enough advance scheduling.
> >
> > Sample backtrace below:
> >
> > [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU
> > [16846.426202] rcu: 32-....: (5249 ticks this GP)
> > idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986
> > [16846.426241] (t=5251 jiffies g=2720809 q=756724)
> > [16846.426273] NMI backtrace for cpu 32
> > [16846.426298] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
> > [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc]
> > [16846.426406] Call Trace:
> > [16846.426429] [c000200010823250] [c00000000074e630] dump_stack+0xc4/0x114
> > (unreliable)
> > [16846.426483] [c000200010823290] [c00000000075aebc]
> > nmi_cpu_backtrace+0xfc/0x150
> > [16846.426506] [c000200010823310] [c00000000075b0a8]
> > nmi_trigger_cpumask_backtrace+0x198/0x1f0
> > [16846.426577] [c0002000108233b0] [c000000000072818]
> > arch_trigger_cpumask_backtrace+0x28/0x40
> > [16846.426621] [c0002000108233d0] [c000000000202db8]
> > rcu_dump_cpu_stacks+0x158/0x1b8
> > [16846.426667] [c000200010823470] [c000000000201828]
> > rcu_sched_clock_irq+0x908/0xb10
> > [16846.426708] [c000200010823560] [c0000000002141d0]
> > update_process_times+0xc0/0x140
> > [16846.426768] [c0002000108235a0] [c00000000022dd34]
> > tick_sched_handle.isra.18+0x34/0xd0
> > [16846.426808] [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0
> > [16846.426856] [c000200010823610] [c00000000021577c]
> > __hrtimer_run_queues+0x16c/0x370
> > [16846.426903] [c000200010823690] [c000000000216378]
> > hrtimer_interrupt+0x128/0x2f0
> > [16846.426947] [c000200010823740] [c000000000029494] timer_interrupt+0x134/0x310
> > [16846.426989] [c0002000108237a0] [c000000000016c54]
> > replay_soft_interrupts+0x124/0x2e0
> > [16846.427045] [c000200010823990] [c000000000016f14]
> > arch_local_irq_restore+0x104/0x170
> > [16846.427103] [c0002000108239c0] [c00000000017247c]
> > mod_delayed_work_on+0x8c/0xe0
> > [16846.427149] [c000200010823a20] [c00800000819fe04]
> > rpc_set_queue_timer+0x5c/0x80 [sunrpc]
> > [16846.427234] [c000200010823a40] [c0080000081a096c]
> > __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc]
> > [16846.427324] [c000200010823a90] [c0080000081a3080]
> > rpc_sleep_on_timeout+0x88/0x110 [sunrpc]
> > [16846.427388] [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530
> > [nfsd]
> > [16846.427457] [c000200010823b60] [c0080000081a0a0c] rpc_exit_task+0x84/0x1d0
> > [sunrpc]
> > [16846.427520] [c000200010823ba0] [c0080000081a2448] __rpc_execute+0xd0/0x760
> > [sunrpc]
> > [16846.427598] [c000200010823c30] [c0080000081a2b18]
> > rpc_async_schedule+0x40/0x70 [sunrpc]
> > [16846.427687] [c000200010823c60] [c000000000170bf0]
> > process_one_work+0x290/0x580
> > [16846.427736] [c000200010823d00] [c000000000170f68] worker_thread+0x88/0x620
> > [16846.427813] [c000200010823da0] [c00000000017b860] kthread+0x1a0/0x1b0
> > [16846.427865] [c000200010823e10] [c00000000000d6ec]
> > ret_from_kernel_thread+0x5c/0x70
> > [16873.869180] watchdog: BUG: soft lockup - CPU#32 stuck for 49s!
> > [kworker/u130:25:10624]
> > [16873.869245] Modules linked in: rpcsec_gss_krb5 iscsi_target_mod
> > target_core_user uio target_core_pscsi target_core_file target_core_iblock
> > target_core_mod tun nft_counter nf_tables nfnetlink vfio_pci vfio_virqfd
> > vfio_iommu_spapr_tce vfio vfio_spapr_eeh i2c_dev bridg$
> > [16873.869413] linear mlx4_ib ib_uverbs ib_core raid1 md_mod sd_mod t10_pi
> > hid_generic usbhid hid ses enclosure crct10dif_vpmsum crc32c_vpmsum xhci_pci
> > xhci_hcd ixgbe mlx4_core mpt3sas usbcore tg3 mdio_devres of_mdio fixed_phy
> > xfrm_algo mdio libphy aacraid igb raid_cl$
> > [16873.869889] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
> > [16873.869966] Workqueue: rpciod rpc_async_schedule [sunrpc]
> > [16873.870023] NIP: c000000000711300 LR: c0080000081a0708 CTR: c0000000007112a0
> > [16873.870073] REGS: c0002000108237d0 TRAP: 0900 Not tainted (5.12.14)
> > [16873.870109] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR:
> > 24004842 XER: 00000000
> > [16873.870146] CFAR: c0080000081d8054 IRQMASK: 0
> > GPR00: c0080000081a0748 c000200010823a70 c0000000015c0700 c0000000e2227a40
> > GPR04: c0000000e2227a40 c0000000e2227a40 c000200ffb6cc0a8 0000000000000018
> > GPR08: 0000000000000000 5deadbeef0000122 c0080000081ffd18 c0080000081d8040
> > GPR12: c0000000007112a0 c000200fff7fee00 c00000000017b6c8 c000000090d9ccc0
> > GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000040
> > GPR24: 0000000000000000 0000000000000000 fffffffffffffe00 0000000000000001
> > GPR28: c00000001a62f000 c0080000081a0988 c0080000081ffd10 c0000000e2227a00
> > [16873.870452] NIP [c000000000711300] __list_del_entry_valid+0x60/0x100
> > [16873.870507] LR [c0080000081a0708]
> > rpc_wake_up_task_on_wq_queue_action_locked+0x330/0x400 [sunrpc]

2021-07-23 21:24:25

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

----- Original Message -----
> From: "J. Bruce Fields" <[email protected]>
> To: "Timothy Pearson" <[email protected]>
> Cc: "Chuck Lever" <[email protected]>, "linux-nfs" <[email protected]>
> Sent: Friday, July 23, 2021 4:00:12 PM
> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

> Sorry, took me a while to get to this because I'm not sure what to
> suggest:
>
> On Mon, Jul 05, 2021 at 04:44:29AM -0500, Timothy Pearson wrote:
>> We've been dealing with a fairly nasty NFS-related problem off and on
>> for the past couple of years. The host is a large POWER server with
>> several external SAS arrays attached, using BTRFS for cold storage of
>> large amounts of data. The main symptom is that under heavy sustained
>> NFS write traffic using certain file types (see below) a core will
>> suddenly lock up, continually spewing a backtrace similar to the one
>> I've pasted below.
>
> Is this the very *first* backtrace you get in the log? Is there
> anything suspicious right before it? (Other than the warnings you
> mention in the next message?)

Nothing in the logs before that.

>> While this immediately halts all NFS traffic to
>> the affected client (which is never the same client as the machine
>> doing the large file transfer), the larger issue is that over the next
>> few minutes / hours the entire host will gradually degrade in
>> responsiveness until it grinds to a complete halt. Once the core
>> stall occurs we have been unable to find any way to restore the
>> machine to full functionality or avoid the degradation and eventual
>> hang short of a hard power down and restart.
>
>>
>> Tens of GB of compressed data in a single file seems to be fairly good at
>> triggering the problem, whereas raw disk images or other regularly patterned
>> data tend not to be. The underlying hardware is functioning perfectly with no
>> problems noted, and moving the files without NFS avoids the bug.
>>
>> We've been using a workaround involving purposefully pausing (SIGSTOP) the file
>> transfer process on the client as soon as other clients start to show a
>> slowdown. This hack avoids the bug entirely provided the host is allowed to
>> catch back up prior to resuming (SIGCONT) the file transfer process. From
>> this, it seems something is going very wrong within the NFS stack under high
>> storage I/O pressure and high storage write latency (timeout?) -- it should
>> simply pause transfers while the storage subsystem catches up, not lock up a
>> core and force a host restart. Interesting, sometimes it does exactly what it
>> is supposed to and does pause and wait for the storage subsystem, but around
>> 20% of the time it just triggers this bug and stalls a core.
>>
>> This bug has been present since at least 4.14 and is still present in the latest
>> 5.12.14 version.
>>
>> As the machine is in production, it is difficult to gather further information
>> or test patches, however we would be able to apply patches to the kernel that
>> would potentially restore stability with enough advance scheduling.
>>
>> Sample backtrace below:
>>
>> [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU
>> [16846.426202] rcu: 32-....: (5249 ticks this GP)
>> idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986
>> [16846.426241] (t=5251 jiffies g=2720809 q=756724)
>> [16846.426273] NMI backtrace for cpu 32
>> [16846.426298] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
>> [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc]
>> [16846.426406] Call Trace:
>> [16846.426429] [c000200010823250] [c00000000074e630] dump_stack+0xc4/0x114
>> (unreliable)
>> [16846.426483] [c000200010823290] [c00000000075aebc]
>> nmi_cpu_backtrace+0xfc/0x150
>> [16846.426506] [c000200010823310] [c00000000075b0a8]
>> nmi_trigger_cpumask_backtrace+0x198/0x1f0
>> [16846.426577] [c0002000108233b0] [c000000000072818]
>> arch_trigger_cpumask_backtrace+0x28/0x40
>> [16846.426621] [c0002000108233d0] [c000000000202db8]
>> rcu_dump_cpu_stacks+0x158/0x1b8
>> [16846.426667] [c000200010823470] [c000000000201828]
>> rcu_sched_clock_irq+0x908/0xb10
>> [16846.426708] [c000200010823560] [c0000000002141d0]
>> update_process_times+0xc0/0x140
>> [16846.426768] [c0002000108235a0] [c00000000022dd34]
>> tick_sched_handle.isra.18+0x34/0xd0
>> [16846.426808] [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0
>> [16846.426856] [c000200010823610] [c00000000021577c]
>> __hrtimer_run_queues+0x16c/0x370
>> [16846.426903] [c000200010823690] [c000000000216378]
>> hrtimer_interrupt+0x128/0x2f0
>> [16846.426947] [c000200010823740] [c000000000029494] timer_interrupt+0x134/0x310
>> [16846.426989] [c0002000108237a0] [c000000000016c54]
>> replay_soft_interrupts+0x124/0x2e0
>> [16846.427045] [c000200010823990] [c000000000016f14]
>> arch_local_irq_restore+0x104/0x170
>> [16846.427103] [c0002000108239c0] [c00000000017247c]
>> mod_delayed_work_on+0x8c/0xe0
>> [16846.427149] [c000200010823a20] [c00800000819fe04]
>> rpc_set_queue_timer+0x5c/0x80 [sunrpc]
>> [16846.427234] [c000200010823a40] [c0080000081a096c]
>> __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc]
>> [16846.427324] [c000200010823a90] [c0080000081a3080]
>> rpc_sleep_on_timeout+0x88/0x110 [sunrpc]
>> [16846.427388] [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530
>> [nfsd]
>
> I think this has to be the rpc_delay in the -NFS4ERR_DELAY case of
> nfsd4_cb_sequence_done. I don't know what would cause a lockup there.
> Maybe the rpc_task we've passed in is corrupted somehow?
>
> We could try to add some instrumentation in that case. I don't think
> that should be a common error case. I guess the client has probably hit
> one of the NFS4ERR_DELAY cases in
> fs/nfs/callback_proc.c:nfs4_callback_sequence?
>
> There might be a way to get some more information out of this with some
> tracing that you could run in production, but I'm not sure what to
> suggest off the top of my head.

Appreciate the response regardless!

2021-07-28 19:52:06

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

Just happened again, this time seemingly at random (unknown network load). This is on kernel 5.13.4 now, and we're going to be breaking off the most critical production services to another host at this point due to the continued issues:

[305943.277482] rcu: INFO: rcu_sched self-detected stall on CPU
[305943.277508] rcu: 45-....: (68259 ticks this GP) idle=bc6/1/0x4000000000000002 softirq=26727261/26727261 fqs=27492
[305943.277551] (t=68263 jiffies g=45124409 q=1158662)
[305943.277568] Sending NMI from CPU 45 to CPUs 16:
[305943.277600] NMI backtrace for cpu 16
[305943.277637] CPU: 16 PID: 15327 Comm: nfsd Tainted: G D W L 5.13.4 #1
[305943.277666] NIP: c0000000001cdcd0 LR: c000000000c024b4 CTR: c000000000c02450
[305943.277694] REGS: c0002000ab2d35f0 TRAP: 0e80 Tainted: G D W L (5.13.4)
[305943.277712] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 22002444 XER: 20040000
[305943.277771] CFAR: c0000000001cdce4 IRQMASK: 0
[305943.277771] GPR00: c000000000c024b4 c0002000ab2d3890 c000000001511f00 c008000010b248f0
[305943.277771] GPR04: 0000000000000000 0000000000000008 0000000000000008 0000000000000010
[305943.277771] GPR08: 0000000000000001 0000000000000001 0000000000000001 c008000010b04c78
[305943.277771] GPR12: c000000000c02450 c000000ffffde000 c00000000017a528 c000200006f30bc0
[305943.277771] GPR16: 0000000000000000 c000000017f45dd0 c0002000891f8000 0000000000000000
[305943.277771] GPR20: c000200099068000 0000000000000000 c008000010b248f0 c0002000993091a0
[305943.277771] GPR24: c00000093aa55698 c008000010b24060 00000000000004e8 c0002000891f8030
[305943.277771] GPR28: 000000000000009d c0002000993091a0 0000000000000000 c0002000993091a0
[305943.278108] NIP [c0000000001cdcd0] native_queued_spin_lock_slowpath+0x70/0x360
[305943.278158] LR [c000000000c024b4] _raw_spin_lock+0x64/0xa0
[305943.278183] Call Trace:
[305943.278204] [c0002000ab2d3890] [c0002000891f8030] 0xc0002000891f8030 (unreliable)
[305943.278252] [c0002000ab2d38b0] [c008000010aec0ac] nfsd4_process_open2+0x864/0x1910 [nfsd]
[305943.278311] [c0002000ab2d3a20] [c008000010ad0df4] nfsd4_open+0x40c/0x8f0 [nfsd]
[305943.278360] [c0002000ab2d3ad0] [c008000010ad16bc] nfsd4_proc_compound+0x3e4/0x730 [nfsd]
[305943.278418] [c0002000ab2d3b80] [c008000010aadb10] nfsd_dispatch+0x148/0x2d8 [nfsd]
[305943.278474] [c0002000ab2d3bd0] [c008000010d644e0] svc_process_common+0x498/0xa70 [sunrpc]
[305943.278525] [c0002000ab2d3ca0] [c008000010d64b70] svc_process+0xb8/0x140 [sunrpc]
[305943.278585] [c0002000ab2d3d10] [c008000010aad198] nfsd+0x150/0x1e0 [nfsd]
[305943.278639] [c0002000ab2d3da0] [c00000000017a6a4] kthread+0x184/0x190
[305943.278667] [c0002000ab2d3e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
[305943.278706] Instruction dump:
[305943.278727] 790a0020 2f890000 409e02b8 2faa0000 419e003c 81230000 5529063e 2f890000
[305943.278782] 419e0028 60000000 60000000 60000000 <7c210b78> 7c421378 81230000 5529063e
[305943.278871] NMI backtrace for cpu 45
[305943.278892] CPU: 45 PID: 15323 Comm: nfsd Tainted: G D W L 5.13.4 #1
[305943.278932] Call Trace:
[305943.278950] [c0002000ab2e30d0] [c00000000075d080] dump_stack+0xc4/0x114 (unreliable)
[305943.279002] [c0002000ab2e3110] [c00000000076996c] nmi_cpu_backtrace+0xfc/0x150
[305943.279054] [c0002000ab2e3190] [c000000000769b78] nmi_trigger_cpumask_backtrace+0x1b8/0x220
[305943.279112] [c0002000ab2e3230] [c000000000070788] arch_trigger_cpumask_backtrace+0x28/0x40
[305943.279167] [c0002000ab2e3250] [c0000000001fe6b0] rcu_dump_cpu_stacks+0x158/0x1b8
[305943.279218] [c0002000ab2e32f0] [c0000000001fd164] rcu_sched_clock_irq+0x954/0xb40
[305943.279276] [c0002000ab2e33e0] [c000000000210030] update_process_times+0xc0/0x140
[305943.279316] [c0002000ab2e3420] [c000000000229e04] tick_sched_handle.isra.18+0x34/0xd0
[305943.279369] [c0002000ab2e3450] [c00000000022a2b8] tick_sched_timer+0x68/0xe0
[305943.279416] [c0002000ab2e3490] [c00000000021158c] __hrtimer_run_queues+0x16c/0x370
[305943.279456] [c0002000ab2e3510] [c0000000002121b8] hrtimer_interrupt+0x128/0x2f0
[305943.279508] [c0002000ab2e35c0] [c000000000027904] timer_interrupt+0x134/0x310
[305943.279551] [c0002000ab2e3620] [c000000000009c04] decrementer_common_virt+0x1a4/0x1b0
[305943.279591] --- interrupt: 900 at native_queued_spin_lock_slowpath+0x200/0x360
[305943.279635] NIP: c0000000001cde60 LR: c000000000c024b4 CTR: c000000000711850
[305943.279678] REGS: c0002000ab2e3690 TRAP: 0900 Tainted: G D W L (5.13.4)
[305943.279727] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 44000822 XER: 20040000
[305943.279787] CFAR: c0000000001cde70 IRQMASK: 0
[305943.279787] GPR00: 0000000000000000 c0002000ab2e3930 c000000001511f00 c008000010b248f0
[305943.279787] GPR04: c000200ffbf75380 0000000000b80000 c000200ffbf75380 c000000001085380
[305943.279787] GPR08: c00000000154e060 0000000000000000 0000000ffe410000 0000000000000000
[305943.279787] GPR12: 0000000000b80000 c000200fff6d5400 c00000000017a528 c000200006f30bc0
[305943.279787] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[305943.279787] GPR20: 0000000000000000 0000000000000000 c000200cd3c90014 c000000ba2670d90
[305943.279787] GPR24: c000000ba2670a30 c008000010b2319c c008000010db8634 c0002000ab2e3a40
[305943.279787] GPR28: c000200099331a44 c008000010b248f0 0000000000000000 0000000000b80101
[305943.280122] NIP [c0000000001cde60] native_queued_spin_lock_slowpath+0x200/0x360
[305943.280150] LR [c000000000c024b4] _raw_spin_lock+0x64/0xa0
[305943.280185] --- interrupt: 900
[305943.280209] [c0002000ab2e3930] [c0002000ab2e3970] 0xc0002000ab2e3970 (unreliable)
[305943.280255] [c0002000ab2e3950] [c0000000007118b8] refcount_dec_and_lock+0x68/0x130
[305943.280298] [c0002000ab2e3990] [c008000010ae5458] put_nfs4_file+0x40/0x140 [nfsd]
[305943.280355] [c0002000ab2e39d0] [c008000010aeee68] nfsd4_close+0x1f0/0x470 [nfsd]
[305943.280410] [c0002000ab2e3ad0] [c008000010ad16bc] nfsd4_proc_compound+0x3e4/0x730 [nfsd]
[305943.280466] [c0002000ab2e3b80] [c008000010aadb10] nfsd_dispatch+0x148/0x2d8 [nfsd]
[305943.280514] [c0002000ab2e3bd0] [c008000010d644e0] svc_process_common+0x498/0xa70 [sunrpc]
[305943.280582] [c0002000ab2e3ca0] [c008000010d64b70] svc_process+0xb8/0x140 [sunrpc]
[305943.280656] [c0002000ab2e3d10] [c008000010aad198] nfsd+0x150/0x1e0 [nfsd]
[305943.280710] [c0002000ab2e3da0] [c00000000017a6a4] kthread+0x184/0x190
[305943.280749] [c0002000ab2e3e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
[305943.280805] Sending NMI from CPU 45 to CPUs 58:
[305943.280842] NMI backtrace for cpu 58
[305943.280865] CPU: 58 PID: 15325 Comm: nfsd Tainted: G D W L 5.13.4 #1
[305943.280904] NIP: c0000000001cde98 LR: c000000000c024b4 CTR: c000000000711850
[305943.280953] REGS: c0002000ab2ff690 TRAP: 0e80 Tainted: G D W L (5.13.4)
[305943.280992] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24000824 XER: 20040000
[305943.281062] CFAR: c0000000001cdeac IRQMASK: 0
[305943.281062] GPR00: 0000000000000000 c0002000ab2ff930 c000000001511f00 c008000010b248f0
[305943.281062] GPR04: c000200ffc795380 0000000000ec0000 c000200ffc795380 c000000001085380
[305943.281062] GPR08: 0000000000e00101 0000000000e00101 0000000000000101 0000000000000000
[305943.281062] GPR12: 0000000000ec0000 c000200fff695e00 c00000000017a528 c000200006f30bc0
[305943.281062] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[305943.281062] GPR20: 0000000000000000 0000000000000000 c000200223680014 c000000ba2670d90
[305943.281062] GPR24: c000000ba2670a30 c008000010b2319c c008000010db8634 c0002000ab2ffa40
[305943.281062] GPR28: c000200099341a44 c008000010b248f0 0000000000000000 0000000000ec0101
[305943.281737] NIP [c0000000001cde98] native_queued_spin_lock_slowpath+0x238/0x360
[305943.281842] LR [c000000000c024b4] _raw_spin_lock+0x64/0xa0
[305943.281902] Call Trace:
[305943.281940] [c0002000ab2ff930] [c0002000ab2ff970] 0xc0002000ab2ff970 (unreliable)
[305943.282040] [c0002000ab2ff950] [c0000000007118b8] refcount_dec_and_lock+0x68/0x130
[305943.282154] [c0002000ab2ff990] [c008000010ae5458] put_nfs4_file+0x40/0x140 [nfsd]
[305943.282330] [c0002000ab2ff9d0] [c008000010aeee68] nfsd4_close+0x1f0/0x470 [nfsd]
[305943.282436] [c0002000ab2ffad0] [c008000010ad16bc] nfsd4_proc_compound+0x3e4/0x730 [nfsd]
[305943.282520] [c0002000ab2ffb80] [c008000010aadb10] nfsd_dispatch+0x148/0x2d8 [nfsd]
[305943.282670] [c0002000ab2ffbd0] [c008000010d644e0] svc_process_common+0x498/0xa70 [sunrpc]
[305943.282781] [c0002000ab2ffca0] [c008000010d64b70] svc_process+0xb8/0x140 [sunrpc]
[305943.282882] [c0002000ab2ffd10] [c008000010aad198] nfsd+0x150/0x1e0 [nfsd]
[305943.282963] [c0002000ab2ffda0] [c00000000017a6a4] kthread+0x184/0x190
[305943.283057] [c0002000ab2ffe10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
[305943.283136] Instruction dump:
[305943.283178] 81240008 2f890000 419efff0 7c2004ac 7d66002a 2fab0000 419e0018 7c0059ec
[305943.283288] 48000010 60000000 7c210b78 7c421378 <81230000> 79280020 7d2907b4 550a043e
[305943.283415] Sending NMI from CPU 45 to CPUs 62:
[305943.283467] NMI backtrace for cpu 62
[305943.283507] CPU: 62 PID: 15326 Comm: nfsd Tainted: G D W L 5.13.4 #1
[305943.283625] NIP: c0000000001cde64 LR: c000000000c024b4 CTR: c000000000711850
[305943.283713] REGS: c0002000ab2f7690 TRAP: 0e80 Tainted: G D W L (5.13.4)
[305943.283805] MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 44000822 XER: 20040000
[305943.283912] CFAR: c0000000001cde70 IRQMASK: 0
[305943.283912] GPR00: 0000000000000000 c0002000ab2f7930 c000000001511f00 c008000010b248f0
[305943.283912] GPR04: c000200ffca15380 0000000000fc0000 c000200ffca15380 c000000001085380
[305943.283912] GPR08: c00000000154e060 0000000000000000 0000200ffb710000 0000000000000000
[305943.283912] GPR12: 0000000000fc0000 c000200fff691600 c00000000017a528 c000200006f30bc0
[305943.283912] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[305943.283912] GPR20: 0000000000000000 0000000000000000 c00020009c030014 c000000ba2670d90
[305943.283912] GPR24: c000000ba2670a30 c008000010b2319c c008000010db8634 c0002000ab2f7a40
[305943.283912] GPR28: c000200099321a44 c008000010b248f0 0000000000000000 0000000000fc0101
[305943.284785] NIP [c0000000001cde64] native_queued_spin_lock_slowpath+0x204/0x360
[305943.284867] LR [c000000000c024b4] _raw_spin_lock+0x64/0xa0
[305943.284934] Call Trace:
[305943.284966] [c0002000ab2f7930] [c0002000ab2f7970] 0xc0002000ab2f7970 (unreliable)
[305943.285042] [c0002000ab2f7950] [c0000000007118b8] refcount_dec_and_lock+0x68/0x130
[305943.285158] [c0002000ab2f7990] [c008000010ae5458] put_nfs4_file+0x40/0x140 [nfsd]
[305943.285249] [c0002000ab2f79d0] [c008000010aeee68] nfsd4_close+0x1f0/0x470 [nfsd]
[305943.285336] [c0002000ab2f7ad0] [c008000010ad16bc] nfsd4_proc_compound+0x3e4/0x730 [nfsd]
[305943.285451] [c0002000ab2f7b80] [c008000010aadb10] nfsd_dispatch+0x148/0x2d8 [nfsd]
[305943.285542] [c0002000ab2f7bd0] [c008000010d644e0] svc_process_common+0x498/0xa70 [sunrpc]
[305943.285639] [c0002000ab2f7ca0] [c008000010d64b70] svc_process+0xb8/0x140 [sunrpc]
[305943.285751] [c0002000ab2f7d10] [c008000010aad198] nfsd+0x150/0x1e0 [nfsd]
[305943.285830] [c0002000ab2f7da0] [c00000000017a6a4] kthread+0x184/0x190
[305943.285903] [c0002000ab2f7e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70

----- Original Message -----
> From: "Timothy Pearson" <[email protected]>
> To: "J. Bruce Fields" <[email protected]>
> Cc: "Chuck Lever" <[email protected]>, "linux-nfs" <[email protected]>
> Sent: Friday, July 23, 2021 4:22:27 PM
> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

> ----- Original Message -----
>> From: "J. Bruce Fields" <[email protected]>
>> To: "Timothy Pearson" <[email protected]>
>> Cc: "Chuck Lever" <[email protected]>, "linux-nfs"
>> <[email protected]>
>> Sent: Friday, July 23, 2021 4:00:12 PM
>> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load
>
>> Sorry, took me a while to get to this because I'm not sure what to
>> suggest:
>>
>> On Mon, Jul 05, 2021 at 04:44:29AM -0500, Timothy Pearson wrote:
>>> We've been dealing with a fairly nasty NFS-related problem off and on
>>> for the past couple of years. The host is a large POWER server with
>>> several external SAS arrays attached, using BTRFS for cold storage of
>>> large amounts of data. The main symptom is that under heavy sustained
>>> NFS write traffic using certain file types (see below) a core will
>>> suddenly lock up, continually spewing a backtrace similar to the one
>>> I've pasted below.
>>
>> Is this the very *first* backtrace you get in the log? Is there
>> anything suspicious right before it? (Other than the warnings you
>> mention in the next message?)
>
> Nothing in the logs before that.
>
>>> While this immediately halts all NFS traffic to
>>> the affected client (which is never the same client as the machine
>>> doing the large file transfer), the larger issue is that over the next
>>> few minutes / hours the entire host will gradually degrade in
>>> responsiveness until it grinds to a complete halt. Once the core
>>> stall occurs we have been unable to find any way to restore the
>>> machine to full functionality or avoid the degradation and eventual
>>> hang short of a hard power down and restart.
>>
>>>
>>> Tens of GB of compressed data in a single file seems to be fairly good at
>>> triggering the problem, whereas raw disk images or other regularly patterned
>>> data tend not to be. The underlying hardware is functioning perfectly with no
>>> problems noted, and moving the files without NFS avoids the bug.
>>>
>>> We've been using a workaround involving purposefully pausing (SIGSTOP) the file
>>> transfer process on the client as soon as other clients start to show a
>>> slowdown. This hack avoids the bug entirely provided the host is allowed to
>>> catch back up prior to resuming (SIGCONT) the file transfer process. From
>>> this, it seems something is going very wrong within the NFS stack under high
>>> storage I/O pressure and high storage write latency (timeout?) -- it should
>>> simply pause transfers while the storage subsystem catches up, not lock up a
>>> core and force a host restart. Interesting, sometimes it does exactly what it
>>> is supposed to and does pause and wait for the storage subsystem, but around
>>> 20% of the time it just triggers this bug and stalls a core.
>>>
>>> This bug has been present since at least 4.14 and is still present in the latest
>>> 5.12.14 version.
>>>
>>> As the machine is in production, it is difficult to gather further information
>>> or test patches, however we would be able to apply patches to the kernel that
>>> would potentially restore stability with enough advance scheduling.
>>>
>>> Sample backtrace below:
>>>
>>> [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU
>>> [16846.426202] rcu: 32-....: (5249 ticks this GP)
>>> idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986
>>> [16846.426241] (t=5251 jiffies g=2720809 q=756724)
>>> [16846.426273] NMI backtrace for cpu 32
>>> [16846.426298] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
>>> [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc]
>>> [16846.426406] Call Trace:
>>> [16846.426429] [c000200010823250] [c00000000074e630] dump_stack+0xc4/0x114
>>> (unreliable)
>>> [16846.426483] [c000200010823290] [c00000000075aebc]
>>> nmi_cpu_backtrace+0xfc/0x150
>>> [16846.426506] [c000200010823310] [c00000000075b0a8]
>>> nmi_trigger_cpumask_backtrace+0x198/0x1f0
>>> [16846.426577] [c0002000108233b0] [c000000000072818]
>>> arch_trigger_cpumask_backtrace+0x28/0x40
>>> [16846.426621] [c0002000108233d0] [c000000000202db8]
>>> rcu_dump_cpu_stacks+0x158/0x1b8
>>> [16846.426667] [c000200010823470] [c000000000201828]
>>> rcu_sched_clock_irq+0x908/0xb10
>>> [16846.426708] [c000200010823560] [c0000000002141d0]
>>> update_process_times+0xc0/0x140
>>> [16846.426768] [c0002000108235a0] [c00000000022dd34]
>>> tick_sched_handle.isra.18+0x34/0xd0
>>> [16846.426808] [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0
>>> [16846.426856] [c000200010823610] [c00000000021577c]
>>> __hrtimer_run_queues+0x16c/0x370
>>> [16846.426903] [c000200010823690] [c000000000216378]
>>> hrtimer_interrupt+0x128/0x2f0
>>> [16846.426947] [c000200010823740] [c000000000029494] timer_interrupt+0x134/0x310
>>> [16846.426989] [c0002000108237a0] [c000000000016c54]
>>> replay_soft_interrupts+0x124/0x2e0
>>> [16846.427045] [c000200010823990] [c000000000016f14]
>>> arch_local_irq_restore+0x104/0x170
>>> [16846.427103] [c0002000108239c0] [c00000000017247c]
>>> mod_delayed_work_on+0x8c/0xe0
>>> [16846.427149] [c000200010823a20] [c00800000819fe04]
>>> rpc_set_queue_timer+0x5c/0x80 [sunrpc]
>>> [16846.427234] [c000200010823a40] [c0080000081a096c]
>>> __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc]
>>> [16846.427324] [c000200010823a90] [c0080000081a3080]
>>> rpc_sleep_on_timeout+0x88/0x110 [sunrpc]
>>> [16846.427388] [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530
>>> [nfsd]
>>
>> I think this has to be the rpc_delay in the -NFS4ERR_DELAY case of
>> nfsd4_cb_sequence_done. I don't know what would cause a lockup there.
>> Maybe the rpc_task we've passed in is corrupted somehow?
>>
>> We could try to add some instrumentation in that case. I don't think
>> that should be a common error case. I guess the client has probably hit
>> one of the NFS4ERR_DELAY cases in
>> fs/nfs/callback_proc.c:nfs4_callback_sequence?
>>
>> There might be a way to get some more information out of this with some
>> tracing that you could run in production, but I'm not sure what to
>> suggest off the top of my head.
>
> Appreciate the response regardless!

2021-08-02 19:29:05

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

On Wed, Jul 28, 2021 at 02:51:36PM -0500, Timothy Pearson wrote:
> Just happened again, this time seemingly at random (unknown network load). This is on kernel 5.13.4 now, and we're going to be breaking off the most critical production services to another host at this point due to the continued issues:

If you wanted to try something, it might also be worth turning off
delegations (write a 0 to /proc/sys/fs/leases-enable before starting the
server).

--b.

> [305943.277482] rcu: INFO: rcu_sched self-detected stall on CPU
> [305943.277508] rcu: 45-....: (68259 ticks this GP) idle=bc6/1/0x4000000000000002 softirq=26727261/26727261 fqs=27492
> [305943.277551] (t=68263 jiffies g=45124409 q=1158662)
> [305943.277568] Sending NMI from CPU 45 to CPUs 16:
> [305943.277600] NMI backtrace for cpu 16
> [305943.277637] CPU: 16 PID: 15327 Comm: nfsd Tainted: G D W L 5.13.4 #1
> [305943.277666] NIP: c0000000001cdcd0 LR: c000000000c024b4 CTR: c000000000c02450
> [305943.277694] REGS: c0002000ab2d35f0 TRAP: 0e80 Tainted: G D W L (5.13.4)
> [305943.277712] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 22002444 XER: 20040000
> [305943.277771] CFAR: c0000000001cdce4 IRQMASK: 0
> [305943.277771] GPR00: c000000000c024b4 c0002000ab2d3890 c000000001511f00 c008000010b248f0
> [305943.277771] GPR04: 0000000000000000 0000000000000008 0000000000000008 0000000000000010
> [305943.277771] GPR08: 0000000000000001 0000000000000001 0000000000000001 c008000010b04c78
> [305943.277771] GPR12: c000000000c02450 c000000ffffde000 c00000000017a528 c000200006f30bc0
> [305943.277771] GPR16: 0000000000000000 c000000017f45dd0 c0002000891f8000 0000000000000000
> [305943.277771] GPR20: c000200099068000 0000000000000000 c008000010b248f0 c0002000993091a0
> [305943.277771] GPR24: c00000093aa55698 c008000010b24060 00000000000004e8 c0002000891f8030
> [305943.277771] GPR28: 000000000000009d c0002000993091a0 0000000000000000 c0002000993091a0
> [305943.278108] NIP [c0000000001cdcd0] native_queued_spin_lock_slowpath+0x70/0x360
> [305943.278158] LR [c000000000c024b4] _raw_spin_lock+0x64/0xa0
> [305943.278183] Call Trace:
> [305943.278204] [c0002000ab2d3890] [c0002000891f8030] 0xc0002000891f8030 (unreliable)
> [305943.278252] [c0002000ab2d38b0] [c008000010aec0ac] nfsd4_process_open2+0x864/0x1910 [nfsd]
> [305943.278311] [c0002000ab2d3a20] [c008000010ad0df4] nfsd4_open+0x40c/0x8f0 [nfsd]
> [305943.278360] [c0002000ab2d3ad0] [c008000010ad16bc] nfsd4_proc_compound+0x3e4/0x730 [nfsd]
> [305943.278418] [c0002000ab2d3b80] [c008000010aadb10] nfsd_dispatch+0x148/0x2d8 [nfsd]
> [305943.278474] [c0002000ab2d3bd0] [c008000010d644e0] svc_process_common+0x498/0xa70 [sunrpc]
> [305943.278525] [c0002000ab2d3ca0] [c008000010d64b70] svc_process+0xb8/0x140 [sunrpc]
> [305943.278585] [c0002000ab2d3d10] [c008000010aad198] nfsd+0x150/0x1e0 [nfsd]
> [305943.278639] [c0002000ab2d3da0] [c00000000017a6a4] kthread+0x184/0x190
> [305943.278667] [c0002000ab2d3e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
> [305943.278706] Instruction dump:
> [305943.278727] 790a0020 2f890000 409e02b8 2faa0000 419e003c 81230000 5529063e 2f890000
> [305943.278782] 419e0028 60000000 60000000 60000000 <7c210b78> 7c421378 81230000 5529063e
> [305943.278871] NMI backtrace for cpu 45
> [305943.278892] CPU: 45 PID: 15323 Comm: nfsd Tainted: G D W L 5.13.4 #1
> [305943.278932] Call Trace:
> [305943.278950] [c0002000ab2e30d0] [c00000000075d080] dump_stack+0xc4/0x114 (unreliable)
> [305943.279002] [c0002000ab2e3110] [c00000000076996c] nmi_cpu_backtrace+0xfc/0x150
> [305943.279054] [c0002000ab2e3190] [c000000000769b78] nmi_trigger_cpumask_backtrace+0x1b8/0x220
> [305943.279112] [c0002000ab2e3230] [c000000000070788] arch_trigger_cpumask_backtrace+0x28/0x40
> [305943.279167] [c0002000ab2e3250] [c0000000001fe6b0] rcu_dump_cpu_stacks+0x158/0x1b8
> [305943.279218] [c0002000ab2e32f0] [c0000000001fd164] rcu_sched_clock_irq+0x954/0xb40
> [305943.279276] [c0002000ab2e33e0] [c000000000210030] update_process_times+0xc0/0x140
> [305943.279316] [c0002000ab2e3420] [c000000000229e04] tick_sched_handle.isra.18+0x34/0xd0
> [305943.279369] [c0002000ab2e3450] [c00000000022a2b8] tick_sched_timer+0x68/0xe0
> [305943.279416] [c0002000ab2e3490] [c00000000021158c] __hrtimer_run_queues+0x16c/0x370
> [305943.279456] [c0002000ab2e3510] [c0000000002121b8] hrtimer_interrupt+0x128/0x2f0
> [305943.279508] [c0002000ab2e35c0] [c000000000027904] timer_interrupt+0x134/0x310
> [305943.279551] [c0002000ab2e3620] [c000000000009c04] decrementer_common_virt+0x1a4/0x1b0
> [305943.279591] --- interrupt: 900 at native_queued_spin_lock_slowpath+0x200/0x360
> [305943.279635] NIP: c0000000001cde60 LR: c000000000c024b4 CTR: c000000000711850
> [305943.279678] REGS: c0002000ab2e3690 TRAP: 0900 Tainted: G D W L (5.13.4)
> [305943.279727] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 44000822 XER: 20040000
> [305943.279787] CFAR: c0000000001cde70 IRQMASK: 0
> [305943.279787] GPR00: 0000000000000000 c0002000ab2e3930 c000000001511f00 c008000010b248f0
> [305943.279787] GPR04: c000200ffbf75380 0000000000b80000 c000200ffbf75380 c000000001085380
> [305943.279787] GPR08: c00000000154e060 0000000000000000 0000000ffe410000 0000000000000000
> [305943.279787] GPR12: 0000000000b80000 c000200fff6d5400 c00000000017a528 c000200006f30bc0
> [305943.279787] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [305943.279787] GPR20: 0000000000000000 0000000000000000 c000200cd3c90014 c000000ba2670d90
> [305943.279787] GPR24: c000000ba2670a30 c008000010b2319c c008000010db8634 c0002000ab2e3a40
> [305943.279787] GPR28: c000200099331a44 c008000010b248f0 0000000000000000 0000000000b80101
> [305943.280122] NIP [c0000000001cde60] native_queued_spin_lock_slowpath+0x200/0x360
> [305943.280150] LR [c000000000c024b4] _raw_spin_lock+0x64/0xa0
> [305943.280185] --- interrupt: 900
> [305943.280209] [c0002000ab2e3930] [c0002000ab2e3970] 0xc0002000ab2e3970 (unreliable)
> [305943.280255] [c0002000ab2e3950] [c0000000007118b8] refcount_dec_and_lock+0x68/0x130
> [305943.280298] [c0002000ab2e3990] [c008000010ae5458] put_nfs4_file+0x40/0x140 [nfsd]
> [305943.280355] [c0002000ab2e39d0] [c008000010aeee68] nfsd4_close+0x1f0/0x470 [nfsd]
> [305943.280410] [c0002000ab2e3ad0] [c008000010ad16bc] nfsd4_proc_compound+0x3e4/0x730 [nfsd]
> [305943.280466] [c0002000ab2e3b80] [c008000010aadb10] nfsd_dispatch+0x148/0x2d8 [nfsd]
> [305943.280514] [c0002000ab2e3bd0] [c008000010d644e0] svc_process_common+0x498/0xa70 [sunrpc]
> [305943.280582] [c0002000ab2e3ca0] [c008000010d64b70] svc_process+0xb8/0x140 [sunrpc]
> [305943.280656] [c0002000ab2e3d10] [c008000010aad198] nfsd+0x150/0x1e0 [nfsd]
> [305943.280710] [c0002000ab2e3da0] [c00000000017a6a4] kthread+0x184/0x190
> [305943.280749] [c0002000ab2e3e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
> [305943.280805] Sending NMI from CPU 45 to CPUs 58:
> [305943.280842] NMI backtrace for cpu 58
> [305943.280865] CPU: 58 PID: 15325 Comm: nfsd Tainted: G D W L 5.13.4 #1
> [305943.280904] NIP: c0000000001cde98 LR: c000000000c024b4 CTR: c000000000711850
> [305943.280953] REGS: c0002000ab2ff690 TRAP: 0e80 Tainted: G D W L (5.13.4)
> [305943.280992] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24000824 XER: 20040000
> [305943.281062] CFAR: c0000000001cdeac IRQMASK: 0
> [305943.281062] GPR00: 0000000000000000 c0002000ab2ff930 c000000001511f00 c008000010b248f0
> [305943.281062] GPR04: c000200ffc795380 0000000000ec0000 c000200ffc795380 c000000001085380
> [305943.281062] GPR08: 0000000000e00101 0000000000e00101 0000000000000101 0000000000000000
> [305943.281062] GPR12: 0000000000ec0000 c000200fff695e00 c00000000017a528 c000200006f30bc0
> [305943.281062] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [305943.281062] GPR20: 0000000000000000 0000000000000000 c000200223680014 c000000ba2670d90
> [305943.281062] GPR24: c000000ba2670a30 c008000010b2319c c008000010db8634 c0002000ab2ffa40
> [305943.281062] GPR28: c000200099341a44 c008000010b248f0 0000000000000000 0000000000ec0101
> [305943.281737] NIP [c0000000001cde98] native_queued_spin_lock_slowpath+0x238/0x360
> [305943.281842] LR [c000000000c024b4] _raw_spin_lock+0x64/0xa0
> [305943.281902] Call Trace:
> [305943.281940] [c0002000ab2ff930] [c0002000ab2ff970] 0xc0002000ab2ff970 (unreliable)
> [305943.282040] [c0002000ab2ff950] [c0000000007118b8] refcount_dec_and_lock+0x68/0x130
> [305943.282154] [c0002000ab2ff990] [c008000010ae5458] put_nfs4_file+0x40/0x140 [nfsd]
> [305943.282330] [c0002000ab2ff9d0] [c008000010aeee68] nfsd4_close+0x1f0/0x470 [nfsd]
> [305943.282436] [c0002000ab2ffad0] [c008000010ad16bc] nfsd4_proc_compound+0x3e4/0x730 [nfsd]
> [305943.282520] [c0002000ab2ffb80] [c008000010aadb10] nfsd_dispatch+0x148/0x2d8 [nfsd]
> [305943.282670] [c0002000ab2ffbd0] [c008000010d644e0] svc_process_common+0x498/0xa70 [sunrpc]
> [305943.282781] [c0002000ab2ffca0] [c008000010d64b70] svc_process+0xb8/0x140 [sunrpc]
> [305943.282882] [c0002000ab2ffd10] [c008000010aad198] nfsd+0x150/0x1e0 [nfsd]
> [305943.282963] [c0002000ab2ffda0] [c00000000017a6a4] kthread+0x184/0x190
> [305943.283057] [c0002000ab2ffe10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
> [305943.283136] Instruction dump:
> [305943.283178] 81240008 2f890000 419efff0 7c2004ac 7d66002a 2fab0000 419e0018 7c0059ec
> [305943.283288] 48000010 60000000 7c210b78 7c421378 <81230000> 79280020 7d2907b4 550a043e
> [305943.283415] Sending NMI from CPU 45 to CPUs 62:
> [305943.283467] NMI backtrace for cpu 62
> [305943.283507] CPU: 62 PID: 15326 Comm: nfsd Tainted: G D W L 5.13.4 #1
> [305943.283625] NIP: c0000000001cde64 LR: c000000000c024b4 CTR: c000000000711850
> [305943.283713] REGS: c0002000ab2f7690 TRAP: 0e80 Tainted: G D W L (5.13.4)
> [305943.283805] MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 44000822 XER: 20040000
> [305943.283912] CFAR: c0000000001cde70 IRQMASK: 0
> [305943.283912] GPR00: 0000000000000000 c0002000ab2f7930 c000000001511f00 c008000010b248f0
> [305943.283912] GPR04: c000200ffca15380 0000000000fc0000 c000200ffca15380 c000000001085380
> [305943.283912] GPR08: c00000000154e060 0000000000000000 0000200ffb710000 0000000000000000
> [305943.283912] GPR12: 0000000000fc0000 c000200fff691600 c00000000017a528 c000200006f30bc0
> [305943.283912] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [305943.283912] GPR20: 0000000000000000 0000000000000000 c00020009c030014 c000000ba2670d90
> [305943.283912] GPR24: c000000ba2670a30 c008000010b2319c c008000010db8634 c0002000ab2f7a40
> [305943.283912] GPR28: c000200099321a44 c008000010b248f0 0000000000000000 0000000000fc0101
> [305943.284785] NIP [c0000000001cde64] native_queued_spin_lock_slowpath+0x204/0x360
> [305943.284867] LR [c000000000c024b4] _raw_spin_lock+0x64/0xa0
> [305943.284934] Call Trace:
> [305943.284966] [c0002000ab2f7930] [c0002000ab2f7970] 0xc0002000ab2f7970 (unreliable)
> [305943.285042] [c0002000ab2f7950] [c0000000007118b8] refcount_dec_and_lock+0x68/0x130
> [305943.285158] [c0002000ab2f7990] [c008000010ae5458] put_nfs4_file+0x40/0x140 [nfsd]
> [305943.285249] [c0002000ab2f79d0] [c008000010aeee68] nfsd4_close+0x1f0/0x470 [nfsd]
> [305943.285336] [c0002000ab2f7ad0] [c008000010ad16bc] nfsd4_proc_compound+0x3e4/0x730 [nfsd]
> [305943.285451] [c0002000ab2f7b80] [c008000010aadb10] nfsd_dispatch+0x148/0x2d8 [nfsd]
> [305943.285542] [c0002000ab2f7bd0] [c008000010d644e0] svc_process_common+0x498/0xa70 [sunrpc]
> [305943.285639] [c0002000ab2f7ca0] [c008000010d64b70] svc_process+0xb8/0x140 [sunrpc]
> [305943.285751] [c0002000ab2f7d10] [c008000010aad198] nfsd+0x150/0x1e0 [nfsd]
> [305943.285830] [c0002000ab2f7da0] [c00000000017a6a4] kthread+0x184/0x190
> [305943.285903] [c0002000ab2f7e10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
>
> ----- Original Message -----
> > From: "Timothy Pearson" <[email protected]>
> > To: "J. Bruce Fields" <[email protected]>
> > Cc: "Chuck Lever" <[email protected]>, "linux-nfs" <[email protected]>
> > Sent: Friday, July 23, 2021 4:22:27 PM
> > Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load
>
> > ----- Original Message -----
> >> From: "J. Bruce Fields" <[email protected]>
> >> To: "Timothy Pearson" <[email protected]>
> >> Cc: "Chuck Lever" <[email protected]>, "linux-nfs"
> >> <[email protected]>
> >> Sent: Friday, July 23, 2021 4:00:12 PM
> >> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load
> >
> >> Sorry, took me a while to get to this because I'm not sure what to
> >> suggest:
> >>
> >> On Mon, Jul 05, 2021 at 04:44:29AM -0500, Timothy Pearson wrote:
> >>> We've been dealing with a fairly nasty NFS-related problem off and on
> >>> for the past couple of years. The host is a large POWER server with
> >>> several external SAS arrays attached, using BTRFS for cold storage of
> >>> large amounts of data. The main symptom is that under heavy sustained
> >>> NFS write traffic using certain file types (see below) a core will
> >>> suddenly lock up, continually spewing a backtrace similar to the one
> >>> I've pasted below.
> >>
> >> Is this the very *first* backtrace you get in the log? Is there
> >> anything suspicious right before it? (Other than the warnings you
> >> mention in the next message?)
> >
> > Nothing in the logs before that.
> >
> >>> While this immediately halts all NFS traffic to
> >>> the affected client (which is never the same client as the machine
> >>> doing the large file transfer), the larger issue is that over the next
> >>> few minutes / hours the entire host will gradually degrade in
> >>> responsiveness until it grinds to a complete halt. Once the core
> >>> stall occurs we have been unable to find any way to restore the
> >>> machine to full functionality or avoid the degradation and eventual
> >>> hang short of a hard power down and restart.
> >>
> >>>
> >>> Tens of GB of compressed data in a single file seems to be fairly good at
> >>> triggering the problem, whereas raw disk images or other regularly patterned
> >>> data tend not to be. The underlying hardware is functioning perfectly with no
> >>> problems noted, and moving the files without NFS avoids the bug.
> >>>
> >>> We've been using a workaround involving purposefully pausing (SIGSTOP) the file
> >>> transfer process on the client as soon as other clients start to show a
> >>> slowdown. This hack avoids the bug entirely provided the host is allowed to
> >>> catch back up prior to resuming (SIGCONT) the file transfer process. From
> >>> this, it seems something is going very wrong within the NFS stack under high
> >>> storage I/O pressure and high storage write latency (timeout?) -- it should
> >>> simply pause transfers while the storage subsystem catches up, not lock up a
> >>> core and force a host restart. Interesting, sometimes it does exactly what it
> >>> is supposed to and does pause and wait for the storage subsystem, but around
> >>> 20% of the time it just triggers this bug and stalls a core.
> >>>
> >>> This bug has been present since at least 4.14 and is still present in the latest
> >>> 5.12.14 version.
> >>>
> >>> As the machine is in production, it is difficult to gather further information
> >>> or test patches, however we would be able to apply patches to the kernel that
> >>> would potentially restore stability with enough advance scheduling.
> >>>
> >>> Sample backtrace below:
> >>>
> >>> [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU
> >>> [16846.426202] rcu: 32-....: (5249 ticks this GP)
> >>> idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986
> >>> [16846.426241] (t=5251 jiffies g=2720809 q=756724)
> >>> [16846.426273] NMI backtrace for cpu 32
> >>> [16846.426298] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
> >>> [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc]
> >>> [16846.426406] Call Trace:
> >>> [16846.426429] [c000200010823250] [c00000000074e630] dump_stack+0xc4/0x114
> >>> (unreliable)
> >>> [16846.426483] [c000200010823290] [c00000000075aebc]
> >>> nmi_cpu_backtrace+0xfc/0x150
> >>> [16846.426506] [c000200010823310] [c00000000075b0a8]
> >>> nmi_trigger_cpumask_backtrace+0x198/0x1f0
> >>> [16846.426577] [c0002000108233b0] [c000000000072818]
> >>> arch_trigger_cpumask_backtrace+0x28/0x40
> >>> [16846.426621] [c0002000108233d0] [c000000000202db8]
> >>> rcu_dump_cpu_stacks+0x158/0x1b8
> >>> [16846.426667] [c000200010823470] [c000000000201828]
> >>> rcu_sched_clock_irq+0x908/0xb10
> >>> [16846.426708] [c000200010823560] [c0000000002141d0]
> >>> update_process_times+0xc0/0x140
> >>> [16846.426768] [c0002000108235a0] [c00000000022dd34]
> >>> tick_sched_handle.isra.18+0x34/0xd0
> >>> [16846.426808] [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0
> >>> [16846.426856] [c000200010823610] [c00000000021577c]
> >>> __hrtimer_run_queues+0x16c/0x370
> >>> [16846.426903] [c000200010823690] [c000000000216378]
> >>> hrtimer_interrupt+0x128/0x2f0
> >>> [16846.426947] [c000200010823740] [c000000000029494] timer_interrupt+0x134/0x310
> >>> [16846.426989] [c0002000108237a0] [c000000000016c54]
> >>> replay_soft_interrupts+0x124/0x2e0
> >>> [16846.427045] [c000200010823990] [c000000000016f14]
> >>> arch_local_irq_restore+0x104/0x170
> >>> [16846.427103] [c0002000108239c0] [c00000000017247c]
> >>> mod_delayed_work_on+0x8c/0xe0
> >>> [16846.427149] [c000200010823a20] [c00800000819fe04]
> >>> rpc_set_queue_timer+0x5c/0x80 [sunrpc]
> >>> [16846.427234] [c000200010823a40] [c0080000081a096c]
> >>> __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc]
> >>> [16846.427324] [c000200010823a90] [c0080000081a3080]
> >>> rpc_sleep_on_timeout+0x88/0x110 [sunrpc]
> >>> [16846.427388] [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530
> >>> [nfsd]
> >>
> >> I think this has to be the rpc_delay in the -NFS4ERR_DELAY case of
> >> nfsd4_cb_sequence_done. I don't know what would cause a lockup there.
> >> Maybe the rpc_task we've passed in is corrupted somehow?
> >>
> >> We could try to add some instrumentation in that case. I don't think
> >> that should be a common error case. I guess the client has probably hit
> >> one of the NFS4ERR_DELAY cases in
> >> fs/nfs/callback_proc.c:nfs4_callback_sequence?
> >>
> >> There might be a way to get some more information out of this with some
> >> tracing that you could run in production, but I'm not sure what to
> >> suggest off the top of my head.
> >
> > Appreciate the response regardless!

2021-08-09 17:10:19

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

Did you see anything else at all on the terminal? The inability to log in is sadly familiar, our boxes are configured to dump a trace over serial every 120 seconds or so if they lock up (/proc/sys/kernel/hung_task_timeout_secs) and I'm not sure you'd see anything past the callback messages without that active.

FWIW we ended up (mostly) working around the problem by moving the critical systems (which are all NFSv3) to a new server, but that's a stopgap measure as we were looking to deploy NFSv4 on a broader scale. My gut feeling is the failure occurs under heavy load where too many NFSv4 requests from a single client are pending due to a combination of storage and network saturation, but it's proven very difficult to debug -- even splitting the v3 hosts from the larger NFS server (reducing traffic + storage load) seems to have temporarily stabilized things.

----- Original Message -----
> From: [email protected]
> To: "Timothy Pearson" <[email protected]>
> Cc: "J. Bruce Fields" <[email protected]>, "Chuck Lever" <[email protected]>, "linux-nfs"
> <[email protected]>
> Sent: Monday, August 9, 2021 11:31:25 AM
> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

> Incidentally, we’re a computer science department. We have such a variety of
> students and researchers that it’s impossible to know what they are all doing.
> Historically, if there’s a bug in anyth9ing, we’ll see it, and usually enough
> for it to be fatal.
>
> question: is backing off to 4.0 or disabling delegations likely to have more of
> an impact on performance?
>
>> On Aug 9, 2021, at 12:17 PM, [email protected] wrote:
>>
>> I just found this because we’ve been dealing with hangs of our primary NFS
>> server. This is ubuntu 20.04, which is 5.10.
>>
>> Right before the hang:
>>
>> Aug 8 21:50:46 communis.lcsr.rutgers.edu <http://communis.lcsr.rutgers.edu/>
>> kernel: [294852.644801] receive_cb_reply: Got unrecognized reply: calldir 0x1
>> xpt_bc_xprt 00000000b260cf95 xid e3faa54e
>> Aug 8 21:51:54 communis.lcsr.rutgers.edu <http://communis.lcsr.rutgers.edu/>
>> kernel: [294921.252531] receive_cb_reply: Got unrecognized reply: calldir 0x1
>> xpt_bc_xprt 00000000b260cf95 xid f0faa54e
>>
>>
>> I looked at the code, and this seems to be an NFS4.1 callback. We just started
>> seeing the problem after upgrading most of our hosts in a way that caused them
>> to move from NFS 4.0 to 4.2. I assume 4.2 is using the 4.1 callback. Rather
>> than disabling delegations, we’re moving back to NFS 4.0 on the clients (except
>> ESXi).
>>
>> We’re using ZFS, so this isn’t just btrfs.
>>
>> I’m afraid I don’t have any backtrace. I was going to get more information, but
>> it happened late at night and we were unable to get into the system to gather
>> information. Just had to reboot.
>>
>>> On Jul 5, 2021, at 5:47 AM, Timothy Pearson <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>> Forgot to add -- sometimes, right before the core stall and backtrace, we see
>>> messages similar to the following:
>>>
>>> [16825.408854] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt
>>> 0000000051f43ff7 xid 2e0c9b7a
>>> [16825.414070] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt
>>> 0000000051f43ff7 xid 2f0c9b7a
>>> [16825.414360] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt
>>> 0000000051f43ff7 xid 300c9b7a
>>>
>>> We're not sure if they are related or not.
>>>
>>> ----- Original Message -----
>>>> From: "Timothy Pearson" <[email protected]
>>>> <mailto:[email protected]>>
>>>> To: "J. Bruce Fields" <[email protected] <mailto:[email protected]>>,
>>>> "Chuck Lever" <[email protected] <mailto:[email protected]>>
>>>> Cc: "linux-nfs" <[email protected] <mailto:[email protected]>>
>>>> Sent: Monday, July 5, 2021 4:44:29 AM
>>>> Subject: CPU stall, eventual host hang with BTRFS + NFS under heavy load
>>>
>>>> We've been dealing with a fairly nasty NFS-related problem off and on for the
>>>> past couple of years. The host is a large POWER server with several external
>>>> SAS arrays attached, using BTRFS for cold storage of large amounts of data.
>>>> The main symptom is that under heavy sustained NFS write traffic using certain
>>>> file types (see below) a core will suddenly lock up, continually spewing a
>>>> backtrace similar to the one I've pasted below. While this immediately halts
>>>> all NFS traffic to the affected client (which is never the same client as the
>>>> machine doing the large file transfer), the larger issue is that over the next
>>>> few minutes / hours the entire host will gradually degrade in responsiveness
>>>> until it grinds to a complete halt. Once the core stall occurs we have been
>>>> unable to find any way to restore the machine to full functionality or avoid
>>>> the degradation and eventual hang short of a hard power down and restart.
>>>>
>>>> Tens of GB of compressed data in a single file seems to be fairly good at
>>>> triggering the problem, whereas raw disk images or other regularly patterned
>>>> data tend not to be. The underlying hardware is functioning perfectly with no
>>>> problems noted, and moving the files without NFS avoids the bug.
>>>>
>>>> We've been using a workaround involving purposefully pausing (SIGSTOP) the file
>>>> transfer process on the client as soon as other clients start to show a
>>>> slowdown. This hack avoids the bug entirely provided the host is allowed to
>>>> catch back up prior to resuming (SIGCONT) the file transfer process. From
>>>> this, it seems something is going very wrong within the NFS stack under high
>>>> storage I/O pressure and high storage write latency (timeout?) -- it should
>>>> simply pause transfers while the storage subsystem catches up, not lock up a
>>>> core and force a host restart. Interesting, sometimes it does exactly what it
>>>> is supposed to and does pause and wait for the storage subsystem, but around
>>>> 20% of the time it just triggers this bug and stalls a core.
>>>>
>>>> This bug has been present since at least 4.14 and is still present in the latest
>>>> 5.12.14 version.
>>>>
>>>> As the machine is in production, it is difficult to gather further information
>>>> or test patches, however we would be able to apply patches to the kernel that
>>>> would potentially restore stability with enough advance scheduling.
>>>>
>>>> Sample backtrace below:
>>>>
>>>> [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU
>>>> [16846.426202] rcu: 32-....: (5249 ticks this GP)
>>>> idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986
>>>> [16846.426241] (t=5251 jiffies g=2720809 q=756724)
>>>> [16846.426273] NMI backtrace for cpu 32
>>>> [16846.426298] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
>>>> [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc]
>>>> [16846.426406] Call Trace:
>>>> [16846.426429] [c000200010823250] [c00000000074e630] dump_stack+0xc4/0x114
>>>> (unreliable)
>>>> [16846.426483] [c000200010823290] [c00000000075aebc]
>>>> nmi_cpu_backtrace+0xfc/0x150
>>>> [16846.426506] [c000200010823310] [c00000000075b0a8]
>>>> nmi_trigger_cpumask_backtrace+0x198/0x1f0
>>>> [16846.426577] [c0002000108233b0] [c000000000072818]
>>>> arch_trigger_cpumask_backtrace+0x28/0x40
>>>> [16846.426621] [c0002000108233d0] [c000000000202db8]
>>>> rcu_dump_cpu_stacks+0x158/0x1b8
>>>> [16846.426667] [c000200010823470] [c000000000201828]
>>>> rcu_sched_clock_irq+0x908/0xb10
>>>> [16846.426708] [c000200010823560] [c0000000002141d0]
>>>> update_process_times+0xc0/0x140
>>>> [16846.426768] [c0002000108235a0] [c00000000022dd34]
>>>> tick_sched_handle.isra.18+0x34/0xd0
>>>> [16846.426808] [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0
>>>> [16846.426856] [c000200010823610] [c00000000021577c]
>>>> __hrtimer_run_queues+0x16c/0x370
>>>> [16846.426903] [c000200010823690] [c000000000216378]
>>>> hrtimer_interrupt+0x128/0x2f0
>>>> [16846.426947] [c000200010823740] [c000000000029494] timer_interrupt+0x134/0x310
>>>> [16846.426989] [c0002000108237a0] [c000000000016c54]
>>>> replay_soft_interrupts+0x124/0x2e0
>>>> [16846.427045] [c000200010823990] [c000000000016f14]
>>>> arch_local_irq_restore+0x104/0x170
>>>> [16846.427103] [c0002000108239c0] [c00000000017247c]
>>>> mod_delayed_work_on+0x8c/0xe0
>>>> [16846.427149] [c000200010823a20] [c00800000819fe04]
>>>> rpc_set_queue_timer+0x5c/0x80 [sunrpc]
>>>> [16846.427234] [c000200010823a40] [c0080000081a096c]
>>>> __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc]
>>>> [16846.427324] [c000200010823a90] [c0080000081a3080]
>>>> rpc_sleep_on_timeout+0x88/0x110 [sunrpc]
>>>> [16846.427388] [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530
>>>> [nfsd]
>>>> [16846.427457] [c000200010823b60] [c0080000081a0a0c] rpc_exit_task+0x84/0x1d0
>>>> [sunrpc]
>>>> [16846.427520] [c000200010823ba0] [c0080000081a2448] __rpc_execute+0xd0/0x760
>>>> [sunrpc]
>>>> [16846.427598] [c000200010823c30] [c0080000081a2b18]
>>>> rpc_async_schedule+0x40/0x70 [sunrpc]
>>>> [16846.427687] [c000200010823c60] [c000000000170bf0]
>>>> process_one_work+0x290/0x580
>>>> [16846.427736] [c000200010823d00] [c000000000170f68] worker_thread+0x88/0x620
>>>> [16846.427813] [c000200010823da0] [c00000000017b860] kthread+0x1a0/0x1b0
>>>> [16846.427865] [c000200010823e10] [c00000000000d6ec]
>>>> ret_from_kernel_thread+0x5c/0x70
>>>> [16873.869180] watchdog: BUG: soft lockup - CPU#32 stuck for 49s!
>>>> [kworker/u130:25:10624]
>>>> [16873.869245] Modules linked in: rpcsec_gss_krb5 iscsi_target_mod
>>>> target_core_user uio target_core_pscsi target_core_file target_core_iblock
>>>> target_core_mod tun nft_counter nf_tables nfnetlink vfio_pci vfio_virqfd
>>>> vfio_iommu_spapr_tce vfio vfio_spapr_eeh i2c_dev bridg$
>>>> [16873.869413] linear mlx4_ib ib_uverbs ib_core raid1 md_mod sd_mod t10_pi
>>>> hid_generic usbhid hid ses enclosure crct10dif_vpmsum crc32c_vpmsum xhci_pci
>>>> xhci_hcd ixgbe mlx4_core mpt3sas usbcore tg3 mdio_devres of_mdio fixed_phy
>>>> xfrm_algo mdio libphy aacraid igb raid_cl$
>>>> [16873.869889] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
>>>> [16873.869966] Workqueue: rpciod rpc_async_schedule [sunrpc]
>>>> [16873.870023] NIP: c000000000711300 LR: c0080000081a0708 CTR: c0000000007112a0
>>>> [16873.870073] REGS: c0002000108237d0 TRAP: 0900 Not tainted (5.12.14)
>>>> [16873.870109] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR:
>>>> 24004842 XER: 00000000
>>>> [16873.870146] CFAR: c0080000081d8054 IRQMASK: 0
>>>> GPR00: c0080000081a0748 c000200010823a70 c0000000015c0700 c0000000e2227a40
>>>> GPR04: c0000000e2227a40 c0000000e2227a40 c000200ffb6cc0a8 0000000000000018
>>>> GPR08: 0000000000000000 5deadbeef0000122 c0080000081ffd18 c0080000081d8040
>>>> GPR12: c0000000007112a0 c000200fff7fee00 c00000000017b6c8 c000000090d9ccc0
>>>> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>>>> GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000040
>>>> GPR24: 0000000000000000 0000000000000000 fffffffffffffe00 0000000000000001
>>>> GPR28: c00000001a62f000 c0080000081a0988 c0080000081ffd10 c0000000e2227a00
>>>> [16873.870452] NIP [c000000000711300] __list_del_entry_valid+0x60/0x100
>>>> [16873.870507] LR [c0080000081a0708]
>>>> rpc_wake_up_task_on_wq_queue_action_locked+0x330/0x400 [sunrpc]

2021-08-09 17:16:33

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

There seems to be a soft lockup message on the console, but that’s all I can find.

I’m currently considering whether it’s best to move to NFS 4.0, which seems not to cause the issue, or 4.2 with delegations disabled. This is the primary server for the department. If it fails, everything fails, VMs because read-only, user jobs fai, etc.

We ran for a year before this showed up, so I’m pretty sure going to 4.0 will fix it. But I have use cases for ACLs that will only work with 4.2. Since the problem seems to be in the callback mechanism, and as far as I can tell that’s only used for delegations, I assume turning off delegations will fix it.

We’ve also had a history of issues with 4.2 problems on clients. That’s why we backed off to 4.0 initially. Clients were seeing hangs.

It’s discouraging to hear that even the most recent kernel has problems.

> On Aug 9, 2021, at 1:06 PM, Timothy Pearson <[email protected]> wrote:
>
> Did you see anything else at all on the terminal? The inability to log in is sadly familiar, our boxes are configured to dump a trace over serial every 120 seconds or so if they lock up (/proc/sys/kernel/hung_task_timeout_secs) and I'm not sure you'd see anything past the callback messages without that active.
>
> FWIW we ended up (mostly) working around the problem by moving the critical systems (which are all NFSv3) to a new server, but that's a stopgap measure as we were looking to deploy NFSv4 on a broader scale. My gut feeling is the failure occurs under heavy load where too many NFSv4 requests from a single client are pending due to a combination of storage and network saturation, but it's proven very difficult to debug -- even splitting the v3 hosts from the larger NFS server (reducing traffic + storage load) seems to have temporarily stabilized things.
>
> ----- Original Message -----
>> From: [email protected]
>> To: "Timothy Pearson" <[email protected]>
>> Cc: "J. Bruce Fields" <[email protected]>, "Chuck Lever" <[email protected]>, "linux-nfs"
>> <[email protected]>
>> Sent: Monday, August 9, 2021 11:31:25 AM
>> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load
>
>> Incidentally, we’re a computer science department. We have such a variety of
>> students and researchers that it’s impossible to know what they are all doing.
>> Historically, if there’s a bug in anyth9ing, we’ll see it, and usually enough
>> for it to be fatal.
>>
>> question: is backing off to 4.0 or disabling delegations likely to have more of
>> an impact on performance?
>>
>>> On Aug 9, 2021, at 12:17 PM, [email protected] wrote:
>>>
>>> I just found this because we’ve been dealing with hangs of our primary NFS
>>> server. This is ubuntu 20.04, which is 5.10.
>>>
>>> Right before the hang:
>>>
>>> Aug 8 21:50:46 communis.lcsr.rutgers.edu <http://communis.lcsr.rutgers.edu/>
>>> kernel: [294852.644801] receive_cb_reply: Got unrecognized reply: calldir 0x1
>>> xpt_bc_xprt 00000000b260cf95 xid e3faa54e
>>> Aug 8 21:51:54 communis.lcsr.rutgers.edu <http://communis.lcsr.rutgers.edu/>
>>> kernel: [294921.252531] receive_cb_reply: Got unrecognized reply: calldir 0x1
>>> xpt_bc_xprt 00000000b260cf95 xid f0faa54e
>>>
>>>
>>> I looked at the code, and this seems to be an NFS4.1 callback. We just started
>>> seeing the problem after upgrading most of our hosts in a way that caused them
>>> to move from NFS 4.0 to 4.2. I assume 4.2 is using the 4.1 callback. Rather
>>> than disabling delegations, we’re moving back to NFS 4.0 on the clients (except
>>> ESXi).
>>>
>>> We’re using ZFS, so this isn’t just btrfs.
>>>
>>> I’m afraid I don’t have any backtrace. I was going to get more information, but
>>> it happened late at night and we were unable to get into the system to gather
>>> information. Just had to reboot.
>>>
>>>> On Jul 5, 2021, at 5:47 AM, Timothy Pearson <[email protected]
>>>> <mailto:[email protected]>> wrote:
>>>>
>>>> Forgot to add -- sometimes, right before the core stall and backtrace, we see
>>>> messages similar to the following:
>>>>
>>>> [16825.408854] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt
>>>> 0000000051f43ff7 xid 2e0c9b7a
>>>> [16825.414070] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt
>>>> 0000000051f43ff7 xid 2f0c9b7a
>>>> [16825.414360] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt
>>>> 0000000051f43ff7 xid 300c9b7a
>>>>
>>>> We're not sure if they are related or not.
>>>>
>>>> ----- Original Message -----
>>>>> From: "Timothy Pearson" <[email protected]
>>>>> <mailto:[email protected]>>
>>>>> To: "J. Bruce Fields" <[email protected] <mailto:[email protected]>>,
>>>>> "Chuck Lever" <[email protected] <mailto:[email protected]>>
>>>>> Cc: "linux-nfs" <[email protected] <mailto:[email protected]>>
>>>>> Sent: Monday, July 5, 2021 4:44:29 AM
>>>>> Subject: CPU stall, eventual host hang with BTRFS + NFS under heavy load
>>>>
>>>>> We've been dealing with a fairly nasty NFS-related problem off and on for the
>>>>> past couple of years. The host is a large POWER server with several external
>>>>> SAS arrays attached, using BTRFS for cold storage of large amounts of data.
>>>>> The main symptom is that under heavy sustained NFS write traffic using certain
>>>>> file types (see below) a core will suddenly lock up, continually spewing a
>>>>> backtrace similar to the one I've pasted below. While this immediately halts
>>>>> all NFS traffic to the affected client (which is never the same client as the
>>>>> machine doing the large file transfer), the larger issue is that over the next
>>>>> few minutes / hours the entire host will gradually degrade in responsiveness
>>>>> until it grinds to a complete halt. Once the core stall occurs we have been
>>>>> unable to find any way to restore the machine to full functionality or avoid
>>>>> the degradation and eventual hang short of a hard power down and restart.
>>>>>
>>>>> Tens of GB of compressed data in a single file seems to be fairly good at
>>>>> triggering the problem, whereas raw disk images or other regularly patterned
>>>>> data tend not to be. The underlying hardware is functioning perfectly with no
>>>>> problems noted, and moving the files without NFS avoids the bug.
>>>>>
>>>>> We've been using a workaround involving purposefully pausing (SIGSTOP) the file
>>>>> transfer process on the client as soon as other clients start to show a
>>>>> slowdown. This hack avoids the bug entirely provided the host is allowed to
>>>>> catch back up prior to resuming (SIGCONT) the file transfer process. From
>>>>> this, it seems something is going very wrong within the NFS stack under high
>>>>> storage I/O pressure and high storage write latency (timeout?) -- it should
>>>>> simply pause transfers while the storage subsystem catches up, not lock up a
>>>>> core and force a host restart. Interesting, sometimes it does exactly what it
>>>>> is supposed to and does pause and wait for the storage subsystem, but around
>>>>> 20% of the time it just triggers this bug and stalls a core.
>>>>>
>>>>> This bug has been present since at least 4.14 and is still present in the latest
>>>>> 5.12.14 version.
>>>>>
>>>>> As the machine is in production, it is difficult to gather further information
>>>>> or test patches, however we would be able to apply patches to the kernel that
>>>>> would potentially restore stability with enough advance scheduling.
>>>>>
>>>>> Sample backtrace below:
>>>>>
>>>>> [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU
>>>>> [16846.426202] rcu: 32-....: (5249 ticks this GP)
>>>>> idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986
>>>>> [16846.426241] (t=5251 jiffies g=2720809 q=756724)
>>>>> [16846.426273] NMI backtrace for cpu 32
>>>>> [16846.426298] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
>>>>> [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc]
>>>>> [16846.426406] Call Trace:
>>>>> [16846.426429] [c000200010823250] [c00000000074e630] dump_stack+0xc4/0x114
>>>>> (unreliable)
>>>>> [16846.426483] [c000200010823290] [c00000000075aebc]
>>>>> nmi_cpu_backtrace+0xfc/0x150
>>>>> [16846.426506] [c000200010823310] [c00000000075b0a8]
>>>>> nmi_trigger_cpumask_backtrace+0x198/0x1f0
>>>>> [16846.426577] [c0002000108233b0] [c000000000072818]
>>>>> arch_trigger_cpumask_backtrace+0x28/0x40
>>>>> [16846.426621] [c0002000108233d0] [c000000000202db8]
>>>>> rcu_dump_cpu_stacks+0x158/0x1b8
>>>>> [16846.426667] [c000200010823470] [c000000000201828]
>>>>> rcu_sched_clock_irq+0x908/0xb10
>>>>> [16846.426708] [c000200010823560] [c0000000002141d0]
>>>>> update_process_times+0xc0/0x140
>>>>> [16846.426768] [c0002000108235a0] [c00000000022dd34]
>>>>> tick_sched_handle.isra.18+0x34/0xd0
>>>>> [16846.426808] [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0
>>>>> [16846.426856] [c000200010823610] [c00000000021577c]
>>>>> __hrtimer_run_queues+0x16c/0x370
>>>>> [16846.426903] [c000200010823690] [c000000000216378]
>>>>> hrtimer_interrupt+0x128/0x2f0
>>>>> [16846.426947] [c000200010823740] [c000000000029494] timer_interrupt+0x134/0x310
>>>>> [16846.426989] [c0002000108237a0] [c000000000016c54]
>>>>> replay_soft_interrupts+0x124/0x2e0
>>>>> [16846.427045] [c000200010823990] [c000000000016f14]
>>>>> arch_local_irq_restore+0x104/0x170
>>>>> [16846.427103] [c0002000108239c0] [c00000000017247c]
>>>>> mod_delayed_work_on+0x8c/0xe0
>>>>> [16846.427149] [c000200010823a20] [c00800000819fe04]
>>>>> rpc_set_queue_timer+0x5c/0x80 [sunrpc]
>>>>> [16846.427234] [c000200010823a40] [c0080000081a096c]
>>>>> __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc]
>>>>> [16846.427324] [c000200010823a90] [c0080000081a3080]
>>>>> rpc_sleep_on_timeout+0x88/0x110 [sunrpc]
>>>>> [16846.427388] [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530
>>>>> [nfsd]
>>>>> [16846.427457] [c000200010823b60] [c0080000081a0a0c] rpc_exit_task+0x84/0x1d0
>>>>> [sunrpc]
>>>>> [16846.427520] [c000200010823ba0] [c0080000081a2448] __rpc_execute+0xd0/0x760
>>>>> [sunrpc]
>>>>> [16846.427598] [c000200010823c30] [c0080000081a2b18]
>>>>> rpc_async_schedule+0x40/0x70 [sunrpc]
>>>>> [16846.427687] [c000200010823c60] [c000000000170bf0]
>>>>> process_one_work+0x290/0x580
>>>>> [16846.427736] [c000200010823d00] [c000000000170f68] worker_thread+0x88/0x620
>>>>> [16846.427813] [c000200010823da0] [c00000000017b860] kthread+0x1a0/0x1b0
>>>>> [16846.427865] [c000200010823e10] [c00000000000d6ec]
>>>>> ret_from_kernel_thread+0x5c/0x70
>>>>> [16873.869180] watchdog: BUG: soft lockup - CPU#32 stuck for 49s!
>>>>> [kworker/u130:25:10624]
>>>>> [16873.869245] Modules linked in: rpcsec_gss_krb5 iscsi_target_mod
>>>>> target_core_user uio target_core_pscsi target_core_file target_core_iblock
>>>>> target_core_mod tun nft_counter nf_tables nfnetlink vfio_pci vfio_virqfd
>>>>> vfio_iommu_spapr_tce vfio vfio_spapr_eeh i2c_dev bridg$
>>>>> [16873.869413] linear mlx4_ib ib_uverbs ib_core raid1 md_mod sd_mod t10_pi
>>>>> hid_generic usbhid hid ses enclosure crct10dif_vpmsum crc32c_vpmsum xhci_pci
>>>>> xhci_hcd ixgbe mlx4_core mpt3sas usbcore tg3 mdio_devres of_mdio fixed_phy
>>>>> xfrm_algo mdio libphy aacraid igb raid_cl$
>>>>> [16873.869889] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
>>>>> [16873.869966] Workqueue: rpciod rpc_async_schedule [sunrpc]
>>>>> [16873.870023] NIP: c000000000711300 LR: c0080000081a0708 CTR: c0000000007112a0
>>>>> [16873.870073] REGS: c0002000108237d0 TRAP: 0900 Not tainted (5.12.14)
>>>>> [16873.870109] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR:
>>>>> 24004842 XER: 00000000
>>>>> [16873.870146] CFAR: c0080000081d8054 IRQMASK: 0
>>>>> GPR00: c0080000081a0748 c000200010823a70 c0000000015c0700 c0000000e2227a40
>>>>> GPR04: c0000000e2227a40 c0000000e2227a40 c000200ffb6cc0a8 0000000000000018
>>>>> GPR08: 0000000000000000 5deadbeef0000122 c0080000081ffd18 c0080000081d8040
>>>>> GPR12: c0000000007112a0 c000200fff7fee00 c00000000017b6c8 c000000090d9ccc0
>>>>> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>>>>> GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000040
>>>>> GPR24: 0000000000000000 0000000000000000 fffffffffffffe00 0000000000000001
>>>>> GPR28: c00000001a62f000 c0080000081a0988 c0080000081ffd10 c0000000e2227a00
>>>>> [16873.870452] NIP [c000000000711300] __list_del_entry_valid+0x60/0x100
>>>>> [16873.870507] LR [c0080000081a0708]
>>>>> rpc_wake_up_task_on_wq_queue_action_locked+0x330/0x400 [sunrpc]

2021-08-09 17:26:57

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

It is a bit frustrating, but if there's a workaround with disabling delegations that would at least be welcome news. We have a Kerberized environment and ACLs are a huge selling point for v4.

To the latter point, I get the impression that either there aren't too many organizations using NFS at the scale that both yours and ours are, or they haven't yet figured out that the random lockup issue they're seeing originates in the NFS stack. My suspicion is the latter, but time will tell.

----- Original Message -----
> From: "hedrick" <[email protected]>
> To: "Timothy Pearson" <[email protected]>
> Cc: "J. Bruce Fields" <[email protected]>, "Chuck Lever" <[email protected]>, "linux-nfs"
> <[email protected]>
> Sent: Monday, August 9, 2021 12:15:33 PM
> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

> There seems to be a soft lockup message on the console, but that’s all I can
> find.
>
> I’m currently considering whether it’s best to move to NFS 4.0, which seems not
> to cause the issue, or 4.2 with delegations disabled. This is the primary
> server for the department. If it fails, everything fails, VMs because
> read-only, user jobs fai, etc.
>
> We ran for a year before this showed up, so I’m pretty sure going to 4.0 will
> fix it. But I have use cases for ACLs that will only work with 4.2. Since the
> problem seems to be in the callback mechanism, and as far as I can tell that’s
> only used for delegations, I assume turning off delegations will fix it.
>
> We’ve also had a history of issues with 4.2 problems on clients. That’s why we
> backed off to 4.0 initially. Clients were seeing hangs.
>
> It’s discouraging to hear that even the most recent kernel has problems.
>
>> On Aug 9, 2021, at 1:06 PM, Timothy Pearson <[email protected]>
>> wrote:
>>
>> Did you see anything else at all on the terminal? The inability to log in is
>> sadly familiar, our boxes are configured to dump a trace over serial every 120
>> seconds or so if they lock up (/proc/sys/kernel/hung_task_timeout_secs) and I'm
>> not sure you'd see anything past the callback messages without that active.
>>
>> FWIW we ended up (mostly) working around the problem by moving the critical
>> systems (which are all NFSv3) to a new server, but that's a stopgap measure as
>> we were looking to deploy NFSv4 on a broader scale. My gut feeling is the
>> failure occurs under heavy load where too many NFSv4 requests from a single
>> client are pending due to a combination of storage and network saturation, but
>> it's proven very difficult to debug -- even splitting the v3 hosts from the
>> larger NFS server (reducing traffic + storage load) seems to have temporarily
>> stabilized things.
>>
>> ----- Original Message -----
>>> From: [email protected]
>>> To: "Timothy Pearson" <[email protected]>
>>> Cc: "J. Bruce Fields" <[email protected]>, "Chuck Lever"
>>> <[email protected]>, "linux-nfs"
>>> <[email protected]>
>>> Sent: Monday, August 9, 2021 11:31:25 AM
>>> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load
>>
>>> Incidentally, we’re a computer science department. We have such a variety of
>>> students and researchers that it’s impossible to know what they are all doing.
>>> Historically, if there’s a bug in anyth9ing, we’ll see it, and usually enough
>>> for it to be fatal.
>>>
>>> question: is backing off to 4.0 or disabling delegations likely to have more of
>>> an impact on performance?
>>>
>>>> On Aug 9, 2021, at 12:17 PM, [email protected] wrote:
>>>>
>>>> I just found this because we’ve been dealing with hangs of our primary NFS
>>>> server. This is ubuntu 20.04, which is 5.10.
>>>>
>>>> Right before the hang:
>>>>
>>>> Aug 8 21:50:46 communis.lcsr.rutgers.edu <http://communis.lcsr.rutgers.edu/>
>>>> kernel: [294852.644801] receive_cb_reply: Got unrecognized reply: calldir 0x1
>>>> xpt_bc_xprt 00000000b260cf95 xid e3faa54e
>>>> Aug 8 21:51:54 communis.lcsr.rutgers.edu <http://communis.lcsr.rutgers.edu/>
>>>> kernel: [294921.252531] receive_cb_reply: Got unrecognized reply: calldir 0x1
>>>> xpt_bc_xprt 00000000b260cf95 xid f0faa54e
>>>>
>>>>
>>>> I looked at the code, and this seems to be an NFS4.1 callback. We just started
>>>> seeing the problem after upgrading most of our hosts in a way that caused them
>>>> to move from NFS 4.0 to 4.2. I assume 4.2 is using the 4.1 callback. Rather
>>>> than disabling delegations, we’re moving back to NFS 4.0 on the clients (except
>>>> ESXi).
>>>>
>>>> We’re using ZFS, so this isn’t just btrfs.
>>>>
>>>> I’m afraid I don’t have any backtrace. I was going to get more information, but
>>>> it happened late at night and we were unable to get into the system to gather
>>>> information. Just had to reboot.
>>>>
>>>>> On Jul 5, 2021, at 5:47 AM, Timothy Pearson <[email protected]
>>>>> <mailto:[email protected]>> wrote:
>>>>>
>>>>> Forgot to add -- sometimes, right before the core stall and backtrace, we see
>>>>> messages similar to the following:
>>>>>
>>>>> [16825.408854] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt
>>>>> 0000000051f43ff7 xid 2e0c9b7a
>>>>> [16825.414070] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt
>>>>> 0000000051f43ff7 xid 2f0c9b7a
>>>>> [16825.414360] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt
>>>>> 0000000051f43ff7 xid 300c9b7a
>>>>>
>>>>> We're not sure if they are related or not.
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Timothy Pearson" <[email protected]
>>>>>> <mailto:[email protected]>>
>>>>>> To: "J. Bruce Fields" <[email protected] <mailto:[email protected]>>,
>>>>>> "Chuck Lever" <[email protected] <mailto:[email protected]>>
>>>>>> Cc: "linux-nfs" <[email protected] <mailto:[email protected]>>
>>>>>> Sent: Monday, July 5, 2021 4:44:29 AM
>>>>>> Subject: CPU stall, eventual host hang with BTRFS + NFS under heavy load
>>>>>
>>>>>> We've been dealing with a fairly nasty NFS-related problem off and on for the
>>>>>> past couple of years. The host is a large POWER server with several external
>>>>>> SAS arrays attached, using BTRFS for cold storage of large amounts of data.
>>>>>> The main symptom is that under heavy sustained NFS write traffic using certain
>>>>>> file types (see below) a core will suddenly lock up, continually spewing a
>>>>>> backtrace similar to the one I've pasted below. While this immediately halts
>>>>>> all NFS traffic to the affected client (which is never the same client as the
>>>>>> machine doing the large file transfer), the larger issue is that over the next
>>>>>> few minutes / hours the entire host will gradually degrade in responsiveness
>>>>>> until it grinds to a complete halt. Once the core stall occurs we have been
>>>>>> unable to find any way to restore the machine to full functionality or avoid
>>>>>> the degradation and eventual hang short of a hard power down and restart.
>>>>>>
>>>>>> Tens of GB of compressed data in a single file seems to be fairly good at
>>>>>> triggering the problem, whereas raw disk images or other regularly patterned
>>>>>> data tend not to be. The underlying hardware is functioning perfectly with no
>>>>>> problems noted, and moving the files without NFS avoids the bug.
>>>>>>
>>>>>> We've been using a workaround involving purposefully pausing (SIGSTOP) the file
>>>>>> transfer process on the client as soon as other clients start to show a
>>>>>> slowdown. This hack avoids the bug entirely provided the host is allowed to
>>>>>> catch back up prior to resuming (SIGCONT) the file transfer process. From
>>>>>> this, it seems something is going very wrong within the NFS stack under high
>>>>>> storage I/O pressure and high storage write latency (timeout?) -- it should
>>>>>> simply pause transfers while the storage subsystem catches up, not lock up a
>>>>>> core and force a host restart. Interesting, sometimes it does exactly what it
>>>>>> is supposed to and does pause and wait for the storage subsystem, but around
>>>>>> 20% of the time it just triggers this bug and stalls a core.
>>>>>>
>>>>>> This bug has been present since at least 4.14 and is still present in the latest
>>>>>> 5.12.14 version.
>>>>>>
>>>>>> As the machine is in production, it is difficult to gather further information
>>>>>> or test patches, however we would be able to apply patches to the kernel that
>>>>>> would potentially restore stability with enough advance scheduling.
>>>>>>
>>>>>> Sample backtrace below:
>>>>>>
>>>>>> [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU
>>>>>> [16846.426202] rcu: 32-....: (5249 ticks this GP)
>>>>>> idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986
>>>>>> [16846.426241] (t=5251 jiffies g=2720809 q=756724)
>>>>>> [16846.426273] NMI backtrace for cpu 32
>>>>>> [16846.426298] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
>>>>>> [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc]
>>>>>> [16846.426406] Call Trace:
>>>>>> [16846.426429] [c000200010823250] [c00000000074e630] dump_stack+0xc4/0x114
>>>>>> (unreliable)
>>>>>> [16846.426483] [c000200010823290] [c00000000075aebc]
>>>>>> nmi_cpu_backtrace+0xfc/0x150
>>>>>> [16846.426506] [c000200010823310] [c00000000075b0a8]
>>>>>> nmi_trigger_cpumask_backtrace+0x198/0x1f0
>>>>>> [16846.426577] [c0002000108233b0] [c000000000072818]
>>>>>> arch_trigger_cpumask_backtrace+0x28/0x40
>>>>>> [16846.426621] [c0002000108233d0] [c000000000202db8]
>>>>>> rcu_dump_cpu_stacks+0x158/0x1b8
>>>>>> [16846.426667] [c000200010823470] [c000000000201828]
>>>>>> rcu_sched_clock_irq+0x908/0xb10
>>>>>> [16846.426708] [c000200010823560] [c0000000002141d0]
>>>>>> update_process_times+0xc0/0x140
>>>>>> [16846.426768] [c0002000108235a0] [c00000000022dd34]
>>>>>> tick_sched_handle.isra.18+0x34/0xd0
>>>>>> [16846.426808] [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0
>>>>>> [16846.426856] [c000200010823610] [c00000000021577c]
>>>>>> __hrtimer_run_queues+0x16c/0x370
>>>>>> [16846.426903] [c000200010823690] [c000000000216378]
>>>>>> hrtimer_interrupt+0x128/0x2f0
>>>>>> [16846.426947] [c000200010823740] [c000000000029494] timer_interrupt+0x134/0x310
>>>>>> [16846.426989] [c0002000108237a0] [c000000000016c54]
>>>>>> replay_soft_interrupts+0x124/0x2e0
>>>>>> [16846.427045] [c000200010823990] [c000000000016f14]
>>>>>> arch_local_irq_restore+0x104/0x170
>>>>>> [16846.427103] [c0002000108239c0] [c00000000017247c]
>>>>>> mod_delayed_work_on+0x8c/0xe0
>>>>>> [16846.427149] [c000200010823a20] [c00800000819fe04]
>>>>>> rpc_set_queue_timer+0x5c/0x80 [sunrpc]
>>>>>> [16846.427234] [c000200010823a40] [c0080000081a096c]
>>>>>> __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc]
>>>>>> [16846.427324] [c000200010823a90] [c0080000081a3080]
>>>>>> rpc_sleep_on_timeout+0x88/0x110 [sunrpc]
>>>>>> [16846.427388] [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530
>>>>>> [nfsd]
>>>>>> [16846.427457] [c000200010823b60] [c0080000081a0a0c] rpc_exit_task+0x84/0x1d0
>>>>>> [sunrpc]
>>>>>> [16846.427520] [c000200010823ba0] [c0080000081a2448] __rpc_execute+0xd0/0x760
>>>>>> [sunrpc]
>>>>>> [16846.427598] [c000200010823c30] [c0080000081a2b18]
>>>>>> rpc_async_schedule+0x40/0x70 [sunrpc]
>>>>>> [16846.427687] [c000200010823c60] [c000000000170bf0]
>>>>>> process_one_work+0x290/0x580
>>>>>> [16846.427736] [c000200010823d00] [c000000000170f68] worker_thread+0x88/0x620
>>>>>> [16846.427813] [c000200010823da0] [c00000000017b860] kthread+0x1a0/0x1b0
>>>>>> [16846.427865] [c000200010823e10] [c00000000000d6ec]
>>>>>> ret_from_kernel_thread+0x5c/0x70
>>>>>> [16873.869180] watchdog: BUG: soft lockup - CPU#32 stuck for 49s!
>>>>>> [kworker/u130:25:10624]
>>>>>> [16873.869245] Modules linked in: rpcsec_gss_krb5 iscsi_target_mod
>>>>>> target_core_user uio target_core_pscsi target_core_file target_core_iblock
>>>>>> target_core_mod tun nft_counter nf_tables nfnetlink vfio_pci vfio_virqfd
>>>>>> vfio_iommu_spapr_tce vfio vfio_spapr_eeh i2c_dev bridg$
>>>>>> [16873.869413] linear mlx4_ib ib_uverbs ib_core raid1 md_mod sd_mod t10_pi
>>>>>> hid_generic usbhid hid ses enclosure crct10dif_vpmsum crc32c_vpmsum xhci_pci
>>>>>> xhci_hcd ixgbe mlx4_core mpt3sas usbcore tg3 mdio_devres of_mdio fixed_phy
>>>>>> xfrm_algo mdio libphy aacraid igb raid_cl$
>>>>>> [16873.869889] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
>>>>>> [16873.869966] Workqueue: rpciod rpc_async_schedule [sunrpc]
>>>>>> [16873.870023] NIP: c000000000711300 LR: c0080000081a0708 CTR: c0000000007112a0
>>>>>> [16873.870073] REGS: c0002000108237d0 TRAP: 0900 Not tainted (5.12.14)
>>>>>> [16873.870109] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR:
>>>>>> 24004842 XER: 00000000
>>>>>> [16873.870146] CFAR: c0080000081d8054 IRQMASK: 0
>>>>>> GPR00: c0080000081a0748 c000200010823a70 c0000000015c0700 c0000000e2227a40
>>>>>> GPR04: c0000000e2227a40 c0000000e2227a40 c000200ffb6cc0a8 0000000000000018
>>>>>> GPR08: 0000000000000000 5deadbeef0000122 c0080000081ffd18 c0080000081d8040
>>>>>> GPR12: c0000000007112a0 c000200fff7fee00 c00000000017b6c8 c000000090d9ccc0
>>>>>> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>>>>>> GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000040
>>>>>> GPR24: 0000000000000000 0000000000000000 fffffffffffffe00 0000000000000001
>>>>>> GPR28: c00000001a62f000 c0080000081a0988 c0080000081ffd10 c0000000e2227a00
>>>>>> [16873.870452] NIP [c000000000711300] __list_del_entry_valid+0x60/0x100
>>>>>> [16873.870507] LR [c0080000081a0708]
> >>>>> rpc_wake_up_task_on_wq_queue_action_locked+0x330/0x400 [sunrpc]

2021-08-09 17:38:10

by Chuck Lever III

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

> On Aug 9, 2021, at 1:15 PM, [email protected] wrote:
>
> There seems to be a soft lockup message on the console, but that’s all I can find.

Then when you say "server hangs" you mean that the entire NFS server
system deadlocks. It's not just unresponsive on one or more exports.

A soft lockup is typically caused by a segmentation fault in code
that is not running in process context.

> I’m currently considering whether it’s best to move to NFS 4.0, which seems not to cause the issue, or 4.2 with delegations disabled. This is the primary server for the department. If it fails, everything fails, VMs because read-only, user jobs fai, etc.
>
> We ran for a year before this showed up, so I’m pretty sure going to 4.0 will fix it. But I have use cases for ACLs that will only work with 4.2. Since the problem seems to be in the callback mechanism, and as far as I can tell that’s only used for delegations, I assume turning off delegations will fix it.

In NFSv4.1 and later, the callback channel is also used for pNFS. It
can also be used for lock notification in all minor versions.

Disabling delegation can have a performance impact, but it depends on
the nature of your workloads and whether files are shared amongst
your client population.

> We’ve also had a history of issues with 4.2 problems on clients. That’s why we backed off to 4.0 initially. Clients were seeing hangs.

Let's stick with the server issue for the moment.

Enabling some tracepoints might give us more insight, though if the
server then crashes we would be hard pressed to examine the trace
records. If it's pretty common to get multiple receive_cb_reply
error messages in a short time space, you might enable a triggered
tracepoint in that function to start a 60-second tcpdump capture to
a file.

--
Chuck Lever

2021-08-09 18:31:32

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

On Mon, Aug 09, 2021 at 01:15:33PM -0400, [email protected] wrote:
> There seems to be a soft lockup message on the console, but that’s all
> I can find.
>
> I’m currently considering whether it’s best to move to NFS 4.0, which
> seems not to cause the issue, or 4.2 with delegations disabled. This
> is the primary server for the department. If it fails, everything
> fails, VMs because read-only, user jobs fai, etc.
>
> We ran for a year before this showed up, so I’m pretty sure going to
> 4.0 will fix it.

I thought you also upgraded the kernel at the same time? (What were the
two kernels involved?) So we don't know whether it's a new kernel bug,
or an NFSv4.2-specific bug, or something else.

> But I have use cases for ACLs that will only work
> with 4.2.

NFSv4 ACLs on the Linux server are the same in 4.0 and 4.2.

> Since the problem seems to be in the callback mechanism, and
> as far as I can tell that’s only used for delegations, I assume
> turning off delegations will fix it.

It could be. Though I asked mainly as a way to help narrow down where
the problem is.

--b.

> We’ve also had a history of issues with 4.2 problems on clients.
> That’s why we backed off to 4.0 initially. Clients were seeing hangs.
>
> It’s discouraging to hear that even the most recent kernel has
> problems.
>
> > On Aug 9, 2021, at 1:06 PM, Timothy Pearson
> > <[email protected]> wrote:
> >
> > Did you see anything else at all on the terminal? The inability to
> > log in is sadly familiar, our boxes are configured to dump a trace
> > over serial every 120 seconds or so if they lock up
> > (/proc/sys/kernel/hung_task_timeout_secs) and I'm not sure you'd see
> > anything past the callback messages without that active.
> >
> > FWIW we ended up (mostly) working around the problem by moving the
> > critical systems (which are all NFSv3) to a new server, but that's a
> > stopgap measure as we were looking to deploy NFSv4 on a broader
> > scale. My gut feeling is the failure occurs under heavy load where
> > too many NFSv4 requests from a single client are pending due to a
> > combination of storage and network saturation, but it's proven very
> > difficult to debug -- even splitting the v3 hosts from the larger
> > NFS server (reducing traffic + storage load) seems to have
> > temporarily stabilized things.
> >
> > ----- Original Message -----
> >> From: [email protected] To: "Timothy Pearson"
> >> <[email protected]> Cc: "J. Bruce Fields"
> >> <[email protected]>, "Chuck Lever" <[email protected]>,
> >> "linux-nfs" <[email protected]> Sent: Monday, August 9,
> >> 2021 11:31:25 AM Subject: Re: CPU stall, eventual host hang with
> >> BTRFS + NFS under heavy load
> >
> >> Incidentally, we’re a computer science department. We have such a
> >> variety of students and researchers that it’s impossible to know
> >> what they are all doing. Historically, if there’s a bug in
> >> anyth9ing, we’ll see it, and usually enough for it to be fatal.
> >>
> >> question: is backing off to 4.0 or disabling delegations likely to
> >> have more of an impact on performance?
> >>
> >>> On Aug 9, 2021, at 12:17 PM, [email protected] wrote:
> >>>
> >>> I just found this because we’ve been dealing with hangs of our
> >>> primary NFS server. This is ubuntu 20.04, which is 5.10.
> >>>
> >>> Right before the hang:
> >>>
> >>> Aug 8 21:50:46 communis.lcsr.rutgers.edu
> >>> <http://communis.lcsr.rutgers.edu/> kernel: [294852.644801]
> >>> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt
> >>> 00000000b260cf95 xid e3faa54e Aug 8 21:51:54
> >>> communis.lcsr.rutgers.edu <http://communis.lcsr.rutgers.edu/>
> >>> kernel: [294921.252531] receive_cb_reply: Got unrecognized reply:
> >>> calldir 0x1 xpt_bc_xprt 00000000b260cf95 xid f0faa54e
> >>>
> >>>
> >>> I looked at the code, and this seems to be an NFS4.1 callback. We
> >>> just started seeing the problem after upgrading most of our hosts
> >>> in a way that caused them to move from NFS 4.0 to 4.2. I assume
> >>> 4.2 is using the 4.1 callback. Rather than disabling delegations,
> >>> we’re moving back to NFS 4.0 on the clients (except ESXi).
> >>>
> >>> We’re using ZFS, so this isn’t just btrfs.
> >>>
> >>> I’m afraid I don’t have any backtrace. I was going to get more
> >>> information, but it happened late at night and we were unable to
> >>> get into the system to gather information. Just had to reboot.
> >>>
> >>>> On Jul 5, 2021, at 5:47 AM, Timothy Pearson
> >>>> <[email protected]
> >>>> <mailto:[email protected]>> wrote:
> >>>>
> >>>> Forgot to add -- sometimes, right before the core stall and
> >>>> backtrace, we see messages similar to the following:
> >>>>
> >>>> [16825.408854] receive_cb_reply: Got unrecognized reply: calldir
> >>>> 0x1 xpt_bc_xprt 0000000051f43ff7 xid 2e0c9b7a [16825.414070]
> >>>> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt
> >>>> 0000000051f43ff7 xid 2f0c9b7a [16825.414360] receive_cb_reply:
> >>>> Got unrecognized reply: calldir 0x1 xpt_bc_xprt 0000000051f43ff7
> >>>> xid 300c9b7a
> >>>>
> >>>> We're not sure if they are related or not.
> >>>>
> >>>> ----- Original Message -----
> >>>>> From: "Timothy Pearson" <[email protected]
> >>>>> <mailto:[email protected]>> To: "J. Bruce Fields"
> >>>>> <[email protected] <mailto:[email protected]>>, "Chuck
> >>>>> Lever" <[email protected] <mailto:[email protected]>>
> >>>>> Cc: "linux-nfs" <[email protected]
> >>>>> <mailto:[email protected]>> Sent: Monday, July 5, 2021
> >>>>> 4:44:29 AM Subject: CPU stall, eventual host hang with BTRFS +
> >>>>> NFS under heavy load
> >>>>
> >>>>> We've been dealing with a fairly nasty NFS-related problem off
> >>>>> and on for the past couple of years. The host is a large POWER
> >>>>> server with several external SAS arrays attached, using BTRFS
> >>>>> for cold storage of large amounts of data. The main symptom is
> >>>>> that under heavy sustained NFS write traffic using certain file
> >>>>> types (see below) a core will suddenly lock up, continually
> >>>>> spewing a backtrace similar to the one I've pasted below. While
> >>>>> this immediately halts all NFS traffic to the affected client
> >>>>> (which is never the same client as the machine doing the large
> >>>>> file transfer), the larger issue is that over the next few
> >>>>> minutes / hours the entire host will gradually degrade in
> >>>>> responsiveness until it grinds to a complete halt. Once the
> >>>>> core stall occurs we have been unable to find any way to restore
> >>>>> the machine to full functionality or avoid the degradation and
> >>>>> eventual hang short of a hard power down and restart.
> >>>>>
> >>>>> Tens of GB of compressed data in a single file seems to be
> >>>>> fairly good at triggering the problem, whereas raw disk images
> >>>>> or other regularly patterned data tend not to be. The
> >>>>> underlying hardware is functioning perfectly with no problems
> >>>>> noted, and moving the files without NFS avoids the bug.
> >>>>>
> >>>>> We've been using a workaround involving purposefully pausing
> >>>>> (SIGSTOP) the file transfer process on the client as soon as
> >>>>> other clients start to show a slowdown. This hack avoids the
> >>>>> bug entirely provided the host is allowed to catch back up prior
> >>>>> to resuming (SIGCONT) the file transfer process. From this, it
> >>>>> seems something is going very wrong within the NFS stack under
> >>>>> high storage I/O pressure and high storage write latency
> >>>>> (timeout?) -- it should simply pause transfers while the storage
> >>>>> subsystem catches up, not lock up a core and force a host
> >>>>> restart. Interesting, sometimes it does exactly what it is
> >>>>> supposed to and does pause and wait for the storage subsystem,
> >>>>> but around 20% of the time it just triggers this bug and stalls
> >>>>> a core.
> >>>>>
> >>>>> This bug has been present since at least 4.14 and is still
> >>>>> present in the latest 5.12.14 version.
> >>>>>
> >>>>> As the machine is in production, it is difficult to gather
> >>>>> further information or test patches, however we would be able to
> >>>>> apply patches to the kernel that would potentially restore
> >>>>> stability with enough advance scheduling.
> >>>>>
> >>>>> Sample backtrace below:
> >>>>>
> >>>>> [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU
> >>>>> [16846.426202] rcu: 32-....: (5249 ticks this GP)
> >>>>> idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986
> >>>>> [16846.426241] (t=5251 jiffies g=2720809 q=756724)
> >>>>> [16846.426273] NMI backtrace for cpu 32 [16846.426298] CPU: 32
> >>>>> PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
> >>>>> [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc]
> >>>>> [16846.426406] Call Trace: [16846.426429] [c000200010823250]
> >>>>> [c00000000074e630] dump_stack+0xc4/0x114 (unreliable)
> >>>>> [16846.426483] [c000200010823290] [c00000000075aebc]
> >>>>> nmi_cpu_backtrace+0xfc/0x150 [16846.426506] [c000200010823310]
> >>>>> [c00000000075b0a8] nmi_trigger_cpumask_backtrace+0x198/0x1f0
> >>>>> [16846.426577] [c0002000108233b0] [c000000000072818]
> >>>>> arch_trigger_cpumask_backtrace+0x28/0x40 [16846.426621]
> >>>>> [c0002000108233d0] [c000000000202db8]
> >>>>> rcu_dump_cpu_stacks+0x158/0x1b8 [16846.426667]
> >>>>> [c000200010823470] [c000000000201828]
> >>>>> rcu_sched_clock_irq+0x908/0xb10 [16846.426708]
> >>>>> [c000200010823560] [c0000000002141d0]
> >>>>> update_process_times+0xc0/0x140 [16846.426768]
> >>>>> [c0002000108235a0] [c00000000022dd34]
> >>>>> tick_sched_handle.isra.18+0x34/0xd0 [16846.426808]
> >>>>> [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0
> >>>>> [16846.426856] [c000200010823610] [c00000000021577c]
> >>>>> __hrtimer_run_queues+0x16c/0x370 [16846.426903]
> >>>>> [c000200010823690] [c000000000216378]
> >>>>> hrtimer_interrupt+0x128/0x2f0 [16846.426947] [c000200010823740]
> >>>>> [c000000000029494] timer_interrupt+0x134/0x310 [16846.426989]
> >>>>> [c0002000108237a0] [c000000000016c54]
> >>>>> replay_soft_interrupts+0x124/0x2e0 [16846.427045]
> >>>>> [c000200010823990] [c000000000016f14]
> >>>>> arch_local_irq_restore+0x104/0x170 [16846.427103]
> >>>>> [c0002000108239c0] [c00000000017247c]
> >>>>> mod_delayed_work_on+0x8c/0xe0 [16846.427149] [c000200010823a20]
> >>>>> [c00800000819fe04] rpc_set_queue_timer+0x5c/0x80 [sunrpc]
> >>>>> [16846.427234] [c000200010823a40] [c0080000081a096c]
> >>>>> __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc]
> >>>>> [16846.427324] [c000200010823a90] [c0080000081a3080]
> >>>>> rpc_sleep_on_timeout+0x88/0x110 [sunrpc] [16846.427388]
> >>>>> [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530
> >>>>> [nfsd] [16846.427457] [c000200010823b60] [c0080000081a0a0c]
> >>>>> rpc_exit_task+0x84/0x1d0 [sunrpc] [16846.427520]
> >>>>> [c000200010823ba0] [c0080000081a2448] __rpc_execute+0xd0/0x760
> >>>>> [sunrpc] [16846.427598] [c000200010823c30] [c0080000081a2b18]
> >>>>> rpc_async_schedule+0x40/0x70 [sunrpc] [16846.427687]
> >>>>> [c000200010823c60] [c000000000170bf0]
> >>>>> process_one_work+0x290/0x580 [16846.427736] [c000200010823d00]
> >>>>> [c000000000170f68] worker_thread+0x88/0x620 [16846.427813]
> >>>>> [c000200010823da0] [c00000000017b860] kthread+0x1a0/0x1b0
> >>>>> [16846.427865] [c000200010823e10] [c00000000000d6ec]
> >>>>> ret_from_kernel_thread+0x5c/0x70 [16873.869180] watchdog: BUG:
> >>>>> soft lockup - CPU#32 stuck for 49s! [kworker/u130:25:10624]
> >>>>> [16873.869245] Modules linked in: rpcsec_gss_krb5
> >>>>> iscsi_target_mod target_core_user uio target_core_pscsi
> >>>>> target_core_file target_core_iblock target_core_mod tun
> >>>>> nft_counter nf_tables nfnetlink vfio_pci vfio_virqfd
> >>>>> vfio_iommu_spapr_tce vfio vfio_spapr_eeh i2c_dev bridg$
> >>>>> [16873.869413] linear mlx4_ib ib_uverbs ib_core raid1 md_mod
> >>>>> sd_mod t10_pi hid_generic usbhid hid ses enclosure
> >>>>> crct10dif_vpmsum crc32c_vpmsum xhci_pci xhci_hcd ixgbe mlx4_core
> >>>>> mpt3sas usbcore tg3 mdio_devres of_mdio fixed_phy xfrm_algo mdio
> >>>>> libphy aacraid igb raid_cl$ [16873.869889] CPU: 32 PID: 10624
> >>>>> Comm: kworker/u130:25 Not tainted 5.12.14 #1 [16873.869966]
> >>>>> Workqueue: rpciod rpc_async_schedule [sunrpc] [16873.870023]
> >>>>> NIP: c000000000711300 LR: c0080000081a0708 CTR:
> >>>>> c0000000007112a0 [16873.870073] REGS: c0002000108237d0 TRAP:
> >>>>> 0900 Not tainted (5.12.14) [16873.870109] MSR:
> >>>>> 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR:
> >>>>> 24004842 XER: 00000000 [16873.870146] CFAR: c0080000081d8054
> >>>>> IRQMASK: 0 GPR00: c0080000081a0748 c000200010823a70
> >>>>> c0000000015c0700 c0000000e2227a40 GPR04: c0000000e2227a40
> >>>>> c0000000e2227a40 c000200ffb6cc0a8 0000000000000018 GPR08:
> >>>>> 0000000000000000 5deadbeef0000122 c0080000081ffd18
> >>>>> c0080000081d8040 GPR12: c0000000007112a0 c000200fff7fee00
> >>>>> c00000000017b6c8 c000000090d9ccc0 GPR16: 0000000000000000
> >>>>> 0000000000000000 0000000000000000 0000000000000000 GPR20:
> >>>>> 0000000000000000 0000000000000000 0000000000000000
> >>>>> 0000000000000040 GPR24: 0000000000000000 0000000000000000
> >>>>> fffffffffffffe00 0000000000000001 GPR28: c00000001a62f000
> >>>>> c0080000081a0988 c0080000081ffd10 c0000000e2227a00
> >>>>> [16873.870452] NIP [c000000000711300]
> >>>>> __list_del_entry_valid+0x60/0x100 [16873.870507] LR
> >>>>> [c0080000081a0708]
> >>>>> rpc_wake_up_task_on_wq_queue_action_locked+0x330/0x400 [sunrpc]

2021-08-09 18:32:37

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

FWIW that's *exactly* what we see. Eventually, if the server is left alone for enough time, even the login system stops responding -- it's as if the I/O subsystem degrades and eventually blocks entirely.

----- Original Message -----
> From: "hedrick" <[email protected]>
> To: "Chuck Lever" <[email protected]>
> Cc: "Timothy Pearson" <[email protected]>, "J. Bruce Fields" <[email protected]>, "linux-nfs"
> <[email protected]>
> Sent: Monday, August 9, 2021 1:29:30 PM
> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

> Evidence is ambiguous. It seems that NFS activity hangs. The first time this
> occurred I saw a process at 100% running rpciod. I tried to do a “sync” and
> reboot, but the sync hung.
>
> The last time I couldn’t get data, but the kernel was running and responding to
> ping. An ssh session responded to CR but when I tried to sudo it hung. Attempt
> to login hung. Oddly, even though the ssh session responded to CR, syslog
> entries on the local system stopped until the reboot. However we also send
> syslog entries to a central server. Those continued and showed a continuing set
> of mounts and unmounts happening through the reboot.
>
> I was goiog to get a stack trace of the 100% process if that happened again, but
> last time I wasn’t in a situation to do that. I don’t think users will put up
> with further attempts to debug, so for the moment I’m going to try disabling
> delegations.
>
>> On Aug 9, 2021, at 1:37 PM, Chuck Lever III <[email protected]> wrote:
>>
>> Then when you say "server hangs" you mean that the entire NFS server
> > system deadlocks. It's not just unresponsive on one or more exports.

2021-08-09 18:35:19

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

No, I backed down from Ubuntu 5.4.0-80 to 5.4.0-74. We had been running 74 safely for months, so I thought it might be an issue with 80. But after rebooting I pulled the source. There were no changes to NFS, and I’m pretty sure no changes to ZFS.

This server has been running 5.4.0 throughout its life. It was 74 for about 3 months. Not sure what it was running before.

> On Aug 9, 2021, at 2:30 PM, J. Bruce Fields <[email protected]> wrote:
>
> On Mon, Aug 09, 2021 at 01:15:33PM -0400, [email protected] wrote:
>> There seems to be a soft lockup message on the console, but that’s all
>> I can find.
>>
>> I’m currently considering whether it’s best to move to NFS 4.0, which
>> seems not to cause the issue, or 4.2 with delegations disabled. This
>> is the primary server for the department. If it fails, everything
>> fails, VMs because read-only, user jobs fai, etc.
>>
>> We ran for a year before this showed up, so I’m pretty sure going to
>> 4.0 will fix it.
>
> I thought you also upgraded the kernel at the same time? (What were the
> two kernels involved?) So we don't know whether it's a new kernel bug,
> or an NFSv4.2-specific bug, or something else.
>
>> But I have use cases for ACLs that will only work
>> with 4.2.
>
> NFSv4 ACLs on the Linux server are the same in 4.0 and 4.2.
>
>> Since the problem seems to be in the callback mechanism, and
>> as far as I can tell that’s only used for delegations, I assume
>> turning off delegations will fix it.
>
> It could be. Though I asked mainly as a way to help narrow down where
> the problem is.
>
> --b.
>
>> We’ve also had a history of issues with 4.2 problems on clients.
>> That’s why we backed off to 4.0 initially. Clients were seeing hangs.
>>
>> It’s discouraging to hear that even the most recent kernel has
>> problems.
>>
>>> On Aug 9, 2021, at 1:06 PM, Timothy Pearson
>>> <[email protected]> wrote:
>>>
>>> Did you see anything else at all on the terminal? The inability to
>>> log in is sadly familiar, our boxes are configured to dump a trace
>>> over serial every 120 seconds or so if they lock up
>>> (/proc/sys/kernel/hung_task_timeout_secs) and I'm not sure you'd see
>>> anything past the callback messages without that active.
>>>
>>> FWIW we ended up (mostly) working around the problem by moving the
>>> critical systems (which are all NFSv3) to a new server, but that's a
>>> stopgap measure as we were looking to deploy NFSv4 on a broader
>>> scale. My gut feeling is the failure occurs under heavy load where
>>> too many NFSv4 requests from a single client are pending due to a
>>> combination of storage and network saturation, but it's proven very
>>> difficult to debug -- even splitting the v3 hosts from the larger
>>> NFS server (reducing traffic + storage load) seems to have
>>> temporarily stabilized things.
>>>
>>> ----- Original Message -----
>>>> From: [email protected] To: "Timothy Pearson"
>>>> <[email protected]> Cc: "J. Bruce Fields"
>>>> <[email protected]>, "Chuck Lever" <[email protected]>,
>>>> "linux-nfs" <[email protected]> Sent: Monday, August 9,
>>>> 2021 11:31:25 AM Subject: Re: CPU stall, eventual host hang with
>>>> BTRFS + NFS under heavy load
>>>
>>>> Incidentally, we’re a computer science department. We have such a
>>>> variety of students and researchers that it’s impossible to know
>>>> what they are all doing. Historically, if there’s a bug in
>>>> anyth9ing, we’ll see it, and usually enough for it to be fatal.
>>>>
>>>> question: is backing off to 4.0 or disabling delegations likely to
>>>> have more of an impact on performance?
>>>>
>>>>> On Aug 9, 2021, at 12:17 PM, [email protected] wrote:
>>>>>
>>>>> I just found this because we’ve been dealing with hangs of our
>>>>> primary NFS server. This is ubuntu 20.04, which is 5.10.
>>>>>
>>>>> Right before the hang:
>>>>>
>>>>> Aug 8 21:50:46 communis.lcsr.rutgers.edu
>>>>> <http://communis.lcsr.rutgers.edu/> kernel: [294852.644801]
>>>>> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt
>>>>> 00000000b260cf95 xid e3faa54e Aug 8 21:51:54
>>>>> communis.lcsr.rutgers.edu <http://communis.lcsr.rutgers.edu/>
>>>>> kernel: [294921.252531] receive_cb_reply: Got unrecognized reply:
>>>>> calldir 0x1 xpt_bc_xprt 00000000b260cf95 xid f0faa54e
>>>>>
>>>>>
>>>>> I looked at the code, and this seems to be an NFS4.1 callback. We
>>>>> just started seeing the problem after upgrading most of our hosts
>>>>> in a way that caused them to move from NFS 4.0 to 4.2. I assume
>>>>> 4.2 is using the 4.1 callback. Rather than disabling delegations,
>>>>> we’re moving back to NFS 4.0 on the clients (except ESXi).
>>>>>
>>>>> We’re using ZFS, so this isn’t just btrfs.
>>>>>
>>>>> I’m afraid I don’t have any backtrace. I was going to get more
>>>>> information, but it happened late at night and we were unable to
>>>>> get into the system to gather information. Just had to reboot.
>>>>>
>>>>>> On Jul 5, 2021, at 5:47 AM, Timothy Pearson
>>>>>> <[email protected]
>>>>>> <mailto:[email protected]>> wrote:
>>>>>>
>>>>>> Forgot to add -- sometimes, right before the core stall and
>>>>>> backtrace, we see messages similar to the following:
>>>>>>
>>>>>> [16825.408854] receive_cb_reply: Got unrecognized reply: calldir
>>>>>> 0x1 xpt_bc_xprt 0000000051f43ff7 xid 2e0c9b7a [16825.414070]
>>>>>> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt
>>>>>> 0000000051f43ff7 xid 2f0c9b7a [16825.414360] receive_cb_reply:
>>>>>> Got unrecognized reply: calldir 0x1 xpt_bc_xprt 0000000051f43ff7
>>>>>> xid 300c9b7a
>>>>>>
>>>>>> We're not sure if they are related or not.
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Timothy Pearson" <[email protected]
>>>>>>> <mailto:[email protected]>> To: "J. Bruce Fields"
>>>>>>> <[email protected] <mailto:[email protected]>>, "Chuck
>>>>>>> Lever" <[email protected] <mailto:[email protected]>>
>>>>>>> Cc: "linux-nfs" <[email protected]
>>>>>>> <mailto:[email protected]>> Sent: Monday, July 5, 2021
>>>>>>> 4:44:29 AM Subject: CPU stall, eventual host hang with BTRFS +
>>>>>>> NFS under heavy load
>>>>>>
>>>>>>> We've been dealing with a fairly nasty NFS-related problem off
>>>>>>> and on for the past couple of years. The host is a large POWER
>>>>>>> server with several external SAS arrays attached, using BTRFS
>>>>>>> for cold storage of large amounts of data. The main symptom is
>>>>>>> that under heavy sustained NFS write traffic using certain file
>>>>>>> types (see below) a core will suddenly lock up, continually
>>>>>>> spewing a backtrace similar to the one I've pasted below. While
>>>>>>> this immediately halts all NFS traffic to the affected client
>>>>>>> (which is never the same client as the machine doing the large
>>>>>>> file transfer), the larger issue is that over the next few
>>>>>>> minutes / hours the entire host will gradually degrade in
>>>>>>> responsiveness until it grinds to a complete halt. Once the
>>>>>>> core stall occurs we have been unable to find any way to restore
>>>>>>> the machine to full functionality or avoid the degradation and
>>>>>>> eventual hang short of a hard power down and restart.
>>>>>>>
>>>>>>> Tens of GB of compressed data in a single file seems to be
>>>>>>> fairly good at triggering the problem, whereas raw disk images
>>>>>>> or other regularly patterned data tend not to be. The
>>>>>>> underlying hardware is functioning perfectly with no problems
>>>>>>> noted, and moving the files without NFS avoids the bug.
>>>>>>>
>>>>>>> We've been using a workaround involving purposefully pausing
>>>>>>> (SIGSTOP) the file transfer process on the client as soon as
>>>>>>> other clients start to show a slowdown. This hack avoids the
>>>>>>> bug entirely provided the host is allowed to catch back up prior
>>>>>>> to resuming (SIGCONT) the file transfer process. From this, it
>>>>>>> seems something is going very wrong within the NFS stack under
>>>>>>> high storage I/O pressure and high storage write latency
>>>>>>> (timeout?) -- it should simply pause transfers while the storage
>>>>>>> subsystem catches up, not lock up a core and force a host
>>>>>>> restart. Interesting, sometimes it does exactly what it is
>>>>>>> supposed to and does pause and wait for the storage subsystem,
>>>>>>> but around 20% of the time it just triggers this bug and stalls
>>>>>>> a core.
>>>>>>>
>>>>>>> This bug has been present since at least 4.14 and is still
>>>>>>> present in the latest 5.12.14 version.
>>>>>>>
>>>>>>> As the machine is in production, it is difficult to gather
>>>>>>> further information or test patches, however we would be able to
>>>>>>> apply patches to the kernel that would potentially restore
>>>>>>> stability with enough advance scheduling.
>>>>>>>
>>>>>>> Sample backtrace below:
>>>>>>>
>>>>>>> [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU
>>>>>>> [16846.426202] rcu: 32-....: (5249 ticks this GP)
>>>>>>> idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986
>>>>>>> [16846.426241] (t=5251 jiffies g=2720809 q=756724)
>>>>>>> [16846.426273] NMI backtrace for cpu 32 [16846.426298] CPU: 32
>>>>>>> PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
>>>>>>> [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc]
>>>>>>> [16846.426406] Call Trace: [16846.426429] [c000200010823250]
>>>>>>> [c00000000074e630] dump_stack+0xc4/0x114 (unreliable)
>>>>>>> [16846.426483] [c000200010823290] [c00000000075aebc]
>>>>>>> nmi_cpu_backtrace+0xfc/0x150 [16846.426506] [c000200010823310]
>>>>>>> [c00000000075b0a8] nmi_trigger_cpumask_backtrace+0x198/0x1f0
>>>>>>> [16846.426577] [c0002000108233b0] [c000000000072818]
>>>>>>> arch_trigger_cpumask_backtrace+0x28/0x40 [16846.426621]
>>>>>>> [c0002000108233d0] [c000000000202db8]
>>>>>>> rcu_dump_cpu_stacks+0x158/0x1b8 [16846.426667]
>>>>>>> [c000200010823470] [c000000000201828]
>>>>>>> rcu_sched_clock_irq+0x908/0xb10 [16846.426708]
>>>>>>> [c000200010823560] [c0000000002141d0]
>>>>>>> update_process_times+0xc0/0x140 [16846.426768]
>>>>>>> [c0002000108235a0] [c00000000022dd34]
>>>>>>> tick_sched_handle.isra.18+0x34/0xd0 [16846.426808]
>>>>>>> [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0
>>>>>>> [16846.426856] [c000200010823610] [c00000000021577c]
>>>>>>> __hrtimer_run_queues+0x16c/0x370 [16846.426903]
>>>>>>> [c000200010823690] [c000000000216378]
>>>>>>> hrtimer_interrupt+0x128/0x2f0 [16846.426947] [c000200010823740]
>>>>>>> [c000000000029494] timer_interrupt+0x134/0x310 [16846.426989]
>>>>>>> [c0002000108237a0] [c000000000016c54]
>>>>>>> replay_soft_interrupts+0x124/0x2e0 [16846.427045]
>>>>>>> [c000200010823990] [c000000000016f14]
>>>>>>> arch_local_irq_restore+0x104/0x170 [16846.427103]
>>>>>>> [c0002000108239c0] [c00000000017247c]
>>>>>>> mod_delayed_work_on+0x8c/0xe0 [16846.427149] [c000200010823a20]
>>>>>>> [c00800000819fe04] rpc_set_queue_timer+0x5c/0x80 [sunrpc]
>>>>>>> [16846.427234] [c000200010823a40] [c0080000081a096c]
>>>>>>> __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc]
>>>>>>> [16846.427324] [c000200010823a90] [c0080000081a3080]
>>>>>>> rpc_sleep_on_timeout+0x88/0x110 [sunrpc] [16846.427388]
>>>>>>> [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530
>>>>>>> [nfsd] [16846.427457] [c000200010823b60] [c0080000081a0a0c]
>>>>>>> rpc_exit_task+0x84/0x1d0 [sunrpc] [16846.427520]
>>>>>>> [c000200010823ba0] [c0080000081a2448] __rpc_execute+0xd0/0x760
>>>>>>> [sunrpc] [16846.427598] [c000200010823c30] [c0080000081a2b18]
>>>>>>> rpc_async_schedule+0x40/0x70 [sunrpc] [16846.427687]
>>>>>>> [c000200010823c60] [c000000000170bf0]
>>>>>>> process_one_work+0x290/0x580 [16846.427736] [c000200010823d00]
>>>>>>> [c000000000170f68] worker_thread+0x88/0x620 [16846.427813]
>>>>>>> [c000200010823da0] [c00000000017b860] kthread+0x1a0/0x1b0
>>>>>>> [16846.427865] [c000200010823e10] [c00000000000d6ec]
>>>>>>> ret_from_kernel_thread+0x5c/0x70 [16873.869180] watchdog: BUG:
>>>>>>> soft lockup - CPU#32 stuck for 49s! [kworker/u130:25:10624]
>>>>>>> [16873.869245] Modules linked in: rpcsec_gss_krb5
>>>>>>> iscsi_target_mod target_core_user uio target_core_pscsi
>>>>>>> target_core_file target_core_iblock target_core_mod tun
>>>>>>> nft_counter nf_tables nfnetlink vfio_pci vfio_virqfd
>>>>>>> vfio_iommu_spapr_tce vfio vfio_spapr_eeh i2c_dev bridg$
>>>>>>> [16873.869413] linear mlx4_ib ib_uverbs ib_core raid1 md_mod
>>>>>>> sd_mod t10_pi hid_generic usbhid hid ses enclosure
>>>>>>> crct10dif_vpmsum crc32c_vpmsum xhci_pci xhci_hcd ixgbe mlx4_core
>>>>>>> mpt3sas usbcore tg3 mdio_devres of_mdio fixed_phy xfrm_algo mdio
>>>>>>> libphy aacraid igb raid_cl$ [16873.869889] CPU: 32 PID: 10624
>>>>>>> Comm: kworker/u130:25 Not tainted 5.12.14 #1 [16873.869966]
>>>>>>> Workqueue: rpciod rpc_async_schedule [sunrpc] [16873.870023]
>>>>>>> NIP: c000000000711300 LR: c0080000081a0708 CTR:
>>>>>>> c0000000007112a0 [16873.870073] REGS: c0002000108237d0 TRAP:
>>>>>>> 0900 Not tainted (5.12.14) [16873.870109] MSR:
>>>>>>> 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR:
>>>>>>> 24004842 XER: 00000000 [16873.870146] CFAR: c0080000081d8054
>>>>>>> IRQMASK: 0 GPR00: c0080000081a0748 c000200010823a70
>>>>>>> c0000000015c0700 c0000000e2227a40 GPR04: c0000000e2227a40
>>>>>>> c0000000e2227a40 c000200ffb6cc0a8 0000000000000018 GPR08:
>>>>>>> 0000000000000000 5deadbeef0000122 c0080000081ffd18
>>>>>>> c0080000081d8040 GPR12: c0000000007112a0 c000200fff7fee00
>>>>>>> c00000000017b6c8 c000000090d9ccc0 GPR16: 0000000000000000
>>>>>>> 0000000000000000 0000000000000000 0000000000000000 GPR20:
>>>>>>> 0000000000000000 0000000000000000 0000000000000000
>>>>>>> 0000000000000040 GPR24: 0000000000000000 0000000000000000
>>>>>>> fffffffffffffe00 0000000000000001 GPR28: c00000001a62f000
>>>>>>> c0080000081a0988 c0080000081ffd10 c0000000e2227a00
>>>>>>> [16873.870452] NIP [c000000000711300]
>>>>>>> __list_del_entry_valid+0x60/0x100 [16873.870507] LR
>>>>>>> [c0080000081a0708]
>>>>>>> rpc_wake_up_task_on_wq_queue_action_locked+0x330/0x400 [sunrpc]

2021-08-09 18:40:08

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

Does setting /proc/sys/fs/leases-enable to 0 work while the system is up? I was expecting to see lslocks | grep DELE | wc go down. It’s not. It’s staying around 1850.

> On Aug 9, 2021, at 2:30 PM, Timothy Pearson <[email protected]> wrote:
>
> FWIW that's *exactly* what we see. Eventually, if the server is left alone for enough time, even the login system stops responding -- it's as if the I/O subsystem degrades and eventually blocks entirely.
>
> ----- Original Message -----
>> From: "hedrick" <[email protected]>
>> To: "Chuck Lever" <[email protected]>
>> Cc: "Timothy Pearson" <[email protected]>, "J. Bruce Fields" <[email protected]>, "linux-nfs"
>> <[email protected]>
>> Sent: Monday, August 9, 2021 1:29:30 PM
>> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load
>
>> Evidence is ambiguous. It seems that NFS activity hangs. The first time this
>> occurred I saw a process at 100% running rpciod. I tried to do a “sync” and
>> reboot, but the sync hung.
>>
>> The last time I couldn’t get data, but the kernel was running and responding to
>> ping. An ssh session responded to CR but when I tried to sudo it hung. Attempt
>> to login hung. Oddly, even though the ssh session responded to CR, syslog
>> entries on the local system stopped until the reboot. However we also send
>> syslog entries to a central server. Those continued and showed a continuing set
>> of mounts and unmounts happening through the reboot.
>>
>> I was goiog to get a stack trace of the 100% process if that happened again, but
>> last time I wasn’t in a situation to do that. I don’t think users will put up
>> with further attempts to debug, so for the moment I’m going to try disabling
>> delegations.
>>
>>> On Aug 9, 2021, at 1:37 PM, Chuck Lever III <[email protected]> wrote:
>>>
>>> Then when you say "server hangs" you mean that the entire NFS server
>>> system deadlocks. It's not just unresponsive on one or more exports.

2021-08-09 18:49:33

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

On Mon, Aug 09, 2021 at 02:38:33PM -0400, [email protected] wrote:
> Does setting /proc/sys/fs/leases-enable to 0 work while the system is
> up? I was expecting to see lslocks | grep DELE | wc go down. It’s not.
> It’s staying around 1850.

All it should do is prevent giving out *new* delegations.

Best is to set that sysctl on system startup before nfsd starts.

> > On Aug 9, 2021, at 2:30 PM, Timothy Pearson
> > <[email protected]> wrote:
> >
> > FWIW that's *exactly* what we see. Eventually, if the server is
> > left alone for enough time, even the login system stops responding
> > -- it's as if the I/O subsystem degrades and eventually blocks
> > entirely.

That's pretty common behavior across a variety of kernel bugs. So on
its own it doesn't mean the root cause is the same.

--b.

2021-08-09 18:57:00

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

Can confirm -- same general backtrace I sent in earlier.

That means the bug is:
1.) Not architecture specific
2.) Not filesystem specific

I was originally concerned it was related to BTRFS or POWER-specific, good to see it is not.

----- Original Message -----
> From: "hedrick" <[email protected]>
> To: "J. Bruce Fields" <[email protected]>
> Cc: "Timothy Pearson" <[email protected]>, "Chuck Lever" <[email protected]>, "linux-nfs"
> <[email protected]>
> Sent: Monday, August 9, 2021 1:51:05 PM
> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

> I have. I was trying to avoid a reboot.
>
> By the way, after the first failure, during reboot, syslog showed the following.
> I’m unclear what it means, bu tit looks ike it might be from the failure
>
>
>
>> On Aug 9, 2021, at 2:49 PM, J. Bruce Fields <[email protected]> wrote:
>>
>> On Mon, Aug 09, 2021 at 02:38:33PM -0400, [email protected] wrote:
>>> Does setting /proc/sys/fs/leases-enable to 0 work while the system is
>>> up? I was expecting to see lslocks | grep DELE | wc go down. It’s not.
>>> It’s staying around 1850.
>>
>> All it should do is prevent giving out *new* delegations.
>>
>> Best is to set that sysctl on system startup before nfsd starts.
>>
>>>> On Aug 9, 2021, at 2:30 PM, Timothy Pearson
>>>> <[email protected]> wrote:
>>>>
>>>> FWIW that's *exactly* what we see. Eventually, if the server is
>>>> left alone for enough time, even the login system stops responding
>>>> -- it's as if the I/O subsystem degrades and eventually blocks
>>>> entirely.
>>
>> That's pretty common behavior across a variety of kernel bugs. So on
>> its own it doesn't mean the root cause is the same.
>>
> > --b.

2021-08-09 20:37:03

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

IIRC most of the NFS server tuning options require a NFS service restart to take effect.

----- Original Message -----
> From: "hedrick" <[email protected]>
> To: "Timothy Pearson" <[email protected]>
> Cc: "Chuck Lever" <[email protected]>, "J. Bruce Fields" <[email protected]>, "linux-nfs"
> <[email protected]>
> Sent: Monday, August 9, 2021 1:38:33 PM
> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

> Does setting /proc/sys/fs/leases-enable to 0 work while the system is up? I was
> expecting to see lslocks | grep DELE | wc go down. It’s not. It’s staying
> around 1850.
>
>> On Aug 9, 2021, at 2:30 PM, Timothy Pearson <[email protected]>
>> wrote:
>>
>> FWIW that's *exactly* what we see. Eventually, if the server is left alone for
>> enough time, even the login system stops responding -- it's as if the I/O
>> subsystem degrades and eventually blocks entirely.
>>
>> ----- Original Message -----
>>> From: "hedrick" <[email protected]>
>>> To: "Chuck Lever" <[email protected]>
>>> Cc: "Timothy Pearson" <[email protected]>, "J. Bruce Fields"
>>> <[email protected]>, "linux-nfs"
>>> <[email protected]>
>>> Sent: Monday, August 9, 2021 1:29:30 PM
>>> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load
>>
>>> Evidence is ambiguous. It seems that NFS activity hangs. The first time this
>>> occurred I saw a process at 100% running rpciod. I tried to do a “sync” and
>>> reboot, but the sync hung.
>>>
>>> The last time I couldn’t get data, but the kernel was running and responding to
>>> ping. An ssh session responded to CR but when I tried to sudo it hung. Attempt
>>> to login hung. Oddly, even though the ssh session responded to CR, syslog
>>> entries on the local system stopped until the reboot. However we also send
>>> syslog entries to a central server. Those continued and showed a continuing set
>>> of mounts and unmounts happening through the reboot.
>>>
>>> I was goiog to get a stack trace of the 100% process if that happened again, but
>>> last time I wasn’t in a situation to do that. I don’t think users will put up
>>> with further attempts to debug, so for the moment I’m going to try disabling
>>> delegations.
>>>
>>>> On Aug 9, 2021, at 1:37 PM, Chuck Lever III <[email protected]> wrote:
>>>>
>>>> Then when you say "server hangs" you mean that the entire NFS server
> >>> system deadlocks. It's not just unresponsive on one or more exports.

2021-08-09 22:50:48

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

I just realized there’s one thing you should know. We run Cisco’s AMP for Endpoints on the server. The goal is to detect malware that our users might put on the file system. Typically one is worried about malware installed n client, but we’re concerned that developers may be using java and python libraries with known issues, and those will commonly be stored on the server.

If AMP is doing its job, it will check most new files. I’m not sure whether that creates atypical usage or not.

> On Aug 9, 2021, at 2:56:15 PM, Timothy Pearson <[email protected]> wrote:
>
> Can confirm -- same general backtrace I sent in earlier.
>
> That means the bug is:
> 1.) Not architecture specific
> 2.) Not filesystem specific
>
> I was originally concerned it was related to BTRFS or POWER-specific, good to see it is not.
>
> ----- Original Message -----
>> From: "hedrick" <[email protected]>
>> To: "J. Bruce Fields" <[email protected]>
>> Cc: "Timothy Pearson" <[email protected]>, "Chuck Lever" <[email protected]>, "linux-nfs"
>> <[email protected]>
>> Sent: Monday, August 9, 2021 1:51:05 PM
>> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load
>
>> I have. I was trying to avoid a reboot.
>>
>> By the way, after the first failure, during reboot, syslog showed the following.
>> I’m unclear what it means, bu tit looks ike it might be from the failure
>>
>>
>>
>>> On Aug 9, 2021, at 2:49 PM, J. Bruce Fields <[email protected]> wrote:
>>>
>>> On Mon, Aug 09, 2021 at 02:38:33PM -0400, [email protected] wrote:
>>>> Does setting /proc/sys/fs/leases-enable to 0 work while the system is
>>>> up? I was expecting to see lslocks | grep DELE | wc go down. It’s not.
>>>> It’s staying around 1850.
>>>
>>> All it should do is prevent giving out *new* delegations.
>>>
>>> Best is to set that sysctl on system startup before nfsd starts.
>>>
>>>>> On Aug 9, 2021, at 2:30 PM, Timothy Pearson
>>>>> <[email protected]> wrote:
>>>>>
>>>>> FWIW that's *exactly* what we see. Eventually, if the server is
>>>>> left alone for enough time, even the login system stops responding
>>>>> -- it's as if the I/O subsystem degrades and eventually blocks
>>>>> entirely.
>>>
>>> That's pretty common behavior across a variety of kernel bugs. So on
>>> its own it doesn't mean the root cause is the same.
>>>
>>> --b.

2021-08-10 02:07:46

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

I'm not sure that is much different than the load patterns we end up generating, with mixed remote and local I/O. I'd think that such a scenario is fairly typical, especially when factoring in backup processes.

----- Original Message -----
> From: "hedrick" <[email protected]>
> To: "Timothy Pearson" <[email protected]>
> Cc: "J. Bruce Fields" <[email protected]>, "Chuck Lever" <[email protected]>, "linux-nfs"
> <[email protected]>
> Sent: Monday, August 9, 2021 3:54:17 PM
> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

> I just realized there’s one thing you should know. We run Cisco’s AMP for
> Endpoints on the server. The goal is to detect malware that our users might put
> on the file system. Typically one is worried about malware installed n client,
> but we’re concerned that developers may be using java and python libraries with
> known issues, and those will commonly be stored on the server.
>
> If AMP is doing its job, it will check most new files. I’m not sure whether that
> creates atypical usage or not.
>
>> On Aug 9, 2021, at 2:56:15 PM, Timothy Pearson <[email protected]>
>> wrote:
>>
>> Can confirm -- same general backtrace I sent in earlier.
>>
>> That means the bug is:
>> 1.) Not architecture specific
>> 2.) Not filesystem specific
>>
>> I was originally concerned it was related to BTRFS or POWER-specific, good to
>> see it is not.
>>
>> ----- Original Message -----
>>> From: "hedrick" <[email protected]>
>>> To: "J. Bruce Fields" <[email protected]>
>>> Cc: "Timothy Pearson" <[email protected]>, "Chuck Lever"
>>> <[email protected]>, "linux-nfs"
>>> <[email protected]>
>>> Sent: Monday, August 9, 2021 1:51:05 PM
>>> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load
>>
>>> I have. I was trying to avoid a reboot.
>>>
>>> By the way, after the first failure, during reboot, syslog showed the following.
>>> I’m unclear what it means, bu tit looks ike it might be from the failure
>>>
>>>
>>>
>>>> On Aug 9, 2021, at 2:49 PM, J. Bruce Fields <[email protected]> wrote:
>>>>
>>>> On Mon, Aug 09, 2021 at 02:38:33PM -0400, [email protected] wrote:
>>>>> Does setting /proc/sys/fs/leases-enable to 0 work while the system is
>>>>> up? I was expecting to see lslocks | grep DELE | wc go down. It’s not.
>>>>> It’s staying around 1850.
>>>>
>>>> All it should do is prevent giving out *new* delegations.
>>>>
>>>> Best is to set that sysctl on system startup before nfsd starts.
>>>>
>>>>>> On Aug 9, 2021, at 2:30 PM, Timothy Pearson
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> FWIW that's *exactly* what we see. Eventually, if the server is
>>>>>> left alone for enough time, even the login system stops responding
>>>>>> -- it's as if the I/O subsystem degrades and eventually blocks
>>>>>> entirely.
>>>>
>>>> That's pretty common behavior across a variety of kernel bugs. So on
>>>> its own it doesn't mean the root cause is the same.
>>>>
> >>> --b.

2021-08-10 02:08:24

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

yes, but the timing may be different. When a new file is created, inotify will tell AMP about it, and AMP will immediately read it.

> On Aug 9, 2021, at 5:49:30 PM, Timothy Pearson <[email protected]> wrote:
>
> I'm not sure that is much different than the load patterns we end up generating, with mixed remote and local I/O. I'd think that such a scenario is fairly typical, especially when factoring in backup processes.
>
> ----- Original Message -----
>> From: "hedrick" <[email protected]>
>> To: "Timothy Pearson" <[email protected]>
>> Cc: "J. Bruce Fields" <[email protected]>, "Chuck Lever" <[email protected]>, "linux-nfs"
>> <[email protected]>
>> Sent: Monday, August 9, 2021 3:54:17 PM
>> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load
>
>> I just realized there’s one thing you should know. We run Cisco’s AMP for
>> Endpoints on the server. The goal is to detect malware that our users might put
>> on the file system. Typically one is worried about malware installed n client,
>> but we’re concerned that developers may be using java and python libraries with
>> known issues, and those will commonly be stored on the server.
>>
>> If AMP is doing its job, it will check most new files. I’m not sure whether that
>> creates atypical usage or not.
>>
>>> On Aug 9, 2021, at 2:56:15 PM, Timothy Pearson <[email protected]>
>>> wrote:
>>>
>>> Can confirm -- same general backtrace I sent in earlier.
>>>
>>> That means the bug is:
>>> 1.) Not architecture specific
>>> 2.) Not filesystem specific
>>>
>>> I was originally concerned it was related to BTRFS or POWER-specific, good to
>>> see it is not.
>>>
>>> ----- Original Message -----
>>>> From: "hedrick" <[email protected]>
>>>> To: "J. Bruce Fields" <[email protected]>
>>>> Cc: "Timothy Pearson" <[email protected]>, "Chuck Lever"
>>>> <[email protected]>, "linux-nfs"
>>>> <[email protected]>
>>>> Sent: Monday, August 9, 2021 1:51:05 PM
>>>> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load
>>>
>>>> I have. I was trying to avoid a reboot.
>>>>
>>>> By the way, after the first failure, during reboot, syslog showed the following.
>>>> I’m unclear what it means, bu tit looks ike it might be from the failure
>>>>
>>>>
>>>>
>>>>> On Aug 9, 2021, at 2:49 PM, J. Bruce Fields <[email protected]> wrote:
>>>>>
>>>>> On Mon, Aug 09, 2021 at 02:38:33PM -0400, [email protected] wrote:
>>>>>> Does setting /proc/sys/fs/leases-enable to 0 work while the system is
>>>>>> up? I was expecting to see lslocks | grep DELE | wc go down. It’s not.
>>>>>> It’s staying around 1850.
>>>>>
>>>>> All it should do is prevent giving out *new* delegations.
>>>>>
>>>>> Best is to set that sysctl on system startup before nfsd starts.
>>>>>
>>>>>>> On Aug 9, 2021, at 2:30 PM, Timothy Pearson
>>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> FWIW that's *exactly* what we see. Eventually, if the server is
>>>>>>> left alone for enough time, even the login system stops responding
>>>>>>> -- it's as if the I/O subsystem degrades and eventually blocks
>>>>>>> entirely.
>>>>>
>>>>> That's pretty common behavior across a variety of kernel bugs. So on
>>>>> its own it doesn't mean the root cause is the same.
>>>>>
>>>>> --b.

2021-08-10 04:20:53

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

On Mon, 05 Jul 2021, Timothy Pearson wrote:
>
> Sample backtrace below:
>
> [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU
> [16846.426202] rcu: 32-....: (5249 ticks this GP) idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986
> [16846.426241] (t=5251 jiffies g=2720809 q=756724)
> [16846.426273] NMI backtrace for cpu 32
> [16846.426298] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
> [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc]
> [16846.426406] Call Trace:
> [16846.426429] [c000200010823250] [c00000000074e630] dump_stack+0xc4/0x114 (unreliable)
> [16846.426483] [c000200010823290] [c00000000075aebc] nmi_cpu_backtrace+0xfc/0x150
> [16846.426506] [c000200010823310] [c00000000075b0a8] nmi_trigger_cpumask_backtrace+0x198/0x1f0
> [16846.426577] [c0002000108233b0] [c000000000072818] arch_trigger_cpumask_backtrace+0x28/0x40
> [16846.426621] [c0002000108233d0] [c000000000202db8] rcu_dump_cpu_stacks+0x158/0x1b8
> [16846.426667] [c000200010823470] [c000000000201828] rcu_sched_clock_irq+0x908/0xb10
> [16846.426708] [c000200010823560] [c0000000002141d0] update_process_times+0xc0/0x140
> [16846.426768] [c0002000108235a0] [c00000000022dd34] tick_sched_handle.isra.18+0x34/0xd0
> [16846.426808] [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0
> [16846.426856] [c000200010823610] [c00000000021577c] __hrtimer_run_queues+0x16c/0x370
> [16846.426903] [c000200010823690] [c000000000216378] hrtimer_interrupt+0x128/0x2f0
> [16846.426947] [c000200010823740] [c000000000029494] timer_interrupt+0x134/0x310
> [16846.426989] [c0002000108237a0] [c000000000016c54] replay_soft_interrupts+0x124/0x2e0
> [16846.427045] [c000200010823990] [c000000000016f14] arch_local_irq_restore+0x104/0x170
> [16846.427103] [c0002000108239c0] [c00000000017247c] mod_delayed_work_on+0x8c/0xe0
> [16846.427149] [c000200010823a20] [c00800000819fe04] rpc_set_queue_timer+0x5c/0x80 [sunrpc]
> [16846.427234] [c000200010823a40] [c0080000081a096c] __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc]
> [16846.427324] [c000200010823a90] [c0080000081a3080] rpc_sleep_on_timeout+0x88/0x110 [sunrpc]
> [16846.427388] [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530 [nfsd]
> [16846.427457] [c000200010823b60] [c0080000081a0a0c] rpc_exit_task+0x84/0x1d0 [sunrpc]
> [16846.427520] [c000200010823ba0] [c0080000081a2448] __rpc_execute+0xd0/0x760 [sunrpc]
> [16846.427598] [c000200010823c30] [c0080000081a2b18] rpc_async_schedule+0x40/0x70 [sunrpc]

Time to play the sleuth .....
"rpc_async_schedule" - so this is clearly an async task.
It is running in __rpc_execute(), and calls rpc_exit_task().

rpc_exit_task() is a possible value for ->tk_action, which is set in
several places.
1/ in call_bc_transmit_status(), but only after generating a kernel
message
RPC: Could not send backchannel reply......
You didn't report that message, so I'll assume it didn't happen.

2/ In call_decode() if ->p_decode is NULL. This implies a message
which didn't expect a reply. All nfs4 callback procedures
(nfs4_cb_procedures[]) do set p_decode, so it cannot be here.

3/ In call_decode() if the reply was successfully decoded.
I cannot rule this out quite so easily, but it seems unlikely as this
is a normal pattern and I wouldn't expect it to cause a soft-lockup.

4/ In rpc_exit(). This is my guess. All the places that rpc_exit() can
be called by nfsd (and nfsd appears in the call stack) are for handling
errors.

So GUESS: rpc_exit() is getting called.
Not only that, but it is getting called *often*. The call to
rpc_exit_task() (which is not the same as rpc_exit() - be careful) sets
tk_action to NULL. So rpc_exit() must get called again and again and
again to keept setting tk_action back to rpc_exit_task, resulting in the
soft lockup.

After setting ->tk_action to NULL, rpc_exit_task() calls
->rpc_call_done, which we see in the stack trace is nfsd4_cb_done().

nfsd4_cb_done() in turn calls ->done which is one of
nfsd4_cb_probe_done()
nfsd4_cb_sequence_done()
nfsd4_cb_layout_done()
or
nfsd4_cb_notify_lock_done()

Some of these call rpc_delay(task,...) and return 0, causing
nfsd4_cb_done() to call rpc_restart_call_prepare() This means the task
can be requeued, but only after a delay.

This doesn't yet explain the spin, but now let's look back at
__rpc_execute().
After calling do_action() (which is rpc_exit_task() in the call trace)
it checks if the task is queued. If rpc_delay_task() wasn't call, it
won't be queued and tk_action will be NULL, so it will loop around,
do_action will be NULL, and the task aborts.

But if rpc_delay_task() was called, then the task will be queued (on the
delay queue), and we continue in __rpc_execute().

The next test if is RPC_SIGNALLED(). If so, then rpc_exit() is called.
Aha! We though that must have been getting called repeatedly. It
*might* not be here, but I think it is. Let's assume so.
rpc_exit() will set ->tk_action to rpc_exit_task, dequeue the task and
(as it is async) schedule it for handling by rpc_async_schedule (that is
in rpc_make_runnable()).

__rpc_execute_task continues down to
if (task_is_async)
return;
and
rpc_async_schedule() returns. But the task has already been queued to
be handled again, so the whole process loops.

The problem here appears to be that a signalled task is being retried
without clearing the SIGNALLED flag. That is causing the infinite loop
and the soft lockup.

This bug appears to have been introduced in Linux 5.2 by
Commit: ae67bd3821bb ("SUNRPC: Fix up task signalling")

Prior to this commit a flag RPC_TASK_KILLED was used, and it gets
cleared by rpc_reset_task_statistics() (called from rpc_exit_task()).
After this commit a new flag RPC_TASK_SIGNALLED is used, and it is never
cleared.

A fix might be to clear RPC_TASK_SIGNALLED in
rpc_reset_task_statistics(), but I'll leave that decision to someone
else.

This analysis doesn't completely gel with your claim that the bug has
been present since at least 4.14, and the bug I think I found appeared
in Linux 5.2.
Maybe you previously had similar symptoms from a different bug?

I'll leave to Bruce, Chuck, and Trond to work out the best fix.

NeilBrown

2021-08-10 04:23:06

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

On Tue, Aug 10, 2021 at 10:43:31AM +1000, NeilBrown wrote:
> On Mon, 05 Jul 2021, Timothy Pearson wrote:
> >
> > Sample backtrace below:
> >
> > [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU
> > [16846.426202] rcu: 32-....: (5249 ticks this GP) idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986
> > [16846.426241] (t=5251 jiffies g=2720809 q=756724)
> > [16846.426273] NMI backtrace for cpu 32
> > [16846.426298] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
> > [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc]
> > [16846.426406] Call Trace:
> > [16846.426429] [c000200010823250] [c00000000074e630] dump_stack+0xc4/0x114 (unreliable)
> > [16846.426483] [c000200010823290] [c00000000075aebc] nmi_cpu_backtrace+0xfc/0x150
> > [16846.426506] [c000200010823310] [c00000000075b0a8] nmi_trigger_cpumask_backtrace+0x198/0x1f0
> > [16846.426577] [c0002000108233b0] [c000000000072818] arch_trigger_cpumask_backtrace+0x28/0x40
> > [16846.426621] [c0002000108233d0] [c000000000202db8] rcu_dump_cpu_stacks+0x158/0x1b8
> > [16846.426667] [c000200010823470] [c000000000201828] rcu_sched_clock_irq+0x908/0xb10
> > [16846.426708] [c000200010823560] [c0000000002141d0] update_process_times+0xc0/0x140
> > [16846.426768] [c0002000108235a0] [c00000000022dd34] tick_sched_handle.isra.18+0x34/0xd0
> > [16846.426808] [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0
> > [16846.426856] [c000200010823610] [c00000000021577c] __hrtimer_run_queues+0x16c/0x370
> > [16846.426903] [c000200010823690] [c000000000216378] hrtimer_interrupt+0x128/0x2f0
> > [16846.426947] [c000200010823740] [c000000000029494] timer_interrupt+0x134/0x310
> > [16846.426989] [c0002000108237a0] [c000000000016c54] replay_soft_interrupts+0x124/0x2e0
> > [16846.427045] [c000200010823990] [c000000000016f14] arch_local_irq_restore+0x104/0x170
> > [16846.427103] [c0002000108239c0] [c00000000017247c] mod_delayed_work_on+0x8c/0xe0
> > [16846.427149] [c000200010823a20] [c00800000819fe04] rpc_set_queue_timer+0x5c/0x80 [sunrpc]
> > [16846.427234] [c000200010823a40] [c0080000081a096c] __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc]
> > [16846.427324] [c000200010823a90] [c0080000081a3080] rpc_sleep_on_timeout+0x88/0x110 [sunrpc]
> > [16846.427388] [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530 [nfsd]
> > [16846.427457] [c000200010823b60] [c0080000081a0a0c] rpc_exit_task+0x84/0x1d0 [sunrpc]
> > [16846.427520] [c000200010823ba0] [c0080000081a2448] __rpc_execute+0xd0/0x760 [sunrpc]
> > [16846.427598] [c000200010823c30] [c0080000081a2b18] rpc_async_schedule+0x40/0x70 [sunrpc]
>
> Time to play the sleuth .....
> "rpc_async_schedule" - so this is clearly an async task.
> It is running in __rpc_execute(), and calls rpc_exit_task().
>
> rpc_exit_task() is a possible value for ->tk_action, which is set in
> several places.
> 1/ in call_bc_transmit_status(), but only after generating a kernel
> message
> RPC: Could not send backchannel reply......
> You didn't report that message, so I'll assume it didn't happen.
>
> 2/ In call_decode() if ->p_decode is NULL. This implies a message
> which didn't expect a reply. All nfs4 callback procedures
> (nfs4_cb_procedures[]) do set p_decode, so it cannot be here.
>
> 3/ In call_decode() if the reply was successfully decoded.
> I cannot rule this out quite so easily, but it seems unlikely as this
> is a normal pattern and I wouldn't expect it to cause a soft-lockup.
>
> 4/ In rpc_exit(). This is my guess. All the places that rpc_exit() can
> be called by nfsd (and nfsd appears in the call stack) are for handling
> errors.
>
> So GUESS: rpc_exit() is getting called.
> Not only that, but it is getting called *often*. The call to
> rpc_exit_task() (which is not the same as rpc_exit() - be careful) sets
> tk_action to NULL. So rpc_exit() must get called again and again and
> again to keept setting tk_action back to rpc_exit_task, resulting in the
> soft lockup.
>
> After setting ->tk_action to NULL, rpc_exit_task() calls
> ->rpc_call_done, which we see in the stack trace is nfsd4_cb_done().
>
> nfsd4_cb_done() in turn calls ->done which is one of
> nfsd4_cb_probe_done()
> nfsd4_cb_sequence_done()
> nfsd4_cb_layout_done()
> or
> nfsd4_cb_notify_lock_done()
>
> Some of these call rpc_delay(task,...) and return 0, causing
> nfsd4_cb_done() to call rpc_restart_call_prepare() This means the task
> can be requeued, but only after a delay.
>
> This doesn't yet explain the spin, but now let's look back at
> __rpc_execute().
> After calling do_action() (which is rpc_exit_task() in the call trace)
> it checks if the task is queued. If rpc_delay_task() wasn't call, it
> won't be queued and tk_action will be NULL, so it will loop around,
> do_action will be NULL, and the task aborts.
>
> But if rpc_delay_task() was called, then the task will be queued (on the
> delay queue), and we continue in __rpc_execute().
>
> The next test if is RPC_SIGNALLED(). If so, then rpc_exit() is called.
> Aha! We though that must have been getting called repeatedly. It
> *might* not be here, but I think it is. Let's assume so.
> rpc_exit() will set ->tk_action to rpc_exit_task, dequeue the task and
> (as it is async) schedule it for handling by rpc_async_schedule (that is
> in rpc_make_runnable()).
>
> __rpc_execute_task continues down to
> if (task_is_async)
> return;
> and
> rpc_async_schedule() returns. But the task has already been queued to
> be handled again, so the whole process loops.
>
> The problem here appears to be that a signalled task is being retried
> without clearing the SIGNALLED flag. That is causing the infinite loop
> and the soft lockup.
>
> This bug appears to have been introduced in Linux 5.2 by
> Commit: ae67bd3821bb ("SUNRPC: Fix up task signalling")

Wow, that's a lot farther than I got. I'll take a look tomorrow....

> Prior to this commit a flag RPC_TASK_KILLED was used, and it gets
> cleared by rpc_reset_task_statistics() (called from rpc_exit_task()).
> After this commit a new flag RPC_TASK_SIGNALLED is used, and it is never
> cleared.
>
> A fix might be to clear RPC_TASK_SIGNALLED in
> rpc_reset_task_statistics(), but I'll leave that decision to someone
> else.
>
> This analysis doesn't completely gel with your claim that the bug has
> been present since at least 4.14, and the bug I think I found appeared
> in Linux 5.2.
> Maybe you previously had similar symptoms from a different bug?

I think it's possible there's more than one problem getting mixed into
this discussion.

--b.

2021-08-10 15:06:56

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

On Tue, Aug 10, 2021 at 10:57:10AM -0400, [email protected] wrote:
> No. NFS 4.2 has a new feature that sends the umask. That means that
> ACLs specifying default permissions actually work. They don’t work in
> 4.0. Since that’s what I want to use ACLs for, effectively they don’t
> work in 4.0 Nothing you can do about that: it’s in the protocol
> definition. But it’s a reason we really want to use 4.2.

D'oh, I forgot the umask change. Got it, makes sense.

--b.

2021-08-10 16:47:11

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

To chime in on the client topic, we have a pretty wide mix of kernels in use and haven't really seen any issues aside from some oddities specific to 4.14 / 5.3 on armhf (embedded dev systems). I know 5.13.4 is fine on the client side.

----- Original Message -----
> From: "hedrick" <[email protected]>
> To: "Chuck Lever" <[email protected]>
> Cc: "Timothy Pearson" <[email protected]>, "J. Bruce Fields" <[email protected]>, "linux-nfs"
> <[email protected]>
> Sent: Tuesday, August 10, 2021 10:03:48 AM
> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

> FYI, we’re reasonably sure 5.8 is OK on the client side. We don’t have much
> testing with 5.4, but we’re hoping current patch levels of 5.4 are as well. So
> we’re not really reporting anything here about the client side. (I’ve looked at
> both upstream and Ubuntu kernel source. They’re pretty good about following the
> upstream NFS source.)
>
>> On Aug 9, 2021, at 1:37 PM, Chuck Lever III <[email protected]> wrote:
>>
>>> We’ve also had a history of issues with 4.2 problems on clients. That’s why we
>>> backed off to 4.0 initially. Clients were seeing hangs.
>>
> > Let's stick with the server issue for the moment.

2021-08-12 15:16:09

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

On Tue, Aug 10, 2021 at 10:43:31AM +1000, NeilBrown wrote:
> On Mon, 05 Jul 2021, Timothy Pearson wrote:
> >
> > Sample backtrace below:
> >
> > [16846.426141] rcu: INFO: rcu_sched self-detected stall on CPU
> > [16846.426202] rcu: 32-....: (5249 ticks this GP) idle=78a/1/0x4000000000000002 softirq=1663878/1663878 fqs=1986
> > [16846.426241] (t=5251 jiffies g=2720809 q=756724)
> > [16846.426273] NMI backtrace for cpu 32
> > [16846.426298] CPU: 32 PID: 10624 Comm: kworker/u130:25 Not tainted 5.12.14 #1
> > [16846.426342] Workqueue: rpciod rpc_async_schedule [sunrpc]
> > [16846.426406] Call Trace:
> > [16846.426429] [c000200010823250] [c00000000074e630] dump_stack+0xc4/0x114 (unreliable)
> > [16846.426483] [c000200010823290] [c00000000075aebc] nmi_cpu_backtrace+0xfc/0x150
> > [16846.426506] [c000200010823310] [c00000000075b0a8] nmi_trigger_cpumask_backtrace+0x198/0x1f0
> > [16846.426577] [c0002000108233b0] [c000000000072818] arch_trigger_cpumask_backtrace+0x28/0x40
> > [16846.426621] [c0002000108233d0] [c000000000202db8] rcu_dump_cpu_stacks+0x158/0x1b8
> > [16846.426667] [c000200010823470] [c000000000201828] rcu_sched_clock_irq+0x908/0xb10
> > [16846.426708] [c000200010823560] [c0000000002141d0] update_process_times+0xc0/0x140
> > [16846.426768] [c0002000108235a0] [c00000000022dd34] tick_sched_handle.isra.18+0x34/0xd0
> > [16846.426808] [c0002000108235d0] [c00000000022e1e8] tick_sched_timer+0x68/0xe0
> > [16846.426856] [c000200010823610] [c00000000021577c] __hrtimer_run_queues+0x16c/0x370
> > [16846.426903] [c000200010823690] [c000000000216378] hrtimer_interrupt+0x128/0x2f0
> > [16846.426947] [c000200010823740] [c000000000029494] timer_interrupt+0x134/0x310
> > [16846.426989] [c0002000108237a0] [c000000000016c54] replay_soft_interrupts+0x124/0x2e0
> > [16846.427045] [c000200010823990] [c000000000016f14] arch_local_irq_restore+0x104/0x170
> > [16846.427103] [c0002000108239c0] [c00000000017247c] mod_delayed_work_on+0x8c/0xe0
> > [16846.427149] [c000200010823a20] [c00800000819fe04] rpc_set_queue_timer+0x5c/0x80 [sunrpc]
> > [16846.427234] [c000200010823a40] [c0080000081a096c] __rpc_sleep_on_priority_timeout+0x194/0x1b0 [sunrpc]
> > [16846.427324] [c000200010823a90] [c0080000081a3080] rpc_sleep_on_timeout+0x88/0x110 [sunrpc]
> > [16846.427388] [c000200010823ad0] [c0080000071f7220] nfsd4_cb_done+0x468/0x530 [nfsd]
> > [16846.427457] [c000200010823b60] [c0080000081a0a0c] rpc_exit_task+0x84/0x1d0 [sunrpc]
> > [16846.427520] [c000200010823ba0] [c0080000081a2448] __rpc_execute+0xd0/0x760 [sunrpc]
> > [16846.427598] [c000200010823c30] [c0080000081a2b18] rpc_async_schedule+0x40/0x70 [sunrpc]
>
> Time to play the sleuth .....
> "rpc_async_schedule" - so this is clearly an async task.
> It is running in __rpc_execute(), and calls rpc_exit_task().
>
> rpc_exit_task() is a possible value for ->tk_action, which is set in
> several places.
> 1/ in call_bc_transmit_status(), but only after generating a kernel
> message
> RPC: Could not send backchannel reply......
> You didn't report that message, so I'll assume it didn't happen.
>
> 2/ In call_decode() if ->p_decode is NULL. This implies a message
> which didn't expect a reply. All nfs4 callback procedures
> (nfs4_cb_procedures[]) do set p_decode, so it cannot be here.
>
> 3/ In call_decode() if the reply was successfully decoded.
> I cannot rule this out quite so easily, but it seems unlikely as this
> is a normal pattern and I wouldn't expect it to cause a soft-lockup.
>
> 4/ In rpc_exit(). This is my guess. All the places that rpc_exit() can
> be called by nfsd (and nfsd appears in the call stack) are for handling
> errors.
>
> So GUESS: rpc_exit() is getting called.
> Not only that, but it is getting called *often*. The call to
> rpc_exit_task() (which is not the same as rpc_exit() - be careful) sets
> tk_action to NULL. So rpc_exit() must get called again and again and
> again to keept setting tk_action back to rpc_exit_task, resulting in the
> soft lockup.
>
> After setting ->tk_action to NULL, rpc_exit_task() calls
> ->rpc_call_done, which we see in the stack trace is nfsd4_cb_done().
>
> nfsd4_cb_done() in turn calls ->done which is one of
> nfsd4_cb_probe_done()
> nfsd4_cb_sequence_done()
> nfsd4_cb_layout_done()
> or
> nfsd4_cb_notify_lock_done()
>
> Some of these call rpc_delay(task,...) and return 0, causing
> nfsd4_cb_done() to call rpc_restart_call_prepare() This means the task
> can be requeued, but only after a delay.
>
> This doesn't yet explain the spin, but now let's look back at
> __rpc_execute().
> After calling do_action() (which is rpc_exit_task() in the call trace)
> it checks if the task is queued. If rpc_delay_task() wasn't call, it
> won't be queued and tk_action will be NULL, so it will loop around,
> do_action will be NULL, and the task aborts.
>
> But if rpc_delay_task() was called, then the task will be queued (on the
> delay queue), and we continue in __rpc_execute().
>
> The next test if is RPC_SIGNALLED(). If so, then rpc_exit() is called.
> Aha! We though that must have been getting called repeatedly. It
> *might* not be here, but I think it is. Let's assume so.
> rpc_exit() will set ->tk_action to rpc_exit_task, dequeue the task and
> (as it is async) schedule it for handling by rpc_async_schedule (that is
> in rpc_make_runnable()).
>
> __rpc_execute_task continues down to
> if (task_is_async)
> return;
> and
> rpc_async_schedule() returns. But the task has already been queued to
> be handled again, so the whole process loops.
>
> The problem here appears to be that a signalled task is being retried
> without clearing the SIGNALLED flag. That is causing the infinite loop
> and the soft lockup.
>
> This bug appears to have been introduced in Linux 5.2 by
> Commit: ae67bd3821bb ("SUNRPC: Fix up task signalling")

I wonder how we arrived here. Does it require that an rpc task returns
from one of those rpc_delay() calls just as rpc_shutdown_client() is
signalling it? That's the only way async tasks get signalled, I think.

> Prior to this commit a flag RPC_TASK_KILLED was used, and it gets
> cleared by rpc_reset_task_statistics() (called from rpc_exit_task()).
> After this commit a new flag RPC_TASK_SIGNALLED is used, and it is never
> cleared.
>
> A fix might be to clear RPC_TASK_SIGNALLED in
> rpc_reset_task_statistics(), but I'll leave that decision to someone
> else.

Might be worth testing with that change just to verify that this is
what's happening.

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index c045f63d11fa..caa931888747 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -813,7 +813,8 @@ static void
rpc_reset_task_statistics(struct rpc_task *task)
{
task->tk_timeouts = 0;
- task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SENT);
+ task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SIGNALLED|
+ RPC_TASK_SENT);
rpc_init_task_statistics(task);
}

--b.

2021-08-12 22:48:24

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

On Fri, 13 Aug 2021, J. Bruce Fields wrote:
> On Tue, Aug 10, 2021 at 10:43:31AM +1000, NeilBrown wrote:
> >
> > The problem here appears to be that a signalled task is being retried
> > without clearing the SIGNALLED flag. That is causing the infinite loop
> > and the soft lockup.
> >
> > This bug appears to have been introduced in Linux 5.2 by
> > Commit: ae67bd3821bb ("SUNRPC: Fix up task signalling")
>
> I wonder how we arrived here. Does it require that an rpc task returns
> from one of those rpc_delay() calls just as rpc_shutdown_client() is
> signalling it? That's the only way async tasks get signalled, I think.

I don't think "just as" is needed.
I think it could only happen if rpc_shutdown_client() were called when
there were active tasks - presumably from nfsd4_process_cb_update(), but
I don't know the callback code well.
If any of those active tasks has a ->done handler which might try to
reschedule the task when tk_status == -ERESTARTSYS, then you get into
the infinite loop.

>
> > Prior to this commit a flag RPC_TASK_KILLED was used, and it gets
> > cleared by rpc_reset_task_statistics() (called from rpc_exit_task()).
> > After this commit a new flag RPC_TASK_SIGNALLED is used, and it is never
> > cleared.
> >
> > A fix might be to clear RPC_TASK_SIGNALLED in
> > rpc_reset_task_statistics(), but I'll leave that decision to someone
> > else.
>
> Might be worth testing with that change just to verify that this is
> what's happening.
>
> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> index c045f63d11fa..caa931888747 100644
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -813,7 +813,8 @@ static void
> rpc_reset_task_statistics(struct rpc_task *task)
> {
> task->tk_timeouts = 0;
> - task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SENT);
> + task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SIGNALLED|
> + RPC_TASK_SENT);

NONONONONO.
RPC_TASK_SIGNALLED is a flag in tk_runstate.
So you need
clear_bit(RPC_TASK_SIGNALLED, &task->tk_runstate);

NeilBrown

2021-08-16 14:47:19

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

FYI, we’re been running for a week now with delegation disabled. I think the system is stable, though with intermittent failures it’s hard to be sure.

It also seems to have gotten rid of errors that we’ve been seeing all along, even when the system was stable (which in retrospect is because we were using NFS 4.0 rather than 4.2):

Aug 10 18:12:17 daytona.rutgers.edu kernel: [634473.030257] NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence 00000000fa7d3f12!
Aug 10 18:12:17 daytona.rutgers.edu kernel: [634473.031110] NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence 00000000fa7d3f12!
Aug 10 18:12:17 daytona.rutgers.edu kernel: [634473.031977] NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence 00000000fa7d3f12!
Aug 10 18:12:17 daytona.rutgers.edu kernel: [634473.032808] NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence 00000000fa7d3f12!
Aug 10 18:12:17 daytona.rutgers.edu kernel: [634473.059923] NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence 00000000b6b6f424!
Aug 10 18:12:17 daytona.rutgers.edu kernel: [634473.060802] NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence 00000000b6b6f424!
Aug 10 18:12:17 daytona.rutgers.edu kernel: [634473.061671] NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence 00000000b6b6f424!
Aug 10 18:12:17 daytona.rutgers.edu kernel: [634473.062478] NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence 00000000b6b6f424!
Aug 10 18:12:17 daytona.rutgers.edu kernel: [634473.068624] NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence 00000000b6b6f424!

2021-10-08 20:28:10

by Scott Mayhew

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

On Fri, 13 Aug 2021, NeilBrown wrote:

> On Fri, 13 Aug 2021, J. Bruce Fields wrote:
> > On Tue, Aug 10, 2021 at 10:43:31AM +1000, NeilBrown wrote:
> > >
> > > The problem here appears to be that a signalled task is being retried
> > > without clearing the SIGNALLED flag. That is causing the infinite loop
> > > and the soft lockup.
> > >
> > > This bug appears to have been introduced in Linux 5.2 by
> > > Commit: ae67bd3821bb ("SUNRPC: Fix up task signalling")
> >
> > I wonder how we arrived here. Does it require that an rpc task returns
> > from one of those rpc_delay() calls just as rpc_shutdown_client() is
> > signalling it? That's the only way async tasks get signalled, I think.
>
> I don't think "just as" is needed.
> I think it could only happen if rpc_shutdown_client() were called when
> there were active tasks - presumably from nfsd4_process_cb_update(), but
> I don't know the callback code well.
> If any of those active tasks has a ->done handler which might try to
> reschedule the task when tk_status == -ERESTARTSYS, then you get into
> the infinite loop.

This thread seems to have fizzled out, but I'm pretty sure I hit this
during the Virtual Bakeathon yesterday. My server was unresponsive but
I eventually managed to get a vmcore.

[182411.119788] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 00000000f2f40905 xid 5d83adfb
[182437.775113] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [kworker/u4:1:216458]
[182437.775633] Modules linked in: nfs_layout_flexfiles nfsv3 nfs_layout_nfsv41_files bluetooth ecdh_generic ecc rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netfs rpcrdma rdma_cm iw_cm ib_cm ib_core tun rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib isofs cdrom nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink xfs intel_rapl_msr intel_rapl_common libcrc32c kvm_intel qxl kvm drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops irqbypass cec joydev virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc drm fuse ext4 mbcache jbd2 ata_generic crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel virtio_net serio_raw ata_piix net_failover libata virtio_blk virtio_scsi failover dm_mirror dm_region_hash dm_log dm_mod
[182437.780157] CPU: 1 PID: 216458 Comm: kworker/u4:1 Kdump: loaded Not tainted 5.14.0-5.el9.x86_64 #1
[182437.780894] Hardware name: DigitalOcean Droplet, BIOS 20171212 12/12/2017
[182437.781567] Workqueue: rpciod rpc_async_schedule [sunrpc]
[182437.782500] RIP: 0010:try_to_grab_pending+0x12/0x160
[182437.783104] Code: e7 e8 72 f3 ff ff e9 6e ff ff ff 48 89 df e8 65 f3 ff ff eb b7 0f 1f 00 0f 1f 44 00 00 41 55 41 54 55 48 89 d5 53 48 89 fb 9c <58> fa 48 89 02 40 84 f6 0f 85 92 00 00 00 f0 48 0f ba 2b 00 72 09
[182437.784261] RSP: 0018:ffffb5b24066fd30 EFLAGS: 00000246
[182437.785052] RAX: 0000000000000000 RBX: ffffffffc05e0768 RCX: 00000000000007d0
[182437.785760] RDX: ffffb5b24066fd60 RSI: 0000000000000001 RDI: ffffffffc05e0768
[182437.786399] RBP: ffffb5b24066fd60 R08: ffffffffc05e0708 R09: ffffffffc05e0708
[182437.787010] R10: 0000000000000003 R11: 0000000000000003 R12: ffffffffc05e0768
[182437.787621] R13: ffff9daa40312400 R14: 00000000000007d0 R15: 0000000000000000
[182437.788235] FS: 0000000000000000(0000) GS:ffff9daa5bd00000(0000) knlGS:0000000000000000
[182437.788859] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[182437.789483] CR2: 00007f8f73d5d828 CR3: 000000008a010003 CR4: 00000000001706e0
[182437.790188] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[182437.790831] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[182437.791522] Call Trace:
[182437.792183] mod_delayed_work_on+0x3c/0x90
[182437.792866] __rpc_sleep_on_priority_timeout+0x107/0x110 [sunrpc]
[182437.793553] rpc_delay+0x56/0x90 [sunrpc]
[182437.794236] nfsd4_cb_sequence_done+0x202/0x290 [nfsd]
[182437.794910] nfsd4_cb_done+0x18/0xf0 [nfsd]
[182437.795974] rpc_exit_task+0x58/0x100 [sunrpc]
[182437.796955] ? rpc_do_put_task+0x60/0x60 [sunrpc]
[182437.797645] __rpc_execute+0x5e/0x250 [sunrpc]
[182437.798375] rpc_async_schedule+0x29/0x40 [sunrpc]
[182437.799043] process_one_work+0x1e6/0x380
[182437.799703] worker_thread+0x53/0x3d0
[182437.800393] ? process_one_work+0x380/0x380
[182437.801029] kthread+0x10f/0x130
[182437.801686] ? set_kthread_struct+0x40/0x40
[182437.802333] ret_from_fork+0x22/0x30

The process causing the soft lockup warnings:

crash> set 216458
PID: 216458
COMMAND: "kworker/u4:1"
TASK: ffff9da94281e200 [THREAD_INFO: ffff9da94281e200]
CPU: 0
STATE: TASK_RUNNING (ACTIVE)
crash> bt
PID: 216458 TASK: ffff9da94281e200 CPU: 0 COMMAND: "kworker/u4:1"
#0 [fffffe000000be50] crash_nmi_callback at ffffffffb3055c31
#1 [fffffe000000be58] nmi_handle at ffffffffb30268f8
#2 [fffffe000000bea0] default_do_nmi at ffffffffb3a36d42
#3 [fffffe000000bec8] exc_nmi at ffffffffb3a36f49
#4 [fffffe000000bef0] end_repeat_nmi at ffffffffb3c013cb
[exception RIP: add_timer]
RIP: ffffffffb317c230 RSP: ffffb5b24066fd58 RFLAGS: 00000046
RAX: 000000010b3585fc RBX: 0000000000000000 RCX: 00000000000007d0
RDX: ffffffffc05e0768 RSI: ffff9daa40312400 RDI: ffffffffc05e0788
RBP: 0000000000002000 R8: ffffffffc05e0770 R9: ffffffffc05e0788
R10: 0000000000000003 R11: 0000000000000003 R12: ffffffffc05e0768
R13: ffff9daa40312400 R14: 00000000000007d0 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#5 [ffffb5b24066fd58] add_timer at ffffffffb317c230
#6 [ffffb5b24066fd58] mod_delayed_work_on at ffffffffb3103247
#7 [ffffb5b24066fd98] __rpc_sleep_on_priority_timeout at ffffffffc0580547 [sunrpc]
#8 [ffffb5b24066fdc8] rpc_delay at ffffffffc0589ed6 [sunrpc]
#9 [ffffb5b24066fde8] nfsd4_cb_sequence_done at ffffffffc06731b2 [nfsd]
#10 [ffffb5b24066fe10] nfsd4_cb_done at ffffffffc0673258 [nfsd]
#11 [ffffb5b24066fe30] rpc_exit_task at ffffffffc05800a8 [sunrpc]
#12 [ffffb5b24066fe40] __rpc_execute at ffffffffc0589fee [sunrpc]
#13 [ffffb5b24066fe70] rpc_async_schedule at ffffffffc058a209 [sunrpc]
#14 [ffffb5b24066fe88] process_one_work at ffffffffb31026c6
#15 [ffffb5b24066fed0] worker_thread at ffffffffb31028b3
#16 [ffffb5b24066ff10] kthread at ffffffffb310960f
#17 [ffffb5b24066ff50] ret_from_fork at ffffffffb30034f2

Looking at the rpc_task being executed:

crash> rpc_task.tk_status,tk_callback,tk_action,tk_runstate,tk_client,tk_flags ffff9da94120bd00
tk_status = 0x0
tk_callback = 0xffffffffc057bc60 <__rpc_atrun>
tk_action = 0xffffffffc0571f20 <call_start>
tk_runstate = 0x47
tk_client = 0xffff9da958909c00
tk_flags = 0x2281

tk_runstate has the following flags set: RPC_TASK_SIGNALLED, RPC_TASK_ACTIVE,
RPC_TASK_QUEUED, and RPC_TASK_RUNNING.

tk_flags is RPC_TASK_NOCONNECT|RPC_TASK_SOFT|RPC_TASK_DYNAMIC|RPC_TASK_ASYNC.

There's another kworker thread calling rpc_shutdown_client() via
nfsd4_process_cb_update():

crash> bt 0x342a3
PID: 213667 TASK: ffff9daa4fde9880 CPU: 1 COMMAND: "kworker/u4:4"
#0 [ffffb5b24077bbe0] __schedule at ffffffffb3a40ec6
#1 [ffffb5b24077bc60] schedule at ffffffffb3a4124c
#2 [ffffb5b24077bc78] schedule_timeout at ffffffffb3a45058
#3 [ffffb5b24077bcd0] rpc_shutdown_client at ffffffffc056fbb3 [sunrpc]
#4 [ffffb5b24077bd20] nfsd4_process_cb_update at ffffffffc0672c6c [nfsd]
#5 [ffffb5b24077be68] nfsd4_run_cb_work at ffffffffc0672f0f [nfsd]
#6 [ffffb5b24077be88] process_one_work at ffffffffb31026c6
#7 [ffffb5b24077bed0] worker_thread at ffffffffb31028b3
#8 [ffffb5b24077bf10] kthread at ffffffffb310960f
#9 [ffffb5b24077bf50] ret_from_fork at ffffffffb30034f2

The rpc_clnt being shut down is:

crash> nfs4_client.cl_cb_client ffff9daa454db808
cl_cb_client = 0xffff9da958909c00

Which is the same as the tk_client for the rpc_task being executed by the
thread triggering the soft lockup warnings.

-Scott

>
> >
> > > Prior to this commit a flag RPC_TASK_KILLED was used, and it gets
> > > cleared by rpc_reset_task_statistics() (called from rpc_exit_task()).
> > > After this commit a new flag RPC_TASK_SIGNALLED is used, and it is never
> > > cleared.
> > >
> > > A fix might be to clear RPC_TASK_SIGNALLED in
> > > rpc_reset_task_statistics(), but I'll leave that decision to someone
> > > else.
> >
> > Might be worth testing with that change just to verify that this is
> > what's happening.
> >
> > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> > index c045f63d11fa..caa931888747 100644
> > --- a/net/sunrpc/sched.c
> > +++ b/net/sunrpc/sched.c
> > @@ -813,7 +813,8 @@ static void
> > rpc_reset_task_statistics(struct rpc_task *task)
> > {
> > task->tk_timeouts = 0;
> > - task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SENT);
> > + task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SIGNALLED|
> > + RPC_TASK_SENT);
>
> NONONONONO.
> RPC_TASK_SIGNALLED is a flag in tk_runstate.
> So you need
> clear_bit(RPC_TASK_SIGNALLED, &task->tk_runstate);
>
> NeilBrown
>

2021-10-08 21:00:50

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

As a data point, disabling NFSv4 has completely resolved the stability problems on our servers (tested for some months), but that's fairly large sledgehammer to use.

----- Original Message -----
> From: "Scott Mayhew" <[email protected]>
> To: "NeilBrown" <[email protected]>
> Cc: "J. Bruce Fields" <[email protected]>, "Timothy Pearson" <[email protected]>, "Chuck Lever"
> <[email protected]>, "linux-nfs" <[email protected]>, "Trond Myklebust" <[email protected]>
> Sent: Friday, October 8, 2021 3:27:10 PM
> Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

> On Fri, 13 Aug 2021, NeilBrown wrote:
>
>> On Fri, 13 Aug 2021, J. Bruce Fields wrote:
>> > On Tue, Aug 10, 2021 at 10:43:31AM +1000, NeilBrown wrote:
>> > >
>> > > The problem here appears to be that a signalled task is being retried
>> > > without clearing the SIGNALLED flag. That is causing the infinite loop
>> > > and the soft lockup.
>> > >
>> > > This bug appears to have been introduced in Linux 5.2 by
>> > > Commit: ae67bd3821bb ("SUNRPC: Fix up task signalling")
>> >
>> > I wonder how we arrived here. Does it require that an rpc task returns
>> > from one of those rpc_delay() calls just as rpc_shutdown_client() is
>> > signalling it? That's the only way async tasks get signalled, I think.
>>
>> I don't think "just as" is needed.
>> I think it could only happen if rpc_shutdown_client() were called when
>> there were active tasks - presumably from nfsd4_process_cb_update(), but
>> I don't know the callback code well.
>> If any of those active tasks has a ->done handler which might try to
>> reschedule the task when tk_status == -ERESTARTSYS, then you get into
>> the infinite loop.
>
> This thread seems to have fizzled out, but I'm pretty sure I hit this
> during the Virtual Bakeathon yesterday. My server was unresponsive but
> I eventually managed to get a vmcore.
>
> [182411.119788] receive_cb_reply: Got unrecognized reply: calldir 0x1
> xpt_bc_xprt 00000000f2f40905 xid 5d83adfb
> [182437.775113] watchdog: BUG: soft lockup - CPU#1 stuck for 26s!
> [kworker/u4:1:216458]
> [182437.775633] Modules linked in: nfs_layout_flexfiles nfsv3
> nfs_layout_nfsv41_files bluetooth ecdh_generic ecc rpcsec_gss_krb5 nfsv4
> dns_resolver nfs fscache netfs rpcrdma rdma_cm iw_cm ib_cm ib_core tun rfkill
> nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib isofs cdrom nft_reject_inet
> nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink xfs
> intel_rapl_msr intel_rapl_common libcrc32c kvm_intel qxl kvm drm_ttm_helper ttm
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops irqbypass cec
> joydev virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace
> sunrpc drm fuse ext4 mbcache jbd2 ata_generic crct10dif_pclmul crc32_pclmul
> crc32c_intel ghash_clmulni_intel virtio_net serio_raw ata_piix net_failover
> libata virtio_blk virtio_scsi failover dm_mirror dm_region_hash dm_log dm_mod
> [182437.780157] CPU: 1 PID: 216458 Comm: kworker/u4:1 Kdump: loaded Not tainted
> 5.14.0-5.el9.x86_64 #1
> [182437.780894] Hardware name: DigitalOcean Droplet, BIOS 20171212 12/12/2017
> [182437.781567] Workqueue: rpciod rpc_async_schedule [sunrpc]
> [182437.782500] RIP: 0010:try_to_grab_pending+0x12/0x160
> [182437.783104] Code: e7 e8 72 f3 ff ff e9 6e ff ff ff 48 89 df e8 65 f3 ff ff
> eb b7 0f 1f 00 0f 1f 44 00 00 41 55 41 54 55 48 89 d5 53 48 89 fb 9c <58> fa 48
> 89 02 40 84 f6 0f 85 92 00 00 00 f0 48 0f ba 2b 00 72 09
> [182437.784261] RSP: 0018:ffffb5b24066fd30 EFLAGS: 00000246
> [182437.785052] RAX: 0000000000000000 RBX: ffffffffc05e0768 RCX:
> 00000000000007d0
> [182437.785760] RDX: ffffb5b24066fd60 RSI: 0000000000000001 RDI:
> ffffffffc05e0768
> [182437.786399] RBP: ffffb5b24066fd60 R08: ffffffffc05e0708 R09:
> ffffffffc05e0708
> [182437.787010] R10: 0000000000000003 R11: 0000000000000003 R12:
> ffffffffc05e0768
> [182437.787621] R13: ffff9daa40312400 R14: 00000000000007d0 R15:
> 0000000000000000
> [182437.788235] FS: 0000000000000000(0000) GS:ffff9daa5bd00000(0000)
> knlGS:0000000000000000
> [182437.788859] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [182437.789483] CR2: 00007f8f73d5d828 CR3: 000000008a010003 CR4:
> 00000000001706e0
> [182437.790188] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [182437.790831] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [182437.791522] Call Trace:
> [182437.792183] mod_delayed_work_on+0x3c/0x90
> [182437.792866] __rpc_sleep_on_priority_timeout+0x107/0x110 [sunrpc]
> [182437.793553] rpc_delay+0x56/0x90 [sunrpc]
> [182437.794236] nfsd4_cb_sequence_done+0x202/0x290 [nfsd]
> [182437.794910] nfsd4_cb_done+0x18/0xf0 [nfsd]
> [182437.795974] rpc_exit_task+0x58/0x100 [sunrpc]
> [182437.796955] ? rpc_do_put_task+0x60/0x60 [sunrpc]
> [182437.797645] __rpc_execute+0x5e/0x250 [sunrpc]
> [182437.798375] rpc_async_schedule+0x29/0x40 [sunrpc]
> [182437.799043] process_one_work+0x1e6/0x380
> [182437.799703] worker_thread+0x53/0x3d0
> [182437.800393] ? process_one_work+0x380/0x380
> [182437.801029] kthread+0x10f/0x130
> [182437.801686] ? set_kthread_struct+0x40/0x40
> [182437.802333] ret_from_fork+0x22/0x30
>
> The process causing the soft lockup warnings:
>
> crash> set 216458
> PID: 216458
> COMMAND: "kworker/u4:1"
> TASK: ffff9da94281e200 [THREAD_INFO: ffff9da94281e200]
> CPU: 0
> STATE: TASK_RUNNING (ACTIVE)
> crash> bt
> PID: 216458 TASK: ffff9da94281e200 CPU: 0 COMMAND: "kworker/u4:1"
> #0 [fffffe000000be50] crash_nmi_callback at ffffffffb3055c31
> #1 [fffffe000000be58] nmi_handle at ffffffffb30268f8
> #2 [fffffe000000bea0] default_do_nmi at ffffffffb3a36d42
> #3 [fffffe000000bec8] exc_nmi at ffffffffb3a36f49
> #4 [fffffe000000bef0] end_repeat_nmi at ffffffffb3c013cb
> [exception RIP: add_timer]
> RIP: ffffffffb317c230 RSP: ffffb5b24066fd58 RFLAGS: 00000046
> RAX: 000000010b3585fc RBX: 0000000000000000 RCX: 00000000000007d0
> RDX: ffffffffc05e0768 RSI: ffff9daa40312400 RDI: ffffffffc05e0788
> RBP: 0000000000002000 R8: ffffffffc05e0770 R9: ffffffffc05e0788
> R10: 0000000000000003 R11: 0000000000000003 R12: ffffffffc05e0768
> R13: ffff9daa40312400 R14: 00000000000007d0 R15: 0000000000000000
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #5 [ffffb5b24066fd58] add_timer at ffffffffb317c230
> #6 [ffffb5b24066fd58] mod_delayed_work_on at ffffffffb3103247
> #7 [ffffb5b24066fd98] __rpc_sleep_on_priority_timeout at ffffffffc0580547
> [sunrpc]
> #8 [ffffb5b24066fdc8] rpc_delay at ffffffffc0589ed6 [sunrpc]
> #9 [ffffb5b24066fde8] nfsd4_cb_sequence_done at ffffffffc06731b2 [nfsd]
> #10 [ffffb5b24066fe10] nfsd4_cb_done at ffffffffc0673258 [nfsd]
> #11 [ffffb5b24066fe30] rpc_exit_task at ffffffffc05800a8 [sunrpc]
> #12 [ffffb5b24066fe40] __rpc_execute at ffffffffc0589fee [sunrpc]
> #13 [ffffb5b24066fe70] rpc_async_schedule at ffffffffc058a209 [sunrpc]
> #14 [ffffb5b24066fe88] process_one_work at ffffffffb31026c6
> #15 [ffffb5b24066fed0] worker_thread at ffffffffb31028b3
> #16 [ffffb5b24066ff10] kthread at ffffffffb310960f
> #17 [ffffb5b24066ff50] ret_from_fork at ffffffffb30034f2
>
> Looking at the rpc_task being executed:
>
> crash> rpc_task.tk_status,tk_callback,tk_action,tk_runstate,tk_client,tk_flags
> ffff9da94120bd00
> tk_status = 0x0
> tk_callback = 0xffffffffc057bc60 <__rpc_atrun>
> tk_action = 0xffffffffc0571f20 <call_start>
> tk_runstate = 0x47
> tk_client = 0xffff9da958909c00
> tk_flags = 0x2281
>
> tk_runstate has the following flags set: RPC_TASK_SIGNALLED, RPC_TASK_ACTIVE,
> RPC_TASK_QUEUED, and RPC_TASK_RUNNING.
>
> tk_flags is RPC_TASK_NOCONNECT|RPC_TASK_SOFT|RPC_TASK_DYNAMIC|RPC_TASK_ASYNC.
>
> There's another kworker thread calling rpc_shutdown_client() via
> nfsd4_process_cb_update():
>
> crash> bt 0x342a3
> PID: 213667 TASK: ffff9daa4fde9880 CPU: 1 COMMAND: "kworker/u4:4"
> #0 [ffffb5b24077bbe0] __schedule at ffffffffb3a40ec6
> #1 [ffffb5b24077bc60] schedule at ffffffffb3a4124c
> #2 [ffffb5b24077bc78] schedule_timeout at ffffffffb3a45058
> #3 [ffffb5b24077bcd0] rpc_shutdown_client at ffffffffc056fbb3 [sunrpc]
> #4 [ffffb5b24077bd20] nfsd4_process_cb_update at ffffffffc0672c6c [nfsd]
> #5 [ffffb5b24077be68] nfsd4_run_cb_work at ffffffffc0672f0f [nfsd]
> #6 [ffffb5b24077be88] process_one_work at ffffffffb31026c6
> #7 [ffffb5b24077bed0] worker_thread at ffffffffb31028b3
> #8 [ffffb5b24077bf10] kthread at ffffffffb310960f
> #9 [ffffb5b24077bf50] ret_from_fork at ffffffffb30034f2
>
> The rpc_clnt being shut down is:
>
> crash> nfs4_client.cl_cb_client ffff9daa454db808
> cl_cb_client = 0xffff9da958909c00
>
> Which is the same as the tk_client for the rpc_task being executed by the
> thread triggering the soft lockup warnings.
>
> -Scott
>
>>
>> >
>> > > Prior to this commit a flag RPC_TASK_KILLED was used, and it gets
>> > > cleared by rpc_reset_task_statistics() (called from rpc_exit_task()).
>> > > After this commit a new flag RPC_TASK_SIGNALLED is used, and it is never
>> > > cleared.
>> > >
>> > > A fix might be to clear RPC_TASK_SIGNALLED in
>> > > rpc_reset_task_statistics(), but I'll leave that decision to someone
>> > > else.
>> >
>> > Might be worth testing with that change just to verify that this is
>> > what's happening.
>> >
>> > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>> > index c045f63d11fa..caa931888747 100644
>> > --- a/net/sunrpc/sched.c
>> > +++ b/net/sunrpc/sched.c
>> > @@ -813,7 +813,8 @@ static void
>> > rpc_reset_task_statistics(struct rpc_task *task)
>> > {
>> > task->tk_timeouts = 0;
>> > - task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SENT);
>> > + task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SIGNALLED|
>> > + RPC_TASK_SENT);
>>
>> NONONONONO.
>> RPC_TASK_SIGNALLED is a flag in tk_runstate.
>> So you need
>> clear_bit(RPC_TASK_SIGNALLED, &task->tk_runstate);
>>
>> NeilBrown

2021-10-08 21:12:47

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

On Fri, Oct 08, 2021 at 04:27:10PM -0400, Scott Mayhew wrote:
> On Fri, 13 Aug 2021, NeilBrown wrote:
>
> > On Fri, 13 Aug 2021, J. Bruce Fields wrote:
> > > On Tue, Aug 10, 2021 at 10:43:31AM +1000, NeilBrown wrote:
> > > >
> > > > The problem here appears to be that a signalled task is being retried
> > > > without clearing the SIGNALLED flag. That is causing the infinite loop
> > > > and the soft lockup.
> > > >
> > > > This bug appears to have been introduced in Linux 5.2 by
> > > > Commit: ae67bd3821bb ("SUNRPC: Fix up task signalling")
> > >
> > > I wonder how we arrived here. Does it require that an rpc task returns
> > > from one of those rpc_delay() calls just as rpc_shutdown_client() is
> > > signalling it? That's the only way async tasks get signalled, I think.
> >
> > I don't think "just as" is needed.
> > I think it could only happen if rpc_shutdown_client() were called when
> > there were active tasks - presumably from nfsd4_process_cb_update(), but
> > I don't know the callback code well.
> > If any of those active tasks has a ->done handler which might try to
> > reschedule the task when tk_status == -ERESTARTSYS, then you get into
> > the infinite loop.
>
> This thread seems to have fizzled out, but I'm pretty sure I hit this
> during the Virtual Bakeathon yesterday. My server was unresponsive but
> I eventually managed to get a vmcore.

Ugh, apologies for dropping this. I spent some time trying to figure
out whether there was a fix in the NFS level--I'm skeptical of the logic
in nfs4callback.c. But I didn't come up with a fix.

--b.

>
> [182411.119788] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 00000000f2f40905 xid 5d83adfb
> [182437.775113] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [kworker/u4:1:216458]
> [182437.775633] Modules linked in: nfs_layout_flexfiles nfsv3 nfs_layout_nfsv41_files bluetooth ecdh_generic ecc rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netfs rpcrdma rdma_cm iw_cm ib_cm ib_core tun rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib isofs cdrom nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink xfs intel_rapl_msr intel_rapl_common libcrc32c kvm_intel qxl kvm drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops irqbypass cec joydev virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc drm fuse ext4 mbcache jbd2 ata_generic crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel virtio_net serio_raw ata_piix net_failover libata virtio_blk virtio_scsi failover dm_mirror dm_region_hash dm_log dm_mod
> [182437.780157] CPU: 1 PID: 216458 Comm: kworker/u4:1 Kdump: loaded Not tainted 5.14.0-5.el9.x86_64 #1
> [182437.780894] Hardware name: DigitalOcean Droplet, BIOS 20171212 12/12/2017
> [182437.781567] Workqueue: rpciod rpc_async_schedule [sunrpc]
> [182437.782500] RIP: 0010:try_to_grab_pending+0x12/0x160
> [182437.783104] Code: e7 e8 72 f3 ff ff e9 6e ff ff ff 48 89 df e8 65 f3 ff ff eb b7 0f 1f 00 0f 1f 44 00 00 41 55 41 54 55 48 89 d5 53 48 89 fb 9c <58> fa 48 89 02 40 84 f6 0f 85 92 00 00 00 f0 48 0f ba 2b 00 72 09
> [182437.784261] RSP: 0018:ffffb5b24066fd30 EFLAGS: 00000246
> [182437.785052] RAX: 0000000000000000 RBX: ffffffffc05e0768 RCX: 00000000000007d0
> [182437.785760] RDX: ffffb5b24066fd60 RSI: 0000000000000001 RDI: ffffffffc05e0768
> [182437.786399] RBP: ffffb5b24066fd60 R08: ffffffffc05e0708 R09: ffffffffc05e0708
> [182437.787010] R10: 0000000000000003 R11: 0000000000000003 R12: ffffffffc05e0768
> [182437.787621] R13: ffff9daa40312400 R14: 00000000000007d0 R15: 0000000000000000
> [182437.788235] FS: 0000000000000000(0000) GS:ffff9daa5bd00000(0000) knlGS:0000000000000000
> [182437.788859] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [182437.789483] CR2: 00007f8f73d5d828 CR3: 000000008a010003 CR4: 00000000001706e0
> [182437.790188] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [182437.790831] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [182437.791522] Call Trace:
> [182437.792183] mod_delayed_work_on+0x3c/0x90
> [182437.792866] __rpc_sleep_on_priority_timeout+0x107/0x110 [sunrpc]
> [182437.793553] rpc_delay+0x56/0x90 [sunrpc]
> [182437.794236] nfsd4_cb_sequence_done+0x202/0x290 [nfsd]
> [182437.794910] nfsd4_cb_done+0x18/0xf0 [nfsd]
> [182437.795974] rpc_exit_task+0x58/0x100 [sunrpc]
> [182437.796955] ? rpc_do_put_task+0x60/0x60 [sunrpc]
> [182437.797645] __rpc_execute+0x5e/0x250 [sunrpc]
> [182437.798375] rpc_async_schedule+0x29/0x40 [sunrpc]
> [182437.799043] process_one_work+0x1e6/0x380
> [182437.799703] worker_thread+0x53/0x3d0
> [182437.800393] ? process_one_work+0x380/0x380
> [182437.801029] kthread+0x10f/0x130
> [182437.801686] ? set_kthread_struct+0x40/0x40
> [182437.802333] ret_from_fork+0x22/0x30
>
> The process causing the soft lockup warnings:
>
> crash> set 216458
> PID: 216458
> COMMAND: "kworker/u4:1"
> TASK: ffff9da94281e200 [THREAD_INFO: ffff9da94281e200]
> CPU: 0
> STATE: TASK_RUNNING (ACTIVE)
> crash> bt
> PID: 216458 TASK: ffff9da94281e200 CPU: 0 COMMAND: "kworker/u4:1"
> #0 [fffffe000000be50] crash_nmi_callback at ffffffffb3055c31
> #1 [fffffe000000be58] nmi_handle at ffffffffb30268f8
> #2 [fffffe000000bea0] default_do_nmi at ffffffffb3a36d42
> #3 [fffffe000000bec8] exc_nmi at ffffffffb3a36f49
> #4 [fffffe000000bef0] end_repeat_nmi at ffffffffb3c013cb
> [exception RIP: add_timer]
> RIP: ffffffffb317c230 RSP: ffffb5b24066fd58 RFLAGS: 00000046
> RAX: 000000010b3585fc RBX: 0000000000000000 RCX: 00000000000007d0
> RDX: ffffffffc05e0768 RSI: ffff9daa40312400 RDI: ffffffffc05e0788
> RBP: 0000000000002000 R8: ffffffffc05e0770 R9: ffffffffc05e0788
> R10: 0000000000000003 R11: 0000000000000003 R12: ffffffffc05e0768
> R13: ffff9daa40312400 R14: 00000000000007d0 R15: 0000000000000000
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #5 [ffffb5b24066fd58] add_timer at ffffffffb317c230
> #6 [ffffb5b24066fd58] mod_delayed_work_on at ffffffffb3103247
> #7 [ffffb5b24066fd98] __rpc_sleep_on_priority_timeout at ffffffffc0580547 [sunrpc]
> #8 [ffffb5b24066fdc8] rpc_delay at ffffffffc0589ed6 [sunrpc]
> #9 [ffffb5b24066fde8] nfsd4_cb_sequence_done at ffffffffc06731b2 [nfsd]
> #10 [ffffb5b24066fe10] nfsd4_cb_done at ffffffffc0673258 [nfsd]
> #11 [ffffb5b24066fe30] rpc_exit_task at ffffffffc05800a8 [sunrpc]
> #12 [ffffb5b24066fe40] __rpc_execute at ffffffffc0589fee [sunrpc]
> #13 [ffffb5b24066fe70] rpc_async_schedule at ffffffffc058a209 [sunrpc]
> #14 [ffffb5b24066fe88] process_one_work at ffffffffb31026c6
> #15 [ffffb5b24066fed0] worker_thread at ffffffffb31028b3
> #16 [ffffb5b24066ff10] kthread at ffffffffb310960f
> #17 [ffffb5b24066ff50] ret_from_fork at ffffffffb30034f2
>
> Looking at the rpc_task being executed:
>
> crash> rpc_task.tk_status,tk_callback,tk_action,tk_runstate,tk_client,tk_flags ffff9da94120bd00
> tk_status = 0x0
> tk_callback = 0xffffffffc057bc60 <__rpc_atrun>
> tk_action = 0xffffffffc0571f20 <call_start>
> tk_runstate = 0x47
> tk_client = 0xffff9da958909c00
> tk_flags = 0x2281
>
> tk_runstate has the following flags set: RPC_TASK_SIGNALLED, RPC_TASK_ACTIVE,
> RPC_TASK_QUEUED, and RPC_TASK_RUNNING.
>
> tk_flags is RPC_TASK_NOCONNECT|RPC_TASK_SOFT|RPC_TASK_DYNAMIC|RPC_TASK_ASYNC.
>
> There's another kworker thread calling rpc_shutdown_client() via
> nfsd4_process_cb_update():
>
> crash> bt 0x342a3
> PID: 213667 TASK: ffff9daa4fde9880 CPU: 1 COMMAND: "kworker/u4:4"
> #0 [ffffb5b24077bbe0] __schedule at ffffffffb3a40ec6
> #1 [ffffb5b24077bc60] schedule at ffffffffb3a4124c
> #2 [ffffb5b24077bc78] schedule_timeout at ffffffffb3a45058
> #3 [ffffb5b24077bcd0] rpc_shutdown_client at ffffffffc056fbb3 [sunrpc]
> #4 [ffffb5b24077bd20] nfsd4_process_cb_update at ffffffffc0672c6c [nfsd]
> #5 [ffffb5b24077be68] nfsd4_run_cb_work at ffffffffc0672f0f [nfsd]
> #6 [ffffb5b24077be88] process_one_work at ffffffffb31026c6
> #7 [ffffb5b24077bed0] worker_thread at ffffffffb31028b3
> #8 [ffffb5b24077bf10] kthread at ffffffffb310960f
> #9 [ffffb5b24077bf50] ret_from_fork at ffffffffb30034f2
>
> The rpc_clnt being shut down is:
>
> crash> nfs4_client.cl_cb_client ffff9daa454db808
> cl_cb_client = 0xffff9da958909c00
>
> Which is the same as the tk_client for the rpc_task being executed by the
> thread triggering the soft lockup warnings.
>
> -Scott
>
> >
> > >
> > > > Prior to this commit a flag RPC_TASK_KILLED was used, and it gets
> > > > cleared by rpc_reset_task_statistics() (called from rpc_exit_task()).
> > > > After this commit a new flag RPC_TASK_SIGNALLED is used, and it is never
> > > > cleared.
> > > >
> > > > A fix might be to clear RPC_TASK_SIGNALLED in
> > > > rpc_reset_task_statistics(), but I'll leave that decision to someone
> > > > else.
> > >
> > > Might be worth testing with that change just to verify that this is
> > > what's happening.
> > >
> > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> > > index c045f63d11fa..caa931888747 100644
> > > --- a/net/sunrpc/sched.c
> > > +++ b/net/sunrpc/sched.c
> > > @@ -813,7 +813,8 @@ static void
> > > rpc_reset_task_statistics(struct rpc_task *task)
> > > {
> > > task->tk_timeouts = 0;
> > > - task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SENT);
> > > + task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SIGNALLED|
> > > + RPC_TASK_SENT);
> >
> > NONONONONO.
> > RPC_TASK_SIGNALLED is a flag in tk_runstate.
> > So you need
> > clear_bit(RPC_TASK_SIGNALLED, &task->tk_runstate);
> >
> > NeilBrown
> >

2021-10-09 17:34:13

by Chuck Lever III

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

> On Oct 8, 2021, at 4:27 PM, Scott Mayhew <[email protected]> wrote:
>
> On Fri, 13 Aug 2021, NeilBrown wrote:
>
>> On Fri, 13 Aug 2021, J. Bruce Fields wrote:
>>> On Tue, Aug 10, 2021 at 10:43:31AM +1000, NeilBrown wrote:
>>>>
>>>> The problem here appears to be that a signalled task is being retried
>>>> without clearing the SIGNALLED flag. That is causing the infinite loop
>>>> and the soft lockup.
>>>>
>>>> This bug appears to have been introduced in Linux 5.2 by
>>>> Commit: ae67bd3821bb ("SUNRPC: Fix up task signalling")
>>>
>>> I wonder how we arrived here. Does it require that an rpc task returns
>>> from one of those rpc_delay() calls just as rpc_shutdown_client() is
>>> signalling it? That's the only way async tasks get signalled, I think.
>>
>> I don't think "just as" is needed.
>> I think it could only happen if rpc_shutdown_client() were called when
>> there were active tasks - presumably from nfsd4_process_cb_update(), but
>> I don't know the callback code well.
>> If any of those active tasks has a ->done handler which might try to
>> reschedule the task when tk_status == -ERESTARTSYS, then you get into
>> the infinite loop.
>
> This thread seems to have fizzled out, but I'm pretty sure I hit this
> during the Virtual Bakeathon yesterday. My server was unresponsive but
> I eventually managed to get a vmcore.
>
> [182411.119788] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 00000000f2f40905 xid 5d83adfb
> [182437.775113] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [kworker/u4:1:216458]
> [182437.775633] Modules linked in: nfs_layout_flexfiles nfsv3 nfs_layout_nfsv41_files bluetooth ecdh_generic ecc rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netfs rpcrdma rdma_cm iw_cm ib_cm ib_core tun rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib isofs cdrom nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink xfs intel_rapl_msr intel_rapl_common libcrc32c kvm_intel qxl kvm drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops irqbypass cec joydev virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc drm fuse ext4 mbcache jbd2 ata_generic crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel virtio_net serio_raw ata_piix net_failover libata virtio_blk virtio_scsi failover dm_mirror dm_region_hash dm_log dm_mod
> [182437.780157] CPU: 1 PID: 216458 Comm: kworker/u4:1 Kdump: loaded Not tainted 5.14.0-5.el9.x86_64 #1
> [182437.780894] Hardware name: DigitalOcean Droplet, BIOS 20171212 12/12/2017
> [182437.781567] Workqueue: rpciod rpc_async_schedule [sunrpc]
> [182437.782500] RIP: 0010:try_to_grab_pending+0x12/0x160
> [182437.783104] Code: e7 e8 72 f3 ff ff e9 6e ff ff ff 48 89 df e8 65 f3 ff ff eb b7 0f 1f 00 0f 1f 44 00 00 41 55 41 54 55 48 89 d5 53 48 89 fb 9c <58> fa 48 89 02 40 84 f6 0f 85 92 00 00 00 f0 48 0f ba 2b 00 72 09
> [182437.784261] RSP: 0018:ffffb5b24066fd30 EFLAGS: 00000246
> [182437.785052] RAX: 0000000000000000 RBX: ffffffffc05e0768 RCX: 00000000000007d0
> [182437.785760] RDX: ffffb5b24066fd60 RSI: 0000000000000001 RDI: ffffffffc05e0768
> [182437.786399] RBP: ffffb5b24066fd60 R08: ffffffffc05e0708 R09: ffffffffc05e0708
> [182437.787010] R10: 0000000000000003 R11: 0000000000000003 R12: ffffffffc05e0768
> [182437.787621] R13: ffff9daa40312400 R14: 00000000000007d0 R15: 0000000000000000
> [182437.788235] FS: 0000000000000000(0000) GS:ffff9daa5bd00000(0000) knlGS:0000000000000000
> [182437.788859] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [182437.789483] CR2: 00007f8f73d5d828 CR3: 000000008a010003 CR4: 00000000001706e0
> [182437.790188] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [182437.790831] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [182437.791522] Call Trace:
> [182437.792183] mod_delayed_work_on+0x3c/0x90
> [182437.792866] __rpc_sleep_on_priority_timeout+0x107/0x110 [sunrpc]
> [182437.793553] rpc_delay+0x56/0x90 [sunrpc]
> [182437.794236] nfsd4_cb_sequence_done+0x202/0x290 [nfsd]
> [182437.794910] nfsd4_cb_done+0x18/0xf0 [nfsd]
> [182437.795974] rpc_exit_task+0x58/0x100 [sunrpc]
> [182437.796955] ? rpc_do_put_task+0x60/0x60 [sunrpc]
> [182437.797645] __rpc_execute+0x5e/0x250 [sunrpc]
> [182437.798375] rpc_async_schedule+0x29/0x40 [sunrpc]
> [182437.799043] process_one_work+0x1e6/0x380
> [182437.799703] worker_thread+0x53/0x3d0
> [182437.800393] ? process_one_work+0x380/0x380
> [182437.801029] kthread+0x10f/0x130
> [182437.801686] ? set_kthread_struct+0x40/0x40
> [182437.802333] ret_from_fork+0x22/0x30
>
> The process causing the soft lockup warnings:
>
> crash> set 216458
> PID: 216458
> COMMAND: "kworker/u4:1"
> TASK: ffff9da94281e200 [THREAD_INFO: ffff9da94281e200]
> CPU: 0
> STATE: TASK_RUNNING (ACTIVE)
> crash> bt
> PID: 216458 TASK: ffff9da94281e200 CPU: 0 COMMAND: "kworker/u4:1"
> #0 [fffffe000000be50] crash_nmi_callback at ffffffffb3055c31
> #1 [fffffe000000be58] nmi_handle at ffffffffb30268f8
> #2 [fffffe000000bea0] default_do_nmi at ffffffffb3a36d42
> #3 [fffffe000000bec8] exc_nmi at ffffffffb3a36f49
> #4 [fffffe000000bef0] end_repeat_nmi at ffffffffb3c013cb
> [exception RIP: add_timer]
> RIP: ffffffffb317c230 RSP: ffffb5b24066fd58 RFLAGS: 00000046
> RAX: 000000010b3585fc RBX: 0000000000000000 RCX: 00000000000007d0
> RDX: ffffffffc05e0768 RSI: ffff9daa40312400 RDI: ffffffffc05e0788
> RBP: 0000000000002000 R8: ffffffffc05e0770 R9: ffffffffc05e0788
> R10: 0000000000000003 R11: 0000000000000003 R12: ffffffffc05e0768
> R13: ffff9daa40312400 R14: 00000000000007d0 R15: 0000000000000000
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #5 [ffffb5b24066fd58] add_timer at ffffffffb317c230
> #6 [ffffb5b24066fd58] mod_delayed_work_on at ffffffffb3103247
> #7 [ffffb5b24066fd98] __rpc_sleep_on_priority_timeout at ffffffffc0580547 [sunrpc]
> #8 [ffffb5b24066fdc8] rpc_delay at ffffffffc0589ed6 [sunrpc]
> #9 [ffffb5b24066fde8] nfsd4_cb_sequence_done at ffffffffc06731b2 [nfsd]
> #10 [ffffb5b24066fe10] nfsd4_cb_done at ffffffffc0673258 [nfsd]
> #11 [ffffb5b24066fe30] rpc_exit_task at ffffffffc05800a8 [sunrpc]
> #12 [ffffb5b24066fe40] __rpc_execute at ffffffffc0589fee [sunrpc]
> #13 [ffffb5b24066fe70] rpc_async_schedule at ffffffffc058a209 [sunrpc]
> #14 [ffffb5b24066fe88] process_one_work at ffffffffb31026c6
> #15 [ffffb5b24066fed0] worker_thread at ffffffffb31028b3
> #16 [ffffb5b24066ff10] kthread at ffffffffb310960f
> #17 [ffffb5b24066ff50] ret_from_fork at ffffffffb30034f2
>
> Looking at the rpc_task being executed:
>
> crash> rpc_task.tk_status,tk_callback,tk_action,tk_runstate,tk_client,tk_flags ffff9da94120bd00
> tk_status = 0x0
> tk_callback = 0xffffffffc057bc60 <__rpc_atrun>
> tk_action = 0xffffffffc0571f20 <call_start>
> tk_runstate = 0x47
> tk_client = 0xffff9da958909c00
> tk_flags = 0x2281
>
> tk_runstate has the following flags set: RPC_TASK_SIGNALLED, RPC_TASK_ACTIVE,
> RPC_TASK_QUEUED, and RPC_TASK_RUNNING.
>
> tk_flags is RPC_TASK_NOCONNECT|RPC_TASK_SOFT|RPC_TASK_DYNAMIC|RPC_TASK_ASYNC.
>
> There's another kworker thread calling rpc_shutdown_client() via
> nfsd4_process_cb_update():
>
> crash> bt 0x342a3
> PID: 213667 TASK: ffff9daa4fde9880 CPU: 1 COMMAND: "kworker/u4:4"
> #0 [ffffb5b24077bbe0] __schedule at ffffffffb3a40ec6
> #1 [ffffb5b24077bc60] schedule at ffffffffb3a4124c
> #2 [ffffb5b24077bc78] schedule_timeout at ffffffffb3a45058
> #3 [ffffb5b24077bcd0] rpc_shutdown_client at ffffffffc056fbb3 [sunrpc]
> #4 [ffffb5b24077bd20] nfsd4_process_cb_update at ffffffffc0672c6c [nfsd]
> #5 [ffffb5b24077be68] nfsd4_run_cb_work at ffffffffc0672f0f [nfsd]
> #6 [ffffb5b24077be88] process_one_work at ffffffffb31026c6
> #7 [ffffb5b24077bed0] worker_thread at ffffffffb31028b3
> #8 [ffffb5b24077bf10] kthread at ffffffffb310960f
> #9 [ffffb5b24077bf50] ret_from_fork at ffffffffb30034f2
>
> The rpc_clnt being shut down is:
>
> crash> nfs4_client.cl_cb_client ffff9daa454db808
> cl_cb_client = 0xffff9da958909c00
>
> Which is the same as the tk_client for the rpc_task being executed by the
> thread triggering the soft lockup warnings.

I've seen a similar issue before.

There is a race between shutting down the client (which kills
running RPC tasks) and some process starting another RPC task
under this client.

For example, killing the last RPC task when there is a GSS
context results in a GSS_CTX_DESTROY request being started
on the same rpc_clnt, and that can deadlock.

I'm not sure quite why this would happen in the backchannel,
but its possible the nfsd_cb logic is kicking off another
CB operation (like a probe) subsequent to shutting down the
rpc_clnt that is associated with the backchannel.

> -Scott
>
>>
>>>
>>>> Prior to this commit a flag RPC_TASK_KILLED was used, and it gets
>>>> cleared by rpc_reset_task_statistics() (called from rpc_exit_task()).
>>>> After this commit a new flag RPC_TASK_SIGNALLED is used, and it is never
>>>> cleared.
>>>>
>>>> A fix might be to clear RPC_TASK_SIGNALLED in
>>>> rpc_reset_task_statistics(), but I'll leave that decision to someone
>>>> else.
>>>
>>> Might be worth testing with that change just to verify that this is
>>> what's happening.
>>>
>>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>>> index c045f63d11fa..caa931888747 100644
>>> --- a/net/sunrpc/sched.c
>>> +++ b/net/sunrpc/sched.c
>>> @@ -813,7 +813,8 @@ static void
>>> rpc_reset_task_statistics(struct rpc_task *task)
>>> {
>>> task->tk_timeouts = 0;
>>> - task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SENT);
>>> + task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SIGNALLED|
>>> + RPC_TASK_SENT);
>>
>> NONONONONO.
>> RPC_TASK_SIGNALLED is a flag in tk_runstate.
>> So you need
>> clear_bit(RPC_TASK_SIGNALLED, &task->tk_runstate);
>>
>> NeilBrown
>>
>

--
Chuck Lever

2021-10-11 14:34:35

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

On Sat, Oct 09, 2021 at 05:33:18PM +0000, Chuck Lever III wrote:
>
>
> > On Oct 8, 2021, at 4:27 PM, Scott Mayhew <[email protected]> wrote:
> >
> > On Fri, 13 Aug 2021, NeilBrown wrote:
> >
> >> On Fri, 13 Aug 2021, J. Bruce Fields wrote:
> >>> On Tue, Aug 10, 2021 at 10:43:31AM +1000, NeilBrown wrote:
> >>>>
> >>>> The problem here appears to be that a signalled task is being retried
> >>>> without clearing the SIGNALLED flag. That is causing the infinite loop
> >>>> and the soft lockup.
> >>>>
> >>>> This bug appears to have been introduced in Linux 5.2 by
> >>>> Commit: ae67bd3821bb ("SUNRPC: Fix up task signalling")
> >>>
> >>> I wonder how we arrived here. Does it require that an rpc task returns
> >>> from one of those rpc_delay() calls just as rpc_shutdown_client() is
> >>> signalling it? That's the only way async tasks get signalled, I think.
> >>
> >> I don't think "just as" is needed.
> >> I think it could only happen if rpc_shutdown_client() were called when
> >> there were active tasks - presumably from nfsd4_process_cb_update(), but
> >> I don't know the callback code well.
> >> If any of those active tasks has a ->done handler which might try to
> >> reschedule the task when tk_status == -ERESTARTSYS, then you get into
> >> the infinite loop.
> >
> > This thread seems to have fizzled out, but I'm pretty sure I hit this
> > during the Virtual Bakeathon yesterday. My server was unresponsive but
> > I eventually managed to get a vmcore.
> >
> > [182411.119788] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 00000000f2f40905 xid 5d83adfb
> > [182437.775113] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [kworker/u4:1:216458]
> > [182437.775633] Modules linked in: nfs_layout_flexfiles nfsv3 nfs_layout_nfsv41_files bluetooth ecdh_generic ecc rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netfs rpcrdma rdma_cm iw_cm ib_cm ib_core tun rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib isofs cdrom nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink xfs intel_rapl_msr intel_rapl_common libcrc32c kvm_intel qxl kvm drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops irqbypass cec joydev virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc drm fuse ext4 mbcache jbd2 ata_generic crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel virtio_net serio_raw ata_piix net_failover libata virtio_blk virtio_scsi failover dm_mirror dm_region_hash dm_log dm_mod
> > [182437.780157] CPU: 1 PID: 216458 Comm: kworker/u4:1 Kdump: loaded Not tainted 5.14.0-5.el9.x86_64 #1
> > [182437.780894] Hardware name: DigitalOcean Droplet, BIOS 20171212 12/12/2017
> > [182437.781567] Workqueue: rpciod rpc_async_schedule [sunrpc]
> > [182437.782500] RIP: 0010:try_to_grab_pending+0x12/0x160
> > [182437.783104] Code: e7 e8 72 f3 ff ff e9 6e ff ff ff 48 89 df e8 65 f3 ff ff eb b7 0f 1f 00 0f 1f 44 00 00 41 55 41 54 55 48 89 d5 53 48 89 fb 9c <58> fa 48 89 02 40 84 f6 0f 85 92 00 00 00 f0 48 0f ba 2b 00 72 09
> > [182437.784261] RSP: 0018:ffffb5b24066fd30 EFLAGS: 00000246
> > [182437.785052] RAX: 0000000000000000 RBX: ffffffffc05e0768 RCX: 00000000000007d0
> > [182437.785760] RDX: ffffb5b24066fd60 RSI: 0000000000000001 RDI: ffffffffc05e0768
> > [182437.786399] RBP: ffffb5b24066fd60 R08: ffffffffc05e0708 R09: ffffffffc05e0708
> > [182437.787010] R10: 0000000000000003 R11: 0000000000000003 R12: ffffffffc05e0768
> > [182437.787621] R13: ffff9daa40312400 R14: 00000000000007d0 R15: 0000000000000000
> > [182437.788235] FS: 0000000000000000(0000) GS:ffff9daa5bd00000(0000) knlGS:0000000000000000
> > [182437.788859] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [182437.789483] CR2: 00007f8f73d5d828 CR3: 000000008a010003 CR4: 00000000001706e0
> > [182437.790188] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [182437.790831] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [182437.791522] Call Trace:
> > [182437.792183] mod_delayed_work_on+0x3c/0x90
> > [182437.792866] __rpc_sleep_on_priority_timeout+0x107/0x110 [sunrpc]
> > [182437.793553] rpc_delay+0x56/0x90 [sunrpc]
> > [182437.794236] nfsd4_cb_sequence_done+0x202/0x290 [nfsd]
> > [182437.794910] nfsd4_cb_done+0x18/0xf0 [nfsd]
> > [182437.795974] rpc_exit_task+0x58/0x100 [sunrpc]
> > [182437.796955] ? rpc_do_put_task+0x60/0x60 [sunrpc]
> > [182437.797645] __rpc_execute+0x5e/0x250 [sunrpc]
> > [182437.798375] rpc_async_schedule+0x29/0x40 [sunrpc]
> > [182437.799043] process_one_work+0x1e6/0x380
> > [182437.799703] worker_thread+0x53/0x3d0
> > [182437.800393] ? process_one_work+0x380/0x380
> > [182437.801029] kthread+0x10f/0x130
> > [182437.801686] ? set_kthread_struct+0x40/0x40
> > [182437.802333] ret_from_fork+0x22/0x30
> >
> > The process causing the soft lockup warnings:
> >
> > crash> set 216458
> > PID: 216458
> > COMMAND: "kworker/u4:1"
> > TASK: ffff9da94281e200 [THREAD_INFO: ffff9da94281e200]
> > CPU: 0
> > STATE: TASK_RUNNING (ACTIVE)
> > crash> bt
> > PID: 216458 TASK: ffff9da94281e200 CPU: 0 COMMAND: "kworker/u4:1"
> > #0 [fffffe000000be50] crash_nmi_callback at ffffffffb3055c31
> > #1 [fffffe000000be58] nmi_handle at ffffffffb30268f8
> > #2 [fffffe000000bea0] default_do_nmi at ffffffffb3a36d42
> > #3 [fffffe000000bec8] exc_nmi at ffffffffb3a36f49
> > #4 [fffffe000000bef0] end_repeat_nmi at ffffffffb3c013cb
> > [exception RIP: add_timer]
> > RIP: ffffffffb317c230 RSP: ffffb5b24066fd58 RFLAGS: 00000046
> > RAX: 000000010b3585fc RBX: 0000000000000000 RCX: 00000000000007d0
> > RDX: ffffffffc05e0768 RSI: ffff9daa40312400 RDI: ffffffffc05e0788
> > RBP: 0000000000002000 R8: ffffffffc05e0770 R9: ffffffffc05e0788
> > R10: 0000000000000003 R11: 0000000000000003 R12: ffffffffc05e0768
> > R13: ffff9daa40312400 R14: 00000000000007d0 R15: 0000000000000000
> > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> > --- <NMI exception stack> ---
> > #5 [ffffb5b24066fd58] add_timer at ffffffffb317c230
> > #6 [ffffb5b24066fd58] mod_delayed_work_on at ffffffffb3103247
> > #7 [ffffb5b24066fd98] __rpc_sleep_on_priority_timeout at ffffffffc0580547 [sunrpc]
> > #8 [ffffb5b24066fdc8] rpc_delay at ffffffffc0589ed6 [sunrpc]
> > #9 [ffffb5b24066fde8] nfsd4_cb_sequence_done at ffffffffc06731b2 [nfsd]
> > #10 [ffffb5b24066fe10] nfsd4_cb_done at ffffffffc0673258 [nfsd]
> > #11 [ffffb5b24066fe30] rpc_exit_task at ffffffffc05800a8 [sunrpc]
> > #12 [ffffb5b24066fe40] __rpc_execute at ffffffffc0589fee [sunrpc]
> > #13 [ffffb5b24066fe70] rpc_async_schedule at ffffffffc058a209 [sunrpc]
> > #14 [ffffb5b24066fe88] process_one_work at ffffffffb31026c6
> > #15 [ffffb5b24066fed0] worker_thread at ffffffffb31028b3
> > #16 [ffffb5b24066ff10] kthread at ffffffffb310960f
> > #17 [ffffb5b24066ff50] ret_from_fork at ffffffffb30034f2
> >
> > Looking at the rpc_task being executed:
> >
> > crash> rpc_task.tk_status,tk_callback,tk_action,tk_runstate,tk_client,tk_flags ffff9da94120bd00
> > tk_status = 0x0
> > tk_callback = 0xffffffffc057bc60 <__rpc_atrun>
> > tk_action = 0xffffffffc0571f20 <call_start>
> > tk_runstate = 0x47
> > tk_client = 0xffff9da958909c00
> > tk_flags = 0x2281
> >
> > tk_runstate has the following flags set: RPC_TASK_SIGNALLED, RPC_TASK_ACTIVE,
> > RPC_TASK_QUEUED, and RPC_TASK_RUNNING.
> >
> > tk_flags is RPC_TASK_NOCONNECT|RPC_TASK_SOFT|RPC_TASK_DYNAMIC|RPC_TASK_ASYNC.
> >
> > There's another kworker thread calling rpc_shutdown_client() via
> > nfsd4_process_cb_update():
> >
> > crash> bt 0x342a3
> > PID: 213667 TASK: ffff9daa4fde9880 CPU: 1 COMMAND: "kworker/u4:4"
> > #0 [ffffb5b24077bbe0] __schedule at ffffffffb3a40ec6
> > #1 [ffffb5b24077bc60] schedule at ffffffffb3a4124c
> > #2 [ffffb5b24077bc78] schedule_timeout at ffffffffb3a45058
> > #3 [ffffb5b24077bcd0] rpc_shutdown_client at ffffffffc056fbb3 [sunrpc]
> > #4 [ffffb5b24077bd20] nfsd4_process_cb_update at ffffffffc0672c6c [nfsd]
> > #5 [ffffb5b24077be68] nfsd4_run_cb_work at ffffffffc0672f0f [nfsd]
> > #6 [ffffb5b24077be88] process_one_work at ffffffffb31026c6
> > #7 [ffffb5b24077bed0] worker_thread at ffffffffb31028b3
> > #8 [ffffb5b24077bf10] kthread at ffffffffb310960f
> > #9 [ffffb5b24077bf50] ret_from_fork at ffffffffb30034f2
> >
> > The rpc_clnt being shut down is:
> >
> > crash> nfs4_client.cl_cb_client ffff9daa454db808
> > cl_cb_client = 0xffff9da958909c00
> >
> > Which is the same as the tk_client for the rpc_task being executed by the
> > thread triggering the soft lockup warnings.
>
> I've seen a similar issue before.
>
> There is a race between shutting down the client (which kills
> running RPC tasks) and some process starting another RPC task
> under this client.

Neil's analysis looked pretty convincing:

https://lore.kernel.org/linux-nfs/[email protected]/T/#m9c84d4c8f71422f4f10b1e4b0fae442af449366a

Assuming this is the same thing--he thought it was a regression due to
ae67bd3821bb ("SUNRPC: Fix up task signalling"). I'm not sure if the
bug is in that patch or if it's uncovering a preexisting bug in how nfsd
reschedules callbacks.

--b.

2021-10-11 16:41:40

by Chuck Lever III

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

> On Oct 11, 2021, at 10:30 AM, Bruce Fields <[email protected]> wrote:
>
> On Sat, Oct 09, 2021 at 05:33:18PM +0000, Chuck Lever III wrote:
>>
>>
>>> On Oct 8, 2021, at 4:27 PM, Scott Mayhew <[email protected]> wrote:
>>>
>>> On Fri, 13 Aug 2021, NeilBrown wrote:
>>>
>>>> On Fri, 13 Aug 2021, J. Bruce Fields wrote:
>>>>> On Tue, Aug 10, 2021 at 10:43:31AM +1000, NeilBrown wrote:
>>>>>>
>>>>>> The problem here appears to be that a signalled task is being retried
>>>>>> without clearing the SIGNALLED flag. That is causing the infinite loop
>>>>>> and the soft lockup.
>>>>>>
>>>>>> This bug appears to have been introduced in Linux 5.2 by
>>>>>> Commit: ae67bd3821bb ("SUNRPC: Fix up task signalling")
>>>>>
>>>>> I wonder how we arrived here. Does it require that an rpc task returns
>>>>> from one of those rpc_delay() calls just as rpc_shutdown_client() is
>>>>> signalling it? That's the only way async tasks get signalled, I think.
>>>>
>>>> I don't think "just as" is needed.
>>>> I think it could only happen if rpc_shutdown_client() were called when
>>>> there were active tasks - presumably from nfsd4_process_cb_update(), but
>>>> I don't know the callback code well.
>>>> If any of those active tasks has a ->done handler which might try to
>>>> reschedule the task when tk_status == -ERESTARTSYS, then you get into
>>>> the infinite loop.
>>>
>>> This thread seems to have fizzled out, but I'm pretty sure I hit this
>>> during the Virtual Bakeathon yesterday. My server was unresponsive but
>>> I eventually managed to get a vmcore.
>>>
>>> [182411.119788] receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt 00000000f2f40905 xid 5d83adfb
>>> [182437.775113] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [kworker/u4:1:216458]
>>> [182437.775633] Modules linked in: nfs_layout_flexfiles nfsv3 nfs_layout_nfsv41_files bluetooth ecdh_generic ecc rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netfs rpcrdma rdma_cm iw_cm ib_cm ib_core tun rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib isofs cdrom nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink xfs intel_rapl_msr intel_rapl_common libcrc32c kvm_intel qxl kvm drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops irqbypass cec joydev virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc drm fuse ext4 mbcache jbd2 ata_generic crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel virtio_net serio_raw ata_piix net_failover libata virtio_blk virtio_scsi failover dm_mirror dm_region_hash dm_log dm_mod
>>> [182437.780157] CPU: 1 PID: 216458 Comm: kworker/u4:1 Kdump: loaded Not tainted 5.14.0-5.el9.x86_64 #1
>>> [182437.780894] Hardware name: DigitalOcean Droplet, BIOS 20171212 12/12/2017
>>> [182437.781567] Workqueue: rpciod rpc_async_schedule [sunrpc]
>>> [182437.782500] RIP: 0010:try_to_grab_pending+0x12/0x160
>>> [182437.783104] Code: e7 e8 72 f3 ff ff e9 6e ff ff ff 48 89 df e8 65 f3 ff ff eb b7 0f 1f 00 0f 1f 44 00 00 41 55 41 54 55 48 89 d5 53 48 89 fb 9c <58> fa 48 89 02 40 84 f6 0f 85 92 00 00 00 f0 48 0f ba 2b 00 72 09
>>> [182437.784261] RSP: 0018:ffffb5b24066fd30 EFLAGS: 00000246
>>> [182437.785052] RAX: 0000000000000000 RBX: ffffffffc05e0768 RCX: 00000000000007d0
>>> [182437.785760] RDX: ffffb5b24066fd60 RSI: 0000000000000001 RDI: ffffffffc05e0768
>>> [182437.786399] RBP: ffffb5b24066fd60 R08: ffffffffc05e0708 R09: ffffffffc05e0708
>>> [182437.787010] R10: 0000000000000003 R11: 0000000000000003 R12: ffffffffc05e0768
>>> [182437.787621] R13: ffff9daa40312400 R14: 00000000000007d0 R15: 0000000000000000
>>> [182437.788235] FS: 0000000000000000(0000) GS:ffff9daa5bd00000(0000) knlGS:0000000000000000
>>> [182437.788859] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [182437.789483] CR2: 00007f8f73d5d828 CR3: 000000008a010003 CR4: 00000000001706e0
>>> [182437.790188] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [182437.790831] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [182437.791522] Call Trace:
>>> [182437.792183] mod_delayed_work_on+0x3c/0x90
>>> [182437.792866] __rpc_sleep_on_priority_timeout+0x107/0x110 [sunrpc]
>>> [182437.793553] rpc_delay+0x56/0x90 [sunrpc]
>>> [182437.794236] nfsd4_cb_sequence_done+0x202/0x290 [nfsd]
>>> [182437.794910] nfsd4_cb_done+0x18/0xf0 [nfsd]
>>> [182437.795974] rpc_exit_task+0x58/0x100 [sunrpc]
>>> [182437.796955] ? rpc_do_put_task+0x60/0x60 [sunrpc]
>>> [182437.797645] __rpc_execute+0x5e/0x250 [sunrpc]
>>> [182437.798375] rpc_async_schedule+0x29/0x40 [sunrpc]
>>> [182437.799043] process_one_work+0x1e6/0x380
>>> [182437.799703] worker_thread+0x53/0x3d0
>>> [182437.800393] ? process_one_work+0x380/0x380
>>> [182437.801029] kthread+0x10f/0x130
>>> [182437.801686] ? set_kthread_struct+0x40/0x40
>>> [182437.802333] ret_from_fork+0x22/0x30
>>>
>>> The process causing the soft lockup warnings:
>>>
>>> crash> set 216458
>>> PID: 216458
>>> COMMAND: "kworker/u4:1"
>>> TASK: ffff9da94281e200 [THREAD_INFO: ffff9da94281e200]
>>> CPU: 0
>>> STATE: TASK_RUNNING (ACTIVE)
>>> crash> bt
>>> PID: 216458 TASK: ffff9da94281e200 CPU: 0 COMMAND: "kworker/u4:1"
>>> #0 [fffffe000000be50] crash_nmi_callback at ffffffffb3055c31
>>> #1 [fffffe000000be58] nmi_handle at ffffffffb30268f8
>>> #2 [fffffe000000bea0] default_do_nmi at ffffffffb3a36d42
>>> #3 [fffffe000000bec8] exc_nmi at ffffffffb3a36f49
>>> #4 [fffffe000000bef0] end_repeat_nmi at ffffffffb3c013cb
>>> [exception RIP: add_timer]
>>> RIP: ffffffffb317c230 RSP: ffffb5b24066fd58 RFLAGS: 00000046
>>> RAX: 000000010b3585fc RBX: 0000000000000000 RCX: 00000000000007d0
>>> RDX: ffffffffc05e0768 RSI: ffff9daa40312400 RDI: ffffffffc05e0788
>>> RBP: 0000000000002000 R8: ffffffffc05e0770 R9: ffffffffc05e0788
>>> R10: 0000000000000003 R11: 0000000000000003 R12: ffffffffc05e0768
>>> R13: ffff9daa40312400 R14: 00000000000007d0 R15: 0000000000000000
>>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
>>> --- <NMI exception stack> ---
>>> #5 [ffffb5b24066fd58] add_timer at ffffffffb317c230
>>> #6 [ffffb5b24066fd58] mod_delayed_work_on at ffffffffb3103247
>>> #7 [ffffb5b24066fd98] __rpc_sleep_on_priority_timeout at ffffffffc0580547 [sunrpc]
>>> #8 [ffffb5b24066fdc8] rpc_delay at ffffffffc0589ed6 [sunrpc]
>>> #9 [ffffb5b24066fde8] nfsd4_cb_sequence_done at ffffffffc06731b2 [nfsd]
>>> #10 [ffffb5b24066fe10] nfsd4_cb_done at ffffffffc0673258 [nfsd]
>>> #11 [ffffb5b24066fe30] rpc_exit_task at ffffffffc05800a8 [sunrpc]
>>> #12 [ffffb5b24066fe40] __rpc_execute at ffffffffc0589fee [sunrpc]
>>> #13 [ffffb5b24066fe70] rpc_async_schedule at ffffffffc058a209 [sunrpc]
>>> #14 [ffffb5b24066fe88] process_one_work at ffffffffb31026c6
>>> #15 [ffffb5b24066fed0] worker_thread at ffffffffb31028b3
>>> #16 [ffffb5b24066ff10] kthread at ffffffffb310960f
>>> #17 [ffffb5b24066ff50] ret_from_fork at ffffffffb30034f2
>>>
>>> Looking at the rpc_task being executed:
>>>
>>> crash> rpc_task.tk_status,tk_callback,tk_action,tk_runstate,tk_client,tk_flags ffff9da94120bd00
>>> tk_status = 0x0
>>> tk_callback = 0xffffffffc057bc60 <__rpc_atrun>
>>> tk_action = 0xffffffffc0571f20 <call_start>
>>> tk_runstate = 0x47
>>> tk_client = 0xffff9da958909c00
>>> tk_flags = 0x2281
>>>
>>> tk_runstate has the following flags set: RPC_TASK_SIGNALLED, RPC_TASK_ACTIVE,
>>> RPC_TASK_QUEUED, and RPC_TASK_RUNNING.
>>>
>>> tk_flags is RPC_TASK_NOCONNECT|RPC_TASK_SOFT|RPC_TASK_DYNAMIC|RPC_TASK_ASYNC.
>>>
>>> There's another kworker thread calling rpc_shutdown_client() via
>>> nfsd4_process_cb_update():
>>>
>>> crash> bt 0x342a3
>>> PID: 213667 TASK: ffff9daa4fde9880 CPU: 1 COMMAND: "kworker/u4:4"
>>> #0 [ffffb5b24077bbe0] __schedule at ffffffffb3a40ec6
>>> #1 [ffffb5b24077bc60] schedule at ffffffffb3a4124c
>>> #2 [ffffb5b24077bc78] schedule_timeout at ffffffffb3a45058
>>> #3 [ffffb5b24077bcd0] rpc_shutdown_client at ffffffffc056fbb3 [sunrpc]
>>> #4 [ffffb5b24077bd20] nfsd4_process_cb_update at ffffffffc0672c6c [nfsd]
>>> #5 [ffffb5b24077be68] nfsd4_run_cb_work at ffffffffc0672f0f [nfsd]
>>> #6 [ffffb5b24077be88] process_one_work at ffffffffb31026c6
>>> #7 [ffffb5b24077bed0] worker_thread at ffffffffb31028b3
>>> #8 [ffffb5b24077bf10] kthread at ffffffffb310960f
>>> #9 [ffffb5b24077bf50] ret_from_fork at ffffffffb30034f2
>>>
>>> The rpc_clnt being shut down is:
>>>
>>> crash> nfs4_client.cl_cb_client ffff9daa454db808
>>> cl_cb_client = 0xffff9da958909c00
>>>
>>> Which is the same as the tk_client for the rpc_task being executed by the
>>> thread triggering the soft lockup warnings.
>>
>> I've seen a similar issue before.
>>
>> There is a race between shutting down the client (which kills
>> running RPC tasks) and some process starting another RPC task
>> under this client.
>
> Neil's analysis looked pretty convincing:
>
> https://lore.kernel.org/linux-nfs/[email protected]/T/#m9c84d4c8f71422f4f10b1e4b0fae442af449366a

That lines up with my (scant) experience in this area.

> Assuming this is the same thing--he thought it was a regression due to
> ae67bd3821bb ("SUNRPC: Fix up task signalling"). I'm not sure if the
> bug is in that patch or if it's uncovering a preexisting bug in how nfsd
> reschedules callbacks.

Post-mortem analysis has been great, but it seems to have hit a
dead end. We understand where we end up, but not yet how we got
there.

Scott seems well positioned to identify a reproducer. Maybe we
can give him some likely candidates for possible bugs to explore
first.

--
Chuck Lever

2021-10-15 03:02:44

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

On Tue, 12 Oct 2021, Chuck Lever III wrote:
>
> Scott seems well positioned to identify a reproducer. Maybe we
> can give him some likely candidates for possible bugs to explore
> first.

Has this patch been tried?

NeilBrown

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index c045f63d11fa..308f5961cb78 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -814,6 +814,7 @@ rpc_reset_task_statistics(struct rpc_task *task)
{
task->tk_timeouts = 0;
task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SENT);
+ clear_bit(RPC_TASK_SIGNALLED, &task->tk_runstate);
rpc_init_task_statistics(task);
}

2021-10-15 05:58:50

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

On Tue, 2021-10-12 at 08:57 +1100, NeilBrown wrote:
> On Tue, 12 Oct 2021, Chuck Lever III wrote:
> >
> > Scott seems well positioned to identify a reproducer. Maybe we
> > can give him some likely candidates for possible bugs to explore
> > first.
>
> Has this patch been tried?
>
> NeilBrown
>
>
> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> index c045f63d11fa..308f5961cb78 100644
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -814,6 +814,7 @@ rpc_reset_task_statistics(struct rpc_task *task)
> {
>         task->tk_timeouts = 0;
>         task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SENT);
> +       clear_bit(RPC_TASK_SIGNALLED, &task->tk_runstate);
>         rpc_init_task_st

We shouldn't automatically "unsignal" a task once it has been told to
die. The correct thing to do here should rather be to change
rpc_restart_call() to exit early if the task was signalled.

> atistics(task);
> }
>

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2021-10-15 06:00:19

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

On Fri, 15 Oct 2021, Trond Myklebust wrote:
> On Tue, 2021-10-12 at 08:57 +1100, NeilBrown wrote:
> > On Tue, 12 Oct 2021, Chuck Lever III wrote:
> > >
> > > Scott seems well positioned to identify a reproducer. Maybe we
> > > can give him some likely candidates for possible bugs to explore
> > > first.
> >
> > Has this patch been tried?
> >
> > NeilBrown
> >
> >
> > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> > index c045f63d11fa..308f5961cb78 100644
> > --- a/net/sunrpc/sched.c
> > +++ b/net/sunrpc/sched.c
> > @@ -814,6 +814,7 @@ rpc_reset_task_statistics(struct rpc_task *task)
> > {
> >         task->tk_timeouts = 0;
> >         task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SENT);
> > +       clear_bit(RPC_TASK_SIGNALLED, &task->tk_runstate);
> >         rpc_init_task_st
>
> We shouldn't automatically "unsignal" a task once it has been told to
> die. The correct thing to do here should rather be to change
> rpc_restart_call() to exit early if the task was signalled.
>

Maybe. It depends on exactly what the signal meant (rpc_killall_tasks()
is a bit different from getting a SIGKILL), and exactly what the task is
trying to achieve.

Before Commit ae67bd3821bb ("SUNRPC: Fix up task signalling")
that is exactly what we did.
If we want to change the behaviour of a task responding to
rpc_killall_tasks(), we should clearly justify it in a patch doing
exactly that.

NeilBrown

2021-10-15 09:20:56

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

On Fri, 2021-10-15 at 09:51 +1100, NeilBrown wrote:
> On Fri, 15 Oct 2021, Trond Myklebust wrote:
> > On Tue, 2021-10-12 at 08:57 +1100, NeilBrown wrote:
> > > On Tue, 12 Oct 2021, Chuck Lever III wrote:
> > > >
> > > > Scott seems well positioned to identify a reproducer. Maybe we
> > > > can give him some likely candidates for possible bugs to
> > > > explore
> > > > first.
> > >
> > > Has this patch been tried?
> > >
> > > NeilBrown
> > >
> > >
> > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> > > index c045f63d11fa..308f5961cb78 100644
> > > --- a/net/sunrpc/sched.c
> > > +++ b/net/sunrpc/sched.c
> > > @@ -814,6 +814,7 @@ rpc_reset_task_statistics(struct rpc_task
> > > *task)
> > > {
> > >         task->tk_timeouts = 0;
> > >         task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SENT);
> > > +       clear_bit(RPC_TASK_SIGNALLED, &task->tk_runstate);
> > >         rpc_init_task_st
> >
> > We shouldn't automatically "unsignal" a task once it has been told
> > to
> > die. The correct thing to do here should rather be to change
> > rpc_restart_call() to exit early if the task was signalled.
> >
>
> Maybe. It depends on exactly what the signal meant
> (rpc_killall_tasks()
> is a bit different from getting a SIGKILL), and exactly what the task
> is
> trying to achieve.
>
> Before Commit ae67bd3821bb ("SUNRPC: Fix up task signalling")
> that is exactly what we did.
> If we want to change the behaviour of a task responding to
> rpc_killall_tasks(), we should clearly justify it in a patch doing
> exactly that.
>

The intention behind rpc_killall_tasks() never changed, which is why it
is listed in nfs_error_is_fatal(). I'm not aware of any case where we
deliberately override in order to restart the RPC call on an
ERESTARTSYS error.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2021-10-15 09:27:17

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

On Fri, 2021-10-15 at 08:03 +0000, Trond Myklebust wrote:
> On Fri, 2021-10-15 at 09:51 +1100, NeilBrown wrote:
> > On Fri, 15 Oct 2021, Trond Myklebust wrote:
> > > On Tue, 2021-10-12 at 08:57 +1100, NeilBrown wrote:
> > > > On Tue, 12 Oct 2021, Chuck Lever III wrote:
> > > > >
> > > > > Scott seems well positioned to identify a reproducer. Maybe
> > > > > we
> > > > > can give him some likely candidates for possible bugs to
> > > > > explore
> > > > > first.
> > > >
> > > > Has this patch been tried?
> > > >
> > > > NeilBrown
> > > >
> > > >
> > > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> > > > index c045f63d11fa..308f5961cb78 100644
> > > > --- a/net/sunrpc/sched.c
> > > > +++ b/net/sunrpc/sched.c
> > > > @@ -814,6 +814,7 @@ rpc_reset_task_statistics(struct rpc_task
> > > > *task)
> > > > {
> > > >         task->tk_timeouts = 0;
> > > >         task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SENT);
> > > > +       clear_bit(RPC_TASK_SIGNALLED, &task->tk_runstate);
> > > >         rpc_init_task_st
> > >
> > > We shouldn't automatically "unsignal" a task once it has been
> > > told
> > > to
> > > die. The correct thing to do here should rather be to change
> > > rpc_restart_call() to exit early if the task was signalled.
> > >
> >
> > Maybe. It depends on exactly what the signal meant
> > (rpc_killall_tasks()
> > is a bit different from getting a SIGKILL), and exactly what the
> > task
> > is
> > trying to achieve.
> >
> > Before Commit ae67bd3821bb ("SUNRPC: Fix up task signalling")
> > that is exactly what we did.
> > If we want to change the behaviour of a task responding to
> > rpc_killall_tasks(), we should clearly justify it in a patch doing
> > exactly that.
> >
>
> The intention behind rpc_killall_tasks() never changed, which is why
> it

("it" being the error ERESTARTSYS)

> is listed in nfs_error_is_fatal(). I'm not aware of any case where we
> deliberately override in order to restart the RPC call on an
> ERESTARTSYS error.
>
>

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2021-12-01 18:37:13

by Scott Mayhew

[permalink] [raw]

Subject: Re: CPU stall, eventual host hang with BTRFS + NFS under heavy load

On Fri, 15 Oct 2021, Trond Myklebust wrote:

> On Fri, 2021-10-15 at 08:03 +0000, Trond Myklebust wrote:
> > On Fri, 2021-10-15 at 09:51 +1100, NeilBrown wrote:
> > > On Fri, 15 Oct 2021, Trond Myklebust wrote:
> > > > On Tue, 2021-10-12 at 08:57 +1100, NeilBrown wrote:
> > > > > On Tue, 12 Oct 2021, Chuck Lever III wrote:
> > > > > >
> > > > > > Scott seems well positioned to identify a reproducer. Maybe
> > > > > > we
> > > > > > can give him some likely candidates for possible bugs to
> > > > > > explore
> > > > > > first.
> > > > >
> > > > > Has this patch been tried?
> > > > >
> > > > > NeilBrown
> > > > >
> > > > >
> > > > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> > > > > index c045f63d11fa..308f5961cb78 100644
> > > > > --- a/net/sunrpc/sched.c
> > > > > +++ b/net/sunrpc/sched.c
> > > > > @@ -814,6 +814,7 @@ rpc_reset_task_statistics(struct rpc_task
> > > > > *task)
> > > > > ?{
> > > > > ????????task->tk_timeouts = 0;
> > > > > ????????task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SENT);
> > > > > +???????clear_bit(RPC_TASK_SIGNALLED, &task->tk_runstate);
> > > > > ????????rpc_init_task_st
> > > >
> > > > We shouldn't automatically "unsignal" a task once it has been
> > > > told
> > > > to
> > > > die. The correct thing to do here should rather be to change
> > > > rpc_restart_call() to exit early if the task was signalled.
> > > >
> > >
> > > Maybe.? It depends on exactly what the signal meant
> > > (rpc_killall_tasks()
> > > is a bit different from getting a SIGKILL), and exactly what the
> > > task
> > > is
> > > trying to achieve.
> > >
> > > Before Commit ae67bd3821bb ("SUNRPC: Fix up task signalling")
> > > that is exactly what we did.
> > > If we want to change the behaviour of a task responding to
> > > rpc_killall_tasks(), we should clearly justify it in a patch doing
> > > exactly that.
> > >
> >
> > The intention behind rpc_killall_tasks() never changed, which is why
> > it
>
> ("it" being the error ERESTARTSYS)
>
> > is listed in nfs_error_is_fatal(). I'm not aware of any case where we
> > deliberately override in order to restart the RPC call on an
> > ERESTARTSYS error.
> >
Update: I'm not able to reproduce this with an upstream kernel. I
bisected it down to commit 2ba5acfb3495 "SUNRPC: fix sign error causing
rpcsec_gss drops" as the commit that "fixed" the issue (but really just
makes the issue less likely to occur, I think).

I also tested commit 10b9d99a3dbb "SUNRPC: Augment server-side rpcgss
tracepoints" (the commit in the Fixes: tag of 2ba5acfb3495) as well as
commit 0e885e846d96 "nfsd: add fattr support for user extended attributes"
(the parent of commit 10b9d99a3dbb) and verified that commit
10b9d99a3dbb is where the issue started occurring.

I think what is happening is that the NFS server gets a request that it
thinks is outside of the GSS sequence window and drops the request,
closes the connection and calls nfsd4_conn_lost(), which calls
nfsd4_probe_callback() which sets NFSD4_CLIENT_CB_UPDATE in
clp->cl_flags. Then the client reestablishes the connection on that
port, sends another request which receives
NFS4ERR_CONN_NOT_BOUND_TO_SESSION. The client runs the state manager
which calls nfs4_bind_conn_to_session(), which calls
nfs4_begin_drain_session(), which sets NFS4_SLOT_TBL_DRAINING in
tbl->slot_tbl_state. Meanwhile a conflicting request comes in that
causes the server to recall the delegation. Since
NFS4_SLOT_TBL_DRAINING is set, the client responds to the CB_SEQUENCE
with NFS4ERR_DELAY. At the same time, the BIND_CONN_TO_SESSION requests
from the client are causing the server to call
nfsd4_process_cb_update(), since NFSD4_CLIENT_CB_UPDATE flag is set.
nfsd4_process_cb_update() calls rpc_shutdown_client() which signals the
CB_RECALL task, which the server is trying re-send due to the
NFS4ERR_DELAY, and we get into the soft-lockup.

I tried this patch

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 20db98679d6b..187f7f1cc02a 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -803,6 +803,7 @@ rpc_reset_task_statistics(struct rpc_task *task)
{
task->tk_timeouts = 0;
task->tk_flags &= ~(RPC_CALL_MAJORSEEN|RPC_TASK_SENT);
+ clear_bit(RPC_TASK_SIGNALLED, &task->tk_runstate);
rpc_init_task_statistics(task);
}

but instead of fixing the soft-lockup I just wind up with a hung task:

INFO: task nfsd:1367 blocked for more than 120 seconds.
[ 3195.902559] Not tainted 4.18.0-353.el8.jsm.test.1.x86_64 #1
[ 3195.905411] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3195.908076] task:nfsd state:D stack: 0 pid: 1367 ppid: 2 flags:0x80004080
[ 3195.910906] Call Trace:
[ 3195.911915] __schedule+0x2d1/0x830
[ 3195.913211] schedule+0x35/0xa0
[ 3195.914377] schedule_timeout+0x274/0x300
[ 3195.915919] ? check_preempt_wakeup+0x113/0x230
[ 3195.916907] wait_for_completion+0x96/0x100
[ 3195.917629] flush_workqueue+0x14d/0x440
[ 3195.918342] nfsd4_destroy_session+0x198/0x230 [nfsd]
[ 3195.919277] nfsd4_proc_compound+0x388/0x6d0 [nfsd]
[ 3195.920144] nfsd_dispatch+0x108/0x210 [nfsd]
[ 3195.920922] svc_process_common+0x2b3/0x700 [sunrpc]
[ 3195.921871] ? svc_xprt_received+0x45/0x80 [sunrpc]
[ 3195.922722] ? nfsd_svc+0x2e0/0x2e0 [nfsd]
[ 3195.923441] ? nfsd_destroy+0x50/0x50 [nfsd]
[ 3195.924199] svc_process+0xb7/0xf0 [sunrpc]
[ 3195.924971] nfsd+0xe3/0x140 [nfsd]
[ 3195.925596] kthread+0x10a/0x120
[ 3195.926383] ? set_kthread_struct+0x40/0x40
[ 3195.927100] ret_from_fork+0x35/0x40

I then tried this patch:

diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 0e212ac0fe44..5667fd15f157 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1573,6 +1573,8 @@ __rpc_restart_call(struct rpc_task *task, void (*action)(struct rpc_task *))
int
rpc_restart_call(struct rpc_task *task)
{
+ if (RPC_SIGNALLED(task))
+ return 0;
return __rpc_restart_call(task, call_start);
}
EXPORT_SYMBOL_GPL(rpc_restart_call);

and that seems to work.

-Scott
> >
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

2021-12-01 19:35:54