2024-02-15 20:40:37

by Perry, Daniel

[permalink] [raw]
Subject: Backporting NFSD filecache improvements to longterm maintenance kernel release

Hello,

Context: We are managing many Linux file servers running nfsd that support large aggregate throughput (up to 10GB/s) and reasonably high iops (up to ~1million) on LTS kernel 5.10.

For some time now we have seen cpu lockups during nfsd laundromat cleanup/gc/shutdown and generally poor performance when there is a 'large' number of open nfsv4 files. After some investigation, it's easy to find that these issues were addressed by the "Overhaul NFSD failcache" patch series (https://lore.kernel.org/linux-nfs/165730437087.28142.6731645688073512500.stgit@klimt.1015granger.net/) that can currently be found in kernel 6.

Another impact this issue has on stability that I have not seen mentioned in other submissions is its impact on nfs-server shutdown. The time it takes to shutdown nfsd scales non-linearly with the number of open nfsv4 files. Once I pass the >75,000 open nfsv4 file mark I'm practically guaranteed to trip the 'soft lockup' watchdog threshold during shutdown on my machine.

Shutdown Times:
# 50,000 Open Files: ~13 seconds
Feb 13 14:23:51 ip-198-19-24-243.ec2.internal systemd[1]: Stopping NFS server and services...
Feb 13 14:24:04 ip-198-19-24-243.ec2.internal systemd[1]: Stopped NFS server and services.
# 75,000 Open Files: 31 seconds
Feb 13 14:43:47 ip-198-19-24-243.ec2.internal systemd[1]: Stopping NFS server and services...
Feb 13 14:44:18 ip-198-19-24-243.ec2.internal systemd[1]: Stopped NFS server and services.
# 100,000 Open Files: 55 seconds
Feb 13 13:47:39 ip-198-19-24-243.ec2.internal systemd[1]: Stopping NFS server and services...
Feb 13 13:48:34 ip-198-19-24-243.ec2.internal systemd[1]: Stopped NFS server and services.
# 125,000 Open Files: 89 seconds
Feb 13 15:01:13 ip-198-19-24-243.ec2.internal systemd[1]: Stopping NFS server and services...
Feb 13 15:02:42 ip-198-19-24-243.ec2.internal systemd[1]: Stopped NFS server and services.

Cpu lockup message:
Feb 13 13:48:31 ip-198-19-24-243 kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [nfsd:31932]

Stack trace during cpu lockup:
Feb 13 13:48:31 ip-198-19-24-243 kernel: RIP: 0010:__list_lru_walk_one+0xa8/0x150
Feb 13 13:48:31 ip-198-19-24-243 kernel: Code: 85 c0 0f 84 b0 00 00 00 41 8b 04 24 85 c0 0f 84 b5 00 00 00 48 83 44 24 10 01 49 83 6c 24 28 01 eb a5 83 f8 03 75 2f 48 8b 03 <49> 89 dd 49 39 df 74 0c 48 89 c3 48 8b 45 00 48 85 c0 75 9e 48 8b
Feb 13 13:48:31 ip-198-19-24-243 kernel: RSP: 0018:ffffb76c46e7fbf8 EFLAGS: 00000246
Feb 13 13:48:31 ip-198-19-24-243 kernel: RAX: ffff98f4e7e8aef0 RBX: ffff98f4e7b52c50 RCX: ffffb76c46e7fc98
Feb 13 13:48:31 ip-198-19-24-243 kernel: RDX: ffff98f47cc5ff40 RSI: ffff98f47cc5ff48 RDI: ffff98f47cc5ff48
Feb 13 13:48:31 ip-198-19-24-243 kernel: RBP: ffffb76c46e7fc90 R08: ffff98f4e7b527f0 R09: ffffb76c46e7fc98
Feb 13 13:48:31 ip-198-19-24-243 kernel: R10: 0000000000000001 R11: 0000000000038400 R12: ffff98f47cc5ff40
Feb 13 13:48:31 ip-198-19-24-243 kernel: R13: ffff98f4e7b527f0 R14: ffffb76c46e7fc98 R15: ffff98f47cc5ff48
Feb 13 13:48:31 ip-198-19-24-243 kernel: FS: 0000000000000000(0000) GS:ffff98f52ee00000(0000) knlGS:0000000000000000
Feb 13 13:48:31 ip-198-19-24-243 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 13 13:48:31 ip-198-19-24-243 kernel: CR2: 00007fddf92fafcc CR3: 000000010b34c004 CR4: 00000000007706f0
Feb 13 13:48:31 ip-198-19-24-243 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 13 13:48:31 ip-198-19-24-243 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 13 13:48:31 ip-198-19-24-243 kernel: PKRU: 55555554
Feb 13 13:48:31 ip-198-19-24-243 kernel: Call Trace:
Feb 13 13:48:31 ip-198-19-24-243 kernel: <IRQ>
Feb 13 13:48:31 ip-198-19-24-243 kernel: ? show_trace_log_lvl+0x1c1/0x2d9
Feb 13 13:48:31 ip-198-19-24-243 kernel: ? show_trace_log_lvl+0x1c1/0x2d9
Feb 13 13:48:31 ip-198-19-24-243 kernel: ? list_lru_walk_node+0x56/0xe0
Feb 13 13:48:31 ip-198-19-24-243 kernel: ? lockup_detector_update_enable+0x50/0x50
Feb 13 13:48:31 ip-198-19-24-243 kernel: ? watchdog_timer_fn+0x1bb/0x210
Feb 13 13:48:31 ip-198-19-24-243 kernel: ? __run_hrtimer+0x5c/0x190
Feb 13 13:48:31 ip-198-19-24-243 kernel: ? __hrtimer_run_queues+0x86/0xe0
Feb 13 13:48:31 ip-198-19-24-243 kernel: ? hrtimer_interrupt+0x110/0x2c0
Feb 13 13:48:31 ip-198-19-24-243 kernel: ? __sysvec_apic_timer_interrupt+0x5c/0xe0
Feb 13 13:48:31 ip-198-19-24-243 kernel: ? asm_call_irq_on_stack+0xf/0x20
Feb 13 13:48:31 ip-198-19-24-243 kernel: </IRQ>
Feb 13 13:48:31 ip-198-19-24-243 kernel: ? sysvec_apic_timer_interrupt+0x72/0x80
Feb 13 13:48:31 ip-198-19-24-243 kernel: ? asm_sysvec_apic_timer_interrupt+0x12/0x20
Feb 13 13:48:31 ip-198-19-24-243 kernel: ? __list_lru_walk_one+0xa8/0x150
Feb 13 13:48:31 ip-198-19-24-243 kernel: ? __list_lru_walk_one+0x74/0x150
Feb 13 13:48:31 ip-198-19-24-243 kernel: ? nfsd_file_lru_count+0xa0/0xa0 [nfsd]
Feb 13 13:48:31 ip-198-19-24-243 kernel: ? nfsd_file_lru_count+0xa0/0xa0 [nfsd]
Feb 13 13:48:31 ip-198-19-24-243 kernel: list_lru_walk_node+0x56/0xe0
Feb 13 13:48:31 ip-198-19-24-243 kernel: nfsd_file_lru_walk_list+0x168/0x190 [nfsd]
Feb 13 13:48:31 ip-198-19-24-243 kernel: release_all_access+0x6a/0x80 [nfsd]
Feb 13 13:48:31 ip-198-19-24-243 kernel: nfs4_free_ol_stateid+0x22/0x60 [nfsd]
Feb 13 13:48:31 ip-198-19-24-243 kernel: free_ol_stateid_reaplist+0x59/0xa0 [nfsd]
Feb 13 13:48:31 ip-198-19-24-243 kernel: release_openowner+0x178/0x1b0 [nfsd]
Feb 13 13:48:31 ip-198-19-24-243 kernel: __destroy_client+0x157/0x230 [nfsd]
Feb 13 13:48:31 ip-198-19-24-243 kernel: nfs4_state_destroy_net+0x82/0x190 [nfsd]
Feb 13 13:48:31 ip-198-19-24-243 kernel: nfs4_state_shutdown_net+0x129/0x160 [nfsd]
Feb 13 13:48:31 ip-198-19-24-243 kernel: nfsd_last_thread+0x102/0x130 [nfsd]
Feb 13 13:48:31 ip-198-19-24-243 kernel: nfsd_destroy+0x3c/0x60 [nfsd]
Feb 13 13:48:31 ip-198-19-24-243 kernel: nfsd+0x126/0x140 [nfsd]
Feb 13 13:48:31 ip-198-19-24-243 kernel: ? nfsd_shutdown_threads+0x80/0x80 [nfsd]
Feb 13 13:48:31 ip-198-19-24-243 kernel: kthread+0x118/0x140
Feb 13 13:48:31 ip-198-19-24-243 kernel: ? __kthread_bind_mask+0x60/0x60
Feb 13 13:48:31 ip-198-19-24-243 kernel: ret_from_fork+0x1f/0x30
***

Before we spend more time investigating, I first thought I'd ask if the maintainers would be open to reviewing a set of patches that backport the NFSD filecache improvements to LTS kernel 5.10. From my perspective, these patches are core to nfsd being performant and stable with nfsv4. The changes included in the original patch series are large, but from what I can tell have been relatively bug free since being introduced to the mainline.

I believe we would not be the only ones who would benefit if these changes were backported to a 5.x LTS kernel. It appears others have attempted to backport some of these changes to their own 5.x kernels (see https://marc.info/?l=linux-kernel&m=167286008910652&w=2, https://marc.info/?l=linux-nfs&m=169269659416487&w=2). Both of these submissions indicate that they encountered some issues after they backported, the latter of which mentioned a later patch resolved (https://marc.info/?l=linux-nfs&m=167293078213110&w=2). However, I'm unsure if this later patch is needed since LTS kernel 6.1 is still without this commit. The above two examples provide some hesitation on our side for backporting these changes without some assistance/guidance.

Also, a mandatory thank you to Chuck Lever and others for implementing these filecache improvements in the first place.

Regards,
Daniel Perry



2024-02-15 22:41:18

by Chuck Lever

[permalink] [raw]
Subject: Re: Backporting NFSD filecache improvements to longterm maintenance kernel release

On Thu, Feb 15, 2024 at 08:40:06PM +0000, Perry, Daniel wrote:
> Before we spend more time investigating, I first thought I'd ask if the maintainers would be open to reviewing a set of patches that backport the NFSD filecache improvements to LTS kernel 5.10. From my perspective, these patches are core to nfsd being performant and stable with nfsv4. The changes included in the original patch series are large, but from what I can tell have been relatively bug free since being introduced to the mainline.
>
> I believe we would not be the only ones who would benefit if these changes were backported to a 5.x LTS kernel. It appears others have attempted to backport some of these changes to their own 5.x kernels (see https://marc.info/?l=linux-kernel&m=167286008910652&w=2, https://marc.info/?l=linux-nfs&m=169269659416487&w=2). Both of these submissions indicate that they encountered some issues after they backported, the latter of which mentioned a later patch resolved (https://marc.info/?l=linux-nfs&m=167293078213110&w=2). However, I'm unsure if this later patch is needed since LTS kernel 6.1 is still without this commit. The above two examples provide some hesitation on our side for backporting these changes without some assistance/guidance.

We (Oracle) have been discussing this internally as well.

I'm not a big fan of backporting large patch series. Generally, if a
stable kernel is not working for you, the best course of action is
for you to upgrade. But I know this is not always feasible.

In this case Jeff and I never found an adequate reproducer, so we
can't nail down exactly where in the series the problem was finally
resolved. And I think the community would be better off if we had an
upstream-tested backport rather than every distribution rolling
their own.

Further, the upstream community now has more standardized CI that
works for not just the upstream kernel but also the 5.x stable
kernels as well.

And, I now have some branches in my kernel.org repo where we can
collect patches specific to each stable kernel, to organize the
testing and review process before we send pull requests to Greg
and Sasha.

(Perhaps) the bad news is I would like to see the performance and
stability issues addressed for all stable kernels between 5.4,
where the filecache was introduced, and 6.1, the kernel release
just before things stabilized again. Maybe 5.4 is not practical?
But I think fixing only 5.10.y is not good enough.

As long as the community, and especially the author of these
patches, is involved I think we can make this happen. Can we start
with v6.1.y, which should be simpler? Do you have testing or CI in
place to tell when nfsd is working satisfactorily?


--
Chuck Lever

2024-02-15 23:04:27

by Chuck Lever

[permalink] [raw]
Subject: Re: Backporting NFSD filecache improvements to longterm maintenance kernel release

On Thu, Feb 15, 2024 at 05:33:36PM -0500, Chuck Lever wrote:
> On Thu, Feb 15, 2024 at 08:40:06PM +0000, Perry, Daniel wrote:
> > Before we spend more time investigating, I first thought I'd ask if the maintainers would be open to reviewing a set of patches that backport the NFSD filecache improvements to LTS kernel 5.10. From my perspective, these patches are core to nfsd being performant and stable with nfsv4. The changes included in the original patch series are large, but from what I can tell have been relatively bug free since being introduced to the mainline.
> >
> > I believe we would not be the only ones who would benefit if these changes were backported to a 5.x LTS kernel. It appears others have attempted to backport some of these changes to their own 5.x kernels (see https://marc.info/?l=linux-kernel&m=167286008910652&w=2, https://marc.info/?l=linux-nfs&m=169269659416487&w=2). Both of these submissions indicate that they encountered some issues after they backported, the latter of which mentioned a later patch resolved (https://marc.info/?l=linux-nfs&m=167293078213110&w=2). However, I'm unsure if this later patch is needed since LTS kernel 6.1 is still without this commit. The above two examples provide some hesitation on our side for backporting these changes without some assistance/guidance.
>
> We (Oracle) have been discussing this internally as well.
>
> I'm not a big fan of backporting large patch series. Generally, if a
> stable kernel is not working for you, the best course of action is
> for you to upgrade. But I know this is not always feasible.
>
> In this case Jeff and I never found an adequate reproducer, so we
> can't nail down exactly where in the series the problem was finally
> resolved. And I think the community would be better off if we had an
> upstream-tested backport rather than every distribution rolling
> their own.
>
> Further, the upstream community now has more standardized CI that
> works for not just the upstream kernel but also the 5.x stable
> kernels as well.
>
> And, I now have some branches in my kernel.org repo where we can
> collect patches specific to each stable kernel, to organize the
> testing and review process before we send pull requests to Greg
> and Sasha.
>
> (Perhaps) the bad news is I would like to see the performance and
> stability issues addressed for all stable kernels between 5.4,
> where the filecache was introduced, and 6.1, the kernel release
> just before things stabilized again. Maybe 5.4 is not practical?
> But I think fixing only 5.10.y is not good enough.
>
> As long as the community, and especially the author of these
> patches, is involved I think we can make this happen. Can we start
> with v6.1.y, which should be simpler? Do you have testing or CI in
> place to tell when nfsd is working satisfactorily?

Looking at 6.1.y's filecache.c, it has already got most or all the
fixes it should have. The only thing we might want to do there is
confirm that with some testing.


--
Chuck Lever