FYI, modern kernels (anything newer than 5.10 LTS, up to and excluding
bleeding-edge mainline kernels) are looping forever in a livelock or
deadlock when running generic/476 on NFS, both in a loopback and
external export configuration. This *may* be an ENOSPC related issue.
See the referenced discussion on [email protected] for more
details.
- Ted
Hi Ted-
It's not clear from your report whether the kernel range applies
to the client's kernel or the server's kernel (in the non-loopback
case).
Since a scratch device is involved, I suspect the livelock might
be due to a problem with the NFSD filecache code introduced on or
about v5.10. There are patches pending in the NFSD for-next branch
that should address this issue. Is there a way that your tester
can try these out to confirm?
> On Jul 21, 2022, at 10:50 AM, Theodore Ts'o <[email protected]> wrote:
>
> FYI, modern kernels (anything newer than 5.10 LTS, up to and excluding
> bleeding-edge mainline kernels) are looping forever in a livelock or
> deadlock when running generic/476 on NFS, both in a loopback and
> external export configuration. This *may* be an ENOSPC related issue.
>
> See the referenced discussion on [email protected] for more
> details.
>
> - Ted
>
>
> From: "Theodore Ts'o" <[email protected]>
> Subject: Re: [PATCH v1] generic/476: requires 27GB scratch size
> Date: July 21, 2022 at 10:03:45 AM EDT
> To: Boyang Xue <[email protected]>
> Cc: "Darrick J. Wong" <[email protected]>, [email protected]
>
>
> Following up, using NFS loopback with a 5GB scratch device on a Google
> Compute Engine VM, generic/476 passes using a 4.14 LTS, 4.19 LTS, and
> 5.4 LTS kernel. So this looks like it's a regression which is in 5.10
> LTS and newer kernels, and so instead of patching it out of the test,
> I think the right thing to do is to add it to a kernel
> version-specific exclude file and then filing a bug with the NFS
> folks.
>
> KERNEL: kernel 4.14.284-xfstests #8 SMP Tue Jul 5 08:21:37 EDT 2022 x86_64
> CMDLINE: -c nfs/default generic/476
> CPUS: 2
> MEM: 7680
>
> nfs/loopback: 1 tests, 597 seconds
> generic/476 Pass 595s
> Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 595s
>
> ---
> KERNEL: kernel 4.19.248-xfstests #4 SMP Sat Jun 25 10:43:45 EDT 2022 x86_64
> CMDLINE: -c nfs/default generic/476
> CPUS: 2
> MEM: 7680
>
> nfs/loopback: 1 tests, 407 seconds
> generic/476 Pass 407s
> Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 407s
>
> ----
> KERNEL: kernel 5.4.199-xfstests #21 SMP Sun Jul 3 12:15:15 EDT 2022 x86_64
> CMDLINE: -c nfs/default generic/476
> CPUS: 2
> MEM: 7680
>
> nfs/loopback: 1 tests, 404 seconds
> generic/476 Pass 404s
> Totals: 1 tests, 0 skipped, 0 failures, 0 errors, 404s
>
>
> See below for what I'm checking into xfstests-bld for
> {kvm,gce}-xfstests. I don't believe we should be changing xfstests's
> generic/476, since it *does* pass with a smaller scratch device on
> older kernels, and presumably, RHEL customers would be cranky if this
> issue resulted in their production systems to lock up, and so it
> should be considered a kernel bug as opposed to a test bug.
>
> - Ted
>
>
> commit 4a33b6721d5db9c07f295a10a8ad65d2a0021406
> Author: Theodore Ts'o <[email protected]>
> Date: Thu Jul 21 09:54:50 2022 -0400
>
> test-appliance: add an nfs test exclusions for kernels newer than 5.4
>
> This is apparently an NFS bug which is visible in 5.10 LTS and newer
> kernels, and likely appeared sometime after 5.4. Since it causes the
> test VM to spin forever (or at least for days), let's exclude it for
> now.
>
> Link: https://lore.kernel.org/all/CAHLe9YaAVyBmmM8T27dudvoeAxbJ_JMQmkz7tdM1ZLnpeQW4UQ@mail.gmail.com/
> Signed-off-by: Theodore Ts'o <[email protected]>
>
> diff --git a/test-appliance/files/root/fs/nfs/exclude b/test-appliance/files/root/fs/nfs/exclude
> index 184750fb..ef4b19bc 100644
> --- a/test-appliance/files/root/fs/nfs/exclude
> +++ b/test-appliance/files/root/fs/nfs/exclude
> @@ -10,3 +10,14 @@ generic/477
> // failing in the expected output of the linux-nfs Wiki page. So we'll
> // suppress this failure for now.
> generic/294
> +
> +#if LINUX_VERSION_CODE > KERNEL_VERSION(5,4,0)
> +// There appears to be a regression that shows up sometime after 5.4.
> +// LTS kernels for 4.14, 4.19, and 5.4 will terminate successfully,
> +// but newer kernels will spin forever in some kind of deadlock or livelock
> +// This apparently does not happen if the scratch device is > 27GB, so it
> +// may be some kind of ENOSPC-related bug.
> +// For more information see the e-mail thread starting at:
> +// https://lore.kernel.org/r/CAHLe9YaAVyBmmM8T27dudvoeAxbJ_JMQmkz7tdM1ZLnpeQW4UQ@mail.gmail.com/
> +generic/476
> +#endif
>
>
--
Chuck Lever