2024-03-30 15:26:34

by Jan Schunk

[permalink] [raw]
Subject: Aw: Re: [External] : nfsd: memory leak when client does many file operations

Full test result:

$ git bisect start v6.6 v6.5
Bisecting: 7882 revisions left to test after this (roughly 13 steps)
[a1c19328a160c80251868dbd80066dce23d07995] Merge tag 'soc-arm-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
--
$ git bisect good
Bisecting: 3935 revisions left to test after this (roughly 12 steps)
[e4f1b8202fb59c56a3de7642d50326923670513f] Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
--
$ git bisect bad
Bisecting: 2014 revisions left to test after this (roughly 11 steps)
[e0152e7481c6c63764d6ea8ee41af5cf9dfac5e9] Merge tag 'riscv-for-linus-6.6-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
--
$ git bisect bad
Bisecting: 975 revisions left to test after this (roughly 10 steps)
[4a3b1007eeb26b2bb7ae4d734cc8577463325165] Merge tag 'pinctrl-v6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
--
$ git bisect good
Bisecting: 476 revisions left to test after this (roughly 9 steps)
[4debf77169ee459c46ec70e13dc503bc25efd7d2] Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
--
$ git bisect good
Bisecting: 237 revisions left to test after this (roughly 8 steps)
[e7e9423db459423d3dcb367217553ad9ededadc9] Merge tag 'v6.6-vfs.super.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
--
$ git bisect good
Bisecting: 141 revisions left to test after this (roughly 7 steps)
[8ae5d298ef2005da5454fc1680f983e85d3e1622] Merge tag '6.6-rc-ksmbd-fixes-part1' of git://git.samba.org/ksmbd
--
$ git bisect good
Bisecting: 61 revisions left to test after this (roughly 6 steps)
[99d99825fc075fd24b60cc9cf0fb1e20b9c16b0f] Merge tag 'nfs-for-6.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
--
$ git bisect bad
Bisecting: 39 revisions left to test after this (roughly 5 steps)
[7b719e2bf342a59e88b2b6215b98ca4cf824bc58] SUNRPC: change svc_recv() to return void.
--
$ git bisect bad
Bisecting: 19 revisions left to test after this (roughly 4 steps)
[e7421ce71437ec8e4d69cc6bdf35b6853adc5050] NFSD: Rename struct svc_cacherep
--
$ git bisect good
Bisecting: 9 revisions left to test after this (roughly 3 steps)
[baabf59c24145612e4a975f459a5024389f13f5d] SUNRPC: Convert svc_udp_sendto() to use the per-socket bio_vec array
--
$ git bisect bad
Bisecting: 4 revisions left to test after this (roughly 2 steps)
[be2be5f7f4436442d8f6bffbb97a6f438df2896b] lockd: nlm_blocked list race fixes
--
$ git bisect good
Bisecting: 2 revisions left to test after this (roughly 1 step)
[d424797032c6e24b44037e6c7a2d32fd958300f0] nfsd: inherit required unset default acls from effective set
--
$ git bisect good
Bisecting: 0 revisions left to test after this (roughly 1 step)
[e18e157bb5c8c1cd8a9ba25acfdcf4f3035836f4] SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call
--
$ git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[2eb2b93581813b74c7174961126f6ec38eadb5a7] SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly
--
$ git bisect good
e18e157bb5c8c1cd8a9ba25acfdcf4f3035836f4 is the first bad commit
commit e18e157bb5c8c1cd8a9ba25acfdcf4f3035836f4

I found the memory loss inside /proc/meminfo only on MemAvailable
MemTotal: 346948 kB
On a bad test run in looks like this:
-MemAvailable: 210820 kB
+MemAvailable: 26608 kB
On a good test run it looks like this:
-MemAvailable: 215872 kB
+MemAvailable: 221128 kB


> Gesendet: Freitag, den 29.03.2024 um 01:25 Uhr
> Von: "Chuck Lever III" <[email protected]>
> An: "Jan Schunk" <[email protected]>, "Benjamin Coddington" <[email protected]>
> Cc: "Jeff Layton" <[email protected]>, "Neil Brown" <[email protected]>, "Olga Kornievskaia" <[email protected]>, "Dai Ngo" <[email protected]>, "Tom Talpey" <[email protected]>, "Linux NFS Mailing List" <[email protected]>, "[email protected]" <[email protected]>
> Betreff: Re: [External] : nfsd: memory leak when client does many file operations
>
>
>
> > On Mar 28, 2024, at 6:03 PM, Jan Schunk <[email protected]> wrote:
> >
> > Inside the VM I was not able to reproduce the issue on v6.5.x so I keep concentrating on v6.6.x.
> >
> > Current status:
> >
> > $ git bisect start v6.6 v6.5
> > Bisecting: 7882 revisions left to test after this (roughly 13 steps)
> > [a1c19328a160c80251868dbd80066dce23d07995] Merge tag 'soc-arm-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
> >
> > --
> > $ git bisect good
> > Bisecting: 3935 revisions left to test after this (roughly 12 steps)
> > [e4f1b8202fb59c56a3de7642d50326923670513f] Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
> >
> > --
> > $ git bisect bad
> > Bisecting: 2014 revisions left to test after this (roughly 11 steps)
> > [e0152e7481c6c63764d6ea8ee41af5cf9dfac5e9] Merge tag 'riscv-for-linus-6.6-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
> >
> > --
> > $ git bisect bad
> > Bisecting: 975 revisions left to test after this (roughly 10 steps)
> > [4a3b1007eeb26b2bb7ae4d734cc8577463325165] Merge tag 'pinctrl-v6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
> >
> > --
> > $ git bisect good
> > Bisecting: 476 revisions left to test after this (roughly 9 steps)
> > [4debf77169ee459c46ec70e13dc503bc25efd7d2] Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
> >
> > --
> > $ git bisect good
> > Bisecting: 237 revisions left to test after this (roughly 8 steps)
> > [e7e9423db459423d3dcb367217553ad9ededadc9] Merge tag 'v6.6-vfs.super.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
>
> Good, keep going.
>
> I've tried replicating the free memory loss here, using the
> git regression suite on my nfsd-fixes branch. Taking a
> meminfo sample between each of four test runs, the only
> clear downward trend I see is:
>
> free:3019839 < start
> free:2858438 < after first run
> free:2836058 < after second run
> free:2822077 < after third run
> free:2797143 < after fourth run
>
> All other metrics seem to vary arbitrarily.
>
> The only slightly suspicious slab I see is buffer_head.
> /sys/kernel/debug/kmemleak has a single entry in it, not
> related to NFSD.
>
> At this point I'm kind of suspecting that the issue will
> not be related to NFSD or SUNRPC or any particular slab
> cache, but will be orphaned whole pages. Your bisect
> still seems like the best shot at localizing the
> misbehavior.
>
>
> --
> Chuck Lever
>
>


2024-03-30 16:27:25

by Chuck Lever

[permalink] [raw]
Subject: Re: Re: [External] : nfsd: memory leak when client does many file operations

On Sat, Mar 30, 2024 at 04:26:09PM +0100, Jan Schunk wrote:
> Full test result:
>
> $ git bisect start v6.6 v6.5
> Bisecting: 7882 revisions left to test after this (roughly 13 steps)
> [a1c19328a160c80251868dbd80066dce23d07995] Merge tag 'soc-arm-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
> --
> $ git bisect good
> Bisecting: 3935 revisions left to test after this (roughly 12 steps)
> [e4f1b8202fb59c56a3de7642d50326923670513f] Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
> --
> $ git bisect bad
> Bisecting: 2014 revisions left to test after this (roughly 11 steps)
> [e0152e7481c6c63764d6ea8ee41af5cf9dfac5e9] Merge tag 'riscv-for-linus-6.6-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
> --
> $ git bisect bad
> Bisecting: 975 revisions left to test after this (roughly 10 steps)
> [4a3b1007eeb26b2bb7ae4d734cc8577463325165] Merge tag 'pinctrl-v6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
> --
> $ git bisect good
> Bisecting: 476 revisions left to test after this (roughly 9 steps)
> [4debf77169ee459c46ec70e13dc503bc25efd7d2] Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
> --
> $ git bisect good
> Bisecting: 237 revisions left to test after this (roughly 8 steps)
> [e7e9423db459423d3dcb367217553ad9ededadc9] Merge tag 'v6.6-vfs.super.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
> --
> $ git bisect good
> Bisecting: 141 revisions left to test after this (roughly 7 steps)
> [8ae5d298ef2005da5454fc1680f983e85d3e1622] Merge tag '6.6-rc-ksmbd-fixes-part1' of git://git.samba.org/ksmbd
> --
> $ git bisect good
> Bisecting: 61 revisions left to test after this (roughly 6 steps)
> [99d99825fc075fd24b60cc9cf0fb1e20b9c16b0f] Merge tag 'nfs-for-6.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
> --
> $ git bisect bad
> Bisecting: 39 revisions left to test after this (roughly 5 steps)
> [7b719e2bf342a59e88b2b6215b98ca4cf824bc58] SUNRPC: change svc_recv() to return void.
> --
> $ git bisect bad
> Bisecting: 19 revisions left to test after this (roughly 4 steps)
> [e7421ce71437ec8e4d69cc6bdf35b6853adc5050] NFSD: Rename struct svc_cacherep
> --
> $ git bisect good
> Bisecting: 9 revisions left to test after this (roughly 3 steps)
> [baabf59c24145612e4a975f459a5024389f13f5d] SUNRPC: Convert svc_udp_sendto() to use the per-socket bio_vec array
> --
> $ git bisect bad
> Bisecting: 4 revisions left to test after this (roughly 2 steps)
> [be2be5f7f4436442d8f6bffbb97a6f438df2896b] lockd: nlm_blocked list race fixes
> --
> $ git bisect good
> Bisecting: 2 revisions left to test after this (roughly 1 step)
> [d424797032c6e24b44037e6c7a2d32fd958300f0] nfsd: inherit required unset default acls from effective set
> --
> $ git bisect good
> Bisecting: 0 revisions left to test after this (roughly 1 step)
> [e18e157bb5c8c1cd8a9ba25acfdcf4f3035836f4] SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call
> --
> $ git bisect bad
> Bisecting: 0 revisions left to test after this (roughly 0 steps)
> [2eb2b93581813b74c7174961126f6ec38eadb5a7] SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly
> --
> $ git bisect good
> e18e157bb5c8c1cd8a9ba25acfdcf4f3035836f4 is the first bad commit
> commit e18e157bb5c8c1cd8a9ba25acfdcf4f3035836f4

This is a plausible bisect result for this behavior, so nice work.

David (cc'd), can you have a brief look at this? What did we miss?
I'm guessing it's a page reference count issue that might occur
only when the XDR head and tail buffers are in the same page. Or
it might occur if two entries in the XDR page array point to the
same page...?

/me stabs in the darkness


> I found the memory loss inside /proc/meminfo only on MemAvailable
> MemTotal: 346948 kB
> On a bad test run in looks like this:
> -MemAvailable: 210820 kB
> +MemAvailable: 26608 kB
> On a good test run it looks like this:
> -MemAvailable: 215872 kB
> +MemAvailable: 221128 kB