Hi Linus,
The following changes since commit 53ab78cd6d5aba25575a7cfb95729336ba9497d8:
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux (2022-02-24 17:35:22 -0800)
are available in the Git repository at:
git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.18-1
for you to fetch changes up to 7c9d845f0612e5bcd23456a2ec43be8ac43458f1:
NFSv4/pNFS: Fix another issue with a list iterator pointing to the head (2022-03-28 08:36:34 -0400)
Please note that Stephen Rothwell has reported a conflict between
this branch and upstream:
On Tue, 15 Mar 2022 20:45:40 +1100 Stephen Rothwell <[email protected]> wrote:
>
> Today's linux-next merge of the folio tree got a conflict in:
>
> fs/nfs/file.c
>
> between commit:
>
> 8786fde8421c ("Convert NFS from readpages to readahead")
>
> from the nfs tree and commit:
>
> 821405cf3ebb ("fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio")
>
> from the folio tree.
The fix should be trivial.
Thanks,
Trond
----------------------------------------------------------------
NFS client updates for Linux 5.18
Highlights include:
Features:
- Switch NFS to use readahead instead of the obsolete readpages.
- Readdir fixes to improve cacheability of large directories when there
are multiple readers and writers.
- Readdir performance improvements when doing a seekdir() immediately
after opening the directory (common when re-exporting NFS).
- NFS swap improvements from Neil Brown.
- Loosen up memory allocation to permit direct reclaim and write back
in cases where there is no danger of deadlocking the writeback code or
NFS swap.
- Avoid sillyrename when the NFSv4 server claims to support the
necessary features to recover the unlinked but open file after reboot.
Bugfixes:
- Patch from Olga to add a mount option to control NFSv4.1 session
trunking discovery, and default it to being off.
- Fix a lockup in nfs_do_recoalesce().
- Two fixes for list iterator variables being used when pointing to the
list head.
- Fix a kernel memory scribble when reading from a non-socket transport
in /sys/kernel/sunrpc.
- Fix a race where reconnecting to a server could leave the TCP socket
stuck forever in the connecting state.
- Patch from Neil to fix a shutdown race which can leave the SUNRPC
transport timer primed after we free the struct xprt itself.
- Patch from Xin Xiong to fix reference count leaks in the NFSv4.2 copy
offload.
- Sunrpc patch from Olga to avoid resending a task on an offlined
transport.
Cleanups:
- Patches from Dave Wysochanski to clean up the fscache code
----------------------------------------------------------------
Alexey Khoroshilov (1):
NFS: remove unneeded check in decode_devicenotify_args()
Benjamin Coddington (1):
NFSv4: use unique client identifiers in network namespaces
Colin Ian King (1):
SUNRPC: remove redundant pointer plainhdr
Dave Wysochanski (4):
NFS: Cleanup usage of nfs_inode in fscache interface
NFS: Rename fscache read and write pages functions
NFS: Replace dfprintks with tracepoints in fscache read and write page functions
NFS: Remove remaining dfprintks related to fscache and remove NFSDBG_FSCACHE
Jakob Koschel (1):
NFS: replace usage of found with dedicated list iterator variable
Matthew Wilcox (Oracle) (1):
Convert NFS from readpages to readahead
NeilBrown (12):
NFS: remove IS_SWAPFILE hack
SUNRPC/call_alloc: async tasks mustn't block waiting for memory
SUNRPC/auth: async tasks mustn't block waiting for memory
SUNRPC/xprt: async tasks mustn't block waiting for memory
SUNRPC: remove scheduling boost for "SWAPPER" tasks.
NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
NFSv4: keep state manager thread active if swap is enabled
NFS: swap IO handling is slightly different for O_DIRECT IO
NFS: swap-out must always use STABLE writes.
SUNRPC: change locking for xs_swap_enable/disable
SUNRPC: avoid race between mod_timer() and del_timer_sync()
Olga Kornievskaia (5):
NFSv4.1 support for NFS4_RESULT_PRESERVER_UNLINKED
NFSv4.1 restrict GETATTR fs_location query to the main transport
NFSv4.1 provide mount option to toggle trunking discovery
SUNRPC don't resend a task on an offlined transport
NFSv4.1: don't retry BIND_CONN_TO_SESSION on session error
Tom Rix (1):
NFS: simplify check for freeing cn_resp
Trond Myklebust (63):
NFSv4: Protect the state recovery thread against direct reclaim
NFS: Charge open/lock file contexts to kmemcg
NFSv4: Charge NFSv4 open state trackers to kmemcg
NFSv4.2: Fix up an invalid combination of memory allocation flags
NFS: Convert GFP_NOFS to GFP_KERNEL
NFSv4/flexfiles: Convert GFP_NOFS to GFP_KERNEL
NFSv4.2/copyoffload: Convert GFP_NOFS to GFP_KERNEL
SUNRPC: Convert GFP_NOFS to GFP_KERNEL
SUNRPC/auth_gss: Convert GFP_NOFS to GFP_KERNEL
SUNRPC/xprtrdma: Convert GFP_NOFS to GFP_KERNEL
NFS: Replace last uses of NFS_INO_REVAL_PAGECACHE
NFS: Remove unused flag NFS_INO_REVAL_PAGECACHE
NFS: NFSv2/v3 clients should never be setting NFS_CAP_XATTR
NFS: Remove unnecessary XATTR cache invalidation in nfs_fhget()
NFS: Clean up NFSv4.2 xattrs
NFS: Use of mapping_set_error() results in spurious errors
Revert "NFSv4: use unique client identifiers in network namespaces"
NFS: Return valid errors from nfs2/3_decode_dirent()
NFS: constify nfs_server_capable() and nfs_have_writebacks()
NFS: Trace lookup revalidation failure
NFS: Initialise the readdir verifier as best we can in nfs_opendir()
NFS: Use kzalloc() to avoid initialising the nfs_open_dir_context
NFS: Calculate page offsets algorithmically
NFS: Store the change attribute in the directory page cache
NFS: Don't re-read the entire page cache to find the next cookie
NFS: Don't advance the page pointer unless the page is full
NFS: Adjust the amount of readahead performed by NFS readdir
NFS: If the cookie verifier changes, we must invalidate the page cache
NFS: Simplify nfs_readdir_xdr_to_array()
NFS: Reduce use of uncached readdir
NFS: Improve heuristic for readdirplus
NFS: Don't ask for readdirplus unless it can help nfs_getattr()
NFSv4: Ask for a full XDR buffer of readdir goodness
NFS: Readdirplus can't help lookup for case insensitive filesystems
NFS: Don't request readdirplus when revalidation was forced
NFS: Add basic readdir tracing
NFS: Trace effects of readdirplus on the dcache
NFS: Trace effects of the readdirplus heuristic
NFS: Clean up page array initialisation/free
NFS: Convert readdir page cache to use a cookie based index
NFS: Fix up forced readdirplus
NFS: Optimise away the previous cookie field
NFS: Cache all entries in the readdirplus reply
NFS: Don't deadlock when cookie hashes collide
NFS: Fix revalidation of empty readdir pages
SUNRPC: Don't call connect() more than once on a TCP socket
SUNRPC: Only save the TCP source port after the connection is complete
SUNRPC: Fix socket waits for write buffer space
SUNRPC: Replace internal use of SOCKWQ_ASYNC_NOSPACE
SUNRPC: Improve accuracy of socket ENOBUFS determination
NFS: Fix memory allocation in rpc_malloc()
NFS: Fix memory allocation in rpc_alloc_task()
SUNRPC: Fix unx_lookup_cred() allocation
SUNRPC: Make the rpciod and xprtiod slab allocation modes consistent
NFS: nfsiod should not block forever in mempool_alloc()
NFS: Avoid writeback threads getting stuck in mempool_alloc()
NFSv4/pnfs: Ensure pNFS allocation modes are consistent with nfsiod
pNFS/flexfiles: Ensure pNFS allocation modes are consistent with nfsiod
pNFS/files: Ensure pNFS allocation modes are consistent with nfsiod
SUNRPC: Do not dereference non-socket transports in sysfs
SUNRPC: Don't return error values in sysfs read of closed files
NFS: Don't loop forever in nfs_do_recoalesce()
NFSv4/pNFS: Fix another issue with a list iterator pointing to the head
Xin Xiong (1):
NFSv4.2: fix reference count leaks in _nfs42_proc_copy_notify()
fs/nfs/Kconfig | 4 +
fs/nfs/callback_proc.c | 29 +-
fs/nfs/callback_xdr.c | 4 -
fs/nfs/client.c | 3 +-
fs/nfs/delegation.c | 2 +-
fs/nfs/dir.c | 626 +++++++++++++++++++-------------
fs/nfs/direct.c | 48 ++-
fs/nfs/file.c | 26 +-
fs/nfs/filelayout/filelayout.c | 2 +-
fs/nfs/flexfilelayout/flexfilelayout.c | 53 ++-
fs/nfs/fs_context.c | 8 +
fs/nfs/fscache.c | 53 +--
fs/nfs/fscache.h | 45 +--
fs/nfs/inode.c | 86 ++---
fs/nfs/internal.h | 25 +-
fs/nfs/nfs2xdr.c | 3 +-
fs/nfs/nfs3xdr.c | 30 +-
fs/nfs/nfs42proc.c | 34 +-
fs/nfs/nfs42xattr.c | 7 +-
fs/nfs/nfs4_fs.h | 1 +
fs/nfs/nfs4file.c | 8 +-
fs/nfs/nfs4proc.c | 62 +++-
fs/nfs/nfs4state.c | 59 ++-
fs/nfs/nfs4xdr.c | 7 +-
fs/nfs/nfstrace.h | 221 ++++++++++-
fs/nfs/pagelist.c | 11 +-
fs/nfs/pnfs.c | 50 +--
fs/nfs/pnfs.h | 2 +
fs/nfs/pnfs_nfs.c | 8 +-
fs/nfs/proc.c | 1 +
fs/nfs/read.c | 29 +-
fs/nfs/write.c | 43 ++-
include/linux/nfs_fs.h | 45 +--
include/linux/nfs_fs_sb.h | 1 +
include/linux/nfs_xdr.h | 5 +-
include/linux/sunrpc/auth.h | 1 +
include/linux/sunrpc/sched.h | 2 +-
include/linux/sunrpc/xprt.h | 3 +
include/linux/sunrpc/xprtsock.h | 3 +-
include/trace/events/sunrpc.h | 1 -
include/uapi/linux/nfs4.h | 1 +
include/uapi/linux/nfs_fs.h | 2 +-
net/sunrpc/auth.c | 8 +-
net/sunrpc/auth_gss/auth_gss.c | 26 +-
net/sunrpc/auth_gss/auth_gss_internal.h | 2 +-
net/sunrpc/auth_gss/gss_krb5_crypto.c | 10 +-
net/sunrpc/auth_gss/gss_krb5_seqnum.c | 4 +-
net/sunrpc/auth_gss/gss_krb5_wrap.c | 4 +-
net/sunrpc/auth_unix.c | 16 +-
net/sunrpc/backchannel_rqst.c | 8 +-
net/sunrpc/clnt.c | 13 +-
net/sunrpc/rpcb_clnt.c | 4 +-
net/sunrpc/sched.c | 56 ++-
net/sunrpc/socklib.c | 3 +-
net/sunrpc/sysfs.c | 76 ++--
net/sunrpc/xprt.c | 23 +-
net/sunrpc/xprtrdma/frwr_ops.c | 2 +-
net/sunrpc/xprtrdma/transport.c | 10 +-
net/sunrpc/xprtrdma/verbs.c | 4 +-
net/sunrpc/xprtsock.c | 207 ++++++-----
60 files changed, 1317 insertions(+), 813 deletions(-)
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
The pull request you sent on Tue, 29 Mar 2022 19:36:00 +0000:
> git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.18-1
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/965181d7ef7e1a863477536dc328c23a7ebc8a1d
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
On Tue, Mar 29, 2022 at 12:36 PM Trond Myklebust
<[email protected]> wrote:
>
> - Readdir fixes to improve cacheability of large directories when there
> are multiple readers and writers.
So I only took a look at this part now. I've obviously already pulled
it, but that use of 'xxhash()' just made me go "Whaaa?"
It's claimed that it's used because of its extreme performance, but
the performance numbers come from using it as a block hash.
That's not what nfs does.
The nfs code just does
xxhash(&cookie, sizeof(cookie), 0) & NFS_READDIR_COOKIE_MASK;
where that "cookie" is just a 64-bit entity. And then it masks off
everything but 18 bits.
Is that *really* appropriate use of a new hash function?
Why is this not just doing
#include <hash.h>
hash_64(cookie, 18);
which is a lot more obvious than xxhash().
If there really are some serious problems with the perfectly standard
hash() functionality, I think you should document them.
Because just randomly picking xxhash() without explaining _why_ you
can't just use the same simple thing we use elsewhere is very odd.
Or rather, when the only documentation is "performance", then I think
the regular "hash_64()" is the obvious and trivial choice.
Linus
On Wed, Mar 30, 2022 at 3:22 PM Trond Myklebust <[email protected]> wrote:
>
> With 9175 ext4 offsets, I see 157 collisions (== hash buckets with > 1
> entry). So hash_64() does perform less well when you're hashing a value
> that is already a hash.
No collisions with xxhash? Because xxhash() reality seems to do pretty
similar things in the end (multiply by a prime, shift bits down and
xor them).
In fact, the main difference seems to be that xxhash() will do a
"rotl()" by 27 before doing the prime multiplication, and then it will
finish the thing by a few more multiples mixed with shifting the high
bits down a few times.
Our regular fast hash doesn't do the "shift bits down", because it
relies on only using the upper bits anyway (and it is pretty heavily
geared towards "fast and good enough").
But if the *source* of the hash has a lot of low bits clear, I can
imagine that the "rotl" that xxhash does improves on the bit
distribution of the multiplication (which will only move bits
upwards).
And if it turns out our default hash has some bad cases, I'd prefer to
fix _that_ regardless..
Linus
On Wed, 2022-03-30 at 16:01 -0400, Trond Myklebust wrote:
> On Wed, 2022-03-30 at 11:17 -0700, Linus Torvalds wrote:
> > On Tue, Mar 29, 2022 at 12:36 PM Trond Myklebust
> > <[email protected]> wrote:
> > >
> > > - Readdir fixes to improve cacheability of large directories when
> > > there
> > > are multiple readers and writers.
> >
> > So I only took a look at this part now. I've obviously already
> > pulled
> > it, but that use of 'xxhash()' just made me go "Whaaa?"
> >
> > It's claimed that it's used because of its extreme performance, but
> > the performance numbers come from using it as a block hash.
> >
> > That's not what nfs does.
> >
> > The nfs code just does
> >
> > xxhash(&cookie, sizeof(cookie), 0) & NFS_READDIR_COOKIE_MASK;
> >
> > where that "cookie" is just a 64-bit entity. And then it masks off
> > everything but 18 bits.
> >
> > Is that *really* appropriate use of a new hash function?
> >
> > Why is this not just doing
> >
> > #include <hash.h>
> >
> > hash_64(cookie, 18);
> >
> > which is a lot more obvious than xxhash().
> >
> > If there really are some serious problems with the perfectly
> > standard
> > hash() functionality, I think you should document them.
> >
> > Because just randomly picking xxhash() without explaining _why_ you
> > can't just use the same simple thing we use elsewhere is very odd.
> >
> > Or rather, when the only documentation is "performance", then I
> > think
> > the regular "hash_64()" is the obvious and trivial choice.
> >
> > Linus
>
> My main worry was that hash_64() would have too many collisions.
> Since
> this is using cuckoo nesting, that would be a problem.
>
> I did some quick studies after I got your email, and it seems as if
> my
> concerns were unfounded. I've tested both a linear index and a sample
> of ext4 getdents offsets.
> While the sample of ext4 offsets did show a larger number of
> collisions
> than a simple linear index, it wasn't too terrible (3 collisions in a
> sample of 9000 entries).
Actually, let me correct that.
With 9175 ext4 offsets, I see 157 collisions (== hash buckets with > 1
entry). So hash_64() does perform less well when you're hashing a value
that is already a hash.
> The linear index showed no collisions up to about 100000 entries.
This is unchanged, so with XFS and btrfs as the exported filesystems,
we should not have a problem.
>
> So, I'd be OK with changing to hash_64() and will queue up a bugfix
> for
> it. I should have done this study earlier...
>
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Wed, 2022-03-30 at 11:17 -0700, Linus Torvalds wrote:
> On Tue, Mar 29, 2022 at 12:36 PM Trond Myklebust
> <[email protected]> wrote:
> >
> > - Readdir fixes to improve cacheability of large directories when
> > there
> > are multiple readers and writers.
>
> So I only took a look at this part now. I've obviously already pulled
> it, but that use of 'xxhash()' just made me go "Whaaa?"
>
> It's claimed that it's used because of its extreme performance, but
> the performance numbers come from using it as a block hash.
>
> That's not what nfs does.
>
> The nfs code just does
>
> xxhash(&cookie, sizeof(cookie), 0) & NFS_READDIR_COOKIE_MASK;
>
> where that "cookie" is just a 64-bit entity. And then it masks off
> everything but 18 bits.
>
> Is that *really* appropriate use of a new hash function?
>
> Why is this not just doing
>
> #include <hash.h>
>
> hash_64(cookie, 18);
>
> which is a lot more obvious than xxhash().
>
> If there really are some serious problems with the perfectly standard
> hash() functionality, I think you should document them.
>
> Because just randomly picking xxhash() without explaining _why_ you
> can't just use the same simple thing we use elsewhere is very odd.
>
> Or rather, when the only documentation is "performance", then I think
> the regular "hash_64()" is the obvious and trivial choice.
>
> Linus
My main worry was that hash_64() would have too many collisions. Since
this is using cuckoo nesting, that would be a problem.
I did some quick studies after I got your email, and it seems as if my
concerns were unfounded. I've tested both a linear index and a sample
of ext4 getdents offsets.
While the sample of ext4 offsets did show a larger number of collisions
than a simple linear index, it wasn't too terrible (3 collisions in a
sample of 9000 entries).
The linear index showed no collisions up to about 100000 entries.
So, I'd be OK with changing to hash_64() and will queue up a bugfix for
it. I should have done this study earlier...
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Wed, 2022-03-30 at 15:45 -0700, Linus Torvalds wrote:
> On Wed, Mar 30, 2022 at 3:22 PM Trond Myklebust
> <[email protected]> wrote:
> >
> > With 9175 ext4 offsets, I see 157 collisions (== hash buckets with
> > > 1
> > entry). So hash_64() does perform less well when you're hashing a
> > value
> > that is already a hash.
>
> No collisions with xxhash? Because xxhash() reality seems to do
> pretty
> similar things in the end (multiply by a prime, shift bits down and
> xor them).
>
> In fact, the main difference seems to be that xxhash() will do a
> "rotl()" by 27 before doing the prime multiplication, and then it
> will
> finish the thing by a few more multiples mixed with shifting the high
> bits down a few times.
>
> Our regular fast hash doesn't do the "shift bits down", because it
> relies on only using the upper bits anyway (and it is pretty heavily
> geared towards "fast and good enough").
>
> But if the *source* of the hash has a lot of low bits clear, I can
> imagine that the "rotl" that xxhash does improves on the bit
> distribution of the multiplication (which will only move bits
> upwards).
>
> And if it turns out our default hash has some bad cases, I'd prefer
> to
> fix _that_ regardless..
>
Hmm... No there doesn't appear to be a huge difference between the two.
With both test programs running on the same data set of ext4 getdents
offsets, I see the following.
With xxhash64 reduced to 18 bits, I see:
read 57654 entries
min = 0, max = 5, collisions = 5501, avg = 1
read 98978 entries
min = 0, max = 6, collisions = 14730, avg = 1
..and with hash_64() reduced to 18 bits:
read 57654 entries
min = 0, max = 4, collisions = 5538, avg = 1
read 98978 entries
min = 0, max = 5, collisions = 14623, avg = 1
So they both appear to be seeing similar collision rates with the same
data sets
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]