Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6A734C61DA3 for ; Thu, 26 Jan 2023 20:36:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229486AbjAZUg6 (ORCPT ); Thu, 26 Jan 2023 15:36:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45066 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232176AbjAZUg4 (ORCPT ); Thu, 26 Jan 2023 15:36:56 -0500 Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BB874728E5 for ; Thu, 26 Jan 2023 12:36:50 -0800 (PST) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 3A0CC2001F; Thu, 26 Jan 2023 20:36:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1674765409; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=CpmG2hokCZTHLBaVXWXG+/hz+JzpFTO1uFR56DSyjns=; b=HjT8VaMr9CbxWMRVyZde9R6e9LnUUVbMRLLaAA6apTTQ1b3FIv2UXSiHKO0VpH8lSqJ/Po 0gaQL3Pj1dT/A+hikz5uDQHulqLbF1ChEcQhzzs0vnBc6waruykLu8pCHAptxtXw9Qx7nE O/baaipnK+IKDf1Yotp+MMYhL6jlym0= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1674765409; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=CpmG2hokCZTHLBaVXWXG+/hz+JzpFTO1uFR56DSyjns=; b=LKp+C9ERmdp2a8LMpROzHDKYxJa/ejTju6MJ9tLk0/W+PlXbNsax7Xf3tRvjzE5V9L5NTr xJa0tEbzLNSh0BAQ== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 5EFEC13A09; Thu, 26 Jan 2023 20:36:47 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id VIpjAV/k0mOBPgAAMHmgww (envelope-from ); Thu, 26 Jan 2023 20:36:47 +0000 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 From: "NeilBrown" To: "Jeff Layton" Cc: "Trond Myklebust" , "Linux NFS Mailing List" Subject: Re: [PATCH/RFC] NFSv3 handle out-of-order write replies In-reply-to: References: <167461221711.23017.1840413589310764555@noble.neil.brown.name>, Date: Fri, 27 Jan 2023 07:36:41 +1100 Message-id: <167476540120.23017.11089540386030477339@noble.neil.brown.name> Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Thu, 26 Jan 2023, Jeff Layton wrote: > On Wed, 2023-01-25 at 13:03 +1100, NeilBrown wrote: > > NFSv3 includes pre/post wcc attributes which allow the client to > > determine if all changes to the file have been made by itself, or any > > might have been made by some other client. IF there are gaps in the > > pre/post ctime sequence it must be assumed that some other client caused > > that change in that gap and the local cache must be suspect. The next > > time the file is opened, the cache should be invalidated. > >=20 > > Since Commit 1c341b777501 ("NFS: Add deferred cache invalidation for > > close-to-open consistency violations") (I think) in linux 5.3 the Linux > > client has been triggering this invalidation. The chunk in > > nfs_update_inode() in particularly triggers. > >=20 > > Unfortunately Linux NFS assumes that all replies will be processed in > > the order sent, and will arrive in the order processed. This is not > > true in general. Linux might ignore the wcc info in a WRITE reply - > > even though the pre-ctime might match the current iversion - because the > > reply is for a request that was sent before some other request for which > > a reply has already been seen. This is detected by Linux using the > > gencount tests in nfs_inode_attr_cmp(). > >=20 > > Also, when the gencount tests pass it is still possible that the > > requests were processed on the service in a different order, and a gap > > in the ctime sequence might be filled in by a subsequent reply. Linux > > NFS does not try to detect this. > >=20 > > The net result is that writing to a server and then reading the file > > back can result in going to the server for the read rather than serving > > it from cache - all because a couple of replies arrived out-of-order. > > This is a performance regression over kernels before 5.3 (though that > > change is a correctness improvement). > >=20 > > This has been seen with Linux writing to a Netapp server which can > > occasionally re-order requests. I have also demonstrated it with a > > modified Linux server which adds pre/post attributes to V3 write > > replies, and occasionally adds a delay to force reply reordering. > >=20 > > This patch attempts to address the problem by replacing the > > NFS_INO_DATA_INVAL_DEFER flag with a small array of segments in the > > ctime sequence which have not yet been seen in replies. Rather than > > setting the flag, we add the missing segment to the array. Rather > > than testing the flag, we check if the array is empty. > >=20 > > The array contains A,B pairs when the local iversion is advanced from A > > to B, and B,A pairs when a reply tells us that the ctime (aka the > > change-id) has advanced from A to B on the server. > > Overlapping segments are merged and inverted segments cancel (which > > effectively comes to the same thing). Empty segments are removed. > >=20 > > An expected reply (probably the common case) will have the server say > > that ctime went from A to B, and Linux will update the iversion from A > > to B. Rather than adding A,B and the B,A which will cancel out, we > > simply don't bother. > >=20 >=20 > So your cache remains intact as long as you're not checking validity > before all of the replies come in. Exactly - and we primarily check validity at 'open' (or lock). >=20 > Clever! I like this idea.=20 Thanks :-) >=20 > > One difficulty in this is that in some circumstances, > > NFS_ATTR_FATTR_PRECHANGE is cleared. This causes the gap tracking to > > fail because we can only add pre/post attributes to the table if bother > > NFS_ATTR_FATTR_PRECHANGE and NFS_ATTR_FATTR_CHANGE are set. If one is > > cleared, we have a problem. > >=20 > > This patch works around that by introducing a new > > NFS_ATTR_FATTR_PRECHANGE_RAW whcih is set when the normal PRECHANGE is > > cleared. If either PRECHANGE or PRECHANGE_RAW a set then we can use the > > prechange value. Obviously this is a hack. I would be keen to > > understand why PRECHANGE is being cleared so I can find a better > > resolution. > >=20 > > I haven't given much thought to directories. A number of places call > > nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA); > >=20 > > only on directories and I don't understand why. As directories still > > have pre/post attributes even in NFSv4 it would be good to understand > > how to get this right. > >=20 > > Any help would be most appreciated. > >=20 > > Thanks, > > NeilBrown > >=20 > >=20 > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > > index f594dac436a7..8535372861e9 100644 > > --- a/fs/nfs/dir.c > > +++ b/fs/nfs/dir.c > > @@ -84,7 +84,7 @@ alloc_nfs_open_dir_context(struct inode *dir) > > ctx->dtsize =3D NFS_INIT_DTSIZE; > > spin_lock(&dir->i_lock); > > if (list_empty(&nfsi->open_files) && > > - (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER)) > > + nfs_ooo_test(nfsi)) > > nfs_set_cache_invalid(dir, > > NFS_INO_INVALID_DATA | > > NFS_INO_REVAL_FORCED); > > diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c > > index 6b2cfa59a1a2..4c9ad2e6586e 100644 > > --- a/fs/nfs/inode.c > > +++ b/fs/nfs/inode.c > > @@ -208,11 +208,12 @@ void nfs_set_cache_invalid(struct inode *inode, uns= igned long flags) > > =20 > > nfsi->cache_validity |=3D flags; > > =20 > > - if (inode->i_mapping->nrpages =3D=3D 0) > > - nfsi->cache_validity &=3D ~(NFS_INO_INVALID_DATA | > > - NFS_INO_DATA_INVAL_DEFER); > > - else if (nfsi->cache_validity & NFS_INO_INVALID_DATA) > > - nfsi->cache_validity &=3D ~NFS_INO_DATA_INVAL_DEFER; > > + if (inode->i_mapping->nrpages =3D=3D 0) { > > + nfsi->cache_validity &=3D ~NFS_INO_INVALID_DATA; > > + nfs_ooo_clear(nfsi); > > + } else if (nfsi->cache_validity & NFS_INO_INVALID_DATA) { > > + nfs_ooo_clear(nfsi); > > + } > > trace_nfs_set_cache_invalid(inode, 0); > > } > > EXPORT_SYMBOL_GPL(nfs_set_cache_invalid); > > @@ -677,9 +678,10 @@ static int nfs_vmtruncate(struct inode * inode, loff= _t offset) > > trace_nfs_size_truncate(inode, offset); > > i_size_write(inode, offset); > > /* Optimisation */ > > - if (offset =3D=3D 0) > > - NFS_I(inode)->cache_validity &=3D ~(NFS_INO_INVALID_DATA | > > - NFS_INO_DATA_INVAL_DEFER); > > + if (offset =3D=3D 0) { > > + NFS_I(inode)->cache_validity &=3D ~NFS_INO_INVALID_DATA; > > + nfs_ooo_clear(NFS_I(inode)); > > + } > > NFS_I(inode)->cache_validity &=3D ~NFS_INO_INVALID_SIZE; > > =20 > > spin_unlock(&inode->i_lock); > > @@ -1101,7 +1103,7 @@ void nfs_inode_attach_open_context(struct nfs_open_= context *ctx) > > =20 > > spin_lock(&inode->i_lock); > > if (list_empty(&nfsi->open_files) && > > - (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER)) > > + nfs_ooo_test(nfsi)) > > nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA | > > NFS_INO_REVAL_FORCED); > > list_add_tail_rcu(&ctx->list, &nfsi->open_files); > > @@ -1344,8 +1346,8 @@ int nfs_clear_invalid_mapping(struct address_space = *mapping) > > =20 > > set_bit(NFS_INO_INVALIDATING, bitlock); > > smp_wmb(); > > - nfsi->cache_validity &=3D > > - ~(NFS_INO_INVALID_DATA | NFS_INO_DATA_INVAL_DEFER); > > + nfsi->cache_validity &=3D ~NFS_INO_INVALID_DATA; > > + nfs_ooo_clear(nfsi); > > spin_unlock(&inode->i_lock); > > trace_nfs_invalidate_mapping_enter(inode); > > ret =3D nfs_invalidate_mapping(inode, mapping); > > @@ -1807,6 +1809,38 @@ static int nfs_inode_finish_partial_attr_update(co= nst struct nfs_fattr *fattr, > > return 0; > > } > > =20 > > +static void nfs_ooo_merge(struct nfs_inode *nfsi, > > + u64 start, u64 end) > > +{ > > + int i, cnt; > > + > > + /* add this range, merging if possible */ > > + cnt =3D nfsi->ooo_cnt; > > + for (i =3D 0; i < cnt; i++) { > > + if (end =3D=3D nfsi->ooo[i][0]) > > + end =3D nfsi->ooo[i][1]; > > + else if (start =3D=3D nfsi->ooo[i][1]) > > + start =3D nfsi->ooo[i][0]; > > + else > > + continue; > > + /* Remove 'i' from table and loop to insert the new range */ > > + cnt -=3D 1; > > + nfsi->ooo[i][0] =3D nfsi->ooo[cnt][0]; > > + nfsi->ooo[i][1] =3D nfsi->ooo[cnt][1]; > > + i =3D -1; > > + } > > + if (start !=3D end && cnt < ARRAY_SIZE(nfsi->ooo)) { > > + nfsi->ooo[cnt][0] =3D start; > > + nfsi->ooo[cnt][1] =3D end; > > + cnt +=3D 1; > > + } > > + if (cnt > nfsi->ooo_max) { > > + printk("ooo_max -> %d\n", cnt); > > + nfsi->ooo_max =3D cnt; > > + } > > + nfsi->ooo_cnt =3D cnt; > > +} > > + > > static int nfs_refresh_inode_locked(struct inode *inode, > > struct nfs_fattr *fattr) > > { > > @@ -1817,8 +1851,17 @@ static int nfs_refresh_inode_locked(struct inode *= inode, > > =20 > > if (attr_cmp > 0 || nfs_inode_finish_partial_attr_update(fattr, inode)) > > ret =3D nfs_update_inode(inode, fattr); > > - else if (attr_cmp =3D=3D 0) > > - ret =3D nfs_check_inode_attributes(inode, fattr); > > + else { > > + if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) && > > + (fattr->valid & (NFS_ATTR_FATTR_PRECHANGE | > > + NFS_ATTR_FATTR_PRECHANGE_RAW))) > > + nfs_ooo_merge(NFS_I(inode), > > + fattr->change_attr, > > + fattr->pre_change_attr); > > + > > + if (attr_cmp =3D=3D 0) > > + ret =3D nfs_check_inode_attributes(inode, fattr); > > + } > > =20 > > trace_nfs_refresh_inode_exit(inode, ret); > > return ret; > > @@ -1909,6 +1952,8 @@ int nfs_post_op_update_inode_force_wcc_locked(struc= t inode *inode, struct nfs_fa > > if (attr_cmp < 0) > > return 0; > > if ((fattr->valid & NFS_ATTR_FATTR) =3D=3D 0 || !attr_cmp) { > > + if (fattr->valid & NFS_ATTR_FATTR_PRECHANGE) > > + fattr->valid |=3D NFS_ATTR_FATTR_PRECHANGE_RAW; > > fattr->valid &=3D ~(NFS_ATTR_FATTR_PRECHANGE > > | NFS_ATTR_FATTR_PRESIZE > > | NFS_ATTR_FATTR_PREMTIME > > @@ -2084,8 +2129,13 @@ static int nfs_update_inode(struct inode *inode, s= truct nfs_fattr *fattr) > > dprintk("NFS: change_attr change on server for file %s/%ld\n", > > inode->i_sb->s_id, > > inode->i_ino); > > - } else if (!have_delegation) > > - nfsi->cache_validity |=3D NFS_INO_DATA_INVAL_DEFER; > > + } else if (!have_delegation) { > > + if (fattr->valid & NFS_ATTR_FATTR_PRECHANGE) > > + nfs_ooo_merge(nfsi, fattr->change_attr, > > + fattr->pre_change_attr); > > + nfs_ooo_merge(nfsi, inode_peek_iversion_raw(inode), > > + fattr->change_attr); > > + } > > inode_set_iversion_raw(inode, fattr->change_attr); > > } > > } else { > > @@ -2239,6 +2289,8 @@ struct inode *nfs_alloc_inode(struct super_block *s= b) > > return NULL; > > nfsi->flags =3D 0UL; > > nfsi->cache_validity =3D 0UL; > > + nfsi->ooo_cnt =3D 0; > > + nfsi->ooo_max =3D 0; > > #if IS_ENABLED(CONFIG_NFS_V4) > > nfsi->nfs4_acl =3D NULL; > > #endif /* CONFIG_NFS_V4 */ > > diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h > > index 8c6cc58679ff..cdf77f49ef8f 100644 > > --- a/fs/nfs/nfstrace.h > > +++ b/fs/nfs/nfstrace.h > > @@ -28,7 +28,6 @@ > > { NFS_INO_INVALID_MTIME, "INVALID_MTIME" }, \ > > { NFS_INO_INVALID_SIZE, "INVALID_SIZE" }, \ > > { NFS_INO_INVALID_OTHER, "INVALID_OTHER" }, \ > > - { NFS_INO_DATA_INVAL_DEFER, "DATA_INVAL_DEFER" }, \ > > { NFS_INO_INVALID_BLOCKS, "INVALID_BLOCKS" }, \ > > { NFS_INO_INVALID_XATTR, "INVALID_XATTR" }, \ > > { NFS_INO_INVALID_NLINK, "INVALID_NLINK" }, \ > > diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h > > index 7931fa472561..8a83d6d204ed 100644 > > --- a/include/linux/nfs_fs.h > > +++ b/include/linux/nfs_fs.h > > @@ -190,6 +190,33 @@ struct nfs_inode { > > /* Open contexts for shared mmap writes */ > > struct list_head open_files; > > =20 > > + /* Keep track of out-of-order replies. > > + * The ooo array contains start/end pairs of > > + * number from the changeid sequence when > > + * the inodes iversion has been updated. > > + * It also contains end/start pair (i.e. reverse order) > > + * of sections of the changeid sequence that have > > + * been seen in replies from the server. > > + * Normally these should match and when both > > + * A:B and B:A are found in ooo, they are both removed. > > + * And if a reply with A:B causes an iversion update > > + * of A:B, then neither are added. > > + * When a reply has pre_change that doesn't match > > + * iversion, then the changeid pair, and any consequent > > + * change in iversion ARE added. Later replies > > + * might fill in the gaps, or possibly a gap is caused > > + * by a change from another client. > > + * When a file or directory is opened, if the ooo table > > + * is not empty, then we assume the gaps were due to > > + * another client and we invalidate the cached data. > > + * > > + * We can only track a limited number of concurrent gaps. > > + * Currently that limit is 16. > > + */ > > + int ooo_cnt; > > + int ooo_max; // TRACING > > + unsigned long ooo[16][2]; >=20 > Why unsigned longs here? Shouldn't these be u64? Yes, they should be u64. Thanks. >=20 > I guess you could argue that when we have 32-bit longs that the most > significant bits don't matter, but that's also the case with 64-bit > longs, and 32 bits would halve the space requirements. >=20 > Also, this grows each inode by 2k on a 64-bit arch! Maybe we should > dynamically allocate these things instead? If the allocation fails, then > we could just go back to marking the cache invalid and move on. 2K? 8*16*2+4*2 =3D=3D 264, not 2048. But I agree that allocating on demand would make sense. 16 is probably more than needed. I don't have proper testing results from the customer yet, but I wouldn't be surprised if a smaller number would suffice. But if we are allocating only on demand, it wouldn't hurt to allocate 16. Thanks, NeilBrown >=20 > > + > > #if IS_ENABLED(CONFIG_NFS_V4) > > struct nfs4_cached_acl *nfs4_acl; > > /* NFSv4 state */ > > @@ -253,8 +280,6 @@ struct nfs4_copy_state { > > #define NFS_INO_INVALID_MTIME BIT(10) /* cached mtime is invalid */ > > #define NFS_INO_INVALID_SIZE BIT(11) /* cached size is invalid */ > > #define NFS_INO_INVALID_OTHER BIT(12) /* other attrs are invalid */ > > -#define NFS_INO_DATA_INVAL_DEFER \ > > - BIT(13) /* Deferred cache invalidation */ > > #define NFS_INO_INVALID_BLOCKS BIT(14) /* cached blocks are inva= lid */ > > #define NFS_INO_INVALID_XATTR BIT(15) /* xattrs are invalid */ > > #define NFS_INO_INVALID_NLINK BIT(16) /* cached nlinks is invalid */ > > @@ -615,6 +640,19 @@ nfs_fileid_to_ino_t(u64 fileid) > > return ino; > > } > > =20 > > +static inline void nfs_ooo_clear(struct nfs_inode *nfsi) > > +{ > > + nfsi->ooo_cnt =3D 0; > > +} > > + > > +static inline bool nfs_ooo_test(struct nfs_inode *nfsi) > > +{ > > + if (nfsi->ooo_cnt =3D=3D 0) > > + return false; > > + printk("nfs_ooo_test_and_clear() found %d\n", nfsi->ooo_cnt); > > + return true; > > +} > > + > > #define NFS_JUKEBOX_RETRY_TIME (5 * HZ) > > =20 > > /* We need to block new opens while a file is being unlinked. > > diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h > > index e86cf6642d21..b18a877b16eb 100644 > > --- a/include/linux/nfs_xdr.h > > +++ b/include/linux/nfs_xdr.h > > @@ -106,6 +106,7 @@ struct nfs_fattr { > > #define NFS_ATTR_FATTR_OWNER_NAME (1U << 23) > > #define NFS_ATTR_FATTR_GROUP_NAME (1U << 24) > > #define NFS_ATTR_FATTR_V4_SECURITY_LABEL (1U << 25) > > +#define NFS_ATTR_FATTR_PRECHANGE_RAW (1U << 26) > > =20 > > #define NFS_ATTR_FATTR (NFS_ATTR_FATTR_TYPE \ > > | NFS_ATTR_FATTR_MODE \ > >=20 >=20 > --=20 > Jeff Layton >=20