Received: by 2002:a05:6a10:16a7:0:0:0:0 with SMTP id gp39csp1858467pxb; Sat, 14 Nov 2020 05:00:42 -0800 (PST) X-Google-Smtp-Source: ABdhPJy5ErFosK0sw2defIHbZYT/vZaX1NyGJX7QEfLWBQeO2N6HgIqaaTOOqZi804UZBB7B6TT/ X-Received: by 2002:aa7:df89:: with SMTP id b9mr7445037edy.335.1605358842590; Sat, 14 Nov 2020 05:00:42 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1605358842; cv=none; d=google.com; s=arc-20160816; b=SfnMcTITDbBcv2DLIuNvbVWxQ6nk9xG9AmABH+8KcFmjfdjkkblU6lpeAZvTzFGvJy DTMzyz7QIt9/uFa4ywwVwSz/6Nc96NE8UPBRrZEyoq+cnTScDSTpB69RTr9RV1bqxBa9 umjHuBiZS6LOss17BoHwK1LU1dCXswzPjuqFaiRfKB39gdsadUppugZnCmr4z3nd9moD kWtcvqyBIl2Sqf535mdGX737D5OGGZefxbNo3fK2VPJseulK0oKINxLQBcKqYPxbA1yC Toqj+ceOc+LRz6RoBSY8aolikGesH7tDssC7hNHH/iWC8DyY0L0l+q3GsCQhaU+NvxgE LcIw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:thread-index:thread-topic :content-transfer-encoding:mime-version:subject:references :in-reply-to:message-id:cc:to:from:date; bh=ZdzE1YwxdAUnC3CSzN3F2ORXHm0Ym5byeMw9cjrLnkc=; b=BpF6K9VYOe8Q+OHXnICbGw7NOocbbP1EPpQJ/iy+qmI5+TzTg8n/plGdHJok80if8g O9Q6ZoEgET5fJgQ3XN6xA0fpVxWkeAe5nxBstOKcpQw8e7h6bKpg7HCXOc9zvuH2s0se +wiRzJZU7at+8CQ9GP7hnikJX9y3vZ2GTNFqjFnB9IlkehuDDAYbl886KFpmh9KazLBL cncBnl6WgsTaCL0oQzmZv1tfBn3JqwCPJ2loF9H3F3YsB2meLETH5EeavWOYOPJgoWwB wGIU9c/HXFQIp0rbK6D7GMOGhlt9Yc6h9bYls5x+13J/Tuhj1WHCjbDdbMkms5i8fmMr ptbQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id hd40si3011891ejc.265.2020.11.14.05.00.05; Sat, 14 Nov 2020 05:00:42 -0800 (PST) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726591AbgKNM52 (ORCPT + 99 others); Sat, 14 Nov 2020 07:57:28 -0500 Received: from natter.dneg.com ([193.203.89.68]:46416 "EHLO natter.dneg.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726868AbgKNM51 (ORCPT ); Sat, 14 Nov 2020 07:57:27 -0500 Received: from localhost (localhost [127.0.0.1]) by natter.dneg.com (Postfix) with ESMTP id 534E875A2EC7; Sat, 14 Nov 2020 12:57:26 +0000 (GMT) X-Virus-Scanned: amavisd-new at mx-dneg Received: from natter.dneg.com ([127.0.0.1]) by localhost (natter.dneg.com [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 0Bji90B3dRTC; Sat, 14 Nov 2020 12:57:26 +0000 (GMT) Received: from zrozimbrai.dneg.com (zrozimbrai.dneg.com [10.11.20.12]) by natter.dneg.com (Postfix) with ESMTPS id 31D6F75A0ACC; Sat, 14 Nov 2020 12:57:26 +0000 (GMT) Received: from localhost (localhost [127.0.0.1]) by zrozimbrai.dneg.com (Postfix) with ESMTP id 21FA281B5F5F; Sat, 14 Nov 2020 12:57:26 +0000 (GMT) Received: from zrozimbrai.dneg.com ([127.0.0.1]) by localhost (zrozimbrai.dneg.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id KeeGZUmt6BiA; Sat, 14 Nov 2020 12:57:26 +0000 (GMT) Received: from localhost (localhost [127.0.0.1]) by zrozimbrai.dneg.com (Postfix) with ESMTP id 062BE81B6520; Sat, 14 Nov 2020 12:57:26 +0000 (GMT) X-Virus-Scanned: amavisd-new at zimbra-dneg Received: from zrozimbrai.dneg.com ([127.0.0.1]) by localhost (zrozimbrai.dneg.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id bXOgpCKTnoq3; Sat, 14 Nov 2020 12:57:25 +0000 (GMT) Received: from zrozimbra1.dneg.com (zrozimbra1.dneg.com [10.11.16.16]) by zrozimbrai.dneg.com (Postfix) with ESMTP id C076581B5F5F; Sat, 14 Nov 2020 12:57:25 +0000 (GMT) Date: Sat, 14 Nov 2020 12:57:24 +0000 (GMT) From: Daire Byrne To: bfields Cc: Trond Myklebust , linux-cachefs , linux-nfs Message-ID: <217712894.87456370.1605358643862.JavaMail.zimbra@dneg.com> In-Reply-To: <20201113222600.GC1299@fieldses.org> References: <943482310.31162206.1599499860595.JavaMail.zimbra@dneg.com> <1744768451.86186596.1605186084252.JavaMail.zimbra@dneg.com> <20201112135733.GA9243@fieldses.org> <444227972.86442677.1605206025305.JavaMail.zimbra@dneg.com> <20201112205524.GI9243@fieldses.org> <883314904.86570901.1605222357023.JavaMail.zimbra@dneg.com> <20201113145050.GB1299@fieldses.org> <20201113222600.GC1299@fieldses.org> Subject: Re: Adventures in NFS re-exporting MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Mailer: Zimbra 8.7.11_GA_1854 (ZimbraWebClient - GC78 (Linux)/8.7.11_GA_1854) Thread-Topic: Adventures in NFS re-exporting Thread-Index: g/vfeij9eDo1ICtJyNfItBe87fEG+w== Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org ----- On 13 Nov, 2020, at 22:26, bfields bfields@fieldses.org wrote: > On Fri, Nov 13, 2020 at 09:50:50AM -0500, bfields wrote: >> On Thu, Nov 12, 2020 at 11:05:57PM +0000, Daire Byrne wrote: >> > So, I can't lay claim to identifying the exact optimisation/hack that >> > improves the retention of the re-export server's client cache when >> > re-exporting an NFSv3 server (which is then read by many clients). We >> > were working with an engineer at the time who showed an interest in >> > our use case and after we supplied a reproducer he suggested modifying >> > the nfs/inode.c >> > >> > - if (!inode_eq_iversion_raw(inode, fattr->change_attr)) { >> > + if (inode_peek_iversion_raw(inode) < fattr->change_attr) >> > { >> > >> > His reasoning at the time was: >> > >> > "Fixes inode invalidation caused by read access. The least important >> > bit is ORed with 1 and causes the inode version to differ from the one >> > seen on the NFS share. This in turn causes unnecessary re-download >> > impacting the performance significantly. This fix makes it only >> > re-fetch file content if inode version seen on the server is newer >> > than the one on the client." >> > >> > But I've always been puzzled by why this only seems to be the case >> > when using knfsd to re-export the (NFSv3) client mount. Using multiple >> > processes on a standard client mount never causes any similar >> > re-validations. And this happens with a completely read-only share >> > which is why I started to think it has something to do with atimes as >> > that could perhaps still cause a "write" modification even when >> > read-only? >> >> Ah-hah! So, it's inode_query_iversion() that's modifying a nfs inode's >> i_version. That's a special thing that only nfsd would do. >> >> I think that's totally fixable, we'll just have to think a little about >> how.... > > I wonder if something like this helps?--b. > > commit 0add88a9ccc5 > Author: J. Bruce Fields > Date: Fri Nov 13 17:03:04 2020 -0500 > > nfs: don't mangle i_version on NFS > > The i_version on NFS has pretty much opaque to the client, so we don't > want to give the low bit any special interpretation. > > Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the > i_version on their own. > > Signed-off-by: J. Bruce Fields > > diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c > index 29ec8b09a52d..9b8dd5b713a7 100644 > --- a/fs/nfs/fs_context.c > +++ b/fs/nfs/fs_context.c > @@ -1488,7 +1488,8 @@ struct file_system_type nfs_fs_type = { > .init_fs_context = nfs_init_fs_context, > .parameters = nfs_fs_parameters, > .kill_sb = nfs_kill_super, > - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA, > + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA| > + FS_PRIVATE_I_VERSION, > }; > MODULE_ALIAS_FS("nfs"); > EXPORT_SYMBOL_GPL(nfs_fs_type); > @@ -1500,7 +1501,8 @@ struct file_system_type nfs4_fs_type = { > .init_fs_context = nfs_init_fs_context, > .parameters = nfs_fs_parameters, > .kill_sb = nfs_kill_super, > - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA, > + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA| > + FS_PRIVATE_I_VERSION, > }; > MODULE_ALIAS_FS("nfs4"); > MODULE_ALIAS("nfs4"); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 21cc971fd960..c5bb4268228b 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2217,6 +2217,7 @@ struct file_system_type { > #define FS_HAS_SUBTYPE 4 > #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */ > #define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */ > +#define FS_PRIVATE_I_VERSION 32 /* i_version managed by filesystem */ > #define FS_THP_SUPPORT 8192 /* Remove once all fs converted */ > #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() > internally. */ > int (*init_fs_context)(struct fs_context *); > diff --git a/include/linux/iversion.h b/include/linux/iversion.h > index 2917ef990d43..52c790a847de 100644 > --- a/include/linux/iversion.h > +++ b/include/linux/iversion.h > @@ -307,6 +307,8 @@ inode_query_iversion(struct inode *inode) > u64 cur, old, new; > > cur = inode_peek_iversion_raw(inode); > + if (inode->i_sb->s_type->fs_flags & FS_PRIVATE_I_VERSION) > + return cur; > for (;;) { > /* If flag is already set, then no need to swap */ > if (cur & I_VERSION_QUERIED) { Yes, I can confirm that this absolutely helps! I replaced our (brute force) iversion patch with this (much nicer) patch and we got the same improvement; nfsd and it's clients no longer cause the re-export server's client cache to constantly be re-validated. The re-export server can now serve the same results to many clients from cache. Thanks so much for spending the time to track this down. If merged, future (crazy) NFS re-exporters will benefit from the metadata performance improvement/acceleration! Now if anyone has any ideas why all the read calls to the originating server are limited to a maximum of 128k (with rsize=1M) when coming via the re-export server's nfsd threads, I see that as the next biggest performance issue. Reading directly on the re-export server with a userspace process issues 1MB reads as expected. It doesn't happen for writes (wsize=1MB all the way through) but I'm not sure if that has more to do with async and write back caching helping to build up the size before commit? I figure the other remaining items on my (wish) list are probably more in the "won't fix" or "can't fix" category (except maybe the NFSv4.0 input/output errors?). Daire