Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp568637pxk; Wed, 23 Sep 2020 10:08:16 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy2P7cxHv5oEEhVkegT3Oh+x8Opbna6Q+UqzPYdidnH2RTQA2w3S76TVreFTuiL/17GnbiM X-Received: by 2002:aa7:d1ce:: with SMTP id g14mr294884edp.153.1600880896789; Wed, 23 Sep 2020 10:08:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1600880896; cv=none; d=google.com; s=arc-20160816; b=XH1aJtnqm6vupKjXIcRxUuiW0b0sWoLn+HcJUMid2EyUYAZxkMmoSuuKJ/ZFZ17bLb v/mh3CuWnydScaX1XuGnGXlIPfEhxRDhLSdoEalBHp06epVaYlB05Q90dJn0iu/Ck6/9 Jb6htE6yVAm2Jtq7oEmVdh71XmWpsWCdqUZvMkAtyuse5CL+qmf8q12ncnMCdQ3zML+G Cg63QDfe3z8inFNl/vgUBhCS++LnQTKXK23GUv2N8vVWPKk7l9ZjmzxBhpuE7O34fd0A CWbpoSdp0OZ0AY2iuw2stehO+0cF0pL7L32EIqM8xOiqiF57Xfo0l39YM05j4FVvnYPy ETKw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature:dkim-filter; bh=RuUaHIcD6ERJPmJEia7PDbV0GjJv4uex9CD9KVpU5N0=; b=b1Nb7aKs9s9wXrwGSkOeuf7Qff/eixiNiHZgt6WRDIcpjr+hNAN1GKRro/48cLJ5nC I5s/Gf3iouH6fKj8KONSOu9D4ON8rIkwEwNGBmFSDcOFHqu/263TDE4iu2FJIPyMJR0J Iho/W3lRb6P1B/SBbECTKbU6TquOS6Bj7+WbJYIfnxNMkbtJADmmN37PdCHhY1yznUwr GSI6sNizHFRfAbAnLsnGoKqCNQHSEjEIZpvJEgPd6AN5E9EWNq3WwSfb0bUZoNfLAQwd gIoSG+goVGOCHYhHqlYHpz1lYCbaNwpNZgJGne2a3zL7+aJhEEHAl9ttv2+yJXwYc7m4 08uQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@fieldses.org header.s=default header.b=Sk7oj9Yc; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id j22si374216eds.357.2020.09.23.10.07.42; Wed, 23 Sep 2020 10:08:16 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@fieldses.org header.s=default header.b=Sk7oj9Yc; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726634AbgIWRHg (ORCPT + 99 others); Wed, 23 Sep 2020 13:07:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57786 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726405AbgIWRHg (ORCPT ); Wed, 23 Sep 2020 13:07:36 -0400 Received: from fieldses.org (fieldses.org [IPv6:2600:3c00:e000:2f7::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 41C0AC0613D1 for ; Wed, 23 Sep 2020 10:07:36 -0700 (PDT) Received: by fieldses.org (Postfix, from userid 2815) id 109FAC56; Wed, 23 Sep 2020 13:07:35 -0400 (EDT) DKIM-Filter: OpenDKIM Filter v2.11.0 fieldses.org 109FAC56 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fieldses.org; s=default; t=1600880855; bh=RuUaHIcD6ERJPmJEia7PDbV0GjJv4uex9CD9KVpU5N0=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Sk7oj9YcnbjgOrPArn23Eh7NpYIRviDguGFTkb6e55D+WjF6wwqdLttftBgvjqVQh 685eAUbbzzdDuNBD7bntCmc+TBwIgBPpcSt6xfysneLgx0XdFYhg/sPJrrsZv9GWO9 gwkvQywjNmTVWjeUSNgi4SyeEcJj4qwfVsj8EVQw= Date: Wed, 23 Sep 2020 13:07:35 -0400 From: "bfields@fieldses.org" To: Trond Myklebust Cc: "linux-cachefs@redhat.com" , "linux-nfs@vger.kernel.org" , "daire@dneg.com" Subject: Re: Adventures in NFS re-exporting Message-ID: <20200923170735.GC4691@fieldses.org> References: <943482310.31162206.1599499860595.JavaMail.zimbra@dneg.com> <1155061727.42788071.1600777874179.JavaMail.zimbra@dneg.com> <20200923124038.GA4691@fieldses.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Wed, Sep 23, 2020 at 01:09:01PM +0000, Trond Myklebust wrote: > On Wed, 2020-09-23 at 08:40 -0400, J. Bruce Fields wrote: > > On Tue, Sep 22, 2020 at 01:52:25PM +0000, Trond Myklebust wrote: > > > On Tue, 2020-09-22 at 13:31 +0100, Daire Byrne wrote: > > > > Hi, > > > > > > > > I just thought I'd flesh out the other two issues I have found > > > > with > > > > re-exporting that are ultimately responsible for the biggest > > > > performance bottlenecks. And both of them revolve around the > > > > caching > > > > of metadata file lookups in the NFS client. > > > > > > > > Especially for the case where we are re-exporting a server many > > > > milliseconds away (i.e. on-premise -> cloud), we want to be able > > > > to > > > > control how much the client caches metadata and file data so that > > > > it's many LAN clients all benefit from the re-export server only > > > > having to do the WAN lookups once (within a specified coherency > > > > time). > > > > > > > > Keeping the file data in the vfs page cache or on disk using > > > > fscache/cachefiles is fairly straightforward, but keeping the > > > > metadata cached is particularly difficult. And without the cached > > > > metadata we introduce long delays before we can serve the already > > > > present and locally cached file data to many waiting clients. > > > > > > > > ----- On 7 Sep, 2020, at 18:31, Daire Byrne daire@dneg.com wrote: > > > > > 2) If we cache metadata on the re-export server using > > > > > actimeo=3600,nocto we can > > > > > cut the network packets back to the origin server to zero for > > > > > repeated lookups. > > > > > However, if a client of the re-export server walks paths and > > > > > memory > > > > > maps those > > > > > files (i.e. loading an application), the re-export server > > > > > starts > > > > > issuing > > > > > unexpected calls back to the origin server again, > > > > > ignoring/invalidating the > > > > > re-export server's NFS client cache. We worked around this this > > > > > by > > > > > patching an > > > > > inode/iversion validity check in inode.c so that the NFS client > > > > > cache on the > > > > > re-export server is used. I'm not sure about the correctness of > > > > > this patch but > > > > > it works for our corner case. > > > > > > > > If we use actimeo=3600,nocto (say) to mount a remote software > > > > volume > > > > on the re-export server, we can successfully cache the loading of > > > > applications and walking of paths directly on the re-export > > > > server > > > > such that after a couple of runs, there are practically zero > > > > packets > > > > back to the originating NFS server (great!). But, if we then do > > > > the > > > > same thing on a client which is mounting that re-export server, > > > > the > > > > re-export server now starts issuing lots of calls back to the > > > > originating server and invalidating it's client cache (bad!). > > > > > > > > I'm not exactly sure why, but the iversion of the inode gets > > > > changed > > > > locally (due to atime modification?) most likely via invocation > > > > of > > > > method inode_inc_iversion_raw. Each time it gets incremented the > > > > following call to validate attributes detects changes causing it > > > > to > > > > be reloaded from the originating server. > > > > > > > > This patch helps to avoid this when applied to the re-export > > > > server > > > > but there may be other places where this happens too. I accept > > > > that > > > > this patch is probably not the right/general way to do this, but > > > > it > > > > helps to highlight the issue when re-exporting and it works well > > > > for > > > > our use case: > > > > > > > > --- linux-5.5.0-1.el7.x86_64/fs/nfs/inode.c 2020-01-27 > > > > 00:23:03.000000000 +0000 > > > > +++ new/fs/nfs/inode.c 2020-02-13 16:32:09.013055074 +0000 > > > > @@ -1869,7 +1869,7 @@ > > > > > > > > /* More cache consistency checks */ > > > > if (fattr->valid & NFS_ATTR_FATTR_CHANGE) { > > > > - if (!inode_eq_iversion_raw(inode, fattr- > > > > > change_attr)) { > > > > + if (inode_peek_iversion_raw(inode) < fattr- > > > > > change_attr) { > > > > /* Could it be a race with writeback? */ > > > > if (!(have_writers || have_delegation)) { > > > > invalid |= NFS_INO_INVALID_DATA > > > > > > There is nothing in the base NFSv4, and NFSv4.1 specs that allow > > > you to > > > make assumptions about how the change attribute behaves over time. > > > > > > The only safe way to do something like the above is if the server > > > supports NFSv4.2 and also advertises support for the > > > 'change_attr_type' > > > attribute. In that case, you can check at mount time for whether or > > > not > > > the change attribute on this filesystem is one of the monotonic > > > types > > > which would allow the above optimisation. > > > > Looking at https://tools.ietf.org/html/rfc7862#section-12.2.3 .... I > > think that would be anything but NFS4_CHANGE_TYPE_IS_UNDEFINED ? > > > > The Linux server's ctime is monotonic and will advertise that with > > change_attr_type since 4.19. > > > > So I think it would be easy to patch the client to check > > change_attr_type and set an NFS_CAP_MONOTONIC_CHANGE flag in > > server->caps, the hard part would be figuring out which optimisations > > are OK. > > > > The ctime is *not* monotonic. It can regress under server reboots and > it can regress if someone deliberately changes the time. So, anything other than IS_UNDEFINED or IS_TIME_METADATA? Though the linux server is susceptible to some of that even when it returns MONTONIC_INCR. If the admin replaces the filesystem by an older snapshot, there's not much we can do. I'm not sure what degree of gaurantee we need. --b. > We have code > that tries to handle all these issues (see fattr->gencount and nfsi- > >attr_gencount) because we've hit those issues before...