Received: by 2002:a05:6a10:6744:0:0:0:0 with SMTP id w4csp3328748pxu; Mon, 19 Oct 2020 09:24:19 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw+HmTcYhYK6J4BuqbcHrk7ipH79/7uhYO4+CbTl1CnfPzXvcvd/ajnAsjijRFgjezfj3LR X-Received: by 2002:aa7:d948:: with SMTP id l8mr566151eds.159.1603124659672; Mon, 19 Oct 2020 09:24:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1603124659; cv=none; d=google.com; s=arc-20160816; b=v+o2irHQdlX9QI4fdQZZcoClf326/cUdQ/hTt8ZJcmyhMyXKJGwmax/q6YHJOCa05N 0ocW6fq4TJ3yzBCOZVeqAiqF9ToxmKjNgqx3EOu+S+XjkBKTT0vAW+Y5Fzu7exCvczts Zl7OTc8pvJPl9aGgD17fC3qXm+f6QBAkE440wHcfi7FhQoqwWbrR7iUOWxy4obsnWaCY SrGaytoxRd/+avZHsg/0OGkTJsx/LxVplyt2chkV9816x2J/30QWf2PvyOiubZureMon Mex24nrk/2Ur5yM62Tg2gHHcM/95Nvwy2OrX+AyByGWC0r4HRAJzFDfsVo7avxFzE85y 4kOQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:thread-index:thread-topic :content-transfer-encoding:mime-version:subject:references :in-reply-to:message-id:cc:to:from:date; bh=IefK2mmMZNhE66zfeDNMmlBpyDJeyfQnq0laAY0YMwc=; b=0u/fHX1BuYhZ8OI/B0GdMSugjC80mDnAbY3TMoG/TDVZusUS3tx1hDB8IU5o2/43yf kNKBu0tbCrRfKlZjAi0JUS4SiJwL3Y4H7nhOiNWIHAS+VPjYcnJ8kJSmo+5O1QoE0n5f jy0Z/MOQEwBvfusCHZdF1LonEwa0ElmaYRox7E3J7z1o93w4XreS/zu18AGSluOkgV4x mVHHs/zQKL2K5fV4dyWrMMAURXT8K9kqL+EyAqyYEnTjaVYvupBI4hjGEn6TQ52oCXeC KU5yDBhaLO/PHwLyPI9Br64AQmfS7/2/6x5p1RTj7QlUY5QFlmgztSg4N3IWVufm3PPp ZMxg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id j15si135492eds.74.2020.10.19.09.23.47; Mon, 19 Oct 2020 09:24:19 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730059AbgJSQXl (ORCPT + 99 others); Mon, 19 Oct 2020 12:23:41 -0400 Received: from natter.dneg.com ([193.203.89.68]:45018 "EHLO natter.dneg.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729845AbgJSQTq (ORCPT ); Mon, 19 Oct 2020 12:19:46 -0400 Received: from localhost (localhost [127.0.0.1]) by natter.dneg.com (Postfix) with ESMTP id 75FB7C35A2; Mon, 19 Oct 2020 17:19:44 +0100 (BST) X-Virus-Scanned: amavisd-new at mx-dneg Received: from natter.dneg.com ([127.0.0.1]) by localhost (natter.dneg.com [127.0.0.1]) (amavisd-new, port 10024) with LMTP id Fj7w7u61iTFb; Mon, 19 Oct 2020 17:19:44 +0100 (BST) Received: from zrozimbrai.dneg.com (zrozimbrai.dneg.com [10.11.20.12]) by natter.dneg.com (Postfix) with ESMTPS id 5463DB3F851; Mon, 19 Oct 2020 17:19:44 +0100 (BST) Received: from localhost (localhost [127.0.0.1]) by zrozimbrai.dneg.com (Postfix) with ESMTP id 439438177FF1; Mon, 19 Oct 2020 17:19:44 +0100 (BST) Received: from zrozimbrai.dneg.com ([127.0.0.1]) by localhost (zrozimbrai.dneg.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id cOaINGZQurbY; Mon, 19 Oct 2020 17:19:44 +0100 (BST) Received: from localhost (localhost [127.0.0.1]) by zrozimbrai.dneg.com (Postfix) with ESMTP id B8F3A8177B84; Mon, 19 Oct 2020 17:19:43 +0100 (BST) X-Virus-Scanned: amavisd-new at zimbra-dneg Received: from zrozimbrai.dneg.com ([127.0.0.1]) by localhost (zrozimbrai.dneg.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id IK2PflamYvtl; Mon, 19 Oct 2020 17:19:43 +0100 (BST) Received: from zrozimbra1.dneg.com (zrozimbra1.dneg.com [10.11.16.16]) by zrozimbrai.dneg.com (Postfix) with ESMTP id A0D278177FFB; Mon, 19 Oct 2020 17:19:43 +0100 (BST) Date: Mon, 19 Oct 2020 17:19:43 +0100 (BST) From: Daire Byrne To: Trond Myklebust Cc: bfields , linux-cachefs , linux-nfs Message-ID: <279389889.68934777.1603124383614.JavaMail.zimbra@dneg.com> In-Reply-To: <1188023047.38703514.1600272094778.JavaMail.zimbra@dneg.com> References: <943482310.31162206.1599499860595.JavaMail.zimbra@dneg.com> <20200915172140.GA32632@fieldses.org> <4d1d7cd0076d98973a56e89c92e4ff0474aa0e14.camel@hammerspace.com> <1188023047.38703514.1600272094778.JavaMail.zimbra@dneg.com> Subject: Re: Adventures in NFS re-exporting MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Mailer: Zimbra 8.7.11_GA_1854 (ZimbraWebClient - GC78 (Linux)/8.7.11_GA_1854) Thread-Topic: Adventures in NFS re-exporting Thread-Index: fNDm/l4o9cYx5Rz5g0S1EO4zMAtIR4tJDJwAAAWCe4BeKhpLVGiQL7pU Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org ----- On 16 Sep, 2020, at 17:01, Daire Byrne daire@dneg.com wrote: > Trond/Bruce, > > ----- On 15 Sep, 2020, at 20:59, Trond Myklebust trondmy@hammerspace.com wrote: > >> On Tue, 2020-09-15 at 13:21 -0400, J. Bruce Fields wrote: >>> On Mon, Sep 07, 2020 at 06:31:00PM +0100, Daire Byrne wrote: >>> > 1) The kernel can drop entries out of the NFS client inode cache >>> > (under memory cache churn) when those filehandles are still being >>> > used by the knfsd's remote clients resulting in sporadic and random >>> > stale filehandles. This seems to be mostly for directories from >>> > what I've seen. Does the NFS client not know that knfsd is still >>> > using those files/dirs? The workaround is to never drop inode & >>> > dentry caches on the re-export servers (vfs_cache_pressure=1). This >>> > also helps to ensure that we actually make the most of our >>> > actimeo=3600,nocto mount options for the full specified time. >>> >>> I thought reexport worked by embedding the original server's >>> filehandles >>> in the filehandles given out by the reexporting server. >>> >>> So, even if nothing's cached, when the reexporting server gets a >>> filehandle, it should be able to extract the original filehandle from >>> it >>> and use that. >>> >>> I wonder why that's not working? >> >> NFSv3? If so, I suspect it is because we never wrote a lookupp() >> callback for it. > > So in terms of the ESTALE counter on the reexport server, we see it increase if > the end client mounts the reexport using either NFSv3 or NFSv4. But there is a > difference in the client experience in that with NFSv3 we quickly get > input/output errors but with NFSv4 we don't. But it does seem like the > performance drops significantly which makes me think that NFSv4 retries the > lookups (which succeed) when an ESTALE is reported but NFSv3 does not? > > This is the simplest reproducer I could come up with but it may still be > specific to our workloads/applications and hard to replicate exactly. > > nfs-client # sudo mount -t nfs -o vers=3,actimeo=5,ro > reexport-server:/vol/software /mnt/software > nfs-client # while true; do /mnt/software/bin/application; echo 3 | sudo tee > /proc/sys/vm/drop_caches; done > > reexport-server # sysctl -w vm.vfs_cache_pressure=100 > reexport-server # while true; do echo 3 > /proc/sys/vm/drop_caches ; done > reexport-server # while true; do awk '/fh/ {print $2}' /proc/net/rpc/nfsd; sleep > 10; done > > Where "application" is some big application with lots of paths to scan with libs > to memory map and "/vol/software" is an NFS mount on the reexport-server from > another originating NFS server. I don't know why this application loading > workload shows this best, but perhaps the access patterns of memory mapped > binaries and libs is particularly susceptible to estale? > > With vfs_cache_pressure=100, running "echo 3 > /proc/sys/vm/drop_caches" > repeatedly on the reexport server drops chunks of the dentry & nfs_inode_cache. > The ESTALE count increases and the client running the application reports > input/output errors with NFSv3 or the loading slows to a crawl with NFSv4. > > As soon as we switch to vfs_cache_pressure=0, the repeating drop_caches on the > reexport server do not cull the dentry or nfs_inode_cache, the ESTALE counter > no longer increases and the client experiences no issues (NFSv3 & NFSv4). I don't suppose anyone has any more thoughts on this one? This is likely the first problem that anyone trying to NFS re-export is going to encounter. If they re-export NFSv3 they'll just get lots of ESTALE as the nfs inodes are dropped from cache (with the default vfs_cache_pressure=100) and if they re-export NFSv4, the lookup performance will drop significantly as an ESTALE triggers re-lookups. For our particular use case, it is actually desirable to have vfs_cache_pressure=0 to keep nfs client inodes and dentry caches in memory to help with expensive metadata lookups, but it would still be nice to have the option of using a less drastic setting (such as vfs_cache_pressure=1) to help avoid OOM conditions. Daire