Received: by 2002:a05:6a10:413:0:0:0:0 with SMTP id 19csp699880pxp; Fri, 11 Mar 2022 12:44:15 -0800 (PST) X-Google-Smtp-Source: ABdhPJyx56zA9Sf8JHb7BwLsT4qodpXfg5VdTBUMPPQSxmW93doutMwep+WLwkSk4nr4eN0M5UaI X-Received: by 2002:a17:903:2312:b0:153:1d6f:83a3 with SMTP id d18-20020a170903231200b001531d6f83a3mr10648868plh.157.1647031454949; Fri, 11 Mar 2022 12:44:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1647031454; cv=none; d=google.com; s=arc-20160816; b=GQ/LEiWp4q4j01dvvQ6IJKJSowfAM3GGeFnqCl39e9jA/dTWgL97g5bG6X9hoTXC0H KvE+W1DTABTpcF9GIU2HYVBR8UAR4JbQK12PXoxSiLeLszMHOX3uyXms21gexEvPmm+j Gplu6p7i+Lc6OryVvkfyQTm3m9YzcdjdkQjK1+KBEx8b0bZsli+BQUY6GGmYI+ntFZ7N a2CfAWv4a9qiaFoO7y6gwqG6dYJa+EbPqPuLTabLaD0tb6B3hJcI8oZJ9ro4J5+y1Pvn zoNXp9frNXxHV7tlCkD9FQy9SJJLKEi4FMPjXSUZ+CGCx+RzmTsUkO5toSBR+6Di4mp/ QKWA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=93B37ZUsiBO0xHZ98cuI+gyycY8o2AYlYdi0k8R7SuQ=; b=i119FfKfC5Y5W2DJGgEk+0fGOPDlG3x3zqXL8cQ6xhwH6l5cw+xxuS1nUb7y8KSxY6 Qjn7DUIXYNcnUZg6JaQMNEXwd3cDVGPEwAuBecbclhg0KKWyfFv4Y/Q6FnRS302vNdxv hBWX0i7VXF+ewM16FqxT5pC6bJfkm9EgRCWlVm+eN6n4xebi7tsbAc9XtI7ymsF5GSfw YhqOAkfqab0v6VNCKgZxcEL7Ar5yjIkbOC0hE/G3rpGqBA+Nj1rMb9ckM0o4Zdx0gWnv rFXl0bEABzG+RmW+dTZIEVvpl0hQXzIuvlm5aeXS7UhGLFORiZUkvfClBmXh/vBWddCO PRSg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=SwwBvs89; spf=softfail (google.com: domain of transitioning linux-nfs-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id y191-20020a638ac8000000b00380caf81d57si7465207pgd.290.2022.03.11.12.44.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 11 Mar 2022 12:44:14 -0800 (PST) Received-SPF: softfail (google.com: domain of transitioning linux-nfs-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=SwwBvs89; spf=softfail (google.com: domain of transitioning linux-nfs-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 4C3911DC9B7; Fri, 11 Mar 2022 12:40:51 -0800 (PST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1349744AbiCKQPQ (ORCPT + 99 others); Fri, 11 Mar 2022 11:15:16 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38252 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235213AbiCKQPO (ORCPT ); Fri, 11 Mar 2022 11:15:14 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id B57D018A798 for ; Fri, 11 Mar 2022 08:14:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1647015249; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=93B37ZUsiBO0xHZ98cuI+gyycY8o2AYlYdi0k8R7SuQ=; b=SwwBvs89R768wScpg7FNPQfOqYFa15zZPRjY52bkEitORsAcvjcRSgatC0YTeQA+NMPIME lfKEmaXjkf/oxPsGjA8o7KPUf2yJjqWR0ryGdvkm2DtBbC+VmdG3F9Aa/3QzTjRZXa5JUA FIlTBpFsZHvBnmfY5xi7XJtrnC1gtCQ= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-124-W18HpO41OZ29BLL9_qO81A-1; Fri, 11 Mar 2022 11:14:05 -0500 X-MC-Unique: W18HpO41OZ29BLL9_qO81A-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id DDCEE1091DA2; Fri, 11 Mar 2022 16:14:04 +0000 (UTC) Received: from [172.16.176.1] (ovpn-64-2.rdu2.redhat.com [10.10.64.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 7FAB5866E5; Fri, 11 Mar 2022 16:14:04 +0000 (UTC) From: "Benjamin Coddington" To: "Trond Myklebust" Cc: linux-nfs@vger.kernel.org Subject: Re: [PATCH v9 23/27] NFS: Convert readdir page cache to use a cookie based index Date: Fri, 11 Mar 2022 11:14:03 -0500 Message-ID: In-Reply-To: <28acdb3ed40b0ffd4c3ec320cc239f503ae74fcc.camel@hammerspace.com> References: <20220227231227.9038-1-trondmy@kernel.org> <20220227231227.9038-2-trondmy@kernel.org> <20220227231227.9038-3-trondmy@kernel.org> <20220227231227.9038-4-trondmy@kernel.org> <20220227231227.9038-5-trondmy@kernel.org> <20220227231227.9038-6-trondmy@kernel.org> <20220227231227.9038-7-trondmy@kernel.org> <20220227231227.9038-8-trondmy@kernel.org> <20220227231227.9038-9-trondmy@kernel.org> <20220227231227.9038-10-trondmy@kernel.org> <20220227231227.9038-11-trondmy@kernel.org> <20220227231227.9038-12-trondmy@kernel.org> <20220227231227.9038-13-trondmy@kernel.org> <20220227231227.9038-14-trondmy@kernel.org> <20220227231227.9038-15-trondmy@kernel.org> <20220227231227.9038-16-trondmy@kernel.org> <20220227231227.9038-17-trondmy@kernel.org> <20220227231227.9038-18-trondmy@kernel.org> <20220227231227.9038-19-trondmy@kernel.org> <20220227231227.9038-20-trondmy@kernel.org> <20220227231227.9038-21-trondmy@kernel.org> <20220227231227.9038-22-trondmy@kernel.org> <20220227231227.9038-23-trondmy@kernel.org> <20220227231227.9038-24-trondmy@kernel.org> <9099fead49c961a53027c8ed309a8efd2222d679.camel@hammerspace.com> <466F8F77-E052-4D06-A016-946FCBD9C9BF@redhat.com> <28acdb3ed40b0ffd4c3ec320cc239f503ae74fcc.camel@hammerspace.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-Spam-Status: No, score=-2.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On 11 Mar 2022, at 9:02, Trond Myklebust wrote: > On Fri, 2022-03-11 at 06:58 -0500, Benjamin Coddington wrote: >> On 10 Mar 2022, at 16:07, Trond Myklebust wrote: >> >>> On Wed, 2022-03-09 at 15:01 -0500, Benjamin Coddington wrote: >>>> On 27 Feb 2022, at 18:12, trondmy@kernel.org wrote: >>>> >>>>> From: Trond Myklebust >>>>> >>>>> Instead of using a linear index to address the pages, use the >>>>> cookie of >>>>> the first entry, since that is what we use to match the page >>>>> anyway. >>>>> >>>>> This allows us to avoid re-reading the entire cache on a >>>>> seekdir() >>>>> type >>>>> of operation. The latter is very common when re-exporting NFS, >>>>> and >>>>> is a >>>>> major performance drain. >>>>> >>>>> The change does affect our duplicate cookie detection, since we >>>>> can >>>>> no >>>>> longer rely on the page index as a linear offset for detecting >>>>> whether >>>>> we looped backwards. However since we no longer do a linear >>>>> search >>>>> through all the pages on each call to nfs_readdir(), this is >>>>> less >>>>> of a >>>>> concern than it was previously. >>>>> The other downside is that invalidate_mapping_pages() no longer >>>>> can >>>>> use >>>>> the page index to avoid clearing pages that have been read. A >>>>> subsequent >>>>> patch will restore the functionality this provides to the 'ls - >>>>> l' >>>>> heuristic. >>>> >>>> I didn't realize the approach was to also hash out the linearly- >>>> cached >>>> entries.  I thought we'd do something like flag the context for >>>> hashed page >>>> indexes after a seekdir event, and if there are collisions with >>>> the >>>> linear >>>> entries, they'll get fixed up when found. >>> >>> Why? What's the point of using 2 models where 1 will do? >> >> I don't think the hashed model is quite as simple and efficient >> overall, and >> may produce impacts to a system beyond NFS. >> >>>> >>>> Doesn't that mean that with this approach seekdir() only hits the >>>> same pages >>>> when the entry offset is page-aligned?  That's 1 in 127 odds. >>> >>> The point is not to stomp all over the pages that contain aligned >>> data >>> when the application does call seekdir(). >>> >>> IOW: we always optimise for the case where we do a linear read of >>> the >>> directory, but we support random seekdir() + read too. >> >> And that could be done just by bumping the seekdir users to some >> constant >> offset (index 262144 ?), or something else equally dead-nuts simple.  >> That >> keeps tightly clustered page indexes, so walking the cache is >> faster.  That >> reduces the "buckshot" effect the hashing has of eating up pagecache >> pages >> they'll never use again.  That doesn't cap our caching ability at 33 >> million >> entries. >> > > What you say would make sense if readdir cookies truly were offsets, > but in general they're not. Cookies are unstructured data, and should > be treated as unstructured data. > > Let's say I do cache more than 33 million entries and I have to find a > cookie. I have to linearly read through at least 1GB of cached data > before I can give up and start a new readdir. Either that, or I need to > have a heuristic that tells me when to stop searching, and then another > heuristic that tells me where to store the data in a way that doesn't > trash the page cache. > > With the hashing, I seek to the page matching the hash, and I either > immediately find what I need, or I immediately know to start a readdir. > There is no need for any additional heuristic. The scenario where we want to find a cookie while not doing a linear pass through the directory will be the seekdir() case. In a linear walk, we have the cached page index to help. So in the seekdir case, the chances of having someone already fill a page and also having the cookie be the 1 in 127 that are page-aligned (and so match an already cached page) are small, I think. Unless your use-case will often hit the exact same offsets over and over. So with the hashing and seekdir case, I think that the cache will be pretty heavily filled with the same duplicated data at various offsets and rarely useful. That's why I wondered if you'd tested your use-case for it and found it to be advantageous. I think what we've got is going to work fine, but I wonder if you've seen it to work well. The major pain point most of our users complain about is not being able to perform a complete walk in linear time with respect to size with invalidations at play. This series fixes that, and is a huge bonus. Other smaller performance improvements are pale in comparison for us, and might just get us forever chasing one or two minor optimizations that have trade-offs. There's a lot of variables at play. For some client/server setups (like some low-latency RDMA), and very large directories and cache sizes, it might be more performant to just do the READDIR every time, walking local caches be damned. >> Its weird to me that we're doing exactly what XArray says not to do, >> hash >> the index, when we don't have to. >> >>>> It also means we're amplifying the pagecache's useage for >>>> slightly >>>> changing >>>> directories - rather than re-using the same pages we're >>>> scattering >>>> our usage >>>> across the index.  Eh, maybe not a big deal if we just expect the >>>> page >>>> cache's LRU to do the work. >>>> >>> >>> I don't understand your point about 'not reusing'. If the user >>> seeks to >>> the same cookie, we reuse the page. However I don't know how you >>> would >>> go about setting up a schema that allows you to seek to an >>> arbitrary >>> cookie without doing a linear search. >> >> So when I was taking about 'reusing' a page, that's about re-filling >> the >> same pages rather than constantly conjuring new ones, which requires >> less of >> the pagecache's resources in total.  Maybe the pagecache can handle >> that >> without it negatively impacting other users of the cache that /will/ >> re-use >> their cached pages, but I worry it might be irresponsible of us to >> fill the >> pagecache with pages we know we're never going to find again. >> > > In the case where the processes are reading linearly through a > directory that is not changing (or at least where the beginning of the > directory is not changing), we will reuse the cached data, because just > like in the linearly indexed case, each process ends up reading the > exact same sequence of cookies, and looking up the exact same sequence > of hashes. > > The sequences start to diverge only if they hit a part of the directory > that is being modified. At that point, we're going to be invalidating > page cache entries anyway with the last reader being more likely to be > following the new sequence of cookies. I don't think we clean up behind ourselves anymore. Now that we are going to validate each page before using it, we don't invalidate the whole cache at any point. That means that a divergence duplicates the pagecache usage beyond the divergence. Ben