Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp1637563pxb; Mon, 8 Mar 2021 02:37:19 -0800 (PST) X-Google-Smtp-Source: ABdhPJw0fzUpcj1oIfeSIZSq3vnR4JSMp5L2B7OXclj8mWzrWRwbWrZXfmyL/ecgdVfOla8yg+4e X-Received: by 2002:a17:906:7384:: with SMTP id f4mr14517249ejl.196.1615199839017; Mon, 08 Mar 2021 02:37:19 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1615199838; cv=none; d=google.com; s=arc-20160816; b=gS0VN19tmYVXnVdTi7yHaIGwirRwGAZI1rWh6fe4kjA4nJVEYiqqj0Xl0UxzpPk0hz JIhyNLcCfCR1PLy+QQxeqD4OPWbLxHPZw0Q0ZG5mlAo/cXzRxhMHpaGekQWnHTpZuCLi yw//kAP425lkOLEKC1NRws0yvV+jnHD9CS5wymdE3EW8D3F75XtQKRLmODZIRfAVYhIc 50c3HP9F8Vq+lhD33q1BDK9I8NTOwVEGTC4bV6sx6CwfXnReszeutEXiJUuSkOXTFOOx 5rtxJ5IvsiBbnD7H+m30yPR6uVhEnnQPwZEiJMtXoUx22JmE6iFBNu8+pfrbleNPUhrY qsqQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=uW22BUdu/SRB8yiALOfrH1UfupHMNVrmOHvwUHBy06I=; b=MRWRD7rmNE7jvCs0oWwafA40QHoSgROEqF3SUurdKuREZursBV4OIfQPxrZcEiHfgv dBxK2TOuCyUz9oXat7FOU9S0Wv5yqCS/ah9Fh1visCmVFY55PAJgrJdyW4GM/G+Zbtso WkdeSTriHIxLP/Bq23U8E8wbaKRJbIQLgnJF+udM8vE0ffsa/ZPEbmSghtooWKwP3gGe ECyhTqnk1PAI35oTjtbQRMpO7KNQNr8AFWaMEFWtmZExd66SMiYrqwitNL270+RoOAoM Ft7zzxbcPVObkk83x196JN9AL1LNmBvoAxEwGN31MSCXtMz5w3XBlnVFtE1kJLJXJIZK eN5g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=GqBBRDbC; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id z1si6832771ejm.289.2021.03.08.02.36.47; Mon, 08 Mar 2021 02:37:18 -0800 (PST) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=GqBBRDbC; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230141AbhCHKfz (ORCPT + 99 others); Mon, 8 Mar 2021 05:35:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34170 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230165AbhCHKfj (ORCPT ); Mon, 8 Mar 2021 05:35:39 -0500 Received: from mail-il1-x130.google.com (mail-il1-x130.google.com [IPv6:2607:f8b0:4864:20::130]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2197FC06174A; Mon, 8 Mar 2021 02:35:39 -0800 (PST) Received: by mail-il1-x130.google.com with SMTP id s1so8336930ilh.12; Mon, 08 Mar 2021 02:35:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=uW22BUdu/SRB8yiALOfrH1UfupHMNVrmOHvwUHBy06I=; b=GqBBRDbCEj38iFW0MXTGa0/KJ45ZgACFsDGi2tbJhwbhi8YCBsFTUewKjtDfNp6Yos dUIJZWXuF8v2CNFNVWJh4WgmIKrT+3Lb9ujeIW6xTDVGr4N+GEVCG3LZerTkiUSV2MTj kWEMcxe0bnQj74mMC1jLhQNsS+BGT6QRJKtMXgZUCcgAQR9V8IRYbp70NATj8rhwtrbi fIGmQaNhUCUhmSCrYNDfTQTbZM7TDR7pgRTkZmUzpkC0TrpqdbCTm8Tr2OwR3gLoRDOV tKhDerYxOi4FCnL9jMMFNJKyJIIDsxy2W0SwENtt832lcuHi5xAwBk4BozKEyXRMkcE2 /TJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=uW22BUdu/SRB8yiALOfrH1UfupHMNVrmOHvwUHBy06I=; b=t21Pxuolgf+IB6Ol+SWHr838XK5g98/jJ3eZCS2UJA4nWqg3KRLIm7Y35qD/8ExpZP blXRaoikgeG7sRbHywpl06Cm8ST00o0Kgm++cOz0JDPV210yLxm7V6uLqgYmDzO7M91y +0NZlntNZqATd7mj07PoVtOu9QRxnyqEA/2NYNObtILnWZ9/Ac9i4Ss+y5lmrFNax35Y tBpIgIBECE4HxkINzyHU1ndX5kA5WR4HrGXQRm/jpa+5Qs14IK2EP114rHRavnDRD3G7 /am8dnd21MJZ6jy9WMrv6wo0FfvNknWSMiWlxcaKGHjkG5MBGNcFmhG8ux2wGzIGaBbZ bBig== X-Gm-Message-State: AOAM530OU9MQph9O7QJ9kbV/za5B7StwGacjW/55DdarC+qg2wvHI5sP LrBpJ+cLjcsUBb69H64Rtss9/YvK78XyQEgIUC0= X-Received: by 2002:a92:c010:: with SMTP id q16mr20835009ild.250.1615199738508; Mon, 08 Mar 2021 02:35:38 -0800 (PST) MIME-Version: 1.0 References: <2653261.1614813611@warthog.procyon.org.uk> <517184.1615194835@warthog.procyon.org.uk> In-Reply-To: <517184.1615194835@warthog.procyon.org.uk> From: Amir Goldstein Date: Mon, 8 Mar 2021 12:35:27 +0200 Message-ID: Subject: Re: fscache: Redesigning the on-disk cache To: David Howells Cc: linux-cachefs@redhat.com, Jeff Layton , David Wysochanski , "Matthew Wilcox (Oracle)" , "J. Bruce Fields" , Christoph Hellwig , Dave Chinner , Alexander Viro , linux-afs@lists.infradead.org, Linux NFS Mailing List , CIFS , ceph-devel , v9fs-developer@lists.sourceforge.net, linux-fsdevel , linux-kernel , Miklos Szeredi Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Mon, Mar 8, 2021 at 11:14 AM David Howells wrote: > > Amir Goldstein wrote: > > > > (0a) As (0) but using SEEK_DATA/SEEK_HOLE instead of bmap and opening the > > > file for every whole operation (which may combine reads and writes). > > > > I read that NFSv4 supports hole punching, so when using ->bmap() or SEEK_DATA > > to keep track of present data, it's hard to distinguish between an > > invalid cached range and a valid "cached hole". > > I wasn't exactly intending to permit caching over NFS. That leads to fun > making sure that the superblock you're caching isn't the one that has the > cache in it. > > However, we will need to handle hole-punching being done on a cached netfs, > even if that's just to completely invalidate the cache for that file. > > > With ->fiemap() you can at least make the distinction between a non existing > > and an UNWRITTEN extent. > > I can't use that for XFS, Ext4 or btrfs, I suspect. Christoph and Dave's > assertion is that the cache can't rely on the backing filesystem's metadata > because these can arbitrarily insert or remove blocks of zeros to bridge or > split extents. > > > You didn't say much about crash consistency or durability requirements of the > > cache. Since cachefiles only syncs the cache on shutdown, I guess you > > rely on the hosting filesystem to provide the required ordering guarantees. > > There's an xattr on each file in the cache to record the state. I use this > mark a cache file "open". If, when I look up a file, the file is marked open, > it is just discarded at the moment. > > Now, there are two types of data stored in the cache: data that has to be > stored as a single complete blob and is replaced as such (e.g. symlinks and > AFS dirs) and data that might be randomly modified (e.g. regular files). > > For the former, I have code, though in yet another branch, that writes this in > a tmpfile, sets the xattrs and then uses vfs_link(LINK_REPLACE) to cut over. > > For the latter, that's harder to do as it would require copying the data to > the tmpfile before we're allowed to modify it. However, if it's possible to > create a tmpfile that's a CoW version of a data file, I could go down that > route. > > But after I've written and sync'd the data, I set the xattr to mark the file > not open. At the moment I'm doing this too lazily, only doing it when a netfs > file gets evicted or when the cache gets withdrawn, but I really need to add a > queue of objects to be sealed as they're closed. The balance is working out > how often to do the sealing as something like a shell script can do a lot of > consecutive open/write/close ops. > You could add an internal vfs API wait_for_multiple_inodes_to_be_synced(). For example, xfs keeps the "LSN" on each inode, so once the transaction with some LSN has been committed, all the relevant inodes, if not dirty, can be declared as synced, without having to call fsync() on any file and without having to force transaction commit or any IO at all. Since fscache takes care of submitting the IO, and it shouldn't care about any specific time that the data/metadata hits the disk(?), you can make use of the existing periodic writeback and rolling transaction commit and only ever need to wait for that to happen before marking cache files "closed". There was a discussion about fsyncing a range of files on LSFMM [1]. In the last comment on the article dchinner argues why we already have that API (and now also with io_uring(), but AFAIK, we do not have a useful wait_for_sync() API. And it doesn't need to be exposed to userspace at all. [1] https://lwn.net/Articles/789024/ > > Anyway, how are those ordering requirements going to be handled when entire > > indexing is in a file? You'd practically need to re-implement a filesystem > > Yes, the though has occurred to me too. I would be implementing a "simple" > filesystem - and we have lots of those:-/. The most obvious solution is to > use the backing filesystem's metadata - except that that's not possible. > > > journal or only write cache updates to a temp file that can be discarded at > > any time? > > It might involve keeping a bitmap of "open" blocks. Those blocks get > invalidated when the cache restarts. The simplest solution would be to wipe > the entire cache in such a situation, but that goes against one of the > important features I want out of it. > > Actually, a journal of open and closed blocks might be better, though all I > really need to store for each block is a 32-bit number. > > It's a particular problem if I'm doing DIO to the data storage area but > buffering the changes to the metadata. Further, the metadata and data might > be on different media, just to add to the complexity. > > Another possibility is only to cull blocks when the parent file is culled. > That probably makes more sense as, as long as the file is registered culled on > disk first and I don't reuse the file slot too quickly, I can write to the > data store before updating the metadata. > If I were you, I would try to avoid re-implementing a journaled filesystem or a database for fscache and try to make use of crash consistency guarantees that filesystems already provide. Namely, use the data dependency already provided by temp files. It doesn't need to be one temp file per cached file. Always easier said than done ;-) Thanks, Amir.