Received: by 2002:a05:6a10:413:0:0:0:0 with SMTP id 19csp3041344pxp; Mon, 14 Mar 2022 09:38:31 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwfDWMROG1nHHBCkDyj+Tj9mH1voqGEXNrSkByH7I/7dtZCIF0bi4fSjzke5tAZtDwU3Qk+ X-Received: by 2002:a17:902:e5cc:b0:151:f626:9153 with SMTP id u12-20020a170902e5cc00b00151f6269153mr24522223plf.41.1647275911200; Mon, 14 Mar 2022 09:38:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1647275911; cv=none; d=google.com; s=arc-20160816; b=vIrFEdlQiNRW5gzbzs5XoWmGE8ERR+YuTtLQKzzB/SHAyCIhQSnqdmbNtAnQzfN9Se kwsFjU0JlTOdkrM4u6Nuzz6LWU7hRL1CT3Bmx8HqvRbhsCyavPjH41cb/CeYXCdMd7b1 wu76DsbBD8g0ETnYXUpxGme8KWmz/0KQ89Bt5h+30AdbdDMOvq6PBpIRVYOP9o3vYHvv z1lz+Gg/DhsQalci5lut9QV7dK47ncaFJlUsFgVRATeqKREz3xWCSZ0CGY23OGhro2MX BcweUAzicZ/+zRKkWxGZCWRCrnvTi5wwUodvhkg7+2FhcsqcO8T6SKPMn7FSlETnHOpz AKKg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=kZHSHqdSI/jBDMOjHUDo3932A1aQUbEpHtDBEyuk690=; b=phXXwvRmuhZVrtYVrs1wAjhiR286DmuusdBcm8j7BGKzgPbRLdc9vwQP3IAHJzbcop Mys1rCTviA58a7/GR5SX6ZQQhd+MOSc4t9GIajeG/ur+KSEUJUNSNVlrf5WHYRQdZX4m f60VB21sBPBbO4ta5zC+Z/eCFQVqJ89j/q+b6EHvGEOkImZlrqFtHj27QPcKK7N9hDkd 8KTu9AhnIeglfRR5jYJf5e7CLbAk5lemQ2qZHsw/+q/YdAOfUZENwCa5Zwigx5GSGMgh JQguwXvyQwIPXJm8o2WstqR3b1D4FpAk9un4faGWly5TMEyQCHEGLa3aaaMA046DSQDn fBRw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=yWfq1hgF; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id bi12-20020a170902bf0c00b0014fd360c0e8si14901656plb.372.2022.03.14.09.38.17; Mon, 14 Mar 2022 09:38:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=yWfq1hgF; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235652AbiCNJZY (ORCPT + 99 others); Mon, 14 Mar 2022 05:25:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34814 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231899AbiCNJZW (ORCPT ); Mon, 14 Mar 2022 05:25:22 -0400 Received: from mail-pj1-x1031.google.com (mail-pj1-x1031.google.com [IPv6:2607:f8b0:4864:20::1031]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E43C22A706 for ; Mon, 14 Mar 2022 02:24:12 -0700 (PDT) Received: by mail-pj1-x1031.google.com with SMTP id kx6-20020a17090b228600b001bf859159bfso16803667pjb.1 for ; Mon, 14 Mar 2022 02:24:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=kZHSHqdSI/jBDMOjHUDo3932A1aQUbEpHtDBEyuk690=; b=yWfq1hgFZccKLGWuefVV1mJc4fkOQ/n4zZZqIsZ8pODphn6dXBhPpYMZ3IU+za0xe/ VtatmqmI3DJJ0X/RGCVAfVKq7CGT1Q/DlLxmqg5bg/AMtX2uoe1OrV5A4QVXrY3z9ITP OWgA+kweNNFIan1yFzFk3+vneAZLAOiy25mIqw0eNqFmDST9oZPbxnHOFaa3h0zmIFIY 6QgJbWx6VClE8tM4W6nT/qff5T2epMd/XdHZg933KIjw1dUSynxJ02gfWIaSdJfLM1R7 CxOK5W7DggUdJwKIJi4AWxQY2TGsgOitGWTFAfVu80sLwSamFPffPtyyNvyPm0VCD6R4 ncHA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=kZHSHqdSI/jBDMOjHUDo3932A1aQUbEpHtDBEyuk690=; b=ODpDAj4hmDPLsy6MlVNB+RwIAmz2X/e9SaRC3QaDhyCNUia+kblX1Wiakq0c49Takk Iek4gbzTVFZJ0WICcybzg6K2G5ZoqliN2/nCVGoXHHoFrt0YQMNzB0bVO7LwP0fBgctE ghcOcZpsDwyXBVJ4G5rYJxgDY6gUMYZ1r1Wun+v1QiJxX4GpXMSKhWYvACxIOauNleby xsJ0k9v6JY5LjRvJb27uLDHfJT6a0K28RkjiDPqcKJAOjNaEdgiDVaEbX1hN+m1uyQ5/ dVRWuM8HpC0QTmJWzUGi/PIyKF7vfQV3PGR3k+QYS2HNitbv/iV6bonYlAVfucEjJ5ZP FcxQ== X-Gm-Message-State: AOAM533hIFuabizAvPagGwFiohma8p4+PW7BlBoB6DjYC66+nwZbUhy8 hbMzIsFvMOT9ksFPOKkHl3DbvA== X-Received: by 2002:a17:903:41c9:b0:153:8a89:de18 with SMTP id u9-20020a17090341c900b001538a89de18mr212038ple.32.1647249852246; Mon, 14 Mar 2022 02:24:12 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.224]) by smtp.gmail.com with ESMTPSA id u14-20020a056a00124e00b004f76d35c1dbsm15890636pfi.75.2022.03.14.02.24.06 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 14 Mar 2022 02:24:11 -0700 (PDT) From: "luodaowen.backend" To: jefflexu@linux.alibaba.com Cc: bo.liu@linux.alibaba.com, chao@kernel.org, dhowells@redhat.com, eguan@linux.alibaba.com, gerry@linux.alibaba.com, gregkh@linuxfoundation.org, joseph.qi@linux.alibaba.com, linux-cachefs@redhat.com, linux-erofs@lists.ozlabs.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, tao.peng@linux.alibaba.com, torvalds@linux-foundation.org, willy@infradead.org, xiang@kernel.org Subject: Re: [PATCH v4 00/21] fscache,erofs: fscache-based on-demand read semantics Date: Mon, 14 Mar 2022 17:24:02 +0800 Message-Id: <20220314092402.43044-1-luodaowen.backend@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20220307123305.79520-1-jefflexu@linux.alibaba.com> References: <20220307123305.79520-1-jefflexu@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=y Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, We're also interested in this way, hoping for the formal solution upstream so we can make use of it as well. Thanks, daowen On Mon, 7 Mar 2022 20:32:44 +0800 Jeffle Xu wrote: > > changes since v3: > - cachefiles: The current implementation relies on the anonymous fd > mechanism to avoid >   the dependence on the format of cache file. When cache file is opened >   for the first time, an anon_fd associated with the cache file is sent to >   user daemon. User daemon could fetch and write data to cache file with >   the given anon_fd. The following write to the anon_fd will finally >   call to cachefiles kernel module, which will write data to cache file in >   the latest format of cache file. Thus the on-demand read mode can >   keep working no matter how cache file format could change in the >   future. (patch 4) > - cachefiles: the on-demand read mode reuses the existing >   "/dev/cachefiles" devnode (patch 3) > - erofs: squash several commits implementing readahead into single >   commit (patch 20) > - erofs: refactor the readahead routine, so that it can read multiple >   pages each round (patch 20) > - patch 1 and 7 have already been cherry-picked by the maintainers, but >   have not been merged to the master. Keep them here for completeness. > > > RFC: https://lore.kernel.org/all/YbRL2glGzjfZkVbH@B-P7TQMD6M-0146.local/t/ > v1: https://lore.kernel.org/lkml/47831875-4bdd-8398-9f2d-0466b31a4382@linux.alibaba.com/T/ > v2: https://lore.kernel.org/all/2946d871-b9e1-cf29-6d39-bcab30f2854f@linux.alibaba.com/t/ > v3: https://lore.kernel.org/lkml/20220209060108.43051-1-jefflexu@linux.alibaba.com/T/ > > [Background] > ============ > Nydus [1] is a container image distribution service specially optimised > for distribution over network. Nydus is an excellent container image > acceleration solution, since it only pulls data from remote when it's > really needed, a.k.a. on-demand reading. > > erofs (Enhanced Read-Only File System) is a filesystem specially > optimised for read-only scenarios. (Documentation/filesystem/erofs.rst) > > Recently we are focusing on erofs in container images distribution > scenario [2], trying to combine it with nydus. In this case, erofs can > be mounted from one bootstrap file (metadata) with (optional) multiple > data blob files (data) stored on another local filesystem. (All these > files are actually image files in erofs disk format.) > > To accelerate the container startup (fetching container image from remote > and then start the container), we do hope that the bootstrap blob file > could support demand read. That is, erofs can be mounted and accessed > even when the bootstrap/data blob files have not been fully downloaded. > > That means we have to manage the cache state of the bootstrap/data blob > files (if cache hit, read directly from the local cache; if cache miss, > fetch the data somehow). It would be painful and may be dumb for erofs to > implement the cache management itself. Thus we prefer fscache/cachefiles > to do the cache management. Besides, the demand-read feature shall be > general and it can benefit other using scenarios if it can be implemented > in fscache level. > > [1] https://nydus.dev > [2] https://sched.co/pcdL > > > [Overall Design] > ================ > > Please refer to patch 6 ("cachefiles: document on-demand read mode") for > more details. > > When working in original mode, cachefiles mainly serves as a local cache for > remote networking fs, while in on-demand read mode, cachefiles can boost the > scenario where on-demand read semantics is needed, e.g. container image > distribution. > > The essential difference between these two modes is that, in original mode, > when cache miss, netfs itself will fetch data from remote, and then write the > fetched data into cache file. While in on-demand read mode, a user daemon is > responsible for fetching data and then writing to the cache file. > > The on-demand read mode relies on a simple protocol used for communication > between kernel and user daemon. > > The current implementation relies on the anonymous fd mechanism to avoid > the dependence on the format of cache file. When cache file is opened > for the first time, an anon_fd associated with the cache file is sent to > user daemon. With the given anon_fd, user daemon could fetch and write data > into the cache file in the background, even when kernel has not triggered > the cache miss. Besides, the write() syscall to the anon_fd will finally > call cachefiles kernel module, which will write data to cache file in > the latest format of cache file. > > 1. cache miss > When cache miss, cachefiles kernel module will notify user daemon the > anon_fd, along with the requested file range. When notified, user dameon > needs to fetch data of the requested file range, and then write the fetched > data into cache file with the given anonymous fd. When finished > processing the request, user daemon needs to notify the kernel. > > After notifying the user daemon, the kernel read routine will hang there, > until the request is handled by user daemon. When it's awaken by the > notification from user daemon, i.e. the corresponding hole has been filled > by the user daemon, it will retry to read from the same file range. > > 2. cache hit > Once data is already ready in cache file, netfs will read from cache > file directly. > > > [Advantage of fscache-based demand-read] > ======================================== > 1. Asynchronous Prefetch > In current mechanism, fscache is responsible for cache state management, > while the data plane (fetch data from local/remote on cache miss) is > done on the user daemon side. > > If data has already been ready in the backing file, the upper fs (e.g. > erofs) will read from the backing file directly and won't be trapped to > user space anymore. Thus the user daemon could fetch data (from remote) > asynchronously on the background, and thus accelerate the backing file > accessing in some degree. > > 2. Support massive blob files > Besides this mechanism supports a large amount of backing files, and > thus can benefit the densely employed scenario. > > In our using scenario, one container image can correspond to one > bootstrap file (required) and multiple data blob files (optional). For > example, one container image for node.js will corresponds to ~20 files > in total. In densely employed environment, there could be as many as > hundreds of containers and thus thousands of backing files on one > machine. > > > [Test] > ========== > You could start a quick test by > https://github.com/lostjeffle/demand-read-cachefilesd > > > > Jeffle Xu (21): >   fscache: export fscache_end_operation() >   cachefiles: export write routine >   cachefiles: introduce on-demand read mode >   cachefiles: notify user daemon with anon_fd when opening cache file >   cachefiles: implement on-demand read >   cachefiles: document on-demand read mode >   erofs: use meta buffers for erofs_read_superblock() >   erofs: export erofs_map_blocks() >   erofs: add mode checking helper >   erofs: register global fscache volume >   erofs: add cookie context helper functions >   erofs: add anonymous inode managing page cache of blob file >   erofs: add erofs_fscache_read_pages() helper >   erofs: register cookie context for bootstrap blob >   erofs: implement fscache-based metadata read >   erofs: implement fscache-based data read for non-inline layout >   erofs: implement fscache-based data read for inline layout >   erofs: register cookie context for data blobs >   erofs: implement fscache-based data read for data blobs >   erofs: implement fscache-based data readahead >   erofs: add 'uuid' mount option > >  .../filesystems/caching/cachefiles.rst        | 159 +++++ >  fs/cachefiles/Kconfig                         |  11 + >  fs/cachefiles/daemon.c                        | 576 +++++++++++++++++- >  fs/cachefiles/internal.h                      |  48 ++ >  fs/cachefiles/io.c                            |  72 ++- >  fs/cachefiles/namei.c                         |  16 +- >  fs/erofs/Makefile                             |   3 +- >  fs/erofs/data.c                               |  18 +- >  fs/erofs/fscache.c                            | 496 +++++++++++++++ >  fs/erofs/inode.c                              |   6 +- >  fs/erofs/internal.h                           |  30 + >  fs/erofs/super.c                              | 106 +++- >  fs/fscache/internal.h                         |  11 - >  fs/nfs/fscache.c                              |   8 - >  include/linux/fscache.h                       |  15 + >  include/linux/netfs.h                         |   1 + >  include/trace/events/cachefiles.h             |   2 + >  include/uapi/linux/cachefiles.h               |  48 ++ >  18 files changed, 1526 insertions(+), 100 deletions(-) >  create mode 100644 fs/erofs/fscache.c >  create mode 100644 include/uapi/linux/cachefiles.h > > -- > 2.27.0