Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp4866285iob; Mon, 9 May 2022 03:34:07 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzingSwUYH8b8gKfMR+Soov8q4IWgrQAk5EB7jCpGG1QP7UOBqTC6+I+sLyqpp9HyL+/vo0 X-Received: by 2002:a17:90a:f3c7:b0:1d9:6832:7be0 with SMTP id ha7-20020a17090af3c700b001d968327be0mr17596580pjb.209.1652092447791; Mon, 09 May 2022 03:34:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652092447; cv=none; d=google.com; s=arc-20160816; b=AHgMbV2buogKfBQtfXKzti8P9L8tTquljaLKVX09268t9RuP8TIbKtdiWh7UFaGSn0 TmXuVPgdt1FldWYNfrpfrxQ04VuXpOvTgPNqlGwtvdGkXFMl+lkmY5we9nq2S2KA1S5E wnut30iLL8L2zRukRavALsT45e3RD/TaSP3qPdvmOhX9r/WDXq5ip+UDZVDojA8QiCQ8 x/gtWs/AT4Th52/wATUNDMAt4INq6k+2U8e90Eh4m/ksUbwVQxi9Gc5RDGLETrI0jyQL t4gJgsUKSAS7F0cjSZMp+sgpoKe1xz7DY9PPUrennm3/yR0TlwpUpd0KQcvgnRXjWvsd d93Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=dd9+C/jeTVn8x619jxysv7jtD7DaVNC/jlLG3jGvojE=; b=kMIqe3TUoe4icBJdRIRu0hn3uuIgYQ0VZAHW9RWR58MABEhAKFPEG24WDkMPBClshH NnanyZgjzQGZ8bDqqh0moQfeIpvvWu92Z3DqXlSPQzTOmPtU7oNdGheqb9tFVJDbWzmO IPpdFlPXBLORHpXD5GOZngNDCD3XKgOKuM/TKmEv4ZMPpOoxpD7cj1wpWBlmBNY6L4vw JDGSxlscoMdZgz9dbkhtrCWhI6s9y/kLx/GfqQyIscZbMdt0/vLK9uM87ggsCKBTxDmw jV2Oa5Uta5gtrE8djx58u7ejp3ysX2R6hXxTuhmIbrP1gyfid6pTuHM+/PvqxvjvWkEj 6rDA== ARC-Authentication-Results: i=1; mx.google.com; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id o4-20020a656a44000000b0039d5b6aa353si18226537pgu.8.2022.05.09.03.34.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 09 May 2022 03:34:07 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id E3C0925F7AA; Mon, 9 May 2022 03:04:16 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236856AbiEIHyT (ORCPT + 99 others); Mon, 9 May 2022 03:54:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59024 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234780AbiEIHow (ORCPT ); Mon, 9 May 2022 03:44:52 -0400 Received: from out30-57.freemail.mail.aliyun.com (out30-57.freemail.mail.aliyun.com [115.124.30.57]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F2E7F13E13; Mon, 9 May 2022 00:40:56 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R101e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01424;MF=jefflexu@linux.alibaba.com;NM=1;PH=DS;RN=20;SR=0;TI=SMTPD_---0VCfaIhb_1652082028; Received: from localhost(mailfrom:jefflexu@linux.alibaba.com fp:SMTPD_---0VCfaIhb_1652082028) by smtp.aliyun-inc.com(127.0.0.1); Mon, 09 May 2022 15:40:29 +0800 From: Jeffle Xu To: dhowells@redhat.com, linux-cachefs@redhat.com, xiang@kernel.org, chao@kernel.org, linux-erofs@lists.ozlabs.org Cc: torvalds@linux-foundation.org, gregkh@linuxfoundation.org, willy@infradead.org, linux-fsdevel@vger.kernel.org, joseph.qi@linux.alibaba.com, bo.liu@linux.alibaba.com, tao.peng@linux.alibaba.com, gerry@linux.alibaba.com, eguan@linux.alibaba.com, linux-kernel@vger.kernel.org, luodaowen.backend@bytedance.com, tianzichen@kuaishou.com, yinxin.x@bytedance.com, zhangjiachen.jaycee@bytedance.com, zhujia.zj@bytedance.com Subject: [PATCH v11 00/22] fscache,erofs: fscache-based on-demand read semantics Date: Mon, 9 May 2022 15:40:06 +0800 Message-Id: <20220509074028.74954-1-jefflexu@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RDNS_NONE, SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE,UNPARSEABLE_RELAY autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org changes since v10: - rebase to 5.18-rc5 - append the patchset with a patch from Xin Yin, implementing the asynchronous readahead (patch 22) Kernel Patchset --------------- Git tree: https://github.com/lostjeffle/linux.git jingbo/dev-erofs-fscache-v11 Gitweb: https://github.com/lostjeffle/linux/commits/jingbo/dev-erofs-fscache-v11 User Guide for E2E Container Use Case ------------------------------------- User guide: https://github.com/dragonflyoss/image-service/blob/fscache/docs/nydus-fscache.md Video: https://youtu.be/F4IF2_DENXo User Daemon for Quick Test -------------------------- Git tree: https://github.com/lostjeffle/demand-read-cachefilesd.git main Gitweb: https://github.com/lostjeffle/demand-read-cachefilesd Tested-by: Zichen Tian Tested-by: Jia Zhu RFC: https://lore.kernel.org/all/YbRL2glGzjfZkVbH@B-P7TQMD6M-0146.local/t/ v1: https://lore.kernel.org/lkml/47831875-4bdd-8398-9f2d-0466b31a4382@linux.alibaba.com/T/ v2: https://lore.kernel.org/all/2946d871-b9e1-cf29-6d39-bcab30f2854f@linux.alibaba.com/t/ v3: https://lore.kernel.org/lkml/20220209060108.43051-1-jefflexu@linux.alibaba.com/T/ v4: https://lore.kernel.org/lkml/20220307123305.79520-1-jefflexu@linux.alibaba.com/T/#t v5: https://lore.kernel.org/lkml/202203170912.gk2sqkaK-lkp@intel.com/T/ v6: https://lore.kernel.org/lkml/202203260720.uA5o7k5w-lkp@intel.com/T/ v7: https://lore.kernel.org/lkml/557bcf75-2334-5fbb-d2e0-c65e96da566d@linux.alibaba.com/T/ v8: https://lore.kernel.org/all/ac8571b8-0935-1f4f-e9f1-e424f059b5ed@linux.alibaba.com/T/ v9: https://lore.kernel.org/lkml/2067a5c7-4e24-f449-4676-811d12e9ab72@linux.alibaba.com/T/ v10:https://lore.kernel.org/all/20220425122143.56815-21-jefflexu@linux.alibaba.com/t/ [Background] ============ Nydus [1] is an image distribution service especially optimized for distribution over network. Nydus is an excellent container image acceleration solution, since it only pulls data from remote when needed, a.k.a. on-demand reading and it also supports chunk-based deduplication, compression, etc. erofs (Enhanced Read-Only File System) is a filesystem designed for read-only scenarios. (Documentation/filesystem/erofs.rst) Over the past months we've been focusing on supporting Nydus image service with in-kernel erofs format[2]. In that case, each container image will be organized in one bootstrap (metadata) and (optional) multiple data blobs in erofs format. Massive container images will be stored on one machine. To accelerate the container startup (fetching container images from remote and then start the container), we do hope that the bootstrap & blob files could support on-demand read. That is, erofs can be mounted and accessed even when the bootstrap/data blob files have not been fully downloaded. Then it'll have native performance after data is available locally. That means we have to manage the cache state of the bootstrap/data blob files (if cache hit, read directly from the local cache; if cache miss, fetch the data somehow). It would be painful and may be dumb for erofs to implement the cache management itself. Thus we prefer fscache/cachefiles to do the cache management instead. The fscache on-demand read feature aims to be implemented in a generic way so that it can benefit other use cases and/or filesystems if it's implemented in the fscache subsystem. [1] https://nydus.dev [2] https://sched.co/pcdL [Overall Design] ================ Please refer to patch 7 ("cachefiles: document on-demand read mode") for more details. When working in the original mode, cachefiles mainly serves as a local cache for remote networking fs, while in on-demand read mode, cachefiles can work in the scenario where on-demand read semantics is needed, e.g. container image distribution. The essential difference between these two modes is that, in original mode, when cache miss, netfs itself will fetch data from remote, and then write the fetched data into cache file. While in on-demand read mode, a user daemon is responsible for fetching data and then feeds to the kernel fscache side. The on-demand read mode relies on a simple protocol used for communication between kernel and user daemon. The proposed implementation relies on the anonymous fd mechanism to avoid the dependence on the format of cache file. When a fscache cachefile is opened for the first time, an anon_fd associated with the cache file is sent to the user daemon. With the given anon_fd, user daemon could fetch and write data into the cache file in the background, even when kernel has not triggered the cache miss. Besides, the write() syscall to the anon_fd will finally call cachefiles kernel module, which will write data to cache file in the latest format of cache file. 1. cache miss When cache miss, cachefiles kernel module will notify user daemon with the anon_fd, along with the requested file range. When notified, user daemon needs to fetch data of the requested file range, and then write the fetched data into cache file with the given anonymous fd. When finished processing the request, user daemon needs to notify the kernel. After notifying the user daemon, the kernel read routine will wait there, until the request is handled by user daemon. When it's awaken by the notification from user daemon, i.e. the corresponding hole has been filled by the user daemon, it will retry to read from the same file range. 2. cache hit Once data is already ready in cache file, netfs will read from cache file directly. [Advantage of fscache-based on-demand read] ======================================== 1. Asynchronous prefetch In current mechanism, fscache is responsible for cache state management, while the data plane (fetching data from local/remote on cache miss) is done on the user daemon side even without any file system request driven. In addition, if cached data has already been available locally, fscache will use it instead of trapping to user space anymore. Therefore, different from event-driven approaches, the fscache on-demand user daemon could also fetch data (from remote) asynchronously in the background just like most multi-threaded HTTP downloaders. 2. Flexible request amplification Since the data plane can be independently controlled by the user daemon, the user daemon can also fetch more data from remote than that the file system actually requests for small I/O sizes. Then, fetched data in bulk will be available at once and fscache won't be trapped into the user daemon again. 3. Support massive blobs This mechanism can naturally support a large amount of backing files, and thus can benefit the densely employed scenarios. In our use cases, one container image can be formed of one bootstrap (required) and multiple chunk-deduplicated data blobs (optional). For example, one container image for node.js will correspond to ~20 files in total. In densely employed environment, there could be hundreds of containers and thus thousands of backing files on one machine. [Following Steps] ================= The following improvements are on our TODO list, and will be formed in shape with the development process: - Data blobs can be shared between multiple filesystems. Whilst in the current implementation, each filesystem registers a unique fscache_volume, causing the backing file for the data blob can not be shared between different erofs filesystems. Later we need to introduce shared domain in order to share fscache_volume, so that data blobs can be shared between container images to some degree. - in-memory extent-based data sharing, e.g., different files can share the same chunk of the data blob. In the current implementation, each erofs file maintains its own page cache, thus the page caches for the same chunk may be duplicated among multiple files sharing the same chunk. - other useful features, including multiple cachefiles daemon support, etc. Jeffle Xu (21): cachefiles: extract write routine cachefiles: notify the user daemon when looking up cookie cachefiles: unbind cachefiles gracefully in on-demand mode cachefiles: notify the user daemon when withdrawing cookie cachefiles: implement on-demand read cachefiles: enable on-demand read mode cachefiles: add tracepoints for on-demand read mode cachefiles: document on-demand read mode erofs: make erofs_map_blocks() generally available erofs: add fscache mode check helper erofs: register fscache volume erofs: add fscache context helper functions erofs: add anonymous inode caching metadata for data blobs erofs: add erofs_fscache_read_folios() helper erofs: register fscache context for primary data blob erofs: register fscache context for extra data blobs erofs: implement fscache-based metadata read erofs: implement fscache-based data read for non-inline layout erofs: implement fscache-based data read for inline layout erofs: implement fscache-based data readahead erofs: add 'fsid' mount option Xin Yin (1): erofs: change to use asynchronous io for fscache readpage/readahead .../filesystems/caching/cachefiles.rst | 178 ++++++ fs/cachefiles/Kconfig | 12 + fs/cachefiles/Makefile | 1 + fs/cachefiles/daemon.c | 117 +++- fs/cachefiles/interface.c | 2 + fs/cachefiles/internal.h | 78 +++ fs/cachefiles/io.c | 76 ++- fs/cachefiles/namei.c | 16 +- fs/cachefiles/ondemand.c | 503 +++++++++++++++++ fs/erofs/Kconfig | 10 + fs/erofs/Makefile | 1 + fs/erofs/data.c | 26 +- fs/erofs/fscache.c | 522 ++++++++++++++++++ fs/erofs/inode.c | 4 + fs/erofs/internal.h | 49 ++ fs/erofs/super.c | 105 +++- fs/erofs/sysfs.c | 4 +- include/linux/fscache.h | 1 + include/linux/netfs.h | 1 + include/trace/events/cachefiles.h | 176 ++++++ include/uapi/linux/cachefiles.h | 68 +++ 21 files changed, 1871 insertions(+), 79 deletions(-) create mode 100644 fs/cachefiles/ondemand.c create mode 100644 fs/erofs/fscache.c create mode 100644 include/uapi/linux/cachefiles.h -- 2.27.0