Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp4337124rdb; Mon, 11 Dec 2023 16:51:41 -0800 (PST) X-Google-Smtp-Source: AGHT+IHx8Cvmqz5cthrlXxMfuk8uIsxjriQrvZERxipyF+sZ7GX3M1VqpEiDivv0wQQQeB/zpvky X-Received: by 2002:a17:90a:bb01:b0:286:c54a:a20d with SMTP id u1-20020a17090abb0100b00286c54aa20dmr4455984pjr.21.1702342300922; Mon, 11 Dec 2023 16:51:40 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702342300; cv=none; d=google.com; s=arc-20160816; b=kmVaJ7IWJCk3phrEBBGLhnYlNGqYFu5ewiugS/MJRQ/pABAyg4rlTR1LKsLEMPKw87 4rKaXMg1ND9Oe/krLV08sgApQnbdgX8is8cZ+q7cVIkuihY5vXBYLFfGWFskCsoAOp1T mIDZMkwnxEl0PsnMlfAeftlkGkdj4HSdeQFWSGAQ4v1xwAzkT+2GFIyar7AMDtCZDsTu mrzaM1boNRws/zoTecPfcyJ8CSChocldrso9Dv1aLa7/p65PtIIMgxP3CbsUuGVZBgkR K3jN7rfWbMvYEs9T2wOYIMUPd0fQ851nmDBzJ/yl2GfkJmIFPn6M3BP7z557JdGtOI1D Ci9g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id; bh=NNrtpCbwHlfMcyT3NMFVJoVAvXxzxDs5I2hl/dVMAzE=; fh=MDU7b57uF3H7UoanpDfsMCkJ0SG4QcSW1jTH4sDAMt8=; b=olcwlklDIQxUq7aUgp6ADH/OuPkxU0mqyWuCExVwapYIunXXWx5lKaBroEGVDthAi4 Y+V1pc3OpjsmBdWvBpEIIxFexA19cOEk8Rw0vVDS4LOCBLobeiiSuyLS4OhMZELpo6dg hL8+YTgdv63KlVUQ5D0qBRjft0D2qfa6OBAZ+ffOHXNe7CwjVcUssgOlkV+YWTSpM8MX q1jc9JjT0vySiTGY08uTIEiwRlslS4msCw/1BoEUA3f3X3iEBqzoE/q1746hhbvSzOcC rg2qq/KuGSDMuF+2Pdd5/ncvtsv02kmfUvjI6OVKfkho8JEaC0lZASMrPdJRIhYMG0g+ VA+A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from pete.vger.email (pete.vger.email. [2620:137:e000::3:6]) by mx.google.com with ESMTPS id i14-20020a17090a4b8e00b002839e1cb23csi8153917pjh.117.2023.12.11.16.51.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 11 Dec 2023 16:51:40 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) client-ip=2620:137:e000::3:6; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id E3FF1807FD78; Mon, 11 Dec 2023 16:51:20 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345585AbjLLAu6 (ORCPT + 99 others); Mon, 11 Dec 2023 19:50:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36278 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345477AbjLLAu4 (ORCPT ); Mon, 11 Dec 2023 19:50:56 -0500 Received: from out30-112.freemail.mail.aliyun.com (out30-112.freemail.mail.aliyun.com [115.124.30.112]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B30CDA6; Mon, 11 Dec 2023 16:51:01 -0800 (PST) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R311e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046059;MF=hsiangkao@linux.alibaba.com;NM=1;PH=DS;RN=20;SR=0;TI=SMTPD_---0VyKKeJX_1702342257; Received: from 192.168.71.57(mailfrom:hsiangkao@linux.alibaba.com fp:SMTPD_---0VyKKeJX_1702342257) by smtp.aliyun-inc.com; Tue, 12 Dec 2023 08:50:59 +0800 Message-ID: <941aff31-6aa4-4c37-bb94-547c46250304@linux.alibaba.com> Date: Tue, 12 Dec 2023 08:50:56 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC KERNEL] initoverlayfs - a scalable initial filesystem To: Eric Curtin , Linux Kernel Mailing List , linux-unionfs@vger.kernel.org, linux-erofs@lists.ozlabs.org Cc: Daan De Meyer , Stephen Smoogen , Yariv Rachmani , Daniel Walsh , Douglas Landgraf , Alexander Larsson , Colin Walters , Brian Masney , Eric Chanudet , Pavol Brilla , Lokesh Mandvekar , =?UTF-8?Q?Petr_=C5=A0abata?= , Lennart Poettering , Luca Boccassi , Neal Gompa , nvdimm@lists.linux.dev References: From: Gao Xiang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, UNPARSEABLE_RELAY autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Mon, 11 Dec 2023 16:51:21 -0800 (PST) Hi, On 2023/12/11 21:45, Eric Curtin wrote: > Hi All, > > We have recently been working on something called initoverlayfs, which > we sent an RFC email to the systemd and dracut mailing lists to gather > feedback. This is an exploratory email as we are unsure if a solution > like this fits in userspace or kernelspace and we would like to gather > feedback from the community. > > To describe this briefly, the idea is to use erofs+overlayfs as an > initial filesystem rather than an initramfs. The benefits are, we can > start userspace significantly faster as we do not have to unpack, > decompress and populate a tmpfs upfront, instead we can rely on > transparent decompression like lz4hc instead. What we believe is the > greater benefit, is that we can have less fear of initial filesystem > bloat, as when you are using transparent decompression you only pay > for decompressing the bytes you actually use. > > We implemented the first version of this, by creating a small > initramfs that only contains storage drivers, udev and a couple of 100 > lines of C code, just enough userspace to mount an erofs with > transient overlay. Then we build a second initramfs which has all the > contents of a normal everyday initramfs with all the bells and > whistles and convert this into an erofs. > > Then at boot time you basically transition to this erofs+overlayfs in > userspace and everything works as normal as it would in a traditional > initramfs. > > The current implementation looks like this: > > ``` > From the filesystem perspective (roughly): > > fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs > > From the process perspective (roughly): > > fw -> bootloader -> kernel -> storage-init -> init -----------------> > ``` > > But we have been asking the question whether we should be implementing > this in kernelspace so it looks more like: > > ``` > From the filesystem perspective (roughly): > > fw -> bootloader -> kernel -> initoverlayfs -> rootfs > > From the process perspective (roughly): > > fw -> bootloader -> kernel -> init -----------------> > ``` > > The kind of questions we are asking are: Would it be possible to > implement this in kernelspace so we could just mount the initial > filesystem data as an erofs+overlayfs filesystem without unpacking, > decompressing, copying the data to a tmpfs, etc.? Could we memmap the > initramfs buffer and mount it like an erofs? What other considerations > should be taken into account? Since Linux 5.15, EROFS has supported FSDAX feature so that it can mount from persistent memory devices with `-o dax`. That is already used for virtualization cases like VM rootfs and container image passthrough with virtio-pmem [1] to share page cache memory between host and guest. For non-virtualization cases, I guess you could try to use `memmap` kernel option [2] to specify a memory region by bootloaders which contains an EROFS rootfs and a customized init for booting as erofs+overlayfs at least for `initoverlayfs`. The main benefit is that the memory region specified by the bootloader can be directly used for mounting. But I never tried if this option actually works. Furthermore, compared to traditional ramdisks, using direct address can avoid page cache totally for uncompressed files like it can just use unencoded data as mmaped memory. For compressed files, it still needs page cache to support mmaped access but we could adapt more for persistent memory scenarios such as disable cache decompression compared to previous block devices. I'm not sure if it's worth implementing this in kernelspace since it's out of scope of an individual filesystem anyway. [1] https://www.qemu.org/docs/master/system/devices/virtio-pmem.html [2] https://docs.pmem.io/persistent-memory/getting-started-guide/creating-development-environments/linux-environments/linux-memmap Thanks, Gao Xiang > > Echo'ing Lennart we must also "keep in mind from the beginning how > authentication of every component of your process shall work" as > that's essential to a couple of different Linux distributions today. > > We kept this email short because we want people to read it and avoid > duplicating information from elsewhere. The effort is described from > different perspectives in the systemd/dracut RFC email and github > README.md if you'd like to learn more, it's worth reading the > discussion in the systemd mailing list: > > https://marc.info/?l=systemd-devel&m=170214639006704&w=2 > > https://github.com/containers/initoverlayfs/blob/main/README.md > > We also received feedback informally in the community that it would be > nice if we could optionally use btrfs as an alternative. > > Is mise le meas/Regards, > > Eric Curtin >