2023-12-11 13:46:55

by Eric Curtin

[permalink] [raw]
Subject: [RFC KERNEL] initoverlayfs - a scalable initial filesystem

Hi All,

We have recently been working on something called initoverlayfs, which
we sent an RFC email to the systemd and dracut mailing lists to gather
feedback. This is an exploratory email as we are unsure if a solution
like this fits in userspace or kernelspace and we would like to gather
feedback from the community.

To describe this briefly, the idea is to use erofs+overlayfs as an
initial filesystem rather than an initramfs. The benefits are, we can
start userspace significantly faster as we do not have to unpack,
decompress and populate a tmpfs upfront, instead we can rely on
transparent decompression like lz4hc instead. What we believe is the
greater benefit, is that we can have less fear of initial filesystem
bloat, as when you are using transparent decompression you only pay
for decompressing the bytes you actually use.

We implemented the first version of this, by creating a small
initramfs that only contains storage drivers, udev and a couple of 100
lines of C code, just enough userspace to mount an erofs with
transient overlay. Then we build a second initramfs which has all the
contents of a normal everyday initramfs with all the bells and
whistles and convert this into an erofs.

Then at boot time you basically transition to this erofs+overlayfs in
userspace and everything works as normal as it would in a traditional
initramfs.

The current implementation looks like this:

```
From the filesystem perspective (roughly):

fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs

From the process perspective (roughly):

fw -> bootloader -> kernel -> storage-init -> init ----------------->
```

But we have been asking the question whether we should be implementing
this in kernelspace so it looks more like:

```
From the filesystem perspective (roughly):

fw -> bootloader -> kernel -> initoverlayfs -> rootfs

From the process perspective (roughly):

fw -> bootloader -> kernel -> init ----------------->
```

The kind of questions we are asking are: Would it be possible to
implement this in kernelspace so we could just mount the initial
filesystem data as an erofs+overlayfs filesystem without unpacking,
decompressing, copying the data to a tmpfs, etc.? Could we memmap the
initramfs buffer and mount it like an erofs? What other considerations
should be taken into account?

Echo'ing Lennart we must also "keep in mind from the beginning how
authentication of every component of your process shall work" as
that's essential to a couple of different Linux distributions today.

We kept this email short because we want people to read it and avoid
duplicating information from elsewhere. The effort is described from
different perspectives in the systemd/dracut RFC email and github
README.md if you'd like to learn more, it's worth reading the
discussion in the systemd mailing list:

https://marc.info/?l=systemd-devel&m=170214639006704&w=2

https://github.com/containers/initoverlayfs/blob/main/README.md

We also received feedback informally in the community that it would be
nice if we could optionally use btrfs as an alternative.

Is mise le meas/Regards,

Eric Curtin


2023-12-11 14:21:00

by Neal Gompa

[permalink] [raw]
Subject: Re: [RFC KERNEL] initoverlayfs - a scalable initial filesystem

On Mon, Dec 11, 2023 at 8:46 AM Eric Curtin <[email protected]> wrote:
>
> Hi All,
>
> We have recently been working on something called initoverlayfs, which
> we sent an RFC email to the systemd and dracut mailing lists to gather
> feedback. This is an exploratory email as we are unsure if a solution
> like this fits in userspace or kernelspace and we would like to gather
> feedback from the community.
>
> To describe this briefly, the idea is to use erofs+overlayfs as an
> initial filesystem rather than an initramfs. The benefits are, we can
> start userspace significantly faster as we do not have to unpack,
> decompress and populate a tmpfs upfront, instead we can rely on
> transparent decompression like lz4hc instead. What we believe is the
> greater benefit, is that we can have less fear of initial filesystem
> bloat, as when you are using transparent decompression you only pay
> for decompressing the bytes you actually use.
>
> We implemented the first version of this, by creating a small
> initramfs that only contains storage drivers, udev and a couple of 100
> lines of C code, just enough userspace to mount an erofs with
> transient overlay. Then we build a second initramfs which has all the
> contents of a normal everyday initramfs with all the bells and
> whistles and convert this into an erofs.
>
> Then at boot time you basically transition to this erofs+overlayfs in
> userspace and everything works as normal as it would in a traditional
> initramfs.
>
> The current implementation looks like this:
>
> ```
> From the filesystem perspective (roughly):
>
> fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
>
> From the process perspective (roughly):
>
> fw -> bootloader -> kernel -> storage-init -> init ----------------->
> ```
>
> But we have been asking the question whether we should be implementing
> this in kernelspace so it looks more like:
>
> ```
> From the filesystem perspective (roughly):
>
> fw -> bootloader -> kernel -> initoverlayfs -> rootfs
>
> From the process perspective (roughly):
>
> fw -> bootloader -> kernel -> init ----------------->
> ```
>
> The kind of questions we are asking are: Would it be possible to
> implement this in kernelspace so we could just mount the initial
> filesystem data as an erofs+overlayfs filesystem without unpacking,
> decompressing, copying the data to a tmpfs, etc.? Could we memmap the
> initramfs buffer and mount it like an erofs? What other considerations
> should be taken into account?
>
> Echo'ing Lennart we must also "keep in mind from the beginning how
> authentication of every component of your process shall work" as
> that's essential to a couple of different Linux distributions today.
>
> We kept this email short because we want people to read it and avoid
> duplicating information from elsewhere. The effort is described from
> different perspectives in the systemd/dracut RFC email and github
> README.md if you'd like to learn more, it's worth reading the
> discussion in the systemd mailing list:
>
> https://marc.info/?l=systemd-devel&m=170214639006704&w=2
>
> https://github.com/containers/initoverlayfs/blob/main/README.md
>
> We also received feedback informally in the community that it would be
> nice if we could optionally use btrfs as an alternative.
>
> Is mise le meas/Regards,
>
> Eric Curtin
>

Adding linux-btrfs@ to the discussion, because I think it'd be useful
to include them for what handling btrfs as an alternative to
erofs+overlayfs would look like.



--
真実はいつも一つ!/ Always, there's only one truth!

2023-12-12 00:51:41

by Gao Xiang

[permalink] [raw]
Subject: Re: [RFC KERNEL] initoverlayfs - a scalable initial filesystem

Hi,

On 2023/12/11 21:45, Eric Curtin wrote:
> Hi All,
>
> We have recently been working on something called initoverlayfs, which
> we sent an RFC email to the systemd and dracut mailing lists to gather
> feedback. This is an exploratory email as we are unsure if a solution
> like this fits in userspace or kernelspace and we would like to gather
> feedback from the community.
>
> To describe this briefly, the idea is to use erofs+overlayfs as an
> initial filesystem rather than an initramfs. The benefits are, we can
> start userspace significantly faster as we do not have to unpack,
> decompress and populate a tmpfs upfront, instead we can rely on
> transparent decompression like lz4hc instead. What we believe is the
> greater benefit, is that we can have less fear of initial filesystem
> bloat, as when you are using transparent decompression you only pay
> for decompressing the bytes you actually use.
>
> We implemented the first version of this, by creating a small
> initramfs that only contains storage drivers, udev and a couple of 100
> lines of C code, just enough userspace to mount an erofs with
> transient overlay. Then we build a second initramfs which has all the
> contents of a normal everyday initramfs with all the bells and
> whistles and convert this into an erofs.
>
> Then at boot time you basically transition to this erofs+overlayfs in
> userspace and everything works as normal as it would in a traditional
> initramfs.
>
> The current implementation looks like this:
>
> ```
> From the filesystem perspective (roughly):
>
> fw -> bootloader -> kernel -> mini-initramfs -> initoverlayfs -> rootfs
>
> From the process perspective (roughly):
>
> fw -> bootloader -> kernel -> storage-init -> init ----------------->
> ```
>
> But we have been asking the question whether we should be implementing
> this in kernelspace so it looks more like:
>
> ```
> From the filesystem perspective (roughly):
>
> fw -> bootloader -> kernel -> initoverlayfs -> rootfs
>
> From the process perspective (roughly):
>
> fw -> bootloader -> kernel -> init ----------------->
> ```
>
> The kind of questions we are asking are: Would it be possible to
> implement this in kernelspace so we could just mount the initial
> filesystem data as an erofs+overlayfs filesystem without unpacking,
> decompressing, copying the data to a tmpfs, etc.? Could we memmap the
> initramfs buffer and mount it like an erofs? What other considerations
> should be taken into account?

Since Linux 5.15, EROFS has supported FSDAX feature so that it can
mount from persistent memory devices with `-o dax`.

That is already used for virtualization cases like VM rootfs and
container image passthrough with virtio-pmem [1] to share page cache
memory between host and guest.

For non-virtualization cases, I guess you could try to use `memmap`
kernel option [2] to specify a memory region by bootloaders which
contains an EROFS rootfs and a customized init for booting as
erofs+overlayfs at least for `initoverlayfs`. The main benefit is
that the memory region specified by the bootloader can be directly
used for mounting. But I never tried if this option actually works.

Furthermore, compared to traditional ramdisks, using direct address
can avoid page cache totally for uncompressed files like it can
just use unencoded data as mmaped memory. For compressed files, it
still needs page cache to support mmaped access but we could adapt
more for persistent memory scenarios such as disable cache
decompression compared to previous block devices.

I'm not sure if it's worth implementing this in kernelspace since
it's out of scope of an individual filesystem anyway.

[1] https://www.qemu.org/docs/master/system/devices/virtio-pmem.html
[2] https://docs.pmem.io/persistent-memory/getting-started-guide/creating-development-environments/linux-environments/linux-memmap

Thanks,
Gao Xiang

>
> Echo'ing Lennart we must also "keep in mind from the beginning how
> authentication of every component of your process shall work" as
> that's essential to a couple of different Linux distributions today.
>
> We kept this email short because we want people to read it and avoid
> duplicating information from elsewhere. The effort is described from
> different perspectives in the systemd/dracut RFC email and github
> README.md if you'd like to learn more, it's worth reading the
> discussion in the systemd mailing list:
>
> https://marc.info/?l=systemd-devel&m=170214639006704&w=2
>
> https://github.com/containers/initoverlayfs/blob/main/README.md
>
> We also received feedback informally in the community that it would be
> nice if we could optionally use btrfs as an alternative.
>
> Is mise le meas/Regards,
>
> Eric Curtin
>

2023-12-12 07:36:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC KERNEL] initoverlayfs - a scalable initial filesystem

On Tue, Dec 12, 2023 at 08:50:56AM +0800, Gao Xiang wrote:
> For non-virtualization cases, I guess you could try to use `memmap`
> kernel option [2] to specify a memory region by bootloaders which
> contains an EROFS rootfs and a customized init for booting as
> erofs+overlayfs at least for `initoverlayfs`. The main benefit is
> that the memory region specified by the bootloader can be directly
> used for mounting. But I never tried if this option actually works.
>
> Furthermore, compared to traditional ramdisks, using direct address
> can avoid page cache totally for uncompressed files like it can
> just use unencoded data as mmaped memory. For compressed files, it
> still needs page cache to support mmaped access but we could adapt
> more for persistent memory scenarios such as disable cache
> decompression compared to previous block devices.
>
> I'm not sure if it's worth implementing this in kernelspace since
> it's out of scope of an individual filesystem anyway.

IFF the use case turns out to be generally useful (it looks quite
convoluted and odd to me), we could esily do an initdax concept where
a chunk of memory passed by the bootloader is presented as a DAX device
properly without memmap hacks.

2023-12-12 07:51:03

by Gao Xiang

[permalink] [raw]
Subject: Re: [RFC KERNEL] initoverlayfs - a scalable initial filesystem



On 2023/12/12 15:35, Christoph Hellwig wrote:
> On Tue, Dec 12, 2023 at 08:50:56AM +0800, Gao Xiang wrote:
>> For non-virtualization cases, I guess you could try to use `memmap`
>> kernel option [2] to specify a memory region by bootloaders which
>> contains an EROFS rootfs and a customized init for booting as
>> erofs+overlayfs at least for `initoverlayfs`. The main benefit is
>> that the memory region specified by the bootloader can be directly
>> used for mounting. But I never tried if this option actually works.
>>
>> Furthermore, compared to traditional ramdisks, using direct address
>> can avoid page cache totally for uncompressed files like it can
>> just use unencoded data as mmaped memory. For compressed files, it
>> still needs page cache to support mmaped access but we could adapt
>> more for persistent memory scenarios such as disable cache
>> decompression compared to previous block devices.
>>
>> I'm not sure if it's worth implementing this in kernelspace since
>> it's out of scope of an individual filesystem anyway.
>
> IFF the use case turns out to be generally useful (it looks quite
> convoluted and odd to me), we could esily do an initdax concept where
> a chunk of memory passed by the bootloader is presented as a DAX device
> properly without memmap hacks.

I have no idea how it's faster than the current initramfs or initrd.
So if it's really useful, maybe some numbers can be posted first
with the current `memmap` hack and see it's worth going further with
some new infrastructure like initdax.

Thanks,
Gao Xiang


2023-12-12 13:06:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC KERNEL] initoverlayfs - a scalable initial filesystem

On Tue, Dec 12, 2023 at 03:50:25PM +0800, Gao Xiang wrote:
> I have no idea how it's faster than the current initramfs or initrd.
> So if it's really useful, maybe some numbers can be posted first
> with the current `memmap` hack and see it's worth going further with
> some new infrastructure like initdax.

Agreed.

2023-12-12 21:18:31

by Eric Curtin

[permalink] [raw]
Subject: Re: [RFC KERNEL] initoverlayfs - a scalable initial filesystem

On Tue, 12 Dec 2023 at 13:06, Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Dec 12, 2023 at 03:50:25PM +0800, Gao Xiang wrote:
> > I have no idea how it's faster than the current initramfs or initrd.
> > So if it's really useful, maybe some numbers can be posted first
> > with the current `memmap` hack and see it's worth going further with
> > some new infrastructure like initdax.
>
> Agreed.
>

I was politely poked this morning to highlight the graphs on the
initoverlayfs page, so as promised highlighting. That's not to say
this is either kernelspace's or userspace's role to optimize, but it
does prove there are benefits if we put some effort into optimizing
early boot.

https://github.com/containers/initoverlayfs

With this approach systemd starts ~300ms faster on a Raspberry Pi 4
with sd card, and this systemd instance has access to all the files
that a traditional initramfs would. I did this test on a Raspberry Pi
4 with NVMe drive over USB and the results were closer to a 500ms
benefit in systemd start time.

Is mise le meas/Regards,

Eric Curtin