2023-01-09 09:03:07

by Gao Xiang

[permalink] [raw]
Subject: [LSF/MM/BPF TOPIC] Image-based read-only filesystem: further use cases & directions

Hi folks,

* Background *

We've been continuously working on forming a useful read-only
(immutable) image solution since the end of 2017 (as a part of our
work) until now as everyone may know: EROFS.

Currently it has already successfully landed to (about) billions of
Android-related devices, other types of embedded devices and containers
with many vendors involved, and we've always been seeking more use
cases such as incremental immutable rootfs, app sandboxes or packages
(Android apk? with many duplicated libraries), dataset packages, etc.

The reasons why we always do believe immutable images can benefit
various use cases are:

- much easier for all vendors to ship/distribute/keep original signing
(golden) images to each instance;

- (combined with the writable layer such as overlayfs) easy to roll
back to the original shipped state or do incremental updates;

- easy to check data corruption or do data recovery (no matter
whether physical device or network errors);

- easy for real storage devices to do hardware write-protection for
immutable images;

- can do various offline algorithms (such as reduced metadata,
content-defined rolling hash deduplication, compression) to minimize
image sizes;

- initrd with FSDAX to avoid double caching with advantages above;

- and more.

In 2019, a LSF/MM/BPF topic was put forward to show EROFS initial use
cases [1] as the read-only Android rootfs of a single instance on
resource-limited devices so that effective compression became quite
important at that time.


* Problem *

In addition to enhance data compression for single-instance deployment,
as a self-contained approach (so that all use cases can share the only
_one_ signed image), we've also focusing on multiple instances (such as
containers or apps, each image represents a complete filesystem tree)
all together on one device with similar data recently years so that
effective data deduplication, on-demand lazy pulling, page cache
sharing among such different golden images became vital as well.


* Current progresses *

In order to resolve the challenges above, we've worked out:

- (v5.15) chunk-based inodes (to form inode extents) to do data
deduplication among a single image;

- (v5.16) multiple shared blobs (to keep content-defined data) in
addition to the primary blob (to keep filesystem metadata) for wider
deduplication across different images:

- (v5.19) file-based distribution by introducing in-kernel local
caching fscache and on-demand lazy pulling feature [2];

- (v6.1) shared domain to share such multiple shared blobs in
fscache mode [3];

- [RFC] preliminary page cache sharing between diffenent images [4].


* Potential topics to discuss *

- data verification of different images with thousands (or more)
shared blobs [5];

- encryption with per-extent keys for confidential containers [5][6];

- current page cache sharing limitation due to mm reserve mapping and
finer (folio or page-based) page cache sharing among images/blobs
[4][7];

- more effective in-kernel local caching features for fscache such as
failover and daemonless;

- (wild preliminary ideas, maybe) overlayfs partial copy-up with
fscache as the upper layer in order to form a unique caching
subsystem for better space saving?

- FSDAX enhancements for initial ramdisk or other use cases;

- other issues when landing.


Finally, if our efforts (or plans) also make sense to you, we do hope
more people could join us, Thanks!

[1] https://lore.kernel.org/r/[email protected]
[2] https://lore.kernel.org/r/Yoj1AcHoBPqir++H@debian
[3] https://lore.kernel.org/r/[email protected]
[4] https://lore.kernel.org/r/[email protected]
[5] https://lore.kernel.org/r/[email protected]
[6] https://lwn.net/SubscriberLink/918893/4d389217f9b8d679
[7] https://lwn.net/Articles/895907

Thanks,
Gao Xiang


2023-02-23 10:39:53

by Xin Yin

[permalink] [raw]
Subject: Re: [LSF/MM/BPF TOPIC] Image-based read-only filesystem: further use cases & directions

On 2023/1/9 16:43, Gao Xiang wrote:
> Hi folks,
>
> * Background *
>
> We've been continuously working on forming a useful read-only
> (immutable) image solution since the end of 2017 (as a part of our
> work) until now as everyone may know: EROFS.
>
> Currently it has already successfully landed to (about) billions of
> Android-related devices, other types of embedded devices and containers
> with many vendors involved, and we've always been seeking more use
> cases such as incremental immutable rootfs, app sandboxes or packages
> (Android apk? with many duplicated libraries), dataset packages, etc.
>
> The reasons why we always do believe immutable images can benefit
> various use cases are:
>
> - much easier for all vendors to ship/distribute/keep original signing
> (golden) images to each instance;
>
> - (combined with the writable layer such as overlayfs) easy to roll
> back to the original shipped state or do incremental updates;
>
> - easy to check data corruption or do data recovery (no matter
> whether physical device or network errors);
>
> - easy for real storage devices to do hardware write-protection for
> immutable images;
>
> - can do various offline algorithms (such as reduced metadata,
> content-defined rolling hash deduplication, compression) to minimize
> image sizes;
>
> - initrd with FSDAX to avoid double caching with advantages above;
>
> - and more.
>
> In 2019, a LSF/MM/BPF topic was put forward to show EROFS initial use
> cases [1] as the read-only Android rootfs of a single instance on
> resource-limited devices so that effective compression became quite
> important at that time.
>
>
> * Problem *
>
> In addition to enhance data compression for single-instance deployment,
> as a self-contained approach (so that all use cases can share the only
> _one_ signed image), we've also focusing on multiple instances (such as
> containers or apps, each image represents a complete filesystem tree)
> all together on one device with similar data recently years so that
> effective data deduplication, on-demand lazy pulling, page cache
> sharing among such different golden images became vital as well.
>
>
> * Current progresses *
>
> In order to resolve the challenges above, we've worked out:
>
> - (v5.15) chunk-based inodes (to form inode extents) to do data
> deduplication among a single image;
>
> - (v5.16) multiple shared blobs (to keep content-defined data) in
> addition to the primary blob (to keep filesystem metadata) for wider
> deduplication across different images:
>
> - (v5.19) file-based distribution by introducing in-kernel local
> caching fscache and on-demand lazy pulling feature [2];
>
> - (v6.1) shared domain to share such multiple shared blobs in
> fscache mode [3];
>
> - [RFC] preliminary page cache sharing between diffenent images [4].
>
>
> * Potential topics to discuss *
>
> - data verification of different images with thousands (or more)
> shared blobs [5];
>
> - encryption with per-extent keys for confidential containers [5][6];
>
> - current page cache sharing limitation due to mm reserve mapping and
> finer (folio or page-based) page cache sharing among images/blobs
> [4][7];
>
> - more effective in-kernel local caching features for fscache such as
> failover and daemonless;
>
> - (wild preliminary ideas, maybe) overlayfs partial copy-up with
> fscache as the upper layer in order to form a unique caching
> subsystem for better space saving?
>

We also interested in these topic, page cache sharing is an exciting feature, and may can save
a lot of memory in high-density deployment scenarios, cause we already can share blobs.

Hope to have further discussion on the failover, mutiple daemons/dirs and daemonless feature of fscache & cachefiles.
So we can have a better form for our production.

And Looking forward to the opportunity to discuss online, if I can't attend offline.

Thanks,
Xin Yin

> - FSDAX enhancements for initial ramdisk or other use cases;
>
> - other issues when landing.
>
>
> Finally, if our efforts (or plans) also make sense to you, we do hope
> more people could join us, Thanks!
>
> [1] https://lore.kernel.org/r/[email protected]
> [2] https://lore.kernel.org/r/Yoj1AcHoBPqir++H@debian
> [3] https://lore.kernel.org/r/[email protected]
> [4] https://lore.kernel.org/r/[email protected]
> [5] https://lore.kernel.org/r/[email protected]
> [6] https://lwn.net/SubscriberLink/918893/4d389217f9b8d679
> [7] https://lwn.net/Articles/895907
>
> Thanks,
> Gao Xiang

--
2.25.1


2023-02-24 03:10:19

by Zhang Yi

[permalink] [raw]
Subject: Re: [LSF/MM/BPF TOPIC] Image-based read-only filesystem: further use cases & directions

On 2023/1/9 16:43, Gao Xiang wrote:
> Hi folks,
>
> * Background *
>
> We've been continuously working on forming a useful read-only
> (immutable) image solution since the end of 2017 (as a part of our
> work) until now as everyone may know:  EROFS.
>
> Currently it has already successfully landed to (about) billions of
> Android-related devices, other types of embedded devices and containers
> with many vendors involved, and we've always been seeking more use
> cases such as incremental immutable rootfs, app sandboxes or packages
> (Android apk? with many duplicated libraries), dataset packages, etc.
>
> The reasons why we always do believe immutable images can benefit
> various use cases are:
>
>  - much easier for all vendors to ship/distribute/keep original signing
>    (golden) images to each instance;
>
>  - (combined with the writable layer such as overlayfs) easy to roll
>    back to the original shipped state or do incremental updates;
>
>  - easy to check data corruption or do data recovery (no matter
>    whether physical device or network errors);
>
>  - easy for real storage devices to do hardware write-protection for
>    immutable images;
>
>  - can do various offline algorithms (such as reduced metadata,
>    content-defined rolling hash deduplication, compression) to minimize
>    image sizes;
>
>  - initrd with FSDAX to avoid double caching with advantages above;
>
>  - and more.
>
> In 2019, a LSF/MM/BPF topic was put forward to show EROFS initial use
> cases [1] as the read-only Android rootfs of a single instance on
> resource-limited devices so that effective compression became quite
> important at that time.
>
>
> * Problem *
>
> In addition to enhance data compression for single-instance deployment,
> as a self-contained approach (so that all use cases can share the only
> _one_ signed image), we've also focusing on multiple instances (such as
> containers or apps, each image represents a complete filesystem tree)
> all together on one device with similar data recently years so that
> effective data deduplication, on-demand lazy pulling, page cache
> sharing among such different golden images became vital as well.
>
>
> * Current progresses *
>
> In order to resolve the challenges above, we've worked out:
>
>  - (v5.15) chunk-based inodes (to form inode extents) to do data
>    deduplication among a single image;
>
>  - (v5.16) multiple shared blobs (to keep content-defined data) in
>    addition to the primary blob (to keep filesystem metadata) for wider
>    deduplication across different images:
>
>  - (v5.19) file-based distribution by introducing in-kernel local
>    caching fscache and on-demand lazy pulling feature [2];
>
>  - (v6.1) shared domain to share such multiple shared blobs in
>    fscache mode [3];
>
>  - [RFC] preliminary page cache sharing between diffenent images [4].
>
>
> * Potential topics to discuss *
>
>  - data verification of different images with thousands (or more)
>    shared blobs [5];
>
>  - encryption with per-extent keys for confidential containers [5][6];
>
>  - current page cache sharing limitation due to mm reserve mapping and
>    finer (folio or page-based) page cache sharing among images/blobs
>    [4][7];
>
>  - more effective in-kernel local caching features for fscache such as
>    failover and daemonless;
>
>  - (wild preliminary ideas, maybe) overlayfs partial copy-up with
>    fscache as the upper layer in order to form a unique caching
>    subsystem for better space saving?
>

Hello Xiang and all,

We interested in these topic too. Our cloud products will also want to use
erofs + overlayfs as container's base image and want to do more researchs on
deduplication, page cache sharing and disk space saving, and I also have some
study on overlayfs partial copy-up feature. I hope we could have further
discussion on this topic in person.

Thanks,
Yi.


>  - FSDAX enhancements for initial ramdisk or other use cases;
>
>  - other issues when landing.
>
>
> Finally, if our efforts (or plans) also make sense to you, we do hope
> more people could join us, Thanks!
>
> [1] https://lore.kernel.org/r/[email protected]
> [2] https://lore.kernel.org/r/Yoj1AcHoBPqir++H@debian
> [3] https://lore.kernel.org/r/[email protected]
> [4] https://lore.kernel.org/r/[email protected]
> [5] https://lore.kernel.org/r/[email protected]
> [6] https://lwn.net/SubscriberLink/918893/4d389217f9b8d679
> [7] https://lwn.net/Articles/895907
>
> Thanks,
> Gao Xiang
> .


2023-02-28 06:12:15

by Jingbo Xu

[permalink] [raw]
Subject: Re: [LSF/MM/BPF TOPIC] Image-based read-only filesystem: further use cases & directions



On 1/9/23 4:43 PM, Gao Xiang wrote:
> Hi folks,
>
> * Background *
>
> We've been continuously working on forming a useful read-only
> (immutable) image solution since the end of 2017 (as a part of our
> work) until now as everyone may know:  EROFS.
>
> Currently it has already successfully landed to (about) billions of
> Android-related devices, other types of embedded devices and containers
> with many vendors involved, and we've always been seeking more use
> cases such as incremental immutable rootfs, app sandboxes or packages
> (Android apk? with many duplicated libraries), dataset packages, etc.
>
> The reasons why we always do believe immutable images can benefit
> various use cases are:
>
>  - much easier for all vendors to ship/distribute/keep original signing
>    (golden) images to each instance;
>
>  - (combined with the writable layer such as overlayfs) easy to roll
>    back to the original shipped state or do incremental updates;
>
>  - easy to check data corruption or do data recovery (no matter
>    whether physical device or network errors);
>
>  - easy for real storage devices to do hardware write-protection for
>    immutable images;
>
>  - can do various offline algorithms (such as reduced metadata,
>    content-defined rolling hash deduplication, compression) to minimize
>    image sizes;
>
>  - initrd with FSDAX to avoid double caching with advantages above;
>
>  - and more.
>
> In 2019, a LSF/MM/BPF topic was put forward to show EROFS initial use
> cases [1] as the read-only Android rootfs of a single instance on
> resource-limited devices so that effective compression became quite
> important at that time.
>
>
> * Problem *
>
> In addition to enhance data compression for single-instance deployment,
> as a self-contained approach (so that all use cases can share the only
> _one_ signed image), we've also focusing on multiple instances (such as
> containers or apps, each image represents a complete filesystem tree)
> all together on one device with similar data recently years so that
> effective data deduplication, on-demand lazy pulling, page cache
> sharing among such different golden images became vital as well.
>
>
> * Current progresses *
>
> In order to resolve the challenges above, we've worked out:
>
>  - (v5.15) chunk-based inodes (to form inode extents) to do data
>    deduplication among a single image;
>
>  - (v5.16) multiple shared blobs (to keep content-defined data) in
>    addition to the primary blob (to keep filesystem metadata) for wider
>    deduplication across different images:
>
>  - (v5.19) file-based distribution by introducing in-kernel local
>    caching fscache and on-demand lazy pulling feature [2];
>
>  - (v6.1) shared domain to share such multiple shared blobs in
>    fscache mode [3];
>
>  - [RFC] preliminary page cache sharing between diffenent images [4].
>
>
> * Potential topics to discuss *
>
>  - data verification of different images with thousands (or more)
>    shared blobs [5];
>
>  - encryption with per-extent keys for confidential containers [5][6];
>
>  - current page cache sharing limitation due to mm reserve mapping and
>    finer (folio or page-based) page cache sharing among images/blobs
>    [4][7];
>
>  - more effective in-kernel local caching features for fscache such as
>    failover and daemonless;
>
>  - (wild preliminary ideas, maybe) overlayfs partial copy-up with
>    fscache as the upper layer in order to form a unique caching
>    subsystem for better space saving?
>
>  - FSDAX enhancements for initial ramdisk or other use cases;
>
>  - other issues when landing.
>
>
> Finally, if our efforts (or plans) also make sense to you, we do hope
> more people could join us, Thanks!
>
> [1]
> https://lore.kernel.org/r/[email protected]
> [2] https://lore.kernel.org/r/Yoj1AcHoBPqir++H@debian
> [3] https://lore.kernel.org/r/[email protected]
> [4]
> https://lore.kernel.org/r/[email protected]
> [5] https://lore.kernel.org/r/[email protected]
> [6] https://lwn.net/SubscriberLink/918893/4d389217f9b8d679
> [7] https://lwn.net/Articles/895907
>

The past year we have a lot promising features enhanced for erofs, such
as file-based distribution and lazy pulling with fscache, share domain
feature to reduce the disk usage of cache files of fscache.

But there are still many features to be done to make it a productive and
stable system for image distribution, such as page cache sharing,
failover and daemonless for fscache, etc.

It would be great if I could join the discussion on these topics :)


--
Thanks,
Jingbo