2023-01-20 15:57:19

by Alexander Larsson

[permalink] [raw]
Subject: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem

Giuseppe Scrivano and I have recently been working on a new project we
call composefs. This is the first time we propose this publically and
we would like some feedback on it.

At its core, composefs is a way to construct and use read only images
that are used similar to how you would use e.g. loop-back mounted
squashfs images. On top of this composefs has two fundamental
features. First it allows sharing of file data (both on disk and in
page cache) between images, and secondly it has dm-verity like
validation on read.

Let me first start with a minimal example of how this can be used,
before going into the details:

Suppose we have this source for an image:

rootfs/
├── dir
│   └── another_a
├── file_a
└── file_b

We can then use this to generate an image file and a set of
content-addressed backing files:

# mkcomposefs --digest-store=objects rootfs/ rootfs.img
# ls -l rootfs.img objects/*/*
-rw-------. 1 root root 10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
-rw-------. 1 root root 10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
-rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img

The rootfs.img file contains all information about directory and file
metadata plus references to the backing files by name. We can now
mount this and look at the result:

# mount -t composefs rootfs.img -o basedir=objects /mnt
# ls /mnt/
dir file_a file_b
# cat /mnt/file_a
content_a

When reading this file the kernel is actually reading the backing
file, in a fashion similar to overlayfs. Since the backing file is
content-addressed, the objects directory can be shared for multiple
images, and any files that happen to have the same content are
shared. I refer to this as opportunistic sharing, as it is different
than the more course-grained explicit sharing used by e.g. container
base images.

The next step is the validation. Note how the object files have
fs-verity enabled. In fact, they are named by their fs-verity digest:

# fsverity digest objects/*/*
sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f

The generated filesystm image may contain the expected digest for the
backing files. When the backing file digest is incorrect, the open
will fail, and if the open succeeds, any other on-disk file-changes
will be detected by fs-verity:

# cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
content_a
# rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
# echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
# cat /mnt/file_a
WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest
cat: /mnt/file_a: Input/output error

This re-uses the existing fs-verity functionallity to protect against
changes in file contents, while adding on top of it protection against
changes in filesystem metadata and structure. I.e. protecting against
replacing a fs-verity enabled file or modifying file permissions or
xattrs.

To be fully verified we need another step: we use fs-verity on the
image itself. Then we pass the expected digest on the mount command
line (which will be verified at mount time):

# fsverity enable rootfs.img
# fsverity digest rootfs.img
sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img
# mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt

So, given a trusted set of mount options (say unlocked from TPM), we
have a fully verified filesystem tree mounted, with opportunistic
finegrained sharing of identical files.

So, why do we want this? There are two initial users. First of all we
want to use the opportunistic sharing for the podman container image
baselayer. The idea is to use a composefs mount as the lower directory
in an overlay mount, with the upper directory being the container work
dir. This will allow automatical file-level disk and page-cache
sharning between any two images, independent of details like the
permissions and timestamps of the files.

Secondly we are interested in using the verification aspects of
composefs in the ostree project. Ostree already supports a
content-addressed object store, but it is currently referenced by
hardlink farms. The object store and the trees that reference it are
signed and verified at download time, but there is no runtime
verification. If we replace the hardlink farm with a composefs image
that points into the existing object store we can use the verification
to implement runtime verification.

In fact, the tooling to create composefs images is 100% reproducible,
so all we need is to add the composefs image fs-verity digest into the
ostree commit. Then the image can be reconstructed from the ostree
commit info, generating a file with the same fs-verity digest.

These are the usecases we're currently interested in, but there seems
to be a breadth of other possible uses. For example, many systems use
loopback mounts for images (like lxc or snap), and these could take
advantage of the opportunistic sharing. We've also talked about using
fuse to implement a local cache for the backing files. I.e. you would
have the second basedir be a fuse filesystem. On lookup failure in the
first basedir it downloads the file and saves it in the first basedir
for later lookups. There are many interesting possibilities here.

The patch series contains some documentation on the file format and
how to use the filesystem.

The userspace tools (and a standalone kernel module) is available
here:
https://github.com/containers/composefs

Initial work on ostree integration is here:
https://github.com/ostreedev/ostree/pull/2640

Changes since v2:
- Simplified filesystem format to use fixed size inodes. This resulted
in simpler (now < 2k lines) code as well as higher performance at
the cost of slightly (~40%) larger images.
- We now use multi-page mappings from the page cache, which removes
limits on sizes of xattrs and makes the dirent handling code simpler.
- Added more documentation about the on-disk file format.
- General cleanups based on review comments.

Changes since v1:
- Fixed some minor compiler warnings
- Fixed build with !CONFIG_MMU
- Documentation fixes from review by Bagas Sanjaya
- Code style and cleanup from review by Brian Masney
- Use existing kernel helpers for hex digit conversion
- Use kmap_local_page() instead of deprecated kmap()

Alexander Larsson (6):
fsverity: Export fsverity_get_digest
composefs: Add on-disk layout header
composefs: Add descriptor parsing code
composefs: Add filesystem implementation
composefs: Add documentation
composefs: Add kconfig and build support

Documentation/filesystems/composefs.rst | 159 +++++
Documentation/filesystems/index.rst | 1 +
fs/Kconfig | 1 +
fs/Makefile | 1 +
fs/composefs/Kconfig | 18 +
fs/composefs/Makefile | 5 +
fs/composefs/cfs-internals.h | 55 ++
fs/composefs/cfs-reader.c | 720 +++++++++++++++++++++++
fs/composefs/cfs.c | 750 ++++++++++++++++++++++++
fs/composefs/cfs.h | 172 ++++++
fs/verity/measure.c | 1 +
11 files changed, 1883 insertions(+)
create mode 100644 Documentation/filesystems/composefs.rst
create mode 100644 fs/composefs/Kconfig
create mode 100644 fs/composefs/Makefile
create mode 100644 fs/composefs/cfs-internals.h
create mode 100644 fs/composefs/cfs-reader.c
create mode 100644 fs/composefs/cfs.c
create mode 100644 fs/composefs/cfs.h

--
2.39.0


2023-01-20 16:14:20

by Alexander Larsson

[permalink] [raw]
Subject: [PATCH v3 3/6] composefs: Add descriptor parsing code

This adds the code to load and decode the filesystem descriptor file
format.

We open the descriptor at mount time and keep the struct file *
around. Most accesses to it happens via cfs_get_buf() which reads the
descriptor data directly from the page cache. Although in a few cases
(like when we need to directly copy data) we use kernel_read()
instead.

Signed-off-by: Alexander Larsson <[email protected]>
Co-developed-by: Giuseppe Scrivano <[email protected]>
Signed-off-by: Giuseppe Scrivano <[email protected]>
---
fs/composefs/cfs-internals.h | 55 +++
fs/composefs/cfs-reader.c | 720 +++++++++++++++++++++++++++++++++++
2 files changed, 775 insertions(+)
create mode 100644 fs/composefs/cfs-internals.h
create mode 100644 fs/composefs/cfs-reader.c

diff --git a/fs/composefs/cfs-internals.h b/fs/composefs/cfs-internals.h
new file mode 100644
index 000000000000..3524b977c8a8
--- /dev/null
+++ b/fs/composefs/cfs-internals.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _CFS_INTERNALS_H
+#define _CFS_INTERNALS_H
+
+#include "cfs.h"
+
+#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
+
+struct cfs_inode_extra_data {
+ char *path_payload; /* Real pathname for files, target for symlinks */
+
+ u64 xattrs_offset;
+ u32 xattrs_len;
+
+ u64 dirents_offset;
+ u32 dirents_len;
+
+ bool has_digest;
+ u8 digest[SHA256_DIGEST_SIZE]; /* fs-verity digest */
+};
+
+struct cfs_context {
+ struct file *descriptor;
+ u64 vdata_offset;
+ u32 num_inodes;
+
+ u64 descriptor_len;
+};
+
+int cfs_init_ctx(const char *descriptor_path, const u8 *required_digest,
+ struct cfs_context *ctx);
+
+void cfs_ctx_put(struct cfs_context *ctx);
+
+int cfs_init_inode(struct cfs_context *ctx, u32 inode_num, struct inode *inode,
+ struct cfs_inode_extra_data *data);
+
+ssize_t cfs_list_xattrs(struct cfs_context *ctx,
+ struct cfs_inode_extra_data *inode_data, char *names,
+ size_t size);
+int cfs_get_xattr(struct cfs_context *ctx, struct cfs_inode_extra_data *inode_data,
+ const char *name, void *value, size_t size);
+
+typedef bool (*cfs_dir_iter_cb)(void *private, const char *name, int namelen,
+ u64 ino, unsigned int dtype);
+
+int cfs_dir_iterate(struct cfs_context *ctx, u64 index,
+ struct cfs_inode_extra_data *inode_data, loff_t first,
+ cfs_dir_iter_cb cb, void *private);
+
+int cfs_dir_lookup(struct cfs_context *ctx, u64 index,
+ struct cfs_inode_extra_data *inode_data, const char *name,
+ size_t name_len, u64 *index_out);
+
+#endif
diff --git a/fs/composefs/cfs-reader.c b/fs/composefs/cfs-reader.c
new file mode 100644
index 000000000000..6ff7d3e70d39
--- /dev/null
+++ b/fs/composefs/cfs-reader.c
@@ -0,0 +1,720 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * composefs
+ *
+ * Copyright (C) 2021 Giuseppe Scrivano
+ * Copyright (C) 2022 Alexander Larsson
+ *
+ * This file is released under the GPL.
+ */
+
+#include "cfs-internals.h"
+
+#include <linux/file.h>
+#include <linux/fsverity.h>
+#include <linux/pagemap.h>
+#include <linux/vmalloc.h>
+
+/* When mapping buffers via page arrays this is an "arbitrary" limit
+ * to ensure we're not ballooning memory use for the page array and
+ * mapping. On a 4k page, 64bit machine this limit will make the page
+ * array fit in one page, and will allow a mapping of 2MB. When
+ * applied to e.g. dirents this will allow more than 27000 filenames
+ * of length 64, which seems ok. If we need to support more, at that
+ * point we should probably fall back to an approach that maps pages
+ * incrementally.
+ */
+#define CFS_BUF_MAXPAGES 512
+
+#define CFS_BUF_PREALLOC_SIZE 4
+
+/* Check if the element, which is supposed to be offset from section_start
+ * actually fits in the section starting at section_start ending at section_end,
+ * and doesn't wrap.
+ */
+static bool cfs_is_in_section(u64 section_start, u64 section_end,
+ u64 element_offset, u64 element_size)
+{
+ u64 element_end;
+ u64 element_start;
+
+ element_start = section_start + element_offset;
+ if (element_start < section_start || element_start >= section_end)
+ return false;
+
+ element_end = element_start + element_size;
+ if (element_end < element_start || element_end > section_end)
+ return false;
+
+ return true;
+}
+
+struct cfs_buf {
+ struct page **pages;
+ size_t n_pages;
+ void *base;
+
+ /* Used as "pages" above to avoid allocation for small buffers */
+ struct page *prealloc[CFS_BUF_PREALLOC_SIZE];
+};
+
+static void cfs_buf_put(struct cfs_buf *buf)
+{
+ if (buf->pages) {
+ if (buf->n_pages == 1)
+ kunmap_local(buf->base);
+ else
+ vm_unmap_ram(buf->base, buf->n_pages);
+ for (size_t i = 0; i < buf->n_pages; i++)
+ put_page(buf->pages[i]);
+ if (buf->n_pages > CFS_BUF_PREALLOC_SIZE)
+ kfree(buf->pages);
+ buf->pages = NULL;
+ }
+}
+
+/* Map data from anywhere in the descriptor */
+static void *cfs_get_buf(struct cfs_context *ctx, u64 offset, u32 size,
+ struct cfs_buf *buf)
+{
+ struct inode *inode = ctx->descriptor->f_inode;
+ struct address_space *const mapping = inode->i_mapping;
+ size_t n_pages, read_pages;
+ u64 index, last_index;
+ struct page **pages;
+ void *base;
+
+ if (buf->pages)
+ return ERR_PTR(-EINVAL);
+
+ if (!cfs_is_in_section(0, ctx->descriptor_len, offset, size) || size == 0)
+ return ERR_PTR(-EFSCORRUPTED);
+
+ index = offset >> PAGE_SHIFT;
+ last_index = (offset + size - 1) >> PAGE_SHIFT;
+ n_pages = last_index - index + 1;
+
+ if (n_pages > CFS_BUF_MAXPAGES)
+ return ERR_PTR(-ENOMEM);
+
+ if (n_pages > CFS_BUF_PREALLOC_SIZE) {
+ pages = kmalloc_array(n_pages, sizeof(struct page *), GFP_KERNEL);
+ if (!pages)
+ return ERR_PTR(-ENOMEM);
+ } else {
+ /* Avoid allocation in common (small) cases */
+ pages = buf->prealloc;
+ }
+
+ for (read_pages = 0; read_pages < n_pages; read_pages++) {
+ struct page *page =
+ read_cache_page(mapping, index + read_pages, NULL, NULL);
+ if (IS_ERR(page))
+ goto nomem;
+ pages[read_pages] = page;
+ }
+
+ if (n_pages == 1) {
+ base = kmap_local_page(pages[0]);
+ } else {
+ base = vm_map_ram(pages, n_pages, -1);
+ if (!base)
+ goto nomem;
+ }
+
+ buf->pages = pages;
+ buf->n_pages = n_pages;
+ buf->base = base;
+
+ return base + (offset & (PAGE_SIZE - 1));
+
+nomem:
+ for (size_t i = 0; i < read_pages; i++)
+ put_page(pages[i]);
+ if (n_pages > CFS_BUF_PREALLOC_SIZE)
+ kfree(pages);
+
+ return ERR_PTR(-ENOMEM);
+}
+
+/* Map data from the inode table */
+static void *cfs_get_inode_buf(struct cfs_context *ctx, u64 offset, u32 len,
+ struct cfs_buf *buf)
+{
+ if (!cfs_is_in_section(CFS_INODE_TABLE_OFFSET, ctx->vdata_offset, offset, len))
+ return ERR_PTR(-EINVAL);
+
+ return cfs_get_buf(ctx, CFS_INODE_TABLE_OFFSET + offset, len, buf);
+}
+
+/* Map data from the variable data section */
+static void *cfs_get_vdata_buf(struct cfs_context *ctx, u64 offset, u32 len,
+ struct cfs_buf *buf)
+{
+ if (!cfs_is_in_section(ctx->vdata_offset, ctx->descriptor_len, offset, len))
+ return ERR_PTR(-EINVAL);
+
+ return cfs_get_buf(ctx, ctx->vdata_offset + offset, len, buf);
+}
+
+/* Read data from anywhere in the descriptor */
+static void *cfs_read_data(struct cfs_context *ctx, u64 offset, u32 size, u8 *dest)
+{
+ loff_t pos = offset;
+ size_t copied;
+
+ if (!cfs_is_in_section(0, ctx->descriptor_len, offset, size))
+ return ERR_PTR(-EFSCORRUPTED);
+
+ copied = 0;
+ while (copied < size) {
+ ssize_t bytes;
+
+ bytes = kernel_read(ctx->descriptor, dest + copied,
+ size - copied, &pos);
+ if (bytes < 0)
+ return ERR_PTR(bytes);
+ if (bytes == 0)
+ return ERR_PTR(-EINVAL);
+
+ copied += bytes;
+ }
+
+ if (copied != size)
+ return ERR_PTR(-EFSCORRUPTED);
+ return dest;
+}
+
+/* Read data from the variable data section */
+static void *cfs_read_vdata(struct cfs_context *ctx, u64 offset, u32 len, char *buf)
+{
+ void *res;
+
+ if (!cfs_is_in_section(ctx->vdata_offset, ctx->descriptor_len, offset, len))
+ return ERR_PTR(-EINVAL);
+
+ res = cfs_read_data(ctx, ctx->vdata_offset + offset, len, buf);
+ if (IS_ERR(res))
+ return ERR_CAST(res);
+
+ return buf;
+}
+
+/* Allocate, read and null-terminate paths from the variable data section */
+static char *cfs_read_vdata_path(struct cfs_context *ctx, u64 offset, u32 len)
+{
+ char *path;
+ void *res;
+
+ if (len > PATH_MAX)
+ return ERR_PTR(-EINVAL);
+
+ path = kmalloc(len + 1, GFP_KERNEL);
+ if (!path)
+ return ERR_PTR(-ENOMEM);
+
+ res = cfs_read_vdata(ctx, offset, len, path);
+ if (IS_ERR(res)) {
+ kfree(path);
+ return ERR_CAST(res);
+ }
+
+ /* zero terminate */
+ path[len] = 0;
+
+ return path;
+}
+
+int cfs_init_ctx(const char *descriptor_path, const u8 *required_digest,
+ struct cfs_context *ctx_out)
+{
+ u8 verity_digest[FS_VERITY_MAX_DIGEST_SIZE];
+ struct cfs_superblock superblock_buf;
+ struct cfs_superblock *superblock;
+ enum hash_algo verity_algo;
+ struct cfs_context ctx;
+ struct file *descriptor;
+ u64 num_inodes;
+ loff_t i_size;
+ int res;
+
+ descriptor = filp_open(descriptor_path, O_RDONLY, 0);
+ if (IS_ERR(descriptor))
+ return PTR_ERR(descriptor);
+
+ if (required_digest) {
+ res = fsverity_get_digest(d_inode(descriptor->f_path.dentry),
+ verity_digest, &verity_algo);
+ if (res < 0) {
+ pr_err("ERROR: composefs descriptor has no fs-verity digest\n");
+ goto fail;
+ }
+ if (verity_algo != HASH_ALGO_SHA256 ||
+ memcmp(required_digest, verity_digest, SHA256_DIGEST_SIZE) != 0) {
+ pr_err("ERROR: composefs descriptor has wrong fs-verity digest\n");
+ res = -EINVAL;
+ goto fail;
+ }
+ }
+
+ i_size = i_size_read(file_inode(descriptor));
+ if (i_size <= CFS_DESCRIPTOR_MIN_SIZE) {
+ res = -EINVAL;
+ goto fail;
+ }
+
+ /* Need this temporary ctx for cfs_read_data() */
+ ctx.descriptor = descriptor;
+ ctx.descriptor_len = i_size;
+
+ superblock = cfs_read_data(&ctx, CFS_SUPERBLOCK_OFFSET,
+ sizeof(struct cfs_superblock),
+ (u8 *)&superblock_buf);
+ if (IS_ERR(superblock)) {
+ res = PTR_ERR(superblock);
+ goto fail;
+ }
+
+ ctx.vdata_offset = le64_to_cpu(superblock->vdata_offset);
+
+ /* Some basic validation of the format */
+ if (le32_to_cpu(superblock->version) != CFS_VERSION ||
+ le32_to_cpu(superblock->magic) != CFS_MAGIC ||
+ /* vdata is in file */
+ ctx.vdata_offset > ctx.descriptor_len ||
+ ctx.vdata_offset <= CFS_INODE_TABLE_OFFSET ||
+ /* vdata is aligned */
+ ctx.vdata_offset % 4 != 0) {
+ res = -EFSCORRUPTED;
+ goto fail;
+ }
+
+ num_inodes = (ctx.vdata_offset - CFS_INODE_TABLE_OFFSET) / CFS_INODE_SIZE;
+ if (num_inodes > U32_MAX) {
+ res = -EFSCORRUPTED;
+ goto fail;
+ }
+ ctx.num_inodes = num_inodes;
+
+ *ctx_out = ctx;
+ return 0;
+
+fail:
+ fput(descriptor);
+ return res;
+}
+
+void cfs_ctx_put(struct cfs_context *ctx)
+{
+ if (ctx->descriptor) {
+ fput(ctx->descriptor);
+ ctx->descriptor = NULL;
+ }
+}
+
+static bool cfs_validate_filename(const char *name, size_t name_len)
+{
+ if (name_len == 0)
+ return false;
+
+ if (name_len == 1 && name[0] == '.')
+ return false;
+
+ if (name_len == 2 && name[0] == '.' && name[1] == '.')
+ return false;
+
+ if (memchr(name, '/', name_len))
+ return false;
+
+ return true;
+}
+
+int cfs_init_inode(struct cfs_context *ctx, u32 inode_num, struct inode *inode,
+ struct cfs_inode_extra_data *inode_data)
+{
+ struct cfs_buf vdata_buf = { NULL };
+ struct cfs_inode_data *disk_data;
+ char *path_payload = NULL;
+ void *res;
+ int ret = 0;
+ u64 variable_data_off;
+ u32 variable_data_len;
+ u64 digest_off;
+ u32 digest_len;
+ u32 st_type;
+ u64 size;
+
+ if (inode_num >= ctx->num_inodes)
+ return -EFSCORRUPTED;
+
+ disk_data = cfs_get_inode_buf(ctx, inode_num * CFS_INODE_SIZE,
+ CFS_INODE_SIZE, &vdata_buf);
+ if (IS_ERR(disk_data))
+ return PTR_ERR(disk_data);
+
+ inode->i_ino = inode_num;
+
+ inode->i_mode = le32_to_cpu(disk_data->st_mode);
+ set_nlink(inode, le32_to_cpu(disk_data->st_nlink));
+ inode->i_uid = make_kuid(current_user_ns(), le32_to_cpu(disk_data->st_uid));
+ inode->i_gid = make_kgid(current_user_ns(), le32_to_cpu(disk_data->st_gid));
+ inode->i_rdev = le32_to_cpu(disk_data->st_rdev);
+
+ size = le64_to_cpu(disk_data->st_size);
+ i_size_write(inode, size);
+ inode_set_bytes(inode, size);
+
+ inode->i_mtime.tv_sec = le64_to_cpu(disk_data->st_mtim_sec);
+ inode->i_mtime.tv_nsec = le32_to_cpu(disk_data->st_mtim_nsec);
+ inode->i_ctime.tv_sec = le64_to_cpu(disk_data->st_ctim_sec);
+ inode->i_ctime.tv_nsec = le32_to_cpu(disk_data->st_ctim_nsec);
+ inode->i_atime = inode->i_mtime;
+
+ variable_data_off = le64_to_cpu(disk_data->variable_data.off);
+ variable_data_len = le32_to_cpu(disk_data->variable_data.len);
+
+ st_type = inode->i_mode & S_IFMT;
+ if (st_type == S_IFDIR) {
+ inode_data->dirents_offset = variable_data_off;
+ inode_data->dirents_len = variable_data_len;
+ } else if ((st_type == S_IFLNK || st_type == S_IFREG) &&
+ variable_data_len > 0) {
+ path_payload = cfs_read_vdata_path(ctx, variable_data_off,
+ variable_data_len);
+ if (IS_ERR(path_payload)) {
+ ret = PTR_ERR(path_payload);
+ goto fail;
+ }
+ inode_data->path_payload = path_payload;
+ }
+
+ if (st_type == S_IFLNK) {
+ /* Symbolic link must have a non-empty target */
+ if (!inode_data->path_payload || *inode_data->path_payload == 0) {
+ ret = -EFSCORRUPTED;
+ goto fail;
+ }
+ } else if (st_type == S_IFREG) {
+ /* Regular file must have backing file except empty files */
+ if ((inode_data->path_payload && size == 0) ||
+ (!inode_data->path_payload && size > 0)) {
+ ret = -EFSCORRUPTED;
+ goto fail;
+ }
+ }
+
+ inode_data->xattrs_offset = le64_to_cpu(disk_data->xattrs.off);
+ inode_data->xattrs_len = le32_to_cpu(disk_data->xattrs.len);
+
+ if (inode_data->xattrs_len != 0) {
+ /* Validate xattr size */
+ if (inode_data->xattrs_len < sizeof(struct cfs_xattr_header)) {
+ ret = -EFSCORRUPTED;
+ goto fail;
+ }
+ }
+
+ digest_off = le64_to_cpu(disk_data->digest.off);
+ digest_len = le32_to_cpu(disk_data->digest.len);
+
+ if (digest_len > 0) {
+ if (digest_len != SHA256_DIGEST_SIZE) {
+ ret = -EFSCORRUPTED;
+ goto fail;
+ }
+
+ res = cfs_read_vdata(ctx, digest_off, digest_len, inode_data->digest);
+ if (IS_ERR(res)) {
+ ret = PTR_ERR(res);
+ goto fail;
+ }
+ inode_data->has_digest = true;
+ }
+
+ cfs_buf_put(&vdata_buf);
+ return 0;
+
+fail:
+ cfs_buf_put(&vdata_buf);
+ return ret;
+}
+
+ssize_t cfs_list_xattrs(struct cfs_context *ctx,
+ struct cfs_inode_extra_data *inode_data, char *names,
+ size_t size)
+{
+ const struct cfs_xattr_header *xattrs;
+ struct cfs_buf vdata_buf = { NULL };
+ size_t n_xattrs = 0;
+ u8 *data, *data_end;
+ ssize_t copied = 0;
+
+ if (inode_data->xattrs_len == 0)
+ return 0;
+
+ /* xattrs_len basic size req was verified in cfs_init_inode_data */
+
+ xattrs = cfs_get_vdata_buf(ctx, inode_data->xattrs_offset,
+ inode_data->xattrs_len, &vdata_buf);
+ if (IS_ERR(xattrs))
+ return PTR_ERR(xattrs);
+
+ n_xattrs = le16_to_cpu(xattrs->n_attr);
+ if (n_xattrs == 0 || n_xattrs > CFS_MAX_XATTRS ||
+ inode_data->xattrs_len < cfs_xattr_header_size(n_xattrs)) {
+ copied = -EFSCORRUPTED;
+ goto exit;
+ }
+
+ data = ((u8 *)xattrs) + cfs_xattr_header_size(n_xattrs);
+ data_end = ((u8 *)xattrs) + inode_data->xattrs_len;
+
+ for (size_t i = 0; i < n_xattrs; i++) {
+ const struct cfs_xattr_element *e = &xattrs->attr[i];
+ u16 this_value_len = le16_to_cpu(e->value_length);
+ u16 this_key_len = le16_to_cpu(e->key_length);
+ const char *this_key;
+
+ if (this_key_len > XATTR_NAME_MAX ||
+ /* key and data needs to fit in data */
+ data_end - data < this_key_len + this_value_len) {
+ copied = -EFSCORRUPTED;
+ goto exit;
+ }
+
+ this_key = data;
+ data += this_key_len + this_value_len;
+
+ if (size) {
+ if (size - copied < this_key_len + 1) {
+ copied = -E2BIG;
+ goto exit;
+ }
+
+ memcpy(names + copied, this_key, this_key_len);
+ names[copied + this_key_len] = '\0';
+ }
+
+ copied += this_key_len + 1;
+ }
+
+exit:
+ cfs_buf_put(&vdata_buf);
+
+ return copied;
+}
+
+int cfs_get_xattr(struct cfs_context *ctx, struct cfs_inode_extra_data *inode_data,
+ const char *name, void *value, size_t size)
+{
+ struct cfs_xattr_header *xattrs;
+ struct cfs_buf vdata_buf = { NULL };
+ size_t name_len = strlen(name);
+ size_t n_xattrs = 0;
+ u8 *data, *data_end;
+ int res;
+
+ if (inode_data->xattrs_len == 0)
+ return -ENODATA;
+
+ /* xattrs_len minimal size req was verified in cfs_init_inode_data */
+
+ xattrs = cfs_get_vdata_buf(ctx, inode_data->xattrs_offset,
+ inode_data->xattrs_len, &vdata_buf);
+ if (IS_ERR(xattrs))
+ return PTR_ERR(xattrs);
+
+ n_xattrs = le16_to_cpu(xattrs->n_attr);
+ if (n_xattrs == 0 || n_xattrs > CFS_MAX_XATTRS ||
+ inode_data->xattrs_len < cfs_xattr_header_size(n_xattrs)) {
+ res = -EFSCORRUPTED;
+ goto exit;
+ }
+
+ data = ((u8 *)xattrs) + cfs_xattr_header_size(n_xattrs);
+ data_end = ((u8 *)xattrs) + inode_data->xattrs_len;
+
+ for (size_t i = 0; i < n_xattrs; i++) {
+ const struct cfs_xattr_element *e = &xattrs->attr[i];
+ u16 this_value_len = le16_to_cpu(e->value_length);
+ u16 this_key_len = le16_to_cpu(e->key_length);
+ const char *this_key, *this_value;
+
+ if (this_key_len > XATTR_NAME_MAX ||
+ /* key and data needs to fit in data */
+ data_end - data < this_key_len + this_value_len) {
+ res = -EFSCORRUPTED;
+ goto exit;
+ }
+
+ this_key = data;
+ this_value = data + this_key_len;
+ data += this_key_len + this_value_len;
+
+ if (this_key_len != name_len || memcmp(this_key, name, name_len) != 0)
+ continue;
+
+ if (size > 0) {
+ if (size < this_value_len) {
+ res = -E2BIG;
+ goto exit;
+ }
+ memcpy(value, this_value, this_value_len);
+ }
+
+ res = this_value_len;
+ goto exit;
+ }
+
+ res = -ENODATA;
+
+exit:
+ cfs_buf_put(&vdata_buf);
+ return res;
+}
+
+/* This is essentially strmcp() for non-null-terminated strings */
+static inline int memcmp2(const void *a, const size_t a_size, const void *b,
+ size_t b_size)
+{
+ size_t common_size = min(a_size, b_size);
+ int res;
+
+ res = memcmp(a, b, common_size);
+ if (res != 0 || a_size == b_size)
+ return res;
+
+ return a_size < b_size ? -1 : 1;
+}
+
+int cfs_dir_iterate(struct cfs_context *ctx, u64 index,
+ struct cfs_inode_extra_data *inode_data, loff_t first,
+ cfs_dir_iter_cb cb, void *private)
+{
+ struct cfs_buf vdata_buf = { NULL };
+ const struct cfs_dir_header *dir;
+ u32 n_dirents;
+ char *namedata, *namedata_end;
+ loff_t pos;
+ int res;
+
+ if (inode_data->dirents_len == 0)
+ return 0;
+
+ dir = cfs_get_vdata_buf(ctx, inode_data->dirents_offset,
+ inode_data->dirents_len, &vdata_buf);
+ if (IS_ERR(dir))
+ return PTR_ERR(dir);
+
+ n_dirents = le32_to_cpu(dir->n_dirents);
+ if (n_dirents == 0 || n_dirents > CFS_MAX_DIRENTS ||
+ inode_data->dirents_len < cfs_dir_header_size(n_dirents)) {
+ res = -EFSCORRUPTED;
+ goto exit;
+ }
+
+ if (first >= n_dirents) {
+ res = 0;
+ goto exit;
+ }
+
+ namedata = ((u8 *)dir) + cfs_dir_header_size(n_dirents);
+ namedata_end = ((u8 *)dir) + inode_data->dirents_len;
+ pos = 0;
+ for (size_t i = 0; i < n_dirents; i++) {
+ const struct cfs_dirent *dirent = &dir->dirents[i];
+ char *dirent_name =
+ (char *)namedata + le32_to_cpu(dirent->name_offset);
+ size_t dirent_name_len = dirent->name_len;
+
+ /* name needs to fit in namedata */
+ if (dirent_name >= namedata_end ||
+ namedata_end - dirent_name < dirent_name_len) {
+ res = -EFSCORRUPTED;
+ goto exit;
+ }
+
+ if (!cfs_validate_filename(dirent_name, dirent_name_len)) {
+ res = -EFSCORRUPTED;
+ goto exit;
+ }
+
+ if (pos++ < first)
+ continue;
+
+ if (!cb(private, dirent_name, dirent_name_len,
+ le32_to_cpu(dirent->inode_num), dirent->d_type)) {
+ break;
+ }
+ }
+
+ res = 0;
+exit:
+ cfs_buf_put(&vdata_buf);
+ return res;
+}
+
+int cfs_dir_lookup(struct cfs_context *ctx, u64 index,
+ struct cfs_inode_extra_data *inode_data, const char *name,
+ size_t name_len, u64 *index_out)
+{
+ struct cfs_buf vdata_buf = { NULL };
+ const struct cfs_dir_header *dir;
+ u32 start_dirent, end_dirent, n_dirents;
+ char *namedata, *namedata_end;
+ int cmp, res;
+
+ if (inode_data->dirents_len == 0)
+ return 0;
+
+ dir = cfs_get_vdata_buf(ctx, inode_data->dirents_offset,
+ inode_data->dirents_len, &vdata_buf);
+ if (IS_ERR(dir))
+ return PTR_ERR(dir);
+
+ n_dirents = le32_to_cpu(dir->n_dirents);
+ if (n_dirents == 0 || n_dirents > CFS_MAX_DIRENTS ||
+ inode_data->dirents_len < cfs_dir_header_size(n_dirents)) {
+ res = -EFSCORRUPTED;
+ goto exit;
+ }
+
+ namedata = ((u8 *)dir) + cfs_dir_header_size(n_dirents);
+ namedata_end = ((u8 *)dir) + inode_data->dirents_len;
+
+ start_dirent = 0;
+ end_dirent = n_dirents - 1;
+ while (start_dirent <= end_dirent) {
+ int mid_dirent = start_dirent + (end_dirent - start_dirent) / 2;
+ const struct cfs_dirent *dirent = &dir->dirents[mid_dirent];
+ char *dirent_name =
+ (char *)namedata + le32_to_cpu(dirent->name_offset);
+ size_t dirent_name_len = dirent->name_len;
+
+ /* name needs to fit in namedata */
+ if (dirent_name >= namedata_end ||
+ namedata_end - dirent_name < dirent_name_len) {
+ res = -EFSCORRUPTED;
+ goto exit;
+ }
+
+ cmp = memcmp2(name, name_len, dirent_name, dirent_name_len);
+ if (cmp == 0) {
+ *index_out = le32_to_cpu(dirent->inode_num);
+ res = 1;
+ goto exit;
+ }
+
+ if (cmp > 0)
+ start_dirent = mid_dirent + 1;
+ else
+ end_dirent = mid_dirent - 1;
+ }
+
+ /* not found */
+ res = 0;
+
+exit:
+ cfs_buf_put(&vdata_buf);
+ return res;
+}
--
2.39.0

2023-01-20 19:49:31

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem

On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <[email protected]> wrote:
>
> Giuseppe Scrivano and I have recently been working on a new project we
> call composefs. This is the first time we propose this publically and
> we would like some feedback on it.
>
> At its core, composefs is a way to construct and use read only images
> that are used similar to how you would use e.g. loop-back mounted
> squashfs images. On top of this composefs has two fundamental
> features. First it allows sharing of file data (both on disk and in
> page cache) between images, and secondly it has dm-verity like
> validation on read.
>
> Let me first start with a minimal example of how this can be used,
> before going into the details:
>
> Suppose we have this source for an image:
>
> rootfs/
> ├── dir
> │ └── another_a
> ├── file_a
> └── file_b
>
> We can then use this to generate an image file and a set of
> content-addressed backing files:
>
> # mkcomposefs --digest-store=objects rootfs/ rootfs.img
> # ls -l rootfs.img objects/*/*
> -rw-------. 1 root root 10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
> -rw-------. 1 root root 10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> -rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img
>
> The rootfs.img file contains all information about directory and file
> metadata plus references to the backing files by name. We can now
> mount this and look at the result:
>
> # mount -t composefs rootfs.img -o basedir=objects /mnt
> # ls /mnt/
> dir file_a file_b
> # cat /mnt/file_a
> content_a
>
> When reading this file the kernel is actually reading the backing
> file, in a fashion similar to overlayfs. Since the backing file is
> content-addressed, the objects directory can be shared for multiple
> images, and any files that happen to have the same content are
> shared. I refer to this as opportunistic sharing, as it is different
> than the more course-grained explicit sharing used by e.g. container
> base images.
>
> The next step is the validation. Note how the object files have
> fs-verity enabled. In fact, they are named by their fs-verity digest:
>
> # fsverity digest objects/*/*
> sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
> sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
>
> The generated filesystm image may contain the expected digest for the
> backing files. When the backing file digest is incorrect, the open
> will fail, and if the open succeeds, any other on-disk file-changes
> will be detected by fs-verity:
>
> # cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> content_a
> # rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> # echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> # cat /mnt/file_a
> WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest
> cat: /mnt/file_a: Input/output error
>
> This re-uses the existing fs-verity functionallity to protect against
> changes in file contents, while adding on top of it protection against
> changes in filesystem metadata and structure. I.e. protecting against
> replacing a fs-verity enabled file or modifying file permissions or
> xattrs.
>
> To be fully verified we need another step: we use fs-verity on the
> image itself. Then we pass the expected digest on the mount command
> line (which will be verified at mount time):
>
> # fsverity enable rootfs.img
> # fsverity digest rootfs.img
> sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img
> # mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt
>
> So, given a trusted set of mount options (say unlocked from TPM), we
> have a fully verified filesystem tree mounted, with opportunistic
> finegrained sharing of identical files.
>
> So, why do we want this? There are two initial users. First of all we
> want to use the opportunistic sharing for the podman container image
> baselayer. The idea is to use a composefs mount as the lower directory
> in an overlay mount, with the upper directory being the container work
> dir. This will allow automatical file-level disk and page-cache
> sharning between any two images, independent of details like the
> permissions and timestamps of the files.
>
> Secondly we are interested in using the verification aspects of
> composefs in the ostree project. Ostree already supports a
> content-addressed object store, but it is currently referenced by
> hardlink farms. The object store and the trees that reference it are
> signed and verified at download time, but there is no runtime
> verification. If we replace the hardlink farm with a composefs image
> that points into the existing object store we can use the verification
> to implement runtime verification.
>
> In fact, the tooling to create composefs images is 100% reproducible,
> so all we need is to add the composefs image fs-verity digest into the
> ostree commit. Then the image can be reconstructed from the ostree
> commit info, generating a file with the same fs-verity digest.
>
> These are the usecases we're currently interested in, but there seems
> to be a breadth of other possible uses. For example, many systems use
> loopback mounts for images (like lxc or snap), and these could take
> advantage of the opportunistic sharing. We've also talked about using
> fuse to implement a local cache for the backing files. I.e. you would
> have the second basedir be a fuse filesystem. On lookup failure in the
> first basedir it downloads the file and saves it in the first basedir
> for later lookups. There are many interesting possibilities here.
>
> The patch series contains some documentation on the file format and
> how to use the filesystem.
>
> The userspace tools (and a standalone kernel module) is available
> here:
> https://github.com/containers/composefs
>
> Initial work on ostree integration is here:
> https://github.com/ostreedev/ostree/pull/2640
>
> Changes since v2:
> - Simplified filesystem format to use fixed size inodes. This resulted
> in simpler (now < 2k lines) code as well as higher performance at
> the cost of slightly (~40%) larger images.
> - We now use multi-page mappings from the page cache, which removes
> limits on sizes of xattrs and makes the dirent handling code simpler.
> - Added more documentation about the on-disk file format.
> - General cleanups based on review comments.
>

Hi Alexander,

I must say that I am a little bit puzzled by this v3.
Gao, Christian and myself asked you questions on v2
that are not mentioned in v3 at all.

To sum it up, please do not propose composefs without explaining
what are the barriers for achieving the exact same outcome with
the use of a read-only overlayfs with two lower layer -
uppermost with erofs containing the metadata files, which include
trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer
to the lowermost layer containing the content files.

Any current functionality gap in erofs and/or in overlayfs
cannot be considered as a reason to maintain a new filesystem
driver unless you come up with an explanation why closing that
functionality gap is not possible or why the erofs+overlayfs alternative
would be inferior to maintaining a new filesystem driver.

From the conversations so far, it does not seem like Gao thinks
that the functionality gap in erofs cannot be closed and I don't
see why the functionality gap in overlayfs cannot be closed.

Are we missing something?

Thanks,
Amir.

2023-01-20 22:28:31

by Giuseppe Scrivano

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem

Hi Amir,

Amir Goldstein <[email protected]> writes:

> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <[email protected]> wrote:
>>
>> Giuseppe Scrivano and I have recently been working on a new project we
>> call composefs. This is the first time we propose this publically and
>> we would like some feedback on it.
>>
>> At its core, composefs is a way to construct and use read only images
>> that are used similar to how you would use e.g. loop-back mounted
>> squashfs images. On top of this composefs has two fundamental
>> features. First it allows sharing of file data (both on disk and in
>> page cache) between images, and secondly it has dm-verity like
>> validation on read.
>>
>> Let me first start with a minimal example of how this can be used,
>> before going into the details:
>>
>> Suppose we have this source for an image:
>>
>> rootfs/
>> ├── dir
>> │ └── another_a
>> ├── file_a
>> └── file_b
>>
>> We can then use this to generate an image file and a set of
>> content-addressed backing files:
>>
>> # mkcomposefs --digest-store=objects rootfs/ rootfs.img
>> # ls -l rootfs.img objects/*/*
>> -rw-------. 1 root root 10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
>> -rw-------. 1 root root 10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
>> -rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img
>>
>> The rootfs.img file contains all information about directory and file
>> metadata plus references to the backing files by name. We can now
>> mount this and look at the result:
>>
>> # mount -t composefs rootfs.img -o basedir=objects /mnt
>> # ls /mnt/
>> dir file_a file_b
>> # cat /mnt/file_a
>> content_a
>>
>> When reading this file the kernel is actually reading the backing
>> file, in a fashion similar to overlayfs. Since the backing file is
>> content-addressed, the objects directory can be shared for multiple
>> images, and any files that happen to have the same content are
>> shared. I refer to this as opportunistic sharing, as it is different
>> than the more course-grained explicit sharing used by e.g. container
>> base images.
>>
>> The next step is the validation. Note how the object files have
>> fs-verity enabled. In fact, they are named by their fs-verity digest:
>>
>> # fsverity digest objects/*/*
>> sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
>> sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
>>
>> The generated filesystm image may contain the expected digest for the
>> backing files. When the backing file digest is incorrect, the open
>> will fail, and if the open succeeds, any other on-disk file-changes
>> will be detected by fs-verity:
>>
>> # cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
>> content_a
>> # rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
>> # echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
>> # cat /mnt/file_a
>> WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest
>> cat: /mnt/file_a: Input/output error
>>
>> This re-uses the existing fs-verity functionallity to protect against
>> changes in file contents, while adding on top of it protection against
>> changes in filesystem metadata and structure. I.e. protecting against
>> replacing a fs-verity enabled file or modifying file permissions or
>> xattrs.
>>
>> To be fully verified we need another step: we use fs-verity on the
>> image itself. Then we pass the expected digest on the mount command
>> line (which will be verified at mount time):
>>
>> # fsverity enable rootfs.img
>> # fsverity digest rootfs.img
>> sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img
>> # mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt
>>
>> So, given a trusted set of mount options (say unlocked from TPM), we
>> have a fully verified filesystem tree mounted, with opportunistic
>> finegrained sharing of identical files.
>>
>> So, why do we want this? There are two initial users. First of all we
>> want to use the opportunistic sharing for the podman container image
>> baselayer. The idea is to use a composefs mount as the lower directory
>> in an overlay mount, with the upper directory being the container work
>> dir. This will allow automatical file-level disk and page-cache
>> sharning between any two images, independent of details like the
>> permissions and timestamps of the files.
>>
>> Secondly we are interested in using the verification aspects of
>> composefs in the ostree project. Ostree already supports a
>> content-addressed object store, but it is currently referenced by
>> hardlink farms. The object store and the trees that reference it are
>> signed and verified at download time, but there is no runtime
>> verification. If we replace the hardlink farm with a composefs image
>> that points into the existing object store we can use the verification
>> to implement runtime verification.
>>
>> In fact, the tooling to create composefs images is 100% reproducible,
>> so all we need is to add the composefs image fs-verity digest into the
>> ostree commit. Then the image can be reconstructed from the ostree
>> commit info, generating a file with the same fs-verity digest.
>>
>> These are the usecases we're currently interested in, but there seems
>> to be a breadth of other possible uses. For example, many systems use
>> loopback mounts for images (like lxc or snap), and these could take
>> advantage of the opportunistic sharing. We've also talked about using
>> fuse to implement a local cache for the backing files. I.e. you would
>> have the second basedir be a fuse filesystem. On lookup failure in the
>> first basedir it downloads the file and saves it in the first basedir
>> for later lookups. There are many interesting possibilities here.
>>
>> The patch series contains some documentation on the file format and
>> how to use the filesystem.
>>
>> The userspace tools (and a standalone kernel module) is available
>> here:
>> https://github.com/containers/composefs
>>
>> Initial work on ostree integration is here:
>> https://github.com/ostreedev/ostree/pull/2640
>>
>> Changes since v2:
>> - Simplified filesystem format to use fixed size inodes. This resulted
>> in simpler (now < 2k lines) code as well as higher performance at
>> the cost of slightly (~40%) larger images.
>> - We now use multi-page mappings from the page cache, which removes
>> limits on sizes of xattrs and makes the dirent handling code simpler.
>> - Added more documentation about the on-disk file format.
>> - General cleanups based on review comments.
>>
>
> Hi Alexander,
>
> I must say that I am a little bit puzzled by this v3.
> Gao, Christian and myself asked you questions on v2
> that are not mentioned in v3 at all.
>
> To sum it up, please do not propose composefs without explaining
> what are the barriers for achieving the exact same outcome with
> the use of a read-only overlayfs with two lower layer -
> uppermost with erofs containing the metadata files, which include
> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer
> to the lowermost layer containing the content files.

I think Dave explained quite well why using overlay is not comparable to
what composefs does.

One big difference is that overlay still requires at least a syscall for
each file in the image, and then we need the equivalent of "rm -rf" to
clean it up. It is somehow acceptable for long-running services, but it
is not for "serverless" containers where images/containers are created
and destroyed frequently. So even in the case we already have all the
image files available locally, we still need to create a checkout with
the final structure we need for the image.

I also don't see how overlay would solve the verified image problem. We
would have the same problem we have today with fs-verity as it can only
validate a single file but not the entire directory structure. Changes
that affect the layer containing the trusted.overlay.{metacopy,redirect}
xattrs won't be noticed.

There are at the moment two ways to handle container images, both somehow
guided by the available file systems in the kernel.

- A single image mounted as a block device.
- A list of tarballs (OCI image) that are unpacked and mounted as
overlay layers.

One big advantage of the block devices model is that you can use
dm-verity, this is something we miss today with OCI container images
that use overlay.

What we are proposing with composefs is a way to have "dm-verity" style
validation based on fs-verity and the possibility to share individual
files instead of layers. These files can also be on different file
systems, which is something not possible with the block device model.

The composefs manifest blob could be generated remotely and signed. A
client would need just to validate the signature for the manifest blob
and from there retrieve the files that are not in the local CAS (even
from an insecure source) and mount directly the manifest file.

Regards,
Giuseppe

> Any current functionality gap in erofs and/or in overlayfs
> cannot be considered as a reason to maintain a new filesystem
> driver unless you come up with an explanation why closing that
> functionality gap is not possible or why the erofs+overlayfs alternative
> would be inferior to maintaining a new filesystem driver.
>
> From the conversations so far, it does not seem like Gao thinks
> that the functionality gap in erofs cannot be closed and I don't
> see why the functionality gap in overlayfs cannot be closed.
>
> Are we missing something?
>
> Thanks,
> Amir.

2023-01-21 03:37:34

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem



On 2023/1/21 06:18, Giuseppe Scrivano wrote:
> Hi Amir,
>
> Amir Goldstein <[email protected]> writes:
>
>> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <[email protected]> wrote:

...

>>>
>>
>> Hi Alexander,
>>
>> I must say that I am a little bit puzzled by this v3.
>> Gao, Christian and myself asked you questions on v2
>> that are not mentioned in v3 at all.
>>
>> To sum it up, please do not propose composefs without explaining
>> what are the barriers for achieving the exact same outcome with
>> the use of a read-only overlayfs with two lower layer -
>> uppermost with erofs containing the metadata files, which include
>> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer
>> to the lowermost layer containing the content files.
>
> I think Dave explained quite well why using overlay is not comparable to
> what composefs does.
>
> One big difference is that overlay still requires at least a syscall for
> each file in the image, and then we need the equivalent of "rm -rf" to
> clean it up. It is somehow acceptable for long-running services, but it
> is not for "serverless" containers where images/containers are created
> and destroyed frequently. So even in the case we already have all the
> image files available locally, we still need to create a checkout with
> the final structure we need for the image.
>
> I also don't see how overlay would solve the verified image problem. We
> would have the same problem we have today with fs-verity as it can only
> validate a single file but not the entire directory structure. Changes
> that affect the layer containing the trusted.overlay.{metacopy,redirect}
> xattrs won't be noticed.
>
> There are at the moment two ways to handle container images, both somehow
> guided by the available file systems in the kernel.
>
> - A single image mounted as a block device.
> - A list of tarballs (OCI image) that are unpacked and mounted as
> overlay layers.
>
> One big advantage of the block devices model is that you can use
> dm-verity, this is something we miss today with OCI container images
> that use overlay.
>
> What we are proposing with composefs is a way to have "dm-verity" style
> validation based on fs-verity and the possibility to share individual
> files instead of layers. These files can also be on different file
> systems, which is something not possible with the block device model.

That is not a new idea honestly, including chain of trust. Even laterly
out-of-tree incremental fs using fs-verity for this as well, except that
it's in a real self-contained way.

>
> The composefs manifest blob could be generated remotely and signed. A
> client would need just to validate the signature for the manifest blob
> and from there retrieve the files that are not in the local CAS (even
> from an insecure source) and mount directly the manifest file.


Back to the topic, after thinking something I have to make a
compliment for reference.

First, EROFS had the same internal dissussion and decision at
that time almost _two years ago_ (June 2021), it means:

a) Some internal people really suggested EROFS could develop
an entire new file-based in-kernel local cache subsystem
(as you called local CAS, whatever) with stackable file
interface so that the exist Nydus image service [1] (as
ostree, and maybe ostree can use it as well) don't need to
modify anything to use exist blobs;

b) Reuse exist fscache/cachefiles;

The reason why we (especially me) finally selected b) because:

- see the people discussion of Google's original Incremental
FS topic [2] [3] in 2019, as Amir already mentioned. At
that time all fs folks really like to reuse exist subsystem
for in-kernel caching rather than reinvent another new
in-kernel wheel for local cache.

[ Reinventing a new wheel is not hard (fs or caching), just
makes Linux more fragmented. Especially a new filesystem
is just proposed to generate images full of massive massive
new magical symlinks with *overriden* uid/gid/permissions
to replace regular files. ]

- in-kernel cache implementation usually met several common
potential security issues; reusing exist subsystem can
make all fses addressed them and benefited from it.

- Usually an exist widely-used userspace implementation is
never an excuse for a new in-kernel feature.

Although David Howells is always quite busy these months to
develop new netfs interface, otherwise (we think) we should
already support failover, multiple daemon/dirs, daemonless and
more.

I know that you guys repeatedly say it's a self-contained
stackable fs and has few code (the same words as Incfs
folks [3] said four years ago already), four reasons make it
weak IMHO:

- I think core EROFS is about 2~3 kLOC as well if
compression, sysfs and fscache are all code-truncated.

Also, it's always welcome that all people could submit
patches for cleaning up. I always do such cleanups
from time to time and makes it better.

- "Few code lines" is somewhat weak because people do
develop new features, layout after upstream.

Such claim is usually _NOT_ true in the future if you
guys do more to optimize performance, new layout or even
do your own lazy pulling with your local CAS codebase in
the future unless
you *promise* you once dump the code, and do bugfix
only like Christian said [4].

From LWN.net comments, I do see the opposite
possibility that you'd like to develop new features
later.

- In the past, all in-tree kernel filesystems were
designed and implemented without some user-space
specific indication, including Nydus and ostree (I did
see a lot of discussion between folks before in ociv2
brainstorm [5]).

That is why EROFS selected exist in-kernel fscache and
made userspace Nydus adapt it:

even (here called) manifest on-disk format ---
EROFS call primary device ---
they call Nydus bootstrap;

I'm not sure why it becomes impossible for ... ($$$$).

In addition, if fscache is used, it can also use
fsverity_get_digest() to enable fsverity for non-on-demand
files.

But again I think even Google's folks think that is
(somewhat) broken so that they added fs-verity to its incFS
in a self-contained way in Feb 2021 [6].

Finally, again, I do hope a LSF/MM discussion for this new
overlay model (full of massive magical symlinks to override
permission.)

[1] https://github.com/dragonflyoss/image-service
[2] https://lore.kernel.org/r/[email protected]om/
[3] https://lore.kernel.org/r/[email protected]om/
[4] https://lore.kernel.org/r/[email protected]/
[5] https://hackmd.io/@cyphar/ociv2-brainstorm
[6] https://android-review.googlesource.com/c/kernel/common/+/1444521

Thanks,
Gao Xiang

>
> Regards,
> Giuseppe
>
>> Any current functionality gap in erofs and/or in overlayfs
>> cannot be considered as a reason to maintain a new filesystem
>> driver unless you come up with an explanation why closing that
>> functionality gap is not possible or why the erofs+overlayfs alternative
>> would be inferior to maintaining a new filesystem driver.
>>
>> From the conversations so far, it does not seem like Gao thinks
>> that the functionality gap in erofs cannot be closed and I don't
>> see why the functionality gap in overlayfs cannot be closed.
>>
>> Are we missing something?
>>
>> Thanks,
>> Amir.
>

2023-01-21 11:27:57

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem

On Sat, Jan 21, 2023 at 12:18 AM Giuseppe Scrivano <[email protected]> wrote:
>
> Hi Amir,
>
> Amir Goldstein <[email protected]> writes:
>
> > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <[email protected]> wrote:
> >>
> >> Giuseppe Scrivano and I have recently been working on a new project we
> >> call composefs. This is the first time we propose this publically and
> >> we would like some feedback on it.
> >>
> >> At its core, composefs is a way to construct and use read only images
> >> that are used similar to how you would use e.g. loop-back mounted
> >> squashfs images. On top of this composefs has two fundamental
> >> features. First it allows sharing of file data (both on disk and in
> >> page cache) between images, and secondly it has dm-verity like
> >> validation on read.
> >>
> >> Let me first start with a minimal example of how this can be used,
> >> before going into the details:
> >>
> >> Suppose we have this source for an image:
> >>
> >> rootfs/
> >> ├── dir
> >> │ └── another_a
> >> ├── file_a
> >> └── file_b
> >>
> >> We can then use this to generate an image file and a set of
> >> content-addressed backing files:
> >>
> >> # mkcomposefs --digest-store=objects rootfs/ rootfs.img
> >> # ls -l rootfs.img objects/*/*
> >> -rw-------. 1 root root 10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
> >> -rw-------. 1 root root 10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> >> -rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img
> >>
> >> The rootfs.img file contains all information about directory and file
> >> metadata plus references to the backing files by name. We can now
> >> mount this and look at the result:
> >>
> >> # mount -t composefs rootfs.img -o basedir=objects /mnt
> >> # ls /mnt/
> >> dir file_a file_b
> >> # cat /mnt/file_a
> >> content_a
> >>
> >> When reading this file the kernel is actually reading the backing
> >> file, in a fashion similar to overlayfs. Since the backing file is
> >> content-addressed, the objects directory can be shared for multiple
> >> images, and any files that happen to have the same content are
> >> shared. I refer to this as opportunistic sharing, as it is different
> >> than the more course-grained explicit sharing used by e.g. container
> >> base images.
> >>
> >> The next step is the validation. Note how the object files have
> >> fs-verity enabled. In fact, they are named by their fs-verity digest:
> >>
> >> # fsverity digest objects/*/*
> >> sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
> >> sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> >>
> >> The generated filesystm image may contain the expected digest for the
> >> backing files. When the backing file digest is incorrect, the open
> >> will fail, and if the open succeeds, any other on-disk file-changes
> >> will be detected by fs-verity:
> >>
> >> # cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> >> content_a
> >> # rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> >> # echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> >> # cat /mnt/file_a
> >> WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest
> >> cat: /mnt/file_a: Input/output error
> >>
> >> This re-uses the existing fs-verity functionallity to protect against
> >> changes in file contents, while adding on top of it protection against
> >> changes in filesystem metadata and structure. I.e. protecting against
> >> replacing a fs-verity enabled file or modifying file permissions or
> >> xattrs.
> >>
> >> To be fully verified we need another step: we use fs-verity on the
> >> image itself. Then we pass the expected digest on the mount command
> >> line (which will be verified at mount time):
> >>
> >> # fsverity enable rootfs.img
> >> # fsverity digest rootfs.img
> >> sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img
> >> # mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt
> >>
> >> So, given a trusted set of mount options (say unlocked from TPM), we
> >> have a fully verified filesystem tree mounted, with opportunistic
> >> finegrained sharing of identical files.
> >>
> >> So, why do we want this? There are two initial users. First of all we
> >> want to use the opportunistic sharing for the podman container image
> >> baselayer. The idea is to use a composefs mount as the lower directory
> >> in an overlay mount, with the upper directory being the container work
> >> dir. This will allow automatical file-level disk and page-cache
> >> sharning between any two images, independent of details like the
> >> permissions and timestamps of the files.
> >>
> >> Secondly we are interested in using the verification aspects of
> >> composefs in the ostree project. Ostree already supports a
> >> content-addressed object store, but it is currently referenced by
> >> hardlink farms. The object store and the trees that reference it are
> >> signed and verified at download time, but there is no runtime
> >> verification. If we replace the hardlink farm with a composefs image
> >> that points into the existing object store we can use the verification
> >> to implement runtime verification.
> >>
> >> In fact, the tooling to create composefs images is 100% reproducible,
> >> so all we need is to add the composefs image fs-verity digest into the
> >> ostree commit. Then the image can be reconstructed from the ostree
> >> commit info, generating a file with the same fs-verity digest.
> >>
> >> These are the usecases we're currently interested in, but there seems
> >> to be a breadth of other possible uses. For example, many systems use
> >> loopback mounts for images (like lxc or snap), and these could take
> >> advantage of the opportunistic sharing. We've also talked about using
> >> fuse to implement a local cache for the backing files. I.e. you would
> >> have the second basedir be a fuse filesystem. On lookup failure in the
> >> first basedir it downloads the file and saves it in the first basedir
> >> for later lookups. There are many interesting possibilities here.
> >>
> >> The patch series contains some documentation on the file format and
> >> how to use the filesystem.
> >>
> >> The userspace tools (and a standalone kernel module) is available
> >> here:
> >> https://github.com/containers/composefs
> >>
> >> Initial work on ostree integration is here:
> >> https://github.com/ostreedev/ostree/pull/2640
> >>
> >> Changes since v2:
> >> - Simplified filesystem format to use fixed size inodes. This resulted
> >> in simpler (now < 2k lines) code as well as higher performance at
> >> the cost of slightly (~40%) larger images.
> >> - We now use multi-page mappings from the page cache, which removes
> >> limits on sizes of xattrs and makes the dirent handling code simpler.
> >> - Added more documentation about the on-disk file format.
> >> - General cleanups based on review comments.
> >>
> >
> > Hi Alexander,
> >
> > I must say that I am a little bit puzzled by this v3.
> > Gao, Christian and myself asked you questions on v2
> > that are not mentioned in v3 at all.
> >
> > To sum it up, please do not propose composefs without explaining
> > what are the barriers for achieving the exact same outcome with
> > the use of a read-only overlayfs with two lower layer -
> > uppermost with erofs containing the metadata files, which include
> > trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer
> > to the lowermost layer containing the content files.
>
> I think Dave explained quite well why using overlay is not comparable to
> what composefs does.
>

Where? Can I get a link please?
If there are good reasons why composefs is superior to erofs+overlayfs
Please include them in the submission, since several developers keep
raising the same questions - that is all I ask.

> One big difference is that overlay still requires at least a syscall for
> each file in the image, and then we need the equivalent of "rm -rf" to
> clean it up. It is somehow acceptable for long-running services, but it
> is not for "serverless" containers where images/containers are created
> and destroyed frequently. So even in the case we already have all the
> image files available locally, we still need to create a checkout with
> the final structure we need for the image.
>

I think you did not understand my suggestion:

overlay read-only mount:
layer 1: erofs mount of a precomposed image (same as mkcomposefs)
layer 2: any pre-existing fs path with /blocks repository
layer 3: any per-existing fs path with /blocks repository
...

The mkcomposefs flow is exactly the same in this suggestion
the upper layer image is created without any syscalls and
removed without any syscalls.

Overlayfs already has the feature of redirecting from upper layer
to relative paths in lower layers.

> I also don't see how overlay would solve the verified image problem. We
> would have the same problem we have today with fs-verity as it can only
> validate a single file but not the entire directory structure. Changes
> that affect the layer containing the trusted.overlay.{metacopy,redirect}
> xattrs won't be noticed.
>

The entire erofs image would be fsverified including the overlayfs xattrs.
That is exactly the same model as composefs.
I am not even saying that your model is wrong, only that you are within
reach of implementing it with existing subsystems.

> There are at the moment two ways to handle container images, both somehow
> guided by the available file systems in the kernel.
>
> - A single image mounted as a block device.
> - A list of tarballs (OCI image) that are unpacked and mounted as
> overlay layers.
>
> One big advantage of the block devices model is that you can use
> dm-verity, this is something we miss today with OCI container images
> that use overlay.
>
> What we are proposing with composefs is a way to have "dm-verity" style
> validation based on fs-verity and the possibility to share individual
> files instead of layers. These files can also be on different file
> systems, which is something not possible with the block device model.
>
> The composefs manifest blob could be generated remotely and signed. A
> client would need just to validate the signature for the manifest blob
> and from there retrieve the files that are not in the local CAS (even
> from an insecure source) and mount directly the manifest file.
>

Excellent description of the problem.
I agree that we need a hybrid solution between the block
and tarball image model.

All I am saying is that this solution can use existing kernel
components and existing established on-disk formats
(erofs+overlayfs).

What was missing all along was the userspace component
(i.e. composefs) and I am very happy that you guys are
working on this project.

These userspace tools could be useful for other use cases.
For example, overlayfs is able to describe a large directory
rename with redirect xattr since v4.9, but image composing
tools do not make use of that, so an OCI image describing a
large dir rename will currently contain all the files within.

Once again, you may or may not be able to use erofs and
overlayfs out of the box for your needs, but so far I did not
see any functionality gap that is not possible to close.

Please let me know if you know of such gaps or if my
proposal does not meet the goals of composefs.

Thanks,
Amir.

2023-01-21 15:36:34

by Giuseppe Scrivano

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem

Amir Goldstein <[email protected]> writes:

> On Sat, Jan 21, 2023 at 12:18 AM Giuseppe Scrivano <[email protected]> wrote:
>>
>> Hi Amir,
>>
>> Amir Goldstein <[email protected]> writes:
>>
>> > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <[email protected]> wrote:
>> >>
>> >> Giuseppe Scrivano and I have recently been working on a new project we
>> >> call composefs. This is the first time we propose this publically and
>> >> we would like some feedback on it.
>> >>
>> >> At its core, composefs is a way to construct and use read only images
>> >> that are used similar to how you would use e.g. loop-back mounted
>> >> squashfs images. On top of this composefs has two fundamental
>> >> features. First it allows sharing of file data (both on disk and in
>> >> page cache) between images, and secondly it has dm-verity like
>> >> validation on read.
>> >>
>> >> Let me first start with a minimal example of how this can be used,
>> >> before going into the details:
>> >>
>> >> Suppose we have this source for an image:
>> >>
>> >> rootfs/
>> >> ├── dir
>> >> │ └── another_a
>> >> ├── file_a
>> >> └── file_b
>> >>
>> >> We can then use this to generate an image file and a set of
>> >> content-addressed backing files:
>> >>
>> >> # mkcomposefs --digest-store=objects rootfs/ rootfs.img
>> >> # ls -l rootfs.img objects/*/*
>> >> -rw-------. 1 root root 10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
>> >> -rw-------. 1 root root 10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
>> >> -rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img
>> >>
>> >> The rootfs.img file contains all information about directory and file
>> >> metadata plus references to the backing files by name. We can now
>> >> mount this and look at the result:
>> >>
>> >> # mount -t composefs rootfs.img -o basedir=objects /mnt
>> >> # ls /mnt/
>> >> dir file_a file_b
>> >> # cat /mnt/file_a
>> >> content_a
>> >>
>> >> When reading this file the kernel is actually reading the backing
>> >> file, in a fashion similar to overlayfs. Since the backing file is
>> >> content-addressed, the objects directory can be shared for multiple
>> >> images, and any files that happen to have the same content are
>> >> shared. I refer to this as opportunistic sharing, as it is different
>> >> than the more course-grained explicit sharing used by e.g. container
>> >> base images.
>> >>
>> >> The next step is the validation. Note how the object files have
>> >> fs-verity enabled. In fact, they are named by their fs-verity digest:
>> >>
>> >> # fsverity digest objects/*/*
>> >> sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
>> >> sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
>> >>
>> >> The generated filesystm image may contain the expected digest for the
>> >> backing files. When the backing file digest is incorrect, the open
>> >> will fail, and if the open succeeds, any other on-disk file-changes
>> >> will be detected by fs-verity:
>> >>
>> >> # cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
>> >> content_a
>> >> # rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
>> >> # echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
>> >> # cat /mnt/file_a
>> >> WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest
>> >> cat: /mnt/file_a: Input/output error
>> >>
>> >> This re-uses the existing fs-verity functionallity to protect against
>> >> changes in file contents, while adding on top of it protection against
>> >> changes in filesystem metadata and structure. I.e. protecting against
>> >> replacing a fs-verity enabled file or modifying file permissions or
>> >> xattrs.
>> >>
>> >> To be fully verified we need another step: we use fs-verity on the
>> >> image itself. Then we pass the expected digest on the mount command
>> >> line (which will be verified at mount time):
>> >>
>> >> # fsverity enable rootfs.img
>> >> # fsverity digest rootfs.img
>> >> sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img
>> >> # mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt
>> >>
>> >> So, given a trusted set of mount options (say unlocked from TPM), we
>> >> have a fully verified filesystem tree mounted, with opportunistic
>> >> finegrained sharing of identical files.
>> >>
>> >> So, why do we want this? There are two initial users. First of all we
>> >> want to use the opportunistic sharing for the podman container image
>> >> baselayer. The idea is to use a composefs mount as the lower directory
>> >> in an overlay mount, with the upper directory being the container work
>> >> dir. This will allow automatical file-level disk and page-cache
>> >> sharning between any two images, independent of details like the
>> >> permissions and timestamps of the files.
>> >>
>> >> Secondly we are interested in using the verification aspects of
>> >> composefs in the ostree project. Ostree already supports a
>> >> content-addressed object store, but it is currently referenced by
>> >> hardlink farms. The object store and the trees that reference it are
>> >> signed and verified at download time, but there is no runtime
>> >> verification. If we replace the hardlink farm with a composefs image
>> >> that points into the existing object store we can use the verification
>> >> to implement runtime verification.
>> >>
>> >> In fact, the tooling to create composefs images is 100% reproducible,
>> >> so all we need is to add the composefs image fs-verity digest into the
>> >> ostree commit. Then the image can be reconstructed from the ostree
>> >> commit info, generating a file with the same fs-verity digest.
>> >>
>> >> These are the usecases we're currently interested in, but there seems
>> >> to be a breadth of other possible uses. For example, many systems use
>> >> loopback mounts for images (like lxc or snap), and these could take
>> >> advantage of the opportunistic sharing. We've also talked about using
>> >> fuse to implement a local cache for the backing files. I.e. you would
>> >> have the second basedir be a fuse filesystem. On lookup failure in the
>> >> first basedir it downloads the file and saves it in the first basedir
>> >> for later lookups. There are many interesting possibilities here.
>> >>
>> >> The patch series contains some documentation on the file format and
>> >> how to use the filesystem.
>> >>
>> >> The userspace tools (and a standalone kernel module) is available
>> >> here:
>> >> https://github.com/containers/composefs
>> >>
>> >> Initial work on ostree integration is here:
>> >> https://github.com/ostreedev/ostree/pull/2640
>> >>
>> >> Changes since v2:
>> >> - Simplified filesystem format to use fixed size inodes. This resulted
>> >> in simpler (now < 2k lines) code as well as higher performance at
>> >> the cost of slightly (~40%) larger images.
>> >> - We now use multi-page mappings from the page cache, which removes
>> >> limits on sizes of xattrs and makes the dirent handling code simpler.
>> >> - Added more documentation about the on-disk file format.
>> >> - General cleanups based on review comments.
>> >>
>> >
>> > Hi Alexander,
>> >
>> > I must say that I am a little bit puzzled by this v3.
>> > Gao, Christian and myself asked you questions on v2
>> > that are not mentioned in v3 at all.
>> >
>> > To sum it up, please do not propose composefs without explaining
>> > what are the barriers for achieving the exact same outcome with
>> > the use of a read-only overlayfs with two lower layer -
>> > uppermost with erofs containing the metadata files, which include
>> > trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer
>> > to the lowermost layer containing the content files.
>>
>> I think Dave explained quite well why using overlay is not comparable to
>> what composefs does.
>>
>
> Where? Can I get a link please?

I am referring to this message: https://lore.kernel.org/lkml/[email protected]/

> If there are good reasons why composefs is superior to erofs+overlayfs
> Please include them in the submission, since several developers keep
> raising the same questions - that is all I ask.
>
>> One big difference is that overlay still requires at least a syscall for
>> each file in the image, and then we need the equivalent of "rm -rf" to
>> clean it up. It is somehow acceptable for long-running services, but it
>> is not for "serverless" containers where images/containers are created
>> and destroyed frequently. So even in the case we already have all the
>> image files available locally, we still need to create a checkout with
>> the final structure we need for the image.
>>
>
> I think you did not understand my suggestion:
>
> overlay read-only mount:
> layer 1: erofs mount of a precomposed image (same as mkcomposefs)
> layer 2: any pre-existing fs path with /blocks repository
> layer 3: any per-existing fs path with /blocks repository
> ...
>
> The mkcomposefs flow is exactly the same in this suggestion
> the upper layer image is created without any syscalls and
> removed without any syscalls.

mkcomposefs is supposed to be used server side, when the image is built.
The clients that will mount the image don't have to create it (at least
for images that will provide the manifest).

So this is quite different as in the overlay model we must create the
layout, that is the equivalent of the composefs manifest, on any node
the image is pulled to.

> Overlayfs already has the feature of redirecting from upper layer
> to relative paths in lower layers.

Could you please provide more information on how you would compose the
overlay image first?

From what I can see, it still requires at least one syscall for each
file in the image to be created and these images are not portable to a
different machine.

Should we always make "/blocks" a whiteout to prevent it is leaked in
the container?

And what prevents files under "/blocks" to be replaced with a different
version? I think fs-verity on the EROFS image itself won't cover it.

>> I also don't see how overlay would solve the verified image problem. We
>> would have the same problem we have today with fs-verity as it can only
>> validate a single file but not the entire directory structure. Changes
>> that affect the layer containing the trusted.overlay.{metacopy,redirect}
>> xattrs won't be noticed.
>>
>
> The entire erofs image would be fsverified including the overlayfs xattrs.
> That is exactly the same model as composefs.
> I am not even saying that your model is wrong, only that you are within
> reach of implementing it with existing subsystems.

now we can do:

mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt

that is quite useful for mounting the OS image, as is the OSTree case.

How would that be possible with the setup you are proposing? Would
overlay gain a new "digest=" kind of option to validate its first layer?

>> There are at the moment two ways to handle container images, both somehow
>> guided by the available file systems in the kernel.
>>
>> - A single image mounted as a block device.
>> - A list of tarballs (OCI image) that are unpacked and mounted as
>> overlay layers.
>>
>> One big advantage of the block devices model is that you can use
>> dm-verity, this is something we miss today with OCI container images
>> that use overlay.
>>
>> What we are proposing with composefs is a way to have "dm-verity" style
>> validation based on fs-verity and the possibility to share individual
>> files instead of layers. These files can also be on different file
>> systems, which is something not possible with the block device model.
>>
>> The composefs manifest blob could be generated remotely and signed. A
>> client would need just to validate the signature for the manifest blob
>> and from there retrieve the files that are not in the local CAS (even
>> from an insecure source) and mount directly the manifest file.
>>
>
> Excellent description of the problem.
> I agree that we need a hybrid solution between the block
> and tarball image model.
>
> All I am saying is that this solution can use existing kernel
> components and existing established on-disk formats
> (erofs+overlayfs).
>
> What was missing all along was the userspace component
> (i.e. composefs) and I am very happy that you guys are
> working on this project.
>
> These userspace tools could be useful for other use cases.
> For example, overlayfs is able to describe a large directory
> rename with redirect xattr since v4.9, but image composing
> tools do not make use of that, so an OCI image describing a
> large dir rename will currently contain all the files within.
>
> Once again, you may or may not be able to use erofs and
> overlayfs out of the box for your needs, but so far I did not
> see any functionality gap that is not possible to close.
>
> Please let me know if you know of such gaps or if my
> proposal does not meet the goals of composefs.

thanks for your helpful comments.

Regards,
Giuseppe

2023-01-21 16:26:50

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem

On Sat, Jan 21, 2023 at 5:01 PM Giuseppe Scrivano <[email protected]> wrote:
>
> Amir Goldstein <[email protected]> writes:
>
> > On Sat, Jan 21, 2023 at 12:18 AM Giuseppe Scrivano <[email protected]> wrote:
> >>
> >> Hi Amir,
> >>
> >> Amir Goldstein <[email protected]> writes:
> >>
> >> > On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <[email protected]> wrote:
> >> >>
> >> >> Giuseppe Scrivano and I have recently been working on a new project we
> >> >> call composefs. This is the first time we propose this publically and
> >> >> we would like some feedback on it.
> >> >>
> >> >> At its core, composefs is a way to construct and use read only images
> >> >> that are used similar to how you would use e.g. loop-back mounted
> >> >> squashfs images. On top of this composefs has two fundamental
> >> >> features. First it allows sharing of file data (both on disk and in
> >> >> page cache) between images, and secondly it has dm-verity like
> >> >> validation on read.
> >> >>
> >> >> Let me first start with a minimal example of how this can be used,
> >> >> before going into the details:
> >> >>
> >> >> Suppose we have this source for an image:
> >> >>
> >> >> rootfs/
> >> >> ├── dir
> >> >> │ └── another_a
> >> >> ├── file_a
> >> >> └── file_b
> >> >>
> >> >> We can then use this to generate an image file and a set of
> >> >> content-addressed backing files:
> >> >>
> >> >> # mkcomposefs --digest-store=objects rootfs/ rootfs.img
> >> >> # ls -l rootfs.img objects/*/*
> >> >> -rw-------. 1 root root 10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
> >> >> -rw-------. 1 root root 10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> >> >> -rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img
> >> >>
> >> >> The rootfs.img file contains all information about directory and file
> >> >> metadata plus references to the backing files by name. We can now
> >> >> mount this and look at the result:
> >> >>
> >> >> # mount -t composefs rootfs.img -o basedir=objects /mnt
> >> >> # ls /mnt/
> >> >> dir file_a file_b
> >> >> # cat /mnt/file_a
> >> >> content_a
> >> >>
> >> >> When reading this file the kernel is actually reading the backing
> >> >> file, in a fashion similar to overlayfs. Since the backing file is
> >> >> content-addressed, the objects directory can be shared for multiple
> >> >> images, and any files that happen to have the same content are
> >> >> shared. I refer to this as opportunistic sharing, as it is different
> >> >> than the more course-grained explicit sharing used by e.g. container
> >> >> base images.
> >> >>
> >> >> The next step is the validation. Note how the object files have
> >> >> fs-verity enabled. In fact, they are named by their fs-verity digest:
> >> >>
> >> >> # fsverity digest objects/*/*
> >> >> sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
> >> >> sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> >> >>
> >> >> The generated filesystm image may contain the expected digest for the
> >> >> backing files. When the backing file digest is incorrect, the open
> >> >> will fail, and if the open succeeds, any other on-disk file-changes
> >> >> will be detected by fs-verity:
> >> >>
> >> >> # cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> >> >> content_a
> >> >> # rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> >> >> # echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> >> >> # cat /mnt/file_a
> >> >> WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest
> >> >> cat: /mnt/file_a: Input/output error
> >> >>
> >> >> This re-uses the existing fs-verity functionallity to protect against
> >> >> changes in file contents, while adding on top of it protection against
> >> >> changes in filesystem metadata and structure. I.e. protecting against
> >> >> replacing a fs-verity enabled file or modifying file permissions or
> >> >> xattrs.
> >> >>
> >> >> To be fully verified we need another step: we use fs-verity on the
> >> >> image itself. Then we pass the expected digest on the mount command
> >> >> line (which will be verified at mount time):
> >> >>
> >> >> # fsverity enable rootfs.img
> >> >> # fsverity digest rootfs.img
> >> >> sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img
> >> >> # mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt
> >> >>
> >> >> So, given a trusted set of mount options (say unlocked from TPM), we
> >> >> have a fully verified filesystem tree mounted, with opportunistic
> >> >> finegrained sharing of identical files.
> >> >>
> >> >> So, why do we want this? There are two initial users. First of all we
> >> >> want to use the opportunistic sharing for the podman container image
> >> >> baselayer. The idea is to use a composefs mount as the lower directory
> >> >> in an overlay mount, with the upper directory being the container work
> >> >> dir. This will allow automatical file-level disk and page-cache
> >> >> sharning between any two images, independent of details like the
> >> >> permissions and timestamps of the files.
> >> >>
> >> >> Secondly we are interested in using the verification aspects of
> >> >> composefs in the ostree project. Ostree already supports a
> >> >> content-addressed object store, but it is currently referenced by
> >> >> hardlink farms. The object store and the trees that reference it are
> >> >> signed and verified at download time, but there is no runtime
> >> >> verification. If we replace the hardlink farm with a composefs image
> >> >> that points into the existing object store we can use the verification
> >> >> to implement runtime verification.
> >> >>
> >> >> In fact, the tooling to create composefs images is 100% reproducible,
> >> >> so all we need is to add the composefs image fs-verity digest into the
> >> >> ostree commit. Then the image can be reconstructed from the ostree
> >> >> commit info, generating a file with the same fs-verity digest.
> >> >>
> >> >> These are the usecases we're currently interested in, but there seems
> >> >> to be a breadth of other possible uses. For example, many systems use
> >> >> loopback mounts for images (like lxc or snap), and these could take
> >> >> advantage of the opportunistic sharing. We've also talked about using
> >> >> fuse to implement a local cache for the backing files. I.e. you would
> >> >> have the second basedir be a fuse filesystem. On lookup failure in the
> >> >> first basedir it downloads the file and saves it in the first basedir
> >> >> for later lookups. There are many interesting possibilities here.
> >> >>
> >> >> The patch series contains some documentation on the file format and
> >> >> how to use the filesystem.
> >> >>
> >> >> The userspace tools (and a standalone kernel module) is available
> >> >> here:
> >> >> https://github.com/containers/composefs
> >> >>
> >> >> Initial work on ostree integration is here:
> >> >> https://github.com/ostreedev/ostree/pull/2640
> >> >>
> >> >> Changes since v2:
> >> >> - Simplified filesystem format to use fixed size inodes. This resulted
> >> >> in simpler (now < 2k lines) code as well as higher performance at
> >> >> the cost of slightly (~40%) larger images.
> >> >> - We now use multi-page mappings from the page cache, which removes
> >> >> limits on sizes of xattrs and makes the dirent handling code simpler.
> >> >> - Added more documentation about the on-disk file format.
> >> >> - General cleanups based on review comments.
> >> >>
> >> >
> >> > Hi Alexander,
> >> >
> >> > I must say that I am a little bit puzzled by this v3.
> >> > Gao, Christian and myself asked you questions on v2
> >> > that are not mentioned in v3 at all.
> >> >
> >> > To sum it up, please do not propose composefs without explaining
> >> > what are the barriers for achieving the exact same outcome with
> >> > the use of a read-only overlayfs with two lower layer -
> >> > uppermost with erofs containing the metadata files, which include
> >> > trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer
> >> > to the lowermost layer containing the content files.
> >>
> >> I think Dave explained quite well why using overlay is not comparable to
> >> what composefs does.
> >>
> >
> > Where? Can I get a link please?
>
> I am referring to this message: https://lore.kernel.org/lkml/[email protected]/
>

That is a good explanation why the current container runtime
overlay storage driver is inadequate, because the orchestration
requires untar of OCI tarball image before mounting overlayfs.

It is not a kernel issue, it is a userspace issue, because userspace
does not utilize overlayfs driver features that are now 6 years
old (redirect_dir) and 4 years old (metacopy).

I completely agree that reflink and hardlinks are not a viable solution
to ephemeral containers.

> > If there are good reasons why composefs is superior to erofs+overlayfs
> > Please include them in the submission, since several developers keep
> > raising the same questions - that is all I ask.
> >
> >> One big difference is that overlay still requires at least a syscall for
> >> each file in the image, and then we need the equivalent of "rm -rf" to
> >> clean it up. It is somehow acceptable for long-running services, but it
> >> is not for "serverless" containers where images/containers are created
> >> and destroyed frequently. So even in the case we already have all the
> >> image files available locally, we still need to create a checkout with
> >> the final structure we need for the image.
> >>
> >
> > I think you did not understand my suggestion:
> >
> > overlay read-only mount:
> > layer 1: erofs mount of a precomposed image (same as mkcomposefs)
> > layer 2: any pre-existing fs path with /blocks repository
> > layer 3: any per-existing fs path with /blocks repository
> > ...
> >
> > The mkcomposefs flow is exactly the same in this suggestion
> > the upper layer image is created without any syscalls and
> > removed without any syscalls.
>
> mkcomposefs is supposed to be used server side, when the image is built.
> The clients that will mount the image don't have to create it (at least
> for images that will provide the manifest).
>
> So this is quite different as in the overlay model we must create the
> layout, that is the equivalent of the composefs manifest, on any node
> the image is pulled to.
>

You don't need to re-create the erofs manifest on the client.
Unless I am completely missing something, the flow that I am
suggesting is drop-in replacement to what you have done.

IIUC, you invented an on-disk format for composefs manifest.
Is there anything preventing you from using the existing
erofs on-disk format to pack the manifest file?
The files in the manifest would be inodes with no blocks, only
with size and attributes and overlay xattrs with references to
the real object blocks, same as you would do with mkcomposefs.
Is it not?

Maybe what I am missing is how are the blob objects distributed?
Are they also shipped as composefs image bundles?
That can still be the case with erofs images that may contain both
blobs with data and metadata files referencing blobs in older images.

> > Overlayfs already has the feature of redirecting from upper layer
> > to relative paths in lower layers.
>
> Could you please provide more information on how you would compose the
> overlay image first?
>
> From what I can see, it still requires at least one syscall for each
> file in the image to be created and these images are not portable to a
> different machine.

Terminology nuance - you do not create an overlayfs image on the server
you create an erofs image on the server, exactly as you would create
a composefs image on the server.

The shipped overlay "image" would then be the erofs image with
references to prereqisite images that contain the blobs and the digest
of the erofs image.

# mount -t composefs rootfs.img -o basedir=objects /mnt

client will do:

# mount -t erofs rootfs.img -o digest=da.... /metadata
# mount -t overlay -o ro,metacopy=on,lowerdir=/metadata:/objects /mnt

>
> Should we always make "/blocks" a whiteout to prevent it is leaked in
> the container?

That would be the simplest option, yes.
If needed we can also make it a hidden layer whose objects
never appear in the namespace and can only be referenced
from an upper layer redirection.

>
> And what prevents files under "/blocks" to be replaced with a different
> version? I think fs-verity on the EROFS image itself won't cover it.
>

I think that part should be added to the overlayfs kernel driver.
We could enhance overlayfs to include optional "overlay.verity" digest
on the metacopy upper files to be fed into fsverity when opening lower
blob files that reside on an fsverity supported filesystem.

I am not an expert in trust chains, but I think this is equivalent to
how composefs driver was going to solve the same problem?

> >> I also don't see how overlay would solve the verified image problem. We
> >> would have the same problem we have today with fs-verity as it can only
> >> validate a single file but not the entire directory structure. Changes
> >> that affect the layer containing the trusted.overlay.{metacopy,redirect}
> >> xattrs won't be noticed.
> >>
> >
> > The entire erofs image would be fsverified including the overlayfs xattrs.
> > That is exactly the same model as composefs.
> > I am not even saying that your model is wrong, only that you are within
> > reach of implementing it with existing subsystems.
>
> now we can do:
>
> mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt
>
> that is quite useful for mounting the OS image, as is the OSTree case.
>
> How would that be possible with the setup you are proposing? Would
> overlay gain a new "digest=" kind of option to validate its first layer?
>

Overlayfs job is to merge the layers.
The first layer would first need to be mounted as erofs,
so I think that the option digest= would need to be added to erofs.

Then, any content in the erofs mount (which is the first overlay layer)
would be verified by fsverity and overlayfs job would be to feed the digest
found in "overlay.verity" xattrs inside the erofs layer when accessing files in
the blob lower (or hidden) layer.

Does this make sense to you?
Or is there still something that I am missing or
misunderstanding about the use case?

Thanks,
Amir.

2023-01-21 16:50:07

by Giuseppe Scrivano

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem

Gao Xiang <[email protected]> writes:

> On 2023/1/21 06:18, Giuseppe Scrivano wrote:
>> Hi Amir,
>> Amir Goldstein <[email protected]> writes:
>>
>>> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <[email protected]> wrote:
>
> ...
>
>>>>
>>>
>>> Hi Alexander,
>>>
>>> I must say that I am a little bit puzzled by this v3.
>>> Gao, Christian and myself asked you questions on v2
>>> that are not mentioned in v3 at all.
>>>
>>> To sum it up, please do not propose composefs without explaining
>>> what are the barriers for achieving the exact same outcome with
>>> the use of a read-only overlayfs with two lower layer -
>>> uppermost with erofs containing the metadata files, which include
>>> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer
>>> to the lowermost layer containing the content files.
>> I think Dave explained quite well why using overlay is not
>> comparable to
>> what composefs does.
>> One big difference is that overlay still requires at least a syscall
>> for
>> each file in the image, and then we need the equivalent of "rm -rf" to
>> clean it up. It is somehow acceptable for long-running services, but it
>> is not for "serverless" containers where images/containers are created
>> and destroyed frequently. So even in the case we already have all the
>> image files available locally, we still need to create a checkout with
>> the final structure we need for the image.
>> I also don't see how overlay would solve the verified image problem.
>> We
>> would have the same problem we have today with fs-verity as it can only
>> validate a single file but not the entire directory structure. Changes
>> that affect the layer containing the trusted.overlay.{metacopy,redirect}
>> xattrs won't be noticed.
>> There are at the moment two ways to handle container images, both
>> somehow
>> guided by the available file systems in the kernel.
>> - A single image mounted as a block device.
>> - A list of tarballs (OCI image) that are unpacked and mounted as
>> overlay layers.
>> One big advantage of the block devices model is that you can use
>> dm-verity, this is something we miss today with OCI container images
>> that use overlay.
>> What we are proposing with composefs is a way to have "dm-verity"
>> style
>> validation based on fs-verity and the possibility to share individual
>> files instead of layers. These files can also be on different file
>> systems, which is something not possible with the block device model.
>
> That is not a new idea honestly, including chain of trust. Even laterly
> out-of-tree incremental fs using fs-verity for this as well, except that
> it's in a real self-contained way.
>
>> The composefs manifest blob could be generated remotely and signed.
>> A
>> client would need just to validate the signature for the manifest blob
>> and from there retrieve the files that are not in the local CAS (even
>> from an insecure source) and mount directly the manifest file.
>
>
> Back to the topic, after thinking something I have to make a
> compliment for reference.
>
> First, EROFS had the same internal dissussion and decision at
> that time almost _two years ago_ (June 2021), it means:
>
> a) Some internal people really suggested EROFS could develop
> an entire new file-based in-kernel local cache subsystem
> (as you called local CAS, whatever) with stackable file
> interface so that the exist Nydus image service [1] (as
> ostree, and maybe ostree can use it as well) don't need to
> modify anything to use exist blobs;
>
> b) Reuse exist fscache/cachefiles;
>
> The reason why we (especially me) finally selected b) because:
>
> - see the people discussion of Google's original Incremental
> FS topic [2] [3] in 2019, as Amir already mentioned. At
> that time all fs folks really like to reuse exist subsystem
> for in-kernel caching rather than reinvent another new
> in-kernel wheel for local cache.
>
> [ Reinventing a new wheel is not hard (fs or caching), just
> makes Linux more fragmented. Especially a new filesystem
> is just proposed to generate images full of massive massive
> new magical symlinks with *overriden* uid/gid/permissions
> to replace regular files. ]
>
> - in-kernel cache implementation usually met several common
> potential security issues; reusing exist subsystem can
> make all fses addressed them and benefited from it.
>
> - Usually an exist widely-used userspace implementation is
> never an excuse for a new in-kernel feature.
>
> Although David Howells is always quite busy these months to
> develop new netfs interface, otherwise (we think) we should
> already support failover, multiple daemon/dirs, daemonless and
> more.

we have not added any new cache system. overlay does "layer
deduplication" and in similar way composefs does "file deduplication".
That is not a built-in feature, it is just a side effect of how things
are packed together.

Using fscache seems like a good idea and it has many advantages but it
is a centralized cache mechanism and it looks like a potential problem
when you think about allowing mounts from a user namespace.

As you know as I've contacted you, I've looked at EROFS in the past
and tried to get our use cases to work with it before thinking about
submitting composefs upstream.

From what I could see EROFS and composefs use two different approaches
to solve a similar problem, but it is not possible to do exactly with
EROFS what we are trying to do. To oversimplify it: I see EROFS as a
block device that uses fscache, and composefs as an overlay for files
instead of directories.

Sure composefs is quite simple and you could embed the composefs
features in EROFS and let EROFS behave as composefs when provided a
similar manifest file. But how is that any better than having a
separate implementation that does just one thing well instead of merging
different paradigms together?

> I know that you guys repeatedly say it's a self-contained
> stackable fs and has few code (the same words as Incfs
> folks [3] said four years ago already), four reasons make it
> weak IMHO:
>
> - I think core EROFS is about 2~3 kLOC as well if
> compression, sysfs and fscache are all code-truncated.
>
> Also, it's always welcome that all people could submit
> patches for cleaning up. I always do such cleanups
> from time to time and makes it better.
>
> - "Few code lines" is somewhat weak because people do
> develop new features, layout after upstream.
>
> Such claim is usually _NOT_ true in the future if you
> guys do more to optimize performance, new layout or even
> do your own lazy pulling with your local CAS codebase in
> the future unless
> you *promise* you once dump the code, and do bugfix
> only like Christian said [4].
>
> From LWN.net comments, I do see the opposite
> possibility that you'd like to develop new features
> later.
>
> - In the past, all in-tree kernel filesystems were
> designed and implemented without some user-space
> specific indication, including Nydus and ostree (I did
> see a lot of discussion between folks before in ociv2
> brainstorm [5]).

Since you are mentioning OCI:

Potentially composefs can be the file system that enables something very
close to "ociv2", but it won't need to be called v2 since it is
completely compatible with the current OCI image format.

It won't require a different image format, just a seekable tarball that
is compatible with old "v1" clients and we need to provide the composefs
manifest file.

The seekable tarball allows individual files to be retrieved. OCI
clients will not need to pull the entire tarball, but only the individual
files that are not already present in the local CAS. They won't also need
to create the overlay layout at all, as we do today, since it is already
described with the composefs manifest file.

The manifest is portable on different machines with different
configurations, as you can use multiple CAS when mounting composefs.

Some users might have a local CAS, some others could have a secondary
CAS on a network file system and composefs support all these
configurations with the same signed manifest file.

> That is why EROFS selected exist in-kernel fscache and
> made userspace Nydus adapt it:
>
> even (here called) manifest on-disk format ---
> EROFS call primary device ---
> they call Nydus bootstrap;
>
> I'm not sure why it becomes impossible for ... ($$$$).

I am not sure what you mean, care to elaborate?

> In addition, if fscache is used, it can also use
> fsverity_get_digest() to enable fsverity for non-on-demand
> files.
>
> But again I think even Google's folks think that is
> (somewhat) broken so that they added fs-verity to its incFS
> in a self-contained way in Feb 2021 [6].
>
> Finally, again, I do hope a LSF/MM discussion for this new
> overlay model (full of massive magical symlinks to override
> permission.)

you keep pointing it out but nobody is overriding any permission. The
"symlinks" as you call them are just a way to refer to the payload files
so they can be shared among different mounts. It is the same idea used
by "overlay metacopy" and nobody is complaining about it being a
security issue (because it is not).

The files in the CAS are owned by the user that creates the mount, so
there is no need to circumvent any permission check to access them.
We use fs-verity for these files to make sure they are not modified by a
malicious user that could get access to them (e.g. a container breakout).

Regards,
Giuseppe

>
> [1] https://github.com/dragonflyoss/image-service
> [2] https://lore.kernel.org/r/CAK8JDrFZW1jwOmhq+YVDPJi[email protected]/
> [3] https://lore.kernel.org/r/[email protected]om/
> [4] https://lore.kernel.org/r/[email protected]/
> [5] https://hackmd.io/@cyphar/ociv2-brainstorm
> [6] https://android-review.googlesource.com/c/kernel/common/+/1444521
>
> Thanks,
> Gao Xiang
>
>> Regards,
>> Giuseppe
>>
>>> Any current functionality gap in erofs and/or in overlayfs
>>> cannot be considered as a reason to maintain a new filesystem
>>> driver unless you come up with an explanation why closing that
>>> functionality gap is not possible or why the erofs+overlayfs alternative
>>> would be inferior to maintaining a new filesystem driver.
>>>
>>> From the conversations so far, it does not seem like Gao thinks
>>> that the functionality gap in erofs cannot be closed and I don't
>>> see why the functionality gap in overlayfs cannot be closed.
>>>
>>> Are we missing something?
>>>
>>> Thanks,
>>> Amir.
>>

2023-01-21 16:55:04

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem



On 2023/1/21 23:54, Amir Goldstein wrote:
> On Sat, Jan 21, 2023 at 5:01 PM Giuseppe Scrivano <[email protected]> wrote:
>>
>> Amir Goldstein <[email protected]> writes:
>>
>>> On Sat, Jan 21, 2023 at 12:18 AM Giuseppe Scrivano <[email protected]> wrote:
>>>>
>>>> Hi Amir,
>>>>
>>>> Amir Goldstein <[email protected]> writes:
>>>>
>>>>> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <[email protected]> wrote:
>>>>>>
>>>>>> Giuseppe Scrivano and I have recently been working on a new project we
>>>>>> call composefs. This is the first time we propose this publically and
>>>>>> we would like some feedback on it.
>>>>>>
>>>>>> At its core, composefs is a way to construct and use read only images
>>>>>> that are used similar to how you would use e.g. loop-back mounted
>>>>>> squashfs images. On top of this composefs has two fundamental
>>>>>> features. First it allows sharing of file data (both on disk and in
>>>>>> page cache) between images, and secondly it has dm-verity like
>>>>>> validation on read.
>>>>>>
>>>>>> Let me first start with a minimal example of how this can be used,
>>>>>> before going into the details:
>>>>>>
>>>>>> Suppose we have this source for an image:
>>>>>>
>>>>>> rootfs/
>>>>>> ├── dir
>>>>>> │ └── another_a
>>>>>> ├── file_a
>>>>>> └── file_b
>>>>>>
>>>>>> We can then use this to generate an image file and a set of
>>>>>> content-addressed backing files:
>>>>>>
>>>>>> # mkcomposefs --digest-store=objects rootfs/ rootfs.img
>>>>>> # ls -l rootfs.img objects/*/*
>>>>>> -rw-------. 1 root root 10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
>>>>>> -rw-------. 1 root root 10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
>>>>>> -rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img
>>>>>>
>>>>>> The rootfs.img file contains all information about directory and file
>>>>>> metadata plus references to the backing files by name. We can now
>>>>>> mount this and look at the result:
>>>>>>
>>>>>> # mount -t composefs rootfs.img -o basedir=objects /mnt
>>>>>> # ls /mnt/
>>>>>> dir file_a file_b
>>>>>> # cat /mnt/file_a
>>>>>> content_a
>>>>>>
>>>>>> When reading this file the kernel is actually reading the backing
>>>>>> file, in a fashion similar to overlayfs. Since the backing file is
>>>>>> content-addressed, the objects directory can be shared for multiple
>>>>>> images, and any files that happen to have the same content are
>>>>>> shared. I refer to this as opportunistic sharing, as it is different
>>>>>> than the more course-grained explicit sharing used by e.g. container
>>>>>> base images.
>>>>>>
>>>>>> The next step is the validation. Note how the object files have
>>>>>> fs-verity enabled. In fact, they are named by their fs-verity digest:
>>>>>>
>>>>>> # fsverity digest objects/*/*
>>>>>> sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
>>>>>> sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
>>>>>>
>>>>>> The generated filesystm image may contain the expected digest for the
>>>>>> backing files. When the backing file digest is incorrect, the open
>>>>>> will fail, and if the open succeeds, any other on-disk file-changes
>>>>>> will be detected by fs-verity:
>>>>>>
>>>>>> # cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
>>>>>> content_a
>>>>>> # rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
>>>>>> # echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
>>>>>> # cat /mnt/file_a
>>>>>> WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest
>>>>>> cat: /mnt/file_a: Input/output error
>>>>>>
>>>>>> This re-uses the existing fs-verity functionallity to protect against
>>>>>> changes in file contents, while adding on top of it protection against
>>>>>> changes in filesystem metadata and structure. I.e. protecting against
>>>>>> replacing a fs-verity enabled file or modifying file permissions or
>>>>>> xattrs.
>>>>>>
>>>>>> To be fully verified we need another step: we use fs-verity on the
>>>>>> image itself. Then we pass the expected digest on the mount command
>>>>>> line (which will be verified at mount time):
>>>>>>
>>>>>> # fsverity enable rootfs.img
>>>>>> # fsverity digest rootfs.img
>>>>>> sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img
>>>>>> # mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt
>>>>>>
>>>>>> So, given a trusted set of mount options (say unlocked from TPM), we
>>>>>> have a fully verified filesystem tree mounted, with opportunistic
>>>>>> finegrained sharing of identical files.
>>>>>>
>>>>>> So, why do we want this? There are two initial users. First of all we
>>>>>> want to use the opportunistic sharing for the podman container image
>>>>>> baselayer. The idea is to use a composefs mount as the lower directory
>>>>>> in an overlay mount, with the upper directory being the container work
>>>>>> dir. This will allow automatical file-level disk and page-cache
>>>>>> sharning between any two images, independent of details like the
>>>>>> permissions and timestamps of the files.
>>>>>>
>>>>>> Secondly we are interested in using the verification aspects of
>>>>>> composefs in the ostree project. Ostree already supports a
>>>>>> content-addressed object store, but it is currently referenced by
>>>>>> hardlink farms. The object store and the trees that reference it are
>>>>>> signed and verified at download time, but there is no runtime
>>>>>> verification. If we replace the hardlink farm with a composefs image
>>>>>> that points into the existing object store we can use the verification
>>>>>> to implement runtime verification.
>>>>>>
>>>>>> In fact, the tooling to create composefs images is 100% reproducible,
>>>>>> so all we need is to add the composefs image fs-verity digest into the
>>>>>> ostree commit. Then the image can be reconstructed from the ostree
>>>>>> commit info, generating a file with the same fs-verity digest.
>>>>>>
>>>>>> These are the usecases we're currently interested in, but there seems
>>>>>> to be a breadth of other possible uses. For example, many systems use
>>>>>> loopback mounts for images (like lxc or snap), and these could take
>>>>>> advantage of the opportunistic sharing. We've also talked about using
>>>>>> fuse to implement a local cache for the backing files. I.e. you would
>>>>>> have the second basedir be a fuse filesystem. On lookup failure in the
>>>>>> first basedir it downloads the file and saves it in the first basedir
>>>>>> for later lookups. There are many interesting possibilities here.
>>>>>>
>>>>>> The patch series contains some documentation on the file format and
>>>>>> how to use the filesystem.
>>>>>>
>>>>>> The userspace tools (and a standalone kernel module) is available
>>>>>> here:
>>>>>> https://github.com/containers/composefs
>>>>>>
>>>>>> Initial work on ostree integration is here:
>>>>>> https://github.com/ostreedev/ostree/pull/2640
>>>>>>
>>>>>> Changes since v2:
>>>>>> - Simplified filesystem format to use fixed size inodes. This resulted
>>>>>> in simpler (now < 2k lines) code as well as higher performance at
>>>>>> the cost of slightly (~40%) larger images.
>>>>>> - We now use multi-page mappings from the page cache, which removes
>>>>>> limits on sizes of xattrs and makes the dirent handling code simpler.
>>>>>> - Added more documentation about the on-disk file format.
>>>>>> - General cleanups based on review comments.
>>>>>>
>>>>>
>>>>> Hi Alexander,
>>>>>
>>>>> I must say that I am a little bit puzzled by this v3.
>>>>> Gao, Christian and myself asked you questions on v2
>>>>> that are not mentioned in v3 at all.
>>>>>
>>>>> To sum it up, please do not propose composefs without explaining
>>>>> what are the barriers for achieving the exact same outcome with
>>>>> the use of a read-only overlayfs with two lower layer -
>>>>> uppermost with erofs containing the metadata files, which include
>>>>> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer
>>>>> to the lowermost layer containing the content files.
>>>>
>>>> I think Dave explained quite well why using overlay is not comparable to
>>>> what composefs does.
>>>>
>>>
>>> Where? Can I get a link please?
>>
>> I am referring to this message: https://lore.kernel.org/lkml/[email protected]/
>>
>
> That is a good explanation why the current container runtime
> overlay storage driver is inadequate, because the orchestration
> requires untar of OCI tarball image before mounting overlayfs.
>
> It is not a kernel issue, it is a userspace issue, because userspace
> does not utilize overlayfs driver features that are now 6 years
> old (redirect_dir) and 4 years old (metacopy).
>
> I completely agree that reflink and hardlinks are not a viable solution
> to ephemeral containers.
>
>>> If there are good reasons why composefs is superior to erofs+overlayfs
>>> Please include them in the submission, since several developers keep
>>> raising the same questions - that is all I ask.
>>>
>>>> One big difference is that overlay still requires at least a syscall for
>>>> each file in the image, and then we need the equivalent of "rm -rf" to
>>>> clean it up. It is somehow acceptable for long-running services, but it
>>>> is not for "serverless" containers where images/containers are created
>>>> and destroyed frequently. So even in the case we already have all the
>>>> image files available locally, we still need to create a checkout with
>>>> the final structure we need for the image.
>>>>
>>>
>>> I think you did not understand my suggestion:
>>>
>>> overlay read-only mount:
>>> layer 1: erofs mount of a precomposed image (same as mkcomposefs)
>>> layer 2: any pre-existing fs path with /blocks repository
>>> layer 3: any per-existing fs path with /blocks repository
>>> ...
>>>
>>> The mkcomposefs flow is exactly the same in this suggestion
>>> the upper layer image is created without any syscalls and
>>> removed without any syscalls.
>>
>> mkcomposefs is supposed to be used server side, when the image is built.
>> The clients that will mount the image don't have to create it (at least
>> for images that will provide the manifest).
>>
>> So this is quite different as in the overlay model we must create the
>> layout, that is the equivalent of the composefs manifest, on any node
>> the image is pulled to.
>>
>
> You don't need to re-create the erofs manifest on the client.
> Unless I am completely missing something, the flow that I am
> suggesting is drop-in replacement to what you have done.
>
> IIUC, you invented an on-disk format for composefs manifest.
> Is there anything preventing you from using the existing
> erofs on-disk format to pack the manifest file?
> The files in the manifest would be inodes with no blocks, only
> with size and attributes and overlay xattrs with references to
> the real object blocks, same as you would do with mkcomposefs.
> Is it not?

Yes, some EROFS special images work as all regular files with empty
data and some overlay "trusted" xattrs included as lower dir would
be ok.

>
> Maybe what I am missing is how are the blob objects distributed?
> Are they also shipped as composefs image bundles?
> That can still be the case with erofs images that may contain both
> blobs with data and metadata files referencing blobs in older images.

Maybe just empty regular files in EROFS (or whatever else fs) with
a magic "trusted.overlay.blablabla" xattr to point to the real file.

>
>>> Overlayfs already has the feature of redirecting from upper layer
>>> to relative paths in lower layers.
>>
>> Could you please provide more information on how you would compose the
>> overlay image first?
>>
>> From what I can see, it still requires at least one syscall for each
>> file in the image to be created and these images are not portable to a
>> different machine.
>
> Terminology nuance - you do not create an overlayfs image on the server
> you create an erofs image on the server, exactly as you would create
> a composefs image on the server.
>
> The shipped overlay "image" would then be the erofs image with
> references to prereqisite images that contain the blobs and the digest
> of the erofs image.
>
> # mount -t composefs rootfs.img -o basedir=objects /mnt
>
> client will do:
>
> # mount -t erofs rootfs.img -o digest=da.... /metadata
> # mount -t overlay -o ro,metacopy=on,lowerdir=/metadata:/objects /mnt

Currently maybe not even introduce "-o digest", just loop+dm-verity for
such manifest is already ok.

>
>>
>> Should we always make "/blocks" a whiteout to prevent it is leaked in
>> the container?
>
> That would be the simplest option, yes.
> If needed we can also make it a hidden layer whose objects
> never appear in the namespace and can only be referenced
> from an upper layer redirection.
>
>>
>> And what prevents files under "/blocks" to be replaced with a different
>> version? I think fs-verity on the EROFS image itself won't cover it.
>>
>
> I think that part should be added to the overlayfs kernel driver.
> We could enhance overlayfs to include optional "overlay.verity" digest
> on the metacopy upper files to be fed into fsverity when opening lower
> blob files that reside on an fsverity supported filesystem.

Agreed, another overlayfs "trusted.overlay.verity" xattr in EROFS (or
whatever else fs) for each empty regular files to do the same
fsverity_get_digest() trick. That would have the same impact IMO.

Thanks,
Gao Xiang

...

>
> Thanks,
> Amir.

2023-01-21 17:35:36

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem



On 2023/1/22 00:19, Giuseppe Scrivano wrote:
> Gao Xiang <[email protected]> writes:
>
>> On 2023/1/21 06:18, Giuseppe Scrivano wrote:
>>> Hi Amir,
>>> Amir Goldstein <[email protected]> writes:
>>>
>>>> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <[email protected]> wrote:
>>
>> ...
>>
>>>>>
>>>>
>>>> Hi Alexander,
>>>>
>>>> I must say that I am a little bit puzzled by this v3.
>>>> Gao, Christian and myself asked you questions on v2
>>>> that are not mentioned in v3 at all.
>>>>
>>>> To sum it up, please do not propose composefs without explaining
>>>> what are the barriers for achieving the exact same outcome with
>>>> the use of a read-only overlayfs with two lower layer -
>>>> uppermost with erofs containing the metadata files, which include
>>>> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer
>>>> to the lowermost layer containing the content files.
>>> I think Dave explained quite well why using overlay is not
>>> comparable to
>>> what composefs does.
>>> One big difference is that overlay still requires at least a syscall
>>> for
>>> each file in the image, and then we need the equivalent of "rm -rf" to
>>> clean it up. It is somehow acceptable for long-running services, but it
>>> is not for "serverless" containers where images/containers are created
>>> and destroyed frequently. So even in the case we already have all the
>>> image files available locally, we still need to create a checkout with
>>> the final structure we need for the image.
>>> I also don't see how overlay would solve the verified image problem.
>>> We
>>> would have the same problem we have today with fs-verity as it can only
>>> validate a single file but not the entire directory structure. Changes
>>> that affect the layer containing the trusted.overlay.{metacopy,redirect}
>>> xattrs won't be noticed.
>>> There are at the moment two ways to handle container images, both
>>> somehow
>>> guided by the available file systems in the kernel.
>>> - A single image mounted as a block device.
>>> - A list of tarballs (OCI image) that are unpacked and mounted as
>>> overlay layers.
>>> One big advantage of the block devices model is that you can use
>>> dm-verity, this is something we miss today with OCI container images
>>> that use overlay.
>>> What we are proposing with composefs is a way to have "dm-verity"
>>> style
>>> validation based on fs-verity and the possibility to share individual
>>> files instead of layers. These files can also be on different file
>>> systems, which is something not possible with the block device model.
>>
>> That is not a new idea honestly, including chain of trust. Even laterly
>> out-of-tree incremental fs using fs-verity for this as well, except that
>> it's in a real self-contained way.
>>
>>> The composefs manifest blob could be generated remotely and signed.
>>> A
>>> client would need just to validate the signature for the manifest blob
>>> and from there retrieve the files that are not in the local CAS (even
>>> from an insecure source) and mount directly the manifest file.
>>
>>
>> Back to the topic, after thinking something I have to make a
>> compliment for reference.
>>
>> First, EROFS had the same internal dissussion and decision at
>> that time almost _two years ago_ (June 2021), it means:
>>
>> a) Some internal people really suggested EROFS could develop
>> an entire new file-based in-kernel local cache subsystem
>> (as you called local CAS, whatever) with stackable file
>> interface so that the exist Nydus image service [1] (as
>> ostree, and maybe ostree can use it as well) don't need to
>> modify anything to use exist blobs;
>>
>> b) Reuse exist fscache/cachefiles;
>>
>> The reason why we (especially me) finally selected b) because:
>>
>> - see the people discussion of Google's original Incremental
>> FS topic [2] [3] in 2019, as Amir already mentioned. At
>> that time all fs folks really like to reuse exist subsystem
>> for in-kernel caching rather than reinvent another new
>> in-kernel wheel for local cache.
>>
>> [ Reinventing a new wheel is not hard (fs or caching), just
>> makes Linux more fragmented. Especially a new filesystem
>> is just proposed to generate images full of massive massive
>> new magical symlinks with *overriden* uid/gid/permissions
>> to replace regular files. ]
>>
>> - in-kernel cache implementation usually met several common
>> potential security issues; reusing exist subsystem can
>> make all fses addressed them and benefited from it.
>>
>> - Usually an exist widely-used userspace implementation is
>> never an excuse for a new in-kernel feature.
>>
>> Although David Howells is always quite busy these months to
>> develop new netfs interface, otherwise (we think) we should
>> already support failover, multiple daemon/dirs, daemonless and
>> more.
>
> we have not added any new cache system. overlay does "layer
> deduplication" and in similar way composefs does "file deduplication".
> That is not a built-in feature, it is just a side effect of how things
> are packed together.
>
> Using fscache seems like a good idea and it has many advantages but it
> is a centralized cache mechanism and it looks like a potential problem
> when you think about allowing mounts from a user namespace.

I think Christian [1] had the same feeling of my own at that time:

"I'm pretty skeptical of this plan whether we should add more filesystems
that are mountable by unprivileged users. FUSE and Overlayfs are
adventurous enough and they don't have their own on-disk format. The
track record of bugs exploitable due to userns isn't making this
very attractive."

Yes, you could add fs-verity, but EROFS could add fs-verity (or just use
dm-verity) as well, but it doesn't change _anything_ about concerns of
"allowing mounts from a user namespace".

>
> As you know as I've contacted you, I've looked at EROFS in the past
> and tried to get our use cases to work with it before thinking about
> submitting composefs upstream.
>
> From what I could see EROFS and composefs use two different approaches
> to solve a similar problem, but it is not possible to do exactly with
> EROFS what we are trying to do. To oversimplify it: I see EROFS as a
> block device that uses fscache, and composefs as an overlay for files
> instead of directories.

I don't think so honestly. EROFS "Multiple device" feature is
actually "multiple blobs" feature if you really think "device"
is block device.

Primary device -- primary blob -- "composefs manifest blob"
Blob device -- data blobs -- "composefs backing files"

any difference?

>
> Sure composefs is quite simple and you could embed the composefs
> features in EROFS and let EROFS behave as composefs when provided a
> similar manifest file. But how is that any better than having a

EROFS always has such feature since v5.16, we called primary device,
or Nydus concept --- "bootstrap file".

> separate implementation that does just one thing well instead of merging
> different paradigms together?

It's exist fs on-disk compatible (people can deploy the same image
to wider scenarios), or you could modify/enhacnce any in-kernel local
fs to do so like I already suggested, such as enhancing "fs/romfs" and
make it maintained again due to this magic symlink feature

(because composefs don't have other on-disk requirements other than
a symlink path and a SHA256 verity digest from its original
requirement. Any local fs can be enhanced like this.)

>
>> I know that you guys repeatedly say it's a self-contained
>> stackable fs and has few code (the same words as Incfs
>> folks [3] said four years ago already), four reasons make it
>> weak IMHO:
>>
>> - I think core EROFS is about 2~3 kLOC as well if
>> compression, sysfs and fscache are all code-truncated.
>>
>> Also, it's always welcome that all people could submit
>> patches for cleaning up. I always do such cleanups
>> from time to time and makes it better.
>>
>> - "Few code lines" is somewhat weak because people do
>> develop new features, layout after upstream.
>>
>> Such claim is usually _NOT_ true in the future if you
>> guys do more to optimize performance, new layout or even
>> do your own lazy pulling with your local CAS codebase in
>> the future unless
>> you *promise* you once dump the code, and do bugfix
>> only like Christian said [4].
>>
>> From LWN.net comments, I do see the opposite
>> possibility that you'd like to develop new features
>> later.
>>
>> - In the past, all in-tree kernel filesystems were
>> designed and implemented without some user-space
>> specific indication, including Nydus and ostree (I did
>> see a lot of discussion between folks before in ociv2
>> brainstorm [5]).
>
> Since you are mentioning OCI:
>
> Potentially composefs can be the file system that enables something very
> close to "ociv2", but it won't need to be called v2 since it is
> completely compatible with the current OCI image format.
>
> It won't require a different image format, just a seekable tarball that
> is compatible with old "v1" clients and we need to provide the composefs
> manifest file.

May I ask did you really look into what Nydus + EROFS already did (as you
mentioned we discussed before)?

Your "composefs manifest file" is exactly "Nydus bootstrap file", see:
https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-design.md

"Rafs is a filesystem image containing a separated metadata blob and
several data-deduplicated content-addressable data blobs. In a typical
rafs filesystem, the metadata is stored in bootstrap while the data
is stored in blobfile.
...

bootstrap: The metadata is a merkle tree (I think that is typo, should be
filesystem tree) whose nodes represents a regular filesystem's
directory/file a leaf node refers to a file and contains hash value of
its file data.

Root node and internal nodes refer to directories and contain the hash value
of their children nodes."

Nydus is already supported "It won't require a different image format, just
a seekable tarball that is compatible with old "v1" clients and we need to
provide the composefs manifest file." feature in v2.2 and will be released
later.

>
> The seekable tarball allows individual files to be retrieved. OCI
> clients will not need to pull the entire tarball, but only the individual
> files that are not already present in the local CAS. They won't also need
> to create the overlay layout at all, as we do today, since it is already
> described with the composefs manifest file.
>
> The manifest is portable on different machines with different
> configurations, as you can use multiple CAS when mounting composefs.
>
> Some users might have a local CAS, some others could have a secondary
> CAS on a network file system and composefs support all these
> configurations with the same signed manifest file.
>
>> That is why EROFS selected exist in-kernel fscache and
>> made userspace Nydus adapt it:
>>
>> even (here called) manifest on-disk format ---
>> EROFS call primary device ---
>> they call Nydus bootstrap;
>>
>> I'm not sure why it becomes impossible for ... ($$$$).
>
> I am not sure what you mean, care to elaborate?

I just meant these concepts are actually the same concept with
different names and:
Nydus is a 2020 stuff;
EROFS + primary device is a 2021-mid stuff.

>
>> In addition, if fscache is used, it can also use
>> fsverity_get_digest() to enable fsverity for non-on-demand
>> files.
>>
>> But again I think even Google's folks think that is
>> (somewhat) broken so that they added fs-verity to its incFS
>> in a self-contained way in Feb 2021 [6].
>>
>> Finally, again, I do hope a LSF/MM discussion for this new
>> overlay model (full of massive magical symlinks to override
>> permission.)
>
> you keep pointing it out but nobody is overriding any permission. The
> "symlinks" as you call them are just a way to refer to the payload files
> so they can be shared among different mounts. It is the same idea used
> by "overlay metacopy" and nobody is complaining about it being a
> security issue (because it is not).

See overlay documentation clearly wrote such metacopy behavior:
https://docs.kernel.org/filesystems/overlayfs.html

"
Do not use metacopy=on with untrusted upper/lower directories.
Otherwise it is possible that an attacker can create a handcrafted file
with appropriate REDIRECT and METACOPY xattrs, and gain access to file
on lower pointed by REDIRECT. This should not be possible on local
system as setting “trusted.” xattrs will require CAP_SYS_ADMIN. But
it should be possible for untrusted layers like from a pen drive.
"

Do we really need such behavior working on another fs especially with
on-disk format? At least Christian said,
"FUSE and Overlayfs are adventurous enough and they don't have their
own on-disk format."

>
> The files in the CAS are owned by the user that creates the mount, so
> there is no need to circumvent any permission check to access them.
> We use fs-verity for these files to make sure they are not modified by a
> malicious user that could get access to them (e.g. a container breakout).

fs-verity is not always enforcing and it's broken here if fsverity is not
supported in underlay fses, that is another my arguable point.

Thanks,
Gao Xiang

[1] https://lore.kernel.org/linux-fsdevel/[email protected]/

>
> Regards,
> Giuseppe
>
>>

2023-01-21 22:58:58

by Giuseppe Scrivano

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem

Gao Xiang <[email protected]> writes:

> On 2023/1/22 00:19, Giuseppe Scrivano wrote:
>> Gao Xiang <[email protected]> writes:
>>
>>> On 2023/1/21 06:18, Giuseppe Scrivano wrote:
>>>> Hi Amir,
>>>> Amir Goldstein <[email protected]> writes:
>>>>
>>>>> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <[email protected]> wrote:
>>>
>>> ...
>>>
>>>>>>
>>>>>
>>>>> Hi Alexander,
>>>>>
>>>>> I must say that I am a little bit puzzled by this v3.
>>>>> Gao, Christian and myself asked you questions on v2
>>>>> that are not mentioned in v3 at all.
>>>>>
>>>>> To sum it up, please do not propose composefs without explaining
>>>>> what are the barriers for achieving the exact same outcome with
>>>>> the use of a read-only overlayfs with two lower layer -
>>>>> uppermost with erofs containing the metadata files, which include
>>>>> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer
>>>>> to the lowermost layer containing the content files.
>>>> I think Dave explained quite well why using overlay is not
>>>> comparable to
>>>> what composefs does.
>>>> One big difference is that overlay still requires at least a syscall
>>>> for
>>>> each file in the image, and then we need the equivalent of "rm -rf" to
>>>> clean it up. It is somehow acceptable for long-running services, but it
>>>> is not for "serverless" containers where images/containers are created
>>>> and destroyed frequently. So even in the case we already have all the
>>>> image files available locally, we still need to create a checkout with
>>>> the final structure we need for the image.
>>>> I also don't see how overlay would solve the verified image problem.
>>>> We
>>>> would have the same problem we have today with fs-verity as it can only
>>>> validate a single file but not the entire directory structure. Changes
>>>> that affect the layer containing the trusted.overlay.{metacopy,redirect}
>>>> xattrs won't be noticed.
>>>> There are at the moment two ways to handle container images, both
>>>> somehow
>>>> guided by the available file systems in the kernel.
>>>> - A single image mounted as a block device.
>>>> - A list of tarballs (OCI image) that are unpacked and mounted as
>>>> overlay layers.
>>>> One big advantage of the block devices model is that you can use
>>>> dm-verity, this is something we miss today with OCI container images
>>>> that use overlay.
>>>> What we are proposing with composefs is a way to have "dm-verity"
>>>> style
>>>> validation based on fs-verity and the possibility to share individual
>>>> files instead of layers. These files can also be on different file
>>>> systems, which is something not possible with the block device model.
>>>
>>> That is not a new idea honestly, including chain of trust. Even laterly
>>> out-of-tree incremental fs using fs-verity for this as well, except that
>>> it's in a real self-contained way.
>>>
>>>> The composefs manifest blob could be generated remotely and signed.
>>>> A
>>>> client would need just to validate the signature for the manifest blob
>>>> and from there retrieve the files that are not in the local CAS (even
>>>> from an insecure source) and mount directly the manifest file.
>>>
>>>
>>> Back to the topic, after thinking something I have to make a
>>> compliment for reference.
>>>
>>> First, EROFS had the same internal dissussion and decision at
>>> that time almost _two years ago_ (June 2021), it means:
>>>
>>> a) Some internal people really suggested EROFS could develop
>>> an entire new file-based in-kernel local cache subsystem
>>> (as you called local CAS, whatever) with stackable file
>>> interface so that the exist Nydus image service [1] (as
>>> ostree, and maybe ostree can use it as well) don't need to
>>> modify anything to use exist blobs;
>>>
>>> b) Reuse exist fscache/cachefiles;
>>>
>>> The reason why we (especially me) finally selected b) because:
>>>
>>> - see the people discussion of Google's original Incremental
>>> FS topic [2] [3] in 2019, as Amir already mentioned. At
>>> that time all fs folks really like to reuse exist subsystem
>>> for in-kernel caching rather than reinvent another new
>>> in-kernel wheel for local cache.
>>>
>>> [ Reinventing a new wheel is not hard (fs or caching), just
>>> makes Linux more fragmented. Especially a new filesystem
>>> is just proposed to generate images full of massive massive
>>> new magical symlinks with *overriden* uid/gid/permissions
>>> to replace regular files. ]
>>>
>>> - in-kernel cache implementation usually met several common
>>> potential security issues; reusing exist subsystem can
>>> make all fses addressed them and benefited from it.
>>>
>>> - Usually an exist widely-used userspace implementation is
>>> never an excuse for a new in-kernel feature.
>>>
>>> Although David Howells is always quite busy these months to
>>> develop new netfs interface, otherwise (we think) we should
>>> already support failover, multiple daemon/dirs, daemonless and
>>> more.
>> we have not added any new cache system. overlay does "layer
>> deduplication" and in similar way composefs does "file deduplication".
>> That is not a built-in feature, it is just a side effect of how things
>> are packed together.
>> Using fscache seems like a good idea and it has many advantages but
>> it
>> is a centralized cache mechanism and it looks like a potential problem
>> when you think about allowing mounts from a user namespace.
>
> I think Christian [1] had the same feeling of my own at that time:
>
> "I'm pretty skeptical of this plan whether we should add more filesystems
> that are mountable by unprivileged users. FUSE and Overlayfs are
> adventurous enough and they don't have their own on-disk format. The
> track record of bugs exploitable due to userns isn't making this
> very attractive."
>
> Yes, you could add fs-verity, but EROFS could add fs-verity (or just use
> dm-verity) as well, but it doesn't change _anything_ about concerns of
> "allowing mounts from a user namespace".

I've mentioned that as a potential feature we could add in future, given
the simplicity of the format and that it uses a CAS for its data instead
of fscache. Each user can have and use their own store to mount the
images.

At this point it is just a wish from userspace, as it would improve a
few real use cases we have.

Having the possibility to run containers without root privileges is a
big deal for many users, look at Flatpak apps for example, or rootless
Podman. Mounting and validating images would be a a big security
improvement. It is something that is not possible at the moment as
fs-verity doesn't cover the directory structure and dm-verity seems out
of reach from a user namespace.

Composefs delegates the entire logic of dealing with files to the
underlying file system in a similar way to overlay.

Forging the inode metadata from a user namespace mount doesn't look
like an insurmountable problem as well since it is already possible
with a FUSE filesystem.

So the proposal/wish here is to have a very simple format, that at some
point could be considered safe to mount from a user namespace, in
addition to overlay and FUSE.


>> As you know as I've contacted you, I've looked at EROFS in the past
>> and tried to get our use cases to work with it before thinking about
>> submitting composefs upstream.
>> From what I could see EROFS and composefs use two different
>> approaches
>> to solve a similar problem, but it is not possible to do exactly with
>> EROFS what we are trying to do. To oversimplify it: I see EROFS as a
>> block device that uses fscache, and composefs as an overlay for files
>> instead of directories.
>
> I don't think so honestly. EROFS "Multiple device" feature is
> actually "multiple blobs" feature if you really think "device"
> is block device.
>
> Primary device -- primary blob -- "composefs manifest blob"
> Blob device -- data blobs -- "composefs backing files"
>
> any difference?

I wouldn't expect any substancial difference between two RO file
systems.

Please correct me if I am wrong: EROFS uses 16 bits for the blob device
ID, so if we map each file to a single blob device we are kind of
limited on how many files we can have.
Sure this is just an artificial limit and can be bumped in a future
version but the major difference remains: EROFS uses the blob device
through fscache while the composefs files are looked up in the specified
repositories.

>> Sure composefs is quite simple and you could embed the composefs
>> features in EROFS and let EROFS behave as composefs when provided a
>> similar manifest file. But how is that any better than having a
>
> EROFS always has such feature since v5.16, we called primary device,
> or Nydus concept --- "bootstrap file".
>
>> separate implementation that does just one thing well instead of merging
>> different paradigms together?
>
> It's exist fs on-disk compatible (people can deploy the same image
> to wider scenarios), or you could modify/enhacnce any in-kernel local
> fs to do so like I already suggested, such as enhancing "fs/romfs" and
> make it maintained again due to this magic symlink feature
>
> (because composefs don't have other on-disk requirements other than
> a symlink path and a SHA256 verity digest from its original
> requirement. Any local fs can be enhanced like this.)
>
>>
>>> I know that you guys repeatedly say it's a self-contained
>>> stackable fs and has few code (the same words as Incfs
>>> folks [3] said four years ago already), four reasons make it
>>> weak IMHO:
>>>
>>> - I think core EROFS is about 2~3 kLOC as well if
>>> compression, sysfs and fscache are all code-truncated.
>>>
>>> Also, it's always welcome that all people could submit
>>> patches for cleaning up. I always do such cleanups
>>> from time to time and makes it better.
>>>
>>> - "Few code lines" is somewhat weak because people do
>>> develop new features, layout after upstream.
>>>
>>> Such claim is usually _NOT_ true in the future if you
>>> guys do more to optimize performance, new layout or even
>>> do your own lazy pulling with your local CAS codebase in
>>> the future unless
>>> you *promise* you once dump the code, and do bugfix
>>> only like Christian said [4].
>>>
>>> From LWN.net comments, I do see the opposite
>>> possibility that you'd like to develop new features
>>> later.
>>>
>>> - In the past, all in-tree kernel filesystems were
>>> designed and implemented without some user-space
>>> specific indication, including Nydus and ostree (I did
>>> see a lot of discussion between folks before in ociv2
>>> brainstorm [5]).
>> Since you are mentioning OCI:
>> Potentially composefs can be the file system that enables something
>> very
>> close to "ociv2", but it won't need to be called v2 since it is
>> completely compatible with the current OCI image format.
>> It won't require a different image format, just a seekable tarball
>> that
>> is compatible with old "v1" clients and we need to provide the composefs
>> manifest file.
>
> May I ask did you really look into what Nydus + EROFS already did (as you
> mentioned we discussed before)?
>
> Your "composefs manifest file" is exactly "Nydus bootstrap file", see:
> https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-design.md
>
> "Rafs is a filesystem image containing a separated metadata blob and
> several data-deduplicated content-addressable data blobs. In a typical
> rafs filesystem, the metadata is stored in bootstrap while the data
> is stored in blobfile.
> ...
>
> bootstrap: The metadata is a merkle tree (I think that is typo, should be
> filesystem tree) whose nodes represents a regular filesystem's
> directory/file a leaf node refers to a file and contains hash value of
> its file data.
> Root node and internal nodes refer to directories and contain the
> hash value
> of their children nodes."
>
> Nydus is already supported "It won't require a different image format, just
> a seekable tarball that is compatible with old "v1" clients and we need to
> provide the composefs manifest file." feature in v2.2 and will be released
> later.

Nydus is not using a tarball compatible with OCI v1.

It defines a media type "application/vnd.oci.image.layer.nydus.blob.v1", that
means it is not compatible with existing clients that don't know about
it and you need special handling for that.

Anyway, let's not bother LKML folks with these userspace details. It
has no relevance to the kernel and what file systems do.


>> The seekable tarball allows individual files to be retrieved. OCI
>> clients will not need to pull the entire tarball, but only the individual
>> files that are not already present in the local CAS. They won't also need
>> to create the overlay layout at all, as we do today, since it is already
>> described with the composefs manifest file.
>> The manifest is portable on different machines with different
>> configurations, as you can use multiple CAS when mounting composefs.
>> Some users might have a local CAS, some others could have a
>> secondary
>> CAS on a network file system and composefs support all these
>> configurations with the same signed manifest file.
>>
>>> That is why EROFS selected exist in-kernel fscache and
>>> made userspace Nydus adapt it:
>>>
>>> even (here called) manifest on-disk format ---
>>> EROFS call primary device ---
>>> they call Nydus bootstrap;
>>>
>>> I'm not sure why it becomes impossible for ... ($$$$).
>> I am not sure what you mean, care to elaborate?
>
> I just meant these concepts are actually the same concept with
> different names and:
> Nydus is a 2020 stuff;

CRFS[1] is 2019 stuff.

> EROFS + primary device is a 2021-mid stuff.
>
>>> In addition, if fscache is used, it can also use
>>> fsverity_get_digest() to enable fsverity for non-on-demand
>>> files.
>>>
>>> But again I think even Google's folks think that is
>>> (somewhat) broken so that they added fs-verity to its incFS
>>> in a self-contained way in Feb 2021 [6].
>>>
>>> Finally, again, I do hope a LSF/MM discussion for this new
>>> overlay model (full of massive magical symlinks to override
>>> permission.)
>> you keep pointing it out but nobody is overriding any permission.
>> The
>> "symlinks" as you call them are just a way to refer to the payload files
>> so they can be shared among different mounts. It is the same idea used
>> by "overlay metacopy" and nobody is complaining about it being a
>> security issue (because it is not).
>
> See overlay documentation clearly wrote such metacopy behavior:
> https://docs.kernel.org/filesystems/overlayfs.html
>
> "
> Do not use metacopy=on with untrusted upper/lower directories.
> Otherwise it is possible that an attacker can create a handcrafted file
> with appropriate REDIRECT and METACOPY xattrs, and gain access to file
> on lower pointed by REDIRECT. This should not be possible on local
> system as setting “trusted.” xattrs will require CAP_SYS_ADMIN. But
> it should be possible for untrusted layers like from a pen drive.
> "
>
> Do we really need such behavior working on another fs especially with
> on-disk format? At least Christian said,
> "FUSE and Overlayfs are adventurous enough and they don't have their
> own on-disk format."

If users want to do something really weird then they can always find a
way but the composefs lookup is limited under the directories specified
at mount time, so it is not possible to access any file outside the
repository.


>> The files in the CAS are owned by the user that creates the mount,
>> so
>> there is no need to circumvent any permission check to access them.
>> We use fs-verity for these files to make sure they are not modified by a
>> malicious user that could get access to them (e.g. a container breakout).
>
> fs-verity is not always enforcing and it's broken here if fsverity is not
> supported in underlay fses, that is another my arguable point.

It is a trade-off. It is up to the user to pick a configuration that
allows using fs-verity if they care about this feature.

Regards,
Giuseppe

[1] https://github.com/google/crfs


> Thanks,
> Gao Xiang
>
> [1] https://lore.kernel.org/linux-fsdevel/[email protected]/
>
>> Regards,
>> Giuseppe

2023-01-22 01:28:32

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem



On 2023/1/22 06:34, Giuseppe Scrivano wrote:
> Gao Xiang <[email protected]> writes:
>
>> On 2023/1/22 00:19, Giuseppe Scrivano wrote:
>>> Gao Xiang <[email protected]> writes:
>>>
>>>> On 2023/1/21 06:18, Giuseppe Scrivano wrote:
>>>>> Hi Amir,
>>>>> Amir Goldstein <[email protected]> writes:
>>>>>
>>>>>> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <[email protected]> wrote:
>>>>
>>>> ...
>>>>
>>>>>>>
>>>>>>
>>>>>> Hi Alexander,
>>>>>>
>>>>>> I must say that I am a little bit puzzled by this v3.
>>>>>> Gao, Christian and myself asked you questions on v2
>>>>>> that are not mentioned in v3 at all.
>>>>>>
>>>>>> To sum it up, please do not propose composefs without explaining
>>>>>> what are the barriers for achieving the exact same outcome with
>>>>>> the use of a read-only overlayfs with two lower layer -
>>>>>> uppermost with erofs containing the metadata files, which include
>>>>>> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer
>>>>>> to the lowermost layer containing the content files.
>>>>> I think Dave explained quite well why using overlay is not
>>>>> comparable to
>>>>> what composefs does.
>>>>> One big difference is that overlay still requires at least a syscall
>>>>> for
>>>>> each file in the image, and then we need the equivalent of "rm -rf" to
>>>>> clean it up. It is somehow acceptable for long-running services, but it
>>>>> is not for "serverless" containers where images/containers are created
>>>>> and destroyed frequently. So even in the case we already have all the
>>>>> image files available locally, we still need to create a checkout with
>>>>> the final structure we need for the image.
>>>>> I also don't see how overlay would solve the verified image problem.
>>>>> We
>>>>> would have the same problem we have today with fs-verity as it can only
>>>>> validate a single file but not the entire directory structure. Changes
>>>>> that affect the layer containing the trusted.overlay.{metacopy,redirect}
>>>>> xattrs won't be noticed.
>>>>> There are at the moment two ways to handle container images, both
>>>>> somehow
>>>>> guided by the available file systems in the kernel.
>>>>> - A single image mounted as a block device.
>>>>> - A list of tarballs (OCI image) that are unpacked and mounted as
>>>>> overlay layers.
>>>>> One big advantage of the block devices model is that you can use
>>>>> dm-verity, this is something we miss today with OCI container images
>>>>> that use overlay.
>>>>> What we are proposing with composefs is a way to have "dm-verity"
>>>>> style
>>>>> validation based on fs-verity and the possibility to share individual
>>>>> files instead of layers. These files can also be on different file
>>>>> systems, which is something not possible with the block device model.
>>>>
>>>> That is not a new idea honestly, including chain of trust. Even laterly
>>>> out-of-tree incremental fs using fs-verity for this as well, except that
>>>> it's in a real self-contained way.
>>>>
>>>>> The composefs manifest blob could be generated remotely and signed.
>>>>> A
>>>>> client would need just to validate the signature for the manifest blob
>>>>> and from there retrieve the files that are not in the local CAS (even
>>>>> from an insecure source) and mount directly the manifest file.
>>>>
>>>>
>>>> Back to the topic, after thinking something I have to make a
>>>> compliment for reference.
>>>>
>>>> First, EROFS had the same internal dissussion and decision at
>>>> that time almost _two years ago_ (June 2021), it means:
>>>>
>>>> a) Some internal people really suggested EROFS could develop
>>>> an entire new file-based in-kernel local cache subsystem
>>>> (as you called local CAS, whatever) with stackable file
>>>> interface so that the exist Nydus image service [1] (as
>>>> ostree, and maybe ostree can use it as well) don't need to
>>>> modify anything to use exist blobs;
>>>>
>>>> b) Reuse exist fscache/cachefiles;
>>>>
>>>> The reason why we (especially me) finally selected b) because:
>>>>
>>>> - see the people discussion of Google's original Incremental
>>>> FS topic [2] [3] in 2019, as Amir already mentioned. At
>>>> that time all fs folks really like to reuse exist subsystem
>>>> for in-kernel caching rather than reinvent another new
>>>> in-kernel wheel for local cache.
>>>>
>>>> [ Reinventing a new wheel is not hard (fs or caching), just
>>>> makes Linux more fragmented. Especially a new filesystem
>>>> is just proposed to generate images full of massive massive
>>>> new magical symlinks with *overriden* uid/gid/permissions
>>>> to replace regular files. ]
>>>>
>>>> - in-kernel cache implementation usually met several common
>>>> potential security issues; reusing exist subsystem can
>>>> make all fses addressed them and benefited from it.
>>>>
>>>> - Usually an exist widely-used userspace implementation is
>>>> never an excuse for a new in-kernel feature.
>>>>
>>>> Although David Howells is always quite busy these months to
>>>> develop new netfs interface, otherwise (we think) we should
>>>> already support failover, multiple daemon/dirs, daemonless and
>>>> more.
>>> we have not added any new cache system. overlay does "layer
>>> deduplication" and in similar way composefs does "file deduplication".
>>> That is not a built-in feature, it is just a side effect of how things
>>> are packed together.
>>> Using fscache seems like a good idea and it has many advantages but
>>> it
>>> is a centralized cache mechanism and it looks like a potential problem
>>> when you think about allowing mounts from a user namespace.
>>
>> I think Christian [1] had the same feeling of my own at that time:
>>
>> "I'm pretty skeptical of this plan whether we should add more filesystems
>> that are mountable by unprivileged users. FUSE and Overlayfs are
>> adventurous enough and they don't have their own on-disk format. The
>> track record of bugs exploitable due to userns isn't making this
>> very attractive."
>>
>> Yes, you could add fs-verity, but EROFS could add fs-verity (or just use
>> dm-verity) as well, but it doesn't change _anything_ about concerns of
>> "allowing mounts from a user namespace".
>
> I've mentioned that as a potential feature we could add in future, given
> the simplicity of the format and that it uses a CAS for its data instead
> of fscache. Each user can have and use their own store to mount the
> images.
>
> At this point it is just a wish from userspace, as it would improve a
> few real use cases we have.
>
> Having the possibility to run containers without root privileges is a
> big deal for many users, look at Flatpak apps for example, or rootless
> Podman. Mounting and validating images would be a a big security
> improvement. It is something that is not possible at the moment as
> fs-verity doesn't cover the directory structure and dm-verity seems out
> of reach from a user namespace.
>
> Composefs delegates the entire logic of dealing with files to the
> underlying file system in a similar way to overlay.
>
> Forging the inode metadata from a user namespace mount doesn't look
> like an insurmountable problem as well since it is already possible
> with a FUSE filesystem.
>
> So the proposal/wish here is to have a very simple format, that at some
> point could be considered safe to mount from a user namespace, in
> addition to overlay and FUSE.

My response is quite similar to
https://lore.kernel.org/r/[email protected]om/

>
>
>>> As you know as I've contacted you, I've looked at EROFS in the past
>>> and tried to get our use cases to work with it before thinking about
>>> submitting composefs upstream.
>>> From what I could see EROFS and composefs use two different
>>> approaches
>>> to solve a similar problem, but it is not possible to do exactly with
>>> EROFS what we are trying to do. To oversimplify it: I see EROFS as a
>>> block device that uses fscache, and composefs as an overlay for files
>>> instead of directories.
>>
>> I don't think so honestly. EROFS "Multiple device" feature is
>> actually "multiple blobs" feature if you really think "device"
>> is block device.
>>
>> Primary device -- primary blob -- "composefs manifest blob"
>> Blob device -- data blobs -- "composefs backing files"
>>
>> any difference?
>
> I wouldn't expect any substancial difference between two RO file
> systems.
>
> Please correct me if I am wrong: EROFS uses 16 bits for the blob device
> ID, so if we map each file to a single blob device we are kind of
> limited on how many files we can have.

I was here just to represent "composefs manifest file" concept rather than
device ID.

> Sure this is just an artificial limit and can be bumped in a future
> version but the major difference remains: EROFS uses the blob device
> through fscache while the composefs files are looked up in the specified
> repositories.

No, fscache can also open any cookie when opening file. Again, even with
fscache, EROFS doesn't need to modify _any_ on-disk format to:

- record a "cookie id" for such special "magical symlink" with a similar
symlink on-disk format (or whatever on-disk format with data, just with
a new on-disk flag);

- open such "cookie id" on demand when opening such EROFS file just as
any other network fses. I don't think blob device is limited here.

some difference now?

>
>>> Sure composefs is quite simple and you could embed the composefs
>>> features in EROFS and let EROFS behave as composefs when provided a
>>> similar manifest file. But how is that any better than having a
>>
>> EROFS always has such feature since v5.16, we called primary device,
>> or Nydus concept --- "bootstrap file".
>>
>>> separate implementation that does just one thing well instead of merging
>>> different paradigms together?
>>
>> It's exist fs on-disk compatible (people can deploy the same image
>> to wider scenarios), or you could modify/enhacnce any in-kernel local
>> fs to do so like I already suggested, such as enhancing "fs/romfs" and
>> make it maintained again due to this magic symlink feature
>>
>> (because composefs don't have other on-disk requirements other than
>> a symlink path and a SHA256 verity digest from its original
>> requirement. Any local fs can be enhanced like this.)
>>
>>>
>>>> I know that you guys repeatedly say it's a self-contained
>>>> stackable fs and has few code (the same words as Incfs
>>>> folks [3] said four years ago already), four reasons make it
>>>> weak IMHO:
>>>>
>>>> - I think core EROFS is about 2~3 kLOC as well if
>>>> compression, sysfs and fscache are all code-truncated.
>>>>
>>>> Also, it's always welcome that all people could submit
>>>> patches for cleaning up. I always do such cleanups
>>>> from time to time and makes it better.
>>>>
>>>> - "Few code lines" is somewhat weak because people do
>>>> develop new features, layout after upstream.
>>>>
>>>> Such claim is usually _NOT_ true in the future if you
>>>> guys do more to optimize performance, new layout or even
>>>> do your own lazy pulling with your local CAS codebase in
>>>> the future unless
>>>> you *promise* you once dump the code, and do bugfix
>>>> only like Christian said [4].
>>>>
>>>> From LWN.net comments, I do see the opposite
>>>> possibility that you'd like to develop new features
>>>> later.
>>>>
>>>> - In the past, all in-tree kernel filesystems were
>>>> designed and implemented without some user-space
>>>> specific indication, including Nydus and ostree (I did
>>>> see a lot of discussion between folks before in ociv2
>>>> brainstorm [5]).
>>> Since you are mentioning OCI:
>>> Potentially composefs can be the file system that enables something
>>> very
>>> close to "ociv2", but it won't need to be called v2 since it is
>>> completely compatible with the current OCI image format.
>>> It won't require a different image format, just a seekable tarball
>>> that
>>> is compatible with old "v1" clients and we need to provide the composefs
>>> manifest file.
>>
>> May I ask did you really look into what Nydus + EROFS already did (as you
>> mentioned we discussed before)?
>>
>> Your "composefs manifest file" is exactly "Nydus bootstrap file", see:
>> https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-design.md
>>
>> "Rafs is a filesystem image containing a separated metadata blob and
>> several data-deduplicated content-addressable data blobs. In a typical
>> rafs filesystem, the metadata is stored in bootstrap while the data
>> is stored in blobfile.
>> ...
>>
>> bootstrap: The metadata is a merkle tree (I think that is typo, should be
>> filesystem tree) whose nodes represents a regular filesystem's
>> directory/file a leaf node refers to a file and contains hash value of
>> its file data.
>> Root node and internal nodes refer to directories and contain the
>> hash value
>> of their children nodes."
>>
>> Nydus is already supported "It won't require a different image format, just
>> a seekable tarball that is compatible with old "v1" clients and we need to
>> provide the composefs manifest file." feature in v2.2 and will be released
>> later.
>
> Nydus is not using a tarball compatible with OCI v1.
>
> It defines a media type "application/vnd.oci.image.layer.nydus.blob.v1", that
> means it is not compatible with existing clients that don't know about
> it and you need special handling for that.

I am not sure what you're saying: "media type" is quite out of topic here.

If you said "mkcomposefs" is done in the server side, what is the media
type of such manifest files?

And why not Nydus cannot do in the same way?
https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-zran.md

>
> Anyway, let's not bother LKML folks with these userspace details. It
> has no relevance to the kernel and what file systems do.

I'd like to avoid, I did't say anything about userspace details, I just would
like to say
"merged filesystem tree is also _not_ a new idea of composefs"
not "media type", etc.

>
>
>>> The seekable tarball allows individual files to be retrieved. OCI
>>> clients will not need to pull the entire tarball, but only the individual
>>> files that are not already present in the local CAS. They won't also need
>>> to create the overlay layout at all, as we do today, since it is already
>>> described with the composefs manifest file.
>>> The manifest is portable on different machines with different
>>> configurations, as you can use multiple CAS when mounting composefs.
>>> Some users might have a local CAS, some others could have a
>>> secondary
>>> CAS on a network file system and composefs support all these
>>> configurations with the same signed manifest file.
>>>
>>>> That is why EROFS selected exist in-kernel fscache and
>>>> made userspace Nydus adapt it:
>>>>
>>>> even (here called) manifest on-disk format ---
>>>> EROFS call primary device ---
>>>> they call Nydus bootstrap;
>>>>
>>>> I'm not sure why it becomes impossible for ... ($$$$).
>>> I am not sure what you mean, care to elaborate?
>>
>> I just meant these concepts are actually the same concept with
>> different names and:
>> Nydus is a 2020 stuff;
>
> CRFS[1] is 2019 stuff.

Does CRFS have anything similiar to a merged filesystem tree?

Here we talked about local CAS:
I have no idea CRFS has anything similar to it.

>
>> EROFS + primary device is a 2021-mid stuff.
>>
>>>> In addition, if fscache is used, it can also use
>>>> fsverity_get_digest() to enable fsverity for non-on-demand
>>>> files.
>>>>
>>>> But again I think even Google's folks think that is
>>>> (somewhat) broken so that they added fs-verity to its incFS
>>>> in a self-contained way in Feb 2021 [6].
>>>>
>>>> Finally, again, I do hope a LSF/MM discussion for this new
>>>> overlay model (full of massive magical symlinks to override
>>>> permission.)
>>> you keep pointing it out but nobody is overriding any permission.
>>> The
>>> "symlinks" as you call them are just a way to refer to the payload files
>>> so they can be shared among different mounts. It is the same idea used
>>> by "overlay metacopy" and nobody is complaining about it being a
>>> security issue (because it is not).
>>
>> See overlay documentation clearly wrote such metacopy behavior:
>> https://docs.kernel.org/filesystems/overlayfs.html
>>
>> "
>> Do not use metacopy=on with untrusted upper/lower directories.
>> Otherwise it is possible that an attacker can create a handcrafted file
>> with appropriate REDIRECT and METACOPY xattrs, and gain access to file
>> on lower pointed by REDIRECT. This should not be possible on local
>> system as setting “trusted.” xattrs will require CAP_SYS_ADMIN. But
>> it should be possible for untrusted layers like from a pen drive.
>> "
>>
>> Do we really need such behavior working on another fs especially with
>> on-disk format? At least Christian said,
>> "FUSE and Overlayfs are adventurous enough and they don't have their
>> own on-disk format."
>
> If users want to do something really weird then they can always find a
> way but the composefs lookup is limited under the directories specified
> at mount time, so it is not possible to access any file outside the
> repository.
>
>
>>> The files in the CAS are owned by the user that creates the mount,
>>> so
>>> there is no need to circumvent any permission check to access them.
>>> We use fs-verity for these files to make sure they are not modified by a
>>> malicious user that could get access to them (e.g. a container breakout).
>>
>> fs-verity is not always enforcing and it's broken here if fsverity is not
>> supported in underlay fses, that is another my arguable point.
>
> It is a trade-off. It is up to the user to pick a configuration that
> allows using fs-verity if they care about this feature.

I don't think fsverity is optional with your plan.

I wrote this all because it seems I didn't mention the original motivation
to use fscache in v2: kernel already has such in-kernel local cache, and
people liked to use it in 2019 rather than another stackable way (as
mentioned in incremental fs thread.)

Thanks,
Gao Xiang

>
> Regards,
> Giuseppe
>
> [1] https://github.com/google/crfs

2023-01-22 09:16:34

by Giuseppe Scrivano

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem

Gao Xiang <[email protected]> writes:

> On 2023/1/22 06:34, Giuseppe Scrivano wrote:
>> Gao Xiang <[email protected]> writes:
>>
>>> On 2023/1/22 00:19, Giuseppe Scrivano wrote:
>>>> Gao Xiang <[email protected]> writes:
>>>>
>>>>> On 2023/1/21 06:18, Giuseppe Scrivano wrote:
>>>>>> Hi Amir,
>>>>>> Amir Goldstein <[email protected]> writes:
>>>>>>
>>>>>>> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <[email protected]> wrote:
>>>>>
>>>>> ...
>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Hi Alexander,
>>>>>>>
>>>>>>> I must say that I am a little bit puzzled by this v3.
>>>>>>> Gao, Christian and myself asked you questions on v2
>>>>>>> that are not mentioned in v3 at all.
>>>>>>>
>>>>>>> To sum it up, please do not propose composefs without explaining
>>>>>>> what are the barriers for achieving the exact same outcome with
>>>>>>> the use of a read-only overlayfs with two lower layer -
>>>>>>> uppermost with erofs containing the metadata files, which include
>>>>>>> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer
>>>>>>> to the lowermost layer containing the content files.
>>>>>> I think Dave explained quite well why using overlay is not
>>>>>> comparable to
>>>>>> what composefs does.
>>>>>> One big difference is that overlay still requires at least a syscall
>>>>>> for
>>>>>> each file in the image, and then we need the equivalent of "rm -rf" to
>>>>>> clean it up. It is somehow acceptable for long-running services, but it
>>>>>> is not for "serverless" containers where images/containers are created
>>>>>> and destroyed frequently. So even in the case we already have all the
>>>>>> image files available locally, we still need to create a checkout with
>>>>>> the final structure we need for the image.
>>>>>> I also don't see how overlay would solve the verified image problem.
>>>>>> We
>>>>>> would have the same problem we have today with fs-verity as it can only
>>>>>> validate a single file but not the entire directory structure. Changes
>>>>>> that affect the layer containing the trusted.overlay.{metacopy,redirect}
>>>>>> xattrs won't be noticed.
>>>>>> There are at the moment two ways to handle container images, both
>>>>>> somehow
>>>>>> guided by the available file systems in the kernel.
>>>>>> - A single image mounted as a block device.
>>>>>> - A list of tarballs (OCI image) that are unpacked and mounted as
>>>>>> overlay layers.
>>>>>> One big advantage of the block devices model is that you can use
>>>>>> dm-verity, this is something we miss today with OCI container images
>>>>>> that use overlay.
>>>>>> What we are proposing with composefs is a way to have "dm-verity"
>>>>>> style
>>>>>> validation based on fs-verity and the possibility to share individual
>>>>>> files instead of layers. These files can also be on different file
>>>>>> systems, which is something not possible with the block device model.
>>>>>
>>>>> That is not a new idea honestly, including chain of trust. Even laterly
>>>>> out-of-tree incremental fs using fs-verity for this as well, except that
>>>>> it's in a real self-contained way.
>>>>>
>>>>>> The composefs manifest blob could be generated remotely and signed.
>>>>>> A
>>>>>> client would need just to validate the signature for the manifest blob
>>>>>> and from there retrieve the files that are not in the local CAS (even
>>>>>> from an insecure source) and mount directly the manifest file.
>>>>>
>>>>>
>>>>> Back to the topic, after thinking something I have to make a
>>>>> compliment for reference.
>>>>>
>>>>> First, EROFS had the same internal dissussion and decision at
>>>>> that time almost _two years ago_ (June 2021), it means:
>>>>>
>>>>> a) Some internal people really suggested EROFS could develop
>>>>> an entire new file-based in-kernel local cache subsystem
>>>>> (as you called local CAS, whatever) with stackable file
>>>>> interface so that the exist Nydus image service [1] (as
>>>>> ostree, and maybe ostree can use it as well) don't need to
>>>>> modify anything to use exist blobs;
>>>>>
>>>>> b) Reuse exist fscache/cachefiles;
>>>>>
>>>>> The reason why we (especially me) finally selected b) because:
>>>>>
>>>>> - see the people discussion of Google's original Incremental
>>>>> FS topic [2] [3] in 2019, as Amir already mentioned. At
>>>>> that time all fs folks really like to reuse exist subsystem
>>>>> for in-kernel caching rather than reinvent another new
>>>>> in-kernel wheel for local cache.
>>>>>
>>>>> [ Reinventing a new wheel is not hard (fs or caching), just
>>>>> makes Linux more fragmented. Especially a new filesystem
>>>>> is just proposed to generate images full of massive massive
>>>>> new magical symlinks with *overriden* uid/gid/permissions
>>>>> to replace regular files. ]
>>>>>
>>>>> - in-kernel cache implementation usually met several common
>>>>> potential security issues; reusing exist subsystem can
>>>>> make all fses addressed them and benefited from it.
>>>>>
>>>>> - Usually an exist widely-used userspace implementation is
>>>>> never an excuse for a new in-kernel feature.
>>>>>
>>>>> Although David Howells is always quite busy these months to
>>>>> develop new netfs interface, otherwise (we think) we should
>>>>> already support failover, multiple daemon/dirs, daemonless and
>>>>> more.
>>>> we have not added any new cache system. overlay does "layer
>>>> deduplication" and in similar way composefs does "file deduplication".
>>>> That is not a built-in feature, it is just a side effect of how things
>>>> are packed together.
>>>> Using fscache seems like a good idea and it has many advantages but
>>>> it
>>>> is a centralized cache mechanism and it looks like a potential problem
>>>> when you think about allowing mounts from a user namespace.
>>>
>>> I think Christian [1] had the same feeling of my own at that time:
>>>
>>> "I'm pretty skeptical of this plan whether we should add more filesystems
>>> that are mountable by unprivileged users. FUSE and Overlayfs are
>>> adventurous enough and they don't have their own on-disk format. The
>>> track record of bugs exploitable due to userns isn't making this
>>> very attractive."
>>>
>>> Yes, you could add fs-verity, but EROFS could add fs-verity (or just use
>>> dm-verity) as well, but it doesn't change _anything_ about concerns of
>>> "allowing mounts from a user namespace".
>> I've mentioned that as a potential feature we could add in future,
>> given
>> the simplicity of the format and that it uses a CAS for its data instead
>> of fscache. Each user can have and use their own store to mount the
>> images.
>> At this point it is just a wish from userspace, as it would improve
>> a
>> few real use cases we have.
>> Having the possibility to run containers without root privileges is
>> a
>> big deal for many users, look at Flatpak apps for example, or rootless
>> Podman. Mounting and validating images would be a a big security
>> improvement. It is something that is not possible at the moment as
>> fs-verity doesn't cover the directory structure and dm-verity seems out
>> of reach from a user namespace.
>> Composefs delegates the entire logic of dealing with files to the
>> underlying file system in a similar way to overlay.
>> Forging the inode metadata from a user namespace mount doesn't look
>> like an insurmountable problem as well since it is already possible
>> with a FUSE filesystem.
>> So the proposal/wish here is to have a very simple format, that at
>> some
>> point could be considered safe to mount from a user namespace, in
>> addition to overlay and FUSE.
>
> My response is quite similar to
> https://lore.kernel.org/r/[email protected]om/

I don't see how that applies to what I said about unprivileged mounts,
except the part about lazy download where I agree with Miklos that
should be handled through FUSE and that is something possible with
composefs:

mount -t composefs composefs -obasedir=/path/to/store:/mnt/fuse /mnt/cfs

where /mnt/fuse is handled by a FUSE file system that takes care of
loading the files from the remote server, and possibly write them to
/path/to/store once they are completed.

So each user could have their "lazy download" without interfering with
other users or the centralized cache.

>>
>>>> As you know as I've contacted you, I've looked at EROFS in the past
>>>> and tried to get our use cases to work with it before thinking about
>>>> submitting composefs upstream.
>>>> From what I could see EROFS and composefs use two different
>>>> approaches
>>>> to solve a similar problem, but it is not possible to do exactly with
>>>> EROFS what we are trying to do. To oversimplify it: I see EROFS as a
>>>> block device that uses fscache, and composefs as an overlay for files
>>>> instead of directories.
>>>
>>> I don't think so honestly. EROFS "Multiple device" feature is
>>> actually "multiple blobs" feature if you really think "device"
>>> is block device.
>>>
>>> Primary device -- primary blob -- "composefs manifest blob"
>>> Blob device -- data blobs -- "composefs backing files"
>>>
>>> any difference?
>> I wouldn't expect any substancial difference between two RO file
>> systems.
>> Please correct me if I am wrong: EROFS uses 16 bits for the blob
>> device
>> ID, so if we map each file to a single blob device we are kind of
>> limited on how many files we can have.
>
> I was here just to represent "composefs manifest file" concept rather than
> device ID.
>
>> Sure this is just an artificial limit and can be bumped in a future
>> version but the major difference remains: EROFS uses the blob device
>> through fscache while the composefs files are looked up in the specified
>> repositories.
>
> No, fscache can also open any cookie when opening file. Again, even with
> fscache, EROFS doesn't need to modify _any_ on-disk format to:
>
> - record a "cookie id" for such special "magical symlink" with a similar
> symlink on-disk format (or whatever on-disk format with data, just with
> a new on-disk flag);
>
> - open such "cookie id" on demand when opening such EROFS file just as
> any other network fses. I don't think blob device is limited here.
>
> some difference now?

recording the "cookie id" is done by a singleton userspace daemon that
controls the cachefiles device and requires one operation for each file
before the image can be mounted.

Is that the case or I misunderstood something?

>>
>>>> Sure composefs is quite simple and you could embed the composefs
>>>> features in EROFS and let EROFS behave as composefs when provided a
>>>> similar manifest file. But how is that any better than having a
>>>
>>> EROFS always has such feature since v5.16, we called primary device,
>>> or Nydus concept --- "bootstrap file".
>>>
>>>> separate implementation that does just one thing well instead of merging
>>>> different paradigms together?
>>>
>>> It's exist fs on-disk compatible (people can deploy the same image
>>> to wider scenarios), or you could modify/enhacnce any in-kernel local
>>> fs to do so like I already suggested, such as enhancing "fs/romfs" and
>>> make it maintained again due to this magic symlink feature
>>>
>>> (because composefs don't have other on-disk requirements other than
>>> a symlink path and a SHA256 verity digest from its original
>>> requirement. Any local fs can be enhanced like this.)
>>>
>>>>
>>>>> I know that you guys repeatedly say it's a self-contained
>>>>> stackable fs and has few code (the same words as Incfs
>>>>> folks [3] said four years ago already), four reasons make it
>>>>> weak IMHO:
>>>>>
>>>>> - I think core EROFS is about 2~3 kLOC as well if
>>>>> compression, sysfs and fscache are all code-truncated.
>>>>>
>>>>> Also, it's always welcome that all people could submit
>>>>> patches for cleaning up. I always do such cleanups
>>>>> from time to time and makes it better.
>>>>>
>>>>> - "Few code lines" is somewhat weak because people do
>>>>> develop new features, layout after upstream.
>>>>>
>>>>> Such claim is usually _NOT_ true in the future if you
>>>>> guys do more to optimize performance, new layout or even
>>>>> do your own lazy pulling with your local CAS codebase in
>>>>> the future unless
>>>>> you *promise* you once dump the code, and do bugfix
>>>>> only like Christian said [4].
>>>>>
>>>>> From LWN.net comments, I do see the opposite
>>>>> possibility that you'd like to develop new features
>>>>> later.
>>>>>
>>>>> - In the past, all in-tree kernel filesystems were
>>>>> designed and implemented without some user-space
>>>>> specific indication, including Nydus and ostree (I did
>>>>> see a lot of discussion between folks before in ociv2
>>>>> brainstorm [5]).
>>>> Since you are mentioning OCI:
>>>> Potentially composefs can be the file system that enables something
>>>> very
>>>> close to "ociv2", but it won't need to be called v2 since it is
>>>> completely compatible with the current OCI image format.
>>>> It won't require a different image format, just a seekable tarball
>>>> that
>>>> is compatible with old "v1" clients and we need to provide the composefs
>>>> manifest file.
>>>
>>> May I ask did you really look into what Nydus + EROFS already did (as you
>>> mentioned we discussed before)?
>>>
>>> Your "composefs manifest file" is exactly "Nydus bootstrap file", see:
>>> https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-design.md
>>>
>>> "Rafs is a filesystem image containing a separated metadata blob and
>>> several data-deduplicated content-addressable data blobs. In a typical
>>> rafs filesystem, the metadata is stored in bootstrap while the data
>>> is stored in blobfile.
>>> ...
>>>
>>> bootstrap: The metadata is a merkle tree (I think that is typo, should be
>>> filesystem tree) whose nodes represents a regular filesystem's
>>> directory/file a leaf node refers to a file and contains hash value of
>>> its file data.
>>> Root node and internal nodes refer to directories and contain the
>>> hash value
>>> of their children nodes."
>>>
>>> Nydus is already supported "It won't require a different image format, just
>>> a seekable tarball that is compatible with old "v1" clients and we need to
>>> provide the composefs manifest file." feature in v2.2 and will be released
>>> later.
>> Nydus is not using a tarball compatible with OCI v1.
>> It defines a media type
>> "application/vnd.oci.image.layer.nydus.blob.v1", that
>> means it is not compatible with existing clients that don't know about
>> it and you need special handling for that.
>
> I am not sure what you're saying: "media type" is quite out of topic here.
>
> If you said "mkcomposefs" is done in the server side, what is the media
> type of such manifest files?
>
> And why not Nydus cannot do in the same way?
> https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-zran.md
>

I am not talking about the manifest or the bootstrap file, I am talking
about the data blobs.

>> Anyway, let's not bother LKML folks with these userspace details.
>> It
>> has no relevance to the kernel and what file systems do.
>
> I'd like to avoid, I did't say anything about userspace details, I just would
> like to say
> "merged filesystem tree is also _not_ a new idea of composefs"
> not "media type", etc.
>
>>
>>>> The seekable tarball allows individual files to be retrieved. OCI
>>>> clients will not need to pull the entire tarball, but only the individual
>>>> files that are not already present in the local CAS. They won't also need
>>>> to create the overlay layout at all, as we do today, since it is already
>>>> described with the composefs manifest file.
>>>> The manifest is portable on different machines with different
>>>> configurations, as you can use multiple CAS when mounting composefs.
>>>> Some users might have a local CAS, some others could have a
>>>> secondary
>>>> CAS on a network file system and composefs support all these
>>>> configurations with the same signed manifest file.
>>>>
>>>>> That is why EROFS selected exist in-kernel fscache and
>>>>> made userspace Nydus adapt it:
>>>>>
>>>>> even (here called) manifest on-disk format ---
>>>>> EROFS call primary device ---
>>>>> they call Nydus bootstrap;
>>>>>
>>>>> I'm not sure why it becomes impossible for ... ($$$$).
>>>> I am not sure what you mean, care to elaborate?
>>>
>>> I just meant these concepts are actually the same concept with
>>> different names and:
>>> Nydus is a 2020 stuff;
>> CRFS[1] is 2019 stuff.
>
> Does CRFS have anything similiar to a merged filesystem tree?
>
> Here we talked about local CAS:
> I have no idea CRFS has anything similar to it.

yes it does and it uses it with a FUSE file system. So neither
composefs nor EROFS have invented anything here.

Anyway, does it really matter who made what first? I don't see how it
helps to understand if there are relevant differences in composefs to
justify its presence in the kernel.

>>
>>> EROFS + primary device is a 2021-mid stuff.
>>>
>>>>> In addition, if fscache is used, it can also use
>>>>> fsverity_get_digest() to enable fsverity for non-on-demand
>>>>> files.
>>>>>
>>>>> But again I think even Google's folks think that is
>>>>> (somewhat) broken so that they added fs-verity to its incFS
>>>>> in a self-contained way in Feb 2021 [6].
>>>>>
>>>>> Finally, again, I do hope a LSF/MM discussion for this new
>>>>> overlay model (full of massive magical symlinks to override
>>>>> permission.)
>>>> you keep pointing it out but nobody is overriding any permission.
>>>> The
>>>> "symlinks" as you call them are just a way to refer to the payload files
>>>> so they can be shared among different mounts. It is the same idea used
>>>> by "overlay metacopy" and nobody is complaining about it being a
>>>> security issue (because it is not).
>>>
>>> See overlay documentation clearly wrote such metacopy behavior:
>>> https://docs.kernel.org/filesystems/overlayfs.html
>>>
>>> "
>>> Do not use metacopy=on with untrusted upper/lower directories.
>>> Otherwise it is possible that an attacker can create a handcrafted file
>>> with appropriate REDIRECT and METACOPY xattrs, and gain access to file
>>> on lower pointed by REDIRECT. This should not be possible on local
>>> system as setting “trusted.” xattrs will require CAP_SYS_ADMIN. But
>>> it should be possible for untrusted layers like from a pen drive.
>>> "
>>>
>>> Do we really need such behavior working on another fs especially with
>>> on-disk format? At least Christian said,
>>> "FUSE and Overlayfs are adventurous enough and they don't have their
>>> own on-disk format."
>> If users want to do something really weird then they can always find
>> a
>> way but the composefs lookup is limited under the directories specified
>> at mount time, so it is not possible to access any file outside the
>> repository.
>>
>>>> The files in the CAS are owned by the user that creates the mount,
>>>> so
>>>> there is no need to circumvent any permission check to access them.
>>>> We use fs-verity for these files to make sure they are not modified by a
>>>> malicious user that could get access to them (e.g. a container breakout).
>>>
>>> fs-verity is not always enforcing and it's broken here if fsverity is not
>>> supported in underlay fses, that is another my arguable point.
>> It is a trade-off. It is up to the user to pick a configuration
>> that
>> allows using fs-verity if they care about this feature.
>
> I don't think fsverity is optional with your plan.

yes it is optional. without fs-verity it would behave the same as today
with overlay mounts without any fs-verity.

How does validation work in EROFS for files served from fscache and that
are on a remote file system?

> I wrote this all because it seems I didn't mention the original motivation
> to use fscache in v2: kernel already has such in-kernel local cache, and
> people liked to use it in 2019 rather than another stackable way (as
> mentioned in incremental fs thread.)

still for us the stackable way works better.

> Thanks,
> Gao Xiang
>
>> Regards,
>> Giuseppe
>> [1] https://github.com/google/crfs

2023-01-22 10:12:58

by Giuseppe Scrivano

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem

Giuseppe Scrivano <[email protected]> writes:

> Gao Xiang <[email protected]> writes:
>
>> On 2023/1/22 06:34, Giuseppe Scrivano wrote:
>>> Gao Xiang <[email protected]> writes:
>>>
>>>> On 2023/1/22 00:19, Giuseppe Scrivano wrote:
>>>>> Gao Xiang <[email protected]> writes:
>>>>>
>>>>>> On 2023/1/21 06:18, Giuseppe Scrivano wrote:
>>>>>>> Hi Amir,
>>>>>>> Amir Goldstein <[email protected]> writes:
>>>>>>>
>>>>>>>> On Fri, Jan 20, 2023 at 5:30 PM Alexander Larsson <[email protected]at.com> wrote:
>>>>>>
>>>>>> ...
>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Alexander,
>>>>>>>>
>>>>>>>> I must say that I am a little bit puzzled by this v3.
>>>>>>>> Gao, Christian and myself asked you questions on v2
>>>>>>>> that are not mentioned in v3 at all.
>>>>>>>>
>>>>>>>> To sum it up, please do not propose composefs without explaining
>>>>>>>> what are the barriers for achieving the exact same outcome with
>>>>>>>> the use of a read-only overlayfs with two lower layer -
>>>>>>>> uppermost with erofs containing the metadata files, which include
>>>>>>>> trusted.overlay.metacopy and trusted.overlay.redirect xattrs that refer
>>>>>>>> to the lowermost layer containing the content files.
>>>>>>> I think Dave explained quite well why using overlay is not
>>>>>>> comparable to
>>>>>>> what composefs does.
>>>>>>> One big difference is that overlay still requires at least a syscall
>>>>>>> for
>>>>>>> each file in the image, and then we need the equivalent of "rm -rf" to
>>>>>>> clean it up. It is somehow acceptable for long-running services, but it
>>>>>>> is not for "serverless" containers where images/containers are created
>>>>>>> and destroyed frequently. So even in the case we already have all the
>>>>>>> image files available locally, we still need to create a checkout with
>>>>>>> the final structure we need for the image.
>>>>>>> I also don't see how overlay would solve the verified image problem.
>>>>>>> We
>>>>>>> would have the same problem we have today with fs-verity as it can only
>>>>>>> validate a single file but not the entire directory structure. Changes
>>>>>>> that affect the layer containing the trusted.overlay.{metacopy,redirect}
>>>>>>> xattrs won't be noticed.
>>>>>>> There are at the moment two ways to handle container images, both
>>>>>>> somehow
>>>>>>> guided by the available file systems in the kernel.
>>>>>>> - A single image mounted as a block device.
>>>>>>> - A list of tarballs (OCI image) that are unpacked and mounted as
>>>>>>> overlay layers.
>>>>>>> One big advantage of the block devices model is that you can use
>>>>>>> dm-verity, this is something we miss today with OCI container images
>>>>>>> that use overlay.
>>>>>>> What we are proposing with composefs is a way to have "dm-verity"
>>>>>>> style
>>>>>>> validation based on fs-verity and the possibility to share individual
>>>>>>> files instead of layers. These files can also be on different file
>>>>>>> systems, which is something not possible with the block device model.
>>>>>>
>>>>>> That is not a new idea honestly, including chain of trust. Even laterly
>>>>>> out-of-tree incremental fs using fs-verity for this as well, except that
>>>>>> it's in a real self-contained way.
>>>>>>
>>>>>>> The composefs manifest blob could be generated remotely and signed.
>>>>>>> A
>>>>>>> client would need just to validate the signature for the manifest blob
>>>>>>> and from there retrieve the files that are not in the local CAS (even
>>>>>>> from an insecure source) and mount directly the manifest file.
>>>>>>
>>>>>>
>>>>>> Back to the topic, after thinking something I have to make a
>>>>>> compliment for reference.
>>>>>>
>>>>>> First, EROFS had the same internal dissussion and decision at
>>>>>> that time almost _two years ago_ (June 2021), it means:
>>>>>>
>>>>>> a) Some internal people really suggested EROFS could develop
>>>>>> an entire new file-based in-kernel local cache subsystem
>>>>>> (as you called local CAS, whatever) with stackable file
>>>>>> interface so that the exist Nydus image service [1] (as
>>>>>> ostree, and maybe ostree can use it as well) don't need to
>>>>>> modify anything to use exist blobs;
>>>>>>
>>>>>> b) Reuse exist fscache/cachefiles;
>>>>>>
>>>>>> The reason why we (especially me) finally selected b) because:
>>>>>>
>>>>>> - see the people discussion of Google's original Incremental
>>>>>> FS topic [2] [3] in 2019, as Amir already mentioned. At
>>>>>> that time all fs folks really like to reuse exist subsystem
>>>>>> for in-kernel caching rather than reinvent another new
>>>>>> in-kernel wheel for local cache.
>>>>>>
>>>>>> [ Reinventing a new wheel is not hard (fs or caching), just
>>>>>> makes Linux more fragmented. Especially a new filesystem
>>>>>> is just proposed to generate images full of massive massive
>>>>>> new magical symlinks with *overriden* uid/gid/permissions
>>>>>> to replace regular files. ]
>>>>>>
>>>>>> - in-kernel cache implementation usually met several common
>>>>>> potential security issues; reusing exist subsystem can
>>>>>> make all fses addressed them and benefited from it.
>>>>>>
>>>>>> - Usually an exist widely-used userspace implementation is
>>>>>> never an excuse for a new in-kernel feature.
>>>>>>
>>>>>> Although David Howells is always quite busy these months to
>>>>>> develop new netfs interface, otherwise (we think) we should
>>>>>> already support failover, multiple daemon/dirs, daemonless and
>>>>>> more.
>>>>> we have not added any new cache system. overlay does "layer
>>>>> deduplication" and in similar way composefs does "file deduplication".
>>>>> That is not a built-in feature, it is just a side effect of how things
>>>>> are packed together.
>>>>> Using fscache seems like a good idea and it has many advantages but
>>>>> it
>>>>> is a centralized cache mechanism and it looks like a potential problem
>>>>> when you think about allowing mounts from a user namespace.
>>>>
>>>> I think Christian [1] had the same feeling of my own at that time:
>>>>
>>>> "I'm pretty skeptical of this plan whether we should add more filesystems
>>>> that are mountable by unprivileged users. FUSE and Overlayfs are
>>>> adventurous enough and they don't have their own on-disk format. The
>>>> track record of bugs exploitable due to userns isn't making this
>>>> very attractive."
>>>>
>>>> Yes, you could add fs-verity, but EROFS could add fs-verity (or just use
>>>> dm-verity) as well, but it doesn't change _anything_ about concerns of
>>>> "allowing mounts from a user namespace".
>>> I've mentioned that as a potential feature we could add in future,
>>> given
>>> the simplicity of the format and that it uses a CAS for its data instead
>>> of fscache. Each user can have and use their own store to mount the
>>> images.
>>> At this point it is just a wish from userspace, as it would improve
>>> a
>>> few real use cases we have.
>>> Having the possibility to run containers without root privileges is
>>> a
>>> big deal for many users, look at Flatpak apps for example, or rootless
>>> Podman. Mounting and validating images would be a a big security
>>> improvement. It is something that is not possible at the moment as
>>> fs-verity doesn't cover the directory structure and dm-verity seems out
>>> of reach from a user namespace.
>>> Composefs delegates the entire logic of dealing with files to the
>>> underlying file system in a similar way to overlay.
>>> Forging the inode metadata from a user namespace mount doesn't look
>>> like an insurmountable problem as well since it is already possible
>>> with a FUSE filesystem.
>>> So the proposal/wish here is to have a very simple format, that at
>>> some
>>> point could be considered safe to mount from a user namespace, in
>>> addition to overlay and FUSE.
>>
>> My response is quite similar to
>> https://lore.kernel.org/r/[email protected]om/
>
> I don't see how that applies to what I said about unprivileged mounts,
> except the part about lazy download where I agree with Miklos that
> should be handled through FUSE and that is something possible with
> composefs:
>
> mount -t composefs composefs -obasedir=/path/to/store:/mnt/fuse /mnt/cfs
>
> where /mnt/fuse is handled by a FUSE file system that takes care of
> loading the files from the remote server, and possibly write them to
> /path/to/store once they are completed.
>
> So each user could have their "lazy download" without interfering with
> other users or the centralized cache.
>
>>>
>>>>> As you know as I've contacted you, I've looked at EROFS in the past
>>>>> and tried to get our use cases to work with it before thinking about
>>>>> submitting composefs upstream.
>>>>> From what I could see EROFS and composefs use two different
>>>>> approaches
>>>>> to solve a similar problem, but it is not possible to do exactly with
>>>>> EROFS what we are trying to do. To oversimplify it: I see EROFS as a
>>>>> block device that uses fscache, and composefs as an overlay for files
>>>>> instead of directories.
>>>>
>>>> I don't think so honestly. EROFS "Multiple device" feature is
>>>> actually "multiple blobs" feature if you really think "device"
>>>> is block device.
>>>>
>>>> Primary device -- primary blob -- "composefs manifest blob"
>>>> Blob device -- data blobs -- "composefs backing files"
>>>>
>>>> any difference?
>>> I wouldn't expect any substancial difference between two RO file
>>> systems.
>>> Please correct me if I am wrong: EROFS uses 16 bits for the blob
>>> device
>>> ID, so if we map each file to a single blob device we are kind of
>>> limited on how many files we can have.
>>
>> I was here just to represent "composefs manifest file" concept rather than
>> device ID.
>>
>>> Sure this is just an artificial limit and can be bumped in a future
>>> version but the major difference remains: EROFS uses the blob device
>>> through fscache while the composefs files are looked up in the specified
>>> repositories.
>>
>> No, fscache can also open any cookie when opening file. Again, even with
>> fscache, EROFS doesn't need to modify _any_ on-disk format to:
>>
>> - record a "cookie id" for such special "magical symlink" with a similar
>> symlink on-disk format (or whatever on-disk format with data, just with
>> a new on-disk flag);
>>
>> - open such "cookie id" on demand when opening such EROFS file just as
>> any other network fses. I don't think blob device is limited here.
>>
>> some difference now?
>
> recording the "cookie id" is done by a singleton userspace daemon that
> controls the cachefiles device and requires one operation for each file
> before the image can be mounted.
>
> Is that the case or I misunderstood something?
>
>>>
>>>>> Sure composefs is quite simple and you could embed the composefs
>>>>> features in EROFS and let EROFS behave as composefs when provided a
>>>>> similar manifest file. But how is that any better than having a
>>>>
>>>> EROFS always has such feature since v5.16, we called primary device,
>>>> or Nydus concept --- "bootstrap file".
>>>>
>>>>> separate implementation that does just one thing well instead of merging
>>>>> different paradigms together?
>>>>
>>>> It's exist fs on-disk compatible (people can deploy the same image
>>>> to wider scenarios), or you could modify/enhacnce any in-kernel local
>>>> fs to do so like I already suggested, such as enhancing "fs/romfs" and
>>>> make it maintained again due to this magic symlink feature
>>>>
>>>> (because composefs don't have other on-disk requirements other than
>>>> a symlink path and a SHA256 verity digest from its original
>>>> requirement. Any local fs can be enhanced like this.)
>>>>
>>>>>
>>>>>> I know that you guys repeatedly say it's a self-contained
>>>>>> stackable fs and has few code (the same words as Incfs
>>>>>> folks [3] said four years ago already), four reasons make it
>>>>>> weak IMHO:
>>>>>>
>>>>>> - I think core EROFS is about 2~3 kLOC as well if
>>>>>> compression, sysfs and fscache are all code-truncated.
>>>>>>
>>>>>> Also, it's always welcome that all people could submit
>>>>>> patches for cleaning up. I always do such cleanups
>>>>>> from time to time and makes it better.
>>>>>>
>>>>>> - "Few code lines" is somewhat weak because people do
>>>>>> develop new features, layout after upstream.
>>>>>>
>>>>>> Such claim is usually _NOT_ true in the future if you
>>>>>> guys do more to optimize performance, new layout or even
>>>>>> do your own lazy pulling with your local CAS codebase in
>>>>>> the future unless
>>>>>> you *promise* you once dump the code, and do bugfix
>>>>>> only like Christian said [4].
>>>>>>
>>>>>> From LWN.net comments, I do see the opposite
>>>>>> possibility that you'd like to develop new features
>>>>>> later.
>>>>>>
>>>>>> - In the past, all in-tree kernel filesystems were
>>>>>> designed and implemented without some user-space
>>>>>> specific indication, including Nydus and ostree (I did
>>>>>> see a lot of discussion between folks before in ociv2
>>>>>> brainstorm [5]).
>>>>> Since you are mentioning OCI:
>>>>> Potentially composefs can be the file system that enables something
>>>>> very
>>>>> close to "ociv2", but it won't need to be called v2 since it is
>>>>> completely compatible with the current OCI image format.
>>>>> It won't require a different image format, just a seekable tarball
>>>>> that
>>>>> is compatible with old "v1" clients and we need to provide the composefs
>>>>> manifest file.
>>>>
>>>> May I ask did you really look into what Nydus + EROFS already did (as you
>>>> mentioned we discussed before)?
>>>>
>>>> Your "composefs manifest file" is exactly "Nydus bootstrap file", see:
>>>> https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-design.md
>>>>
>>>> "Rafs is a filesystem image containing a separated metadata blob and
>>>> several data-deduplicated content-addressable data blobs. In a typical
>>>> rafs filesystem, the metadata is stored in bootstrap while the data
>>>> is stored in blobfile.
>>>> ...
>>>>
>>>> bootstrap: The metadata is a merkle tree (I think that is typo, should be
>>>> filesystem tree) whose nodes represents a regular filesystem's
>>>> directory/file a leaf node refers to a file and contains hash value of
>>>> its file data.
>>>> Root node and internal nodes refer to directories and contain the
>>>> hash value
>>>> of their children nodes."
>>>>
>>>> Nydus is already supported "It won't require a different image format, just
>>>> a seekable tarball that is compatible with old "v1" clients and we need to
>>>> provide the composefs manifest file." feature in v2.2 and will be released
>>>> later.
>>> Nydus is not using a tarball compatible with OCI v1.
>>> It defines a media type
>>> "application/vnd.oci.image.layer.nydus.blob.v1", that
>>> means it is not compatible with existing clients that don't know about
>>> it and you need special handling for that.
>>
>> I am not sure what you're saying: "media type" is quite out of topic here.
>>
>> If you said "mkcomposefs" is done in the server side, what is the media
>> type of such manifest files?
>>
>> And why not Nydus cannot do in the same way?
>> https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-zran.md
>>
>
> I am not talking about the manifest or the bootstrap file, I am talking
> about the data blobs.
>
>>> Anyway, let's not bother LKML folks with these userspace details.
>>> It
>>> has no relevance to the kernel and what file systems do.
>>
>> I'd like to avoid, I did't say anything about userspace details, I just would
>> like to say
>> "merged filesystem tree is also _not_ a new idea of composefs"
>> not "media type", etc.
>>
>>>
>>>>> The seekable tarball allows individual files to be retrieved. OCI
>>>>> clients will not need to pull the entire tarball, but only the individual
>>>>> files that are not already present in the local CAS. They won't also need
>>>>> to create the overlay layout at all, as we do today, since it is already
>>>>> described with the composefs manifest file.
>>>>> The manifest is portable on different machines with different
>>>>> configurations, as you can use multiple CAS when mounting composefs.
>>>>> Some users might have a local CAS, some others could have a
>>>>> secondary
>>>>> CAS on a network file system and composefs support all these
>>>>> configurations with the same signed manifest file.
>>>>>
>>>>>> That is why EROFS selected exist in-kernel fscache and
>>>>>> made userspace Nydus adapt it:
>>>>>>
>>>>>> even (here called) manifest on-disk format ---
>>>>>> EROFS call primary device ---
>>>>>> they call Nydus bootstrap;
>>>>>>
>>>>>> I'm not sure why it becomes impossible for ... ($$$$).
>>>>> I am not sure what you mean, care to elaborate?
>>>>
>>>> I just meant these concepts are actually the same concept with
>>>> different names and:
>>>> Nydus is a 2020 stuff;
>>> CRFS[1] is 2019 stuff.
>>
>> Does CRFS have anything similiar to a merged filesystem tree?
>>
>> Here we talked about local CAS:
>> I have no idea CRFS has anything similar to it.
>
> yes it does and it uses it with a FUSE file system. So neither
> composefs nor EROFS have invented anything here.
>
> Anyway, does it really matter who made what first? I don't see how it
> helps to understand if there are relevant differences in composefs to
> justify its presence in the kernel.
>
>>>
>>>> EROFS + primary device is a 2021-mid stuff.
>>>>
>>>>>> In addition, if fscache is used, it can also use
>>>>>> fsverity_get_digest() to enable fsverity for non-on-demand
>>>>>> files.
>>>>>>
>>>>>> But again I think even Google's folks think that is
>>>>>> (somewhat) broken so that they added fs-verity to its incFS
>>>>>> in a self-contained way in Feb 2021 [6].
>>>>>>
>>>>>> Finally, again, I do hope a LSF/MM discussion for this new
>>>>>> overlay model (full of massive magical symlinks to override
>>>>>> permission.)
>>>>> you keep pointing it out but nobody is overriding any permission.
>>>>> The
>>>>> "symlinks" as you call them are just a way to refer to the payload files
>>>>> so they can be shared among different mounts. It is the same idea used
>>>>> by "overlay metacopy" and nobody is complaining about it being a
>>>>> security issue (because it is not).
>>>>
>>>> See overlay documentation clearly wrote such metacopy behavior:
>>>> https://docs.kernel.org/filesystems/overlayfs.html
>>>>
>>>> "
>>>> Do not use metacopy=on with untrusted upper/lower directories.
>>>> Otherwise it is possible that an attacker can create a handcrafted file
>>>> with appropriate REDIRECT and METACOPY xattrs, and gain access to file
>>>> on lower pointed by REDIRECT. This should not be possible on local
>>>> system as setting “trusted.” xattrs will require CAP_SYS_ADMIN. But
>>>> it should be possible for untrusted layers like from a pen drive.
>>>> "
>>>>
>>>> Do we really need such behavior working on another fs especially with
>>>> on-disk format? At least Christian said,
>>>> "FUSE and Overlayfs are adventurous enough and they don't have their
>>>> own on-disk format."
>>> If users want to do something really weird then they can always find
>>> a
>>> way but the composefs lookup is limited under the directories specified
>>> at mount time, so it is not possible to access any file outside the
>>> repository.
>>>
>>>>> The files in the CAS are owned by the user that creates the mount,
>>>>> so
>>>>> there is no need to circumvent any permission check to access them.
>>>>> We use fs-verity for these files to make sure they are not modified by a
>>>>> malicious user that could get access to them (e.g. a container breakout).
>>>>
>>>> fs-verity is not always enforcing and it's broken here if fsverity is not
>>>> supported in underlay fses, that is another my arguable point.
>>> It is a trade-off. It is up to the user to pick a configuration
>>> that
>>> allows using fs-verity if they care about this feature.
>>
>> I don't think fsverity is optional with your plan.
>
> yes it is optional. without fs-verity it would behave the same as today
> with overlay mounts without any fs-verity.
>
> How does validation work in EROFS for files served from fscache and that
> are on a remote file system?

nevermind my last question, I guess it would still go through the block
device in EROFS.
This is clearly a point in favor of a block device approach that a
stacking file system like overlay or composefs cannot achieve without
support from the underlying file system.

>
>> I wrote this all because it seems I didn't mention the original motivation
>> to use fscache in v2: kernel already has such in-kernel local cache, and
>> people liked to use it in 2019 rather than another stackable way (as
>> mentioned in incremental fs thread.)
>
> still for us the stackable way works better.
>
>> Thanks,
>> Gao Xiang
>>
>>> Regards,
>>> Giuseppe
>>> [1] https://github.com/google/crfs