LinuxLists.cc - [PATCH 0/8] loopfs

2020-04-08 17:23:42

Subject: [PATCH 0/8] loopfs

Hey everyone,

After having been pinged about this by various people recently here's loopfs.

This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner and more complete
approach.

To experiment, the patchset can be found in the following locations:
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=loopfs
https://gitlab.com/brauner/linux/-/commits/loopfs
https://github.com/brauner/linux/tree/loopfs

One of the use-cases for loopfs is to allow to dynamically allocate loop
devices in sandboxed workloads without exposing /dev or
/dev/loop-control to the workload in question and without having to
implement a complex and also racy protocol to send around file
descriptors for loop devices. With loopfs each mount is a new instance,
i.e. loop devices created in one loopfs instance are independent of any
loop devices created in another loopfs instance. This allows
sufficiently privileged tools to have their own private stash of loop
device instances. Dmitry has expressed his desire to use this for
syzkaller in a private discussion. And various parties that want to use
it are Cced here too.

In addition, the loopfs filesystem can be mounted by user namespace root
and is thus suitable for use in containers. Combined with syscall
interception this makes it possible to securely delegate mounting of
images on loop devices, i.e. when a user calls mount -o loop <image>
<mountpoint> it will be possible to completely setup the loop device.
The final mount syscall to actually perform the mount will be handled
through syscall interception and be performed by a sufficiently
privileged process. Syscall interception is already supported through a
new seccomp feature we implemented in [1] and extended in [2] and is
actively used in production workloads. The additional loopfs work will
be used there and in various other workloads too. You'll find a short
illustration how this works with syscall interception below in [4].

The number of loop devices available to a loopfs instance can be limited
by setting the "max" mount option to a positive integer. This e.g.
allows sufficiently privileged processes to dynamically enforce a limit
on the number of devices. This limit is dynamic in contrast to the
max_loop module option in that a sufficiently privileged process can
update it with a simple remount operation.

The loopfs filesystem is placed under a new config option and special
care has been taken to not introduce any new code when users do not
select this config option.

Thanks!
Christian

[1]: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
[2]: fb3c5386b382 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/[email protected]
[4]:
root@f1:~# cat /proc/self/uid_map
0 100000 1000000000
root@f1:~# cat /proc/self/gid_map
0 100000 1000000000
root@f1:~# mkdir /dev/loopfs
root@f1:~# mount -t loop loop /dev/loopfs/
root@f1:~# ln -sf /dev/loopfs/loop-control /dev/loop-control
root@f1:~# losetup -f
/dev/loop9
root@f1:~# ln -sf /dev/loopfs/loop9 /dev/loop9
root@f1:~# ls -al /sys/class/block/loop9
lrwxrwxrwx 1 root root 0 Apr 8 14:53 /sys/class/block/loop9 -> ../../devices/virtual/block/loop9
root@f1:~# ls -al /sys/class/block/loop9/
total 0
drwxr-xr-x 9 root root 0 Apr 8 14:53 .
drwxr-xr-x 13 nobody nogroup 0 Apr 8 14:53 ..
-r--r--r-- 1 root root 4096 Apr 8 14:53 alignment_offset
lrwxrwxrwx 1 nobody nogroup 0 Apr 8 14:53 bdi -> ../../bdi/7:9
-r--r--r-- 1 root root 4096 Apr 8 14:53 capability
-r--r--r-- 1 root root 4096 Apr 8 14:53 dev
-r--r--r-- 1 root root 4096 Apr 8 14:53 discard_alignment
-r--r--r-- 1 root root 4096 Apr 8 14:53 events
-r--r--r-- 1 root root 4096 Apr 8 14:53 events_async
-rw-r--r-- 1 root root 4096 Apr 8 14:53 events_poll_msecs
-r--r--r-- 1 root root 4096 Apr 8 14:53 ext_range
-r--r--r-- 1 root root 4096 Apr 8 14:53 hidden
drwxr-xr-x 2 nobody nogroup 0 Apr 8 14:53 holders
-r--r--r-- 1 root root 4096 Apr 8 14:53 inflight
drwxr-xr-x 2 nobody nogroup 0 Apr 8 14:53 integrity
drwxr-xr-x 3 nobody nogroup 0 Apr 8 14:53 mq
drwxr-xr-x 2 root root 0 Apr 8 14:53 power
drwxr-xr-x 3 nobody nogroup 0 Apr 8 14:53 queue
-r--r--r-- 1 root root 4096 Apr 8 14:53 range
-r--r--r-- 1 root root 4096 Apr 8 14:53 removable
-r--r--r-- 1 root root 4096 Apr 8 14:53 ro
-r--r--r-- 1 root root 4096 Apr 8 14:53 size
drwxr-xr-x 2 nobody nogroup 0 Apr 8 14:53 slaves
-r--r--r-- 1 root root 4096 Apr 8 14:53 stat
lrwxrwxrwx 1 nobody nogroup 0 Apr 8 14:53 subsystem -> ../../../../class/block
drwxr-xr-x 2 root root 0 Apr 8 14:53 trace
-rw-r--r-- 1 root root 4096 Apr 8 14:53 uevent
root@f1:~#
root@f1:~# stat --file-system /bla.img
File: "/bla.img"
ID: 4396dc4f5f3ffe1b Namelen: 255 Type: btrfs
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 11230468 Free: 10851929 Available: 10738585
Inodes: Total: 0 Free: 0
root@f1:~# mount -o loop /bla.img /opt
root@f1:~# findmnt | grep opt
└─/opt /dev/loop9 btrfs rw,relatime,ssd,space_cache,subvolid=5,subvol=/

Christian Brauner (8):
kobject_uevent: remove unneeded netlink_ns check
loopfs: implement loopfs
loop: use ns_capable for some loop operations
kernfs: handle multiple namespace tags
kernfs: let objects opt-in to propagating from the initial namespace
genhd: add minimal namespace infrastructure
loopfs: start attaching correct namespace during loop_add()
loopfs: only show devices in their correct instance

Documentation/filesystems/sysfs-tagging.txt | 1 -
MAINTAINERS | 5 +
block/genhd.c | 79 ++++
drivers/base/devtmpfs.c | 4 +-
drivers/block/Kconfig | 4 +
drivers/block/Makefile | 1 +
drivers/block/loop.c | 186 +++++++--
drivers/block/loop.h | 8 +-
drivers/block/loopfs/Makefile | 3 +
drivers/block/loopfs/loopfs.c | 429 ++++++++++++++++++++
drivers/block/loopfs/loopfs.h | 35 ++
fs/kernfs/dir.c | 38 +-
fs/kernfs/kernfs-internal.h | 26 +-
fs/kernfs/mount.c | 11 +-
fs/sysfs/mount.c | 14 +-
include/linux/device.h | 3 +
include/linux/genhd.h | 3 +
include/linux/kernfs.h | 44 +-
include/linux/kobject_ns.h | 7 +-
include/linux/sysfs.h | 8 +-
include/uapi/linux/magic.h | 1 +
lib/kobject.c | 17 +-
lib/kobject_uevent.c | 2 +-
net/core/net-sysfs.c | 6 -
24 files changed, 834 insertions(+), 101 deletions(-)
create mode 100644 drivers/block/loopfs/Makefile
create mode 100644 drivers/block/loopfs/loopfs.c
create mode 100644 drivers/block/loopfs/loopfs.h

base-commit: 7111951b8d4973bda27ff663f2cf18b663d15b48
--
2.26.0

2020-04-08 17:24:04

by Christian Brauner

[permalink] [raw]

Subject: [PATCH 8/8] loopfs: only show devices in their correct instance

Since loopfs devices belong to a loopfs instance they have no business
polluting the host's devtmpfs mount and should not propagate out of the
namespace they belong to.

Cc: Jens Axboe <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
drivers/base/devtmpfs.c | 4 ++--
drivers/block/loop.c | 4 +++-
include/linux/device.h | 3 +++
3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index c9017e0584c0..77371ceb88fa 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -111,7 +111,7 @@ int devtmpfs_create_node(struct device *dev)
const char *tmp = NULL;
struct req req;

- if (!thread)
+ if (!thread || dev->no_devnode)
return 0;

req.mode = 0;
@@ -138,7 +138,7 @@ int devtmpfs_delete_node(struct device *dev)
const char *tmp = NULL;
struct req req;

- if (!thread)
+ if (!thread || dev->no_devnode)
return 0;

req.name = device_get_devnode(dev, NULL, NULL, NULL, &tmp);
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 7a14fd3e4329..df75ca4ac040 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2155,8 +2155,10 @@ static int loop_add(struct loop_device **l, int i, struct inode *inode)
disk->queue = lo->lo_queue;
sprintf(disk->disk_name, "loop%d", i);
#ifdef CONFIG_BLK_DEV_LOOPFS
- if (loopfs_i_sb(inode))
+ if (loopfs_i_sb(inode)) {
disk->user_ns = loopfs_i_sb(inode)->s_user_ns;
+ disk_to_dev(disk)->no_devnode = true;
+ }
#endif

add_disk(disk);
diff --git a/include/linux/device.h b/include/linux/device.h
index fa04dfd22bbc..9fa438e3e4ca 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -525,6 +525,8 @@ struct dev_links_info {
* sync_state() callback.
* @dma_coherent: this particular device is dma coherent, even if the
* architecture supports non-coherent devices.
+ * @no_devnode: whether device nodes associated with this device are kept out
+ * of devtmpfs (e.g. due to separate filesystem)
*
* At the lowest level, every device in a Linux system is represented by an
* instance of struct device. The device structure contains the information
@@ -625,6 +627,7 @@ struct device {
defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
bool dma_coherent:1;
#endif
+ bool no_devnode:1;
};

static inline struct device *kobj_to_dev(struct kobject *kobj)
--
2.26.0

2020-04-08 17:24:28

by Christian Brauner

[permalink] [raw]

Subject: [PATCH 6/8] genhd: add minimal namespace infrastructure

This lets the block_class properly support loopfs device by introducing
the minimal infrastructure needed to support different sysfs views for
devices belonging to the block_class. This is similar to how network
devices work. Note, that nothing changes with this patch since
all block_class devices are tagged explicitly with init_user_ns whereas
they were tagged implicitly with init_user_ns before. No code is added
if CONFIG_BLK_DEV_LOOPFS is not set.

Cc: Jens Axboe <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
block/genhd.c | 79 +++++++++++++++++++++++++++++++++++++
fs/kernfs/kernfs-internal.h | 3 ++
fs/sysfs/mount.c | 4 ++
include/linux/genhd.h | 3 ++
include/linux/kobject_ns.h | 1 +
5 files changed, 90 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 9c2e13ce0d19..a6d51d9a94f6 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1127,11 +1127,81 @@ static struct kobject *base_probe(dev_t devt, int *partno, void *data)
return NULL;
}

+#ifdef CONFIG_BLK_DEV_LOOPFS
+static void *user_grab_current_ns(void)
+{
+ struct user_namespace *ns = current_user_ns();
+ return get_user_ns(ns);
+}
+
+static const void *user_initial_ns(void)
+{
+ return &init_user_ns;
+}
+
+static void user_put_ns(void *p)
+{
+ struct user_namespace *ns = p;
+ put_user_ns(ns);
+}
+
+static bool user_current_may_mount(void)
+{
+ return ns_capable(current_user_ns(), CAP_SYS_ADMIN);
+}
+
+const struct kobj_ns_type_operations user_ns_type_operations = {
+ .type = KOBJ_NS_TYPE_USER,
+ .current_may_mount = user_current_may_mount,
+ .grab_current_ns = user_grab_current_ns,
+ .initial_ns = user_initial_ns,
+ .drop_ns = user_put_ns,
+};
+
+static const void *block_class_user_namespace(struct device *dev)
+{
+ struct gendisk *disk;
+
+ if (dev->type == &part_type)
+ disk = part_to_disk(dev_to_part(dev));
+ else
+ disk = dev_to_disk(dev);
+
+ return disk->user_ns;
+}
+
+static void block_class_get_ownership(struct device *dev, kuid_t *uid, kgid_t *gid)
+{
+ struct gendisk *disk;
+ struct user_namespace *ns;
+
+ if (dev->type == &part_type)
+ disk = part_to_disk(dev_to_part(dev));
+ else
+ disk = dev_to_disk(dev);
+
+ ns = disk->user_ns;
+ if (ns && ns != &init_user_ns) {
+ kuid_t ns_root_uid = make_kuid(ns, 0);
+ kgid_t ns_root_gid = make_kgid(ns, 0);
+
+ if (uid_valid(ns_root_uid))
+ *uid = ns_root_uid;
+
+ if (gid_valid(ns_root_gid))
+ *gid = ns_root_gid;
+ }
+}
+#endif /* CONFIG_BLK_DEV_LOOPFS */
+
static int __init genhd_device_init(void)
{
int error;

block_class.dev_kobj = sysfs_dev_block_kobj;
+#ifdef CONFIG_BLK_DEV_LOOPFS
+ kobj_ns_type_register(&user_ns_type_operations);
+#endif
error = class_register(&block_class);
if (unlikely(error))
return error;
@@ -1369,8 +1439,14 @@ static void disk_release(struct device *dev)
blk_put_queue(disk->queue);
kfree(disk);
}
+
struct class block_class = {
.name = "block",
+#ifdef CONFIG_BLK_DEV_LOOPFS
+ .ns_type = &user_ns_type_operations,
+ .namespace = block_class_user_namespace,
+ .get_ownership = block_class_get_ownership,
+#endif
};

static char *block_devnode(struct device *dev, umode_t *mode,
@@ -1550,6 +1626,9 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
disk_to_dev(disk)->class = &block_class;
disk_to_dev(disk)->type = &disk_type;
device_initialize(disk_to_dev(disk));
+#ifdef CONFIG_BLK_DEV_LOOPFS
+ disk->user_ns = &init_user_ns;
+#endif
}
return disk;
}
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 4ba7b36103de..699b7b67f9e0 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -79,12 +79,15 @@ static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
}

extern struct net init_net;
+extern struct user_namespace init_user_ns;

static inline const void *kernfs_init_ns(enum kobj_ns_type ns_type)
{
switch (ns_type) {
case KOBJ_NS_TYPE_NET:
return &init_net;
+ case KOBJ_NS_TYPE_USER:
+ return &init_user_ns;
default:
pr_debug("Unsupported namespace type %d for kernfs\n", ns_type);
}
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 5e2ec88a709e..99b82a0ae7ea 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -43,6 +43,8 @@ static void sysfs_fs_context_free(struct fs_context *fc)

if (kfc->ns_tag[KOBJ_NS_TYPE_NET])
kobj_ns_drop(KOBJ_NS_TYPE_NET, kfc->ns_tag[KOBJ_NS_TYPE_NET]);
+ if (kfc->ns_tag[KOBJ_NS_TYPE_USER])
+ kobj_ns_drop(KOBJ_NS_TYPE_USER, kfc->ns_tag[KOBJ_NS_TYPE_USER]);
kernfs_free_fs_context(fc);
kfree(kfc);
}
@@ -67,6 +69,7 @@ static int sysfs_init_fs_context(struct fs_context *fc)
return -ENOMEM;

kfc->ns_tag[KOBJ_NS_TYPE_NET] = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
+ kfc->ns_tag[KOBJ_NS_TYPE_USER] = kobj_ns_grab_current(KOBJ_NS_TYPE_USER);
kfc->root = sysfs_root;
kfc->magic = SYSFS_MAGIC;
fc->fs_private = kfc;
@@ -85,6 +88,7 @@ static void sysfs_kill_sb(struct super_block *sb)

kernfs_kill_sb(sb);
kobj_ns_drop(KOBJ_NS_TYPE_NET, ns[KOBJ_NS_TYPE_NET]);
+ kobj_ns_drop(KOBJ_NS_TYPE_USER, ns[KOBJ_NS_TYPE_USER]);
}

static struct file_system_type sysfs_fs_type = {
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 07dc91835b98..e5cf5caea345 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -219,6 +219,9 @@ struct gendisk {
int node_id;
struct badblocks *bb;
struct lockdep_map lockdep_map;
+#ifdef CONFIG_BLK_DEV_LOOPFS
+ struct user_namespace *user_ns;
+#endif
};

static inline struct gendisk *part_to_disk(struct hd_struct *part)
diff --git a/include/linux/kobject_ns.h b/include/linux/kobject_ns.h
index 216f9112ee1d..a9c45bcce235 100644
--- a/include/linux/kobject_ns.h
+++ b/include/linux/kobject_ns.h
@@ -26,6 +26,7 @@ struct kobject;
enum kobj_ns_type {
KOBJ_NS_TYPE_NONE = 0,
KOBJ_NS_TYPE_NET,
+ KOBJ_NS_TYPE_USER,
KOBJ_NS_TYPES
};

--
2.26.0

2020-04-08 18:27:07

by Jann Horn

[permalink] [raw]

Subject: Re: [PATCH 0/8] loopfs

On Wed, Apr 8, 2020 at 5:23 PM Christian Brauner
<[email protected]> wrote:
> One of the use-cases for loopfs is to allow to dynamically allocate loop
> devices in sandboxed workloads without exposing /dev or
> /dev/loop-control to the workload in question and without having to
> implement a complex and also racy protocol to send around file
> descriptors for loop devices. With loopfs each mount is a new instance,
> i.e. loop devices created in one loopfs instance are independent of any
> loop devices created in another loopfs instance. This allows
> sufficiently privileged tools to have their own private stash of loop
> device instances. Dmitry has expressed his desire to use this for
> syzkaller in a private discussion. And various parties that want to use
> it are Cced here too.
>
> In addition, the loopfs filesystem can be mounted by user namespace root
> and is thus suitable for use in containers. Combined with syscall
> interception this makes it possible to securely delegate mounting of
> images on loop devices, i.e. when a user calls mount -o loop <image>
> <mountpoint> it will be possible to completely setup the loop device.
> The final mount syscall to actually perform the mount will be handled
> through syscall interception and be performed by a sufficiently
> privileged process. Syscall interception is already supported through a
> new seccomp feature we implemented in [1] and extended in [2] and is
> actively used in production workloads. The additional loopfs work will
> be used there and in various other workloads too. You'll find a short
> illustration how this works with syscall interception below in [4].

Would that privileged process then allow you to mount your filesystem
images with things like ext4? As far as I know, the filesystem
maintainers don't generally consider "untrusted filesystem image" to
be a strongly enforced security boundary; and worse, if an attacker
has access to a loop device from which something like ext4 is mounted,
things like "struct ext4_dir_entry_2" will effectively be in shared
memory, and an attacker can trivially bypass e.g.
ext4_check_dir_entry(). At the moment, that's not a huge problem (for
anything other than kernel lockdown) because only root normally has
access to loop devices.

Ubuntu carries an out-of-tree patch that afaik blocks the shared
memory thing: <https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/eoan/commit?id=4bc428fdf5500b7366313f166b7c9c50ee43f2c4>

But even with that patch, I'm not super excited about exposing
filesystem image parsing attack surface to containers unless you run
the filesystem in a sandboxed environment (at which point you don't
need a loop device anymore either).

2020-04-08 18:31:21

by Stéphane Graber

[permalink] [raw]

Subject: Re: [PATCH 0/8] loopfs

On Wed, Apr 8, 2020 at 12:24 PM Jann Horn <[email protected]> wrote:
>
> On Wed, Apr 8, 2020 at 5:23 PM Christian Brauner
> <[email protected]> wrote:
> > One of the use-cases for loopfs is to allow to dynamically allocate loop
> > devices in sandboxed workloads without exposing /dev or
> > /dev/loop-control to the workload in question and without having to
> > implement a complex and also racy protocol to send around file
> > descriptors for loop devices. With loopfs each mount is a new instance,
> > i.e. loop devices created in one loopfs instance are independent of any
> > loop devices created in another loopfs instance. This allows
> > sufficiently privileged tools to have their own private stash of loop
> > device instances. Dmitry has expressed his desire to use this for
> > syzkaller in a private discussion. And various parties that want to use
> > it are Cced here too.
> >
> > In addition, the loopfs filesystem can be mounted by user namespace root
> > and is thus suitable for use in containers. Combined with syscall
> > interception this makes it possible to securely delegate mounting of
> > images on loop devices, i.e. when a user calls mount -o loop <image>
> > <mountpoint> it will be possible to completely setup the loop device.
> > The final mount syscall to actually perform the mount will be handled
> > through syscall interception and be performed by a sufficiently
> > privileged process. Syscall interception is already supported through a
> > new seccomp feature we implemented in [1] and extended in [2] and is
> > actively used in production workloads. The additional loopfs work will
> > be used there and in various other workloads too. You'll find a short
> > illustration how this works with syscall interception below in [4].
>
> Would that privileged process then allow you to mount your filesystem
> images with things like ext4? As far as I know, the filesystem
> maintainers don't generally consider "untrusted filesystem image" to
> be a strongly enforced security boundary; and worse, if an attacker
> has access to a loop device from which something like ext4 is mounted,
> things like "struct ext4_dir_entry_2" will effectively be in shared
> memory, and an attacker can trivially bypass e.g.
> ext4_check_dir_entry(). At the moment, that's not a huge problem (for
> anything other than kernel lockdown) because only root normally has
> access to loop devices.
>
> Ubuntu carries an out-of-tree patch that afaik blocks the shared
> memory thing: <https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/eoan/commit?id=4bc428fdf5500b7366313f166b7c9c50ee43f2c4>
>
> But even with that patch, I'm not super excited about exposing
> filesystem image parsing attack surface to containers unless you run
> the filesystem in a sandboxed environment (at which point you don't
> need a loop device anymore either).

So in general we certainly agree that you should never expose someone
that you wouldn't trust with root on the host to syscall interception
mounting of real kernel filesystems.

But that's not all that our syscall interception logic can do. We have
support for rewriting a normal filesystem mount attempt to instead use
an available FUSE implementation. As far as the user is concerned,
they ran "mount /dev/sdaX /mnt" and got that ext4 filesystem mounted
on /mnt as requested, except that the container manager intercepted
the mount attempt and instead spawned fuse2fs for that mount. This
requires absolutely no change to the software the user is running.

loopfs, with that interception mode, will let us also handle all cases
where a loop would be used, similarly without needing any change to
the software being run. If a piece of software calls the command
"mount -o loop blah.img /mnt", the "mount" command will setup a loop
device as it normally would (doing so through loopfs) and then will
call the "mount" syscall, which will get intercepted and redirected to
a FUSE implementation if so configured, resulting in the expected
filesystem being mounted for the user.

LXD with syscall interception offers both straight up privileged
mounting using the kernel fs or using a FUSE based implementation.
This is configurable on a per-filesystem and per-container basis.

I hope that clarifies what we're doing here :)

Stéphane

2020-04-09 07:04:28

by Dmitry Vyukov

[permalink] [raw]

Subject: Re: [PATCH 0/8] loopfs

On Wed, Apr 8, 2020 at 6:41 PM Stéphane Graber <[email protected]> wrote:
>
> On Wed, Apr 8, 2020 at 12:24 PM Jann Horn <[email protected]> wrote:
> >
> > On Wed, Apr 8, 2020 at 5:23 PM Christian Brauner
> > <[email protected]> wrote:
> > > One of the use-cases for loopfs is to allow to dynamically allocate loop
> > > devices in sandboxed workloads without exposing /dev or
> > > /dev/loop-control to the workload in question and without having to
> > > implement a complex and also racy protocol to send around file
> > > descriptors for loop devices. With loopfs each mount is a new instance,
> > > i.e. loop devices created in one loopfs instance are independent of any
> > > loop devices created in another loopfs instance. This allows
> > > sufficiently privileged tools to have their own private stash of loop
> > > device instances. Dmitry has expressed his desire to use this for
> > > syzkaller in a private discussion. And various parties that want to use
> > > it are Cced here too.
> > >
> > > In addition, the loopfs filesystem can be mounted by user namespace root
> > > and is thus suitable for use in containers. Combined with syscall
> > > interception this makes it possible to securely delegate mounting of
> > > images on loop devices, i.e. when a user calls mount -o loop <image>
> > > <mountpoint> it will be possible to completely setup the loop device.
> > > The final mount syscall to actually perform the mount will be handled
> > > through syscall interception and be performed by a sufficiently
> > > privileged process. Syscall interception is already supported through a
> > > new seccomp feature we implemented in [1] and extended in [2] and is
> > > actively used in production workloads. The additional loopfs work will
> > > be used there and in various other workloads too. You'll find a short
> > > illustration how this works with syscall interception below in [4].
> >
> > Would that privileged process then allow you to mount your filesystem
> > images with things like ext4? As far as I know, the filesystem
> > maintainers don't generally consider "untrusted filesystem image" to
> > be a strongly enforced security boundary; and worse, if an attacker
> > has access to a loop device from which something like ext4 is mounted,
> > things like "struct ext4_dir_entry_2" will effectively be in shared
> > memory, and an attacker can trivially bypass e.g.
> > ext4_check_dir_entry(). At the moment, that's not a huge problem (for
> > anything other than kernel lockdown) because only root normally has
> > access to loop devices.
> >
> > Ubuntu carries an out-of-tree patch that afaik blocks the shared
> > memory thing: <https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/eoan/commit?id=4bc428fdf5500b7366313f166b7c9c50ee43f2c4>
> >
> > But even with that patch, I'm not super excited about exposing
> > filesystem image parsing attack surface to containers unless you run
> > the filesystem in a sandboxed environment (at which point you don't
> > need a loop device anymore either).
>
> So in general we certainly agree that you should never expose someone
> that you wouldn't trust with root on the host to syscall interception
> mounting of real kernel filesystems.
>
> But that's not all that our syscall interception logic can do. We have
> support for rewriting a normal filesystem mount attempt to instead use
> an available FUSE implementation. As far as the user is concerned,
> they ran "mount /dev/sdaX /mnt" and got that ext4 filesystem mounted
> on /mnt as requested, except that the container manager intercepted
> the mount attempt and instead spawned fuse2fs for that mount. This
> requires absolutely no change to the software the user is running.
>
> loopfs, with that interception mode, will let us also handle all cases
> where a loop would be used, similarly without needing any change to
> the software being run. If a piece of software calls the command
> "mount -o loop blah.img /mnt", the "mount" command will setup a loop
> device as it normally would (doing so through loopfs) and then will
> call the "mount" syscall, which will get intercepted and redirected to
> a FUSE implementation if so configured, resulting in the expected
> filesystem being mounted for the user.
>
> LXD with syscall interception offers both straight up privileged
> mounting using the kernel fs or using a FUSE based implementation.
> This is configurable on a per-filesystem and per-container basis.
>
> I hope that clarifies what we're doing here :)
>
> Stéphane

Hi Christian,

Our use case for loopfs in syzkaller would be isolation of several
test processes from each other.
Currently all loop devices and loop-control are global and cause test
processes to collide, which in turn causes non-reproducible coverage
and non-reproducible crashes. Ideally we give each test process its
own loopfs instance.

2020-04-15 02:08:37

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH 6/8] genhd: add minimal namespace infrastructure

Hello,

On Wed, Apr 08, 2020 at 05:21:49PM +0200, Christian Brauner wrote:
> This lets the block_class properly support loopfs device by introducing
> the minimal infrastructure needed to support different sysfs views for
> devices belonging to the block_class. This is similar to how network
> devices work. Note, that nothing changes with this patch since

I was hoping that all devices on the system would be visible at the root level
as administration at system level becomes pretty tricky otherwise. Is it just
me who thinks this way?

Thanks.

--
tejun

2020-04-15 02:27:27

by Christian Brauner

[permalink] [raw]

Subject: Re: [PATCH 6/8] genhd: add minimal namespace infrastructure

On Mon, Apr 13, 2020 at 03:04:52PM -0400, Tejun Heo wrote:
> Hello,
>
> On Wed, Apr 08, 2020 at 05:21:49PM +0200, Christian Brauner wrote:
> > This lets the block_class properly support loopfs device by introducing
> > the minimal infrastructure needed to support different sysfs views for
> > devices belonging to the block_class. This is similar to how network
> > devices work. Note, that nothing changes with this patch since
>
> I was hoping that all devices on the system would be visible at the root level
> as administration at system level becomes pretty tricky otherwise. Is it just
> me who thinks this way?

Hey Tejun,

I think this is the same question in a different form you had in
https://lore.kernel.org/lkml/20200413193950.tokh5m7wsyrous3c@wittgenstein/T/#m20b396a29c8d499d9dc073e6aef38f38c08f8bbe
and I tried answered it there.

Thanks!
Christian