Hey everyone,
This is v2 of loopfs.
I've added a few more people to the Cc that want to make use of this and
I've added the missing ucount part that David pointed out and expanded a
little more on how this is used so this is used.
This implements loopfs, a loop device filesystem. It takes inspiration
from the binderfs filesystem I implemented about two years ago and with
which we had overall good experiences so far. Parts of it are also
based on [3] but it's mostly a new, imho cleaner approach.
Loopfs allows to create private loop devices instances to applications
for various use-cases. It covers the use-case that was expressed on-list
and in-person to get programmatic access to private loop devices for
image building in sandboxes. An illustration for this is provided in
[4].
Also loopfs is intended to provide loop devices to privileged and
unprivileged containers which has been a frequent request from various
major tools (Chromium, Kubernetes, LXD, Moby/Docker, systemd). I'm
providing a non-exhaustive list of issues and requests (cf. [5]) around
this feature mainly to illustrate that I'm not making the use-cases up.
Currently none of this can be done safely since handing a loop device
from the host into a container means that the container can see anything
that the host is doing with that loop device and what other containers
are doing with that device too. And (bind-)mounting devtmpfs inside of
containers is not secure at all so also not an option (though sometimes
done out of despair apparently).
The workloads people run in containers are supposed to be indiscernible
from workloads run on the host and the tools inside of the container are
supposed to not be required to be aware that they are running inside a
container apart from containerization tools themselves. This is
especially true when running older distros in containers that did exist
before containers were as ubiquitous as they are today. With loopfs user
can call mount -o loop and in a correctly setup container things work
the same way they would on the host. The filesystem representation
allows us to do this in a very simple way. At container setup, a
container manager can mount a private instance of loopfs somehwere, e.g.
at /dev/loopfs and then bind-mount or symlink /dev/loopfs/loop-control
to /dev/loop-control, pre allocate and symlink the number of standard
devices into their standard location and have a service file or rules in
place that symlink additionally allocated loop devices through losetup
into place as well.
With the new syscall interception logic this is also possible for
unprivileged containers. In these cases when a user calls mount -o loop
<image> <mountpoint> it will be possible to completely setup the loop
device in the container. The final mount syscall is handled through
syscall interception which we already implemented and released in
earlier kernels (see [1] and [2]) and is actively used in production
workloads. The mount is often rewritten to a fuse binary to provide safe
access for unprivileged containers.
Loopfs also allows the creation of hidden/detached dynamic loop devices
and associated mounts which also was a often issued request. With the
old mount api this can be achieved by creating a temporary loopfs and
stashing a file descriptor to the mount point and the loop-control
device and immediately unmounting the loopfs instance. With the new
mount api a detached mount can be created directly (i.e. a mount not
visible anywhere in the filesystem). New loop devices can then be
allocated and configured. They can be mounted through
/proc/self/<fd>/<nr> with the old mount api or by using the fd directly
with the new mount api. Combined with a mount namespace this allows for
fully auto-cleaned up loop devices on program crash. This ties back to
various use-cases and is illustrated in [4].
The filesystem representation requires the standard boilerplate
filesystem code we know from other tiny filesystems. And all of
the loopfs code is hidden under a config option that defaults to false.
This specifically means, that none of the code even exists when users do
not have any use-case for loopfs.
In addition, the loopfs code does not alter how loop devices behave at
all, i.e. there are no changes to any existing workloads and I've taken
care to ifdef all loopfs specific things out.
Each loopfs mount is a separate instance. As such loop devices created
in one instance are independent of loop devices created in another
instance. This specifically entails that loop devices are only visible
in the loopfs instance they belong to.
The number of loop devices available in loopfs instances are
hierarchically limited through /proc/sys/user/max_loop_devices via the
ucount infrastructure (Thanks to David Rheinsberg for pointing out that
missing piece.). An administrator could e.g. set
echo 3 > /proc/sys/user/max_loop_devices at which point any loopfs
instance mounted by uid x can only create 3 loop devices no matter how
many loopfs instances they mount. This limit applies hierarchically to
all user namespaces.
In addition, loopfs has a "max" mount option which allows to set a limit
on the number of loop devices for a given loopfs instance. This is
mainly to cover use-cases where a single loopfs mount is shared as a
bind-mount between multiple parties that are prevented from creating
other loopfs mounts and is equivalent to the semantics of the binderfs
and devpts "max" mount option.
Thanks!
Christian
[1]: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
[2]: fb3c5386b382 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE")
[3]: https://lore.kernel.org/lkml/[email protected]
[4]: https://gist.github.com/brauner/dcaf15e6977cc1bfadfb3965f126c02f
[5]: https://github.com/kubernetes-sigs/kind/issues/1333
https://github.com/kubernetes-sigs/kind/issues/1248
https://lists.freedesktop.org/archives/systemd-devel/2017-August/039453.html
https://chromium.googlesource.com/chromiumos/docs/+/master/containers_and_vms.md#loop-mount
https://gitlab.com/gitlab-com/support-forum/issues/3732
https://github.com/moby/moby/issues/27886
https://twitter.com/_AkihiroSuda_/status/1249664478267854848
https://serverfault.com/questions/701384/loop-device-in-a-linux-container
https://discuss.linuxcontainers.org/t/providing-access-to-loop-and-other-devices-in-containers/1352
https://discuss.concourse-ci.org/t/exposing-dev-loop-devices-in-privileged-mode/813
Christian Brauner (7):
kobject_uevent: remove unneeded netlink_ns check
loopfs: implement loopfs
loop: use ns_capable for some loop operations
kernfs: handle multiple namespace tags
loop: preserve sysfs backwards compatibility
loopfs: start attaching correct namespace during loop_add()
loopfs: only show devices in their correct instance
Documentation/filesystems/sysfs-tagging.txt | 1 -
MAINTAINERS | 5 +
block/genhd.c | 79 ++++
drivers/base/devtmpfs.c | 4 +-
drivers/block/Kconfig | 4 +
drivers/block/Makefile | 1 +
drivers/block/loop.c | 226 +++++++--
drivers/block/loop.h | 12 +-
drivers/block/loopfs/Makefile | 3 +
drivers/block/loopfs/loopfs.c | 494 ++++++++++++++++++++
drivers/block/loopfs/loopfs.h | 36 ++
fs/kernfs/dir.c | 38 +-
fs/kernfs/kernfs-internal.h | 33 +-
fs/kernfs/mount.c | 11 +-
fs/sysfs/mount.c | 14 +-
include/linux/device.h | 3 +
include/linux/genhd.h | 3 +
include/linux/kernfs.h | 44 +-
include/linux/kobject_ns.h | 7 +-
include/linux/sysfs.h | 8 +-
include/linux/user_namespace.h | 3 +
include/uapi/linux/magic.h | 1 +
kernel/ucount.c | 3 +
lib/kobject.c | 17 +-
lib/kobject_uevent.c | 2 +-
net/core/net-sysfs.c | 6 -
26 files changed, 953 insertions(+), 105 deletions(-)
create mode 100644 drivers/block/loopfs/Makefile
create mode 100644 drivers/block/loopfs/loopfs.c
create mode 100644 drivers/block/loopfs/loopfs.h
base-commit: ae83d0b416db002fe95601e7f97f64b59514d936
--
2.26.1
Back when I rewrote large chunks of uevent sending I should have removed
the .netlink_ns method completely after having removed it's last user in
[1]. Let's remove it now and also remove the helper associated with it
that is unused too.
Fixes: a3498436b3a0 ("netns: restrict uevents") /* No backport needed. */
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "David S. Miller" <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
unchanged
---
Documentation/filesystems/sysfs-tagging.txt | 1 -
include/linux/kobject_ns.h | 3 ---
lib/kobject.c | 13 -------------
lib/kobject_uevent.c | 2 +-
net/core/net-sysfs.c | 6 ------
5 files changed, 1 insertion(+), 24 deletions(-)
diff --git a/Documentation/filesystems/sysfs-tagging.txt b/Documentation/filesystems/sysfs-tagging.txt
index c7c8e6438958..51d28dd8b84f 100644
--- a/Documentation/filesystems/sysfs-tagging.txt
+++ b/Documentation/filesystems/sysfs-tagging.txt
@@ -37,6 +37,5 @@ Users of this interface:
- define a type in the kobj_ns_type enumeration.
- call kobj_ns_type_register() with its kobj_ns_type_operations which has
- current_ns() which returns current's namespace
- - netlink_ns() which returns a socket's namespace
- initial_ns() which returns the initial namesapce
- call kobj_ns_exit() when an individual tag is no longer valid
diff --git a/include/linux/kobject_ns.h b/include/linux/kobject_ns.h
index 069aa2ebef90..991a9286bcea 100644
--- a/include/linux/kobject_ns.h
+++ b/include/linux/kobject_ns.h
@@ -32,7 +32,6 @@ enum kobj_ns_type {
/*
* Callbacks so sysfs can determine namespaces
* @grab_current_ns: return a new reference to calling task's namespace
- * @netlink_ns: return namespace to which a sock belongs (right?)
* @initial_ns: return the initial namespace (i.e. init_net_ns)
* @drop_ns: drops a reference to namespace
*/
@@ -40,7 +39,6 @@ struct kobj_ns_type_operations {
enum kobj_ns_type type;
bool (*current_may_mount)(void);
void *(*grab_current_ns)(void);
- const void *(*netlink_ns)(struct sock *sk);
const void *(*initial_ns)(void);
void (*drop_ns)(void *);
};
@@ -52,7 +50,6 @@ const struct kobj_ns_type_operations *kobj_ns_ops(struct kobject *kobj);
bool kobj_ns_current_may_mount(enum kobj_ns_type type);
void *kobj_ns_grab_current(enum kobj_ns_type type);
-const void *kobj_ns_netlink(enum kobj_ns_type type, struct sock *sk);
const void *kobj_ns_initial(enum kobj_ns_type type);
void kobj_ns_drop(enum kobj_ns_type type, void *ns);
diff --git a/lib/kobject.c b/lib/kobject.c
index 83198cb37d8d..6f07083cc111 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -1092,19 +1092,6 @@ void *kobj_ns_grab_current(enum kobj_ns_type type)
}
EXPORT_SYMBOL_GPL(kobj_ns_grab_current);
-const void *kobj_ns_netlink(enum kobj_ns_type type, struct sock *sk)
-{
- const void *ns = NULL;
-
- spin_lock(&kobj_ns_type_lock);
- if ((type > KOBJ_NS_TYPE_NONE) && (type < KOBJ_NS_TYPES) &&
- kobj_ns_ops_tbl[type])
- ns = kobj_ns_ops_tbl[type]->netlink_ns(sk);
- spin_unlock(&kobj_ns_type_lock);
-
- return ns;
-}
-
const void *kobj_ns_initial(enum kobj_ns_type type)
{
const void *ns = NULL;
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 7998affa45d4..a45b3eeaa2b9 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -400,7 +400,7 @@ static int kobject_uevent_net_broadcast(struct kobject *kobj,
* are the only tag relevant here since we want to decide which
* network namespaces to broadcast the uevent into.
*/
- if (ops && ops->netlink_ns && kobj->ktype->namespace)
+ if (ops && kobj->ktype->namespace)
if (ops->type == KOBJ_NS_TYPE_NET)
net = kobj->ktype->namespace(kobj);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 4773ad6ec111..3fa35a3c843a 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1694,16 +1694,10 @@ static const void *net_initial_ns(void)
return &init_net;
}
-static const void *net_netlink_ns(struct sock *sk)
-{
- return sock_net(sk);
-}
-
const struct kobj_ns_type_operations net_ns_type_operations = {
.type = KOBJ_NS_TYPE_NET,
.current_may_mount = net_current_may_mount,
.grab_current_ns = net_grab_current_ns,
- .netlink_ns = net_netlink_ns,
.initial_ns = net_initial_ns,
.drop_ns = net_drop_ns,
};
--
2.26.1
Tag loop devices with the namespace the loopfs instance was mounted in.
This has the consequence that loopfs devices carry the correct sysfs
permissions for all their core files. All other devices files will
continue to be correctly owned by the initial namespaces. Here is sample
output:
root@b1:~# mount -t loop loop /mnt
root@b1:~# ln -sf /mnt/loop-control /dev/loop-control
root@b1:~# losetup -f
/dev/loop8
root@b1:~# ln -sf /mnt/loop8 /dev/loop8
root@b1:~# ls -al /sys/class/block/loop8
lrwxrwxrwx 1 root root 0 Apr 7 13:06 /sys/class/block/loop8 -> ../../devices/virtual/block/loop8
root@b1:~# ls -al /sys/class/block/loop8/
total 0
drwxr-xr-x 9 root root 0 Apr 7 13:06 .
drwxr-xr-x 18 nobody nogroup 0 Apr 7 13:07 ..
-r--r--r-- 1 root root 4096 Apr 7 13:06 alignment_offset
lrwxrwxrwx 1 nobody nogroup 0 Apr 7 13:07 bdi -> ../../bdi/7:8
-r--r--r-- 1 root root 4096 Apr 7 13:06 capability
-r--r--r-- 1 root root 4096 Apr 7 13:06 dev
-r--r--r-- 1 root root 4096 Apr 7 13:06 discard_alignment
-r--r--r-- 1 root root 4096 Apr 7 13:06 events
-r--r--r-- 1 root root 4096 Apr 7 13:06 events_async
-rw-r--r-- 1 root root 4096 Apr 7 13:06 events_poll_msecs
-r--r--r-- 1 root root 4096 Apr 7 13:06 ext_range
-r--r--r-- 1 root root 4096 Apr 7 13:06 hidden
drwxr-xr-x 2 nobody nogroup 0 Apr 7 13:07 holders
-r--r--r-- 1 root root 4096 Apr 7 13:06 inflight
drwxr-xr-x 2 nobody nogroup 0 Apr 7 13:07 integrity
drwxr-xr-x 3 nobody nogroup 0 Apr 7 13:07 mq
drwxr-xr-x 2 root root 0 Apr 7 13:06 power
drwxr-xr-x 3 nobody nogroup 0 Apr 7 13:07 queue
-r--r--r-- 1 root root 4096 Apr 7 13:06 range
-r--r--r-- 1 root root 4096 Apr 7 13:06 removable
-r--r--r-- 1 root root 4096 Apr 7 13:06 ro
-r--r--r-- 1 root root 4096 Apr 7 13:06 size
drwxr-xr-x 2 nobody nogroup 0 Apr 7 13:07 slaves
-r--r--r-- 1 root root 4096 Apr 7 13:06 stat
lrwxrwxrwx 1 nobody nogroup 0 Apr 7 13:07 subsystem -> ../../../../class/block
drwxr-xr-x 2 root root 0 Apr 7 13:06 trace
-rw-r--r-- 1 root root 4096 Apr 7 13:06 uevent
root@b1:~#
Cc: Jens Axboe <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
unchanged
- Christian Brauner <[email protected]>:
- Adapted commit message otherwise unchanged.
---
drivers/block/loop.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 8e21d4b33e01..2dc53bad4b48 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2212,6 +2212,10 @@ static int loop_add(struct loop_device **l, int i, struct inode *inode)
disk->private_data = lo;
disk->queue = lo->lo_queue;
sprintf(disk->disk_name, "loop%d", i);
+#ifdef CONFIG_BLK_DEV_LOOPFS
+ if (loopfs_i_sb(inode))
+ disk->user_ns = loopfs_i_sb(inode)->s_user_ns;
+#endif
add_disk(disk);
--
2.26.1
On Wed, Apr 22, 2020 at 04:54:31PM +0200, Christian Brauner wrote:
> Back when I rewrote large chunks of uevent sending I should have removed
> the .netlink_ns method completely after having removed it's last user in
> [1]. Let's remove it now and also remove the helper associated with it
> that is unused too.
>
> Fixes: a3498436b3a0 ("netns: restrict uevents") /* No backport needed. */
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: "David S. Miller" <[email protected]>
> Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: Greg Kroah-Hartman <[email protected]>
On Wed, Apr 22, 2020 at 04:54:36PM +0200, Christian Brauner wrote:
> Tag loop devices with the namespace the loopfs instance was mounted in.
> This has the consequence that loopfs devices carry the correct sysfs
> permissions for all their core files. All other devices files will
> continue to be correctly owned by the initial namespaces. Here is sample
> output:
>
> root@b1:~# mount -t loop loop /mnt
> root@b1:~# ln -sf /mnt/loop-control /dev/loop-control
> root@b1:~# losetup -f
> /dev/loop8
> root@b1:~# ln -sf /mnt/loop8 /dev/loop8
> root@b1:~# ls -al /sys/class/block/loop8
> lrwxrwxrwx 1 root root 0 Apr 7 13:06 /sys/class/block/loop8 -> ../../devices/virtual/block/loop8
> root@b1:~# ls -al /sys/class/block/loop8/
> total 0
> drwxr-xr-x 9 root root 0 Apr 7 13:06 .
> drwxr-xr-x 18 nobody nogroup 0 Apr 7 13:07 ..
> -r--r--r-- 1 root root 4096 Apr 7 13:06 alignment_offset
> lrwxrwxrwx 1 nobody nogroup 0 Apr 7 13:07 bdi -> ../../bdi/7:8
> -r--r--r-- 1 root root 4096 Apr 7 13:06 capability
> -r--r--r-- 1 root root 4096 Apr 7 13:06 dev
> -r--r--r-- 1 root root 4096 Apr 7 13:06 discard_alignment
> -r--r--r-- 1 root root 4096 Apr 7 13:06 events
> -r--r--r-- 1 root root 4096 Apr 7 13:06 events_async
> -rw-r--r-- 1 root root 4096 Apr 7 13:06 events_poll_msecs
> -r--r--r-- 1 root root 4096 Apr 7 13:06 ext_range
> -r--r--r-- 1 root root 4096 Apr 7 13:06 hidden
> drwxr-xr-x 2 nobody nogroup 0 Apr 7 13:07 holders
> -r--r--r-- 1 root root 4096 Apr 7 13:06 inflight
> drwxr-xr-x 2 nobody nogroup 0 Apr 7 13:07 integrity
> drwxr-xr-x 3 nobody nogroup 0 Apr 7 13:07 mq
> drwxr-xr-x 2 root root 0 Apr 7 13:06 power
> drwxr-xr-x 3 nobody nogroup 0 Apr 7 13:07 queue
> -r--r--r-- 1 root root 4096 Apr 7 13:06 range
> -r--r--r-- 1 root root 4096 Apr 7 13:06 removable
> -r--r--r-- 1 root root 4096 Apr 7 13:06 ro
> -r--r--r-- 1 root root 4096 Apr 7 13:06 size
> drwxr-xr-x 2 nobody nogroup 0 Apr 7 13:07 slaves
> -r--r--r-- 1 root root 4096 Apr 7 13:06 stat
> lrwxrwxrwx 1 nobody nogroup 0 Apr 7 13:07 subsystem -> ../../../../class/block
> drwxr-xr-x 2 root root 0 Apr 7 13:06 trace
> -rw-r--r-- 1 root root 4096 Apr 7 13:06 uevent
> root@b1:~#
>
> Cc: Jens Axboe <[email protected]>
> Signed-off-by: Christian Brauner <[email protected]>
I was a *bit* worried about not taking a reference to the
user namespace, but it doesn't look like the chain of
loop_remove() -> del_gendisk() -> device_del() will allow any later
access through sysfs, so I guess it's fine.
Reviewed-by: Serge Hallyn <[email protected]>
> ---
> /* v2 */
> unchanged
> - Christian Brauner <[email protected]>:
> - Adapted commit message otherwise unchanged.
> ---
> drivers/block/loop.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 8e21d4b33e01..2dc53bad4b48 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -2212,6 +2212,10 @@ static int loop_add(struct loop_device **l, int i, struct inode *inode)
> disk->private_data = lo;
> disk->queue = lo->lo_queue;
> sprintf(disk->disk_name, "loop%d", i);
> +#ifdef CONFIG_BLK_DEV_LOOPFS
> + if (loopfs_i_sb(inode))
> + disk->user_ns = loopfs_i_sb(inode)->s_user_ns;
> +#endif
>
> add_disk(disk);
>
> --
> 2.26.1