2023-08-14 14:58:18

by Michael Weiß

[permalink] [raw]
Subject: [PATCH RFC 0/4] bpf: cgroup device guard for non-initial user namespace

Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD
which allows to set a cgroup device program to be a device guard.
This may be used to guard actions on device nodes in non-initial
userns, e.g., mknod.

If a container manager restricts its unprivileged (user namespaced)
children by a device cgroup, it is not necessary to deny mknod
anymore. Thus, user space applications may map devices on different
locations in the file system by using mknod() inside the container.

A use case for this, we also use in GyroidOS, is to run virsh for
VMs inside an unprivileged container. virsh creates device nodes,
e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
in a non-initial userns, even if a cgroup device white list with the
corresponding major, minor of /dev/null exists. Thus, in this case
the usual bind mounts or pre populated device nodes under /dev are
not sufficient.

To circumvent this limitation, we allow mknod() in the VFS if a
bpf cgroup device guard is enabled for the current task and check
CAP_MKNOD for the current user namespace instead of the init userns.

To avoid unusable device nodes on file systems mounted in
non-initial user namespace, may_open_dev() ignores the SB_I_NODEV
for cgroup device guarded tasks.

Tested for a GyroidOS container generated by the cmld using the
following user space patch: https://github.com/gyroidos/cml/pull/394

I discussed this internally with Christian in the UAPI group, earlier.
I put this to the public list now, since also LXC/LXD Folks have
announced interest on this.

This series applies to the latest mainline v6.5-rc6 tag.

Signed-off-by: Michael Weiß <[email protected]>
---
Michael Weiß (4):
bpf: add cgroup device guard to flag a cgroup device prog
bpf: provide cgroup_device_guard in bpf_prog_info to user space
device_cgroup: wrapper for bpf cgroup device guard
fs: allow mknod in non-initial userns using cgroup device guard

fs/namei.c | 19 ++++++++++++++++---
include/linux/bpf-cgroup.h | 7 +++++++
include/linux/bpf.h | 1 +
include/linux/device_cgroup.h | 7 +++++++
include/uapi/linux/bpf.h | 8 +++++++-
kernel/bpf/cgroup.c | 30 ++++++++++++++++++++++++++++++
kernel/bpf/syscall.c | 6 +++++-
security/device_cgroup.c | 10 ++++++++++
tools/bpf/bpftool/prog.c | 2 ++
tools/include/uapi/linux/bpf.h | 8 +++++++-
10 files changed, 92 insertions(+), 6 deletions(-)
---
base-commit: 2ccdd1b13c591d306f0401d98dedc4bdcd02b421
change-id: 20230814-devcg_guard-5398ef84bf7b

Best regards,
--
Michael Weiß <[email protected]>



2023-08-14 14:58:39

by Michael Weiß

[permalink] [raw]
Subject: [PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard

If a container manager restricts its unprivileged (user namespaced)
children by a device cgroup, it is not necessary to deny mknod
anymore. Thus, user space applications may map devices on different
locations in the file system by using mknod() inside the container.

A use case for this, we also use in GyroidOS, is to run virsh for
VMs inside an unprivileged container. virsh creates device nodes,
e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
in a non-initial userns, even if a cgroup device white list with the
corresponding major, minor of /dev/null exists. Thus, in this case
the usual bind mounts or pre populated device nodes under /dev are
not sufficient.

To circumvent this limitation, we allow mknod() in fs/namei.c if a
bpf cgroup device guard is enabeld for the current task using
devcgroup_task_is_guarded() and check CAP_MKNOD for the current user
namespace by ns_capable() instead of the global CAP_MKNOD.

To avoid unusable device nodes on file systems mounted in
non-initial user namespace, may_open_dev() ignores the SB_I_NODEV
for cgroup device guarded tasks.

Signed-off-by: Michael Weiß <[email protected]>
---
fs/namei.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index e56ff39a79bc..ef4f22b9575c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3221,6 +3221,9 @@ EXPORT_SYMBOL(vfs_mkobj);

bool may_open_dev(const struct path *path)
{
+ if (devcgroup_task_is_guarded(current))
+ return !(path->mnt->mnt_flags & MNT_NODEV);
+
return !(path->mnt->mnt_flags & MNT_NODEV) &&
!(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
}
@@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
if (error)
return error;

- if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
- !capable(CAP_MKNOD))
- return -EPERM;
+ /*
+ * In case of a device cgroup restirction allow mknod in user
+ * namespace. Otherwise just check global capability; thus,
+ * mknod is also disabled for user namespace other than the
+ * initial one.
+ */
+ if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) {
+ if (devcgroup_task_is_guarded(current)) {
+ if (!ns_capable(current_user_ns(), CAP_MKNOD))
+ return -EPERM;
+ } else if (!capable(CAP_MKNOD))
+ return -EPERM;
+ }

if (!dir->i_op->mknod)
return -EPERM;

--
2.30.2


2023-08-14 15:45:30

by Michael Weiß

[permalink] [raw]
Subject: [PATCH RFC 2/4] bpf: provide cgroup_device_guard in bpf_prog_info to user space

To allow user space tools to check if a device guard is active,
we extend the struct bpf_prog_info by a cgroup_device_guard field.
This is then used by the bpftool in print_prog_header_*() functions.

Output of bpftool, here for the bpf prog of a GyroidOS container:

# ./bpftool prog show id 37
37: cgroup_device tag 1824c08482acee1b gpl cgdev_guard
loaded_at 2023-08-14T13:47:10+0200 uid 0
xlated 456B jited 311B memlock 4096B

Signed-off-by: Michael Weiß <[email protected]>
---
include/uapi/linux/bpf.h | 3 ++-
kernel/bpf/syscall.c | 1 +
tools/bpf/bpftool/prog.c | 2 ++
tools/include/uapi/linux/bpf.h | 3 ++-
4 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3be57f7957b1..7b383665d5f4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6331,7 +6331,8 @@ struct bpf_prog_info {
char name[BPF_OBJ_NAME_LEN];
__u32 ifindex;
__u32 gpl_compatible:1;
- __u32 :31; /* alignment pad */
+ __u32 cgroup_device_guard:1;
+ __u32 :30; /* alignment pad */
__u64 netns_dev;
__u64 netns_ino;
__u32 nr_jited_ksyms;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 33ea67c702c1..9bc6d5dd2e90 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4062,6 +4062,7 @@ static int bpf_prog_get_info_by_fd(struct file *file,
info.created_by_uid = from_kuid_munged(current_user_ns(),
prog->aux->user->uid);
info.gpl_compatible = prog->gpl_compatible;
+ info.cgroup_device_guard = prog->aux->cgroup_device_guard;

memcpy(info.tag, prog->tag, sizeof(prog->tag));
memcpy(info.name, prog->aux->name, sizeof(prog->aux->name));
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 8443a149dd17..66d21794b641 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -434,6 +434,7 @@ static void print_prog_header_json(struct bpf_prog_info *info, int fd)
info->tag[4], info->tag[5], info->tag[6], info->tag[7]);

jsonw_bool_field(json_wtr, "gpl_compatible", info->gpl_compatible);
+ jsonw_bool_field(json_wtr, "cgroup_device_guard", info->cgroup_device_guard);
if (info->run_time_ns) {
jsonw_uint_field(json_wtr, "run_time_ns", info->run_time_ns);
jsonw_uint_field(json_wtr, "run_cnt", info->run_cnt);
@@ -519,6 +520,7 @@ static void print_prog_header_plain(struct bpf_prog_info *info, int fd)
fprint_hex(stdout, info->tag, BPF_TAG_SIZE, "");
print_dev_plain(info->ifindex, info->netns_dev, info->netns_ino);
printf("%s", info->gpl_compatible ? " gpl" : "");
+ printf("%s", info->cgroup_device_guard ? " cgdev_guard" : "");
if (info->run_time_ns)
printf(" run_time_ns %lld run_cnt %lld",
info->run_time_ns, info->run_cnt);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 3be57f7957b1..7b383665d5f4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6331,7 +6331,8 @@ struct bpf_prog_info {
char name[BPF_OBJ_NAME_LEN];
__u32 ifindex;
__u32 gpl_compatible:1;
- __u32 :31; /* alignment pad */
+ __u32 cgroup_device_guard:1;
+ __u32 :30; /* alignment pad */
__u64 netns_dev;
__u64 netns_ino;
__u32 nr_jited_ksyms;

--
2.30.2


2023-08-14 16:44:58

by Alexander Mikhalitsyn

[permalink] [raw]
Subject: Re: [PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard

+CC Stéphane Graber <[email protected]>


On Mon, Aug 14, 2023 at 4:26 PM Michael Weiß
<[email protected]> wrote:
>
> If a container manager restricts its unprivileged (user namespaced)
> children by a device cgroup, it is not necessary to deny mknod
> anymore. Thus, user space applications may map devices on different
> locations in the file system by using mknod() inside the container.
>
> A use case for this, we also use in GyroidOS, is to run virsh for
> VMs inside an unprivileged container. virsh creates device nodes,
> e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
> in a non-initial userns, even if a cgroup device white list with the
> corresponding major, minor of /dev/null exists. Thus, in this case
> the usual bind mounts or pre populated device nodes under /dev are
> not sufficient.
>
> To circumvent this limitation, we allow mknod() in fs/namei.c if a
> bpf cgroup device guard is enabeld for the current task using
> devcgroup_task_is_guarded() and check CAP_MKNOD for the current user
> namespace by ns_capable() instead of the global CAP_MKNOD.
>
> To avoid unusable device nodes on file systems mounted in
> non-initial user namespace, may_open_dev() ignores the SB_I_NODEV
> for cgroup device guarded tasks.
>
> Signed-off-by: Michael Weiß <[email protected]>
> ---
> fs/namei.c | 19 ++++++++++++++++---
> 1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index e56ff39a79bc..ef4f22b9575c 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3221,6 +3221,9 @@ EXPORT_SYMBOL(vfs_mkobj);
>
> bool may_open_dev(const struct path *path)
> {
> + if (devcgroup_task_is_guarded(current))
> + return !(path->mnt->mnt_flags & MNT_NODEV);
> +
> return !(path->mnt->mnt_flags & MNT_NODEV) &&
> !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
> }
> @@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
> if (error)
> return error;
>
> - if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
> - !capable(CAP_MKNOD))
> - return -EPERM;
> + /*
> + * In case of a device cgroup restirction allow mknod in user
> + * namespace. Otherwise just check global capability; thus,
> + * mknod is also disabled for user namespace other than the
> + * initial one.
> + */
> + if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) {
> + if (devcgroup_task_is_guarded(current)) {
> + if (!ns_capable(current_user_ns(), CAP_MKNOD))
> + return -EPERM;
> + } else if (!capable(CAP_MKNOD))
> + return -EPERM;
> + }
>
> if (!dir->i_op->mknod)
> return -EPERM;
>
> --
> 2.30.2
>