LinuxLists.cc - [RESEND RFC PATCH v2 00/14] device_cgroup: guard mknod for non-initial user namespace

2023-10-25 09:45:01

Subject: [RESEND RFC PATCH v2 00/14] device_cgroup: guard mknod for non-initial user namespace

Introduce the flag BPF_DEVCG_ACC_MKNOD_UNS for bpf programs of type
BPF_PROG_TYPE_CGROUP_DEVICE which allows to guard access to mknod
in non-initial user namespaces.

If a container manager restricts its unprivileged (user namespaced)
children by a device cgroup, it is not necessary to deny mknod()
anymore. Thus, user space applications may map devices on different
locations in the file system by using mknod() inside the container.

A use case for this, we also use in GyroidOS, is to run virsh for
VMs inside an unprivileged container. virsh creates device nodes,
e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
in a non-initial userns, even if a cgroup device white list with the
corresponding major, minor of /dev/null exists. Thus, in this case
the usual bind mounts or pre populated device nodes under /dev are
not sufficient.

To circumvent this limitation, allow mknod() by checking CAP_MKNOD
in the userns by implementing the security_inode_mknod_nscap(). The
hook implementation checks if the corresponding permission flag
BPF_DEVCG_ACC_MKNOD_UNS is set for the device in the bpf program.
To avoid to create unusable inodes in user space the hook also
checks SB_I_NODEV on the corresponding super block.

Further, the security_sb_alloc_userns() hook is implemented using
cgroup_bpf_current_enabled() to allow usage of device nodes on super
blocks mounted by a guarded task.

Patch 1 to 3 rework the current devcgroup_inode hooks as an LSM

Patch 4 to 8 rework explicit calls to devcgroup_check_permission
also as LSM hooks and finalize the conversion of the device_cgroup
subsystem to a LSM.

Patch 9 and 10 introduce new generic security hooks to be used
for the actual mknod device guard implementation.

Patch 11 wires up the security hooks in the vfs

Patch 12 and 13 provide helper functions in the bpf cgroup
subsystem.

Patch 14 finally implement the LSM hooks to grand access

Signed-off-by: Michael Weiß <[email protected]>
---
Changes in v2:
- Integrate this as LSM (Christian, Paul)
- Switched to a device cgroup specific flag instead of a generic
bpf program flag (Christian)
- do not ignore SB_I_NODEV in fs/namei.c but use LSM hook in
sb_alloc_super in fs/super.c
- Link to v1: https://lore.kernel.org/r/[email protected]

Michael Weiß (14):
device_cgroup: Implement devcgroup hooks as lsm security hooks
vfs: Remove explicit devcgroup_inode calls
device_cgroup: Remove explicit devcgroup_inode hooks
lsm: Add security_dev_permission() hook
device_cgroup: Implement dev_permission() hook
block: Switch from devcgroup_check_permission to security hook
drm/amdkfd: Switch from devcgroup_check_permission to security hook
device_cgroup: Hide devcgroup functionality completely in lsm
lsm: Add security_inode_mknod_nscap() hook
lsm: Add security_sb_alloc_userns() hook
vfs: Wire up security hooks for lsm-based device guard in userns
bpf: Add flag BPF_DEVCG_ACC_MKNOD_UNS for device access
bpf: cgroup: Introduce helper cgroup_bpf_current_enabled()
device_cgroup: Allow mknod in non-initial userns if guarded

block/bdev.c | 9 +-
drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 7 +-
fs/namei.c | 24 ++--
fs/super.c | 6 +-
include/linux/bpf-cgroup.h | 2 +
include/linux/device_cgroup.h | 67 -----------
include/linux/lsm_hook_defs.h | 4 +
include/linux/security.h | 18 +++
include/uapi/linux/bpf.h | 1 +
init/Kconfig | 4 +
kernel/bpf/cgroup.c | 14 +++
security/Kconfig | 1 +
security/Makefile | 2 +-
security/device_cgroup/Kconfig | 7 ++
security/device_cgroup/Makefile | 4 +
security/{ => device_cgroup}/device_cgroup.c | 3 +-
security/device_cgroup/device_cgroup.h | 20 ++++
security/device_cgroup/lsm.c | 114 +++++++++++++++++++
security/security.c | 75 ++++++++++++
19 files changed, 294 insertions(+), 88 deletions(-)
delete mode 100644 include/linux/device_cgroup.h
create mode 100644 security/device_cgroup/Kconfig
create mode 100644 security/device_cgroup/Makefile
rename security/{ => device_cgroup}/device_cgroup.c (99%)
create mode 100644 security/device_cgroup/device_cgroup.h
create mode 100644 security/device_cgroup/lsm.c

base-commit: 58720809f52779dc0f08e53e54b014209d13eebb
--
2.30.2

2023-10-25 09:45:54

by Michael Weiß

[permalink] [raw]

Subject: [RESEND RFC PATCH v2 14/14] device_cgroup: Allow mknod in non-initial userns if guarded

If a container manager restricts its unprivileged (user namespaced)
children by a device cgroup, it is not necessary to deny mknod()
anymore. Thus, user space applications may map devices on different
locations in the file system by using mknod() inside the container.

A use case for this, we also use in GyroidOS, is to run virsh for
VMs inside an unprivileged container. virsh creates device nodes,
e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
in a non-initial userns, even if a cgroup device white list with the
corresponding major, minor of /dev/null exists. Thus, in this case
the usual bind mounts or pre populated device nodes under /dev are
not sufficient.

To circumvent this limitation, allow mknod() by checking CAP_MKNOD
in the userns by implementing the security_inode_mknod_nscap(). The
hook implementation checks if the corresponding permission flag
BPF_DEVCG_ACC_MKNOD_UNS is set for the device in the bpf program.
To avoid to create unusable inodes in user space the hook also checks
SB_I_NODEV on the corresponding super block.

Further, the security_sb_alloc_userns() hook is implemented using
cgroup_bpf_current_enabled() to allow usage of device nodes on super
blocks mounted by a guarded task.

Signed-off-by: Michael Weiß <[email protected]>
---
security/device_cgroup/lsm.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)

diff --git a/security/device_cgroup/lsm.c b/security/device_cgroup/lsm.c
index a963536d0a15..6bc984d9c9d1 100644
--- a/security/device_cgroup/lsm.c
+++ b/security/device_cgroup/lsm.c
@@ -66,10 +66,37 @@ static int devcg_inode_mknod(struct inode *dir, struct dentry *dentry,
return __devcg_inode_mknod(mode, dev, DEVCG_ACC_MKNOD);
}

+#ifdef CONFIG_CGROUP_BPF
+static int devcg_sb_alloc_userns(struct super_block *sb)
+{
+ if (cgroup_bpf_current_enabled(CGROUP_DEVICE))
+ return 0;
+
+ return -EPERM;
+}
+
+static int devcg_inode_mknod_nscap(struct inode *dir, struct dentry *dentry,
+ umode_t mode, dev_t dev)
+{
+ if (!cgroup_bpf_current_enabled(CGROUP_DEVICE))
+ return -EPERM;
+
+ // avoid to create unusable inodes in user space
+ if (dentry->d_sb->s_iflags & SB_I_NODEV)
+ return -EPERM;
+
+ return __devcg_inode_mknod(mode, dev, BPF_DEVCG_ACC_MKNOD_UNS);
+}
+#endif /* CONFIG_CGROUP_BPF */
+
static struct security_hook_list devcg_hooks[] __ro_after_init = {
LSM_HOOK_INIT(inode_permission, devcg_inode_permission),
LSM_HOOK_INIT(inode_mknod, devcg_inode_mknod),
LSM_HOOK_INIT(dev_permission, devcg_dev_permission),
+#ifdef CONFIG_CGROUP_BPF
+ LSM_HOOK_INIT(sb_alloc_userns, devcg_sb_alloc_userns),
+ LSM_HOOK_INIT(inode_mknod_nscap, devcg_inode_mknod_nscap),
+#endif
};

static int __init devcgroup_init(void)
--
2.30.2

2023-10-25 09:45:54

by Michael Weiß

[permalink] [raw]

Subject: [RESEND RFC PATCH v2 13/14] bpf: cgroup: Introduce helper cgroup_bpf_current_enabled()

This helper can be used to check if a cgroup-bpf specific program is
active for the current task.

Signed-off-by: Michael Weiß <[email protected]>
Reviewed-by: Alexander Mikhalitsyn <[email protected]>
---
include/linux/bpf-cgroup.h | 2 ++
kernel/bpf/cgroup.c | 14 ++++++++++++++
2 files changed, 16 insertions(+)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 8506690dbb9c..655697c2a620 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -184,6 +184,8 @@ static inline bool cgroup_bpf_sock_enabled(struct sock *sk,
return array != &bpf_empty_prog_array.hdr;
}

+bool cgroup_bpf_current_enabled(enum cgroup_bpf_attach_type type);
+
/* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */
#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb) \
({ \
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 03b3d4492980..19ae3d037db7 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -24,6 +24,20 @@
DEFINE_STATIC_KEY_ARRAY_FALSE(cgroup_bpf_enabled_key, MAX_CGROUP_BPF_ATTACH_TYPE);
EXPORT_SYMBOL(cgroup_bpf_enabled_key);

+bool cgroup_bpf_current_enabled(enum cgroup_bpf_attach_type type)
+{
+ struct cgroup *cgrp;
+ struct bpf_prog_array *array;
+
+ rcu_read_lock();
+ cgrp = task_dfl_cgroup(current);
+ rcu_read_unlock();
+
+ array = rcu_access_pointer(cgrp->bpf.effective[type]);
+ return array != &bpf_empty_prog_array.hdr;
+}
+EXPORT_SYMBOL(cgroup_bpf_current_enabled);
+
/* __always_inline is necessary to prevent indirect call through run_prog
* function pointer.
*/
--
2.30.2

2023-10-25 09:46:21

by Michael Weiß

[permalink] [raw]

Subject: [RESEND RFC PATCH v2 12/14] bpf: Add flag BPF_DEVCG_ACC_MKNOD_UNS for device access

With this new flag for bpf cgroup device programs, it should be
possible to guard mknod() access in non-initial user namespaces
later on.

Signed-off-by: Michael Weiß <[email protected]>
---
include/uapi/linux/bpf.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0448700890f7..0196b9c72d3e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6927,6 +6927,7 @@ enum {
BPF_DEVCG_ACC_MKNOD = (1ULL << 0),
BPF_DEVCG_ACC_READ = (1ULL << 1),
BPF_DEVCG_ACC_WRITE = (1ULL << 2),
+ BPF_DEVCG_ACC_MKNOD_UNS = (1ULL << 3),
};

enum {
--
2.30.2

2023-10-25 09:46:22

by Michael Weiß

[permalink] [raw]

Subject: [RESEND RFC PATCH v2 10/14] lsm: Add security_sb_alloc_userns() hook

Provide a new lsm hook which may be used to allow access to device
nodes for super blocks created in unprivileged namespaces if some
sort of device guard to control access is implemented.

By default this will return -EPERM if no lsm implements the hook.
A first lsm to use this will be the lately converted cgroup_device
module.

Signed-off-by: Michael Weiß <[email protected]>
---
include/linux/lsm_hook_defs.h | 1 +
include/linux/security.h | 5 +++++
security/security.c | 26 ++++++++++++++++++++++++++
3 files changed, 32 insertions(+)

diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index f4fa01182910..0f734a0a5ebc 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -278,6 +278,7 @@ LSM_HOOK(int, 0, inode_getsecctx, struct inode *inode, void **ctx,
LSM_HOOK(int, 0, dev_permission, umode_t mode, dev_t dev, int mask)
LSM_HOOK(int, -EPERM, inode_mknod_nscap, struct inode *dir, struct dentry *dentry,
umode_t mode, dev_t dev)
+LSM_HOOK(int, -EPERM, sb_alloc_userns, struct super_block *sb)

#if defined(CONFIG_SECURITY) && defined(CONFIG_WATCH_QUEUE)
LSM_HOOK(int, 0, post_notification, const struct cred *w_cred,
diff --git a/include/linux/security.h b/include/linux/security.h
index bad6992877f4..0f66be1ed1ed 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -487,6 +487,7 @@ int security_locked_down(enum lockdown_reason what);
int security_dev_permission(umode_t mode, dev_t dev, int mask);
int security_inode_mknod_nscap(struct inode *dir, struct dentry *dentry,
umode_t mode, dev_t dev);
+int security_sb_alloc_userns(struct super_block *sb);
#else /* CONFIG_SECURITY */

static inline int call_blocking_lsm_notifier(enum lsm_event event, void *data)
@@ -1408,6 +1409,10 @@ static inline int security_inode_mknod_nscap(struct inode *dir,
{
return -EPERM;
}
+static inline int security_sb_alloc_userns(struct super_block *sb)
+{
+ return -EPERM;
+}
#endif /* CONFIG_SECURITY */

#if defined(CONFIG_SECURITY) && defined(CONFIG_WATCH_QUEUE)
diff --git a/security/security.c b/security/security.c
index 7708374b6d7e..9d5d4ec28e62 100644
--- a/security/security.c
+++ b/security/security.c
@@ -4065,6 +4065,32 @@ int security_inode_mknod_nscap(struct inode *dir, struct dentry *dentry,
}
EXPORT_SYMBOL(security_inode_mknod_nscap);

+/**
+ * security_sb_alloc_userns() - Grand access to device nodes on sb in userns
+ *
+ * If device access is provided elsewere, this hook will grand access to device nodes
+ * on the allocated sb for unprivileged user namespaces.
+ *
+ * Return: Returns 0 on success, error on failure.
+ */
+int security_sb_alloc_userns(struct super_block *sb)
+{
+ int thisrc;
+ int rc = LSM_RET_DEFAULT(sb_alloc_userns);
+ struct security_hook_list *hp;
+
+ hlist_for_each_entry(hp, &security_hook_heads.sb_alloc_userns, list) {
+ thisrc = hp->hook.sb_alloc_userns(sb);
+ if (thisrc != LSM_RET_DEFAULT(sb_alloc_userns)) {
+ rc = thisrc;
+ if (thisrc != 0)
+ break;
+ }
+ }
+ return rc;
+}
+EXPORT_SYMBOL(security_sb_alloc_userns);
+
#ifdef CONFIG_WATCH_QUEUE
/**
* security_post_notification() - Check if a watch notification can be posted
--
2.30.2

2023-10-25 09:46:26

by Michael Weiß

[permalink] [raw]

Subject: [RESEND RFC PATCH v2 09/14] lsm: Add security_inode_mknod_nscap() hook

Provide a new lsm hook which may be used to allow mknod in
non-initial userns. If access to the device is guarded by this
hook, access to mknod may be granted by checking cap mknod for
unprivileged user namespaces.

By default this will return -EPERM if no lsm implements the
hook. A first lsm to use this will be the lately converted
cgroup_device module.

Signed-off-by: Michael Weiß <[email protected]>
---
include/linux/lsm_hook_defs.h | 2 ++
include/linux/security.h | 8 ++++++++
security/security.c | 31 +++++++++++++++++++++++++++++++
3 files changed, 41 insertions(+)

diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index a868982725a9..f4fa01182910 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -276,6 +276,8 @@ LSM_HOOK(int, 0, inode_setsecctx, struct dentry *dentry, void *ctx, u32 ctxlen)
LSM_HOOK(int, 0, inode_getsecctx, struct inode *inode, void **ctx,
u32 *ctxlen)
LSM_HOOK(int, 0, dev_permission, umode_t mode, dev_t dev, int mask)
+LSM_HOOK(int, -EPERM, inode_mknod_nscap, struct inode *dir, struct dentry *dentry,
+ umode_t mode, dev_t dev)

#if defined(CONFIG_SECURITY) && defined(CONFIG_WATCH_QUEUE)
LSM_HOOK(int, 0, post_notification, const struct cred *w_cred,
diff --git a/include/linux/security.h b/include/linux/security.h
index 8bc6ac8816c6..bad6992877f4 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -485,6 +485,8 @@ int security_inode_setsecctx(struct dentry *dentry, void *ctx, u32 ctxlen);
int security_inode_getsecctx(struct inode *inode, void **ctx, u32 *ctxlen);
int security_locked_down(enum lockdown_reason what);
int security_dev_permission(umode_t mode, dev_t dev, int mask);
+int security_inode_mknod_nscap(struct inode *dir, struct dentry *dentry,
+ umode_t mode, dev_t dev);
#else /* CONFIG_SECURITY */

static inline int call_blocking_lsm_notifier(enum lsm_event event, void *data)
@@ -1400,6 +1402,12 @@ static inline int security_dev_permission(umode_t mode, dev_t dev, int mask)
{
return 0;
}
+static inline int security_inode_mknod_nscap(struct inode *dir,
+ struct dentry *dentry,
+ umode_t mode, dev_t dev);
+{
+ return -EPERM;
+}
#endif /* CONFIG_SECURITY */

#if defined(CONFIG_SECURITY) && defined(CONFIG_WATCH_QUEUE)
diff --git a/security/security.c b/security/security.c
index 40f6787df3b1..7708374b6d7e 100644
--- a/security/security.c
+++ b/security/security.c
@@ -4034,6 +4034,37 @@ int security_dev_permission(umode_t mode, dev_t dev, int mask)
}
EXPORT_SYMBOL(security_dev_permission);

+/**
+ * security_inode_mknod_nscap() - Check if device is guarded
+ * @dir: parent directory
+ * @dentry: new file
+ * @mode: new file mode
+ * @dev: device number
+ *
+ * If access to the device is guarded by this hook, access to mknod may be granted by
+ * checking cap mknod for unprivileged user namespaces.
+ *
+ * Return: Returns 0 on success, error on failure.
+ */
+int security_inode_mknod_nscap(struct inode *dir, struct dentry *dentry,
+ umode_t mode, dev_t dev)
+{
+ int thisrc;
+ int rc = LSM_RET_DEFAULT(inode_mknod_nscap);
+ struct security_hook_list *hp;
+
+ hlist_for_each_entry(hp, &security_hook_heads.inode_mknod_nscap, list) {
+ thisrc = hp->hook.inode_mknod_nscap(dir, dentry, mode, dev);
+ if (thisrc != LSM_RET_DEFAULT(inode_mknod_nscap)) {
+ rc = thisrc;
+ if (thisrc != 0)
+ break;
+ }
+ }
+ return rc;
+}
+EXPORT_SYMBOL(security_inode_mknod_nscap);
+
#ifdef CONFIG_WATCH_QUEUE
/**
* security_post_notification() - Check if a watch notification can be posted
--
2.30.2

2023-10-25 09:46:26

by Michael Weiß

[permalink] [raw]

Subject: [RESEND RFC PATCH v2 11/14] vfs: Wire up security hooks for lsm-based device guard in userns

Wire up security_inode_mknod_capns() in fs/namei.c. If implemented
and access is granted by an lsm, check ns_capable() instead of the
global CAP_MKNOD.

Wire up security_sb_alloc_userns() in fs/super.c. If implemented
and access is granted by an lsm, the created super block will allow
access to device nodes also if it was created in a non-inital userns.

Signed-off-by: Michael Weiß <[email protected]>
---
fs/namei.c | 16 +++++++++++++++-
fs/super.c | 6 +++++-
2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index f601fcbdc4d2..1f68d160e2c0 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3949,6 +3949,20 @@ inline struct dentry *user_path_create(int dfd, const char __user *pathname,
}
EXPORT_SYMBOL(user_path_create);

+static bool mknod_capable(struct inode *dir, struct dentry *dentry,
+ umode_t mode, dev_t dev)
+{
+ /*
+ * In case of a security hook implementation check mknod in user
+ * namespace. Otherwise just check global capability.
+ */
+ int error = security_inode_mknod_nscap(dir, dentry, mode, dev);
+ if (!error)
+ return ns_capable(current_user_ns(), CAP_MKNOD);
+ else
+ return capable(CAP_MKNOD);
+}
+
/**
* vfs_mknod - create device node or file
* @idmap: idmap of the mount the inode was found from
@@ -3975,7 +3989,7 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
return error;

if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
- !capable(CAP_MKNOD))
+ !mknod_capable(dir, dentry, mode, dev))
return -EPERM;

if (!dir->i_op->mknod)
diff --git a/fs/super.c b/fs/super.c
index 2d762ce67f6e..bb01db6d9986 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -362,7 +362,11 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
}
s->s_bdi = &noop_backing_dev_info;
s->s_flags = flags;
- if (s->s_user_ns != &init_user_ns)
+ /*
+ * We still have to think about this here. Several concerns exist
+ * about the security model, especially about malicious fuse.
+ */
+ if (s->s_user_ns != &init_user_ns && security_sb_alloc_userns(s))
s->s_iflags |= SB_I_NODEV;
INIT_HLIST_NODE(&s->s_instances);
INIT_HLIST_BL_HEAD(&s->s_roots);
--
2.30.2

2023-10-25 13:18:07

by Paul Moore

[permalink] [raw]

Subject: Re: [RESEND RFC PATCH v2 00/14] device_cgroup: guard mknod for non-initial user namespace

On Wed, Oct 25, 2023 at 5:42 AM Michael Weiß
<[email protected]> wrote:
>
> Introduce the flag BPF_DEVCG_ACC_MKNOD_UNS for bpf programs of type
> BPF_PROG_TYPE_CGROUP_DEVICE which allows to guard access to mknod
> in non-initial user namespaces.
>
> If a container manager restricts its unprivileged (user namespaced)
> children by a device cgroup, it is not necessary to deny mknod()
> anymore. Thus, user space applications may map devices on different
> locations in the file system by using mknod() inside the container.
>
> A use case for this, we also use in GyroidOS, is to run virsh for
> VMs inside an unprivileged container. virsh creates device nodes,
> e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
> in a non-initial userns, even if a cgroup device white list with the
> corresponding major, minor of /dev/null exists. Thus, in this case
> the usual bind mounts or pre populated device nodes under /dev are
> not sufficient.
>
> To circumvent this limitation, allow mknod() by checking CAP_MKNOD
> in the userns by implementing the security_inode_mknod_nscap(). The
> hook implementation checks if the corresponding permission flag
> BPF_DEVCG_ACC_MKNOD_UNS is set for the device in the bpf program.
> To avoid to create unusable inodes in user space the hook also
> checks SB_I_NODEV on the corresponding super block.
>
> Further, the security_sb_alloc_userns() hook is implemented using
> cgroup_bpf_current_enabled() to allow usage of device nodes on super
> blocks mounted by a guarded task.
>
> Patch 1 to 3 rework the current devcgroup_inode hooks as an LSM
>
> Patch 4 to 8 rework explicit calls to devcgroup_check_permission
> also as LSM hooks and finalize the conversion of the device_cgroup
> subsystem to a LSM.
>
> Patch 9 and 10 introduce new generic security hooks to be used
> for the actual mknod device guard implementation.
>
> Patch 11 wires up the security hooks in the vfs
>
> Patch 12 and 13 provide helper functions in the bpf cgroup
> subsystem.
>
> Patch 14 finally implement the LSM hooks to grand access
>
> Signed-off-by: Michael Weiß <[email protected]>
> ---
> Changes in v2:
> - Integrate this as LSM (Christian, Paul)
> - Switched to a device cgroup specific flag instead of a generic
> bpf program flag (Christian)
> - do not ignore SB_I_NODEV in fs/namei.c but use LSM hook in
> sb_alloc_super in fs/super.c
> - Link to v1: https://lore.kernel.org/r/[email protected]
>
> Michael Weiß (14):
> device_cgroup: Implement devcgroup hooks as lsm security hooks
> vfs: Remove explicit devcgroup_inode calls
> device_cgroup: Remove explicit devcgroup_inode hooks
> lsm: Add security_dev_permission() hook
> device_cgroup: Implement dev_permission() hook
> block: Switch from devcgroup_check_permission to security hook
> drm/amdkfd: Switch from devcgroup_check_permission to security hook
> device_cgroup: Hide devcgroup functionality completely in lsm
> lsm: Add security_inode_mknod_nscap() hook
> lsm: Add security_sb_alloc_userns() hook
> vfs: Wire up security hooks for lsm-based device guard in userns
> bpf: Add flag BPF_DEVCG_ACC_MKNOD_UNS for device access
> bpf: cgroup: Introduce helper cgroup_bpf_current_enabled()
> device_cgroup: Allow mknod in non-initial userns if guarded
>
> block/bdev.c | 9 +-
> drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 7 +-
> fs/namei.c | 24 ++--
> fs/super.c | 6 +-
> include/linux/bpf-cgroup.h | 2 +
> include/linux/device_cgroup.h | 67 -----------
> include/linux/lsm_hook_defs.h | 4 +
> include/linux/security.h | 18 +++
> include/uapi/linux/bpf.h | 1 +
> init/Kconfig | 4 +
> kernel/bpf/cgroup.c | 14 +++
> security/Kconfig | 1 +
> security/Makefile | 2 +-
> security/device_cgroup/Kconfig | 7 ++
> security/device_cgroup/Makefile | 4 +
> security/{ => device_cgroup}/device_cgroup.c | 3 +-
> security/device_cgroup/device_cgroup.h | 20 ++++
> security/device_cgroup/lsm.c | 114 +++++++++++++++++++
> security/security.c | 75 ++++++++++++
> 19 files changed, 294 insertions(+), 88 deletions(-)
> delete mode 100644 include/linux/device_cgroup.h
> create mode 100644 security/device_cgroup/Kconfig
> create mode 100644 security/device_cgroup/Makefile
> rename security/{ => device_cgroup}/device_cgroup.c (99%)
> create mode 100644 security/device_cgroup/device_cgroup.h
> create mode 100644 security/device_cgroup/lsm.c

Hi Michael,

I think this was lost because it wasn't CC'd to the LSM list (see
below). I've CC'd the list on my reply, but future patch submissions
that involve the LSM must be posted to the LSM list if you would like
them to be considered.

http://vger.kernel.org/vger-lists.html#linux-security-module

--
paul-moore.com

2023-10-25 18:12:23

by Michael Weiß

[permalink] [raw]

Subject: Re: [RESEND RFC PATCH v2 00/14] device_cgroup: guard mknod for non-initial user namespace

On 25.10.23 15:17, Paul Moore wrote:
> On Wed, Oct 25, 2023 at 5:42 AM Michael Weiß
> <[email protected]> wrote:
>>
>> Introduce the flag BPF_DEVCG_ACC_MKNOD_UNS for bpf programs of type
>> BPF_PROG_TYPE_CGROUP_DEVICE which allows to guard access to mknod
>> in non-initial user namespaces.
>>
>> If a container manager restricts its unprivileged (user namespaced)
>> children by a device cgroup, it is not necessary to deny mknod()
>> anymore. Thus, user space applications may map devices on different
>> locations in the file system by using mknod() inside the container.
>>
>> A use case for this, we also use in GyroidOS, is to run virsh for
>> VMs inside an unprivileged container. virsh creates device nodes,
>> e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
>> in a non-initial userns, even if a cgroup device white list with the
>> corresponding major, minor of /dev/null exists. Thus, in this case
>> the usual bind mounts or pre populated device nodes under /dev are
>> not sufficient.
>>
>> To circumvent this limitation, allow mknod() by checking CAP_MKNOD
>> in the userns by implementing the security_inode_mknod_nscap(). The
>> hook implementation checks if the corresponding permission flag
>> BPF_DEVCG_ACC_MKNOD_UNS is set for the device in the bpf program.
>> To avoid to create unusable inodes in user space the hook also
>> checks SB_I_NODEV on the corresponding super block.
>>
>> Further, the security_sb_alloc_userns() hook is implemented using
>> cgroup_bpf_current_enabled() to allow usage of device nodes on super
>> blocks mounted by a guarded task.
>>
>> Patch 1 to 3 rework the current devcgroup_inode hooks as an LSM
>>
>> Patch 4 to 8 rework explicit calls to devcgroup_check_permission
>> also as LSM hooks and finalize the conversion of the device_cgroup
>> subsystem to a LSM.
>>
>> Patch 9 and 10 introduce new generic security hooks to be used
>> for the actual mknod device guard implementation.
>>
>> Patch 11 wires up the security hooks in the vfs
>>
>> Patch 12 and 13 provide helper functions in the bpf cgroup
>> subsystem.
>>
>> Patch 14 finally implement the LSM hooks to grand access
>>
>> Signed-off-by: Michael Weiß <[email protected]>
>> ---
>> Changes in v2:
>> - Integrate this as LSM (Christian, Paul)
>> - Switched to a device cgroup specific flag instead of a generic
>> bpf program flag (Christian)
>> - do not ignore SB_I_NODEV in fs/namei.c but use LSM hook in
>> sb_alloc_super in fs/super.c
>> - Link to v1: https://lore.kernel.org/r/[email protected]
>>
>> Michael Weiß (14):
>> device_cgroup: Implement devcgroup hooks as lsm security hooks
>> vfs: Remove explicit devcgroup_inode calls
>> device_cgroup: Remove explicit devcgroup_inode hooks
>> lsm: Add security_dev_permission() hook
>> device_cgroup: Implement dev_permission() hook
>> block: Switch from devcgroup_check_permission to security hook
>> drm/amdkfd: Switch from devcgroup_check_permission to security hook
>> device_cgroup: Hide devcgroup functionality completely in lsm
>> lsm: Add security_inode_mknod_nscap() hook
>> lsm: Add security_sb_alloc_userns() hook
>> vfs: Wire up security hooks for lsm-based device guard in userns
>> bpf: Add flag BPF_DEVCG_ACC_MKNOD_UNS for device access
>> bpf: cgroup: Introduce helper cgroup_bpf_current_enabled()
>> device_cgroup: Allow mknod in non-initial userns if guarded
>>
>> block/bdev.c | 9 +-
>> drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 7 +-
>> fs/namei.c | 24 ++--
>> fs/super.c | 6 +-
>> include/linux/bpf-cgroup.h | 2 +
>> include/linux/device_cgroup.h | 67 -----------
>> include/linux/lsm_hook_defs.h | 4 +
>> include/linux/security.h | 18 +++
>> include/uapi/linux/bpf.h | 1 +
>> init/Kconfig | 4 +
>> kernel/bpf/cgroup.c | 14 +++
>> security/Kconfig | 1 +
>> security/Makefile | 2 +-
>> security/device_cgroup/Kconfig | 7 ++
>> security/device_cgroup/Makefile | 4 +
>> security/{ => device_cgroup}/device_cgroup.c | 3 +-
>> security/device_cgroup/device_cgroup.h | 20 ++++
>> security/device_cgroup/lsm.c | 114 +++++++++++++++++++
>> security/security.c | 75 ++++++++++++
>> 19 files changed, 294 insertions(+), 88 deletions(-)
>> delete mode 100644 include/linux/device_cgroup.h
>> create mode 100644 security/device_cgroup/Kconfig
>> create mode 100644 security/device_cgroup/Makefile
>> rename security/{ => device_cgroup}/device_cgroup.c (99%)
>> create mode 100644 security/device_cgroup/device_cgroup.h
>> create mode 100644 security/device_cgroup/lsm.c
>
> Hi Michael,
>
> I think this was lost because it wasn't CC'd to the LSM list (see
> below). I've CC'd the list on my reply, but future patch submissions
> that involve the LSM must be posted to the LSM list if you would like
> them to be considered.
>
> http://vger.kernel.org/vger-lists.html#linux-security-module
>
Hi Paul,

thanks, I'll keep this in mind for the next submissions.

I have also resend because, I realized that some spam filters my
have swallowed the last submission as I used my private smtp server
from another domain in the gitconfig. Sorry for that. I hope now
every one received it.

Thanks,
Michael

2023-11-24 16:48:12

by Christian Brauner

[permalink] [raw]

Subject: Re: [RESEND RFC PATCH v2 00/14] device_cgroup: guard mknod for non-initial user namespace

> - Integrate this as LSM (Christian, Paul)

Huh, my rant made you write an LSM. I'm not sure if that's a good or bad
thing...

So I dislike this less than the initial version that just worked around
SB_I_NODEV and this might be able to go somewhere. _But_ I want to see
the rules written down:

(1) current device access management
I summarized the current places where that's done very very briefly in
https://lore.kernel.org/all/20230815-feigling-kopfsache-56c2d31275bd@brauner

* inode_permission()
-> devcgroup_inode_permission()

* vfs_mknod()
-> devcgroup_inode_mknod()

* blkdev_get_by_dev() // sget()/sget_fc(), other ways to open block devices and friends
-> devcgroup_check_permission()

* drivers/gpu/drm/amd/amdkfd // weird restrictions on showing gpu info afaict
-> devcgroup_check_permission()

but that's not enough. What we need is a summary of how device node
creation and device node opening currently interact.

Because it is subtle. Currently you cannot even create device nodes
without capable(CAP_SYS_ADMIN) and you can't open any existing ones
if you lack capable(CAP_SYS_ADMIN).

Years ago we tried that insane model where we enabled userspace to
create device nodes but not open them. IOW, the capability check
for device node creation was lifted but the SB_I_NODEV limitation
wasn't lifted. That broke the whole world and had to be reverted.

(2) LSM device access management

I really want to be able to see how you envision the permission
checking to work in the new model. Specifically:

* How do device node creation and device node opening interact.
* The consequences of allowing to remove the SB_I_NODEV restriction.
* Permission checking for users without and without a bpf guarded
profile.

If you write this down we'll add it to documentation as well or to the
commit messages. It won't be lost. It doesn't have to be some really
long thing. I just want to better understand what you think this is
going to do and what the code does.

2023-11-24 17:06:58

by Christian Brauner

[permalink] [raw]

Subject: Re: [RESEND RFC PATCH v2 11/14] vfs: Wire up security hooks for lsm-based device guard in userns

On Wed, Oct 25, 2023 at 11:42:21AM +0200, Michael Weiß wrote:
> Wire up security_inode_mknod_capns() in fs/namei.c. If implemented
> and access is granted by an lsm, check ns_capable() instead of the
> global CAP_MKNOD.
>
> Wire up security_sb_alloc_userns() in fs/super.c. If implemented
> and access is granted by an lsm, the created super block will allow
> access to device nodes also if it was created in a non-inital userns.
>
> Signed-off-by: Michael Weiß <[email protected]>
> ---
> fs/namei.c | 16 +++++++++++++++-
> fs/super.c | 6 +++++-
> 2 files changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index f601fcbdc4d2..1f68d160e2c0 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3949,6 +3949,20 @@ inline struct dentry *user_path_create(int dfd, const char __user *pathname,
> }
> EXPORT_SYMBOL(user_path_create);
>
> +static bool mknod_capable(struct inode *dir, struct dentry *dentry,
> + umode_t mode, dev_t dev)
> +{
> + /*
> + * In case of a security hook implementation check mknod in user
> + * namespace. Otherwise just check global capability.
> + */
> + int error = security_inode_mknod_nscap(dir, dentry, mode, dev);
> + if (!error)
> + return ns_capable(current_user_ns(), CAP_MKNOD);
> + else
> + return capable(CAP_MKNOD);
> +}
> +
> /**
> * vfs_mknod - create device node or file
> * @idmap: idmap of the mount the inode was found from
> @@ -3975,7 +3989,7 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
> return error;
>
> if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
> - !capable(CAP_MKNOD))
> + !mknod_capable(dir, dentry, mode, dev))
> return -EPERM;
>
> if (!dir->i_op->mknod)
> diff --git a/fs/super.c b/fs/super.c
> index 2d762ce67f6e..bb01db6d9986 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -362,7 +362,11 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
> }
> s->s_bdi = &noop_backing_dev_info;
> s->s_flags = flags;
> - if (s->s_user_ns != &init_user_ns)
> + /*
> + * We still have to think about this here. Several concerns exist
> + * about the security model, especially about malicious fuse.
> + */
> + if (s->s_user_ns != &init_user_ns && security_sb_alloc_userns(s))
> s->s_iflags |= SB_I_NODEV;

Hm, no.

We dont want to have security hooks called in alloc_super(). That's just
the wrong layer for this. This is deeply internal stuff where we should
avoid interfacing with other subsystems.

Removing SB_I_NODEV here is also problematic or at least overly broad
because you allow to circumvent this for _every_ filesystems including
stuff like proc and so on where that doesn't make any sense.

2023-11-24 18:09:11

by Christian Brauner

[permalink] [raw]

Subject: Re: [RESEND RFC PATCH v2 00/14] device_cgroup: guard mknod for non-initial user namespace

On Fri, Nov 24, 2023 at 05:47:32PM +0100, Christian Brauner wrote:
> > - Integrate this as LSM (Christian, Paul)
>
> Huh, my rant made you write an LSM. I'm not sure if that's a good or bad
> thing...
>
> So I dislike this less than the initial version that just worked around

Hm, I wonder if we're being to timid or too complex in how we want to
solve this problem.

The device cgroup management logic is hacked into multiple layers and is
frankly pretty appalling.

What I think device access management wants to look like is that you can
implement a policy in an LSM - be it bpf or regular selinux - and have
this guarded by the main hooks:

security_file_open()
security_inode_mknod()

So, look at:

vfs_get_tree()
-> security_sb_set_mnt_opts()
-> bpf_sb_set_mnt_opts()

A bpf LSM program should be able to strip SB_I_NODEV from sb->s_iflags
today via bpf_sb_set_mnt_opts() without any kernel changes at all.

I assume that a bpf LSM can also keep state in sb->s_security just like
selinux et al? If so then a device access management program or whatever
can be stored in sb->s_security.

That device access management program would then be run on each call to:

security_file_open()
-> bpf_file_open()

and

security_inode_mknod()
-> bpf_sb_set_mnt_opts()

and take access decisions.

This obviously makes device access management something that's tied
completely to a filesystem. So, you could have the same device node on
two tmpfs filesystems both mounted in the same userns.

The first tmpfs has SB_I_NODEV and doesn't allow you to open that
device. The second tmpfs has a bpf LSM program attached to it that has
stripped SB_I_NODEV and manages device access and allows callers to open
that device.

I guess it's even possible to restrict this on a caller basis by marking
them with a "container id" when the container is started. That can be
done with that task storage thing also via a bpf LSM hook. And then
you can further restrict device access to only those tasks that have a
specific container id in some range or some token or something.

I might just be fantasizing abilities into bpf that it doesn't have so
anyone with the knowledge please speak up.

If this is feasible then the only thing we need to figure out is what to
do with the legacy cgroup access management and specifically the
capable(CAP_SYS_ADMIN) check that's more of a hack than anything else.

So, we could introduce a sysctl that makes it possible to turn this
check into ns_capable(sb->s_userns, CAP_SYS_ADMIN). Because due to
SB_I_NODEV it is inherently safe to do that. It's just that a lot of
container runtimes need to have time to adapt to a world where you may
be able to create a device but not be able to then open it. This isn't
rocket science but it will take time.

But in the end this will mean we get away with minimal kernel changes
and using a lot of existing infrastructure.

Thoughts?

2023-11-28 20:55:06

by Michael Weiß

[permalink] [raw]

Subject: Re: [RESEND RFC PATCH v2 00/14] device_cgroup: guard mknod for non-initial user namespace

On 24.11.23 19:08, Christian Brauner wrote:
> On Fri, Nov 24, 2023 at 05:47:32PM +0100, Christian Brauner wrote:
>>> - Integrate this as LSM (Christian, Paul)
>>
>> Huh, my rant made you write an LSM. I'm not sure if that's a good or bad
>> thing...
>>
>> So I dislike this less than the initial version that just worked around
>> SB_I_NODEV and this might be able to go somewhere. _But_ I want to see
the rules written down:

Since we have some new Ideas, I also will try to better describe
the vision and current device node interaction when submitting the next
version.

>
> Hm, I wonder if we're being to timid or too complex in how we want to
> solve this problem.
>
> The device cgroup management logic is hacked into multiple layers and is
> frankly pretty appalling.
>
> What I think device access management wants to look like is that you can
> implement a policy in an LSM - be it bpf or regular selinux - and have
> this guarded by the main hooks:
>
> security_file_open()
> security_inode_mknod()
>
> So, look at:
>
> vfs_get_tree()
> -> security_sb_set_mnt_opts()
> -> bpf_sb_set_mnt_opts()
>
> A bpf LSM program should be able to strip SB_I_NODEV from sb->s_iflags
> today via bpf_sb_set_mnt_opts() without any kernel changes at all.
> > I assume that a bpf LSM can also keep state in sb->s_security just like
> selinux et al? If so then a device access management program or whatever
> can be stored in sb->s_security.
>
> That device access management program would then be run on each call to:
>
> security_file_open()
> -> bpf_file_open()
>
> and
>
> security_inode_mknod()
> -> bpf_sb_set_mnt_opts()
>
> and take access security_sb_set_mnt_optsdecisions.
>
> This obviously makes device access management something that's tied
> completely to a filesystem. So, you could have the same device node on
> two tmpfs filesystems both mounted in the same userns.
>
> The first tmpfs has SB_I_NODEV and doesn't allow you to open that
> device. The second tmpfs has a bpf LSM program attached to it that has
> stripped SB_I_NODEV and manages device access and allows callers to open
> that device.

I like the approach to clear SB_I_NODEV in security_sb_set_mnt_opts().
I still have to sort this out but I think that was the missing piece in
the current patch set.

>
> I guess it's even possible to restrict this on a caller basis by marking
> them with a "container id" when the container is started. That can be
> done with that task storage thing also via a bpf LSM hook. And then
> you can further restrict device access to only those tasks that have a
> specific container id in some range or some token or something.
>
> I might just be fantasizing abilities into bpf that it doesn't have so
> anyone with the knowledge please speak up.
>
> If this is feasible then the only thing we need to figure out is what to
> do with the legacy cgroup access management and specifically the
> capable(CAP_SYS_ADMIN) check that's more of a hack than anything else.

First to make this clear, we speak about CAP_SYS_MKNOD.

My approach was to restructure the cgroup_device in an own cgroup_device
lsm not in the current bpf lsm, to also be able to handle the legacy calls.
Especially, the remaining direct calls to devcgroup_check_permission() are
transformed to a new security_hook security_dev_permission() which is
similar to security_inode_permission() but could be called in such places
where only the dev_t representation is available. However, if we
implement it that way you sketched up above, we can just leave the
devcgroup_check_permission() in its current place and only implement
the devcgroup_inode_permission() and devcgroup_mknode and let the blk
and amd/gpu drivers as is for now, or just leave all the devcgroup_*()
hooks as is.

>
> So, we could introduce a sysctl that makes it possible to turn this
> check into ns_capable(sb->s_userns, CAP_SYS_ADMIN). Because due to
> SB_I_NODEV it is inherently safe to do that. It's just that a lot of
> container runtimes need to have time to adapt to a world where you may
> be able to create a device but not be able to then open it. This isn't
> rocket science but it will take time.

True. I think a sysctl would be a good option.

>
> But in the end this will mean we get away with minimal kernel changes
> and using a lot of existing infrastructure.
>
> Thoughts?

For the whole bpf lsm part I'm not confident enough to make any
proposition, yet.
But I think an own simple devnode lsm would be feasible with the above
described security_sb_set_mnt_opts() handling to get this idea realized.
Maybe we go that way to implement a simple lsm without any changes to
the device_cgroup and keep the devcgroup hooks in place. To implement
it as bpf lsm with all full blown conatiner_id could then be done
orthogonally.
So from a simple container runtime perspective one could just use the
simple lsm and the existing bpf device cgroup program without any change
only having to activate the sysctl. A more complex cloud setup Kubernetes
what so ever, could then use bpf lsm approach.