If a container manager restricts its unprivileged (user namespaced)
children by a device cgroup, it is not necessary to deny mknod()
anymore. Thus, user space applications may map devices on different
locations in the file system by using mknod() inside the container.
A use case for this, we also use in GyroidOS, is to run virsh for
VMs inside an unprivileged container. virsh creates device nodes,
e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
in a non-initial userns, even if a cgroup device white list with the
corresponding major, minor of /dev/null exists. Thus, in this case
the usual bind mounts or pre populated device nodes under /dev are
not sufficient.
Due to the discussion with Christian on v2, I agree that the previous
approach was to complex. Actually, we just want working device
nodes in user namespace if we have a device cgroup in place which
handles access decisions.
Patch 1 provides a helper functions to check if the current task
is guarded by a bpf-device cgroup program.
Thanks Alexander Mikhalitsyn for reviewing.
Patch 2 implements the ns_capable check including sysctl as proposed
by Christian. I provide a short overview about device node creation
and access decisions in the commit message there.
Patch 3 provides devgard, a small lsm which actually strips out
SB_I_NODEV.
---
Changes in v3:
- Small LSM to just implement security_inode_mknod() hook
- Leave devcgroup as is
- Strip SB_I_NO_DEV in security_inode_mknod hook as suggested by
Christian
- Do not change bpf or cgroup access decision at all
- ns_capable(sb->s_iflags, CAP_MKNOD) in vfs_mknod()
- Link to v2: https://lore.kernel.org/lkml/[email protected]/
Changes in v2:
- Integrate this as LSM (Christian, Paul)
- Switched to a device cgroup specific flag instead of a generic
bpf program flag (Christian)
- Do not ignore SB_I_NODEV in fs/namei.c but use LSM hook in
sb_alloc_super in fs/super.c
- Link to v1: https://lore.kernel.org/lkml/[email protected]
Michael Weiß (3):
bpf: cgroup: Introduce helper cgroup_bpf_current_enabled()
fs: Make vfs_mknod() to check CAP_MKNOD in user namespace of sb
devguard: added device guard for mknod in non-initial userns
fs/namei.c | 30 +++++++++++++++++++++++-
include/linux/bpf-cgroup.h | 2 ++
kernel/bpf/cgroup.c | 14 ++++++++++++
security/Kconfig | 11 +++++----
security/Makefile | 1 +
security/devguard/Kconfig | 12 ++++++++++
security/devguard/Makefile | 2 ++
security/devguard/devguard.c | 44 ++++++++++++++++++++++++++++++++++++
8 files changed, 110 insertions(+), 6 deletions(-)
create mode 100644 security/devguard/Kconfig
create mode 100644 security/devguard/Makefile
create mode 100644 security/devguard/devguard.c
base-commit: a39b6ac3781d46ba18193c9dbb2110f31e9bffe9
--
2.30.2
Check CAP_MKNOD for user namespace of sb with ns_cabable() in
fs/namei.c. This will allow lsm-based guarding of device node
creation in non-initial user namespace by stripping out SB_I_NODEV
for mounts in its own namespace.
Currently, device access is blocked unconditionally in
may_open_dev() and mounts inside unprivileged user namespaces get
SB_I_NODEV set in sb->s_iflags causing open() to fail with -EACCES.
Device access by cgroups is mediated in the following places
1) fs/namei.c:
inode_permission() -> devcgroup_inode_permission
vfs_mknod() and -> devcgroup_inode_mknod
2) block/bdev.c:
blkdev_get_by_dev() -> devcgroup_check_permission
3) drivers/gpu/drm/amd/amdkfd/kfd_priv.h:
kfd_devcgroup_check_permission -> devcgroup_check_permission
We leave this all in place. However, a lsm now can implement the
security hook security_inode_mknod() which is called directly after
the devcgroup_inode_mknod() in vfs_mknod() and remove the SB_I_NODEV.
This will let the call to may_open_dev() during open() succeed.
Turning the check form capable(CAP_MKNOD) to ns_capable(sb->s_userns,
CAP_MKNOD) is inherently save due to SB_I_NODEV. However, this may
allow to create device nodes which then could not be opened.
To give user space some time to adopt, we introduce a sysctl knob
which must be explicitly set to "1" to activate the use of
ns_capable(). Otherwise, we just check the global capability for the
current task as before.
I tested this approach in a GyroidOS container using the small
devguard LSM of the followup commit.
Signed-off-by: Michael Weiß <[email protected]>
---
fs/namei.c | 30 +++++++++++++++++++++++++++++-
1 file changed, 29 insertions(+), 1 deletion(-)
diff --git a/fs/namei.c b/fs/namei.c
index 71c13b2990b4..cc61545e02ce 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1032,6 +1032,7 @@ static int sysctl_protected_symlinks __read_mostly;
static int sysctl_protected_hardlinks __read_mostly;
static int sysctl_protected_fifos __read_mostly;
static int sysctl_protected_regular __read_mostly;
+static int sysctl_nscap_mknod __read_mostly;
#ifdef CONFIG_SYSCTL
static struct ctl_table namei_sysctls[] = {
@@ -1071,6 +1072,15 @@ static struct ctl_table namei_sysctls[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_TWO,
},
+ {
+ .procname = "nscap_mknod",
+ .data = &sysctl_nscap_mknod,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
+ },
{ }
};
@@ -3940,6 +3950,24 @@ inline struct dentry *user_path_create(int dfd, const char __user *pathname,
}
EXPORT_SYMBOL(user_path_create);
+/**
+ * sb_mknod_capable - check userns of sb for CAP_MKNOD
+ * @sb: super block to which userns CAP_MKNOD should be checked
+ *
+ * Check userns of sb for CAP_MKNOD
+ *
+ * Check CAP_MKNOD for owning user namespace of sb if corresponding sysctl is set.
+ * Otherwise just check global capability for current task. This allows
+ * lsm-based guarding of device node creation in non-initial user namespace.
+ */
+static bool sb_mknod_capable(struct super_block *sb)
+{
+ struct user_namespace *user_ns;
+
+ user_ns = sysctl_nscap_mknod ? sb->s_user_ns : &init_user_ns;
+ return ns_capable(user_ns, CAP_MKNOD);
+}
+
/**
* vfs_mknod - create device node or file
* @idmap: idmap of the mount the inode was found from
@@ -3966,7 +3994,7 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
return error;
if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
- !capable(CAP_MKNOD))
+ !sb_mknod_capable(dentry->d_sb))
return -EPERM;
if (!dir->i_op->mknod)
--
2.30.2