2019-07-26 12:06:38

by Christian Brauner

[permalink] [raw]
Subject: Regression in 5.3 for some FS_USERNS_MOUNT (aka user-namespace-mountable) filesystems

Hey everyone,

We have another mount api regression. With current 5.3-rc1 it is not
possible anymore to mount filesystems that have FS_USERNS_MOUNT set and
their fs_context's global member set to true. At least sysfs is
affected, likely also cgroup{2}fs.

The commit that introduced the regression is:

commit 0ce0cf12fc4c6a089717ff613d76457052cf4303
Author: Al Viro <[email protected]>
Date: Sun May 12 15:42:48 2019 -0400

consolidate the capability checks in sget_{fc,userns}()

... into a common helper - mount_capable(type, userns)

Signed-off-by: Al Viro <[email protected]>

mount_capable() will select the user namespace in which to check for
CAP_SYS_ADMIN based on the global property of the filesystem's
fs_context.

Since sysfs has global set to true mount_capable() will check for
CAP_SYS_ADMIN in init_user_ns and fail the mount with EPERM for any
non-init userns root. The same check is present in sget_fc().

To me it looks like that global is overriding FS_USERNS_MOUNT which
seems odd. Afaict, there are two ways to fix this:
- remove global from sysfs
- remove the global check from mount_capable() and possibly sget_fc()

The latter feels more correct but I'm not sure *why* that global thing
got introduced. Seems there could be an additional flag on affected
filesystems instead of this "global" thing. But not sure.

I can whip up a patch in case that does make sense.
And it would probably be a good thing if we had some sort of test (if
there isn't one already) so that this doesn't happen again. It could be
as simple as:

unshare -U -m --map-root -n
mkdir whatever
mount -t sysfs sysfs ./whatever

Thanks!
Christian


2019-07-26 23:21:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: Regression in 5.3 for some FS_USERNS_MOUNT (aka user-namespace-mountable) filesystems

On Fri, Jul 26, 2019 at 5:00 AM Christian Brauner <[email protected]> wrote:
>
> The commit that introduced the regression is:
>
> commit 0ce0cf12fc4c6a089717ff613d76457052cf4303
> Author: Al Viro <[email protected]>
> Date: Sun May 12 15:42:48 2019 -0400
>
> consolidate the capability checks in sget_{fc,userns}()
>
> ... into a common helper - mount_capable(type, userns)

The commit message there tries to imply that it's just consolidating
existing checks, but you're right - that's not at all the case.

In sget_fc(), the tests are all the exact same tests, but it uses a
different 'user_ns' after the commit. It *used* to use fc->user_ns,
now it uses 'user_ns' which depends on that 'global' bit.

And in sget_userns(), the userns is the same, but the tests are
different. Before that commit, it *always* checked for capability in
user_ns, and then (for non-FS_USERNS_MOUNT) it checked for capability
in the init namespace.

I guess the semantic change in sget_userns() is immaterial - if you
have CAP_SYS_ADMIN in the init namespace, you will have it in user_ns
too.

But the sget_fc() semantic change is a more serious change. Maybe that
was just unintentional, and Al _meant_ to pass in "fc->user_ns", but
didn't?

Of course, then later on, commit 20284ab7427f ("switch mount_capable()
to fs_context") drops that argument entirely, and hardcodes the
decision to look at fc->global.

But that fc->global decision wasn't there originally, and is incorrect
since it breaks existing users.

What gets much more confusing about this is that the two different
users then moved around. The sget_userns() case got moved to
legacy_get_tree(), and then joined together in vfs_get_tree(), and
then split and moved out to do_new_mount() and vfs_fsconfig_locked().

And that "joined together into vfs_get_tree()" must be wrong, because
the two cases used two different namespace rules. The sget_userns()
case *did* have that "global" flag check, while the sget_fc() did not.

Messy. Al?

Linus

2019-07-26 23:27:58

by Al Viro

[permalink] [raw]
Subject: Re: Regression in 5.3 for some FS_USERNS_MOUNT (aka user-namespace-mountable) filesystems

On Fri, Jul 26, 2019 at 03:47:02PM -0700, Linus Torvalds wrote:

> Of course, then later on, commit 20284ab7427f ("switch mount_capable()
> to fs_context") drops that argument entirely, and hardcodes the
> decision to look at fc->global.
>
> But that fc->global decision wasn't there originally, and is incorrect
> since it breaks existing users.
>
> What gets much more confusing about this is that the two different
> users then moved around. The sget_userns() case got moved to
> legacy_get_tree(), and then joined together in vfs_get_tree(), and
> then split and moved out to do_new_mount() and vfs_fsconfig_locked().
>
> And that "joined together into vfs_get_tree()" must be wrong, because
> the two cases used two different namespace rules. The sget_userns()
> case *did* have that "global" flag check, while the sget_fc() did not.
>
> Messy. Al?

Digging through that mess... It's my fuckup, and we obviously need to
restore the old behaviour, but I really hope to manage that with
checks _not_ in superblock allocator ;-/

2019-07-27 01:42:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Regression in 5.3 for some FS_USERNS_MOUNT (aka user-namespace-mountable) filesystems

Al Viro <[email protected]> writes:

> On Fri, Jul 26, 2019 at 03:47:02PM -0700, Linus Torvalds wrote:
>
>> Of course, then later on, commit 20284ab7427f ("switch mount_capable()
>> to fs_context") drops that argument entirely, and hardcodes the
>> decision to look at fc->global.
>>
>> But that fc->global decision wasn't there originally, and is incorrect
>> since it breaks existing users.
>>
>> What gets much more confusing about this is that the two different
>> users then moved around. The sget_userns() case got moved to
>> legacy_get_tree(), and then joined together in vfs_get_tree(), and
>> then split and moved out to do_new_mount() and vfs_fsconfig_locked().
>>
>> And that "joined together into vfs_get_tree()" must be wrong, because
>> the two cases used two different namespace rules. The sget_userns()
>> case *did* have that "global" flag check, while the sget_fc() did not.
>>
>> Messy. Al?
>
> Digging through that mess... It's my fuckup, and we obviously need to
> restore the old behaviour, but I really hope to manage that with
> checks _not_ in superblock allocator ;-/

If someone had bothered to actually look at how I was proposing to clean
things up before the new mount api we would already have that. Sigh.

You should be able to get away with something like this which moves the
checks earlier and makes things clearer. My old patch against the pre
new mount api code.

I am running at undependable speed due to the new baby so it is probably
better for someone else to forward port this, but I will attempt it
otherwise.

Eric

From: "Eric W. Biederman" <[email protected]>
Date: Wed, 21 Nov 2018 11:17:01 -0600
Subject: [PATCH] vfs: Replace FS_USERNS_MOUNT with file_system_type->permission

Permission checking of the user to see if the can mount an individual
filesystem using FS_USERNS_MOUNT and checks in sget is not very
comprehensible. Further by pushing the logic down into sget the
attack surface on filesystems that don't support unprivilged mounts is
much larger than it should be.

Now that it is understood what the permission checks need to be refactor the
checks into a simple per filesystme permission check. If no permission check is
implemented the default check becomes a simple capable(CAP_SYS_ADMIN).

The result is code that is much simpler to understand and much easier to maintain.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/devpts/inode.c | 2 +-
fs/fuse/inode.c | 3 ++-
fs/namespace.c | 15 +++++++++++++++
fs/proc/root.c | 8 +++++++-
fs/ramfs/inode.c | 2 +-
fs/super.c | 20 ++++++--------------
fs/sysfs/mount.c | 13 +++++++------
include/linux/fs.h | 4 +++-
ipc/mqueue.c | 8 +++++++-
kernel/cgroup/cgroup.c | 16 ++++++++--------
mm/shmem.c | 4 ++--
11 files changed, 59 insertions(+), 36 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index c53814539070..1418912efc7d 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -519,9 +519,9 @@ static void devpts_kill_sb(struct super_block *sb)

static struct file_system_type devpts_fs_type = {
.name = "devpts",
+ .permission = userns_mount_permission,
.mount = devpts_mount,
.kill_sb = devpts_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
};

/*
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 0b94b23b02d4..e9f6aa9974f8 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1259,7 +1259,8 @@ static void fuse_kill_sb_anon(struct super_block *sb)
static struct file_system_type fuse_fs_type = {
.owner = THIS_MODULE,
.name = "fuse",
- .fs_flags = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
+ .fs_flags = FS_HAS_SUBTYPE,
+ .permission = userns_mount_permission,
.mount = fuse_mount,
.kill_sb = fuse_kill_sb_anon,
};
diff --git a/fs/namespace.c b/fs/namespace.c
index 74f64294a410..44935dbdb162 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2448,6 +2448,16 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)

static bool mount_too_revealing(struct vfsmount *mnt, int *new_mnt_flags);

+static int new_mount_permission(struct file_system_type *type)
+{
+ int err = 0;
+ if (type->permission)
+ err = type->permission();
+ else if (!capable(CAP_SYS_ADMIN))
+ err = -EPERM;
+ return err;
+}
+
/*
* create a new mount for userspace and request it to be added into the
* namespace's tree
@@ -2466,6 +2476,11 @@ static int do_new_mount(struct path *path, const char *fstype, int sb_flags,
if (!type)
return -ENODEV;

+ /* Verify the mounter has permission to mount the filesystem */
+ err = new_mount_permission(type);
+ if (err)
+ return err;
+
mnt = vfs_kern_mount(type, sb_flags, name, data);
if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&
!mnt->mnt_sb->s_subtype)
diff --git a/fs/proc/root.c b/fs/proc/root.c
index f4b1a9d2eca6..b870eacef952 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -86,6 +86,12 @@ int proc_remount(struct super_block *sb, int *flags, char *data)
return !proc_parse_options(data, pid);
}

+static int proc_mount_permission(void)
+{
+ struct pid_namespace *ns = task_active_pid_ns(current);
+ return ns_capable(ns->user_ns, CAP_SYS_ADMIN) ? 0 : -EPERM;
+}
+
static struct dentry *proc_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
{
@@ -116,9 +122,9 @@ static void proc_kill_sb(struct super_block *sb)

static struct file_system_type proc_fs_type = {
.name = "proc",
+ .permission = proc_mount_permission,
.mount = proc_mount,
.kill_sb = proc_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
};

void __init proc_root_init(void)
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 11201b2d06b9..6710cbe5e087 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -261,9 +261,9 @@ static void ramfs_kill_sb(struct super_block *sb)

static struct file_system_type ramfs_fs_type = {
.name = "ramfs",
+ .permission = userns_mount_permission,
.mount = ramfs_mount,
.kill_sb = ramfs_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
};

int __init init_ramfs_fs(void)
diff --git a/fs/super.c b/fs/super.c
index ca53a08497ed..0783794e6173 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -474,6 +474,12 @@ void generic_shutdown_super(struct super_block *sb)

EXPORT_SYMBOL(generic_shutdown_super);

+int userns_mount_permission(void)
+{
+ return ns_capable(current_user_ns(), CAP_SYS_ADMIN) ? 0 : -EPERM;
+}
+EXPORT_SYMBOL(userns_mount_permission);
+
/**
* sget_userns - find or create a superblock
* @type: filesystem type superblock should belong to
@@ -493,10 +499,6 @@ struct super_block *sget_userns(struct file_system_type *type,
struct super_block *old;
int err;

- if (!(flags & (SB_KERNMOUNT|SB_SUBMOUNT)) &&
- !(type->fs_flags & FS_USERNS_MOUNT) &&
- !capable(CAP_SYS_ADMIN))
- return ERR_PTR(-EPERM);
retry:
spin_lock(&sb_lock);
if (test) {
@@ -563,10 +565,6 @@ struct super_block *sget(struct file_system_type *type,
if (flags & SB_SUBMOUNT)
user_ns = &init_user_ns;

- /* Ensure the requestor has permissions over the target filesystem */
- if (!(flags & (SB_KERNMOUNT|SB_SUBMOUNT)) && !ns_capable(user_ns, CAP_SYS_ADMIN))
- return ERR_PTR(-EPERM);
-
return sget_userns(type, test, set, flags, user_ns, data);
}

@@ -1059,12 +1057,6 @@ struct dentry *mount_ns(struct file_system_type *fs_type,
{
struct super_block *sb;

- /* Don't allow mounting unless the caller has CAP_SYS_ADMIN
- * over the namespace.
- */
- if (!(flags & SB_KERNMOUNT) && !ns_capable(user_ns, CAP_SYS_ADMIN))
- return ERR_PTR(-EPERM);
-
sb = sget_userns(fs_type, ns_test_super, ns_set_super, flags,
user_ns, ns);
if (IS_ERR(sb))
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 015a8d95952b..058ebd28c413 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -21,6 +21,12 @@
static struct kernfs_root *sysfs_root;
struct kernfs_node *sysfs_root_kn;

+static int sysfs_mount_permission(void)
+{
+ struct net *net = current->nsproxy->net_ns;
+ return ns_capable(net->user_ns, CAP_SYS_ADMIN) ? 0 : -EPERM;
+}
+
static struct dentry *sysfs_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
{
@@ -28,11 +34,6 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
struct dentry *root;
bool new_sb = false;

- if (!(flags & SB_KERNMOUNT)) {
- if (!ns_capable(net->user_ns, CAP_SYS_ADMIN))
- return ERR_PTR(-EPERM);
- }
-
net = hold_net(net);
root = kernfs_mount_ns(fs_type, flags, sysfs_root,
SYSFS_MAGIC, &new_sb, net);
@@ -54,9 +55,9 @@ static void sysfs_kill_sb(struct super_block *sb)

static struct file_system_type sysfs_fs_type = {
.name = "sysfs",
+ .permission = sysfs_mount_permission,
.mount = sysfs_mount,
.kill_sb = sysfs_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
};

int __init sysfs_init(void)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c95c0807471f..61dcd9b98118 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2165,8 +2165,8 @@ struct file_system_type {
#define FS_REQUIRES_DEV 1
#define FS_BINARY_MOUNTDATA 2
#define FS_HAS_SUBTYPE 4
-#define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
+ int (*permission)(void);
struct dentry *(*mount) (struct file_system_type *, int,
const char *, void *);
void (*kill_sb) (struct super_block *);
@@ -2186,6 +2186,8 @@ struct file_system_type {

#define MODULE_ALIAS_FS(NAME) MODULE_ALIAS("fs-" NAME)

+int userns_mount_permission(void);
+
extern struct dentry *mount_ns(struct file_system_type *fs_type,
int flags, void *data, void *ns, struct user_namespace *user_ns,
int (*fill_super)(struct super_block *, void *, int));
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index c595bed7bfcb..2c9533082110 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -343,6 +343,12 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
return 0;
}

+static int mqueue_mount_permission(void)
+{
+ struct ipc_namespace *ns = current->nsproxy->ipc_ns;
+ return ns_capable(ns->user_ns, CAP_SYS_ADMIN) ? 0 : -EPERM;
+}
+
static struct dentry *mqueue_mount(struct file_system_type *fs_type,
int flags, const char *dev_name,
void *data)
@@ -1524,9 +1530,9 @@ static const struct super_operations mqueue_super_ops = {

static struct file_system_type mqueue_fs_type = {
.name = "mqueue",
+ .permission = mqueue_mount_permission,
.mount = mqueue_mount,
.kill_sb = kill_litter_super,
- .fs_flags = FS_USERNS_MOUNT,
};

int mq_init_ns(struct ipc_namespace *ns)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 6aaf5dd5383b..82539ac4fa28 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2030,6 +2030,12 @@ struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
return dentry;
}

+static int cgroup_mount_permission(void)
+{
+ struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
+ return ns_capable(ns->user_ns, CAP_SYS_ADMIN) ? 0 : -EPERM;
+}
+
static struct dentry *cgroup_mount(struct file_system_type *fs_type,
int flags, const char *unused_dev_name,
void *data)
@@ -2040,12 +2046,6 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,

get_cgroup_ns(ns);

- /* Check if the caller has permission to mount. */
- if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
- put_cgroup_ns(ns);
- return ERR_PTR(-EPERM);
- }
-
/*
* The first time anyone tries to mount a cgroup, enable the list
* linking each css_set to its tasks and fix up all existing tasks.
@@ -2101,16 +2101,16 @@ static void cgroup_kill_sb(struct super_block *sb)

struct file_system_type cgroup_fs_type = {
.name = "cgroup",
+ .permission = cgroup_mount_permission,
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
};

static struct file_system_type cgroup2_fs_type = {
.name = "cgroup2",
+ .permission = cgroup_mount_permission,
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
};

int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
diff --git a/mm/shmem.c b/mm/shmem.c
index ea26d7a0342d..02f3bdaf7dc2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3654,9 +3654,9 @@ static struct dentry *shmem_mount(struct file_system_type *fs_type,
static struct file_system_type shmem_fs_type = {
.owner = THIS_MODULE,
.name = "tmpfs",
+ .permission = userns_mount_permission,
.mount = shmem_mount,
.kill_sb = kill_litter_super,
- .fs_flags = FS_USERNS_MOUNT,
};

int __init shmem_init(void)
@@ -3799,9 +3799,9 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)

static struct file_system_type shmem_fs_type = {
.name = "tmpfs",
+ .permission = userns_mount_permission,
.mount = ramfs_mount,
.kill_sb = kill_litter_super,
- .fs_flags = FS_USERNS_MOUNT,
};

int __init shmem_init(void)
--
2.21.0.dirty



2019-07-27 02:28:22

by Al Viro

[permalink] [raw]
Subject: Re: Regression in 5.3 for some FS_USERNS_MOUNT (aka user-namespace-mountable) filesystems

On Sat, Jul 27, 2019 at 12:22:20AM +0100, Al Viro wrote:
> On Fri, Jul 26, 2019 at 03:47:02PM -0700, Linus Torvalds wrote:
>
> > Of course, then later on, commit 20284ab7427f ("switch mount_capable()
> > to fs_context") drops that argument entirely, and hardcodes the
> > decision to look at fc->global.
> >
> > But that fc->global decision wasn't there originally, and is incorrect
> > since it breaks existing users.
> >
> > What gets much more confusing about this is that the two different
> > users then moved around. The sget_userns() case got moved to
> > legacy_get_tree(), and then joined together in vfs_get_tree(), and
> > then split and moved out to do_new_mount() and vfs_fsconfig_locked().
> >
> > And that "joined together into vfs_get_tree()" must be wrong, because
> > the two cases used two different namespace rules. The sget_userns()
> > case *did* have that "global" flag check, while the sget_fc() did not.
> >
> > Messy. Al?
>
> Digging through that mess... It's my fuckup, and we obviously need to
> restore the old behaviour, but I really hope to manage that with
> checks _not_ in superblock allocator ;-/

It shouldn't have looked at fc->global for those checks. In any cases.
sget_fc() should indeed have been passing fc->user_ns, not userns.
And as for sget_userns(), by the time of 20284ab7427f
its checks had been moved to legacy_get_tree(). In form of
if (!mount_capable(fc->fs_type, fc->user_ns))
as it bloody well ought to.

So the first mistake (wrong argument passed to mount_capable() by sget_fc()
in 0ce0cf12fc4c) has been completed by 20284ab7427f - that conversion was,
actually, an equivalent transformation (callers of legacy_get_tree() never
have fc->global set, so it's all the same). However, the bug introduced in
the earlier commit was now spelled out in mount_capable() itself.

IOW, the minimal fix should be as below. In principle, I'm not against
Eric's "add a method instead of setting FS_USERNS_MOUNT", but note that
in *all* cases the instances of his method end up being equivalent to
return ns_capable(fc->user_ns, CAP_SYS_ADMIN) ? 0 : -EPERM;

Anyway, AFAICS the regression fix should be simply this:

Unbreak mount_capable()

In "consolidate the capability checks in sget_{fc,userns}())" the
wrong argument had been passed to mount_capable() by sget_fc().
That mistake had been further obscured later, when switching
mount_capable() to fs_context has moved the calculation of
bogus argument from sget_fc() to mount_capable() itself. It
should've been fc->user_ns all along.

Screwed-up-by: Al Viro <[email protected]>
Reported-by: Christian Brauner <[email protected]>
Signed-off-by: Al Viro <[email protected]>
---
diff --git a/fs/super.c b/fs/super.c
index 113c58f19425..5960578a4076 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -478,13 +478,10 @@ EXPORT_SYMBOL(generic_shutdown_super);

bool mount_capable(struct fs_context *fc)
{
- struct user_namespace *user_ns = fc->global ? &init_user_ns
- : fc->user_ns;
-
if (!(fc->fs_type->fs_flags & FS_USERNS_MOUNT))
return capable(CAP_SYS_ADMIN);
else
- return ns_capable(user_ns, CAP_SYS_ADMIN);
+ return ns_capable(fc->user_ns, CAP_SYS_ADMIN);
}

/**


2019-07-27 02:31:33

by Al Viro

[permalink] [raw]
Subject: Re: Regression in 5.3 for some FS_USERNS_MOUNT (aka user-namespace-mountable) filesystems

On Fri, Jul 26, 2019 at 07:46:18PM -0500, Eric W. Biederman wrote:

> If someone had bothered to actually look at how I was proposing to clean
> things up before the new mount api we would already have that. Sigh.
>
> You should be able to get away with something like this which moves the
> checks earlier and makes things clearer. My old patch against the pre
> new mount api code.

Check your instances of ->permission(); AFAICS in all cases it's (in
current terms)
return ns_capable(fc->user_ns, CAP_SYS_ADMIN) ? 0 : -EPERM;

In principle I like killing FS_USERNS_MOUNT flag, but when a method
is always either NULL or exact same function...

2019-07-27 11:25:27

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Regression in 5.3 for some FS_USERNS_MOUNT (aka user-namespace-mountable) filesystems

Al Viro <[email protected]> writes:

> On Fri, Jul 26, 2019 at 07:46:18PM -0500, Eric W. Biederman wrote:
>
>> If someone had bothered to actually look at how I was proposing to clean
>> things up before the new mount api we would already have that. Sigh.
>>
>> You should be able to get away with something like this which moves the
>> checks earlier and makes things clearer. My old patch against the pre
>> new mount api code.
>
> Check your instances of ->permission(); AFAICS in all cases it's (in
> current terms)
> return ns_capable(fc->user_ns, CAP_SYS_ADMIN) ? 0 : -EPERM;


Yes. Because I refactored all of the logic to be in terms of current
before we even get to the filesystem. The idea is on a per filesystem
basis to know which namespaces for the filesystem will be selected
and to check those.

Since all that version of the patch converts is the old API we know
from only looking at current what needs to be checked.

> In principle I like killing FS_USERNS_MOUNT flag, but when a method
> is always either NULL or exact same function...

Either you are being dramatic or you read the patch much too quickly.

userns_mount_permission covers the common case of FS_USERNS_MOUNT.
Then there are the cases where you need to know how the filesystem is
going to map current into the filesystem that will be mounted. Those
are: proc_mount_permission, sysfs_mount_permission,
mqueue_mount_permission, cgroup_mount_permission,

So yes I agree the function of interest is always capable in some form,
we just need the filesystem specific logic to check to see if we will
have capable over the filesystem that will be mounted.

I don't doubt that the new mount api has added a few new complexities.

*Shrug* I have done my best to keep it simple, and to help avoid
breaking changes. When you never post your patches for public review,
and don't take any feedback it is difficult to give help.

Eric

2019-07-27 12:38:52

by Al Viro

[permalink] [raw]
Subject: Re: Regression in 5.3 for some FS_USERNS_MOUNT (aka user-namespace-mountable) filesystems

On Sat, Jul 27, 2019 at 06:20:18AM -0500, Eric W. Biederman wrote:

> > In principle I like killing FS_USERNS_MOUNT flag, but when a method
> > is always either NULL or exact same function...
>
> Either you are being dramatic or you read the patch much too quickly.

Or you've done the same to my reply. I'm not saying that in the
posted form the instances are the same; I'm saying that they are
all equivalent to exact same code - ns_capable(fc->user_ns, CAP_SYS_ADMIN),
that is.

> userns_mount_permission covers the common case of FS_USERNS_MOUNT.
> Then there are the cases where you need to know how the filesystem is
> going to map current into the filesystem that will be mounted. Those
> are: proc_mount_permission, sysfs_mount_permission,
> mqueue_mount_permission, cgroup_mount_permission,

+static int proc_mount_permission(void)
+{
+ struct pid_namespace *ns = task_active_pid_ns(current);
+ return ns_capable(ns->user_ns, CAP_SYS_ADMIN) ? 0 : -EPERM;
+}

compare with
ctx->pid_ns = get_pid_ns(task_active_pid_ns(current));
put_user_ns(fc->user_ns);
fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
in proc_init_fs_context(). Notice anything? With those 'orrible
API changes proc_mount_permissions() and userns_mount_permission()
can actually become identical. Because the default case is
to leave fc->user_ns set to current_user_ns().

cgroup case:
+static int cgroup_mount_permission(void)
+{
+ struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
+ return ns_capable(ns->user_ns, CAP_SYS_ADMIN) ? 0 : -EPERM;
+}

with
ctx->ns = current->nsproxy->cgroup_ns;
...
put_user_ns(fc->user_ns);
fc->user_ns = get_user_ns(ctx->ns->user_ns);
in cgroup_init_fs_context(). IOW, again your instances fold together.

mqueue:
ctx->ipc_ns = get_ipc_ns(current->nsproxy->ipc_ns);
put_user_ns(fc->user_ns);
fc->user_ns = get_user_ns(ctx->ipc_ns->user_ns);
in mqueue_init_fs_context() and
+static int mqueue_mount_permission(void)
+{
+ struct ipc_namespace *ns = current->nsproxy->ipc_ns;
+ return ns_capable(ns->user_ns, CAP_SYS_ADMIN) ? 0 : -EPERM;
+}

Same situation.

+static int sysfs_mount_permission(void)
+{
+ struct net *net = current->nsproxy->net_ns;
+ return ns_capable(net->user_ns, CAP_SYS_ADMIN) ? 0 : -EPERM;
+}

with
kfc->ns_tag = netns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
...
if (netns) {
put_user_ns(fc->user_ns);
fc->user_ns = get_user_ns(netns->user_ns);
}
Now, _that_ is interesting. Here the variants diverge - in case
USER_NS && !NET_NS fc->user_ns is left at current_user_ns().

That, BTW, is the only case where the old checks are left in:
if (!(fc->sb_flags & SB_KERNMOUNT)) {
if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
return -EPERM;
}
in sysfs_init_fs_context() papers over that case and I'd love to
get rid of that irregularity. Which, AFAICS, can be done by
unconditional
put_user_ns(fc->user_ns);
fc->user_ns = get_user_ns(current->nsproxy->net_ns->user_ns);
whatever the .config we have. On NET_NS ones it'll be identical
to ->user_ns of what kobj_ns_grab_current(KOBJ_NS_TYPE_NET) returns,
on !NET_NS it'll get mount_capable() do the right thing without
that wart with init_fs_context() doing that extra check.

> So yes I agree the function of interest is always capable in some form,
> we just need the filesystem specific logic to check to see if we will
> have capable over the filesystem that will be mounted.
>
> I don't doubt that the new mount api has added a few new complexities.

So far it looks like *in this particular case* complexities would be
reduced - with one exception all your ->permission() instances become
identical.

Moreover, even in that case we still get the right overall behaviour
with the same instance...

So do you have any specific objections to behaviour in vfs.git #fixes
and/or to adding

diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index db81cfbab9d6..c5b3c7d4d360 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -57,11 +57,6 @@ static int sysfs_init_fs_context(struct fs_context *fc)
struct kernfs_fs_context *kfc;
struct net *netns;

- if (!(fc->sb_flags & SB_KERNMOUNT)) {
- if (!kobj_ns_current_may_mount(KOBJ_NS_TYPE_NET))
- return -EPERM;
- }
-
kfc = kzalloc(sizeof(struct kernfs_fs_context), GFP_KERNEL);
if (!kfc)
return -ENOMEM;
@@ -71,10 +66,8 @@ static int sysfs_init_fs_context(struct fs_context *fc)
kfc->magic = SYSFS_MAGIC;
fc->fs_private = kfc;
fc->ops = &sysfs_fs_context_ops;
- if (netns) {
- put_user_ns(fc->user_ns);
- fc->user_ns = get_user_ns(netns->user_ns);
- }
+ put_user_ns(fc->user_ns);
+ fc->user_ns = get_user_ns(current->nsproxy->net_ns->user_ns);
fc->global = true;
return 0;
}

on top of that? The last one shouldn't change the behaviour, but it
certainly looks like a nice wartectomy; not #fixes material, though.

2019-07-27 13:20:47

by Al Viro

[permalink] [raw]
Subject: Re: Regression in 5.3 for some FS_USERNS_MOUNT (aka user-namespace-mountable) filesystems

On Sat, Jul 27, 2019 at 01:37:05PM +0100, Al Viro wrote:

> > So yes I agree the function of interest is always capable in some form,
> > we just need the filesystem specific logic to check to see if we will
> > have capable over the filesystem that will be mounted.
> >
> > I don't doubt that the new mount api has added a few new complexities.
>
> So far it looks like *in this particular case* complexities would be
> reduced - with one exception all your ->permission() instances become
> identical.
>
> Moreover, even in that case we still get the right overall behaviour
> with the same instance...

PS: For the record

* I obviously agree with your reasoning behind making those checks
fs-dependent (they have to) and with putting them (back then) into
->mount() instances (since that was the first method called)
* I agree (violently) with not liking them done inside ->mount().
* in principle I agree that the stuff like "can that thing
be mounted in non-initial userns" might better off as a method rather
than a flag.

However
* these days filesystem *can* have "which userns should the
capabilities be checked for?" handled outside ->mount(). Setting
fc->user_ns in ->init_fs_context() does just that; the thing is
called first in all cases.
* with that done we get the same logics for all FS_USERNS_MOUNT
filesystems. IOW, all your ->permission() methods would either become
NULL (for !FS_USERNS_MOUNT) or, for all non-NULL, identical to each other.
All variability between them is already taken care of when we set fc->user_ns.

The last one is what makes me somewhat dubious re having that method -
it's literally one bit of information encoded into a function pointer.
Do you anticipate any cases where the thing would *NOT* be of the same
form? I.e. when something is userns-mountable, but the check is not
ns_capable(some userns, CAP_SYS_ADMIN)?

While we are at it, kobj_ns_...() look like preparations to something
that has never fully materialized. What would sysfs mount checks be
supposed to do if we'd ever grown more than one struct kobj_ns_type_operations
instance? Because that looks like the most plausible case of "we might
need trickier ->permission()"...