2018-05-23 23:23:53

by Eric W. Biederman

[permalink] [raw]
Subject: [REVIEW][PATCH 0/6] Wrapping up the vfs support for unprivileged mounts


Very slowly the work has been progressing to ensure the vfs has the
necessary support for mounting filesystems without privilege.

This patchset contains one more core piece of that work, ensuring a few
more operations that would write back an inode and confuse an exisiting
filesystem are denied.

The rest of the changes actually enable userns root to do things with
filesystems that the userns root has mounted. Most of these have been
waiting in the wings a long time, held back because I wanted the core
of the patchset to be solid before I started allowing additional
behavor.

It is definitely time for these changes so the effect of s_user_ns
becomes less theoretical.

The change to allow mknod is new, but consistent with everything else
and harmless as device nodes on filesystems mounted without privilege
are ignored.

Unless problems show up in the during review I plan to merge these changes.

These changes are also available at:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git userns-test

Eric W. Biederman (5):
vfs: Don't allow changing the link count of an inode with an invalid uid or gid
vfs: Allow userns root to call mknod on owned filesystems.
fs: Allow superblock owner to replace invalid owners of inodes
fs: Allow superblock owner to access do_remount_sb()
capabilities: Allow privileged user in s_user_ns to set security.* xattrs

Seth Forshee (1):
fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems

fs/attr.c | 36 ++++++++++++++++++++++++++++--------
fs/ioctl.c | 4 ++--
fs/namei.c | 16 ++++++++++++----
fs/namespace.c | 4 ++--
security/commoncap.c | 8 ++++++--
5 files changed, 50 insertions(+), 18 deletions(-)

Eric


2018-05-23 23:26:59

by Eric W. Biederman

[permalink] [raw]
Subject: [REVIEW][PATCH 1/6] vfs: Don't allow changing the link count of an inode with an invalid uid or gid

Changing the link count of an inode via unlink or link will cause a
write back of that inode. If the uids or gids are invalid (aka not known
to the kernel) writing the inode back may change the uid or gid in the
filesystem. To prevent possible filesystem and to avoid the need for
filesystem maintainers to worry about it don't allow operations on
inodes with an invalid uid or gid.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/namei.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 186bd2464fd5..942c1f096f6b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -984,13 +984,15 @@ static bool safe_hardlink_source(struct inode *inode)
*/
static int may_linkat(struct path *link)
{
- struct inode *inode;
+ struct inode *inode = link->dentry->d_inode;
+
+ /* Inode writeback is not safe when the uid or gid are invalid. */
+ if (!uid_valid(inode->i_uid) || !gid_valid(inode->i_gid))
+ return -EOVERFLOW;

if (!sysctl_protected_hardlinks)
return 0;

- inode = link->dentry->d_inode;
-
/* Source inode owner (or CAP_FOWNER) can hardlink all they like,
* otherwise, it must be a safe source.
*/
@@ -2749,6 +2751,11 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
BUG_ON(!inode);

BUG_ON(victim->d_parent->d_inode != dir);
+
+ /* Inode writeback is not safe when the uid or gid are invalid. */
+ if (!uid_valid(inode->i_uid) || !gid_valid(inode->i_gid))
+ return -EOVERFLOW;
+
audit_inode_child(dir, victim, AUDIT_TYPE_CHILD_DELETE);

error = inode_permission(dir, MAY_WRITE | MAY_EXEC);
--
2.14.1


2018-05-23 23:27:25

by Eric W. Biederman

[permalink] [raw]
Subject: [REVIEW][PATCH 2/6] vfs: Allow userns root to call mknod on owned filesystems.

These filesystems already always set SB_I_NODEV so mknod will not be
useful for gaining control of any devices no matter their permissions.
This will allow overlayfs and applications to fakeroot to use device
nodes to represent things on disk.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/namei.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 942c1f096f6b..20335896dcce 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3679,7 +3679,8 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
if (error)
return error;

- if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
+ if ((S_ISCHR(mode) || S_ISBLK(mode)) &&
+ !ns_capable(dentry->d_sb->s_user_ns, CAP_MKNOD))
return -EPERM;

if (!dir->i_op->mknod)
--
2.14.1


2018-05-23 23:27:35

by Eric W. Biederman

[permalink] [raw]
Subject: [REVIEW][PATCH 5/6] capabilities: Allow privileged user in s_user_ns to set security.* xattrs

A privileged user in s_user_ns will generally have the ability to
manipulate the backing store and insert security.* xattrs into
the filesystem directly. Therefore the kernel must be prepared to
handle these xattrs from unprivileged mounts, and it makes little
sense for commoncap to prevent writing these xattrs to the
filesystem. The capability and LSM code have already been updated
to appropriately handle xattrs from unprivileged mounts, so it
is safe to loosen this restriction on setting xattrs.

The exception to this logic is that writing xattrs to a mounted
filesystem may also cause the LSM inode_post_setxattr or
inode_setsecurity callbacks to be invoked. SELinux will deny the
xattr update by virtue of applying mountpoint labeling to
unprivileged userns mounts, and Smack will deny the writes for
any user without global CAP_MAC_ADMIN, so loosening the
capability check in commoncap is safe in this respect as well.

Signed-off-by: Seth Forshee <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
---
security/commoncap.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/security/commoncap.c b/security/commoncap.c
index 1ce701fcb3f3..f4c33abd9959 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -919,6 +919,8 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
int cap_inode_setxattr(struct dentry *dentry, const char *name,
const void *value, size_t size, int flags)
{
+ struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
+
/* Ignore non-security xattrs */
if (strncmp(name, XATTR_SECURITY_PREFIX,
sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
@@ -931,7 +933,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
if (strcmp(name, XATTR_NAME_CAPS) == 0)
return 0;

- if (!capable(CAP_SYS_ADMIN))
+ if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;
return 0;
}
@@ -949,6 +951,8 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
*/
int cap_inode_removexattr(struct dentry *dentry, const char *name)
{
+ struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
+
/* Ignore non-security xattrs */
if (strncmp(name, XATTR_SECURITY_PREFIX,
sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
@@ -964,7 +968,7 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
return 0;
}

- if (!capable(CAP_SYS_ADMIN))
+ if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;
return 0;
}
--
2.14.1


2018-05-23 23:27:49

by Eric W. Biederman

[permalink] [raw]
Subject: [REVIEW][PATCH 4/6] fs: Allow superblock owner to access do_remount_sb()

Superblock level remounts are currently restricted to global
CAP_SYS_ADMIN, as is the path for changing the root mount to
read only on umount. Loosen both of these permission checks to
also allow CAP_SYS_ADMIN in any namespace which is privileged
towards the userns which originally mounted the filesystem.

Signed-off-by: Seth Forshee <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/namespace.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 5f75969adff1..8ddd14806799 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1590,7 +1590,7 @@ static int do_umount(struct mount *mnt, int flags)
* Special case for "unmounting" root ...
* we just try to remount it readonly.
*/
- if (!capable(CAP_SYS_ADMIN))
+ if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
return -EPERM;
down_write(&sb->s_umount);
if (!sb_rdonly(sb))
@@ -2333,7 +2333,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
down_write(&sb->s_umount);
if (ms_flags & MS_BIND)
err = change_mount_flags(path->mnt, ms_flags);
- else if (!capable(CAP_SYS_ADMIN))
+ else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
err = -EPERM;
else
err = do_remount_sb(sb, sb_flags, data, 0);
--
2.14.1


2018-05-23 23:28:06

by Eric W. Biederman

[permalink] [raw]
Subject: [REVIEW][PATCH 3/6] fs: Allow superblock owner to replace invalid owners of inodes

Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to
chown files when inode owner is invalid. Ordinarily the
capable_wrt_inode_uidgid check is sufficient to allow access to files
but when the underlying filesystem has uids or gids that don't map to
the current user namespace it is not enough, so the chown permission
checks need to be extended to allow this case.

Calling chown on filesystem nodes whose uid or gid don't map is
necessary if those nodes are going to be modified as writing back
inodes which contain uids or gids that don't map is likely to cause
filesystem corruption of the uid or gid fields.

Once chown has been called the existing capable_wrt_inode_uidgid
checks are sufficient to allow the owner of a superblock to do anything
the global root user can do with an appropriate set of capabilities.

An ordinary filesystem mountable by a userns root will limit all uids
and gids in s_user_ns or the INVALID_UID and INVALID_GID to flag all
others. So having this added permission limited to just INVALID_UID
and INVALID_GID is sufficient to handle every case on an ordinary filesystem.

Of the virtual filesystems at least proc is known to set s_user_ns to
something other than &init_user_ns, while at the same time presenting
some files owned by GLOBAL_ROOT_UID. Those files the mounter of proc
in a user namespace should not be able to chown to get access to.
Limiting the relaxation in permission to just the minimum of allowing
changing INVALID_UID and INVALID_GID prevents problems with cases like
that.

The original version of this patch was written by: Seth Forshee. I
have rewritten and rethought this patch enough so it's really not the
same thing (certainly it needs a different description), but he
deserves credit for getting out there and getting the conversation
started, and finding the potential gotcha's and putting up with my
semi-paranoid feedback.

Inspired-by: Seth Forshee <[email protected]>
Acked-by: Seth Forshee <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/attr.c | 36 ++++++++++++++++++++++++++++--------
1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 12ffdb6fb63c..4220c98f41a5 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -18,6 +18,32 @@
#include <linux/evm.h>
#include <linux/ima.h>

+static bool chown_ok(const struct inode *inode, kuid_t uid)
+{
+ if (uid_eq(current_fsuid(), inode->i_uid) &&
+ uid_eq(uid, inode->i_uid))
+ return true;
+ if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+ return true;
+ if (uid_eq(inode->i_uid, INVALID_UID) &&
+ ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN));
+ return true;
+ return false;
+}
+
+static bool chgrp_ok(const struct inode *inode, kgid_t gid)
+{
+ if (uid_eq(current_fsuid(), inode->i_uid) &&
+ (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
+ return true;
+ if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+ return true;
+ if (gid_eq(inode->i_gid, INVALID_GID) &&
+ ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN));
+ return true;
+ return false;
+}
+
/**
* setattr_prepare - check if attribute changes to a dentry are allowed
* @dentry: dentry to check
@@ -52,17 +78,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
goto kill_priv;

/* Make sure a caller can chown. */
- if ((ia_valid & ATTR_UID) &&
- (!uid_eq(current_fsuid(), inode->i_uid) ||
- !uid_eq(attr->ia_uid, inode->i_uid)) &&
- !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+ if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
return -EPERM;

/* Make sure caller can chgrp. */
- if ((ia_valid & ATTR_GID) &&
- (!uid_eq(current_fsuid(), inode->i_uid) ||
- (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
- !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+ if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
return -EPERM;

/* Make sure a caller can chmod. */
--
2.14.1


2018-05-23 23:28:08

by Eric W. Biederman

[permalink] [raw]
Subject: [REVIEW][PATCH 6/6] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems

From: Seth Forshee <[email protected]>

The user in control of a super block should be allowed to freeze
and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
ioctls to require CAP_SYS_ADMIN in s_user_ns.

Signed-off-by: Seth Forshee <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/ioctl.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 4823431d1c9d..b445b13fc59b 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -549,7 +549,7 @@ static int ioctl_fsfreeze(struct file *filp)
{
struct super_block *sb = file_inode(filp)->i_sb;

- if (!capable(CAP_SYS_ADMIN))
+ if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
return -EPERM;

/* If filesystem doesn't support freeze feature, return. */
@@ -566,7 +566,7 @@ static int ioctl_fsthaw(struct file *filp)
{
struct super_block *sb = file_inode(filp)->i_sb;

- if (!capable(CAP_SYS_ADMIN))
+ if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
return -EPERM;

/* Thaw */
--
2.14.1


2018-05-23 23:44:01

by Eric W. Biederman

[permalink] [raw]
Subject: [REVIEW][PATCH v2 3/6] fs: Allow superblock owner to replace invalid owners of inodes


Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to
chown files when inode owner is invalid. Ordinarily the
capable_wrt_inode_uidgid check is sufficient to allow access to files
but when the underlying filesystem has uids or gids that don't map to
the current user namespace it is not enough, so the chown permission
checks need to be extended to allow this case.

Calling chown on filesystem nodes whose uid or gid don't map is
necessary if those nodes are going to be modified as writing back
inodes which contain uids or gids that don't map is likely to cause
filesystem corruption of the uid or gid fields.

Once chown has been called the existing capable_wrt_inode_uidgid
checks are sufficient to allow the owner of a superblock to do anything
the global root user can do with an appropriate set of capabilities.

An ordinary filesystem mountable by a userns root will limit all uids
and gids in s_user_ns or the INVALID_UID and INVALID_GID to flag all
others. So having this added permission limited to just INVALID_UID
and INVALID_GID is sufficient to handle every case on an ordinary filesystem.

Of the virtual filesystems at least proc is known to set s_user_ns to
something other than &init_user_ns, while at the same time presenting
some files owned by GLOBAL_ROOT_UID. Those files the mounter of proc
in a user namespace should not be able to chown to get access to.
Limiting the relaxation in permission to just the minimum of allowing
changing INVALID_UID and INVALID_GID prevents problems with cases like
that.

The original version of this patch was written by: Seth Forshee. I
have rewritten and rethought this patch enough so it's really not the
same thing (certainly it needs a different description), but he
deserves credit for getting out there and getting the conversation
started, and finding the potential gotcha's and putting up with my
semi-paranoid feedback.

Inspired-by: Seth Forshee <[email protected]>
Acked-by: Seth Forshee <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
---

Sigh. In simplifying this change so it would not require a change to
proc (or any other similar filesystem) I accidentally introduced some
badly placed semicolons. The kbuild test robot was very nice and found
those for me. Resend with those unnecessary semicolons removed.

fs/attr.c | 36 ++++++++++++++++++++++++++++--------
1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 12ffdb6fb63c..d0b4d34878fb 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -18,6 +18,32 @@
#include <linux/evm.h>
#include <linux/ima.h>

+static bool chown_ok(const struct inode *inode, kuid_t uid)
+{
+ if (uid_eq(current_fsuid(), inode->i_uid) &&
+ uid_eq(uid, inode->i_uid))
+ return true;
+ if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+ return true;
+ if (uid_eq(inode->i_uid, INVALID_UID) &&
+ ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
+ return true;
+ return false;
+}
+
+static bool chgrp_ok(const struct inode *inode, kgid_t gid)
+{
+ if (uid_eq(current_fsuid(), inode->i_uid) &&
+ (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
+ return true;
+ if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+ return true;
+ if (gid_eq(inode->i_gid, INVALID_GID) &&
+ ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
+ return true;
+ return false;
+}
+
/**
* setattr_prepare - check if attribute changes to a dentry are allowed
* @dentry: dentry to check
@@ -52,17 +78,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
goto kill_priv;

/* Make sure a caller can chown. */
- if ((ia_valid & ATTR_UID) &&
- (!uid_eq(current_fsuid(), inode->i_uid) ||
- !uid_eq(attr->ia_uid, inode->i_uid)) &&
- !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+ if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
return -EPERM;

/* Make sure caller can chgrp. */
- if ((ia_valid & ATTR_GID) &&
- (!uid_eq(current_fsuid(), inode->i_uid) ||
- (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
- !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+ if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
return -EPERM;

/* Make sure a caller can chmod. */

2018-05-24 22:55:15

by Seth Forshee

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 1/6] vfs: Don't allow changing the link count of an inode with an invalid uid or gid

On Wed, May 23, 2018 at 06:25:33PM -0500, Eric W. Biederman wrote:
> Changing the link count of an inode via unlink or link will cause a
> write back of that inode. If the uids or gids are invalid (aka not known
> to the kernel) writing the inode back may change the uid or gid in the
> filesystem. To prevent possible filesystem and to avoid the need for
> filesystem maintainers to worry about it don't allow operations on
> inodes with an invalid uid or gid.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Acked-by: Seth Forshee <[email protected]>

2018-05-25 00:11:13

by Seth Forshee

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 2/6] vfs: Allow userns root to call mknod on owned filesystems.

On Wed, May 23, 2018 at 06:25:34PM -0500, Eric W. Biederman wrote:
> These filesystems already always set SB_I_NODEV so mknod will not be
> useful for gaining control of any devices no matter their permissions.
> This will allow overlayfs and applications to fakeroot to use device
> nodes to represent things on disk.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>

For a normal filesystem this does seem safe enough.

However, I'd also like to see us allow unprivileged mounting for
overlayfs, and there we need to worry about whether this would allow a
mknod in an underlying filesystem which should not be allowed. That
mknod will be subject to this same check in the underlying filesystem
using the credentials of the user that mounted the overaly fs, which
should be sufficient to ensure that the mknod is permitted.

Thus this looks okay to me.

Acked-by: Seth Forshee <[email protected]>

2018-05-25 02:24:38

by Christian Brauner

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 5/6] capabilities: Allow privileged user in s_user_ns to set security.* xattrs

On Wed, May 23, 2018 at 06:25:37PM -0500, Eric W. Biederman wrote:
> A privileged user in s_user_ns will generally have the ability to
> manipulate the backing store and insert security.* xattrs into
> the filesystem directly. Therefore the kernel must be prepared to
> handle these xattrs from unprivileged mounts, and it makes little
> sense for commoncap to prevent writing these xattrs to the
> filesystem. The capability and LSM code have already been updated
> to appropriately handle xattrs from unprivileged mounts, so it
> is safe to loosen this restriction on setting xattrs.
>
> The exception to this logic is that writing xattrs to a mounted
> filesystem may also cause the LSM inode_post_setxattr or
> inode_setsecurity callbacks to be invoked. SELinux will deny the
> xattr update by virtue of applying mountpoint labeling to
> unprivileged userns mounts, and Smack will deny the writes for
> any user without global CAP_MAC_ADMIN, so loosening the
> capability check in commoncap is safe in this respect as well.

Acked-by: Christian Brauner <[email protected]>

>
> Signed-off-by: Seth Forshee <[email protected]>
> Acked-by: Serge Hallyn <[email protected]>

Note, I just talked to Serge. This should be Acked-by: Serge Hallyn <[email protected]>

> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> security/commoncap.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 1ce701fcb3f3..f4c33abd9959 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -919,6 +919,8 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
> int cap_inode_setxattr(struct dentry *dentry, const char *name,
> const void *value, size_t size, int flags)
> {
> + struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
> +
> /* Ignore non-security xattrs */
> if (strncmp(name, XATTR_SECURITY_PREFIX,
> sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> @@ -931,7 +933,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
> if (strcmp(name, XATTR_NAME_CAPS) == 0)
> return 0;
>
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(user_ns, CAP_SYS_ADMIN))
> return -EPERM;
> return 0;
> }
> @@ -949,6 +951,8 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
> */
> int cap_inode_removexattr(struct dentry *dentry, const char *name)
> {
> + struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
> +
> /* Ignore non-security xattrs */
> if (strncmp(name, XATTR_SECURITY_PREFIX,
> sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> @@ -964,7 +968,7 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
> return 0;
> }
>
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(user_ns, CAP_SYS_ADMIN))
> return -EPERM;
> return 0;
> }
> --
> 2.14.1
>

2018-05-25 02:25:56

by Christian Brauner

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 4/6] fs: Allow superblock owner to access do_remount_sb()

On Wed, May 23, 2018 at 06:25:36PM -0500, Eric W. Biederman wrote:
> Superblock level remounts are currently restricted to global
> CAP_SYS_ADMIN, as is the path for changing the root mount to
> read only on umount. Loosen both of these permission checks to
> also allow CAP_SYS_ADMIN in any namespace which is privileged
> towards the userns which originally mounted the filesystem.

Acked-by: Christian Brauner <[email protected]>

>
> Signed-off-by: Seth Forshee <[email protected]>
> Acked-by: "Eric W. Biederman" <[email protected]>
> Acked-by: Serge Hallyn <[email protected]>

Note, I just talked to Serge. This should be Acked-by: Serge Hallyn <[email protected]>

> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> fs/namespace.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 5f75969adff1..8ddd14806799 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1590,7 +1590,7 @@ static int do_umount(struct mount *mnt, int flags)
> * Special case for "unmounting" root ...
> * we just try to remount it readonly.
> */
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
> return -EPERM;
> down_write(&sb->s_umount);
> if (!sb_rdonly(sb))
> @@ -2333,7 +2333,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
> down_write(&sb->s_umount);
> if (ms_flags & MS_BIND)
> err = change_mount_flags(path->mnt, ms_flags);
> - else if (!capable(CAP_SYS_ADMIN))
> + else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
> err = -EPERM;
> else
> err = do_remount_sb(sb, sb_flags, data, 0);
> --
> 2.14.1
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2018-05-25 02:26:17

by Christian Brauner

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 6/6] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems

On Wed, May 23, 2018 at 06:25:38PM -0500, Eric W. Biederman wrote:
> From: Seth Forshee <[email protected]>
>
> The user in control of a super block should be allowed to freeze
> and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
> ioctls to require CAP_SYS_ADMIN in s_user_ns.

Acked-by: Christian Brauner <[email protected]>

>
> Signed-off-by: Seth Forshee <[email protected]>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> fs/ioctl.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 4823431d1c9d..b445b13fc59b 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -549,7 +549,7 @@ static int ioctl_fsfreeze(struct file *filp)
> {
> struct super_block *sb = file_inode(filp)->i_sb;
>
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
> return -EPERM;
>
> /* If filesystem doesn't support freeze feature, return. */
> @@ -566,7 +566,7 @@ static int ioctl_fsthaw(struct file *filp)
> {
> struct super_block *sb = file_inode(filp)->i_sb;
>
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
> return -EPERM;
>
> /* Thaw */
> --
> 2.14.1
>

2018-05-25 02:30:38

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 4/6] fs: Allow superblock owner to access do_remount_sb()

Christian Brauner <[email protected]> writes:

> On Wed, May 23, 2018 at 06:25:36PM -0500, Eric W. Biederman wrote:
>> Superblock level remounts are currently restricted to global
>> CAP_SYS_ADMIN, as is the path for changing the root mount to
>> read only on umount. Loosen both of these permission checks to
>> also allow CAP_SYS_ADMIN in any namespace which is privileged
>> towards the userns which originally mounted the filesystem.
>
> Acked-by: Christian Brauner <[email protected]>
>
>>
>> Signed-off-by: Seth Forshee <[email protected]>
>> Acked-by: "Eric W. Biederman" <[email protected]>
>> Acked-by: Serge Hallyn <[email protected]>
>
> Note, I just talked to Serge. This should be Acked-by: Serge Hallyn <[email protected]>

Now you know how long these patches have been sitting waiting to get
merged.

Eric

2018-05-25 02:32:31

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 2/6] vfs: Allow userns root to call mknod on owned filesystems.

Seth Forshee <[email protected]> writes:

> On Wed, May 23, 2018 at 06:25:34PM -0500, Eric W. Biederman wrote:
>> These filesystems already always set SB_I_NODEV so mknod will not be
>> useful for gaining control of any devices no matter their permissions.
>> This will allow overlayfs and applications to fakeroot to use device
>> nodes to represent things on disk.
>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>
> For a normal filesystem this does seem safe enough.
>
> However, I'd also like to see us allow unprivileged mounting for
> overlayfs, and there we need to worry about whether this would allow a
> mknod in an underlying filesystem which should not be allowed. That
> mknod will be subject to this same check in the underlying filesystem
> using the credentials of the user that mounted the overaly fs, which
> should be sufficient to ensure that the mknod is permitted.

Sufficient to ensure the mknod is not permitted on the underlying
filesystem. I believe you mean.

> Thus this looks okay to me.
>
> Acked-by: Seth Forshee <[email protected]>

Eric


2018-05-25 02:33:21

by Christian Brauner

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 4/6] fs: Allow superblock owner to access do_remount_sb()

On Thu, May 24, 2018 at 11:45:06AM -0500, Eric W. Biederman wrote:
> Christian Brauner <[email protected]> writes:
>
> > On Wed, May 23, 2018 at 06:25:36PM -0500, Eric W. Biederman wrote:
> >> Superblock level remounts are currently restricted to global
> >> CAP_SYS_ADMIN, as is the path for changing the root mount to
> >> read only on umount. Loosen both of these permission checks to
> >> also allow CAP_SYS_ADMIN in any namespace which is privileged
> >> towards the userns which originally mounted the filesystem.
> >
> > Acked-by: Christian Brauner <[email protected]>
> >
> >>
> >> Signed-off-by: Seth Forshee <[email protected]>
> >> Acked-by: "Eric W. Biederman" <[email protected]>
> >> Acked-by: Serge Hallyn <[email protected]>
> >
> > Note, I just talked to Serge. This should be Acked-by: Serge Hallyn <[email protected]>
>
> Now you know how long these patches have been sitting waiting to get
> merged.

Indeed. :)

Christian

2018-05-25 02:34:03

by Seth Forshee

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 2/6] vfs: Allow userns root to call mknod on owned filesystems.

On Thu, May 24, 2018 at 11:55:45AM -0500, Eric W. Biederman wrote:
> Seth Forshee <[email protected]> writes:
>
> > On Wed, May 23, 2018 at 06:25:34PM -0500, Eric W. Biederman wrote:
> >> These filesystems already always set SB_I_NODEV so mknod will not be
> >> useful for gaining control of any devices no matter their permissions.
> >> This will allow overlayfs and applications to fakeroot to use device
> >> nodes to represent things on disk.
> >>
> >> Signed-off-by: "Eric W. Biederman" <[email protected]>
> >
> > For a normal filesystem this does seem safe enough.
> >
> > However, I'd also like to see us allow unprivileged mounting for
> > overlayfs, and there we need to worry about whether this would allow a
> > mknod in an underlying filesystem which should not be allowed. That
> > mknod will be subject to this same check in the underlying filesystem
> > using the credentials of the user that mounted the overaly fs, which
> > should be sufficient to ensure that the mknod is permitted.
>
> Sufficient to ensure the mknod is not permitted on the underlying
> filesystem. I believe you mean.

Right, or in other words with the relaxed capability check a user still
could not use an overlayfs mount in a user namespace to mknod in a
filesystem when that user couldn't otherwise mknod in that filesystem.
Sorry if I wasn't clear.

>
> > Thus this looks okay to me.
> >
> > Acked-by: Seth Forshee <[email protected]>
>
> Eric
>

2018-05-25 02:39:40

by Christian Brauner

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 2/6] vfs: Allow userns root to call mknod on owned filesystems.

On Wed, May 23, 2018 at 06:25:34PM -0500, Eric W. Biederman wrote:
> These filesystems already always set SB_I_NODEV so mknod will not be
> useful for gaining control of any devices no matter their permissions.
> This will allow overlayfs and applications to fakeroot to use device
> nodes to represent things on disk.

Excellent.

Acked-by: Christian Brauner <[email protected]>

>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/namei.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 942c1f096f6b..20335896dcce 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3679,7 +3679,8 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
> if (error)
> return error;
>
> - if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
> + if ((S_ISCHR(mode) || S_ISBLK(mode)) &&
> + !ns_capable(dentry->d_sb->s_user_ns, CAP_MKNOD))
> return -EPERM;
>
> if (!dir->i_op->mknod)
> --
> 2.14.1
>

2018-05-25 02:46:01

by Christian Brauner

[permalink] [raw]
Subject: Re: [REVIEW][PATCH v2 3/6] fs: Allow superblock owner to replace invalid owners of inodes

On Wed, May 23, 2018 at 06:41:29PM -0500, Eric W. Biederman wrote:
>
> Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to
> chown files when inode owner is invalid. Ordinarily the
> capable_wrt_inode_uidgid check is sufficient to allow access to files
> but when the underlying filesystem has uids or gids that don't map to
> the current user namespace it is not enough, so the chown permission
> checks need to be extended to allow this case.
>
> Calling chown on filesystem nodes whose uid or gid don't map is
> necessary if those nodes are going to be modified as writing back
> inodes which contain uids or gids that don't map is likely to cause
> filesystem corruption of the uid or gid fields.
>
> Once chown has been called the existing capable_wrt_inode_uidgid
> checks are sufficient to allow the owner of a superblock to do anything
> the global root user can do with an appropriate set of capabilities.
>
> An ordinary filesystem mountable by a userns root will limit all uids
> and gids in s_user_ns or the INVALID_UID and INVALID_GID to flag all
> others. So having this added permission limited to just INVALID_UID
> and INVALID_GID is sufficient to handle every case on an ordinary filesystem.
>
> Of the virtual filesystems at least proc is known to set s_user_ns to
> something other than &init_user_ns, while at the same time presenting
> some files owned by GLOBAL_ROOT_UID. Those files the mounter of proc
> in a user namespace should not be able to chown to get access to.
> Limiting the relaxation in permission to just the minimum of allowing
> changing INVALID_UID and INVALID_GID prevents problems with cases like
> that.
>
> The original version of this patch was written by: Seth Forshee. I
> have rewritten and rethought this patch enough so it's really not the
> same thing (certainly it needs a different description), but he
> deserves credit for getting out there and getting the conversation
> started, and finding the potential gotcha's and putting up with my
> semi-paranoid feedback.

Ok, took me a little longer to reason about this.

Acked-by: Christian Brauner <[email protected]>

>
> Inspired-by: Seth Forshee <[email protected]>
> Acked-by: Seth Forshee <[email protected]>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
>
> Sigh. In simplifying this change so it would not require a change to
> proc (or any other similar filesystem) I accidentally introduced some
> badly placed semicolons. The kbuild test robot was very nice and found
> those for me. Resend with those unnecessary semicolons removed.
>
> fs/attr.c | 36 ++++++++++++++++++++++++++++--------
> 1 file changed, 28 insertions(+), 8 deletions(-)
>
> diff --git a/fs/attr.c b/fs/attr.c
> index 12ffdb6fb63c..d0b4d34878fb 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -18,6 +18,32 @@
> #include <linux/evm.h>
> #include <linux/ima.h>
>
> +static bool chown_ok(const struct inode *inode, kuid_t uid)
> +{
> + if (uid_eq(current_fsuid(), inode->i_uid) &&
> + uid_eq(uid, inode->i_uid))
> + return true;
> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + return true;
> + if (uid_eq(inode->i_uid, INVALID_UID) &&
> + ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> + return true;
> + return false;
> +}
> +
> +static bool chgrp_ok(const struct inode *inode, kgid_t gid)
> +{
> + if (uid_eq(current_fsuid(), inode->i_uid) &&
> + (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
> + return true;
> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + return true;
> + if (gid_eq(inode->i_gid, INVALID_GID) &&
> + ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> + return true;
> + return false;
> +}
> +
> /**
> * setattr_prepare - check if attribute changes to a dentry are allowed
> * @dentry: dentry to check
> @@ -52,17 +78,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
> goto kill_priv;
>
> /* Make sure a caller can chown. */
> - if ((ia_valid & ATTR_UID) &&
> - (!uid_eq(current_fsuid(), inode->i_uid) ||
> - !uid_eq(attr->ia_uid, inode->i_uid)) &&
> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
> return -EPERM;
>
> /* Make sure caller can chgrp. */
> - if ((ia_valid & ATTR_GID) &&
> - (!uid_eq(current_fsuid(), inode->i_uid) ||
> - (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
> return -EPERM;
>
> /* Make sure a caller can chmod. */

2018-05-25 02:46:03

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 0/6] Wrapping up the vfs support for unprivileged mounts

On Wed, May 23, 2018 at 06:22:56PM -0500, Eric W. Biederman wrote:
>
> Very slowly the work has been progressing to ensure the vfs has the
> necessary support for mounting filesystems without privilege.

What's the thinking behind how system administrators and/or file
systems would configure whether or not a particular file system type
will be allowed to be mounted w/o privilege?

- Ted

2018-05-25 02:46:48

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 0/6] Wrapping up the vfs support for unprivileged mounts

"Theodore Y. Ts'o" <[email protected]> writes:

> On Wed, May 23, 2018 at 06:22:56PM -0500, Eric W. Biederman wrote:
>>
>> Very slowly the work has been progressing to ensure the vfs has the
>> necessary support for mounting filesystems without privilege.
>
> What's the thinking behind how system administrators and/or file
> systems would configure whether or not a particular file system type
> will be allowed to be mounted w/o privilege?

The mechanism is .fs_flags in file_system_type. If the FS_USERNS_MOUNT
flag is set then root in a user namespace (AKA an unprivileged user)
will be allowed to mount to mount the filesystem.

There are very real concerns about attacking a filesystem with an
invalid filesystem image, or by a malicious protocol speaker. So I
don't want to enable anything without the file system maintainers
consent and without a reasonable expecation that neither a system wide
denial of service attack nor a privilege escalation attack is possible
from if the filesystem is enabled.

So at a practical level what we have in the vfs is the non-fuse specific
bits that enable unprivileged mounts of fuse. Things like handling
of unmapped uid and gids, how normally trusted xattrs are dealt with,
etc.

A big practical one for me is that if either the uid or gid is not
mapped the vfs avoids writing to the inode.

Right now my practical goal is to be able to say: "Go run your
filesystem in userspace with fuse if you want stronger security
guarantees." I think that will be enough to make removable media
reasonably safe from privilege escalation attacks.

There is enough code in most filesystems that I don't know what our
chances of locking down very many of them are. But I figure a few more
of them are possible.

Eric


2018-05-25 02:47:08

by Christian Brauner

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 1/6] vfs: Don't allow changing the link count of an inode with an invalid uid or gid

On Thu, May 24, 2018 at 07:58:32AM -0500, Seth Forshee wrote:
> On Wed, May 23, 2018 at 06:25:33PM -0500, Eric W. Biederman wrote:
> > Changing the link count of an inode via unlink or link will cause a
> > write back of that inode. If the uids or gids are invalid (aka not known
> > to the kernel) writing the inode back may change the uid or gid in the
> > filesystem. To prevent possible filesystem and to avoid the need for
> > filesystem maintainers to worry about it don't allow operations on
> > inodes with an invalid uid or gid.
> >
> > Signed-off-by: "Eric W. Biederman" <[email protected]>
>
> Acked-by: Seth Forshee <[email protected]>

Acked-by: Christian Brauner <[email protected]>

2018-05-25 03:57:43

by Dave Chinner

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 0/6] Wrapping up the vfs support for unprivileged mounts

On Thu, May 24, 2018 at 06:23:30PM -0500, Eric W. Biederman wrote:
> "Theodore Y. Ts'o" <[email protected]> writes:
>
> > On Wed, May 23, 2018 at 06:22:56PM -0500, Eric W. Biederman wrote:
> >>
> >> Very slowly the work has been progressing to ensure the vfs has the
> >> necessary support for mounting filesystems without privilege.
> >
> > What's the thinking behind how system administrators and/or file
> > systems would configure whether or not a particular file system type
> > will be allowed to be mounted w/o privilege?
>
> The mechanism is .fs_flags in file_system_type. If the FS_USERNS_MOUNT
> flag is set then root in a user namespace (AKA an unprivileged user)
> will be allowed to mount to mount the filesystem.
>
> There are very real concerns about attacking a filesystem with an
> invalid filesystem image, or by a malicious protocol speaker. So I
> don't want to enable anything without the file system maintainers
> consent and without a reasonable expecation that neither a system wide
> denial of service attack nor a privilege escalation attack is possible
> from if the filesystem is enabled.
>
> So at a practical level what we have in the vfs is the non-fuse specific
> bits that enable unprivileged mounts of fuse. Things like handling
> of unmapped uid and gids, how normally trusted xattrs are dealt with,
> etc.
>
> A big practical one for me is that if either the uid or gid is not
> mapped the vfs avoids writing to the inode.
>
> Right now my practical goal is to be able to say: "Go run your
> filesystem in userspace with fuse if you want stronger security
> guarantees." I think that will be enough to make removable media
> reasonably safe from privilege escalation attacks.
>
> There is enough code in most filesystems that I don't know what our
> chances of locking down very many of them are. But I figure a few more
> of them are possible.

I'm not sure we need to - fusefs-lkl gives users the ability to
mount any of the kernel filesystems via fuse without us needing to
support unprivileged kernel mounts for those filesystems.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2018-05-25 04:07:09

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 0/6] Wrapping up the vfs support for unprivileged mounts

On Fri, May 25, 2018 at 01:57:16PM +1000, Dave Chinner wrote:
> On Thu, May 24, 2018 at 06:23:30PM -0500, Eric W. Biederman wrote:
> > "Theodore Y. Ts'o" <[email protected]> writes:
> >
> > > On Wed, May 23, 2018 at 06:22:56PM -0500, Eric W. Biederman wrote:
> > >>
> > >> Very slowly the work has been progressing to ensure the vfs has the
> > >> necessary support for mounting filesystems without privilege.
> > >
> > > What's the thinking behind how system administrators and/or file
> > > systems would configure whether or not a particular file system type
> > > will be allowed to be mounted w/o privilege?
> >
> > The mechanism is .fs_flags in file_system_type. If the FS_USERNS_MOUNT
> > flag is set then root in a user namespace (AKA an unprivileged user)
> > will be allowed to mount to mount the filesystem.
> >
> > There are very real concerns about attacking a filesystem with an
> > invalid filesystem image, or by a malicious protocol speaker. So I
> > don't want to enable anything without the file system maintainers
> > consent and without a reasonable expecation that neither a system wide
> > denial of service attack nor a privilege escalation attack is possible
> > from if the filesystem is enabled.
> >
> > So at a practical level what we have in the vfs is the non-fuse specific
> > bits that enable unprivileged mounts of fuse. Things like handling
> > of unmapped uid and gids, how normally trusted xattrs are dealt with,
> > etc.
> >
> > A big practical one for me is that if either the uid or gid is not
> > mapped the vfs avoids writing to the inode.
> >
> > Right now my practical goal is to be able to say: "Go run your
> > filesystem in userspace with fuse if you want stronger security
> > guarantees." I think that will be enough to make removable media
> > reasonably safe from privilege escalation attacks.
> >
> > There is enough code in most filesystems that I don't know what our
> > chances of locking down very many of them are. But I figure a few more
> > of them are possible.
>
> I'm not sure we need to - fusefs-lkl gives users the ability to
> mount any of the kernel filesystems via fuse without us needing to
> support unprivileged kernel mounts for those filesystems.

/me wonders, is there a fusefs-lkl package for Linux?

(He says, knowing that freebsd has one... :))

--D

> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2018-05-29 13:16:03

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 0/6] Wrapping up the vfs support for unprivileged mounts

Dave Chinner <[email protected]> writes:

> On Thu, May 24, 2018 at 06:23:30PM -0500, Eric W. Biederman wrote:
>> "Theodore Y. Ts'o" <[email protected]> writes:
>>
>> > On Wed, May 23, 2018 at 06:22:56PM -0500, Eric W. Biederman wrote:
>> >>
>> >> Very slowly the work has been progressing to ensure the vfs has the
>> >> necessary support for mounting filesystems without privilege.
>> >
>> > What's the thinking behind how system administrators and/or file
>> > systems would configure whether or not a particular file system type
>> > will be allowed to be mounted w/o privilege?
>>
>> The mechanism is .fs_flags in file_system_type. If the FS_USERNS_MOUNT
>> flag is set then root in a user namespace (AKA an unprivileged user)
>> will be allowed to mount to mount the filesystem.
>>
>> There are very real concerns about attacking a filesystem with an
>> invalid filesystem image, or by a malicious protocol speaker. So I
>> don't want to enable anything without the file system maintainers
>> consent and without a reasonable expecation that neither a system wide
>> denial of service attack nor a privilege escalation attack is possible
>> from if the filesystem is enabled.
>>
>> So at a practical level what we have in the vfs is the non-fuse specific
>> bits that enable unprivileged mounts of fuse. Things like handling
>> of unmapped uid and gids, how normally trusted xattrs are dealt with,
>> etc.
>>
>> A big practical one for me is that if either the uid or gid is not
>> mapped the vfs avoids writing to the inode.
>>
>> Right now my practical goal is to be able to say: "Go run your
>> filesystem in userspace with fuse if you want stronger security
>> guarantees." I think that will be enough to make removable media
>> reasonably safe from privilege escalation attacks.
>>
>> There is enough code in most filesystems that I don't know what our
>> chances of locking down very many of them are. But I figure a few more
>> of them are possible.
>
> I'm not sure we need to - fusefs-lkl gives users the ability to
> mount any of the kernel filesystems via fuse without us needing to
> support unprivileged kernel mounts for those filesystems.

Maybe.

That certainly seems like a good proof of concept for running
ordinary filesystems with fuse. If we are going to rely on it
someone probably needs to do the work to merge arch/lkl into the
main tree. My quick look suggests that the lkl port lags behind
a little bit and has just made it to 4.16.

Is fusefs-lkl valuable for testing filesystems? If xfs-tests were to
have a mode that used that used the fuse protocol for testing and
fuzzing filesystems without the full weight of the kernel in the middle
that might encourage people to suppor this kind of things as well.

Eric

2018-05-29 15:41:54

by Dongsu Park

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 0/6] Wrapping up the vfs support for unprivileged mounts

Hi,

On Thu, May 24, 2018 at 1:22 AM, Eric W. Biederman
<[email protected]> wrote:
>
> Very slowly the work has been progressing to ensure the vfs has the
> necessary support for mounting filesystems without privilege.
>
> This patchset contains one more core piece of that work, ensuring a few
> more operations that would write back an inode and confuse an exisiting
> filesystem are denied.
>
> The rest of the changes actually enable userns root to do things with
> filesystems that the userns root has mounted. Most of these have been
> waiting in the wings a long time, held back because I wanted the core
> of the patchset to be solid before I started allowing additional
> behavor.
>
> It is definitely time for these changes so the effect of s_user_ns
> becomes less theoretical.
>
> The change to allow mknod is new, but consistent with everything else
> and harmless as device nodes on filesystems mounted without privilege
> are ignored.
>
> Unless problems show up in the during review I plan to merge these changes.

Thank you for the great work. I have been looking forward to seeing it.
I have just gathered available relevant patches in my branch:

https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-for-4.18

With this branch, I tested sshfs/fuse from non-init user namespace.
It works fine as expected. So you can add:

Tested-by: Dongsu Park <[email protected]>

Thanks!
Dongsu

> These changes are also available at:
> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git userns-test
>
> Eric W. Biederman (5):
> vfs: Don't allow changing the link count of an inode with an invalid uid or gid
> vfs: Allow userns root to call mknod on owned filesystems.
> fs: Allow superblock owner to replace invalid owners of inodes
> fs: Allow superblock owner to access do_remount_sb()
> capabilities: Allow privileged user in s_user_ns to set security.* xattrs
>
> Seth Forshee (1):
> fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
>
> fs/attr.c | 36 ++++++++++++++++++++++++++++--------
> fs/ioctl.c | 4 ++--
> fs/namei.c | 16 ++++++++++++----
> fs/namespace.c | 4 ++--
> security/commoncap.c | 8 ++++++--
> 5 files changed, 50 insertions(+), 18 deletions(-)
>
> Eric
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2018-05-29 22:18:25

by Dave Chinner

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 0/6] Wrapping up the vfs support for unprivileged mounts

On Tue, May 29, 2018 at 08:12:28AM -0500, Eric W. Biederman wrote:
> Dave Chinner <[email protected]> writes:
>
> > On Thu, May 24, 2018 at 06:23:30PM -0500, Eric W. Biederman wrote:
> >> "Theodore Y. Ts'o" <[email protected]> writes:
> >>
> >> > On Wed, May 23, 2018 at 06:22:56PM -0500, Eric W. Biederman wrote:
> >> >>
> >> >> Very slowly the work has been progressing to ensure the vfs has the
> >> >> necessary support for mounting filesystems without privilege.
> >> >
> >> > What's the thinking behind how system administrators and/or file
> >> > systems would configure whether or not a particular file system type
> >> > will be allowed to be mounted w/o privilege?
> >>
> >> The mechanism is .fs_flags in file_system_type. If the FS_USERNS_MOUNT
> >> flag is set then root in a user namespace (AKA an unprivileged user)
> >> will be allowed to mount to mount the filesystem.
> >>
> >> There are very real concerns about attacking a filesystem with an
> >> invalid filesystem image, or by a malicious protocol speaker. So I
> >> don't want to enable anything without the file system maintainers
> >> consent and without a reasonable expecation that neither a system wide
> >> denial of service attack nor a privilege escalation attack is possible
> >> from if the filesystem is enabled.
> >>
> >> So at a practical level what we have in the vfs is the non-fuse specific
> >> bits that enable unprivileged mounts of fuse. Things like handling
> >> of unmapped uid and gids, how normally trusted xattrs are dealt with,
> >> etc.
> >>
> >> A big practical one for me is that if either the uid or gid is not
> >> mapped the vfs avoids writing to the inode.
> >>
> >> Right now my practical goal is to be able to say: "Go run your
> >> filesystem in userspace with fuse if you want stronger security
> >> guarantees." I think that will be enough to make removable media
> >> reasonably safe from privilege escalation attacks.
> >>
> >> There is enough code in most filesystems that I don't know what our
> >> chances of locking down very many of them are. But I figure a few more
> >> of them are possible.
> >
> > I'm not sure we need to - fusefs-lkl gives users the ability to
> > mount any of the kernel filesystems via fuse without us needing to
> > support unprivileged kernel mounts for those filesystems.
>
> Maybe.
>
> That certainly seems like a good proof of concept for running
> ordinary filesystems with fuse. If we are going to rely on it
> someone probably needs to do the work to merge arch/lkl into the
> main tree. My quick look suggests that the lkl port lags behind
> a little bit and has just made it to 4.16.

Yeah, the are some fairly big process and policy things that need
to be decided here. Not just at the kernel level, but at distro and
app infrastructure level too.

I was originally sceptical of supporting kernel filesystems via lkl,
but the desire for unprivileged mounts has not gone away and so I'm
less worried about accessing filesystems that way than I am of
letting the kernel parse untrusted images from untrusted users...

I'm not sure what the correct forum for this is - wasn't this
something the Plumbers conference was supposed to facilitate?

> Is fusefs-lkl valuable for testing filesystems? If xfs-tests were to
> have a mode that used that used the fuse protocol for testing and
> fuzzing filesystems without the full weight of the kernel in the middle
> that might encourage people to suppor this kind of things as well.

Getting lkl-fuse to run under fstests would be a great way to ensure
we have some level of confidence that it will do the right thing and
users can expect that it won't eat their data. I think this would
need to be a part of a recommendation for wider deploy of such a
solution...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2018-05-30 02:37:06

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 0/6] Wrapping up the vfs support for unprivileged mounts

Dave Chinner <[email protected]> writes:

> Yeah, the are some fairly big process and policy things that need
> to be decided here. Not just at the kernel level, but at distro and
> app infrastructure level too.
>
> I was originally sceptical of supporting kernel filesystems via lkl,
> but the desire for unprivileged mounts has not gone away and so I'm
> less worried about accessing filesystems that way than I am of
> letting the kernel parse untrusted images from untrusted users...

There is also the more readily available libguestfs which doesn't
support as many filesystems but does seem available in most
linux distributions already. It already has a fuse option available
with guestmount. I may have to dig in there and see how to make
it available without using fusermount.

> I'm not sure what the correct forum for this is - wasn't this
> something the Plumbers conference was supposed to facilitate?

Yes. If we all need to be in a room and talk about things.
It is early enough in the planning for Plumers that we could
definitely schedule a talk or a BOF for this.

>> Is fusefs-lkl valuable for testing filesystems? If xfs-tests were to
>> have a mode that used that used the fuse protocol for testing and
>> fuzzing filesystems without the full weight of the kernel in the middle
>> that might encourage people to suppor this kind of things as well.
>
> Getting lkl-fuse to run under fstests would be a great way to ensure
> we have some level of confidence that it will do the right thing and
> users can expect that it won't eat their data. I think this would
> need to be a part of a recommendation for wider deploy of such a
> solution...

Good thought. I will have to give that a look. That does sound like a
good practical test.

Eric

2018-05-30 04:35:43

by Dave Chinner

[permalink] [raw]
Subject: Re: [REVIEW][PATCH 0/6] Wrapping up the vfs support for unprivileged mounts

On Tue, May 29, 2018 at 09:34:35PM -0500, Eric W. Biederman wrote:
> Dave Chinner <[email protected]> writes:
>
> > Yeah, the are some fairly big process and policy things that
> > need to be decided here. Not just at the kernel level, but at
> > distro and app infrastructure level too.
> >
> > I was originally sceptical of supporting kernel filesystems via
> > lkl, but the desire for unprivileged mounts has not gone away
> > and so I'm less worried about accessing filesystems that way
> > than I am of letting the kernel parse untrusted images from
> > untrusted users...
>
> There is also the more readily available libguestfs which doesn't
> support as many filesystems but does seem available in most linux
> distributions already. It already has a fuse option available
> with guestmount. I may have to dig in there and see how to make
> it available without using fusermount.

That only provides host access to filesystems mounted inside guest
VMs, right? AFAIA, libguestfs is not providing a FUSE
implementation that mounts and parses raw XFS images. e.g it barely
understands anything XFS, and that which it does is via running and
screen-scraping the output of XFS's userspace management tools...

> > I'm not sure what the correct forum for this is - wasn't this
> > something the Plumbers conference was supposed to facilitate?
>
> Yes. If we all need to be in a room and talk about things.
> It is early enough in the planning for Plumers that we could
> definitely schedule a talk or a BOF for this.

Ok. I have no idea if I'll be at plumbers - it's an awful long way
from where I am....

Cheers,

Dave.
--
Dave Chinner
[email protected]