2017-12-22 14:31:24

by Dongsu Park

[permalink] [raw]
Subject: [PATCH v5 00/11] FUSE mounts from non-init user namespaces

This patchset v5 is based on work by Seth Forshee and Eric Biederman.
The latest patchset was v4:
https://www.mail-archive.com/[email protected]/msg1132206.html

At the moment, filesystems backed by physical medium can only be mounted
by real root in the initial user namespace. This restriction exists
because if it's allowed for root user in non-init user namespaces to
mount the filesystem, then it effectively allows the user to control the
underlying source of the filesystem. In case of FUSE, the source would
mean any underlying device.

However, in many use cases such as containers, it's necessary to allow
filesystems to be mounted from non-init user namespaces. Goal of this
patchset is to allow FUSE filesystems to be mounted from non-init user
namespaces. Support for other filesystems like ext4 are not in the
scope of this patchset.

Let me describe how to test mounting from non-init user namespaces. It's
assumed that tests are done via sshfs, a userspace filesystem based on
FUSE with ssh as backend. Testing system is Fedora 27.

====
$ sudo dnf install -y sshfs
$ sudo mkdir -p /mnt/userns

### workaround to get the sshfs permission checks
$ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies

$ unshare -U -r -m
# sshfs root@localhost: /mnt/userns

### You can see sshfs being mounted from a non-init user namespace
# mount | grep sshfs
root@localhost: on /mnt/userns type fuse.sshfs
(rw,nosuid,nodev,relatime,user_id=0,group_id=0)

# touch /mnt/userns/test
# ls -l /mnt/userns/test
-rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test
====

Open another terminal, check the mountpoint from outside the namespace.

====
$ grep userns /proc/$(pidof sshfs)/mountinfo
131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs
root@localhost: rw,user_id=0,group_id=0
====

After all tests are done, you can unmount the filesystem
inside the namespace.

====
# fusermount -u /mnt/userns
====

Changes since v4:
* Remove other parts like ext4 to keep the patchset minimal for FUSE
* Add and change commit messages
* Describe how to test non-init user namespaces

TODO:
* Think through potential security implications. There are 2 patches
being prepared for security issues. One is "ima: define a new policy
option named force" by Mimi Zohar, which adds an option to specify
that the results should not be cached:
https://marc.info/?l=linux-integrity&m=151275680115856&w=2
The other one is to basically prevent FUSE results from being cached,
which is still in progress.

* Test IMA/LSMs. Details are written in
https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md

Patches 1-2 deal with an additional flag of lookup_bdev() to check for
additional inode permission.

Patches 3-7 allow the superblock owner to change ownership of inodes, and
deal with additional capability checks w.r.t user namespaces.

Patches 8-10 allow FUSE filesystems to be mounted outside of the init
user namespace.

Patch 11 handles a corner case of non-root users in EVM.

The patchset is also available in our github repo:
https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1


Eric W. Biederman (1):
fs: Allow superblock owner to change ownership of inodes

Seth Forshee (10):
block_dev: Support checking inode permissions in lookup_bdev()
mtd: Check permissions towards mtd block device inode when mounting
fs: Don't remove suid for CAP_FSETID for userns root
fs: Allow superblock owner to access do_remount_sb()
capabilities: Allow privileged user in s_user_ns to set security.*
xattrs
fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
fuse: Support fuse filesystems outside of init_user_ns
fuse: Restrict allow_other to the superblock's namespace or a
descendant
fuse: Allow user namespace mounts
evm: Don't update hmacs in user ns mounts

drivers/md/bcache/super.c | 2 +-
drivers/md/dm-table.c | 2 +-
drivers/mtd/mtdsuper.c | 6 +++++-
fs/attr.c | 34 ++++++++++++++++++++++++++--------
fs/block_dev.c | 13 ++++++++++---
fs/fuse/cuse.c | 3 ++-
fs/fuse/dev.c | 11 ++++++++---
fs/fuse/dir.c | 16 ++++++++--------
fs/fuse/fuse_i.h | 6 +++++-
fs/fuse/inode.c | 35 +++++++++++++++++++++--------------
fs/inode.c | 6 ++++--
fs/ioctl.c | 4 ++--
fs/namespace.c | 4 ++--
fs/proc/base.c | 7 +++++++
fs/proc/generic.c | 7 +++++++
fs/proc/proc_sysctl.c | 7 +++++++
fs/quota/quota.c | 2 +-
include/linux/fs.h | 2 +-
kernel/user_namespace.c | 1 +
security/commoncap.c | 8 ++++++--
security/integrity/evm/evm_crypto.c | 3 ++-
21 files changed, 127 insertions(+), 52 deletions(-)

--
2.13.6


2017-12-22 14:31:34

by Dongsu Park

[permalink] [raw]
Subject: [PATCH 05/11] fs: Allow superblock owner to access do_remount_sb()

From: Seth Forshee <[email protected]>

Superblock level remounts are currently restricted to global
CAP_SYS_ADMIN, as is the path for changing the root mount to
read only on umount. Loosen both of these permission checks to
also allow CAP_SYS_ADMIN in any namespace which is privileged
towards the userns which originally mounted the filesystem.

Patch v4 is available: https://patchwork.kernel.org/patch/8944631/

Cc: [email protected]
Cc: [email protected]
Cc: Alexander Viro <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Serge Hallyn <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
Signed-off-by: Dongsu Park <[email protected]>
---
fs/namespace.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index e158ec6b..830040d7 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1589,7 +1589,7 @@ static int do_umount(struct mount *mnt, int flags)
* Special case for "unmounting" root ...
* we just try to remount it readonly.
*/
- if (!capable(CAP_SYS_ADMIN))
+ if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
return -EPERM;
down_write(&sb->s_umount);
if (!sb_rdonly(sb))
@@ -2327,7 +2327,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
down_write(&sb->s_umount);
if (ms_flags & MS_BIND)
err = change_mount_flags(path->mnt, ms_flags);
- else if (!capable(CAP_SYS_ADMIN))
+ else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
err = -EPERM;
else
err = do_remount_sb(sb, sb_flags, data, 0);
--
2.13.6

2017-12-22 14:31:38

by Dongsu Park

[permalink] [raw]
Subject: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root

From: Seth Forshee <[email protected]>

Expand the check in should_remove_suid() to keep privileges for
CAP_FSETID in s_user_ns rather than init_user_ns.

Patch v4 is available: https://patchwork.kernel.org/patch/8944621/

--EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid

Cc: [email protected]
Cc: [email protected]
Cc: Alexander Viro <[email protected]>
Cc: Serge Hallyn <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
Signed-off-by: Dongsu Park <[email protected]>
---
fs/inode.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index fd401028..6459a437 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime);
*/
int should_remove_suid(struct dentry *dentry)
{
- umode_t mode = d_inode(dentry)->i_mode;
+ struct inode *inode = d_inode(dentry);
+ umode_t mode = inode->i_mode;
int kill = 0;

/* suid always must be killed */
@@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry)
if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
kill |= ATTR_KILL_SGID;

- if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
+ if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) &&
+ S_ISREG(mode)))
return kill;

return 0;
--
2.13.6

2017-12-22 14:31:45

by Dongsu Park

[permalink] [raw]
Subject: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes

From: Eric W. Biederman <[email protected]>

Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to
chown files. Ordinarily the capable_wrt_inode_uidgid check is
sufficient to allow access to files but when the underlying filesystem
has uids or gids that don't map to the current user namespace it is
not enough, so the chown permission checks need to be extended to
allow this case.

Calling chown on filesystem nodes whose uid or gid don't map is
necessary if those nodes are going to be modified as writing back
inodes which contain uids or gids that don't map is likely to cause
filesystem corruption of the uid or gid fields.

Once chown has been called the existing capable_wrt_inode_uidgid
checks are sufficient, to allow the owner of a superblock to do anything
the global root user can do with an appropriate set of capabilities.

For the proc filesystem this relaxation of permissions is not safe, as
some files are owned by users (particularly GLOBAL_ROOT_UID) outside
of the control of the mounter of the proc and that would be unsafe to
grant chown access to. So update setattr on proc to disallow changing
files whose uids or gids are outside of proc's s_user_ns.

The original version of this patch was written by: Seth Forshee. I
have rewritten and rethought this patch enough so it's really not the
same thing (certainly it needs a different description), but he
deserves credit for getting out there and getting the conversation
started, and finding the potential gotcha's and putting up with my
semi-paranoid feedback.

Patch v4 is available: https://patchwork.kernel.org/patch/8944611/

Cc: [email protected]
Cc: [email protected]
Cc: Alexander Viro <[email protected]>
Cc: "Luis R. Rodriguez" <[email protected]>
Cc: Kees Cook <[email protected]>
Inspired-by: Seth Forshee <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
[saf: Resolve conflicts caused by s/inode_change_ok/setattr_prepare/]
Signed-off-by: Dongsu Park <[email protected]>
---
fs/attr.c | 34 ++++++++++++++++++++++++++--------
fs/proc/base.c | 7 +++++++
fs/proc/generic.c | 7 +++++++
fs/proc/proc_sysctl.c | 7 +++++++
4 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 12ffdb6f..bf8e94f3 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -18,6 +18,30 @@
#include <linux/evm.h>
#include <linux/ima.h>

+static bool chown_ok(const struct inode *inode, kuid_t uid)
+{
+ if (uid_eq(current_fsuid(), inode->i_uid) &&
+ uid_eq(uid, inode->i_uid))
+ return true;
+ if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+ return true;
+ if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
+ return true;
+ return false;
+}
+
+static bool chgrp_ok(const struct inode *inode, kgid_t gid)
+{
+ if (uid_eq(current_fsuid(), inode->i_uid) &&
+ (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
+ return true;
+ if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+ return true;
+ if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
+ return true;
+ return false;
+}
+
/**
* setattr_prepare - check if attribute changes to a dentry are allowed
* @dentry: dentry to check
@@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
goto kill_priv;

/* Make sure a caller can chown. */
- if ((ia_valid & ATTR_UID) &&
- (!uid_eq(current_fsuid(), inode->i_uid) ||
- !uid_eq(attr->ia_uid, inode->i_uid)) &&
- !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+ if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
return -EPERM;

/* Make sure caller can chgrp. */
- if ((ia_valid & ATTR_GID) &&
- (!uid_eq(current_fsuid(), inode->i_uid) ||
- (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
- !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
+ if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
return -EPERM;

/* Make sure a caller can chmod. */
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 31934cb9..9d50ec92 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
{
int error;
struct inode *inode = d_inode(dentry);
+ struct user_namespace *s_user_ns;

if (attr->ia_valid & ATTR_MODE)
return -EPERM;

+ /* Don't let anyone mess with weird proc files */
+ s_user_ns = inode->i_sb->s_user_ns;
+ if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
+ !kgid_has_mapping(s_user_ns, inode->i_gid))
+ return -EPERM;
+
error = setattr_prepare(dentry, attr);
if (error)
return error;
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 793a6757..527d46c8 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -106,8 +106,15 @@ static int proc_notify_change(struct dentry *dentry, struct iattr *iattr)
{
struct inode *inode = d_inode(dentry);
struct proc_dir_entry *de = PDE(inode);
+ struct user_namespace *s_user_ns;
int error;

+ /* Don't let anyone mess with weird proc files */
+ s_user_ns = inode->i_sb->s_user_ns;
+ if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
+ !kgid_has_mapping(s_user_ns, inode->i_gid))
+ return -EPERM;
+
error = setattr_prepare(dentry, iattr);
if (error)
return error;
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index c5cbbdff..0f9562d1 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -802,11 +802,18 @@ static int proc_sys_permission(struct inode *inode, int mask)
static int proc_sys_setattr(struct dentry *dentry, struct iattr *attr)
{
struct inode *inode = d_inode(dentry);
+ struct user_namespace *s_user_ns;
int error;

if (attr->ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID))
return -EPERM;

+ /* Don't let anyone mess with weird proc files */
+ s_user_ns = inode->i_sb->s_user_ns;
+ if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
+ !kgid_has_mapping(s_user_ns, inode->i_gid))
+ return -EPERM;
+
error = setattr_prepare(dentry, attr);
if (error)
return error;
--
2.13.6

2017-12-22 14:32:35

by Dongsu Park

[permalink] [raw]
Subject: [PATCH 10/11] fuse: Allow user namespace mounts

From: Seth Forshee <[email protected]>

To be able to mount fuse from non-init user namespaces, it's necessary
to set FS_USERNS_MOUNT flag to fs_flags.

Patch v4 is available: https://patchwork.kernel.org/patch/8944681/

Cc: [email protected]
Cc: [email protected]
Cc: Miklos Szeredi <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
[dongsu: add a simple commit messasge]
Signed-off-by: Dongsu Park <[email protected]>
---
fs/fuse/inode.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 7f6b2e55..8c98edee 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
static struct file_system_type fuse_fs_type = {
.owner = THIS_MODULE,
.name = "fuse",
- .fs_flags = FS_HAS_SUBTYPE,
+ .fs_flags = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
.mount = fuse_mount,
.kill_sb = fuse_kill_sb_anon,
};
@@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = {
.name = "fuseblk",
.mount = fuse_mount_blk,
.kill_sb = fuse_kill_sb_blk,
- .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
+ .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
};
MODULE_ALIAS_FS("fuseblk");

--
2.13.6

2017-12-22 14:32:41

by Dongsu Park

[permalink] [raw]
Subject: [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant

From: Seth Forshee <[email protected]>

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Patch v4 is available: https://patchwork.kernel.org/patch/8944671/

Cc: [email protected]
Cc: [email protected]
Cc: "Eric W. Biederman" <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
Signed-off-by: Dongsu Park <[email protected]>
---
fs/fuse/dir.c | 2 +-
kernel/user_namespace.c | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ad1cfac1..d41559a0 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
const struct cred *cred;

if (fc->allow_other)
- return 1;
+ return current_in_userns(fc->user_ns);

cred = current_cred();
if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4c..492c255e 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
{
return in_userns(target_ns, current_user_ns());
}
+EXPORT_SYMBOL(current_in_userns);

static inline struct user_namespace *to_user_ns(struct ns_common *ns)
{
--
2.13.6

2017-12-22 14:32:44

by Dongsu Park

[permalink] [raw]
Subject: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns

From: Seth Forshee <[email protected]>

In order to support mounts from namespaces other than
init_user_ns, fuse must translate uids and gids to/from the
userns of the process servicing requests on /dev/fuse. This
patch does that, with a couple of restrictions on the namespace:

- The userns for the fuse connection is fixed to the namespace
from which /dev/fuse is opened.

- The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the
need to pass around userns references and by allowing fuse to
rely on the checks in inode_change_ok for ownership changes.
Either restriction could be relaxed in the future if needed.

For cuse the namespace used for the connection is also simply
current_user_ns() at the time /dev/cuse is opened.

Patch v4 is available: https://patchwork.kernel.org/patch/8944661/

Cc: [email protected]
Cc: [email protected]
Cc: Miklos Szeredi <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
Signed-off-by: Dongsu Park <[email protected]>
---
fs/fuse/cuse.c | 3 ++-
fs/fuse/dev.c | 11 ++++++++---
fs/fuse/dir.c | 14 +++++++-------
fs/fuse/fuse_i.h | 6 +++++-
fs/fuse/inode.c | 31 +++++++++++++++++++------------
5 files changed, 41 insertions(+), 24 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index e9e97803..b1b83259 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
#include <linux/stat.h>
#include <linux/module.h>
#include <linux/uio.h>
+#include <linux/user_namespace.h>

#include "fuse_i.h"

@@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
if (!cc)
return -ENOMEM;

- fuse_conn_init(&cc->fc);
+ fuse_conn_init(&cc->fc, current_user_ns());

fud = fuse_dev_alloc(&cc->fc);
if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 17f0d05b..0f780e16 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)

static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
{
- req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
- req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
+ req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+ req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
}

@@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
__set_bit(FR_WAITING, &req->flags);
if (for_background)
__set_bit(FR_BACKGROUND, &req->flags);
+ if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
+ fuse_put_request(fc, req);
+ return ERR_PTR(-EOVERFLOW);
+ }

return req;

@@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
in = &req->in;
reqsize = in->h.len;

- if (task_active_pid_ns(current) != fc->pid_ns) {
+ if (task_active_pid_ns(current) != fc->pid_ns ||
+ current_user_ns() != fc->user_ns) {
rcu_read_lock();
in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
rcu_read_unlock();
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 24967382..ad1cfac1 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
stat->ino = attr->ino;
stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
stat->nlink = attr->nlink;
- stat->uid = make_kuid(&init_user_ns, attr->uid);
- stat->gid = make_kgid(&init_user_ns, attr->gid);
+ stat->uid = make_kuid(fc->user_ns, attr->uid);
+ stat->gid = make_kgid(fc->user_ns, attr->gid);
stat->rdev = inode->i_rdev;
stat->atime.tv_sec = attr->atime;
stat->atime.tv_nsec = attr->atimensec;
@@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
return true;
}

-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
- bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+ struct fuse_setattr_in *arg, bool trust_local_cmtime)
{
unsigned ivalid = iattr->ia_valid;

if (ivalid & ATTR_MODE)
arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode;
if (ivalid & ATTR_UID)
- arg->valid |= FATTR_UID, arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
+ arg->valid |= FATTR_UID, arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
if (ivalid & ATTR_GID)
- arg->valid |= FATTR_GID, arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
+ arg->valid |= FATTR_GID, arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
if (ivalid & ATTR_SIZE)
arg->valid |= FATTR_SIZE, arg->size = iattr->ia_size;
if (ivalid & ATTR_ATIME) {
@@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,

memset(&inarg, 0, sizeof(inarg));
memset(&outarg, 0, sizeof(outarg));
- iattr_to_fattr(attr, &inarg, trust_local_cmtime);
+ iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
if (file) {
struct fuse_file *ff = file->private_data;
inarg.valid |= FATTR_FH;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index d5773ca6..364e65c8 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -26,6 +26,7 @@
#include <linux/xattr.h>
#include <linux/pid_namespace.h>
#include <linux/refcount.h>
+#include <linux/user_namespace.h>

/** Max number of pages that can be used in a single read request */
#define FUSE_MAX_PAGES_PER_REQ 32
@@ -466,6 +467,9 @@ struct fuse_conn {
/** The pid namespace for this mount */
struct pid_namespace *pid_ns;

+ /** The user namespace for this mount */
+ struct user_namespace *user_ns;
+
/** Maximum read size */
unsigned max_read;

@@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
/**
* Initialize fuse_conn
*/
-void fuse_conn_init(struct fuse_conn *fc);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);

/**
* Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 2f504d61..7f6b2e55 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
inode->i_ino = fuse_squash_ino(attr->ino);
inode->i_mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
set_nlink(inode, attr->nlink);
- inode->i_uid = make_kuid(&init_user_ns, attr->uid);
- inode->i_gid = make_kgid(&init_user_ns, attr->gid);
+ inode->i_uid = make_kuid(fc->user_ns, attr->uid);
+ inode->i_gid = make_kgid(fc->user_ns, attr->gid);
inode->i_blocks = attr->blocks;
inode->i_atime.tv_sec = attr->atime;
inode->i_atime.tv_nsec = attr->atimensec;
@@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
return err;
}

-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
+static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+ struct user_namespace *user_ns)
{
char *p;
memset(d, 0, sizeof(struct fuse_mount_data));
@@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
case OPT_USER_ID:
if (fuse_match_uint(&args[0], &uv))
return 0;
- d->user_id = make_kuid(current_user_ns(), uv);
+ d->user_id = make_kuid(user_ns, uv);
if (!uid_valid(d->user_id))
return 0;
d->user_id_present = 1;
@@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
case OPT_GROUP_ID:
if (fuse_match_uint(&args[0], &uv))
return 0;
- d->group_id = make_kgid(current_user_ns(), uv);
+ d->group_id = make_kgid(user_ns, uv);
if (!gid_valid(d->group_id))
return 0;
d->group_id_present = 1;
@@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
struct super_block *sb = root->d_sb;
struct fuse_conn *fc = get_fuse_conn_super(sb);

- seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
- seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
+ seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
+ seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
if (fc->default_permissions)
seq_puts(m, ",default_permissions");
if (fc->allow_other)
@@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
fpq->connected = 1;
}

-void fuse_conn_init(struct fuse_conn *fc)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
{
memset(fc, 0, sizeof(*fc));
spin_lock_init(&fc->lock);
@@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
fc->attr_version = 1;
get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+ fc->user_ns = get_user_ns(user_ns);
}
EXPORT_SYMBOL_GPL(fuse_conn_init);

@@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
if (fc->destroy_req)
fuse_request_free(fc->destroy_req);
put_pid_ns(fc->pid_ns);
+ put_user_ns(fc->user_ns);
fc->release(fc);
}
}
@@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)

sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);

- if (!parse_fuse_opt(data, &d, is_bdev))
+ if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
goto err;

if (is_bdev) {
@@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
if (!file)
goto err;

- if ((file->f_op != &fuse_dev_operations) ||
- (file->f_cred->user_ns != &init_user_ns))
+ /*
+ * Require mount to happen from the same user namespace which
+ * opened /dev/fuse to prevent potential attacks.
+ */
+ if (file->f_op != &fuse_dev_operations ||
+ file->f_cred->user_ns != sb->s_user_ns)
goto err_fput;

fc = kmalloc(sizeof(*fc), GFP_KERNEL);
@@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
if (!fc)
goto err_fput;

- fuse_conn_init(fc);
+ fuse_conn_init(fc, sb->s_user_ns);
fc->release = fuse_free_conn;

fud = fuse_dev_alloc(fc);
--
2.13.6

2017-12-22 14:32:54

by Dongsu Park

[permalink] [raw]
Subject: [PATCH 06/11] capabilities: Allow privileged user in s_user_ns to set security.* xattrs

From: Seth Forshee <[email protected]>

A privileged user in s_user_ns will generally have the ability to
manipulate the backing store and insert security.* xattrs into
the filesystem directly. Therefore the kernel must be prepared to
handle these xattrs from unprivileged mounts, and it makes little
sense for commoncap to prevent writing these xattrs to the
filesystem. The capability and LSM code have already been updated
to appropriately handle xattrs from unprivileged mounts, so it
is safe to loosen this restriction on setting xattrs.

The exception to this logic is that writing xattrs to a mounted
filesystem may also cause the LSM inode_post_setxattr or
inode_setsecurity callbacks to be invoked. SELinux will deny the
xattr update by virtue of applying mountpoint labeling to
unprivileged userns mounts, and Smack will deny the writes for
any user without global CAP_MAC_ADMIN, so loosening the
capability check in commoncap is safe in this respect as well.

Patch v4 is available: https://patchwork.kernel.org/patch/8944641/

Cc: [email protected]
Cc: [email protected]
Cc: James Morris <[email protected]>
Cc: Serge Hallyn <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
Signed-off-by: Dongsu Park <[email protected]>
---
security/commoncap.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/security/commoncap.c b/security/commoncap.c
index 4f8e0934..dd0afef9 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -920,6 +920,8 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
int cap_inode_setxattr(struct dentry *dentry, const char *name,
const void *value, size_t size, int flags)
{
+ struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
+
/* Ignore non-security xattrs */
if (strncmp(name, XATTR_SECURITY_PREFIX,
sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
@@ -932,7 +934,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
if (strcmp(name, XATTR_NAME_CAPS) == 0)
return 0;

- if (!capable(CAP_SYS_ADMIN))
+ if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;
return 0;
}
@@ -950,6 +952,8 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
*/
int cap_inode_removexattr(struct dentry *dentry, const char *name)
{
+ struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
+
/* Ignore non-security xattrs */
if (strncmp(name, XATTR_SECURITY_PREFIX,
sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
@@ -965,7 +969,7 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
return 0;
}

- if (!capable(CAP_SYS_ADMIN))
+ if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;
return 0;
}
--
2.13.6

2017-12-22 14:32:50

by Dongsu Park

[permalink] [raw]
Subject: [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems

From: Seth Forshee <[email protected]>

The user in control of a super block should be allowed to freeze
and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
ioctls to require CAP_SYS_ADMIN in s_user_ns.

Cc: [email protected]
Cc: [email protected]
Cc: Alexander Viro <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
Signed-off-by: Dongsu Park <[email protected]>
---
fs/ioctl.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 5ace7efb..8c628a8d 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -549,7 +549,7 @@ static int ioctl_fsfreeze(struct file *filp)
{
struct super_block *sb = file_inode(filp)->i_sb;

- if (!capable(CAP_SYS_ADMIN))
+ if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
return -EPERM;

/* If filesystem doesn't support freeze feature, return. */
@@ -566,7 +566,7 @@ static int ioctl_fsthaw(struct file *filp)
{
struct super_block *sb = file_inode(filp)->i_sb;

- if (!capable(CAP_SYS_ADMIN))
+ if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
return -EPERM;

/* Thaw */
--
2.13.6

2017-12-22 14:32:30

by Dongsu Park

[permalink] [raw]
Subject: [PATCH 11/11] evm: Don't update hmacs in user ns mounts

From: Seth Forshee <[email protected]>

The kernel should not calculate new hmacs for mounts done by
non-root users. Update evm_calc_hmac_or_hash() to refuse to
calculate new hmacs for mounts for non-init user namespaces.

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: James Morris <[email protected]>
Cc: Mimi Zohar <[email protected]>
Cc: "Serge E. Hallyn" <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
Signed-off-by: Dongsu Park <[email protected]>
---
security/integrity/evm/evm_crypto.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
index bcd64baf..729f4545 100644
--- a/security/integrity/evm/evm_crypto.c
+++ b/security/integrity/evm/evm_crypto.c
@@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry,
int error;
int size;

- if (!(inode->i_opflags & IOP_XATTR))
+ if (!(inode->i_opflags & IOP_XATTR) ||
+ inode->i_sb->s_user_ns != &init_user_ns)
return -EOPNOTSUPP;

desc = init_desc(type);
--
2.13.6

2017-12-22 14:34:27

by Dongsu Park

[permalink] [raw]
Subject: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting

From: Seth Forshee <[email protected]>

Unprivileged users should not be able to mount mtd block devices
when they lack sufficient privileges towards the block device
inode. Update mount_mtd() to validate that the user has the
required access to the inode at the specified path. The check
will be skipped for CAP_SYS_ADMIN, so privileged mounts will
continue working as before.

Patch v3 is available: https://patchwork.kernel.org/patch/7640011/

Cc: [email protected]
Cc: [email protected]
Signed-off-by: Seth Forshee <[email protected]>
Signed-off-by: Dongsu Park <[email protected]>
---
drivers/mtd/mtdsuper.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index 4a4d40c0..3c8734f3 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -129,6 +129,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
#ifdef CONFIG_BLOCK
struct block_device *bdev;
int ret, major;
+ int perm;
#endif
int mtdnr;

@@ -180,7 +181,10 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
/* try the old way - the hack where we allowed users to mount
* /dev/mtdblock$(n) but didn't actually _use_ the blockdev
*/
- bdev = lookup_bdev(dev_name, 0);
+ perm = MAY_READ;
+ if (!(flags & MS_RDONLY))
+ perm |= MAY_WRITE;
+ bdev = lookup_bdev(dev_name, perm);
if (IS_ERR(bdev)) {
ret = PTR_ERR(bdev);
pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
--
2.13.6

2017-12-22 14:34:41

by Dongsu Park

[permalink] [raw]
Subject: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()

From: Seth Forshee <[email protected]>

When looking up a block device by path no permission check is
done to verify that the user has access to the block device inode
at the specified path. In some cases it may be necessary to
check permissions towards the inode, such as allowing
unprivileged users to mount block devices in user namespaces.

Add an argument to lookup_bdev() to optionally perform this
permission check. A value of 0 skips the permission check and
behaves the same as before. A non-zero value specifies the mask
of access rights required towards the inode at the specified
path. The check is always skipped if the user has CAP_SYS_ADMIN.

All callers of lookup_bdev() currently pass a mask of 0, so this
patch results in no functional change. Subsequent patches will
add permission checks where appropriate.

Patch v4 is available: https://patchwork.kernel.org/patch/8943601/

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Alexander Viro <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Serge Hallyn <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
Signed-off-by: Dongsu Park <[email protected]>
---
drivers/md/bcache/super.c | 2 +-
drivers/md/dm-table.c | 2 +-
drivers/mtd/mtdsuper.c | 2 +-
fs/block_dev.c | 13 ++++++++++---
fs/quota/quota.c | 2 +-
include/linux/fs.h | 2 +-
6 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index b4d28928..acc9d56c 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
sb);
if (IS_ERR(bdev)) {
if (bdev == ERR_PTR(-EBUSY)) {
- bdev = lookup_bdev(strim(path));
+ bdev = lookup_bdev(strim(path), 0);
mutex_lock(&bch_register_lock);
if (!IS_ERR(bdev) && bch_is_open(bdev))
err = "device already registered";
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 88130b5d..bca5eaf4 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -410,7 +410,7 @@ dev_t dm_get_dev_t(const char *path)
dev_t dev;
struct block_device *bdev;

- bdev = lookup_bdev(path);
+ bdev = lookup_bdev(path, 0);
if (IS_ERR(bdev))
dev = name_to_dev_t(path);
else {
diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index e43fea89..4a4d40c0 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -180,7 +180,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
/* try the old way - the hack where we allowed users to mount
* /dev/mtdblock$(n) but didn't actually _use_ the blockdev
*/
- bdev = lookup_bdev(dev_name);
+ bdev = lookup_bdev(dev_name, 0);
if (IS_ERR(bdev)) {
ret = PTR_ERR(bdev);
pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4a181fcb..5ca06095 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1662,7 +1662,7 @@ struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
struct block_device *bdev;
int err;

- bdev = lookup_bdev(path);
+ bdev = lookup_bdev(path, 0);
if (IS_ERR(bdev))
return bdev;

@@ -2052,12 +2052,14 @@ EXPORT_SYMBOL(ioctl_by_bdev);
/**
* lookup_bdev - lookup a struct block_device by name
* @pathname: special file representing the block device
+ * @mask: rights to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
*
* Get a reference to the blockdevice at @pathname in the current
* namespace if possible and return it. Return ERR_PTR(error)
- * otherwise.
+ * otherwise. If @mask is non-zero, check for access rights to the
+ * inode at @pathname.
*/
-struct block_device *lookup_bdev(const char *pathname)
+struct block_device *lookup_bdev(const char *pathname, int mask)
{
struct block_device *bdev;
struct inode *inode;
@@ -2072,6 +2074,11 @@ struct block_device *lookup_bdev(const char *pathname)
return ERR_PTR(error);

inode = d_backing_inode(path.dentry);
+ if (mask != 0 && !capable(CAP_SYS_ADMIN)) {
+ error = __inode_permission(inode, mask);
+ if (error)
+ goto fail;
+ }
error = -ENOTBLK;
if (!S_ISBLK(inode->i_mode))
goto fail;
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 43612e2a..e5d47955 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -807,7 +807,7 @@ static struct super_block *quotactl_block(const char __user *special, int cmd)

if (IS_ERR(tmp))
return ERR_CAST(tmp);
- bdev = lookup_bdev(tmp->name);
+ bdev = lookup_bdev(tmp->name, 0);
putname(tmp);
if (IS_ERR(bdev))
return ERR_CAST(bdev);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2995a271..fce19c49 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2551,7 +2551,7 @@ static inline void unregister_chrdev(unsigned int major, const char *name)
#define BLKDEV_MAJOR_MAX 512
extern const char *__bdevname(dev_t, char *buffer);
extern const char *bdevname(struct block_device *bdev, char *buffer);
-extern struct block_device *lookup_bdev(const char *);
+extern struct block_device *lookup_bdev(const char *, int mask);
extern void blkdev_show(struct seq_file *,off_t);

#else
--
2.13.6

2017-12-22 19:00:26

by Coly Li

[permalink] [raw]
Subject: Re: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()

On 22/12/2017 10:32 PM, Dongsu Park wrote:
> From: Seth Forshee <[email protected]>
>
> When looking up a block device by path no permission check is
> done to verify that the user has access to the block device inode
> at the specified path. In some cases it may be necessary to
> check permissions towards the inode, such as allowing
> unprivileged users to mount block devices in user namespaces.
>
> Add an argument to lookup_bdev() to optionally perform this
> permission check. A value of 0 skips the permission check and
> behaves the same as before. A non-zero value specifies the mask
> of access rights required towards the inode at the specified
> path. The check is always skipped if the user has CAP_SYS_ADMIN.
>
> All callers of lookup_bdev() currently pass a mask of 0, so this
> patch results in no functional change. Subsequent patches will
> add permission checks where appropriate.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8943601/
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Alexander Viro <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Serge Hallyn <[email protected]>
> Signed-off-by: Seth Forshee <[email protected]>
> Signed-off-by: Dongsu Park <[email protected]>

Hi Dongsu,

Could you please use a macro like NO_PERMISSION_CHECK to replace hard
coded 0 ? At least for me, I don't need to check what does 0 mean in the
new lookup_bdev().

Thanks.

Coly Li

> ---
> drivers/md/bcache/super.c | 2 +-
> drivers/md/dm-table.c | 2 +-
> drivers/mtd/mtdsuper.c | 2 +-
> fs/block_dev.c | 13 ++++++++++---
> fs/quota/quota.c | 2 +-
> include/linux/fs.h | 2 +-
> 6 files changed, 15 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index b4d28928..acc9d56c 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
> sb);
> if (IS_ERR(bdev)) {
> if (bdev == ERR_PTR(-EBUSY)) {
> - bdev = lookup_bdev(strim(path));
> + bdev = lookup_bdev(strim(path), 0);
> mutex_lock(&bch_register_lock);
> if (!IS_ERR(bdev) && bch_is_open(bdev))
> err = "device already registered";
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 88130b5d..bca5eaf4 100644
[snip]


--
Coly Li

2017-12-22 21:06:18

by Richard Weinberger

[permalink] [raw]
Subject: Re: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting

Dongsu,

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
> From: Seth Forshee <[email protected]>
>
> Unprivileged users should not be able to mount mtd block devices
> when they lack sufficient privileges towards the block device
> inode. Update mount_mtd() to validate that the user has the
> required access to the inode at the specified path. The check
> will be skipped for CAP_SYS_ADMIN, so privileged mounts will
> continue working as before.

What is the big picture of this?
Can in future an unprivileged user just mount UBIFS?
Please note that UBIFS sits on top of a character device and not a block device.

--
Thanks,
//richard

2017-12-23 03:03:14

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()

On Fri, Dec 22, 2017 at 03:32:25PM +0100, Dongsu Park wrote:
> From: Seth Forshee <[email protected]>
>
> When looking up a block device by path no permission check is
> done to verify that the user has access to the block device inode
> at the specified path. In some cases it may be necessary to
> check permissions towards the inode, such as allowing
> unprivileged users to mount block devices in user namespaces.
>
> Add an argument to lookup_bdev() to optionally perform this
> permission check. A value of 0 skips the permission check and
> behaves the same as before. A non-zero value specifies the mask
> of access rights required towards the inode at the specified
> path. The check is always skipped if the user has CAP_SYS_ADMIN.
>
> All callers of lookup_bdev() currently pass a mask of 0, so this
> patch results in no functional change. Subsequent patches will
> add permission checks where appropriate.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8943601/
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: Alexander Viro <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Serge Hallyn <[email protected]>

Acked-by: Serge Hallyn <[email protected]>

> Signed-off-by: Seth Forshee <[email protected]>
> Signed-off-by: Dongsu Park <[email protected]>
> ---
> drivers/md/bcache/super.c | 2 +-
> drivers/md/dm-table.c | 2 +-
> drivers/mtd/mtdsuper.c | 2 +-
> fs/block_dev.c | 13 ++++++++++---
> fs/quota/quota.c | 2 +-
> include/linux/fs.h | 2 +-
> 6 files changed, 15 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index b4d28928..acc9d56c 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
> sb);
> if (IS_ERR(bdev)) {
> if (bdev == ERR_PTR(-EBUSY)) {
> - bdev = lookup_bdev(strim(path));
> + bdev = lookup_bdev(strim(path), 0);
> mutex_lock(&bch_register_lock);
> if (!IS_ERR(bdev) && bch_is_open(bdev))
> err = "device already registered";
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 88130b5d..bca5eaf4 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -410,7 +410,7 @@ dev_t dm_get_dev_t(const char *path)
> dev_t dev;
> struct block_device *bdev;
>
> - bdev = lookup_bdev(path);
> + bdev = lookup_bdev(path, 0);
> if (IS_ERR(bdev))
> dev = name_to_dev_t(path);
> else {
> diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
> index e43fea89..4a4d40c0 100644
> --- a/drivers/mtd/mtdsuper.c
> +++ b/drivers/mtd/mtdsuper.c
> @@ -180,7 +180,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
> /* try the old way - the hack where we allowed users to mount
> * /dev/mtdblock$(n) but didn't actually _use_ the blockdev
> */
> - bdev = lookup_bdev(dev_name);
> + bdev = lookup_bdev(dev_name, 0);
> if (IS_ERR(bdev)) {
> ret = PTR_ERR(bdev);
> pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 4a181fcb..5ca06095 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1662,7 +1662,7 @@ struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
> struct block_device *bdev;
> int err;
>
> - bdev = lookup_bdev(path);
> + bdev = lookup_bdev(path, 0);
> if (IS_ERR(bdev))
> return bdev;
>
> @@ -2052,12 +2052,14 @@ EXPORT_SYMBOL(ioctl_by_bdev);
> /**
> * lookup_bdev - lookup a struct block_device by name
> * @pathname: special file representing the block device
> + * @mask: rights to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
> *
> * Get a reference to the blockdevice at @pathname in the current
> * namespace if possible and return it. Return ERR_PTR(error)
> - * otherwise.
> + * otherwise. If @mask is non-zero, check for access rights to the
> + * inode at @pathname.
> */
> -struct block_device *lookup_bdev(const char *pathname)
> +struct block_device *lookup_bdev(const char *pathname, int mask)
> {
> struct block_device *bdev;
> struct inode *inode;
> @@ -2072,6 +2074,11 @@ struct block_device *lookup_bdev(const char *pathname)
> return ERR_PTR(error);
>
> inode = d_backing_inode(path.dentry);
> + if (mask != 0 && !capable(CAP_SYS_ADMIN)) {
> + error = __inode_permission(inode, mask);
> + if (error)
> + goto fail;
> + }
> error = -ENOTBLK;
> if (!S_ISBLK(inode->i_mode))
> goto fail;
> diff --git a/fs/quota/quota.c b/fs/quota/quota.c
> index 43612e2a..e5d47955 100644
> --- a/fs/quota/quota.c
> +++ b/fs/quota/quota.c
> @@ -807,7 +807,7 @@ static struct super_block *quotactl_block(const char __user *special, int cmd)
>
> if (IS_ERR(tmp))
> return ERR_CAST(tmp);
> - bdev = lookup_bdev(tmp->name);
> + bdev = lookup_bdev(tmp->name, 0);
> putname(tmp);
> if (IS_ERR(bdev))
> return ERR_CAST(bdev);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 2995a271..fce19c49 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2551,7 +2551,7 @@ static inline void unregister_chrdev(unsigned int major, const char *name)
> #define BLKDEV_MAJOR_MAX 512
> extern const char *__bdevname(dev_t, char *buffer);
> extern const char *bdevname(struct block_device *bdev, char *buffer);
> -extern struct block_device *lookup_bdev(const char *);
> +extern struct block_device *lookup_bdev(const char *, int mask);
> extern void blkdev_show(struct seq_file *,off_t);
>
> #else
> --
> 2.13.6

2017-12-23 03:05:18

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting

On Fri, Dec 22, 2017 at 03:32:26PM +0100, Dongsu Park wrote:
> From: Seth Forshee <[email protected]>
>
> Unprivileged users should not be able to mount mtd block devices
> when they lack sufficient privileges towards the block device
> inode. Update mount_mtd() to validate that the user has the
> required access to the inode at the specified path. The check
> will be skipped for CAP_SYS_ADMIN, so privileged mounts will
> continue working as before.
>
> Patch v3 is available: https://patchwork.kernel.org/patch/7640011/
>
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Seth Forshee <[email protected]>
> Signed-off-by: Dongsu Park <[email protected]>

Acked-by: Serge Hallyn <[email protected]>

> ---
> drivers/mtd/mtdsuper.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
> index 4a4d40c0..3c8734f3 100644
> --- a/drivers/mtd/mtdsuper.c
> +++ b/drivers/mtd/mtdsuper.c
> @@ -129,6 +129,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
> #ifdef CONFIG_BLOCK
> struct block_device *bdev;
> int ret, major;
> + int perm;
> #endif
> int mtdnr;
>
> @@ -180,7 +181,10 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
> /* try the old way - the hack where we allowed users to mount
> * /dev/mtdblock$(n) but didn't actually _use_ the blockdev
> */
> - bdev = lookup_bdev(dev_name, 0);
> + perm = MAY_READ;
> + if (!(flags & MS_RDONLY))
> + perm |= MAY_WRITE;
> + bdev = lookup_bdev(dev_name, perm);
> if (IS_ERR(bdev)) {
> ret = PTR_ERR(bdev);
> pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
> --
> 2.13.6
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2017-12-23 03:17:44

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes

On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
> From: Eric W. Biederman <[email protected]>
>
> Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to

Note it is CAP_CHOWN

> chown files. Ordinarily the capable_wrt_inode_uidgid check is
> sufficient to allow access to files but when the underlying filesystem
> has uids or gids that don't map to the current user namespace it is
> not enough, so the chown permission checks need to be extended to
> allow this case.
>
> Calling chown on filesystem nodes whose uid or gid don't map is
> necessary if those nodes are going to be modified as writing back
> inodes which contain uids or gids that don't map is likely to cause
> filesystem corruption of the uid or gid fields.
>
> Once chown has been called the existing capable_wrt_inode_uidgid
> checks are sufficient, to allow the owner of a superblock to do anything
> the global root user can do with an appropriate set of capabilities.
>
> For the proc filesystem this relaxation of permissions is not safe, as
> some files are owned by users (particularly GLOBAL_ROOT_UID) outside
> of the control of the mounter of the proc and that would be unsafe to
> grant chown access to. So update setattr on proc to disallow changing
> files whose uids or gids are outside of proc's s_user_ns.
>
> The original version of this patch was written by: Seth Forshee. I
> have rewritten and rethought this patch enough so it's really not the
> same thing (certainly it needs a different description), but he
> deserves credit for getting out there and getting the conversation
> started, and finding the potential gotcha's and putting up with my
> semi-paranoid feedback.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944611/
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Alexander Viro <[email protected]>
> Cc: "Luis R. Rodriguez" <[email protected]>
> Cc: Kees Cook <[email protected]>
> Inspired-by: Seth Forshee <[email protected]>
> Signed-off-by: Eric W. Biederman <[email protected]>
> [saf: Resolve conflicts caused by s/inode_change_ok/setattr_prepare/]
> Signed-off-by: Dongsu Park <[email protected]>

Reviewed-by: Serge Hallyn <[email protected]>

> ---
> fs/attr.c | 34 ++++++++++++++++++++++++++--------
> fs/proc/base.c | 7 +++++++
> fs/proc/generic.c | 7 +++++++
> fs/proc/proc_sysctl.c | 7 +++++++
> 4 files changed, 47 insertions(+), 8 deletions(-)
>
> diff --git a/fs/attr.c b/fs/attr.c
> index 12ffdb6f..bf8e94f3 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -18,6 +18,30 @@
> #include <linux/evm.h>
> #include <linux/ima.h>
>
> +static bool chown_ok(const struct inode *inode, kuid_t uid)
> +{
> + if (uid_eq(current_fsuid(), inode->i_uid) &&
> + uid_eq(uid, inode->i_uid))
> + return true;
> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + return true;
> + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> + return true;
> + return false;
> +}
> +
> +static bool chgrp_ok(const struct inode *inode, kgid_t gid)
> +{
> + if (uid_eq(current_fsuid(), inode->i_uid) &&
> + (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
> + return true;
> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + return true;
> + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> + return true;
> + return false;
> +}
> +
> /**
> * setattr_prepare - check if attribute changes to a dentry are allowed
> * @dentry: dentry to check
> @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
> goto kill_priv;
>
> /* Make sure a caller can chown. */
> - if ((ia_valid & ATTR_UID) &&
> - (!uid_eq(current_fsuid(), inode->i_uid) ||
> - !uid_eq(attr->ia_uid, inode->i_uid)) &&
> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
> return -EPERM;
>
> /* Make sure caller can chgrp. */
> - if ((ia_valid & ATTR_GID) &&
> - (!uid_eq(current_fsuid(), inode->i_uid) ||
> - (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
> return -EPERM;
>
> /* Make sure a caller can chmod. */
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 31934cb9..9d50ec92 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
> {
> int error;
> struct inode *inode = d_inode(dentry);
> + struct user_namespace *s_user_ns;
>
> if (attr->ia_valid & ATTR_MODE)
> return -EPERM;
>
> + /* Don't let anyone mess with weird proc files */
> + s_user_ns = inode->i_sb->s_user_ns;
> + if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
> + !kgid_has_mapping(s_user_ns, inode->i_gid))
> + return -EPERM;
> +
> error = setattr_prepare(dentry, attr);
> if (error)
> return error;
> diff --git a/fs/proc/generic.c b/fs/proc/generic.c
> index 793a6757..527d46c8 100644
> --- a/fs/proc/generic.c
> +++ b/fs/proc/generic.c
> @@ -106,8 +106,15 @@ static int proc_notify_change(struct dentry *dentry, struct iattr *iattr)
> {
> struct inode *inode = d_inode(dentry);
> struct proc_dir_entry *de = PDE(inode);
> + struct user_namespace *s_user_ns;
> int error;
>
> + /* Don't let anyone mess with weird proc files */
> + s_user_ns = inode->i_sb->s_user_ns;
> + if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
> + !kgid_has_mapping(s_user_ns, inode->i_gid))
> + return -EPERM;
> +
> error = setattr_prepare(dentry, iattr);
> if (error)
> return error;
> diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> index c5cbbdff..0f9562d1 100644
> --- a/fs/proc/proc_sysctl.c
> +++ b/fs/proc/proc_sysctl.c
> @@ -802,11 +802,18 @@ static int proc_sys_permission(struct inode *inode, int mask)
> static int proc_sys_setattr(struct dentry *dentry, struct iattr *attr)
> {
> struct inode *inode = d_inode(dentry);
> + struct user_namespace *s_user_ns;
> int error;
>
> if (attr->ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID))
> return -EPERM;
>
> + /* Don't let anyone mess with weird proc files */
> + s_user_ns = inode->i_sb->s_user_ns;
> + if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
> + !kgid_has_mapping(s_user_ns, inode->i_gid))
> + return -EPERM;
> +
> error = setattr_prepare(dentry, attr);
> if (error)
> return error;
> --
> 2.13.6

2017-12-23 03:26:11

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root

On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote:
> From: Seth Forshee <[email protected]>
>
> Expand the check in should_remove_suid() to keep privileges for

I realize this description came from Seth, but reading it now,
'Expand' seems wrong. Expanding a check brings to my mind making
it stricter, not looser. How about 'Relax the check' ?

> CAP_FSETID in s_user_ns rather than init_user_ns.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/
>
> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid

Why exactly?

This is wrong, because capable_wrt_inode_uidgid() does a check
against current_user_ns, not the inode->i_sb->s_user_ns

>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Alexander Viro <[email protected]>
> Cc: Serge Hallyn <[email protected]>
> Signed-off-by: Seth Forshee <[email protected]>
> Signed-off-by: Dongsu Park <[email protected]>
> ---
> fs/inode.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index fd401028..6459a437 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime);
> */
> int should_remove_suid(struct dentry *dentry)
> {
> - umode_t mode = d_inode(dentry)->i_mode;
> + struct inode *inode = d_inode(dentry);
> + umode_t mode = inode->i_mode;
> int kill = 0;
>
> /* suid always must be killed */
> @@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry)
> if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
> kill |= ATTR_KILL_SGID;
>
> - if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
> + if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) &&
> + S_ISREG(mode)))
> return kill;
>
> return 0;
> --
> 2.13.6

2017-12-23 03:30:48

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 05/11] fs: Allow superblock owner to access do_remount_sb()

On Fri, Dec 22, 2017 at 03:32:29PM +0100, Dongsu Park wrote:
> From: Seth Forshee <[email protected]>
>
> Superblock level remounts are currently restricted to global
> CAP_SYS_ADMIN, as is the path for changing the root mount to
> read only on umount. Loosen both of these permission checks to
> also allow CAP_SYS_ADMIN in any namespace which is privileged
> towards the userns which originally mounted the filesystem.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944631/
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Alexander Viro <[email protected]>
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: Serge Hallyn <[email protected]>

Acked-by: Serge Hallyn <[email protected]>

> Signed-off-by: Seth Forshee <[email protected]>
> Signed-off-by: Dongsu Park <[email protected]>
> ---
> fs/namespace.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index e158ec6b..830040d7 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1589,7 +1589,7 @@ static int do_umount(struct mount *mnt, int flags)
> * Special case for "unmounting" root ...
> * we just try to remount it readonly.
> */
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
> return -EPERM;
> down_write(&sb->s_umount);
> if (!sb_rdonly(sb))
> @@ -2327,7 +2327,7 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
> down_write(&sb->s_umount);
> if (ms_flags & MS_BIND)
> err = change_mount_flags(path->mnt, ms_flags);
> - else if (!capable(CAP_SYS_ADMIN))
> + else if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
> err = -EPERM;
> else
> err = do_remount_sb(sb, sb_flags, data, 0);
> --
> 2.13.6

2017-12-23 03:33:42

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 06/11] capabilities: Allow privileged user in s_user_ns to set security.* xattrs

On Fri, Dec 22, 2017 at 03:32:30PM +0100, Dongsu Park wrote:
> From: Seth Forshee <[email protected]>
>
> A privileged user in s_user_ns will generally have the ability to
> manipulate the backing store and insert security.* xattrs into
> the filesystem directly. Therefore the kernel must be prepared to
> handle these xattrs from unprivileged mounts, and it makes little
> sense for commoncap to prevent writing these xattrs to the
> filesystem. The capability and LSM code have already been updated
> to appropriately handle xattrs from unprivileged mounts, so it
> is safe to loosen this restriction on setting xattrs.
>
> The exception to this logic is that writing xattrs to a mounted
> filesystem may also cause the LSM inode_post_setxattr or
> inode_setsecurity callbacks to be invoked. SELinux will deny the
> xattr update by virtue of applying mountpoint labeling to
> unprivileged userns mounts, and Smack will deny the writes for
> any user without global CAP_MAC_ADMIN, so loosening the
> capability check in commoncap is safe in this respect as well.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944641/
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: James Morris <[email protected]>
> Cc: Serge Hallyn <[email protected]>

Reviewed-by: Serge Hallyn <[email protected]>

> Signed-off-by: Seth Forshee <[email protected]>
> Signed-off-by: Dongsu Park <[email protected]>
> ---
> security/commoncap.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 4f8e0934..dd0afef9 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -920,6 +920,8 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
> int cap_inode_setxattr(struct dentry *dentry, const char *name,
> const void *value, size_t size, int flags)
> {
> + struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
> +
> /* Ignore non-security xattrs */
> if (strncmp(name, XATTR_SECURITY_PREFIX,
> sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> @@ -932,7 +934,7 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
> if (strcmp(name, XATTR_NAME_CAPS) == 0)
> return 0;
>
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(user_ns, CAP_SYS_ADMIN))
> return -EPERM;
> return 0;
> }
> @@ -950,6 +952,8 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
> */
> int cap_inode_removexattr(struct dentry *dentry, const char *name)
> {
> + struct user_namespace *user_ns = dentry->d_sb->s_user_ns;
> +
> /* Ignore non-security xattrs */
> if (strncmp(name, XATTR_SECURITY_PREFIX,
> sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> @@ -965,7 +969,7 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
> return 0;
> }
>
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(user_ns, CAP_SYS_ADMIN))
> return -EPERM;
> return 0;
> }
> --
> 2.13.6

2017-12-23 03:39:22

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems

On Fri, Dec 22, 2017 at 03:32:31PM +0100, Dongsu Park wrote:
> From: Seth Forshee <[email protected]>
>
> The user in control of a super block should be allowed to freeze
> and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
> ioctls to require CAP_SYS_ADMIN in s_user_ns.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Alexander Viro <[email protected]>
> Signed-off-by: Seth Forshee <[email protected]>
> Signed-off-by: Dongsu Park <[email protected]>

Reviewed-by: Serge Hallyn <[email protected]>

> ---
> fs/ioctl.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 5ace7efb..8c628a8d 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -549,7 +549,7 @@ static int ioctl_fsfreeze(struct file *filp)
> {
> struct super_block *sb = file_inode(filp)->i_sb;
>
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
> return -EPERM;
>
> /* If filesystem doesn't support freeze feature, return. */
> @@ -566,7 +566,7 @@ static int ioctl_fsthaw(struct file *filp)
> {
> struct super_block *sb = file_inode(filp)->i_sb;
>
> - if (!capable(CAP_SYS_ADMIN))
> + if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
> return -EPERM;
>
> /* Thaw */
> --
> 2.13.6
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2017-12-23 03:46:37

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns

On Fri, Dec 22, 2017 at 03:32:32PM +0100, Dongsu Park wrote:
> From: Seth Forshee <[email protected]>
>
> In order to support mounts from namespaces other than
> init_user_ns, fuse must translate uids and gids to/from the
> userns of the process servicing requests on /dev/fuse. This
> patch does that, with a couple of restrictions on the namespace:
>
> - The userns for the fuse connection is fixed to the namespace
> from which /dev/fuse is opened.
>
> - The namespace must be the same as s_user_ns.
>
> These restrictions simplify the implementation by avoiding the
> need to pass around userns references and by allowing fuse to
> rely on the checks in inode_change_ok for ownership changes.
> Either restriction could be relaxed in the future if needed.
>
> For cuse the namespace used for the connection is also simply
> current_user_ns() at the time /dev/cuse is opened.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Miklos Szeredi <[email protected]>
> Signed-off-by: Seth Forshee <[email protected]>
> Signed-off-by: Dongsu Park <[email protected]>

Acked-by: Serge Hallyn <[email protected]>

> ---
> fs/fuse/cuse.c | 3 ++-
> fs/fuse/dev.c | 11 ++++++++---
> fs/fuse/dir.c | 14 +++++++-------
> fs/fuse/fuse_i.h | 6 +++++-
> fs/fuse/inode.c | 31 +++++++++++++++++++------------
> 5 files changed, 41 insertions(+), 24 deletions(-)
>
> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
> index e9e97803..b1b83259 100644
> --- a/fs/fuse/cuse.c
> +++ b/fs/fuse/cuse.c
> @@ -48,6 +48,7 @@
> #include <linux/stat.h>
> #include <linux/module.h>
> #include <linux/uio.h>
> +#include <linux/user_namespace.h>
>
> #include "fuse_i.h"
>
> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
> if (!cc)
> return -ENOMEM;
>
> - fuse_conn_init(&cc->fc);
> + fuse_conn_init(&cc->fc, current_user_ns());
>
> fud = fuse_dev_alloc(&cc->fc);
> if (!fud) {
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 17f0d05b..0f780e16 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>
> static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
> {
> - req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> - req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> + req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
> + req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
> req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
> }
>
> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
> __set_bit(FR_WAITING, &req->flags);
> if (for_background)
> __set_bit(FR_BACKGROUND, &req->flags);
> + if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
> + fuse_put_request(fc, req);
> + return ERR_PTR(-EOVERFLOW);
> + }
>
> return req;
>
> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
> in = &req->in;
> reqsize = in->h.len;
>
> - if (task_active_pid_ns(current) != fc->pid_ns) {
> + if (task_active_pid_ns(current) != fc->pid_ns ||
> + current_user_ns() != fc->user_ns) {
> rcu_read_lock();
> in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
> rcu_read_unlock();
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 24967382..ad1cfac1 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
> stat->ino = attr->ino;
> stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
> stat->nlink = attr->nlink;
> - stat->uid = make_kuid(&init_user_ns, attr->uid);
> - stat->gid = make_kgid(&init_user_ns, attr->gid);
> + stat->uid = make_kuid(fc->user_ns, attr->uid);
> + stat->gid = make_kgid(fc->user_ns, attr->gid);
> stat->rdev = inode->i_rdev;
> stat->atime.tv_sec = attr->atime;
> stat->atime.tv_nsec = attr->atimensec;
> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
> return true;
> }
>
> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
> - bool trust_local_cmtime)
> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
> + struct fuse_setattr_in *arg, bool trust_local_cmtime)
> {
> unsigned ivalid = iattr->ia_valid;
>
> if (ivalid & ATTR_MODE)
> arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode;
> if (ivalid & ATTR_UID)
> - arg->valid |= FATTR_UID, arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
> + arg->valid |= FATTR_UID, arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
> if (ivalid & ATTR_GID)
> - arg->valid |= FATTR_GID, arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
> + arg->valid |= FATTR_GID, arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
> if (ivalid & ATTR_SIZE)
> arg->valid |= FATTR_SIZE, arg->size = iattr->ia_size;
> if (ivalid & ATTR_ATIME) {
> @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
>
> memset(&inarg, 0, sizeof(inarg));
> memset(&outarg, 0, sizeof(outarg));
> - iattr_to_fattr(attr, &inarg, trust_local_cmtime);
> + iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
> if (file) {
> struct fuse_file *ff = file->private_data;
> inarg.valid |= FATTR_FH;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index d5773ca6..364e65c8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -26,6 +26,7 @@
> #include <linux/xattr.h>
> #include <linux/pid_namespace.h>
> #include <linux/refcount.h>
> +#include <linux/user_namespace.h>
>
> /** Max number of pages that can be used in a single read request */
> #define FUSE_MAX_PAGES_PER_REQ 32
> @@ -466,6 +467,9 @@ struct fuse_conn {
> /** The pid namespace for this mount */
> struct pid_namespace *pid_ns;
>
> + /** The user namespace for this mount */
> + struct user_namespace *user_ns;
> +
> /** Maximum read size */
> unsigned max_read;
>
> @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
> /**
> * Initialize fuse_conn
> */
> -void fuse_conn_init(struct fuse_conn *fc);
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
>
> /**
> * Release reference to fuse_conn
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 2f504d61..7f6b2e55 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
> inode->i_ino = fuse_squash_ino(attr->ino);
> inode->i_mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
> set_nlink(inode, attr->nlink);
> - inode->i_uid = make_kuid(&init_user_ns, attr->uid);
> - inode->i_gid = make_kgid(&init_user_ns, attr->gid);
> + inode->i_uid = make_kuid(fc->user_ns, attr->uid);
> + inode->i_gid = make_kgid(fc->user_ns, attr->gid);
> inode->i_blocks = attr->blocks;
> inode->i_atime.tv_sec = attr->atime;
> inode->i_atime.tv_nsec = attr->atimensec;
> @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
> return err;
> }
>
> -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
> + struct user_namespace *user_ns)
> {
> char *p;
> memset(d, 0, sizeof(struct fuse_mount_data));
> @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> case OPT_USER_ID:
> if (fuse_match_uint(&args[0], &uv))
> return 0;
> - d->user_id = make_kuid(current_user_ns(), uv);
> + d->user_id = make_kuid(user_ns, uv);
> if (!uid_valid(d->user_id))
> return 0;
> d->user_id_present = 1;
> @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> case OPT_GROUP_ID:
> if (fuse_match_uint(&args[0], &uv))
> return 0;
> - d->group_id = make_kgid(current_user_ns(), uv);
> + d->group_id = make_kgid(user_ns, uv);
> if (!gid_valid(d->group_id))
> return 0;
> d->group_id_present = 1;
> @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
> struct super_block *sb = root->d_sb;
> struct fuse_conn *fc = get_fuse_conn_super(sb);
>
> - seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
> - seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
> + seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
> + seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
> if (fc->default_permissions)
> seq_puts(m, ",default_permissions");
> if (fc->allow_other)
> @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
> fpq->connected = 1;
> }
>
> -void fuse_conn_init(struct fuse_conn *fc)
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
> {
> memset(fc, 0, sizeof(*fc));
> spin_lock_init(&fc->lock);
> @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
> fc->attr_version = 1;
> get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
> fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> + fc->user_ns = get_user_ns(user_ns);
> }
> EXPORT_SYMBOL_GPL(fuse_conn_init);
>
> @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
> if (fc->destroy_req)
> fuse_request_free(fc->destroy_req);
> put_pid_ns(fc->pid_ns);
> + put_user_ns(fc->user_ns);
> fc->release(fc);
> }
> }
> @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>
> sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
>
> - if (!parse_fuse_opt(data, &d, is_bdev))
> + if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
> goto err;
>
> if (is_bdev) {
> @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
> if (!file)
> goto err;
>
> - if ((file->f_op != &fuse_dev_operations) ||
> - (file->f_cred->user_ns != &init_user_ns))
> + /*
> + * Require mount to happen from the same user namespace which
> + * opened /dev/fuse to prevent potential attacks.
> + */
> + if (file->f_op != &fuse_dev_operations ||
> + file->f_cred->user_ns != sb->s_user_ns)
> goto err_fput;
>
> fc = kmalloc(sizeof(*fc), GFP_KERNEL);
> @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
> if (!fc)
> goto err_fput;
>
> - fuse_conn_init(fc);
> + fuse_conn_init(fc, sb->s_user_ns);
> fc->release = fuse_free_conn;
>
> fud = fuse_dev_alloc(fc);
> --
> 2.13.6
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2017-12-23 03:50:49

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant

On Fri, Dec 22, 2017 at 03:32:33PM +0100, Dongsu Park wrote:
> From: Seth Forshee <[email protected]>
>
> Unprivileged users are normally restricted from mounting with the
> allow_other option by system policy, but this could be bypassed
> for a mount done with user namespace root permissions. In such
> cases allow_other should not allow users outside the userns
> to access the mount as doing so would give the unprivileged user
> the ability to manipulate processes it would otherwise be unable
> to manipulate. Restrict allow_other to apply to users in the same
> userns used at mount or a descendant of that namespace. Also
> export current_in_userns() for use by fuse when built as a
> module.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944671/
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: Serge Hallyn <[email protected]>
> Cc: Miklos Szeredi <[email protected]>
> Signed-off-by: Seth Forshee <[email protected]>
> Signed-off-by: Dongsu Park <[email protected]>

Reviewed-by: Serge Hallyn <[email protected]>

> ---
> fs/fuse/dir.c | 2 +-
> kernel/user_namespace.c | 1 +
> 2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index ad1cfac1..d41559a0 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
> const struct cred *cred;
>
> if (fc->allow_other)
> - return 1;
> + return current_in_userns(fc->user_ns);
>
> cred = current_cred();
> if (uid_eq(cred->euid, fc->user_id) &&
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 246d4d4c..492c255e 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
> {
> return in_userns(target_ns, current_user_ns());
> }
> +EXPORT_SYMBOL(current_in_userns);

I have to say I'm not happy with this name. I wish it had been
called current_under_userns or something to indicate it may also
be in a child.

>
> static inline struct user_namespace *to_user_ns(struct ns_common *ns)
> {
> --
> 2.13.6

2017-12-23 03:51:53

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 10/11] fuse: Allow user namespace mounts

On Fri, Dec 22, 2017 at 03:32:34PM +0100, Dongsu Park wrote:
> From: Seth Forshee <[email protected]>
>
> To be able to mount fuse from non-init user namespaces, it's necessary
> to set FS_USERNS_MOUNT flag to fs_flags.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944681/
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Miklos Szeredi <[email protected]>
> Signed-off-by: Seth Forshee <[email protected]>
> [dongsu: add a simple commit messasge]
> Signed-off-by: Dongsu Park <[email protected]>

Reviewed-by: Serge Hallyn <[email protected]>

> ---
> fs/fuse/inode.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 7f6b2e55..8c98edee 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
> static struct file_system_type fuse_fs_type = {
> .owner = THIS_MODULE,
> .name = "fuse",
> - .fs_flags = FS_HAS_SUBTYPE,
> + .fs_flags = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
> .mount = fuse_mount,
> .kill_sb = fuse_kill_sb_anon,
> };
> @@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = {
> .name = "fuseblk",
> .mount = fuse_mount_blk,
> .kill_sb = fuse_kill_sb_blk,
> - .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
> + .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
> };
> MODULE_ALIAS_FS("fuseblk");
>
> --
> 2.13.6
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2017-12-23 04:03:51

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 11/11] evm: Don't update hmacs in user ns mounts

On Fri, Dec 22, 2017 at 03:32:35PM +0100, Dongsu Park wrote:
> From: Seth Forshee <[email protected]>
>
> The kernel should not calculate new hmacs for mounts done by
> non-root users. Update evm_calc_hmac_or_hash() to refuse to
> calculate new hmacs for mounts for non-init user namespaces.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: James Morris <[email protected]>
> Cc: Mimi Zohar <[email protected]>

Hi Mimi,

does this change seem sufficient to you?

> Cc: "Serge E. Hallyn" <[email protected]>
> Signed-off-by: Seth Forshee <[email protected]>
> Signed-off-by: Dongsu Park <[email protected]>
> ---
> security/integrity/evm/evm_crypto.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
> index bcd64baf..729f4545 100644
> --- a/security/integrity/evm/evm_crypto.c
> +++ b/security/integrity/evm/evm_crypto.c
> @@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry,
> int error;
> int size;
>
> - if (!(inode->i_opflags & IOP_XATTR))
> + if (!(inode->i_opflags & IOP_XATTR) ||
> + inode->i_sb->s_user_ns != &init_user_ns)
> return -EOPNOTSUPP;
>
> desc = init_desc(type);
> --
> 2.13.6

2017-12-23 12:01:00

by Dongsu Park

[permalink] [raw]
Subject: Re: [PATCH 01/11] block_dev: Support checking inode permissions in lookup_bdev()

Hi,

On Fri, Dec 22, 2017 at 7:59 PM, Coly Li <[email protected]> wrote:
> On 22/12/2017 10:32 PM, Dongsu Park wrote:
> Hi Dongsu,
>
> Could you please use a macro like NO_PERMISSION_CHECK to replace hard
> coded 0 ? At least for me, I don't need to check what does 0 mean in the
> new lookup_bdev().

I see. I'll do that.

Thanks,
Dongsu

> Thanks.
>
> Coly Li
>
>> ---
>> drivers/md/bcache/super.c | 2 +-
>> drivers/md/dm-table.c | 2 +-
>> drivers/mtd/mtdsuper.c | 2 +-
>> fs/block_dev.c | 13 ++++++++++---
>> fs/quota/quota.c | 2 +-
>> include/linux/fs.h | 2 +-
>> 6 files changed, 15 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index b4d28928..acc9d56c 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -1967,7 +1967,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>> sb);
>> if (IS_ERR(bdev)) {
>> if (bdev == ERR_PTR(-EBUSY)) {
>> - bdev = lookup_bdev(strim(path));
>> + bdev = lookup_bdev(strim(path), 0);
>> mutex_lock(&bch_register_lock);
>> if (!IS_ERR(bdev) && bch_is_open(bdev))
>> err = "device already registered";
>> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
>> index 88130b5d..bca5eaf4 100644
> [snip]
>
>
> --
> Coly Li

2017-12-23 12:18:33

by Dongsu Park

[permalink] [raw]
Subject: Re: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting

Hi,

On Fri, Dec 22, 2017 at 10:06 PM, Richard Weinberger
<[email protected]> wrote:
> Dongsu,
>
> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
>> From: Seth Forshee <[email protected]>
>>
>> Unprivileged users should not be able to mount mtd block devices
>> when they lack sufficient privileges towards the block device
>> inode. Update mount_mtd() to validate that the user has the
>> required access to the inode at the specified path. The check
>> will be skipped for CAP_SYS_ADMIN, so privileged mounts will
>> continue working as before.
>
> What is the big picture of this?
> Can in future an unprivileged user just mount UBIFS?

I'm not sure I'm aware of all use cases w.r.t mtd & ubifs.
To my understanding, in these days many container runtimes allow
unprivileged users to run containers. (docker, lxc, runc, bubblewrap, etc)
That's why the kernel should deal with additional permission checks
that might have not been necessary in the past.
This MTD patch is one of those special cases.

> Please note that UBIFS sits on top of a character device and not a block device.

Aha, good to know.

Thanks,
Dongsu

> --
> Thanks,
> //richard

2017-12-23 12:38:54

by Dongsu Park

[permalink] [raw]
Subject: Re: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root

Hi,

On Sat, Dec 23, 2017 at 4:26 AM, Serge E. Hallyn <[email protected]> wrote:
> On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote:
>> From: Seth Forshee <[email protected]>
>>
>> Expand the check in should_remove_suid() to keep privileges for
>
> I realize this description came from Seth, but reading it now,
> 'Expand' seems wrong. Expanding a check brings to my mind making
> it stricter, not looser. How about 'Relax the check' ?

Makes sense. Will do.

>> CAP_FSETID in s_user_ns rather than init_user_ns.
>>
>> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/
>>
>> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid
>
> Why exactly?
>
> This is wrong, because capable_wrt_inode_uidgid() does a check
> against current_user_ns, not the inode->i_sb->s_user_ns

Ah. I see.
I suppose it was changed probably for the privileged_wrt_inode_uidgid()
called by capable_wrt_inode_uidgid(). But as you pointed out, that checks
against current_user_ns, which is wrong. I would just create another
wrapper like capable_userns_wrt_inode_uidgid(), which takes an
additional parameter of (struct user_namespace *), to be able to check for
both ns_capable() and privileged_wrt_inode_uidgid().

Thanks,
Dongsu

>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: Alexander Viro <[email protected]>
>> Cc: Serge Hallyn <[email protected]>
>> Signed-off-by: Seth Forshee <[email protected]>
>> Signed-off-by: Dongsu Park <[email protected]>
>> ---
>> fs/inode.c | 6 ++++--
>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index fd401028..6459a437 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -1749,7 +1749,8 @@ EXPORT_SYMBOL(touch_atime);
>> */
>> int should_remove_suid(struct dentry *dentry)
>> {
>> - umode_t mode = d_inode(dentry)->i_mode;
>> + struct inode *inode = d_inode(dentry);
>> + umode_t mode = inode->i_mode;
>> int kill = 0;
>>
>> /* suid always must be killed */
>> @@ -1763,7 +1764,8 @@ int should_remove_suid(struct dentry *dentry)
>> if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
>> kill |= ATTR_KILL_SGID;
>>
>> - if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
>> + if (unlikely(kill && !capable_wrt_inode_uidgid(inode, CAP_FSETID) &&
>> + S_ISREG(mode)))
>> return kill;
>>
>> return 0;
>> --
>> 2.13.6

2017-12-23 12:56:17

by Richard Weinberger

[permalink] [raw]
Subject: Re: [PATCH 02/11] mtd: Check permissions towards mtd block device inode when mounting

Dongsu,

Am Samstag, 23. Dezember 2017, 13:18:30 CET schrieb Dongsu Park:
> Hi,
>
> On Fri, Dec 22, 2017 at 10:06 PM, Richard Weinberger
>
> <[email protected]> wrote:
> > Dongsu,
> >
> > On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
> >> From: Seth Forshee <[email protected]>
> >>
> >> Unprivileged users should not be able to mount mtd block devices
> >> when they lack sufficient privileges towards the block device
> >> inode. Update mount_mtd() to validate that the user has the
> >> required access to the inode at the specified path. The check
> >> will be skipped for CAP_SYS_ADMIN, so privileged mounts will
> >> continue working as before.
> >
> > What is the big picture of this?
> > Can in future an unprivileged user just mount UBIFS?
>
> I'm not sure I'm aware of all use cases w.r.t mtd & ubifs.
> To my understanding, in these days many container runtimes allow
> unprivileged users to run containers. (docker, lxc, runc, bubblewrap, etc)
> That's why the kernel should deal with additional permission checks
> that might have not been necessary in the past.
> This MTD patch is one of those special cases.

My fear is that a corner case is forgotten and all of a sudden someone can do
funky things with MTD in a container...

Thanks,
//richard

2017-12-24 05:12:22

by Mimi Zohar

[permalink] [raw]
Subject: Re: [PATCH 11/11] evm: Don't update hmacs in user ns mounts

Hi Serge,

On Fri, 2017-12-22 at 22:03 -0600, Serge E. Hallyn wrote:
> On Fri, Dec 22, 2017 at 03:32:35PM +0100, Dongsu Park wrote:
> > From: Seth Forshee <[email protected]>
> >
> > The kernel should not calculate new hmacs for mounts done by
> > non-root users. Update evm_calc_hmac_or_hash() to refuse to
> > calculate new hmacs for mounts for non-init user namespaces.
> >
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: James Morris <[email protected]>
> > Cc: Mimi Zohar <[email protected]>
>
> Hi Mimi,
>
> does this change seem sufficient to you?

I think this is the correct behavior in the context of fuse file
systems.  This patch, the "ima: define a new policy option named
force" patch, and an updated IMA policy should be upstreamed together.
 The cover letter should provide the motivation for these patches.

Mimi

>
> > Cc: "Serge E. Hallyn" <[email protected]>
> > Signed-off-by: Seth Forshee <[email protected]>
> > Signed-off-by: Dongsu Park <[email protected]>
> > ---
> > security/integrity/evm/evm_crypto.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/security/integrity/evm/evm_crypto.c b/security/integrity/evm/evm_crypto.c
> > index bcd64baf..729f4545 100644
> > --- a/security/integrity/evm/evm_crypto.c
> > +++ b/security/integrity/evm/evm_crypto.c
> > @@ -190,7 +190,8 @@ static int evm_calc_hmac_or_hash(struct dentry *dentry,
> > int error;
> > int size;
> >
> > - if (!(inode->i_opflags & IOP_XATTR))
> > + if (!(inode->i_opflags & IOP_XATTR) ||
> > + inode->i_sb->s_user_ns != &init_user_ns)
> > return -EOPNOTSUPP;
> >
> > desc = init_desc(type);
> > --
> > 2.13.6
>

2017-12-24 05:56:14

by Mimi Zohar

[permalink] [raw]
Subject: Re: [PATCH 11/11] evm: Don't update hmacs in user ns mounts

On Sun, 2017-12-24 at 00:12 -0500, Mimi Zohar wrote:
> Hi Serge,
>
> On Fri, 2017-12-22 at 22:03 -0600, Serge E. Hallyn wrote:
> > On Fri, Dec 22, 2017 at 03:32:35PM +0100, Dongsu Park wrote:
> > > From: Seth Forshee <[email protected]>
> > >
> > > The kernel should not calculate new hmacs for mounts done by
> > > non-root users. Update evm_calc_hmac_or_hash() to refuse to
> > > calculate new hmacs for mounts for non-init user namespaces.
> > >
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: James Morris <[email protected]>
> > > Cc: Mimi Zohar <[email protected]>
> >
> > Hi Mimi,
> >
> > does this change seem sufficient to you?
>
> I think this is the correct behavior in the context of fuse file
> systems.  This patch, the "ima: define a new policy option named
> force" patch, and an updated IMA policy should be upstreamed together.
>  The cover letter should provide the motivation for these patches.

Ah, this patch is being upstreamed with the fuse mounts patches.  I
guess Seth is planning on posting the IMA policy changes for fuse
separately.

Mimi

2017-12-25 07:06:32

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces

Dongsu Park <[email protected]> writes:

> This patchset v5 is based on work by Seth Forshee and Eric Biederman.
> The latest patchset was v4:
> https://www.mail-archive.com/[email protected]/msg1132206.html
>
> At the moment, filesystems backed by physical medium can only be mounted
> by real root in the initial user namespace. This restriction exists
> because if it's allowed for root user in non-init user namespaces to
> mount the filesystem, then it effectively allows the user to control the
> underlying source of the filesystem. In case of FUSE, the source would
> mean any underlying device.
>
> However, in many use cases such as containers, it's necessary to allow
> filesystems to be mounted from non-init user namespaces. Goal of this
> patchset is to allow FUSE filesystems to be mounted from non-init user
> namespaces. Support for other filesystems like ext4 are not in the
> scope of this patchset.
>
> Let me describe how to test mounting from non-init user namespaces. It's
> assumed that tests are done via sshfs, a userspace filesystem based on
> FUSE with ssh as backend. Testing system is Fedora 27.

In general I am for this work, and more bodies and more eyes on it is
generally better.

I will review this after the New Year, I am out for the holidays right
now.

Eric


>
> ====
> $ sudo dnf install -y sshfs
> $ sudo mkdir -p /mnt/userns
>
> ### workaround to get the sshfs permission checks
> $ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies
>
> $ unshare -U -r -m
> # sshfs root@localhost: /mnt/userns
>
> ### You can see sshfs being mounted from a non-init user namespace
> # mount | grep sshfs
> root@localhost: on /mnt/userns type fuse.sshfs
> (rw,nosuid,nodev,relatime,user_id=0,group_id=0)
>
> # touch /mnt/userns/test
> # ls -l /mnt/userns/test
> -rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test
> ====
>
> Open another terminal, check the mountpoint from outside the namespace.
>
> ====
> $ grep userns /proc/$(pidof sshfs)/mountinfo
> 131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs
> root@localhost: rw,user_id=0,group_id=0
> ====
>
> After all tests are done, you can unmount the filesystem
> inside the namespace.
>
> ====
> # fusermount -u /mnt/userns
> ====
>
> Changes since v4:
> * Remove other parts like ext4 to keep the patchset minimal for FUSE
> * Add and change commit messages
> * Describe how to test non-init user namespaces
>
> TODO:
> * Think through potential security implications. There are 2 patches
> being prepared for security issues. One is "ima: define a new policy
> option named force" by Mimi Zohar, which adds an option to specify
> that the results should not be cached:
> https://marc.info/?l=linux-integrity&m=151275680115856&w=2
> The other one is to basically prevent FUSE results from being cached,
> which is still in progress.
>
> * Test IMA/LSMs. Details are written in
> https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md
>
> Patches 1-2 deal with an additional flag of lookup_bdev() to check for
> additional inode permission.
>
> Patches 3-7 allow the superblock owner to change ownership of inodes, and
> deal with additional capability checks w.r.t user namespaces.
>
> Patches 8-10 allow FUSE filesystems to be mounted outside of the init
> user namespace.
>
> Patch 11 handles a corner case of non-root users in EVM.
>
> The patchset is also available in our github repo:
> https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1
>
>
> Eric W. Biederman (1):
> fs: Allow superblock owner to change ownership of inodes
>
> Seth Forshee (10):
> block_dev: Support checking inode permissions in lookup_bdev()
> mtd: Check permissions towards mtd block device inode when mounting
> fs: Don't remove suid for CAP_FSETID for userns root
> fs: Allow superblock owner to access do_remount_sb()
> capabilities: Allow privileged user in s_user_ns to set security.*
> xattrs
> fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
> fuse: Support fuse filesystems outside of init_user_ns
> fuse: Restrict allow_other to the superblock's namespace or a
> descendant
> fuse: Allow user namespace mounts
> evm: Don't update hmacs in user ns mounts
>
> drivers/md/bcache/super.c | 2 +-
> drivers/md/dm-table.c | 2 +-
> drivers/mtd/mtdsuper.c | 6 +++++-
> fs/attr.c | 34 ++++++++++++++++++++++++++--------
> fs/block_dev.c | 13 ++++++++++---
> fs/fuse/cuse.c | 3 ++-
> fs/fuse/dev.c | 11 ++++++++---
> fs/fuse/dir.c | 16 ++++++++--------
> fs/fuse/fuse_i.h | 6 +++++-
> fs/fuse/inode.c | 35 +++++++++++++++++++++--------------
> fs/inode.c | 6 ++++--
> fs/ioctl.c | 4 ++--
> fs/namespace.c | 4 ++--
> fs/proc/base.c | 7 +++++++
> fs/proc/generic.c | 7 +++++++
> fs/proc/proc_sysctl.c | 7 +++++++
> fs/quota/quota.c | 2 +-
> include/linux/fs.h | 2 +-
> kernel/user_namespace.c | 1 +
> security/commoncap.c | 8 ++++++--
> security/integrity/evm/evm_crypto.c | 3 ++-
> 21 files changed, 127 insertions(+), 52 deletions(-)

2018-01-05 19:24:14

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes

On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
> diff --git a/fs/attr.c b/fs/attr.c
> index 12ffdb6f..bf8e94f3 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -18,6 +18,30 @@
> #include <linux/evm.h>
> #include <linux/ima.h>
>
> +static bool chown_ok(const struct inode *inode, kuid_t uid)
> +{
> + if (uid_eq(current_fsuid(), inode->i_uid) &&
> + uid_eq(uid, inode->i_uid))
> + return true;
> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + return true;
> + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> + return true;
> + return false;
> +}
> +
> +static bool chgrp_ok(const struct inode *inode, kgid_t gid)
> +{
> + if (uid_eq(current_fsuid(), inode->i_uid) &&
> + (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
> + return true;
> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + return true;
> + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
> + return true;
> + return false;
> +}
> +
> /**
> * setattr_prepare - check if attribute changes to a dentry are allowed
> * @dentry: dentry to check
> @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
> goto kill_priv;
>
> /* Make sure a caller can chown. */
> - if ((ia_valid & ATTR_UID) &&
> - (!uid_eq(current_fsuid(), inode->i_uid) ||
> - !uid_eq(attr->ia_uid, inode->i_uid)) &&
> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
> return -EPERM;

I think this patch would read much better and easier to review if it was
split up by first adding the helpers, and then extending them afterwards.

>
> /* Make sure caller can chgrp. */
> - if ((ia_valid & ATTR_GID) &&
> - (!uid_eq(current_fsuid(), inode->i_uid) ||
> - (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
> + if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
> return -EPERM;
>
> /* Make sure a caller can chmod. */
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 31934cb9..9d50ec92 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
> {
> int error;
> struct inode *inode = d_inode(dentry);
> + struct user_namespace *s_user_ns;
>
> if (attr->ia_valid & ATTR_MODE)
> return -EPERM;
>
> + /* Don't let anyone mess with weird proc files */
> + s_user_ns = inode->i_sb->s_user_ns;
> + if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
> + !kgid_has_mapping(s_user_ns, inode->i_gid))
> + return -EPERM;
> +
> error = setattr_prepare(dentry, attr);
> if (error)
> return error;

Are we sure proc is the only special one? How was it observed first that this was
require for proc? Has anyone tried fuzzing by trying this op with a slew of other
filesystems on all files?

Luis

2018-01-09 15:05:46

by Dongsu Park

[permalink] [raw]
Subject: Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces

Hi,

On Mon, Dec 25, 2017 at 8:05 AM, Eric W. Biederman
<[email protected]> wrote:
> Dongsu Park <[email protected]> writes:
>
>> This patchset v5 is based on work by Seth Forshee and Eric Biederman.
>> The latest patchset was v4:
>> https://www.mail-archive.com/[email protected]/msg1132206.html
>>
>> At the moment, filesystems backed by physical medium can only be mounted
>> by real root in the initial user namespace. This restriction exists
>> because if it's allowed for root user in non-init user namespaces to
>> mount the filesystem, then it effectively allows the user to control the
>> underlying source of the filesystem. In case of FUSE, the source would
>> mean any underlying device.
>>
>> However, in many use cases such as containers, it's necessary to allow
>> filesystems to be mounted from non-init user namespaces. Goal of this
>> patchset is to allow FUSE filesystems to be mounted from non-init user
>> namespaces. Support for other filesystems like ext4 are not in the
>> scope of this patchset.
>>
>> Let me describe how to test mounting from non-init user namespaces. It's
>> assumed that tests are done via sshfs, a userspace filesystem based on
>> FUSE with ssh as backend. Testing system is Fedora 27.
>
> In general I am for this work, and more bodies and more eyes on it is
> generally better.
>
> I will review this after the New Year, I am out for the holidays right
> now.

Thanks. I'll wait for your review.

Dongsu

> Eric
>
>
>>
>> ====
>> $ sudo dnf install -y sshfs
>> $ sudo mkdir -p /mnt/userns
>>
>> ### workaround to get the sshfs permission checks
>> $ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies
>>
>> $ unshare -U -r -m
>> # sshfs root@localhost: /mnt/userns
>>
>> ### You can see sshfs being mounted from a non-init user namespace
>> # mount | grep sshfs
>> root@localhost: on /mnt/userns type fuse.sshfs
>> (rw,nosuid,nodev,relatime,user_id=0,group_id=0)
>>
>> # touch /mnt/userns/test
>> # ls -l /mnt/userns/test
>> -rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test
>> ====
>>
>> Open another terminal, check the mountpoint from outside the namespace.
>>
>> ====
>> $ grep userns /proc/$(pidof sshfs)/mountinfo
>> 131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs
>> root@localhost: rw,user_id=0,group_id=0
>> ====
>>
>> After all tests are done, you can unmount the filesystem
>> inside the namespace.
>>
>> ====
>> # fusermount -u /mnt/userns
>> ====
>>
>> Changes since v4:
>> * Remove other parts like ext4 to keep the patchset minimal for FUSE
>> * Add and change commit messages
>> * Describe how to test non-init user namespaces
>>
>> TODO:
>> * Think through potential security implications. There are 2 patches
>> being prepared for security issues. One is "ima: define a new policy
>> option named force" by Mimi Zohar, which adds an option to specify
>> that the results should not be cached:
>> https://marc.info/?l=linux-integrity&m=151275680115856&w=2
>> The other one is to basically prevent FUSE results from being cached,
>> which is still in progress.
>>
>> * Test IMA/LSMs. Details are written in
>> https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md
>>
>> Patches 1-2 deal with an additional flag of lookup_bdev() to check for
>> additional inode permission.
>>
>> Patches 3-7 allow the superblock owner to change ownership of inodes, and
>> deal with additional capability checks w.r.t user namespaces.
>>
>> Patches 8-10 allow FUSE filesystems to be mounted outside of the init
>> user namespace.
>>
>> Patch 11 handles a corner case of non-root users in EVM.
>>
>> The patchset is also available in our github repo:
>> https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1
>>
>>
>> Eric W. Biederman (1):
>> fs: Allow superblock owner to change ownership of inodes
>>
>> Seth Forshee (10):
>> block_dev: Support checking inode permissions in lookup_bdev()
>> mtd: Check permissions towards mtd block device inode when mounting
>> fs: Don't remove suid for CAP_FSETID for userns root
>> fs: Allow superblock owner to access do_remount_sb()
>> capabilities: Allow privileged user in s_user_ns to set security.*
>> xattrs
>> fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
>> fuse: Support fuse filesystems outside of init_user_ns
>> fuse: Restrict allow_other to the superblock's namespace or a
>> descendant
>> fuse: Allow user namespace mounts
>> evm: Don't update hmacs in user ns mounts
>>
>> drivers/md/bcache/super.c | 2 +-
>> drivers/md/dm-table.c | 2 +-
>> drivers/mtd/mtdsuper.c | 6 +++++-
>> fs/attr.c | 34 ++++++++++++++++++++++++++--------
>> fs/block_dev.c | 13 ++++++++++---
>> fs/fuse/cuse.c | 3 ++-
>> fs/fuse/dev.c | 11 ++++++++---
>> fs/fuse/dir.c | 16 ++++++++--------
>> fs/fuse/fuse_i.h | 6 +++++-
>> fs/fuse/inode.c | 35 +++++++++++++++++++++--------------
>> fs/inode.c | 6 ++++--
>> fs/ioctl.c | 4 ++--
>> fs/namespace.c | 4 ++--
>> fs/proc/base.c | 7 +++++++
>> fs/proc/generic.c | 7 +++++++
>> fs/proc/proc_sysctl.c | 7 +++++++
>> fs/quota/quota.c | 2 +-
>> include/linux/fs.h | 2 +-
>> kernel/user_namespace.c | 1 +
>> security/commoncap.c | 8 ++++++--
>> security/integrity/evm/evm_crypto.c | 3 ++-
>> 21 files changed, 127 insertions(+), 52 deletions(-)

2018-01-09 15:10:57

by Dongsu Park

[permalink] [raw]
Subject: Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes

Hi,

On Fri, Jan 5, 2018 at 8:24 PM, Luis R. Rodriguez <[email protected]> wrote:
> On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
>> diff --git a/fs/attr.c b/fs/attr.c
>> index 12ffdb6f..bf8e94f3 100644
>> --- a/fs/attr.c
>> +++ b/fs/attr.c
>> @@ -18,6 +18,30 @@
>> #include <linux/evm.h>
>> #include <linux/ima.h>
>>
>> +static bool chown_ok(const struct inode *inode, kuid_t uid)
>> +{
>> + if (uid_eq(current_fsuid(), inode->i_uid) &&
>> + uid_eq(uid, inode->i_uid))
>> + return true;
>> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> + return true;
>> + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
>> + return true;
>> + return false;
>> +}
>> +
>> +static bool chgrp_ok(const struct inode *inode, kgid_t gid)
>> +{
>> + if (uid_eq(current_fsuid(), inode->i_uid) &&
>> + (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
>> + return true;
>> + if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> + return true;
>> + if (ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
>> + return true;
>> + return false;
>> +}
>> +
>> /**
>> * setattr_prepare - check if attribute changes to a dentry are allowed
>> * @dentry: dentry to check
>> @@ -52,17 +76,11 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
>> goto kill_priv;
>>
>> /* Make sure a caller can chown. */
>> - if ((ia_valid & ATTR_UID) &&
>> - (!uid_eq(current_fsuid(), inode->i_uid) ||
>> - !uid_eq(attr->ia_uid, inode->i_uid)) &&
>> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> + if ((ia_valid & ATTR_UID) && !chown_ok(inode, attr->ia_uid))
>> return -EPERM;
>
> I think this patch would read much better and easier to review if it was
> split up by first adding the helpers, and then extending them afterwards.

I'm fine with splitting it up into multiple patches, if the original author
Eric agrees.

>> /* Make sure caller can chgrp. */
>> - if ((ia_valid & ATTR_GID) &&
>> - (!uid_eq(current_fsuid(), inode->i_uid) ||
>> - (!in_group_p(attr->ia_gid) && !gid_eq(attr->ia_gid, inode->i_gid))) &&
>> - !capable_wrt_inode_uidgid(inode, CAP_CHOWN))
>> + if ((ia_valid & ATTR_GID) && !chgrp_ok(inode, attr->ia_gid))
>> return -EPERM;
>>
>> /* Make sure a caller can chmod. */
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index 31934cb9..9d50ec92 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -665,10 +665,17 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
>> {
>> int error;
>> struct inode *inode = d_inode(dentry);
>> + struct user_namespace *s_user_ns;
>>
>> if (attr->ia_valid & ATTR_MODE)
>> return -EPERM;
>>
>> + /* Don't let anyone mess with weird proc files */
>> + s_user_ns = inode->i_sb->s_user_ns;
>> + if (!kuid_has_mapping(s_user_ns, inode->i_uid) ||
>> + !kgid_has_mapping(s_user_ns, inode->i_gid))
>> + return -EPERM;
>> +
>> error = setattr_prepare(dentry, attr);
>> if (error)
>> return error;
>
> Are we sure proc is the only special one? How was it observed first that this was
> require for proc? Has anyone tried fuzzing by trying this op with a slew of other
> filesystems on all files?

>From my limited knowledge about procfs, I suppose that procfs is a little
different from ordinary filesystems. Procfs is not exactly namespaced,
it has many inconsistencies. Some files under /proc should be owned by the
global root, regardless of user namespaces. That's why we need to handle such
special cases for proc. As it has been historically like that since the
beginning, it's hard to change it fundamentally.

However, you have good points. Other than procfs, there could be other
filesystems that have potential issues when relaxing privileges. Question is
how we can be sure that there's no hidden issues. From my understanding,
usually we could run testsuites like LTP
(https://github.com/linux-test-project/ltp.git) to avoid such regressions.
Today I have run LTP tests for fs & containers, with the patchset included.
It seemed to work fine without failures. Obviously it doesn't mean that it's
completely bug-free, when we are talking about unknown issues.
Please let me know if there are other good ways to figure out potential issues.

Thanks,
Dongsu

> Luis

2018-01-09 17:23:23

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes

On Tue, Jan 09, 2018 at 04:10:54PM +0100, Dongsu Park wrote:
> On Fri, Jan 5, 2018 at 8:24 PM, Luis R. Rodriguez <[email protected]> wrote:
> > On Fri, Dec 22, 2017 at 03:32:27PM +0100, Dongsu Park wrote:
> > I think this patch would read much better and easier to review if it was
> > split up by first adding the helpers, and then extending them afterwards.
>
> I'm fine with splitting it up into multiple patches, if the original author
> Eric agrees.

Great.

> > Are we sure proc is the only special one? How was it observed first that this was
> > require for proc? Has anyone tried fuzzing by trying this op with a slew of other
> > filesystems on all files?
>
> Please let me know if there are other good ways to figure out potential issues.

I think the trick would be to create a test which mimicks the issue and then try to
mount and run the test against as many filesystems as we support. So would developing
a test be possible here?

Luis

2018-01-17 10:59:11

by Alban Crequy

[permalink] [raw]
Subject: Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns

[Adding Tejun, David, Tom for question about cuse]

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
> From: Seth Forshee <[email protected]>
>
> In order to support mounts from namespaces other than
> init_user_ns, fuse must translate uids and gids to/from the
> userns of the process servicing requests on /dev/fuse. This
> patch does that, with a couple of restrictions on the namespace:
>
> - The userns for the fuse connection is fixed to the namespace
> from which /dev/fuse is opened.
>
> - The namespace must be the same as s_user_ns.
>
> These restrictions simplify the implementation by avoiding the
> need to pass around userns references and by allowing fuse to
> rely on the checks in inode_change_ok for ownership changes.
> Either restriction could be relaxed in the future if needed.
>
> For cuse the namespace used for the connection is also simply
> current_user_ns() at the time /dev/cuse is opened.

Was a use case discussed for using cuse in a new unprivileged userns?

I ran some tests yesterday with cusexmp [1] and I could add a new char
device as an unprivileged user with:

$ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
--maj=99 --min=30 --name=foo

where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
Then, I could see the new device:

$ cat /proc/devices | grep foo
99 foo

On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
seems dangerous if the dev node can be provided otherwise and if we
don't have a use case for it.

Thoughts?

[1] https://github.com/fuse4x/fuse/blob/master/example/cusexmp.c#L9

Cheers,
Alban


> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Miklos Szeredi <[email protected]>
> Signed-off-by: Seth Forshee <[email protected]>
> Signed-off-by: Dongsu Park <[email protected]>
> ---
> fs/fuse/cuse.c | 3 ++-
> fs/fuse/dev.c | 11 ++++++++---
> fs/fuse/dir.c | 14 +++++++-------
> fs/fuse/fuse_i.h | 6 +++++-
> fs/fuse/inode.c | 31 +++++++++++++++++++------------
> 5 files changed, 41 insertions(+), 24 deletions(-)
>
> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
> index e9e97803..b1b83259 100644
> --- a/fs/fuse/cuse.c
> +++ b/fs/fuse/cuse.c
> @@ -48,6 +48,7 @@
> #include <linux/stat.h>
> #include <linux/module.h>
> #include <linux/uio.h>
> +#include <linux/user_namespace.h>
>
> #include "fuse_i.h"
>
> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
> if (!cc)
> return -ENOMEM;
>
> - fuse_conn_init(&cc->fc);
> + fuse_conn_init(&cc->fc, current_user_ns());
>
> fud = fuse_dev_alloc(&cc->fc);
> if (!fud) {
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 17f0d05b..0f780e16 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>
> static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
> {
> - req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> - req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> + req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
> + req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
> req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
> }
>
> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
> __set_bit(FR_WAITING, &req->flags);
> if (for_background)
> __set_bit(FR_BACKGROUND, &req->flags);
> + if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
> + fuse_put_request(fc, req);
> + return ERR_PTR(-EOVERFLOW);
> + }
>
> return req;
>
> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
> in = &req->in;
> reqsize = in->h.len;
>
> - if (task_active_pid_ns(current) != fc->pid_ns) {
> + if (task_active_pid_ns(current) != fc->pid_ns ||
> + current_user_ns() != fc->user_ns) {
> rcu_read_lock();
> in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
> rcu_read_unlock();
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 24967382..ad1cfac1 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
> stat->ino = attr->ino;
> stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
> stat->nlink = attr->nlink;
> - stat->uid = make_kuid(&init_user_ns, attr->uid);
> - stat->gid = make_kgid(&init_user_ns, attr->gid);
> + stat->uid = make_kuid(fc->user_ns, attr->uid);
> + stat->gid = make_kgid(fc->user_ns, attr->gid);
> stat->rdev = inode->i_rdev;
> stat->atime.tv_sec = attr->atime;
> stat->atime.tv_nsec = attr->atimensec;
> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
> return true;
> }
>
> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
> - bool trust_local_cmtime)
> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
> + struct fuse_setattr_in *arg, bool trust_local_cmtime)
> {
> unsigned ivalid = iattr->ia_valid;
>
> if (ivalid & ATTR_MODE)
> arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode;
> if (ivalid & ATTR_UID)
> - arg->valid |= FATTR_UID, arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
> + arg->valid |= FATTR_UID, arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
> if (ivalid & ATTR_GID)
> - arg->valid |= FATTR_GID, arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
> + arg->valid |= FATTR_GID, arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
> if (ivalid & ATTR_SIZE)
> arg->valid |= FATTR_SIZE, arg->size = iattr->ia_size;
> if (ivalid & ATTR_ATIME) {
> @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
>
> memset(&inarg, 0, sizeof(inarg));
> memset(&outarg, 0, sizeof(outarg));
> - iattr_to_fattr(attr, &inarg, trust_local_cmtime);
> + iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
> if (file) {
> struct fuse_file *ff = file->private_data;
> inarg.valid |= FATTR_FH;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index d5773ca6..364e65c8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -26,6 +26,7 @@
> #include <linux/xattr.h>
> #include <linux/pid_namespace.h>
> #include <linux/refcount.h>
> +#include <linux/user_namespace.h>
>
> /** Max number of pages that can be used in a single read request */
> #define FUSE_MAX_PAGES_PER_REQ 32
> @@ -466,6 +467,9 @@ struct fuse_conn {
> /** The pid namespace for this mount */
> struct pid_namespace *pid_ns;
>
> + /** The user namespace for this mount */
> + struct user_namespace *user_ns;
> +
> /** Maximum read size */
> unsigned max_read;
>
> @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
> /**
> * Initialize fuse_conn
> */
> -void fuse_conn_init(struct fuse_conn *fc);
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
>
> /**
> * Release reference to fuse_conn
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 2f504d61..7f6b2e55 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
> inode->i_ino = fuse_squash_ino(attr->ino);
> inode->i_mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
> set_nlink(inode, attr->nlink);
> - inode->i_uid = make_kuid(&init_user_ns, attr->uid);
> - inode->i_gid = make_kgid(&init_user_ns, attr->gid);
> + inode->i_uid = make_kuid(fc->user_ns, attr->uid);
> + inode->i_gid = make_kgid(fc->user_ns, attr->gid);
> inode->i_blocks = attr->blocks;
> inode->i_atime.tv_sec = attr->atime;
> inode->i_atime.tv_nsec = attr->atimensec;
> @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
> return err;
> }
>
> -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
> + struct user_namespace *user_ns)
> {
> char *p;
> memset(d, 0, sizeof(struct fuse_mount_data));
> @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> case OPT_USER_ID:
> if (fuse_match_uint(&args[0], &uv))
> return 0;
> - d->user_id = make_kuid(current_user_ns(), uv);
> + d->user_id = make_kuid(user_ns, uv);
> if (!uid_valid(d->user_id))
> return 0;
> d->user_id_present = 1;
> @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> case OPT_GROUP_ID:
> if (fuse_match_uint(&args[0], &uv))
> return 0;
> - d->group_id = make_kgid(current_user_ns(), uv);
> + d->group_id = make_kgid(user_ns, uv);
> if (!gid_valid(d->group_id))
> return 0;
> d->group_id_present = 1;
> @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
> struct super_block *sb = root->d_sb;
> struct fuse_conn *fc = get_fuse_conn_super(sb);
>
> - seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
> - seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
> + seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
> + seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
> if (fc->default_permissions)
> seq_puts(m, ",default_permissions");
> if (fc->allow_other)
> @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
> fpq->connected = 1;
> }
>
> -void fuse_conn_init(struct fuse_conn *fc)
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
> {
> memset(fc, 0, sizeof(*fc));
> spin_lock_init(&fc->lock);
> @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
> fc->attr_version = 1;
> get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
> fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> + fc->user_ns = get_user_ns(user_ns);
> }
> EXPORT_SYMBOL_GPL(fuse_conn_init);
>
> @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
> if (fc->destroy_req)
> fuse_request_free(fc->destroy_req);
> put_pid_ns(fc->pid_ns);
> + put_user_ns(fc->user_ns);
> fc->release(fc);
> }
> }
> @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>
> sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
>
> - if (!parse_fuse_opt(data, &d, is_bdev))
> + if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
> goto err;
>
> if (is_bdev) {
> @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
> if (!file)
> goto err;
>
> - if ((file->f_op != &fuse_dev_operations) ||
> - (file->f_cred->user_ns != &init_user_ns))
> + /*
> + * Require mount to happen from the same user namespace which
> + * opened /dev/fuse to prevent potential attacks.
> + */
> + if (file->f_op != &fuse_dev_operations ||
> + file->f_cred->user_ns != sb->s_user_ns)
> goto err_fput;
>
> fc = kmalloc(sizeof(*fc), GFP_KERNEL);
> @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
> if (!fc)
> goto err_fput;
>
> - fuse_conn_init(fc);
> + fuse_conn_init(fc, sb->s_user_ns);
> fc->release = fuse_free_conn;
>
> fud = fuse_dev_alloc(fc);
> --
> 2.13.6
>

2018-01-17 14:29:46

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns

On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote:
> [Adding Tejun, David, Tom for question about cuse]
>
> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
> > From: Seth Forshee <[email protected]>
> >
> > In order to support mounts from namespaces other than
> > init_user_ns, fuse must translate uids and gids to/from the
> > userns of the process servicing requests on /dev/fuse. This
> > patch does that, with a couple of restrictions on the namespace:
> >
> > - The userns for the fuse connection is fixed to the namespace
> > from which /dev/fuse is opened.
> >
> > - The namespace must be the same as s_user_ns.
> >
> > These restrictions simplify the implementation by avoiding the
> > need to pass around userns references and by allowing fuse to
> > rely on the checks in inode_change_ok for ownership changes.
> > Either restriction could be relaxed in the future if needed.
> >
> > For cuse the namespace used for the connection is also simply
> > current_user_ns() at the time /dev/cuse is opened.
>
> Was a use case discussed for using cuse in a new unprivileged userns?
>
> I ran some tests yesterday with cusexmp [1] and I could add a new char
> device as an unprivileged user with:
>
> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
> --maj=99 --min=30 --name=foo
>
> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
> Then, I could see the new device:
>
> $ cat /proc/devices | grep foo
> 99 foo
>
> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
> seems dangerous if the dev node can be provided otherwise and if we
> don't have a use case for it.
>
> Thoughts?

I can't remember the specific reasons, but I had concluded that letting
unprivileged users use cuse within a user namespace isn't safe. But
having a cuse device node usable by regular users at all is equally
unsafe I suspect, so I don't think your example demonstrates any problem
specific to user namespaces. There shouldn't be any way to use a user
namespace to gain access permissions towards /dev/cuse, otherwise we
have bigger problems than cuse to worry about.

Seth

2018-01-17 18:58:48

by Alban Crequy

[permalink] [raw]
Subject: Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns

On Wed, Jan 17, 2018 at 3:29 PM, Seth Forshee
<[email protected]> wrote:
> On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote:
>> [Adding Tejun, David, Tom for question about cuse]
>>
>> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
>> > From: Seth Forshee <[email protected]>
>> >
>> > In order to support mounts from namespaces other than
>> > init_user_ns, fuse must translate uids and gids to/from the
>> > userns of the process servicing requests on /dev/fuse. This
>> > patch does that, with a couple of restrictions on the namespace:
>> >
>> > - The userns for the fuse connection is fixed to the namespace
>> > from which /dev/fuse is opened.
>> >
>> > - The namespace must be the same as s_user_ns.
>> >
>> > These restrictions simplify the implementation by avoiding the
>> > need to pass around userns references and by allowing fuse to
>> > rely on the checks in inode_change_ok for ownership changes.
>> > Either restriction could be relaxed in the future if needed.
>> >
>> > For cuse the namespace used for the connection is also simply
>> > current_user_ns() at the time /dev/cuse is opened.
>>
>> Was a use case discussed for using cuse in a new unprivileged userns?
>>
>> I ran some tests yesterday with cusexmp [1] and I could add a new char
>> device as an unprivileged user with:
>>
>> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
>> --maj=99 --min=30 --name=foo
>>
>> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
>> Then, I could see the new device:
>>
>> $ cat /proc/devices | grep foo
>> 99 foo
>>
>> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
>> seems dangerous if the dev node can be provided otherwise and if we
>> don't have a use case for it.
>>
>> Thoughts?
>
> I can't remember the specific reasons, but I had concluded that letting
> unprivileged users use cuse within a user namespace isn't safe. But
> having a cuse device node usable by regular users at all is equally
> unsafe I suspect,

This makes sense.

> so I don't think your example demonstrates any problem
> specific to user namespaces. There shouldn't be any way to use a user
> namespace to gain access permissions towards /dev/cuse, otherwise we
> have bigger problems than cuse to worry about.

From my tests, the patch seem safe but I don't fully understand why that is.

I am not trying to gain more permissions towards /dev/cuse but to
create another cuse char file from within the unprivileged userns. I
tested the scenario by patching the memfs userspace FUSE driver to
generate the char device whenever the file is named "cuse" (turning
the regular file into a char device with the cuse major/minor behind
the scene):

$ unshare -U -r -m
# memfs /mnt/memfs &
# ls -l /mnt/memfs
# echo -n > /mnt/memfs/cuse
-bash: /mnt/memfs/cuse: Input/output error
# ls -l /mnt/memfs/cuse
crwxrwxrwx. 1 root root 10, 203 Jan 17 18:24 /mnt/memfs/cuse
# cat /mnt/memfs/cuse
cat: /mnt/memfs/cuse: Permission denied

But then, I could not use that char device, even though it seems to
have the correct major/minor and permissions. The kernel FUSE code
seems to call init_special_inode() to handle character devices. I
don't understand why it seems to be safe.

Thanks!
Alban

2018-01-17 19:33:10

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns

On Wed, Jan 17, 2018 at 07:56:59PM +0100, Alban Crequy wrote:
> On Wed, Jan 17, 2018 at 3:29 PM, Seth Forshee
> <[email protected]> wrote:
> > On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote:
> >> [Adding Tejun, David, Tom for question about cuse]
> >>
> >> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
> >> > From: Seth Forshee <[email protected]>
> >> >
> >> > In order to support mounts from namespaces other than
> >> > init_user_ns, fuse must translate uids and gids to/from the
> >> > userns of the process servicing requests on /dev/fuse. This
> >> > patch does that, with a couple of restrictions on the namespace:
> >> >
> >> > - The userns for the fuse connection is fixed to the namespace
> >> > from which /dev/fuse is opened.
> >> >
> >> > - The namespace must be the same as s_user_ns.
> >> >
> >> > These restrictions simplify the implementation by avoiding the
> >> > need to pass around userns references and by allowing fuse to
> >> > rely on the checks in inode_change_ok for ownership changes.
> >> > Either restriction could be relaxed in the future if needed.
> >> >
> >> > For cuse the namespace used for the connection is also simply
> >> > current_user_ns() at the time /dev/cuse is opened.
> >>
> >> Was a use case discussed for using cuse in a new unprivileged userns?
> >>
> >> I ran some tests yesterday with cusexmp [1] and I could add a new char
> >> device as an unprivileged user with:
> >>
> >> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
> >> --maj=99 --min=30 --name=foo
> >>
> >> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
> >> Then, I could see the new device:
> >>
> >> $ cat /proc/devices | grep foo
> >> 99 foo
> >>
> >> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
> >> seems dangerous if the dev node can be provided otherwise and if we
> >> don't have a use case for it.
> >>
> >> Thoughts?
> >
> > I can't remember the specific reasons, but I had concluded that letting
> > unprivileged users use cuse within a user namespace isn't safe. But
> > having a cuse device node usable by regular users at all is equally
> > unsafe I suspect,
>
> This makes sense.
>
> > so I don't think your example demonstrates any problem
> > specific to user namespaces. There shouldn't be any way to use a user
> > namespace to gain access permissions towards /dev/cuse, otherwise we
> > have bigger problems than cuse to worry about.
>
> From my tests, the patch seem safe but I don't fully understand why that is.
>
> I am not trying to gain more permissions towards /dev/cuse but to
> create another cuse char file from within the unprivileged userns. I
> tested the scenario by patching the memfs userspace FUSE driver to
> generate the char device whenever the file is named "cuse" (turning
> the regular file into a char device with the cuse major/minor behind
> the scene):
>
> $ unshare -U -r -m
> # memfs /mnt/memfs &
> # ls -l /mnt/memfs
> # echo -n > /mnt/memfs/cuse
> -bash: /mnt/memfs/cuse: Input/output error
> # ls -l /mnt/memfs/cuse
> crwxrwxrwx. 1 root root 10, 203 Jan 17 18:24 /mnt/memfs/cuse
> # cat /mnt/memfs/cuse
> cat: /mnt/memfs/cuse: Permission denied
>
> But then, I could not use that char device, even though it seems to
> have the correct major/minor and permissions. The kernel FUSE code
> seems to call init_special_inode() to handle character devices. I
> don't understand why it seems to be safe.

Because for new mounts in non-init user namespaces alloc_super() sets
SB_I_NODEV flag in s_iflags, which disallows opening device nodes in
that filesystem.

Seth

2018-01-18 10:32:21

by Alban Crequy

[permalink] [raw]
Subject: Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns

On Wed, Jan 17, 2018 at 8:31 PM, Seth Forshee
<[email protected]> wrote:
> On Wed, Jan 17, 2018 at 07:56:59PM +0100, Alban Crequy wrote:
>> On Wed, Jan 17, 2018 at 3:29 PM, Seth Forshee
>> <[email protected]> wrote:
>> > On Wed, Jan 17, 2018 at 11:59:06AM +0100, Alban Crequy wrote:
>> >> [Adding Tejun, David, Tom for question about cuse]
>> >>
>> >> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
>> >> > From: Seth Forshee <[email protected]>
>> >> >
>> >> > In order to support mounts from namespaces other than
>> >> > init_user_ns, fuse must translate uids and gids to/from the
>> >> > userns of the process servicing requests on /dev/fuse. This
>> >> > patch does that, with a couple of restrictions on the namespace:
>> >> >
>> >> > - The userns for the fuse connection is fixed to the namespace
>> >> > from which /dev/fuse is opened.
>> >> >
>> >> > - The namespace must be the same as s_user_ns.
>> >> >
>> >> > These restrictions simplify the implementation by avoiding the
>> >> > need to pass around userns references and by allowing fuse to
>> >> > rely on the checks in inode_change_ok for ownership changes.
>> >> > Either restriction could be relaxed in the future if needed.
>> >> >
>> >> > For cuse the namespace used for the connection is also simply
>> >> > current_user_ns() at the time /dev/cuse is opened.
>> >>
>> >> Was a use case discussed for using cuse in a new unprivileged userns?
>> >>
>> >> I ran some tests yesterday with cusexmp [1] and I could add a new char
>> >> device as an unprivileged user with:
>> >>
>> >> $ unshare -U -r -m sh -c 'mount --bind /mnt/cuse /dev/cuse ; cusexmp
>> >> --maj=99 --min=30 --name=foo
>> >>
>> >> where /mnt/cuse is previously mknod'ed correctly and chmod'ed 777.
>> >> Then, I could see the new device:
>> >>
>> >> $ cat /proc/devices | grep foo
>> >> 99 foo
>> >>
>> >> On normal distros, we don't have a /mnt/cuse chmod'ed 777 but still it
>> >> seems dangerous if the dev node can be provided otherwise and if we
>> >> don't have a use case for it.
>> >>
>> >> Thoughts?
>> >
>> > I can't remember the specific reasons, but I had concluded that letting
>> > unprivileged users use cuse within a user namespace isn't safe. But
>> > having a cuse device node usable by regular users at all is equally
>> > unsafe I suspect,
>>
>> This makes sense.
>>
>> > so I don't think your example demonstrates any problem
>> > specific to user namespaces. There shouldn't be any way to use a user
>> > namespace to gain access permissions towards /dev/cuse, otherwise we
>> > have bigger problems than cuse to worry about.
>>
>> From my tests, the patch seem safe but I don't fully understand why that is.
>>
>> I am not trying to gain more permissions towards /dev/cuse but to
>> create another cuse char file from within the unprivileged userns. I
>> tested the scenario by patching the memfs userspace FUSE driver to
>> generate the char device whenever the file is named "cuse" (turning
>> the regular file into a char device with the cuse major/minor behind
>> the scene):
>>
>> $ unshare -U -r -m
>> # memfs /mnt/memfs &
>> # ls -l /mnt/memfs
>> # echo -n > /mnt/memfs/cuse
>> -bash: /mnt/memfs/cuse: Input/output error
>> # ls -l /mnt/memfs/cuse
>> crwxrwxrwx. 1 root root 10, 203 Jan 17 18:24 /mnt/memfs/cuse
>> # cat /mnt/memfs/cuse
>> cat: /mnt/memfs/cuse: Permission denied
>>
>> But then, I could not use that char device, even though it seems to
>> have the correct major/minor and permissions. The kernel FUSE code
>> seems to call init_special_inode() to handle character devices. I
>> don't understand why it seems to be safe.
>
> Because for new mounts in non-init user namespaces alloc_super() sets
> SB_I_NODEV flag in s_iflags, which disallows opening device nodes in
> that filesystem.

I see. Thanks for the explanation!

2018-01-18 15:02:09

by Alban Crequy

[permalink] [raw]
Subject: Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces

On Tue, Jan 9, 2018 at 4:05 PM, Dongsu Park <[email protected]> wrote:
> Hi,
>
> On Mon, Dec 25, 2017 at 8:05 AM, Eric W. Biederman
> <[email protected]> wrote:
>> Dongsu Park <[email protected]> writes:
>>
>>> This patchset v5 is based on work by Seth Forshee and Eric Biederman.
>>> The latest patchset was v4:
>>> https://www.mail-archive.com/[email protected]/msg1132206.html
>>>
>>> At the moment, filesystems backed by physical medium can only be mounted
>>> by real root in the initial user namespace. This restriction exists
>>> because if it's allowed for root user in non-init user namespaces to
>>> mount the filesystem, then it effectively allows the user to control the
>>> underlying source of the filesystem. In case of FUSE, the source would
>>> mean any underlying device.
>>>
>>> However, in many use cases such as containers, it's necessary to allow
>>> filesystems to be mounted from non-init user namespaces. Goal of this
>>> patchset is to allow FUSE filesystems to be mounted from non-init user
>>> namespaces. Support for other filesystems like ext4 are not in the
>>> scope of this patchset.
>>>
>>> Let me describe how to test mounting from non-init user namespaces. It's
>>> assumed that tests are done via sshfs, a userspace filesystem based on
>>> FUSE with ssh as backend. Testing system is Fedora 27.
>>
>> In general I am for this work, and more bodies and more eyes on it is
>> generally better.
>>
>> I will review this after the New Year, I am out for the holidays right
>> now.
>
> Thanks. I'll wait for your review.

Hi Eric,

Do you have some cycles for this now that it is the new year?

A review on the associated ima issue would also be appreciated:
https://www.mail-archive.com/[email protected]/msg1587678.html

Cheers,
Alban

>>> ====
>>> $ sudo dnf install -y sshfs
>>> $ sudo mkdir -p /mnt/userns
>>>
>>> ### workaround to get the sshfs permission checks
>>> $ sudo chown -R $UID:$UID /etc/ssh/ssh_config.d /usr/share/crypto-policies
>>>
>>> $ unshare -U -r -m
>>> # sshfs root@localhost: /mnt/userns
>>>
>>> ### You can see sshfs being mounted from a non-init user namespace
>>> # mount | grep sshfs
>>> root@localhost: on /mnt/userns type fuse.sshfs
>>> (rw,nosuid,nodev,relatime,user_id=0,group_id=0)
>>>
>>> # touch /mnt/userns/test
>>> # ls -l /mnt/userns/test
>>> -rw-r--r-- 1 root root 0 Dec 11 19:01 /mnt/userns/test
>>> ====
>>>
>>> Open another terminal, check the mountpoint from outside the namespace.
>>>
>>> ====
>>> $ grep userns /proc/$(pidof sshfs)/mountinfo
>>> 131 102 0:35 / /mnt/userns rw,nosuid,nodev,relatime - fuse.sshfs
>>> root@localhost: rw,user_id=0,group_id=0
>>> ====
>>>
>>> After all tests are done, you can unmount the filesystem
>>> inside the namespace.
>>>
>>> ====
>>> # fusermount -u /mnt/userns
>>> ====
>>>
>>> Changes since v4:
>>> * Remove other parts like ext4 to keep the patchset minimal for FUSE
>>> * Add and change commit messages
>>> * Describe how to test non-init user namespaces
>>>
>>> TODO:
>>> * Think through potential security implications. There are 2 patches
>>> being prepared for security issues. One is "ima: define a new policy
>>> option named force" by Mimi Zohar, which adds an option to specify
>>> that the results should not be cached:
>>> https://marc.info/?l=linux-integrity&m=151275680115856&w=2
>>> The other one is to basically prevent FUSE results from being cached,
>>> which is still in progress.
>>>
>>> * Test IMA/LSMs. Details are written in
>>> https://github.com/kinvolk/fuse-userns-patches/blob/master/tests/TESTING_INTEGRITY.md
>>>
>>> Patches 1-2 deal with an additional flag of lookup_bdev() to check for
>>> additional inode permission.
>>>
>>> Patches 3-7 allow the superblock owner to change ownership of inodes, and
>>> deal with additional capability checks w.r.t user namespaces.
>>>
>>> Patches 8-10 allow FUSE filesystems to be mounted outside of the init
>>> user namespace.
>>>
>>> Patch 11 handles a corner case of non-root users in EVM.
>>>
>>> The patchset is also available in our github repo:
>>> https://github.com/kinvolk/linux/tree/dongsu/fuse-userns-v5-1
>>>
>>>
>>> Eric W. Biederman (1):
>>> fs: Allow superblock owner to change ownership of inodes
>>>
>>> Seth Forshee (10):
>>> block_dev: Support checking inode permissions in lookup_bdev()
>>> mtd: Check permissions towards mtd block device inode when mounting
>>> fs: Don't remove suid for CAP_FSETID for userns root
>>> fs: Allow superblock owner to access do_remount_sb()
>>> capabilities: Allow privileged user in s_user_ns to set security.*
>>> xattrs
>>> fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
>>> fuse: Support fuse filesystems outside of init_user_ns
>>> fuse: Restrict allow_other to the superblock's namespace or a
>>> descendant
>>> fuse: Allow user namespace mounts
>>> evm: Don't update hmacs in user ns mounts
>>>
>>> drivers/md/bcache/super.c | 2 +-
>>> drivers/md/dm-table.c | 2 +-
>>> drivers/mtd/mtdsuper.c | 6 +++++-
>>> fs/attr.c | 34 ++++++++++++++++++++++++++--------
>>> fs/block_dev.c | 13 ++++++++++---
>>> fs/fuse/cuse.c | 3 ++-
>>> fs/fuse/dev.c | 11 ++++++++---
>>> fs/fuse/dir.c | 16 ++++++++--------
>>> fs/fuse/fuse_i.h | 6 +++++-
>>> fs/fuse/inode.c | 35 +++++++++++++++++++++--------------
>>> fs/inode.c | 6 ++++--
>>> fs/ioctl.c | 4 ++--
>>> fs/namespace.c | 4 ++--
>>> fs/proc/base.c | 7 +++++++
>>> fs/proc/generic.c | 7 +++++++
>>> fs/proc/proc_sysctl.c | 7 +++++++
>>> fs/quota/quota.c | 2 +-
>>> include/linux/fs.h | 2 +-
>>> kernel/user_namespace.c | 1 +
>>> security/commoncap.c | 8 ++++++--
>>> security/integrity/evm/evm_crypto.c | 3 ++-
>>> 21 files changed, 127 insertions(+), 52 deletions(-)

2018-02-12 16:36:30

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns

Miklos Szeredi <[email protected]> writes:

> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
>> From: Seth Forshee <[email protected]>
>>
>> In order to support mounts from namespaces other than
>> init_user_ns, fuse must translate uids and gids to/from the
>> userns of the process servicing requests on /dev/fuse. This
>> patch does that, with a couple of restrictions on the namespace:
>>
>> - The userns for the fuse connection is fixed to the namespace
>> from which /dev/fuse is opened.
>>
>> - The namespace must be the same as s_user_ns.
>>
>> These restrictions simplify the implementation by avoiding the
>> need to pass around userns references and by allowing fuse to
>> rely on the checks in inode_change_ok for ownership changes.
>> Either restriction could be relaxed in the future if needed.
>
> Can we not introduce potential userspace interface regressions?
>
> The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse:
> allow server to run in different pid_ns") will probably bite us here
> as well.

Maybe, but unlike the pid namespace no one has been able to mount
fuse outside of init_user_ns so we are much less exposed. I agree we
should be careful.

> We basically need two modes of operation:
>
> a) old, backward compatible (not introducing any new failure mores),
> created with privileged mount
> b) new, non-backward compatible, created with unprivileged mount
>
> Technically there would still be a risk from breaking userspace, since
> we are using the same entry point for both, but let's hope that no
> practical problems come from that.

Answering from a 10,000 foot perspective:

There are two cases. Requests to read/write the filesystem from outside
of s_user_ns. These run no risk of breaking userspace as this mode has
not been implemented before.

Restrictions at mount time to ensure we are not dealing with a crazy mix
of namespaces. This has a small chance of breaking someone's crazy
setup.


Dropping requests to read/write the filesystem when the requester does
not map into s_user_ns should not be a problem to enable universally. If
s_user_ns is init_user_ns everything maps so there is no restriction.



What we can do if we want to ensure maximum backwards compatibility
is if the fuse filesystem is mounted in init_user_ns but if device for
the communication channel is opened in some other user namespace we
can just force the communication channel to operate in init_user_ns.

That will be 100% backwards compatible in all cases and as far as I can
see remove the need for having different ``modes'' of operation.



This does look like the time to give all of this a hard look and see if
we can get these patches in shape to be merged.

Eric



>> For cuse the namespace used for the connection is also simply
>> current_user_ns() at the time /dev/cuse is opened.
>>
>> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
>>
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: Miklos Szeredi <[email protected]>
>> Signed-off-by: Seth Forshee <[email protected]>
>> Signed-off-by: Dongsu Park <[email protected]>
>> ---
>> fs/fuse/cuse.c | 3 ++-
>> fs/fuse/dev.c | 11 ++++++++---
>> fs/fuse/dir.c | 14 +++++++-------
>> fs/fuse/fuse_i.h | 6 +++++-
>> fs/fuse/inode.c | 31 +++++++++++++++++++------------
>> 5 files changed, 41 insertions(+), 24 deletions(-)
>>
>> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
>> index e9e97803..b1b83259 100644
>> --- a/fs/fuse/cuse.c
>> +++ b/fs/fuse/cuse.c
>> @@ -48,6 +48,7 @@
>> #include <linux/stat.h>
>> #include <linux/module.h>
>> #include <linux/uio.h>
>> +#include <linux/user_namespace.h>
>>
>> #include "fuse_i.h"
>>
>> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
>> if (!cc)
>> return -ENOMEM;
>>
>> - fuse_conn_init(&cc->fc);
>> + fuse_conn_init(&cc->fc, current_user_ns());
>>
>> fud = fuse_dev_alloc(&cc->fc);
>> if (!fud) {
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index 17f0d05b..0f780e16 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>>
>> static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>> {
>> - req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
>> - req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
>> + req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
>> + req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
>> req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>> }
>>
>> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>> __set_bit(FR_WAITING, &req->flags);
>> if (for_background)
>> __set_bit(FR_BACKGROUND, &req->flags);
>> + if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
>> + fuse_put_request(fc, req);
>> + return ERR_PTR(-EOVERFLOW);
>> + }
>>
>> return req;
>>
>> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
>> in = &req->in;
>> reqsize = in->h.len;
>>
>> - if (task_active_pid_ns(current) != fc->pid_ns) {
>> + if (task_active_pid_ns(current) != fc->pid_ns ||
>> + current_user_ns() != fc->user_ns) {
>
> I don't get it. Why recalculate the pid if the user_ns does not match?
>
>> rcu_read_lock();
>> in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
>> rcu_read_unlock();
>> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
>> index 24967382..ad1cfac1 100644
>> --- a/fs/fuse/dir.c
>> +++ b/fs/fuse/dir.c
>> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
>> stat->ino = attr->ino;
>> stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>> stat->nlink = attr->nlink;
>> - stat->uid = make_kuid(&init_user_ns, attr->uid);
>> - stat->gid = make_kgid(&init_user_ns, attr->gid);
>> + stat->uid = make_kuid(fc->user_ns, attr->uid);
>> + stat->gid = make_kgid(fc->user_ns, attr->gid);
>> stat->rdev = inode->i_rdev;
>> stat->atime.tv_sec = attr->atime;
>> stat->atime.tv_nsec = attr->atimensec;
>> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
>> return true;
>> }
>>
>> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
>> - bool trust_local_cmtime)
>> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
>> + struct fuse_setattr_in *arg, bool trust_local_cmtime)
>> {
>> unsigned ivalid = iattr->ia_valid;
>>
>> if (ivalid & ATTR_MODE)
>> arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode;
>> if (ivalid & ATTR_UID)
>> - arg->valid |= FATTR_UID, arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
>> + arg->valid |= FATTR_UID, arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
>> if (ivalid & ATTR_GID)
>> - arg->valid |= FATTR_GID, arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
>> + arg->valid |= FATTR_GID, arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
>> if (ivalid & ATTR_SIZE)
>> arg->valid |= FATTR_SIZE, arg->size = iattr->ia_size;
>> if (ivalid & ATTR_ATIME) {
>> @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
>>
>> memset(&inarg, 0, sizeof(inarg));
>> memset(&outarg, 0, sizeof(outarg));
>> - iattr_to_fattr(attr, &inarg, trust_local_cmtime);
>> + iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
>> if (file) {
>> struct fuse_file *ff = file->private_data;
>> inarg.valid |= FATTR_FH;
>> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
>> index d5773ca6..364e65c8 100644
>> --- a/fs/fuse/fuse_i.h
>> +++ b/fs/fuse/fuse_i.h
>> @@ -26,6 +26,7 @@
>> #include <linux/xattr.h>
>> #include <linux/pid_namespace.h>
>> #include <linux/refcount.h>
>> +#include <linux/user_namespace.h>
>>
>> /** Max number of pages that can be used in a single read request */
>> #define FUSE_MAX_PAGES_PER_REQ 32
>> @@ -466,6 +467,9 @@ struct fuse_conn {
>> /** The pid namespace for this mount */
>> struct pid_namespace *pid_ns;
>>
>> + /** The user namespace for this mount */
>> + struct user_namespace *user_ns;
>> +
>> /** Maximum read size */
>> unsigned max_read;
>>
>> @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
>> /**
>> * Initialize fuse_conn
>> */
>> -void fuse_conn_init(struct fuse_conn *fc);
>> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
>>
>> /**
>> * Release reference to fuse_conn
>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> index 2f504d61..7f6b2e55 100644
>> --- a/fs/fuse/inode.c
>> +++ b/fs/fuse/inode.c
>> @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
>> inode->i_ino = fuse_squash_ino(attr->ino);
>> inode->i_mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
>> set_nlink(inode, attr->nlink);
>> - inode->i_uid = make_kuid(&init_user_ns, attr->uid);
>> - inode->i_gid = make_kgid(&init_user_ns, attr->gid);
>> + inode->i_uid = make_kuid(fc->user_ns, attr->uid);
>> + inode->i_gid = make_kgid(fc->user_ns, attr->gid);
>> inode->i_blocks = attr->blocks;
>> inode->i_atime.tv_sec = attr->atime;
>> inode->i_atime.tv_nsec = attr->atimensec;
>> @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
>> return err;
>> }
>>
>> -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>> +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
>> + struct user_namespace *user_ns)
>> {
>> char *p;
>> memset(d, 0, sizeof(struct fuse_mount_data));
>> @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>> case OPT_USER_ID:
>> if (fuse_match_uint(&args[0], &uv))
>> return 0;
>> - d->user_id = make_kuid(current_user_ns(), uv);
>> + d->user_id = make_kuid(user_ns, uv);
>> if (!uid_valid(d->user_id))
>> return 0;
>> d->user_id_present = 1;
>> @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
>> case OPT_GROUP_ID:
>> if (fuse_match_uint(&args[0], &uv))
>> return 0;
>> - d->group_id = make_kgid(current_user_ns(), uv);
>> + d->group_id = make_kgid(user_ns, uv);
>> if (!gid_valid(d->group_id))
>> return 0;
>> d->group_id_present = 1;
>> @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
>> struct super_block *sb = root->d_sb;
>> struct fuse_conn *fc = get_fuse_conn_super(sb);
>>
>> - seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
>> - seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
>> + seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
>> + seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
>> if (fc->default_permissions)
>> seq_puts(m, ",default_permissions");
>> if (fc->allow_other)
>> @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
>> fpq->connected = 1;
>> }
>>
>> -void fuse_conn_init(struct fuse_conn *fc)
>> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
>> {
>> memset(fc, 0, sizeof(*fc));
>> spin_lock_init(&fc->lock);
>> @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
>> fc->attr_version = 1;
>> get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
>> fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
>> + fc->user_ns = get_user_ns(user_ns);
>> }
>> EXPORT_SYMBOL_GPL(fuse_conn_init);
>>
>> @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
>> if (fc->destroy_req)
>> fuse_request_free(fc->destroy_req);
>> put_pid_ns(fc->pid_ns);
>> + put_user_ns(fc->user_ns);
>> fc->release(fc);
>> }
>> }
>> @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>>
>> sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
>>
>> - if (!parse_fuse_opt(data, &d, is_bdev))
>> + if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
>> goto err;
>>
>> if (is_bdev) {
>> @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>> if (!file)
>> goto err;
>>
>> - if ((file->f_op != &fuse_dev_operations) ||
>> - (file->f_cred->user_ns != &init_user_ns))
>> + /*
>> + * Require mount to happen from the same user namespace which
>> + * opened /dev/fuse to prevent potential attacks.
>> + */
>> + if (file->f_op != &fuse_dev_operations ||
>> + file->f_cred->user_ns != sb->s_user_ns)
>> goto err_fput;
>>
>> fc = kmalloc(sizeof(*fc), GFP_KERNEL);
>> @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>> if (!fc)
>> goto err_fput;
>>
>> - fuse_conn_init(fc);
>> + fuse_conn_init(fc, sb->s_user_ns);
>> fc->release = fuse_free_conn;
>>
>> fud = fuse_dev_alloc(fc);
>> --
>> 2.13.6
>>

2018-02-12 16:58:08

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
> From: Seth Forshee <[email protected]>
>
> In order to support mounts from namespaces other than
> init_user_ns, fuse must translate uids and gids to/from the
> userns of the process servicing requests on /dev/fuse. This
> patch does that, with a couple of restrictions on the namespace:
>
> - The userns for the fuse connection is fixed to the namespace
> from which /dev/fuse is opened.
>
> - The namespace must be the same as s_user_ns.
>
> These restrictions simplify the implementation by avoiding the
> need to pass around userns references and by allowing fuse to
> rely on the checks in inode_change_ok for ownership changes.
> Either restriction could be relaxed in the future if needed.

Can we not introduce potential userspace interface regressions?

The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse:
allow server to run in different pid_ns") will probably bite us here
as well.

We basically need two modes of operation:

a) old, backward compatible (not introducing any new failure mores),
created with privileged mount
b) new, non-backward compatible, created with unprivileged mount

Technically there would still be a risk from breaking userspace, since
we are using the same entry point for both, but let's hope that no
practical problems come from that.

> For cuse the namespace used for the connection is also simply
> current_user_ns() at the time /dev/cuse is opened.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Miklos Szeredi <[email protected]>
> Signed-off-by: Seth Forshee <[email protected]>
> Signed-off-by: Dongsu Park <[email protected]>
> ---
> fs/fuse/cuse.c | 3 ++-
> fs/fuse/dev.c | 11 ++++++++---
> fs/fuse/dir.c | 14 +++++++-------
> fs/fuse/fuse_i.h | 6 +++++-
> fs/fuse/inode.c | 31 +++++++++++++++++++------------
> 5 files changed, 41 insertions(+), 24 deletions(-)
>
> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
> index e9e97803..b1b83259 100644
> --- a/fs/fuse/cuse.c
> +++ b/fs/fuse/cuse.c
> @@ -48,6 +48,7 @@
> #include <linux/stat.h>
> #include <linux/module.h>
> #include <linux/uio.h>
> +#include <linux/user_namespace.h>
>
> #include "fuse_i.h"
>
> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
> if (!cc)
> return -ENOMEM;
>
> - fuse_conn_init(&cc->fc);
> + fuse_conn_init(&cc->fc, current_user_ns());
>
> fud = fuse_dev_alloc(&cc->fc);
> if (!fud) {
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 17f0d05b..0f780e16 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>
> static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
> {
> - req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> - req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> + req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
> + req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
> req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
> }
>
> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
> __set_bit(FR_WAITING, &req->flags);
> if (for_background)
> __set_bit(FR_BACKGROUND, &req->flags);
> + if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
> + fuse_put_request(fc, req);
> + return ERR_PTR(-EOVERFLOW);
> + }
>
> return req;
>
> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
> in = &req->in;
> reqsize = in->h.len;
>
> - if (task_active_pid_ns(current) != fc->pid_ns) {
> + if (task_active_pid_ns(current) != fc->pid_ns ||
> + current_user_ns() != fc->user_ns) {

I don't get it. Why recalculate the pid if the user_ns does not match?

> rcu_read_lock();
> in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
> rcu_read_unlock();
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 24967382..ad1cfac1 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
> stat->ino = attr->ino;
> stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
> stat->nlink = attr->nlink;
> - stat->uid = make_kuid(&init_user_ns, attr->uid);
> - stat->gid = make_kgid(&init_user_ns, attr->gid);
> + stat->uid = make_kuid(fc->user_ns, attr->uid);
> + stat->gid = make_kgid(fc->user_ns, attr->gid);
> stat->rdev = inode->i_rdev;
> stat->atime.tv_sec = attr->atime;
> stat->atime.tv_nsec = attr->atimensec;
> @@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
> return true;
> }
>
> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
> - bool trust_local_cmtime)
> +static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
> + struct fuse_setattr_in *arg, bool trust_local_cmtime)
> {
> unsigned ivalid = iattr->ia_valid;
>
> if (ivalid & ATTR_MODE)
> arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode;
> if (ivalid & ATTR_UID)
> - arg->valid |= FATTR_UID, arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
> + arg->valid |= FATTR_UID, arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
> if (ivalid & ATTR_GID)
> - arg->valid |= FATTR_GID, arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
> + arg->valid |= FATTR_GID, arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
> if (ivalid & ATTR_SIZE)
> arg->valid |= FATTR_SIZE, arg->size = iattr->ia_size;
> if (ivalid & ATTR_ATIME) {
> @@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
>
> memset(&inarg, 0, sizeof(inarg));
> memset(&outarg, 0, sizeof(outarg));
> - iattr_to_fattr(attr, &inarg, trust_local_cmtime);
> + iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
> if (file) {
> struct fuse_file *ff = file->private_data;
> inarg.valid |= FATTR_FH;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index d5773ca6..364e65c8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -26,6 +26,7 @@
> #include <linux/xattr.h>
> #include <linux/pid_namespace.h>
> #include <linux/refcount.h>
> +#include <linux/user_namespace.h>
>
> /** Max number of pages that can be used in a single read request */
> #define FUSE_MAX_PAGES_PER_REQ 32
> @@ -466,6 +467,9 @@ struct fuse_conn {
> /** The pid namespace for this mount */
> struct pid_namespace *pid_ns;
>
> + /** The user namespace for this mount */
> + struct user_namespace *user_ns;
> +
> /** Maximum read size */
> unsigned max_read;
>
> @@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
> /**
> * Initialize fuse_conn
> */
> -void fuse_conn_init(struct fuse_conn *fc);
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
>
> /**
> * Release reference to fuse_conn
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 2f504d61..7f6b2e55 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
> inode->i_ino = fuse_squash_ino(attr->ino);
> inode->i_mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
> set_nlink(inode, attr->nlink);
> - inode->i_uid = make_kuid(&init_user_ns, attr->uid);
> - inode->i_gid = make_kgid(&init_user_ns, attr->gid);
> + inode->i_uid = make_kuid(fc->user_ns, attr->uid);
> + inode->i_gid = make_kgid(fc->user_ns, attr->gid);
> inode->i_blocks = attr->blocks;
> inode->i_atime.tv_sec = attr->atime;
> inode->i_atime.tv_nsec = attr->atimensec;
> @@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
> return err;
> }
>
> -static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> +static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
> + struct user_namespace *user_ns)
> {
> char *p;
> memset(d, 0, sizeof(struct fuse_mount_data));
> @@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> case OPT_USER_ID:
> if (fuse_match_uint(&args[0], &uv))
> return 0;
> - d->user_id = make_kuid(current_user_ns(), uv);
> + d->user_id = make_kuid(user_ns, uv);
> if (!uid_valid(d->user_id))
> return 0;
> d->user_id_present = 1;
> @@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> case OPT_GROUP_ID:
> if (fuse_match_uint(&args[0], &uv))
> return 0;
> - d->group_id = make_kgid(current_user_ns(), uv);
> + d->group_id = make_kgid(user_ns, uv);
> if (!gid_valid(d->group_id))
> return 0;
> d->group_id_present = 1;
> @@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
> struct super_block *sb = root->d_sb;
> struct fuse_conn *fc = get_fuse_conn_super(sb);
>
> - seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
> - seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
> + seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
> + seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
> if (fc->default_permissions)
> seq_puts(m, ",default_permissions");
> if (fc->allow_other)
> @@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
> fpq->connected = 1;
> }
>
> -void fuse_conn_init(struct fuse_conn *fc)
> +void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
> {
> memset(fc, 0, sizeof(*fc));
> spin_lock_init(&fc->lock);
> @@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
> fc->attr_version = 1;
> get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
> fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> + fc->user_ns = get_user_ns(user_ns);
> }
> EXPORT_SYMBOL_GPL(fuse_conn_init);
>
> @@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
> if (fc->destroy_req)
> fuse_request_free(fc->destroy_req);
> put_pid_ns(fc->pid_ns);
> + put_user_ns(fc->user_ns);
> fc->release(fc);
> }
> }
> @@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>
> sb->s_flags &= ~(MS_NOSEC | SB_I_VERSION);
>
> - if (!parse_fuse_opt(data, &d, is_bdev))
> + if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
> goto err;
>
> if (is_bdev) {
> @@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
> if (!file)
> goto err;
>
> - if ((file->f_op != &fuse_dev_operations) ||
> - (file->f_cred->user_ns != &init_user_ns))
> + /*
> + * Require mount to happen from the same user namespace which
> + * opened /dev/fuse to prevent potential attacks.
> + */
> + if (file->f_op != &fuse_dev_operations ||
> + file->f_cred->user_ns != sb->s_user_ns)
> goto err_fput;
>
> fc = kmalloc(sizeof(*fc), GFP_KERNEL);
> @@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
> if (!fc)
> goto err_fput;
>
> - fuse_conn_init(fc);
> + fuse_conn_init(fc, sb->s_user_ns);
> fc->release = fuse_free_conn;
>
> fud = fuse_dev_alloc(fc);
> --
> 2.13.6
>

2018-02-13 10:21:10

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns

On Mon, Feb 12, 2018 at 5:35 PM, Eric W. Biederman
<[email protected]> wrote:
> Miklos Szeredi <[email protected]> writes:
>
>> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
>>> From: Seth Forshee <[email protected]>
>>>
>>> In order to support mounts from namespaces other than
>>> init_user_ns, fuse must translate uids and gids to/from the
>>> userns of the process servicing requests on /dev/fuse. This
>>> patch does that, with a couple of restrictions on the namespace:
>>>
>>> - The userns for the fuse connection is fixed to the namespace
>>> from which /dev/fuse is opened.
>>>
>>> - The namespace must be the same as s_user_ns.
>>>
>>> These restrictions simplify the implementation by avoiding the
>>> need to pass around userns references and by allowing fuse to
>>> rely on the checks in inode_change_ok for ownership changes.
>>> Either restriction could be relaxed in the future if needed.
>>
>> Can we not introduce potential userspace interface regressions?
>>
>> The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse:
>> allow server to run in different pid_ns") will probably bite us here
>> as well.
>
> Maybe, but unlike the pid namespace no one has been able to mount
> fuse outside of init_user_ns so we are much less exposed. I agree we
> should be careful.

Have to wrap my head around all the rules here.

There's the may_mount() one:

ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN)

Um, first of all, why isn't it checking current->cred->user_ns?

Ah, there it is in sget():

ns_capable(user_ns, CAP_SYS_ADMIN)

I get the plain capable(CAP_SYS_ADMIN) check in sget_userns() if fs
doesn't have FS_USERNS_MOUNT. This is the one that prevents fuse
mounts from being created when (current->cred->user_ns !=
&init_user_ns).

Maybe there's a logic to this web of namespaces, but I don't yet see
it. Is it documented somewhere?

>> We basically need two modes of operation:
>>
>> a) old, backward compatible (not introducing any new failure mores),
>> created with privileged mount
>> b) new, non-backward compatible, created with unprivileged mount
>>
>> Technically there would still be a risk from breaking userspace, since
>> we are using the same entry point for both, but let's hope that no
>> practical problems come from that.
>
> Answering from a 10,000 foot perspective:
>
> There are two cases. Requests to read/write the filesystem from outside
> of s_user_ns. These run no risk of breaking userspace as this mode has
> not been implemented before.

This comes from the fact that (s_user_ns == &init_user_ns) and all
user namespaces are "inside" init_user_ns, right?

One question: why does current code use the from_[ug]id_munged()
variant, when the conversion can never fail. Or can it?

> Restrictions at mount time to ensure we are not dealing with a crazy mix
> of namespaces. This has a small chance of breaking someone's crazy
> setup.
>
>
> Dropping requests to read/write the filesystem when the requester does
> not map into s_user_ns should not be a problem to enable universally. If
> s_user_ns is init_user_ns everything maps so there is no restriction.
>
>
>
> What we can do if we want to ensure maximum backwards compatibility
> is if the fuse filesystem is mounted in init_user_ns but if device for
> the communication channel is opened in some other user namespace we
> can just force the communication channel to operate in init_user_ns.
>
> That will be 100% backwards compatible in all cases and as far as I can
> see remove the need for having different ``modes'' of operation.

Okay.

Thanks,
Miklos

2018-02-13 11:33:02

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:

> Patches 1-2 deal with an additional flag of lookup_bdev() to check for
> additional inode permission.

fuse_blk is less suitable for unprivileged mounting than plain fuse.
fusermount doesn't allow mounting fuse_blk unprivileged, so there's
little data about that usecase (IIRC ntfs3g guys did that, or at least
tried to do it, but I don't remember the details).

As such, I think we should leave it out of the initial version. Which
means you can drop patches 1-2 from this series. Unless there's a
strong use case for this. In which case we should look hard at the
differences between fuse_blk and fuse and how that affects
unprivileged operation. There are a few assumptions about fuse_blk
filesystem being more "well behaved", I think.

Thanks,
Miklos

2018-02-13 13:20:27

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
> From: Eric W. Biederman <[email protected]>
>
> Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to
> chown files. Ordinarily the capable_wrt_inode_uidgid check is
> sufficient to allow access to files but when the underlying filesystem
> has uids or gids that don't map to the current user namespace it is
> not enough, so the chown permission checks need to be extended to
> allow this case.
>
> Calling chown on filesystem nodes whose uid or gid don't map is
> necessary if those nodes are going to be modified as writing back
> inodes which contain uids or gids that don't map is likely to cause
> filesystem corruption of the uid or gid fields.

How can the filesystem be corrupted if chown is denied?

It is not clear to me what the purpose of this patch is or what the
exact usecase this is fixing.

Thanks,
Miklos

2018-02-13 13:38:18

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 04/11] fs: Don't remove suid for CAP_FSETID for userns root

On Sat, Dec 23, 2017 at 1:38 PM, Dongsu Park <[email protected]> wrote:
> Hi,
>
> On Sat, Dec 23, 2017 at 4:26 AM, Serge E. Hallyn <[email protected]> wrote:
>> On Fri, Dec 22, 2017 at 03:32:28PM +0100, Dongsu Park wrote:
>>> From: Seth Forshee <[email protected]>
>>>
>>> Expand the check in should_remove_suid() to keep privileges for
>>
>> I realize this description came from Seth, but reading it now,
>> 'Expand' seems wrong. Expanding a check brings to my mind making
>> it stricter, not looser. How about 'Relax the check' ?
>
> Makes sense. Will do.
>
>>> CAP_FSETID in s_user_ns rather than init_user_ns.
>>>
>>> Patch v4 is available: https://patchwork.kernel.org/patch/8944621/
>>>
>>> --EWB Changed from ns_capable(sb->s_user_ns, ) to capable_wrt_inode_uidgid
>>
>> Why exactly?
>>
>> This is wrong, because capable_wrt_inode_uidgid() does a check
>> against current_user_ns, not the inode->i_sb->s_user_ns

I'm thoroughly confused. s_user_ns is supposed to be about the
usernamespace the filesystem perceives to be in, right? How does that
come into play when checking permissions to do something?

Thanks,
Miklos

2018-02-14 12:30:42

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
> From: Seth Forshee <[email protected]>
>
> The user in control of a super block should be allowed to freeze
> and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
> ioctls to require CAP_SYS_ADMIN in s_user_ns.

Why is this required for unprivileged fuse?

Fuse doesn't support freeze, so this seems to make no sense in the
context of this patchset.

Thanks,
Miklos

2018-02-14 13:46:00

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 10/11] fuse: Allow user namespace mounts

On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
> From: Seth Forshee <[email protected]>
>
> To be able to mount fuse from non-init user namespaces, it's necessary
> to set FS_USERNS_MOUNT flag to fs_flags.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944681/
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Miklos Szeredi <[email protected]>
> Signed-off-by: Seth Forshee <[email protected]>
> [dongsu: add a simple commit messasge]
> Signed-off-by: Dongsu Park <[email protected]>
> ---
> fs/fuse/inode.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 7f6b2e55..8c98edee 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
> static struct file_system_type fuse_fs_type = {
> .owner = THIS_MODULE,
> .name = "fuse",
> - .fs_flags = FS_HAS_SUBTYPE,
> + .fs_flags = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
> .mount = fuse_mount,
> .kill_sb = fuse_kill_sb_anon,
> };

I think enabling FS_USERNS_MOUNT should be pretty safe.

I was thinking opting out should be as simple as "chmod o-rw
/dev/fuse". But that breaks libfuse, even though fusermount opens
/dev/fuse in privileged mode, so it shouldn't. That can be fixed in
libfuse, but it's an unfortunate bug and it also means /dev/fuse is
configured with "crw-rw-rw-" in most cases. Which means it will be
opting out, not opting in, which is the less safe version.

> @@ -1244,7 +1244,7 @@ static struct file_system_type fuseblk_fs_type = {
> .name = "fuseblk",
> .mount = fuse_mount_blk,
> .kill_sb = fuse_kill_sb_blk,
> - .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
> + .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
> };
> MODULE_ALIAS_FS("fuseblk");

As I said, this hunk should be dropped from the first version, because
it's possibly unsafe.

Thanks,
Miklos

2018-02-15 08:47:55

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 10/11] fuse: Allow user namespace mounts

On Wed, Feb 14, 2018 at 2:44 PM, Miklos Szeredi <[email protected]> wrote:
> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
>> From: Seth Forshee <[email protected]>
>>
>> To be able to mount fuse from non-init user namespaces, it's necessary
>> to set FS_USERNS_MOUNT flag to fs_flags.
>>
>> Patch v4 is available: https://patchwork.kernel.org/patch/8944681/
>>
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: Miklos Szeredi <[email protected]>
>> Signed-off-by: Seth Forshee <[email protected]>
>> [dongsu: add a simple commit messasge]
>> Signed-off-by: Dongsu Park <[email protected]>
>> ---
>> fs/fuse/inode.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> index 7f6b2e55..8c98edee 100644
>> --- a/fs/fuse/inode.c
>> +++ b/fs/fuse/inode.c
>> @@ -1212,7 +1212,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
>> static struct file_system_type fuse_fs_type = {
>> .owner = THIS_MODULE,
>> .name = "fuse",
>> - .fs_flags = FS_HAS_SUBTYPE,
>> + .fs_flags = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>> .mount = fuse_mount,
>> .kill_sb = fuse_kill_sb_anon,
>> };
>
> I think enabling FS_USERNS_MOUNT should be pretty safe.
>
> I was thinking opting out should be as simple as "chmod o-rw
> /dev/fuse". But that breaks libfuse, even though fusermount opens
> /dev/fuse in privileged mode, so it shouldn't.

I'm talking rubbish, /dev/fuse is opened without privs in fusermount as well.

So there's not way to differentiate user_ns unpriv mounts from suid
fusermount unpriv mounts.

Maybe that's just as well...

Thanks,
Miklos

2018-02-16 21:53:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns

Miklos Szeredi <[email protected]> writes:

> On Mon, Feb 12, 2018 at 5:35 PM, Eric W. Biederman
> <[email protected]> wrote:
>> Miklos Szeredi <[email protected]> writes:
>>
>>> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
>>>> From: Seth Forshee <[email protected]>
>>>>
>>>> In order to support mounts from namespaces other than
>>>> init_user_ns, fuse must translate uids and gids to/from the
>>>> userns of the process servicing requests on /dev/fuse. This
>>>> patch does that, with a couple of restrictions on the namespace:
>>>>
>>>> - The userns for the fuse connection is fixed to the namespace
>>>> from which /dev/fuse is opened.
>>>>
>>>> - The namespace must be the same as s_user_ns.
>>>>
>>>> These restrictions simplify the implementation by avoiding the
>>>> need to pass around userns references and by allowing fuse to
>>>> rely on the checks in inode_change_ok for ownership changes.
>>>> Either restriction could be relaxed in the future if needed.
>>>
>>> Can we not introduce potential userspace interface regressions?
>>>
>>> The issue with pid namespaces fixed in commit 5d6d3a301c4e ("fuse:
>>> allow server to run in different pid_ns") will probably bite us here
>>> as well.
>>
>> Maybe, but unlike the pid namespace no one has been able to mount
>> fuse outside of init_user_ns so we are much less exposed. I agree we
>> should be careful.
>
> Have to wrap my head around all the rules here.
>
> There's the may_mount() one:
>
> ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN)
>
> Um, first of all, why isn't it checking current->cred->user_ns?
>
> Ah, there it is in sget():
>
> ns_capable(user_ns, CAP_SYS_ADMIN)
>
> I get the plain capable(CAP_SYS_ADMIN) check in sget_userns() if fs
> doesn't have FS_USERNS_MOUNT. This is the one that prevents fuse
> mounts from being created when (current->cred->user_ns !=
> &init_user_ns).
>
> Maybe there's a logic to this web of namespaces, but I don't yet see
> it. Is it documented somewhere?

I think this is a bit simpler than the fiddly details in the
implementation might make it look.

The fundamental idea is that permission to have full control over
a mount namespace, is different than permission to have full control
over an instance of a filesystem.

Implementing that separation of permission checks gets a little bit
fiddly. The first challenge is that there are several filesystems like
sysfs and proc whose internal mount is created outside of a process.
Then there are the file systems like nfs and afs that have ``referral
points'' that transition you to other instances of those filesystems
when you transition over them. That is the reason why there are
exceptions for SB_KERNMOUNT and SB_SUBMOUNT.

may_mount is just the permission check for the mount namespace. It
checks that the current process has CAP_SYS_ADMIN in the user namespace
that owns the current mount namespace. AKA is the process allowed to
change the mount namespace.

sget is just the permission check for mounting a filesystem. It checks
that the mounter has CAP_SYS_ADMIN over the user namespace that will own
the newly mounted filesystem.

By the time execition gets to to sget_userns in general all of the
permission checks have all been made. But if the filesystem is not one
that supports mounting within a user namespace the code checks
capable(CAP_SYS_ADMIN).

That is more convoluted than I would like but the checks derive from the
definition of what we are doing.

>
>>> We basically need two modes of operation:
>>>
>>> a) old, backward compatible (not introducing any new failure mores),
>>> created with privileged mount
>>> b) new, non-backward compatible, created with unprivileged mount
>>>
>>> Technically there would still be a risk from breaking userspace, since
>>> we are using the same entry point for both, but let's hope that no
>>> practical problems come from that.
>>
>> Answering from a 10,000 foot perspective:
>>
>> There are two cases. Requests to read/write the filesystem from outside
>> of s_user_ns. These run no risk of breaking userspace as this mode has
>> not been implemented before.
>
> This comes from the fact that (s_user_ns == &init_user_ns) and all
> user namespaces are "inside" init_user_ns, right?

Yes.

> One question: why does current code use the from_[ug]id_munged()
> variant, when the conversion can never fail. Or can it?

There is always at least (uid_t)-1 that can fail if it shows up on a
filesystem. As far as I can tell no one was using it for a uid, there
were already uses of (uid_t)-1 as a special case, and I just grabbed it
to become INVALID_UID.

In practice the mapping can't fail unless someone malicious starts using
that id.

I believe I picked the _munged variant so in case that version hits
we are guaranteed to return the 16bit nobody user.

Eric

2018-02-16 21:55:33

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces

Miklos Szeredi <[email protected]> writes:

> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
>
>> Patches 1-2 deal with an additional flag of lookup_bdev() to check for
>> additional inode permission.
>
> fuse_blk is less suitable for unprivileged mounting than plain fuse.
> fusermount doesn't allow mounting fuse_blk unprivileged, so there's
> little data about that usecase (IIRC ntfs3g guys did that, or at least
> tried to do it, but I don't remember the details).
>
> As such, I think we should leave it out of the initial version. Which
> means you can drop patches 1-2 from this series. Unless there's a
> strong use case for this. In which case we should look hard at the
> differences between fuse_blk and fuse and how that affects
> unprivileged operation. There are a few assumptions about fuse_blk
> filesystem being more "well behaved", I think.

Especially to start with I am fine with that.

It makes a lot of sense to get the obvious cases first.

Eric

2018-02-16 22:02:22

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 03/11] fs: Allow superblock owner to change ownership of inodes

Miklos Szeredi <[email protected]> writes:

> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
>> From: Eric W. Biederman <[email protected]>
>>
>> Allow users with CAP_SYS_CHOWN over the superblock of a filesystem to
>> chown files. Ordinarily the capable_wrt_inode_uidgid check is
>> sufficient to allow access to files but when the underlying filesystem
>> has uids or gids that don't map to the current user namespace it is
>> not enough, so the chown permission checks need to be extended to
>> allow this case.
>>
>> Calling chown on filesystem nodes whose uid or gid don't map is
>> necessary if those nodes are going to be modified as writing back
>> inodes which contain uids or gids that don't map is likely to cause
>> filesystem corruption of the uid or gid fields.
>
> How can the filesystem be corrupted if chown is denied?
>
> It is not clear to me what the purpose of this patch is or what the
> exact usecase this is fixing.

It isn't a fix and we can delay this one and similar patches
that enable things until we are certain all of the necessary
restrictions are in place. This is not essential for safely getting
fully unprivileged mounting of fuse to work.

The overall strategy has been to handle as many of the generic concerns
at the vfs level as possible to separate filesystem concerns and generic
concerns.

In this case the generic concern is what happens when the uid is read
from the filesystem and it gets mapped to INVALID_UID and then the inode
for that file is written back.

That is a trap for the unwary filesystem implementation and not a case
that I think anyone will actually care about. It is just not useful
to mount a filesystem and to not map some of it's ids. So the generic
vfs code just denies writes to files like show with uid of INVALID_UID
or gid of INVALID_GID. Just to ensure that problems don't show up.

This patch gets through those defenses.

Eric


2018-02-19 22:58:21

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 07/11] fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems

Miklos Szeredi <[email protected]> writes:

> On Fri, Dec 22, 2017 at 3:32 PM, Dongsu Park <[email protected]> wrote:
>> From: Seth Forshee <[email protected]>
>>
>> The user in control of a super block should be allowed to freeze
>> and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
>> ioctls to require CAP_SYS_ADMIN in s_user_ns.
>
> Why is this required for unprivileged fuse?
>
> Fuse doesn't support freeze, so this seems to make no sense in the
> context of this patchset.

This isn't required to support fuse. It is a relaxation in permissions
so it isn't strictly necessary for anything.

Until just recently Seth and I work working through the vfs looking at
what we need in general for unprivileged mounts. With fuse as our focus
but we were not limiting ourselves to fuse.

I have been putting off relaxation of permissions like this because they
are not necessary for safety. But in general they do make sense.

In practice I think all we need to worry about for fuse is the last 4 patches.


Eric


2018-02-19 23:11:12

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v5 00/11] FUSE mounts from non-init user namespaces

Alban Crequy <[email protected]> writes:

> Hi Eric,
>
> Do you have some cycles for this now that it is the new year?
>
> A review on the associated ima issue would also be appreciated:
> https://www.mail-archive.com/[email protected]/msg1587678.html

It has taken me longer than I expected but I do have time now. I am
moving through these patches and issues a little slowly I do intend to
get us through the fuse issues this development cycle if at all
possible.

I think for starters we should restrict ourselves to the last 4 patches
aka (8, 9, 10, 11).

In particular we should concentrate on
[8/11] fuse: Support fuse filesystems outside of init_user_ns
[9/11] fuse: Restrict allow_other to the superblock's namespace or a descendant

The tricky issues are handled in the vfs, and I think the remaining
tricky issues are evm and ima. Which are close enough to be resolved
that we can count them as resolved.

Once we have 8 & 9 reviewed and merged we can double check there isn't
some silly reason not to set FS_USERNS_MOUNT on fuse and then enable it.

I would like to double check and ensure there are not silly issues with
posix acls or anything else in the vfs. But I think except for a silly
oversight we are good.

I should probably also add a patch that adds to
Documentation/filesystems that explains what the vfs does for
unprivileged mounts. So that I can point people working on filesystems
and are thinking about enabling user namespace mounts at the
documentation for what the vfs does. That would also provide a good
checklist to ensure the way the vfs handles things is sufficient for
fuse.

As for the earlier patches that enable things. Overall they are
good. They are slightly dangerous as they enable more code paths
to unprivileged users. But mostly I think they are not immediately
necessary and as such a distraction to getting this code in.

That said once we get the fuse bits reviewed merged I will be more than
happy to merge the relaxation of permission checks that we can perform
now that s_user_ns exists.

Eric

2018-02-19 23:19:02

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 09/11] fuse: Restrict allow_other to the superblock's namespace or a descendant

Dongsu Park <[email protected]> writes:

> From: Seth Forshee <[email protected]>
>
> Unprivileged users are normally restricted from mounting with the
> allow_other option by system policy, but this could be bypassed
> for a mount done with user namespace root permissions. In such
> cases allow_other should not allow users outside the userns
> to access the mount as doing so would give the unprivileged user
> the ability to manipulate processes it would otherwise be unable
> to manipulate. Restrict allow_other to apply to users in the same
> userns used at mount or a descendant of that namespace. Also
> export current_in_userns() for use by fuse when built as a
> module.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944671/
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: Serge Hallyn <[email protected]>
> Cc: Miklos Szeredi <[email protected]>
> Signed-off-by: Seth Forshee <[email protected]>
> Signed-off-by: Dongsu Park <[email protected]>

Reviewed-by: "Eric W. Biederman" <[email protected]>

> ---
> fs/fuse/dir.c | 2 +-
> kernel/user_namespace.c | 1 +
> 2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index ad1cfac1..d41559a0 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
> const struct cred *cred;
>
> if (fc->allow_other)
> - return 1;
> + return current_in_userns(fc->user_ns);
>
> cred = current_cred();
> if (uid_eq(cred->euid, fc->user_id) &&
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 246d4d4c..492c255e 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
> {
> return in_userns(target_ns, current_user_ns());
> }
> +EXPORT_SYMBOL(current_in_userns);
>
> static inline struct user_namespace *to_user_ns(struct ns_common *ns)
> {

2018-02-20 02:25:01

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 08/11] fuse: Support fuse filesystems outside of init_user_ns

Dongsu Park <[email protected]> writes:

> From: Seth Forshee <[email protected]>
>
> In order to support mounts from namespaces other than
> init_user_ns, fuse must translate uids and gids to/from the
> userns of the process servicing requests on /dev/fuse. This
> patch does that, with a couple of restrictions on the namespace:
>
> - The userns for the fuse connection is fixed to the namespace
> from which /dev/fuse is opened.
>
> - The namespace must be the same as s_user_ns.
>
> These restrictions simplify the implementation by avoiding the
> need to pass around userns references and by allowing fuse to
> rely on the checks in inode_change_ok for ownership changes.
> Either restriction could be relaxed in the future if needed.
>
> For cuse the namespace used for the connection is also simply
> current_user_ns() at the time /dev/cuse is opened.
>
> Patch v4 is available: https://patchwork.kernel.org/patch/8944661/
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Miklos Szeredi <[email protected]>
> Signed-off-by: Seth Forshee <[email protected]>
> Signed-off-by: Dongsu Park <[email protected]>
> ---
> fs/fuse/cuse.c | 3 ++-
> fs/fuse/dev.c | 11 ++++++++---
> fs/fuse/dir.c | 14 +++++++-------
> fs/fuse/fuse_i.h | 6 +++++-
> fs/fuse/inode.c | 31 +++++++++++++++++++------------
> 5 files changed, 41 insertions(+), 24 deletions(-)
>
> diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
> index e9e97803..b1b83259 100644
> --- a/fs/fuse/cuse.c
> +++ b/fs/fuse/cuse.c
> @@ -48,6 +48,7 @@
> #include <linux/stat.h>
> #include <linux/module.h>
> #include <linux/uio.h>
> +#include <linux/user_namespace.h>
>
> #include "fuse_i.h"
>
> @@ -498,7 +499,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
> if (!cc)
> return -ENOMEM;
>
As noticed in the review this should probably say:
if (current_user_ns() != &init_user_ns)
return -EINVAL;

Just so we don't need to think about cuse being opened in a user
namespace at this point. It is probably harmless. But it isn't
what we are focusing on.

> - fuse_conn_init(&cc->fc);
> + fuse_conn_init(&cc->fc, current_user_ns());
>
> fud = fuse_dev_alloc(&cc->fc);
> if (!fud) {


> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 17f0d05b..0f780e16 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)
>
> static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
> {
> - req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> - req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> + req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
> + req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
> req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
> }
>
> @@ -167,6 +167,10 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
> __set_bit(FR_WAITING, &req->flags);
> if (for_background)
> __set_bit(FR_BACKGROUND, &req->flags);
> + if (req->in.h.uid == (uid_t)-1 || req->in.h.gid == (gid_t)-1) {
> + fuse_put_request(fc, req);
> + return ERR_PTR(-EOVERFLOW);
> + }
>
> return req;
>
> @@ -1260,7 +1264,8 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
> in = &req->in;
> reqsize = in->h.len;
>
> - if (task_active_pid_ns(current) != fc->pid_ns) {
> + if (task_active_pid_ns(current) != fc->pid_ns ||
> + current_user_ns() != fc->user_ns) {
> rcu_read_lock();
> in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
> rcu_read_unlock();

The hunk above is a rebase error. I believe it started out by erroring
out in the same case the pid namespace case errored out. Miklos has a
good point that we need to handle the case where we have servers running
in jails of one sort or another because at least sandstorm runs
applications in that fashion, and we have previously had error reports
about that configuration breaking.

I think we can easily fix that. Either by adding extra translation as
we did for the pid namespace or changing the user namespace used on the
connection. I believe extra translation like we did with the pid
namespace will be more consistent. And again it won't be a special
case except possibly during mount. Of course there is weirdness there.

Eric


2018-02-21 20:25:55

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v6 0/6] fuse: mounts from non-init user namespaces


This patchset builds on the work by Donsu Park and Seth Forshee and is
reduced to the set of patches that just affect fuse. The non-fuse
patches are far enough along we can ignore them except possibly for the
question of when does FS_USERNS_MOUNT get set in fuse_fs_type.

Fuse with a block device has been left as an exercise for a later time.

I had to change the core of this patchset around some as the previous
patches were showing signs of bitrot. Some important explanations were
missing, some important functionality was missing, and xattr handling
was completely absent.

Miklos can you take a look and see what you think?

I think this much of the fuse changes are ready, and as such I would
like to get them in this development cycle if possible.

My apologies if I have lost someone's ack or review somewhere. Let me
know and I will fix it.

These changes are also available at:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git userns-fuse-v6

Eric W. Biederman (4):
fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
fuse: Fail all requests with invalid uids or gids
fuse: Support fuse filesystems outside of init_user_ns
fuse: Ensure posix acls are translated outside of init_user_ns

Seth Forshee (1):
fuse: Restrict allow_other to the superblock's namespace or a descendant

fs/fuse/acl.c | 4 ++--
fs/fuse/cuse.c | 7 ++++++-
fs/fuse/dev.c | 26 +++++++++++++-------------
fs/fuse/dir.c | 16 ++++++++--------
fs/fuse/fuse_i.h | 7 ++++++-
fs/fuse/inode.c | 38 ++++++++++++++++++++++++++------------
fs/fuse/xattr.c | 43 +++++++++++++++++++++++++++++++++++++++++++
kernel/user_namespace.c | 1 +
8 files changed, 105 insertions(+), 37 deletions(-)

Eric

2018-02-21 20:31:33

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read

At the point of fuse_dev_do_read the user space process that initiated the
action on the fuse filesystem may no longer exist. The process have been
killed or may have fired an asynchronous request and exited.

If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
the pid has been reallocated it can return practically any pid. Any pid is
possible as the pid allocator allocates pid numbers in different pid
namespaces independently.

The only way to make translation in fuse_dev_do_read reliable is to call
get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
fuse_dev_do_read. That reference counting in other contexts has been shown
to bounce cache lines between processors and in general be slow. So that is
not desirable.

The only known user of running the fuse server in a different pid namespace
from the filesystem does not care what the pids are in the fuse messages
so removing this code should not matter.

Getting the translation to a server running outside of the pid namespace
of a container can still be achieved by playing setns games at mount time.
It is also possible to add an option to pass a pid namespace into the fuse
filesystem at mount time.

Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/fuse/dev.c | 6 ------
1 file changed, 6 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5d06384c2cae..0fb58f364fa6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1260,12 +1260,6 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
in = &req->in;
reqsize = in->h.len;

- if (task_active_pid_ns(current) != fc->pid_ns) {
- rcu_read_lock();
- in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
- rcu_read_unlock();
- }
-
/* If request is too large, reply with an error and restart the read */
if (nbytes < reqsize) {
req->out.h.error = -EIO;
--
2.14.1


2018-02-21 20:32:02

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v6 2/5] fuse: Fail all requests with invalid uids or gids

Upon a cursory examinination the uid and gid of a fuse request are
necessary for correct operation. Failing a fuse request where those
values are not reliable seems a straight forward and reliable means of
ensuring that fuse requests with bad data are not sent or processed.

In most cases the vfs will avoid actions it suspects will cause
an inode write back of an inode with an invalid uid or gid. But that does
not map precisely to what fuse is doing, so test for this and solve
this at the fuse level as well.

Performing this work in fuse_req_init_context is cheap as the code is
already performing the translation here and only needs to check the
result of the translation to see if things are not representable in
a form the fuse server can handle.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/fuse/dev.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 0fb58f364fa6..216db3f51a31 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -112,11 +112,13 @@ static void __fuse_put_request(struct fuse_req *req)
refcount_dec(&req->count);
}

-static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
+static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
{
- req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
- req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
+ req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
+ req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
+
+ return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
}

void fuse_set_initialized(struct fuse_conn *fc)
@@ -162,12 +164,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
wake_up(&fc->blocked_waitq);
goto out;
}
-
- fuse_req_init_context(fc, req);
__set_bit(FR_WAITING, &req->flags);
if (for_background)
__set_bit(FR_BACKGROUND, &req->flags);
-
+ if (unlikely(!fuse_req_init_context(fc, req))) {
+ fuse_put_request(fc, req);
+ return ERR_PTR(-EOVERFLOW);
+ }
return req;

out:
@@ -256,9 +259,12 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
if (!req)
req = get_reserved_req(fc, file);

- fuse_req_init_context(fc, req);
__set_bit(FR_WAITING, &req->flags);
__clear_bit(FR_BACKGROUND, &req->flags);
+ if (unlikely(!fuse_req_init_context(fc, req))) {
+ fuse_put_request(fc, req);
+ return ERR_PTR(-EOVERFLOW);
+ }
return req;
}

--
2.14.1


2018-02-21 20:32:07

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v6 3/5] fuse: Support fuse filesystems outside of init_user_ns

In order to support mounts from namespaces other than init_user_ns,
fuse must translate uids and gids to/from the userns of the process
servicing requests on /dev/fuse. This patch does that, with a couple
of restrictions on the namespace:

- The userns for the fuse connection is fixed to the namespace
from which /dev/fuse is opened.

- The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the need to
pass around userns references and by allowing fuse to rely on the
checks in setattr_prepare for ownership changes. Either restriction
could be relaxed in the future if needed.

For cuse the userns used is the opener of /dev/cuse. Semantically the
cuse support does not appear safe for unprivileged users. Practically
the permissions on /dev/cuse only make it accessible to the global root
user. If something slips through the cracks in a user namespace the only
users who will be able to use the cuse device are those users mapped into
the user namespace.

Translation in the posix acl is updated to use the uuser namespace of
the filesystem. Avoiding cases which might bypass this translation is
handled in a following change.

This change is stronlgy based on a similar change from Seth Forshee
and Dongsu Park.

Cc: [email protected]
Cc: [email protected]
Cc: Miklos Szeredi <[email protected]>
Cc: <[email protected]>
Cc: Dongsu Park <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/fuse/acl.c | 4 ++--
fs/fuse/cuse.c | 7 ++++++-
fs/fuse/dev.c | 4 ++--
fs/fuse/dir.c | 14 +++++++-------
fs/fuse/fuse_i.h | 6 +++++-
fs/fuse/inode.c | 31 +++++++++++++++++++------------
6 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index ec85765502f1..5a48cee6d7d3 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -34,7 +34,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
return ERR_PTR(-ENOMEM);
size = fuse_getxattr(inode, name, value, PAGE_SIZE);
if (size > 0)
- acl = posix_acl_from_xattr(&init_user_ns, value, size);
+ acl = posix_acl_from_xattr(fc->user_ns, value, size);
else if ((size == 0) || (size == -ENODATA) ||
(size == -EOPNOTSUPP && fc->no_getxattr))
acl = NULL;
@@ -81,7 +81,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
if (!value)
return -ENOMEM;

- ret = posix_acl_to_xattr(&init_user_ns, acl, value, size);
+ ret = posix_acl_to_xattr(fc->user_ns, acl, value, size);
if (ret < 0) {
kfree(value);
return ret;
diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index e9e97803442a..036ee477669e 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
#include <linux/stat.h>
#include <linux/module.h>
#include <linux/uio.h>
+#include <linux/user_namespace.h>

#include "fuse_i.h"

@@ -498,7 +499,11 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
if (!cc)
return -ENOMEM;

- fuse_conn_init(&cc->fc);
+ /*
+ * Limit the cuse channel to requests that can
+ * be represented in file->f_cred->user_ns.
+ */
+ fuse_conn_init(&cc->fc, file->f_cred->user_ns);

fud = fuse_dev_alloc(&cc->fc);
if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 216db3f51a31..338cfda3eb8f 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)

static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
{
- req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
- req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
+ req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+ req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);

return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 24967382a7b1..ad1cfac1942f 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
stat->ino = attr->ino;
stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
stat->nlink = attr->nlink;
- stat->uid = make_kuid(&init_user_ns, attr->uid);
- stat->gid = make_kgid(&init_user_ns, attr->gid);
+ stat->uid = make_kuid(fc->user_ns, attr->uid);
+ stat->gid = make_kgid(fc->user_ns, attr->gid);
stat->rdev = inode->i_rdev;
stat->atime.tv_sec = attr->atime;
stat->atime.tv_nsec = attr->atimensec;
@@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
return true;
}

-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
- bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+ struct fuse_setattr_in *arg, bool trust_local_cmtime)
{
unsigned ivalid = iattr->ia_valid;

if (ivalid & ATTR_MODE)
arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode;
if (ivalid & ATTR_UID)
- arg->valid |= FATTR_UID, arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
+ arg->valid |= FATTR_UID, arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
if (ivalid & ATTR_GID)
- arg->valid |= FATTR_GID, arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
+ arg->valid |= FATTR_GID, arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
if (ivalid & ATTR_SIZE)
arg->valid |= FATTR_SIZE, arg->size = iattr->ia_size;
if (ivalid & ATTR_ATIME) {
@@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,

memset(&inarg, 0, sizeof(inarg));
memset(&outarg, 0, sizeof(outarg));
- iattr_to_fattr(attr, &inarg, trust_local_cmtime);
+ iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
if (file) {
struct fuse_file *ff = file->private_data;
inarg.valid |= FATTR_FH;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c4c093bbf456..7772e2b4057e 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -26,6 +26,7 @@
#include <linux/xattr.h>
#include <linux/pid_namespace.h>
#include <linux/refcount.h>
+#include <linux/user_namespace.h>

/** Max number of pages that can be used in a single read request */
#define FUSE_MAX_PAGES_PER_REQ 32
@@ -466,6 +467,9 @@ struct fuse_conn {
/** The pid namespace for this mount */
struct pid_namespace *pid_ns;

+ /** The user namespace for this mount */
+ struct user_namespace *user_ns;
+
/** Maximum read size */
unsigned max_read;

@@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
/**
* Initialize fuse_conn
*/
-void fuse_conn_init(struct fuse_conn *fc);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);

/**
* Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 624f18bbfd2b..e018dc3999f4 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
inode->i_ino = fuse_squash_ino(attr->ino);
inode->i_mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
set_nlink(inode, attr->nlink);
- inode->i_uid = make_kuid(&init_user_ns, attr->uid);
- inode->i_gid = make_kgid(&init_user_ns, attr->gid);
+ inode->i_uid = make_kuid(fc->user_ns, attr->uid);
+ inode->i_gid = make_kgid(fc->user_ns, attr->gid);
inode->i_blocks = attr->blocks;
inode->i_atime.tv_sec = attr->atime;
inode->i_atime.tv_nsec = attr->atimensec;
@@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
return err;
}

-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
+static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+ struct user_namespace *user_ns)
{
char *p;
memset(d, 0, sizeof(struct fuse_mount_data));
@@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
case OPT_USER_ID:
if (fuse_match_uint(&args[0], &uv))
return 0;
- d->user_id = make_kuid(current_user_ns(), uv);
+ d->user_id = make_kuid(user_ns, uv);
if (!uid_valid(d->user_id))
return 0;
d->user_id_present = 1;
@@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
case OPT_GROUP_ID:
if (fuse_match_uint(&args[0], &uv))
return 0;
- d->group_id = make_kgid(current_user_ns(), uv);
+ d->group_id = make_kgid(user_ns, uv);
if (!gid_valid(d->group_id))
return 0;
d->group_id_present = 1;
@@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
struct super_block *sb = root->d_sb;
struct fuse_conn *fc = get_fuse_conn_super(sb);

- seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
- seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
+ seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
+ seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
if (fc->default_permissions)
seq_puts(m, ",default_permissions");
if (fc->allow_other)
@@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
fpq->connected = 1;
}

-void fuse_conn_init(struct fuse_conn *fc)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
{
memset(fc, 0, sizeof(*fc));
spin_lock_init(&fc->lock);
@@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
fc->attr_version = 1;
get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+ fc->user_ns = get_user_ns(user_ns);
}
EXPORT_SYMBOL_GPL(fuse_conn_init);

@@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
if (fc->destroy_req)
fuse_request_free(fc->destroy_req);
put_pid_ns(fc->pid_ns);
+ put_user_ns(fc->user_ns);
fc->release(fc);
}
}
@@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)

sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION);

- if (!parse_fuse_opt(data, &d, is_bdev))
+ if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
goto err;

if (is_bdev) {
@@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
if (!file)
goto err;

- if ((file->f_op != &fuse_dev_operations) ||
- (file->f_cred->user_ns != &init_user_ns))
+ /*
+ * Require mount to happen from the same user namespace which
+ * opened /dev/fuse to prevent potential attacks.
+ */
+ if (file->f_op != &fuse_dev_operations ||
+ file->f_cred->user_ns != sb->s_user_ns)
goto err_fput;

fc = kmalloc(sizeof(*fc), GFP_KERNEL);
@@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
if (!fc)
goto err_fput;

- fuse_conn_init(fc);
+ fuse_conn_init(fc, sb->s_user_ns);
fc->release = fuse_free_conn;

fud = fuse_dev_alloc(fc);
--
2.14.1


2018-02-21 20:33:37

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v6 5/5] fuse: Restrict allow_other to the superblock's namespace or a descendant

From: Seth Forshee <[email protected]>

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Cc: [email protected]
Cc: [email protected]
Cc: "Eric W. Biederman" <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Acked-by: Miklos Szeredi <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
Reviewed-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
Signed-off-by: Dongsu Park <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/fuse/dir.c | 2 +-
kernel/user_namespace.c | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ad1cfac1942f..d41559a0aa6b 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
const struct cred *cred;

if (fc->allow_other)
- return 1;
+ return current_in_userns(fc->user_ns);

cred = current_cred();
if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4ce5c7..492c255e6c5a 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
{
return in_userns(target_ns, current_user_ns());
}
+EXPORT_SYMBOL(current_in_userns);

static inline struct user_namespace *to_user_ns(struct ns_common *ns)
{
--
2.14.1


2018-02-21 20:34:24

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns

Ensure the translation happens by failing to read or write
posix acls when the filesystem has not indicated it supports
posix acls.

This ensures that modern cached posix acl support is available
and used when dealing with posix acls. This is important
because only that path has the code to convernt the uids and
gids in posix acls into the user namespace of a fuse filesystem.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/fuse/fuse_i.h | 1 +
fs/fuse/inode.c | 7 +++++++
fs/fuse/xattr.c | 43 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 51 insertions(+)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 7772e2b4057e..986fa2b043ab 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -979,6 +979,7 @@ ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
int fuse_removexattr(struct inode *inode, const char *name);
extern const struct xattr_handler *fuse_xattr_handlers[];
extern const struct xattr_handler *fuse_acl_xattr_handlers[];
+extern const struct xattr_handler *fuse_no_acl_xattr_handlers[];

struct posix_acl;
struct posix_acl *fuse_get_acl(struct inode *inode, int type);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index e018dc3999f4..a52cf2019a58 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1097,6 +1097,13 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
file->f_cred->user_ns != sb->s_user_ns)
goto err_fput;

+ /*
+ * If we are not in the initial user namespace posix
+ * acls must be translated.
+ */
+ if (sb->s_user_ns != &init_user_ns)
+ sb->s_xattr = fuse_no_acl_xattr_handlers;
+
fc = kmalloc(sizeof(*fc), GFP_KERNEL);
err = -ENOMEM;
if (!fc)
diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 3caac46b08b0..433717640f78 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -192,6 +192,26 @@ static int fuse_xattr_set(const struct xattr_handler *handler,
return fuse_setxattr(inode, name, value, size, flags);
}

+static bool no_xattr_list(struct dentry *dentry)
+{
+ return false;
+}
+
+static int no_xattr_get(const struct xattr_handler *handler,
+ struct dentry *dentry, struct inode *inode,
+ const char *name, void *value, size_t size)
+{
+ return -EOPNOTSUPP;
+}
+
+static int no_xattr_set(const struct xattr_handler *handler,
+ struct dentry *dentry, struct inode *nodee,
+ const char *name, const void *value,
+ size_t size, int flags)
+{
+ return -EOPNOTSUPP;
+}
+
static const struct xattr_handler fuse_xattr_handler = {
.prefix = "",
.get = fuse_xattr_get,
@@ -209,3 +229,26 @@ const struct xattr_handler *fuse_acl_xattr_handlers[] = {
&fuse_xattr_handler,
NULL
};
+
+static const struct xattr_handler fuse_no_acl_access_xattr_handler = {
+ .name = XATTR_NAME_POSIX_ACL_ACCESS,
+ .flags = ACL_TYPE_ACCESS,
+ .list = no_xattr_list,
+ .get = no_xattr_get,
+ .set = no_xattr_set,
+};
+
+static const struct xattr_handler fuse_no_acl_default_xattr_handler = {
+ .name = XATTR_NAME_POSIX_ACL_DEFAULT,
+ .flags = ACL_TYPE_ACCESS,
+ .list = no_xattr_list,
+ .get = no_xattr_get,
+ .set = no_xattr_set,
+};
+
+const struct xattr_handler *fuse_no_acl_xattr_handlers[] = {
+ &fuse_no_acl_access_xattr_handler,
+ &fuse_no_acl_default_xattr_handler,
+ &fuse_xattr_handler,
+ NULL
+};
--
2.14.1


2018-02-22 10:16:34

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read

On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
<[email protected]> wrote:
> At the point of fuse_dev_do_read the user space process that initiated the
> action on the fuse filesystem may no longer exist. The process have been
> killed or may have fired an asynchronous request and exited.
>
> If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
> fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
> the pid has been reallocated it can return practically any pid. Any pid is
> possible as the pid allocator allocates pid numbers in different pid
> namespaces independently.
>
> The only way to make translation in fuse_dev_do_read reliable is to call
> get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
> fuse_dev_do_read. That reference counting in other contexts has been shown
> to bounce cache lines between processors and in general be slow. So that is
> not desirable.
>
> The only known user of running the fuse server in a different pid namespace
> from the filesystem does not care what the pids are in the fuse messages
> so removing this code should not matter.

Shouldn't we at least zero out the pid in that case?

Thanks,
Miklos


>
> Getting the translation to a server running outside of the pid namespace
> of a container can still be achieved by playing setns games at mount time.
> It is also possible to add an option to pass a pid namespace into the fuse
> filesystem at mount time.
>
> Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns")
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/fuse/dev.c | 6 ------
> 1 file changed, 6 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 5d06384c2cae..0fb58f364fa6 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -1260,12 +1260,6 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
> in = &req->in;
> reqsize = in->h.len;
>
> - if (task_active_pid_ns(current) != fc->pid_ns) {
> - rcu_read_lock();
> - in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
> - rcu_read_unlock();
> - }
> -
> /* If request is too large, reply with an error and restart the read */
> if (nbytes < reqsize) {
> req->out.h.error = -EIO;
> --
> 2.14.1
>

2018-02-22 10:27:28

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v6 2/5] fuse: Fail all requests with invalid uids or gids

On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
<[email protected]> wrote:
> Upon a cursory examinination the uid and gid of a fuse request are
> necessary for correct operation. Failing a fuse request where those
> values are not reliable seems a straight forward and reliable means of
> ensuring that fuse requests with bad data are not sent or processed.
>
> In most cases the vfs will avoid actions it suspects will cause
> an inode write back of an inode with an invalid uid or gid. But that does
> not map precisely to what fuse is doing, so test for this and solve
> this at the fuse level as well.
>
> Performing this work in fuse_req_init_context is cheap as the code is
> already performing the translation here and only needs to check the
> result of the translation to see if things are not representable in
> a form the fuse server can handle.
>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> fs/fuse/dev.c | 20 +++++++++++++-------
> 1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 0fb58f364fa6..216db3f51a31 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -112,11 +112,13 @@ static void __fuse_put_request(struct fuse_req *req)
> refcount_dec(&req->count);
> }
>
> -static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
> +static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
> {
> - req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> - req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> + req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
> + req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
> req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
> +
> + return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
> }
>
> void fuse_set_initialized(struct fuse_conn *fc)
> @@ -162,12 +164,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
> wake_up(&fc->blocked_waitq);
> goto out;
> }
> -
> - fuse_req_init_context(fc, req);
> __set_bit(FR_WAITING, &req->flags);
> if (for_background)
> __set_bit(FR_BACKGROUND, &req->flags);
> -
> + if (unlikely(!fuse_req_init_context(fc, req))) {
> + fuse_put_request(fc, req);
> + return ERR_PTR(-EOVERFLOW);
> + }
> return req;
>
> out:
> @@ -256,9 +259,12 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
> if (!req)
> req = get_reserved_req(fc, file);
>
> - fuse_req_init_context(fc, req);
> __set_bit(FR_WAITING, &req->flags);
> __clear_bit(FR_BACKGROUND, &req->flags);
> + if (unlikely(!fuse_req_init_context(fc, req))) {
> + fuse_put_request(fc, req);
> + return ERR_PTR(-EOVERFLOW);
> + }

I think failing the "_nofail" variant is the wrong thing to do. This
is called to allocate a FLUSH request on close() and in readdirplus to
allocate a FORGET request. Failing the latter results in refcount
leak in userspace. Failing the former results in missing unlock on
close() of posix locks.

Thanks,
Miklos

2018-02-22 11:41:53

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns

On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
<[email protected]> wrote:
> Ensure the translation happens by failing to read or write
> posix acls when the filesystem has not indicated it supports
> posix acls.

For the first iteration this is fine, but we could convert the raw
xattrs as well, if we later want to, right?

Thanks,
Miklos

>
> This ensures that modern cached posix acl support is available
> and used when dealing with posix acls. This is important
> because only that path has the code to convernt the uids and
> gids in posix acls into the user namespace of a fuse filesystem.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/fuse/fuse_i.h | 1 +
> fs/fuse/inode.c | 7 +++++++
> fs/fuse/xattr.c | 43 +++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 51 insertions(+)
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 7772e2b4057e..986fa2b043ab 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -979,6 +979,7 @@ ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
> int fuse_removexattr(struct inode *inode, const char *name);
> extern const struct xattr_handler *fuse_xattr_handlers[];
> extern const struct xattr_handler *fuse_acl_xattr_handlers[];
> +extern const struct xattr_handler *fuse_no_acl_xattr_handlers[];
>
> struct posix_acl;
> struct posix_acl *fuse_get_acl(struct inode *inode, int type);
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index e018dc3999f4..a52cf2019a58 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1097,6 +1097,13 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
> file->f_cred->user_ns != sb->s_user_ns)
> goto err_fput;
>
> + /*
> + * If we are not in the initial user namespace posix
> + * acls must be translated.
> + */
> + if (sb->s_user_ns != &init_user_ns)
> + sb->s_xattr = fuse_no_acl_xattr_handlers;
> +
> fc = kmalloc(sizeof(*fc), GFP_KERNEL);
> err = -ENOMEM;
> if (!fc)
> diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
> index 3caac46b08b0..433717640f78 100644
> --- a/fs/fuse/xattr.c
> +++ b/fs/fuse/xattr.c
> @@ -192,6 +192,26 @@ static int fuse_xattr_set(const struct xattr_handler *handler,
> return fuse_setxattr(inode, name, value, size, flags);
> }
>
> +static bool no_xattr_list(struct dentry *dentry)
> +{
> + return false;
> +}
> +
> +static int no_xattr_get(const struct xattr_handler *handler,
> + struct dentry *dentry, struct inode *inode,
> + const char *name, void *value, size_t size)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +static int no_xattr_set(const struct xattr_handler *handler,
> + struct dentry *dentry, struct inode *nodee,
> + const char *name, const void *value,
> + size_t size, int flags)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> static const struct xattr_handler fuse_xattr_handler = {
> .prefix = "",
> .get = fuse_xattr_get,
> @@ -209,3 +229,26 @@ const struct xattr_handler *fuse_acl_xattr_handlers[] = {
> &fuse_xattr_handler,
> NULL
> };
> +
> +static const struct xattr_handler fuse_no_acl_access_xattr_handler = {
> + .name = XATTR_NAME_POSIX_ACL_ACCESS,
> + .flags = ACL_TYPE_ACCESS,
> + .list = no_xattr_list,
> + .get = no_xattr_get,
> + .set = no_xattr_set,
> +};
> +
> +static const struct xattr_handler fuse_no_acl_default_xattr_handler = {
> + .name = XATTR_NAME_POSIX_ACL_DEFAULT,
> + .flags = ACL_TYPE_ACCESS,
> + .list = no_xattr_list,
> + .get = no_xattr_get,
> + .set = no_xattr_set,
> +};
> +
> +const struct xattr_handler *fuse_no_acl_xattr_handlers[] = {
> + &fuse_no_acl_access_xattr_handler,
> + &fuse_no_acl_default_xattr_handler,
> + &fuse_xattr_handler,
> + NULL
> +};
> --
> 2.14.1
>

2018-02-22 18:16:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v6 2/5] fuse: Fail all requests with invalid uids or gids

Miklos Szeredi <[email protected]> writes:

> On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
> <[email protected]> wrote:
>> Upon a cursory examinination the uid and gid of a fuse request are
>> necessary for correct operation. Failing a fuse request where those
>> values are not reliable seems a straight forward and reliable means of
>> ensuring that fuse requests with bad data are not sent or processed.
>>
>> In most cases the vfs will avoid actions it suspects will cause
>> an inode write back of an inode with an invalid uid or gid. But that does
>> not map precisely to what fuse is doing, so test for this and solve
>> this at the fuse level as well.
>>
>> Performing this work in fuse_req_init_context is cheap as the code is
>> already performing the translation here and only needs to check the
>> result of the translation to see if things are not representable in
>> a form the fuse server can handle.
>>
>> Signed-off-by: Eric W. Biederman <[email protected]>
>> ---
>> fs/fuse/dev.c | 20 +++++++++++++-------
>> 1 file changed, 13 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index 0fb58f364fa6..216db3f51a31 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -112,11 +112,13 @@ static void __fuse_put_request(struct fuse_req *req)
>> refcount_dec(&req->count);
>> }
>>
>> -static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>> +static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>> {
>> - req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
>> - req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
>> + req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
>> + req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
>> req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>> +
>> + return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
>> }
>>
>> void fuse_set_initialized(struct fuse_conn *fc)
>> @@ -162,12 +164,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>> wake_up(&fc->blocked_waitq);
>> goto out;
>> }
>> -
>> - fuse_req_init_context(fc, req);
>> __set_bit(FR_WAITING, &req->flags);
>> if (for_background)
>> __set_bit(FR_BACKGROUND, &req->flags);
>> -
>> + if (unlikely(!fuse_req_init_context(fc, req))) {
>> + fuse_put_request(fc, req);
>> + return ERR_PTR(-EOVERFLOW);
>> + }
>> return req;
>>
>> out:
>> @@ -256,9 +259,12 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
>> if (!req)
>> req = get_reserved_req(fc, file);
>>
>> - fuse_req_init_context(fc, req);
>> __set_bit(FR_WAITING, &req->flags);
>> __clear_bit(FR_BACKGROUND, &req->flags);
>> + if (unlikely(!fuse_req_init_context(fc, req))) {
>> + fuse_put_request(fc, req);
>> + return ERR_PTR(-EOVERFLOW);
>> + }
>
> I think failing the "_nofail" variant is the wrong thing to do. This
> is called to allocate a FLUSH request on close() and in readdirplus to
> allocate a FORGET request. Failing the latter results in refcount
> leak in userspace. Failing the former results in missing unlock on
> close() of posix locks.

Doh! You are quite correct.

Modifying fuse_get_req_nofail_nopages to fail is a bug.

I am thinking the proper solution is to write:

static void fuse_req_init_context_nofail(struct fuse_req *req)
{
req->in.h.uid = 0;
req->in.h.gid = 0;
req->in.h.pid = 0;
}

And use that in the nofail case. As it appears neither flush nor
the eviction of inodes is a user space triggered action and as such
user space identifiers are nonsense in those cases.

I will respin this patch shortly.

Eric


2018-02-22 19:06:46

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v6 1/5] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read

Miklos Szeredi <[email protected]> writes:

> On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
> <[email protected]> wrote:
>> At the point of fuse_dev_do_read the user space process that initiated the
>> action on the fuse filesystem may no longer exist. The process have been
>> killed or may have fired an asynchronous request and exited.
>>
>> If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
>> fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
>> the pid has been reallocated it can return practically any pid. Any pid is
>> possible as the pid allocator allocates pid numbers in different pid
>> namespaces independently.
>>
>> The only way to make translation in fuse_dev_do_read reliable is to call
>> get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
>> fuse_dev_do_read. That reference counting in other contexts has been shown
>> to bounce cache lines between processors and in general be slow. So that is
>> not desirable.
>>
>> The only known user of running the fuse server in a different pid namespace
>> from the filesystem does not care what the pids are in the fuse messages
>> so removing this code should not matter.
>
> Shouldn't we at least zero out the pid in that case?

This is an explicit case of passing a file descriptor between pid
namespaces. So I think there are plenty of buyer be ware signs out.
So I don't know if there are any real world advantages of zeroing the
pid.

I can see a case for using the pid namespace of the opener of /dev/fuse
instead of the pid namespace of the mounter of the fuse filesystem.
Although in practice I would be surprised if they were different.

I am very leary about caring during a read operation. Caring about the
current processes during read/write tends to break caching, is error prone
as the need for this patch demonstrates, and is generally likely to be
slower than not caring.

So yes we can zero the pid. I don't think it is wise to zero the pid
unless we zero the pid in fuse_req_init_context.

Eric

2018-02-22 19:20:12

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns

Miklos Szeredi <[email protected]> writes:

> On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
> <[email protected]> wrote:
>> Ensure the translation happens by failing to read or write
>> posix acls when the filesystem has not indicated it supports
>> posix acls.
>
> For the first iteration this is fine, but we could convert the raw
> xattrs as well, if we later want to, right?

I will say maybe. This is tricky. The code would not be too hard,
and the function to do the work posix_acl_fix_xattr_userns already
exists in fs/posix_acl.c

I don't actually expect that to work longterm. I expect the direction
the kernel internals are moving is that all filesystems that implement
posix acls will be expected to implement .get_acl and .set_acl.

I would have to reread the old thread that got us to this point with
posix acls before I could really understand the backwards compatible
fuse use case, and I would have to reread the rest of the acl processing
in the kernel before I could recall exactly what makes sense.

If there was an obvious way to whitelist xattrs that fuse can support
for user namespaces I think I would go for that. Just to avoid future
problems with future xattrs.

Eric

2018-02-22 22:52:37

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns

[email protected] (Eric W. Biederman) writes:

> Miklos Szeredi <[email protected]> writes:
>
>> On Wed, Feb 21, 2018 at 9:29 PM, Eric W. Biederman
>> <[email protected]> wrote:
>>> Ensure the translation happens by failing to read or write
>>> posix acls when the filesystem has not indicated it supports
>>> posix acls.
>>
>> For the first iteration this is fine, but we could convert the raw
>> xattrs as well, if we later want to, right?
>
> I will say maybe. This is tricky. The code would not be too hard,
> and the function to do the work posix_acl_fix_xattr_userns already
> exists in fs/posix_acl.c
>
> I don't actually expect that to work longterm. I expect the direction
> the kernel internals are moving is that all filesystems that implement
> posix acls will be expected to implement .get_acl and .set_acl.
>
> I would have to reread the old thread that got us to this point with
> posix acls before I could really understand the backwards compatible
> fuse use case, and I would have to reread the rest of the acl processing
> in the kernel before I could recall exactly what makes sense.
>
> If there was an obvious way to whitelist xattrs that fuse can support
> for user namespaces I think I would go for that. Just to avoid future
> problems with future xattrs.

I am remembering why this is such a sticky issue.

Today when a posix acl is read from user space the code does:
posix_acl_to_xattr(&init_user_ns, ...) in posix_acl_xattr_get
posix_acl_fix_xattr_to_user() in getxattr

Similary when a posix acl is written from user space the code does:
posix_acl_fix_xattr_from_user() in setxattr
posix_acl_from_xattr(&init_user_us, ...) in posix_acl_xattr_set

If every posix acl supporting filesystem in the kernel would use
posix_acl_access_xattr_handler and posix_acl_default_xattr_handler the
function posix_acl_fix_xattr_to_user and posix_acl_fix_xattr_from_user
and posix_acl_fix_xattr_userns could all be removed and the posix acl
handling could be that little bit simpler and faster.

So if we could figure out how to use the generic acl support for the old
brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much
easier to support them long term.

Eric




2018-02-26 07:49:26

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns

On Thu, Feb 22, 2018 at 11:50 PM, Eric W. Biederman
<[email protected]> wrote:

> So if we could figure out how to use the generic acl support for the old
> brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much
> easier to support them long term.

Simplest and most robust way seems to be to do everything the same (as
with FUSE_POSIX_ACL) but tell the vfs not to cache the acl.

Thanks,
Miklos

2018-02-26 16:37:15

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns

Miklos Szeredi <[email protected]> writes:

> On Thu, Feb 22, 2018 at 11:50 PM, Eric W. Biederman
> <[email protected]> wrote:
>
>> So if we could figure out how to use the generic acl support for the old
>> brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much
>> easier to support them long term.
>
> Simplest and most robust way seems to be to do everything the same (as
> with FUSE_POSIX_ACL) but tell the vfs not to cache the acl.

Good point. That sounds like for the !fc->posix_acl case we just
need a careful use of "forget_all_cached_acls(inode)".

I will take a quick look at that, and see if that is easy/sufficient to
cover the legacy fuse case. Otherwise I will go with what I already
have here.

That feels like a better path. And internally I would call what is
today fc->posix_acl fc->cached_posix_acl. To better convey the intent.
Fingers crossed.

Eric

2018-02-26 21:53:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v6 4/5] fuse: Ensure posix acls are translated outside of init_user_ns

[email protected] (Eric W. Biederman) writes:

> Miklos Szeredi <[email protected]> writes:
>
>> On Thu, Feb 22, 2018 at 11:50 PM, Eric W. Biederman
>> <[email protected]> wrote:
>>
>>> So if we could figure out how to use the generic acl support for the old
>>> brand of fuse filesystems that don't set FUSE_POSIX_ACL it would be much
>>> easier to support them long term.
>>
>> Simplest and most robust way seems to be to do everything the same (as
>> with FUSE_POSIX_ACL) but tell the vfs not to cache the acl.
>
> Good point. That sounds like for the !fc->posix_acl case we just
> need a careful use of "forget_all_cached_acls(inode)".
>
> I will take a quick look at that, and see if that is easy/sufficient to
> cover the legacy fuse case. Otherwise I will go with what I already
> have here.
>
> That feels like a better path. And internally I would call what is
> today fc->posix_acl fc->cached_posix_acl. To better convey the intent.
> Fingers crossed.

It looks like simply setting
"inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;" is the secret
sauce needed to disable caching in the legacy case and make everything
work.

I had to tweak the calls to forget_all_cached_acls so that won't clear
the ACL_DONT_CACHE status but otherwise that was an absolutely trivial
change to combine those two code paths.

I will post my updated patches shortly.

Eric


2018-02-26 23:53:55

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v7 0/7] fuse: mounts from non-init user namespaces


This patchset builds on the work by Donsu Park and Seth Forshee and is
reduced to the set of patches that just affect fuse. The non-fuse
patches are far enough along we can ignore them except possibly for the
question of when does FS_USERNS_MOUNT get set in fuse_fs_type.

Fuse with a block device has been left as an exercise for a later time.

Since v5 I changed the core of this patchset around as the previous
patches were showing signs of bitrot. Some important explanations were
missing, some important functionality was missing, and xattr handling
was completely absent.

Since v6 I have:
- Removed the failure case from fuse_get_req_nofail_nopages that I
added.
- Updated fuse to always to use posix_acl_access_xattr_handler, and
posix_acl_default_xattr_handler, by teaching fuse to set
ACL_DONT_CACHE when FUSE_POSIX_ACL is not set.

Miklos can you take a look and see what you think?

I think this much of the fuse changes are ready, and as such I would
like to get them in this development cycle if possible.

These changes are also available at:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git userns-fuse-v7

Eric W. Biederman (6):
fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
fuse: Fail all requests with invalid uids or gids
fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE
fuse: Cache a NULL acl when FUSE_GETXATTR returns -ENOSYS
fuse: Simplfiy the posix acl handling logic.
fuse: Support fuse filesystems outside of init_user_ns

Seth Forshee (1):
fuse: Restrict allow_other to the superblock's namespace or a descendant

fs/fuse/acl.c | 10 +++++-----
fs/fuse/cuse.c | 7 ++++++-
fs/fuse/dev.c | 30 +++++++++++++++++-------------
fs/fuse/dir.c | 27 +++++++++++++--------------
fs/fuse/fuse_i.h | 11 ++++++++---
fs/fuse/inode.c | 44 +++++++++++++++++++++++++++++---------------
fs/fuse/xattr.c | 6 +-----
fs/posix_acl.c | 7 +++++--
kernel/user_namespace.c | 1 +
9 files changed, 85 insertions(+), 58 deletions(-)

Eric

2018-02-26 23:55:07

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v7 1/7] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read

At the point of fuse_dev_do_read the user space process that initiated the
action on the fuse filesystem may no longer exist. The process have been
killed or may have fired an asynchronous request and exited.

If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
the pid has been reallocated it can return practically any pid. Any pid is
possible as the pid allocator allocates pid numbers in different pid
namespaces independently.

The only way to make translation in fuse_dev_do_read reliable is to call
get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
fuse_dev_do_read. That reference counting in other contexts has been shown
to bounce cache lines between processors and in general be slow. So that is
not desirable.

The only known user of running the fuse server in a different pid namespace
from the filesystem does not care what the pids are in the fuse messages
so removing this code should not matter.

Getting the translation to a server running outside of the pid namespace
of a container can still be achieved by playing setns games at mount time.
It is also possible to add an option to pass a pid namespace into the fuse
filesystem at mount time.

Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/fuse/dev.c | 6 ------
1 file changed, 6 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5d06384c2cae..0fb58f364fa6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1260,12 +1260,6 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
in = &req->in;
reqsize = in->h.len;

- if (task_active_pid_ns(current) != fc->pid_ns) {
- rcu_read_lock();
- in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
- rcu_read_unlock();
- }
-
/* If request is too large, reply with an error and restart the read */
if (nbytes < reqsize) {
req->out.h.error = -EIO;
--
2.14.1


2018-02-26 23:55:35

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE

Fuse is about to join overlayfs in relying on get_acl respecting
ACL_DONT_CACHE so update the documentation in get_acl to reflect that
fact. The comment and this change description should give people a
clue that respecting ACL_DONT_CACHE in get_acl is important, and they
should audit the filesystems before removing that support.

Additionaly update the comment above the call to get_acl itself and
remove the wrong information that an implementation of get_acl can
prevent caching by calling forget_cached_acl. Replace that with the
correct information that to prevent caching all that is necessary is
to set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE when the
inode is initialized.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/posix_acl.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 2fd0fde16fe1..3c24fc263401 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -121,14 +121,17 @@ struct posix_acl *get_acl(struct inode *inode, int type)
* could wait for that other task to complete its job, but it's easier
* to just call ->get_acl to fetch the ACL ourself. (This is going to
* be an unlikely race.)
+ *
+ * ACL_DONT_CACHE is treated as another task updating the acl and
+ * remains set.
*/
if (cmpxchg(p, ACL_NOT_CACHED, sentinel) != ACL_NOT_CACHED)
/* fall through */ ;

/*
* Normally, the ACL returned by ->get_acl will be cached.
- * A filesystem can prevent that by calling
- * forget_cached_acl(inode, type) in ->get_acl.
+ * A filesystem can prevent that by calling setting
+ * inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE.
*
* If the filesystem doesn't have a get_acl() function at all, we'll
* just create the negative cache entry.
--
2.14.1


2018-02-26 23:56:06

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v7 4/7] fuse: Cache a NULL acl when FUSE_GETXATTR returns -ENOSYS

When FUSE_GETXATTR will never return anything call cache_no_acl to
cache that state in the vfs as well in fuse with fc->no_getxattr.

The only code path this affects are the code paths that call
fuse_get_acl and caching a NULL or returning it immediately
is exactly the same effect so this should not effect anything.

This keeps the vfs from waisting it's time calling down into fuse
when fuse isn't going to do anything, and it makes it clear
when a NULL should be cached for optimal performance.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/fuse/xattr.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 3caac46b08b0..0520a4f47226 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -82,6 +82,7 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
ret = min_t(ssize_t, outarg.size, XATTR_SIZE_MAX);
if (ret == -ENOSYS) {
fc->no_getxattr = 1;
+ cache_no_acl(inode);
ret = -EOPNOTSUPP;
}
return ret;
--
2.14.1


2018-02-26 23:56:09

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v7 5/7] fuse: Simplfiy the posix acl handling logic.

Rename the fuse connection flag posix_acl to cached_posix_acl as that
is what it actually means. That fuse will cache and operate on the
cached value of the posix acl.

When fc->cached_posix_acl is not set, set ACL_DONT_CACHE on the inode
so that get_acl and friends won't cache the acl values even if they
are called.

Replace forget_all_cached_acls with fuse_forget_cached_acls. This
wrapper only takes effect when cached_posix_acl is true to prevent
losing the nocache or noxattr status in when posix acls are not
cached.

Always use posix_acl_access_xattr_handler so the fuse code
benefits from the generic posix acl handlers as much as possible.
This will become important as the code works on translation
of uid and gid in the posix acls when fuse is not mounted in
the initial user namespace.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/fuse/acl.c | 6 +++---
fs/fuse/dir.c | 11 +++++------
fs/fuse/fuse_i.h | 5 +++--
fs/fuse/inode.c | 13 ++++++++++---
fs/fuse/xattr.c | 5 -----
5 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index ec85765502f1..8fb2153dbf50 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -19,7 +19,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
void *value = NULL;
struct posix_acl *acl;

- if (!fc->posix_acl || fc->no_getxattr)
+ if (fc->no_getxattr)
return NULL;

if (type == ACL_TYPE_ACCESS)
@@ -53,7 +53,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
const char *name;
int ret;

- if (!fc->posix_acl || fc->no_setxattr)
+ if (fc->no_setxattr)
return -EOPNOTSUPP;

if (type == ACL_TYPE_ACCESS)
@@ -92,7 +92,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
} else {
ret = fuse_removexattr(inode, name);
}
- forget_all_cached_acls(inode);
+ fuse_forget_cached_acls(inode);
fuse_invalidate_attr(inode);

return ret;
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 24967382a7b1..a44ca509db4f 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -237,7 +237,7 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
if (ret || (outarg.attr.mode ^ inode->i_mode) & S_IFMT)
goto invalid;

- forget_all_cached_acls(inode);
+ fuse_forget_cached_acls(inode);
fuse_change_attributes(inode, &outarg.attr,
entry_attr_timeout(&outarg),
attr_version);
@@ -930,7 +930,7 @@ static int fuse_update_get_attr(struct inode *inode, struct file *file,
int err = 0;

if (time_before64(fi->i_time, get_jiffies_64())) {
- forget_all_cached_acls(inode);
+ fuse_forget_cached_acls(inode);
err = fuse_do_getattr(inode, stat, file);
} else if (stat) {
generic_fillattr(inode, stat);
@@ -1076,7 +1076,7 @@ static int fuse_perm_getattr(struct inode *inode, int mask)
if (mask & MAY_NOT_BLOCK)
return -ECHILD;

- forget_all_cached_acls(inode);
+ fuse_forget_cached_acls(inode);
return fuse_do_getattr(inode, NULL, NULL);
}

@@ -1246,7 +1246,7 @@ static int fuse_direntplus_link(struct file *file,
fi->nlookup++;
spin_unlock(&fc->lock);

- forget_all_cached_acls(inode);
+ fuse_forget_cached_acls(inode);
fuse_change_attributes(inode, &o->attr,
entry_attr_timeout(o),
attr_version);
@@ -1764,8 +1764,7 @@ static int fuse_setattr(struct dentry *entry, struct iattr *attr)
* If filesystem supports acls it may have updated acl xattrs in
* the filesystem, so forget cached acls for the inode.
*/
- if (fc->posix_acl)
- forget_all_cached_acls(inode);
+ fuse_forget_cached_acls(inode);

/* Directory mode changed, may need to revalidate access */
if (d_is_dir(entry) && (attr->ia_valid & ATTR_MODE))
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c4c093bbf456..3cf296d60bc0 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -619,7 +619,7 @@ struct fuse_conn {
unsigned no_lseek:1;

/** Does the filesystem support posix acls? */
- unsigned posix_acl:1;
+ unsigned cached_posix_acl:1;

/** Check permissions based on the file mode or not? */
unsigned default_permissions:1;
@@ -913,6 +913,8 @@ void fuse_release_nowrite(struct inode *inode);

u64 fuse_get_attr_version(struct fuse_conn *fc);

+void fuse_forget_cached_acls(struct inode *inode);
+
/**
* File-system tells the kernel to invalidate cache for the given node id.
*/
@@ -974,7 +976,6 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
int fuse_removexattr(struct inode *inode, const char *name);
extern const struct xattr_handler *fuse_xattr_handlers[];
-extern const struct xattr_handler *fuse_acl_xattr_handlers[];

struct posix_acl;
struct posix_acl *fuse_get_acl(struct inode *inode, int type);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 624f18bbfd2b..0c3ccca7c554 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -313,6 +313,8 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
if (!fc->writeback_cache || !S_ISREG(attr->mode))
inode->i_flags |= S_NOCMTIME;
inode->i_generation = generation;
+ if (!fc->cached_posix_acl)
+ inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;
fuse_init_inode(inode, attr);
unlock_new_inode(inode);
} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
@@ -331,6 +333,12 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
return inode;
}

+void fuse_forget_cached_acls(struct inode *inode)
+{
+ if (get_fuse_conn(inode)->cached_posix_acl)
+ forget_all_cached_acls(inode);
+}
+
int fuse_reverse_inval_inode(struct super_block *sb, u64 nodeid,
loff_t offset, loff_t len)
{
@@ -343,7 +351,7 @@ int fuse_reverse_inval_inode(struct super_block *sb, u64 nodeid,
return -ENOENT;

fuse_invalidate_attr(inode);
- forget_all_cached_acls(inode);
+ fuse_forget_cached_acls(inode);
if (offset >= 0) {
pg_start = offset >> PAGE_SHIFT;
if (len <= 0)
@@ -915,8 +923,7 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
fc->sb->s_time_gran = arg->time_gran;
if ((arg->flags & FUSE_POSIX_ACL)) {
fc->default_permissions = 1;
- fc->posix_acl = 1;
- fc->sb->s_xattr = fuse_acl_xattr_handlers;
+ fc->cached_posix_acl = 1;
}
} else {
ra_pages = fc->max_read / PAGE_SIZE;
diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 0520a4f47226..48a95e1bb020 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -200,11 +200,6 @@ static const struct xattr_handler fuse_xattr_handler = {
};

const struct xattr_handler *fuse_xattr_handlers[] = {
- &fuse_xattr_handler,
- NULL
-};
-
-const struct xattr_handler *fuse_acl_xattr_handlers[] = {
&posix_acl_access_xattr_handler,
&posix_acl_default_xattr_handler,
&fuse_xattr_handler,
--
2.14.1


2018-02-26 23:56:11

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v7 7/7] fuse: Restrict allow_other to the superblock's namespace or a descendant

From: Seth Forshee <[email protected]>

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Cc: [email protected]
Cc: [email protected]
Cc: "Eric W. Biederman" <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Acked-by: Miklos Szeredi <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
Reviewed-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
Signed-off-by: Dongsu Park <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/fuse/dir.c | 2 +-
kernel/user_namespace.c | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 79cca1687457..0cbd1ff3dd48 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
const struct cred *cred;

if (fc->allow_other)
- return 1;
+ return current_in_userns(fc->user_ns);

cred = current_cred();
if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4ce5c7..492c255e6c5a 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
{
return in_userns(target_ns, current_user_ns());
}
+EXPORT_SYMBOL(current_in_userns);

static inline struct user_namespace *to_user_ns(struct ns_common *ns)
{
--
2.14.1


2018-02-26 23:58:24

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v7 6/7] fuse: Support fuse filesystems outside of init_user_ns

In order to support mounts from namespaces other than init_user_ns,
fuse must translate uids and gids to/from the userns of the process
servicing requests on /dev/fuse. This patch does that, with a couple
of restrictions on the namespace:

- The userns for the fuse connection is fixed to the namespace
from which /dev/fuse is opened.

- The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the need to
pass around userns references and by allowing fuse to rely on the
checks in setattr_prepare for ownership changes. Either restriction
could be relaxed in the future if needed.

For cuse the userns used is the opener of /dev/cuse. Semantically the
cuse support does not appear safe for unprivileged users. Practically
the permissions on /dev/cuse only make it accessible to the global root
user. If something slips through the cracks in a user namespace the only
users who will be able to use the cuse device are those users mapped into
the user namespace.

Translation in the posix acl is updated to use the uuser namespace of
the filesystem. Avoiding cases which might bypass this translation is
handled in a following change.

This change is stronlgy based on a similar change from Seth Forshee
and Dongsu Park.

Cc: [email protected]
Cc: [email protected]
Cc: Miklos Szeredi <[email protected]>
Cc: <[email protected]>
Cc: Dongsu Park <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/fuse/acl.c | 4 ++--
fs/fuse/cuse.c | 7 ++++++-
fs/fuse/dev.c | 4 ++--
fs/fuse/dir.c | 14 +++++++-------
fs/fuse/fuse_i.h | 6 +++++-
fs/fuse/inode.c | 31 +++++++++++++++++++------------
6 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index 8fb2153dbf50..5a67c80e21d6 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -34,7 +34,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
return ERR_PTR(-ENOMEM);
size = fuse_getxattr(inode, name, value, PAGE_SIZE);
if (size > 0)
- acl = posix_acl_from_xattr(&init_user_ns, value, size);
+ acl = posix_acl_from_xattr(fc->user_ns, value, size);
else if ((size == 0) || (size == -ENODATA) ||
(size == -EOPNOTSUPP && fc->no_getxattr))
acl = NULL;
@@ -81,7 +81,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
if (!value)
return -ENOMEM;

- ret = posix_acl_to_xattr(&init_user_ns, acl, value, size);
+ ret = posix_acl_to_xattr(fc->user_ns, acl, value, size);
if (ret < 0) {
kfree(value);
return ret;
diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index e9e97803442a..036ee477669e 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
#include <linux/stat.h>
#include <linux/module.h>
#include <linux/uio.h>
+#include <linux/user_namespace.h>

#include "fuse_i.h"

@@ -498,7 +499,11 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
if (!cc)
return -ENOMEM;

- fuse_conn_init(&cc->fc);
+ /*
+ * Limit the cuse channel to requests that can
+ * be represented in file->f_cred->user_ns.
+ */
+ fuse_conn_init(&cc->fc, file->f_cred->user_ns);

fud = fuse_dev_alloc(&cc->fc);
if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 2886a56d5f61..fce7915aea13 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)

static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
{
- req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
- req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
+ req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+ req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);

return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index a44ca509db4f..79cca1687457 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
stat->ino = attr->ino;
stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
stat->nlink = attr->nlink;
- stat->uid = make_kuid(&init_user_ns, attr->uid);
- stat->gid = make_kgid(&init_user_ns, attr->gid);
+ stat->uid = make_kuid(fc->user_ns, attr->uid);
+ stat->gid = make_kgid(fc->user_ns, attr->gid);
stat->rdev = inode->i_rdev;
stat->atime.tv_sec = attr->atime;
stat->atime.tv_nsec = attr->atimensec;
@@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
return true;
}

-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
- bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+ struct fuse_setattr_in *arg, bool trust_local_cmtime)
{
unsigned ivalid = iattr->ia_valid;

if (ivalid & ATTR_MODE)
arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode;
if (ivalid & ATTR_UID)
- arg->valid |= FATTR_UID, arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
+ arg->valid |= FATTR_UID, arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
if (ivalid & ATTR_GID)
- arg->valid |= FATTR_GID, arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
+ arg->valid |= FATTR_GID, arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
if (ivalid & ATTR_SIZE)
arg->valid |= FATTR_SIZE, arg->size = iattr->ia_size;
if (ivalid & ATTR_ATIME) {
@@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,

memset(&inarg, 0, sizeof(inarg));
memset(&outarg, 0, sizeof(outarg));
- iattr_to_fattr(attr, &inarg, trust_local_cmtime);
+ iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
if (file) {
struct fuse_file *ff = file->private_data;
inarg.valid |= FATTR_FH;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 3cf296d60bc0..eba0beea8634 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -26,6 +26,7 @@
#include <linux/xattr.h>
#include <linux/pid_namespace.h>
#include <linux/refcount.h>
+#include <linux/user_namespace.h>

/** Max number of pages that can be used in a single read request */
#define FUSE_MAX_PAGES_PER_REQ 32
@@ -466,6 +467,9 @@ struct fuse_conn {
/** The pid namespace for this mount */
struct pid_namespace *pid_ns;

+ /** The user namespace for this mount */
+ struct user_namespace *user_ns;
+
/** Maximum read size */
unsigned max_read;

@@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
/**
* Initialize fuse_conn
*/
-void fuse_conn_init(struct fuse_conn *fc);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);

/**
* Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 0c3ccca7c554..cd3d29610688 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
inode->i_ino = fuse_squash_ino(attr->ino);
inode->i_mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
set_nlink(inode, attr->nlink);
- inode->i_uid = make_kuid(&init_user_ns, attr->uid);
- inode->i_gid = make_kgid(&init_user_ns, attr->gid);
+ inode->i_uid = make_kuid(fc->user_ns, attr->uid);
+ inode->i_gid = make_kgid(fc->user_ns, attr->gid);
inode->i_blocks = attr->blocks;
inode->i_atime.tv_sec = attr->atime;
inode->i_atime.tv_nsec = attr->atimensec;
@@ -485,7 +485,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
return err;
}

-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
+static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+ struct user_namespace *user_ns)
{
char *p;
memset(d, 0, sizeof(struct fuse_mount_data));
@@ -521,7 +522,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
case OPT_USER_ID:
if (fuse_match_uint(&args[0], &uv))
return 0;
- d->user_id = make_kuid(current_user_ns(), uv);
+ d->user_id = make_kuid(user_ns, uv);
if (!uid_valid(d->user_id))
return 0;
d->user_id_present = 1;
@@ -530,7 +531,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
case OPT_GROUP_ID:
if (fuse_match_uint(&args[0], &uv))
return 0;
- d->group_id = make_kgid(current_user_ns(), uv);
+ d->group_id = make_kgid(user_ns, uv);
if (!gid_valid(d->group_id))
return 0;
d->group_id_present = 1;
@@ -573,8 +574,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
struct super_block *sb = root->d_sb;
struct fuse_conn *fc = get_fuse_conn_super(sb);

- seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
- seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
+ seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
+ seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
if (fc->default_permissions)
seq_puts(m, ",default_permissions");
if (fc->allow_other)
@@ -605,7 +606,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
fpq->connected = 1;
}

-void fuse_conn_init(struct fuse_conn *fc)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
{
memset(fc, 0, sizeof(*fc));
spin_lock_init(&fc->lock);
@@ -629,6 +630,7 @@ void fuse_conn_init(struct fuse_conn *fc)
fc->attr_version = 1;
get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+ fc->user_ns = get_user_ns(user_ns);
}
EXPORT_SYMBOL_GPL(fuse_conn_init);

@@ -638,6 +640,7 @@ void fuse_conn_put(struct fuse_conn *fc)
if (fc->destroy_req)
fuse_request_free(fc->destroy_req);
put_pid_ns(fc->pid_ns);
+ put_user_ns(fc->user_ns);
fc->release(fc);
}
}
@@ -1068,7 +1071,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)

sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION);

- if (!parse_fuse_opt(data, &d, is_bdev))
+ if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
goto err;

if (is_bdev) {
@@ -1093,8 +1096,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
if (!file)
goto err;

- if ((file->f_op != &fuse_dev_operations) ||
- (file->f_cred->user_ns != &init_user_ns))
+ /*
+ * Require mount to happen from the same user namespace which
+ * opened /dev/fuse to prevent potential attacks.
+ */
+ if (file->f_op != &fuse_dev_operations ||
+ file->f_cred->user_ns != sb->s_user_ns)
goto err_fput;

fc = kmalloc(sizeof(*fc), GFP_KERNEL);
@@ -1102,7 +1109,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
if (!fc)
goto err_fput;

- fuse_conn_init(fc);
+ fuse_conn_init(fc, sb->s_user_ns);
fc->release = fuse_free_conn;

fud = fuse_dev_alloc(fc);
--
2.14.1


2018-02-26 23:58:29

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v7 2/7] fuse: Fail all requests with invalid uids or gids

Upon a cursory examinination the uid and gid of a fuse request are
necessary for correct operation. Failing a fuse request where those
values are not reliable seems a straight forward and reliable means of
ensuring that fuse requests with bad data are not sent or processed.

In most cases the vfs will avoid actions it suspects will cause
an inode write back of an inode with an invalid uid or gid. But that does
not map precisely to what fuse is doing, so test for this and solve
this at the fuse level as well.

Performing this work in fuse_req_init_context is cheap as the code is
already performing the translation here and only needs to check the
result of the translation to see if things are not representable in
a form the fuse server can handle.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/fuse/dev.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 0fb58f364fa6..2886a56d5f61 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -112,11 +112,20 @@ static void __fuse_put_request(struct fuse_req *req)
refcount_dec(&req->count);
}

-static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
+static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
{
- req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
- req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
+ req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
+ req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
+
+ return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
+}
+
+static void fuse_req_init_context_nofail(struct fuse_req *req)
+{
+ req->in.h.uid = 0;
+ req->in.h.gid = 0;
+ req->in.h.pid = 0;
}

void fuse_set_initialized(struct fuse_conn *fc)
@@ -162,12 +171,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
wake_up(&fc->blocked_waitq);
goto out;
}
-
- fuse_req_init_context(fc, req);
__set_bit(FR_WAITING, &req->flags);
if (for_background)
__set_bit(FR_BACKGROUND, &req->flags);
-
+ if (unlikely(!fuse_req_init_context(fc, req))) {
+ fuse_put_request(fc, req);
+ return ERR_PTR(-EOVERFLOW);
+ }
return req;

out:
@@ -256,7 +266,7 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
if (!req)
req = get_reserved_req(fc, file);

- fuse_req_init_context(fc, req);
+ fuse_req_init_context_nofail(req);
__set_bit(FR_WAITING, &req->flags);
__clear_bit(FR_BACKGROUND, &req->flags);
return req;
--
2.14.1


2018-02-27 01:14:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE

On Mon, Feb 26, 2018 at 3:52 PM, Eric W. Biederman
<[email protected]> wrote:
>
> Additionaly update the comment above the call to get_acl itself and
> remove the wrong information that an implementation of get_acl can
> prevent caching by calling forget_cached_acl.

This part is just confusing.

First off, that comment is correct: a filesystem _can_ prevent the
returning of cached data by just calling forget_cached_acl().

Note that there are two different cases: saying that you _never_ want
to cache things (ACL_DONT_CACHE) and saying that there _currently_ is
no cached data (ACL_NOT_CACHED).

forget_cached_acl() just removes the current cache.

You're just replacing one case of "no cached" information with the other.

Just explain the two cases, don't try to muddy the waters even more..

PLUS you are just confusing things entirely. That whole new comment of yours:

+ * ACL_DONT_CACHE is treated as another task updating the acl and
+ * remains set.

is just garbage.

The code is very clear - it will only replace a ACL_NOT_CACHED entry.
The code is clear:

if (cmpxchg(p, ACL_NOT_CACHED, sentinel) != ACL_NOT_CACHED)
/* fall through */ ;

this is basically just an atomic "if *p == ACL_NOT_CACHED then replace
it with 'sentinel'".

Your comment does not add any clarity at all, and only confuses
things. It has nothing to do with "treated as another task updating
the acl".

The fact is, ACL_DONT_CACHE is treated as if the cache is simply
already filled - it's just filled with "no cache".

So the only thing special is ACL_NOT_CACHED, which is the only thing
we will try to _replace_.

So NAK on this patch entirely. It's just adding confusion, not adding
clarifications.

Linus

2018-02-27 02:54:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE


So the purpose for having a patch in the first place is that
2a3a2a3f3524 ("ovl: don't cache acl on overlay layer")
which addded ACL_DONT_CACHED did not result in any comment updates
to get_acl.

Which mean that if you read the comments in get_acl() that you
don't even think of ACL_DONT_CACHED.

Which means that this comment:
/*
* If the ACL isn't being read yet, set our sentinel. Otherwise, the
* current value of the ACL will not be ACL_NOT_CACHED and so our own
* sentinel will not be set; another task will update the cache. We
* could wait for that other task to complete its job, but it's easier
* to just call ->get_acl to fetch the ACL ourself. (This is going to
* be an unlikely race.)
*/

Which presumes the only reason the acl could be anything other
ACL_NOT_CACHED is because get_acl() is already being called upon it in
another task.

I wanted something to mention ACL_DONT_CACHED so someone would at least
think about that case if they ever step up to modify the code.

The code is perfectly clear, the comment is not. That scares me.

And I had to read the code about a dozen times before I realized the
ACL_DONT_CACHED case even exists. Not useful when I am need to use
that to preserve historical fuse semantics.

So something is missing here even if my wording does not improve things.



Then we get this comment:
/*
* Normally, the ACL returned by ->get_acl will be cached.
* A filesystem can prevent that by calling
* forget_cached_acl(inode, type) in ->get_acl.
*/

Which was added in b8a7a3a66747 ("posix_acl: Inode acl caching fixes")
That comment is and always has been rubbish.

I don't have a clue what it is trying to say but it is not something
a person can use to write filesystem code with.


Truths:
- forget_cached_acl(inode, type) can be used to invalidate the acl
cache.

- Calling forget_cached_acl from within the filesystems ->get_acl
method won't prevent a cached value from being returend because
->get_acl will be set.

- Calling forget_cached_acl from within the filesystems ->get_acl
method won't prevent a returned value from being cached
because it the caching happens after ->get_acl returns.

- Setting inode->i_acl = ACL_DONT_CACHE is the only way to prevent
a value from ->get_acl from being cached.


In summary I only care about two things.
1) ACL_NOT_CACHED being mentioned somewhere in get_acl so people looking
at the code, and people updating the code will have a hint that they
need to consider that case.

2) That misleading completely bogus comment being removed/fixed.


And yes I agree the code is clear. The comments are not.


Does this look better as a comment updating patch?

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 2fd0fde16fe1..5453094b8828 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -98,6 +98,11 @@ struct posix_acl *get_acl(struct inode *inode, int type)
struct posix_acl **p;
struct posix_acl *acl;

+ /*
+ * To avoid caching the result of ->get_acl
+ * set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;
+ */
+
/*
* The sentinel is used to detect when another operation like
* set_cached_acl() or forget_cached_acl() races with get_acl().
@@ -126,9 +131,7 @@ struct posix_acl *get_acl(struct inode *inode, int type)
/* fall through */ ;

/*
- * Normally, the ACL returned by ->get_acl will be cached.
- * A filesystem can prevent that by calling
- * forget_cached_acl(inode, type) in ->get_acl.
+ * The ACL returned by ->get_acl will be cached.
*
* If the filesystem doesn't have a get_acl() function at all, we'll
* just create the negative cache entry.

Eric

2018-02-27 03:30:17

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE

[email protected] (Eric W. Biederman) writes:

2> So the purpose for having a patch in the first place is that
> 2a3a2a3f3524 ("ovl: don't cache acl on overlay layer")
> which addded ACL_DONT_CACHED did not result in any comment updates
> to get_acl.
>
> Which mean that if you read the comments in get_acl() that you
> don't even think of ACL_DONT_CACHED.
>
> Which means that this comment:
> /*
> * If the ACL isn't being read yet, set our sentinel. Otherwise, the
> * current value of the ACL will not be ACL_NOT_CACHED and so our own
> * sentinel will not be set; another task will update the cache. We
> * could wait for that other task to complete its job, but it's easier
> * to just call ->get_acl to fetch the ACL ourself. (This is going to
> * be an unlikely race.)
> */
>
> Which presumes the only reason the acl could be anything other
> ACL_NOT_CACHED is because get_acl() is already being called upon it in
> another task.
>
> I wanted something to mention ACL_DONT_CACHED so someone would at least
> think about that case if they ever step up to modify the code.
>
> The code is perfectly clear, the comment is not. That scares me.
>
> And I had to read the code about a dozen times before I realized the
> ACL_DONT_CACHED case even exists. Not useful when I am need to use
> that to preserve historical fuse semantics.
>
> So something is missing here even if my wording does not improve things.
>
>
>
> Then we get this comment:
> /*
> * Normally, the ACL returned by ->get_acl will be cached.
> * A filesystem can prevent that by calling
> * forget_cached_acl(inode, type) in ->get_acl.
> */
>
> Which was added in b8a7a3a66747 ("posix_acl: Inode acl caching fixes")
> That comment is and always has been rubbish.
>
> I don't have a clue what it is trying to say but it is not something
> a person can use to write filesystem code with.
>
>
> Truths:
> - forget_cached_acl(inode, type) can be used to invalidate the acl
> cache.
>
> - Calling forget_cached_acl from within the filesystems ->get_acl
> method won't prevent a cached value from being returend because
> ->get_acl will be set.
>
> - Calling forget_cached_acl from within the filesystems ->get_acl
> method won't prevent a returned value from being cached
> because it the caching happens after ->get_acl returns.

Sigh. Yes it will because we set the special sentinel value,
and forget_cached_acl will replace the sentinel value with
ACL_NOT_CACHED.

It is a terribly brittle and racy thing to do, and it probably won't
work to say cache this acl but not this one on a case by case bases
in ->get_acl.

As such I believe that usage of forget_cached_acl should be subsumed by
using ACL_NOT_CACHED. If not we should really come up with a different
helper function name to call from ->get_acl. Preferably one that does
"cmpxchng(p, sentinel, ACL_NOT_CACHED)" so that we remove the races.


> - Setting inode->i_acl = ACL_DONT_CACHE is the only way to prevent
> a value from ->get_acl from being cached.
>
>
> In summary I only care about two things.
> 1) ACL_NOT_CACHED being mentioned somewhere in get_acl so people looking
> at the code, and people updating the code will have a hint that they
> need to consider that case.
>
> 2) That misleading completely bogus comment being removed/fixed.
>
>
> And yes I agree the code is clear. The comments are not.
>
>
> Does this look better as a comment updating patch?
>
> diff --git a/fs/posix_acl.c b/fs/posix_acl.c
> index 2fd0fde16fe1..5453094b8828 100644
> --- a/fs/posix_acl.c
> +++ b/fs/posix_acl.c
> @@ -98,6 +98,11 @@ struct posix_acl *get_acl(struct inode *inode, int type)
> struct posix_acl **p;
> struct posix_acl *acl;
>
> + /*
> + * To avoid caching the result of ->get_acl
> + * set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;
> + */
> +
> /*
> * The sentinel is used to detect when another operation like
> * set_cached_acl() or forget_cached_acl() races with get_acl().
> @@ -126,9 +131,7 @@ struct posix_acl *get_acl(struct inode *inode, int type)
> /* fall through */ ;
>
> /*
> - * Normally, the ACL returned by ->get_acl will be cached.
> - * A filesystem can prevent that by calling
> - * forget_cached_acl(inode, type) in ->get_acl.
> + * The ACL returned by ->get_acl will be cached.
> *
> * If the filesystem doesn't have a get_acl() function at all, we'll
> * just create the negative cache entry.
>
> Eric

2018-02-27 03:37:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE

On Mon, Feb 26, 2018 at 6:53 PM, Eric W. Biederman
<[email protected]> wrote:
>
> So the purpose for having a patch in the first place is that
> 2a3a2a3f3524 ("ovl: don't cache acl on overlay layer")
> which addded ACL_DONT_CACHED did not result in any comment updates
> to get_acl.

I'm not opposed to just updating the comments.

I just think your updates were somewhat misleading.

> Which mean that if you read the comments in get_acl() that you
> don't even think of ACL_DONT_CACHED.

Right. By all means add a comment about ACL_DONT_CACHE disabling the
cache entirely.

But don't _remove_ the other valid way to flush the cache, and don't
make that comment above cmpxchg() be even more confusing than the code
is.

> Does this look better as a comment updating patch?
>
> diff --git a/fs/posix_acl.c b/fs/posix_acl.c
> index 2fd0fde16fe1..5453094b8828 100644
> --- a/fs/posix_acl.c
> +++ b/fs/posix_acl.c
> @@ -98,6 +98,11 @@ struct posix_acl *get_acl(struct inode *inode, int type)
> struct posix_acl **p;
> struct posix_acl *acl;
>
> + /*
> + * To avoid caching the result of ->get_acl
> + * set inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE;
> + */
> +
> /*
> * The sentinel is used to detect when another operation like
> * set_cached_acl() or forget_cached_acl() races with get_acl().
> @@ -126,9 +131,7 @@ struct posix_acl *get_acl(struct inode *inode, int type)
> /* fall through */ ;
>
> /*
> - * Normally, the ACL returned by ->get_acl will be cached.
> - * A filesystem can prevent that by calling
> - * forget_cached_acl(inode, type) in ->get_acl.
> + * The ACL returned by ->get_acl will be cached.

Why do you hate forget_cached_acl()?

It's perfectly valid too. Don't remove that comment. Maybe reword it
to talk not about "preventing", but about "invalidating the cache".

But the old comment that you remove isn't _wrong_, it's just that the
"preventing" from returning the cached state with forget_cached_acl()
is just a one-time thing.

So forget_cached_acl() exists, and it works, and it does exactly what
its name says. It is a perfectly valid way to prevent the current
entry from being used in the future.

See? I object to you removing that, and trying to make it be like
ACL_DONT_CACHE is the *onyl* way to not cache something.

Because honestly, that's what your comment updates do. They take the
comments about _one_ case, and switch it over to be about the _othger_
case.

But dammit, there are _two_ ways to not cache things.

"Fixing" the comment to talk about one and removing the other isn't a
fix. It's just a stupid change that now has the problem the other way
around!

So fix the comment to really just talk about both things.

First: talk about how to avoid caching entirely (ACL_DONT_CACHE).
Then, talk about how to invalidate the cache once it has been
instantiated (forget_cached_acl()).

Don't do this idiotic "remove the valid comment just because you
happened to care about the _other_ case"


Linus

2018-02-27 03:42:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 3/7] fs/posix_acl: Document that get_acl respects ACL_DONT_CACHE

On Mon, Feb 26, 2018 at 7:14 PM, Eric W. Biederman
<[email protected]> wrote:
>
> As such I believe that usage of forget_cached_acl should be subsumed by
> using ACL_NOT_CACHED. If not we should really come up with a different
> helper function name to call from ->get_acl. Preferably one that does
> "cmpxchng(p, sentinel, ACL_NOT_CACHED)" so that we remove the races.

You make your bias very clear, by simply trying to hide the other case.

But for chrissake, that's not the state right now. That other case
exists. You can't - and shouldn't - try to just hide it.

Besides, that "forget_cached_acl()" approach actually has a valid use
case. Maybe you _do_ want to cache ACL's, but with a timeout or
revalidation.

ACL_DONT_CACHE really is a big hammer that makes caching not work at
all. It's not necessarily the right thing to do at all.

Linus

2018-02-27 09:01:26

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v7 5/7] fuse: Simplfiy the posix acl handling logic.

On Tue, Feb 27, 2018 at 12:53 AM, Eric W. Biederman
<[email protected]> wrote:
> Rename the fuse connection flag posix_acl to cached_posix_acl as that
> is what it actually means. That fuse will cache and operate on the
> cached value of the posix acl.
>
> When fc->cached_posix_acl is not set, set ACL_DONT_CACHE on the inode
> so that get_acl and friends won't cache the acl values even if they
> are called.
>
> Replace forget_all_cached_acls with fuse_forget_cached_acls. This
> wrapper only takes effect when cached_posix_acl is true to prevent
> losing the nocache or noxattr status in when posix acls are not
> cached.

Shouldn't forget_cached_acl() be taught about ACL_DONT_CACHE? I think
it makes sense to generally not clear ACL_DONT_CACHE, since it's not
an actual acl value that needs forgetting.

Thanks,
Miklos

2018-03-02 21:46:02

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH] fs/posix_acl: Update the comments and support lightweight cache skipping


The code has been missing a way for a ->get_acl method to not cache
a return value without risking invalidating a cached value
that was set while get_acl() was returning.

Add that support by implementing to_uncachable_acl, to_cachable_acl,
is_uncacheable_acl, and dealing with uncachable acls in get_acl().

Update the comments so that they are a little clearer about
what is going on in get_acl()

Signed-off-by: "Eric W. Biederman" <[email protected]>
---

Linus my issue with the forget_cached_acl case was really that it was
too big of a hammer. If you care about caching acls only somtimes
forget_cached_acl called from ->get_acl can stomp that acl you
explicitly cached with set_cached_acl.

With this change I can unify the legacy horrible fuse posix acl case
that requires not caching acls with a single if statement in the get_acl
method. AKA:

+ if (!IS_ERR(acl) && !fc->posix_acl)
+ acl = to_uncacheable_acl(acl);
return acl;

That code I know is locally correct even if later fuse decides to cache
negative acls when the underlying filesystem does not support xattrs.

fs/posix_acl.c | 56 ++++++++++++++++++++++++++++++++++-------------
include/linux/posix_acl.h | 17 ++++++++++++++
2 files changed, 58 insertions(+), 15 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 2fd0fde16fe1..e58a68e18603 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -96,12 +96,16 @@ struct posix_acl *get_acl(struct inode *inode, int type)
{
void *sentinel;
struct posix_acl **p;
- struct posix_acl *acl;
+ struct posix_acl *acl, *to_cache;

/*
* The sentinel is used to detect when another operation like
* set_cached_acl() or forget_cached_acl() races with get_acl().
* It is guaranteed that is_uncached_acl(sentinel) is true.
+ *
+ * This is sufficient to prevent races between ->set_acl
+ * calling set_cached_acl (outside of filesystem specific
+ * locking) and get_acl() caching the returned acl.
*/

acl = get_cached_acl(inode, type);
@@ -126,12 +130,18 @@ struct posix_acl *get_acl(struct inode *inode, int type)
/* fall through */ ;

/*
- * Normally, the ACL returned by ->get_acl will be cached.
- * A filesystem can prevent that by calling
- * forget_cached_acl(inode, type) in ->get_acl.
+ * Normally, the ACL returned by ->get_acl() will be cached.
+ *
+ * A filesystem can prevent the acl returned by ->get_acl()
+ * from being cached by wrapping it with to_uncachable_acl().
+ *
+ * A filesystem can at anytime effect the cache directly and
+ * cause in process calls of get_acl() not to update the cache
+ * by calling forget_cache_acl(inode, type) or
+ * set_cached_acl(inode, type, acl).
*
- * If the filesystem doesn't have a get_acl() function at all, we'll
- * just create the negative cache entry.
+ * If the filesystem doesn't have a ->get_acl() function at
+ * all, we'll just create the negative cache entry.
*/
if (!inode->i_op->get_acl) {
set_cached_acl(inode, type, NULL);
@@ -139,21 +149,37 @@ struct posix_acl *get_acl(struct inode *inode, int type)
}
acl = inode->i_op->get_acl(inode, type);

+
+ /* To keep the logic simple default to not caching an acl when
+ * the sentinel is cleared.
+ */
+ to_cache = ACL_NOT_CACHED;
if (IS_ERR(acl)) {
- /*
- * Remove our sentinel so that we don't block future attempts
- * to cache the ACL.
+ /* Clears the sentinel so that we don't block future
+ * attempts to cache the ACL, and return an error.
*/
- cmpxchg(p, sentinel, ACL_NOT_CACHED);
- return acl;
+ }
+ else if (is_uncacheable_acl(acl)) {
+ /* Clears the sentinel so that we don't block future
+ * attempts to cache the ACL, and return a valid ACL.
+ */
+ acl = to_cacheable_acl(acl);
+ }
+ else {
+ to_cache = acl;
+ posix_acl_dup(to_cache);
}

/*
- * Cache the result, but only if our sentinel is still in place.
+ * Remove the sentinel and replace it with the value that
+ * needs to be cached, but only if the sentinel is still in
+ * place.
*/
- posix_acl_dup(acl);
- if (unlikely(cmpxchg(p, sentinel, acl) != sentinel))
- posix_acl_release(acl);
+ if (unlikely(cmpxchg(p, sentinel, to_cache) != sentinel)) {
+ if (!is_uncached_acl(to_cache))
+ posix_acl_release(to_cache);
+ }
+
return acl;
}
EXPORT_SYMBOL(get_acl);
diff --git a/include/linux/posix_acl.h b/include/linux/posix_acl.h
index 540595a321a7..3be8929b9f48 100644
--- a/include/linux/posix_acl.h
+++ b/include/linux/posix_acl.h
@@ -56,6 +56,23 @@ posix_acl_release(struct posix_acl *acl)
kfree_rcu(acl, a_rcu);
}

+/*
+ * Allow for acls returned from ->get_acl() to not be cached.
+ */
+static inline bool is_uncacheable_acl(struct posix_acl *acl)
+{
+ return ((unsigned long)acl) & 1UL;
+}
+
+static inline struct posix_acl *to_uncacheable_acl(struct posix_acl *acl)
+{
+ return (struct posix_acl *)(((unsigned long)acl) | 1UL);
+}
+
+static inline struct posix_acl *to_cacheable_acl(struct posix_acl *acl)
+{
+ return (struct posix_acl *)(((unsigned long)acl) & ~1UL);
+}

/* posix_acl.c */

--
2.14.1


2018-03-02 22:07:18

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v7 5/7] fuse: Simplfiy the posix acl handling logic.

Miklos Szeredi <[email protected]> writes:

> On Tue, Feb 27, 2018 at 12:53 AM, Eric W. Biederman
> <[email protected]> wrote:
>> Rename the fuse connection flag posix_acl to cached_posix_acl as that
>> is what it actually means. That fuse will cache and operate on the
>> cached value of the posix acl.
>>
>> When fc->cached_posix_acl is not set, set ACL_DONT_CACHE on the inode
>> so that get_acl and friends won't cache the acl values even if they
>> are called.
>>
>> Replace forget_all_cached_acls with fuse_forget_cached_acls. This
>> wrapper only takes effect when cached_posix_acl is true to prevent
>> losing the nocache or noxattr status in when posix acls are not
>> cached.
>
> Shouldn't forget_cached_acl() be taught about ACL_DONT_CACHE? I think
> it makes sense to generally not clear ACL_DONT_CACHE, since it's not
> an actual acl value that needs forgetting.

After stopping to make certain I understand the issues, I don't think
it makes sense to teach forget_cached_acl about ACL_DONT_CACHE.

If you are fogetting a cached attribute ACL_DONT_CACHE simply doesn't
make sense.

Further it makes sense to cache a negative result for fuse when
!fc->no_getxattr. Even if you would ordinarily not cache posix acls.

So I think the better plan is to teach the posix acl code how to not
cache results on a case by case basis. As I did in my rfc patch I
posted a little earlier today. That works with forget_cached_acl and it
supports local reasoning. Further while the performance might not be as
good as ACL_DONT_CACHE I don't think that matters as always going to the
fuse server to get acls is almost certainly going to dominate the acl
costs.

Eric

2018-03-02 22:08:50

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v8 0/6] fuse: mounts from non-init user namespaces


This patchset builds on the work by Donsu Park and Seth Forshee and is
reduced to the set of patches that just affect fuse. The non-fuse
vfs patches are far enough along we can ignore them except possibly for the
question of when does FS_USERNS_MOUNT get set in fuse_fs_type.

Fuse with a block device has been left as an exercise for a later time.

Since v5 I changed the core of this patchset around as the previous
patches were showing signs of bitrot. Some important explanations were
missing, some important functionality was missing, and xattr handling
was completely absent.

Since v6 I have:
- Removed the failure case from fuse_get_req_nofail_nopages that I
added.
- Updated fuse to always to use posix_acl_access_xattr_handler, and
posix_acl_default_xattr_handler, by teaching fuse to set
ACL_DONT_CACHE when FUSE_POSIX_ACL is not set.

Since v7 I have:
- Rethought and reworked how I am unifying the cached and the non-cached
posix acl case so the code is cleaner and simpler.
- I have dropped enhancements to caching negative acls when
fc->no_getxattr is set.
- Removed the need to wrap forget_all_cached_acls in fuse.
- Reorder the patches so the posix acl work comes first

Miklos can you take a look and see what you think?

I think this much of the fuse changes are ready, and as such I would
like to get them in this development cycle if possible.

These changes are also available at:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git userns-fuse-v8

Eric W. Biederman (5):
fs/posix_acl: Update the comments and support lightweight cache skipping
fuse: Simplfiy the posix acl handling logic.
fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
fuse: Fail all requests with invalid uids or gids
fuse: Support fuse filesystems outside of init_user_ns

Seth Forshee (1):
fuse: Restrict allow_other to the superblock's namespace or a descendant


fs/fuse/acl.c | 10 ++++++----
fs/fuse/cuse.c | 7 ++++++-
fs/fuse/dev.c | 30 ++++++++++++++++------------
fs/fuse/dir.c | 18 ++++++++---------
fs/fuse/fuse_i.h | 9 ++++++---
fs/fuse/inode.c | 34 +++++++++++++++++++-------------
fs/fuse/xattr.c | 5 -----
fs/posix_acl.c | 50 ++++++++++++++++++++++++++++++++---------------
include/linux/posix_acl.h | 17 ++++++++++++++++
kernel/user_namespace.c | 1 +
10 files changed, 116 insertions(+), 65 deletions(-)

2018-03-02 22:08:52

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v8 2/6] fuse: Simplfiy the posix acl handling logic.

Rename the fuse connection flag posix_acl to cached_posix_acl as that
is what it actually means. That fuse will cache and operate on the
cached value of the posix acl.

Always use posix_acl_access_xattr_handler so the fuse code benefits
from the generic posix acl handlers as much as possible. This will
become important as the code works on translation of uid and gid in
the posix acls when fuse is not mounted in the initial user namespace.

Update fuse_get_acl so that it does not cache the acl if the code is
not caching the acl. This is all that is needed to ensure the
fuse_getxattr calls down into the fuse server when posix_acl_xattr_get
is called. The updated code goes through fuse_getacl, and as such has
posix acl specific sanity checks and attribute handling but no real
difference from the previous code that skipped it.

It can safely be assumed that fuse filesystems where acls are not
cached in the kernel do not set fc->default_permissions as
default_permissions only checked posix acls if .get_acl was defined
and before the cached acl flag was introduced fuse did not implement a
get_acl method.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/fuse/acl.c | 6 ++++--
fs/fuse/dir.c | 2 +-
fs/fuse/fuse_i.h | 3 +--
fs/fuse/inode.c | 3 +--
fs/fuse/xattr.c | 5 -----
5 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index ec85765502f1..cfa58ee0c10b 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -19,7 +19,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
void *value = NULL;
struct posix_acl *acl;

- if (!fc->posix_acl || fc->no_getxattr)
+ if (fc->no_getxattr)
return NULL;

if (type == ACL_TYPE_ACCESS)
@@ -44,6 +44,8 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
acl = ERR_PTR(size);

kfree(value);
+ if (!IS_ERR(acl) && !fc->cached_posix_acl)
+ acl = to_uncacheable_acl(acl);
return acl;
}

@@ -53,7 +55,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
const char *name;
int ret;

- if (!fc->posix_acl || fc->no_setxattr)
+ if (fc->no_setxattr)
return -EOPNOTSUPP;

if (type == ACL_TYPE_ACCESS)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 24967382a7b1..43a45e83d313 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1764,7 +1764,7 @@ static int fuse_setattr(struct dentry *entry, struct iattr *attr)
* If filesystem supports acls it may have updated acl xattrs in
* the filesystem, so forget cached acls for the inode.
*/
- if (fc->posix_acl)
+ if (fc->cached_posix_acl)
forget_all_cached_acls(inode);

/* Directory mode changed, may need to revalidate access */
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c4c093bbf456..74ce02fb16d6 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -619,7 +619,7 @@ struct fuse_conn {
unsigned no_lseek:1;

/** Does the filesystem support posix acls? */
- unsigned posix_acl:1;
+ unsigned cached_posix_acl:1;

/** Check permissions based on the file mode or not? */
unsigned default_permissions:1;
@@ -974,7 +974,6 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size);
int fuse_removexattr(struct inode *inode, const char *name);
extern const struct xattr_handler *fuse_xattr_handlers[];
-extern const struct xattr_handler *fuse_acl_xattr_handlers[];

struct posix_acl;
struct posix_acl *fuse_get_acl(struct inode *inode, int type);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 624f18bbfd2b..507f780046c5 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -915,8 +915,7 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
fc->sb->s_time_gran = arg->time_gran;
if ((arg->flags & FUSE_POSIX_ACL)) {
fc->default_permissions = 1;
- fc->posix_acl = 1;
- fc->sb->s_xattr = fuse_acl_xattr_handlers;
+ fc->cached_posix_acl = 1;
}
} else {
ra_pages = fc->max_read / PAGE_SIZE;
diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 3caac46b08b0..ed64c508585a 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -199,11 +199,6 @@ static const struct xattr_handler fuse_xattr_handler = {
};

const struct xattr_handler *fuse_xattr_handlers[] = {
- &fuse_xattr_handler,
- NULL
-};
-
-const struct xattr_handler *fuse_acl_xattr_handlers[] = {
&posix_acl_access_xattr_handler,
&posix_acl_default_xattr_handler,
&fuse_xattr_handler,
--
2.14.1


2018-03-02 22:09:00

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v8 1/6] fs/posix_acl: Update the comments and support lightweight cache skipping

The code has been missing a way for a ->get_acl method to not cache
a return value without risking invalidating a cached value
that was set while get_acl() was returning.

Add that support by implementing to_uncachable_acl, to_cachable_acl,
is_uncacheable_acl, and dealing with uncachable acls in get_acl().

Update the comments so that they are a little clearer about
what is going on in get_acl()

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/posix_acl.c | 50 ++++++++++++++++++++++++++++++++---------------
include/linux/posix_acl.h | 17 ++++++++++++++++
2 files changed, 51 insertions(+), 16 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 2fd0fde16fe1..00281bc30854 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -96,12 +96,16 @@ struct posix_acl *get_acl(struct inode *inode, int type)
{
void *sentinel;
struct posix_acl **p;
- struct posix_acl *acl;
+ struct posix_acl *acl, *to_cache;

/*
* The sentinel is used to detect when another operation like
* set_cached_acl() or forget_cached_acl() races with get_acl().
* It is guaranteed that is_uncached_acl(sentinel) is true.
+ *
+ * This is sufficient to prevent races between ->set_acl
+ * calling set_cached_acl (outside of filesystem specific
+ * locking) and get_acl() caching the returned acl.
*/

acl = get_cached_acl(inode, type);
@@ -126,12 +130,18 @@ struct posix_acl *get_acl(struct inode *inode, int type)
/* fall through */ ;

/*
- * Normally, the ACL returned by ->get_acl will be cached.
- * A filesystem can prevent that by calling
- * forget_cached_acl(inode, type) in ->get_acl.
+ * Normally, the ACL returned by ->get_acl() will be cached.
+ *
+ * A filesystem can prevent the acl returned by ->get_acl()
+ * from being cached by wrapping it with to_uncachable_acl().
*
- * If the filesystem doesn't have a get_acl() function at all, we'll
- * just create the negative cache entry.
+ * A filesystem can at anytime effect the cache directly and
+ * cause in process calls of get_acl() not to update the cache
+ * by calling forget_cache_acl(inode, type) or
+ * set_cached_acl(inode, type, acl).
+ *
+ * If the filesystem doesn't have a ->get_acl() function at
+ * all, we'll just create the negative cache entry.
*/
if (!inode->i_op->get_acl) {
set_cached_acl(inode, type, NULL);
@@ -140,20 +150,28 @@ struct posix_acl *get_acl(struct inode *inode, int type)
acl = inode->i_op->get_acl(inode, type);

if (IS_ERR(acl)) {
- /*
- * Remove our sentinel so that we don't block future attempts
- * to cache the ACL.
- */
- cmpxchg(p, sentinel, ACL_NOT_CACHED);
- return acl;
+ /* Don't cache an acl just return an error. */
+ to_cache = ACL_NOT_CACHED;
+ }
+ else if (is_uncacheable_acl(acl)) {
+ /* Don't cache an acl, but return one. */
+ to_cache = ACL_NOT_CACHED;
+ acl = to_cacheable_acl(acl);
+ }
+ else {
+ /* Cache and return the acl. */
+ to_cache = posix_acl_dup(acl);
}

/*
- * Cache the result, but only if our sentinel is still in place.
+ * Remove the sentinel and replace it with the value to
+ * cache, but only if the sentinel is still in place.
*/
- posix_acl_dup(acl);
- if (unlikely(cmpxchg(p, sentinel, acl) != sentinel))
- posix_acl_release(acl);
+ if (unlikely(cmpxchg(p, sentinel, to_cache) != sentinel)) {
+ if (!is_uncached_acl(to_cache))
+ posix_acl_release(to_cache);
+ }
+
return acl;
}
EXPORT_SYMBOL(get_acl);
diff --git a/include/linux/posix_acl.h b/include/linux/posix_acl.h
index 540595a321a7..3be8929b9f48 100644
--- a/include/linux/posix_acl.h
+++ b/include/linux/posix_acl.h
@@ -56,6 +56,23 @@ posix_acl_release(struct posix_acl *acl)
kfree_rcu(acl, a_rcu);
}

+/*
+ * Allow for acls returned from ->get_acl() to not be cached.
+ */
+static inline bool is_uncacheable_acl(struct posix_acl *acl)
+{
+ return ((unsigned long)acl) & 1UL;
+}
+
+static inline struct posix_acl *to_uncacheable_acl(struct posix_acl *acl)
+{
+ return (struct posix_acl *)(((unsigned long)acl) | 1UL);
+}
+
+static inline struct posix_acl *to_cacheable_acl(struct posix_acl *acl)
+{
+ return (struct posix_acl *)(((unsigned long)acl) & ~1UL);
+}

/* posix_acl.c */

--
2.14.1


2018-03-02 22:09:06

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v8 5/6] fuse: Support fuse filesystems outside of init_user_ns

In order to support mounts from namespaces other than init_user_ns,
fuse must translate uids and gids to/from the userns of the process
servicing requests on /dev/fuse. This patch does that, with a couple
of restrictions on the namespace:

- The userns for the fuse connection is fixed to the namespace
from which /dev/fuse is opened.

- The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the need to
pass around userns references and by allowing fuse to rely on the
checks in setattr_prepare for ownership changes. Either restriction
could be relaxed in the future if needed.

For cuse the userns used is the opener of /dev/cuse. Semantically the
cuse support does not appear safe for unprivileged users. Practically
the permissions on /dev/cuse only make it accessible to the global root
user. If something slips through the cracks in a user namespace the only
users who will be able to use the cuse device are those users mapped into
the user namespace.

Translation in the posix acl is updated to use the uuser namespace of
the filesystem. Avoiding cases which might bypass this translation is
handled in a following change.

This change is stronlgy based on a similar change from Seth Forshee
and Dongsu Park.

Cc: [email protected]
Cc: [email protected]
Cc: Miklos Szeredi <[email protected]>
Cc: <[email protected]>
Cc: Dongsu Park <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/fuse/acl.c | 4 ++--
fs/fuse/cuse.c | 7 ++++++-
fs/fuse/dev.c | 4 ++--
fs/fuse/dir.c | 14 +++++++-------
fs/fuse/fuse_i.h | 6 +++++-
fs/fuse/inode.c | 31 +++++++++++++++++++------------
6 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index cfa58ee0c10b..0472735a89c3 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -34,7 +34,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
return ERR_PTR(-ENOMEM);
size = fuse_getxattr(inode, name, value, PAGE_SIZE);
if (size > 0)
- acl = posix_acl_from_xattr(&init_user_ns, value, size);
+ acl = posix_acl_from_xattr(fc->user_ns, value, size);
else if ((size == 0) || (size == -ENODATA) ||
(size == -EOPNOTSUPP && fc->no_getxattr))
acl = NULL;
@@ -83,7 +83,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
if (!value)
return -ENOMEM;

- ret = posix_acl_to_xattr(&init_user_ns, acl, value, size);
+ ret = posix_acl_to_xattr(fc->user_ns, acl, value, size);
if (ret < 0) {
kfree(value);
return ret;
diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index e9e97803442a..036ee477669e 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
#include <linux/stat.h>
#include <linux/module.h>
#include <linux/uio.h>
+#include <linux/user_namespace.h>

#include "fuse_i.h"

@@ -498,7 +499,11 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
if (!cc)
return -ENOMEM;

- fuse_conn_init(&cc->fc);
+ /*
+ * Limit the cuse channel to requests that can
+ * be represented in file->f_cred->user_ns.
+ */
+ fuse_conn_init(&cc->fc, file->f_cred->user_ns);

fud = fuse_dev_alloc(&cc->fc);
if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 2886a56d5f61..fce7915aea13 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)

static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
{
- req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
- req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
+ req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+ req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);

return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 43a45e83d313..c749a4bd4ea3 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
stat->ino = attr->ino;
stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
stat->nlink = attr->nlink;
- stat->uid = make_kuid(&init_user_ns, attr->uid);
- stat->gid = make_kgid(&init_user_ns, attr->gid);
+ stat->uid = make_kuid(fc->user_ns, attr->uid);
+ stat->gid = make_kgid(fc->user_ns, attr->gid);
stat->rdev = inode->i_rdev;
stat->atime.tv_sec = attr->atime;
stat->atime.tv_nsec = attr->atimensec;
@@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
return true;
}

-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
- bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+ struct fuse_setattr_in *arg, bool trust_local_cmtime)
{
unsigned ivalid = iattr->ia_valid;

if (ivalid & ATTR_MODE)
arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode;
if (ivalid & ATTR_UID)
- arg->valid |= FATTR_UID, arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
+ arg->valid |= FATTR_UID, arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
if (ivalid & ATTR_GID)
- arg->valid |= FATTR_GID, arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
+ arg->valid |= FATTR_GID, arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
if (ivalid & ATTR_SIZE)
arg->valid |= FATTR_SIZE, arg->size = iattr->ia_size;
if (ivalid & ATTR_ATIME) {
@@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,

memset(&inarg, 0, sizeof(inarg));
memset(&outarg, 0, sizeof(outarg));
- iattr_to_fattr(attr, &inarg, trust_local_cmtime);
+ iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
if (file) {
struct fuse_file *ff = file->private_data;
inarg.valid |= FATTR_FH;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 74ce02fb16d6..dbb1d4ef1a0b 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -26,6 +26,7 @@
#include <linux/xattr.h>
#include <linux/pid_namespace.h>
#include <linux/refcount.h>
+#include <linux/user_namespace.h>

/** Max number of pages that can be used in a single read request */
#define FUSE_MAX_PAGES_PER_REQ 32
@@ -466,6 +467,9 @@ struct fuse_conn {
/** The pid namespace for this mount */
struct pid_namespace *pid_ns;

+ /** The user namespace for this mount */
+ struct user_namespace *user_ns;
+
/** Maximum read size */
unsigned max_read;

@@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
/**
* Initialize fuse_conn
*/
-void fuse_conn_init(struct fuse_conn *fc);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);

/**
* Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 507f780046c5..b5b2e1fc5bfd 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
inode->i_ino = fuse_squash_ino(attr->ino);
inode->i_mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
set_nlink(inode, attr->nlink);
- inode->i_uid = make_kuid(&init_user_ns, attr->uid);
- inode->i_gid = make_kgid(&init_user_ns, attr->gid);
+ inode->i_uid = make_kuid(fc->user_ns, attr->uid);
+ inode->i_gid = make_kgid(fc->user_ns, attr->gid);
inode->i_blocks = attr->blocks;
inode->i_atime.tv_sec = attr->atime;
inode->i_atime.tv_nsec = attr->atimensec;
@@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
return err;
}

-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
+static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+ struct user_namespace *user_ns)
{
char *p;
memset(d, 0, sizeof(struct fuse_mount_data));
@@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
case OPT_USER_ID:
if (fuse_match_uint(&args[0], &uv))
return 0;
- d->user_id = make_kuid(current_user_ns(), uv);
+ d->user_id = make_kuid(user_ns, uv);
if (!uid_valid(d->user_id))
return 0;
d->user_id_present = 1;
@@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
case OPT_GROUP_ID:
if (fuse_match_uint(&args[0], &uv))
return 0;
- d->group_id = make_kgid(current_user_ns(), uv);
+ d->group_id = make_kgid(user_ns, uv);
if (!gid_valid(d->group_id))
return 0;
d->group_id_present = 1;
@@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
struct super_block *sb = root->d_sb;
struct fuse_conn *fc = get_fuse_conn_super(sb);

- seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
- seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
+ seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
+ seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
if (fc->default_permissions)
seq_puts(m, ",default_permissions");
if (fc->allow_other)
@@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
fpq->connected = 1;
}

-void fuse_conn_init(struct fuse_conn *fc)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
{
memset(fc, 0, sizeof(*fc));
spin_lock_init(&fc->lock);
@@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
fc->attr_version = 1;
get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+ fc->user_ns = get_user_ns(user_ns);
}
EXPORT_SYMBOL_GPL(fuse_conn_init);

@@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
if (fc->destroy_req)
fuse_request_free(fc->destroy_req);
put_pid_ns(fc->pid_ns);
+ put_user_ns(fc->user_ns);
fc->release(fc);
}
}
@@ -1060,7 +1063,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)

sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION);

- if (!parse_fuse_opt(data, &d, is_bdev))
+ if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
goto err;

if (is_bdev) {
@@ -1085,8 +1088,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
if (!file)
goto err;

- if ((file->f_op != &fuse_dev_operations) ||
- (file->f_cred->user_ns != &init_user_ns))
+ /*
+ * Require mount to happen from the same user namespace which
+ * opened /dev/fuse to prevent potential attacks.
+ */
+ if (file->f_op != &fuse_dev_operations ||
+ file->f_cred->user_ns != sb->s_user_ns)
goto err_fput;

fc = kmalloc(sizeof(*fc), GFP_KERNEL);
@@ -1094,7 +1101,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
if (!fc)
goto err_fput;

- fuse_conn_init(fc);
+ fuse_conn_init(fc, sb->s_user_ns);
fc->release = fuse_free_conn;

fud = fuse_dev_alloc(fc);
--
2.14.1


2018-03-02 22:09:07

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v8 4/6] fuse: Fail all requests with invalid uids or gids

Upon a cursory examinination the uid and gid of a fuse request are
necessary for correct operation. Failing a fuse request where those
values are not reliable seems a straight forward and reliable means of
ensuring that fuse requests with bad data are not sent or processed.

In most cases the vfs will avoid actions it suspects will cause
an inode write back of an inode with an invalid uid or gid. But that does
not map precisely to what fuse is doing, so test for this and solve
this at the fuse level as well.

Performing this work in fuse_req_init_context is cheap as the code is
already performing the translation here and only needs to check the
result of the translation to see if things are not representable in
a form the fuse server can handle.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/fuse/dev.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 0fb58f364fa6..2886a56d5f61 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -112,11 +112,20 @@ static void __fuse_put_request(struct fuse_req *req)
refcount_dec(&req->count);
}

-static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
+static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
{
- req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
- req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
+ req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
+ req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
+
+ return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
+}
+
+static void fuse_req_init_context_nofail(struct fuse_req *req)
+{
+ req->in.h.uid = 0;
+ req->in.h.gid = 0;
+ req->in.h.pid = 0;
}

void fuse_set_initialized(struct fuse_conn *fc)
@@ -162,12 +171,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
wake_up(&fc->blocked_waitq);
goto out;
}
-
- fuse_req_init_context(fc, req);
__set_bit(FR_WAITING, &req->flags);
if (for_background)
__set_bit(FR_BACKGROUND, &req->flags);
-
+ if (unlikely(!fuse_req_init_context(fc, req))) {
+ fuse_put_request(fc, req);
+ return ERR_PTR(-EOVERFLOW);
+ }
return req;

out:
@@ -256,7 +266,7 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
if (!req)
req = get_reserved_req(fc, file);

- fuse_req_init_context(fc, req);
+ fuse_req_init_context_nofail(req);
__set_bit(FR_WAITING, &req->flags);
__clear_bit(FR_BACKGROUND, &req->flags);
return req;
--
2.14.1


2018-03-02 22:10:06

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v8 6/6] fuse: Restrict allow_other to the superblock's namespace or a descendant

From: Seth Forshee <[email protected]>

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Cc: [email protected]
Cc: [email protected]
Cc: "Eric W. Biederman" <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Acked-by: Miklos Szeredi <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
Reviewed-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
Signed-off-by: Dongsu Park <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/fuse/dir.c | 2 +-
kernel/user_namespace.c | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index c749a4bd4ea3..5461b63bb2a4 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
const struct cred *cred;

if (fc->allow_other)
- return 1;
+ return current_in_userns(fc->user_ns);

cred = current_cred();
if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4ce5c7..492c255e6c5a 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
{
return in_userns(target_ns, current_user_ns());
}
+EXPORT_SYMBOL(current_in_userns);

static inline struct user_namespace *to_user_ns(struct ns_common *ns)
{
--
2.14.1


2018-03-03 03:35:40

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v8 3/6] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read

At the point of fuse_dev_do_read the user space process that initiated the
action on the fuse filesystem may no longer exist. The process have been
killed or may have fired an asynchronous request and exited.

If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
the pid has been reallocated it can return practically any pid. Any pid is
possible as the pid allocator allocates pid numbers in different pid
namespaces independently.

The only way to make translation in fuse_dev_do_read reliable is to call
get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
fuse_dev_do_read. That reference counting in other contexts has been shown
to bounce cache lines between processors and in general be slow. So that is
not desirable.

The only known user of running the fuse server in a different pid namespace
from the filesystem does not care what the pids are in the fuse messages
so removing this code should not matter.

Getting the translation to a server running outside of the pid namespace
of a container can still be achieved by playing setns games at mount time.
It is also possible to add an option to pass a pid namespace into the fuse
filesystem at mount time.

Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/fuse/dev.c | 6 ------
1 file changed, 6 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5d06384c2cae..0fb58f364fa6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1260,12 +1260,6 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
in = &req->in;
reqsize = in->h.len;

- if (task_active_pid_ns(current) != fc->pid_ns) {
- rcu_read_lock();
- in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
- rcu_read_unlock();
- }
-
/* If request is too large, reply with an error and restart the read */
if (nbytes < reqsize) {
req->out.h.error = -EIO;
--
2.14.1


2018-03-05 09:58:57

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v8 1/6] fs/posix_acl: Update the comments and support lightweight cache skipping

On Fri, Mar 2, 2018 at 10:59 PM, Eric W. Biederman
<[email protected]> wrote:
> The code has been missing a way for a ->get_acl method to not cache
> a return value without risking invalidating a cached value
> that was set while get_acl() was returning.
>
> Add that support by implementing to_uncachable_acl, to_cachable_acl,
> is_uncacheable_acl, and dealing with uncachable acls in get_acl().

I don't like the pointer magic here. Can't the uncachable bit just be
added to struct posix_acl?

AFAICS that can be done even without increasing the size of that
struct (e.g. by unioning it with the rcu_head).

Thanks,
Miklos

2018-03-05 14:21:25

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v8 1/6] fs/posix_acl: Update the comments and support lightweight cache skipping

Miklos Szeredi <[email protected]> writes:

> On Fri, Mar 2, 2018 at 10:59 PM, Eric W. Biederman
> <[email protected]> wrote:
>> The code has been missing a way for a ->get_acl method to not cache
>> a return value without risking invalidating a cached value
>> that was set while get_acl() was returning.
>>
>> Add that support by implementing to_uncachable_acl, to_cachable_acl,
>> is_uncacheable_acl, and dealing with uncachable acls in get_acl().
>
> I don't like the pointer magic here. Can't the uncachable bit just be
> added to struct posix_acl?
>
> AFAICS that can be done even without increasing the size of that
> struct (e.g. by unioning it with the rcu_head).

Except that would:
- add a possible cache line miss.
- make it unusable for overlayfs.

I am after very light-weight semantics that say don't cache this return
value but don't have any effects elsewhere.

We are already playing pointer magic games in this code. This just uses
those games for the last piece of information to keep the logic clean.

I see two possible implementation alternatives:
- Make get_acl return a struct that returns the acl and cachability flag
- Add a helper that does"cmpxchg(p, sentinel, ACL_NOT_CACHED)".
Such a heleper function seems like a waste, it does side effect magic
which is never particularly pleasant, and it is more code to execute
in practice. Though honestly it is my second choice.

void dont_cache_my_return_acl(struct inode *inode, int type)
{
/* Valid only inside ->get_acl implementations */
struct posix_acl **p = get_acl_type(inode, type);
struct posix_acl *sentinel = uncached_acl_sentinel(current);
cmpxchg(p, sentinel, ACL_NOT_CACHED);
}
EXPORT_SYMBOL(dont_cache_my_return_acl);

It is just a few instructions more so I guess it isn't that bad.
Especially for something that is not a common case.

Do you think you could live with dont_cache_my_return_acl?

Otherwise I think I will respin this patch set without the acl
unification. There is plenty of evidence what it will look like
now. We can deal with the rest of the patches. Then we can come back
to exactly what acl unification in fuse should look like.

Eric

2018-03-08 21:25:17

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v9 0/4] fuse: mounts from non-init user namespaces


This patchset builds on the work by Donsu Park and Seth Forshee and is
reduced to the set of patches that just affect fuse. The non-fuse
vfs patches are far enough along we can ignore them except possibly for the
question of when does FS_USERNS_MOUNT get set in fuse_fs_type.

Fuse with a block device has been left as an exercise for a later time.

Since v5 I changed the core of this patchset around as the previous
patches were showing signs of bitrot. Some important explanations were
missing, some important functionality was missing, and xattr handling
was completely absent.

Since v6 I have:
- Removed the failure case from fuse_get_req_nofail_nopages that I
added.
- Updated fuse to always to use posix_acl_access_xattr_handler, and
posix_acl_default_xattr_handler, by teaching fuse to set
ACL_DONT_CACHE when FUSE_POSIX_ACL is not set.

Since v7 I have:
- Rethought and reworked how I am unifying the cached and the non-cached
posix acl case so the code is cleaner and simpler.
- I have dropped enhancements to caching negative acls when
fc->no_getxattr is set.
- Removed the need to wrap forget_all_cached_acls in fuse.
- Reorder the patches so the posix acl work comes first

Since v8 I have:
- Dropped and postponed the unification of the uncached and the cached
posix acls case. The code is not hard but tricky enough it needs
to be considered on it's own on it's own merits.

Miklos can you take a look and see what you think?

Miklos if you could pick these up I would appreciate it. If not I can
merge these through the userns tree.

These changes are also available at:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git userns-fuse-v9

Eric W. Biederman (3):
fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
fuse: Fail all requests with invalid uids or gids
fuse: Support fuse filesystems outside of init_user_ns

Seth Forshee (1):
fuse: Restrict allow_other to the superblock's namespace or a descendant

fs/fuse/acl.c | 4 ++--
fs/fuse/cuse.c | 7 ++++++-
fs/fuse/dev.c | 30 +++++++++++++++++-------------
fs/fuse/dir.c | 16 ++++++++--------
fs/fuse/fuse_i.h | 6 +++++-
fs/fuse/inode.c | 31 +++++++++++++++++++------------
kernel/user_namespace.c | 1 +
7 files changed, 58 insertions(+), 37 deletions(-)

2018-03-08 21:29:52

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v9 1/4] fuse: Remove the buggy retranslation of pids in fuse_dev_do_read

At the point of fuse_dev_do_read the user space process that initiated the
action on the fuse filesystem may no longer exist. The process have been
killed or may have fired an asynchronous request and exited.

If the initial process has exited the code "pid_vnr(find_pid_ns(in->h.pid,
fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
the pid has been reallocated it can return practically any pid. Any pid is
possible as the pid allocator allocates pid numbers in different pid
namespaces independently.

The only way to make translation in fuse_dev_do_read reliable is to call
get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
fuse_dev_do_read. That reference counting in other contexts has been shown
to bounce cache lines between processors and in general be slow. So that is
not desirable.

The only known user of running the fuse server in a different pid namespace
from the filesystem does not care what the pids are in the fuse messages
so removing this code should not matter.

Getting the translation to a server running outside of the pid namespace
of a container can still be achieved by playing setns games at mount time.
It is also possible to add an option to pass a pid namespace into the fuse
filesystem at mount time.

Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/fuse/dev.c | 6 ------
1 file changed, 6 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5d06384c2cae..0fb58f364fa6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1260,12 +1260,6 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
in = &req->in;
reqsize = in->h.len;

- if (task_active_pid_ns(current) != fc->pid_ns) {
- rcu_read_lock();
- in->h.pid = pid_vnr(find_pid_ns(in->h.pid, fc->pid_ns));
- rcu_read_unlock();
- }
-
/* If request is too large, reply with an error and restart the read */
if (nbytes < reqsize) {
req->out.h.error = -EIO;
--
2.14.1


2018-03-08 21:29:52

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v9 4/4] fuse: Restrict allow_other to the superblock's namespace or a descendant

From: Seth Forshee <[email protected]>

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.

Cc: [email protected]
Cc: [email protected]
Cc: "Eric W. Biederman" <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Acked-by: Miklos Szeredi <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
Reviewed-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
Signed-off-by: Dongsu Park <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/fuse/dir.c | 2 +-
kernel/user_namespace.c | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index ad1cfac1942f..d41559a0aa6b 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1030,7 +1030,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
const struct cred *cred;

if (fc->allow_other)
- return 1;
+ return current_in_userns(fc->user_ns);

cred = current_cred();
if (uid_eq(cred->euid, fc->user_id) &&
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4ce5c7..492c255e6c5a 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1235,6 +1235,7 @@ bool current_in_userns(const struct user_namespace *target_ns)
{
return in_userns(target_ns, current_user_ns());
}
+EXPORT_SYMBOL(current_in_userns);

static inline struct user_namespace *to_user_ns(struct ns_common *ns)
{
--
2.14.1


2018-03-08 21:30:19

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v9 2/4] fuse: Fail all requests with invalid uids or gids

Upon a cursory examinination the uid and gid of a fuse request are
necessary for correct operation. Failing a fuse request where those
values are not reliable seems a straight forward and reliable means of
ensuring that fuse requests with bad data are not sent or processed.

In most cases the vfs will avoid actions it suspects will cause
an inode write back of an inode with an invalid uid or gid. But that does
not map precisely to what fuse is doing, so test for this and solve
this at the fuse level as well.

Performing this work in fuse_req_init_context is cheap as the code is
already performing the translation here and only needs to check the
result of the translation to see if things are not representable in
a form the fuse server can handle.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/fuse/dev.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 0fb58f364fa6..2886a56d5f61 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -112,11 +112,20 @@ static void __fuse_put_request(struct fuse_req *req)
refcount_dec(&req->count);
}

-static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
+static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
{
- req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
- req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
+ req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
+ req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
+
+ return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
+}
+
+static void fuse_req_init_context_nofail(struct fuse_req *req)
+{
+ req->in.h.uid = 0;
+ req->in.h.gid = 0;
+ req->in.h.pid = 0;
}

void fuse_set_initialized(struct fuse_conn *fc)
@@ -162,12 +171,13 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
wake_up(&fc->blocked_waitq);
goto out;
}
-
- fuse_req_init_context(fc, req);
__set_bit(FR_WAITING, &req->flags);
if (for_background)
__set_bit(FR_BACKGROUND, &req->flags);
-
+ if (unlikely(!fuse_req_init_context(fc, req))) {
+ fuse_put_request(fc, req);
+ return ERR_PTR(-EOVERFLOW);
+ }
return req;

out:
@@ -256,7 +266,7 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
if (!req)
req = get_reserved_req(fc, file);

- fuse_req_init_context(fc, req);
+ fuse_req_init_context_nofail(req);
__set_bit(FR_WAITING, &req->flags);
__clear_bit(FR_BACKGROUND, &req->flags);
return req;
--
2.14.1


2018-03-08 21:31:06

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH v9 3/4] fuse: Support fuse filesystems outside of init_user_ns

In order to support mounts from namespaces other than init_user_ns,
fuse must translate uids and gids to/from the userns of the process
servicing requests on /dev/fuse. This patch does that, with a couple
of restrictions on the namespace:

- The userns for the fuse connection is fixed to the namespace
from which /dev/fuse is opened.

- The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the need to
pass around userns references and by allowing fuse to rely on the
checks in setattr_prepare for ownership changes. Either restriction
could be relaxed in the future if needed.

For cuse the userns used is the opener of /dev/cuse. Semantically the
cuse support does not appear safe for unprivileged users. Practically
the permissions on /dev/cuse only make it accessible to the global root
user. If something slips through the cracks in a user namespace the only
users who will be able to use the cuse device are those users mapped into
the user namespace.

Translation in the posix acl is updated to use the uuser namespace of
the filesystem. Avoiding cases which might bypass this translation is
handled in a following change.

This change is stronlgy based on a similar change from Seth Forshee
and Dongsu Park.

Cc: [email protected]
Cc: [email protected]
Cc: Miklos Szeredi <[email protected]>
Cc: <[email protected]>
Cc: Dongsu Park <[email protected]>
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/fuse/acl.c | 4 ++--
fs/fuse/cuse.c | 7 ++++++-
fs/fuse/dev.c | 4 ++--
fs/fuse/dir.c | 14 +++++++-------
fs/fuse/fuse_i.h | 6 +++++-
fs/fuse/inode.c | 31 +++++++++++++++++++------------
6 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index ec85765502f1..5a48cee6d7d3 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -34,7 +34,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type)
return ERR_PTR(-ENOMEM);
size = fuse_getxattr(inode, name, value, PAGE_SIZE);
if (size > 0)
- acl = posix_acl_from_xattr(&init_user_ns, value, size);
+ acl = posix_acl_from_xattr(fc->user_ns, value, size);
else if ((size == 0) || (size == -ENODATA) ||
(size == -EOPNOTSUPP && fc->no_getxattr))
acl = NULL;
@@ -81,7 +81,7 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type)
if (!value)
return -ENOMEM;

- ret = posix_acl_to_xattr(&init_user_ns, acl, value, size);
+ ret = posix_acl_to_xattr(fc->user_ns, acl, value, size);
if (ret < 0) {
kfree(value);
return ret;
diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index e9e97803442a..036ee477669e 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -48,6 +48,7 @@
#include <linux/stat.h>
#include <linux/module.h>
#include <linux/uio.h>
+#include <linux/user_namespace.h>

#include "fuse_i.h"

@@ -498,7 +499,11 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
if (!cc)
return -ENOMEM;

- fuse_conn_init(&cc->fc);
+ /*
+ * Limit the cuse channel to requests that can
+ * be represented in file->f_cred->user_ns.
+ */
+ fuse_conn_init(&cc->fc, file->f_cred->user_ns);

fud = fuse_dev_alloc(&cc->fc);
if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 2886a56d5f61..fce7915aea13 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -114,8 +114,8 @@ static void __fuse_put_request(struct fuse_req *req)

static bool fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
{
- req->in.h.uid = from_kuid(&init_user_ns, current_fsuid());
- req->in.h.gid = from_kgid(&init_user_ns, current_fsgid());
+ req->in.h.uid = from_kuid(fc->user_ns, current_fsuid());
+ req->in.h.gid = from_kgid(fc->user_ns, current_fsgid());
req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);

return (req->in.h.uid != ((uid_t)-1)) && (req->in.h.gid != ((gid_t)-1));
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 24967382a7b1..ad1cfac1942f 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -858,8 +858,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
stat->ino = attr->ino;
stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
stat->nlink = attr->nlink;
- stat->uid = make_kuid(&init_user_ns, attr->uid);
- stat->gid = make_kgid(&init_user_ns, attr->gid);
+ stat->uid = make_kuid(fc->user_ns, attr->uid);
+ stat->gid = make_kgid(fc->user_ns, attr->gid);
stat->rdev = inode->i_rdev;
stat->atime.tv_sec = attr->atime;
stat->atime.tv_nsec = attr->atimensec;
@@ -1475,17 +1475,17 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
return true;
}

-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
- bool trust_local_cmtime)
+static void iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+ struct fuse_setattr_in *arg, bool trust_local_cmtime)
{
unsigned ivalid = iattr->ia_valid;

if (ivalid & ATTR_MODE)
arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode;
if (ivalid & ATTR_UID)
- arg->valid |= FATTR_UID, arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
+ arg->valid |= FATTR_UID, arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
if (ivalid & ATTR_GID)
- arg->valid |= FATTR_GID, arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
+ arg->valid |= FATTR_GID, arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
if (ivalid & ATTR_SIZE)
arg->valid |= FATTR_SIZE, arg->size = iattr->ia_size;
if (ivalid & ATTR_ATIME) {
@@ -1646,7 +1646,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,

memset(&inarg, 0, sizeof(inarg));
memset(&outarg, 0, sizeof(outarg));
- iattr_to_fattr(attr, &inarg, trust_local_cmtime);
+ iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
if (file) {
struct fuse_file *ff = file->private_data;
inarg.valid |= FATTR_FH;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c4c093bbf456..7772e2b4057e 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -26,6 +26,7 @@
#include <linux/xattr.h>
#include <linux/pid_namespace.h>
#include <linux/refcount.h>
+#include <linux/user_namespace.h>

/** Max number of pages that can be used in a single read request */
#define FUSE_MAX_PAGES_PER_REQ 32
@@ -466,6 +467,9 @@ struct fuse_conn {
/** The pid namespace for this mount */
struct pid_namespace *pid_ns;

+ /** The user namespace for this mount */
+ struct user_namespace *user_ns;
+
/** Maximum read size */
unsigned max_read;

@@ -870,7 +874,7 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
/**
* Initialize fuse_conn
*/
-void fuse_conn_init(struct fuse_conn *fc);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);

/**
* Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 624f18bbfd2b..e018dc3999f4 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -171,8 +171,8 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
inode->i_ino = fuse_squash_ino(attr->ino);
inode->i_mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
set_nlink(inode, attr->nlink);
- inode->i_uid = make_kuid(&init_user_ns, attr->uid);
- inode->i_gid = make_kgid(&init_user_ns, attr->gid);
+ inode->i_uid = make_kuid(fc->user_ns, attr->uid);
+ inode->i_gid = make_kgid(fc->user_ns, attr->gid);
inode->i_blocks = attr->blocks;
inode->i_atime.tv_sec = attr->atime;
inode->i_atime.tv_nsec = attr->atimensec;
@@ -477,7 +477,8 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
return err;
}

-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
+static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+ struct user_namespace *user_ns)
{
char *p;
memset(d, 0, sizeof(struct fuse_mount_data));
@@ -513,7 +514,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
case OPT_USER_ID:
if (fuse_match_uint(&args[0], &uv))
return 0;
- d->user_id = make_kuid(current_user_ns(), uv);
+ d->user_id = make_kuid(user_ns, uv);
if (!uid_valid(d->user_id))
return 0;
d->user_id_present = 1;
@@ -522,7 +523,7 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
case OPT_GROUP_ID:
if (fuse_match_uint(&args[0], &uv))
return 0;
- d->group_id = make_kgid(current_user_ns(), uv);
+ d->group_id = make_kgid(user_ns, uv);
if (!gid_valid(d->group_id))
return 0;
d->group_id_present = 1;
@@ -565,8 +566,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
struct super_block *sb = root->d_sb;
struct fuse_conn *fc = get_fuse_conn_super(sb);

- seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
- seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
+ seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
+ seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
if (fc->default_permissions)
seq_puts(m, ",default_permissions");
if (fc->allow_other)
@@ -597,7 +598,7 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
fpq->connected = 1;
}

-void fuse_conn_init(struct fuse_conn *fc)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
{
memset(fc, 0, sizeof(*fc));
spin_lock_init(&fc->lock);
@@ -621,6 +622,7 @@ void fuse_conn_init(struct fuse_conn *fc)
fc->attr_version = 1;
get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+ fc->user_ns = get_user_ns(user_ns);
}
EXPORT_SYMBOL_GPL(fuse_conn_init);

@@ -630,6 +632,7 @@ void fuse_conn_put(struct fuse_conn *fc)
if (fc->destroy_req)
fuse_request_free(fc->destroy_req);
put_pid_ns(fc->pid_ns);
+ put_user_ns(fc->user_ns);
fc->release(fc);
}
}
@@ -1061,7 +1064,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)

sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION);

- if (!parse_fuse_opt(data, &d, is_bdev))
+ if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
goto err;

if (is_bdev) {
@@ -1086,8 +1089,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
if (!file)
goto err;

- if ((file->f_op != &fuse_dev_operations) ||
- (file->f_cred->user_ns != &init_user_ns))
+ /*
+ * Require mount to happen from the same user namespace which
+ * opened /dev/fuse to prevent potential attacks.
+ */
+ if (file->f_op != &fuse_dev_operations ||
+ file->f_cred->user_ns != sb->s_user_ns)
goto err_fput;

fc = kmalloc(sizeof(*fc), GFP_KERNEL);
@@ -1095,7 +1102,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
if (!fc)
goto err_fput;

- fuse_conn_init(fc);
+ fuse_conn_init(fc, sb->s_user_ns);
fc->release = fuse_free_conn;

fud = fuse_dev_alloc(fc);
--
2.14.1


2018-03-20 16:26:32

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v9 0/4] fuse: mounts from non-init user namespaces

On Thu, Mar 8, 2018 at 10:23 PM, Eric W. Biederman
<[email protected]> wrote:
>
> This patchset builds on the work by Donsu Park and Seth Forshee and is
> reduced to the set of patches that just affect fuse. The non-fuse
> vfs patches are far enough along we can ignore them except possibly for the
> question of when does FS_USERNS_MOUNT get set in fuse_fs_type.
>
> Fuse with a block device has been left as an exercise for a later time.
>
> Since v5 I changed the core of this patchset around as the previous
> patches were showing signs of bitrot. Some important explanations were
> missing, some important functionality was missing, and xattr handling
> was completely absent.
>
> Since v6 I have:
> - Removed the failure case from fuse_get_req_nofail_nopages that I
> added.
> - Updated fuse to always to use posix_acl_access_xattr_handler, and
> posix_acl_default_xattr_handler, by teaching fuse to set
> ACL_DONT_CACHE when FUSE_POSIX_ACL is not set.
>
> Since v7 I have:
> - Rethought and reworked how I am unifying the cached and the non-cached
> posix acl case so the code is cleaner and simpler.
> - I have dropped enhancements to caching negative acls when
> fc->no_getxattr is set.
> - Removed the need to wrap forget_all_cached_acls in fuse.
> - Reorder the patches so the posix acl work comes first
>
> Since v8 I have:
> - Dropped and postponed the unification of the uncached and the cached
> posix acls case. The code is not hard but tricky enough it needs
> to be considered on it's own on it's own merits.
>
> Miklos can you take a look and see what you think?
>
> Miklos if you could pick these up I would appreciate it. If not I can
> merge these through the userns tree.

Thank you Eric for moving this along. Patches pushed to:

git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next

I did just one modification to "fuse: Fail all requests with invalid
uids or gids": instead of zeroing out the context for the nofail case,
continue to use the "_munged" variants. I don't think this hurts and
is better for backward compatibility (I guess the only relevant use
would be for debugging output, but we don't want to regress even for
that if not necessary).

Thanks,
Miklos

2018-03-20 18:30:25

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v9 0/4] fuse: mounts from non-init user namespaces

Miklos Szeredi <[email protected]> writes:

> On Thu, Mar 8, 2018 at 10:23 PM, Eric W. Biederman
> <[email protected]> wrote:
>>
>> This patchset builds on the work by Donsu Park and Seth Forshee and is
>> reduced to the set of patches that just affect fuse. The non-fuse
>> vfs patches are far enough along we can ignore them except possibly for the
>> question of when does FS_USERNS_MOUNT get set in fuse_fs_type.
>>
>> Fuse with a block device has been left as an exercise for a later time.
>>
>> Since v5 I changed the core of this patchset around as the previous
>> patches were showing signs of bitrot. Some important explanations were
>> missing, some important functionality was missing, and xattr handling
>> was completely absent.
>>
>> Since v6 I have:
>> - Removed the failure case from fuse_get_req_nofail_nopages that I
>> added.
>> - Updated fuse to always to use posix_acl_access_xattr_handler, and
>> posix_acl_default_xattr_handler, by teaching fuse to set
>> ACL_DONT_CACHE when FUSE_POSIX_ACL is not set.
>>
>> Since v7 I have:
>> - Rethought and reworked how I am unifying the cached and the non-cached
>> posix acl case so the code is cleaner and simpler.
>> - I have dropped enhancements to caching negative acls when
>> fc->no_getxattr is set.
>> - Removed the need to wrap forget_all_cached_acls in fuse.
>> - Reorder the patches so the posix acl work comes first
>>
>> Since v8 I have:
>> - Dropped and postponed the unification of the uncached and the cached
>> posix acls case. The code is not hard but tricky enough it needs
>> to be considered on it's own on it's own merits.
>>
>> Miklos can you take a look and see what you think?
>>
>> Miklos if you could pick these up I would appreciate it. If not I can
>> merge these through the userns tree.
>
> Thank you Eric for moving this along. Patches pushed to:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next
>
> I did just one modification to "fuse: Fail all requests with invalid
> uids or gids": instead of zeroing out the context for the nofail case,
> continue to use the "_munged" variants. I don't think this hurts and
> is better for backward compatibility (I guess the only relevant use
> would be for debugging output, but we don't want to regress even for
> that if not necessary)

Hmm...

The thing is the failure doesn't come in the difference between the
_munged and the normal variants. The difference between
munged and non-munged variants is how they handled failure ((uid16_t)-2)
aka 0xfffe for munged and -1 for the non-munged case.

The failures are introduced by changing &init_user_ns to fc->user_ns.

The operations in question are iop->flush and fuse_force_forget (on an
error). I don't know what value having ids on those paths will do
they are operations that must succeed, and they should not change the
on-disk ids. I was thinking saying the most privileged id was asking
for the oepration would seem to make sense.

With the munged variants we will get (uid16_t)-2 aka 0xfffe aka
nobody asking for the operation if things don't map. In practice
the don't map case is new.

Since the id's should not be looked at anyway I don't see it makes
much difference which ids we use so the munged case seems at least
plausible.

It might be better to use the non-munghed variant and do:
if (req->in.h.uid == (uid_t)-1)
req.in.h.uid = 0;
if (req->in.h.gid == (gid_t)-1)
req.in.h.gid = 0;

That might be less surprising to userspace. As I don't think the
unmapped case has ever occurred in practice yet. The vfs will work hard
to keep the unmapped case from happening but only in the context of
i_uid and i_gid not current_fsuid and current_fsgid.

Eric

2018-03-21 08:40:17

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v9 0/4] fuse: mounts from non-init user namespaces

On Tue, Mar 20, 2018 at 7:27 PM, Eric W. Biederman
<[email protected]> wrote:
> Miklos Szeredi <[email protected]> writes:

>> I did just one modification to "fuse: Fail all requests with invalid
>> uids or gids": instead of zeroing out the context for the nofail case,
>> continue to use the "_munged" variants. I don't think this hurts and
>> is better for backward compatibility (I guess the only relevant use
>> would be for debugging output, but we don't want to regress even for
>> that if not necessary)
>
> Hmm...
>
> The thing is the failure doesn't come in the difference between the
> _munged and the normal variants. The difference between
> munged and non-munged variants is how they handled failure ((uid16_t)-2)
> aka 0xfffe for munged and -1 for the non-munged case.
>
> The failures are introduced by changing &init_user_ns to fc->user_ns.

Right.

> The operations in question are iop->flush and fuse_force_forget (on an
> error). I don't know what value having ids on those paths will do
> they are operations that must succeed, and they should not change the
> on-disk ids. I was thinking saying the most privileged id was asking
> for the oepration would seem to make sense.

I don't think anybody should actually *care* about the id's in flush,
but I'd still not change the current behavior for change's sake.

>
> With the munged variants we will get (uid16_t)-2 aka 0xfffe aka
> nobody asking for the operation if things don't map. In practice
> the don't map case is new.
>
> Since the id's should not be looked at anyway I don't see it makes
> much difference which ids we use so the munged case seems at least
> plausible.
>
> It might be better to use the non-munghed variant and do:
> if (req->in.h.uid == (uid_t)-1)
> req.in.h.uid = 0;
> if (req->in.h.gid == (gid_t)-1)
> req.in.h.gid = 0;
>
> That might be less surprising to userspace. As I don't think the
> unmapped case has ever occurred in practice yet.

Right, that would work too, but I don't think it actually matters, so
unless you can think of an actual security issue arising from using
the munged variants, I'd just leave it as it is.

Thanks,
Miklos