2014-10-22 21:24:55

by Seth Forshee

[permalink] [raw]
Subject: [PATCH v5 0/4] fuse: Add support for mounts from pid/user namespaces

Here's another update to the patches to enable userns mounts in fuse.
The changes this time around center on xattrs and allow_other. I'm
considering the following patch to be a prerequisite for this series:

http://lkml.kernel.org/g/252a4d87d99fc2b5fe4411c838f65b312c4e13cd.1413330857.git.luto@amacapital.net

With this patch I'm dropping the xattr restrictions from the v4 patches.
This leaves the situation with xattrs unchanged in init_user_ns. A fuse
userns mount could still supply xattrs which the user would normally be
unable to set, but since the mount will be inaccessible outside of that
userns or its descendants those xattrs will never be seen by the host.

I'm also in agreement with Andy that some decision regarding this patch
should be made before merging the fuse userns support as well:

http://lkml.kernel.org/g/2686c32f00b14148379e8cfee9c028c794d4aa1a.1407974494.git.luto@amacapital.net

Changes since v4:
- Drop patch to restrict xattrs
- Update allow_other to restrict based on userns of current rather than
uid/gid

Thanks,
Seth


Seth Forshee (4):
fuse: Add support for pid namespaces
fuse: Support fuse filesystems outside of init_user_ns
fuse: Restrict allow_other to the superblock's namespace or a
descendant
fuse: Allow user namespace mounts

fs/fuse/dev.c | 13 ++++----
fs/fuse/dir.c | 91 +++++++++++++++++++++++++++++++++++++-------------------
fs/fuse/file.c | 38 ++++++++++++++++-------
fs/fuse/fuse_i.h | 16 +++++++---
fs/fuse/inode.c | 81 ++++++++++++++++++++++++++++++++++++++-----------
5 files changed, 171 insertions(+), 68 deletions(-)


2014-10-22 21:24:59

by Seth Forshee

[permalink] [raw]
Subject: [PATCH v5 3/4] fuse: Restrict allow_other to the superblock's namespace or a descendant

Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Therefore access with allow_other should be
restricted to users in the userns as the superblock or a
descendant of that namespace.

Cc: Eric W. Biederman <[email protected]>
Cc: Serge H. Hallyn <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
---
fs/fuse/dir.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 123db1e06c78..b23ec5c1ff18 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1091,8 +1091,14 @@ int fuse_allow_current_process(struct fuse_conn *fc)
{
const struct cred *cred;

- if (fc->flags & FUSE_ALLOW_OTHER)
- return 1;
+ if (fc->flags & FUSE_ALLOW_OTHER) {
+ struct user_namespace *ns;
+ for (ns = current_user_ns(); ns; ns = ns->parent) {
+ if (ns == fc->user_ns)
+ return 1;
+ }
+ return 0;
+ }

cred = current_cred();
if (uid_eq(cred->euid, fc->user_id) &&
--
1.9.1

2014-10-22 21:24:58

by Seth Forshee

[permalink] [raw]
Subject: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

Update fuse to translate uids and gids to/from the user namspace
of the process servicing requests on /dev/fuse. Any ids which do
not map into the namespace will result in errors. inodes will
also be marked bad when unmappable ids are received from
userspace.

Due to security concerns the namespace used should be fixed,
otherwise a user might be able to gain elevated privileges or
influence processes that the user would otherwise be unable to
manipulate. Thus the namespace of the mounting process is used
for all translations, and this namespace is required to be the
same as the one in use when /dev/fuse was opened.

Cc: Eric W. Biederman <[email protected]>
Cc: Serge H. Hallyn <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
---
fs/fuse/dev.c | 4 +--
fs/fuse/dir.c | 81 ++++++++++++++++++++++++++++++++++++--------------------
fs/fuse/fuse_i.h | 12 ++++++---
fs/fuse/inode.c | 73 +++++++++++++++++++++++++++++++++++++++-----------
4 files changed, 121 insertions(+), 49 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 839caebd34f1..03c8785ed731 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -127,8 +127,8 @@ static void __fuse_put_request(struct fuse_req *req)

static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
{
- req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
- req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
+ req->in.h.uid = from_kuid_munged(fc->user_ns, current_fsuid());
+ req->in.h.gid = from_kgid_munged(fc->user_ns, current_fsgid());
req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
}

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index de1d84af9f7c..123db1e06c78 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -253,9 +253,12 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
if (err || (outarg.attr.mode ^ inode->i_mode) & S_IFMT)
goto invalid;

- fuse_change_attributes(inode, &outarg.attr,
- entry_attr_timeout(&outarg),
- attr_version);
+ err = fuse_change_attributes(inode, &outarg.attr,
+ entry_attr_timeout(&outarg),
+ attr_version);
+ if (err)
+ goto invalid;
+
fuse_change_entry_timeout(entry, &outarg);
} else if (inode) {
fi = get_fuse_inode(inode);
@@ -340,8 +343,9 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, struct qstr *name,
*inode = fuse_iget(sb, outarg->nodeid, outarg->generation,
&outarg->attr, entry_attr_timeout(outarg),
attr_version);
- err = -ENOMEM;
- if (!*inode) {
+ if (IS_ERR(*inode)) {
+ err = PTR_ERR(*inode);
+ *inode = NULL;
fuse_queue_forget(fc, forget, outarg->nodeid, 1);
goto out;
}
@@ -473,11 +477,11 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
ff->open_flags = outopen.open_flags;
inode = fuse_iget(dir->i_sb, outentry.nodeid, outentry.generation,
&outentry.attr, entry_attr_timeout(&outentry), 0);
- if (!inode) {
+ if (IS_ERR(inode)) {
flags &= ~(O_CREAT | O_EXCL | O_TRUNC);
fuse_sync_release(ff, flags);
fuse_queue_forget(fc, forget, outentry.nodeid, 1);
- err = -ENOMEM;
+ err = PTR_ERR(inode);
goto out_err;
}
kfree(forget);
@@ -588,9 +592,9 @@ static int create_new_entry(struct fuse_conn *fc, struct fuse_req *req,

inode = fuse_iget(dir->i_sb, outarg.nodeid, outarg.generation,
&outarg.attr, entry_attr_timeout(&outarg), 0);
- if (!inode) {
+ if (IS_ERR(inode)) {
fuse_queue_forget(fc, forget, outarg.nodeid, 1);
- return -ENOMEM;
+ return PTR_ERR(inode);
}
kfree(forget);

@@ -905,8 +909,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
stat->ino = attr->ino;
stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
stat->nlink = attr->nlink;
- stat->uid = make_kuid(&init_user_ns, attr->uid);
- stat->gid = make_kgid(&init_user_ns, attr->gid);
+ stat->uid = inode->i_uid;
+ stat->gid = inode->i_gid;
stat->rdev = inode->i_rdev;
stat->atime.tv_sec = attr->atime;
stat->atime.tv_nsec = attr->atimensec;
@@ -969,10 +973,10 @@ static int fuse_do_getattr(struct inode *inode, struct kstat *stat,
make_bad_inode(inode);
err = -EIO;
} else {
- fuse_change_attributes(inode, &outarg.attr,
- attr_timeout(&outarg),
- attr_version);
- if (stat)
+ err = fuse_change_attributes(inode, &outarg.attr,
+ attr_timeout(&outarg),
+ attr_version);
+ if (!err && stat)
fuse_fillattr(inode, &outarg.attr, stat);
}
}
@@ -1302,9 +1306,11 @@ static int fuse_direntplus_link(struct file *file,
fi->nlookup++;
spin_unlock(&fc->lock);

- fuse_change_attributes(inode, &o->attr,
- entry_attr_timeout(o),
- attr_version);
+ err = fuse_change_attributes(inode, &o->attr,
+ entry_attr_timeout(o),
+ attr_version);
+ if (err)
+ goto out;

/*
* The other branch to 'found' comes via fuse_iget()
@@ -1322,8 +1328,10 @@ static int fuse_direntplus_link(struct file *file,

inode = fuse_iget(dir->i_sb, o->nodeid, o->generation,
&o->attr, entry_attr_timeout(o), attr_version);
- if (!inode)
+ if (IS_ERR(inode)) {
+ err = PTR_ERR(inode);
goto out;
+ }

alias = d_materialise_unique(dentry, inode);
err = PTR_ERR(alias);
@@ -1556,17 +1564,25 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
return true;
}

-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
- bool trust_local_cmtime)
+static int iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
+ struct fuse_setattr_in *arg, bool trust_local_cmtime)
{
unsigned ivalid = iattr->ia_valid;

if (ivalid & ATTR_MODE)
arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode;
- if (ivalid & ATTR_UID)
- arg->valid |= FATTR_UID, arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
- if (ivalid & ATTR_GID)
- arg->valid |= FATTR_GID, arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
+ if (ivalid & ATTR_UID) {
+ arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
+ if (arg->uid == (uid_t)-1)
+ return -EINVAL;
+ arg->valid |= FATTR_UID;
+ }
+ if (ivalid & ATTR_GID) {
+ arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
+ if (arg->gid == (gid_t)-1)
+ return -EINVAL;
+ arg->valid |= FATTR_GID;
+ }
if (ivalid & ATTR_SIZE)
arg->valid |= FATTR_SIZE, arg->size = iattr->ia_size;
if (ivalid & ATTR_ATIME) {
@@ -1588,6 +1604,8 @@ static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
arg->ctime = iattr->ia_ctime.tv_sec;
arg->ctimensec = iattr->ia_ctime.tv_nsec;
}
+
+ return 0;
}

/*
@@ -1741,7 +1759,9 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,

memset(&inarg, 0, sizeof(inarg));
memset(&outarg, 0, sizeof(outarg));
- iattr_to_fattr(attr, &inarg, trust_local_cmtime);
+ err = iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
+ if (err)
+ goto error;
if (file) {
struct fuse_file *ff = file->private_data;
inarg.valid |= FATTR_FH;
@@ -1778,8 +1798,13 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
/* FIXME: clear I_DIRTY_SYNC? */
}

- fuse_change_attributes_common(inode, &outarg.attr,
- attr_timeout(&outarg));
+ err = fuse_change_attributes_common(inode, &outarg.attr,
+ attr_timeout(&outarg));
+ if (err) {
+ spin_unlock(&fc->lock);
+ goto error;
+ }
+
oldsize = inode->i_size;
/* see the comment in fuse_change_attributes() */
if (!is_wb || is_truncate || !S_ISREG(inode->i_mode))
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a3ded071e2c6..81187ba04e4a 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -22,6 +22,7 @@
#include <linux/rbtree.h>
#include <linux/poll.h>
#include <linux/workqueue.h>
+#include <linux/user_namespace.h>
#include <linux/pid_namespace.h>

/** Max number of pages that can be used in a single read request */
@@ -387,6 +388,9 @@ struct fuse_conn {
/** The group id for this mount */
kgid_t group_id;

+ /** The user namespace for this mount */
+ struct user_namespace *user_ns;
+
/** The pid namespace for this mount */
struct pid_namespace *pid_ns;

@@ -713,11 +717,11 @@ void fuse_init_symlink(struct inode *inode);
/**
* Change attributes of an inode
*/
-void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
- u64 attr_valid, u64 attr_version);
+int fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
+ u64 attr_valid, u64 attr_version);

-void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
- u64 attr_valid);
+int fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
+ u64 attr_valid);

/**
* Initialize the client device
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index e137969815a3..b88b5a780228 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -155,11 +155,22 @@ static ino_t fuse_squash_ino(u64 ino64)
return ino;
}

-void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
- u64 attr_valid)
+int fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
+ u64 attr_valid)
{
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
+ kuid_t uid;
+ kgid_t gid;
+
+ uid = make_kuid(fc->user_ns, attr->uid);
+ gid = make_kgid(fc->user_ns, attr->gid);
+ if (!uid_valid(uid) || !gid_valid(gid)) {
+ make_bad_inode(inode);
+ return -EIO;
+ }
+ inode->i_uid = uid;
+ inode->i_gid = gid;

fi->attr_version = ++fc->attr_version;
fi->i_time = attr_valid;
@@ -167,8 +178,6 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
inode->i_ino = fuse_squash_ino(attr->ino);
inode->i_mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
set_nlink(inode, attr->nlink);
- inode->i_uid = make_kuid(&init_user_ns, attr->uid);
- inode->i_gid = make_kgid(&init_user_ns, attr->gid);
inode->i_blocks = attr->blocks;
inode->i_atime.tv_sec = attr->atime;
inode->i_atime.tv_nsec = attr->atimensec;
@@ -195,26 +204,32 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
inode->i_mode &= ~S_ISVTX;

fi->orig_ino = attr->ino;
+ return 0;
}

-void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
- u64 attr_valid, u64 attr_version)
+int fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
+ u64 attr_valid, u64 attr_version)
{
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
bool is_wb = fc->writeback_cache;
loff_t oldsize;
struct timespec old_mtime;
+ int err;

spin_lock(&fc->lock);
if ((attr_version != 0 && fi->attr_version > attr_version) ||
test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) {
spin_unlock(&fc->lock);
- return;
+ return 0;
}

old_mtime = inode->i_mtime;
- fuse_change_attributes_common(inode, attr, attr_valid);
+ err = fuse_change_attributes_common(inode, attr, attr_valid);
+ if (err) {
+ spin_unlock(&fc->lock);
+ return err;
+ }

oldsize = inode->i_size;
/*
@@ -249,6 +264,8 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
if (inval)
invalidate_inode_pages2(inode->i_mapping);
}
+
+ return 0;
}

static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
@@ -302,7 +319,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
retry:
inode = iget5_locked(sb, nodeid, fuse_inode_eq, fuse_inode_set, &nodeid);
if (!inode)
- return NULL;
+ return ERR_PTR(-ENOMEM);

if ((inode->i_state & I_NEW)) {
inode->i_flags |= S_NOATIME;
@@ -319,11 +336,23 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
goto retry;
}

+ /*
+ * Must do this before incrementing nlookup, as the caller will
+ * send a forget for the node if this function fails.
+ */
+ if (fuse_change_attributes(inode, attr, attr_valid, attr_version)) {
+ /*
+ * inode_make_bad() already called by
+ * fuse_change_attributes()
+ */
+ iput(inode);
+ return ERR_PTR(-EIO);
+ }
+
fi = get_fuse_inode(inode);
spin_lock(&fc->lock);
fi->nlookup++;
spin_unlock(&fc->lock);
- fuse_change_attributes(inode, attr, attr_valid, attr_version);

return inode;
}
@@ -496,6 +525,8 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
memset(d, 0, sizeof(struct fuse_mount_data));
d->max_read = ~0;
d->blksize = FUSE_DEFAULT_BLKSIZE;
+ d->user_id = make_kuid(current_user_ns(), 0);
+ d->group_id = make_kgid(current_user_ns(), 0);

while ((p = strsep(&opt, ",")) != NULL) {
int token;
@@ -578,8 +609,10 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
struct super_block *sb = root->d_sb;
struct fuse_conn *fc = get_fuse_conn_super(sb);

- seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
- seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
+ seq_printf(m, ",user_id=%u",
+ from_kuid_munged(fc->user_ns, fc->user_id));
+ seq_printf(m, ",group_id=%u",
+ from_kgid_munged(fc->user_ns, fc->group_id));
if (fc->flags & FUSE_DEFAULT_PERMISSIONS)
seq_puts(m, ",default_permissions");
if (fc->flags & FUSE_ALLOW_OTHER)
@@ -618,6 +651,7 @@ void fuse_conn_init(struct fuse_conn *fc)
fc->attr_version = 1;
get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+ fc->user_ns = get_user_ns(current_user_ns());
}
EXPORT_SYMBOL_GPL(fuse_conn_init);

@@ -626,6 +660,8 @@ void fuse_conn_put(struct fuse_conn *fc)
if (atomic_dec_and_test(&fc->count)) {
if (fc->destroy_req)
fuse_request_free(fc->destroy_req);
+ put_user_ns(fc->user_ns);
+ fc->user_ns = NULL;
put_pid_ns(fc->pid_ns);
fc->pid_ns = NULL;
fc->release(fc);
@@ -643,12 +679,15 @@ EXPORT_SYMBOL_GPL(fuse_conn_get);
static struct inode *fuse_get_root_inode(struct super_block *sb, unsigned mode)
{
struct fuse_attr attr;
+ struct inode *inode;
+
memset(&attr, 0, sizeof(attr));

attr.mode = mode;
attr.ino = FUSE_ROOT_ID;
attr.nlink = 1;
- return fuse_iget(sb, 1, 0, &attr, 0, 0);
+ inode = fuse_iget(sb, 1, 0, &attr, 0, 0);
+ return IS_ERR(inode) ? NULL : inode;
}

struct fuse_inode_handle {
@@ -1043,8 +1082,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
if (!file)
goto err;

- if ((file->f_op != &fuse_dev_operations) ||
- (file->f_cred->user_ns != &init_user_ns))
+ /*
+ * Require mount to happen from the same user namespace which
+ * opened /dev/fuse to prevent potential attacks.
+ */
+ if (file->f_op != &fuse_dev_operations ||
+ file->f_cred->user_ns != current_user_ns())
goto err_fput;

fc = kmalloc(sizeof(*fc), GFP_KERNEL);
--
1.9.1

2014-10-22 21:25:27

by Seth Forshee

[permalink] [raw]
Subject: [PATCH v5 4/4] fuse: Allow user namespace mounts

Cc: Eric W. Biederman <[email protected]>
Cc: Serge H. Hallyn <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
---
fs/fuse/inode.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b88b5a780228..7d0e73e36e7b 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1201,7 +1201,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
static struct file_system_type fuse_fs_type = {
.owner = THIS_MODULE,
.name = "fuse",
- .fs_flags = FS_HAS_SUBTYPE,
+ .fs_flags = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
.mount = fuse_mount,
.kill_sb = fuse_kill_sb_anon,
};
@@ -1233,7 +1233,7 @@ static struct file_system_type fuseblk_fs_type = {
.name = "fuseblk",
.mount = fuse_mount_blk,
.kill_sb = fuse_kill_sb_blk,
- .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
+ .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
};
MODULE_ALIAS_FS("fuseblk");

--
1.9.1

2014-10-22 21:26:09

by Seth Forshee

[permalink] [raw]
Subject: [PATCH v5 1/4] fuse: Add support for pid namespaces

If the userspace process servicing fuse requests is running in
a pid namespace then pids passed via the fuse fd need to be
translated relative to that namespace. Capture the pid namespace
in use when the filesystem is mounted and use this for pid
translation.

File locking changes based on previous work done by Eric
Biederman.

Cc: Eric W. Biederman <[email protected]>
Cc: Serge H. Hallyn <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
---
fs/fuse/dev.c | 9 +++++----
fs/fuse/file.c | 38 +++++++++++++++++++++++++++-----------
fs/fuse/fuse_i.h | 4 ++++
fs/fuse/inode.c | 4 ++++
4 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index ca887314aba9..839caebd34f1 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -20,6 +20,7 @@
#include <linux/swap.h>
#include <linux/splice.h>
#include <linux/aio.h>
+#include <linux/sched.h>

MODULE_ALIAS_MISCDEV(FUSE_MINOR);
MODULE_ALIAS("devname:fuse");
@@ -124,11 +125,11 @@ static void __fuse_put_request(struct fuse_req *req)
atomic_dec(&req->count);
}

-static void fuse_req_init_context(struct fuse_req *req)
+static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
{
req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
- req->in.h.pid = current->pid;
+ req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
}

static bool fuse_block_alloc(struct fuse_conn *fc, bool for_background)
@@ -168,7 +169,7 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
goto out;
}

- fuse_req_init_context(req);
+ fuse_req_init_context(fc, req);
req->waiting = 1;
req->background = for_background;
return req;
@@ -257,7 +258,7 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
if (!req)
req = get_reserved_req(fc, file);

- fuse_req_init_context(req);
+ fuse_req_init_context(fc, req);
req->waiting = 1;
req->background = 0;
return req;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index caa8d95b24e8..cb0e40ecc362 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2131,7 +2131,8 @@ static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma)
return generic_file_mmap(file, vma);
}

-static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
+static int convert_fuse_file_lock(struct fuse_conn *fc,
+ const struct fuse_file_lock *ffl,
struct file_lock *fl)
{
switch (ffl->type) {
@@ -2146,7 +2147,11 @@ static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,

fl->fl_start = ffl->start;
fl->fl_end = ffl->end;
- fl->fl_pid = ffl->pid;
+ rcu_read_lock();
+ fl->fl_pid = pid_vnr(find_pid_ns(ffl->pid, fc->pid_ns));
+ rcu_read_unlock();
+ if (ffl->pid != 0 && fl->fl_pid == 0)
+ return -EIO;
break;

default:
@@ -2156,9 +2161,9 @@ static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
return 0;
}

-static void fuse_lk_fill(struct fuse_req *req, struct file *file,
- const struct file_lock *fl, int opcode, pid_t pid,
- int flock)
+static int fuse_lk_fill(struct fuse_req *req, struct file *file,
+ const struct file_lock *fl, int opcode,
+ struct pid *pid, int flock)
{
struct inode *inode = file_inode(file);
struct fuse_conn *fc = get_fuse_conn(inode);
@@ -2170,7 +2175,9 @@ static void fuse_lk_fill(struct fuse_req *req, struct file *file,
arg->lk.start = fl->fl_start;
arg->lk.end = fl->fl_end;
arg->lk.type = fl->fl_type;
- arg->lk.pid = pid;
+ arg->lk.pid = pid_nr_ns(pid, fc->pid_ns);
+ if (pid && arg->lk.pid == 0)
+ return -EOVERFLOW;
if (flock)
arg->lk_flags |= FUSE_LK_FLOCK;
req->in.h.opcode = opcode;
@@ -2178,6 +2185,8 @@ static void fuse_lk_fill(struct fuse_req *req, struct file *file,
req->in.numargs = 1;
req->in.args[0].size = sizeof(*arg);
req->in.args[0].value = arg;
+
+ return 0;
}

static int fuse_getlk(struct file *file, struct file_lock *fl)
@@ -2192,16 +2201,19 @@ static int fuse_getlk(struct file *file, struct file_lock *fl)
if (IS_ERR(req))
return PTR_ERR(req);

- fuse_lk_fill(req, file, fl, FUSE_GETLK, 0, 0);
+ err = fuse_lk_fill(req, file, fl, FUSE_GETLK, NULL, 0);
+ if (err)
+ goto out;
req->out.numargs = 1;
req->out.args[0].size = sizeof(outarg);
req->out.args[0].value = &outarg;
fuse_request_send(fc, req);
err = req->out.h.error;
- fuse_put_request(fc, req);
if (!err)
- err = convert_fuse_file_lock(&outarg.lk, fl);
+ err = convert_fuse_file_lock(fc, &outarg.lk, fl);

+out:
+ fuse_put_request(fc, req);
return err;
}

@@ -2211,7 +2223,7 @@ static int fuse_setlk(struct file *file, struct file_lock *fl, int flock)
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_req *req;
int opcode = (fl->fl_flags & FL_SLEEP) ? FUSE_SETLKW : FUSE_SETLK;
- pid_t pid = fl->fl_type != F_UNLCK ? current->tgid : 0;
+ struct pid *pid = fl->fl_type != F_UNLCK ? task_tgid(current) : NULL;
int err;

if (fl->fl_lmops && fl->fl_lmops->lm_grant) {
@@ -2227,12 +2239,16 @@ static int fuse_setlk(struct file *file, struct file_lock *fl, int flock)
if (IS_ERR(req))
return PTR_ERR(req);

- fuse_lk_fill(req, file, fl, opcode, pid, flock);
+ err = fuse_lk_fill(req, file, fl, opcode, pid, flock);
+ if (err)
+ goto out;
fuse_request_send(fc, req);
err = req->out.h.error;
/* locking is restartable */
if (err == -EINTR)
err = -ERESTARTSYS;
+
+out:
fuse_put_request(fc, req);
return err;
}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e8e47a6ab518..a3ded071e2c6 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -22,6 +22,7 @@
#include <linux/rbtree.h>
#include <linux/poll.h>
#include <linux/workqueue.h>
+#include <linux/pid_namespace.h>

/** Max number of pages that can be used in a single read request */
#define FUSE_MAX_PAGES_PER_REQ 32
@@ -386,6 +387,9 @@ struct fuse_conn {
/** The group id for this mount */
kgid_t group_id;

+ /** The pid namespace for this mount */
+ struct pid_namespace *pid_ns;
+
/** The fuse mount flags for this mount */
unsigned flags;

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 03246cd9d47a..e137969815a3 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -20,6 +20,7 @@
#include <linux/random.h>
#include <linux/sched.h>
#include <linux/exportfs.h>
+#include <linux/pid_namespace.h>

MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
MODULE_DESCRIPTION("Filesystem in Userspace");
@@ -616,6 +617,7 @@ void fuse_conn_init(struct fuse_conn *fc)
fc->initialized = 0;
fc->attr_version = 1;
get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
+ fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
}
EXPORT_SYMBOL_GPL(fuse_conn_init);

@@ -624,6 +626,8 @@ void fuse_conn_put(struct fuse_conn *fc)
if (atomic_dec_and_test(&fc->count)) {
if (fc->destroy_req)
fuse_request_free(fc->destroy_req);
+ put_pid_ns(fc->pid_ns);
+ fc->pid_ns = NULL;
fc->release(fc);
}
}
--
1.9.1

2014-10-22 21:47:54

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

On Wed, Oct 22, 2014 at 2:24 PM, Seth Forshee
<[email protected]> wrote:
> Update fuse to translate uids and gids to/from the user namspace
> of the process servicing requests on /dev/fuse. Any ids which do
> not map into the namespace will result in errors. inodes will
> also be marked bad when unmappable ids are received from
> userspace.
>
> Due to security concerns the namespace used should be fixed,
> otherwise a user might be able to gain elevated privileges or
> influence processes that the user would otherwise be unable to
> manipulate. Thus the namespace of the mounting process is used
> for all translations, and this namespace is required to be the
> same as the one in use when /dev/fuse was opened.

Looks generally sensible to me. I'm not familiar enough with this
code to really review it, though.

>
> Cc: Eric W. Biederman <[email protected]>
> Cc: Serge H. Hallyn <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Signed-off-by: Seth Forshee <[email protected]>
> ---
> fs/fuse/dev.c | 4 +--
> fs/fuse/dir.c | 81 ++++++++++++++++++++++++++++++++++++--------------------
> fs/fuse/fuse_i.h | 12 ++++++---
> fs/fuse/inode.c | 73 +++++++++++++++++++++++++++++++++++++++-----------
> 4 files changed, 121 insertions(+), 49 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 839caebd34f1..03c8785ed731 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -127,8 +127,8 @@ static void __fuse_put_request(struct fuse_req *req)
>
> static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
> {
> - req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> - req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> + req->in.h.uid = from_kuid_munged(fc->user_ns, current_fsuid());
> + req->in.h.gid = from_kgid_munged(fc->user_ns, current_fsgid());
> req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
> }
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index de1d84af9f7c..123db1e06c78 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -253,9 +253,12 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
> if (err || (outarg.attr.mode ^ inode->i_mode) & S_IFMT)
> goto invalid;
>
> - fuse_change_attributes(inode, &outarg.attr,
> - entry_attr_timeout(&outarg),
> - attr_version);
> + err = fuse_change_attributes(inode, &outarg.attr,
> + entry_attr_timeout(&outarg),
> + attr_version);
> + if (err)
> + goto invalid;
> +
> fuse_change_entry_timeout(entry, &outarg);
> } else if (inode) {
> fi = get_fuse_inode(inode);
> @@ -340,8 +343,9 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, struct qstr *name,
> *inode = fuse_iget(sb, outarg->nodeid, outarg->generation,
> &outarg->attr, entry_attr_timeout(outarg),
> attr_version);
> - err = -ENOMEM;
> - if (!*inode) {
> + if (IS_ERR(*inode)) {
> + err = PTR_ERR(*inode);
> + *inode = NULL;
> fuse_queue_forget(fc, forget, outarg->nodeid, 1);
> goto out;
> }
> @@ -473,11 +477,11 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> ff->open_flags = outopen.open_flags;
> inode = fuse_iget(dir->i_sb, outentry.nodeid, outentry.generation,
> &outentry.attr, entry_attr_timeout(&outentry), 0);
> - if (!inode) {
> + if (IS_ERR(inode)) {
> flags &= ~(O_CREAT | O_EXCL | O_TRUNC);
> fuse_sync_release(ff, flags);
> fuse_queue_forget(fc, forget, outentry.nodeid, 1);
> - err = -ENOMEM;
> + err = PTR_ERR(inode);
> goto out_err;
> }
> kfree(forget);
> @@ -588,9 +592,9 @@ static int create_new_entry(struct fuse_conn *fc, struct fuse_req *req,
>
> inode = fuse_iget(dir->i_sb, outarg.nodeid, outarg.generation,
> &outarg.attr, entry_attr_timeout(&outarg), 0);
> - if (!inode) {
> + if (IS_ERR(inode)) {
> fuse_queue_forget(fc, forget, outarg.nodeid, 1);
> - return -ENOMEM;
> + return PTR_ERR(inode);
> }
> kfree(forget);
>
> @@ -905,8 +909,8 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
> stat->ino = attr->ino;
> stat->mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
> stat->nlink = attr->nlink;
> - stat->uid = make_kuid(&init_user_ns, attr->uid);
> - stat->gid = make_kgid(&init_user_ns, attr->gid);
> + stat->uid = inode->i_uid;
> + stat->gid = inode->i_gid;
> stat->rdev = inode->i_rdev;
> stat->atime.tv_sec = attr->atime;
> stat->atime.tv_nsec = attr->atimensec;
> @@ -969,10 +973,10 @@ static int fuse_do_getattr(struct inode *inode, struct kstat *stat,
> make_bad_inode(inode);
> err = -EIO;
> } else {
> - fuse_change_attributes(inode, &outarg.attr,
> - attr_timeout(&outarg),
> - attr_version);
> - if (stat)
> + err = fuse_change_attributes(inode, &outarg.attr,
> + attr_timeout(&outarg),
> + attr_version);
> + if (!err && stat)
> fuse_fillattr(inode, &outarg.attr, stat);
> }
> }
> @@ -1302,9 +1306,11 @@ static int fuse_direntplus_link(struct file *file,
> fi->nlookup++;
> spin_unlock(&fc->lock);
>
> - fuse_change_attributes(inode, &o->attr,
> - entry_attr_timeout(o),
> - attr_version);
> + err = fuse_change_attributes(inode, &o->attr,
> + entry_attr_timeout(o),
> + attr_version);
> + if (err)
> + goto out;
>
> /*
> * The other branch to 'found' comes via fuse_iget()
> @@ -1322,8 +1328,10 @@ static int fuse_direntplus_link(struct file *file,
>
> inode = fuse_iget(dir->i_sb, o->nodeid, o->generation,
> &o->attr, entry_attr_timeout(o), attr_version);
> - if (!inode)
> + if (IS_ERR(inode)) {
> + err = PTR_ERR(inode);
> goto out;
> + }
>
> alias = d_materialise_unique(dentry, inode);
> err = PTR_ERR(alias);
> @@ -1556,17 +1564,25 @@ static bool update_mtime(unsigned ivalid, bool trust_local_mtime)
> return true;
> }
>
> -static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
> - bool trust_local_cmtime)
> +static int iattr_to_fattr(struct fuse_conn *fc, struct iattr *iattr,
> + struct fuse_setattr_in *arg, bool trust_local_cmtime)
> {
> unsigned ivalid = iattr->ia_valid;
>
> if (ivalid & ATTR_MODE)
> arg->valid |= FATTR_MODE, arg->mode = iattr->ia_mode;
> - if (ivalid & ATTR_UID)
> - arg->valid |= FATTR_UID, arg->uid = from_kuid(&init_user_ns, iattr->ia_uid);
> - if (ivalid & ATTR_GID)
> - arg->valid |= FATTR_GID, arg->gid = from_kgid(&init_user_ns, iattr->ia_gid);
> + if (ivalid & ATTR_UID) {
> + arg->uid = from_kuid(fc->user_ns, iattr->ia_uid);
> + if (arg->uid == (uid_t)-1)
> + return -EINVAL;
> + arg->valid |= FATTR_UID;
> + }
> + if (ivalid & ATTR_GID) {
> + arg->gid = from_kgid(fc->user_ns, iattr->ia_gid);
> + if (arg->gid == (gid_t)-1)
> + return -EINVAL;
> + arg->valid |= FATTR_GID;
> + }
> if (ivalid & ATTR_SIZE)
> arg->valid |= FATTR_SIZE, arg->size = iattr->ia_size;
> if (ivalid & ATTR_ATIME) {
> @@ -1588,6 +1604,8 @@ static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
> arg->ctime = iattr->ia_ctime.tv_sec;
> arg->ctimensec = iattr->ia_ctime.tv_nsec;
> }
> +
> + return 0;
> }
>
> /*
> @@ -1741,7 +1759,9 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
>
> memset(&inarg, 0, sizeof(inarg));
> memset(&outarg, 0, sizeof(outarg));
> - iattr_to_fattr(attr, &inarg, trust_local_cmtime);
> + err = iattr_to_fattr(fc, attr, &inarg, trust_local_cmtime);
> + if (err)
> + goto error;
> if (file) {
> struct fuse_file *ff = file->private_data;
> inarg.valid |= FATTR_FH;
> @@ -1778,8 +1798,13 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
> /* FIXME: clear I_DIRTY_SYNC? */
> }
>
> - fuse_change_attributes_common(inode, &outarg.attr,
> - attr_timeout(&outarg));
> + err = fuse_change_attributes_common(inode, &outarg.attr,
> + attr_timeout(&outarg));
> + if (err) {
> + spin_unlock(&fc->lock);
> + goto error;
> + }
> +
> oldsize = inode->i_size;
> /* see the comment in fuse_change_attributes() */
> if (!is_wb || is_truncate || !S_ISREG(inode->i_mode))
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index a3ded071e2c6..81187ba04e4a 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -22,6 +22,7 @@
> #include <linux/rbtree.h>
> #include <linux/poll.h>
> #include <linux/workqueue.h>
> +#include <linux/user_namespace.h>
> #include <linux/pid_namespace.h>
>
> /** Max number of pages that can be used in a single read request */
> @@ -387,6 +388,9 @@ struct fuse_conn {
> /** The group id for this mount */
> kgid_t group_id;
>
> + /** The user namespace for this mount */
> + struct user_namespace *user_ns;
> +
> /** The pid namespace for this mount */
> struct pid_namespace *pid_ns;
>
> @@ -713,11 +717,11 @@ void fuse_init_symlink(struct inode *inode);
> /**
> * Change attributes of an inode
> */
> -void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> - u64 attr_valid, u64 attr_version);
> +int fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> + u64 attr_valid, u64 attr_version);
>
> -void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
> - u64 attr_valid);
> +int fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
> + u64 attr_valid);
>
> /**
> * Initialize the client device
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index e137969815a3..b88b5a780228 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -155,11 +155,22 @@ static ino_t fuse_squash_ino(u64 ino64)
> return ino;
> }
>
> -void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
> - u64 attr_valid)
> +int fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
> + u64 attr_valid)
> {
> struct fuse_conn *fc = get_fuse_conn(inode);
> struct fuse_inode *fi = get_fuse_inode(inode);
> + kuid_t uid;
> + kgid_t gid;
> +
> + uid = make_kuid(fc->user_ns, attr->uid);
> + gid = make_kgid(fc->user_ns, attr->gid);
> + if (!uid_valid(uid) || !gid_valid(gid)) {
> + make_bad_inode(inode);
> + return -EIO;
> + }
> + inode->i_uid = uid;
> + inode->i_gid = gid;
>
> fi->attr_version = ++fc->attr_version;
> fi->i_time = attr_valid;
> @@ -167,8 +178,6 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
> inode->i_ino = fuse_squash_ino(attr->ino);
> inode->i_mode = (inode->i_mode & S_IFMT) | (attr->mode & 07777);
> set_nlink(inode, attr->nlink);
> - inode->i_uid = make_kuid(&init_user_ns, attr->uid);
> - inode->i_gid = make_kgid(&init_user_ns, attr->gid);
> inode->i_blocks = attr->blocks;
> inode->i_atime.tv_sec = attr->atime;
> inode->i_atime.tv_nsec = attr->atimensec;
> @@ -195,26 +204,32 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
> inode->i_mode &= ~S_ISVTX;
>
> fi->orig_ino = attr->ino;
> + return 0;
> }
>
> -void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> - u64 attr_valid, u64 attr_version)
> +int fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> + u64 attr_valid, u64 attr_version)
> {
> struct fuse_conn *fc = get_fuse_conn(inode);
> struct fuse_inode *fi = get_fuse_inode(inode);
> bool is_wb = fc->writeback_cache;
> loff_t oldsize;
> struct timespec old_mtime;
> + int err;
>
> spin_lock(&fc->lock);
> if ((attr_version != 0 && fi->attr_version > attr_version) ||
> test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) {
> spin_unlock(&fc->lock);
> - return;
> + return 0;
> }
>
> old_mtime = inode->i_mtime;
> - fuse_change_attributes_common(inode, attr, attr_valid);
> + err = fuse_change_attributes_common(inode, attr, attr_valid);
> + if (err) {
> + spin_unlock(&fc->lock);
> + return err;
> + }
>
> oldsize = inode->i_size;
> /*
> @@ -249,6 +264,8 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> if (inval)
> invalidate_inode_pages2(inode->i_mapping);
> }
> +
> + return 0;
> }
>
> static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
> @@ -302,7 +319,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
> retry:
> inode = iget5_locked(sb, nodeid, fuse_inode_eq, fuse_inode_set, &nodeid);
> if (!inode)
> - return NULL;
> + return ERR_PTR(-ENOMEM);
>
> if ((inode->i_state & I_NEW)) {
> inode->i_flags |= S_NOATIME;
> @@ -319,11 +336,23 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
> goto retry;
> }
>
> + /*
> + * Must do this before incrementing nlookup, as the caller will
> + * send a forget for the node if this function fails.
> + */
> + if (fuse_change_attributes(inode, attr, attr_valid, attr_version)) {
> + /*
> + * inode_make_bad() already called by
> + * fuse_change_attributes()
> + */
> + iput(inode);
> + return ERR_PTR(-EIO);
> + }
> +
> fi = get_fuse_inode(inode);
> spin_lock(&fc->lock);
> fi->nlookup++;
> spin_unlock(&fc->lock);
> - fuse_change_attributes(inode, attr, attr_valid, attr_version);
>
> return inode;
> }
> @@ -496,6 +525,8 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
> memset(d, 0, sizeof(struct fuse_mount_data));
> d->max_read = ~0;
> d->blksize = FUSE_DEFAULT_BLKSIZE;
> + d->user_id = make_kuid(current_user_ns(), 0);
> + d->group_id = make_kgid(current_user_ns(), 0);
>
> while ((p = strsep(&opt, ",")) != NULL) {
> int token;
> @@ -578,8 +609,10 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
> struct super_block *sb = root->d_sb;
> struct fuse_conn *fc = get_fuse_conn_super(sb);
>
> - seq_printf(m, ",user_id=%u", from_kuid_munged(&init_user_ns, fc->user_id));
> - seq_printf(m, ",group_id=%u", from_kgid_munged(&init_user_ns, fc->group_id));
> + seq_printf(m, ",user_id=%u",
> + from_kuid_munged(fc->user_ns, fc->user_id));
> + seq_printf(m, ",group_id=%u",
> + from_kgid_munged(fc->user_ns, fc->group_id));
> if (fc->flags & FUSE_DEFAULT_PERMISSIONS)
> seq_puts(m, ",default_permissions");
> if (fc->flags & FUSE_ALLOW_OTHER)
> @@ -618,6 +651,7 @@ void fuse_conn_init(struct fuse_conn *fc)
> fc->attr_version = 1;
> get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
> fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> + fc->user_ns = get_user_ns(current_user_ns());
> }
> EXPORT_SYMBOL_GPL(fuse_conn_init);
>
> @@ -626,6 +660,8 @@ void fuse_conn_put(struct fuse_conn *fc)
> if (atomic_dec_and_test(&fc->count)) {
> if (fc->destroy_req)
> fuse_request_free(fc->destroy_req);
> + put_user_ns(fc->user_ns);
> + fc->user_ns = NULL;
> put_pid_ns(fc->pid_ns);
> fc->pid_ns = NULL;
> fc->release(fc);
> @@ -643,12 +679,15 @@ EXPORT_SYMBOL_GPL(fuse_conn_get);
> static struct inode *fuse_get_root_inode(struct super_block *sb, unsigned mode)
> {
> struct fuse_attr attr;
> + struct inode *inode;
> +
> memset(&attr, 0, sizeof(attr));
>
> attr.mode = mode;
> attr.ino = FUSE_ROOT_ID;
> attr.nlink = 1;
> - return fuse_iget(sb, 1, 0, &attr, 0, 0);
> + inode = fuse_iget(sb, 1, 0, &attr, 0, 0);
> + return IS_ERR(inode) ? NULL : inode;
> }
>
> struct fuse_inode_handle {
> @@ -1043,8 +1082,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
> if (!file)
> goto err;
>
> - if ((file->f_op != &fuse_dev_operations) ||
> - (file->f_cred->user_ns != &init_user_ns))
> + /*
> + * Require mount to happen from the same user namespace which
> + * opened /dev/fuse to prevent potential attacks.
> + */
> + if (file->f_op != &fuse_dev_operations ||
> + file->f_cred->user_ns != current_user_ns())
> goto err_fput;
>
> fc = kmalloc(sizeof(*fc), GFP_KERNEL);
> --
> 1.9.1
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-10-22 21:49:13

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 3/4] fuse: Restrict allow_other to the superblock's namespace or a descendant

On Wed, Oct 22, 2014 at 2:24 PM, Seth Forshee
<[email protected]> wrote:
> Unprivileged users are normally restricted from mounting with the
> allow_other option by system policy, but this could be bypassed
> for a mount done with user namespace root permissions. In such
> cases allow_other should not allow users outside the userns
> to access the mount as doing so would give the unprivileged user
> the ability to manipulate processes it would otherwise be unable
> to manipulate. Therefore access with allow_other should be
> restricted to users in the userns as the superblock or a
> descendant of that namespace.

Looks good to me.

Reviewed-by: Andy Lutomirski <[email protected]>

>
> Cc: Eric W. Biederman <[email protected]>
> Cc: Serge H. Hallyn <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Signed-off-by: Seth Forshee <[email protected]>
> ---
> fs/fuse/dir.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 123db1e06c78..b23ec5c1ff18 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1091,8 +1091,14 @@ int fuse_allow_current_process(struct fuse_conn *fc)
> {
> const struct cred *cred;
>
> - if (fc->flags & FUSE_ALLOW_OTHER)
> - return 1;
> + if (fc->flags & FUSE_ALLOW_OTHER) {
> + struct user_namespace *ns;
> + for (ns = current_user_ns(); ns; ns = ns->parent) {
> + if (ns == fc->user_ns)
> + return 1;
> + }
> + return 0;
> + }
>
> cred = current_cred();
> if (uid_eq(cred->euid, fc->user_id) &&
> --
> 1.9.1
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-10-22 21:52:20

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 4/4] fuse: Allow user namespace mounts

On Wed, Oct 22, 2014 at 2:24 PM, Seth Forshee
<[email protected]> wrote:
> Cc: Eric W. Biederman <[email protected]>
> Cc: Serge H. Hallyn <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Signed-off-by: Seth Forshee <[email protected]>
> ---
> fs/fuse/inode.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index b88b5a780228..7d0e73e36e7b 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1201,7 +1201,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
> static struct file_system_type fuse_fs_type = {
> .owner = THIS_MODULE,
> .name = "fuse",
> - .fs_flags = FS_HAS_SUBTYPE,
> + .fs_flags = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
> .mount = fuse_mount,
> .kill_sb = fuse_kill_sb_anon,
> };
> @@ -1233,7 +1233,7 @@ static struct file_system_type fuseblk_fs_type = {
> .name = "fuseblk",
> .mount = fuse_mount_blk,
> .kill_sb = fuse_kill_sb_blk,
> - .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
> + .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
> };
> MODULE_ALIAS_FS("fuseblk");
>
> --
> 1.9.1
>

This is mostly a sign of my ignorance, but how does this actually end
up working? I assume that the mounter opens /dev/fuse and then passes
the fd to the mount call. Which userns is captured? The opener of
/dev/fuse or the mounter of the fs?

--Andy

2014-10-23 00:22:26

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH v5 4/4] fuse: Allow user namespace mounts

On Wed, Oct 22, 2014 at 02:51:56PM -0700, Andy Lutomirski wrote:
> On Wed, Oct 22, 2014 at 2:24 PM, Seth Forshee
> <[email protected]> wrote:
> > Cc: Eric W. Biederman <[email protected]>
> > Cc: Serge H. Hallyn <[email protected]>
> > Cc: Andy Lutomirski <[email protected]>
> > Signed-off-by: Seth Forshee <[email protected]>
> > ---
> > fs/fuse/inode.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index b88b5a780228..7d0e73e36e7b 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -1201,7 +1201,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
> > static struct file_system_type fuse_fs_type = {
> > .owner = THIS_MODULE,
> > .name = "fuse",
> > - .fs_flags = FS_HAS_SUBTYPE,
> > + .fs_flags = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
> > .mount = fuse_mount,
> > .kill_sb = fuse_kill_sb_anon,
> > };
> > @@ -1233,7 +1233,7 @@ static struct file_system_type fuseblk_fs_type = {
> > .name = "fuseblk",
> > .mount = fuse_mount_blk,
> > .kill_sb = fuse_kill_sb_blk,
> > - .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
> > + .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
> > };
> > MODULE_ALIAS_FS("fuseblk");
> >
> > --
> > 1.9.1
> >
>
> This is mostly a sign of my ignorance, but how does this actually end
> up working? I assume that the mounter opens /dev/fuse and then passes
> the fd to the mount call. Which userns is captured? The opener of
> /dev/fuse or the mounter of the fs?

You're correct that the mounter passes the fd to /dev/fuse to the mount
call. The namespace of the mounter is used, but there's also a check to
make sure that's the same as that of the opener of /dev/fuse, otherwise
the mount fails.

Seth

2014-10-23 02:20:13

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 4/4] fuse: Allow user namespace mounts

On Wed, Oct 22, 2014 at 5:22 PM, Seth Forshee
<[email protected]> wrote:
> On Wed, Oct 22, 2014 at 02:51:56PM -0700, Andy Lutomirski wrote:
>> On Wed, Oct 22, 2014 at 2:24 PM, Seth Forshee
>> <[email protected]> wrote:
>> > Cc: Eric W. Biederman <[email protected]>
>> > Cc: Serge H. Hallyn <[email protected]>
>> > Cc: Andy Lutomirski <[email protected]>
>> > Signed-off-by: Seth Forshee <[email protected]>
>> > ---
>> > fs/fuse/inode.c | 4 ++--
>> > 1 file changed, 2 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> > index b88b5a780228..7d0e73e36e7b 100644
>> > --- a/fs/fuse/inode.c
>> > +++ b/fs/fuse/inode.c
>> > @@ -1201,7 +1201,7 @@ static void fuse_kill_sb_anon(struct super_block *sb)
>> > static struct file_system_type fuse_fs_type = {
>> > .owner = THIS_MODULE,
>> > .name = "fuse",
>> > - .fs_flags = FS_HAS_SUBTYPE,
>> > + .fs_flags = FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>> > .mount = fuse_mount,
>> > .kill_sb = fuse_kill_sb_anon,
>> > };
>> > @@ -1233,7 +1233,7 @@ static struct file_system_type fuseblk_fs_type = {
>> > .name = "fuseblk",
>> > .mount = fuse_mount_blk,
>> > .kill_sb = fuse_kill_sb_blk,
>> > - .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE,
>> > + .fs_flags = FS_REQUIRES_DEV | FS_HAS_SUBTYPE | FS_USERNS_MOUNT,
>> > };
>> > MODULE_ALIAS_FS("fuseblk");
>> >
>> > --
>> > 1.9.1
>> >
>>
>> This is mostly a sign of my ignorance, but how does this actually end
>> up working? I assume that the mounter opens /dev/fuse and then passes
>> the fd to the mount call. Which userns is captured? The opener of
>> /dev/fuse or the mounter of the fs?
>
> You're correct that the mounter passes the fd to /dev/fuse to the mount
> call. The namespace of the mounter is used, but there's also a check to
> make sure that's the same as that of the opener of /dev/fuse, otherwise
> the mount fails.

As long as my nodev fix is either applied first or rejected,

Acked-by: Andy Lutomirski <[email protected]>

Although the LSM situation is still a mess. Serge, any thoughts?

--Andy

>
> Seth



--
Andy Lutomirski
AMA Capital Management, LLC

2014-11-03 17:15:25

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] fuse: Add support for mounts from pid/user namespaces

On Wed, Oct 22, 2014 at 04:24:16PM -0500, Seth Forshee wrote:
> Here's another update to the patches to enable userns mounts in fuse.
> The changes this time around center on xattrs and allow_other. I'm
> considering the following patch to be a prerequisite for this series:
>
> http://lkml.kernel.org/g/252a4d87d99fc2b5fe4411c838f65b312c4e13cd.1413330857.git.luto@amacapital.net
>
> With this patch I'm dropping the xattr restrictions from the v4 patches.
> This leaves the situation with xattrs unchanged in init_user_ns. A fuse
> userns mount could still supply xattrs which the user would normally be
> unable to set, but since the mount will be inaccessible outside of that
> userns or its descendants those xattrs will never be seen by the host.
>
> I'm also in agreement with Andy that some decision regarding this patch
> should be made before merging the fuse userns support as well:
>
> http://lkml.kernel.org/g/2686c32f00b14148379e8cfee9c028c794d4aa1a.1407974494.git.luto@amacapital.net
>
> Changes since v4:
> - Drop patch to restrict xattrs
> - Update allow_other to restrict based on userns of current rather than
> uid/gid

Miklos, Eric - Any feedback from either of you on these patches?

Thanks,
Seth

2014-11-03 17:18:24

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] fuse: Add support for mounts from pid/user namespaces

On Mon, Nov 3, 2014 at 9:15 AM, Seth Forshee <[email protected]> wrote:
> On Wed, Oct 22, 2014 at 04:24:16PM -0500, Seth Forshee wrote:
>> Here's another update to the patches to enable userns mounts in fuse.
>> The changes this time around center on xattrs and allow_other. I'm
>> considering the following patch to be a prerequisite for this series:
>>
>> http://lkml.kernel.org/g/252a4d87d99fc2b5fe4411c838f65b312c4e13cd.1413330857.git.luto@amacapital.net
>>
>> With this patch I'm dropping the xattr restrictions from the v4 patches.
>> This leaves the situation with xattrs unchanged in init_user_ns. A fuse
>> userns mount could still supply xattrs which the user would normally be
>> unable to set, but since the mount will be inaccessible outside of that
>> userns or its descendants those xattrs will never be seen by the host.
>>
>> I'm also in agreement with Andy that some decision regarding this patch
>> should be made before merging the fuse userns support as well:
>>
>> http://lkml.kernel.org/g/2686c32f00b14148379e8cfee9c028c794d4aa1a.1407974494.git.luto@amacapital.net
>>
>> Changes since v4:
>> - Drop patch to restrict xattrs
>> - Update allow_other to restrict based on userns of current rather than
>> uid/gid
>
> Miklos, Eric - Any feedback from either of you on these patches?
>

[off-list to avoid further clutter]

Eric's moving, I think, and will be mostly unavailable for a week or two.

--Andy

> Thanks,
> Seth



--
Andy Lutomirski
AMA Capital Management, LLC

2014-11-11 13:27:38

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] fuse: Add support for pid namespaces

On Wed, Oct 22, 2014 at 04:24:17PM -0500, Seth Forshee wrote:
> If the userspace process servicing fuse requests is running in
> a pid namespace then pids passed via the fuse fd need to be
> translated relative to that namespace. Capture the pid namespace
> in use when the filesystem is mounted and use this for pid
> translation.

But WHY?

I know it's been discussed, but the reasons should be spelled out here:

- capturing the namespace makes the code less complex (the strongest argument
that I've heard)

- there's a security concern with translating to current namespace (that I
still don't fully understand)

- changing the pid namespace after mounting is unlikely to be of any use

- if it turns out to be useful, we can still fix this later (can we?)

What happens when the server does indeed change pid namespace after mounting?
Will just receive bogus pid values? Shouldn't it receive an error instead?

More comments inline.

> File locking changes based on previous work done by Eric
> Biederman.
>
> Cc: Eric W. Biederman <[email protected]>
> Cc: Serge H. Hallyn <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Signed-off-by: Seth Forshee <[email protected]>
> ---
> fs/fuse/dev.c | 9 +++++----
> fs/fuse/file.c | 38 +++++++++++++++++++++++++++-----------
> fs/fuse/fuse_i.h | 4 ++++
> fs/fuse/inode.c | 4 ++++
> 4 files changed, 40 insertions(+), 15 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index ca887314aba9..839caebd34f1 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -20,6 +20,7 @@
> #include <linux/swap.h>
> #include <linux/splice.h>
> #include <linux/aio.h>
> +#include <linux/sched.h>
>
> MODULE_ALIAS_MISCDEV(FUSE_MINOR);
> MODULE_ALIAS("devname:fuse");
> @@ -124,11 +125,11 @@ static void __fuse_put_request(struct fuse_req *req)
> atomic_dec(&req->count);
> }
>
> -static void fuse_req_init_context(struct fuse_req *req)
> +static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
> {
> req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> - req->in.h.pid = current->pid;
> + req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
> }
>
> static bool fuse_block_alloc(struct fuse_conn *fc, bool for_background)
> @@ -168,7 +169,7 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
> goto out;
> }
>
> - fuse_req_init_context(req);
> + fuse_req_init_context(fc, req);
> req->waiting = 1;
> req->background = for_background;
> return req;
> @@ -257,7 +258,7 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
> if (!req)
> req = get_reserved_req(fc, file);
>
> - fuse_req_init_context(req);
> + fuse_req_init_context(fc, req);
> req->waiting = 1;
> req->background = 0;
> return req;
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index caa8d95b24e8..cb0e40ecc362 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -2131,7 +2131,8 @@ static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma)
> return generic_file_mmap(file, vma);
> }
>
> -static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
> +static int convert_fuse_file_lock(struct fuse_conn *fc,
> + const struct fuse_file_lock *ffl,
> struct file_lock *fl)
> {
> switch (ffl->type) {
> @@ -2146,7 +2147,11 @@ static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
>
> fl->fl_start = ffl->start;
> fl->fl_end = ffl->end;
> - fl->fl_pid = ffl->pid;
> + rcu_read_lock();
> + fl->fl_pid = pid_vnr(find_pid_ns(ffl->pid, fc->pid_ns));
> + rcu_read_unlock();
> + if (ffl->pid != 0 && fl->fl_pid == 0)
> + return -EIO;

This needs some comments: is this trying to translate the pid backwards? Why is
it not checking the return of find_pid_ns() then? The man page documents l_pid
value of -1 but not of 0, so why are we checking for "ffl->pid != 0"? Or is the
man page incomplete and in practice we get l_pid == 0 values?


> break;
>
> default:
> @@ -2156,9 +2161,9 @@ static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
> return 0;
> }
>
> -static void fuse_lk_fill(struct fuse_req *req, struct file *file,
> - const struct file_lock *fl, int opcode, pid_t pid,
> - int flock)
> +static int fuse_lk_fill(struct fuse_req *req, struct file *file,
> + const struct file_lock *fl, int opcode,
> + struct pid *pid, int flock)
> {
> struct inode *inode = file_inode(file);
> struct fuse_conn *fc = get_fuse_conn(inode);
> @@ -2170,7 +2175,9 @@ static void fuse_lk_fill(struct fuse_req *req, struct file *file,
> arg->lk.start = fl->fl_start;
> arg->lk.end = fl->fl_end;
> arg->lk.type = fl->fl_type;
> - arg->lk.pid = pid;
> + arg->lk.pid = pid_nr_ns(pid, fc->pid_ns);
> + if (pid && arg->lk.pid == 0)
> + return -EOVERFLOW;

Could have done the conversion immediately after getting 'pid' with task_tgid(),
then the changes would have been smaller and more localized.


> if (flock)
> arg->lk_flags |= FUSE_LK_FLOCK;
> req->in.h.opcode = opcode;
> @@ -2178,6 +2185,8 @@ static void fuse_lk_fill(struct fuse_req *req, struct file *file,
> req->in.numargs = 1;
> req->in.args[0].size = sizeof(*arg);
> req->in.args[0].value = arg;
> +
> + return 0;
> }
>
> static int fuse_getlk(struct file *file, struct file_lock *fl)
> @@ -2192,16 +2201,19 @@ static int fuse_getlk(struct file *file, struct file_lock *fl)
> if (IS_ERR(req))
> return PTR_ERR(req);
>
> - fuse_lk_fill(req, file, fl, FUSE_GETLK, 0, 0);
> + err = fuse_lk_fill(req, file, fl, FUSE_GETLK, NULL, 0);
> + if (err)
> + goto out;
> req->out.numargs = 1;
> req->out.args[0].size = sizeof(outarg);
> req->out.args[0].value = &outarg;
> fuse_request_send(fc, req);
> err = req->out.h.error;
> - fuse_put_request(fc, req);
> if (!err)
> - err = convert_fuse_file_lock(&outarg.lk, fl);
> + err = convert_fuse_file_lock(fc, &outarg.lk, fl);
>
> +out:
> + fuse_put_request(fc, req);
> return err;
> }
>
> @@ -2211,7 +2223,7 @@ static int fuse_setlk(struct file *file, struct file_lock *fl, int flock)
> struct fuse_conn *fc = get_fuse_conn(inode);
> struct fuse_req *req;
> int opcode = (fl->fl_flags & FL_SLEEP) ? FUSE_SETLKW : FUSE_SETLK;
> - pid_t pid = fl->fl_type != F_UNLCK ? current->tgid : 0;
> + struct pid *pid = fl->fl_type != F_UNLCK ? task_tgid(current) : NULL;
> int err;
>
> if (fl->fl_lmops && fl->fl_lmops->lm_grant) {
> @@ -2227,12 +2239,16 @@ static int fuse_setlk(struct file *file, struct file_lock *fl, int flock)
> if (IS_ERR(req))
> return PTR_ERR(req);
>
> - fuse_lk_fill(req, file, fl, opcode, pid, flock);
> + err = fuse_lk_fill(req, file, fl, opcode, pid, flock);
> + if (err)
> + goto out;
> fuse_request_send(fc, req);
> err = req->out.h.error;
> /* locking is restartable */
> if (err == -EINTR)
> err = -ERESTARTSYS;
> +
> +out:
> fuse_put_request(fc, req);
> return err;
> }
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index e8e47a6ab518..a3ded071e2c6 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -22,6 +22,7 @@
> #include <linux/rbtree.h>
> #include <linux/poll.h>
> #include <linux/workqueue.h>
> +#include <linux/pid_namespace.h>
>
> /** Max number of pages that can be used in a single read request */
> #define FUSE_MAX_PAGES_PER_REQ 32
> @@ -386,6 +387,9 @@ struct fuse_conn {
> /** The group id for this mount */
> kgid_t group_id;
>
> + /** The pid namespace for this mount */
> + struct pid_namespace *pid_ns;
> +
> /** The fuse mount flags for this mount */
> unsigned flags;
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 03246cd9d47a..e137969815a3 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -20,6 +20,7 @@
> #include <linux/random.h>
> #include <linux/sched.h>
> #include <linux/exportfs.h>
> +#include <linux/pid_namespace.h>
>
> MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
> MODULE_DESCRIPTION("Filesystem in Userspace");
> @@ -616,6 +617,7 @@ void fuse_conn_init(struct fuse_conn *fc)
> fc->initialized = 0;
> fc->attr_version = 1;
> get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
> + fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> }
> EXPORT_SYMBOL_GPL(fuse_conn_init);
>
> @@ -624,6 +626,8 @@ void fuse_conn_put(struct fuse_conn *fc)
> if (atomic_dec_and_test(&fc->count)) {
> if (fc->destroy_req)
> fuse_request_free(fc->destroy_req);
> + put_pid_ns(fc->pid_ns);
> + fc->pid_ns = NULL;

We don't usually zero out fields before freeing. There are debugging config
options to do that for us.

> fc->release(fc);
> }
> }
> --
> 1.9.1
>

2014-11-11 14:04:58

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

On Wed, Oct 22, 2014 at 04:24:18PM -0500, Seth Forshee wrote:
> Update fuse to translate uids and gids to/from the user namspace
> of the process servicing requests on /dev/fuse. Any ids which do
> not map into the namespace will result in errors. inodes will
> also be marked bad when unmappable ids are received from
> userspace.

Okay.

> Due to security concerns the namespace used should be fixed,
> otherwise a user might be able to gain elevated privileges or
> influence processes that the user would otherwise be unable to
> manipulate. Thus the namespace of the mounting process is used
> for all translations, and this namespace is required to be the
> same as the one in use when /dev/fuse was opened.

Maybe I'm being dense, but can someone give a concrete example of such an
attack?

That might also help me understand how exactly user/pid namespaces work...

Patch otherwise looks okay.

Thanks,
Miklos

2014-11-11 15:24:35

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] fuse: Add support for pid namespaces

On Tue, Nov 11, 2014 at 02:27:34PM +0100, Miklos Szeredi wrote:
> On Wed, Oct 22, 2014 at 04:24:17PM -0500, Seth Forshee wrote:
> > If the userspace process servicing fuse requests is running in
> > a pid namespace then pids passed via the fuse fd need to be
> > translated relative to that namespace. Capture the pid namespace
> > in use when the filesystem is mounted and use this for pid
> > translation.
>
> But WHY?
>
> I know it's been discussed, but the reasons should be spelled out here:

The reasons relate much more to user namespaces rather than pid
namespaces, but in my opinion it doesn't make much sense to handle them
differently. It might make more sense to reorder the patches to add user
namespace support first, give the justification there, and then for this
one just say "to be consistent with userns support."

> - capturing the namespace makes the code less complex (the strongest argument
> that I've heard)

For pid namespaces it's not so bad, except maybe for file locking (I
can't remember for sure about that). For user namespaces though it does
get much more complex to use the namespace from the /dev/fuse IO path,
because frequently the raw data from userspace is being handed off to
another thread before it is interpreted. That means we either have to
capture the userns and pass it along with the data or restructure the
code to do all of this in the IO paths. Either way having a static
namespace in fuse_conn works out to be much simpler.

> - there's a security concern with translating to current namespace (that I
> still don't fully understand)

There's the one I explained with suid, which in reality should be
mitigated by some of the other protections in fuse. Other than that I
don't know of anything specific Eric had in mind other than the fact
that other sources of userns security bugs have been "time of open
versus time of use" sorts of problems, and he believes that capturing
the userns at IO time instead of mount time is likely to be similarly
vulnerable.

Otherwise I can't add any specifics beyond what Eric has already said in
http://lkml.kernel.org/g/[email protected]. Maybe
Eric or Andy could elaborate further. I'll also look at some of the past
vulnerabilities and see if I get any ideas for specific attacks.

> - changing the pid namespace after mounting is unlikely to be of any use

This sounds like trying to prove a negative. All I can say is that no
one has stepped forward to voice any specific use cases which would
require it.

> - if it turns out to be useful, we can still fix this later (can we?)

I don't see why we couldn't as long as nothing in userspace came to rely
upon the behavior.

> What happens when the server does indeed change pid namespace after mounting?
> Will just receive bogus pid values? Shouldn't it receive an error instead?

Yeah, I suppose it does receive bogus pids and userid values. About all
we could do to prevent this is make the /dev/fuse read/write paths
return an error if the current namespace isn't the same as the one at
mount time. But then requests could get stuck in the queue forever, so
maybe we should also fail all requests in the queue when this happens.
Unless you have a better idea?

>
> More comments inline.
>
> > File locking changes based on previous work done by Eric
> > Biederman.
> >
> > Cc: Eric W. Biederman <[email protected]>
> > Cc: Serge H. Hallyn <[email protected]>
> > Cc: Andy Lutomirski <[email protected]>
> > Signed-off-by: Seth Forshee <[email protected]>
> > ---
> > fs/fuse/dev.c | 9 +++++----
> > fs/fuse/file.c | 38 +++++++++++++++++++++++++++-----------
> > fs/fuse/fuse_i.h | 4 ++++
> > fs/fuse/inode.c | 4 ++++
> > 4 files changed, 40 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > index ca887314aba9..839caebd34f1 100644
> > --- a/fs/fuse/dev.c
> > +++ b/fs/fuse/dev.c
> > @@ -20,6 +20,7 @@
> > #include <linux/swap.h>
> > #include <linux/splice.h>
> > #include <linux/aio.h>
> > +#include <linux/sched.h>
> >
> > MODULE_ALIAS_MISCDEV(FUSE_MINOR);
> > MODULE_ALIAS("devname:fuse");
> > @@ -124,11 +125,11 @@ static void __fuse_put_request(struct fuse_req *req)
> > atomic_dec(&req->count);
> > }
> >
> > -static void fuse_req_init_context(struct fuse_req *req)
> > +static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
> > {
> > req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> > req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> > - req->in.h.pid = current->pid;
> > + req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
> > }
> >
> > static bool fuse_block_alloc(struct fuse_conn *fc, bool for_background)
> > @@ -168,7 +169,7 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
> > goto out;
> > }
> >
> > - fuse_req_init_context(req);
> > + fuse_req_init_context(fc, req);
> > req->waiting = 1;
> > req->background = for_background;
> > return req;
> > @@ -257,7 +258,7 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
> > if (!req)
> > req = get_reserved_req(fc, file);
> >
> > - fuse_req_init_context(req);
> > + fuse_req_init_context(fc, req);
> > req->waiting = 1;
> > req->background = 0;
> > return req;
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index caa8d95b24e8..cb0e40ecc362 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -2131,7 +2131,8 @@ static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma)
> > return generic_file_mmap(file, vma);
> > }
> >
> > -static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
> > +static int convert_fuse_file_lock(struct fuse_conn *fc,
> > + const struct fuse_file_lock *ffl,
> > struct file_lock *fl)
> > {
> > switch (ffl->type) {
> > @@ -2146,7 +2147,11 @@ static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
> >
> > fl->fl_start = ffl->start;
> > fl->fl_end = ffl->end;
> > - fl->fl_pid = ffl->pid;
> > + rcu_read_lock();
> > + fl->fl_pid = pid_vnr(find_pid_ns(ffl->pid, fc->pid_ns));
> > + rcu_read_unlock();
> > + if (ffl->pid != 0 && fl->fl_pid == 0)
> > + return -EIO;
>
> This needs some comments: is this trying to translate the pid backwards? Why is
> it not checking the return of find_pid_ns() then? The man page documents l_pid
> value of -1 but not of 0, so why are we checking for "ffl->pid != 0"? Or is the
> man page incomplete and in practice we get l_pid == 0 values?

I'll add comments. It's converting the pid in the fuse_file_lock into
the current pid namespace. pid_vnr calls pid_nr_ns, which returns 0 if
the pid can't be translated into the namespace, thus we return the
error.

pid_vnr's return values don't necessarily conform to the expectations of
the fcntl syscall in all cases, and as far as I can tell it should never
return <0. But if fcntl doesn't expect l_pid == 0 and pid_vnr could
return that value then doesn't it makes sense for fuse to return an
error in cases where this would happen?

Fwiw, this is also an example of a case where it's simpler to have a
static namespace. If we were to use the current ns from the /dev/fuse IO
path then we either have to process the request there or grab a
reference to the ns there and pass it alongside the request (and we
have to do it for all requests unless the IO path is going to look at
the request type and decide whether or not it requires a reference to
the namespace).

>
>
> > break;
> >
> > default:
> > @@ -2156,9 +2161,9 @@ static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
> > return 0;
> > }
> >
> > -static void fuse_lk_fill(struct fuse_req *req, struct file *file,
> > - const struct file_lock *fl, int opcode, pid_t pid,
> > - int flock)
> > +static int fuse_lk_fill(struct fuse_req *req, struct file *file,
> > + const struct file_lock *fl, int opcode,
> > + struct pid *pid, int flock)
> > {
> > struct inode *inode = file_inode(file);
> > struct fuse_conn *fc = get_fuse_conn(inode);
> > @@ -2170,7 +2175,9 @@ static void fuse_lk_fill(struct fuse_req *req, struct file *file,
> > arg->lk.start = fl->fl_start;
> > arg->lk.end = fl->fl_end;
> > arg->lk.type = fl->fl_type;
> > - arg->lk.pid = pid;
> > + arg->lk.pid = pid_nr_ns(pid, fc->pid_ns);
> > + if (pid && arg->lk.pid == 0)
> > + return -EOVERFLOW;
>
> Could have done the conversion immediately after getting 'pid' with task_tgid(),
> then the changes would have been smaller and more localized.

The changes would be very marginally smaller since currently only one
caller of fuse_lk_fill which passes a non-zero pid. If additional
callers were ever added with non-zero pids then there would be
duplication of this code. But I'll do it whichever way you prefer, just
let me know.

> > if (flock)
> > arg->lk_flags |= FUSE_LK_FLOCK;
> > req->in.h.opcode = opcode;
> > @@ -2178,6 +2185,8 @@ static void fuse_lk_fill(struct fuse_req *req, struct file *file,
> > req->in.numargs = 1;
> > req->in.args[0].size = sizeof(*arg);
> > req->in.args[0].value = arg;
> > +
> > + return 0;
> > }
> >
> > static int fuse_getlk(struct file *file, struct file_lock *fl)
> > @@ -2192,16 +2201,19 @@ static int fuse_getlk(struct file *file, struct file_lock *fl)
> > if (IS_ERR(req))
> > return PTR_ERR(req);
> >
> > - fuse_lk_fill(req, file, fl, FUSE_GETLK, 0, 0);
> > + err = fuse_lk_fill(req, file, fl, FUSE_GETLK, NULL, 0);
> > + if (err)
> > + goto out;
> > req->out.numargs = 1;
> > req->out.args[0].size = sizeof(outarg);
> > req->out.args[0].value = &outarg;
> > fuse_request_send(fc, req);
> > err = req->out.h.error;
> > - fuse_put_request(fc, req);
> > if (!err)
> > - err = convert_fuse_file_lock(&outarg.lk, fl);
> > + err = convert_fuse_file_lock(fc, &outarg.lk, fl);
> >
> > +out:
> > + fuse_put_request(fc, req);
> > return err;
> > }
> >
> > @@ -2211,7 +2223,7 @@ static int fuse_setlk(struct file *file, struct file_lock *fl, int flock)
> > struct fuse_conn *fc = get_fuse_conn(inode);
> > struct fuse_req *req;
> > int opcode = (fl->fl_flags & FL_SLEEP) ? FUSE_SETLKW : FUSE_SETLK;
> > - pid_t pid = fl->fl_type != F_UNLCK ? current->tgid : 0;
> > + struct pid *pid = fl->fl_type != F_UNLCK ? task_tgid(current) : NULL;
> > int err;
> >
> > if (fl->fl_lmops && fl->fl_lmops->lm_grant) {
> > @@ -2227,12 +2239,16 @@ static int fuse_setlk(struct file *file, struct file_lock *fl, int flock)
> > if (IS_ERR(req))
> > return PTR_ERR(req);
> >
> > - fuse_lk_fill(req, file, fl, opcode, pid, flock);
> > + err = fuse_lk_fill(req, file, fl, opcode, pid, flock);
> > + if (err)
> > + goto out;
> > fuse_request_send(fc, req);
> > err = req->out.h.error;
> > /* locking is restartable */
> > if (err == -EINTR)
> > err = -ERESTARTSYS;
> > +
> > +out:
> > fuse_put_request(fc, req);
> > return err;
> > }
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index e8e47a6ab518..a3ded071e2c6 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -22,6 +22,7 @@
> > #include <linux/rbtree.h>
> > #include <linux/poll.h>
> > #include <linux/workqueue.h>
> > +#include <linux/pid_namespace.h>
> >
> > /** Max number of pages that can be used in a single read request */
> > #define FUSE_MAX_PAGES_PER_REQ 32
> > @@ -386,6 +387,9 @@ struct fuse_conn {
> > /** The group id for this mount */
> > kgid_t group_id;
> >
> > + /** The pid namespace for this mount */
> > + struct pid_namespace *pid_ns;
> > +
> > /** The fuse mount flags for this mount */
> > unsigned flags;
> >
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 03246cd9d47a..e137969815a3 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -20,6 +20,7 @@
> > #include <linux/random.h>
> > #include <linux/sched.h>
> > #include <linux/exportfs.h>
> > +#include <linux/pid_namespace.h>
> >
> > MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
> > MODULE_DESCRIPTION("Filesystem in Userspace");
> > @@ -616,6 +617,7 @@ void fuse_conn_init(struct fuse_conn *fc)
> > fc->initialized = 0;
> > fc->attr_version = 1;
> > get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
> > + fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> > }
> > EXPORT_SYMBOL_GPL(fuse_conn_init);
> >
> > @@ -624,6 +626,8 @@ void fuse_conn_put(struct fuse_conn *fc)
> > if (atomic_dec_and_test(&fc->count)) {
> > if (fc->destroy_req)
> > fuse_request_free(fc->destroy_req);
> > + put_pid_ns(fc->pid_ns);
> > + fc->pid_ns = NULL;
>
> We don't usually zero out fields before freeing. There are debugging config
> options to do that for us.

Okay, I'll remove that line.

Thanks,
Seth

>
> > fc->release(fc);
> > }
> > }
> > --
> > 1.9.1
> >

2014-11-11 15:27:34

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

On Tue, Nov 11, 2014 at 03:04:54PM +0100, Miklos Szeredi wrote:
> On Wed, Oct 22, 2014 at 04:24:18PM -0500, Seth Forshee wrote:
> > Update fuse to translate uids and gids to/from the user namspace
> > of the process servicing requests on /dev/fuse. Any ids which do
> > not map into the namespace will result in errors. inodes will
> > also be marked bad when unmappable ids are received from
> > userspace.
>
> Okay.
>
> > Due to security concerns the namespace used should be fixed,
> > otherwise a user might be able to gain elevated privileges or
> > influence processes that the user would otherwise be unable to
> > manipulate. Thus the namespace of the mounting process is used
> > for all translations, and this namespace is required to be the
> > same as the one in use when /dev/fuse was opened.
>
> Maybe I'm being dense, but can someone give a concrete example of such an
> attack?

I'm repeating myself, but the only specific example I'm aware of is the
suid example with is prevented by other mechanisms (both in fuse and in
Andy's proposed patch). Perhaps Eric or Andy could give examples of
other potential problems, and I'll also see what I can come up with.

> That might also help me understand how exactly user/pid namespaces work...
>
> Patch otherwise looks okay.

Great, thanks for the review.

Seth

>
> Thanks,
> Miklos

2014-11-11 15:27:41

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v5 3/4] fuse: Restrict allow_other to the superblock's namespace or a descendant

On Wed, Oct 22, 2014 at 04:24:19PM -0500, Seth Forshee wrote:
> Unprivileged users are normally restricted from mounting with the
> allow_other option by system policy, but this could be bypassed
> for a mount done with user namespace root permissions. In such
> cases allow_oth er should not allow users outside the userns
> to access the mount as doing so would give the unprivileged user
> the ability to manipulate processes it would otherwise be unable
> to manipulate. Therefore access with allow_other should be
> restricted to users in the userns as the superblock or a
> descendant of that namespace.

Fine.

But aren't this kind of thing supposed to be prevented anyway by having private
mount namespace coupled with the pid-user-whatever namespace?

It seems like being a bit too careful (not to say that that's a bad thing).

Thanks,
Miklos

2014-11-11 15:37:38

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH v5 3/4] fuse: Restrict allow_other to the superblock's namespace or a descendant

On Tue, Nov 11, 2014 at 04:27:37PM +0100, Miklos Szeredi wrote:
> On Wed, Oct 22, 2014 at 04:24:19PM -0500, Seth Forshee wrote:
> > Unprivileged users are normally restricted from mounting with the
> > allow_other option by system policy, but this could be bypassed
> > for a mount done with user namespace root permissions. In such
> > cases allow_oth er should not allow users outside the userns
> > to access the mount as doing so would give the unprivileged user
> > the ability to manipulate processes it would otherwise be unable
> > to manipulate. Therefore access with allow_other should be
> > restricted to users in the userns as the superblock or a
> > descendant of that namespace.
>
> Fine.
>
> But aren't this kind of thing supposed to be prevented anyway by having private
> mount namespace coupled with the pid-user-whatever namespace?
>
> It seems like being a bit too careful (not to say that that's a bad thing).

A userns mount should be in a "private" mount namespace; specifically
the user performing the mount must have CAP_SYS_ADMIN in
mnt_ns->user_ns. The mount may still be accessible via /proc/pid/root
though, and doing this ensures that in any case the user can never use
the mount to manipulate processes that it can't already manipulate.

Thanks,
Seth

2014-11-11 15:38:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

Miklos Szeredi <[email protected]> writes:

> On Wed, Oct 22, 2014 at 04:24:18PM -0500, Seth Forshee wrote:
>> Update fuse to translate uids and gids to/from the user namspace
>> of the process servicing requests on /dev/fuse. Any ids which do
>> not map into the namespace will result in errors. inodes will
>> also be marked bad when unmappable ids are received from
>> userspace.
>
> Okay.
>
>> Due to security concerns the namespace used should be fixed,
>> otherwise a user might be able to gain elevated privileges or
>> influence processes that the user would otherwise be unable to
>> manipulate. Thus the namespace of the mounting process is used
>> for all translations, and this namespace is required to be the
>> same as the one in use when /dev/fuse was opened.
>
> Maybe I'm being dense, but can someone give a concrete example of such an
> attack?

There are two variants of things at play here.

There is the classic if you don't freeze your context at open time when
you pass that file descriptor to another process unexpected things can
happen.

An essentially harmless but extremely confusing example is what happens
to a partial read when it stops halfway through a uid value and the next
read on the same file descriptor is from a process in a different user
namespace. Which uid value should be returned to userspace.

If the kernel is just reporting values all that happens is confusion
with a correct implementation. Confusion however is a breeding ground
for little unexpected behaviors that trigger bugs in the code.

If the kernel is reading values things can get ugly. As root in a user
a namespace I am allowed to call setuid and setgid to any uid or gid
that is mapped into my user namespace but no others.

Now if I am in a nefarious mood I can create a unprivileged user
namespace, open /dev/fuse and mount a fuse filesystem. Pass the file
descriptor to /dev/fuse to a processes that is in the default user
namespace (and thus can use any uid/gid). With that file desctipor
report that there is a setuid 0 exectuable on that file system.

The version of fuse in these patches has two different defenses against
that kind of nefariousness. The first defense restrictions /dev/fuse
to speak the uids and gids that the opener of /dev/fuse has in their
user namespace. The second defense restricts who can access a fuse file
system by requiring them to be in the user namespace of the opener of
/dev/fuse and the mounter of the fuse filesystem.

I find the first defense stronger and easier to analyze. If non-sense
uids can't get past fuse into the vfs we simply don't have to worry
about them.

I hope that helps.

> That might also help me understand how exactly user/pid namespaces work...

The idea of user/pid namespaces is to translate uid, gids and pids at
the edge of userspace into a kernel internal form that can be use
everywhere. In this case we get into the subtlties of which
translations make sense.

> Patch otherwise looks okay.

Eric

2014-11-11 15:39:33

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] fuse: Add support for pid namespaces

On Tue, Nov 11, 2014 at 7:24 AM, Seth Forshee
<[email protected]> wrote:
> On Tue, Nov 11, 2014 at 02:27:34PM +0100, Miklos Szeredi wrote:
>> On Wed, Oct 22, 2014 at 04:24:17PM -0500, Seth Forshee wrote:
>> > If the userspace process servicing fuse requests is running in
>> > a pid namespace then pids passed via the fuse fd need to be
>> > translated relative to that namespace. Capture the pid namespace
>> > in use when the filesystem is mounted and use this for pid
>> > translation.
>>
>> But WHY?
>>
>> I know it's been discussed, but the reasons should be spelled out here:
>
> The reasons relate much more to user namespaces rather than pid
> namespaces, but in my opinion it doesn't make much sense to handle them
> differently. It might make more sense to reorder the patches to add user
> namespace support first, give the justification there, and then for this
> one just say "to be consistent with userns support."
>
>> - capturing the namespace makes the code less complex (the strongest argument
>> that I've heard)
>
> For pid namespaces it's not so bad, except maybe for file locking (I
> can't remember for sure about that). For user namespaces though it does
> get much more complex to use the namespace from the /dev/fuse IO path,
> because frequently the raw data from userspace is being handed off to
> another thread before it is interpreted. That means we either have to
> capture the userns and pass it along with the data or restructure the
> code to do all of this in the IO paths. Either way having a static
> namespace in fuse_conn works out to be much simpler.
>
>> - there's a security concern with translating to current namespace (that I
>> still don't fully understand)
>
> There's the one I explained with suid, which in reality should be
> mitigated by some of the other protections in fuse. Other than that I
> don't know of anything specific Eric had in mind other than the fact
> that other sources of userns security bugs have been "time of open
> versus time of use" sorts of problems, and he believes that capturing
> the userns at IO time instead of mount time is likely to be similarly
> vulnerable.
>

The general concern is that read(2) and write(2) really have no
business looking at creds. ioctl is considerably less dangerous. The
risk is that you get an fd and use at as stderr for a setuid process
or something along those lines.

> Otherwise I can't add any specifics beyond what Eric has already said in
> http://lkml.kernel.org/g/[email protected]. Maybe
> Eric or Andy could elaborate further. I'll also look at some of the past
> vulnerabilities and see if I get any ideas for specific attacks.
>
>> - changing the pid namespace after mounting is unlikely to be of any use
>
> This sounds like trying to prove a negative. All I can say is that no
> one has stepped forward to voice any specific use cases which would
> require it.
>
>> - if it turns out to be useful, we can still fix this later (can we?)
>
> I don't see why we couldn't as long as nothing in userspace came to rely
> upon the behavior.

Just to be clear, the current behavior is to reject /dev/fuse io
attempts with namespaces that don't match the opener, right? If so,
that seems like the conservative option, and changing it won't break
things.

>
>> What happens when the server does indeed change pid namespace after mounting?
>> Will just receive bogus pid values? Shouldn't it receive an error instead?
>
> Yeah, I suppose it does receive bogus pids and userid values. About all
> we could do to prevent this is make the /dev/fuse read/write paths
> return an error if the current namespace isn't the same as the one at
> mount time. But then requests could get stuck in the queue forever, so
> maybe we should also fail all requests in the queue when this happens.
> Unless you have a better idea?
>

If this is actually read(2)/write(2), then I think you need to either
error out or use the namespaces at the time of open. Anything else is
asking for trouble when the caller of read(2) or write(2) doesn't
realize that the fd is a /dev/fuse fd.

--Andy

>>
>> More comments inline.
>>
>> > File locking changes based on previous work done by Eric
>> > Biederman.
>> >
>> > Cc: Eric W. Biederman <[email protected]>
>> > Cc: Serge H. Hallyn <[email protected]>
>> > Cc: Andy Lutomirski <[email protected]>
>> > Signed-off-by: Seth Forshee <[email protected]>
>> > ---
>> > fs/fuse/dev.c | 9 +++++----
>> > fs/fuse/file.c | 38 +++++++++++++++++++++++++++-----------
>> > fs/fuse/fuse_i.h | 4 ++++
>> > fs/fuse/inode.c | 4 ++++
>> > 4 files changed, 40 insertions(+), 15 deletions(-)
>> >
>> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> > index ca887314aba9..839caebd34f1 100644
>> > --- a/fs/fuse/dev.c
>> > +++ b/fs/fuse/dev.c
>> > @@ -20,6 +20,7 @@
>> > #include <linux/swap.h>
>> > #include <linux/splice.h>
>> > #include <linux/aio.h>
>> > +#include <linux/sched.h>
>> >
>> > MODULE_ALIAS_MISCDEV(FUSE_MINOR);
>> > MODULE_ALIAS("devname:fuse");
>> > @@ -124,11 +125,11 @@ static void __fuse_put_request(struct fuse_req *req)
>> > atomic_dec(&req->count);
>> > }
>> >
>> > -static void fuse_req_init_context(struct fuse_req *req)
>> > +static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
>> > {
>> > req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
>> > req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
>> > - req->in.h.pid = current->pid;
>> > + req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
>> > }
>> >
>> > static bool fuse_block_alloc(struct fuse_conn *fc, bool for_background)
>> > @@ -168,7 +169,7 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
>> > goto out;
>> > }
>> >
>> > - fuse_req_init_context(req);
>> > + fuse_req_init_context(fc, req);
>> > req->waiting = 1;
>> > req->background = for_background;
>> > return req;
>> > @@ -257,7 +258,7 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
>> > if (!req)
>> > req = get_reserved_req(fc, file);
>> >
>> > - fuse_req_init_context(req);
>> > + fuse_req_init_context(fc, req);
>> > req->waiting = 1;
>> > req->background = 0;
>> > return req;
>> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> > index caa8d95b24e8..cb0e40ecc362 100644
>> > --- a/fs/fuse/file.c
>> > +++ b/fs/fuse/file.c
>> > @@ -2131,7 +2131,8 @@ static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma)
>> > return generic_file_mmap(file, vma);
>> > }
>> >
>> > -static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
>> > +static int convert_fuse_file_lock(struct fuse_conn *fc,
>> > + const struct fuse_file_lock *ffl,
>> > struct file_lock *fl)
>> > {
>> > switch (ffl->type) {
>> > @@ -2146,7 +2147,11 @@ static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
>> >
>> > fl->fl_start = ffl->start;
>> > fl->fl_end = ffl->end;
>> > - fl->fl_pid = ffl->pid;
>> > + rcu_read_lock();
>> > + fl->fl_pid = pid_vnr(find_pid_ns(ffl->pid, fc->pid_ns));
>> > + rcu_read_unlock();
>> > + if (ffl->pid != 0 && fl->fl_pid == 0)
>> > + return -EIO;
>>
>> This needs some comments: is this trying to translate the pid backwards? Why is
>> it not checking the return of find_pid_ns() then? The man page documents l_pid
>> value of -1 but not of 0, so why are we checking for "ffl->pid != 0"? Or is the
>> man page incomplete and in practice we get l_pid == 0 values?
>
> I'll add comments. It's converting the pid in the fuse_file_lock into
> the current pid namespace. pid_vnr calls pid_nr_ns, which returns 0 if
> the pid can't be translated into the namespace, thus we return the
> error.
>
> pid_vnr's return values don't necessarily conform to the expectations of
> the fcntl syscall in all cases, and as far as I can tell it should never
> return <0. But if fcntl doesn't expect l_pid == 0 and pid_vnr could
> return that value then doesn't it makes sense for fuse to return an
> error in cases where this would happen?
>
> Fwiw, this is also an example of a case where it's simpler to have a
> static namespace. If we were to use the current ns from the /dev/fuse IO
> path then we either have to process the request there or grab a
> reference to the ns there and pass it alongside the request (and we
> have to do it for all requests unless the IO path is going to look at
> the request type and decide whether or not it requires a reference to
> the namespace).
>
>>
>>
>> > break;
>> >
>> > default:
>> > @@ -2156,9 +2161,9 @@ static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
>> > return 0;
>> > }
>> >
>> > -static void fuse_lk_fill(struct fuse_req *req, struct file *file,
>> > - const struct file_lock *fl, int opcode, pid_t pid,
>> > - int flock)
>> > +static int fuse_lk_fill(struct fuse_req *req, struct file *file,
>> > + const struct file_lock *fl, int opcode,
>> > + struct pid *pid, int flock)
>> > {
>> > struct inode *inode = file_inode(file);
>> > struct fuse_conn *fc = get_fuse_conn(inode);
>> > @@ -2170,7 +2175,9 @@ static void fuse_lk_fill(struct fuse_req *req, struct file *file,
>> > arg->lk.start = fl->fl_start;
>> > arg->lk.end = fl->fl_end;
>> > arg->lk.type = fl->fl_type;
>> > - arg->lk.pid = pid;
>> > + arg->lk.pid = pid_nr_ns(pid, fc->pid_ns);
>> > + if (pid && arg->lk.pid == 0)
>> > + return -EOVERFLOW;
>>
>> Could have done the conversion immediately after getting 'pid' with task_tgid(),
>> then the changes would have been smaller and more localized.
>
> The changes would be very marginally smaller since currently only one
> caller of fuse_lk_fill which passes a non-zero pid. If additional
> callers were ever added with non-zero pids then there would be
> duplication of this code. But I'll do it whichever way you prefer, just
> let me know.
>
>> > if (flock)
>> > arg->lk_flags |= FUSE_LK_FLOCK;
>> > req->in.h.opcode = opcode;
>> > @@ -2178,6 +2185,8 @@ static void fuse_lk_fill(struct fuse_req *req, struct file *file,
>> > req->in.numargs = 1;
>> > req->in.args[0].size = sizeof(*arg);
>> > req->in.args[0].value = arg;
>> > +
>> > + return 0;
>> > }
>> >
>> > static int fuse_getlk(struct file *file, struct file_lock *fl)
>> > @@ -2192,16 +2201,19 @@ static int fuse_getlk(struct file *file, struct file_lock *fl)
>> > if (IS_ERR(req))
>> > return PTR_ERR(req);
>> >
>> > - fuse_lk_fill(req, file, fl, FUSE_GETLK, 0, 0);
>> > + err = fuse_lk_fill(req, file, fl, FUSE_GETLK, NULL, 0);
>> > + if (err)
>> > + goto out;
>> > req->out.numargs = 1;
>> > req->out.args[0].size = sizeof(outarg);
>> > req->out.args[0].value = &outarg;
>> > fuse_request_send(fc, req);
>> > err = req->out.h.error;
>> > - fuse_put_request(fc, req);
>> > if (!err)
>> > - err = convert_fuse_file_lock(&outarg.lk, fl);
>> > + err = convert_fuse_file_lock(fc, &outarg.lk, fl);
>> >
>> > +out:
>> > + fuse_put_request(fc, req);
>> > return err;
>> > }
>> >
>> > @@ -2211,7 +2223,7 @@ static int fuse_setlk(struct file *file, struct file_lock *fl, int flock)
>> > struct fuse_conn *fc = get_fuse_conn(inode);
>> > struct fuse_req *req;
>> > int opcode = (fl->fl_flags & FL_SLEEP) ? FUSE_SETLKW : FUSE_SETLK;
>> > - pid_t pid = fl->fl_type != F_UNLCK ? current->tgid : 0;
>> > + struct pid *pid = fl->fl_type != F_UNLCK ? task_tgid(current) : NULL;
>> > int err;
>> >
>> > if (fl->fl_lmops && fl->fl_lmops->lm_grant) {
>> > @@ -2227,12 +2239,16 @@ static int fuse_setlk(struct file *file, struct file_lock *fl, int flock)
>> > if (IS_ERR(req))
>> > return PTR_ERR(req);
>> >
>> > - fuse_lk_fill(req, file, fl, opcode, pid, flock);
>> > + err = fuse_lk_fill(req, file, fl, opcode, pid, flock);
>> > + if (err)
>> > + goto out;
>> > fuse_request_send(fc, req);
>> > err = req->out.h.error;
>> > /* locking is restartable */
>> > if (err == -EINTR)
>> > err = -ERESTARTSYS;
>> > +
>> > +out:
>> > fuse_put_request(fc, req);
>> > return err;
>> > }
>> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
>> > index e8e47a6ab518..a3ded071e2c6 100644
>> > --- a/fs/fuse/fuse_i.h
>> > +++ b/fs/fuse/fuse_i.h
>> > @@ -22,6 +22,7 @@
>> > #include <linux/rbtree.h>
>> > #include <linux/poll.h>
>> > #include <linux/workqueue.h>
>> > +#include <linux/pid_namespace.h>
>> >
>> > /** Max number of pages that can be used in a single read request */
>> > #define FUSE_MAX_PAGES_PER_REQ 32
>> > @@ -386,6 +387,9 @@ struct fuse_conn {
>> > /** The group id for this mount */
>> > kgid_t group_id;
>> >
>> > + /** The pid namespace for this mount */
>> > + struct pid_namespace *pid_ns;
>> > +
>> > /** The fuse mount flags for this mount */
>> > unsigned flags;
>> >
>> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> > index 03246cd9d47a..e137969815a3 100644
>> > --- a/fs/fuse/inode.c
>> > +++ b/fs/fuse/inode.c
>> > @@ -20,6 +20,7 @@
>> > #include <linux/random.h>
>> > #include <linux/sched.h>
>> > #include <linux/exportfs.h>
>> > +#include <linux/pid_namespace.h>
>> >
>> > MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
>> > MODULE_DESCRIPTION("Filesystem in Userspace");
>> > @@ -616,6 +617,7 @@ void fuse_conn_init(struct fuse_conn *fc)
>> > fc->initialized = 0;
>> > fc->attr_version = 1;
>> > get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
>> > + fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
>> > }
>> > EXPORT_SYMBOL_GPL(fuse_conn_init);
>> >
>> > @@ -624,6 +626,8 @@ void fuse_conn_put(struct fuse_conn *fc)
>> > if (atomic_dec_and_test(&fc->count)) {
>> > if (fc->destroy_req)
>> > fuse_request_free(fc->destroy_req);
>> > + put_pid_ns(fc->pid_ns);
>> > + fc->pid_ns = NULL;
>>
>> We don't usually zero out fields before freeing. There are debugging config
>> options to do that for us.
>
> Okay, I'll remove that line.
>
> Thanks,
> Seth
>
>>
>> > fc->release(fc);
>> > }
>> > }
>> > --
>> > 1.9.1
>> >



--
Andy Lutomirski
AMA Capital Management, LLC

2014-11-11 16:26:15

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] fuse: Add support for pid namespaces

On Tue, Nov 11, 2014 at 07:39:09AM -0800, Andy Lutomirski wrote:
> On Tue, Nov 11, 2014 at 7:24 AM, Seth Forshee
> <[email protected]> wrote:
> > On Tue, Nov 11, 2014 at 02:27:34PM +0100, Miklos Szeredi wrote:
> >> On Wed, Oct 22, 2014 at 04:24:17PM -0500, Seth Forshee wrote:
> >> > If the userspace process servicing fuse requests is running in
> >> > a pid namespace then pids passed via the fuse fd need to be
> >> > translated relative to that namespace. Capture the pid namespace
> >> > in use when the filesystem is mounted and use this for pid
> >> > translation.
> >>
> >> But WHY?
> >>
> >> I know it's been discussed, but the reasons should be spelled out here:
> >
> > The reasons relate much more to user namespaces rather than pid
> > namespaces, but in my opinion it doesn't make much sense to handle them
> > differently. It might make more sense to reorder the patches to add user
> > namespace support first, give the justification there, and then for this
> > one just say "to be consistent with userns support."
> >
> >> - capturing the namespace makes the code less complex (the strongest argument
> >> that I've heard)
> >
> > For pid namespaces it's not so bad, except maybe for file locking (I
> > can't remember for sure about that). For user namespaces though it does
> > get much more complex to use the namespace from the /dev/fuse IO path,
> > because frequently the raw data from userspace is being handed off to
> > another thread before it is interpreted. That means we either have to
> > capture the userns and pass it along with the data or restructure the
> > code to do all of this in the IO paths. Either way having a static
> > namespace in fuse_conn works out to be much simpler.
> >
> >> - there's a security concern with translating to current namespace (that I
> >> still don't fully understand)
> >
> > There's the one I explained with suid, which in reality should be
> > mitigated by some of the other protections in fuse. Other than that I
> > don't know of anything specific Eric had in mind other than the fact
> > that other sources of userns security bugs have been "time of open
> > versus time of use" sorts of problems, and he believes that capturing
> > the userns at IO time instead of mount time is likely to be similarly
> > vulnerable.
> >
>
> The general concern is that read(2) and write(2) really have no
> business looking at creds. ioctl is considerably less dangerous. The
> risk is that you get an fd and use at as stderr for a setuid process
> or something along those lines.
>
> > Otherwise I can't add any specifics beyond what Eric has already said in
> > http://lkml.kernel.org/g/[email protected]. Maybe
> > Eric or Andy could elaborate further. I'll also look at some of the past
> > vulnerabilities and see if I get any ideas for specific attacks.
> >
> >> - changing the pid namespace after mounting is unlikely to be of any use
> >
> > This sounds like trying to prove a negative. All I can say is that no
> > one has stepped forward to voice any specific use cases which would
> > require it.
> >
> >> - if it turns out to be useful, we can still fix this later (can we?)
> >
> > I don't see why we couldn't as long as nothing in userspace came to rely
> > upon the behavior.
>
> Just to be clear, the current behavior is to reject /dev/fuse io
> attempts with namespaces that don't match the opener, right? If so,
> that seems like the conservative option, and changing it won't break
> things.
>
> >
> >> What happens when the server does indeed change pid namespace after mounting?
> >> Will just receive bogus pid values? Shouldn't it receive an error instead?
> >
> > Yeah, I suppose it does receive bogus pids and userid values. About all
> > we could do to prevent this is make the /dev/fuse read/write paths
> > return an error if the current namespace isn't the same as the one at
> > mount time. But then requests could get stuck in the queue forever, so
> > maybe we should also fail all requests in the queue when this happens.
> > Unless you have a better idea?
> >
>
> If this is actually read(2)/write(2), then I think you need to either
> error out or use the namespaces at the time of open. Anything else is
> asking for trouble when the caller of read(2) or write(2) doesn't
> realize that the fd is a /dev/fuse fd.

I'll lay things out explicitly just to be sure there's no confusion.

There are two sides of the transaction. The first is the process
accessing the fuse mount, whose access attempt results in a request
being queued for reading on /dev/fuse. I'll call this side the client.
The other side is that of the process which reads the request from
/dev/fuse, processes it, and writes the response back to /dev/fuse.
I'll call this process the server.

In my current patches the pid and user namespaces are captured at mount
time, and the userns is also required to be the same as when opening
/dev/fuse or else the mount attempt fails. These namespaces are used for
all subsequent translations for IO on /dev/fuse, even if the server
switches to a different namespace.

One of the things Miklos is asking for is more convincing arguments that
we must use the namespaces from /dev/fuse open time and not the current
namespace of the server at IO time.

His other question is, assuming that we must use the namespaces from
/dev/fuse open, should the server recieve an error when it tries to do
IO on /dev/fuse from a different namespace? In this case the server will
know that the fd is a /dev/fuse fd.

But if the server cannot process requests due to switching namespaces it
could cause clients who have pending requests to block indefinitely. The
clients do not have fds to /dev/fuse but to some file within the fuse
mount. My suggestion was that whenever we refuse /dev/fuse IO for a
server which has switched namespaces we should assume that pending
requests aren't going to get processed and return errors to the clients
for pending IO operations on files within the fuse mount.

Does that make sense?

Thanks,
Seth

>
> --Andy
>
> >>
> >> More comments inline.
> >>
> >> > File locking changes based on previous work done by Eric
> >> > Biederman.
> >> >
> >> > Cc: Eric W. Biederman <[email protected]>
> >> > Cc: Serge H. Hallyn <[email protected]>
> >> > Cc: Andy Lutomirski <[email protected]>
> >> > Signed-off-by: Seth Forshee <[email protected]>
> >> > ---
> >> > fs/fuse/dev.c | 9 +++++----
> >> > fs/fuse/file.c | 38 +++++++++++++++++++++++++++-----------
> >> > fs/fuse/fuse_i.h | 4 ++++
> >> > fs/fuse/inode.c | 4 ++++
> >> > 4 files changed, 40 insertions(+), 15 deletions(-)
> >> >
> >> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> >> > index ca887314aba9..839caebd34f1 100644
> >> > --- a/fs/fuse/dev.c
> >> > +++ b/fs/fuse/dev.c
> >> > @@ -20,6 +20,7 @@
> >> > #include <linux/swap.h>
> >> > #include <linux/splice.h>
> >> > #include <linux/aio.h>
> >> > +#include <linux/sched.h>
> >> >
> >> > MODULE_ALIAS_MISCDEV(FUSE_MINOR);
> >> > MODULE_ALIAS("devname:fuse");
> >> > @@ -124,11 +125,11 @@ static void __fuse_put_request(struct fuse_req *req)
> >> > atomic_dec(&req->count);
> >> > }
> >> >
> >> > -static void fuse_req_init_context(struct fuse_req *req)
> >> > +static void fuse_req_init_context(struct fuse_conn *fc, struct fuse_req *req)
> >> > {
> >> > req->in.h.uid = from_kuid_munged(&init_user_ns, current_fsuid());
> >> > req->in.h.gid = from_kgid_munged(&init_user_ns, current_fsgid());
> >> > - req->in.h.pid = current->pid;
> >> > + req->in.h.pid = pid_nr_ns(task_pid(current), fc->pid_ns);
> >> > }
> >> >
> >> > static bool fuse_block_alloc(struct fuse_conn *fc, bool for_background)
> >> > @@ -168,7 +169,7 @@ static struct fuse_req *__fuse_get_req(struct fuse_conn *fc, unsigned npages,
> >> > goto out;
> >> > }
> >> >
> >> > - fuse_req_init_context(req);
> >> > + fuse_req_init_context(fc, req);
> >> > req->waiting = 1;
> >> > req->background = for_background;
> >> > return req;
> >> > @@ -257,7 +258,7 @@ struct fuse_req *fuse_get_req_nofail_nopages(struct fuse_conn *fc,
> >> > if (!req)
> >> > req = get_reserved_req(fc, file);
> >> >
> >> > - fuse_req_init_context(req);
> >> > + fuse_req_init_context(fc, req);
> >> > req->waiting = 1;
> >> > req->background = 0;
> >> > return req;
> >> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> >> > index caa8d95b24e8..cb0e40ecc362 100644
> >> > --- a/fs/fuse/file.c
> >> > +++ b/fs/fuse/file.c
> >> > @@ -2131,7 +2131,8 @@ static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma)
> >> > return generic_file_mmap(file, vma);
> >> > }
> >> >
> >> > -static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
> >> > +static int convert_fuse_file_lock(struct fuse_conn *fc,
> >> > + const struct fuse_file_lock *ffl,
> >> > struct file_lock *fl)
> >> > {
> >> > switch (ffl->type) {
> >> > @@ -2146,7 +2147,11 @@ static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
> >> >
> >> > fl->fl_start = ffl->start;
> >> > fl->fl_end = ffl->end;
> >> > - fl->fl_pid = ffl->pid;
> >> > + rcu_read_lock();
> >> > + fl->fl_pid = pid_vnr(find_pid_ns(ffl->pid, fc->pid_ns));
> >> > + rcu_read_unlock();
> >> > + if (ffl->pid != 0 && fl->fl_pid == 0)
> >> > + return -EIO;
> >>
> >> This needs some comments: is this trying to translate the pid backwards? Why is
> >> it not checking the return of find_pid_ns() then? The man page documents l_pid
> >> value of -1 but not of 0, so why are we checking for "ffl->pid != 0"? Or is the
> >> man page incomplete and in practice we get l_pid == 0 values?
> >
> > I'll add comments. It's converting the pid in the fuse_file_lock into
> > the current pid namespace. pid_vnr calls pid_nr_ns, which returns 0 if
> > the pid can't be translated into the namespace, thus we return the
> > error.
> >
> > pid_vnr's return values don't necessarily conform to the expectations of
> > the fcntl syscall in all cases, and as far as I can tell it should never
> > return <0. But if fcntl doesn't expect l_pid == 0 and pid_vnr could
> > return that value then doesn't it makes sense for fuse to return an
> > error in cases where this would happen?
> >
> > Fwiw, this is also an example of a case where it's simpler to have a
> > static namespace. If we were to use the current ns from the /dev/fuse IO
> > path then we either have to process the request there or grab a
> > reference to the ns there and pass it alongside the request (and we
> > have to do it for all requests unless the IO path is going to look at
> > the request type and decide whether or not it requires a reference to
> > the namespace).
> >
> >>
> >>
> >> > break;
> >> >
> >> > default:
> >> > @@ -2156,9 +2161,9 @@ static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
> >> > return 0;
> >> > }
> >> >
> >> > -static void fuse_lk_fill(struct fuse_req *req, struct file *file,
> >> > - const struct file_lock *fl, int opcode, pid_t pid,
> >> > - int flock)
> >> > +static int fuse_lk_fill(struct fuse_req *req, struct file *file,
> >> > + const struct file_lock *fl, int opcode,
> >> > + struct pid *pid, int flock)
> >> > {
> >> > struct inode *inode = file_inode(file);
> >> > struct fuse_conn *fc = get_fuse_conn(inode);
> >> > @@ -2170,7 +2175,9 @@ static void fuse_lk_fill(struct fuse_req *req, struct file *file,
> >> > arg->lk.start = fl->fl_start;
> >> > arg->lk.end = fl->fl_end;
> >> > arg->lk.type = fl->fl_type;
> >> > - arg->lk.pid = pid;
> >> > + arg->lk.pid = pid_nr_ns(pid, fc->pid_ns);
> >> > + if (pid && arg->lk.pid == 0)
> >> > + return -EOVERFLOW;
> >>
> >> Could have done the conversion immediately after getting 'pid' with task_tgid(),
> >> then the changes would have been smaller and more localized.
> >
> > The changes would be very marginally smaller since currently only one
> > caller of fuse_lk_fill which passes a non-zero pid. If additional
> > callers were ever added with non-zero pids then there would be
> > duplication of this code. But I'll do it whichever way you prefer, just
> > let me know.
> >
> >> > if (flock)
> >> > arg->lk_flags |= FUSE_LK_FLOCK;
> >> > req->in.h.opcode = opcode;
> >> > @@ -2178,6 +2185,8 @@ static void fuse_lk_fill(struct fuse_req *req, struct file *file,
> >> > req->in.numargs = 1;
> >> > req->in.args[0].size = sizeof(*arg);
> >> > req->in.args[0].value = arg;
> >> > +
> >> > + return 0;
> >> > }
> >> >
> >> > static int fuse_getlk(struct file *file, struct file_lock *fl)
> >> > @@ -2192,16 +2201,19 @@ static int fuse_getlk(struct file *file, struct file_lock *fl)
> >> > if (IS_ERR(req))
> >> > return PTR_ERR(req);
> >> >
> >> > - fuse_lk_fill(req, file, fl, FUSE_GETLK, 0, 0);
> >> > + err = fuse_lk_fill(req, file, fl, FUSE_GETLK, NULL, 0);
> >> > + if (err)
> >> > + goto out;
> >> > req->out.numargs = 1;
> >> > req->out.args[0].size = sizeof(outarg);
> >> > req->out.args[0].value = &outarg;
> >> > fuse_request_send(fc, req);
> >> > err = req->out.h.error;
> >> > - fuse_put_request(fc, req);
> >> > if (!err)
> >> > - err = convert_fuse_file_lock(&outarg.lk, fl);
> >> > + err = convert_fuse_file_lock(fc, &outarg.lk, fl);
> >> >
> >> > +out:
> >> > + fuse_put_request(fc, req);
> >> > return err;
> >> > }
> >> >
> >> > @@ -2211,7 +2223,7 @@ static int fuse_setlk(struct file *file, struct file_lock *fl, int flock)
> >> > struct fuse_conn *fc = get_fuse_conn(inode);
> >> > struct fuse_req *req;
> >> > int opcode = (fl->fl_flags & FL_SLEEP) ? FUSE_SETLKW : FUSE_SETLK;
> >> > - pid_t pid = fl->fl_type != F_UNLCK ? current->tgid : 0;
> >> > + struct pid *pid = fl->fl_type != F_UNLCK ? task_tgid(current) : NULL;
> >> > int err;
> >> >
> >> > if (fl->fl_lmops && fl->fl_lmops->lm_grant) {
> >> > @@ -2227,12 +2239,16 @@ static int fuse_setlk(struct file *file, struct file_lock *fl, int flock)
> >> > if (IS_ERR(req))
> >> > return PTR_ERR(req);
> >> >
> >> > - fuse_lk_fill(req, file, fl, opcode, pid, flock);
> >> > + err = fuse_lk_fill(req, file, fl, opcode, pid, flock);
> >> > + if (err)
> >> > + goto out;
> >> > fuse_request_send(fc, req);
> >> > err = req->out.h.error;
> >> > /* locking is restartable */
> >> > if (err == -EINTR)
> >> > err = -ERESTARTSYS;
> >> > +
> >> > +out:
> >> > fuse_put_request(fc, req);
> >> > return err;
> >> > }
> >> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> >> > index e8e47a6ab518..a3ded071e2c6 100644
> >> > --- a/fs/fuse/fuse_i.h
> >> > +++ b/fs/fuse/fuse_i.h
> >> > @@ -22,6 +22,7 @@
> >> > #include <linux/rbtree.h>
> >> > #include <linux/poll.h>
> >> > #include <linux/workqueue.h>
> >> > +#include <linux/pid_namespace.h>
> >> >
> >> > /** Max number of pages that can be used in a single read request */
> >> > #define FUSE_MAX_PAGES_PER_REQ 32
> >> > @@ -386,6 +387,9 @@ struct fuse_conn {
> >> > /** The group id for this mount */
> >> > kgid_t group_id;
> >> >
> >> > + /** The pid namespace for this mount */
> >> > + struct pid_namespace *pid_ns;
> >> > +
> >> > /** The fuse mount flags for this mount */
> >> > unsigned flags;
> >> >
> >> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> >> > index 03246cd9d47a..e137969815a3 100644
> >> > --- a/fs/fuse/inode.c
> >> > +++ b/fs/fuse/inode.c
> >> > @@ -20,6 +20,7 @@
> >> > #include <linux/random.h>
> >> > #include <linux/sched.h>
> >> > #include <linux/exportfs.h>
> >> > +#include <linux/pid_namespace.h>
> >> >
> >> > MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
> >> > MODULE_DESCRIPTION("Filesystem in Userspace");
> >> > @@ -616,6 +617,7 @@ void fuse_conn_init(struct fuse_conn *fc)
> >> > fc->initialized = 0;
> >> > fc->attr_version = 1;
> >> > get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
> >> > + fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
> >> > }
> >> > EXPORT_SYMBOL_GPL(fuse_conn_init);
> >> >
> >> > @@ -624,6 +626,8 @@ void fuse_conn_put(struct fuse_conn *fc)
> >> > if (atomic_dec_and_test(&fc->count)) {
> >> > if (fc->destroy_req)
> >> > fuse_request_free(fc->destroy_req);
> >> > + put_pid_ns(fc->pid_ns);
> >> > + fc->pid_ns = NULL;
> >>
> >> We don't usually zero out fields before freeing. There are debugging config
> >> options to do that for us.
> >
> > Okay, I'll remove that line.
> >
> > Thanks,
> > Seth
> >
> >>
> >> > fc->release(fc);
> >> > }
> >> > }
> >> > --
> >> > 1.9.1
> >> >
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2014-11-12 12:07:44

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] fuse: Add support for pid namespaces

On Tue, Nov 11, 2014 at 09:24:29AM -0600, Seth Forshee wrote:
> > What happens when the server does indeed change pid namespace after mounting?
> > Will just receive bogus pid values? Shouldn't it receive an error instead?
>
> Yeah, I suppose it does receive bogus pids and userid values. About all
> we could do to prevent this is make the /dev/fuse read/write paths
> return an error if the current namespace isn't the same as the one at
> mount time. But then requests could get stuck in the queue forever, so
> maybe we should also fail all requests in the queue when this happens.
> Unless you have a better idea?

In fuse_dev_do_read() just after dequeuing the request check if the namespaces
match, and if not, reject with EIO.

> > > @@ -2146,7 +2147,11 @@ static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
> > >
> > > fl->fl_start = ffl->start;
> > > fl->fl_end = ffl->end;
> > > - fl->fl_pid = ffl->pid;
> > > + rcu_read_lock();
> > > + fl->fl_pid = pid_vnr(find_pid_ns(ffl->pid, fc->pid_ns));
> > > + rcu_read_unlock();
> > > + if (ffl->pid != 0 && fl->fl_pid == 0)
> > > + return -EIO;
> >
> > This needs some comments: is this trying to translate the pid backwards?
> > Why is it not checking the return of find_pid_ns() then? The man page
> > documents l_pid value of -1 but not of 0, so why are we checking for
> > "ffl->pid != 0"? Or is the man page incomplete and in practice we get l_pid
> > == 0 values?
>
> I'll add comments. It's converting the pid in the fuse_file_lock into
> the current pid namespace. pid_vnr calls pid_nr_ns, which returns 0 if
> the pid can't be translated into the namespace, thus we return the
> error.
>
> pid_vnr's return values don't necessarily conform to the expectations of
> the fcntl syscall in all cases, and as far as I can tell it should never
> return <0. But if fcntl doesn't expect l_pid == 0 and pid_vnr could
> return that value then doesn't it makes sense for fuse to return an
> error in cases where this would happen?

Not necessarily. The conflicting lock might be held by a process whose pid is
not visible from the client's namespace. That doesn't mean that the GETLK
should fail, it just means that l_pid can't be filled in (same as in NFS when a
lock held on a different client). AFAICS, NFS fills l_pid with zero in that
case. Makes sense, not sure why the man page doesn't document that.

> > > @@ -2170,7 +2175,9 @@ static void fuse_lk_fill(struct fuse_req *req, struct file *file,
> > > arg->lk.start = fl->fl_start;
> > > arg->lk.end = fl->fl_end;
> > > arg->lk.type = fl->fl_type;
> > > - arg->lk.pid = pid;
> > > + arg->lk.pid = pid_nr_ns(pid, fc->pid_ns);
> > > + if (pid && arg->lk.pid == 0)
> > > + return -EOVERFLOW;
> >
> > Could have done the conversion immediately after getting 'pid' with task_tgid(),
> > then the changes would have been smaller and more localized.
>
> The changes would be very marginally smaller since currently only one
> caller of fuse_lk_fill which passes a non-zero pid. If additional
> callers were ever added with non-zero pids then there would be
> duplication of this code. But I'll do it whichever way you prefer, just
> let me know.

I prefer the simpler (even if only marginally) one.

Thanks,
Miklos

2014-11-12 13:09:17

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

On Tue, Nov 11, 2014 at 09:37:10AM -0600, Eric W. Biederman wrote:

> > Maybe I'm being dense, but can someone give a concrete example of such an
> > attack?
>
> There are two variants of things at play here.
>
> There is the classic if you don't freeze your context at open time when
> you pass that file descriptor to another process unexpected things can
> happen.
>
> An essentially harmless but extremely confusing example is what happens
> to a partial read when it stops halfway through a uid value and the next
> read on the same file descriptor is from a process in a different user
> namespace. Which uid value should be returned to userspace.

Fuse device doesn't currently do partial reads, so that's a non-issue.

> Now if I am in a nefarious mood I can create a unprivileged user
> namespace, open /dev/fuse and mount a fuse filesystem. Pass the file
> descriptor to /dev/fuse to a processes that is in the default user
> namespace (and thus can use any uid/gid). With that file desctipor
> report that there is a setuid 0 exectuable on that file system.

Yes, and this would also be prevented by MNT_NOSUID, which would be a good idea
anyway. I just don't see the reason we'd want to allow clearing MNT_NOSUID in a
private namespace.

So we don't currently see a use case for relaxing either the MNT_NOSUID
restriction or for relaxing the requirement on the user namespace the fuse
server is in. Is that correct?

If so, we should leave both restrictions in place since that allows the greatest
flexibility in the future, is either of those needs to be relaxed.

> > That might also help me understand how exactly user/pid namespaces work...
>
> The idea of user/pid namespaces is to translate uid, gids and pids at
> the edge of userspace into a kernel internal form that can be use
> everywhere. In this case we get into the subtlties of which
> translations make sense.

I mean, what's the point of translating uid, gids and pids? What are the use
cases?

What are the rules on the translations between parent and child namespaces?

Is all this documented anywhere?

Thanks,
Miklos

2014-11-12 14:33:24

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] fuse: Add support for pid namespaces

On Wed, Nov 12, 2014 at 01:07:42PM +0100, Miklos Szeredi wrote:
> On Tue, Nov 11, 2014 at 09:24:29AM -0600, Seth Forshee wrote:
> > > What happens when the server does indeed change pid namespace after mounting?
> > > Will just receive bogus pid values? Shouldn't it receive an error instead?
> >
> > Yeah, I suppose it does receive bogus pids and userid values. About all
> > we could do to prevent this is make the /dev/fuse read/write paths
> > return an error if the current namespace isn't the same as the one at
> > mount time. But then requests could get stuck in the queue forever, so
> > maybe we should also fail all requests in the queue when this happens.
> > Unless you have a better idea?
>
> In fuse_dev_do_read() just after dequeuing the request check if the namespaces
> match, and if not, reject with EIO.

Okay, I'll do that.

> > > > @@ -2146,7 +2147,11 @@ static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
> > > >
> > > > fl->fl_start = ffl->start;
> > > > fl->fl_end = ffl->end;
> > > > - fl->fl_pid = ffl->pid;
> > > > + rcu_read_lock();
> > > > + fl->fl_pid = pid_vnr(find_pid_ns(ffl->pid, fc->pid_ns));
> > > > + rcu_read_unlock();
> > > > + if (ffl->pid != 0 && fl->fl_pid == 0)
> > > > + return -EIO;
> > >
> > > This needs some comments: is this trying to translate the pid backwards?
> > > Why is it not checking the return of find_pid_ns() then? The man page
> > > documents l_pid value of -1 but not of 0, so why are we checking for
> > > "ffl->pid != 0"? Or is the man page incomplete and in practice we get l_pid
> > > == 0 values?
> >
> > I'll add comments. It's converting the pid in the fuse_file_lock into
> > the current pid namespace. pid_vnr calls pid_nr_ns, which returns 0 if
> > the pid can't be translated into the namespace, thus we return the
> > error.
> >
> > pid_vnr's return values don't necessarily conform to the expectations of
> > the fcntl syscall in all cases, and as far as I can tell it should never
> > return <0. But if fcntl doesn't expect l_pid == 0 and pid_vnr could
> > return that value then doesn't it makes sense for fuse to return an
> > error in cases where this would happen?
>
> Not necessarily. The conflicting lock might be held by a process whose pid is
> not visible from the client's namespace. That doesn't mean that the GETLK
> should fail, it just means that l_pid can't be filled in (same as in NFS when a
> lock held on a different client). AFAICS, NFS fills l_pid with zero in that
> case. Makes sense, not sure why the man page doesn't document that.

Seems reasonable to follow the precedence set by NFS then.

>
> > > > @@ -2170,7 +2175,9 @@ static void fuse_lk_fill(struct fuse_req *req, struct file *file,
> > > > arg->lk.start = fl->fl_start;
> > > > arg->lk.end = fl->fl_end;
> > > > arg->lk.type = fl->fl_type;
> > > > - arg->lk.pid = pid;
> > > > + arg->lk.pid = pid_nr_ns(pid, fc->pid_ns);
> > > > + if (pid && arg->lk.pid == 0)
> > > > + return -EOVERFLOW;
> > >
> > > Could have done the conversion immediately after getting 'pid' with task_tgid(),
> > > then the changes would have been smaller and more localized.
> >
> > The changes would be very marginally smaller since currently only one
> > caller of fuse_lk_fill which passes a non-zero pid. If additional
> > callers were ever added with non-zero pids then there would be
> > duplication of this code. But I'll do it whichever way you prefer, just
> > let me know.
>
> I prefer the simpler (even if only marginally) one.

I'll change it then.

Thanks,
Seth

2014-11-12 16:23:03

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

On Wed, Nov 12, 2014 at 02:09:15PM +0100, Miklos Szeredi wrote:
> On Tue, Nov 11, 2014 at 09:37:10AM -0600, Eric W. Biederman wrote:
>
> > > Maybe I'm being dense, but can someone give a concrete example of such an
> > > attack?
> >
> > There are two variants of things at play here.
> >
> > There is the classic if you don't freeze your context at open time when
> > you pass that file descriptor to another process unexpected things can
> > happen.
> >
> > An essentially harmless but extremely confusing example is what happens
> > to a partial read when it stops halfway through a uid value and the next
> > read on the same file descriptor is from a process in a different user
> > namespace. Which uid value should be returned to userspace.
>
> Fuse device doesn't currently do partial reads, so that's a non-issue.
>
> > Now if I am in a nefarious mood I can create a unprivileged user
> > namespace, open /dev/fuse and mount a fuse filesystem. Pass the file
> > descriptor to /dev/fuse to a processes that is in the default user
> > namespace (and thus can use any uid/gid). With that file desctipor
> > report that there is a setuid 0 exectuable on that file system.
>
> Yes, and this would also be prevented by MNT_NOSUID, which would be a good idea
> anyway. I just don't see the reason we'd want to allow clearing MNT_NOSUID in a
> private namespace.
>
> So we don't currently see a use case for relaxing either the MNT_NOSUID
> restriction or for relaxing the requirement on the user namespace the fuse
> server is in. Is that correct?
>
> If so, we should leave both restrictions in place since that allows the greatest
> flexibility in the future, is either of those needs to be relaxed.

I'm not aware of specific use cases for either at this point. However,
Andy's patch [1] will limit suid to the set of namespaces where the user
who mounted the filesystem already has privileges. Enforcing MNT_NOSUID
will require enforcement in the vfs, and in that case we definitely need
to decide whether the policy is to implicitly add the flag or fail the
mount attempt if the flag is not present [2].

> > > That might also help me understand how exactly user/pid namespaces work...
> >
> > The idea of user/pid namespaces is to translate uid, gids and pids at
> > the edge of userspace into a kernel internal form that can be use
> > everywhere. In this case we get into the subtlties of which
> > translations make sense.
>
> I mean, what's the point of translating uid, gids and pids? What are the use
> cases?

Do you mean in general, or for fuse specifically? In general user/pid
namespaces are primarily used to implement containers with isolated sets
of resources (if you're unfamiliar with containers, think of something
which looks more or less like a VM from within but runs under the same
kernel as the host).

For fuse: an unprivileged user has a regular file containing a
filesystem image which they wish to mount inside a container using fuse.
Assume that in this container uid 0 maps to uid 100000 in the host, etc.
The filesystem image is likely to be using ids like 0, 1000, etc. If the
kernel translates these to kuid 0, 1000, ... then these will map to
overflowuid in the container, and the mount won't be very useful to the
user. What the user expects is that uid 0 in the filesystem will map to
uid 0 within the container (kuid 100000 in this example).

The pids aren't nearly so user-visible, but if the userspace fuse driver
is running in a pid namespace then pids must be translated into the
namespace to be useful to the driver.

Does that answer your questions?

> What are the rules on the translations between parent and child namespaces?
>
> Is all this documented anywhere?

I haven't found any documentation. Eric?

As far as I can tell though the most important rules are to translate
to/from the kernel's internal representation as close to the
userspace/kernel boundary as possible and to work with kernel-internal
representations within the kernel (e.g. kuid_t, kgid_t, etc.).

The series of articles starting with [3] also serve as a good
introduction.

Thanks,
Seth

[1] http://lkml.kernel.org/g/252a4d87d99fc2b5fe4411c838f65b312c4e13cd.1413330857.git.luto@amacapital.net
[2] http://lkml.kernel.org/g/2686c32f00b14148379e8cfee9c028c794d4aa1a.1407974494.git.luto@amacapital.net
[3] http://lwn.net/Articles/531114/

2014-11-18 15:22:02

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

On Wed, Nov 12, 2014 at 10:22:54AM -0600, Seth Forshee wrote:
> On Wed, Nov 12, 2014 at 02:09:15PM +0100, Miklos Szeredi wrote:
> > On Tue, Nov 11, 2014 at 09:37:10AM -0600, Eric W. Biederman wrote:
> >
> > > > Maybe I'm being dense, but can someone give a concrete example of such an
> > > > attack?
> > >
> > > There are two variants of things at play here.
> > >
> > > There is the classic if you don't freeze your context at open time when
> > > you pass that file descriptor to another process unexpected things can
> > > happen.
> > >
> > > An essentially harmless but extremely confusing example is what happens
> > > to a partial read when it stops halfway through a uid value and the next
> > > read on the same file descriptor is from a process in a different user
> > > namespace. Which uid value should be returned to userspace.
> >
> > Fuse device doesn't currently do partial reads, so that's a non-issue.
> >
> > > Now if I am in a nefarious mood I can create a unprivileged user
> > > namespace, open /dev/fuse and mount a fuse filesystem. Pass the file
> > > descriptor to /dev/fuse to a processes that is in the default user
> > > namespace (and thus can use any uid/gid). With that file desctipor
> > > report that there is a setuid 0 exectuable on that file system.
> >
> > Yes, and this would also be prevented by MNT_NOSUID, which would be a good idea
> > anyway. I just don't see the reason we'd want to allow clearing MNT_NOSUID in a
> > private namespace.
> >
> > So we don't currently see a use case for relaxing either the MNT_NOSUID
> > restriction or for relaxing the requirement on the user namespace the fuse
> > server is in. Is that correct?
> >
> > If so, we should leave both restrictions in place since that allows the greatest
> > flexibility in the future, is either of those needs to be relaxed.
>
> I'm not aware of specific use cases for either at this point. However,
> Andy's patch [1] will limit suid to the set of namespaces where the user
> who mounted the filesystem already has privileges. Enforcing MNT_NOSUID
> will require enforcement in the vfs, and in that case we definitely need
> to decide whether the policy is to implicitly add the flag or fail the
> mount attempt if the flag is not present [2].

I asked around a bit, and it turns out there are use cases for nested
containers (i.e. a container within a container) where the rootfs for
the outer container mounts a filesystem containing the rootfs for the
inner container. If that mount is nosuid then suid utilities like ping
aren't going to work in the inner container.

So since there's a use case for suid in a userns mount and we have what
we belive are sufficient protections against using this as a vector to
get privileges outside the container, I'm planning to move ahead without
the MNT_NOSUID restriction. Any objections?

Thanks,
Seth

2014-11-18 17:09:58

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

On Tue, Nov 18, 2014 at 7:21 AM, Seth Forshee
<[email protected]> wrote:
> On Wed, Nov 12, 2014 at 10:22:54AM -0600, Seth Forshee wrote:
>> On Wed, Nov 12, 2014 at 02:09:15PM +0100, Miklos Szeredi wrote:
>> > On Tue, Nov 11, 2014 at 09:37:10AM -0600, Eric W. Biederman wrote:
>> >
>> > > > Maybe I'm being dense, but can someone give a concrete example of such an
>> > > > attack?
>> > >
>> > > There are two variants of things at play here.
>> > >
>> > > There is the classic if you don't freeze your context at open time when
>> > > you pass that file descriptor to another process unexpected things can
>> > > happen.
>> > >
>> > > An essentially harmless but extremely confusing example is what happens
>> > > to a partial read when it stops halfway through a uid value and the next
>> > > read on the same file descriptor is from a process in a different user
>> > > namespace. Which uid value should be returned to userspace.
>> >
>> > Fuse device doesn't currently do partial reads, so that's a non-issue.
>> >
>> > > Now if I am in a nefarious mood I can create a unprivileged user
>> > > namespace, open /dev/fuse and mount a fuse filesystem. Pass the file
>> > > descriptor to /dev/fuse to a processes that is in the default user
>> > > namespace (and thus can use any uid/gid). With that file desctipor
>> > > report that there is a setuid 0 exectuable on that file system.
>> >
>> > Yes, and this would also be prevented by MNT_NOSUID, which would be a good idea
>> > anyway. I just don't see the reason we'd want to allow clearing MNT_NOSUID in a
>> > private namespace.
>> >
>> > So we don't currently see a use case for relaxing either the MNT_NOSUID
>> > restriction or for relaxing the requirement on the user namespace the fuse
>> > server is in. Is that correct?
>> >
>> > If so, we should leave both restrictions in place since that allows the greatest
>> > flexibility in the future, is either of those needs to be relaxed.
>>
>> I'm not aware of specific use cases for either at this point. However,
>> Andy's patch [1] will limit suid to the set of namespaces where the user
>> who mounted the filesystem already has privileges. Enforcing MNT_NOSUID
>> will require enforcement in the vfs, and in that case we definitely need
>> to decide whether the policy is to implicitly add the flag or fail the
>> mount attempt if the flag is not present [2].
>
> I asked around a bit, and it turns out there are use cases for nested
> containers (i.e. a container within a container) where the rootfs for
> the outer container mounts a filesystem containing the rootfs for the
> inner container. If that mount is nosuid then suid utilities like ping
> aren't going to work in the inner container.
>
> So since there's a use case for suid in a userns mount and we have what
> we belive are sufficient protections against using this as a vector to
> get privileges outside the container, I'm planning to move ahead without
> the MNT_NOSUID restriction. Any objections?

Are you talking about MNT_NOSUID the flag or my ns-dependent thing?

--Andy

>
> Thanks,
> Seth
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-11-18 17:13:27

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

On Tue, Nov 18, 2014 at 09:09:34AM -0800, Andy Lutomirski wrote:
> On Tue, Nov 18, 2014 at 7:21 AM, Seth Forshee
> <[email protected]> wrote:
> > On Wed, Nov 12, 2014 at 10:22:54AM -0600, Seth Forshee wrote:
> >> On Wed, Nov 12, 2014 at 02:09:15PM +0100, Miklos Szeredi wrote:
> >> > On Tue, Nov 11, 2014 at 09:37:10AM -0600, Eric W. Biederman wrote:
> >> >
> >> > > > Maybe I'm being dense, but can someone give a concrete example of such an
> >> > > > attack?
> >> > >
> >> > > There are two variants of things at play here.
> >> > >
> >> > > There is the classic if you don't freeze your context at open time when
> >> > > you pass that file descriptor to another process unexpected things can
> >> > > happen.
> >> > >
> >> > > An essentially harmless but extremely confusing example is what happens
> >> > > to a partial read when it stops halfway through a uid value and the next
> >> > > read on the same file descriptor is from a process in a different user
> >> > > namespace. Which uid value should be returned to userspace.
> >> >
> >> > Fuse device doesn't currently do partial reads, so that's a non-issue.
> >> >
> >> > > Now if I am in a nefarious mood I can create a unprivileged user
> >> > > namespace, open /dev/fuse and mount a fuse filesystem. Pass the file
> >> > > descriptor to /dev/fuse to a processes that is in the default user
> >> > > namespace (and thus can use any uid/gid). With that file desctipor
> >> > > report that there is a setuid 0 exectuable on that file system.
> >> >
> >> > Yes, and this would also be prevented by MNT_NOSUID, which would be a good idea
> >> > anyway. I just don't see the reason we'd want to allow clearing MNT_NOSUID in a
> >> > private namespace.
> >> >
> >> > So we don't currently see a use case for relaxing either the MNT_NOSUID
> >> > restriction or for relaxing the requirement on the user namespace the fuse
> >> > server is in. Is that correct?
> >> >
> >> > If so, we should leave both restrictions in place since that allows the greatest
> >> > flexibility in the future, is either of those needs to be relaxed.
> >>
> >> I'm not aware of specific use cases for either at this point. However,
> >> Andy's patch [1] will limit suid to the set of namespaces where the user
> >> who mounted the filesystem already has privileges. Enforcing MNT_NOSUID
> >> will require enforcement in the vfs, and in that case we definitely need
> >> to decide whether the policy is to implicitly add the flag or fail the
> >> mount attempt if the flag is not present [2].
> >
> > I asked around a bit, and it turns out there are use cases for nested
> > containers (i.e. a container within a container) where the rootfs for
> > the outer container mounts a filesystem containing the rootfs for the
> > inner container. If that mount is nosuid then suid utilities like ping
> > aren't going to work in the inner container.
> >
> > So since there's a use case for suid in a userns mount and we have what
> > we belive are sufficient protections against using this as a vector to
> > get privileges outside the container, I'm planning to move ahead without
> > the MNT_NOSUID restriction. Any objections?
>
> Are you talking about MNT_NOSUID the flag or my ns-dependent thing?

I'm talking about dropping the proposed requirement from Miklos that all
fuse userns mounts are required to have the MNT_NOSUID flag. I intend to
keep your ns-dependent thing.

Thanks,
Seth

2014-11-18 17:20:06

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

On Tue, Nov 18, 2014 at 9:13 AM, Seth Forshee
<[email protected]> wrote:
> On Tue, Nov 18, 2014 at 09:09:34AM -0800, Andy Lutomirski wrote:
>> On Tue, Nov 18, 2014 at 7:21 AM, Seth Forshee
>> <[email protected]> wrote:
>> > On Wed, Nov 12, 2014 at 10:22:54AM -0600, Seth Forshee wrote:
>> >> On Wed, Nov 12, 2014 at 02:09:15PM +0100, Miklos Szeredi wrote:
>> >> > On Tue, Nov 11, 2014 at 09:37:10AM -0600, Eric W. Biederman wrote:
>> >> >
>> >> > > > Maybe I'm being dense, but can someone give a concrete example of such an
>> >> > > > attack?
>> >> > >
>> >> > > There are two variants of things at play here.
>> >> > >
>> >> > > There is the classic if you don't freeze your context at open time when
>> >> > > you pass that file descriptor to another process unexpected things can
>> >> > > happen.
>> >> > >
>> >> > > An essentially harmless but extremely confusing example is what happens
>> >> > > to a partial read when it stops halfway through a uid value and the next
>> >> > > read on the same file descriptor is from a process in a different user
>> >> > > namespace. Which uid value should be returned to userspace.
>> >> >
>> >> > Fuse device doesn't currently do partial reads, so that's a non-issue.
>> >> >
>> >> > > Now if I am in a nefarious mood I can create a unprivileged user
>> >> > > namespace, open /dev/fuse and mount a fuse filesystem. Pass the file
>> >> > > descriptor to /dev/fuse to a processes that is in the default user
>> >> > > namespace (and thus can use any uid/gid). With that file desctipor
>> >> > > report that there is a setuid 0 exectuable on that file system.
>> >> >
>> >> > Yes, and this would also be prevented by MNT_NOSUID, which would be a good idea
>> >> > anyway. I just don't see the reason we'd want to allow clearing MNT_NOSUID in a
>> >> > private namespace.
>> >> >
>> >> > So we don't currently see a use case for relaxing either the MNT_NOSUID
>> >> > restriction or for relaxing the requirement on the user namespace the fuse
>> >> > server is in. Is that correct?
>> >> >
>> >> > If so, we should leave both restrictions in place since that allows the greatest
>> >> > flexibility in the future, is either of those needs to be relaxed.
>> >>
>> >> I'm not aware of specific use cases for either at this point. However,
>> >> Andy's patch [1] will limit suid to the set of namespaces where the user
>> >> who mounted the filesystem already has privileges. Enforcing MNT_NOSUID
>> >> will require enforcement in the vfs, and in that case we definitely need
>> >> to decide whether the policy is to implicitly add the flag or fail the
>> >> mount attempt if the flag is not present [2].
>> >
>> > I asked around a bit, and it turns out there are use cases for nested
>> > containers (i.e. a container within a container) where the rootfs for
>> > the outer container mounts a filesystem containing the rootfs for the
>> > inner container. If that mount is nosuid then suid utilities like ping
>> > aren't going to work in the inner container.
>> >
>> > So since there's a use case for suid in a userns mount and we have what
>> > we belive are sufficient protections against using this as a vector to
>> > get privileges outside the container, I'm planning to move ahead without
>> > the MNT_NOSUID restriction. Any objections?
>>
>> Are you talking about MNT_NOSUID the flag or my ns-dependent thing?
>
> I'm talking about dropping the proposed requirement from Miklos that all
> fuse userns mounts are required to have the MNT_NOSUID flag. I intend to
> keep your ns-dependent thing.
>

In that case, I agree completely. There are certainly uses for
non-nosuid mounts in containers, and I don't see why fuse should be
any different.

--Andy

> Thanks,
> Seth
>



--
Andy Lutomirski
AMA Capital Management, LLC

2014-11-19 08:51:00

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

On Tue, Nov 18, 2014 at 4:21 PM, Seth Forshee
<[email protected]> wrote:
>> I asked around a bit, and it turns out there are use cases for nested
> containers (i.e. a container within a container) where the rootfs for
> the outer container mounts a filesystem containing the rootfs for the
> inner container. If that mount is nosuid then suid utilities like ping
> aren't going to work in the inner container.
>
> So since there's a use case for suid in a userns mount and we have what
> we belive are sufficient protections against using this as a vector to
> get privileges outside the container, I'm planning to move ahead without
> the MNT_NOSUID restriction. Any objections?

In the general case how'd we prevent suid executable being tricked to
do something it shouldn't do by unprivileged mounting into sensitive
places (i.e. config files) inside the container?

Allowing SUID looks like a slippery slope to me. And there are plenty
of solutions to the "ping" problem, AFAICS, that don't involve the
suid bit.

Disclaimer: I'm pretty ignorant in the field of computer security...

Thanks,
Miklos

2014-11-19 10:38:18

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

On Wed, Nov 19, 2014 at 9:50 AM, Miklos Szeredi <[email protected]> wrote:
> On Tue, Nov 18, 2014 at 4:21 PM, Seth Forshee
> <[email protected]> wrote:
>>> I asked around a bit, and it turns out there are use cases for nested
>> containers (i.e. a container within a container) where the rootfs for
>> the outer container mounts a filesystem containing the rootfs for the
>> inner container. If that mount is nosuid then suid utilities like ping
>> aren't going to work in the inner container.
>>
>> So since there's a use case for suid in a userns mount and we have what
>> we belive are sufficient protections against using this as a vector to
>> get privileges outside the container, I'm planning to move ahead without
>> the MNT_NOSUID restriction. Any objections?
>
> In the general case how'd we prevent suid executable being tricked to
> do something it shouldn't do by unprivileged mounting into sensitive
> places (i.e. config files) inside the container?
>
> Allowing SUID looks like a slippery slope to me. And there are plenty
> of solutions to the "ping" problem, AFAICS, that don't involve the
> suid bit.

ping isn't even suid on my system, it has security.capability xattr instead.

Please just get rid of SUID/SGID. It's a legacy, it's a hack, not
worth the complexity and potential problems arising from that
complexity.

Thanks,
Miklos

2014-11-19 14:09:16

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

Quoting Miklos Szeredi ([email protected]):
> On Wed, Nov 19, 2014 at 9:50 AM, Miklos Szeredi <[email protected]> wrote:
> > On Tue, Nov 18, 2014 at 4:21 PM, Seth Forshee
> > <[email protected]> wrote:
> >>> I asked around a bit, and it turns out there are use cases for nested
> >> containers (i.e. a container within a container) where the rootfs for
> >> the outer container mounts a filesystem containing the rootfs for the
> >> inner container. If that mount is nosuid then suid utilities like ping
> >> aren't going to work in the inner container.
> >>
> >> So since there's a use case for suid in a userns mount and we have what
> >> we belive are sufficient protections against using this as a vector to
> >> get privileges outside the container, I'm planning to move ahead without
> >> the MNT_NOSUID restriction. Any objections?
> >
> > In the general case how'd we prevent suid executable being tricked to
> > do something it shouldn't do by unprivileged mounting into sensitive
> > places (i.e. config files) inside the container?

The design of the namespaces would prevent that. You cannot manipulate your
mounts namespace unless you own it. You cannot manipulate the mounts namespace
for a task whose user namespace you do not own. If you can, for instance,
bind mount $HOME/shadow onto /etc/shadow, then you already own your user
namespace and are root there, so any suid-root program which you mount through
fuse will only subjegate your own namespace. Any task which running in the
parent user-ns (and therefore parent mount-ns) will not see your bind mount.

> > Allowing SUID looks like a slippery slope to me. And there are plenty
> > of solutions to the "ping" problem, AFAICS, that don't involve the
> > suid bit.
>
> ping isn't even suid on my system, it has security.capability xattr instead.

security.capability xattrs that will have the exact same concerns wrt
confusion through bind mounts as suid.

> Please just get rid of SUID/SGID. It's a legacy, it's a hack, not
> worth the complexity and potential problems arising from that
> complexity.

Oh boy, I don't know which side to sit on here :) I'm all for replacing
suid with some use of file capabilities, but realistically there are reasons
why that hasn't happened more widely than it has - tar, package managers,
cpio, nfs, etc.

-serge

2014-11-21 16:44:45

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

On Wed, Nov 19, 2014 at 03:09:11PM +0100, Serge E. Hallyn wrote:
> Quoting Miklos Szeredi ([email protected]):
> > On Wed, Nov 19, 2014 at 9:50 AM, Miklos Szeredi <[email protected]> wrote:
> > > On Tue, Nov 18, 2014 at 4:21 PM, Seth Forshee
> > > <[email protected]> wrote:
> > >>> I asked around a bit, and it turns out there are use cases for nested
> > >> containers (i.e. a container within a container) where the rootfs for
> > >> the outer container mounts a filesystem containing the rootfs for the
> > >> inner container. If that mount is nosuid then suid utilities like ping
> > >> aren't going to work in the inner container.
> > >>
> > >> So since there's a use case for suid in a userns mount and we have what
> > >> we belive are sufficient protections against using this as a vector to
> > >> get privileges outside the container, I'm planning to move ahead without
> > >> the MNT_NOSUID restriction. Any objections?
> > >
> > > In the general case how'd we prevent suid executable being tricked to
> > > do something it shouldn't do by unprivileged mounting into sensitive
> > > places (i.e. config files) inside the container?
>
> The design of the namespaces would prevent that. You cannot manipulate your
> mounts namespace unless you own it. You cannot manipulate the mounts namespace
> for a task whose user namespace you do not own. If you can, for instance,
> bind mount $HOME/shadow onto /etc/shadow, then you already own your user
> namespace and are root there, so any suid-root program which you mount through
> fuse will only subjegate your own namespace. Any task which running in the
> parent user-ns (and therefore parent mount-ns) will not see your bind mount.
>
> > > Allowing SUID looks like a slippery slope to me. And there are plenty
> > > of solutions to the "ping" problem, AFAICS, that don't involve the
> > > suid bit.
> >
> > ping isn't even suid on my system, it has security.capability xattr instead.
>
> security.capability xattrs that will have the exact same concerns wrt
> confusion through bind mounts as suid.
>
> > Please just get rid of SUID/SGID. It's a legacy, it's a hack, not
> > worth the complexity and potential problems arising from that
> > complexity.
>
> Oh boy, I don't know which side to sit on here :) I'm all for replacing
> suid with some use of file capabilities, but realistically there are reasons
> why that hasn't happened more widely than it has - tar, package managers,
> cpio, nfs, etc.

Miklos: I we're all generally in agreement here that suid/sgid is not
the best solution, but as Serge points out we are unfortunately not yet
in a place where it can be completely dropped in favor of capabilities.
In light of this can I convince you to reconsider your position?

Thanks,
Seth

2014-11-21 17:19:36

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

On Fri, Nov 21, 2014 at 8:44 AM, Seth Forshee
<[email protected]> wrote:
> On Wed, Nov 19, 2014 at 03:09:11PM +0100, Serge E. Hallyn wrote:
>> Quoting Miklos Szeredi ([email protected]):
>> > On Wed, Nov 19, 2014 at 9:50 AM, Miklos Szeredi <[email protected]> wrote:
>> > > On Tue, Nov 18, 2014 at 4:21 PM, Seth Forshee
>> > > <[email protected]> wrote:
>> > >>> I asked around a bit, and it turns out there are use cases for nested
>> > >> containers (i.e. a container within a container) where the rootfs for
>> > >> the outer container mounts a filesystem containing the rootfs for the
>> > >> inner container. If that mount is nosuid then suid utilities like ping
>> > >> aren't going to work in the inner container.
>> > >>
>> > >> So since there's a use case for suid in a userns mount and we have what
>> > >> we belive are sufficient protections against using this as a vector to
>> > >> get privileges outside the container, I'm planning to move ahead without
>> > >> the MNT_NOSUID restriction. Any objections?
>> > >
>> > > In the general case how'd we prevent suid executable being tricked to
>> > > do something it shouldn't do by unprivileged mounting into sensitive
>> > > places (i.e. config files) inside the container?
>>
>> The design of the namespaces would prevent that. You cannot manipulate your
>> mounts namespace unless you own it. You cannot manipulate the mounts namespace
>> for a task whose user namespace you do not own. If you can, for instance,
>> bind mount $HOME/shadow onto /etc/shadow, then you already own your user
>> namespace and are root there, so any suid-root program which you mount through
>> fuse will only subjegate your own namespace. Any task which running in the
>> parent user-ns (and therefore parent mount-ns) will not see your bind mount.
>>
>> > > Allowing SUID looks like a slippery slope to me. And there are plenty
>> > > of solutions to the "ping" problem, AFAICS, that don't involve the
>> > > suid bit.
>> >
>> > ping isn't even suid on my system, it has security.capability xattr instead.
>>
>> security.capability xattrs that will have the exact same concerns wrt
>> confusion through bind mounts as suid.
>>
>> > Please just get rid of SUID/SGID. It's a legacy, it's a hack, not
>> > worth the complexity and potential problems arising from that
>> > complexity.
>>
>> Oh boy, I don't know which side to sit on here :) I'm all for replacing
>> suid with some use of file capabilities, but realistically there are reasons
>> why that hasn't happened more widely than it has - tar, package managers,
>> cpio, nfs, etc.
>
> Miklos: I we're all generally in agreement here that suid/sgid is not
> the best solution, but as Serge points out we are unfortunately not yet
> in a place where it can be completely dropped in favor of capabilities.
> In light of this can I convince you to reconsider your position?
>

I would go one step further: all the things that gain privilege on
exec (suig/sgid, fscaps, and LSM transitions) are not just "not the
best" but are in fact disasters. They made sense when systems had a
few KB of RAM.

suid/sgid is at least a /standardized/ disaster, though, and
namespaced code should be able to use it.

Miklos, I'm not sure whether you saw it (it was a bit buried, I
think), but this series is intended to depend on a patch of mine that
makes all mounts that belong to foreign namespaces act as though
they're MNT_NOSUID. That means that, in order for suid/sgid to do
anything, the namespace owner needs to indicate their trust in the fs
by explicitly mounting it.

--Andy

2014-11-21 18:15:56

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

Seth Forshee <[email protected]> writes:

> On Wed, Nov 19, 2014 at 03:09:11PM +0100, Serge E. Hallyn wrote:
>> Quoting Miklos Szeredi ([email protected]):
>> > On Wed, Nov 19, 2014 at 9:50 AM, Miklos Szeredi <[email protected]> wrote:
>> > > On Tue, Nov 18, 2014 at 4:21 PM, Seth Forshee
>> > > <[email protected]> wrote:
>> > >>> I asked around a bit, and it turns out there are use cases for nested
>> > >> containers (i.e. a container within a container) where the rootfs for
>> > >> the outer container mounts a filesystem containing the rootfs for the
>> > >> inner container. If that mount is nosuid then suid utilities like ping
>> > >> aren't going to work in the inner container.
>> > >>
>> > >> So since there's a use case for suid in a userns mount and we have what
>> > >> we belive are sufficient protections against using this as a vector to
>> > >> get privileges outside the container, I'm planning to move ahead without
>> > >> the MNT_NOSUID restriction. Any objections?
>> > >
>> > > In the general case how'd we prevent suid executable being tricked to
>> > > do something it shouldn't do by unprivileged mounting into sensitive
>> > > places (i.e. config files) inside the container?
>>
>> The design of the namespaces would prevent that. You cannot manipulate your
>> mounts namespace unless you own it. You cannot manipulate the mounts namespace
>> for a task whose user namespace you do not own. If you can, for instance,
>> bind mount $HOME/shadow onto /etc/shadow, then you already own your user
>> namespace and are root there, so any suid-root program which you mount through
>> fuse will only subjegate your own namespace. Any task which running in the
>> parent user-ns (and therefore parent mount-ns) will not see your bind mount.
>>
>> > > Allowing SUID looks like a slippery slope to me. And there are plenty
>> > > of solutions to the "ping" problem, AFAICS, that don't involve the
>> > > suid bit.
>> >
>> > ping isn't even suid on my system, it has security.capability xattr instead.
>>
>> security.capability xattrs that will have the exact same concerns wrt
>> confusion through bind mounts as suid.
>>
>> > Please just get rid of SUID/SGID. It's a legacy, it's a hack, not
>> > worth the complexity and potential problems arising from that
>> > complexity.
>>
>> Oh boy, I don't know which side to sit on here :) I'm all for replacing
>> suid with some use of file capabilities, but realistically there are reasons
>> why that hasn't happened more widely than it has - tar, package managers,
>> cpio, nfs, etc.
>
> Miklos: I we're all generally in agreement here that suid/sgid is not
> the best solution, but as Serge points out we are unfortunately not yet
> in a place where it can be completely dropped in favor of capabilities.
> In light of this can I convince you to reconsider your position?

Regardless of what fuse does user namespaces must support mounting
filesystems that have the setuid and setgid bits set. Likewise we need
to handle capabilities.

There is a parallel bit of work to the fuse patches that I think at this
point should be completed first.

- Add s_user_ns to struct super. So we can have filesystems whose
labels are not interpreted at a global scope.

- Tweak the file capability code to look at s_user_ns and treat it
properly.

- Tweak the lsms to look at s_user_ns and ignore security labels that
don't come from init_user_ns. (The lsms at their discrection can
be more trusting but the default should be for them to ignore those
labels).

- Tweak the security checks to allow setting file capabilities and
other security xattrs if we have the appropriate capabilities in
s_user_ns.

- Update tmpfs and ramfs to set s_user_ns when being mounted.


When those bits are done we can tweak the fuse patches to also set
s_user_ns.

As for MNT_NO_SUID if fuse wants to enforce that in some way. I don't
particularly care, but I don't think that makes sense as a vfs property.

Eric

2014-11-21 18:25:51

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

On Fri, Nov 21, 2014 at 10:14 AM, Eric W. Biederman
<[email protected]> wrote:
> Seth Forshee <[email protected]> writes:
>
>> On Wed, Nov 19, 2014 at 03:09:11PM +0100, Serge E. Hallyn wrote:
>>> Quoting Miklos Szeredi ([email protected]):
>>> > On Wed, Nov 19, 2014 at 9:50 AM, Miklos Szeredi <[email protected]> wrote:
>>> > > On Tue, Nov 18, 2014 at 4:21 PM, Seth Forshee
>>> > > <[email protected]> wrote:
>>> > >>> I asked around a bit, and it turns out there are use cases for nested
>>> > >> containers (i.e. a container within a container) where the rootfs for
>>> > >> the outer container mounts a filesystem containing the rootfs for the
>>> > >> inner container. If that mount is nosuid then suid utilities like ping
>>> > >> aren't going to work in the inner container.
>>> > >>
>>> > >> So since there's a use case for suid in a userns mount and we have what
>>> > >> we belive are sufficient protections against using this as a vector to
>>> > >> get privileges outside the container, I'm planning to move ahead without
>>> > >> the MNT_NOSUID restriction. Any objections?
>>> > >
>>> > > In the general case how'd we prevent suid executable being tricked to
>>> > > do something it shouldn't do by unprivileged mounting into sensitive
>>> > > places (i.e. config files) inside the container?
>>>
>>> The design of the namespaces would prevent that. You cannot manipulate your
>>> mounts namespace unless you own it. You cannot manipulate the mounts namespace
>>> for a task whose user namespace you do not own. If you can, for instance,
>>> bind mount $HOME/shadow onto /etc/shadow, then you already own your user
>>> namespace and are root there, so any suid-root program which you mount through
>>> fuse will only subjegate your own namespace. Any task which running in the
>>> parent user-ns (and therefore parent mount-ns) will not see your bind mount.
>>>
>>> > > Allowing SUID looks like a slippery slope to me. And there are plenty
>>> > > of solutions to the "ping" problem, AFAICS, that don't involve the
>>> > > suid bit.
>>> >
>>> > ping isn't even suid on my system, it has security.capability xattr instead.
>>>
>>> security.capability xattrs that will have the exact same concerns wrt
>>> confusion through bind mounts as suid.
>>>
>>> > Please just get rid of SUID/SGID. It's a legacy, it's a hack, not
>>> > worth the complexity and potential problems arising from that
>>> > complexity.
>>>
>>> Oh boy, I don't know which side to sit on here :) I'm all for replacing
>>> suid with some use of file capabilities, but realistically there are reasons
>>> why that hasn't happened more widely than it has - tar, package managers,
>>> cpio, nfs, etc.
>>
>> Miklos: I we're all generally in agreement here that suid/sgid is not
>> the best solution, but as Serge points out we are unfortunately not yet
>> in a place where it can be completely dropped in favor of capabilities.
>> In light of this can I convince you to reconsider your position?
>
> Regardless of what fuse does user namespaces must support mounting
> filesystems that have the setuid and setgid bits set. Likewise we need
> to handle capabilities.
>
> There is a parallel bit of work to the fuse patches that I think at this
> point should be completed first.
>
> - Add s_user_ns to struct super. So we can have filesystems whose
> labels are not interpreted at a global scope.
>

I agree with this. Although I'm not sure why fuse needs to wait for it.

> - Tweak the file capability code to look at s_user_ns and treat it
> properly.
>
> - Tweak the lsms to look at s_user_ns and ignore security labels that
> don't come from init_user_ns. (The lsms at their discrection can
> be more trusting but the default should be for them to ignore those
> labels).
>

These two make me a bit nervous. Suppose that I take filesystem with
s_user_ns == &init_user_ns and bind mount it somewhere in my
namespace. Then I pass an fd to it back out to the init ns. Under
this logic, LSMs and fscaps will trust that fd. If we look at the
mount namespace instead, fscps will not trust that fd. (I haven't
updated LSMs, but I think that LSMs should consider the mount ns, not
the super ns, as well.)

It's not clear to me that allowing unprivileged users to mess around
like this is safe. It ought to be okay for fscaps and maybe even for
selinux, but apparmor may get very confused, since the unprivileged
user has essentially full control over the pathnames (at least if
they're relative to the fd's mount ns).

It may be safe, but I'm less confident in it than I am in trusting the
path's mount ns.

> - Tweak the security checks to allow setting file capabilities and
> other security xattrs if we have the appropriate capabilities in
> s_user_ns.
>
> - Update tmpfs and ramfs to set s_user_ns when being mounted.
>
>
> When those bits are done we can tweak the fuse patches to also set
> s_user_ns.
>
> As for MNT_NO_SUID if fuse wants to enforce that in some way. I don't
> particularly care, but I don't think that makes sense as a vfs property.
>


--Andy

2014-11-21 18:27:35

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

On Fri, Nov 21, 2014 at 12:14:19PM -0600, Eric W. Biederman wrote:
> Seth Forshee <[email protected]> writes:
>
> > On Wed, Nov 19, 2014 at 03:09:11PM +0100, Serge E. Hallyn wrote:
> >> Quoting Miklos Szeredi ([email protected]):
> >> > On Wed, Nov 19, 2014 at 9:50 AM, Miklos Szeredi <[email protected]> wrote:
> >> > > On Tue, Nov 18, 2014 at 4:21 PM, Seth Forshee
> >> > > <[email protected]> wrote:
> >> > >>> I asked around a bit, and it turns out there are use cases for nested
> >> > >> containers (i.e. a container within a container) where the rootfs for
> >> > >> the outer container mounts a filesystem containing the rootfs for the
> >> > >> inner container. If that mount is nosuid then suid utilities like ping
> >> > >> aren't going to work in the inner container.
> >> > >>
> >> > >> So since there's a use case for suid in a userns mount and we have what
> >> > >> we belive are sufficient protections against using this as a vector to
> >> > >> get privileges outside the container, I'm planning to move ahead without
> >> > >> the MNT_NOSUID restriction. Any objections?
> >> > >
> >> > > In the general case how'd we prevent suid executable being tricked to
> >> > > do something it shouldn't do by unprivileged mounting into sensitive
> >> > > places (i.e. config files) inside the container?
> >>
> >> The design of the namespaces would prevent that. You cannot manipulate your
> >> mounts namespace unless you own it. You cannot manipulate the mounts namespace
> >> for a task whose user namespace you do not own. If you can, for instance,
> >> bind mount $HOME/shadow onto /etc/shadow, then you already own your user
> >> namespace and are root there, so any suid-root program which you mount through
> >> fuse will only subjegate your own namespace. Any task which running in the
> >> parent user-ns (and therefore parent mount-ns) will not see your bind mount.
> >>
> >> > > Allowing SUID looks like a slippery slope to me. And there are plenty
> >> > > of solutions to the "ping" problem, AFAICS, that don't involve the
> >> > > suid bit.
> >> >
> >> > ping isn't even suid on my system, it has security.capability xattr instead.
> >>
> >> security.capability xattrs that will have the exact same concerns wrt
> >> confusion through bind mounts as suid.
> >>
> >> > Please just get rid of SUID/SGID. It's a legacy, it's a hack, not
> >> > worth the complexity and potential problems arising from that
> >> > complexity.
> >>
> >> Oh boy, I don't know which side to sit on here :) I'm all for replacing
> >> suid with some use of file capabilities, but realistically there are reasons
> >> why that hasn't happened more widely than it has - tar, package managers,
> >> cpio, nfs, etc.
> >
> > Miklos: I we're all generally in agreement here that suid/sgid is not
> > the best solution, but as Serge points out we are unfortunately not yet
> > in a place where it can be completely dropped in favor of capabilities.
> > In light of this can I convince you to reconsider your position?
>
> Regardless of what fuse does user namespaces must support mounting
> filesystems that have the setuid and setgid bits set. Likewise we need
> to handle capabilities.
>
> There is a parallel bit of work to the fuse patches that I think at this
> point should be completed first.
>
> - Add s_user_ns to struct super. So we can have filesystems whose
> labels are not interpreted at a global scope.
>
> - Tweak the file capability code to look at s_user_ns and treat it
> properly.
>
> - Tweak the lsms to look at s_user_ns and ignore security labels that
> don't come from init_user_ns. (The lsms at their discrection can
> be more trusting but the default should be for them to ignore those
> labels).
>
> - Tweak the security checks to allow setting file capabilities and
> other security xattrs if we have the appropriate capabilities in
> s_user_ns.
>
> - Update tmpfs and ramfs to set s_user_ns when being mounted.

Okay. Is someone already working on these items? If not I'll work up
some patches.

> When those bits are done we can tweak the fuse patches to also set
> s_user_ns.
>
> As for MNT_NO_SUID if fuse wants to enforce that in some way. I don't
> particularly care, but I don't think that makes sense as a vfs property.

I don't think fuse can enforce this since the flag applies to the mount
and not the super block. It would have to be enforced by the vfs.

Thanks,
Seth

2014-11-21 18:38:47

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] fuse: Support fuse filesystems outside of init_user_ns

On Fri, Nov 21, 2014 at 10:14 AM, Eric W. Biederman
<[email protected]> wrote:
> - Tweak the file capability code to look at s_user_ns and treat it
> properly.
>

> - Tweak the security checks to allow setting file capabilities and
> other security xattrs if we have the appropriate capabilities in
> s_user_ns.
>

Thinking about this some more, what do you mean by tweaking the file
capability code to look at s_user_ns and treat it properly?

I think that the semantics should be that cap_inode_setxattr should
check ns_capable wrt s_user_ns, but that the fscap *consumer* should
check the mount as in my may_suid patch (and maybe also check
s_user_ns). There is legacy code that starts a FUSE server as global
root, mounts the thing in a mount namespace belonging to an
unprivileged user ns, and (I think) hands the /dev/fuse fd to that
unprivileged code.

Without the mount ns check, that FUSE server can take over the system.
With the mount ns check, it's safe.

--Andy

>
> When those bits are done we can tweak the fuse patches to also set
> s_user_ns.
>
> As for MNT_NO_SUID if fuse wants to enforce that in some way. I don't
> particularly care, but I don't think that makes sense as a vfs property.
>
> Eric



--
Andy Lutomirski
AMA Capital Management, LLC