LinuxLists.cc - [PATCH v2 0/7] Initial support for user namespace owned mounts

2015-08-10 21:05:43

Subject: [PATCH v2 0/7] Initial support for user namespace owned mounts

This series adds support for the idea of superblocks owned by
unprivileged user namespaces. This is initially used to simplify the
MNT_NODEV handling for unpivileged mounts, and the ultimate goal is to
allow mounting of additional filesystem types from unprivileged
containers. As such these are the first in a larger set of patches, with
the full series (so far) available at:

git://kernel.ubuntu.com/sforshee/linux.git userns-mounts

The strategy for the series as a whole is to do as much of the heavy
lifting as possible in the vfs to avoid the need to handle any weird
edge cases in the code for individual filesystems. These patches lay the
initial groundwork and fall into a two broad groups:

1. Patches 1-2 add s_ser_ns and simplify MNT_NODEV handling.

2. Patches 3-7 tighten down security for mounts with s_user_ns !=
&init_user_ns. This includes updates to how file caps and suid are
handled and updates for SELinux and Smack to avoid using security
labels from untrusted mounts.

Note that no updates are included for the LSMs which do not store
security labels on disk. As far as I can tell these security modules are
not susceptible to attack via unprivileged mounts because no security
metadata can be injected into the system via these mounts. I have cc-ed
the maintainers of these LSMs and would appreciate confirmation that
this is the case.

Also note that this only addresses security at the vfs level. As has
been discussed previously, individual filesystems may still be
vulnerable to attacks via maliciuos metadata in the backing store. The
goal is to find a small set of filesystems that can be hardened against
attacks from below. At minimum fuse has been designed to resist such
attacks.

Changes since v1:
- Check current_user_ns instead of mount ns owner for MNT_LOCK_NODEV
check in fs_fully_visible.
- Add check of s_user_ns in mnt_may_suid.
- Improve handling of superblocks from unprivileged users in SELinux
and Smack.
- Add permission checks for block device paths passed to mount by
unprivileged users.

Andy Lutomirski (1):
fs: Treat foreign mounts as nosuid

Eric W. Biederman (1):
userns: Simpilify MNT_NODEV handling.

Seth Forshee (5):
fs: Add user namesapace member to struct super_block
fs: Verify access of user towards block device file when mounting
fs: Limit file caps to the user namespace of the super block
Smack: Add support for unprivileged mounts from user namespaces
selinux: Add support for unprivileged mounts from user namespaces

drivers/mtd/mtdsuper.c | 7 +++++-
fs/block_dev.c | 54 +++++++++++++++++++++++++++++++++---------
fs/exec.c | 2 +-
fs/namei.c | 9 ++++++-
fs/namespace.c | 34 +++++++++++++++-----------
fs/proc/root.c | 3 ++-
fs/super.c | 38 +++++++++++++++++++++++++----
include/linux/fs.h | 11 ++++++++-
include/linux/mount.h | 1 +
include/linux/user_namespace.h | 8 +++++++
kernel/user_namespace.c | 14 +++++++++++
security/commoncap.c | 4 +++-
security/selinux/hooks.c | 25 ++++++++++++++++++-
security/smack/smack.h | 6 +++++
security/smack/smack_lsm.c | 35 ++++++++++++++++++++-------
15 files changed, 206 insertions(+), 45 deletions(-)

2015-08-10 21:07:36

by Seth Forshee

[permalink] [raw]

Subject: [PATCH v2 1/7] fs: Add user namesapace member to struct super_block

Initially this will be used to eliminate the implicit MNT_NODEV
flag for mounts from user namespaces. In the future it will also
be used for translating ids and checking capabilities for
filesystems mounted from user namespaces.

s_user_ns is initialized in alloc_super() and is generally set to
current_user_ns(). To avoid security and corruption issues, two
additional mount checks are also added:

- do_new_mount() gains a check that the user has CAP_SYS_ADMIN
in current_user_ns().

- sget() will fail with EBUSY when the filesystem it's looking
for is already mounted from another user namespace.

proc requires some special handling. The user namespace of
current isn't appropriate when forking as a result of clone (2)
with CLONE_NEWPID|CLONE_NEWUSER, as it will set s_user_ns to the
namespace of the parent and make proc unmountable in the new user
namespace. Instead, the user namespace which owns the new pid
namespace is used. sget_userns() is allowed to allow passing in
a namespace other than that of current, and sget becomes a
wrapper around sget_userns() which passes current_user_ns().

Signed-off-by: Seth Forshee <[email protected]>
---
fs/namespace.c | 3 +++
fs/proc/root.c | 3 ++-
fs/super.c | 38 +++++++++++++++++++++++++++++++++-----
include/linux/fs.h | 9 ++++++++-
4 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 0570729c87fd..d023a353dc63 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2381,6 +2381,9 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
struct vfsmount *mnt;
int err;

+ if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+ return -EPERM;
+
if (!fstype)
return -EINVAL;

diff --git a/fs/proc/root.c b/fs/proc/root.c
index 361ab4ee42fc..4b302cbf13f9 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -117,7 +117,8 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
return ERR_PTR(-EPERM);
}

- sb = sget(fs_type, proc_test_super, proc_set_super, flags, ns);
+ sb = sget_userns(fs_type, proc_test_super, proc_set_super, flags,
+ ns->user_ns, ns);
if (IS_ERR(sb))
return ERR_CAST(sb);

diff --git a/fs/super.c b/fs/super.c
index b61372354f2b..b5f171aadbf7 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -33,6 +33,7 @@
#include <linux/cleancache.h>
#include <linux/fsnotify.h>
#include <linux/lockdep.h>
+#include <linux/user_namespace.h>
#include "internal.h"

@@ -148,6 +149,7 @@ static void destroy_super(struct super_block *s)
list_lru_destroy(&s->s_inode_lru);
for (i = 0; i < SB_FREEZE_LEVELS; i++)
percpu_counter_destroy(&s->s_writers.counter[i]);
+ put_user_ns(s->s_user_ns);
security_sb_free(s);
WARN_ON(!list_empty(&s->s_mounts));
kfree(s->s_subtype);
@@ -163,7 +165,8 @@ static void destroy_super(struct super_block *s)
* Allocates and initializes a new &struct super_block. alloc_super()
* returns a pointer new superblock or %NULL if allocation had failed.
*/
-static struct super_block *alloc_super(struct file_system_type *type, int flags)
+static struct super_block *alloc_super(struct file_system_type *type, int flags,
+ struct user_namespace *user_ns)
{
struct super_block *s = kzalloc(sizeof(struct super_block), GFP_USER);
static const struct super_operations default_op;
@@ -231,6 +234,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
s->s_shrink.count_objects = super_cache_count;
s->s_shrink.batch = 1024;
s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
+
+ s->s_user_ns = get_user_ns(user_ns);
return s;

fail:
@@ -427,17 +432,17 @@ void generic_shutdown_super(struct super_block *sb)
EXPORT_SYMBOL(generic_shutdown_super);

/**
- * sget - find or create a superblock
+ * sget_userns - find or create a superblock
* @type: filesystem type superblock should belong to
* @test: comparison callback
* @set: setup callback
* @flags: mount flags
* @data: argument to each of them
*/
-struct super_block *sget(struct file_system_type *type,
+struct super_block *sget_userns(struct file_system_type *type,
int (*test)(struct super_block *,void *),
int (*set)(struct super_block *,void *),
- int flags,
+ int flags, struct user_namespace *user_ns,
void *data)
{
struct super_block *s = NULL;
@@ -450,6 +455,10 @@ retry:
hlist_for_each_entry(old, &type->fs_supers, s_instances) {
if (!test(old, data))
continue;
+ if (user_ns != old->s_user_ns) {
+ spin_unlock(&sb_lock);
+ return ERR_PTR(-EBUSY);
+ }
if (!grab_super(old))
goto retry;
if (s) {
@@ -462,7 +471,7 @@ retry:
}
if (!s) {
spin_unlock(&sb_lock);
- s = alloc_super(type, flags);
+ s = alloc_super(type, flags, user_ns);
if (!s)
return ERR_PTR(-ENOMEM);
goto retry;
@@ -485,6 +494,25 @@ retry:
return s;
}

+EXPORT_SYMBOL(sget_userns);
+
+/**
+ * sget - find or create a superblock
+ * @type: filesystem type superblock should belong to
+ * @test: comparison callback
+ * @set: setup callback
+ * @flags: mount flags
+ * @data: argument to each of them
+ */
+struct super_block *sget(struct file_system_type *type,
+ int (*test)(struct super_block *,void *),
+ int (*set)(struct super_block *,void *),
+ int flags,
+ void *data)
+{
+ return sget_userns(type, test, set, flags, current_user_ns(), data);
+}
+
EXPORT_SYMBOL(sget);

void drop_super(struct super_block *sb)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fbd780c33c5f..6242e9629fe6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -30,6 +30,7 @@
#include <linux/lockdep.h>
#include <linux/percpu-rwsem.h>
#include <linux/blk_types.h>
+#include <linux/user_namespace.h>

#include <asm/byteorder.h>
#include <uapi/linux/fs.h>
@@ -1369,6 +1370,8 @@ struct super_block {
struct workqueue_struct *s_dio_done_wq;
struct hlist_head s_pins;

+ struct user_namespace *s_user_ns;
+
/*
* Keep the lru lists last in the structure so they always sit on their
* own individual cachelines.
@@ -1499,7 +1502,6 @@ static inline void sb_start_intwrite(struct super_block *sb)
__sb_start_write(sb, SB_FREEZE_FS, true);
}

-
extern bool inode_owner_or_capable(const struct inode *inode);

/*
@@ -1975,6 +1977,11 @@ void deactivate_locked_super(struct super_block *sb);
int set_anon_super(struct super_block *s, void *data);
int get_anon_bdev(dev_t *);
void free_anon_bdev(dev_t);
+struct super_block *sget_userns(struct file_system_type *type,
+ int (*test)(struct super_block *,void *),
+ int (*set)(struct super_block *,void *),
+ int flags, struct user_namespace *user_ns,
+ void *data);
struct super_block *sget(struct file_system_type *type,
int (*test)(struct super_block *,void *),
int (*set)(struct super_block *,void *),
--
1.9.1

2015-08-10 21:07:34

by Seth Forshee

[permalink] [raw]

Subject: [PATCH v2 2/7] userns: Simpilify MNT_NODEV handling.

From: "Eric W. Biederman" <[email protected]>

- Consolidate the testing if a device node may be opened in a new
function may_open_dev.

- Move the check for allowing access to device nodes on filesystems
not mounted in the initial user namespace from mount time to open
time and include it in may_open_dev.

This set of changes removes the implicit adding of MNT_NODEV which
simplifies the logic in fs/namespace.c and removes a potentially
problematic user visible difference in how normal and unprivileged
mount namespaces work.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/block_dev.c | 2 +-
fs/namei.c | 9 ++++++++-
fs/namespace.c | 18 ++++--------------
include/linux/fs.h | 1 +
4 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 198243717da5..f8ce371c437c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1729,7 +1729,7 @@ struct block_device *lookup_bdev(const char *pathname)
if (!S_ISBLK(inode->i_mode))
goto fail;
error = -EACCES;
- if (path.mnt->mnt_flags & MNT_NODEV)
+ if (!may_open_dev(&path))
goto fail;
error = -ENOMEM;
bdev = bd_acquire(inode);
diff --git a/fs/namei.c b/fs/namei.c
index fbbcf0993312..59444c066f47 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2640,6 +2640,13 @@ int vfs_create(struct inode *dir, struct dentry *dentry, umode_t mode,
}
EXPORT_SYMBOL(vfs_create);

+bool may_open_dev(const struct path *path)
+{
+ return !(path->mnt->mnt_flags & MNT_NODEV) &&
+ ((path->mnt->mnt_sb->s_user_ns == &init_user_ns) ||
+ (path->mnt->mnt_sb->s_type->fs_flags & FS_USERNS_DEV_MOUNT));
+}
+
static int may_open(struct path *path, int acc_mode, int flag)
{
struct dentry *dentry = path->dentry;
@@ -2662,7 +2669,7 @@ static int may_open(struct path *path, int acc_mode, int flag)
break;
case S_IFBLK:
case S_IFCHR:
- if (path->mnt->mnt_flags & MNT_NODEV)
+ if (!may_open_dev(path))
return -EACCES;
/*FALLTHRU*/
case S_IFIFO:
diff --git a/fs/namespace.c b/fs/namespace.c
index d023a353dc63..e48fa1c23378 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2177,13 +2177,7 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
}
if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
!(mnt_flags & MNT_NODEV)) {
- /* Was the nodev implicitly added in mount? */
- if ((mnt->mnt_ns->user_ns != &init_user_ns) &&
- !(sb->s_type->fs_flags & FS_USERNS_DEV_MOUNT)) {
- mnt_flags |= MNT_NODEV;
- } else {
- return -EPERM;
- }
+ return -EPERM;
}
if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) &&
!(mnt_flags & MNT_NOSUID)) {
@@ -2396,13 +2390,6 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
put_filesystem(type);
return -EPERM;
}
- /* Only in special cases allow devices from mounts
- * created outside the initial user namespace.
- */
- if (!(type->fs_flags & FS_USERNS_DEV_MOUNT)) {
- flags |= MS_NODEV;
- mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV;
- }
if (type->fs_flags & FS_USERNS_VISIBLE) {
if (!fs_fully_visible(type, &mnt_flags))
return -EPERM;
@@ -3238,6 +3225,9 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
mnt_flags = mnt->mnt.mnt_flags;
if (mnt->mnt.mnt_sb->s_iflags & SB_I_NOEXEC)
mnt_flags &= ~(MNT_LOCK_NOSUID | MNT_LOCK_NOEXEC);
+ if (current_user_ns() != &init_user_ns &&
+ !(mnt->mnt.mnt_sb->s_type->fs_flags & FS_USERNS_DEV_MOUNT))
+ mnt_flags &= ~(MNT_LOCK_NODEV);

/* Verify the mount flags are equal to or more permissive
* than the proposed new mount.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6242e9629fe6..4a60e76f86ac 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1527,6 +1527,7 @@ extern void dentry_unhash(struct dentry *dentry);
*/
extern void inode_init_owner(struct inode *inode, const struct inode *dir,
umode_t mode);
+extern bool may_open_dev(const struct path *path);
/*
* VFS FS_IOC_FIEMAP helper definitions.
*/
--
1.9.1

2015-08-10 21:07:07

by Seth Forshee

[permalink] [raw]

Subject: [PATCH v2 3/7] fs: Verify access of user towards block device file when mounting

When mounting a filesystem on a block device there is currently
no verification that the user has appropriate access to the
device file passed to mount. This has not been an issue so far
since the user in question has always been root, but this must
be changed before allowing unprivileged users to mount in user
namespaces.

To do this, a new version of lookup_bdev() is added named
lookup_bdev_perm(). Both of these functions become wrappers
around a common inner fucntion. The behavior of lookup_bdev() is
unchanged, but calling lookup_bdev_perm() will fail if the user
does not have the specified access rights to the supplied path.
The permission check is skipped if the user has CAP_SYS_ADMIN to
avoid any possible regressions in behavior.

blkdev_get_by_path() is updated to use lookup_bdev_perm(). This
is used by mount_bdev() and mount_mtd(), so this will cause
mounts on block devices to fail when the user lacks the required
permissions. Other calls to blkdev_get_by_path() will all happen
with root privileges, so these calls will be unaffected.

Signed-off-by: Seth Forshee <[email protected]>
---
drivers/mtd/mtdsuper.c | 7 ++++++-
fs/block_dev.c | 52 ++++++++++++++++++++++++++++++++++++++++----------
include/linux/fs.h | 1 +
3 files changed, 49 insertions(+), 11 deletions(-)

diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index 20c02a3b7417..a3970b0506b9 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -125,6 +125,7 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
#ifdef CONFIG_BLOCK
struct block_device *bdev;
int ret, major;
+ int perm = 0;
#endif
int mtdnr;

@@ -176,7 +177,11 @@ struct dentry *mount_mtd(struct file_system_type *fs_type, int flags,
/* try the old way - the hack where we allowed users to mount
* /dev/mtdblock$(n) but didn't actually _use_ the blockdev
*/
- bdev = lookup_bdev(dev_name);
+ if (flags & FMODE_READ)
+ perm |= MAY_READ;
+ if (flags & FMODE_WRITE)
+ perm |= MAY_WRITE;
+ bdev = lookup_bdev_perm(dev_name, perm);
if (IS_ERR(bdev)) {
ret = PTR_ERR(bdev);
pr_debug("MTDSB: lookup_bdev() returned %d\n", ret);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index f8ce371c437c..11979b9e7086 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1393,9 +1393,14 @@ struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
void *holder)
{
struct block_device *bdev;
+ int perm = 0;
int err;

- bdev = lookup_bdev(path);
+ if (mode & FMODE_READ)
+ perm |= MAY_READ;
+ if (mode & FMODE_WRITE)
+ perm |= MAY_WRITE;
+ bdev = lookup_bdev_perm(path, perm);
if (IS_ERR(bdev))
return bdev;

@@ -1702,15 +1707,8 @@ int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg)

EXPORT_SYMBOL(ioctl_by_bdev);

-/**
- * lookup_bdev - lookup a struct block_device by name
- * @pathname: special file representing the block device
- *
- * Get a reference to the blockdevice at @pathname in the current
- * namespace if possible and return it. Return ERR_PTR(error)
- * otherwise.
- */
-struct block_device *lookup_bdev(const char *pathname)
+static struct block_device *__lookup_bdev(const char *pathname, int mask,
+ bool check_perm)
{
struct block_device *bdev;
struct inode *inode;
@@ -1725,6 +1723,11 @@ struct block_device *lookup_bdev(const char *pathname)
return ERR_PTR(error);

inode = d_backing_inode(path.dentry);
+ if (check_perm && !capable(CAP_SYS_ADMIN)) {
+ error = inode_permission(inode, mask);
+ if (error)
+ goto fail;
+ }
error = -ENOTBLK;
if (!S_ISBLK(inode->i_mode))
goto fail;
@@ -1742,6 +1745,35 @@ fail:
bdev = ERR_PTR(error);
goto out;
}
+
+/*
+ * lookup_bdev_perm - lookup a struct block_device by name and check for
+ * access rights
+ * @pathname: special file representing the block device
+ * @mask: rights to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ *
+ * Get a reference to the blockdevice at @pathname in the current
+ * namespace if possible, check for the specified rights to the device
+ * at @pathname, and return it. Return ERR_PTR(error) otherwise.
+ */
+struct block_device *lookup_bdev_perm(const char *pathname, int mask)
+{
+ return __lookup_bdev(pathname, mask, true);
+}
+EXPORT_SYMBOL(lookup_bdev_perm);
+
+/**
+ * lookup_bdev - lookup a struct block_device by name
+ * @pathname: special file representing the block device
+ *
+ * Get a reference to the blockdevice at @pathname in the current
+ * namespace if possible and return it. Return ERR_PTR(error)
+ * otherwise.
+ */
+struct block_device *lookup_bdev(const char *pathname)
+{
+ return __lookup_bdev(pathname, 0, false);
+}
EXPORT_SYMBOL(lookup_bdev);

int __invalidate_device(struct block_device *bdev, bool kill_dirty)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4a60e76f86ac..99417367f3b5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2374,6 +2374,7 @@ static inline void unregister_chrdev(unsigned int major, const char *name)
#define BLKDEV_MAJOR_HASH_SIZE 255
extern const char *__bdevname(dev_t, char *buffer);
extern const char *bdevname(struct block_device *bdev, char *buffer);
+extern struct block_device *lookup_bdev_perm(const char *, int);
extern struct block_device *lookup_bdev(const char *);
extern void blkdev_show(struct seq_file *,off_t);

--
1.9.1

2015-08-10 21:05:51

by Seth Forshee

[permalink] [raw]

Subject: [PATCH v2 4/7] fs: Limit file caps to the user namespace of the super block

Capability sets attached to files must be ignored except in the
user namespaces where the mounter is privileged, i.e. s_user_ns
and its descendants. Otherwise a vector exists for gaining
privileges in namespaces where a user is not already privileged.

Add a new helper function, in_user_ns(), to test whether a user
namespace is the same as or a descendant of another namespace.
Use this helper to determine whether a file's capability set
should be applied to the caps constructed during exec.

Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
---
include/linux/user_namespace.h | 8 ++++++++
kernel/user_namespace.c | 14 ++++++++++++++
security/commoncap.c | 2 ++
3 files changed, 24 insertions(+)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 8297e5b341d8..a43faa727124 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -72,6 +72,8 @@ extern ssize_t proc_projid_map_write(struct file *, const char __user *, size_t,
extern ssize_t proc_setgroups_write(struct file *, const char __user *, size_t, loff_t *);
extern int proc_setgroups_show(struct seq_file *m, void *v);
extern bool userns_may_setgroups(const struct user_namespace *ns);
+extern bool in_userns(const struct user_namespace *ns,
+ const struct user_namespace *target_ns);
#else

static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
@@ -100,6 +102,12 @@ static inline bool userns_may_setgroups(const struct user_namespace *ns)
{
return true;
}
+
+static inline bool in_userns(const struct user_namespace *ns,
+ const struct user_namespace *target_ns)
+{
+ return true;
+}
#endif

#endif /* _LINUX_USER_H */
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 4109f8320684..2b043876d5f0 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -944,6 +944,20 @@ bool userns_may_setgroups(const struct user_namespace *ns)
return allowed;
}

+/*
+ * Returns true if @ns is the same namespace as or a descendant of
+ * @target_ns.
+ */
+bool in_userns(const struct user_namespace *ns,
+ const struct user_namespace *target_ns)
+{
+ for (; ns; ns = ns->parent) {
+ if (ns == target_ns)
+ return true;
+ }
+ return false;
+}
+
static inline struct user_namespace *to_user_ns(struct ns_common *ns)
{
return container_of(ns, struct user_namespace, ns);
diff --git a/security/commoncap.c b/security/commoncap.c
index d103f5a4043d..175ab497e810 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -439,6 +439,8 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c

if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
return 0;
+ if (!in_userns(current_user_ns(), bprm->file->f_path.mnt->mnt_sb->s_user_ns))
+ return 0;

rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
if (rc < 0) {
--
1.9.1

2015-08-10 21:06:41

by Seth Forshee

[permalink] [raw]

Subject: [PATCH v2 5/7] fs: Treat foreign mounts as nosuid

From: Andy Lutomirski <[email protected]>

If a process gets access to a mount from a different user
namespace, that process should not be able to take advantage of
setuid files or selinux entrypoints from that filesystem. Prevent
this by treating mounts from other mount namespaces and those not
owned by current_user_ns() or an ancestor as nosuid.

This will make it safer to allow more complex filesystems to be
mounted in non-root user namespaces.

This does not remove the need for MNT_LOCK_NOSUID. The setuid,
setgid, and file capability bits can no longer be abused if code in
a user namespace were to clear nosuid on an untrusted filesystem,
but this patch, by itself, is insufficient to protect the system
from abuse of files that, when execed, would increase MAC privilege.

As a more concrete explanation, any task that can manipulate a
vfsmount associated with a given user namespace already has
capabilities in that namespace and all of its descendents. If they
can cause a malicious setuid, setgid, or file-caps executable to
appear in that mount, then that executable will only allow them to
elevate privileges in exactly the set of namespaces in which they
are already privileges.

On the other hand, if they can cause a malicious executable to
appear with a dangerous MAC label, running it could change the
caller's security context in a way that should not have been
possible, even inside the namespace in which the task is confined.

As a hardening measure, this would have made CVE-2014-5207 much
more difficult to exploit.

Signed-off-by: Andy Lutomirski <[email protected]>
Signed-off-by: Seth Forshee <[email protected]>
---
fs/exec.c | 2 +-
fs/namespace.c | 13 +++++++++++++
include/linux/mount.h | 1 +
security/commoncap.c | 2 +-
security/selinux/hooks.c | 2 +-
5 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index b06623a9347f..ea7311d72cc3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1295,7 +1295,7 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
bprm->cred->euid = current_euid();
bprm->cred->egid = current_egid();

- if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
+ if (!mnt_may_suid(bprm->file->f_path.mnt))
return;

if (task_no_new_privs(current))
diff --git a/fs/namespace.c b/fs/namespace.c
index e48fa1c23378..710aaa05cdaa 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3276,6 +3276,19 @@ found:
return visible;
}

+bool mnt_may_suid(struct vfsmount *mnt)
+{
+ /*
+ * Foreign mounts (accessed via fchdir or through /proc
+ * symlinks) are always treated as if they are nosuid. This
+ * prevents namespaces from trusting potentially unsafe
+ * suid/sgid bits, file caps, or security labels that originate
+ * in other namespaces.
+ */
+ return !(mnt->mnt_flags & MNT_NOSUID) && check_mnt(real_mount(mnt)) &&
+ in_userns(current_user_ns(), mnt->mnt_sb->s_user_ns);
+}
+
static struct ns_common *mntns_get(struct task_struct *task)
{
struct ns_common *ns = NULL;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index f822c3c11377..54a594d49733 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -81,6 +81,7 @@ extern void mntput(struct vfsmount *mnt);
extern struct vfsmount *mntget(struct vfsmount *mnt);
extern struct vfsmount *mnt_clone_internal(struct path *path);
extern int __mnt_is_readonly(struct vfsmount *mnt);
+extern bool mnt_may_suid(struct vfsmount *mnt);

struct path;
extern struct vfsmount *clone_private_mount(struct path *path);
diff --git a/security/commoncap.c b/security/commoncap.c
index 175ab497e810..858d86a1b73c 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -437,7 +437,7 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
if (!file_caps_enabled)
return 0;

- if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
+ if (!mnt_may_suid(bprm->file->f_path.mnt))
return 0;
if (!in_userns(current_user_ns(), bprm->file->f_path.mnt->mnt_sb->s_user_ns))
return 0;
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 564079c5c49d..459e71ddbc9d 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2137,7 +2137,7 @@ static int check_nnp_nosuid(const struct linux_binprm *bprm,
const struct task_security_struct *new_tsec)
{
int nnp = (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS);
- int nosuid = (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID);
+ int nosuid = !mnt_may_suid(bprm->file->f_path.mnt);
int rc;

if (!nnp && !nosuid)
--
1.9.1

2015-08-10 21:06:35

by Seth Forshee

[permalink] [raw]

Subject: [PATCH v2 6/7] Smack: Add support for unprivileged mounts from user namespaces

Security labels from unprivileged mounts cannot be trusted.
Ideally for these mounts we would assign the objects in the
filesystem the same label as the inode for the backing device
passed to mount. Unfortunately it's currently impossible to
determine which inode this is from the LSM mount hooks, so we
settle for the label of the process doing the mount.

This label is assigned to s_root, and also to smk_default to
ensure that new inodes receive this label. The transmute property
is also set on s_root to make this behavior more explicit, even
though it is technically not necessary.

If a filesystem has existing security labels, access to inodes is
permitted if the label is the same as smk_root, otherwise access
is denied. The SMACK64EXEC xattr is completely ignored.

Explicit setting of security labels continues to require
CAP_MAC_ADMIN in init_user_ns.

Altogether, this ensures that filesystem objects are not
accessible to subjects which cannot already access the backing
store, that MAC is not violated for any objects in the fileystem
which are already labeled, and that a user cannot use an
unprivileged mount to gain elevated MAC privileges.

sysfs, tmpfs, and ramfs are already mountable from user
namespaces and support security labels. We can't rule out the
possibility that these filesystems may already be used in mounts
from user namespaces with security lables set from the init
namespace, so failing to trust lables in these filesystems may
introduce regressions. It is safe to trust labels from these
filesystems, since the unprivileged user does not control the
backing store and thus cannot supply security labels, so an
explicit exception is made to trust labels from these
filesystems.

Signed-off-by: Seth Forshee <[email protected]>
---
security/smack/smack.h | 6 ++++++
security/smack/smack_lsm.c | 35 +++++++++++++++++++++++++++--------
2 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/security/smack/smack.h b/security/smack/smack.h
index 244e035e5a99..473cfc355a8d 100644
--- a/security/smack/smack.h
+++ b/security/smack/smack.h
@@ -76,8 +76,14 @@ struct superblock_smack {
struct smack_known *smk_hat;
struct smack_known *smk_default;
int smk_initialized;
+ int smk_flags;
};

+/*
+ * Superblock flags
+ */
+#define SMK_SB_UNTRUSTED 0x01
+
struct socket_smack {
struct smack_known *smk_out; /* outbound label */
struct smack_known *smk_in; /* inbound label */
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..70ccc7796bee 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -662,6 +662,17 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
skp = smk_of_current();
sp->smk_root = skp;
sp->smk_default = skp;
+ /*
+ * For a handful of fs types with no user-controlled
+ * backing store it's okay to trust security labels
+ * in the filesystem. The rest are untrusted.
+ */
+ if (sb->s_user_ns != &init_user_ns &&
+ sb->s_magic != SYSFS_MAGIC && sb->s_magic != TMPFS_MAGIC &&
+ sb->s_magic != RAMFS_MAGIC) {
+ transmute = 1;
+ sp->smk_flags |= SMK_SB_UNTRUSTED;
+ }
}
/*
* Initialize the root inode.
@@ -1012,6 +1023,7 @@ static int smack_inode_rename(struct inode *old_inode,
*/
static int smack_inode_permission(struct inode *inode, int mask)
{
+ struct superblock_smack *sbsp = inode->i_sb->s_security;
struct smk_audit_info ad;
int no_block = mask & MAY_NOT_BLOCK;
int rc;
@@ -1023,6 +1035,11 @@ static int smack_inode_permission(struct inode *inode, int mask)
if (mask == 0)
return 0;

+ if (sbsp->smk_flags & SMK_SB_UNTRUSTED) {
+ if (smk_of_inode(inode) != sbsp->smk_root)
+ return -EACCES;
+ }
+
/* May be droppable after audit */
if (no_block)
return -ECHILD;
@@ -3220,14 +3237,16 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
if (rc >= 0)
transflag = SMK_INODE_TRANSMUTE;
}
- /*
- * Don't let the exec or mmap label be "*" or "@".
- */
- skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
- if (IS_ERR(skp) || skp == &smack_known_star ||
- skp == &smack_known_web)
- skp = NULL;
- isp->smk_task = skp;
+ if (!(sbsp->smk_flags & SMK_SB_UNTRUSTED)) {
+ /*
+ * Don't let the exec or mmap label be "*" or "@".
+ */
+ skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
+ if (IS_ERR(skp) || skp == &smack_known_star ||
+ skp == &smack_known_web)
+ skp = NULL;
+ isp->smk_task = skp;
+ }

skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
if (IS_ERR(skp) || skp == &smack_known_star ||
--
1.9.1

2015-08-10 21:06:15

by Seth Forshee

[permalink] [raw]

Subject: [PATCH v2 7/7] selinux: Add support for unprivileged mounts from user namespaces

Security labels from unprivileged mounts in user namespaces must
be ignored. Force superblocks from user namespaces whose labeling
behavior is to use xattrs to use mountpoint labeling instead.
For the mountpoint label, default to converting the current task
context into a form suitable for file objects, but also allow the
policy writer to specify a different label through policy
transition rules.

Pieced together from code snippets provided by Stephen Smalley.

Signed-off-by: Seth Forshee <[email protected]>
---
security/selinux/hooks.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 459e71ddbc9d..242dac0b8b24 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -745,6 +745,28 @@ static int selinux_set_mnt_opts(struct super_block *sb,
goto out;
}
}
+
+ /*
+ * If this is a user namespace mount, no contexts are allowed
+ * on the command line and security labels must be ignored.
+ */
+ if (sb->s_user_ns != &init_user_ns) {
+ if (context_sid || fscontext_sid || rootcontext_sid ||
+ defcontext_sid) {
+ rc = -EACCES;
+ goto out;
+ }
+ if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
+ sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
+ rc = security_transition_sid(current_sid(), current_sid(),
+ SECCLASS_FILE, NULL,
+ &sbsec->mntpoint_sid);
+ if (rc)
+ goto out;
+ }
+ goto out_set_opts;
+ }
+
/* sets the context of the superblock for the fs being mounted. */
if (fscontext_sid) {
rc = may_context_mount_sb_relabel(fscontext_sid, sbsec, cred);
@@ -813,6 +835,7 @@ static int selinux_set_mnt_opts(struct super_block *sb,
sbsec->def_sid = defcontext_sid;
}

+out_set_opts:
rc = sb_finish_set_opts(sb);
out:
mutex_unlock(&sbsec->lock);
--
1.9.1

2015-08-12 16:01:39

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH v2 0/7] Initial support for user namespace owned mounts

Seth Forshee <[email protected]> writes:

> This series adds support for the idea of superblocks owned by
> unprivileged user namespaces. This is initially used to simplify the
> MNT_NODEV handling for unpivileged mounts, and the ultimate goal is to
> allow mounting of additional filesystem types from unprivileged
> containers. As such these are the first in a larger set of patches, with
> the full series (so far) available at:
>
> git://kernel.ubuntu.com/sforshee/linux.git userns-mounts

Seth just a quick note to let you know this isn't forgotten. I have a
big pile of things to work through this week, so probably the first
chance I will have to look at your updated patches is during the boring
bits of Linux Plumbers Conf.

Eric

2015-08-17 16:35:23

by Seth Forshee

[permalink] [raw]

Subject: Re: [PATCH v2 2/7] userns: Simpilify MNT_NODEV handling.

On Mon, Aug 10, 2015 at 04:05:13PM -0500, Seth Forshee wrote:
> From: "Eric W. Biederman" <[email protected]>
>
> - Consolidate the testing if a device node may be opened in a new
> function may_open_dev.
>
> - Move the check for allowing access to device nodes on filesystems
> not mounted in the initial user namespace from mount time to open
> time and include it in may_open_dev.
>
> This set of changes removes the implicit adding of MNT_NODEV which
> simplifies the logic in fs/namespace.c and removes a potentially
> problematic user visible difference in how normal and unprivileged
> mount namespaces work.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/block_dev.c | 2 +-
> fs/namei.c | 9 ++++++++-
> fs/namespace.c | 18 ++++--------------
> include/linux/fs.h | 1 +
> 4 files changed, 14 insertions(+), 16 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 198243717da5..f8ce371c437c 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1729,7 +1729,7 @@ struct block_device *lookup_bdev(const char *pathname)
> if (!S_ISBLK(inode->i_mode))
> goto fail;
> error = -EACCES;
> - if (path.mnt->mnt_flags & MNT_NODEV)
> + if (!may_open_dev(&path))
> goto fail;
> error = -ENOMEM;
> bdev = bd_acquire(inode);
> diff --git a/fs/namei.c b/fs/namei.c
> index fbbcf0993312..59444c066f47 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2640,6 +2640,13 @@ int vfs_create(struct inode *dir, struct dentry *dentry, umode_t mode,
> }
> EXPORT_SYMBOL(vfs_create);
>
> +bool may_open_dev(const struct path *path)
> +{
> + return !(path->mnt->mnt_flags & MNT_NODEV) &&
> + ((path->mnt->mnt_sb->s_user_ns == &init_user_ns) ||
> + (path->mnt->mnt_sb->s_type->fs_flags & FS_USERNS_DEV_MOUNT));
> +}
> +
> static int may_open(struct path *path, int acc_mode, int flag)
> {
> struct dentry *dentry = path->dentry;
> @@ -2662,7 +2669,7 @@ static int may_open(struct path *path, int acc_mode, int flag)
> break;
> case S_IFBLK:
> case S_IFCHR:
> - if (path->mnt->mnt_flags & MNT_NODEV)
> + if (!may_open_dev(path))
> return -EACCES;
> /*FALLTHRU*/
> case S_IFIFO:
> diff --git a/fs/namespace.c b/fs/namespace.c
> index d023a353dc63..e48fa1c23378 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2177,13 +2177,7 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
> }
> if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
> !(mnt_flags & MNT_NODEV)) {
> - /* Was the nodev implicitly added in mount? */
> - if ((mnt->mnt_ns->user_ns != &init_user_ns) &&
> - !(sb->s_type->fs_flags & FS_USERNS_DEV_MOUNT)) {
> - mnt_flags |= MNT_NODEV;
> - } else {
> - return -EPERM;
> - }
> + return -EPERM;
> }
> if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) &&
> !(mnt_flags & MNT_NOSUID)) {
> @@ -2396,13 +2390,6 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
> put_filesystem(type);
> return -EPERM;
> }
> - /* Only in special cases allow devices from mounts
> - * created outside the initial user namespace.
> - */
> - if (!(type->fs_flags & FS_USERNS_DEV_MOUNT)) {
> - flags |= MS_NODEV;
> - mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV;
> - }
> if (type->fs_flags & FS_USERNS_VISIBLE) {
> if (!fs_fully_visible(type, &mnt_flags))
> return -EPERM;
> @@ -3238,6 +3225,9 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
> mnt_flags = mnt->mnt.mnt_flags;
> if (mnt->mnt.mnt_sb->s_iflags & SB_I_NOEXEC)
> mnt_flags &= ~(MNT_LOCK_NOSUID | MNT_LOCK_NOEXEC);
> + if (current_user_ns() != &init_user_ns &&
> + !(mnt->mnt.mnt_sb->s_type->fs_flags & FS_USERNS_DEV_MOUNT))
> + mnt_flags &= ~(MNT_LOCK_NODEV);

Oops, this should be s_user_ns and not current_user_ns(). I think I had
to change it when I had made changes to make sysfs, etc. use s_user_ns =
&init_user_ns and then frogot to change it back.

Seth