2015-07-15 19:47:18

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 0/7] Initial support for user namespace owned mounts

These are the first in a larger set of patches that I've been working on
(with help from Eric Biederman) to support mounting ext4 and fuse
filesystems from within user namespaces. I've pushed the full series to:

git://kernel.ubuntu.com/sforshee/linux.git userns-mounts

Taking the series as a whole, the strategy is to handle as much of the
heavy lifting as possible in the vfs so the filesystems don't have to
handle weird edge cases. If you look at the full series you'll find that
the changes in ext4 to support user namespace mounts turn out to be
fairly minimal (fuse is a bit more complicated though as it must deal
with translating ids for a userspace process which is running in pid and
user namespaces).

The patches I'm sending today lay some of the groundwork in the vfs and
related code. They fall into two broad groups:

1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
pretty straightforward, and Eric has expressed interest in merging
these patches soon. Note that patch 2 won't apply cleanly without
Eric's noexec patches for proc and sys [1].

2. Patches 2-7 tighten down security for mounts with s_user_ns !=
&init_user_ns. This includes updates to how file caps and suid are
handled and LSM updates to ignore security labels on superblocks
from non-init namespaces.

The LSM changes in particular may not be optimal, as I don't have a
lot of familiarity with this code, so I'd be especially appreciative
of review of these changes and suggestions on how to improve them.

Subsequent patches will update the vfs for id translation, handling
various corner cases, giving privileges to the user namsepace which owns
a superblock, and finally supporting user namespace mounts for ext4 and
fuse.

Thanks,
Seth

[1] http://lkml.kernel.org/r/[email protected]


Andy Lutomirski (1):
fs: Treat foreign mounts as nosuid

Eric W. Biederman (1):
userns: Simpilify MNT_NODEV handling.

Seth Forshee (5):
fs: Add user namesapace member to struct super_block
fs: Ignore file caps in mounts from other user namespaces
security: Restrict security attribute updates for userns mounts
selinux: Ignore security labels on user namespace mounts
smack: Don't use security labels for user namespace mounts

fs/block_dev.c | 2 +-
fs/exec.c | 2 +-
fs/namei.c | 9 ++++++++-
fs/namespace.c | 34 ++++++++++++++++++++--------------
fs/proc/root.c | 3 ++-
fs/super.c | 38 +++++++++++++++++++++++++++++++++-----
include/linux/fs.h | 9 +++++++++
include/linux/mount.h | 1 +
include/linux/user_namespace.h | 8 ++++++++
kernel/user_namespace.c | 14 ++++++++++++++
security/commoncap.c | 4 +++-
security/security.c | 10 +++++++++-
security/selinux/hooks.c | 16 +++++++++++++++-
security/smack/smack_lsm.c | 12 ++++++++++--
14 files changed, 134 insertions(+), 28 deletions(-)


2015-07-15 19:49:18

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 1/7] fs: Add user namesapace member to struct super_block

Initially this will be used to eliminate the implicit MNT_NODEV
flag for mounts from user namespaces. In the future it will also
be used for translating ids and checking capabilities for
filesystems mounted from user namespaces.

s_user_ns is initialized in alloc_super() and is generally set to
current_user_ns(). To avoid security and corruption issues, two
additional mount checks are also added:

- do_new_mount() gains a check that the user has CAP_SYS_ADMIN
in current_user_ns().

- sget() will fail with EBUSY when the filesystem it's looking
for is already mounted from another user namespace.

proc needs some special handling here. The user namespace of
current isn't appropriate when forking as a result of clone (2)
with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
from within the new user namespace. Instead, the user namespace
which owns the new pid namespace should be used. sget_userns() is
added to allow passing of a user namespace other than that of
current, and this is used by proc_mount(). sget() becomes a
wrapper around sget_userns() which passes current_user_ns().

Signed-off-by: Seth Forshee <[email protected]>
---
fs/namespace.c | 3 +++
fs/proc/root.c | 3 ++-
fs/super.c | 38 +++++++++++++++++++++++++++++++++-----
include/linux/fs.h | 8 ++++++++
4 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ce428cadd41f..f1f67d663d49 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2357,6 +2357,9 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
struct vfsmount *mnt;
int err;

+ if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+ return -EPERM;
+
if (!fstype)
return -EINVAL;

diff --git a/fs/proc/root.c b/fs/proc/root.c
index 361ab4ee42fc..4b302cbf13f9 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -117,7 +117,8 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
return ERR_PTR(-EPERM);
}

- sb = sget(fs_type, proc_test_super, proc_set_super, flags, ns);
+ sb = sget_userns(fs_type, proc_test_super, proc_set_super, flags,
+ ns->user_ns, ns);
if (IS_ERR(sb))
return ERR_CAST(sb);

diff --git a/fs/super.c b/fs/super.c
index b61372354f2b..b5f171aadbf7 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -33,6 +33,7 @@
#include <linux/cleancache.h>
#include <linux/fsnotify.h>
#include <linux/lockdep.h>
+#include <linux/user_namespace.h>
#include "internal.h"


@@ -148,6 +149,7 @@ static void destroy_super(struct super_block *s)
list_lru_destroy(&s->s_inode_lru);
for (i = 0; i < SB_FREEZE_LEVELS; i++)
percpu_counter_destroy(&s->s_writers.counter[i]);
+ put_user_ns(s->s_user_ns);
security_sb_free(s);
WARN_ON(!list_empty(&s->s_mounts));
kfree(s->s_subtype);
@@ -163,7 +165,8 @@ static void destroy_super(struct super_block *s)
* Allocates and initializes a new &struct super_block. alloc_super()
* returns a pointer new superblock or %NULL if allocation had failed.
*/
-static struct super_block *alloc_super(struct file_system_type *type, int flags)
+static struct super_block *alloc_super(struct file_system_type *type, int flags,
+ struct user_namespace *user_ns)
{
struct super_block *s = kzalloc(sizeof(struct super_block), GFP_USER);
static const struct super_operations default_op;
@@ -231,6 +234,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
s->s_shrink.count_objects = super_cache_count;
s->s_shrink.batch = 1024;
s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
+
+ s->s_user_ns = get_user_ns(user_ns);
return s;

fail:
@@ -427,17 +432,17 @@ void generic_shutdown_super(struct super_block *sb)
EXPORT_SYMBOL(generic_shutdown_super);

/**
- * sget - find or create a superblock
+ * sget_userns - find or create a superblock
* @type: filesystem type superblock should belong to
* @test: comparison callback
* @set: setup callback
* @flags: mount flags
* @data: argument to each of them
*/
-struct super_block *sget(struct file_system_type *type,
+struct super_block *sget_userns(struct file_system_type *type,
int (*test)(struct super_block *,void *),
int (*set)(struct super_block *,void *),
- int flags,
+ int flags, struct user_namespace *user_ns,
void *data)
{
struct super_block *s = NULL;
@@ -450,6 +455,10 @@ retry:
hlist_for_each_entry(old, &type->fs_supers, s_instances) {
if (!test(old, data))
continue;
+ if (user_ns != old->s_user_ns) {
+ spin_unlock(&sb_lock);
+ return ERR_PTR(-EBUSY);
+ }
if (!grab_super(old))
goto retry;
if (s) {
@@ -462,7 +471,7 @@ retry:
}
if (!s) {
spin_unlock(&sb_lock);
- s = alloc_super(type, flags);
+ s = alloc_super(type, flags, user_ns);
if (!s)
return ERR_PTR(-ENOMEM);
goto retry;
@@ -485,6 +494,25 @@ retry:
return s;
}

+EXPORT_SYMBOL(sget_userns);
+
+/**
+ * sget - find or create a superblock
+ * @type: filesystem type superblock should belong to
+ * @test: comparison callback
+ * @set: setup callback
+ * @flags: mount flags
+ * @data: argument to each of them
+ */
+struct super_block *sget(struct file_system_type *type,
+ int (*test)(struct super_block *,void *),
+ int (*set)(struct super_block *,void *),
+ int flags,
+ void *data)
+{
+ return sget_userns(type, test, set, flags, current_user_ns(), data);
+}
+
EXPORT_SYMBOL(sget);

void drop_super(struct super_block *sb)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 42912f8d286e..1876477ac9f8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -30,6 +30,7 @@
#include <linux/lockdep.h>
#include <linux/percpu-rwsem.h>
#include <linux/blk_types.h>
+#include <linux/user_namespace.h>

#include <asm/byteorder.h>
#include <uapi/linux/fs.h>
@@ -1353,6 +1354,8 @@ struct super_block {
struct workqueue_struct *s_dio_done_wq;
struct hlist_head s_pins;

+ struct user_namespace *s_user_ns;
+
/*
* Keep the lru lists last in the structure so they always sit on their
* own individual cachelines.
@@ -1959,6 +1962,11 @@ void deactivate_locked_super(struct super_block *sb);
int set_anon_super(struct super_block *s, void *data);
int get_anon_bdev(dev_t *);
void free_anon_bdev(dev_t);
+struct super_block *sget_userns(struct file_system_type *type,
+ int (*test)(struct super_block *,void *),
+ int (*set)(struct super_block *,void *),
+ int flags, struct user_namespace *user_ns,
+ void *data);
struct super_block *sget(struct file_system_type *type,
int (*test)(struct super_block *,void *),
int (*set)(struct super_block *,void *),
--
1.9.1

2015-07-15 19:47:22

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 2/7] userns: Simpilify MNT_NODEV handling.

From: "Eric W. Biederman" <[email protected]>

- Consolidate the testing if a device node may be opened in a new
function may_open_dev.

- Move the check for allowing access to device nodes on filesystems
not mounted in the initial user namespace from mount time to open
time and include it in may_open_dev.

This set of changes removes the implicit adding of MNT_NODEV which
simplifies the logic in fs/namespace.c and removes a potentially
problematic user visible difference in how normal and unprivileged
mount namespaces work.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/block_dev.c | 2 +-
fs/namei.c | 9 ++++++++-
fs/namespace.c | 18 ++++--------------
include/linux/fs.h | 1 +
4 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 198243717da5..f8ce371c437c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1729,7 +1729,7 @@ struct block_device *lookup_bdev(const char *pathname)
if (!S_ISBLK(inode->i_mode))
goto fail;
error = -EACCES;
- if (path.mnt->mnt_flags & MNT_NODEV)
+ if (!may_open_dev(&path))
goto fail;
error = -ENOMEM;
bdev = bd_acquire(inode);
diff --git a/fs/namei.c b/fs/namei.c
index ae4e4c18b2ac..87c54cb34dce 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2635,6 +2635,13 @@ int vfs_create(struct inode *dir, struct dentry *dentry, umode_t mode,
}
EXPORT_SYMBOL(vfs_create);

+bool may_open_dev(const struct path *path)
+{
+ return !(path->mnt->mnt_flags & MNT_NODEV) &&
+ ((path->mnt->mnt_sb->s_user_ns == &init_user_ns) ||
+ (path->mnt->mnt_sb->s_type->fs_flags & FS_USERNS_DEV_MOUNT));
+}
+
static int may_open(struct path *path, int acc_mode, int flag)
{
struct dentry *dentry = path->dentry;
@@ -2657,7 +2664,7 @@ static int may_open(struct path *path, int acc_mode, int flag)
break;
case S_IFBLK:
case S_IFCHR:
- if (path->mnt->mnt_flags & MNT_NODEV)
+ if (!may_open_dev(path))
return -EACCES;
/*FALLTHRU*/
case S_IFIFO:
diff --git a/fs/namespace.c b/fs/namespace.c
index f1f67d663d49..423001de32a2 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2153,13 +2153,7 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
}
if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
!(mnt_flags & MNT_NODEV)) {
- /* Was the nodev implicitly added in mount? */
- if ((mnt->mnt_ns->user_ns != &init_user_ns) &&
- !(sb->s_type->fs_flags & FS_USERNS_DEV_MOUNT)) {
- mnt_flags |= MNT_NODEV;
- } else {
- return -EPERM;
- }
+ return -EPERM;
}
if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) &&
!(mnt_flags & MNT_NOSUID)) {
@@ -2372,13 +2366,6 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
put_filesystem(type);
return -EPERM;
}
- /* Only in special cases allow devices from mounts
- * created outside the initial user namespace.
- */
- if (!(type->fs_flags & FS_USERNS_DEV_MOUNT)) {
- flags |= MS_NODEV;
- mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV;
- }
if (type->fs_flags & FS_USERNS_VISIBLE) {
if (!fs_fully_visible(type, &mnt_flags))
return -EPERM;
@@ -3214,6 +3201,9 @@ static bool fs_fully_visible(struct file_system_type *type, int *new_mnt_flags)
mnt_flags = mnt->mnt.mnt_flags;
if (mnt->mnt.mnt_sb->s_iflags & SB_I_NOEXEC)
mnt_flags &= ~(MNT_LOCK_NOSUID | MNT_LOCK_NOEXEC);
+ if (mnt->mnt.mnt_sb->s_user_ns != &init_user_ns &&
+ !(mnt->mnt.mnt_sb->s_type->fs_flags & FS_USERNS_DEV_MOUNT))
+ mnt_flags &= ~(MNT_LOCK_NODEV);

/* Verify the mount flags are equal to or more permissive
* than the proposed new mount.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1876477ac9f8..a0db522196ab 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1512,6 +1512,7 @@ extern void dentry_unhash(struct dentry *dentry);
*/
extern void inode_init_owner(struct inode *inode, const struct inode *dir,
umode_t mode);
+extern bool may_open_dev(const struct path *path);
/*
* VFS FS_IOC_FIEMAP helper definitions.
*/
--
1.9.1

2015-07-15 19:48:48

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 3/7] fs: Ignore file caps in mounts from other user namespaces

Capability sets attached to files must be ignored except in the
user namespaces where the mounter is privileged, i.e. s_user_ns
and its descendants. Otherwise a vector exists for gaining
privileges in namespaces where a user is not already privileged.

Add a new helper function, in_user_ns(), to test whether a user
namespace is the same as or a descendant of another namespace.
Use this helper to determine whether a file's capability set
should be applied to the caps constructed during exec.

Signed-off-by: Seth Forshee <[email protected]>
---
include/linux/user_namespace.h | 8 ++++++++
kernel/user_namespace.c | 14 ++++++++++++++
security/commoncap.c | 2 ++
3 files changed, 24 insertions(+)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 8297e5b341d8..a43faa727124 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -72,6 +72,8 @@ extern ssize_t proc_projid_map_write(struct file *, const char __user *, size_t,
extern ssize_t proc_setgroups_write(struct file *, const char __user *, size_t, loff_t *);
extern int proc_setgroups_show(struct seq_file *m, void *v);
extern bool userns_may_setgroups(const struct user_namespace *ns);
+extern bool in_userns(const struct user_namespace *ns,
+ const struct user_namespace *target_ns);
#else

static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
@@ -100,6 +102,12 @@ static inline bool userns_may_setgroups(const struct user_namespace *ns)
{
return true;
}
+
+static inline bool in_userns(const struct user_namespace *ns,
+ const struct user_namespace *target_ns)
+{
+ return true;
+}
#endif

#endif /* _LINUX_USER_H */
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 4109f8320684..2b043876d5f0 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -944,6 +944,20 @@ bool userns_may_setgroups(const struct user_namespace *ns)
return allowed;
}

+/*
+ * Returns true if @ns is the same namespace as or a descendant of
+ * @target_ns.
+ */
+bool in_userns(const struct user_namespace *ns,
+ const struct user_namespace *target_ns)
+{
+ for (; ns; ns = ns->parent) {
+ if (ns == target_ns)
+ return true;
+ }
+ return false;
+}
+
static inline struct user_namespace *to_user_ns(struct ns_common *ns)
{
return container_of(ns, struct user_namespace, ns);
diff --git a/security/commoncap.c b/security/commoncap.c
index d103f5a4043d..175ab497e810 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -439,6 +439,8 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c

if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
return 0;
+ if (!in_userns(current_user_ns(), bprm->file->f_path.mnt->mnt_sb->s_user_ns))
+ return 0;

rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
if (rc < 0) {
--
1.9.1

2015-07-15 19:48:26

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 4/7] fs: Treat foreign mounts as nosuid

From: Andy Lutomirski <[email protected]>

If a process gets access to a mount from a different namespace user
namespace, that process should not be able to take advantage of
setuid files or selinux entrypoints from that filesystem.
Technically, trusting mounts created by the same or ancestor user
namespaces ought to be safe, but it's simpler to distrust all
foreign mounts.

This will make it safer to allow more complex filesystems to be
mounted in non-root user namespaces.

This does not remove the need for MNT_LOCK_NOSUID. The setuid,
setgid, and file capability bits can no longer be abused if code in
a user namespace were to clear nosuid on an untrusted filesystem,
but this patch, by itself, is insufficient to protect the system
from abuse of files that, when execed, would increase MAC privilege.

As a more concrete explanation, any task that can manipulate a
vfsmount associated with a given user namespace already has
capabilities in that namespace and all of its descendents. If they
can cause a malicious setuid, setgid, or file-caps executable to
appear in that mount, then that executable will only allow them to
elevate privileges in exactly the set of namespaces in which they
are already privileges.

On the other hand, if they can cause a malicious executable to
appear with a dangerous MAC label, running it could change the
caller's security context in a way that should not have been
possible, even inside the namespace in which the task is confined.

As a hardening measure, this would have made CVE-2014-5207 much
more difficult to exploit.

Signed-off-by: Andy Lutomirski <[email protected]>
[ saf: Forward ported to 4.2 ]
Signed-off-by: Seth Forshee <[email protected]>
---
fs/exec.c | 2 +-
fs/namespace.c | 13 +++++++++++++
include/linux/mount.h | 1 +
security/commoncap.c | 2 +-
security/selinux/hooks.c | 2 +-
5 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index b06623a9347f..ea7311d72cc3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1295,7 +1295,7 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
bprm->cred->euid = current_euid();
bprm->cred->egid = current_egid();

- if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
+ if (!mnt_may_suid(bprm->file->f_path.mnt))
return;

if (task_no_new_privs(current))
diff --git a/fs/namespace.c b/fs/namespace.c
index 423001de32a2..2bfd7ca92247 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3252,6 +3252,19 @@ found:
return visible;
}

+bool mnt_may_suid(struct vfsmount *mnt)
+{
+ /*
+ * Foreign mounts (accessed via fchdir or through /proc
+ * symlinks) are always treated as if they are nosuid. This
+ * prevents namespaces from trusting potentially unsafe
+ * suid/sgid bits, file caps, or security labels that originate
+ * in other namespaces.
+ */
+ return real_mount(mnt)->mnt_ns == current->nsproxy->mnt_ns &&
+ !(mnt->mnt_flags & MNT_NOSUID);
+}
+
static struct ns_common *mntns_get(struct task_struct *task)
{
struct ns_common *ns = NULL;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index f822c3c11377..54a594d49733 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -81,6 +81,7 @@ extern void mntput(struct vfsmount *mnt);
extern struct vfsmount *mntget(struct vfsmount *mnt);
extern struct vfsmount *mnt_clone_internal(struct path *path);
extern int __mnt_is_readonly(struct vfsmount *mnt);
+extern bool mnt_may_suid(struct vfsmount *mnt);

struct path;
extern struct vfsmount *clone_private_mount(struct path *path);
diff --git a/security/commoncap.c b/security/commoncap.c
index 175ab497e810..858d86a1b73c 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -437,7 +437,7 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
if (!file_caps_enabled)
return 0;

- if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
+ if (!mnt_may_suid(bprm->file->f_path.mnt))
return 0;
if (!in_userns(current_user_ns(), bprm->file->f_path.mnt->mnt_sb->s_user_ns))
return 0;
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 564079c5c49d..459e71ddbc9d 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2137,7 +2137,7 @@ static int check_nnp_nosuid(const struct linux_binprm *bprm,
const struct task_security_struct *new_tsec)
{
int nnp = (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS);
- int nosuid = (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID);
+ int nosuid = !mnt_may_suid(bprm->file->f_path.mnt);
int rc;

if (!nnp && !nosuid)
--
1.9.1

2015-07-15 19:48:01

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 5/7] security: Restrict security attribute updates for userns mounts

Respecting security labels for mounts from user namespaces may
allow unprivileged users to introduce security labels into the
system. To stop this from happening prevent calling the
inode_post_setxattr, inode_setsecurity, inode_notifysecctx, and
inode_setsecctx hooks when s_user_ns != init_user_ns. There's no
purpose in actually blocking setting of these xattrs, as (for rw
mounts at least) the user must have write access to the
underlying filesystem and could set the xattrs by other means.

Signed-off-by: Seth Forshee <[email protected]>
---
security/security.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/security/security.c b/security/security.c
index 062f3c997fdc..980710baa8f9 100644
--- a/security/security.c
+++ b/security/security.c
@@ -653,7 +653,9 @@ void security_inode_post_setxattr(struct dentry *dentry, const char *name,
{
if (unlikely(IS_PRIVATE(d_backing_inode(dentry))))
return;
- call_void_hook(inode_post_setxattr, dentry, name, value, size, flags);
+ if (dentry->d_inode->i_sb->s_user_ns == &init_user_ns)
+ call_void_hook(inode_post_setxattr, dentry, name, value, size,
+ flags);
evm_inode_post_setxattr(dentry, name, value, size);
}

@@ -712,6 +714,8 @@ int security_inode_getsecurity(const struct inode *inode, const char *name, void

int security_inode_setsecurity(struct inode *inode, const char *name, const void *value, size_t size, int flags)
{
+ if (inode->i_sb->s_user_ns != &init_user_ns)
+ return -EOPNOTSUPP;
if (unlikely(IS_PRIVATE(inode)))
return -EOPNOTSUPP;
return call_int_hook(inode_setsecurity, -EOPNOTSUPP, inode, name,
@@ -1168,12 +1172,16 @@ EXPORT_SYMBOL(security_release_secctx);

int security_inode_notifysecctx(struct inode *inode, void *ctx, u32 ctxlen)
{
+ if (inode->i_sb->s_user_ns != &init_user_ns)
+ return -EOPNOTSUPP;
return call_int_hook(inode_notifysecctx, 0, inode, ctx, ctxlen);
}
EXPORT_SYMBOL(security_inode_notifysecctx);

int security_inode_setsecctx(struct dentry *dentry, void *ctx, u32 ctxlen)
{
+ if (dentry->d_inode->i_sb->s_user_ns != &init_user_ns)
+ return -EOPNOTSUPP;
return call_int_hook(inode_setsecctx, 0, dentry, ctx, ctxlen);
}
EXPORT_SYMBOL(security_inode_setsecctx);
--
1.9.1

2015-07-15 19:47:33

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 6/7] selinux: Ignore security labels on user namespace mounts

Unprivileged users should not be able to supply security labels
in filesystems, nor should they be able to supply security
contexts in unprivileged mounts. For any mount where s_user_ns is
not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
and return EPERM if any contexts are supplied in the mount
options.

Signed-off-by: Seth Forshee <[email protected]>
---
security/selinux/hooks.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 459e71ddbc9d..eeb71e45ab82 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -732,6 +732,19 @@ static int selinux_set_mnt_opts(struct super_block *sb,
!strcmp(sb->s_type->name, "pstore"))
sbsec->flags |= SE_SBGENFS;

+ /*
+ * If this is a user namespace mount, no contexts are allowed
+ * on the command line and security labels mus be ignored.
+ */
+ if (sb->s_user_ns != &init_user_ns) {
+ if (context_sid || fscontext_sid || rootcontext_sid ||
+ defcontext_sid)
+ return -EPERM;
+ sbsec->behavior = SECURITY_FS_USE_NONE;
+ goto out_set_opts;
+ }
+
+
if (!sbsec->behavior) {
/*
* Determine the labeling behavior to use for this
@@ -813,6 +826,7 @@ static int selinux_set_mnt_opts(struct super_block *sb,
sbsec->def_sid = defcontext_sid;
}

+out_set_opts:
rc = sb_finish_set_opts(sb);
out:
mutex_unlock(&sbsec->lock);
--
1.9.1

2015-07-15 19:47:31

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 7/7] smack: Don't use security labels for user namespace mounts

Avoid use of untrusted security labels when s_user_ns !=
init_user_ns:
- smk_fetch: refuse to read labels from disk
- smack_inode_init_security: return -ENOTSUPP
- smack_d_instantiate: don't use security xattrs from disk

Signed-off-by: Seth Forshee <[email protected]>
---
security/smack/smack_lsm.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..6a849da94f47 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -255,6 +255,9 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
char *buffer;
struct smack_known *skp = NULL;

+ if (ip->i_sb->s_user_ns != &init_user_ns)
+ return NULL;
+
if (ip->i_op->getxattr == NULL)
return ERR_PTR(-EOPNOTSUPP);

@@ -833,6 +836,9 @@ static int smack_inode_init_security(struct inode *inode, struct inode *dir,
struct smack_known *dsp = smk_of_inode(dir);
int may;

+ if (inode->i_sb->s_user_ns != &init_user_ns)
+ return -ENOTSUPP;
+
if (name)
*name = XATTR_SMACK_SUFFIX;

@@ -3176,11 +3182,13 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
}
/*
* No xattr support means, alas, no SMACK label.
- * Use the aforeapplied default.
+ * Use the aforeapplied default. Also don't use
+ * xattrs from userns mounts.
* It would be curious if the label of the task
* does not match that assigned.
*/
- if (inode->i_op->getxattr == NULL)
+ if (inode->i_sb->s_user_ns != &init_user_ns ||
+ inode->i_op->getxattr == NULL)
break;
/*
* Get the dentry for xattr.
--
1.9.1

2015-07-15 20:36:29

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On 7/15/2015 12:46 PM, Seth Forshee wrote:
> These are the first in a larger set of patches that I've been working on
> (with help from Eric Biederman) to support mounting ext4 and fuse
> filesystems from within user namespaces. I've pushed the full series to:
>
> git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
>
> Taking the series as a whole, the strategy is to handle as much of the
> heavy lifting as possible in the vfs so the filesystems don't have to
> handle weird edge cases. If you look at the full series you'll find that
> the changes in ext4 to support user namespace mounts turn out to be
> fairly minimal (fuse is a bit more complicated though as it must deal
> with translating ids for a userspace process which is running in pid and
> user namespaces).
>
> The patches I'm sending today lay some of the groundwork in the vfs and
> related code. They fall into two broad groups:
>
> 1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
> pretty straightforward, and Eric has expressed interest in merging
> these patches soon. Note that patch 2 won't apply cleanly without
> Eric's noexec patches for proc and sys [1].
>
> 2. Patches 2-7 tighten down security for mounts with s_user_ns !=
> &init_user_ns. This includes updates to how file caps and suid are
> handled and LSM updates to ignore security labels on superblocks
> from non-init namespaces.
>
> The LSM changes in particular may not be optimal, as I don't have a
> lot of familiarity with this code, so I'd be especially appreciative
> of review of these changes and suggestions on how to improve them.

Lukasz Pawelczyk <[email protected]> proposed
LSM support in user namespaces ([RFC] lsm: namespace hooks)
that make a whole lot more sense than just turning off
the option of using labels on files. Gutting the ability
to use MAC in a namespace is a step down the road of
making MAC and namespaces incompatible.



>
> Subsequent patches will update the vfs for id translation, handling
> various corner cases, giving privileges to the user namsepace which owns
> a superblock, and finally supporting user namespace mounts for ext4 and
> fuse.
>
> Thanks,
> Seth
>
> [1] http://lkml.kernel.org/r/[email protected]
>
>
> Andy Lutomirski (1):
> fs: Treat foreign mounts as nosuid
>
> Eric W. Biederman (1):
> userns: Simpilify MNT_NODEV handling.
>
> Seth Forshee (5):
> fs: Add user namesapace member to struct super_block
> fs: Ignore file caps in mounts from other user namespaces
> security: Restrict security attribute updates for userns mounts
> selinux: Ignore security labels on user namespace mounts
> smack: Don't use security labels for user namespace mounts
>
> fs/block_dev.c | 2 +-
> fs/exec.c | 2 +-
> fs/namei.c | 9 ++++++++-
> fs/namespace.c | 34 ++++++++++++++++++++--------------
> fs/proc/root.c | 3 ++-
> fs/super.c | 38 +++++++++++++++++++++++++++++++++-----
> include/linux/fs.h | 9 +++++++++
> include/linux/mount.h | 1 +
> include/linux/user_namespace.h | 8 ++++++++
> kernel/user_namespace.c | 14 ++++++++++++++
> security/commoncap.c | 4 +++-
> security/security.c | 10 +++++++++-
> security/selinux/hooks.c | 16 +++++++++++++++-
> security/smack/smack_lsm.c | 12 ++++++++++--
> 14 files changed, 134 insertions(+), 28 deletions(-)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2015-07-15 20:43:41

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH 7/7] smack: Don't use security labels for user namespace mounts

On 7/15/2015 12:46 PM, Seth Forshee wrote:
> Avoid use of untrusted security labels when s_user_ns !=
> init_user_ns:
> - smk_fetch: refuse to read labels from disk
> - smack_inode_init_security: return -ENOTSUPP
> - smack_d_instantiate: don't use security xattrs from disk
>
> Signed-off-by: Seth Forshee <[email protected]>

I do not like this at all at all. Pretending that Smack
doesn't exist in a user namespace can lead to all sorts
of blatant security violations, both while the filesystem
is mounted in the namespace and in the init namespace.

> ---
> security/smack/smack_lsm.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> index a143328f75eb..6a849da94f47 100644
> --- a/security/smack/smack_lsm.c
> +++ b/security/smack/smack_lsm.c
> @@ -255,6 +255,9 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
> char *buffer;
> struct smack_known *skp = NULL;
>
> + if (ip->i_sb->s_user_ns != &init_user_ns)
> + return NULL;
> +
> if (ip->i_op->getxattr == NULL)
> return ERR_PTR(-EOPNOTSUPP);
>
> @@ -833,6 +836,9 @@ static int smack_inode_init_security(struct inode *inode, struct inode *dir,
> struct smack_known *dsp = smk_of_inode(dir);
> int may;
>
> + if (inode->i_sb->s_user_ns != &init_user_ns)
> + return -ENOTSUPP;
> +
> if (name)
> *name = XATTR_SMACK_SUFFIX;
>
> @@ -3176,11 +3182,13 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
> }
> /*
> * No xattr support means, alas, no SMACK label.
> - * Use the aforeapplied default.
> + * Use the aforeapplied default. Also don't use
> + * xattrs from userns mounts.
> * It would be curious if the label of the task
> * does not match that assigned.
> */
> - if (inode->i_op->getxattr == NULL)
> + if (inode->i_sb->s_user_ns != &init_user_ns ||
> + inode->i_op->getxattr == NULL)
> break;
> /*
> * Get the dentry for xattr.

2015-07-15 21:13:06

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

Casey Schaufler <[email protected]> writes:

> On 7/15/2015 12:46 PM, Seth Forshee wrote:
>> These are the first in a larger set of patches that I've been working on
>> (with help from Eric Biederman) to support mounting ext4 and fuse
>> filesystems from within user namespaces. I've pushed the full series to:
>>
>> git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
>>
>> Taking the series as a whole, the strategy is to handle as much of the
>> heavy lifting as possible in the vfs so the filesystems don't have to
>> handle weird edge cases. If you look at the full series you'll find that
>> the changes in ext4 to support user namespace mounts turn out to be
>> fairly minimal (fuse is a bit more complicated though as it must deal
>> with translating ids for a userspace process which is running in pid and
>> user namespaces).
>>
>> The patches I'm sending today lay some of the groundwork in the vfs and
>> related code. They fall into two broad groups:
>>
>> 1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
>> pretty straightforward, and Eric has expressed interest in merging
>> these patches soon. Note that patch 2 won't apply cleanly without
>> Eric's noexec patches for proc and sys [1].
>>
>> 2. Patches 2-7 tighten down security for mounts with s_user_ns !=
>> &init_user_ns. This includes updates to how file caps and suid are
>> handled and LSM updates to ignore security labels on superblocks
>> from non-init namespaces.
>>
>> The LSM changes in particular may not be optimal, as I don't have a
>> lot of familiarity with this code, so I'd be especially appreciative
>> of review of these changes and suggestions on how to improve them.
>
> Lukasz Pawelczyk <[email protected]> proposed
> LSM support in user namespaces ([RFC] lsm: namespace hooks)
> that make a whole lot more sense than just turning off
> the option of using labels on files. Gutting the ability
> to use MAC in a namespace is a step down the road of
> making MAC and namespaces incompatible.

This is not "turning off the option to use labels on files".

This is supporting mounting filesystems like ext4 by unprivileged users
and not trusting the labels they set in the same way as we trust labels
on filesystems mounted by privileged users.

The first step needs to be not trusting those labels and treating such
filesystems as filesystems without label support. I hope that is Seth
has implemented.

In the long run we can do more interesting things with such filesystems
once the appropriate LSM policy is in place.

Getting s_user_ns present on struct super, properly set, and all of the
appropriate checks against it present in the vfs so that filesystems
don't need to duplicate logic is important if we are going do more
interesting things with user namespaces (as users have been asking for).

It is important for things as small as making it safe to allow
truly unprivileged users to mount fuse filesystems.

I am on the fence with Lukasz Pawelczyk's patches. Some parts I liked
some parts I had issues with. As I recall one of my issues was that
those patches conflicted in detail if not in principle with this
appropach.

If these patches do not do a good job of laying the ground work for
supporting security labels that unprivileged users can set than Seth
could really use some feedback. Figuring out how to properly deal with
the LSMs has been one of his challenges.

I am hoping I can finishing working through the patches to fix the
semantics of rename and bind mounts before the next merge window opens,
so I can have enough cycles to lift the feature freeze on user
namespaces. Except for maybe his first two patches (which fix a small
userspace API breakage) none of Seth's patches get to go in until I lift
the freeze.

Which is probably too much information but I hope this makes it clear
that the point of this work is as an enabler for future developments,
not as something to make user namespaces and LSMs incompatible.

Eric

2015-07-15 21:49:18

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Wed, Jul 15, 2015 at 04:06:35PM -0500, Eric W. Biederman wrote:
> Casey Schaufler <[email protected]> writes:
>
> > On 7/15/2015 12:46 PM, Seth Forshee wrote:
> >> These are the first in a larger set of patches that I've been working on
> >> (with help from Eric Biederman) to support mounting ext4 and fuse
> >> filesystems from within user namespaces. I've pushed the full series to:
> >>
> >> git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
> >>
> >> Taking the series as a whole, the strategy is to handle as much of the
> >> heavy lifting as possible in the vfs so the filesystems don't have to
> >> handle weird edge cases. If you look at the full series you'll find that
> >> the changes in ext4 to support user namespace mounts turn out to be
> >> fairly minimal (fuse is a bit more complicated though as it must deal
> >> with translating ids for a userspace process which is running in pid and
> >> user namespaces).
> >>
> >> The patches I'm sending today lay some of the groundwork in the vfs and
> >> related code. They fall into two broad groups:
> >>
> >> 1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
> >> pretty straightforward, and Eric has expressed interest in merging
> >> these patches soon. Note that patch 2 won't apply cleanly without
> >> Eric's noexec patches for proc and sys [1].
> >>
> >> 2. Patches 2-7 tighten down security for mounts with s_user_ns !=
> >> &init_user_ns. This includes updates to how file caps and suid are
> >> handled and LSM updates to ignore security labels on superblocks
> >> from non-init namespaces.
> >>
> >> The LSM changes in particular may not be optimal, as I don't have a
> >> lot of familiarity with this code, so I'd be especially appreciative
> >> of review of these changes and suggestions on how to improve them.
> >
> > Lukasz Pawelczyk <[email protected]> proposed
> > LSM support in user namespaces ([RFC] lsm: namespace hooks)
> > that make a whole lot more sense than just turning off
> > the option of using labels on files. Gutting the ability
> > to use MAC in a namespace is a step down the road of
> > making MAC and namespaces incompatible.
>
> This is not "turning off the option to use labels on files".
>
> This is supporting mounting filesystems like ext4 by unprivileged users
> and not trusting the labels they set in the same way as we trust labels
> on filesystems mounted by privileged users.
>
> The first step needs to be not trusting those labels and treating such
> filesystems as filesystems without label support. I hope that is Seth
> has implemented.
>
> In the long run we can do more interesting things with such filesystems
> once the appropriate LSM policy is in place.

Yes, this exactly. Right now it looks to me like the only safe thing to
do with mounts from unprivileged users is to ignore the security labels,
so that's what I'm trying to do with these changes. If there's some
better thing to do, or some better way to do it, I'm more than happy to
receive that feedback.

Seth

2015-07-15 21:48:52

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 3/7] fs: Ignore file caps in mounts from other user namespaces

On Wed, Jul 15, 2015 at 02:46:04PM -0500, Seth Forshee wrote:
> Capability sets attached to files must be ignored except in the
> user namespaces where the mounter is privileged, i.e. s_user_ns
> and its descendants. Otherwise a vector exists for gaining
> privileges in namespaces where a user is not already privileged.
>
> Add a new helper function, in_user_ns(), to test whether a user
> namespace is the same as or a descendant of another namespace.
> Use this helper to determine whether a file's capability set
> should be applied to the caps constructed during exec.
>
> Signed-off-by: Seth Forshee <[email protected]>

Acked-by: Serge Hallyn <[email protected]>

I think it's an ok behavior, though let's just go over the
alternatives.

It might actually be ok to simply require that the user_ns be
equal. If I unshare a new userns in which a different uid is
mapped to root, I may not want file capabilities to be granted
to tasks in that ns. (On the other hand, I might be creating
a new user_ns specifically to not have a uid 0 mapped into it
at all, and only have file capabilities grant privilege)

Conversely, if I unshare one user_ns with a MS_SHARED mnt_ns, mount
an ext4fs, and then (from the parent shell) unshare another user_ns
with the same mapping, intending it to be a "peer" to the first one
I'd unshared and be able to use the ext4fs it mounted. This won't
work here. That's probably best - the appropriate thing to do was
to attach to the existing user_ns. But it could end up being
limiting in some special cases, so I'm bringing it up here.

Again I think what you have here is the simplest and most sensible
choice, so ack.

> ---
> include/linux/user_namespace.h | 8 ++++++++
> kernel/user_namespace.c | 14 ++++++++++++++
> security/commoncap.c | 2 ++
> 3 files changed, 24 insertions(+)
>
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 8297e5b341d8..a43faa727124 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -72,6 +72,8 @@ extern ssize_t proc_projid_map_write(struct file *, const char __user *, size_t,
> extern ssize_t proc_setgroups_write(struct file *, const char __user *, size_t, loff_t *);
> extern int proc_setgroups_show(struct seq_file *m, void *v);
> extern bool userns_may_setgroups(const struct user_namespace *ns);
> +extern bool in_userns(const struct user_namespace *ns,
> + const struct user_namespace *target_ns);
> #else
>
> static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
> @@ -100,6 +102,12 @@ static inline bool userns_may_setgroups(const struct user_namespace *ns)
> {
> return true;
> }
> +
> +static inline bool in_userns(const struct user_namespace *ns,
> + const struct user_namespace *target_ns)
> +{
> + return true;
> +}
> #endif
>
> #endif /* _LINUX_USER_H */
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 4109f8320684..2b043876d5f0 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -944,6 +944,20 @@ bool userns_may_setgroups(const struct user_namespace *ns)
> return allowed;
> }
>
> +/*
> + * Returns true if @ns is the same namespace as or a descendant of
> + * @target_ns.
> + */
> +bool in_userns(const struct user_namespace *ns,
> + const struct user_namespace *target_ns)
> +{
> + for (; ns; ns = ns->parent) {
> + if (ns == target_ns)
> + return true;
> + }
> + return false;
> +}
> +
> static inline struct user_namespace *to_user_ns(struct ns_common *ns)
> {
> return container_of(ns, struct user_namespace, ns);
> diff --git a/security/commoncap.c b/security/commoncap.c
> index d103f5a4043d..175ab497e810 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -439,6 +439,8 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
>
> if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
> return 0;
> + if (!in_userns(current_user_ns(), bprm->file->f_path.mnt->mnt_sb->s_user_ns))
> + return 0;
>
> rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> if (rc < 0) {
> --
> 1.9.1

2015-07-15 21:51:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 3/7] fs: Ignore file caps in mounts from other user namespaces

On Wed, Jul 15, 2015 at 2:48 PM, Serge E. Hallyn <[email protected]> wrote:
> On Wed, Jul 15, 2015 at 02:46:04PM -0500, Seth Forshee wrote:
>> Capability sets attached to files must be ignored except in the
>> user namespaces where the mounter is privileged, i.e. s_user_ns
>> and its descendants. Otherwise a vector exists for gaining
>> privileges in namespaces where a user is not already privileged.
>>
>> Add a new helper function, in_user_ns(), to test whether a user
>> namespace is the same as or a descendant of another namespace.
>> Use this helper to determine whether a file's capability set
>> should be applied to the caps constructed during exec.
>>
>> Signed-off-by: Seth Forshee <[email protected]>
>
> Acked-by: Serge Hallyn <[email protected]>
>
> I think it's an ok behavior, though let's just go over the
> alternatives.
>
> It might actually be ok to simply require that the user_ns be
> equal. If I unshare a new userns in which a different uid is
> mapped to root, I may not want file capabilities to be granted
> to tasks in that ns. (On the other hand, I might be creating
> a new user_ns specifically to not have a uid 0 mapped into it
> at all, and only have file capabilities grant privilege)
>
> Conversely, if I unshare one user_ns with a MS_SHARED mnt_ns, mount
> an ext4fs, and then (from the parent shell) unshare another user_ns
> with the same mapping, intending it to be a "peer" to the first one
> I'd unshared and be able to use the ext4fs it mounted. This won't
> work here. That's probably best - the appropriate thing to do was
> to attach to the existing user_ns. But it could end up being
> limiting in some special cases, so I'm bringing it up here.
>
> Again I think what you have here is the simplest and most sensible
> choice, so ack.
>

I think I'm missing something. Why is this separate from mount_may_suid?

I can see why it would make sense to check s_user_ns (or maybe
s_user_ns *and* the vfsmount namespace) in mount_may_suid, but I don't
see why we need separate checks.

--Andy

2015-07-15 22:34:34

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

Seth Forshee <[email protected]> writes:

> On Wed, Jul 15, 2015 at 04:06:35PM -0500, Eric W. Biederman wrote:
>> Casey Schaufler <[email protected]> writes:
>>
>> > On 7/15/2015 12:46 PM, Seth Forshee wrote:
>> >> These are the first in a larger set of patches that I've been working on
>> >> (with help from Eric Biederman) to support mounting ext4 and fuse
>> >> filesystems from within user namespaces. I've pushed the full series to:
>> >>
>> >> git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
>> >>
>> >> Taking the series as a whole, the strategy is to handle as much of the
>> >> heavy lifting as possible in the vfs so the filesystems don't have to
>> >> handle weird edge cases. If you look at the full series you'll find that
>> >> the changes in ext4 to support user namespace mounts turn out to be
>> >> fairly minimal (fuse is a bit more complicated though as it must deal
>> >> with translating ids for a userspace process which is running in pid and
>> >> user namespaces).
>> >>
>> >> The patches I'm sending today lay some of the groundwork in the vfs and
>> >> related code. They fall into two broad groups:
>> >>
>> >> 1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
>> >> pretty straightforward, and Eric has expressed interest in merging
>> >> these patches soon. Note that patch 2 won't apply cleanly without
>> >> Eric's noexec patches for proc and sys [1].
>> >>
>> >> 2. Patches 2-7 tighten down security for mounts with s_user_ns !=
>> >> &init_user_ns. This includes updates to how file caps and suid are
>> >> handled and LSM updates to ignore security labels on superblocks
>> >> from non-init namespaces.
>> >>
>> >> The LSM changes in particular may not be optimal, as I don't have a
>> >> lot of familiarity with this code, so I'd be especially appreciative
>> >> of review of these changes and suggestions on how to improve them.
>> >
>> > Lukasz Pawelczyk <[email protected]> proposed
>> > LSM support in user namespaces ([RFC] lsm: namespace hooks)
>> > that make a whole lot more sense than just turning off
>> > the option of using labels on files. Gutting the ability
>> > to use MAC in a namespace is a step down the road of
>> > making MAC and namespaces incompatible.
>>
>> This is not "turning off the option to use labels on files".
>>
>> This is supporting mounting filesystems like ext4 by unprivileged users
>> and not trusting the labels they set in the same way as we trust labels
>> on filesystems mounted by privileged users.
>>
>> The first step needs to be not trusting those labels and treating such
>> filesystems as filesystems without label support. I hope that is Seth
>> has implemented.
>>
>> In the long run we can do more interesting things with such filesystems
>> once the appropriate LSM policy is in place.
>
> Yes, this exactly. Right now it looks to me like the only safe thing to
> do with mounts from unprivileged users is to ignore the security labels,
> so that's what I'm trying to do with these changes. If there's some
> better thing to do, or some better way to do it, I'm more than happy to
> receive that feedback.

Ugh.

This made me realize that we have an interesting problem here. An
unprivileged mount of tmpfs probably needs to have
s_user_ns == &init_user_ns.

Otherwise we will break security labels on tmpfs for no good reason.
ramfs and sysfs also seem to have similar concerns.

Because they have no backing store we can trust those filesystems with
security labels. Plus for at least sysfs there is the security label
bleed through issue, that we need to make certain works.

Perhaps these filesystems with trusted backing store need to call
"sget_userns(..., &init_user_ns)".

If we don't get this right we will have significant regressions with
respect to security labels, and that is not ok.

Eric

2015-07-15 22:41:55

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/7] fs: Ignore file caps in mounts from other user namespaces

Andy Lutomirski <[email protected]> writes:

> On Wed, Jul 15, 2015 at 2:48 PM, Serge E. Hallyn <[email protected]> wrote:
>> On Wed, Jul 15, 2015 at 02:46:04PM -0500, Seth Forshee wrote:
>>> Capability sets attached to files must be ignored except in the
>>> user namespaces where the mounter is privileged, i.e. s_user_ns
>>> and its descendants. Otherwise a vector exists for gaining
>>> privileges in namespaces where a user is not already privileged.
>>>
>>> Add a new helper function, in_user_ns(), to test whether a user
>>> namespace is the same as or a descendant of another namespace.
>>> Use this helper to determine whether a file's capability set
>>> should be applied to the caps constructed during exec.
>>>
>>> Signed-off-by: Seth Forshee <[email protected]>
>>
>> Acked-by: Serge Hallyn <[email protected]>
>>
>> I think it's an ok behavior, though let's just go over the
>> alternatives.
>>
>> It might actually be ok to simply require that the user_ns be
>> equal. If I unshare a new userns in which a different uid is
>> mapped to root, I may not want file capabilities to be granted
>> to tasks in that ns. (On the other hand, I might be creating
>> a new user_ns specifically to not have a uid 0 mapped into it
>> at all, and only have file capabilities grant privilege)
>>
>> Conversely, if I unshare one user_ns with a MS_SHARED mnt_ns, mount
>> an ext4fs, and then (from the parent shell) unshare another user_ns
>> with the same mapping, intending it to be a "peer" to the first one
>> I'd unshared and be able to use the ext4fs it mounted. This won't
>> work here. That's probably best - the appropriate thing to do was
>> to attach to the existing user_ns. But it could end up being
>> limiting in some special cases, so I'm bringing it up here.
>>
>> Again I think what you have here is the simplest and most sensible
>> choice, so ack.
>>
>
> I think I'm missing something. Why is this separate from mount_may_suid?
>
> I can see why it would make sense to check s_user_ns (or maybe
> s_user_ns *and* the vfsmount namespace) in mount_may_suid, but I don't
> see why we need separate checks.

So I don't quite understand your concerns that lead to the mnt_may_suid
patch. But in my limited understanding there are two distinct issues.

1) What do file capabilities mean on a filesystem mounted with user
namespace privileges. Where the unprivileged user can control what
resides on disk.

That is what this patch should be about.

Meaning and restricting those permissions to unprivileged users.

2) The second issue that I think your mnt_may_suid patch is about seems
to be what to do if a mount winds up in a place we never intended.

Aka leaks. I don't think any changes to mnt_may_suid are necessary
in that sense. However they may be useful.

So I think your mnt_may_suid change may be worth having but so far it
seems unnecessary.

Which is a long way of saying this patch is fundamentally necessary,
and I am not certain about the mnt_may_suid patch.

Am I right in understanding it's purpose? Or does this patch actually
succeed in obsoleting it?

Eric

2015-07-15 22:39:05

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On 7/15/2015 2:06 PM, Eric W. Biederman wrote:
> Casey Schaufler <[email protected]> writes:
>
>> On 7/15/2015 12:46 PM, Seth Forshee wrote:
>>> These are the first in a larger set of patches that I've been working on
>>> (with help from Eric Biederman) to support mounting ext4 and fuse
>>> filesystems from within user namespaces. I've pushed the full series to:
>>>
>>> git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
>>>
>>> Taking the series as a whole, the strategy is to handle as much of the
>>> heavy lifting as possible in the vfs so the filesystems don't have to
>>> handle weird edge cases. If you look at the full series you'll find that
>>> the changes in ext4 to support user namespace mounts turn out to be
>>> fairly minimal (fuse is a bit more complicated though as it must deal
>>> with translating ids for a userspace process which is running in pid and
>>> user namespaces).
>>>
>>> The patches I'm sending today lay some of the groundwork in the vfs and
>>> related code. They fall into two broad groups:
>>>
>>> 1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
>>> pretty straightforward, and Eric has expressed interest in merging
>>> these patches soon. Note that patch 2 won't apply cleanly without
>>> Eric's noexec patches for proc and sys [1].
>>>
>>> 2. Patches 2-7 tighten down security for mounts with s_user_ns !=
>>> &init_user_ns. This includes updates to how file caps and suid are
>>> handled and LSM updates to ignore security labels on superblocks
>>> from non-init namespaces.
>>>
>>> The LSM changes in particular may not be optimal, as I don't have a
>>> lot of familiarity with this code, so I'd be especially appreciative
>>> of review of these changes and suggestions on how to improve them.
>> Lukasz Pawelczyk <[email protected]> proposed
>> LSM support in user namespaces ([RFC] lsm: namespace hooks)
>> that make a whole lot more sense than just turning off
>> the option of using labels on files. Gutting the ability
>> to use MAC in a namespace is a step down the road of
>> making MAC and namespaces incompatible.
> This is not "turning off the option to use labels on files".

It gives an unprivileged user the ability to ignore
the Smack labels that are on files and to create files
with labels that do not match the rules laid down by the
security module.

> This is supporting mounting filesystems like ext4 by unprivileged users
> and not trusting the labels they set in the same way as we trust labels
> on filesystems mounted by privileged users.

OK, you don't trust the metadata on a filesystem mounted by an untrusted
user. That's fair.


> The first step needs to be not trusting those labels and treating such
> filesystems as filesystems without label support. I hope that is Seth
> has implemented.

A filesystem with Smack labels gets mounted in a namespace. The labels
are ignored. Instead, the filesystem defaults (potentially specified as
mount options smackfsdef="something", but usually the floor label ("_"))
are used, giving the user the ability to read everything and (usually)
change nothing. This is both dangerous (unintended read access to files)
and pointless (can't make changes).

I can't speak authoritatively for SELinux, but it looks to me like you
may have similar issues there.

> In the long run we can do more interesting things with such filesystems
> once the appropriate LSM policy is in place.

The problem is not that the short term behavior is uninteresting,
it's that it is broken. Mounting a filesystem with xattrs and ignoring
those xattrs results in incorrect access control decisions.

> Getting s_user_ns present on struct super, properly set, and all of the
> appropriate checks against it present in the vfs so that filesystems
> don't need to duplicate logic is important if we are going do more
> interesting things with user namespaces (as users have been asking for).

OK, but the fact that someone wants to do something they shouldn't
doesn't mean you get to break things that work now to accommodate
them. There are reasons why mounting filesystems requires privilege!

> It is important for things as small as making it safe to allow
> truly unprivileged users to mount fuse filesystems.

If it isn't safe you shouldn't be doing it, even if it's "small"
and something that would make life easier for some set of users.

> I am on the fence with Lukasz Pawelczyk's patches. Some parts I liked
> some parts I had issues with. As I recall one of my issues was that
> those patches conflicted in detail if not in principle with this
> appropach.
>
> If these patches do not do a good job of laying the ground work for
> supporting security labels that unprivileged users can set than Seth
> could really use some feedback. Figuring out how to properly deal with
> the LSMs has been one of his challenges.

The feedback is that you can't pick and
choose when you are going to pay attention to the security attributes
on a filesystem. It's possible that it will work out the way you want
it, but it probably won't. Smack doesn't allow you to choose if you're
using xattrs. SELinux does, but certainly doesn't expect you to be
flipping it on and off. I'm not convinced that it's safe to do for
capability sets, either, but I'm not up to arguing PIxFE+ vector
calculations just now.

> I am hoping I can finishing working through the patches to fix the
> semantics of rename and bind mounts before the next merge window opens,
> so I can have enough cycles to lift the feature freeze on user
> namespaces. Except for maybe his first two patches (which fix a small
> userspace API breakage) none of Seth's patches get to go in until I lift
> the freeze.

Thanks. I know (believe me, I know) how frustrating it can be when
you get the big NAK on something that seems like it's addressed.
Unfortunately, the proposed approach (not just the specifics of
implementation) does not work.

> Which is probably too much information but I hope this makes it clear
> that the point of this work is as an enabler for future developments,
> not as something to make user namespaces and LSMs incompatible.

I am paranoid, but not to the extent that I think anyone
is trying to break the interaction between security modules
and namespaces. Having worked with Lukasz on his security
namespace patches it is clear to me that this is not a simple
problem and that it is unlikely to have the simple solution
everyone would like to see. I also don't see an intermediate
state that works while the "real" solution is being refined.
As always, I'm willing to be proven wrong.

> Eric

2015-07-15 23:04:43

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On 7/15/2015 2:48 PM, Seth Forshee wrote:
> On Wed, Jul 15, 2015 at 04:06:35PM -0500, Eric W. Biederman wrote:
>> Casey Schaufler <[email protected]> writes:
>>
>>> On 7/15/2015 12:46 PM, Seth Forshee wrote:
>>>> These are the first in a larger set of patches that I've been working on
>>>> (with help from Eric Biederman) to support mounting ext4 and fuse
>>>> filesystems from within user namespaces. I've pushed the full series to:
>>>>
>>>> git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
>>>>
>>>> Taking the series as a whole, the strategy is to handle as much of the
>>>> heavy lifting as possible in the vfs so the filesystems don't have to
>>>> handle weird edge cases. If you look at the full series you'll find that
>>>> the changes in ext4 to support user namespace mounts turn out to be
>>>> fairly minimal (fuse is a bit more complicated though as it must deal
>>>> with translating ids for a userspace process which is running in pid and
>>>> user namespaces).
>>>>
>>>> The patches I'm sending today lay some of the groundwork in the vfs and
>>>> related code. They fall into two broad groups:
>>>>
>>>> 1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
>>>> pretty straightforward, and Eric has expressed interest in merging
>>>> these patches soon. Note that patch 2 won't apply cleanly without
>>>> Eric's noexec patches for proc and sys [1].
>>>>
>>>> 2. Patches 2-7 tighten down security for mounts with s_user_ns !=
>>>> &init_user_ns. This includes updates to how file caps and suid are
>>>> handled and LSM updates to ignore security labels on superblocks
>>>> from non-init namespaces.
>>>>
>>>> The LSM changes in particular may not be optimal, as I don't have a
>>>> lot of familiarity with this code, so I'd be especially appreciative
>>>> of review of these changes and suggestions on how to improve them.
>>> Lukasz Pawelczyk <[email protected]> proposed
>>> LSM support in user namespaces ([RFC] lsm: namespace hooks)
>>> that make a whole lot more sense than just turning off
>>> the option of using labels on files. Gutting the ability
>>> to use MAC in a namespace is a step down the road of
>>> making MAC and namespaces incompatible.
>> This is not "turning off the option to use labels on files".
>>
>> This is supporting mounting filesystems like ext4 by unprivileged users
>> and not trusting the labels they set in the same way as we trust labels
>> on filesystems mounted by privileged users.
>>
>> The first step needs to be not trusting those labels and treating such
>> filesystems as filesystems without label support. I hope that is Seth
>> has implemented.
>>
>> In the long run we can do more interesting things with such filesystems
>> once the appropriate LSM policy is in place.
> Yes, this exactly. Right now it looks to me like the only safe thing to
> do with mounts from unprivileged users is to ignore the security labels,
> so that's what I'm trying to do with these changes. If there's some
> better thing to do, or some better way to do it, I'm more than happy to
> receive that feedback.

If you ignore Smack labels you get a system that is broken.
Without specifying Smack mount options (requires CAP_MAC_ADMIN)
all your files will be labeled with the floor ("_") label. Unless
you're running with the floor label (Smack systems generally don't)
there won't be anything you can write to. You will be able to read
everything, which is also something you're unlikely to want. Like
I said, broken.

Personally, I don't believe that the goal of supporting
unprivileged mounts is especially sane. I am willing to
be educated, but I don't see a rational solution.

> Seth

2015-07-16 01:06:15

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Jul 15, 2015 3:34 PM, "Eric W. Biederman" <[email protected]> wrote:
>
> Seth Forshee <[email protected]> writes:
>
> > On Wed, Jul 15, 2015 at 04:06:35PM -0500, Eric W. Biederman wrote:
> >> Casey Schaufler <[email protected]> writes:
> >>
> >> > On 7/15/2015 12:46 PM, Seth Forshee wrote:
> >> >> These are the first in a larger set of patches that I've been working on
> >> >> (with help from Eric Biederman) to support mounting ext4 and fuse
> >> >> filesystems from within user namespaces. I've pushed the full series to:
> >> >>
> >> >> git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
> >> >>
> >> >> Taking the series as a whole, the strategy is to handle as much of the
> >> >> heavy lifting as possible in the vfs so the filesystems don't have to
> >> >> handle weird edge cases. If you look at the full series you'll find that
> >> >> the changes in ext4 to support user namespace mounts turn out to be
> >> >> fairly minimal (fuse is a bit more complicated though as it must deal
> >> >> with translating ids for a userspace process which is running in pid and
> >> >> user namespaces).
> >> >>
> >> >> The patches I'm sending today lay some of the groundwork in the vfs and
> >> >> related code. They fall into two broad groups:
> >> >>
> >> >> 1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
> >> >> pretty straightforward, and Eric has expressed interest in merging
> >> >> these patches soon. Note that patch 2 won't apply cleanly without
> >> >> Eric's noexec patches for proc and sys [1].
> >> >>
> >> >> 2. Patches 2-7 tighten down security for mounts with s_user_ns !=
> >> >> &init_user_ns. This includes updates to how file caps and suid are
> >> >> handled and LSM updates to ignore security labels on superblocks
> >> >> from non-init namespaces.
> >> >>
> >> >> The LSM changes in particular may not be optimal, as I don't have a
> >> >> lot of familiarity with this code, so I'd be especially appreciative
> >> >> of review of these changes and suggestions on how to improve them.
> >> >
> >> > Lukasz Pawelczyk <[email protected]> proposed
> >> > LSM support in user namespaces ([RFC] lsm: namespace hooks)
> >> > that make a whole lot more sense than just turning off
> >> > the option of using labels on files. Gutting the ability
> >> > to use MAC in a namespace is a step down the road of
> >> > making MAC and namespaces incompatible.
> >>
> >> This is not "turning off the option to use labels on files".
> >>
> >> This is supporting mounting filesystems like ext4 by unprivileged users
> >> and not trusting the labels they set in the same way as we trust labels
> >> on filesystems mounted by privileged users.
> >>
> >> The first step needs to be not trusting those labels and treating such
> >> filesystems as filesystems without label support. I hope that is Seth
> >> has implemented.
> >>
> >> In the long run we can do more interesting things with such filesystems
> >> once the appropriate LSM policy is in place.
> >
> > Yes, this exactly. Right now it looks to me like the only safe thing to
> > do with mounts from unprivileged users is to ignore the security labels,
> > so that's what I'm trying to do with these changes. If there's some
> > better thing to do, or some better way to do it, I'm more than happy to
> > receive that feedback.
>
> Ugh.
>
> This made me realize that we have an interesting problem here. An
> unprivileged mount of tmpfs probably needs to have
> s_user_ns == &init_user_ns.
>
> Otherwise we will break security labels on tmpfs for no good reason.
> ramfs and sysfs also seem to have similar concerns.
>
> Because they have no backing store we can trust those filesystems with
> security labels. Plus for at least sysfs there is the security label
> bleed through issue, that we need to make certain works.
>
> Perhaps these filesystems with trusted backing store need to call
> "sget_userns(..., &init_user_ns)".
>
> If we don't get this right we will have significant regressions with
> respect to security labels, and that is not ok.

That's only a problem if there's anyone who sets security labels on
such a mount. You need global caps to do that (I hope), which
requires someone outside the userns to help, which means there's a
good chance that literally no one does this.

--Andy

2015-07-16 01:09:24

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Wed, Jul 15, 2015 at 3:39 PM, Casey Schaufler <[email protected]> wrote:
> On 7/15/2015 2:06 PM, Eric W. Biederman wrote:
>> Casey Schaufler <[email protected]> writes:
>
>> The first step needs to be not trusting those labels and treating such
>> filesystems as filesystems without label support. I hope that is Seth
>> has implemented.
>
> A filesystem with Smack labels gets mounted in a namespace. The labels
> are ignored. Instead, the filesystem defaults (potentially specified as
> mount options smackfsdef="something", but usually the floor label ("_"))
> are used, giving the user the ability to read everything and (usually)
> change nothing. This is both dangerous (unintended read access to files)
> and pointless (can't make changes).

I don't get it.

If I mount an unprivileged filesystem, then either the contents were
put there *by me*, in which case letting me access them are fine, or
(with Seth's patches and then some) I control the backing store, in
which case I can do whatever I want regardless of what LSM thinks.

So I don't see the problem. Why would Smack or any other LSM care at
all, unless it wants to prevent me from mounting the fs in the first
place?

--Andy

2015-07-16 01:15:17

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 3/7] fs: Ignore file caps in mounts from other user namespaces

On Wed, Jul 15, 2015 at 05:35:24PM -0500, Eric W. Biederman wrote:
> Andy Lutomirski <[email protected]> writes:
>
> > On Wed, Jul 15, 2015 at 2:48 PM, Serge E. Hallyn <[email protected]> wrote:
> >> On Wed, Jul 15, 2015 at 02:46:04PM -0500, Seth Forshee wrote:
> >>> Capability sets attached to files must be ignored except in the
> >>> user namespaces where the mounter is privileged, i.e. s_user_ns
> >>> and its descendants. Otherwise a vector exists for gaining
> >>> privileges in namespaces where a user is not already privileged.
> >>>
> >>> Add a new helper function, in_user_ns(), to test whether a user
> >>> namespace is the same as or a descendant of another namespace.
> >>> Use this helper to determine whether a file's capability set
> >>> should be applied to the caps constructed during exec.
> >>>
> >>> Signed-off-by: Seth Forshee <[email protected]>
> >>
> >> Acked-by: Serge Hallyn <[email protected]>
> >>
> >> I think it's an ok behavior, though let's just go over the
> >> alternatives.
> >>
> >> It might actually be ok to simply require that the user_ns be
> >> equal. If I unshare a new userns in which a different uid is
> >> mapped to root, I may not want file capabilities to be granted
> >> to tasks in that ns. (On the other hand, I might be creating
> >> a new user_ns specifically to not have a uid 0 mapped into it
> >> at all, and only have file capabilities grant privilege)
> >>
> >> Conversely, if I unshare one user_ns with a MS_SHARED mnt_ns, mount
> >> an ext4fs, and then (from the parent shell) unshare another user_ns
> >> with the same mapping, intending it to be a "peer" to the first one
> >> I'd unshared and be able to use the ext4fs it mounted. This won't
> >> work here. That's probably best - the appropriate thing to do was
> >> to attach to the existing user_ns. But it could end up being
> >> limiting in some special cases, so I'm bringing it up here.
> >>
> >> Again I think what you have here is the simplest and most sensible
> >> choice, so ack.
> >>
> >
> > I think I'm missing something. Why is this separate from mount_may_suid?
> >
> > I can see why it would make sense to check s_user_ns (or maybe
> > s_user_ns *and* the vfsmount namespace) in mount_may_suid, but I don't
> > see why we need separate checks.
>
> So I don't quite understand your concerns that lead to the mnt_may_suid
> patch. But in my limited understanding there are two distinct issues.
>
> 1) What do file capabilities mean on a filesystem mounted with user
> namespace privileges. Where the unprivileged user can control what
> resides on disk.
>
> That is what this patch should be about.
>
> Meaning and restricting those permissions to unprivileged users.
>
> 2) The second issue that I think your mnt_may_suid patch is about seems
> to be what to do if a mount winds up in a place we never intended.
>
> Aka leaks. I don't think any changes to mnt_may_suid are necessary
> in that sense. However they may be useful.
>
> So I think your mnt_may_suid change may be worth having but so far it
> seems unnecessary.
>
> Which is a long way of saying this patch is fundamentally necessary,
> and I am not certain about the mnt_may_suid patch.
>
> Am I right in understanding it's purpose? Or does this patch actually
> succeed in obsoleting it?

The only part that's absolutely needed is the restriction on file caps,
otherwise it will be trivial to get root through a user namespace mount.
I've become convinced that the safest and most logical thing to do is to
restrict file capabilites to the user namespaces where the mounter
already has privileges, which is what the patch does.

mnt_may_suid would also restrict the namespaces where the capabilities
would be honored, but not to only namespaces where the mounter is
already privileged. Of course it does require a user privileged in
another namespace to perform a mount, but that still leaves me feeling a
bit uncomfortable.

suid doesn't require quite so strict a check because (jumping ahead to
the patches I haven't sent yet) ids in a user namespace mount of a
normal filesystem are constrained to ids in that namespace. So users
could only exploit this to suid to ids they already control, or if they
managed to somehow bypass other kernel protections they could possibly
gain access to user ns mounts belonging to another user.

So if we have the s_user_ns check in get_file_caps the mnt_may_suid pass
isn't strictly necessary, but I still think it is useful as a mitigation
to the "leaks" Eric mentions. It _should_ be impossible for a user to
gain access to another user's mount namespace, and it _should_ be
impossible for a user to clear MNT_NOSUID in a bind mount from
init_user_ns. But if someone does find a way to do either then the patch
stops them from being able to gain privileges via suid, and I think
that's worth adding the check.

Andy alludes to the possibility of checking s_user_ns or both s_user_ns
and the mount namespace in mnt_may_suid, and those are certainly
possibilities that would work equally well (though checking both is
probably unnecessary). One thing I came away with from conversing with
Eric though is that he wants to see a clear and explicit check in
get_file_caps, not something implicit from may_mnt_suid. And I can see
his point - there is a concern with file capabilities independent of the
question of whether suid is allowed, and having a separate check does
make that clearer.

Seth

2015-07-16 01:20:11

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 3/7] fs: Ignore file caps in mounts from other user namespaces

On Wed, Jul 15, 2015 at 3:35 PM, Eric W. Biederman
<[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> On Wed, Jul 15, 2015 at 2:48 PM, Serge E. Hallyn <[email protected]> wrote:
>>> On Wed, Jul 15, 2015 at 02:46:04PM -0500, Seth Forshee wrote:
>>>> Capability sets attached to files must be ignored except in the
>>>> user namespaces where the mounter is privileged, i.e. s_user_ns
>>>> and its descendants. Otherwise a vector exists for gaining
>>>> privileges in namespaces where a user is not already privileged.
>>>>
>>>> Add a new helper function, in_user_ns(), to test whether a user
>>>> namespace is the same as or a descendant of another namespace.
>>>> Use this helper to determine whether a file's capability set
>>>> should be applied to the caps constructed during exec.
>>>>
>>>> Signed-off-by: Seth Forshee <[email protected]>
>>>
>>> Acked-by: Serge Hallyn <[email protected]>
>>>
>>> I think it's an ok behavior, though let's just go over the
>>> alternatives.
>>>
>>> It might actually be ok to simply require that the user_ns be
>>> equal. If I unshare a new userns in which a different uid is
>>> mapped to root, I may not want file capabilities to be granted
>>> to tasks in that ns. (On the other hand, I might be creating
>>> a new user_ns specifically to not have a uid 0 mapped into it
>>> at all, and only have file capabilities grant privilege)
>>>
>>> Conversely, if I unshare one user_ns with a MS_SHARED mnt_ns, mount
>>> an ext4fs, and then (from the parent shell) unshare another user_ns
>>> with the same mapping, intending it to be a "peer" to the first one
>>> I'd unshared and be able to use the ext4fs it mounted. This won't
>>> work here. That's probably best - the appropriate thing to do was
>>> to attach to the existing user_ns. But it could end up being
>>> limiting in some special cases, so I'm bringing it up here.
>>>
>>> Again I think what you have here is the simplest and most sensible
>>> choice, so ack.
>>>
>>
>> I think I'm missing something. Why is this separate from mount_may_suid?
>>
>> I can see why it would make sense to check s_user_ns (or maybe
>> s_user_ns *and* the vfsmount namespace) in mount_may_suid, but I don't
>> see why we need separate checks.
>
> So I don't quite understand your concerns that lead to the mnt_may_suid
> patch. But in my limited understanding there are two distinct issues.

The issue is that we need some kind of control for whether a given
operation should trust a given mounted filesystem. There are two
kinds of trust: trusting the fs for execve security context (nosuid
controls this) and trusting it for LSM access restrictions. I think
that, in an unprivileged namespace context, the latter is a bit silly
-- the creator of the fs owns it, full stop. I'm talking about the
former.

In particular, If I unshare everything, mount a fresh FUSE, shove a
setuid, fcapped, LSM-labeled thing in it, pass a file descriptor out,
and have someone in the root ns execve it, and *pwned*.

My suggestion is to use a single function to control this, and I
called it mnt_may_suid. We can certainly debate when that function
should return true, but I'm unconvinced that the conditions for LSM
and for regular setuid should be different.

>
> 1) What do file capabilities mean on a filesystem mounted with user
> namespace privileges. Where the unprivileged user can control what
> resides on disk.
>
> That is what this patch should be about.
>
> Meaning and restricting those permissions to unprivileged users.

I think that file caps should mean what they usually do if the execve
caller's userns should trust the file. Otherwise file caps should do
nothing.

My original idea was that a namespace trusts a vfsmount if the
namespace or one of its ancestors created the mount. Doing the same
thing but with s_user_ns might also make sense.

>
> 2) The second issue that I think your mnt_may_suid patch is about seems
> to be what to do if a mount winds up in a place we never intended.
>
> Aka leaks. I don't think any changes to mnt_may_suid are necessary
> in that sense. However they may be useful.
>
> So I think your mnt_may_suid change may be worth having but so far it
> seems unnecessary.

There's that, too. For one thing, with my mnt_may_suid patch (or a
variant that checks the vfsmount and s_user_ns), we could drop the
bind-mount nosuid restrictions. If you want to bind-mount an
MS_NOSUID mount without MS_NOSUID, then that's fine -- you can't do
any harm.

>
> Which is a long way of saying this patch is fundamentally necessary,
> and I am not certain about the mnt_may_suid patch.
>
> Am I right in understanding it's purpose? Or does this patch actually
> succeed in obsoleting it?

Other way around. I think that an improved mnt_may_suid patch might
render this patch unnecessary.

--Andy

2015-07-16 01:23:25

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 3/7] fs: Ignore file caps in mounts from other user namespaces

On Wed, Jul 15, 2015 at 6:14 PM, Seth Forshee
<[email protected]> wrote:
> mnt_may_suid would also restrict the namespaces where the capabilities
> would be honored, but not to only namespaces where the mounter is
> already privileged. Of course it does require a user privileged in
> another namespace to perform a mount, but that still leaves me feeling a
> bit uncomfortable.

Right. I think mnt_may_suid should check s_user_ns in addition.

>
> suid doesn't require quite so strict a check because (jumping ahead to
> the patches I haven't sent yet) ids in a user namespace mount of a
> normal filesystem are constrained to ids in that namespace. So users
> could only exploit this to suid to ids they already control, or if they
> managed to somehow bypass other kernel protections they could possibly
> gain access to user ns mounts belonging to another user.

True. But LSMs labels probably want the same protection as file caps,
and the mnt_no_suid approach handles that, too. (Your patches also do
this, but maybe we'd want to relax that some day for LSMs that are
scoped sensibly.)

>
> So if we have the s_user_ns check in get_file_caps the mnt_may_suid pass
> isn't strictly necessary, but I still think it is useful as a mitigation
> to the "leaks" Eric mentions. It _should_ be impossible for a user to
> gain access to another user's mount namespace,

No, it's very easy with SCM_RIGHTS. We should make sure it's safe.

> Andy alludes to the possibility of checking s_user_ns or both s_user_ns
> and the mount namespace in mnt_may_suid, and those are certainly
> possibilities that would work equally well (though checking both is
> probably unnecessary). One thing I came away with from conversing with
> Eric though is that he wants to see a clear and explicit check in
> get_file_caps, not something implicit from may_mnt_suid. And I can see
> his point - there is a concern with file capabilities independent of the
> question of whether suid is allowed, and having a separate check does
> make that clearer.

But we absolutely need MS_NOSUID to block file caps, and it does. Why
not just use the existing mechanism with an expanded sense of
"nosuid"?

--Andy

2015-07-16 02:26:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

Andy Lutomirski <[email protected]> writes:

> On Jul 15, 2015 3:34 PM, "Eric W. Biederman" <[email protected]> wrote:
>>
>> Seth Forshee <[email protected]> writes:
>>
>> > On Wed, Jul 15, 2015 at 04:06:35PM -0500, Eric W. Biederman wrote:
>> >> Casey Schaufler <[email protected]> writes:
>> >>
>> >> > On 7/15/2015 12:46 PM, Seth Forshee wrote:
>> >> >> These are the first in a larger set of patches that I've been working on
>> >> >> (with help from Eric Biederman) to support mounting ext4 and fuse
>> >> >> filesystems from within user namespaces. I've pushed the full series to:
>> >> >>
>> >> >> git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
>> >> >>
>> >> >> Taking the series as a whole, the strategy is to handle as much of the
>> >> >> heavy lifting as possible in the vfs so the filesystems don't have to
>> >> >> handle weird edge cases. If you look at the full series you'll find that
>> >> >> the changes in ext4 to support user namespace mounts turn out to be
>> >> >> fairly minimal (fuse is a bit more complicated though as it must deal
>> >> >> with translating ids for a userspace process which is running in pid and
>> >> >> user namespaces).
>> >> >>
>> >> >> The patches I'm sending today lay some of the groundwork in the vfs and
>> >> >> related code. They fall into two broad groups:
>> >> >>
>> >> >> 1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
>> >> >> pretty straightforward, and Eric has expressed interest in merging
>> >> >> these patches soon. Note that patch 2 won't apply cleanly without
>> >> >> Eric's noexec patches for proc and sys [1].
>> >> >>
>> >> >> 2. Patches 2-7 tighten down security for mounts with s_user_ns !=
>> >> >> &init_user_ns. This includes updates to how file caps and suid are
>> >> >> handled and LSM updates to ignore security labels on superblocks
>> >> >> from non-init namespaces.
>> >> >>
>> >> >> The LSM changes in particular may not be optimal, as I don't have a
>> >> >> lot of familiarity with this code, so I'd be especially appreciative
>> >> >> of review of these changes and suggestions on how to improve them.
>> >> >
>> >> > Lukasz Pawelczyk <[email protected]> proposed
>> >> > LSM support in user namespaces ([RFC] lsm: namespace hooks)
>> >> > that make a whole lot more sense than just turning off
>> >> > the option of using labels on files. Gutting the ability
>> >> > to use MAC in a namespace is a step down the road of
>> >> > making MAC and namespaces incompatible.
>> >>
>> >> This is not "turning off the option to use labels on files".
>> >>
>> >> This is supporting mounting filesystems like ext4 by unprivileged users
>> >> and not trusting the labels they set in the same way as we trust labels
>> >> on filesystems mounted by privileged users.
>> >>
>> >> The first step needs to be not trusting those labels and treating such
>> >> filesystems as filesystems without label support. I hope that is Seth
>> >> has implemented.
>> >>
>> >> In the long run we can do more interesting things with such filesystems
>> >> once the appropriate LSM policy is in place.
>> >
>> > Yes, this exactly. Right now it looks to me like the only safe thing to
>> > do with mounts from unprivileged users is to ignore the security labels,
>> > so that's what I'm trying to do with these changes. If there's some
>> > better thing to do, or some better way to do it, I'm more than happy to
>> > receive that feedback.
>>
>> Ugh.
>>
>> This made me realize that we have an interesting problem here. An
>> unprivileged mount of tmpfs probably needs to have
>> s_user_ns == &init_user_ns.
>>
>> Otherwise we will break security labels on tmpfs for no good reason.
>> ramfs and sysfs also seem to have similar concerns.
>>
>> Because they have no backing store we can trust those filesystems with
>> security labels. Plus for at least sysfs there is the security label
>> bleed through issue, that we need to make certain works.
>>
>> Perhaps these filesystems with trusted backing store need to call
>> "sget_userns(..., &init_user_ns)".
>>
>> If we don't get this right we will have significant regressions with
>> respect to security labels, and that is not ok.
>
> That's only a problem if there's anyone who sets security labels on
> such a mount. You need global caps to do that (I hope), which
> requires someone outside the userns to help, which means there's a
> good chance that literally no one does this.

Fair enough. That is however something we need to test. If no one
puts security labels or file caps on such a mount we can change things.
If not we can't because it would introduce regressions.

Eric

2015-07-16 02:53:39

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

Seth Forshee <[email protected]> writes:

> Initially this will be used to eliminate the implicit MNT_NODEV
> flag for mounts from user namespaces. In the future it will also
> be used for translating ids and checking capabilities for
> filesystems mounted from user namespaces.
>
> s_user_ns is initialized in alloc_super() and is generally set to
> current_user_ns(). To avoid security and corruption issues, two
> additional mount checks are also added:
>
> - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
> in current_user_ns().
>
> - sget() will fail with EBUSY when the filesystem it's looking
> for is already mounted from another user namespace.
>
> proc needs some special handling here. The user namespace of
> current isn't appropriate when forking as a result of clone (2)
> with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
> from within the new user namespace. Instead, the user namespace
> which owns the new pid namespace should be used. sget_userns() is
> added to allow passing of a user namespace other than that of
> current, and this is used by proc_mount(). sget() becomes a
> wrapper around sget_userns() which passes current_user_ns().

>From bits of the previous conversation.

We need sget_userns(..., &init_user_ns) for sysfs. The sysfs
xattrs can travel from one mount of sysfs to another via the sysfs
backing store.

For tmpfs and any other filesystems we support mounting without
privilige that support xattrs. We need to identify them and
see if userspace is taking advantage of the ability to set
xattrs and file caps (unlikely). If they are we need to call
sget_userns(..., &init_user_ns) on those filesystems as well.

Possibly/Probably we should just do that for all of the interesting
filesystems to start with and then change back to an ordinary old sget
after we have done the testing and confirmed we will not be introducing
userspace regressions.

Eric

2015-07-16 02:54:28

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On 7/15/2015 6:08 PM, Andy Lutomirski wrote:
> On Wed, Jul 15, 2015 at 3:39 PM, Casey Schaufler <[email protected]> wrote:
>> On 7/15/2015 2:06 PM, Eric W. Biederman wrote:
>>> Casey Schaufler <[email protected]> writes:
>>> The first step needs to be not trusting those labels and treating such
>>> filesystems as filesystems without label support. I hope that is Seth
>>> has implemented.
>> A filesystem with Smack labels gets mounted in a namespace. The labels
>> are ignored. Instead, the filesystem defaults (potentially specified as
>> mount options smackfsdef="something", but usually the floor label ("_"))
>> are used, giving the user the ability to read everything and (usually)
>> change nothing. This is both dangerous (unintended read access to files)
>> and pointless (can't make changes).
> I don't get it.
>
> If I mount an unprivileged filesystem, then either the contents were
> put there *by me*, in which case letting me access them are fine, or
> (with Seth's patches and then some) I control the backing store, in
> which case I can do whatever I want regardless of what LSM thinks.
>
> So I don't see the problem. Why would Smack or any other LSM care at
> all, unless it wants to prevent me from mounting the fs in the first
> place?

First off, I don't cotton to the notion that you should be able
to mount filesystems without privilege. But it seems I'm being
outvoted on that. I suspect that there are cases where it might
be safe, but I can't think of one off the top of my head.

If you do mount a filesystem it needs to behave according to the
rules of the system. If you have a security module that uses
attributes on the filesystem you can't ignore them just because
it's "your data". Mandatory access control schemes, including
Smack and SELinux don't give a fig about who you are. It's the
label on the data and the process that matter. If "you" get to
muck the labels up, you've broken the mandatory access control.

> --Andy

2015-07-16 03:21:48

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts


Seth I think for the LSMs we should start with:

diff --git a/security/security.c b/security/security.c
index 062f3c997fdc..5b6ece92a8e5 100644
--- a/security/security.c
+++ b/security/security.c
@@ -310,6 +310,8 @@ int security_sb_statfs(struct dentry *dentry)
int security_sb_mount(const char *dev_name, struct path *path,
const char *type, unsigned long flags, void *data)
{
+ if (current_user_ns() != &init_user_ns)
+ return -EPERM;
return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
}


Then we should push this down into all of the lsms.
Then when we should remove or relax or change the check as appropriate
in each lsm.

The point is this is good enough to see that it is trivially safe,
and this allows us to focus on the core issues, and stop worrying about
the lsms for a bit.

Then we can focus on each lsm one at at time and take the time to really
understand them and talk with their maintainers etc to make certain
we get things correct.

This should remove the need for your patches 5, 6 and 7. For the
immediate future.

Eric

2015-07-16 04:30:30

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/7] fs: Ignore file caps in mounts from other user namespaces


Ok. Andy I have stopped and really looked at your patch that is 4/7 in
this series. Something I had not done before since it sounded totally
wrong.

That combined with your earlier comments I think I can say something
meaningful.

Andy as I read your patch the thread you are primarily worried about is
chdir(/some/directory/in/another/mnt/ns). I think enhancing nosuid to
deal with that case is reasonable, and is unlikely to break userspace.
It is one of those hairy security things so we need to be careful not to
introduce a regression.

I think a top down enhancement of nosuid to just block funny cases that
no one cares about is completely sensible. Removing goofy corner
that no one cares about and that are only good for security exploits
seems reasonable.

I am a little concerned that smack does not seem to respect nosuid
on filesystems. But that is an issue with nosuid not with your enhanced
nosuid.




Now this patch 3/7 really should be entitled:
"Limit file caps to the userns of the super block".

It really really is doing something different. This change is about a
bottom up understanding of what file caps means on a filesystem mounted
by a user namespace root.

That is file caps should only apply to the user namespace root of the
root user who mounted the filesystem, because that is all the privileges
the mounter of the filesystem had.

This guarantees that even if the filesystem somehow propagates with
mount propagation that there will be no issues. I think I know how to
make that happen...




But deeply and fundamentally limiting a filesystem to only the
privilieges of it's user namespace root, and enhancing nosuid
protections are rather different things.


The approaches show up differently for dealing with uids and gids,
as mappings are required. The approaches will likely to continue to
show up differently for file caps when Serge implements a version
of file caps with a user namespace root in them.

The approaches fundamentally will need to do different things with
security xattrs. As mnt_may_suid can just treat as a filesystem
without labels, while ultimately the lsms will have to do something
meaningful.



So while in the very narrow case of todays file caps the two approaches
are the same. Enhancing nosuid is something very different from
limiting a filesystem to it's mounters user namespace.

Eric

2015-07-16 04:53:39

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

Casey Schaufler <[email protected]> writes:

> On 7/15/2015 6:08 PM, Andy Lutomirski wrote:
>> On Wed, Jul 15, 2015 at 3:39 PM, Casey Schaufler <[email protected]> wrote:
>>> On 7/15/2015 2:06 PM, Eric W. Biederman wrote:
>>>> Casey Schaufler <[email protected]> writes:
>>>> The first step needs to be not trusting those labels and treating such
>>>> filesystems as filesystems without label support. I hope that is Seth
>>>> has implemented.
>>> A filesystem with Smack labels gets mounted in a namespace. The labels
>>> are ignored. Instead, the filesystem defaults (potentially specified as
>>> mount options smackfsdef="something", but usually the floor label ("_"))
>>> are used, giving the user the ability to read everything and (usually)
>>> change nothing. This is both dangerous (unintended read access to files)
>>> and pointless (can't make changes).
>> I don't get it.
>>
>> If I mount an unprivileged filesystem, then either the contents were
>> put there *by me*, in which case letting me access them are fine, or
>> (with Seth's patches and then some) I control the backing store, in
>> which case I can do whatever I want regardless of what LSM thinks.
>>
>> So I don't see the problem. Why would Smack or any other LSM care at
>> all, unless it wants to prevent me from mounting the fs in the first
>> place?
>
> First off, I don't cotton to the notion that you should be able
> to mount filesystems without privilege. But it seems I'm being
> outvoted on that. I suspect that there are cases where it might
> be safe, but I can't think of one off the top of my head.

There are two fundamental issues mounting filesystems without privielge,
by which I actually mean mounting filesystems as the root user in a user
namespace.

- Are the semantics safe.
- Is the extra attack surface a problem.

Figuring out how to make semantics safe is what we are talking about.

Once we sort out the semantics we can look at the handful of filesystems
like fuse where the extra attack surface is not a concern.

With that said desktop environments have for a long time been
automatically mounting whichever filesystem you place in your computer,
so in practice what this is really about is trying to align the kernel
with how people use filesystems.

I haven't looked closely but I think docker is just about as bad as
those desktop environments when it comes to mounting filesystems.

> If you do mount a filesystem it needs to behave according to the
> rules of the system.

I agree.

> If you have a security module that uses
> attributes on the filesystem you can't ignore them just because
> it's "your data". Mandatory access control schemes, including
> Smack and SELinux don't give a fig about who you are. It's the
> label on the data and the process that matter. If "you" get to
> muck the labels up, you've broken the mandatory access control.

So there are filesystems like fat and minix that can not store a label.
Since it is not possible to store labels securely in filesystems mounted
by unprivileged users (at least in the normal sense) the intent would be
to treat a filesystem mounted without the privileges of the global root
user as a filesystem that does not support xattrs.

Treating such a filesystem as a filesystem that does not support xattrs
is the only possible way support such a filesystem securely, because as
you have said someone who can muck up the labels breaks mandatory access
control.

Given how non-trivial it is to grasp the nuances of different lsms
mandatory access control semantics, I am asking Seth for the first past
to simply forbid mounting of filesystems with just user namespace
permissions when there is an lsm active.

Once we get that far smack may never need to support such systems.

Eric

2015-07-16 04:49:39

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 3/7] fs: Ignore file caps in mounts from other user namespaces

On Wed, Jul 15, 2015 at 9:23 PM, Eric W. Biederman
<[email protected]> wrote:
>
> Ok. Andy I have stopped and really looked at your patch that is 4/7 in
> this series. Something I had not done before since it sounded totally
> wrong.
>
> That combined with your earlier comments I think I can say something
> meaningful.
>
> Andy as I read your patch the thread you are primarily worried about is
> chdir(/some/directory/in/another/mnt/ns). I think enhancing nosuid to
> deal with that case is reasonable, and is unlikely to break userspace.
> It is one of those hairy security things so we need to be careful not to
> introduce a regression.
>

Indeed. It's plausible this could regress something, but it would be
really weird.

> I think a top down enhancement of nosuid to just block funny cases that
> no one cares about is completely sensible. Removing goofy corner
> that no one cares about and that are only good for security exploits
> seems reasonable.
>

Agreed.

> I am a little concerned that smack does not seem to respect nosuid
> on filesystems. But that is an issue with nosuid not with your enhanced
> nosuid.
>
>
>
>
> Now this patch 3/7 really should be entitled:
> "Limit file caps to the userns of the super block".
>
> It really really is doing something different. This change is about a
> bottom up understanding of what file caps means on a filesystem mounted
> by a user namespace root.
>
> That is file caps should only apply to the user namespace root of the
> root user who mounted the filesystem, because that is all the privileges
> the mounter of the filesystem had.
>
> This guarantees that even if the filesystem somehow propagates with
> mount propagation that there will be no issues. I think I know how to
> make that happen...
>
>
>
>
> But deeply and fundamentally limiting a filesystem to only the
> privilieges of it's user namespace root, and enhancing nosuid
> protections are rather different things.
>

So here's the semantic question:

Suppose an unprivileged user (uid 1000) creates a user namespace and a
mount namespace. They stick a file (owned by uid 1000 as seen by
init_user_ns) in there and mark it setuid root and give it fcaps.

Then global root gets an fd to this filesystem. If they execve the
file directly, then, with my patch 4, it won't act as setuid 1000 and
the fcaps will be ignored. Even with my patch 4, though, if they bind
mount the fs and execve the file from their bind mount, it will act as
setuid 1000. Maybe this is odd. However, with Seth's patch 3, the
fcaps will (correctly) not be honored.

I tend to thing that, if we're not honoring the fcaps, we shouldn't be
honoring the setuid bit either. After all, it's really not a trusted
file, even though the only user who could have messed with it really
is the apparent owner.

And, if we're going to say we don't trust the file and shouldn't honor
setuid or fcaps, then merging all the functionality into mnt_may_suid
could make sense. Yes, these two things do different things, but they
could hook in to the same place.

--Andy

2015-07-16 05:11:11

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/7] fs: Ignore file caps in mounts from other user namespaces

Andy Lutomirski <[email protected]> writes:

> On Wed, Jul 15, 2015 at 9:23 PM, Eric W. Biederman
> <[email protected]> wrote:
>>
>> Ok. Andy I have stopped and really looked at your patch that is 4/7 in
>> this series. Something I had not done before since it sounded totally
>> wrong.
>>
>> That combined with your earlier comments I think I can say something
>> meaningful.
>>
>> Andy as I read your patch the thread you are primarily worried about is
>> chdir(/some/directory/in/another/mnt/ns). I think enhancing nosuid to
>> deal with that case is reasonable, and is unlikely to break userspace.
>> It is one of those hairy security things so we need to be careful not to
>> introduce a regression.
>>
>
> Indeed. It's plausible this could regress something, but it would be
> really weird.
>
>> I think a top down enhancement of nosuid to just block funny cases that
>> no one cares about is completely sensible. Removing goofy corner
>> that no one cares about and that are only good for security exploits
>> seems reasonable.
>>
>
> Agreed.
>
>> I am a little concerned that smack does not seem to respect nosuid
>> on filesystems. But that is an issue with nosuid not with your enhanced
>> nosuid.
>>
>>
>>
>>
>> Now this patch 3/7 really should be entitled:
>> "Limit file caps to the userns of the super block".
>>
>> It really really is doing something different. This change is about a
>> bottom up understanding of what file caps means on a filesystem mounted
>> by a user namespace root.
>>
>> That is file caps should only apply to the user namespace root of the
>> root user who mounted the filesystem, because that is all the privileges
>> the mounter of the filesystem had.
>>
>> This guarantees that even if the filesystem somehow propagates with
>> mount propagation that there will be no issues. I think I know how to
>> make that happen...
>>
>>
>>
>>
>> But deeply and fundamentally limiting a filesystem to only the
>> privilieges of it's user namespace root, and enhancing nosuid
>> protections are rather different things.
>>
>
> So here's the semantic question:
>
> Suppose an unprivileged user (uid 1000) creates a user namespace and a
> mount namespace. They stick a file (owned by uid 1000 as seen by
> init_user_ns) in there and mark it setuid root and give it fcaps.

To make this make sense I have to ask, is this file on a filesystem
where uid 1000 as seen by the init_user_ns stored as uid 1000 on
the filesystem? Or is this uid 0 as seen by the filesystem?

I assume this is uid 0 on the filesystem in question or else your
unprivileged user would not have sufficient privileges over the
filesystem to setup fcaps.

> Then global root gets an fd to this filesystem. If they execve the
> file directly, then, with my patch 4, it won't act as setuid 1000 and
> the fcaps will be ignored. Even with my patch 4, though, if they bind
> mount the fs and execve the file from their bind mount, it will act as
> setuid 1000. Maybe this is odd. However, with Seth's patch 3, the
> fcaps will (correctly) not be honored.

With patch 3 you can also think of it as fcaps being honored and you
get all the caps in the appropriate user namespace, but since you are
not in that user namespace and so don't have a place to store them
in struct cred you don't get the file caps.

>From the philosophy of interpreting the file as defined by the
filesystem in principle we could extend struct cred so you actually
get the creds just in uid 1000s user namespace, but that is very
unlikely to be worth it.

> I tend to thing that, if we're not honoring the fcaps, we shouldn't be
> honoring the setuid bit either. After all, it's really not a trusted
> file, even though the only user who could have messed with it really
> is the apparent owner.

For the file caps we can't honor them because you don't have the bits
in struct cred.

For setuid we can honor it, and setuid is something that the user
namespace allows.

> And, if we're going to say we don't trust the file and shouldn't honor
> setuid or fcaps, then merging all the functionality into mnt_may_suid
> could make sense. Yes, these two things do different things, but they
> could hook in to the same place.

There are really two separate questions:
- Do we trust this filesystem?
- Do you have the bits to implement this concept?

Even if in this specific context the two questions wind up looking
exactly the same. I think it makes a lot of sense to ask the two
questions separately. As future maintenance changes may cause the
implementation of the questions to diverge.

Eric

2015-07-16 05:15:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 3/7] fs: Ignore file caps in mounts from other user namespaces

On Wed, Jul 15, 2015 at 10:04 PM, Eric W. Biederman
<[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> On Wed, Jul 15, 2015 at 9:23 PM, Eric W. Biederman
>> <[email protected]> wrote:
>>>
>>> Ok. Andy I have stopped and really looked at your patch that is 4/7 in
>>> this series. Something I had not done before since it sounded totally
>>> wrong.
>>>
>>> That combined with your earlier comments I think I can say something
>>> meaningful.
>>>
>>> Andy as I read your patch the thread you are primarily worried about is
>>> chdir(/some/directory/in/another/mnt/ns). I think enhancing nosuid to
>>> deal with that case is reasonable, and is unlikely to break userspace.
>>> It is one of those hairy security things so we need to be careful not to
>>> introduce a regression.
>>>
>>
>> Indeed. It's plausible this could regress something, but it would be
>> really weird.
>>
>>> I think a top down enhancement of nosuid to just block funny cases that
>>> no one cares about is completely sensible. Removing goofy corner
>>> that no one cares about and that are only good for security exploits
>>> seems reasonable.
>>>
>>
>> Agreed.
>>
>>> I am a little concerned that smack does not seem to respect nosuid
>>> on filesystems. But that is an issue with nosuid not with your enhanced
>>> nosuid.
>>>
>>>
>>>
>>>
>>> Now this patch 3/7 really should be entitled:
>>> "Limit file caps to the userns of the super block".
>>>
>>> It really really is doing something different. This change is about a
>>> bottom up understanding of what file caps means on a filesystem mounted
>>> by a user namespace root.
>>>
>>> That is file caps should only apply to the user namespace root of the
>>> root user who mounted the filesystem, because that is all the privileges
>>> the mounter of the filesystem had.
>>>
>>> This guarantees that even if the filesystem somehow propagates with
>>> mount propagation that there will be no issues. I think I know how to
>>> make that happen...
>>>
>>>
>>>
>>>
>>> But deeply and fundamentally limiting a filesystem to only the
>>> privilieges of it's user namespace root, and enhancing nosuid
>>> protections are rather different things.
>>>
>>
>> So here's the semantic question:
>>
>> Suppose an unprivileged user (uid 1000) creates a user namespace and a
>> mount namespace. They stick a file (owned by uid 1000 as seen by
>> init_user_ns) in there and mark it setuid root and give it fcaps.
>
> To make this make sense I have to ask, is this file on a filesystem
> where uid 1000 as seen by the init_user_ns stored as uid 1000 on
> the filesystem? Or is this uid 0 as seen by the filesystem?
>
> I assume this is uid 0 on the filesystem in question or else your
> unprivileged user would not have sufficient privileges over the
> filesystem to setup fcaps.

I was thinking uid 0 as seen by the filesystem. But even if it were
uid 1000, the unprivileged user can still set whatever mode and xattrs
they want -- they control the backing store.

>
>> Then global root gets an fd to this filesystem. If they execve the
>> file directly, then, with my patch 4, it won't act as setuid 1000 and
>> the fcaps will be ignored. Even with my patch 4, though, if they bind
>> mount the fs and execve the file from their bind mount, it will act as
>> setuid 1000. Maybe this is odd. However, with Seth's patch 3, the
>> fcaps will (correctly) not be honored.
>
> With patch 3 you can also think of it as fcaps being honored and you
> get all the caps in the appropriate user namespace, but since you are
> not in that user namespace and so don't have a place to store them
> in struct cred you don't get the file caps.
>
> From the philosophy of interpreting the file as defined by the
> filesystem in principle we could extend struct cred so you actually
> get the creds just in uid 1000s user namespace, but that is very
> unlikely to be worth it.

I agree.

>
>> I tend to thing that, if we're not honoring the fcaps, we shouldn't be
>> honoring the setuid bit either. After all, it's really not a trusted
>> file, even though the only user who could have messed with it really
>> is the apparent owner.
>
> For the file caps we can't honor them because you don't have the bits
> in struct cred.
>
> For setuid we can honor it, and setuid is something that the user
> namespace allows.
>

We certainly *can* honor it. But why should we? I'd be more
comfortable with this if the contents of an untrusted filesystem were
really treated as just data.

>> And, if we're going to say we don't trust the file and shouldn't honor
>> setuid or fcaps, then merging all the functionality into mnt_may_suid
>> could make sense. Yes, these two things do different things, but they
>> could hook in to the same place.
>
> There are really two separate questions:
> - Do we trust this filesystem?
> - Do you have the bits to implement this concept?
>
> Even if in this specific context the two questions wind up looking
> exactly the same. I think it makes a lot of sense to ask the two
> questions separately. As future maintenance changes may cause the
> implementation of the questions to diverge.
>

Agreed.

Unless someone thinks of an argument to the contrary, I'd say "no, we
don't trust this filesystem". I could be convinced otherwise.

--Andy

2015-07-16 05:51:44

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/7] fs: Ignore file caps in mounts from other user namespaces

Andy Lutomirski <[email protected]> writes:

> On Wed, Jul 15, 2015 at 10:04 PM, Eric W. Biederman
> <[email protected]> wrote:
>> Andy Lutomirski <[email protected]> writes:
>>
>>>
>>> So here's the semantic question:
>>>
>>> Suppose an unprivileged user (uid 1000) creates a user namespace and a
>>> mount namespace. They stick a file (owned by uid 1000 as seen by
>>> init_user_ns) in there and mark it setuid root and give it fcaps.
>>
>> To make this make sense I have to ask, is this file on a filesystem
>> where uid 1000 as seen by the init_user_ns stored as uid 1000 on
>> the filesystem? Or is this uid 0 as seen by the filesystem?
>>
>> I assume this is uid 0 on the filesystem in question or else your
>> unprivileged user would not have sufficient privileges over the
>> filesystem to setup fcaps.
>
> I was thinking uid 0 as seen by the filesystem. But even if it were
> uid 1000, the unprivileged user can still set whatever mode and xattrs
> they want -- they control the backing store.

Yes. And that is what I was really asking. Are we taking about a
filesystem where the user controls the backing store?

>>> Then global root gets an fd to this filesystem. If they execve the
>>> file directly, then, with my patch 4, it won't act as setuid 1000 and
>>> the fcaps will be ignored. Even with my patch 4, though, if they bind
>>> mount the fs and execve the file from their bind mount, it will act as
>>> setuid 1000. Maybe this is odd. However, with Seth's patch 3, the
>>> fcaps will (correctly) not be honored.
>>
>> With patch 3 you can also think of it as fcaps being honored and you
>> get all the caps in the appropriate user namespace, but since you are
>> not in that user namespace and so don't have a place to store them
>> in struct cred you don't get the file caps.
>>
>> From the philosophy of interpreting the file as defined by the
>> filesystem in principle we could extend struct cred so you actually
>> get the creds just in uid 1000s user namespace, but that is very
>> unlikely to be worth it.
>
> I agree.
>
>>
>>> I tend to thing that, if we're not honoring the fcaps, we shouldn't be
>>> honoring the setuid bit either. After all, it's really not a trusted
>>> file, even though the only user who could have messed with it really
>>> is the apparent owner.
>>
>> For the file caps we can't honor them because you don't have the bits
>> in struct cred.
>>
>> For setuid we can honor it, and setuid is something that the user
>> namespace allows.
>>
>
> We certainly *can* honor it. But why should we? I'd be more
> comfortable with this if the contents of an untrusted filesystem were
> really treated as just data.

In these weird bleed through situtations I don't know that we should.
But extending nosuid protections in this way is a bit like yama
a bit gratuitious stomping don't care cases in the semantics to
make bugs harder to exploit.

>>> And, if we're going to say we don't trust the file and shouldn't honor
>>> setuid or fcaps, then merging all the functionality into mnt_may_suid
>>> could make sense. Yes, these two things do different things, but they
>>> could hook in to the same place.
>>
>> There are really two separate questions:
>> - Do we trust this filesystem?
>> - Do you have the bits to implement this concept?
>>
>> Even if in this specific context the two questions wind up looking
>> exactly the same. I think it makes a lot of sense to ask the two
>> questions separately. As future maintenance changes may cause the
>> implementation of the questions to diverge.
>>
>
> Agreed.
>
> Unless someone thinks of an argument to the contrary, I'd say "no, we
> don't trust this filesystem". I could be convinced otherwise.

But this is context dependent. From the perspective of the container
we really do want to trust the filesystem. As the container root set it
up, and if he isn't being hostile likely has a use for setfcaps files
and setuid files and all of the rest.

Perhaps I should phrase it as:
- In this context do we trust the code? AKA mnt_may_suid?
- What do these bits mean in this context? (Usually something more complicated).

Which says to me we want both patches 3 and 4 (even if 4 uses s_user_ns)
because 3 is different than 4.

And now I better context switch back to fixing bind mounts.

Eric

2015-07-16 11:16:51

by Lukasz Pawelczyk

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On śro, 2015-07-15 at 16:06 -0500, Eric W. Biederman wrote:
>
> I am on the fence with Lukasz Pawelczyk's patches. Some parts I
> liked
> some parts I had issues with. As I recall one of my issues was that
> those patches conflicted in detail if not in principle with this
> appropach.
>
> If these patches do not do a good job of laying the ground work for
> supporting security labels that unprivileged users can set than Seth
> could really use some feedback. Figuring out how to properly deal
> with
> the LSMs has been one of his challenges.

I fail to see how those 2 are in any conflict. Smack namespace is just
a mean of limiting the view of Smack labels within user namespace, to
be able to give some limited capabilities to processes in the namespace
to make it possible to partially administer Smack there. It doesn't
change Smack behaviour or mode of operation in any way.

If your approach here is to treat user ns mounted filesystem as if they
didn't support xattrs at all then my patches don't conflict here any
more than Smack itself already does.

If the filesystem will get a default (e.g. by smack* mount options)
label then this label will co-work with Smack namespaces.


--
Lukasz Pawelczyk
Samsung R&D Institute Poland
Samsung Electronics


2015-07-16 13:07:15

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 3/7] fs: Ignore file caps in mounts from other user namespaces

On Wed, Jul 15, 2015 at 06:23:01PM -0700, Andy Lutomirski wrote:
> > So if we have the s_user_ns check in get_file_caps the mnt_may_suid pass
> > isn't strictly necessary, but I still think it is useful as a mitigation
> > to the "leaks" Eric mentions. It _should_ be impossible for a user to
> > gain access to another user's mount namespace,
>
> No, it's very easy with SCM_RIGHTS. We should make sure it's safe.

Sure, what I really meant was that an attacker shouldn't be able to do
so without cooperation from the other user's processes. But I think
we're all in agreement that making it safe is a good idea.

Seth

2015-07-16 13:14:20

by Stephen Smalley

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On 07/15/2015 09:05 PM, Andy Lutomirski wrote:
> On Jul 15, 2015 3:34 PM, "Eric W. Biederman" <[email protected]> wrote:
>>
>> Seth Forshee <[email protected]> writes:
>>
>>> On Wed, Jul 15, 2015 at 04:06:35PM -0500, Eric W. Biederman wrote:
>>>> Casey Schaufler <[email protected]> writes:
>>>>
>>>>> On 7/15/2015 12:46 PM, Seth Forshee wrote:
>>>>>> These are the first in a larger set of patches that I've been working on
>>>>>> (with help from Eric Biederman) to support mounting ext4 and fuse
>>>>>> filesystems from within user namespaces. I've pushed the full series to:
>>>>>>
>>>>>> git://kernel.ubuntu.com/sforshee/linux.git userns-mounts
>>>>>>
>>>>>> Taking the series as a whole, the strategy is to handle as much of the
>>>>>> heavy lifting as possible in the vfs so the filesystems don't have to
>>>>>> handle weird edge cases. If you look at the full series you'll find that
>>>>>> the changes in ext4 to support user namespace mounts turn out to be
>>>>>> fairly minimal (fuse is a bit more complicated though as it must deal
>>>>>> with translating ids for a userspace process which is running in pid and
>>>>>> user namespaces).
>>>>>>
>>>>>> The patches I'm sending today lay some of the groundwork in the vfs and
>>>>>> related code. They fall into two broad groups:
>>>>>>
>>>>>> 1. Patches 1-2 add s_user_ns and simplify MNT_NODEV handling. These are
>>>>>> pretty straightforward, and Eric has expressed interest in merging
>>>>>> these patches soon. Note that patch 2 won't apply cleanly without
>>>>>> Eric's noexec patches for proc and sys [1].
>>>>>>
>>>>>> 2. Patches 2-7 tighten down security for mounts with s_user_ns !=
>>>>>> &init_user_ns. This includes updates to how file caps and suid are
>>>>>> handled and LSM updates to ignore security labels on superblocks
>>>>>> from non-init namespaces.
>>>>>>
>>>>>> The LSM changes in particular may not be optimal, as I don't have a
>>>>>> lot of familiarity with this code, so I'd be especially appreciative
>>>>>> of review of these changes and suggestions on how to improve them.
>>>>>
>>>>> Lukasz Pawelczyk <[email protected]> proposed
>>>>> LSM support in user namespaces ([RFC] lsm: namespace hooks)
>>>>> that make a whole lot more sense than just turning off
>>>>> the option of using labels on files. Gutting the ability
>>>>> to use MAC in a namespace is a step down the road of
>>>>> making MAC and namespaces incompatible.
>>>>
>>>> This is not "turning off the option to use labels on files".
>>>>
>>>> This is supporting mounting filesystems like ext4 by unprivileged users
>>>> and not trusting the labels they set in the same way as we trust labels
>>>> on filesystems mounted by privileged users.
>>>>
>>>> The first step needs to be not trusting those labels and treating such
>>>> filesystems as filesystems without label support. I hope that is Seth
>>>> has implemented.
>>>>
>>>> In the long run we can do more interesting things with such filesystems
>>>> once the appropriate LSM policy is in place.
>>>
>>> Yes, this exactly. Right now it looks to me like the only safe thing to
>>> do with mounts from unprivileged users is to ignore the security labels,
>>> so that's what I'm trying to do with these changes. If there's some
>>> better thing to do, or some better way to do it, I'm more than happy to
>>> receive that feedback.
>>
>> Ugh.
>>
>> This made me realize that we have an interesting problem here. An
>> unprivileged mount of tmpfs probably needs to have
>> s_user_ns == &init_user_ns.
>>
>> Otherwise we will break security labels on tmpfs for no good reason.
>> ramfs and sysfs also seem to have similar concerns.
>>
>> Because they have no backing store we can trust those filesystems with
>> security labels. Plus for at least sysfs there is the security label
>> bleed through issue, that we need to make certain works.
>>
>> Perhaps these filesystems with trusted backing store need to call
>> "sget_userns(..., &init_user_ns)".
>>
>> If we don't get this right we will have significant regressions with
>> respect to security labels, and that is not ok.
>
> That's only a problem if there's anyone who sets security labels on
> such a mount. You need global caps to do that (I hope), which
> requires someone outside the userns to help, which means there's a
> good chance that literally no one does this.

Setting of security.selinux attributes is governed by SELinux permission
checks, not by capabilities.

Also, files are always assigned a label at creation time; a tmpfs inode
will be labeled based on its creator without any userspace entity ever
calling setxattr() at all.

2015-07-16 13:14:16

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 3/7] fs: Ignore file caps in mounts from other user namespaces

On Thu, Jul 16, 2015 at 12:44:49AM -0500, Eric W. Biederman wrote:
> Andy Lutomirski <[email protected]> writes:
>
> > On Wed, Jul 15, 2015 at 10:04 PM, Eric W. Biederman
> > <[email protected]> wrote:
> >> Andy Lutomirski <[email protected]> writes:
> >>
> >>>
> >>> So here's the semantic question:
> >>>
> >>> Suppose an unprivileged user (uid 1000) creates a user namespace and a
> >>> mount namespace. They stick a file (owned by uid 1000 as seen by
> >>> init_user_ns) in there and mark it setuid root and give it fcaps.
> >>
> >> To make this make sense I have to ask, is this file on a filesystem
> >> where uid 1000 as seen by the init_user_ns stored as uid 1000 on
> >> the filesystem? Or is this uid 0 as seen by the filesystem?
> >>
> >> I assume this is uid 0 on the filesystem in question or else your
> >> unprivileged user would not have sufficient privileges over the
> >> filesystem to setup fcaps.
> >
> > I was thinking uid 0 as seen by the filesystem. But even if it were
> > uid 1000, the unprivileged user can still set whatever mode and xattrs
> > they want -- they control the backing store.
>
> Yes. And that is what I was really asking. Are we taking about a
> filesystem where the user controls the backing store?
>
> >>> Then global root gets an fd to this filesystem. If they execve the
> >>> file directly, then, with my patch 4, it won't act as setuid 1000 and
> >>> the fcaps will be ignored. Even with my patch 4, though, if they bind
> >>> mount the fs and execve the file from their bind mount, it will act as
> >>> setuid 1000. Maybe this is odd. However, with Seth's patch 3, the
> >>> fcaps will (correctly) not be honored.
> >>
> >> With patch 3 you can also think of it as fcaps being honored and you
> >> get all the caps in the appropriate user namespace, but since you are
> >> not in that user namespace and so don't have a place to store them
> >> in struct cred you don't get the file caps.
> >>
> >> From the philosophy of interpreting the file as defined by the
> >> filesystem in principle we could extend struct cred so you actually
> >> get the creds just in uid 1000s user namespace, but that is very
> >> unlikely to be worth it.
> >
> > I agree.
> >
> >>
> >>> I tend to thing that, if we're not honoring the fcaps, we shouldn't be
> >>> honoring the setuid bit either. After all, it's really not a trusted
> >>> file, even though the only user who could have messed with it really
> >>> is the apparent owner.
> >>
> >> For the file caps we can't honor them because you don't have the bits
> >> in struct cred.
> >>
> >> For setuid we can honor it, and setuid is something that the user
> >> namespace allows.
> >>
> >
> > We certainly *can* honor it. But why should we? I'd be more
> > comfortable with this if the contents of an untrusted filesystem were
> > really treated as just data.
>
> In these weird bleed through situtations I don't know that we should.
> But extending nosuid protections in this way is a bit like yama
> a bit gratuitious stomping don't care cases in the semantics to
> make bugs harder to exploit.
>
> >>> And, if we're going to say we don't trust the file and shouldn't honor
> >>> setuid or fcaps, then merging all the functionality into mnt_may_suid
> >>> could make sense. Yes, these two things do different things, but they
> >>> could hook in to the same place.
> >>
> >> There are really two separate questions:
> >> - Do we trust this filesystem?
> >> - Do you have the bits to implement this concept?
> >>
> >> Even if in this specific context the two questions wind up looking
> >> exactly the same. I think it makes a lot of sense to ask the two
> >> questions separately. As future maintenance changes may cause the
> >> implementation of the questions to diverge.
> >>
> >
> > Agreed.
> >
> > Unless someone thinks of an argument to the contrary, I'd say "no, we
> > don't trust this filesystem". I could be convinced otherwise.
>
> But this is context dependent. From the perspective of the container
> we really do want to trust the filesystem. As the container root set it
> up, and if he isn't being hostile likely has a use for setfcaps files
> and setuid files and all of the rest.
>
> Perhaps I should phrase it as:
> - In this context do we trust the code? AKA mnt_may_suid?
> - What do these bits mean in this context? (Usually something more complicated).
>
> Which says to me we want both patches 3 and 4 (even if 4 uses s_user_ns)
> because 3 is different than 4.

So what I'll do is:

- Add a s_user_ns check to mnt_may_suid
- Keep the (now redundant) s_user_ns check in get_file_caps

I'm on the fence about having both the mnt and user ns checks in
mnt_may_suid - it might be overkill, but it still adds the protection
against clearing MNT_NOSUID in a bind mount. So I guess I'll keep the
mnt ns check.

Seth

2015-07-16 13:24:37

by Stephen Smalley

[permalink] [raw]
Subject: Re: [PATCH 6/7] selinux: Ignore security labels on user namespace mounts

On 07/15/2015 03:46 PM, Seth Forshee wrote:
> Unprivileged users should not be able to supply security labels
> in filesystems, nor should they be able to supply security
> contexts in unprivileged mounts. For any mount where s_user_ns is
> not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
> and return EPERM if any contexts are supplied in the mount
> options.
>
> Signed-off-by: Seth Forshee <[email protected]>

I think this is obsoleted by the subsequent discussion, but just for the
record: this patch would cause the files in the userns mount to be left
with the "unlabeled" label, and therefore under typical policies,
completely inaccessible to any process in a confined domain.

> ---
> security/selinux/hooks.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 459e71ddbc9d..eeb71e45ab82 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -732,6 +732,19 @@ static int selinux_set_mnt_opts(struct super_block *sb,
> !strcmp(sb->s_type->name, "pstore"))
> sbsec->flags |= SE_SBGENFS;
>
> + /*
> + * If this is a user namespace mount, no contexts are allowed
> + * on the command line and security labels mus be ignored.
> + */
> + if (sb->s_user_ns != &init_user_ns) {
> + if (context_sid || fscontext_sid || rootcontext_sid ||
> + defcontext_sid)
> + return -EPERM;
> + sbsec->behavior = SECURITY_FS_USE_NONE;
> + goto out_set_opts;
> + }
> +
> +
> if (!sbsec->behavior) {
> /*
> * Determine the labeling behavior to use for this
> @@ -813,6 +826,7 @@ static int selinux_set_mnt_opts(struct super_block *sb,
> sbsec->def_sid = defcontext_sid;
> }
>
> +out_set_opts:
> rc = sb_finish_set_opts(sb);
> out:
> mutex_unlock(&sbsec->lock);
>

2015-07-16 14:00:55

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Wed, Jul 15, 2015 at 10:15:21PM -0500, Eric W. Biederman wrote:
>
> Seth I think for the LSMs we should start with:
>
> diff --git a/security/security.c b/security/security.c
> index 062f3c997fdc..5b6ece92a8e5 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -310,6 +310,8 @@ int security_sb_statfs(struct dentry *dentry)
> int security_sb_mount(const char *dev_name, struct path *path,
> const char *type, unsigned long flags, void *data)
> {
> + if (current_user_ns() != &init_user_ns)
> + return -EPERM;
> return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
> }

This just makes it impossible to mount from a user namespace. Every
mount from current_user_ns() != &init_user_ns will fail.

> Then we should push this down into all of the lsms.
> Then when we should remove or relax or change the check as appropriate
> in each lsm.
>
> The point is this is good enough to see that it is trivially safe,
> and this allows us to focus on the core issues, and stop worrying about
> the lsms for a bit.
>
> Then we can focus on each lsm one at at time and take the time to really
> understand them and talk with their maintainers etc to make certain
> we get things correct.
>
> This should remove the need for your patches 5, 6 and 7. For the
> immediate future.

I'm still not entirely sure what you were trying to do, maybe refuse to
mount whenever a security module is loaded? I think this could be a good
option to start, but couldn't we restrict it to only the LSMs which use
xattrs for security labels? In situations where the filesystem cannot
supply security policy metadata I can't think of any reason to disallow
the mounts.

Seth

2015-07-16 15:09:23

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On 7/16/2015 6:59 AM, Seth Forshee wrote:
> On Wed, Jul 15, 2015 at 10:15:21PM -0500, Eric W. Biederman wrote:
>> Seth I think for the LSMs we should start with:
>>
>> diff --git a/security/security.c b/security/security.c
>> index 062f3c997fdc..5b6ece92a8e5 100644
>> --- a/security/security.c
>> +++ b/security/security.c
>> @@ -310,6 +310,8 @@ int security_sb_statfs(struct dentry *dentry)
>> int security_sb_mount(const char *dev_name, struct path *path,
>> const char *type, unsigned long flags, void *data)
>> {
>> + if (current_user_ns() != &init_user_ns)
>> + return -EPERM;
>> return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
>> }
> This just makes it impossible to mount from a user namespace. Every
> mount from current_user_ns() != &init_user_ns will fail.
>
>> Then we should push this down into all of the lsms.
>> Then when we should remove or relax or change the check as appropriate
>> in each lsm.
>>
>> The point is this is good enough to see that it is trivially safe,
>> and this allows us to focus on the core issues, and stop worrying about
>> the lsms for a bit.

Given the extent to which LSMs are deployed I find it a bit
worrisome that they might not be considered a "core issue".

>> Then we can focus on each lsm one at at time and take the time to really
>> understand them and talk with their maintainers etc to make certain
>> we get things correct.

The "Do the easy stuff, fix the hard stuff after we've sold the product"
approach works really well until you get to the point of fixing the hard
stuff. This is the origin of the 90/90 rule of software development.

>>
>> This should remove the need for your patches 5, 6 and 7. For the
>> immediate future.
> I'm still not entirely sure what you were trying to do, maybe refuse to
> mount whenever a security module is loaded? I think this could be a good
> option to start, but couldn't we restrict it to only the LSMs which use
> xattrs for security labels? In situations where the filesystem cannot
> supply security policy metadata I can't think of any reason to disallow
> the mounts.

This whole notion of mounting a generic filesystem (e.g. ext4) that
is "owned" by a user (as opposed to the system) has lots of implications,
and I seriously doubt that many of them have been accounted for.

Think back to the "negative group access" issue. You can't just
ignore issues that are inconvenient, or claim that you have a reasonable
system just because *you* can't think of a problem.

> Seth
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2015-07-16 16:00:12

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Thu, Jul 16, 2015 at 08:59:47AM -0500, Seth Forshee wrote:
> On Wed, Jul 15, 2015 at 10:15:21PM -0500, Eric W. Biederman wrote:
> >
> > Seth I think for the LSMs we should start with:
> >
> > diff --git a/security/security.c b/security/security.c
> > index 062f3c997fdc..5b6ece92a8e5 100644
> > --- a/security/security.c
> > +++ b/security/security.c
> > @@ -310,6 +310,8 @@ int security_sb_statfs(struct dentry *dentry)
> > int security_sb_mount(const char *dev_name, struct path *path,
> > const char *type, unsigned long flags, void *data)
> > {
> > + if (current_user_ns() != &init_user_ns)
> > + return -EPERM;
> > return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
> > }
>
> This just makes it impossible to mount from a user namespace. Every
> mount from current_user_ns() != &init_user_ns will fail.

What might work instead is to add a check in security_sb_kern_mount.
Then it would need to check s_user_ns, that way if proc, sysfs, etc.
use sget_userns(..., &init_user_ns) they can still be mounted in
containers.

It would be nicer to have a hook after sget but before fill_super so
that a bunch of work doesn't have to be done and then undone. Right now
there doesn't seem to be any suitable hook.

> > Then we should push this down into all of the lsms.
> > Then when we should remove or relax or change the check as appropriate
> > in each lsm.
> >
> > The point is this is good enough to see that it is trivially safe,
> > and this allows us to focus on the core issues, and stop worrying about
> > the lsms for a bit.
> >
> > Then we can focus on each lsm one at at time and take the time to really
> > understand them and talk with their maintainers etc to make certain
> > we get things correct.
> >
> > This should remove the need for your patches 5, 6 and 7. For the
> > immediate future.
>
> I'm still not entirely sure what you were trying to do, maybe refuse to
> mount whenever a security module is loaded? I think this could be a good
> option to start, but couldn't we restrict it to only the LSMs which use
> xattrs for security labels? In situations where the filesystem cannot
> supply security policy metadata I can't think of any reason to disallow
> the mounts.
>
> Seth

2015-07-16 18:58:58

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Thu, Jul 16, 2015 at 08:09:20AM -0700, Casey Schaufler wrote:
> On 7/16/2015 6:59 AM, Seth Forshee wrote:
> > On Wed, Jul 15, 2015 at 10:15:21PM -0500, Eric W. Biederman wrote:
> >> Seth I think for the LSMs we should start with:
> >>
> >> diff --git a/security/security.c b/security/security.c
> >> index 062f3c997fdc..5b6ece92a8e5 100644
> >> --- a/security/security.c
> >> +++ b/security/security.c
> >> @@ -310,6 +310,8 @@ int security_sb_statfs(struct dentry *dentry)
> >> int security_sb_mount(const char *dev_name, struct path *path,
> >> const char *type, unsigned long flags, void *data)
> >> {
> >> + if (current_user_ns() != &init_user_ns)
> >> + return -EPERM;
> >> return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
> >> }
> > This just makes it impossible to mount from a user namespace. Every
> > mount from current_user_ns() != &init_user_ns will fail.
> >
> >> Then we should push this down into all of the lsms.
> >> Then when we should remove or relax or change the check as appropriate
> >> in each lsm.
> >>
> >> The point is this is good enough to see that it is trivially safe,
> >> and this allows us to focus on the core issues, and stop worrying about
> >> the lsms for a bit.
>
> Given the extent to which LSMs are deployed I find it a bit
> worrisome that they might not be considered a "core issue".
>
> >> Then we can focus on each lsm one at at time and take the time to really
> >> understand them and talk with their maintainers etc to make certain
> >> we get things correct.
>
> The "Do the easy stuff, fix the hard stuff after we've sold the product"
> approach works really well until you get to the point of fixing the hard
> stuff. This is the origin of the 90/90 rule of software development.
>
> >>
> >> This should remove the need for your patches 5, 6 and 7. For the
> >> immediate future.
> > I'm still not entirely sure what you were trying to do, maybe refuse to
> > mount whenever a security module is loaded? I think this could be a good
> > option to start, but couldn't we restrict it to only the LSMs which use
> > xattrs for security labels? In situations where the filesystem cannot
> > supply security policy metadata I can't think of any reason to disallow
> > the mounts.
>
> This whole notion of mounting a generic filesystem (e.g. ext4) that
> is "owned" by a user (as opposed to the system) has lots of implications,
> and I seriously doubt that many of them have been accounted for.
>
> Think back to the "negative group access" issue. You can't just
> ignore issues that are inconvenient, or claim that you have a reasonable
> system just because *you* can't think of a problem.

I've spent a lot of time considering the implications and previous
vulnerabilities, and I've addressed everything I turned up. Now I'm
asking for review from those with more experience with and expertise of
the code in question. I'm not sure what more I should be doing.

I welcome feedback about anything I've missed, but stating generally
that you think I probably missed something isn't very helpful.

The LSM issue is thornier than the rest of it though, which is why I
specifically asked for review there in the cover letter. There's a lot
of complexity and nuance, and I still don't have a grasp on all the
subtleties. One such subtlety is the full impact of simply ignoring the
security labels on disk (but I am still confused as to why this is
different from filesystems which don't support xattrs at all).

I was unaware of Lukasz's patches until yesterday, and I will have a
look at them. But since we don't have the LSM support for user
namespaces yet, I don't see the problem with doing something safe for
LSMs initially and evolving the LSM integration for user ns mounts along
with the rest of the user ns integration.

Your point is taken about my less-than-expert opinion about the other
security modules. We should at minimum get acks from the maintainers of
those modules that unprivileged mounts will not compromise MAC.

For Smack specifically, I believe my only concern was the SMACK64EXEC
attribute, as all the other attributes only affected subjects' access to
the files. So maybe it would be possible to simply ignore this attribute
in unprivileged mounts and respect the others, even lacking more
complete LSM support for user namespaces.

Seth

2015-07-16 21:42:22

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On 7/16/2015 11:57 AM, Seth Forshee wrote:
> On Thu, Jul 16, 2015 at 08:09:20AM -0700, Casey Schaufler wrote:
>> On 7/16/2015 6:59 AM, Seth Forshee wrote:
>>> On Wed, Jul 15, 2015 at 10:15:21PM -0500, Eric W. Biederman wrote:
>>>> Seth I think for the LSMs we should start with:
>>>>
>>>> diff --git a/security/security.c b/security/security.c
>>>> index 062f3c997fdc..5b6ece92a8e5 100644
>>>> --- a/security/security.c
>>>> +++ b/security/security.c
>>>> @@ -310,6 +310,8 @@ int security_sb_statfs(struct dentry *dentry)
>>>> int security_sb_mount(const char *dev_name, struct path *path,
>>>> const char *type, unsigned long flags, void *data)
>>>> {
>>>> + if (current_user_ns() != &init_user_ns)
>>>> + return -EPERM;
>>>> return call_int_hook(sb_mount, 0, dev_name, path, type, flags, data);
>>>> }
>>> This just makes it impossible to mount from a user namespace. Every
>>> mount from current_user_ns() != &init_user_ns will fail.
>>>
>>>> Then we should push this down into all of the lsms.
>>>> Then when we should remove or relax or change the check as appropriate
>>>> in each lsm.
>>>>
>>>> The point is this is good enough to see that it is trivially safe,
>>>> and this allows us to focus on the core issues, and stop worrying about
>>>> the lsms for a bit.
>> Given the extent to which LSMs are deployed I find it a bit
>> worrisome that they might not be considered a "core issue".
>>
>>>> Then we can focus on each lsm one at at time and take the time to really
>>>> understand them and talk with their maintainers etc to make certain
>>>> we get things correct.
>> The "Do the easy stuff, fix the hard stuff after we've sold the product"
>> approach works really well until you get to the point of fixing the hard
>> stuff. This is the origin of the 90/90 rule of software development.
>>
>>>> This should remove the need for your patches 5, 6 and 7. For the
>>>> immediate future.
>>> I'm still not entirely sure what you were trying to do, maybe refuse to
>>> mount whenever a security module is loaded? I think this could be a good
>>> option to start, but couldn't we restrict it to only the LSMs which use
>>> xattrs for security labels? In situations where the filesystem cannot
>>> supply security policy metadata I can't think of any reason to disallow
>>> the mounts.
>> This whole notion of mounting a generic filesystem (e.g. ext4) that
>> is "owned" by a user (as opposed to the system) has lots of implications,
>> and I seriously doubt that many of them have been accounted for.
>>
>> Think back to the "negative group access" issue. You can't just
>> ignore issues that are inconvenient, or claim that you have a reasonable
>> system just because *you* can't think of a problem.
> I've spent a lot of time considering the implications and previous
> vulnerabilities, and I've addressed everything I turned up. Now I'm
> asking for review from those with more experience with and expertise of
> the code in question. I'm not sure what more I should be doing.

Part of the problem I see is that you're looking at the details
when there's an architectural issue. That's OK, it happens all
the time, but we have to pull the issue up slightly higher in
order to address the underlying difficulties.

You want to provide a mechanism whereby an unprivileged user (Seth)
can mount a filesystem for his own use. You want full filesystem
semantics, but you're willing to accept restrictions on certain
filesystem features to avoid opening security holes. You are not
willing to accept restrictions that make the filesystem unusable,
such as making it read-only.

I am going to present a suggestion. Feel free to correct my
assumptions and my reasoning. For simplicity let's use loop-back
mounting of a filesystem contained in a file as an example. The
principles should apply to newly created memory based filesystems
or disk partitions "owned" by Seth.

Seth wants to mount a file (~seth/myfs) which contains an ext4
filesystem. There is already a filesystem object, with security
attributes, that the system knows how to deal with. If Seth mounts
this as a filesystem he, and potentially other people, will be
able to access the content of this object without accessing the
object itself.

seth$ mount --justforme -t ext4 ~seth/myfs /tmp/seth
seth$ chmod 777 /tmp/seth
seth$ ls -la /tmp/seth
drwxrwxrwx. 3 seth seth 260 Jul 16 12:59 .
drwxrwxrwxt 18 root root 4069 Jul 16 11:13 ..
seth$

Everything's fine at this point. Wilma is also using the system,
being the sort who likes to hide things in out of the way places

wilma$ cp ~/scandals /tmp/seth
wilma$ chmod 600 /tmp/seth/scandals

puts her list of scandals on the unsuspecting filesystem, and changes
the mode to ensure that no one can find out what went on after the
office party.

Seth unmounts /tmp/seth. He looks in ~seth/myfs, finds out what really
happened at the office party, and the story goes from there.

Wilma did everything correctly according to the system security policy,
but the system security policy did not protect her as advertised. The
system was tricked into behaving as if it was in control of the content
of the filesystem when in fact it was not.

One way to fix this problem is for unprivileged mounts to recognize the
attributes of the object mounted and to propagate those attributes to all
the objects they present. All files on /tmp/seth would be owned by seth
and protected by the mode bits, ACL and LSM requirements of ~/seth/myfs.
opening a file on /tmp/seth would require the same permissions as opening
the file containing the mounted filesystem. These attributes would have to
be immutable, or at least demonstrably more restrictive (chmod might be
allowed in some cases, but chown would never be) when changed. I don't see
how a user other than seth could create a new file, as you'd either have
a magical change in ownership or a false sense of security.

I don't see that the presence of user namespaces changes anything. You
may reduce the set of uids available, but the problems with putting a
uid into someone else's file is just as real.

> I welcome feedback about anything I've missed, but stating generally
> that you think I probably missed something isn't very helpful.

True enough. I hope I've explained myself above.

> The LSM issue is thornier than the rest of it though, which is why I
> specifically asked for review there in the cover letter. There's a lot
> of complexity and nuance, and I still don't have a grasp on all the
> subtleties. One such subtlety is the full impact of simply ignoring the
> security labels on disk (but I am still confused as to why this is
> different from filesystems which don't support xattrs at all).

If you can mount a filesystem such that the labels are ignored you
are effectively specifying that the Smack label on the files be
determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
Without it, it's not.

> I was unaware of Lukasz's patches until yesterday, and I will have a
> look at them. But since we don't have the LSM support for user
> namespaces yet, I don't see the problem with doing something safe for
> LSMs initially and evolving the LSM integration for user ns mounts along
> with the rest of the user ns integration.

Ignoring the security attributes is not safe!

> Your point is taken about my less-than-expert opinion about the other
> security modules. We should at minimum get acks from the maintainers of
> those modules that unprivileged mounts will not compromise MAC.

I am the Smack maintainer. Unprivileged mounts as you have
described them compromise MAC. They compromise DAC, too.


> For Smack specifically, I believe my only concern was the SMACK64EXEC
> attribute, as all the other attributes only affected subjects' access to
> the files. So maybe it would be possible to simply ignore this attribute
> in unprivileged mounts and respect the others, even lacking more
> complete LSM support for user namespaces.

SMACK64EXEC is analogous to the setuid bit, but I would rather see
exec() of programs with this attribute refused that for it to be
blindly ignored.

> Seth
>

2015-07-16 22:28:14

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Thu, Jul 16, 2015 at 2:42 PM, Casey Schaufler <[email protected]> wrote:
> You want to provide a mechanism whereby an unprivileged user (Seth)
> can mount a filesystem for his own use. You want full filesystem
> semantics, but you're willing to accept restrictions on certain
> filesystem features to avoid opening security holes. You are not
> willing to accept restrictions that make the filesystem unusable,
> such as making it read-only.
>
> I am going to present a suggestion. Feel free to correct my
> assumptions and my reasoning. For simplicity let's use loop-back
> mounting of a filesystem contained in a file as an example. The
> principles should apply to newly created memory based filesystems
> or disk partitions "owned" by Seth.
>
> Seth wants to mount a file (~seth/myfs) which contains an ext4
> filesystem. There is already a filesystem object, with security
> attributes, that the system knows how to deal with. If Seth mounts
> this as a filesystem he, and potentially other people, will be
> able to access the content of this object without accessing the
> object itself.
>
> seth$ mount --justforme -t ext4 ~seth/myfs /tmp/seth
> seth$ chmod 777 /tmp/seth
> seth$ ls -la /tmp/seth
> drwxrwxrwx. 3 seth seth 260 Jul 16 12:59 .
> drwxrwxrwxt 18 root root 4069 Jul 16 11:13 ..
> seth$
>
> Everything's fine at this point. Wilma is also using the system,
> being the sort who likes to hide things in out of the way places
>
> wilma$ cp ~/scandals /tmp/seth
> wilma$ chmod 600 /tmp/seth/scandals

This is already impossible as described. Seth can only mount the
filesystem in a private mount namespace inside a user namespace that
he created. Wilma can't see it unless Seth passes an fd to Wilma and
Wilma accepts and uses it.

>
> puts her list of scandals on the unsuspecting filesystem, and changes
> the mode to ensure that no one can find out what went on after the
> office party.
>
> Seth unmounts /tmp/seth. He looks in ~seth/myfs, finds out what really
> happened at the office party, and the story goes from there.
>
> Wilma did everything correctly according to the system security policy,
> but the system security policy did not protect her as advertised. The
> system was tricked into behaving as if it was in control of the content
> of the filesystem when in fact it was not.


I would argue that, if Wilma writes to some place described by an fd
and doesn't verify where she's writing to, then she has no expectation
of privacy. After all, she could just *tell* Seth directly whatever
she wants (assuming she can communicate with Seth in the first place).

>
> One way to fix this problem is for unprivileged mounts to recognize the
> attributes of the object mounted and to propagate those attributes to all
> the objects they present. All files on /tmp/seth would be owned by seth
> and protected by the mode bits, ACL and LSM requirements of ~/seth/myfs.

This is impossible to enforce, because Seth could use FUSE instead of ext4.

> opening a file on /tmp/seth would require the same permissions as opening
> the file containing the mounted filesystem. These attributes would have to
> be immutable, or at least demonstrably more restrictive (chmod might be
> allowed in some cases, but chown would never be) when changed. I don't see
> how a user other than seth could create a new file, as you'd either have
> a magical change in ownership or a false sense of security.

This would be a very harsh restriction. Seth might legitimately want
to give a user access to a file on backing store he owns without
giving that user access to the backing store. Root on a normal system
does that all the time.

> If you can mount a filesystem such that the labels are ignored you
> are effectively specifying that the Smack label on the files be
> determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
> Without it, it's not.

Can you explain what the threat model is here? I don't see what it is
that you're trying to prevent.

>> Your point is taken about my less-than-expert opinion about the other
>> security modules. We should at minimum get acks from the maintainers of
>> those modules that unprivileged mounts will not compromise MAC.
>
> I am the Smack maintainer. Unprivileged mounts as you have
> described them compromise MAC. They compromise DAC, too.
>

How do they compromise DAC?

--Andy

2015-07-16 23:09:00

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On 7/16/2015 3:27 PM, Andy Lutomirski wrote:
> On Thu, Jul 16, 2015 at 2:42 PM, Casey Schaufler <[email protected]> wrote:
>> You want to provide a mechanism whereby an unprivileged user (Seth)
>> can mount a filesystem for his own use. You want full filesystem
>> semantics, but you're willing to accept restrictions on certain
>> filesystem features to avoid opening security holes. You are not
>> willing to accept restrictions that make the filesystem unusable,
>> such as making it read-only.
>>
>> I am going to present a suggestion. Feel free to correct my
>> assumptions and my reasoning. For simplicity let's use loop-back
>> mounting of a filesystem contained in a file as an example. The
>> principles should apply to newly created memory based filesystems
>> or disk partitions "owned" by Seth.
>>
>> Seth wants to mount a file (~seth/myfs) which contains an ext4
>> filesystem. There is already a filesystem object, with security
>> attributes, that the system knows how to deal with. If Seth mounts
>> this as a filesystem he, and potentially other people, will be
>> able to access the content of this object without accessing the
>> object itself.
>>
>> seth$ mount --justforme -t ext4 ~seth/myfs /tmp/seth
>> seth$ chmod 777 /tmp/seth
>> seth$ ls -la /tmp/seth
>> drwxrwxrwx. 3 seth seth 260 Jul 16 12:59 .
>> drwxrwxrwxt 18 root root 4069 Jul 16 11:13 ..
>> seth$
>>
>> Everything's fine at this point. Wilma is also using the system,
>> being the sort who likes to hide things in out of the way places
>>
>> wilma$ cp ~/scandals /tmp/seth
>> wilma$ chmod 600 /tmp/seth/scandals
> This is already impossible as described. Seth can only mount the
> filesystem in a private mount namespace inside a user namespace that
> he created. Wilma can't see it unless Seth passes an fd to Wilma and
> Wilma accepts and uses it.

But you do have multiple UIDs withing your user namespace, right?
There are processes running as someone other than seth, right?

>
>> puts her list of scandals on the unsuspecting filesystem, and changes
>> the mode to ensure that no one can find out what went on after the
>> office party.
>>
>> Seth unmounts /tmp/seth. He looks in ~seth/myfs, finds out what really
>> happened at the office party, and the story goes from there.
>>
>> Wilma did everything correctly according to the system security policy,
>> but the system security policy did not protect her as advertised. The
>> system was tricked into behaving as if it was in control of the content
>> of the filesystem when in fact it was not.
>
> I would argue that, if Wilma writes to some place described by an fd
> and doesn't verify where she's writing to, then she has no expectation
> of privacy. After all, she could just *tell* Seth directly whatever
> she wants (assuming she can communicate with Seth in the first place).

Don't ascribe either wisdom or good intentions to Wilma.

>> One way to fix this problem is for unprivileged mounts to recognize the
>> attributes of the object mounted and to propagate those attributes to all
>> the objects they present. All files on /tmp/seth would be owned by seth
>> and protected by the mode bits, ACL and LSM requirements of ~/seth/myfs.
> This is impossible to enforce, because Seth could use FUSE instead of ext4.

I never said that things aren't already broken. And, if you want
to ignore the potential DAC issues (read, negative groups) just
do it for the LSM xattrs.


>
>> opening a file on /tmp/seth would require the same permissions as opening
>> the file containing the mounted filesystem. These attributes would have to
>> be immutable, or at least demonstrably more restrictive (chmod might be
>> allowed in some cases, but chown would never be) when changed. I don't see
>> how a user other than seth could create a new file, as you'd either have
>> a magical change in ownership or a false sense of security.
> This would be a very harsh restriction. Seth might legitimately want
> to give a user access to a file on backing store he owns without
> giving that user access to the backing store. Root on a normal system
> does that all the time.

You already said that it was impossible for Wilma to get
access, so how is this more restrictive? Besides, Seth can
always set the mode on ~/seth so that Wilma can't read the
files it contains. This isn't an old problem or a novel
solution.

>> If you can mount a filesystem such that the labels are ignored you
>> are effectively specifying that the Smack label on the files be
>> determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
>> Without it, it's not.
> Can you explain what the threat model is here? I don't see what it is
> that you're trying to prevent.

Um, OK.
The filesystem has files with a hundred different Smack labels on it.
I mount it as an unlabeled filesystem and everything is readable by
everyone. Bad jojo.

>
>>> Your point is taken about my less-than-expert opinion about the other
>>> security modules. We should at minimum get acks from the maintainers of
>>> those modules that unprivileged mounts will not compromise MAC.
>> I am the Smack maintainer. Unprivileged mounts as you have
>> described them compromise MAC. They compromise DAC, too.
>>
> How do they compromise DAC?

Wilma's expectation (or the application running with a mapped UID)
that chmod will keep Seth out of the file.

> --Andy
>

2015-07-16 23:30:04

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Thu, Jul 16, 2015 at 4:08 PM, Casey Schaufler <[email protected]> wrote:
> On 7/16/2015 3:27 PM, Andy Lutomirski wrote:
>> On Thu, Jul 16, 2015 at 2:42 PM, Casey Schaufler <[email protected]> wrote:
>>> You want to provide a mechanism whereby an unprivileged user (Seth)
>>> can mount a filesystem for his own use. You want full filesystem
>>> semantics, but you're willing to accept restrictions on certain
>>> filesystem features to avoid opening security holes. You are not
>>> willing to accept restrictions that make the filesystem unusable,
>>> such as making it read-only.
>>>
>>> I am going to present a suggestion. Feel free to correct my
>>> assumptions and my reasoning. For simplicity let's use loop-back
>>> mounting of a filesystem contained in a file as an example. The
>>> principles should apply to newly created memory based filesystems
>>> or disk partitions "owned" by Seth.
>>>
>>> Seth wants to mount a file (~seth/myfs) which contains an ext4
>>> filesystem. There is already a filesystem object, with security
>>> attributes, that the system knows how to deal with. If Seth mounts
>>> this as a filesystem he, and potentially other people, will be
>>> able to access the content of this object without accessing the
>>> object itself.
>>>
>>> seth$ mount --justforme -t ext4 ~seth/myfs /tmp/seth
>>> seth$ chmod 777 /tmp/seth
>>> seth$ ls -la /tmp/seth
>>> drwxrwxrwx. 3 seth seth 260 Jul 16 12:59 .
>>> drwxrwxrwxt 18 root root 4069 Jul 16 11:13 ..
>>> seth$
>>>
>>> Everything's fine at this point. Wilma is also using the system,
>>> being the sort who likes to hide things in out of the way places
>>>
>>> wilma$ cp ~/scandals /tmp/seth
>>> wilma$ chmod 600 /tmp/seth/scandals
>> This is already impossible as described. Seth can only mount the
>> filesystem in a private mount namespace inside a user namespace that
>> he created. Wilma can't see it unless Seth passes an fd to Wilma and
>> Wilma accepts and uses it.
>
> But you do have multiple UIDs withing your user namespace, right?
> There are processes running as someone other than seth, right?
>

Only if root set it up that way. For example, root could set up
"subuids" (this is a userspace concept) that belong to Seth. These
would be uids that Seth controls and that represent subsets of Seth's
authority. Wilma wouldn't be one of these subuids unless she was
somehow part of Seth (or if root completely screwed up).

>>
>>> puts her list of scandals on the unsuspecting filesystem, and changes
>>> the mode to ensure that no one can find out what went on after the
>>> office party.
>>>
>>> Seth unmounts /tmp/seth. He looks in ~seth/myfs, finds out what really
>>> happened at the office party, and the story goes from there.
>>>
>>> Wilma did everything correctly according to the system security policy,
>>> but the system security policy did not protect her as advertised. The
>>> system was tricked into behaving as if it was in control of the content
>>> of the filesystem when in fact it was not.
>>
>> I would argue that, if Wilma writes to some place described by an fd
>> and doesn't verify where she's writing to, then she has no expectation
>> of privacy. After all, she could just *tell* Seth directly whatever
>> she wants (assuming she can communicate with Seth in the first place).
>
> Don't ascribe either wisdom or good intentions to Wilma.

In that case, I'll mention the futility of solving the problem, even
without user namespaces. If Wilma tells Seth something, he's going to
find out. If Wilma pokes it (in whatever form) into an fd provided by
Seth, then Seth is extremely likely to find out, regardless of what
root or the MAC owner tries to do.

If Wilma writes to a path that's mounted in her namespace, then, sure,
overall policy associated with her namespace (which, in your example,
is the root namespace) must apply. But Seth can't mount things into
Wilma's namespace without having CAP_SYS_ADMIN in that namespace and,
if he has CAP_SYS_ADMIN, it's already game over.

>
>>> One way to fix this problem is for unprivileged mounts to recognize the
>>> attributes of the object mounted and to propagate those attributes to all
>>> the objects they present. All files on /tmp/seth would be owned by seth
>>> and protected by the mode bits, ACL and LSM requirements of ~/seth/myfs.
>> This is impossible to enforce, because Seth could use FUSE instead of ext4.
>
> I never said that things aren't already broken. And, if you want
> to ignore the potential DAC issues (read, negative groups) just
> do it for the LSM xattrs.
>

Negative groups are a solved problem, I believe.

>
>>
>>> opening a file on /tmp/seth would require the same permissions as opening
>>> the file containing the mounted filesystem. These attributes would have to
>>> be immutable, or at least demonstrably more restrictive (chmod might be
>>> allowed in some cases, but chown would never be) when changed. I don't see
>>> how a user other than seth could create a new file, as you'd either have
>>> a magical change in ownership or a false sense of security.
>> This would be a very harsh restriction. Seth might legitimately want
>> to give a user access to a file on backing store he owns without
>> giving that user access to the backing store. Root on a normal system
>> does that all the time.
>
> You already said that it was impossible for Wilma to get
> access, so how is this more restrictive? Besides, Seth can
> always set the mode on ~/seth so that Wilma can't read the
> files it contains. This isn't an old problem or a novel
> solution.

Seth can pass an fd around. This is actually a plausible thing to do:
Seth creates a userns to sandbox himself, mounts some FUSE thing in
there, and passes an fd out for the benefit of some daemon. That
daemon had better validate the thing before using it, though.

I really don't see the benefit of making up extra rules that apply to
users outside a userns who try to access specifically a filesystem
with backing store. They wouldn't make sense for filesystems without
backing store.

>
>>> If you can mount a filesystem such that the labels are ignored you
>>> are effectively specifying that the Smack label on the files be
>>> determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
>>> Without it, it's not.
>> Can you explain what the threat model is here? I don't see what it is
>> that you're trying to prevent.
>
> Um, OK.
> The filesystem has files with a hundred different Smack labels on it.
> I mount it as an unlabeled filesystem and everything is readable by
> everyone. Bad jojo.

I still don't understand. If it's a filesystem backed by a file that
Seth has RW access to, then Seth can read everything on it, full stop.
The security labels in the filesystem are irrelevant.

This is like saying that, if you put restrictive labels in the
filesystem that lives on /dev/sda2 and give Seth ownership of
/dev/sda2, then you expect Seth to be unable to bypass the policy
specifies by your labels.

Or maybe I'm misunderstanding you.

>
>>
>>>> Your point is taken about my less-than-expert opinion about the other
>>>> security modules. We should at minimum get acks from the maintainers of
>>>> those modules that unprivileged mounts will not compromise MAC.
>>> I am the Smack maintainer. Unprivileged mounts as you have
>>> described them compromise MAC. They compromise DAC, too.
>>>
>> How do they compromise DAC?
>
> Wilma's expectation (or the application running with a mapped UID)
> that chmod will keep Seth out of the file.

That was never true. If Seth has an open fd, Wilma can chmod all day
and it won't matter. In this example, Seth owns the entire filesystem
along with its backing store.

--Andy

2015-07-17 00:12:51

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Wed, Jul 15, 2015 at 11:47:08PM -0500, Eric W. Biederman wrote:
> Casey Schaufler <[email protected]> writes:
> > On 7/15/2015 6:08 PM, Andy Lutomirski wrote:
> >> If I mount an unprivileged filesystem, then either the contents were
> >> put there *by me*, in which case letting me access them are fine, or
> >> (with Seth's patches and then some) I control the backing store, in
> >> which case I can do whatever I want regardless of what LSM thinks.
> >>
> >> So I don't see the problem. Why would Smack or any other LSM care at
> >> all, unless it wants to prevent me from mounting the fs in the first
> >> place?
> >
> > First off, I don't cotton to the notion that you should be able
> > to mount filesystems without privilege. But it seems I'm being
> > outvoted on that. I suspect that there are cases where it might
> > be safe, but I can't think of one off the top of my head.
>
> There are two fundamental issues mounting filesystems without privielge,
> by which I actually mean mounting filesystems as the root user in a user
> namespace.
>
> - Are the semantics safe.
> - Is the extra attack surface a problem.

I think the attack surface this exposes is the biggest problem
facing this proposal.

> Figuring out how to make semantics safe is what we are talking about.
>
> Once we sort out the semantics we can look at the handful of filesystems
> like fuse where the extra attack surface is not a concern.
>
> With that said desktop environments have for a long time been
> automatically mounting whichever filesystem you place in your computer,
> so in practice what this is really about is trying to align the kernel
> with how people use filesystems.

The key difference is that desktops only do this when you physically
plug in a device. With unprivileged mounts, a hostile attacker
doesn't need physical access to the machine to exploit lurking
kernel filesystem bugs. i.e. they can just use loopback mounts, and
they can keep mounting corrupted images until they find something
that works.

User namespaces are supposed to provide trust separation. The
kernel filesystems simply aren't hardened against unprivileged
attacks from below - there is a trust relationship between root and
the filesystem in that they are the only things that can write to
the disk. Mounts from within a userns destroys this relationship as
the userns root, by definition, is not a trusted actor.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-07-17 00:17:04

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

Lukasz Pawelczyk <[email protected]> writes:

> On śro, 2015-07-15 at 16:06 -0500, Eric W. Biederman wrote:
>>
>> I am on the fence with Lukasz Pawelczyk's patches. Some parts I
>> liked
>> some parts I had issues with. As I recall one of my issues was that
>> those patches conflicted in detail if not in principle with this
>> appropach.
>>
>> If these patches do not do a good job of laying the ground work for
>> supporting security labels that unprivileged users can set than Seth
>> could really use some feedback. Figuring out how to properly deal
>> with
>> the LSMs has been one of his challenges.
>
> I fail to see how those 2 are in any conflict.

Like I said. They don't really conflict, and actually to really support
things well for smack we probably need something like your patches.

At the same time a patch written without dealing with s_user_ns is going
to going to fail to take a lot of important details into account.

Right now after fixing the mount namespace issues the top priority is to
work through the details and get s_user_ns implemented. By that I mean
some version of patch 1 of Seth's series.

s_user_ns fundamentally changes how the concepts are represented in the
kernel in a way that is easier to secure, and that fundamentally better
matches things. And sigh. This review has shown we don't quite have
all of the details worked out.

> If your approach here is to treat user ns mounted filesystem as if they
> didn't support xattrs at all then my patches don't conflict here any
> more than Smack itself already does.

The end game if people developing smack choose to play, is to figure out
how to store your unmapped labels in a filesystem contained by a
user namespace and a smack label namespace root.

> If the filesystem will get a default (e.g. by smack* mount options)
> label then this label will co-work with Smack namespaces.

A default, but I don't know if it will be smack mount options that will
give that default. The devil is in the details and there are a lot
of details.

Eric

2015-07-17 00:48:31

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

Dave Chinner <[email protected]> writes:

> On Wed, Jul 15, 2015 at 11:47:08PM -0500, Eric W. Biederman wrote:
>> Casey Schaufler <[email protected]> writes:
>> > On 7/15/2015 6:08 PM, Andy Lutomirski wrote:
>> >> If I mount an unprivileged filesystem, then either the contents were
>> >> put there *by me*, in which case letting me access them are fine, or
>> >> (with Seth's patches and then some) I control the backing store, in
>> >> which case I can do whatever I want regardless of what LSM thinks.
>> >>
>> >> So I don't see the problem. Why would Smack or any other LSM care at
>> >> all, unless it wants to prevent me from mounting the fs in the first
>> >> place?
>> >
>> > First off, I don't cotton to the notion that you should be able
>> > to mount filesystems without privilege. But it seems I'm being
>> > outvoted on that. I suspect that there are cases where it might
>> > be safe, but I can't think of one off the top of my head.
>>
>> There are two fundamental issues mounting filesystems without privielge,
>> by which I actually mean mounting filesystems as the root user in a user
>> namespace.
>>
>> - Are the semantics safe.
>> - Is the extra attack surface a problem.
>
> I think the attack surface this exposes is the biggest problem
> facing this proposal.

I completely agree.

>> Figuring out how to make semantics safe is what we are talking about.
>>
>> Once we sort out the semantics we can look at the handful of filesystems
>> like fuse where the extra attack surface is not a concern.
>>
>> With that said desktop environments have for a long time been
>> automatically mounting whichever filesystem you place in your computer,
>> so in practice what this is really about is trying to align the kernel
>> with how people use filesystems.
>
> The key difference is that desktops only do this when you physically
> plug in a device. With unprivileged mounts, a hostile attacker
> doesn't need physical access to the machine to exploit lurking
> kernel filesystem bugs. i.e. they can just use loopback mounts, and
> they can keep mounting corrupted images until they find something
> that works.

Yep. That magnifies the problem quite a bit.

> User namespaces are supposed to provide trust separation. The
> kernel filesystems simply aren't hardened against unprivileged
> attacks from below - there is a trust relationship between root and
> the filesystem in that they are the only things that can write to
> the disk. Mounts from within a userns destroys this relationship as
> the userns root, by definition, is not a trusted actor.

I talked to Ted Tso a while back and ext4 is at least in principle
already hardened against that kind of attack. I am not certain I
believe it, but if it is true I think it is fantastic.

At this point any setting of the FS_USER_MOUNT flag I figure needs to go
through the filesystem maintainers tree and they need to be aware of and
agree to deal with the attack from below issue.

The one filesystem I truly expect we can make work is fuse. fuse has
been designed to deal with some variation of the attack from below issue
since day one. We looked at what the patches to fuse would look like
with the current state of the vfs and it was not pretty.

We very much need to sort through as much as possible at the vfs layer,
and in generic code. Allow everyone to see what is going on and how
it works before preceeding forward with enabling any filesystems.



I truly hope we can find a small set of block device filesystems that we
can harden from attack below. That would allow linux to have serious
defenses against evil usb stick attacks. I think that is going to take
a lot of careful coding, testing and validation and advancing the state
of the art to get there.

Eric

2015-07-17 00:49:56

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/7] fs: Ignore file caps in mounts from other user namespaces

Seth Forshee <[email protected]> writes:

> On Thu, Jul 16, 2015 at 12:44:49AM -0500, Eric W. Biederman wrote:
>> Andy Lutomirski <[email protected]> writes:
>>
>> > On Wed, Jul 15, 2015 at 10:04 PM, Eric W. Biederman
>> > <[email protected]> wrote:
>> >> Andy Lutomirski <[email protected]> writes:
>> >>
>> >>>
>> >>> So here's the semantic question:
>> >>>
>> >>> Suppose an unprivileged user (uid 1000) creates a user namespace and a
>> >>> mount namespace. They stick a file (owned by uid 1000 as seen by
>> >>> init_user_ns) in there and mark it setuid root and give it fcaps.
>> >>
>> >> To make this make sense I have to ask, is this file on a filesystem
>> >> where uid 1000 as seen by the init_user_ns stored as uid 1000 on
>> >> the filesystem? Or is this uid 0 as seen by the filesystem?
>> >>
>> >> I assume this is uid 0 on the filesystem in question or else your
>> >> unprivileged user would not have sufficient privileges over the
>> >> filesystem to setup fcaps.
>> >
>> > I was thinking uid 0 as seen by the filesystem. But even if it were
>> > uid 1000, the unprivileged user can still set whatever mode and xattrs
>> > they want -- they control the backing store.
>>
>> Yes. And that is what I was really asking. Are we taking about a
>> filesystem where the user controls the backing store?
>>
>> >>> Then global root gets an fd to this filesystem. If they execve the
>> >>> file directly, then, with my patch 4, it won't act as setuid 1000 and
>> >>> the fcaps will be ignored. Even with my patch 4, though, if they bind
>> >>> mount the fs and execve the file from their bind mount, it will act as
>> >>> setuid 1000. Maybe this is odd. However, with Seth's patch 3, the
>> >>> fcaps will (correctly) not be honored.
>> >>
>> >> With patch 3 you can also think of it as fcaps being honored and you
>> >> get all the caps in the appropriate user namespace, but since you are
>> >> not in that user namespace and so don't have a place to store them
>> >> in struct cred you don't get the file caps.
>> >>
>> >> From the philosophy of interpreting the file as defined by the
>> >> filesystem in principle we could extend struct cred so you actually
>> >> get the creds just in uid 1000s user namespace, but that is very
>> >> unlikely to be worth it.
>> >
>> > I agree.
>> >
>> >>
>> >>> I tend to thing that, if we're not honoring the fcaps, we shouldn't be
>> >>> honoring the setuid bit either. After all, it's really not a trusted
>> >>> file, even though the only user who could have messed with it really
>> >>> is the apparent owner.
>> >>
>> >> For the file caps we can't honor them because you don't have the bits
>> >> in struct cred.
>> >>
>> >> For setuid we can honor it, and setuid is something that the user
>> >> namespace allows.
>> >>
>> >
>> > We certainly *can* honor it. But why should we? I'd be more
>> > comfortable with this if the contents of an untrusted filesystem were
>> > really treated as just data.
>>
>> In these weird bleed through situtations I don't know that we should.
>> But extending nosuid protections in this way is a bit like yama
>> a bit gratuitious stomping don't care cases in the semantics to
>> make bugs harder to exploit.
>>
>> >>> And, if we're going to say we don't trust the file and shouldn't honor
>> >>> setuid or fcaps, then merging all the functionality into mnt_may_suid
>> >>> could make sense. Yes, these two things do different things, but they
>> >>> could hook in to the same place.
>> >>
>> >> There are really two separate questions:
>> >> - Do we trust this filesystem?
>> >> - Do you have the bits to implement this concept?
>> >>
>> >> Even if in this specific context the two questions wind up looking
>> >> exactly the same. I think it makes a lot of sense to ask the two
>> >> questions separately. As future maintenance changes may cause the
>> >> implementation of the questions to diverge.
>> >>
>> >
>> > Agreed.
>> >
>> > Unless someone thinks of an argument to the contrary, I'd say "no, we
>> > don't trust this filesystem". I could be convinced otherwise.
>>
>> But this is context dependent. From the perspective of the container
>> we really do want to trust the filesystem. As the container root set it
>> up, and if he isn't being hostile likely has a use for setfcaps files
>> and setuid files and all of the rest.
>>
>> Perhaps I should phrase it as:
>> - In this context do we trust the code? AKA mnt_may_suid?
>> - What do these bits mean in this context? (Usually something more complicated).
>>
>> Which says to me we want both patches 3 and 4 (even if 4 uses s_user_ns)
>> because 3 is different than 4.
>
> So what I'll do is:
>
> - Add a s_user_ns check to mnt_may_suid
> - Keep the (now redundant) s_user_ns check in get_file_caps
>
> I'm on the fence about having both the mnt and user ns checks in
> mnt_may_suid - it might be overkill, but it still adds the protection
> against clearing MNT_NOSUID in a bind mount. So I guess I'll keep the
> mnt ns check.

That sounds like a plan.

Eric

2015-07-17 00:45:56

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
> On Thu, Jul 16, 2015 at 4:08 PM, Casey Schaufler <[email protected]> wrote:
>> On 7/16/2015 3:27 PM, Andy Lutomirski wrote:
>>> On Thu, Jul 16, 2015 at 2:42 PM, Casey Schaufler <[email protected]> wrote:
>>>> You want to provide a mechanism whereby an unprivileged user (Seth)
>>>> can mount a filesystem for his own use. You want full filesystem
>>>> semantics, but you're willing to accept restrictions on certain
>>>> filesystem features to avoid opening security holes. You are not
>>>> willing to accept restrictions that make the filesystem unusable,
>>>> such as making it read-only.
>>>>
>>>> I am going to present a suggestion. Feel free to correct my
>>>> assumptions and my reasoning. For simplicity let's use loop-back
>>>> mounting of a filesystem contained in a file as an example. The
>>>> principles should apply to newly created memory based filesystems
>>>> or disk partitions "owned" by Seth.
>>>>
>>>> Seth wants to mount a file (~seth/myfs) which contains an ext4
>>>> filesystem. There is already a filesystem object, with security
>>>> attributes, that the system knows how to deal with. If Seth mounts
>>>> this as a filesystem he, and potentially other people, will be
>>>> able to access the content of this object without accessing the
>>>> object itself.
>>>>
>>>> seth$ mount --justforme -t ext4 ~seth/myfs /tmp/seth
>>>> seth$ chmod 777 /tmp/seth
>>>> seth$ ls -la /tmp/seth
>>>> drwxrwxrwx. 3 seth seth 260 Jul 16 12:59 .
>>>> drwxrwxrwxt 18 root root 4069 Jul 16 11:13 ..
>>>> seth$
>>>>
>>>> Everything's fine at this point. Wilma is also using the system,
>>>> being the sort who likes to hide things in out of the way places
>>>>
>>>> wilma$ cp ~/scandals /tmp/seth
>>>> wilma$ chmod 600 /tmp/seth/scandals
>>> This is already impossible as described. Seth can only mount the
>>> filesystem in a private mount namespace inside a user namespace that
>>> he created. Wilma can't see it unless Seth passes an fd to Wilma and
>>> Wilma accepts and uses it.
>> But you do have multiple UIDs withing your user namespace, right?
>> There are processes running as someone other than seth, right?
>>
> Only if root set it up that way. For example, root could set up
> "subuids" (this is a userspace concept) that belong to Seth. These
> would be uids that Seth controls and that represent subsets of Seth's
> authority. Wilma wouldn't be one of these subuids unless she was
> somehow part of Seth (or if root completely screwed up).

Or if root had some really unexpected and inappropriate ideas
on what qualifies as "clever". But I'll back off. It looks like
this particular objection of mine is covered.

>
>>>> puts her list of scandals on the unsuspecting filesystem, and changes
>>>> the mode to ensure that no one can find out what went on after the
>>>> office party.
>>>>
>>>> Seth unmounts /tmp/seth. He looks in ~seth/myfs, finds out what really
>>>> happened at the office party, and the story goes from there.
>>>>
>>>> Wilma did everything correctly according to the system security policy,
>>>> but the system security policy did not protect her as advertised. The
>>>> system was tricked into behaving as if it was in control of the content
>>>> of the filesystem when in fact it was not.
>>> I would argue that, if Wilma writes to some place described by an fd
>>> and doesn't verify where she's writing to, then she has no expectation
>>> of privacy. After all, she could just *tell* Seth directly whatever
>>> she wants (assuming she can communicate with Seth in the first place).
>> Don't ascribe either wisdom or good intentions to Wilma.
> In that case, I'll mention the futility of solving the problem, even
> without user namespaces. If Wilma tells Seth something, he's going to
> find out. If Wilma pokes it (in whatever form) into an fd provided by
> Seth, then Seth is extremely likely to find out, regardless of what
> root or the MAC owner tries to do.

I'll buy that, too. I still get queasy every time someone
tells me that passing file descriptors is a security feature.

> If Wilma writes to a path that's mounted in her namespace, then, sure,
> overall policy associated with her namespace (which, in your example,
> is the root namespace) must apply. But Seth can't mount things into
> Wilma's namespace without having CAP_SYS_ADMIN in that namespace and,
> if he has CAP_SYS_ADMIN, it's already game over.

And so long as it's restricted to the namespace ...
I'm starting to get it now.

>>>> One way to fix this problem is for unprivileged mounts to recognize the
>>>> attributes of the object mounted and to propagate those attributes to all
>>>> the objects they present. All files on /tmp/seth would be owned by seth
>>>> and protected by the mode bits, ACL and LSM requirements of ~/seth/myfs.
>>> This is impossible to enforce, because Seth could use FUSE instead of ext4.
>> I never said that things aren't already broken. And, if you want
>> to ignore the potential DAC issues (read, negative groups) just
>> do it for the LSM xattrs.
>>
> Negative groups are a solved problem, I believe.

My position is that there's a workaround but that the
design is still fundamentally flawed.

>
>>>> opening a file on /tmp/seth would require the same permissions as opening
>>>> the file containing the mounted filesystem. These attributes would have to
>>>> be immutable, or at least demonstrably more restrictive (chmod might be
>>>> allowed in some cases, but chown would never be) when changed. I don't see
>>>> how a user other than seth could create a new file, as you'd either have
>>>> a magical change in ownership or a false sense of security.
>>> This would be a very harsh restriction. Seth might legitimately want
>>> to give a user access to a file on backing store he owns without
>>> giving that user access to the backing store. Root on a normal system
>>> does that all the time.
>> You already said that it was impossible for Wilma to get
>> access, so how is this more restrictive? Besides, Seth can
>> always set the mode on ~/seth so that Wilma can't read the
>> files it contains. This isn't an old problem or a novel
>> solution.
> Seth can pass an fd around. This is actually a plausible thing to do:
> Seth creates a userns to sandbox himself, mounts some FUSE thing in
> there, and passes an fd out for the benefit of some daemon. That
> daemon had better validate the thing before using it, though.

Point. It won't, but it should.

> I really don't see the benefit of making up extra rules that apply to
> users outside a userns who try to access specifically a filesystem
> with backing store. They wouldn't make sense for filesystems without
> backing store.

Sure it would. For Smack, it would be the label a file would be
created with, which would be the label of the process creating
the memory based filesystem. For SELinux the rules are more a
touch more sophisticated, but I'm sure that Paul or Stephen could
come up with how to determine it.

The point, looping all the way back to the beginning, where we
were talking about just ignoring the labels on the filesystem,
is that if you use the same Smack label on the files in the
filesystem as the backing store file has, we'll all be happy.
If that label isn't something user can write to, he won't be
able to write to the mounted objects, either. If there is no
backing store then use the label of the process creating the
filesystem, which will be the user, which will mean everything
will work hunky dory.

Yes, there's work involved, but I doubt there's a lot. Getting
the label from the backing store or the creating process is
simple enough.


>>>> If you can mount a filesystem such that the labels are ignored you
>>>> are effectively specifying that the Smack label on the files be
>>>> determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
>>>> Without it, it's not.
>>> Can you explain what the threat model is here? I don't see what it is
>>> that you're trying to prevent.
>> Um, OK.
>> The filesystem has files with a hundred different Smack labels on it.
>> I mount it as an unlabeled filesystem and everything is readable by
>> everyone. Bad jojo.
> I still don't understand. If it's a filesystem backed by a file that
> Seth has RW access to, then Seth can read everything on it, full stop.
> The security labels in the filesystem are irrelevant.

Well, they can't be trusted, if that's what you mean.
That's why I'm saying that the objects exposed by mounting
this backing store need to be treated with the same security
attributes as the backing store. Fudge it for DAC if you are
so inclined, but I think it's the right way to go for MAC.

> This is like saying that, if you put restrictive labels in the
> filesystem that lives on /dev/sda2 and give Seth ownership of
> /dev/sda2, then you expect Seth to be unable to bypass the policy
> specifies by your labels.

Consider the Smack label on /dev/sda2. Smack does not care
who owns it, just what the Smack label is. Just like on
~/seth/myfs. The backing store "object" is /dev/sda2 in the
one case, ~/seth/myfs in the other, and something in the ether
for a memory based filesystem. So long as the labels of the
files exposed on the mount point match those of the backing
store "object", Smack is going to be happy. Since you're
running without privilege, you can't change the labels on
the files.

Now Seth, being the sneaky person that he is, could change
the Smack labels on the files in the backing store while it's
offline. Since he has access to the backing store, he can't
give himself more access by changing the labels within the
filesystem. He can give himself less, but I'm OK with that.

> Or maybe I'm misunderstanding you.

Probably, but I'm undoubtedly doing the same.

If you're going to be at LinuxCon in Seattle we should
continue this discussion over the beverage of your choice.

>>>>> Your point is taken about my less-than-expert opinion about the other
>>>>> security modules. We should at minimum get acks from the maintainers of
>>>>> those modules that unprivileged mounts will not compromise MAC.
>>>> I am the Smack maintainer. Unprivileged mounts as you have
>>>> described them compromise MAC. They compromise DAC, too.
>>>>
>>> How do they compromise DAC?
>> Wilma's expectation (or the application running with a mapped UID)
>> that chmod will keep Seth out of the file.
> That was never true. If Seth has an open fd, Wilma can chmod all day
> and it won't matter. In this example, Seth owns the entire filesystem
> along with its backing store.
>
> --Andy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2015-07-17 00:59:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <[email protected]> wrote:
> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
>> I really don't see the benefit of making up extra rules that apply to
>> users outside a userns who try to access specifically a filesystem
>> with backing store. They wouldn't make sense for filesystems without
>> backing store.
>
> Sure it would. For Smack, it would be the label a file would be
> created with, which would be the label of the process creating
> the memory based filesystem. For SELinux the rules are more a
> touch more sophisticated, but I'm sure that Paul or Stephen could
> come up with how to determine it.
>
> The point, looping all the way back to the beginning, where we
> were talking about just ignoring the labels on the filesystem,
> is that if you use the same Smack label on the files in the
> filesystem as the backing store file has, we'll all be happy.
> If that label isn't something user can write to, he won't be
> able to write to the mounted objects, either. If there is no
> backing store then use the label of the process creating the
> filesystem, which will be the user, which will mean everything
> will work hunky dory.
>
> Yes, there's work involved, but I doubt there's a lot. Getting
> the label from the backing store or the creating process is
> simple enough.
>

So what if Smack used the label of the user creating the filesystem
even for filesystems with backing store? IMO this ought to be doable
with the LSM hooks -- it certainly seems reasonable for the LSM to be
aware of who created a filesystem. In fact, I'd argue that if Smack
can't do this with the proposed LSM hooks, then the hooks are
insufficient.

Presumably Smack could also figure out what was mounted, but keep in
mind that there are filesystems like ntfs-3g out there. While ntfs-3g
logically has backing store, I don't think the kernel actually knows
about it.

>
>>>>> If you can mount a filesystem such that the labels are ignored you
>>>>> are effectively specifying that the Smack label on the files be
>>>>> determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
>>>>> Without it, it's not.
>>>> Can you explain what the threat model is here? I don't see what it is
>>>> that you're trying to prevent.
>>> Um, OK.
>>> The filesystem has files with a hundred different Smack labels on it.
>>> I mount it as an unlabeled filesystem and everything is readable by
>>> everyone. Bad jojo.
>> I still don't understand. If it's a filesystem backed by a file that
>> Seth has RW access to, then Seth can read everything on it, full stop.
>> The security labels in the filesystem are irrelevant.
>
> Well, they can't be trusted, if that's what you mean.
> That's why I'm saying that the objects exposed by mounting
> this backing store need to be treated with the same security
> attributes as the backing store. Fudge it for DAC if you are
> so inclined, but I think it's the right way to go for MAC.
>
>> This is like saying that, if you put restrictive labels in the
>> filesystem that lives on /dev/sda2 and give Seth ownership of
>> /dev/sda2, then you expect Seth to be unable to bypass the policy
>> specifies by your labels.
>
> Consider the Smack label on /dev/sda2. Smack does not care
> who owns it, just what the Smack label is. Just like on
> ~/seth/myfs. The backing store "object" is /dev/sda2 in the
> one case, ~/seth/myfs in the other, and something in the ether
> for a memory based filesystem. So long as the labels of the
> files exposed on the mount point match those of the backing
> store "object", Smack is going to be happy. Since you're
> running without privilege, you can't change the labels on
> the files.
>
> Now Seth, being the sneaky person that he is, could change
> the Smack labels on the files in the backing store while it's
> offline. Since he has access to the backing store, he can't
> give himself more access by changing the labels within the
> filesystem. He can give himself less, but I'm OK with that.
>
>> Or maybe I'm misunderstanding you.
>
> Probably, but I'm undoubtedly doing the same.
>
> If you're going to be at LinuxCon in Seattle we should
> continue this discussion over the beverage of your choice.

There's a small but not quite zero chance I'll be there. I'll
probably be in Seoul. It's too bad that LSS and KS are in different
places this year.

--Andy

2015-07-17 02:48:09

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Thu, Jul 16, 2015 at 07:42:03PM -0500, Eric W. Biederman wrote:
> Dave Chinner <[email protected]> writes:
>
> > On Wed, Jul 15, 2015 at 11:47:08PM -0500, Eric W. Biederman wrote:
> >> Casey Schaufler <[email protected]> writes:
> >> > On 7/15/2015 6:08 PM, Andy Lutomirski wrote:
> >> >> If I mount an unprivileged filesystem, then either the contents were
> >> >> put there *by me*, in which case letting me access them are fine, or
> >> >> (with Seth's patches and then some) I control the backing store, in
> >> >> which case I can do whatever I want regardless of what LSM thinks.
> >> >>
> >> >> So I don't see the problem. Why would Smack or any other LSM care at
> >> >> all, unless it wants to prevent me from mounting the fs in the first
> >> >> place?
> >> >
> >> > First off, I don't cotton to the notion that you should be able
> >> > to mount filesystems without privilege. But it seems I'm being
> >> > outvoted on that. I suspect that there are cases where it might
> >> > be safe, but I can't think of one off the top of my head.
> >>
> >> There are two fundamental issues mounting filesystems without privielge,
> >> by which I actually mean mounting filesystems as the root user in a user
> >> namespace.
> >>
> >> - Are the semantics safe.
> >> - Is the extra attack surface a problem.
> >
> > I think the attack surface this exposes is the biggest problem
> > facing this proposal.
>
> I completely agree.
>
> >> Figuring out how to make semantics safe is what we are talking about.
> >>
> >> Once we sort out the semantics we can look at the handful of filesystems
> >> like fuse where the extra attack surface is not a concern.
> >>
> >> With that said desktop environments have for a long time been
> >> automatically mounting whichever filesystem you place in your computer,
> >> so in practice what this is really about is trying to align the kernel
> >> with how people use filesystems.
> >
> > The key difference is that desktops only do this when you physically
> > plug in a device. With unprivileged mounts, a hostile attacker
> > doesn't need physical access to the machine to exploit lurking
> > kernel filesystem bugs. i.e. they can just use loopback mounts, and
> > they can keep mounting corrupted images until they find something
> > that works.
>
> Yep. That magnifies the problem quite a bit.
>
> > User namespaces are supposed to provide trust separation. The
> > kernel filesystems simply aren't hardened against unprivileged
> > attacks from below - there is a trust relationship between root and
> > the filesystem in that they are the only things that can write to
> > the disk. Mounts from within a userns destroys this relationship as
> > the userns root, by definition, is not a trusted actor.
>
> I talked to Ted Tso a while back and ext4 is at least in principle
> already hardened against that kind of attack. I am not certain I
> believe it, but if it is true I think it is fantastic.

No, it's not. No filesystem is, because to harden against such
attacks requires complete verification of all metadata when it is
read from disk, before it is used, or some method or ensuring the
block was not tampered with. CRCs are not sufficient, because they
can be tampered with, too.

The only way a filesystem would be able to trust what it reads from
disk has not been tampered with in a system with untrusted mounts is
if it has some kind of cryptographically secure signature in the
metadata and the attacker is unable to access the key for that
signature. No filesystem we have has that capability and AFAIA there
are no plans for any filesystem to implement such tamper detection.
And no, ext4 encryption does not provide this because it only stores
the values and data in encrypted format and does not protect
metadata from tampering when it is not mounted.

If we don't have crypto signatures in metadata, then XFS is probably
the most robust against tampering as it does a lot more checking of
the on-disk metadata before it is used than any other filesystem
(i.e. see the verifier infrastructure that does corruption checks
after read (in io completion) and before write (in io submission)
to catch bad metadata before it is used by the kernel, or before it
is written to disk by the kernel.

However, these checks are far from comprehensive. we can only check
internal consistency of the metadata objects in the block, and even
then we really only can check for values within range rather than
absolute correctness. e.g. we can check a dirent has a valid name,
length, ftype and inode number, but we can't validate that the inode
is actually allocated or not because that requires a lookup in the
allocated inode btree. We *trust* that inode number to be
allocated and valid because it is in metadata the filesystem wrote.

For inode numbers that come from untrusted sources (NFS,
open-by-handle, etc) we have a flag that does inode number
validation on lookup (XFS_IGET_UNTRUSTED) to check against trusted
metadata (i.e. the allocated inode btrees), but that is expensive
and so not done on inodes that we pull directly from metadata that
has come from disk. Indeed, we still trust on-disk metadata to be
correct to validate that other metadata canbe trusted, so if one
structure can be tampered with, so can others.

IOWs, if we cannot trust one part of the filesystem metadata to be
correct, then we cannot trust that filesystem *at all*, *for
anything*. And even running fsck doesn't restore trust - all it does
is tell us that any modification that was made is not a detectable
inconsistency that needs fixing.

> At this point any setting of the FS_USER_MOUNT flag I figure needs to go
> through the filesystem maintainers tree and they need to be aware of and
> agree to deal with the attack from below issue.
>
> The one filesystem I truly expect we can make work is fuse. fuse has
> been designed to deal with some variation of the attack from below issue
> since day one. We looked at what the patches to fuse would look like
> with the current state of the vfs and it was not pretty.
>
> We very much need to sort through as much as possible at the vfs layer,
> and in generic code. Allow everyone to see what is going on and how
> it works before preceeding forward with enabling any filesystems.

The VFS protects us from attacks from above the filesystem, not
below. The VFS plays no part in validating the on-disk structure of
a filesystem which is what attacks from below will be attempting to
exploit.

> I truly hope we can find a small set of block device filesystems that we
> can harden from attack below. That would allow linux to have serious
> defenses against evil usb stick attacks. I think that is going to take
> a lot of careful coding, testing and validation and advancing the state
> of the art to get there.

Somehow, I just can't see that happening.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-07-17 06:46:20

by Nikolay Borisov

[permalink] [raw]
Subject: Re: [PATCH 4/7] fs: Treat foreign mounts as nosuid



On 07/15/2015 10:46 PM, Seth Forshee wrote:
> From: Andy Lutomirski <[email protected]>
>
> If a process gets access to a mount from a different namespace user
> namespace, that process should not be able to take advantage of
> setuid files or selinux entrypoints from that filesystem.
> Technically, trusting mounts created by the same or ancestor user
> namespaces ought to be safe, but it's simpler to distrust all
> foreign mounts.
>
> This will make it safer to allow more complex filesystems to be
> mounted in non-root user namespaces.
>
> This does not remove the need for MNT_LOCK_NOSUID. The setuid,
> setgid, and file capability bits can no longer be abused if code in
> a user namespace were to clear nosuid on an untrusted filesystem,
> but this patch, by itself, is insufficient to protect the system
> from abuse of files that, when execed, would increase MAC privilege.
>
> As a more concrete explanation, any task that can manipulate a
> vfsmount associated with a given user namespace already has
> capabilities in that namespace and all of its descendents. If they
> can cause a malicious setuid, setgid, or file-caps executable to
> appear in that mount, then that executable will only allow them to
> elevate privileges in exactly the set of namespaces in which they
> are already privileges.
>
> On the other hand, if they can cause a malicious executable to
> appear with a dangerous MAC label, running it could change the
> caller's security context in a way that should not have been
> possible, even inside the namespace in which the task is confined.
>
> As a hardening measure, this would have made CVE-2014-5207 much
> more difficult to exploit.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> [ saf: Forward ported to 4.2 ]
> Signed-off-by: Seth Forshee <[email protected]>
> ---
> fs/exec.c | 2 +-
> fs/namespace.c | 13 +++++++++++++
> include/linux/mount.h | 1 +
> security/commoncap.c | 2 +-
> security/selinux/hooks.c | 2 +-
> 5 files changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index b06623a9347f..ea7311d72cc3 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1295,7 +1295,7 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
> bprm->cred->euid = current_euid();
> bprm->cred->egid = current_egid();
>
> - if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
> + if (!mnt_may_suid(bprm->file->f_path.mnt))
> return;
>
> if (task_no_new_privs(current))
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 423001de32a2..2bfd7ca92247 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3252,6 +3252,19 @@ found:
> return visible;
> }
>
> +bool mnt_may_suid(struct vfsmount *mnt)
> +{
> + /*
> + * Foreign mounts (accessed via fchdir or through /proc
> + * symlinks) are always treated as if they are nosuid. This
> + * prevents namespaces from trusting potentially unsafe
> + * suid/sgid bits, file caps, or security labels that originate
> + * in other namespaces.
> + */
> + return real_mount(mnt)->mnt_ns == current->nsproxy->mnt_ns &&
> + !(mnt->mnt_flags & MNT_NOSUID);

Maybe check_mnt() from fs/namespace.c can be exported and used here,
instead of open coding it.

> +}
> +
> static struct ns_common *mntns_get(struct task_struct *task)
> {
> struct ns_common *ns = NULL;
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index f822c3c11377..54a594d49733 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -81,6 +81,7 @@ extern void mntput(struct vfsmount *mnt);
> extern struct vfsmount *mntget(struct vfsmount *mnt);
> extern struct vfsmount *mnt_clone_internal(struct path *path);
> extern int __mnt_is_readonly(struct vfsmount *mnt);
> +extern bool mnt_may_suid(struct vfsmount *mnt);
>
> struct path;
> extern struct vfsmount *clone_private_mount(struct path *path);
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 175ab497e810..858d86a1b73c 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -437,7 +437,7 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
> if (!file_caps_enabled)
> return 0;
>
> - if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
> + if (!mnt_may_suid(bprm->file->f_path.mnt))
> return 0;
> if (!in_userns(current_user_ns(), bprm->file->f_path.mnt->mnt_sb->s_user_ns))
> return 0;
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 564079c5c49d..459e71ddbc9d 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -2137,7 +2137,7 @@ static int check_nnp_nosuid(const struct linux_binprm *bprm,
> const struct task_security_struct *new_tsec)
> {
> int nnp = (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS);
> - int nosuid = (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID);
> + int nosuid = !mnt_may_suid(bprm->file->f_path.mnt);
> int rc;
>
> if (!nnp && !nosuid)
>

2015-07-17 10:13:42

by Lukasz Pawelczyk

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On czw, 2015-07-16 at 19:10 -0500, Eric W. Biederman wrote:
> Lukasz Pawelczyk <[email protected]> writes:
> >
> > I fail to see how those 2 are in any conflict.
>
> Like I said. They don't really conflict, and actually to really
> support
> things well for smack we probably need something like your patches.

As far as I can see now from the discussion the best thing to do would
to be inherit label from a backing store object, or something along
this line.

> At the same time a patch written without dealing with s_user_ns is
> going
> to going to fail to take a lot of important details into account.

I don't touch anything that would need to deal with s_user_ns. I also
don't change Smack's mounting logic in any way. My patches are
orthogonal to that.

> Right now after fixing the mount namespace issues the top priority is
> to
> work through the details and get s_user_ns implemented. By that I
> mean
> some version of patch 1 of Seth's series.

My priority is to make Smack namespace work. This is a functionality
that has a perfectly valid use case now. Without it Smack in a
container is impossible to operate on.

> s_user_ns fundamentally changes how the concepts are represented in
> the
> kernel in a way that is easier to secure, and that fundamentally
> better
> matches things. And sigh. This review has shown we don't quite have
> all of the details worked out.
>
> > If your approach here is to treat user ns mounted filesystem as if
> > they
> > didn't support xattrs at all then my patches don't conflict here
> > any
> > more than Smack itself already does.
>
> The end game if people developing smack choose to play, is to figure
> out
> how to store your unmapped labels in a filesystem contained by a
> user namespace and a smack label namespace root.

Storing an unmapped label (read: real label) in Smack namespace is
exactly the same as it is now without the namespace. I always store the
real label.

The problem here is: what real label should be "read" and eventually
stored in that filesystem (see my first comment here). Again, Smack
namespace doesn't touch that logic.

> > If the filesystem will get a default (e.g. by smack* mount options)
> > label then this label will co-work with Smack namespaces.
>
> A default, but I don't know if it will be smack mount options that
> will
> give that default. The devil is in the details and there are a lot
> of details.

Now Smack gives the default. If someone will modify Smack to give a
different label because of s_user_ns support Smack namepace will not
cause any hindrance here.

Smack namespace main role is only to be able to operate Smack within a
container. All the other LSM can do that already as they don't require
caps to operate normally. Smack does. Hence it had to be namespaced in
some way to give limited capabilities in a container (user ns).

This really has nothing to do with the way Smack mounts, assigns
labels, decides what is allowed and what is not, etc.

What this discussion is about is how to modify or even bend LSM's way
of work to make unprivileged user ns mounts work under LSM (or not).
Smack namespace here is just an utility within Smack itself. And maybe
it can be used to help this at some point, but beyond that it's
orthogonal to the problem.



--
Lukasz Pawelczyk
Samsung R&D Institute Poland
Samsung Electronics


2015-07-17 13:22:54

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Thu, Jul 16, 2015 at 02:42:22PM -0700, Casey Schaufler wrote:

<snip>

> > I welcome feedback about anything I've missed, but stating generally
> > that you think I probably missed something isn't very helpful.
>
> True enough. I hope I've explained myself above.

Thanks, that definitely clarified where we were having a disconnect.
Andy's done a fantastic job explaining how those concerns are addressed.

> > The LSM issue is thornier than the rest of it though, which is why I
> > specifically asked for review there in the cover letter. There's a lot
> > of complexity and nuance, and I still don't have a grasp on all the
> > subtleties. One such subtlety is the full impact of simply ignoring the
> > security labels on disk (but I am still confused as to why this is
> > different from filesystems which don't support xattrs at all).
>
> If you can mount a filesystem such that the labels are ignored you
> are effectively specifying that the Smack label on the files be
> determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
> Without it, it's not.
>
> > I was unaware of Lukasz's patches until yesterday, and I will have a
> > look at them. But since we don't have the LSM support for user
> > namespaces yet, I don't see the problem with doing something safe for
> > LSMs initially and evolving the LSM integration for user ns mounts along
> > with the rest of the user ns integration.
>
> Ignoring the security attributes is not safe!

Understood. It's surely safe for each LSM to deny such mounts until it
has a way to handle them safely though.

I'm not trying to completely punt on the issue of security modules, just
break this down into more manageable chunks. You've given good guidance
for Smack (thanks very much for that), so I can plan to work on that
soon.

> > Your point is taken about my less-than-expert opinion about the other
> > security modules. We should at minimum get acks from the maintainers of
> > those modules that unprivileged mounts will not compromise MAC.
>
> I am the Smack maintainer. Unprivileged mounts as you have
> described them compromise MAC. They compromise DAC, too.

It looks like Andy's more or less convinced you that DAC isn't
(additionally?) compromised. And there's a plan for MAC, that the
security module can deny mounts from user namespaces until it has a
solution for allowing them safely.

> > For Smack specifically, I believe my only concern was the SMACK64EXEC
> > attribute, as all the other attributes only affected subjects' access to
> > the files. So maybe it would be possible to simply ignore this attribute
> > in unprivileged mounts and respect the others, even lacking more
> > complete LSM support for user namespaces.
>
> SMACK64EXEC is analogous to the setuid bit, but I would rather see
> exec() of programs with this attribute refused that for it to be
> blindly ignored.

That's fine, it's your call.

Thanks,
Seth

2015-07-17 14:28:36

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <[email protected]> wrote:
> > On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
> >> I really don't see the benefit of making up extra rules that apply to
> >> users outside a userns who try to access specifically a filesystem
> >> with backing store. They wouldn't make sense for filesystems without
> >> backing store.
> >
> > Sure it would. For Smack, it would be the label a file would be
> > created with, which would be the label of the process creating
> > the memory based filesystem. For SELinux the rules are more a
> > touch more sophisticated, but I'm sure that Paul or Stephen could
> > come up with how to determine it.
> >
> > The point, looping all the way back to the beginning, where we
> > were talking about just ignoring the labels on the filesystem,
> > is that if you use the same Smack label on the files in the
> > filesystem as the backing store file has, we'll all be happy.
> > If that label isn't something user can write to, he won't be
> > able to write to the mounted objects, either. If there is no
> > backing store then use the label of the process creating the
> > filesystem, which will be the user, which will mean everything
> > will work hunky dory.
> >
> > Yes, there's work involved, but I doubt there's a lot. Getting
> > the label from the backing store or the creating process is
> > simple enough.
> >
>
> So what if Smack used the label of the user creating the filesystem
> even for filesystems with backing store? IMO this ought to be doable

The more usual LSM-ish way to handle this would be to ask the LSM, at
mount time, with a new security_mount_bdev_in_userns() hook, passing
it the user's label and the backing store's label (if any), and storing
the label to be used for the files. Even more LSM-ish (though risking
performance hit) would be to then have the LSM at each inode_init_security
decide whether to use that label or trust what's in the fs anyway (or
do something else). That could allow the LSM to use policy to decide
that.

Because I don't know that for all LSMs it makes sense for a 'subject'
label to be assigned to an object.

> with the LSM hooks -- it certainly seems reasonable for the LSM to be
> aware of who created a filesystem. In fact, I'd argue that if Smack
> can't do this with the proposed LSM hooks, then the hooks are
> insufficient.
>
> Presumably Smack could also figure out what was mounted, but keep in
> mind that there are filesystems like ntfs-3g out there. While ntfs-3g
> logically has backing store, I don't think the kernel actually knows
> about it.
>
> >
> >>>>> If you can mount a filesystem such that the labels are ignored you
> >>>>> are effectively specifying that the Smack label on the files be
> >>>>> determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
> >>>>> Without it, it's not.
> >>>> Can you explain what the threat model is here? I don't see what it is
> >>>> that you're trying to prevent.
> >>> Um, OK.
> >>> The filesystem has files with a hundred different Smack labels on it.
> >>> I mount it as an unlabeled filesystem and everything is readable by
> >>> everyone. Bad jojo.
> >> I still don't understand. If it's a filesystem backed by a file that
> >> Seth has RW access to, then Seth can read everything on it, full stop.
> >> The security labels in the filesystem are irrelevant.
> >
> > Well, they can't be trusted, if that's what you mean.
> > That's why I'm saying that the objects exposed by mounting
> > this backing store need to be treated with the same security
> > attributes as the backing store. Fudge it for DAC if you are
> > so inclined, but I think it's the right way to go for MAC.
> >
> >> This is like saying that, if you put restrictive labels in the
> >> filesystem that lives on /dev/sda2 and give Seth ownership of
> >> /dev/sda2, then you expect Seth to be unable to bypass the policy
> >> specifies by your labels.
> >
> > Consider the Smack label on /dev/sda2. Smack does not care
> > who owns it, just what the Smack label is. Just like on
> > ~/seth/myfs. The backing store "object" is /dev/sda2 in the
> > one case, ~/seth/myfs in the other, and something in the ether
> > for a memory based filesystem. So long as the labels of the
> > files exposed on the mount point match those of the backing
> > store "object", Smack is going to be happy. Since you're
> > running without privilege, you can't change the labels on
> > the files.
> >
> > Now Seth, being the sneaky person that he is, could change
> > the Smack labels on the files in the backing store while it's
> > offline. Since he has access to the backing store, he can't
> > give himself more access by changing the labels within the
> > filesystem. He can give himself less, but I'm OK with that.
> >
> >> Or maybe I'm misunderstanding you.
> >
> > Probably, but I'm undoubtedly doing the same.
> >
> > If you're going to be at LinuxCon in Seattle we should
> > continue this discussion over the beverage of your choice.
>
> There's a small but not quite zero chance I'll be there. I'll
> probably be in Seoul. It's too bad that LSS and KS are in different
> places this year.

FWIW I'll be there and happy to discuss.

-serge

2015-07-17 14:58:06

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Fri, Jul 17, 2015 at 09:28:32AM -0500, Serge E. Hallyn wrote:
> > > If you're going to be at LinuxCon in Seattle we should
> > > continue this discussion over the beverage of your choice.
> >
> > There's a small but not quite zero chance I'll be there. I'll
> > probably be in Seoul. It's too bad that LSS and KS are in different
> > places this year.
>
> FWIW I'll be there and happy to discuss.

I'll also be in Seattle and happy to discuss.

Seth

2015-07-17 17:14:16

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On 7/17/2015 6:21 AM, Seth Forshee wrote:
> On Thu, Jul 16, 2015 at 02:42:22PM -0700, Casey Schaufler wrote:
>
> <snip>
>
>>> I welcome feedback about anything I've missed, but stating generally
>>> that you think I probably missed something isn't very helpful.
>> True enough. I hope I've explained myself above.
> Thanks, that definitely clarified where we were having a disconnect.
> Andy's done a fantastic job explaining how those concerns are addressed.
>
>>> The LSM issue is thornier than the rest of it though, which is why I
>>> specifically asked for review there in the cover letter. There's a lot
>>> of complexity and nuance, and I still don't have a grasp on all the
>>> subtleties. One such subtlety is the full impact of simply ignoring the
>>> security labels on disk (but I am still confused as to why this is
>>> different from filesystems which don't support xattrs at all).
>> If you can mount a filesystem such that the labels are ignored you
>> are effectively specifying that the Smack label on the files be
>> determined by the defaulting rules. With CAP_MAC_ADMIN that's fine.
>> Without it, it's not.
>>
>>> I was unaware of Lukasz's patches until yesterday, and I will have a
>>> look at them. But since we don't have the LSM support for user
>>> namespaces yet, I don't see the problem with doing something safe for
>>> LSMs initially and evolving the LSM integration for user ns mounts along
>>> with the rest of the user ns integration.
>> Ignoring the security attributes is not safe!
> Understood. It's surely safe for each LSM to deny such mounts until it
> has a way to handle them safely though.
>
> I'm not trying to completely punt on the issue of security modules, just
> break this down into more manageable chunks. You've given good guidance
> for Smack (thanks very much for that), so I can plan to work on that
> soon.
>
>>> Your point is taken about my less-than-expert opinion about the other
>>> security modules. We should at minimum get acks from the maintainers of
>>> those modules that unprivileged mounts will not compromise MAC.
>> I am the Smack maintainer. Unprivileged mounts as you have
>> described them compromise MAC. They compromise DAC, too.
> It looks like Andy's more or less convinced you that DAC isn't
> (additionally?) compromised. And there's a plan for MAC, that the
> security module can deny mounts from user namespaces until it has a
> solution for allowing them safely.

I wouldn't say that Andy has me convinced on DAC. I would say that
he's taken me deeper into the details of namespaces than I feel
comfortable making arguments about. I don't know that he's right,
I just don't know how to argue that he isn't. Part of what bothers
me is the dependence on namespaces. If you could come up with a
mechanism that wasn't dependent on namespaces it would be much
easier for dinosaurs like me to comprehend.

As far as declaring that MAC and namespace owned mounts are
incompatible goes, I think that I said early on that wasn't
going to fly. Too much of the Linux population (Fedora, Android,
Tizen, ...) uses MAC for the feature to be considered ready
for general consumption without it. And no, I don't believe in
partial implementations. You wouldn't get away with putting this
in if it only worked on s370 processors.

>>> For Smack specifically, I believe my only concern was the SMACK64EXEC
>>> attribute, as all the other attributes only affected subjects' access to
>>> the files. So maybe it would be possible to simply ignore this attribute
>>> in unprivileged mounts and respect the others, even lacking more
>>> complete LSM support for user namespaces.
>> SMACK64EXEC is analogous to the setuid bit, but I would rather see
>> exec() of programs with this attribute refused that for it to be
>> blindly ignored.
> That's fine, it's your call.

I said it, but on reflection the current NOSETUID behavior is
as you described it, so I wouldn't change that.

>
> Thanks,
> Seth
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2015-07-18 00:07:06

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Thu, Jul 16, 2015 at 07:42:03PM -0500, Eric W. Biederman wrote:
> Dave Chinner <[email protected]> writes:
>
> > On Wed, Jul 15, 2015 at 11:47:08PM -0500, Eric W. Biederman wrote:
> >> Casey Schaufler <[email protected]> writes:
> >> > On 7/15/2015 6:08 PM, Andy Lutomirski wrote:
> >> >> If I mount an unprivileged filesystem, then either the contents were
> >> >> put there *by me*, in which case letting me access them are fine, or
> >> >> (with Seth's patches and then some) I control the backing store, in
> >> >> which case I can do whatever I want regardless of what LSM thinks.
> >> >>
> >> >> So I don't see the problem. Why would Smack or any other LSM care at
> >> >> all, unless it wants to prevent me from mounting the fs in the first
> >> >> place?
> >> >
> >> > First off, I don't cotton to the notion that you should be able
> >> > to mount filesystems without privilege. But it seems I'm being
> >> > outvoted on that. I suspect that there are cases where it might
> >> > be safe, but I can't think of one off the top of my head.
> >>
> >> There are two fundamental issues mounting filesystems without privielge,
> >> by which I actually mean mounting filesystems as the root user in a user
> >> namespace.
> >>
> >> - Are the semantics safe.
> >> - Is the extra attack surface a problem.
> >
> > I think the attack surface this exposes is the biggest problem
> > facing this proposal.
>
> I completely agree.
>
> >> Figuring out how to make semantics safe is what we are talking about.
> >>
> >> Once we sort out the semantics we can look at the handful of filesystems
> >> like fuse where the extra attack surface is not a concern.
> >>
> >> With that said desktop environments have for a long time been
> >> automatically mounting whichever filesystem you place in your computer,
> >> so in practice what this is really about is trying to align the kernel
> >> with how people use filesystems.
> >
> > The key difference is that desktops only do this when you physically
> > plug in a device. With unprivileged mounts, a hostile attacker
> > doesn't need physical access to the machine to exploit lurking
> > kernel filesystem bugs. i.e. they can just use loopback mounts, and
> > they can keep mounting corrupted images until they find something
> > that works.
>
> Yep. That magnifies the problem quite a bit.
>
> > User namespaces are supposed to provide trust separation. The
> > kernel filesystems simply aren't hardened against unprivileged
> > attacks from below - there is a trust relationship between root and
> > the filesystem in that they are the only things that can write to
> > the disk. Mounts from within a userns destroys this relationship as
> > the userns root, by definition, is not a trusted actor.
>
> I talked to Ted Tso a while back and ext4 is at least in principle
> already hardened against that kind of attack. I am not certain I
> believe it, but if it is true I think it is fantastic.

Not sure what he said in private, but at the kernel summit last year
what he said was not that it was "hardened", but that any bugs which would
result from mounting a garbage image (i.e. an unpriv user fuzzing)
would be deemed by him a real bug. As opposed to saying "don't do that".

To the best of my knowledge that's so far only the case with Ted/ext4,
which I assume is why Seth started with ext4.

-serge

2015-07-20 17:55:04

by Colin Walters

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Thu, Jul 16, 2015, at 12:47 AM, Eric W. Biederman wrote:

> With that said desktop environments have for a long time been
> automatically mounting whichever filesystem you place in your computer,
> so in practice what this is really about is trying to align the kernel
> with how people use filesystems.

There is a large attack surface difference between mounting a device
that someone physically plugged into the computer (and note typically
it's required that the active console be unlocked as well[1]) versus
allowing any "unprivileged" process at any time to do it.

Many server setups use "unprivileged" uids that otherwise wouldn't
be able to exploit bugs in filesystem code.

[1] https://bugzilla.gnome.org/show_bug.cgi?id=653520
"AutomountManager also keeps track of the current session availability
(using the ConsoleKit and gnome-screensaver DBus interfaces) and
inhibits mounting if the current session is locked, or another session
is in use instead."

2015-07-21 17:37:26

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Fri, Jul 17, 2015 at 12:47:35PM +1000, Dave Chinner wrote:
> On Thu, Jul 16, 2015 at 07:42:03PM -0500, Eric W. Biederman wrote:
> > Dave Chinner <[email protected]> writes:
> >
> > > On Wed, Jul 15, 2015 at 11:47:08PM -0500, Eric W. Biederman wrote:
> > >> Casey Schaufler <[email protected]> writes:
> > >> > On 7/15/2015 6:08 PM, Andy Lutomirski wrote:
> > >> >> If I mount an unprivileged filesystem, then either the contents were
> > >> >> put there *by me*, in which case letting me access them are fine, or
> > >> >> (with Seth's patches and then some) I control the backing store, in
> > >> >> which case I can do whatever I want regardless of what LSM thinks.
> > >> >>
> > >> >> So I don't see the problem. Why would Smack or any other LSM care at
> > >> >> all, unless it wants to prevent me from mounting the fs in the first
> > >> >> place?
> > >> >
> > >> > First off, I don't cotton to the notion that you should be able
> > >> > to mount filesystems without privilege. But it seems I'm being
> > >> > outvoted on that. I suspect that there are cases where it might
> > >> > be safe, but I can't think of one off the top of my head.
> > >>
> > >> There are two fundamental issues mounting filesystems without privielge,
> > >> by which I actually mean mounting filesystems as the root user in a user
> > >> namespace.
> > >>
> > >> - Are the semantics safe.
> > >> - Is the extra attack surface a problem.
> > >
> > > I think the attack surface this exposes is the biggest problem
> > > facing this proposal.
> >
> > I completely agree.
> >
> > >> Figuring out how to make semantics safe is what we are talking about.
> > >>
> > >> Once we sort out the semantics we can look at the handful of filesystems
> > >> like fuse where the extra attack surface is not a concern.
> > >>
> > >> With that said desktop environments have for a long time been
> > >> automatically mounting whichever filesystem you place in your computer,
> > >> so in practice what this is really about is trying to align the kernel
> > >> with how people use filesystems.
> > >
> > > The key difference is that desktops only do this when you physically
> > > plug in a device. With unprivileged mounts, a hostile attacker
> > > doesn't need physical access to the machine to exploit lurking
> > > kernel filesystem bugs. i.e. they can just use loopback mounts, and
> > > they can keep mounting corrupted images until they find something
> > > that works.
> >
> > Yep. That magnifies the problem quite a bit.
> >
> > > User namespaces are supposed to provide trust separation. The
> > > kernel filesystems simply aren't hardened against unprivileged
> > > attacks from below - there is a trust relationship between root and
> > > the filesystem in that they are the only things that can write to
> > > the disk. Mounts from within a userns destroys this relationship as
> > > the userns root, by definition, is not a trusted actor.
> >
> > I talked to Ted Tso a while back and ext4 is at least in principle
> > already hardened against that kind of attack. I am not certain I
> > believe it, but if it is true I think it is fantastic.
>
> No, it's not. No filesystem is, because to harden against such
> attacks requires complete verification of all metadata when it is
> read from disk, before it is used, or some method or ensuring the
> block was not tampered with. CRCs are not sufficient, because they
> can be tampered with, too.
>
> The only way a filesystem would be able to trust what it reads from
> disk has not been tampered with in a system with untrusted mounts is
> if it has some kind of cryptographically secure signature in the
> metadata and the attacker is unable to access the key for that
> signature.

Preventing tampering is a little different from protecting the kernel
from attack, isn't it? I thought the latter was what people were asking
about.

So, for example, a screwed up on-disk directory structure shouldn't
result in creating a cycle in the dcache and then deadlocking.

--b.

> No filesystem we have has that capability and AFAIA there
> are no plans for any filesystem to implement such tamper detection.
> And no, ext4 encryption does not provide this because it only stores
> the values and data in encrypted format and does not protect
> metadata from tampering when it is not mounted.
>
> If we don't have crypto signatures in metadata, then XFS is probably
> the most robust against tampering as it does a lot more checking of
> the on-disk metadata before it is used than any other filesystem
> (i.e. see the verifier infrastructure that does corruption checks
> after read (in io completion) and before write (in io submission)
> to catch bad metadata before it is used by the kernel, or before it
> is written to disk by the kernel.
>
> However, these checks are far from comprehensive. we can only check
> internal consistency of the metadata objects in the block, and even
> then we really only can check for values within range rather than
> absolute correctness. e.g. we can check a dirent has a valid name,
> length, ftype and inode number, but we can't validate that the inode
> is actually allocated or not because that requires a lookup in the
> allocated inode btree. We *trust* that inode number to be
> allocated and valid because it is in metadata the filesystem wrote.
>
> For inode numbers that come from untrusted sources (NFS,
> open-by-handle, etc) we have a flag that does inode number
> validation on lookup (XFS_IGET_UNTRUSTED) to check against trusted
> metadata (i.e. the allocated inode btrees), but that is expensive
> and so not done on inodes that we pull directly from metadata that
> has come from disk. Indeed, we still trust on-disk metadata to be
> correct to validate that other metadata canbe trusted, so if one
> structure can be tampered with, so can others.
>
> IOWs, if we cannot trust one part of the filesystem metadata to be
> correct, then we cannot trust that filesystem *at all*, *for
> anything*. And even running fsck doesn't restore trust - all it does
> is tell us that any modification that was made is not a detectable
> inconsistency that needs fixing.
>
> > At this point any setting of the FS_USER_MOUNT flag I figure needs to go
> > through the filesystem maintainers tree and they need to be aware of and
> > agree to deal with the attack from below issue.
> >
> > The one filesystem I truly expect we can make work is fuse. fuse has
> > been designed to deal with some variation of the attack from below issue
> > since day one. We looked at what the patches to fuse would look like
> > with the current state of the vfs and it was not pretty.
> >
> > We very much need to sort through as much as possible at the vfs layer,
> > and in generic code. Allow everyone to see what is going on and how
> > it works before preceeding forward with enabling any filesystems.
>
> The VFS protects us from attacks from above the filesystem, not
> below. The VFS plays no part in validating the on-disk structure of
> a filesystem which is what attacks from below will be attempting to
> exploit.
>
> > I truly hope we can find a small set of block device filesystems that we
> > can harden from attack below. That would allow linux to have serious
> > defenses against evil usb stick attacks. I think that is going to take
> > a lot of careful coding, testing and validation and advancing the state
> > of the art to get there.
>
> Somehow, I just can't see that happening.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-07-21 20:35:55

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <[email protected]> wrote:
> > On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
> >> I really don't see the benefit of making up extra rules that apply to
> >> users outside a userns who try to access specifically a filesystem
> >> with backing store. They wouldn't make sense for filesystems without
> >> backing store.
> >
> > Sure it would. For Smack, it would be the label a file would be
> > created with, which would be the label of the process creating
> > the memory based filesystem. For SELinux the rules are more a
> > touch more sophisticated, but I'm sure that Paul or Stephen could
> > come up with how to determine it.
> >
> > The point, looping all the way back to the beginning, where we
> > were talking about just ignoring the labels on the filesystem,
> > is that if you use the same Smack label on the files in the
> > filesystem as the backing store file has, we'll all be happy.
> > If that label isn't something user can write to, he won't be
> > able to write to the mounted objects, either. If there is no
> > backing store then use the label of the process creating the
> > filesystem, which will be the user, which will mean everything
> > will work hunky dory.
> >
> > Yes, there's work involved, but I doubt there's a lot. Getting
> > the label from the backing store or the creating process is
> > simple enough.
> >

So something like the diff below (untested)?

All I'm really doing is setting smk_default as you describe above and
then using it instead of smk_of_current() in
smack_inode_alloc_security() and instead of the label from the disk in
smack_d_instantiate(). Since a user currently needs CAP_MAC_ADMIN in
init_user_ns to store security labels it looks like this should be
sufficient. I'm not even sure that the inode_alloc_security hook changes
are needed.

We could allow privileged users in s_user_ns to write security labels to
disk since they already control the backing store, as long as Smack
didn't subsequently import them. I didn't do that here.

> So what if Smack used the label of the user creating the filesystem
> even for filesystems with backing store? IMO this ought to be doable
> with the LSM hooks -- it certainly seems reasonable for the LSM to be
> aware of who created a filesystem. In fact, I'd argue that if Smack
> can't do this with the proposed LSM hooks, then the hooks are
> insufficient.

It would be very simple to use the label of the task instead.

Seth

---

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 32f598db0b0d..4597420ab933 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
__sb_start_write(sb, SB_FREEZE_FS, true);
}

+static inline bool sb_in_userns(struct super_block *sb)
+{
+ return sb->s_user_ns != &init_user_ns;
+}

extern bool inode_owner_or_capable(const struct inode *inode);

diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..591fd19294e7 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
char *buffer;
struct smack_known *skp = NULL;

+ /* Should never fetch xattrs from untrusted mounts */
+ if (WARN_ON(sb_in_userns(ip->i_sb)))
+ return ERR_PTR(-EPERM);
+
if (ip->i_op->getxattr == NULL)
return ERR_PTR(-EOPNOTSUPP);

@@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
*/
if (specified)
return -EPERM;
+
/*
- * Unprivileged mounts get root and default from the caller.
+ * User namespace mounts get root and default from the backing
+ * store, if there is one. Other unprivileged mounts get them
+ * from the caller.
*/
- skp = smk_of_current();
+ skp = (sb_in_userns(sb) && sb->s_bdev) ?
+ smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
sp->smk_root = skp;
sp->smk_default = skp;
}
@@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
*/
static int smack_inode_alloc_security(struct inode *inode)
{
- struct smack_known *skp = smk_of_current();
+ struct smack_known *skp;
+
+ if (sb_in_userns(inode->i_sb))
+ skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
+ else
+ skp = smk_of_current();

inode->i_security = new_inode_smack(skp);
if (inode->i_security == NULL)
@@ -3175,6 +3188,11 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
break;
}
/*
+ * Don't use labels from xattrs for unprivileged mounts.
+ */
+ if (sb_in_userns(inode->i_sb))
+ break;
+ /*
* No xattr support means, alas, no SMACK label.
* Use the aforeapplied default.
* It would be curious if the label of the task

2015-07-22 01:52:24

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On 7/21/2015 1:35 PM, Seth Forshee wrote:
> On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
>> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <[email protected]> wrote:
>>> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
>>>> I really don't see the benefit of making up extra rules that apply to
>>>> users outside a userns who try to access specifically a filesystem
>>>> with backing store. They wouldn't make sense for filesystems without
>>>> backing store.
>>> Sure it would. For Smack, it would be the label a file would be
>>> created with, which would be the label of the process creating
>>> the memory based filesystem. For SELinux the rules are more a
>>> touch more sophisticated, but I'm sure that Paul or Stephen could
>>> come up with how to determine it.
>>>
>>> The point, looping all the way back to the beginning, where we
>>> were talking about just ignoring the labels on the filesystem,
>>> is that if you use the same Smack label on the files in the
>>> filesystem as the backing store file has, we'll all be happy.
>>> If that label isn't something user can write to, he won't be
>>> able to write to the mounted objects, either. If there is no
>>> backing store then use the label of the process creating the
>>> filesystem, which will be the user, which will mean everything
>>> will work hunky dory.
>>>
>>> Yes, there's work involved, but I doubt there's a lot. Getting
>>> the label from the backing store or the creating process is
>>> simple enough.
>>>
> So something like the diff below (untested)?

I think that this is close, and quite good for someone
who isn't very familiar with Smack. It's definitely headed
in the right direction.

> All I'm really doing is setting smk_default as you describe above and
> then using it instead of smk_of_current() in
> smack_inode_alloc_security() and instead of the label from the disk in
> smack_d_instantiate().

Let's say your backing store is a file labeled Rubble.

mount -o smackfsroot=Rubble,smackfsdef=Rubble ...

It is completely reasonable for a process labeled Flintstone to
have rwxa access to a file labeled Rubble.

Smack rule: Flintstone Rubble rwxa

In the case of writing to an existing Rubble file, what you
have looks fine. What's not so great is that if the Flintstone
process creates a file, it should be labeled Flintstone. Your
use of the smk_default, which is going to violate the principle
of least astonishment, and break the Smack policy as well.

Let's make a minor change. Instead of using smackfsroot let's
use smackfstransmute and a slightly different access rule:

mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...

Smack rule: Flintstone Rubble rwxat

Now the only change we have to make to the Smack code is
that we don't want to create any files unless either the
process is labeled Rubble or the rule allowing the creation
has the "t" for transmute access. That should ensure that
everything is labeled Rubble. If it isn't, someone has mucked
with the metadata in a detectable way.


> Since a user currently needs CAP_MAC_ADMIN in
> init_user_ns to store security labels it looks like this should be
> sufficient. I'm not even sure that the inode_alloc_security hook changes
> are needed.
>
> We could allow privileged users in s_user_ns to write security labels to
> disk since they already control the backing store, as long as Smack
> didn't subsequently import them. I didn't do that here.
>
>> So what if Smack used the label of the user creating the filesystem
>> even for filesystems with backing store? IMO this ought to be doable
>> with the LSM hooks -- it certainly seems reasonable for the LSM to be
>> aware of who created a filesystem. In fact, I'd argue that if Smack
>> can't do this with the proposed LSM hooks, then the hooks are
>> insufficient.
> It would be very simple to use the label of the task instead.
>
> Seth
>
> ---
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 32f598db0b0d..4597420ab933 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
> __sb_start_write(sb, SB_FREEZE_FS, true);
> }
>
> +static inline bool sb_in_userns(struct super_block *sb)
> +{
> + return sb->s_user_ns != &init_user_ns;
> +}
>
> extern bool inode_owner_or_capable(const struct inode *inode);
>
> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> index a143328f75eb..591fd19294e7 100644
> --- a/security/smack/smack_lsm.c
> +++ b/security/smack/smack_lsm.c
> @@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
> char *buffer;
> struct smack_known *skp = NULL;
>
> + /* Should never fetch xattrs from untrusted mounts */
> + if (WARN_ON(sb_in_userns(ip->i_sb)))
> + return ERR_PTR(-EPERM);
> +

Go ahead and fetch it, we'll check to make sure it's viable later.

> if (ip->i_op->getxattr == NULL)
> return ERR_PTR(-EOPNOTSUPP);
>
> @@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
> */
> if (specified)
> return -EPERM;
> +
> /*
> - * Unprivileged mounts get root and default from the caller.
> + * User namespace mounts get root and default from the backing
> + * store, if there is one. Other unprivileged mounts get them
> + * from the caller.
> */
> - skp = smk_of_current();
> + skp = (sb_in_userns(sb) && sb->s_bdev) ?
> + smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
> sp->smk_root = skp;
> sp->smk_default = skp;

sp->smk_flags |= SMK_INODE_TRANSMUTE;

> }
> @@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
> */
> static int smack_inode_alloc_security(struct inode *inode)
> {
> - struct smack_known *skp = smk_of_current();
> + struct smack_known *skp;
> +
> + if (sb_in_userns(inode->i_sb))
> + skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
> + else
> + skp = smk_of_current();

This should be left alone.
smack_inode_init_security is where you could disallow access that doesn't
legitimately result in a Rubble label on the file. It's something like

... after the call may = smk_access_entry(...)
if (sb_in_userns(inode->i_sb))
if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
return -EACCES;

> inode->i_security = new_inode_smack(skp);
> if (inode->i_security == NULL)
> @@ -3175,6 +3188,11 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
> break;
> }
> /*
> + * Don't use labels from xattrs for unprivileged mounts.
> + */
> + if (sb_in_userns(inode->i_sb))
> + break;
> + /*

Again, use the label. Just check to make sure it's what you expect.

> * No xattr support means, alas, no SMACK label.
> * Use the aforeapplied default.
> * It would be curious if the label of the task

Also untested.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2015-07-22 07:56:47

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Tue, Jul 21, 2015 at 01:37:21PM -0400, J. Bruce Fields wrote:
> On Fri, Jul 17, 2015 at 12:47:35PM +1000, Dave Chinner wrote:
> > On Thu, Jul 16, 2015 at 07:42:03PM -0500, Eric W. Biederman wrote:
> > > Dave Chinner <[email protected]> writes:
> > > > The key difference is that desktops only do this when you physically
> > > > plug in a device. With unprivileged mounts, a hostile attacker
> > > > doesn't need physical access to the machine to exploit lurking
> > > > kernel filesystem bugs. i.e. they can just use loopback mounts, and
> > > > they can keep mounting corrupted images until they find something
> > > > that works.
> > >
> > > Yep. That magnifies the problem quite a bit.
> > >
> > > > User namespaces are supposed to provide trust separation. The
> > > > kernel filesystems simply aren't hardened against unprivileged
> > > > attacks from below - there is a trust relationship between root and
> > > > the filesystem in that they are the only things that can write to
> > > > the disk. Mounts from within a userns destroys this relationship as
> > > > the userns root, by definition, is not a trusted actor.
> > >
> > > I talked to Ted Tso a while back and ext4 is at least in principle
> > > already hardened against that kind of attack. I am not certain I
> > > believe it, but if it is true I think it is fantastic.
> >
> > No, it's not. No filesystem is, because to harden against such
> > attacks requires complete verification of all metadata when it is
> > read from disk, before it is used, or some method or ensuring the
> > block was not tampered with. CRCs are not sufficient, because they
> > can be tampered with, too.
> >
> > The only way a filesystem would be able to trust what it reads from
> > disk has not been tampered with in a system with untrusted mounts is
> > if it has some kind of cryptographically secure signature in the
> > metadata and the attacker is unable to access the key for that
> > signature.
>
> Preventing tampering is a little different from protecting the kernel
> from attack, isn't it? I thought the latter was what people were asking
> about.

People might be asking for the latter, but the only attack vector
that can be made against filesystems from below is via tampering
with the on-disk structure.

An untrusted user in an untrusted container can construct arbitrary
untrusted filesystem structures and get them parsed by a context
running as $DIETY that assumes the structure is from a trusted
source. What can possibly go wrong?

IOWs, To protect the kernel against attack from untrusted filesystem
images, we either have to be able to guarantee the image can not be
modified by untrusted parties (i.e. needs to be created with
signed tools, contain only signed filesystem metadata and
signed/encrypted data), or we have to sandbox the filesystem parsing
code completely (i.e. fuse).

> So, for example, a screwed up on-disk directory structure shouldn't
> result in creating a cycle in the dcache and then deadlocking.

Therein lies the problem: how do you detect such structural defects
without doing a full structure validation? e.g. cyclic links may
only manifest when completely unrelated pieces of metadata are linked
together in a specific way.

Further, the problem is not restricted to validation at mount time -
if the user can write to the filesystem image file, then they can
modify it after it has been mounted, too. That means the attacker
may be someone who has broken into a container, not necessarily the
user you trusted with unprivileged mounts. That means every cold
metadata read needs to be treated with suspicion, not just at mount
time.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-07-22 14:09:27

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Wed, Jul 22, 2015 at 05:56:40PM +1000, Dave Chinner wrote:
> On Tue, Jul 21, 2015 at 01:37:21PM -0400, J. Bruce Fields wrote:
> > On Fri, Jul 17, 2015 at 12:47:35PM +1000, Dave Chinner wrote:
> > > On Thu, Jul 16, 2015 at 07:42:03PM -0500, Eric W. Biederman wrote:
> > > > Dave Chinner <[email protected]> writes:
> > > > > The key difference is that desktops only do this when you physically
> > > > > plug in a device. With unprivileged mounts, a hostile attacker
> > > > > doesn't need physical access to the machine to exploit lurking
> > > > > kernel filesystem bugs. i.e. they can just use loopback mounts, and
> > > > > they can keep mounting corrupted images until they find something
> > > > > that works.
> > > >
> > > > Yep. That magnifies the problem quite a bit.
> > > >
> > > > > User namespaces are supposed to provide trust separation. The
> > > > > kernel filesystems simply aren't hardened against unprivileged
> > > > > attacks from below - there is a trust relationship between root and
> > > > > the filesystem in that they are the only things that can write to
> > > > > the disk. Mounts from within a userns destroys this relationship as
> > > > > the userns root, by definition, is not a trusted actor.
> > > >
> > > > I talked to Ted Tso a while back and ext4 is at least in principle
> > > > already hardened against that kind of attack. I am not certain I
> > > > believe it, but if it is true I think it is fantastic.
> > >
> > > No, it's not. No filesystem is, because to harden against such
> > > attacks requires complete verification of all metadata when it is
> > > read from disk, before it is used, or some method or ensuring the
> > > block was not tampered with. CRCs are not sufficient, because they
> > > can be tampered with, too.
> > >
> > > The only way a filesystem would be able to trust what it reads from
> > > disk has not been tampered with in a system with untrusted mounts is
> > > if it has some kind of cryptographically secure signature in the
> > > metadata and the attacker is unable to access the key for that
> > > signature.
> >
> > Preventing tampering is a little different from protecting the kernel
> > from attack, isn't it? I thought the latter was what people were asking
> > about.
>
> People might be asking for the latter, but the only attack vector
> that can be made against filesystems from below is via tampering
> with the on-disk structure.
>
> An untrusted user in an untrusted container can construct arbitrary
> untrusted filesystem structures and get them parsed by a context
> running as $DIETY that assumes the structure is from a trusted
> source. What can possibly go wrong?
>
> IOWs, To protect the kernel against attack from untrusted filesystem
> images, we either have to be able to guarantee the image can not be
> modified by untrusted parties (i.e. needs to be created with
> signed tools, contain only signed filesystem metadata and
> signed/encrypted data),

I don't think that works--who exactly would be the "trusted party"? It
can't be this kernel or this hardware--users expect to be able to mount
filesystems created by older kernels, on other machines, running other
distributions (even other operating systems). It can't be the
user--then any user could compromise the kernel by signing a bad
filesystem.

Authenticating the creator of the filesystem might be useful for other
reasons, but it sounds to me like at best only very weak protection
against corrupted filesystems.

As a similar example, browser makers are stuck both implementing SSL and
hardening their code against malicious content. Those address separate
problems.

> or we have to sandbox the filesystem parsing
> code completely (i.e. fuse).
>
> > So, for example, a screwed up on-disk directory structure shouldn't
> > result in creating a cycle in the dcache and then deadlocking.
>
> Therein lies the problem: how do you detect such structural defects
> without doing a full structure validation?

You can prevent cycles in a graph if you can prevent adding an edge
which would be part of a cycle.

For the dcache, it's d_splice_alias that does that (using d_ancestor).

(And I believe the main motivation for that was NFS, where you don't
need a filesystem cycle, just a server-side race that can briefly make
it look like there's one--an example of the changing filesystem problem
that you point out below.)

> e.g. cyclic links may
> only manifest when completely unrelated pieces of metadata are linked
> together in a specific way.
>
> Further, the problem is not restricted to validation at mount time -
> if the user can write to the filesystem image file, then they can
> modify it after it has been mounted, too. That means the attacker
> may be someone who has broken into a container, not necessarily the
> user you trusted with unprivileged mounts. That means every cold
> metadata read needs to be treated with suspicion, not just at mount
> time.

Yes. Agreed that this is difficult. (I can't actually give an example
of an existing problem of this sort, but I'd be surprised if they don't
exist.)

--b.

2015-07-22 15:56:44

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Tue, Jul 21, 2015 at 06:52:31PM -0700, Casey Schaufler wrote:
> On 7/21/2015 1:35 PM, Seth Forshee wrote:
> > On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
> >> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <[email protected]> wrote:
> >>> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
> >>>> I really don't see the benefit of making up extra rules that apply to
> >>>> users outside a userns who try to access specifically a filesystem
> >>>> with backing store. They wouldn't make sense for filesystems without
> >>>> backing store.
> >>> Sure it would. For Smack, it would be the label a file would be
> >>> created with, which would be the label of the process creating
> >>> the memory based filesystem. For SELinux the rules are more a
> >>> touch more sophisticated, but I'm sure that Paul or Stephen could
> >>> come up with how to determine it.
> >>>
> >>> The point, looping all the way back to the beginning, where we
> >>> were talking about just ignoring the labels on the filesystem,
> >>> is that if you use the same Smack label on the files in the
> >>> filesystem as the backing store file has, we'll all be happy.
> >>> If that label isn't something user can write to, he won't be
> >>> able to write to the mounted objects, either. If there is no
> >>> backing store then use the label of the process creating the
> >>> filesystem, which will be the user, which will mean everything
> >>> will work hunky dory.
> >>>
> >>> Yes, there's work involved, but I doubt there's a lot. Getting
> >>> the label from the backing store or the creating process is
> >>> simple enough.
> >>>
> > So something like the diff below (untested)?
>
> I think that this is close, and quite good for someone
> who isn't very familiar with Smack. It's definitely headed
> in the right direction.
>
> > All I'm really doing is setting smk_default as you describe above and
> > then using it instead of smk_of_current() in
> > smack_inode_alloc_security() and instead of the label from the disk in
> > smack_d_instantiate().
>
> Let's say your backing store is a file labeled Rubble.
>
> mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
>
> It is completely reasonable for a process labeled Flintstone to
> have rwxa access to a file labeled Rubble.
>
> Smack rule: Flintstone Rubble rwxa
>
> In the case of writing to an existing Rubble file, what you
> have looks fine. What's not so great is that if the Flintstone
> process creates a file, it should be labeled Flintstone. Your
> use of the smk_default, which is going to violate the principle
> of least astonishment, and break the Smack policy as well.
>
> Let's make a minor change. Instead of using smackfsroot let's
> use smackfstransmute and a slightly different access rule:
>
> mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
>
> Smack rule: Flintstone Rubble rwxat
>
> Now the only change we have to make to the Smack code is
> that we don't want to create any files unless either the
> process is labeled Rubble or the rule allowing the creation
> has the "t" for transmute access. That should ensure that
> everything is labeled Rubble. If it isn't, someone has mucked
> with the metadata in a detectable way.

All right, that kind of makes sense, but I'm still missing some pieces.
Questions follow.

> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 32f598db0b0d..4597420ab933 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
> > __sb_start_write(sb, SB_FREEZE_FS, true);
> > }
> >
> > +static inline bool sb_in_userns(struct super_block *sb)
> > +{
> > + return sb->s_user_ns != &init_user_ns;
> > +}
> >
> > extern bool inode_owner_or_capable(const struct inode *inode);
> >
> > diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> > index a143328f75eb..591fd19294e7 100644
> > --- a/security/smack/smack_lsm.c
> > +++ b/security/smack/smack_lsm.c
> > @@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
> > char *buffer;
> > struct smack_known *skp = NULL;
> >
> > + /* Should never fetch xattrs from untrusted mounts */
> > + if (WARN_ON(sb_in_userns(ip->i_sb)))
> > + return ERR_PTR(-EPERM);
> > +
>
> Go ahead and fetch it, we'll check to make sure it's viable later.
>
> > if (ip->i_op->getxattr == NULL)
> > return ERR_PTR(-EOPNOTSUPP);
> >
> > @@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
> > */
> > if (specified)
> > return -EPERM;
> > +
> > /*
> > - * Unprivileged mounts get root and default from the caller.
> > + * User namespace mounts get root and default from the backing
> > + * store, if there is one. Other unprivileged mounts get them
> > + * from the caller.
> > */
> > - skp = smk_of_current();
> > + skp = (sb_in_userns(sb) && sb->s_bdev) ?
> > + smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
> > sp->smk_root = skp;
> > sp->smk_default = skp;
>
> sp->smk_flags |= SMK_INODE_TRANSMUTE;

I assume that you meant skp and not sp here.

>
> > }
> > @@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
> > */
> > static int smack_inode_alloc_security(struct inode *inode)
> > {
> > - struct smack_known *skp = smk_of_current();
> > + struct smack_known *skp;
> > +
> > + if (sb_in_userns(inode->i_sb))
> > + skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
> > + else
> > + skp = smk_of_current();
>
> This should be left alone.
> smack_inode_init_security is where you could disallow access that doesn't
> legitimately result in a Rubble label on the file. It's something like
>
> ... after the call may = smk_access_entry(...)
> if (sb_in_userns(inode->i_sb))
> if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
> return -EACCES;

I'm not getting how this covers all cases.

So we've set the transmute flag on the root inode. Files and directories
created in the root directory get the same label, and directories also
get the transmute attribute. That's all fine.

What about an existing directory in the filesystem that already has a
Slate label? I'm not getting what happens with this directory, or for
new files created in this directory, which also relates to my other
questions below.

Also an aside - smk_access_entry looks weird. may is initialized to
-ENOENT, and then rule_list is searched for a rule which matches the
object and subject labels. Presumably it's possible that no rule could
be found, otherwise the prior initialization of may is pointless. If
this happens the following code treats it as though it always contains
access flags even though it might contain -ENOENT. Nothing bad actually
happens with a two's compliement representation of -ENOENT since it will
just set a bit that's already set, but it still seems like it should
have a may > 0 condition, for clarity if for no other reason.

>
> > inode->i_security = new_inode_smack(skp);
> > if (inode->i_security == NULL)
> > @@ -3175,6 +3188,11 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
> > break;
> > }
> > /*
> > + * Don't use labels from xattrs for unprivileged mounts.
> > + */
> > + if (sb_in_userns(inode->i_sb))
> > + break;
> > + /*
>
> Again, use the label. Just check to make sure it's what you expect.

What happens if it's not what I expect? smack_d_instantiate cannot fail
... so just use the default label? In that case why bother reading it at
all? Or would we actually want to change the on-disk label if it didn't
match?

>
> > * No xattr support means, alas, no SMACK label.
> > * Use the aforeapplied default.
> > * It would be curious if the label of the task
>
> Also untested.
>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
>

2015-07-22 16:03:22

by Stephen Smalley

[permalink] [raw]
Subject: Re: [PATCH 6/7] selinux: Ignore security labels on user namespace mounts

On 07/16/2015 09:23 AM, Stephen Smalley wrote:
> On 07/15/2015 03:46 PM, Seth Forshee wrote:
>> Unprivileged users should not be able to supply security labels
>> in filesystems, nor should they be able to supply security
>> contexts in unprivileged mounts. For any mount where s_user_ns is
>> not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
>> and return EPERM if any contexts are supplied in the mount
>> options.
>>
>> Signed-off-by: Seth Forshee <[email protected]>
>
> I think this is obsoleted by the subsequent discussion, but just for the
> record: this patch would cause the files in the userns mount to be left
> with the "unlabeled" label, and therefore under typical policies,
> completely inaccessible to any process in a confined domain.

The right way to handle this for SELinux would be to automatically use
mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
specifying a context= mount option), with the sbsec->mntpoint_sid set
from some related object (e.g. the block device file context, as in your
patches for Smack). That will cause SELinux to use that value instead
of any xattr value from the filesystem and will cause attempts by
userspace to set the security.selinux xattr to fail on that filesystem.
That is how SELinux normally deals with untrusted filesystems, except
that it is normally specified as a mount option by a trusted mounting
process, whereas in your case you need to automatically set it.

>
>> ---
>> security/selinux/hooks.c | 14 ++++++++++++++
>> 1 file changed, 14 insertions(+)
>>
>> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
>> index 459e71ddbc9d..eeb71e45ab82 100644
>> --- a/security/selinux/hooks.c
>> +++ b/security/selinux/hooks.c
>> @@ -732,6 +732,19 @@ static int selinux_set_mnt_opts(struct super_block *sb,
>> !strcmp(sb->s_type->name, "pstore"))
>> sbsec->flags |= SE_SBGENFS;
>>
>> + /*
>> + * If this is a user namespace mount, no contexts are allowed
>> + * on the command line and security labels mus be ignored.
>> + */
>> + if (sb->s_user_ns != &init_user_ns) {
>> + if (context_sid || fscontext_sid || rootcontext_sid ||
>> + defcontext_sid)
>> + return -EPERM;
>> + sbsec->behavior = SECURITY_FS_USE_NONE;
>> + goto out_set_opts;
>> + }
>> +
>> +
>> if (!sbsec->behavior) {
>> /*
>> * Determine the labeling behavior to use for this
>> @@ -813,6 +826,7 @@ static int selinux_set_mnt_opts(struct super_block *sb,
>> sbsec->def_sid = defcontext_sid;
>> }
>>
>> +out_set_opts:
>> rc = sb_finish_set_opts(sb);
>> out:
>> mutex_unlock(&sbsec->lock);
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

2015-07-22 16:14:28

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 6/7] selinux: Ignore security labels on user namespace mounts

On Wed, Jul 22, 2015 at 12:02:13PM -0400, Stephen Smalley wrote:
> On 07/16/2015 09:23 AM, Stephen Smalley wrote:
> > On 07/15/2015 03:46 PM, Seth Forshee wrote:
> >> Unprivileged users should not be able to supply security labels
> >> in filesystems, nor should they be able to supply security
> >> contexts in unprivileged mounts. For any mount where s_user_ns is
> >> not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
> >> and return EPERM if any contexts are supplied in the mount
> >> options.
> >>
> >> Signed-off-by: Seth Forshee <[email protected]>
> >
> > I think this is obsoleted by the subsequent discussion, but just for the
> > record: this patch would cause the files in the userns mount to be left
> > with the "unlabeled" label, and therefore under typical policies,
> > completely inaccessible to any process in a confined domain.
>
> The right way to handle this for SELinux would be to automatically use
> mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
> specifying a context= mount option), with the sbsec->mntpoint_sid set
> from some related object (e.g. the block device file context, as in your
> patches for Smack). That will cause SELinux to use that value instead
> of any xattr value from the filesystem and will cause attempts by
> userspace to set the security.selinux xattr to fail on that filesystem.
> That is how SELinux normally deals with untrusted filesystems, except
> that it is normally specified as a mount option by a trusted mounting
> process, whereas in your case you need to automatically set it.

Excellent, thank you for the advice. I'll start on this when I've
finished with Smack.

Seth

2015-07-22 16:53:11

by Austin S Hemmelgarn

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On 2015-07-22 10:09, J. Bruce Fields wrote:
> On Wed, Jul 22, 2015 at 05:56:40PM +1000, Dave Chinner wrote:
>> On Tue, Jul 21, 2015 at 01:37:21PM -0400, J. Bruce Fields wrote:
>>> On Fri, Jul 17, 2015 at 12:47:35PM +1000, Dave Chinner wrote:
>>> So, for example, a screwed up on-disk directory structure shouldn't
>>> result in creating a cycle in the dcache and then deadlocking.
>>
>> Therein lies the problem: how do you detect such structural defects
>> without doing a full structure validation?
>
> You can prevent cycles in a graph if you can prevent adding an edge
> which would be part of a cycle.
>
Except if the user can write to the filesystem's backing storage (be it
a device or a file), and has sufficient knowledge of the on-disk
structures, they can create all the cycles they want in the metadata.
So unless the kernel builds the graph internally by parsing the metadata
_and_ has some way to detect that the on-disk metadata has hit a cycle
(which may not just involve 2 items), then you still have the potential
for a DoS attack.

Trust me, I've done this before (quite a while back when I was just
starting out with programming on Linux) with hard-link cycles in an ext4
filesystem in a virtual machine just to see what would happen (IIRC,
something deadlocked, I can't remember though if it was fsck or trying
to access the file once the FS was mounted) (and in fact, I think I may
try this again just to see if anything has changed).


Attachments:
smime.p7s (2.95 kB)
S/MIME Cryptographic Signature

2015-07-22 17:41:05

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Wed, Jul 22, 2015 at 12:52:58PM -0400, Austin S Hemmelgarn wrote:
> On 2015-07-22 10:09, J. Bruce Fields wrote:
> >On Wed, Jul 22, 2015 at 05:56:40PM +1000, Dave Chinner wrote:
> >>On Tue, Jul 21, 2015 at 01:37:21PM -0400, J. Bruce Fields wrote:
> >>>On Fri, Jul 17, 2015 at 12:47:35PM +1000, Dave Chinner wrote:
> >>>So, for example, a screwed up on-disk directory structure shouldn't
> >>>result in creating a cycle in the dcache and then deadlocking.
> >>
> >>Therein lies the problem: how do you detect such structural defects
> >>without doing a full structure validation?
> >
> >You can prevent cycles in a graph if you can prevent adding an edge
> >which would be part of a cycle.
> >
> Except if the user can write to the filesystem's backing storage (be
> it a device or a file), and has sufficient knowledge of the on-disk
> structures, they can create all the cycles they want in the
> metadata. So unless the kernel builds the graph internally by
> parsing the metadata _and_ has some way to detect that the on-disk
> metadata has hit a cycle (which may not just involve 2 items),

Understood. Again, see the d_ancestor call in d_splice_alias, this is
exactly what it checks for.

> then
> you still have the potential for a DoS attack.

> Trust me, I've done this before (quite a while back when I was just
> starting out with programming on Linux) with hard-link cycles in an
> ext4 filesystem in a virtual machine just to see what would happen
> (IIRC, something deadlocked, I can't remember though if it was fsck
> or trying to access the file once the FS was mounted) (and in fact,
> I think I may try this again just to see if anything has changed).

I've also seen bugs caused by loops in corrupted ext4 filesystems. As
far as I know, they're fixed as of 95ad5c291313b.

(I mentioned the example of dcache loops because it's something I
happened to run across before. I'm sure there are any number of cases
where we need similar checking to keep internal data structures
consistent in the face of unexpected filesystem content.)

--b.

2015-07-22 18:10:52

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On 7/22/2015 8:56 AM, Seth Forshee wrote:
> On Tue, Jul 21, 2015 at 06:52:31PM -0700, Casey Schaufler wrote:
>> On 7/21/2015 1:35 PM, Seth Forshee wrote:
>>> On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
>>>> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <[email protected]> wrote:
>>>>> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
>>>>>> I really don't see the benefit of making up extra rules that apply to
>>>>>> users outside a userns who try to access specifically a filesystem
>>>>>> with backing store. They wouldn't make sense for filesystems without
>>>>>> backing store.
>>>>> Sure it would. For Smack, it would be the label a file would be
>>>>> created with, which would be the label of the process creating
>>>>> the memory based filesystem. For SELinux the rules are more a
>>>>> touch more sophisticated, but I'm sure that Paul or Stephen could
>>>>> come up with how to determine it.
>>>>>
>>>>> The point, looping all the way back to the beginning, where we
>>>>> were talking about just ignoring the labels on the filesystem,
>>>>> is that if you use the same Smack label on the files in the
>>>>> filesystem as the backing store file has, we'll all be happy.
>>>>> If that label isn't something user can write to, he won't be
>>>>> able to write to the mounted objects, either. If there is no
>>>>> backing store then use the label of the process creating the
>>>>> filesystem, which will be the user, which will mean everything
>>>>> will work hunky dory.
>>>>>
>>>>> Yes, there's work involved, but I doubt there's a lot. Getting
>>>>> the label from the backing store or the creating process is
>>>>> simple enough.
>>>>>
>>> So something like the diff below (untested)?
>> I think that this is close, and quite good for someone
>> who isn't very familiar with Smack. It's definitely headed
>> in the right direction.
>>
>>> All I'm really doing is setting smk_default as you describe above and
>>> then using it instead of smk_of_current() in
>>> smack_inode_alloc_security() and instead of the label from the disk in
>>> smack_d_instantiate().
>> Let's say your backing store is a file labeled Rubble.
>>
>> mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
>>
>> It is completely reasonable for a process labeled Flintstone to
>> have rwxa access to a file labeled Rubble.
>>
>> Smack rule: Flintstone Rubble rwxa
>>
>> In the case of writing to an existing Rubble file, what you
>> have looks fine. What's not so great is that if the Flintstone
>> process creates a file, it should be labeled Flintstone. Your
>> use of the smk_default, which is going to violate the principle
>> of least astonishment, and break the Smack policy as well.
>>
>> Let's make a minor change. Instead of using smackfsroot let's
>> use smackfstransmute and a slightly different access rule:
>>
>> mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
>>
>> Smack rule: Flintstone Rubble rwxat
>>
>> Now the only change we have to make to the Smack code is
>> that we don't want to create any files unless either the
>> process is labeled Rubble or the rule allowing the creation
>> has the "t" for transmute access. That should ensure that
>> everything is labeled Rubble. If it isn't, someone has mucked
>> with the metadata in a detectable way.
> All right, that kind of makes sense, but I'm still missing some pieces.
> Questions follow.
>
>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>> index 32f598db0b0d..4597420ab933 100644
>>> --- a/include/linux/fs.h
>>> +++ b/include/linux/fs.h
>>> @@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
>>> __sb_start_write(sb, SB_FREEZE_FS, true);
>>> }
>>>
>>> +static inline bool sb_in_userns(struct super_block *sb)
>>> +{
>>> + return sb->s_user_ns != &init_user_ns;
>>> +}
>>>
>>> extern bool inode_owner_or_capable(const struct inode *inode);
>>>
>>> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
>>> index a143328f75eb..591fd19294e7 100644
>>> --- a/security/smack/smack_lsm.c
>>> +++ b/security/smack/smack_lsm.c
>>> @@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
>>> char *buffer;
>>> struct smack_known *skp = NULL;
>>>
>>> + /* Should never fetch xattrs from untrusted mounts */
>>> + if (WARN_ON(sb_in_userns(ip->i_sb)))
>>> + return ERR_PTR(-EPERM);
>>> +
>> Go ahead and fetch it, we'll check to make sure it's viable later.
>>
>>> if (ip->i_op->getxattr == NULL)
>>> return ERR_PTR(-EOPNOTSUPP);
>>>
>>> @@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
>>> */
>>> if (specified)
>>> return -EPERM;
>>> +
>>> /*
>>> - * Unprivileged mounts get root and default from the caller.
>>> + * User namespace mounts get root and default from the backing
>>> + * store, if there is one. Other unprivileged mounts get them
>>> + * from the caller.
>>> */
>>> - skp = smk_of_current();
>>> + skp = (sb_in_userns(sb) && sb->s_bdev) ?
>>> + smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
>>> sp->smk_root = skp;
>>> sp->smk_default = skp;
>> sp->smk_flags |= SMK_INODE_TRANSMUTE;
> I assume that you meant skp and not sp here.

Actually, neither is correct. You want to set SMK_INODE_TRANSMUTE
in the smk_flags field of the root inode. That's easy:

transmute = 1;

and the code after "Initialize the root inode" will take care of it.


>>> }
>>> @@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
>>> */
>>> static int smack_inode_alloc_security(struct inode *inode)
>>> {
>>> - struct smack_known *skp = smk_of_current();
>>> + struct smack_known *skp;
>>> +
>>> + if (sb_in_userns(inode->i_sb))
>>> + skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
>>> + else
>>> + skp = smk_of_current();
>> This should be left alone.
>> smack_inode_init_security is where you could disallow access that doesn't
>> legitimately result in a Rubble label on the file. It's something like
>>
>> ... after the call may = smk_access_entry(...)
>> if (sb_in_userns(inode->i_sb))
>> if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
>> return -EACCES;
> I'm not getting how this covers all cases.
>
> So we've set the transmute flag on the root inode. Files and directories
> created in the root directory get the same label, and directories also
> get the transmute attribute. That's all fine.
>
> What about an existing directory in the filesystem that already has a
> Slate label? I'm not getting what happens with this directory, or for
> new files created in this directory, which also relates to my other
> questions below.
>
> Also an aside - smk_access_entry looks weird. may is initialized to
> -ENOENT, and then rule_list is searched for a rule which matches the
> object and subject labels. Presumably it's possible that no rule could
> be found, otherwise the prior initialization of may is pointless. If
> this happens the following code treats it as though it always contains
> access flags even though it might contain -ENOENT. Nothing bad actually
> happens with a two's compliement representation of -ENOENT since it will
> just set a bit that's already set, but it still seems like it should
> have a may > 0 condition, for clarity if for no other reason.

My suggested code is just wrong. I wasn't looking at the whole code,
only the patch, and got myself confused. Apologies.

If we want to go straight for the jugular how about this? I'm assuming
that inode->i_sb->s_bdev->bd_inode is the inode of the backing store.

static int smack_inode_permission(struct inode *inode, int mask)
{
struct smk_audit_info ad;
int no_block = mask & MAY_NOT_BLOCK;
int rc;

mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);
/*
* No permission to check. Existence test. Yup, it's there.
*/
if (mask == 0)
return 0;

+ if (sb_in_userns(inode->i_sb)) &&
+ smk_of_inode(inode) != smk_of_inode(inode->i_sb->s_bdev->bd_inode))
+ return -EACCES;
+
/* May be droppable after audit */
if (no_block)
return -ECHILD;
smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_INODE);
smk_ad_setfield_u_fs_inode(&ad, inode);
rc = smk_curacc(smk_of_inode(inode), mask, &ad);
rc = smk_bu_inode(inode, mask, rc);
return rc;
}


>
>>> inode->i_security = new_inode_smack(skp);
>>> if (inode->i_security == NULL)
>>> @@ -3175,6 +3188,11 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
>>> break;
>>> }
>>> /*
>>> + * Don't use labels from xattrs for unprivileged mounts.
>>> + */
>>> + if (sb_in_userns(inode->i_sb))
>>> + break;
>>> + /*
>> Again, use the label. Just check to make sure it's what you expect.
> What happens if it's not what I expect? smack_d_instantiate cannot fail
> ... so just use the default label? In that case why bother reading it at
> all? Or would we actually want to change the on-disk label if it didn't
> match?
>
>>> * No xattr support means, alas, no SMACK label.
>>> * Use the aforeapplied default.
>>> * It would be curious if the label of the task
>> Also untested.
>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2015-07-22 19:32:30

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Wed, Jul 22, 2015 at 11:10:46AM -0700, Casey Schaufler wrote:
> On 7/22/2015 8:56 AM, Seth Forshee wrote:
> > On Tue, Jul 21, 2015 at 06:52:31PM -0700, Casey Schaufler wrote:
> >> On 7/21/2015 1:35 PM, Seth Forshee wrote:
> >>> On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
> >>>> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <[email protected]> wrote:
> >>>>> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
> >>>>>> I really don't see the benefit of making up extra rules that apply to
> >>>>>> users outside a userns who try to access specifically a filesystem
> >>>>>> with backing store. They wouldn't make sense for filesystems without
> >>>>>> backing store.
> >>>>> Sure it would. For Smack, it would be the label a file would be
> >>>>> created with, which would be the label of the process creating
> >>>>> the memory based filesystem. For SELinux the rules are more a
> >>>>> touch more sophisticated, but I'm sure that Paul or Stephen could
> >>>>> come up with how to determine it.
> >>>>>
> >>>>> The point, looping all the way back to the beginning, where we
> >>>>> were talking about just ignoring the labels on the filesystem,
> >>>>> is that if you use the same Smack label on the files in the
> >>>>> filesystem as the backing store file has, we'll all be happy.
> >>>>> If that label isn't something user can write to, he won't be
> >>>>> able to write to the mounted objects, either. If there is no
> >>>>> backing store then use the label of the process creating the
> >>>>> filesystem, which will be the user, which will mean everything
> >>>>> will work hunky dory.
> >>>>>
> >>>>> Yes, there's work involved, but I doubt there's a lot. Getting
> >>>>> the label from the backing store or the creating process is
> >>>>> simple enough.
> >>>>>
> >>> So something like the diff below (untested)?
> >> I think that this is close, and quite good for someone
> >> who isn't very familiar with Smack. It's definitely headed
> >> in the right direction.
> >>
> >>> All I'm really doing is setting smk_default as you describe above and
> >>> then using it instead of smk_of_current() in
> >>> smack_inode_alloc_security() and instead of the label from the disk in
> >>> smack_d_instantiate().
> >> Let's say your backing store is a file labeled Rubble.
> >>
> >> mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
> >>
> >> It is completely reasonable for a process labeled Flintstone to
> >> have rwxa access to a file labeled Rubble.
> >>
> >> Smack rule: Flintstone Rubble rwxa
> >>
> >> In the case of writing to an existing Rubble file, what you
> >> have looks fine. What's not so great is that if the Flintstone
> >> process creates a file, it should be labeled Flintstone. Your
> >> use of the smk_default, which is going to violate the principle
> >> of least astonishment, and break the Smack policy as well.
> >>
> >> Let's make a minor change. Instead of using smackfsroot let's
> >> use smackfstransmute and a slightly different access rule:
> >>
> >> mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
> >>
> >> Smack rule: Flintstone Rubble rwxat
> >>
> >> Now the only change we have to make to the Smack code is
> >> that we don't want to create any files unless either the
> >> process is labeled Rubble or the rule allowing the creation
> >> has the "t" for transmute access. That should ensure that
> >> everything is labeled Rubble. If it isn't, someone has mucked
> >> with the metadata in a detectable way.
> > All right, that kind of makes sense, but I'm still missing some pieces.
> > Questions follow.
> >
> >>> diff --git a/include/linux/fs.h b/include/linux/fs.h
> >>> index 32f598db0b0d..4597420ab933 100644
> >>> --- a/include/linux/fs.h
> >>> +++ b/include/linux/fs.h
> >>> @@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
> >>> __sb_start_write(sb, SB_FREEZE_FS, true);
> >>> }
> >>>
> >>> +static inline bool sb_in_userns(struct super_block *sb)
> >>> +{
> >>> + return sb->s_user_ns != &init_user_ns;
> >>> +}
> >>>
> >>> extern bool inode_owner_or_capable(const struct inode *inode);
> >>>
> >>> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> >>> index a143328f75eb..591fd19294e7 100644
> >>> --- a/security/smack/smack_lsm.c
> >>> +++ b/security/smack/smack_lsm.c
> >>> @@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
> >>> char *buffer;
> >>> struct smack_known *skp = NULL;
> >>>
> >>> + /* Should never fetch xattrs from untrusted mounts */
> >>> + if (WARN_ON(sb_in_userns(ip->i_sb)))
> >>> + return ERR_PTR(-EPERM);
> >>> +
> >> Go ahead and fetch it, we'll check to make sure it's viable later.
> >>
> >>> if (ip->i_op->getxattr == NULL)
> >>> return ERR_PTR(-EOPNOTSUPP);
> >>>
> >>> @@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
> >>> */
> >>> if (specified)
> >>> return -EPERM;
> >>> +
> >>> /*
> >>> - * Unprivileged mounts get root and default from the caller.
> >>> + * User namespace mounts get root and default from the backing
> >>> + * store, if there is one. Other unprivileged mounts get them
> >>> + * from the caller.
> >>> */
> >>> - skp = smk_of_current();
> >>> + skp = (sb_in_userns(sb) && sb->s_bdev) ?
> >>> + smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
> >>> sp->smk_root = skp;
> >>> sp->smk_default = skp;
> >> sp->smk_flags |= SMK_INODE_TRANSMUTE;
> > I assume that you meant skp and not sp here.
>
> Actually, neither is correct. You want to set SMK_INODE_TRANSMUTE
> in the smk_flags field of the root inode. That's easy:
>
> transmute = 1;
>
> and the code after "Initialize the root inode" will take care of it.

Yeah, that's what I've actually done.

> >>> }
> >>> @@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
> >>> */
> >>> static int smack_inode_alloc_security(struct inode *inode)
> >>> {
> >>> - struct smack_known *skp = smk_of_current();
> >>> + struct smack_known *skp;
> >>> +
> >>> + if (sb_in_userns(inode->i_sb))
> >>> + skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
> >>> + else
> >>> + skp = smk_of_current();
> >> This should be left alone.
> >> smack_inode_init_security is where you could disallow access that doesn't
> >> legitimately result in a Rubble label on the file. It's something like
> >>
> >> ... after the call may = smk_access_entry(...)
> >> if (sb_in_userns(inode->i_sb))
> >> if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
> >> return -EACCES;
> > I'm not getting how this covers all cases.
> >
> > So we've set the transmute flag on the root inode. Files and directories
> > created in the root directory get the same label, and directories also
> > get the transmute attribute. That's all fine.
> >
> > What about an existing directory in the filesystem that already has a
> > Slate label? I'm not getting what happens with this directory, or for
> > new files created in this directory, which also relates to my other
> > questions below.
> >
> > Also an aside - smk_access_entry looks weird. may is initialized to
> > -ENOENT, and then rule_list is searched for a rule which matches the
> > object and subject labels. Presumably it's possible that no rule could
> > be found, otherwise the prior initialization of may is pointless. If
> > this happens the following code treats it as though it always contains
> > access flags even though it might contain -ENOENT. Nothing bad actually
> > happens with a two's compliement representation of -ENOENT since it will
> > just set a bit that's already set, but it still seems like it should
> > have a may > 0 condition, for clarity if for no other reason.
>
> My suggested code is just wrong. I wasn't looking at the whole code,
> only the patch, and got myself confused. Apologies.
>
> If we want to go straight for the jugular how about this? I'm assuming
> that inode->i_sb->s_bdev->bd_inode is the inode of the backing store.

Yes.

> static int smack_inode_permission(struct inode *inode, int mask)
> {
> struct smk_audit_info ad;
> int no_block = mask & MAY_NOT_BLOCK;
> int rc;
>
> mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);
> /*
> * No permission to check. Existence test. Yup, it's there.
> */
> if (mask == 0)
> return 0;
>
> + if (sb_in_userns(inode->i_sb)) &&
> + smk_of_inode(inode) != smk_of_inode(inode->i_sb->s_bdev->bd_inode))
> + return -EACCES;
> +
> /* May be droppable after audit */
> if (no_block)
> return -ECHILD;
> smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_INODE);
> smk_ad_setfield_u_fs_inode(&ad, inode);
> rc = smk_curacc(smk_of_inode(inode), mask, &ad);
> rc = smk_bu_inode(inode, mask, rc);
> return rc;
> }

Hmm, okay. I think I've been a little confused all this time about how
you want to handle these unprivileged mounts.

Originally I thought you wanted all objects in the filesystem to get the
same label as the backing store. That's what I tried to implement
originally, i.e. smk_root=smk_default=smk_of_inode(...->bd_inode), then
assign every object (new and existing) smk_default and completely ignore
the labels on disk.

This is what I currently think you want for user ns mounts:

1. smk_root and smk_default are assigned the label of the backing
device.
2. s_root is assigned the transmute property.
3. For existing files:
a. Files with the same label as the backing device are accessible.
b. Files with any other label are not accessible.

If this is right, there are a couple lingering questions in my mind.

First, what happens with files created in directories with the same
label as the backing device but without the transmute property set? The
inode for the new file will initially be labeled with smk_of_current(),
but then during d_instantiate it will get smk_default and thus end up
with the label we want. So that seems okay.

The second is whether files with the SMACK64EXEC attribute is still a
problem. It seems it is, for files with the same label as the backing
store at least. I think we can simply skip the code that reads out this
xattr and sets smk_task for user ns mounts, or else skip assigning the
label to the new task in bprm_set_creds. The latter seems more
consistent with the approach you've suggested for dealing with labels
from disk.

So I guess all of that seems okay, though perhaps a bit restrictive
given that the user who mounted the filesystem already has full access
to the backing store.

Please let me know whether or not this matches up with what you are
thinking, then I can procede with the implementation.

Thanks,
Seth

2015-07-22 20:26:46

by Stephen Smalley

[permalink] [raw]
Subject: Re: [PATCH 6/7] selinux: Ignore security labels on user namespace mounts

On 07/22/2015 12:14 PM, Seth Forshee wrote:
> On Wed, Jul 22, 2015 at 12:02:13PM -0400, Stephen Smalley wrote:
>> On 07/16/2015 09:23 AM, Stephen Smalley wrote:
>>> On 07/15/2015 03:46 PM, Seth Forshee wrote:
>>>> Unprivileged users should not be able to supply security labels
>>>> in filesystems, nor should they be able to supply security
>>>> contexts in unprivileged mounts. For any mount where s_user_ns is
>>>> not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
>>>> and return EPERM if any contexts are supplied in the mount
>>>> options.
>>>>
>>>> Signed-off-by: Seth Forshee <[email protected]>
>>>
>>> I think this is obsoleted by the subsequent discussion, but just for the
>>> record: this patch would cause the files in the userns mount to be left
>>> with the "unlabeled" label, and therefore under typical policies,
>>> completely inaccessible to any process in a confined domain.
>>
>> The right way to handle this for SELinux would be to automatically use
>> mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
>> specifying a context= mount option), with the sbsec->mntpoint_sid set
>> from some related object (e.g. the block device file context, as in your
>> patches for Smack). That will cause SELinux to use that value instead
>> of any xattr value from the filesystem and will cause attempts by
>> userspace to set the security.selinux xattr to fail on that filesystem.
>> That is how SELinux normally deals with untrusted filesystems, except
>> that it is normally specified as a mount option by a trusted mounting
>> process, whereas in your case you need to automatically set it.
>
> Excellent, thank you for the advice. I'll start on this when I've
> finished with Smack.

Not tested, but something like this should work. Note that it should
come after the call to security_fs_use() so we know whether SELinux
would even try to use xattrs supplied by the filesystem in the first place.

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 564079c..84da3a2 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -745,6 +745,30 @@ static int selinux_set_mnt_opts(struct super_block *sb,
goto out;
}
}
+
+ /*
+ * If this is a user namespace mount, no contexts are allowed
+ * on the command line and security labels must be ignored.
+ */
+ if (sb->s_user_ns != &init_user_ns) {
+ if (context_sid || fscontext_sid || rootcontext_sid ||
+ defcontext_sid) {
+ rc = -EACCES;
+ goto out;
+ }
+ if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
+ struct block_device *bdev = sb->s_bdev;
+ sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
+ if (bdev) {
+ struct inode_security_struct *isec =
bdev->bd_inode;
+ sbsec->mntpoint_sid = isec->sid;
+ } else {
+ sbsec->mntpoint_sid = current_sid();
+ }
+ }
+ goto out_set_opts;
+ }
+
/* sets the context of the superblock for the fs being mounted. */
if (fscontext_sid) {
rc = may_context_mount_sb_relabel(fscontext_sid, sbsec,
cred);
@@ -813,6 +837,7 @@ static int selinux_set_mnt_opts(struct super_block *sb,
sbsec->def_sid = defcontext_sid;
}

+out_set_opts:
rc = sb_finish_set_opts(sb);
out:
mutex_unlock(&sbsec->lock);

2015-07-22 20:42:04

by Stephen Smalley

[permalink] [raw]
Subject: Re: [PATCH 6/7] selinux: Ignore security labels on user namespace mounts

On 07/22/2015 04:25 PM, Stephen Smalley wrote:
> On 07/22/2015 12:14 PM, Seth Forshee wrote:
>> On Wed, Jul 22, 2015 at 12:02:13PM -0400, Stephen Smalley wrote:
>>> On 07/16/2015 09:23 AM, Stephen Smalley wrote:
>>>> On 07/15/2015 03:46 PM, Seth Forshee wrote:
>>>>> Unprivileged users should not be able to supply security labels
>>>>> in filesystems, nor should they be able to supply security
>>>>> contexts in unprivileged mounts. For any mount where s_user_ns is
>>>>> not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
>>>>> and return EPERM if any contexts are supplied in the mount
>>>>> options.
>>>>>
>>>>> Signed-off-by: Seth Forshee <[email protected]>
>>>>
>>>> I think this is obsoleted by the subsequent discussion, but just for the
>>>> record: this patch would cause the files in the userns mount to be left
>>>> with the "unlabeled" label, and therefore under typical policies,
>>>> completely inaccessible to any process in a confined domain.
>>>
>>> The right way to handle this for SELinux would be to automatically use
>>> mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
>>> specifying a context= mount option), with the sbsec->mntpoint_sid set
>>> from some related object (e.g. the block device file context, as in your
>>> patches for Smack). That will cause SELinux to use that value instead
>>> of any xattr value from the filesystem and will cause attempts by
>>> userspace to set the security.selinux xattr to fail on that filesystem.
>>> That is how SELinux normally deals with untrusted filesystems, except
>>> that it is normally specified as a mount option by a trusted mounting
>>> process, whereas in your case you need to automatically set it.
>>
>> Excellent, thank you for the advice. I'll start on this when I've
>> finished with Smack.
>
> Not tested, but something like this should work. Note that it should
> come after the call to security_fs_use() so we know whether SELinux
> would even try to use xattrs supplied by the filesystem in the first place.
>
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 564079c..84da3a2 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -745,6 +745,30 @@ static int selinux_set_mnt_opts(struct super_block *sb,
> goto out;
> }
> }
> +
> + /*
> + * If this is a user namespace mount, no contexts are allowed
> + * on the command line and security labels must be ignored.
> + */
> + if (sb->s_user_ns != &init_user_ns) {
> + if (context_sid || fscontext_sid || rootcontext_sid ||
> + defcontext_sid) {
> + rc = -EACCES;
> + goto out;
> + }
> + if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
> + struct block_device *bdev = sb->s_bdev;
> + sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
> + if (bdev) {
> + struct inode_security_struct *isec =
> bdev->bd_inode;

That should be bdev->bd_inode->i_security.

> + sbsec->mntpoint_sid = isec->sid;
> + } else {
> + sbsec->mntpoint_sid = current_sid();
> + }
> + }
> + goto out_set_opts;
> + }
> +
> /* sets the context of the superblock for the fs being mounted. */
> if (fscontext_sid) {
> rc = may_context_mount_sb_relabel(fscontext_sid, sbsec,
> cred);
> @@ -813,6 +837,7 @@ static int selinux_set_mnt_opts(struct super_block *sb,
> sbsec->def_sid = defcontext_sid;
> }
>
> +out_set_opts:
> rc = sb_finish_set_opts(sb);
> out:
> mutex_unlock(&sbsec->lock);
>
> _______________________________________________
> Selinux mailing list
> [email protected]
> To unsubscribe, send email to [email protected].
> To get help, send an email containing "help" to [email protected].
>

2015-07-23 00:05:06

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On 7/22/2015 12:32 PM, Seth Forshee wrote:
> On Wed, Jul 22, 2015 at 11:10:46AM -0700, Casey Schaufler wrote:
>> On 7/22/2015 8:56 AM, Seth Forshee wrote:
>>> On Tue, Jul 21, 2015 at 06:52:31PM -0700, Casey Schaufler wrote:
>>>> On 7/21/2015 1:35 PM, Seth Forshee wrote:
>>>>> On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
>>>>>> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <[email protected]> wrote:
>>>>>>> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
>>>>>>>> I really don't see the benefit of making up extra rules that apply to
>>>>>>>> users outside a userns who try to access specifically a filesystem
>>>>>>>> with backing store. They wouldn't make sense for filesystems without
>>>>>>>> backing store.
>>>>>>> Sure it would. For Smack, it would be the label a file would be
>>>>>>> created with, which would be the label of the process creating
>>>>>>> the memory based filesystem. For SELinux the rules are more a
>>>>>>> touch more sophisticated, but I'm sure that Paul or Stephen could
>>>>>>> come up with how to determine it.
>>>>>>>
>>>>>>> The point, looping all the way back to the beginning, where we
>>>>>>> were talking about just ignoring the labels on the filesystem,
>>>>>>> is that if you use the same Smack label on the files in the
>>>>>>> filesystem as the backing store file has, we'll all be happy.
>>>>>>> If that label isn't something user can write to, he won't be
>>>>>>> able to write to the mounted objects, either. If there is no
>>>>>>> backing store then use the label of the process creating the
>>>>>>> filesystem, which will be the user, which will mean everything
>>>>>>> will work hunky dory.
>>>>>>>
>>>>>>> Yes, there's work involved, but I doubt there's a lot. Getting
>>>>>>> the label from the backing store or the creating process is
>>>>>>> simple enough.
>>>>>>>
>>>>> So something like the diff below (untested)?
>>>> I think that this is close, and quite good for someone
>>>> who isn't very familiar with Smack. It's definitely headed
>>>> in the right direction.
>>>>
>>>>> All I'm really doing is setting smk_default as you describe above and
>>>>> then using it instead of smk_of_current() in
>>>>> smack_inode_alloc_security() and instead of the label from the disk in
>>>>> smack_d_instantiate().
>>>> Let's say your backing store is a file labeled Rubble.
>>>>
>>>> mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
>>>>
>>>> It is completely reasonable for a process labeled Flintstone to
>>>> have rwxa access to a file labeled Rubble.
>>>>
>>>> Smack rule: Flintstone Rubble rwxa
>>>>
>>>> In the case of writing to an existing Rubble file, what you
>>>> have looks fine. What's not so great is that if the Flintstone
>>>> process creates a file, it should be labeled Flintstone. Your
>>>> use of the smk_default, which is going to violate the principle
>>>> of least astonishment, and break the Smack policy as well.
>>>>
>>>> Let's make a minor change. Instead of using smackfsroot let's
>>>> use smackfstransmute and a slightly different access rule:
>>>>
>>>> mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
>>>>
>>>> Smack rule: Flintstone Rubble rwxat
>>>>
>>>> Now the only change we have to make to the Smack code is
>>>> that we don't want to create any files unless either the
>>>> process is labeled Rubble or the rule allowing the creation
>>>> has the "t" for transmute access. That should ensure that
>>>> everything is labeled Rubble. If it isn't, someone has mucked
>>>> with the metadata in a detectable way.
>>> All right, that kind of makes sense, but I'm still missing some pieces.
>>> Questions follow.
>>>
>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>>>> index 32f598db0b0d..4597420ab933 100644
>>>>> --- a/include/linux/fs.h
>>>>> +++ b/include/linux/fs.h
>>>>> @@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
>>>>> __sb_start_write(sb, SB_FREEZE_FS, true);
>>>>> }
>>>>>
>>>>> +static inline bool sb_in_userns(struct super_block *sb)
>>>>> +{
>>>>> + return sb->s_user_ns != &init_user_ns;
>>>>> +}
>>>>>
>>>>> extern bool inode_owner_or_capable(const struct inode *inode);
>>>>>
>>>>> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
>>>>> index a143328f75eb..591fd19294e7 100644
>>>>> --- a/security/smack/smack_lsm.c
>>>>> +++ b/security/smack/smack_lsm.c
>>>>> @@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
>>>>> char *buffer;
>>>>> struct smack_known *skp = NULL;
>>>>>
>>>>> + /* Should never fetch xattrs from untrusted mounts */
>>>>> + if (WARN_ON(sb_in_userns(ip->i_sb)))
>>>>> + return ERR_PTR(-EPERM);
>>>>> +
>>>> Go ahead and fetch it, we'll check to make sure it's viable later.
>>>>
>>>>> if (ip->i_op->getxattr == NULL)
>>>>> return ERR_PTR(-EOPNOTSUPP);
>>>>>
>>>>> @@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
>>>>> */
>>>>> if (specified)
>>>>> return -EPERM;
>>>>> +
>>>>> /*
>>>>> - * Unprivileged mounts get root and default from the caller.
>>>>> + * User namespace mounts get root and default from the backing
>>>>> + * store, if there is one. Other unprivileged mounts get them
>>>>> + * from the caller.
>>>>> */
>>>>> - skp = smk_of_current();
>>>>> + skp = (sb_in_userns(sb) && sb->s_bdev) ?
>>>>> + smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
>>>>> sp->smk_root = skp;
>>>>> sp->smk_default = skp;
>>>> sp->smk_flags |= SMK_INODE_TRANSMUTE;
>>> I assume that you meant skp and not sp here.
>> Actually, neither is correct. You want to set SMK_INODE_TRANSMUTE
>> in the smk_flags field of the root inode. That's easy:
>>
>> transmute = 1;
>>
>> and the code after "Initialize the root inode" will take care of it.
> Yeah, that's what I've actually done.
>
>>>>> }
>>>>> @@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
>>>>> */
>>>>> static int smack_inode_alloc_security(struct inode *inode)
>>>>> {
>>>>> - struct smack_known *skp = smk_of_current();
>>>>> + struct smack_known *skp;
>>>>> +
>>>>> + if (sb_in_userns(inode->i_sb))
>>>>> + skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
>>>>> + else
>>>>> + skp = smk_of_current();
>>>> This should be left alone.
>>>> smack_inode_init_security is where you could disallow access that doesn't
>>>> legitimately result in a Rubble label on the file. It's something like
>>>>
>>>> ... after the call may = smk_access_entry(...)
>>>> if (sb_in_userns(inode->i_sb))
>>>> if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
>>>> return -EACCES;
>>> I'm not getting how this covers all cases.
>>>
>>> So we've set the transmute flag on the root inode. Files and directories
>>> created in the root directory get the same label, and directories also
>>> get the transmute attribute. That's all fine.
>>>
>>> What about an existing directory in the filesystem that already has a
>>> Slate label? I'm not getting what happens with this directory, or for
>>> new files created in this directory, which also relates to my other
>>> questions below.
>>>
>>> Also an aside - smk_access_entry looks weird. may is initialized to
>>> -ENOENT, and then rule_list is searched for a rule which matches the
>>> object and subject labels. Presumably it's possible that no rule could
>>> be found, otherwise the prior initialization of may is pointless. If
>>> this happens the following code treats it as though it always contains
>>> access flags even though it might contain -ENOENT. Nothing bad actually
>>> happens with a two's compliement representation of -ENOENT since it will
>>> just set a bit that's already set, but it still seems like it should
>>> have a may > 0 condition, for clarity if for no other reason.
>> My suggested code is just wrong. I wasn't looking at the whole code,
>> only the patch, and got myself confused. Apologies.
>>
>> If we want to go straight for the jugular how about this? I'm assuming
>> that inode->i_sb->s_bdev->bd_inode is the inode of the backing store.
> Yes.
>
>> static int smack_inode_permission(struct inode *inode, int mask)
>> {
>> struct smk_audit_info ad;
>> int no_block = mask & MAY_NOT_BLOCK;
>> int rc;
>>
>> mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);
>> /*
>> * No permission to check. Existence test. Yup, it's there.
>> */
>> if (mask == 0)
>> return 0;
>>
>> + if (sb_in_userns(inode->i_sb)) &&
>> + smk_of_inode(inode) != smk_of_inode(inode->i_sb->s_bdev->bd_inode))
>> + return -EACCES;
>> +
>> /* May be droppable after audit */
>> if (no_block)
>> return -ECHILD;
>> smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_INODE);
>> smk_ad_setfield_u_fs_inode(&ad, inode);
>> rc = smk_curacc(smk_of_inode(inode), mask, &ad);
>> rc = smk_bu_inode(inode, mask, rc);
>> return rc;
>> }
> Hmm, okay. I think I've been a little confused all this time about how
> you want to handle these unprivileged mounts.

Not your problem. I'm not the most consistent of reviewers.

> Originally I thought you wanted all objects in the filesystem to get the
> same label as the backing store. That's what I tried to implement
> originally, i.e. smk_root=smk_default=smk_of_inode(...->bd_inode), then
> assign every object (new and existing) smk_default and completely ignore
> the labels on disk.

I want everything to have the label of the backing store, but
I don't want to ignore it if it somehow got something else. Because
the only legitimate label for this example is Rubble, I want to
reject anything else that appears. If someone builds a filesystem
by hand with Slate labels I want it treated "safely".

> This is what I currently think you want for user ns mounts:
>
> 1. smk_root and smk_default are assigned the label of the backing
> device.
> 2. s_root is assigned the transmute property.
> 3. For existing files:
> a. Files with the same label as the backing device are accessible.
> b. Files with any other label are not accessible.

That's right. Accept correct data, reject anything that's not right.

> If this is right, there are a couple lingering questions in my mind.
>
> First, what happens with files created in directories with the same
> label as the backing device but without the transmute property set? The
> inode for the new file will initially be labeled with smk_of_current(),
> but then during d_instantiate it will get smk_default and thus end up
> with the label we want. So that seems okay.

Yes.

> The second is whether files with the SMACK64EXEC attribute is still a
> problem. It seems it is, for files with the same label as the backing
> store at least. I think we can simply skip the code that reads out this
> xattr and sets smk_task for user ns mounts, or else skip assigning the
> label to the new task in bprm_set_creds. The latter seems more
> consistent with the approach you've suggested for dealing with labels
> from disk.

Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
smack_d_instantiate for unprivileged mounts would do the trick.

> So I guess all of that seems okay, though perhaps a bit restrictive
> given that the user who mounted the filesystem already has full access
> to the backing store.

In truth, there is no reason to expect that the "user" who did the
mount will ever have a Smack label that differs from the label of
the backing store. If what we've got here seems restrictive, it's
because you've got access from someone other than the "user".

> Please let me know whether or not this matches up with what you are
> thinking, then I can procede with the implementation.

My current mindset is that, if you're going to allow unprivileged
mounts of user defined backing stores, this is as safe as we can
make it.

>
> Thanks,
> Seth
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2015-07-23 00:22:03

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

Casey Schaufler <[email protected]> writes:

> On 7/22/2015 12:32 PM, Seth Forshee wrote:
>> On Wed, Jul 22, 2015 at 11:10:46AM -0700, Casey Schaufler wrote:
>>> On 7/22/2015 8:56 AM, Seth Forshee wrote:
>>>> On Tue, Jul 21, 2015 at 06:52:31PM -0700, Casey Schaufler wrote:
>>>>> On 7/21/2015 1:35 PM, Seth Forshee wrote:
>>>>>> On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
>>>>>>> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <[email protected]> wrote:
>>>>>>>> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
>>>>>>>>> I really don't see the benefit of making up extra rules that apply to
>>>>>>>>> users outside a userns who try to access specifically a filesystem
>>>>>>>>> with backing store. They wouldn't make sense for filesystems without
>>>>>>>>> backing store.
>>>>>>>> Sure it would. For Smack, it would be the label a file would be
>>>>>>>> created with, which would be the label of the process creating
>>>>>>>> the memory based filesystem. For SELinux the rules are more a
>>>>>>>> touch more sophisticated, but I'm sure that Paul or Stephen could
>>>>>>>> come up with how to determine it.
>>>>>>>>
>>>>>>>> The point, looping all the way back to the beginning, where we
>>>>>>>> were talking about just ignoring the labels on the filesystem,
>>>>>>>> is that if you use the same Smack label on the files in the
>>>>>>>> filesystem as the backing store file has, we'll all be happy.
>>>>>>>> If that label isn't something user can write to, he won't be
>>>>>>>> able to write to the mounted objects, either. If there is no
>>>>>>>> backing store then use the label of the process creating the
>>>>>>>> filesystem, which will be the user, which will mean everything
>>>>>>>> will work hunky dory.
>>>>>>>>
>>>>>>>> Yes, there's work involved, but I doubt there's a lot. Getting
>>>>>>>> the label from the backing store or the creating process is
>>>>>>>> simple enough.
>>>>>>>>
>>>>>> So something like the diff below (untested)?
>>>>> I think that this is close, and quite good for someone
>>>>> who isn't very familiar with Smack. It's definitely headed
>>>>> in the right direction.
>>>>>
>>>>>> All I'm really doing is setting smk_default as you describe above and
>>>>>> then using it instead of smk_of_current() in
>>>>>> smack_inode_alloc_security() and instead of the label from the disk in
>>>>>> smack_d_instantiate().
>>>>> Let's say your backing store is a file labeled Rubble.
>>>>>
>>>>> mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
>>>>>
>>>>> It is completely reasonable for a process labeled Flintstone to
>>>>> have rwxa access to a file labeled Rubble.
>>>>>
>>>>> Smack rule: Flintstone Rubble rwxa
>>>>>
>>>>> In the case of writing to an existing Rubble file, what you
>>>>> have looks fine. What's not so great is that if the Flintstone
>>>>> process creates a file, it should be labeled Flintstone. Your
>>>>> use of the smk_default, which is going to violate the principle
>>>>> of least astonishment, and break the Smack policy as well.
>>>>>
>>>>> Let's make a minor change. Instead of using smackfsroot let's
>>>>> use smackfstransmute and a slightly different access rule:
>>>>>
>>>>> mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
>>>>>
>>>>> Smack rule: Flintstone Rubble rwxat
>>>>>
>>>>> Now the only change we have to make to the Smack code is
>>>>> that we don't want to create any files unless either the
>>>>> process is labeled Rubble or the rule allowing the creation
>>>>> has the "t" for transmute access. That should ensure that
>>>>> everything is labeled Rubble. If it isn't, someone has mucked
>>>>> with the metadata in a detectable way.
>>>> All right, that kind of makes sense, but I'm still missing some pieces.
>>>> Questions follow.
>>>>
>>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>>>>> index 32f598db0b0d..4597420ab933 100644
>>>>>> --- a/include/linux/fs.h
>>>>>> +++ b/include/linux/fs.h
>>>>>> @@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
>>>>>> __sb_start_write(sb, SB_FREEZE_FS, true);
>>>>>> }
>>>>>>
>>>>>> +static inline bool sb_in_userns(struct super_block *sb)
>>>>>> +{
>>>>>> + return sb->s_user_ns != &init_user_ns;
>>>>>> +}
>>>>>>
>>>>>> extern bool inode_owner_or_capable(const struct inode *inode);
>>>>>>
>>>>>> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
>>>>>> index a143328f75eb..591fd19294e7 100644
>>>>>> --- a/security/smack/smack_lsm.c
>>>>>> +++ b/security/smack/smack_lsm.c
>>>>>> @@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
>>>>>> char *buffer;
>>>>>> struct smack_known *skp = NULL;
>>>>>>
>>>>>> + /* Should never fetch xattrs from untrusted mounts */
>>>>>> + if (WARN_ON(sb_in_userns(ip->i_sb)))
>>>>>> + return ERR_PTR(-EPERM);
>>>>>> +
>>>>> Go ahead and fetch it, we'll check to make sure it's viable later.
>>>>>
>>>>>> if (ip->i_op->getxattr == NULL)
>>>>>> return ERR_PTR(-EOPNOTSUPP);
>>>>>>
>>>>>> @@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
>>>>>> */
>>>>>> if (specified)
>>>>>> return -EPERM;
>>>>>> +
>>>>>> /*
>>>>>> - * Unprivileged mounts get root and default from the caller.
>>>>>> + * User namespace mounts get root and default from the backing
>>>>>> + * store, if there is one. Other unprivileged mounts get them
>>>>>> + * from the caller.
>>>>>> */
>>>>>> - skp = smk_of_current();
>>>>>> + skp = (sb_in_userns(sb) && sb->s_bdev) ?
>>>>>> + smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
>>>>>> sp->smk_root = skp;
>>>>>> sp->smk_default = skp;
>>>>> sp->smk_flags |= SMK_INODE_TRANSMUTE;
>>>> I assume that you meant skp and not sp here.
>>> Actually, neither is correct. You want to set SMK_INODE_TRANSMUTE
>>> in the smk_flags field of the root inode. That's easy:
>>>
>>> transmute = 1;
>>>
>>> and the code after "Initialize the root inode" will take care of it.
>> Yeah, that's what I've actually done.
>>
>>>>>> }
>>>>>> @@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
>>>>>> */
>>>>>> static int smack_inode_alloc_security(struct inode *inode)
>>>>>> {
>>>>>> - struct smack_known *skp = smk_of_current();
>>>>>> + struct smack_known *skp;
>>>>>> +
>>>>>> + if (sb_in_userns(inode->i_sb))
>>>>>> + skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
>>>>>> + else
>>>>>> + skp = smk_of_current();
>>>>> This should be left alone.
>>>>> smack_inode_init_security is where you could disallow access that doesn't
>>>>> legitimately result in a Rubble label on the file. It's something like
>>>>>
>>>>> ... after the call may = smk_access_entry(...)
>>>>> if (sb_in_userns(inode->i_sb))
>>>>> if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
>>>>> return -EACCES;
>>>> I'm not getting how this covers all cases.
>>>>
>>>> So we've set the transmute flag on the root inode. Files and directories
>>>> created in the root directory get the same label, and directories also
>>>> get the transmute attribute. That's all fine.
>>>>
>>>> What about an existing directory in the filesystem that already has a
>>>> Slate label? I'm not getting what happens with this directory, or for
>>>> new files created in this directory, which also relates to my other
>>>> questions below.
>>>>
>>>> Also an aside - smk_access_entry looks weird. may is initialized to
>>>> -ENOENT, and then rule_list is searched for a rule which matches the
>>>> object and subject labels. Presumably it's possible that no rule could
>>>> be found, otherwise the prior initialization of may is pointless. If
>>>> this happens the following code treats it as though it always contains
>>>> access flags even though it might contain -ENOENT. Nothing bad actually
>>>> happens with a two's compliement representation of -ENOENT since it will
>>>> just set a bit that's already set, but it still seems like it should
>>>> have a may > 0 condition, for clarity if for no other reason.
>>> My suggested code is just wrong. I wasn't looking at the whole code,
>>> only the patch, and got myself confused. Apologies.
>>>
>>> If we want to go straight for the jugular how about this? I'm assuming
>>> that inode->i_sb->s_bdev->bd_inode is the inode of the backing store.
>> Yes.
>>
>>> static int smack_inode_permission(struct inode *inode, int mask)
>>> {
>>> struct smk_audit_info ad;
>>> int no_block = mask & MAY_NOT_BLOCK;
>>> int rc;
>>>
>>> mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);
>>> /*
>>> * No permission to check. Existence test. Yup, it's there.
>>> */
>>> if (mask == 0)
>>> return 0;
>>>
>>> + if (sb_in_userns(inode->i_sb)) &&
>>> + smk_of_inode(inode) != smk_of_inode(inode->i_sb->s_bdev->bd_inode))
>>> + return -EACCES;
>>> +
>>> /* May be droppable after audit */
>>> if (no_block)
>>> return -ECHILD;
>>> smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_INODE);
>>> smk_ad_setfield_u_fs_inode(&ad, inode);
>>> rc = smk_curacc(smk_of_inode(inode), mask, &ad);
>>> rc = smk_bu_inode(inode, mask, rc);
>>> return rc;
>>> }
>> Hmm, okay. I think I've been a little confused all this time about how
>> you want to handle these unprivileged mounts.
>
> Not your problem. I'm not the most consistent of reviewers.
>
>> Originally I thought you wanted all objects in the filesystem to get the
>> same label as the backing store. That's what I tried to implement
>> originally, i.e. smk_root=smk_default=smk_of_inode(...->bd_inode), then
>> assign every object (new and existing) smk_default and completely ignore
>> the labels on disk.
>
> I want everything to have the label of the backing store, but
> I don't want to ignore it if it somehow got something else. Because
> the only legitimate label for this example is Rubble, I want to
> reject anything else that appears. If someone builds a filesystem
> by hand with Slate labels I want it treated "safely".
>
>> This is what I currently think you want for user ns mounts:
>>
>> 1. smk_root and smk_default are assigned the label of the backing
>> device.
>> 2. s_root is assigned the transmute property.
>> 3. For existing files:
>> a. Files with the same label as the backing device are accessible.
>> b. Files with any other label are not accessible.
>
> That's right. Accept correct data, reject anything that's not right.
>
>> If this is right, there are a couple lingering questions in my mind.
>>
>> First, what happens with files created in directories with the same
>> label as the backing device but without the transmute property set? The
>> inode for the new file will initially be labeled with smk_of_current(),
>> but then during d_instantiate it will get smk_default and thus end up
>> with the label we want. So that seems okay.
>
> Yes.
>
>> The second is whether files with the SMACK64EXEC attribute is still a
>> problem. It seems it is, for files with the same label as the backing
>> store at least. I think we can simply skip the code that reads out this
>> xattr and sets smk_task for user ns mounts, or else skip assigning the
>> label to the new task in bprm_set_creds. The latter seems more
>> consistent with the approach you've suggested for dealing with labels
>> from disk.
>
> Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
> smack_d_instantiate for unprivileged mounts would do the trick.
>
>> So I guess all of that seems okay, though perhaps a bit restrictive
>> given that the user who mounted the filesystem already has full access
>> to the backing store.
>
> In truth, there is no reason to expect that the "user" who did the
> mount will ever have a Smack label that differs from the label of
> the backing store. If what we've got here seems restrictive, it's
> because you've got access from someone other than the "user".
>
>> Please let me know whether or not this matches up with what you are
>> thinking, then I can procede with the implementation.
>
> My current mindset is that, if you're going to allow unprivileged
> mounts of user defined backing stores, this is as safe as we can
> make it.

That actually sounds very reasonable to me. It is essentially what we
do with uid and gids already. I presume the smack namespace support
would when integrated with all of this would allow a set of labels to be
set.

Have I missed a part of the conversation you talk about fileystems that
don't have support for storing labels? Filesystems like vfat, isofs,
etc.

Eric

2015-07-23 01:51:49

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Wed, Jul 22, 2015 at 01:41:00PM -0400, J. Bruce Fields wrote:
> On Wed, Jul 22, 2015 at 12:52:58PM -0400, Austin S Hemmelgarn wrote:
> > On 2015-07-22 10:09, J. Bruce Fields wrote:
> > >On Wed, Jul 22, 2015 at 05:56:40PM +1000, Dave Chinner wrote:
> > >>On Tue, Jul 21, 2015 at 01:37:21PM -0400, J. Bruce Fields wrote:
> > >>>On Fri, Jul 17, 2015 at 12:47:35PM +1000, Dave Chinner wrote:
> > >>>So, for example, a screwed up on-disk directory structure shouldn't
> > >>>result in creating a cycle in the dcache and then deadlocking.
> > >>
> > >>Therein lies the problem: how do you detect such structural defects
> > >>without doing a full structure validation?
> > >
> > >You can prevent cycles in a graph if you can prevent adding an edge
> > >which would be part of a cycle.
> > >
> > Except if the user can write to the filesystem's backing storage (be
> > it a device or a file), and has sufficient knowledge of the on-disk
> > structures, they can create all the cycles they want in the
> > metadata. So unless the kernel builds the graph internally by
> > parsing the metadata _and_ has some way to detect that the on-disk
> > metadata has hit a cycle (which may not just involve 2 items),
>
> Understood. Again, see the d_ancestor call in d_splice_alias, this is
> exactly what it checks for.

But that only addresses one type of loop in one specific metadata
structure. There's plenty of other ways you could construct metadata
loops that are essentially undetected and result in either deadlock
or livelock within the filesystem code itself. e.g. just make btree
sibling pointers loop over a range of entries that have the same
index key (e.g. free space extents of the same size). If allocation
then falls into this loop, the kernel will just spin searching the
same blocks for something it will never find. Such resource
consumption attacks are trivial to construct but extremely difficult
to detect because they exploit normal behaviour of the structure and
algorithms by mangling trusted pointers.

Of course, this sort of attack will eventually deadlock the
filesystem because it will backs up on locks held by the live locked
search. Once the filesystem is deadlocked, it can then cause sync()
calls to get stuck on the filesystem. And because sync() is a global
operation, a deadlocked filesystem in one container could cause sync
to hang in completely unrelated container....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-07-23 05:15:45

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Wed, Jul 22, 2015 at 07:15:19PM -0500, Eric W. Biederman wrote:
> Casey Schaufler <[email protected]> writes:
>
> > On 7/22/2015 12:32 PM, Seth Forshee wrote:
> >> On Wed, Jul 22, 2015 at 11:10:46AM -0700, Casey Schaufler wrote:
> >>> On 7/22/2015 8:56 AM, Seth Forshee wrote:
> >>>> On Tue, Jul 21, 2015 at 06:52:31PM -0700, Casey Schaufler wrote:
> >>>>> On 7/21/2015 1:35 PM, Seth Forshee wrote:
> >>>>>> On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
> >>>>>>> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <[email protected]> wrote:
> >>>>>>>> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
> >>>>>>>>> I really don't see the benefit of making up extra rules that apply to
> >>>>>>>>> users outside a userns who try to access specifically a filesystem
> >>>>>>>>> with backing store. They wouldn't make sense for filesystems without
> >>>>>>>>> backing store.
> >>>>>>>> Sure it would. For Smack, it would be the label a file would be
> >>>>>>>> created with, which would be the label of the process creating
> >>>>>>>> the memory based filesystem. For SELinux the rules are more a
> >>>>>>>> touch more sophisticated, but I'm sure that Paul or Stephen could
> >>>>>>>> come up with how to determine it.
> >>>>>>>>
> >>>>>>>> The point, looping all the way back to the beginning, where we
> >>>>>>>> were talking about just ignoring the labels on the filesystem,
> >>>>>>>> is that if you use the same Smack label on the files in the
> >>>>>>>> filesystem as the backing store file has, we'll all be happy.
> >>>>>>>> If that label isn't something user can write to, he won't be
> >>>>>>>> able to write to the mounted objects, either. If there is no
> >>>>>>>> backing store then use the label of the process creating the
> >>>>>>>> filesystem, which will be the user, which will mean everything
> >>>>>>>> will work hunky dory.
> >>>>>>>>
> >>>>>>>> Yes, there's work involved, but I doubt there's a lot. Getting
> >>>>>>>> the label from the backing store or the creating process is
> >>>>>>>> simple enough.
> >>>>>>>>
> >>>>>> So something like the diff below (untested)?
> >>>>> I think that this is close, and quite good for someone
> >>>>> who isn't very familiar with Smack. It's definitely headed
> >>>>> in the right direction.
> >>>>>
> >>>>>> All I'm really doing is setting smk_default as you describe above and
> >>>>>> then using it instead of smk_of_current() in
> >>>>>> smack_inode_alloc_security() and instead of the label from the disk in
> >>>>>> smack_d_instantiate().
> >>>>> Let's say your backing store is a file labeled Rubble.
> >>>>>
> >>>>> mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
> >>>>>
> >>>>> It is completely reasonable for a process labeled Flintstone to
> >>>>> have rwxa access to a file labeled Rubble.
> >>>>>
> >>>>> Smack rule: Flintstone Rubble rwxa
> >>>>>
> >>>>> In the case of writing to an existing Rubble file, what you
> >>>>> have looks fine. What's not so great is that if the Flintstone
> >>>>> process creates a file, it should be labeled Flintstone. Your
> >>>>> use of the smk_default, which is going to violate the principle
> >>>>> of least astonishment, and break the Smack policy as well.
> >>>>>
> >>>>> Let's make a minor change. Instead of using smackfsroot let's
> >>>>> use smackfstransmute and a slightly different access rule:
> >>>>>
> >>>>> mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
> >>>>>
> >>>>> Smack rule: Flintstone Rubble rwxat
> >>>>>
> >>>>> Now the only change we have to make to the Smack code is
> >>>>> that we don't want to create any files unless either the
> >>>>> process is labeled Rubble or the rule allowing the creation
> >>>>> has the "t" for transmute access. That should ensure that
> >>>>> everything is labeled Rubble. If it isn't, someone has mucked
> >>>>> with the metadata in a detectable way.
> >>>> All right, that kind of makes sense, but I'm still missing some pieces.
> >>>> Questions follow.
> >>>>
> >>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
> >>>>>> index 32f598db0b0d..4597420ab933 100644
> >>>>>> --- a/include/linux/fs.h
> >>>>>> +++ b/include/linux/fs.h
> >>>>>> @@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
> >>>>>> __sb_start_write(sb, SB_FREEZE_FS, true);
> >>>>>> }
> >>>>>>
> >>>>>> +static inline bool sb_in_userns(struct super_block *sb)
> >>>>>> +{
> >>>>>> + return sb->s_user_ns != &init_user_ns;
> >>>>>> +}
> >>>>>>
> >>>>>> extern bool inode_owner_or_capable(const struct inode *inode);
> >>>>>>
> >>>>>> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> >>>>>> index a143328f75eb..591fd19294e7 100644
> >>>>>> --- a/security/smack/smack_lsm.c
> >>>>>> +++ b/security/smack/smack_lsm.c
> >>>>>> @@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
> >>>>>> char *buffer;
> >>>>>> struct smack_known *skp = NULL;
> >>>>>>
> >>>>>> + /* Should never fetch xattrs from untrusted mounts */
> >>>>>> + if (WARN_ON(sb_in_userns(ip->i_sb)))
> >>>>>> + return ERR_PTR(-EPERM);
> >>>>>> +
> >>>>> Go ahead and fetch it, we'll check to make sure it's viable later.
> >>>>>
> >>>>>> if (ip->i_op->getxattr == NULL)
> >>>>>> return ERR_PTR(-EOPNOTSUPP);
> >>>>>>
> >>>>>> @@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
> >>>>>> */
> >>>>>> if (specified)
> >>>>>> return -EPERM;
> >>>>>> +
> >>>>>> /*
> >>>>>> - * Unprivileged mounts get root and default from the caller.
> >>>>>> + * User namespace mounts get root and default from the backing
> >>>>>> + * store, if there is one. Other unprivileged mounts get them
> >>>>>> + * from the caller.
> >>>>>> */
> >>>>>> - skp = smk_of_current();
> >>>>>> + skp = (sb_in_userns(sb) && sb->s_bdev) ?
> >>>>>> + smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
> >>>>>> sp->smk_root = skp;
> >>>>>> sp->smk_default = skp;
> >>>>> sp->smk_flags |= SMK_INODE_TRANSMUTE;
> >>>> I assume that you meant skp and not sp here.
> >>> Actually, neither is correct. You want to set SMK_INODE_TRANSMUTE
> >>> in the smk_flags field of the root inode. That's easy:
> >>>
> >>> transmute = 1;
> >>>
> >>> and the code after "Initialize the root inode" will take care of it.
> >> Yeah, that's what I've actually done.
> >>
> >>>>>> }
> >>>>>> @@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
> >>>>>> */
> >>>>>> static int smack_inode_alloc_security(struct inode *inode)
> >>>>>> {
> >>>>>> - struct smack_known *skp = smk_of_current();
> >>>>>> + struct smack_known *skp;
> >>>>>> +
> >>>>>> + if (sb_in_userns(inode->i_sb))
> >>>>>> + skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
> >>>>>> + else
> >>>>>> + skp = smk_of_current();
> >>>>> This should be left alone.
> >>>>> smack_inode_init_security is where you could disallow access that doesn't
> >>>>> legitimately result in a Rubble label on the file. It's something like
> >>>>>
> >>>>> ... after the call may = smk_access_entry(...)
> >>>>> if (sb_in_userns(inode->i_sb))
> >>>>> if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
> >>>>> return -EACCES;
> >>>> I'm not getting how this covers all cases.
> >>>>
> >>>> So we've set the transmute flag on the root inode. Files and directories
> >>>> created in the root directory get the same label, and directories also
> >>>> get the transmute attribute. That's all fine.
> >>>>
> >>>> What about an existing directory in the filesystem that already has a
> >>>> Slate label? I'm not getting what happens with this directory, or for
> >>>> new files created in this directory, which also relates to my other
> >>>> questions below.
> >>>>
> >>>> Also an aside - smk_access_entry looks weird. may is initialized to
> >>>> -ENOENT, and then rule_list is searched for a rule which matches the
> >>>> object and subject labels. Presumably it's possible that no rule could
> >>>> be found, otherwise the prior initialization of may is pointless. If
> >>>> this happens the following code treats it as though it always contains
> >>>> access flags even though it might contain -ENOENT. Nothing bad actually
> >>>> happens with a two's compliement representation of -ENOENT since it will
> >>>> just set a bit that's already set, but it still seems like it should
> >>>> have a may > 0 condition, for clarity if for no other reason.
> >>> My suggested code is just wrong. I wasn't looking at the whole code,
> >>> only the patch, and got myself confused. Apologies.
> >>>
> >>> If we want to go straight for the jugular how about this? I'm assuming
> >>> that inode->i_sb->s_bdev->bd_inode is the inode of the backing store.
> >> Yes.
> >>
> >>> static int smack_inode_permission(struct inode *inode, int mask)
> >>> {
> >>> struct smk_audit_info ad;
> >>> int no_block = mask & MAY_NOT_BLOCK;
> >>> int rc;
> >>>
> >>> mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);
> >>> /*
> >>> * No permission to check. Existence test. Yup, it's there.
> >>> */
> >>> if (mask == 0)
> >>> return 0;
> >>>
> >>> + if (sb_in_userns(inode->i_sb)) &&
> >>> + smk_of_inode(inode) != smk_of_inode(inode->i_sb->s_bdev->bd_inode))
> >>> + return -EACCES;
> >>> +
> >>> /* May be droppable after audit */
> >>> if (no_block)
> >>> return -ECHILD;
> >>> smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_INODE);
> >>> smk_ad_setfield_u_fs_inode(&ad, inode);
> >>> rc = smk_curacc(smk_of_inode(inode), mask, &ad);
> >>> rc = smk_bu_inode(inode, mask, rc);
> >>> return rc;
> >>> }
> >> Hmm, okay. I think I've been a little confused all this time about how
> >> you want to handle these unprivileged mounts.
> >
> > Not your problem. I'm not the most consistent of reviewers.
> >
> >> Originally I thought you wanted all objects in the filesystem to get the
> >> same label as the backing store. That's what I tried to implement
> >> originally, i.e. smk_root=smk_default=smk_of_inode(...->bd_inode), then
> >> assign every object (new and existing) smk_default and completely ignore
> >> the labels on disk.
> >
> > I want everything to have the label of the backing store, but
> > I don't want to ignore it if it somehow got something else. Because
> > the only legitimate label for this example is Rubble, I want to
> > reject anything else that appears. If someone builds a filesystem
> > by hand with Slate labels I want it treated "safely".
> >
> >> This is what I currently think you want for user ns mounts:
> >>
> >> 1. smk_root and smk_default are assigned the label of the backing
> >> device.
> >> 2. s_root is assigned the transmute property.
> >> 3. For existing files:
> >> a. Files with the same label as the backing device are accessible.
> >> b. Files with any other label are not accessible.
> >
> > That's right. Accept correct data, reject anything that's not right.
> >
> >> If this is right, there are a couple lingering questions in my mind.
> >>
> >> First, what happens with files created in directories with the same
> >> label as the backing device but without the transmute property set? The
> >> inode for the new file will initially be labeled with smk_of_current(),
> >> but then during d_instantiate it will get smk_default and thus end up
> >> with the label we want. So that seems okay.
> >
> > Yes.
> >
> >> The second is whether files with the SMACK64EXEC attribute is still a
> >> problem. It seems it is, for files with the same label as the backing
> >> store at least. I think we can simply skip the code that reads out this
> >> xattr and sets smk_task for user ns mounts, or else skip assigning the
> >> label to the new task in bprm_set_creds. The latter seems more
> >> consistent with the approach you've suggested for dealing with labels
> >> from disk.
> >
> > Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
> > smack_d_instantiate for unprivileged mounts would do the trick.
> >
> >> So I guess all of that seems okay, though perhaps a bit restrictive
> >> given that the user who mounted the filesystem already has full access
> >> to the backing store.
> >
> > In truth, there is no reason to expect that the "user" who did the
> > mount will ever have a Smack label that differs from the label of
> > the backing store. If what we've got here seems restrictive, it's
> > because you've got access from someone other than the "user".
> >
> >> Please let me know whether or not this matches up with what you are
> >> thinking, then I can procede with the implementation.
> >
> > My current mindset is that, if you're going to allow unprivileged
> > mounts of user defined backing stores, this is as safe as we can
> > make it.
>
> That actually sounds very reasonable to me. It is essentially what we
> do with uid and gids already. I presume the smack namespace support
> would when integrated with all of this would allow a set of labels to be
> set.
>
> Have I missed a part of the conversation you talk about fileystems that
> don't have support for storing labels? Filesystems like vfat, isofs,
> etc.

As I read the code they should all end up with the superblock's
smk_default label for the objects in RAM, i.e. the label of the backing
store. The same would be true for existing files in a filesystem which
does support storing labels but has no labels on the files.

Seth

2015-07-23 13:19:38

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Thu, Jul 23, 2015 at 11:51:35AM +1000, Dave Chinner wrote:
> On Wed, Jul 22, 2015 at 01:41:00PM -0400, J. Bruce Fields wrote:
> > On Wed, Jul 22, 2015 at 12:52:58PM -0400, Austin S Hemmelgarn wrote:
> > > On 2015-07-22 10:09, J. Bruce Fields wrote:
> > > >On Wed, Jul 22, 2015 at 05:56:40PM +1000, Dave Chinner wrote:
> > > >>On Tue, Jul 21, 2015 at 01:37:21PM -0400, J. Bruce Fields wrote:
> > > >>>On Fri, Jul 17, 2015 at 12:47:35PM +1000, Dave Chinner wrote:
> > > >>>So, for example, a screwed up on-disk directory structure shouldn't
> > > >>>result in creating a cycle in the dcache and then deadlocking.
> > > >>
> > > >>Therein lies the problem: how do you detect such structural defects
> > > >>without doing a full structure validation?
> > > >
> > > >You can prevent cycles in a graph if you can prevent adding an edge
> > > >which would be part of a cycle.
> > > >
> > > Except if the user can write to the filesystem's backing storage (be
> > > it a device or a file), and has sufficient knowledge of the on-disk
> > > structures, they can create all the cycles they want in the
> > > metadata. So unless the kernel builds the graph internally by
> > > parsing the metadata _and_ has some way to detect that the on-disk
> > > metadata has hit a cycle (which may not just involve 2 items),
> >
> > Understood. Again, see the d_ancestor call in d_splice_alias, this is
> > exactly what it checks for.
>
> But that only addresses one type of loop in one specific metadata
> structure.

Yep, agreed!

> There's plenty of other ways you could construct metadata
> loops that are essentially undetected and result in either deadlock
> or livelock within the filesystem code itself. e.g. just make btree
> sibling pointers loop over a range of entries that have the same
> index key (e.g. free space extents of the same size). If allocation
> then falls into this loop, the kernel will just spin searching the
> same blocks for something it will never find. Such resource
> consumption attacks are trivial to construct but extremely difficult
> to detect because they exploit normal behaviour of the structure and
> algorithms by mangling trusted pointers.

Interesting example, thanks! I doubt this particular example would be
*that* hard to detect? But understood that there may be lots of others.

--b.

>
> Of course, this sort of attack will eventually deadlock the
> filesystem because it will backs up on locks held by the live locked
> search. Once the filesystem is deadlocked, it can then cause sync()
> calls to get stuck on the filesystem. And because sync() is a global
> operation, a deadlocked filesystem in one container could cause sync
> to hang in completely unrelated container....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2015-07-23 13:59:12

by Stephen Smalley

[permalink] [raw]
Subject: Re: [PATCH 6/7] selinux: Ignore security labels on user namespace mounts

On 07/22/2015 04:40 PM, Stephen Smalley wrote:
> On 07/22/2015 04:25 PM, Stephen Smalley wrote:
>> On 07/22/2015 12:14 PM, Seth Forshee wrote:
>>> On Wed, Jul 22, 2015 at 12:02:13PM -0400, Stephen Smalley wrote:
>>>> On 07/16/2015 09:23 AM, Stephen Smalley wrote:
>>>>> On 07/15/2015 03:46 PM, Seth Forshee wrote:
>>>>>> Unprivileged users should not be able to supply security labels
>>>>>> in filesystems, nor should they be able to supply security
>>>>>> contexts in unprivileged mounts. For any mount where s_user_ns is
>>>>>> not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
>>>>>> and return EPERM if any contexts are supplied in the mount
>>>>>> options.
>>>>>>
>>>>>> Signed-off-by: Seth Forshee <[email protected]>
>>>>>
>>>>> I think this is obsoleted by the subsequent discussion, but just for the
>>>>> record: this patch would cause the files in the userns mount to be left
>>>>> with the "unlabeled" label, and therefore under typical policies,
>>>>> completely inaccessible to any process in a confined domain.
>>>>
>>>> The right way to handle this for SELinux would be to automatically use
>>>> mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
>>>> specifying a context= mount option), with the sbsec->mntpoint_sid set
>>>> from some related object (e.g. the block device file context, as in your
>>>> patches for Smack). That will cause SELinux to use that value instead
>>>> of any xattr value from the filesystem and will cause attempts by
>>>> userspace to set the security.selinux xattr to fail on that filesystem.
>>>> That is how SELinux normally deals with untrusted filesystems, except
>>>> that it is normally specified as a mount option by a trusted mounting
>>>> process, whereas in your case you need to automatically set it.
>>>
>>> Excellent, thank you for the advice. I'll start on this when I've
>>> finished with Smack.
>>
>> Not tested, but something like this should work. Note that it should
>> come after the call to security_fs_use() so we know whether SELinux
>> would even try to use xattrs supplied by the filesystem in the first place.
>>
>> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
>> index 564079c..84da3a2 100644
>> --- a/security/selinux/hooks.c
>> +++ b/security/selinux/hooks.c
>> @@ -745,6 +745,30 @@ static int selinux_set_mnt_opts(struct super_block *sb,
>> goto out;
>> }
>> }
>> +
>> + /*
>> + * If this is a user namespace mount, no contexts are allowed
>> + * on the command line and security labels must be ignored.
>> + */
>> + if (sb->s_user_ns != &init_user_ns) {
>> + if (context_sid || fscontext_sid || rootcontext_sid ||
>> + defcontext_sid) {
>> + rc = -EACCES;
>> + goto out;
>> + }
>> + if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
>> + struct block_device *bdev = sb->s_bdev;
>> + sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
>> + if (bdev) {
>> + struct inode_security_struct *isec =
>> bdev->bd_inode;
>
> That should be bdev->bd_inode->i_security.

Sorry, this won't work. bd_inode is not the inode of the block device
file that was passed to mount, and it isn't labeled in any way. It will
just be unlabeled.

So I guess the only real option here as a fallback is
sbsec->mntpoint_sid = current_sid(). Which isn't great either, as the
only case where we currently assign task labels to files is for their
/proc/pid inodes, and no current policy will therefore allow create
permission to such files.

>
>> + sbsec->mntpoint_sid = isec->sid;
>> + } else {
>> + sbsec->mntpoint_sid = current_sid();
>> + }
>> + }
>> + goto out_set_opts;
>> + }
>> +
>> /* sets the context of the superblock for the fs being mounted. */
>> if (fscontext_sid) {
>> rc = may_context_mount_sb_relabel(fscontext_sid, sbsec,
>> cred);
>> @@ -813,6 +837,7 @@ static int selinux_set_mnt_opts(struct super_block *sb,
>> sbsec->def_sid = defcontext_sid;
>> }
>>
>> +out_set_opts:
>> rc = sb_finish_set_opts(sb);
>> out:
>> mutex_unlock(&sbsec->lock);
>>
>> _______________________________________________
>> Selinux mailing list
>> [email protected]
>> To unsubscribe, send email to [email protected].
>> To get help, send an email containing "help" to [email protected].
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

2015-07-23 14:39:33

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 6/7] selinux: Ignore security labels on user namespace mounts

On Thu, Jul 23, 2015 at 09:57:20AM -0400, Stephen Smalley wrote:
> On 07/22/2015 04:40 PM, Stephen Smalley wrote:
> > On 07/22/2015 04:25 PM, Stephen Smalley wrote:
> >> On 07/22/2015 12:14 PM, Seth Forshee wrote:
> >>> On Wed, Jul 22, 2015 at 12:02:13PM -0400, Stephen Smalley wrote:
> >>>> On 07/16/2015 09:23 AM, Stephen Smalley wrote:
> >>>>> On 07/15/2015 03:46 PM, Seth Forshee wrote:
> >>>>>> Unprivileged users should not be able to supply security labels
> >>>>>> in filesystems, nor should they be able to supply security
> >>>>>> contexts in unprivileged mounts. For any mount where s_user_ns is
> >>>>>> not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
> >>>>>> and return EPERM if any contexts are supplied in the mount
> >>>>>> options.
> >>>>>>
> >>>>>> Signed-off-by: Seth Forshee <[email protected]>
> >>>>>
> >>>>> I think this is obsoleted by the subsequent discussion, but just for the
> >>>>> record: this patch would cause the files in the userns mount to be left
> >>>>> with the "unlabeled" label, and therefore under typical policies,
> >>>>> completely inaccessible to any process in a confined domain.
> >>>>
> >>>> The right way to handle this for SELinux would be to automatically use
> >>>> mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
> >>>> specifying a context= mount option), with the sbsec->mntpoint_sid set
> >>>> from some related object (e.g. the block device file context, as in your
> >>>> patches for Smack). That will cause SELinux to use that value instead
> >>>> of any xattr value from the filesystem and will cause attempts by
> >>>> userspace to set the security.selinux xattr to fail on that filesystem.
> >>>> That is how SELinux normally deals with untrusted filesystems, except
> >>>> that it is normally specified as a mount option by a trusted mounting
> >>>> process, whereas in your case you need to automatically set it.
> >>>
> >>> Excellent, thank you for the advice. I'll start on this when I've
> >>> finished with Smack.
> >>
> >> Not tested, but something like this should work. Note that it should
> >> come after the call to security_fs_use() so we know whether SELinux
> >> would even try to use xattrs supplied by the filesystem in the first place.
> >>
> >> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> >> index 564079c..84da3a2 100644
> >> --- a/security/selinux/hooks.c
> >> +++ b/security/selinux/hooks.c
> >> @@ -745,6 +745,30 @@ static int selinux_set_mnt_opts(struct super_block *sb,
> >> goto out;
> >> }
> >> }
> >> +
> >> + /*
> >> + * If this is a user namespace mount, no contexts are allowed
> >> + * on the command line and security labels must be ignored.
> >> + */
> >> + if (sb->s_user_ns != &init_user_ns) {
> >> + if (context_sid || fscontext_sid || rootcontext_sid ||
> >> + defcontext_sid) {
> >> + rc = -EACCES;
> >> + goto out;
> >> + }
> >> + if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
> >> + struct block_device *bdev = sb->s_bdev;
> >> + sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
> >> + if (bdev) {
> >> + struct inode_security_struct *isec =
> >> bdev->bd_inode;
> >
> > That should be bdev->bd_inode->i_security.
>
> Sorry, this won't work. bd_inode is not the inode of the block device
> file that was passed to mount, and it isn't labeled in any way. It will
> just be unlabeled.
>
> So I guess the only real option here as a fallback is
> sbsec->mntpoint_sid = current_sid(). Which isn't great either, as the
> only case where we currently assign task labels to files is for their
> /proc/pid inodes, and no current policy will therefore allow create
> permission to such files.

Darn, you're right, that isn't the inode we want. There really doesn't
seem to be any way to get back to the one we want from the LSM, short of
adding a new hook.

Seth

2015-07-23 15:37:50

by Stephen Smalley

[permalink] [raw]
Subject: Re: [PATCH 6/7] selinux: Ignore security labels on user namespace mounts

On 07/23/2015 10:39 AM, Seth Forshee wrote:
> On Thu, Jul 23, 2015 at 09:57:20AM -0400, Stephen Smalley wrote:
>> On 07/22/2015 04:40 PM, Stephen Smalley wrote:
>>> On 07/22/2015 04:25 PM, Stephen Smalley wrote:
>>>> On 07/22/2015 12:14 PM, Seth Forshee wrote:
>>>>> On Wed, Jul 22, 2015 at 12:02:13PM -0400, Stephen Smalley wrote:
>>>>>> On 07/16/2015 09:23 AM, Stephen Smalley wrote:
>>>>>>> On 07/15/2015 03:46 PM, Seth Forshee wrote:
>>>>>>>> Unprivileged users should not be able to supply security labels
>>>>>>>> in filesystems, nor should they be able to supply security
>>>>>>>> contexts in unprivileged mounts. For any mount where s_user_ns is
>>>>>>>> not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
>>>>>>>> and return EPERM if any contexts are supplied in the mount
>>>>>>>> options.
>>>>>>>>
>>>>>>>> Signed-off-by: Seth Forshee <[email protected]>
>>>>>>>
>>>>>>> I think this is obsoleted by the subsequent discussion, but just for the
>>>>>>> record: this patch would cause the files in the userns mount to be left
>>>>>>> with the "unlabeled" label, and therefore under typical policies,
>>>>>>> completely inaccessible to any process in a confined domain.
>>>>>>
>>>>>> The right way to handle this for SELinux would be to automatically use
>>>>>> mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
>>>>>> specifying a context= mount option), with the sbsec->mntpoint_sid set
>>>>>> from some related object (e.g. the block device file context, as in your
>>>>>> patches for Smack). That will cause SELinux to use that value instead
>>>>>> of any xattr value from the filesystem and will cause attempts by
>>>>>> userspace to set the security.selinux xattr to fail on that filesystem.
>>>>>> That is how SELinux normally deals with untrusted filesystems, except
>>>>>> that it is normally specified as a mount option by a trusted mounting
>>>>>> process, whereas in your case you need to automatically set it.
>>>>>
>>>>> Excellent, thank you for the advice. I'll start on this when I've
>>>>> finished with Smack.
>>>>
>>>> Not tested, but something like this should work. Note that it should
>>>> come after the call to security_fs_use() so we know whether SELinux
>>>> would even try to use xattrs supplied by the filesystem in the first place.
>>>>
>>>> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
>>>> index 564079c..84da3a2 100644
>>>> --- a/security/selinux/hooks.c
>>>> +++ b/security/selinux/hooks.c
>>>> @@ -745,6 +745,30 @@ static int selinux_set_mnt_opts(struct super_block *sb,
>>>> goto out;
>>>> }
>>>> }
>>>> +
>>>> + /*
>>>> + * If this is a user namespace mount, no contexts are allowed
>>>> + * on the command line and security labels must be ignored.
>>>> + */
>>>> + if (sb->s_user_ns != &init_user_ns) {
>>>> + if (context_sid || fscontext_sid || rootcontext_sid ||
>>>> + defcontext_sid) {
>>>> + rc = -EACCES;
>>>> + goto out;
>>>> + }
>>>> + if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
>>>> + struct block_device *bdev = sb->s_bdev;
>>>> + sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
>>>> + if (bdev) {
>>>> + struct inode_security_struct *isec =
>>>> bdev->bd_inode;
>>>
>>> That should be bdev->bd_inode->i_security.
>>
>> Sorry, this won't work. bd_inode is not the inode of the block device
>> file that was passed to mount, and it isn't labeled in any way. It will
>> just be unlabeled.
>>
>> So I guess the only real option here as a fallback is
>> sbsec->mntpoint_sid = current_sid(). Which isn't great either, as the
>> only case where we currently assign task labels to files is for their
>> /proc/pid inodes, and no current policy will therefore allow create
>> permission to such files.
>
> Darn, you're right, that isn't the inode we want. There really doesn't
> seem to be any way to get back to the one we want from the LSM, short of
> adding a new hook.

Maybe list_first_entry(&sb->s_bdev->bd_inodes, struct inode, i_devices)?
Feels like a layering violation though...

2015-07-23 16:23:47

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 6/7] selinux: Ignore security labels on user namespace mounts

On Thu, Jul 23, 2015 at 11:36:03AM -0400, Stephen Smalley wrote:
> On 07/23/2015 10:39 AM, Seth Forshee wrote:
> > On Thu, Jul 23, 2015 at 09:57:20AM -0400, Stephen Smalley wrote:
> >> On 07/22/2015 04:40 PM, Stephen Smalley wrote:
> >>> On 07/22/2015 04:25 PM, Stephen Smalley wrote:
> >>>> On 07/22/2015 12:14 PM, Seth Forshee wrote:
> >>>>> On Wed, Jul 22, 2015 at 12:02:13PM -0400, Stephen Smalley wrote:
> >>>>>> On 07/16/2015 09:23 AM, Stephen Smalley wrote:
> >>>>>>> On 07/15/2015 03:46 PM, Seth Forshee wrote:
> >>>>>>>> Unprivileged users should not be able to supply security labels
> >>>>>>>> in filesystems, nor should they be able to supply security
> >>>>>>>> contexts in unprivileged mounts. For any mount where s_user_ns is
> >>>>>>>> not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
> >>>>>>>> and return EPERM if any contexts are supplied in the mount
> >>>>>>>> options.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Seth Forshee <[email protected]>
> >>>>>>>
> >>>>>>> I think this is obsoleted by the subsequent discussion, but just for the
> >>>>>>> record: this patch would cause the files in the userns mount to be left
> >>>>>>> with the "unlabeled" label, and therefore under typical policies,
> >>>>>>> completely inaccessible to any process in a confined domain.
> >>>>>>
> >>>>>> The right way to handle this for SELinux would be to automatically use
> >>>>>> mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
> >>>>>> specifying a context= mount option), with the sbsec->mntpoint_sid set
> >>>>>> from some related object (e.g. the block device file context, as in your
> >>>>>> patches for Smack). That will cause SELinux to use that value instead
> >>>>>> of any xattr value from the filesystem and will cause attempts by
> >>>>>> userspace to set the security.selinux xattr to fail on that filesystem.
> >>>>>> That is how SELinux normally deals with untrusted filesystems, except
> >>>>>> that it is normally specified as a mount option by a trusted mounting
> >>>>>> process, whereas in your case you need to automatically set it.
> >>>>>
> >>>>> Excellent, thank you for the advice. I'll start on this when I've
> >>>>> finished with Smack.
> >>>>
> >>>> Not tested, but something like this should work. Note that it should
> >>>> come after the call to security_fs_use() so we know whether SELinux
> >>>> would even try to use xattrs supplied by the filesystem in the first place.
> >>>>
> >>>> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> >>>> index 564079c..84da3a2 100644
> >>>> --- a/security/selinux/hooks.c
> >>>> +++ b/security/selinux/hooks.c
> >>>> @@ -745,6 +745,30 @@ static int selinux_set_mnt_opts(struct super_block *sb,
> >>>> goto out;
> >>>> }
> >>>> }
> >>>> +
> >>>> + /*
> >>>> + * If this is a user namespace mount, no contexts are allowed
> >>>> + * on the command line and security labels must be ignored.
> >>>> + */
> >>>> + if (sb->s_user_ns != &init_user_ns) {
> >>>> + if (context_sid || fscontext_sid || rootcontext_sid ||
> >>>> + defcontext_sid) {
> >>>> + rc = -EACCES;
> >>>> + goto out;
> >>>> + }
> >>>> + if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
> >>>> + struct block_device *bdev = sb->s_bdev;
> >>>> + sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
> >>>> + if (bdev) {
> >>>> + struct inode_security_struct *isec =
> >>>> bdev->bd_inode;
> >>>
> >>> That should be bdev->bd_inode->i_security.
> >>
> >> Sorry, this won't work. bd_inode is not the inode of the block device
> >> file that was passed to mount, and it isn't labeled in any way. It will
> >> just be unlabeled.
> >>
> >> So I guess the only real option here as a fallback is
> >> sbsec->mntpoint_sid = current_sid(). Which isn't great either, as the
> >> only case where we currently assign task labels to files is for their
> >> /proc/pid inodes, and no current policy will therefore allow create
> >> permission to such files.
> >
> > Darn, you're right, that isn't the inode we want. There really doesn't
> > seem to be any way to get back to the one we want from the LSM, short of
> > adding a new hook.
>
> Maybe list_first_entry(&sb->s_bdev->bd_inodes, struct inode, i_devices)?
> Feels like a layering violation though...

Yeah, and even though that probably works out to be the inode we want in
most cases I don't think we can be absolutely certain that it is. Maybe
there's some way we could walk the list and be sure we've found the
right inode, but I'm not seeing it.

2015-07-23 21:48:30

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On 7/22/2015 5:15 PM, Eric W. Biederman wrote:
> Casey Schaufler <[email protected]> writes:
>
>> On 7/22/2015 12:32 PM, Seth Forshee wrote:
>>> On Wed, Jul 22, 2015 at 11:10:46AM -0700, Casey Schaufler wrote:
>>>> On 7/22/2015 8:56 AM, Seth Forshee wrote:
>>>>> On Tue, Jul 21, 2015 at 06:52:31PM -0700, Casey Schaufler wrote:
>>>>>> On 7/21/2015 1:35 PM, Seth Forshee wrote:
>>>>>>> On Thu, Jul 16, 2015 at 05:59:22PM -0700, Andy Lutomirski wrote:
>>>>>>>> On Thu, Jul 16, 2015 at 5:45 PM, Casey Schaufler <[email protected]> wrote:
>>>>>>>>> On 7/16/2015 4:29 PM, Andy Lutomirski wrote:
>>>>>>>>>> I really don't see the benefit of making up extra rules that apply to
>>>>>>>>>> users outside a userns who try to access specifically a filesystem
>>>>>>>>>> with backing store. They wouldn't make sense for filesystems without
>>>>>>>>>> backing store.
>>>>>>>>> Sure it would. For Smack, it would be the label a file would be
>>>>>>>>> created with, which would be the label of the process creating
>>>>>>>>> the memory based filesystem. For SELinux the rules are more a
>>>>>>>>> touch more sophisticated, but I'm sure that Paul or Stephen could
>>>>>>>>> come up with how to determine it.
>>>>>>>>>
>>>>>>>>> The point, looping all the way back to the beginning, where we
>>>>>>>>> were talking about just ignoring the labels on the filesystem,
>>>>>>>>> is that if you use the same Smack label on the files in the
>>>>>>>>> filesystem as the backing store file has, we'll all be happy.
>>>>>>>>> If that label isn't something user can write to, he won't be
>>>>>>>>> able to write to the mounted objects, either. If there is no
>>>>>>>>> backing store then use the label of the process creating the
>>>>>>>>> filesystem, which will be the user, which will mean everything
>>>>>>>>> will work hunky dory.
>>>>>>>>>
>>>>>>>>> Yes, there's work involved, but I doubt there's a lot. Getting
>>>>>>>>> the label from the backing store or the creating process is
>>>>>>>>> simple enough.
>>>>>>>>>
>>>>>>> So something like the diff below (untested)?
>>>>>> I think that this is close, and quite good for someone
>>>>>> who isn't very familiar with Smack. It's definitely headed
>>>>>> in the right direction.
>>>>>>
>>>>>>> All I'm really doing is setting smk_default as you describe above and
>>>>>>> then using it instead of smk_of_current() in
>>>>>>> smack_inode_alloc_security() and instead of the label from the disk in
>>>>>>> smack_d_instantiate().
>>>>>> Let's say your backing store is a file labeled Rubble.
>>>>>>
>>>>>> mount -o smackfsroot=Rubble,smackfsdef=Rubble ...
>>>>>>
>>>>>> It is completely reasonable for a process labeled Flintstone to
>>>>>> have rwxa access to a file labeled Rubble.
>>>>>>
>>>>>> Smack rule: Flintstone Rubble rwxa
>>>>>>
>>>>>> In the case of writing to an existing Rubble file, what you
>>>>>> have looks fine. What's not so great is that if the Flintstone
>>>>>> process creates a file, it should be labeled Flintstone. Your
>>>>>> use of the smk_default, which is going to violate the principle
>>>>>> of least astonishment, and break the Smack policy as well.
>>>>>>
>>>>>> Let's make a minor change. Instead of using smackfsroot let's
>>>>>> use smackfstransmute and a slightly different access rule:
>>>>>>
>>>>>> mount -o smackfstransmute=Rubble,smackfsdef=Rubble ...
>>>>>>
>>>>>> Smack rule: Flintstone Rubble rwxat
>>>>>>
>>>>>> Now the only change we have to make to the Smack code is
>>>>>> that we don't want to create any files unless either the
>>>>>> process is labeled Rubble or the rule allowing the creation
>>>>>> has the "t" for transmute access. That should ensure that
>>>>>> everything is labeled Rubble. If it isn't, someone has mucked
>>>>>> with the metadata in a detectable way.
>>>>> All right, that kind of makes sense, but I'm still missing some pieces.
>>>>> Questions follow.
>>>>>
>>>>>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>>>>>> index 32f598db0b0d..4597420ab933 100644
>>>>>>> --- a/include/linux/fs.h
>>>>>>> +++ b/include/linux/fs.h
>>>>>>> @@ -1486,6 +1486,10 @@ static inline void sb_start_intwrite(struct super_block *sb)
>>>>>>> __sb_start_write(sb, SB_FREEZE_FS, true);
>>>>>>> }
>>>>>>>
>>>>>>> +static inline bool sb_in_userns(struct super_block *sb)
>>>>>>> +{
>>>>>>> + return sb->s_user_ns != &init_user_ns;
>>>>>>> +}
>>>>>>>
>>>>>>> extern bool inode_owner_or_capable(const struct inode *inode);
>>>>>>>
>>>>>>> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
>>>>>>> index a143328f75eb..591fd19294e7 100644
>>>>>>> --- a/security/smack/smack_lsm.c
>>>>>>> +++ b/security/smack/smack_lsm.c
>>>>>>> @@ -255,6 +255,10 @@ static struct smack_known *smk_fetch(const char *name, struct inode *ip,
>>>>>>> char *buffer;
>>>>>>> struct smack_known *skp = NULL;
>>>>>>>
>>>>>>> + /* Should never fetch xattrs from untrusted mounts */
>>>>>>> + if (WARN_ON(sb_in_userns(ip->i_sb)))
>>>>>>> + return ERR_PTR(-EPERM);
>>>>>>> +
>>>>>> Go ahead and fetch it, we'll check to make sure it's viable later.
>>>>>>
>>>>>>> if (ip->i_op->getxattr == NULL)
>>>>>>> return ERR_PTR(-EOPNOTSUPP);
>>>>>>>
>>>>>>> @@ -656,10 +660,14 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
>>>>>>> */
>>>>>>> if (specified)
>>>>>>> return -EPERM;
>>>>>>> +
>>>>>>> /*
>>>>>>> - * Unprivileged mounts get root and default from the caller.
>>>>>>> + * User namespace mounts get root and default from the backing
>>>>>>> + * store, if there is one. Other unprivileged mounts get them
>>>>>>> + * from the caller.
>>>>>>> */
>>>>>>> - skp = smk_of_current();
>>>>>>> + skp = (sb_in_userns(sb) && sb->s_bdev) ?
>>>>>>> + smk_of_inode(sb->s_bdev->bd_inode) : smk_of_current();
>>>>>>> sp->smk_root = skp;
>>>>>>> sp->smk_default = skp;
>>>>>> sp->smk_flags |= SMK_INODE_TRANSMUTE;
>>>>> I assume that you meant skp and not sp here.
>>>> Actually, neither is correct. You want to set SMK_INODE_TRANSMUTE
>>>> in the smk_flags field of the root inode. That's easy:
>>>>
>>>> transmute = 1;
>>>>
>>>> and the code after "Initialize the root inode" will take care of it.
>>> Yeah, that's what I've actually done.
>>>
>>>>>>> }
>>>>>>> @@ -792,7 +800,12 @@ static int smack_bprm_secureexec(struct linux_binprm *bprm)
>>>>>>> */
>>>>>>> static int smack_inode_alloc_security(struct inode *inode)
>>>>>>> {
>>>>>>> - struct smack_known *skp = smk_of_current();
>>>>>>> + struct smack_known *skp;
>>>>>>> +
>>>>>>> + if (sb_in_userns(inode->i_sb))
>>>>>>> + skp = ((struct superblock_smack *)(inode->i_sb->s_security))->smk_default;
>>>>>>> + else
>>>>>>> + skp = smk_of_current();
>>>>>> This should be left alone.
>>>>>> smack_inode_init_security is where you could disallow access that doesn't
>>>>>> legitimately result in a Rubble label on the file. It's something like
>>>>>>
>>>>>> ... after the call may = smk_access_entry(...)
>>>>>> if (sb_in_userns(inode->i_sb))
>>>>>> if (skp != dsp && (may & MAY_TRANSMUTE) == 0)
>>>>>> return -EACCES;
>>>>> I'm not getting how this covers all cases.
>>>>>
>>>>> So we've set the transmute flag on the root inode. Files and directories
>>>>> created in the root directory get the same label, and directories also
>>>>> get the transmute attribute. That's all fine.
>>>>>
>>>>> What about an existing directory in the filesystem that already has a
>>>>> Slate label? I'm not getting what happens with this directory, or for
>>>>> new files created in this directory, which also relates to my other
>>>>> questions below.
>>>>>
>>>>> Also an aside - smk_access_entry looks weird. may is initialized to
>>>>> -ENOENT, and then rule_list is searched for a rule which matches the
>>>>> object and subject labels. Presumably it's possible that no rule could
>>>>> be found, otherwise the prior initialization of may is pointless. If
>>>>> this happens the following code treats it as though it always contains
>>>>> access flags even though it might contain -ENOENT. Nothing bad actually
>>>>> happens with a two's compliement representation of -ENOENT since it will
>>>>> just set a bit that's already set, but it still seems like it should
>>>>> have a may > 0 condition, for clarity if for no other reason.
>>>> My suggested code is just wrong. I wasn't looking at the whole code,
>>>> only the patch, and got myself confused. Apologies.
>>>>
>>>> If we want to go straight for the jugular how about this? I'm assuming
>>>> that inode->i_sb->s_bdev->bd_inode is the inode of the backing store.
>>> Yes.
>>>
>>>> static int smack_inode_permission(struct inode *inode, int mask)
>>>> {
>>>> struct smk_audit_info ad;
>>>> int no_block = mask & MAY_NOT_BLOCK;
>>>> int rc;
>>>>
>>>> mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);
>>>> /*
>>>> * No permission to check. Existence test. Yup, it's there.
>>>> */
>>>> if (mask == 0)
>>>> return 0;
>>>>
>>>> + if (sb_in_userns(inode->i_sb)) &&
>>>> + smk_of_inode(inode) != smk_of_inode(inode->i_sb->s_bdev->bd_inode))
>>>> + return -EACCES;
>>>> +
>>>> /* May be droppable after audit */
>>>> if (no_block)
>>>> return -ECHILD;
>>>> smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_INODE);
>>>> smk_ad_setfield_u_fs_inode(&ad, inode);
>>>> rc = smk_curacc(smk_of_inode(inode), mask, &ad);
>>>> rc = smk_bu_inode(inode, mask, rc);
>>>> return rc;
>>>> }
>>> Hmm, okay. I think I've been a little confused all this time about how
>>> you want to handle these unprivileged mounts.
>> Not your problem. I'm not the most consistent of reviewers.
>>
>>> Originally I thought you wanted all objects in the filesystem to get the
>>> same label as the backing store. That's what I tried to implement
>>> originally, i.e. smk_root=smk_default=smk_of_inode(...->bd_inode), then
>>> assign every object (new and existing) smk_default and completely ignore
>>> the labels on disk.
>> I want everything to have the label of the backing store, but
>> I don't want to ignore it if it somehow got something else. Because
>> the only legitimate label for this example is Rubble, I want to
>> reject anything else that appears. If someone builds a filesystem
>> by hand with Slate labels I want it treated "safely".
>>
>>> This is what I currently think you want for user ns mounts:
>>>
>>> 1. smk_root and smk_default are assigned the label of the backing
>>> device.
>>> 2. s_root is assigned the transmute property.
>>> 3. For existing files:
>>> a. Files with the same label as the backing device are accessible.
>>> b. Files with any other label are not accessible.
>> That's right. Accept correct data, reject anything that's not right.
>>
>>> If this is right, there are a couple lingering questions in my mind.
>>>
>>> First, what happens with files created in directories with the same
>>> label as the backing device but without the transmute property set? The
>>> inode for the new file will initially be labeled with smk_of_current(),
>>> but then during d_instantiate it will get smk_default and thus end up
>>> with the label we want. So that seems okay.
>> Yes.
>>
>>> The second is whether files with the SMACK64EXEC attribute is still a
>>> problem. It seems it is, for files with the same label as the backing
>>> store at least. I think we can simply skip the code that reads out this
>>> xattr and sets smk_task for user ns mounts, or else skip assigning the
>>> label to the new task in bprm_set_creds. The latter seems more
>>> consistent with the approach you've suggested for dealing with labels
>>> from disk.
>> Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
>> smack_d_instantiate for unprivileged mounts would do the trick.
>>
>>> So I guess all of that seems okay, though perhaps a bit restrictive
>>> given that the user who mounted the filesystem already has full access
>>> to the backing store.
>> In truth, there is no reason to expect that the "user" who did the
>> mount will ever have a Smack label that differs from the label of
>> the backing store. If what we've got here seems restrictive, it's
>> because you've got access from someone other than the "user".
>>
>>> Please let me know whether or not this matches up with what you are
>>> thinking, then I can procede with the implementation.
>> My current mindset is that, if you're going to allow unprivileged
>> mounts of user defined backing stores, this is as safe as we can
>> make it.
> That actually sounds very reasonable to me. It is essentially what we
> do with uid and gids already. I presume the smack namespace support
> would when integrated with all of this would allow a set of labels to be
> set.
>
> Have I missed a part of the conversation you talk about fileystems that
> don't have support for storing labels? Filesystems like vfat, isofs,
> etc.

They are easier. Set smackfsroot=Rubble,smackfsdef=Rubble and all objects
there will get labeled Rubble. Processes with different labels that can
write there will end up creating Rubble objects. For privileged mounts you
can set the values at will. For unprivileged mounts, you should take the
label values from the backing store.

>
> Eric
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2015-07-23 23:49:01

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Thu, Jul 23, 2015 at 09:19:28AM -0400, J. Bruce Fields wrote:
> On Thu, Jul 23, 2015 at 11:51:35AM +1000, Dave Chinner wrote:
> > On Wed, Jul 22, 2015 at 01:41:00PM -0400, J. Bruce Fields wrote:
> > > On Wed, Jul 22, 2015 at 12:52:58PM -0400, Austin S Hemmelgarn wrote:
> > > > On 2015-07-22 10:09, J. Bruce Fields wrote:
> > > > >On Wed, Jul 22, 2015 at 05:56:40PM +1000, Dave Chinner wrote:
> > > > >>On Tue, Jul 21, 2015 at 01:37:21PM -0400, J. Bruce Fields wrote:
> > > > >>>On Fri, Jul 17, 2015 at 12:47:35PM +1000, Dave Chinner wrote:
> > > > >>>So, for example, a screwed up on-disk directory structure shouldn't
> > > > >>>result in creating a cycle in the dcache and then deadlocking.
> > > > >>
> > > > >>Therein lies the problem: how do you detect such structural defects
> > > > >>without doing a full structure validation?
> > > > >
> > > > >You can prevent cycles in a graph if you can prevent adding an edge
> > > > >which would be part of a cycle.
> > > > >
> > > > Except if the user can write to the filesystem's backing storage (be
> > > > it a device or a file), and has sufficient knowledge of the on-disk
> > > > structures, they can create all the cycles they want in the
> > > > metadata. So unless the kernel builds the graph internally by
> > > > parsing the metadata _and_ has some way to detect that the on-disk
> > > > metadata has hit a cycle (which may not just involve 2 items),
> > >
> > > Understood. Again, see the d_ancestor call in d_splice_alias, this is
> > > exactly what it checks for.
> >
> > But that only addresses one type of loop in one specific metadata
> > structure.
>
> Yep, agreed!
>
> > There's plenty of other ways you could construct metadata
> > loops that are essentially undetected and result in either deadlock
> > or livelock within the filesystem code itself. e.g. just make btree
> > sibling pointers loop over a range of entries that have the same
> > index key (e.g. free space extents of the same size). If allocation
> > then falls into this loop, the kernel will just spin searching the
> > same blocks for something it will never find. Such resource
> > consumption attacks are trivial to construct but extremely difficult
> > to detect because they exploit normal behaviour of the structure and
> > algorithms by mangling trusted pointers.
>
> Interesting example, thanks! I doubt this particular example would be
> *that* hard to detect?

Yes, it can be detected, but it's not as easy as it sounds because
of abstractions between tree walking and record parsing.

> But understood that there may be lots of others.

Yeah, that's just one of many, many ways I can think of modifying
on disk structures to screw up the kernel.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-07-24 15:11:48

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 6/7] selinux: Ignore security labels on user namespace mounts

On Thu, Jul 23, 2015 at 11:23:31AM -0500, Seth Forshee wrote:
> On Thu, Jul 23, 2015 at 11:36:03AM -0400, Stephen Smalley wrote:
> > On 07/23/2015 10:39 AM, Seth Forshee wrote:
> > > On Thu, Jul 23, 2015 at 09:57:20AM -0400, Stephen Smalley wrote:
> > >> On 07/22/2015 04:40 PM, Stephen Smalley wrote:
> > >>> On 07/22/2015 04:25 PM, Stephen Smalley wrote:
> > >>>> On 07/22/2015 12:14 PM, Seth Forshee wrote:
> > >>>>> On Wed, Jul 22, 2015 at 12:02:13PM -0400, Stephen Smalley wrote:
> > >>>>>> On 07/16/2015 09:23 AM, Stephen Smalley wrote:
> > >>>>>>> On 07/15/2015 03:46 PM, Seth Forshee wrote:
> > >>>>>>>> Unprivileged users should not be able to supply security labels
> > >>>>>>>> in filesystems, nor should they be able to supply security
> > >>>>>>>> contexts in unprivileged mounts. For any mount where s_user_ns is
> > >>>>>>>> not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
> > >>>>>>>> and return EPERM if any contexts are supplied in the mount
> > >>>>>>>> options.
> > >>>>>>>>
> > >>>>>>>> Signed-off-by: Seth Forshee <[email protected]>
> > >>>>>>>
> > >>>>>>> I think this is obsoleted by the subsequent discussion, but just for the
> > >>>>>>> record: this patch would cause the files in the userns mount to be left
> > >>>>>>> with the "unlabeled" label, and therefore under typical policies,
> > >>>>>>> completely inaccessible to any process in a confined domain.
> > >>>>>>
> > >>>>>> The right way to handle this for SELinux would be to automatically use
> > >>>>>> mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
> > >>>>>> specifying a context= mount option), with the sbsec->mntpoint_sid set
> > >>>>>> from some related object (e.g. the block device file context, as in your
> > >>>>>> patches for Smack). That will cause SELinux to use that value instead
> > >>>>>> of any xattr value from the filesystem and will cause attempts by
> > >>>>>> userspace to set the security.selinux xattr to fail on that filesystem.
> > >>>>>> That is how SELinux normally deals with untrusted filesystems, except
> > >>>>>> that it is normally specified as a mount option by a trusted mounting
> > >>>>>> process, whereas in your case you need to automatically set it.
> > >>>>>
> > >>>>> Excellent, thank you for the advice. I'll start on this when I've
> > >>>>> finished with Smack.
> > >>>>
> > >>>> Not tested, but something like this should work. Note that it should
> > >>>> come after the call to security_fs_use() so we know whether SELinux
> > >>>> would even try to use xattrs supplied by the filesystem in the first place.
> > >>>>
> > >>>> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> > >>>> index 564079c..84da3a2 100644
> > >>>> --- a/security/selinux/hooks.c
> > >>>> +++ b/security/selinux/hooks.c
> > >>>> @@ -745,6 +745,30 @@ static int selinux_set_mnt_opts(struct super_block *sb,
> > >>>> goto out;
> > >>>> }
> > >>>> }
> > >>>> +
> > >>>> + /*
> > >>>> + * If this is a user namespace mount, no contexts are allowed
> > >>>> + * on the command line and security labels must be ignored.
> > >>>> + */
> > >>>> + if (sb->s_user_ns != &init_user_ns) {
> > >>>> + if (context_sid || fscontext_sid || rootcontext_sid ||
> > >>>> + defcontext_sid) {
> > >>>> + rc = -EACCES;
> > >>>> + goto out;
> > >>>> + }
> > >>>> + if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
> > >>>> + struct block_device *bdev = sb->s_bdev;
> > >>>> + sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
> > >>>> + if (bdev) {
> > >>>> + struct inode_security_struct *isec =
> > >>>> bdev->bd_inode;
> > >>>
> > >>> That should be bdev->bd_inode->i_security.
> > >>
> > >> Sorry, this won't work. bd_inode is not the inode of the block device
> > >> file that was passed to mount, and it isn't labeled in any way. It will
> > >> just be unlabeled.
> > >>
> > >> So I guess the only real option here as a fallback is
> > >> sbsec->mntpoint_sid = current_sid(). Which isn't great either, as the
> > >> only case where we currently assign task labels to files is for their
> > >> /proc/pid inodes, and no current policy will therefore allow create
> > >> permission to such files.
> > >
> > > Darn, you're right, that isn't the inode we want. There really doesn't
> > > seem to be any way to get back to the one we want from the LSM, short of
> > > adding a new hook.
> >
> > Maybe list_first_entry(&sb->s_bdev->bd_inodes, struct inode, i_devices)?
> > Feels like a layering violation though...
>
> Yeah, and even though that probably works out to be the inode we want in
> most cases I don't think we can be absolutely certain that it is. Maybe
> there's some way we could walk the list and be sure we've found the
> right inode, but I'm not seeing it.

I guess we could do something like this (note that most of the changes
here are just to give a version of blkdev_get_by_path which takes a
struct path * so that the filename lookup doesn't have to be done
twice). Basically add a new hook that informs the security module of the
inode for the backing device file passed to mount and call that from
mount_bdev. The security module could grab a reference to the inode and
stash it away.

Something else to note is that, as I have it here, the hook would end up
getting called for every mount of a given block device, not just the
first. So it's possible the security module could see the hook called a
second time with a different inode that has a different label. The hook
could be changed to return int if you wanted to have the opportunity to
reject such mounts.

Seth

---

diff --git a/fs/block_dev.c b/fs/block_dev.c
index f8ce371c437c..dc2173e24e30 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1372,14 +1372,39 @@ int blkdev_get(struct block_device *bdev, fmode_t mode, void *holder)
}
EXPORT_SYMBOL(blkdev_get);

+static struct block_device *__lookup_bdev(struct path *path);
+
+struct block_device * __blkdev_get_by_path(struct path *path, fmode_t mode,
+ void *holder)
+{
+ struct block_device *bdev;
+ int err;
+
+ bdev = __lookup_bdev(path);
+ if (IS_ERR(bdev))
+ return bdev;
+
+ err = blkdev_get(bdev, mode, holder);
+ if (err)
+ return ERR_PTR(err);
+
+ if ((mode & FMODE_WRITE) && bdev_read_only(bdev)) {
+ blkdev_put(bdev, mode);
+ return ERR_PTR(-EACCES);
+ }
+
+ return bdev;
+}
+EXPORT_SYMBOL(__blkdev_get_by_path);
+
/**
* blkdev_get_by_path - open a block device by name
- * @path: path to the block device to open
+ * @pathname: path to the block device to open
* @mode: FMODE_* mask
* @holder: exclusive holder identifier
*
- * Open the blockdevice described by the device file at @path. @mode
- * and @holder are identical to blkdev_get().
+ * Open the blockdevice described by the device file at @pathname.
+ * @mode and @holder are identical to blkdev_get().
*
* On success, the returned block_device has reference count of one.
*
@@ -1389,25 +1414,22 @@ EXPORT_SYMBOL(blkdev_get);
* RETURNS:
* Pointer to block_device on success, ERR_PTR(-errno) on failure.
*/
-struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
+struct block_device *blkdev_get_by_path(const char *pathname, fmode_t mode,
void *holder)
{
struct block_device *bdev;
- int err;
-
- bdev = lookup_bdev(path);
- if (IS_ERR(bdev))
- return bdev;
+ struct path path;
+ int error;

- err = blkdev_get(bdev, mode, holder);
- if (err)
- return ERR_PTR(err);
+ if (!pathname || !*pathname)
+ return ERR_PTR(-EINVAL);

- if ((mode & FMODE_WRITE) && bdev_read_only(bdev)) {
- blkdev_put(bdev, mode);
- return ERR_PTR(-EACCES);
- }
+ error = kern_path(pathname, LOOKUP_FOLLOW, &path);
+ if (error)
+ return ERR_PTR(error);

+ bdev = __blkdev_get_by_path(&path, mode, holder);
+ path_put(&path);
return bdev;
}
EXPORT_SYMBOL(blkdev_get_by_path);
@@ -1702,6 +1724,30 @@ int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg)

EXPORT_SYMBOL(ioctl_by_bdev);

+static struct block_device *__lookup_bdev(struct path *path)
+{
+ struct block_device *bdev;
+ struct inode *inode;
+ int error;
+
+ inode = d_backing_inode(path->dentry);
+ error = -ENOTBLK;
+ if (!S_ISBLK(inode->i_mode))
+ goto fail;
+ error = -EACCES;
+ if (!may_open_dev(path))
+ goto fail;
+ error = -ENOMEM;
+ bdev = bd_acquire(inode);
+ if (!bdev)
+ goto fail;
+out:
+ return bdev;
+fail:
+ bdev = ERR_PTR(error);
+ goto out;
+}
+
/**
* lookup_bdev - lookup a struct block_device by name
* @pathname: special file representing the block device
@@ -1713,7 +1759,6 @@ EXPORT_SYMBOL(ioctl_by_bdev);
struct block_device *lookup_bdev(const char *pathname)
{
struct block_device *bdev;
- struct inode *inode;
struct path path;
int error;

@@ -1724,23 +1769,9 @@ struct block_device *lookup_bdev(const char *pathname)
if (error)
return ERR_PTR(error);

- inode = d_backing_inode(path.dentry);
- error = -ENOTBLK;
- if (!S_ISBLK(inode->i_mode))
- goto fail;
- error = -EACCES;
- if (!may_open_dev(&path))
- goto fail;
- error = -ENOMEM;
- bdev = bd_acquire(inode);
- if (!bdev)
- goto fail;
-out:
+ bdev = __lookup_bdev(&path);
path_put(&path);
return bdev;
-fail:
- bdev = ERR_PTR(error);
- goto out;
}
EXPORT_SYMBOL(lookup_bdev);

diff --git a/fs/super.c b/fs/super.c
index 008f938e3ec0..558f7845a171 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -34,6 +34,7 @@
#include <linux/fsnotify.h>
#include <linux/lockdep.h>
#include <linux/user_namespace.h>
+#include <linux/namei.h>
#include "internal.h"


@@ -980,15 +981,26 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
{
struct block_device *bdev;
struct super_block *s;
+ struct path path;
+ struct inode *inode;
fmode_t mode = FMODE_READ | FMODE_EXCL;
int error = 0;

if (!(flags & MS_RDONLY))
mode |= FMODE_WRITE;

- bdev = blkdev_get_by_path(dev_name, mode, fs_type);
- if (IS_ERR(bdev))
- return ERR_CAST(bdev);
+ if (!dev_name || !*dev_name)
+ return ERR_PTR(-EINVAL);
+
+ error = kern_path(dev_name, LOOKUP_FOLLOW, &path);
+ if (error)
+ return ERR_PTR(error);
+
+ bdev = __blkdev_get_by_path(&path, mode, fs_type);
+ if (IS_ERR(bdev)) {
+ error = PTR_ERR(bdev);
+ goto error;
+ }

/*
* once the super is inserted into the list by sget, s_umount
@@ -1040,6 +1052,10 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
bdev->bd_super = s;
}

+ inode = d_backing_inode(path.dentry);
+ security_sb_backing_dev(s, inode);
+ path_put(&path);
+
return dget(s->s_root);

error_s:
@@ -1047,6 +1063,7 @@ error_s:
error_bdev:
blkdev_put(bdev, mode);
error:
+ path_put(&path);
return ERR_PTR(error);
}
EXPORT_SYMBOL(mount_bdev);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4597420ab933..3748945bf0d5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2315,6 +2315,8 @@ extern int ioctl_by_bdev(struct block_device *, unsigned, unsigned long);
extern int blkdev_ioctl(struct block_device *, fmode_t, unsigned, unsigned long);
extern long compat_blkdev_ioctl(struct file *, unsigned, unsigned long);
extern int blkdev_get(struct block_device *bdev, fmode_t mode, void *holder);
+extern struct block_device *__blkdev_get_by_path(struct path *path, fmode_t mode,
+ void *holder);
extern struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
void *holder);
extern struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode,
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 9429f054c323..52ce1a094e04 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1351,6 +1351,7 @@ union security_list_options {
int (*sb_clone_mnt_opts)(const struct super_block *oldsb,
struct super_block *newsb);
int (*sb_parse_opts_str)(char *options, struct security_mnt_opts *opts);
+ void (*sb_backing_dev)(struct super_block *sb, struct inode *inode);
int (*dentry_init_security)(struct dentry *dentry, int mode,
struct qstr *name, void **ctx,
u32 *ctxlen);
@@ -1648,6 +1649,7 @@ struct security_hook_heads {
struct list_head sb_set_mnt_opts;
struct list_head sb_clone_mnt_opts;
struct list_head sb_parse_opts_str;
+ struct list_head sb_backing_dev;
struct list_head dentry_init_security;
#ifdef CONFIG_SECURITY_PATH
struct list_head path_unlink;
diff --git a/include/linux/security.h b/include/linux/security.h
index 79d85ddf8093..7a4d8382af20 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -231,6 +231,7 @@ int security_sb_set_mnt_opts(struct super_block *sb,
int security_sb_clone_mnt_opts(const struct super_block *oldsb,
struct super_block *newsb);
int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts);
+void security_sb_backing_dev(struct super_block *sb, struct inode *inode);
int security_dentry_init_security(struct dentry *dentry, int mode,
struct qstr *name, void **ctx,
u32 *ctxlen);
@@ -562,6 +563,10 @@ static inline int security_sb_parse_opts_str(char *options, struct security_mnt_
return 0;
}

+static inline void security_sb_backing_dev(struct super_block *sb,
+ struct inode *inode)
+{ }
+
static inline int security_inode_alloc(struct inode *inode)
{
return 0;
diff --git a/security/security.c b/security/security.c
index 062f3c997fdc..f6f89e0f06d8 100644
--- a/security/security.c
+++ b/security/security.c
@@ -347,6 +347,11 @@ int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts)
}
EXPORT_SYMBOL(security_sb_parse_opts_str);

+void security_sb_backing_dev(struct super_block *sb, struct inode *inode)
+{
+ call_void_hook(sb_backing_dev, sb, inode);
+}
+
int security_inode_alloc(struct inode *inode)
{
inode->i_security = NULL;
@@ -1595,6 +1600,8 @@ struct security_hook_heads security_hook_heads = {
LIST_HEAD_INIT(security_hook_heads.sb_clone_mnt_opts),
.sb_parse_opts_str =
LIST_HEAD_INIT(security_hook_heads.sb_parse_opts_str),
+ .sb_backing_dev =
+ LIST_HEAD_INIT(security_hook_heads.sb_backing_dev),
.dentry_init_security =
LIST_HEAD_INIT(security_hook_heads.dentry_init_security),
#ifdef CONFIG_SECURITY_PATH

2015-07-28 20:40:31

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
> > This is what I currently think you want for user ns mounts:
> >
> > 1. smk_root and smk_default are assigned the label of the backing
> > device.
> > 2. s_root is assigned the transmute property.
> > 3. For existing files:
> > a. Files with the same label as the backing device are accessible.
> > b. Files with any other label are not accessible.
>
> That's right. Accept correct data, reject anything that's not right.
>
> > If this is right, there are a couple lingering questions in my mind.
> >
> > First, what happens with files created in directories with the same
> > label as the backing device but without the transmute property set? The
> > inode for the new file will initially be labeled with smk_of_current(),
> > but then during d_instantiate it will get smk_default and thus end up
> > with the label we want. So that seems okay.
>
> Yes.
>
> > The second is whether files with the SMACK64EXEC attribute is still a
> > problem. It seems it is, for files with the same label as the backing
> > store at least. I think we can simply skip the code that reads out this
> > xattr and sets smk_task for user ns mounts, or else skip assigning the
> > label to the new task in bprm_set_creds. The latter seems more
> > consistent with the approach you've suggested for dealing with labels
> > from disk.
>
> Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
> smack_d_instantiate for unprivileged mounts would do the trick.
>
> > So I guess all of that seems okay, though perhaps a bit restrictive
> > given that the user who mounted the filesystem already has full access
> > to the backing store.
>
> In truth, there is no reason to expect that the "user" who did the
> mount will ever have a Smack label that differs from the label of
> the backing store. If what we've got here seems restrictive, it's
> because you've got access from someone other than the "user".
>
> > Please let me know whether or not this matches up with what you are
> > thinking, then I can procede with the implementation.
>
> My current mindset is that, if you're going to allow unprivileged
> mounts of user defined backing stores, this is as safe as we can
> make it.

All right, I've got a patch which I think does this, and I've managed to
do some testing to confirm that it behaves like I expect. How does this
look?

What's missing is getting the label from the block device inode; as
Stephen discovered the inode that I thought we could get the label from
turned out to be the wrong one. Afaict we would need a new hook in order
to do that, so for now I'm using the label of the proccess calling
mount.

---

diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index a143328f75eb..8e631a66b03c 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -662,6 +662,8 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
skp = smk_of_current();
sp->smk_root = skp;
sp->smk_default = skp;
+ if (sb_in_userns(sb))
+ transmute = 1;
}
/*
* Initialize the root inode.
@@ -1023,6 +1025,12 @@ static int smack_inode_permission(struct inode *inode, int mask)
if (mask == 0)
return 0;

+ if (sb_in_userns(inode->i_sb)) {
+ struct superblock_smack *sbsp = inode->i_sb->s_security;
+ if (smk_of_inode(inode) != sbsp->smk_root)
+ return -EACCES;
+ }
+
/* May be droppable after audit */
if (no_block)
return -ECHILD;
@@ -3220,14 +3228,16 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
if (rc >= 0)
transflag = SMK_INODE_TRANSMUTE;
}
- /*
- * Don't let the exec or mmap label be "*" or "@".
- */
- skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
- if (IS_ERR(skp) || skp == &smack_known_star ||
- skp == &smack_known_web)
- skp = NULL;
- isp->smk_task = skp;
+ if (!sb_in_userns(inode->i_sb)) {
+ /*
+ * Don't let the exec or mmap label be "*" or "@".
+ */
+ skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
+ if (IS_ERR(skp) || skp == &smack_known_star ||
+ skp == &smack_known_web)
+ skp = NULL;
+ isp->smk_task = skp;
+ }

skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
if (IS_ERR(skp) || skp == &smack_known_star ||

2015-07-29 16:04:55

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 3/7] fs: Ignore file caps in mounts from other user namespaces

On Thu, Jul 16, 2015 at 12:04:43AM -0500, Eric W. Biederman wrote:
> > I tend to thing that, if we're not honoring the fcaps, we shouldn't be
> > honoring the setuid bit either. After all, it's really not a trusted
> > file, even though the only user who could have messed with it really
> > is the apparent owner.
>
> For the file caps we can't honor them because you don't have the bits
> in struct cred.
>
> For setuid we can honor it, and setuid is something that the user
> namespace allows.

Setuid is something explicitly tied to the user id. File capabilities
are MAC, that is, explicitly orthogonal to user id. So 100% agreed with
honoring setuid in user_ns and, for now, ignoring file caps.

As I've mentioned a few times privately, I'm intending to implement
user-namespaced file capabilities as a new xattr. Design is not 100%
nailed down, but probably it would support a set of userns_fcaps, each
of which lists the k_uid of the root user in the namespace assigning the
filecaps, followed by three sets. Then when exec()ing the file, if
the current->userns->root user has a userns_fcap entry, or there is a -1
entry, then use that, else use nothing. I think this is a very importing
thing to support, to remove a barrier to shipping packages with software
using filecaps. Without this, any package, say ping, which wants to
support being installed in a (unprivileged) cotainer would need to also
support use without filecaps, meaning that will likely be the only
supported mode.

-serge

2015-07-29 16:18:10

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 3/7] fs: Ignore file caps in mounts from other user namespaces

On Wed, Jul 29, 2015 at 11:04:50AM -0500, Serge E. Hallyn wrote:
> On Thu, Jul 16, 2015 at 12:04:43AM -0500, Eric W. Biederman wrote:
> > > I tend to thing that, if we're not honoring the fcaps, we shouldn't be
> > > honoring the setuid bit either. After all, it's really not a trusted
> > > file, even though the only user who could have messed with it really
> > > is the apparent owner.
> >
> > For the file caps we can't honor them because you don't have the bits
> > in struct cred.
> >
> > For setuid we can honor it, and setuid is something that the user
> > namespace allows.
>
> Setuid is something explicitly tied to the user id. File capabilities
> are MAC, that is, explicitly orthogonal to user id. So 100% agreed with
> honoring setuid in user_ns and, for now, ignoring file caps.

Hm. No. Seems like both should be fine when current is in the mounter's
user_ns, and ignored otherwise.

(The below is still needed :)

> As I've mentioned a few times privately, I'm intending to implement
> user-namespaced file capabilities as a new xattr. Design is not 100%
> nailed down, but probably it would support a set of userns_fcaps, each
> of which lists the k_uid of the root user in the namespace assigning the
> filecaps, followed by three sets. Then when exec()ing the file, if
> the current->userns->root user has a userns_fcap entry, or there is a -1
> entry, then use that, else use nothing. I think this is a very importing
> thing to support, to remove a barrier to shipping packages with software
> using filecaps. Without this, any package, say ping, which wants to
> support being installed in a (unprivileged) cotainer would need to also
> support use without filecaps, meaning that will likely be the only
> supported mode.
>
> -serge

2015-07-30 04:24:15

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
<[email protected]> wrote:
>
> On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
> > > This is what I currently think you want for user ns mounts:
> > >
> > > 1. smk_root and smk_default are assigned the label of the backing
> > > device.

Seth,

There were 2 main concerns discussed in this thread:
1. trusting LSM labels outside the namespace
2. trusting the content of the image file/loopdev

While your approach addresses the first concern, I suspect it may be placing
an obstacle in a way for resolving the second concern.

A viable security policy to mitigate the second concern could be:
- Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
- Allow mount only of 'Loopback' images

This should allow the system as a whole to trust unprivileged mounts based on
the trust of the entities that had raw access the the fs layout.

Alas, if you choose to propagate the backing dev label to contained files,
they would all share the designated 'Loopback' label and render the policy above
useless.

Any thoughts on how to reconcile this conflict?

Amir.


> > > 2. s_root is assigned the transmute property.
> > > 3. For existing files:
> > > a. Files with the same label as the backing device are accessible.
> > > b. Files with any other label are not accessible.
> >
> > That's right. Accept correct data, reject anything that's not right.
> >
> > > If this is right, there are a couple lingering questions in my mind.
> > >
> > > First, what happens with files created in directories with the same
> > > label as the backing device but without the transmute property set? The
> > > inode for the new file will initially be labeled with smk_of_current(),
> > > but then during d_instantiate it will get smk_default and thus end up
> > > with the label we want. So that seems okay.
> >
> > Yes.
> >
> > > The second is whether files with the SMACK64EXEC attribute is still a
> > > problem. It seems it is, for files with the same label as the backing
> > > store at least. I think we can simply skip the code that reads out this
> > > xattr and sets smk_task for user ns mounts, or else skip assigning the
> > > label to the new task in bprm_set_creds. The latter seems more
> > > consistent with the approach you've suggested for dealing with labels
> > > from disk.
> >
> > Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
> > smack_d_instantiate for unprivileged mounts would do the trick.
> >
> > > So I guess all of that seems okay, though perhaps a bit restrictive
> > > given that the user who mounted the filesystem already has full access
> > > to the backing store.
> >
> > In truth, there is no reason to expect that the "user" who did the
> > mount will ever have a Smack label that differs from the label of
> > the backing store. If what we've got here seems restrictive, it's
> > because you've got access from someone other than the "user".
> >
> > > Please let me know whether or not this matches up with what you are
> > > thinking, then I can procede with the implementation.
> >
> > My current mindset is that, if you're going to allow unprivileged
> > mounts of user defined backing stores, this is as safe as we can
> > make it.
>
> All right, I've got a patch which I think does this, and I've managed to
> do some testing to confirm that it behaves like I expect. How does this
> look?
>
> What's missing is getting the label from the block device inode; as
> Stephen discovered the inode that I thought we could get the label from
> turned out to be the wrong one. Afaict we would need a new hook in order
> to do that, so for now I'm using the label of the proccess calling
> mount.
>
> ---
>
> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> index a143328f75eb..8e631a66b03c 100644
> --- a/security/smack/smack_lsm.c
> +++ b/security/smack/smack_lsm.c
> @@ -662,6 +662,8 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
> skp = smk_of_current();
> sp->smk_root = skp;
> sp->smk_default = skp;
> + if (sb_in_userns(sb))
> + transmute = 1;
> }
> /*
> * Initialize the root inode.
> @@ -1023,6 +1025,12 @@ static int smack_inode_permission(struct inode *inode, int mask)
> if (mask == 0)
> return 0;
>
> + if (sb_in_userns(inode->i_sb)) {
> + struct superblock_smack *sbsp = inode->i_sb->s_security;
> + if (smk_of_inode(inode) != sbsp->smk_root)
> + return -EACCES;
> + }
> +
> /* May be droppable after audit */
> if (no_block)
> return -ECHILD;
> @@ -3220,14 +3228,16 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
> if (rc >= 0)
> transflag = SMK_INODE_TRANSMUTE;
> }
> - /*
> - * Don't let the exec or mmap label be "*" or "@".
> - */
> - skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
> - if (IS_ERR(skp) || skp == &smack_known_star ||
> - skp == &smack_known_web)
> - skp = NULL;
> - isp->smk_task = skp;
> + if (!sb_in_userns(inode->i_sb)) {
> + /*
> + * Don't let the exec or mmap label be "*" or "@".
> + */
> + skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
> + if (IS_ERR(skp) || skp == &smack_known_star ||
> + skp == &smack_known_web)
> + skp = NULL;
> + isp->smk_task = skp;
> + }
>
> skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
> if (IS_ERR(skp) || skp == &smack_known_star ||
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-07-30 13:55:55

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Thu, Jul 30, 2015 at 07:24:11AM +0300, Amir Goldstein wrote:
> On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
> <[email protected]> wrote:
> >
> > On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
> > > > This is what I currently think you want for user ns mounts:
> > > >
> > > > 1. smk_root and smk_default are assigned the label of the backing
> > > > device.
>
> Seth,
>
> There were 2 main concerns discussed in this thread:
> 1. trusting LSM labels outside the namespace
> 2. trusting the content of the image file/loopdev
>
> While your approach addresses the first concern, I suspect it may be placing
> an obstacle in a way for resolving the second concern.
>
> A viable security policy to mitigate the second concern could be:
> - Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
> - Allow mount only of 'Loopback' images
>
> This should allow the system as a whole to trust unprivileged mounts based on
> the trust of the entities that had raw access the the fs layout.

You don't really say what you mean by "trusted" programs. In a container
context I'd have to assume that you mean suid-root or similar programs
shared into the container by the host. In that case is any new kernel
functionality even required?

That also doesn't work for some of our use cases, where we'd like to be
able to do something like "mount -o loop foo.img /mnt/foo" in an
unprivileged container where foo.img is not created on the local machine
and not fully under control of the host environment.

Agreed though that the "attack from below" problem for untrusted
filesystems is still an open question. At minimum we have fuse, which
has been designed to protect against this threat. Others have mentioned
on this thread that Ted had said something at kernel summit last year
about being willing to support ext4 mounts from unprivileged user
namespaces as well. I've added Ted to the Cc in case he wants to confirm
or deny this rumor.

> Alas, if you choose to propagate the backing dev label to contained files,
> they would all share the designated 'Loopback' label and render the policy above
> useless.
>
> Any thoughts on how to reconcile this conflict?

I'm not seeing what the conflict is here - nothing you proposed says
anything about security labels in the filesystem, and nothing would
prevent a "trusted" program with CAP_MAC_ADMIN from setting whatever
label was desired on the backing device. Care to elaborate?

Seth

2015-07-30 13:57:23

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

Quoting Amir Goldstein ([email protected]):
> On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
> <[email protected]> wrote:
> >
> > On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
> > > > This is what I currently think you want for user ns mounts:
> > > >
> > > > 1. smk_root and smk_default are assigned the label of the backing
> > > > device.
>
> Seth,
>
> There were 2 main concerns discussed in this thread:
> 1. trusting LSM labels outside the namespace
> 2. trusting the content of the image file/loopdev
>
> While your approach addresses the first concern, I suspect it may be placing
> an obstacle in a way for resolving the second concern.
>
> A viable security policy to mitigate the second concern could be:
> - Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
> - Allow mount only of 'Loopback' images
>
> This should allow the system as a whole to trust unprivileged mounts based on
> the trust of the entities that had raw access the the fs layout.

Just to be sure I understand right, you're looking for a way to let
the host admin trust that the kernel's superblock parsers aren't being
fed trash or an exploit?

> Alas, if you choose to propagate the backing dev label to contained files,
> they would all share the designated 'Loopback' label and render the policy above
> useless.
>
> Any thoughts on how to reconcile this conflict?
>
> Amir.
>
>
> > > > 2. s_root is assigned the transmute property.
> > > > 3. For existing files:
> > > > a. Files with the same label as the backing device are accessible.
> > > > b. Files with any other label are not accessible.
> > >
> > > That's right. Accept correct data, reject anything that's not right.
> > >
> > > > If this is right, there are a couple lingering questions in my mind.
> > > >
> > > > First, what happens with files created in directories with the same
> > > > label as the backing device but without the transmute property set? The
> > > > inode for the new file will initially be labeled with smk_of_current(),
> > > > but then during d_instantiate it will get smk_default and thus end up
> > > > with the label we want. So that seems okay.
> > >
> > > Yes.
> > >
> > > > The second is whether files with the SMACK64EXEC attribute is still a
> > > > problem. It seems it is, for files with the same label as the backing
> > > > store at least. I think we can simply skip the code that reads out this
> > > > xattr and sets smk_task for user ns mounts, or else skip assigning the
> > > > label to the new task in bprm_set_creds. The latter seems more
> > > > consistent with the approach you've suggested for dealing with labels
> > > > from disk.
> > >
> > > Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
> > > smack_d_instantiate for unprivileged mounts would do the trick.
> > >
> > > > So I guess all of that seems okay, though perhaps a bit restrictive
> > > > given that the user who mounted the filesystem already has full access
> > > > to the backing store.
> > >
> > > In truth, there is no reason to expect that the "user" who did the
> > > mount will ever have a Smack label that differs from the label of
> > > the backing store. If what we've got here seems restrictive, it's
> > > because you've got access from someone other than the "user".
> > >
> > > > Please let me know whether or not this matches up with what you are
> > > > thinking, then I can procede with the implementation.
> > >
> > > My current mindset is that, if you're going to allow unprivileged
> > > mounts of user defined backing stores, this is as safe as we can
> > > make it.
> >
> > All right, I've got a patch which I think does this, and I've managed to
> > do some testing to confirm that it behaves like I expect. How does this
> > look?
> >
> > What's missing is getting the label from the block device inode; as
> > Stephen discovered the inode that I thought we could get the label from
> > turned out to be the wrong one. Afaict we would need a new hook in order
> > to do that, so for now I'm using the label of the proccess calling
> > mount.
> >
> > ---
> >
> > diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> > index a143328f75eb..8e631a66b03c 100644
> > --- a/security/smack/smack_lsm.c
> > +++ b/security/smack/smack_lsm.c
> > @@ -662,6 +662,8 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
> > skp = smk_of_current();
> > sp->smk_root = skp;
> > sp->smk_default = skp;
> > + if (sb_in_userns(sb))
> > + transmute = 1;
> > }
> > /*
> > * Initialize the root inode.
> > @@ -1023,6 +1025,12 @@ static int smack_inode_permission(struct inode *inode, int mask)
> > if (mask == 0)
> > return 0;
> >
> > + if (sb_in_userns(inode->i_sb)) {
> > + struct superblock_smack *sbsp = inode->i_sb->s_security;
> > + if (smk_of_inode(inode) != sbsp->smk_root)
> > + return -EACCES;
> > + }
> > +
> > /* May be droppable after audit */
> > if (no_block)
> > return -ECHILD;
> > @@ -3220,14 +3228,16 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
> > if (rc >= 0)
> > transflag = SMK_INODE_TRANSMUTE;
> > }
> > - /*
> > - * Don't let the exec or mmap label be "*" or "@".
> > - */
> > - skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
> > - if (IS_ERR(skp) || skp == &smack_known_star ||
> > - skp == &smack_known_web)
> > - skp = NULL;
> > - isp->smk_task = skp;
> > + if (!sb_in_userns(inode->i_sb)) {
> > + /*
> > + * Don't let the exec or mmap label be "*" or "@".
> > + */
> > + skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
> > + if (IS_ERR(skp) || skp == &smack_known_star ||
> > + skp == &smack_known_web)
> > + skp = NULL;
> > + isp->smk_task = skp;
> > + }
> >
> > skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
> > if (IS_ERR(skp) || skp == &smack_known_star ||
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-07-30 14:47:08

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Thu, Jul 30, 2015 at 4:55 PM, Seth Forshee
<[email protected]> wrote:
>
> On Thu, Jul 30, 2015 at 07:24:11AM +0300, Amir Goldstein wrote:
> > On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
> > <[email protected]> wrote:
> > >
> > > On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
> > > > > This is what I currently think you want for user ns mounts:
> > > > >
> > > > > 1. smk_root and smk_default are assigned the label of the backing
> > > > > device.
> >
> > Seth,
> >
> > There were 2 main concerns discussed in this thread:
> > 1. trusting LSM labels outside the namespace
> > 2. trusting the content of the image file/loopdev
> >
> > While your approach addresses the first concern, I suspect it may be placing
> > an obstacle in a way for resolving the second concern.
> >
> > A viable security policy to mitigate the second concern could be:
> > - Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
> > - Allow mount only of 'Loopback' images
> >
> > This should allow the system as a whole to trust unprivileged mounts based on
> > the trust of the entities that had raw access the the fs layout.
>
> You don't really say what you mean by "trusted" programs. In a container
> context I'd have to assume that you mean suid-root or similar programs
> shared into the container by the host. In that case is any new kernel
> functionality even required?

Sorry I was not clear. I will try to explain better.
I meant that the programs are "trusted" by the LSM security policy.
I envisioned a system where unprivileged user is allowed to spawn
a container which contains "trusted" programs (e.g. mkfs) that are labeled
as 'FileSystemTools' by the admin of the host.
FileSystemTools are allowed to write into Loopback labeled files.

>
> That also doesn't work for some of our use cases, where we'd like to be
> able to do something like "mount -o loop foo.img /mnt/foo" in an
> unprivileged container where foo.img is not created on the local machine
> and not fully under control of the host environment.

That use case will not be addressed by the policy I suggested,
but the more common case of:
- create a loopback file
- mkfs
- mount
will be addressed.

So if the (host) admin of the system trusts that unprivileged user cannot create
a malicious fs layout using mkfs and fsck alone, then the system is
relatively safe
mounting (non fuse) file systems from loopback files.
IMHO, this statement is going to be easier for Ted to sign.

>
> Agreed though that the "attack from below" problem for untrusted
> filesystems is still an open question. At minimum we have fuse, which
> has been designed to protect against this threat. Others have mentioned
> on this thread that Ted had said something at kernel summit last year
> about being willing to support ext4 mounts from unprivileged user
> namespaces as well. I've added Ted to the Cc in case he wants to confirm
> or deny this rumor.
>
> > Alas, if you choose to propagate the backing dev label to contained files,
> > they would all share the designated 'Loopback' label and render the policy above
> > useless.
> >
> > Any thoughts on how to reconcile this conflict?
>
> I'm not seeing what the conflict is here - nothing you proposed says
> anything about security labels in the filesystem, and nothing would
> prevent a "trusted" program with CAP_MAC_ADMIN from setting whatever
> label was desired on the backing device. Care to elaborate?
>
> Seth

2015-07-30 15:09:15

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Thu, Jul 30, 2015 at 4:57 PM, Serge Hallyn <[email protected]> wrote:
> Quoting Amir Goldstein ([email protected]):
>> On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
>> <[email protected]> wrote:
>> >
>> > On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
>> > > > This is what I currently think you want for user ns mounts:
>> > > >
>> > > > 1. smk_root and smk_default are assigned the label of the backing
>> > > > device.
>>
>> Seth,
>>
>> There were 2 main concerns discussed in this thread:
>> 1. trusting LSM labels outside the namespace
>> 2. trusting the content of the image file/loopdev
>>
>> While your approach addresses the first concern, I suspect it may be placing
>> an obstacle in a way for resolving the second concern.
>>
>> A viable security policy to mitigate the second concern could be:
>> - Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
>> - Allow mount only of 'Loopback' images
>>
>> This should allow the system as a whole to trust unprivileged mounts based on
>> the trust of the entities that had raw access the the fs layout.
>
> Just to be sure I understand right, you're looking for a way to let
> the host admin trust that the kernel's superblock parsers aren't being
> fed trash or an exploit?

Correct.
I do not believe in the direction of auditing file system code to
vulnerability free level
nor do I think that cryptographically signed file system metadata is
the only way
to ensure an exploit free unprivileged mount.


>
>> Alas, if you choose to propagate the backing dev label to contained files,
>> they would all share the designated 'Loopback' label and render the policy above
>> useless.
>>
>> Any thoughts on how to reconcile this conflict?
>>
>> Amir.
>>
>>
>> > > > 2. s_root is assigned the transmute property.
>> > > > 3. For existing files:
>> > > > a. Files with the same label as the backing device are accessible.
>> > > > b. Files with any other label are not accessible.
>> > >
>> > > That's right. Accept correct data, reject anything that's not right.
>> > >
>> > > > If this is right, there are a couple lingering questions in my mind.
>> > > >
>> > > > First, what happens with files created in directories with the same
>> > > > label as the backing device but without the transmute property set? The
>> > > > inode for the new file will initially be labeled with smk_of_current(),
>> > > > but then during d_instantiate it will get smk_default and thus end up
>> > > > with the label we want. So that seems okay.
>> > >
>> > > Yes.
>> > >
>> > > > The second is whether files with the SMACK64EXEC attribute is still a
>> > > > problem. It seems it is, for files with the same label as the backing
>> > > > store at least. I think we can simply skip the code that reads out this
>> > > > xattr and sets smk_task for user ns mounts, or else skip assigning the
>> > > > label to the new task in bprm_set_creds. The latter seems more
>> > > > consistent with the approach you've suggested for dealing with labels
>> > > > from disk.
>> > >
>> > > Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
>> > > smack_d_instantiate for unprivileged mounts would do the trick.
>> > >
>> > > > So I guess all of that seems okay, though perhaps a bit restrictive
>> > > > given that the user who mounted the filesystem already has full access
>> > > > to the backing store.
>> > >
>> > > In truth, there is no reason to expect that the "user" who did the
>> > > mount will ever have a Smack label that differs from the label of
>> > > the backing store. If what we've got here seems restrictive, it's
>> > > because you've got access from someone other than the "user".
>> > >
>> > > > Please let me know whether or not this matches up with what you are
>> > > > thinking, then I can procede with the implementation.
>> > >
>> > > My current mindset is that, if you're going to allow unprivileged
>> > > mounts of user defined backing stores, this is as safe as we can
>> > > make it.
>> >
>> > All right, I've got a patch which I think does this, and I've managed to
>> > do some testing to confirm that it behaves like I expect. How does this
>> > look?
>> >
>> > What's missing is getting the label from the block device inode; as
>> > Stephen discovered the inode that I thought we could get the label from
>> > turned out to be the wrong one. Afaict we would need a new hook in order
>> > to do that, so for now I'm using the label of the proccess calling
>> > mount.
>> >
>> > ---
>> >
>> > diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
>> > index a143328f75eb..8e631a66b03c 100644
>> > --- a/security/smack/smack_lsm.c
>> > +++ b/security/smack/smack_lsm.c
>> > @@ -662,6 +662,8 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
>> > skp = smk_of_current();
>> > sp->smk_root = skp;
>> > sp->smk_default = skp;
>> > + if (sb_in_userns(sb))
>> > + transmute = 1;
>> > }
>> > /*
>> > * Initialize the root inode.
>> > @@ -1023,6 +1025,12 @@ static int smack_inode_permission(struct inode *inode, int mask)
>> > if (mask == 0)
>> > return 0;
>> >
>> > + if (sb_in_userns(inode->i_sb)) {
>> > + struct superblock_smack *sbsp = inode->i_sb->s_security;
>> > + if (smk_of_inode(inode) != sbsp->smk_root)
>> > + return -EACCES;
>> > + }
>> > +
>> > /* May be droppable after audit */
>> > if (no_block)
>> > return -ECHILD;
>> > @@ -3220,14 +3228,16 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
>> > if (rc >= 0)
>> > transflag = SMK_INODE_TRANSMUTE;
>> > }
>> > - /*
>> > - * Don't let the exec or mmap label be "*" or "@".
>> > - */
>> > - skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
>> > - if (IS_ERR(skp) || skp == &smack_known_star ||
>> > - skp == &smack_known_web)
>> > - skp = NULL;
>> > - isp->smk_task = skp;
>> > + if (!sb_in_userns(inode->i_sb)) {
>> > + /*
>> > + * Don't let the exec or mmap label be "*" or "@".
>> > + */
>> > + skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
>> > + if (IS_ERR(skp) || skp == &smack_known_star ||
>> > + skp == &smack_known_web)
>> > + skp = NULL;
>> > + isp->smk_task = skp;
>> > + }
>> >
>> > skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
>> > if (IS_ERR(skp) || skp == &smack_known_star ||
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> > the body of a message to [email protected]
>> > More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-07-30 15:33:35

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On 7/30/2015 7:47 AM, Amir Goldstein wrote:
> On Thu, Jul 30, 2015 at 4:55 PM, Seth Forshee
> <[email protected]> wrote:
>> On Thu, Jul 30, 2015 at 07:24:11AM +0300, Amir Goldstein wrote:
>>> On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
>>> <[email protected]> wrote:
>>>> On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
>>>>>> This is what I currently think you want for user ns mounts:
>>>>>>
>>>>>> 1. smk_root and smk_default are assigned the label of the backing
>>>>>> device.
>>> Seth,
>>>
>>> There were 2 main concerns discussed in this thread:
>>> 1. trusting LSM labels outside the namespace
>>> 2. trusting the content of the image file/loopdev
>>>
>>> While your approach addresses the first concern, I suspect it may be placing
>>> an obstacle in a way for resolving the second concern.
>>>
>>> A viable security policy to mitigate the second concern could be:
>>> - Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
>>> - Allow mount only of 'Loopback' images
>>>
>>> This should allow the system as a whole to trust unprivileged mounts based on
>>> the trust of the entities that had raw access the the fs layout.
>> You don't really say what you mean by "trusted" programs. In a container
>> context I'd have to assume that you mean suid-root or similar programs
>> shared into the container by the host. In that case is any new kernel
>> functionality even required?
> Sorry I was not clear. I will try to explain better.
> I meant that the programs are "trusted" by the LSM security policy.
> I envisioned a system where unprivileged user is allowed to spawn
> a container which contains "trusted" programs (e.g. mkfs) that are labeled
> as 'FileSystemTools' by the admin of the host.
> FileSystemTools are allowed to write into Loopback labeled files.

You could do this on a Smack based system. It would require
CAP_MAC_ADMIN and CAP_MAC_OVERRIDE to set up. You would need
to set some SMACK64EXEC labels on your FileSystemTools, and
they would have to be written as carefully as the would if they
had "more" privilege. You'd need to designate a repository for
your loopback files. On the whole, it would be unattractive.
I will pass on providing the details for fear someone will like
it well enough to implement.

>> That also doesn't work for some of our use cases, where we'd like to be
>> able to do something like "mount -o loop foo.img /mnt/foo" in an
>> unprivileged container where foo.img is not created on the local machine
>> and not fully under control of the host environment.
> That use case will not be addressed by the policy I suggested,
> but the more common case of:
> - create a loopback file
> - mkfs
> - mount
> will be addressed.
>
> So if the (host) admin of the system trusts that unprivileged user cannot create
> a malicious fs layout using mkfs and fsck alone, then the system is
> relatively safe
> mounting (non fuse) file systems from loopback files.
> IMHO, this statement is going to be easier for Ted to sign.

But that sort of defeats the purpose of unprivileged mounts.
Or rather, you're trying to place restrictions on what an
unprivileged user can do without calling the ability to
violate those restrictions "privilege".

>
>> Agreed though that the "attack from below" problem for untrusted
>> filesystems is still an open question. At minimum we have fuse, which
>> has been designed to protect against this threat. Others have mentioned
>> on this thread that Ted had said something at kernel summit last year
>> about being willing to support ext4 mounts from unprivileged user
>> namespaces as well. I've added Ted to the Cc in case he wants to confirm
>> or deny this rumor.
>>
>>> Alas, if you choose to propagate the backing dev label to contained files,
>>> they would all share the designated 'Loopback' label and render the policy above
>>> useless.
>>>
>>> Any thoughts on how to reconcile this conflict?
>> I'm not seeing what the conflict is here - nothing you proposed says
>> anything about security labels in the filesystem, and nothing would
>> prevent a "trusted" program with CAP_MAC_ADMIN from setting whatever
>> label was desired on the backing device. Care to elaborate?
>>
>> Seth

2015-07-30 15:52:11

by Colin Walters

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

It's worth noting here that I think a lot of the use cases
for unprivileged mounts are testing/development type things,
and these are pretty well covered by:

http://libguestfs.org/

Basically it just runs the host kernel in a VM, and the userspace
is a minimal agent that you can talk to over virtio. You can use
the API, or `guestmount` exposes it via FUSE.

It doesn't magically make the kernel filesystems robust against
untrusted input, but in the case of compromise, it's an
"unprivileged" VM. I've used it for several projects and been
quite happy.

2015-07-30 15:59:05

by Stephen Smalley

[permalink] [raw]
Subject: Re: [PATCH 6/7] selinux: Ignore security labels on user namespace mounts

On 07/24/2015 11:11 AM, Seth Forshee wrote:
> On Thu, Jul 23, 2015 at 11:23:31AM -0500, Seth Forshee wrote:
>> On Thu, Jul 23, 2015 at 11:36:03AM -0400, Stephen Smalley wrote:
>>> On 07/23/2015 10:39 AM, Seth Forshee wrote:
>>>> On Thu, Jul 23, 2015 at 09:57:20AM -0400, Stephen Smalley wrote:
>>>>> On 07/22/2015 04:40 PM, Stephen Smalley wrote:
>>>>>> On 07/22/2015 04:25 PM, Stephen Smalley wrote:
>>>>>>> On 07/22/2015 12:14 PM, Seth Forshee wrote:
>>>>>>>> On Wed, Jul 22, 2015 at 12:02:13PM -0400, Stephen Smalley wrote:
>>>>>>>>> On 07/16/2015 09:23 AM, Stephen Smalley wrote:
>>>>>>>>>> On 07/15/2015 03:46 PM, Seth Forshee wrote:
>>>>>>>>>>> Unprivileged users should not be able to supply security labels
>>>>>>>>>>> in filesystems, nor should they be able to supply security
>>>>>>>>>>> contexts in unprivileged mounts. For any mount where s_user_ns is
>>>>>>>>>>> not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
>>>>>>>>>>> and return EPERM if any contexts are supplied in the mount
>>>>>>>>>>> options.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Seth Forshee <[email protected]>
>>>>>>>>>>
>>>>>>>>>> I think this is obsoleted by the subsequent discussion, but just for the
>>>>>>>>>> record: this patch would cause the files in the userns mount to be left
>>>>>>>>>> with the "unlabeled" label, and therefore under typical policies,
>>>>>>>>>> completely inaccessible to any process in a confined domain.
>>>>>>>>>
>>>>>>>>> The right way to handle this for SELinux would be to automatically use
>>>>>>>>> mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
>>>>>>>>> specifying a context= mount option), with the sbsec->mntpoint_sid set
>>>>>>>>> from some related object (e.g. the block device file context, as in your
>>>>>>>>> patches for Smack). That will cause SELinux to use that value instead
>>>>>>>>> of any xattr value from the filesystem and will cause attempts by
>>>>>>>>> userspace to set the security.selinux xattr to fail on that filesystem.
>>>>>>>>> That is how SELinux normally deals with untrusted filesystems, except
>>>>>>>>> that it is normally specified as a mount option by a trusted mounting
>>>>>>>>> process, whereas in your case you need to automatically set it.
>>>>>>>>
>>>>>>>> Excellent, thank you for the advice. I'll start on this when I've
>>>>>>>> finished with Smack.
>>>>>>>
>>>>>>> Not tested, but something like this should work. Note that it should
>>>>>>> come after the call to security_fs_use() so we know whether SELinux
>>>>>>> would even try to use xattrs supplied by the filesystem in the first place.
>>>>>>>
>>>>>>> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
>>>>>>> index 564079c..84da3a2 100644
>>>>>>> --- a/security/selinux/hooks.c
>>>>>>> +++ b/security/selinux/hooks.c
>>>>>>> @@ -745,6 +745,30 @@ static int selinux_set_mnt_opts(struct super_block *sb,
>>>>>>> goto out;
>>>>>>> }
>>>>>>> }
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * If this is a user namespace mount, no contexts are allowed
>>>>>>> + * on the command line and security labels must be ignored.
>>>>>>> + */
>>>>>>> + if (sb->s_user_ns != &init_user_ns) {
>>>>>>> + if (context_sid || fscontext_sid || rootcontext_sid ||
>>>>>>> + defcontext_sid) {
>>>>>>> + rc = -EACCES;
>>>>>>> + goto out;
>>>>>>> + }
>>>>>>> + if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
>>>>>>> + struct block_device *bdev = sb->s_bdev;
>>>>>>> + sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
>>>>>>> + if (bdev) {
>>>>>>> + struct inode_security_struct *isec =
>>>>>>> bdev->bd_inode;
>>>>>>
>>>>>> That should be bdev->bd_inode->i_security.
>>>>>
>>>>> Sorry, this won't work. bd_inode is not the inode of the block device
>>>>> file that was passed to mount, and it isn't labeled in any way. It will
>>>>> just be unlabeled.
>>>>>
>>>>> So I guess the only real option here as a fallback is
>>>>> sbsec->mntpoint_sid = current_sid(). Which isn't great either, as the
>>>>> only case where we currently assign task labels to files is for their
>>>>> /proc/pid inodes, and no current policy will therefore allow create
>>>>> permission to such files.
>>>>
>>>> Darn, you're right, that isn't the inode we want. There really doesn't
>>>> seem to be any way to get back to the one we want from the LSM, short of
>>>> adding a new hook.
>>>
>>> Maybe list_first_entry(&sb->s_bdev->bd_inodes, struct inode, i_devices)?
>>> Feels like a layering violation though...
>>
>> Yeah, and even though that probably works out to be the inode we want in
>> most cases I don't think we can be absolutely certain that it is. Maybe
>> there's some way we could walk the list and be sure we've found the
>> right inode, but I'm not seeing it.
>
> I guess we could do something like this (note that most of the changes
> here are just to give a version of blkdev_get_by_path which takes a
> struct path * so that the filename lookup doesn't have to be done
> twice). Basically add a new hook that informs the security module of the
> inode for the backing device file passed to mount and call that from
> mount_bdev. The security module could grab a reference to the inode and
> stash it away.
>
> Something else to note is that, as I have it here, the hook would end up
> getting called for every mount of a given block device, not just the
> first. So it's possible the security module could see the hook called a
> second time with a different inode that has a different label. The hook
> could be changed to return int if you wanted to have the opportunity to
> reject such mounts.

I'm not comfortable with this approach due to the aliasing/ambiguity you
mention, as well as being unsure as to whether we truly want to label it
the same as the backing block device (we certainly do not do that for
normal mounts). Was also expecting the vfs folks to veto this patch but
haven't seen that yet.

For now, how about if we just do this to compute the mountpoint label
for SELinux:
rc = security_transition_sid(current_sid(), current_sid(),
SECCLASS_FILE, NULL, &sbsec->mntpoint_sid);
if (rc)
goto out;

This will turn the current task context into a form suitable for a file
object, while simultaneously allowing the policy writer to specify a
different label for the files through policy transition rules if desired.

>
> Seth
>
> ---
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index f8ce371c437c..dc2173e24e30 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1372,14 +1372,39 @@ int blkdev_get(struct block_device *bdev, fmode_t mode, void *holder)
> }
> EXPORT_SYMBOL(blkdev_get);
>
> +static struct block_device *__lookup_bdev(struct path *path);
> +
> +struct block_device * __blkdev_get_by_path(struct path *path, fmode_t mode,
> + void *holder)
> +{
> + struct block_device *bdev;
> + int err;
> +
> + bdev = __lookup_bdev(path);
> + if (IS_ERR(bdev))
> + return bdev;
> +
> + err = blkdev_get(bdev, mode, holder);
> + if (err)
> + return ERR_PTR(err);
> +
> + if ((mode & FMODE_WRITE) && bdev_read_only(bdev)) {
> + blkdev_put(bdev, mode);
> + return ERR_PTR(-EACCES);
> + }
> +
> + return bdev;
> +}
> +EXPORT_SYMBOL(__blkdev_get_by_path);
> +
> /**
> * blkdev_get_by_path - open a block device by name
> - * @path: path to the block device to open
> + * @pathname: path to the block device to open
> * @mode: FMODE_* mask
> * @holder: exclusive holder identifier
> *
> - * Open the blockdevice described by the device file at @path. @mode
> - * and @holder are identical to blkdev_get().
> + * Open the blockdevice described by the device file at @pathname.
> + * @mode and @holder are identical to blkdev_get().
> *
> * On success, the returned block_device has reference count of one.
> *
> @@ -1389,25 +1414,22 @@ EXPORT_SYMBOL(blkdev_get);
> * RETURNS:
> * Pointer to block_device on success, ERR_PTR(-errno) on failure.
> */
> -struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
> +struct block_device *blkdev_get_by_path(const char *pathname, fmode_t mode,
> void *holder)
> {
> struct block_device *bdev;
> - int err;
> -
> - bdev = lookup_bdev(path);
> - if (IS_ERR(bdev))
> - return bdev;
> + struct path path;
> + int error;
>
> - err = blkdev_get(bdev, mode, holder);
> - if (err)
> - return ERR_PTR(err);
> + if (!pathname || !*pathname)
> + return ERR_PTR(-EINVAL);
>
> - if ((mode & FMODE_WRITE) && bdev_read_only(bdev)) {
> - blkdev_put(bdev, mode);
> - return ERR_PTR(-EACCES);
> - }
> + error = kern_path(pathname, LOOKUP_FOLLOW, &path);
> + if (error)
> + return ERR_PTR(error);
>
> + bdev = __blkdev_get_by_path(&path, mode, holder);
> + path_put(&path);
> return bdev;
> }
> EXPORT_SYMBOL(blkdev_get_by_path);
> @@ -1702,6 +1724,30 @@ int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg)
>
> EXPORT_SYMBOL(ioctl_by_bdev);
>
> +static struct block_device *__lookup_bdev(struct path *path)
> +{
> + struct block_device *bdev;
> + struct inode *inode;
> + int error;
> +
> + inode = d_backing_inode(path->dentry);
> + error = -ENOTBLK;
> + if (!S_ISBLK(inode->i_mode))
> + goto fail;
> + error = -EACCES;
> + if (!may_open_dev(path))
> + goto fail;
> + error = -ENOMEM;
> + bdev = bd_acquire(inode);
> + if (!bdev)
> + goto fail;
> +out:
> + return bdev;
> +fail:
> + bdev = ERR_PTR(error);
> + goto out;
> +}
> +
> /**
> * lookup_bdev - lookup a struct block_device by name
> * @pathname: special file representing the block device
> @@ -1713,7 +1759,6 @@ EXPORT_SYMBOL(ioctl_by_bdev);
> struct block_device *lookup_bdev(const char *pathname)
> {
> struct block_device *bdev;
> - struct inode *inode;
> struct path path;
> int error;
>
> @@ -1724,23 +1769,9 @@ struct block_device *lookup_bdev(const char *pathname)
> if (error)
> return ERR_PTR(error);
>
> - inode = d_backing_inode(path.dentry);
> - error = -ENOTBLK;
> - if (!S_ISBLK(inode->i_mode))
> - goto fail;
> - error = -EACCES;
> - if (!may_open_dev(&path))
> - goto fail;
> - error = -ENOMEM;
> - bdev = bd_acquire(inode);
> - if (!bdev)
> - goto fail;
> -out:
> + bdev = __lookup_bdev(&path);
> path_put(&path);
> return bdev;
> -fail:
> - bdev = ERR_PTR(error);
> - goto out;
> }
> EXPORT_SYMBOL(lookup_bdev);
>
> diff --git a/fs/super.c b/fs/super.c
> index 008f938e3ec0..558f7845a171 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -34,6 +34,7 @@
> #include <linux/fsnotify.h>
> #include <linux/lockdep.h>
> #include <linux/user_namespace.h>
> +#include <linux/namei.h>
> #include "internal.h"
>
>
> @@ -980,15 +981,26 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
> {
> struct block_device *bdev;
> struct super_block *s;
> + struct path path;
> + struct inode *inode;
> fmode_t mode = FMODE_READ | FMODE_EXCL;
> int error = 0;
>
> if (!(flags & MS_RDONLY))
> mode |= FMODE_WRITE;
>
> - bdev = blkdev_get_by_path(dev_name, mode, fs_type);
> - if (IS_ERR(bdev))
> - return ERR_CAST(bdev);
> + if (!dev_name || !*dev_name)
> + return ERR_PTR(-EINVAL);
> +
> + error = kern_path(dev_name, LOOKUP_FOLLOW, &path);
> + if (error)
> + return ERR_PTR(error);
> +
> + bdev = __blkdev_get_by_path(&path, mode, fs_type);
> + if (IS_ERR(bdev)) {
> + error = PTR_ERR(bdev);
> + goto error;
> + }
>
> /*
> * once the super is inserted into the list by sget, s_umount
> @@ -1040,6 +1052,10 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
> bdev->bd_super = s;
> }
>
> + inode = d_backing_inode(path.dentry);
> + security_sb_backing_dev(s, inode);
> + path_put(&path);
> +
> return dget(s->s_root);
>
> error_s:
> @@ -1047,6 +1063,7 @@ error_s:
> error_bdev:
> blkdev_put(bdev, mode);
> error:
> + path_put(&path);
> return ERR_PTR(error);
> }
> EXPORT_SYMBOL(mount_bdev);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4597420ab933..3748945bf0d5 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2315,6 +2315,8 @@ extern int ioctl_by_bdev(struct block_device *, unsigned, unsigned long);
> extern int blkdev_ioctl(struct block_device *, fmode_t, unsigned, unsigned long);
> extern long compat_blkdev_ioctl(struct file *, unsigned, unsigned long);
> extern int blkdev_get(struct block_device *bdev, fmode_t mode, void *holder);
> +extern struct block_device *__blkdev_get_by_path(struct path *path, fmode_t mode,
> + void *holder);
> extern struct block_device *blkdev_get_by_path(const char *path, fmode_t mode,
> void *holder);
> extern struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode,
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index 9429f054c323..52ce1a094e04 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -1351,6 +1351,7 @@ union security_list_options {
> int (*sb_clone_mnt_opts)(const struct super_block *oldsb,
> struct super_block *newsb);
> int (*sb_parse_opts_str)(char *options, struct security_mnt_opts *opts);
> + void (*sb_backing_dev)(struct super_block *sb, struct inode *inode);
> int (*dentry_init_security)(struct dentry *dentry, int mode,
> struct qstr *name, void **ctx,
> u32 *ctxlen);
> @@ -1648,6 +1649,7 @@ struct security_hook_heads {
> struct list_head sb_set_mnt_opts;
> struct list_head sb_clone_mnt_opts;
> struct list_head sb_parse_opts_str;
> + struct list_head sb_backing_dev;
> struct list_head dentry_init_security;
> #ifdef CONFIG_SECURITY_PATH
> struct list_head path_unlink;
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 79d85ddf8093..7a4d8382af20 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -231,6 +231,7 @@ int security_sb_set_mnt_opts(struct super_block *sb,
> int security_sb_clone_mnt_opts(const struct super_block *oldsb,
> struct super_block *newsb);
> int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts);
> +void security_sb_backing_dev(struct super_block *sb, struct inode *inode);
> int security_dentry_init_security(struct dentry *dentry, int mode,
> struct qstr *name, void **ctx,
> u32 *ctxlen);
> @@ -562,6 +563,10 @@ static inline int security_sb_parse_opts_str(char *options, struct security_mnt_
> return 0;
> }
>
> +static inline void security_sb_backing_dev(struct super_block *sb,
> + struct inode *inode)
> +{ }
> +
> static inline int security_inode_alloc(struct inode *inode)
> {
> return 0;
> diff --git a/security/security.c b/security/security.c
> index 062f3c997fdc..f6f89e0f06d8 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -347,6 +347,11 @@ int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts)
> }
> EXPORT_SYMBOL(security_sb_parse_opts_str);
>
> +void security_sb_backing_dev(struct super_block *sb, struct inode *inode)
> +{
> + call_void_hook(sb_backing_dev, sb, inode);
> +}
> +
> int security_inode_alloc(struct inode *inode)
> {
> inode->i_security = NULL;
> @@ -1595,6 +1600,8 @@ struct security_hook_heads security_hook_heads = {
> LIST_HEAD_INIT(security_hook_heads.sb_clone_mnt_opts),
> .sb_parse_opts_str =
> LIST_HEAD_INIT(security_hook_heads.sb_parse_opts_str),
> + .sb_backing_dev =
> + LIST_HEAD_INIT(security_hook_heads.sb_backing_dev),
> .dentry_init_security =
> LIST_HEAD_INIT(security_hook_heads.dentry_init_security),
> #ifdef CONFIG_SECURITY_PATH
>
>

2015-07-30 16:22:30

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

Colin Walters <[email protected]> writes:

> It's worth noting here that I think a lot of the use cases
> for unprivileged mounts are testing/development type things,
> and these are pretty well covered by:
>
> http://libguestfs.org/
>
> Basically it just runs the host kernel in a VM, and the userspace
> is a minimal agent that you can talk to over virtio. You can use
> the API, or `guestmount` exposes it via FUSE.
>
> It doesn't magically make the kernel filesystems robust against
> untrusted input, but in the case of compromise, it's an
> "unprivileged" VM. I've used it for several projects and been
> quite happy.

Thanks for pointing this out. That makes it clear we only have to get
as far as making fuse work for this work to be useful in practice.

Eric

2015-07-30 16:18:24

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On 7/28/2015 1:40 PM, Seth Forshee wrote:
> On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
>>> This is what I currently think you want for user ns mounts:
>>>
>>> 1. smk_root and smk_default are assigned the label of the backing
>>> device.
>>> 2. s_root is assigned the transmute property.
>>> 3. For existing files:
>>> a. Files with the same label as the backing device are accessible.
>>> b. Files with any other label are not accessible.
>> That's right. Accept correct data, reject anything that's not right.
>>
>>> If this is right, there are a couple lingering questions in my mind.
>>>
>>> First, what happens with files created in directories with the same
>>> label as the backing device but without the transmute property set? The
>>> inode for the new file will initially be labeled with smk_of_current(),
>>> but then during d_instantiate it will get smk_default and thus end up
>>> with the label we want. So that seems okay.
>> Yes.
>>
>>> The second is whether files with the SMACK64EXEC attribute is still a
>>> problem. It seems it is, for files with the same label as the backing
>>> store at least. I think we can simply skip the code that reads out this
>>> xattr and sets smk_task for user ns mounts, or else skip assigning the
>>> label to the new task in bprm_set_creds. The latter seems more
>>> consistent with the approach you've suggested for dealing with labels
>>> from disk.
>> Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
>> smack_d_instantiate for unprivileged mounts would do the trick.
>>
>>> So I guess all of that seems okay, though perhaps a bit restrictive
>>> given that the user who mounted the filesystem already has full access
>>> to the backing store.
>> In truth, there is no reason to expect that the "user" who did the
>> mount will ever have a Smack label that differs from the label of
>> the backing store. If what we've got here seems restrictive, it's
>> because you've got access from someone other than the "user".
>>
>>> Please let me know whether or not this matches up with what you are
>>> thinking, then I can procede with the implementation.
>> My current mindset is that, if you're going to allow unprivileged
>> mounts of user defined backing stores, this is as safe as we can
>> make it.
> All right, I've got a patch which I think does this, and I've managed to
> do some testing to confirm that it behaves like I expect. How does this
> look?
>
> What's missing is getting the label from the block device inode; as
> Stephen discovered the inode that I thought we could get the label from
> turned out to be the wrong one. Afaict we would need a new hook in order
> to do that, so for now I'm using the label of the proccess calling
> mount.

That will be OK if the mount processing checks for write access to
the backing store. I haven't looked to see if it does. If it doesn't
the problems should be pretty obvious.

>
> ---
>
> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> index a143328f75eb..8e631a66b03c 100644
> --- a/security/smack/smack_lsm.c
> +++ b/security/smack/smack_lsm.c
> @@ -662,6 +662,8 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
> skp = smk_of_current();
> sp->smk_root = skp;
> sp->smk_default = skp;
> + if (sb_in_userns(sb))
> + transmute = 1;
> }
> /*
> * Initialize the root inode.
> @@ -1023,6 +1025,12 @@ static int smack_inode_permission(struct inode *inode, int mask)
> if (mask == 0)
> return 0;
>
> + if (sb_in_userns(inode->i_sb)) {
> + struct superblock_smack *sbsp = inode->i_sb->s_security;
> + if (smk_of_inode(inode) != sbsp->smk_root)
> + return -EACCES;
> + }
> +
> /* May be droppable after audit */
> if (no_block)
> return -ECHILD;
> @@ -3220,14 +3228,16 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
> if (rc >= 0)
> transflag = SMK_INODE_TRANSMUTE;
> }
> - /*
> - * Don't let the exec or mmap label be "*" or "@".
> - */
> - skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
> - if (IS_ERR(skp) || skp == &smack_known_star ||
> - skp == &smack_known_web)
> - skp = NULL;
> - isp->smk_task = skp;
> + if (!sb_in_userns(inode->i_sb)) {
> + /*
> + * Don't let the exec or mmap label be "*" or "@".
> + */
> + skp = smk_fetch(XATTR_NAME_SMACKEXEC, inode, dp);
> + if (IS_ERR(skp) || skp == &smack_known_star ||
> + skp == &smack_known_web)
> + skp = NULL;
> + isp->smk_task = skp;
> + }
>
> skp = smk_fetch(XATTR_NAME_SMACKMMAP, inode, dp);
> if (IS_ERR(skp) || skp == &smack_known_star ||
>

2015-07-30 16:24:38

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 6/7] selinux: Ignore security labels on user namespace mounts

On Thu, Jul 30, 2015 at 11:57:24AM -0400, Stephen Smalley wrote:
> On 07/24/2015 11:11 AM, Seth Forshee wrote:
> > On Thu, Jul 23, 2015 at 11:23:31AM -0500, Seth Forshee wrote:
> >> On Thu, Jul 23, 2015 at 11:36:03AM -0400, Stephen Smalley wrote:
> >>> On 07/23/2015 10:39 AM, Seth Forshee wrote:
> >>>> On Thu, Jul 23, 2015 at 09:57:20AM -0400, Stephen Smalley wrote:
> >>>>> On 07/22/2015 04:40 PM, Stephen Smalley wrote:
> >>>>>> On 07/22/2015 04:25 PM, Stephen Smalley wrote:
> >>>>>>> On 07/22/2015 12:14 PM, Seth Forshee wrote:
> >>>>>>>> On Wed, Jul 22, 2015 at 12:02:13PM -0400, Stephen Smalley wrote:
> >>>>>>>>> On 07/16/2015 09:23 AM, Stephen Smalley wrote:
> >>>>>>>>>> On 07/15/2015 03:46 PM, Seth Forshee wrote:
> >>>>>>>>>>> Unprivileged users should not be able to supply security labels
> >>>>>>>>>>> in filesystems, nor should they be able to supply security
> >>>>>>>>>>> contexts in unprivileged mounts. For any mount where s_user_ns is
> >>>>>>>>>>> not init_user_ns, force the use of SECURITY_FS_USE_NONE behavior
> >>>>>>>>>>> and return EPERM if any contexts are supplied in the mount
> >>>>>>>>>>> options.
> >>>>>>>>>>>
> >>>>>>>>>>> Signed-off-by: Seth Forshee <[email protected]>
> >>>>>>>>>>
> >>>>>>>>>> I think this is obsoleted by the subsequent discussion, but just for the
> >>>>>>>>>> record: this patch would cause the files in the userns mount to be left
> >>>>>>>>>> with the "unlabeled" label, and therefore under typical policies,
> >>>>>>>>>> completely inaccessible to any process in a confined domain.
> >>>>>>>>>
> >>>>>>>>> The right way to handle this for SELinux would be to automatically use
> >>>>>>>>> mountpoint labeling (SECURITY_FS_USE_MNTPOINT, normally set by
> >>>>>>>>> specifying a context= mount option), with the sbsec->mntpoint_sid set
> >>>>>>>>> from some related object (e.g. the block device file context, as in your
> >>>>>>>>> patches for Smack). That will cause SELinux to use that value instead
> >>>>>>>>> of any xattr value from the filesystem and will cause attempts by
> >>>>>>>>> userspace to set the security.selinux xattr to fail on that filesystem.
> >>>>>>>>> That is how SELinux normally deals with untrusted filesystems, except
> >>>>>>>>> that it is normally specified as a mount option by a trusted mounting
> >>>>>>>>> process, whereas in your case you need to automatically set it.
> >>>>>>>>
> >>>>>>>> Excellent, thank you for the advice. I'll start on this when I've
> >>>>>>>> finished with Smack.
> >>>>>>>
> >>>>>>> Not tested, but something like this should work. Note that it should
> >>>>>>> come after the call to security_fs_use() so we know whether SELinux
> >>>>>>> would even try to use xattrs supplied by the filesystem in the first place.
> >>>>>>>
> >>>>>>> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> >>>>>>> index 564079c..84da3a2 100644
> >>>>>>> --- a/security/selinux/hooks.c
> >>>>>>> +++ b/security/selinux/hooks.c
> >>>>>>> @@ -745,6 +745,30 @@ static int selinux_set_mnt_opts(struct super_block *sb,
> >>>>>>> goto out;
> >>>>>>> }
> >>>>>>> }
> >>>>>>> +
> >>>>>>> + /*
> >>>>>>> + * If this is a user namespace mount, no contexts are allowed
> >>>>>>> + * on the command line and security labels must be ignored.
> >>>>>>> + */
> >>>>>>> + if (sb->s_user_ns != &init_user_ns) {
> >>>>>>> + if (context_sid || fscontext_sid || rootcontext_sid ||
> >>>>>>> + defcontext_sid) {
> >>>>>>> + rc = -EACCES;
> >>>>>>> + goto out;
> >>>>>>> + }
> >>>>>>> + if (sbsec->behavior == SECURITY_FS_USE_XATTR) {
> >>>>>>> + struct block_device *bdev = sb->s_bdev;
> >>>>>>> + sbsec->behavior = SECURITY_FS_USE_MNTPOINT;
> >>>>>>> + if (bdev) {
> >>>>>>> + struct inode_security_struct *isec =
> >>>>>>> bdev->bd_inode;
> >>>>>>
> >>>>>> That should be bdev->bd_inode->i_security.
> >>>>>
> >>>>> Sorry, this won't work. bd_inode is not the inode of the block device
> >>>>> file that was passed to mount, and it isn't labeled in any way. It will
> >>>>> just be unlabeled.
> >>>>>
> >>>>> So I guess the only real option here as a fallback is
> >>>>> sbsec->mntpoint_sid = current_sid(). Which isn't great either, as the
> >>>>> only case where we currently assign task labels to files is for their
> >>>>> /proc/pid inodes, and no current policy will therefore allow create
> >>>>> permission to such files.
> >>>>
> >>>> Darn, you're right, that isn't the inode we want. There really doesn't
> >>>> seem to be any way to get back to the one we want from the LSM, short of
> >>>> adding a new hook.
> >>>
> >>> Maybe list_first_entry(&sb->s_bdev->bd_inodes, struct inode, i_devices)?
> >>> Feels like a layering violation though...
> >>
> >> Yeah, and even though that probably works out to be the inode we want in
> >> most cases I don't think we can be absolutely certain that it is. Maybe
> >> there's some way we could walk the list and be sure we've found the
> >> right inode, but I'm not seeing it.
> >
> > I guess we could do something like this (note that most of the changes
> > here are just to give a version of blkdev_get_by_path which takes a
> > struct path * so that the filename lookup doesn't have to be done
> > twice). Basically add a new hook that informs the security module of the
> > inode for the backing device file passed to mount and call that from
> > mount_bdev. The security module could grab a reference to the inode and
> > stash it away.
> >
> > Something else to note is that, as I have it here, the hook would end up
> > getting called for every mount of a given block device, not just the
> > first. So it's possible the security module could see the hook called a
> > second time with a different inode that has a different label. The hook
> > could be changed to return int if you wanted to have the opportunity to
> > reject such mounts.
>
> I'm not comfortable with this approach due to the aliasing/ambiguity you
> mention, as well as being unsure as to whether we truly want to label it
> the same as the backing block device (we certainly do not do that for
> normal mounts). Was also expecting the vfs folks to veto this patch but
> haven't seen that yet.

Yeah, I wasn't necessarily suggesting that this was a _good_ way to go,
only that I couldn't find a workable alternative.

> For now, how about if we just do this to compute the mountpoint label
> for SELinux:
> rc = security_transition_sid(current_sid(), current_sid(),
> SECCLASS_FILE, NULL, &sbsec->mntpoint_sid);
> if (rc)
> goto out;
>
> This will turn the current task context into a form suitable for a file
> object, while simultaneously allowing the policy writer to specify a
> different label for the files through policy transition rules if desired.

Great, I'll incorporate this. Thanks!

Seth

2015-07-30 17:12:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

Casey Schaufler <[email protected]> writes:

> On 7/28/2015 1:40 PM, Seth Forshee wrote:
>> On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
>>>> This is what I currently think you want for user ns mounts:
>>>>
>>>> 1. smk_root and smk_default are assigned the label of the backing
>>>> device.
>>>> 2. s_root is assigned the transmute property.
>>>> 3. For existing files:
>>>> a. Files with the same label as the backing device are accessible.
>>>> b. Files with any other label are not accessible.
>>> That's right. Accept correct data, reject anything that's not right.
>>>
>>>> If this is right, there are a couple lingering questions in my mind.
>>>>
>>>> First, what happens with files created in directories with the same
>>>> label as the backing device but without the transmute property set? The
>>>> inode for the new file will initially be labeled with smk_of_current(),
>>>> but then during d_instantiate it will get smk_default and thus end up
>>>> with the label we want. So that seems okay.
>>> Yes.
>>>
>>>> The second is whether files with the SMACK64EXEC attribute is still a
>>>> problem. It seems it is, for files with the same label as the backing
>>>> store at least. I think we can simply skip the code that reads out this
>>>> xattr and sets smk_task for user ns mounts, or else skip assigning the
>>>> label to the new task in bprm_set_creds. The latter seems more
>>>> consistent with the approach you've suggested for dealing with labels
>>>> from disk.
>>> Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
>>> smack_d_instantiate for unprivileged mounts would do the trick.
>>>
>>>> So I guess all of that seems okay, though perhaps a bit restrictive
>>>> given that the user who mounted the filesystem already has full access
>>>> to the backing store.
>>> In truth, there is no reason to expect that the "user" who did the
>>> mount will ever have a Smack label that differs from the label of
>>> the backing store. If what we've got here seems restrictive, it's
>>> because you've got access from someone other than the "user".
>>>
>>>> Please let me know whether or not this matches up with what you are
>>>> thinking, then I can procede with the implementation.
>>> My current mindset is that, if you're going to allow unprivileged
>>> mounts of user defined backing stores, this is as safe as we can
>>> make it.
>> All right, I've got a patch which I think does this, and I've managed to
>> do some testing to confirm that it behaves like I expect. How does this
>> look?
>>
>> What's missing is getting the label from the block device inode; as
>> Stephen discovered the inode that I thought we could get the label from
>> turned out to be the wrong one. Afaict we would need a new hook in order
>> to do that, so for now I'm using the label of the proccess calling
>> mount.
>
> That will be OK if the mount processing checks for write access to
> the backing store. I haven't looked to see if it does. If it doesn't
> the problems should be pretty obvious.


do_new_mount
vfs_kern_mount
mount_fs
...
mount_bdev
blkdev_get_by_path(...,FMODE_READ| FMODE_WRITE | FMODE_EXCL,...)
lookup_bdev
kern_path
filename_lookup
path_lookupat
lookup_last
walk_component
blkdev_get(...,mode,...)
__blkdev_get(...,mode,...)
devcgroup_inode_permission(bdev->bd_inode, perm)

*scratches my head*

It looks like we don't actually check the permissions on the block
device. Tomoyo has a hack for it. nfsd does something. There is
devcgroup silliness.

But overall it looks like we depend on capable(CAP_SYS_ADMIN).

Seth I do believe we have found another area of the vfs we will need to
short up before allowing unprivileged mounts of block device based
filesystems.

It looks like there are enough hacks someone with a clue coming through
and making the code make more sense seems like a good idea anyway.

Eric

2015-07-30 17:25:40

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Thu, Jul 30, 2015 at 12:05:27PM -0500, Eric W. Biederman wrote:
> Casey Schaufler <[email protected]> writes:
>
> > On 7/28/2015 1:40 PM, Seth Forshee wrote:
> >> On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
> >>>> This is what I currently think you want for user ns mounts:
> >>>>
> >>>> 1. smk_root and smk_default are assigned the label of the backing
> >>>> device.
> >>>> 2. s_root is assigned the transmute property.
> >>>> 3. For existing files:
> >>>> a. Files with the same label as the backing device are accessible.
> >>>> b. Files with any other label are not accessible.
> >>> That's right. Accept correct data, reject anything that's not right.
> >>>
> >>>> If this is right, there are a couple lingering questions in my mind.
> >>>>
> >>>> First, what happens with files created in directories with the same
> >>>> label as the backing device but without the transmute property set? The
> >>>> inode for the new file will initially be labeled with smk_of_current(),
> >>>> but then during d_instantiate it will get smk_default and thus end up
> >>>> with the label we want. So that seems okay.
> >>> Yes.
> >>>
> >>>> The second is whether files with the SMACK64EXEC attribute is still a
> >>>> problem. It seems it is, for files with the same label as the backing
> >>>> store at least. I think we can simply skip the code that reads out this
> >>>> xattr and sets smk_task for user ns mounts, or else skip assigning the
> >>>> label to the new task in bprm_set_creds. The latter seems more
> >>>> consistent with the approach you've suggested for dealing with labels
> >>>> from disk.
> >>> Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
> >>> smack_d_instantiate for unprivileged mounts would do the trick.
> >>>
> >>>> So I guess all of that seems okay, though perhaps a bit restrictive
> >>>> given that the user who mounted the filesystem already has full access
> >>>> to the backing store.
> >>> In truth, there is no reason to expect that the "user" who did the
> >>> mount will ever have a Smack label that differs from the label of
> >>> the backing store. If what we've got here seems restrictive, it's
> >>> because you've got access from someone other than the "user".
> >>>
> >>>> Please let me know whether or not this matches up with what you are
> >>>> thinking, then I can procede with the implementation.
> >>> My current mindset is that, if you're going to allow unprivileged
> >>> mounts of user defined backing stores, this is as safe as we can
> >>> make it.
> >> All right, I've got a patch which I think does this, and I've managed to
> >> do some testing to confirm that it behaves like I expect. How does this
> >> look?
> >>
> >> What's missing is getting the label from the block device inode; as
> >> Stephen discovered the inode that I thought we could get the label from
> >> turned out to be the wrong one. Afaict we would need a new hook in order
> >> to do that, so for now I'm using the label of the proccess calling
> >> mount.
> >
> > That will be OK if the mount processing checks for write access to
> > the backing store. I haven't looked to see if it does. If it doesn't
> > the problems should be pretty obvious.
>
>
> do_new_mount
> vfs_kern_mount
> mount_fs
> ...
> mount_bdev
> blkdev_get_by_path(...,FMODE_READ| FMODE_WRITE | FMODE_EXCL,...)
> lookup_bdev
> kern_path
> filename_lookup
> path_lookupat
> lookup_last
> walk_component
> blkdev_get(...,mode,...)
> __blkdev_get(...,mode,...)
> devcgroup_inode_permission(bdev->bd_inode, perm)
>
> *scratches my head*
>
> It looks like we don't actually check the permissions on the block
> device. Tomoyo has a hack for it. nfsd does something. There is
> devcgroup silliness.
>
> But overall it looks like we depend on capable(CAP_SYS_ADMIN).
>
> Seth I do believe we have found another area of the vfs we will need to
> short up before allowing unprivileged mounts of block device based
> filesystems.
>
> It looks like there are enough hacks someone with a clue coming through
> and making the code make more sense seems like a good idea anyway.

Yep, I just came to the same conclusion myself, and I also verified the
behavior emperically. That's definitely a problem. I'll get to work on
fixing that.

Seth

2015-07-30 17:40:39

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

Seth Forshee <[email protected]> writes:

> On Thu, Jul 30, 2015 at 12:05:27PM -0500, Eric W. Biederman wrote:
>> Casey Schaufler <[email protected]> writes:
>>
>> > On 7/28/2015 1:40 PM, Seth Forshee wrote:
>> >> On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
>> >>>> This is what I currently think you want for user ns mounts:
>> >>>>
>> >>>> 1. smk_root and smk_default are assigned the label of the backing
>> >>>> device.
>> >>>> 2. s_root is assigned the transmute property.
>> >>>> 3. For existing files:
>> >>>> a. Files with the same label as the backing device are accessible.
>> >>>> b. Files with any other label are not accessible.
>> >>> That's right. Accept correct data, reject anything that's not right.
>> >>>
>> >>>> If this is right, there are a couple lingering questions in my mind.
>> >>>>
>> >>>> First, what happens with files created in directories with the same
>> >>>> label as the backing device but without the transmute property set? The
>> >>>> inode for the new file will initially be labeled with smk_of_current(),
>> >>>> but then during d_instantiate it will get smk_default and thus end up
>> >>>> with the label we want. So that seems okay.
>> >>> Yes.
>> >>>
>> >>>> The second is whether files with the SMACK64EXEC attribute is still a
>> >>>> problem. It seems it is, for files with the same label as the backing
>> >>>> store at least. I think we can simply skip the code that reads out this
>> >>>> xattr and sets smk_task for user ns mounts, or else skip assigning the
>> >>>> label to the new task in bprm_set_creds. The latter seems more
>> >>>> consistent with the approach you've suggested for dealing with labels
>> >>>> from disk.
>> >>> Yes, I think that skipping the smk_fetch(XATTR_NAME_SMACKEXEC, ...) in
>> >>> smack_d_instantiate for unprivileged mounts would do the trick.
>> >>>
>> >>>> So I guess all of that seems okay, though perhaps a bit restrictive
>> >>>> given that the user who mounted the filesystem already has full access
>> >>>> to the backing store.
>> >>> In truth, there is no reason to expect that the "user" who did the
>> >>> mount will ever have a Smack label that differs from the label of
>> >>> the backing store. If what we've got here seems restrictive, it's
>> >>> because you've got access from someone other than the "user".
>> >>>
>> >>>> Please let me know whether or not this matches up with what you are
>> >>>> thinking, then I can procede with the implementation.
>> >>> My current mindset is that, if you're going to allow unprivileged
>> >>> mounts of user defined backing stores, this is as safe as we can
>> >>> make it.
>> >> All right, I've got a patch which I think does this, and I've managed to
>> >> do some testing to confirm that it behaves like I expect. How does this
>> >> look?
>> >>
>> >> What's missing is getting the label from the block device inode; as
>> >> Stephen discovered the inode that I thought we could get the label from
>> >> turned out to be the wrong one. Afaict we would need a new hook in order
>> >> to do that, so for now I'm using the label of the proccess calling
>> >> mount.
>> >
>> > That will be OK if the mount processing checks for write access to
>> > the backing store. I haven't looked to see if it does. If it doesn't
>> > the problems should be pretty obvious.
>>
>>
>> do_new_mount
>> vfs_kern_mount
>> mount_fs
>> ...
>> mount_bdev
>> blkdev_get_by_path(...,FMODE_READ| FMODE_WRITE | FMODE_EXCL,...)
>> lookup_bdev
>> kern_path
>> filename_lookup
>> path_lookupat
>> lookup_last
>> walk_component
>> blkdev_get(...,mode,...)
>> __blkdev_get(...,mode,...)
>> devcgroup_inode_permission(bdev->bd_inode, perm)
>>
>> *scratches my head*
>>
>> It looks like we don't actually check the permissions on the block
>> device. Tomoyo has a hack for it. nfsd does something. There is
>> devcgroup silliness.
>>
>> But overall it looks like we depend on capable(CAP_SYS_ADMIN).
>>
>> Seth I do believe we have found another area of the vfs we will need to
>> short up before allowing unprivileged mounts of block device based
>> filesystems.
>>
>> It looks like there are enough hacks someone with a clue coming through
>> and making the code make more sense seems like a good idea anyway.
>
> Yep, I just came to the same conclusion myself, and I also verified the
> behavior emperically. That's definitely a problem. I'll get to work on
> fixing that.

At a quick glance it looks like lookup_bdev, and most of it's callers
need to be modified to do potentially do the additional permission
checking.

I expect we could move the devcgroup checks into whatever new checks we
wind up adding.

Fun, fun fun.

Eric

2015-07-31 08:11:32

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Thu, Jul 30, 2015 at 6:33 PM, Casey Schaufler <[email protected]> wrote:
> On 7/30/2015 7:47 AM, Amir Goldstein wrote:
>> On Thu, Jul 30, 2015 at 4:55 PM, Seth Forshee
>> <[email protected]> wrote:
>>> On Thu, Jul 30, 2015 at 07:24:11AM +0300, Amir Goldstein wrote:
>>>> On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
>>>> <[email protected]> wrote:
>>>>> On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
>>>>>>> This is what I currently think you want for user ns mounts:
>>>>>>>
>>>>>>> 1. smk_root and smk_default are assigned the label of the backing
>>>>>>> device.
>>>> Seth,
>>>>
>>>> There were 2 main concerns discussed in this thread:
>>>> 1. trusting LSM labels outside the namespace
>>>> 2. trusting the content of the image file/loopdev
>>>>
>>>> While your approach addresses the first concern, I suspect it may be placing
>>>> an obstacle in a way for resolving the second concern.
>>>>
>>>> A viable security policy to mitigate the second concern could be:
>>>> - Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
>>>> - Allow mount only of 'Loopback' images
>>>>
>>>> This should allow the system as a whole to trust unprivileged mounts based on
>>>> the trust of the entities that had raw access the the fs layout.
>>> You don't really say what you mean by "trusted" programs. In a container
>>> context I'd have to assume that you mean suid-root or similar programs
>>> shared into the container by the host. In that case is any new kernel
>>> functionality even required?
>> Sorry I was not clear. I will try to explain better.
>> I meant that the programs are "trusted" by the LSM security policy.
>> I envisioned a system where unprivileged user is allowed to spawn
>> a container which contains "trusted" programs (e.g. mkfs) that are labeled
>> as 'FileSystemTools' by the admin of the host.
>> FileSystemTools are allowed to write into Loopback labeled files.
>
> You could do this on a Smack based system. It would require
> CAP_MAC_ADMIN and CAP_MAC_OVERRIDE to set up. You would need
> to set some SMACK64EXEC labels on your FileSystemTools, and
> they would have to be written as carefully as the would if they
> had "more" privilege. You'd need to designate a repository for
> your loopback files. On the whole, it would be unattractive.
> I will pass on providing the details for fear someone will like
> it well enough to implement.
>
>>> That also doesn't work for some of our use cases, where we'd like to be
>>> able to do something like "mount -o loop foo.img /mnt/foo" in an
>>> unprivileged container where foo.img is not created on the local machine
>>> and not fully under control of the host environment.
>> That use case will not be addressed by the policy I suggested,
>> but the more common case of:
>> - create a loopback file
>> - mkfs
>> - mount
>> will be addressed.
>>
>> So if the (host) admin of the system trusts that unprivileged user cannot create
>> a malicious fs layout using mkfs and fsck alone, then the system is
>> relatively safe
>> mounting (non fuse) file systems from loopback files.
>> IMHO, this statement is going to be easier for Ted to sign.
>
> But that sort of defeats the purpose of unprivileged mounts.
> Or rather, you're trying to place restrictions on what an
> unprivileged user can do without calling the ability to
> violate those restrictions "privilege".

I don't understand your concern.
I am saying that LSM can come to the rescue, in a use case that
many have been considering as unsolvable (i.e. the loopback tampering).

Yes, I am trying to place restrictions on what an unprivileged user can do.
As it stands right now, user is about to gain the ability to mount FUSE.
With some extra care on crafting the policy and without any extra code,
user can gain the ability to mount "trusted loopback files".
It does not solve all use cases, but it does solve a handful.

Anyway, the concern I was raising was about the fact that if files inside
the loopback mount inherit the label of the loopback file, this policy is
going to be impossible to write.
But Stephan has already proposed an alternative to this implicit inherit rule
on [PATCH 6/7] thread, so I withdraw my concern.


>
>>
>>> Agreed though that the "attack from below" problem for untrusted
>>> filesystems is still an open question. At minimum we have fuse, which
>>> has been designed to protect against this threat. Others have mentioned
>>> on this thread that Ted had said something at kernel summit last year
>>> about being willing to support ext4 mounts from unprivileged user
>>> namespaces as well. I've added Ted to the Cc in case he wants to confirm
>>> or deny this rumor.
>>>
>>>> Alas, if you choose to propagate the backing dev label to contained files,
>>>> they would all share the designated 'Loopback' label and render the policy above
>>>> useless.
>>>>
>>>> Any thoughts on how to reconcile this conflict?
>>> I'm not seeing what the conflict is here - nothing you proposed says
>>> anything about security labels in the filesystem, and nothing would
>>> prevent a "trusted" program with CAP_MAC_ADMIN from setting whatever
>>> label was desired on the backing device. Care to elaborate?
>>>
>>> Seth
>

2015-07-31 08:36:13

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

On Thu, Jul 16, 2015 at 5:47 AM, Eric W. Biederman
<[email protected]> wrote:
> Seth Forshee <[email protected]> writes:
>
>> Initially this will be used to eliminate the implicit MNT_NODEV
>> flag for mounts from user namespaces. In the future it will also
>> be used for translating ids and checking capabilities for
>> filesystems mounted from user namespaces.
>>
>> s_user_ns is initialized in alloc_super() and is generally set to
>> current_user_ns(). To avoid security and corruption issues, two
>> additional mount checks are also added:
>>
>> - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
>> in current_user_ns().
>>
>> - sget() will fail with EBUSY when the filesystem it's looking
>> for is already mounted from another user namespace.
>>
>> proc needs some special handling here. The user namespace of
>> current isn't appropriate when forking as a result of clone (2)
>> with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
>> from within the new user namespace. Instead, the user namespace
>> which owns the new pid namespace should be used. sget_userns() is
>> added to allow passing of a user namespace other than that of
>> current, and this is used by proc_mount(). sget() becomes a
>> wrapper around sget_userns() which passes current_user_ns().
>
> From bits of the previous conversation.
>
> We need sget_userns(..., &init_user_ns) for sysfs. The sysfs
> xattrs can travel from one mount of sysfs to another via the sysfs
> backing store.
>
> For tmpfs and any other filesystems we support mounting without
> privilige that support xattrs. We need to identify them and
> see if userspace is taking advantage of the ability to set
> xattrs and file caps (unlikely). If they are we need to call
> sget_userns(..., &init_user_ns) on those filesystems as well.
>
> Possibly/Probably we should just do that for all of the interesting
> filesystems to start with and then change back to an ordinary old sget
> after we have done the testing and confirmed we will not be introducing
> userspace regressions.

Eric,

Perhaps it is too soon to discuss here, but how do you envision
handling of file system private mount options in user ns.

For example, suppose that we get to a point where we can trust
an ext4 loopback mount to be non vulnerable to exploits.
That loopback mounted fs could very well have errors and so
error=panic option would be very much undesired from unprivileged user mount.

Do you think this would require extra flags/callbacks from VFS to
file system code or would s_user_ns be sufficient?

>
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-07-31 14:41:08

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

Amir Goldstein <[email protected]> writes:

> On Thu, Jul 16, 2015 at 5:47 AM, Eric W. Biederman
> <[email protected]> wrote:
>> Seth Forshee <[email protected]> writes:
>>
>>> Initially this will be used to eliminate the implicit MNT_NODEV
>>> flag for mounts from user namespaces. In the future it will also
>>> be used for translating ids and checking capabilities for
>>> filesystems mounted from user namespaces.
>>>
>>> s_user_ns is initialized in alloc_super() and is generally set to
>>> current_user_ns(). To avoid security and corruption issues, two
>>> additional mount checks are also added:
>>>
>>> - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
>>> in current_user_ns().
>>>
>>> - sget() will fail with EBUSY when the filesystem it's looking
>>> for is already mounted from another user namespace.
>>>
>>> proc needs some special handling here. The user namespace of
>>> current isn't appropriate when forking as a result of clone (2)
>>> with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
>>> from within the new user namespace. Instead, the user namespace
>>> which owns the new pid namespace should be used. sget_userns() is
>>> added to allow passing of a user namespace other than that of
>>> current, and this is used by proc_mount(). sget() becomes a
>>> wrapper around sget_userns() which passes current_user_ns().
>>
>> From bits of the previous conversation.
>>
>> We need sget_userns(..., &init_user_ns) for sysfs. The sysfs
>> xattrs can travel from one mount of sysfs to another via the sysfs
>> backing store.
>>
>> For tmpfs and any other filesystems we support mounting without
>> privilige that support xattrs. We need to identify them and
>> see if userspace is taking advantage of the ability to set
>> xattrs and file caps (unlikely). If they are we need to call
>> sget_userns(..., &init_user_ns) on those filesystems as well.
>>
>> Possibly/Probably we should just do that for all of the interesting
>> filesystems to start with and then change back to an ordinary old sget
>> after we have done the testing and confirmed we will not be introducing
>> userspace regressions.
>
> Eric,
>
> Perhaps it is too soon to discuss here, but how do you envision
> handling of file system private mount options in user ns.
>
> For example, suppose that we get to a point where we can trust
> an ext4 loopback mount to be non vulnerable to exploits.
> That loopback mounted fs could very well have errors and so
> error=panic option would be very much undesired from unprivileged user mount.
>
> Do you think this would require extra flags/callbacks from VFS to
> file system code or would s_user_ns be sufficient?

This case is easy. In mount or remount we just need to check
capable(CAP_SYS_ADMIN) if someone sets error=panic, and if the capable
call fails don't allow the mount or the remount.

But this corner case is another good reminder that we have to be very
deliberate and very careful before we enable mounting a filesystem this
way.

Eric

2015-07-31 19:57:07

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On 7/31/2015 1:11 AM, Amir Goldstein wrote:
> On Thu, Jul 30, 2015 at 6:33 PM, Casey Schaufler <[email protected]> wrote:
>> On 7/30/2015 7:47 AM, Amir Goldstein wrote:
>>> On Thu, Jul 30, 2015 at 4:55 PM, Seth Forshee
>>> <[email protected]> wrote:
>>>> On Thu, Jul 30, 2015 at 07:24:11AM +0300, Amir Goldstein wrote:
>>>>> On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
>>>>> <[email protected]> wrote:
>>>>>> On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
>>>>>>>> This is what I currently think you want for user ns mounts:
>>>>>>>>
>>>>>>>> 1. smk_root and smk_default are assigned the label of the backing
>>>>>>>> device.
>>>>> Seth,
>>>>>
>>>>> There were 2 main concerns discussed in this thread:
>>>>> 1. trusting LSM labels outside the namespace
>>>>> 2. trusting the content of the image file/loopdev
>>>>>
>>>>> While your approach addresses the first concern, I suspect it may be placing
>>>>> an obstacle in a way for resolving the second concern.
>>>>>
>>>>> A viable security policy to mitigate the second concern could be:
>>>>> - Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
>>>>> - Allow mount only of 'Loopback' images
>>>>>
>>>>> This should allow the system as a whole to trust unprivileged mounts based on
>>>>> the trust of the entities that had raw access the the fs layout.
>>>> You don't really say what you mean by "trusted" programs. In a container
>>>> context I'd have to assume that you mean suid-root or similar programs
>>>> shared into the container by the host. In that case is any new kernel
>>>> functionality even required?
>>> Sorry I was not clear. I will try to explain better.
>>> I meant that the programs are "trusted" by the LSM security policy.
>>> I envisioned a system where unprivileged user is allowed to spawn
>>> a container which contains "trusted" programs (e.g. mkfs) that are labeled
>>> as 'FileSystemTools' by the admin of the host.
>>> FileSystemTools are allowed to write into Loopback labeled files.
>> You could do this on a Smack based system. It would require
>> CAP_MAC_ADMIN and CAP_MAC_OVERRIDE to set up. You would need
>> to set some SMACK64EXEC labels on your FileSystemTools, and
>> they would have to be written as carefully as the would if they
>> had "more" privilege. You'd need to designate a repository for
>> your loopback files. On the whole, it would be unattractive.
>> I will pass on providing the details for fear someone will like
>> it well enough to implement.
>>
>>>> That also doesn't work for some of our use cases, where we'd like to be
>>>> able to do something like "mount -o loop foo.img /mnt/foo" in an
>>>> unprivileged container where foo.img is not created on the local machine
>>>> and not fully under control of the host environment.
>>> That use case will not be addressed by the policy I suggested,
>>> but the more common case of:
>>> - create a loopback file
>>> - mkfs
>>> - mount
>>> will be addressed.
>>>
>>> So if the (host) admin of the system trusts that unprivileged user cannot create
>>> a malicious fs layout using mkfs and fsck alone, then the system is
>>> relatively safe
>>> mounting (non fuse) file systems from loopback files.
>>> IMHO, this statement is going to be easier for Ted to sign.
>> But that sort of defeats the purpose of unprivileged mounts.
>> Or rather, you're trying to place restrictions on what an
>> unprivileged user can do without calling the ability to
>> violate those restrictions "privilege".
> I don't understand your concern.

My concern is that you're playing a shell game. Allow unprivileged
mounts, but only of things that where created using privilege. How
is that better than requiring privilege to do the mount?

> I am saying that LSM can come to the rescue, in a use case that
> many have been considering as unsolvable (i.e. the loopback tampering).
>
> Yes, I am trying to place restrictions on what an unprivileged user can do.
> As it stands right now, user is about to gain the ability to mount FUSE.
> With some extra care on crafting the policy and without any extra code,
> user can gain the ability to mount "trusted loopback files".
> It does not solve all use cases, but it does solve a handful.

As I said, you can do this, but it will be ugly, and people won't
understand how to use it correctly. The distance between the "trusted"
creation of the filesystem and the "untrusted" mount is too great.
Plus, there are too many ways to circumvent the integrity of your
"trusted" filesystem.

> Anyway, the concern I was raising was about the fact that if files inside
> the loopback mount inherit the label of the loopback file, this policy is
> going to be impossible to write.
> But Stephan has already proposed an alternative to this implicit inherit rule
> on [PATCH 6/7] thread, so I withdraw my concern.

What Stephan has proposed is dandy for SELinux.

>
>
>>>> Agreed though that the "attack from below" problem for untrusted
>>>> filesystems is still an open question. At minimum we have fuse, which
>>>> has been designed to protect against this threat. Others have mentioned
>>>> on this thread that Ted had said something at kernel summit last year
>>>> about being willing to support ext4 mounts from unprivileged user
>>>> namespaces as well. I've added Ted to the Cc in case he wants to confirm
>>>> or deny this rumor.
>>>>
>>>>> Alas, if you choose to propagate the backing dev label to contained files,
>>>>> they would all share the designated 'Loopback' label and render the policy above
>>>>> useless.
>>>>>
>>>>> Any thoughts on how to reconcile this conflict?
>>>> I'm not seeing what the conflict is here - nothing you proposed says
>>>> anything about security labels in the filesystem, and nothing would
>>>> prevent a "trusted" program with CAP_MAC_ADMIN from setting whatever
>>>> label was desired on the backing device. Care to elaborate?
>>>>
>>>> Seth

2015-08-01 17:01:30

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 0/7] Initial support for user namespace owned mounts

On Fri, Jul 31, 2015 at 10:56 PM, Casey Schaufler
<[email protected]> wrote:
> On 7/31/2015 1:11 AM, Amir Goldstein wrote:
>> On Thu, Jul 30, 2015 at 6:33 PM, Casey Schaufler <[email protected]> wrote:
>>> On 7/30/2015 7:47 AM, Amir Goldstein wrote:
>>>> On Thu, Jul 30, 2015 at 4:55 PM, Seth Forshee
>>>> <[email protected]> wrote:
>>>>> On Thu, Jul 30, 2015 at 07:24:11AM +0300, Amir Goldstein wrote:
>>>>>> On Tue, Jul 28, 2015 at 11:40 PM, Seth Forshee
>>>>>> <[email protected]> wrote:
>>>>>>> On Wed, Jul 22, 2015 at 05:05:17PM -0700, Casey Schaufler wrote:
>>>>>>>>> This is what I currently think you want for user ns mounts:
>>>>>>>>>
>>>>>>>>> 1. smk_root and smk_default are assigned the label of the backing
>>>>>>>>> device.
>>>>>> Seth,
>>>>>>
>>>>>> There were 2 main concerns discussed in this thread:
>>>>>> 1. trusting LSM labels outside the namespace
>>>>>> 2. trusting the content of the image file/loopdev
>>>>>>
>>>>>> While your approach addresses the first concern, I suspect it may be placing
>>>>>> an obstacle in a way for resolving the second concern.
>>>>>>
>>>>>> A viable security policy to mitigate the second concern could be:
>>>>>> - Allow only trusted programs (e.g. mkfs, fsck) to write to 'Loopback' images
>>>>>> - Allow mount only of 'Loopback' images
>>>>>>
>>>>>> This should allow the system as a whole to trust unprivileged mounts based on
>>>>>> the trust of the entities that had raw access the the fs layout.
>>>>> You don't really say what you mean by "trusted" programs. In a container
>>>>> context I'd have to assume that you mean suid-root or similar programs
>>>>> shared into the container by the host. In that case is any new kernel
>>>>> functionality even required?
>>>> Sorry I was not clear. I will try to explain better.
>>>> I meant that the programs are "trusted" by the LSM security policy.
>>>> I envisioned a system where unprivileged user is allowed to spawn
>>>> a container which contains "trusted" programs (e.g. mkfs) that are labeled
>>>> as 'FileSystemTools' by the admin of the host.
>>>> FileSystemTools are allowed to write into Loopback labeled files.
>>> You could do this on a Smack based system. It would require
>>> CAP_MAC_ADMIN and CAP_MAC_OVERRIDE to set up. You would need
>>> to set some SMACK64EXEC labels on your FileSystemTools, and
>>> they would have to be written as carefully as the would if they
>>> had "more" privilege. You'd need to designate a repository for
>>> your loopback files. On the whole, it would be unattractive.
>>> I will pass on providing the details for fear someone will like
>>> it well enough to implement.
>>>
>>>>> That also doesn't work for some of our use cases, where we'd like to be
>>>>> able to do something like "mount -o loop foo.img /mnt/foo" in an
>>>>> unprivileged container where foo.img is not created on the local machine
>>>>> and not fully under control of the host environment.
>>>> That use case will not be addressed by the policy I suggested,
>>>> but the more common case of:
>>>> - create a loopback file
>>>> - mkfs
>>>> - mount
>>>> will be addressed.
>>>>
>>>> So if the (host) admin of the system trusts that unprivileged user cannot create
>>>> a malicious fs layout using mkfs and fsck alone, then the system is
>>>> relatively safe
>>>> mounting (non fuse) file systems from loopback files.
>>>> IMHO, this statement is going to be easier for Ted to sign.
>>> But that sort of defeats the purpose of unprivileged mounts.
>>> Or rather, you're trying to place restrictions on what an
>>> unprivileged user can do without calling the ability to
>>> violate those restrictions "privilege".
>> I don't understand your concern.
>
> My concern is that you're playing a shell game. Allow unprivileged
> mounts, but only of things that where created using privilege. How
> is that better than requiring privilege to do the mount?

To me, the ability of an admin to delegate permissions to unprivileged
user to mkfs/fsck/mount "trusted" loopdevs, sounds very useful.
But I am not going to argue that use case any further.

I do agree that it would have been much better if user namespace
could allow unprivileged mounts of certain non FUSE file systems
without relying on specially crafted security policies, but I do not
see how that can happen.


>
>> I am saying that LSM can come to the rescue, in a use case that
>> many have been considering as unsolvable (i.e. the loopback tampering).
>>
>> Yes, I am trying to place restrictions on what an unprivileged user can do.
>> As it stands right now, user is about to gain the ability to mount FUSE.
>> With some extra care on crafting the policy and without any extra code,
>> user can gain the ability to mount "trusted loopback files".
>> It does not solve all use cases, but it does solve a handful.
>
> As I said, you can do this, but it will be ugly, and people won't
> understand how to use it correctly. The distance between the "trusted"
> creation of the filesystem and the "untrusted" mount is too great.
> Plus, there are too many ways to circumvent the integrity of your
> "trusted" filesystem.
>
>> Anyway, the concern I was raising was about the fact that if files inside
>> the loopback mount inherit the label of the loopback file, this policy is
>> going to be impossible to write.
>> But Stephan has already proposed an alternative to this implicit inherit rule
>> on [PATCH 6/7] thread, so I withdraw my concern.
>
> What Stephan has proposed is dandy for SELinux.
>
>>
>>
>>>>> Agreed though that the "attack from below" problem for untrusted
>>>>> filesystems is still an open question. At minimum we have fuse, which
>>>>> has been designed to protect against this threat. Others have mentioned
>>>>> on this thread that Ted had said something at kernel summit last year
>>>>> about being willing to support ext4 mounts from unprivileged user
>>>>> namespaces as well. I've added Ted to the Cc in case he wants to confirm
>>>>> or deny this rumor.
>>>>>
>>>>>> Alas, if you choose to propagate the backing dev label to contained files,
>>>>>> they would all share the designated 'Loopback' label and render the policy above
>>>>>> useless.
>>>>>>
>>>>>> Any thoughts on how to reconcile this conflict?
>>>>> I'm not seeing what the conflict is here - nothing you proposed says
>>>>> anything about security labels in the filesystem, and nothing would
>>>>> prevent a "trusted" program with CAP_MAC_ADMIN from setting whatever
>>>>> label was desired on the backing device. Care to elaborate?
>>>>>
>>>>> Seth
>

2015-08-05 21:04:03

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
> Seth Forshee <[email protected]> writes:
>
> > Initially this will be used to eliminate the implicit MNT_NODEV
> > flag for mounts from user namespaces. In the future it will also
> > be used for translating ids and checking capabilities for
> > filesystems mounted from user namespaces.
> >
> > s_user_ns is initialized in alloc_super() and is generally set to
> > current_user_ns(). To avoid security and corruption issues, two
> > additional mount checks are also added:
> >
> > - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
> > in current_user_ns().
> >
> > - sget() will fail with EBUSY when the filesystem it's looking
> > for is already mounted from another user namespace.
> >
> > proc needs some special handling here. The user namespace of
> > current isn't appropriate when forking as a result of clone (2)
> > with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
> > from within the new user namespace. Instead, the user namespace
> > which owns the new pid namespace should be used. sget_userns() is
> > added to allow passing of a user namespace other than that of
> > current, and this is used by proc_mount(). sget() becomes a
> > wrapper around sget_userns() which passes current_user_ns().
>
> From bits of the previous conversation.
>
> We need sget_userns(..., &init_user_ns) for sysfs. The sysfs
> xattrs can travel from one mount of sysfs to another via the sysfs
> backing store.
>
> For tmpfs and any other filesystems we support mounting without
> privilige that support xattrs. We need to identify them and
> see if userspace is taking advantage of the ability to set
> xattrs and file caps (unlikely). If they are we need to call
> sget_userns(..., &init_user_ns) on those filesystems as well.
>
> Possibly/Probably we should just do that for all of the interesting
> filesystems to start with and then change back to an ordinary old sget
> after we have done the testing and confirmed we will not be introducing
> userspace regressions.

I was reviewing everything in preparation for sending v2 patches, and I
realized that doing this has an undesirable side effect. In patch 2 the
implicit nodev is removed for unprivileged mounts, and instead s_user_ns
is used to block opening devices in these mounts. When we set s_user_ns
to &init_user_ns, it becomes possible to open device nodes from
unprivileged mounts of these filesystems.

This doesn't pose a real problem today. The only filesystems it will
affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
&init_user_ns for user namespace mounts), and all of these aren't
problems. sysfs is okay because kernfs doesn't (currently?) allow device
nodes, and a user would require CAP_MKNOD to create any device nodes in
a tmpfs or ramfs mount.

But for sysfs in particular it does mean that we will need to make sure
that there's no way that device nodes could start appearing in an
unprivileged mount.

2015-08-05 21:25:53

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

Seth Forshee <[email protected]> writes:

> On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
>> Seth Forshee <[email protected]> writes:
>>
>> > Initially this will be used to eliminate the implicit MNT_NODEV
>> > flag for mounts from user namespaces. In the future it will also
>> > be used for translating ids and checking capabilities for
>> > filesystems mounted from user namespaces.
>> >
>> > s_user_ns is initialized in alloc_super() and is generally set to
>> > current_user_ns(). To avoid security and corruption issues, two
>> > additional mount checks are also added:
>> >
>> > - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
>> > in current_user_ns().
>> >
>> > - sget() will fail with EBUSY when the filesystem it's looking
>> > for is already mounted from another user namespace.
>> >
>> > proc needs some special handling here. The user namespace of
>> > current isn't appropriate when forking as a result of clone (2)
>> > with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
>> > from within the new user namespace. Instead, the user namespace
>> > which owns the new pid namespace should be used. sget_userns() is
>> > added to allow passing of a user namespace other than that of
>> > current, and this is used by proc_mount(). sget() becomes a
>> > wrapper around sget_userns() which passes current_user_ns().
>>
>> From bits of the previous conversation.
>>
>> We need sget_userns(..., &init_user_ns) for sysfs. The sysfs
>> xattrs can travel from one mount of sysfs to another via the sysfs
>> backing store.
>>
>> For tmpfs and any other filesystems we support mounting without
>> privilige that support xattrs. We need to identify them and
>> see if userspace is taking advantage of the ability to set
>> xattrs and file caps (unlikely). If they are we need to call
>> sget_userns(..., &init_user_ns) on those filesystems as well.
>>
>> Possibly/Probably we should just do that for all of the interesting
>> filesystems to start with and then change back to an ordinary old sget
>> after we have done the testing and confirmed we will not be introducing
>> userspace regressions.
>
> I was reviewing everything in preparation for sending v2 patches, and I
> realized that doing this has an undesirable side effect. In patch 2 the
> implicit nodev is removed for unprivileged mounts, and instead s_user_ns
> is used to block opening devices in these mounts. When we set s_user_ns
> to &init_user_ns, it becomes possible to open device nodes from
> unprivileged mounts of these filesystems.
>
> This doesn't pose a real problem today. The only filesystems it will
> affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
> &init_user_ns for user namespace mounts), and all of these aren't
> problems. sysfs is okay because kernfs doesn't (currently?) allow device
> nodes, and a user would require CAP_MKNOD to create any device nodes in
> a tmpfs or ramfs mount.
>
> But for sysfs in particular it does mean that we will need to make sure
> that there's no way that device nodes could start appearing in an
> unprivileged mount.

Good point about nodev.

For tmpfs and ramfs and security labels the smack policy of allowing but
filtering security labels mean smack once it has those bits will not
care which user namespace ramfs and tmpfs live in. The labels should
pretty much stay the same in any case.

If the same class of handling will also apply to selinux and those are
the only two security modules that apply labels than we can leave tmpfs
and ramfs with the security labels of whomever mounted them.

For sysfs things get a little more interesting. Assuming tmpfs and
ramfs don't need s_user_ns == &init_user_ns, sysfs may be fine operating
with possibly invalid securitly labels set on a different mount of
selinux. (I am wondering now how all of these labels work in the
context of nfs).

The worst case for sysfs is that we come up with a cousin of
SB_I_NO_EXEC say SB_I_NO_DEV.

But at the moment I am hoping that limited label storage in a user
namespace as you and Casey have been talking about winds up being the
norm and then we can follow the standard rules for setting s_user_ns and
still preserve the current label setting behavior.

Eric

2015-08-06 14:20:44

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

On Wed, Aug 05, 2015 at 04:19:03PM -0500, Eric W. Biederman wrote:
> Seth Forshee <[email protected]> writes:
>
> > On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
> >> Seth Forshee <[email protected]> writes:
> >>
> >> > Initially this will be used to eliminate the implicit MNT_NODEV
> >> > flag for mounts from user namespaces. In the future it will also
> >> > be used for translating ids and checking capabilities for
> >> > filesystems mounted from user namespaces.
> >> >
> >> > s_user_ns is initialized in alloc_super() and is generally set to
> >> > current_user_ns(). To avoid security and corruption issues, two
> >> > additional mount checks are also added:
> >> >
> >> > - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
> >> > in current_user_ns().
> >> >
> >> > - sget() will fail with EBUSY when the filesystem it's looking
> >> > for is already mounted from another user namespace.
> >> >
> >> > proc needs some special handling here. The user namespace of
> >> > current isn't appropriate when forking as a result of clone (2)
> >> > with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
> >> > from within the new user namespace. Instead, the user namespace
> >> > which owns the new pid namespace should be used. sget_userns() is
> >> > added to allow passing of a user namespace other than that of
> >> > current, and this is used by proc_mount(). sget() becomes a
> >> > wrapper around sget_userns() which passes current_user_ns().
> >>
> >> From bits of the previous conversation.
> >>
> >> We need sget_userns(..., &init_user_ns) for sysfs. The sysfs
> >> xattrs can travel from one mount of sysfs to another via the sysfs
> >> backing store.
> >>
> >> For tmpfs and any other filesystems we support mounting without
> >> privilige that support xattrs. We need to identify them and
> >> see if userspace is taking advantage of the ability to set
> >> xattrs and file caps (unlikely). If they are we need to call
> >> sget_userns(..., &init_user_ns) on those filesystems as well.
> >>
> >> Possibly/Probably we should just do that for all of the interesting
> >> filesystems to start with and then change back to an ordinary old sget
> >> after we have done the testing and confirmed we will not be introducing
> >> userspace regressions.
> >
> > I was reviewing everything in preparation for sending v2 patches, and I
> > realized that doing this has an undesirable side effect. In patch 2 the
> > implicit nodev is removed for unprivileged mounts, and instead s_user_ns
> > is used to block opening devices in these mounts. When we set s_user_ns
> > to &init_user_ns, it becomes possible to open device nodes from
> > unprivileged mounts of these filesystems.
> >
> > This doesn't pose a real problem today. The only filesystems it will
> > affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
> > &init_user_ns for user namespace mounts), and all of these aren't
> > problems. sysfs is okay because kernfs doesn't (currently?) allow device
> > nodes, and a user would require CAP_MKNOD to create any device nodes in
> > a tmpfs or ramfs mount.
> >
> > But for sysfs in particular it does mean that we will need to make sure
> > that there's no way that device nodes could start appearing in an
> > unprivileged mount.
>
> Good point about nodev.
>
> For tmpfs and ramfs and security labels the smack policy of allowing but
> filtering security labels mean smack once it has those bits will not
> care which user namespace ramfs and tmpfs live in. The labels should
> pretty much stay the same in any case.

Smack does care which namespace ramfs and tmpfs are in. With the patch
I've got right now, if s_user_ns != &init_user_ns and the label of an
inode does not match that of the root inode then
security_inode_permission() will return EACCES.

So if something with CAP_MAC_ADMIN is changing security labels in such a
mount, suddenly those inodes might become inaccessible. And while it may
be unlikely that anyone is doing this it's impossible for me to prove
that's the case.

> If the same class of handling will also apply to selinux and those are
> the only two security modules that apply labels than we can leave tmpfs
> and ramfs with the security labels of whomever mounted them.

For SELinux I now have a patch which applies mountpoint labeling to
mounts for which s_user_ns != &init_user_ns. I'm less sure then with
Smack how this behavior will differ from what happens today, but my
understanding is that this means that the label of the mountpoint is
used for all objects from that superblock. Afaik it does not have the
Smack behavior of denying access to filesystem objects which have a
different label in the backing store.

> For sysfs things get a little more interesting. Assuming tmpfs and
> ramfs don't need s_user_ns == &init_user_ns, sysfs may be fine operating
> with possibly invalid securitly labels set on a different mount of
> selinux. (I am wondering now how all of these labels work in the
> context of nfs).

If someone was using Smack to label sysfs then a mount with s_user_ns !=
&init_user_ns is going to leave inaccessible anything without the same
label as the process which performed the mount.

Again with SELinux I'm less certain, but I think you could end up with a
sysfs superblock that has mountpoint labeling, and thus any labels set
in the mount in the init namespace would be ignored.

> The worst case for sysfs is that we come up with a cousin of
> SB_I_NO_EXEC say SB_I_NO_DEV.

That idea occurred to me. Or else something that indicated to the
security module that the filesystem has no user-controlled backing store
which could be used to inject security labels, thus allowing us to set
s_user_ns to a non-init namespace while still allowing standard MAC
labeling behavior.

> But at the moment I am hoping that limited label storage in a user
> namespace as you and Casey have been talking about winds up being the
> norm and then we can follow the standard rules for setting s_user_ns and
> still preserve the current label setting behavior.

Unfortunately I'm afraid that's not going to work out.

Seth

2015-08-06 14:52:23

by Stephen Smalley

[permalink] [raw]
Subject: Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

On 08/06/2015 10:20 AM, Seth Forshee wrote:
> On Wed, Aug 05, 2015 at 04:19:03PM -0500, Eric W. Biederman wrote:
>> Seth Forshee <[email protected]> writes:
>>
>>> On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
>>>> Seth Forshee <[email protected]> writes:
>>>>
>>>>> Initially this will be used to eliminate the implicit MNT_NODEV
>>>>> flag for mounts from user namespaces. In the future it will also
>>>>> be used for translating ids and checking capabilities for
>>>>> filesystems mounted from user namespaces.
>>>>>
>>>>> s_user_ns is initialized in alloc_super() and is generally set to
>>>>> current_user_ns(). To avoid security and corruption issues, two
>>>>> additional mount checks are also added:
>>>>>
>>>>> - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
>>>>> in current_user_ns().
>>>>>
>>>>> - sget() will fail with EBUSY when the filesystem it's looking
>>>>> for is already mounted from another user namespace.
>>>>>
>>>>> proc needs some special handling here. The user namespace of
>>>>> current isn't appropriate when forking as a result of clone (2)
>>>>> with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
>>>>> from within the new user namespace. Instead, the user namespace
>>>>> which owns the new pid namespace should be used. sget_userns() is
>>>>> added to allow passing of a user namespace other than that of
>>>>> current, and this is used by proc_mount(). sget() becomes a
>>>>> wrapper around sget_userns() which passes current_user_ns().
>>>>
>>>> From bits of the previous conversation.
>>>>
>>>> We need sget_userns(..., &init_user_ns) for sysfs. The sysfs
>>>> xattrs can travel from one mount of sysfs to another via the sysfs
>>>> backing store.
>>>>
>>>> For tmpfs and any other filesystems we support mounting without
>>>> privilige that support xattrs. We need to identify them and
>>>> see if userspace is taking advantage of the ability to set
>>>> xattrs and file caps (unlikely). If they are we need to call
>>>> sget_userns(..., &init_user_ns) on those filesystems as well.
>>>>
>>>> Possibly/Probably we should just do that for all of the interesting
>>>> filesystems to start with and then change back to an ordinary old sget
>>>> after we have done the testing and confirmed we will not be introducing
>>>> userspace regressions.
>>>
>>> I was reviewing everything in preparation for sending v2 patches, and I
>>> realized that doing this has an undesirable side effect. In patch 2 the
>>> implicit nodev is removed for unprivileged mounts, and instead s_user_ns
>>> is used to block opening devices in these mounts. When we set s_user_ns
>>> to &init_user_ns, it becomes possible to open device nodes from
>>> unprivileged mounts of these filesystems.
>>>
>>> This doesn't pose a real problem today. The only filesystems it will
>>> affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
>>> &init_user_ns for user namespace mounts), and all of these aren't
>>> problems. sysfs is okay because kernfs doesn't (currently?) allow device
>>> nodes, and a user would require CAP_MKNOD to create any device nodes in
>>> a tmpfs or ramfs mount.
>>>
>>> But for sysfs in particular it does mean that we will need to make sure
>>> that there's no way that device nodes could start appearing in an
>>> unprivileged mount.
>>
>> Good point about nodev.
>>
>> For tmpfs and ramfs and security labels the smack policy of allowing but
>> filtering security labels mean smack once it has those bits will not
>> care which user namespace ramfs and tmpfs live in. The labels should
>> pretty much stay the same in any case.
>
> Smack does care which namespace ramfs and tmpfs are in. With the patch
> I've got right now, if s_user_ns != &init_user_ns and the label of an
> inode does not match that of the root inode then
> security_inode_permission() will return EACCES.
>
> So if something with CAP_MAC_ADMIN is changing security labels in such a
> mount, suddenly those inodes might become inaccessible. And while it may
> be unlikely that anyone is doing this it's impossible for me to prove
> that's the case.
>
>> If the same class of handling will also apply to selinux and those are
>> the only two security modules that apply labels than we can leave tmpfs
>> and ramfs with the security labels of whomever mounted them.
>
> For SELinux I now have a patch which applies mountpoint labeling to
> mounts for which s_user_ns != &init_user_ns. I'm less sure then with
> Smack how this behavior will differ from what happens today, but my
> understanding is that this means that the label of the mountpoint is
> used for all objects from that superblock. Afaik it does not have the
> Smack behavior of denying access to filesystem objects which have a
> different label in the backing store.
>
>> For sysfs things get a little more interesting. Assuming tmpfs and
>> ramfs don't need s_user_ns == &init_user_ns, sysfs may be fine operating
>> with possibly invalid securitly labels set on a different mount of
>> selinux. (I am wondering now how all of these labels work in the
>> context of nfs).
>
> If someone was using Smack to label sysfs then a mount with s_user_ns !=
> &init_user_ns is going to leave inaccessible anything without the same
> label as the process which performed the mount.
>
> Again with SELinux I'm less certain, but I think you could end up with a
> sysfs superblock that has mountpoint labeling, and thus any labels set
> in the mount in the init namespace would be ignored.

If you're using the logic I suggested for SELinux, then SELinux will
only use mountpoint labeling if SELinux would otherwise fetch the
extended attribute value from the filesystem via ->getxattr (this is the
SECURITY_FS_USE_XATTR test in the code). As this is not the case for
purely in-memory filesystems like tmpfs, ramfs, or sysfs, SELinux will
still label those filesystems in the usual manner, i.e. it initially
computes a default label for new inodes, and if userspace later performs
a setxattr(), then it updates its internal state at that time from the
relevant hooks (inode_post_setxattr or inode_setsecurity).
So nothing should change for SELinux wrt labeling of tmpfs, ramfs, or
sysfs in userns mounts aside from not allowing the use of the additional
mount options (e.g. context=).

Also, a superblock can only have a single labeling behavior, so you
can't have different mounts of sysfs, one using mountpoint labeling and
one not. An inode can only have one label, no matter how you reach it.

>> The worst case for sysfs is that we come up with a cousin of
>> SB_I_NO_EXEC say SB_I_NO_DEV.
>
> That idea occurred to me. Or else something that indicated to the
> security module that the filesystem has no user-controlled backing store
> which could be used to inject security labels, thus allowing us to set
> s_user_ns to a non-init namespace while still allowing standard MAC
> labeling behavior.
>
>> But at the moment I am hoping that limited label storage in a user
>> namespace as you and Casey have been talking about winds up being the
>> norm and then we can follow the standard rules for setting s_user_ns and
>> still preserve the current label setting behavior.
>
> Unfortunately I'm afraid that's not going to work out.
>
> Seth
>

2015-08-06 15:44:55

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

On Thu, Aug 06, 2015 at 10:51:16AM -0400, Stephen Smalley wrote:
> On 08/06/2015 10:20 AM, Seth Forshee wrote:
> > On Wed, Aug 05, 2015 at 04:19:03PM -0500, Eric W. Biederman wrote:
> >> Seth Forshee <[email protected]> writes:
> >>
> >>> On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
> >>>> Seth Forshee <[email protected]> writes:
> >>>>
> >>>>> Initially this will be used to eliminate the implicit MNT_NODEV
> >>>>> flag for mounts from user namespaces. In the future it will also
> >>>>> be used for translating ids and checking capabilities for
> >>>>> filesystems mounted from user namespaces.
> >>>>>
> >>>>> s_user_ns is initialized in alloc_super() and is generally set to
> >>>>> current_user_ns(). To avoid security and corruption issues, two
> >>>>> additional mount checks are also added:
> >>>>>
> >>>>> - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
> >>>>> in current_user_ns().
> >>>>>
> >>>>> - sget() will fail with EBUSY when the filesystem it's looking
> >>>>> for is already mounted from another user namespace.
> >>>>>
> >>>>> proc needs some special handling here. The user namespace of
> >>>>> current isn't appropriate when forking as a result of clone (2)
> >>>>> with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
> >>>>> from within the new user namespace. Instead, the user namespace
> >>>>> which owns the new pid namespace should be used. sget_userns() is
> >>>>> added to allow passing of a user namespace other than that of
> >>>>> current, and this is used by proc_mount(). sget() becomes a
> >>>>> wrapper around sget_userns() which passes current_user_ns().
> >>>>
> >>>> From bits of the previous conversation.
> >>>>
> >>>> We need sget_userns(..., &init_user_ns) for sysfs. The sysfs
> >>>> xattrs can travel from one mount of sysfs to another via the sysfs
> >>>> backing store.
> >>>>
> >>>> For tmpfs and any other filesystems we support mounting without
> >>>> privilige that support xattrs. We need to identify them and
> >>>> see if userspace is taking advantage of the ability to set
> >>>> xattrs and file caps (unlikely). If they are we need to call
> >>>> sget_userns(..., &init_user_ns) on those filesystems as well.
> >>>>
> >>>> Possibly/Probably we should just do that for all of the interesting
> >>>> filesystems to start with and then change back to an ordinary old sget
> >>>> after we have done the testing and confirmed we will not be introducing
> >>>> userspace regressions.
> >>>
> >>> I was reviewing everything in preparation for sending v2 patches, and I
> >>> realized that doing this has an undesirable side effect. In patch 2 the
> >>> implicit nodev is removed for unprivileged mounts, and instead s_user_ns
> >>> is used to block opening devices in these mounts. When we set s_user_ns
> >>> to &init_user_ns, it becomes possible to open device nodes from
> >>> unprivileged mounts of these filesystems.
> >>>
> >>> This doesn't pose a real problem today. The only filesystems it will
> >>> affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
> >>> &init_user_ns for user namespace mounts), and all of these aren't
> >>> problems. sysfs is okay because kernfs doesn't (currently?) allow device
> >>> nodes, and a user would require CAP_MKNOD to create any device nodes in
> >>> a tmpfs or ramfs mount.
> >>>
> >>> But for sysfs in particular it does mean that we will need to make sure
> >>> that there's no way that device nodes could start appearing in an
> >>> unprivileged mount.
> >>
> >> Good point about nodev.
> >>
> >> For tmpfs and ramfs and security labels the smack policy of allowing but
> >> filtering security labels mean smack once it has those bits will not
> >> care which user namespace ramfs and tmpfs live in. The labels should
> >> pretty much stay the same in any case.
> >
> > Smack does care which namespace ramfs and tmpfs are in. With the patch
> > I've got right now, if s_user_ns != &init_user_ns and the label of an
> > inode does not match that of the root inode then
> > security_inode_permission() will return EACCES.
> >
> > So if something with CAP_MAC_ADMIN is changing security labels in such a
> > mount, suddenly those inodes might become inaccessible. And while it may
> > be unlikely that anyone is doing this it's impossible for me to prove
> > that's the case.
> >
> >> If the same class of handling will also apply to selinux and those are
> >> the only two security modules that apply labels than we can leave tmpfs
> >> and ramfs with the security labels of whomever mounted them.
> >
> > For SELinux I now have a patch which applies mountpoint labeling to
> > mounts for which s_user_ns != &init_user_ns. I'm less sure then with
> > Smack how this behavior will differ from what happens today, but my
> > understanding is that this means that the label of the mountpoint is
> > used for all objects from that superblock. Afaik it does not have the
> > Smack behavior of denying access to filesystem objects which have a
> > different label in the backing store.
> >
> >> For sysfs things get a little more interesting. Assuming tmpfs and
> >> ramfs don't need s_user_ns == &init_user_ns, sysfs may be fine operating
> >> with possibly invalid securitly labels set on a different mount of
> >> selinux. (I am wondering now how all of these labels work in the
> >> context of nfs).
> >
> > If someone was using Smack to label sysfs then a mount with s_user_ns !=
> > &init_user_ns is going to leave inaccessible anything without the same
> > label as the process which performed the mount.
> >
> > Again with SELinux I'm less certain, but I think you could end up with a
> > sysfs superblock that has mountpoint labeling, and thus any labels set
> > in the mount in the init namespace would be ignored.
>
> If you're using the logic I suggested for SELinux, then SELinux will
> only use mountpoint labeling if SELinux would otherwise fetch the
> extended attribute value from the filesystem via ->getxattr (this is the
> SECURITY_FS_USE_XATTR test in the code). As this is not the case for
> purely in-memory filesystems like tmpfs, ramfs, or sysfs, SELinux will
> still label those filesystems in the usual manner, i.e. it initially
> computes a default label for new inodes, and if userspace later performs
> a setxattr(), then it updates its internal state at that time from the
> relevant hooks (inode_post_setxattr or inode_setsecurity).
> So nothing should change for SELinux wrt labeling of tmpfs, ramfs, or
> sysfs in userns mounts aside from not allowing the use of the additional
> mount options (e.g. context=).

This is the patch I have currently:

http://kernel.ubuntu.com/git/sforshee/linux.git/commit/?h=userns-mounts&id=080e5f5ee58143a56cfc57b4e51dff58b7a3cb1a

I haven't been able to figure out which labeling behavior sysfs would
end up with normally from just inspecting the code. kernfs does support
xattrs, but now that I look at the implementation it handles security
xattrs differently and calls security_inode_setsecurity whenever one is
written. I'm not sure how all of that is going to work out in practice
with SELinux.

> Also, a superblock can only have a single labeling behavior, so you
> can't have different mounts of sysfs, one using mountpoint labeling and
> one not. An inode can only have one label, no matter how you reach it.

There are multiple sysfs superblocks though, see sysfs_mount(). It calls
kernfs_mount_ns(), passing a kobject for the current net ns.
kernfs_test_super() only matches if the net ns matches an existing
superblock, so you end up with a different superblock per net ns.

For kobjects which aren't namespaced, the same path within two different
sysfs superblocks will be backed by the same kernfs node. kernfs stashes
the security context inside the kernfs node, so inodes in different
superblocks backed by the same kernfs node will have the same security
context.

So, with sysfs you can have different superblocks with (partially) the
same backing store, and it would be possible for those superblocks to
end up with different labeling behavior. I think we want to avoid having
security labels applied to sysfs files in the init namespace and have
those get lost in a mount from another namespace.

Seth

2015-08-06 16:13:03

by Stephen Smalley

[permalink] [raw]
Subject: Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

On 08/06/2015 11:44 AM, Seth Forshee wrote:
> On Thu, Aug 06, 2015 at 10:51:16AM -0400, Stephen Smalley wrote:
>> On 08/06/2015 10:20 AM, Seth Forshee wrote:
>>> On Wed, Aug 05, 2015 at 04:19:03PM -0500, Eric W. Biederman wrote:
>>>> Seth Forshee <[email protected]> writes:
>>>>
>>>>> On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
>>>>>> Seth Forshee <[email protected]> writes:
>>>>>>
>>>>>>> Initially this will be used to eliminate the implicit MNT_NODEV
>>>>>>> flag for mounts from user namespaces. In the future it will also
>>>>>>> be used for translating ids and checking capabilities for
>>>>>>> filesystems mounted from user namespaces.
>>>>>>>
>>>>>>> s_user_ns is initialized in alloc_super() and is generally set to
>>>>>>> current_user_ns(). To avoid security and corruption issues, two
>>>>>>> additional mount checks are also added:
>>>>>>>
>>>>>>> - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
>>>>>>> in current_user_ns().
>>>>>>>
>>>>>>> - sget() will fail with EBUSY when the filesystem it's looking
>>>>>>> for is already mounted from another user namespace.
>>>>>>>
>>>>>>> proc needs some special handling here. The user namespace of
>>>>>>> current isn't appropriate when forking as a result of clone (2)
>>>>>>> with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
>>>>>>> from within the new user namespace. Instead, the user namespace
>>>>>>> which owns the new pid namespace should be used. sget_userns() is
>>>>>>> added to allow passing of a user namespace other than that of
>>>>>>> current, and this is used by proc_mount(). sget() becomes a
>>>>>>> wrapper around sget_userns() which passes current_user_ns().
>>>>>>
>>>>>> From bits of the previous conversation.
>>>>>>
>>>>>> We need sget_userns(..., &init_user_ns) for sysfs. The sysfs
>>>>>> xattrs can travel from one mount of sysfs to another via the sysfs
>>>>>> backing store.
>>>>>>
>>>>>> For tmpfs and any other filesystems we support mounting without
>>>>>> privilige that support xattrs. We need to identify them and
>>>>>> see if userspace is taking advantage of the ability to set
>>>>>> xattrs and file caps (unlikely). If they are we need to call
>>>>>> sget_userns(..., &init_user_ns) on those filesystems as well.
>>>>>>
>>>>>> Possibly/Probably we should just do that for all of the interesting
>>>>>> filesystems to start with and then change back to an ordinary old sget
>>>>>> after we have done the testing and confirmed we will not be introducing
>>>>>> userspace regressions.
>>>>>
>>>>> I was reviewing everything in preparation for sending v2 patches, and I
>>>>> realized that doing this has an undesirable side effect. In patch 2 the
>>>>> implicit nodev is removed for unprivileged mounts, and instead s_user_ns
>>>>> is used to block opening devices in these mounts. When we set s_user_ns
>>>>> to &init_user_ns, it becomes possible to open device nodes from
>>>>> unprivileged mounts of these filesystems.
>>>>>
>>>>> This doesn't pose a real problem today. The only filesystems it will
>>>>> affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
>>>>> &init_user_ns for user namespace mounts), and all of these aren't
>>>>> problems. sysfs is okay because kernfs doesn't (currently?) allow device
>>>>> nodes, and a user would require CAP_MKNOD to create any device nodes in
>>>>> a tmpfs or ramfs mount.
>>>>>
>>>>> But for sysfs in particular it does mean that we will need to make sure
>>>>> that there's no way that device nodes could start appearing in an
>>>>> unprivileged mount.
>>>>
>>>> Good point about nodev.
>>>>
>>>> For tmpfs and ramfs and security labels the smack policy of allowing but
>>>> filtering security labels mean smack once it has those bits will not
>>>> care which user namespace ramfs and tmpfs live in. The labels should
>>>> pretty much stay the same in any case.
>>>
>>> Smack does care which namespace ramfs and tmpfs are in. With the patch
>>> I've got right now, if s_user_ns != &init_user_ns and the label of an
>>> inode does not match that of the root inode then
>>> security_inode_permission() will return EACCES.
>>>
>>> So if something with CAP_MAC_ADMIN is changing security labels in such a
>>> mount, suddenly those inodes might become inaccessible. And while it may
>>> be unlikely that anyone is doing this it's impossible for me to prove
>>> that's the case.
>>>
>>>> If the same class of handling will also apply to selinux and those are
>>>> the only two security modules that apply labels than we can leave tmpfs
>>>> and ramfs with the security labels of whomever mounted them.
>>>
>>> For SELinux I now have a patch which applies mountpoint labeling to
>>> mounts for which s_user_ns != &init_user_ns. I'm less sure then with
>>> Smack how this behavior will differ from what happens today, but my
>>> understanding is that this means that the label of the mountpoint is
>>> used for all objects from that superblock. Afaik it does not have the
>>> Smack behavior of denying access to filesystem objects which have a
>>> different label in the backing store.
>>>
>>>> For sysfs things get a little more interesting. Assuming tmpfs and
>>>> ramfs don't need s_user_ns == &init_user_ns, sysfs may be fine operating
>>>> with possibly invalid securitly labels set on a different mount of
>>>> selinux. (I am wondering now how all of these labels work in the
>>>> context of nfs).
>>>
>>> If someone was using Smack to label sysfs then a mount with s_user_ns !=
>>> &init_user_ns is going to leave inaccessible anything without the same
>>> label as the process which performed the mount.
>>>
>>> Again with SELinux I'm less certain, but I think you could end up with a
>>> sysfs superblock that has mountpoint labeling, and thus any labels set
>>> in the mount in the init namespace would be ignored.
>>
>> If you're using the logic I suggested for SELinux, then SELinux will
>> only use mountpoint labeling if SELinux would otherwise fetch the
>> extended attribute value from the filesystem via ->getxattr (this is the
>> SECURITY_FS_USE_XATTR test in the code). As this is not the case for
>> purely in-memory filesystems like tmpfs, ramfs, or sysfs, SELinux will
>> still label those filesystems in the usual manner, i.e. it initially
>> computes a default label for new inodes, and if userspace later performs
>> a setxattr(), then it updates its internal state at that time from the
>> relevant hooks (inode_post_setxattr or inode_setsecurity).
>> So nothing should change for SELinux wrt labeling of tmpfs, ramfs, or
>> sysfs in userns mounts aside from not allowing the use of the additional
>> mount options (e.g. context=).
>
> This is the patch I have currently:
>
> http://kernel.ubuntu.com/git/sforshee/linux.git/commit/?h=userns-mounts&id=080e5f5ee58143a56cfc57b4e51dff58b7a3cb1a
>
> I haven't been able to figure out which labeling behavior sysfs would
> end up with normally from just inspecting the code. kernfs does support
> xattrs, but now that I look at the implementation it handles security
> xattrs differently and calls security_inode_setsecurity whenever one is
> written. I'm not sure how all of that is going to work out in practice
> with SELinux.

sysfs would have a labeling behavior of SECURITY_FS_USE_GENFS
(policy-driven). It wouldn't make sense to configure sysfs with
SECURITY_FS_USE_XATTR, because that would cause SELinux to ask the
filesystem via ->getxattr for the initial value for the label when the
inode is first instantiated, and sysfs would have no answer there. So,
in practice, sysfs will still get labeled exactly as before, and there
would be no change in behavior. Similarly for tmpfs
(SECURITY_FS_USE_TRANS) or ramfs. The only filesystem types that get
SECURITY_FS_USE_XATTR are the ones that actually support storing SELinux
attributes persistently and therefore could provide an initial value
from backing store.

>> Also, a superblock can only have a single labeling behavior, so you
>> can't have different mounts of sysfs, one using mountpoint labeling and
>> one not. An inode can only have one label, no matter how you reach it.
>
> There are multiple sysfs superblocks though, see sysfs_mount(). It calls
> kernfs_mount_ns(), passing a kobject for the current net ns.
> kernfs_test_super() only matches if the net ns matches an existing
> superblock, so you end up with a different superblock per net ns.
>
> For kobjects which aren't namespaced, the same path within two different
> sysfs superblocks will be backed by the same kernfs node. kernfs stashes
> the security context inside the kernfs node, so inodes in different
> superblocks backed by the same kernfs node will have the same security
> context.
>
> So, with sysfs you can have different superblocks with (partially) the
> same backing store, and it would be possible for those superblocks to
> end up with different labeling behavior. I think we want to avoid having
> security labels applied to sysfs files in the init namespace and have
> those get lost in a mount from another namespace.

As long as we prohibit context= mounts on the userns mounts (which your
patch does), then this shouldn't be possible. Maybe we should just do
that for sysfs always.

2015-08-07 14:16:24

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

On Thu, Aug 06, 2015 at 12:11:53PM -0400, Stephen Smalley wrote:
> On 08/06/2015 11:44 AM, Seth Forshee wrote:
> > On Thu, Aug 06, 2015 at 10:51:16AM -0400, Stephen Smalley wrote:
> >> On 08/06/2015 10:20 AM, Seth Forshee wrote:
> >>> On Wed, Aug 05, 2015 at 04:19:03PM -0500, Eric W. Biederman wrote:
> >>>> Seth Forshee <[email protected]> writes:
> >>>>
> >>>>> On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
> >>>>>> Seth Forshee <[email protected]> writes:
> >>>>>>
> >>>>>>> Initially this will be used to eliminate the implicit MNT_NODEV
> >>>>>>> flag for mounts from user namespaces. In the future it will also
> >>>>>>> be used for translating ids and checking capabilities for
> >>>>>>> filesystems mounted from user namespaces.
> >>>>>>>
> >>>>>>> s_user_ns is initialized in alloc_super() and is generally set to
> >>>>>>> current_user_ns(). To avoid security and corruption issues, two
> >>>>>>> additional mount checks are also added:
> >>>>>>>
> >>>>>>> - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
> >>>>>>> in current_user_ns().
> >>>>>>>
> >>>>>>> - sget() will fail with EBUSY when the filesystem it's looking
> >>>>>>> for is already mounted from another user namespace.
> >>>>>>>
> >>>>>>> proc needs some special handling here. The user namespace of
> >>>>>>> current isn't appropriate when forking as a result of clone (2)
> >>>>>>> with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
> >>>>>>> from within the new user namespace. Instead, the user namespace
> >>>>>>> which owns the new pid namespace should be used. sget_userns() is
> >>>>>>> added to allow passing of a user namespace other than that of
> >>>>>>> current, and this is used by proc_mount(). sget() becomes a
> >>>>>>> wrapper around sget_userns() which passes current_user_ns().
> >>>>>>
> >>>>>> From bits of the previous conversation.
> >>>>>>
> >>>>>> We need sget_userns(..., &init_user_ns) for sysfs. The sysfs
> >>>>>> xattrs can travel from one mount of sysfs to another via the sysfs
> >>>>>> backing store.
> >>>>>>
> >>>>>> For tmpfs and any other filesystems we support mounting without
> >>>>>> privilige that support xattrs. We need to identify them and
> >>>>>> see if userspace is taking advantage of the ability to set
> >>>>>> xattrs and file caps (unlikely). If they are we need to call
> >>>>>> sget_userns(..., &init_user_ns) on those filesystems as well.
> >>>>>>
> >>>>>> Possibly/Probably we should just do that for all of the interesting
> >>>>>> filesystems to start with and then change back to an ordinary old sget
> >>>>>> after we have done the testing and confirmed we will not be introducing
> >>>>>> userspace regressions.
> >>>>>
> >>>>> I was reviewing everything in preparation for sending v2 patches, and I
> >>>>> realized that doing this has an undesirable side effect. In patch 2 the
> >>>>> implicit nodev is removed for unprivileged mounts, and instead s_user_ns
> >>>>> is used to block opening devices in these mounts. When we set s_user_ns
> >>>>> to &init_user_ns, it becomes possible to open device nodes from
> >>>>> unprivileged mounts of these filesystems.
> >>>>>
> >>>>> This doesn't pose a real problem today. The only filesystems it will
> >>>>> affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
> >>>>> &init_user_ns for user namespace mounts), and all of these aren't
> >>>>> problems. sysfs is okay because kernfs doesn't (currently?) allow device
> >>>>> nodes, and a user would require CAP_MKNOD to create any device nodes in
> >>>>> a tmpfs or ramfs mount.
> >>>>>
> >>>>> But for sysfs in particular it does mean that we will need to make sure
> >>>>> that there's no way that device nodes could start appearing in an
> >>>>> unprivileged mount.
> >>>>
> >>>> Good point about nodev.
> >>>>
> >>>> For tmpfs and ramfs and security labels the smack policy of allowing but
> >>>> filtering security labels mean smack once it has those bits will not
> >>>> care which user namespace ramfs and tmpfs live in. The labels should
> >>>> pretty much stay the same in any case.
> >>>
> >>> Smack does care which namespace ramfs and tmpfs are in. With the patch
> >>> I've got right now, if s_user_ns != &init_user_ns and the label of an
> >>> inode does not match that of the root inode then
> >>> security_inode_permission() will return EACCES.
> >>>
> >>> So if something with CAP_MAC_ADMIN is changing security labels in such a
> >>> mount, suddenly those inodes might become inaccessible. And while it may
> >>> be unlikely that anyone is doing this it's impossible for me to prove
> >>> that's the case.
> >>>
> >>>> If the same class of handling will also apply to selinux and those are
> >>>> the only two security modules that apply labels than we can leave tmpfs
> >>>> and ramfs with the security labels of whomever mounted them.
> >>>
> >>> For SELinux I now have a patch which applies mountpoint labeling to
> >>> mounts for which s_user_ns != &init_user_ns. I'm less sure then with
> >>> Smack how this behavior will differ from what happens today, but my
> >>> understanding is that this means that the label of the mountpoint is
> >>> used for all objects from that superblock. Afaik it does not have the
> >>> Smack behavior of denying access to filesystem objects which have a
> >>> different label in the backing store.
> >>>
> >>>> For sysfs things get a little more interesting. Assuming tmpfs and
> >>>> ramfs don't need s_user_ns == &init_user_ns, sysfs may be fine operating
> >>>> with possibly invalid securitly labels set on a different mount of
> >>>> selinux. (I am wondering now how all of these labels work in the
> >>>> context of nfs).
> >>>
> >>> If someone was using Smack to label sysfs then a mount with s_user_ns !=
> >>> &init_user_ns is going to leave inaccessible anything without the same
> >>> label as the process which performed the mount.
> >>>
> >>> Again with SELinux I'm less certain, but I think you could end up with a
> >>> sysfs superblock that has mountpoint labeling, and thus any labels set
> >>> in the mount in the init namespace would be ignored.
> >>
> >> If you're using the logic I suggested for SELinux, then SELinux will
> >> only use mountpoint labeling if SELinux would otherwise fetch the
> >> extended attribute value from the filesystem via ->getxattr (this is the
> >> SECURITY_FS_USE_XATTR test in the code). As this is not the case for
> >> purely in-memory filesystems like tmpfs, ramfs, or sysfs, SELinux will
> >> still label those filesystems in the usual manner, i.e. it initially
> >> computes a default label for new inodes, and if userspace later performs
> >> a setxattr(), then it updates its internal state at that time from the
> >> relevant hooks (inode_post_setxattr or inode_setsecurity).
> >> So nothing should change for SELinux wrt labeling of tmpfs, ramfs, or
> >> sysfs in userns mounts aside from not allowing the use of the additional
> >> mount options (e.g. context=).
> >
> > This is the patch I have currently:
> >
> > http://kernel.ubuntu.com/git/sforshee/linux.git/commit/?h=userns-mounts&id=080e5f5ee58143a56cfc57b4e51dff58b7a3cb1a
> >
> > I haven't been able to figure out which labeling behavior sysfs would
> > end up with normally from just inspecting the code. kernfs does support
> > xattrs, but now that I look at the implementation it handles security
> > xattrs differently and calls security_inode_setsecurity whenever one is
> > written. I'm not sure how all of that is going to work out in practice
> > with SELinux.
>
> sysfs would have a labeling behavior of SECURITY_FS_USE_GENFS
> (policy-driven). It wouldn't make sense to configure sysfs with
> SECURITY_FS_USE_XATTR, because that would cause SELinux to ask the
> filesystem via ->getxattr for the initial value for the label when the
> inode is first instantiated, and sysfs would have no answer there. So,
> in practice, sysfs will still get labeled exactly as before, and there
> would be no change in behavior. Similarly for tmpfs
> (SECURITY_FS_USE_TRANS) or ramfs. The only filesystem types that get
> SECURITY_FS_USE_XATTR are the ones that actually support storing SELinux
> attributes persistently and therefore could provide an initial value
> from backing store.
>
> >> Also, a superblock can only have a single labeling behavior, so you
> >> can't have different mounts of sysfs, one using mountpoint labeling and
> >> one not. An inode can only have one label, no matter how you reach it.
> >
> > There are multiple sysfs superblocks though, see sysfs_mount(). It calls
> > kernfs_mount_ns(), passing a kobject for the current net ns.
> > kernfs_test_super() only matches if the net ns matches an existing
> > superblock, so you end up with a different superblock per net ns.
> >
> > For kobjects which aren't namespaced, the same path within two different
> > sysfs superblocks will be backed by the same kernfs node. kernfs stashes
> > the security context inside the kernfs node, so inodes in different
> > superblocks backed by the same kernfs node will have the same security
> > context.
> >
> > So, with sysfs you can have different superblocks with (partially) the
> > same backing store, and it would be possible for those superblocks to
> > end up with different labeling behavior. I think we want to avoid having
> > security labels applied to sysfs files in the init namespace and have
> > those get lost in a mount from another namespace.
>
> As long as we prohibit context= mounts on the userns mounts (which your
> patch does), then this shouldn't be possible. Maybe we should just do
> that for sysfs always.

Great. Thanks for your help.

Seth

2015-08-07 14:32:11

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

On Thu, Aug 06, 2015 at 09:20:29AM -0500, Seth Forshee wrote:
> On Wed, Aug 05, 2015 at 04:19:03PM -0500, Eric W. Biederman wrote:
> > Seth Forshee <[email protected]> writes:
> >
> > > On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
> > >> Seth Forshee <[email protected]> writes:
> > >>
> > >> > Initially this will be used to eliminate the implicit MNT_NODEV
> > >> > flag for mounts from user namespaces. In the future it will also
> > >> > be used for translating ids and checking capabilities for
> > >> > filesystems mounted from user namespaces.
> > >> >
> > >> > s_user_ns is initialized in alloc_super() and is generally set to
> > >> > current_user_ns(). To avoid security and corruption issues, two
> > >> > additional mount checks are also added:
> > >> >
> > >> > - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
> > >> > in current_user_ns().
> > >> >
> > >> > - sget() will fail with EBUSY when the filesystem it's looking
> > >> > for is already mounted from another user namespace.
> > >> >
> > >> > proc needs some special handling here. The user namespace of
> > >> > current isn't appropriate when forking as a result of clone (2)
> > >> > with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
> > >> > from within the new user namespace. Instead, the user namespace
> > >> > which owns the new pid namespace should be used. sget_userns() is
> > >> > added to allow passing of a user namespace other than that of
> > >> > current, and this is used by proc_mount(). sget() becomes a
> > >> > wrapper around sget_userns() which passes current_user_ns().
> > >>
> > >> From bits of the previous conversation.
> > >>
> > >> We need sget_userns(..., &init_user_ns) for sysfs. The sysfs
> > >> xattrs can travel from one mount of sysfs to another via the sysfs
> > >> backing store.
> > >>
> > >> For tmpfs and any other filesystems we support mounting without
> > >> privilige that support xattrs. We need to identify them and
> > >> see if userspace is taking advantage of the ability to set
> > >> xattrs and file caps (unlikely). If they are we need to call
> > >> sget_userns(..., &init_user_ns) on those filesystems as well.
> > >>
> > >> Possibly/Probably we should just do that for all of the interesting
> > >> filesystems to start with and then change back to an ordinary old sget
> > >> after we have done the testing and confirmed we will not be introducing
> > >> userspace regressions.
> > >
> > > I was reviewing everything in preparation for sending v2 patches, and I
> > > realized that doing this has an undesirable side effect. In patch 2 the
> > > implicit nodev is removed for unprivileged mounts, and instead s_user_ns
> > > is used to block opening devices in these mounts. When we set s_user_ns
> > > to &init_user_ns, it becomes possible to open device nodes from
> > > unprivileged mounts of these filesystems.
> > >
> > > This doesn't pose a real problem today. The only filesystems it will
> > > affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
> > > &init_user_ns for user namespace mounts), and all of these aren't
> > > problems. sysfs is okay because kernfs doesn't (currently?) allow device
> > > nodes, and a user would require CAP_MKNOD to create any device nodes in
> > > a tmpfs or ramfs mount.
> > >
> > > But for sysfs in particular it does mean that we will need to make sure
> > > that there's no way that device nodes could start appearing in an
> > > unprivileged mount.
> >
> > Good point about nodev.
> >
> > For tmpfs and ramfs and security labels the smack policy of allowing but
> > filtering security labels mean smack once it has those bits will not
> > care which user namespace ramfs and tmpfs live in. The labels should
> > pretty much stay the same in any case.
>
> Smack does care which namespace ramfs and tmpfs are in. With the patch
> I've got right now, if s_user_ns != &init_user_ns and the label of an
> inode does not match that of the root inode then
> security_inode_permission() will return EACCES.
>
> So if something with CAP_MAC_ADMIN is changing security labels in such a
> mount, suddenly those inodes might become inaccessible. And while it may
> be unlikely that anyone is doing this it's impossible for me to prove
> that's the case.
>
> > If the same class of handling will also apply to selinux and those are
> > the only two security modules that apply labels than we can leave tmpfs
> > and ramfs with the security labels of whomever mounted them.
>
> For SELinux I now have a patch which applies mountpoint labeling to
> mounts for which s_user_ns != &init_user_ns. I'm less sure then with
> Smack how this behavior will differ from what happens today, but my
> understanding is that this means that the label of the mountpoint is
> used for all objects from that superblock. Afaik it does not have the
> Smack behavior of denying access to filesystem objects which have a
> different label in the backing store.
>
> > For sysfs things get a little more interesting. Assuming tmpfs and
> > ramfs don't need s_user_ns == &init_user_ns, sysfs may be fine operating
> > with possibly invalid securitly labels set on a different mount of
> > selinux. (I am wondering now how all of these labels work in the
> > context of nfs).
>
> If someone was using Smack to label sysfs then a mount with s_user_ns !=
> &init_user_ns is going to leave inaccessible anything without the same
> label as the process which performed the mount.
>
> Again with SELinux I'm less certain, but I think you could end up with a
> sysfs superblock that has mountpoint labeling, and thus any labels set
> in the mount in the init namespace would be ignored.
>
> > The worst case for sysfs is that we come up with a cousin of
> > SB_I_NO_EXEC say SB_I_NO_DEV.
>
> That idea occurred to me. Or else something that indicated to the
> security module that the filesystem has no user-controlled backing store
> which could be used to inject security labels, thus allowing us to set
> s_user_ns to a non-init namespace while still allowing standard MAC
> labeling behavior.
>
> > But at the moment I am hoping that limited label storage in a user
> > namespace as you and Casey have been talking about winds up being the
> > norm and then we can follow the standard rules for setting s_user_ns and
> > still preserve the current label setting behavior.
>
> Unfortunately I'm afraid that's not going to work out.

What I really meant here was that it wasn't going to work out for these
few filesystems. There's no reason why that couldn't be the norm moving
forward.

Casey: Would you have a problem with special-casing Smack for these
filesystems? It's not ideal, but it avoids regressions for those
filesystems that can already be mounted in a user namespace with trusted
labels. Something like this (on top of the changes we've already
discussed).

diff --git a/security/smack/smack.h b/security/smack/smack.h
index 244e035e5a99..473cfc355a8d 100644
--- a/security/smack/smack.h
+++ b/security/smack/smack.h
@@ -76,8 +76,14 @@ struct superblock_smack {
struct smack_known *smk_hat;
struct smack_known *smk_default;
int smk_initialized;
+ int smk_flags;
};

+/*
+ * Superblock flags
+ */
+#define SMK_SB_UNTRUSTED 0x01
+
struct socket_smack {
struct smack_known *smk_out; /* outbound label */
struct smack_known *smk_in; /* inbound label */
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 8e631a66b03c..44e27f5f2a43 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -662,8 +662,16 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
skp = smk_of_current();
sp->smk_root = skp;
sp->smk_default = skp;
- if (sb_in_userns(sb))
+ /*
+ * For a handful of fs types with no user-controlled
+ * backing store it's okay to trust security labels
+ * in the filesystem. The rest are untrusted.
+ */
+ if (sb->s_magic != SYSFS_MAGIC && sb->s_magic != TMPFS_MAGIC &&
+ sb->s_magic != RAMFS_MAGIC) {
transmute = 1;
+ sp->smk_flags |= SMK_SB_UNTRUSTED;
+ }
}
/*
* Initialize the root inode.
@@ -1014,6 +1022,7 @@ static int smack_inode_rename(struct inode *old_inode,
*/
static int smack_inode_permission(struct inode *inode, int mask)
{
+ struct superblock_smack *sbsp = inode->i_sb->s_security;
struct smk_audit_info ad;
int no_block = mask & MAY_NOT_BLOCK;
int rc;
@@ -1025,8 +1034,7 @@ static int smack_inode_permission(struct inode *inode, int mask)
if (mask == 0)
return 0;

- if (sb_in_userns(inode->i_sb)) {
- struct superblock_smack *sbsp = inode->i_sb->s_security;
+ if (sbsp->smk_flags & SMK_SB_UNTRUSTED) {
if (smk_of_inode(inode) != sbsp->smk_root)
return -EACCES;
}
@@ -3228,7 +3236,7 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
if (rc >= 0)
transflag = SMK_INODE_TRANSMUTE;
}
- if (!sb_in_userns(inode->i_sb)) {
+ if (!(sbsp->smk_flags & SMK_SB_UNTRUSTED)) {
/*
* Don't let the exec or mmap label be "*" or "@".
*/

2015-08-07 18:35:28

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

On 8/7/2015 7:32 AM, Seth Forshee wrote:
> On Thu, Aug 06, 2015 at 09:20:29AM -0500, Seth Forshee wrote:
>> On Wed, Aug 05, 2015 at 04:19:03PM -0500, Eric W. Biederman wrote:
>>> Seth Forshee <[email protected]> writes:
>>>
>>>> On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
>>>>> Seth Forshee <[email protected]> writes:
>>>>>
>>>>>> Initially this will be used to eliminate the implicit MNT_NODEV
>>>>>> flag for mounts from user namespaces. In the future it will also
>>>>>> be used for translating ids and checking capabilities for
>>>>>> filesystems mounted from user namespaces.
>>>>>>
>>>>>> s_user_ns is initialized in alloc_super() and is generally set to
>>>>>> current_user_ns(). To avoid security and corruption issues, two
>>>>>> additional mount checks are also added:
>>>>>>
>>>>>> - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
>>>>>> in current_user_ns().
>>>>>>
>>>>>> - sget() will fail with EBUSY when the filesystem it's looking
>>>>>> for is already mounted from another user namespace.
>>>>>>
>>>>>> proc needs some special handling here. The user namespace of
>>>>>> current isn't appropriate when forking as a result of clone (2)
>>>>>> with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
>>>>>> from within the new user namespace. Instead, the user namespace
>>>>>> which owns the new pid namespace should be used. sget_userns() is
>>>>>> added to allow passing of a user namespace other than that of
>>>>>> current, and this is used by proc_mount(). sget() becomes a
>>>>>> wrapper around sget_userns() which passes current_user_ns().
>>>>> From bits of the previous conversation.
>>>>>
>>>>> We need sget_userns(..., &init_user_ns) for sysfs. The sysfs
>>>>> xattrs can travel from one mount of sysfs to another via the sysfs
>>>>> backing store.
>>>>>
>>>>> For tmpfs and any other filesystems we support mounting without
>>>>> privilige that support xattrs. We need to identify them and
>>>>> see if userspace is taking advantage of the ability to set
>>>>> xattrs and file caps (unlikely). If they are we need to call
>>>>> sget_userns(..., &init_user_ns) on those filesystems as well.
>>>>>
>>>>> Possibly/Probably we should just do that for all of the interesting
>>>>> filesystems to start with and then change back to an ordinary old sget
>>>>> after we have done the testing and confirmed we will not be introducing
>>>>> userspace regressions.
>>>> I was reviewing everything in preparation for sending v2 patches, and I
>>>> realized that doing this has an undesirable side effect. In patch 2 the
>>>> implicit nodev is removed for unprivileged mounts, and instead s_user_ns
>>>> is used to block opening devices in these mounts. When we set s_user_ns
>>>> to &init_user_ns, it becomes possible to open device nodes from
>>>> unprivileged mounts of these filesystems.
>>>>
>>>> This doesn't pose a real problem today. The only filesystems it will
>>>> affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
>>>> &init_user_ns for user namespace mounts), and all of these aren't
>>>> problems. sysfs is okay because kernfs doesn't (currently?) allow device
>>>> nodes, and a user would require CAP_MKNOD to create any device nodes in
>>>> a tmpfs or ramfs mount.
>>>>
>>>> But for sysfs in particular it does mean that we will need to make sure
>>>> that there's no way that device nodes could start appearing in an
>>>> unprivileged mount.
>>> Good point about nodev.
>>>
>>> For tmpfs and ramfs and security labels the smack policy of allowing but
>>> filtering security labels mean smack once it has those bits will not
>>> care which user namespace ramfs and tmpfs live in. The labels should
>>> pretty much stay the same in any case.
>> Smack does care which namespace ramfs and tmpfs are in. With the patch
>> I've got right now, if s_user_ns != &init_user_ns and the label of an
>> inode does not match that of the root inode then
>> security_inode_permission() will return EACCES.
>>
>> So if something with CAP_MAC_ADMIN is changing security labels in such a
>> mount, suddenly those inodes might become inaccessible. And while it may
>> be unlikely that anyone is doing this it's impossible for me to prove
>> that's the case.
>>
>>> If the same class of handling will also apply to selinux and those are
>>> the only two security modules that apply labels than we can leave tmpfs
>>> and ramfs with the security labels of whomever mounted them.
>> For SELinux I now have a patch which applies mountpoint labeling to
>> mounts for which s_user_ns != &init_user_ns. I'm less sure then with
>> Smack how this behavior will differ from what happens today, but my
>> understanding is that this means that the label of the mountpoint is
>> used for all objects from that superblock. Afaik it does not have the
>> Smack behavior of denying access to filesystem objects which have a
>> different label in the backing store.
>>
>>> For sysfs things get a little more interesting. Assuming tmpfs and
>>> ramfs don't need s_user_ns == &init_user_ns, sysfs may be fine operating
>>> with possibly invalid securitly labels set on a different mount of
>>> selinux. (I am wondering now how all of these labels work in the
>>> context of nfs).
>> If someone was using Smack to label sysfs then a mount with s_user_ns !=
>> &init_user_ns is going to leave inaccessible anything without the same
>> label as the process which performed the mount.
>>
>> Again with SELinux I'm less certain, but I think you could end up with a
>> sysfs superblock that has mountpoint labeling, and thus any labels set
>> in the mount in the init namespace would be ignored.
>>
>>> The worst case for sysfs is that we come up with a cousin of
>>> SB_I_NO_EXEC say SB_I_NO_DEV.
>> That idea occurred to me. Or else something that indicated to the
>> security module that the filesystem has no user-controlled backing store
>> which could be used to inject security labels, thus allowing us to set
>> s_user_ns to a non-init namespace while still allowing standard MAC
>> labeling behavior.
>>
>>> But at the moment I am hoping that limited label storage in a user
>>> namespace as you and Casey have been talking about winds up being the
>>> norm and then we can follow the standard rules for setting s_user_ns and
>>> still preserve the current label setting behavior.
>> Unfortunately I'm afraid that's not going to work out.
> What I really meant here was that it wasn't going to work out for these
> few filesystems. There's no reason why that couldn't be the norm moving
> forward.
>
> Casey: Would you have a problem with special-casing Smack for these
> filesystems? It's not ideal, but it avoids regressions for those
> filesystems that can already be mounted in a user namespace with trusted
> labels. Something like this (on top of the changes we've already
> discussed).

As badly as I want to run away screaming, I can't see a reason
that this approach doesn't make sense. With no backing store there's
no way the untrusted mounter can get untoward access to data, and
the data isn't persistent. If there weren't already filesystem
special casing in Smack I could object to that, but I've already
started down that slope.

So I'm not real happy, but I don't have a better solution.

>
> diff --git a/security/smack/smack.h b/security/smack/smack.h
> index 244e035e5a99..473cfc355a8d 100644
> --- a/security/smack/smack.h
> +++ b/security/smack/smack.h
> @@ -76,8 +76,14 @@ struct superblock_smack {
> struct smack_known *smk_hat;
> struct smack_known *smk_default;
> int smk_initialized;
> + int smk_flags;
> };
>
> +/*
> + * Superblock flags
> + */
> +#define SMK_SB_UNTRUSTED 0x01
> +
> struct socket_smack {
> struct smack_known *smk_out; /* outbound label */
> struct smack_known *smk_in; /* inbound label */
> diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
> index 8e631a66b03c..44e27f5f2a43 100644
> --- a/security/smack/smack_lsm.c
> +++ b/security/smack/smack_lsm.c
> @@ -662,8 +662,16 @@ static int smack_sb_kern_mount(struct super_block *sb, int flags, void *data)
> skp = smk_of_current();
> sp->smk_root = skp;
> sp->smk_default = skp;
> - if (sb_in_userns(sb))
> + /*
> + * For a handful of fs types with no user-controlled
> + * backing store it's okay to trust security labels
> + * in the filesystem. The rest are untrusted.
> + */
> + if (sb->s_magic != SYSFS_MAGIC && sb->s_magic != TMPFS_MAGIC &&
> + sb->s_magic != RAMFS_MAGIC) {
> transmute = 1;
> + sp->smk_flags |= SMK_SB_UNTRUSTED;
> + }
> }
> /*
> * Initialize the root inode.
> @@ -1014,6 +1022,7 @@ static int smack_inode_rename(struct inode *old_inode,
> */
> static int smack_inode_permission(struct inode *inode, int mask)
> {
> + struct superblock_smack *sbsp = inode->i_sb->s_security;
> struct smk_audit_info ad;
> int no_block = mask & MAY_NOT_BLOCK;
> int rc;
> @@ -1025,8 +1034,7 @@ static int smack_inode_permission(struct inode *inode, int mask)
> if (mask == 0)
> return 0;
>
> - if (sb_in_userns(inode->i_sb)) {
> - struct superblock_smack *sbsp = inode->i_sb->s_security;
> + if (sbsp->smk_flags & SMK_SB_UNTRUSTED) {
> if (smk_of_inode(inode) != sbsp->smk_root)
> return -EACCES;
> }
> @@ -3228,7 +3236,7 @@ static void smack_d_instantiate(struct dentry *opt_dentry, struct inode *inode)
> if (rc >= 0)
> transflag = SMK_INODE_TRANSMUTE;
> }
> - if (!sb_in_userns(inode->i_sb)) {
> + if (!(sbsp->smk_flags & SMK_SB_UNTRUSTED)) {
> /*
> * Don't let the exec or mmap label be "*" or "@".
> */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2015-08-07 18:57:58

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 1/7] fs: Add user namesapace member to struct super_block

On Fri, Aug 07, 2015 at 11:35:31AM -0700, Casey Schaufler wrote:
> On 8/7/2015 7:32 AM, Seth Forshee wrote:
> > On Thu, Aug 06, 2015 at 09:20:29AM -0500, Seth Forshee wrote:
> >> On Wed, Aug 05, 2015 at 04:19:03PM -0500, Eric W. Biederman wrote:
> >>> Seth Forshee <[email protected]> writes:
> >>>
> >>>> On Wed, Jul 15, 2015 at 09:47:11PM -0500, Eric W. Biederman wrote:
> >>>>> Seth Forshee <[email protected]> writes:
> >>>>>
> >>>>>> Initially this will be used to eliminate the implicit MNT_NODEV
> >>>>>> flag for mounts from user namespaces. In the future it will also
> >>>>>> be used for translating ids and checking capabilities for
> >>>>>> filesystems mounted from user namespaces.
> >>>>>>
> >>>>>> s_user_ns is initialized in alloc_super() and is generally set to
> >>>>>> current_user_ns(). To avoid security and corruption issues, two
> >>>>>> additional mount checks are also added:
> >>>>>>
> >>>>>> - do_new_mount() gains a check that the user has CAP_SYS_ADMIN
> >>>>>> in current_user_ns().
> >>>>>>
> >>>>>> - sget() will fail with EBUSY when the filesystem it's looking
> >>>>>> for is already mounted from another user namespace.
> >>>>>>
> >>>>>> proc needs some special handling here. The user namespace of
> >>>>>> current isn't appropriate when forking as a result of clone (2)
> >>>>>> with CLONE_NEWPID|CLONE_NEWUSER, as it will make proc unmountable
> >>>>>> from within the new user namespace. Instead, the user namespace
> >>>>>> which owns the new pid namespace should be used. sget_userns() is
> >>>>>> added to allow passing of a user namespace other than that of
> >>>>>> current, and this is used by proc_mount(). sget() becomes a
> >>>>>> wrapper around sget_userns() which passes current_user_ns().
> >>>>> From bits of the previous conversation.
> >>>>>
> >>>>> We need sget_userns(..., &init_user_ns) for sysfs. The sysfs
> >>>>> xattrs can travel from one mount of sysfs to another via the sysfs
> >>>>> backing store.
> >>>>>
> >>>>> For tmpfs and any other filesystems we support mounting without
> >>>>> privilige that support xattrs. We need to identify them and
> >>>>> see if userspace is taking advantage of the ability to set
> >>>>> xattrs and file caps (unlikely). If they are we need to call
> >>>>> sget_userns(..., &init_user_ns) on those filesystems as well.
> >>>>>
> >>>>> Possibly/Probably we should just do that for all of the interesting
> >>>>> filesystems to start with and then change back to an ordinary old sget
> >>>>> after we have done the testing and confirmed we will not be introducing
> >>>>> userspace regressions.
> >>>> I was reviewing everything in preparation for sending v2 patches, and I
> >>>> realized that doing this has an undesirable side effect. In patch 2 the
> >>>> implicit nodev is removed for unprivileged mounts, and instead s_user_ns
> >>>> is used to block opening devices in these mounts. When we set s_user_ns
> >>>> to &init_user_ns, it becomes possible to open device nodes from
> >>>> unprivileged mounts of these filesystems.
> >>>>
> >>>> This doesn't pose a real problem today. The only filesystems it will
> >>>> affect is sysfs, tmpfs, and ramfs (no others need s_user_ns =
> >>>> &init_user_ns for user namespace mounts), and all of these aren't
> >>>> problems. sysfs is okay because kernfs doesn't (currently?) allow device
> >>>> nodes, and a user would require CAP_MKNOD to create any device nodes in
> >>>> a tmpfs or ramfs mount.
> >>>>
> >>>> But for sysfs in particular it does mean that we will need to make sure
> >>>> that there's no way that device nodes could start appearing in an
> >>>> unprivileged mount.
> >>> Good point about nodev.
> >>>
> >>> For tmpfs and ramfs and security labels the smack policy of allowing but
> >>> filtering security labels mean smack once it has those bits will not
> >>> care which user namespace ramfs and tmpfs live in. The labels should
> >>> pretty much stay the same in any case.
> >> Smack does care which namespace ramfs and tmpfs are in. With the patch
> >> I've got right now, if s_user_ns != &init_user_ns and the label of an
> >> inode does not match that of the root inode then
> >> security_inode_permission() will return EACCES.
> >>
> >> So if something with CAP_MAC_ADMIN is changing security labels in such a
> >> mount, suddenly those inodes might become inaccessible. And while it may
> >> be unlikely that anyone is doing this it's impossible for me to prove
> >> that's the case.
> >>
> >>> If the same class of handling will also apply to selinux and those are
> >>> the only two security modules that apply labels than we can leave tmpfs
> >>> and ramfs with the security labels of whomever mounted them.
> >> For SELinux I now have a patch which applies mountpoint labeling to
> >> mounts for which s_user_ns != &init_user_ns. I'm less sure then with
> >> Smack how this behavior will differ from what happens today, but my
> >> understanding is that this means that the label of the mountpoint is
> >> used for all objects from that superblock. Afaik it does not have the
> >> Smack behavior of denying access to filesystem objects which have a
> >> different label in the backing store.
> >>
> >>> For sysfs things get a little more interesting. Assuming tmpfs and
> >>> ramfs don't need s_user_ns == &init_user_ns, sysfs may be fine operating
> >>> with possibly invalid securitly labels set on a different mount of
> >>> selinux. (I am wondering now how all of these labels work in the
> >>> context of nfs).
> >> If someone was using Smack to label sysfs then a mount with s_user_ns !=
> >> &init_user_ns is going to leave inaccessible anything without the same
> >> label as the process which performed the mount.
> >>
> >> Again with SELinux I'm less certain, but I think you could end up with a
> >> sysfs superblock that has mountpoint labeling, and thus any labels set
> >> in the mount in the init namespace would be ignored.
> >>
> >>> The worst case for sysfs is that we come up with a cousin of
> >>> SB_I_NO_EXEC say SB_I_NO_DEV.
> >> That idea occurred to me. Or else something that indicated to the
> >> security module that the filesystem has no user-controlled backing store
> >> which could be used to inject security labels, thus allowing us to set
> >> s_user_ns to a non-init namespace while still allowing standard MAC
> >> labeling behavior.
> >>
> >>> But at the moment I am hoping that limited label storage in a user
> >>> namespace as you and Casey have been talking about winds up being the
> >>> norm and then we can follow the standard rules for setting s_user_ns and
> >>> still preserve the current label setting behavior.
> >> Unfortunately I'm afraid that's not going to work out.
> > What I really meant here was that it wasn't going to work out for these
> > few filesystems. There's no reason why that couldn't be the norm moving
> > forward.
> >
> > Casey: Would you have a problem with special-casing Smack for these
> > filesystems? It's not ideal, but it avoids regressions for those
> > filesystems that can already be mounted in a user namespace with trusted
> > labels. Something like this (on top of the changes we've already
> > discussed).
>
> As badly as I want to run away screaming, I can't see a reason
> that this approach doesn't make sense. With no backing store there's
> no way the untrusted mounter can get untoward access to data, and
> the data isn't persistent. If there weren't already filesystem
> special casing in Smack I could object to that, but I've already
> started down that slope.
>
> So I'm not real happy, but I don't have a better solution.

Yeah, I understand. I had hoped there would be something we could look
at to distinguish these types of filesystems generically, but I couldn't
find anything. So short of adding some flag to the fs type or the
superblock, this was the best I could come up with.

Thanks,
Seth