If a process gets access to a mount from a descendent or unrelated
user namespace, that process should not be able to take advantage of
setuid files or selinux entrypoints from that filesystem.
This will make it safer to allow more complex filesystems to be
mounted in non-root user namespaces.
This does not remove the need for MNT_LOCK_NOSUID. The setuid,
setgid, and file capability bits can no longer be abused if code in
a user namespace were to clear nosuid on an untrusted filesystem,
but this patch, by itself, is insufficient to protect the system
from abuse of files that, when execed, would increase MAC privilege.
As a more concrete explanation, any task that can manipulate a
vfsmount associated with a given user namespace already has
capabilities in that namespace and all of its descendents. If they
can cause a malicious setuid, setgid, or file-caps executable to
appear in that mount, then that executable will only allow them to
elevate privileges in exactly the set of namespaces in which they
are already privileges.
On the other hand, if they can cause a malicious executable to
appear with a dangerous MAC label, running it could change the
caller's security context in a way that should not have been
possible, even inside the namespace in which the task is confined.
Signed-off-by: Andy Lutomirski <[email protected]>
---
Seth, this should address a problem that's related to yours. If a
userns creates and untrusted fs (by any means, although admittedly fuse
and user namespaces don't work all that well together right now), then
this prevents shenanigans that could happen when the userns passes an fd
pointing at the filesystem out to the root ns.
fs/exec.c | 2 +-
fs/namespace.c | 21 +++++++++++++++++++++
include/linux/mount.h | 1 +
security/commoncap.c | 2 +-
security/selinux/hooks.c | 4 ++--
5 files changed, 26 insertions(+), 4 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index a2b42a98c743..ac0bb22aa3ed 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1267,7 +1267,7 @@ int prepare_binprm(struct linux_binprm *bprm)
bprm->cred->euid = current_euid();
bprm->cred->egid = current_egid();
- if (!(bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) &&
+ if (mnt_may_suid(bprm->file->f_path.mnt) &&
!task_no_new_privs(current) &&
kuid_has_mapping(bprm->cred->user_ns, inode->i_uid) &&
kgid_has_mapping(bprm->cred->user_ns, inode->i_gid)) {
diff --git a/fs/namespace.c b/fs/namespace.c
index a01c7730e9af..53301680ea7e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3011,6 +3011,27 @@ found:
return visible;
}
+bool mnt_may_suid(struct vfsmount *mnt)
+{
+ struct user_namespace *mount_userns = real_mount(mnt)->mnt_ns->user_ns;
+ struct user_namespace *ns;
+
+ if (mnt->mnt_flags & MNT_NOSUID)
+ return false;
+
+ /*
+ * We only trust mounts in our own namespace or its parents; we
+ * treat untrusted mounts as MNT_NOSUID regardless of whether
+ * they have MNT_NOSUID set.
+ */
+ for (ns = current_user_ns(); ns; ns = ns->parent) {
+ if (ns == mount_userns)
+ return true;
+ }
+
+ return false;
+}
+
static void *mntns_get(struct task_struct *task)
{
struct mnt_namespace *ns = NULL;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 9262e4bf0cc3..b7b84bafe09b 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -80,6 +80,7 @@ extern void mntput(struct vfsmount *mnt);
extern struct vfsmount *mntget(struct vfsmount *mnt);
extern struct vfsmount *mnt_clone_internal(struct path *path);
extern int __mnt_is_readonly(struct vfsmount *mnt);
+extern bool mnt_may_suid(struct vfsmount *mnt);
struct file_system_type;
extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
diff --git a/security/commoncap.c b/security/commoncap.c
index bab0611afc1e..52b3eed065e0 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -443,7 +443,7 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
if (!file_caps_enabled)
return 0;
- if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
+ if (!mnt_may_suid(bprm->file->f_path.mnt))
return 0;
dentry = dget(bprm->file->f_dentry);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index b0e940497e23..2089fd0d539e 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2139,7 +2139,7 @@ static int selinux_bprm_set_creds(struct linux_binprm *bprm)
*/
if (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)
return -EPERM;
- if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
+ if (!mnt_may_suid(bprm->file->f_path.mnt))
return -EACCES;
} else {
/* Check for a default transition on this program. */
@@ -2153,7 +2153,7 @@ static int selinux_bprm_set_creds(struct linux_binprm *bprm)
ad.type = LSM_AUDIT_DATA_PATH;
ad.u.path = bprm->file->f_path;
- if ((bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) ||
+ if (!mnt_may_suid(bprm->file->f_path.mnt) ||
(bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS))
new_tsec->sid = old_tsec->sid;
--
1.9.3
Andy Lutomirski <[email protected]> writes:
> If a process gets access to a mount from a descendent or unrelated
> user namespace, that process should not be able to take advantage of
> setuid files or selinux entrypoints from that filesystem.
>
> This will make it safer to allow more complex filesystems to be
> mounted in non-root user namespaces.
>
> This does not remove the need for MNT_LOCK_NOSUID. The setuid,
> setgid, and file capability bits can no longer be abused if code in
> a user namespace were to clear nosuid on an untrusted filesystem,
> but this patch, by itself, is insufficient to protect the system
> from abuse of files that, when execed, would increase MAC privilege.
>
> As a more concrete explanation, any task that can manipulate a
> vfsmount associated with a given user namespace already has
> capabilities in that namespace and all of its descendents. If they
> can cause a malicious setuid, setgid, or file-caps executable to
> appear in that mount, then that executable will only allow them to
> elevate privileges in exactly the set of namespaces in which they
> are already privileges.
>
> On the other hand, if they can cause a malicious executable to
> appear with a dangerous MAC label, running it could change the
> caller's security context in a way that should not have been
> possible, even inside the namespace in which the task is confined.
As presented this is complete and total nonsense. Mount propgation
strongly weakens if not completely breaks the assumptions you are making
in this code.
To write any generic code that knows anything we need to capture a user
namespace on struct super.
Further I think all we really want is to filter out security labels from
unprivileged mounts. uids/gids and the like should be completely fine
because of the uid mappings.
Having been down the route of comparing uids as userns uid tuples I am
convinced that anything requires us to take the user namespace into
account on a routine basis in the core will simply be broken for someone
forgetting somewhere. This looks like a design that has that kind of
susceptibility.
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
>
> Seth, this should address a problem that's related to yours. If a
> userns creates and untrusted fs (by any means, although admittedly fuse
> and user namespaces don't work all that well together right now), then
> this prevents shenanigans that could happen when the userns passes an fd
> pointing at the filesystem out to the root ns.
Andy for now I really think we are best not even reading those
capabilities into the vfs from unprivileged mounts.
Eric
> fs/exec.c | 2 +-
> fs/namespace.c | 21 +++++++++++++++++++++
> include/linux/mount.h | 1 +
> security/commoncap.c | 2 +-
> security/selinux/hooks.c | 4 ++--
> 5 files changed, 26 insertions(+), 4 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index a2b42a98c743..ac0bb22aa3ed 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1267,7 +1267,7 @@ int prepare_binprm(struct linux_binprm *bprm)
> bprm->cred->euid = current_euid();
> bprm->cred->egid = current_egid();
>
> - if (!(bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) &&
> + if (mnt_may_suid(bprm->file->f_path.mnt) &&
> !task_no_new_privs(current) &&
> kuid_has_mapping(bprm->cred->user_ns, inode->i_uid) &&
> kgid_has_mapping(bprm->cred->user_ns, inode->i_gid)) {
> diff --git a/fs/namespace.c b/fs/namespace.c
> index a01c7730e9af..53301680ea7e 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3011,6 +3011,27 @@ found:
> return visible;
> }
>
> +bool mnt_may_suid(struct vfsmount *mnt)
> +{
> + struct user_namespace *mount_userns = real_mount(mnt)->mnt_ns->user_ns;
> + struct user_namespace *ns;
> +
> + if (mnt->mnt_flags & MNT_NOSUID)
> + return false;
> +
> + /*
> + * We only trust mounts in our own namespace or its parents; we
> + * treat untrusted mounts as MNT_NOSUID regardless of whether
> + * they have MNT_NOSUID set.
> + */
> + for (ns = current_user_ns(); ns; ns = ns->parent) {
> + if (ns == mount_userns)
> + return true;
> + }
> +
> + return false;
> +}
> +
> static void *mntns_get(struct task_struct *task)
> {
> struct mnt_namespace *ns = NULL;
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index 9262e4bf0cc3..b7b84bafe09b 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -80,6 +80,7 @@ extern void mntput(struct vfsmount *mnt);
> extern struct vfsmount *mntget(struct vfsmount *mnt);
> extern struct vfsmount *mnt_clone_internal(struct path *path);
> extern int __mnt_is_readonly(struct vfsmount *mnt);
> +extern bool mnt_may_suid(struct vfsmount *mnt);
>
> struct file_system_type;
> extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
> diff --git a/security/commoncap.c b/security/commoncap.c
> index bab0611afc1e..52b3eed065e0 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -443,7 +443,7 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
> if (!file_caps_enabled)
> return 0;
>
> - if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
> + if (!mnt_may_suid(bprm->file->f_path.mnt))
> return 0;
>
> dentry = dget(bprm->file->f_dentry);
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index b0e940497e23..2089fd0d539e 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -2139,7 +2139,7 @@ static int selinux_bprm_set_creds(struct linux_binprm *bprm)
> */
> if (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)
> return -EPERM;
> - if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
> + if (!mnt_may_suid(bprm->file->f_path.mnt))
> return -EACCES;
> } else {
> /* Check for a default transition on this program. */
> @@ -2153,7 +2153,7 @@ static int selinux_bprm_set_creds(struct linux_binprm *bprm)
> ad.type = LSM_AUDIT_DATA_PATH;
> ad.u.path = bprm->file->f_path;
>
> - if ((bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) ||
> + if (!mnt_may_suid(bprm->file->f_path.mnt) ||
> (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS))
> new_tsec->sid = old_tsec->sid;
On Tue, Oct 14, 2014 at 2:57 PM, Eric W. Biederman
<[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> If a process gets access to a mount from a descendent or unrelated
>> user namespace, that process should not be able to take advantage of
>> setuid files or selinux entrypoints from that filesystem.
>>
>> This will make it safer to allow more complex filesystems to be
>> mounted in non-root user namespaces.
>>
>> This does not remove the need for MNT_LOCK_NOSUID. The setuid,
>> setgid, and file capability bits can no longer be abused if code in
>> a user namespace were to clear nosuid on an untrusted filesystem,
>> but this patch, by itself, is insufficient to protect the system
>> from abuse of files that, when execed, would increase MAC privilege.
>>
>> As a more concrete explanation, any task that can manipulate a
>> vfsmount associated with a given user namespace already has
>> capabilities in that namespace and all of its descendents. If they
>> can cause a malicious setuid, setgid, or file-caps executable to
>> appear in that mount, then that executable will only allow them to
>> elevate privileges in exactly the set of namespaces in which they
>> are already privileges.
>>
>> On the other hand, if they can cause a malicious executable to
>> appear with a dangerous MAC label, running it could change the
>> caller's security context in a way that should not have been
>> possible, even inside the namespace in which the task is confined.
>
> As presented this is complete and total nonsense. Mount propgation
> strongly weakens if not completely breaks the assumptions you are making
> in this code.
Huh? Please elaborate.
>
> To write any generic code that knows anything we need to capture a user
> namespace on struct super.
I disagree, actually. If global root mounts FUSE (somewhere
invisible) and then propagates it into a userns-owned mountns, then I
think that root does *not* want the global userns to trust that mount,
even though the super belongs to the init userns.
In general, the ability to elevate your privileges by following a
/proc symlink into a different userns's mounts (or using fchdir) and
executing a setuid program is, I think, a mistake. I've already
written one root exploit that depends on that ability, and I can't see
any legitimate reason to allow it.
>
> Further I think all we really want is to filter out security labels from
> unprivileged mounts. uids/gids and the like should be completely fine
> because of the uid mappings.
Why? As you mentioned, unprivileged userns mounts are just like
regular nosuid removable media mounts in that respect, except that
they probably won't have the nosuid flag set. This patch completely
closes the issue of security labels taking effect in the wrong
namespace as long as LSMs handle nosuid correctly, and LSMs MUST
handle nosuid correctly in order to avoid being bypassed by regular
FUSE or by removable media.
>
> Having been down the route of comparing uids as userns uid tuples I am
> convinced that anything requires us to take the user namespace into
> account on a routine basis in the core will simply be broken for someone
> forgetting somewhere. This looks like a design that has that kind of
> susceptibility.
A smatch rule would fix that, as would moving MNT_NOSUID into an
internal header.
>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> ---
>>
>> Seth, this should address a problem that's related to yours. If a
>> userns creates and untrusted fs (by any means, although admittedly fuse
>> and user namespaces don't work all that well together right now), then
>> this prevents shenanigans that could happen when the userns passes an fd
>> pointing at the filesystem out to the root ns.
>
> Andy for now I really think we are best not even reading those
> capabilities into the vfs from unprivileged mounts.
But won't we want to support letting userns containers create setuid
files and security labels using FUSE and related things for their own
benefit someday? This lets us do that without compromising the init
namespace.
--Andy
Quoting Eric W. Biederman ([email protected]):
> Andy Lutomirski <[email protected]> writes:
>
> > If a process gets access to a mount from a descendent or unrelated
> > user namespace, that process should not be able to take advantage of
> > setuid files or selinux entrypoints from that filesystem.
> >
> > This will make it safer to allow more complex filesystems to be
> > mounted in non-root user namespaces.
> >
> > This does not remove the need for MNT_LOCK_NOSUID. The setuid,
> > setgid, and file capability bits can no longer be abused if code in
> > a user namespace were to clear nosuid on an untrusted filesystem,
> > but this patch, by itself, is insufficient to protect the system
> > from abuse of files that, when execed, would increase MAC privilege.
> >
> > As a more concrete explanation, any task that can manipulate a
> > vfsmount associated with a given user namespace already has
> > capabilities in that namespace and all of its descendents. If they
> > can cause a malicious setuid, setgid, or file-caps executable to
> > appear in that mount, then that executable will only allow them to
> > elevate privileges in exactly the set of namespaces in which they
> > are already privileges.
> >
> > On the other hand, if they can cause a malicious executable to
> > appear with a dangerous MAC label, running it could change the
> > caller's security context in a way that should not have been
> > possible, even inside the namespace in which the task is confined.
>
> As presented this is complete and total nonsense. Mount propgation
> strongly weakens if not completely breaks the assumptions you are making
> in this code.
>
> To write any generic code that knows anything we need to capture a user
> namespace on struct super.
>
> Further I think all we really want is to filter out security labels from
> unprivileged mounts. uids/gids and the like should be completely fine
> because of the uid mappings.
>
> Having been down the route of comparing uids as userns uid tuples I am
> convinced that anything requires us to take the user namespace into
> account on a routine basis in the core will simply be broken for someone
> forgetting somewhere. This looks like a design that has that kind of
> susceptibility.
The above paragraph is very compelling. However Andy's patch is a step
in the right direction from what we've got. I think given what you say
below and given Andy's rationale above, simply tweaking his patch to
ignore the parent-userns loop, and return false if current_user_ns() !=
mount_userns, should be right? It'll prevent a child userns from
setting a selinux/apparmor entrypoint or POSIX file capabilities on a
file and having the parent userns trip over those.
> > Signed-off-by: Andy Lutomirski <[email protected]>
> > ---
> >
> > Seth, this should address a problem that's related to yours. If a
> > userns creates and untrusted fs (by any means, although admittedly fuse
> > and user namespaces don't work all that well together right now), then
> > this prevents shenanigans that could happen when the userns passes an fd
> > pointing at the filesystem out to the root ns.
>
> Andy for now I really think we are best not even reading those
> capabilities into the vfs from unprivileged mounts.
>
> Eric
>
> > fs/exec.c | 2 +-
> > fs/namespace.c | 21 +++++++++++++++++++++
> > include/linux/mount.h | 1 +
> > security/commoncap.c | 2 +-
> > security/selinux/hooks.c | 4 ++--
> > 5 files changed, 26 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/exec.c b/fs/exec.c
> > index a2b42a98c743..ac0bb22aa3ed 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -1267,7 +1267,7 @@ int prepare_binprm(struct linux_binprm *bprm)
> > bprm->cred->euid = current_euid();
> > bprm->cred->egid = current_egid();
> >
> > - if (!(bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) &&
> > + if (mnt_may_suid(bprm->file->f_path.mnt) &&
> > !task_no_new_privs(current) &&
> > kuid_has_mapping(bprm->cred->user_ns, inode->i_uid) &&
> > kgid_has_mapping(bprm->cred->user_ns, inode->i_gid)) {
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index a01c7730e9af..53301680ea7e 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -3011,6 +3011,27 @@ found:
> > return visible;
> > }
> >
> > +bool mnt_may_suid(struct vfsmount *mnt)
> > +{
> > + struct user_namespace *mount_userns = real_mount(mnt)->mnt_ns->user_ns;
> > + struct user_namespace *ns;
> > +
> > + if (mnt->mnt_flags & MNT_NOSUID)
> > + return false;
> > +
> > + /*
> > + * We only trust mounts in our own namespace or its parents; we
> > + * treat untrusted mounts as MNT_NOSUID regardless of whether
> > + * they have MNT_NOSUID set.
> > + */
> > + for (ns = current_user_ns(); ns; ns = ns->parent) {
> > + if (ns == mount_userns)
> > + return true;
> > + }
> > +
> > + return false;
> > +}
> > +
> > static void *mntns_get(struct task_struct *task)
> > {
> > struct mnt_namespace *ns = NULL;
> > diff --git a/include/linux/mount.h b/include/linux/mount.h
> > index 9262e4bf0cc3..b7b84bafe09b 100644
> > --- a/include/linux/mount.h
> > +++ b/include/linux/mount.h
> > @@ -80,6 +80,7 @@ extern void mntput(struct vfsmount *mnt);
> > extern struct vfsmount *mntget(struct vfsmount *mnt);
> > extern struct vfsmount *mnt_clone_internal(struct path *path);
> > extern int __mnt_is_readonly(struct vfsmount *mnt);
> > +extern bool mnt_may_suid(struct vfsmount *mnt);
> >
> > struct file_system_type;
> > extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
> > diff --git a/security/commoncap.c b/security/commoncap.c
> > index bab0611afc1e..52b3eed065e0 100644
> > --- a/security/commoncap.c
> > +++ b/security/commoncap.c
> > @@ -443,7 +443,7 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
> > if (!file_caps_enabled)
> > return 0;
> >
> > - if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
> > + if (!mnt_may_suid(bprm->file->f_path.mnt))
> > return 0;
> >
> > dentry = dget(bprm->file->f_dentry);
> > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> > index b0e940497e23..2089fd0d539e 100644
> > --- a/security/selinux/hooks.c
> > +++ b/security/selinux/hooks.c
> > @@ -2139,7 +2139,7 @@ static int selinux_bprm_set_creds(struct linux_binprm *bprm)
> > */
> > if (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS)
> > return -EPERM;
> > - if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
> > + if (!mnt_may_suid(bprm->file->f_path.mnt))
> > return -EACCES;
> > } else {
> > /* Check for a default transition on this program. */
> > @@ -2153,7 +2153,7 @@ static int selinux_bprm_set_creds(struct linux_binprm *bprm)
> > ad.type = LSM_AUDIT_DATA_PATH;
> > ad.u.path = bprm->file->f_path;
> >
> > - if ((bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) ||
> > + if (!mnt_may_suid(bprm->file->f_path.mnt) ||
> > (bprm->unsafe & LSM_UNSAFE_NO_NEW_PRIVS))
> > new_tsec->sid = old_tsec->sid;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Tue, Oct 14, 2014 at 3:07 PM, Andy Lutomirski <[email protected]> wrote:
> On Tue, Oct 14, 2014 at 2:57 PM, Eric W. Biederman
>>> Seth, this should address a problem that's related to yours. If a
>>> userns creates and untrusted fs (by any means, although admittedly fuse
>>> and user namespaces don't work all that well together right now), then
>>> this prevents shenanigans that could happen when the userns passes an fd
>>> pointing at the filesystem out to the root ns.
>>
>> Andy for now I really think we are best not even reading those
>> capabilities into the vfs from unprivileged mounts.
>
> But won't we want to support letting userns containers create setuid
> files and security labels using FUSE and related things for their own
> benefit someday? This lets us do that without compromising the init
> namespace.
More concretely, root in a userns should be able to have a
setuid-whomever or security-labeled file, and another user in that
userns should be able to exec it and transition. But, if you're
outside the userns, then:
$ /proc/PID_IN_USERNS/root/path/to/labeled/file
shouldn't transition.
--Andy
Quoting Serge E. Hallyn ([email protected]):
> Quoting Eric W. Biederman ([email protected]):
> > Andy Lutomirski <[email protected]> writes:
> >
> > > If a process gets access to a mount from a descendent or unrelated
> > > user namespace, that process should not be able to take advantage of
> > > setuid files or selinux entrypoints from that filesystem.
> > >
> > > This will make it safer to allow more complex filesystems to be
> > > mounted in non-root user namespaces.
> > >
> > > This does not remove the need for MNT_LOCK_NOSUID. The setuid,
> > > setgid, and file capability bits can no longer be abused if code in
> > > a user namespace were to clear nosuid on an untrusted filesystem,
> > > but this patch, by itself, is insufficient to protect the system
> > > from abuse of files that, when execed, would increase MAC privilege.
> > >
> > > As a more concrete explanation, any task that can manipulate a
> > > vfsmount associated with a given user namespace already has
> > > capabilities in that namespace and all of its descendents. If they
> > > can cause a malicious setuid, setgid, or file-caps executable to
> > > appear in that mount, then that executable will only allow them to
> > > elevate privileges in exactly the set of namespaces in which they
> > > are already privileges.
> > >
> > > On the other hand, if they can cause a malicious executable to
> > > appear with a dangerous MAC label, running it could change the
> > > caller's security context in a way that should not have been
> > > possible, even inside the namespace in which the task is confined.
> >
> > As presented this is complete and total nonsense. Mount propgation
> > strongly weakens if not completely breaks the assumptions you are making
> > in this code.
> >
> > To write any generic code that knows anything we need to capture a user
> > namespace on struct super.
> >
> > Further I think all we really want is to filter out security labels from
> > unprivileged mounts. uids/gids and the like should be completely fine
> > because of the uid mappings.
> >
> > Having been down the route of comparing uids as userns uid tuples I am
> > convinced that anything requires us to take the user namespace into
> > account on a routine basis in the core will simply be broken for someone
> > forgetting somewhere. This looks like a design that has that kind of
> > susceptibility.
>
> The above paragraph is very compelling. However Andy's patch is a step
> in the right direction from what we've got. I think given what you say
> below and given Andy's rationale above, simply tweaking his patch to
> ignore the parent-userns loop, and return false if current_user_ns() !=
> mount_userns, should be right? It'll prevent a child userns from
> setting a selinux/apparmor entrypoint or POSIX file capabilities on a
> file and having the parent userns trip over those.
Ok, Andy's fn does the opposite, which will protect the parent userns,
which is good.
I suspect simply insisting that the user_ns's be equal is still better.
It fits better with the idea that POSIX caps (and LSM entrypoints) are
orthogonal to DAC. Kinda.
-serge
On Tue, Oct 14, 2014 at 3:12 PM, Serge E. Hallyn <[email protected]> wrote:
> Quoting Eric W. Biederman ([email protected]):
>> Andy Lutomirski <[email protected]> writes:
>>
>> > If a process gets access to a mount from a descendent or unrelated
>> > user namespace, that process should not be able to take advantage of
>> > setuid files or selinux entrypoints from that filesystem.
>> >
>> > This will make it safer to allow more complex filesystems to be
>> > mounted in non-root user namespaces.
>> >
>> > This does not remove the need for MNT_LOCK_NOSUID. The setuid,
>> > setgid, and file capability bits can no longer be abused if code in
>> > a user namespace were to clear nosuid on an untrusted filesystem,
>> > but this patch, by itself, is insufficient to protect the system
>> > from abuse of files that, when execed, would increase MAC privilege.
>> >
>> > As a more concrete explanation, any task that can manipulate a
>> > vfsmount associated with a given user namespace already has
>> > capabilities in that namespace and all of its descendents. If they
>> > can cause a malicious setuid, setgid, or file-caps executable to
>> > appear in that mount, then that executable will only allow them to
>> > elevate privileges in exactly the set of namespaces in which they
>> > are already privileges.
>> >
>> > On the other hand, if they can cause a malicious executable to
>> > appear with a dangerous MAC label, running it could change the
>> > caller's security context in a way that should not have been
>> > possible, even inside the namespace in which the task is confined.
>>
>> As presented this is complete and total nonsense. Mount propgation
>> strongly weakens if not completely breaks the assumptions you are making
>> in this code.
>>
>> To write any generic code that knows anything we need to capture a user
>> namespace on struct super.
>>
>> Further I think all we really want is to filter out security labels from
>> unprivileged mounts. uids/gids and the like should be completely fine
>> because of the uid mappings.
>>
>> Having been down the route of comparing uids as userns uid tuples I am
>> convinced that anything requires us to take the user namespace into
>> account on a routine basis in the core will simply be broken for someone
>> forgetting somewhere. This looks like a design that has that kind of
>> susceptibility.
>
> The above paragraph is very compelling. However Andy's patch is a step
> in the right direction from what we've got. I think given what you say
> below and given Andy's rationale above, simply tweaking his patch to
> ignore the parent-userns loop, and return false if current_user_ns() !=
> mount_userns, should be right? It'll prevent a child userns from
> setting a selinux/apparmor entrypoint or POSIX file capabilities on a
> file and having the parent userns trip over those.
I'm a bit confused. How does removing that loop help? Shouldn't all
usernses trust labels from the root userns?
Admittedly, they should really only be execing things that propagated
in instead of using /proc or fchdir, so I don't think we lose much by
dropping the loop.
--Andy
On Tue, Oct 14, 2014 at 3:14 PM, Serge E. Hallyn <[email protected]> wrote:
> Quoting Serge E. Hallyn ([email protected]):
>> Quoting Eric W. Biederman ([email protected]):
>> > Andy Lutomirski <[email protected]> writes:
>> >
>> > > If a process gets access to a mount from a descendent or unrelated
>> > > user namespace, that process should not be able to take advantage of
>> > > setuid files or selinux entrypoints from that filesystem.
>> > >
>> > > This will make it safer to allow more complex filesystems to be
>> > > mounted in non-root user namespaces.
>> > >
>> > > This does not remove the need for MNT_LOCK_NOSUID. The setuid,
>> > > setgid, and file capability bits can no longer be abused if code in
>> > > a user namespace were to clear nosuid on an untrusted filesystem,
>> > > but this patch, by itself, is insufficient to protect the system
>> > > from abuse of files that, when execed, would increase MAC privilege.
>> > >
>> > > As a more concrete explanation, any task that can manipulate a
>> > > vfsmount associated with a given user namespace already has
>> > > capabilities in that namespace and all of its descendents. If they
>> > > can cause a malicious setuid, setgid, or file-caps executable to
>> > > appear in that mount, then that executable will only allow them to
>> > > elevate privileges in exactly the set of namespaces in which they
>> > > are already privileges.
>> > >
>> > > On the other hand, if they can cause a malicious executable to
>> > > appear with a dangerous MAC label, running it could change the
>> > > caller's security context in a way that should not have been
>> > > possible, even inside the namespace in which the task is confined.
>> >
>> > As presented this is complete and total nonsense. Mount propgation
>> > strongly weakens if not completely breaks the assumptions you are making
>> > in this code.
>> >
>> > To write any generic code that knows anything we need to capture a user
>> > namespace on struct super.
>> >
>> > Further I think all we really want is to filter out security labels from
>> > unprivileged mounts. uids/gids and the like should be completely fine
>> > because of the uid mappings.
>> >
>> > Having been down the route of comparing uids as userns uid tuples I am
>> > convinced that anything requires us to take the user namespace into
>> > account on a routine basis in the core will simply be broken for someone
>> > forgetting somewhere. This looks like a design that has that kind of
>> > susceptibility.
>>
>> The above paragraph is very compelling. However Andy's patch is a step
>> in the right direction from what we've got. I think given what you say
>> below and given Andy's rationale above, simply tweaking his patch to
>> ignore the parent-userns loop, and return false if current_user_ns() !=
>> mount_userns, should be right? It'll prevent a child userns from
>> setting a selinux/apparmor entrypoint or POSIX file capabilities on a
>> file and having the parent userns trip over those.
>
> Ok, Andy's fn does the opposite, which will protect the parent userns,
> which is good.
>
> I suspect simply insisting that the user_ns's be equal is still better.
> It fits better with the idea that POSIX caps (and LSM entrypoints) are
> orthogonal to DAC. Kinda.
We could tighten it even further if we compared *mount* namespaces
instead of user namespaces. That would benefit Docker, non-userns-lxc
and such, too (sigh).
Actually, I see to good reason to insist on userns equality but not on
mountns equality. If we're not going to trust executables in foreign
namespaces, let's go all the way to distrust executables in all
foreign namespaces, at least unless someone thinks of a reason this
would break existing userspace.
--Andy
Quoting Andy Lutomirski ([email protected]):
> On Tue, Oct 14, 2014 at 3:14 PM, Serge E. Hallyn <[email protected]> wrote:
> > Quoting Serge E. Hallyn ([email protected]):
> >> Quoting Eric W. Biederman ([email protected]):
> >> > Andy Lutomirski <[email protected]> writes:
> >> >
> >> > > If a process gets access to a mount from a descendent or unrelated
> >> > > user namespace, that process should not be able to take advantage of
> >> > > setuid files or selinux entrypoints from that filesystem.
> >> > >
> >> > > This will make it safer to allow more complex filesystems to be
> >> > > mounted in non-root user namespaces.
> >> > >
> >> > > This does not remove the need for MNT_LOCK_NOSUID. The setuid,
> >> > > setgid, and file capability bits can no longer be abused if code in
> >> > > a user namespace were to clear nosuid on an untrusted filesystem,
> >> > > but this patch, by itself, is insufficient to protect the system
> >> > > from abuse of files that, when execed, would increase MAC privilege.
> >> > >
> >> > > As a more concrete explanation, any task that can manipulate a
> >> > > vfsmount associated with a given user namespace already has
> >> > > capabilities in that namespace and all of its descendents. If they
> >> > > can cause a malicious setuid, setgid, or file-caps executable to
> >> > > appear in that mount, then that executable will only allow them to
> >> > > elevate privileges in exactly the set of namespaces in which they
> >> > > are already privileges.
> >> > >
> >> > > On the other hand, if they can cause a malicious executable to
> >> > > appear with a dangerous MAC label, running it could change the
> >> > > caller's security context in a way that should not have been
> >> > > possible, even inside the namespace in which the task is confined.
> >> >
> >> > As presented this is complete and total nonsense. Mount propgation
> >> > strongly weakens if not completely breaks the assumptions you are making
> >> > in this code.
> >> >
> >> > To write any generic code that knows anything we need to capture a user
> >> > namespace on struct super.
> >> >
> >> > Further I think all we really want is to filter out security labels from
> >> > unprivileged mounts. uids/gids and the like should be completely fine
> >> > because of the uid mappings.
> >> >
> >> > Having been down the route of comparing uids as userns uid tuples I am
> >> > convinced that anything requires us to take the user namespace into
> >> > account on a routine basis in the core will simply be broken for someone
> >> > forgetting somewhere. This looks like a design that has that kind of
> >> > susceptibility.
> >>
> >> The above paragraph is very compelling. However Andy's patch is a step
> >> in the right direction from what we've got. I think given what you say
> >> below and given Andy's rationale above, simply tweaking his patch to
> >> ignore the parent-userns loop, and return false if current_user_ns() !=
> >> mount_userns, should be right? It'll prevent a child userns from
> >> setting a selinux/apparmor entrypoint or POSIX file capabilities on a
> >> file and having the parent userns trip over those.
> >
> > Ok, Andy's fn does the opposite, which will protect the parent userns,
> > which is good.
> >
> > I suspect simply insisting that the user_ns's be equal is still better.
> > It fits better with the idea that POSIX caps (and LSM entrypoints) are
> > orthogonal to DAC. Kinda.
>
> We could tighten it even further if we compared *mount* namespaces
> instead of user namespaces. That would benefit Docker, non-userns-lxc
> and such, too (sigh).
>
> Actually, I see to good reason to insist on userns equality but not on
> mountns equality. If we're not going to trust executables in foreign
> namespaces, let's go all the way to distrust executables in all
> foreign namespaces, at least unless someone thinks of a reason this
> would break existing userspace.
I have no doubt there is code out there in production which ends up
executing /proc/pid/root/sbin/ifconfig etc. Cause, you know, you really
wanna execute whatever garbage is there... Breaking that might be a
good thing.
-serge
On Tue, Oct 14, 2014 at 3:45 PM, Serge E. Hallyn <[email protected]> wrote:
> Quoting Andy Lutomirski ([email protected]):
>> On Tue, Oct 14, 2014 at 3:14 PM, Serge E. Hallyn <[email protected]> wrote:
>> > Quoting Serge E. Hallyn ([email protected]):
>> >> Quoting Eric W. Biederman ([email protected]):
>> >> > Andy Lutomirski <[email protected]> writes:
>> >> >
>> >> > > If a process gets access to a mount from a descendent or unrelated
>> >> > > user namespace, that process should not be able to take advantage of
>> >> > > setuid files or selinux entrypoints from that filesystem.
>> >> > >
>> >> > > This will make it safer to allow more complex filesystems to be
>> >> > > mounted in non-root user namespaces.
>> >> > >
>> >> > > This does not remove the need for MNT_LOCK_NOSUID. The setuid,
>> >> > > setgid, and file capability bits can no longer be abused if code in
>> >> > > a user namespace were to clear nosuid on an untrusted filesystem,
>> >> > > but this patch, by itself, is insufficient to protect the system
>> >> > > from abuse of files that, when execed, would increase MAC privilege.
>> >> > >
>> >> > > As a more concrete explanation, any task that can manipulate a
>> >> > > vfsmount associated with a given user namespace already has
>> >> > > capabilities in that namespace and all of its descendents. If they
>> >> > > can cause a malicious setuid, setgid, or file-caps executable to
>> >> > > appear in that mount, then that executable will only allow them to
>> >> > > elevate privileges in exactly the set of namespaces in which they
>> >> > > are already privileges.
>> >> > >
>> >> > > On the other hand, if they can cause a malicious executable to
>> >> > > appear with a dangerous MAC label, running it could change the
>> >> > > caller's security context in a way that should not have been
>> >> > > possible, even inside the namespace in which the task is confined.
>> >> >
>> >> > As presented this is complete and total nonsense. Mount propgation
>> >> > strongly weakens if not completely breaks the assumptions you are making
>> >> > in this code.
>> >> >
>> >> > To write any generic code that knows anything we need to capture a user
>> >> > namespace on struct super.
>> >> >
>> >> > Further I think all we really want is to filter out security labels from
>> >> > unprivileged mounts. uids/gids and the like should be completely fine
>> >> > because of the uid mappings.
>> >> >
>> >> > Having been down the route of comparing uids as userns uid tuples I am
>> >> > convinced that anything requires us to take the user namespace into
>> >> > account on a routine basis in the core will simply be broken for someone
>> >> > forgetting somewhere. This looks like a design that has that kind of
>> >> > susceptibility.
>> >>
>> >> The above paragraph is very compelling. However Andy's patch is a step
>> >> in the right direction from what we've got. I think given what you say
>> >> below and given Andy's rationale above, simply tweaking his patch to
>> >> ignore the parent-userns loop, and return false if current_user_ns() !=
>> >> mount_userns, should be right? It'll prevent a child userns from
>> >> setting a selinux/apparmor entrypoint or POSIX file capabilities on a
>> >> file and having the parent userns trip over those.
>> >
>> > Ok, Andy's fn does the opposite, which will protect the parent userns,
>> > which is good.
>> >
>> > I suspect simply insisting that the user_ns's be equal is still better.
>> > It fits better with the idea that POSIX caps (and LSM entrypoints) are
>> > orthogonal to DAC. Kinda.
>>
>> We could tighten it even further if we compared *mount* namespaces
>> instead of user namespaces. That would benefit Docker, non-userns-lxc
>> and such, too (sigh).
>>
>> Actually, I see to good reason to insist on userns equality but not on
>> mountns equality. If we're not going to trust executables in foreign
>> namespaces, let's go all the way to distrust executables in all
>> foreign namespaces, at least unless someone thinks of a reason this
>> would break existing userspace.
>
> I have no doubt there is code out there in production which ends up
> executing /proc/pid/root/sbin/ifconfig etc. Cause, you know, you really
> wanna execute whatever garbage is there... Breaking that might be a
> good thing.
Heh.
But it's the code that executes /proc/pid/root/sbin/sudo that we'll break :)
I'll send a new patch.
--Andy