Hi,
I've sent a few patches and emails over the past months about supporting
file capabilities in user namespace confined containers. A few of the
requirements as I see them are:
1. Root in a user namespace should be able to set file capabilities on a binary
for use by any user mapped into his namespace.
2. Any uid not mapped into the user namespace whose root user set file
capabilities should not gain privileges when running an executable which only
has file capabilities set by this root user.
3. Existing calls to cap_set_file(3) and cap_get_file(3) as well as
setcap(8) and getcap(8) should transparently work. This would allow
package managers to simply set file capabilities in postinst.
Below is a kernel patch which implements a new security.nscapability
extended attribute. Setting this xattr on a file requires cap_setfcap
against the current user namespace, and for the file to be owned by
a uid and gid mapped into that namespace. When found on a file,
the capabilities will take effect only if the file is owned by the
root uid in the caller's namespace, or the root uid in any ancestor
namespace.
While this design supports nested namespaces, it does not support
use of file capabilities by users in unrelated namespaces. So if
the same file is linked into two namespaces N1 and N2 which do not
share the same root kuid, then the only way for N1 and N2 to both
execute the file while respecting security.nscapability is to have
a common ancestor namespace write the capability. The only reasonable
way we could handle this case would be to use a securityfs interface
to set file capabilities. The capability.ko module could then
do the work of keeping a list of uid ranges for which file capabilities
should be honored. I don't think that flexibility is really called
for.
The kernel patch follows, and can be found at
https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/log/?h=2016-04-22/nsfscaps
The libcap patch can be found at
https://git.kernel.org/cgit/linux/kernel/git/sergeh/libcap.git/log/?h=2016-04-22/nscaps
Comments/conversation/suggestions greatly appreciated.
thanks,
-serge
From: Serge Hallyn <[email protected]>
This can only be set by root in his own namespace, and will
only be respected by namespaces with that same root kuid
mapped as root, or namespaces descended from it.
This allows a simple setxattr to work, allows tar/untar to
work, and allows us to tar in one namespace and untar in
another while preserving the capability, without risking
leaking privilege into a parent namespace.
Signed-off-by: Serge Hallyn <[email protected]>
---
include/linux/capability.h | 5 ++-
include/uapi/linux/capability.h | 18 ++++++++
include/uapi/linux/xattr.h | 3 ++
security/commoncap.c | 91 +++++++++++++++++++++++++++++++++++++--
4 files changed, 112 insertions(+), 5 deletions(-)
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 00690ff..cf533ff 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -13,7 +13,7 @@
#define _LINUX_CAPABILITY_H
#include <uapi/linux/capability.h>
-
+#include <linux/uidgid.h>
#define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
#define _KERNEL_CAPABILITY_U32S _LINUX_CAPABILITY_U32S_3
@@ -31,6 +31,9 @@ struct cpu_vfs_cap_data {
kernel_cap_t inheritable;
};
+#define NS_CAPS_VERSION(x) (x & 0xFF)
+#define NS_CAPS_FLAGS(x) ((x >> 8) & 0xFF)
+
#define _USER_CAP_HEADER_SIZE (sizeof(struct __user_cap_header_struct))
#define _KERNEL_CAP_T_SIZE (sizeof(kernel_cap_t))
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 12c37a1..f0b4a66 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -62,6 +62,9 @@ typedef struct __user_cap_data_struct {
#define VFS_CAP_U32_2 2
#define XATTR_CAPS_SZ_2 (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
+/* version number for security.nscapability xattrs hdr->hdr_info */
+#define VFS_NS_CAP_REVISION 1
+
#define XATTR_CAPS_SZ XATTR_CAPS_SZ_2
#define VFS_CAP_U32 VFS_CAP_U32_2
#define VFS_CAP_REVISION VFS_CAP_REVISION_2
@@ -74,6 +77,21 @@ struct vfs_cap_data {
} data[VFS_CAP_U32];
};
+#define VFS_NS_CAP_EFFECTIVE 0x1
+/*
+ * 32-bit hdr_info contains
+ * 16 leftmost: reserved
+ * next 8: flags (only VFS_NS_CAP_EFFECTIVE so far)
+ * last 8: version
+ */
+struct vfs_ns_cap_data {
+ __le32 magic_etc;
+ struct {
+ __le32 permitted; /* Little endian */
+ __le32 inheritable; /* Little endian */
+ } data[VFS_CAP_U32];
+};
+
#ifndef __KERNEL__
/*
diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
index 1590c49..67c80ab 100644
--- a/include/uapi/linux/xattr.h
+++ b/include/uapi/linux/xattr.h
@@ -68,6 +68,9 @@
#define XATTR_CAPS_SUFFIX "capability"
#define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
+#define XATTR_NS_CAPS_SUFFIX "nscapability"
+#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
+
#define XATTR_POSIX_ACL_ACCESS "posix_acl_access"
#define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
#define XATTR_POSIX_ACL_DEFAULT "posix_acl_default"
diff --git a/security/commoncap.c b/security/commoncap.c
index 48071ed..8f3f34a 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -313,6 +313,10 @@ int cap_inode_need_killpriv(struct dentry *dentry)
if (!inode->i_op->getxattr)
return 0;
+ error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
+ if (error > 0)
+ return 1;
+
error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
if (error <= 0)
return 0;
@@ -330,11 +334,17 @@ int cap_inode_need_killpriv(struct dentry *dentry)
int cap_inode_killpriv(struct dentry *dentry)
{
struct inode *inode = d_backing_inode(dentry);
+ int ret1, ret2;
if (!inode->i_op->removexattr)
return 0;
- return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
+ ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
+ ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
+
+ if (ret1 != 0)
+ return ret1;
+ return ret2;
}
/*
@@ -438,6 +448,65 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
return 0;
}
+int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
+{
+ struct inode *inode = d_backing_inode(dentry);
+ unsigned i;
+ u32 magic_etc;
+ ssize_t size;
+ struct vfs_ns_cap_data nscap;
+ bool foundroot = false;
+ struct user_namespace *ns;
+
+ memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
+
+ if (!inode || !inode->i_op->getxattr)
+ return -ENODATA;
+
+ /* verify that current or ancestor userns root owns this file */
+ for (ns = current_user_ns(); ; ns = ns->parent) {
+ if (from_kuid(ns, dentry->d_inode->i_uid) == 0) {
+ foundroot = true;
+ break;
+ }
+ if (ns == &init_user_ns)
+ break;
+ }
+ if (!foundroot)
+ return -ENODATA;
+
+ size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS,
+ &nscap, sizeof(nscap));
+ if (size == -ENODATA || size == -EOPNOTSUPP)
+ /* no data, that's ok */
+ return -ENODATA;
+ if (size < 0)
+ return size;
+ if (size != sizeof(nscap))
+ return -EINVAL;
+
+ magic_etc = le32_to_cpu(nscap.magic_etc);
+
+ if (NS_CAPS_VERSION(magic_etc) != VFS_NS_CAP_REVISION)
+ return -EINVAL;
+
+ cpu_caps->magic_etc = VFS_CAP_REVISION_2;
+ if (NS_CAPS_FLAGS(magic_etc) & VFS_NS_CAP_EFFECTIVE)
+ cpu_caps->magic_etc |= VFS_CAP_FLAGS_EFFECTIVE;
+ /* copy the entry */
+ CAP_FOR_EACH_U32(i) {
+ if (i >= VFS_CAP_U32_2)
+ break;
+ cpu_caps->permitted.cap[i] = le32_to_cpu(nscap.data[i].permitted);
+ cpu_caps->inheritable.cap[i] = le32_to_cpu(nscap.data[i].inheritable);
+ }
+
+ cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
+ cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
+
+ return 0;
+}
+
/*
* Attempt to get the on-exec apply capability sets for an executable file from
* its xattrs and, if present, apply them to the proposed credentials being
@@ -456,11 +525,13 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
return 0;
- rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
+ rc = get_vfs_ns_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
+ if (rc == -ENODATA)
+ rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
if (rc < 0) {
if (rc == -EINVAL)
- printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
- __func__, rc, bprm->filename);
+ printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
+ bprm->filename);
else if (rc == -ENODATA)
rc = 0;
goto out;
@@ -662,6 +733,12 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
return 0;
}
+ if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
+ if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
+ return -EPERM;
+ return 0;
+ }
+
if (!strncmp(name, XATTR_SECURITY_PREFIX,
sizeof(XATTR_SECURITY_PREFIX) - 1) &&
!capable(CAP_SYS_ADMIN))
@@ -688,6 +765,12 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
return 0;
}
+ if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
+ if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
+ return -EPERM;
+ return 0;
+ }
+
if (!strncmp(name, XATTR_SECURITY_PREFIX,
sizeof(XATTR_SECURITY_PREFIX) - 1) &&
!capable(CAP_SYS_ADMIN))
--
1.7.9.5
On Fri, Apr 22, 2016 at 12:26:33PM -0500, [email protected] wrote:
> From: Serge Hallyn <[email protected]>
>
> This can only be set by root in his own namespace, and will
> only be respected by namespaces with that same root kuid
> mapped as root, or namespaces descended from it.
>
> This allows a simple setxattr to work, allows tar/untar to
> work, and allows us to tar in one namespace and untar in
> another while preserving the capability, without risking
> leaking privilege into a parent namespace.
>
> Signed-off-by: Serge Hallyn <[email protected]>
Seems like the simplest possible design which meets the requirements.
Acked-by: Seth Forshee <[email protected]>
On Fri, Apr 22, 2016 at 10:26 AM, <[email protected]> wrote:
> From: Serge Hallyn <[email protected]>
>
> This can only be set by root in his own namespace, and will
> only be respected by namespaces with that same root kuid
> mapped as root, or namespaces descended from it.
>
> This allows a simple setxattr to work, allows tar/untar to
> work, and allows us to tar in one namespace and untar in
> another while preserving the capability, without risking
> leaking privilege into a parent namespace.
The concept seems sensible to me. Various notes below...
>
> Signed-off-by: Serge Hallyn <[email protected]>
> ---
> include/linux/capability.h | 5 ++-
> include/uapi/linux/capability.h | 18 ++++++++
> include/uapi/linux/xattr.h | 3 ++
> security/commoncap.c | 91 +++++++++++++++++++++++++++++++++++++--
> 4 files changed, 112 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index 00690ff..cf533ff 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -13,7 +13,7 @@
> #define _LINUX_CAPABILITY_H
>
> #include <uapi/linux/capability.h>
> -
> +#include <linux/uidgid.h>
>
> #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
> #define _KERNEL_CAPABILITY_U32S _LINUX_CAPABILITY_U32S_3
> @@ -31,6 +31,9 @@ struct cpu_vfs_cap_data {
> kernel_cap_t inheritable;
> };
>
> +#define NS_CAPS_VERSION(x) (x & 0xFF)
> +#define NS_CAPS_FLAGS(x) ((x >> 8) & 0xFF)
> +
> #define _USER_CAP_HEADER_SIZE (sizeof(struct __user_cap_header_struct))
> #define _KERNEL_CAP_T_SIZE (sizeof(kernel_cap_t))
>
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index 12c37a1..f0b4a66 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -62,6 +62,9 @@ typedef struct __user_cap_data_struct {
> #define VFS_CAP_U32_2 2
> #define XATTR_CAPS_SZ_2 (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
>
> +/* version number for security.nscapability xattrs hdr->hdr_info */
> +#define VFS_NS_CAP_REVISION 1
> +
> #define XATTR_CAPS_SZ XATTR_CAPS_SZ_2
> #define VFS_CAP_U32 VFS_CAP_U32_2
> #define VFS_CAP_REVISION VFS_CAP_REVISION_2
> @@ -74,6 +77,21 @@ struct vfs_cap_data {
> } data[VFS_CAP_U32];
> };
>
> +#define VFS_NS_CAP_EFFECTIVE 0x1
> +/*
> + * 32-bit hdr_info contains
> + * 16 leftmost: reserved
> + * next 8: flags (only VFS_NS_CAP_EFFECTIVE so far)
> + * last 8: version
> + */
> +struct vfs_ns_cap_data {
> + __le32 magic_etc;
> + struct {
> + __le32 permitted; /* Little endian */
> + __le32 inheritable; /* Little endian */
> + } data[VFS_CAP_U32];
> +};
This is identical to vfs_cap_data. Is there a reason not to re-use the
existing one?
> +
> #ifndef __KERNEL__
>
> /*
> diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
> index 1590c49..67c80ab 100644
> --- a/include/uapi/linux/xattr.h
> +++ b/include/uapi/linux/xattr.h
> @@ -68,6 +68,9 @@
> #define XATTR_CAPS_SUFFIX "capability"
> #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
>
> +#define XATTR_NS_CAPS_SUFFIX "nscapability"
> +#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
> +
> #define XATTR_POSIX_ACL_ACCESS "posix_acl_access"
> #define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
> #define XATTR_POSIX_ACL_DEFAULT "posix_acl_default"
Are these documented anywhere in Documentation/ or in man pages? This
seems like it'd need a man-page update too.
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 48071ed..8f3f34a 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -313,6 +313,10 @@ int cap_inode_need_killpriv(struct dentry *dentry)
> if (!inode->i_op->getxattr)
> return 0;
>
> + error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
> + if (error > 0)
> + return 1;
> +
> error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
> if (error <= 0)
> return 0;
I think this might be more readable if the getxattr calls were
standardized (one returns 1, the other returns 0, with inverted tests
-- hard to read). IIUC, if either returns > 0, return 1, otherwise
return 0.
> @@ -330,11 +334,17 @@ int cap_inode_need_killpriv(struct dentry *dentry)
> int cap_inode_killpriv(struct dentry *dentry)
> {
> struct inode *inode = d_backing_inode(dentry);
> + int ret1, ret2;
>
> if (!inode->i_op->removexattr)
> return 0;
>
> - return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> + ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> + ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
> +
> + if (ret1 != 0)
> + return ret1;
> + return ret2;
> }
>
> /*
> @@ -438,6 +448,65 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
> return 0;
> }
>
> +int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
> +{
> + struct inode *inode = d_backing_inode(dentry);
> + unsigned i;
> + u32 magic_etc;
> + ssize_t size;
> + struct vfs_ns_cap_data nscap;
> + bool foundroot = false;
> + struct user_namespace *ns;
> +
> + memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
> +
> + if (!inode || !inode->i_op->getxattr)
> + return -ENODATA;
> +
> + /* verify that current or ancestor userns root owns this file */
> + for (ns = current_user_ns(); ; ns = ns->parent) {
> + if (from_kuid(ns, dentry->d_inode->i_uid) == 0) {
> + foundroot = true;
> + break;
> + }
> + if (ns == &init_user_ns)
> + break;
> + }
> + if (!foundroot)
> + return -ENODATA;
> +
> + size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS,
> + &nscap, sizeof(nscap));
> + if (size == -ENODATA || size == -EOPNOTSUPP)
> + /* no data, that's ok */
> + return -ENODATA;
> + if (size < 0)
> + return size;
> + if (size != sizeof(nscap))
> + return -EINVAL;
> +
> + magic_etc = le32_to_cpu(nscap.magic_etc);
> +
> + if (NS_CAPS_VERSION(magic_etc) != VFS_NS_CAP_REVISION)
> + return -EINVAL;
> +
> + cpu_caps->magic_etc = VFS_CAP_REVISION_2;
> + if (NS_CAPS_FLAGS(magic_etc) & VFS_NS_CAP_EFFECTIVE)
> + cpu_caps->magic_etc |= VFS_CAP_FLAGS_EFFECTIVE;
> + /* copy the entry */
> + CAP_FOR_EACH_U32(i) {
> + if (i >= VFS_CAP_U32_2)
> + break;
> + cpu_caps->permitted.cap[i] = le32_to_cpu(nscap.data[i].permitted);
> + cpu_caps->inheritable.cap[i] = le32_to_cpu(nscap.data[i].inheritable);
> + }
> +
> + cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> + cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> +
> + return 0;
> +}
> +
> /*
> * Attempt to get the on-exec apply capability sets for an executable file from
> * its xattrs and, if present, apply them to the proposed credentials being
> @@ -456,11 +525,13 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
> if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
> return 0;
>
> - rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> + rc = get_vfs_ns_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> + if (rc == -ENODATA)
> + rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
So nscaps overrides a "regular" file cap? That might seem worth
mentioning in the change log or somewhere.
> if (rc < 0) {
> if (rc == -EINVAL)
> - printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
> - __func__, rc, bprm->filename);
> + printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
> + bprm->filename);
> else if (rc == -ENODATA)
> rc = 0;
> goto out;
> @@ -662,6 +733,12 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
> return 0;
> }
>
> + if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
> + if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
> + return -EPERM;
> + return 0;
> + }
> +
> if (!strncmp(name, XATTR_SECURITY_PREFIX,
> sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> !capable(CAP_SYS_ADMIN))
> @@ -688,6 +765,12 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
> return 0;
> }
>
> + if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
> + if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
> + return -EPERM;
> + return 0;
> + }
> +
This looks like userspace must knowingly be aware that it is in a
namespace and to DTRT instead of it being translated by the kernel
when setxattr is called under !init_user_ns?
> if (!strncmp(name, XATTR_SECURITY_PREFIX,
> sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> !capable(CAP_SYS_ADMIN))
> --
> 1.7.9.5
>
--
Kees Cook
Chrome OS & Brillo Security
Quoting Kees Cook ([email protected]):
> On Fri, Apr 22, 2016 at 10:26 AM, <[email protected]> wrote:
> > From: Serge Hallyn <[email protected]>
> >
> > This can only be set by root in his own namespace, and will
> > only be respected by namespaces with that same root kuid
> > mapped as root, or namespaces descended from it.
> >
> > This allows a simple setxattr to work, allows tar/untar to
> > work, and allows us to tar in one namespace and untar in
> > another while preserving the capability, without risking
> > leaking privilege into a parent namespace.
>
> The concept seems sensible to me. Various notes below...
>
> >
> > Signed-off-by: Serge Hallyn <[email protected]>
> > ---
> > include/linux/capability.h | 5 ++-
> > include/uapi/linux/capability.h | 18 ++++++++
> > include/uapi/linux/xattr.h | 3 ++
> > security/commoncap.c | 91 +++++++++++++++++++++++++++++++++++++--
> > 4 files changed, 112 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/capability.h b/include/linux/capability.h
> > index 00690ff..cf533ff 100644
> > --- a/include/linux/capability.h
> > +++ b/include/linux/capability.h
> > @@ -13,7 +13,7 @@
> > #define _LINUX_CAPABILITY_H
> >
> > #include <uapi/linux/capability.h>
> > -
> > +#include <linux/uidgid.h>
> >
> > #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
> > #define _KERNEL_CAPABILITY_U32S _LINUX_CAPABILITY_U32S_3
> > @@ -31,6 +31,9 @@ struct cpu_vfs_cap_data {
> > kernel_cap_t inheritable;
> > };
> >
> > +#define NS_CAPS_VERSION(x) (x & 0xFF)
> > +#define NS_CAPS_FLAGS(x) ((x >> 8) & 0xFF)
> > +
> > #define _USER_CAP_HEADER_SIZE (sizeof(struct __user_cap_header_struct))
> > #define _KERNEL_CAP_T_SIZE (sizeof(kernel_cap_t))
> >
> > diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> > index 12c37a1..f0b4a66 100644
> > --- a/include/uapi/linux/capability.h
> > +++ b/include/uapi/linux/capability.h
> > @@ -62,6 +62,9 @@ typedef struct __user_cap_data_struct {
> > #define VFS_CAP_U32_2 2
> > #define XATTR_CAPS_SZ_2 (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
> >
> > +/* version number for security.nscapability xattrs hdr->hdr_info */
> > +#define VFS_NS_CAP_REVISION 1
> > +
> > #define XATTR_CAPS_SZ XATTR_CAPS_SZ_2
> > #define VFS_CAP_U32 VFS_CAP_U32_2
> > #define VFS_CAP_REVISION VFS_CAP_REVISION_2
> > @@ -74,6 +77,21 @@ struct vfs_cap_data {
> > } data[VFS_CAP_U32];
> > };
> >
> > +#define VFS_NS_CAP_EFFECTIVE 0x1
> > +/*
> > + * 32-bit hdr_info contains
> > + * 16 leftmost: reserved
> > + * next 8: flags (only VFS_NS_CAP_EFFECTIVE so far)
> > + * last 8: version
> > + */
> > +struct vfs_ns_cap_data {
> > + __le32 magic_etc;
> > + struct {
> > + __le32 permitted; /* Little endian */
> > + __le32 inheritable; /* Little endian */
> > + } data[VFS_CAP_U32];
> > +};
>
> This is identical to vfs_cap_data. Is there a reason not to re-use the
> existing one?
Hm. I used to have them completely different. ATM the only difference
is what goes into the magic_etc, and that not very (different). So
yeah it probably should be re-used.
> > +
> > #ifndef __KERNEL__
> >
> > /*
> > diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
> > index 1590c49..67c80ab 100644
> > --- a/include/uapi/linux/xattr.h
> > +++ b/include/uapi/linux/xattr.h
> > @@ -68,6 +68,9 @@
> > #define XATTR_CAPS_SUFFIX "capability"
> > #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
> >
> > +#define XATTR_NS_CAPS_SUFFIX "nscapability"
> > +#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
> > +
> > #define XATTR_POSIX_ACL_ACCESS "posix_acl_access"
> > #define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
> > #define XATTR_POSIX_ACL_DEFAULT "posix_acl_default"
>
> Are these documented anywhere in Documentation/ or in man pages? This
> seems like it'd need a man-page update too.
Yeah, if we decide we're ok with this strategy I'll update those.
> > diff --git a/security/commoncap.c b/security/commoncap.c
> > index 48071ed..8f3f34a 100644
> > --- a/security/commoncap.c
> > +++ b/security/commoncap.c
> > @@ -313,6 +313,10 @@ int cap_inode_need_killpriv(struct dentry *dentry)
> > if (!inode->i_op->getxattr)
> > return 0;
> >
> > + error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
> > + if (error > 0)
> > + return 1;
> > +
> > error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
> > if (error <= 0)
> > return 0;
>
> I think this might be more readable if the getxattr calls were
> standardized (one returns 1, the other returns 0, with inverted tests
> -- hard to read). IIUC, if either returns > 0, return 1, otherwise
> return 0.
Hm. Yeah I can see where that would be confusing. I can change the flow
to make the checks the same.
> > @@ -330,11 +334,17 @@ int cap_inode_need_killpriv(struct dentry *dentry)
> > int cap_inode_killpriv(struct dentry *dentry)
> > {
> > struct inode *inode = d_backing_inode(dentry);
> > + int ret1, ret2;
> >
> > if (!inode->i_op->removexattr)
> > return 0;
> >
> > - return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> > + ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> > + ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
> > +
> > + if (ret1 != 0)
> > + return ret1;
> > + return ret2;
> > }
> >
> > /*
> > @@ -438,6 +448,65 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
> > return 0;
> > }
> >
> > +int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
> > +{
> > + struct inode *inode = d_backing_inode(dentry);
> > + unsigned i;
> > + u32 magic_etc;
> > + ssize_t size;
> > + struct vfs_ns_cap_data nscap;
> > + bool foundroot = false;
> > + struct user_namespace *ns;
> > +
> > + memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
> > +
> > + if (!inode || !inode->i_op->getxattr)
> > + return -ENODATA;
> > +
> > + /* verify that current or ancestor userns root owns this file */
> > + for (ns = current_user_ns(); ; ns = ns->parent) {
> > + if (from_kuid(ns, dentry->d_inode->i_uid) == 0) {
> > + foundroot = true;
> > + break;
> > + }
> > + if (ns == &init_user_ns)
> > + break;
> > + }
> > + if (!foundroot)
> > + return -ENODATA;
> > +
> > + size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS,
> > + &nscap, sizeof(nscap));
> > + if (size == -ENODATA || size == -EOPNOTSUPP)
> > + /* no data, that's ok */
> > + return -ENODATA;
> > + if (size < 0)
> > + return size;
> > + if (size != sizeof(nscap))
> > + return -EINVAL;
> > +
> > + magic_etc = le32_to_cpu(nscap.magic_etc);
> > +
> > + if (NS_CAPS_VERSION(magic_etc) != VFS_NS_CAP_REVISION)
> > + return -EINVAL;
> > +
> > + cpu_caps->magic_etc = VFS_CAP_REVISION_2;
> > + if (NS_CAPS_FLAGS(magic_etc) & VFS_NS_CAP_EFFECTIVE)
> > + cpu_caps->magic_etc |= VFS_CAP_FLAGS_EFFECTIVE;
> > + /* copy the entry */
> > + CAP_FOR_EACH_U32(i) {
> > + if (i >= VFS_CAP_U32_2)
> > + break;
> > + cpu_caps->permitted.cap[i] = le32_to_cpu(nscap.data[i].permitted);
> > + cpu_caps->inheritable.cap[i] = le32_to_cpu(nscap.data[i].inheritable);
> > + }
> > +
> > + cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> > + cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> > +
> > + return 0;
> > +}
> > +
> > /*
> > * Attempt to get the on-exec apply capability sets for an executable file from
> > * its xattrs and, if present, apply them to the proposed credentials being
> > @@ -456,11 +525,13 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
> > if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
> > return 0;
> >
> > - rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> > + rc = get_vfs_ns_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> > + if (rc == -ENODATA)
> > + rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
>
> So nscaps overrides a "regular" file cap? That might seem worth
> mentioning in the change log or somewhere.
If it applies, yes. Now maybe that's not what we want. Maybe so long as
a regular one exists (which must have been set by the init_user_ns admin)
we should use it? I think that makes sense, what do you think?
(In either case agreed it should be clearly documented)
> > if (rc < 0) {
> > if (rc == -EINVAL)
> > - printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
> > - __func__, rc, bprm->filename);
> > + printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
> > + bprm->filename);
> > else if (rc == -ENODATA)
> > rc = 0;
> > goto out;
> > @@ -662,6 +733,12 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
> > return 0;
> > }
> >
> > + if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
> > + if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
> > + return -EPERM;
> > + return 0;
> > + }
> > +
> > if (!strncmp(name, XATTR_SECURITY_PREFIX,
> > sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> > !capable(CAP_SYS_ADMIN))
> > @@ -688,6 +765,12 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
> > return 0;
> > }
> >
> > + if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
> > + if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
> > + return -EPERM;
> > + return 0;
> > + }
> > +
>
> This looks like userspace must knowingly be aware that it is in a
> namespace and to DTRT instead of it being translated by the kernel
> when setxattr is called under !init_user_ns?
Yes - my libcap2 patch checks /proc/self/uid_map to decide that. If that
shows you are in init_user_ns then it uses security.capability, otherwise
it uses security.nscapability.
I've occasionally considered having the xattr code do the quiet
substitution if need be.
In fact, much of this structure comes from when I was still trying to
do multiple values per xattr. Given what we're doing here, we could
keep the xattr contents exactly the same, just changing the name.
So userspace could just get and set security.capability; if you are
in a non-init user_ns, if security.capability is set then you cannot
set it; if security.capability is not set, then the kernel writes
security.nscapability instead and returns success.
I don't like magic, but this might be just straightforward enough
to not be offensive. Thoughts?
> > if (!strncmp(name, XATTR_SECURITY_PREFIX,
> > sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> > !capable(CAP_SYS_ADMIN))
> > --
> > 1.7.9.5
> >
>
>
>
> --
> Kees Cook
> Chrome OS & Brillo Security
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
On Tue, Apr 26, 2016 at 3:26 PM, Serge E. Hallyn <[email protected]> wrote:
> Quoting Kees Cook ([email protected]):
>> On Fri, Apr 22, 2016 at 10:26 AM, <[email protected]> wrote:
>> > From: Serge Hallyn <[email protected]>
>> >
>> > This can only be set by root in his own namespace, and will
>> > only be respected by namespaces with that same root kuid
>> > mapped as root, or namespaces descended from it.
>> >
>> > This allows a simple setxattr to work, allows tar/untar to
>> > work, and allows us to tar in one namespace and untar in
>> > another while preserving the capability, without risking
>> > leaking privilege into a parent namespace.
>>
>> The concept seems sensible to me. Various notes below...
>>
>> >
>> > Signed-off-by: Serge Hallyn <[email protected]>
>> > ---
>> > include/linux/capability.h | 5 ++-
>> > include/uapi/linux/capability.h | 18 ++++++++
>> > include/uapi/linux/xattr.h | 3 ++
>> > security/commoncap.c | 91 +++++++++++++++++++++++++++++++++++++--
>> > 4 files changed, 112 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/include/linux/capability.h b/include/linux/capability.h
>> > index 00690ff..cf533ff 100644
>> > --- a/include/linux/capability.h
>> > +++ b/include/linux/capability.h
>> > @@ -13,7 +13,7 @@
>> > #define _LINUX_CAPABILITY_H
>> >
>> > #include <uapi/linux/capability.h>
>> > -
>> > +#include <linux/uidgid.h>
>> >
>> > #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
>> > #define _KERNEL_CAPABILITY_U32S _LINUX_CAPABILITY_U32S_3
>> > @@ -31,6 +31,9 @@ struct cpu_vfs_cap_data {
>> > kernel_cap_t inheritable;
>> > };
>> >
>> > +#define NS_CAPS_VERSION(x) (x & 0xFF)
>> > +#define NS_CAPS_FLAGS(x) ((x >> 8) & 0xFF)
>> > +
>> > #define _USER_CAP_HEADER_SIZE (sizeof(struct __user_cap_header_struct))
>> > #define _KERNEL_CAP_T_SIZE (sizeof(kernel_cap_t))
>> >
>> > diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
>> > index 12c37a1..f0b4a66 100644
>> > --- a/include/uapi/linux/capability.h
>> > +++ b/include/uapi/linux/capability.h
>> > @@ -62,6 +62,9 @@ typedef struct __user_cap_data_struct {
>> > #define VFS_CAP_U32_2 2
>> > #define XATTR_CAPS_SZ_2 (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
>> >
>> > +/* version number for security.nscapability xattrs hdr->hdr_info */
>> > +#define VFS_NS_CAP_REVISION 1
>> > +
>> > #define XATTR_CAPS_SZ XATTR_CAPS_SZ_2
>> > #define VFS_CAP_U32 VFS_CAP_U32_2
>> > #define VFS_CAP_REVISION VFS_CAP_REVISION_2
>> > @@ -74,6 +77,21 @@ struct vfs_cap_data {
>> > } data[VFS_CAP_U32];
>> > };
>> >
>> > +#define VFS_NS_CAP_EFFECTIVE 0x1
>> > +/*
>> > + * 32-bit hdr_info contains
>> > + * 16 leftmost: reserved
>> > + * next 8: flags (only VFS_NS_CAP_EFFECTIVE so far)
>> > + * last 8: version
>> > + */
>> > +struct vfs_ns_cap_data {
>> > + __le32 magic_etc;
>> > + struct {
>> > + __le32 permitted; /* Little endian */
>> > + __le32 inheritable; /* Little endian */
>> > + } data[VFS_CAP_U32];
>> > +};
>>
>> This is identical to vfs_cap_data. Is there a reason not to re-use the
>> existing one?
>
> Hm. I used to have them completely different. ATM the only difference
> is what goes into the magic_etc, and that not very (different). So
> yeah it probably should be re-used.
>
>> > +
>> > #ifndef __KERNEL__
>> >
>> > /*
>> > diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
>> > index 1590c49..67c80ab 100644
>> > --- a/include/uapi/linux/xattr.h
>> > +++ b/include/uapi/linux/xattr.h
>> > @@ -68,6 +68,9 @@
>> > #define XATTR_CAPS_SUFFIX "capability"
>> > #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
>> >
>> > +#define XATTR_NS_CAPS_SUFFIX "nscapability"
>> > +#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
>> > +
>> > #define XATTR_POSIX_ACL_ACCESS "posix_acl_access"
>> > #define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
>> > #define XATTR_POSIX_ACL_DEFAULT "posix_acl_default"
>>
>> Are these documented anywhere in Documentation/ or in man pages? This
>> seems like it'd need a man-page update too.
>
> Yeah, if we decide we're ok with this strategy I'll update those.
>
>> > diff --git a/security/commoncap.c b/security/commoncap.c
>> > index 48071ed..8f3f34a 100644
>> > --- a/security/commoncap.c
>> > +++ b/security/commoncap.c
>> > @@ -313,6 +313,10 @@ int cap_inode_need_killpriv(struct dentry *dentry)
>> > if (!inode->i_op->getxattr)
>> > return 0;
>> >
>> > + error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
>> > + if (error > 0)
>> > + return 1;
>> > +
>> > error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
>> > if (error <= 0)
>> > return 0;
>>
>> I think this might be more readable if the getxattr calls were
>> standardized (one returns 1, the other returns 0, with inverted tests
>> -- hard to read). IIUC, if either returns > 0, return 1, otherwise
>> return 0.
>
> Hm. Yeah I can see where that would be confusing. I can change the flow
> to make the checks the same.
>
>> > @@ -330,11 +334,17 @@ int cap_inode_need_killpriv(struct dentry *dentry)
>> > int cap_inode_killpriv(struct dentry *dentry)
>> > {
>> > struct inode *inode = d_backing_inode(dentry);
>> > + int ret1, ret2;
>> >
>> > if (!inode->i_op->removexattr)
>> > return 0;
>> >
>> > - return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
>> > + ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
>> > + ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
>> > +
>> > + if (ret1 != 0)
>> > + return ret1;
>> > + return ret2;
>> > }
>> >
>> > /*
>> > @@ -438,6 +448,65 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
>> > return 0;
>> > }
>> >
>> > +int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
>> > +{
>> > + struct inode *inode = d_backing_inode(dentry);
>> > + unsigned i;
>> > + u32 magic_etc;
>> > + ssize_t size;
>> > + struct vfs_ns_cap_data nscap;
>> > + bool foundroot = false;
>> > + struct user_namespace *ns;
>> > +
>> > + memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
>> > +
>> > + if (!inode || !inode->i_op->getxattr)
>> > + return -ENODATA;
>> > +
>> > + /* verify that current or ancestor userns root owns this file */
>> > + for (ns = current_user_ns(); ; ns = ns->parent) {
>> > + if (from_kuid(ns, dentry->d_inode->i_uid) == 0) {
>> > + foundroot = true;
>> > + break;
>> > + }
>> > + if (ns == &init_user_ns)
>> > + break;
>> > + }
>> > + if (!foundroot)
>> > + return -ENODATA;
>> > +
>> > + size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS,
>> > + &nscap, sizeof(nscap));
>> > + if (size == -ENODATA || size == -EOPNOTSUPP)
>> > + /* no data, that's ok */
>> > + return -ENODATA;
>> > + if (size < 0)
>> > + return size;
>> > + if (size != sizeof(nscap))
>> > + return -EINVAL;
>> > +
>> > + magic_etc = le32_to_cpu(nscap.magic_etc);
>> > +
>> > + if (NS_CAPS_VERSION(magic_etc) != VFS_NS_CAP_REVISION)
>> > + return -EINVAL;
>> > +
>> > + cpu_caps->magic_etc = VFS_CAP_REVISION_2;
>> > + if (NS_CAPS_FLAGS(magic_etc) & VFS_NS_CAP_EFFECTIVE)
>> > + cpu_caps->magic_etc |= VFS_CAP_FLAGS_EFFECTIVE;
>> > + /* copy the entry */
>> > + CAP_FOR_EACH_U32(i) {
>> > + if (i >= VFS_CAP_U32_2)
>> > + break;
>> > + cpu_caps->permitted.cap[i] = le32_to_cpu(nscap.data[i].permitted);
>> > + cpu_caps->inheritable.cap[i] = le32_to_cpu(nscap.data[i].inheritable);
>> > + }
>> > +
>> > + cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
>> > + cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
>> > +
>> > + return 0;
>> > +}
>> > +
>> > /*
>> > * Attempt to get the on-exec apply capability sets for an executable file from
>> > * its xattrs and, if present, apply them to the proposed credentials being
>> > @@ -456,11 +525,13 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
>> > if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
>> > return 0;
>> >
>> > - rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
>> > + rc = get_vfs_ns_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
>> > + if (rc == -ENODATA)
>> > + rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
>>
>> So nscaps overrides a "regular" file cap? That might seem worth
>> mentioning in the change log or somewhere.
>
> If it applies, yes. Now maybe that's not what we want. Maybe so long as
> a regular one exists (which must have been set by the init_user_ns admin)
> we should use it? I think that makes sense, what do you think?
I struggle to see why the non-ns fscap should be honored... this is
effectively fscap stacking, and there's no way that I see for the cap
to leak "up" to init_user_ns. It only leaks up to nested user_nses...
but they need to already be priv-equiv. Hmmm.
I'm CC'ing Jann who like these mind-benders... :)
>
> (In either case agreed it should be clearly documented)
>
>> > if (rc < 0) {
>> > if (rc == -EINVAL)
>> > - printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
>> > - __func__, rc, bprm->filename);
>> > + printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
>> > + bprm->filename);
>> > else if (rc == -ENODATA)
>> > rc = 0;
>> > goto out;
>> > @@ -662,6 +733,12 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
>> > return 0;
>> > }
>> >
>> > + if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
>> > + if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
>> > + return -EPERM;
>> > + return 0;
>> > + }
>> > +
>> > if (!strncmp(name, XATTR_SECURITY_PREFIX,
>> > sizeof(XATTR_SECURITY_PREFIX) - 1) &&
>> > !capable(CAP_SYS_ADMIN))
>> > @@ -688,6 +765,12 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
>> > return 0;
>> > }
>> >
>> > + if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
>> > + if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
>> > + return -EPERM;
>> > + return 0;
>> > + }
>> > +
>>
>> This looks like userspace must knowingly be aware that it is in a
>> namespace and to DTRT instead of it being translated by the kernel
>> when setxattr is called under !init_user_ns?
>
> Yes - my libcap2 patch checks /proc/self/uid_map to decide that. If that
> shows you are in init_user_ns then it uses security.capability, otherwise
> it uses security.nscapability.
>
> I've occasionally considered having the xattr code do the quiet
> substitution if need be.
>
> In fact, much of this structure comes from when I was still trying to
> do multiple values per xattr. Given what we're doing here, we could
> keep the xattr contents exactly the same, just changing the name.
> So userspace could just get and set security.capability; if you are
> in a non-init user_ns, if security.capability is set then you cannot
> set it; if security.capability is not set, then the kernel writes
> security.nscapability instead and returns success.
>
> I don't like magic, but this might be just straightforward enough
> to not be offensive. Thoughts?
Yeah, I think it might be better to have the magic in this case, since
it seems weird to just reject setxattr if a tool didn't realize it was
in a namespace. I'm not sure -- it is also nice to have an explicit
API here.
I would defer to Eric or Michael on that. I keep going back and forth,
though I suspect it's probably best to do what you already have
(explicit API).
-Kees
>
>> > if (!strncmp(name, XATTR_SECURITY_PREFIX,
>> > sizeof(XATTR_SECURITY_PREFIX) - 1) &&
>> > !capable(CAP_SYS_ADMIN))
>> > --
>> > 1.7.9.5
>> >
>>
>>
>>
>> --
>> Kees Cook
>> Chrome OS & Brillo Security
>> _______________________________________________
>> Containers mailing list
>> [email protected]
>> https://lists.linuxfoundation.org/mailman/listinfo/containers
--
Kees Cook
Chrome OS & Brillo Security
Quoting Kees Cook ([email protected]):
> On Tue, Apr 26, 2016 at 3:26 PM, Serge E. Hallyn <[email protected]> wrote:
> > Quoting Kees Cook ([email protected]):
> >> On Fri, Apr 22, 2016 at 10:26 AM, <[email protected]> wrote:
> >> > From: Serge Hallyn <[email protected]>
> >> >
> >> > This can only be set by root in his own namespace, and will
> >> > only be respected by namespaces with that same root kuid
> >> > mapped as root, or namespaces descended from it.
> >> >
> >> > This allows a simple setxattr to work, allows tar/untar to
> >> > work, and allows us to tar in one namespace and untar in
> >> > another while preserving the capability, without risking
> >> > leaking privilege into a parent namespace.
> >>
> >> The concept seems sensible to me. Various notes below...
> >>
> >> >
> >> > Signed-off-by: Serge Hallyn <[email protected]>
> >> > ---
> >> > include/linux/capability.h | 5 ++-
> >> > include/uapi/linux/capability.h | 18 ++++++++
> >> > include/uapi/linux/xattr.h | 3 ++
> >> > security/commoncap.c | 91 +++++++++++++++++++++++++++++++++++++--
> >> > 4 files changed, 112 insertions(+), 5 deletions(-)
> >> >
> >> > diff --git a/include/linux/capability.h b/include/linux/capability.h
> >> > index 00690ff..cf533ff 100644
> >> > --- a/include/linux/capability.h
> >> > +++ b/include/linux/capability.h
> >> > @@ -13,7 +13,7 @@
> >> > #define _LINUX_CAPABILITY_H
> >> >
> >> > #include <uapi/linux/capability.h>
> >> > -
> >> > +#include <linux/uidgid.h>
> >> >
> >> > #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
> >> > #define _KERNEL_CAPABILITY_U32S _LINUX_CAPABILITY_U32S_3
> >> > @@ -31,6 +31,9 @@ struct cpu_vfs_cap_data {
> >> > kernel_cap_t inheritable;
> >> > };
> >> >
> >> > +#define NS_CAPS_VERSION(x) (x & 0xFF)
> >> > +#define NS_CAPS_FLAGS(x) ((x >> 8) & 0xFF)
> >> > +
> >> > #define _USER_CAP_HEADER_SIZE (sizeof(struct __user_cap_header_struct))
> >> > #define _KERNEL_CAP_T_SIZE (sizeof(kernel_cap_t))
> >> >
> >> > diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> >> > index 12c37a1..f0b4a66 100644
> >> > --- a/include/uapi/linux/capability.h
> >> > +++ b/include/uapi/linux/capability.h
> >> > @@ -62,6 +62,9 @@ typedef struct __user_cap_data_struct {
> >> > #define VFS_CAP_U32_2 2
> >> > #define XATTR_CAPS_SZ_2 (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
> >> >
> >> > +/* version number for security.nscapability xattrs hdr->hdr_info */
> >> > +#define VFS_NS_CAP_REVISION 1
> >> > +
> >> > #define XATTR_CAPS_SZ XATTR_CAPS_SZ_2
> >> > #define VFS_CAP_U32 VFS_CAP_U32_2
> >> > #define VFS_CAP_REVISION VFS_CAP_REVISION_2
> >> > @@ -74,6 +77,21 @@ struct vfs_cap_data {
> >> > } data[VFS_CAP_U32];
> >> > };
> >> >
> >> > +#define VFS_NS_CAP_EFFECTIVE 0x1
> >> > +/*
> >> > + * 32-bit hdr_info contains
> >> > + * 16 leftmost: reserved
> >> > + * next 8: flags (only VFS_NS_CAP_EFFECTIVE so far)
> >> > + * last 8: version
> >> > + */
> >> > +struct vfs_ns_cap_data {
> >> > + __le32 magic_etc;
> >> > + struct {
> >> > + __le32 permitted; /* Little endian */
> >> > + __le32 inheritable; /* Little endian */
> >> > + } data[VFS_CAP_U32];
> >> > +};
> >>
> >> This is identical to vfs_cap_data. Is there a reason not to re-use the
> >> existing one?
> >
> > Hm. I used to have them completely different. ATM the only difference
> > is what goes into the magic_etc, and that not very (different). So
> > yeah it probably should be re-used.
> >
> >> > +
> >> > #ifndef __KERNEL__
> >> >
> >> > /*
> >> > diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
> >> > index 1590c49..67c80ab 100644
> >> > --- a/include/uapi/linux/xattr.h
> >> > +++ b/include/uapi/linux/xattr.h
> >> > @@ -68,6 +68,9 @@
> >> > #define XATTR_CAPS_SUFFIX "capability"
> >> > #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
> >> >
> >> > +#define XATTR_NS_CAPS_SUFFIX "nscapability"
> >> > +#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
> >> > +
> >> > #define XATTR_POSIX_ACL_ACCESS "posix_acl_access"
> >> > #define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
> >> > #define XATTR_POSIX_ACL_DEFAULT "posix_acl_default"
> >>
> >> Are these documented anywhere in Documentation/ or in man pages? This
> >> seems like it'd need a man-page update too.
> >
> > Yeah, if we decide we're ok with this strategy I'll update those.
> >
> >> > diff --git a/security/commoncap.c b/security/commoncap.c
> >> > index 48071ed..8f3f34a 100644
> >> > --- a/security/commoncap.c
> >> > +++ b/security/commoncap.c
> >> > @@ -313,6 +313,10 @@ int cap_inode_need_killpriv(struct dentry *dentry)
> >> > if (!inode->i_op->getxattr)
> >> > return 0;
> >> >
> >> > + error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
> >> > + if (error > 0)
> >> > + return 1;
> >> > +
> >> > error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
> >> > if (error <= 0)
> >> > return 0;
> >>
> >> I think this might be more readable if the getxattr calls were
> >> standardized (one returns 1, the other returns 0, with inverted tests
> >> -- hard to read). IIUC, if either returns > 0, return 1, otherwise
> >> return 0.
> >
> > Hm. Yeah I can see where that would be confusing. I can change the flow
> > to make the checks the same.
> >
> >> > @@ -330,11 +334,17 @@ int cap_inode_need_killpriv(struct dentry *dentry)
> >> > int cap_inode_killpriv(struct dentry *dentry)
> >> > {
> >> > struct inode *inode = d_backing_inode(dentry);
> >> > + int ret1, ret2;
> >> >
> >> > if (!inode->i_op->removexattr)
> >> > return 0;
> >> >
> >> > - return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> >> > + ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> >> > + ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
> >> > +
> >> > + if (ret1 != 0)
> >> > + return ret1;
> >> > + return ret2;
> >> > }
> >> >
> >> > /*
> >> > @@ -438,6 +448,65 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
> >> > return 0;
> >> > }
> >> >
> >> > +int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
> >> > +{
> >> > + struct inode *inode = d_backing_inode(dentry);
> >> > + unsigned i;
> >> > + u32 magic_etc;
> >> > + ssize_t size;
> >> > + struct vfs_ns_cap_data nscap;
> >> > + bool foundroot = false;
> >> > + struct user_namespace *ns;
> >> > +
> >> > + memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
> >> > +
> >> > + if (!inode || !inode->i_op->getxattr)
> >> > + return -ENODATA;
> >> > +
> >> > + /* verify that current or ancestor userns root owns this file */
> >> > + for (ns = current_user_ns(); ; ns = ns->parent) {
> >> > + if (from_kuid(ns, dentry->d_inode->i_uid) == 0) {
> >> > + foundroot = true;
> >> > + break;
> >> > + }
> >> > + if (ns == &init_user_ns)
> >> > + break;
> >> > + }
> >> > + if (!foundroot)
> >> > + return -ENODATA;
> >> > +
> >> > + size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS,
> >> > + &nscap, sizeof(nscap));
> >> > + if (size == -ENODATA || size == -EOPNOTSUPP)
> >> > + /* no data, that's ok */
> >> > + return -ENODATA;
> >> > + if (size < 0)
> >> > + return size;
> >> > + if (size != sizeof(nscap))
> >> > + return -EINVAL;
> >> > +
> >> > + magic_etc = le32_to_cpu(nscap.magic_etc);
> >> > +
> >> > + if (NS_CAPS_VERSION(magic_etc) != VFS_NS_CAP_REVISION)
> >> > + return -EINVAL;
> >> > +
> >> > + cpu_caps->magic_etc = VFS_CAP_REVISION_2;
> >> > + if (NS_CAPS_FLAGS(magic_etc) & VFS_NS_CAP_EFFECTIVE)
> >> > + cpu_caps->magic_etc |= VFS_CAP_FLAGS_EFFECTIVE;
> >> > + /* copy the entry */
> >> > + CAP_FOR_EACH_U32(i) {
> >> > + if (i >= VFS_CAP_U32_2)
> >> > + break;
> >> > + cpu_caps->permitted.cap[i] = le32_to_cpu(nscap.data[i].permitted);
> >> > + cpu_caps->inheritable.cap[i] = le32_to_cpu(nscap.data[i].inheritable);
> >> > + }
> >> > +
> >> > + cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> >> > + cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> >> > +
> >> > + return 0;
> >> > +}
> >> > +
> >> > /*
> >> > * Attempt to get the on-exec apply capability sets for an executable file from
> >> > * its xattrs and, if present, apply them to the proposed credentials being
> >> > @@ -456,11 +525,13 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
> >> > if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
> >> > return 0;
> >> >
> >> > - rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> >> > + rc = get_vfs_ns_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> >> > + if (rc == -ENODATA)
> >> > + rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> >>
> >> So nscaps overrides a "regular" file cap? That might seem worth
> >> mentioning in the change log or somewhere.
> >
> > If it applies, yes. Now maybe that's not what we want. Maybe so long as
> > a regular one exists (which must have been set by the init_user_ns admin)
> > we should use it? I think that makes sense, what do you think?
>
> I struggle to see why the non-ns fscap should be honored... this is
Mainly because that's how it is right now - if there is a security.capability
then it is honored in all namespaces. Changing that *could* change the
behavior which some poor schlep is depending on. And in general we don't
do that.
> effectively fscap stacking, and there's no way that I see for the cap
> to leak "up" to init_user_ns. It only leaks up to nested user_nses...
> but they need to already be priv-equiv. Hmmm.
>
> I'm CC'ing Jann who like these mind-benders... :)
>
> >
> > (In either case agreed it should be clearly documented)
> >
> >> > if (rc < 0) {
> >> > if (rc == -EINVAL)
> >> > - printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
> >> > - __func__, rc, bprm->filename);
> >> > + printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
> >> > + bprm->filename);
> >> > else if (rc == -ENODATA)
> >> > rc = 0;
> >> > goto out;
> >> > @@ -662,6 +733,12 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
> >> > return 0;
> >> > }
> >> >
> >> > + if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
> >> > + if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
> >> > + return -EPERM;
> >> > + return 0;
> >> > + }
> >> > +
> >> > if (!strncmp(name, XATTR_SECURITY_PREFIX,
> >> > sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> >> > !capable(CAP_SYS_ADMIN))
> >> > @@ -688,6 +765,12 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
> >> > return 0;
> >> > }
> >> >
> >> > + if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
> >> > + if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
> >> > + return -EPERM;
> >> > + return 0;
> >> > + }
> >> > +
> >>
> >> This looks like userspace must knowingly be aware that it is in a
> >> namespace and to DTRT instead of it being translated by the kernel
> >> when setxattr is called under !init_user_ns?
> >
> > Yes - my libcap2 patch checks /proc/self/uid_map to decide that. If that
> > shows you are in init_user_ns then it uses security.capability, otherwise
> > it uses security.nscapability.
> >
> > I've occasionally considered having the xattr code do the quiet
> > substitution if need be.
> >
> > In fact, much of this structure comes from when I was still trying to
> > do multiple values per xattr. Given what we're doing here, we could
> > keep the xattr contents exactly the same, just changing the name.
> > So userspace could just get and set security.capability; if you are
> > in a non-init user_ns, if security.capability is set then you cannot
> > set it; if security.capability is not set, then the kernel writes
> > security.nscapability instead and returns success.
> >
> > I don't like magic, but this might be just straightforward enough
> > to not be offensive. Thoughts?
>
> Yeah, I think it might be better to have the magic in this case, since
> it seems weird to just reject setxattr if a tool didn't realize it was
> in a namespace. I'm not sure -- it is also nice to have an explicit
> API here.
>
> I would defer to Eric or Michael on that. I keep going back and forth,
> though I suspect it's probably best to do what you already have
> (explicit API).
>
> -Kees
>
> >
> >> > if (!strncmp(name, XATTR_SECURITY_PREFIX,
> >> > sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> >> > !capable(CAP_SYS_ADMIN))
> >> > --
> >> > 1.7.9.5
> >> >
> >>
> >>
> >>
> >> --
> >> Kees Cook
> >> Chrome OS & Brillo Security
> >> _______________________________________________
> >> Containers mailing list
> >> [email protected]
> >> https://lists.linuxfoundation.org/mailman/listinfo/containers
>
>
>
> --
> Kees Cook
> Chrome OS & Brillo Security
On Tue, Apr 26, 2016 at 03:39:54PM -0700, Kees Cook wrote:
> On Tue, Apr 26, 2016 at 3:26 PM, Serge E. Hallyn <[email protected]> wrote:
> > Quoting Kees Cook ([email protected]):
> >> On Fri, Apr 22, 2016 at 10:26 AM, <[email protected]> wrote:
> >> > From: Serge Hallyn <[email protected]>
> >> >
> >> > This can only be set by root in his own namespace, and will
> >> > only be respected by namespaces with that same root kuid
> >> > mapped as root, or namespaces descended from it.
> >> >
> >> > This allows a simple setxattr to work, allows tar/untar to
> >> > work, and allows us to tar in one namespace and untar in
> >> > another while preserving the capability, without risking
> >> > leaking privilege into a parent namespace.
> >>
> >> The concept seems sensible to me. Various notes below...
> >>
> >> >
> >> > Signed-off-by: Serge Hallyn <[email protected]>
> >> > ---
> >> > include/linux/capability.h | 5 ++-
> >> > include/uapi/linux/capability.h | 18 ++++++++
> >> > include/uapi/linux/xattr.h | 3 ++
> >> > security/commoncap.c | 91 +++++++++++++++++++++++++++++++++++++--
> >> > 4 files changed, 112 insertions(+), 5 deletions(-)
> >> >
> >> > diff --git a/include/linux/capability.h b/include/linux/capability.h
> >> > index 00690ff..cf533ff 100644
> >> > --- a/include/linux/capability.h
> >> > +++ b/include/linux/capability.h
> >> > @@ -13,7 +13,7 @@
> >> > #define _LINUX_CAPABILITY_H
> >> >
> >> > #include <uapi/linux/capability.h>
> >> > -
> >> > +#include <linux/uidgid.h>
> >> >
> >> > #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
> >> > #define _KERNEL_CAPABILITY_U32S _LINUX_CAPABILITY_U32S_3
> >> > @@ -31,6 +31,9 @@ struct cpu_vfs_cap_data {
> >> > kernel_cap_t inheritable;
> >> > };
> >> >
> >> > +#define NS_CAPS_VERSION(x) (x & 0xFF)
> >> > +#define NS_CAPS_FLAGS(x) ((x >> 8) & 0xFF)
> >> > +
> >> > #define _USER_CAP_HEADER_SIZE (sizeof(struct __user_cap_header_struct))
> >> > #define _KERNEL_CAP_T_SIZE (sizeof(kernel_cap_t))
> >> >
> >> > diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> >> > index 12c37a1..f0b4a66 100644
> >> > --- a/include/uapi/linux/capability.h
> >> > +++ b/include/uapi/linux/capability.h
> >> > @@ -62,6 +62,9 @@ typedef struct __user_cap_data_struct {
> >> > #define VFS_CAP_U32_2 2
> >> > #define XATTR_CAPS_SZ_2 (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
> >> >
> >> > +/* version number for security.nscapability xattrs hdr->hdr_info */
> >> > +#define VFS_NS_CAP_REVISION 1
> >> > +
> >> > #define XATTR_CAPS_SZ XATTR_CAPS_SZ_2
> >> > #define VFS_CAP_U32 VFS_CAP_U32_2
> >> > #define VFS_CAP_REVISION VFS_CAP_REVISION_2
> >> > @@ -74,6 +77,21 @@ struct vfs_cap_data {
> >> > } data[VFS_CAP_U32];
> >> > };
> >> >
> >> > +#define VFS_NS_CAP_EFFECTIVE 0x1
> >> > +/*
> >> > + * 32-bit hdr_info contains
> >> > + * 16 leftmost: reserved
> >> > + * next 8: flags (only VFS_NS_CAP_EFFECTIVE so far)
> >> > + * last 8: version
> >> > + */
> >> > +struct vfs_ns_cap_data {
> >> > + __le32 magic_etc;
> >> > + struct {
> >> > + __le32 permitted; /* Little endian */
> >> > + __le32 inheritable; /* Little endian */
> >> > + } data[VFS_CAP_U32];
> >> > +};
> >>
> >> This is identical to vfs_cap_data. Is there a reason not to re-use the
> >> existing one?
> >
> > Hm. I used to have them completely different. ATM the only difference
> > is what goes into the magic_etc, and that not very (different). So
> > yeah it probably should be re-used.
> >
> >> > +
> >> > #ifndef __KERNEL__
> >> >
> >> > /*
> >> > diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
> >> > index 1590c49..67c80ab 100644
> >> > --- a/include/uapi/linux/xattr.h
> >> > +++ b/include/uapi/linux/xattr.h
> >> > @@ -68,6 +68,9 @@
> >> > #define XATTR_CAPS_SUFFIX "capability"
> >> > #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
> >> >
> >> > +#define XATTR_NS_CAPS_SUFFIX "nscapability"
> >> > +#define XATTR_NAME_NS_CAPS XATTR_SECURITY_PREFIX XATTR_NS_CAPS_SUFFIX
> >> > +
> >> > #define XATTR_POSIX_ACL_ACCESS "posix_acl_access"
> >> > #define XATTR_NAME_POSIX_ACL_ACCESS XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_ACCESS
> >> > #define XATTR_POSIX_ACL_DEFAULT "posix_acl_default"
> >>
> >> Are these documented anywhere in Documentation/ or in man pages? This
> >> seems like it'd need a man-page update too.
> >
> > Yeah, if we decide we're ok with this strategy I'll update those.
> >
> >> > diff --git a/security/commoncap.c b/security/commoncap.c
> >> > index 48071ed..8f3f34a 100644
> >> > --- a/security/commoncap.c
> >> > +++ b/security/commoncap.c
> >> > @@ -313,6 +313,10 @@ int cap_inode_need_killpriv(struct dentry *dentry)
> >> > if (!inode->i_op->getxattr)
> >> > return 0;
> >> >
> >> > + error = inode->i_op->getxattr(dentry, XATTR_NAME_NS_CAPS, NULL, 0);
> >> > + if (error > 0)
> >> > + return 1;
> >> > +
> >> > error = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
> >> > if (error <= 0)
> >> > return 0;
> >>
> >> I think this might be more readable if the getxattr calls were
> >> standardized (one returns 1, the other returns 0, with inverted tests
> >> -- hard to read). IIUC, if either returns > 0, return 1, otherwise
> >> return 0.
> >
> > Hm. Yeah I can see where that would be confusing. I can change the flow
> > to make the checks the same.
> >
> >> > @@ -330,11 +334,17 @@ int cap_inode_need_killpriv(struct dentry *dentry)
> >> > int cap_inode_killpriv(struct dentry *dentry)
> >> > {
> >> > struct inode *inode = d_backing_inode(dentry);
> >> > + int ret1, ret2;
> >> >
> >> > if (!inode->i_op->removexattr)
> >> > return 0;
> >> >
> >> > - return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> >> > + ret1 = inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
> >> > + ret2 = inode->i_op->removexattr(dentry, XATTR_NAME_NS_CAPS);
> >> > +
> >> > + if (ret1 != 0)
> >> > + return ret1;
> >> > + return ret2;
> >> > }
> >> >
> >> > /*
> >> > @@ -438,6 +448,65 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
> >> > return 0;
> >> > }
> >> >
> >> > +int get_vfs_ns_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps)
> >> > +{
> >> > + struct inode *inode = d_backing_inode(dentry);
> >> > + unsigned i;
> >> > + u32 magic_etc;
> >> > + ssize_t size;
> >> > + struct vfs_ns_cap_data nscap;
> >> > + bool foundroot = false;
> >> > + struct user_namespace *ns;
> >> > +
> >> > + memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
> >> > +
> >> > + if (!inode || !inode->i_op->getxattr)
> >> > + return -ENODATA;
> >> > +
> >> > + /* verify that current or ancestor userns root owns this file */
> >> > + for (ns = current_user_ns(); ; ns = ns->parent) {
> >> > + if (from_kuid(ns, dentry->d_inode->i_uid) == 0) {
> >> > + foundroot = true;
> >> > + break;
> >> > + }
> >> > + if (ns == &init_user_ns)
> >> > + break;
> >> > + }
> >> > + if (!foundroot)
> >> > + return -ENODATA;
> >> > +
> >> > + size = inode->i_op->getxattr((struct dentry *)dentry, XATTR_NAME_NS_CAPS,
> >> > + &nscap, sizeof(nscap));
> >> > + if (size == -ENODATA || size == -EOPNOTSUPP)
> >> > + /* no data, that's ok */
> >> > + return -ENODATA;
> >> > + if (size < 0)
> >> > + return size;
> >> > + if (size != sizeof(nscap))
> >> > + return -EINVAL;
> >> > +
> >> > + magic_etc = le32_to_cpu(nscap.magic_etc);
> >> > +
> >> > + if (NS_CAPS_VERSION(magic_etc) != VFS_NS_CAP_REVISION)
> >> > + return -EINVAL;
> >> > +
> >> > + cpu_caps->magic_etc = VFS_CAP_REVISION_2;
> >> > + if (NS_CAPS_FLAGS(magic_etc) & VFS_NS_CAP_EFFECTIVE)
> >> > + cpu_caps->magic_etc |= VFS_CAP_FLAGS_EFFECTIVE;
> >> > + /* copy the entry */
> >> > + CAP_FOR_EACH_U32(i) {
> >> > + if (i >= VFS_CAP_U32_2)
> >> > + break;
> >> > + cpu_caps->permitted.cap[i] = le32_to_cpu(nscap.data[i].permitted);
> >> > + cpu_caps->inheritable.cap[i] = le32_to_cpu(nscap.data[i].inheritable);
> >> > + }
> >> > +
> >> > + cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> >> > + cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> >> > +
> >> > + return 0;
> >> > +}
> >> > +
> >> > /*
> >> > * Attempt to get the on-exec apply capability sets for an executable file from
> >> > * its xattrs and, if present, apply them to the proposed credentials being
> >> > @@ -456,11 +525,13 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
> >> > if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
> >> > return 0;
> >> >
> >> > - rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> >> > + rc = get_vfs_ns_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> >> > + if (rc == -ENODATA)
> >> > + rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> >>
> >> So nscaps overrides a "regular" file cap? That might seem worth
> >> mentioning in the change log or somewhere.
> >
> > If it applies, yes. Now maybe that's not what we want. Maybe so long as
> > a regular one exists (which must have been set by the init_user_ns admin)
> > we should use it? I think that makes sense, what do you think?
>
> I struggle to see why the non-ns fscap should be honored... this is
> effectively fscap stacking, and there's no way that I see for the cap
> to leak "up" to init_user_ns. It only leaks up to nested user_nses...
> but they need to already be priv-equiv. Hmmm.
>
> I'm CC'ing Jann who like these mind-benders... :)
Thanks. :D
I think it's safe, and it might make sense for userspace when e.g.
bind-mounting /bin into a container or so.
Theoretically, it could be an issue if a setcap binary from the init
namespace behaves insecurely when executed in another namespace (because
it is less privileged than expected), in the worst case making it
possible for someone to gain capabilities inside the namespace - but
as long as userspace behaves sanely and does proper error checking, I
think this shouldn't happen.
> > (In either case agreed it should be clearly documented)
> >
> >> > if (rc < 0) {
> >> > if (rc == -EINVAL)
> >> > - printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
> >> > - __func__, rc, bprm->filename);
> >> > + printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
> >> > + bprm->filename);
> >> > else if (rc == -ENODATA)
> >> > rc = 0;
> >> > goto out;
> >> > @@ -662,6 +733,12 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
> >> > return 0;
> >> > }
> >> >
> >> > + if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
> >> > + if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
> >> > + return -EPERM;
> >> > + return 0;
> >> > + }
> >> > +
> >> > if (!strncmp(name, XATTR_SECURITY_PREFIX,
> >> > sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> >> > !capable(CAP_SYS_ADMIN))
> >> > @@ -688,6 +765,12 @@ int cap_inode_removexattr(struct dentry *dentry, const char *name)
> >> > return 0;
> >> > }
> >> >
> >> > + if (!strcmp(name, XATTR_NAME_NS_CAPS)) {
> >> > + if (!capable_wrt_inode_uidgid(dentry->d_inode, CAP_SETFCAP))
> >> > + return -EPERM;
> >> > + return 0;
> >> > + }
> >> > +
> >>
> >> This looks like userspace must knowingly be aware that it is in a
> >> namespace and to DTRT instead of it being translated by the kernel
> >> when setxattr is called under !init_user_ns?
> >
> > Yes - my libcap2 patch checks /proc/self/uid_map to decide that. If that
> > shows you are in init_user_ns then it uses security.capability, otherwise
> > it uses security.nscapability.
> >
> > I've occasionally considered having the xattr code do the quiet
> > substitution if need be.
> >
> > In fact, much of this structure comes from when I was still trying to
> > do multiple values per xattr. Given what we're doing here, we could
> > keep the xattr contents exactly the same, just changing the name.
> > So userspace could just get and set security.capability; if you are
> > in a non-init user_ns, if security.capability is set then you cannot
> > set it; if security.capability is not set, then the kernel writes
> > security.nscapability instead and returns success.
> >
> > I don't like magic, but this might be just straightforward enough
> > to not be offensive. Thoughts?
>
> Yeah, I think it might be better to have the magic in this case, since
> it seems weird to just reject setxattr if a tool didn't realize it was
> in a namespace. I'm not sure -- it is also nice to have an explicit
> API here.
>
> I would defer to Eric or Michael on that. I keep going back and forth,
> though I suspect it's probably best to do what you already have
> (explicit API).
>
> -Kees
>
> >
> >> > if (!strncmp(name, XATTR_SECURITY_PREFIX,
> >> > sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> >> > !capable(CAP_SYS_ADMIN))
> >> > --
> >> > 1.7.9.5
> >> >
> >>
> >>
> >>
> >> --
> >> Kees Cook
> >> Chrome OS & Brillo Security
> >> _______________________________________________
> >> Containers mailing list
> >> [email protected]
> >> https://lists.linuxfoundation.org/mailman/listinfo/containers
>
>
>
> --
> Kees Cook
> Chrome OS & Brillo Security