2023-11-29 21:51:13

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 00/16] fs: use type-safe uid representation for filesystem capabilities

This series converts filesystem capabilities from passing around raw
xattr data to using a kernel-internal representation with type safe
uids, similar to the conversion done previously for posix ACLs.
Currently fscaps representations in the kernel have two different
instances of unclear or confused types:

- fscaps are generally passed around in the raw xattr form, with the
rootid sometimes containing the user uid value and at other times
containing the filesystem value.

- The existing kernel-internal representation of fscaps,
cpu_vfs_cap_data, uses the kuid_t type, but the value stored is
actually a vfsuid.

This series eliminates this confusion by converting the xattr data to
the kernel representation near the userspace and filesystem boundaries,
using the kernel representation within the vfs and commoncap code. The
internal representation is renamed to vfs_caps to reflect this broader
use, and the rootid is changed to a vfsuid_t to correctly identify the
type of uid which it contains.

New vfs interfaces are added to allow for getting and setting fscaps
using the kernel representation. This requires the addition of new inode
operations to allow overlayfs to handle fscaps properly; all other
filesystems fall back to a generic implementation. The top-level vfs
xattr interfaces will now reject fscaps xattrs, though the lower-level
interfaces continue to accept them for reading and writing the raw xattr
data.

The existing xattr security hooks can continue to be used for fscaps.
There is some awkwardness here, as EVM requires the on-disk fscaps data
to compare with any existing on-disk value. Security checks need to
happen before calling into filesystem inode operations, when the fscaps
are still in the kernel-internal format, so an extra conversion to the
on-disk format is necessary for EVM's setxattr checks.

The remainder of the changes are preparatory work and addition of
helpers for converting between the xattr and kernel fscaps
representation.

I have tested this code with xfstests, ltp, libcap2, and libcap-ng with
no regressions found.

Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
---
Seth Forshee (DigitalOcean) (16):
mnt_idmapping: split out core vfs[ug]id_t definitions into vfsid.h
mnt_idmapping: include cred.h
capability: rename cpu_vfs_cap_data to vfs_caps
capability: use vfsuid_t for vfs_caps rootids
capability: provide helpers for converting between xattrs and vfs_caps
capability: provide a helper for converting vfs_caps to xattr for userspace
fs: add inode operations to get/set/remove fscaps
fs: add vfs_get_fscaps()
fs: add vfs_set_fscaps()
fs: add vfs_remove_fscaps()
ovl: add fscaps handlers
ovl: use vfs_{get,set}_fscaps() for copy-up
fs: use vfs interfaces for capabilities xattrs
commoncap: remove cap_inode_getsecurity()
commoncap: use vfs fscaps interfaces for killpriv checks
vfs: return -EOPNOTSUPP for fscaps from vfs_*xattr()

MAINTAINERS | 1 +
fs/overlayfs/copy_up.c | 72 +++---
fs/overlayfs/dir.c | 3 +
fs/overlayfs/inode.c | 84 +++++++
fs/overlayfs/overlayfs.h | 6 +
fs/xattr.c | 286 ++++++++++++++++++++++-
include/linux/capability.h | 23 +-
include/linux/fs.h | 13 ++
include/linux/mnt_idmapping.h | 67 +-----
include/linux/security.h | 5 +-
include/linux/vfsid.h | 74 ++++++
kernel/auditsc.c | 9 +-
security/commoncap.c | 519 ++++++++++++++++++++++--------------------
13 files changed, 802 insertions(+), 360 deletions(-)
---
base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab
change-id: 20230512-idmap-fscap-refactor-63b61fa0a36f

Best regards,
--
Seth Forshee (DigitalOcean) <[email protected]>


2023-11-29 21:51:18

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 02/16] mnt_idmapping: include cred.h

mnt_idmapping.h uses declarations from cred.h, so it should include that
file directly.

Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
---
include/linux/mnt_idmapping.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/linux/mnt_idmapping.h b/include/linux/mnt_idmapping.h
index 8b5e00ee6472..65e5d0c32fde 100644
--- a/include/linux/mnt_idmapping.h
+++ b/include/linux/mnt_idmapping.h
@@ -5,6 +5,7 @@
#include <linux/types.h>
#include <linux/uidgid.h>
#include <linux/vfsid.h>
+#include <linux/cred.h>

struct mnt_idmap;
struct user_namespace;

--
2.43.0

2023-11-29 21:51:20

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 01/16] mnt_idmapping: split out core vfs[ug]id_t definitions into vfsid.h

The rootid member of cpu_vfs_cap_data is a kuid_t, but it should be a
vfsuid_t as the id stored there is mapped into the mount idmapping. It's
currently impossible to use vfsuid_t within cred.h though as it is
defined in mnt_idmapping.h, which uses definitions from cred.h.

Split out the core vfsid type definitions into a separate file which can
be included from cred.h.

Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
---
MAINTAINERS | 1 +
include/linux/mnt_idmapping.h | 66 +-------------------------------------
include/linux/vfsid.h | 74 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 76 insertions(+), 65 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 012df8ccf34e..8c73081d3dcc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10260,6 +10260,7 @@ S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping.git
F: Documentation/filesystems/idmappings.rst
F: include/linux/mnt_idmapping.*
+F: include/linux/vfsid.h
F: tools/testing/selftests/mount_setattr/

IDT VersaClock 5 CLOCK DRIVER
diff --git a/include/linux/mnt_idmapping.h b/include/linux/mnt_idmapping.h
index b8da2db4ecd2..8b5e00ee6472 100644
--- a/include/linux/mnt_idmapping.h
+++ b/include/linux/mnt_idmapping.h
@@ -4,6 +4,7 @@

#include <linux/types.h>
#include <linux/uidgid.h>
+#include <linux/vfsid.h>

struct mnt_idmap;
struct user_namespace;
@@ -11,61 +12,6 @@ struct user_namespace;
extern struct mnt_idmap nop_mnt_idmap;
extern struct user_namespace init_user_ns;

-typedef struct {
- uid_t val;
-} vfsuid_t;
-
-typedef struct {
- gid_t val;
-} vfsgid_t;
-
-static_assert(sizeof(vfsuid_t) == sizeof(kuid_t));
-static_assert(sizeof(vfsgid_t) == sizeof(kgid_t));
-static_assert(offsetof(vfsuid_t, val) == offsetof(kuid_t, val));
-static_assert(offsetof(vfsgid_t, val) == offsetof(kgid_t, val));
-
-#ifdef CONFIG_MULTIUSER
-static inline uid_t __vfsuid_val(vfsuid_t uid)
-{
- return uid.val;
-}
-
-static inline gid_t __vfsgid_val(vfsgid_t gid)
-{
- return gid.val;
-}
-#else
-static inline uid_t __vfsuid_val(vfsuid_t uid)
-{
- return 0;
-}
-
-static inline gid_t __vfsgid_val(vfsgid_t gid)
-{
- return 0;
-}
-#endif
-
-static inline bool vfsuid_valid(vfsuid_t uid)
-{
- return __vfsuid_val(uid) != (uid_t)-1;
-}
-
-static inline bool vfsgid_valid(vfsgid_t gid)
-{
- return __vfsgid_val(gid) != (gid_t)-1;
-}
-
-static inline bool vfsuid_eq(vfsuid_t left, vfsuid_t right)
-{
- return vfsuid_valid(left) && __vfsuid_val(left) == __vfsuid_val(right);
-}
-
-static inline bool vfsgid_eq(vfsgid_t left, vfsgid_t right)
-{
- return vfsgid_valid(left) && __vfsgid_val(left) == __vfsgid_val(right);
-}
-
/**
* vfsuid_eq_kuid - check whether kuid and vfsuid have the same value
* @vfsuid: the vfsuid to compare
@@ -96,16 +42,6 @@ static inline bool vfsgid_eq_kgid(vfsgid_t vfsgid, kgid_t kgid)
return vfsgid_valid(vfsgid) && __vfsgid_val(vfsgid) == __kgid_val(kgid);
}

-/*
- * vfs{g,u}ids are created from k{g,u}ids.
- * We don't allow them to be created from regular {u,g}id.
- */
-#define VFSUIDT_INIT(val) (vfsuid_t){ __kuid_val(val) }
-#define VFSGIDT_INIT(val) (vfsgid_t){ __kgid_val(val) }
-
-#define INVALID_VFSUID VFSUIDT_INIT(INVALID_UID)
-#define INVALID_VFSGID VFSGIDT_INIT(INVALID_GID)
-
/*
* Allow a vfs{g,u}id to be used as a k{g,u}id where we want to compare
* whether the mapped value is identical to value of a k{g,u}id.
diff --git a/include/linux/vfsid.h b/include/linux/vfsid.h
new file mode 100644
index 000000000000..90262944b042
--- /dev/null
+++ b/include/linux/vfsid.h
@@ -0,0 +1,74 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MNT_VFSID_H
+#define _LINUX_MNT_VFSID_H
+
+#include <linux/types.h>
+#include <linux/uidgid.h>
+#include <linux/build_bug.h>
+
+typedef struct {
+ uid_t val;
+} vfsuid_t;
+
+typedef struct {
+ gid_t val;
+} vfsgid_t;
+
+static_assert(sizeof(vfsuid_t) == sizeof(kuid_t));
+static_assert(sizeof(vfsgid_t) == sizeof(kgid_t));
+static_assert(offsetof(vfsuid_t, val) == offsetof(kuid_t, val));
+static_assert(offsetof(vfsgid_t, val) == offsetof(kgid_t, val));
+
+#ifdef CONFIG_MULTIUSER
+static inline uid_t __vfsuid_val(vfsuid_t uid)
+{
+ return uid.val;
+}
+
+static inline gid_t __vfsgid_val(vfsgid_t gid)
+{
+ return gid.val;
+}
+#else
+static inline uid_t __vfsuid_val(vfsuid_t uid)
+{
+ return 0;
+}
+
+static inline gid_t __vfsgid_val(vfsgid_t gid)
+{
+ return 0;
+}
+#endif
+
+static inline bool vfsuid_valid(vfsuid_t uid)
+{
+ return __vfsuid_val(uid) != (uid_t)-1;
+}
+
+static inline bool vfsgid_valid(vfsgid_t gid)
+{
+ return __vfsgid_val(gid) != (gid_t)-1;
+}
+
+static inline bool vfsuid_eq(vfsuid_t left, vfsuid_t right)
+{
+ return vfsuid_valid(left) && __vfsuid_val(left) == __vfsuid_val(right);
+}
+
+static inline bool vfsgid_eq(vfsgid_t left, vfsgid_t right)
+{
+ return vfsgid_valid(left) && __vfsgid_val(left) == __vfsgid_val(right);
+}
+
+/*
+ * vfs{g,u}ids are created from k{g,u}ids.
+ * We don't allow them to be created from regular {u,g}id.
+ */
+#define VFSUIDT_INIT(val) (vfsuid_t){ __kuid_val(val) }
+#define VFSGIDT_INIT(val) (vfsgid_t){ __kgid_val(val) }
+
+#define INVALID_VFSUID VFSUIDT_INIT(INVALID_UID)
+#define INVALID_VFSGID VFSGIDT_INIT(INVALID_GID)
+
+#endif /* _LINUX_MNT_VFSID_H */

--
2.43.0

2023-11-29 21:51:27

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 04/16] capability: use vfsuid_t for vfs_caps rootids

The rootid is a kuid_t, but it contains an id which maped into a mount
idmapping, so it is really a vfsuid. This is confusing and creates
potential for misuse of the value, so change it to vfsuid_t.

Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
---
include/linux/capability.h | 3 ++-
kernel/auditsc.c | 5 +++--
security/commoncap.c | 2 +-
3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index c24477e660fc..eb46d346bbbc 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -16,6 +16,7 @@
#include <uapi/linux/capability.h>
#include <linux/uidgid.h>
#include <linux/bits.h>
+#include <linux/vfsid.h>

#define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3

@@ -26,7 +27,7 @@ typedef struct { u64 val; } kernel_cap_t;
/* same as vfs_ns_cap_data but in cpu endian and always filled completely */
struct vfs_caps {
__u32 magic_etc;
- kuid_t rootid;
+ vfsuid_t rootid;
kernel_cap_t permitted;
kernel_cap_t inheritable;
};
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 783d0bf69ca5..65691450b080 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -65,6 +65,7 @@
#include <uapi/linux/netfilter/nf_tables.h>
#include <uapi/linux/openat2.h> // struct open_how
#include <uapi/linux/fanotify.h>
+#include <linux/mnt_idmapping.h>

#include "audit.h"

@@ -2260,7 +2261,7 @@ static inline int audit_copy_fcaps(struct audit_names *name,
name->fcap.permitted = caps.permitted;
name->fcap.inheritable = caps.inheritable;
name->fcap.fE = !!(caps.magic_etc & VFS_CAP_FLAGS_EFFECTIVE);
- name->fcap.rootid = caps.rootid;
+ name->fcap.rootid = AS_KUIDT(caps.rootid);
name->fcap_ver = (caps.magic_etc & VFS_CAP_REVISION_MASK) >>
VFS_CAP_REVISION_SHIFT;

@@ -2816,7 +2817,7 @@ int __audit_log_bprm_fcaps(struct linux_binprm *bprm,
ax->fcap.permitted = vcaps.permitted;
ax->fcap.inheritable = vcaps.inheritable;
ax->fcap.fE = !!(vcaps.magic_etc & VFS_CAP_FLAGS_EFFECTIVE);
- ax->fcap.rootid = vcaps.rootid;
+ ax->fcap.rootid = AS_KUIDT(vcaps.rootid);
ax->fcap_ver = (vcaps.magic_etc & VFS_CAP_REVISION_MASK) >> VFS_CAP_REVISION_SHIFT;

ax->old_pcap.permitted = old->cap_permitted;
diff --git a/security/commoncap.c b/security/commoncap.c
index cf130d81b8b4..3d045d377e5e 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -710,7 +710,7 @@ int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
cpu_caps->permitted.val &= CAP_VALID_MASK;
cpu_caps->inheritable.val &= CAP_VALID_MASK;

- cpu_caps->rootid = vfsuid_into_kuid(rootvfsuid);
+ cpu_caps->rootid = rootvfsuid;

return 0;
}

--
2.43.0

2023-11-29 21:51:35

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 08/16] fs: add vfs_get_fscaps()

Provide a type-safe interface for retrieving filesystem capabilities and
a generic implementation suitable for most filesystems. Also add an
internal interface, __vfs_get_fscaps(), which skips security checks for
later use from the capability code.

Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
---
fs/xattr.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 4 ++++
2 files changed, 70 insertions(+)

diff --git a/fs/xattr.c b/fs/xattr.c
index 09d927603433..3abaf9bef0a5 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -181,6 +181,72 @@ xattr_supports_user_prefix(struct inode *inode)
}
EXPORT_SYMBOL(xattr_supports_user_prefix);

+static int generic_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
+ struct vfs_caps *caps)
+{
+ struct inode *inode = d_inode(dentry);
+ struct vfs_ns_cap_data *nscaps = NULL;
+ int ret;
+
+ ret = (int)vfs_getxattr_alloc(idmap, dentry, XATTR_NAME_CAPS,
+ (char **)&nscaps, 0, GFP_NOFS);
+
+ if (ret >= 0)
+ ret = vfs_caps_from_xattr(idmap, i_user_ns(inode), caps, nscaps, ret);
+
+ kfree(nscaps);
+ return ret;
+}
+
+/**
+ * __vfs_get_fscaps - get filesystem capabilities without security checks
+ * @idmap: idmap of the mount the inode was found from
+ * @dentry: the dentry from which to get filesystem capabilities
+ * @caps: storage in which to return the filesystem capabilities
+ *
+ * This function gets the filesystem capabilities for the dentry and returns
+ * them in @caps. It does not perform security checks.
+ *
+ * Return: 0 on success, a negative errno on error.
+ */
+int __vfs_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
+ struct vfs_caps *caps)
+{
+ struct inode *inode = d_inode(dentry);
+
+ if (inode->i_op->get_fscaps)
+ return inode->i_op->get_fscaps(idmap, dentry, caps);
+ return generic_get_fscaps(idmap, dentry, caps);
+}
+
+/**
+ * vfs_get_fscaps - get filesystem capabilities
+ * @idmap: idmap of the mount the inode was found from
+ * @dentry: the dentry from which to get filesystem capabilities
+ * @caps: storage in which to return the filesystem capabilities
+ *
+ * This function gets the filesystem capabilities for the dentry and returns
+ * them in @caps.
+ *
+ * Return: 0 on success, a negative errno on error.
+ */
+int vfs_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
+ struct vfs_caps *caps)
+{
+ int error;
+
+ /*
+ * The VFS has no restrictions on reading security.* xattrs, so
+ * xattr_permission() isn't needed. Only LSMs get a say.
+ */
+ error = security_inode_getxattr(dentry, XATTR_NAME_CAPS);
+ if (error)
+ return error;
+
+ return __vfs_get_fscaps(idmap, dentry, caps);
+}
+EXPORT_SYMBOL(vfs_get_fscaps);
+
int
__vfs_setxattr(struct mnt_idmap *idmap, struct dentry *dentry,
struct inode *inode, const char *name, const void *value,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a0a77f67b999..e25b39e4017a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2055,6 +2055,10 @@ extern int vfs_dedupe_file_range(struct file *file,
extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
struct file *dst_file, loff_t dst_pos,
loff_t len, unsigned int remap_flags);
+extern int __vfs_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
+ struct vfs_caps *caps);
+extern int vfs_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
+ struct vfs_caps *caps);

enum freeze_holder {
FREEZE_HOLDER_KERNEL = (1U << 0),

--
2.43.0

2023-11-29 21:51:37

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 05/16] capability: provide helpers for converting between xattrs and vfs_caps

To pass around vfs_caps instead of raw xattr data we will need to
convert between the two representations near userspace and disk
boundaries. We already convert xattrs from disks to vfs_caps, so move
that code into a helper, and change get_vfs_caps_from_disk() to use the
helper.

When converting vfs_caps to xattrs we have different considerations
depending on the destination of the xattr data. For xattrs which will be
written to disk we need to reject the xattr if the rootid does not map
into the filesystem's user namespace, whereas xattrs read by userspace
may need to undergo a conversion from v3 to v2 format when the rootid
does not map. So this helper is split into an internal and an external
interface. The internal interface does not return an error if the rootid
has no mapping in the target user namespace and will be used for
conversions targeting userspace. The external interface returns
EOVERFLOW if the rootid has no mapping and will be used for all other
conversions.

Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
---
include/linux/capability.h | 10 ++
security/commoncap.c | 227 +++++++++++++++++++++++++++++++++++----------
2 files changed, 186 insertions(+), 51 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index eb46d346bbbc..cdd7d2d8855e 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -209,6 +209,16 @@ static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
ns_capable(ns, CAP_SYS_ADMIN);
}

+/* helpers to convert between xattr and in-kernel representations */
+int vfs_caps_from_xattr(struct mnt_idmap *idmap,
+ struct user_namespace *src_userns,
+ struct vfs_caps *vfs_caps,
+ const void *data, int size);
+int vfs_caps_to_xattr(struct mnt_idmap *idmap,
+ struct user_namespace *dest_userns,
+ const struct vfs_caps *vfs_caps,
+ void *data, int size);
+
/* audit system wants to get cap info from files as well */
int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
const struct dentry *dentry,
diff --git a/security/commoncap.c b/security/commoncap.c
index 3d045d377e5e..ef37966f3522 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -618,54 +618,41 @@ static inline int bprm_caps_from_vfs_caps(struct vfs_caps *caps,
}

/**
- * get_vfs_caps_from_disk - retrieve vfs caps from disk
+ * vfs_caps_from_xattr - convert raw caps xattr data to vfs_caps
*
- * @idmap: idmap of the mount the inode was found from
- * @dentry: dentry from which @inode is retrieved
- * @cpu_caps: vfs capabilities
+ * @idmap: idmap of the mount the inode was found from
+ * @src_userns: user namespace for ids in xattr data
+ * @vfs_caps: destination buffer for vfs_caps data
+ * @data: rax xattr caps data
+ * @size: size of xattr data
*
- * Extract the on-exec-apply capability sets for an executable file.
+ * Converts a raw security.capability xattr into the kernel-internal
+ * capabilities format.
*
- * If the inode has been found through an idmapped mount the idmap of
- * the vfsmount must be passed through @idmap. This function will then
- * take care to map the inode according to @idmap before checking
- * permissions. On non-idmapped mounts or if permission checking is to be
- * performed on the raw inode simply pass @nop_mnt_idmap.
+ * If the xattr is being read or written through an idmapped mount the
+ * idmap of the vfsmount must be passed through @idmap. This function
+ * will then take care to map the rootid according to @idmap.
+ *
+ * Return: On success, return 0; on error, return < 0.
*/
-int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
- const struct dentry *dentry,
- struct vfs_caps *cpu_caps)
+int vfs_caps_from_xattr(struct mnt_idmap *idmap,
+ struct user_namespace *src_userns,
+ struct vfs_caps *vfs_caps,
+ const void *data, int size)
{
- struct inode *inode = d_backing_inode(dentry);
__u32 magic_etc;
- int size;
- struct vfs_ns_cap_data data, *nscaps = &data;
- struct vfs_cap_data *caps = (struct vfs_cap_data *) &data;
+ const struct vfs_ns_cap_data *ns_caps = data;
+ struct vfs_cap_data *caps = (struct vfs_cap_data *)ns_caps;
kuid_t rootkuid;
- vfsuid_t rootvfsuid;
- struct user_namespace *fs_ns;
-
- memset(cpu_caps, 0, sizeof(struct vfs_caps));
-
- if (!inode)
- return -ENODATA;

- fs_ns = inode->i_sb->s_user_ns;
- size = __vfs_getxattr((struct dentry *)dentry, inode,
- XATTR_NAME_CAPS, &data, XATTR_CAPS_SZ);
- if (size == -ENODATA || size == -EOPNOTSUPP)
- /* no data, that's ok */
- return -ENODATA;
-
- if (size < 0)
- return size;
+ memset(vfs_caps, 0, sizeof(*vfs_caps));

if (size < sizeof(magic_etc))
return -EINVAL;

- cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps->magic_etc);
+ vfs_caps->magic_etc = magic_etc = le32_to_cpu(caps->magic_etc);

- rootkuid = make_kuid(fs_ns, 0);
+ rootkuid = make_kuid(src_userns, 0);
switch (magic_etc & VFS_CAP_REVISION_MASK) {
case VFS_CAP_REVISION_1:
if (size != XATTR_CAPS_SZ_1)
@@ -678,39 +665,177 @@ int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
case VFS_CAP_REVISION_3:
if (size != XATTR_CAPS_SZ_3)
return -EINVAL;
- rootkuid = make_kuid(fs_ns, le32_to_cpu(nscaps->rootid));
+ rootkuid = make_kuid(src_userns, le32_to_cpu(ns_caps->rootid));
break;

default:
return -EINVAL;
}

- rootvfsuid = make_vfsuid(idmap, fs_ns, rootkuid);
- if (!vfsuid_valid(rootvfsuid))
- return -ENODATA;
+ vfs_caps->rootid = make_vfsuid(idmap, src_userns, rootkuid);
+ if (!vfsuid_valid(vfs_caps->rootid))
+ return -EOVERFLOW;

- /* Limit the caps to the mounter of the filesystem
- * or the more limited uid specified in the xattr.
+ vfs_caps->permitted.val = le32_to_cpu(caps->data[0].permitted);
+ vfs_caps->inheritable.val = le32_to_cpu(caps->data[0].inheritable);
+
+ /*
+ * Rev1 had just a single 32-bit word, later expanded
+ * to a second one for the high bits
*/
- if (!rootid_owns_currentns(rootvfsuid))
- return -ENODATA;
+ if ((magic_etc & VFS_CAP_REVISION_MASK) != VFS_CAP_REVISION_1) {
+ vfs_caps->permitted.val += (u64)le32_to_cpu(caps->data[1].permitted) << 32;
+ vfs_caps->inheritable.val += (u64)le32_to_cpu(caps->data[1].inheritable) << 32;
+ }
+
+ vfs_caps->permitted.val &= CAP_VALID_MASK;
+ vfs_caps->inheritable.val &= CAP_VALID_MASK;
+
+ return 0;
+}
+
+/*
+ * Inner implementation of vfs_caps_to_xattr() which does not return an
+ * error if the rootid does not map into @dest_userns.
+ */
+static int __vfs_caps_to_xattr(struct mnt_idmap *idmap,
+ struct user_namespace *dest_userns,
+ const struct vfs_caps *vfs_caps,
+ void *data, int size)
+{
+ struct vfs_ns_cap_data *ns_caps = data;
+ struct vfs_cap_data *caps = (struct vfs_cap_data *)ns_caps;
+ kuid_t rootkuid;
+ uid_t rootid;
+
+ memset(ns_caps, 0, size);
+
+ rootid = 0;
+ switch (vfs_caps->magic_etc & VFS_CAP_REVISION_MASK) {
+ case VFS_CAP_REVISION_1:
+ if (size < XATTR_CAPS_SZ_1)
+ return -EINVAL;
+ size = XATTR_CAPS_SZ_1;
+ break;
+ case VFS_CAP_REVISION_2:
+ if (size < XATTR_CAPS_SZ_2)
+ return -EINVAL;
+ size = XATTR_CAPS_SZ_2;
+ break;
+ case VFS_CAP_REVISION_3:
+ if (size < XATTR_CAPS_SZ_3)
+ return -EINVAL;
+ size = XATTR_CAPS_SZ_3;
+ rootkuid = from_vfsuid(idmap, dest_userns, vfs_caps->rootid);
+ rootid = from_kuid(dest_userns, rootkuid);
+ ns_caps->rootid = cpu_to_le32(rootid);
+ break;

- cpu_caps->permitted.val = le32_to_cpu(caps->data[0].permitted);
- cpu_caps->inheritable.val = le32_to_cpu(caps->data[0].inheritable);
+ default:
+ return -EINVAL;
+ }
+
+ caps->magic_etc = cpu_to_le32(vfs_caps->magic_etc);
+
+ caps->data[0].permitted = cpu_to_le32(lower_32_bits(vfs_caps->permitted.val));
+ caps->data[0].inheritable = cpu_to_le32(lower_32_bits(vfs_caps->inheritable.val));

/*
* Rev1 had just a single 32-bit word, later expanded
* to a second one for the high bits
*/
- if ((magic_etc & VFS_CAP_REVISION_MASK) != VFS_CAP_REVISION_1) {
- cpu_caps->permitted.val += (u64)le32_to_cpu(caps->data[1].permitted) << 32;
- cpu_caps->inheritable.val += (u64)le32_to_cpu(caps->data[1].inheritable) << 32;
+ if ((vfs_caps->magic_etc & VFS_CAP_REVISION_MASK) != VFS_CAP_REVISION_1) {
+ caps->data[1].permitted =
+ cpu_to_le32(upper_32_bits(vfs_caps->permitted.val));
+ caps->data[1].inheritable =
+ cpu_to_le32(upper_32_bits(vfs_caps->inheritable.val));
}

- cpu_caps->permitted.val &= CAP_VALID_MASK;
- cpu_caps->inheritable.val &= CAP_VALID_MASK;
+ return size;
+}
+
+
+/**
+ * vfs_caps_to_xattr - convert vfs_caps to raw caps xattr data
+ *
+ * @idmap: idmap of the mount the inode was found from
+ * @dest_userns: user namespace for ids in xattr data
+ * @vfs_caps: source vfs_caps data
+ * @data: destination buffer for rax xattr caps data
+ * @size: size of the @data buffer
+ *
+ * Converts a kernel-interrnal capability into the raw security.capability
+ * xattr format.
+ *
+ * If the xattr is being read or written through an idmapped mount the
+ * idmap of the vfsmount must be passed through @idmap. This function
+ * will then take care to map the rootid according to @idmap.
+ *
+ * Return: On success, return 0; on error, return < 0.
+ */
+int vfs_caps_to_xattr(struct mnt_idmap *idmap,
+ struct user_namespace *dest_userns,
+ const struct vfs_caps *vfs_caps,
+ void *data, int size)
+{
+ struct vfs_ns_cap_data *caps = data;
+ int ret;
+
+ ret = __vfs_caps_to_xattr(idmap, dest_userns, vfs_caps, data, size);
+ if (ret > 0 &&
+ (vfs_caps->magic_etc & VFS_CAP_REVISION_MASK) == VFS_CAP_REVISION_3 &&
+ le32_to_cpu(caps->rootid) == (uid_t)-1)
+ return -EOVERFLOW;
+ return ret;
+}
+
+/**
+ * get_vfs_caps_from_disk - retrieve vfs caps from disk
+ *
+ * @idmap: idmap of the mount the inode was found from
+ * @dentry: dentry from which @inode is retrieved
+ * @cpu_caps: vfs capabilities
+ *
+ * Extract the on-exec-apply capability sets for an executable file.
+ *
+ * If the inode has been found through an idmapped mount the idmap of
+ * the vfsmount must be passed through @idmap. This function will then
+ * take care to map the inode according to @idmap before checking
+ * permissions. On non-idmapped mounts or if permission checking is to be
+ * performed on the raw inode simply pass @nop_mnt_idmap.
+ */
+int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
+ const struct dentry *dentry,
+ struct vfs_caps *cpu_caps)
+{
+ struct inode *inode = d_backing_inode(dentry);
+ int size, ret;
+ struct vfs_ns_cap_data data, *nscaps = &data;
+
+ if (!inode)
+ return -ENODATA;

- cpu_caps->rootid = rootvfsuid;
+ size = __vfs_getxattr((struct dentry *)dentry, inode,
+ XATTR_NAME_CAPS, &data, XATTR_CAPS_SZ);
+ if (size == -ENODATA || size == -EOPNOTSUPP)
+ /* no data, that's ok */
+ return -ENODATA;
+
+ if (size < 0)
+ return size;
+
+ ret = vfs_caps_from_xattr(idmap, inode->i_sb->s_user_ns,
+ cpu_caps, nscaps, size);
+ if (ret == -EOVERFLOW)
+ return -ENODATA;
+ if (ret)
+ return ret;
+
+ /* Limit the caps to the mounter of the filesystem
+ * or the more limited uid specified in the xattr.
+ */
+ if (!rootid_owns_currentns(cpu_caps->rootid))
+ return -ENODATA;

return 0;
}

--
2.43.0

2023-11-29 21:51:42

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 12/16] ovl: use vfs_{get,set}_fscaps() for copy-up

Using vfs_{get,set}xattr() for fscaps will be blocked in a future
commit, so convert ovl to use the new interfaces. Also remove the now
unused ovl_getxattr_value().

Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
---
fs/overlayfs/copy_up.c | 72 ++++++++++++++++++++++++++------------------------
1 file changed, 37 insertions(+), 35 deletions(-)

diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 4382881b0709..b43af5ce4b21 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -73,6 +73,23 @@ static int ovl_copy_acl(struct ovl_fs *ofs, const struct path *path,
return err;
}

+static int ovl_copy_fscaps(struct ovl_fs *ofs, const struct path *oldpath,
+ struct dentry *new)
+{
+ struct vfs_caps capability;
+ int err;
+
+ err = vfs_get_fscaps(mnt_idmap(oldpath->mnt), oldpath->dentry,
+ &capability);
+ if (err) {
+ if (err == -ENODATA || err == -EOPNOTSUPP)
+ return 0;
+ return err;
+ }
+
+ return vfs_set_fscaps(ovl_upper_mnt_idmap(ofs), new, &capability, 0);
+}
+
int ovl_copy_xattr(struct super_block *sb, const struct path *oldpath, struct dentry *new)
{
struct dentry *old = oldpath->dentry;
@@ -130,6 +147,14 @@ int ovl_copy_xattr(struct super_block *sb, const struct path *oldpath, struct de
break;
}

+ if (!strcmp(name, XATTR_NAME_CAPS)) {
+ error = ovl_copy_fscaps(OVL_FS(sb), oldpath, new);
+ if (!error)
+ continue;
+ /* fs capabilities must be copied */
+ break;
+ }
+
retry:
size = ovl_do_getxattr(oldpath, name, value, value_size);
if (size == -ERANGE)
@@ -1006,61 +1031,40 @@ static bool ovl_need_meta_copy_up(struct dentry *dentry, umode_t mode,
return true;
}

-static ssize_t ovl_getxattr_value(const struct path *path, char *name, char **value)
-{
- ssize_t res;
- char *buf;
-
- res = ovl_do_getxattr(path, name, NULL, 0);
- if (res == -ENODATA || res == -EOPNOTSUPP)
- res = 0;
-
- if (res > 0) {
- buf = kzalloc(res, GFP_KERNEL);
- if (!buf)
- return -ENOMEM;
-
- res = ovl_do_getxattr(path, name, buf, res);
- if (res < 0)
- kfree(buf);
- else
- *value = buf;
- }
- return res;
-}
-
/* Copy up data of an inode which was copied up metadata only in the past. */
static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
{
struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
struct path upperpath;
int err;
- char *capability = NULL;
- ssize_t cap_size;
+ struct vfs_caps capability;
+ bool has_capability = false;

ovl_path_upper(c->dentry, &upperpath);
if (WARN_ON(upperpath.dentry == NULL))
return -EIO;

if (c->stat.size) {
- err = cap_size = ovl_getxattr_value(&upperpath, XATTR_NAME_CAPS,
- &capability);
- if (cap_size < 0)
+ err = vfs_get_fscaps(mnt_idmap(upperpath.mnt), upperpath.dentry,
+ &capability);
+ if (!err)
+ has_capability = 1;
+ else if (err != -ENODATA && err != EOPNOTSUPP)
goto out;
}

err = ovl_copy_up_data(c, &upperpath);
if (err)
- goto out_free;
+ goto out;

/*
* Writing to upper file will clear security.capability xattr. We
* don't want that to happen for normal copy-up operation.
*/
ovl_start_write(c->dentry);
- if (capability) {
- err = ovl_do_setxattr(ofs, upperpath.dentry, XATTR_NAME_CAPS,
- capability, cap_size, 0);
+ if (has_capability) {
+ err = vfs_set_fscaps(mnt_idmap(upperpath.mnt), upperpath.dentry,
+ &capability, 0);
}
if (!err) {
err = ovl_removexattr(ofs, upperpath.dentry,
@@ -1068,13 +1072,11 @@ static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
}
ovl_end_write(c->dentry);
if (err)
- goto out_free;
+ goto out;

ovl_clear_flag(OVL_HAS_DIGEST, d_inode(c->dentry));
ovl_clear_flag(OVL_VERIFIED_DIGEST, d_inode(c->dentry));
ovl_set_upperdata(d_inode(c->dentry));
-out_free:
- kfree(capability);
out:
return err;
}

--
2.43.0

2023-11-29 21:51:44

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 16/16] vfs: return -EOPNOTSUPP for fscaps from vfs_*xattr()

Now that the new vfs-level interfaces are fully supported and all code
has been converted to use them, stop permitting use of the top-level vfs
xattr interfaces for capabilities xattrs. Unlike with ACLs we still need
to be able to work with fscaps xattrs using lower-level interfaces in a
handful of places, so only use of the top-level xattr interfaces is
restricted.

Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
---
fs/xattr.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/fs/xattr.c b/fs/xattr.c
index 372644b15457..4b779779ad8c 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -540,6 +540,9 @@ vfs_setxattr(struct mnt_idmap *idmap, struct dentry *dentry,
const void *orig_value = value;
int error;

+ if (!strcmp(name, XATTR_NAME_CAPS))
+ return -EOPNOTSUPP;
+
retry_deleg:
inode_lock(inode);
error = __vfs_setxattr_locked(idmap, dentry, name, value, size,
@@ -655,6 +658,9 @@ vfs_getxattr(struct mnt_idmap *idmap, struct dentry *dentry,
struct inode *inode = dentry->d_inode;
int error;

+ if (!strcmp(name, XATTR_NAME_CAPS))
+ return -EOPNOTSUPP;
+
error = xattr_permission(idmap, inode, name, MAY_READ);
if (error)
return error;
@@ -794,6 +800,9 @@ vfs_removexattr(struct mnt_idmap *idmap, struct dentry *dentry,
struct inode *delegated_inode = NULL;
int error;

+ if (!strcmp(name, XATTR_NAME_CAPS))
+ return -EOPNOTSUPP;
+
retry_deleg:
inode_lock(inode);
error = __vfs_removexattr_locked(idmap, dentry,

--
2.43.0

2023-11-29 21:51:48

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 03/16] capability: rename cpu_vfs_cap_data to vfs_caps

vfs_caps is a more generic name which is better suited to the broader
use this struct will see in subsequent commits.

Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
---
include/linux/capability.h | 4 ++--
kernel/auditsc.c | 4 ++--
security/commoncap.c | 8 ++++----
3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index 0c356a517991..c24477e660fc 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -24,7 +24,7 @@ extern int file_caps_enabled;
typedef struct { u64 val; } kernel_cap_t;

/* same as vfs_ns_cap_data but in cpu endian and always filled completely */
-struct cpu_vfs_cap_data {
+struct vfs_caps {
__u32 magic_etc;
kuid_t rootid;
kernel_cap_t permitted;
@@ -211,7 +211,7 @@ static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
/* audit system wants to get cap info from files as well */
int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
const struct dentry *dentry,
- struct cpu_vfs_cap_data *cpu_caps);
+ struct vfs_caps *cpu_caps);

int cap_convert_nscap(struct mnt_idmap *idmap, struct dentry *dentry,
const void **ivalue, size_t size);
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 6f0d6fb6523f..783d0bf69ca5 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -2247,7 +2247,7 @@ void __audit_getname(struct filename *name)
static inline int audit_copy_fcaps(struct audit_names *name,
const struct dentry *dentry)
{
- struct cpu_vfs_cap_data caps;
+ struct vfs_caps caps;
int rc;

if (!dentry)
@@ -2800,7 +2800,7 @@ int __audit_log_bprm_fcaps(struct linux_binprm *bprm,
{
struct audit_aux_data_bprm_fcaps *ax;
struct audit_context *context = audit_context();
- struct cpu_vfs_cap_data vcaps;
+ struct vfs_caps vcaps;

ax = kmalloc(sizeof(*ax), GFP_KERNEL);
if (!ax)
diff --git a/security/commoncap.c b/security/commoncap.c
index 8e8c630ce204..cf130d81b8b4 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -583,7 +583,7 @@ int cap_convert_nscap(struct mnt_idmap *idmap, struct dentry *dentry,
* Calculate the new process capability sets from the capability sets attached
* to a file.
*/
-static inline int bprm_caps_from_vfs_caps(struct cpu_vfs_cap_data *caps,
+static inline int bprm_caps_from_vfs_caps(struct vfs_caps *caps,
struct linux_binprm *bprm,
bool *effective,
bool *has_fcap)
@@ -634,7 +634,7 @@ static inline int bprm_caps_from_vfs_caps(struct cpu_vfs_cap_data *caps,
*/
int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
const struct dentry *dentry,
- struct cpu_vfs_cap_data *cpu_caps)
+ struct vfs_caps *cpu_caps)
{
struct inode *inode = d_backing_inode(dentry);
__u32 magic_etc;
@@ -645,7 +645,7 @@ int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
vfsuid_t rootvfsuid;
struct user_namespace *fs_ns;

- memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
+ memset(cpu_caps, 0, sizeof(struct vfs_caps));

if (!inode)
return -ENODATA;
@@ -724,7 +724,7 @@ static int get_file_caps(struct linux_binprm *bprm, const struct file *file,
bool *effective, bool *has_fcap)
{
int rc = 0;
- struct cpu_vfs_cap_data vcaps;
+ struct vfs_caps vcaps;

cap_clear(bprm->cred->cap_permitted);


--
2.43.0

2023-11-29 21:51:53

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 14/16] commoncap: remove cap_inode_getsecurity()

Reading of fscaps xattrs is now done via vfs_get_fscaps(), so there is
no longer any need to do it from security_inode_getsecurity(). Remove
cap_inode_getsecurity() and its associated helpers which are now unused.

We don't allow reading capabilities xattrs this way anyomre, so remove
the handler and associated helpers.

Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
---
include/linux/security.h | 5 +-
security/commoncap.c | 132 -----------------------------------------------
2 files changed, 1 insertion(+), 136 deletions(-)

diff --git a/include/linux/security.h b/include/linux/security.h
index 1d1df326c881..784b85816907 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -158,9 +158,6 @@ int cap_inode_removexattr(struct mnt_idmap *idmap,
struct dentry *dentry, const char *name);
int cap_inode_need_killpriv(struct dentry *dentry);
int cap_inode_killpriv(struct mnt_idmap *idmap, struct dentry *dentry);
-int cap_inode_getsecurity(struct mnt_idmap *idmap,
- struct inode *inode, const char *name, void **buffer,
- bool alloc);
extern int cap_mmap_addr(unsigned long addr);
extern int cap_mmap_file(struct file *file, unsigned long reqprot,
unsigned long prot, unsigned long flags);
@@ -934,7 +931,7 @@ static inline int security_inode_getsecurity(struct mnt_idmap *idmap,
const char *name, void **buffer,
bool alloc)
{
- return cap_inode_getsecurity(idmap, inode, name, buffer, alloc);
+ return -EOPNOTSUPP;
}

static inline int security_inode_setsecurity(struct inode *inode, const char *name, const void *value, size_t size, int flags)
diff --git a/security/commoncap.c b/security/commoncap.c
index bd95b806af2f..ced7a3c9685f 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -352,137 +352,6 @@ static __u32 sansflags(__u32 m)
return m & ~VFS_CAP_FLAGS_EFFECTIVE;
}

-static bool is_v2header(int size, const struct vfs_cap_data *cap)
-{
- if (size != XATTR_CAPS_SZ_2)
- return false;
- return sansflags(le32_to_cpu(cap->magic_etc)) == VFS_CAP_REVISION_2;
-}
-
-static bool is_v3header(int size, const struct vfs_cap_data *cap)
-{
- if (size != XATTR_CAPS_SZ_3)
- return false;
- return sansflags(le32_to_cpu(cap->magic_etc)) == VFS_CAP_REVISION_3;
-}
-
-/*
- * getsecurity: We are called for security.* before any attempt to read the
- * xattr from the inode itself.
- *
- * This gives us a chance to read the on-disk value and convert it. If we
- * return -EOPNOTSUPP, then vfs_getxattr() will call the i_op handler.
- *
- * Note we are not called by vfs_getxattr_alloc(), but that is only called
- * by the integrity subsystem, which really wants the unconverted values -
- * so that's good.
- */
-int cap_inode_getsecurity(struct mnt_idmap *idmap,
- struct inode *inode, const char *name, void **buffer,
- bool alloc)
-{
- int size;
- kuid_t kroot;
- vfsuid_t vfsroot;
- u32 nsmagic, magic;
- uid_t root, mappedroot;
- char *tmpbuf = NULL;
- struct vfs_cap_data *cap;
- struct vfs_ns_cap_data *nscap = NULL;
- struct dentry *dentry;
- struct user_namespace *fs_ns;
-
- if (strcmp(name, "capability") != 0)
- return -EOPNOTSUPP;
-
- dentry = d_find_any_alias(inode);
- if (!dentry)
- return -EINVAL;
- size = vfs_getxattr_alloc(idmap, dentry, XATTR_NAME_CAPS, &tmpbuf,
- sizeof(struct vfs_ns_cap_data), GFP_NOFS);
- dput(dentry);
- /* gcc11 complains if we don't check for !tmpbuf */
- if (size < 0 || !tmpbuf)
- goto out_free;
-
- fs_ns = inode->i_sb->s_user_ns;
- cap = (struct vfs_cap_data *) tmpbuf;
- if (is_v2header(size, cap)) {
- root = 0;
- } else if (is_v3header(size, cap)) {
- nscap = (struct vfs_ns_cap_data *) tmpbuf;
- root = le32_to_cpu(nscap->rootid);
- } else {
- size = -EINVAL;
- goto out_free;
- }
-
- kroot = make_kuid(fs_ns, root);
-
- /* If this is an idmapped mount shift the kuid. */
- vfsroot = make_vfsuid(idmap, fs_ns, kroot);
-
- /* If the root kuid maps to a valid uid in current ns, then return
- * this as a nscap. */
- mappedroot = from_kuid(current_user_ns(), vfsuid_into_kuid(vfsroot));
- if (mappedroot != (uid_t)-1 && mappedroot != (uid_t)0) {
- size = sizeof(struct vfs_ns_cap_data);
- if (alloc) {
- if (!nscap) {
- /* v2 -> v3 conversion */
- nscap = kzalloc(size, GFP_ATOMIC);
- if (!nscap) {
- size = -ENOMEM;
- goto out_free;
- }
- nsmagic = VFS_CAP_REVISION_3;
- magic = le32_to_cpu(cap->magic_etc);
- if (magic & VFS_CAP_FLAGS_EFFECTIVE)
- nsmagic |= VFS_CAP_FLAGS_EFFECTIVE;
- memcpy(&nscap->data, &cap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
- nscap->magic_etc = cpu_to_le32(nsmagic);
- } else {
- /* use allocated v3 buffer */
- tmpbuf = NULL;
- }
- nscap->rootid = cpu_to_le32(mappedroot);
- *buffer = nscap;
- }
- goto out_free;
- }
-
- if (!rootid_owns_currentns(vfsroot)) {
- size = -EOVERFLOW;
- goto out_free;
- }
-
- /* This comes from a parent namespace. Return as a v2 capability */
- size = sizeof(struct vfs_cap_data);
- if (alloc) {
- if (nscap) {
- /* v3 -> v2 conversion */
- cap = kzalloc(size, GFP_ATOMIC);
- if (!cap) {
- size = -ENOMEM;
- goto out_free;
- }
- magic = VFS_CAP_REVISION_2;
- nsmagic = le32_to_cpu(nscap->magic_etc);
- if (nsmagic & VFS_CAP_FLAGS_EFFECTIVE)
- magic |= VFS_CAP_FLAGS_EFFECTIVE;
- memcpy(&cap->data, &nscap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
- cap->magic_etc = cpu_to_le32(magic);
- } else {
- /* use unconverted v2 */
- tmpbuf = NULL;
- }
- *buffer = cap;
- }
-out_free:
- kfree(tmpbuf);
- return size;
-}
-
/**
* rootid_from_vfs_caps - translate root uid of vfs caps
*
@@ -1616,7 +1485,6 @@ static struct security_hook_list capability_hooks[] __ro_after_init = {
LSM_HOOK_INIT(bprm_creds_from_file, cap_bprm_creds_from_file),
LSM_HOOK_INIT(inode_need_killpriv, cap_inode_need_killpriv),
LSM_HOOK_INIT(inode_killpriv, cap_inode_killpriv),
- LSM_HOOK_INIT(inode_getsecurity, cap_inode_getsecurity),
LSM_HOOK_INIT(mmap_addr, cap_mmap_addr),
LSM_HOOK_INIT(mmap_file, cap_mmap_file),
LSM_HOOK_INIT(task_fix_setuid, cap_task_fix_setuid),

--
2.43.0

2023-11-29 21:51:53

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 06/16] capability: provide a helper for converting vfs_caps to xattr for userspace

cap_inode_getsecurity() implements a handful of policies for capability
xattrs read by userspace:

- It returns EINVAL if the on-disk capability is in v1 format.

- It masks off all bits in magic_etc except for the version and
VFS_CAP_FLAGS_EFFECTIVE.

- v3 capabilities are converted to v2 format if the rootid returned to
userspace would be 0 or if the rootid corresponds to root in an
ancestor user namespace.

- It returns EOVERFLOW for a v3 capability whose rootid does not map to
a valid id in current_user_ns() or to root in an ancestor namespace.

These policies must be maintained when converting vfs_caps to an xattr
for userspace. Provide a vfs_caps_to_user_xattr() helper which will
enforce these policies.

Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
---
include/linux/capability.h | 4 +++
security/commoncap.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 72 insertions(+)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index cdd7d2d8855e..c0bd9447685b 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -218,6 +218,10 @@ int vfs_caps_to_xattr(struct mnt_idmap *idmap,
struct user_namespace *dest_userns,
const struct vfs_caps *vfs_caps,
void *data, int size);
+int vfs_caps_to_user_xattr(struct mnt_idmap *idmap,
+ struct user_namespace *dest_userns,
+ const struct vfs_caps *vfs_caps,
+ void *data, int size);

/* audit system wants to get cap info from files as well */
int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
diff --git a/security/commoncap.c b/security/commoncap.c
index ef37966f3522..c645330f83a0 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -789,6 +789,74 @@ int vfs_caps_to_xattr(struct mnt_idmap *idmap,
return ret;
}

+/**
+ * vfs_caps_to_user_xattr - convert vfs_caps to caps xattr for userspace
+ *
+ * @idmap: idmap of the mount the inode was found from
+ * @dest_userns: user namespace for ids in xattr data
+ * @vfs_caps: source vfs_caps data
+ * @data: destination buffer for rax xattr caps data
+ * @size: size of the @data buffer
+ *
+ * Converts a kernel-interrnal capability into the raw security.capability
+ * xattr format. Includes permission checking and v2->v3 conversion as
+ * appropriate.
+ *
+ * If the xattr is being read or written through an idmapped mount the
+ * idmap of the vfsmount must be passed through @idmap. This function
+ * will then take care to map the rootid according to @idmap.
+ *
+ * Return: On success, return 0; on error, return < 0.
+ */
+int vfs_caps_to_user_xattr(struct mnt_idmap *idmap,
+ struct user_namespace *dest_userns,
+ const struct vfs_caps *vfs_caps,
+ void *data, int size)
+{
+ struct vfs_ns_cap_data *ns_caps = data;
+ bool is_v3;
+ u32 magic;
+
+ /* Preserve previous behavior of returning EINVAL for v1 caps */
+ if ((vfs_caps->magic_etc & VFS_CAP_REVISION_MASK) == VFS_CAP_REVISION_1)
+ return -EINVAL;
+
+ size = __vfs_caps_to_xattr(idmap, dest_userns, vfs_caps, data, size);
+ if (size < 0)
+ return size;
+
+ magic = vfs_caps->magic_etc &
+ (VFS_CAP_REVISION_MASK | VFS_CAP_FLAGS_EFFECTIVE);
+ ns_caps->magic_etc = cpu_to_le32(magic);
+
+ /*
+ * If this is a v3 capability with a valid, non-zero rootid, return
+ * the v3 capability to userspace. A v3 capability with a rootid of
+ * 0 will be converted to a v2 capability below for compatibility
+ * with old userspace.
+ */
+ is_v3 = (vfs_caps->magic_etc & VFS_CAP_REVISION_MASK) == VFS_CAP_REVISION_3;
+ if (is_v3) {
+ uid_t rootid = le32_to_cpu(ns_caps->rootid);
+ if (rootid != (uid_t)-1 && rootid != (uid_t)0)
+ return size;
+ }
+
+ if (!rootid_owns_currentns(vfs_caps->rootid))
+ return -EOVERFLOW;
+
+ /* This comes from a parent namespace. Return as a v2 capability. */
+ if (is_v3) {
+ magic = VFS_CAP_REVISION_2 |
+ (vfs_caps->magic_etc & VFS_CAP_FLAGS_EFFECTIVE);
+ ns_caps->magic_etc = cpu_to_le32(magic);
+ ns_caps->rootid = cpu_to_le32(0);
+ size = XATTR_CAPS_SZ_2;
+ }
+
+ return size;
+}
+
/**
* get_vfs_caps_from_disk - retrieve vfs caps from disk
*

--
2.43.0

2023-11-29 21:51:54

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 09/16] fs: add vfs_set_fscaps()

Provide a type-safe interface for setting filesystem capabilities and a
generic implementation suitable for most filesystems.

Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
---
fs/xattr.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 2 ++
2 files changed, 89 insertions(+)

diff --git a/fs/xattr.c b/fs/xattr.c
index 3abaf9bef0a5..03cc824e4f87 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -247,6 +247,93 @@ int vfs_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
}
EXPORT_SYMBOL(vfs_get_fscaps);

+static int generic_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
+ const struct vfs_caps *caps, int flags)
+{
+ struct inode *inode = d_inode(dentry);
+ struct vfs_ns_cap_data nscaps;
+ int size;
+
+ size = vfs_caps_to_xattr(idmap, i_user_ns(inode), caps,
+ &nscaps, sizeof(nscaps));
+ if (size < 0)
+ return size;
+
+ return __vfs_setxattr_noperm(idmap, dentry, XATTR_NAME_CAPS,
+ &nscaps, size, flags);
+}
+
+/**
+ * vfs_set_fscaps - set filesystem capabilities
+ * @idmap: idmap of the mount the inode was found from
+ * @dentry: the dentry on which to set filesystem capabilities
+ * @caps: the filesystem capabilities to be written
+ * @flags: setxattr flags to use when writing the capabilities xattr
+ *
+ * This function writes the supplied filesystem capabilities to the dentry.
+ *
+ * Return: 0 on success, a negative errno on error.
+ */
+int vfs_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
+ const struct vfs_caps *caps, int flags)
+{
+ struct inode *inode = d_inode(dentry);
+ struct inode *delegated_inode = NULL;
+ struct vfs_ns_cap_data nscaps;
+ int size, error;
+
+ /*
+ * Unfortunately EVM wants to have the raw xattr value to compare to
+ * the on-disk version, so we need to pass the raw xattr to the
+ * security hooks. But we also want to do security checks before
+ * breaking leases, so that means a conversion to the raw xattr here
+ * which will usually be reduntant with the conversion we do for
+ * writing the xattr to disk.
+ */
+ size = vfs_caps_to_xattr(idmap, i_user_ns(inode), caps, &nscaps,
+ sizeof(nscaps));
+ if (size < 0)
+ return size;
+
+retry_deleg:
+ inode_lock(inode);
+
+ error = xattr_permission(idmap, inode, XATTR_NAME_CAPS, MAY_WRITE);
+ if (error)
+ goto out_inode_unlock;
+ error = security_inode_setxattr(idmap, dentry, XATTR_NAME_CAPS, &nscaps,
+ size, flags);
+ if (error)
+ goto out_inode_unlock;
+
+ error = try_break_deleg(inode, &delegated_inode);
+ if (error)
+ goto out_inode_unlock;
+
+ if (inode->i_opflags & IOP_XATTR) {
+ if (inode->i_op->set_fscaps)
+ error = inode->i_op->set_fscaps(idmap, dentry, caps, flags);
+ else
+ error = generic_set_fscaps(idmap, dentry, caps, flags);
+ } else if (unlikely(is_bad_inode(inode))) {
+ error = -EIO;
+ } else {
+ error = -EOPNOTSUPP;
+ }
+
+out_inode_unlock:
+ inode_unlock(inode);
+
+ if (delegated_inode) {
+ error = break_deleg_wait(&delegated_inode);
+ if (!error)
+ goto retry_deleg;
+ }
+
+ return error;
+}
+EXPORT_SYMBOL(vfs_set_fscaps);
+
int
__vfs_setxattr(struct mnt_idmap *idmap, struct dentry *dentry,
struct inode *inode, const char *name, const void *value,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e25b39e4017a..80992e210b83 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2059,6 +2059,8 @@ extern int __vfs_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
struct vfs_caps *caps);
extern int vfs_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
struct vfs_caps *caps);
+extern int vfs_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
+ const struct vfs_caps *caps, int flags);

enum freeze_holder {
FREEZE_HOLDER_KERNEL = (1U << 0),

--
2.43.0

2023-11-29 21:51:55

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 07/16] fs: add inode operations to get/set/remove fscaps

Add inode operations for getting, setting and removing filesystem
capabilities rather than passing around raw xattr data. This provides
better type safety for ids contained within xattrs.

Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
---
include/linux/fs.h | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98b7a7a8c42e..a0a77f67b999 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2002,6 +2002,11 @@ struct inode_operations {
int);
int (*set_acl)(struct mnt_idmap *, struct dentry *,
struct posix_acl *, int);
+ int (*get_fscaps)(struct mnt_idmap *, struct dentry *,
+ struct vfs_caps *);
+ int (*set_fscaps)(struct mnt_idmap *, struct dentry *,
+ const struct vfs_caps *, int flags);
+ int (*remove_fscaps)(struct mnt_idmap *, struct dentry *);
int (*fileattr_set)(struct mnt_idmap *idmap,
struct dentry *dentry, struct fileattr *fa);
int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);

--
2.43.0

2023-11-29 21:52:01

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 15/16] commoncap: use vfs fscaps interfaces for killpriv checks

Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
---
security/commoncap.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/security/commoncap.c b/security/commoncap.c
index ced7a3c9685f..15344c86c759 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -295,11 +295,12 @@ int cap_capset(struct cred *new,
*/
int cap_inode_need_killpriv(struct dentry *dentry)
{
- struct inode *inode = d_backing_inode(dentry);
+ struct vfs_caps caps;
int error;

- error = __vfs_getxattr(dentry, inode, XATTR_NAME_CAPS, NULL, 0);
- return error > 0;
+ /* Use nop_mnt_idmap for no mapping here as mapping is unimportant */
+ error= __vfs_get_fscaps(&nop_mnt_idmap, dentry, &caps);
+ return error == 0;
}

/**
@@ -322,7 +323,7 @@ int cap_inode_killpriv(struct mnt_idmap *idmap, struct dentry *dentry)
{
int error;

- error = __vfs_removexattr(idmap, dentry, XATTR_NAME_CAPS);
+ error = __vfs_remove_fscaps(idmap, dentry);
if (error == -EOPNOTSUPP)
error = 0;
return error;

--
2.43.0

2023-11-29 21:52:02

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 13/16] fs: use vfs interfaces for capabilities xattrs

Now that all the plumbing is in place, switch over to using the new
inode operations to get/set fs caps. This pushes all mapping of ids into
the caller's user ns to above the vfs_*() level, making this consistent
with other vfs_*() interfaces.

cap_convert_nscap() is updated to return vfs_caps and moved to be called
from the new code path for setting fscaps. This means that use of
vfs_setxattr() will no longer remap ids in fscap xattrs, but all code
which used vfs_setxattr() for fscaps xattrs has been converted to the
new interfaces.

Removing the mapping of fscaps rootids from vfs_getxattr() is more
invovled and will be addressed in a later commit.

Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
---
fs/xattr.c | 49 ++++++++++++++++++++++++----
include/linux/capability.h | 2 +-
security/commoncap.c | 79 +++++++++++++++-------------------------------
3 files changed, 69 insertions(+), 61 deletions(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index f60ef2a79dfa..372644b15457 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -540,13 +540,6 @@ vfs_setxattr(struct mnt_idmap *idmap, struct dentry *dentry,
const void *orig_value = value;
int error;

- if (size && strcmp(name, XATTR_NAME_CAPS) == 0) {
- error = cap_convert_nscap(idmap, dentry, &value, size);
- if (error < 0)
- return error;
- size = error;
- }
-
retry_deleg:
inode_lock(inode);
error = __vfs_setxattr_locked(idmap, dentry, name, value, size,
@@ -857,6 +850,24 @@ int do_setxattr(struct mnt_idmap *idmap, struct dentry *dentry,
return do_set_acl(idmap, dentry, ctx->kname->name,
ctx->kvalue, ctx->size);

+ if (strcmp(ctx->kname->name, XATTR_NAME_CAPS) == 0) {
+ struct vfs_caps caps;
+ int ret;
+
+ /*
+ * rootid is already in the mount idmap, so pass nop_mnt_idmap
+ * so that it won't be mapped.
+ */
+ ret = vfs_caps_from_xattr(&nop_mnt_idmap, current_user_ns(),
+ &caps, ctx->kvalue, ctx->size);
+ if (ret)
+ return ret;
+ ret = cap_convert_nscap(idmap, dentry, &caps);
+ if (ret)
+ return ret;
+ return vfs_set_fscaps(idmap, dentry, &caps, ctx->flags);
+ }
+
return vfs_setxattr(idmap, dentry, ctx->kname->name,
ctx->kvalue, ctx->size, ctx->flags);
}
@@ -955,6 +966,27 @@ do_getxattr(struct mnt_idmap *idmap, struct dentry *d,
ssize_t error;
char *kname = ctx->kname->name;

+ if (strcmp(kname, XATTR_NAME_CAPS) == 0) {
+ struct vfs_caps caps;
+ struct vfs_ns_cap_data data;
+ int ret;
+
+ ret = vfs_get_fscaps(idmap, d, &caps);
+ if (ret)
+ return ret;
+ /*
+ * rootid is already in the mount idmap, so pass nop_mnt_idmap
+ * so that it won't be mapped.
+ */
+ ret = vfs_caps_to_user_xattr(&nop_mnt_idmap, current_user_ns(),
+ &caps, &data, ctx->size);
+ if (ret < 0)
+ return ret;
+ if (ctx->size && copy_to_user(ctx->value, &data, ret))
+ return -EFAULT;
+ return ret;
+ }
+
if (ctx->size) {
if (ctx->size > XATTR_SIZE_MAX)
ctx->size = XATTR_SIZE_MAX;
@@ -1145,6 +1177,9 @@ removexattr(struct mnt_idmap *idmap, struct dentry *d,
if (is_posix_acl_xattr(kname))
return vfs_remove_acl(idmap, d, kname);

+ if (strcmp(kname, XATTR_NAME_CAPS) == 0)
+ return vfs_remove_fscaps(idmap, d);
+
return vfs_removexattr(idmap, d, kname);
}

diff --git a/include/linux/capability.h b/include/linux/capability.h
index c0bd9447685b..563f084e9453 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -229,6 +229,6 @@ int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
struct vfs_caps *cpu_caps);

int cap_convert_nscap(struct mnt_idmap *idmap, struct dentry *dentry,
- const void **ivalue, size_t size);
+ struct vfs_caps *caps);

#endif /* !_LINUX_CAPABILITY_H */
diff --git a/security/commoncap.c b/security/commoncap.c
index c645330f83a0..bd95b806af2f 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -484,27 +484,21 @@ int cap_inode_getsecurity(struct mnt_idmap *idmap,
}

/**
- * rootid_from_xattr - translate root uid of vfs caps
+ * rootid_from_vfs_caps - translate root uid of vfs caps
*
- * @value: vfs caps value which may be modified by this function
- * @size: size of @ivalue
+ * @caps: vfs caps value which may be modified by this function
* @task_ns: user namespace of the caller
+ *
+ * Return the rootid from a v3 fs cap, or the id of root in the task's user
+ * namespace for v1 and v2 fs caps.
*/
-static vfsuid_t rootid_from_xattr(const void *value, size_t size,
- struct user_namespace *task_ns)
+static vfsuid_t rootid_from_vfs_caps(const struct vfs_caps *caps,
+ struct user_namespace *task_ns)
{
- const struct vfs_ns_cap_data *nscap = value;
- uid_t rootid = 0;
-
- if (size == XATTR_CAPS_SZ_3)
- rootid = le32_to_cpu(nscap->rootid);
-
- return VFSUIDT_INIT(make_kuid(task_ns, rootid));
-}
+ if ((caps->magic_etc & VFS_CAP_REVISION_MASK) == VFS_CAP_REVISION_3)
+ return caps->rootid;

-static bool validheader(size_t size, const struct vfs_cap_data *cap)
-{
- return is_v2header(size, cap) || is_v3header(size, cap);
+ return VFSUIDT_INIT(make_kuid(task_ns, 0));
}

/**
@@ -512,11 +506,10 @@ static bool validheader(size_t size, const struct vfs_cap_data *cap)
*
* @idmap: idmap of the mount the inode was found from
* @dentry: used to retrieve inode to check permissions on
- * @ivalue: vfs caps value which may be modified by this function
- * @size: size of @ivalue
+ * @caps: vfs caps which may be modified by this function
*
- * User requested a write of security.capability. If needed, update the
- * xattr to change from v2 to v3, or to fixup the v3 rootid.
+ * User requested a write of security.capability. Check permissions, and if
+ * needed, update the xattr to change from v2 to v3.
*
* If the inode has been found through an idmapped mount the idmap of
* the vfsmount must be passed through @idmap. This function will then
@@ -524,59 +517,39 @@ static bool validheader(size_t size, const struct vfs_cap_data *cap)
* permissions. On non-idmapped mounts or if permission checking is to be
* performed on the raw inode simply pass @nop_mnt_idmap.
*
- * Return: On success, return the new size; on error, return < 0.
+ * Return: On success, return 0; on error, return < 0.
*/
int cap_convert_nscap(struct mnt_idmap *idmap, struct dentry *dentry,
- const void **ivalue, size_t size)
+ struct vfs_caps *caps)
{
- struct vfs_ns_cap_data *nscap;
- uid_t nsrootid;
- const struct vfs_cap_data *cap = *ivalue;
- __u32 magic, nsmagic;
struct inode *inode = d_backing_inode(dentry);
struct user_namespace *task_ns = current_user_ns(),
*fs_ns = inode->i_sb->s_user_ns;
- kuid_t rootid;
vfsuid_t vfsrootid;
- size_t newsize;
+ __u32 revision;

- if (!*ivalue)
- return -EINVAL;
- if (!validheader(size, cap))
+ revision = sansflags(caps->magic_etc);
+ if (revision != VFS_CAP_REVISION_2 && revision != VFS_CAP_REVISION_3)
return -EINVAL;
if (!capable_wrt_inode_uidgid(idmap, inode, CAP_SETFCAP))
return -EPERM;
- if (size == XATTR_CAPS_SZ_2 && (idmap == &nop_mnt_idmap))
+ if (revision == VFS_CAP_REVISION_2 && (idmap == &nop_mnt_idmap))
if (ns_capable(inode->i_sb->s_user_ns, CAP_SETFCAP))
/* user is privileged, just write the v2 */
- return size;
+ return 0;

- vfsrootid = rootid_from_xattr(*ivalue, size, task_ns);
+ vfsrootid = rootid_from_vfs_caps(caps, task_ns);
if (!vfsuid_valid(vfsrootid))
return -EINVAL;

- rootid = from_vfsuid(idmap, fs_ns, vfsrootid);
- if (!uid_valid(rootid))
+ if (!vfsuid_has_fsmapping(idmap, fs_ns, vfsrootid))
return -EINVAL;

- nsrootid = from_kuid(fs_ns, rootid);
- if (nsrootid == -1)
- return -EINVAL;
+ caps->rootid = vfsrootid;
+ caps->magic_etc = VFS_CAP_REVISION_3 |
+ (caps->magic_etc & VFS_CAP_FLAGS_EFFECTIVE);

- newsize = sizeof(struct vfs_ns_cap_data);
- nscap = kmalloc(newsize, GFP_ATOMIC);
- if (!nscap)
- return -ENOMEM;
- nscap->rootid = cpu_to_le32(nsrootid);
- nsmagic = VFS_CAP_REVISION_3;
- magic = le32_to_cpu(cap->magic_etc);
- if (magic & VFS_CAP_FLAGS_EFFECTIVE)
- nsmagic |= VFS_CAP_FLAGS_EFFECTIVE;
- nscap->magic_etc = cpu_to_le32(nsmagic);
- memcpy(&nscap->data, &cap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
-
- *ivalue = nscap;
- return newsize;
+ return 0;
}

/*

--
2.43.0

2023-11-29 21:52:13

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 10/16] fs: add vfs_remove_fscaps()

Provide a type-safe interface for removing filesystem capabilities and
a generic implementation suitable for most filesystems. Also add an
internal interface, __vfs_remove_fscaps(), which is called with the
inode lock held and skips security checks for later use from the
capability code.

Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
---
fs/xattr.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 2 ++
2 files changed, 79 insertions(+)

diff --git a/fs/xattr.c b/fs/xattr.c
index 03cc824e4f87..f60ef2a79dfa 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -334,6 +334,83 @@ int vfs_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
}
EXPORT_SYMBOL(vfs_set_fscaps);

+static int generic_remove_fscaps(struct mnt_idmap *idmap, struct dentry *dentry)
+{
+ return __vfs_removexattr(idmap, dentry, XATTR_NAME_CAPS);
+}
+
+/**
+ * __vfs_remove_fscaps - remove filesystem capabilities without security checks
+ * @idmap: idmap of the mount the inode was found from
+ * @dentry: the dentry from which to remove filesystem capabilities
+ *
+ * This function removes any filesystem capabilities from the specified
+ * dentry. Does not perform any security checks, and callers must hold the
+ * inode lock.
+ *
+ * Return: 0 on success, a negative errno on error.
+ */
+int __vfs_remove_fscaps(struct mnt_idmap *idmap, struct dentry *dentry)
+{
+ struct inode *inode = dentry->d_inode;
+ int error;
+
+ if (inode->i_op->remove_fscaps)
+ error = inode->i_op->remove_fscaps(idmap, dentry);
+ else
+ error = generic_remove_fscaps(idmap, dentry);
+
+ return error;
+}
+
+/**
+ * vfs_remove_fscaps - remove filesystem capabilities
+ * @idmap: idmap of the mount the inode was found from
+ * @dentry: the dentry from which to remove filesystem capabilities
+ *
+ * This function removes any filesystem capabilities from the specified
+ * dentry.
+ *
+ * Return: 0 on success, a negative errno on error.
+ */
+int vfs_remove_fscaps(struct mnt_idmap *idmap, struct dentry *dentry)
+{
+ struct inode *inode = dentry->d_inode;
+ struct inode *delegated_inode = NULL;
+ int error;
+
+retry_deleg:
+ inode_lock(inode);
+
+ error = xattr_permission(idmap, inode, XATTR_NAME_CAPS, MAY_WRITE);
+ if (error)
+ goto out_inode_unlock;
+
+ error = security_inode_removexattr(idmap, dentry, XATTR_NAME_CAPS);
+ if (error)
+ goto out_inode_unlock;
+
+ error = try_break_deleg(inode, &delegated_inode);
+ if (error)
+ goto out_inode_unlock;
+
+ error = __vfs_remove_fscaps(idmap, dentry);
+ if (!error)
+ fsnotify_xattr(dentry);
+
+out_inode_unlock:
+ inode_unlock(inode);
+
+ if (delegated_inode) {
+ error = break_deleg_wait(&delegated_inode);
+ if (!error)
+ goto retry_deleg;
+ }
+
+ return error;
+}
+EXPORT_SYMBOL(vfs_remove_fscaps);
+
int
__vfs_setxattr(struct mnt_idmap *idmap, struct dentry *dentry,
struct inode *inode, const char *name, const void *value,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 80992e210b83..057bad11a4e6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2061,6 +2061,8 @@ extern int vfs_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
struct vfs_caps *caps);
extern int vfs_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
const struct vfs_caps *caps, int flags);
+extern int __vfs_remove_fscaps(struct mnt_idmap *idmap, struct dentry *dentry);
+extern int vfs_remove_fscaps(struct mnt_idmap *idmap, struct dentry *dentry);

enum freeze_holder {
FREEZE_HOLDER_KERNEL = (1U << 0),

--
2.43.0

2023-11-29 21:52:14

by Seth Forshee

[permalink] [raw]
Subject: [PATCH 11/16] ovl: add fscaps handlers

Add handlers which read fs caps from the lower or upper filesystem and
write/remove fs caps to the upper filesystem, performing copy-up as
necessary.

While it doesn't make sense to use fscaps on directories, nothing in the
kernel actually prevents setting or getting fscaps xattrs for directory
inodes. If we omit fscaps handlers in ovl_dir_inode_operations then the
generic handlers will be used. These handlers will use the xattr inode
operations, bypassing any idmapping on lower mounts, so fscaps handlers
are also installed for ovl_dir_inode_operations.

Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
---
fs/overlayfs/dir.c | 3 ++
fs/overlayfs/inode.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/overlayfs/overlayfs.h | 6 ++++
3 files changed, 93 insertions(+)

diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index aab3f5d93556..d9ab3c9ce10a 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -1303,6 +1303,9 @@ const struct inode_operations ovl_dir_inode_operations = {
.get_inode_acl = ovl_get_inode_acl,
.get_acl = ovl_get_acl,
.set_acl = ovl_set_acl,
+ .get_fscaps = ovl_get_fscaps,
+ .set_fscaps = ovl_set_fscaps,
+ .remove_fscaps = ovl_remove_fscaps,
.update_time = ovl_update_time,
.fileattr_get = ovl_fileattr_get,
.fileattr_set = ovl_fileattr_set,
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index c63b31a460be..82fc6e479d45 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -568,6 +568,87 @@ int ovl_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
}
#endif

+int ovl_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
+ struct vfs_caps *caps)
+{
+ int err;
+ const struct cred *old_cred;
+ struct path realpath;
+
+ ovl_path_real(dentry, &realpath);
+ old_cred = ovl_override_creds(dentry->d_sb);
+ err = vfs_get_fscaps(mnt_idmap(realpath.mnt), realpath.dentry, caps);
+ revert_creds(old_cred);
+ return err;
+}
+
+int ovl_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
+ const struct vfs_caps *caps, int flags)
+{
+ int err;
+ struct ovl_fs *ofs = OVL_FS(dentry->d_sb);
+ struct dentry *upperdentry = ovl_dentry_upper(dentry);
+ struct dentry *realdentry = upperdentry ?: ovl_dentry_lower(dentry);
+ const struct cred *old_cred;
+
+ err = ovl_want_write(dentry);
+ if (err)
+ goto out;
+
+ if (!upperdentry) {
+ err = ovl_copy_up(dentry);
+ if (err)
+ goto out_drop_write;
+
+ realdentry = ovl_dentry_upper(dentry);
+ }
+
+ old_cred = ovl_override_creds(dentry->d_sb);
+ err = vfs_set_fscaps(ovl_upper_mnt_idmap(ofs), realdentry, caps, flags);
+ revert_creds(old_cred);
+
+ /* copy c/mtime */
+ ovl_copyattr(d_inode(dentry));
+
+out_drop_write:
+ ovl_drop_write(dentry);
+out:
+ return err;
+}
+
+int ovl_remove_fscaps(struct mnt_idmap *idmap, struct dentry *dentry)
+{
+ int err;
+ struct ovl_fs *ofs = OVL_FS(dentry->d_sb);
+ struct dentry *upperdentry = ovl_dentry_upper(dentry);
+ struct dentry *realdentry = upperdentry ?: ovl_dentry_lower(dentry);
+ const struct cred *old_cred;
+
+ err = ovl_want_write(dentry);
+ if (err)
+ goto out;
+
+ if (!upperdentry) {
+ err = ovl_copy_up(dentry);
+ if (err)
+ goto out_drop_write;
+
+ realdentry = ovl_dentry_upper(dentry);
+ }
+
+ old_cred = ovl_override_creds(dentry->d_sb);
+ err = vfs_remove_fscaps(ovl_upper_mnt_idmap(ofs), realdentry);
+ revert_creds(old_cred);
+
+ /* copy c/mtime */
+ ovl_copyattr(d_inode(dentry));
+
+out_drop_write:
+ ovl_drop_write(dentry);
+out:
+ return err;
+}
+
int ovl_update_time(struct inode *inode, int flags)
{
if (flags & S_ATIME) {
@@ -747,6 +828,9 @@ static const struct inode_operations ovl_file_inode_operations = {
.get_inode_acl = ovl_get_inode_acl,
.get_acl = ovl_get_acl,
.set_acl = ovl_set_acl,
+ .get_fscaps = ovl_get_fscaps,
+ .set_fscaps = ovl_set_fscaps,
+ .remove_fscaps = ovl_remove_fscaps,
.update_time = ovl_update_time,
.fiemap = ovl_fiemap,
.fileattr_get = ovl_fileattr_get,
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 05c3dd597fa8..e72ee2374f96 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -764,6 +764,12 @@ static inline struct posix_acl *ovl_get_acl_path(const struct path *path,
}
#endif

+int ovl_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
+ struct vfs_caps *caps);
+int ovl_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
+ const struct vfs_caps *caps, int flags);
+int ovl_remove_fscaps(struct mnt_idmap *idmap, struct dentry *dentry);
+
int ovl_update_time(struct inode *inode, int flags);
bool ovl_is_private_xattr(struct super_block *sb, const char *name);


--
2.43.0

2023-11-30 05:33:00

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 07/16] fs: add inode operations to get/set/remove fscaps

On Wed, Nov 29, 2023 at 11:51 PM Seth Forshee (DigitalOcean)
<[email protected]> wrote:
>
> Add inode operations for getting, setting and removing filesystem
> capabilities rather than passing around raw xattr data. This provides
> better type safety for ids contained within xattrs.
>
> Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> ---
> include/linux/fs.h | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 98b7a7a8c42e..a0a77f67b999 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2002,6 +2002,11 @@ struct inode_operations {
> int);
> int (*set_acl)(struct mnt_idmap *, struct dentry *,
> struct posix_acl *, int);
> + int (*get_fscaps)(struct mnt_idmap *, struct dentry *,
> + struct vfs_caps *);
> + int (*set_fscaps)(struct mnt_idmap *, struct dentry *,
> + const struct vfs_caps *, int flags);
> + int (*remove_fscaps)(struct mnt_idmap *, struct dentry *);
> int (*fileattr_set)(struct mnt_idmap *idmap,
> struct dentry *dentry, struct fileattr *fa);
> int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
>

Please document in Documentation/filesystems/{vfs,locking}.rst

Thanks,
Amir.

2023-11-30 05:57:07

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 11/16] ovl: add fscaps handlers

On Wed, Nov 29, 2023 at 11:50 PM Seth Forshee (DigitalOcean)
<[email protected]> wrote:
>
> Add handlers which read fs caps from the lower or upper filesystem and
> write/remove fs caps to the upper filesystem, performing copy-up as
> necessary.
>
> While it doesn't make sense to use fscaps on directories, nothing in the
> kernel actually prevents setting or getting fscaps xattrs for directory
> inodes. If we omit fscaps handlers in ovl_dir_inode_operations then the
> generic handlers will be used. These handlers will use the xattr inode
> operations, bypassing any idmapping on lower mounts, so fscaps handlers
> are also installed for ovl_dir_inode_operations.
>
> Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> ---
> fs/overlayfs/dir.c | 3 ++
> fs/overlayfs/inode.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
> fs/overlayfs/overlayfs.h | 6 ++++
> 3 files changed, 93 insertions(+)
>
> diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
> index aab3f5d93556..d9ab3c9ce10a 100644
> --- a/fs/overlayfs/dir.c
> +++ b/fs/overlayfs/dir.c
> @@ -1303,6 +1303,9 @@ const struct inode_operations ovl_dir_inode_operations = {
> .get_inode_acl = ovl_get_inode_acl,
> .get_acl = ovl_get_acl,
> .set_acl = ovl_set_acl,
> + .get_fscaps = ovl_get_fscaps,
> + .set_fscaps = ovl_set_fscaps,
> + .remove_fscaps = ovl_remove_fscaps,
> .update_time = ovl_update_time,
> .fileattr_get = ovl_fileattr_get,
> .fileattr_set = ovl_fileattr_set,
> diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
> index c63b31a460be..82fc6e479d45 100644
> --- a/fs/overlayfs/inode.c
> +++ b/fs/overlayfs/inode.c
> @@ -568,6 +568,87 @@ int ovl_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
> }
> #endif
>
> +int ovl_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> + struct vfs_caps *caps)
> +{
> + int err;
> + const struct cred *old_cred;
> + struct path realpath;
> +
> + ovl_path_real(dentry, &realpath);
> + old_cred = ovl_override_creds(dentry->d_sb);
> + err = vfs_get_fscaps(mnt_idmap(realpath.mnt), realpath.dentry, caps);
> + revert_creds(old_cred);
> + return err;
> +}
> +
> +int ovl_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> + const struct vfs_caps *caps, int flags)
> +{
> + int err;
> + struct ovl_fs *ofs = OVL_FS(dentry->d_sb);
> + struct dentry *upperdentry = ovl_dentry_upper(dentry);
> + struct dentry *realdentry = upperdentry ?: ovl_dentry_lower(dentry);
> + const struct cred *old_cred;
> +
> + err = ovl_want_write(dentry);
> + if (err)
> + goto out;
> +
> + if (!upperdentry) {
> + err = ovl_copy_up(dentry);
> + if (err)
> + goto out_drop_write;
> +
> + realdentry = ovl_dentry_upper(dentry);
> + }
> +
> + old_cred = ovl_override_creds(dentry->d_sb);
> + err = vfs_set_fscaps(ovl_upper_mnt_idmap(ofs), realdentry, caps, flags);
> + revert_creds(old_cred);
> +
> + /* copy c/mtime */
> + ovl_copyattr(d_inode(dentry));
> +
> +out_drop_write:
> + ovl_drop_write(dentry);
> +out:
> + return err;
> +}
> +
> +int ovl_remove_fscaps(struct mnt_idmap *idmap, struct dentry *dentry)
> +{
> + int err;
> + struct ovl_fs *ofs = OVL_FS(dentry->d_sb);
> + struct dentry *upperdentry = ovl_dentry_upper(dentry);
> + struct dentry *realdentry = upperdentry ?: ovl_dentry_lower(dentry);
> + const struct cred *old_cred;
> +
> + err = ovl_want_write(dentry);
> + if (err)
> + goto out;
> +
> + if (!upperdentry) {
> + err = ovl_copy_up(dentry);
> + if (err)
> + goto out_drop_write;
> +
> + realdentry = ovl_dentry_upper(dentry);
> + }
> +

This construct is peculiar.
Most of the operations just do this unconditionally:

err = ovl_copy_up(dentry);
if (err)
goto out_drop_write;

and then use ovl_dentry_upper(dentry) directly, because a modification
will always be done on the upper dentry, regardless of the state before
the operation started.

I was wondering where you copied this from and I found it right above
in ovl_set_or_remove_acl().
In that case, there was also no justification for this construct.

There is also no justification for open coding:
struct dentry *realdentry = upperdentry ?: ovl_dentry_lower(dentry);
when later on, ovl_path_lower(dentry, &realpath) is called anyway.

The only reason to do anything special in ovl_set_or_remove_acl() is:

/*
* If ACL is to be removed from a lower file, check if it exists in
* the first place before copying it up.
*/

Do you not want to do the same for ovl_remove_fscaps()?

Also, the comparison to remove_acl API bares the question,
why did you need to add a separate method for remove_fscaps?
Why not use set_fscaps(NULL), just like setxattr() and set_acl() APIs?

Thanks,
Amir.

2023-11-30 06:12:00

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 16/16] vfs: return -EOPNOTSUPP for fscaps from vfs_*xattr()

On Wed, Nov 29, 2023 at 11:51 PM Seth Forshee (DigitalOcean)
<[email protected]> wrote:
>
> Now that the new vfs-level interfaces are fully supported and all code
> has been converted to use them, stop permitting use of the top-level vfs
> xattr interfaces for capabilities xattrs. Unlike with ACLs we still need
> to be able to work with fscaps xattrs using lower-level interfaces in a
> handful of places, so only use of the top-level xattr interfaces is
> restricted.

Can you explain why?
Is there an inherent difference between ACLs and fscaps in that respect
or is it just a matter of more work that needs to be done?

>
> Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> ---
> fs/xattr.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 372644b15457..4b779779ad8c 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -540,6 +540,9 @@ vfs_setxattr(struct mnt_idmap *idmap, struct dentry *dentry,
> const void *orig_value = value;
> int error;
>
> + if (!strcmp(name, XATTR_NAME_CAPS))
> + return -EOPNOTSUPP;
> +

It this is really not expected, then it should be an assert and
please use an inline helper like is_posix_acl_xattr():

if (WARN_ON_ONCE(is_fscaps_xattr(name)))

It wouldn't hurt to add those assertions to is_posix_acl_xattr()
cases as well, but that is unrelated to your change.

Thanks,
Amir.

2023-11-30 06:24:10

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 12/16] ovl: use vfs_{get,set}_fscaps() for copy-up

On Wed, Nov 29, 2023 at 11:50 PM Seth Forshee (DigitalOcean)
<[email protected]> wrote:
>
> Using vfs_{get,set}xattr() for fscaps will be blocked in a future
> commit, so convert ovl to use the new interfaces. Also remove the now
> unused ovl_getxattr_value().
>
> Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>

You may add:

Reviewed-by: Amir Goldstein <[email protected]>

I am assuming that this work is destined to be merged via the vfs tree?
Note that there is already a (non-conflicting) patch to copy_up.c on
Christian's vfs.rw branch.

I think it is best that all the overlayfs patches are tested together by
the vfs maintainer in preparation for the 6.8 merge window, so I have
a feeling that the 6.8 overlayfs PR is going to be merged via the vfs tree ;-)

Thanks,
Amir.

> ---
> fs/overlayfs/copy_up.c | 72 ++++++++++++++++++++++++++------------------------
> 1 file changed, 37 insertions(+), 35 deletions(-)
>
> diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
> index 4382881b0709..b43af5ce4b21 100644
> --- a/fs/overlayfs/copy_up.c
> +++ b/fs/overlayfs/copy_up.c
> @@ -73,6 +73,23 @@ static int ovl_copy_acl(struct ovl_fs *ofs, const struct path *path,
> return err;
> }
>
> +static int ovl_copy_fscaps(struct ovl_fs *ofs, const struct path *oldpath,
> + struct dentry *new)
> +{
> + struct vfs_caps capability;
> + int err;
> +
> + err = vfs_get_fscaps(mnt_idmap(oldpath->mnt), oldpath->dentry,
> + &capability);
> + if (err) {
> + if (err == -ENODATA || err == -EOPNOTSUPP)
> + return 0;
> + return err;
> + }
> +
> + return vfs_set_fscaps(ovl_upper_mnt_idmap(ofs), new, &capability, 0);
> +}
> +
> int ovl_copy_xattr(struct super_block *sb, const struct path *oldpath, struct dentry *new)
> {
> struct dentry *old = oldpath->dentry;
> @@ -130,6 +147,14 @@ int ovl_copy_xattr(struct super_block *sb, const struct path *oldpath, struct de
> break;
> }
>
> + if (!strcmp(name, XATTR_NAME_CAPS)) {
> + error = ovl_copy_fscaps(OVL_FS(sb), oldpath, new);
> + if (!error)
> + continue;
> + /* fs capabilities must be copied */
> + break;
> + }
> +
> retry:
> size = ovl_do_getxattr(oldpath, name, value, value_size);
> if (size == -ERANGE)
> @@ -1006,61 +1031,40 @@ static bool ovl_need_meta_copy_up(struct dentry *dentry, umode_t mode,
> return true;
> }
>
> -static ssize_t ovl_getxattr_value(const struct path *path, char *name, char **value)
> -{
> - ssize_t res;
> - char *buf;
> -
> - res = ovl_do_getxattr(path, name, NULL, 0);
> - if (res == -ENODATA || res == -EOPNOTSUPP)
> - res = 0;
> -
> - if (res > 0) {
> - buf = kzalloc(res, GFP_KERNEL);
> - if (!buf)
> - return -ENOMEM;
> -
> - res = ovl_do_getxattr(path, name, buf, res);
> - if (res < 0)
> - kfree(buf);
> - else
> - *value = buf;
> - }
> - return res;
> -}
> -
> /* Copy up data of an inode which was copied up metadata only in the past. */
> static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
> {
> struct ovl_fs *ofs = OVL_FS(c->dentry->d_sb);
> struct path upperpath;
> int err;
> - char *capability = NULL;
> - ssize_t cap_size;
> + struct vfs_caps capability;
> + bool has_capability = false;
>
> ovl_path_upper(c->dentry, &upperpath);
> if (WARN_ON(upperpath.dentry == NULL))
> return -EIO;
>
> if (c->stat.size) {
> - err = cap_size = ovl_getxattr_value(&upperpath, XATTR_NAME_CAPS,
> - &capability);
> - if (cap_size < 0)
> + err = vfs_get_fscaps(mnt_idmap(upperpath.mnt), upperpath.dentry,
> + &capability);
> + if (!err)
> + has_capability = 1;
> + else if (err != -ENODATA && err != EOPNOTSUPP)
> goto out;
> }
>
> err = ovl_copy_up_data(c, &upperpath);
> if (err)
> - goto out_free;
> + goto out;
>
> /*
> * Writing to upper file will clear security.capability xattr. We
> * don't want that to happen for normal copy-up operation.
> */
> ovl_start_write(c->dentry);
> - if (capability) {
> - err = ovl_do_setxattr(ofs, upperpath.dentry, XATTR_NAME_CAPS,
> - capability, cap_size, 0);
> + if (has_capability) {
> + err = vfs_set_fscaps(mnt_idmap(upperpath.mnt), upperpath.dentry,
> + &capability, 0);
> }
> if (!err) {
> err = ovl_removexattr(ofs, upperpath.dentry,
> @@ -1068,13 +1072,11 @@ static int ovl_copy_up_meta_inode_data(struct ovl_copy_up_ctx *c)
> }
> ovl_end_write(c->dentry);
> if (err)
> - goto out_free;
> + goto out;
>
> ovl_clear_flag(OVL_HAS_DIGEST, d_inode(c->dentry));
> ovl_clear_flag(OVL_VERIFIED_DIGEST, d_inode(c->dentry));
> ovl_set_upperdata(d_inode(c->dentry));
> -out_free:
> - kfree(capability);
> out:
> return err;
> }
>
> --
> 2.43.0
>

2023-11-30 08:02:36

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 09/16] fs: add vfs_set_fscaps()

On Wed, Nov 29, 2023 at 11:50 PM Seth Forshee (DigitalOcean)
<[email protected]> wrote:
>
> Provide a type-safe interface for setting filesystem capabilities and a
> generic implementation suitable for most filesystems.
>
> Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> ---
> fs/xattr.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/fs.h | 2 ++
> 2 files changed, 89 insertions(+)
>
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 3abaf9bef0a5..03cc824e4f87 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -247,6 +247,93 @@ int vfs_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> }
> EXPORT_SYMBOL(vfs_get_fscaps);
>
> +static int generic_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> + const struct vfs_caps *caps, int flags)
> +{
> + struct inode *inode = d_inode(dentry);
> + struct vfs_ns_cap_data nscaps;
> + int size;
> +
> + size = vfs_caps_to_xattr(idmap, i_user_ns(inode), caps,
> + &nscaps, sizeof(nscaps));
> + if (size < 0)
> + return size;
> +
> + return __vfs_setxattr_noperm(idmap, dentry, XATTR_NAME_CAPS,
> + &nscaps, size, flags);
> +}
> +
> +/**
> + * vfs_set_fscaps - set filesystem capabilities
> + * @idmap: idmap of the mount the inode was found from
> + * @dentry: the dentry on which to set filesystem capabilities
> + * @caps: the filesystem capabilities to be written
> + * @flags: setxattr flags to use when writing the capabilities xattr
> + *
> + * This function writes the supplied filesystem capabilities to the dentry.
> + *
> + * Return: 0 on success, a negative errno on error.
> + */
> +int vfs_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> + const struct vfs_caps *caps, int flags)
> +{
> + struct inode *inode = d_inode(dentry);
> + struct inode *delegated_inode = NULL;
> + struct vfs_ns_cap_data nscaps;
> + int size, error;
> +
> + /*
> + * Unfortunately EVM wants to have the raw xattr value to compare to
> + * the on-disk version, so we need to pass the raw xattr to the
> + * security hooks. But we also want to do security checks before
> + * breaking leases, so that means a conversion to the raw xattr here
> + * which will usually be reduntant with the conversion we do for
> + * writing the xattr to disk.
> + */
> + size = vfs_caps_to_xattr(idmap, i_user_ns(inode), caps, &nscaps,
> + sizeof(nscaps));
> + if (size < 0)
> + return size;
> +
> +retry_deleg:
> + inode_lock(inode);
> +
> + error = xattr_permission(idmap, inode, XATTR_NAME_CAPS, MAY_WRITE);
> + if (error)
> + goto out_inode_unlock;
> + error = security_inode_setxattr(idmap, dentry, XATTR_NAME_CAPS, &nscaps,
> + size, flags);
> + if (error)
> + goto out_inode_unlock;
> +
> + error = try_break_deleg(inode, &delegated_inode);
> + if (error)
> + goto out_inode_unlock;
> +
> + if (inode->i_opflags & IOP_XATTR) {
> + if (inode->i_op->set_fscaps)
> + error = inode->i_op->set_fscaps(idmap, dentry, caps, flags);
> + else
> + error = generic_set_fscaps(idmap, dentry, caps, flags);

I think the non-generic case is missing fsnotify_xattr().

See vfs_set_acl() for comparison.

Thanks,
Amir.

2023-11-30 15:36:56

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 07/16] fs: add inode operations to get/set/remove fscaps

On Thu, Nov 30, 2023 at 07:32:19AM +0200, Amir Goldstein wrote:
> On Wed, Nov 29, 2023 at 11:51 PM Seth Forshee (DigitalOcean)
> <[email protected]> wrote:
> >
> > Add inode operations for getting, setting and removing filesystem
> > capabilities rather than passing around raw xattr data. This provides
> > better type safety for ids contained within xattrs.
> >
> > Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> > ---
> > include/linux/fs.h | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 98b7a7a8c42e..a0a77f67b999 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2002,6 +2002,11 @@ struct inode_operations {
> > int);
> > int (*set_acl)(struct mnt_idmap *, struct dentry *,
> > struct posix_acl *, int);
> > + int (*get_fscaps)(struct mnt_idmap *, struct dentry *,
> > + struct vfs_caps *);
> > + int (*set_fscaps)(struct mnt_idmap *, struct dentry *,
> > + const struct vfs_caps *, int flags);
> > + int (*remove_fscaps)(struct mnt_idmap *, struct dentry *);
> > int (*fileattr_set)(struct mnt_idmap *idmap,
> > struct dentry *dentry, struct fileattr *fa);
> > int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
> >
>
> Please document in Documentation/filesystems/{vfs,locking}.rst

Done for v2.

2023-11-30 15:38:18

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 09/16] fs: add vfs_set_fscaps()

On Thu, Nov 30, 2023 at 10:01:55AM +0200, Amir Goldstein wrote:
> On Wed, Nov 29, 2023 at 11:50 PM Seth Forshee (DigitalOcean)
> <[email protected]> wrote:
> >
> > Provide a type-safe interface for setting filesystem capabilities and a
> > generic implementation suitable for most filesystems.
> >
> > Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> > ---
> > fs/xattr.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/fs.h | 2 ++
> > 2 files changed, 89 insertions(+)
> >
> > diff --git a/fs/xattr.c b/fs/xattr.c
> > index 3abaf9bef0a5..03cc824e4f87 100644
> > --- a/fs/xattr.c
> > +++ b/fs/xattr.c
> > @@ -247,6 +247,93 @@ int vfs_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> > }
> > EXPORT_SYMBOL(vfs_get_fscaps);
> >
> > +static int generic_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> > + const struct vfs_caps *caps, int flags)
> > +{
> > + struct inode *inode = d_inode(dentry);
> > + struct vfs_ns_cap_data nscaps;
> > + int size;
> > +
> > + size = vfs_caps_to_xattr(idmap, i_user_ns(inode), caps,
> > + &nscaps, sizeof(nscaps));
> > + if (size < 0)
> > + return size;
> > +
> > + return __vfs_setxattr_noperm(idmap, dentry, XATTR_NAME_CAPS,
> > + &nscaps, size, flags);
> > +}
> > +
> > +/**
> > + * vfs_set_fscaps - set filesystem capabilities
> > + * @idmap: idmap of the mount the inode was found from
> > + * @dentry: the dentry on which to set filesystem capabilities
> > + * @caps: the filesystem capabilities to be written
> > + * @flags: setxattr flags to use when writing the capabilities xattr
> > + *
> > + * This function writes the supplied filesystem capabilities to the dentry.
> > + *
> > + * Return: 0 on success, a negative errno on error.
> > + */
> > +int vfs_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> > + const struct vfs_caps *caps, int flags)
> > +{
> > + struct inode *inode = d_inode(dentry);
> > + struct inode *delegated_inode = NULL;
> > + struct vfs_ns_cap_data nscaps;
> > + int size, error;
> > +
> > + /*
> > + * Unfortunately EVM wants to have the raw xattr value to compare to
> > + * the on-disk version, so we need to pass the raw xattr to the
> > + * security hooks. But we also want to do security checks before
> > + * breaking leases, so that means a conversion to the raw xattr here
> > + * which will usually be reduntant with the conversion we do for
> > + * writing the xattr to disk.
> > + */
> > + size = vfs_caps_to_xattr(idmap, i_user_ns(inode), caps, &nscaps,
> > + sizeof(nscaps));
> > + if (size < 0)
> > + return size;
> > +
> > +retry_deleg:
> > + inode_lock(inode);
> > +
> > + error = xattr_permission(idmap, inode, XATTR_NAME_CAPS, MAY_WRITE);
> > + if (error)
> > + goto out_inode_unlock;
> > + error = security_inode_setxattr(idmap, dentry, XATTR_NAME_CAPS, &nscaps,
> > + size, flags);
> > + if (error)
> > + goto out_inode_unlock;
> > +
> > + error = try_break_deleg(inode, &delegated_inode);
> > + if (error)
> > + goto out_inode_unlock;
> > +
> > + if (inode->i_opflags & IOP_XATTR) {
> > + if (inode->i_op->set_fscaps)
> > + error = inode->i_op->set_fscaps(idmap, dentry, caps, flags);
> > + else
> > + error = generic_set_fscaps(idmap, dentry, caps, flags);
>
> I think the non-generic case is missing fsnotify_xattr().
>
> See vfs_set_acl() for comparison.

Good catch. I'm going to have another look at some of this in light of
some of your other feedback, but I'll get it fixed one way or another in
v2.

2023-11-30 16:02:22

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 11/16] ovl: add fscaps handlers

On Thu, Nov 30, 2023 at 07:56:48AM +0200, Amir Goldstein wrote:
> On Wed, Nov 29, 2023 at 11:50 PM Seth Forshee (DigitalOcean)
> <[email protected]> wrote:
> >
> > Add handlers which read fs caps from the lower or upper filesystem and
> > write/remove fs caps to the upper filesystem, performing copy-up as
> > necessary.
> >
> > While it doesn't make sense to use fscaps on directories, nothing in the
> > kernel actually prevents setting or getting fscaps xattrs for directory
> > inodes. If we omit fscaps handlers in ovl_dir_inode_operations then the
> > generic handlers will be used. These handlers will use the xattr inode
> > operations, bypassing any idmapping on lower mounts, so fscaps handlers
> > are also installed for ovl_dir_inode_operations.
> >
> > Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> > ---
> > fs/overlayfs/dir.c | 3 ++
> > fs/overlayfs/inode.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
> > fs/overlayfs/overlayfs.h | 6 ++++
> > 3 files changed, 93 insertions(+)
> >
> > diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
> > index aab3f5d93556..d9ab3c9ce10a 100644
> > --- a/fs/overlayfs/dir.c
> > +++ b/fs/overlayfs/dir.c
> > @@ -1303,6 +1303,9 @@ const struct inode_operations ovl_dir_inode_operations = {
> > .get_inode_acl = ovl_get_inode_acl,
> > .get_acl = ovl_get_acl,
> > .set_acl = ovl_set_acl,
> > + .get_fscaps = ovl_get_fscaps,
> > + .set_fscaps = ovl_set_fscaps,
> > + .remove_fscaps = ovl_remove_fscaps,
> > .update_time = ovl_update_time,
> > .fileattr_get = ovl_fileattr_get,
> > .fileattr_set = ovl_fileattr_set,
> > diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
> > index c63b31a460be..82fc6e479d45 100644
> > --- a/fs/overlayfs/inode.c
> > +++ b/fs/overlayfs/inode.c
> > @@ -568,6 +568,87 @@ int ovl_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
> > }
> > #endif
> >
> > +int ovl_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> > + struct vfs_caps *caps)
> > +{
> > + int err;
> > + const struct cred *old_cred;
> > + struct path realpath;
> > +
> > + ovl_path_real(dentry, &realpath);
> > + old_cred = ovl_override_creds(dentry->d_sb);
> > + err = vfs_get_fscaps(mnt_idmap(realpath.mnt), realpath.dentry, caps);
> > + revert_creds(old_cred);
> > + return err;
> > +}
> > +
> > +int ovl_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> > + const struct vfs_caps *caps, int flags)
> > +{
> > + int err;
> > + struct ovl_fs *ofs = OVL_FS(dentry->d_sb);
> > + struct dentry *upperdentry = ovl_dentry_upper(dentry);
> > + struct dentry *realdentry = upperdentry ?: ovl_dentry_lower(dentry);
> > + const struct cred *old_cred;
> > +
> > + err = ovl_want_write(dentry);
> > + if (err)
> > + goto out;
> > +
> > + if (!upperdentry) {
> > + err = ovl_copy_up(dentry);
> > + if (err)
> > + goto out_drop_write;
> > +
> > + realdentry = ovl_dentry_upper(dentry);
> > + }
> > +
> > + old_cred = ovl_override_creds(dentry->d_sb);
> > + err = vfs_set_fscaps(ovl_upper_mnt_idmap(ofs), realdentry, caps, flags);
> > + revert_creds(old_cred);
> > +
> > + /* copy c/mtime */
> > + ovl_copyattr(d_inode(dentry));
> > +
> > +out_drop_write:
> > + ovl_drop_write(dentry);
> > +out:
> > + return err;
> > +}
> > +
> > +int ovl_remove_fscaps(struct mnt_idmap *idmap, struct dentry *dentry)
> > +{
> > + int err;
> > + struct ovl_fs *ofs = OVL_FS(dentry->d_sb);
> > + struct dentry *upperdentry = ovl_dentry_upper(dentry);
> > + struct dentry *realdentry = upperdentry ?: ovl_dentry_lower(dentry);
> > + const struct cred *old_cred;
> > +
> > + err = ovl_want_write(dentry);
> > + if (err)
> > + goto out;
> > +
> > + if (!upperdentry) {
> > + err = ovl_copy_up(dentry);
> > + if (err)
> > + goto out_drop_write;
> > +
> > + realdentry = ovl_dentry_upper(dentry);
> > + }
> > +
>
> This construct is peculiar.
> Most of the operations just do this unconditionally:
>
> err = ovl_copy_up(dentry);
> if (err)
> goto out_drop_write;
>
> and then use ovl_dentry_upper(dentry) directly, because a modification
> will always be done on the upper dentry, regardless of the state before
> the operation started.
>
> I was wondering where you copied this from and I found it right above
> in ovl_set_or_remove_acl().
> In that case, there was also no justification for this construct.

Yes, I did use ovl_set_or_remove_acl() as a reference here. But looking
around at some of the other code, I see what you mean, so I'll rework
this code.

> There is also no justification for open coding:
> struct dentry *realdentry = upperdentry ?: ovl_dentry_lower(dentry);
> when later on, ovl_path_lower(dentry, &realpath) is called anyway.
>
> The only reason to do anything special in ovl_set_or_remove_acl() is:
>
> /*
> * If ACL is to be removed from a lower file, check if it exists in
> * the first place before copying it up.
> */
>
> Do you not want to do the same for ovl_remove_fscaps()?

Yes, I will add this.

> Also, the comparison to remove_acl API bares the question,
> why did you need to add a separate method for remove_fscaps?
> Why not use set_fscaps(NULL), just like setxattr() and set_acl() APIs?

I had started on these patches a while back and then picked them up
again recently, and honestly I don't remember why I did that originally.
I don't see a need for it now though, so I can change that.

2023-11-30 16:41:36

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 16/16] vfs: return -EOPNOTSUPP for fscaps from vfs_*xattr()

On Thu, Nov 30, 2023 at 08:10:15AM +0200, Amir Goldstein wrote:
> On Wed, Nov 29, 2023 at 11:51 PM Seth Forshee (DigitalOcean)
> <[email protected]> wrote:
> >
> > Now that the new vfs-level interfaces are fully supported and all code
> > has been converted to use them, stop permitting use of the top-level vfs
> > xattr interfaces for capabilities xattrs. Unlike with ACLs we still need
> > to be able to work with fscaps xattrs using lower-level interfaces in a
> > handful of places, so only use of the top-level xattr interfaces is
> > restricted.
>
> Can you explain why?
> Is there an inherent difference between ACLs and fscaps in that respect
> or is it just a matter of more work that needs to be done?

There are a number of differences. ACLs have caching, require additional
permission checks, and require a lot of filesystem-specific handling.
fscaps are simpler by comparison, and most filesystems can rely on a
common implementation that just converts to/from raw disk xattrs.

So at minimum I think the lowest level interfaces,
__vfs_{get,set,remove}xattr(), need to continue to allow fscaps, and
that's where ACL xattrs are blocked. Allowing some of the others to
still work with them is a matter of convenience (e.g. using
vfs_getxattr_alloc()) and trying to reduce code duplication. But as you
pointed out I did miss at least duplicating fsnotify_xattr(), so I'm
going to have another look at how I implemented these.

>
> >
> > Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> > ---
> > fs/xattr.c | 9 +++++++++
> > 1 file changed, 9 insertions(+)
> >
> > diff --git a/fs/xattr.c b/fs/xattr.c
> > index 372644b15457..4b779779ad8c 100644
> > --- a/fs/xattr.c
> > +++ b/fs/xattr.c
> > @@ -540,6 +540,9 @@ vfs_setxattr(struct mnt_idmap *idmap, struct dentry *dentry,
> > const void *orig_value = value;
> > int error;
> >
> > + if (!strcmp(name, XATTR_NAME_CAPS))
> > + return -EOPNOTSUPP;
> > +
>
> It this is really not expected, then it should be an assert and
> please use an inline helper like is_posix_acl_xattr():
>
> if (WARN_ON_ONCE(is_fscaps_xattr(name)))

Ack, makes sense.

2023-11-30 16:44:13

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 12/16] ovl: use vfs_{get,set}_fscaps() for copy-up

On Thu, Nov 30, 2023 at 08:23:28AM +0200, Amir Goldstein wrote:
> On Wed, Nov 29, 2023 at 11:50 PM Seth Forshee (DigitalOcean)
> <[email protected]> wrote:
> >
> > Using vfs_{get,set}xattr() for fscaps will be blocked in a future
> > commit, so convert ovl to use the new interfaces. Also remove the now
> > unused ovl_getxattr_value().
> >
> > Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
>
> You may add:
>
> Reviewed-by: Amir Goldstein <[email protected]>

Thanks!

> I am assuming that this work is destined to be merged via the vfs tree?
> Note that there is already a (non-conflicting) patch to copy_up.c on
> Christian's vfs.rw branch.

I'll leave that up to Christian. There are also other mnt_idmapping.h
changes on vfs.misc which could cause (probably minor) conflicts.

2023-12-01 15:50:48

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 03/16] capability: rename cpu_vfs_cap_data to vfs_caps

On Wed, Nov 29, 2023 at 03:50:21PM -0600, Seth Forshee (DigitalOcean) wrote:
> vfs_caps is a more generic name which is better suited to the broader
> use this struct will see in subsequent commits.
>
> Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> ---

Yep, looks good to me,
Reviewed-by: Christian Brauner <[email protected]>

2023-12-01 16:41:47

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 05/16] capability: provide helpers for converting between xattrs and vfs_caps

On Wed, Nov 29, 2023 at 03:50:23PM -0600, Seth Forshee (DigitalOcean) wrote:
> To pass around vfs_caps instead of raw xattr data we will need to
> convert between the two representations near userspace and disk
> boundaries. We already convert xattrs from disks to vfs_caps, so move
> that code into a helper, and change get_vfs_caps_from_disk() to use the
> helper.
>
> When converting vfs_caps to xattrs we have different considerations
> depending on the destination of the xattr data. For xattrs which will be
> written to disk we need to reject the xattr if the rootid does not map
> into the filesystem's user namespace, whereas xattrs read by userspace
> may need to undergo a conversion from v3 to v2 format when the rootid
> does not map. So this helper is split into an internal and an external
> interface. The internal interface does not return an error if the rootid
> has no mapping in the target user namespace and will be used for
> conversions targeting userspace. The external interface returns
> EOVERFLOW if the rootid has no mapping and will be used for all other
> conversions.
>
> Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> ---
> include/linux/capability.h | 10 ++
> security/commoncap.c | 227 +++++++++++++++++++++++++++++++++++----------
> 2 files changed, 186 insertions(+), 51 deletions(-)
>
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index eb46d346bbbc..cdd7d2d8855e 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -209,6 +209,16 @@ static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
> ns_capable(ns, CAP_SYS_ADMIN);
> }
>
> +/* helpers to convert between xattr and in-kernel representations */
> +int vfs_caps_from_xattr(struct mnt_idmap *idmap,
> + struct user_namespace *src_userns,
> + struct vfs_caps *vfs_caps,
> + const void *data, int size);
> +int vfs_caps_to_xattr(struct mnt_idmap *idmap,
> + struct user_namespace *dest_userns,
> + const struct vfs_caps *vfs_caps,
> + void *data, int size);
> +
> /* audit system wants to get cap info from files as well */
> int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
> const struct dentry *dentry,
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 3d045d377e5e..ef37966f3522 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -618,54 +618,41 @@ static inline int bprm_caps_from_vfs_caps(struct vfs_caps *caps,
> }
>
> /**
> - * get_vfs_caps_from_disk - retrieve vfs caps from disk
> + * vfs_caps_from_xattr - convert raw caps xattr data to vfs_caps
> *
> - * @idmap: idmap of the mount the inode was found from
> - * @dentry: dentry from which @inode is retrieved
> - * @cpu_caps: vfs capabilities
> + * @idmap: idmap of the mount the inode was found from
> + * @src_userns: user namespace for ids in xattr data
> + * @vfs_caps: destination buffer for vfs_caps data
> + * @data: rax xattr caps data
> + * @size: size of xattr data
> *
> - * Extract the on-exec-apply capability sets for an executable file.
> + * Converts a raw security.capability xattr into the kernel-internal
> + * capabilities format.
> *
> - * If the inode has been found through an idmapped mount the idmap of
> - * the vfsmount must be passed through @idmap. This function will then
> - * take care to map the inode according to @idmap before checking
> - * permissions. On non-idmapped mounts or if permission checking is to be
> - * performed on the raw inode simply pass @nop_mnt_idmap.
> + * If the xattr is being read or written through an idmapped mount the
> + * idmap of the vfsmount must be passed through @idmap. This function
> + * will then take care to map the rootid according to @idmap.
> + *
> + * Return: On success, return 0; on error, return < 0.
> */
> -int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
> - const struct dentry *dentry,
> - struct vfs_caps *cpu_caps)
> +int vfs_caps_from_xattr(struct mnt_idmap *idmap,
> + struct user_namespace *src_userns,
> + struct vfs_caps *vfs_caps,
> + const void *data, int size)
> {
> - struct inode *inode = d_backing_inode(dentry);
> __u32 magic_etc;
> - int size;
> - struct vfs_ns_cap_data data, *nscaps = &data;
> - struct vfs_cap_data *caps = (struct vfs_cap_data *) &data;
> + const struct vfs_ns_cap_data *ns_caps = data;

The casting here is predicated on the compatibility of these two structs
and I'm kinda surprised to see no static assertions about their layout.
So I would recommend to add some to the header:

diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 5bb906098697..0fd75aab9754 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -16,6 +16,10 @@

#include <linux/types.h>

+#ifdef __KERNEL__
+#include <linux/build_bug.h>
+#endif
+
/* User-level do most of the mapping between kernel and user
capabilities based on the version tag given by the kernel. The
kernel might be somewhat backwards compatible, but don't bet on
@@ -100,6 +104,15 @@ struct vfs_ns_cap_data {
#define _LINUX_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_1
#define _LINUX_CAPABILITY_U32S _LINUX_CAPABILITY_U32S_1

+#else
+
+static_assert(offsetof(struct vfs_cap_data, magic_etc) ==
+ offsetof(struct vfs_ns_cap_data, magic_etc));
+static_assert(offsetof(struct vfs_cap_data, data) ==
+ offsetof(struct vfs_ns_cap_data, data));
+static_assert(sizeof(struct vfs_cap_data) ==
+ offsetof(struct vfs_ns_cap_data, rootid));
+
#endif


> + struct vfs_cap_data *caps = (struct vfs_cap_data *)ns_caps;
> kuid_t rootkuid;
> - vfsuid_t rootvfsuid;
> - struct user_namespace *fs_ns;
> -
> - memset(cpu_caps, 0, sizeof(struct vfs_caps));
> -
> - if (!inode)
> - return -ENODATA;
>
> - fs_ns = inode->i_sb->s_user_ns;
> - size = __vfs_getxattr((struct dentry *)dentry, inode,
> - XATTR_NAME_CAPS, &data, XATTR_CAPS_SZ);
> - if (size == -ENODATA || size == -EOPNOTSUPP)
> - /* no data, that's ok */
> - return -ENODATA;
> -
> - if (size < 0)
> - return size;
> + memset(vfs_caps, 0, sizeof(*vfs_caps));
>
> if (size < sizeof(magic_etc))
> return -EINVAL;
>
> - cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps->magic_etc);
> + vfs_caps->magic_etc = magic_etc = le32_to_cpu(caps->magic_etc);
>
> - rootkuid = make_kuid(fs_ns, 0);
> + rootkuid = make_kuid(src_userns, 0);
> switch (magic_etc & VFS_CAP_REVISION_MASK) {
> case VFS_CAP_REVISION_1:
> if (size != XATTR_CAPS_SZ_1)
> @@ -678,39 +665,177 @@ int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
> case VFS_CAP_REVISION_3:
> if (size != XATTR_CAPS_SZ_3)
> return -EINVAL;
> - rootkuid = make_kuid(fs_ns, le32_to_cpu(nscaps->rootid));
> + rootkuid = make_kuid(src_userns, le32_to_cpu(ns_caps->rootid));
> break;
>
> default:
> return -EINVAL;
> }
>
> - rootvfsuid = make_vfsuid(idmap, fs_ns, rootkuid);
> - if (!vfsuid_valid(rootvfsuid))
> - return -ENODATA;
> + vfs_caps->rootid = make_vfsuid(idmap, src_userns, rootkuid);
> + if (!vfsuid_valid(vfs_caps->rootid))
> + return -EOVERFLOW;
>
> - /* Limit the caps to the mounter of the filesystem
> - * or the more limited uid specified in the xattr.
> + vfs_caps->permitted.val = le32_to_cpu(caps->data[0].permitted);
> + vfs_caps->inheritable.val = le32_to_cpu(caps->data[0].inheritable);
> +
> + /*
> + * Rev1 had just a single 32-bit word, later expanded
> + * to a second one for the high bits
> */
> - if (!rootid_owns_currentns(rootvfsuid))
> - return -ENODATA;
> + if ((magic_etc & VFS_CAP_REVISION_MASK) != VFS_CAP_REVISION_1) {
> + vfs_caps->permitted.val += (u64)le32_to_cpu(caps->data[1].permitted) << 32;
> + vfs_caps->inheritable.val += (u64)le32_to_cpu(caps->data[1].inheritable) << 32;
> + }
> +
> + vfs_caps->permitted.val &= CAP_VALID_MASK;
> + vfs_caps->inheritable.val &= CAP_VALID_MASK;
> +
> + return 0;
> +}
> +
> +/*
> + * Inner implementation of vfs_caps_to_xattr() which does not return an
> + * error if the rootid does not map into @dest_userns.
> + */
> +static int __vfs_caps_to_xattr(struct mnt_idmap *idmap,
> + struct user_namespace *dest_userns,
> + const struct vfs_caps *vfs_caps,
> + void *data, int size)
> +{
> + struct vfs_ns_cap_data *ns_caps = data;
> + struct vfs_cap_data *caps = (struct vfs_cap_data *)ns_caps;
> + kuid_t rootkuid;
> + uid_t rootid;
> +
> + memset(ns_caps, 0, size);
> +
> + rootid = 0;
> + switch (vfs_caps->magic_etc & VFS_CAP_REVISION_MASK) {
> + case VFS_CAP_REVISION_1:
> + if (size < XATTR_CAPS_SZ_1)
> + return -EINVAL;
> + size = XATTR_CAPS_SZ_1;
> + break;
> + case VFS_CAP_REVISION_2:
> + if (size < XATTR_CAPS_SZ_2)
> + return -EINVAL;
> + size = XATTR_CAPS_SZ_2;
> + break;
> + case VFS_CAP_REVISION_3:
> + if (size < XATTR_CAPS_SZ_3)
> + return -EINVAL;
> + size = XATTR_CAPS_SZ_3;
> + rootkuid = from_vfsuid(idmap, dest_userns, vfs_caps->rootid);
> + rootid = from_kuid(dest_userns, rootkuid);
> + ns_caps->rootid = cpu_to_le32(rootid);
> + break;
>
> - cpu_caps->permitted.val = le32_to_cpu(caps->data[0].permitted);
> - cpu_caps->inheritable.val = le32_to_cpu(caps->data[0].inheritable);
> + default:
> + return -EINVAL;
> + }
> +
> + caps->magic_etc = cpu_to_le32(vfs_caps->magic_etc);
> +
> + caps->data[0].permitted = cpu_to_le32(lower_32_bits(vfs_caps->permitted.val));
> + caps->data[0].inheritable = cpu_to_le32(lower_32_bits(vfs_caps->inheritable.val));
>
> /*
> * Rev1 had just a single 32-bit word, later expanded
> * to a second one for the high bits
> */
> - if ((magic_etc & VFS_CAP_REVISION_MASK) != VFS_CAP_REVISION_1) {
> - cpu_caps->permitted.val += (u64)le32_to_cpu(caps->data[1].permitted) << 32;
> - cpu_caps->inheritable.val += (u64)le32_to_cpu(caps->data[1].inheritable) << 32;
> + if ((vfs_caps->magic_etc & VFS_CAP_REVISION_MASK) != VFS_CAP_REVISION_1) {
> + caps->data[1].permitted =
> + cpu_to_le32(upper_32_bits(vfs_caps->permitted.val));
> + caps->data[1].inheritable =
> + cpu_to_le32(upper_32_bits(vfs_caps->inheritable.val));
> }
>
> - cpu_caps->permitted.val &= CAP_VALID_MASK;
> - cpu_caps->inheritable.val &= CAP_VALID_MASK;
> + return size;
> +}
> +
> +
> +/**
> + * vfs_caps_to_xattr - convert vfs_caps to raw caps xattr data
> + *
> + * @idmap: idmap of the mount the inode was found from
> + * @dest_userns: user namespace for ids in xattr data
> + * @vfs_caps: source vfs_caps data
> + * @data: destination buffer for rax xattr caps data
> + * @size: size of the @data buffer
> + *
> + * Converts a kernel-interrnal capability into the raw security.capability

s/interrnal/internal/

> + * xattr format.
> + *
> + * If the xattr is being read or written through an idmapped mount the
> + * idmap of the vfsmount must be passed through @idmap. This function
> + * will then take care to map the rootid according to @idmap.
> + *
> + * Return: On success, return 0; on error, return < 0.
> + */
> +int vfs_caps_to_xattr(struct mnt_idmap *idmap,
> + struct user_namespace *dest_userns,
> + const struct vfs_caps *vfs_caps,
> + void *data, int size)
> +{
> + struct vfs_ns_cap_data *caps = data;
> + int ret;
> +
> + ret = __vfs_caps_to_xattr(idmap, dest_userns, vfs_caps, data, size);
> + if (ret > 0 &&
> + (vfs_caps->magic_etc & VFS_CAP_REVISION_MASK) == VFS_CAP_REVISION_3 &&
> + le32_to_cpu(caps->rootid) == (uid_t)-1)
> + return -EOVERFLOW;

I think the return value situation of these two helper is a bit
confusing. So if they return the size or a negative error pointer the
return value of both functions should likely be ssize_t.

But unless you actually later use the size in the callers of these
helpers it would be easier to just stick with 0 on success, negative
errno on failure. Note that that's what vfs_caps_from_xattr() is doing.

I'll see whether that becomes relevant later in the series though.

> + return ret;
> +}
> +
> +/**
> + * get_vfs_caps_from_disk - retrieve vfs caps from disk
> + *
> + * @idmap: idmap of the mount the inode was found from
> + * @dentry: dentry from which @inode is retrieved
> + * @cpu_caps: vfs capabilities
> + *
> + * Extract the on-exec-apply capability sets for an executable file.
> + *
> + * If the inode has been found through an idmapped mount the idmap of
> + * the vfsmount must be passed through @idmap. This function will then
> + * take care to map the inode according to @idmap before checking
> + * permissions. On non-idmapped mounts or if permission checking is to be
> + * performed on the raw inode simply pass @nop_mnt_idmap.
> + */
> +int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
> + const struct dentry *dentry,
> + struct vfs_caps *cpu_caps)
> +{
> + struct inode *inode = d_backing_inode(dentry);
> + int size, ret;
> + struct vfs_ns_cap_data data, *nscaps = &data;
> +
> + if (!inode)
> + return -ENODATA;
>
> - cpu_caps->rootid = rootvfsuid;
> + size = __vfs_getxattr((struct dentry *)dentry, inode,
> + XATTR_NAME_CAPS, &data, XATTR_CAPS_SZ);
> + if (size == -ENODATA || size == -EOPNOTSUPP)
> + /* no data, that's ok */
> + return -ENODATA;
> +
> + if (size < 0)
> + return size;
> +
> + ret = vfs_caps_from_xattr(idmap, inode->i_sb->s_user_ns,
> + cpu_caps, nscaps, size);
> + if (ret == -EOVERFLOW)
> + return -ENODATA;
> + if (ret)
> + return ret;
> +
> + /* Limit the caps to the mounter of the filesystem
> + * or the more limited uid specified in the xattr.
> + */
> + if (!rootid_owns_currentns(cpu_caps->rootid))
> + return -ENODATA;
>
> return 0;
> }
>
> --
> 2.43.0
>

2023-12-01 16:58:01

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 06/16] capability: provide a helper for converting vfs_caps to xattr for userspace

On Wed, Nov 29, 2023 at 03:50:24PM -0600, Seth Forshee (DigitalOcean) wrote:
> cap_inode_getsecurity() implements a handful of policies for capability
> xattrs read by userspace:
>
> - It returns EINVAL if the on-disk capability is in v1 format.
>
> - It masks off all bits in magic_etc except for the version and
> VFS_CAP_FLAGS_EFFECTIVE.
>
> - v3 capabilities are converted to v2 format if the rootid returned to
> userspace would be 0 or if the rootid corresponds to root in an
> ancestor user namespace.
>
> - It returns EOVERFLOW for a v3 capability whose rootid does not map to
> a valid id in current_user_ns() or to root in an ancestor namespace.

Nice. Precise and clear, please just drop these bullet points into the
kernel-doc of that function.

>
> These policies must be maintained when converting vfs_caps to an xattr
> for userspace. Provide a vfs_caps_to_user_xattr() helper which will
> enforce these policies.
>
> Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> ---
> include/linux/capability.h | 4 +++
> security/commoncap.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 72 insertions(+)
>
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index cdd7d2d8855e..c0bd9447685b 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -218,6 +218,10 @@ int vfs_caps_to_xattr(struct mnt_idmap *idmap,
> struct user_namespace *dest_userns,
> const struct vfs_caps *vfs_caps,
> void *data, int size);
> +int vfs_caps_to_user_xattr(struct mnt_idmap *idmap,
> + struct user_namespace *dest_userns,
> + const struct vfs_caps *vfs_caps,
> + void *data, int size);
>
> /* audit system wants to get cap info from files as well */
> int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
> diff --git a/security/commoncap.c b/security/commoncap.c
> index ef37966f3522..c645330f83a0 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -789,6 +789,74 @@ int vfs_caps_to_xattr(struct mnt_idmap *idmap,
> return ret;
> }
>
> +/**
> + * vfs_caps_to_user_xattr - convert vfs_caps to caps xattr for userspace
> + *
> + * @idmap: idmap of the mount the inode was found from
> + * @dest_userns: user namespace for ids in xattr data
> + * @vfs_caps: source vfs_caps data
> + * @data: destination buffer for rax xattr caps data
> + * @size: size of the @data buffer
> + *
> + * Converts a kernel-interrnal capability into the raw security.capability
> + * xattr format. Includes permission checking and v2->v3 conversion as
> + * appropriate.
> + *
> + * If the xattr is being read or written through an idmapped mount the
> + * idmap of the vfsmount must be passed through @idmap. This function
> + * will then take care to map the rootid according to @idmap.
> + *
> + * Return: On success, return 0; on error, return < 0.
> + */
> +int vfs_caps_to_user_xattr(struct mnt_idmap *idmap,
> + struct user_namespace *dest_userns,
> + const struct vfs_caps *vfs_caps,
> + void *data, int size)
> +{
> + struct vfs_ns_cap_data *ns_caps = data;
> + bool is_v3;
> + u32 magic;
> +
> + /* Preserve previous behavior of returning EINVAL for v1 caps */
> + if ((vfs_caps->magic_etc & VFS_CAP_REVISION_MASK) == VFS_CAP_REVISION_1)
> + return -EINVAL;
> +
> + size = __vfs_caps_to_xattr(idmap, dest_userns, vfs_caps, data, size);
> + if (size < 0)
> + return size;
> +
> + magic = vfs_caps->magic_etc &
> + (VFS_CAP_REVISION_MASK | VFS_CAP_FLAGS_EFFECTIVE);
> + ns_caps->magic_etc = cpu_to_le32(magic);
> +
> + /*
> + * If this is a v3 capability with a valid, non-zero rootid, return
> + * the v3 capability to userspace. A v3 capability with a rootid of
> + * 0 will be converted to a v2 capability below for compatibility
> + * with old userspace.
> + */
> + is_v3 = (vfs_caps->magic_etc & VFS_CAP_REVISION_MASK) == VFS_CAP_REVISION_3;
> + if (is_v3) {
> + uid_t rootid = le32_to_cpu(ns_caps->rootid);
> + if (rootid != (uid_t)-1 && rootid != (uid_t)0)
> + return size;
> + }
> +
> + if (!rootid_owns_currentns(vfs_caps->rootid))
> + return -EOVERFLOW;

For a v2 cap that we read vfs_caps->rootid will be vfsuid 0, right?
So that means we're guaranteed to resolve that in the initial user
namespace. IOW, rootid_owns_currentns() will indeed work with a pure v2
cap. Ok. Just making sure that I understand that this won't cause
EOVERFLOW for v2. But you would've likely seen that in tests right away.

> +
> + /* This comes from a parent namespace. Return as a v2 capability. */
> + if (is_v3) {
> + magic = VFS_CAP_REVISION_2 |
> + (vfs_caps->magic_etc & VFS_CAP_FLAGS_EFFECTIVE);
> + ns_caps->magic_etc = cpu_to_le32(magic);
> + ns_caps->rootid = cpu_to_le32(0);
> + size = XATTR_CAPS_SZ_2;
> + }
> +
> + return size;
> +}
> +
> /**
> * get_vfs_caps_from_disk - retrieve vfs caps from disk
> *
>
> --
> 2.43.0
>

2023-12-01 17:03:18

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 07/16] fs: add inode operations to get/set/remove fscaps

On Wed, Nov 29, 2023 at 03:50:25PM -0600, Seth Forshee (DigitalOcean) wrote:
> Add inode operations for getting, setting and removing filesystem
> capabilities rather than passing around raw xattr data. This provides
> better type safety for ids contained within xattrs.
>
> Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> ---
> include/linux/fs.h | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 98b7a7a8c42e..a0a77f67b999 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2002,6 +2002,11 @@ struct inode_operations {
> int);
> int (*set_acl)(struct mnt_idmap *, struct dentry *,
> struct posix_acl *, int);
> + int (*get_fscaps)(struct mnt_idmap *, struct dentry *,
> + struct vfs_caps *);
> + int (*set_fscaps)(struct mnt_idmap *, struct dentry *,
> + const struct vfs_caps *, int flags);

If it's really a flags argument, then unsigned int, please,
Reviewed-by: Christian Brauner <[email protected]>

> + int (*remove_fscaps)(struct mnt_idmap *, struct dentry *);
> int (*fileattr_set)(struct mnt_idmap *idmap,
> struct dentry *dentry, struct fileattr *fa);
> int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);

Ofc we managed to add get/set_foo() and bar_get/set().

2023-12-01 17:09:50

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 05/16] capability: provide helpers for converting between xattrs and vfs_caps

On Fri, Dec 01, 2023 at 05:41:19PM +0100, Christian Brauner wrote:
> > /**
> > - * get_vfs_caps_from_disk - retrieve vfs caps from disk
> > + * vfs_caps_from_xattr - convert raw caps xattr data to vfs_caps
> > *
> > - * @idmap: idmap of the mount the inode was found from
> > - * @dentry: dentry from which @inode is retrieved
> > - * @cpu_caps: vfs capabilities
> > + * @idmap: idmap of the mount the inode was found from
> > + * @src_userns: user namespace for ids in xattr data
> > + * @vfs_caps: destination buffer for vfs_caps data
> > + * @data: rax xattr caps data
> > + * @size: size of xattr data
> > *
> > - * Extract the on-exec-apply capability sets for an executable file.
> > + * Converts a raw security.capability xattr into the kernel-internal
> > + * capabilities format.
> > *
> > - * If the inode has been found through an idmapped mount the idmap of
> > - * the vfsmount must be passed through @idmap. This function will then
> > - * take care to map the inode according to @idmap before checking
> > - * permissions. On non-idmapped mounts or if permission checking is to be
> > - * performed on the raw inode simply pass @nop_mnt_idmap.
> > + * If the xattr is being read or written through an idmapped mount the
> > + * idmap of the vfsmount must be passed through @idmap. This function
> > + * will then take care to map the rootid according to @idmap.
> > + *
> > + * Return: On success, return 0; on error, return < 0.
> > */
> > -int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
> > - const struct dentry *dentry,
> > - struct vfs_caps *cpu_caps)
> > +int vfs_caps_from_xattr(struct mnt_idmap *idmap,
> > + struct user_namespace *src_userns,
> > + struct vfs_caps *vfs_caps,
> > + const void *data, int size)
> > {
> > - struct inode *inode = d_backing_inode(dentry);
> > __u32 magic_etc;
> > - int size;
> > - struct vfs_ns_cap_data data, *nscaps = &data;
> > - struct vfs_cap_data *caps = (struct vfs_cap_data *) &data;
> > + const struct vfs_ns_cap_data *ns_caps = data;
>
> The casting here is predicated on the compatibility of these two structs
> and I'm kinda surprised to see no static assertions about their layout.
> So I would recommend to add some to the header:
>
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index 5bb906098697..0fd75aab9754 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -16,6 +16,10 @@
>
> #include <linux/types.h>
>
> +#ifdef __KERNEL__
> +#include <linux/build_bug.h>
> +#endif
> +
> /* User-level do most of the mapping between kernel and user
> capabilities based on the version tag given by the kernel. The
> kernel might be somewhat backwards compatible, but don't bet on
> @@ -100,6 +104,15 @@ struct vfs_ns_cap_data {
> #define _LINUX_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_1
> #define _LINUX_CAPABILITY_U32S _LINUX_CAPABILITY_U32S_1
>
> +#else
> +
> +static_assert(offsetof(struct vfs_cap_data, magic_etc) ==
> + offsetof(struct vfs_ns_cap_data, magic_etc));
> +static_assert(offsetof(struct vfs_cap_data, data) ==
> + offsetof(struct vfs_ns_cap_data, data));
> +static_assert(sizeof(struct vfs_cap_data) ==
> + offsetof(struct vfs_ns_cap_data, rootid));
> +
> #endif

It's a little orthogonal to the changes, but I can certainly add a patch
for this.

> > +/**
> > + * vfs_caps_to_xattr - convert vfs_caps to raw caps xattr data
> > + *
> > + * @idmap: idmap of the mount the inode was found from
> > + * @dest_userns: user namespace for ids in xattr data
> > + * @vfs_caps: source vfs_caps data
> > + * @data: destination buffer for rax xattr caps data
> > + * @size: size of the @data buffer
> > + *
> > + * Converts a kernel-interrnal capability into the raw security.capability
>
> s/interrnal/internal/
>
> > + * xattr format.
> > + *
> > + * If the xattr is being read or written through an idmapped mount the
> > + * idmap of the vfsmount must be passed through @idmap. This function
> > + * will then take care to map the rootid according to @idmap.
> > + *
> > + * Return: On success, return 0; on error, return < 0.
> > + */
> > +int vfs_caps_to_xattr(struct mnt_idmap *idmap,
> > + struct user_namespace *dest_userns,
> > + const struct vfs_caps *vfs_caps,
> > + void *data, int size)
> > +{
> > + struct vfs_ns_cap_data *caps = data;
> > + int ret;
> > +
> > + ret = __vfs_caps_to_xattr(idmap, dest_userns, vfs_caps, data, size);
> > + if (ret > 0 &&
> > + (vfs_caps->magic_etc & VFS_CAP_REVISION_MASK) == VFS_CAP_REVISION_3 &&
> > + le32_to_cpu(caps->rootid) == (uid_t)-1)
> > + return -EOVERFLOW;
>
> I think the return value situation of these two helper is a bit
> confusing. So if they return the size or a negative error pointer the
> return value of both functions should likely be ssize_t.
>
> But unless you actually later use the size in the callers of these
> helpers it would be easier to just stick with 0 on success, negative
> errno on failure. Note that that's what vfs_caps_from_xattr() is doing.
>
> I'll see whether that becomes relevant later in the series though.

Size is relevant since the different versions have different xattr
sizes, and callers need to know how much data to write to disk or copy
to userspace. ssize_t probably is better though, so I'll change it.

2023-12-01 17:10:03

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 08/16] fs: add vfs_get_fscaps()

On Wed, Nov 29, 2023 at 03:50:26PM -0600, Seth Forshee (DigitalOcean) wrote:
> Provide a type-safe interface for retrieving filesystem capabilities and
> a generic implementation suitable for most filesystems. Also add an
> internal interface, __vfs_get_fscaps(), which skips security checks for
> later use from the capability code.
>
> Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> ---
> fs/xattr.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/fs.h | 4 ++++
> 2 files changed, 70 insertions(+)
>
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 09d927603433..3abaf9bef0a5 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -181,6 +181,72 @@ xattr_supports_user_prefix(struct inode *inode)
> }
> EXPORT_SYMBOL(xattr_supports_user_prefix);
>
> +static int generic_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> + struct vfs_caps *caps)
> +{
> + struct inode *inode = d_inode(dentry);
> + struct vfs_ns_cap_data *nscaps = NULL;
> + int ret;
> +
> + ret = (int)vfs_getxattr_alloc(idmap, dentry, XATTR_NAME_CAPS,

I don't think you need that case here.

> + (char **)&nscaps, 0, GFP_NOFS);
> +
> + if (ret >= 0)
> + ret = vfs_caps_from_xattr(idmap, i_user_ns(inode), caps, nscaps, ret);
> +
> + kfree(nscaps);
> + return ret;
> +}
> +
> +/**
> + * __vfs_get_fscaps - get filesystem capabilities without security checks
> + * @idmap: idmap of the mount the inode was found from
> + * @dentry: the dentry from which to get filesystem capabilities
> + * @caps: storage in which to return the filesystem capabilities
> + *
> + * This function gets the filesystem capabilities for the dentry and returns
> + * them in @caps. It does not perform security checks.
> + *
> + * Return: 0 on success, a negative errno on error.
> + */
> +int __vfs_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> + struct vfs_caps *caps)

I would rename that to vfs_get_fscaps_nosec(). We do that for
vfs_getxattr_nosec() as well. It's not pretty but it's better than just
slapping underscores onto it imo.

2023-12-01 17:26:47

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 06/16] capability: provide a helper for converting vfs_caps to xattr for userspace

On Fri, Dec 01, 2023 at 05:57:35PM +0100, Christian Brauner wrote:
> On Wed, Nov 29, 2023 at 03:50:24PM -0600, Seth Forshee (DigitalOcean) wrote:
> > cap_inode_getsecurity() implements a handful of policies for capability
> > xattrs read by userspace:
> >
> > - It returns EINVAL if the on-disk capability is in v1 format.
> >
> > - It masks off all bits in magic_etc except for the version and
> > VFS_CAP_FLAGS_EFFECTIVE.
> >
> > - v3 capabilities are converted to v2 format if the rootid returned to
> > userspace would be 0 or if the rootid corresponds to root in an
> > ancestor user namespace.
> >
> > - It returns EOVERFLOW for a v3 capability whose rootid does not map to
> > a valid id in current_user_ns() or to root in an ancestor namespace.
>
> Nice. Precise and clear, please just drop these bullet points into the
> kernel-doc of that function.

Will do.

> > +/**
> > + * vfs_caps_to_user_xattr - convert vfs_caps to caps xattr for userspace
> > + *
> > + * @idmap: idmap of the mount the inode was found from
> > + * @dest_userns: user namespace for ids in xattr data
> > + * @vfs_caps: source vfs_caps data
> > + * @data: destination buffer for rax xattr caps data
> > + * @size: size of the @data buffer
> > + *
> > + * Converts a kernel-interrnal capability into the raw security.capability
> > + * xattr format. Includes permission checking and v2->v3 conversion as
> > + * appropriate.
> > + *
> > + * If the xattr is being read or written through an idmapped mount the
> > + * idmap of the vfsmount must be passed through @idmap. This function
> > + * will then take care to map the rootid according to @idmap.
> > + *
> > + * Return: On success, return 0; on error, return < 0.
> > + */
> > +int vfs_caps_to_user_xattr(struct mnt_idmap *idmap,
> > + struct user_namespace *dest_userns,
> > + const struct vfs_caps *vfs_caps,
> > + void *data, int size)
> > +{
> > + struct vfs_ns_cap_data *ns_caps = data;
> > + bool is_v3;
> > + u32 magic;
> > +
> > + /* Preserve previous behavior of returning EINVAL for v1 caps */
> > + if ((vfs_caps->magic_etc & VFS_CAP_REVISION_MASK) == VFS_CAP_REVISION_1)
> > + return -EINVAL;
> > +
> > + size = __vfs_caps_to_xattr(idmap, dest_userns, vfs_caps, data, size);
> > + if (size < 0)
> > + return size;
> > +
> > + magic = vfs_caps->magic_etc &
> > + (VFS_CAP_REVISION_MASK | VFS_CAP_FLAGS_EFFECTIVE);
> > + ns_caps->magic_etc = cpu_to_le32(magic);
> > +
> > + /*
> > + * If this is a v3 capability with a valid, non-zero rootid, return
> > + * the v3 capability to userspace. A v3 capability with a rootid of
> > + * 0 will be converted to a v2 capability below for compatibility
> > + * with old userspace.
> > + */
> > + is_v3 = (vfs_caps->magic_etc & VFS_CAP_REVISION_MASK) == VFS_CAP_REVISION_3;
> > + if (is_v3) {
> > + uid_t rootid = le32_to_cpu(ns_caps->rootid);
> > + if (rootid != (uid_t)-1 && rootid != (uid_t)0)
> > + return size;
> > + }
> > +
> > + if (!rootid_owns_currentns(vfs_caps->rootid))
> > + return -EOVERFLOW;
>
> For a v2 cap that we read vfs_caps->rootid will be vfsuid 0, right?
> So that means we're guaranteed to resolve that in the initial user
> namespace. IOW, rootid_owns_currentns() will indeed work with a pure v2
> cap. Ok. Just making sure that I understand that this won't cause
> EOVERFLOW for v2. But you would've likely seen that in tests right away.

Yes, that's all correct.

2023-12-01 17:38:46

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 07/16] fs: add inode operations to get/set/remove fscaps

On Fri, Dec 01, 2023 at 06:02:55PM +0100, Christian Brauner wrote:
> On Wed, Nov 29, 2023 at 03:50:25PM -0600, Seth Forshee (DigitalOcean) wrote:
> > Add inode operations for getting, setting and removing filesystem
> > capabilities rather than passing around raw xattr data. This provides
> > better type safety for ids contained within xattrs.
> >
> > Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> > ---
> > include/linux/fs.h | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 98b7a7a8c42e..a0a77f67b999 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2002,6 +2002,11 @@ struct inode_operations {
> > int);
> > int (*set_acl)(struct mnt_idmap *, struct dentry *,
> > struct posix_acl *, int);
> > + int (*get_fscaps)(struct mnt_idmap *, struct dentry *,
> > + struct vfs_caps *);
> > + int (*set_fscaps)(struct mnt_idmap *, struct dentry *,
> > + const struct vfs_caps *, int flags);
>
> If it's really a flags argument, then unsigned int, please,

This is the flags for setxattr, which is an int everywhere. Or almost
everywhere; I just noticed that it is actually an unsigned int in struct
xattr_ctx. But for consistency I think it makes sense to have it be an
int here too. Though maybe naming it setxattr_flags would be helpful for
clarity.

2023-12-01 17:39:32

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 09/16] fs: add vfs_set_fscaps()

On Wed, Nov 29, 2023 at 03:50:27PM -0600, Seth Forshee (DigitalOcean) wrote:
> Provide a type-safe interface for setting filesystem capabilities and a
> generic implementation suitable for most filesystems.
>
> Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> ---
> fs/xattr.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/fs.h | 2 ++
> 2 files changed, 89 insertions(+)
>
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 3abaf9bef0a5..03cc824e4f87 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -247,6 +247,93 @@ int vfs_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> }
> EXPORT_SYMBOL(vfs_get_fscaps);
>
> +static int generic_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> + const struct vfs_caps *caps, int flags)
> +{
> + struct inode *inode = d_inode(dentry);
> + struct vfs_ns_cap_data nscaps;
> + int size;
> +
> + size = vfs_caps_to_xattr(idmap, i_user_ns(inode), caps,
> + &nscaps, sizeof(nscaps));
> + if (size < 0)
> + return size;

I see, here the size is relevant. Ok, but then please make the
lower-level helper use ssize_t.

> +
> + return __vfs_setxattr_noperm(idmap, dentry, XATTR_NAME_CAPS,
> + &nscaps, size, flags);
> +}
> +
> +/**
> + * vfs_set_fscaps - set filesystem capabilities
> + * @idmap: idmap of the mount the inode was found from
> + * @dentry: the dentry on which to set filesystem capabilities
> + * @caps: the filesystem capabilities to be written
> + * @flags: setxattr flags to use when writing the capabilities xattr
> + *
> + * This function writes the supplied filesystem capabilities to the dentry.
> + *
> + * Return: 0 on success, a negative errno on error.
> + */
> +int vfs_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> + const struct vfs_caps *caps, int flags)
> +{
> + struct inode *inode = d_inode(dentry);
> + struct inode *delegated_inode = NULL;
> + struct vfs_ns_cap_data nscaps;
> + int size, error;
> +
> + /*
> + * Unfortunately EVM wants to have the raw xattr value to compare to
> + * the on-disk version, so we need to pass the raw xattr to the
> + * security hooks. But we also want to do security checks before
> + * breaking leases, so that means a conversion to the raw xattr here
> + * which will usually be reduntant with the conversion we do for
> + * writing the xattr to disk.
> + */
> + size = vfs_caps_to_xattr(idmap, i_user_ns(inode), caps, &nscaps,
> + sizeof(nscaps));
> + if (size < 0)
> + return size;

Oh right, I remember that. Slight eyeroll. See below though...

> +
> +retry_deleg:
> + inode_lock(inode);
> +
> + error = xattr_permission(idmap, inode, XATTR_NAME_CAPS, MAY_WRITE);
> + if (error)
> + goto out_inode_unlock;
> + error = security_inode_setxattr(idmap, dentry, XATTR_NAME_CAPS, &nscaps,
> + size, flags);
> + if (error)
> + goto out_inode_unlock;

For posix acls I added dedicated security hooks that take the struct
posix_acl stuff and then plumb that down into the security modules. You
could do the same thing here and then just force EVM and others to do
their own conversion from in-kernel to xattr format, instead of forcing
the VFS to do this.

Because right now we make everyone pay the price all the time when
really EVM should pay that price and this whole unpleasantness.

> +
> + error = try_break_deleg(inode, &delegated_inode);
> + if (error)
> + goto out_inode_unlock;
> +
> + if (inode->i_opflags & IOP_XATTR) {

So I'm trying to remember the details how I did this for POSIX ACLs in
commit e499214ce3ef ("acl: don't depend on IOP_XATTR"). I think what you
did here is correct because you need to have an xattr handler for
fscaps currently. IOW, it isn't purely based on inode operations.

And here starts the hate mail in so far as you'll hate me for asking
this:

I think I asked this before when we talked about this but how feasible
would it be to move fscaps completely off of xattr handlers and purely
on inode operations for all filesystems?

Yes, that's a fairly large patchset but it would also be a pretty good
win because we avoid munging this from inode operations through xattr
handlers again which seems a bit ugly and what we really wanted to
avoid desperately with POSIX ACLs.

If this is feasible and you'd be up for it I wouldn't even mind doing
that in two steps. IOW, merge something like this first and them move
everyone off of their individual xattr handlers.

Could you quickly remind me whether there would be any issues with this?

2023-12-01 17:41:47

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 08/16] fs: add vfs_get_fscaps()

On Fri, Dec 01, 2023 at 06:09:36PM +0100, Christian Brauner wrote:
> On Wed, Nov 29, 2023 at 03:50:26PM -0600, Seth Forshee (DigitalOcean) wrote:
> > Provide a type-safe interface for retrieving filesystem capabilities and
> > a generic implementation suitable for most filesystems. Also add an
> > internal interface, __vfs_get_fscaps(), which skips security checks for
> > later use from the capability code.
> >
> > Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> > ---
> > fs/xattr.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/fs.h | 4 ++++
> > 2 files changed, 70 insertions(+)
> >
> > diff --git a/fs/xattr.c b/fs/xattr.c
> > index 09d927603433..3abaf9bef0a5 100644
> > --- a/fs/xattr.c
> > +++ b/fs/xattr.c
> > @@ -181,6 +181,72 @@ xattr_supports_user_prefix(struct inode *inode)
> > }
> > EXPORT_SYMBOL(xattr_supports_user_prefix);
> >
> > +static int generic_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> > + struct vfs_caps *caps)
> > +{
> > + struct inode *inode = d_inode(dentry);
> > + struct vfs_ns_cap_data *nscaps = NULL;
> > + int ret;
> > +
> > + ret = (int)vfs_getxattr_alloc(idmap, dentry, XATTR_NAME_CAPS,
>
> I don't think you need that case here.

Yep. I played with a few different implementations of this function, so
I'm guessing I did need it at one point and failed to notice when it was
no longer needed.

>
> > + (char **)&nscaps, 0, GFP_NOFS);
> > +
> > + if (ret >= 0)
> > + ret = vfs_caps_from_xattr(idmap, i_user_ns(inode), caps, nscaps, ret);
> > +
> > + kfree(nscaps);
> > + return ret;
> > +}
> > +
> > +/**
> > + * __vfs_get_fscaps - get filesystem capabilities without security checks
> > + * @idmap: idmap of the mount the inode was found from
> > + * @dentry: the dentry from which to get filesystem capabilities
> > + * @caps: storage in which to return the filesystem capabilities
> > + *
> > + * This function gets the filesystem capabilities for the dentry and returns
> > + * them in @caps. It does not perform security checks.
> > + *
> > + * Return: 0 on success, a negative errno on error.
> > + */
> > +int __vfs_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> > + struct vfs_caps *caps)
>
> I would rename that to vfs_get_fscaps_nosec(). We do that for
> vfs_getxattr_nosec() as well. It's not pretty but it's better than just
> slapping underscores onto it imo.

Will do.

2023-12-01 18:18:16

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 09/16] fs: add vfs_set_fscaps()

On Fri, Dec 01, 2023 at 06:39:18PM +0100, Christian Brauner wrote:
> > +/**
> > + * vfs_set_fscaps - set filesystem capabilities
> > + * @idmap: idmap of the mount the inode was found from
> > + * @dentry: the dentry on which to set filesystem capabilities
> > + * @caps: the filesystem capabilities to be written
> > + * @flags: setxattr flags to use when writing the capabilities xattr
> > + *
> > + * This function writes the supplied filesystem capabilities to the dentry.
> > + *
> > + * Return: 0 on success, a negative errno on error.
> > + */
> > +int vfs_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> > + const struct vfs_caps *caps, int flags)
> > +{
> > + struct inode *inode = d_inode(dentry);
> > + struct inode *delegated_inode = NULL;
> > + struct vfs_ns_cap_data nscaps;
> > + int size, error;
> > +
> > + /*
> > + * Unfortunately EVM wants to have the raw xattr value to compare to
> > + * the on-disk version, so we need to pass the raw xattr to the
> > + * security hooks. But we also want to do security checks before
> > + * breaking leases, so that means a conversion to the raw xattr here
> > + * which will usually be reduntant with the conversion we do for
> > + * writing the xattr to disk.
> > + */
> > + size = vfs_caps_to_xattr(idmap, i_user_ns(inode), caps, &nscaps,
> > + sizeof(nscaps));
> > + if (size < 0)
> > + return size;
>
> Oh right, I remember that. Slight eyeroll. See below though...
>
> > +
> > +retry_deleg:
> > + inode_lock(inode);
> > +
> > + error = xattr_permission(idmap, inode, XATTR_NAME_CAPS, MAY_WRITE);
> > + if (error)
> > + goto out_inode_unlock;
> > + error = security_inode_setxattr(idmap, dentry, XATTR_NAME_CAPS, &nscaps,
> > + size, flags);
> > + if (error)
> > + goto out_inode_unlock;
>
> For posix acls I added dedicated security hooks that take the struct
> posix_acl stuff and then plumb that down into the security modules. You
> could do the same thing here and then just force EVM and others to do
> their own conversion from in-kernel to xattr format, instead of forcing
> the VFS to do this.
>
> Because right now we make everyone pay the price all the time when
> really EVM should pay that price and this whole unpleasantness.

Good point, I'll do that.

>
> > +
> > + error = try_break_deleg(inode, &delegated_inode);
> > + if (error)
> > + goto out_inode_unlock;
> > +
> > + if (inode->i_opflags & IOP_XATTR) {
>
> So I'm trying to remember the details how I did this for POSIX ACLs in
> commit e499214ce3ef ("acl: don't depend on IOP_XATTR"). I think what you
> did here is correct because you need to have an xattr handler for
> fscaps currently. IOW, it isn't purely based on inode operations.
>
> And here starts the hate mail in so far as you'll hate me for asking
> this:
>
> I think I asked this before when we talked about this but how feasible
> would it be to move fscaps completely off of xattr handlers and purely
> on inode operations for all filesystems?
>
> Yes, that's a fairly large patchset but it would also be a pretty good
> win because we avoid munging this from inode operations through xattr
> handlers again which seems a bit ugly and what we really wanted to
> avoid desperately with POSIX ACLs.
>
> If this is feasible and you'd be up for it I wouldn't even mind doing
> that in two steps. IOW, merge something like this first and them move
> everyone off of their individual xattr handlers.
>
> Could you quickly remind me whether there would be any issues with this?

It's certainly possible to do this. There wouldn't be any issues per se,
but there are some tradoffs to consider.

First, it's really only overlayfs that needs special handling. It seems
pretty unfortunate to make every filesystem provide its own
implementations which are virtually identical, which is what we'd need
to do if we want to completely avoid the xattr handlers. But we could
still provide a generic implementation that uses only
__vfs_{get,set}xattr(), and most filesystems could use those in their
inode ops. How does that sound?

The other drawback I see is needing to duplicate logic from the
{get,set}xattr codepaths into the fscaps codepaths and maintain them in
parallel. I was trying to avoid that as much as possible, but in the end
I had to duplicate some of the logic anyway. And as Amir pointed out I
did miss some things I needed to duplicate from the setxattr logic, so I
already need to revisit that code and probably pull in more of the
setxattr logic, so there may not be as much benefit here as I'd
originally hoped.

2023-12-05 11:51:39

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 07/16] fs: add inode operations to get/set/remove fscaps

On Fri, Dec 01, 2023 at 11:38:33AM -0600, Seth Forshee (DigitalOcean) wrote:
> On Fri, Dec 01, 2023 at 06:02:55PM +0100, Christian Brauner wrote:
> > On Wed, Nov 29, 2023 at 03:50:25PM -0600, Seth Forshee (DigitalOcean) wrote:
> > > Add inode operations for getting, setting and removing filesystem
> > > capabilities rather than passing around raw xattr data. This provides
> > > better type safety for ids contained within xattrs.
> > >
> > > Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> > > ---
> > > include/linux/fs.h | 5 +++++
> > > 1 file changed, 5 insertions(+)
> > >
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 98b7a7a8c42e..a0a77f67b999 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -2002,6 +2002,11 @@ struct inode_operations {
> > > int);
> > > int (*set_acl)(struct mnt_idmap *, struct dentry *,
> > > struct posix_acl *, int);
> > > + int (*get_fscaps)(struct mnt_idmap *, struct dentry *,
> > > + struct vfs_caps *);
> > > + int (*set_fscaps)(struct mnt_idmap *, struct dentry *,
> > > + const struct vfs_caps *, int flags);
> >
> > If it's really a flags argument, then unsigned int, please,
>
> This is the flags for setxattr, which is an int everywhere. Or almost

Ah right. Ugh, we should clean that up but not necessarily in this
series.

2023-12-05 21:25:49

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH 3/16] capability: rename cpu_vfs_cap_data to vfs_caps

On Nov 29, 2023 "Seth Forshee (DigitalOcean)" <[email protected]> wrote:
>
> vfs_caps is a more generic name which is better suited to the broader
> use this struct will see in subsequent commits.
>
> Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> Reviewed-by: Christian Brauner <[email protected]>
> ---
> include/linux/capability.h | 4 ++--
> kernel/auditsc.c | 4 ++--
> security/commoncap.c | 8 ++++----
> 3 files changed, 8 insertions(+), 8 deletions(-)

Bonus points in that the proposed name is shorter too :)

Technically you'll want to get Serge's ACK as he's the capabilities
maintainer, but with my LSM hat on this looks okay, and is pretty
trivial anyway.

Acked-by: Paul Moore <[email protected]> (Audit,LSM)

--
paul-moore.com

2023-12-05 21:26:24

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH 14/16] commoncap: remove cap_inode_getsecurity()

On Nov 29, 2023 "Seth Forshee (DigitalOcean)" <[email protected]> wrote:
>
> Reading of fscaps xattrs is now done via vfs_get_fscaps(), so there is
> no longer any need to do it from security_inode_getsecurity(). Remove
> cap_inode_getsecurity() and its associated helpers which are now unused.
>
> We don't allow reading capabilities xattrs this way anyomre, so remove
> the handler and associated helpers.
>
> Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> ---
> include/linux/security.h | 5 +-
> security/commoncap.c | 132 -----------------------------------------------
> 2 files changed, 1 insertion(+), 136 deletions(-)

Once again, you should get Serge's ACK on the commoncap.c stuff, but
no objections from a LSM perspective.

Acked-by: Paul Moore <[email protected]> (LSM)

--
paul-moore.com

2023-12-05 21:26:27

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH 4/16] capability: use vfsuid_t for vfs_caps rootids

On Nov 29, 2023 "Seth Forshee (DigitalOcean)" <[email protected]> wrote:
>
> The rootid is a kuid_t, but it contains an id which maped into a mount
> idmapping, so it is really a vfsuid. This is confusing and creates
> potential for misuse of the value, so change it to vfsuid_t.
>
> Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]>
> ---
> include/linux/capability.h | 3 ++-
> kernel/auditsc.c | 5 +++--
> security/commoncap.c | 2 +-
> 3 files changed, 6 insertions(+), 4 deletions(-)

It might be nice if AS_KUIDT() and friends were named in such a way
as to indicate that they require a vfsuid_t parameter. At least the
call to __vfsuid_val() should flag a type mismatch if some other type
is used. Regardless, that is more of a general VFS issue and not a
problem specific to this patchset.

With the same understanding about the capabilities code and Serge ...

Acked-by: Paul Moore <[email protected]> (Audit,LSM)

> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index c24477e660fc..eb46d346bbbc 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -16,6 +16,7 @@
> #include <uapi/linux/capability.h>
> #include <linux/uidgid.h>
> #include <linux/bits.h>
> +#include <linux/vfsid.h>
>
> #define _KERNEL_CAPABILITY_VERSION _LINUX_CAPABILITY_VERSION_3
>
> @@ -26,7 +27,7 @@ typedef struct { u64 val; } kernel_cap_t;
> /* same as vfs_ns_cap_data but in cpu endian and always filled completely */
> struct vfs_caps {
> __u32 magic_etc;
> - kuid_t rootid;
> + vfsuid_t rootid;
> kernel_cap_t permitted;
> kernel_cap_t inheritable;
> };
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index 783d0bf69ca5..65691450b080 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -65,6 +65,7 @@
> #include <uapi/linux/netfilter/nf_tables.h>
> #include <uapi/linux/openat2.h> // struct open_how
> #include <uapi/linux/fanotify.h>
> +#include <linux/mnt_idmapping.h>
>
> #include "audit.h"
>
> @@ -2260,7 +2261,7 @@ static inline int audit_copy_fcaps(struct audit_names *name,
> name->fcap.permitted = caps.permitted;
> name->fcap.inheritable = caps.inheritable;
> name->fcap.fE = !!(caps.magic_etc & VFS_CAP_FLAGS_EFFECTIVE);
> - name->fcap.rootid = caps.rootid;
> + name->fcap.rootid = AS_KUIDT(caps.rootid);
> name->fcap_ver = (caps.magic_etc & VFS_CAP_REVISION_MASK) >>
> VFS_CAP_REVISION_SHIFT;
>
> @@ -2816,7 +2817,7 @@ int __audit_log_bprm_fcaps(struct linux_binprm *bprm,
> ax->fcap.permitted = vcaps.permitted;
> ax->fcap.inheritable = vcaps.inheritable;
> ax->fcap.fE = !!(vcaps.magic_etc & VFS_CAP_FLAGS_EFFECTIVE);
> - ax->fcap.rootid = vcaps.rootid;
> + ax->fcap.rootid = AS_KUIDT(vcaps.rootid);
> ax->fcap_ver = (vcaps.magic_etc & VFS_CAP_REVISION_MASK) >> VFS_CAP_REVISION_SHIFT;
>
> ax->old_pcap.permitted = old->cap_permitted;
> diff --git a/security/commoncap.c b/security/commoncap.c
> index cf130d81b8b4..3d045d377e5e 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -710,7 +710,7 @@ int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
> cpu_caps->permitted.val &= CAP_VALID_MASK;
> cpu_caps->inheritable.val &= CAP_VALID_MASK;
>
> - cpu_caps->rootid = vfsuid_into_kuid(rootvfsuid);
> + cpu_caps->rootid = rootvfsuid;
>
> return 0;
> }
> --
> 2.43.0

--
paul-moore.com

2023-12-07 14:43:15

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 09/16] fs: add vfs_set_fscaps()

[Adding Mimi for insights on EVM questions]

On Fri, Dec 01, 2023 at 12:18:00PM -0600, Seth Forshee (DigitalOcean) wrote:
> On Fri, Dec 01, 2023 at 06:39:18PM +0100, Christian Brauner wrote:
> > > +/**
> > > + * vfs_set_fscaps - set filesystem capabilities
> > > + * @idmap: idmap of the mount the inode was found from
> > > + * @dentry: the dentry on which to set filesystem capabilities
> > > + * @caps: the filesystem capabilities to be written
> > > + * @flags: setxattr flags to use when writing the capabilities xattr
> > > + *
> > > + * This function writes the supplied filesystem capabilities to the dentry.
> > > + *
> > > + * Return: 0 on success, a negative errno on error.
> > > + */
> > > +int vfs_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> > > + const struct vfs_caps *caps, int flags)
> > > +{
> > > + struct inode *inode = d_inode(dentry);
> > > + struct inode *delegated_inode = NULL;
> > > + struct vfs_ns_cap_data nscaps;
> > > + int size, error;
> > > +
> > > + /*
> > > + * Unfortunately EVM wants to have the raw xattr value to compare to
> > > + * the on-disk version, so we need to pass the raw xattr to the
> > > + * security hooks. But we also want to do security checks before
> > > + * breaking leases, so that means a conversion to the raw xattr here
> > > + * which will usually be reduntant with the conversion we do for
> > > + * writing the xattr to disk.
> > > + */
> > > + size = vfs_caps_to_xattr(idmap, i_user_ns(inode), caps, &nscaps,
> > > + sizeof(nscaps));
> > > + if (size < 0)
> > > + return size;
> >
> > Oh right, I remember that. Slight eyeroll. See below though...
> >
> > > +
> > > +retry_deleg:
> > > + inode_lock(inode);
> > > +
> > > + error = xattr_permission(idmap, inode, XATTR_NAME_CAPS, MAY_WRITE);
> > > + if (error)
> > > + goto out_inode_unlock;
> > > + error = security_inode_setxattr(idmap, dentry, XATTR_NAME_CAPS, &nscaps,
> > > + size, flags);
> > > + if (error)
> > > + goto out_inode_unlock;
> >
> > For posix acls I added dedicated security hooks that take the struct
> > posix_acl stuff and then plumb that down into the security modules. You
> > could do the same thing here and then just force EVM and others to do
> > their own conversion from in-kernel to xattr format, instead of forcing
> > the VFS to do this.
> >
> > Because right now we make everyone pay the price all the time when
> > really EVM should pay that price and this whole unpleasantness.
>
> Good point, I'll do that.

I've been reconsidering various approaches here. One thing I noticed is
that for the non-generic case (iow overlayfs) I missed calling
security_inode_post_setxattr(), where EVM also wants the raw xattr, so
that would require another conversion. That got me wondering whether the
setxattr security hooks really matter when writing fscaps to overlayfs.
And it seems like they might not: the LSMs only look for their own
xattrs, and IMA doesn't do anything with fscaps xattrs. EVM does, but
what it does for a xattr write to an overlayfs indoe seems at least
partially if not completely redundant with what it will do when the
xattr is written to the upper filesystem.

So could we push these security calls down to the generic fscaps
implementations just before/after writing the raw xattr data and just
skip them for overlayfs? If so we can get away with doing the vfs_caps
to xattr conversion only once.

The trade offs are that filesystems which implement fscaps inode
operations become responsible for calling the security hooks if needed,
and if something changes such that we need to call those security hooks
for fscaps on overlayfs this solution would no longer work.

2023-12-10 16:42:24

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 09/16] fs: add vfs_set_fscaps()

On Thu, Dec 7, 2023 at 4:43 PM Seth Forshee (DigitalOcean)
<[email protected]> wrote:
>
> [Adding Mimi for insights on EVM questions]
>
> On Fri, Dec 01, 2023 at 12:18:00PM -0600, Seth Forshee (DigitalOcean) wrote:
> > On Fri, Dec 01, 2023 at 06:39:18PM +0100, Christian Brauner wrote:
> > > > +/**
> > > > + * vfs_set_fscaps - set filesystem capabilities
> > > > + * @idmap: idmap of the mount the inode was found from
> > > > + * @dentry: the dentry on which to set filesystem capabilities
> > > > + * @caps: the filesystem capabilities to be written
> > > > + * @flags: setxattr flags to use when writing the capabilities xattr
> > > > + *
> > > > + * This function writes the supplied filesystem capabilities to the dentry.
> > > > + *
> > > > + * Return: 0 on success, a negative errno on error.
> > > > + */
> > > > +int vfs_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> > > > + const struct vfs_caps *caps, int flags)
> > > > +{
> > > > + struct inode *inode = d_inode(dentry);
> > > > + struct inode *delegated_inode = NULL;
> > > > + struct vfs_ns_cap_data nscaps;
> > > > + int size, error;
> > > > +
> > > > + /*
> > > > + * Unfortunately EVM wants to have the raw xattr value to compare to
> > > > + * the on-disk version, so we need to pass the raw xattr to the
> > > > + * security hooks. But we also want to do security checks before
> > > > + * breaking leases, so that means a conversion to the raw xattr here
> > > > + * which will usually be reduntant with the conversion we do for
> > > > + * writing the xattr to disk.
> > > > + */
> > > > + size = vfs_caps_to_xattr(idmap, i_user_ns(inode), caps, &nscaps,
> > > > + sizeof(nscaps));
> > > > + if (size < 0)
> > > > + return size;
> > >
> > > Oh right, I remember that. Slight eyeroll. See below though...
> > >
> > > > +
> > > > +retry_deleg:
> > > > + inode_lock(inode);
> > > > +
> > > > + error = xattr_permission(idmap, inode, XATTR_NAME_CAPS, MAY_WRITE);
> > > > + if (error)
> > > > + goto out_inode_unlock;
> > > > + error = security_inode_setxattr(idmap, dentry, XATTR_NAME_CAPS, &nscaps,
> > > > + size, flags);
> > > > + if (error)
> > > > + goto out_inode_unlock;
> > >
> > > For posix acls I added dedicated security hooks that take the struct
> > > posix_acl stuff and then plumb that down into the security modules. You
> > > could do the same thing here and then just force EVM and others to do
> > > their own conversion from in-kernel to xattr format, instead of forcing
> > > the VFS to do this.
> > >
> > > Because right now we make everyone pay the price all the time when
> > > really EVM should pay that price and this whole unpleasantness.
> >
> > Good point, I'll do that.
>
> I've been reconsidering various approaches here. One thing I noticed is
> that for the non-generic case (iow overlayfs) I missed calling
> security_inode_post_setxattr(), where EVM also wants the raw xattr, so
> that would require another conversion. That got me wondering whether the
> setxattr security hooks really matter when writing fscaps to overlayfs.
> And it seems like they might not: the LSMs only look for their own
> xattrs, and IMA doesn't do anything with fscaps xattrs. EVM does, but
> what it does for a xattr write to an overlayfs indoe seems at least
> partially if not completely redundant with what it will do when the
> xattr is written to the upper filesystem.
>
> So could we push these security calls down to the generic fscaps
> implementations just before/after writing the raw xattr data and just
> skip them for overlayfs? If so we can get away with doing the vfs_caps
> to xattr conversion only once.
>
> The trade offs are that filesystems which implement fscaps inode
> operations become responsible for calling the security hooks if needed,
> and if something changes such that we need to call those security hooks
> for fscaps on overlayfs this solution would no longer work.

Hi Seth,

I was trying to understand the alternative proposals, but TBH,
I cannot wrap my head about overlayfs+IMA/EVM and I do not
fully understand the use case.

Specifically, I do not understand why the IMA/EVM attestation on
the upper and lower fs isn't enough to make overlayfs tamper proof.
I never got an explanation of the threat model for overlayfs+IMA/EVM.

I know that for SELinux and overlayfs a lot of work was done by Vivek.
I was not involved in this work, but AKAIF, it did not involve any conversion
of selinux xattrs.

Thanks,
Amir.

2023-12-11 07:58:39

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 15/16] commoncap: use vfs fscaps interfaces for killpriv checks



Hello,

kernel test robot noticed a -3.4% regression of unixbench.score on:


commit: 4d9674015c6c6b0d3dd2013f7fbff6a8648e59dd ("[PATCH 15/16] commoncap: use vfs fscaps interfaces for killpriv checks")
url: https://github.com/intel-lab-lkp/linux/commits/Seth-Forshee-DigitalOcean/mnt_idmapping-split-out-core-vfs-ug-id_t-definitions-into-vfsid-h/20231130-055846
patch link: https://lore.kernel.org/all/[email protected]/
patch subject: [PATCH 15/16] commoncap: use vfs fscaps interfaces for killpriv checks

testcase: unixbench
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
parameters:

runtime: 300s
nr_task: 100%
test: fsbuffer
cpufreq_governor: performance




If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231211/[email protected]

=========================================================================================
compiler/cpufreq_governor/kconfig/nr_task/rootfs/runtime/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/100%/debian-11.1-x86_64-20220510.cgz/300s/lkp-icl-2sp9/fsbuffer/unixbench

commit:
28b9eedcb5 ("commoncap: remove cap_inode_getsecurity()")
4d9674015c ("commoncap: use vfs fscaps interfaces for killpriv checks")

28b9eedcb59f6969 4d9674015c6c6b0d3dd2013f7fb
---------------- ---------------------------
%stddev %change %stddev
\ | \
442.83 ? 8% +13.6% 503.17 ? 5% perf-c2c.DRAM.local
106496 -3.4% 102870 unixbench.score
8.46e+09 -3.3% 8.181e+09 unixbench.workload
22955 ? 9% -15.1% 19480 ? 8% sched_debug.cfs_rq:/.load.avg
52666 ? 8% -20.2% 42050 ? 6% sched_debug.cfs_rq:/.load.stddev
696.27 ? 10% -22.2% 541.75 ? 9% sched_debug.cpu.curr->pid.stddev
0.15 ? 13% -18.8% 0.12 ? 10% perf-sched.wait_and_delay.avg.ms.__cond_resched.__filemap_get_folio.simple_write_begin.generic_perform_write.generic_file_write_iter
0.23 ? 12% -25.8% 0.17 ? 14% perf-sched.wait_and_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
413.00 ? 4% +17.7% 486.00 ? 7% perf-sched.wait_and_delay.count.__cond_resched.__filemap_get_folio.simple_write_begin.generic_perform_write.generic_file_write_iter
0.15 ? 13% -18.8% 0.12 ? 10% perf-sched.wait_time.avg.ms.__cond_resched.__filemap_get_folio.simple_write_begin.generic_perform_write.generic_file_write_iter
0.22 ? 14% -28.2% 0.16 ? 13% perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
3.277e+10 +1.2% 3.316e+10 perf-stat.i.branch-instructions
4.128e+08 -3.2% 3.997e+08 perf-stat.i.cache-references
282058 -4.5% 269378 perf-stat.i.dTLB-load-misses
7803 ? 44% +24.8% 9740 perf-stat.overall.path-length
2.728e+10 ? 44% +21.4% 3.313e+10 perf-stat.ps.branch-instructions
17.86 -1.0 16.83 perf-profile.calltrace.cycles-pp.generic_perform_write.generic_file_write_iter.vfs_write.ksys_write.do_syscall_64
6.26 -0.3 5.93 perf-profile.calltrace.cycles-pp.simple_write_begin.generic_perform_write.generic_file_write_iter.vfs_write.ksys_write
5.75 -0.3 5.43 perf-profile.calltrace.cycles-pp.__filemap_get_folio.simple_write_begin.generic_perform_write.generic_file_write_iter.vfs_write
3.71 -0.2 3.50 perf-profile.calltrace.cycles-pp.copy_page_from_iter_atomic.generic_perform_write.generic_file_write_iter.vfs_write.ksys_write
3.79 -0.2 3.58 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.write
2.55 -0.2 2.34 perf-profile.calltrace.cycles-pp.simple_write_end.generic_perform_write.generic_file_write_iter.vfs_write.ksys_write
3.11 ? 2% -0.2 2.93 perf-profile.calltrace.cycles-pp.filemap_get_entry.__filemap_get_folio.simple_write_begin.generic_perform_write.generic_file_write_iter
2.34 -0.1 2.20 perf-profile.calltrace.cycles-pp.__fsnotify_parent.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.08 -0.1 1.94 perf-profile.calltrace.cycles-pp.file_update_time.__generic_file_write_iter.generic_file_write_iter.vfs_write.ksys_write
2.74 -0.1 2.62 perf-profile.calltrace.cycles-pp.fault_in_iov_iter_readable.generic_perform_write.generic_file_write_iter.vfs_write.ksys_write
1.70 -0.1 1.59 perf-profile.calltrace.cycles-pp.inode_needs_update_time.file_update_time.__generic_file_write_iter.generic_file_write_iter.vfs_write
1.32 ? 2% -0.1 1.23 perf-profile.calltrace.cycles-pp.xas_load.filemap_get_entry.__filemap_get_folio.simple_write_begin.generic_perform_write
1.20 -0.1 1.11 ? 2% perf-profile.calltrace.cycles-pp.down_write.generic_file_write_iter.vfs_write.ksys_write.do_syscall_64
0.57 ? 3% -0.1 0.52 ? 2% perf-profile.calltrace.cycles-pp.xas_descend.xas_load.filemap_get_entry.__filemap_get_folio.simple_write_begin
0.78 ? 2% -0.0 0.74 perf-profile.calltrace.cycles-pp.up_write.generic_file_write_iter.vfs_write.ksys_write.do_syscall_64
1.01 -0.0 0.96 perf-profile.calltrace.cycles-pp.generic_write_checks.generic_file_write_iter.vfs_write.ksys_write.do_syscall_64
0.75 -0.0 0.71 ? 2% perf-profile.calltrace.cycles-pp.folio_unlock.simple_write_end.generic_perform_write.generic_file_write_iter.vfs_write
0.68 ? 5% +0.1 0.77 perf-profile.calltrace.cycles-pp.xas_descend.xas_load.filemap_get_read_batch.filemap_get_pages.filemap_read
1.72 ? 3% +0.1 1.84 perf-profile.calltrace.cycles-pp.xas_load.filemap_get_read_batch.filemap_get_pages.filemap_read.vfs_read
43.59 +0.3 43.87 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.write
42.78 +0.3 43.10 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
40.44 +0.4 40.84 perf-profile.calltrace.cycles-pp.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
38.06 +0.5 38.58 perf-profile.calltrace.cycles-pp.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
0.00 +0.6 0.55 ? 4% perf-profile.calltrace.cycles-pp.xattr_resolve_name.vfs_getxattr_alloc.__vfs_get_fscaps.cap_inode_need_killpriv.security_inode_need_killpriv
28.82 +0.9 29.74 perf-profile.calltrace.cycles-pp.generic_file_write_iter.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +1.5 1.47 perf-profile.calltrace.cycles-pp.strncmp.xattr_permission.vfs_getxattr_alloc.__vfs_get_fscaps.cap_inode_need_killpriv
0.00 +1.8 1.84 perf-profile.calltrace.cycles-pp.xattr_permission.vfs_getxattr_alloc.__vfs_get_fscaps.cap_inode_need_killpriv.security_inode_need_killpriv
6.70 +2.2 8.92 perf-profile.calltrace.cycles-pp.__generic_file_write_iter.generic_file_write_iter.vfs_write.ksys_write.do_syscall_64
3.86 +2.4 6.26 perf-profile.calltrace.cycles-pp.__file_remove_privs.__generic_file_write_iter.generic_file_write_iter.vfs_write.ksys_write
2.37 +2.5 4.90 perf-profile.calltrace.cycles-pp.security_inode_need_killpriv.__file_remove_privs.__generic_file_write_iter.generic_file_write_iter.vfs_write
1.94 +2.6 4.52 perf-profile.calltrace.cycles-pp.cap_inode_need_killpriv.security_inode_need_killpriv.__file_remove_privs.__generic_file_write_iter.generic_file_write_iter
0.00 +3.0 2.98 perf-profile.calltrace.cycles-pp.vfs_getxattr_alloc.__vfs_get_fscaps.cap_inode_need_killpriv.security_inode_need_killpriv.__file_remove_privs
0.00 +4.1 4.08 perf-profile.calltrace.cycles-pp.__vfs_get_fscaps.cap_inode_need_killpriv.security_inode_need_killpriv.__file_remove_privs.__generic_file_write_iter
18.27 -1.0 17.22 perf-profile.children.cycles-pp.generic_perform_write
6.36 -0.3 6.03 perf-profile.children.cycles-pp.simple_write_begin
5.96 -0.3 5.64 perf-profile.children.cycles-pp.__filemap_get_folio
2.71 -0.2 2.49 perf-profile.children.cycles-pp.simple_write_end
3.76 -0.2 3.55 perf-profile.children.cycles-pp.copy_page_from_iter_atomic
3.21 ? 2% -0.2 3.02 perf-profile.children.cycles-pp.filemap_get_entry
2.23 -0.1 2.08 perf-profile.children.cycles-pp.file_update_time
1.90 -0.1 1.77 perf-profile.children.cycles-pp.inode_needs_update_time
2.86 -0.1 2.74 perf-profile.children.cycles-pp.fault_in_iov_iter_readable
4.92 -0.1 4.81 perf-profile.children.cycles-pp.entry_SYSCALL_64
2.44 -0.1 2.33 perf-profile.children.cycles-pp.fault_in_readable
1.31 -0.1 1.21 ? 2% perf-profile.children.cycles-pp.down_write
3.93 -0.1 3.86 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
0.55 ? 3% -0.1 0.49 ? 2% perf-profile.children.cycles-pp.folio_mark_dirty
0.85 ? 3% -0.1 0.79 ? 2% perf-profile.children.cycles-pp.up_write
0.56 -0.1 0.50 perf-profile.children.cycles-pp.balance_dirty_pages_ratelimited_flags
0.80 -0.1 0.75 ? 2% perf-profile.children.cycles-pp.folio_unlock
0.58 ? 2% -0.0 0.53 ? 2% perf-profile.children.cycles-pp.w_test
1.16 -0.0 1.11 perf-profile.children.cycles-pp.generic_write_checks
1.12 -0.0 1.07 perf-profile.children.cycles-pp.syscall_enter_from_user_mode
0.54 -0.0 0.50 ? 2% perf-profile.children.cycles-pp.folio_mapping
0.62 ? 2% -0.0 0.59 perf-profile.children.cycles-pp.timestamp_truncate
0.52 -0.0 0.49 ? 2% perf-profile.children.cycles-pp.generic_write_check_limits
0.42 -0.0 0.40 ? 2% perf-profile.children.cycles-pp.folio_wait_stable
0.37 -0.0 0.35 perf-profile.children.cycles-pp.setattr_should_drop_suidgid
0.22 ? 2% -0.0 0.21 ? 3% perf-profile.children.cycles-pp.inode_to_bdi
0.17 ? 2% -0.0 0.15 ? 3% perf-profile.children.cycles-pp.is_bad_inode
0.58 ? 2% +0.1 0.66 ? 4% perf-profile.children.cycles-pp.xattr_resolve_name
0.00 +0.3 0.27 ? 2% perf-profile.children.cycles-pp.kfree
86.62 +0.3 86.93 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
85.38 +0.3 85.71 perf-profile.children.cycles-pp.do_syscall_64
40.67 +0.4 41.06 perf-profile.children.cycles-pp.ksys_write
38.45 +0.5 38.96 perf-profile.children.cycles-pp.vfs_write
29.06 +0.9 29.98 perf-profile.children.cycles-pp.generic_file_write_iter
0.00 +1.5 1.54 perf-profile.children.cycles-pp.strncmp
0.00 +2.0 2.00 perf-profile.children.cycles-pp.xattr_permission
6.86 +2.2 9.07 perf-profile.children.cycles-pp.__generic_file_write_iter
4.03 +2.4 6.42 perf-profile.children.cycles-pp.__file_remove_privs
2.49 +2.5 5.00 perf-profile.children.cycles-pp.security_inode_need_killpriv
2.07 +2.6 4.63 perf-profile.children.cycles-pp.cap_inode_need_killpriv
0.00 +3.3 3.26 perf-profile.children.cycles-pp.vfs_getxattr_alloc
0.00 +4.3 4.27 perf-profile.children.cycles-pp.__vfs_get_fscaps
3.70 -0.2 3.49 perf-profile.self.cycles-pp.copy_page_from_iter_atomic
3.91 -0.2 3.72 perf-profile.self.cycles-pp.vfs_write
1.30 -0.1 1.18 perf-profile.self.cycles-pp.simple_write_end
1.23 -0.1 1.12 ? 2% perf-profile.self.cycles-pp.__file_remove_privs
1.86 -0.1 1.75 perf-profile.self.cycles-pp.generic_perform_write
2.37 -0.1 2.26 perf-profile.self.cycles-pp.fault_in_readable
1.83 -0.1 1.73 perf-profile.self.cycles-pp.write
0.90 -0.1 0.81 ? 3% perf-profile.self.cycles-pp.down_write
1.98 -0.1 1.89 perf-profile.self.cycles-pp.__filemap_get_folio
7.88 -0.1 7.80 perf-profile.self.cycles-pp.__fsnotify_parent
3.80 -0.1 3.73 perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
0.97 -0.1 0.90 ? 2% perf-profile.self.cycles-pp.inode_needs_update_time
0.80 ? 3% -0.1 0.74 ? 2% perf-profile.self.cycles-pp.up_write
0.42 -0.0 0.36 ? 3% perf-profile.self.cycles-pp.security_inode_need_killpriv
0.52 ? 2% -0.0 0.48 ? 2% perf-profile.self.cycles-pp.w_test
0.74 -0.0 0.70 ? 2% perf-profile.self.cycles-pp.folio_unlock
0.96 -0.0 0.91 perf-profile.self.cycles-pp.syscall_enter_from_user_mode
0.71 -0.0 0.67 perf-profile.self.cycles-pp.ksys_write
0.83 -0.0 0.79 perf-profile.self.cycles-pp.generic_file_write_iter
0.38 -0.0 0.35 perf-profile.self.cycles-pp.balance_dirty_pages_ratelimited_flags
0.42 ? 2% -0.0 0.39 ? 3% perf-profile.self.cycles-pp.generic_write_check_limits
0.43 ? 2% -0.0 0.40 ? 2% perf-profile.self.cycles-pp.folio_mapping
0.28 ? 5% -0.0 0.25 ? 2% perf-profile.self.cycles-pp.folio_mark_dirty
1.00 -0.0 0.97 perf-profile.self.cycles-pp.__get_task_ioprio
1.22 -0.0 1.20 perf-profile.self.cycles-pp.entry_SYSCALL_64
0.41 -0.0 0.39 perf-profile.self.cycles-pp.simple_write_begin
0.39 -0.0 0.37 ? 2% perf-profile.self.cycles-pp.fault_in_iov_iter_readable
0.36 ? 5% +0.1 0.42 ? 6% perf-profile.self.cycles-pp.xattr_resolve_name
0.18 ? 3% +0.2 0.36 perf-profile.self.cycles-pp.cap_inode_need_killpriv
0.00 +0.2 0.18 ? 2% perf-profile.self.cycles-pp.kfree
0.00 +0.5 0.48 perf-profile.self.cycles-pp.xattr_permission
0.00 +0.8 0.76 perf-profile.self.cycles-pp.vfs_getxattr_alloc
0.00 +0.8 0.83 perf-profile.self.cycles-pp.__vfs_get_fscaps
0.00 +1.4 1.42 perf-profile.self.cycles-pp.strncmp




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki