From: Andi Kleen <[email protected]>
Some recent benchmarking on btrfs showed that a major scaling bottleneck
on large systems on btrfs is currently the xattr lookup on every write.
Why xattr lookup on every write I hear you ask?
write wants to drop suid and security related xattrs that could set o
capabilities for executables. To do that it currently looks up
security.capability on EVERY write (even for non executables) to decide
whether to drop it or not.
In btrfs this causes an additional tree walk, hitting some per file system
locks and quite bad scalability. In a simple read workload on a 8S
system I saw over 90% CPU time in spinlocks related to that.
Chris Mason tells me this is also a problem in ext4, where it hits
the global mbcache lock.
This patch adds a simple per inode to avoid this problem. We only
do the lookup once per file and then if there is no xattr cache
the decision. All xattr changes clear the flag.
I also used the same flag to avoid the suid check, although
that one is pretty cheap.
A file system can also set this flag when it creates the inode,
if it has a cheap way to do so. This is done for some common file systems
in followon patches.
With this patch a major part of the lock contention disappears
for btrfs. Some testing on smaller systems didn't show significant
performance changes, but at least it helps the larger systems
and is generally more efficient.
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Serge E. Hallyn <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
---
fs/attr.c | 7 +++++++
fs/xattr.c | 7 +++++--
include/linux/fs.h | 7 +++++++
mm/filemap.c | 14 ++++++++++++--
4 files changed, 31 insertions(+), 4 deletions(-)
diff --git a/fs/attr.c b/fs/attr.c
index 91dbe2a..384229c 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -175,6 +175,13 @@ int notify_change(struct dentry * dentry, struct iattr * attr)
return -EPERM;
}
+ if ((ia_valid & ATTR_MODE)) {
+ mode_t amode = attr->ia_mode;
+ /* Flag setting protected by i_mutex */
+ if (is_sgid(amode))
+ inode->i_flags &= ~S_NOSEC;
+ }
+
now = current_fs_time(inode->i_sb);
attr->ia_ctime = now;
diff --git a/fs/xattr.c b/fs/xattr.c
index f1ef949..1f0cb40 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -87,7 +87,11 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
{
struct inode *inode = dentry->d_inode;
int error = -EOPNOTSUPP;
+ int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
+ XATTR_SECURITY_PREFIX_LEN);
+ if (issec)
+ inode->i_flags &= ~S_NOSEC;
if (inode->i_op->setxattr) {
error = inode->i_op->setxattr(dentry, name, value, size, flags);
if (!error) {
@@ -95,8 +99,7 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
security_inode_post_setxattr(dentry, name, value,
size, flags);
}
- } else if (!strncmp(name, XATTR_SECURITY_PREFIX,
- XATTR_SECURITY_PREFIX_LEN)) {
+ } else if (issec) {
const char *suffix = name + XATTR_SECURITY_PREFIX_LEN;
error = security_inode_setsecurity(inode, suffix, value,
size, flags);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index cdf9495..576f07c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -236,6 +236,7 @@ struct inodes_stat_t {
#define S_PRIVATE 512 /* Inode is fs-internal */
#define S_IMA 1024 /* Inode has an associated IMA struct */
#define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */
+#define S_NOSEC 4096 /* no suid or xattr security attributes */
/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -272,6 +273,7 @@ struct inodes_stat_t {
#define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)
#define IS_IMA(inode) ((inode)->i_flags & S_IMA)
#define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT)
+#define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC)
/* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */
@@ -2578,5 +2580,10 @@ int __init get_filesystem_list(char *buf);
#define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE) | \
(flag & __FMODE_NONOTIFY)))
+static inline int is_sgid(mode_t mode)
+{
+ return (mode & S_ISUID) || ((mode & S_ISGID) && (mode & S_IXGRP));
+}
+
#endif /* __KERNEL__ */
#endif /* _LINUX_FS_H */
diff --git a/mm/filemap.c b/mm/filemap.c
index c641edf..5893a88 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1943,16 +1943,26 @@ static int __remove_suid(struct dentry *dentry, int kill)
int file_remove_suid(struct file *file)
{
struct dentry *dentry = file->f_path.dentry;
- int killsuid = should_remove_suid(dentry);
- int killpriv = security_inode_need_killpriv(dentry);
+ struct inode *inode = dentry->d_inode;
+ int killsuid;
+ int killpriv;
int error = 0;
+ /* Fast path for nothing security related */
+ if (IS_NOSEC(inode))
+ return 0;
+
+ killsuid = should_remove_suid(dentry);
+ killpriv = security_inode_need_killpriv(dentry);
+
if (killpriv < 0)
return killpriv;
if (killpriv)
error = security_inode_killpriv(dentry);
if (!error && killsuid)
error = __remove_suid(dentry, killsuid);
+ if (!error)
+ inode->i_flags |= S_NOSEC;
return error;
}
--
1.7.4.4
From: Andi Kleen <[email protected]>
This avoids a xattr lookup on every write.
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
---
fs/ext4/ialloc.c | 5 +++++
fs/ext4/inode.c | 7 +++++++
2 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 21bb2f6..cc7878d 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1012,6 +1012,11 @@ got:
*/
ei->i_flags =
ext4_mask_flags(mode, EXT4_I(dir)->i_flags & EXT4_FL_INHERITED);
+ /*
+ * New inode doesn't have security xattrs.
+ */
+ if (!is_sgid(inode->i_mode))
+ inode->i_flags |= S_NOSEC;
ei->i_file_acl = 0;
ei->i_dtime = 0;
ei->i_block_group = group;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f2fa5e8..d03b2b0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4758,6 +4758,13 @@ void ext4_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT4_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+ /*
+ * Don't know yet if an xattr is really security related, but the first
+ * write will find out.
+ */
+ if (!is_sgid(inode->i_mode) &&
+ !ext4_test_inode_state(inode, EXT4_STATE_XATTR))
+ inode->i_flags |= S_NOSEC;
}
/* Propagate flags from i_flags to EXT4_I(inode)->i_flags */
--
1.7.4.4
From: Andi Kleen <[email protected]>
Based on a suggestion from Josef Bacik.
The btrfs inode lookup already does a quick check for xattrs.
If it doesn't find any and the file is not sgid set the NOSEC flag.
This avoids a xattr lookup on the first write.
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
---
fs/btrfs/inode.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7cd8ab0..b309c30 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2544,8 +2544,11 @@ static void btrfs_read_locked_inode(struct inode *inode)
* any xattrs or acls
*/
maybe_acls = acls_after_inode_item(leaf, path->slots[0], inode->i_ino);
- if (!maybe_acls)
+ if (!maybe_acls) {
cache_no_acl(inode);
+ if (!is_sgid(inode->i_mode))
+ inode->i_flags |= S_NOSEC;
+ }
BTRFS_I(inode)->block_group = btrfs_find_block_group(root, 0,
alloc_group_block, 0);
@@ -4616,6 +4619,8 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
(BTRFS_I(dir)->flags & BTRFS_INODE_NODATACOW))
BTRFS_I(inode)->flags |= BTRFS_INODE_NODATACOW;
}
+ if (!is_sgid(mode))
+ inode->i_flags |= S_NOSEC;
insert_inode_hash(inode);
inode_tree_add(inode);
--
1.7.4.4
From: Andi Kleen <[email protected]>
Set the S_NOSEC flag early if the inode has no xattrs for XFS.
This avoids a single xattr lookup for the first write on every
file, if the file has no xattrs.
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
---
fs/xfs/linux-2.6/xfs_iops.c | 3 +++
fs/xfs/xfs_attr.c | 2 +-
fs/xfs/xfs_attr.h | 1 +
3 files changed, 5 insertions(+), 1 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
index dd21784..4ed1e4c 100644
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -763,6 +763,9 @@ xfs_setup_inode(
break;
}
+ if (!is_sgid(inode->i_mode) && !xfs_inode_hasattr(ip))
+ inode->i_flags |= S_NOSEC;
+
xfs_iflags_clear(ip, XFS_INEW);
barrier();
diff --git a/fs/xfs/xfs_attr.c b/fs/xfs/xfs_attr.c
index c863753..43140db 100644
--- a/fs/xfs/xfs_attr.c
+++ b/fs/xfs/xfs_attr.c
@@ -99,7 +99,7 @@ xfs_attr_name_to_xname(
return 0;
}
-STATIC int
+int
xfs_inode_hasattr(
struct xfs_inode *ip)
{
diff --git a/fs/xfs/xfs_attr.h b/fs/xfs/xfs_attr.h
index e920d68..20f4038 100644
--- a/fs/xfs/xfs_attr.h
+++ b/fs/xfs/xfs_attr.h
@@ -142,5 +142,6 @@ typedef struct xfs_attr_list_context {
int xfs_attr_inactive(struct xfs_inode *dp);
int xfs_attr_rmtval_get(struct xfs_da_args *args);
int xfs_attr_list_int(struct xfs_attr_list_context *);
+int xfs_inode_hasattr(struct xfs_inode *ip);
#endif /* __XFS_ATTR_H__ */
--
1.7.4.4
Il 28/05/2011 00:54, Andi Kleen ha scritto:
> From: Andi Kleen<[email protected]>
>
> This avoids a xattr lookup on every write.
>
> Cc: [email protected]
> Signed-off-by: Andi Kleen<[email protected]>
Are ext2/3 affected too? It seems to me that more or less the xattr
management for the extN series is the same, isn't it?
Marco
Excerpts from Marco Stornelli's message of 2011-05-28 05:08:01 -0400:
> Il 28/05/2011 00:54, Andi Kleen ha scritto:
> > From: Andi Kleen<[email protected]>
> >
> > This avoids a xattr lookup on every write.
> >
> > Cc: [email protected]
> > Signed-off-by: Andi Kleen<[email protected]>
>
> Are ext2/3 affected too? It seems to me that more or less the xattr
> management for the extN series is the same, isn't it?
Yes, everyone who uses mbcache.
-chris
On Fri, May 27, 2011 at 03:54:02PM -0700, Andi Kleen wrote:
> From: Andi Kleen <[email protected]>
>
> Some recent benchmarking on btrfs showed that a major scaling bottleneck
> on large systems on btrfs is currently the xattr lookup on every write.
>
> Why xattr lookup on every write I hear you ask?
>
> write wants to drop suid and security related xattrs that could set o
> capabilities for executables. To do that it currently looks up
> security.capability on EVERY write (even for non executables) to decide
> whether to drop it or not.
Hm...
a) is_sgid() is a bad name for that - at the very least s/g/x/, since
anybody would read your variant as "check if it's set-group-id".
b) I'd add a helper for filesystems to use, rather than messing
with the flags directly.
On Sat, May 28, 2011 at 11:08:01AM +0200, Marco Stornelli wrote:
> Il 28/05/2011 00:54, Andi Kleen ha scritto:
> >From: Andi Kleen<[email protected]>
> >
> >This avoids a xattr lookup on every write.
> >
> >Cc: [email protected]
> >Signed-off-by: Andi Kleen<[email protected]>
>
> Are ext2/3 affected too? It seems to me that more or less the xattr
> management for the extN series is the same, isn't it?
Yes, they could be changed the same way.
Note however that the file system specific change is only a relatively
minor optimization (avoids the first lookup). The important one is
the cache which every file system get's automatically without changes.
-Andi
On Fri, 27 May 2011 15:54:02 PDT, Andi Kleen said:
> @@ -2578,5 +2580,10 @@ int __init get_filesystem_list(char *buf);
> #define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE) | \
> (flag & __FMODE_NONOTIFY)))
>
> +static inline int is_sgid(mode_t mode)
> +{
> + return (mode & S_ISUID) || ((mode & S_ISGID) && (mode & S_IXGRP));
> +}
> +
There has *got* to be a better name for this function.
And having said that, I'm not convinced it's the *right* check - on an SELinux
system, pretty much *all* the files have a security xattr attached to them, and
very few are actually setuid/setgid. So 98% of the time, or more, this will DTWT.
> And having said that, I'm not convinced it's the *right* check - on an SELinux
> system, pretty much *all* the files have a security xattr attached to them, and
> very few are actually setuid/setgid. So 98% of the time, or more, this will DTWT.
These are not for selinux xattrs, but capability xattrs.
And I think you misunderstand the semantics of the flag.
The flag just signifies the inode has neither suid nor capabilities.
-Andi
On Sun, 29 May 2011 07:24:32 PDT, Andi Kleen said:
> These are not for selinux xattrs, but capability xattrs.
+#define S_NOSEC 4096 /* no suid or xattr security attributes */
Sorry for reading that wrong, since selinux stores stuff under security.* xattr
as well.
+ int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
+ XATTR_SECURITY_PREFIX_LEN);
is going to match *any* security.* attribute, including SELinux ones stored under
security.selinux. If you wanted to be capability-specific, maybe youw anted these
two:
#define XATTR_CAPS_SUFFIX "capability"
#define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX