From: Andi Kleen <[email protected]>
Some recent benchmarking on btrfs showed that a major scaling bottleneck
on large systems on btrfs is currently the xattr lookup on every write.
Why xattr lookup on every write I hear you ask?
write wants to drop suid and security related xattrs that could set o
capabilities for executables. To do that it currently looks up
security.capability on EVERY write (even for non executables) to decide
whether to drop it or not.
In btrfs this causes an additional tree walk, hitting some per file system
locks and quite bad scalability. In a simple read workload on a 8S
system I saw over 90% CPU time in spinlocks related to that.
Chris Mason tells me this is also a problem in ext4, where it hits
the global mbcache lock.
This patch adds a simple per inode to avoid this problem. We only
do the lookup once per file and then if there is no xattr cache
the decision. All xattr changes clear the flag.
I also used the same flag to avoid the suid check, although
that one is pretty cheap.
A file system can also set this flag when it creates the inode,
if it has a cheap way to do so. This is done for some common file systems
in followon patches.
With this patch a major part of the lock contention disappears
for btrfs. Some testing on smaller systems didn't show significant
performance changes, but at least it helps the larger systems
and is generally more efficient.
v2: Rename is_sgid. add file system helper.
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Serge E. Hallyn <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
---
fs/attr.c | 7 +++++++
fs/xattr.c | 7 +++++--
include/linux/fs.h | 13 +++++++++++++
mm/filemap.c | 14 ++++++++++++--
4 files changed, 37 insertions(+), 4 deletions(-)
diff --git a/fs/attr.c b/fs/attr.c
index 91dbe2a..caf2aa5 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -175,6 +175,13 @@ int notify_change(struct dentry * dentry, struct iattr * attr)
return -EPERM;
}
+ if ((ia_valid & ATTR_MODE)) {
+ mode_t amode = attr->ia_mode;
+ /* Flag setting protected by i_mutex */
+ if (is_sxid(amode))
+ inode->i_flags &= ~S_NOSEC;
+ }
+
now = current_fs_time(inode->i_sb);
attr->ia_ctime = now;
diff --git a/fs/xattr.c b/fs/xattr.c
index f1ef949..1f0cb40 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -87,7 +87,11 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
{
struct inode *inode = dentry->d_inode;
int error = -EOPNOTSUPP;
+ int issec = !strncmp(name, XATTR_SECURITY_PREFIX,
+ XATTR_SECURITY_PREFIX_LEN);
+ if (issec)
+ inode->i_flags &= ~S_NOSEC;
if (inode->i_op->setxattr) {
error = inode->i_op->setxattr(dentry, name, value, size, flags);
if (!error) {
@@ -95,8 +99,7 @@ int __vfs_setxattr_noperm(struct dentry *dentry, const char *name,
security_inode_post_setxattr(dentry, name, value,
size, flags);
}
- } else if (!strncmp(name, XATTR_SECURITY_PREFIX,
- XATTR_SECURITY_PREFIX_LEN)) {
+ } else if (issec) {
const char *suffix = name + XATTR_SECURITY_PREFIX_LEN;
error = security_inode_setsecurity(inode, suffix, value,
size, flags);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index cdf9495..1bca00b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -236,6 +236,7 @@ struct inodes_stat_t {
#define S_PRIVATE 512 /* Inode is fs-internal */
#define S_IMA 1024 /* Inode has an associated IMA struct */
#define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */
+#define S_NOSEC 4096 /* no suid or xattr security attributes */
/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -272,6 +273,7 @@ struct inodes_stat_t {
#define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)
#define IS_IMA(inode) ((inode)->i_flags & S_IMA)
#define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT)
+#define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC)
/* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */
@@ -2578,5 +2580,16 @@ int __init get_filesystem_list(char *buf);
#define OPEN_FMODE(flag) ((__force fmode_t)(((flag + 1) & O_ACCMODE) | \
(flag & __FMODE_NONOTIFY)))
+static inline int is_sxid(mode_t mode)
+{
+ return (mode & S_ISUID) || ((mode & S_ISGID) && (mode & S_IXGRP));
+}
+
+static inline void inode_has_no_xattr(struct inode *inode)
+{
+ if (!is_sxid(inode->i_mode))
+ inode->i_flags |= S_NOSEC;
+}
+
#endif /* __KERNEL__ */
#endif /* _LINUX_FS_H */
diff --git a/mm/filemap.c b/mm/filemap.c
index c641edf..5893a88 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1943,16 +1943,26 @@ static int __remove_suid(struct dentry *dentry, int kill)
int file_remove_suid(struct file *file)
{
struct dentry *dentry = file->f_path.dentry;
- int killsuid = should_remove_suid(dentry);
- int killpriv = security_inode_need_killpriv(dentry);
+ struct inode *inode = dentry->d_inode;
+ int killsuid;
+ int killpriv;
int error = 0;
+ /* Fast path for nothing security related */
+ if (IS_NOSEC(inode))
+ return 0;
+
+ killsuid = should_remove_suid(dentry);
+ killpriv = security_inode_need_killpriv(dentry);
+
if (killpriv < 0)
return killpriv;
if (killpriv)
error = security_inode_killpriv(dentry);
if (!error && killsuid)
error = __remove_suid(dentry, killsuid);
+ if (!error)
+ inode->i_flags |= S_NOSEC;
return error;
}
--
1.7.4.4
From: Andi Kleen <[email protected]>
This avoids a xattr lookup on every write.
v2: Use helper
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
---
fs/ext4/ialloc.c | 4 ++++
fs/ext4/inode.c | 6 ++++++
2 files changed, 10 insertions(+), 0 deletions(-)
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 21bb2f6..e0aabdb 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1012,6 +1012,10 @@ got:
*/
ei->i_flags =
ext4_mask_flags(mode, EXT4_I(dir)->i_flags & EXT4_FL_INHERITED);
+ /*
+ * New inode doesn't have security xattrs.
+ */
+ inode_has_no_xattr(inode);
ei->i_file_acl = 0;
ei->i_dtime = 0;
ei->i_block_group = group;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f2fa5e8..1079d02 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4758,6 +4758,12 @@ void ext4_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT4_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+ /*
+ * Don't know yet if an xattr is really security related, but the first
+ * write will find out.
+ */
+ if (!ext4_test_inode_state(inode, EXT4_STATE_XATTR))
+ inode_has_no_xattr(inode);
}
/* Propagate flags from i_flags to EXT4_I(inode)->i_flags */
--
1.7.4.4
From: Andi Kleen <[email protected]>
Based on a suggestion from Josef Bacik.
The btrfs inode lookup already does a quick check for xattrs.
If it doesn't find any and the file is not sgid set the NOSEC flag.
This avoids a xattr lookup on the first write.
v2: Use helper
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
---
fs/btrfs/inode.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7cd8ab0..20cecb8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2544,8 +2544,10 @@ static void btrfs_read_locked_inode(struct inode *inode)
* any xattrs or acls
*/
maybe_acls = acls_after_inode_item(leaf, path->slots[0], inode->i_ino);
- if (!maybe_acls)
+ if (!maybe_acls) {
cache_no_acl(inode);
+ inode_has_no_xattr(inode);
+ }
BTRFS_I(inode)->block_group = btrfs_find_block_group(root, 0,
alloc_group_block, 0);
@@ -4616,6 +4618,10 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
(BTRFS_I(dir)->flags & BTRFS_INODE_NODATACOW))
BTRFS_I(inode)->flags |= BTRFS_INODE_NODATACOW;
}
+ /*
+ * A new inode has no xattrs yet.
+ */
+ inode_has_no_xattr(inode);
insert_inode_hash(inode);
inode_tree_add(inode);
--
1.7.4.4
From: Andi Kleen <[email protected]>
Set the S_NOSEC flag early if the inode has no xattrs for XFS.
This avoids a single xattr lookup for the first write on every
file, if the file has no xattrs.
v2: Use helper
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
---
fs/xfs/linux-2.6/xfs_iops.c | 3 +++
fs/xfs/xfs_attr.c | 2 +-
fs/xfs/xfs_attr.h | 1 +
3 files changed, 5 insertions(+), 1 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
index dd21784..507505b 100644
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -763,6 +763,9 @@ xfs_setup_inode(
break;
}
+ if (!xfs_inode_hasattr(ip))
+ inode_has_no_xattr(inode);
+
xfs_iflags_clear(ip, XFS_INEW);
barrier();
diff --git a/fs/xfs/xfs_attr.c b/fs/xfs/xfs_attr.c
index c863753..43140db 100644
--- a/fs/xfs/xfs_attr.c
+++ b/fs/xfs/xfs_attr.c
@@ -99,7 +99,7 @@ xfs_attr_name_to_xname(
return 0;
}
-STATIC int
+int
xfs_inode_hasattr(
struct xfs_inode *ip)
{
diff --git a/fs/xfs/xfs_attr.h b/fs/xfs/xfs_attr.h
index e920d68..20f4038 100644
--- a/fs/xfs/xfs_attr.h
+++ b/fs/xfs/xfs_attr.h
@@ -142,5 +142,6 @@ typedef struct xfs_attr_list_context {
int xfs_attr_inactive(struct xfs_inode *dp);
int xfs_attr_rmtval_get(struct xfs_da_args *args);
int xfs_attr_list_int(struct xfs_attr_list_context *);
+int xfs_inode_hasattr(struct xfs_inode *ip);
#endif /* __XFS_ATTR_H__ */
--
1.7.4.4
Hi,
On Sat, 2011-05-28 at 08:25 -0700, Andi Kleen wrote:
> From: Andi Kleen <[email protected]>
>
> Some recent benchmarking on btrfs showed that a major scaling bottleneck
> on large systems on btrfs is currently the xattr lookup on every write.
>
> Why xattr lookup on every write I hear you ask?
>
> write wants to drop suid and security related xattrs that could set o
> capabilities for executables. To do that it currently looks up
> security.capability on EVERY write (even for non executables) to decide
> whether to drop it or not.
>
It sounds like a good idea, but cluster filesystems will need to clear
the flag when they update their in-core inodes. Without that we could
have:
Node A looks up inode and sets S_NOSEC since its not suid
Node B does chmod +s on the inode
Node A now has S_NOSEC set, but inode is suid, so writes don't clear
suid
For GFS2 it should simply be a case of adjusting gfs2_set_inode_flags()
to update S_NOSEC appropriately, something like this (untested):
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index a9f5cbe..3d856e4 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -174,7 +174,9 @@ void gfs2_set_inode_flags(struct inode *inode)
struct gfs2_inode *ip = GFS2_I(inode);
unsigned int flags = inode->i_flags;
- flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+ flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_NOSEC);
+ if (!is_sxid(inode->i_mode))
+ flags |= S_NOSEC;
if (ip->i_diskflags & GFS2_DIF_IMMUTABLE)
flags |= S_IMMUTABLE;
if (ip->i_diskflags & GFS2_DIF_APPENDONLY)
Note that this also serves the dual purpose of setting the flag for
newly created inodes as well, as per the patches for the other
filesystems,
Steve.
> It sounds like a good idea, but cluster filesystems will need to clear
> the flag when they update their in-core inodes. Without that we could
> have:
>
> Node A looks up inode and sets S_NOSEC since its not suid
> Node B does chmod +s on the inode
> Node A now has S_NOSEC set, but inode is suid, so writes don't clear
> suid
Good point. I assume that's also true for network file systems.
This would essentially argue that for those putting the helper
into the inode read paths is not optional. I'll look into this
later.
> - flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
> + flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_NOSEC);
> + if (!is_sxid(inode->i_mode))
> + flags |= S_NOSEC;
Doesn't that need a check for no xattr too? or do you not support
those currently?
Note I added a helper for this in the latest version:
inode_has_no_xattr()
-Andi
Hi,
On Tue, 2011-05-31 at 11:06 -0700, Andi Kleen wrote:
> > It sounds like a good idea, but cluster filesystems will need to clear
> > the flag when they update their in-core inodes. Without that we could
> > have:
> >
> > Node A looks up inode and sets S_NOSEC since its not suid
> > Node B does chmod +s on the inode
> > Node A now has S_NOSEC set, but inode is suid, so writes don't clear
> > suid
>
> Good point. I assume that's also true for network file systems.
>
> This would essentially argue that for those putting the helper
> into the inode read paths is not optional. I'll look into this
> later.
>
> > - flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
> > + flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_NOSEC);
> > + if (!is_sxid(inode->i_mode))
> > + flags |= S_NOSEC;
>
> Doesn't that need a check for no xattr too? or do you not support
> those currently?
>
> Note I added a helper for this in the latest version:
> inode_has_no_xattr()
>
> -Andi
Yes, it should test for xattr too,
Steve.
On Tue, May 31, 2011 at 07:42:26PM +0100, Steven Whitehouse wrote:
> Yes, it should test for xattr too,
Frankly, I suspect that the sanest way to handle that is this:
* new superblock flag - MS_NOSEC
* S_NOSEC is never set unless we have MS_NOSEC
* mount_bdev() sets it before calling fill_super callback.
* ocfs2 and fuse *clear* it in their fill_super
* btrfs manually sets it in its ->mount()
... and if gfs2 or any other non-trivial fs wants to use that, it'll need
to set MS_NOSEC in its ->mount() and take care of clearing S_NOSEC whenever
we decide it might've gone stale (a-la your patch).
That way we do not have a hole on network filesystems, get rid of duplicate
xattr / SxID checks on local block ones (which is the majority of filesystems
that might care about all that crap) and keep the patch size on saner side...
Something like this:
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 9b2e7e5..d158b67 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -819,7 +819,7 @@ static struct dentry *btrfs_mount(struct file_system_type *fs_type, int flags,
} else {
char b[BDEVNAME_SIZE];
- s->s_flags = flags;
+ s->s_flags = flags | MS_NOSEC;
strlcpy(s->s_id, bdevname(bdev, b), sizeof(s->s_id));
error = btrfs_fill_super(s, fs_devices, data,
flags & MS_SILENT ? 1 : 0);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index cc6ec4b..38f84cd 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -921,6 +921,8 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
if (sb->s_flags & MS_MANDLOCK)
goto err;
+ sb->s_flags &= ~MS_NOSEC;
+
if (!parse_fuse_opt((char *) data, &d, is_bdev))
goto err;
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index cdbaf5e..56f6102 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -1072,7 +1072,7 @@ static int ocfs2_fill_super(struct super_block *sb, void *data, int silent)
sb->s_magic = OCFS2_SUPER_MAGIC;
- sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
+ sb->s_flags = (sb->s_flags & ~(MS_POSIXACL | MS_NOSEC)) |
((osb->s_mount_opt & OCFS2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);
/* Hard readonly mode only if: bdev_read_only, MS_RDONLY,
diff --git a/fs/super.c b/fs/super.c
index c755939..ab3d672 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -822,7 +822,7 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
} else {
char b[BDEVNAME_SIZE];
- s->s_flags = flags;
+ s->s_flags = flags | MS_NOSEC;
s->s_mode = mode;
strlcpy(s->s_id, bdevname(bdev, b), sizeof(s->s_id));
sb_set_blocksize(s, block_size(bdev));
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c55d6b7..646a183 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -208,6 +208,7 @@ struct inodes_stat_t {
#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
#define MS_I_VERSION (1<<23) /* Update inode I_version field */
#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
+#define MS_NOSEC (1<<28)
#define MS_BORN (1<<29)
#define MS_ACTIVE (1<<30)
#define MS_NOUSER (1<<31)
@@ -2591,7 +2592,7 @@ static inline int is_sxid(mode_t mode)
static inline void inode_has_no_xattr(struct inode *inode)
{
- if (!is_sxid(inode->i_mode))
+ if (!is_sxid(inode->i_mode) && (inode->i_sb->s_flags & MS_NOSEC))
inode->i_flags |= S_NOSEC;
}
diff --git a/mm/filemap.c b/mm/filemap.c
index d7b1057..a8251a8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2000,7 +2000,7 @@ int file_remove_suid(struct file *file)
error = security_inode_killpriv(dentry);
if (!error && killsuid)
error = __remove_suid(dentry, killsuid);
- if (!error)
+ if (!error && (inode->i_sb->s_flags & MS_NOSEC))
inode->i_flags |= S_NOSEC;
return error;
On Tue, May 31, 2011 at 09:07:50PM +0100, Al Viro wrote:
> On Tue, May 31, 2011 at 07:42:26PM +0100, Steven Whitehouse wrote:
>
> > Yes, it should test for xattr too,
>
> Frankly, I suspect that the sanest way to handle that is this:
> * new superblock flag - MS_NOSEC
> * S_NOSEC is never set unless we have MS_NOSEC
> * mount_bdev() sets it before calling fill_super callback.
> * ocfs2 and fuse *clear* it in their fill_super
gfs2 needs to clear it too (unless Steven's updated patch goes in too)
Steven?
> * btrfs manually sets it in its ->mount()
Looks good. I haven't test it so far though.
Acked-by: Andi Kleen <[email protected]>
-Andi
On Tue, May 31, 2011 at 10:18:34PM +0200, Andi Kleen wrote:
> On Tue, May 31, 2011 at 09:07:50PM +0100, Al Viro wrote:
> > On Tue, May 31, 2011 at 07:42:26PM +0100, Steven Whitehouse wrote:
> >
> > > Yes, it should test for xattr too,
> >
> > Frankly, I suspect that the sanest way to handle that is this:
> > * new superblock flag - MS_NOSEC
> > * S_NOSEC is never set unless we have MS_NOSEC
> > * mount_bdev() sets it before calling fill_super callback.
> > * ocfs2 and fuse *clear* it in their fill_super
>
> gfs2 needs to clear it too (unless Steven's updated patch goes in too)
> Steven?
gfs2 doesn't use mount_bdev(), so there's nothing that would set it (we
never pass it set in flags - mount(2) clears all but user-settable bits).
Hi,
Il 31/05/2011 22:07, Al Viro ha scritto:
> On Tue, May 31, 2011 at 07:42:26PM +0100, Steven Whitehouse wrote:
>
>> Yes, it should test for xattr too,
>
> Frankly, I suspect that the sanest way to handle that is this:
> * new superblock flag - MS_NOSEC
> * S_NOSEC is never set unless we have MS_NOSEC
> * mount_bdev() sets it before calling fill_super callback.
> * ocfs2 and fuse *clear* it in their fill_super
> * btrfs manually sets it in its ->mount()
> ... and if gfs2 or any other non-trivial fs wants to use that, it'll need
> to set MS_NOSEC in its ->mount() and take care of clearing S_NOSEC whenever
> we decide it might've gone stale (a-la your patch).
>
several fs now uses MS_NOSEC (because this flag is set in mount_bdev())
but I don't see any user of the function inode_has_no_xattr() in the
latest version. If I well understand, a fs that wants to manage this
feature has to set MS_NOSEC and calls when needed this function, isn't
it? So at this point, why there aren't any user of this function?
Marco
> several fs now uses MS_NOSEC (because this flag is set in mount_bdev())
> but I don't see any user of the function inode_has_no_xattr() in the
> latest version. If I well understand, a fs that wants to manage this
> feature has to set MS_NOSEC and calls when needed this function, isn't
> it? So at this point, why there aren't any user of this function?
Calling the function is just an optimization to avoid the lookup
for the first write. But most of the benefit you already get
when just the flag is set.
I haven't resent the patches using it to the fs maintainers yet and none
of them took it the first round.
-Andi