2009-09-07 21:02:20

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 0/8] VFS name lookup permission checking cleanup


This is a series of eight trivial patches that I'd like people to take a
look at, because I am hoping to eventually do multiple path component
lookups in one go without taking the per-dentry lock or incrementing (and
then decrementing) the per-dentry atomic count for each component.

The aim would be to try to avoid getting that annoying cacheline ping-pong
on the common top-level dentries that everybody looks up (ie root and home
directories, /usr, /usr/bin etc).

Right now I have some simple (but real) loads that show the contention on
dentry->d_lock to be a roughly 3% performance hit on a single-socket
nehalem, and I assume it can be much worse on multi-socket machines.

And the thing is, it should be entirely possible to do everything but the
last component lookup with just a single read_seqbegin()/read_seqretry()
around the whole lookup. Yes, the last component is special and absolutely
needs locking and counting - but the last component also doesn't tend to
be shared, so locking it is fine.

Now, I may never actually get there, but when looking at it, the biggest
problem is actually not so much the path lookup itself, as the security
tests that are done for each path component. And it should be noted that
in order for a lockless seq-lock only lookup make sense, any such
operations would have to be totally lock-free too. They certainly can't
take mutexes etc, but right now they do.

Those security tests fall into two categories:

- actual security layer callouts: ima_path_check().

This one looks totally pointless. Path component lookup is a horribly
timing-critical path, and we will only do a successful lookup on a
directory (inode needs to have a ->lookup operation), yet in the middle
of that is a call to "ima_path_check()".

Now, it looks like ima_path_check() is very much designed to only check
the _final_ path anyway, and was never meant to be used to check the
directories we hit on the way. In fact, the whole function starts with

if (!ima_initialized || !S_ISREG(inode->i_mode))
return 0;

so it's totally pointless to do that thing on a directory where
that !S_ISREG() test will trigger.

So just remove it. IMA should never have put that check in there to
begin with, it's just way too performance-sensitive.

- the real filesystem permission checks.

We used to do the common case entirely in the VFS layer, but these days
the common case includes POSIX ACL checking, and as a result, the
trivial short-circuit code in the VFS layer almost never triggers in
practice, and we call down to the low-level filesystem for each
component.

We can't fix that by just removing the call, but what we _can_ do is to
at least avoid the silly calling back-and-forth: most filesystems will
just call back to the VFS layer to do the "generic_permission()" with
their own ACL-checking routines.

That way we can flatten the call-chain out a bit, and avoid one
unnecessary indirect call in that timing-critical region. And
eventually, if we make the whole ACL caching thing be something that we
do at a VFS layer (Al Viro already worked on _some_ of that), we'll be
able to avoid the calls entirely when we can see the cached ACL
pointers directly.

So this series of 8 patches do all these preliminary things. As shown by
the diffstat below, it actually reduces the lines of code (mainly by just
removing the silly per-filesystem wrappers around "generic_permission()")
and it also makes it a _lot_ clearer what actually gets called in that
whole 'exec_permission_lite()' function that we use to check the
permission of a pathname lookup.

Comments? Especially from the IMA people (first patch) and from generic
VFS, security and low-level FS people (the 'Simplify exec_permission_lite'
series, and then the check_acl + per-filesystem changes).

Al?

I'm looking to merge these shortly after 2.6.31 is released, but comments
welcome.

Linus

----
Linus Torvalds (8):
Do not call 'ima_path_check()' for each path component
Simplify exec_permission_lite() logic
Simplify exec_permission_lite() further
Simplify exec_permission_lite(), part 3
Make 'check_acl()' a first-class filesystem op
shmfs: use 'check_acl' instead of 'permission'
ext[234]: move over to 'check_acl' permission model
jffs2/jfs/xfs: switch over to 'check_acl' rather than 'permission()'

fs/ext2/acl.c | 8 +----
fs/ext2/acl.h | 4 +-
fs/ext2/file.c | 2 +-
fs/ext2/namei.c | 4 +-
fs/ext3/acl.c | 8 +----
fs/ext3/acl.h | 4 +-
fs/ext3/file.c | 2 +-
fs/ext3/namei.c | 4 +-
fs/ext4/acl.c | 8 +----
fs/ext4/acl.h | 4 +-
fs/ext4/file.c | 2 +-
fs/ext4/namei.c | 4 +-
fs/jffs2/acl.c | 7 +---
fs/jffs2/acl.h | 4 +-
fs/jffs2/dir.c | 2 +-
fs/jffs2/file.c | 2 +-
fs/jffs2/symlink.c | 2 +-
fs/jfs/acl.c | 7 +---
fs/jfs/file.c | 2 +-
fs/jfs/jfs_acl.h | 2 +-
fs/jfs/namei.c | 2 +-
fs/namei.c | 82 +++++++++++++++++++++---------------------
fs/xfs/linux-2.6/xfs_iops.c | 16 ++------
include/linux/fs.h | 1 +
include/linux/shmem_fs.h | 2 +-
mm/shmem.c | 6 ++--
mm/shmem_acl.c | 11 +-----
27 files changed, 79 insertions(+), 123 deletions(-)


2009-09-07 21:03:19

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 1/8] Do not call 'ima_path_check()' for each path component


From: Linus Torvalds <[email protected]>
Date: Fri, 28 Aug 2009 10:05:33 -0700

Not only is that a supremely timing-critical path, but it's hopefully some
day going to be lockless for the common case, and ima can't do that.

Plus the integrity code doesn't even care about non-regular files,
so it was always a total waste of time and effort.

Signed-off-by: Linus Torvalds <[email protected]>
---
fs/namei.c | 3 ---
1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index f3c5b27..164aa15 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -856,9 +856,6 @@ static int __link_path_walk(const char *name, struct nameidata *nd)
if (err == -EAGAIN)
err = inode_permission(nd->path.dentry->d_inode,
MAY_EXEC);
- if (!err)
- err = ima_path_check(&nd->path, MAY_EXEC,
- IMA_COUNT_UPDATE);
if (err)
break;

--
1.6.4.1.209.g74b8

2009-09-07 21:03:52

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 2/8] Simplify exec_permission_lite() logic


From: Linus Torvalds <[email protected]>
Date: Fri, 28 Aug 2009 10:50:37 -0700

Instead of returning EAGAIN and having the caller do something
special for that case, just do the special case directly.

Signed-off-by: Linus Torvalds <[email protected]>
---
fs/namei.c | 5 +----
1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 164aa15..bf8aa95 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -435,7 +435,7 @@ static int exec_permission_lite(struct inode *inode)
umode_t mode = inode->i_mode;

if (inode->i_op->permission)
- return -EAGAIN;
+ return inode_permission(inode, MAY_EXEC);

if (current_fsuid() == inode->i_uid)
mode >>= 6;
@@ -853,9 +853,6 @@ static int __link_path_walk(const char *name, struct nameidata *nd)

nd->flags |= LOOKUP_CONTINUE;
err = exec_permission_lite(inode);
- if (err == -EAGAIN)
- err = inode_permission(nd->path.dentry->d_inode,
- MAY_EXEC);
if (err)
break;

--
1.6.4.1.209.g74b8

2009-09-07 21:04:10

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 3/8] Simplify exec_permission_lite() further


From: Linus Torvalds <[email protected]>
Date: Fri, 28 Aug 2009 10:53:56 -0700

This function is only called for path components that are already known
to be directories (they have a '->lookup' method). So don't bother
doing that whole S_ISDIR() testing, the whole point of the 'lite()'
version is that we know that we are looking at a directory component,
and that we're only checking name lookup permission.

Signed-off-by: Linus Torvalds <[email protected]>
---
fs/namei.c | 8 +-------
1 files changed, 1 insertions(+), 7 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index bf8aa95..e39e5cb 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -445,13 +445,7 @@ static int exec_permission_lite(struct inode *inode)
if (mode & MAY_EXEC)
goto ok;

- if ((inode->i_mode & S_IXUGO) && capable(CAP_DAC_OVERRIDE))
- goto ok;
-
- if (S_ISDIR(inode->i_mode) && capable(CAP_DAC_OVERRIDE))
- goto ok;
-
- if (S_ISDIR(inode->i_mode) && capable(CAP_DAC_READ_SEARCH))
+ if (capable(CAP_DAC_OVERRIDE) || capable(CAP_DAC_READ_SEARCH))
goto ok;

return -EACCES;
--
1.6.4.1.209.g74b8

2009-09-07 21:04:43

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 4/8] Simplify exec_permission_lite(), part 3


From: Linus Torvalds <[email protected]>
Date: Fri, 28 Aug 2009 11:08:31 -0700

Don't call down to the generic inode_permission() function just to
call the inode-specific permission function - just do it directly.

The generic inode_permission() code does things like checking MAY_WRITE
and devcgroup_inode_permission(), neither of which are relevant for the
light pathname walk permission checks (we always do just MAY_EXEC, and
the inode is never a special device).

Signed-off-by: Linus Torvalds <[email protected]>
---
fs/namei.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index e39e5cb..7959e70 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -434,8 +434,12 @@ static int exec_permission_lite(struct inode *inode)
{
umode_t mode = inode->i_mode;

- if (inode->i_op->permission)
- return inode_permission(inode, MAY_EXEC);
+ if (inode->i_op->permission) {
+ int ret = inode->i_op->permission(inode, MAY_EXEC);
+ if (!ret)
+ goto ok;
+ return ret;
+ }

if (current_fsuid() == inode->i_uid)
mode >>= 6;
--
1.6.4.1.209.g74b8

2009-09-07 21:05:09

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 5/8] Make 'check_acl()' a first-class filesystem op


From: Linus Torvalds <[email protected]>
Date: Fri, 28 Aug 2009 11:51:25 -0700

This is stage one in flattening out the callchains for the common
permission testing. Rather than have most filesystem implemnt their
inode->i_op->permission own function that just calls back down to the
VFS layers 'generic_permission()' with the per-filesystem ACL checking
function, the filesystem can just expose its 'check_acl' function
directly, and let the VFS layer do everything for it.

This is all just preparatory - no filesystem actually enables this yet.

Signed-off-by: Linus Torvalds <[email protected]>
---
fs/namei.c | 62 +++++++++++++++++++++++++++++----------------------
include/linux/fs.h | 1 +
2 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 7959e70..5953913 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -169,19 +169,10 @@ void putname(const char *name)
EXPORT_SYMBOL(putname);
#endif

-
-/**
- * generic_permission - check for access rights on a Posix-like filesystem
- * @inode: inode to check access rights for
- * @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
- * @check_acl: optional callback to check for Posix ACLs
- *
- * Used to check for read/write/execute permissions on a file.
- * We use "fsuid" for this, letting us set arbitrary permissions
- * for filesystem access without changing the "normal" uids which
- * are used for other things..
+/*
+ * This does basic POSIX ACL permission checking
*/
-int generic_permission(struct inode *inode, int mask,
+static int acl_permission_check(struct inode *inode, int mask,
int (*check_acl)(struct inode *inode, int mask))
{
umode_t mode = inode->i_mode;
@@ -193,9 +184,7 @@ int generic_permission(struct inode *inode, int mask,
else {
if (IS_POSIXACL(inode) && (mode & S_IRWXG) && check_acl) {
int error = check_acl(inode, mask);
- if (error == -EACCES)
- goto check_capabilities;
- else if (error != -EAGAIN)
+ if (error != -EAGAIN)
return error;
}

@@ -208,8 +197,32 @@ int generic_permission(struct inode *inode, int mask,
*/
if ((mask & ~mode) == 0)
return 0;
+ return -EACCES;
+}
+
+/**
+ * generic_permission - check for access rights on a Posix-like filesystem
+ * @inode: inode to check access rights for
+ * @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ * @check_acl: optional callback to check for Posix ACLs
+ *
+ * Used to check for read/write/execute permissions on a file.
+ * We use "fsuid" for this, letting us set arbitrary permissions
+ * for filesystem access without changing the "normal" uids which
+ * are used for other things..
+ */
+int generic_permission(struct inode *inode, int mask,
+ int (*check_acl)(struct inode *inode, int mask))
+{
+ int ret;
+
+ /*
+ * Do the basic POSIX ACL permission checks.
+ */
+ ret = acl_permission_check(inode, mask, check_acl);
+ if (ret != -EACCES)
+ return ret;

- check_capabilities:
/*
* Read/write DACs are always overridable.
* Executable DACs are overridable if at least one exec bit is set.
@@ -262,7 +275,7 @@ int inode_permission(struct inode *inode, int mask)
if (inode->i_op->permission)
retval = inode->i_op->permission(inode, mask);
else
- retval = generic_permission(inode, mask, NULL);
+ retval = generic_permission(inode, mask, inode->i_op->check_acl);

if (retval)
return retval;
@@ -432,27 +445,22 @@ static struct dentry * cached_lookup(struct dentry * parent, struct qstr * name,
*/
static int exec_permission_lite(struct inode *inode)
{
- umode_t mode = inode->i_mode;
+ int ret;

if (inode->i_op->permission) {
- int ret = inode->i_op->permission(inode, MAY_EXEC);
+ ret = inode->i_op->permission(inode, MAY_EXEC);
if (!ret)
goto ok;
return ret;
}
-
- if (current_fsuid() == inode->i_uid)
- mode >>= 6;
- else if (in_group_p(inode->i_gid))
- mode >>= 3;
-
- if (mode & MAY_EXEC)
+ ret = acl_permission_check(inode, MAY_EXEC, inode->i_op->check_acl);
+ if (!ret)
goto ok;

if (capable(CAP_DAC_OVERRIDE) || capable(CAP_DAC_READ_SEARCH))
goto ok;

- return -EACCES;
+ return ret;
ok:
return security_inode_permission(inode, MAY_EXEC);
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 73e9b64..c1f9935 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1528,6 +1528,7 @@ struct inode_operations {
void (*put_link) (struct dentry *, struct nameidata *, void *);
void (*truncate) (struct inode *);
int (*permission) (struct inode *, int);
+ int (*check_acl)(struct inode *, int);
int (*setattr) (struct dentry *, struct iattr *);
int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
--
1.6.4.1.209.g74b8

2009-09-07 21:05:29

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 6/8] shmfs: use 'check_acl' instead of 'permission'


From: Linus Torvalds <[email protected]>
Date: Fri, 28 Aug 2009 12:04:28 -0700

shmfs wants purely standard POSIX ACL semantics, so we can use the new
generic VFS layer POSIX ACL checking rather than cooking our own
'permission()' function.

Signed-off-by: Linus Torvalds <[email protected]>
---
include/linux/shmem_fs.h | 2 +-
mm/shmem.c | 6 +++---
mm/shmem_acl.c | 11 +----------
3 files changed, 5 insertions(+), 14 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index abff6c9..6d3f2f4 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -39,7 +39,7 @@ static inline struct shmem_inode_info *SHMEM_I(struct inode *inode)
}

#ifdef CONFIG_TMPFS_POSIX_ACL
-int shmem_permission(struct inode *, int);
+int shmem_check_acl(struct inode *, int);
int shmem_acl_init(struct inode *, struct inode *);

extern struct xattr_handler shmem_xattr_acl_access_handler;
diff --git a/mm/shmem.c b/mm/shmem.c
index d713239..5a0b3d4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2446,7 +2446,7 @@ static const struct inode_operations shmem_inode_operations = {
.getxattr = generic_getxattr,
.listxattr = generic_listxattr,
.removexattr = generic_removexattr,
- .permission = shmem_permission,
+ .check_acl = shmem_check_acl,
#endif

};
@@ -2469,7 +2469,7 @@ static const struct inode_operations shmem_dir_inode_operations = {
.getxattr = generic_getxattr,
.listxattr = generic_listxattr,
.removexattr = generic_removexattr,
- .permission = shmem_permission,
+ .check_acl = shmem_check_acl,
#endif
};

@@ -2480,7 +2480,7 @@ static const struct inode_operations shmem_special_inode_operations = {
.getxattr = generic_getxattr,
.listxattr = generic_listxattr,
.removexattr = generic_removexattr,
- .permission = shmem_permission,
+ .check_acl = shmem_check_acl,
#endif
};

diff --git a/mm/shmem_acl.c b/mm/shmem_acl.c
index 606a8e7..df2c87f 100644
--- a/mm/shmem_acl.c
+++ b/mm/shmem_acl.c
@@ -157,7 +157,7 @@ shmem_acl_init(struct inode *inode, struct inode *dir)
/**
* shmem_check_acl - check_acl() callback for generic_permission()
*/
-static int
+int
shmem_check_acl(struct inode *inode, int mask)
{
struct posix_acl *acl = shmem_get_acl(inode, ACL_TYPE_ACCESS);
@@ -169,12 +169,3 @@ shmem_check_acl(struct inode *inode, int mask)
}
return -EAGAIN;
}
-
-/**
- * shmem_permission - permission() inode operation
- */
-int
-shmem_permission(struct inode *inode, int mask)
-{
- return generic_permission(inode, mask, shmem_check_acl);
-}
--
1.6.4.1.209.g74b8

2009-09-07 21:06:04

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 7/8] ext[234]: move over to 'check_acl' permission model


From: Linus Torvalds <[email protected]>
Date: Fri, 28 Aug 2009 12:12:24 -0700

Don't implement per-filesystem 'extX_permission()' functions that have
to be called for every path component operation, and instead just expose
the actual ACL checking so that the VFS layer can now do it for us.

Signed-off-by: Linus Torvalds <[email protected]>
---
fs/ext2/acl.c | 8 +-------
fs/ext2/acl.h | 4 ++--
fs/ext2/file.c | 2 +-
fs/ext2/namei.c | 4 ++--
fs/ext3/acl.c | 8 +-------
fs/ext3/acl.h | 4 ++--
fs/ext3/file.c | 2 +-
fs/ext3/namei.c | 4 ++--
fs/ext4/acl.c | 8 +-------
fs/ext4/acl.h | 4 ++--
fs/ext4/file.c | 2 +-
fs/ext4/namei.c | 4 ++--
12 files changed, 18 insertions(+), 36 deletions(-)

diff --git a/fs/ext2/acl.c b/fs/ext2/acl.c
index d636e12..a63d442 100644
--- a/fs/ext2/acl.c
+++ b/fs/ext2/acl.c
@@ -230,7 +230,7 @@ ext2_set_acl(struct inode *inode, int type, struct posix_acl *acl)
return error;
}

-static int
+int
ext2_check_acl(struct inode *inode, int mask)
{
struct posix_acl *acl = ext2_get_acl(inode, ACL_TYPE_ACCESS);
@@ -246,12 +246,6 @@ ext2_check_acl(struct inode *inode, int mask)
return -EAGAIN;
}

-int
-ext2_permission(struct inode *inode, int mask)
-{
- return generic_permission(inode, mask, ext2_check_acl);
-}
-
/*
* Initialize the ACLs of a new inode. Called from ext2_new_inode.
*
diff --git a/fs/ext2/acl.h b/fs/ext2/acl.h
index ecefe47..3ff6cbb 100644
--- a/fs/ext2/acl.h
+++ b/fs/ext2/acl.h
@@ -54,13 +54,13 @@ static inline int ext2_acl_count(size_t size)
#ifdef CONFIG_EXT2_FS_POSIX_ACL

/* acl.c */
-extern int ext2_permission (struct inode *, int);
+extern int ext2_check_acl (struct inode *, int);
extern int ext2_acl_chmod (struct inode *);
extern int ext2_init_acl (struct inode *, struct inode *);

#else
#include <linux/sched.h>
-#define ext2_permission NULL
+#define ext2_check_acl NULL
#define ext2_get_acl NULL
#define ext2_set_acl NULL

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 2b9e47d..a2f3afd 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -85,6 +85,6 @@ const struct inode_operations ext2_file_inode_operations = {
.removexattr = generic_removexattr,
#endif
.setattr = ext2_setattr,
- .permission = ext2_permission,
+ .check_acl = ext2_check_acl,
.fiemap = ext2_fiemap,
};
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index e1dedb0..58051df 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -396,7 +396,7 @@ const struct inode_operations ext2_dir_inode_operations = {
.removexattr = generic_removexattr,
#endif
.setattr = ext2_setattr,
- .permission = ext2_permission,
+ .check_acl = ext2_check_acl,
};

const struct inode_operations ext2_special_inode_operations = {
@@ -407,5 +407,5 @@ const struct inode_operations ext2_special_inode_operations = {
.removexattr = generic_removexattr,
#endif
.setattr = ext2_setattr,
- .permission = ext2_permission,
+ .check_acl = ext2_check_acl,
};
diff --git a/fs/ext3/acl.c b/fs/ext3/acl.c
index e167bae..c9b0df3 100644
--- a/fs/ext3/acl.c
+++ b/fs/ext3/acl.c
@@ -238,7 +238,7 @@ ext3_set_acl(handle_t *handle, struct inode *inode, int type,
return error;
}

-static int
+int
ext3_check_acl(struct inode *inode, int mask)
{
struct posix_acl *acl = ext3_get_acl(inode, ACL_TYPE_ACCESS);
@@ -254,12 +254,6 @@ ext3_check_acl(struct inode *inode, int mask)
return -EAGAIN;
}

-int
-ext3_permission(struct inode *inode, int mask)
-{
- return generic_permission(inode, mask, ext3_check_acl);
-}
-
/*
* Initialize the ACLs of a new inode. Called from ext3_new_inode.
*
diff --git a/fs/ext3/acl.h b/fs/ext3/acl.h
index 07d15a3..5973346 100644
--- a/fs/ext3/acl.h
+++ b/fs/ext3/acl.h
@@ -54,13 +54,13 @@ static inline int ext3_acl_count(size_t size)
#ifdef CONFIG_EXT3_FS_POSIX_ACL

/* acl.c */
-extern int ext3_permission (struct inode *, int);
+extern int ext3_check_acl (struct inode *, int);
extern int ext3_acl_chmod (struct inode *);
extern int ext3_init_acl (handle_t *, struct inode *, struct inode *);

#else /* CONFIG_EXT3_FS_POSIX_ACL */
#include <linux/sched.h>
-#define ext3_permission NULL
+#define ext3_check_acl NULL

static inline int
ext3_acl_chmod(struct inode *inode)
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 5b49704..2992532 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -137,7 +137,7 @@ const struct inode_operations ext3_file_inode_operations = {
.listxattr = ext3_listxattr,
.removexattr = generic_removexattr,
#endif
- .permission = ext3_permission,
+ .check_acl = ext3_check_acl,
.fiemap = ext3_fiemap,
};

diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 6ff7b97..aad6400 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -2445,7 +2445,7 @@ const struct inode_operations ext3_dir_inode_operations = {
.listxattr = ext3_listxattr,
.removexattr = generic_removexattr,
#endif
- .permission = ext3_permission,
+ .check_acl = ext3_check_acl,
};

const struct inode_operations ext3_special_inode_operations = {
@@ -2456,5 +2456,5 @@ const struct inode_operations ext3_special_inode_operations = {
.listxattr = ext3_listxattr,
.removexattr = generic_removexattr,
#endif
- .permission = ext3_permission,
+ .check_acl = ext3_check_acl,
};
diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index f6d8967..0df88b2 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -236,7 +236,7 @@ ext4_set_acl(handle_t *handle, struct inode *inode, int type,
return error;
}

-static int
+int
ext4_check_acl(struct inode *inode, int mask)
{
struct posix_acl *acl = ext4_get_acl(inode, ACL_TYPE_ACCESS);
@@ -252,12 +252,6 @@ ext4_check_acl(struct inode *inode, int mask)
return -EAGAIN;
}

-int
-ext4_permission(struct inode *inode, int mask)
-{
- return generic_permission(inode, mask, ext4_check_acl);
-}
-
/*
* Initialize the ACLs of a new inode. Called from ext4_new_inode.
*
diff --git a/fs/ext4/acl.h b/fs/ext4/acl.h
index 949789d..9d843d5 100644
--- a/fs/ext4/acl.h
+++ b/fs/ext4/acl.h
@@ -54,13 +54,13 @@ static inline int ext4_acl_count(size_t size)
#ifdef CONFIG_EXT4_FS_POSIX_ACL

/* acl.c */
-extern int ext4_permission(struct inode *, int);
+extern int ext4_check_acl(struct inode *, int);
extern int ext4_acl_chmod(struct inode *);
extern int ext4_init_acl(handle_t *, struct inode *, struct inode *);

#else /* CONFIG_EXT4_FS_POSIX_ACL */
#include <linux/sched.h>
-#define ext4_permission NULL
+#define ext4_check_acl NULL

static inline int
ext4_acl_chmod(struct inode *inode)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 3f1873f..27f3c53 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -207,7 +207,7 @@ const struct inode_operations ext4_file_inode_operations = {
.listxattr = ext4_listxattr,
.removexattr = generic_removexattr,
#endif
- .permission = ext4_permission,
+ .check_acl = ext4_check_acl,
.fallocate = ext4_fallocate,
.fiemap = ext4_fiemap,
};
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index de04013..114abe5 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2536,7 +2536,7 @@ const struct inode_operations ext4_dir_inode_operations = {
.listxattr = ext4_listxattr,
.removexattr = generic_removexattr,
#endif
- .permission = ext4_permission,
+ .check_acl = ext4_check_acl,
.fiemap = ext4_fiemap,
};

@@ -2548,5 +2548,5 @@ const struct inode_operations ext4_special_inode_operations = {
.listxattr = ext4_listxattr,
.removexattr = generic_removexattr,
#endif
- .permission = ext4_permission,
+ .check_acl = ext4_check_acl,
};
--
1.6.4.1.209.g74b8

2009-09-07 21:06:33

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 8/8] jffs2/jfs/xfs: switch over to 'check_acl' rather than 'permission()'


From: Linus Torvalds <[email protected]>
Date: Fri, 28 Aug 2009 12:29:03 -0700

This avoids an indirect call in the VFS for each path component lookup.

Well, at least as long as you own the directory in question, and the ACL
check is unnecessary.

Signed-off-by: Linus Torvalds <[email protected]>
---
fs/jffs2/acl.c | 7 +------
fs/jffs2/acl.h | 4 ++--
fs/jffs2/dir.c | 2 +-
fs/jffs2/file.c | 2 +-
fs/jffs2/symlink.c | 2 +-
fs/jfs/acl.c | 7 +------
fs/jfs/file.c | 2 +-
fs/jfs/jfs_acl.h | 2 +-
fs/jfs/namei.c | 2 +-
fs/xfs/linux-2.6/xfs_iops.c | 16 ++++------------
10 files changed, 14 insertions(+), 32 deletions(-)

diff --git a/fs/jffs2/acl.c b/fs/jffs2/acl.c
index 8fcb623..7edb62e 100644
--- a/fs/jffs2/acl.c
+++ b/fs/jffs2/acl.c
@@ -258,7 +258,7 @@ static int jffs2_set_acl(struct inode *inode, int type, struct posix_acl *acl)
return rc;
}

-static int jffs2_check_acl(struct inode *inode, int mask)
+int jffs2_check_acl(struct inode *inode, int mask)
{
struct posix_acl *acl;
int rc;
@@ -274,11 +274,6 @@ static int jffs2_check_acl(struct inode *inode, int mask)
return -EAGAIN;
}

-int jffs2_permission(struct inode *inode, int mask)
-{
- return generic_permission(inode, mask, jffs2_check_acl);
-}
-
int jffs2_init_acl_pre(struct inode *dir_i, struct inode *inode, int *i_mode)
{
struct posix_acl *acl, *clone;
diff --git a/fs/jffs2/acl.h b/fs/jffs2/acl.h
index fc929f2..f0ba63e 100644
--- a/fs/jffs2/acl.h
+++ b/fs/jffs2/acl.h
@@ -26,7 +26,7 @@ struct jffs2_acl_header {

#ifdef CONFIG_JFFS2_FS_POSIX_ACL

-extern int jffs2_permission(struct inode *, int);
+extern int jffs2_check_acl(struct inode *, int);
extern int jffs2_acl_chmod(struct inode *);
extern int jffs2_init_acl_pre(struct inode *, struct inode *, int *);
extern int jffs2_init_acl_post(struct inode *);
@@ -36,7 +36,7 @@ extern struct xattr_handler jffs2_acl_default_xattr_handler;

#else

-#define jffs2_permission (NULL)
+#define jffs2_check_acl (NULL)
#define jffs2_acl_chmod(inode) (0)
#define jffs2_init_acl_pre(dir_i,inode,mode) (0)
#define jffs2_init_acl_post(inode) (0)
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index 6f60cc9..7aa4417 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -55,7 +55,7 @@ const struct inode_operations jffs2_dir_inode_operations =
.rmdir = jffs2_rmdir,
.mknod = jffs2_mknod,
.rename = jffs2_rename,
- .permission = jffs2_permission,
+ .check_acl = jffs2_check_acl,
.setattr = jffs2_setattr,
.setxattr = jffs2_setxattr,
.getxattr = jffs2_getxattr,
diff --git a/fs/jffs2/file.c b/fs/jffs2/file.c
index 23c9475..b7b74e2 100644
--- a/fs/jffs2/file.c
+++ b/fs/jffs2/file.c
@@ -56,7 +56,7 @@ const struct file_operations jffs2_file_operations =

const struct inode_operations jffs2_file_inode_operations =
{
- .permission = jffs2_permission,
+ .check_acl = jffs2_check_acl,
.setattr = jffs2_setattr,
.setxattr = jffs2_setxattr,
.getxattr = jffs2_getxattr,
diff --git a/fs/jffs2/symlink.c b/fs/jffs2/symlink.c
index b7339c3..4ec11e8 100644
--- a/fs/jffs2/symlink.c
+++ b/fs/jffs2/symlink.c
@@ -21,7 +21,7 @@ const struct inode_operations jffs2_symlink_inode_operations =
{
.readlink = generic_readlink,
.follow_link = jffs2_follow_link,
- .permission = jffs2_permission,
+ .check_acl = jffs2_check_acl,
.setattr = jffs2_setattr,
.setxattr = jffs2_setxattr,
.getxattr = jffs2_getxattr,
diff --git a/fs/jfs/acl.c b/fs/jfs/acl.c
index a29c7c3..d66477c 100644
--- a/fs/jfs/acl.c
+++ b/fs/jfs/acl.c
@@ -114,7 +114,7 @@ out:
return rc;
}

-static int jfs_check_acl(struct inode *inode, int mask)
+int jfs_check_acl(struct inode *inode, int mask)
{
struct posix_acl *acl = jfs_get_acl(inode, ACL_TYPE_ACCESS);

@@ -129,11 +129,6 @@ static int jfs_check_acl(struct inode *inode, int mask)
return -EAGAIN;
}

-int jfs_permission(struct inode *inode, int mask)
-{
- return generic_permission(inode, mask, jfs_check_acl);
-}
-
int jfs_init_acl(tid_t tid, struct inode *inode, struct inode *dir)
{
struct posix_acl *acl = NULL;
diff --git a/fs/jfs/file.c b/fs/jfs/file.c
index 7f6063a..2b70fa7 100644
--- a/fs/jfs/file.c
+++ b/fs/jfs/file.c
@@ -96,7 +96,7 @@ const struct inode_operations jfs_file_inode_operations = {
.removexattr = jfs_removexattr,
#ifdef CONFIG_JFS_POSIX_ACL
.setattr = jfs_setattr,
- .permission = jfs_permission,
+ .check_acl = jfs_check_acl,
#endif
};

diff --git a/fs/jfs/jfs_acl.h b/fs/jfs/jfs_acl.h
index 88475f1..b07bd41 100644
--- a/fs/jfs/jfs_acl.h
+++ b/fs/jfs/jfs_acl.h
@@ -20,7 +20,7 @@

#ifdef CONFIG_JFS_POSIX_ACL

-int jfs_permission(struct inode *, int);
+int jfs_check_acl(struct inode *, int);
int jfs_init_acl(tid_t, struct inode *, struct inode *);
int jfs_setattr(struct dentry *, struct iattr *);

diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index 514ee2e..c79a427 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -1543,7 +1543,7 @@ const struct inode_operations jfs_dir_inode_operations = {
.removexattr = jfs_removexattr,
#ifdef CONFIG_JFS_POSIX_ACL
.setattr = jfs_setattr,
- .permission = jfs_permission,
+ .check_acl = jfs_check_acl,
#endif
};

diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
index 8070b34..6c32f1d 100644
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -485,14 +485,6 @@ xfs_vn_put_link(
}

STATIC int
-xfs_vn_permission(
- struct inode *inode,
- int mask)
-{
- return generic_permission(inode, mask, xfs_check_acl);
-}
-
-STATIC int
xfs_vn_getattr(
struct vfsmount *mnt,
struct dentry *dentry,
@@ -696,7 +688,7 @@ xfs_vn_fiemap(
}

static const struct inode_operations xfs_inode_operations = {
- .permission = xfs_vn_permission,
+ .check_acl = xfs_check_acl,
.truncate = xfs_vn_truncate,
.getattr = xfs_vn_getattr,
.setattr = xfs_vn_setattr,
@@ -724,7 +716,7 @@ static const struct inode_operations xfs_dir_inode_operations = {
.rmdir = xfs_vn_unlink,
.mknod = xfs_vn_mknod,
.rename = xfs_vn_rename,
- .permission = xfs_vn_permission,
+ .check_acl = xfs_check_acl,
.getattr = xfs_vn_getattr,
.setattr = xfs_vn_setattr,
.setxattr = generic_setxattr,
@@ -749,7 +741,7 @@ static const struct inode_operations xfs_dir_ci_inode_operations = {
.rmdir = xfs_vn_unlink,
.mknod = xfs_vn_mknod,
.rename = xfs_vn_rename,
- .permission = xfs_vn_permission,
+ .check_acl = xfs_check_acl,
.getattr = xfs_vn_getattr,
.setattr = xfs_vn_setattr,
.setxattr = generic_setxattr,
@@ -762,7 +754,7 @@ static const struct inode_operations xfs_symlink_inode_operations = {
.readlink = generic_readlink,
.follow_link = xfs_vn_follow_link,
.put_link = xfs_vn_put_link,
- .permission = xfs_vn_permission,
+ .check_acl = xfs_check_acl,
.getattr = xfs_vn_getattr,
.setattr = xfs_vn_setattr,
.setxattr = generic_setxattr,
--
1.6.4.1.209.g74b8

2009-09-07 22:13:04

by James Morris

[permalink] [raw]
Subject: Re: [PATCH 2/8] Simplify exec_permission_lite() logic

On Mon, 7 Sep 2009, Linus Torvalds wrote:

>
> From: Linus Torvalds <[email protected]>
> Date: Fri, 28 Aug 2009 10:50:37 -0700
>
> Instead of returning EAGAIN and having the caller do something
> special for that case, just do the special case directly.
>
> Signed-off-by: Linus Torvalds <[email protected]>

Reviewed-by: James Morris <[email protected]>


--
James Morris
<[email protected]>

2009-09-07 22:15:48

by James Morris

[permalink] [raw]
Subject: Re: [PATCH 3/8] Simplify exec_permission_lite() further

On Mon, 7 Sep 2009, Linus Torvalds wrote:

> Signed-off-by: Linus Torvalds <[email protected]>

Looks ok to me, but I'd be interested in any comments from others on the
cc list.

Reviewed-by: James Morris <[email protected]>


--
James Morris
<[email protected]>

2009-09-07 22:19:09

by James Morris

[permalink] [raw]
Subject: Re: [PATCH 4/8] Simplify exec_permission_lite(), part 3

On Mon, 7 Sep 2009, Linus Torvalds wrote:

> The generic inode_permission() code does things like checking MAY_WRITE
> and devcgroup_inode_permission(), neither of which are relevant for the
> light pathname walk permission checks (we always do just MAY_EXEC, and
> the inode is never a special device).
>
> Signed-off-by: Linus Torvalds <[email protected]>

Reviewed-by: James Morris <[email protected]>

--
James Morris
<[email protected]>

2009-09-07 22:21:13

by James Morris

[permalink] [raw]
Subject: Re: [PATCH 5/8] Make 'check_acl()' a first-class filesystem op

On Mon, 7 Sep 2009, Linus Torvalds wrote:

>
> From: Linus Torvalds <[email protected]>
> Date: Fri, 28 Aug 2009 11:51:25 -0700
>
> This is stage one in flattening out the callchains for the common
> permission testing. Rather than have most filesystem implemnt their
> inode->i_op->permission own function that just calls back down to the
> VFS layers 'generic_permission()' with the per-filesystem ACL checking
> function, the filesystem can just expose its 'check_acl' function
> directly, and let the VFS layer do everything for it.
>
> This is all just preparatory - no filesystem actually enables this yet.
>
> Signed-off-by: Linus Torvalds <[email protected]>

Reviewed-by: James Morris <[email protected]>

--
James Morris
<[email protected]>

2009-09-07 22:23:21

by James Morris

[permalink] [raw]
Subject: Re: [PATCH 6/8] shmfs: use 'check_acl' instead of 'permission'

On Mon, 7 Sep 2009, Linus Torvalds wrote:

>
> From: Linus Torvalds <[email protected]>
> Date: Fri, 28 Aug 2009 12:04:28 -0700
>
> shmfs wants purely standard POSIX ACL semantics, so we can use the new
> generic VFS layer POSIX ACL checking rather than cooking our own
> 'permission()' function.
>
> Signed-off-by: Linus Torvalds <[email protected]>

Reviewed-by: James Morris <[email protected]>

--
James Morris
<[email protected]>

2009-09-08 00:04:31

by James Morris

[permalink] [raw]
Subject: Re: [PATCH 6/8] shmfs: use 'check_acl' instead of 'permission'

On Tue, 8 Sep 2009, James Morris wrote:

> Reviewed-by: James Morris <[email protected]>

Ditto for patches 7 & 8.


--
James Morris
<[email protected]>

2009-09-08 01:50:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/8] Do not call 'ima_path_check()' for each path component

On Mon, Sep 07, 2009 at 02:02:16PM -0700, Linus Torvalds wrote:
>
> From: Linus Torvalds <[email protected]>
> Date: Fri, 28 Aug 2009 10:05:33 -0700
>
> Not only is that a supremely timing-critical path, but it's hopefully some
> day going to be lockless for the common case, and ima can't do that.
>
> Plus the integrity code doesn't even care about non-regular files,
> so it was always a total waste of time and effort.

Haha, nice one.

2009-09-08 14:21:35

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 0/8] VFS name lookup permission checking cleanup

Quoting Linus Torvalds ([email protected]):
>
> This is a series of eight trivial patches that I'd like people to take a
> look at, because I am hoping to eventually do multiple path component
> lookups in one go without taking the per-dentry lock or incrementing (and
> then decrementing) the per-dentry atomic count for each component.
>
> The aim would be to try to avoid getting that annoying cacheline ping-pong
> on the common top-level dentries that everybody looks up (ie root and home
> directories, /usr, /usr/bin etc).
>
> Right now I have some simple (but real) loads that show the contention on
> dentry->d_lock to be a roughly 3% performance hit on a single-socket
> nehalem, and I assume it can be much worse on multi-socket machines.
>
> And the thing is, it should be entirely possible to do everything but the
> last component lookup with just a single read_seqbegin()/read_seqretry()
> around the whole lookup. Yes, the last component is special and absolutely
> needs locking and counting - but the last component also doesn't tend to
> be shared, so locking it is fine.
>
> Now, I may never actually get there, but when looking at it, the biggest
> problem is actually not so much the path lookup itself, as the security
> tests that are done for each path component. And it should be noted that
> in order for a lockless seq-lock only lookup make sense, any such
> operations would have to be totally lock-free too. They certainly can't
> take mutexes etc, but right now they do.
>
> Those security tests fall into two categories:
>
> - actual security layer callouts: ima_path_check().
>
> This one looks totally pointless. Path component lookup is a horribly
> timing-critical path, and we will only do a successful lookup on a
> directory (inode needs to have a ->lookup operation), yet in the middle
> of that is a call to "ima_path_check()".
>
> Now, it looks like ima_path_check() is very much designed to only check
> the _final_ path anyway, and was never meant to be used to check the
> directories we hit on the way. In fact, the whole function starts with
>
> if (!ima_initialized || !S_ISREG(inode->i_mode))
> return 0;
>
> so it's totally pointless to do that thing on a directory where
> that !S_ISREG() test will trigger.
>
> So just remove it. IMA should never have put that check in there to
> begin with, it's just way too performance-sensitive.
>
> - the real filesystem permission checks.
>
> We used to do the common case entirely in the VFS layer, but these days
> the common case includes POSIX ACL checking, and as a result, the
> trivial short-circuit code in the VFS layer almost never triggers in
> practice, and we call down to the low-level filesystem for each
> component.
>
> We can't fix that by just removing the call, but what we _can_ do is to
> at least avoid the silly calling back-and-forth: most filesystems will
> just call back to the VFS layer to do the "generic_permission()" with
> their own ACL-checking routines.
>
> That way we can flatten the call-chain out a bit, and avoid one
> unnecessary indirect call in that timing-critical region. And
> eventually, if we make the whole ACL caching thing be something that we
> do at a VFS layer (Al Viro already worked on _some_ of that), we'll be
> able to avoid the calls entirely when we can see the cached ACL
> pointers directly.
>
> So this series of 8 patches do all these preliminary things. As shown by
> the diffstat below, it actually reduces the lines of code (mainly by just
> removing the silly per-filesystem wrappers around "generic_permission()")
> and it also makes it a _lot_ clearer what actually gets called in that
> whole 'exec_permission_lite()' function that we use to check the
> permission of a pathname lookup.
>
> Comments? Especially from the IMA people (first patch) and from generic
> VFS, security and low-level FS people (the 'Simplify exec_permission_lite'
> series, and then the check_acl + per-filesystem changes).
>
> Al?
>
> I'm looking to merge these shortly after 2.6.31 is released, but comments
> welcome.

All of them seem good, and I don't see any thinkos, no resulting skipped
checks or anything.

Acked-by: Serge Hallyn <[email protected]>

-serge

2009-09-08 14:40:21

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH 3/8] Simplify exec_permission_lite() further

Linus Torvalds wrote:
> This function is only called for path components that are already known
> to be directories (they have a '->lookup' method). So don't bother
> doing that whole S_ISDIR() testing,

A few years ago you briefly discussed diddling the VFS to accept
objects which are both files and directories. That is, have
->lookup() and also can be read as regular files.

I don't know if your thinking has changed, and I don't particularly
care, only thought I'd remind in case it's something you'd like to
support at some point.

-- Jamie

2009-09-08 14:58:32

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 3/8] Simplify exec_permission_lite() further

On Tue, Sep 08, 2009 at 03:40:12PM +0100, Jamie Lokier wrote:
> Linus Torvalds wrote:
> > This function is only called for path components that are already known
> > to be directories (they have a '->lookup' method). So don't bother
> > doing that whole S_ISDIR() testing,
>
> A few years ago you briefly discussed diddling the VFS to accept
> objects which are both files and directories. That is, have
> ->lookup() and also can be read as regular files.
>
> I don't know if your thinking has changed, and I don't particularly
> care, only thought I'd remind in case it's something you'd like to
> support at some point.

If it's being used as a component of a pathname, we're _using_ it as a
directory, so it won't matter for this case.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2009-09-08 15:04:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/8] Simplify exec_permission_lite() further



On Tue, 8 Sep 2009, Jamie Lokier wrote:
>
> Linus Torvalds wrote:
> > This function is only called for path components that are already known
> > to be directories (they have a '->lookup' method). So don't bother
> > doing that whole S_ISDIR() testing,
>
> A few years ago you briefly discussed diddling the VFS to accept
> objects which are both files and directories. That is, have
> ->lookup() and also can be read as regular files.
>
> I don't know if your thinking has changed, and I don't particularly
> care, only thought I'd remind in case it's something you'd like to
> support at some point.

No, I explicitly thought about that case too, and the thing is, that even
_if_ we do that some day, the change is actually the RightThing(tm) to do.

In fact, it's even _more_ true in that case. The old code was broken for
that case.

The thing is, if we ever implement the concept of using a regular file as
a path component (say, we look up some attributes or we teach the kernel
to look into tar-files or whatever), then the permission check for using
it as a path component would be _different_ from the permission check for
"executing" the file as a file. When used as a path component, we'd want
to use CAP_DAC_SEARCH, for example, not some "is it ok to execute this
file" permission.

So "permission" for a path component really is fundamentally different
from the same thing as an actual end-result file open/use. They do share
some logic, of course, like just the whole owner/group/other and ACL
logic. But at the same time they have conceptual - and actual -
differences too.

Just think about what "execute" permission means: it's not just the 'x'
bit, there are things like per-filesystem "NOEXEC" bits. We check that at
a completely different level already - exactly because the "execute"
permission for a pathname component is something fundamentally _different_
from executing the result of an actual final lookup.

The "exec_permission_lite()" function makes some of that difference more
explicit - and also makes it explicit what is shared. We do check the 'x'
bit, and we do honor ACL bits, but the whole Integrity check part was very
obviously meant to be done at the "actual final lookup result" level
rather than at the path component level.

In fact, as it was, "ima_path_check()" wouldn't know whether MAY_EXEC
meant that we were going to actually run a program, or look up a pathname.
The way it "solved" that was to just ignore non-regular files. And I'm
convinced that the real solution is to just say that "ima_path_check()" is
always checking the end result of a path, not the components, and then
MAY_EXEC actually has a much more unambiguous meaning.

(Admittedly I think the IMA code could also have disambiguated the cases
by looking at whether MAY_OPEN is set ot not - looking at whether the
inode in question had S_ISDIR set probably had other reasons too).

Anyway, I do suspect that if we ever do the whole "able to treat regular
files as a lookup target" we _will_ hit other issues, and it's going to
require a lot of thought - but I suspect this series is actually going to
help it by separating out the permission handling more clearly.

Linus

2009-09-08 17:52:30

by Mimi Zohar

[permalink] [raw]
Subject: Re: [PATCH 0/8] VFS name lookup permission checking cleanup

Linus Torvalds <[email protected]> wrote on 09/07/2009
05:01:14 PM:

> This is a series of eight trivial patches that I'd like people to take a

> look at, because I am hoping to eventually do multiple path component
> lookups in one go without taking the per-dentry lock or incrementing
(and
> then decrementing) the per-dentry atomic count for each component.
>
> The aim would be to try to avoid getting that annoying cacheline
ping-pong
> on the common top-level dentries that everybody looks up (ie root and
home
> directories, /usr, /usr/bin etc).
>
> Right now I have some simple (but real) loads that show the contention
on
> dentry->d_lock to be a roughly 3% performance hit on a single-socket
> nehalem, and I assume it can be much worse on multi-socket machines.
>
> And the thing is, it should be entirely possible to do everything but
the
> last component lookup with just a single read_seqbegin()/read_seqretry()

> around the whole lookup. Yes, the last component is special and
absolutely
> needs locking and counting - but the last component also doesn't tend to

> be shared, so locking it is fine.
>
> Now, I may never actually get there, but when looking at it, the biggest

> problem is actually not so much the path lookup itself, as the security
> tests that are done for each path component. And it should be noted that

> in order for a lockless seq-lock only lookup make sense, any such
> operations would have to be totally lock-free too. They certainly can't
> take mutexes etc, but right now they do.
>
> Those security tests fall into two categories:
>
> - actual security layer callouts: ima_path_check().
>
> This one looks totally pointless. Path component lookup is a horribly

> timing-critical path, and we will only do a successful lookup on a
> directory (inode needs to have a ->lookup operation), yet in the
middle
> of that is a call to "ima_path_check()".
>
> Now, it looks like ima_path_check() is very much designed to only
check
> the _final_ path anyway, and was never meant to be used to check the
> directories we hit on the way. In fact, the whole function starts
with
>
> if (!ima_initialized || !S_ISREG(inode->i_mode))
> return 0;
>
> so it's totally pointless to do that thing on a directory where
> that !S_ISREG() test will trigger.
>
> So just remove it. IMA should never have put that check in there to
> begin with, it's just way too performance-sensitive.

You're right. We don't need to call ima_path_check() here, as IMA
only measures the integrity of the file itself, and not directories.

Mimi

2009-09-08 18:06:29

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 6/8] shmfs: use 'check_acl' instead of 'permission'

On Mon, 7 Sep 2009, Linus Torvalds wrote:
>
> From: Linus Torvalds <[email protected]>
> Date: Fri, 28 Aug 2009 12:04:28 -0700
>
> shmfs wants purely standard POSIX ACL semantics, so we can use the new
> generic VFS layer POSIX ACL checking rather than cooking our own
> 'permission()' function.
>
> Signed-off-by: Linus Torvalds <[email protected]>

Acked-by: Hugh Dickins <[email protected]>

> ---
> include/linux/shmem_fs.h | 2 +-
> mm/shmem.c | 6 +++---
> mm/shmem_acl.c | 11 +----------
> 3 files changed, 5 insertions(+), 14 deletions(-)

2009-09-08 18:34:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 8/8] jffs2/jfs/xfs: switch over to 'check_acl' rather than 'permission()'

The split of these patches is a bit odd, either do all in one patch or
one patch per filesystem instead of those groups.

That beeing said if we go down this way I would prefer if we go
down all the way, that is convert the remaining few filesystems that
pass a check_acl argument to generic_permission (btrfs, gfs2, ocfs2)
and just kill off that argument.

After that there is another step we can easily go: as we now cache the
ACLs in the generic inode instead of the per-fs one we can move the
get_cached_acl call to your acl_permission_check helper (for gfs2/ocfs2
that don't cache ACLs it will always fail), and not call out to the fs
for the fast path at all.

2009-09-09 17:11:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 8/8] jffs2/jfs/xfs: switch over to 'check_acl' rather than 'permission()'



On Tue, 8 Sep 2009, Christoph Hellwig wrote:
>
> The split of these patches is a bit odd, either do all in one patch or
> one patch per filesystem instead of those groups.

Hmm. I actually did that somewhat on purpose.

The reason? If there is something wrong, I want bisection to specify it
fairly well, and I was thinking that maybe there would be some filesystem
specific issue. I know, it's unlikely, but hey, if I knew when bugs
happened, I'd just make sure they never did..

So I wanted to keep the shmfs one separate, because people use that
filesystem independently of - and generally differently from - the other
on-disk ones, and maybe you could have a shmfs-specific bug but not a
filesystem-specific one (or vice versa). So if some application suddenly
breaks, anybody doing bisection would see which one it is.

Now, I could then have bundled up the rest of the filesystems as one
commit, or done them all as individual ones, and there I don't really have
any huge preferences. There were six filesystems that had "obviously just
the wrapper function" (done by just doing a

git grep -2 generic_permission

in case anybody cares), and quite frankly, if you do that grep, then the
ext* group stands out very clearly (next to each other, same indentation
due to filenames etc etc). So I just did them next.

And the third group was just "the rest". Not standing out in any
particular way, but also not worth doing individually in any particular
way. Bisectability in those groups doesn't much matter, because I don't
think it's all that likely that a machine that shows problems would run a
mix of filesystems, the way you'd mix shmfs with other filesystems and
other patterns.

> That beeing said if we go down this way I would prefer if we go
> down all the way, that is convert the remaining few filesystems that
> pass a check_acl argument to generic_permission (btrfs, gfs2, ocfs2)
> and just kill off that argument.

And I do agree, eventually. But the series was really meant to be a
standalone thing where I didn't force filesystems to change. I also hope
that some filesystems, like btrfs, that don't use the "trivial wrapper",
would look at _why_ they don't just use the trivial wrapper.

For some of them, they seem to have real major reasons not to just use
'generic_permission()' (eg fuse), while others - btrfs - seem to be just
odd, and eg have its own special case of btrfs-specific read-only crap.
Which is a real downer, since that MAY_WRITE case isn't even relevant for
pathname lookup, but even for other inodes I really don't see why btrfs
doesn't just use the generic VFS-layer "S_IMMUTABLE" bit.

Now, for your particular suggestion it doesn't matter (we'd just drop the
last argument, and have "generic_permission()" pick it up from the inode
ops), but the larger point was that I really didn't want to change
filesystem code unnecessarily. And I'd actually hope that eventually _no_
filesystem would use "generic_permission()" at all, and we'd just always
do that whole check_acl thing at a VFS level if the filesystem provides
acls.

Part of that is that I would also move the whole "get_cached_acl()" to
the VFS layer entirely - and I think Al has patches to this. So then we'd
only need to call some filesystem-specific "check_acl" in the cases where
we do _not_ already have cached ACL's - and then we can finally do all the
ACL checks without ever calling into the filesystem at all, which is
pretty much required if we want to do lockless lookups.

> After that there is another step we can easily go: as we now cache the
> ACLs in the generic inode instead of the per-fs one we can move the
> get_cached_acl call to your acl_permission_check helper (for gfs2/ocfs2
> that don't cache ACLs it will always fail), and not call out to the fs
> for the fast path at all.

Yes, see above. Al and I talked about this a couple of months ago, and
that's why he already moved the ACL lists to the generic inode in
preparation of this. The reason for that was that absolutely _all_
filesystems got the locking wrong (too much unnecessary locking for the
common case of a cached "no acl" case), and some of them got the caching
wrong (no caching at all).

The take-away message from this all is generally: "don't make individual
filesystems do even simple things - they'll do it wrong". It's not just
that the VFS layer will do it better, it's that if it's done in the VFS
layer we can be clever about locking in a way that we can _not_ be when we
have to rely on the filesystem writers being aware of subtle issues.

Linus