So since we now do pathname lookup so well, I've been doing profiles
and looking at which parts are still problematic.
And some of them are just because it's real work to look up filenames.
So testing one of my favorite stupid loads ("do 'make' on a kernel
that is basically already made"), I'm not at all upset when I see
profiles that say
25.91% make make [.] 0x1e78
5.41% make [kernel.kallsyms] [k] __d_lookup_rcu
4.45% make [kernel.kallsyms] [k] link_path_walk
although I do wonder what make is doing that makes it so slow.
Whatever. The kernel part looks rather good.
Except looking down, almost 2% of the time is spent in
"acl_permission_check()"! That's just disgusting. The function should
be instantaneous. It's a really trivial function. But it gets called a
_lot_.
Now part of it is that the code generated for it was just bad. We
should inline the call to "current_user_ns()", because immediately
afterwards we do "current_fsuid()", and a lot of it is shared (they
both access "current->cred"). Easy enough (just use
"_current_user_ns()" instead, to inline it). And let's not use
"mode_t", use "unsigned int" and gcc generates better code.
Fine. That gets us down a bit, so now it's not the third-hottest
kernel function, now it's just the fifth-hottest one:
1.67% make [kernel.kallsyms] [k] __strncpy_from_user
1.39% make [kernel.kallsyms] [k] do_lookup
1.28% make [kernel.kallsyms] [k]
acl_permission_check
which is still really quite depressing.
The annotated profile tells the story:
7.36 : ffffffff810bcd52: 65 48 8b 04 25 80 b5 mov
%gs:0xb580,%rax
9.71 : ffffffff810bcd67: 48 8b 80 18 03 00 00 mov
0x318(%rax),%rax
15.34 : ffffffff810bcd71: 48 8b 70 70 mov
0x70(%rax),%rsi
30.57 : ffffffff810bcd79: 48 81 7e 58 00 3b a4 cmpq
$0xffffffff81a43b00,0x58(%rsi)
Those four instructions are about two thirds of the cost of the
function. The last two are about 50% of the cost.
They are the accesses to "current", "->cred", "->user" and "->user_ns"
respectively (the cmp with the big constant is that compare against
"init_ns").
Now, if we got rid of them, we wouldn't improve performance by 2/3rds
on that function, because we do need the two first accesses for
"fsuid" (which is the next check), and the third one (which is
currently "cred->user" ends up doing the cache miss that we'd take for
"cred->fsuid" anyway. So the first three costs are fairly inescapable.
They are also cheaper, probably because those fields tend to be more
often in the cache. So it really is that fourth one that hurts the
most, as shown by it taking almost a third of the cycles of that
function.
And it all comes from that annoying commit e795b71799ff0 ("userns:
userns: check user namespace for task->file uid equivalence checks"),
and I bet nobody involved thought about how expensive that was.
That "user_ns" is _really_ expensive to load. And the fact that it's
after a chain of three other loads makes it all totally serialized,
and makes things much more expensive.
Could we perhaps have "user_ns" directly in the "struct cred"? Or
could we avoid or short-circuit this check entirely somehow, since it
always checks against "init_ns"?
It's sad seeing that long chain of accesses and how nasty it gets (and
interesting but perhaps not surprising how it also clearly gets colder
from the cache the further away from "current" that it is).
It's extra sad, considering that almost nobody gets any _advantage_
from that stupid user_ns test. So it really is wasted time.
Linus
Quoting Linus Torvalds ([email protected]):
> Those four instructions are about two thirds of the cost of the
> function. The last two are about 50% of the cost.
>
> They are the accesses to "current", "->cred", "->user" and "->user_ns"
> respectively (the cmp with the big constant is that compare against
> "init_ns").
>
> Now, if we got rid of them, we wouldn't improve performance by 2/3rds
> on that function, because we do need the two first accesses for
> "fsuid" (which is the next check), and the third one (which is
> currently "cred->user" ends up doing the cache miss that we'd take for
> "cred->fsuid" anyway. So the first three costs are fairly inescapable.
>
> They are also cheaper, probably because those fields tend to be more
> often in the cache. So it really is that fourth one that hurts the
> most, as shown by it taking almost a third of the cycles of that
> function.
>
> And it all comes from that annoying commit e795b71799ff0 ("userns:
> userns: check user namespace for task->file uid equivalence checks"),
> and I bet nobody involved thought about how expensive that was.
>
> That "user_ns" is _really_ expensive to load. And the fact that it's
> after a chain of three other loads makes it all totally serialized,
> and makes things much more expensive.
>
> Could we perhaps have "user_ns" directly in the "struct cred"? Or
The only reason not to put it into struct cred would be to avoid growing
the struct cred. For that matter, esp since you can't unshare the user_ns,
it could also go right into the task_struct.
(Eric's sys_setns patchset will eventually complicate that, but I don't
think it'll be a problem)
> could we avoid or short-circuit this check entirely somehow, since it
> always checks against "init_ns"?
Of course I'm hoping that before fall the check won't be against
init_ns any more :) I was actually hoping to get back to that next
week, so I can start by testing the caching you suggest.
thanks,
-serge
"Serge E. Hallyn" <[email protected]> writes:
> Quoting Linus Torvalds ([email protected]):
>> Those four instructions are about two thirds of the cost of the
>> function. The last two are about 50% of the cost.
>>
>> They are the accesses to "current", "->cred", "->user" and "->user_ns"
>> respectively (the cmp with the big constant is that compare against
>> "init_ns").
>>
>> Now, if we got rid of them, we wouldn't improve performance by 2/3rds
>> on that function, because we do need the two first accesses for
>> "fsuid" (which is the next check), and the third one (which is
>> currently "cred->user" ends up doing the cache miss that we'd take for
>> "cred->fsuid" anyway. So the first three costs are fairly inescapable.
>>
>> They are also cheaper, probably because those fields tend to be more
>> often in the cache. So it really is that fourth one that hurts the
>> most, as shown by it taking almost a third of the cycles of that
>> function.
>>
>> And it all comes from that annoying commit e795b71799ff0 ("userns:
>> userns: check user namespace for task->file uid equivalence checks"),
>> and I bet nobody involved thought about how expensive that was.
>>
>> That "user_ns" is _really_ expensive to load. And the fact that it's
>> after a chain of three other loads makes it all totally serialized,
>> and makes things much more expensive.
>>
>> Could we perhaps have "user_ns" directly in the "struct cred"? Or
>
> The only reason not to put it into struct cred would be to avoid growing
> the struct cred. For that matter, esp since you can't unshare the user_ns,
> it could also go right into the task_struct.
>
> (Eric's sys_setns patchset will eventually complicate that, but I don't
> think it'll be a problem)
>From the perspective of a process the user namespace and the pid
namespace will never change. I expect we will have something that lets
you change the user namespace and the pid namespace experienced by child
processes. So the sys_setns work should not affect this.
>> could we avoid or short-circuit this check entirely somehow, since it
>> always checks against "init_ns"?
>
> Of course I'm hoping that before fall the check won't be against
> init_ns any more :) I was actually hoping to get back to that next
> week, so I can start by testing the caching you suggest.
Linus brings up a good point that we need to be very careful with
the user namespace and performance. That said I think there is
a cheap trick we can do until the user namespace is actually
good for something.
Something like my untested patch below.
Perhaps current_user_ns needs to move into user_namespace.h to get this
to compile. There are some weird circular header dependencies in there.
In any event an inline version of current_user_ns that returns
init_user_ns in the case where user namespaces aren't compiled in should
fix the immediate performance problems by allowing the compiler to
optimize them out.
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 9aeeb0b..09c76c2 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -357,7 +357,17 @@ static inline void put_cred(const struct cred *_cred)
#define _current_user_ns() (current_cred_xxx(user)->user_ns)
#define current_security() (current_cred_xxx(security))
+#if CONFIG_USER_NS
extern struct user_namespace *current_user_ns(void);
+#else
+struct user_namespace;
+extern struct user_namespace init_user_ns;
+static inline struct user_namespace *current_user_ns(void)
+{
+
+ return &init_user_ns;
+}
+#endif
#define current_uid_gid(_uid, _gid) \
do { \
I wonder how much this would help: (only compile-tested)
I will look into how to do some profiling tomorrow.
From: Serge E. Hallyn <[email protected]>
Date: Fri, 13 May 2011 04:27:54 +0100
Subject: [PATCH 1/1] Cache struct cred in acl_permission_check, and cache user_ns
in struct cred.
Signed-off-by: Serge E. Hallyn <[email protected]>
---
fs/namei.c | 5 +++--
include/linux/cred.h | 3 ++-
kernel/cred.c | 6 ++++++
3 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 54fc993..eb0f4ea 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -180,13 +180,14 @@ static int acl_permission_check(struct inode *inode, int mask, unsigned int flag
int (*check_acl)(struct inode *inode, int mask, unsigned int flags))
{
umode_t mode = inode->i_mode;
+ const struct cred *cred = current_cred();
mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
- if (current_user_ns() != inode_userns(inode))
+ if (cred->user_ns != inode_userns(inode))
goto other_perms;
- if (current_fsuid() == inode->i_uid)
+ if (cred->fsuid == inode->i_uid)
mode >>= 6;
else {
if (IS_POSIXACL(inode) && (mode & S_IRWXG) && check_acl) {
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 9aeeb0b..a2a892c 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -146,6 +146,7 @@ struct cred {
void *security; /* subjective LSM security */
#endif
struct user_struct *user; /* real user ID subscription */
+ struct user_namespace *user_ns; /* cached user->user_ns */
struct group_info *group_info; /* supplementary groups for euid/fsgid */
struct rcu_head rcu; /* RCU deletion hook */
};
@@ -354,7 +355,7 @@ static inline void put_cred(const struct cred *_cred)
#define current_fsgid() (current_cred_xxx(fsgid))
#define current_cap() (current_cred_xxx(cap_effective))
#define current_user() (current_cred_xxx(user))
-#define _current_user_ns() (current_cred_xxx(user)->user_ns)
+#define _current_user_ns() (current_cred_xxx(user_ns))
#define current_security() (current_cred_xxx(security))
extern struct user_namespace *current_user_ns(void);
diff --git a/kernel/cred.c b/kernel/cred.c
index 5557b55..a3dcf28 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -54,6 +54,7 @@ struct cred init_cred = {
.cap_effective = CAP_INIT_EFF_SET,
.cap_bset = CAP_INIT_BSET,
.user = INIT_USER,
+ .user_ns = &init_user_ns,
.group_info = &init_groups,
#ifdef CONFIG_KEYS
.tgcred = &init_tgcred,
@@ -410,6 +411,11 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
goto error_put;
}
+ /* cache user_ns in cred. Doesn't need a refcount because it will
+ * stay pinned by cred->user
+ */
+ new->user_ns = new->user->user_ns;
+
#ifdef CONFIG_KEYS
/* new threads get their own thread keyrings if their parent already
* had one */
--
1.7.0.4
On Thu, May 12, 2011 at 8:52 PM, Eric W. Biederman
<[email protected]> wrote:
>
> In any event an inline version of current_user_ns that returns
> init_user_ns in the case where user namespaces aren't compiled in should
> fix the immediate performance problems by allowing the compiler to
> optimize them out.
Yes. However, then Serge's patch in the next email immediately breaks
that optimization ;^(
Linus
On Thu, May 12, 2011 at 9:02 PM, Serge E. Hallyn <[email protected]> wrote:
>
> I wonder how much this would help: ?(only compile-tested)
So it should help a lot, but it breaks when CONFIG_USER_NS isn't even
set (the case that Eric fixed.
So instead, do this:
(a) get rid of the "_current_user_ns()" thing. There is no reason to
have it, if it's directly off "current->cred", then it's cheaper to
inline it than have a function just for two pointer indirections.
(b) do
#ifdef CONFIG_USER_NS
#define current_user_ns() (&init_user_ns)
#else
#define current_user_ns() (current_cred_xxx(user_ns))
#endif
(c) and then the rest of your patch (to actually initialize and set
"cred->user_ns") should be fine. But don't touch
acl_permission_check() at all - once you've fixed current_user_ns() to
DTRT, acl_permission_check will be ok.
Ok?
Then the compiler will do the right thing: in acl_permission_check()
it will see that current_user_ns() and "current_fsuid()" share 90% of
the code, and will CSE the "current_cred()" part (or, if
CONFIG_USER_NS isn't set, it will see a constant comparison of
&init_user_ns with itself, and generate no code for hat case). There's
no reason for you to do it for it, and when you do it by hand you get
the !CONFIG_USER_NS case wrong.
Hmm? That should fix all the horrible code generation issues in
acl_permission_check. When CONFIG_USER_NS isn't set, it will generate
no code at all, and when it _is_ set, it will just generate two loads
from current->cred (one for user_ns, one for fsuid), which is about as
good as it gets.
Linus
Quoting Linus Torvalds ([email protected]):
> On Thu, May 12, 2011 at 9:02 PM, Serge E. Hallyn <[email protected]> wrote:
> >
> > I wonder how much this would help: ?(only compile-tested)
>
> So it should help a lot, but it breaks when CONFIG_USER_NS isn't even
> set (the case that Eric fixed.
>
> So instead, do this:
>
> (a) get rid of the "_current_user_ns()" thing. There is no reason to
> have it, if it's directly off "current->cred", then it's cheaper to
> inline it than have a function just for two pointer indirections.
Unfortunately that was not there as an optimization but because it
gets around a circular dependency between cred.h and capability.h.
Ideally we would keep capability.h from having to know about struct
cred.
The two simplest solutions are to uninline nsown_capable() into
capability.c, and to move it from capability.h to cred.h, with the
expectation that all callers of nsown_capable() #include cred.h
(which they do). I'm uninlining it here, hoping that the compiler
will inline it for us.
From: Serge E. Hallyn <[email protected]>
Date: Fri, 13 May 2011 04:27:54 +0100
Subject: [PATCH 1/1] Cache user_ns in struct cred (v2)
If !CONFIG_USERNS, have current_user_ns() defined to (&init_user_ns).
Get rid of _current_user_ns. This requires nsown_capable() to be
defined in capability.c rather than as static inline in capability.h,
so do that.
Request_key needs init_user_ns defined at current_user_ns if
!CONFIG_USERNS, so forward-declare that in cred.h if !CONFIG_USERNS
at current_user_ns() define.
Compile-tested with and without CONFIG_USERNS.
Signed-off-by: Serge E. Hallyn <[email protected]>
---
include/linux/capability.h | 13 +------------
include/linux/cred.h | 10 ++++++++--
kernel/capability.c | 12 ++++++++++++
kernel/cred.c | 12 ++++++------
4 files changed, 27 insertions(+), 20 deletions(-)
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 16ee8b4..d4675af 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -546,18 +546,7 @@ extern bool has_capability_noaudit(struct task_struct *t, int cap);
extern bool capable(int cap);
extern bool ns_capable(struct user_namespace *ns, int cap);
extern bool task_ns_capable(struct task_struct *t, int cap);
-
-/**
- * nsown_capable - Check superior capability to one's own user_ns
- * @cap: The capability in question
- *
- * Return true if the current task has the given superior capability
- * targeted at its own user namespace.
- */
-static inline bool nsown_capable(int cap)
-{
- return ns_capable(current_user_ns(), cap);
-}
+extern bool nsown_capable(int cap);
/* audit system wants to get cap info from files as well */
extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 9aeeb0b..be16b61 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -146,6 +146,7 @@ struct cred {
void *security; /* subjective LSM security */
#endif
struct user_struct *user; /* real user ID subscription */
+ struct user_namespace *user_ns; /* cached user->user_ns */
struct group_info *group_info; /* supplementary groups for euid/fsgid */
struct rcu_head rcu; /* RCU deletion hook */
};
@@ -354,10 +355,15 @@ static inline void put_cred(const struct cred *_cred)
#define current_fsgid() (current_cred_xxx(fsgid))
#define current_cap() (current_cred_xxx(cap_effective))
#define current_user() (current_cred_xxx(user))
-#define _current_user_ns() (current_cred_xxx(user)->user_ns)
#define current_security() (current_cred_xxx(security))
-extern struct user_namespace *current_user_ns(void);
+#ifdef CONFIG_USER_NS
+#define current_user_ns() (current_cred_xxx(user_ns))
+#else
+extern struct user_namespace init_user_ns;
+#define current_user_ns() (&init_user_ns)
+#endif
+
#define current_uid_gid(_uid, _gid) \
do { \
diff --git a/kernel/capability.c b/kernel/capability.c
index bf0c734..32a80e0 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -399,3 +399,15 @@ bool task_ns_capable(struct task_struct *t, int cap)
return ns_capable(task_cred_xxx(t, user)->user_ns, cap);
}
EXPORT_SYMBOL(task_ns_capable);
+
+/**
+ * nsown_capable - Check superior capability to one's own user_ns
+ * @cap: The capability in question
+ *
+ * Return true if the current task has the given superior capability
+ * targeted at its own user namespace.
+ */
+bool nsown_capable(int cap)
+{
+ return ns_capable(current_user_ns(), cap);
+}
diff --git a/kernel/cred.c b/kernel/cred.c
index 5557b55..8093c16 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -54,6 +54,7 @@ struct cred init_cred = {
.cap_effective = CAP_INIT_EFF_SET,
.cap_bset = CAP_INIT_BSET,
.user = INIT_USER,
+ .user_ns = &init_user_ns,
.group_info = &init_groups,
#ifdef CONFIG_KEYS
.tgcred = &init_tgcred,
@@ -410,6 +411,11 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
goto error_put;
}
+ /* cache user_ns in cred. Doesn't need a refcount because it will
+ * stay pinned by cred->user
+ */
+ new->user_ns = new->user->user_ns;
+
#ifdef CONFIG_KEYS
/* new threads get their own thread keyrings if their parent already
* had one */
@@ -741,12 +747,6 @@ int set_create_files_as(struct cred *new, struct inode *inode)
}
EXPORT_SYMBOL(set_create_files_as);
-struct user_namespace *current_user_ns(void)
-{
- return _current_user_ns();
-}
-EXPORT_SYMBOL(current_user_ns);
-
#ifdef CONFIG_DEBUG_CREDENTIALS
bool creds_are_invalid(const struct cred *cred)
--
1.7.0.4
On Fri, May 13, 2011 at 6:19 AM, Serge E. Hallyn <[email protected]> wrote:
>
> Compile-tested with and without CONFIG_USERNS.
Looks ok to me. And generates good code for acl_permission_check
without CONFIG_USER_NS.
I'll see how much that function drops on the kernel profiles..
Linus
On Fri, May 13, 2011 at 9:16 AM, Linus Torvalds
<[email protected]> wrote:
>
> Looks ok to me. And generates good code for acl_permission_check
> without CONFIG_USER_NS.
>
> I'll see how much that function drops on the kernel profiles..
Yup, looking good.
For my "kernel make with no changes" workload, it dropped from
1.28% make [kernel.kallsyms] [k] acl_permission_check
to
0.88% make [kernel.kallsyms] [k]
acl_permission_check
which is pretty much exactly the expected 30% drop from no longer
having that expensive load of user_ns.
Of course, that 30% improvement is just a 0.4% performance improvement
in the big picture, but hey, almost half a percentage point on a real
load from just one single function in the kernel is definitely worth
doing.
Do you want to carry this for 2.6.40, or should I just apply it?
Linus
Quoting Linus Torvalds ([email protected]):
> On Fri, May 13, 2011 at 9:16 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > Looks ok to me. And generates good code for acl_permission_check
> > without CONFIG_USER_NS.
> >
> > I'll see how much that function drops on the kernel profiles..
>
> Yup, looking good.
>
> For my "kernel make with no changes" workload, it dropped from
>
> 1.28% make [kernel.kallsyms] [k] acl_permission_check
>
> to
>
> 0.88% make [kernel.kallsyms] [k]
> acl_permission_check
>
> which is pretty much exactly the expected 30% drop from no longer
> having that expensive load of user_ns.
>
> Of course, that 30% improvement is just a 0.4% performance improvement
> in the big picture, but hey, almost half a percentage point on a real
> load from just one single function in the kernel is definitely worth
> doing.
That's great, thanks for the help.
> Do you want to carry this for 2.6.40, or should I just apply it?
It makes no user-visible difference so I'd say just apply it.
thanks,
-serge