LinuxLists.cc - Stupid VFS name lookup interface..

2013-05-21 22:22:47

Subject: Stupid VFS name lookup interface..

Ok, Al, please tell me why I'm wrong, but I was looking at the hot
code in fs/dcache.c again (__d_lookup_rcu() remains the hottest
function under pathname lookup heavy operations) and that "inode"
argument was mis-commented (it used to be an in-out argument long long
ago) and it just kept bugging me.

And it looks totally pointless anyway. Nobody sane actually wants it.

Yeah, several filesystems use "pinode->i_sb" to look up their
superblock, but they can use "dentry->d_sb" for that instead. You did
most of that long ago, I think.

The *one* insane exception is ncpfs, which actually wants to look at
the parent (ie directory) inode data in order to decide if it should
use a case sensitive hash or not. However, even in that case, I'd
argue that we could just optimistically do a
ACCESS_ONCE(dentry->d_inode) and do the compare using the information
we got from that.

Because we don't care if the dentry->d_inode is unstable: if we got
some stale inode, we would hit the dentry_rcuwalk_barrier() case for
that parent when we later check the sequence numbers. So then we'd
throw away the comparison result anyway. We check both the dentry and
the parent sequence count in lookup_fast(), verifying that they've
been stable over the sequence.

So as far as I can tell, the only thing we should worry about might be
a NULL pointer due to a concurrent rmdir(), but the identity of the
inode itself we really don't care too much about. Take one or the
other, and don't crash on NULL.

There's a similat case going on in proc_sys_compare(). Same logic applies.

Hmm?

Getting rid of those annoying separate dentry inode pointers makes
__d_lookup_rcu() compile into noticeably better code on x86-64,
because there's no stack frame needed any more. And all the
filesystems end up cleaner too, and the crazy cases go where they
belong.

Untested patch attached. It compiles cleanly, looks sane, and most of
it is just making the function prototypes look much nicer. I think it
works.

Linus

Attachments:

patch.diff (33.84 kB)

2013-05-21 22:34:56

by Al Viro

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On Tue, May 21, 2013 at 03:22:44PM -0700, Linus Torvalds wrote:
> The *one* insane exception is ncpfs, which actually wants to look at
> the parent (ie directory) inode data in order to decide if it should
> use a case sensitive hash or not. However, even in that case, I'd
> argue that we could just optimistically do a
> ACCESS_ONCE(dentry->d_inode) and do the compare using the information
> we got from that.
>
> Because we don't care if the dentry->d_inode is unstable: if we got
> some stale inode, we would hit the dentry_rcuwalk_barrier() case for
> that parent when we later check the sequence numbers. So then we'd
> throw away the comparison result anyway. We check both the dentry and
> the parent sequence count in lookup_fast(), verifying that they've
> been stable over the sequence.
>
> So as far as I can tell, the only thing we should worry about might be
> a NULL pointer due to a concurrent rmdir(), but the identity of the
> inode itself we really don't care too much about. Take one or the
> other, and don't crash on NULL.
>
> There's a similat case going on in proc_sys_compare(). Same logic applies.

In principle, yes, but... I wonder if those two cases are actually
safe (especially ncpfs) right now. We dereference the parent inode
there and that could get ugly, whether we've got it from caller or
as ->d_inode. Let me dig around in that code a bit, OK?

Al, enjoying the excuse to take a break from ->readdir() code audit ;-/

2013-05-21 22:38:12

by Linus Torvalds

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On Tue, May 21, 2013 at 3:34 PM, Al Viro <[email protected]> wrote:
>
> In principle, yes, but... I wonder if those two cases are actually
> safe (especially ncpfs) right now.

Now I can agree that that may well be an issue.

I don't think my patch makes anything worse (because if the inode
isn't stable, we could have hit the before/after cases before, and the
new NULL case is trivial to handle).

But I'm certainly not going to claim that ncpfs doesn't already have a
race as-is. Just claiming that I wouldn't have made it worse ;)

> Let me dig around in that code a bit, OK?

Sure, no problem. This would be 3.11 material anyway, I'd expect.

Linus

2013-05-25 03:21:12

by Linus Torvalds

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On Tue, May 21, 2013 at 3:22 PM, Linus Torvalds
<[email protected]> wrote:
>
> Untested patch attached. It compiles cleanly, looks sane, and most of
> it is just making the function prototypes look much nicer. I think it
> works.

Ok, here's another patch in the "let's make the VFS go faster series".
This one, sadly, is not a cleanup.

The concept is simple: right now the inode->i_security pointer chasing
kills us on inode security checking with selinux. So let's move two of
the fields from the selinux security fields directly into the inode.
So instead of doing "inode->i_security->{sid,sclass}", we can just do
"inode->{i_sid,i_sclass}" directly.

It's a very mechanical transform, so it should all be good, but the
reason I don't much like it is that I think other security models
might want to do something like this too, and right now it's
selinux-specific. I could imagine making it just an anonymous union of
size 64 bits or something, and just making one of the union entries be
an (anonymous) struct with those two fields. So it's not conceptually
selinux-specific, but right now it's pretty much a selinux hack.

But it's a selinux-specific hack that really does matter. The
inode_has_perm() and selinux_inode_permission() functions show up
pretty high on kernel profiles that do a lot of filename lookup, and
it's pretty much all just that i_security pointer chasing and extra
cache miss.

With this, inode->i_security is not very hot any more, and we could
move the i_security pointer elsewhere in the inode.

Comments? I don't think this is *pretty* (and I do want to repeat that
it's not even tested yet), but I think it's worth it. We've been very
good at avoiding extra pointer dereferences in the path lookup, this
is one of the few remaining ones.

Linus

Attachments:

patch.diff (22.42 kB)

2013-05-25 16:57:14

by Al Viro

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On Fri, May 24, 2013 at 08:21:08PM -0700, Linus Torvalds wrote:
> On Tue, May 21, 2013 at 3:22 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > Untested patch attached. It compiles cleanly, looks sane, and most of
> > it is just making the function prototypes look much nicer. I think it
> > works.
>
> Ok, here's another patch in the "let's make the VFS go faster series".
> This one, sadly, is not a cleanup.
>
> The concept is simple: right now the inode->i_security pointer chasing
> kills us on inode security checking with selinux. So let's move two of
> the fields from the selinux security fields directly into the inode.
> So instead of doing "inode->i_security->{sid,sclass}", we can just do
> "inode->{i_sid,i_sclass}" directly.
>
> It's a very mechanical transform, so it should all be good, but the
> reason I don't much like it is that I think other security models
> might want to do something like this too, and right now it's
> selinux-specific. I could imagine making it just an anonymous union of
> size 64 bits or something, and just making one of the union entries be
> an (anonymous) struct with those two fields. So it's not conceptually
> selinux-specific, but right now it's pretty much a selinux hack.
>
> But it's a selinux-specific hack that really does matter. The
> inode_has_perm() and selinux_inode_permission() functions show up
> pretty high on kernel profiles that do a lot of filename lookup, and
> it's pretty much all just that i_security pointer chasing and extra
> cache miss.
>
> With this, inode->i_security is not very hot any more, and we could
> move the i_security pointer elsewhere in the inode.
>
> Comments? I don't think this is *pretty* (and I do want to repeat that
> it's not even tested yet), but I think it's worth it. We've been very
> good at avoiding extra pointer dereferences in the path lookup, this
> is one of the few remaining ones.

Well... The problem I see here is not even selinux per se - it's that
"LSM stacking" insanity. How would your anon union deal with that? Which
LSM gets to play with it when we have more than one of those turds around?

2013-05-25 17:26:31

by Linus Torvalds

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On Sat, May 25, 2013 at 9:57 AM, Al Viro <[email protected]> wrote:
>
> Well... The problem I see here is not even selinux per se - it's that
> "LSM stacking" insanity. How would your anon union deal with that? Which
> LSM gets to play with it when we have more than one of those turds around?

We don't support stacking anyway. And if we ever do, that will require
some really *major* changes, since right now i_security is owned by
the security module. It's not a list, it's a direct pointer to opaque
per-LSM data.

Of course, if you don't use i_security (only selinux and smack do
right now according to my quick grep), then you could live together
with somebody who does. And that wouldn't change with the new fields,
the same rules would apply. If you can make your security decisions
purely based on standard inode/dentry information (like the common
capabilities code), you'd be ok, and in that sense stacking has
obviously existed for a long while (ie some of the standard unix
capabilities are checked regardless of any LSM ones).

Side note: avoiding the i_security dereference helps, and testing
shows that it drops inode_has_perm and selinux_inode_permission down
the profile list. The bulk of the cost of pathname intensive loads is
now very much the actual path walking itself (__d_lookup_rcu,
path_lookupat, link_path_walk). The security checks are still quite
visible on my "git parallel stat" test-case, with avc_has_perm_noaudit
being 3.5% and the inode_has_perm() and selinux_inode_permission at
around 2.5% each, but I don't really see anything fixable going on any
more. The inode_has_perm() cost is simply because it's one of the
first things that touched the inode itself, so we get the -
unavoidable - cache miss on the second line of the inode.

So the patch is tested, and improves at least profiles. It's probably
not that noticeable in real life, but I like how it makes me go: "I
really don't see any low-hanging fruit any more" on the path lookup.

Linus

2013-05-25 18:40:13

by Casey Schaufler

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On 5/25/2013 9:57 AM, Al Viro wrote:
> On Fri, May 24, 2013 at 08:21:08PM -0700, Linus Torvalds wrote:
>> On Tue, May 21, 2013 at 3:22 PM, Linus Torvalds
>> <[email protected]> wrote:
>>> Untested patch attached. It compiles cleanly, looks sane, and most of
>>> it is just making the function prototypes look much nicer. I think it
>>> works.
>> Ok, here's another patch in the "let's make the VFS go faster series".
>> This one, sadly, is not a cleanup.
>>
>> The concept is simple: right now the inode->i_security pointer chasing
>> kills us on inode security checking with selinux. So let's move two of
>> the fields from the selinux security fields directly into the inode.
>> So instead of doing "inode->i_security->{sid,sclass}", we can just do
>> "inode->{i_sid,i_sclass}" directly.
>>
>> It's a very mechanical transform, so it should all be good, but the
>> reason I don't much like it is that I think other security models
>> might want to do something like this too, and right now it's
>> selinux-specific. I could imagine making it just an anonymous union of
>> size 64 bits or something, and just making one of the union entries be
>> an (anonymous) struct with those two fields. So it's not conceptually
>> selinux-specific, but right now it's pretty much a selinux hack.
>>
>> But it's a selinux-specific hack that really does matter. The
>> inode_has_perm() and selinux_inode_permission() functions show up
>> pretty high on kernel profiles that do a lot of filename lookup, and
>> it's pretty much all just that i_security pointer chasing and extra
>> cache miss.
>>
>> With this, inode->i_security is not very hot any more, and we could
>> move the i_security pointer elsewhere in the inode.
>>
>> Comments? I don't think this is *pretty* (and I do want to repeat that
>> it's not even tested yet), but I think it's worth it. We've been very
>> good at avoiding extra pointer dereferences in the path lookup, this
>> is one of the few remaining ones.
> Well... The problem I see here is not even selinux per se - it's that
> "LSM stacking" insanity. How would your anon union deal with that? Which
> LSM gets to play with it when we have more than one of those turds around?

I don't know that the terms "insanity" and "turds" really do
the situation justice, but Al has a firm grasp on the nut of the issue.

The LSM stacking I've been working on (v14 due "real soon") would
render this change useless, as you'd have to have the multiple
instances of the special fields just as you'd need multiple blob
pointers. That would have to reintroduce the indirection you're
trying to be rid of. I have been working on the assumption that
the single blob pointer was all that could ever go into the inode.
If that's not true stacking could get considerably easier and could
have less performance impact.

Now I'll put on my Smack maintainer hat. Performance improvement is
always welcome, but I would rather see attention to performance of
the LSM architecture than SELinux specific hacks. The LSM blob
pointer scheme is there so that you (Linus) don't have to see the
dreadful things that we security people are doing. Is it time to
get past that level of disassociation? Or, and I really hate asking
this, have you fallen into the SELinux camp?

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-05-26 03:22:11

by Linus Torvalds

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On Sat, May 25, 2013 at 11:33 AM, Casey Schaufler
<[email protected]> wrote:
>
> Now I'll put on my Smack maintainer hat. Performance improvement is
> always welcome, but I would rather see attention to performance of
> the LSM architecture than SELinux specific hacks.

I haven't seen huge issues with performance at that level.

> The LSM blob
> pointer scheme is there so that you (Linus) don't have to see the
> dreadful things that we security people are doing. Is it time to
> get past that level of disassociation? Or, and I really hate asking
> this, have you fallen into the SELinux camp?

I only have selinux performance to look at, since I run Fedora. I used
to actually turn it off entirely, because it impacted VFS performance
so horribly. We fixed it. I (and Al) spent time to make sure that we
don't need to drop RCU lookup just because we call into the security
layers etc.

But I haven't even looked at what non-selinux setups do to
performance. Last time I tried Ubuntu (they still use apparmor, no?),
"make modules_install ; make install" didn't work for the kernel, and
if the Ubuntu people don't want to support kernel engineers, I
certainly am not going to bother with them. Who uses smack?

Linus

2013-05-26 04:55:30

by James Morris

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On Sat, 25 May 2013, Al Viro wrote:

> Well... The problem I see here is not even selinux per se - it's that
> "LSM stacking" insanity.

FWIW, I don't see a concrete rationale for merging this stacking work. That
doesn't mean there won't be one in the future, but I wouldn't let the idea
of a future possible change prevent this patch from being merged.

--
James Morris
<[email protected]>

2013-05-26 04:56:13

by James Morris

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On Sat, 25 May 2013, Linus Torvalds wrote:

> But I haven't even looked at what non-selinux setups do to
> performance. Last time I tried Ubuntu (they still use apparmor, no?),
> "make modules_install ; make install" didn't work for the kernel, and
> if the Ubuntu people don't want to support kernel engineers, I
> certainly am not going to bother with them. Who uses smack?

Tizen, perhaps a few others.

--
James Morris
<[email protected]>

2013-05-26 05:19:56

by Linus Torvalds

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On Sat, May 25, 2013 at 10:04 PM, James Morris <[email protected]> wrote:
> On Sat, 25 May 2013, Linus Torvalds wrote:
>
>> But I haven't even looked at what non-selinux setups do to
>> performance. Last time I tried Ubuntu (they still use apparmor, no?),
>> "make modules_install ; make install" didn't work for the kernel, and
>> if the Ubuntu people don't want to support kernel engineers, I
>> certainly am not going to bother with them. Who uses smack?
>
> Tizen, perhaps a few others.

Btw, it really would be good if security people started realizing that
performance matters. It's annoying to see the security lookups cause
50% performance degradations on pathname lookup (and no, I'm not
exaggerating, that's literally what it was before we fixed it - and
no, by "we" I don't mean security people).

There's a really simple benchmark that is actually fairly relevant:
build a reasonable kernel ("make localmodconfig" or similar - not the
normal distro kernel that has everything enabled) without debugging or
other crud enabled, run that kernel, and then re-build the fully built
kernel to make sure it's all in the disk cache. Then, when you don't
need any IO, and don't need to recompile anything, do a "make -j".

Assuming you have a reasonably modern desktop machine, it should take
something like 5-10 seconds, of which almost everything is just "make"
doing lots of stat() calls to see that everything is fully built. If
it takes any longer, you're doing something wrong.

Once you are at that point, just do "perf record -f -e cycles:pp make
-j" and then "perf report" on the thing.

(The "-e cycles:pp" is not necessary for the rough information, but it
helps if you then want to go and annotate the assembler to see where
the costs come from).

If you see security functions at the top, you know that the security
routines take more time than the real work the kernel is doing, and
should realize that that would be a problem.

Right now (zooming into the kernel only - ignoring the fact that make
really spends a fair amount of time in user space) I get

9.79% make [k] __d_lookup_rcu
5.48% make [k] link_path_walk
2.94% make [k] avc_has_perm_noaudit
2.47% make [k] selinux_inode_permission
2.25% make [k] path_lookupat
1.89% make [k] generic_fillattr
1.50% make [k] lookup_fast
1.27% make [k] copy_user_generic_string
1.17% make [k] generic_permission
1.15% make [k] dput
1.12% make [k] inode_has_perm.constprop.58
1.11% make [k] __inode_permission
1.08% make [k] kmem_cache_alloc
...

so the permission checking is certainly quite noticeable, but it's by
no means dominant. This is with both of the patches I've posted, but
the numbers weren't all that different before (inode_has_perm and
selinux_inode_permission used to be higher up in the list, now
avc_has_perm_noaudit is the top selinux cost - which actually makes
some amount of sense).

So it's easy to have a fairly real-world performance profile that
shows path lookup costs on a real test.

Linus

2013-05-26 12:03:08

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On Sat, May 25, 2013 at 11:33:46AM -0700, Casey Schaufler wrote:
> Now I'll put on my Smack maintainer hat. Performance improvement is
> always welcome, but I would rather see attention to performance of
> the LSM architecture than SELinux specific hacks. The LSM blob
> pointer scheme is there so that you (Linus) don't have to see the
> dreadful things that we security people are doing. Is it time to
> get past that level of disassociation? Or, and I really hate asking
> this, have you fallen into the SELinux camp?

What part of the LSM architecture are you proposing be optimized? The
LSM layer is pretty thin, partially because the various different
security approaches don't agree with each other on fairly fundamental
issues. What sort of optimization opportunities you are suggesting?
Are there changes that can be made that all of the major security LSM
maintainers would actually agree with?

I've been re-reading the thread on LKML which was spawned when SMACK
was proposed for upstream inclusion:

http://thread.gmane.org/gmane.linux.kernel/585903/focus=586412

Have any of the arguments over the proper security models changed over
or have gotten resolved over the past six years, while I haven't been
looking?

- Ted

2013-05-26 17:59:59

by Casey Schaufler

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On 5/25/2013 10:19 PM, Linus Torvalds wrote:
> On Sat, May 25, 2013 at 10:04 PM, James Morris <[email protected]> wrote:
>> On Sat, 25 May 2013, Linus Torvalds wrote:
>>
>>> But I haven't even looked at what non-selinux setups do to
>>> performance. Last time I tried Ubuntu (they still use apparmor, no?),
>>> "make modules_install ; make install" didn't work for the kernel, and
>>> if the Ubuntu people don't want to support kernel engineers, I
>>> certainly am not going to bother with them. Who uses smack?
>> Tizen, perhaps a few others.
> Btw, it really would be good if security people started realizing that
> performance matters. It's annoying to see the security lookups cause
> 50% performance degradations on pathname lookup (and no, I'm not
> exaggerating, that's literally what it was before we fixed it - and
> no, by "we" I don't mean security people).

I think that we have a pretty good idea that performance matters.
I have never, not even once, tried to introduce a "security" feature
that was not subject to an objection based on its performance impact.
We are also extremely tuned into the desire that when our security
features are not present the impact of their potential availability
has to be zero.

The whole secid philosophy comes out of the need to keep security out
of other people's way. It has performance impact. Sure, SELinux
hashes lookups, but a blob pointer gets you right where you want to be.
When we are constrained in unnatural ways there are going to be
consequences. Performance is one. Code complexity is another.

One need look no further than the recent discussions regarding Paul
Moore's suggested changes to sk_buff to see just how seriously
performance considerations impact security development. Because the
security data in the sk_buff is limited to a u32 instead of a blob
pointer labeled networking performance is seriously worse than
unlabeled. Paul has gone to heroic lengths to come up with a change
that meets all of the performance criteria, and has still been
rejected. Not because the stated issues haven't been addressed, but
because someone else might have changes that "need those cache lines"
in the future.

Two of the recent Smack changes have been performance improvements.
Smack is currently under serious scrutiny for performance as it is
heading into small machines.

I'm not saying we can't do better, or that we (at least I) don't
appreciate any help we can get. We are, however, bombarded with
concern over the performance impact of what we're up to. All too
often it's not constructive criticism. Sometimes it is downright
hostile.

>
> There's a really simple benchmark that is actually fairly relevant:
> build a reasonable kernel ("make localmodconfig" or similar - not the
> normal distro kernel that has everything enabled) without debugging or
> other crud enabled, run that kernel, and then re-build the fully built
> kernel to make sure it's all in the disk cache. Then, when you don't
> need any IO, and don't need to recompile anything, do a "make -j".
>
> Assuming you have a reasonably modern desktop machine, it should take
> something like 5-10 seconds, of which almost everything is just "make"
> doing lots of stat() calls to see that everything is fully built. If
> it takes any longer, you're doing something wrong.
>
> Once you are at that point, just do "perf record -f -e cycles:pp make
> -j" and then "perf report" on the thing.
>
> (The "-e cycles:pp" is not necessary for the rough information, but it
> helps if you then want to go and annotate the assembler to see where
> the costs come from).
>
> If you see security functions at the top, you know that the security
> routines take more time than the real work the kernel is doing, and
> should realize that that would be a problem.
>
> Right now (zooming into the kernel only - ignoring the fact that make
> really spends a fair amount of time in user space) I get
>
> 9.79% make [k] __d_lookup_rcu
> 5.48% make [k] link_path_walk
> 2.94% make [k] avc_has_perm_noaudit
> 2.47% make [k] selinux_inode_permission
> 2.25% make [k] path_lookupat
> 1.89% make [k] generic_fillattr
> 1.50% make [k] lookup_fast
> 1.27% make [k] copy_user_generic_string
> 1.17% make [k] generic_permission
> 1.15% make [k] dput
> 1.12% make [k] inode_has_perm.constprop.58
> 1.11% make [k] __inode_permission
> 1.08% make [k] kmem_cache_alloc
> ...
>
> so the permission checking is certainly quite noticeable, but it's by
> no means dominant. This is with both of the patches I've posted, but
> the numbers weren't all that different before (inode_has_perm and
> selinux_inode_permission used to be higher up in the list, now
> avc_has_perm_noaudit is the top selinux cost - which actually makes
> some amount of sense).
>
> So it's easy to have a fairly real-world performance profile that
> shows path lookup costs on a real test.
>
> Linus
>

2013-05-26 18:17:21

by Linus Torvalds

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On Sun, May 26, 2013 at 10:59 AM, Casey Schaufler
<[email protected]> wrote:
>
> The whole secid philosophy comes out of the need to keep security out
> of other people's way. It has performance impact. Sure, SELinux
> hashes lookups, but a blob pointer gets you right where you want to be.
> When we are constrained in unnatural ways there are going to be
> consequences. Performance is one. Code complexity is another.

Quite frankly, I'd like to possibly introduce a cache of security
decisions at least for the common filesystem operations (and the
per-pathcomponent lookup is *the* most common one, but the per-stat
one is pretty bad too), and put that cache at the VFS level, so that
the security people can *not* screw it up, and so that we don't call
down to the security layer at all 99% of the time.

Once that happens, we don't care any more what security people do.

It has been how we have fixed performance problems for filesystems
every single time. It's simply not possible to have a generic
interface to 50+ different filesystems and expect that kind of generic
interface to be high-performance - but when we've been able to
abstract it out as a cache in front of the filesystem operations, it
is suddenly quite reasonable to spend a lot of effort making that
cache go fast like a bat out of hell.

It started with the dentry cache and the page cache, but we now do the
POSIX ACL's that way too, because it just wasn't reasonable to call
down to the filesystem to look up ACL's and have all the complex "we
do this with RCU locks held" semantics.

Is there something similar we could do for the security layer? We
don't have 50+ different security models, but we do have several. If
the different security modules could agree on some kind of generic
"security ID" model so that we could cache things (see fs/namei.c and
get_cached_acl_rcu() for example), it would be a great thing.

Then selinux could get rid of it's hashed lookups entirely, because
that whole "cache the security ID" would be handled by generic code.

But that *would* require that there would be some abstract notion of
security ID/context that we could use in generic code *WITHOUT* the
need to call down the the security subsystem.

The indirect calls are expensive, but they are expensive not because
an indirect call itself is particularly expensive (although that's
true on some architectures too), but because the whole notion of "I'm
calling down to the lower-level non-generic code" means that we can't
do inlining, we can't optimize locking, we can't do anything clever.

My selinux patch kept the indirect call, but at least made it cheap.
Could we do even better? And keep it generic?

Btw, if we can do something like that, then nested security modules
likely get much easier to do too, because the nesting would all be
behind the cache. Once it's behind the cache, it doesn't matter if
we'd need to traverse lists etc. The hot case would be able to ignore
it all.

Linus

2013-05-26 18:23:07

by Casey Schaufler

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On 5/26/2013 5:02 AM, Theodore Ts'o wrote:
> On Sat, May 25, 2013 at 11:33:46AM -0700, Casey Schaufler wrote:
>> Now I'll put on my Smack maintainer hat. Performance improvement is
>> always welcome, but I would rather see attention to performance of
>> the LSM architecture than SELinux specific hacks. The LSM blob
>> pointer scheme is there so that you (Linus) don't have to see the
>> dreadful things that we security people are doing. Is it time to
>> get past that level of disassociation? Or, and I really hate asking
>> this, have you fallen into the SELinux camp?
> What part of the LSM architecture are you proposing be optimized?

Secids are an inherent performance issue.

This thread is all about a performance problem with
the security blob pointer scheme. I don't know what
would be better and general, but I'm willing to learn.

> The
> LSM layer is pretty thin, partially because the various different
> security approaches don't agree with each other on fairly fundamental
> issues. What sort of optimization opportunities you are suggesting?
> Are there changes that can be made that all of the major security LSM
> maintainers would actually agree with?

As you point out, the various existing LSMs use a variety of
mechanisms to perform their access checks. A big part of what
I see as the "problem" is that the LSM hooks grew organically,
at a time when there was exactly one project being funded.
By the time other LSMs came in to the mainstream we had a
collection of hooks, not an architecture. The LSM architecture
has not been seriously revisited since.

Can we come to agreement? I don't know. I expect so.

>
> I've been re-reading the thread on LKML which was spawned when SMACK
> was proposed for upstream inclusion:
>
> http://thread.gmane.org/gmane.linux.kernel/585903/focus=586412
>
> Have any of the arguments over the proper security models changed over
> or have gotten resolved over the past six years, while I haven't been
> looking?

I believe that Yama points to a serious change in the way
"operating systems" are being developed. The desktop is not
the sweet spot for Linux development, nor is the enterprise
server. Six years ago the Bell & LaPadula subject/object models
still made sense. Today, we're looking at applications, services
and resources. We don't have LSMs that support those* natively.
We are going to have new LSMs, and soon, if Linux is going to
remain relevant.

---
* SEAndriod is trying. We'll see where that goes.

>
> - Ted
>

2013-05-26 18:48:27

by Casey Schaufler

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On 5/26/2013 11:17 AM, Linus Torvalds wrote:
> On Sun, May 26, 2013 at 10:59 AM, Casey Schaufler
> <[email protected]> wrote:
>> The whole secid philosophy comes out of the need to keep security out
>> of other people's way. It has performance impact. Sure, SELinux
>> hashes lookups, but a blob pointer gets you right where you want to be.
>> When we are constrained in unnatural ways there are going to be
>> consequences. Performance is one. Code complexity is another.
> Quite frankly, I'd like to possibly introduce a cache of security
> decisions at least for the common filesystem operations (and the
> per-pathcomponent lookup is *the* most common one, but the per-stat
> one is pretty bad too), and put that cache at the VFS level, so that
> the security people can *not* screw it up, and so that we don't call
> down to the security layer at all 99% of the time.
>
> Once that happens, we don't care any more what security people do.
>
> It has been how we have fixed performance problems for filesystems
> every single time. It's simply not possible to have a generic
> interface to 50+ different filesystems and expect that kind of generic
> interface to be high-performance - but when we've been able to
> abstract it out as a cache in front of the filesystem operations, it
> is suddenly quite reasonable to spend a lot of effort making that
> cache go fast like a bat out of hell.
>
> It started with the dentry cache and the page cache, but we now do the
> POSIX ACL's that way too, because it just wasn't reasonable to call
> down to the filesystem to look up ACL's and have all the complex "we
> do this with RCU locks held" semantics.
>
> Is there something similar we could do for the security layer? We
> don't have 50+ different security models, but we do have several. If
> the different security modules could agree on some kind of generic
> "security ID" model so that we could cache things (see fs/namei.c and
> get_cached_acl_rcu() for example), it would be a great thing.
>
> Then selinux could get rid of it's hashed lookups entirely, because
> that whole "cache the security ID" would be handled by generic code.
>
> But that *would* require that there would be some abstract notion of
> security ID/context that we could use in generic code *WITHOUT* the
> need to call down the the security subsystem.
>
> The indirect calls are expensive, but they are expensive not because
> an indirect call itself is particularly expensive (although that's
> true on some architectures too), but because the whole notion of "I'm
> calling down to the lower-level non-generic code" means that we can't
> do inlining, we can't optimize locking, we can't do anything clever.
>
> My selinux patch kept the indirect call, but at least made it cheap.
> Could we do even better? And keep it generic?

Probably, but we'll need help from people who really understand VFS,
caching and RCU.

> Btw, if we can do something like that, then nested security modules
> likely get much easier to do too, because the nesting would all be
> behind the cache. Once it's behind the cache, it doesn't matter if
> we'd need to traverse lists etc. The hot case would be able to ignore
> it all.

Indeed. It may require significant revision to the existing LSMs.

>
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-05-26 19:11:34

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On Sun, May 26, 2013 at 11:23:01AM -0700, Casey Schaufler wrote:
> I believe that Yama points to a serious change in the way
> "operating systems" are being developed. The desktop is not
> the sweet spot for Linux development, nor is the enterprise
> server. Six years ago the Bell & LaPadula subject/object models
> still made sense. Today, we're looking at applications, services
> and resources. We don't have LSMs that support those* natively.
> We are going to have new LSMs, and soon, if Linux is going to
> remain relevant.

Oh, I agree, having worked on a TS project involving real-time Linux
that did not even try to use a Bell-LaPadula security model on the
Linux system.

The challenge is that although the world does seem to be moving
towards using air gap firewalls and separate single-level systems
(using trusted message passing routers and/or trusted hypervisors)
instead of a MLS design, I'm sure there will still be some
applications where people will want a Bell-LaPadula security model.
And if we can't rip out that fundamental assumption, it's not obvious
to me it will be possible to simplify the core LSM architecture.

We also have to consider especially how much sunk costs have been
invested in systems such as SELinux which can support (but which are
not limited to) Bell-LaPadula. If we simplify the LSM architecture,
but it requires a radical rewrite of systems where massive amounts of
effort have been poured into make them somewhat easier to use
(although personally I'm not convinced that a policy file which is
hundreds of kilobytes qualifies as "easy" to debug or to validate), I
suspect the people who don't feel like throwing all of that heavy
investment away will resist such an initiative mightily. (Not to
mention that if we break backwards compatibility, we will end up
breaking userspace for deployed enterprise distro's.)

And yet, if we don't rip out some of these assumptions, it's not
obvious to me how much even Linus's caching idea is going to buy us.
A generic caching layer still has to include in its hash all of the
possible inputs that a LSM module might want to use as part of its
access decision. If this includes the pathname, then you'll have to
lock a whole bunch of dentries while you calculate its hash ---
something that wouldn't be necessary for SELinux, for example. And
I'm sure SELinux uses some things when making its access decisions
which Smack does not need, and so on. So a generalized caching scheme
might not result in any any performance wins; it might even be worse
than what we have today!

So I'm dubious --- but maybe this is something which a Security
working group could maybe try to hash out at a face-to-face meeting.
Given our past track record of coming to consensus, I suspect
face-to-face has a higher chance of working --- who knows, maybe hell
will freeze over or pigs will end up nesting in trees. :-)

Cheers,

- Ted

2013-05-26 19:32:46

by Linus Torvalds

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On Sun, May 26, 2013 at 12:11 PM, Theodore Ts'o <[email protected]> wrote:
>
> And if we can't rip out that fundamental assumption, it's not obvious
> to me it will be possible to simplify the core LSM architecture.

One thing that may be sufficient is to maintain a complex model as a
*possible* case, but make sure that we handle the simple cases well.

For example, a "security cache" *might* be as simple as a single bit
saying "this inode has no special file permissions outside the normal
UNIX ones".

For example, the biggest win in the whole POSIX ACL cache was not
actually caching POSIX ACL's at all - even though it does that - but
caching the fact that a particular file did *not* have any POSIX ACL's
associated with it. So the POSIX AXL cache was very much designed to
have separate states for "I don't have a cache entry" and "I have an
empty cache entry". And 99% of the time, that negative cache is
sufficient.

Similarly, for at least the filesystem security case, if we had a "the
owner of this file can do all the normal operations on it" bit, that
would generally take care of 99% of all home directory operations. If,
in addition to that, there's some way to mark other "normal"
directories as having only normal UNIX security semantics (ie /home
and /usr), that would take care of most of the rest.

Having to call into the security layer when you cross some special
boundary is fine. It's doing it for every single path component, and
every single 'stat' of a regular file - *THAT* is what kills us.

So complexity could be ok. As long as the common case isn't complex,
and as long as there's a simple way to check that. We don't
necessarily need to simplify things in the general case.

Linus

2013-05-28 16:26:33

by Casey Schaufler

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On 5/26/2013 12:32 PM, Linus Torvalds wrote:
> On Sun, May 26, 2013 at 12:11 PM, Theodore Ts'o <[email protected]> wrote:
>> And if we can't rip out that fundamental assumption, it's not obvious
>> to me it will be possible to simplify the core LSM architecture.
> One thing that may be sufficient is to maintain a complex model as a
> *possible* case, but make sure that we handle the simple cases well.
>
> For example, a "security cache" *might* be as simple as a single bit
> saying "this inode has no special file permissions outside the normal
> UNIX ones".
>
> For example, the biggest win in the whole POSIX ACL cache was not
> actually caching POSIX ACL's at all - even though it does that - but
> caching the fact that a particular file did *not* have any POSIX ACL's
> associated with it. So the POSIX AXL cache was very much designed to
> have separate states for "I don't have a cache entry" and "I have an
> empty cache entry". And 99% of the time, that negative cache is
> sufficient.

POSIX ACLs work well in this scheme for two reasons. First, they are
optional and only show up on files when someone wants to put one there.
Second, no one seems inclined to use them. Because the existence of an
ACL is rare it makes *lots* of sense to shortcut if there is no ACL.

SELinux contexts and Smack labels are different from ACLs in that
the usual case will be that the label exists and a file without the
extended attribute will be quite rare. Because every file is expected
to have a context/label no check that is as simple or convenient as
what you have with ACLs is available.

> Similarly, for at least the filesystem security case, if we had a "the
> owner of this file can do all the normal operations on it" bit, that
> would generally take care of 99% of all home directory operations. If,
> in addition to that, there's some way to mark other "normal"
> directories as having only normal UNIX security semantics (ie /home
> and /usr), that would take care of most of the rest.

This is pretty close to the Smack "Basic Rule", which says that a process
with the same label as an object can do what it will with it. You only
go looking at the Smack rules if this check fails.

SELinux (Eric, James, correct me where I mislead) may need to do some
calculation based on the process context, containing directory, component
name and (perhaps) other factors before it's possible to determine exactly
what contexts are being compared. It may very well be possible to do some
amount of advanced computation on this, but how that might work is beyond
me.

It might be possible to implement labeling on a filesystem basis, and
achieve performance tricks based on that. Both SELinux and Smack are
capable of providing a single context/label for a filesystem, although
Smack currently always uses extended attributes when they are available.
This would be of limited value in most of the configurations I have seen.

> Having to call into the security layer when you cross some special
> boundary is fine. It's doing it for every single path component, and
> every single 'stat' of a regular file - *THAT* is what kills us.

Hmm. If we associated the access checks with the object rather than the
task we might be able to do something here. That's the VFS caching that
you've been referring to, I expect. On a system with a small number of
contexts/labels I'd expect this to help.

> So complexity could be ok. As long as the common case isn't complex,
> and as long as there's a simple way to check that. We don't
> necessarily need to simplify things in the general case.

Keeping the common case common is the problem with mandatory access
control systems. Once the granularity gremlins get control of a
distribution no one access case is ever going to be the common case.

>
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-05-29 00:18:34

by Valdis Klētnieks

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On Sun, 26 May 2013 08:02:51 -0400, "Theodore Ts'o" said:

> Have any of the arguments over the proper security models changed over
> or have gotten resolved over the past six years, while I haven't been
> looking?

Doubtful, because the security models are addressing different threat
models. If you can't agree on what you're trying to secure against,
there can be no agreement on how to do the securing.

Attachments:

(No filename) (865.00 B)

2013-05-30 01:28:32

by Eric Paris

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On Sat, May 25, 2013 at 10:19 PM, Linus Torvalds
<[email protected]> wrote:
> On Sat, May 25, 2013 at 10:04 PM, James Morris <[email protected]> wrote:
>> On Sat, 25 May 2013, Linus Torvalds wrote:
>>
>>> But I haven't even looked at what non-selinux setups do to
>>> performance. Last time I tried Ubuntu (they still use apparmor, no?),
>>> "make modules_install ; make install" didn't work for the kernel, and
>>> if the Ubuntu people don't want to support kernel engineers, I
>>> certainly am not going to bother with them. Who uses smack?
>>
>> Tizen, perhaps a few others.
>
> Btw, it really would be good if security people started realizing that
> performance matters. It's annoying to see the security lookups cause
> 50% performance degradations on pathname lookup (and no, I'm not
> exaggerating, that's literally what it was before we fixed it - and
> no, by "we" I don't mean security people).

I take a bit of exception to this. I do care. Stephen Smalley, the
only other person who does SELinux kernel work, cares. I don't speak
for other LSMs, but at least both of us who have done anything with
SELinux in the last years care. I did the RCU work for selinux and
you, sds, and I did a bunch of work to stop wasting so much stack
space which was crapping on performance. And I'm here again :)

> Right now (zooming into the kernel only - ignoring the fact that make
> really spends a fair amount of time in user space) I get
>
> 9.79% make [k] __d_lookup_rcu
> 5.48% make [k] link_path_walk
> 2.94% make [k] avc_has_perm_noaudit
> 2.47% make [k] selinux_inode_permission
> 2.25% make [k] path_lookupat
> 1.89% make [k] generic_fillattr
> 1.50% make [k] lookup_fast
> 1.27% make [k] copy_user_generic_string
> 1.17% make [k] generic_permission
> 1.15% make [k] dput
> 1.12% make [k] inode_has_perm.constprop.58
> 1.11% make [k] __inode_permission
> 1.08% make [k] kmem_cache_alloc
> ...

I tried something else, doing caching of the last successful security
check inside the isec. It isn't race free, so it's not ready for
prime time. But right now my >1% looks like:

7.97% make [k] __d_lookup_rcu
5.79% make [k] link_path_walk
3.67% make [k] selinux_inode_permission
2.02% make [k] lookup_fast
1.90% make [k] system_call
1.76% make [k] path_lookupat
1.68% make [k] inode_has_perm.isra.45.constprop.61
1.53% make [k] copy_user_enhanced_fast_string
1.39% make [k] generic_permission
1.35% make [k] kmem_cache_free
1.30% make [k] __audit_syscall_exit
1.13% make [k] kmem_cache_alloc
1.00% make [k] strncpy_from_user

How do I tell what is taking time inside selinux_inode_permission?

2013-05-30 03:05:25

by Linus Torvalds

[permalink] [raw]

Subject: Re: Stupid VFS name lookup interface..

On Thu, May 30, 2013 at 10:28 AM, Eric Paris <[email protected]> wrote:
>
> How do I tell what is taking time inside selinux_inode_permission?

Go to "annotate" (just press 'a' when the function is highlighted),
which will show you the disassembly and the cost of each instruction.

That's when you really want to use "-e cycles:pp" to get the
instruction-level profile right, though. Otherwise the cost will
usually be assigned to the instructions following the expensive one.

And I can tell you that the cost is almost certainly the cache miss on
the inode->i_security accesses. Which was the reason for that second
patch that moved "sid" to the inode->i_sid field and avoided the extra
dereference.

Linus