Just something like open(/usr/include/sys/stat.h) causes five calls of
generic_permission -> acl_permission_check -> in_group_p; if the
compiler first tried /usr/local/include/..., that would be a few
more.
Altogether, on a bog-standard Ubuntu 20.04 install, a workload
consisting of compiling lots of userspace programs (i.e., calling lots
of short-lived programs that all need to get their shared libs mapped
in, and the compilers poking around looking for system headers - lots
of /usr/lib, /usr/bin, /usr/include/ accesses) puts in_group_p around
0.1% according to perf top. With an artificial load such as
while true ; do find /usr/ -print0 | xargs -0 stat > /dev/null ; done
that jumps to over 0.4%.
System-installed files are almost always 0755 (directories and
binaries) or 0644, so in most cases, we can avoid the binary search
and the cost of pulling the cred->groups array and in_group_p() .text
into the cpu cache.
Signed-off-by: Rasmus Villemoes <[email protected]>
---
fs/namei.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/fs/namei.c b/fs/namei.c
index d81f73ff1a8b..c6f0c6643db5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -303,7 +303,12 @@ static int acl_permission_check(struct inode *inode, int mask)
return error;
}
- if (in_group_p(inode->i_gid))
+ /*
+ * If the "group" and "other" permissions are the same,
+ * there's no point calling in_group_p() to decide which
+ * set to use.
+ */
+ if ((((mode >> 3) ^ mode) & 7) && in_group_p(inode->i_gid))
mode >>= 3;
}
--
2.23.0
On Fri, Jun 5, 2020 at 7:23 AM Rasmus Villemoes
<[email protected]> wrote:
>
> + /*
> + * If the "group" and "other" permissions are the same,
> + * there's no point calling in_group_p() to decide which
> + * set to use.
> + */
> + if ((((mode >> 3) ^ mode) & 7) && in_group_p(inode->i_gid))
> mode >>= 3;
Ugh. Not only is this ugly, but it's not even the best optimization.
We don't care that group and other match exactly. We only care that
they match in the low 3 bits of the "mask" bits.
So if we want this optimization - and it sounds worth it - I think we
should do it right. But I also think it should be written more
legibly.
And the "& 7" is the same "& (MAY_READ | MAY_WRITE | MAY_EXEC)" we do later.
In other words, if we do this, I'd like it to be done even more
aggressively, but I'd also like the end result to be a lot more
readable and have more comments about why we do that odd thing.
Something like this *UNTESTED* patch, perhaps?
I might have gotten something wrong, so this would need
double-checking, but if it's right, I find it a _lot_ more easy to
understand than making one expression that is pretty complicated and
opaque.
Hmm?
Linus
On 05/06/2020 22.18, Linus Torvalds wrote:
> On Fri, Jun 5, 2020 at 7:23 AM Rasmus Villemoes
> <[email protected]> wrote:
>>
>> + /*
>> + * If the "group" and "other" permissions are the same,
>> + * there's no point calling in_group_p() to decide which
>> + * set to use.
>> + */
>> + if ((((mode >> 3) ^ mode) & 7) && in_group_p(inode->i_gid))
>> mode >>= 3;
>
> Ugh. Not only is this ugly, but it's not even the best optimization.
>
> We don't care that group and other match exactly. We only care that
> they match in the low 3 bits of the "mask" bits.
Yes, I did think about that, but I thought this was the more obviously
correct approach, and that in practice one only sees the 0X44 and 0X55
cases.
> So if we want this optimization - and it sounds worth it - I think we
> should do it right. But I also think it should be written more
> legibly.
>
> And the "& 7" is the same "& (MAY_READ | MAY_WRITE | MAY_EXEC)" we do later.
>
> In other words, if we do this, I'd like it to be done even more
> aggressively, but I'd also like the end result to be a lot more
> readable and have more comments about why we do that odd thing.
>
> Something like this *UNTESTED* patch, perhaps?
That will kinda work, except you do that mask &= MAY_RWX before
check_acl(), which cares about MAY_NOT_BLOCK and who knows what other bits.
> I might have gotten something wrong, so this would need
> double-checking, but if it's right, I find it a _lot_ more easy to
> understand than making one expression that is pretty complicated and
> opaque.
Well, I thought this was readable enough with the added comment. There's
already that magic constant 3 in the shifts, so the 7 seemed entirely
sensible, though one could spell it 0007. Whatever.
Perhaps this? As a whole function, I think that's a bit easier for
brain-storming. It's your patch, just with that rwx thing used instead
of mask, except for the call to check_acl().
static int acl_permission_check(struct inode *inode, int mask)
{
unsigned int mode = inode->i_mode;
unsigned int rwx = mask & (MAY_READ | MAY_WRITE | MAY_EXEC);
/* Are we the owner? If so, ACL's don't matter */
if (likely(uid_eq(current_fsuid(), inode->i_uid))) {
if ((rwx << 6) & ~mode)
return -EACCES;
return 0;
}
/* Do we have ACL's? */
if (IS_POSIXACL(inode) && (mode & S_IRWXG)) {
int error = check_acl(inode, mask);
if (error != -EAGAIN)
return error;
}
/*
* Are the group permissions different from
* the other permissions in the bits we care
* about? Need to check group ownership if so.
*/
if (rwx & (mode ^ (mode >> 3))) {
if (in_group_p(inode->i_gid))
mode >>= 3;
}
/* Bits in 'mode' clear that we require? */
return (rwx & ~mode) ? -EACCES : 0;
}
Rasmus
On Sun, Jun 7, 2020 at 6:22 AM Rasmus Villemoes
<[email protected]> wrote:
>
> Yes, I did think about that, but I thought this was the more obviously
> correct approach, and that in practice one only sees the 0X44 and 0X55
> cases.
I'm not sure about that - it probably depends on your umask.
Because I see a lot of -rw-rw-r--. in my home directory, and it looks
like I have a umask of 0002.
That's just the Fedora default, I think. Looking at /etc/bashrc, it does
if [ $UID -gt 199 ] && [ "`/usr/bin/id -gn`" = "`/usr/bin/id -un`" ]; then
umask 002
else
umask 022
fi
iow, if you have the same user-name and group name, then umask is 002
by default for regular users.
Honestly, I'm not sure why Fedora has that "each user has its own
group" thing, but it's at least one common setup.
So I think that the system you are looking at just happens to have
umask 0022, which is traditional when you have just a 'user' group.
> That will kinda work, except you do that mask &= MAY_RWX before
> check_acl(), which cares about MAY_NOT_BLOCK and who knows what other bits.
Good catch.
> Perhaps this? As a whole function, I think that's a bit easier for
> brain-storming. It's your patch, just with that rwx thing used instead
> of mask, except for the call to check_acl().
Looks fine to me. Once we have to have rwx/mask separate, I'm not sure
it's worth having that early masking at all (didn't check what the
register pressure is over that "check_acl()" call, but at least it is
fairly easy to follow along.
Send me a patch with commit message etc, and I'll apply it.
Linus
On Sun, Jun 7, 2020 at 9:37 AM Linus Torvalds
<[email protected]> wrote:
>
> > That will kinda work, except you do that mask &= MAY_RWX before
> > check_acl(), which cares about MAY_NOT_BLOCK and who knows what other bits.
>
> Good catch.
With the change to not clear the non-rwx bits in general, the owner
case now wants to do the masking, and then the "shift left by 6"
modification makes no sense since it only makes for a bigger constant
(the only reason to do the shift-right was so that the bitwise not of
the i_mode could be done in parallel with the shift, but with the
masking that instruction scheduling optimization becomes kind of
immaterial too). So I modified that patch to not bother, and add a
comment about MAY_NOT_BLOCK.
And since I was looking at the MAY_NOT_BLOCK logic, it was not only
not mentioned in comments, it also had some really confusing code
around it.
The posix_acl_permission() looked like it tried to conserve that bit,
which is completely wrong. It wasn't a bug only for the simple reason
that the only two call-sites had either explicitly cleared the bit
when calling, or had tested that the bit wasn't set in the first
place.
So as a result, I wrote a second patch to clear that confusion up.
Rasmus, say the word and I'll mark you for authorship on the first one.
Comments? Can you find something else wrong here, or some other fixup to do?
Al, any reaction?
Linus
On Sun, Jun 07, 2020 at 12:48:53PM -0700, Linus Torvalds wrote:
> Rasmus, say the word and I'll mark you for authorship on the first one.
>
> Comments? Can you find something else wrong here, or some other fixup to do?
>
> Al, any reaction?
It's correct, but this
> + if (mask & (mode ^ (mode >> 3))) {
> + if (in_group_p(inode->i_gid))
> + mode >>= 3;
> + }
> +
> + /* Bits in 'mode' clear that we require? */
> + return (mask & ~mode) ? -EACCES : 0;
might be easier to follow if we had, from the very beginning done
unsigned int deny = ~inode->i_mode;
and turned that into
// for group the bits 3..5 apply, for others - 0..2
// we only care which to use when they do not
// agree anyway.
if (mask & (deny ^ (deny >> 3))) // mask & deny != mask & (deny >> 3)
if (in_...
deny >>= 3;
return mask & deny ? -EACCES : 0;
Hell knows...
On 07/06/2020 21.48, Linus Torvalds wrote:
> On Sun, Jun 7, 2020 at 9:37 AM Linus Torvalds
> <[email protected]> wrote:
>>
>>> That will kinda work, except you do that mask &= MAY_RWX before
>>> check_acl(), which cares about MAY_NOT_BLOCK and who knows what other bits.
>>
>> Good catch.
>
> With the change to not clear the non-rwx bits in general, the owner
> case now wants to do the masking, and then the "shift left by 6"
> modification makes no sense since it only makes for a bigger constant
> (the only reason to do the shift-right was so that the bitwise not of
> the i_mode could be done in parallel with the shift, but with the
> masking that instruction scheduling optimization becomes kind of
> immaterial too). So I modified that patch to not bother, and add a
> comment about MAY_NOT_BLOCK.
>
> And since I was looking at the MAY_NOT_BLOCK logic, it was not only
> not mentioned in comments, it also had some really confusing code
> around it.
>
> The posix_acl_permission() looked like it tried to conserve that bit,
> which is completely wrong. It wasn't a bug only for the simple reason
> that the only two call-sites had either explicitly cleared the bit
> when calling, or had tested that the bit wasn't set in the first
> place.
>
> So as a result, I wrote a second patch to clear that confusion up.
>
> Rasmus, say the word and I'll mark you for authorship on the first one.
It might be a bit confusing with me mentioned in the third person and
then also author, and it's really mostly your patch, so reported-by is
fine with me. But it's up to you.
> Comments? Can you find something else wrong here, or some other fixup to do?
No, I think it's ok. I took a look at the disassembly and it looks fine.
There's an extra push/pop of %r14 [that's where gcc computes mode>>3,
then CSE allows it to do cmovne %r14d,%ebx after in_group_p), so the
owner case gets slightly penalized. I think/hope the savings from
avoiding the in_group_p should compensate for that - any absolute path
open() by non-root saves at least two in_group_p. YMMV.
Rasmus
On Sun, Jun 7, 2020 at 7:05 PM Al Viro <[email protected]> wrote:
>
> return mask & deny ? -EACCES : 0;
I agree that 'deny' would be simpler to read here, but in other places
it would look odd. ie the IS_POSIXACL() thing that checks for "are
group bits set" still wants the mode.
And I'd hate to have us use code that then mixes 'deny' and 'mode'
when they are directly related to each other.
Anyway, I merged the patch as-is, I guess we can make future changes
to this if you feel strongly about it.
Linus