2021-09-13 17:25:12

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH RESEND v3 6/6] powerpc/signal: Use unsafe_copy_siginfo_to_user()



Le 13/09/2021 à 18:21, Eric W. Biederman a écrit :
> [email protected] (Eric W. Biederman) writes:
>
>> Christophe Leroy <[email protected]> writes:
>>
>>> Use unsafe_copy_siginfo_to_user() in order to do the copy
>>> within the user access block.
>>>
>>> On an mpc 8321 (book3s/32) the improvment is about 5% on a process
>>> sending a signal to itself.
>
> If you can't make function calls from an unsafe macro there is another
> way to handle this that doesn't require everything to be inline.
>
> From a safety perspective it is probably even a better approach.

Yes but that's exactly what I wanted to avoid for the native ppc32 case:
this double hop means useless pressure on the cache. The siginfo_t
structure is 128 bytes large, that means 8 lines of cache on powerpc 8xx.

But maybe it is acceptable to do that only for the compat case. Let me
think about it, it might be quite easy.

Christophe


2021-09-14 00:58:43

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH RESEND v3 6/6] powerpc/signal: Use unsafe_copy_siginfo_to_user()

Christophe Leroy <[email protected]> writes:

> Le 13/09/2021 à 18:21, Eric W. Biederman a écrit :
>> [email protected] (Eric W. Biederman) writes:
>>
>>> Christophe Leroy <[email protected]> writes:
>>>
>>>> Use unsafe_copy_siginfo_to_user() in order to do the copy
>>>> within the user access block.
>>>>
>>>> On an mpc 8321 (book3s/32) the improvment is about 5% on a process
>>>> sending a signal to itself.
>>
>> If you can't make function calls from an unsafe macro there is another
>> way to handle this that doesn't require everything to be inline.
>>
>> From a safety perspective it is probably even a better approach.
>
> Yes but that's exactly what I wanted to avoid for the native ppc32 case: this
> double hop means useless pressure on the cache. The siginfo_t structure is 128
> bytes large, that means 8 lines of cache on powerpc 8xx.
>
> But maybe it is acceptable to do that only for the compat case. Let me think
> about it, it might be quite easy.

The places get_signal is called tend to be well known. So I think we
are safe from a capacity standpoint.

I am not certain it makes a difference in capacity as there is a high
probability that the stack was deeper recently than it is now which
suggests the cache blocks might already be in the cache.

My sense it is worth benchmarking before optimizing out the extra copy
like that.

On the extreme side there is simply building the entire sigframe on the
stack and then just calling it copy_to_user. As the stack cache lines
are likely to be hot, and copy_to_user is quite well optimized
there is a real possibility that it is faster to build everything
on the kernel stack, and then copy it to the user space stack.

It is also possible that I am wrong and we may want to figure out how
far up we can push the conversion to the 32bit siginfo format.

If could move the work into collect_signal we could guarantee there
would be no extra work. That would require adjusting the sigframe
generation code on all of the architectures.

There is a lot we can do but we need benchmarking to tell if it is
worth it.

Eric





2021-09-14 14:03:07

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH RESEND v3 6/6] powerpc/signal: Use unsafe_copy_siginfo_to_user()



Le 13/09/2021 à 21:11, Eric W. Biederman a écrit :
> Christophe Leroy <[email protected]> writes:
>
>> Le 13/09/2021 à 18:21, Eric W. Biederman a écrit :
>>> [email protected] (Eric W. Biederman) writes:
>>>
>>>> Christophe Leroy <[email protected]> writes:
>>>>
>>>>> Use unsafe_copy_siginfo_to_user() in order to do the copy
>>>>> within the user access block.
>>>>>
>>>>> On an mpc 8321 (book3s/32) the improvment is about 5% on a process
>>>>> sending a signal to itself.
>>>
>>> If you can't make function calls from an unsafe macro there is another
>>> way to handle this that doesn't require everything to be inline.
>>>
>>> From a safety perspective it is probably even a better approach.
>>
>> Yes but that's exactly what I wanted to avoid for the native ppc32 case: this
>> double hop means useless pressure on the cache. The siginfo_t structure is 128
>> bytes large, that means 8 lines of cache on powerpc 8xx.
>>
>> But maybe it is acceptable to do that only for the compat case. Let me think
>> about it, it might be quite easy.
>
> The places get_signal is called tend to be well known. So I think we
> are safe from a capacity standpoint.
>
> I am not certain it makes a difference in capacity as there is a high
> probability that the stack was deeper recently than it is now which
> suggests the cache blocks might already be in the cache.
>
> My sense it is worth benchmarking before optimizing out the extra copy
> like that.
>
> On the extreme side there is simply building the entire sigframe on the
> stack and then just calling it copy_to_user. As the stack cache lines
> are likely to be hot, and copy_to_user is quite well optimized
> there is a real possibility that it is faster to build everything
> on the kernel stack, and then copy it to the user space stack.
>
> It is also possible that I am wrong and we may want to figure out how
> far up we can push the conversion to the 32bit siginfo format.
>
> If could move the work into collect_signal we could guarantee there
> would be no extra work. That would require adjusting the sigframe
> generation code on all of the architectures.
>
> There is a lot we can do but we need benchmarking to tell if it is
> worth it.
>


Sure, I'm benchmarking all the work I have been doing on signal code
with the following simple app that I run with 'perf stat':

#include <stdlib.h>
#include <signal.h>

void sigusr1(int sig) { }

int main(int argc, char **argv)
{
int i = 100000;

signal(SIGUSR1, sigusr1);
for (;i--;)
raise(SIGUSR1);
exit(0);
}


On an mpc8321 a 32 bits powerpc with KUAP enabled (KUAP is equivalent of
x86 SMAP)

Before changing copy_siginfo_to_user() to unsafe_copy_siginfo_to_user(),
'perf stat' reports 1983 msec (task-clock)

After my change I get 1900 msec.

With your approach I get 1930 msec, so we are loosing 36% of the benefit
of converting to the 'unsafe_' alternative.

So I think it is worth it.

Christophe