2022-02-24 00:40:25

by Gabriel Paubert

[permalink] [raw]
Subject: Re: [PATCH] powerpc/32: Clear volatile regs on syscall exit

On Wed, Feb 23, 2022 at 06:11:36PM +0100, Christophe Leroy wrote:
> Commit a82adfd5c7cb ("hardening: Introduce CONFIG_ZERO_CALL_USED_REGS")
> added zeroing of used registers at function exit.
>
> At the time being, PPC64 clears volatile registers on syscall exit but
> PPC32 doesn't do it for performance reason.
>
> Add that clearing in PPC32 syscall exit as well, but only when
> CONFIG_ZERO_CALL_USED_REGS is selected.
>
> On an 8xx, the null_syscall selftest gives:
> - Without CONFIG_ZERO_CALL_USED_REGS : 288 cycles
> - With CONFIG_ZERO_CALL_USED_REGS : 305 cycles
> - With CONFIG_ZERO_CALL_USED_REGS + this patch : 319 cycles
>
> Note that (independent of this patch), with pmac32_defconfig,
> vmlinux size is as follows with/without CONFIG_ZERO_CALL_USED_REGS:
>
> text data bss dec hex filename
> 9578869 2525210 194400 12298479 bba8ef vmlinux.without
> 10318045 2525210 194400 13037655 c6f057 vmlinux.with
>
> That is a 7.7% increase on text size, 6.0% on overall size.
>
> Signed-off-by: Christophe Leroy <[email protected]>
> ---
> arch/powerpc/kernel/entry_32.S | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
> index 7748c278d13c..199f23092c02 100644
> --- a/arch/powerpc/kernel/entry_32.S
> +++ b/arch/powerpc/kernel/entry_32.S
> @@ -151,6 +151,21 @@ syscall_exit_finish:
> bne 3f
> mtcr r5
>
> +#ifdef CONFIG_ZERO_CALL_USED_REGS
> + /* Zero volatile regs that may contain sensitive kernel data */
> + li r0,0
> + li r4,0
> + li r5,0
> + li r6,0
> + li r7,0
> + li r8,0
> + li r9,0
> + li r10,0
> + li r11,0
> + li r12,0
> + mtctr r0
> + mtxer r0

Here, I'm almost sure that on some processors, it would be better to
separate mtctr form mtxer. mtxer is typically very expensive (pipeline
flush) but I don't know what's the best ordering for the average core.

And what about lr? Should it also be cleared?

Gabriel

> +#endif
> 1: lwz r2,GPR2(r1)
> lwz r1,GPR1(r1)
> rfi
> --
> 2.34.1
>



2022-02-24 00:54:49

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH] powerpc/32: Clear volatile regs on syscall exit

On Wed, Feb 23, 2022 at 09:48:09PM +0100, Gabriel Paubert wrote:
> On Wed, Feb 23, 2022 at 06:11:36PM +0100, Christophe Leroy wrote:
> > + /* Zero volatile regs that may contain sensitive kernel data */
> > + li r0,0
> > + li r4,0
> > + li r5,0
> > + li r6,0
> > + li r7,0
> > + li r8,0
> > + li r9,0
> > + li r10,0
> > + li r11,0
> > + li r12,0
> > + mtctr r0
> > + mtxer r0
>
> Here, I'm almost sure that on some processors, it would be better to
> separate mtctr form mtxer. mtxer is typically very expensive (pipeline
> flush) but I don't know what's the best ordering for the average core.

mtxer is cheaper than mtctr on many cores :-)

On p9 mtxer is cracked into two latency 3 ops (which run in parallel).
While mtctr has latency 5.

On p8 mtxer was horrible indeed (but nothing near as bad as a pipeline
flush).


Segher

2022-02-24 08:30:55

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH] powerpc/32: Clear volatile regs on syscall exit



Le 23/02/2022 à 21:48, Gabriel Paubert a écrit :
> On Wed, Feb 23, 2022 at 06:11:36PM +0100, Christophe Leroy wrote:
>> Commit a82adfd5c7cb ("hardening: Introduce CONFIG_ZERO_CALL_USED_REGS")
>> added zeroing of used registers at function exit.
>>
>> At the time being, PPC64 clears volatile registers on syscall exit but
>> PPC32 doesn't do it for performance reason.
>>
>> Add that clearing in PPC32 syscall exit as well, but only when
>> CONFIG_ZERO_CALL_USED_REGS is selected.
>>
>> On an 8xx, the null_syscall selftest gives:
>> - Without CONFIG_ZERO_CALL_USED_REGS : 288 cycles
>> - With CONFIG_ZERO_CALL_USED_REGS : 305 cycles
>> - With CONFIG_ZERO_CALL_USED_REGS + this patch : 319 cycles
>>
>> Note that (independent of this patch), with pmac32_defconfig,
>> vmlinux size is as follows with/without CONFIG_ZERO_CALL_USED_REGS:
>>
>> text data bss dec hex filename
>> 9578869 2525210 194400 12298479 bba8ef vmlinux.without
>> 10318045 2525210 194400 13037655 c6f057 vmlinux.with
>>
>> That is a 7.7% increase on text size, 6.0% on overall size.
>>
>> Signed-off-by: Christophe Leroy <[email protected]>
>> ---
>> arch/powerpc/kernel/entry_32.S | 15 +++++++++++++++
>> 1 file changed, 15 insertions(+)
>>
>> diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
>> index 7748c278d13c..199f23092c02 100644
>> --- a/arch/powerpc/kernel/entry_32.S
>> +++ b/arch/powerpc/kernel/entry_32.S
>> @@ -151,6 +151,21 @@ syscall_exit_finish:
>> bne 3f
>> mtcr r5
>>
>> +#ifdef CONFIG_ZERO_CALL_USED_REGS
>> + /* Zero volatile regs that may contain sensitive kernel data */
>> + li r0,0
>> + li r4,0
>> + li r5,0
>> + li r6,0
>> + li r7,0
>> + li r8,0
>> + li r9,0
>> + li r10,0
>> + li r11,0
>> + li r12,0
>> + mtctr r0
>> + mtxer r0
>
> Here, I'm almost sure that on some processors, it would be better to
> separate mtctr form mtxer. mtxer is typically very expensive (pipeline
> flush) but I don't know what's the best ordering for the average core.

In the 8xx, CTR and LR are handled by the BPU as any other reg (Latency
1 blocage 1).
AFAIU, XER is serialize + 1

>
> And what about lr? Should it also be cleared?

LR is restored from stack.

Christophe

2022-02-24 08:57:38

by Gabriel Paubert

[permalink] [raw]
Subject: Re: [PATCH] powerpc/32: Clear volatile regs on syscall exit

On Wed, Feb 23, 2022 at 05:27:39PM -0600, Segher Boessenkool wrote:
> On Wed, Feb 23, 2022 at 09:48:09PM +0100, Gabriel Paubert wrote:
> > On Wed, Feb 23, 2022 at 06:11:36PM +0100, Christophe Leroy wrote:
> > > + /* Zero volatile regs that may contain sensitive kernel data */
> > > + li r0,0
> > > + li r4,0
> > > + li r5,0
> > > + li r6,0
> > > + li r7,0
> > > + li r8,0
> > > + li r9,0
> > > + li r10,0
> > > + li r11,0
> > > + li r12,0
> > > + mtctr r0
> > > + mtxer r0
> >
> > Here, I'm almost sure that on some processors, it would be better to
> > separate mtctr form mtxer. mtxer is typically very expensive (pipeline
> > flush) but I don't know what's the best ordering for the average core.
>
> mtxer is cheaper than mtctr on many cores :-)

We're speaking of 32 bit here I believe; on my (admittedly old) paper
copy of PowerPC 604 user's manual, I read in a footnote:

"The mtspr (XER) instruction causes instructions to be flushed when it
executes."

Also a paragraph about "PostDispatch Serialization Mode" which reads:
"All instructions following the postdispatch serialization instruction
are flushed, refetched, and reexecuted."

Then it goes on to list the affected instructions which starts with:
mtsper(xer), mcrxr, isync, ...

I know there are probably very few 604 left in the field, but in this
case mtspr(xer) looks very much like a superset of isync.

I also just had a look at the documentation of a more widespread core:

https://www.nxp.com/docs/en/reference-manual/MPC7450UM.pdf

and mtspr(xer) is marked as execution and refetch serialized, actually
it is the only instruction to have both.

Maybe there is a subtle difference between "refetch serialization" and
"pipeline flush", but in this case please educate me.

Besides that the back to back mtctr/mtspr(xer) may limit instruction
decoding and issuing bandwidth. I'd rather move one of them up by a few
lines since they can only go to one of the execution units on some
(or even most?) cores. This was my main point initially.

Gabriel

>
> On p9 mtxer is cracked into two latency 3 ops (which run in parallel).
> While mtctr has latency 5.
>
> On p8 mtxer was horrible indeed (but nothing near as bad as a pipeline
> flush).
>
>
> Segher


2022-02-24 15:45:57

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH] powerpc/32: Clear volatile regs on syscall exit

Hi!

On Thu, Feb 24, 2022 at 09:29:55AM +0100, Gabriel Paubert wrote:
> On Wed, Feb 23, 2022 at 05:27:39PM -0600, Segher Boessenkool wrote:
> > On Wed, Feb 23, 2022 at 09:48:09PM +0100, Gabriel Paubert wrote:
> > > On Wed, Feb 23, 2022 at 06:11:36PM +0100, Christophe Leroy wrote:
> > > > + /* Zero volatile regs that may contain sensitive kernel data */
> > > > + li r0,0
> > > > + li r4,0
> > > > + li r5,0
> > > > + li r6,0
> > > > + li r7,0
> > > > + li r8,0
> > > > + li r9,0
> > > > + li r10,0
> > > > + li r11,0
> > > > + li r12,0
> > > > + mtctr r0
> > > > + mtxer r0
> > >
> > > Here, I'm almost sure that on some processors, it would be better to
> > > separate mtctr form mtxer. mtxer is typically very expensive (pipeline
> > > flush) but I don't know what's the best ordering for the average core.
> >
> > mtxer is cheaper than mtctr on many cores :-)
>
> We're speaking of 32 bit here I believe;

32-bit userland, yes. Which runs fine on non-ancient cores, too.

> on my (admittedly old) paper
> copy of PowerPC 604 user's manual, I read in a footnote:
>
> "The mtspr (XER) instruction causes instructions to be flushed when it
> executes."

And the 604 has a trivial depth pipeline anyway.

> I know there are probably very few 604 left in the field, but in this
> case mtspr(xer) looks very much like a superset of isync.

It hasn't been like that for decades. On the 750 mtxer was execution
synchronised only already, for example.

> I also just had a look at the documentation of a more widespread core:
>
> https://www.nxp.com/docs/en/reference-manual/MPC7450UM.pdf
>
> and mtspr(xer) is marked as execution and refetch serialized, actually
> it is the only instruction to have both.

This looks like a late addition (it messes up the table, for example,
being put after "mtspr (other)"). It also is different from 7400 and
750 and everything else. A late bugfix? Curious :-)

> Maybe there is a subtle difference between "refetch serialization" and
> "pipeline flush", but in this case please educate me.

There is a subtle difference, but it goes the other way: refetch
serialisation doesn't stop fetch / flush everything after it, only when
the instruction completes it rejects everything after it. So it can
waste a bit more :-)

> Besides that the back to back mtctr/mtspr(xer) may limit instruction
> decoding and issuing bandwidth.

It doesn't limit decode or dispatch (not issue fwiw) bandwidth on any
core I have ever heard of.

> I'd rather move one of them up by a few
> lines since they can only go to one of the execution units on some
> (or even most?) cores. This was my main point initially.

I think it is much more beneficial to *not* do these insns than to
shift them back and forth a cycle.


Segher