With some defconfig including CONFIG_CC_OPTIMIZE_FOR_SIZE,
(for instance mvme5100_defconfig and ps3_defconfig), gcc 5
generates a call to _restgpr_31_x.
Until recently it went unnoticed, but
commit 42ed6d56ade2 ("powerpc/vdso: Block R_PPC_REL24 relocations")
made it rise to the surface.
Provide that function (copied from lib/crtsavres.S) in
gettimeofday.S
Fixes: ab037dd87a2f ("powerpc/vdso: Switch VDSO to generic C implementation.")
Signed-off-by: Christophe Leroy <[email protected]>
---
I don't know if there is a way to tell GCC not to emit that call, because at the end we get more instructions than needed.
---
arch/powerpc/kernel/vdso32/gettimeofday.S | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/arch/powerpc/kernel/vdso32/gettimeofday.S b/arch/powerpc/kernel/vdso32/gettimeofday.S
index a6e29f880e0e..d21d08140a5e 100644
--- a/arch/powerpc/kernel/vdso32/gettimeofday.S
+++ b/arch/powerpc/kernel/vdso32/gettimeofday.S
@@ -65,3 +65,14 @@ V_FUNCTION_END(__kernel_clock_getres)
V_FUNCTION_BEGIN(__kernel_time)
cvdso_call_time __c_kernel_time
V_FUNCTION_END(__kernel_time)
+
+/* Routines for restoring integer registers, called by the compiler. */
+/* Called with r11 pointing to the stack header word of the caller of the */
+/* function, just beyond the end of the integer restore area. */
+_GLOBAL(_restgpr_31_x)
+_GLOBAL(_rest32gpr_31_x)
+ lwz r0,4(r11)
+ lwz r31,-4(r11)
+ mtlr r0
+ mr r1,r11
+ blr
--
2.25.0
Hi!
On Tue, Mar 09, 2021 at 06:19:30AM +0000, Christophe Leroy wrote:
> With some defconfig including CONFIG_CC_OPTIMIZE_FOR_SIZE,
> (for instance mvme5100_defconfig and ps3_defconfig), gcc 5
> generates a call to _restgpr_31_x.
> I don't know if there is a way to tell GCC not to emit that call, because at the end we get more instructions than needed.
The function is required by the ABI, you need to have it.
You get *fewer* insns statically, and that is what -Os is about: reduce
the size of the binaries.
(The only reason you get such problems is because Linux does not have
all of libgcc. You can have that *and* have some symbols cause link
errors, it isn't rocket science).
Segher
Le 09/03/2021 à 07:19, Christophe Leroy a écrit :
> With some defconfig including CONFIG_CC_OPTIMIZE_FOR_SIZE,
> (for instance mvme5100_defconfig and ps3_defconfig), gcc 5
> generates a call to _restgpr_31_x.
>
> Until recently it went unnoticed, but
> commit 42ed6d56ade2 ("powerpc/vdso: Block R_PPC_REL24 relocations")
> made it rise to the surface.
>
> Provide that function (copied from lib/crtsavres.S) in
> gettimeofday.S
>
> Fixes: ab037dd87a2f ("powerpc/vdso: Switch VDSO to generic C implementation.")
> Signed-off-by: Christophe Leroy <[email protected]>
Fixes the following builds:
http://kisskb.ellerman.id.au/kisskb/buildresult/14492138/
http://kisskb.ellerman.id.au/kisskb/buildresult/14492041/
Christophe
> ---
> I don't know if there is a way to tell GCC not to emit that call, because at the end we get more instructions than needed.
> ---
> arch/powerpc/kernel/vdso32/gettimeofday.S | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/arch/powerpc/kernel/vdso32/gettimeofday.S b/arch/powerpc/kernel/vdso32/gettimeofday.S
> index a6e29f880e0e..d21d08140a5e 100644
> --- a/arch/powerpc/kernel/vdso32/gettimeofday.S
> +++ b/arch/powerpc/kernel/vdso32/gettimeofday.S
> @@ -65,3 +65,14 @@ V_FUNCTION_END(__kernel_clock_getres)
> V_FUNCTION_BEGIN(__kernel_time)
> cvdso_call_time __c_kernel_time
> V_FUNCTION_END(__kernel_time)
> +
> +/* Routines for restoring integer registers, called by the compiler. */
> +/* Called with r11 pointing to the stack header word of the caller of the */
> +/* function, just beyond the end of the integer restore area. */
> +_GLOBAL(_restgpr_31_x)
> +_GLOBAL(_rest32gpr_31_x)
> + lwz r0,4(r11)
> + lwz r31,-4(r11)
> + mtlr r0
> + mr r1,r11
> + blr
>
On Tue, 9 Mar 2021 06:19:30 +0000 (UTC), Christophe Leroy wrote:
> With some defconfig including CONFIG_CC_OPTIMIZE_FOR_SIZE,
> (for instance mvme5100_defconfig and ps3_defconfig), gcc 5
> generates a call to _restgpr_31_x.
>
> Until recently it went unnoticed, but
> commit 42ed6d56ade2 ("powerpc/vdso: Block R_PPC_REL24 relocations")
> made it rise to the surface.
>
> [...]
Applied to powerpc/fixes.
[1/1] powerpc/vdso32: Add missing _restgpr_31_x to fix build failure
https://git.kernel.org/powerpc/c/08c18b63d9656e0389087d1956d2b37fd7019172
cheers
On 12/03/2021 03.29, Segher Boessenkool wrote:
> Hi!
>
> On Tue, Mar 09, 2021 at 06:19:30AM +0000, Christophe Leroy wrote:
>> With some defconfig including CONFIG_CC_OPTIMIZE_FOR_SIZE,
>> (for instance mvme5100_defconfig and ps3_defconfig), gcc 5
>> generates a call to _restgpr_31_x.
>
>> I don't know if there is a way to tell GCC not to emit that call, because at the end we get more instructions than needed.
>
> The function is required by the ABI, you need to have it.
>
> You get *fewer* insns statically, and that is what -Os is about: reduce
> the size of the binaries.
Is there any reason to not just always build the vdso with -O2? It's one
page/one VMA either way, and the vdso is about making certain system
calls cheaper, so if unconditional -O2 could save a few cycles compared
to -Os, why not? (And if, as it seems, there's only one user within the
DSO of _restgpr_31_x, yes, the overall size of the .text segment
probably increases slightly).
Rasmus
From: Rasmus Villemoes
> Sent: 15 March 2021 16:24
>
> On 12/03/2021 03.29, Segher Boessenkool wrote:
> > Hi!
> >
> > On Tue, Mar 09, 2021 at 06:19:30AM +0000, Christophe Leroy wrote:
> >> With some defconfig including CONFIG_CC_OPTIMIZE_FOR_SIZE,
> >> (for instance mvme5100_defconfig and ps3_defconfig), gcc 5
> >> generates a call to _restgpr_31_x.
> >
> >> I don't know if there is a way to tell GCC not to emit that call, because at the end we get more
> instructions than needed.
> >
> > The function is required by the ABI, you need to have it.
> >
> > You get *fewer* insns statically, and that is what -Os is about: reduce
> > the size of the binaries.
>
> Is there any reason to not just always build the vdso with -O2? It's one
> page/one VMA either way, and the vdso is about making certain system
> calls cheaper, so if unconditional -O2 could save a few cycles compared
> to -Os, why not? (And if, as it seems, there's only one user within the
> DSO of _restgpr_31_x, yes, the overall size of the .text segment
> probably increases slightly).
Sometimes -Os generates such horrid code you really never want to use it.
A classic is on x86 where it replaces 'load register with byte constant'
with 'push byte' 'pop register'.
The code is actually smaller but the execution time is horrid.
There are also cases where -O2 actually generates smaller code.
Although you may need to disable loop unrolling (often dubious at best)
and either force or disable some function inlining.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Mon, Mar 15, 2021 at 05:23:44PM +0100, Rasmus Villemoes wrote:
> On 12/03/2021 03.29, Segher Boessenkool wrote:
> > On Tue, Mar 09, 2021 at 06:19:30AM +0000, Christophe Leroy wrote:
> >> With some defconfig including CONFIG_CC_OPTIMIZE_FOR_SIZE,
> >> (for instance mvme5100_defconfig and ps3_defconfig), gcc 5
> >> generates a call to _restgpr_31_x.
> >
> >> I don't know if there is a way to tell GCC not to emit that call, because at the end we get more instructions than needed.
> >
> > The function is required by the ABI, you need to have it.
> >
> > You get *fewer* insns statically, and that is what -Os is about: reduce
> > the size of the binaries.
>
> Is there any reason to not just always build the vdso with -O2? It's one
> page/one VMA either way, and the vdso is about making certain system
> calls cheaper, so if unconditional -O2 could save a few cycles compared
> to -Os, why not? (And if, as it seems, there's only one user within the
> DSO of _restgpr_31_x, yes, the overall size of the .text segment
> probably increases slightly).
You can use exactly the same reasoning for using -O2 instead of -Os
anywhere else.
-Os doesn't mean "smaller code, but only where that is reasonable". It
means "smaller code".
Segher
On Mon, Mar 15, 2021 at 04:38:52PM +0000, David Laight wrote:
> From: Rasmus Villemoes
> > Sent: 15 March 2021 16:24
> > On 12/03/2021 03.29, Segher Boessenkool wrote:
> > > On Tue, Mar 09, 2021 at 06:19:30AM +0000, Christophe Leroy wrote:
> > >> With some defconfig including CONFIG_CC_OPTIMIZE_FOR_SIZE,
> > >> (for instance mvme5100_defconfig and ps3_defconfig), gcc 5
> > >> generates a call to _restgpr_31_x.
> > >
> > >> I don't know if there is a way to tell GCC not to emit that call, because at the end we get more
> > instructions than needed.
> > >
> > > The function is required by the ABI, you need to have it.
> > >
> > > You get *fewer* insns statically, and that is what -Os is about: reduce
> > > the size of the binaries.
> >
> > Is there any reason to not just always build the vdso with -O2? It's one
> > page/one VMA either way, and the vdso is about making certain system
> > calls cheaper, so if unconditional -O2 could save a few cycles compared
> > to -Os, why not? (And if, as it seems, there's only one user within the
> > DSO of _restgpr_31_x, yes, the overall size of the .text segment
> > probably increases slightly).
>
> Sometimes -Os generates such horrid code you really never want to use it.
> A classic is on x86 where it replaces 'load register with byte constant'
> with 'push byte' 'pop register'.
> The code is actually smaller but the execution time is horrid.
>
> There are also cases where -O2 actually generates smaller code.
Yes, as with all heuristics it doesn't always work out. But usually -Os
is smaller.
> Although you may need to disable loop unrolling (often dubious at best)
> and either force or disable some function inlining.
The cases where GCC does loop unrolling at -O2 always help quite a lot.
Or, do you have a counter-example? We'd love to see one.
And yup, inlining is hard. GCC's heuristics there are very good
nowadays, but any single decision has big effects. Doing the important
spots manually (always_inline or noinline) has good payoff.
Segher
From: Segher Boessenkool
> Sent: 16 March 2021 00:00
...
> > Although you may need to disable loop unrolling (often dubious at best)
> > and either force or disable some function inlining.
>
> The cases where GCC does loop unrolling at -O2 always help quite a lot.
> Or, do you have a counter-example? We'd love to see one.
The real problem with loop unrolling is that quite often a modern
out-of-order superscaler processor actually has 'spare' execution
cycles where the loop control can be done 'for free'.
Sometimes you do need to unroll (or interleave) a couple of
times to get enough spare execution cycles.
But the unrolled loop has to read a lot more code into cache
- so unless the code is 'hot cache' (that is usually arranged
for benchmarking) those delays apply as well.
The larger code footprint also displaces other code.
My real annoyance with gcc is unrolling (and vectorizing)
loops that I know are never executed as many times as even one
copy of the unrolled loop.
As an example intel (ivy bridge onwards) cpu execute the
following code (the middle of the ip checksum) at 8 bytes/clock.
(Limited by the carry flag.)
It just doesn't need any further unrolling.
+ "10: jecxz 20f\n"
+ " adc (%[buff], %[len]), %[sum_0]\n"
+ " adc 8(%[buff], %[len]), %[sum_1]\n"
+ " lea 32(%[len]), %[len_tmp]\n"
+ " adc 16(%[buff], %[len]), %[sum_0]\n"
+ " adc 24(%[buff], %[len]), %[sum_1]\n"
+ " mov %[len_tmp], %[len]\n"
+ " jmp 10b\n"
Annoyingly that loop is slow on my 8-core atom.
The existing code only does 4 bytes/clock on intel cpu prior
to either broadwell or haswell (forgotten which) in spite
of much more unroling.
> And yup, inlining is hard. GCC's heuristics there are very good
> nowadays, but any single decision has big effects. Doing the important
> spots manually (always_inline or noinline) has good payoff.
Latest inline gripe was a function replicated about 20 times
when the non-inline version was a register load and 'tail call'.
The inlining is just bloat.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)