2015-07-05 20:08:28

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH] x86: Fix detection of GCC -mpreferred-stack-boundary support

As per https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, GCC only
allows -mpreferred-stack-boundary=3 on x86_64 if -mno-sse is set.
That means that cc-option will not detect
-mpreferred-stack-boundary=3 support, because we test for it before
setting -mno-sse.

Fix it by reordering the Makefile bits.

Compile-tested only. This should help avoid code generation issues
such as the one that was worked around in b96fecbfa8c8 ("x86/fpu:
Fix boot crash in the early FPU code").

I'm a bit concerned that we could still have problems on older GCC
versions given that our asm code does not respect GCC's idea of the
ABI-required stack alignment.

Cc: Linus Torvalds <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/Makefile | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 118e6debc483..344dd2110b2a 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -39,6 +39,12 @@ ifdef CONFIG_X86_NEED_RELOCS
LDFLAGS_vmlinux := --emit-relocs
endif

+# prevent gcc from generating any FP code by mistake
+# This must be before we try -mpreferred-stack-boundary -- see
+# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383
+KBUILD_CFLAGS += -mno-sse -mno-mmx -mno-sse2 -mno-3dnow
+KBUILD_CFLAGS += $(call cc-option,-mno-avx,)
+
ifeq ($(CONFIG_X86_32),y)
BITS := 32
UTS_MACHINE := i386
@@ -167,9 +173,6 @@ KBUILD_CFLAGS += -pipe
KBUILD_CFLAGS += -Wno-sign-compare
#
KBUILD_CFLAGS += -fno-asynchronous-unwind-tables
-# prevent gcc from generating any FP code by mistake
-KBUILD_CFLAGS += -mno-sse -mno-mmx -mno-sse2 -mno-3dnow
-KBUILD_CFLAGS += $(call cc-option,-mno-avx,)

KBUILD_CFLAGS += $(mflags-y)
KBUILD_AFLAGS += $(mflags-y)
--
2.4.3


2015-07-06 13:44:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: Fix detection of GCC -mpreferred-stack-boundary support


* Andy Lutomirski <[email protected]> wrote:

> As per https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, GCC only
> allows -mpreferred-stack-boundary=3 on x86_64 if -mno-sse is set.
> That means that cc-option will not detect
> -mpreferred-stack-boundary=3 support, because we test for it before
> setting -mno-sse.
>
> Fix it by reordering the Makefile bits.
>
> Compile-tested only. This should help avoid code generation issues
> such as the one that was worked around in b96fecbfa8c8 ("x86/fpu:
> Fix boot crash in the early FPU code").
>
> I'm a bit concerned that we could still have problems on older GCC
> versions given that our asm code does not respect GCC's idea of the
> ABI-required stack alignment.
>
> Cc: Linus Torvalds <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/Makefile | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> index 118e6debc483..344dd2110b2a 100644
> --- a/arch/x86/Makefile
> +++ b/arch/x86/Makefile
> @@ -39,6 +39,12 @@ ifdef CONFIG_X86_NEED_RELOCS
> LDFLAGS_vmlinux := --emit-relocs
> endif
>
> +# prevent gcc from generating any FP code by mistake
> +# This must be before we try -mpreferred-stack-boundary -- see
> +# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383
> +KBUILD_CFLAGS += -mno-sse -mno-mmx -mno-sse2 -mno-3dnow
> +KBUILD_CFLAGS += $(call cc-option,-mno-avx,)
> +

So the 'stack boundary' is the RSP that GCC generates before it calls another
function from within an existing function, right?

So looking at this I question the choice of -mpreferred-stack-boundary=3. Why not
do -mpreferred-stack-boundary=2?

My reasoning: on modern uarchs there's no penalty for 32-bit misalignment of
64-bit variables, only if they cross 64-byte cache lines, which should be rare
with a chance of 1:16. This small penalty (of at most +1 cycle in some
circumstances IIRC) should be more than counterbalanced by the compression of the
stack by 5% on average.

... using stack-boundary=1 or stack-boundary=0 would probably be
counterproductive, as these more exotic misalignments get treated progressively
worse by x86 CPUs.

... but I have not measured any of this and even the 5% is just a possibly
overoptimistic guess.

Thanks,

Ingo

2015-07-06 16:59:26

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86: Fix detection of GCC -mpreferred-stack-boundary support

On Mon, Jul 6, 2015 at 6:44 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> As per https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, GCC only
>> allows -mpreferred-stack-boundary=3 on x86_64 if -mno-sse is set.
>> That means that cc-option will not detect
>> -mpreferred-stack-boundary=3 support, because we test for it before
>> setting -mno-sse.
>>
>> Fix it by reordering the Makefile bits.

...

>
> So the 'stack boundary' is the RSP that GCC generates before it calls another
> function from within an existing function, right?
>

I think so. Certainly the "incoming stack boundary" (which is exactly
the same as the preferred stack boundary unless explicitly changed) is
the RSP alignment that GCC expects on entry.

> So looking at this I question the choice of -mpreferred-stack-boundary=3. Why not
> do -mpreferred-stack-boundary=2?
>

Easy answer: we can't:

$ gcc -c -mno-sse -mpreferred-stack-boundary=2 empty.c
empty.c:1:0: error: -mpreferred-stack-boundary=2 is not between 3 and 12

> My reasoning: on modern uarchs there's no penalty for 32-bit misalignment of
> 64-bit variables, only if they cross 64-byte cache lines, which should be rare
> with a chance of 1:16. This small penalty (of at most +1 cycle in some
> circumstances IIRC) should be more than counterbalanced by the compression of the
> stack by 5% on average.
>

I'll counter with: what's the benefit? There are no operations that
will naturally change RSP by anything that isn't a multiple of 8
(there's no pushl in 64-bit mode, or at least not on AMD chips -- the
Intel manual is a bit vague on this point), so we'll end up with RSP
being a multiple of 8 regardless. Even if we somehow shaved 4 bytes
off in asm, that still wouldn't buy us anything, as a dangling 4 bytes
at the bottom of the stack isn't useful for anything.

--Andy

2015-07-06 17:10:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86: Fix detection of GCC -mpreferred-stack-boundary support

On Mon, Jul 6, 2015 at 6:44 AM, Ingo Molnar <[email protected]> wrote:
>
> So looking at this I question the choice of -mpreferred-stack-boundary=3. Why not
> do -mpreferred-stack-boundary=2?

It wouldn't make sense anyway - it would only make code worse (if it
worked) and not any better.

The reason the "=3" value is good is because 8-byte alignment is the
"natural" alignment - it's what you get with a normal call sequence,
simply because the return address is 8 bytes in size.

That means that with "=3" you don't get extra code to align the stack
for the simple functions that don't need a frame.

Anything smaller than 3 wouldn't help even if it worked, because none
of the normal stack operations (pushing/popping registers to
save/restore them) would be any smaller anyway.

But bigger values than 3 result in the compiler having to generate
extra stack adjustments just to align the stack after a call that very
naturally mis-aligned it. And it doesn't help anyway, since in the
kernel we don't put stuff on the stack that needs bigger alignment
(of, the fxsave buffer is a counter-example, but it's a very odd one
that we _shouldn't_ have put on the stack).

Linus

2015-07-06 17:32:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: Fix detection of GCC -mpreferred-stack-boundary support


* Linus Torvalds <[email protected]> wrote:

> On Mon, Jul 6, 2015 at 6:44 AM, Ingo Molnar <[email protected]> wrote:
> >
> > So looking at this I question the choice of -mpreferred-stack-boundary=3. Why
> > not do -mpreferred-stack-boundary=2?
>
> It wouldn't make sense anyway - it would only make code worse (if it worked) and
> not any better.
>
> The reason the "=3" value is good is because 8-byte alignment is the "natural"
> alignment - it's what you get with a normal call sequence, simply because the
> return address is 8 bytes in size.
>
> That means that with "=3" you don't get extra code to align the stack for the
> simple functions that don't need a frame.
>
> Anything smaller than 3 wouldn't help even if it worked, because none of the
> normal stack operations (pushing/popping registers to save/restore them) would
> be any smaller anyway.
>
> But bigger values than 3 result in the compiler having to generate extra stack
> adjustments just to align the stack after a call that very naturally mis-aligned
> it. And it doesn't help anyway, since in the kernel we don't put stuff on the
> stack that needs bigger alignment (of, the fxsave buffer is a counter-example,
> but it's a very odd one that we _shouldn't_ have put on the stack).

Ok, so it's all moot, but my (quite possibly flawed) thinking was that for deeper
call chains, using 4 byte RSP alignment (as opposed to 8 bytes) would allow, in
about 50% of the cases, the stack frame to be narrower by 4 bytes. (depending on
whether the 'natural' stack boundary is properly aligned to 8 bytes or not.)

For a 10 deep call chain that's a 20 bytes more compact stack on average
(10*4*0.5), resulting in a tiny bit denser D$.

My assumptions were:

- no extra code is generated by GCC. (If it causes any extra code to be generated
then it's an obvious loss.)

- mis-aligning an 8 byte variable by 4 bytes is being handled quite well by most
x86 uarchs, without penalty in most cases.

But ... it's all moot and even in the best case if both my assumptions are fully
met (which is not a given), the advantages are pretty marginal, so consider the
idea dead by multiple mortal wounds.

Thanks,

Ingo

2015-07-06 17:40:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: Fix detection of GCC -mpreferred-stack-boundary support


* Andy Lutomirski <[email protected]> wrote:

> > My reasoning: on modern uarchs there's no penalty for 32-bit misalignment of
> > 64-bit variables, only if they cross 64-byte cache lines, which should be rare
> > with a chance of 1:16. This small penalty (of at most +1 cycle in some
> > circumstances IIRC) should be more than counterbalanced by the compression of
> > the stack by 5% on average.
>
> I'll counter with: what's the benefit? There are no operations that will
> naturally change RSP by anything that isn't a multiple of 8 (there's no pushl in
> 64-bit mode, or at least not on AMD chips -- the Intel manual is a bit vague on
> this point), so we'll end up with RSP being a multiple of 8 regardless. Even if
> we somehow shaved 4 bytes off in asm, that still wouldn't buy us anything, as a
> dangling 4 bytes at the bottom of the stack isn't useful for anything.

Yeah, so it might be utilized in frame-pointer less builds (which we might be able
to utilize in the future if sane Dwarf code comes around), which does not use
push/pop to manage the stack but often has patterns like:

ffffffff8102aa90 <SyS_getpriority>:
ffffffff8102aa90: 48 83 ec 18 sub $0x18,%rsp

and uses MOVs to manage the stack. Those kinds of stack frames could be 4-byte
granular as well.

But yeah ... it's pretty marginal.

Thanks,

Ingo

2015-07-06 17:59:43

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86: Fix detection of GCC -mpreferred-stack-boundary support

On Mon, Jul 6, 2015 at 10:40 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> > My reasoning: on modern uarchs there's no penalty for 32-bit misalignment of
>> > 64-bit variables, only if they cross 64-byte cache lines, which should be rare
>> > with a chance of 1:16. This small penalty (of at most +1 cycle in some
>> > circumstances IIRC) should be more than counterbalanced by the compression of
>> > the stack by 5% on average.
>>
>> I'll counter with: what's the benefit? There are no operations that will
>> naturally change RSP by anything that isn't a multiple of 8 (there's no pushl in
>> 64-bit mode, or at least not on AMD chips -- the Intel manual is a bit vague on
>> this point), so we'll end up with RSP being a multiple of 8 regardless. Even if
>> we somehow shaved 4 bytes off in asm, that still wouldn't buy us anything, as a
>> dangling 4 bytes at the bottom of the stack isn't useful for anything.
>
> Yeah, so it might be utilized in frame-pointer less builds (which we might be able
> to utilize in the future if sane Dwarf code comes around), which does not use
> push/pop to manage the stack but often has patterns like:
>
> ffffffff8102aa90 <SyS_getpriority>:
> ffffffff8102aa90: 48 83 ec 18 sub $0x18,%rsp
>
> and uses MOVs to manage the stack. Those kinds of stack frames could be 4-byte
> granular as well.
>
> But yeah ... it's pretty marginal.

To get even that, we'd need an additional ABI-changing GCC flag to
change GCC's idea of the alignment of long from 8 to 4. (I just
checked: g++ thinks that alignof(long) == 8. I was too lazy to look
up how to ask the equivalent question in C.)

--Andy

2015-07-07 04:02:02

by Raymond Jennings

[permalink] [raw]
Subject: Re: [PATCH] x86: Fix detection of GCC -mpreferred-stack-boundary support

On Mon, 2015-07-06 at 10:59 -0700, Andy Lutomirski wrote:
> On Mon, Jul 6, 2015 at 10:40 AM, Ingo Molnar <[email protected]> wrote:
> >
> > * Andy Lutomirski <[email protected]> wrote:
> >
> >> > My reasoning: on modern uarchs there's no penalty for 32-bit misalignment of
> >> > 64-bit variables, only if they cross 64-byte cache lines, which should be rare
> >> > with a chance of 1:16. This small penalty (of at most +1 cycle in some
> >> > circumstances IIRC) should be more than counterbalanced by the compression of
> >> > the stack by 5% on average.
> >>
> >> I'll counter with: what's the benefit? There are no operations that will
> >> naturally change RSP by anything that isn't a multiple of 8 (there's no pushl in
> >> 64-bit mode, or at least not on AMD chips -- the Intel manual is a bit vague on
> >> this point), so we'll end up with RSP being a multiple of 8 regardless. Even if
> >> we somehow shaved 4 bytes off in asm, that still wouldn't buy us anything, as a
> >> dangling 4 bytes at the bottom of the stack isn't useful for anything.
> >
> > Yeah, so it might be utilized in frame-pointer less builds (which we might be able
> > to utilize in the future if sane Dwarf code comes around), which does not use
> > push/pop to manage the stack but often has patterns like:
> >
> > ffffffff8102aa90 <SyS_getpriority>:
> > ffffffff8102aa90: 48 83 ec 18 sub $0x18,%rsp
> >
> > and uses MOVs to manage the stack. Those kinds of stack frames could be 4-byte
> > granular as well.
> >
> > But yeah ... it's pretty marginal.
>
> To get even that, we'd need an additional ABI-changing GCC flag to
> change GCC's idea of the alignment of long from 8 to 4. (I just
> checked: g++ thinks that alignof(long) == 8. I was too lazy to look
> up how to ask the equivalent question in C.)

I just want to point out that long itself is 8 bytes on 64-bit x86, but
only 4 bytes on 32-bit x86.

Perhaps we should keep in mind sizeof(long) and not just alignof(long)?

My opinion btw, is that if long is 8 bytes wide, it should also be 8
bytes aligned.

> --Andy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/