2021-12-29 02:13:31

by Bill Wendling

[permalink] [raw]
Subject: [PATCH v2] x86: use builtins to read eflags

GCC and Clang both have builtins to read and write the EFLAGS register.
This allows the compiler to determine the best way to generate this
code, which can improve code generation.

This issue arose due to Clang's issue with the "=rm" constraint. Clang
chooses to be conservative in these situations, and so uses memory
instead of registers. This is a known issue, which is currently being
addressed.

However, using builtins is benefiical in general, because it removes the
burden of determining what's the way to read the flags register from the
programmer and places it on to the compiler, which has the information
needed to make that decision. Indeed, this piece of code has had several
changes over the years, some of which were pinging back and forth to
determine the correct constraints to use.

With this change, Clang generates better code:

Original code:
movq $0, -48(%rbp)
#APP
# __raw_save_flags
pushfq
popq -48(%rbp)
#NO_APP
movq -48(%rbp), %rbx

New code:
pushfq
popq %rbx
#APP

Note that the stack slot in the original code is no longer needed in the
new code, saving a small amount of stack space.

Signed-off-by: Bill Wendling <[email protected]>
---
v2: - Kept the original function to retain the out-of-line symbol.
- Improved the commit message.
- Note that I couldn't use Nick's suggestion of

return IS_ENABLED(CONFIG_X86_64) ? ...

because Clang complains about using __builtin_ia32_readeflags_u32 in
64-bit mode.
---
arch/x86/include/asm/irqflags.h | 19 +++++--------------
1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index c5ce9845c999..27f919ea7ac3 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -19,20 +19,11 @@
extern inline unsigned long native_save_fl(void);
extern __always_inline unsigned long native_save_fl(void)
{
- unsigned long flags;
-
- /*
- * "=rm" is safe here, because "pop" adjusts the stack before
- * it evaluates its effective address -- this is part of the
- * documented behavior of the "pop" instruction.
- */
- asm volatile("# __raw_save_flags\n\t"
- "pushf ; pop %0"
- : "=rm" (flags)
- : /* no input */
- : "memory");
-
- return flags;
+#ifdef CONFIG_X86_64
+ return __builtin_ia32_readeflags_u64();
+#else
+ return __builtin_ia32_readeflags_u32();
+#endif
}

static __always_inline void native_irq_disable(void)
--
2.34.1.448.ga2b2bfdf31-goog



2022-01-28 17:23:19

by Bill Wendling

[permalink] [raw]
Subject: Re: [PATCH v2] x86: use builtins to read eflags

Bump for review.


On Tue, Dec 28, 2021 at 6:13 PM Bill Wendling <[email protected]> wrote:
>
> GCC and Clang both have builtins to read and write the EFLAGS register.
> This allows the compiler to determine the best way to generate this
> code, which can improve code generation.
>
> This issue arose due to Clang's issue with the "=rm" constraint. Clang
> chooses to be conservative in these situations, and so uses memory
> instead of registers. This is a known issue, which is currently being
> addressed.
>
> However, using builtins is benefiical in general, because it removes the
> burden of determining what's the way to read the flags register from the
> programmer and places it on to the compiler, which has the information
> needed to make that decision. Indeed, this piece of code has had several
> changes over the years, some of which were pinging back and forth to
> determine the correct constraints to use.
>
> With this change, Clang generates better code:
>
> Original code:
> movq $0, -48(%rbp)
> #APP
> # __raw_save_flags
> pushfq
> popq -48(%rbp)
> #NO_APP
> movq -48(%rbp), %rbx
>
> New code:
> pushfq
> popq %rbx
> #APP
>
> Note that the stack slot in the original code is no longer needed in the
> new code, saving a small amount of stack space.
>
> Signed-off-by: Bill Wendling <[email protected]>
> ---
> v2: - Kept the original function to retain the out-of-line symbol.
> - Improved the commit message.
> - Note that I couldn't use Nick's suggestion of
>
> return IS_ENABLED(CONFIG_X86_64) ? ...
>
> because Clang complains about using __builtin_ia32_readeflags_u32 in
> 64-bit mode.
> ---
> arch/x86/include/asm/irqflags.h | 19 +++++--------------
> 1 file changed, 5 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
> index c5ce9845c999..27f919ea7ac3 100644
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -19,20 +19,11 @@
> extern inline unsigned long native_save_fl(void);
> extern __always_inline unsigned long native_save_fl(void)
> {
> - unsigned long flags;
> -
> - /*
> - * "=rm" is safe here, because "pop" adjusts the stack before
> - * it evaluates its effective address -- this is part of the
> - * documented behavior of the "pop" instruction.
> - */
> - asm volatile("# __raw_save_flags\n\t"
> - "pushf ; pop %0"
> - : "=rm" (flags)
> - : /* no input */
> - : "memory");
> -
> - return flags;
> +#ifdef CONFIG_X86_64
> + return __builtin_ia32_readeflags_u64();
> +#else
> + return __builtin_ia32_readeflags_u32();
> +#endif
> }
>
> static __always_inline void native_irq_disable(void)
> --
> 2.34.1.448.ga2b2bfdf31-goog
>

2022-02-04 23:56:50

by Bill Wendling

[permalink] [raw]
Subject: [PATCH v3] x86: use builtins to read eflags

GCC and Clang both have builtins to read and write the EFLAGS register.
This allows the compiler to determine the best way to generate this
code, which can improve code generation.

This issue arose due to Clang's issue with the "=rm" constraint. Clang
chooses to be conservative in these situations, and so uses memory
instead of registers. This is a known issue, which is currently being
addressed.

However, using builtins is benefiical in general, because it removes the
burden of determining what's the way to read the flags register from the
programmer and places it on to the compiler, which has the information
needed to make that decision. Indeed, this piece of code has had several
changes over the years, some of which were pinging back and forth to
determine the correct constraints to use.

With this change, Clang generates better code:

Original code:
movq $0, -48(%rbp)
#APP
# __raw_save_flags
pushfq
popq -48(%rbp)
#NO_APP
movq -48(%rbp), %rbx

New code:
pushfq
popq %rbx
#APP

Note that the stack slot in the original code is no longer needed in the
new code, saving a small amount of stack space.

There is no change to GCC's ouput:

Original code:

# __raw_save_flags
pushf ; pop %r13 # flags

New code:

pushfq
popq %r13 # _23

Signed-off-by: Bill Wendling <[email protected]>
---
v3: - Add blurb indicating that GCC's output hasn't changed.
v2: - Kept the original function to retain the out-of-line symbol.
- Improved the commit message.
- Note that I couldn't use Nick's suggestion of

return IS_ENABLED(CONFIG_X86_64) ? ...

because Clang complains about using __builtin_ia32_readeflags_u32 in
64-bit mode.
---
arch/x86/include/asm/irqflags.h | 19 +++++--------------
1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 87761396e8cc0..f31a035f3c6a9 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -19,20 +19,11 @@
extern inline unsigned long native_save_fl(void);
extern __always_inline unsigned long native_save_fl(void)
{
- unsigned long flags;
-
- /*
- * "=rm" is safe here, because "pop" adjusts the stack before
- * it evaluates its effective address -- this is part of the
- * documented behavior of the "pop" instruction.
- */
- asm volatile("# __raw_save_flags\n\t"
- "pushf ; pop %0"
- : "=rm" (flags)
- : /* no input */
- : "memory");
-
- return flags;
+#ifdef CONFIG_X86_64
+ return __builtin_ia32_readeflags_u64();
+#else
+ return __builtin_ia32_readeflags_u32();
+#endif
}

static __always_inline void native_irq_disable(void)
--
2.35.0.263.gb82422642f-goog

2022-02-07 08:33:44

by Bill Wendling

[permalink] [raw]
Subject: Re: [PATCH v2] x86: use builtins to read eflags

On Thu, Feb 3, 2022 at 4:16 PM Thomas Gleixner <[email protected]> wrote:
>
> On Tue, Dec 28 2021 at 18:12, Bill Wendling wrote:
> > GCC and Clang both have builtins to read and write the EFLAGS register.
> > This allows the compiler to determine the best way to generate this
> > code, which can improve code generation.
> >
> > This issue arose due to Clang's issue with the "=rm" constraint. Clang
> > chooses to be conservative in these situations, and so uses memory
> > instead of registers. This is a known issue, which is currently being
> > addressed.
> >
> > However, using builtins is benefiical in general, because it removes the
> > burden of determining what's the way to read the flags register from the
> > programmer and places it on to the compiler, which has the information
> > needed to make that decision. Indeed, this piece of code has had several
> > changes over the years, some of which were pinging back and forth to
> > determine the correct constraints to use.
> >
> > With this change, Clang generates better code:
> >
> > Original code:
> > movq $0, -48(%rbp)
> > #APP
> > # __raw_save_flags
> > pushfq
> > popq -48(%rbp)
> > #NO_APP
> > movq -48(%rbp), %rbx
> >
> > New code:
> > pushfq
> > popq %rbx
> > #APP
> >
> > Note that the stack slot in the original code is no longer needed in the
> > new code, saving a small amount of stack space.
>
> This still lacks any information about the effect on GCC. There is a
> world outside clang. It's not my job to validate that.
>
I never said you were the one required to validate that. I'm not sure
where you got that idea from. There of course is no change to GCC's
output. I updated the commit message and sent out v3. PTAL.

-bw

2022-02-07 14:59:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2] x86: use builtins to read eflags

On Tue, Dec 28 2021 at 18:12, Bill Wendling wrote:
> GCC and Clang both have builtins to read and write the EFLAGS register.
> This allows the compiler to determine the best way to generate this
> code, which can improve code generation.
>
> This issue arose due to Clang's issue with the "=rm" constraint. Clang
> chooses to be conservative in these situations, and so uses memory
> instead of registers. This is a known issue, which is currently being
> addressed.
>
> However, using builtins is benefiical in general, because it removes the
> burden of determining what's the way to read the flags register from the
> programmer and places it on to the compiler, which has the information
> needed to make that decision. Indeed, this piece of code has had several
> changes over the years, some of which were pinging back and forth to
> determine the correct constraints to use.
>
> With this change, Clang generates better code:
>
> Original code:
> movq $0, -48(%rbp)
> #APP
> # __raw_save_flags
> pushfq
> popq -48(%rbp)
> #NO_APP
> movq -48(%rbp), %rbx
>
> New code:
> pushfq
> popq %rbx
> #APP
>
> Note that the stack slot in the original code is no longer needed in the
> new code, saving a small amount of stack space.

This still lacks any information about the effect on GCC. There is a
world outside clang. It's not my job to validate that.

Thanks,

tglx

2022-02-09 10:02:34

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v3] x86: use builtins to read eflags

On Thu, Feb 3, 2022 at 4:57 PM Bill Wendling <[email protected]> wrote:
>
> GCC and Clang both have builtins to read and write the EFLAGS register.
> This allows the compiler to determine the best way to generate this
> code, which can improve code generation.
>
> This issue arose due to Clang's issue with the "=rm" constraint. Clang
> chooses to be conservative in these situations, and so uses memory
> instead of registers. This is a known issue, which is currently being
> addressed.
>
> However, using builtins is benefiical in general, because it removes the

s/benefiical/beneficial/

> burden of determining what's the way to read the flags register from the
> programmer and places it on to the compiler, which has the information
> needed to make that decision. Indeed, this piece of code has had several
> changes over the years, some of which were pinging back and forth to
> determine the correct constraints to use.
>
> With this change, Clang generates better code:
>
> Original code:
> movq $0, -48(%rbp)
> #APP
> # __raw_save_flags
> pushfq
> popq -48(%rbp)
> #NO_APP
> movq -48(%rbp), %rbx
>
> New code:
> pushfq
> popq %rbx
> #APP

But it also forces frame pointers due to another bug in LLVM.
https://godbolt.org/z/6badWaGjo
For x86_64, we default to CONFIG_UNWINDER_ORC=y, not
CONFIG_UNWINDER_FRAME_POINTER=y. So this change would make us use
registers instead of stack slots (improvement), but then force frame
pointers when we probably didn't need or want them (deterioration) for
all released versions of clang.

I think we should fix https://reviews.llvm.org/D92695 first before I'd
be comfortable signing off on this kernel change. Again, I think we
should test out Phoebe's recommendation
https://reviews.llvm.org/D92695#inline-1086936
or do you already have a fix that I haven't yet been cc'ed on, perhaps?

>
> Note that the stack slot in the original code is no longer needed in the
> new code, saving a small amount of stack space.
>
> There is no change to GCC's ouput:
>
> Original code:
>
> # __raw_save_flags
> pushf ; pop %r13 # flags
>
> New code:
>
> pushfq
> popq %r13 # _23
>
> Signed-off-by: Bill Wendling <[email protected]>
> ---
> v3: - Add blurb indicating that GCC's output hasn't changed.
> v2: - Kept the original function to retain the out-of-line symbol.
> - Improved the commit message.
> - Note that I couldn't use Nick's suggestion of
>
> return IS_ENABLED(CONFIG_X86_64) ? ...
>
> because Clang complains about using __builtin_ia32_readeflags_u32 in
> 64-bit mode.
> ---
> arch/x86/include/asm/irqflags.h | 19 +++++--------------
> 1 file changed, 5 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
> index 87761396e8cc0..f31a035f3c6a9 100644
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -19,20 +19,11 @@
> extern inline unsigned long native_save_fl(void);
> extern __always_inline unsigned long native_save_fl(void)
> {
> - unsigned long flags;
> -
> - /*
> - * "=rm" is safe here, because "pop" adjusts the stack before
> - * it evaluates its effective address -- this is part of the
> - * documented behavior of the "pop" instruction.
> - */
> - asm volatile("# __raw_save_flags\n\t"
> - "pushf ; pop %0"
> - : "=rm" (flags)
> - : /* no input */
> - : "memory");
> -
> - return flags;
> +#ifdef CONFIG_X86_64
> + return __builtin_ia32_readeflags_u64();
> +#else
> + return __builtin_ia32_readeflags_u32();
> +#endif
> }
>
> static __always_inline void native_irq_disable(void)
> --
> 2.35.0.263.gb82422642f-goog
>


--
Thanks,
~Nick Desaulniers

2022-02-09 11:54:49

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v3] x86: use builtins to read eflags

From: Nick Desaulniers
> Sent: 07 February 2022 22:12
>
> On Thu, Feb 3, 2022 at 4:57 PM Bill Wendling <[email protected]> wrote:
> >
> > GCC and Clang both have builtins to read and write the EFLAGS register.
> > This allows the compiler to determine the best way to generate this
> > code, which can improve code generation.
> >
> > This issue arose due to Clang's issue with the "=rm" constraint. Clang
> > chooses to be conservative in these situations, and so uses memory
> > instead of registers. This is a known issue, which is currently being
> > addressed.

How much performance would be lost by just using "=r"?

You get two instructions if the actual target is memory.
This might be a marginal code size increase - but not much,
It might also slow things down if the execution is limited
by the instruction decoder.

But on Intel cpu 'pop memory' is 2 uops, exactly the same
as 'pop register' 'store register' (and I think amd is similar).
So the actual execution time is exactly the same for both.

Also it looks like clang's builtin is effectively "=r".
Compiling:
long fl;
void native_save_fl(void) {
fl = __builtin_ia32_readeflags_u64();
}
Not only generates a stack frame, it also generates:
pushf; pop %rax; mov mem, %rax.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-02-09 12:05:24

by Bill Wendling

[permalink] [raw]
Subject: Re: [PATCH v3] x86: use builtins to read eflags

On Tue, Feb 8, 2022 at 1:14 AM David Laight <[email protected]> wrote:
>
> From: Nick Desaulniers
> > Sent: 07 February 2022 22:12
> >
> > On Thu, Feb 3, 2022 at 4:57 PM Bill Wendling <[email protected]> wrote:
> > >
> > > GCC and Clang both have builtins to read and write the EFLAGS register.
> > > This allows the compiler to determine the best way to generate this
> > > code, which can improve code generation.
> > >
> > > This issue arose due to Clang's issue with the "=rm" constraint. Clang
> > > chooses to be conservative in these situations, and so uses memory
> > > instead of registers. This is a known issue, which is currently being
> > > addressed.
>
> How much performance would be lost by just using "=r"?
>
> You get two instructions if the actual target is memory.
> This might be a marginal code size increase - but not much,
> It might also slow things down if the execution is limited
> by the instruction decoder.
>
> But on Intel cpu 'pop memory' is 2 uops, exactly the same
> as 'pop register' 'store register' (and I think amd is similar).
> So the actual execution time is exactly the same for both.
>
> Also it looks like clang's builtin is effectively "=r".
> Compiling:
> long fl;
> void native_save_fl(void) {
> fl = __builtin_ia32_readeflags_u64();
> }
> Not only generates a stack frame, it also generates:
> pushf; pop %rax; mov mem, %rax.
>
It used to be "=r" (see f1f029c7bfbf4e), but was changed back to "=rm"
in ab94fcf528d127. This pinging back and forth is another reason to
use the builtins and be done with it.

-bw

2022-02-11 02:14:15

by Bill Wendling

[permalink] [raw]
Subject: [PATCH v4] x86: use builtins to read eflags

GCC and Clang both have builtins to read and write the EFLAGS register.
This allows the compiler to determine the best way to generate this
code, which can improve code generation.

This issue arose due to Clang's issue with the "=rm" constraint. Clang
chooses to be conservative in these situations, and so uses memory
instead of registers. This is a known issue, which is currently being
addressed.

However, using builtins is beneficial in general, because it removes the
burden of determining what's the way to read the flags register from the
programmer and places it on to the compiler, which has the information
needed to make that decision. Indeed, this piece of code has had several
changes over the years, some of which were pinging back and forth to
determine the correct constraints to use.

With this change, Clang generates better code:

Original code:
movq $0, -48(%rbp)
#APP
# __raw_save_flags
pushfq
popq -48(%rbp)
#NO_APP
movq -48(%rbp), %rbx

New code:
pushfq
popq %rbx
#APP

Note that the stack slot in the original code is no longer needed in the
new code, saving a small amount of stack space.

There is no change to GCC's output:

Original code:

# __raw_save_flags
pushf ; pop %r13 # flags

New code:

pushfq
popq %r13 # _23

Signed-off-by: Bill Wendling <[email protected]>
---
v4: - Clang now no longer generates stack frames when using these builtins.
- Corrected misspellings.
v3: - Add blurb indicating that GCC's output hasn't changed.
v2: - Kept the original function to retain the out-of-line symbol.
- Improved the commit message.
- Note that I couldn't use Nick's suggestion of

return IS_ENABLED(CONFIG_X86_64) ? ...

because Clang complains about using __builtin_ia32_readeflags_u32 in
64-bit mode.
---
arch/x86/include/asm/irqflags.h | 19 +++++--------------
1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 87761396e8cc..f31a035f3c6a 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -19,20 +19,11 @@
extern inline unsigned long native_save_fl(void);
extern __always_inline unsigned long native_save_fl(void)
{
- unsigned long flags;
-
- /*
- * "=rm" is safe here, because "pop" adjusts the stack before
- * it evaluates its effective address -- this is part of the
- * documented behavior of the "pop" instruction.
- */
- asm volatile("# __raw_save_flags\n\t"
- "pushf ; pop %0"
- : "=rm" (flags)
- : /* no input */
- : "memory");
-
- return flags;
+#ifdef CONFIG_X86_64
+ return __builtin_ia32_readeflags_u64();
+#else
+ return __builtin_ia32_readeflags_u32();
+#endif
}

static __always_inline void native_irq_disable(void)
--
2.35.1.265.g69c8d7142f-goog


2022-02-11 17:25:22

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v4] x86: use builtins to read eflags

From: Bill Wendling
> Sent: 10 February 2022 22:32
>
> GCC and Clang both have builtins to read and write the EFLAGS register.
> This allows the compiler to determine the best way to generate this
> code, which can improve code generation.
>
> This issue arose due to Clang's issue with the "=rm" constraint. Clang
> chooses to be conservative in these situations, and so uses memory
> instead of registers. This is a known issue, which is currently being
> addressed.
>
> However, using builtins is beneficial in general, because it removes the
> burden of determining what's the way to read the flags register from the
> programmer and places it on to the compiler, which has the information
> needed to make that decision.

Except that neither gcc nor clang attempt to make that decision.
They always do pushf; pop ax;

...
> v4: - Clang now no longer generates stack frames when using these builtins.
> - Corrected misspellings.

While clang 'head' has been fixed, it seems a bit premature to say
it is 'fixed' enough for all clang builds to use the builtin.

Seems better to change it (back) to "=r" and comment that this
is currently as good as __builtin_ia32_readeflags_u64() and that
clang makes a 'pigs breakfast' of "=rm" - which has only marginal
benefit.

Changing to __builtin_ia32_readeflags_u64() may be worth while
if/when the compilers will generate pushf; pop mem; for it.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-02-11 21:36:40

by Bill Wendling

[permalink] [raw]
Subject: Re: [PATCH v4] x86: use builtins to read eflags

On Fri, Feb 11, 2022 at 8:40 AM David Laight <[email protected]> wrote:
> From: Bill Wendling
> > Sent: 10 February 2022 22:32
> >
> > GCC and Clang both have builtins to read and write the EFLAGS register.
> > This allows the compiler to determine the best way to generate this
> > code, which can improve code generation.
> >
> > This issue arose due to Clang's issue with the "=rm" constraint. Clang
> > chooses to be conservative in these situations, and so uses memory
> > instead of registers. This is a known issue, which is currently being
> > addressed.
> >
> > However, using builtins is beneficial in general, because it removes the
> > burden of determining what's the way to read the flags register from the
> > programmer and places it on to the compiler, which has the information
> > needed to make that decision.
>
> Except that neither gcc nor clang attempt to make that decision.
> They always do pushf; pop ax;
>
It looks like both GCC and Clang pop into virtual registers. The
register allocator is then able to determine if it can allocate a
physical register or if a stack slot is required.

> ...
> > v4: - Clang now no longer generates stack frames when using these builtins.
> > - Corrected misspellings.
>
> While clang 'head' has been fixed, it seems a bit premature to say
> it is 'fixed' enough for all clang builds to use the builtin.
>
True, but it's been cherry-picked into the clang 14.0.0 branch, which
is scheduled for release in March.

> Seems better to change it (back) to "=r" and comment that this
> is currently as good as __builtin_ia32_readeflags_u64() and that
> clang makes a 'pigs breakfast' of "=rm" - which has only marginal
> benefit.
>
That would be okay as far as code generation is concerned, but it does
place the burden of correctness back on the programmer. Also, it was
that at some point, but was changed to "=rm" here. :-)

commit ab94fcf528d127fcb490175512a8910f37e5b346
Author: H. Peter Anvin <[email protected]>
Date: Tue Aug 25 16:47:16 2009 -0700

x86: allow "=rm" in native_save_fl()

This is a partial revert of f1f029c7bfbf4ee1918b90a431ab823bed812504.

"=rm" is allowed in this context, because "pop" is explicitly defined
to adjust the stack pointer *before* it evaluates its effective
address, if it has one. Thus, we do end up writing to the correct
address even if we use an on-stack memory argument.

The original reporter for f1f029c7bfbf4ee1918b90a431ab823bed812504 was
apparently using a broken x86 simulator.

[ Impact: performance ]

Signed-off-by: H. Peter Anvin <[email protected]>
Cc: Gabe Black <[email protected]>


> Changing to __builtin_ia32_readeflags_u64() may be worth while
> if/when the compilers will generate pushf; pop mem; for it.
>
I was able to come up with an example where GCC generates "pushf ; pop mem":

https://godbolt.org/z/9rocjdoaK

(Clang generates a variation of "pop mem," and is horrible code, but
it's meant for demonstration purposes only.) One interesting thing
about the use of the builtins is that if at all possible, the "pop"
instruction may be moved away from the "pushf" if it's safe and would
reduce register pressure.

-bw

2022-02-12 16:30:53

by Bill Wendling

[permalink] [raw]
Subject: Re: [PATCH v4] x86: use builtins to read eflags

On Fri, Feb 11, 2022 at 4:24 PM Nick Desaulniers
<[email protected]> wrote:
> I would like to discuss this further off list with you Bill before you
> resend this patch again.

That won't be necessary.

-bw

2022-02-12 18:00:29

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v4] x86: use builtins to read eflags

From: Bill Wendling > Sent: 11 February 2022 19:26
>
> On Fri, Feb 11, 2022 at 8:40 AM David Laight <[email protected]> wrote:
> > From: Bill Wendling
> > > Sent: 10 February 2022 22:32
> > >
> > > GCC and Clang both have builtins to read and write the EFLAGS register.
> > > This allows the compiler to determine the best way to generate this
> > > code, which can improve code generation.
> > >
> > > This issue arose due to Clang's issue with the "=rm" constraint. Clang
> > > chooses to be conservative in these situations, and so uses memory
> > > instead of registers. This is a known issue, which is currently being
> > > addressed.
> > >
> > > However, using builtins is beneficial in general, because it removes the
> > > burden of determining what's the way to read the flags register from the
> > > programmer and places it on to the compiler, which has the information
> > > needed to make that decision.
> >
> > Except that neither gcc nor clang attempt to make that decision.
> > They always do pushf; pop ax;
> >
> It looks like both GCC and Clang pop into virtual registers. The
> register allocator is then able to determine if it can allocate a
> physical register or if a stack slot is required.

Doing:
int fl;
void f(void) { fl = __builtin_ia32_readeflags_u64(); }
Seems to use register.
If it pops to a virtual register it will probably never pop
into a real target location.

See https://godbolt.org/z/8aY8o8rhe

But performance wise the pop+mov is just one byte longer.
Instruction decode time might be higher for two instruction, but since
'pop mem' generates 2 uops (intel) it may be constrained to the first
decoder (I can't rememberthe exact details), but the separate pop+mov
can be decoded in parallel - so could end up faster.

Actual execution time (if that makes any sense) is really the same.
Two operations, one pop and one memory write.

I bet you'd be hard pressed to find a piece of code where it even made
a consistent difference.

> > ...
> > > v4: - Clang now no longer generates stack frames when using these builtins.
> > > - Corrected misspellings.
> >
> > While clang 'head' has been fixed, it seems a bit premature to say
> > it is 'fixed' enough for all clang builds to use the builtin.
> >
> True, but it's been cherry-picked into the clang 14.0.0 branch, which
> is scheduled for release in March.
>
> > Seems better to change it (back) to "=r" and comment that this
> > is currently as good as __builtin_ia32_readeflags_u64() and that
> > clang makes a 'pigs breakfast' of "=rm" - which has only marginal
> > benefit.
> >
> That would be okay as far as code generation is concerned, but it does
> place the burden of correctness back on the programmer. Also, it was
> that at some point, but was changed to "=rm" here. :-)

As I said, a comment should stop the bounce.

...
> I was able to come up with an example where GCC generates "pushf ; pop mem":
>
> https://godbolt.org/z/9rocjdoaK
>
> (Clang generates a variation of "pop mem," and is horrible code, but
> it's meant for demonstration purposes only.) One interesting thing
> about the use of the builtins is that if at all possible, the "pop"
> instruction may be moved away from the "pushf" if it's safe and would
> reduce register pressure.

I wouldn't trust the compiler to get stack pointer relative accesses
right if it does move them apart.
Definitely scope for horrid bugs ;-)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-02-14 10:11:01

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v4] x86: use builtins to read eflags

On Fri, Feb 11, 2022 at 11:25 AM Bill Wendling <[email protected]> wrote:
>
> On Fri, Feb 11, 2022 at 8:40 AM David Laight <[email protected]> wrote:
> > From: Bill Wendling
> > > Sent: 10 February 2022 22:32
> > >
> > > v4: - Clang now no longer generates stack frames when using these builtins.
> > > - Corrected misspellings.
> >
> > While clang 'head' has been fixed, it seems a bit premature to say
> > it is 'fixed' enough for all clang builds to use the builtin.
> >
> True, but it's been cherry-picked into the clang 14.0.0 branch, which
> is scheduled for release in March.

While I've requested a cherry pick, the auto PR has not yet been
merged into the release/14.x branch.
https://github.com/llvm/llvm-project/issues/46875#issuecomment-1035262333
https://github.com/llvmbot/llvm-project/pull/42

I share David's concern. We currently support clang-11+, see
Documentation/process/changes.rst

Users external to Google do not necessarily live at ToT LLVM; we MUST
support them. There are also multiple unwinders for x86 selectable at
kconfig time; this patch has implications for those different choices.

I would like to discuss this further off list with you Bill before you
resend this patch again. Please do not resend this patch again until
we've had the chance to speak more about it. I'm generally in favor
of the intent of this patch, but let's make sure this patch is the
best it can be for everyone, regardless of their compiler, compiler
version, or unwinder.
--
Thanks,
~Nick Desaulniers

2022-02-14 11:42:53

by Bill Wendling

[permalink] [raw]
Subject: Re: [PATCH v4] x86: use builtins to read eflags

On Fri, Feb 11, 2022 at 2:10 PM David Laight <[email protected]> wrote:
>
> From: Bill Wendling > Sent: 11 February 2022 19:26
> >
> > On Fri, Feb 11, 2022 at 8:40 AM David Laight <[email protected]> wrote:
> > > From: Bill Wendling
> > > > Sent: 10 February 2022 22:32
> > > >
> > > > GCC and Clang both have builtins to read and write the EFLAGS register.
> > > > This allows the compiler to determine the best way to generate this
> > > > code, which can improve code generation.
> > > >
> > > > This issue arose due to Clang's issue with the "=rm" constraint. Clang
> > > > chooses to be conservative in these situations, and so uses memory
> > > > instead of registers. This is a known issue, which is currently being
> > > > addressed.
> > > >
> > > > However, using builtins is beneficial in general, because it removes the
> > > > burden of determining what's the way to read the flags register from the
> > > > programmer and places it on to the compiler, which has the information
> > > > needed to make that decision.
> > >
> > > Except that neither gcc nor clang attempt to make that decision.
> > > They always do pushf; pop ax;
> > >
> > It looks like both GCC and Clang pop into virtual registers. The
> > register allocator is then able to determine if it can allocate a
> > physical register or if a stack slot is required.
>
> Doing:
> int fl;
> void f(void) { fl = __builtin_ia32_readeflags_u64(); }
> Seems to use register.
> If it pops to a virtual register it will probably never pop
> into a real target location.
>
> See https://godbolt.org/z/8aY8o8rhe
>
Yes, it does produce the appropriate code. What I meant was that,
internal to the compiler, the code that's generated before register
allocation contains "virtual" registers. I.e., registers that would
exist if the machine had an infinite number of registers to use. It's
the job of the register allocator to replace those virtual registers
with physical ones, or with spills to memory if no registers are
available. My (completely made up) example was to show that in an
extreme case where no registers are available, and no amount of code
motion can alleviate the register pressure, then both clang and gcc
will produce the appropriate spills to memory.

> But performance wise the pop+mov is just one byte longer.
> Instruction decode time might be higher for two instruction, but since
> 'pop mem' generates 2 uops (intel) it may be constrained to the first
> decoder (I can't rememberthe exact details), but the separate pop+mov
> can be decoded in parallel - so could end up faster.
>
It's the spill to memory that I'm trying to avoid here. I'm not
concerned about a "pop ; mov" combination (though even that is one
instruction too many except in the most extreme cases).

> Actual execution time (if that makes any sense) is really the same.
> Two operations, one pop and one memory write.
>
> I bet you'd be hard pressed to find a piece of code where it even made
> a consistent difference.
>
> > > ...
> > > > v4: - Clang now no longer generates stack frames when using these builtins.
> > > > - Corrected misspellings.
> > >
> > > While clang 'head' has been fixed, it seems a bit premature to say
> > > it is 'fixed' enough for all clang builds to use the builtin.
> > >
> > True, but it's been cherry-picked into the clang 14.0.0 branch, which
> > is scheduled for release in March.
> >
> > > Seems better to change it (back) to "=r" and comment that this
> > > is currently as good as __builtin_ia32_readeflags_u64() and that
> > > clang makes a 'pigs breakfast' of "=rm" - which has only marginal
> > > benefit.
> > >
> > That would be okay as far as code generation is concerned, but it does
> > place the burden of correctness back on the programmer. Also, it was
> > that at some point, but was changed to "=rm" here. :-)
>
> As I said, a comment should stop the bounce.
>
The constraints on this piece of code went from "=g" to "=r" to "=rm".
Which is the correct one and why? The last will apparently work in all
situations, because the compiler's left to determine the output
destination: register or memory. The "=r" constraint may fail if a
register cannot be allocated for some reason.

> ...
> > I was able to come up with an example where GCC generates "pushf ; pop mem":
> >
> > https://godbolt.org/z/9rocjdoaK
> >
> > (Clang generates a variation of "pop mem," and is horrible code, but
> > it's meant for demonstration purposes only.) One interesting thing
> > about the use of the builtins is that if at all possible, the "pop"
> > instruction may be moved away from the "pushf" if it's safe and would
> > reduce register pressure.
>
> I wouldn't trust the compiler to get stack pointer relative accesses
> right if it does move them apart.
> Definitely scope for horrid bugs ;-)
>
Compilers are pretty good at knowing such things and doing it correctly.

-bw

2022-02-15 00:52:19

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v3] x86: use builtins to read eflags

On Tue, Feb 8, 2022 at 1:14 AM David Laight <[email protected]> wrote:
>
> From: Nick Desaulniers
> > Sent: 07 February 2022 22:12
> >
> > On Thu, Feb 3, 2022 at 4:57 PM Bill Wendling <[email protected]> wrote:
> > >
> > > GCC and Clang both have builtins to read and write the EFLAGS register.
> > > This allows the compiler to determine the best way to generate this
> > > code, which can improve code generation.
> > >
> > > This issue arose due to Clang's issue with the "=rm" constraint. Clang
> > > chooses to be conservative in these situations, and so uses memory
> > > instead of registers. This is a known issue, which is currently being
> > > addressed.
>
> How much performance would be lost by just using "=r"?

It adds undue register pressure. When native_save_fl() is inlined
into callers, requiring an operand to be in a register clobbers it, so
we may need to insert stack spills + reloads for values produced
before but consumed after the asm stmt if we run out of registers to
allocate for the live range.

Looking at the disassembly of native_save_fl() or calls to
__builtin_ia32_readeflags_u64() in isolation are less interesting than
what happens once native_save_fl() starts getting inlined into
callers.

>
> You get two instructions if the actual target is memory.
> This might be a marginal code size increase - but not much,
> It might also slow things down if the execution is limited
> by the instruction decoder.

"=rm" vs "=r" is more so about whether the operands to the asm are
required to be in memory or not, rather than number of instructions
generated.

>
> But on Intel cpu 'pop memory' is 2 uops, exactly the same
> as 'pop register' 'store register' (and I think amd is similar).
> So the actual execution time is exactly the same for both.

Page 10 of Agner Fog's
https://www.agner.org/optimize/instruction_tables.pdf
shows that a pop to memory is 3 "ops" while pop to register is 2. Page
9 defines "ops" as

Number of macro-operations issued from instruction decoder to
schedulers. Instructions with more than 2 macro-operations use
microcode.

Internally, it looks like we have benchmarks that show that
kmem_cache_free spends 5.18% of cycles in that pop-to-memory
instruction (looks like this is CONFIG_SLAB). Further,
native_save_fl() is called by local_irq_save() all over the kernel.
After this patch, the hot spot disappears from the profile (the
push+pop becomes 0.003% of samples from 5.18%).

So I think with those above two notes, that should answer the question
"why is any of this important?"

Further, page 10 also notes that push to memory is 2 "ops" and push to
register is 1, but that doesn't make a difference in the profiles.

Additionally, I compare the disassembly of kmem_cache_free
(CONFIG_SLAB) before+after this patch. There is no difference for GCC,
but you can see quite an improvement for clang:
https://gist.github.com/nickdesaulniers/7f143bae821d86603294bf00b8bd5074
This is the kind of info I would give Thomas when he asks about
changes to GCC codegen.

To the point I made last Friday about support for actually released
versions of clang, here's how v4 of the patch makes clang codegen
worse. This is ToT llvm with f3481f43bbe2c8a24e74210d310cf3be291bf52d
reverted. (That's the commit that fixed the forcibly inserted stack
frame).
https://gist.github.com/nickdesaulniers/b4d0f6e26f8cbef0ae4c5352cfeaca67

Specifically, CONFIG_UNWINDER_FRAME_POINTER=y is improved;
CONFIG_UNWINDER_ORC=y is made worse (for all released versions of
clang, since they don't yet have
f3481f43bbe2c8a24e74210d310cf3be291bf52d).

Therefore, v5 of this patch should have something like:

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index f31a035f3c6a..2d2d4d8ab823 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -19,7 +19,12 @@
extern inline unsigned long native_save_fl(void);
extern __always_inline unsigned long native_save_fl(void)
{
-#ifdef CONFIG_X86_64
+/* clang would force frame pointers via the builtins until clang-14 */
+#if defined(CC_IS_CLANG) && defined(UNWINDER_ORC) && CLANG_VERSION < 140000
+ unsigned long flags;
+ asm volatile ("pushf; pop %0" : "=rm"(flags));
+ return flags;
+#elif defined(CONFIG_X86_64)
return __builtin_ia32_readeflags_u64();
#else
return __builtin_ia32_readeflags_u32();

(though I would keep the existing code and comment for the complex
case, rather than blindly applying my full diff above; and I would
wait for https://github.com/llvmbot/llvm-project/pull/42 to actually
be merged before claiming clang was fixed. oh, looks like it just was.
I'd add `Link: https://github.com/llvm/llvm-project/issues/46875`
explicitly to the commit message.)

Then once clang-14 is the minimally supported version of clang
according to Documentation/process/changes.rst we can drop the
existing inline asm and simply retain the builtins.

Thomas also asked in review of v1 that additional info about previous
commits be included in the commit message.
https://lore.kernel.org/llvm/87o85faum5.ffs@tglx/

Finally, if we need exhaustive proof that the builtins are better for
codegen, even with GCC, I think this case demonstrates so perfectly:
https://godbolt.org/z/5n3Eov1xT

Because compilers treat inline asm a black boxes, they can't schedule
instructions in the inline asm with instructions lowered from the
surrounding C statements. With the compiler builtins, they now can,
because the assembler isn't involved. Perhaps noting that in the
commit message would help improve the first paragraph of the commit
message of v4.

>
> Also it looks like clang's builtin is effectively "=r".

It's not; it's closer to how "=rm" should work; add register pressure
and you may find a memory operand rather than a register operand. But
the builtin is resolved by different machinery than the inline asm.
The inline asm is conservative in the presence of both r and m and
picks m to avoid undue register pressure; r and m together is treated
as "r or m" with a priority for m. In this case, we'd prefer r's
codegen when possible.

I read "=rm" as "r" or "m"; we'd prefer if llvm chose "r" for most
cases, but it will choose "m" conservatively which is legal by the
"or". "r" would be more precise for what we want. I'm not sure if
there are configs that could exhaust the register allocator though if
"m" was not specified as an option.

AFAICT
commit ab94fcf528d1 ("x86: allow "=rm" in native_save_fl()")
does not cite register exhaustion as a reason to permit "m", so
perhaps the version from
commit f1f029c7bfbf ("x86: fix assembly constraints in native_save_fl()")
would be preferred. That said, who knows how "r" will affect codegen
elsewhere, which is why my suggested diff above retains "=rm" as the
constraint for the fallback code. "No functional change" for released
versions of clang.

> Compiling:
> long fl;
> void native_save_fl(void) {
> fl = __builtin_ia32_readeflags_u64();
> }
> Not only generates a stack frame, it also generates:
> pushf; pop %rax; mov mem, %rax.

https://github.com/llvm/llvm-project/issues/46875

--
Thanks,
~Nick Desaulniers

2022-02-15 02:40:00

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v4] x86: use builtins to read eflags

On Sat, Feb 12, 2022 at 1:23 AM Bill Wendling <[email protected]> wrote:
>
> On Fri, Feb 11, 2022 at 4:24 PM Nick Desaulniers
> <[email protected]> wrote:
> > I would like to discuss this further off list with you Bill before you
> > resend this patch again.
>
> That won't be necessary.

Bill,
I'm sorry; I could have worded that better+differently.

In code review, I tend to use phrases like "consider doing X" for
suggestions that are merely suggestions; things I don't care about.

I should have been more explicit that my concern regarding released
versions of clang and CONFIG_UNWINDER_ORC=y was not one of those
cases.

I like the intent of the patch; with the following fixups, I would
gladly sign off on it and encourage the x86 maintainers to consider it
further.
1. handle the case of released versions of clang and
CONFIG_UNWINDER_ORC=y. Something with a grep'able comment for clang-14
so we can clean that up when the minimally supported version of clang
is bumped.
2. update the commit message to refer to previous commits that
modified these flags, as per Thomas in response to v1.
3. Note no change in generated code from GCC, as per Thomas in
response to v2. Perhaps test a few more configs, since I only checked
kmem_cache_free yet native_save_fl and local_irq_save have MANY more
call sites.
4. Add `Link: https://github.com/llvm/llvm-project/issues/46875` to
commit message.

Optionally:
5. Add notes to commit message about the intent of this patch
improving profiles for kmem_cache_free for CONFIG_SLAB=y. Feel free to
use numbers I cited from our internal bug and
https://lore.kernel.org/lkml/CAKwvOdmXxmYgqEOT4vSbN60smSkQRcc3B5duQAhaaYoaDo=Riw@mail.gmail.com/.
6. Add note to commit message about memory operands from Agner Fog's
instruction tables in the above lore link.
7. Add note to commit that compilers can schedule the generated
instructions from the builtins among instructions from surrounding C
statements, which they cannot do for inline asm, as demonstrated by
https://godbolt.org/z/EsP1KTjxa.

Actually 7 is probably more precise than 3. Since there's no change
generally in terms of operands chosen AFAICT, but GCC (and clang)
could now better schedule instructions.

It's still a good patch; please stick with it! Let me know how I can help.
--
Thanks,
~Nick Desaulniers

2022-03-03 00:04:27

by Bill Wendling

[permalink] [raw]
Subject: [PATCH v5] x86: use builtins to read eflags

This issue arose due to Clang's issue with the "=rm" constraint. Clang
chooses to be conservative in these situations and always uses memory
instead of registers, resulting in sub-optimal code. (This is a known
issue, which is currently being addressed.)

This function has gone through numerous changes over the years:

- The original version of this function used the "=g" constraint,
which has the following description:

Any register, memory or immediate integer operand is allowed,
except for registers that are not general registers.

- This was changed to "=r" in commit f1f029c7bfbf ("x86: fix assembly
constraints in native_save_fl()"), because someone noticed that:

the offset of the flags variable from the stack pointer will
change when the pushf is performed. gcc doesn't attempt to
understand that fact, and address used for pop will still be the
same. It will write to somewhere near flags on the stack but not
actually into it and overwrite some other value.

- However, commit f1f029c7bfbf ("x86: fix assembly constraints in
native_save_fl()") was partially reverted in commit ab94fcf528d1
("x86: allow "=rm" in native_save_fl()"), because the original
reporter of the issue was using a broken x86 simulator. The
justification for this change was:

"=rm" is allowed in this context, because "pop" is explicitly
defined to adjust the stack pointer *before* it evaluates its
effective address, if it has one. Thus, we do end up writing to
the correct address even if we use an on-stack memory argument.

Clang generates good code when the builtins are used. On one benchmark,
a hotspot in kmem_cache_free went from using 5.18% of cycles popping to
a memory address to 0.13% popping to a register. This benefit is
magnified given that this code is inlined in numerous places in the
kernel.

The builtins also help GCC. It allows GCC (and Clang) to reduce register
pressure and, consequently, register spills by rescheduling
instructions. It can't happen with instructions in inline assembly,
because compilers view inline assembly blocks as "black boxes," whose
instructions can't be rescheduled.

Another benefit of builtins over asm blocks is that compilers are able
to make more precise inlining decisions, since they no longer need to
rely on imprecise measures based on newline counts.

A trivial example demonstrates this code motion.

void y(void);
unsigned long x(void) {
unsigned long v = __builtin_ia32_readeflags_u64();
y();
return v;
}

GCC at -O1:
pushq %rbx
pushfq
popq %rbx
movl $0, %eax
call y
movq %rbx, %rax
popq %rbx
ret

GCC at -O2:
pushq %r12
pushfq
xorl %eax, %eax
popq %r12
call y
movq %r12, %rax
popq %r12
ret

Link: https://gist.github.com/nickdesaulniers/b4d0f6e26f8cbef0ae4c5352cfeaca67
Link: https://github.com/llvm/llvm-project/issues/20571
Link: https://gcc.gnu.org/onlinedocs/gcc/Simple-Constraints.html#Simple-Constraints
Link: https://godbolt.org/z/5n3Eov1xT
Signed-off-by: Bill Wendling <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
---
v5: - Incorporate Nick's suggestion to limit the change to Clang >= 14.0 and
GCC.
v4: - Clang now no longer generates stack frames when using these builtins.
- Corrected misspellings.
v3: - Add blurb indicating that GCC's output hasn't changed.
v2: - Kept the original function to retain the out-of-line symbol.
- Improved the commit message.
- Note that I couldn't use Nick's suggestion of

return IS_ENABLED(CONFIG_X86_64) ? ...

because Clang complains about using __builtin_ia32_readeflags_u32 in
64-bit mode.
---
arch/x86/include/asm/irqflags.h | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 87761396e8cc..2eded855f6ab 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -19,6 +19,11 @@
extern inline unsigned long native_save_fl(void);
extern __always_inline unsigned long native_save_fl(void)
{
+#if defined(CC_IS_CLANG) && defined(UNWINDER_ORC) && CLANG_VERSION < 140000
+ /*
+ * Clang forced frame pointers via the builtins until Clang-14. Use
+ * this as a fall-back until the minimum Clang version is >= 14.0.
+ */
unsigned long flags;

/*
@@ -33,6 +38,11 @@ extern __always_inline unsigned long native_save_fl(void)
: "memory");

return flags;
+#elif defined(CONFIG_X86_64)
+ return __builtin_ia32_readeflags_u64();
+#else
+ return __builtin_ia32_readeflags_u32();
+#endif
}

static __always_inline void native_irq_disable(void)
--
2.35.1.574.g5d30c73bfb-goog

2022-03-16 12:42:15

by Bill Wendling

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Tue, Mar 1, 2022 at 12:19 PM Bill Wendling <[email protected]> wrote:
>
Bump for review.

-bw

> This issue arose due to Clang's issue with the "=rm" constraint. Clang
> chooses to be conservative in these situations and always uses memory
> instead of registers, resulting in sub-optimal code. (This is a known
> issue, which is currently being addressed.)
>
> This function has gone through numerous changes over the years:
>
> - The original version of this function used the "=g" constraint,
> which has the following description:
>
> Any register, memory or immediate integer operand is allowed,
> except for registers that are not general registers.
>
> - This was changed to "=r" in commit f1f029c7bfbf ("x86: fix assembly
> constraints in native_save_fl()"), because someone noticed that:
>
> the offset of the flags variable from the stack pointer will
> change when the pushf is performed. gcc doesn't attempt to
> understand that fact, and address used for pop will still be the
> same. It will write to somewhere near flags on the stack but not
> actually into it and overwrite some other value.
>
> - However, commit f1f029c7bfbf ("x86: fix assembly constraints in
> native_save_fl()") was partially reverted in commit ab94fcf528d1
> ("x86: allow "=rm" in native_save_fl()"), because the original
> reporter of the issue was using a broken x86 simulator. The
> justification for this change was:
>
> "=rm" is allowed in this context, because "pop" is explicitly
> defined to adjust the stack pointer *before* it evaluates its
> effective address, if it has one. Thus, we do end up writing to
> the correct address even if we use an on-stack memory argument.
>
> Clang generates good code when the builtins are used. On one benchmark,
> a hotspot in kmem_cache_free went from using 5.18% of cycles popping to
> a memory address to 0.13% popping to a register. This benefit is
> magnified given that this code is inlined in numerous places in the
> kernel.
>
> The builtins also help GCC. It allows GCC (and Clang) to reduce register
> pressure and, consequently, register spills by rescheduling
> instructions. It can't happen with instructions in inline assembly,
> because compilers view inline assembly blocks as "black boxes," whose
> instructions can't be rescheduled.
>
> Another benefit of builtins over asm blocks is that compilers are able
> to make more precise inlining decisions, since they no longer need to
> rely on imprecise measures based on newline counts.
>
> A trivial example demonstrates this code motion.
>
> void y(void);
> unsigned long x(void) {
> unsigned long v = __builtin_ia32_readeflags_u64();
> y();
> return v;
> }
>
> GCC at -O1:
> pushq %rbx
> pushfq
> popq %rbx
> movl $0, %eax
> call y
> movq %rbx, %rax
> popq %rbx
> ret
>
> GCC at -O2:
> pushq %r12
> pushfq
> xorl %eax, %eax
> popq %r12
> call y
> movq %r12, %rax
> popq %r12
> ret
>
> Link: https://gist.github.com/nickdesaulniers/b4d0f6e26f8cbef0ae4c5352cfeaca67
> Link: https://github.com/llvm/llvm-project/issues/20571
> Link: https://gcc.gnu.org/onlinedocs/gcc/Simple-Constraints.html#Simple-Constraints
> Link: https://godbolt.org/z/5n3Eov1xT
> Signed-off-by: Bill Wendling <[email protected]>
> Reviewed-by: Nick Desaulniers <[email protected]>
> ---
> v5: - Incorporate Nick's suggestion to limit the change to Clang >= 14.0 and
> GCC.
> v4: - Clang now no longer generates stack frames when using these builtins.
> - Corrected misspellings.
> v3: - Add blurb indicating that GCC's output hasn't changed.
> v2: - Kept the original function to retain the out-of-line symbol.
> - Improved the commit message.
> - Note that I couldn't use Nick's suggestion of
>
> return IS_ENABLED(CONFIG_X86_64) ? ...
>
> because Clang complains about using __builtin_ia32_readeflags_u32 in
> 64-bit mode.
> ---
> arch/x86/include/asm/irqflags.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
> index 87761396e8cc..2eded855f6ab 100644
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -19,6 +19,11 @@
> extern inline unsigned long native_save_fl(void);
> extern __always_inline unsigned long native_save_fl(void)
> {
> +#if defined(CC_IS_CLANG) && defined(UNWINDER_ORC) && CLANG_VERSION < 140000
> + /*
> + * Clang forced frame pointers via the builtins until Clang-14. Use
> + * this as a fall-back until the minimum Clang version is >= 14.0.
> + */
> unsigned long flags;
>
> /*
> @@ -33,6 +38,11 @@ extern __always_inline unsigned long native_save_fl(void)
> : "memory");
>
> return flags;
> +#elif defined(CONFIG_X86_64)
> + return __builtin_ia32_readeflags_u64();
> +#else
> + return __builtin_ia32_readeflags_u32();
> +#endif
> }
>
> static __always_inline void native_irq_disable(void)
> --
> 2.35.1.574.g5d30c73bfb-goog
>

2022-03-17 01:57:31

by Bill Wendling

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Mon, Mar 14, 2022 at 6:08 PM H. Peter Anvin <[email protected]> wrote:
>
> My only concern is this: how does this affect the placement of these sequences in relation to the things they need to protect?
>
With clang (and I assume it's similar with gcc), the
__builtin_ia32_readeflags_u{64|32} builtins specify that the EFLAGS
register is used (and the stack pointer is modified). Instructions
that may change EFLAGS are marked as defining EFLAGS, and clang and
gcc are careful not to move instructions that access EFLAGS past them.

One change you may see due to this patch is the compiler moving the
"pop %..." instruction away from the "pushf" instruction. This could
happen if the compiler determines that it could produce better code by
doing so---e.g. to reduce register pressure. The "gcc -O2" example
below shows this code movement.

-bw

> On March 14, 2022 4:07:23 PM PDT, Bill Wendling <[email protected]> wrote:
>>
>> On Tue, Mar 1, 2022 at 12:19 PM Bill Wendling <[email protected]> wrote:
>>>
>>>
>> Bump for review.
>>
>> -bw
>>
>>> This issue arose due to Clang's issue with the "=rm" constraint. Clang
>>> chooses to be conservative in these situations and always uses memory
>>> instead of registers, resulting in sub-optimal code. (This is a known
>>> issue, which is currently being addressed.)
>>>
>>> This function has gone through numerous changes over the years:
>>>
>>> - The original version of this function used the "=g" constraint,
>>> which has the following description:
>>>
>>> Any register, memory or immediate integer operand is allowed,
>>> except for registers that are not general registers.
>>>
>>> - This was changed to "=r" in commit f1f029c7bfbf ("x86: fix assembly
>>> constraints in native_save_fl()"), because someone noticed that:
>>>
>>> the offset of the flags variable from the stack pointer will
>>> change when the pushf is performed. gcc doesn't attempt to
>>> understand that fact, and address used for pop will still be the
>>> same. It will write to somewhere near flags on the stack but not
>>> actually into it and overwrite some other value.
>>>
>>> - However, commit f1f029c7bfbf ("x86: fix assembly constraints in
>>> native_save_fl()") was partially reverted in commit ab94fcf528d1
>>> ("x86: allow "=rm" in native_save_fl()"), because the original
>>> reporter of the issue was using a broken x86 simulator. The
>>> justification for this change was:
>>>
>>> "=rm" is allowed in this context, because "pop" is explicitly
>>> defined to adjust the stack pointer *before* it evaluates its
>>> effective address, if it has one. Thus, we do end up writing to
>>> the correct address even if we use an on-stack memory argument.
>>>
>>> Clang generates good code when the builtins are used. On one benchmark,
>>> a hotspot in kmem_cache_free went from using 5.18% of cycles popping to
>>> a memory address to 0.13% popping to a register. This benefit is
>>> magnified given that this code is inlined in numerous places in the
>>> kernel.
>>>
>>> The builtins also help GCC. It allows GCC (and Clang) to reduce register
>>> pressure and, consequently, register spills by rescheduling
>>> instructions. It can't happen with instructions in inline assembly,
>>> because compilers view inline assembly blocks as "black boxes," whose
>>> instructions can't be rescheduled.
>>>
>>> Another benefit of builtins over asm blocks is that compilers are able
>>> to make more precise inlining decisions, since they no longer need to
>>> rely on imprecise measures based on newline counts.
>>>
>>> A trivial example demonstrates this code motion.
>>>
>>> void y(void);
>>> unsigned long x(void) {
>>> unsigned long v = __builtin_ia32_readeflags_u64();
>>> y();
>>> return v;
>>> }
>>>
>>> GCC at -O1:
>>> pushq %rbx
>>> pushfq
>>> popq %rbx
>>> movl $0, %eax
>>> call y
>>> movq %rbx, %rax
>>> popq %rbx
>>> ret
>>>
>>> GCC at -O2:
>>> pushq %r12
>>> pushfq
>>> xorl %eax, %eax
>>> popq %r12
>>> call y
>>> movq %r12, %rax
>>> popq %r12
>>> ret
>>>
>>> Link: https://gist.github.com/nickdesaulniers/b4d0f6e26f8cbef0ae4c5352cfeaca67
>>> Link: https://github.com/llvm/llvm-project/issues/20571
>>> Link: https://gcc.gnu.org/onlinedocs/gcc/Simple-Constraints.html#Simple-Constraints
>>> Link: https://godbolt.org/z/5n3Eov1xT
>>> Signed-off-by: Bill Wendling <[email protected]>
>>> Reviewed-by: Nick Desaulniers <[email protected]>
>>> ________________________________
>>> v5: - Incorporate Nick's suggestion to limit the change to Clang >= 14.0 and
>>> GCC.
>>> v4: - Clang now no longer generates stack frames when using these builtins.
>>> - Corrected misspellings.
>>> v3: - Add blurb indicating that GCC's output hasn't changed.
>>> v2: - Kept the original function to retain the out-of-line symbol.
>>> - Improved the commit message.
>>> - Note that I couldn't use Nick's suggestion of
>>>
>>> return IS_ENABLED(CONFIG_X86_64) ? ...
>>>
>>> because Clang complains about using __builtin_ia32_readeflags_u32 in
>>> 64-bit mode.
>>> ________________________________
>>> arch/x86/include/asm/irqflags.h | 10 ++++++++++
>>> 1 file changed, 10 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
>>> index 87761396e8cc..2eded855f6ab 100644
>>> --- a/arch/x86/include/asm/irqflags.h
>>> +++ b/arch/x86/include/asm/irqflags.h
>>> @@ -19,6 +19,11 @@
>>> extern inline unsigned long native_save_fl(void);
>>> extern __always_inline unsigned long native_save_fl(void)
>>> {
>>> +#if defined(CC_IS_CLANG) && defined(UNWINDER_ORC) && CLANG_VERSION < 140000
>>> + /*
>>> + * Clang forced frame pointers via the builtins until Clang-14. Use
>>> + * this as a fall-back until the minimum Clang version is >= 14.0.
>>> + */
>>> unsigned long flags;
>>>
>>> /*
>>> @@ -33,6 +38,11 @@ extern __always_inline unsigned long native_save_fl(void)
>>> : "memory");
>>>
>>> return flags;
>>> +#elif defined(CONFIG_X86_64)
>>> + return __builtin_ia32_readeflags_u64();
>>> +#else
>>> + return __builtin_ia32_readeflags_u32();
>>> +#endif
>>> }
>>>
>>> static __always_inline void native_irq_disable(void)
>>> --
>>> 2.35.1.574.g5d30c73bfb-goog
>>>

2022-03-17 18:55:12

by Nick Desaulniers

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Thu, Mar 17, 2022 at 8:43 AM H. Peter Anvin <[email protected]> wrote:
>
> On March 15, 2022 12:19:40 AM PDT, Bill Wendling <[email protected]> wrote:
> >On Mon, Mar 14, 2022 at 6:08 PM H. Peter Anvin <[email protected]> wrote:
> >>
> >> My only concern is this: how does this affect the placement of these sequences in relation to the things they need to protect?
> >>
> >With clang (and I assume it's similar with gcc), the
> >__builtin_ia32_readeflags_u{64|32} builtins specify that the EFLAGS
> >register is used (and the stack pointer is modified). Instructions
> >that may change EFLAGS are marked as defining EFLAGS, and clang and
> >gcc are careful not to move instructions that access EFLAGS past them.
> >
> >One change you may see due to this patch is the compiler moving the
> >"pop %..." instruction away from the "pushf" instruction. This could
> >happen if the compiler determines that it could produce better code by
> >doing so---e.g. to reduce register pressure. The "gcc -O2" example
> >below shows this code movement.
> >
> >-bw
> >
> >> On March 14, 2022 4:07:23 PM PDT, Bill Wendling <[email protected]> wrote:
> >>>
> >>> On Tue, Mar 1, 2022 at 12:19 PM Bill Wendling <[email protected]> wrote:
> >>>>
> >>>>
> >>>> Clang generates good code when the builtins are used. On one benchmark,
> >>>> a hotspot in kmem_cache_free went from using 5.18% of cycles popping to
> >>>> a memory address to 0.13% popping to a register. This benefit is
> >>>> magnified given that this code is inlined in numerous places in the
> >>>> kernel.
> >>>>
> >>>> The builtins also help GCC. It allows GCC (and Clang) to reduce register
> >>>> pressure and, consequently, register spills by rescheduling
> >>>> instructions. It can't happen with instructions in inline assembly,
> >>>> because compilers view inline assembly blocks as "black boxes," whose
> >>>> instructions can't be rescheduled.
> >>>>
> >>>> Another benefit of builtins over asm blocks is that compilers are able
> >>>> to make more precise inlining decisions, since they no longer need to
> >>>> rely on imprecise measures based on newline counts.
> >>>>
> >>>> A trivial example demonstrates this code motion.
> >>>>
> >>>> void y(void);
> >>>> unsigned long x(void) {
> >>>> unsigned long v = __builtin_ia32_readeflags_u64();
> >>>> y();
> >>>> return v;
> >>>> }
> >>>>
> >>>> GCC at -O1:
> >>>> pushq %rbx
> >>>> pushfq
> >>>> popq %rbx
> >>>> movl $0, %eax
> >>>> call y
> >>>> movq %rbx, %rax
> >>>> popq %rbx
> >>>> ret
> >>>>
> >>>> GCC at -O2:
> >>>> pushq %r12
> >>>> pushfq
> >>>> xorl %eax, %eax
> >>>> popq %r12
> >>>> call y
> >>>> movq %r12, %rax
> >>>> popq %r12
> >>>> ret
> >>>>
> EFLAGS is a mishmash of different things, only some of which are visible to the compiler, and the semantics of which are totally different.
>
> Changing IF, changing DF, changing AC, and changing the arithmetic flags have *enormously* different properties. The compiler can't know how the semantics of a particular instance is, at least without being explicitly told (giving it a some kind of mask of bits that could change.)

EFLAGS is defined as the lower 16 bits of FLAGS register, yeah? Aren't
IF, DF, and AC all within those range of bits? Yes; there's no way to
express finer grain resolution that you only care to read/write an
individual bitfield and not the whole register; EFLAGS is the finest
range. Even if the compiler determined via the use of the results of
reading eflags that only a particular bitfield was needed, I'm pretty
sure there's no instructions in X86 for reading these individual
fields. So I don't understand your point; the finest grain resolution
the compiler has to work with is the whole EFLAGS register, not
individual bits like IF, DF, or AC. (I triple checked this with the
maintainer of LLVM's x86 backend).

> The memory barrier is needed for IF changes, for example.

Shouldn't native_irq_disable/native_irq_enable be declaring that they
clobber "cc" then?
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers-and-Scratch-Registers

> This feels like "let's fix LLVM by hacking the kernel in dangerous ways" once again!

Even if the behavior of llvm was changed, you'd _still_ get better
codegen for _both_ compilers by using the intrinsics.

Another issue is that when multiple constraints are used, the order
they're selected in is unspecified.
https://gcc.gnu.org/onlinedocs/gcc/Constraints.html#Constraints if you
want to look.
I could ask GCC folks if they want to specify that behavior; doing so
does take away some freedom from implementations, so they may choose
not to.

> We rely on "memory" as compiler barriers *all over the place*.

Ah, your concern is about removing the explicit "memory"
clobber/compiler barrier. Now you're earlier question "how does this
affect the placement of these sequences in relation to the things they
need to protect?" makes sense. Is the concern a call to
native_save_fl() being re-ordered with respect to a call to
native_irq_disable()/native_irq_enable()? Or do you have a call site
of native_save_fl() that you have concrete concerns about?

> This is especially so since this appears to be a suboptimal code generation issue and not a correctness issue; your change is likely to promote the former (underoptimizing) to the latter (overoptimizing.)

How so? Are you going to ignore the example posted above (`x()`)? Can
you help me understand via code and disassembly that can demonstrate
your claim?
--
Thanks,
~Nick Desaulniers

2022-03-17 19:49:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

[ I got cc'd in the middle of the discussion, so I might be missing
some context or pointing something out that was already discussed ]

On Thu, Mar 17, 2022 at 11:00 AM Nick Desaulniers
<[email protected]> wrote:
>
> > >One change you may see due to this patch is the compiler moving the
> > >"pop %..." instruction away from the "pushf" instruction. This could
> > >happen if the compiler determines that it could produce better code by
> > >doing so---e.g. to reduce register pressure. The "gcc -O2" example
> > >below shows this code movement.

Honestly, that part worries me a _lot_.

Why?

Because clang in particular has already messed up eflags handling
once, by spilling condition codes (in the form of eflags) onto the
stack, and then loading them later with a "popf".

And it did so across a function call THAT MODIFIED 'IF'. This was a
major bug in clang back in 2015 or so, and made it unusable for the
kernel.

See for example

https://lore.kernel.org/all/CA+icZUU7y5ATSLV_0TGzi5m5deWADLmAMBkAT32FKGyUWNSJSA@mail.gmail.com/

for some kernel discussion, and separately

https://lists.llvm.org/pipermail/llvm-dev/2015-July/088774.html

for just llvm discussions.

It is perhaps telling that the LLVM discussion I found seems to talk
more about the performance impact, not about the fact that THE
GENERATED CODE WAS WRONG.

That compiler bug would basically enable or disable interrupts in
random places - because clang developers thought that 'eflags' is only
about the condition codes.

> EFLAGS is defined as the lower 16 bits of FLAGS register, yeah?

No. EFLAGS is a 32-bit register.

A lot of the high bits end up being reserved, but not all of them. AC,
VIF/VIP are all in the upper 16 bits.

Also:

> So I don't understand your point; the finest grain resolution
> the compiler has to work with is the whole EFLAGS register, not
> individual bits like IF, DF, or AC. (I triple checked this with the
> maintainer of LLVM's x86 backend)

You can actually operate on EFLAGS at multiple granularities.

- normal pushf/popf. Don't do it unless you are doing system software.

- You *can* (but you really *really* should not) operate on only the
lower 16 bits, with "popfw". Don't do it - it doesn't help, requires
an operand size override, and is just bad.

- you can use lahf/sahc to load/store only the arithmetic flags
into/from AH. Deprecated, and going away, but historically supported.

- you can obviously use arithmetic and setcc to modify/read the
arithmetic flags properly.

A compiler should NEVER think that it can access eflags natively with
pushf/popf. Clang thought it could do so, and clang was WRONG.

The absolutely only case that a compiler should ever worry about and
do is that last case above: arithmetic instructions that set the
flags, and 'jcc', 'setcc', 'cmov', 'adc' etc that test it./

Yes, yes, that complete mental breakdown with pushf/popf did get
fixed, but it really makes me very wary of thinking that we should
ever use a built-in that compiler writers really fundamentally got so
wrong before.

What would make me think that you'd get it right now? In user space,
you'll basically never actually see the whole system flags issues, so
your test-cases would never work or be very contrieved. You'd have to
really work at it to see the problems.

> > The memory barrier is needed for IF changes, for example.
>
> Shouldn't native_irq_disable/native_irq_enable be declaring that they
> clobber "cc" then?
> https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers-and-Scratch-Registers

No. Once again: compilers don't understand system flags in eflags.

In fact, gcc pretty much documents that "cc" clobbers doesn't do
anything on x86, and isn't needed. It just assumes that the arithmetic
flags always get clobbered, because on x86 that's pretty much the
case. They have sometimes been added for documentation purposes in the
kernel, but that's all they are.

So you can find that "cc" clobber in a few places in the kernel, but
it's pointless.

The reason pushf/popf need to have a memory clobber is so that the
compiler will not move it around other things that have memory
clobbers.

Because memory clobbers are things that compilers *UNDERSTAND*.

Put another way: when you see that "memory" clobber in an inline asm,
don't think that it means "this modifies or uses memory".

That may be the documentation, and that may be how compiler people
think about them, but that's just wrong.

What *kernel* people think is "this makes the compiler not do stupid things".

Because compilers really don't understand things like system registers
and system flags, and the subtle issues they have, and the ordering
requirements they have. Never have, and never will.

And compilers really don't understand - or care about - things like
"cc", because compiler people think it's about arithmetic flags, and
would do things like "oh, I'll just save and restore the flags over
this since the asm modifies it".

Note that "eflags" is just the tip of the iceberg here. This is true
for a lot of other cases.

Yes, when you have something like a "disable interrupts", the compiler
really must not move memory accesses around it, because the interrupt
flag may in fact be protecting that memory access from becoming a
deadlock.

But the whole "you can't move _other_ things that you don't even
understand around this either" is equally important. A "disable
interrupts" could easily be protecting a "read and modify a CPU MSR
value" too - no real "memory" access necessarily involved, but
"memory" is the only way we can tell you "don't move this".

Particularly since both the clang people and gcc people have actively
been trying to weaken the "volatile" semantics (which would have
generally been the logically consistent model - thinking if inline asm
to be visible in the machine model, and not being able to move them
around for those reasons).

> > This feels like "let's fix LLVM by hacking the kernel in dangerous ways" once again!
>
> Even if the behavior of llvm was changed, you'd _still_ get better
> codegen for _both_ compilers by using the intrinsics.

No.

"moving instructions around a bit" is not better code generation when
the end result DOES NOT WORK.

And honestly, we have absolutely zero reason to believe that it would work.

If it turns out that "pop to memory" is a big deal for performance,
let's just change the "=rm" to "=r" and be done with it. Problem
solved. We already used to do that on the "pushf" side, but these days
we avoid "popfl" entirely and do a conditional jump over a "sti"
instead.

Because maybe moving that "pop" by a few inbstructions helps a bit,
but if you move the "pushf" noticeably, you've just created a big big
problem.

A bug you've had once already.

Linus

2022-03-17 20:40:56

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On March 15, 2022 12:19:40 AM PDT, Bill Wendling <[email protected]> wrote:
>On Mon, Mar 14, 2022 at 6:08 PM H. Peter Anvin <[email protected]> wrote:
>>
>> My only concern is this: how does this affect the placement of these sequences in relation to the things they need to protect?
>>
>With clang (and I assume it's similar with gcc), the
>__builtin_ia32_readeflags_u{64|32} builtins specify that the EFLAGS
>register is used (and the stack pointer is modified). Instructions
>that may change EFLAGS are marked as defining EFLAGS, and clang and
>gcc are careful not to move instructions that access EFLAGS past them.
>
>One change you may see due to this patch is the compiler moving the
>"pop %..." instruction away from the "pushf" instruction. This could
>happen if the compiler determines that it could produce better code by
>doing so---e.g. to reduce register pressure. The "gcc -O2" example
>below shows this code movement.
>
>-bw
>
>> On March 14, 2022 4:07:23 PM PDT, Bill Wendling <[email protected]> wrote:
>>>
>>> On Tue, Mar 1, 2022 at 12:19 PM Bill Wendling <[email protected]> wrote:
>>>>
>>>>
>>> Bump for review.
>>>
>>> -bw
>>>
>>>> This issue arose due to Clang's issue with the "=rm" constraint. Clang
>>>> chooses to be conservative in these situations and always uses memory
>>>> instead of registers, resulting in sub-optimal code. (This is a known
>>>> issue, which is currently being addressed.)
>>>>
>>>> This function has gone through numerous changes over the years:
>>>>
>>>> - The original version of this function used the "=g" constraint,
>>>> which has the following description:
>>>>
>>>> Any register, memory or immediate integer operand is allowed,
>>>> except for registers that are not general registers.
>>>>
>>>> - This was changed to "=r" in commit f1f029c7bfbf ("x86: fix assembly
>>>> constraints in native_save_fl()"), because someone noticed that:
>>>>
>>>> the offset of the flags variable from the stack pointer will
>>>> change when the pushf is performed. gcc doesn't attempt to
>>>> understand that fact, and address used for pop will still be the
>>>> same. It will write to somewhere near flags on the stack but not
>>>> actually into it and overwrite some other value.
>>>>
>>>> - However, commit f1f029c7bfbf ("x86: fix assembly constraints in
>>>> native_save_fl()") was partially reverted in commit ab94fcf528d1
>>>> ("x86: allow "=rm" in native_save_fl()"), because the original
>>>> reporter of the issue was using a broken x86 simulator. The
>>>> justification for this change was:
>>>>
>>>> "=rm" is allowed in this context, because "pop" is explicitly
>>>> defined to adjust the stack pointer *before* it evaluates its
>>>> effective address, if it has one. Thus, we do end up writing to
>>>> the correct address even if we use an on-stack memory argument.
>>>>
>>>> Clang generates good code when the builtins are used. On one benchmark,
>>>> a hotspot in kmem_cache_free went from using 5.18% of cycles popping to
>>>> a memory address to 0.13% popping to a register. This benefit is
>>>> magnified given that this code is inlined in numerous places in the
>>>> kernel.
>>>>
>>>> The builtins also help GCC. It allows GCC (and Clang) to reduce register
>>>> pressure and, consequently, register spills by rescheduling
>>>> instructions. It can't happen with instructions in inline assembly,
>>>> because compilers view inline assembly blocks as "black boxes," whose
>>>> instructions can't be rescheduled.
>>>>
>>>> Another benefit of builtins over asm blocks is that compilers are able
>>>> to make more precise inlining decisions, since they no longer need to
>>>> rely on imprecise measures based on newline counts.
>>>>
>>>> A trivial example demonstrates this code motion.
>>>>
>>>> void y(void);
>>>> unsigned long x(void) {
>>>> unsigned long v = __builtin_ia32_readeflags_u64();
>>>> y();
>>>> return v;
>>>> }
>>>>
>>>> GCC at -O1:
>>>> pushq %rbx
>>>> pushfq
>>>> popq %rbx
>>>> movl $0, %eax
>>>> call y
>>>> movq %rbx, %rax
>>>> popq %rbx
>>>> ret
>>>>
>>>> GCC at -O2:
>>>> pushq %r12
>>>> pushfq
>>>> xorl %eax, %eax
>>>> popq %r12
>>>> call y
>>>> movq %r12, %rax
>>>> popq %r12
>>>> ret
>>>>
>>>> Link: https://gist.github.com/nickdesaulniers/b4d0f6e26f8cbef0ae4c5352cfeaca67
>>>> Link: https://github.com/llvm/llvm-project/issues/20571
>>>> Link: https://gcc.gnu.org/onlinedocs/gcc/Simple-Constraints.html#Simple-Constraints
>>>> Link: https://godbolt.org/z/5n3Eov1xT
>>>> Signed-off-by: Bill Wendling <[email protected]>
>>>> Reviewed-by: Nick Desaulniers <[email protected]>
>>>> ________________________________
>>>> v5: - Incorporate Nick's suggestion to limit the change to Clang >= 14.0 and
>>>> GCC.
>>>> v4: - Clang now no longer generates stack frames when using these builtins.
>>>> - Corrected misspellings.
>>>> v3: - Add blurb indicating that GCC's output hasn't changed.
>>>> v2: - Kept the original function to retain the out-of-line symbol.
>>>> - Improved the commit message.
>>>> - Note that I couldn't use Nick's suggestion of
>>>>
>>>> return IS_ENABLED(CONFIG_X86_64) ? ...
>>>>
>>>> because Clang complains about using __builtin_ia32_readeflags_u32 in
>>>> 64-bit mode.
>>>> ________________________________
>>>> arch/x86/include/asm/irqflags.h | 10 ++++++++++
>>>> 1 file changed, 10 insertions(+)
>>>>
>>>> diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
>>>> index 87761396e8cc..2eded855f6ab 100644
>>>> --- a/arch/x86/include/asm/irqflags.h
>>>> +++ b/arch/x86/include/asm/irqflags.h
>>>> @@ -19,6 +19,11 @@
>>>> extern inline unsigned long native_save_fl(void);
>>>> extern __always_inline unsigned long native_save_fl(void)
>>>> {
>>>> +#if defined(CC_IS_CLANG) && defined(UNWINDER_ORC) && CLANG_VERSION < 140000
>>>> + /*
>>>> + * Clang forced frame pointers via the builtins until Clang-14. Use
>>>> + * this as a fall-back until the minimum Clang version is >= 14.0.
>>>> + */
>>>> unsigned long flags;
>>>>
>>>> /*
>>>> @@ -33,6 +38,11 @@ extern __always_inline unsigned long native_save_fl(void)
>>>> : "memory");
>>>>
>>>> return flags;
>>>> +#elif defined(CONFIG_X86_64)
>>>> + return __builtin_ia32_readeflags_u64();
>>>> +#else
>>>> + return __builtin_ia32_readeflags_u32();
>>>> +#endif
>>>> }
>>>>
>>>> static __always_inline void native_irq_disable(void)
>>>> --
>>>> 2.35.1.574.g5d30c73bfb-goog
>>>>

EFLAGS is a mishmash of different things, only some of which are visible to the compiler, and the semantics of which are totally different.

Changing IF, changing DF, changing AC, and changing the arithmetic flags have *enormously* different properties. The compiler can't know how the semantics of a particular instance is, at least without being explicitly told (giving it a some kind of mask of bits that could change.) The memory barrier is needed for IF changes, for example.

This feels like "let's fix LLVM by hacking the kernel in dangerous ways" once again! We rely on "memory" as compiler barriers *all over the place*. This is especially so since this appears to be a suboptimal code generation issue and not a correctness issue; your change is likely to promote the former (underoptimizing) to the latter (overoptimizing.)

To whatever extent my vote matters these days:

Nacked-by: H. Peter Anvin (Intel) <[email protected]>


2022-03-17 20:44:59

by Bill Wendling

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Thu, Mar 17, 2022 at 11:52 AM Linus Torvalds
<[email protected]> wrote:
>
> [ I got cc'd in the middle of the discussion, so I might be missing
> some context or pointing something out that was already discussed ]
>
> On Thu, Mar 17, 2022 at 11:00 AM Nick Desaulniers
> <[email protected]> wrote:
> >
> > > >One change you may see due to this patch is the compiler moving the
> > > >"pop %..." instruction away from the "pushf" instruction. This could
> > > >happen if the compiler determines that it could produce better code by
> > > >doing so---e.g. to reduce register pressure. The "gcc -O2" example
> > > >below shows this code movement.
>
> Honestly, that part worries me a _lot_.
>
> Why?
>
> Because clang in particular has already messed up eflags handling
> once, by spilling condition codes (in the form of eflags) onto the
> stack, and then loading them later with a "popf".
>
> And it did so across a function call THAT MODIFIED 'IF'. This was a
> major bug in clang back in 2015 or so, and made it unusable for the
> kernel.
>
> See for example
>
> https://lore.kernel.org/all/CA+icZUU7y5ATSLV_0TGzi5m5deWADLmAMBkAT32FKGyUWNSJSA@mail.gmail.com/
>
> for some kernel discussion, and separately
>
> https://lists.llvm.org/pipermail/llvm-dev/2015-July/088774.html
>
> for just llvm discussions.
>
I can't account for LLVM's past sins. That particular issue was fixed
of course. This change will only generate "pushf/popf" (actually just
"pushf" here) only when explicitly used.

> It is perhaps telling that the LLVM discussion I found seems to talk
> more about the performance impact, not about the fact that THE
> GENERATED CODE WAS WRONG.
>
I think that's a bit unfair. There seemed to be a general consensus
that the code was wrong and needed to be fixed. However, the compiler
is also expected to generate the most performant code that it can.

> That compiler bug would basically enable or disable interrupts in
> random places - because clang developers thought that 'eflags' is only
> about the condition codes.
>
> > EFLAGS is defined as the lower 16 bits of FLAGS register, yeah?
>
> No. EFLAGS is a 32-bit register.
>
> A lot of the high bits end up being reserved, but not all of them. AC,
> VIF/VIP are all in the upper 16 bits.
>
> Also:
>
> > So I don't understand your point; the finest grain resolution
> > the compiler has to work with is the whole EFLAGS register, not
> > individual bits like IF, DF, or AC. (I triple checked this with the
> > maintainer of LLVM's x86 backend)
>
> You can actually operate on EFLAGS at multiple granularities.
>
> - normal pushf/popf. Don't do it unless you are doing system software.
>
> - You *can* (but you really *really* should not) operate on only the
> lower 16 bits, with "popfw". Don't do it - it doesn't help, requires
> an operand size override, and is just bad.
>
> - you can use lahf/sahc to load/store only the arithmetic flags
> into/from AH. Deprecated, and going away, but historically supported.
>
> - you can obviously use arithmetic and setcc to modify/read the
> arithmetic flags properly.
>
> A compiler should NEVER think that it can access eflags natively with
> pushf/popf. Clang thought it could do so, and clang was WRONG.
>
> The absolutely only case that a compiler should ever worry about and
> do is that last case above: arithmetic instructions that set the
> flags, and 'jcc', 'setcc', 'cmov', 'adc' etc that test it./
>
> Yes, yes, that complete mental breakdown with pushf/popf did get
> fixed, but it really makes me very wary of thinking that we should
> ever use a built-in that compiler writers really fundamentally got so
> wrong before.
>
> What would make me think that you'd get it right now? In user space,
> you'll basically never actually see the whole system flags issues, so
> your test-cases would never work or be very contrieved. You'd have to
> really work at it to see the problems.
>
As pointed out above, the LLVM people agreed with your sentiment here:
never generate "pushf/popf" unless explicitly told to do so. This
patch isn't trying to walk that back, just that when "pushf/popf" is
explicitly asked for it allows LLVM and GCC to generate the best
instruction sequence.

> > > The memory barrier is needed for IF changes, for example.
> >
> > Shouldn't native_irq_disable/native_irq_enable be declaring that they
> > clobber "cc" then?
> > https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers-and-Scratch-Registers
>
> No. Once again: compilers don't understand system flags in eflags.
>
> In fact, gcc pretty much documents that "cc" clobbers doesn't do
> anything on x86, and isn't needed. It just assumes that the arithmetic
> flags always get clobbered, because on x86 that's pretty much the
> case. They have sometimes been added for documentation purposes in the
> kernel, but that's all they are.
>
> So you can find that "cc" clobber in a few places in the kernel, but
> it's pointless.
>
> The reason pushf/popf need to have a memory clobber is so that the
> compiler will not move it around other things that have memory
> clobbers.
>
> Because memory clobbers are things that compilers *UNDERSTAND*.
>
> Put another way: when you see that "memory" clobber in an inline asm,
> don't think that it means "this modifies or uses memory".
>
> That may be the documentation, and that may be how compiler people
> think about them, but that's just wrong.
>
> What *kernel* people think is "this makes the compiler not do stupid things".
>
> Because compilers really don't understand things like system registers
> and system flags, and the subtle issues they have, and the ordering
> requirements they have. Never have, and never will.
>
> And compilers really don't understand - or care about - things like
> "cc", because compiler people think it's about arithmetic flags, and
> would do things like "oh, I'll just save and restore the flags over
> this since the asm modifies it".
>
> Note that "eflags" is just the tip of the iceberg here. This is true
> for a lot of other cases.
>
> Yes, when you have something like a "disable interrupts", the compiler
> really must not move memory accesses around it, because the interrupt
> flag may in fact be protecting that memory access from becoming a
> deadlock.
>
> But the whole "you can't move _other_ things that you don't even
> understand around this either" is equally important. A "disable
> interrupts" could easily be protecting a "read and modify a CPU MSR
> value" too - no real "memory" access necessarily involved, but
> "memory" is the only way we can tell you "don't move this".
>
And yet that's not guaranteed. From
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html:

```
Note that the compiler can move even volatile asm instructions
relative to other code, including across jump instructions. For
example, on many targets there is a system register that controls the
rounding mode of floating-point operations. Setting it with a volatile
asm statement, as in the following PowerPC example, does not work
reliably.

asm volatile("mtfsf 255, %0" : : "f" (fpenv));
sum = x + y;

The solution is to reference the "sum" variable in the asm:

asm volatile ("mtfsf 255,%1" : "=X" (sum) : "f" (fpenv));
```

Note that the solution _isn't_ to add a "memory" clobber, because it's
not guaranteed to work, as it's explicitly defined to be a read/write
_memory_ barrier, despite what kernel writers wish it would do.

Your assertion that compilers don't know about control registers isn't
exactly true. In the case of "pushf/popf", those instructions know
about the eflags registers. All subsequent instructions that read or
modify eflags also know about it. In essence, the compiler can
determine its own clobber list, which include MSRs.

-bw

> Particularly since both the clang people and gcc people have actively
> been trying to weaken the "volatile" semantics (which would have
> generally been the logically consistent model - thinking if inline asm
> to be visible in the machine model, and not being able to move them
> around for those reasons).
>
> > > This feels like "let's fix LLVM by hacking the kernel in dangerous ways" once again!
> >
> > Even if the behavior of llvm was changed, you'd _still_ get better
> > codegen for _both_ compilers by using the intrinsics.
>
> No.
>
> "moving instructions around a bit" is not better code generation when
> the end result DOES NOT WORK.
>
> And honestly, we have absolutely zero reason to believe that it would work.
>
> If it turns out that "pop to memory" is a big deal for performance,
> let's just change the "=rm" to "=r" and be done with it. Problem
> solved. We already used to do that on the "pushf" side, but these days
> we avoid "popfl" entirely and do a conditional jump over a "sti"
> instead.
>
> Because maybe moving that "pop" by a few inbstructions helps a bit,
> but if you move the "pushf" noticeably, you've just created a big big
> problem.
>
> A bug you've had once already.
>
> Linus

2022-03-17 20:51:11

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

* Linus Torvalds:

> You can actually operate on EFLAGS at multiple granularities.
>
> - normal pushf/popf. Don't do it unless you are doing system software.

There's one exception: PUSHF/twiddle/POPF/PUSHF/compare is the
recommended sequence to detect CPUID support on i386 (in userspace and
elsewhere).

> - you can use lahf/sahc to load/store only the arithmetic flags
> into/from AH. Deprecated, and going away, but historically supported.

And these instructions were missing from the original long mode, but
they were added back.

> Yes, yes, that complete mental breakdown with pushf/popf did get
> fixed, but it really makes me very wary of thinking that we should
> ever use a built-in that compiler writers really fundamentally got so
> wrong before.
>
> What would make me think that you'd get it right now? In user space,
> you'll basically never actually see the whole system flags issues, so
> your test-cases would never work or be very contrieved. You'd have to
> really work at it to see the problems.

I think as the result of the nature of that kind of bug it does not
matter whether you use a compiler builtin to access the flags (to put
their combined value into a general-purpose register).

GCC doesn't have barriers in the built-ins (if we are talking about
__builtin_ia32_readeflags_u64 and __builtin_ia32_writeeflags_u64). I
expect they are actually pretty useless, and were merely added for
completeness of the intrinsics headers.

It's not that you can write

unsigned a, b, c;
// …
c = a + b;

and examine __builtin_ia32_readeflags_u64() to see if there was an
overflow. Neither GCC nor Clang model the EFLAGS register and
arithmetic expression side effects to make this possible.

Thanks,
Florian

2022-03-17 20:51:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Thu, Mar 17, 2022 at 12:45 PM Bill Wendling <[email protected]> wrote:
>
> On Thu, Mar 17, 2022 at 11:52 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > But the whole "you can't move _other_ things that you don't even
> > understand around this either" is equally important. A "disable
> > interrupts" could easily be protecting a "read and modify a CPU MSR
> > value" too - no real "memory" access necessarily involved, but
> > "memory" is the only way we can tell you "don't move this".
> >
> And yet that's not guaranteed. From
> https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html:

That's my point exactly.

The 'volatile' part of 'asm volatile' is almost meaningless.

As a result, we mark pretty much all system instructions as being
memory clobbers, because that actually works.

Whether they actually clobber memory or not is immaterial, and is not
why we do it.

> Note that the solution _isn't_ to add a "memory" clobber, because it's
> not guaranteed to work, as it's explicitly defined to be a read/write
> _memory_ barrier, despite what kernel writers wish it would do.

The solution you quote *ALSO* doesn't work, because they used a
pointless example that was made-up in order to get to that solution.

Nobody cares about an operation being ordered wrt an addition.

Mostly kernel people care about an operation being ordered wrt
something that the compiler DOES NOT KNOW ABOUT, and there is no way
to actually tell the compiler, exactly because the compiler has no
effin idea about it.

But the other thing kernel people care about is ordering those
operations wrt externally visible things - like modifying memory. So
an MSR write (or a write to a register like %CR0) may not itself
directly read or modify memory at all, but there are other reasons why
it might need to be ordered with any memory operations around it
anyway, because some of those memory operations may be indirectly
relevant (ie maybe they are page table writes and you just changed the
page table pointer in %CR0, and now - even if you don't access the
particular memory location, speculation may cause TLB fills to happen
at any time).

You can't tell the compiler "this eflags operation must be ordered wrt
this MSR write" - because even if the compiler knows about eflags, it
doesn't know about things like page table contents or specific MSR
bits, or whatever.

In a perfect world, we could actually enumerate resources we cared
about somehow. But that world is not the world we live in.

So we end up basically having exactly *ONE* resource we can use as a
"the compiler knows about this, and lets us use it as a
synchronization point".

That one resource is "memory". You may not like it, but you have
absolutely zero realistic alternatives, do you?

> Your assertion that compilers don't know about control registers isn't
> exactly true. In the case of "pushf/popf", those instructions know
> about the eflags registers. All subsequent instructions that read or
> modify eflags also know about it. In essence, the compiler can
> determine its own clobber list, which include MSRs.

Nope.

I think you are thinking of the arm64 MSR's. As far as I know, no
compiler out there - and certainly not the complete set of compilers
we support - know a thing about x86 msr registers. It's all inline
asm.

And honestly, no sane person would _want_ a compiler worrying about x86 MSR's.

A compiler should do a good job at the basic instruction set, adn then
give us the escapes for the unusual cases.

Stop believing that a compiler can deal with every part of system
programming and that everything should be intrinsics. Because you
don't have all the intrinsics we need anyway.

Linus

2022-03-17 21:07:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Thu, Mar 17, 2022 at 1:13 PM Florian Weimer <[email protected]> wrote:
>
> * Linus Torvalds:
>
> > You can actually operate on EFLAGS at multiple granularities.
> >
> > - normal pushf/popf. Don't do it unless you are doing system software.
>
> There's one exception: PUSHF/twiddle/POPF/PUSHF/compare is the
> recommended sequence to detect CPUID support on i386 (in userspace and
> elsewhere).

Yeah.

I do think that hand-crafted sequences using pushf/popf work. But I
think they should be in one single inline asm statement.

Obviously the kernel use of

asm volatile("# __raw_save_flags\n\t"
"pushf ; pop %0"
: "=rm" (flags)
: /* no input */
: "memory");

is exactly that (and yes, I can well believe that we should make "=rm"
be "=r"), or at least show that the "m" case is much more expensive
some way).

Is it optimal that we put the push/pop right next to each other? No.
But it avoids a *lot* of problems.

And is that "memory" clobber because it modifies the memory location
just below the current stack pointer?

No, not really - outside the kernel that might be an issue, but we
already have to build the kernel with -mno-red-zone, so if the
compiler uses that memory location, that would be a *HUGE* compiler
bug already.

So the "memory" clobber has absolutely nothing to do with the fact
that 'pushf' updates the stack pointer, writes to that location, and
the popf then undoes it.

It's literally because we don't want the compiler to move non-spill
memory accesses around it (or other asm statements wiht memory
clobbers), regardless of the fact that the sequence doesn't really
read or write memory in any way that is relevant to the compiler.

> > - you can use lahf/sahc to load/store only the arithmetic flags
> > into/from AH. Deprecated, and going away, but historically supported.
>
> And these instructions were missing from the original long mode, but
> they were added back.

They're also actively being deprecated, so the "adding back" ends up
being (maybe) temporary.

> GCC doesn't have barriers in the built-ins (if we are talking about
> __builtin_ia32_readeflags_u64 and __builtin_ia32_writeeflags_u64). I
> expect they are actually pretty useless, and were merely added for
> completeness of the intrinsics headers.

Yeah, without any kinds of ordering guarantees, I think those builtins
are basically only so in name. They might as well return a random
value - they're not *useful*, because they don't have any defined
behavior.

I mean, we *could* certainly use "read eflags" in the kernel, and yes,
in theory it would be lovely if we didn't have to encode it as a
"pushf/pop" sequence, and the compiler tracked the stack pointer for
us, and perhaps combined it with other stack pointer changes to the
point where the "popf" would never happen, it would just undo the %rsp
change at function exit time.

So yes, a builtin can improve code generation.

But if there is no documented barrier guarantees, how do we know that
the read of the eflags the compiler does doesn't migrate around the
code that sets IF or whatever?

So then we'd need compiler intrinsics for 'sti' and 'cli' too. And
then we'd need to have a way to describe the ordering requirements for
*those* wrt all the other things we do (ie locks or whatever other
code that need to be protected from interrupts).

None of which exists.

And even if it ends up existing in newer compiler versions, we'd have
to wait for the better part of a decade (or more) for that to have
percolated everywhere.

And even _then_, the builtin syntax is often easily clumsier and less
flexible than what we get with inline asm (I'm thinking of the
fundamentally mis-designed memory ordering intrinsics).

Linus

2022-03-17 21:44:16

by Bill Wendling

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Thu, Mar 17, 2022 at 1:14 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, Mar 17, 2022 at 12:45 PM Bill Wendling <[email protected]> wrote:
> >
> > On Thu, Mar 17, 2022 at 11:52 AM Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > But the whole "you can't move _other_ things that you don't even
> > > understand around this either" is equally important. A "disable
> > > interrupts" could easily be protecting a "read and modify a CPU MSR
> > > value" too - no real "memory" access necessarily involved, but
> > > "memory" is the only way we can tell you "don't move this".
> > >
> > And yet that's not guaranteed. From
> > https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html:
>
> That's my point exactly.
>
> The 'volatile' part of 'asm volatile' is almost meaningless.
>
> As a result, we mark pretty much all system instructions as being
> memory clobbers, because that actually works.
>
For now.

> Whether they actually clobber memory or not is immaterial, and is not
> why we do it.
>
I understand that. My point is that it's not a guarantee that the
compiler won't change in the future.

> > Note that the solution _isn't_ to add a "memory" clobber, because it's
> > not guaranteed to work, as it's explicitly defined to be a read/write
> > _memory_ barrier, despite what kernel writers wish it would do.
>
> The solution you quote *ALSO* doesn't work, because they used a
> pointless example that was made-up in order to get to that solution.
>
> Nobody cares about an operation being ordered wrt an addition.
>
It's a short example to demonstrate what they meant. I doubt it was
meant to be definitive. And I'm not suggesting that it's a general
solution to the "we want the ordering to be this and don't change it!"
issue. I'm just pointing out that even the current code isn't
guaranteed to remain ordered.

> Mostly kernel people care about an operation being ordered wrt
> something that the compiler DOES NOT KNOW ABOUT, and there is no way
> to actually tell the compiler, exactly because the compiler has no
> effin idea about it.
>
> But the other thing kernel people care about is ordering those
> operations wrt externally visible things - like modifying memory. So
> an MSR write (or a write to a register like %CR0) may not itself
> directly read or modify memory at all, but there are other reasons why
> it might need to be ordered with any memory operations around it
> anyway, because some of those memory operations may be indirectly
> relevant (ie maybe they are page table writes and you just changed the
> page table pointer in %CR0, and now - even if you don't access the
> particular memory location, speculation may cause TLB fills to happen
> at any time).
>
> You can't tell the compiler "this eflags operation must be ordered wrt
> this MSR write" - because even if the compiler knows about eflags, it
> doesn't know about things like page table contents or specific MSR
> bits, or whatever.
>
> In a perfect world, we could actually enumerate resources we cared
> about somehow. But that world is not the world we live in.
>
> So we end up basically having exactly *ONE* resource we can use as a
> "the compiler knows about this, and lets us use it as a
> synchronization point".
>
> That one resource is "memory". You may not like it, but you have
> absolutely zero realistic alternatives, do you?
>
> > Your assertion that compilers don't know about control registers isn't
> > exactly true. In the case of "pushf/popf", those instructions know
> > about the eflags registers. All subsequent instructions that read or
> > modify eflags also know about it. In essence, the compiler can
> > determine its own clobber list, which include MSRs.
>
> Nope.
>
> I think you are thinking of the arm64 MSR's. As far as I know, no
> compiler out there - and certainly not the complete set of compilers
> we support - know a thing about x86 msr registers. It's all inline
> asm.
>
> And honestly, no sane person would _want_ a compiler worrying about x86 MSR's.
>
It's true that a compiler won't take a seemingly unrelated resource
(MSR) into account ("unrelated" only in that the ordering is important
to the programmer, but the compiler doesn't see an interaction and
thus may move it). There was a discussion on that topic here:

https://lore.kernel.org/lkml/CAMj1kXFnUuWLyy5q-fAV1jwZobTCNHqhKSN3mF98frsJ4ai4Ow@mail.gmail.com/T/#md85de238571dfe2bb4e4a822e6c983fb7a5ecf60

I don't think a conclusion was reached about preventing a reordering.

-bw

2022-03-17 21:53:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Thu, Mar 17, 2022 at 2:10 PM Bill Wendling <[email protected]> wrote:
> >
> > As a result, we mark pretty much all system instructions as being
> > memory clobbers, because that actually works.
>
> For now.

No. Forever.

If you change the compiler to ignore memory clobbers in inline asms,
we'll stop using clang again.

This is not some kind of "threat". This is literally just a plain fact.

If you turn your compiler into garbage, we can't use it.

End of discussion.

> > Whether they actually clobber memory or not is immaterial, and is not
> > why we do it.
>
> I understand that. My point is that it's not a guarantee that the
> compiler won't change in the future.

YES IT DAMN WELL IS.

If I have an inline asm thing, and I tell the compiler "this inline
asm reads or writes memory in ways you don't understand", and you then
move externally visible memory operations around it - or move other
inline asms that do the same thing around it - then your compiler is
simply not a compiler any more.

IT IS BROKEN SHIT.

See?

That memory clobber is not a "please mister compiler, can you pretty
please take this into account".

That memory clobber is a "if you don't take this into account, you are
now no longer a working compiler, and thank Gods we have
alternatives".

This is not a "in ten years things can change" kind of issue. This is
a very fundamental and simple thing.

I don't understand why you can't just admit that.

This is as simple as 2+2 being 4. That's black and white.

There is no "the compiler might optimize it to be 3 at some future date".

Linus

2022-03-17 22:00:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Thu, Mar 17, 2022 at 2:06 PM Andrew Cooper <[email protected]> wrote:
>
> I was idly playing with this builtin in response to the thread, and
> found that GCC has a far more serious bug than Clang had.

Heh. Yeah, that's bad.

But it kind of fits my argument: compiler intrinsics aren't
necessarily such a wonderful thing. They hardly get any testing at all
unless they are some very very core thing.

I will personally *always* prefer some generic "escape clause" over a
compiler that thinks it knows better.

I think compiler people should see inline asm as a really *good*
thing, because it allows people to have that "I'm going to do
something that nobody normal does, look away now" kind of escape
clause, letting the compiler concentrate on the things it does best.

Yes, I realize inline asm isn't something compiler people love. And
yes, I realize that it might not give optimal code.

But think of it a bit like casts - it adds complexity to the language,
and it might make some optimizations harder or even impossible - but
it makes the language much more powerful and allows you to do things
that you simply couldn't do without it.

There's a reason people use C instead of Pascal or FORTRAN. Those
casts, and that pointer arithmetic - the bane of many optimizations -
are really really good things.

Intrinsics should generally be shunned, not seen as a "this is an
alternative to inline asm". They should only exist for situations
where you can show "look, I can do so much better", and have a
test-suite for it and a real-life case where it gives you a clear and
undeniable uplift on a meaningful load.

(And here I separate 'intrinsic' and generic compiler builtins. I love
the builtins that allow me to tell the compiler something, or the
compiler to tell me something about the code: __builtin_unreachable()
and __builtin_constant_p() are both wonderful examples of those kinds
of things)

But an intrinsic for some odd target-specific thing that can't even be
portably generalized? Give me inline asm any day.

We can do things with inline asms that the compiler can't even _dream_
of, like the whole instruction rewriting thing we use.

I'd much rather have a powerful generic concept (eg "asm goto" is
lovely) than a specialized intrinsic that does only one thing, and
doesn't even do it well because it's untested and badly defined.

Linus

2022-03-17 22:03:26

by Bill Wendling

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Thu, Mar 17, 2022 at 2:21 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, Mar 17, 2022 at 2:10 PM Bill Wendling <[email protected]> wrote:
> > >
> > > As a result, we mark pretty much all system instructions as being
> > > memory clobbers, because that actually works.
> >
> > For now.
>
> No. Forever.
>
> If you change the compiler to ignore memory clobbers in inline asms,
> we'll stop using clang again.
>
> This is not some kind of "threat". This is literally just a plain fact.
>
> If you turn your compiler into garbage, we can't use it.
>
> End of discussion.
>
> > > Whether they actually clobber memory or not is immaterial, and is not
> > > why we do it.
> >
> > I understand that. My point is that it's not a guarantee that the
> > compiler won't change in the future.
>
> YES IT DAMN WELL IS.
>
> If I have an inline asm thing, and I tell the compiler "this inline
> asm reads or writes memory in ways you don't understand", and you then
> move externally visible memory operations around it - or move other
> inline asms that do the same thing around it - then your compiler is
> simply not a compiler any more.
>
> IT IS BROKEN SHIT.
>
> See?
>
> That memory clobber is not a "please mister compiler, can you pretty
> please take this into account".
>
> That memory clobber is a "if you don't take this into account, you are
> now no longer a working compiler, and thank Gods we have
> alternatives".
>
> This is not a "in ten years things can change" kind of issue. This is
> a very fundamental and simple thing.
>
> I don't understand why you can't just admit that.
>
> This is as simple as 2+2 being 4. That's black and white.
>
> There is no "the compiler might optimize it to be 3 at some future date".
>
I'm NOT saying that it WILL change or that it SHOULD change. I'm also
not saying that your concern isn't justified. What I am saying is that
unless you're using a compiler feature that's DEFINED as having a
certain effect, then you are not using that feature correctly,
regardless of how it's acted in the past. And it has the potential to
bite you in the ass sooner or later. We've all seen such things happen
before.

I'm not thick and you don't have to yell about this. We're on the same
page! :-) It would be much much better if there was a feature that was
defined to act in the way you want it to. But I'm not offering one at
the moment or even with the change that started this thread.

-bw

2022-03-17 23:31:19

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

Hi!

On Thu, Mar 17, 2022 at 12:45:30PM -0700, Bill Wendling wrote:
> On Thu, Mar 17, 2022 at 11:52 AM Linus Torvalds
> <[email protected]> wrote:
> > It is perhaps telling that the LLVM discussion I found seems to talk
> > more about the performance impact, not about the fact that THE
> > GENERATED CODE WAS WRONG.
> >
> I think that's a bit unfair. There seemed to be a general consensus
> that the code was wrong and needed to be fixed. However, the compiler
> is also expected to generate the most performant code that it can.

I think Linus' point is that correctness is required, and performance
is just a nice to have. I don't think you disagree, I sure don't :-)

> > In fact, gcc pretty much documents that "cc" clobbers doesn't do
> > anything on x86, and isn't needed. It just assumes that the arithmetic
> > flags always get clobbered, because on x86 that's pretty much the
> > case. They have sometimes been added for documentation purposes in the
> > kernel, but that's all they are.

GCC on x86 clobbers the arithmetic flags in every asm. Long ago the x86
port still used CC0 (an internal representation of the condition code
flags), which acted effectively like clobbering the flags on every
instruction (CC0 will finally be gone in GCC 12 btw). The x86 port
stopped using CC0 (and so started using the automatic "cc" clobber, to
keep old inline asm working) in 1999 :-)

"cc" is a valid clobber on every port, but it doesn't necessarily do
anything, and if it does, it does not necessarily correspond to any more
specific register, neither to a GCC register nor to a hardware register.
It also is not really useful for generic code, since the asm itself is
by nature pretty much target-specific :-)

> > But the whole "you can't move _other_ things that you don't even
> > understand around this either" is equally important. A "disable
> > interrupts" could easily be protecting a "read and modify a CPU MSR
> > value" too - no real "memory" access necessarily involved, but
> > "memory" is the only way we can tell you "don't move this".
> >
> And yet that's not guaranteed. From
> https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html:
>
> ```
> Note that the compiler can move even volatile asm instructions
> relative to other code, including across jump instructions. For
> example, on many targets there is a system register that controls the
> rounding mode of floating-point operations. Setting it with a volatile
> asm statement, as in the following PowerPC example, does not work
> reliably.
>
> asm volatile("mtfsf 255, %0" : : "f" (fpenv));
> sum = x + y;
>
> The solution is to reference the "sum" variable in the asm:
>
> asm volatile ("mtfsf 255,%1" : "=X" (sum) : "f" (fpenv));
> ```
>
> Note that the solution _isn't_ to add a "memory" clobber, because it's
> not guaranteed to work, as it's explicitly defined to be a read/write
> _memory_ barrier, despite what kernel writers wish it would do.

It doesn't work because *other* code can change the fp environment as
well; adding "memory" clobbers to both asms will keep them in order (if
they both are not deleted, etc). This hinders optimisation very
seriously for code like this btw, another important reason to not do it
here.

And what is "memory" here anyway? New memory items can be added very
late in the pass pipeline, and anything done in an earlier pass will
not have considered those of course. The most important case of this
(but not the only case) is new slots on the stack.

> Your assertion that compilers don't know about control registers isn't
> exactly true. In the case of "pushf/popf", those instructions know
> about the eflags registers. All subsequent instructions that read or
> modify eflags also know about it. In essence, the compiler can
> determine its own clobber list, which include MSRs.

The compiler only knows about some bits in the flags register. The
compiler does not know about most machine resources, including all of
the model-specific registers.


Segher

2022-03-17 23:33:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Thu, Mar 17, 2022 at 2:45 PM Bill Wendling <[email protected]> wrote:
>
> I'm NOT saying that it WILL change or that it SHOULD change. I'm also
> not saying that your concern isn't justified. What I am saying is that
> unless you're using a compiler feature that's DEFINED as having a
> certain effect, then you are not using that feature correctly,
> regardless of how it's acted in the past. And it has the potential to
> bite you in the ass sooner or later. We've all seen such things happen
> before.

So I think most of inline asm constraints are fairly well defined.
Certainly "memory" clobbers are.

The unfortunate exception to this is, I think, "volatile". It has
always had somewhat undefined semantics (iirc originally gcc talked
about it not being "moved significantly" etc), and it ends up getting
mixed reasons for existing.

The *natural* semantics would be to literally make it have the same
rules as volatile data accesses: something like "'volatile' marks the
asm as having visible side effects in the virtual machine".

So I think natural semantics for "asm volatile" - and the ones that
would be simple to document - would literally be to compare it to
those volatile memory accesses, and say that it can't be optimized
away, and it's ordered wrt other volatile operations (whether volatile
data accesses or other volatile asm instructions).

But that is, afaik, not what it ever did, so it always had somewhat
random semantics, the main being "it can't be removed even if its
outputs are never used". So the "cannot be optimized away" ends up
being the central part of the definition, but without the conceptual
sense.

And then we in the kernel have then also co-opted 'asm volatile' to
just fix some compiler bugs, so we end up using "asm volatile goto"
because of

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58670

although *that* particular issue is probably historical now that we
require more modern compiler versions.

I still think that from a sanity standpoint, it would be good to
actually strengthen the semantics of "asm volatile" to literally act
as - and be ordered with - volatile memory accesses.

But I guess that's water under the bridge.

Linus

2022-03-18 00:57:36

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Thu, Mar 17, 2022 at 03:51:28PM -0700, Linus Torvalds wrote:
> On Thu, Mar 17, 2022 at 2:45 PM Bill Wendling <[email protected]> wrote:
> > I'm NOT saying that it WILL change or that it SHOULD change. I'm also
> > not saying that your concern isn't justified. What I am saying is that
> > unless you're using a compiler feature that's DEFINED as having a
> > certain effect, then you are not using that feature correctly,
> > regardless of how it's acted in the past. And it has the potential to
> > bite you in the ass sooner or later. We've all seen such things happen
> > before.
>
> So I think most of inline asm constraints are fairly well defined.
> Certainly "memory" clobbers are.
>
> The unfortunate exception to this is, I think, "volatile". It has
> always had somewhat undefined semantics (iirc originally gcc talked
> about it not being "moved significantly" etc), and it ends up getting
> mixed reasons for existing.

"asm volatile" has always meant "has some unspecified side effect", in
parallel with what a volatile object is in C (there, all *accesses* to
such objects have the side effects). All such side effects have to
happen on the real machine in the same order (and exactly as often) as
on the abstract C machine. This is all it means, nothing more, nothing
less.

This is a little hard to understand, certainly for most users, who do
not often have heard of the abstract machine before (which is a shame,
because *all* of C semantics are defined wrt that).

> The *natural* semantics would be to literally make it have the same
> rules as volatile data accesses: something like "'volatile' marks the
> asm as having visible side effects in the virtual machine".

Not necessarily visible, that is the point even, but yes :-)

> So I think natural semantics for "asm volatile" - and the ones that
> would be simple to document - would literally be to compare it to
> those volatile memory accesses, and say that it can't be optimized
> away, and it's ordered wrt other volatile operations (whether volatile
> data accesses or other volatile asm instructions).

"Cannot be optimised away" means something else to everyone, and almost
all of those meanings do not correspond to the truth very well.

> And then we in the kernel have then also co-opted 'asm volatile' to
> just fix some compiler bugs, so we end up using "asm volatile goto"
> because of
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58670
>
> although *that* particular issue is probably historical now that we
> require more modern compiler versions.

Thankfully. The barrier_before_unreachable() thing in compiler-gcc.h
needs some newer GCC 7 (.3 or .4 I think?), so that is still needed for
a while more. The barrier() one in compiler.h is probably not a bug
anymore since decades, it is from before the kernel started documenting
what versions of required tools have what known bugs, apparently ;-)

> I still think that from a sanity standpoint, it would be good to
> actually strengthen the semantics of "asm volatile" to literally act
> as - and be ordered with - volatile memory accesses.
>
> But I guess that's water under the bridge.

That is what it has actually done since forever. See C 5.1.2.3. For
GCC, "asm volatile" has a side effect like in /2 there as well, as does
unspec_volatile (an internal GCC thing used to implement certain
builtins, among other things).

"asm volatile" does not mean, and has never meant, anything like "can
not be deleted" or "can not be cloned". "Cannot be moved significantly"
is vague enough that it isn't untrue (but isn't very helpful either).


Segher

2022-03-18 01:00:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Thu, Mar 17, 2022 at 4:25 PM Segher Boessenkool
<[email protected]> wrote:
>
> > I still think that from a sanity standpoint, it would be good to
> > actually strengthen the semantics of "asm volatile" to literally act
> > as - and be ordered with - volatile memory accesses.
> >
> > But I guess that's water under the bridge.
>
> That is what it has actually done since forever. See C 5.1.2.3. For
> GCC, "asm volatile" has a side effect like in /2 there as well, as does
> unspec_volatile (an internal GCC thing used to implement certain
> builtins, among other things).

Oh, so two "asm volatile" statements _are_ in fact defined to be
ordered wrt each other?

Because the gcc docs certainly don't say that ;(

Yeah, yeah, dead code can be removed, whether volatile or not. That's
true of "*(volatile int *)x" too. It's not the dead code that is the
interesting case, though..

Is this also well-defined ordering-wise:

asm volatile("do_something");
WRITE_ONCE(x, 1);

(where "WRITE_ONCE()" is that kernel macro that just uses a volatile
pointer assignment to force the access)?

And could we get that documented?

Linus

2022-03-18 02:15:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Thu, Mar 17, 2022 at 5:31 PM Segher Boessenkool
<[email protected]> wrote:
>
> On Thu, Mar 17, 2022 at 01:36:19PM -0700, Linus Torvalds wrote:
> >
> > So yes, a builtin can improve code generation.
>
> Yes, and they are much easier to write, and can be written correctly by
> lookenspeepers who just *have* to twist every knob that they can -- more
> easily than inline asm anyway, which is unforgiving to the extreme.

What?

They aren't easier for the user: they are entirely undocumented, they
can't be fvound anywhere sane, and even if you do find them, you don't
actually have a clue which compiler version implements them.

If you know about the name - and how would you? - one of the first
google hits when you try to figure out what it does is the gcc
bugzilla about how they are undocumented:

https://gcc.gnu.org/bugzilla//show_bug.cgi?id=92137

and then when you look at that bugzilla, the reply to the "it's
undocumented" is _literally_

"You shouldn't use those, they are for internal use only. That's
the reason they are not documented"

so I call BS on them being "easier to write".

I also call BS on the "can be written correctly", since the actual gcc
built-in is not just undocumented, it is also (a) badly defined in
that there is no chance in hell it can ever do anything sane thanks to
missing serialization definitions, and (b) clearly seriously buggy and
generates completely bogus code as Andrew Cooper found out.

Now, compare that to just using inline asm: it's trivial, and we've
used it basically unchanged for three decades.

So seriously: when I say "yes, a builtin can improve code generation",
I mean it purely in the theoretical sense.

Because in the *practical* sense that actually takes reality into
account, I can unequivocally say that the built-in is pure and utter
garbage, should never be used, and should actively be deleted from the
gcc source base because it's so fundamentally broken.

Do you really want to make arguments that are counter-factual when the
facts are *this* clear?

Linus

2022-03-18 03:36:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Thu, Mar 17, 2022 at 6:21 PM Linus Torvalds
<[email protected]> wrote:
>
> Now, compare that to just using inline asm: it's trivial, and we've
> used it basically unchanged for three decades.

Ok, so going _really_ far back, we used to have them literally written out:

__asm__ __volatile__("pushfl ; popl %0 ; cli":"=r" (flags));

in random code, and then in October 1992 switched those nasty things
to instead use

#define cli() __asm__ __volatile__ ("cli"::)

#define save_flags(x) \
__asm__ __volatile__("pushfl ; popl %0":"=r" (x))

and the code was changed to actually use

save_flags(flags);
cli();

instead of that open-coded raw asm.

And the "memory" clobber was added early June, 1993:

#define save_flags(x) \
-__asm__ __volatile__("pushfl ; popl %0":"=r" (x))
+__asm__ __volatile__("pushfl ; popl %0":"=r" (x)::"memory")

so that thing really has existed in pretty much that exact form for
almost 30 years.

There's been tweaks since (the "=r" became "=g" before becoming "=rm",
comments have been added, "pushfl" became just "pushf" with x86-64,
and the thing has moved around and is now called "native_save_fl()" in
a completely different header file etc)

But at no point was it ever as buggy as the actual gcc intrinsic seems
to be today, nor have we needed to check for compiler versions etc.

Linus

2022-03-18 05:43:23

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Thu, Mar 17, 2022 at 04:31:19PM -0700, Linus Torvalds wrote:
> On Thu, Mar 17, 2022 at 4:25 PM Segher Boessenkool
> <[email protected]> wrote:
> >
> > > I still think that from a sanity standpoint, it would be good to
> > > actually strengthen the semantics of "asm volatile" to literally act
> > > as - and be ordered with - volatile memory accesses.
> > >
> > > But I guess that's water under the bridge.
> >
> > That is what it has actually done since forever. See C 5.1.2.3. For
> > GCC, "asm volatile" has a side effect like in /2 there as well, as does
> > unspec_volatile (an internal GCC thing used to implement certain
> > builtins, among other things).
>
> Oh, so two "asm volatile" statements _are_ in fact defined to be
> ordered wrt each other?

Colloquially you could say that. But that statement can be understood
to mean more than the actual guarantee (and it is also much less than
it, oin the positive side).

> Because the gcc docs certainly don't say that ;(

Older GCC docs said
You will also want to add the volatile keyword if the memory affected
is not listed in the inputs or outputs of the asm, as the `memory'
clobber does not count as a side-effect of the asm.

and
The volatile keyword indicates that the instruction has important
side-effects. GCC will not delete a volatile asm if it is reachable.
(The instruction can still be deleted if GCC can prove that
control-flow will never reach the location of the instruction.) Note
that even a volatile asm instruction can be moved relative to other
code, including across jump instructions.

and
Similarly, you can't expect a sequence of volatile asm instructions to
remain perfectly consecutive. If you want consecutive output, use a
single asm. Also, GCC will perform some optimizations across a
volatile asm instruction; GCC does not “forget everything” when it
encounters a volatile asm instruction the way some other compilers do.

The internal assembler docs were rewritten later, because they were
considered too terse, too hard for users to understand. But they may
have lost some clarity in the process.

> Yeah, yeah, dead code can be removed, whether volatile or not. That's
> true of "*(volatile int *)x" too. It's not the dead code that is the
> interesting case, though..

Internally to GCC this is the same in most ways, too; it's no accident
this things correspond so well :-)

> Is this also well-defined ordering-wise:
>
> asm volatile("do_something");
> WRITE_ONCE(x, 1);
>
> (where "WRITE_ONCE()" is that kernel macro that just uses a volatile
> pointer assignment to force the access)?

The macro in asm-generic is

#define __WRITE_ONCE(x, val) \
do { \
*(volatile typeof(x) *)&(x) = (val); \
} while (0)

which will work fine given the undocumented C extension I mentioned in
https://gcc.gnu.org/PR33053 (this will be part of C2x as well btw).

I don't know if all arch-specific versions are fine.

> And could we get that documented?

Could you open a GCC PR for it? I can do it, but you know better what
you want, and I might forget, etc.


Segher

2022-03-18 07:45:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Thu, Mar 17, 2022 at 3:51 PM Linus Torvalds
<[email protected]> wrote:
>
> I still think that from a sanity standpoint, it would be good to
> actually strengthen the semantics of "asm volatile" to literally act
> as - and be ordered with - volatile memory accesses.

Side note: the reason I personally would prefer these semantics is
that if we had a stronger definition of what "asm volatile" actually
means, we could get rid of some - but certainly not all - "memory"
clobbers.

And while I've been very positive about memory clobbers in this thread
- because they give us that serialization we so often need - they can
often be bad for code generation.

Quite often it's not actually a huge deal. Even with a memory clobber,
it's not like the compiler needs to flush any spills or local
variables that haven't had their address taken to memory.

So quite often a memory clobber is only going to affect instruction
ordering a bit, and that's usually exactly what you want.

But it certainly _can_ be quite noticeable, and while things like
"cli" really wants a memory clobber anyway because it really does want
to order with respect to memory accesses that the compiler does, other
operations wouldn't necessarily need it.

That "native_save_fl()" that does that "pushf ; pop %0" instruction,
for example, would be perfectly happy only being ordered wrt other asm
instructions (notably ordered wrt cli/sti).

So we could easily remove the "memory" clobber there, but only if we
had that other ordering constraint of "asm volatile" being ordered wrt
each other.

Right now we have memory clobbers just about everywhere. It's almost
the default, unless it's purely about some purely arithmetic
operation. I'm not sure how many of them we could remove, but I do
think that pushf/popf is _one_ such case.

So stronger semantics for "asm volatile" could potentially help
generate better code.

Of course, then the "I just don't want you to optimize this away"
crowd might be unhappy and find cases where it really hurts.

Linus

2022-03-18 09:01:36

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Thu, Mar 17, 2022 at 01:36:19PM -0700, Linus Torvalds wrote:
> asm volatile("# __raw_save_flags\n\t"
> "pushf ; pop %0"
> : "=rm" (flags)
> : /* no input */
> : "memory");

> And is that "memory" clobber because it modifies the memory location
> just below the current stack pointer?
>
> No, not really - outside the kernel that might be an issue, but we
> already have to build the kernel with -mno-red-zone, so if the
> compiler uses that memory location, that would be a *HUGE* compiler
> bug already.

There is the problem though that this might extend the stack, which
isn't marked up anywhere, so the static checkers do not see the stack
overflow, and it won't be noticed until runtime. Or do the checkers
consider such cases?

> So the "memory" clobber has absolutely nothing to do with the fact
> that 'pushf' updates the stack pointer, writes to that location, and
> the popf then undoes it.
>
> It's literally because we don't want the compiler to move non-spill
> memory accesses around it (or other asm statements wiht memory
> clobbers), regardless of the fact that the sequence doesn't really
> read or write memory in any way that is relevant to the compiler.

Well, that, or the write of the code didn't consider this, just went
"writes memory, so we clobber".

> > GCC doesn't have barriers in the built-ins (if we are talking about
> > __builtin_ia32_readeflags_u64 and __builtin_ia32_writeeflags_u64). I
> > expect they are actually pretty useless, and were merely added for
> > completeness of the intrinsics headers.
>
> Yeah, without any kinds of ordering guarantees, I think those builtins
> are basically only so in name. They might as well return a random
> value - they're not *useful*, because they don't have any defined
> behavior.

No ordering wrt any other code, yes. Which is not anything you can
solve in only the builtin -- you need to consider the other code that
you order with as well, change that code as well.

> I mean, we *could* certainly use "read eflags" in the kernel, and yes,
> in theory it would be lovely if we didn't have to encode it as a
> "pushf/pop" sequence, and the compiler tracked the stack pointer for
> us, and perhaps combined it with other stack pointer changes to the
> point where the "popf" would never happen, it would just undo the %rsp
> change at function exit time.
>
> So yes, a builtin can improve code generation.

Yes, and they are much easier to write, and can be written correctly by
lookenspeepers who just *have* to twist every knob that they can -- more
easily than inline asm anyway, which is unforgiving to the extreme.

They are also much easier to read usually, make for cleaner code. They
certainly have their place. But they should stay there, too :-)


Segher

2022-03-18 20:12:49

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On 3/17/22 14:39, Linus Torvalds wrote:
> On Thu, Mar 17, 2022 at 2:06 PM Andrew Cooper <[email protected]> wrote:
>>
>> I was idly playing with this builtin in response to the thread, and
>> found that GCC has a far more serious bug than Clang had.
>
> Heh. Yeah, that's bad.
>
> But it kind of fits my argument: compiler intrinsics aren't
> necessarily such a wonderful thing. They hardly get any testing at all
> unless they are some very very core thing.
>
> I will personally *always* prefer some generic "escape clause" over a
> compiler that thinks it knows better.
>
> I think compiler people should see inline asm as a really *good*
> thing, because it allows people to have that "I'm going to do
> something that nobody normal does, look away now" kind of escape
> clause, letting the compiler concentrate on the things it does best.
>
> Yes, I realize inline asm isn't something compiler people love. And
> yes, I realize that it might not give optimal code.
>
> But think of it a bit like casts - it adds complexity to the language,
> and it might make some optimizations harder or even impossible - but
> it makes the language much more powerful and allows you to do things
> that you simply couldn't do without it.
>
> There's a reason people use C instead of Pascal or FORTRAN. Those
> casts, and that pointer arithmetic - the bane of many optimizations -
> are really really good things.
>
> Intrinsics should generally be shunned, not seen as a "this is an
> alternative to inline asm". They should only exist for situations
> where you can show "look, I can do so much better", and have a
> test-suite for it and a real-life case where it gives you a clear and
> undeniable uplift on a meaningful load.
>
> (And here I separate 'intrinsic' and generic compiler builtins. I love
> the builtins that allow me to tell the compiler something, or the
> compiler to tell me something about the code: __builtin_unreachable()
> and __builtin_constant_p() are both wonderful examples of those kinds
> of things)
>
> But an intrinsic for some odd target-specific thing that can't even be
> portably generalized? Give me inline asm any day.
>
> We can do things with inline asms that the compiler can't even _dream_
> of, like the whole instruction rewriting thing we use.
>
> I'd much rather have a powerful generic concept (eg "asm goto" is
> lovely) than a specialized intrinsic that does only one thing, and
> doesn't even do it well because it's untested and badly defined.
>

I generally agree. In this particular case, though, I will keep using
the builtin in tools/testing/selftests/x86/helpers.h unless we actually
hit breakage. This is because this is *user* code, it is compiled with
the redzone enabled, and the asm code to do the right thing when the
redzone is enabled is too hairy for me to want to deal with it. The
builtins have their share of problems, but they do seem to work in this
context.

This isn't at all an argument that we should do the same thing in actual
kernel code.

2022-03-19 13:04:59

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v5] x86: use builtins to read eflags

From: Linus Torvalds
> Sent: 18 March 2022 18:19
>
> On Fri, Mar 18, 2022 at 10:59 AM Andy Lutomirski <[email protected]> wrote:
> >
> > I generally agree. In this particular case, though, I will keep using
> > the builtin in tools/testing/selftests/x86/helpers.h unless we actually
> > hit breakage. This is because this is *user* code, it is compiled with
> > the redzone enabled, and the asm code to do the right thing when the
> > redzone is enabled is too hairy for me to want to deal with it.
>
> Yeah, redzoning is a problem for "pushf".
>
> Or rather, it's not the redzoning itself, but the fact that the
> compiler might use the word under the stack for random other things,
> and the pushf will then corrupt some local variable storage.
>
> I think it would be lovely to solve that in inline asm itself some way
> - by marking the stack pointer clobbered or something.

Something that generates:
mov %rax,-8(%rsp)
pushf
xchg %rax,0(%rsp)
add %rsp,8
should work with or without a redzone.
Will be a bit slower :-(

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-03-19 13:27:25

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v5] x86: use builtins to read eflags

From: Segher Boessenkool
> Sent: 18 March 2022 23:04
...
> The vast majority of compiler builtins are for simple transformations
> that the machine can do, for example with vector instructions. Using
> such builtins does *not* instruct the compiler to use those machine
> insns, even if the builtin name would suggest that; instead, it asks to
> have code generated that has such semantics. So it can be optimised by
> the compiler, much more than what can be done with inline asm.

Bah.
I wrote some small functions to convert blocks of 80 audio
samples between C 'float' and the 8-bit u-law and A-law floating
point formats - one set use the F16C conversions for denormalised
values.
I really want the instructions I've asked for in the order
I've asked for them.
I don't want the compiler doing stupid things.
(Like deciding to try to vectorise the bit of code at the end
that handled non 80 byte blocks.)

> It also can be optimised better by the compiler than if you would
> open-code the transforms (if you ask to frobnicate something, the
> compiler will know you want to frobnicate that thing, and it will not
> always recognise that is what you want if you just write it out in more
> general code).

Yep.
If I write 'for (i = 0; i < n; i++) foo[i] = bar[i]'
I want a loop - not a call to memcpy().
If I want a memcpy() I'll call memcpy().

And if I write:
do {
sum64a += buff32[0];
sum64b += buff32[1];
sum64a += buff32[2];
sum64b += buff32[3];
buff += 4;
} while (buff != lim);
I don't want to see 'buff[1] + buff[2]' anywhere!
That loop has half a chance of running at 8 bytes/clock.
But not how gcc compiles it.

> Well-chosen builtin names are also much more readable than the best
> inline asm can ever be, and it can express much more in a much smaller
> space, without so much opportunity to make mistakes, either.

Hmmm...
Trying to write that SSE2/AVX code was a nightmare.
Chase through the cpu instruction set trying to sort out
the name of the required instruction.
Then search through the 'intrinsic' header to find the
name of the builtin.
Then disassemble the code to check the I'd got the right one.
I'm pretty sure the asm would have been shorter
and needed just as many comments.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-03-19 17:08:04

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v5] x86: use builtins to read eflags

From: Segher Boessenkool
> Sent: 18 March 2022 22:09
...
> It generally is a very good idea to
> have a redzone though, without it you pay much more than necessary for
> frame setup and teardown in leaf functions (similar to some of what the
> misnamed "shrink-wrapping" optimisation does, but the two are mostly
> independent, the benefits add up).

Are there really leaf functions that need to spill data to stack
where the cost of setting up a stack frame is significant?

I'd have thought that was relatively rare.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-03-19 18:59:38

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Thu, Mar 17, 2022 at 02:39:34PM -0700, Linus Torvalds wrote:
> I think compiler people should see inline asm as a really *good*
> thing, because it allows people to have that "I'm going to do
> something that nobody normal does, look away now" kind of escape
> clause, letting the compiler concentrate on the things it does best.

There is a reason that we have the GCC inline asm extension at all. It
is not because the compiler developers think it is useless ;-)

It often *is* abused, or used in wrong, dangerous ways. This often will
go undetected for a long time, for all the same reasons you want to have
inline asm in the first place (it is an escape hatch, much of what you
do with it is outside of what the compiler can (and does) know about)!

> Yes, I realize inline asm isn't something compiler people love. And
> yes, I realize that it might not give optimal code.

Some of us do love it. But it should not be used when you do not need
it, indeed to get better machine code genereated, but much more
importantly to get more correct code, and much simpler code, simpler to
read code. All these things are connected.

> (And here I separate 'intrinsic' and generic compiler builtins. I love
> the builtins that allow me to tell the compiler something, or the
> compiler to tell me something about the code: __builtin_unreachable()
> and __builtin_constant_p() are both wonderful examples of those kinds
> of things)

I don't know what you mean when you say "intrinsic". It is a worse name
than a "built-in function", which is either not a function or not a
built-in thing, so that name is rubbish; both "compiler intrinsic" and
"intrinsic function" are even worse though!

I like to say "builtin", which has no other meaning in the English
language, so it cannot be easily understood to have properties it
doesn't, not just from a misleading name anyway.

And yes, __builtin_unreachable() is nicer than having to write "1/0" in
all those places, it is much clearer. It is undefined behaviour to
execute it exactly the same though.

__builtin_constant_p() is a curious one: it returns whether the compiler
knows its argument to be constant, which is something C does not define
(*cannot* define). It is a curse as well as a blessing, moving from
compiler bversion to compiler version can return false instead of true
on the same expression, or the other way around, and the same when
trying out different compilers of course.

> But an intrinsic for some odd target-specific thing that can't even be
> portably generalized? Give me inline asm any day.

The vast majority of compiler builtins are for simple transformations
that the machine can do, for example with vector instructions. Using
such builtins does *not* instruct the compiler to use those machine
insns, even if the builtin name would suggest that; instead, it asks to
have code generated that has such semantics. So it can be optimised by
the compiler, much more than what can be done with inline asm.

It also can be optimised better by the compiler than if you would
open-code the transforms (if you ask to frobnicate something, the
compiler will know you want to frobnicate that thing, and it will not
always recognise that is what you want if you just write it out in more
general code).

Well-chosen builtin names are also much more readable than the best
inline asm can ever be, and it can express much more in a much smaller
space, without so much opportunity to make mistakes, either.

Builtins are good!

> We can do things with inline asms that the compiler can't even _dream_
> of, like the whole instruction rewriting thing we use.

Yes. Inline asm is good as well! Both things have their places.

Some builtins are not well-conceived. And some inline asm has the
same problem :-(

> I'd much rather have a powerful generic concept (eg "asm goto" is
> lovely) than a specialized intrinsic that does only one thing, and
> doesn't even do it well because it's untested and badly defined.

Being generally applicable is a double-edged sword. The trick is to
have things more general when reasonable, and not when not.


Segher

2022-03-19 22:52:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Fri, Mar 18, 2022 at 10:59 AM Andy Lutomirski <[email protected]> wrote:
>
> I generally agree. In this particular case, though, I will keep using
> the builtin in tools/testing/selftests/x86/helpers.h unless we actually
> hit breakage. This is because this is *user* code, it is compiled with
> the redzone enabled, and the asm code to do the right thing when the
> redzone is enabled is too hairy for me to want to deal with it.

Yeah, redzoning is a problem for "pushf".

Or rather, it's not the redzoning itself, but the fact that the
compiler might use the word under the stack for random other things,
and the pushf will then corrupt some local variable storage.

I think it would be lovely to solve that in inline asm itself some way
- by marking the stack pointer clobbered or something.

Because you have the same issue if an inline asm might need to do a
function call - think magic calling conventions etc, but also possibly
slow-path cases.

As mentioned, it's not an issue for the kernel proper due to
-mno-red-zone which we need for entirely unrelated reasons.

Side note and kind of related: we do have this in the kernel:

register unsigned long current_stack_pointer asm(_ASM_SP);
#define ASM_CALL_CONSTRAINT "+r" (current_stack_pointer)

which *might* also solve the redzoning issue.

In the kernel we need it not because of redzoned stack use, but
because we need the stack frame to be set up properly or objtool
complains.

Linus

2022-03-21 12:53:13

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On March 18, 2022 3:09:01 PM PDT, Segher Boessenkool <[email protected]> wrote:
>On Fri, Mar 18, 2022 at 11:19:28AM -0700, Linus Torvalds wrote:
>> Or rather, it's not the redzoning itself, but the fact that the
>> compiler might use the word under the stack for random other things,
>> and the pushf will then corrupt some local variable storage.
>>
>> I think it would be lovely to solve that in inline asm itself some way
>> - by marking the stack pointer clobbered or something.
>
>Inline assembler does not allow you to change the stack pointer, in
>principle. You have to return everything to its original state before
>you return control from the asm code, and you have to deal with what
>happens if am interrupt comes in halfway through the asm, and all other
>ABI things that may happen on your platform.
>
>> Because you have the same issue if an inline asm might need to do a
>> function call - think magic calling conventions etc, but also possibly
>> slow-path cases.
>
>Yes. The compiler itself can deal with all the red zone restrictions --
>precisely *because* it is in full control of the stack frame -- but
>those restrictions are very real. It generally is a very good idea to
>have a redzone though, without it you pay much more than necessary for
>frame setup and teardown in leaf functions (similar to some of what the
>misnamed "shrink-wrapping" optimisation does, but the two are mostly
>independent, the benefits add up).
>
>> As mentioned, it's not an issue for the kernel proper due to
>> -mno-red-zone which we need for entirely unrelated reasons.
>
>It might help to have some special clobber syntax that says "this asm
>clobbers the red zone", so the compiler can arrange for the red zone to
>contain nothing during the asm (it already does the same for function
>calls, for example).
>
>How bad is it to do the fail-safe general solution here though? I.e.,
>write actual assembler code:
>
># u16 getflags(void);
>getflags:
> pushf
> pop %ax
> ret
>
>(or whatever the syntax is, my x86 is rusty).
>
>> Side note and kind of related: we do have this in the kernel:
>>
>> register unsigned long current_stack_pointer asm(_ASM_SP);
>> #define ASM_CALL_CONSTRAINT "+r" (current_stack_pointer)
>>
>> which *might* also solve the redzoning issue.
>
>The GCC documentation of inline asm says
> Another restriction is that the clobber list should not contain the
> stack pointer register. This is because the compiler requires the
> value of the stack pointer to be the same after an 'asm' statement as
> it was on entry to the statement. However, previous versions of GCC
> did not enforce this rule and allowed the stack pointer to appear in
> the list, with unclear semantics. This behavior is deprecated and
> listing the stack pointer may become an error in future versions of
> GCC.
>
>> In the kernel we need it not because of redzoned stack use, but
>> because we need the stack frame to be set up properly or objtool
>> complains.
>
>If the kernel has special rules for the stack, it had better teach the
>compiler about this special ABI, or there will be tears eventually. If
>the kernel requires only what the standard ABIs provide, it can trust
>the compiler to do that correctly, this is one of the core jobs of a
>compiler!
>
>
>Segher

It is extremely common for inline assembly to be written using push/pop or call sequences, and not just because of eflags. In the kernel redzone is (currently) not supported (with FRED exception handling it would be possible to support it as a kernel-wide compile-time option), but there needs to be a way to communicate this to the compiler.

2022-03-21 17:34:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Fri, Mar 18, 2022 at 2:48 PM Andrew Cooper <[email protected]> wrote:
>
> As such, I'm not sure how current_stack_pointer can work as intended in
> all cases...

So as mentioned, the kernel doesn't really care, since for the kernel
that inline asm use is more of a "get proper backtraces" thing. The
compiler thinks the function is a leaf function, doesn't set up a
frame for a call that happens inside the inline asm.

The code *works* without it, but the frame annotations aren't right.

And obviously we don't actually *(change* the stack pointer. Or
rather, a call will do exactly as with pushf/pop: rsp gets updated
there but gets put right back.

See commit 317c2ce77d8a ("x86/alternatives: Add stack frame dependency
to alternative_call_2()") for some details.

And yes, that trick then caused problems with clang, and we did
f5caf621ee35 ("x86/asm: Fix inline asm call constraints for Clang")
that created the current situation.

It would be lovely to have some explicit model for "I want the frame
to have been set up for backtraces", but here we are. Marking '%rsp
used makes the compiler understand it's not a leaf function.

And while we have other uses for it that then use the actual value,
those don't care about the exact value of the stack pointer register,
they just want "give me a pointer that is contained within the current
stack", because we control the stack allocation and do funky things
there. So "any random stack pointer value in this function" is
perfectly fine and expected.

But for user mode, it would probably be a great idea to also have a "I
cannot use a redzone in this function" thing. The kernel can't use it
because we have nested exceptions, but maybe some day even the kernel
could make use of (controlled) red-zoning.

Linus

2022-03-21 18:39:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Fri, Mar 18, 2022 at 4:47 PM Segher Boessenkool
<[email protected]> wrote:
>>
> > Marking '%rsp
> > used makes the compiler understand it's not a leaf function.
>
> As I said before, this is explicitly incorrect code. Always was, but
> it is documented since a while (since GCC 9). Clobbering the stack
> pointer can never be correct, the stack pointer after an asm has to be
> identical to the one before that asm!

We've never clobbered the stack register.

We've _marked_ it as an in-out register, but we obviously never
actually change it (as far as the compiler can tell). That would very
obviously never work.

And even that marking must have been some gcc person telling us to do
that, because I don't think we would have come up with it otherwise. I
would guess it's what gcc uses internally to decide "I need this
function to have a stack frame". And once you have a stack frame, the
inline asm will automatically be put inside of it.

I spent a lot of time trying to find the origin of it. Finding the
commit in the kernel where it was first introduced is easy: looks like
commit 0e8e2238b52e ("x86/xen: Add stack frame dependency to hypercall
inline asm calls") was the first to use that trick, although it was a
series of patches from Josh that did that to make objdump happy.

And I found the culprit. It's you, on the gcc lists:

https://gcc.gnu.org/legacy-ml/gcc/2015-07/msg00080.html

Heh.

Linus

2022-03-21 21:34:06

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags



On Fri, Mar 18, 2022, at 4:42 PM, Segher Boessenkool wrote:
> On Fri, Mar 18, 2022 at 04:10:55PM -0700, Linus Torvalds wrote:
>> It would be lovely to have some explicit model for "I want the frame
>> to have been set up for backtraces", but here we are.
>
> So please define exactly what that *means*? Preferably portably, but I
> reckon at least some of it will have to be machine-specific (and ABI-
> specific). But it needs to be well-defined, clearly defined, defined at
> all, and *documented* :-)
>
>> Marking '%rsp
>> used makes the compiler understand it's not a leaf function.
>
> As I said before, this is explicitly incorrect code. Always was, but
> it is documented since a while (since GCC 9). Clobbering the stack
> pointer can never be correct, the stack pointer after an asm has to be
> identical to the one before that asm!
>
>> And while we have other uses for it that then use the actual value,
>> those don't care about the exact value of the stack pointer register,
>> they just want "give me a pointer that is contained within the current
>> stack", because we control the stack allocation and do funky things
>> there. So "any random stack pointer value in this function" is
>> perfectly fine and expected.
>
> You can use %rsp as *input* operand just fine, which is all you need for
> that.
>
>> But for user mode, it would probably be a great idea to also have a "I
>> cannot use a redzone in this function" thing. The kernel can't use it
>> because we have nested exceptions, but maybe some day even the kernel
>> could make use of (controlled) red-zoning.
>
> Yes. We just have to figure out what the exact semantics we want is,
> and how to express that in a target-independent way, and then relatedly
> what a good name for it would be ("redzone" in the clobber list is the
> best I can come up with right now, but that may have to change).

Here’s the semantics I want:

I want to tell the compiler that an asm statement makes a function call. I want to specify the stack alignment and offset I need. I want the compiler to make it work. Something like this, but preferably with better syntax:

asm("asm here" ::: "call" (align=16, offset=0));

This means that the asm in question wants rsp to be 0 more than a multiple of 16 and that it wants precisely the setup needed for a call to be done. If frame pointers are enabled, a frame should be set up. If there is a redzone then either the compiler needs to not use it or needs to advance rsp past it.

>
>
> Segher

2022-03-21 22:30:15

by H. Peter Anvin

[permalink] [raw]
Subject: RE: [PATCH v5] x86: use builtins to read eflags

On March 18, 2022 3:36:48 PM PDT, David Laight <[email protected]> wrote:
>From: Segher Boessenkool
>> Sent: 18 March 2022 22:09
>...
>> It generally is a very good idea to
>> have a redzone though, without it you pay much more than necessary for
>> frame setup and teardown in leaf functions (similar to some of what the
>> misnamed "shrink-wrapping" optimisation does, but the two are mostly
>> independent, the benefits add up).
>
>Are there really leaf functions that need to spill data to stack
>where the cost of setting up a stack frame is significant?
>
>I'd have thought that was relatively rare.
>
> David
>
>-
>Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
>Registration No: 1397386 (Wales)
>
>

Yes.

2022-03-21 22:36:16

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Fri, Mar 18, 2022 at 11:19:28AM -0700, Linus Torvalds wrote:
> Or rather, it's not the redzoning itself, but the fact that the
> compiler might use the word under the stack for random other things,
> and the pushf will then corrupt some local variable storage.
>
> I think it would be lovely to solve that in inline asm itself some way
> - by marking the stack pointer clobbered or something.

Inline assembler does not allow you to change the stack pointer, in
principle. You have to return everything to its original state before
you return control from the asm code, and you have to deal with what
happens if am interrupt comes in halfway through the asm, and all other
ABI things that may happen on your platform.

> Because you have the same issue if an inline asm might need to do a
> function call - think magic calling conventions etc, but also possibly
> slow-path cases.

Yes. The compiler itself can deal with all the red zone restrictions --
precisely *because* it is in full control of the stack frame -- but
those restrictions are very real. It generally is a very good idea to
have a redzone though, without it you pay much more than necessary for
frame setup and teardown in leaf functions (similar to some of what the
misnamed "shrink-wrapping" optimisation does, but the two are mostly
independent, the benefits add up).

> As mentioned, it's not an issue for the kernel proper due to
> -mno-red-zone which we need for entirely unrelated reasons.

It might help to have some special clobber syntax that says "this asm
clobbers the red zone", so the compiler can arrange for the red zone to
contain nothing during the asm (it already does the same for function
calls, for example).

How bad is it to do the fail-safe general solution here though? I.e.,
write actual assembler code:

# u16 getflags(void);
getflags:
pushf
pop %ax
ret

(or whatever the syntax is, my x86 is rusty).

> Side note and kind of related: we do have this in the kernel:
>
> register unsigned long current_stack_pointer asm(_ASM_SP);
> #define ASM_CALL_CONSTRAINT "+r" (current_stack_pointer)
>
> which *might* also solve the redzoning issue.

The GCC documentation of inline asm says
Another restriction is that the clobber list should not contain the
stack pointer register. This is because the compiler requires the
value of the stack pointer to be the same after an 'asm' statement as
it was on entry to the statement. However, previous versions of GCC
did not enforce this rule and allowed the stack pointer to appear in
the list, with unclear semantics. This behavior is deprecated and
listing the stack pointer may become an error in future versions of
GCC.

> In the kernel we need it not because of redzoned stack use, but
> because we need the stack frame to be set up properly or objtool
> complains.

If the kernel has special rules for the stack, it had better teach the
compiler about this special ABI, or there will be tears eventually. If
the kernel requires only what the standard ABIs provide, it can trust
the compiler to do that correctly, this is one of the core jobs of a
compiler!


Segher

2022-03-21 22:47:19

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH v5] x86: use builtins to read eflags

On Fri, Mar 18, 2022 at 04:10:55PM -0700, Linus Torvalds wrote:
> It would be lovely to have some explicit model for "I want the frame
> to have been set up for backtraces", but here we are.

So please define exactly what that *means*? Preferably portably, but I
reckon at least some of it will have to be machine-specific (and ABI-
specific). But it needs to be well-defined, clearly defined, defined at
all, and *documented* :-)

> Marking '%rsp
> used makes the compiler understand it's not a leaf function.

As I said before, this is explicitly incorrect code. Always was, but
it is documented since a while (since GCC 9). Clobbering the stack
pointer can never be correct, the stack pointer after an asm has to be
identical to the one before that asm!

> And while we have other uses for it that then use the actual value,
> those don't care about the exact value of the stack pointer register,
> they just want "give me a pointer that is contained within the current
> stack", because we control the stack allocation and do funky things
> there. So "any random stack pointer value in this function" is
> perfectly fine and expected.

You can use %rsp as *input* operand just fine, which is all you need for
that.

> But for user mode, it would probably be a great idea to also have a "I
> cannot use a redzone in this function" thing. The kernel can't use it
> because we have nested exceptions, but maybe some day even the kernel
> could make use of (controlled) red-zoning.

Yes. We just have to figure out what the exact semantics we want is,
and how to express that in a target-independent way, and then relatedly
what a good name for it would be ("redzone" in the clobber list is the
best I can come up with right now, but that may have to change).


Segher

2022-03-21 23:08:52

by H. Peter Anvin

[permalink] [raw]
Subject: RE: [PATCH v5] x86: use builtins to read eflags

On March 18, 2022 3:43:13 PM PDT, David Laight <[email protected]> wrote:
>From: Linus Torvalds
>> Sent: 18 March 2022 18:19
>>
>> On Fri, Mar 18, 2022 at 10:59 AM Andy Lutomirski <[email protected]> wrote:
>> >
>> > I generally agree. In this particular case, though, I will keep using
>> > the builtin in tools/testing/selftests/x86/helpers.h unless we actually
>> > hit breakage. This is because this is *user* code, it is compiled with
>> > the redzone enabled, and the asm code to do the right thing when the
>> > redzone is enabled is too hairy for me to want to deal with it.
>>
>> Yeah, redzoning is a problem for "pushf".
>>
>> Or rather, it's not the redzoning itself, but the fact that the
>> compiler might use the word under the stack for random other things,
>> and the pushf will then corrupt some local variable storage.
>>
>> I think it would be lovely to solve that in inline asm itself some way
>> - by marking the stack pointer clobbered or something.
>
>Something that generates:
> mov %rax,-8(%rsp)
> pushf
> xchg %rax,0(%rsp)
> add %rsp,8
>should work with or without a redzone.
>Will be a bit slower :-(
>
> David
>
>-
>Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
>Registration No: 1397386 (Wales)
>

*Much* slower (XCHG is implicitly locked, for hysterical raisins) and you are using the redzone between the mov and the push, so it seems like you just made things worse for no reason.