2024-04-04 22:55:41

by Linus Torvalds

[permalink] [raw]
Subject: More annoying code generation by clang

So this doesn't really matter in any real life situation, but it
really grated on me.

Clang has this nasty habit of taking our nice asm constraints, and
turning them into worst-case garbage. It's been reported a couple of
times where we use "g" to tell the compiler that pretty much any
source to the asm works, and then clang takes that to mean "I will
take that to use 'memory'" even when that makes no sense what-so-ever.

See for example

https://lore.kernel.org/all/CAHk-=wgobnShg4c2yyMbk2p=U-wmnOmX_0=b3ZY_479Jjey2xw@mail.gmail.com/

where I was ranting about clang just doing pointlessly stupid things.

However, I found a case where yes, clang does pointlessly stupid
things, but it's at least _partly_ our fault, and gcc can't generate
optimal code either.

We have this fairly critical code in __fget_files_rcu() to look up a
'struct file *' from an fd, and it does this:

/* Mask is a 0 for invalid fd's, ~0 for valid ones */
nospec_mask = array_index_mask_nospec(fd, fdt->max_fds);

and clang makes a *horrid* mess of it, generating this code:

movl %edi, %r14d
movq 32(%rbx), %rdx
movl (%rdx), %eax
movq %rax, 8(%rsp)
cmpq 8(%rsp), %r14
sbbq %rcx, %rcx

which is just crazy. Notice how it does that "move rax to stack, then
do the compare against the stack", instead of just using %rax.

In fact, that function shouldn't have a stack frame at all, and the
only reason it is generated is because of this whole oddity.

All clang's fault, right?

Yeah, mostly. But it turns out that what really messes with clangs
little head is that the x86 array_index_mask_nospec() function is
being a bit annoying.

This is what we do:

static __always_inline unsigned long
array_index_mask_nospec(unsigned long index,
unsigned long size)
{
unsigned long mask;

asm volatile ("cmp %1,%2; sbb %0,%0;"
:"=r" (mask)
:"g"(size),"r" (index)
:"cc");
return mask;
}

and look at the use again:

nospec_mask = array_index_mask_nospec(fd, fdt->max_fds);

here all the values are actually 'unsigned int'. So what happens is
that clang can't just use the fdt->max_fds value *directly* from
memory, because it needs to be expanded from 32-bit to 64-bit because
we've made our array_index_mask_nospec() function only work on 64-bit
'unsigned long' values.

So it turns out that by massaging this a bit, and making it just be a
macro - so that the asm can decide that "I can do this in 32-bit" - I
can get clang to generate much better code.

Clang still absolutely hates the "g" constraint, so to get clang to
really get this right I have to use "ir" instead of "g". Which is
wrong. Because gcc does this right, and could use the memory op
directly. But even gcc cannot do that with our *current* function,
because of that "the memory value is 32-bit, we require a 64-bit
value"

Anyway, I can get gcc to generate the right code:

movq 32(%r13), %rdx
cmp (%rdx),%ebx
sbb %esi,%esi

which is basically the right code for the six crazy instructions clang
generates. And if I make the "g" be "ir", I can get clang to generate

movq 32(%rdi), %rcx
movl (%rcx), %eax
cmpl %eax, %esi
sbbl %esi, %esi

which is the same thing, but with that (pointless) load to a register.

And now clang doesn't generate that stack frame at all.

Anyway, this was a long email to explain the odd attached patch.

Comments? Note that this patch is *entirely* untested, I have done
this purely by looking at the code generation in fs/file.c.

Linus


Attachments:
patch.diff (1.10 kB)

2024-04-06 10:56:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: More annoying code generation by clang


[ I've Cc:-ed a few more people who might be interested in this. ]
[ Copy of Linus's email below. ]

* Linus Torvalds <[email protected]> wrote:

> So this doesn't really matter in any real life situation, but it
> really grated on me.
>
> Clang has this nasty habit of taking our nice asm constraints, and
> turning them into worst-case garbage. It's been reported a couple of
> times where we use "g" to tell the compiler that pretty much any
> source to the asm works, and then clang takes that to mean "I will
> take that to use 'memory'" even when that makes no sense what-so-ever.
>
> See for example
>
> https://lore.kernel.org/all/CAHk-=wgobnShg4c2yyMbk2p=U-wmnOmX_0=b3ZY_479Jjey2xw@mail.gmail.com/
>
> where I was ranting about clang just doing pointlessly stupid things.
>
> However, I found a case where yes, clang does pointlessly stupid
> things, but it's at least _partly_ our fault, and gcc can't generate
> optimal code either.
>
> We have this fairly critical code in __fget_files_rcu() to look up a
> 'struct file *' from an fd, and it does this:
>
> /* Mask is a 0 for invalid fd's, ~0 for valid ones */
> nospec_mask = array_index_mask_nospec(fd, fdt->max_fds);
>
> and clang makes a *horrid* mess of it, generating this code:
>
> movl %edi, %r14d
> movq 32(%rbx), %rdx
> movl (%rdx), %eax
> movq %rax, 8(%rsp)
> cmpq 8(%rsp), %r14
> sbbq %rcx, %rcx
>
> which is just crazy. Notice how it does that "move rax to stack, then
> do the compare against the stack", instead of just using %rax.
>
> In fact, that function shouldn't have a stack frame at all, and the
> only reason it is generated is because of this whole oddity.
>
> All clang's fault, right?
>
> Yeah, mostly. But it turns out that what really messes with clangs
> little head is that the x86 array_index_mask_nospec() function is
> being a bit annoying.
>
> This is what we do:
>
> static __always_inline unsigned long
> array_index_mask_nospec(unsigned long index,
> unsigned long size)
> {
> unsigned long mask;
>
> asm volatile ("cmp %1,%2; sbb %0,%0;"
> :"=r" (mask)
> :"g"(size),"r" (index)
> :"cc");
> return mask;
> }
>
> and look at the use again:
>
> nospec_mask = array_index_mask_nospec(fd, fdt->max_fds);
>
> here all the values are actually 'unsigned int'. So what happens is
> that clang can't just use the fdt->max_fds value *directly* from
> memory, because it needs to be expanded from 32-bit to 64-bit because
> we've made our array_index_mask_nospec() function only work on 64-bit
> 'unsigned long' values.
>
> So it turns out that by massaging this a bit, and making it just be a
> macro - so that the asm can decide that "I can do this in 32-bit" - I
> can get clang to generate much better code.
>
> Clang still absolutely hates the "g" constraint, so to get clang to
> really get this right I have to use "ir" instead of "g". Which is
> wrong. Because gcc does this right, and could use the memory op
> directly. But even gcc cannot do that with our *current* function,
> because of that "the memory value is 32-bit, we require a 64-bit
> value"
>
> Anyway, I can get gcc to generate the right code:
>
> movq 32(%r13), %rdx
> cmp (%rdx),%ebx
> sbb %esi,%esi
>
> which is basically the right code for the six crazy instructions clang
> generates. And if I make the "g" be "ir", I can get clang to generate
>
> movq 32(%rdi), %rcx
> movl (%rcx), %eax
> cmpl %eax, %esi
> sbbl %esi, %esi
>
> which is the same thing, but with that (pointless) load to a register.
>
> And now clang doesn't generate that stack frame at all.
>
> Anyway, this was a long email to explain the odd attached patch.
>
> Comments? Note that this patch is *entirely* untested, I have done
> this purely by looking at the code generation in fs/file.c.
>
> Linus

> arch/x86/include/asm/barrier.h | 23 +++++++++--------------
> 1 file changed, 9 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> index 66e57c010392..6159d2cbbfde 100644
> --- a/arch/x86/include/asm/barrier.h
> +++ b/arch/x86/include/asm/barrier.h
> @@ -33,20 +33,15 @@
> * Returns:
> * 0 - (index < size)
> */
> -static __always_inline unsigned long array_index_mask_nospec(unsigned long index,
> - unsigned long size)
> -{
> - unsigned long mask;
> -
> - asm volatile ("cmp %1,%2; sbb %0,%0;"
> - :"=r" (mask)
> - :"g"(size),"r" (index)
> - :"cc");
> - return mask;
> -}
> -
> -/* Override the default implementation from linux/nospec.h. */
> -#define array_index_mask_nospec array_index_mask_nospec
> +#define array_index_mask_nospec(idx,sz) ({ \
> + typeof((idx)+(sz)) __idx = (idx); \
> + typeof(__idx) __sz = (sz); \
> + typeof(__idx) __mask; \
> + asm volatile ("cmp %1,%2; sbb %0,%0" \
> + :"=r" (__mask) \
> + :"ir"(__sz),"r" (__idx) \
> + :"cc"); \
> + __mask; })
>
> /* Prevent speculative execution past this barrier. */
> #define barrier_nospec() asm volatile("lfence":::"memory")


2024-04-06 12:30:51

by Uros Bizjak

[permalink] [raw]
Subject: Re: More annoying code generation by clang

On Sat, Apr 6, 2024 at 12:56 PM Ingo Molnar <[email protected]> wrote:
>
>
> [ I've Cc:-ed a few more people who might be interested in this. ]
> [ Copy of Linus's email below. ]
>
> * Linus Torvalds <[email protected]> wrote:
>
> > So this doesn't really matter in any real life situation, but it
> > really grated on me.
> >
> > Clang has this nasty habit of taking our nice asm constraints, and
> > turning them into worst-case garbage. It's been reported a couple of
> > times where we use "g" to tell the compiler that pretty much any
> > source to the asm works, and then clang takes that to mean "I will
> > take that to use 'memory'" even when that makes no sense what-so-ever.
> >
> > See for example
> >
> > https://lore.kernel.org/all/CAHk-=wgobnShg4c2yyMbk2p=U-wmnOmX_0=b3ZY_479Jjey2xw@mail.gmail.com/
> >
> > where I was ranting about clang just doing pointlessly stupid things.
> >
> > However, I found a case where yes, clang does pointlessly stupid
> > things, but it's at least _partly_ our fault, and gcc can't generate
> > optimal code either.
> >
> > We have this fairly critical code in __fget_files_rcu() to look up a
> > 'struct file *' from an fd, and it does this:
> >
> > /* Mask is a 0 for invalid fd's, ~0 for valid ones */
> > nospec_mask = array_index_mask_nospec(fd, fdt->max_fds);
> >
> > and clang makes a *horrid* mess of it, generating this code:
> >
> > movl %edi, %r14d
> > movq 32(%rbx), %rdx
> > movl (%rdx), %eax
> > movq %rax, 8(%rsp)
> > cmpq 8(%rsp), %r14
> > sbbq %rcx, %rcx
> >
> > which is just crazy. Notice how it does that "move rax to stack, then
> > do the compare against the stack", instead of just using %rax.
> >
> > In fact, that function shouldn't have a stack frame at all, and the
> > only reason it is generated is because of this whole oddity.
> >
> > All clang's fault, right?
> >
> > Yeah, mostly. But it turns out that what really messes with clangs
> > little head is that the x86 array_index_mask_nospec() function is
> > being a bit annoying.
> >
> > This is what we do:
> >
> > static __always_inline unsigned long
> > array_index_mask_nospec(unsigned long index,
> > unsigned long size)
> > {
> > unsigned long mask;
> >
> > asm volatile ("cmp %1,%2; sbb %0,%0;"
> > :"=r" (mask)
> > :"g"(size),"r" (index)
> > :"cc");
> > return mask;
> > }
> >
> > and look at the use again:
> >
> > nospec_mask = array_index_mask_nospec(fd, fdt->max_fds);
> >
> > here all the values are actually 'unsigned int'. So what happens is
> > that clang can't just use the fdt->max_fds value *directly* from
> > memory, because it needs to be expanded from 32-bit to 64-bit because
> > we've made our array_index_mask_nospec() function only work on 64-bit
> > 'unsigned long' values.
> >
> > So it turns out that by massaging this a bit, and making it just be a
> > macro - so that the asm can decide that "I can do this in 32-bit" - I
> > can get clang to generate much better code.
> >
> > Clang still absolutely hates the "g" constraint, so to get clang to
> > really get this right I have to use "ir" instead of "g". Which is
> > wrong. Because gcc does this right, and could use the memory op
> > directly. But even gcc cannot do that with our *current* function,
> > because of that "the memory value is 32-bit, we require a 64-bit
> > value"
> >
> > Anyway, I can get gcc to generate the right code:
> >
> > movq 32(%r13), %rdx
> > cmp (%rdx),%ebx
> > sbb %esi,%esi
> >
> > which is basically the right code for the six crazy instructions clang
> > generates. And if I make the "g" be "ir", I can get clang to generate
> >
> > movq 32(%rdi), %rcx
> > movl (%rcx), %eax
> > cmpl %eax, %esi
> > sbbl %esi, %esi
> >
> > which is the same thing, but with that (pointless) load to a register.
> >
> > And now clang doesn't generate that stack frame at all.
> >
> > Anyway, this was a long email to explain the odd attached patch.
> >
> > Comments? Note that this patch is *entirely* untested, I have done
> > this purely by looking at the code generation in fs/file.c.

FYI, please note that gcc-12 is able to synthesize carry-flag compares
on its own:

--cut here--
unsigned long foo (unsigned long index, unsigned long size)
{
return (index <= size) ? 0 : -1;
}

extern unsigned int size;

unsigned long bar (unsigned int index)
{
return (index <= size) ? 0 : -1;
}
--cut here--

gcc-12 -O2:

foo:
cmpq %rdi, %rsi
sbbq %rax, %rax
ret
bar:
cmpl %edi, size(%rip)
sbbq %rax, %rax
ret

Uros.

> >
> > Linus
>
> > arch/x86/include/asm/barrier.h | 23 +++++++++--------------
> > 1 file changed, 9 insertions(+), 14 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> > index 66e57c010392..6159d2cbbfde 100644
> > --- a/arch/x86/include/asm/barrier.h
> > +++ b/arch/x86/include/asm/barrier.h
> > @@ -33,20 +33,15 @@
> > * Returns:
> > * 0 - (index < size)
> > */
> > -static __always_inline unsigned long array_index_mask_nospec(unsigned long index,
> > - unsigned long size)
> > -{
> > - unsigned long mask;
> > -
> > - asm volatile ("cmp %1,%2; sbb %0,%0;"
> > - :"=r" (mask)
> > - :"g"(size),"r" (index)
> > - :"cc");
> > - return mask;
> > -}
> > -
> > -/* Override the default implementation from linux/nospec.h. */
> > -#define array_index_mask_nospec array_index_mask_nospec
> > +#define array_index_mask_nospec(idx,sz) ({ \
> > + typeof((idx)+(sz)) __idx = (idx); \
> > + typeof(__idx) __sz = (sz); \
> > + typeof(__idx) __mask; \
> > + asm volatile ("cmp %1,%2; sbb %0,%0" \
> > + :"=r" (__mask) \
> > + :"ir"(__sz),"r" (__idx) \
> > + :"cc"); \
> > + __mask; })
> >
> > /* Prevent speculative execution past this barrier. */
> > #define barrier_nospec() asm volatile("lfence":::"memory")
>

2024-04-06 15:40:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: More annoying code generation by clang

On Sat, 6 Apr 2024 at 05:30, Uros Bizjak <[email protected]> wrote:
>
> FYI, please note that gcc-12 is able to synthesize carry-flag compares
> on its own:

Oh, gcc has been able to do that for much longer than that. It's a
idiomatic i386 pattern, and gcc has generated it for as long as I can
remember.

HOWEVER.

There's a big difference between "able to" and "GUARANTEED to".

Because this code actually requires a data-depencency and not a
control dependency as a correctness issue because of Spectre-v1.

So while I know very well that gcc _can_ do it, I also know very well
that there are absolutely no guarantees that gcc won't use a
conditional branch instead.

So this code is needs to generate good code because it's actually
important code that shows up in benchmarks, but this code also needs
to generate a very _particular_ pattern of code, and it's not good
enough that gcc may "happen" to generate that pattern of code.

Thus the inline asm.

Linus

2024-04-06 16:05:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: More annoying code generation by clang

On Sat, 6 Apr 2024 at 08:39, Linus Torvalds
<[email protected]> wrote:
>
> Because this code actually requires a data-depencency and not a
> control dependency as a correctness issue because of Spectre-v1.

Just to clarify: our comments in this code are maybe a bit odd,
because our comments are not about the Spectre-v1 issue (which
predates a rewrite) and more about the odd RCU pattern and conditional
avoidance we use here:

unsigned long nospec_mask;

/* Mask is a 0 for invalid fd's, ~0 for valid ones */
nospec_mask = array_index_mask_nospec(fd, fdt->max_fds);

/*
* fdentry points to the 'fd' offset, or fdt->fd[0].
* Loading from fdt->fd[0] is always safe, because the
* array always exists.
*/
fdentry = fdt->fd + (fd & nospec_mask);

/* Do the load, then mask any invalid result */
file = rcu_dereference_raw(*fdentry);

where *normally* (if RCU wasn't an issue) we'd just write this as

file = fdt->fd[array_index_nospec(fd, fdt->max_fds)];

where the key part is that "nospec" array indexing that will not
speculatively access the array past the "max_fds".

IOW, the code naively would want to do just

if (fd < fdt->max_fds) {
file = fdt->fd[fd];
...

but we need to make sure that it can't be fooled into using a branch
mispredict and use a user-controlled index ("fd") to speculatively
access the array with an arbitrary index and then leak unrelated data
through some side channel (mostly cache access).

And while the normal pattern doesn't expose the mask generation and
just hides that mask in that simpler "array_index_nospec()" macro,
this code actually ends up using the same mask *twice*, because it
will later end up doing this hack:

file = (void *)(nospec_mask & (unsigned long)file);
if (unlikely(!file))
return NULL;

to have just one single conditional at the end (ie we may have loaded
a non-NULL file pointer from fdt->fd[0] because an invalid index got
masked down to a zero index, and the second masking will mask away
that pointer and make it NULL because we're bad people and we know
that NULL is "bitpattern 0" and we care about the code working, not
about some unreal "NULL could be anything else" thing.

End result: this code that is just a few lines long and has more
comments than code, and generates only a handful of instruction is
fairly subtle but also fairly important both for hardware security
issues and for performance.

See commit 253ca8678d30 ("Improve __fget_files_rcu() code generation
(and thus __fget_light())" that actually started doing this "use mask
twice", and realize that that commit is what this performance
regression report is talking about:

https://lore.kernel.org/all/ZWQ+LEcfFFi4YOAU@xsang-OptiPlex-9020/

ie that whole "use masks and avoid doing the obvious thing" may be a
bit subtle, but it's what turned a 2.9% performance regression into a
3.4% improvement.

(Ok, those performance numbers are on just one random microbenchmark
and don't really matter, so take that with a pinch of salt, but if you
care about a _lot_ of random benchmarks, eventually you get good
performance overall).

Anyway, hopefully that explains the dual issue here: we care about
performance, but we also have to use a specific instruction pattern,
and can't just hope for the best.

Linus

2024-04-08 12:29:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: More annoying code generation by clang

On Thu, Apr 04, 2024 at 03:53:49PM -0700, Linus Torvalds wrote:

> diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> index 66e57c010392..6159d2cbbfde 100644
> --- a/arch/x86/include/asm/barrier.h
> +++ b/arch/x86/include/asm/barrier.h
> @@ -33,20 +33,15 @@
> * Returns:
> * 0 - (index < size)
> */
> +#define array_index_mask_nospec(idx,sz) ({ \
> + typeof((idx)+(sz)) __idx = (idx); \
> + typeof(__idx) __sz = (sz); \
> + typeof(__idx) __mask; \
> + asm volatile ("cmp %1,%2; sbb %0,%0" \
> + :"=r" (__mask) \
> + :"ir"(__sz),"r" (__idx) \
> + :"cc"); \
> + __mask; })

Should this not carry a comment about the "ir" constraint wanting to be
"g" except for clang being daft?

(I really wish clang would go fix this, it keeps coming up time and
again).

>
> /* Prevent speculative execution past this barrier. */
> #define barrier_nospec() asm volatile("lfence":::"memory")


2024-04-08 13:46:32

by H. Peter Anvin

[permalink] [raw]
Subject: Re: More annoying code generation by clang

On April 8, 2024 1:49:34 AM PDT, Peter Zijlstra <[email protected]> wrote:
>On Thu, Apr 04, 2024 at 03:53:49PM -0700, Linus Torvalds wrote:
>
>> diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
>> index 66e57c010392..6159d2cbbfde 100644
>> --- a/arch/x86/include/asm/barrier.h
>> +++ b/arch/x86/include/asm/barrier.h
>> @@ -33,20 +33,15 @@
>> * Returns:
>> * 0 - (index < size)
>> */
>> +#define array_index_mask_nospec(idx,sz) ({ \
>> + typeof((idx)+(sz)) __idx = (idx); \
>> + typeof(__idx) __sz = (sz); \
>> + typeof(__idx) __mask; \
>> + asm volatile ("cmp %1,%2; sbb %0,%0" \
>> + :"=r" (__mask) \
>> + :"ir"(__sz),"r" (__idx) \
>> + :"cc"); \
>> + __mask; })
>
>Should this not carry a comment about the "ir" constraint wanting to be
>"g" except for clang being daft?
>
>(I really wish clang would go fix this, it keeps coming up time and
>again).
>
>>
>> /* Prevent speculative execution past this barrier. */
>> #define barrier_nospec() asm volatile("lfence":::"memory")
>

If the only reason for "ir" as opposed to "g" (= "irm") is clang then it really needs to be called out. Or better yet, don't do anything and let the clang people actually fix their code generation.

2024-04-08 18:33:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: More annoying code generation by clang

On Mon, 8 Apr 2024 at 01:49, Peter Zijlstra <[email protected]> wrote:
>
> Should this not carry a comment about the "ir" constraint wanting to be
> "g" except for clang being daft?

Yeah. Except I think I'll do something like

/* Clang messes up "g" as an asm source */
#define ASM_SOURCE_G "ir"

in <linux/compiler-clang.h>, and

#ifndef ASM_SOURCE_G
#define ASM_SOURCE_G "g"
#endif

in linux/compiler.h.

> (I really wish clang would go fix this, it keeps coming up time and
> again).

It's been reported long ago, it seems to be hard to fix.

I suspect the issue is that the inline asm format is fairly closely
related to the gcc machine descriptions (look at the machine
descriptor files in gcc, and if you can ignore the horrid LISP-style
syntax you see how close they are).

And clang has a different model and needs to "translate" things, and
that one doesn't translate.

It's not like we don't have workarounds for gcc bugs in this area too
(eg "asm_goto_output()", née "asm_volatile_goto()").

There was another bug in my patch, though: the output mask should
always be "unsigned long", not tied to the input type.

Linus

2024-04-08 19:42:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: More annoying code generation by clang

On Mon, 8 Apr 2024 at 11:32, Linus Torvalds
<[email protected]> wrote:
>
> It's been reported long ago, it seems to be hard to fix.
>
> I suspect the issue is that the inline asm format is fairly closely
> related to the gcc machine descriptions (look at the machine
> descriptor files in gcc, and if you can ignore the horrid LISP-style
> syntax you see how close they are).

Actually, one of the github issues pages has more of an explanation
(and yes, it's tied to impedance issues between the inline asm syntax
and how clang works):

https://github.com/llvm/llvm-project/issues/20571#issuecomment-980933442

so I wrote more of a commit log and did that "ASM_SOURCE_G" thing
(except I decided to call it "input" instead of "source", since that's
the standard inline asm language).

This version also has that output size fixed, and the commit message
talks about it.

This does *not* fix other inline asms to use "ASM_INPUT_G/RM".

I think it's mainly some of the bitop code that people have noticed
before - fls and variable_ffs() and friends.

I suspect clang is more common in the arm64 world than it is for
x86-64 kernel developers, and arm64 inline asm basically never uses
"rm" or "g" since arm64 doesn't have instructions that take either a
register or a memory operand.

Anyway, with gcc this generates

cmp (%rdx),%ebx; sbb %rax,%rax # _7->max_fds, fd, __mask

IOW, it uses the memory location for "max_fds". It couldn't do that
before, because it used to think that it always had to do the compare
in 64 bits, and the memory location is only 32-bit.

With clang, this generates

movl (%rcx), %eax
cmpl %eax, %edi
sbbq %rdi, %rdi

which has that extra register use, but is at least much better than
what it used to generate with crazy "load into register, spill to
stack, then compare against stack contents".

Linus


Attachments:
0001-x86-improve-array_index_mask_nospec-code-generation.patch (4.45 kB)

2024-04-09 10:46:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: More annoying code generation by clang

On Mon, Apr 08, 2024 at 12:42:31PM -0700, Linus Torvalds wrote:

> Actually, one of the github issues pages has more of an explanation
> (and yes, it's tied to impedance issues between the inline asm syntax
> and how clang works):
>
> https://github.com/llvm/llvm-project/issues/20571#issuecomment-980933442
>

So that same issue seems to suggest Nick is actually working on this and
got stuff merged. Nick, what is the status of your efforts and should we
indeed do the below as Linus suggests or should he upgrade his compiler?

> diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h
> index 49feac0162a5..0dee061fd7a6 100644
> --- a/include/linux/compiler-clang.h
> +++ b/include/linux/compiler-clang.h
> @@ -118,3 +118,15 @@
>
> #define __diag_ignore_all(option, comment) \
> __diag_clang(13, ignore, option)
> +
> +/*
> + * clang has horrible behavior with "g" or "rm" constraints for asm
> + * inputs, turning them into something worse than "m". Avoid using
> + * constraints with multiple possible uses (but "ir" seems to be ok):
> + *
> + * https://github.com/llvm/llvm-project/issues/20571
> + * https://github.com/llvm/llvm-project/issues/30873
> + * https://github.com/llvm/llvm-project/issues/34837
> + */
> +#define ASM_INPUT_G "ir"
> +#define ASM_INPUT_RM "r"
> diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
> index 2abaa3a825a9..e53acd310545 100644
> --- a/include/linux/compiler_types.h
> +++ b/include/linux/compiler_types.h
> @@ -380,6 +380,15 @@ struct ftrace_likely_data {
> #define asm_goto_output(x...) asm volatile goto(x)
> #endif
>
> +/*
> + * Clang has trouble with constraints with multiple
> + * alternative behaviors (mainly "g" and "rm").
> + */
> +#ifndef ASM_INPUT_G
> + #define ASM_INPUT_G "g"
> + #define ASM_INPUT_RM "rm"
> +#endif
> +
> #ifdef CONFIG_CC_HAS_ASM_INLINE
> #define asm_inline asm __inline
> #else
> --
> 2.44.0.330.g4d18c88175
>


2024-04-09 15:40:21

by Nick Desaulniers

[permalink] [raw]
Subject: Re: More annoying code generation by clang

On Tue, Apr 9, 2024 at 3:45 AM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Apr 08, 2024 at 12:42:31PM -0700, Linus Torvalds wrote:
>
> > Actually, one of the github issues pages has more of an explanation
> > (and yes, it's tied to impedance issues between the inline asm syntax
> > and how clang works):
> >
> > https://github.com/llvm/llvm-project/issues/20571#issuecomment-980933442
> >
>
> So that same issue seems to suggest Nick is actually working on this and
> got stuff merged. Nick, what is the status of your efforts and should we
> indeed do the below as Linus suggests or should he upgrade his compiler?

Sorry, I'm no longer working on the issue. I should mark that as such.

The feature got hung up on rewriting one of the register allocation
frameworks in llvm.
https://github.com/llvm/llvm-project/pull/74344

I have a new set of responsibilities at work so I probably wont be
working on that issue any time soon.

>
> > diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h
> > index 49feac0162a5..0dee061fd7a6 100644
> > --- a/include/linux/compiler-clang.h
> > +++ b/include/linux/compiler-clang.h
> > @@ -118,3 +118,15 @@
> >
> > #define __diag_ignore_all(option, comment) \
> > __diag_clang(13, ignore, option)
> > +
> > +/*
> > + * clang has horrible behavior with "g" or "rm" constraints for asm
> > + * inputs, turning them into something worse than "m". Avoid using
> > + * constraints with multiple possible uses (but "ir" seems to be ok):
> > + *
> > + * https://github.com/llvm/llvm-project/issues/20571
> > + * https://github.com/llvm/llvm-project/issues/30873
> > + * https://github.com/llvm/llvm-project/issues/34837

20571 is the cannonical bug for this.

> > + */
> > +#define ASM_INPUT_G "ir"
> > +#define ASM_INPUT_RM "r"
> > diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
> > index 2abaa3a825a9..e53acd310545 100644
> > --- a/include/linux/compiler_types.h
> > +++ b/include/linux/compiler_types.h
> > @@ -380,6 +380,15 @@ struct ftrace_likely_data {
> > #define asm_goto_output(x...) asm volatile goto(x)
> > #endif
> >
> > +/*
> > + * Clang has trouble with constraints with multiple
> > + * alternative behaviors (mainly "g" and "rm").
> > + */
> > +#ifndef ASM_INPUT_G
> > + #define ASM_INPUT_G "g"
> > + #define ASM_INPUT_RM "rm"
> > +#endif
> > +
> > #ifdef CONFIG_CC_HAS_ASM_INLINE
> > #define asm_inline asm __inline
> > #else
> > --
> > 2.44.0.330.g4d18c88175
> >
>


--
Thanks,
~Nick Desaulniers

2024-04-10 08:12:51

by David Laight

[permalink] [raw]
Subject: RE: More annoying code generation by clang

From: Linus Torvalds
> Sent: 08 April 2024 20:43
...
> I think it's mainly some of the bitop code that people have noticed
> before - fls and variable_ffs() and friends.
>
> I suspect clang is more common in the arm64 world than it is for
> x86-64 kernel developers, and arm64 inline asm basically never uses
> "rm" or "g" since arm64 doesn't have instructions that take either a
> register or a memory operand.
>
> Anyway, with gcc this generates
>
> cmp (%rdx),%ebx; sbb %rax,%rax # _7->max_fds, fd, __mask
>
> IOW, it uses the memory location for "max_fds". It couldn't do that
> before, because it used to think that it always had to do the compare
> in 64 bits, and the memory location is only 32-bit.
>
> With clang, this generates
>
> movl (%rcx), %eax
> cmpl %eax, %edi
> sbbq %rdi, %rdi
>
> which has that extra register use, but is at least much better than
> what it used to generate with crazy "load into register, spill to
> stack, then compare against stack contents".

Provided the compiler can find a register I doubt the extra
instruction makes much difference.
The 'cmp (%rdx),%ebx)' ends up being 2 u-ops the same as
the movl/cmpl pair.
Instruction decode and retirement aren't often bottlenecks on recent cpu.
So I suspect the main difference is cache footprint.

Trying to measure the difference is probably impossible...

You'll probably get a bigger difference by changing a lot of
function results and parameters to 'unsigned long' to remove
all the zero-extending that happens.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-04-10 19:23:38

by Bill Wendling

[permalink] [raw]
Subject: Re: More annoying code generation by clang

On Tue, Apr 9, 2024 at 8:37 AM Nick Desaulniers <[email protected]> wrote:
>
> On Tue, Apr 9, 2024 at 3:45 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Mon, Apr 08, 2024 at 12:42:31PM -0700, Linus Torvalds wrote:
> >
> > > Actually, one of the github issues pages has more of an explanation
> > > (and yes, it's tied to impedance issues between the inline asm syntax
> > > and how clang works):
> > >
> > > https://github.com/llvm/llvm-project/issues/20571#issuecomment-980933442
> > >
> >
> > So that same issue seems to suggest Nick is actually working on this and
> > got stuff merged. Nick, what is the status of your efforts and should we
> > indeed do the below as Linus suggests or should he upgrade his compiler?
>
> Sorry, I'm no longer working on the issue. I should mark that as such.
>
> The feature got hung up on rewriting one of the register allocation
> frameworks in llvm.
> https://github.com/llvm/llvm-project/pull/74344
>
> I have a new set of responsibilities at work so I probably wont be
> working on that issue any time soon.
>
I plan on picking this up. Could you please create a WIP branch on
GitHub with your current work?

-bw

> >
> > > diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h
> > > index 49feac0162a5..0dee061fd7a6 100644
> > > --- a/include/linux/compiler-clang.h
> > > +++ b/include/linux/compiler-clang.h
> > > @@ -118,3 +118,15 @@
> > >
> > > #define __diag_ignore_all(option, comment) \
> > > __diag_clang(13, ignore, option)
> > > +
> > > +/*
> > > + * clang has horrible behavior with "g" or "rm" constraints for asm
> > > + * inputs, turning them into something worse than "m". Avoid using
> > > + * constraints with multiple possible uses (but "ir" seems to be ok):
> > > + *
> > > + * https://github.com/llvm/llvm-project/issues/20571
> > > + * https://github.com/llvm/llvm-project/issues/30873
> > > + * https://github.com/llvm/llvm-project/issues/34837
>
> 20571 is the cannonical bug for this.
>
> > > + */
> > > +#define ASM_INPUT_G "ir"
> > > +#define ASM_INPUT_RM "r"
> > > diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
> > > index 2abaa3a825a9..e53acd310545 100644
> > > --- a/include/linux/compiler_types.h
> > > +++ b/include/linux/compiler_types.h
> > > @@ -380,6 +380,15 @@ struct ftrace_likely_data {
> > > #define asm_goto_output(x...) asm volatile goto(x)
> > > #endif
> > >
> > > +/*
> > > + * Clang has trouble with constraints with multiple
> > > + * alternative behaviors (mainly "g" and "rm").
> > > + */
> > > +#ifndef ASM_INPUT_G
> > > + #define ASM_INPUT_G "g"
> > > + #define ASM_INPUT_RM "rm"
> > > +#endif
> > > +
> > > #ifdef CONFIG_CC_HAS_ASM_INLINE
> > > #define asm_inline asm __inline
> > > #else
> > > --
> > > 2.44.0.330.g4d18c88175
> > >
> >
>
>
> --
> Thanks,
> ~Nick Desaulniers
>