2023-07-21 17:09:22

by Brian Gerst

[permalink] [raw]
Subject: [PATCH v2 0/6] x86: Clean up fast syscall return validation

This patch set cleans up the tests done to determine if a fast syscall
return instruction can be used to return to userspace. It converts the
code to C, and refactors existing code to be more readable.

v2:
- Fix shift value for canonical RIP test and use
__is_canonical_address()

Brian Gerst (6):
x86/entry/64: Remove obsolete comment on tracing vs. SYSRET
x86/entry/64: Convert SYSRET validation tests to C
x86/entry/compat: Combine return value test from syscall handler
x86/entry/32: Convert do_fast_syscall_32() to bool return type
x86/entry/32: Remove SEP test for SYSEXIT
x86/entry/32: Clean up syscall fast exit tests

arch/x86/entry/common.c | 99 +++++++++++++++++++++-----------
arch/x86/entry/entry_32.S | 2 +-
arch/x86/entry/entry_64.S | 68 +---------------------
arch/x86/entry/entry_64_compat.S | 12 ++--
arch/x86/include/asm/syscall.h | 6 +-
5 files changed, 77 insertions(+), 110 deletions(-)

--
2.41.0



2023-10-05 14:17:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86: Clean up fast syscall return validation


* Brian Gerst <[email protected]> wrote:

> This patch set cleans up the tests done to determine if a fast syscall
> return instruction can be used to return to userspace. It converts the
> code to C, and refactors existing code to be more readable.
>
> v2:
> - Fix shift value for canonical RIP test and use
> __is_canonical_address()
>
> Brian Gerst (6):
> x86/entry/64: Remove obsolete comment on tracing vs. SYSRET
> x86/entry/64: Convert SYSRET validation tests to C
> x86/entry/compat: Combine return value test from syscall handler
> x86/entry/32: Convert do_fast_syscall_32() to bool return type
> x86/entry/32: Remove SEP test for SYSEXIT
> x86/entry/32: Clean up syscall fast exit tests
>
> arch/x86/entry/common.c | 99 +++++++++++++++++++++-----------
> arch/x86/entry/entry_32.S | 2 +-
> arch/x86/entry/entry_64.S | 68 +---------------------
> arch/x86/entry/entry_64_compat.S | 12 ++--
> arch/x86/include/asm/syscall.h | 6 +-
> 5 files changed, 77 insertions(+), 110 deletions(-)

Ok, so I've applied patches #1, #3, #4 and #5 to tip:x86/entry,
(ie. skipped #2 & #6 for now), as they look correct and are good
improvements. None of these four patches depend on the skipped
patches in some way I missed, correct?

As for #2, I looked at the before/after disassembly, and the new
C code in do_syscall_64() looked suboptimal on x86-64 defconfig,
if I was reading it right.

Mind re-evaluating that, and if you still think the C conversion
is a good idea, mind putting a before/after analysis of the
generated instructions into the changelog? This is our primary
system call return path after all.

Thanks,

Ingo

2023-10-05 16:21:07

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86: Clean up fast syscall return validation

On Thu, Oct 5, 2023 at 4:22 AM Ingo Molnar <[email protected]> wrote:
>
>
> * Brian Gerst <[email protected]> wrote:
>
> > This patch set cleans up the tests done to determine if a fast syscall
> > return instruction can be used to return to userspace. It converts the
> > code to C, and refactors existing code to be more readable.
> >
> > v2:
> > - Fix shift value for canonical RIP test and use
> > __is_canonical_address()
> >
> > Brian Gerst (6):
> > x86/entry/64: Remove obsolete comment on tracing vs. SYSRET
> > x86/entry/64: Convert SYSRET validation tests to C
> > x86/entry/compat: Combine return value test from syscall handler
> > x86/entry/32: Convert do_fast_syscall_32() to bool return type
> > x86/entry/32: Remove SEP test for SYSEXIT
> > x86/entry/32: Clean up syscall fast exit tests
> >
> > arch/x86/entry/common.c | 99 +++++++++++++++++++++-----------
> > arch/x86/entry/entry_32.S | 2 +-
> > arch/x86/entry/entry_64.S | 68 +---------------------
> > arch/x86/entry/entry_64_compat.S | 12 ++--
> > arch/x86/include/asm/syscall.h | 6 +-
> > 5 files changed, 77 insertions(+), 110 deletions(-)
>
> Ok, so I've applied patches #1, #3, #4 and #5 to tip:x86/entry,
> (ie. skipped #2 & #6 for now), as they look correct and are good
> improvements. None of these four patches depend on the skipped
> patches in some way I missed, correct?
>
> As for #2, I looked at the before/after disassembly, and the new
> C code in do_syscall_64() looked suboptimal on x86-64 defconfig,
> if I was reading it right.
>
> Mind re-evaluating that, and if you still think the C conversion
> is a good idea, mind putting a before/after analysis of the
> generated instructions into the changelog? This is our primary
> system call return path after all.

Looking at the compiled output, the only suboptimal code appears to be
the canonical address test, where the C code uses the CL register for
the shifts instead of immediates.

180: e9 00 00 00 00 jmp 185 <do_syscall_64+0x85>
181: R_X86_64_PC32 .altinstr_aux-0x4
185: b9 07 00 00 00 mov $0x7,%ecx
18a: eb 05 jmp 191 <do_syscall_64+0x91>
18c: b9 10 00 00 00 mov $0x10,%ecx
191: 48 89 c2 mov %rax,%rdx
194: 48 d3 e2 shl %cl,%rdx
197: 48 d3 fa sar %cl,%rdx
19a: 48 39 d0 cmp %rdx,%rax
19d: 75 39 jne 1d8 <do_syscall_64+0xd8>

Was there anything else specifically that you can point out?

Brian Gerst

2023-10-05 20:21:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86: Clean up fast syscall return validation


* Brian Gerst <[email protected]> wrote:

> Looking at the compiled output, the only suboptimal code appears to be
> the canonical address test, where the C code uses the CL register for
> the shifts instead of immediates.
>
> 180: e9 00 00 00 00 jmp 185 <do_syscall_64+0x85>
> 181: R_X86_64_PC32 .altinstr_aux-0x4
> 185: b9 07 00 00 00 mov $0x7,%ecx
> 18a: eb 05 jmp 191 <do_syscall_64+0x91>
> 18c: b9 10 00 00 00 mov $0x10,%ecx
> 191: 48 89 c2 mov %rax,%rdx
> 194: 48 d3 e2 shl %cl,%rdx
> 197: 48 d3 fa sar %cl,%rdx
> 19a: 48 39 d0 cmp %rdx,%rax
> 19d: 75 39 jne 1d8 <do_syscall_64+0xd8>

Yeah, it didn't look equivalent - so I guess we want a C equivalent for:

- ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
- "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57

instead of the pgtable_l5_enabled() runtime test that
__is_canonical_address() uses?

Thanks,

Ingo

2023-10-06 19:00:36

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86: Clean up fast syscall return validation

On 10/5/23 13:20, Ingo Molnar wrote:
>
> * Brian Gerst <[email protected]> wrote:
>
>> Looking at the compiled output, the only suboptimal code appears to be
>> the canonical address test, where the C code uses the CL register for
>> the shifts instead of immediates.
>>
>> 180: e9 00 00 00 00 jmp 185 <do_syscall_64+0x85>
>> 181: R_X86_64_PC32 .altinstr_aux-0x4
>> 185: b9 07 00 00 00 mov $0x7,%ecx
>> 18a: eb 05 jmp 191 <do_syscall_64+0x91>
>> 18c: b9 10 00 00 00 mov $0x10,%ecx
>> 191: 48 89 c2 mov %rax,%rdx
>> 194: 48 d3 e2 shl %cl,%rdx
>> 197: 48 d3 fa sar %cl,%rdx
>> 19a: 48 39 d0 cmp %rdx,%rax
>> 19d: 75 39 jne 1d8 <do_syscall_64+0xd8>
>
> Yeah, it didn't look equivalent - so I guess we want a C equivalent for:
>
> - ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
> - "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
>
> instead of the pgtable_l5_enabled() runtime test that
> __is_canonical_address() uses?
>

I don't think such a thing (without simply duplicate the above as an
alternative asm, which is obviously easy enough, and still allows the
compiler to pick the register used) would be possible without immediate
patching support[*].

Incidentally, this is a question for Uros: is there a reason this is a
mov to %ecx and not just %cl, which would save 3 bytes?

Incidentally, it is possible to save one instruction and use only *one*
alternative immediate:

leaq (%rax,%rax),%rdx
xorq %rax,%rdx
shrq $(63 - LA),%rdx # Yes, 63, not 64
# ZF=1 if canonical

This works because if bit [x] is set in the output, then bit [x] and
[x-1] in the input are different (bit [-1] considered to be zero); and
by definition a bit is canonical if and only if all the bits [63:LA] are
identical, thus bits [63:LA+1] in the output must all be zero.

The first two instructions are pure arithmetic and can thus be done in C:

bar = foo ^ (foo << 1);

... leaving only one instruction needing to be patched at runtime.

-hpa



[*] which is a whole bit of infrastructure that I know first-hand is
extremely complex[**] -- I had an almost-complete patchset done at one
time, but it has severely bitrotted. The good part is that it is
probably a lot easier to do today, with the alternatives-patching
callback routines available.

[**] the absolute best would of course be if such infrastructure could
be compiler-supported (probably as part as some really fancy support for
alternatives/static branch), but that would probably be a nightmare to
do in the compiler or a plugin.


2023-10-06 21:33:26

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86: Clean up fast syscall return validation

On Fri, Oct 6, 2023 at 2:59 PM H. Peter Anvin <[email protected]> wrote:
>
> On 10/5/23 13:20, Ingo Molnar wrote:
> >
> > * Brian Gerst <[email protected]> wrote:
> >
> >> Looking at the compiled output, the only suboptimal code appears to be
> >> the canonical address test, where the C code uses the CL register for
> >> the shifts instead of immediates.
> >>
> >> 180: e9 00 00 00 00 jmp 185 <do_syscall_64+0x85>
> >> 181: R_X86_64_PC32 .altinstr_aux-0x4
> >> 185: b9 07 00 00 00 mov $0x7,%ecx
> >> 18a: eb 05 jmp 191 <do_syscall_64+0x91>
> >> 18c: b9 10 00 00 00 mov $0x10,%ecx
> >> 191: 48 89 c2 mov %rax,%rdx
> >> 194: 48 d3 e2 shl %cl,%rdx
> >> 197: 48 d3 fa sar %cl,%rdx
> >> 19a: 48 39 d0 cmp %rdx,%rax
> >> 19d: 75 39 jne 1d8 <do_syscall_64+0xd8>
> >
> > Yeah, it didn't look equivalent - so I guess we want a C equivalent for:
> >
> > - ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
> > - "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
> >
> > instead of the pgtable_l5_enabled() runtime test that
> > __is_canonical_address() uses?
> >
>
> I don't think such a thing (without simply duplicate the above as an
> alternative asm, which is obviously easy enough, and still allows the
> compiler to pick the register used) would be possible without immediate
> patching support[*].
>
> Incidentally, this is a question for Uros: is there a reason this is a
> mov to %ecx and not just %cl, which would save 3 bytes?
>
> Incidentally, it is possible to save one instruction and use only *one*
> alternative immediate:
>
> leaq (%rax,%rax),%rdx
> xorq %rax,%rdx
> shrq $(63 - LA),%rdx # Yes, 63, not 64
> # ZF=1 if canonical
>
> This works because if bit [x] is set in the output, then bit [x] and
> [x-1] in the input are different (bit [-1] considered to be zero); and
> by definition a bit is canonical if and only if all the bits [63:LA] are
> identical, thus bits [63:LA+1] in the output must all be zero.
>
> The first two instructions are pure arithmetic and can thus be done in C:
>
> bar = foo ^ (foo << 1);
>
> ... leaving only one instruction needing to be patched at runtime.
>
> -hpa

One other alternative I have been considering is comparing against
TASK_SIZE_MAX. The only user-executable address above that is the
long deprecated vsyscall page. IMHO it's not worth optimizing for
that case, so just let it fall back to using IRET.

if (unlikely(regs->ip >= TASK_SIZE_MAX)) return false;

compiles to:

180: 48 b9 00 f0 ff ff ff movabs $0x7ffffffff000,%rcx
187: 7f 00 00
18a: 48 39 c8 cmp %rcx,%rax
18d: 73 39 jae 1c8 <do_syscall_64+0xc8>

0000000000000000 <.altinstr_replacement>:
0: 48 b9 00 f0 ff ff ff movabs $0xfffffffffff000,%rcx
7: ff ff 00

Brian Gerst

2023-10-06 23:59:23

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86: Clean up fast syscall return validation

On 10/6/23 11:59, H. Peter Anvin wrote:
>
> Incidentally, it is possible to save one instruction and use only *one*
> alternative immediate:
>
>     leaq (%rax,%rax),%rdx
>     xorq %rax,%rdx
>     shrq $(63 - LA),%rdx        # Yes, 63, not 64
>     # ZF=1 if canonical
>
> This works because if bit [x] is set in the output, then bit [x] and
> [x-1] in the input are different (bit [-1] considered to be zero); and
> by definition a bit is canonical if and only if all the bits [63:LA] are
> identical, thus bits [63:LA+1] in the output must all be zero.
>

Yes, I'm a doofus. Bits [63:LA-1] must be identical, so 64 is correct :$)

-hpa

2023-10-07 09:42:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86: Clean up fast syscall return validation


* Brian Gerst <[email protected]> wrote:

> On Fri, Oct 6, 2023 at 2:59 PM H. Peter Anvin <[email protected]> wrote:
> >
> > On 10/5/23 13:20, Ingo Molnar wrote:
> > >
> > > * Brian Gerst <[email protected]> wrote:
> > >
> > >> Looking at the compiled output, the only suboptimal code appears to be
> > >> the canonical address test, where the C code uses the CL register for
> > >> the shifts instead of immediates.
> > >>
> > >> 180: e9 00 00 00 00 jmp 185 <do_syscall_64+0x85>
> > >> 181: R_X86_64_PC32 .altinstr_aux-0x4
> > >> 185: b9 07 00 00 00 mov $0x7,%ecx
> > >> 18a: eb 05 jmp 191 <do_syscall_64+0x91>
> > >> 18c: b9 10 00 00 00 mov $0x10,%ecx
> > >> 191: 48 89 c2 mov %rax,%rdx
> > >> 194: 48 d3 e2 shl %cl,%rdx
> > >> 197: 48 d3 fa sar %cl,%rdx
> > >> 19a: 48 39 d0 cmp %rdx,%rax
> > >> 19d: 75 39 jne 1d8 <do_syscall_64+0xd8>
> > >
> > > Yeah, it didn't look equivalent - so I guess we want a C equivalent for:
> > >
> > > - ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
> > > - "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
> > >
> > > instead of the pgtable_l5_enabled() runtime test that
> > > __is_canonical_address() uses?
> > >
> >
> > I don't think such a thing (without simply duplicate the above as an
> > alternative asm, which is obviously easy enough, and still allows the
> > compiler to pick the register used) would be possible without immediate
> > patching support[*].
> >
> > Incidentally, this is a question for Uros: is there a reason this is a
> > mov to %ecx and not just %cl, which would save 3 bytes?
> >
> > Incidentally, it is possible to save one instruction and use only *one*
> > alternative immediate:
> >
> > leaq (%rax,%rax),%rdx
> > xorq %rax,%rdx
> > shrq $(63 - LA),%rdx # Yes, 63, not 64
> > # ZF=1 if canonical
> >
> > This works because if bit [x] is set in the output, then bit [x] and
> > [x-1] in the input are different (bit [-1] considered to be zero); and
> > by definition a bit is canonical if and only if all the bits [63:LA] are
> > identical, thus bits [63:LA+1] in the output must all be zero.
> >
> > The first two instructions are pure arithmetic and can thus be done in C:
> >
> > bar = foo ^ (foo << 1);
> >
> > ... leaving only one instruction needing to be patched at runtime.
> >
> > -hpa
>
> One other alternative I have been considering is comparing against
> TASK_SIZE_MAX. The only user-executable address above that is the
> long deprecated vsyscall page. IMHO it's not worth optimizing for
> that case, so just let it fall back to using IRET.
>
> if (unlikely(regs->ip >= TASK_SIZE_MAX)) return false;
>
> compiles to:
>
> 180: 48 b9 00 f0 ff ff ff movabs $0x7ffffffff000,%rcx
> 187: 7f 00 00
> 18a: 48 39 c8 cmp %rcx,%rax
> 18d: 73 39 jae 1c8 <do_syscall_64+0xc8>
>
> 0000000000000000 <.altinstr_replacement>:
> 0: 48 b9 00 f0 ff ff ff movabs $0xfffffffffff000,%rcx
> 7: ff ff 00

That sounds good - and we could do this as a separate patch on top
of your existing patches, to keep it bisectable in case there's
any problems.

Thanks,

Ingo

2023-10-07 09:57:36

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86: Clean up fast syscall return validation

On Fri, Oct 6, 2023 at 8:59 PM H. Peter Anvin <[email protected]> wrote:
>
> On 10/5/23 13:20, Ingo Molnar wrote:
> >
> > * Brian Gerst <[email protected]> wrote:
> >
> >> Looking at the compiled output, the only suboptimal code appears to be
> >> the canonical address test, where the C code uses the CL register for
> >> the shifts instead of immediates.
> >>
> >> 180: e9 00 00 00 00 jmp 185 <do_syscall_64+0x85>
> >> 181: R_X86_64_PC32 .altinstr_aux-0x4
> >> 185: b9 07 00 00 00 mov $0x7,%ecx
> >> 18a: eb 05 jmp 191 <do_syscall_64+0x91>
> >> 18c: b9 10 00 00 00 mov $0x10,%ecx
> >> 191: 48 89 c2 mov %rax,%rdx
> >> 194: 48 d3 e2 shl %cl,%rdx
> >> 197: 48 d3 fa sar %cl,%rdx
> >> 19a: 48 39 d0 cmp %rdx,%rax
> >> 19d: 75 39 jne 1d8 <do_syscall_64+0xd8>
> >
> > Yeah, it didn't look equivalent - so I guess we want a C equivalent for:
> >
> > - ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \
> > - "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57
> >
> > instead of the pgtable_l5_enabled() runtime test that
> > __is_canonical_address() uses?
> >
>
> I don't think such a thing (without simply duplicate the above as an
> alternative asm, which is obviously easy enough, and still allows the
> compiler to pick the register used) would be possible without immediate
> patching support[*].
>
> Incidentally, this is a question for Uros: is there a reason this is a
> mov to %ecx and not just %cl, which would save 3 bytes?

The compiler uses 32-bit mode to move values between registers, even
when they are less than 32-bit wide. To avoid partial register stalls,
it uses 32-bit mode as much as possible (e.g. zero-extends from memory
to load 8-bit value, load of 32-bit constant, etc). Since the kernel
is compiled with -O2, the compiler does not care that much for the
size of instructions, and it uses full 32-bit width to initialize
register with a constant.

Please note that 8-bit movb instruction in fact represents insert into
word-mode register. The compiler does not know how this word-mode
register will be used, so to avoid partial register stalls, it takes a
cautious approach and (with -O2) moves constant to a register with a
word-width instruction.

Also, the compiler is quite eager to CSE constants. When there are two
or more uses of the same constant, it will move a constant into the
register first.

Uros.

2023-10-07 18:09:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86: Clean up fast syscall return validation

On Sat, 7 Oct 2023 at 02:57, Uros Bizjak <[email protected]> wrote:
>
> Please note that 8-bit movb instruction in fact represents insert into
> word-mode register. The compiler does not know how this word-mode
> register will be used, so to avoid partial register stalls, it takes a
> cautious approach and (with -O2) moves constant to a register with a
> word-width instruction.

Yes. In this case I really think it's our kernel code that is disgusting.

People seem to think that because cpu_feature_enabled() uses static
jumps, it's somehow "free". Not so. The static jumps are often
horrendous for code quality, and they effectively make the compiler
blind to otherwise obvious optimizations.

We need to stop using those so blindly. I see code like this

return pgtable_l5_enabled() ? PT64_ROOT_5LEVEL : PT64_ROOT_4LEVEL;

and it just makes me sad. And yes, that virtual mask is disgusting. This:

#define __VIRTUAL_MASK_SHIFT (pgtable_l5_enabled() ? 56 : 47)

is *NOT* cheap. The end result is *not* a constant, it's a two-way
branch, and while the branch is static and the compiler may then
optimize each side of the branch with the constant in question, most
of the time it just results in *disgusting* code like this where gcc
just has to load a constant into a register, and treat it as a
variable.

With the code shuffling, it's possibly *more* expensive than just
loading from memory, in that you just change a D$ load into an I$ one
instead. I'm sure that looks great on benchmarks that stay entirely in
the I$, but ...

Something like using alternatives is *much* cheaper. Better yet is the
suggestion to simplify the conditionals entirely to not depend on the
virtual shift at all (ie our recent access_ok() simplifications that
were due to _another_ CPU feature)

> Also, the compiler is quite eager to CSE constants. When there are two
> or more uses of the same constant, it will move a constant into the
> register first.

Now, honestly, that may be a good thing for other architectures that
have issues with constants, but on x86 - and integer constants - it's
likely a bad thing unless the constant is particularly complicated.

Most x86 instructions will take the usual 8/32-bit constants (but if
we have arbitrary 64-bit ones, you tend to need them in registers).

Linus