LinuxLists.cc - Proposal for finishing the 64-bit x86 syscall cleanup

2015-08-24 21:14:13

Subject: Proposal for finishing the 64-bit x86 syscall cleanup

Hi all-

I want to (try to) mostly or fully get rid of the messy bits (as
opposed to the hardware-bs-forced bits) of the 64-bit syscall asm.
There are two major conceptual things that are in the way.

Thing 1: partial pt_regs

64-bit fast path syscalls don't fully initialize pt_regs: bx, bp, and
r12-r15 are uninitialized. Some syscalls require them to be
initialized, and they have special awful stubs to do it. The entry
and exit tracing code (except for phase1 tracing) also need them
initialized, and they have their own messy initialization. Compat
syscalls are their own private little mess here.

This gets in the way of all kinds of cleanups, because C code can't
switch between the full and partial pt_regs states.

I can see two ways out. We could remove the optimization entirely,
which consists of pushing and popping six more registers and adds
about ten cycles to fast path syscalls on Sandy Bridge. It also
simplifies and presumably speeds up the slow paths.

We could also annotate with syscalls need full regs and jump to the
slow path for them. This would leave the fast path unchanged (we
could duplicate the sys call table so that regs-requiring syscalls
would turn into some asm that switches to the slow path). We'd make
the syscall table say something like:

59 64 execve sys_execve:regs

The fast path would have exactly identical performance and the slow
path would presumably speed up. The down side would be additional
complexity.

Thing 2: vdso compilation with binutils that doesn't support .cfi directives

Userspace debuggers really like having the vdso properly
CFI-annotated, and the 32-bit fast syscall entries are annotatied
manually in hexidecimal. AFAIK Jan Beulich is the only person who
understands it.

I want to be able to change the entries a little bit to clean them up
(and possibly rework the SYSCALL32 and SYSENTER register tricks, which
currently suck), but it's really, really messy right now because of
the hex CFI stuff. Could we just drop the CFI annotations if the
binutils version is too old or even just require new enough binutils
to build 32-bit and compat kernels?

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC

2015-08-25 07:29:43

by Jan Beulich

[permalink] [raw]

Subject: Re: Proposal for finishing the 64-bit x86 syscall cleanup

>>> Andy Lutomirski <[email protected]> 08/24/15 11:14 PM >>>
>Thing 1: partial pt_regs
>
>64-bit fast path syscalls don't fully initialize pt_regs: bx, bp, and
>r12-r15 are uninitialized. Some syscalls require them to be
>initialized, and they have special awful stubs to do it. The entry
>and exit tracing code (except for phase1 tracing) also need them
>initialized, and they have their own messy initialization. Compat
>syscalls are their own private little mess here.
>
>This gets in the way of all kinds of cleanups, because C code can't
>switch between the full and partial pt_regs states.
>
>I can see two ways out. We could remove the optimization entirely,
>which consists of pushing and popping six more registers and adds
>about ten cycles to fast path syscalls on Sandy Bridge. It also
>simplifies and presumably speeds up the slow paths.
>
>We could also annotate with syscalls need full regs and jump to the
>slow path for them. This would leave the fast path unchanged (we
>could duplicate the sys call table so that regs-requiring syscalls
>would turn into some asm that switches to the slow path). We'd make
>the syscall table say something like:
>
>59 64 execve sys_execve:regs
>
>The fast path would have exactly identical performance and the slow
>path would presumably speed up. The down side would be additional
>complexity.

Namely - would this be any better than the current, "special awful" stubs?

>Thing 2: vdso compilation with binutils that doesn't support .cfi directives
>
>Userspace debuggers really like having the vdso properly
>CFI-annotated, and the 32-bit fast syscall entries are annotatied
>manually in hexidecimal. AFAIK Jan Beulich is the only person who
>understands it.
>
>I want to be able to change the entries a little bit to clean them up
>(and possibly rework the SYSCALL32 and SYSENTER register tricks, which
>currently suck), but it's really, really messy right now because of
>the hex CFI stuff. Could we just drop the CFI annotations if the
>binutils version is too old or even just require new enough binutils
>to build 32-bit and compat kernels?

I think that's a reasonable thing - iirc the oldest binutils I'm building with
(SLE10 i.e. 2.16.91-ish) support them, and I'd suppose the equally old
RHEL's binutils do too. Not sure if there are any other long maintained
distros that might carry even older binutils.

Jan

2015-08-25 08:18:49

by Ingo Molnar

[permalink] [raw]

Subject: Re: Proposal for finishing the 64-bit x86 syscall cleanup

* Andy Lutomirski <[email protected]> wrote:

> Hi all-
>
> I want to (try to) mostly or fully get rid of the messy bits (as
> opposed to the hardware-bs-forced bits) of the 64-bit syscall asm.
> There are two major conceptual things that are in the way.
>
> Thing 1: partial pt_regs
>
> 64-bit fast path syscalls don't fully initialize pt_regs: bx, bp, and
> r12-r15 are uninitialized. Some syscalls require them to be
> initialized, and they have special awful stubs to do it. The entry
> and exit tracing code (except for phase1 tracing) also need them
> initialized, and they have their own messy initialization. Compat
> syscalls are their own private little mess here.
>
> This gets in the way of all kinds of cleanups, because C code can't
> switch between the full and partial pt_regs states.
>
> I can see two ways out. We could remove the optimization entirely,
> which consists of pushing and popping six more registers and adds
> about ten cycles to fast path syscalls on Sandy Bridge. It also
> simplifies and presumably speeds up the slow paths.

So out of hundreds of regular system calls there's only a handful of such system
calls:

triton:~/tip> git grep stub arch/x86/entry/syscalls/
arch/x86/entry/syscalls/syscall_32.tbl:2 i386 fork sys_fork stub32_fork
arch/x86/entry/syscalls/syscall_32.tbl:11 i386 execve sys_execve stub32_execve
arch/x86/entry/syscalls/syscall_32.tbl:119 i386 sigreturn sys_sigreturn stub32_sigreturn
arch/x86/entry/syscalls/syscall_32.tbl:120 i386 clone sys_clone stub32_clone
arch/x86/entry/syscalls/syscall_32.tbl:173 i386 rt_sigreturn sys_rt_sigreturn stub32_rt_sigreturn
arch/x86/entry/syscalls/syscall_32.tbl:190 i386 vfork sys_vfork stub32_vfork
arch/x86/entry/syscalls/syscall_32.tbl:358 i386 execveat sys_execveat stub32_execveat
arch/x86/entry/syscalls/syscall_64.tbl:15 64 rt_sigreturn stub_rt_sigreturn
arch/x86/entry/syscalls/syscall_64.tbl:56 common clone stub_clone
arch/x86/entry/syscalls/syscall_64.tbl:57 common fork stub_fork
arch/x86/entry/syscalls/syscall_64.tbl:58 common vfork stub_vfork
arch/x86/entry/syscalls/syscall_64.tbl:59 64 execve stub_execve
arch/x86/entry/syscalls/syscall_64.tbl:322 64 execveat stub_execveat
arch/x86/entry/syscalls/syscall_64.tbl:513 x32 rt_sigreturn stub_x32_rt_sigreturn
arch/x86/entry/syscalls/syscall_64.tbl:520 x32 execve stub_x32_execve
arch/x86/entry/syscalls/syscall_64.tbl:545 x32 execveat stub_x32_execveat

and none of them are super performance critical system calls, so no way would I go
for unconditionally saving/restoring all of ptregs, just to make it a bit simpler
for these syscalls.

> We could also annotate with syscalls need full regs and jump to the
> slow path for them. This would leave the fast path unchanged (we
> could duplicate the sys call table so that regs-requiring syscalls
> would turn into some asm that switches to the slow path). We'd make
> the syscall table say something like:
>
> 59 64 execve sys_execve:regs
>
> The fast path would have exactly identical performance and the slow
> path would presumably speed up. The down side would be additional
> complexity.

The 'fast path performance unchanged' aspect definitely gives me warm fuzzy
feelings.

Your suggested annotation would essentially be a syntactical cleanup, in that we'd
auto-generate the stubs during build, instead of the current ugly open coded
stubs? Or did you have something else in mind?

> Thing 2: vdso compilation with binutils that doesn't support .cfi directives
>
> Userspace debuggers really like having the vdso properly
> CFI-annotated, and the 32-bit fast syscall entries are annotatied
> manually in hexidecimal. AFAIK Jan Beulich is the only person who
> understands it.
>
> I want to be able to change the entries a little bit to clean them up
> (and possibly rework the SYSCALL32 and SYSENTER register tricks, which
> currently suck), but it's really, really messy right now because of
> the hex CFI stuff. Could we just drop the CFI annotations if the
> binutils version is too old or even just require new enough binutils
> to build 32-bit and compat kernels?

We could also test for those directives and not generate debuginfo on such
tooling. Not generating debuginfo is still much better than failing the build.

I'm all for removing the hexa encoded debuginfo.

Thanks,

Ingo

2015-08-25 08:42:40

by Ingo Molnar

[permalink] [raw]

Subject: Re: Proposal for finishing the 64-bit x86 syscall cleanup

* Ingo Molnar <[email protected]> wrote:

>
> * Andy Lutomirski <[email protected]> wrote:
>
> > Hi all-
> >
> > I want to (try to) mostly or fully get rid of the messy bits (as
> > opposed to the hardware-bs-forced bits) of the 64-bit syscall asm.
> > There are two major conceptual things that are in the way.
> >
> > Thing 1: partial pt_regs
> >
> > 64-bit fast path syscalls don't fully initialize pt_regs: bx, bp, and
> > r12-r15 are uninitialized. Some syscalls require them to be
> > initialized, and they have special awful stubs to do it. The entry
> > and exit tracing code (except for phase1 tracing) also need them
> > initialized, and they have their own messy initialization. Compat
> > syscalls are their own private little mess here.
> >
> > This gets in the way of all kinds of cleanups, because C code can't
> > switch between the full and partial pt_regs states.
> >
> > I can see two ways out. We could remove the optimization entirely,
> > which consists of pushing and popping six more registers and adds
> > about ten cycles to fast path syscalls on Sandy Bridge. It also
> > simplifies and presumably speeds up the slow paths.
>
> So out of hundreds of regular system calls there's only a handful of such system
> calls:
>
> triton:~/tip> git grep stub arch/x86/entry/syscalls/
> arch/x86/entry/syscalls/syscall_32.tbl:2 i386 fork sys_fork stub32_fork
> arch/x86/entry/syscalls/syscall_32.tbl:11 i386 execve sys_execve stub32_execve
> arch/x86/entry/syscalls/syscall_32.tbl:119 i386 sigreturn sys_sigreturn stub32_sigreturn
> arch/x86/entry/syscalls/syscall_32.tbl:120 i386 clone sys_clone stub32_clone
> arch/x86/entry/syscalls/syscall_32.tbl:173 i386 rt_sigreturn sys_rt_sigreturn stub32_rt_sigreturn
> arch/x86/entry/syscalls/syscall_32.tbl:190 i386 vfork sys_vfork stub32_vfork
> arch/x86/entry/syscalls/syscall_32.tbl:358 i386 execveat sys_execveat stub32_execveat
> arch/x86/entry/syscalls/syscall_64.tbl:15 64 rt_sigreturn stub_rt_sigreturn
> arch/x86/entry/syscalls/syscall_64.tbl:56 common clone stub_clone
> arch/x86/entry/syscalls/syscall_64.tbl:57 common fork stub_fork
> arch/x86/entry/syscalls/syscall_64.tbl:58 common vfork stub_vfork
> arch/x86/entry/syscalls/syscall_64.tbl:59 64 execve stub_execve
> arch/x86/entry/syscalls/syscall_64.tbl:322 64 execveat stub_execveat
> arch/x86/entry/syscalls/syscall_64.tbl:513 x32 rt_sigreturn stub_x32_rt_sigreturn
> arch/x86/entry/syscalls/syscall_64.tbl:520 x32 execve stub_x32_execve
> arch/x86/entry/syscalls/syscall_64.tbl:545 x32 execveat stub_x32_execveat
>
> and none of them are super performance critical system calls, so no way would I
> go for unconditionally saving/restoring all of ptregs, just to make it a bit
> simpler for these syscalls.

Let me qualify that: no way in the long run.

In the short run we can drop the optimization and reintroduce it later, to lower
all the risks that the C conversion brings with itself.

( That would also make it easier to re-analyze the cost/benefit ratio of the
optimization. )

So feel free to introduce a simple ptregs save/restore pattern for now.

Thanks,

Ingo

2015-08-25 10:59:55

by Brian Gerst

[permalink] [raw]

Subject: Re: Proposal for finishing the 64-bit x86 syscall cleanup

On Mon, Aug 24, 2015 at 5:13 PM, Andy Lutomirski <[email protected]> wrote:
> Hi all-
>
> I want to (try to) mostly or fully get rid of the messy bits (as
> opposed to the hardware-bs-forced bits) of the 64-bit syscall asm.
> There are two major conceptual things that are in the way.
>
> Thing 1: partial pt_regs
>
> 64-bit fast path syscalls don't fully initialize pt_regs: bx, bp, and
> r12-r15 are uninitialized. Some syscalls require them to be
> initialized, and they have special awful stubs to do it. The entry
> and exit tracing code (except for phase1 tracing) also need them
> initialized, and they have their own messy initialization. Compat
> syscalls are their own private little mess here.
>
> This gets in the way of all kinds of cleanups, because C code can't
> switch between the full and partial pt_regs states.
>
> I can see two ways out. We could remove the optimization entirely,
> which consists of pushing and popping six more registers and adds
> about ten cycles to fast path syscalls on Sandy Bridge. It also
> simplifies and presumably speeds up the slow paths.
>
> We could also annotate with syscalls need full regs and jump to the
> slow path for them. This would leave the fast path unchanged (we
> could duplicate the sys call table so that regs-requiring syscalls
> would turn into some asm that switches to the slow path). We'd make
> the syscall table say something like:
>
> 59 64 execve sys_execve:regs
>
> The fast path would have exactly identical performance and the slow
> path would presumably speed up. The down side would be additional
> complexity.

I don't think it is worth it to optimize the syscalls that need full
pt_regs (which are generally quite expensive and less frequently used)
at the expense of every other syscall.

What kind of cleanups, other than just removing the stubs, would this
allow? Is there more code you plan to move to C?

> Thing 2: vdso compilation with binutils that doesn't support .cfi directives
>
> Userspace debuggers really like having the vdso properly
> CFI-annotated, and the 32-bit fast syscall entries are annotatied
> manually in hexidecimal. AFAIK Jan Beulich is the only person who
> understands it.
>
> I want to be able to change the entries a little bit to clean them up
> (and possibly rework the SYSCALL32 and SYSENTER register tricks, which
> currently suck), but it's really, really messy right now because of
> the hex CFI stuff. Could we just drop the CFI annotations if the
> binutils version is too old or even just require new enough binutils
> to build 32-bit and compat kernels?

One thing I want to do is rework the 32-bit VDSO into a single image,
using alternatives to handle the selection of entry method. The
open-coded CFI crap has made that near impossible to do.

--
Brian Gerst

2015-08-25 16:29:21

by Andy Lutomirski

[permalink] [raw]

Subject: Re: Proposal for finishing the 64-bit x86 syscall cleanup

On Tue, Aug 25, 2015 at 3:59 AM, Brian Gerst <[email protected]> wrote:
> On Mon, Aug 24, 2015 at 5:13 PM, Andy Lutomirski <[email protected]> wrote:
>>
>> We could also annotate with syscalls need full regs and jump to the
>> slow path for them. This would leave the fast path unchanged (we
>> could duplicate the sys call table so that regs-requiring syscalls
>> would turn into some asm that switches to the slow path). We'd make
>> the syscall table say something like:
>>
>> 59 64 execve sys_execve:regs
>>
>> The fast path would have exactly identical performance and the slow
>> path would presumably speed up. The down side would be additional
>> complexity.
>
> I don't think it is worth it to optimize the syscalls that need full
> pt_regs (which are generally quite expensive and less frequently used)
> at the expense of every other syscall.
>
> What kind of cleanups, other than just removing the stubs, would this
> allow? Is there more code you plan to move to C?

This isn't about optimizing the regs-using syscalls at all -- it's
about simplifying all the other ones and optimizing the slow path.

The way that the regs-using syscalls currently work is that the entry
in the syscall table expects to see rbx, rbp, and r12-r15 in
*registers* and it shoves them into pt_regs and pulls them back out.
This means that we pretty much have to call syscalls from asm, which
precludes the straightforward re-implementation of the whole slow path
as:

void do_slow_syscall(...) {
enter_from_user_mode();
fixup_arg5 [if compat fast syscall];
seccomp, etc;
if (nr < max)
call the syscall;
exit tracing;
prepare_return_to_usermode();
}

I bet that, with a bit of tweaking, that would actually end up faster
than what we do right now for everything except fully fast-path
syscalls. This would also be a *huge* sanity improvement for the
compat case in which the args are currently jumbled in asm. It would
become:

if (nr < max)
call the syscall(regs->bx, regs->cx, regs->dx, ...);

which completely avoids the unreadable and probably buggy mess we have now.

We could just get rid of the compat fast path entirely -- I would be a
bit surprised if anyone cared about a couple cycles for compat, but I
don't think it's a great idea long-term to have the compat path fully
written in C but the native 64-bit path partially in asm.

My concrete idea here is to have two 64-bit syscall tables: fast and
slow. The slow table would point to the real C functions for all
syscalls. The fast table would be the same except for the syscalls
that use regs; for those syscalls it would point to:

GLOBAL(stub_switch_to_slow_path_64)
popq %r11 /* discard return address */
movq %rbp, RBP(%rsp), etc;
jmp entry_SYSCALL_64_slow_path
END(stub_switch_to_slow_path_64)

so that the regs-using syscalls take the slow path no matter what.
This doesn't even require autogenerated stubs, since they can all
share the same stub.

Now the 64-bit fast path can stay more or less the same (we'd reorder
the first flags test and the subq $(6*8), %rsp), and the slow path can
be almost all in C.

Then I can back out the two-phase entry tracing thing, and after
*that*, muahaha, I can dust off some languishing seccomp improvements
I have that are incompatible with two-phase entry tracing.

(I have a half-written test case to exercise the dark corners of
syscall args and tracing. So far it catches a bug in SYSCALL32 that
was apparently never fixed (which makes me wonder why signal-heavy
workloads work on AMD systems in compat mode), but I haven't extended
it enough to catch the R9 thing.)

>
>> Thing 2: vdso compilation with binutils that doesn't support .cfi directives
>>
>> Userspace debuggers really like having the vdso properly
>> CFI-annotated, and the 32-bit fast syscall entries are annotatied
>> manually in hexidecimal. AFAIK Jan Beulich is the only person who
>> understands it.
>>
>> I want to be able to change the entries a little bit to clean them up
>> (and possibly rework the SYSCALL32 and SYSENTER register tricks, which
>> currently suck), but it's really, really messy right now because of
>> the hex CFI stuff. Could we just drop the CFI annotations if the
>> binutils version is too old or even just require new enough binutils
>> to build 32-bit and compat kernels?
>
> One thing I want to do is rework the 32-bit VDSO into a single image,
> using alternatives to handle the selection of entry method. The
> open-coded CFI crap has made that near impossible to do.
>

Yes please!

But please don't change the actual instruction ordering at all yet,
since the SYSCALL case seems to be buggy right now.

(If you want to be really fancy, don't use alternatives. Instead
teach vdso2c to annotate the actual dynamic table function pointers so
we can rewrite the pointers at boot time. That will save a cycle or
two.)

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC

2015-08-25 16:59:14

by Linus Torvalds

[permalink] [raw]

Subject: Re: Proposal for finishing the 64-bit x86 syscall cleanup

On Tue, Aug 25, 2015 at 9:28 AM, Andy Lutomirski <[email protected]> wrote:
>
> I bet that, with a bit of tweaking, that would actually end up faster
> than what we do right now for everything except fully fast-path
> syscalls. This would also be a *huge* sanity improvement for the
> compat case in which the args are currently jumbled in asm. It would
> become:
>
> if (nr < max)
> call the syscall(regs->bx, regs->cx, regs->dx, ...);
>
> which completely avoids the unreadable and probably buggy mess we have now.

I'm willing to try it. Just pass in the system call number and the
regs pointer as the arguments, and keep the asm portion as the minimal
"save regs, call function, restore regs, return" sequence with
absolutely nothing else going on.

The main cost would be a few more push/pop instructions, and then the
loads from the stack frame in the C function. Go for it.

The system calls where the extra five cycles might be noticeable are
all irrelevant anyway (ie "getppid()" etc). They aren't actually
performance-critical, even if they may be used for benchmarks.

But please do make it be very tight push/pop sequences, not those
crazy "movq %reg,%off(%rsp)" things.

Linus

2015-08-26 05:20:55

by Brian Gerst

[permalink] [raw]

Subject: Re: Proposal for finishing the 64-bit x86 syscall cleanup

>>> Thing 2: vdso compilation with binutils that doesn't support .cfi directives
>>>
>>> Userspace debuggers really like having the vdso properly
>>> CFI-annotated, and the 32-bit fast syscall entries are annotatied
>>> manually in hexidecimal. AFAIK Jan Beulich is the only person who
>>> understands it.
>>>
>>> I want to be able to change the entries a little bit to clean them up
>>> (and possibly rework the SYSCALL32 and SYSENTER register tricks, which
>>> currently suck), but it's really, really messy right now because of
>>> the hex CFI stuff. Could we just drop the CFI annotations if the
>>> binutils version is too old or even just require new enough binutils
>>> to build 32-bit and compat kernels?
>>
>> One thing I want to do is rework the 32-bit VDSO into a single image,
>> using alternatives to handle the selection of entry method. The
>> open-coded CFI crap has made that near impossible to do.
>>
>
> Yes please!
>
> But please don't change the actual instruction ordering at all yet,
> since the SYSCALL case seems to be buggy right now.
>
> (If you want to be really fancy, don't use alternatives. Instead
> teach vdso2c to annotate the actual dynamic table function pointers so
> we can rewrite the pointers at boot time. That will save a cycle or
> two.)

The easiest way to select the right entry code is by changing the ELF
AUX vector. That gets the normal usage, but there are two additional
cases that need addressing.

1) Some code could possibly lookup the __kernel_vsyscall symbol
directly and call it, but that's non-standard. If there is code out
there that does this, we could update the ELF symbol table to point
__kernel_vsyscall to the chosen entry point, or just remove the symbol
and let the caller fall back to INT80.

2) The sigreturn trampolines. These are tricky because the sigreturn
syscalls implicitly uses regs->sp to find the signal frame. That
interacts badly with the SYSENTER/SYSCALL entries, which save
registers on the stack. It currently uses a bare SYSCALL instruction
(no pushes to the stack), but falls back to INT80 for SYSENTER. One
option is to create new syscalls that takes a pointer to the signal
frame as arg1.

--
Brian Gerst

2015-08-26 17:11:08

by Andy Lutomirski

[permalink] [raw]

Subject: Re: Proposal for finishing the 64-bit x86 syscall cleanup

On Tue, Aug 25, 2015 at 10:20 PM, Brian Gerst <[email protected]> wrote:
>>>> Thing 2: vdso compilation with binutils that doesn't support .cfi directives
>>>>
>>>> Userspace debuggers really like having the vdso properly
>>>> CFI-annotated, and the 32-bit fast syscall entries are annotatied
>>>> manually in hexidecimal. AFAIK Jan Beulich is the only person who
>>>> understands it.
>>>>
>>>> I want to be able to change the entries a little bit to clean them up
>>>> (and possibly rework the SYSCALL32 and SYSENTER register tricks, which
>>>> currently suck), but it's really, really messy right now because of
>>>> the hex CFI stuff. Could we just drop the CFI annotations if the
>>>> binutils version is too old or even just require new enough binutils
>>>> to build 32-bit and compat kernels?
>>>
>>> One thing I want to do is rework the 32-bit VDSO into a single image,
>>> using alternatives to handle the selection of entry method. The
>>> open-coded CFI crap has made that near impossible to do.
>>>
>>
>> Yes please!
>>
>> But please don't change the actual instruction ordering at all yet,
>> since the SYSCALL case seems to be buggy right now.
>>
>> (If you want to be really fancy, don't use alternatives. Instead
>> teach vdso2c to annotate the actual dynamic table function pointers so
>> we can rewrite the pointers at boot time. That will save a cycle or
>> two.)
>
> The easiest way to select the right entry code is by changing the ELF
> AUX vector. That gets the normal usage, but there are two additional
> cases that need addressing.
>
> 1) Some code could possibly lookup the __kernel_vsyscall symbol
> directly and call it, but that's non-standard. If there is code out
> there that does this, we could update the ELF symbol table to point
> __kernel_vsyscall to the chosen entry point, or just remove the symbol
> and let the caller fall back to INT80.

Here's an alternate proposal, which is mostly copied from what I
posted here yesterday afternoon:

https://bugzilla.kernel.org/show_bug.cgi?id=101061

I think we should consider doing this:

__kernel_vsyscall:
push %ecx
push %edx
movl %esp, %edx

ALTERNATIVE (Intel with SEP):
sysenter

ALTERNATIVE (AMD with SYSCALL32 on 64-bit kernels):
syscall
hlt /* keep weird binary tracers from utterly screwing up */

ALTERNATIVE (if neither of the other cases apply):
nops

movl %edx, %esp
movl (%esp), %edx
movl 8(%esp), %ecx
int $0x80
vsyscall_after_int80:
popl %edx
popl %ecx
ret

First, in the case where we have neither SEP nor SYSCALL32, I claim
that this Just Works. We push a couple regs, pointlessly shuffle esp,
restore the regs, do int $0x80 with the same regs we started with, and
then (again, pointlessly) pop the regs we pushed.

Now we make the semantics of *both* syscall32 and sysenter that they
load edx into regs->sp, fetch regs->dx and regs->cx from memory, and
set regs->ip to vsyscall_after_int80. (This is a wee bit slower than
the current sysenter code because it does two memory fetches instead
of one.) Then they proceed just like int80. In particular, anything
that does "ip -= 2" works exactly like int80 because it points at an
actual int80 instruction.

Note that sysenter's slow path already sort of works like this.

Now we've fixed the correctness issues but we've killed performance,
as we'll use IRET instead of SYSRET to get back out. We can fix that
using opportunstic sysret. If we're returning from a compat syscall
that entered via sysenter or syscall32, if regs->ip ==
vsyscall_after_int80, regs->r11 == regs->flags, regs->ss == __USER_DS,
regs->cs == __USER32_CS, and flags are sane, then return using
SYSRETL. (The r11 check is probably unnecessary.)

This is not quite as elegant as 64-bit opportunistic sysret, since
we're zapping ecx. This should be unobservable except by debuggers,
since we already know that we're returning to a 'pop ecx' instruction.

NB: I don't think we can enable SYSCALL on 32-bit kernels. IIRC
there's no MSR_SYSCALL_MASK support, which makes the whole thing
basically useless, since we can't mask TF and we don't control ESP.

>
> 2) The sigreturn trampolines. These are tricky because the sigreturn
> syscalls implicitly uses regs->sp to find the signal frame. That
> interacts badly with the SYSENTER/SYSCALL entries, which save
> registers on the stack. It currently uses a bare SYSCALL instruction
> (no pushes to the stack), but falls back to INT80 for SYSENTER. One
> option is to create new syscalls that takes a pointer to the signal
> frame as arg1.

I'm not sure this is worth doing, since IIRC modern userspace doesn't
use the vdso sigreturn trampolines at all. We could just continue
using int80 for now.

--Andy

2015-08-27 03:13:25

by Brian Gerst

[permalink] [raw]

Subject: Re: Proposal for finishing the 64-bit x86 syscall cleanup

On Wed, Aug 26, 2015 at 1:10 PM, Andy Lutomirski <[email protected]> wrote:
> On Tue, Aug 25, 2015 at 10:20 PM, Brian Gerst <[email protected]> wrote:
>>>>> Thing 2: vdso compilation with binutils that doesn't support .cfi directives
>>>>>
>>>>> Userspace debuggers really like having the vdso properly
>>>>> CFI-annotated, and the 32-bit fast syscall entries are annotatied
>>>>> manually in hexidecimal. AFAIK Jan Beulich is the only person who
>>>>> understands it.
>>>>>
>>>>> I want to be able to change the entries a little bit to clean them up
>>>>> (and possibly rework the SYSCALL32 and SYSENTER register tricks, which
>>>>> currently suck), but it's really, really messy right now because of
>>>>> the hex CFI stuff. Could we just drop the CFI annotations if the
>>>>> binutils version is too old or even just require new enough binutils
>>>>> to build 32-bit and compat kernels?
>>>>
>>>> One thing I want to do is rework the 32-bit VDSO into a single image,
>>>> using alternatives to handle the selection of entry method. The
>>>> open-coded CFI crap has made that near impossible to do.
>>>>
>>>
>>> Yes please!
>>>
>>> But please don't change the actual instruction ordering at all yet,
>>> since the SYSCALL case seems to be buggy right now.
>>>
>>> (If you want to be really fancy, don't use alternatives. Instead
>>> teach vdso2c to annotate the actual dynamic table function pointers so
>>> we can rewrite the pointers at boot time. That will save a cycle or
>>> two.)
>>
>> The easiest way to select the right entry code is by changing the ELF
>> AUX vector. That gets the normal usage, but there are two additional
>> cases that need addressing.
>>
>> 1) Some code could possibly lookup the __kernel_vsyscall symbol
>> directly and call it, but that's non-standard. If there is code out
>> there that does this, we could update the ELF symbol table to point
>> __kernel_vsyscall to the chosen entry point, or just remove the symbol
>> and let the caller fall back to INT80.
>
> Here's an alternate proposal, which is mostly copied from what I
> posted here yesterday afternoon:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=101061
>
> I think we should consider doing this:
>
> __kernel_vsyscall:
> push %ecx
> push %edx
> movl %esp, %edx
>
> ALTERNATIVE (Intel with SEP):
> sysenter
>
> ALTERNATIVE (AMD with SYSCALL32 on 64-bit kernels):
> syscall
> hlt /* keep weird binary tracers from utterly screwing up */
>
> ALTERNATIVE (if neither of the other cases apply):
> nops
>
> movl %edx, %esp
> movl (%esp), %edx
> movl 8(%esp), %ecx
> int $0x80
> vsyscall_after_int80:
> popl %edx
> popl %ecx
> ret

This could interfere with sigreturn/ptrace, since if EDX or ECX are
changed those would get overwritten by the pops from the stack.
That's a problem with the current code too.

> First, in the case where we have neither SEP nor SYSCALL32, I claim
> that this Just Works. We push a couple regs, pointlessly shuffle esp,
> restore the regs, do int $0x80 with the same regs we started with, and
> then (again, pointlessly) pop the regs we pushed.

If neither SYSENTER or SYSCALL are supported then it should just be
"int $0x80; ret;", nothing more. You can't assume all int80 calls
come from the VDSO. In fact, fork/clone cannot use the VDSO (the
saved values on the stack are not copied to the child) and always uses
int80 directly.

> Now we make the semantics of *both* syscall32 and sysenter that they
> load edx into regs->sp, fetch regs->dx and regs->cx from memory, and
> set regs->ip to vsyscall_after_int80. (This is a wee bit slower than
> the current sysenter code because it does two memory fetches instead
> of one.) Then they proceed just like int80. In particular, anything
> that does "ip -= 2" works exactly like int80 because it points at an
> actual int80 instruction.
>
> Note that sysenter's slow path already sort of works like this.
>
> Now we've fixed the correctness issues but we've killed performance,
> as we'll use IRET instead of SYSRET to get back out. We can fix that
> using opportunstic sysret. If we're returning from a compat syscall
> that entered via sysenter or syscall32, if regs->ip ==
> vsyscall_after_int80, regs->r11 == regs->flags, regs->ss == __USER_DS,
> regs->cs == __USER32_CS, and flags are sane, then return using
> SYSRETL. (The r11 check is probably unnecessary.)
>
> This is not quite as elegant as 64-bit opportunistic sysret, since
> we're zapping ecx. This should be unobservable except by debuggers,
> since we already know that we're returning to a 'pop ecx' instruction.
>
> NB: I don't think we can enable SYSCALL on 32-bit kernels. IIRC
> there's no MSR_SYSCALL_MASK support, which makes the whole thing
> basically useless, since we can't mask TF and we don't control ESP.

The original implementation of SYSCALL on the K6 had a bad side
effect. You could only return to userspace with SYSRET. It set some
internal state that caused IRET to fault, which meant you couldn't
context switch to another task that had taken a page fault for
example. That made it unusable without some ugly hacks. It may work
differently on modern AMD processors in legacy mode, but it's probably
not worth trying to support it.

--
Brian Gerst

2015-08-27 03:38:44

by Andy Lutomirski

[permalink] [raw]

Subject: Re: Proposal for finishing the 64-bit x86 syscall cleanup

On Wed, Aug 26, 2015 at 8:13 PM, Brian Gerst <[email protected]> wrote:
> On Wed, Aug 26, 2015 at 1:10 PM, Andy Lutomirski <[email protected]> wrote:
>> On Tue, Aug 25, 2015 at 10:20 PM, Brian Gerst <[email protected]> wrote:
>>>>>> Thing 2: vdso compilation with binutils that doesn't support .cfi directives
>>>>>>
>>>>>> Userspace debuggers really like having the vdso properly
>>>>>> CFI-annotated, and the 32-bit fast syscall entries are annotatied
>>>>>> manually in hexidecimal. AFAIK Jan Beulich is the only person who
>>>>>> understands it.
>>>>>>
>>>>>> I want to be able to change the entries a little bit to clean them up
>>>>>> (and possibly rework the SYSCALL32 and SYSENTER register tricks, which
>>>>>> currently suck), but it's really, really messy right now because of
>>>>>> the hex CFI stuff. Could we just drop the CFI annotations if the
>>>>>> binutils version is too old or even just require new enough binutils
>>>>>> to build 32-bit and compat kernels?
>>>>>
>>>>> One thing I want to do is rework the 32-bit VDSO into a single image,
>>>>> using alternatives to handle the selection of entry method. The
>>>>> open-coded CFI crap has made that near impossible to do.
>>>>>
>>>>
>>>> Yes please!
>>>>
>>>> But please don't change the actual instruction ordering at all yet,
>>>> since the SYSCALL case seems to be buggy right now.
>>>>
>>>> (If you want to be really fancy, don't use alternatives. Instead
>>>> teach vdso2c to annotate the actual dynamic table function pointers so
>>>> we can rewrite the pointers at boot time. That will save a cycle or
>>>> two.)
>>>
>>> The easiest way to select the right entry code is by changing the ELF
>>> AUX vector. That gets the normal usage, but there are two additional
>>> cases that need addressing.
>>>
>>> 1) Some code could possibly lookup the __kernel_vsyscall symbol
>>> directly and call it, but that's non-standard. If there is code out
>>> there that does this, we could update the ELF symbol table to point
>>> __kernel_vsyscall to the chosen entry point, or just remove the symbol
>>> and let the caller fall back to INT80.
>>
>> Here's an alternate proposal, which is mostly copied from what I
>> posted here yesterday afternoon:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=101061
>>
>> I think we should consider doing this:
>>
>> __kernel_vsyscall:
>> push %ecx
>> push %edx
>> movl %esp, %edx
>>
>> ALTERNATIVE (Intel with SEP):
>> sysenter
>>
>> ALTERNATIVE (AMD with SYSCALL32 on 64-bit kernels):
>> syscall
>> hlt /* keep weird binary tracers from utterly screwing up */
>>
>> ALTERNATIVE (if neither of the other cases apply):
>> nops
>>
>> movl %edx, %esp
>> movl (%esp), %edx
>> movl 8(%esp), %ecx
>> int $0x80
>> vsyscall_after_int80:
>> popl %edx
>> popl %ecx
>> ret
>
> This could interfere with sigreturn/ptrace, since if EDX or ECX are
> changed those would get overwritten by the pops from the stack.
> That's a problem with the current code too.
>

Anyone who ptraces a fast 32-bit syscall, changes ecx or edx, and
doesn't change it back is asking for serious trouble. Glibc has a
bunch of asm that sticks ebx (which is preserved across function calls
per the C ABI) into ecx or edx across syscalls with few enough args,
and, if those change, then ebx is toast.

I don't see any clean way to fix it.

>> First, in the case where we have neither SEP nor SYSCALL32, I claim
>> that this Just Works. We push a couple regs, pointlessly shuffle esp,
>> restore the regs, do int $0x80 with the same regs we started with, and
>> then (again, pointlessly) pop the regs we pushed.
>
> If neither SYSENTER or SYSCALL are supported then it should just be
> "int $0x80; ret;", nothing more. You can't assume all int80 calls
> come from the VDSO.

I don't understand. I'm talking about having __kernel_vsyscall
contain the extra prologue and epilogue code on non-SEP 32-bit
systems. This won't have any effect on bare int80 instructions.

> In fact, fork/clone cannot use the VDSO (the
> saved values on the stack are not copied to the child) and always uses
> int80 directly.

I see why it's screwed up for threads, but why is fork a problem?
Isn't the entire stack copied to the child?

In any event, if we wanted to make the vsyscall work for shared-VM
clone, we have a bigger problem than ecx and edx: the actual return
address gets zapped. Fixing *that* would be an incredible mess, and,
if anyone actually wanted to try it, they'd probably want to just add
a new entry point for clone.

>
>> Now we make the semantics of *both* syscall32 and sysenter that they
>> load edx into regs->sp, fetch regs->dx and regs->cx from memory, and
>> set regs->ip to vsyscall_after_int80. (This is a wee bit slower than
>> the current sysenter code because it does two memory fetches instead
>> of one.) Then they proceed just like int80. In particular, anything
>> that does "ip -= 2" works exactly like int80 because it points at an
>> actual int80 instruction.
>>
>> Note that sysenter's slow path already sort of works like this.
>>
>> Now we've fixed the correctness issues but we've killed performance,
>> as we'll use IRET instead of SYSRET to get back out. We can fix that
>> using opportunstic sysret. If we're returning from a compat syscall
>> that entered via sysenter or syscall32, if regs->ip ==
>> vsyscall_after_int80, regs->r11 == regs->flags, regs->ss == __USER_DS,
>> regs->cs == __USER32_CS, and flags are sane, then return using
>> SYSRETL. (The r11 check is probably unnecessary.)
>>
>> This is not quite as elegant as 64-bit opportunistic sysret, since
>> we're zapping ecx. This should be unobservable except by debuggers,
>> since we already know that we're returning to a 'pop ecx' instruction.
>>
>> NB: I don't think we can enable SYSCALL on 32-bit kernels. IIRC
>> there's no MSR_SYSCALL_MASK support, which makes the whole thing
>> basically useless, since we can't mask TF and we don't control ESP.
>
> The original implementation of SYSCALL on the K6 had a bad side
> effect. You could only return to userspace with SYSRET. It set some
> internal state that caused IRET to fault, which meant you couldn't
> context switch to another task that had taken a page fault for
> example. That made it unusable without some ugly hacks. It may work
> differently on modern AMD processors in legacy mode, but it's probably
> not worth trying to support it.
>

I don't know about that, but even the modern implementation screws up
SYSRET. If you enter via an interrupt and exit via SYSRET, SS ends up
unusable even though the selector is correct. That's the
"sysret_ss_attrs" thing that we now hack around.

--Andy