2016-03-06 17:39:54

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH] x86/entry: Improve system call entry comments

Ingo suggested that the comments should explain when the various
entries are used. This adds these explanations and improves other
parts of the comments.

Signed-off-by: Andy Lutomirski <[email protected]>
---

This applies on top of all the TF / debug / SYSENTER stuff. If this
is too confusing, I'll rebase it and resend after everything settles
in -tip.

arch/x86/entry/entry_32.S | 59 +++++++++++++++++++++++++++-
arch/x86/entry/entry_64.S | 10 +++++
arch/x86/entry/entry_64_compat.S | 85 +++++++++++++++++++++++++++-------------
3 files changed, 126 insertions(+), 28 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 09360c445c69..846a6c478bfd 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -307,6 +307,38 @@ ENTRY(xen_sysenter_target)
jmp sysenter_past_esp
#endif

+/*
+ * 32-bit SYSENTER entry.
+ *
+ * 32-bit system calls through the vDSO's __kernel_vsyscall enter here
+ * if X86_FEATURE_SEP is available. This is the preferred system call
+ * entry on 32-bit systems.
+ *
+ * The SYSENTER instruction, in principle, should *only* occur in the
+ * vDSO. In practice, a small number of Android devices were shipped
+ * with a copy of Bionic that inlined a SYSENTER instruction. This
+ * never happened in any of Google's Bionic versions -- it only happened
+ * in a narrow range of Intel-provided versions.
+ *
+ * SYSENTER loads SS, ESP, CS, and EIP from previously programmed MSRs.
+ * IF and VM in RFLAGS are cleared (IOW: interrupts are off).
+ * SYSENTER does not save anything on the stack,
+ * and does not save old EIP (!!!), ESP, or EFLAGS.
+ *
+ * To avoid losing track of EFLAGS.VM (and thus potentially corrupting
+ * user and/or vm86 state), we explicitly disable the SYSENTER
+ * instruction in vm86 mode by reprogramming the MSRs.
+ *
+ * Arguments:
+ * eax system call number
+ * ebx arg1
+ * ecx arg2
+ * edx arg3
+ * esi arg4
+ * edi arg5
+ * ebp user stack
+ * 0(%ebp) arg6
+ */
ENTRY(entry_SYSENTER_32)
movl TSS_sysenter_sp0(%esp), %esp
sysenter_past_esp:
@@ -398,7 +430,32 @@ sysenter_past_esp:
GLOBAL(__end_SYSENTER_singlestep_region)
ENDPROC(entry_SYSENTER_32)

- # system call handler stub
+/*
+ * 32-bit legacy system call entry.
+ *
+ * 32-bit x86 Linux system calls traditionally used the INT $0x80
+ * instruction. INT $0x80 lands here.
+ *
+ * This entry point can be used by 32-bit and 64-bit programs to perform
+ * 32-bit system calls. Instances of INT $0x80 can be found inline in
+ * various programs and libraries. It is also used by the vDSO's
+ * __kernel_vsyscall fallback for hardware that doesn't support a faster
+ * entry method. Restarted 32-bit system calls also fall back to INT
+ * $0x80 regardless of what instruction was originally used to do the
+ * system call.
+ *
+ * This is considered a slow path. It is not used by modern libc
+ * implementations on modern hardware except during process startup.
+ *
+ * Arguments:
+ * eax system call number
+ * ebx arg1
+ * ecx arg2
+ * edx arg3
+ * esi arg4
+ * edi arg5
+ * ebp arg6
+ */
ENTRY(entry_INT80_32)
ASM_CLAC
pushl %eax /* pt_regs->orig_ax */
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 2f61059dacc3..6f5fbc584cb1 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -106,6 +106,16 @@ ENDPROC(native_usergs_sysret64)
/*
* 64-bit SYSCALL instruction entry. Up to 6 arguments in registers.
*
+ * This is the only entry point used for 64-bit system calls. The
+ * hardware interface is reasonably well designed and the register to
+ * argument mapping Linux uses fits well with the registers that are
+ * available when SYSCALL is used.
+ *
+ * SYSCALL instructions can be found inlined in libc implementations as
+ * well as some other programs and libraries. There are also a handful
+ * of SYSCALL instructions in the vDSO used, for example, as a
+ * clock_gettimeofday fallback.
+ *
* 64-bit SYSCALL saves rip to rcx, clears rflags.RF, then saves rflags to r11,
* then loads new ss, cs, and rip from previously programmed MSRs.
* rflags gets masked by a value from another MSR (so CLD and CLAC
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index aa3f79ba2e4d..1838dcdd3886 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -19,12 +19,21 @@
.section .entry.text, "ax"

/*
- * 32-bit SYSENTER instruction entry.
+ * 32-bit SYSENTER entry.
*
- * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs.
- * IF and VM in rflags are cleared (IOW: interrupts are off).
+ * 32-bit system calls through the vDSO's __kernel_vsyscall enter here
+ * on 64-bit kernels running on Intel CPUs.
+ *
+ * The SYSENTER instruction, in principle, should *only* occur in the
+ * vDSO. In practice, a small number of Android devices were shipped
+ * with a copy of Bionic that inlined a SYSENTER instruction. This
+ * never happened in any of Google's Bionic versions -- it only happened
+ * in a narrow range of Intel-provided versions.
+ *
+ * SYSENTER loads SS, RSP, CS, and RIP from previously programmed MSRs.
+ * IF and VM in RFLAGS are cleared (IOW: interrupts are off).
* SYSENTER does not save anything on the stack,
- * and does not save old rip (!!!) and rflags.
+ * and does not save old RIP (!!!), RSP, or RFLAGS.
*
* Arguments:
* eax system call number
@@ -35,10 +44,6 @@
* edi arg5
* ebp user stack
* 0(%ebp) arg6
- *
- * This is purely a fast path. For anything complicated we use the int 0x80
- * path below. We set up a complete hardware stack frame to share code
- * with the int 0x80 path.
*/
ENTRY(entry_SYSENTER_compat)
/* Interrupts are off on entry. */
@@ -131,17 +136,38 @@ GLOBAL(__end_entry_SYSENTER_compat)
ENDPROC(entry_SYSENTER_compat)

/*
- * 32-bit SYSCALL instruction entry.
+ * 32-bit SYSCALL entry.
+ *
+ * 32-bit system calls through the vDSO's __kernel_vsyscall enter here
+ * on 64-bit kernels running on AMD CPUs.
+ *
+ * The SYSCALL instruction, in principle, should *only* occur in the
+ * vDSO. In practice, it appears that this really is the case.
+ * As evidence:
+ *
+ * - The calling convention for SYSCALL has changed several times without
+ * anyone noticing.
+ *
+ * - Prior to the in-kernel X86_BUG_SYSRET_SS_ATTRS fixup, anything
+ * user task that did SYSCALL without immediately reloading SS
+ * would randomly crash.
*
- * 32-bit SYSCALL saves rip to rcx, clears rflags.RF, then saves rflags to r11,
- * then loads new ss, cs, and rip from previously programmed MSRs.
- * rflags gets masked by a value from another MSR (so CLD and CLAC
- * are not needed). SYSCALL does not save anything on the stack
- * and does not change rsp.
+ * - Most programmers do not directly target AMD CPUs, and the 32-bit
+ * SYSCALL instruction does not exist on Intel CPUs. Even on AMD
+ * CPUs, Linux disables the SYSCALL instruction on 32-bit kernels
+ * because the SYSCALL instruction in legacy/native 32-bit mode (as
+ * opposed to compat mode) is sufficiently poorly designed as to be
+ * essentially unusable.
*
- * Note: rflags saving+masking-with-MSR happens only in Long mode
+ * 32-bit SYSCALL saves RIP to RCX, clears RFLAGS.RF, then saves
+ * RFLAGS to R11, then loads new SS, CS, and RIP from previously
+ * programmed MSRs. RFLAGS gets masked by a value from another MSR
+ * (so CLD and CLAC are not needed). SYSCALL does not save anything on
+ * the stack and does not change RSP.
+ *
+ * Note: RFLAGS saving+masking-with-MSR happens only in Long mode
* (in legacy 32-bit mode, IF, RF and VM bits are cleared and that's it).
- * Don't get confused: rflags saving+masking depends on Long Mode Active bit
+ * Don't get confused: RFLAGS saving+masking depends on Long Mode Active bit
* (EFER.LMA=1), NOT on bitness of userspace where SYSCALL executes
* or target CS descriptor's L bit (SYSCALL does not read segment descriptors).
*
@@ -241,7 +267,21 @@ sysret32_from_system_call:
END(entry_SYSCALL_compat)

/*
- * Emulated IA32 system calls via int 0x80.
+ * 32-bit legacy system call entry.
+ *
+ * 32-bit x86 Linux system calls traditionally used the INT $0x80
+ * instruction. INT $0x80 lands here.
+ *
+ * This entry point can be used by 32-bit and 64-bit programs to perform
+ * 32-bit system calls. Instances of INT $0x80 can be found inline in
+ * various programs and libraries. It is also used by the vDSO's
+ * __kernel_vsyscall fallback for hardware that doesn't support a faster
+ * entry method. Restarted 32-bit system calls also fall back to INT
+ * $0x80 regardless of what instruction was originally used to do the
+ * system call.
+ *
+ * This is considered a slow path. It is not used by modern libc
+ * implementations on modern hardware except during process startup.
*
* Arguments:
* eax system call number
@@ -250,17 +290,8 @@ END(entry_SYSCALL_compat)
* edx arg3
* esi arg4
* edi arg5
- * ebp arg6 (note: not saved in the stack frame, should not be touched)
- *
- * Notes:
- * Uses the same stack frame as the x86-64 version.
- * All registers except eax must be saved (but ptrace may violate that).
- * Arguments are zero extended. For system calls that want sign extension and
- * take long arguments a wrapper is needed. Most calls can just be called
- * directly.
- * Assumes it is only called from user space and entered with interrupts off.
+ * ebp arg6
*/
-
ENTRY(entry_INT80_compat)
/*
* Interrupts are off on entry.
--
2.5.0


2016-03-07 08:22:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86/entry: Improve system call entry comments


* Andy Lutomirski <[email protected]> wrote:

> Ingo suggested that the comments should explain when the various
> entries are used. This adds these explanations and improves other
> parts of the comments.

Thanks for doing this, this is really useful!

One very small detail I noticed:

> +/*
> + * 32-bit legacy system call entry.
> + *
> + * 32-bit x86 Linux system calls traditionally used the INT $0x80
> + * instruction. INT $0x80 lands here.
> + *
> + * This entry point can be used by 32-bit and 64-bit programs to perform
> + * 32-bit system calls. Instances of INT $0x80 can be found inline in
> + * various programs and libraries. It is also used by the vDSO's
> + * __kernel_vsyscall fallback for hardware that doesn't support a faster
> + * entry method. Restarted 32-bit system calls also fall back to INT
> + * $0x80 regardless of what instruction was originally used to do the
> + * system call.
> + *
> + * This is considered a slow path. It is not used by modern libc
> + * implementations on modern hardware except during process startup.
> + *
> + * Arguments:
> + * eax system call number
> + * ebx arg1
> + * ecx arg2
> + * edx arg3
> + * esi arg4
> + * edi arg5
> + * ebp arg6
> + */
> ENTRY(entry_INT80_32)

entry_INT80_32() is only used on pure 32-bit kernels, 64-bit kernels use
entry_INT80_compat(). So the above text should not talk about 64-bit programs, as
they can never trigger this specific entry point, right?

So I'd change the explanation to something like:

> + * This entry point is active on 32-bit kernels and can thus be used by 32-bit
> + * programs to perform 32-bit system calls. (Programs running on 64-bit
> + * kernels executing INT $0x80 will land on another entry point:
> + * entry_INT80_compat. The ABI is identical.)

Agreed?

Thanks,

Ingo

2016-03-07 16:35:06

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86/entry: Improve system call entry comments

On March 7, 2016 12:22:28 AM PST, Ingo Molnar <[email protected]> wrote:
>
>* Andy Lutomirski <[email protected]> wrote:
>
>> Ingo suggested that the comments should explain when the various
>> entries are used. This adds these explanations and improves other
>> parts of the comments.
>
>Thanks for doing this, this is really useful!
>
>One very small detail I noticed:
>
>> +/*
>> + * 32-bit legacy system call entry.
>> + *
>> + * 32-bit x86 Linux system calls traditionally used the INT $0x80
>> + * instruction. INT $0x80 lands here.
>> + *
>> + * This entry point can be used by 32-bit and 64-bit programs to
>perform
>> + * 32-bit system calls. Instances of INT $0x80 can be found inline
>in
>> + * various programs and libraries. It is also used by the vDSO's
>> + * __kernel_vsyscall fallback for hardware that doesn't support a
>faster
>> + * entry method. Restarted 32-bit system calls also fall back to
>INT
>> + * $0x80 regardless of what instruction was originally used to do
>the
>> + * system call.
>> + *
>> + * This is considered a slow path. It is not used by modern libc
>> + * implementations on modern hardware except during process startup.
>> + *
>> + * Arguments:
>> + * eax system call number
>> + * ebx arg1
>> + * ecx arg2
>> + * edx arg3
>> + * esi arg4
>> + * edi arg5
>> + * ebp arg6
>> + */
>> ENTRY(entry_INT80_32)
>
>entry_INT80_32() is only used on pure 32-bit kernels, 64-bit kernels
>use
>entry_INT80_compat(). So the above text should not talk about 64-bit
>programs, as
>they can never trigger this specific entry point, right?
>
>So I'd change the explanation to something like:
>
>> + * This entry point is active on 32-bit kernels and can thus be used
>by 32-bit
>> + * programs to perform 32-bit system calls. (Programs running on
>64-bit
>> + * kernels executing INT $0x80 will land on another entry point:
>> + * entry_INT80_compat. The ABI is identical.)
>
>Agreed?
>
>Thanks,
>
> Ingo

Sadly I believe Android still uses int $0x80 in the upstream version.
--
Sent from my Android device with K-9 Mail. Please excuse brevity and formatting.

2016-03-07 17:02:19

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86/entry: Improve system call entry comments

On Mar 7, 2016 12:22 AM, "Ingo Molnar" <[email protected]> wrote:
>
>
> * Andy Lutomirski <[email protected]> wrote:
>
> > Ingo suggested that the comments should explain when the various
> > entries are used. This adds these explanations and improves other
> > parts of the comments.
>
> Thanks for doing this, this is really useful!
>
> One very small detail I noticed:
>
> > +/*
> > + * 32-bit legacy system call entry.
> > + *
> > + * 32-bit x86 Linux system calls traditionally used the INT $0x80
> > + * instruction. INT $0x80 lands here.
> > + *
> > + * This entry point can be used by 32-bit and 64-bit programs to perform
> > + * 32-bit system calls. Instances of INT $0x80 can be found inline in
> > + * various programs and libraries. It is also used by the vDSO's
> > + * __kernel_vsyscall fallback for hardware that doesn't support a faster
> > + * entry method. Restarted 32-bit system calls also fall back to INT
> > + * $0x80 regardless of what instruction was originally used to do the
> > + * system call.
> > + *
> > + * This is considered a slow path. It is not used by modern libc
> > + * implementations on modern hardware except during process startup.
> > + *
> > + * Arguments:
> > + * eax system call number
> > + * ebx arg1
> > + * ecx arg2
> > + * edx arg3
> > + * esi arg4
> > + * edi arg5
> > + * ebp arg6
> > + */
> > ENTRY(entry_INT80_32)
>
> entry_INT80_32() is only used on pure 32-bit kernels, 64-bit kernels use
> entry_INT80_compat(). So the above text should not talk about 64-bit programs, as
> they can never trigger this specific entry point, right?
>

64-bit programs can and sometimes do trigger this entry point. It
does a 32-bit syscall regardless of the caller's bitness, but it
returns back to the caller's original context, whatever it was.

> So I'd change the explanation to something like:
>
> > + * This entry point is active on 32-bit kernels and can thus be used by 32-bit
> > + * programs to perform 32-bit system calls. (Programs running on 64-bit
> > + * kernels executing INT $0x80 will land on another entry point:
> > + * entry_INT80_compat. The ABI is identical.)

I like the part in parentheses.

--Andy

2016-03-08 10:28:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86/entry: Improve system call entry comments


* Andy Lutomirski <[email protected]> wrote:

> > > ENTRY(entry_INT80_32)
> >
> > entry_INT80_32() is only used on pure 32-bit kernels, 64-bit kernels use
> > entry_INT80_compat(). So the above text should not talk about 64-bit programs, as
> > they can never trigger this specific entry point, right?
> >
>
> 64-bit programs can and sometimes do trigger this entry point. [...]

How can 64-bit programs trigger entry_INT80_32? It's only ever set on 32-bit
kernels:

#ifdef CONFIG_X86_32
set_system_trap_gate(IA32_SYSCALL_VECTOR, entry_INT80_32);
set_bit(IA32_SYSCALL_VECTOR, used_vectors);
#endif

> [...] It does a 32-bit syscall regardless of the caller's bitness, but it
> returns back to the caller's original context, whatever it was.

That's true of INT $0x80, but I'm talking about the entry point: AFAICS
entry_INT80_32 can only ever execute on 32-bit kernels.

We don't even build the entry_32.S::entry_INT80_32 entry point on 64-bit kernels:

obj-y := entry_$(BITS).o [...]

>
> > So I'd change the explanation to something like:
> >
> > > + * This entry point is active on 32-bit kernels and can thus be used by 32-bit
> > > + * programs to perform 32-bit system calls. (Programs running on 64-bit
> > > + * kernels executing INT $0x80 will land on another entry point:
> > > + * entry_INT80_compat. The ABI is identical.)
>
> I like the part in parentheses.

So the part in parentheses conflict with your above statement :)

What I wanted to say with this:

> > > + * This entry point is active on 32-bit kernels and can thus be used by 32-bit
> > > + * programs to perform 32-bit system calls. (Programs running on 64-bit
> > > + * kernels executing INT $0x80 will land on another entry point:
> > > + * entry_INT80_compat. The ABI is identical.)

... is what it says: that entry_INT80_32 is only active on 32-bit kernels, running
32-bit programs, performing 32-bit system calls.

Programs running on 64-bit kernels can use INT $0x80 as well, but will land on
another, different, 64-bit kernel specific entry point.

What am I missing?

Thanks,

Ingo

2016-03-08 10:36:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86/entry: Improve system call entry comments


* H. Peter Anvin <[email protected]> wrote:

> On March 7, 2016 12:22:28 AM PST, Ingo Molnar <[email protected]> wrote:
> >
> >* Andy Lutomirski <[email protected]> wrote:
> >
> >> Ingo suggested that the comments should explain when the various
> >> entries are used. This adds these explanations and improves other
> >> parts of the comments.
> >
> >Thanks for doing this, this is really useful!
> >
> >One very small detail I noticed:
> >
> >> +/*
> >> + * 32-bit legacy system call entry.
> >> + *
> >> + * 32-bit x86 Linux system calls traditionally used the INT $0x80
> >> + * instruction. INT $0x80 lands here.
> >> + *
> >> + * This entry point can be used by 32-bit and 64-bit programs to
> >perform
> >> + * 32-bit system calls. Instances of INT $0x80 can be found inline
> >in
> >> + * various programs and libraries. It is also used by the vDSO's
> >> + * __kernel_vsyscall fallback for hardware that doesn't support a
> >faster
> >> + * entry method. Restarted 32-bit system calls also fall back to
> >INT
> >> + * $0x80 regardless of what instruction was originally used to do
> >the
> >> + * system call.
> >> + *
> >> + * This is considered a slow path. It is not used by modern libc
> >> + * implementations on modern hardware except during process startup.
> >> + *
> >> + * Arguments:
> >> + * eax system call number
> >> + * ebx arg1
> >> + * ecx arg2
> >> + * edx arg3
> >> + * esi arg4
> >> + * edi arg5
> >> + * ebp arg6
> >> + */
> >> ENTRY(entry_INT80_32)
> >
> >entry_INT80_32() is only used on pure 32-bit kernels, 64-bit kernels
> >use
> >entry_INT80_compat(). So the above text should not talk about 64-bit
> >programs, as
> >they can never trigger this specific entry point, right?
> >
> >So I'd change the explanation to something like:
> >
> >> + * This entry point is active on 32-bit kernels and can thus be used
> >by 32-bit
> >> + * programs to perform 32-bit system calls. (Programs running on
> >64-bit
> >> + * kernels executing INT $0x80 will land on another entry point:
> >> + * entry_INT80_compat. The ABI is identical.)
> >
> >Agreed?
> >
> >Thanks,
> >
> > Ingo
>
> Sadly I believe Android still uses int $0x80 in the upstream version.

I don't see how that fact conflicts with my statement: on 64-bit kernels INT $0x80
will (of course) work, but will land on another entry point: entry_INT80_compat(),
not entry_INT80_32().

On 32-bit kernels the INT $0x80 entry point is entry_INT80_32().

Thanks,

Ingo

2016-03-08 18:30:23

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86/entry: Improve system call entry comments

On Mar 8, 2016 2:27 AM, "Ingo Molnar" <[email protected]> wrote:
>
>
> * Andy Lutomirski <[email protected]> wrote:
>
> > > > ENTRY(entry_INT80_32)
> > >
> > > entry_INT80_32() is only used on pure 32-bit kernels, 64-bit kernels use
> > > entry_INT80_compat(). So the above text should not talk about 64-bit programs, as
> > > they can never trigger this specific entry point, right?
> > >
> >
> > 64-bit programs can and sometimes do trigger this entry point. [...]
>
> How can 64-bit programs trigger entry_INT80_32? It's only ever set on 32-bit
> kernels:
>
> #ifdef CONFIG_X86_32
> set_system_trap_gate(IA32_SYSCALL_VECTOR, entry_INT80_32);
> set_bit(IA32_SYSCALL_VECTOR, used_vectors);
> #endif
>
> > [...] It does a 32-bit syscall regardless of the caller's bitness, but it
> > returns back to the caller's original context, whatever it was.
>
> That's true of INT $0x80, but I'm talking about the entry point: AFAICS
> entry_INT80_32 can only ever execute on 32-bit kernels.

Oh, duh.

>
> We don't even build the entry_32.S::entry_INT80_32 entry point on 64-bit kernels:
>
> obj-y := entry_$(BITS).o [...]
>
> >
> > > So I'd change the explanation to something like:
> > >
> > > > + * This entry point is active on 32-bit kernels and can thus be used by 32-bit
> > > > + * programs to perform 32-bit system calls. (Programs running on 64-bit
> > > > + * kernels executing INT $0x80 will land on another entry point:
> > > > + * entry_INT80_compat. The ABI is identical.)
> >
> > I like the part in parentheses.
>
> So the part in parentheses conflict with your above statement :)
>
> What I wanted to say with this:
>
> > > > + * This entry point is active on 32-bit kernels and can thus be used by 32-bit
> > > > + * programs to perform 32-bit system calls. (Programs running on 64-bit
> > > > + * kernels executing INT $0x80 will land on another entry point:
> > > > + * entry_INT80_compat. The ABI is identical.)
>
> ... is what it says: that entry_INT80_32 is only active on 32-bit kernels, running
> 32-bit programs, performing 32-bit system calls.
>
> Programs running on 64-bit kernels can use INT $0x80 as well, but will land on
> another, different, 64-bit kernel specific entry point.
>
> What am I missing?
>

Nothing. I mis-read your earlier email. Want to fix and apply it, or
should I send a new version?

> Thanks,
>
> Ingo

2016-03-08 18:41:24

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86/entry: Improve system call entry comments

On 03/08/16 02:30, Ingo Molnar wrote:
>>>> + *
>>>> + * This is considered a slow path. It is not used by modern libc
>>>> + * implementations on modern hardware except during process startup.
>>>> + *
>>
>> Sadly I believe Android still uses int $0x80 in the upstream version.
>
> I don't see how that fact conflicts with my statement: on 64-bit kernels INT $0x80
> will (of course) work, but will land on another entry point: entry_INT80_compat(),
> not entry_INT80_32().
>
> On 32-bit kernels the INT $0x80 entry point is entry_INT80_32().
>

It doesn't. I was referring to the above quote. Trying to fix that.

-hpa


2016-03-08 18:45:50

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86/entry: Improve system call entry comments

On Tue, Mar 8, 2016 at 10:40 AM, H. Peter Anvin <[email protected]> wrote:
> On 03/08/16 02:30, Ingo Molnar wrote:
>>>>> + *
>>>>> + * This is considered a slow path. It is not used by modern libc
>>>>> + * implementations on modern hardware except during process startup.
>>>>> + *
>>>
>>> Sadly I believe Android still uses int $0x80 in the upstream version.
>>
>> I don't see how that fact conflicts with my statement: on 64-bit kernels INT $0x80
>> will (of course) work, but will land on another entry point: entry_INT80_compat(),
>> not entry_INT80_32().
>>
>> On 32-bit kernels the INT $0x80 entry point is entry_INT80_32().
>>
>
> It doesn't. I was referring to the above quote. Trying to fix that.

s/modern/most, perhaps?

I'm hoping that some day Bionic goes away and gets replaced by musl.

Of course, musl doesn't always use fast syscalls because it needs a
vdso facility that doesn't currently exist. I'll deal with that
eventually.

>
> -hpa
>
>



--
Andy Lutomirski
AMA Capital Management, LLC

2016-03-08 18:48:37

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86/entry: Improve system call entry comments

On 03/08/16 10:45, Andy Lutomirski wrote:
>
> s/modern/most, perhaps?
>
> I'm hoping that some day Bionic goes away and gets replaced by musl.
>
> Of course, musl doesn't always use fast syscalls because it needs a
> vdso facility that doesn't currently exist. I'll deal with that
> eventually.
>

You don't actually need actual DSO support to support fast system calls
on i386. Even klibc uses them now, and the additional code to support
it is trivial.

-hpa


2016-03-08 18:51:29

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86/entry: Improve system call entry comments

On Tue, Mar 8, 2016 at 10:47 AM, H. Peter Anvin <[email protected]> wrote:
> On 03/08/16 10:45, Andy Lutomirski wrote:
>>
>> s/modern/most, perhaps?
>>
>> I'm hoping that some day Bionic goes away and gets replaced by musl.
>>
>> Of course, musl doesn't always use fast syscalls because it needs a
>> vdso facility that doesn't currently exist. I'll deal with that
>> eventually.
>>
>
> You don't actually need actual DSO support to support fast system calls
> on i386. Even klibc uses them now, and the additional code to support
> it is trivial.

That's not the issue. The issue is that musl does something
crazy^Wclever to support POSIX pthread cancellation, and it involves
being able to tell whether a signal's ucontext points to a syscall
and, if so, what the return address is. This is straightforward with
an inlined int $0x80, but doing it reliably with the current vdso
design would requiring parsing the DWARF data, and I can't really
blame musl for not wanting to do that.

There was a thread awhile back about adding a new vdso helper to do
this. I think I even had some code for it. If I find time, I'll try
to send patches for 4.7.

--Andy

2016-03-08 19:00:06

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86/entry: Improve system call entry comments

On 03/08/16 10:50, Andy Lutomirski wrote:
> On Tue, Mar 8, 2016 at 10:47 AM, H. Peter Anvin <[email protected]> wrote:
>> On 03/08/16 10:45, Andy Lutomirski wrote:
>>>
>>> s/modern/most, perhaps?
>>>
>>> I'm hoping that some day Bionic goes away and gets replaced by musl.
>>>
>>> Of course, musl doesn't always use fast syscalls because it needs a
>>> vdso facility that doesn't currently exist. I'll deal with that
>>> eventually.
>>>
>>
>> You don't actually need actual DSO support to support fast system calls
>> on i386. Even klibc uses them now, and the additional code to support
>> it is trivial.
>
> That's not the issue. The issue is that musl does something
> crazy^Wclever to support POSIX pthread cancellation, and it involves
> being able to tell whether a signal's ucontext points to a syscall
> and, if so, what the return address is. This is straightforward with
> an inlined int $0x80, but doing it reliably with the current vdso
> design would requiring parsing the DWARF data, and I can't really
> blame musl for not wanting to do that.
>
> There was a thread awhile back about adding a new vdso helper to do
> this. I think I even had some code for it. If I find time, I'll try
> to send patches for 4.7.
>

As far as I know, when we get a signal the EIP always points to int
$0x80 as we don't support system call restart (being a rare case) for
the fast system calls.

-hpa


2016-03-08 19:11:37

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86/entry: Improve system call entry comments

On Tue, Mar 8, 2016 at 10:59 AM, H. Peter Anvin <[email protected]> wrote:
> On 03/08/16 10:50, Andy Lutomirski wrote:
>> On Tue, Mar 8, 2016 at 10:47 AM, H. Peter Anvin <[email protected]> wrote:
>>> On 03/08/16 10:45, Andy Lutomirski wrote:
>>>>
>>>> s/modern/most, perhaps?
>>>>
>>>> I'm hoping that some day Bionic goes away and gets replaced by musl.
>>>>
>>>> Of course, musl doesn't always use fast syscalls because it needs a
>>>> vdso facility that doesn't currently exist. I'll deal with that
>>>> eventually.
>>>>
>>>
>>> You don't actually need actual DSO support to support fast system calls
>>> on i386. Even klibc uses them now, and the additional code to support
>>> it is trivial.
>>
>> That's not the issue. The issue is that musl does something
>> crazy^Wclever to support POSIX pthread cancellation, and it involves
>> being able to tell whether a signal's ucontext points to a syscall
>> and, if so, what the return address is. This is straightforward with
>> an inlined int $0x80, but doing it reliably with the current vdso
>> design would requiring parsing the DWARF data, and I can't really
>> blame musl for not wanting to do that.
>>
>> There was a thread awhile back about adding a new vdso helper to do
>> this. I think I even had some code for it. If I find time, I'll try
>> to send patches for 4.7.
>>
>
> As far as I know, when we get a signal the EIP always points to int
> $0x80 as we don't support system call restart (being a rare case) for
> the fast system calls.
>

We actually fully support system call restart on fast syscalls as of
(IIRC) 4.5, even on AMD. Phew!

However, the nasty case for musl is when the cancellation signal
happens immediately before the actual kernel entry. The signal
handler needs some way to detect whether the thread is at a
cancellation point.

--Andy
-