2022-01-07 23:52:38

by Ammar Faizi

[permalink] [raw]
Subject: [PATCH v1 2/3] x86/entry/64: Add info about registers on exit

There was a controversial discussion about the wording in the System
V ABI document regarding what registers the kernel is allowed to
clobber when the userspace executes syscall.

The resolution of the discussion was reviewing the clobber list in
the glibc source. For a historical reason in the glibc source, the
kernel must restore all registers before returning to the userspace
(except for rax, rcx and r11).

Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://gitlab.com/x86-psABIs/x86-64-ABI/-/merge_requests/25

This adds info about registers on exit.

Cc: Andy Lutomirski <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Michael Matz <[email protected]>
Cc: "H.J. Lu" <[email protected]>
Cc: Willy Tarreau <[email protected]>
Cc: x86-ml <[email protected]>
Cc: lkml <[email protected]>
Cc: GNU/Weeb Mailing List <[email protected]>
Signed-off-by: Ammar Faizi <[email protected]>
---

Quoted the full comment in that file after patched, so it's easier to
review:
/*
* 64-bit SYSCALL instruction entry. Up to 6 arguments in registers.
*
* This is the only entry point used for 64-bit system calls. The
* hardware interface is reasonably well designed and the register to
* argument mapping Linux uses fits well with the registers that are
* available when SYSCALL is used.
*
* SYSCALL instructions can be found inlined in libc implementations as
* well as some other programs and libraries. There are also a handful
* of SYSCALL instructions in the vDSO used, for example, as a
* clock_gettimeofday fallback.
*
* 64-bit SYSCALL saves rip to rcx, clears rflags.RF, then saves rflags to r11,
* then loads new ss, cs, and rip from previously programmed MSRs.
* rflags gets masked by a value from another MSR (so CLD and CLAC
* are not needed). SYSCALL does not save anything on the stack
* and does not change rsp.
*
* Registers on entry:
* rax system call number
* rcx return address
* r11 saved rflags (note: r11 is callee-clobbered register in C ABI)
* rdi arg0
* rsi arg1
* rdx arg2
* r10 arg3 (needs to be moved to rcx to conform to C ABI)
* r8 arg4
* r9 arg5
* (note: r12-r15, rbp, rbx are callee-preserved in C ABI)
*
* Only called from user space.
*
* Registers on exit:
* rax syscall return value
* rcx return address
* r11 rflags
*
* For a historical reason in the glibc source, the kernel must restore all
* registers except the rax (syscall return value) before returning to the
* userspace.
*
* In other words, with respect to the userspace, when the kernel returns
* to the userspace, only 3 registers are clobbered, they are rax, rcx,
* and r11.
*
* When user can change pt_regs->foo always force IRET. That is because
* it deals with uncanonical addresses better. SYSRET has trouble
* with them due to bugs in both AMD and Intel CPUs.
*/

---

arch/x86/entry/entry_64.S | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e432dd075291..1111fff2e05f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -79,6 +79,19 @@
*
* Only called from user space.
*
+ * Registers on exit:
+ * rax syscall return value
+ * rcx return address
+ * r11 rflags
+ *
+ * For a historical reason in the glibc source, the kernel must restore all
+ * registers except the rax (syscall return value) before returning to the
+ * userspace.
+ *
+ * In other words, with respect to the userspace, when the kernel returns
+ * to the userspace, only 3 registers are clobbered, they are rax, rcx,
+ * and r11.
+ *
* When user can change pt_regs->foo always force IRET. That is because
* it deals with uncanonical addresses better. SYSRET has trouble
* with them due to bugs in both AMD and Intel CPUs.
--
2.32.0



2022-01-08 00:03:57

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v1 2/3] x86/entry/64: Add info about registers on exit

On 1/7/22 15:52, Ammar Faizi wrote:
> There was a controversial discussion about the wording in the System
> V ABI document regarding what registers the kernel is allowed to
> clobber when the userspace executes syscall.
>
> The resolution of the discussion was reviewing the clobber list in
> the glibc source. For a historical reason in the glibc source, the
> kernel must restore all registers before returning to the userspace
> (except for rax, rcx and r11).
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Link: https://gitlab.com/x86-psABIs/x86-64-ABI/-/merge_requests/25
>
> This adds info about registers on exit.
>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Michael Matz <[email protected]>
> Cc: "H.J. Lu" <[email protected]>
> Cc: Willy Tarreau <[email protected]>
> Cc: x86-ml <[email protected]>
> Cc: lkml <[email protected]>
> Cc: GNU/Weeb Mailing List <[email protected]>
> Signed-off-by: Ammar Faizi <[email protected]>
> ---
>
> Quoted the full comment in that file after patched, so it's easier to
> review:
> /*
> * 64-bit SYSCALL instruction entry. Up to 6 arguments in registers.
> *
> * This is the only entry point used for 64-bit system calls. The
> * hardware interface is reasonably well designed and the register to
> * argument mapping Linux uses fits well with the registers that are
> * available when SYSCALL is used.
> *
> * SYSCALL instructions can be found inlined in libc implementations as
> * well as some other programs and libraries. There are also a handful
> * of SYSCALL instructions in the vDSO used, for example, as a
> * clock_gettimeofday fallback.
> *
> * 64-bit SYSCALL saves rip to rcx, clears rflags.RF, then saves rflags to r11,
> * then loads new ss, cs, and rip from previously programmed MSRs.
> * rflags gets masked by a value from another MSR (so CLD and CLAC
> * are not needed). SYSCALL does not save anything on the stack
> * and does not change rsp.
> *
> * Registers on entry:
> * rax system call number
> * rcx return address
> * r11 saved rflags (note: r11 is callee-clobbered register in C ABI)
> * rdi arg0
> * rsi arg1
> * rdx arg2
> * r10 arg3 (needs to be moved to rcx to conform to C ABI)
> * r8 arg4
> * r9 arg5
> * (note: r12-r15, rbp, rbx are callee-preserved in C ABI)
> *
> * Only called from user space.
> *
> * Registers on exit:
> * rax syscall return value
> * rcx return address
> * r11 rflags
> *
> * For a historical reason in the glibc source, the kernel must restore all
> * registers except the rax (syscall return value) before returning to the
> * userspace.
> *
> * In other words, with respect to the userspace, when the kernel returns
> * to the userspace, only 3 registers are clobbered, they are rax, rcx,
> * and r11.
> *
> * When user can change pt_regs->foo always force IRET. That is because
> * it deals with uncanonical addresses better. SYSRET has trouble
> * with them due to bugs in both AMD and Intel CPUs.
> */
>
> ---
>
> arch/x86/entry/entry_64.S | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index e432dd075291..1111fff2e05f 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -79,6 +79,19 @@
> *
> * Only called from user space.
> *
> + * Registers on exit:
> + * rax syscall return value
> + * rcx return address
> + * r11 rflags
> + *
> + * For a historical reason in the glibc source, the kernel must restore all
> + * registers except the rax (syscall return value) before returning to the
> + * userspace.
> + *
> + * In other words, with respect to the userspace, when the kernel returns
> + * to the userspace, only 3 registers are clobbered, they are rax, rcx,
> + * and r11.
> + *

I would say this much more concisely:

The Linux kernel preserves all registers (even C callee-clobbered
registers) except for rax, rcx and r11 across system calls, and existing
user code relies on this behavior.

> * When user can change pt_regs->foo always force IRET. That is because
> * it deals with uncanonical addresses better. SYSRET has trouble
> * with them due to bugs in both AMD and Intel CPUs.
>


2022-01-08 00:34:51

by Ammar Faizi

[permalink] [raw]
Subject: Re: [PATCH v1 2/3] x86/entry/64: Add info about registers on exit

On 1/8/22 7:03 AM, Andy Lutomirski wrote:
> On 1/7/22 15:52, Ammar Faizi wrote:
>> There was a controversial discussion about the wording in the System
>> V ABI document regarding what registers the kernel is allowed to
>> clobber when the userspace executes syscall.
>>
>> The resolution of the discussion was reviewing the clobber list in
>> the glibc source. For a historical reason in the glibc source, the
>> kernel must restore all registers before returning to the userspace
>> (except for rax, rcx and r11).
>>
>> Link: https://lore.kernel.org/lkml/[email protected]/
>> Link: https://gitlab.com/x86-psABIs/x86-64-ABI/-/merge_requests/25
>>
>> This adds info about registers on exit.
>>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Borislav Petkov <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: "H. Peter Anvin" <[email protected]>
>> Cc: Michael Matz <[email protected]>
>> Cc: "H.J. Lu" <[email protected]>
>> Cc: Willy Tarreau <[email protected]>
>> Cc: x86-ml <[email protected]>
>> Cc: lkml <[email protected]>
>> Cc: GNU/Weeb Mailing List <[email protected]>
>> Signed-off-by: Ammar Faizi <[email protected]>
>> ---
[...]
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index e432dd075291..1111fff2e05f 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -79,6 +79,19 @@
>> *
>> * Only called from user space.
>> *
>> + * Registers on exit:
>> + * rax syscall return value
>> + * rcx return address
>> + * r11 rflags
>> + *
>> + * For a historical reason in the glibc source, the kernel must restore all
>> + * registers except the rax (syscall return value) before returning to the
>> + * userspace.
>> + *
>> + * In other words, with respect to the userspace, when the kernel returns
>> + * to the userspace, only 3 registers are clobbered, they are rax, rcx,
>> + * and r11.
>> + *
>
> I would say this much more concisely:
>
> The Linux kernel preserves all registers (even C callee-clobbered
> registers) except for rax, rcx and r11 across system calls, and
> existing user code relies on this behavior.

Agree, I will take that as Suggested-by in the v2.

--
Ammar Faizi


Attachments:
OpenPGP_signature (495.00 B)
OpenPGP digital signature