2016-03-11 16:53:43

by Denys Vlasenko

[permalink] [raw]
Subject: [PATCH] x86/asm/entry/32: simplify pushes of zeroed pt_regs->REGs

Use of a temporary R8 register here seems to be unnecessary.

"push %r8" is a two-byte insn (it needs REX prefix to specify R8),
"push $0" is two-byte too. It seems just using the latter would be
no worse.

Thus, code had an unnecessary "xorq %r8,%r8" insn.
It probably costs nothing in execution time here since we are probably
limited by store bandwidth at this point, but still.

Run-tested under QEMU: 32-bit calls still work:

/ # ./test_syscall_vdso32
[RUN] Executing 6-argument 32-bit syscall via VDSO
[OK] Arguments are preserved across syscall
[NOTE] R11 has changed:0000000000200ed7 - assuming clobbered by SYSRET insn
[OK] R8..R15 did not leak kernel data
[RUN] Executing 6-argument 32-bit syscall via INT 80
[OK] Arguments are preserved across syscall
[OK] R8..R15 did not leak kernel data
[RUN] Running tests under ptrace
[RUN] Executing 6-argument 32-bit syscall via VDSO
[OK] Arguments are preserved across syscall
[NOTE] R11 has changed:0000000000200ed7 - assuming clobbered by SYSRET insn
[OK] R8..R15 did not leak kernel data
[RUN] Executing 6-argument 32-bit syscall via INT 80
[OK] Arguments are preserved across syscall
[OK] R8..R15 did not leak kernel data

Signed-off-by: Denys Vlasenko <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Borislav Petkov <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Frederic Weisbecker <[email protected]>
CC: Will Drewry <[email protected]>
CC: Kees Cook <[email protected]>
CC: [email protected]
CC: [email protected]
---
arch/x86/entry/entry_64_compat.S | 45 +++++++++++++++++++---------------------
1 file changed, 21 insertions(+), 24 deletions(-)

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 847f2f0..e1721da 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -72,24 +72,23 @@ ENTRY(entry_SYSENTER_compat)
pushfq /* pt_regs->flags (except IF = 0) */
orl $X86_EFLAGS_IF, (%rsp) /* Fix saved flags */
pushq $__USER32_CS /* pt_regs->cs */
- xorq %r8,%r8
- pushq %r8 /* pt_regs->ip = 0 (placeholder) */
+ pushq $0 /* pt_regs->ip = 0 (placeholder) */
pushq %rax /* pt_regs->orig_ax */
pushq %rdi /* pt_regs->di */
pushq %rsi /* pt_regs->si */
pushq %rdx /* pt_regs->dx */
pushq %rcx /* pt_regs->cx */
pushq $-ENOSYS /* pt_regs->ax */
- pushq %r8 /* pt_regs->r8 = 0 */
- pushq %r8 /* pt_regs->r9 = 0 */
- pushq %r8 /* pt_regs->r10 = 0 */
- pushq %r8 /* pt_regs->r11 = 0 */
+ pushq $0 /* pt_regs->r8 = 0 */
+ pushq $0 /* pt_regs->r9 = 0 */
+ pushq $0 /* pt_regs->r10 = 0 */
+ pushq $0 /* pt_regs->r11 = 0 */
pushq %rbx /* pt_regs->rbx */
pushq %rbp /* pt_regs->rbp (will be overwritten) */
- pushq %r8 /* pt_regs->r12 = 0 */
- pushq %r8 /* pt_regs->r13 = 0 */
- pushq %r8 /* pt_regs->r14 = 0 */
- pushq %r8 /* pt_regs->r15 = 0 */
+ pushq $0 /* pt_regs->r12 = 0 */
+ pushq $0 /* pt_regs->r13 = 0 */
+ pushq $0 /* pt_regs->r14 = 0 */
+ pushq $0 /* pt_regs->r15 = 0 */
cld

/*
@@ -205,17 +204,16 @@ ENTRY(entry_SYSCALL_compat)
pushq %rdx /* pt_regs->dx */
pushq %rbp /* pt_regs->cx (stashed in bp) */
pushq $-ENOSYS /* pt_regs->ax */
- xorq %r8,%r8
- pushq %r8 /* pt_regs->r8 = 0 */
- pushq %r8 /* pt_regs->r9 = 0 */
- pushq %r8 /* pt_regs->r10 = 0 */
- pushq %r8 /* pt_regs->r11 = 0 */
+ pushq $0 /* pt_regs->r8 = 0 */
+ pushq $0 /* pt_regs->r9 = 0 */
+ pushq $0 /* pt_regs->r10 = 0 */
+ pushq $0 /* pt_regs->r11 = 0 */
pushq %rbx /* pt_regs->rbx */
pushq %rbp /* pt_regs->rbp (will be overwritten) */
- pushq %r8 /* pt_regs->r12 = 0 */
- pushq %r8 /* pt_regs->r13 = 0 */
- pushq %r8 /* pt_regs->r14 = 0 */
- pushq %r8 /* pt_regs->r15 = 0 */
+ pushq $0 /* pt_regs->r12 = 0 */
+ pushq $0 /* pt_regs->r13 = 0 */
+ pushq $0 /* pt_regs->r14 = 0 */
+ pushq $0 /* pt_regs->r15 = 0 */

/*
* User mode is traced as though IRQs are on, and SYSENTER
@@ -316,11 +314,10 @@ ENTRY(entry_INT80_compat)
pushq %rdx /* pt_regs->dx */
pushq %rcx /* pt_regs->cx */
pushq $-ENOSYS /* pt_regs->ax */
- xorq %r8,%r8
- pushq %r8 /* pt_regs->r8 = 0 */
- pushq %r8 /* pt_regs->r9 = 0 */
- pushq %r8 /* pt_regs->r10 = 0 */
- pushq %r8 /* pt_regs->r11 = 0 */
+ pushq $0 /* pt_regs->r8 = 0 */
+ pushq $0 /* pt_regs->r9 = 0 */
+ pushq $0 /* pt_regs->r10 = 0 */
+ pushq $0 /* pt_regs->r11 = 0 */
pushq %rbx /* pt_regs->rbx */
pushq %rbp /* pt_regs->rbp */
pushq %r12 /* pt_regs->r12 */
--
1.8.1.4


2016-03-12 15:38:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: simplify pushes of zeroed pt_regs->REGs


* Denys Vlasenko <[email protected]> wrote:

> Use of a temporary R8 register here seems to be unnecessary.
>
> "push %r8" is a two-byte insn (it needs REX prefix to specify R8),
> "push $0" is two-byte too. It seems just using the latter would be
> no worse.
>
> Thus, code had an unnecessary "xorq %r8,%r8" insn.

Neat!

> It probably costs nothing in execution time here since we are probably
> limited by store bandwidth at this point, but still.
>
> Run-tested under QEMU: 32-bit calls still work:
>
> / # ./test_syscall_vdso32

Did you manage to test all 3 compat variants:

> @@ -72,24 +72,23 @@ ENTRY(entry_SYSENTER_compat)
> @@ -205,17 +204,16 @@ ENTRY(entry_SYSCALL_compat)
> @@ -316,11 +314,10 @@ ENTRY(entry_INT80_compat)

?

Thanks,

Ingo

2016-03-12 15:45:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: simplify pushes of zeroed pt_regs->REGs


* Denys Vlasenko <[email protected]> wrote:

> Use of a temporary R8 register here seems to be unnecessary.
>
> "push %r8" is a two-byte insn (it needs REX prefix to specify R8),
> "push $0" is two-byte too. It seems just using the latter would be
> no worse.
>
> Thus, code had an unnecessary "xorq %r8,%r8" insn.
> It probably costs nothing in execution time here since we are probably
> limited by store bandwidth at this point, but still.

Note that the 3 fewer instruction in the image also shrink the code by 16 bytes:

arch/x86/entry/entry_64_compat.o:

text data bss dec hex filename
380 0 0 380 17c entry_64_compat.o.before
364 0 0 364 16c entry_64_compat.o.after

because (at least in this defconfig build) one of these functions shrunk below a
16-byte boundary. So cache footprint got denser.

Thanks,

Ingo

2016-03-12 17:53:19

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: simplify pushes of zeroed pt_regs->REGs

On 03/12/2016 04:38 PM, Ingo Molnar wrote:
>
> * Denys Vlasenko <[email protected]> wrote:
>
>> Use of a temporary R8 register here seems to be unnecessary.
>>
>> "push %r8" is a two-byte insn (it needs REX prefix to specify R8),
>> "push $0" is two-byte too. It seems just using the latter would be
>> no worse.
>>
>> Thus, code had an unnecessary "xorq %r8,%r8" insn.
>
> Neat!
>
>> It probably costs nothing in execution time here since we are probably
>> limited by store bandwidth at this point, but still.
>>
>> Run-tested under QEMU: 32-bit calls still work:
>>
>> / # ./test_syscall_vdso32
>
> Did you manage to test all 3 compat variants:
>
>> @@ -72,24 +72,23 @@ ENTRY(entry_SYSENTER_compat)
>> @@ -205,17 +204,16 @@ ENTRY(entry_SYSCALL_compat)
>> @@ -316,11 +314,10 @@ ENTRY(entry_INT80_compat)

Yes.

test_syscall_vdso32 checks vdso syscall (if available)
and direct int80 syscall.
Booting two times, with different qemu flags:

qemu-system-x86_64 -cpu Opteron_G4
qemu-system-x86_64 -cpu SandyBridge

makes kernel choose either SYSCALL or SYSENTER vdso.
So it's all covered.

2016-03-12 18:05:32

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: simplify pushes of zeroed pt_regs->REGs

On Sat, Mar 12, 2016 at 9:53 AM, Denys Vlasenko <[email protected]> wrote:
> On 03/12/2016 04:38 PM, Ingo Molnar wrote:
>>
>> * Denys Vlasenko <[email protected]> wrote:
>>
>>> Use of a temporary R8 register here seems to be unnecessary.
>>>
>>> "push %r8" is a two-byte insn (it needs REX prefix to specify R8),
>>> "push $0" is two-byte too. It seems just using the latter would be
>>> no worse.
>>>
>>> Thus, code had an unnecessary "xorq %r8,%r8" insn.
>>
>> Neat!
>>
>>> It probably costs nothing in execution time here since we are probably
>>> limited by store bandwidth at this point, but still.
>>>
>>> Run-tested under QEMU: 32-bit calls still work:
>>>
>>> / # ./test_syscall_vdso32
>>
>> Did you manage to test all 3 compat variants:
>>
>>> @@ -72,24 +72,23 @@ ENTRY(entry_SYSENTER_compat)
>>> @@ -205,17 +204,16 @@ ENTRY(entry_SYSCALL_compat)
>>> @@ -316,11 +314,10 @@ ENTRY(entry_INT80_compat)
>
> Yes.
>
> test_syscall_vdso32 checks vdso syscall (if available)
> and direct int80 syscall.
> Booting two times, with different qemu flags:
>
> qemu-system-x86_64 -cpu Opteron_G4
> qemu-system-x86_64 -cpu SandyBridge
>
> makes kernel choose either SYSCALL or SYSENTER vdso.
> So it's all covered.

How carefully did you check the latter bit? In my experience, if KVM
is used, your cpu will report as your native CPU's manufacturer
regardless of who actually makes the emulated CPU. -machine accel=tcg
turns that off.

--
Andy Lutomirski
AMA Capital Management, LLC

2016-03-12 21:41:32

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH] x86/asm/entry/32: simplify pushes of zeroed pt_regs->REGs

On 03/12/2016 07:05 PM, Andy Lutomirski wrote:
> On Sat, Mar 12, 2016 at 9:53 AM, Denys Vlasenko <[email protected]> wrote:
>> On 03/12/2016 04:38 PM, Ingo Molnar wrote:
>>>
>>> * Denys Vlasenko <[email protected]> wrote:
>>>
>>>> Use of a temporary R8 register here seems to be unnecessary.
>>>>
>>>> "push %r8" is a two-byte insn (it needs REX prefix to specify R8),
>>>> "push $0" is two-byte too. It seems just using the latter would be
>>>> no worse.
>>>>
>>>> Thus, code had an unnecessary "xorq %r8,%r8" insn.
>>>
>>> Neat!
>>>
>>>> It probably costs nothing in execution time here since we are probably
>>>> limited by store bandwidth at this point, but still.
>>>>
>>>> Run-tested under QEMU: 32-bit calls still work:
>>>>
>>>> / # ./test_syscall_vdso32
>>>
>>> Did you manage to test all 3 compat variants:
>>>
>>>> @@ -72,24 +72,23 @@ ENTRY(entry_SYSENTER_compat)
>>>> @@ -205,17 +204,16 @@ ENTRY(entry_SYSCALL_compat)
>>>> @@ -316,11 +314,10 @@ ENTRY(entry_INT80_compat)
>>
>> Yes.
>>
>> test_syscall_vdso32 checks vdso syscall (if available)
>> and direct int80 syscall.
>> Booting two times, with different qemu flags:
>>
>> qemu-system-x86_64 -cpu Opteron_G4
>> qemu-system-x86_64 -cpu SandyBridge
>>
>> makes kernel choose either SYSCALL or SYSENTER vdso.
>> So it's all covered.
>
> How carefully did you check the latter bit?

To double-check, I built a kernel with intentionally crippled
SYSENTER handling (infinite loop).

qemu-system-x86_64 -cpu Opteron_G4 - works
qemu-system-x86_64 -cpu SandyBridge - ./test_syscall_vdso_32 hung

This proves that -cpu SandyBridge does cause SYSENTER path to be used.