2015-02-24 18:52:22

by Denys Vlasenko

[permalink] [raw]
Subject: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns

In all three 32-bit entry points, %eax is zero-extended to %rax.
It is safe to do 32-bit compare when checking that syscall#
is not too large.

The last instance of "mysterious" SS+8 constant is replaced by SIZEOF_PTREGS.

The $AUDIT_ARCH_X86_64 parameter to syscall_trace_enter_phase1/2
is a 32-bit constant, loading it with 64-bit MOV produces 10-byte insn
instead of 5-byte one.

After TEST insn, JE anctually means "jump of zero",
let's use JZ mnemonic instead.

At irq_return_via_sysret:
* avoid redundant load of %r11 (it is already loaded a few instructions before).
* do not needlessly increment %rsp - we are going to return to userspace
via SYSRET, this insn doesn't use stack for return.

Signed-off-by: Denys Vlasenko <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Borislav Petkov <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Oleg Nesterov <[email protected]>
CC: Frederic Weisbecker <[email protected]>
CC: Alexei Starovoitov <[email protected]>
CC: Will Drewry <[email protected]>
CC: Kees Cook <[email protected]>
CC: [email protected]
CC: [email protected]
---
arch/x86/ia32/ia32entry.S | 14 +++++++-------
arch/x86/include/asm/calling.h | 3 +++
arch/x86/kernel/entry_64.S | 13 ++++++-------
3 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 6dcd372..01eca80 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -163,8 +163,8 @@ sysenter_flags_fixed:
orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
CFI_REMEMBER_STATE
- jnz sysenter_tracesys
- cmpq $(IA32_NR_syscalls-1),%rax
+ jnz sysenter_tracesys
+ cmpl $(IA32_NR_syscalls-1),%eax
ja ia32_badsys
sysenter_do_call:
/* 32bit syscall -> 64bit C ABI argument conversion */
@@ -350,9 +350,9 @@ ENTRY(ia32_cstar_target)
orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
CFI_REMEMBER_STATE
- jnz cstar_tracesys
- cmpq $IA32_NR_syscalls-1,%rax
- ja ia32_badsys
+ jnz cstar_tracesys
+ cmpl $(IA32_NR_syscalls-1),%eax
+ ja ia32_badsys
cstar_do_call:
/* 32bit syscall -> 64bit C ABI argument conversion */
movl %edi,%r8d /* arg5 */
@@ -473,7 +473,7 @@ ENTRY(ia32_syscall)
orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
jnz ia32_tracesys
- cmpq $(IA32_NR_syscalls-1),%rax
+ cmpl $(IA32_NR_syscalls-1),%eax
ja ia32_badsys
ia32_do_call:
/* 32bit syscall -> 64bit C ABI argument conversion */
@@ -536,7 +536,7 @@ ia32_ptregs_common:
CFI_ENDPROC
CFI_STARTPROC32 simple
CFI_SIGNAL_FRAME
- CFI_DEF_CFA rsp,SS+8
+ CFI_DEF_CFA rsp,SIZEOF_PTREGS
CFI_REL_OFFSET rax,RAX
CFI_REL_OFFSET rcx,RCX
CFI_REL_OFFSET rdx,RDX
diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
index 3374235..f1a962f 100644
--- a/arch/x86/include/asm/calling.h
+++ b/arch/x86/include/asm/calling.h
@@ -176,6 +176,9 @@ For 32-bit we have the following conventions - kernel is built with
.macro RESTORE_C_REGS_EXCEPT_RCX
RESTORE_C_REGS_HELPER 1,0,1,1,1
.endm
+ .macro RESTORE_C_REGS_EXCEPT_R11
+ RESTORE_C_REGS_HELPER 1,1,0,1,1
+ .endm
.macro RESTORE_RSI_RDI
RESTORE_C_REGS_HELPER 0,0,0,0,0
.endm
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 64f2fd3..b93f3a2 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -312,7 +312,7 @@ int_ret_from_sys_call_fixup:
/* Do syscall tracing */
tracesys:
movq %rsp, %rdi
- movq $AUDIT_ARCH_X86_64, %rsi
+ movl $AUDIT_ARCH_X86_64, %esi
call syscall_trace_enter_phase1
test %rax, %rax
jnz tracesys_phase2 /* if needed, run the slow path */
@@ -323,7 +323,7 @@ tracesys_phase2:
SAVE_EXTRA_REGS
FIXUP_TOP_OF_STACK %rdi
movq %rsp, %rdi
- movq $AUDIT_ARCH_X86_64, %rsi
+ movl $AUDIT_ARCH_X86_64, %esi
movq %rax,%rdx
call syscall_trace_enter_phase2

@@ -686,7 +686,7 @@ ret_from_intr:
exit_intr:
GET_THREAD_INFO(%rcx)
testl $3,CS(%rsp)
- je retint_kernel
+ jz retint_kernel

/* Interrupt came from user space */
/*
@@ -742,7 +742,7 @@ retint_swapgs: /* return to user-space */
cmpq %r11,EFLAGS(%rsp) /* R11 == RFLAGS */
jne opportunistic_sysret_failed

- testq $X86_EFLAGS_RF,%r11 /* sysret can't restore RF */
+ testl $X86_EFLAGS_RF,%r11d /* sysret can't restore RF */
jnz opportunistic_sysret_failed

/* nothing to check for RSP */
@@ -756,9 +756,8 @@ retint_swapgs: /* return to user-space */
*/
irq_return_via_sysret:
CFI_REMEMBER_STATE
- RESTORE_C_REGS
- REMOVE_PT_GPREGS_FROM_STACK 8
- movq (RSP-RIP)(%rsp),%rsp
+ RESTORE_C_REGS_EXCEPT_R11
+ movq RSP(%rsp),%rsp
USERGS_SYSRET64
CFI_RESTORE_STATE

--
1.8.1.4


2015-02-24 18:52:23

by Denys Vlasenko

[permalink] [raw]
Subject: [PATCH 2/4] x86: get rid of KERNEL_STACK_OFFSET

PER_CPU_VAR(kernel_stack) was set up in a way where it points
five stack slots below the top of stack.

Presumably, it was done to avoid one "sub $5*8,%rsp"
in syscall/sysenter code paths, where iret frame needs to be
created by hand.

Ironically, none of them benefit from this optimization,
since all of them need to allocate additional data on stack
(struct pt_regs), so they still have to perform subtraction.
And ia32_sysenter_target even needs to *undo* this optimization:
it constructs iret stack with pushes instead of movs,
so it needs to start right at the top.

This patch eliminates KERNEL_STACK_OFFSET.
PER_CPU_VAR(kernel_stack) now points directly to top of stack.
pt_regs allocations are adjusted to allocate iret frame as well.

Semi-mysterious expressions THREAD_INFO(%rsp,RIP) - "why RIP??"
are now replaced by more logical THREAD_INFO(%rsp,SIZEOF_PTREGS) -
"rsp is SIZEOF_PTREGS bytes below the top, calculate
thread_info's address using that information"

Net result in code is that one instruction is eliminated,
and constants in several others have changed.

Signed-off-by: Denys Vlasenko <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Borislav Petkov <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Oleg Nesterov <[email protected]>
CC: Frederic Weisbecker <[email protected]>
CC: Alexei Starovoitov <[email protected]>
CC: Will Drewry <[email protected]>
CC: Kees Cook <[email protected]>
CC: [email protected]
CC: [email protected]
---
arch/x86/ia32/ia32entry.S | 35 +++++++++++++++++------------------
arch/x86/include/asm/thread_info.h | 8 +++-----
arch/x86/kernel/cpu/common.c | 2 +-
arch/x86/kernel/entry_64.S | 8 ++++----
arch/x86/kernel/process_32.c | 3 +--
arch/x86/kernel/process_64.c | 3 +--
arch/x86/kernel/smpboot.c | 3 +--
arch/x86/xen/smp.c | 3 +--
8 files changed, 29 insertions(+), 36 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 01eca80..b349ae8 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -114,7 +114,6 @@ ENTRY(ia32_sysenter_target)
CFI_REGISTER rsp,rbp
SWAPGS_UNSAFE_STACK
movq PER_CPU_VAR(kernel_stack), %rsp
- addq $(KERNEL_STACK_OFFSET),%rsp
/*
* No need to follow this irqs on/off section: the syscall
* disabled irqs, here we enable it straight after entry:
@@ -128,7 +127,7 @@ ENTRY(ia32_sysenter_target)
CFI_REL_OFFSET rsp,0
pushfq_cfi
/*CFI_REL_OFFSET rflags,0*/
- movl TI_sysenter_return+THREAD_INFO(%rsp,3*8-KERNEL_STACK_OFFSET),%r10d
+ movl TI_sysenter_return+THREAD_INFO(%rsp,3*8),%r10d
CFI_REGISTER rip,r10
pushq_cfi $__USER32_CS
/*CFI_REL_OFFSET cs,0*/
@@ -160,8 +159,8 @@ ENTRY(ia32_sysenter_target)
jnz sysenter_fix_flags
sysenter_flags_fixed:

- orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
- testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
+ orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
+ testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
CFI_REMEMBER_STATE
jnz sysenter_tracesys
cmpl $(IA32_NR_syscalls-1),%eax
@@ -178,10 +177,10 @@ sysenter_dispatch:
movq %rax,RAX(%rsp)
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
- testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
+ testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
jnz sysexit_audit
sysexit_from_sys_call:
- andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
+ andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
/* clear IF, that popfq doesn't enable interrupts early */
andl $~0x200,EFLAGS(%rsp)
movl RIP(%rsp),%edx /* User %eip */
@@ -226,7 +225,7 @@ sysexit_from_sys_call:
.endm

.macro auditsys_exit exit
- testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP)
+ testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
jnz ia32_ret_from_sys_call
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
@@ -241,7 +240,7 @@ sysexit_from_sys_call:
movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
- testl %edi,TI_flags+THREAD_INFO(%rsp,RIP)
+ testl %edi,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
jz \exit
CLEAR_RREGS
jmp int_with_check
@@ -263,7 +262,7 @@ sysenter_fix_flags:

sysenter_tracesys:
#ifdef CONFIG_AUDITSYSCALL
- testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP)
+ testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
jz sysenter_auditsys
#endif
SAVE_EXTRA_REGS
@@ -312,7 +311,7 @@ ENDPROC(ia32_sysenter_target)
ENTRY(ia32_cstar_target)
CFI_STARTPROC32 simple
CFI_SIGNAL_FRAME
- CFI_DEF_CFA rsp,KERNEL_STACK_OFFSET
+ CFI_DEF_CFA rsp,0
CFI_REGISTER rip,rcx
/*CFI_REGISTER rflags,r11*/
SWAPGS_UNSAFE_STACK
@@ -324,7 +323,7 @@ ENTRY(ia32_cstar_target)
* disabled irqs and here we enable it straight after entry:
*/
ENABLE_INTERRUPTS(CLBR_NONE)
- ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */
+ ALLOC_PT_GPREGS_ON_STACK 6*8 /* 6*8: space for orig_ax and iret frame */
SAVE_C_REGS_EXCEPT_RCX_R891011
movl %eax,%eax /* zero extension */
movq %rax,ORIG_RAX(%rsp)
@@ -347,8 +346,8 @@ ENTRY(ia32_cstar_target)
1: movl (%r8),%r9d
_ASM_EXTABLE(1b,ia32_badarg)
ASM_CLAC
- orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
- testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
+ orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
+ testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
CFI_REMEMBER_STATE
jnz cstar_tracesys
cmpl $(IA32_NR_syscalls-1),%eax
@@ -365,10 +364,10 @@ cstar_dispatch:
movq %rax,RAX(%rsp)
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
- testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
+ testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
jnz sysretl_audit
sysretl_from_sys_call:
- andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
+ andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
RESTORE_RSI_RDI_RDX
movl RIP(%rsp),%ecx
CFI_REGISTER rip,rcx
@@ -403,7 +402,7 @@ sysretl_audit:

cstar_tracesys:
#ifdef CONFIG_AUDITSYSCALL
- testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP)
+ testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
jz cstar_auditsys
#endif
xchgl %r9d,%ebp
@@ -470,8 +469,8 @@ ENTRY(ia32_syscall)
this could be a problem. */
ALLOC_PT_GPREGS_ON_STACK
SAVE_C_REGS_EXCEPT_R891011
- orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
- testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
+ orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
+ testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
jnz ia32_tracesys
cmpl $(IA32_NR_syscalls-1),%eax
ja ia32_badsys
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index e82e95a..3eb9d05 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -149,7 +149,6 @@ struct thread_info {
#define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW)

#define STACK_WARN (THREAD_SIZE/8)
-#define KERNEL_STACK_OFFSET (5*(BITS_PER_LONG/8))

/*
* macros/functions for gaining access to the thread information structure
@@ -163,8 +162,7 @@ DECLARE_PER_CPU(unsigned long, kernel_stack);
static inline struct thread_info *current_thread_info(void)
{
struct thread_info *ti;
- ti = (void *)(this_cpu_read_stable(kernel_stack) +
- KERNEL_STACK_OFFSET - THREAD_SIZE);
+ ti = (void *)(this_cpu_read_stable(kernel_stack) - THREAD_SIZE);
return ti;
}

@@ -184,13 +182,13 @@ static inline unsigned long current_stack_pointer(void)
/* how to get the thread information struct from ASM */
#define GET_THREAD_INFO(reg) \
_ASM_MOV PER_CPU_VAR(kernel_stack),reg ; \
- _ASM_SUB $(THREAD_SIZE-KERNEL_STACK_OFFSET),reg ;
+ _ASM_SUB $(THREAD_SIZE),reg ;

/*
* Same if PER_CPU_VAR(kernel_stack) is, perhaps with some offset, already in
* a certain register (to be used in assembler memory operands).
*/
-#define THREAD_INFO(reg, off) KERNEL_STACK_OFFSET+(off)-THREAD_SIZE(reg)
+#define THREAD_INFO(reg, off) (off)-THREAD_SIZE(reg)

#endif

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index cb56925..a47ca751 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1116,7 +1116,7 @@ static __init int setup_disablecpuid(char *arg)
__setup("clearcpuid=", setup_disablecpuid);

DEFINE_PER_CPU(unsigned long, kernel_stack) =
- (unsigned long)&init_thread_union - KERNEL_STACK_OFFSET + THREAD_SIZE;
+ (unsigned long)&init_thread_union + THREAD_SIZE;
EXPORT_PER_CPU_SYMBOL(kernel_stack);

#ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index b93f3a2..91af6be 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -237,7 +237,7 @@ ENDPROC(native_usergs_sysret64)
ENTRY(system_call)
CFI_STARTPROC simple
CFI_SIGNAL_FRAME
- CFI_DEF_CFA rsp,KERNEL_STACK_OFFSET
+ CFI_DEF_CFA rsp,0
CFI_REGISTER rip,rcx
/*CFI_REGISTER rflags,r11*/
SWAPGS_UNSAFE_STACK
@@ -256,13 +256,13 @@ GLOBAL(system_call_after_swapgs)
* and short:
*/
ENABLE_INTERRUPTS(CLBR_NONE)
- ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */
+ ALLOC_PT_GPREGS_ON_STACK 6*8 /* 6*8: space for orig_ax and iret frame */
SAVE_C_REGS_EXCEPT_RAX_RCX
movq $-ENOSYS,RAX(%rsp)
movq_cfi rax,ORIG_RAX
movq %rcx,RIP(%rsp)
CFI_REL_OFFSET rip,RIP
- testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
+ testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
jnz tracesys
system_call_fastpath:
#if __SYSCALL_MASK == ~0
@@ -280,7 +280,7 @@ system_call_fastpath:
* Has incomplete stack frame and undefined top of stack.
*/
ret_from_sys_call:
- testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
+ testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
jnz int_ret_from_sys_call_fixup /* Go the the slow path */

LOCKDEP_SYS_EXIT
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 8f3ebfe..ecda2bd 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -311,8 +311,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
arch_end_context_switch(next_p);

this_cpu_write(kernel_stack,
- (unsigned long)task_stack_page(next_p) +
- THREAD_SIZE - KERNEL_STACK_OFFSET);
+ (unsigned long)task_stack_page(next_p) + THREAD_SIZE);

/*
* Restore %gs if needed (which is common)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 5a2c029..975d342 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -414,8 +414,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
this_cpu_write(__preempt_count, task_thread_info(next_p)->saved_preempt_count);

this_cpu_write(kernel_stack,
- (unsigned long)task_stack_page(next_p) +
- THREAD_SIZE - KERNEL_STACK_OFFSET);
+ (unsigned long)task_stack_page(next_p) + THREAD_SIZE);

/*
* Now maybe reload the debug registers and handle I/O bitmaps
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index febc6aa..73dfb1f 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -811,8 +811,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
initial_gs = per_cpu_offset(cpu);
#endif
per_cpu(kernel_stack, cpu) =
- (unsigned long)task_stack_page(idle) -
- KERNEL_STACK_OFFSET + THREAD_SIZE;
+ (unsigned long)task_stack_page(idle) + THREAD_SIZE;
early_gdt_descr.address = (unsigned long)get_cpu_gdt_table(cpu);
initial_code = (unsigned long)start_secondary;
stack_start = idle->thread.sp;
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 4c071ae..4e71a3b 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -452,8 +452,7 @@ static int xen_cpu_up(unsigned int cpu, struct task_struct *idle)
clear_tsk_thread_flag(idle, TIF_FORK);
#endif
per_cpu(kernel_stack, cpu) =
- (unsigned long)task_stack_page(idle) -
- KERNEL_STACK_OFFSET + THREAD_SIZE;
+ (unsigned long)task_stack_page(idle) + THREAD_SIZE;

xen_setup_runstate_info(cpu);
xen_setup_timer(cpu);
--
1.8.1.4

2015-02-24 18:53:02

by Denys Vlasenko

[permalink] [raw]
Subject: [PATCH 3/4] x86: save r11 into pt_regs->eflags on SYSCALL64 fastpath

Before this patch, rcx and r11 were saved in pt_regs->rcx
and pt_regs->r11. Which looks natural, but requires messy
shuffling to/from iret stack whenever ptrace or e.g. iopl
wants to modify return address or flags - because that's
how these registers are used by SYSCALL/SYSRET.

This patch saves rcx and r11 in pt_regs->rip and pt_regs->flags,
and uses these values for SYSRET64 insn. Shuffling is eliminated.

stub_iopl is no longer needed: pt_regs->flags needs no fixing up.

Testing shows that syscall fast path is ~54.3 ns before
and after the patch (on 2.7 GHz Sandy Bridge CPU).

Signed-off-by: Denys Vlasenko <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Borislav Petkov <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Oleg Nesterov <[email protected]>
CC: Frederic Weisbecker <[email protected]>
CC: Alexei Starovoitov <[email protected]>
CC: Will Drewry <[email protected]>
CC: Kees Cook <[email protected]>
CC: [email protected]
CC: [email protected]
---
arch/x86/include/asm/calling.h | 20 ++++++++++++++------
arch/x86/kernel/entry_64.S | 33 +++++++++------------------------
arch/x86/syscalls/syscall_64.tbl | 2 +-
arch/x86/um/sys_call_table_64.c | 2 +-
4 files changed, 25 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
index f1a962f..4b5f7bf 100644
--- a/arch/x86/include/asm/calling.h
+++ b/arch/x86/include/asm/calling.h
@@ -95,9 +95,11 @@ For 32-bit we have the following conventions - kernel is built with
CFI_ADJUST_CFA_OFFSET 15*8+\addskip
.endm

- .macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8plus=1
- .if \r8plus
+ .macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8910=1 r11=1
+ .if \r11
movq_cfi r11, 6*8+\offset
+ .endif
+ .if \r8910
movq_cfi r10, 7*8+\offset
movq_cfi r9, 8*8+\offset
movq_cfi r8, 9*8+\offset
@@ -113,16 +115,19 @@ For 32-bit we have the following conventions - kernel is built with
movq_cfi rdi, 14*8+\offset
.endm
.macro SAVE_C_REGS offset=0
- SAVE_C_REGS_HELPER \offset, 1, 1, 1
+ SAVE_C_REGS_HELPER \offset, 1, 1, 1, 1
.endm
.macro SAVE_C_REGS_EXCEPT_RAX_RCX offset=0
- SAVE_C_REGS_HELPER \offset, 0, 0, 1
+ SAVE_C_REGS_HELPER \offset, 0, 0, 1, 1
.endm
.macro SAVE_C_REGS_EXCEPT_R891011
- SAVE_C_REGS_HELPER 0, 1, 1, 0
+ SAVE_C_REGS_HELPER 0, 1, 1, 0, 0
.endm
.macro SAVE_C_REGS_EXCEPT_RCX_R891011
- SAVE_C_REGS_HELPER 0, 1, 0, 0
+ SAVE_C_REGS_HELPER 0, 1, 0, 0, 0
+ .endm
+ .macro SAVE_C_REGS_EXCEPT_RAX_RCX_R11
+ SAVE_C_REGS_HELPER 0, 0, 0, 1, 0
.endm

.macro SAVE_EXTRA_REGS offset=0
@@ -179,6 +184,9 @@ For 32-bit we have the following conventions - kernel is built with
.macro RESTORE_C_REGS_EXCEPT_R11
RESTORE_C_REGS_HELPER 1,1,0,1,1
.endm
+ .macro RESTORE_C_REGS_EXCEPT_RCX_R11
+ RESTORE_C_REGS_HELPER 1,0,0,1,1
+ .endm
.macro RESTORE_RSI_RDI
RESTORE_C_REGS_HELPER 0,0,0,0,0
.endm
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 91af6be..2fd9349 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -121,14 +121,12 @@ ENDPROC(native_usergs_sysret64)
#endif

/*
- * C code is not supposed to know about undefined top of stack. Every time
+ * C code is not supposed to know that iret frame is not populated. Every time
* a C function with an pt_regs argument is called from the SYSCALL based
* fast path FIXUP_TOP_OF_STACK is needed.
* RESTORE_TOP_OF_STACK syncs the syscall state after any possible ptregs
* manipulation.
*/
-
- /* %rsp:at FRAMEEND */
.macro FIXUP_TOP_OF_STACK tmp offset=0
movq PER_CPU_VAR(old_rsp),\tmp
movq \tmp,RSP+\offset(%rsp)
@@ -136,15 +134,13 @@ ENDPROC(native_usergs_sysret64)
movq $__USER_CS,CS+\offset(%rsp)
movq RIP+\offset(%rsp),\tmp /* get rip */
movq \tmp,RCX+\offset(%rsp) /* copy it to rcx as sysret would do */
- movq R11+\offset(%rsp),\tmp /* get eflags */
- movq \tmp,EFLAGS+\offset(%rsp)
+ movq EFLAGS+\offset(%rsp),\tmp /* ditto for rflags->r11 */
+ movq \tmp,R11+\offset(%rsp)
.endm

.macro RESTORE_TOP_OF_STACK tmp offset=0
movq RSP+\offset(%rsp),\tmp
movq \tmp,PER_CPU_VAR(old_rsp)
- movq EFLAGS+\offset(%rsp),\tmp
- movq \tmp,R11+\offset(%rsp)
.endm

/*
@@ -257,9 +253,10 @@ GLOBAL(system_call_after_swapgs)
*/
ENABLE_INTERRUPTS(CLBR_NONE)
ALLOC_PT_GPREGS_ON_STACK 6*8 /* 6*8: space for orig_ax and iret frame */
- SAVE_C_REGS_EXCEPT_RAX_RCX
+ SAVE_C_REGS_EXCEPT_RAX_RCX_R11
movq $-ENOSYS,RAX(%rsp)
movq_cfi rax,ORIG_RAX
+ movq %r11,EFLAGS(%rsp)
movq %rcx,RIP(%rsp)
CFI_REL_OFFSET rip,RIP
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
@@ -277,7 +274,7 @@ system_call_fastpath:
movq %rax,RAX(%rsp)
/*
* Syscall return path ending with SYSRET (fast path)
- * Has incomplete stack frame and undefined top of stack.
+ * Has incompletely filled pt_regs, iret frame is also incomplete.
*/
ret_from_sys_call:
testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
@@ -291,9 +288,10 @@ ret_from_sys_call:
* sysretq will re-enable interrupts:
*/
TRACE_IRQS_ON
- RESTORE_C_REGS_EXCEPT_RCX
- movq RIP(%rsp),%rcx
+ RESTORE_C_REGS_EXCEPT_RCX_R11
+ movq RIP(%rsp),%rcx
CFI_REGISTER rip,rcx
+ movq EFLAGS(%rsp),%r11
/*CFI_REGISTER rflags,r11*/
movq PER_CPU_VAR(old_rsp), %rsp
/*
@@ -422,22 +420,9 @@ ENTRY(stub_\func)
END(stub_\func)
.endm

- .macro FIXED_FRAME label,func
-ENTRY(\label)
- CFI_STARTPROC
- DEFAULT_FRAME 0, 8 /* offset 8: return address */
- FIXUP_TOP_OF_STACK %r11, 8
- call \func
- RESTORE_TOP_OF_STACK %r11, 8
- ret
- CFI_ENDPROC
-END(\label)
- .endm
-
FORK_LIKE clone
FORK_LIKE fork
FORK_LIKE vfork
- FIXED_FRAME stub_iopl, sys_iopl

ENTRY(stub_execve)
CFI_STARTPROC
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..9ef32d5 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -178,7 +178,7 @@
169 common reboot sys_reboot
170 common sethostname sys_sethostname
171 common setdomainname sys_setdomainname
-172 common iopl stub_iopl
+172 common iopl sys_iopl
173 common ioperm sys_ioperm
174 64 create_module
175 common init_module sys_init_module
diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
index 5cdfa9d..a75d8700 100644
--- a/arch/x86/um/sys_call_table_64.c
+++ b/arch/x86/um/sys_call_table_64.c
@@ -16,7 +16,7 @@
*/

/* Not going to be implemented by UML, since we have no hardware. */
-#define stub_iopl sys_ni_syscall
+#define sys_iopl sys_ni_syscall
#define sys_ioperm sys_ni_syscall

/*
--
1.8.1.4

2015-02-24 18:52:30

by Denys Vlasenko

[permalink] [raw]
Subject: [PATCH 4/4] x86: save user %rsp in pt_regs->sp

PER_CPU(old_rsp) usage is simplified - now it is used only
as temp storage, and userspace stack pointer is immediately stored
in pt_regs->sp on syscall entry, instead of being used later,
on syscall exit. This allows to get rid of thread_struct::usersp.

The lazy store of pt_regs->cs and pt_regs->ss is retained -
tests have shown that these insns do take ~2 cycles on fast path.

Signed-off-by: Denys Vlasenko <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Borislav Petkov <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Oleg Nesterov <[email protected]>
CC: Frederic Weisbecker <[email protected]>
CC: Alexei Starovoitov <[email protected]>
CC: Will Drewry <[email protected]>
CC: Kees Cook <[email protected]>
CC: [email protected]
CC: [email protected]
---
arch/x86/include/asm/compat.h | 2 +-
arch/x86/include/asm/processor.h | 6 ------
arch/x86/include/asm/ptrace.h | 8 ++------
arch/x86/kernel/entry_64.S | 22 +++++++++-------------
arch/x86/kernel/perf_regs.c | 2 +-
arch/x86/kernel/process_64.c | 8 +-------
6 files changed, 14 insertions(+), 34 deletions(-)

diff --git a/arch/x86/include/asm/compat.h b/arch/x86/include/asm/compat.h
index 59c6c40..acdee09 100644
--- a/arch/x86/include/asm/compat.h
+++ b/arch/x86/include/asm/compat.h
@@ -301,7 +301,7 @@ static inline void __user *arch_compat_alloc_user_space(long len)
sp = task_pt_regs(current)->sp;
} else {
/* -128 for the x32 ABI redzone */
- sp = this_cpu_read(old_rsp) - 128;
+ sp = task_pt_regs(current)->sp - 128;
}

return (void __user *)round_down(sp - len, 16);
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index a092a0c..ce4aadf 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -474,7 +474,6 @@ struct thread_struct {
#ifdef CONFIG_X86_32
unsigned long sysenter_cs;
#else
- unsigned long usersp; /* Copy from PDA */
unsigned short es;
unsigned short ds;
unsigned short fsindex;
@@ -935,11 +934,6 @@ extern unsigned long thread_saved_pc(struct task_struct *tsk);
#define task_pt_regs(tsk) ((struct pt_regs *)(tsk)->thread.sp0 - 1)
extern unsigned long KSTK_ESP(struct task_struct *task);

-/*
- * User space RSP while inside the SYSCALL fast path
- */
-DECLARE_PER_CPU(unsigned long, old_rsp);
-
#endif /* CONFIG_X86_64 */

extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 4077d96..74bb2e0 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -145,12 +145,8 @@ static inline bool user_64bit_mode(struct pt_regs *regs)
#endif
}

-#define current_user_stack_pointer() this_cpu_read(old_rsp)
-/* ia32 vs. x32 difference */
-#define compat_user_stack_pointer() \
- (test_thread_flag(TIF_IA32) \
- ? current_pt_regs()->sp \
- : this_cpu_read(old_rsp))
+#define current_user_stack_pointer() current_pt_regs()->sp
+#define compat_user_stack_pointer() current_pt_regs()->sp
#endif

#ifdef CONFIG_X86_32
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 2fd9349..c63a582 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64)
* manipulation.
*/
.macro FIXUP_TOP_OF_STACK tmp offset=0
- movq PER_CPU_VAR(old_rsp),\tmp
- movq \tmp,RSP+\offset(%rsp)
movq $__USER_DS,SS+\offset(%rsp)
movq $__USER_CS,CS+\offset(%rsp)
movq RIP+\offset(%rsp),\tmp /* get rip */
@@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64)
.endm

.macro RESTORE_TOP_OF_STACK tmp offset=0
- movq RSP+\offset(%rsp),\tmp
- movq \tmp,PER_CPU_VAR(old_rsp)
+ /* nothing to do */
.endm

/*
@@ -222,9 +219,6 @@ ENDPROC(native_usergs_sysret64)
* Interrupts are off on entry.
* Only called from user space.
*
- * XXX if we had a free scratch register we could save the RSP into the stack frame
- * and report it properly in ps. Unfortunately we haven't.
- *
* When user can change the frames always force IRET. That is because
* it deals with uncanonical addresses better. SYSRET has trouble
* with them due to bugs in both AMD and Intel CPUs.
@@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
*/
ENABLE_INTERRUPTS(CLBR_NONE)
ALLOC_PT_GPREGS_ON_STACK 6*8 /* 6*8: space for orig_ax and iret frame */
+ movq %rcx,RIP(%rsp)
+ movq %r11,EFLAGS(%rsp)
+ movq PER_CPU_VAR(old_rsp),%rcx
+ movq %rcx,RSP(%rsp)
+ movq_cfi rax,ORIG_RAX
SAVE_C_REGS_EXCEPT_RAX_RCX_R11
movq $-ENOSYS,RAX(%rsp)
- movq_cfi rax,ORIG_RAX
- movq %r11,EFLAGS(%rsp)
- movq %rcx,RIP(%rsp)
CFI_REL_OFFSET rip,RIP
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
jnz tracesys
@@ -293,7 +289,7 @@ ret_from_sys_call:
CFI_REGISTER rip,rcx
movq EFLAGS(%rsp),%r11
/*CFI_REGISTER rflags,r11*/
- movq PER_CPU_VAR(old_rsp), %rsp
+ movq RSP(%rsp),%rsp
/*
* 64bit SYSRET restores rip from rcx,
* rflags from r11 (but RF and VM bits are forced to 0),
@@ -307,7 +303,7 @@ int_ret_from_sys_call_fixup:
FIXUP_TOP_OF_STACK %r11
jmp int_ret_from_sys_call

- /* Do syscall tracing */
+ /* Do syscall entry tracing */
tracesys:
movq %rsp, %rdi
movl $AUDIT_ARCH_X86_64, %esi
@@ -346,7 +342,7 @@ tracesys_phase2:

/*
* Syscall return path ending with IRET.
- * Has correct top of stack, but partial stack frame.
+ * Has correct iret frame.
*/
GLOBAL(int_ret_from_sys_call)
DISABLE_INTERRUPTS(CLBR_NONE)
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 781861c..02a8720 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -177,7 +177,7 @@ void perf_get_regs_user(struct perf_regs *regs_user,
* than just blindly copying user_regs.
*/
regs_user->abi = PERF_SAMPLE_REGS_ABI_64;
- regs_user_copy->sp = this_cpu_read(old_rsp);
+ regs_user_copy->sp = user_regs->sp;
regs_user_copy->cs = __USER_CS;
regs_user_copy->ss = __USER_DS;
regs_user_copy->cx = -1; /* usually contains garbage */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 975d342..ab79139 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -161,7 +161,6 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,
p->thread.sp0 = (unsigned long)task_stack_page(p) + THREAD_SIZE;
childregs = task_pt_regs(p);
p->thread.sp = (unsigned long) childregs;
- p->thread.usersp = me->thread.usersp;
set_tsk_thread_flag(p, TIF_FORK);
p->thread.io_bitmap_ptr = NULL;

@@ -235,10 +234,8 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
loadsegment(es, _ds);
loadsegment(ds, _ds);
load_gs_index(0);
- current->thread.usersp = new_sp;
regs->ip = new_ip;
regs->sp = new_sp;
- this_cpu_write(old_rsp, new_sp);
regs->cs = _cs;
regs->ss = _ss;
regs->flags = X86_EFLAGS_IF;
@@ -401,8 +398,6 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
/*
* Switch the PDA and FPU contexts.
*/
- prev->usersp = this_cpu_read(old_rsp);
- this_cpu_write(old_rsp, next->usersp);
this_cpu_write(current_task, next_p);

/*
@@ -601,6 +596,5 @@ long sys_arch_prctl(int code, unsigned long addr)

unsigned long KSTK_ESP(struct task_struct *task)
{
- return (test_tsk_thread_flag(task, TIF_IA32)) ?
- (task_pt_regs(task)->sp) : ((task)->thread.usersp);
+ return task_pt_regs(task)->sp;
}
--
1.8.1.4

2015-02-24 19:30:09

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/4] x86: get rid of KERNEL_STACK_OFFSET

On Tue, 24 Feb 2015 19:51:33 +0100
Denys Vlasenko <[email protected]> wrote:

> PER_CPU_VAR(kernel_stack) was set up in a way where it points
> five stack slots below the top of stack.
>
> Presumably, it was done to avoid one "sub $5*8,%rsp"
> in syscall/sysenter code paths, where iret frame needs to be
> created by hand.
>
> Ironically, none of them benefit from this optimization,
> since all of them need to allocate additional data on stack
> (struct pt_regs), so they still have to perform subtraction.
> And ia32_sysenter_target even needs to *undo* this optimization:
> it constructs iret stack with pushes instead of movs,
> so it needs to start right at the top.
>
> This patch eliminates KERNEL_STACK_OFFSET.
> PER_CPU_VAR(kernel_stack) now points directly to top of stack.
> pt_regs allocations are adjusted to allocate iret frame as well.
>

I always thought the KERNEL_STACK_OFFSET wasn't an optimization, but a
buffer from the real top of stack, in case we had any off by one bugs,
it wouldn't crash the system.

-- Steve

2015-02-24 19:56:50

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns

On Tue, Feb 24, 2015 at 07:51:32PM +0100, Denys Vlasenko wrote:
> In all three 32-bit entry points, %eax is zero-extended to %rax.
> It is safe to do 32-bit compare when checking that syscall#
> is not too large.
>
> The last instance of "mysterious" SS+8 constant is replaced by SIZEOF_PTREGS.
>
> The $AUDIT_ARCH_X86_64 parameter to syscall_trace_enter_phase1/2
> is a 32-bit constant, loading it with 64-bit MOV produces 10-byte insn
> instead of 5-byte one.
>
> After TEST insn, JE anctually means "jump of zero",
> let's use JZ mnemonic instead.

Actually, JE == LZ as that's the same opcode for testing ZF=1.

And I have to object:

testl $3,CS(%rsp)
je retint_kernel

is much more understandable than

testl $3,CS(%rsp)
jz retint_kernel

It basically says are $3 and CS(%rsp) equal.

JZ, on the other hand, not so clear: the TEST ANDed the two operands and
set flags accordingly, so JZ is testing the ZF. This means, you actually
know what TEST does and you haven't forgotten.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-02-24 20:02:59

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH 2/4] x86: get rid of KERNEL_STACK_OFFSET

On 02/24/2015 08:30 PM, Steven Rostedt wrote:
> On Tue, 24 Feb 2015 19:51:33 +0100
> Denys Vlasenko <[email protected]> wrote:
>
>> PER_CPU_VAR(kernel_stack) was set up in a way where it points
>> five stack slots below the top of stack.
>>
>> Presumably, it was done to avoid one "sub $5*8,%rsp"
>> in syscall/sysenter code paths, where iret frame needs to be
>> created by hand.
>>
>> Ironically, none of them benefit from this optimization,
>> since all of them need to allocate additional data on stack
>> (struct pt_regs), so they still have to perform subtraction.
>> And ia32_sysenter_target even needs to *undo* this optimization:
>> it constructs iret stack with pushes instead of movs,
>> so it needs to start right at the top.
>>
>> This patch eliminates KERNEL_STACK_OFFSET.
>> PER_CPU_VAR(kernel_stack) now points directly to top of stack.
>> pt_regs allocations are adjusted to allocate iret frame as well.
>>
>
> I always thought the KERNEL_STACK_OFFSET wasn't an optimization, but a
> buffer from the real top of stack, in case we had any off by one bugs,
> it wouldn't crash the system.

I was thinking about it, but it looks unlikely. Reasons:

(1) ia32_sysenter_target does "addq $(KERNEL_STACK_OFFSET),%rsp"
on entry before saving registers with PUSHes,
this returns %rsp to the very top of kernel stack.
If that is a problem (say, a NMI at this point would do bad things),
it would be noticed by now.

(2) even ordinary 64-bit syscall path uses IRET return at times.
For one, on every execve and signal return (because they need
to load a modified %rsp). With current layout, return frame
for IRET lies exactly there, in those 5 stack slots "reserved"
via KERNEL_STACK_OFFSET thingy.

(3) There are no comments anywhere about KERNEL_STACK_OFFSET being
a safety measure.

2015-02-24 20:13:27

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns

On Tue, Feb 24, 2015 at 8:58 PM, Borislav Petkov <[email protected]> wrote:
> On Tue, Feb 24, 2015 at 07:51:32PM +0100, Denys Vlasenko wrote:
>> In all three 32-bit entry points, %eax is zero-extended to %rax.
>> It is safe to do 32-bit compare when checking that syscall#
>> is not too large.
>>
>> The last instance of "mysterious" SS+8 constant is replaced by SIZEOF_PTREGS.
>>
>> The $AUDIT_ARCH_X86_64 parameter to syscall_trace_enter_phase1/2
>> is a 32-bit constant, loading it with 64-bit MOV produces 10-byte insn
>> instead of 5-byte one.
>>
>> After TEST insn, JE anctually means "jump of zero",
>> let's use JZ mnemonic instead.
>
> Actually, JE == LZ as that's the same opcode for testing ZF=1.

Yes, I know that :)

> And I have to object:
>
> testl $3,CS(%rsp)
> je retint_kernel
>
> is much more understandable than
>
> testl $3,CS(%rsp)
> jz retint_kernel
>
> It basically says are $3 and CS(%rsp) equal.

They aren't equal. $1 and $2 in two lowest bits will also
be interpreted as "userspace" here. "Equal to $3" sends
a wrong message here to a human reading the code,
the code doesn't test for CPL=3, it tests for any nonzero CPL.

> JZ, on the other hand, not so clear: the TEST ANDed the two operands and
> set flags accordingly, so JZ is testing the ZF. This means, you actually
> know what TEST does and you haven't forgotten.

JZ says "jump if zero", in this case, "jump if CPL is zero".

2015-02-24 20:35:18

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns

On Tue, Feb 24, 2015 at 09:13:03PM +0100, Denys Vlasenko wrote:
> They aren't equal. $1 and $2 in two lowest bits will also
> be interpreted as "userspace" here. "Equal to $3" sends
> a wrong message here to a human reading the code,
> the code doesn't test for CPL=3, it tests for any nonzero CPL.

Doh, of course.

I was going to propose to make it even more explicit:

andb $3, CS(%rsp)...

but that's destructive.

So yours makes it less misleading, actually. To me at least and
obviously.

:-)

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-02-24 22:25:24

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns

On Tue, Feb 24, 2015 at 10:51 AM, Denys Vlasenko <[email protected]> wrote:
> In all three 32-bit entry points, %eax is zero-extended to %rax.
> It is safe to do 32-bit compare when checking that syscall#
> is not too large.

Applied. Thanks!

2015-02-24 22:37:36

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/4] x86: get rid of KERNEL_STACK_OFFSET

On Tue, Feb 24, 2015 at 10:51 AM, Denys Vlasenko <[email protected]> wrote:
> PER_CPU_VAR(kernel_stack) was set up in a way where it points
> five stack slots below the top of stack.
>
> Presumably, it was done to avoid one "sub $5*8,%rsp"
> in syscall/sysenter code paths, where iret frame needs to be
> created by hand.
>
> Ironically, none of them benefit from this optimization,
> since all of them need to allocate additional data on stack
> (struct pt_regs), so they still have to perform subtraction.
> And ia32_sysenter_target even needs to *undo* this optimization:
> it constructs iret stack with pushes instead of movs,
> so it needs to start right at the top.
>
> This patch eliminates KERNEL_STACK_OFFSET.
> PER_CPU_VAR(kernel_stack) now points directly to top of stack.
> pt_regs allocations are adjusted to allocate iret frame as well.
>
> Semi-mysterious expressions THREAD_INFO(%rsp,RIP) - "why RIP??"
> are now replaced by more logical THREAD_INFO(%rsp,SIZEOF_PTREGS) -
> "rsp is SIZEOF_PTREGS bytes below the top, calculate
> thread_info's address using that information"
>
> Net result in code is that one instruction is eliminated,
> and constants in several others have changed.

This has a side effect: kernel_stack now points to the first byte of
the page above the stack instead of to the page containing the stack.
That means we need this fix:

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index c74f2f5652da..5f20a0a612a6 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -174,7 +174,8 @@ void ist_begin_non_atomic(struct pt_regs *regs)
* will catch asm bugs and any attempt to use ist_preempt_enable
* from double_fault.
*/
- BUG_ON(((current_stack_pointer() ^ this_cpu_read_stable(kernel_stack))
+ BUG_ON(((current_stack_pointer() ^
+ (this_cpu_read_stable(kernel_stack) - 1))
& ~(THREAD_SIZE - 1)) != 0);

preempt_count_sub(HARDIRQ_OFFSET);

I added that in and applied this patch.

Borislav and/or Tony, when all the dust settles, could you make sure
to re-run your memory failure test on the result?

Thanks,
Andy

>
> Signed-off-by: Denys Vlasenko <[email protected]>
> CC: Linus Torvalds <[email protected]>
> CC: Steven Rostedt <[email protected]>
> CC: Ingo Molnar <[email protected]>
> CC: Borislav Petkov <[email protected]>
> CC: "H. Peter Anvin" <[email protected]>
> CC: Andy Lutomirski <[email protected]>
> CC: Oleg Nesterov <[email protected]>
> CC: Frederic Weisbecker <[email protected]>
> CC: Alexei Starovoitov <[email protected]>
> CC: Will Drewry <[email protected]>
> CC: Kees Cook <[email protected]>
> CC: [email protected]
> CC: [email protected]
> ---
> arch/x86/ia32/ia32entry.S | 35 +++++++++++++++++------------------
> arch/x86/include/asm/thread_info.h | 8 +++-----
> arch/x86/kernel/cpu/common.c | 2 +-
> arch/x86/kernel/entry_64.S | 8 ++++----
> arch/x86/kernel/process_32.c | 3 +--
> arch/x86/kernel/process_64.c | 3 +--
> arch/x86/kernel/smpboot.c | 3 +--
> arch/x86/xen/smp.c | 3 +--
> 8 files changed, 29 insertions(+), 36 deletions(-)
>
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 01eca80..b349ae8 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -114,7 +114,6 @@ ENTRY(ia32_sysenter_target)
> CFI_REGISTER rsp,rbp
> SWAPGS_UNSAFE_STACK
> movq PER_CPU_VAR(kernel_stack), %rsp
> - addq $(KERNEL_STACK_OFFSET),%rsp
> /*
> * No need to follow this irqs on/off section: the syscall
> * disabled irqs, here we enable it straight after entry:
> @@ -128,7 +127,7 @@ ENTRY(ia32_sysenter_target)
> CFI_REL_OFFSET rsp,0
> pushfq_cfi
> /*CFI_REL_OFFSET rflags,0*/
> - movl TI_sysenter_return+THREAD_INFO(%rsp,3*8-KERNEL_STACK_OFFSET),%r10d
> + movl TI_sysenter_return+THREAD_INFO(%rsp,3*8),%r10d
> CFI_REGISTER rip,r10
> pushq_cfi $__USER32_CS
> /*CFI_REL_OFFSET cs,0*/
> @@ -160,8 +159,8 @@ ENTRY(ia32_sysenter_target)
> jnz sysenter_fix_flags
> sysenter_flags_fixed:
>
> - orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
> - testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
> + orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
> + testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
> CFI_REMEMBER_STATE
> jnz sysenter_tracesys
> cmpl $(IA32_NR_syscalls-1),%eax
> @@ -178,10 +177,10 @@ sysenter_dispatch:
> movq %rax,RAX(%rsp)
> DISABLE_INTERRUPTS(CLBR_NONE)
> TRACE_IRQS_OFF
> - testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
> + testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
> jnz sysexit_audit
> sysexit_from_sys_call:
> - andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
> + andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
> /* clear IF, that popfq doesn't enable interrupts early */
> andl $~0x200,EFLAGS(%rsp)
> movl RIP(%rsp),%edx /* User %eip */
> @@ -226,7 +225,7 @@ sysexit_from_sys_call:
> .endm
>
> .macro auditsys_exit exit
> - testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP)
> + testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
> jnz ia32_ret_from_sys_call
> TRACE_IRQS_ON
> ENABLE_INTERRUPTS(CLBR_NONE)
> @@ -241,7 +240,7 @@ sysexit_from_sys_call:
> movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi
> DISABLE_INTERRUPTS(CLBR_NONE)
> TRACE_IRQS_OFF
> - testl %edi,TI_flags+THREAD_INFO(%rsp,RIP)
> + testl %edi,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
> jz \exit
> CLEAR_RREGS
> jmp int_with_check
> @@ -263,7 +262,7 @@ sysenter_fix_flags:
>
> sysenter_tracesys:
> #ifdef CONFIG_AUDITSYSCALL
> - testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP)
> + testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
> jz sysenter_auditsys
> #endif
> SAVE_EXTRA_REGS
> @@ -312,7 +311,7 @@ ENDPROC(ia32_sysenter_target)
> ENTRY(ia32_cstar_target)
> CFI_STARTPROC32 simple
> CFI_SIGNAL_FRAME
> - CFI_DEF_CFA rsp,KERNEL_STACK_OFFSET
> + CFI_DEF_CFA rsp,0
> CFI_REGISTER rip,rcx
> /*CFI_REGISTER rflags,r11*/
> SWAPGS_UNSAFE_STACK
> @@ -324,7 +323,7 @@ ENTRY(ia32_cstar_target)
> * disabled irqs and here we enable it straight after entry:
> */
> ENABLE_INTERRUPTS(CLBR_NONE)
> - ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */
> + ALLOC_PT_GPREGS_ON_STACK 6*8 /* 6*8: space for orig_ax and iret frame */
> SAVE_C_REGS_EXCEPT_RCX_R891011
> movl %eax,%eax /* zero extension */
> movq %rax,ORIG_RAX(%rsp)
> @@ -347,8 +346,8 @@ ENTRY(ia32_cstar_target)
> 1: movl (%r8),%r9d
> _ASM_EXTABLE(1b,ia32_badarg)
> ASM_CLAC
> - orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
> - testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
> + orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
> + testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
> CFI_REMEMBER_STATE
> jnz cstar_tracesys
> cmpl $(IA32_NR_syscalls-1),%eax
> @@ -365,10 +364,10 @@ cstar_dispatch:
> movq %rax,RAX(%rsp)
> DISABLE_INTERRUPTS(CLBR_NONE)
> TRACE_IRQS_OFF
> - testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
> + testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
> jnz sysretl_audit
> sysretl_from_sys_call:
> - andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
> + andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
> RESTORE_RSI_RDI_RDX
> movl RIP(%rsp),%ecx
> CFI_REGISTER rip,rcx
> @@ -403,7 +402,7 @@ sysretl_audit:
>
> cstar_tracesys:
> #ifdef CONFIG_AUDITSYSCALL
> - testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP)
> + testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
> jz cstar_auditsys
> #endif
> xchgl %r9d,%ebp
> @@ -470,8 +469,8 @@ ENTRY(ia32_syscall)
> this could be a problem. */
> ALLOC_PT_GPREGS_ON_STACK
> SAVE_C_REGS_EXCEPT_R891011
> - orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP)
> - testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
> + orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS)
> + testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
> jnz ia32_tracesys
> cmpl $(IA32_NR_syscalls-1),%eax
> ja ia32_badsys
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index e82e95a..3eb9d05 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -149,7 +149,6 @@ struct thread_info {
> #define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW)
>
> #define STACK_WARN (THREAD_SIZE/8)
> -#define KERNEL_STACK_OFFSET (5*(BITS_PER_LONG/8))
>
> /*
> * macros/functions for gaining access to the thread information structure
> @@ -163,8 +162,7 @@ DECLARE_PER_CPU(unsigned long, kernel_stack);
> static inline struct thread_info *current_thread_info(void)
> {
> struct thread_info *ti;
> - ti = (void *)(this_cpu_read_stable(kernel_stack) +
> - KERNEL_STACK_OFFSET - THREAD_SIZE);
> + ti = (void *)(this_cpu_read_stable(kernel_stack) - THREAD_SIZE);
> return ti;
> }
>
> @@ -184,13 +182,13 @@ static inline unsigned long current_stack_pointer(void)
> /* how to get the thread information struct from ASM */
> #define GET_THREAD_INFO(reg) \
> _ASM_MOV PER_CPU_VAR(kernel_stack),reg ; \
> - _ASM_SUB $(THREAD_SIZE-KERNEL_STACK_OFFSET),reg ;
> + _ASM_SUB $(THREAD_SIZE),reg ;
>
> /*
> * Same if PER_CPU_VAR(kernel_stack) is, perhaps with some offset, already in
> * a certain register (to be used in assembler memory operands).
> */
> -#define THREAD_INFO(reg, off) KERNEL_STACK_OFFSET+(off)-THREAD_SIZE(reg)
> +#define THREAD_INFO(reg, off) (off)-THREAD_SIZE(reg)
>
> #endif
>
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index cb56925..a47ca751 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1116,7 +1116,7 @@ static __init int setup_disablecpuid(char *arg)
> __setup("clearcpuid=", setup_disablecpuid);
>
> DEFINE_PER_CPU(unsigned long, kernel_stack) =
> - (unsigned long)&init_thread_union - KERNEL_STACK_OFFSET + THREAD_SIZE;
> + (unsigned long)&init_thread_union + THREAD_SIZE;
> EXPORT_PER_CPU_SYMBOL(kernel_stack);
>
> #ifdef CONFIG_X86_64
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index b93f3a2..91af6be 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -237,7 +237,7 @@ ENDPROC(native_usergs_sysret64)
> ENTRY(system_call)
> CFI_STARTPROC simple
> CFI_SIGNAL_FRAME
> - CFI_DEF_CFA rsp,KERNEL_STACK_OFFSET
> + CFI_DEF_CFA rsp,0
> CFI_REGISTER rip,rcx
> /*CFI_REGISTER rflags,r11*/
> SWAPGS_UNSAFE_STACK
> @@ -256,13 +256,13 @@ GLOBAL(system_call_after_swapgs)
> * and short:
> */
> ENABLE_INTERRUPTS(CLBR_NONE)
> - ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */
> + ALLOC_PT_GPREGS_ON_STACK 6*8 /* 6*8: space for orig_ax and iret frame */
> SAVE_C_REGS_EXCEPT_RAX_RCX
> movq $-ENOSYS,RAX(%rsp)
> movq_cfi rax,ORIG_RAX
> movq %rcx,RIP(%rsp)
> CFI_REL_OFFSET rip,RIP
> - testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
> + testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
> jnz tracesys
> system_call_fastpath:
> #if __SYSCALL_MASK == ~0
> @@ -280,7 +280,7 @@ system_call_fastpath:
> * Has incomplete stack frame and undefined top of stack.
> */
> ret_from_sys_call:
> - testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP)
> + testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
> jnz int_ret_from_sys_call_fixup /* Go the the slow path */
>
> LOCKDEP_SYS_EXIT
> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index 8f3ebfe..ecda2bd 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -311,8 +311,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> arch_end_context_switch(next_p);
>
> this_cpu_write(kernel_stack,
> - (unsigned long)task_stack_page(next_p) +
> - THREAD_SIZE - KERNEL_STACK_OFFSET);
> + (unsigned long)task_stack_page(next_p) + THREAD_SIZE);
>
> /*
> * Restore %gs if needed (which is common)
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 5a2c029..975d342 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -414,8 +414,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> this_cpu_write(__preempt_count, task_thread_info(next_p)->saved_preempt_count);
>
> this_cpu_write(kernel_stack,
> - (unsigned long)task_stack_page(next_p) +
> - THREAD_SIZE - KERNEL_STACK_OFFSET);
> + (unsigned long)task_stack_page(next_p) + THREAD_SIZE);
>
> /*
> * Now maybe reload the debug registers and handle I/O bitmaps
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index febc6aa..73dfb1f 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -811,8 +811,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
> initial_gs = per_cpu_offset(cpu);
> #endif
> per_cpu(kernel_stack, cpu) =
> - (unsigned long)task_stack_page(idle) -
> - KERNEL_STACK_OFFSET + THREAD_SIZE;
> + (unsigned long)task_stack_page(idle) + THREAD_SIZE;
> early_gdt_descr.address = (unsigned long)get_cpu_gdt_table(cpu);
> initial_code = (unsigned long)start_secondary;
> stack_start = idle->thread.sp;
> diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
> index 4c071ae..4e71a3b 100644
> --- a/arch/x86/xen/smp.c
> +++ b/arch/x86/xen/smp.c
> @@ -452,8 +452,7 @@ static int xen_cpu_up(unsigned int cpu, struct task_struct *idle)
> clear_tsk_thread_flag(idle, TIF_FORK);
> #endif
> per_cpu(kernel_stack, cpu) =
> - (unsigned long)task_stack_page(idle) -
> - KERNEL_STACK_OFFSET + THREAD_SIZE;
> + (unsigned long)task_stack_page(idle) + THREAD_SIZE;
>
> xen_setup_runstate_info(cpu);
> xen_setup_timer(cpu);
> --
> 1.8.1.4
>



--
Andy Lutomirski
AMA Capital Management, LLC

2015-02-24 22:45:03

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86: save r11 into pt_regs->eflags on SYSCALL64 fastpath

On Tue, Feb 24, 2015 at 10:51 AM, Denys Vlasenko <[email protected]> wrote:
> Before this patch, rcx and r11 were saved in pt_regs->rcx
> and pt_regs->r11. Which looks natural, but requires messy
> shuffling to/from iret stack whenever ptrace or e.g. iopl
> wants to modify return address or flags - because that's
> how these registers are used by SYSCALL/SYSRET.
>
> This patch saves rcx and r11 in pt_regs->rip and pt_regs->flags,
> and uses these values for SYSRET64 insn. Shuffling is eliminated.
>
> stub_iopl is no longer needed: pt_regs->flags needs no fixing up.
>
> Testing shows that syscall fast path is ~54.3 ns before
> and after the patch (on 2.7 GHz Sandy Bridge CPU).
>
> Signed-off-by: Denys Vlasenko <[email protected]>
> CC: Linus Torvalds <[email protected]>
> CC: Steven Rostedt <[email protected]>
> CC: Ingo Molnar <[email protected]>
> CC: Borislav Petkov <[email protected]>
> CC: "H. Peter Anvin" <[email protected]>
> CC: Andy Lutomirski <[email protected]>
> CC: Oleg Nesterov <[email protected]>
> CC: Frederic Weisbecker <[email protected]>
> CC: Alexei Starovoitov <[email protected]>
> CC: Will Drewry <[email protected]>
> CC: Kees Cook <[email protected]>
> CC: [email protected]
> CC: [email protected]
> ---
> arch/x86/include/asm/calling.h | 20 ++++++++++++++------
> arch/x86/kernel/entry_64.S | 33 +++++++++------------------------
> arch/x86/syscalls/syscall_64.tbl | 2 +-
> arch/x86/um/sys_call_table_64.c | 2 +-
> 4 files changed, 25 insertions(+), 32 deletions(-)
>
> diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
> index f1a962f..4b5f7bf 100644
> --- a/arch/x86/include/asm/calling.h
> +++ b/arch/x86/include/asm/calling.h
> @@ -95,9 +95,11 @@ For 32-bit we have the following conventions - kernel is built with
> CFI_ADJUST_CFA_OFFSET 15*8+\addskip
> .endm
>
> - .macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8plus=1
> - .if \r8plus
> + .macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8910=1 r11=1
> + .if \r11
> movq_cfi r11, 6*8+\offset
> + .endif
> + .if \r8910
> movq_cfi r10, 7*8+\offset
> movq_cfi r9, 8*8+\offset
> movq_cfi r8, 9*8+\offset
> @@ -113,16 +115,19 @@ For 32-bit we have the following conventions - kernel is built with
> movq_cfi rdi, 14*8+\offset
> .endm
> .macro SAVE_C_REGS offset=0
> - SAVE_C_REGS_HELPER \offset, 1, 1, 1
> + SAVE_C_REGS_HELPER \offset, 1, 1, 1, 1
> .endm
> .macro SAVE_C_REGS_EXCEPT_RAX_RCX offset=0
> - SAVE_C_REGS_HELPER \offset, 0, 0, 1
> + SAVE_C_REGS_HELPER \offset, 0, 0, 1, 1
> .endm
> .macro SAVE_C_REGS_EXCEPT_R891011
> - SAVE_C_REGS_HELPER 0, 1, 1, 0
> + SAVE_C_REGS_HELPER 0, 1, 1, 0, 0
> .endm
> .macro SAVE_C_REGS_EXCEPT_RCX_R891011
> - SAVE_C_REGS_HELPER 0, 1, 0, 0
> + SAVE_C_REGS_HELPER 0, 1, 0, 0, 0
> + .endm
> + .macro SAVE_C_REGS_EXCEPT_RAX_RCX_R11
> + SAVE_C_REGS_HELPER 0, 0, 0, 1, 0
> .endm
>

This is unnecessarily difficult to read. Could you rework it to use
named macro parameters?

> .macro SAVE_EXTRA_REGS offset=0
> @@ -179,6 +184,9 @@ For 32-bit we have the following conventions - kernel is built with
> .macro RESTORE_C_REGS_EXCEPT_R11
> RESTORE_C_REGS_HELPER 1,1,0,1,1
> .endm
> + .macro RESTORE_C_REGS_EXCEPT_RCX_R11
> + RESTORE_C_REGS_HELPER 1,0,0,1,1
> + .endm

Ditto.

> .macro RESTORE_RSI_RDI
> RESTORE_C_REGS_HELPER 0,0,0,0,0
> .endm
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 91af6be..2fd9349 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -121,14 +121,12 @@ ENDPROC(native_usergs_sysret64)
> #endif
>
> /*
> - * C code is not supposed to know about undefined top of stack. Every time
> + * C code is not supposed to know that iret frame is not populated. Every time

"that the iret frame," please. (Вы говорите по-русски? :) )

> * a C function with an pt_regs argument is called from the SYSCALL based
> * fast path FIXUP_TOP_OF_STACK is needed.
> * RESTORE_TOP_OF_STACK syncs the syscall state after any possible ptregs
> * manipulation.
> */
> -
> - /* %rsp:at FRAMEEND */
> .macro FIXUP_TOP_OF_STACK tmp offset=0
> movq PER_CPU_VAR(old_rsp),\tmp
> movq \tmp,RSP+\offset(%rsp)
> @@ -136,15 +134,13 @@ ENDPROC(native_usergs_sysret64)
> movq $__USER_CS,CS+\offset(%rsp)
> movq RIP+\offset(%rsp),\tmp /* get rip */
> movq \tmp,RCX+\offset(%rsp) /* copy it to rcx as sysret would do */
> - movq R11+\offset(%rsp),\tmp /* get eflags */
> - movq \tmp,EFLAGS+\offset(%rsp)
> + movq EFLAGS+\offset(%rsp),\tmp /* ditto for rflags->r11 */
> + movq \tmp,R11+\offset(%rsp)
> .endm

It occurs to me that both the name of this macro and comment are
wrong. It's not fixing the *top* of the stack, since it fixes both
rcx and r11. Oh, well, maybe we'll just delete it eventually.

The patch looks correct. Can you submit a v2 once I finish reading
the rest of these? (Also, can you put v2 in the subject? My email is
going nuts here.)

--Andy

2015-02-24 22:51:51

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns

On 02/24/2015 12:36 PM, Borislav Petkov wrote:
> On Tue, Feb 24, 2015 at 09:13:03PM +0100, Denys Vlasenko wrote:
>> They aren't equal. $1 and $2 in two lowest bits will also
>> be interpreted as "userspace" here. "Equal to $3" sends
>> a wrong message here to a human reading the code,
>> the code doesn't test for CPL=3, it tests for any nonzero CPL.
>
> Doh, of course.
>
> I was going to propose to make it even more explicit:
>
> andb $3, CS(%rsp)...
>
> but that's destructive.
>
> So yours makes it less misleading, actually. To me at least and
> obviously.
>

Yes, please use JZ/JNZ with TEST.

-hpa

2015-02-24 22:53:15

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns

On 02/24/2015 02:25 PM, Andy Lutomirski wrote:
> On Tue, Feb 24, 2015 at 10:51 AM, Denys Vlasenko <[email protected]> wrote:
>> In all three 32-bit entry points, %eax is zero-extended to %rax.
>> It is safe to do 32-bit compare when checking that syscall#
>> is not too large.
>
> Applied. Thanks!
>

NAK NAK NAK NAK NAK!!!!

We have already had this turn into a security issue not just once but
TWICE, because someone decided to "optimize" the path by taking out the
zero extend.

The use of a 64-bit compare here is an intentional "belts and
suspenders" safety issue.

-hpa

2015-02-24 22:55:56

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 4/4] x86: save user %rsp in pt_regs->sp

On Tue, Feb 24, 2015 at 10:51 AM, Denys Vlasenko <[email protected]> wrote:
> PER_CPU(old_rsp) usage is simplified - now it is used only
> as temp storage, and userspace stack pointer is immediately stored
> in pt_regs->sp on syscall entry, instead of being used later,
> on syscall exit. This allows to get rid of thread_struct::usersp.

I'm not seeing any performance loss here, which is good. This is a
nice cleanup.

> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
> */
> ENABLE_INTERRUPTS(CLBR_NONE)
> ALLOC_PT_GPREGS_ON_STACK 6*8 /* 6*8: space for orig_ax and iret frame */
> + movq %rcx,RIP(%rsp)
> + movq %r11,EFLAGS(%rsp)

> + movq PER_CPU_VAR(old_rsp),%rcx
> + movq %rcx,RSP(%rsp)

It's a bit unfortunate that this adds two instructions instead of just
one. I guess this gets fix when we start using push instead of mov
here. (And, if we use push, then we can fix up the RCX slot for free,
too.)

Acked-by me, but I won't apply it until v2 to avoid messy rebases.

--Andy

2015-02-24 22:56:36

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns

On Tue, Feb 24, 2015 at 2:52 PM, H. Peter Anvin <[email protected]> wrote:
> On 02/24/2015 02:25 PM, Andy Lutomirski wrote:
>> On Tue, Feb 24, 2015 at 10:51 AM, Denys Vlasenko <[email protected]> wrote:
>>> In all three 32-bit entry points, %eax is zero-extended to %rax.
>>> It is safe to do 32-bit compare when checking that syscall#
>>> is not too large.
>>
>> Applied. Thanks!
>>
>
> NAK NAK NAK NAK NAK!!!!
>
> We have already had this turn into a security issue not just once but
> TWICE, because someone decided to "optimize" the path by taking out the
> zero extend.
>
> The use of a 64-bit compare here is an intentional "belts and
> suspenders" safety issue.

Fair enough. OK if I just undo that part of this patch?

--Andy

2015-02-24 23:02:01

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns

On 02/24/2015 02:56 PM, Andy Lutomirski wrote:
> On Tue, Feb 24, 2015 at 2:52 PM, H. Peter Anvin <[email protected]> wrote:
>> On 02/24/2015 02:25 PM, Andy Lutomirski wrote:
>>> On Tue, Feb 24, 2015 at 10:51 AM, Denys Vlasenko <[email protected]> wrote:
>>>> In all three 32-bit entry points, %eax is zero-extended to %rax.
>>>> It is safe to do 32-bit compare when checking that syscall#
>>>> is not too large.
>>>
>>> Applied. Thanks!
>>>
>>
>> NAK NAK NAK NAK NAK!!!!
>>
>> We have already had this turn into a security issue not just once but
>> TWICE, because someone decided to "optimize" the path by taking out the
>> zero extend.
>>
>> The use of a 64-bit compare here is an intentional "belts and
>> suspenders" safety issue.
>
> Fair enough. OK if I just undo that part of this patch?
>

Actually this part should have been broken up. The word "several" in
the patch description is by itself a cause to NAK the patch.

-hpa

2015-02-24 23:04:17

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns

On Tue, Feb 24, 2015 at 3:01 PM, H. Peter Anvin <[email protected]> wrote:
> On 02/24/2015 02:56 PM, Andy Lutomirski wrote:
>> On Tue, Feb 24, 2015 at 2:52 PM, H. Peter Anvin <[email protected]> wrote:
>>> On 02/24/2015 02:25 PM, Andy Lutomirski wrote:
>>>> On Tue, Feb 24, 2015 at 10:51 AM, Denys Vlasenko <[email protected]> wrote:
>>>>> In all three 32-bit entry points, %eax is zero-extended to %rax.
>>>>> It is safe to do 32-bit compare when checking that syscall#
>>>>> is not too large.
>>>>
>>>> Applied. Thanks!
>>>>
>>>
>>> NAK NAK NAK NAK NAK!!!!
>>>
>>> We have already had this turn into a security issue not just once but
>>> TWICE, because someone decided to "optimize" the path by taking out the
>>> zero extend.
>>>
>>> The use of a 64-bit compare here is an intentional "belts and
>>> suspenders" safety issue.
>>
>> Fair enough. OK if I just undo that part of this patch?
>>
>
> Actually this part should have been broken up. The word "several" in
> the patch description is by itself a cause to NAK the patch.

Point taken.

Denys, can you fix this and send a v2 of the entire series with the
traps.c fix as well?

Thanks,
Andy

>
> -hpa
>
>



--
Andy Lutomirski
AMA Capital Management, LLC

2015-02-24 23:26:37

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns

On Wed, Feb 25, 2015 at 12:03 AM, Andy Lutomirski <[email protected]> wrote:
>> Actually this part should have been broken up. The word "several" in
>> the patch description is by itself a cause to NAK the patch.
>
> Point taken.
>
> Denys, can you fix this and send a v2 of the entire series with the
> traps.c fix as well?

Ok, working on it.

2015-02-25 08:54:01

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/4] x86: get rid of KERNEL_STACK_OFFSET


* Denys Vlasenko <[email protected]> wrote:

> PER_CPU_VAR(kernel_stack) was set up in a way where it
> points five stack slots below the top of stack.
>
> Presumably, it was done to avoid one "sub $5*8,%rsp" in
> syscall/sysenter code paths, where iret frame needs to be
> created by hand.
>
> Ironically, none of them benefit from this optimization,
> since all of them need to allocate additional data on
> stack (struct pt_regs), so they still have to perform
> subtraction.

Well, the original idea of percpu::kernel_stack was that of
an optimization of the 64-bit system_call() path: to set up
RSP as it has to be before we call into system calls.

This optimization has bitrotted away: because these days
the first SAVE_ARGS in the 64-bit entry path modifies RSP
as well, undoing the optimization.

But the fix should be to not touch RSP in SAVE_ARGS, to
keep percpu::kernel_stack as an optimized entry point -
with KERNEL_STACK_OFFSET pointing to.

So NAK - this should be fixed for real.

> And ia32_sysenter_target even needs to *undo* this
> optimization: it constructs iret stack with pushes
> instead of movs, so it needs to start right at the top.

Lets keep it in mind that in any case the micro-costs of
the 32-bit entry path are almost always irrelevant: we
optimize the 64-bit entry path, if that helps the 32-bit
side as well then that's a happy coincidence, nothing more.

If the 32-bit entry path can be optimized without affecting
the 64-bit path then that's good, but we don't ever hurt
the 64-bit path to make things easier or simpler for the
32-bit path.

Thanks,

Ingo

2015-02-25 09:20:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns


* H. Peter Anvin <[email protected]> wrote:

> On 02/24/2015 02:25 PM, Andy Lutomirski wrote:
> > On Tue, Feb 24, 2015 at 10:51 AM, Denys Vlasenko <[email protected]> wrote:
> >>
> >> In all three 32-bit entry points, %eax is
> >> zero-extended to %rax. It is safe to do 32-bit compare
> >> when checking that syscall# is not too large.
> >
> > Applied. Thanks!
> >
>
> NAK NAK NAK NAK NAK!!!!
>
> We have already had this turn into a security issue not
> just once but TWICE, because someone decided to
> "optimize" the path by taking out the zero extend.
>
> The use of a 64-bit compare here is an intentional "belts
> and suspenders" safety issue.

I think the fundamental fragility is that we allow the high
32 bits to be nonzero.

So could we just zap the high 32 bits of RAX early in the
entry code, and then from that point on we could both use
32-bit ops and won't have to remember the possibility
either?

That's arguably one more (cheap) instruction in the 32-bit
entry paths but then we could use the shorter 32-bit
instructions for compares and other uses and could always
be certain that we get what we want.


But, if we do that, we can do even better, and also do an
optimization of the 64-bit entry path as well: we could
simply mask RAX with 0x3ff and not do a compare. Pad the
syscall table up to 0x400 (1024) entries and fill in the
table with sys_ni syscall entries.

This is valid on 64-bit and 32-bit kernels as well, and it
allows the removal of a compare from the syscall entry
path, at the cost of a couple of kilobytes of unused
syscall table.

The downside would be that if we ever grow past 1024
syscall entries we'll be in trouble if new userspace calls
syscall 513 on an old kernel and gets syscall 1.

I doubt we'll ever get so many syscalls, and user-space
will be able to be smart in any case, so it's not a
showstopper.

This is the safest and quickest implementation as well.

Thoughts?

Thanks,

Ingo

2015-02-25 09:27:56

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns

On 02/25/2015 01:20 AM, Ingo Molnar wrote:
>
> I think the fundamental fragility is that we allow the high
> 32 bits to be nonzero.
>
> So could we just zap the high 32 bits of RAX early in the
> entry code, and then from that point on we could both use
> 32-bit ops and won't have to remember the possibility
> either?
>

We do that, but people keep "optimizing" the zero extend away. We have
had this cause a wide-open security hole twice already. So the extra
REX prefix is a cheap cost to avoid this happen again.

-hpa

2015-02-25 09:39:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns


* H. Peter Anvin <[email protected]> wrote:

> > So could we just zap the high 32 bits of RAX early in
> > the entry code, and then from that point on we could
> > both use 32-bit ops and won't have to remember the
> > possibility either?
>
> We do that, [...]

Ok, indeed, so in ia32_sysenter_target() we have:

movl %eax, %eax

> [...] but people keep "optimizing" the zero extend away.
> [...]

Possibly because there's not a single comment near that
code explaining the importance of that line. But nobody
will get a change past me with such a warning next to the
line.

> [...] We have had this cause a wide-open security hole
> twice already. So the extra REX prefix is a cheap cost
> to avoid this happen again.

But since we already zap the high bits, there's no point in
doing 64-bit compares...

Just make sure the high zero bit clearing is there and is
never removed.

So in that sense the changes are correct, even in the
security robustness sense.

Furthermore, with the masking suggestion I made in the
previous mail it's moot as we can solve both problems:
64-bit uses of RAX will become correct as well, and it
will be a bit faster as well.

Hm?

Thanks,

Ingo

2015-02-25 09:45:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/4] x86: get rid of KERNEL_STACK_OFFSET


* Andy Lutomirski <[email protected]> wrote:

> - BUG_ON(((current_stack_pointer() ^ this_cpu_read_stable(kernel_stack))
> + BUG_ON(((current_stack_pointer() ^
> + (this_cpu_read_stable(kernel_stack) - 1))
> & ~(THREAD_SIZE - 1)) != 0);
>
> preempt_count_sub(HARDIRQ_OFFSET);
>
> I added that in and applied this patch.

So this is not just slightly buggy, it's fundamentally
wrong as well as it removes the possibility of an RSP value
optimization from the 64-bit path, see my previous mail.

The right solution would be to make SAVE_ARGS recognize
when KERNEL_STACK_OFFSET == args-offset and omit the RSP
fixup in that case, or to simply use a __SAVE_ARGS for the
64-bit path, knowing that RSP has the right value already.

Please also add comments that explain the relationship
between percpu::kernel_stack, KERNEL_STACK_OFFSET and the
64-bit system call entry code.

Also, guys, please slow down a bit and be more careful.
Andy, could you please send me all currently pending entry
bits pending in your tree, because all the in-flight
changes make it hard for me to review patches? Please
(re-)send all patches as well as part of the submission.

Thanks,

Ingo

2015-02-25 12:48:46

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH 2/4] x86: get rid of KERNEL_STACK_OFFSET

On Wed, Feb 25, 2015 at 9:53 AM, Ingo Molnar <[email protected]> wrote:
>
> * Denys Vlasenko <[email protected]> wrote:
>
>> PER_CPU_VAR(kernel_stack) was set up in a way where it
>> points five stack slots below the top of stack.
>>
>> Presumably, it was done to avoid one "sub $5*8,%rsp" in
>> syscall/sysenter code paths, where iret frame needs to be
>> created by hand.
>>
>> Ironically, none of them benefit from this optimization,
>> since all of them need to allocate additional data on
>> stack (struct pt_regs), so they still have to perform
>> subtraction.
>
> Well, the original idea of percpu::kernel_stack was that of
> an optimization of the 64-bit system_call() path: to set up
> RSP as it has to be before we call into system calls.
>
> This optimization has bitrotted away: because these days
> the first SAVE_ARGS in the 64-bit entry path modifies RSP
> as well, undoing the optimization.

Yes, I figured this is how it was supposed to work.

> But the fix should be to not touch RSP in SAVE_ARGS, to
> keep percpu::kernel_stack as an optimized entry point -
> with KERNEL_STACK_OFFSET pointing to.
>
> So NAK - this should be fixed for real.

IOW, the proposal is to set KERNEL_STACK_OFFSET
to SIZEOF_PTREGS. I can do that.

However.

There is an ortogonal idea we were discussing: to save
registers and construct iret frame using PUSH insns, not MOVs.
IIRC Andy and Linus liked it. I am ambivalent: the code will be smaller,
but might get slower (at least on some CPUs).
If we go that way, we will require KERNEL_STACK_OFFSET = 0
(IOW: the current patch).

The decision on how exactly we should fix KERNEL_STACK_OFFSET
(set it to SIZEOF_PTREGS or to zero) depends on whether
we switch to using PUSHes, or not. What do you think?

--
vda

2015-02-25 13:25:39

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH 2/4] x86: get rid of KERNEL_STACK_OFFSET

On 02/25/2015 01:48 PM, Denys Vlasenko wrote:
> On Wed, Feb 25, 2015 at 9:53 AM, Ingo Molnar <[email protected]> wrote:
>> But the fix should be to not touch RSP in SAVE_ARGS, to
>> keep percpu::kernel_stack as an optimized entry point -
>> with KERNEL_STACK_OFFSET pointing to.
>>
>> So NAK - this should be fixed for real.
>
> IOW, the proposal is to set KERNEL_STACK_OFFSET
> to SIZEOF_PTREGS. I can do that.
>
> However.
>
> There is an ortogonal idea we were discussing: to save
> registers and construct iret frame using PUSH insns, not MOVs.
> IIRC Andy and Linus liked it. I am ambivalent: the code will be smaller,
> but might get slower (at least on some CPUs).
> If we go that way, we will require KERNEL_STACK_OFFSET = 0
> (IOW: the current patch).
>
> The decision on how exactly we should fix KERNEL_STACK_OFFSET
> (set it to SIZEOF_PTREGS or to zero) depends on whether
> we switch to using PUSHes, or not. What do you think?

A data point. I implemented push-based creation of pt_regs
and benchmarked it. The patch is on top of all my latest
patches sent to ML.

On SandyBridge CPU, it does not get slower: seems to be 1 cycle
faster per syscall.

We lose a number of large insns there:

text data bss dec hex filename
- 9863 0 0 9863 2687 entry_64.o
+ 9671 0 0 9671 25c7 entry_64.o


diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index f505cb6..d97bd92 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64)
* manipulation.
*/
.macro FIXUP_TOP_OF_STACK tmp offset=0
- movq $__USER_DS,SS+\offset(%rsp)
- movq $__USER_CS,CS+\offset(%rsp)
movq RIP+\offset(%rsp),\tmp /* get rip */
movq \tmp,RCX+\offset(%rsp) /* copy it to rcx as sysret would do */
movq EFLAGS+\offset(%rsp),\tmp /* ditto for rflags->r11 */
@@ -245,14 +243,22 @@ GLOBAL(system_call_after_swapgs)
* and short:
*/
ENABLE_INTERRUPTS(CLBR_NONE)
- ALLOC_PT_GPREGS_ON_STACK 6*8 /* 6*8: space for orig_ax and iret frame */
- movq %rcx,RIP(%rsp)
- movq %r11,EFLAGS(%rsp)
- movq PER_CPU_VAR(old_rsp),%rcx
- movq %rcx,RSP(%rsp)
- movq_cfi rax,ORIG_RAX
- SAVE_C_REGS_EXCEPT_RAX_RCX_R11
- movq $-ENOSYS,RAX(%rsp)
+ /* Construct struct pt_regs on stack */
+ pushq $__USER_DS /* pt_regs->ss */
+ pushq PER_CPU_VAR(old_rsp) /* pt_regs->sp */
+ pushq %r11 /* pt_regs->flags */
+ pushq $__USER_CS /* pt_regs->cs */
+ pushq %rcx /* pt_regs->ip */
+ pushq %rax /* pt_regs->orig_ax */
+ pushq %rdi /* pt_regs->di */
+ pushq %rsi /* pt_regs->si */
+ pushq %rdx /* pt_regs->dx */
+ pushq %rcx /* pt_regs->cx */
+ pushq $-ENOSYS /* pt_regs->ax */
+ pushq %r8 /* pt_regs->r8 */
+ pushq %r9 /* pt_regs->r9 */
+ pushq %r10 /* pt_regs->r10 */
+ sub $(7*8),%rsp /* pt_regs->r11,bp,bx,r12-15 */
CFI_REL_OFFSET rip,RIP
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS)
jnz tracesys

2015-02-25 13:53:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/4] x86: get rid of KERNEL_STACK_OFFSET


* Denys Vlasenko <[email protected]> wrote:

> > The decision on how exactly we should fix
> > KERNEL_STACK_OFFSET (set it to SIZEOF_PTREGS or to
> > zero) depends on whether we switch to using PUSHes, or
> > not. What do you think?

Yes.

> A data point. I implemented push-based creation of
> pt_regs and benchmarked it. The patch is on top of all my
> latest patches sent to ML.
>
> On SandyBridge CPU, it does not get slower: seems to be 1
> cycle faster per syscall.
>
> We lose a number of large insns there:
>
> text data bss dec hex filename
> - 9863 0 0 9863 2687 entry_64.o
> + 9671 0 0 9671 25c7 entry_64.o

That's a nice reduction in I$ footprint ...

> + /* Construct struct pt_regs on stack */
> + pushq $__USER_DS /* pt_regs->ss */
> + pushq PER_CPU_VAR(old_rsp) /* pt_regs->sp */
> + pushq %r11 /* pt_regs->flags */

Btw., this could also construct all the dwarf annotations
in a natural, maintainable fashion - pushq_cfi and friends?

> + pushq $__USER_CS /* pt_regs->cs */
> + pushq %rcx /* pt_regs->ip */
> + pushq %rax /* pt_regs->orig_ax */
> + pushq %rdi /* pt_regs->di */
> + pushq %rsi /* pt_regs->si */
> + pushq %rdx /* pt_regs->dx */
> + pushq %rcx /* pt_regs->cx */
> + pushq $-ENOSYS /* pt_regs->ax */
> + pushq %r8 /* pt_regs->r8 */
> + pushq %r9 /* pt_regs->r9 */
> + pushq %r10 /* pt_regs->r10 */
> + sub $(7*8),%rsp /* pt_regs->r11,bp,bx,r12-15 */

So the 'SUB' there is a bit sad, but push sequences are
generally easier to read, so I like it altogether.

Then we could indeed get rid of KERNEL_STACK_OFFSET.

Thanks,

Ingo

2015-02-25 14:43:16

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns

On Wed, 25 Feb 2015 10:20:43 +0100
Ingo Molnar <[email protected]> wrote:


> But, if we do that, we can do even better, and also do an
> optimization of the 64-bit entry path as well: we could
> simply mask RAX with 0x3ff and not do a compare. Pad the
> syscall table up to 0x400 (1024) entries and fill in the
> table with sys_ni syscall entries.
>
> This is valid on 64-bit and 32-bit kernels as well, and it
> allows the removal of a compare from the syscall entry
> path, at the cost of a couple of kilobytes of unused
> syscall table.
>
> The downside would be that if we ever grow past 1024
> syscall entries we'll be in trouble if new userspace calls
> syscall 513 on an old kernel and gets syscall 1.

What if we test against ~0x3ff and jump to sys_ni if anything is set.
This would still work with new userspace calls on older kernels.

-- Steve

2015-02-25 15:41:12

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns

On Wed, Feb 25, 2015 at 3:43 PM, Steven Rostedt <[email protected]> wrote:
> On Wed, 25 Feb 2015 10:20:43 +0100 Ingo Molnar <[email protected]> wrote:
>> But, if we do that, we can do even better, and also do an
>> optimization of the 64-bit entry path as well: we could
>> simply mask RAX with 0x3ff and not do a compare. Pad the
>> syscall table up to 0x400 (1024) entries and fill in the
>> table with sys_ni syscall entries.
>>
>> This is valid on 64-bit and 32-bit kernels as well, and it
>> allows the removal of a compare from the syscall entry
>> path, at the cost of a couple of kilobytes of unused
>> syscall table.
>>
>> The downside would be that if we ever grow past 1024
>> syscall entries we'll be in trouble if new userspace calls
>> syscall 513 on an old kernel and gets syscall 1.
>
> What if we test against ~0x3ff and jump to sys_ni if anything is set.
> This would still work with new userspace calls on older kernels.

That would require a branch insn. The whole idea of masking
was merely to avoid that branch. If you need a branch,
then you can as well just retain current code.

2015-02-25 16:01:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns

On Wed, 25 Feb 2015 16:40:49 +0100
Denys Vlasenko <[email protected]> wrote:

> >> The downside would be that if we ever grow past 1024
> >> syscall entries we'll be in trouble if new userspace calls
> >> syscall 513 on an old kernel and gets syscall 1.
> >
> > What if we test against ~0x3ff and jump to sys_ni if anything is set.
> > This would still work with new userspace calls on older kernels.
>
> That would require a branch insn. The whole idea of masking
> was merely to avoid that branch. If you need a branch,
> then you can as well just retain current code.

I'm just curious, do all these micro optimizations have any real impact
on real use cases?

That is, if we are going to make the system less robust, shouldn't we
show that it has real benefit?

And yes, I consider user space passing in a system call of 0x401 and
having it work the same as 0x1 as being less robust.

-- Steve

2015-02-25 16:07:25

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns

On Wed, Feb 25, 2015 at 11:01:29AM -0500, Steven Rostedt wrote:
> I'm just curious, do all these micro optimizations have any real impact
> on real use cases?
>
> That is, if we are going to make the system less robust, shouldn't we
> show that it has real benefit?

I'm wondering the same thing this whole time. Is it even worth
the bother if those "improvements" don't show on any reasonable
benchmark...? And maybe we should benchmark stuff first and then
apply.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-02-25 16:15:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/4] x86: get rid of KERNEL_STACK_OFFSET

On Wed, Feb 25, 2015 at 1:45 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> - BUG_ON(((current_stack_pointer() ^ this_cpu_read_stable(kernel_stack))
>> + BUG_ON(((current_stack_pointer() ^
>> + (this_cpu_read_stable(kernel_stack) - 1))
>> & ~(THREAD_SIZE - 1)) != 0);
>>
>> preempt_count_sub(HARDIRQ_OFFSET);
>>
>> I added that in and applied this patch.
>
> So this is not just slightly buggy, it's fundamentally
> wrong as well as it removes the possibility of an RSP value
> optimization from the 64-bit path, see my previous mail.

This is just trying to check that the function is executing on the
per-thread stack. It was correct (and fairly heavily tested by Tony)
wither KERNEL_STACK_OFFSET being nonzero, but we're checking the wrong
page if KERNEL_STACK_OFFSET becomes zero.

I don't think I understand your objection to this bit.

>
> The right solution would be to make SAVE_ARGS recognize
> when KERNEL_STACK_OFFSET == args-offset and omit the RSP
> fixup in that case, or to simply use a __SAVE_ARGS for the
> 64-bit path, knowing that RSP has the right value already.
>
> Please also add comments that explain the relationship
> between percpu::kernel_stack, KERNEL_STACK_OFFSET and the
> 64-bit system call entry code.
>
> Also, guys, please slow down a bit and be more careful.
> Andy, could you please send me all currently pending entry
> bits pending in your tree, because all the in-flight
> changes make it hard for me to review patches? Please
> (re-)send all patches as well as part of the submission.

Will do later today.

Note that someone just found a bug as a result of this being in -next.

--Andy

2015-02-26 11:43:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/4] x86: get rid of KERNEL_STACK_OFFSET


* Andy Lutomirski <[email protected]> wrote:

> >> I added that in and applied this patch.
> >
> > So this is not just slightly buggy, it's fundamentally
> > wrong as well as it removes the possibility of an RSP
> > value optimization from the 64-bit path, see my
> > previous mail.
>
> This is just trying to check that the function is
> executing on the per-thread stack. It was correct (and
> fairly heavily tested by Tony) wither KERNEL_STACK_OFFSET
> being nonzero, but we're checking the wrong page if
> KERNEL_STACK_OFFSET becomes zero.
>
> I don't think I understand your objection to this bit.

I object to the KERNEL_STACK_OFFSET removal patch you fixed
here, not to the add-on fix in particular (which is correct
in the context of that patch).

Thanks,

Ingo

2015-02-26 11:47:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns


* Steven Rostedt <[email protected]> wrote:

> > That would require a branch insn. The whole idea of
> > masking was merely to avoid that branch. If you need a
> > branch, then you can as well just retain current code.
>
> I'm just curious, do all these micro optimizations have
> any real impact on real use cases?

The bona fide removal of a real instruction from a true hot
path is generally always worth doing, you don't even have
to 'prove' that it improves things: unless the claim is
that for some really robust reason the instruction was zero
cost to begin with.

So the main question here is not whether it's worth doing
it, the question is the cost of the removal:

- the change in syscall number overflow handling
behavior. (We might not want the new behavior)

- the increase in the syscall table size.

Thanks,

Ingo

2015-02-26 12:46:32

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns

On Thu, 26 Feb 2015 12:47:03 +0100
Ingo Molnar <[email protected]> wrote:

> So the main question here is not whether it's worth doing
> it, the question is the cost of the removal:
>
> - the change in syscall number overflow handling
> behavior. (We might not want the new behavior)
>
> - the increase in the syscall table size.

Yep, totally agree.

-- Steve

2015-02-26 12:50:00

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 1/4] x86: entry.S: tidy up several suboptimal insns

On Thu, 26 Feb 2015 12:47:03 +0100
Ingo Molnar <[email protected]> wrote:


> The bona fide removal of a real instruction from a true hot
> path is generally always worth doing, you don't even have
> to 'prove' that it improves things: unless the claim is
> that for some really robust reason the instruction was zero
> cost to begin with.

I agree that removing instructions from hot paths are for the most part
worth doing without any proof, as long as it doesn't change behavior or
increase the memory footprint. But that's not the case here. If the
change does affect behavior or increases memory footprint, then it
should prove itself as worth doing.

-- Steve

2015-02-26 15:21:59

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/4] x86: get rid of KERNEL_STACK_OFFSET

On Thu, Feb 26, 2015 at 3:42 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> >> I added that in and applied this patch.
>> >
>> > So this is not just slightly buggy, it's fundamentally
>> > wrong as well as it removes the possibility of an RSP
>> > value optimization from the 64-bit path, see my
>> > previous mail.
>>
>> This is just trying to check that the function is
>> executing on the per-thread stack. It was correct (and
>> fairly heavily tested by Tony) wither KERNEL_STACK_OFFSET
>> being nonzero, but we're checking the wrong page if
>> KERNEL_STACK_OFFSET becomes zero.
>>
>> I don't think I understand your objection to this bit.
>
> I object to the KERNEL_STACK_OFFSET removal patch you fixed
> here, not to the add-on fix in particular (which is correct
> in the context of that patch).
>

Aha. I misunderstood.

I would object, too, except that I think that a better improvement
would be to build the entire frame using pushes, including the "top of
stack" part, even on fast-path syscalls. That would have
KERNEL_STACK_OFFSET=0.

Actually, I want an even bigger change. kernel_stack seems entirely
pointless to me, since we could just as easily be using tss.sp0 (I
think), as long as there's no useful offset. So maybe the right way
to handle this patch would be to first clean up the ia32entry code and
all the non-asm users to use tss.sp0, and then to figure out what to
do about the syscall64 path.

I'd be happy to defer this patch until the FIXUP_TOP_OF_STACK cleanups
later on, assuming it can be easily extricated, which I haven't
checked yet. Thoughts?

--Andy

2015-02-26 15:30:17

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/4] x86: get rid of KERNEL_STACK_OFFSET

On Thu, Feb 26, 2015 at 7:21 AM, Andy Lutomirski <[email protected]> wrote:
> On Thu, Feb 26, 2015 at 3:42 AM, Ingo Molnar <[email protected]> wrote:
>>
>> * Andy Lutomirski <[email protected]> wrote:
>>
>>> >> I added that in and applied this patch.
>>> >
>>> > So this is not just slightly buggy, it's fundamentally
>>> > wrong as well as it removes the possibility of an RSP
>>> > value optimization from the 64-bit path, see my
>>> > previous mail.
>>>
>>> This is just trying to check that the function is
>>> executing on the per-thread stack. It was correct (and
>>> fairly heavily tested by Tony) wither KERNEL_STACK_OFFSET
>>> being nonzero, but we're checking the wrong page if
>>> KERNEL_STACK_OFFSET becomes zero.
>>>
>>> I don't think I understand your objection to this bit.
>>
>> I object to the KERNEL_STACK_OFFSET removal patch you fixed
>> here, not to the add-on fix in particular (which is correct
>> in the context of that patch).
>>
>
> Aha. I misunderstood.
>
> I would object, too, except that I think that a better improvement
> would be to build the entire frame using pushes, including the "top of
> stack" part, even on fast-path syscalls. That would have
> KERNEL_STACK_OFFSET=0.
>
> Actually, I want an even bigger change. kernel_stack seems entirely
> pointless to me, since we could just as easily be using tss.sp0 (I
> think), as long as there's no useful offset. So maybe the right way
> to handle this patch would be to first clean up the ia32entry code and
> all the non-asm users to use tss.sp0, and then to figure out what to
> do about the syscall64 path.
>
> I'd be happy to defer this patch until the FIXUP_TOP_OF_STACK cleanups
> later on, assuming it can be easily extricated, which I haven't
> checked yet. Thoughts?
>

Damnit, there are too many sets of patches that I've confused myself.
I haven't applied this one yet. My bad for letting the set of things
flying around get out of control.

I think I'll NAK this patch as is. Let's get the existing stuff
stabilized first.

For a resubmission of this, or something like it, let's do it in nice
bite-sized pieces:

a) Make sure that asm can read sp0. (It's at a fixed offset from a
percpu variable, so it should be fine, but asm-offsets might need
updating.)

b) Remove the C inline helpers and fix anything that needs fixing,
such as the BUG_ON in traps.c

c) In appropriately split-up pieces, change over the asm code to use
sp0 instead of kernel_stack

d) For the 64-bit syscall path, either switch to sp0 or keep using
kernel_stack but *ename it to avoid to avoid confusion.

One thing that was problematic with the big layout change patch is
that it unnecessarily made too many changes at once. IMO it really
ought to have introduced new macros (including
ALLOC_PARTIAL_WHATEVER_ON_STACK), then switch users of the macros to
new macros piece by piece without layout changes (to improve
bisectability and to fix the real underlying problem with the old
code, namely that it was totally incomprehensible), and *then* to
change the layout with a patch in which both the old and new code was
readable.

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC