Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752920AbbBXWhg (ORCPT ); Tue, 24 Feb 2015 17:37:36 -0500 Received: from mail-lb0-f176.google.com ([209.85.217.176]:36856 "EHLO mail-lb0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750748AbbBXWhe (ORCPT ); Tue, 24 Feb 2015 17:37:34 -0500 MIME-Version: 1.0 In-Reply-To: <1424803895-4420-2-git-send-email-dvlasenk@redhat.com> References: <1424803895-4420-1-git-send-email-dvlasenk@redhat.com> <1424803895-4420-2-git-send-email-dvlasenk@redhat.com> From: Andy Lutomirski Date: Tue, 24 Feb 2015 14:37:12 -0800 Message-ID: Subject: Re: [PATCH 2/4] x86: get rid of KERNEL_STACK_OFFSET To: Denys Vlasenko Cc: Linus Torvalds , Steven Rostedt , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" , Oleg Nesterov , Frederic Weisbecker , Alexei Starovoitov , Will Drewry , Kees Cook , X86 ML , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 15693 Lines: 373 On Tue, Feb 24, 2015 at 10:51 AM, Denys Vlasenko wrote: > PER_CPU_VAR(kernel_stack) was set up in a way where it points > five stack slots below the top of stack. > > Presumably, it was done to avoid one "sub $5*8,%rsp" > in syscall/sysenter code paths, where iret frame needs to be > created by hand. > > Ironically, none of them benefit from this optimization, > since all of them need to allocate additional data on stack > (struct pt_regs), so they still have to perform subtraction. > And ia32_sysenter_target even needs to *undo* this optimization: > it constructs iret stack with pushes instead of movs, > so it needs to start right at the top. > > This patch eliminates KERNEL_STACK_OFFSET. > PER_CPU_VAR(kernel_stack) now points directly to top of stack. > pt_regs allocations are adjusted to allocate iret frame as well. > > Semi-mysterious expressions THREAD_INFO(%rsp,RIP) - "why RIP??" > are now replaced by more logical THREAD_INFO(%rsp,SIZEOF_PTREGS) - > "rsp is SIZEOF_PTREGS bytes below the top, calculate > thread_info's address using that information" > > Net result in code is that one instruction is eliminated, > and constants in several others have changed. This has a side effect: kernel_stack now points to the first byte of the page above the stack instead of to the page containing the stack. That means we need this fix: diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index c74f2f5652da..5f20a0a612a6 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -174,7 +174,8 @@ void ist_begin_non_atomic(struct pt_regs *regs) * will catch asm bugs and any attempt to use ist_preempt_enable * from double_fault. */ - BUG_ON(((current_stack_pointer() ^ this_cpu_read_stable(kernel_stack)) + BUG_ON(((current_stack_pointer() ^ + (this_cpu_read_stable(kernel_stack) - 1)) & ~(THREAD_SIZE - 1)) != 0); preempt_count_sub(HARDIRQ_OFFSET); I added that in and applied this patch. Borislav and/or Tony, when all the dust settles, could you make sure to re-run your memory failure test on the result? Thanks, Andy > > Signed-off-by: Denys Vlasenko > CC: Linus Torvalds > CC: Steven Rostedt > CC: Ingo Molnar > CC: Borislav Petkov > CC: "H. Peter Anvin" > CC: Andy Lutomirski > CC: Oleg Nesterov > CC: Frederic Weisbecker > CC: Alexei Starovoitov > CC: Will Drewry > CC: Kees Cook > CC: x86@kernel.org > CC: linux-kernel@vger.kernel.org > --- > arch/x86/ia32/ia32entry.S | 35 +++++++++++++++++------------------ > arch/x86/include/asm/thread_info.h | 8 +++----- > arch/x86/kernel/cpu/common.c | 2 +- > arch/x86/kernel/entry_64.S | 8 ++++---- > arch/x86/kernel/process_32.c | 3 +-- > arch/x86/kernel/process_64.c | 3 +-- > arch/x86/kernel/smpboot.c | 3 +-- > arch/x86/xen/smp.c | 3 +-- > 8 files changed, 29 insertions(+), 36 deletions(-) > > diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S > index 01eca80..b349ae8 100644 > --- a/arch/x86/ia32/ia32entry.S > +++ b/arch/x86/ia32/ia32entry.S > @@ -114,7 +114,6 @@ ENTRY(ia32_sysenter_target) > CFI_REGISTER rsp,rbp > SWAPGS_UNSAFE_STACK > movq PER_CPU_VAR(kernel_stack), %rsp > - addq $(KERNEL_STACK_OFFSET),%rsp > /* > * No need to follow this irqs on/off section: the syscall > * disabled irqs, here we enable it straight after entry: > @@ -128,7 +127,7 @@ ENTRY(ia32_sysenter_target) > CFI_REL_OFFSET rsp,0 > pushfq_cfi > /*CFI_REL_OFFSET rflags,0*/ > - movl TI_sysenter_return+THREAD_INFO(%rsp,3*8-KERNEL_STACK_OFFSET),%r10d > + movl TI_sysenter_return+THREAD_INFO(%rsp,3*8),%r10d > CFI_REGISTER rip,r10 > pushq_cfi $__USER32_CS > /*CFI_REL_OFFSET cs,0*/ > @@ -160,8 +159,8 @@ ENTRY(ia32_sysenter_target) > jnz sysenter_fix_flags > sysenter_flags_fixed: > > - orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP) > - testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) > + orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS) > + testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS) > CFI_REMEMBER_STATE > jnz sysenter_tracesys > cmpl $(IA32_NR_syscalls-1),%eax > @@ -178,10 +177,10 @@ sysenter_dispatch: > movq %rax,RAX(%rsp) > DISABLE_INTERRUPTS(CLBR_NONE) > TRACE_IRQS_OFF > - testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP) > + testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS) > jnz sysexit_audit > sysexit_from_sys_call: > - andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP) > + andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS) > /* clear IF, that popfq doesn't enable interrupts early */ > andl $~0x200,EFLAGS(%rsp) > movl RIP(%rsp),%edx /* User %eip */ > @@ -226,7 +225,7 @@ sysexit_from_sys_call: > .endm > > .macro auditsys_exit exit > - testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP) > + testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS) > jnz ia32_ret_from_sys_call > TRACE_IRQS_ON > ENABLE_INTERRUPTS(CLBR_NONE) > @@ -241,7 +240,7 @@ sysexit_from_sys_call: > movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi > DISABLE_INTERRUPTS(CLBR_NONE) > TRACE_IRQS_OFF > - testl %edi,TI_flags+THREAD_INFO(%rsp,RIP) > + testl %edi,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS) > jz \exit > CLEAR_RREGS > jmp int_with_check > @@ -263,7 +262,7 @@ sysenter_fix_flags: > > sysenter_tracesys: > #ifdef CONFIG_AUDITSYSCALL > - testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP) > + testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS) > jz sysenter_auditsys > #endif > SAVE_EXTRA_REGS > @@ -312,7 +311,7 @@ ENDPROC(ia32_sysenter_target) > ENTRY(ia32_cstar_target) > CFI_STARTPROC32 simple > CFI_SIGNAL_FRAME > - CFI_DEF_CFA rsp,KERNEL_STACK_OFFSET > + CFI_DEF_CFA rsp,0 > CFI_REGISTER rip,rcx > /*CFI_REGISTER rflags,r11*/ > SWAPGS_UNSAFE_STACK > @@ -324,7 +323,7 @@ ENTRY(ia32_cstar_target) > * disabled irqs and here we enable it straight after entry: > */ > ENABLE_INTERRUPTS(CLBR_NONE) > - ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ > + ALLOC_PT_GPREGS_ON_STACK 6*8 /* 6*8: space for orig_ax and iret frame */ > SAVE_C_REGS_EXCEPT_RCX_R891011 > movl %eax,%eax /* zero extension */ > movq %rax,ORIG_RAX(%rsp) > @@ -347,8 +346,8 @@ ENTRY(ia32_cstar_target) > 1: movl (%r8),%r9d > _ASM_EXTABLE(1b,ia32_badarg) > ASM_CLAC > - orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP) > - testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) > + orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS) > + testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS) > CFI_REMEMBER_STATE > jnz cstar_tracesys > cmpl $(IA32_NR_syscalls-1),%eax > @@ -365,10 +364,10 @@ cstar_dispatch: > movq %rax,RAX(%rsp) > DISABLE_INTERRUPTS(CLBR_NONE) > TRACE_IRQS_OFF > - testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP) > + testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS) > jnz sysretl_audit > sysretl_from_sys_call: > - andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP) > + andl $~TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS) > RESTORE_RSI_RDI_RDX > movl RIP(%rsp),%ecx > CFI_REGISTER rip,rcx > @@ -403,7 +402,7 @@ sysretl_audit: > > cstar_tracesys: > #ifdef CONFIG_AUDITSYSCALL > - testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP) > + testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS) > jz cstar_auditsys > #endif > xchgl %r9d,%ebp > @@ -470,8 +469,8 @@ ENTRY(ia32_syscall) > this could be a problem. */ > ALLOC_PT_GPREGS_ON_STACK > SAVE_C_REGS_EXCEPT_R891011 > - orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,RIP) > - testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) > + orl $TS_COMPAT,TI_status+THREAD_INFO(%rsp,SIZEOF_PTREGS) > + testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS) > jnz ia32_tracesys > cmpl $(IA32_NR_syscalls-1),%eax > ja ia32_badsys > diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h > index e82e95a..3eb9d05 100644 > --- a/arch/x86/include/asm/thread_info.h > +++ b/arch/x86/include/asm/thread_info.h > @@ -149,7 +149,6 @@ struct thread_info { > #define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW) > > #define STACK_WARN (THREAD_SIZE/8) > -#define KERNEL_STACK_OFFSET (5*(BITS_PER_LONG/8)) > > /* > * macros/functions for gaining access to the thread information structure > @@ -163,8 +162,7 @@ DECLARE_PER_CPU(unsigned long, kernel_stack); > static inline struct thread_info *current_thread_info(void) > { > struct thread_info *ti; > - ti = (void *)(this_cpu_read_stable(kernel_stack) + > - KERNEL_STACK_OFFSET - THREAD_SIZE); > + ti = (void *)(this_cpu_read_stable(kernel_stack) - THREAD_SIZE); > return ti; > } > > @@ -184,13 +182,13 @@ static inline unsigned long current_stack_pointer(void) > /* how to get the thread information struct from ASM */ > #define GET_THREAD_INFO(reg) \ > _ASM_MOV PER_CPU_VAR(kernel_stack),reg ; \ > - _ASM_SUB $(THREAD_SIZE-KERNEL_STACK_OFFSET),reg ; > + _ASM_SUB $(THREAD_SIZE),reg ; > > /* > * Same if PER_CPU_VAR(kernel_stack) is, perhaps with some offset, already in > * a certain register (to be used in assembler memory operands). > */ > -#define THREAD_INFO(reg, off) KERNEL_STACK_OFFSET+(off)-THREAD_SIZE(reg) > +#define THREAD_INFO(reg, off) (off)-THREAD_SIZE(reg) > > #endif > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c > index cb56925..a47ca751 100644 > --- a/arch/x86/kernel/cpu/common.c > +++ b/arch/x86/kernel/cpu/common.c > @@ -1116,7 +1116,7 @@ static __init int setup_disablecpuid(char *arg) > __setup("clearcpuid=", setup_disablecpuid); > > DEFINE_PER_CPU(unsigned long, kernel_stack) = > - (unsigned long)&init_thread_union - KERNEL_STACK_OFFSET + THREAD_SIZE; > + (unsigned long)&init_thread_union + THREAD_SIZE; > EXPORT_PER_CPU_SYMBOL(kernel_stack); > > #ifdef CONFIG_X86_64 > diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S > index b93f3a2..91af6be 100644 > --- a/arch/x86/kernel/entry_64.S > +++ b/arch/x86/kernel/entry_64.S > @@ -237,7 +237,7 @@ ENDPROC(native_usergs_sysret64) > ENTRY(system_call) > CFI_STARTPROC simple > CFI_SIGNAL_FRAME > - CFI_DEF_CFA rsp,KERNEL_STACK_OFFSET > + CFI_DEF_CFA rsp,0 > CFI_REGISTER rip,rcx > /*CFI_REGISTER rflags,r11*/ > SWAPGS_UNSAFE_STACK > @@ -256,13 +256,13 @@ GLOBAL(system_call_after_swapgs) > * and short: > */ > ENABLE_INTERRUPTS(CLBR_NONE) > - ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ > + ALLOC_PT_GPREGS_ON_STACK 6*8 /* 6*8: space for orig_ax and iret frame */ > SAVE_C_REGS_EXCEPT_RAX_RCX > movq $-ENOSYS,RAX(%rsp) > movq_cfi rax,ORIG_RAX > movq %rcx,RIP(%rsp) > CFI_REL_OFFSET rip,RIP > - testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) > + testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS) > jnz tracesys > system_call_fastpath: > #if __SYSCALL_MASK == ~0 > @@ -280,7 +280,7 @@ system_call_fastpath: > * Has incomplete stack frame and undefined top of stack. > */ > ret_from_sys_call: > - testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP) > + testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS) > jnz int_ret_from_sys_call_fixup /* Go the the slow path */ > > LOCKDEP_SYS_EXIT > diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c > index 8f3ebfe..ecda2bd 100644 > --- a/arch/x86/kernel/process_32.c > +++ b/arch/x86/kernel/process_32.c > @@ -311,8 +311,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) > arch_end_context_switch(next_p); > > this_cpu_write(kernel_stack, > - (unsigned long)task_stack_page(next_p) + > - THREAD_SIZE - KERNEL_STACK_OFFSET); > + (unsigned long)task_stack_page(next_p) + THREAD_SIZE); > > /* > * Restore %gs if needed (which is common) > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c > index 5a2c029..975d342 100644 > --- a/arch/x86/kernel/process_64.c > +++ b/arch/x86/kernel/process_64.c > @@ -414,8 +414,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) > this_cpu_write(__preempt_count, task_thread_info(next_p)->saved_preempt_count); > > this_cpu_write(kernel_stack, > - (unsigned long)task_stack_page(next_p) + > - THREAD_SIZE - KERNEL_STACK_OFFSET); > + (unsigned long)task_stack_page(next_p) + THREAD_SIZE); > > /* > * Now maybe reload the debug registers and handle I/O bitmaps > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index febc6aa..73dfb1f 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -811,8 +811,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle) > initial_gs = per_cpu_offset(cpu); > #endif > per_cpu(kernel_stack, cpu) = > - (unsigned long)task_stack_page(idle) - > - KERNEL_STACK_OFFSET + THREAD_SIZE; > + (unsigned long)task_stack_page(idle) + THREAD_SIZE; > early_gdt_descr.address = (unsigned long)get_cpu_gdt_table(cpu); > initial_code = (unsigned long)start_secondary; > stack_start = idle->thread.sp; > diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c > index 4c071ae..4e71a3b 100644 > --- a/arch/x86/xen/smp.c > +++ b/arch/x86/xen/smp.c > @@ -452,8 +452,7 @@ static int xen_cpu_up(unsigned int cpu, struct task_struct *idle) > clear_tsk_thread_flag(idle, TIF_FORK); > #endif > per_cpu(kernel_stack, cpu) = > - (unsigned long)task_stack_page(idle) - > - KERNEL_STACK_OFFSET + THREAD_SIZE; > + (unsigned long)task_stack_page(idle) + THREAD_SIZE; > > xen_setup_runstate_info(cpu); > xen_setup_timer(cpu); > -- > 1.8.1.4 > -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/