Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757073AbbDWJ5C (ORCPT ); Thu, 23 Apr 2015 05:57:02 -0400 Received: from mx1.redhat.com ([209.132.183.28]:42263 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756995AbbDWJ45 (ORCPT ); Thu, 23 Apr 2015 05:56:57 -0400 Message-ID: <5538C1C5.7010408@redhat.com> Date: Thu, 23 Apr 2015 11:56:21 +0200 From: Denys Vlasenko User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Andy Lutomirski , Brian Gerst CC: Steven Rostedt , Oleg Nesterov , Ingo Molnar , "H. Peter Anvin" , Borislav Petkov , Linus Torvalds , Andy Lutomirski , Will Drewry , =?UTF-8?B?RnLDqWTDqXJpYyBXZWlzYmVja2Vy?= , Alexei Starovoitov , Linux Kernel Mailing List , Kees Cook , Thomas Gleixner , "linux-tip-commits@vger.kernel.org" Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss References: <63da6d778f69fd0f1345d9287f6764d58be519fa.1427482099.git.luto@kernel.org> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8515 Lines: 222 On 04/23/2015 10:49 AM, Andy Lutomirski wrote: > On Thu, Apr 23, 2015 at 12:37 AM, Brian Gerst wrote: >> On Tue, Mar 31, 2015 at 8:38 AM, tip-bot for Denys Vlasenko >> wrote: >>> Commit-ID: e7d6eefaaa443130079d73cd05039d90b3db7a4a >>> Gitweb: http://git.kernel.org/tip/e7d6eefaaa443130079d73cd05039d90b3db7a4a >>> Author: Denys Vlasenko >>> AuthorDate: Fri, 27 Mar 2015 11:48:17 -0700 >>> Committer: Ingo Molnar >>> CommitDate: Tue, 31 Mar 2015 10:45:15 +0200 >>> >>> x86/vdso32/syscall.S: Do not load __USER32_DS to %ss >>> >>> This vDSO code only gets used by 64-bit kernels, not 32-bit ones. >>> >>> On 64-bit kernels, the data segment is the same for 32-bit and >>> 64-bit userspace, and the SYSRET instruction loads %ss with its >>> selector. >>> >>> So there's no need to repeat it by hand. Segment loads are somewhat >>> expensive: tens of cycles. >>> >>> Signed-off-by: Denys Vlasenko >>> [ Removed unnecessary comment. ] >>> Signed-off-by: Andy Lutomirski >>> Cc: Alexei Starovoitov >>> Cc: Andy Lutomirski >>> Cc: Borislav Petkov >>> Cc: Frederic Weisbecker >>> Cc: H. Peter Anvin >>> Cc: Kees Cook >>> Cc: Linus Torvalds >>> Cc: Oleg Nesterov >>> Cc: Steven Rostedt >>> Cc: Will Drewry >>> Link: http://lkml.kernel.org/r/63da6d778f69fd0f1345d9287f6764d58be519fa.1427482099.git.luto@kernel.org >>> Signed-off-by: Ingo Molnar >>> --- >>> arch/x86/vdso/vdso32/syscall.S | 2 -- >>> 1 file changed, 2 deletions(-) >>> >>> diff --git a/arch/x86/vdso/vdso32/syscall.S b/arch/x86/vdso/vdso32/syscall.S >>> index 5415b56..6b286bb 100644 >>> --- a/arch/x86/vdso/vdso32/syscall.S >>> +++ b/arch/x86/vdso/vdso32/syscall.S >>> @@ -19,8 +19,6 @@ __kernel_vsyscall: >>> .Lpush_ebp: >>> movl %ecx, %ebp >>> syscall >>> - movl $__USER32_DS, %ecx >>> - movl %ecx, %ss >>> movl %ebp, %ecx >>> popl %ebp >>> .Lpop_ebp: >> >> This patch unfortunately is causing Wine to break on some applications: >> >> Unhandled exception: stack overflow in 32-bit code (0xf779bc07). >> Register dump: >> CS:0023 SS:002b DS:002b ES:002b FS:0063 GS:006b >> EIP:f779bc07 ESP:00aed60c EBP:00aed750 EFLAGS:00010216( R- -- I -A-P- ) >> EAX:00000040 EBX:00000010 ECX:00aed750 EDX:00000040 >> ESI:00000040 EDI:7ffd4000 >> Stack dump: >> 0x00aed60c: 00aed648 f7575e5b 7bcc8000 00000000 >> 0x00aed61c: 7bc7bc09 00000010 00aed750 00000040 >> 0x00aed62c: 00aed750 00aed650 7bcc8000 7bc7bbdd >> 0x00aed63c: 7bcc8000 00aed6a0 00aed750 00aed738 >> 0x00aed64c: 7bc7cfa9 00000011 00aed750 00000040 >> 0x00aed65c: 00000020 00000000 00000000 7bc4f141 >> Backtrace: >> =>0 0xf779bc07 __kernel_vsyscall+0x7() in [vdso].so (0x00aed750) >> 1 0xf7575e5b __libc_read+0x4a() in libpthread.so.0 (0x00aed648) >> 2 0x7bc7bc09 read_reply_data+0x38(buffer=0xaed750, size=0x40) >> [/home/bgerst/src/wine/wine32/dlls/ntdll/../../../dlls/ntdll/server.c:239] >> in ntdll (0x00aed648) >> 3 0x7bc7cfa9 wine_server_call+0x178() in ntdll (0x00aed738) >> 4 0x7bc840ec NtSetEvent+0x4b(handle=0x80, >> NumberOfThreadsReleased=0x0(nil)) >> [/home/bgerst/src/wine/wine32/dlls/ntdll/../../../dlls/ntdll/sync.c:361] >> in ntdll (0x00aed7c8) >> 5 0x7b874afa SetEvent+0x24(handle=) >> [/home/bgerst/src/wine/wine32/dlls/kernel32/../../../dlls/kernel32/sync.c:572] >> in kernel32 (0x00aed7e8) >> 6 0x0044e31a in battle.net launcher (+0x4e319) (0x00aed818) >> ... >> >> __kernel_vsyscall+0x7 points to "pop %ebp". >> >> This is on an AMD Phenom(tm) II X6 1055T Processor. >> >> It appears that there are some subtle differences in how sysretl works >> on AMD vs. Intel. According to the Intel docs, the SS selector and >> descriptor cache is completely reset by sysret to fixed values. The >> AMD docs however are concerning: > > My understanding is that, in long mode, the segment attributes are > ignored, and that there is no such thing as a "64-bit stack". So... > >> >> AMD's syscall: >> SS.sel = MSR_STAR.SYSCALL_CS + 8 >> SS.attr = 64-bit stack,dpl0 > > I don't really believe that. > >> SS.base = 0x00000000 >> SS.limit = 0xFFFFFFFF >> >> AMD's sysret: >> SS.sel = MSR_STAR.SYSRET_CS + 8 // SS selector is changed, >> // SS base, limit, attributes unchanged. >> > > I'm pretty sure that this is at least a little bit wrong. It makes no > sense for me for syscall to set SS.DPL=0 and for sysret to leave > SS.DPL=0. It had better at least change DPL to 3. (Except... don't > they mean RPL? Why is the DPL cached at all? But RPL is clearly > changed, since it's part of the selector.) > >> Not changing base or limit is no big deal, but not changing attributes >> could be the problem. It might be leaving the "64-bit stack" >> attribute set, for whatever that means. > > Hmm. I don't know if I believe that explanation. For one thing, the > APM says "Executing SYSRET in non-64-bit mode or with a 16- or 32-bit > operand size returns to 32-bit mode with a 32-bit stack pointer." > > We can revert this patch or fix it, but I'd like to at least try to > understand what's wrong first. Borislav, any ideas? > > I'm curious whether we can somehow end up in the kernel without a > sensible SS. What happens if we have SS = 0? Nothing happens. We continue to execute as if nothing is wrong. 24593_APM_v21.pdf says: 8.9 Long-Mode Interrupt Control Transfers The long-mode architecture expands the legacy interrupt-mechanism to support 64-bit operating systems and applications. These changes include: * All interrupt handlers are 64-bit code and operate in 64-bit mode. * The size of an interrupt-stack push is fixed at 64 bits (8 bytes). * The interrupt-stack frame is aligned on a 16-byte boundary. * The stack pointer, SS:RSP, is pushed unconditionally on interrupts, rather than conditionally based on a change in CPL. * The SS selector register is loaded with a null selector as a result of an interrupt, if the CPL changes. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ !!!!! * The IRET instruction behavior changes, to unconditionally pop SS:RSP, allowing a null SS to be popped. ... Intel's doc has a similar statement. > Try this on for size: > > 1. Wine process does syscall > 2. Context switch to any other task > 3. Interrupt (software or hardware), which loads SS with ss0, which is > 0 on x86_64. (should be "which loads SS with 0 unconditionally" according to AMD docs) > 4. Context switch back to Wine. > 5. sysretl Plausible. But why it happens only with Wine? The fix can look like this (untested): diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index 0c302d0..9f4c232 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -198,6 +198,18 @@ sysexit_from_sys_call: * with 'sysenter' and it uses the SYSENTER calling convention. */ andl $~TS_COMPAT,ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS) + /* + * On AMD, SYSRET32 does not modify %ss cached descriptor; + * and in kernel, %ss can be loaded with 0, invalidating it. + * On return to 32-bit mode, this makes stack ops fail. + * Fix %ss only if it's wrong: it takes ~40 cycles. + */ + movl %ss, %ecx + cmpl $__USER32_DS, %ecx + je 1f + movl $__USER32_DS, %ecx + movl %ecx, %ss +1: movl RIP(%rsp),%ecx /* User %eip */ CFI_REGISTER rip,rcx RESTORE_RSI_RDI @@ -408,6 +420,18 @@ cstar_dispatch: sysretl_from_sys_call: andl $~TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS) RESTORE_RSI_RDI_RDX + /* + * On AMD, SYSRET32 does not modify %ss cached descriptor; + * and in kernel, %ss can be loaded with 0, invalidating it. + * On return to 32-bit mode, this makes stack ops fail. + * Fix %ss only if it's wrong: it takes ~40 cycles. + */ + movl %ss, %ecx + cmpl $__USER32_DS, %ecx + je 1f + movl $__USER32_DS, %ecx + movl %ecx, %ss +1: movl RIP(%rsp),%ecx CFI_REGISTER rip,rcx movl EFLAGS(%rsp),%r11d -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/