2015-04-23 07:37:43

by Brian Gerst

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On Tue, Mar 31, 2015 at 8:38 AM, tip-bot for Denys Vlasenko
<[email protected]> wrote:
> Commit-ID: e7d6eefaaa443130079d73cd05039d90b3db7a4a
> Gitweb: http://git.kernel.org/tip/e7d6eefaaa443130079d73cd05039d90b3db7a4a
> Author: Denys Vlasenko <[email protected]>
> AuthorDate: Fri, 27 Mar 2015 11:48:17 -0700
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Tue, 31 Mar 2015 10:45:15 +0200
>
> x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
>
> This vDSO code only gets used by 64-bit kernels, not 32-bit ones.
>
> On 64-bit kernels, the data segment is the same for 32-bit and
> 64-bit userspace, and the SYSRET instruction loads %ss with its
> selector.
>
> So there's no need to repeat it by hand. Segment loads are somewhat
> expensive: tens of cycles.
>
> Signed-off-by: Denys Vlasenko <[email protected]>
> [ Removed unnecessary comment. ]
> Signed-off-by: Andy Lutomirski <[email protected]>
> Cc: Alexei Starovoitov <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Will Drewry <[email protected]>
> Link: http://lkml.kernel.org/r/63da6d778f69fd0f1345d9287f6764d58be519fa.1427482099.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> arch/x86/vdso/vdso32/syscall.S | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/arch/x86/vdso/vdso32/syscall.S b/arch/x86/vdso/vdso32/syscall.S
> index 5415b56..6b286bb 100644
> --- a/arch/x86/vdso/vdso32/syscall.S
> +++ b/arch/x86/vdso/vdso32/syscall.S
> @@ -19,8 +19,6 @@ __kernel_vsyscall:
> .Lpush_ebp:
> movl %ecx, %ebp
> syscall
> - movl $__USER32_DS, %ecx
> - movl %ecx, %ss
> movl %ebp, %ecx
> popl %ebp
> .Lpop_ebp:

This patch unfortunately is causing Wine to break on some applications:

Unhandled exception: stack overflow in 32-bit code (0xf779bc07).
Register dump:
CS:0023 SS:002b DS:002b ES:002b FS:0063 GS:006b
EIP:f779bc07 ESP:00aed60c EBP:00aed750 EFLAGS:00010216( R- -- I -A-P- )
EAX:00000040 EBX:00000010 ECX:00aed750 EDX:00000040
ESI:00000040 EDI:7ffd4000
Stack dump:
0x00aed60c: 00aed648 f7575e5b 7bcc8000 00000000
0x00aed61c: 7bc7bc09 00000010 00aed750 00000040
0x00aed62c: 00aed750 00aed650 7bcc8000 7bc7bbdd
0x00aed63c: 7bcc8000 00aed6a0 00aed750 00aed738
0x00aed64c: 7bc7cfa9 00000011 00aed750 00000040
0x00aed65c: 00000020 00000000 00000000 7bc4f141
Backtrace:
=>0 0xf779bc07 __kernel_vsyscall+0x7() in [vdso].so (0x00aed750)
1 0xf7575e5b __libc_read+0x4a() in libpthread.so.0 (0x00aed648)
2 0x7bc7bc09 read_reply_data+0x38(buffer=0xaed750, size=0x40)
[/home/bgerst/src/wine/wine32/dlls/ntdll/../../../dlls/ntdll/server.c:239]
in ntdll (0x00aed648)
3 0x7bc7cfa9 wine_server_call+0x178() in ntdll (0x00aed738)
4 0x7bc840ec NtSetEvent+0x4b(handle=0x80,
NumberOfThreadsReleased=0x0(nil))
[/home/bgerst/src/wine/wine32/dlls/ntdll/../../../dlls/ntdll/sync.c:361]
in ntdll (0x00aed7c8)
5 0x7b874afa SetEvent+0x24(handle=<couldn't compute location>)
[/home/bgerst/src/wine/wine32/dlls/kernel32/../../../dlls/kernel32/sync.c:572]
in kernel32 (0x00aed7e8)
6 0x0044e31a in battle.net launcher (+0x4e319) (0x00aed818)
...

__kernel_vsyscall+0x7 points to "pop %ebp".

This is on an AMD Phenom(tm) II X6 1055T Processor.

It appears that there are some subtle differences in how sysretl works
on AMD vs. Intel. According to the Intel docs, the SS selector and
descriptor cache is completely reset by sysret to fixed values. The
AMD docs however are concerning:

AMD's syscall:
SS.sel = MSR_STAR.SYSCALL_CS + 8
SS.attr = 64-bit stack,dpl0
SS.base = 0x00000000
SS.limit = 0xFFFFFFFF

AMD's sysret:
SS.sel = MSR_STAR.SYSRET_CS + 8 // SS selector is changed,
// SS base, limit, attributes unchanged.

Not changing base or limit is no big deal, but not changing attributes
could be the problem. It might be leaving the "64-bit stack"
attribute set, for whatever that means.

Reloading SS from the GDT would obviously reset any bad state left by
sysretl. Unfortunately we may have to put it back in, and then NOP it
out on Intel.

--
Brian Gerst


2015-04-23 08:50:22

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On Thu, Apr 23, 2015 at 12:37 AM, Brian Gerst <[email protected]> wrote:
> On Tue, Mar 31, 2015 at 8:38 AM, tip-bot for Denys Vlasenko
> <[email protected]> wrote:
>> Commit-ID: e7d6eefaaa443130079d73cd05039d90b3db7a4a
>> Gitweb: http://git.kernel.org/tip/e7d6eefaaa443130079d73cd05039d90b3db7a4a
>> Author: Denys Vlasenko <[email protected]>
>> AuthorDate: Fri, 27 Mar 2015 11:48:17 -0700
>> Committer: Ingo Molnar <[email protected]>
>> CommitDate: Tue, 31 Mar 2015 10:45:15 +0200
>>
>> x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
>>
>> This vDSO code only gets used by 64-bit kernels, not 32-bit ones.
>>
>> On 64-bit kernels, the data segment is the same for 32-bit and
>> 64-bit userspace, and the SYSRET instruction loads %ss with its
>> selector.
>>
>> So there's no need to repeat it by hand. Segment loads are somewhat
>> expensive: tens of cycles.
>>
>> Signed-off-by: Denys Vlasenko <[email protected]>
>> [ Removed unnecessary comment. ]
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> Cc: Alexei Starovoitov <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Borislav Petkov <[email protected]>
>> Cc: Frederic Weisbecker <[email protected]>
>> Cc: H. Peter Anvin <[email protected]>
>> Cc: Kees Cook <[email protected]>
>> Cc: Linus Torvalds <[email protected]>
>> Cc: Oleg Nesterov <[email protected]>
>> Cc: Steven Rostedt <[email protected]>
>> Cc: Will Drewry <[email protected]>
>> Link: http://lkml.kernel.org/r/63da6d778f69fd0f1345d9287f6764d58be519fa.1427482099.git.luto@kernel.org
>> Signed-off-by: Ingo Molnar <[email protected]>
>> ---
>> arch/x86/vdso/vdso32/syscall.S | 2 --
>> 1 file changed, 2 deletions(-)
>>
>> diff --git a/arch/x86/vdso/vdso32/syscall.S b/arch/x86/vdso/vdso32/syscall.S
>> index 5415b56..6b286bb 100644
>> --- a/arch/x86/vdso/vdso32/syscall.S
>> +++ b/arch/x86/vdso/vdso32/syscall.S
>> @@ -19,8 +19,6 @@ __kernel_vsyscall:
>> .Lpush_ebp:
>> movl %ecx, %ebp
>> syscall
>> - movl $__USER32_DS, %ecx
>> - movl %ecx, %ss
>> movl %ebp, %ecx
>> popl %ebp
>> .Lpop_ebp:
>
> This patch unfortunately is causing Wine to break on some applications:
>
> Unhandled exception: stack overflow in 32-bit code (0xf779bc07).
> Register dump:
> CS:0023 SS:002b DS:002b ES:002b FS:0063 GS:006b
> EIP:f779bc07 ESP:00aed60c EBP:00aed750 EFLAGS:00010216( R- -- I -A-P- )
> EAX:00000040 EBX:00000010 ECX:00aed750 EDX:00000040
> ESI:00000040 EDI:7ffd4000
> Stack dump:
> 0x00aed60c: 00aed648 f7575e5b 7bcc8000 00000000
> 0x00aed61c: 7bc7bc09 00000010 00aed750 00000040
> 0x00aed62c: 00aed750 00aed650 7bcc8000 7bc7bbdd
> 0x00aed63c: 7bcc8000 00aed6a0 00aed750 00aed738
> 0x00aed64c: 7bc7cfa9 00000011 00aed750 00000040
> 0x00aed65c: 00000020 00000000 00000000 7bc4f141
> Backtrace:
> =>0 0xf779bc07 __kernel_vsyscall+0x7() in [vdso].so (0x00aed750)
> 1 0xf7575e5b __libc_read+0x4a() in libpthread.so.0 (0x00aed648)
> 2 0x7bc7bc09 read_reply_data+0x38(buffer=0xaed750, size=0x40)
> [/home/bgerst/src/wine/wine32/dlls/ntdll/../../../dlls/ntdll/server.c:239]
> in ntdll (0x00aed648)
> 3 0x7bc7cfa9 wine_server_call+0x178() in ntdll (0x00aed738)
> 4 0x7bc840ec NtSetEvent+0x4b(handle=0x80,
> NumberOfThreadsReleased=0x0(nil))
> [/home/bgerst/src/wine/wine32/dlls/ntdll/../../../dlls/ntdll/sync.c:361]
> in ntdll (0x00aed7c8)
> 5 0x7b874afa SetEvent+0x24(handle=<couldn't compute location>)
> [/home/bgerst/src/wine/wine32/dlls/kernel32/../../../dlls/kernel32/sync.c:572]
> in kernel32 (0x00aed7e8)
> 6 0x0044e31a in battle.net launcher (+0x4e319) (0x00aed818)
> ...
>
> __kernel_vsyscall+0x7 points to "pop %ebp".
>
> This is on an AMD Phenom(tm) II X6 1055T Processor.
>
> It appears that there are some subtle differences in how sysretl works
> on AMD vs. Intel. According to the Intel docs, the SS selector and
> descriptor cache is completely reset by sysret to fixed values. The
> AMD docs however are concerning:

My understanding is that, in long mode, the segment attributes are
ignored, and that there is no such thing as a "64-bit stack". So...

>
> AMD's syscall:
> SS.sel = MSR_STAR.SYSCALL_CS + 8
> SS.attr = 64-bit stack,dpl0

I don't really believe that.

> SS.base = 0x00000000
> SS.limit = 0xFFFFFFFF
>
> AMD's sysret:
> SS.sel = MSR_STAR.SYSRET_CS + 8 // SS selector is changed,
> // SS base, limit, attributes unchanged.
>

I'm pretty sure that this is at least a little bit wrong. It makes no
sense for me for syscall to set SS.DPL=0 and for sysret to leave
SS.DPL=0. It had better at least change DPL to 3. (Except... don't
they mean RPL? Why is the DPL cached at all? But RPL is clearly
changed, since it's part of the selector.)

> Not changing base or limit is no big deal, but not changing attributes
> could be the problem. It might be leaving the "64-bit stack"
> attribute set, for whatever that means.

Hmm. I don't know if I believe that explanation. For one thing, the
APM says "Executing SYSRET in non-64-bit mode or with a 16- or 32-bit
operand size returns to 32-bit mode with a 32-bit stack pointer."

We can revert this patch or fix it, but I'd like to at least try to
understand what's wrong first. Borislav, any ideas?

I'm curious whether we can somehow end up in the kernel without a
sensible SS. What happens if we have SS = 0?

Try this on for size:

1. Wine process does syscall
2. Context switch to any other task
3. Interrupt (software or hardware), which loads SS with ss0, which is
0 on x86_64.
4. Context switch back to Wine.
5. sysretl

Would fixing this be as simple as changing this code in
arch/x86/kernel/process.c:

__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
.x86_tss = {
.sp0 = TOP_OF_INIT_STACK,
#ifdef CONFIG_X86_32
.ss0 = __KERNEL_DS,

by moving the ifdef down a line? Even if that fixed it, it would be
extremely fragile, but IMO it would be a good change to make
regardless (i.e. the kernel's SS would be less unpredictable).

In any event, if we really need the SS reload, I'd rather reload it in
kernel space before sysret than in user space after sysret, so we
never run user code with whatever screwed up SS hidden part we have
here. Unless, of course, Xen would corrupt it for us.

>
> Reloading SS from the GDT would obviously reset any bad state left by
> sysretl. Unfortunately we may have to put it back in, and then NOP it
> out on Intel.

At least this is easy now that alternatives work in the vdso.

--Andy

2015-04-23 09:07:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On Thu, Apr 23, 2015 at 1:49 AM, Andy Lutomirski <[email protected]> wrote:
>
> I'm curious whether we can somehow end up in the kernel without a
> sensible SS. What happens if we have SS = 0?
>
> Try this on for size:
>
> 1. Wine process does syscall
> 2. Context switch to any other task
> 3. Interrupt (software or hardware), which loads SS with ss0, which is
> 0 on x86_64.
> 4. Context switch back to Wine.
> 5. sysretl
>
> Would fixing this be as simple as changing this code in
> arch/x86/kernel/process.c:
>
> __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
> .x86_tss = {
> .sp0 = TOP_OF_INIT_STACK,
> #ifdef CONFIG_X86_32
> .ss0 = __KERNEL_DS,
>
> by moving the ifdef down a line? Even if that fixed it, it would be
> extremely fragile, but IMO it would be a good change to make
> regardless (i.e. the kernel's SS would be less unpredictable).

Confirmed with KVM on VMX: we can definitely end up in the kernel with SS == 0.

I don't know whether changing ss0 would be a good idea, though. It
would be cleaner, but it could slow down interrupt processing:
interrupt delivery would have to do an extra GDT load.

Food for thought: wouldn't this mean that we have a bug on sysretq
too? If we're in the kernel with SS == 0, we do sysretq, and then
user code does a far jump to 32-bit code, then we end up with a bogus
SS. Maybe we don't care, and reloading SS on every sysretq would
suck. We could fix it up in a kind of evil way: in do_stack_segment,
we could detect that we had SS == __USER_DS, in which case #SS should
be impossible, and just return without signalling the process. IRET
would fix up the attributes.

We just might need a stable fix, though -- I wonder if there's any bad
interaction with opportunistic sysret in 4.0. Maybe we should
benchmark ss0 = __KERNEL_DS and try it after all.

--Andy

2015-04-23 09:21:10

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On 04/23/2015 09:37 AM, Brian Gerst wrote:
> On Tue, Mar 31, 2015 at 8:38 AM, tip-bot for Denys Vlasenko
> <[email protected]> wrote:
>> Commit-ID: e7d6eefaaa443130079d73cd05039d90b3db7a4a
>> Gitweb: http://git.kernel.org/tip/e7d6eefaaa443130079d73cd05039d90b3db7a4a
>> Author: Denys Vlasenko <[email protected]>
>> AuthorDate: Fri, 27 Mar 2015 11:48:17 -0700
>> Committer: Ingo Molnar <[email protected]>
>> CommitDate: Tue, 31 Mar 2015 10:45:15 +0200
>>
>> x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
>>
>> This vDSO code only gets used by 64-bit kernels, not 32-bit ones.
>>
>> On 64-bit kernels, the data segment is the same for 32-bit and
>> 64-bit userspace, and the SYSRET instruction loads %ss with its
>> selector.
>>
>> So there's no need to repeat it by hand. Segment loads are somewhat
>> expensive: tens of cycles.
>>
>> Signed-off-by: Denys Vlasenko <[email protected]>
>> [ Removed unnecessary comment. ]
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> Cc: Alexei Starovoitov <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Borislav Petkov <[email protected]>
>> Cc: Frederic Weisbecker <[email protected]>
>> Cc: H. Peter Anvin <[email protected]>
>> Cc: Kees Cook <[email protected]>
>> Cc: Linus Torvalds <[email protected]>
>> Cc: Oleg Nesterov <[email protected]>
>> Cc: Steven Rostedt <[email protected]>
>> Cc: Will Drewry <[email protected]>
>> Link: http://lkml.kernel.org/r/63da6d778f69fd0f1345d9287f6764d58be519fa.1427482099.git.luto@kernel.org
>> Signed-off-by: Ingo Molnar <[email protected]>
>> ---
>> arch/x86/vdso/vdso32/syscall.S | 2 --
>> 1 file changed, 2 deletions(-)
>>
>> diff --git a/arch/x86/vdso/vdso32/syscall.S b/arch/x86/vdso/vdso32/syscall.S
>> index 5415b56..6b286bb 100644
>> --- a/arch/x86/vdso/vdso32/syscall.S
>> +++ b/arch/x86/vdso/vdso32/syscall.S
>> @@ -19,8 +19,6 @@ __kernel_vsyscall:
>> .Lpush_ebp:
>> movl %ecx, %ebp
>> syscall
>> - movl $__USER32_DS, %ecx
>> - movl %ecx, %ss
>> movl %ebp, %ecx
>> popl %ebp
>> .Lpop_ebp:
>
> This patch unfortunately is causing Wine to break on some applications:
>
> Unhandled exception: stack overflow in 32-bit code (0xf779bc07).
> Register dump:
> CS:0023 SS:002b DS:002b ES:002b FS:0063 GS:006b
> EIP:f779bc07 ESP:00aed60c EBP:00aed750 EFLAGS:00010216( R- -- I -A-P- )
> EAX:00000040 EBX:00000010 ECX:00aed750 EDX:00000040
> ESI:00000040 EDI:7ffd4000
> Stack dump:
> 0x00aed60c: 00aed648 f7575e5b 7bcc8000 00000000
> 0x00aed61c: 7bc7bc09 00000010 00aed750 00000040
> 0x00aed62c: 00aed750 00aed650 7bcc8000 7bc7bbdd
> 0x00aed63c: 7bcc8000 00aed6a0 00aed750 00aed738
> 0x00aed64c: 7bc7cfa9 00000011 00aed750 00000040
> 0x00aed65c: 00000020 00000000 00000000 7bc4f141
> Backtrace:
> =>0 0xf779bc07 __kernel_vsyscall+0x7() in [vdso].so (0x00aed750)
> 1 0xf7575e5b __libc_read+0x4a() in libpthread.so.0 (0x00aed648)
> 2 0x7bc7bc09 read_reply_data+0x38(buffer=0xaed750, size=0x40)
> [/home/bgerst/src/wine/wine32/dlls/ntdll/../../../dlls/ntdll/server.c:239]
> in ntdll (0x00aed648)
> 3 0x7bc7cfa9 wine_server_call+0x178() in ntdll (0x00aed738)
> 4 0x7bc840ec NtSetEvent+0x4b(handle=0x80,
> NumberOfThreadsReleased=0x0(nil))
> [/home/bgerst/src/wine/wine32/dlls/ntdll/../../../dlls/ntdll/sync.c:361]
> in ntdll (0x00aed7c8)
> 5 0x7b874afa SetEvent+0x24(handle=<couldn't compute location>)
> [/home/bgerst/src/wine/wine32/dlls/kernel32/../../../dlls/kernel32/sync.c:572]
> in kernel32 (0x00aed7e8)
> 6 0x0044e31a in battle.net launcher (+0x4e319) (0x00aed818)
> ...
>
> __kernel_vsyscall+0x7 points to "pop %ebp".
>
> This is on an AMD Phenom(tm) II X6 1055T Processor.
>
> It appears that there are some subtle differences in how sysretl works
> on AMD vs. Intel. According to the Intel docs, the SS selector and
> descriptor cache is completely reset by sysret to fixed values. The
> AMD docs however are concerning:
>
> AMD's syscall:
> SS.sel = MSR_STAR.SYSCALL_CS + 8
> SS.attr = 64-bit stack,dpl0
> SS.base = 0x00000000
> SS.limit = 0xFFFFFFFF
>
> AMD's sysret:
> SS.sel = MSR_STAR.SYSRET_CS + 8 // SS selector is changed,
> // SS base, limit, attributes unchanged.



> Not changing base or limit is no big deal, but not changing attributes
> could be the problem. It might be leaving the "64-bit stack"
> attribute set, for whatever that means.

I am not aware of any officially existing "64-bit" stack or
data segment attribute. x86 data segment descriptors
don't have any such bits, the 64-bitness of stack operations
in long mode is hardwired. (Unlike code segment descriptors,
which _do_ have a bit which controls 64-bitness).

This is not to say that CPU internally is prohibited from having
something along those lines.

However, if AMD CPUs would have a bug where after sysretl %ss
descriptor cache is left in a bad state causing stack ops to be
done in 64-bit fashion, *any* 32-bit userspace would immediately explode.
This is not the case.

What Wine could do differently from a typical Linux executable?
It may use nonzero %ss base, it may use a non-4Gb limit,
it may use 16-bit stack segment, it may use an expand-down stack segment.
(I know very little about Windows/Wine internals, so I just listed
all possibilities which came to mind).

Looking at the error message:

> Unhandled exception: stack overflow in 32-bit code (0xf779bc07).
> Register dump:
> CS:0023 SS:002b DS:002b ES:002b FS:0063 GS:006b
> EIP:f779bc07 ESP:00aed60c EBP:00aed750 EFLAGS:00010216( R- -- I -A-P- )
> EAX:00000040 EBX:00000010 ECX:00aed750 EDX:00000040
> ESI:00000040 EDI:7ffd4000

it is not coming from Wine itself, looks like it's from Windows code,
and I'd guess it just tells us that they got exception 12,
without further information on the cause.

I don't have any solid theories here, so I'd just brainstorm.

* what if %ss before syscall was NOT the usual value of 0x2b, but some
other segment, not the typical 0-base, 0xffffffff limit 32-bit expand-up one?
Not restoring proper %ss would not go well.
[but then, Intel CPUs work, and old code worked....]

* what if there is indeed a hidden "64-bit mode" bit in %ss descriptor
cache which makes stack ops use %rsp, not %esp; and %rsp somehow
has nonzero high half? Then, in compat mode, CPU checks segment limit
on every memory reference including POP (which it doesn't do in 64-bit mode)
and we got exception 12. [But we do:
movl RSP(%rsp),%esp
USERGS_SYSRET32
which ensures that %rsp never has upper bits set...]

* what if in kernel (64-bmode) we load %ss descriptor cache with a segment
whose limit is too small or with outright invalid segment? In 64-bit mode,
%ss is not actually checked by CPU. For example, nested exceptions/interrupts
even auto-load %ss with zero selector - and stack ops continue to work fine
in 64-bit mode. But as soon as we return to non-64-bit world, CPU starts
checking data sergemnt limits and validity - and if cached %ss descriptor
was made invalid while in 64-bit mode and was not changed by SYSRETL,
POP will fail.

> Reloading SS from the GDT would obviously reset any bad state left by
> sysretl. Unfortunately we may have to put it back in, and then NOP it
> out on Intel.

Sure we can, but let's try to get a better understanding
what's going on here.

2015-04-23 09:24:21

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On 04/23/2015 10:49 AM, Andy Lutomirski wrote:
> Would fixing this be as simple as changing this code in
> arch/x86/kernel/process.c:
>
> __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
> .x86_tss = {
> .sp0 = TOP_OF_INIT_STACK,
> #ifdef CONFIG_X86_32
> .ss0 = __KERNEL_DS,
>
> by moving the ifdef down a line?

You can't do that. 64-bit TSS has no .ss0/1/2 fields :)

2015-04-23 09:47:20

by Borislav Petkov

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On Thu, Apr 23, 2015 at 01:49:50AM -0700, Andy Lutomirski wrote:
> I'm pretty sure that this is at least a little bit wrong. It makes no
> sense for me for syscall to set SS.DPL=0 and for sysret to leave
> SS.DPL=0. It had better at least change DPL to 3. (Except... don't
> they mean RPL? Why is the DPL cached at all? But RPL is clearly
> changed, since it's part of the selector.)

I think this should explain it a bit:

"• STAR—The STAR register has the following fields (unless otherwise
noted, all bits are read/write):

- SYSRET CS and SS Selectors—Bits 63:48. This field is used to specify
both the CS and SS selectors loaded into CS and SS during SYSRET. If
SYSRET is returning to 32-bit mode (either legacy or compatibility),
this field is copied directly into the CS selector field. If SYSRET is
returning to 64-bit mode, the CS selector is set to this field + 16.
SS.Sel is set to this field + 8, regardless of the target mode. Because
SYSRET always returns to CPL 3, the RPL bits 49:48 should be initialized
to 11b.

- SYSCALL CS and SS Selectors—Bits 47:32. This field is used to
specify both the CS and SS selectors loaded into CS and SS during
SYSCALL. This field is copied directly into CS.Sel. SS.Sel is set to
this field + 8. Because SYSCALL always switches to CPL 0, the RPL bits
33:32 should be initialized to 00b."

So I'm reading "SYSRET always returns to CPL3" and "SYSCALL always
switches to CPL 0" as those are being enforced. And this is also
mentioned in the SYSCALL/SYSRET documentation:

"SYSCALL sets the CPL to 0, regardless of the values of bits 33:32 of
the STAR register."

and

"SYSRET sets the CPL to 3, regardless of the values of bits 49:48 of the
star register."

BUT(!)

"It is also assumed (but not checked) that the RPL of the SYSCALL and
SYSRET target selectors are set to 0 and 3, respectively."

>
> > Not changing base or limit is no big deal, but not changing attributes
> > could be the problem. It might be leaving the "64-bit stack"
> > attribute set, for whatever that means.
>
> Hmm. I don't know if I believe that explanation. For one thing, the
> APM says "Executing SYSRET in non-64-bit mode or with a 16- or 32-bit
> operand size returns to 32-bit mode with a 32-bit stack pointer."
>
> We can revert this patch or fix it, but I'd like to at least try to
> understand what's wrong first. Borislav, any ideas?

Right, so according to the documentation, SYSRET does load SS from
MSR_STAR[63:48] and forces the RPL bits [49:48] to 3.

So if we really do have a bad %ss, then something is changing it in
MSR_STAR.

But that sounds really far-fetched and implausible so it must be
something else.

/me scratches head...

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-04-23 09:57:02

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On 04/23/2015 10:49 AM, Andy Lutomirski wrote:
> On Thu, Apr 23, 2015 at 12:37 AM, Brian Gerst <[email protected]> wrote:
>> On Tue, Mar 31, 2015 at 8:38 AM, tip-bot for Denys Vlasenko
>> <[email protected]> wrote:
>>> Commit-ID: e7d6eefaaa443130079d73cd05039d90b3db7a4a
>>> Gitweb: http://git.kernel.org/tip/e7d6eefaaa443130079d73cd05039d90b3db7a4a
>>> Author: Denys Vlasenko <[email protected]>
>>> AuthorDate: Fri, 27 Mar 2015 11:48:17 -0700
>>> Committer: Ingo Molnar <[email protected]>
>>> CommitDate: Tue, 31 Mar 2015 10:45:15 +0200
>>>
>>> x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
>>>
>>> This vDSO code only gets used by 64-bit kernels, not 32-bit ones.
>>>
>>> On 64-bit kernels, the data segment is the same for 32-bit and
>>> 64-bit userspace, and the SYSRET instruction loads %ss with its
>>> selector.
>>>
>>> So there's no need to repeat it by hand. Segment loads are somewhat
>>> expensive: tens of cycles.
>>>
>>> Signed-off-by: Denys Vlasenko <[email protected]>
>>> [ Removed unnecessary comment. ]
>>> Signed-off-by: Andy Lutomirski <[email protected]>
>>> Cc: Alexei Starovoitov <[email protected]>
>>> Cc: Andy Lutomirski <[email protected]>
>>> Cc: Borislav Petkov <[email protected]>
>>> Cc: Frederic Weisbecker <[email protected]>
>>> Cc: H. Peter Anvin <[email protected]>
>>> Cc: Kees Cook <[email protected]>
>>> Cc: Linus Torvalds <[email protected]>
>>> Cc: Oleg Nesterov <[email protected]>
>>> Cc: Steven Rostedt <[email protected]>
>>> Cc: Will Drewry <[email protected]>
>>> Link: http://lkml.kernel.org/r/63da6d778f69fd0f1345d9287f6764d58be519fa.1427482099.git.luto@kernel.org
>>> Signed-off-by: Ingo Molnar <[email protected]>
>>> ---
>>> arch/x86/vdso/vdso32/syscall.S | 2 --
>>> 1 file changed, 2 deletions(-)
>>>
>>> diff --git a/arch/x86/vdso/vdso32/syscall.S b/arch/x86/vdso/vdso32/syscall.S
>>> index 5415b56..6b286bb 100644
>>> --- a/arch/x86/vdso/vdso32/syscall.S
>>> +++ b/arch/x86/vdso/vdso32/syscall.S
>>> @@ -19,8 +19,6 @@ __kernel_vsyscall:
>>> .Lpush_ebp:
>>> movl %ecx, %ebp
>>> syscall
>>> - movl $__USER32_DS, %ecx
>>> - movl %ecx, %ss
>>> movl %ebp, %ecx
>>> popl %ebp
>>> .Lpop_ebp:
>>
>> This patch unfortunately is causing Wine to break on some applications:
>>
>> Unhandled exception: stack overflow in 32-bit code (0xf779bc07).
>> Register dump:
>> CS:0023 SS:002b DS:002b ES:002b FS:0063 GS:006b
>> EIP:f779bc07 ESP:00aed60c EBP:00aed750 EFLAGS:00010216( R- -- I -A-P- )
>> EAX:00000040 EBX:00000010 ECX:00aed750 EDX:00000040
>> ESI:00000040 EDI:7ffd4000
>> Stack dump:
>> 0x00aed60c: 00aed648 f7575e5b 7bcc8000 00000000
>> 0x00aed61c: 7bc7bc09 00000010 00aed750 00000040
>> 0x00aed62c: 00aed750 00aed650 7bcc8000 7bc7bbdd
>> 0x00aed63c: 7bcc8000 00aed6a0 00aed750 00aed738
>> 0x00aed64c: 7bc7cfa9 00000011 00aed750 00000040
>> 0x00aed65c: 00000020 00000000 00000000 7bc4f141
>> Backtrace:
>> =>0 0xf779bc07 __kernel_vsyscall+0x7() in [vdso].so (0x00aed750)
>> 1 0xf7575e5b __libc_read+0x4a() in libpthread.so.0 (0x00aed648)
>> 2 0x7bc7bc09 read_reply_data+0x38(buffer=0xaed750, size=0x40)
>> [/home/bgerst/src/wine/wine32/dlls/ntdll/../../../dlls/ntdll/server.c:239]
>> in ntdll (0x00aed648)
>> 3 0x7bc7cfa9 wine_server_call+0x178() in ntdll (0x00aed738)
>> 4 0x7bc840ec NtSetEvent+0x4b(handle=0x80,
>> NumberOfThreadsReleased=0x0(nil))
>> [/home/bgerst/src/wine/wine32/dlls/ntdll/../../../dlls/ntdll/sync.c:361]
>> in ntdll (0x00aed7c8)
>> 5 0x7b874afa SetEvent+0x24(handle=<couldn't compute location>)
>> [/home/bgerst/src/wine/wine32/dlls/kernel32/../../../dlls/kernel32/sync.c:572]
>> in kernel32 (0x00aed7e8)
>> 6 0x0044e31a in battle.net launcher (+0x4e319) (0x00aed818)
>> ...
>>
>> __kernel_vsyscall+0x7 points to "pop %ebp".
>>
>> This is on an AMD Phenom(tm) II X6 1055T Processor.
>>
>> It appears that there are some subtle differences in how sysretl works
>> on AMD vs. Intel. According to the Intel docs, the SS selector and
>> descriptor cache is completely reset by sysret to fixed values. The
>> AMD docs however are concerning:
>
> My understanding is that, in long mode, the segment attributes are
> ignored, and that there is no such thing as a "64-bit stack". So...
>
>>
>> AMD's syscall:
>> SS.sel = MSR_STAR.SYSCALL_CS + 8
>> SS.attr = 64-bit stack,dpl0
>
> I don't really believe that.
>
>> SS.base = 0x00000000
>> SS.limit = 0xFFFFFFFF
>>
>> AMD's sysret:
>> SS.sel = MSR_STAR.SYSRET_CS + 8 // SS selector is changed,
>> // SS base, limit, attributes unchanged.
>>
>
> I'm pretty sure that this is at least a little bit wrong. It makes no
> sense for me for syscall to set SS.DPL=0 and for sysret to leave
> SS.DPL=0. It had better at least change DPL to 3. (Except... don't
> they mean RPL? Why is the DPL cached at all? But RPL is clearly
> changed, since it's part of the selector.)
>
>> Not changing base or limit is no big deal, but not changing attributes
>> could be the problem. It might be leaving the "64-bit stack"
>> attribute set, for whatever that means.
>
> Hmm. I don't know if I believe that explanation. For one thing, the
> APM says "Executing SYSRET in non-64-bit mode or with a 16- or 32-bit
> operand size returns to 32-bit mode with a 32-bit stack pointer."
>
> We can revert this patch or fix it, but I'd like to at least try to
> understand what's wrong first. Borislav, any ideas?
>
> I'm curious whether we can somehow end up in the kernel without a
> sensible SS. What happens if we have SS = 0?

Nothing happens. We continue to execute as if nothing is wrong.

24593_APM_v21.pdf says:

8.9 Long-Mode Interrupt Control Transfers
The long-mode architecture expands the legacy interrupt-mechanism to support 64-bit operating
systems and applications. These changes include:

* All interrupt handlers are 64-bit code and operate in 64-bit mode.
* The size of an interrupt-stack push is fixed at 64 bits (8 bytes).
* The interrupt-stack frame is aligned on a 16-byte boundary.
* The stack pointer, SS:RSP, is pushed unconditionally on interrupts, rather than conditionally based
on a change in CPL.
* The SS selector register is loaded with a null selector as a result of an interrupt, if the CPL changes.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ !!!!!
* The IRET instruction behavior changes, to unconditionally pop SS:RSP, allowing a null SS to be
popped.
...


Intel's doc has a similar statement.

> Try this on for size:
>
> 1. Wine process does syscall
> 2. Context switch to any other task
> 3. Interrupt (software or hardware), which loads SS with ss0, which is
> 0 on x86_64.

(should be "which loads SS with 0 unconditionally" according to AMD docs)

> 4. Context switch back to Wine.
> 5. sysretl

Plausible. But why it happens only with Wine?


The fix can look like this (untested):


diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 0c302d0..9f4c232 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -198,6 +198,18 @@ sysexit_from_sys_call:
* with 'sysenter' and it uses the SYSENTER calling convention.
*/
andl $~TS_COMPAT,ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
+ /*
+ * On AMD, SYSRET32 does not modify %ss cached descriptor;
+ * and in kernel, %ss can be loaded with 0, invalidating it.
+ * On return to 32-bit mode, this makes stack ops fail.
+ * Fix %ss only if it's wrong: it takes ~40 cycles.
+ */
+ movl %ss, %ecx
+ cmpl $__USER32_DS, %ecx
+ je 1f
+ movl $__USER32_DS, %ecx
+ movl %ecx, %ss
+1:
movl RIP(%rsp),%ecx /* User %eip */
CFI_REGISTER rip,rcx
RESTORE_RSI_RDI
@@ -408,6 +420,18 @@ cstar_dispatch:
sysretl_from_sys_call:
andl $~TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
RESTORE_RSI_RDI_RDX
+ /*
+ * On AMD, SYSRET32 does not modify %ss cached descriptor;
+ * and in kernel, %ss can be loaded with 0, invalidating it.
+ * On return to 32-bit mode, this makes stack ops fail.
+ * Fix %ss only if it's wrong: it takes ~40 cycles.
+ */
+ movl %ss, %ecx
+ cmpl $__USER32_DS, %ecx
+ je 1f
+ movl $__USER32_DS, %ecx
+ movl %ecx, %ss
+1:
movl RIP(%rsp),%ecx
CFI_REGISTER rip,rcx
movl EFLAGS(%rsp),%r11d

2015-04-23 09:56:46

by Borislav Petkov

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On Thu, Apr 23, 2015 at 11:20:56AM +0200, Denys Vlasenko wrote:
> * what if %ss before syscall was NOT the usual value of 0x2b, but some
> other segment, not the typical 0-base, 0xffffffff limit 32-bit expand-up one?
> Not restoring proper %ss would not go well.
> [but then, Intel CPUs work, and old code worked....]

Have we run the exact same reproducer on Intel already?

Brian, can you run the same thing on an Intel box, if you haven't done
so already?

Thanks.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-04-23 10:18:53

by Borislav Petkov

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On Thu, Apr 23, 2015 at 11:56:21AM +0200, Denys Vlasenko wrote:
> The fix can look like this (untested):
>
>
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 0c302d0..9f4c232 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -198,6 +198,18 @@ sysexit_from_sys_call:
> * with 'sysenter' and it uses the SYSENTER calling convention.
> */
> andl $~TS_COMPAT,ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
> + /*
> + * On AMD, SYSRET32 does not modify %ss cached descriptor;

Ok, but doc says that in both long and compat mode, SYSRET does load
SS.sel with the value in MSR_STAR...

Hmmm.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-04-23 10:27:20

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On 04/23/2015 12:18 PM, Borislav Petkov wrote:
> On Thu, Apr 23, 2015 at 11:56:21AM +0200, Denys Vlasenko wrote:
>> The fix can look like this (untested):
>>
>>
>> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
>> index 0c302d0..9f4c232 100644
>> --- a/arch/x86/ia32/ia32entry.S
>> +++ b/arch/x86/ia32/ia32entry.S
>> @@ -198,6 +198,18 @@ sysexit_from_sys_call:
>> * with 'sysenter' and it uses the SYSENTER calling convention.
>> */
>> andl $~TS_COMPAT,ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
>> + /*
>> + * On AMD, SYSRET32 does not modify %ss cached descriptor;
>
> Ok, but doc says that in both long and compat mode, SYSRET does load
> SS.sel with the value in MSR_STAR...

Yes. It loads *selector*. AMD docs say that selector is loaded as you say,
but *cached descriptor* of SS (which is a different entity) is not modified.

If *cached descriptor* is invalid, in 32-bit mode stack ops
will fail. (In 64-bit mode, CPU doesn't do those checks).

2015-04-23 10:44:13

by Borislav Petkov

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On Thu, Apr 23, 2015 at 12:26:43PM +0200, Denys Vlasenko wrote:
> Yes. It loads *selector*. AMD docs say that selector is loaded as you say,
> but *cached descriptor* of SS (which is a different entity) is not modified.
>
> If *cached descriptor* is invalid, in 32-bit mode stack ops
> will fail. (In 64-bit mode, CPU doesn't do those checks).

So how can that happen with wine? Something's changing the cached
descriptor and only the write to %ss reloads it with the correct value?

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-04-23 11:06:26

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On 04/23/2015 12:44 PM, Borislav Petkov wrote:
> On Thu, Apr 23, 2015 at 12:26:43PM +0200, Denys Vlasenko wrote:
>> Yes. It loads *selector*. AMD docs say that selector is loaded as you say,
>> but *cached descriptor* of SS (which is a different entity) is not modified.
>>
>> If *cached descriptor* is invalid, in 32-bit mode stack ops
>> will fail. (In 64-bit mode, CPU doesn't do those checks).
>
> So how can that happen with wine? Something's changing the cached
> descriptor ... ?

Yes. We know of at least one case where documentation
(both Intel and AMD) specifically states that %ss is set to NULL:
this happens on every interrupt and exception.

If interrupt/exception returns to the same task with IRET, all is well:
%ss is reloaded from iret frame (both selector and cached descriptor).

However, if interrupt results in a preemption, we end up in a different
task (say, Wine), and we can return to its userspace code
with SYSRETL. *This* type of return does not reload cached descriptor.

I don't know why it happens only with Wine. Maybe it just happens
with Wine much more easily than with other 32-bit tasks?

2015-04-23 11:11:55

by Brian Gerst

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On Thu, Apr 23, 2015 at 5:56 AM, Borislav Petkov <[email protected]> wrote:
> On Thu, Apr 23, 2015 at 11:20:56AM +0200, Denys Vlasenko wrote:
>> * what if %ss before syscall was NOT the usual value of 0x2b, but some
>> other segment, not the typical 0-base, 0xffffffff limit 32-bit expand-up one?
>> Not restoring proper %ss would not go well.
>> [but then, Intel CPUs work, and old code worked....]
>
> Have we run the exact same reproducer on Intel already?
>
> Brian, can you run the same thing on an Intel box, if you haven't done
> so already?
>
> Thanks.

Tested it on a Intel(R) Core(TM)2 Duo CPU T6400 @ 2.00GHz and cannot
reproduce it there. Note that on Intel CPUs, we use the sysenter VDSO
but return with sysret.

--
Brian Gerst

2015-04-23 11:12:49

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On 04/23/2015 09:37 AM, Brian Gerst wrote:
> This patch unfortunately is causing Wine to break on some applications:
>
> Unhandled exception: stack overflow in 32-bit code (0xf779bc07).
> Register dump:
> CS:0023 SS:002b DS:002b ES:002b FS:0063 GS:006b
> EIP:f779bc07 ESP:00aed60c EBP:00aed750 EFLAGS:00010216( R- -- I -A-P- )
> EAX:00000040 EBX:00000010 ECX:00aed750 EDX:00000040
> ESI:00000040 EDI:7ffd4000
> Stack dump:
> 0x00aed60c: 00aed648 f7575e5b 7bcc8000 00000000
> 0x00aed61c: 7bc7bc09 00000010 00aed750 00000040
> 0x00aed62c: 00aed750 00aed650 7bcc8000 7bc7bbdd
> 0x00aed63c: 7bcc8000 00aed6a0 00aed750 00aed738
> 0x00aed64c: 7bc7cfa9 00000011 00aed750 00000040
> 0x00aed65c: 00000020 00000000 00000000 7bc4f141
> Backtrace:
> =>0 0xf779bc07 __kernel_vsyscall+0x7() in [vdso].so (0x00aed750)
> 1 0xf7575e5b __libc_read+0x4a() in libpthread.so.0 (0x00aed648)
> 2 0x7bc7bc09 read_reply_data+0x38(buffer=0xaed750, size=0x40)
> [/home/bgerst/src/wine/wine32/dlls/ntdll/../../../dlls/ntdll/server.c:239]
> in ntdll (0x00aed648)
> 3 0x7bc7cfa9 wine_server_call+0x178() in ntdll (0x00aed738)
> 4 0x7bc840ec NtSetEvent+0x4b(handle=0x80,
> NumberOfThreadsReleased=0x0(nil))
> [/home/bgerst/src/wine/wine32/dlls/ntdll/../../../dlls/ntdll/sync.c:361]
> in ntdll (0x00aed7c8)
> 5 0x7b874afa SetEvent+0x24(handle=<couldn't compute location>)
> [/home/bgerst/src/wine/wine32/dlls/kernel32/../../../dlls/kernel32/sync.c:572]
> in kernel32 (0x00aed7e8)
> 6 0x0044e31a in battle.net launcher (+0x4e319) (0x00aed818)
> ...
>
> __kernel_vsyscall+0x7 points to "pop %ebp".
>
> This is on an AMD Phenom(tm) II X6 1055T Processor.
>
> It appears that there are some subtle differences in how sysretl works
> on AMD vs. Intel. According to the Intel docs, the SS selector and
> descriptor cache is completely reset by sysret to fixed values. The
> AMD docs however are concerning:
>
> AMD's syscall:
> SS.sel = MSR_STAR.SYSCALL_CS + 8
> SS.attr = 64-bit stack,dpl0
> SS.base = 0x00000000
> SS.limit = 0xFFFFFFFF
>
> AMD's sysret:
> SS.sel = MSR_STAR.SYSRET_CS + 8 // SS selector is changed,
> // SS base, limit, attributes unchanged.
>
> Not changing base or limit is no big deal, but not changing attributes
> could be the problem. It might be leaving the "64-bit stack"
> attribute set, for whatever that means.
>
> Reloading SS from the GDT would obviously reset any bad state left by
> sysretl. Unfortunately we may have to put it back in, and then NOP it
> out on Intel.

Please try attached tentative fix:


>From 4b9e9bdf08b092724934e62892116af6dd712128 Mon Sep 17 00:00:00 2001
From: Denys Vlasenko <[email protected]>
Date: Thu, 23 Apr 2015 12:38:47 +0200
Subject: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary

AMD docs say that SYSRET32 loads %ss selector with a value from a MSR,
but *cached descriptor* of %ss is not modified.

It was observed to cause Wine crashes. Conjectured sequence of events
causing it is:

1. Wine process does syscall.
2. Context switch to any other task.
3. Interrupt (software or hardware), which loads %ss with 0.
(This happens according to both Intel and AMD docs.)
%ss cached descriptor is set to "invalid" state.
4. Context switch back to Wine.
5. sysret to 32-bit. %ss has correct value but its cached descriptor
is still invalid.
6. The very first userspace POP insn after this causes exception 12.

Fix this by checking %ss value. If it is not __KERNEL_DS,
(and it really can only be __KERNEL_DS or zero),
then load it with __KERNEL_DS.

Signed-off-by: Denys Vlasenko <[email protected]>
---
arch/x86/ia32/ia32entry.S | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 0c302d0..94c0b39 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -198,6 +198,20 @@ sysexit_from_sys_call:
* with 'sysenter' and it uses the SYSENTER calling convention.
*/
andl $~TS_COMPAT,ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
+ /*
+ * On AMD, SYSRET32 loads %ss selector, but does not modify its
+ * cached descriptor; and in kernel, %ss can be loaded with 0,
+ * setting cached descriptor to "invalid". This has no effect on
+ * 64-bit mode, but on return to 32-bit mode, it makes stack ops fail.
+ * Fix %ss only if it's wrong: read from %ss takes ~2 cycles,
+ * write to %ss is ~40 cycles.
+ */
+ movl %ss, %ecx
+ cmpl $__KERNEL_DS, %ecx
+ je 1f
+ movl $__KERNEL_DS, %ecx
+ movl %ecx, %ss
+1:
movl RIP(%rsp),%ecx /* User %eip */
CFI_REGISTER rip,rcx
RESTORE_RSI_RDI
@@ -408,6 +421,20 @@ cstar_dispatch:
sysretl_from_sys_call:
andl $~TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
RESTORE_RSI_RDI_RDX
+ /*
+ * On AMD, SYSRET32 loads %ss selector, but does not modify its
+ * cached descriptor; and in kernel, %ss can be loaded with 0,
+ * setting cached descriptor to "invalid". This has no effect on
+ * 64-bit mode, but on return to 32-bit mode, it makes stack ops fail.
+ * Fix %ss only if it's wrong: read from %ss takes ~2 cycles,
+ * write to %ss is ~40 cycles.
+ */
+ movl %ss, %ecx
+ cmpl $__KERNEL_DS, %ecx
+ je 1f
+ movl $__KERNEL_DS, %ecx
+ movl %ecx, %ss
+1:
movl RIP(%rsp),%ecx
CFI_REGISTER rip,rcx
movl EFLAGS(%rsp),%r11d
--
1.8.1.4



2015-04-23 11:28:49

by Brian Gerst

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On Thu, Apr 23, 2015 at 5:20 AM, Denys Vlasenko <[email protected]> wrote:
> On 04/23/2015 09:37 AM, Brian Gerst wrote:
>> On Tue, Mar 31, 2015 at 8:38 AM, tip-bot for Denys Vlasenko
>> <[email protected]> wrote:
>>> Commit-ID: e7d6eefaaa443130079d73cd05039d90b3db7a4a
>>> Gitweb: http://git.kernel.org/tip/e7d6eefaaa443130079d73cd05039d90b3db7a4a
>>> Author: Denys Vlasenko <[email protected]>
>>> AuthorDate: Fri, 27 Mar 2015 11:48:17 -0700
>>> Committer: Ingo Molnar <[email protected]>
>>> CommitDate: Tue, 31 Mar 2015 10:45:15 +0200
>>>
>>> x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
>>>
>>> This vDSO code only gets used by 64-bit kernels, not 32-bit ones.
>>>
>>> On 64-bit kernels, the data segment is the same for 32-bit and
>>> 64-bit userspace, and the SYSRET instruction loads %ss with its
>>> selector.
>>>
>>> So there's no need to repeat it by hand. Segment loads are somewhat
>>> expensive: tens of cycles.
>>>
>>> Signed-off-by: Denys Vlasenko <[email protected]>
>>> [ Removed unnecessary comment. ]
>>> Signed-off-by: Andy Lutomirski <[email protected]>
>>> Cc: Alexei Starovoitov <[email protected]>
>>> Cc: Andy Lutomirski <[email protected]>
>>> Cc: Borislav Petkov <[email protected]>
>>> Cc: Frederic Weisbecker <[email protected]>
>>> Cc: H. Peter Anvin <[email protected]>
>>> Cc: Kees Cook <[email protected]>
>>> Cc: Linus Torvalds <[email protected]>
>>> Cc: Oleg Nesterov <[email protected]>
>>> Cc: Steven Rostedt <[email protected]>
>>> Cc: Will Drewry <[email protected]>
>>> Link: http://lkml.kernel.org/r/63da6d778f69fd0f1345d9287f6764d58be519fa.1427482099.git.luto@kernel.org
>>> Signed-off-by: Ingo Molnar <[email protected]>
>>> ---
>>> arch/x86/vdso/vdso32/syscall.S | 2 --
>>> 1 file changed, 2 deletions(-)
>>>
>>> diff --git a/arch/x86/vdso/vdso32/syscall.S b/arch/x86/vdso/vdso32/syscall.S
>>> index 5415b56..6b286bb 100644
>>> --- a/arch/x86/vdso/vdso32/syscall.S
>>> +++ b/arch/x86/vdso/vdso32/syscall.S
>>> @@ -19,8 +19,6 @@ __kernel_vsyscall:
>>> .Lpush_ebp:
>>> movl %ecx, %ebp
>>> syscall
>>> - movl $__USER32_DS, %ecx
>>> - movl %ecx, %ss
>>> movl %ebp, %ecx
>>> popl %ebp
>>> .Lpop_ebp:
>>
>> This patch unfortunately is causing Wine to break on some applications:
>>
>> Unhandled exception: stack overflow in 32-bit code (0xf779bc07).
>> Register dump:
>> CS:0023 SS:002b DS:002b ES:002b FS:0063 GS:006b
>> EIP:f779bc07 ESP:00aed60c EBP:00aed750 EFLAGS:00010216( R- -- I -A-P- )
>> EAX:00000040 EBX:00000010 ECX:00aed750 EDX:00000040
>> ESI:00000040 EDI:7ffd4000
>> Stack dump:
>> 0x00aed60c: 00aed648 f7575e5b 7bcc8000 00000000
>> 0x00aed61c: 7bc7bc09 00000010 00aed750 00000040
>> 0x00aed62c: 00aed750 00aed650 7bcc8000 7bc7bbdd
>> 0x00aed63c: 7bcc8000 00aed6a0 00aed750 00aed738
>> 0x00aed64c: 7bc7cfa9 00000011 00aed750 00000040
>> 0x00aed65c: 00000020 00000000 00000000 7bc4f141
>> Backtrace:
>> =>0 0xf779bc07 __kernel_vsyscall+0x7() in [vdso].so (0x00aed750)
>> 1 0xf7575e5b __libc_read+0x4a() in libpthread.so.0 (0x00aed648)
>> 2 0x7bc7bc09 read_reply_data+0x38(buffer=0xaed750, size=0x40)
>> [/home/bgerst/src/wine/wine32/dlls/ntdll/../../../dlls/ntdll/server.c:239]
>> in ntdll (0x00aed648)
>> 3 0x7bc7cfa9 wine_server_call+0x178() in ntdll (0x00aed738)
>> 4 0x7bc840ec NtSetEvent+0x4b(handle=0x80,
>> NumberOfThreadsReleased=0x0(nil))
>> [/home/bgerst/src/wine/wine32/dlls/ntdll/../../../dlls/ntdll/sync.c:361]
>> in ntdll (0x00aed7c8)
>> 5 0x7b874afa SetEvent+0x24(handle=<couldn't compute location>)
>> [/home/bgerst/src/wine/wine32/dlls/kernel32/../../../dlls/kernel32/sync.c:572]
>> in kernel32 (0x00aed7e8)
>> 6 0x0044e31a in battle.net launcher (+0x4e319) (0x00aed818)
>> ...
>>
>> __kernel_vsyscall+0x7 points to "pop %ebp".
>>
>> This is on an AMD Phenom(tm) II X6 1055T Processor.
>>
>> It appears that there are some subtle differences in how sysretl works
>> on AMD vs. Intel. According to the Intel docs, the SS selector and
>> descriptor cache is completely reset by sysret to fixed values. The
>> AMD docs however are concerning:
>>
>> AMD's syscall:
>> SS.sel = MSR_STAR.SYSCALL_CS + 8
>> SS.attr = 64-bit stack,dpl0
>> SS.base = 0x00000000
>> SS.limit = 0xFFFFFFFF
>>
>> AMD's sysret:
>> SS.sel = MSR_STAR.SYSRET_CS + 8 // SS selector is changed,
>> // SS base, limit, attributes unchanged.
>
>
>
>> Not changing base or limit is no big deal, but not changing attributes
>> could be the problem. It might be leaving the "64-bit stack"
>> attribute set, for whatever that means.
>
> I am not aware of any officially existing "64-bit" stack or
> data segment attribute. x86 data segment descriptors
> don't have any such bits, the 64-bitness of stack operations
> in long mode is hardwired. (Unlike code segment descriptors,
> which _do_ have a bit which controls 64-bitness).
>
> This is not to say that CPU internally is prohibited from having
> something along those lines.
>
> However, if AMD CPUs would have a bug where after sysretl %ss
> descriptor cache is left in a bad state causing stack ops to be
> done in 64-bit fashion, *any* 32-bit userspace would immediately explode.
> This is not the case.
>
> What Wine could do differently from a typical Linux executable?
> It may use nonzero %ss base, it may use a non-4Gb limit,
> it may use 16-bit stack segment, it may use an expand-down stack segment.
> (I know very little about Windows/Wine internals, so I just listed
> all possibilities which came to mind).

This is a modern Win32 app, so it shouldn't be doing any segment
modifications on its own (Win32 uses flat segments just like Linux).

> Looking at the error message:
>
>> Unhandled exception: stack overflow in 32-bit code (0xf779bc07).
>> Register dump:
>> CS:0023 SS:002b DS:002b ES:002b FS:0063 GS:006b
>> EIP:f779bc07 ESP:00aed60c EBP:00aed750 EFLAGS:00010216( R- -- I -A-P- )
>> EAX:00000040 EBX:00000010 ECX:00aed750 EDX:00000040
>> ESI:00000040 EDI:7ffd4000
>
> it is not coming from Wine itself, looks like it's from Windows code,
> and I'd guess it just tells us that they got exception 12,
> without further information on the cause.

The backtrace shows the fault is in the VDSO, the first pop
instruction after returning from the kernel.

--
Brian Gerst

2015-04-23 11:47:19

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On 04/23/2015 01:28 PM, Brian Gerst wrote:
>> Looking at the error message:
>>
>>> Unhandled exception: stack overflow in 32-bit code (0xf779bc07).
>>> Register dump:
>>> CS:0023 SS:002b DS:002b ES:002b FS:0063 GS:006b
>>> EIP:f779bc07 ESP:00aed60c EBP:00aed750 EFLAGS:00010216( R- -- I -A-P- )
>>> EAX:00000040 EBX:00000010 ECX:00aed750 EDX:00000040
>>> ESI:00000040 EDI:7ffd4000
>>
>> it is not coming from Wine itself, looks like it's from Windows code,
>> and I'd guess it just tells us that they got exception 12,
>> without further information on the cause.
>
> The backtrace shows the fault is in the VDSO, the first pop
> instruction after returning from the kernel.

Yes, I understand at which insn exception happens.

I meant that *the message* is not generated by Wine or kernel.
grep for "Unhandled exception:" comes up empty
on their source trees.

After much grepping, I see that I'm wrong.
It does come from wine:

void info_win32_exception(void)
{
const EXCEPTION_RECORD* rec;
ADDRESS64 addr;
char hexbuf[MAX_OFFSET_TO_STR_LEN];

if (!dbg_curr_thread->in_exception)
{
dbg_printf("Thread isn't in an exception\n");
return;
}
rec = &dbg_curr_thread->excpt_record;
memory_get_current_pc(&addr);

/* print some infos */
dbg_printf("%s: ",
dbg_curr_thread->first_chance ? "First chance exception" : "Unhandled exception");
switch (rec->ExceptionCode)
{
case EXCEPTION_BREAKPOINT:
dbg_printf("breakpoint");
break;
case EXCEPTION_SINGLE_STEP:
dbg_printf("single step");
break;
case EXCEPTION_INT_DIVIDE_BY_ZERO:
dbg_printf("divide by zero");
break;
case EXCEPTION_INT_OVERFLOW:
dbg_printf("overflow");
break;
case EXCEPTION_ARRAY_BOUNDS_EXCEEDED:
dbg_printf("array bounds");
break;
case EXCEPTION_ILLEGAL_INSTRUCTION:
dbg_printf("illegal instruction");
break;
case EXCEPTION_STACK_OVERFLOW:
dbg_printf("stack overflow");
break;
...

I hoped we can easily make Wine show exception's error code.
Not that easy :/

2015-04-23 12:01:25

by Brian Gerst

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On Thu, Apr 23, 2015 at 7:46 AM, Denys Vlasenko <[email protected]> wrote:
> On 04/23/2015 01:28 PM, Brian Gerst wrote:
>>> Looking at the error message:
>>>
>>>> Unhandled exception: stack overflow in 32-bit code (0xf779bc07).
>>>> Register dump:
>>>> CS:0023 SS:002b DS:002b ES:002b FS:0063 GS:006b
>>>> EIP:f779bc07 ESP:00aed60c EBP:00aed750 EFLAGS:00010216( R- -- I -A-P- )
>>>> EAX:00000040 EBX:00000010 ECX:00aed750 EDX:00000040
>>>> ESI:00000040 EDI:7ffd4000
>>>
>>> it is not coming from Wine itself, looks like it's from Windows code,
>>> and I'd guess it just tells us that they got exception 12,
>>> without further information on the cause.
>>
>> The backtrace shows the fault is in the VDSO, the first pop
>> instruction after returning from the kernel.
>
> Yes, I understand at which insn exception happens.
>
> I meant that *the message* is not generated by Wine or kernel.
> grep for "Unhandled exception:" comes up empty
> on their source trees.
>
> After much grepping, I see that I'm wrong.
> It does come from wine:
>
> void info_win32_exception(void)
> {
> const EXCEPTION_RECORD* rec;
> ADDRESS64 addr;
> char hexbuf[MAX_OFFSET_TO_STR_LEN];
>
> if (!dbg_curr_thread->in_exception)
> {
> dbg_printf("Thread isn't in an exception\n");
> return;
> }
> rec = &dbg_curr_thread->excpt_record;
> memory_get_current_pc(&addr);
>
> /* print some infos */
> dbg_printf("%s: ",
> dbg_curr_thread->first_chance ? "First chance exception" : "Unhandled exception");
> switch (rec->ExceptionCode)
> {
> case EXCEPTION_BREAKPOINT:
> dbg_printf("breakpoint");
> break;
> case EXCEPTION_SINGLE_STEP:
> dbg_printf("single step");
> break;
> case EXCEPTION_INT_DIVIDE_BY_ZERO:
> dbg_printf("divide by zero");
> break;
> case EXCEPTION_INT_OVERFLOW:
> dbg_printf("overflow");
> break;
> case EXCEPTION_ARRAY_BOUNDS_EXCEEDED:
> dbg_printf("array bounds");
> break;
> case EXCEPTION_ILLEGAL_INSTRUCTION:
> dbg_printf("illegal instruction");
> break;
> case EXCEPTION_STACK_OVERFLOW:
> dbg_printf("stack overflow");
> break;
> ...
>
> I hoped we can easily make Wine show exception's error code.
> Not that easy :/
>

I added some debug messages to an unpatched kernel:
[ 382.639763] traps: wine[14281] trap stack segment ip:f7716c07
sp:fff9a024 error:0
[ 382.639778] traps: wine[14281] trap stack segment ip:f7716c07
sp:fff9a024 error:0

The patch does appear to fix the crash.

--
Brian Gerst

2015-04-23 12:35:38

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On 04/23/2015 02:01 PM, Brian Gerst wrote:
> The patch does appear to fix the crash.

Thanks for testing it.

I changed the patch a bit and sent it to Ingo.
Can you test that one and confirm that it also works?

2015-04-23 15:48:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On Thu, Apr 23, 2015 at 3:44 AM, Borislav Petkov <[email protected]> wrote:
> On Thu, Apr 23, 2015 at 12:26:43PM +0200, Denys Vlasenko wrote:
>> Yes. It loads *selector*. AMD docs say that selector is loaded as you say,
>> but *cached descriptor* of SS (which is a different entity) is not modified.
>>
>> If *cached descriptor* is invalid, in 32-bit mode stack ops
>> will fail. (In 64-bit mode, CPU doesn't do those checks).
>
> So how can that happen with wine? Something's changing the cached
> descriptor and only the write to %ss reloads it with the correct value?
>

I bet that the actual crashing task is a red herring. How about this theory:

Some *other* Wine program is running either in 16-bit mode or in
32-bit mode with the stack limit != 0xffffffff. We take an interrupt
and, as an optimization, the CPU doesn't reset the cached SS limit
and/or attributes (what would it reload them to, anyway?). Now we
context switch to a syscall issues by the dying task and do sysretl,
and we accidentally land back in userspace in segmented mode.

If this theory is right, it explains why we only see this on Wine. It
also means we may have an info leak on 64-bit syscalls, too (see other
thread).

A test for this would be to run syscalls in a loop in a native 32-bit
process while running dosbox or something like that.

--Andy


> --
> Regards/Gruss,
> Boris.
>
> ECO tip #101: Trim your mails when you reply.
> --



--
Andy Lutomirski
AMA Capital Management, LLC

2015-04-23 16:41:30

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

An alternative fix would be, if we decided to schedule
in an interrupt, check %ss for zero and reload it
with __KERNEL_DS before schedule.

2015-04-23 16:50:41

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On Thu, Apr 23, 2015 at 9:41 AM, Denys Vlasenko
<[email protected]> wrote:
> An alternative fix would be, if we decided to schedule
> in an interrupt, check %ss for zero and reload it
> with __KERNEL_DS before schedule.

For anyone who has the right hardware (not me!), a possible reproducer is here:

https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/

make && taskset -c 0 ./sysret_ss_attrs_32

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC

2015-04-23 17:14:54

by Borislav Petkov

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On Thu, Apr 23, 2015 at 09:50:17AM -0700, Andy Lutomirski wrote:
> On Thu, Apr 23, 2015 at 9:41 AM, Denys Vlasenko
> <[email protected]> wrote:
> > An alternative fix would be, if we decided to schedule
> > in an interrupt, check %ss for zero and reload it
> > with __KERNEL_DS before schedule.
>
> For anyone who has the right hardware (not me!), a possible reproducer is here:
>
> https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/
>
> make && taskset -c 0 ./sysret_ss_attrs_32

[ 195.438441] traps: sysret_ss_attrs[1745] trap stack segment ip:f77dab87 sp:ffdf0b70 error:0
[ 196.831952] traps: sysret_ss_attrs[1748] trap stack segment ip:f7786b87 sp:fffc0810 error:0

Ran it twice.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-04-23 18:24:40

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On Thu, Apr 23, 2015 at 10:14 AM, Borislav Petkov <[email protected]> wrote:
> On Thu, Apr 23, 2015 at 09:50:17AM -0700, Andy Lutomirski wrote:
>> On Thu, Apr 23, 2015 at 9:41 AM, Denys Vlasenko
>> <[email protected]> wrote:
>> > An alternative fix would be, if we decided to schedule
>> > in an interrupt, check %ss for zero and reload it
>> > with __KERNEL_DS before schedule.
>>
>> For anyone who has the right hardware (not me!), a possible reproducer is here:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/
>>
>> make && taskset -c 0 ./sysret_ss_attrs_32
>
> [ 195.438441] traps: sysret_ss_attrs[1745] trap stack segment ip:f77dab87 sp:ffdf0b70 error:0
> [ 196.831952] traps: sysret_ss_attrs[1748] trap stack segment ip:f7786b87 sp:fffc0810 error:0
>
> Ran it twice.

That nails it. We really do leak segment limits to other tasks on AMD
chips. I see at least two questions we should answer before fixing
this:

1. Do we consider this to be enough of a security issue that we want
to fix it for 64-bit userspace as well?

2. Do we fix it at sysret time (at the cost of an ss read even in the
best case on AMD chips) or at context switch time (with the risk of
more ss writes than necessary)?

I slightly favor fixing it at sysret time for both the 32-bit and
64-bit paths., but I'm not really convinced.

Regardless, I'd rather fix this in the kernel than the vdso. I see no
reason at all that we should ever return to 32-bit userspace with a
corrupt SS cached descriptor.

(OK, tiny lie. The vdso approach avoids a nop somewhere on Intel
CPUs. Big deal.)

--Andy

2015-04-23 18:36:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On Thu, Apr 23, 2015 at 11:24 AM, Andy Lutomirski <[email protected]> wrote:
>
> 1. Do we consider this to be enough of a security issue that we want
> to fix it for 64-bit userspace as well?
>
> 2. Do we fix it at sysret time (at the cost of an ss read even in the
> best case on AMD chips) or at context switch time (with the risk of
> more ss writes than necessary)?

So I'm ok with doing it in the sysret path, and just know that we
randomly have 0 or __KERNEL_DS in %ss while in kernel mode depending
on how we entered it.

But I'd like to make sure it is:

- documented somewhere (I feel that I understand the problem better
now, but three years from now I will have forgotten all the details
and be surprised about the 0/__KERNEL_DS issue all over again)

And by "documented somewhere" I very much mean _both_ the
0/__KERNEL_DS issue _and_ the odd 'AMD sysret buglet'

- I think we should do it for both 32-bit and 64-bit, or at the very
least add a comment to the 64-bit path about this

- I think the sysret code should also make it clear that this is a CPU buglet

That "sysret code should make it clear" coould possibly be just by
commenting on it (so that "document it clearly" could be the entire
solution), but possibly by even making the fixup be something that
uses the instruction rewriting logic so that it is both (a) very
explicit as a "these CPU's have a problem" and (b) cheaper on CPU's
that don't have the issue. I don't think anybody really cares about a
few cycles in the 32-bit compat path on a 64-bit kernel, but the pure
64-bit case we might care about an extra cycle or three.

Considering that it apparently _is_ a CPU buglet, I don't think we
want to touch the schedule path after all. The odd behavior wrt %ss is
kind of confusing, but if we document it and just clarify that all
returns to user space fix things up, I guess I don't care.

Hmm?

Linus

2015-04-23 18:52:17

by Borislav Petkov

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On Thu, Apr 23, 2015 at 11:24:14AM -0700, Andy Lutomirski wrote:
> That nails it. We really do leak segment limits to other tasks on AMD
> chips. I see at least two questions we should answer before fixing
> this:

Ok, WTF is going on?! Even this trivial test case causes a Bus Error:

---
static unsigned short GDT3(int idx)
{
return (idx << 3) | 3;
}

static void *threadproc(void *ctx)
{
printf("Hello world\n");
return NULL;
}

int main()
{
pthread_t thread;
if (pthread_create(&thread, 0, threadproc, 0) != 0)
err(1, "pthread_create");

while (1) {
usleep(1);
}

return 0;
}
---

$ make sysret_ss_attrs_32
gcc -m32 -o sysret_ss_attrs_32 -O2 -g -std=gnu99 -pthread -Wall sysret_ss_attrs.c -lrt -ldl
sysret_ss_attrs.c:23:23: warning: ‘GDT3’ defined but not used [-Wunused-function]
static unsigned short GDT3(int idx)
^
$ taskset -c 0 ./sysret_ss_attrs_32
Hello world
Bus error

in dmesg:

[ 583.389368] traps: sysret_ss_attrs[2135] trap stack segment ip:f7784b87 sp:ffb640c0 error:0

This is insane.

> 1. Do we consider this to be enough of a security issue that we want
> to fix it for 64-bit userspace as well?
>
> 2. Do we fix it at sysret time (at the cost of an ss read even in the
> best case on AMD chips) or at context switch time (with the risk of
> more ss writes than necessary)?
>
> I slightly favor fixing it at sysret time for both the 32-bit and
> 64-bit paths., but I'm not really convinced.

Yeah, a "call amd_fixup_ss" which gets NOPped out on Intel with
alternatives sounds nice and clean to me.

Pending we have an explanation WTH is going on...

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-04-23 19:21:18

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On Thu, Apr 23, 2015 at 11:52 AM, Borislav Petkov <[email protected]> wrote:
> On Thu, Apr 23, 2015 at 11:24:14AM -0700, Andy Lutomirski wrote:
>> That nails it. We really do leak segment limits to other tasks on AMD
>> chips. I see at least two questions we should answer before fixing
>> this:
>
> Ok, WTF is going on?! Even this trivial test case causes a Bus Error:
>
> ---
> static unsigned short GDT3(int idx)
> {
> return (idx << 3) | 3;
> }
>
> static void *threadproc(void *ctx)
> {
> printf("Hello world\n");
> return NULL;
> }
>
> int main()
> {
> pthread_t thread;
> if (pthread_create(&thread, 0, threadproc, 0) != 0)
> err(1, "pthread_create");
>
> while (1) {
> usleep(1);
> }
>
> return 0;
> }
> ---
>
> $ make sysret_ss_attrs_32
> gcc -m32 -o sysret_ss_attrs_32 -O2 -g -std=gnu99 -pthread -Wall sysret_ss_attrs.c -lrt -ldl
> sysret_ss_attrs.c:23:23: warning: ‘GDT3’ defined but not used [-Wunused-function]
> static unsigned short GDT3(int idx)
> ^
> $ taskset -c 0 ./sysret_ss_attrs_32
> Hello world
> Bus error
>
> in dmesg:
>
> [ 583.389368] traps: sysret_ss_attrs[2135] trap stack segment ip:f7784b87 sp:ffb640c0 error:0
>
> This is insane.
>
>> 1. Do we consider this to be enough of a security issue that we want
>> to fix it for 64-bit userspace as well?
>>
>> 2. Do we fix it at sysret time (at the cost of an ss read even in the
>> best case on AMD chips) or at context switch time (with the risk of
>> more ss writes than necessary)?
>>
>> I slightly favor fixing it at sysret time for both the 32-bit and
>> 64-bit paths., but I'm not really convinced.
>
> Yeah, a "call amd_fixup_ss" which gets NOPped out on Intel with
> alternatives sounds nice and clean to me.
>
> Pending we have an explanation WTH is going on...

Huh, what? Maybe interrupt delivery on AMD CPUs actually blows away
the descriptor cache completely.

On VMX, at least, it's possible to read the descriptor cache directly.
I wonder whether the KVM SVM code could be instrumented to do
something similar.

--Andy

>
> --
> Regards/Gruss,
> Boris.
>
> ECO tip #101: Trim your mails when you reply.
> --



--
Andy Lutomirski
AMA Capital Management, LLC

2015-04-23 19:50:49

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

On Thu, Apr 23, 2015 at 8:52 PM, Borislav Petkov <[email protected]> wrote:
> On Thu, Apr 23, 2015 at 11:24:14AM -0700, Andy Lutomirski wrote:
>> That nails it. We really do leak segment limits to other tasks on AMD
>> chips. I see at least two questions we should answer before fixing
>> this:
>
> Ok, WTF is going on?! Even this trivial test case causes a Bus Error:
>
> ---
> static unsigned short GDT3(int idx)
> {
> return (idx << 3) | 3;
> }
>
> static void *threadproc(void *ctx)
> {
> printf("Hello world\n");
> return NULL;
> }
>
> int main()
> {
> pthread_t thread;
> if (pthread_create(&thread, 0, threadproc, 0) != 0)
> err(1, "pthread_create");
>
> while (1) {
> usleep(1);
> }
>
> return 0;
> }
> ---
>
> $ make sysret_ss_attrs_32
> gcc -m32 -o sysret_ss_attrs_32 -O2 -g -std=gnu99 -pthread -Wall sysret_ss_attrs.c -lrt -ldl
> sysret_ss_attrs.c:23:23: warning: ‘GDT3’ defined but not used [-Wunused-function]
> static unsigned short GDT3(int idx)
> ^
> $ taskset -c 0 ./sysret_ss_attrs_32
> Hello world
> Bus error
>
> in dmesg:
>
> [ 583.389368] traps: sysret_ss_attrs[2135] trap stack segment ip:f7784b87 sp:ffb640c0 error:0

I reproduced it.

I also confirm that the patch fixes it.

In fact, the simplest reproducer is

int main()
{
while (1)
usleep(1);
return 0;
}

- no threads necessary. You only need to do a lot of sysret32's,
and eventually it happens. If you omit -m32, it doesn't happen.

--
vda