LinuxLists.cc - Re: 32bit binaries on x86_64/Xen segfaults in syscall-vdso

2009-09-03 20:51:34

Subject: Re: 32bit binaries on x86_64/Xen segfaults in syscall-vdso

On 08/30/09 11:16, Bastian Blank wrote:
> Hi folks
>
> I upgraded one of my 32bit chroots on a x86-64 machine runing under Xen
> lately. All binaries started to segfault. Some extensive checks later
> show the vdso as the culprit. Later I found <[email protected]>
> with the same problem. The full story can be found in the Debian bug
> 544145[1].
>
> It happens with Linux 2.6.30 and 2.6.31-rc8 on Xen 3.2 and 3.4.
>
> For the tests I set the vdso to compat mode to have it loaded on a fixed
> location.
>
> The following program is a minimal test case for the vdso in compat
> mode, it can be compiled against dietlibc to minimize other effects.
>

Is this an AMD machine? Does booting with vdso32=0 on the kernel
command line work around the problem?

J

2009-09-03 22:02:53

by Bastian Blank

[permalink] [raw]

Subject: Re: 32bit binaries on x86_64/Xen segfaults in syscall-vdso

On Thu, Sep 03, 2009 at 01:51:35PM -0700, Jeremy Fitzhardinge wrote:
> On 08/30/09 11:16, Bastian Blank wrote:
> > I upgraded one of my 32bit chroots on a x86-64 machine runing under Xen
> > lately. All binaries started to segfault. Some extensive checks later
> > show the vdso as the culprit. Later I found <[email protected]>
> > with the same problem. The full story can be found in the Debian bug
> > 544145[1].
> >
> > It happens with Linux 2.6.30 and 2.6.31-rc8 on Xen 3.2 and 3.4.
> >
> > For the tests I set the vdso to compat mode to have it loaded on a fixed
> > location.
> >
> > The following program is a minimal test case for the vdso in compat
> > mode, it can be compiled against dietlibc to minimize other effects.
>
> Is this an AMD machine? Does booting with vdso32=0 on the kernel
> command line work around the problem?

AFAIK only AMD support the syscall instruction, so yes it is an AMD
machine. And yes, disabling the only thing that make the glibc call this
instruction works around it.

Bastian

--
Conquest is easy. Control is not.
-- Kirk, "Mirror, Mirror", stardate unknown

2009-09-03 22:06:37

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: 32bit binaries on x86_64/Xen segfaults in syscall-vdso

On 09/03/09 15:02, Bastian Blank wrote:
> AFAIK only AMD support the syscall instruction, so yes it is an AMD
> machine. And yes, disabling the only thing that make the glibc call this
> instruction works around it.
>

The bug actually appears to be in xen_sysret32, ie the crash happens on
the way out of the kernel.

J

2009-09-03 22:36:00

by Bastian Blank

[permalink] [raw]

Subject: Re: 32bit binaries on x86_64/Xen segfaults in syscall-vdso

On Thu, Sep 03, 2009 at 03:06:32PM -0700, Jeremy Fitzhardinge wrote:
> On 09/03/09 15:02, Bastian Blank wrote:
> > AFAIK only AMD support the syscall instruction, so yes it is an AMD
> > machine. And yes, disabling the only thing that make the glibc call this
> > instruction works around it.
> The bug actually appears to be in xen_sysret32, ie the crash happens on
> the way out of the kernel.

This function looks weird. It tries to restores the user code segment.
But the documentation from AMD explicitely stat that the CS and SS are
restored from the STAR register.

Bastian

--
Killing is stupid; useless!
-- McCoy, "A Private Little War", stardate 4211.8

2009-09-04 16:07:39

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: 32bit binaries on x86_64/Xen segfaults in syscall-vdso

On 09/03/09 15:36, Bastian Blank wrote:
> This function looks weird. It tries to restores the user code segment.
> But the documentation from AMD explicitely stat that the CS and SS are
> restored from the STAR register.

And STAR is always set with:

wrmsrl(MSR_STAR, ((u64)__USER32_CS)<<48 | ((u64)__KERNEL_CS)<<32);

so when using sysret to return to 32-bit, it:

The CS selector value is set to MSR IA32_STAR[63:48]. The SS is set
to IA32_STAR[63:48] + 8.

so CS is __USER32_CS and SS is __USER32_DS.

The code for xen_sysret32 is:

ENTRY(xen_sysret32)
/*
* We're already on the usermode stack at this point, but
* still with the kernel gs, so we can easily switch back
*/
movq %rsp, PER_CPU_VAR(old_rsp)
movq PER_CPU_VAR(kernel_stack), %rsp

pushq $__USER32_DS
pushq PER_CPU_VAR(old_rsp)
pushq %r11
pushq $__USER32_CS
pushq %rcx

pushq $VGCF_in_syscall
1: jmp hypercall_iret

The iret frame is:

ss
rsp
rflags
cs
rip <-- rsp

so this constructs a frame of:

__USER32_DS
user_esp
user_eflags
__USER32_CS
user_eip <-- kernel rsp

and then it does the iret hypercall.

But for some reason that's triggering a failsafe callback, which invokes
a GP.

J

2009-09-04 16:20:01

by Bastian Blank

[permalink] [raw]

Subject: Re: 32bit binaries on x86_64/Xen segfaults in syscall-vdso

On Fri, Sep 04, 2009 at 09:07:39AM -0700, Jeremy Fitzhardinge wrote:
> But for some reason that's triggering a failsafe callback, which invokes
> a GP.

Hmm, not in my tests. It always returned to userspace correctly and died
some operations later, usually the "ret". This then produced either a
segfault (unreadable address), sigill (if it managed to reach the ELF
header of the ld.so) or a GPF.

Bastian

--
You! What PLANET is this!
-- McCoy, "The City on the Edge of Forever", stardate 3134.0

2009-09-04 16:56:57

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: 32bit binaries on x86_64/Xen segfaults in syscall-vdso

On 09/04/09 09:20, Bastian Blank wrote:
> On Fri, Sep 04, 2009 at 09:07:39AM -0700, Jeremy Fitzhardinge wrote:
>
>> But for some reason that's triggering a failsafe callback, which invokes
>> a GP.
>>
> Hmm, not in my tests. It always returned to userspace correctly and died
> some operations later, usually the "ret". This then produced either a
> segfault (unreadable address), sigill (if it managed to reach the ELF
> header of the ld.so) or a GPF.

Hm, I may have misdiagnosed it then. Your symptoms are odd; either its
landing back in userspace in the right place but then stumbles on for a
while before crashing (wrong processor mode?) or the eip is wrong and
its just landing in the wrong place and crashing immediately.

How non-deterministic is it? Does it differ every time, or from boot to
boot, build to build?

J

2009-09-04 17:46:07

by Bastian Blank

[permalink] [raw]

Subject: Re: 32bit binaries on x86_64/Xen segfaults in syscall-vdso

On Fri, Sep 04, 2009 at 09:07:39AM -0700, Jeremy Fitzhardinge wrote:
> On 09/03/09 15:36, Bastian Blank wrote:
> > This function looks weird. It tries to restores the user code segment.
> > But the documentation from AMD explicitely stat that the CS and SS are
> > restored from the STAR register.
>
> And STAR is always set with:
> wrmsrl(MSR_STAR, ((u64)__USER32_CS)<<48 | ((u64)__KERNEL_CS)<<32);

No. This is the normal kernel setup. But the Xen setup (the relevant
one) looks different:

| #define FLAT_RING3_CS32 0xe023
| wrmsr(MSR_STAR, 0, (FLAT_RING3_CS32<<16) | __HYPERVISOR_CS);

But this does not match my observation either.

And even the native Linux kernel uses "iret" to jump out of a compat
(32bit) syscall. No, I don't want to understand this, but it looks
highly broken.

Bastian

--
Captain's Log, star date 21:34.5...

2009-09-04 18:19:37

by Bastian Blank

[permalink] [raw]

Subject: Re: 32bit binaries on x86_64/Xen segfaults in syscall-vdso

On Fri, Sep 04, 2009 at 07:46:05PM +0200, Bastian Blank wrote:
> On Fri, Sep 04, 2009 at 09:07:39AM -0700, Jeremy Fitzhardinge wrote:
> > On 09/03/09 15:36, Bastian Blank wrote:
> > > This function looks weird. It tries to restores the user code segment.
> > > But the documentation from AMD explicitely stat that the CS and SS are
> > > restored from the STAR register.
> | #define FLAT_RING3_CS32 0xe023
> | wrmsr(MSR_STAR, 0, (FLAT_RING3_CS32<<16) | __HYPERVISOR_CS);
> But this does not match my observation either.

Well, it does. The values for a long-mode program within Xen:

| cs 0xe033 57395
| ss 0xe02b 57387

Values on the bare hardware:

| cs 0x33 51
| ss 0x2b 43

Values for a compatibility-mode program on the bare hardware:

| cs 0x23 35
| ss 0x2b 43

So Xen adds 0xe000 (no idea what that means), but the Linux kernel
expects the value without. Long mode is not affected.

Okay, I tried the test program again and yes, it jumps back into
long-mode. (See the 0x10 in the restored CS[1].)

| cs 0xe033 57395
| ss 0xe02b 57387

Bastian

[1]: http://amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf,
page 151
--
Where there's no emotion, there's no motive for violence.
-- Spock, "Dagger of the Mind", stardate 2715.1