2008-06-16 12:30:16

by Sean Young

[permalink] [raw]
Subject: Regression: boot failure on AMD Elan TS-5500

The symptons are either a crash or reboot on booting the kernel. No printk's
have occurred yet -- even with early printk on.

2.6.15 worked on this board however current does not. I've bisected it to:

commit a24e785111a32ccb7cebafd24b1b1cb474ea8e5d
Author: Rusty Russell <[email protected]>
Date: Sun Oct 21 16:41:35 2007 -0700

i386: paravirt boot sequence

This patch uses the updated boot protocol to do paravirtualized boot.
If the boot version is >= 2.07, then it will do two things:

1. Check the bootparams loadflags to see if we should reload the
segment registers and clear interrupts. This is appropriate
for normal native boot and some paravirtualized environments, but
inapproprate for others.

2. Check the hardware architecture, and dispatch to the appropriate
kernel entrypoint. If the bootloader doesn't set this, then we
simply do the normal boot sequence.

Unfortunately this patch can't be reverse-applied onto mainline, so I've
tried it on 2.6.24-rc1. 2.6.24-rc1 as-is fails, but with this patch
reverted it does work.

This is the board:

http://www.embeddedarm.com/products/board-detail.php?product=TS-5500

Regards
Sean


2008-06-16 13:28:34

by Rusty Russell

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

On Monday 16 June 2008 22:11:39 Sean Young wrote:
> The symptons are either a crash or reboot on booting the kernel. No
> printk's have occurred yet -- even with early printk on.
>
> 2.6.15 worked on this board however current does not. I've bisected it to:
>
> commit a24e785111a32ccb7cebafd24b1b1cb474ea8e5d
> Author: Rusty Russell <[email protected]>
> Date: Sun Oct 21 16:41:35 2007 -0700
>
> i386: paravirt boot sequence

Hi Sean,

Thanks for tracking this down. Can we try reverting this in pieces to see
exactly what the cause was?

1) Revert arch/x86/boot/compressed/head_32.S
2) If that doesn't fix it, Try removing the 8 lines which were added to
arch/x86/kernel/head_32.S

I assume this is CONFIG_PARAVIRT=n?

Thanks,
Rusty.

2008-06-16 13:38:07

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

Rusty Russell wrote:
> On Monday 16 June 2008 22:11:39 Sean Young wrote:
>
>> The symptons are either a crash or reboot on booting the kernel. No
>> printk's have occurred yet -- even with early printk on.
>>
>> 2.6.15 worked on this board however current does not. I've bisected it to:
>>
>> commit a24e785111a32ccb7cebafd24b1b1cb474ea8e5d
>> Author: Rusty Russell <[email protected]>
>> Date: Sun Oct 21 16:41:35 2007 -0700
>>
>> i386: paravirt boot sequence
>>
>
> Hi Sean,
>
> Thanks for tracking this down. Can we try reverting this in pieces to see
> exactly what the cause was?
>
> 1) Revert arch/x86/boot/compressed/head_32.S
> 2) If that doesn't fix it, Try removing the 8 lines which were added to
> arch/x86/kernel/head_32.S
>
And the arch/x86/boot/compressed/misc_32.c change too, just in case...

> I assume this is CONFIG_PARAVIRT=n?

What bootloader does this platform use? I wonder if it's setting things
up in an unexpected way?

J

2008-06-16 16:19:55

by Sean Young

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

On Mon, Jun 16, 2008 at 06:37:12AM -0700, Jeremy Fitzhardinge wrote:
> Rusty Russell wrote:
> >On Monday 16 June 2008 22:11:39 Sean Young wrote:
> >
> >>The symptons are either a crash or reboot on booting the kernel. No
> >>printk's have occurred yet -- even with early printk on.
> >>
> >>2.6.15 worked on this board however current does not. I've bisected it to:
> >>
> >> commit a24e785111a32ccb7cebafd24b1b1cb474ea8e5d
> >> Author: Rusty Russell <[email protected]>
> >> Date: Sun Oct 21 16:41:35 2007 -0700
> >>
> >> i386: paravirt boot sequence
> >>
> >
> >Hi Sean,
> >
> > Thanks for tracking this down. Can we try reverting this in pieces to
> > see exactly what the cause was?
> >
> >1) Revert arch/x86/boot/compressed/head_32.S
> >2) If that doesn't fix it, Try removing the 8 lines which were added to
> >arch/x86/kernel/head_32.S
> >
> And the arch/x86/boot/compressed/misc_32.c change too, just in case...

Reverting only head_32.S makes the problem go away. I didn't try any other
changes.

> >I assume this is CONFIG_PARAVIRT=n?

Indeed.

> What bootloader does this platform use? I wonder if it's setting things
> up in an unexpected way?

It's syslinux. That must be the culprit, considering the change to head_32.S.

I'm digging in syslinux right now.

Thanks
Sean

2008-06-16 17:08:36

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

Sean Young wrote:
> On Mon, Jun 16, 2008 at 06:37:12AM -0700, Jeremy Fitzhardinge wrote:
>
>> Rusty Russell wrote:
>>
>>> On Monday 16 June 2008 22:11:39 Sean Young wrote:
>>>
>>>
>>>> The symptons are either a crash or reboot on booting the kernel. No
>>>> printk's have occurred yet -- even with early printk on.
>>>>
>>>> 2.6.15 worked on this board however current does not. I've bisected it to:
>>>>
>>>> commit a24e785111a32ccb7cebafd24b1b1cb474ea8e5d
>>>> Author: Rusty Russell <[email protected]>
>>>> Date: Sun Oct 21 16:41:35 2007 -0700
>>>>
>>>> i386: paravirt boot sequence
>>>>
>>>>
>>> Hi Sean,
>>>
>>> Thanks for tracking this down. Can we try reverting this in pieces to
>>> see exactly what the cause was?
>>>
>>> 1) Revert arch/x86/boot/compressed/head_32.S
>>> 2) If that doesn't fix it, Try removing the 8 lines which were added to
>>> arch/x86/kernel/head_32.S
>>>
>>>
>> And the arch/x86/boot/compressed/misc_32.c change too, just in case...
>>
>
> Reverting only head_32.S makes the problem go away. I didn't try any other
> changes.
>
>
>>> I assume this is CONFIG_PARAVIRT=n?
>>>
>
> Indeed.
>
>
>> What bootloader does this platform use? I wonder if it's setting things
>> up in an unexpected way?
>>
>
> It's syslinux. That must be the culprit, considering the change to head_32.S.
>
> I'm digging in syslinux right now.

Weird. That change has been in there a fairly long time now, and lots of
people have successfully used syslinux to boot it in that time. Must be
some specific combination of factors on your platform.

I would guess the KEEP_SEGMENTS flag is being set for some reason, but
it's hard to imagine how, unless there's some memory corruption going on.

J

2008-06-16 17:10:18

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

Sean Young wrote:
>
>> What bootloader does this platform use? I wonder if it's setting things
>> up in an unexpected way?
>
> It's syslinux. That must be the culprit, considering the change to head_32.S.
>
> I'm digging in syslinux right now.
>

What makes you draw that conclusion?

-hpa

2008-06-16 17:10:42

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

Sean Young wrote:
>
>> What bootloader does this platform use? I wonder if it's setting things
>> up in an unexpected way?
>
> It's syslinux. That must be the culprit, considering the change to head_32.S.
>
> I'm digging in syslinux right now.
>

Also, what version of syslinux?

-hpa

2008-06-16 17:58:41

by Sean Young

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

On Mon, Jun 16, 2008 at 10:06:30AM -0700, H. Peter Anvin wrote:
> Sean Young wrote:
> >
> >>What bootloader does this platform use? I wonder if it's setting things
> >>up in an unexpected way?
> >
> >It's syslinux. That must be the culprit, considering the change to
> >head_32.S.
> >
> >I'm digging in syslinux right now.
> >
>
> What makes you draw that conclusion?

The thinking was that syslinux sets the boot parameters which are being read
in head_32.S. I've used syslinux 3.53 but have just moved to 3.63.

That was the theory at least. I'm not sure what's going on now. I'll spend
some more time on this.

Thanks
Sean

2008-06-16 18:45:31

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

Sean Young wrote:
>
> The thinking was that syslinux sets the boot parameters which are being read
> in head_32.S. I've used syslinux 3.53 but have just moved to 3.63.
>
> That was the theory at least. I'm not sure what's going on now. I'll spend
> some more time on this.
>

Most of the boot parameters actually come from the real-mode boot code
(arch/x86/boot). A few of them come from the bootloader.

With a version of Syslinux that recent there shouldn't be any issues,
especially since Syslinux is, for somewhat obvious reasons, my reference
bootloader. You may want to experiment with using the optional
linux.c32 module just to see if it changes the behaviour:

boot: linux.c32 kernel options...

Or, in the configuration file:

label blah
kernel linux.c32
append kernel options...

-hpa

2008-06-30 21:53:22

by Sean Young

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

On Mon, Jun 16, 2008 at 04:19:44PM +0000, Sean Young wrote:
> On Mon, Jun 16, 2008 at 06:37:12AM -0700, Jeremy Fitzhardinge wrote:
> > Rusty Russell wrote:
> > >On Monday 16 June 2008 22:11:39 Sean Young wrote:
> > >
> > >>The symptons are either a crash or reboot on booting the kernel. No
> > >>printk's have occurred yet -- even with early printk on.
> > >>
> > >>2.6.15 worked on this board however current does not. I've bisected it to:
> > >>
> > >> commit a24e785111a32ccb7cebafd24b1b1cb474ea8e5d
> > >> Author: Rusty Russell <[email protected]>
> > >> Date: Sun Oct 21 16:41:35 2007 -0700
> > >>
> > >> i386: paravirt boot sequence
> > >>
> > >
> > >Hi Sean,
> > >
> > > Thanks for tracking this down. Can we try reverting this in pieces to
> > > see exactly what the cause was?
> > >
> > >1) Revert arch/x86/boot/compressed/head_32.S
> > >2) If that doesn't fix it, Try removing the 8 lines which were added to
> > >arch/x86/kernel/head_32.S
> > >
> > And the arch/x86/boot/compressed/misc_32.c change too, just in case...
>
> Reverting only head_32.S makes the problem go away. I didn't try any other
> changes.

Looking at the beginning of startup_32, it seems ds is used before it is set:

startup_32:
cld
/* test KEEP_SEGMENTS flag to see if the bootloader is asking
* us to not reload segments */
testb $(1<<6), BP_loadflags(%esi)
jnz 1f

cli
movl $(__BOOT_DS),%eax
movl %eax,%ds
movl %eax,%es
movl %eax,%fs
movl %eax,%gs
movl %eax,%ss
1:

Since the testb instruction is a dereference, ds is implicitly used. If
I move the testb to after "movl %eax,%ds" it seems to work (not that it
would make any sense there, but just to prove the point).

1) Am I barking up the wrong tree?

2) If I'm right I have no idea what the correct solution is; it seems that
a chicken & egg issue is introduced.

Please advise. I am very new to all of this.


Sean

2008-06-30 21:55:36

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

Sean Young wrote:
> On Mon, Jun 16, 2008 at 04:19:44PM +0000, Sean Young wrote:
>
>> On Mon, Jun 16, 2008 at 06:37:12AM -0700, Jeremy Fitzhardinge wrote:
>>
>>> Rusty Russell wrote:
>>>
>>>> On Monday 16 June 2008 22:11:39 Sean Young wrote:
>>>>
>>>>
>>>>> The symptons are either a crash or reboot on booting the kernel. No
>>>>> printk's have occurred yet -- even with early printk on.
>>>>>
>>>>> 2.6.15 worked on this board however current does not. I've bisected it to:
>>>>>
>>>>> commit a24e785111a32ccb7cebafd24b1b1cb474ea8e5d
>>>>> Author: Rusty Russell <[email protected]>
>>>>> Date: Sun Oct 21 16:41:35 2007 -0700
>>>>>
>>>>> i386: paravirt boot sequence
>>>>>
>>>>>
>>>> Hi Sean,
>>>>
>>>> Thanks for tracking this down. Can we try reverting this in pieces to
>>>> see exactly what the cause was?
>>>>
>>>> 1) Revert arch/x86/boot/compressed/head_32.S
>>>> 2) If that doesn't fix it, Try removing the 8 lines which were added to
>>>> arch/x86/kernel/head_32.S
>>>>
>>>>
>>> And the arch/x86/boot/compressed/misc_32.c change too, just in case...
>>>
>> Reverting only head_32.S makes the problem go away. I didn't try any other
>> changes.
>>
>
> Looking at the beginning of startup_32, it seems ds is used before it is set:
>
> startup_32:
> cld
> /* test KEEP_SEGMENTS flag to see if the bootloader is asking
> * us to not reload segments */
> testb $(1<<6), BP_loadflags(%esi)
> jnz 1f
>
> cli
> movl $(__BOOT_DS),%eax
> movl %eax,%ds
> movl %eax,%es
> movl %eax,%fs
> movl %eax,%gs
> movl %eax,%ss
> 1:
>
> Since the testb instruction is a dereference, ds is implicitly used. If
> I move the testb to after "movl %eax,%ds" it seems to work (not that it
> would make any sense there, but just to prove the point).
>
> 1) Am I barking up the wrong tree?
>
> 2) If I'm right I have no idea what the correct solution is; it seems that
> a chicken & egg issue is introduced.
>
> Please advise. I am very new to all of this.

It's a bit odd that the boot loader neglected to set up ds properly, but
changing the testb line to

testb $(1<<6), %cs:BP_loadflags(%esi)

should work. (Or perhaps a %ss: override would be better?)

I'm assuming that the GDT setup isn't completely mad and that the
segments have the same base at least.

J

2008-06-30 22:08:41

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

Jeremy Fitzhardinge wrote:
>>
>> Looking at the beginning of startup_32, it seems ds is used before it
>> is set:
>>
>> startup_32:
>> cld
>> /* test KEEP_SEGMENTS flag to see if the bootloader is asking
>> * us to not reload segments */
>> testb $(1<<6), BP_loadflags(%esi)
>> jnz 1f
>>
>> cli
>> movl $(__BOOT_DS),%eax
>> movl %eax,%ds
>> movl %eax,%es
>> movl %eax,%fs
>> movl %eax,%gs
>> movl %eax,%ss
>> 1:
>>
>> Since the testb instruction is a dereference, ds is implicitly used. If
>> I move the testb to after "movl %eax,%ds" it seems to work (not that it
>> would make any sense there, but just to prove the point).
>>
>> 1) Am I barking up the wrong tree?
>>
>> 2) If I'm right I have no idea what the correct solution is; it seems
>> that
>> a chicken & egg issue is introduced.
>>
>> Please advise. I am very new to all of this.
>
> It's a bit odd that the boot loader neglected to set up ds properly, but
> changing the testb line to
>
> testb $(1<<6), %cs:BP_loadflags(%esi)
>
> should work. (Or perhaps a %ss: override would be better?)
>
> I'm assuming that the GDT setup isn't completely mad and that the
> segments have the same base at least.
>

This should have been set up by the *boot code* (specifically lines
57-61 of arch/x86/boot/pmjump.S) since he's using a conventional boot
loader (syslinux) so something is utterly fuggled up. Using %cs: here
should be safe, though (and *is* more conservative, after all, why
otherwise bother reloading these segments at all?), but it still
concerns me a great deal if this is broken in this way. It's definitely
better than %ss:.

In particular, I'm wondering if the Elan CPU has any strange ordering
requirements with regards to the protected mode transition that we're
not obeying.

It would be interesting to put in a heavyweight "brutally synchronizing"
instruction like WBINVD at various places in arch/x86/boot/pmjump.S and
see if it helps.

-hpa

2008-06-30 22:09:51

by Sean Young

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

On Mon, Jun 30, 2008 at 02:52:10PM -0700, Jeremy Fitzhardinge wrote:
> It's a bit odd that the boot loader neglected to set up ds properly, but
> changing the testb line to
>
> testb $(1<<6), %cs:BP_loadflags(%esi)
>
> should work. (Or perhaps a %ss: override would be better?)

Yup, it does (%cs: prefix).

> I'm assuming that the GDT setup isn't completely mad and that the
> segments have the same base at least.

Where do I go from here?

Thanks
Sean

2008-06-30 22:18:19

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

H. Peter Anvin wrote:
>>>
>>> startup_32:
>>> cld
>>> /* test KEEP_SEGMENTS flag to see if the bootloader is asking
>>> * us to not reload segments */
>>> testb $(1<<6), BP_loadflags(%esi)
>>> jnz 1f
>>>
>>> cli
>>> movl $(__BOOT_DS),%eax
>>> movl %eax,%ds
>>> movl %eax,%es
>>> movl %eax,%fs
>>> movl %eax,%gs
>>> movl %eax,%ss
>>> 1:

On this general subject... I keep thinking that it would be better to
have this as:

movl %cs, %eax
addl $8, %eax
movl %eax, %cs

... instead of a hard-coded constant. That actually removes all
hard-coded uses of BOOT_CS/BOOT_DS until we eventually load the kernel's
own boot GDT at head_32.S:94.

Does anyone see any problem with that? As far as I can tell, we're
requiring %cs == BOOT_CS for the current code anyway (unless
KEEP_SEGMENTS), but %ds == %cs + 8 seems like a more sensible requirement.

-hpa

2008-06-30 22:31:30

by Sean Young

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

On Mon, Jun 30, 2008 at 03:04:33PM -0700, H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
> >It's a bit odd that the boot loader neglected to set up ds properly, but
> >changing the testb line to
> >
> > testb $(1<<6), %cs:BP_loadflags(%esi)
> >
> >should work. (Or perhaps a %ss: override would be better?)
> >
> >I'm assuming that the GDT setup isn't completely mad and that the
> >segments have the same base at least.
> >
>
> This should have been set up by the *boot code* (specifically lines
> 57-61 of arch/x86/boot/pmjump.S) since he's using a conventional boot
> loader (syslinux) so something is utterly fuggled up. Using %cs: here
> should be safe, though (and *is* more conservative, after all, why
> otherwise bother reloading these segments at all?), but it still
> concerns me a great deal if this is broken in this way. It's definitely
> better than %ss:.
>
> In particular, I'm wondering if the Elan CPU has any strange ordering
> requirements with regards to the protected mode transition that we're
> not obeying.
>
> It would be interesting to put in a heavyweight "brutally synchronizing"
> instruction like WBINVD at various places in arch/x86/boot/pmjump.S and
> see if it helps.

You are right. With wbinvd before %ds is changed in that file the kernel
boots.


Sean

2008-06-30 22:35:03

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
>>>
>>> Looking at the beginning of startup_32, it seems ds is used before
>>> it is set:
>>>
>>> startup_32:
>>> cld
>>> /* test KEEP_SEGMENTS flag to see if the bootloader is asking
>>> * us to not reload segments */
>>> testb $(1<<6), BP_loadflags(%esi)
>>> jnz 1f
>>>
>>> cli
>>> movl $(__BOOT_DS),%eax
>>> movl %eax,%ds
>>> movl %eax,%es
>>> movl %eax,%fs
>>> movl %eax,%gs
>>> movl %eax,%ss
>>> 1:
>>>
>>> Since the testb instruction is a dereference, ds is implicitly used. If
>>> I move the testb to after "movl %eax,%ds" it seems to work (not that it
>>> would make any sense there, but just to prove the point).
>>>
>>> 1) Am I barking up the wrong tree?
>>>
>>> 2) If I'm right I have no idea what the correct solution is; it
>>> seems that
>>> a chicken & egg issue is introduced.
>>>
>>> Please advise. I am very new to all of this.
>>
>> It's a bit odd that the boot loader neglected to set up ds properly,
>> but changing the testb line to
>>
>> testb $(1<<6), %cs:BP_loadflags(%esi)
>>
>> should work. (Or perhaps a %ss: override would be better?)
>>
>> I'm assuming that the GDT setup isn't completely mad and that the
>> segments have the same base at least.
>>
>
> This should have been set up by the *boot code* (specifically lines
> 57-61 of arch/x86/boot/pmjump.S) since he's using a conventional boot
> loader (syslinux) so something is utterly fuggled up.

Hm, yeah.

> Using %cs: here should be safe, though (and *is* more conservative,
> after all, why otherwise bother reloading these segments at all?), but
> it still concerns me a great deal if this is broken in this way. It's
> definitely better than %ss:.
>
> In particular, I'm wondering if the Elan CPU has any strange ordering
> requirements with regards to the protected mode transition that we're
> not obeying.

Maybe it really does require the far jump immediately after setting PE
in cr0...

Hm, I don't remember this paragraph being in vol 3a, section 8.9.1
before. Is it a recent addition?

Random failures can occur if other instructions exist between steps
3 and 4 above. Failures will be readily seen in some situations,
such as when instructions that reference memory are inserted between
steps 3 and 4 while in system management mode.



J

2008-06-30 22:36:20

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

H. Peter Anvin wrote:
> H. Peter Anvin wrote:
>>>>
>>>> startup_32:
>>>> cld
>>>> /* test KEEP_SEGMENTS flag to see if the bootloader is asking
>>>> * us to not reload segments */
>>>> testb $(1<<6), BP_loadflags(%esi)
>>>> jnz 1f
>>>>
>>>> cli
>>>> movl $(__BOOT_DS),%eax
>>>> movl %eax,%ds
>>>> movl %eax,%es
>>>> movl %eax,%fs
>>>> movl %eax,%gs
>>>> movl %eax,%ss
>>>> 1:
>
> On this general subject... I keep thinking that it would be better to
> have this as:
>
> movl %cs, %eax
> addl $8, %eax
> movl %eax, %cs

What's your intent? You can't directly load %cs; do you mean the other
segment registers?

>
> ... instead of a hard-coded constant. That actually removes all
> hard-coded uses of BOOT_CS/BOOT_DS until we eventually load the
> kernel's own boot GDT at head_32.S:94.
>
> Does anyone see any problem with that? As far as I can tell, we're
> requiring %cs == BOOT_CS for the current code anyway (unless
> KEEP_SEGMENTS), but %ds == %cs + 8 seems like a more sensible
> requirement.
>

Oh, you mean

mov %cs, %eax
add $8, %eax
mov %eax, %ds (etc)
?

Seems reasonable.

J

2008-06-30 22:46:23

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

diff --git a/arch/x86/boot/pmjump.S b/arch/x86/boot/pmjump.S
index ab049d4..141b6e2 100644
--- a/arch/x86/boot/pmjump.S
+++ b/arch/x86/boot/pmjump.S
@@ -33,6 +33,8 @@ protected_mode_jump:
movw %cs, %bx
shll $4, %ebx
addl %ebx, 2f
+ jmp 1f # Short jump to serialize on 386/486
+1:

movw $__BOOT_DS, %cx
movw $__BOOT_TSS, %di
@@ -40,8 +42,6 @@ protected_mode_jump:
movl %cr0, %edx
orb $X86_CR0_PE, %dl # Protected mode
movl %edx, %cr0
- jmp 1f # Short jump to serialize on 386/486
-1:

# Transition to 32-bit mode
.byte 0x66, 0xea # ljmpl opcode


Attachments:
diff (571.00 B)

2008-06-30 22:49:07

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

Jeremy Fitzhardinge wrote:
>>
>> On this general subject... I keep thinking that it would be better to
>> have this as:
>>
>> movl %cs, %eax
>> addl $8, %eax
>> movl %eax, %cs
>
> What's your intent? You can't directly load %cs; do you mean the other
> segment registers?
>
>
> Oh, you mean
>
> mov %cs, %eax
> add $8, %eax
> mov %eax, %ds (etc)
> ?
>
> Seems reasonable.
>

Uhm, yeah ;)

-hpa

2008-07-01 19:59:53

by Sean Young

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

On Mon, Jun 30, 2008 at 03:42:47PM -0700, H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
> >
> >Maybe it really does require the far jump immediately after setting PE
> >in cr0...
> >
> >Hm, I don't remember this paragraph being in vol 3a, section 8.9.1
> >before. Is it a recent addition?
> >
> > Random failures can occur if other instructions exist between steps
> > 3 and 4 above. Failures will be readily seen in some situations,
> > such as when instructions that reference memory are inserted between
> > steps 3 and 4 while in system management mode.
> >
>
> I don't remember that, either.

Which document are we talking about?

> Sean: could you try the following patch?
>
> -hpa

> diff --git a/arch/x86/boot/pmjump.S b/arch/x86/boot/pmjump.S
> index ab049d4..141b6e2 100644
> --- a/arch/x86/boot/pmjump.S
> +++ b/arch/x86/boot/pmjump.S
> @@ -33,6 +33,8 @@ protected_mode_jump:
> movw %cs, %bx
> shll $4, %ebx
> addl %ebx, 2f
> + jmp 1f # Short jump to serialize on 386/486
> +1:
>
> movw $__BOOT_DS, %cx
> movw $__BOOT_TSS, %di
> @@ -40,8 +42,6 @@ protected_mode_jump:
> movl %cr0, %edx
> orb $X86_CR0_PE, %dl # Protected mode
> movl %edx, %cr0
> - jmp 1f # Short jump to serialize on 386/486
> -1:
>
> # Transition to 32-bit mode
> .byte 0x66, 0xea # ljmpl opcode

I'm afraid it doesn't work. Maybe I can find something in the AMD Elan
documentation. Would a fence make sense?

Thanks
Sean

2008-07-01 20:20:28

by Sean Young

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

On Tue, Jul 01, 2008 at 07:59:42PM +0000, Sean Young wrote:
> On Mon, Jun 30, 2008 at 03:42:47PM -0700, H. Peter Anvin wrote:
> > diff --git a/arch/x86/boot/pmjump.S b/arch/x86/boot/pmjump.S
> > index ab049d4..141b6e2 100644
> > --- a/arch/x86/boot/pmjump.S
> > +++ b/arch/x86/boot/pmjump.S
> > @@ -33,6 +33,8 @@ protected_mode_jump:
> > movw %cs, %bx
> > shll $4, %ebx
> > addl %ebx, 2f
> > + jmp 1f # Short jump to serialize on 386/486
> > +1:
> >
> > movw $__BOOT_DS, %cx
> > movw $__BOOT_TSS, %di
> > @@ -40,8 +42,6 @@ protected_mode_jump:
> > movl %cr0, %edx
> > orb $X86_CR0_PE, %dl # Protected mode
> > movl %edx, %cr0
> > - jmp 1f # Short jump to serialize on 386/486
> > -1:
> >
> > # Transition to 32-bit mode
> > .byte 0x66, 0xea # ljmpl opcode
>
> I'm afraid it doesn't work. Maybe I can find something in the AMD Elan
> documentation. Would a fence make sense?

I was just trying to confirm this with postive and negative tests which
is failing. I'm probably being an idiot, please let systematically test
this.

This does not seem like the heisenbug category, or is it?


Sean

2008-07-01 20:23:17

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

Sean Young wrote:
> On Mon, Jun 30, 2008 at 03:42:47PM -0700, H. Peter Anvin wrote:
>
>> Jeremy Fitzhardinge wrote:
>>
>>> Maybe it really does require the far jump immediately after setting PE
>>> in cr0...
>>>
>>> Hm, I don't remember this paragraph being in vol 3a, section 8.9.1
>>> before. Is it a recent addition?
>>>
>>> Random failures can occur if other instructions exist between steps
>>> 3 and 4 above. Failures will be readily seen in some situations,
>>> such as when instructions that reference memory are inserted between
>>> steps 3 and 4 while in system management mode.
>>>
>>>
>> I don't remember that, either.
>>
>
> Which document are we talking about?
>

*The* document ;)
http://www.intel.com/products/processor/manuals/ specifically, Volume
3a: System Programming Guide, Part 1.

Section 8.9.1 describes the steps needed to turn on protected mode
correctly. It says that you need to do a far jump or call immediately
after turning on protected mode. Linux has not done it immediately, and
there has been a school of thought that this advice is a workaround for
some obsolete CPU, and is not something we have to worry about now.

However, the paragraph I quoted was added since the previous release of
the manual, and so presumably documents a current concern.
Specifically, the mention of SMM is interesting, because I gather that
embedded-class processors like the Elan are very SMM-dependent.

>> Sean: could you try the following patch?
>>
>> -hpa
>>
>
>
>> diff --git a/arch/x86/boot/pmjump.S b/arch/x86/boot/pmjump.S
>> index ab049d4..141b6e2 100644
>> --- a/arch/x86/boot/pmjump.S
>> +++ b/arch/x86/boot/pmjump.S
>> @@ -33,6 +33,8 @@ protected_mode_jump:
>> movw %cs, %bx
>> shll $4, %ebx
>> addl %ebx, 2f
>> + jmp 1f # Short jump to serialize on 386/486
>> +1:
>>
>> movw $__BOOT_DS, %cx
>> movw $__BOOT_TSS, %di
>> @@ -40,8 +42,6 @@ protected_mode_jump:
>> movl %cr0, %edx
>> orb $X86_CR0_PE, %dl # Protected mode
>> movl %edx, %cr0
>> - jmp 1f # Short jump to serialize on 386/486
>> -1:
>>
>> # Transition to 32-bit mode
>> .byte 0x66, 0xea # ljmpl opcode
>>
>
> I'm afraid it doesn't work. Maybe I can find something in the AMD Elan
> documentation. Would a fence make sense?
>

Not really, but if it fixes the bug it won't hurt anyone else (unless
older processors treat it as an illegal instruction).

J

2008-07-01 20:25:00

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

Sean Young wrote:
> I was just trying to confirm this with postive and negative tests which
> is failing. I'm probably being an idiot, please let systematically test
> this.
>
> This does not seem like the heisenbug category, or is it?
>

Possibly. Does the platform use system-management mode? If it does, it
may depend on whether you get an SMM interrupt in a critical window.

J

2008-07-01 20:27:29

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

Sean Young wrote:
> On Mon, Jun 30, 2008 at 03:42:47PM -0700, H. Peter Anvin wrote:
>> Jeremy Fitzhardinge wrote:
>>> Maybe it really does require the far jump immediately after setting PE
>>> in cr0...
>>>
>>> Hm, I don't remember this paragraph being in vol 3a, section 8.9.1
>>> before. Is it a recent addition?
>>>
>>> Random failures can occur if other instructions exist between steps
>>> 3 and 4 above. Failures will be readily seen in some situations,
>>> such as when instructions that reference memory are inserted between
>>> steps 3 and 4 while in system management mode.
>>>
>> I don't remember that, either.
>
> Which document are we talking about?
>

Intel's Software Developer's Manual, volume 3a.

> I'm afraid it doesn't work. Maybe I can find something in the AMD Elan
> documentation. Would a fence make sense?

Not really. In particular, I really don't want to add something that is
even remotely likely to break on other platforms, and a fence
instruction is relatively recent.

-hpa

2008-07-01 20:29:00

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

Sean Young wrote:
>
> I was just trying to confirm this with postive and negative tests which
> is failing. I'm probably being an idiot, please let systematically test
> this.
>
> This does not seem like the heisenbug category, or is it?
>

It could easily be something timing-dependent, which might affect only X
boots out of Y. Please keep us posted on your testing progress.

It would also be very interesting if you put, say, a WBINVD at the point
of label 1:

-hpa

2008-07-01 20:29:23

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

H. Peter Anvin wrote:
> Not really. In particular, I really don't want to add something that
> is even remotely likely to break on other platforms, and a fence
> instruction is relatively recent.

I wonder if a cpuid would do the trick? Do we still support cpuid-less
processors?

J

2008-07-01 20:42:44

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> Not really. In particular, I really don't want to add something that
>> is even remotely likely to break on other platforms, and a fence
>> instruction is relatively recent.
>
> I wonder if a cpuid would do the trick? Do we still support cpuid-less
> processors?

Yes, we do. Worst case, we could bypass it, but I'd like to understand
what the real constraint is, here.

-hpa

2008-07-06 02:51:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

Sean Young <[email protected]> writes:

> On Mon, Jun 30, 2008 at 02:52:10PM -0700, Jeremy Fitzhardinge wrote:
>> It's a bit odd that the boot loader neglected to set up ds properly, but
>> changing the testb line to
>>
>> testb $(1<<6), %cs:BP_loadflags(%esi)
>>
>> should work. (Or perhaps a %ss: override would be better?)
>
> Yup, it does (%cs: prefix).

Guys I have a fuzzy memory of reads and definitely writes through %cs not working,
on some processor models. i386?

If that hadn't been the case I would have used that form more to remove the
dependencies before we reached lgdt in haed.S when I was in there years ago.

Eric

2008-07-06 03:14:55

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

Eric W. Biederman wrote:
> Sean Young <[email protected]> writes:
>
>> On Mon, Jun 30, 2008 at 02:52:10PM -0700, Jeremy Fitzhardinge wrote:
>>> It's a bit odd that the boot loader neglected to set up ds properly, but
>>> changing the testb line to
>>>
>>> testb $(1<<6), %cs:BP_loadflags(%esi)
>>>
>>> should work. (Or perhaps a %ss: override would be better?)
>> Yup, it does (%cs: prefix).
>
> Guys I have a fuzzy memory of reads and definitely writes through %cs not working,
> on some processor models. i386?
>

You can never write through %cs in protected mode.

You can read through %cs: in protected mode unless %cs contains an
execute-only descriptor.

-hpa

2008-07-06 13:21:13

by Pavel Machek

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

On Tue 2008-07-01 13:29:01, Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> Not really. In particular, I really don't want to add something that is
>> even remotely likely to break on other platforms, and a fence instruction
>> is relatively recent.
>
> I wonder if a cpuid would do the trick? Do we still support cpuid-less
> processors?

We still support 386~~, AFAICT.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-07-15 22:24:17

by Sean Young

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

On Tue, Jul 01, 2008 at 08:20:14PM +0000, Sean Young wrote:
> On Tue, Jul 01, 2008 at 07:59:42PM +0000, Sean Young wrote:
> > On Mon, Jun 30, 2008 at 03:42:47PM -0700, H. Peter Anvin wrote:
> > > diff --git a/arch/x86/boot/pmjump.S b/arch/x86/boot/pmjump.S
> > > index ab049d4..141b6e2 100644
> > > --- a/arch/x86/boot/pmjump.S
> > > +++ b/arch/x86/boot/pmjump.S
> > > @@ -33,6 +33,8 @@ protected_mode_jump:
> > > movw %cs, %bx
> > > shll $4, %ebx
> > > addl %ebx, 2f
> > > + jmp 1f # Short jump to serialize on 386/486
> > > +1:
> > >
> > > movw $__BOOT_DS, %cx
> > > movw $__BOOT_TSS, %di
> > > @@ -40,8 +42,6 @@ protected_mode_jump:
> > > movl %cr0, %edx
> > > orb $X86_CR0_PE, %dl # Protected mode
> > > movl %edx, %cr0
> > > - jmp 1f # Short jump to serialize on 386/486
> > > -1:
> > >
> > > # Transition to 32-bit mode
> > > .byte 0x66, 0xea # ljmpl opcode
> >
> > I'm afraid it doesn't work. Maybe I can find something in the AMD Elan
> > documentation. Would a fence make sense?
>
> I was just trying to confirm this with postive and negative tests which
> is failing. I'm probably being an idiot, please let systematically test
> this.

When I bisected this issue between latest and 2.6.23 the issue seemed to be
here (round about 2.6.24-rc1). Trying to repeat the same things on latest
just led to confusion.

So I've started from scratch with contemporary trees and found something
very different.

2.6.25: works
2.6.25.5: works
2.6.25.6: broken
2.6.25.10: broken
2.6.26: works

Now the issue which was introduced in 2.6.25.6 is that a rdtsc will be done
even though this CPU does not support it (before console_init(), leaving
me with only one led for output!). This issue has been fixed in 2.6.26 but
not in 2.6.25-stable.

The protected mode jump was just a red herring (or at least whatever was
broken back then is fixed now); 2.6.26 is working fine and that's good
enough for me.

Thanks for all the help,

Sean

2008-07-15 22:36:34

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Regression: boot failure on AMD Elan TS-5500

Sean Young wrote:
>
> When I bisected this issue between latest and 2.6.23 the issue seemed to be
> here (round about 2.6.24-rc1). Trying to repeat the same things on latest
> just led to confusion.
>
> So I've started from scratch with contemporary trees and found something
> very different.
>
> 2.6.25: works
> 2.6.25.5: works
> 2.6.25.6: broken
> 2.6.25.10: broken
> 2.6.26: works
>
> Now the issue which was introduced in 2.6.25.6 is that a rdtsc will be done
> even though this CPU does not support it (before console_init(), leaving
> me with only one led for output!). This issue has been fixed in 2.6.26 but
> not in 2.6.25-stable.
>
> The protected mode jump was just a red herring (or at least whatever was
> broken back then is fixed now); 2.6.26 is working fine and that's good
> enough for me.
>

Okay, that is good to know. We did move the jmp in pmjump.S for 2.6.26;
even if it ended up being redundant I have to say I'm happier following
the documented script, especially given the apparent recent
documentation change on Intel's part about this.

-hpa