2004-09-16 17:54:18

by Stas Sergeev

[permalink] [raw]
Subject: ESP corruption bug - what CPUs are affected?

Hello.

My question is not directly linux-related,
but I think people here can give me some clue
on an issue.

There is a "semi-official" bug in Intel CPUs,
which is described here:
http://www.intel.com/design/intarch/specupdt/27287402.PDF
chapter "Specification Clarifications"
section 4: "Use Of ESP In 16-Bit Code With 32-Bit Interrupt Handlers".

A short quote:
"ISSUE: When a 32-bit IRET is used to return to another privilege level,
and the old level uses a 4G stack (D/B bit in the segment register = 1),
while the new level uses a 64k stack (D/B bit = 0), then only the
lower word of ESP is updated."

Which means that the higher word of ESP gets
trashed. This bug beats dosemu rather badly,
but there seem to be not much info about that
bug available on the net.
What I want to find out, is what CPUs are
affected. I wrote a small test-case. It is
attached. I tried it on Intel Pentium and
on AMD Athlon - bug is present on both. But
I'd like to know if it is also present on a
Transmeta Crusoe, Cyrix and other CPUs.
Can someone please run the test-case on some
of that CPUs and see how it goes?
Note: specifying "32" as an arg to the
test-case, will make it to use the 32bit
segment for the stack, and it will not detect
the bug. It is there to prove that it detects
the Right Thing. Run it without that argument.

I'd also like to know any thoughts on whether
it is possible to work around the bug, probably
in a kernel? Well, I am not hoping on such a
possibility, but who knows...
Anyway, I'd be glad to get any info on that bug.
Why it was not fixed for so many years, looks
also like an interesting question, as for me.


Attachments:
stk.c (2.57 kB)

2004-09-16 18:41:39

by Petr Vandrovec

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

On 16 Sep 04 at 21:49, Stas Sergeev wrote:
>
> There is a "semi-official" bug in Intel CPUs,
> which is described here:
> http://www.intel.com/design/intarch/specupdt/27287402.PDF
> chapter "Specification Clarifications"
> section 4: "Use Of ESP In 16-Bit Code With 32-Bit Interrupt Handlers".

Not a bug, but a feature.

> What I want to find out, is what CPUs are
> affected. I wrote a small test-case. It is
> attached. I tried it on Intel Pentium and
> on AMD Athlon - bug is present on both. But
> I'd like to know if it is also present on a
> Transmeta Crusoe, Cyrix and other CPUs.

AFAIK all. There are products which depend on this behavior.

> I'd also like to know any thoughts on whether
> it is possible to work around the bug, probably
> in a kernel? Well, I am not hoping on such a
> possibility, but who knows...

IMHO you have to switch to 16bit stack, load upper bits of ESP
with target value, and then execute IRET, while 16bit SS:SP points
to same place where flat ESP pointed.

This way IRET, as stack is 16bit, uses only SP instead of ESP,
and so you can preload upper bits of ESP before executing IRET.

And I do not think that linux NMI handler survives 16bit stack, so
natural solution seems to be to create complete 16bit CPL1
environment, return to it, load ESP as you want, and then do IRET
to return to CPL2 or CPL3. Fortunately V8086 mode is not affected,
so there should be no problem with using CPL1 for this middle step.
But of course it is not something you want to do on each return
from interrupt handler... Well, or maybe you want...

> Anyway, I'd be glad to get any info on that bug.
> Why it was not fixed for so many years, looks
> also like an interesting question, as for me.

It is part of architecture now...
Petr Vandrovec


2004-09-16 19:03:32

by Denis Vlasenko

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

On Thursday 16 September 2004 20:49, Stas Sergeev wrote:
> Hello.
>
> My question is not directly linux-related,
> but I think people here can give me some clue
> on an issue.
>
> There is a "semi-official" bug in Intel CPUs,
> which is described here:
> http://www.intel.com/design/intarch/specupdt/27287402.PDF
> chapter "Specification Clarifications"
> section 4: "Use Of ESP In 16-Bit Code With 32-Bit Interrupt Handlers".
>
> A short quote:
> "ISSUE: When a 32-bit IRET is used to return to another privilege level,
> and the old level uses a 4G stack (D/B bit in the segment register = 1),
> while the new level uses a 64k stack (D/B bit = 0), then only the
> lower word of ESP is updated."
>
> Which means that the higher word of ESP gets
> trashed. This bug beats dosemu rather badly,
> but there seem to be not much info about that
> bug available on the net.

Well. Strictly speaking it's not a bug.
Code executes as it should. IRET returns to the correct
location. Upper 16 bits of ESP do not affect execution
in this case.

You want CPU to preserve upper bits,
but this will waste a handful of transistors
practically for no gain.

I think dosemu should just work around this.
Is there some subtle problem with that in dosemu?

> What I want to find out, is what CPUs are
> affected. I wrote a small test-case. It is
> attached. I tried it on Intel Pentium and
> on AMD Athlon - bug is present on both. But
> I'd like to know if it is also present on a
> Transmeta Crusoe, Cyrix and other CPUs.
> Can someone please run the test-case on some
> of that CPUs and see how it goes?
> Note: specifying "32" as an arg to the
> test-case, will make it to use the 32bit
> segment for the stack, and it will not detect
> the bug. It is there to prove that it detects
> the Right Thing. Run it without that argument.
>
> I'd also like to know any thoughts on whether
> it is possible to work around the bug, probably
> in a kernel? Well, I am not hoping on such a
> possibility, but who knows...
> Anyway, I'd be glad to get any info on that bug.
> Why it was not fixed for so many years, looks
> also like an interesting question, as for me.

Because it does not matter for 99.9999% of users...

# gcc -O2 stk.c
# ./a.out
sp=0xbffffb30, ss=0x7b
In sighandler: esp=bffffb10
Now sp=0xccf1fb10, ss=0x7f
Esp changed, strange...
# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 8
model name : Unknown CPU Typ
stepping : 1
cpu MHz : 1743.632
cache size : 64 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
--
vda

2004-09-17 18:11:37

by Stas Sergeev

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

Hello.

Petr Vandrovec wrote:
>> There is a "semi-official" bug in Intel CPUs,
>> which is described here:
>> http://www.intel.com/design/intarch/specupdt/27287402.PDF
>> chapter "Specification Clarifications"
>> section 4: "Use Of ESP In 16-Bit Code With 32-Bit Interrupt Handlers".
> Not a bug, but a feature.
Yep, one of those features, that you can
neither disable, nor (speaking about dosemu)
work around.

>> What I want to find out, is what CPUs are
>> affected. I wrote a small test-case. It is
>> attached. I tried it on Intel Pentium and
>> on AMD Athlon - bug is present on both. But
>> I'd like to know if it is also present on a
>> Transmeta Crusoe, Cyrix and other CPUs.
> AFAIK all.
I expected that. My last hopes were on Cyrix, which,
at least as I've heard, doesn't follow the Intel's
specs very strictly. So I thought maybe they have
not duplicated that "feature".

> There are products which depend on this behavior.
Out of curiosity, what are those products? I
think it is pretty much brain-damaged to depend
on a ESP corruption.

>> I'd also like to know any thoughts on whether
>> it is possible to work around the bug, probably
>> in a kernel?
> IMHO you have to switch to 16bit stack, load upper bits of ESP
> with target value, and then execute IRET, while 16bit SS:SP points
> to same place where flat ESP pointed.
Hmm, that sounds feasible.

> And I do not think that linux NMI handler survives 16bit stack, so
> natural solution seems to be to create complete 16bit CPL1
> environment, return to it, load ESP as you want, and then do IRET
> to return to CPL2 or CPL3. Fortunately V8086 mode is not affected,
> so there should be no problem with using CPL1 for this middle step.
> But of course it is not something you want to do on each return
> from interrupt handler... Well, or maybe you want...
Umm, no. But well, the spec claims that this
happens only with iret. It doesn't say that it
is a general problem with switching between a
different protection rings. And it seems, for
example, that WinNT/XP do not have that problem.
I.e. the DOS progs that get a Stack Fault under
dosemu because ESP points to the kernel space, can
actually work under WinXP. Maybe they are using
some other technique to return to Ring3, the one
that doesn't trigger the bug at all?

>> Anyway, I'd be glad to get any info on that bug.
>> Why it was not fixed for so many years, looks
>> also like an interesting question, as for me.
> It is part of architecture now...
How could that happen? Was it so difficult to just
fix? Oh well, perhaps it was...

2004-09-17 18:13:40

by Stas Sergeev

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

Hello.

Denis Vlasenko wrote:
>> Which means that the higher word of ESP gets
>> trashed. This bug beats dosemu rather badly,
>> but there seem to be not much info about that
>> bug available on the net.
> Well. Strictly speaking it's not a bug.
> Code executes as it should. IRET returns to the correct
> location. Upper 16 bits of ESP do not affect execution
> in this case.
Unfortunately this is not true. Mainly because
the code is still 32bit, so when is operates with
the stack pointer directly, it gets ESP, not SP.
Here are a few examples from the real DOS progs:
---
frstor [esp] <--crash
---
push ebp
mov ebp,esp
mov edx,[ebp+0x8] <--crash
---
and there are much more. "mov ebp,esp" is just a
standard thing, and then the ebp is being used to
access the locals and function arguments. Why do
they use the 16bit stack in that case, is another
question. Unfortunately some of them do.

> You want CPU to preserve upper bits,
> but this will waste a handful of transistors
> practically for no gain.
I don't think so. After all, if only the B bit is
set, this is being done. So I guess all the transistors
are in place (well, of course this is just a wild
guess).

> I think dosemu should just work around this.
I see no way to work around this in user-space.
Esp. since dosemu doesn't control the execution
completely, it acts as a monitor, only getting a
control when exceptions occur. For dosemu code
there are no problems, but the DOS program that
is running, crashes.

> Is there some subtle problem with that in dosemu?
Nothing subtle I think. The problem is real and
quite hurts. Fortunately it happened that using
the dos32a extender instead of dos4gw "fixes" it
for the dos4gw-based apps. But users wants
everything out of the box, and never read docs or
even the error messages in a log, so this doesn't
really help:) And there are some not dos4gw-based
progs that are affected either.

>> Why it was not fixed for so many years, looks
>> also like an interesting question, as for me.
> Because it does not matter for 99.9999% of users...
Aparently you are underestimating the amount of
the dosemu users out there:) AFAIK it is also a
problem for Win95 DOS box sometimes.

> # gcc -O2 stk.c
> # ./a.out
> sp=0xbffffb30, ss=0x7b
> In sighandler: esp=bffffb10
> Now sp=0xccf1fb10, ss=0x7f
> Esp changed, strange...
That's strange indeed! Apparently your ESP value,
0xccf1fb10, is greater than 0xc0000000, in which
case my program should just write "BUG!", but for
you it doesn't. Haven't you altered the code
somehow? Why I compare it with 0xc0000000, is
because ESP has the higher word of a kernel's stack
address, so it should be above the TASK_SIZE when
corrupted.

2004-09-17 22:10:11

by Denis Vlasenko

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

On Friday 17 September 2004 21:13, Stas Sergeev wrote:
> Hello.
>
> Denis Vlasenko wrote:
> >> Which means that the higher word of ESP gets
> >> trashed. This bug beats dosemu rather badly,
> >> but there seem to be not much info about that
> >> bug available on the net.
> >
> > Well. Strictly speaking it's not a bug.
> > Code executes as it should. IRET returns to the correct
> > location. Upper 16 bits of ESP do not affect execution
> > in this case.
>
> Unfortunately this is not true. Mainly because
> the code is still 32bit, so when is operates with
> the stack pointer directly, it gets ESP, not SP.
> Here are a few examples from the real DOS progs:
> ---
> frstor [esp] <--crash
> ---
> push ebp
> mov ebp,esp
> mov edx,[ebp+0x8] <--crash
> ---

Hmmm. Let me dust off my i386 knwoledge.

This is what happens: CPU is doing IRET in 32bit mode.
Stack layout:

+20 (empty) <--- SS:ESP
+16 old_CS
+12 old_EIP
+8 old_EFLAGS
+4 old_SS
+0 old_ESP

If old_SS references 'small' data segment (16-bit one),
processor does not restore ESP from old_ESP.
It restores lower 16 bits only. Upper 16 bits are filled
with garbage (or just not modified at all?).
This works fine because processor uses SP, not ESP,
for subsequent push/pop/call/ret operations.
But if code uses full ESP, thinking that upper 16 bits are zero,
it will crash badly. Correct me if I'm wrong.

> and there are much more. "mov ebp,esp" is just a
> standard thing, and then the ebp is being used to
> access the locals and function arguments. Why do
> they use the 16bit stack in that case, is another
> question. Unfortunately some of them do.

Hmm. I think you need to *emulate* this IRET.
Is this IRET happens in kernel or in dosemu code?

> > # gcc -O2 stk.c
> > # ./a.out
> > sp=0xbffffb30, ss=0x7b
> > In sighandler: esp=bffffb10
> > Now sp=0xccf1fb10, ss=0x7f
> > Esp changed, strange...
>
> That's strange indeed! Apparently your ESP value,
> 0xccf1fb10, is greater than 0xc0000000, in which
> case my program should just write "BUG!", but for
> you it doesn't. Haven't you altered the code
> somehow?

No, I didn't alter anything. Strange indeed.

> Why I compare it with 0xc0000000, is
> because ESP has the higher word of a kernel's stack
> address, so it should be above the TASK_SIZE when
> corrupted.

Even more strange! Now I did modify the code (attached),
and this is what I've got:

# gcc -o stk2 -O2 stk2.c
# ./stk2
sp=0xbffffb30, ss=0x7b
In sighandler: esp=bffffb10
Now sp=cd79fb10 <=========
Now sp=bffffb50 <========= !?
Now sp=bffffb50, ss=007f
Now sp=bffffb50
Now sp=bffffb50
Now sp=bffffb50
Esp changed, strange...

I am thoroughly confused now. Two back-to-back
'printf("Now sp=%08lx\n", new_esp);'
gave different result. How this can happen??

Corresponding gcc -S code. new_esp is at -180(%ebp):

#APP
movw -182(%ebp), %ss
hlt
movw -170(%ebp), %ss
movl %esp, -180(%ebp)
movl -176(%ebp), %esp

#NO_APP
addl $40, %esp
pushl -180(%ebp)
pushl $.LC4
call printf
popl %eax
popl %edx
pushl -180(%ebp)
pushl $.LC4
call printf
addl $12, %esp
movzwl -182(%ebp), %eax
pushl %eax
pushl -180(%ebp)
pushl $.LC5
call printf
popl %ecx
popl %ebx
pushl -180(%ebp)
pushl $.LC4
call printf
popl %eax
popl %edx

--
vda


Attachments:
(No filename) (3.38 kB)
stk2.c (2.83 kB)
Download all attachments

2004-09-18 10:58:04

by Stas Sergeev

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

Hello.

Denis Vlasenko wrote:
>> This is what happens: CPU is doing IRET in 32bit mode.
>> Stack layout:
> +20 (empty) <--- SS:ESP
> +16 old_CS
> +12 old_EIP
> +8 old_EFLAGS
> +4 old_SS
> +0 old_ESP
JFYI, looks like you swapped CS<-->EIP and SS<-->ESP
on your stack frame layout.

> If old_SS references 'small' data segment (16-bit one),
> processor does not restore ESP from old_ESP.
> It restores lower 16 bits only. Upper 16 bits are filled
> with garbage (or just not modified at all?).
Not modified at all, yes. That's why it is always
greater than TASK_SIZE (0xc0000000) - it is still
in a kernel space.

> This works fine because processor uses SP, not ESP,
> for subsequent push/pop/call/ret operations.
> But if code uses full ESP, thinking that upper 16 bits are zero,
> it will crash badly. Correct me if I'm wrong.
That's correct. But you should note that the
program not just "thinks" that the upper 16 bits
are zero. It writes zero there itself, and a few
instructions later - oops, it is no longer zero...
My test-case does "hlt" to force the context switch,
but in fact it is not necessary. I tried an
empty loop instead of HLT. If the loop is long
enough and some IRQ being handled by the kernel
in a mean time, the ESP is corrupted. So the
user-space never knows when exactly it gets corrupted.
And, as I mentioned already, since the code is 32bit,
it just does the things like "mov ebp,esp" and
uses ebp to access the function params and locals,
which crashes, or uses ESP directly to access
something, etc.

> Hmm. I think you need to *emulate* this IRET.
> Is this IRET happens in kernel or in dosemu code?
The problem happens only when iret is used to
switch to another priv level. So it is the kernel's
iret of course, something to the effect of entry.S:128
I think. The control is transferred from kernel code
to the DOS code directly. dosemu code is completely
bypassed. dosemu have no chances to intercept that.
The DOS program is running on its own (as long as
it doesn't trigger an exception, or SIGALRM comes),
and the CPU is corrupting its ESP in *any* place,
since it happens when an external IRQ arrives.

>> That's strange indeed! Apparently your ESP value,
>> 0xccf1fb10, is greater than 0xc0000000, in which
>> case my program should just write "BUG!", but for
>> you it doesn't. Haven't you altered the code
>> somehow?
> No, I didn't alter anything. Strange indeed.
So that proves that writing the bug-free code, esp.
when the asm is involved, is not always as easy as
I assume...

> I am thoroughly confused now. Two back-to-back
> 'printf("Now sp=%08lx\n", new_esp);'
> gave different result. How this can happen??
My code did the assumption that the ESP should
not change within the execution of main(). This
is true only if you don't enable the optimization
(as I did), or use -fno-defer-pop option of gcc.
Fixed code is attached (just for the overall
completeness:) Should survive the optimization now.


Attachments:
stk1.c (2.58 kB)

2004-09-18 13:09:39

by Denis Vlasenko

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

On Saturday 18 September 2004 13:58, Stas Sergeev wrote:
> Hello.
>
> Denis Vlasenko wrote:
> >> This is what happens: CPU is doing IRET in 32bit mode.
> >> Stack layout:
> >
> > +20 (empty) <--- SS:ESP
> > +16 old_CS
> > +12 old_EIP
> > +8 old_EFLAGS
> > +4 old_SS
> > +0 old_ESP
>
> JFYI, looks like you swapped CS<-->EIP and SS<-->ESP
> on your stack frame layout.

Hm. More than that. My figure is totally incorrect.
Corrected one:

+16 old_SS
+12 old_ESP
+8 old_EFLAGS
+4 old_CS
+0 old_EIP <--- SS:ESP

Addresses grow from the bottom in this fig.
Note that offsets are always stored in lower address
relative to where selectors are stored.

> > If old_SS references 'small' data segment (16-bit one),
> > processor does not restore ESP from old_ESP.
> > It restores lower 16 bits only. Upper 16 bits are filled
> > with garbage (or just not modified at all?).
>
> Not modified at all, yes. That's why it is always
> greater than TASK_SIZE (0xc0000000) - it is still
> in a kernel space.
>
> > This works fine because processor uses SP, not ESP,
> > for subsequent push/pop/call/ret operations.
> > But if code uses full ESP, thinking that upper 16 bits are zero,
> > it will crash badly. Correct me if I'm wrong.
>
> That's correct. But you should note that the
> program not just "thinks" that the upper 16 bits
> are zero. It writes zero there itself, and a few
> instructions later - oops, it is no longer zero...
> My test-case does "hlt" to force the context switch,
> but in fact it is not necessary. I tried an
> empty loop instead of HLT. If the loop is long
> enough and some IRQ being handled by the kernel
> in a mean time, the ESP is corrupted. So the
> user-space never knows when exactly it gets corrupted.
> And, as I mentioned already, since the code is 32bit,
> it just does the things like "mov ebp,esp" and
> uses ebp to access the function params and locals,
> which crashes, or uses ESP directly to access
> something, etc.
>
> > Hmm. I think you need to *emulate* this IRET.
> > Is this IRET happens in kernel or in dosemu code?
>
> The problem happens only when iret is used to
> switch to another priv level. So it is the kernel's
> iret of course, something to the effect of entry.S:128
> I think. The control is transferred from kernel code
> to the DOS code directly. dosemu code is completely
> bypassed. dosemu have no chances to intercept that.
> The DOS program is running on its own (as long as
> it doesn't trigger an exception, or SIGALRM comes),
> and the CPU is corrupting its ESP in *any* place,
> since it happens when an external IRQ arrives.

Aha. The only way to sanely handle it is to
hack on entry.S I'm afraid. Something like rewriting
CS:EIP so that it returns to a small ring-3 trampoline
which clears upper 16 bits of ESP and jumps to original CS:EIP.

You may use ring-3 user stack as a place to save original CS:EIP,
and then just do RETF. This way, trampoline boils down to

movzxl %sp,%esp
retf

which does not touch EFLAGS and you don't need to bother
with thinking about preserving it. IRET will restore it,
and trampoline wouldn't touch it anymore.

Now, how to detect when to use this? Hmm.... the simplest thing
is to check that

(old_ESP <= 0xffff) && !(old_EFLAGS & VM_MASK) && (descr_old_SS is 16bit one)

This will cost us only one comparison in the normal
path, because typically ESP of Linus executables
is greater than 0xffff.

P.S.

# gcc -o stk1 -O2 stk1.c
# ./stk1
old_ss=0x7b new_ss=0x7f
In sighandler: esp=bffffb30
old_esp=0xbffffb30 new_esp=0xc618fb30
BUG!

--
vda

2004-09-18 16:44:09

by Stas Sergeev

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

Hi Petr.

Petr Vandrovec wrote:
> natural solution seems to be to create complete 16bit CPL1
> environment, return to it, load ESP as you want, and then do IRET
> to return to CPL2 or CPL3. Fortunately V8086 mode is not affected,
> so there should be no problem with using CPL1 for this middle step.
> But of course it is not something you want to do on each return
> from interrupt handler... Well, or maybe you want...
Actually, this may indeed be what I want!
I think this can be implemented with the checks
that Denis Vlasenko suggests. Something like this
can be added to entry.S, right before the "iret":
---
if (!(old_EFLAGS & VM_MASK) && (descr_old_SS is 16bit one))
push_a_stack_frame_to_return_to_the_ring1_trampoline();
---
This way the overhead for the normal case would be
something about 4 asm insns (the check), and for the
dosemu case - who cares? (and probably also the wine
people will value that)

Does this look reasonable? If it does, I think I
should just start implementing that.

2004-09-18 16:59:38

by Petr Vandrovec

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

On Sat, Sep 18, 2004 at 08:45:33PM +0400, Stas Sergeev wrote:
> Hi Petr.
>
> Petr Vandrovec wrote:
> >natural solution seems to be to create complete 16bit CPL1
> >environment, return to it, load ESP as you want, and then do IRET
> >to return to CPL2 or CPL3. Fortunately V8086 mode is not affected,
> >so there should be no problem with using CPL1 for this middle step.
> >But of course it is not something you want to do on each return
> >from interrupt handler... Well, or maybe you want...
> Actually, this may indeed be what I want!
> I think this can be implemented with the checks
> that Denis Vlasenko suggests. Something like this
> can be added to entry.S, right before the "iret":
> ---
> if (!(old_EFLAGS & VM_MASK) && (descr_old_SS is 16bit one))
> push_a_stack_frame_to_return_to_the_ring1_trampoline();
> ---
> This way the overhead for the normal case would be
> something about 4 asm insns (the check), and for the
> dosemu case - who cares? (and probably also the wine
> people will value that)
>
> Does this look reasonable? If it does, I think I
> should just start implementing that.

Do not forget that you have to implement also return to CPL1, as
NMI may arrive while you are running on CPL1. So it may not be
as trivial as it seemed. Maybe all these programs survive that
their CPL3 stack changes, and then you could just push cs,eip and
esp on CPL3 stack, and return to user code (vsyscall page seems
natural place for that code) which would do pop %esp; retf?

Only problem is how to find that old SS points to 16bit segment.
You need LAR and/or you have to peek GDT/LDT to find stack size,
and AFAIK LAR is microcoded on P4.
Petr Vandrovec

2004-09-18 17:03:35

by Stas Sergeev

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

Hi Denis.

Denis Vlasenko wrote:
> Aha. The only way to sanely handle it is to
> hack on entry.S I'm afraid. Something like rewriting
> CS:EIP so that it returns to a small ring-3 trampoline
> which clears upper 16 bits of ESP and jumps to original CS:EIP.
Well, I don't really want to clear the higher word
of ESP (even if I want to do that in most cases),
but I'd really like to just restore it properly.
Of course (I think) ring-3 trampoline will not
work for many reasons. The most obvious one is
that it itself can be interrupted in any place.
Another problem with it is that the return frame
will have to be pushed to the stack of a DOS prog.
This is not the good thing to do. dosemu avoids
ever touching the stack of a DOS prog by setting
up the sigaltstack.
What *will* work, however, is a ring-1 trampoline,
as Petr Vandrovec suggested. It can be executed
with interrupts disabled (I think) so will not
be interrupted, and it can (isn't it?) use the
kernel stack, if I understand that correctly.

> Now, how to detect when to use this? Hmm.... the simplest thing
> is to check that
> (old_ESP <= 0xffff) && !(old_EFLAGS & VM_MASK) && (descr_old_SS is 16bit one)
Yes, that's an excellent idea (except for the
ESP<=0xffff check - I don't think this one is
necessary).

> This will cost us only one comparison in the normal
> path,
Yes, so I think I should just try to implement
that. Not as a ring-3 trampoline, but as a ring-1
one.

> because typically ESP of Linus executables
> is greater than 0xffff.
I'd rather say, because the stack segment is 32bit
for them always.
And this all should work, as it seems to me.
Thanks for the hint about the checking.

2004-09-18 19:13:17

by Stas Sergeev

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

Hi,

Petr Vandrovec wrote:
>> Does this look reasonable? If it does, I think I
>> should just start implementing that.
> Do not forget that you have to implement also return to CPL1, as
> NMI may arrive while you are running on CPL1. So it may not be
> as trivial as it seemed.
I am not sure what special actions have to be
taken here compared to returning to ring-3 from NMI.
Is there anywhere in the sources an example to take
a look at? (sorry for the newbie questions)

> Maybe all these programs survive that
> their CPL3 stack changes,
Most likely they will, I am just not sure. What
if they disabled interrupts and are switching the
stack by loading the SS and ESP separately? If we
interrupt it there, there may be the problems, which
would be almost impossible to track down later.
It just looks a bit unsafe to me. Or maybe exploit
a sigaltstack for that? Hmm, is implementing the
CPL1 trampoline really that difficult after all?
I think it is somewhat cleaner and maybe safer.

> Only problem is how to find that old SS points to 16bit segment.
> You need LAR and/or you have to peek GDT/LDT to find stack size,
Yes, I was thinking about using LAR - looks like the
most easy and fast way to just get that single bit
out of LDT.

> and AFAIK LAR is microcoded on P4.
Where does this lead us to? Some other problems I
am not aware about?

2004-09-18 20:36:51

by Petr Vandrovec

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

On Sat, Sep 18, 2004 at 11:14:44PM +0400, Stas Sergeev wrote:
> Hi,
>
> Petr Vandrovec wrote:
> >>Does this look reasonable? If it does, I think I
> >>should just start implementing that.
> >Do not forget that you have to implement also return to CPL1, as
> >NMI may arrive while you are running on CPL1. So it may not be
> >as trivial as it seemed.
> I am not sure what special actions have to be
> taken here compared to returning to ring-3 from NMI.
> Is there anywhere in the sources an example to take
> a look at? (sorry for the newbie questions)

It means that you cannot blindly create CPL1 trampoline stack
in some static per-cpu area. But if we can assume that there
is no other CPL1 code in the system, something like code below
could work:

/* + 20 [word 5] SS
+ 16 [word 4] ESP
+ 12 [word 3] EFLAGS
+ 8 [word 2] CS
+ 4 [word 1] EIP
+ 0 [word 0] ESP for popl %esp
*/
u_int32_t cpl1stacks[NUM_CPUS][6];

curCPU = smp_processor_id();
minSP = curCPU * sizeof(cpl1stacks[0]);
maxSP = minSP + sizeof(cpl1stacks[0]);
cpl1stack = cpl1stacks[curCPU];

if (cpl0stack[retCS] & 3 == 1) {
/* Going back to our trampoline */
/* There is no other place in kernel running on CPL1
except our trampoline; so interrupt could occur either
on popl %esp or on iret. If it occured on popl %esp,
just return, code will do proper things. If interrupt
occured on iret, we have to perform popl %esp again,
so that upper bits of %esp are correctly restored
for CPL3 code */
ASSERT(cpl0stack[retCS] == FLAT_4G_CPL1_CS);
ASSERT(cpl0stack[retSS] == SMALL_CPL1_SS);
if (cpl0stack[retEIP] == fixup_proc) {
ASSERT(cpl0stack[retESP] == minSP]);
} else if (cpl0stack[retEIP] == fixup_proc_iret) {
ASSERT(cpl0stack[retESP] & 0xFFFF == minSP + 4);
/* Undo popl %esp - copy value from ESP we were
using on CPL1 back to stack */
cpl1stack[0] = cpl0stack[retESP];
cpl0stack[retEIP] = fixup_proc;
cpl0stack[retESP] = minSP;
} else {
/* unexpected code running on CPL1 */
/* Probably do simple IRET and hope for the best? */
ASSERT(0);
}
iret;
} else {
cpl1Stack[5] = cpl0stack[retSS];
cpl1Stack[4] = cpl0stack[retESP];
cpl1Stack[3] = cpl0stack[retEFLAGS];
cpl1Stack[2] = cpl0stack[retCS];
cpl1Stack[1] = cpl0stack[retEIP];
cpl1Stack[0] = (cpl0stack[retESP] & 0xFFFF0000) | minSP + 4;
cpl0stack[retSS] = SMALL_CPL1_SS;
cpl0stack[retESP] = minSP;
/*
Do NOT clear IF... IF flag is affected only if IOPL >= CPL, so
with IOPL=0 IRET on CPL1 won't reenable interrupts. This is reason
why we cannot use RETF to return from CPL0 to CPL1 (retf
is much faster than iret on P4) (and we cannot use retf for
CPL1->CPL3 due to TF/RF).

Clear TF so we do not start tracing on CPL1 if we trace
userspace, and clear RF, so if somebody intentionaly pointed
hardware breakpoint into CPL1 handler, it will be triggered
(we must use this path for returns from INT1 too, so it
is possible that RF is set in EFLAGS on stack).
*/
cpl0stack[retEFLAGS] &= ~(EFLAGS_TF | EFLAGS_RF);
cpl0stack[retCS] = FLAT_4G_CPL1_CS;
cpl0stack[retEIP] = fixup_proc;
iret;
}

fixup_proc:
popl %esp
fixup_proc_iret;
iret


It assumes that there is one new 32bit CPL1 flat CS descriptor in GDT, and
one 16bit (small) CPL1 SS descriptor (grows up, with limit 24*<num_cpus>
and base of cpl1stacks), plus <num_cpus> 24byte CPL1 stacks (max. 64KB,
so for kernels with more than 2730 CPUs you need more than one stack
descriptor).


> >Maybe all these programs survive that
> >their CPL3 stack changes,
> Most likely they will, I am just not sure. What
> if they disabled interrupts and are switching the
> stack by loading the SS and ESP separately? If we
> interrupt it there, there may be the problems, which
> would be almost impossible to track down later.
> It just looks a bit unsafe to me. Or maybe exploit
> a sigaltstack for that? Hmm, is implementing the
> CPL1 trampoline really that difficult after all?
> I think it is somewhat cleaner and maybe safer.

As in pseudocode above I was able to handle even NMIs with just
24 bytes of stack on CPL1, it is definitely preferred solution.

> >and AFAIK LAR is microcoded on P4.
> Where does this lead us to? Some other problems I
> am not aware about?

It is slow. No other problems, except that doing
(((ss & 4) ? gdt[ss >> 3] : ldt[ss >> 3]) & 0x00????00) == 0x00????00;
may be faster than doing (lar(ss) & 0x00????00) == 0x00????00.
Petr

2004-09-21 11:23:26

by Pavel Machek

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

Hi!

> >for subsequent push/pop/call/ret operations.
> >But if code uses full ESP, thinking that upper 16 bits are zero,
> >it will crash badly. Correct me if I'm wrong.
> That's correct. But you should note that the
> program not just "thinks" that the upper 16 bits
> are zero. It writes zero there itself, and a few
> instructions later - oops, it is no longer zero...

Hmm, perhaps this can also be viewed as a "information leak"? Program
running under dosemu is not expected to know high bits of kernel
%esp...
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2004-09-21 11:44:34

by Denis Vlasenko

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

On Tuesday 21 September 2004 14:19, Pavel Machek wrote:
> Hi!
>
> > >for subsequent push/pop/call/ret operations.
> > >But if code uses full ESP, thinking that upper 16 bits are zero,
> > >it will crash badly. Correct me if I'm wrong.
> >
> > That's correct. But you should note that the
> > program not just "thinks" that the upper 16 bits
> > are zero. It writes zero there itself, and a few
> > instructions later - oops, it is no longer zero...
>
> Hmm, perhaps this can also be viewed as a "information leak"? Program
> running under dosemu is not expected to know high bits of kernel
> %esp...

Yes.

I must admit that Stas was right and I was wrong.
This is a 386 CPU bug.

Since then it seems to be codified in i86 architecture
and duplicated in all successors.
Incredible... :(
--
vda

2004-09-22 18:41:37

by Stas Sergeev

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

Hi Petr et al.

I coded up something that remotely looks like the
discussed patch.
I have deviated from your proposals at some important
points:
1. I am not allocating the ring1 stack separately,
I am allocating it on a ring0 stack. Much simpler
code, although non-reentrant/preempt-unsafe.
2. I am disabling the interrupts after all. That's
because of the preempt-unsafeness. I pass up the
IOPL=1 when necessary, to avoid problems.
But I guess also with your technique the interrupts
had to be disabled, unless the ring1 stack is
per-thread.
3. I am using LAR. Do you really think it can be
slower than the whole thing of locating LDT?

Let me know if I did something stupid.
The patch is attached.
I tested (pretty much) everything in it, except
probably the "popl %esp" restartability. But that
one looks fairly simple and should work.

Does this patch look good?


Attachments:
linux-2.6.8-stacks2.diff (9.63 kB)

2004-09-22 18:55:45

by Stas Sergeev

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

Hi,

Denis Vlasenko wrote:
> Maybe. This would be a complicated thing.
I bet it was! :)

> Well. Not okay. Maybe this?
> 1. We build IRET frame on ring1 stack for step 4 (see below),
> modify IRET frame on ring0 stack so that intrs are disabled
> and CS:EIP and SS:ESP point to values of ring1 code/stack.
Much simpler: IRET frame to return to user is
already there. Just push another one to return
to ring1 first.

> 2. IRET returns to ring1 code with dedicated *16-bit* ring1 stack
> Upper word of ESP is wrong now, but we can safely fiddle with it.
> 3. trampoline code fixes upper word of ESP (how?)
popl %esp (as per Petr Vandrovec's suggestion)
The value on stack is carefully prepared on ring0.

> 4. trampoline IRETs to user code.
> May work.
Works!

> ring1 stacks must be per-CPU.
I allocate it on a ring0 stack. Noone seem to
suggest that. Is this flawed for some reasons?

>> ESP<=0xffff check - I don't think this one is
>> necessary).
> Any program which runs with 16bit stack and yet with
> ESP > 0xffff is doing something *terminally* weird.
> I think it is acceptable to leave this case unfixed.
For what? We can have that fixed so why not?

> You cannot check 16bitness of SS descriptor in two
> insns.
I do actually:
larl OLDSS(%esp), %eax
testl $0x00400000, %eax
which is exactly two insns.

Since I've forgot to CC the patch to you, I uploaded
it here:
http://www.dosemu.org/stas/linux-2.6.8-stacks2.diff
so that you (or anyone interested) can make a review.

2004-09-22 19:20:18

by Richard B. Johnson

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?


What problem is this supposed to fix? ESP is __not__ corrupted
when returning to protected-mode or a different privilege level.
You don't 'return' to protected mode from a virtual-8086 mode,
even though you (can) use an `iret`instruction. The fundamental
change is though the EFLAGS register stored in the TSS.

The so-called bug is that when in real mode or in virtual-8086
mode, the high word of ESP is not changed. It is not a bug
because the high word doesn't even exist when in VM-86 mode!!
It is possible to use the 32-bit prefix, when in 16-bit mode,
to muck with the high word of the stack, but that's not
a documented procedure, but a side-effect of such an undocumented
instruction.

There is no bug to fix. When the VM-86 mode transitions to
32-bit protected mode, the stack is restored to the condition
it was just prior to the transition to VM-86 mode, therefore you
don't "use up" any stack. The so-called bug is only cosmetic
when somebody is prowling around in undocumented shadows.

Please, somebody from Intel tell these guys to leave the thing
alone. I, for one, don't want a bunch of "fixes" that do nothing
except consume valuable RAM, making it near impossible to
use later versions of Linux in embedded systems.


On Wed, 22 Sep 2004, Stas Sergeev wrote:

> Hi Petr et al.
>
> I coded up something that remotely looks like the
> discussed patch.
> I have deviated from your proposals at some important
> points:
> 1. I am not allocating the ring1 stack separately,
> I am allocating it on a ring0 stack. Much simpler
> code, although non-reentrant/preempt-unsafe.
> 2. I am disabling the interrupts after all. That's
> because of the preempt-unsafeness. I pass up the
> IOPL=1 when necessary, to avoid problems.
> But I guess also with your technique the interrupts
> had to be disabled, unless the ring1 stack is
> per-thread.
> 3. I am using LAR. Do you really think it can be
> slower than the whole thing of locating LDT?
>
> Let me know if I did something stupid.
> The patch is attached.
> I tested (pretty much) everything in it, except
> probably the "popl %esp" restartability. But that
> one looks fairly simple and should work.
>
> Does this patch look good?
>
>

Cheers,
Dick Johnson
Penguin : Linux version 2.4.26 on an i686 machine (5570.56 BogoMips).
Note 96.31% of all statistics are fiction.

2004-09-22 19:53:23

by Stas Sergeev

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

Hi,

Richard B. Johnson wrote:
> What problem is this supposed to fix?
Richard, it will really help if you read the
whole thread. I was answering this to Denis
Vlasenko already - he agreed. Do I have to
repeat the explanations?

> ESP is __not__ corrupted
Right, but is not properly restored either,
while it have to be.

> when returning to protected-mode or a different privilege level.
It gets "corrupted" (not properly restored)
exactly when returning to *protected mode*
from another priv level. Please refer to the
Intel docs I pointed to in that thread earlier.

> You don't 'return' to protected mode from a virtual-8086 mode,
Noone was speaking about v86. I don't see why
you pick that up.

> The so-called bug is that when in real mode or in virtual-8086
> mode, the high word of ESP is not changed.
In short: Wrong.
The complete explanations are easily locateable
in that thread. And it have nothing to do with
the real mode either.

> It is not a bug
> because the high word doesn't even exist when in VM-86 mode!!
Noone was speaking about v86, sorry. I am bypassing
that part.

> It is possible to use the 32-bit prefix, when in 16-bit mode,
That's not about the prefixes either, sorry.
We were talking about the 32bit PM code.

> Please, somebody from Intel tell these guys to leave the thing
> alone.
Thanks many, they already left that alone once:)
Maybe enough of leaving the bugs alone?
I have lots of the DOS progs here that do not
work under dosemu without that patch, and work
perfectly with it. It should be enough. If
you need an examples - well, OpenCubicPlayer
for one. It will start, but crash as soon as
the music is attempted to play. The patch fixes
it. Other progs you'll have problems downloading
anyway, but let me know if you need these.

> I, for one, don't want a bunch of "fixes" that do nothing
> except consume valuable RAM, making it near impossible to
> use later versions of Linux in embedded systems.
Well, my patch is purely in asm. How many
valueable bytes does it take from you?
As for performance - 8 asm insns on a generic
path. Not too much either, as for me.

2004-09-22 20:02:39

by Petr Vandrovec

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

On Wed, Sep 22, 2004 at 10:49:45PM +0400, Stas Sergeev wrote:
> Hi Petr et al.
>
> I coded up something that remotely looks like the
> discussed patch.
> I have deviated from your proposals at some important
> points:
> 1. I am not allocating the ring1 stack separately,
> I am allocating it on a ring0 stack. Much simpler
> code, although non-reentrant/preempt-unsafe.

I think that it is problem. And do not forget that NMI can occur at any
place, so I'm not quite sure how your code works.

> 2. I am disabling the interrupts after all. That's
> because of the preempt-unsafeness. I pass up the
> IOPL=1 when necessary, to avoid problems.
> But I guess also with your technique the interrupts
> had to be disabled, unless the ring1 stack is
> per-thread.

What if exception occurs in CPL1 code? Processor reloads SS/ESP from
TSS, pointing at top of CPL0 stack - where it overwrites CPL1 stack -
- and it pushes on top of CPL0 stack address of CPL1 - and real return
address for CPL3 was lost :-( Boom.

You change base address for SS segment - in that case you need per-CPU
SS segment, it wont' work with global SS segment.

> 3. I am using LAR. Do you really think it can be
> slower than the whole thing of locating LDT?

Try measuring that. I think that LAR will be faster than lookup in
memory, but...
>
> Let me know if I did something stupid.
> The patch is attached.
> I tested (pretty much) everything in it, except
> probably the "popl %esp" restartability. But that
> one looks fairly simple and should work.

Did you tried setting SS/ESP on stack to invalid values, so IRET on CPL1
triggers fault? You should be able to do that by modifying SS/ESP from
signal handler.

> +#define NMI_STACK_RESERVE 0x400
> +#define MAKE_ESPFIX \
> + cli; \
> + subl $NMI_STACK_RESERVE, %esp; \

This is ugly. What if NMI handler uses more than 1KB?

> + .rept 5; \
> + pushl NMI_STACK_RESERVE+0x10(%esp); \
> + .endr; \
> + pushl %eax; \
> + movl %esp, %eax; \
> + pushl %ebp; \
> + GET_THREAD_INFO(%ebp); \
> + movl TI_cpu(%ebp), %ebp; \
> + shll $GDT_SIZE_SHIFT, %ebp; \
> + addl $cpu_gdt_table, %ebp; \
> + movw %ax, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 2)(%ebp); \
> + shrl $16, %eax; \
> + movb %al, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 4)(%ebp); \
> + movb %ah, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 7)(%ebp); \
> + popl %ebp; \
> + popl %eax; \
> + pushw 14(%esp); \

What about referencing 14+NMI_STACK_RESERVE() ? And ESP is not
multiple of 4 here - it may slow down processor a bit.

I think that you should prepare stack by using moves to
-NMI_STACK_RESERVE-xx(%esp) instead of adjusting stack and using pushes.
Or you can (if you want per-thread and not per-CPU stack) prepare CPL1
stack above CPL0 stack - this way you steal 24 bytes from CPL0 stack,
but CPL0 can use full 4KB (8KB), without NMI being limited by 1KB.

> + pushw $4; \
> + pushl $__ESPFIX_SS; \
> + pushl $0; \
> + pushl 20(%esp); \
> + andl $~(IF_MASK | TF_MASK | RF_MASK | NT_MASK | AC_MASK), (%esp); \
> + orl $IOPL1_MASK, (%esp); \

Does this work if userspace runs with iopl 3,2 or 1? Does iret on CPL1
set iopl level properly (i.e. is iopl set if IOPL <= CPL, or only if CPL=0?).

> + pushl $__ESPFIX_CS; \
> + pushl $espfix_trampoline;
> +
> #define SAVE_ALL \
> cld; \
> pushl %es; \
> @@ -122,9 +172,24 @@
> .previous
>
>
> +/* If returning to Ring-3, not to V86, and with
> + * the small stack, try to fix the higher word of
> + * ESP, as the CPU won't restore it from stack.
> + * This is an "official" bug of all the x86-compatible
> + * CPUs, which we can try to work around to make
> + * dosemu happy. */
> #define RESTORE_ALL \
> - RESTORE_REGS \
> - addl $4, %esp; \
> + movl EFLAGS(%esp), %eax; \
> + movb CS(%esp), %al; \
> + andl $(VM_MASK | 2), %eax; \
> + cmpl $2, %eax; \
> + jne 3f; \
> + larl OLDSS(%esp), %eax; \
> + testl $0x00400000, %eax; \
> +3: RESTORE_REGS \
> + leal 4(%esp), %esp; \
> + jnz 1f; \
> + MAKE_ESPFIX \
> 1: iret; \
> .section .fixup,"ax"; \

I'm not sure about lea speed. And fast path should NOT jump,
forward jumps are initialy predicted as not taken... Maybe:

3: RESTORE_REGS
jz 8f
add $4,%esp
1: iret
8: add $4,%esp
MAKE_ESPFIX
iret

And then you can move MAKE_ESPFIX out of macro.

And it seems to me that it adds too many operations to fast path.
Maybe you want to introduce some per-thread flag, or leave syscall path
to corrupt ESP (dosemu apps should not call Linux INT syscall, yes?).

> diff -urN linux-2.6.8-pcsp/arch/i386/kernel/process.c linux-2.6.8-stacks/arch/i386/kernel/process.c
> --- linux-2.6.8-pcsp/arch/i386/kernel/process.c 2004-06-21 09:19:07.000000000 +0400
> +++ linux-2.6.8-stacks/arch/i386/kernel/process.c 2004-09-21 10:45:55.000000000 +0400
> @@ -225,7 +225,7 @@
> printk("EIP: %04x:[<%08lx>] CPU: %d\n",0xffff & regs->xcs,regs->eip, smp_processor_id());
> print_symbol("EIP is at %s\n", regs->eip);
>
> - if (regs->xcs & 3)
> + if (regs->xcs & 2)
> printk(" ESP: %04x:%08lx",0xffff & regs->xss,regs->esp);

I think you should leave this as is, regs->xss/regs->esp should be valid
for CPL1 exceptions.

> printk(" EFLAGS: %08lx %s (%s)\n",regs->eflags, print_tainted(),UTS_RELEASE);
> printk("EAX: %08lx EBX: %08lx ECX: %08lx EDX: %08lx\n",
> diff -urN linux-2.6.8-pcsp/arch/i386/kernel/signal.c linux-2.6.8-stacks/arch/i386/kernel/signal.c
> --- linux-2.6.8-pcsp/arch/i386/kernel/signal.c 2004-08-10 11:02:36.000000000 +0400
> +++ linux-2.6.8-stacks/arch/i386/kernel/signal.c 2004-09-21 10:48:38.000000000 +0400
> @@ -558,7 +558,7 @@
> * kernel mode. Just return without doing anything
> * if so.
> */
> - if ((regs->xcs & 3) != 3)
> + if (!(regs->xcs & 2))
> return 1;

Really? I think that it should stay.

> @@ -630,7 +630,7 @@
> /* If this is a kernel mode trap, save the user PC on entry to
> * the kernel, that's what the debugger can make sense of.
> */
> - info.si_addr = ((regs->xcs & 3) == 0) ? (void *)tsk->thread.eip :
> + info.si_addr = ((regs->xcs & 2) == 0) ? (void *)tsk->thread.eip :
> (void *)regs->eip;

Sure? Maybe this should be (regs->xcs & 3 != 3) instead?

Rest looks fine, but I'm not completely sure that SMP and NMIs are
handled in the best possible way.
Petr Vandrovec

2004-09-22 20:23:33

by Richard B. Johnson

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

On Thu, 23 Sep 2004, Stas Sergeev wrote:

> Hi,
>
> Richard B. Johnson wrote:
> > What problem is this supposed to fix?
> Richard, it will really help if you read the
> whole thread. I was answering this to Denis
> Vlasenko already - he agreed. Do I have to
> repeat the explanations?
>
> > ESP is __not__ corrupted
> Right, but is not properly restored either,
> while it have to be.
>
> > when returning to protected-mode or a different privilege level.
> It gets "corrupted" (not properly restored)
> exactly when returning to *protected mode*
> from another priv level. Please refer to the
> Intel docs I pointed to in that thread earlier.
>
> > You don't 'return' to protected mode from a virtual-8086 mode,
> Noone was speaking about v86. I don't see why
> you pick that up.
>
> > The so-called bug is that when in real mode or in virtual-8086
> > mode, the high word of ESP is not changed.
> In short: Wrong.
> The complete explanations are easily locateable
> in that thread. And it have nothing to do with
> the real mode either.
>
> > It is not a bug
> > because the high word doesn't even exist when in VM-86 mode!!
> Noone was speaking about v86, sorry. I am bypassing
> that part.
>
> > It is possible to use the 32-bit prefix, when in 16-bit mode,
> That's not about the prefixes either, sorry.
> We were talking about the 32bit PM code.
>
> > Please, somebody from Intel tell these guys to leave the thing
> > alone.
> Thanks many, they already left that alone once:)
> Maybe enough of leaving the bugs alone?
> I have lots of the DOS progs here that do not
> work under dosemu without that patch, and work
> perfectly with it. It should be enough. If
> you need an examples - well, OpenCubicPlayer
> for one. It will start, but crash as soon as
> the music is attempted to play. The patch fixes
> it. Other progs you'll have problems downloading
> anyway, but let me know if you need these.
>
> > I, for one, don't want a bunch of "fixes" that do nothing
> > except consume valuable RAM, making it near impossible to
> > use later versions of Linux in embedded systems.
> Well, my patch is purely in asm. How many
> valueable bytes does it take from you?
> As for performance - 8 asm insns on a generic
> path. Not too much either, as for me.
>

Well DOSEMU uses VM-86 mode. That's how it works. It
creates a virtual 8086 machine, complete with the
required "DOS compatible" virtual BIOS environment.

I use it all the time because I write, amongst other things,
the complete BIOS and startup code for many Intel based
machines.

I run compilers, assemblers, linkers, and editors in that
environment and it works.

Sombody mentions a completely unrelated so-called Intel
bug and next thing you know, there are patches to work-
around the bug???

The bug doesn't exist period. Here is a session in
VM-86 mode., using DOSEMU. It does one hell of a lot
more work than your games.



Script started on Wed Sep 22 16:00:02 2004
# godos
CPU speed set to 2793/1 MHz
Running on CPU=586, FPU=1, rdtsc=1
Linux DOS emulator 0.98.5.0 $Date: 99/01/15 $
Last configured at Wed Mar 24 12:44:16 EST 1999 on linux
This is work in progress.
Please test against a recent version before reporting bugs and problems.
Bugs, Patches & New Code to [email protected]

Starting MS-DOS...





 1. Start the TCP/IP Network (NE* B
radley's code)  2. Start the TCP/IP Network (3COM Board PCTCP)  3. Start the TCP/IP Network (3COM Board ANALOGIC)  4. Load NDIS driver only (3COM Board)  5. Do not load any network  6. Do not execute any AUTOEXEC.BAT commands  7. Start RCCS Host Software  MS-DOS 6.22 Startup Menu  Enter a choice: 5??????            F5=Bypass startup files F8=Confirm each line of CONFIG.SYS and AUTOEXEC.BAT [N]Time remaining: 04  
MS-DOS Version 6.22

C:\>cd pbios

C:\PBIOS>make clean

Microsoft (R) Program Maintenance Utility Version 4.07
Copyright (C) Microsoft Corp 1984-1988. All rights reserved.



  if exist *.obj del *.obj 


 if exist *.bin del *.bin  if exist *.rom del *.rom  if exist *.hex del *.hex 






 if exist *.lis del *.lis  if exist *.com del *.com  if exist *.exe del *.exe  if exist *.oe0 del *.oe0  if exist *.oe1 del *.oe1  cd tools  if exist *.obj del *.obj 






C:\PBIOS> make clean if exist *.bin del *.bin  if exist *.rom del *.rom  if exist *.hex del *.hex  if exist *.com del *.com  if exist *.exe del *.exe  cd ..  C:\PBIOS>makepbiosMMMMMMMMMMMMMMMM if exist *.obj del *.obj
if exist *.bin del *.bin  if exist *.rom del *.rom  if exist *.hex del *.hex  if exist *.com del *.com  if exist *.exe del *.exe  cd ..  C:\PBIOS> make pbios  Microsoft (R) Program Maintenance Utility Version 4.07 Copyright (C) Microsoft Corp 1984-1988. All rights reserved.   MAKE : warning U4000: 'TOOLS\IHEX.EXE' : target does not exist  CD TOOLS  MAKE TOOLS  
MAKE : warning U4000: 'odeven.exe' : target does not existcl -W3 odeven.c

MMMMcomcomexeexecd .. 
C:\PBIOS> make pbios 
Microsoft (R) Program Maintenance Utility Version 4.07
Copyright (C) Microsoft Corp 1984-1988. All rights reserved.

MAKE : warning U4000: 'TOOLS\IHEX.EXE' : target does not exist  CD TOOLS  MAKE TOOLS  AKE : warning U4000: 'odeven.exe' : target does not exist
cl -W3 odeven.c
Microsoft (R) C Optimizing Compiler Version 6.00A
Copyright (c) Microsoft Corp 1984-1990. All rights reserved.

odeven.c









  Microsoft (R) Segmented-Executable Linker Version 5.13 Copyright (C) Microsoft Corp 1984-1991. All rights reserved.  Object Modules [.OBJ]: odeven.obj /farcall Run File [odeven.exe]: "odeven.exe" /noi List File [NUL.MAP]: NUL Libraries [.LIB]: Definitions File [NUL.DEF]: ; 






  MAKE : warning U4000: 'fixup.exe' : target does not exist  cl -W3 fixup.c Microsoft (R) C Optimizing Compiler Version 6.00A Copyright (c) Microsoft Corp 1984-1990. All rights reserved.  fixup.c Definitions File [NUL.DEF]: ;fixup.exe' : target does not exist fixup.c



fixup.c fixup.obj /farcall fixup.exe]: "fixup.exe" /noi getall.exe' : target does not existgetall.c



getall.c
getall.exe' : target does not existgetall.c



getall.cgetall.obj /farcallgetall.exe]: "getall.exe" /noipci.obj' : target does not exist AL -c -W3 -o pci.obj pci.c



pci.c






  pci.c  MAKE : warning U4000: 'pcireg.obj' : target does not exist  tasm pcireg; Turbo Assembler Version 2.0 Copyright (c) 1988, 1990 Borland International Assembling file: pcireg.ASM
MMMMMMMMMMMMMMMMMMMAKE : warning U4000: 'pci.obj' : target does not exist
cl -AL -c -W3 -o pci.obj pci.c
Microsoft (R) C Optimizing Compiler Version 6.00A Copyright (c) Microsoft Corp 1984-1990. All rights reserved.  pci.c  MAKE : warning U4000: 'pcireg.obj' : target does not exist  tasm pcireg; Turbo Assembler Version 2.0 Copyright (c) 1988, 1990 Borland International  Assembling file: pcireg.ASM Error messages: None Warning messages: None Passes: 1 Remaining memory: 324k   MAKE : warning U4000: 'pci.exe' : target does not exist  link pci pcireg; 







 MAKE : warning U4000: 'ihex.exe' : target does not exist  cl -W3 ihex.c Microsoft (R) C Optimizing Compiler Version 6.00A Copyright (c) Microsoft Corp 1984-1990. All rights reserved.  ihex.c MM 
MAKE : warning U4000: 'pci.exe' : target does not exist link pci pcireg;

Microsoft (R) Segmented-Executable Linker Version 5.13
Copyright (C) Microsoft Corp 1984-1991. All rights reserved.


MAKE : warning U4000: 'ihex.exe' : target does not existcl -W3 ihex.cicrosoft (R) C Optimizing Compiler Version 6.00A
Copyright (c) Microsoft Corp 1984-1990. All rights reserved.
ihex.c 
Object Modules [.OBJ]: ihex.obj /farcall
Run File [ihex.exe]: "ihex.exe" /noi
List File [NUL.MAP]: NUL
Libraries [.LIB]:
Definitions File [NUL.DEF]: ;
MMMMihex.exe' : target does not existcl -W3 ihex.c
Microsoft (R) C Optimizing Compiler Version 6.00A Copyright (c) Microsoft Corp 1984-1990. All rights reserved.  ihex.c 
Object Modules [.OBJ]: ihex.obj /farcall
Run File [ihex.exe]: "ihex.exe" /noi
List File [NUL.MAP]: NUL
Libraries [.LIB]:
Definitions File [NUL.DEF]: ;
CD ..

MAKE : warning U4000: 'ABIOS.OBJ' : target does not exist
TASM /ml /p /w2 /m2 PBIOS, ABIOS.OBJ, PBIOS.LIS;
Turbo Assembler Version 2.0 Copyright (c) 1988, 1990 Borland International

Assembling file: PBIOS.ASM to ABIOS.OBJ




Turbo Assembler Version 2.0 Copyright (c) 1988, 1990 Borland International  Assembling file: PBIOS.ASM to ABIOS.OBJ Error messages: None Warning messages: None Passes: 2 Remaining memory: 258k   MAKE : warning U4000: 'BBIOS.OBJ' : target does not exist  TASM /ml /p /w2 /m2 /DCOM=1 PBIOS, BBIOS.OBJ; B








Error messages: None Warning messages: None Passes: 2 Remaining memory: 262k   MAKE : warning U4000: 'ABIOS.BIN' : target does not exist  LINK/T ABIOS.OBJ, ABIOS.BIN; MMMMMMMAssembling file: PBIOS.ASM to ABIOS.OBJ
Error messages: None
Warning messages: None
Passes:2
Remaining memory: 258k
 MAKE : warning U4000: 'BBIOS.OBJ' : target does not exist  TASM /ml /p /w2 /m2 /DCOM=1 PBIOS, BBIOS.OBJ; Turbo Assembler Version 2.0 Copyright (c) 1988, 1990 Borland International  Assembling file: PBIOS.ASM to BBIOS.OBJ Remaining memory: 262k


MAKE : warning U4000: 'ABIOS.BIN' : target does not exist
LINK/T ABIOS.OBJ, ABIOS.BIN;

Microsoft (R) Segmented-Executable Linker Version 5.13
Copyright (C) Microsoft Corp 1984-1991. All rights reserved.
M
Assembling file: PBIOS.ASM to BBIOS.OBJ Remaining memory: 262k


MAKE : warning U4000: 'ABIOS.BIN' : target does not exist
LINK/T ABIOS.OBJ, ABIOS.BIN;

Microsoft (R) Segmented-Executable Linker Version 5.13
Copyright (C) Microsoft Corp 1984-1991. All rights reserved.


MAKE : warning U4000: 'PBIOS.COM' : target does not existLINK/T BBIOS.OBJ, PBIOS.COM;icrosoft (R) Segmented-Executable Linker Version 5.13
Copyright (C) Microsoft Corp 1984-1991. All rights reserved.


MAKE : warning U4000: 'PBIOS.ROM' : target does not exist TOOLS\FIXUP ABIOS.BIN PBIOS.ROM







 MAKE : warning U4000: 'PBIOS.HEX' : target does not exist  TOOLS\IHEX PBIOS.ROM PBIOS.HEX 0 20 Y IHEX V2.00 Analogic Software Tools  Reading from file PBIOS.ROM, writing to file PBIOS.HEX ... Starting address (hex) 0000, block length (hex) 20 
C:\PBIOS> back


(B(B

#
#
#
#
# exit
Script done on Wed Sep 22 16:00:39 2004

Even Linux 32-bit `script` works when I am in VM-86 mode.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.26 on an i686 machine (5570.56 BogoMips).
Note 96.31% of all statistics are fiction.

2004-09-23 04:06:40

by Stas Sergeev

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

Hi,

Petr Vandrovec wrote:
>> 1. I am not allocating the ring1 stack separately,
>> I am allocating it on a ring0 stack. Much simpler
>> code, although non-reentrant/preempt-unsafe.
> I think that it is problem. And do not forget that NMI can occur at any
> place, so I'm not quite sure how your code works.
Why it is not preempt-unsafe, I beleive, is because
if it is preempted after modifying GDT, and then
another process will walk that path modifying GDT,
the first process will not return anywhere. But as
for NMI this should not be the case, as it doesn't
seem to call schedule().

> What if exception occurs in CPL1 code? Processor reloads SS/ESP from
> TSS, pointing at top of CPL0 stack - where it overwrites CPL1 stack -
> - and it pushes on top of CPL0 stack address of CPL1 - and real return
> address for CPL3 was lost :-( Boom.
But for that I reserve the room on stack, so that
should not hurt.

> You change base address for SS segment - in that case you need per-CPU
> SS segment, it wont' work with global SS segment.
But I beleive it *is* per-CPU, no? I am taking the
TI_cpu value from the thread struct to look up the
proper GDT, should that work? (no SMP-box here to
test)

> Did you tried setting SS/ESP on stack to invalid values, so IRET on CPL1
> triggers fault? You should be able to do that by modifying SS/ESP from
> signal handler.
Yes, I did that - that seem to work (i.e. no Oops
or something bad).

>> +#define NMI_STACK_RESERVE 0x400
>> +#define MAKE_ESPFIX \
>> + cli; \
>> + subl $NMI_STACK_RESERVE, %esp; \
> This is ugly. What if NMI handler uses more than 1KB?
Adjusting NMI_STACK_RESERVE? I need only a few
bytes of stack to work with, so I can reserve
any amount for NMI. What should it be? 0x400 was
taken "randomly", but really, is it worth allocating
another per-CPU object to only save a few bytes of
the ring-0 stack?

> What about referencing 14+NMI_STACK_RESERVE() ?
Will work either, but for what? Just looks a bit
cleaner to me to do it that way, or am I missing
something?

> And ESP is not
> multiple of 4 here - it may slow down processor a bit.
But I have to store the value of ((ESP & 0xffff) | 4) on
stack, so that's why I do that. How can it be
done in a more effective way? pushl, then movw?

> I think that you should prepare stack by using moves to
> -NMI_STACK_RESERVE-xx(%esp) instead of adjusting stack and using pushes.
I can do that of course, do you think it is faster?

> Or you can (if you want per-thread and not per-CPU stack) prepare CPL1
> stack above CPL0 stack - this way you steal 24 bytes from CPL0 stack,
> but CPL0 can use full 4KB (8KB), without NMI being limited by 1KB.
Ughh, I really need the motivation to fiddle with
the CPL1 stack separately. I don't want to limit the
NMI handler, and 0x400 was just a random value. Maybe
just enlarging it will resolve the problem?

>> + pushl 20(%esp); \
>> + andl $~(IF_MASK | TF_MASK | RF_MASK | NT_MASK | AC_MASK), (%esp); \
>> + orl $IOPL1_MASK, (%esp); \
> Does this work if userspace runs with iopl 3,2 or 1? Does iret on CPL1
Yes, that should work I think.

> set iopl level properly (i.e. is iopl set if IOPL <= CPL, or only if CPL=0?).
Ring-1 code can't alter the IOPL, but then I set
IOPL3 initially. I just use ring3_IOPL|IOPL1, which
should be able to handle all cases (i.e. I do
"pushl 20(%esp)" exactly only to get the Ring-3 IOPL,
I don't care about the other flags there).

> I'm not sure about lea speed. And fast path should NOT jump,
> forward jumps are initialy predicted as not taken... Maybe:
No problems.

> And then you can move MAKE_ESPFIX out of macro.
Good idea! :)

> And it seems to me that it adds too many operations to fast path.
Ughh, 8 insns only! Is this really too much? :(

> Maybe you want to introduce some per-thread flag,
I don't know. Probably not. I'll think about that.

> or leave syscall path
> to corrupt ESP (dosemu apps should not call Linux INT syscall, yes?).
Yes. I'll see how that can be done.

>> - if (regs->xcs & 3)
>> + if (regs->xcs & 2)
>> printk(" ESP: %04x:%08lx",0xffff & regs->xss,regs->esp);
> I think you should leave this as is, regs->xss/regs->esp should be valid
> for CPL1 exceptions.
Very probably, I was not sure about all that
s/3/2/g changes I made. So I'll remove those
you mentioned, thanks. It was just that the
kernel/user code should now be considered by
regs->xcs&2 rather than regs->xcs&3, so I just
replaced it everywhere. I'll have a closer look.

Also I was suggested privately to try doing
everything at Ring-0 after all (unfortunately
the guy have not CCd the message to LKML).
I know that we use Ring-1 so that the NMI
handler to work on the small stack, but the
guy writes:

... but i think linux
already has a task gate for NMI, or if not, it should.

I should have a look also in that direction I think.

2004-09-23 17:03:18

by Stas Sergeev

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

Hi.

I'll reply to that mail again because my previous
reply was done in a big hurry.
The new patch is attached. I applied most of the
optimizations you suggested and reverted the things
you said were unnecessary/wrong - thanks.
Does this one look better?

Petr Vandrovec wrote:
>> 1. I am not allocating the ring1 stack separately,
>> I am allocating it on a ring0 stack. Much simpler
>> code, although non-reentrant/preempt-unsafe.
> I think that it is problem. And do not forget that NMI can occur at any
> place, so I'm not quite sure how your code works.
I assume this was a misunderstanding. Now I use
"preempt_stop" instead of "cli" to make it more
obvious what am I doing. So I assume that part is
preempt-unsafe, but not NMI-unsafe.

>> +#define NMI_STACK_RESERVE 0x400
>> +#define MAKE_ESPFIX \
>> + cli; \
>> + subl $NMI_STACK_RESERVE, %esp; \
> This is ugly. What if NMI handler uses more than 1KB?
In that new patch I set the const to 0xe00, which
is 3,5K. Is it still the limitation? I can probably
raise it ever higher, but I am not sure if there is
some sensible data below the stack that I may overwrite
ocasionally?

> And ESP is not
> multiple of 4 here - it may slow down processor a bit.
Fixed with "pushl/movw".

> I'm not sure about lea speed. And fast path should NOT jump,
> forward jumps are initialy predicted as not taken... Maybe:
Done. Looks much better now!

> And then you can move MAKE_ESPFIX out of macro.
He-he, but RESTORE_ALL is a macro itself, so... :)

> Maybe you want to introduce some per-thread flag,
What kind of flag do you have in mind? Flag the
modify_ldt() calls that set some segment to 16bit?
But is it really worth the 8 insns I use? Checking
the per-thread flag will cost you ~5 insns anyway,
and a headache...

> or leave syscall path
> to corrupt ESP (dosemu apps should not call Linux INT syscall, yes?).
I wanted to do that, but then recalled the comment
of Pavel Machek, who thinks exporting the %esp
value to user-space is an "informational leak".
I personally don't see the security threat here,
but it looks safe to assume that Pavel is right.
In which case I think we have to root out the
bug from every path, including the syscall. So I
left that place as is for now.

>> - if (regs->xcs & 3)
>> + if (regs->xcs & 2)
>> printk(" ESP: %04x:%08lx",0xffff & regs->xss,regs->esp);
> I think you should leave this as is, regs->xss/regs->esp should be valid
> for CPL1 exceptions.
OK, everything you mentioned, is reverted.

> Rest looks fine, but I'm not completely sure that SMP and NMIs are
> handled in the best possible way.
Well, even though my patch can be improved, I don't
feel quite comfortable with its complexity.
Now I really want to re-assure myself that this all
is not possible to do on ring0, in which case this
will simply be the trivial thing.
Anonymous LKML reader (whom I have to thank for the
hints) proposed 2 things:
1. Try lss followed by iret. In that case the interrupt
will not kill us because after lss the interrupts are
not accepted until the next insn is executed, AFAIK.
But he was not sure, and I don't know either, if it is
true even for NMI. So if it is - we could just do
lss/iret, and the problem is solved. So is this true
for NMI? Is this documented anywhere?
2. Set task gate for NMI. AFAICS, the task-gate is
now used only for the doublefault exception, but not
for NMI. If I understand the idea correctly, this
will guarantee that the NMI will be executing on a
separate stack, which may be a good idea in any case,
and will allow us to use the small stack at ring-0,
if only we disable the interrupts. Are there any
chances to use the task-gate for NMI? I may even try to
implement that myself, but I guess there are *reasons*
why it is not yet? (of course this applies only if 1.
fails)

Or maybe somehow to modify the NMI handler itself,
so that it will check if it is on a "small" stack,
and switch to the "big" stack before anything else?
Well, doing the whole trick at ring-0 sounds really
plausible to me...


Attachments:
linux-2.6.8-stacks3.diff (8.30 kB)

2004-09-23 18:09:49

by Petr Vandrovec

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

On Thu, Sep 23, 2004 at 09:08:54PM +0400, Stas Sergeev wrote:
> Petr Vandrovec wrote:
> >>1. I am not allocating the ring1 stack separately,
> >>I am allocating it on a ring0 stack. Much simpler
> >>code, although non-reentrant/preempt-unsafe.
> >I think that it is problem. And do not forget that NMI can occur at any
> >place, so I'm not quite sure how your code works.
> I assume this was a misunderstanding. Now I use
> "preempt_stop" instead of "cli" to make it more
> obvious what am I doing. So I assume that part is
> preempt-unsafe, but not NMI-unsafe.
>
> >>+#define NMI_STACK_RESERVE 0x400
> >>+#define MAKE_ESPFIX \
> >>+ cli; \
> >>+ subl $NMI_STACK_RESERVE, %esp; \
> >This is ugly. What if NMI handler uses more than 1KB?
> In that new patch I set the const to 0xe00, which
> is 3,5K. Is it still the limitation? I can probably
> raise it ever higher, but I am not sure if there is
> some sensible data below the stack that I may overwrite
> ocasionally?

For 4KB stacks 2KB looks better (if NMIs do not use
its own stacks).

> >Rest looks fine, but I'm not completely sure that SMP and NMIs are
> >handled in the best possible way.
> Well, even though my patch can be improved, I don't
> feel quite comfortable with its complexity.
> Now I really want to re-assure myself that this all
> is not possible to do on ring0, in which case this
> will simply be the trivial thing.
> Anonymous LKML reader (whom I have to thank for the
> hints) proposed 2 things:
> 1. Try lss followed by iret. In that case the interrupt
> will not kill us because after lss the interrupts are
> not accepted until the next insn is executed, AFAIK.
> But he was not sure, and I don't know either, if it is
> true even for NMI. So if it is - we could just do
> lss/iret, and the problem is solved. So is this true
> for NMI? Is this documented anywhere?

LSS is exempted from rule that interrupts cannot occur after
operation modifying SS, as with LSS you are supposed to
load SS together with ESP in one operation, and so you
do not need interrupt holdoff. And when exception
occurs on iret, it will happen with 16bit stack, and
you are dead...

> 2. Set task gate for NMI. AFAICS, the task-gate is
> now used only for the doublefault exception, but not
> for NMI. If I understand the idea correctly, this
> will guarantee that the NMI will be executing on a
> separate stack, which may be a good idea in any case,
> and will allow us to use the small stack at ring-0,
> if only we disable the interrupts. Are there any
> chances to use the task-gate for NMI? I may even try to
> implement that myself, but I guess there are *reasons*
> why it is not yet? (of course this applies only if 1.
> fails)

Yes. But you still have to handle exceptions on iret
to userspace :-(

> Or maybe somehow to modify the NMI handler itself,
> so that it will check if it is on a "small" stack,
> and switch to the "big" stack before anything else?
> Well, doing the whole trick at ring-0 sounds really
> plausible to me...

If you can do that, great. But you have to modify
at least NMI, GP and SS fault handlers to reload
SS/ESP with correct values.

> diff -urN linux-2.6.8-pcsp/arch/i386/kernel/entry.S linux-2.6.8-stacks/arch/i386/kernel/entry.S
> --- linux-2.6.8-pcsp/arch/i386/kernel/entry.S 2004-06-10 13:28:35.000000000 +0400
> +++ linux-2.6.8-stacks/arch/i386/kernel/entry.S 2004-09-23 16:14:55.209520160 +0400
> @@ -122,8 +129,23 @@
> .previous
>
>
> +#define NMI_STACK_RESERVE 0xe00
> +/* If returning to Ring-3, not to V86, and with
> + * the small stack, try to fix the higher word of
> + * ESP, as the CPU won't restore it from stack.
> + * This is an "official" bug of all the x86-compatible
> + * CPUs, which we can try to work around to make
> + * dosemu happy. */
> #define RESTORE_ALL \
> - RESTORE_REGS \
> + movl EFLAGS(%esp), %eax; \
> + movb CS(%esp), %al; \
> + andl $(VM_MASK | 2), %eax; \
> + cmpl $2, %eax; \
> + jne 3f; \
> + larl OLDSS(%esp), %eax; \
> + testl $0x00400000, %eax; \
> +3: RESTORE_REGS \
> + jz 8f; \
> addl $4, %esp; \
> 1: iret; \
> .section .fixup,"ax"; \
> @@ -137,8 +159,43 @@
> .section __ex_table,"a";\
> .align 4; \
> .long 1b,2b; \
> -.previous
> -
> +.previous; \
> + /* preparing the ESPfix here */ \
> + /* reserve some space on stack for the NMI handler */ \
> +8: subl $(NMI_STACK_RESERVE-4), %esp; \
> + .rept 5; \

Any reason why you left this part of RESTORE_ALL macro? It can
be moved out, IMHO. RESTORE_ALL is used twice in entry.S, so you
could save one copy. Though I'm not sure why NMI handler simple
does not jump to RESTORE_ALL we already have.

> + pushl NMI_STACK_RESERVE+16(%esp); \
> + .endr; \
> + pushl %eax; \
> + movl %esp, %eax; \
> + pushl %ebp; \
> + preempt_stop; \
> + GET_THREAD_INFO(%ebp); \
> + movl TI_cpu(%ebp), %ebp; \
> + shll $GDT_SIZE_SHIFT, %ebp; \
> + /* find GDT of the proper CPU */ \
> + addl $cpu_gdt_table, %ebp; \

I did not know we have per-CPU GDT tables... This explains it.

> + /* patch the base of the ring-1 16bit stack */ \
> + movw %ax, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 2)(%ebp); \
> + shrl $16, %eax; \
> + movb %al, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 4)(%ebp); \
> + movb %ah, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 7)(%ebp); \
> + popl %ebp; \
> + popl %eax; \
> + /* push the ESP value to preload on ring-1 */ \
> + pushl 12(%esp); \
> + movw $4, (%esp); \
> + /* push the iret frame for our ring-1 trampoline */ \
> + pushl $__ESPFIX_SS; \
> + pushl $0; \
> + /* push ring-3 flags only to get the IOPL right */ \
> + pushl 20(%esp); \
> + andl $~(IF_MASK | TF_MASK | RF_MASK | NT_MASK | AC_MASK), (%esp); \
> + /* we need at least IOPL1 to re-enable interrupts */ \
> + orl $IOPL1_MASK, (%esp); \
> + pushl $__ESPFIX_CS; \
> + pushl $espfix_trampoline; \
> + iret;

FYI, on my system (P4/1.6GHz) 100M loops of these pushes takes 1.20sec
while written with subl $24,%esp and then doing movs to xx(%esp) takes
0.94sec. Plus you could then reorganize code a bit (first do 5 pushes
to copy stack, then subtract 24 from esp, and push eax/ebp after that.
This way you can use %eax for computations you currently do in memory
(which are probably most important for 27% slowdown between
pushes and movs - I wrote mov code with using eax for EFLAGS1 & ESP1).

> - .quad 0x0000000000000000 /* 0xd0 - unused */
> - .quad 0x0000000000000000 /* 0xd8 - unused */
> + .quad 0x00cfba000000ffff /* 0xd0 - ESPfix CS */
> + .quad 0x0000b2000000ffff /* 0xd8 - ESPfix SS */

Set SS limit to 23 bytes, so misuse can be quickly catched? As SP=0 and
we use 16bit stack, it goes from 0000 => FFFE, and you could corrupt
memory 64KB beyond top of CPL1 stack.
Best regards,
Petr Vandrovec


2004-09-24 20:37:27

by Stas Sergeev

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

Hi,

Petr Vandrovec wrote:
> In that new patch I set the const to 0xe00, which
> is 3,5K. Is it still the limitation? I can probably
> For 4KB stacks 2KB looks better
OK, done that. Wondering though, for what?
I don't need 2K myself, I need 24 bytes only.
So what prevents me to raise the gap to 3.5K
or somesuch? Why 2K looks better?

>> 2. Set task gate for NMI. AFAICS, the task-gate is
> Yes. But you still have to handle exceptions on iret
> to userspace :-(
Yes. Thanks, I see the problem now. And I
suppose setting the task-gates also for all
that exceptions is not an option...

>> Or maybe somehow to modify the NMI handler itself,
> If you can do that, great. But you have to modify
> at least NMI, GP and SS fault handlers to reload
> SS/ESP with correct values.
Yes, that's the problem either. Doesn't look too
difficult (I'll have to look up TSS for the stack
pointer I guess, then do lss to it), but probably
modifying all that handlers for only that small
purpose is an overkill...

>> +.previous; \
>> + /* preparing the ESPfix here */ \
>> + /* reserve some space on stack for the NMI handler */ \
>> +8: subl $(NMI_STACK_RESERVE-4), %esp; \
>> + .rept 5; \
> Any reason why you left this part of RESTORE_ALL macro?
The reason is that the previous part of the macro
can jump to that part. So how can I divide those?

> be moved out, IMHO. RESTORE_ALL is used twice in entry.S, so you
> could save one copy.
Do you mean the NMI return path doesn't need
the ESP fix at all? Why?

> Though I'm not sure why NMI handler simple
> does not jump to RESTORE_ALL we already have.
I can only change that and then un-macro the
RESTORE_ALL completely. So I did that.
Additionally I introduced the "short path"
for the case where we know for sure that we
are returning to the kernel. And I am not
setting the exception handler there because
returning to the kernel if fails, should die()
anyway. Is this correct?

>> + pushl 20(%esp); \
>> + andl $~(IF_MASK | TF_MASK | RF_MASK | NT_MASK | AC_MASK), (%esp); \
>> + /* we need at least IOPL1 to re-enable interrupts */ \
>> + orl $IOPL1_MASK, (%esp); \
>> + pushl $__ESPFIX_CS; \
>> + pushl $espfix_trampoline; \
> FYI, on my system (P4/1.6GHz) 100M loops of these pushes takes 1.20sec
> while written with subl $24,%esp and then doing movs to xx(%esp) takes
> 0.94sec.
OK, I wasn't sure what pushes do you mean.
I supposed you wanted me to replace those
5 pushes that copy the stack frame.

> Plus you could then reorganize code a bit (first do 5 pushes
> to copy stack, then subtract 24 from esp, and push eax/ebp after that.
> This way you can use %eax for computations you currently do in memory
Done, thanks! Does this help your test-case?

>> + .quad 0x00cfba000000ffff /* 0xd0 - ESPfix CS */
>> + .quad 0x0000b2000000ffff /* 0xd8 - ESPfix SS */
> Set SS limit to 23 bytes, so misuse can be quickly catched?
Yes!

So the new patch is attached:)


Attachments:
linux-2.6.8-stacks4.diff (9.70 kB)

2004-09-24 21:44:29

by Petr Vandrovec

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

On Sat, Sep 25, 2004 at 12:36:15AM +0400, Stas Sergeev wrote:
> Hi,
>
> Petr Vandrovec wrote:
> >In that new patch I set the const to 0xe00, which
> >is 3,5K. Is it still the limitation? I can probably
> >For 4KB stacks 2KB looks better
> OK, done that. Wondering though, for what?
> I don't need 2K myself, I need 24 bytes only.
> So what prevents me to raise the gap to 3.5K
> or somesuch? Why 2K looks better?

You run with ESP decreased by 2KB for some time during
CPL1 stack setup. As you run in this part at CPL0
with same setup as on CPL1, I think that you should
offer same stack for setup code, and for CPL1 code,
and so each should get 2KB.

> >>+8: subl $(NMI_STACK_RESERVE-4), %esp; \
> >>+ .rept 5; \
> >Any reason why you left this part of RESTORE_ALL macro?
> The reason is that the previous part of the macro
> can jump to that part. So how can I divide those?

Use real labels instead of numeric local labels. That way
macro does jne cpl1exit, and you create cpl1exit: label outside
of macro, somewhere in entry.S...

> >be moved out, IMHO. RESTORE_ALL is used twice in entry.S, so you
> >could save one copy.
> Do you mean the NMI return path doesn't need
> the ESP fix at all? Why?

No. I was attempting to say that RESTORE_ALL is on two
places to save jumps (I cannot think about any other reason),
and so cpl1exit could be shared:

#define RESTORE_ALL \
movl EFLAGS(%esp),%eax \
.... \
jne 3f \
lar OLDSS(%esp),%eax \
testl $0x00400000,%eax \
jz cpl1exit \
3: RESTORE_REGS \
add $4,%esp \
1: iret \
.section __ex_table,"a";\
.align 4; \
long 1b,badss; \
.previous


badss: sti;
movl $(__USER_DS), %edx;
movl %edx, %ds;
movl %edx, %es;
pushl $11;
call do_exit;

cpl1exit:
RESTORE_REGS
subl $...,%esp
...

But your solution with killing RESTORE_ALL completely is also
fine - you just could reorder jumps a bit, so return to kernel
or VM86 is one (not taken) jump shorter than it is with current
version of patch.

> >Though I'm not sure why NMI handler simple
> >does not jump to RESTORE_ALL we already have.
> I can only change that and then un-macro the
> RESTORE_ALL completely. So I did that.
> Additionally I introduced the "short path"
> for the case where we know for sure that we
> are returning to the kernel. And I am not
> setting the exception handler there because
> returning to the kernel if fails, should die()
> anyway. Is this correct?

If you'll first test result of lar, and then
you'll do two independent RESTORE_REGS (second
one actually does not have to restore %ebp
and %eax and you can restore them directly by
move from stack after you are done with trampoline
setup, instead of doing push + pop), you can reuse
RESOTRE_REGS+add $4,%esp+iret return path, including
its exception handling.

> +ENTRY(espfix_trampoline)
> + popl %esp
> +espfix_past_esp:
> +1: iret
> +.section .fixup,"ax"
> +2: sti
> + movl $(__USER_DS), %edx
> + movl %edx, %ds
> + movl %edx, %es
> + pushl $11
> + call do_exit
> +.previous

You can reuse fixup code from other iret instance,
just give it real label and reference it from __ex_table
instead of '2b'.

Patch looks fine, though you could speedup it a bit as outlined above.
Now you have to persuade others that this patch should include patch
into the kernel.
Petr

2004-09-25 08:08:53

by Gabriel Paubert

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

On Fri, Sep 24, 2004 at 11:43:30PM +0200, Petr Vandrovec wrote:
> On Sat, Sep 25, 2004 at 12:36:15AM +0400, Stas Sergeev wrote:
> > Hi,
> >
> > Petr Vandrovec wrote:
> > >In that new patch I set the const to 0xe00, which
> > >is 3,5K. Is it still the limitation? I can probably
> > >For 4KB stacks 2KB looks better
> > OK, done that. Wondering though, for what?
> > I don't need 2K myself, I need 24 bytes only.
> > So what prevents me to raise the gap to 3.5K
> > or somesuch? Why 2K looks better?
>
> You run with ESP decreased by 2KB for some time during
> CPL1 stack setup. As you run in this part at CPL0
> with same setup as on CPL1, I think that you should
> offer same stack for setup code, and for CPL1 code,
> and so each should get 2KB.

Maybe I miss something, but it seems that lret (or retl)
is not affected by this bug. What prevents you from reordering
the stack (doing the inverse operation of what the lcall7 entry
point does) and then doing:

popfl
lret

Yes, I see issues with debugging (trap flag mostly) but I believe
that they are solvable.

Regards,
Gabriel

2004-09-25 12:25:23

by Stas Sergeev

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

Hello.

Gabriel Paubert wrote:
> Maybe I miss something, but it seems that lret (or retl)
> is not affected by this bug.
Petr Vandrovec says (he forgot to CC that
to LKML I think):
---
Looking at VMware's code it seems that RETF suffers from
this bug too...
---

I tested that - he is right, and Intel docs
make no sense as to not mentioning this.

2004-09-25 19:28:13

by Gabriel Paubert

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

On Sat, Sep 25, 2004 at 04:25:43PM +0400, Stas Sergeev wrote:
> Hello.
>
> Gabriel Paubert wrote:
> >Maybe I miss something, but it seems that lret (or retl)
> >is not affected by this bug.
> Petr Vandrovec says (he forgot to CC that
> to LKML I think):

At least I did not see it.

> ---
> Looking at VMware's code it seems that RETF suffers from
> this bug too...
> ---
>
> I tested that - he is right, and Intel docs
> make no sense as to not mentioning this.

I suspected that they behaved differently because the
pseudocode in iret's description is quite different
(for iret, it even does not mention restoring ESP!).
But if you expect Intel's doc, or that of any manufacturer
for the matter, to tell the whole truth, you're na?ve.

Is ESP really properly restored for V86 bmode or is it that
it does not hit the case of a default 32 bit code segment with
a 16 bit stack?

I'm absolutely amazed by the fact that this bug has been there
since the beginning and only seems to hit users right now.

I don't like adding an intermediate privilege level, but
it looks hard to do through the vdso, and you always have to
push something on the final stack. A hardware task switch
might work, but it has problems with races on the segment
descriptors.

Anyway, I've just read again Intel's doc about mixing 16 and 32 bit
code and I have found the understament of the day:

"For most efficient and trouble-free operation of the processor, 32-bit
^^^^^^^^^^^^
programs or tasks should have the D flag in the code-segment descriptor
and the B flag in the stack-segment descriptor set, and 16-bit programs
or tasks should have these flags clear. Program control transfers from
16-bit segments to 32-bit segments (and vice versa) are handled most
efficiently through call, interrupt, or trap gates."


Regards,
Gabriel

2004-09-25 20:37:42

by Stas Sergeev

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

Hi.

Gabriel Paubert wrote:
> Is ESP really properly restored for V86 bmode or is it that
> it does not hit the case of a default 32 bit code segment with
> a 16 bit stack?
It does not restore it in any case for the 16bit
stack (no matter whether the code is 16 or 32 in PM).
Petr says V86 is not affected, but I have not tested
that case because why to care? The problem is only for
the 32bit code. For 16bit code (PM or V86) it just
doesn't really matter I think (I don't think using
prefixes for ESP is sane).

> I'm absolutely amazed by the fact that this bug has been there
> since the beginning and only seems to hit users right now.
No, right now it just hits me:)
It used to hit the dosemu users always, but people just
blamed dosemu itself and that was it. Noone wanted
to spend weeks traceing the DOS programs under dosemu,
then traceing dosemu itself, then traceing kernel,
then looking through the docs to track the problem down
to something known, then writing to Intel's techsup for
clarifications, and then writing to LKML:) And if not
for the great help I got here, this will end up nowhere
again. So that's how it used to stay "unnoticed" for
years.
As for the other instances of that problem, here are some:

http://www.tenberry.com/dos4g/watcom/rn4gw.html
---
B ** Fixed the mouse32 handler to ignore a Microsoft Windows DOS box bug
which mangles the high word of ESP.
---

http://clio.rice.edu/djgpp/r5bug04.txt
---
On VCPI servers which mode switch using ESP upper bits, CWSDPMI does not
clear those bits when swapping to it's own stack.
---

Searching google reveals quite a few implicit
instances of that problem.

> Anyway, I've just read again Intel's doc about mixing 16 and 32 bit
> code and I have found the understament of the day:
> "For most efficient and trouble-free operation of the processor,
> ^^^^^^^^^^^^
What is to expect? They knew there are troubles,
yet decided to make it a part of the specs...

2004-09-25 23:48:41

by Gabriel Paubert

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

Hi,

> It does not restore it in any case for the 16bit
> stack (no matter whether the code is 16 or 32 in PM).
> Petr says V86 is not affected, but I have not tested
> that case because why to care? The problem is only for
> the 32bit code. For 16bit code (PM or V86) it just
> doesn't really matter I think (I don't think using
> prefixes for ESP is sane).

Well, thrashing a register at any time under the user
just because an interrupt happened is even less sane ;-)

>
> >I'm absolutely amazed by the fact that this bug has been there
> >since the beginning and only seems to hit users right now.
> No, right now it just hits me:)
> It used to hit the dosemu users always, but people just
> blamed dosemu itself and that was it. Noone wanted
> to spend weeks traceing the DOS programs under dosemu,
> then traceing dosemu itself, then traceing kernel,
> then looking through the docs to track the problem down
> to something known, then writing to Intel's techsup for
> clarifications, and then writing to LKML:) And if not
> for the great help I got here, this will end up nowhere
> again. So that's how it used to stay "unnoticed" for
> years.

I see. And finally we know that the problem with the
processor and that Intel changed the spec to conform
to what the hardware does but is rather coy about it.

> As for the other instances of that problem, here are some:
>
> http://www.tenberry.com/dos4g/watcom/rn4gw.html
> ---
> B ** Fixed the mouse32 handler to ignore a Microsoft Windows DOS box bug
> which mangles the high word of ESP.

But if the bug is also affects Windows DOS box, it means that
V86 is affected too, no?

I'd like to know what OS/2 did. The DOS boxes and 16 bit mode
DPMI applications ran very well and it was very stable, despite
the fact that the kernel spent its time switching between 16
and 32 bit modes (for example, almost all device drivers were
16 bit code, but the drivers for DOS emulation were 32 bit).
But I removed it from all my machines several years ago.


> http://clio.rice.edu/djgpp/r5bug04.txt
> ---
> On VCPI servers which mode switch using ESP upper bits, CWSDPMI does not
> clear those bits when swapping to it's own stack.
> ---
>
> Searching google reveals quite a few implicit
> instances of that problem.
>
> >Anyway, I've just read again Intel's doc about mixing 16 and 32 bit
> >code and I have found the understament of the day:
> >"For most efficient and trouble-free operation of the processor,
> > ^^^^^^^^^^^^
> What is to expect? They knew there are troubles,
> yet decided to make it a part of the specs...

Of which specs? It is never clearly stated in the whole
4 volume doc of the architecture. Only in an obscure
specification change...

Gabriel

2004-09-26 18:01:34

by Stas Sergeev

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

Hi.

Gabriel Paubert wrote:
>> the 32bit code. For 16bit code (PM or V86) it just
>> doesn't really matter I think (I don't think using
>> prefixes for ESP is sane).
> Well, thrashing a register at any time under the user
> just because an interrupt happened is even less sane ;-)
OK, I agree, so I tested that: wrote some value
to the higher word of ESP in the struct that
dosemu passes to vm86() syscall, let the prog
to run for a while, and the value was still there.
It is not 100% reliable test because I can't
guarantee the vm86() was interrupted during the
period I was waiting (because vm86() is executed
for the short durations, then exits on fault or
signal), but it looks OK and why not to beleive
Petr that v86 is not affected.

>> http://www.tenberry.com/dos4g/watcom/rn4gw.html
>> ---
>> B ** Fixed the mouse32 handler to ignore a Microsoft Windows DOS box bug
>> which mangles the high word of ESP.
> But if the bug is also affects Windows DOS box, it means that
> V86 is affected too, no?
No. This URL is about dos4gw, so that's about a
prot. mode either.

> I'd like to know what OS/2 did.
AFAIK also WinNT is not affected. Not sure though.

> The DOS boxes and 16 bit mode
> DPMI applications ran very well and it was very stable, despite
Hey. It is not about 16bit DPMI mode, it is
about the 32bit DPMI mode. That's the whole
problem. Be it only about the 16bit DPMI mode,
the problem may not ever harm anything at all.
But for 32bit mode that's quite a problem.

2004-09-27 09:10:19

by Gabriel Paubert

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

On Sun, Sep 26, 2004 at 10:04:55PM +0400, Stas Sergeev wrote:

Hi,

On Sun, Sep 26, 2004 at 10:04:55PM +0400, Stas Sergeev wrote:
> OK, I agree, so I tested that: wrote some value
> to the higher word of ESP in the struct that
> dosemu passes to vm86() syscall, let the prog
> to run for a while, and the value was still there.
> It is not 100% reliable test because I can't
> guarantee the vm86() was interrupted during the
> period I was waiting (because vm86() is executed
> for the short durations, then exits on fault or
> signal), but it looks OK and why not to beleive
> Petr that v86 is not affected.

Should be enough, you'll likely get at least a timer
interrupt in the middle of v86 mode sooner or later.

>
> >>http://www.tenberry.com/dos4g/watcom/rn4gw.html
> >>---
> >>B ** Fixed the mouse32 handler to ignore a Microsoft Windows DOS box bug
> >> which mangles the high word of ESP.
> >But if the bug is also affects Windows DOS box, it means that
> >V86 is affected too, no?
> No. This URL is about dos4gw, so that's about a
> prot. mode either.

I see, I missed the 32 in mouse32.

>
> >I'd like to know what OS/2 did.
> AFAIK also WinNT is not affected. Not sure though.
>
> >The DOS boxes and 16 bit mode
> >DPMI applications ran very well and it was very stable, despite
> Hey. It is not about 16bit DPMI mode, it is
> about the 32bit DPMI mode. That's the whole
> problem. Be it only about the 16bit DPMI mode,
> the problem may not ever harm anything at all.
> But for 32bit mode that's quite a problem.

Well, I did not express myself correctly. OS/2 was probably the
most frightening mix of 16 and 32 bit code ever produced, so
they certainly had to work around the problem. There
were literally tons of "thunks" to interface both modes
and the LDT was set up in a way which made the translation
of addresses relatively easy:

off32 = (seg16>>3) + off16

Therefore the base address of LDT segment n was n*64k,
limiting the addressable memory to 512MB in all versions
I used (later versions lifted this limit), but making
conversions of 16 to 32 bit addresses and back quite
efficient. However, I can't remember exactly how the
thunks worked.

Gabriel


Gabriel

2004-09-28 15:44:26

by Denis Vlasenko

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

On Wednesday 22 September 2004 23:13, Richard B. Johnson wrote:
> Well DOSEMU uses VM-86 mode. That's how it works. It
> creates a virtual 8086 machine, complete with the
> required "DOS compatible" virtual BIOS environment.

You forgot DOS extenders which do not use VM86
but 32bit protected mode.

> I use it all the time because I write, amongst other things,
> the complete BIOS and startup code for many Intel based
> machines.
>
> I run compilers, assemblers, linkers, and editors in that
> environment and it works.

The fact that those programs work fine does not automatically mean
that _other_ DOS programs will work.

> Sombody mentions a completely unrelated so-called Intel
> bug and next thing you know, there are patches to work-
> around the bug???
>
> The bug doesn't exist period.

Sorry, but it does exist. It is really obscure bug.
Please read on:

> There is a "semi-official" bug in Intel CPUs,
> which is described here:
> http://www.intel.com/design/intarch/specupdt/27287402.PDF
> chapter "Specification Clarifications"
> section 4: "Use Of ESP In 16-Bit Code With 32-Bit Interrupt Handlers".
>
> A short quote:
> "ISSUE: When a 32-bit IRET is used to return to another privilege level,
> and the old level uses a 4G stack (D/B bit in the segment register = 1),
> while the new level uses a 64k stack (D/B bit = 0), then only the
> lower word of ESP is updated."
>
> Which means that the higher word of ESP gets
> trashed. This bug beats dosemu rather badly,
> but there seem to be not much info about that
> bug available on the net.

IRET-to-lower-CPL-level stack frame:

+16 old_SS
+12 old_ESP
+8 old_EFLAGS
+4 old_CS
+0 old_EIP <--- SS:ESP

> > If old_SS references 'small' data segment (16-bit one),
> > processor does not restore ESP from old_ESP.
> > It restores lower 16 bits only. Upper 16 bits are filled
> > with garbage (or just not modified at all?).
>
> Not modified at all, yes. That's why it is always
> greater than TASK_SIZE (0xc0000000) - it is still
> in a kernel space.
>
> > This works fine because processor uses SP, not ESP,
> > for subsequent push/pop/call/ret operations.
> > But if code uses full ESP, thinking that upper 16 bits are zero,
> > it will crash badly. Correct me if I'm wrong.
>
> That's correct. But you should note that the
> program not just "thinks" that the upper 16 bits
> are zero. It writes zero there itself, and a few
> instructions later - oops, it is no longer zero...
--
vda

2004-09-30 15:10:27

by Bill Davidsen

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected?

Gabriel Paubert wrote:
> Hi,
>
>
>>It does not restore it in any case for the 16bit
>>stack (no matter whether the code is 16 or 32 in PM).
>>Petr says V86 is not affected, but I have not tested
>>that case because why to care? The problem is only for
>>the 32bit code. For 16bit code (PM or V86) it just
>>doesn't really matter I think (I don't think using
>>prefixes for ESP is sane).
>
>
> Well, thrashing a register at any time under the user
> just because an interrupt happened is even less sane ;-)
>
>
>>>I'm absolutely amazed by the fact that this bug has been there
>>>since the beginning and only seems to hit users right now.
>>
>>No, right now it just hits me:)
>>It used to hit the dosemu users always, but people just
>>blamed dosemu itself and that was it. Noone wanted
>>to spend weeks traceing the DOS programs under dosemu,
>>then traceing dosemu itself, then traceing kernel,
>>then looking through the docs to track the problem down
>>to something known, then writing to Intel's techsup for
>>clarifications, and then writing to LKML:) And if not
>>for the great help I got here, this will end up nowhere
>>again. So that's how it used to stay "unnoticed" for
>>years.
>
>
> I see. And finally we know that the problem with the
> processor and that Intel changed the spec to conform
> to what the hardware does but is rather coy about it.
>
>
>>As for the other instances of that problem, here are some:
>>
>>http://www.tenberry.com/dos4g/watcom/rn4gw.html
>>---
>>B ** Fixed the mouse32 handler to ignore a Microsoft Windows DOS box bug
>> which mangles the high word of ESP.
>
>
> But if the bug is also affects Windows DOS box, it means that
> V86 is affected too, no?
>
> I'd like to know what OS/2 did. The DOS boxes and 16 bit mode
> DPMI applications ran very well and it was very stable, despite
> the fact that the kernel spent its time switching between 16
> and 32 bit modes (for example, almost all device drivers were
> 16 bit code, but the drivers for DOS emulation were 32 bit).
> But I removed it from all my machines several years ago.

May I suggest that IBM is a friend of Linux, and that *if* there is a
simple way to get around the problem they will probably be very willing
to share it with us.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-10-06 16:13:54

by Stas Sergeev

[permalink] [raw]
Subject: Re: ESP corruption bug - what CPUs are affected? (patch attempts)

Hi,

Petr Vandrovec wrote:
>> The reason is that the previous part of the macro
>> can jump to that part. So how can I divide those?
> Use real labels instead of numeric local labels.
OK, I am really sorry for asking such a crap :(

>> returning to the kernel if fails, should die()
>> anyway. Is this correct?
> If you'll first test result of lar, and then
> you'll do two independent RESTORE_REGS (second
> setup, instead of doing push + pop), you can reuse
> RESOTRE_REGS+add $4,%esp+iret return path, including
> its exception handling.
Yes, but I was wondering whether we really do
need an exception handler when returning to
kernel. There should be no fault there, and
if there is - it should be an Oops, I think.
Nevertheless, I did what you said. Perhaps
something like die_if_kernel() is somewhere
down the road anyway.

> Patch looks fine, though you could speedup it a bit as outlined above.
New patch is attached.

> Now you have to persuade others that this patch should include patch
> into the kernel.
Yes, if not for that anonymous guy, who kept posting
to me until he finally convinced me that the Ring-0
approach is not that difficult at all.
So I tried... It was much more difficult to code
up, but at the end it looks a little better
and localized to entry.S completely. OTOH it
touches the exception handlers, but not too much -
it adds only 5 insns on the fast path. And the
code is very fragile, but after I made all the
magic numbers a #define consts, it actually looks
not so bad.
I don't know which patch is really better, so
I am attaching both.
Any ideas on which one should I have to continue
working on and what improvements can be done to
either one?


Attachments:
linux-2.6.8-stacks5.diff (9.54 kB)
linux-2.6.8-stk0-2.diff (7.23 kB)
Download all attachments