Hi Alan.
Attached patch works around the corruption
of the high word of the ESP register, which
is the official bug of x86 CPUs. The bug
triggers only when the one is using the
16bit stack segment, and is described here:
http://www.intel.com/design/intarch/specupdt/27287402.PDF
Patch helps running many apps under dosemu,
and, according to the comments found in
Wine sources, also helps Wine.
Also the patch closes the "information leak",
which is that the half of the kernel's %esp
value gets leaked to user-space.
The initial discussion about the problem
can be found here:
http://lkml.org/lkml/2004/9/16/254
On a later stages the development of that
patch was driven mainly by Linus, but it
was in private.
This patch adds only 6 instructions for
the fast kernel-->user return path, 6
instructions on an exception path and
5 instructions on an NMI path. All the
rest of the patch gets executed only
for the two apps in that world, namely,
dosemu and wine (and VMWare?).
That's why it should be relatevely easy
to make sure the patch doesn't do any
harm for the normal apps, and so it is
safe, and probably fits into an -ac tree.
How it works:
- Allocates the per-cpu data for 16bit
stacks and sets the per-cpu GDT entries
for them.
- On a return to userspace, checks whether
the SS is from LDT and is 16 bit.
- If it is, the iret frame gets copied
to the 16bit per-cpu stack and stack gets
switched to the 16bit one, while the
higher word of %esp gets preloaded with
the proper value.
- On an exceptions the check is performed
to see if we are on a 16bit stack, and
if we are - switch to the 32bit one.
Alan, would it be possible to apply that
patch to an -ac tree?
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Petr Vandrovec <[email protected]>
Signed-off-by: Stas Sergeev <[email protected]>
On Sun, 13 Mar 2005, Stas Sergeev wrote:
> Hi Alan.
>
> Attached patch works around the corruption
> of the high word of the ESP register, which
> is the official bug of x86 CPUs. The bug
> triggers only when the one is using the
> 16bit stack segment, and is described here:
> http://www.intel.com/design/intarch/specupdt/27287402.PDF
Does the bug also egsist on AMD CPU's? Does the patch add anything to
kernels compiled for AMD CPU's?
Thanks,
Grzegorz Kulewski
Hello.
Grzegorz Kulewski wrote:
> Does the bug also egsist on AMD CPU's?
Yes. As well as the ones of a Transmeta etc.
I just haven't tested the old Cyrixes, that
AFAIK were trying to ignore some Intel bugs.
The test-case for the bug is here:
http://www.ussg.iu.edu/hypermail/linux/kernel/0409.2/0690.html
> Does the patch add anything to
> kernels compiled for AMD CPU's?
Same as for the Intel ones - unless you are
a dosemu or, in a lesser extent, Wine user -
nothing except for fixing the small "information
leak".
If you are the dosemu user however, then this
patch adds a lot. Whether or not it adds
something for the VMWare users I can't say, since
I am not the one of them, but my guess is that
it can help with the DOS programs under it.
Stas Sergeev wrote:
> Hello.
>
> Grzegorz Kulewski wrote:
>
>> Does the bug also egsist on AMD CPU's?
>
> Yes. As well as the ones of a Transmeta etc.
> I just haven't tested the old Cyrixes, that
> AFAIK were trying to ignore some Intel bugs.
> The test-case for the bug is here:
> http://www.ussg.iu.edu/hypermail/linux/kernel/0409.2/0690.html
I've just ran that on my Cyrix MII PR300 and the bug is present:
old_ss=0x7b new_ss=0x7f
In sighandler: esp=bffff780
old_esp=0xbffff780 new_esp=0xc1a6f780
BUG!
I have also an older Cyrix CPU - 6x86 PR166 - but can't test it now as
it's sitting in a plastic box on the shelf :-)
UMC U5SX/33 in my router - also present:
old_ss=0x7b new_ss=0x7f
In sighandler: esp=bffff820
old_esp=0xbffff820 new_esp=0xc003f820
BUG!
--
Ondrej Zary
Hello.
Ondrej Zary wrote:
> I've just ran that on my Cyrix MII PR300 and the bug is present:<>
> UMC U5SX/33 in my router - also present:
Thanks, now I know for sure that it exist
everywhere.
Now you can apply the patch and make sure
the bug goes away:)
Hi!
> >Attached patch works around the corruption
> >of the high word of the ESP register, which
> >is the official bug of x86 CPUs. The bug
> >triggers only when the one is using the
> >16bit stack segment, and is described here:
> >http://www.intel.com/design/intarch/specupdt/27287402.PDF
>
> Does the bug also egsist on AMD CPU's? Does the patch add anything to
> kernels compiled for AMD CPU's?
Yes, same workaround is needed on AMDs, Cyrixes, ...
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
Hi!
> @@ -257,8 +265,31 @@
> movl TI_flags(%ebp), %ecx
> testw $_TIF_ALLWORK_MASK, %cx # current->work
> jne syscall_exit_work
> +
> restore_all:
> - RESTORE_ALL
> + movl EFLAGS(%esp), %eax # mix EFLAGS, SS and CS
> + movb OLDSS(%esp), %ah
> + movb CS(%esp), %al
> + andl $(VM_MASK | (4 << 8) | 3), %eax
> + cmpl $((4 << 8) | 3), %eax
> + je ldt_ss # returning to user-space with LDT SS
All common linux apps use same %ss, no? Perhaps it would be more
efficient to just check if %ss == 0x7b, and proceed directly to
restore_nocheck if not?
Or perhaps we could only enable this code after application loads
custom ldt?
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
Hi.
Pavel Machek wrote:
>> + andl $(VM_MASK | (4 << 8) | 3), %eax
>> + cmpl $((4 << 8) | 3), %eax
>> + je ldt_ss # returning to user-space with LDT SS
> All common linux apps use same %ss, no? Perhaps it would be more
> efficient to just check if %ss == 0x7b, and proceed directly to
> restore_nocheck if not?
Such an optimization will cost three more
instructions, one of which is a "taken"
jump. It seems like the "taken" jump on
a fast path is not good, while now it is
only 5 instructions with a not-taken jump.
I am not sure here, but I think the current
solution is better (depends on how bad the
"taken" jump is, and how bad it is to have
the three extra insns for that optimization
purpose).
> Or perhaps we could only enable this code
> after application loads custom ldt?
The good thing here is that the code
actually does what you say, i.e. it jumps
to ldt_ss only when the app has loaded
the custom ldt and loaded that selector
to %ss. The way it is implemented, is
probably different from what you mean,
I assume you mean the new per-thread flag?
But I don't see how it can be more optimal,
i.e. you propose to check whether or not
the app altered the ldt (which can just be
the old glibc I think), while the current
solution is to also check whether it was
loaded to %ss (so the glibc case is avoided,
IIRC glibc used to load %gs with LDT selector).
I.e. since right now we jump to ldt_ss only
when the %ss is loaded with an LDT selector,
I think the extra checks are not needed.
On Sun, 13 Mar 2005, Stas Sergeev wrote:
>
> Such an optimization will cost three more
> instructions, one of which is a "taken"
> jump.
I think Pavel missed the fact that you have to check the VM86 bit in
eflags before you check SS, since otherwise SS doesn't mean anything
special at all (ie checking for just the normal SS isn't correct: you
could have a 16-bit SS that looks normal, but is actually a vm86 segment).
Pavel: for the same reason you have to check the low two bits of CS too,
since if they are zero, then SS hasn't been saved on the stack at all, so
comparing it against some normal value is meaningless.
That said, the "ldt_ss" case should be moved _after_ the conditional
tests, since most CPU's out there will do static prediction based on
forward/backwards direction if the branch predictor isn't primed. And so
since it's an unusual case, the branch should be a forward branch, which
is usually the not-taken one (this depends on the core, of course, and you
could also add the prefix byte to mark it explicitly predict-not-taken for
the newer cores that support it).
Linus
Hello.
Linus Torvalds wrote:
> I think Pavel missed the fact that you have to check the VM86 bit in
Ah, thanks, I must have forgotten the
essentials of the own patch :(
> That said, the "ldt_ss" case should be moved _after_ the conditional
> tests, since most CPU's out there will do static prediction based on<>
> forward/backwards direction if the branch predictor isn't primed.
OK, I moved the "ldt_ss" below the check.
The new patch is attached.
Alan, can you please apply that to an -ac
tree?
----
Summary:
Attached patch works around the corruption
of the high word of the ESP register, which
is the official bug of x86 CPUs. The bug
triggers only when the one is using the
16bit stack segment, and is described here:
http://www.intel.com/design/intarch/specupdt/27287402.PDF
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Petr Vandrovec <[email protected]>
Acked-by: Pavel Machek <[email protected]>
Signed-off-by: Stas Sergeev <[email protected]>
On Ne 13-03-05 13:13:16, Linus Torvalds wrote:
>
>
> On Sun, 13 Mar 2005, Stas Sergeev wrote:
> >
> > Such an optimization will cost three more
> > instructions, one of which is a "taken"
> > jump.
>
> I think Pavel missed the fact that you have to check the VM86 bit in
> eflags before you check SS, since otherwise SS doesn't mean anything
> special at all (ie checking for just the normal SS isn't correct: you
> could have a 16-bit SS that looks normal, but is actually a vm86 segment).
>
> Pavel: for the same reason you have to check the low two bits of CS too,
> since if they are zero, then SS hasn't been saved on the stack at all, so
> comparing it against some normal value is meaningless.
Yes, I missed that one, thanks.
What about flag similar to _TIF_SYSCALL_TRACE (call it
_TIF_THIS_BEAST_USES_V86 or something), and only do the tests in the
slowpath if it is set? As normal applications do not use v86, we could
make this 0 instructions in syscall fast path...
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
On Mon, 14 Mar 2005, Pavel Machek wrote:
>
> What about flag similar to _TIF_SYSCALL_TRACE (call it
> _TIF_THIS_BEAST_USES_V86 or something), and only do the tests in the
> slowpath if it is set? As normal applications do not use v86, we could
> make this 0 instructions in syscall fast path...
It wouldn't help you. You'd need to mix in two of the values anyway, so at
most you'd save one instruction. And the cost would be that anything that
has ever used vm86 mode (can you say "X server"?) would be slower. Not a
good trade-off.
Oh, I guess you could clear the flag when you know there's no vm86 state
anywhere (easy enough, those things never nest), but then it still comes
back to "extra complexity that you can get wrong, just to save a single
"mov" instruction - that "mov" may have partial-register-stall issues,
but still..).
Linus
On Sun, 13 Mar 2005, Linus Torvalds wrote:
>
> That said, the "ldt_ss" case should be moved _after_ the conditional
> tests, since most CPU's out there will do static prediction based on
> forward/backwards direction
Btw, Stas, one thing I'd really like to see is even a partial list of
anything that actually cares about this. Ie, if there is some known
Windows app where Wine works better or something like that, just adding
that information to the comments would be hugely appreciated.
Another way of saying the same thing: I absolutely hate seeing patches
that fix some theoretical issue that no Linux apps will ever care about.
So I'd like to have a bit more of a case for this patch, since I know what
the case against it is ;)
Linus
Hi,
Linus Torvalds wrote:
> Btw, Stas, one thing I'd really like to see is even a partial list of
> anything that actually cares about this. Ie, if there is some known
> Windows app where Wine works better or something like that, just adding
I am not using Wine too much, but I've
found this:
http://cvs.winehq.org/cvsweb/~checkout~/wine/dlls/winedos/int31.c?rev=1.41&content-type=text/plain
---
/* due to a flaw in some CPUs (at least mine), it is best to mark stack segments as 32-bit if they
can be used in 32-bit code. Otherwise, these CPUs may not set the high word of esp during a
ring transition (from kernel code) to the 16-bit stack, and this causes trouble if executing
32-bit code using this stack. */
---
I added win-devel to CC, maybe people there
can tell if that patch has any value for them
or not.
The reference to the original patch:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0503.1/1794.html
Dosemu looks a little better on that, the
whole chapter of the docs is dedicated to
that problem:
http://dosemu.sourceforge.net/docs/EMUfailure/t1.html#AEN55
There you can find a (relatively small)
list of the programs that are affected,
but I personally have the old Microsoft
linker that crashes, and a few more DOS
games.
> Another way of saying the same thing: I absolutely hate seeing patches
> that fix some theoretical issue that no Linux apps will ever care about.
No, it is not theoretical, but it is mainly
about a DOS games and an MS linker, as for
me. The things I'd like to get working, but
the ones you may not care too much about:)
The particular game I want to get working,
is "Master of Orion 2" for DOS.
> So I'd like to have a bit more of a case for this patch, since I know what
> the case against it is ;)
Yep, and the informational leak it closes,
looks also rather minor.
So it is only a matter of how do you care
about the dosemu and the DOS games under
linux. Considering the amount of the
dosemu-related code in vm86.c, I guess you
care:)
And uhm, adding the list of the DOS games
to the comments of the Linux kernel code,
doesn't sound like a good idea to me:)
Stas Sergeev <[email protected]> writes:
>
>> Another way of saying the same thing: I absolutely hate seeing
>> patches that fix some theoretical issue that no Linux apps will ever
>> care about.
> No, it is not theoretical, but it is mainly
> about a DOS games and an MS linker, as for
> me. The things I'd like to get working, but
> the ones you may not care too much about:)
> The particular game I want to get working,
> is "Master of Orion 2" for DOS.
How about you just run it in dosbox instead of dosemu ?
-Andi
Hi Andi
I tried to read /dev/kmem on x86_64 (linux-2.6.11) and got no success.
read() or pread() returns EINVAL
I tried mmap() too : mmap() calls succeed, but as soon the user process
dereference memory, we get :
tinfo: Corrupted page table at address 2aaaaaabf800
PGD 8a983067 PUD c7e5a067 PMD 91588067 PTE ffffffff8048a025
Bad pagetable: 000d [1] SMP
CPU 0
Modules linked in: ipt_REJECT
Pid: 10892, comm: tinfo Not tainted 2.6.11
RIP: 0033:[<0000000000100562>] [<0000000000100562>]
RSP: 002b:00007ffffffff790 EFLAGS: 00010217
RAX: 00002aaaaaabf000 RBX: 00002aaaaabbe000 RCX: 00002aaaaac8fc0c
RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000
RBP: 00007ffffffff7f8 R08: 0000000000000003 R09: ffffffff8048a000
R10: 0000000000000001 R11: 0000000000000206 R12: 00000000001005b0
R13: 0000000000000001 R14: 00002aaaaadfdfe8 R15: 0000000000100530
FS: 00002aaaaabcb970(0000) GS:ffffffff804866c0(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaaaabf800 CR3: 0000000090368000 CR4: 00000000000006e0
Process tinfo (pid: 10892, threadinfo ffff8100901b0000, task
ffff8100c7d976c0)
RIP [<0000000000100562>] RSP <00007ffffffff790>
Thank you
Eric Dumazet
----------------------------------------------------------------
# cat tinfo.c
#define _XOPEN_SOURCE 500
#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
struct tcp_hashinfo {
struct tcp_ehash_bucket *__tcp_ehash;
struct tcp_bind_hashbucket *__tcp_bhash;
int __tcp_bhash_size;
int __tcp_ehash_size;
} tcp_hashinfo;
#define TCPINFO_ADDR 0xffffffff8048a000 /* tcp_hashinfo */
int main()
{
int fd = open("/dev/kmem", O_RDONLY) ;
if (pread(fd, &tcp_hashinfo, sizeof(tcp_hashinfo), TCPINFO_ADDR) == -1) {
lseek(fd, TCPINFO_ADDR, 0) ;
if (read(fd, &tcp_hashinfo, sizeof(tcp_hashinfo)) == -1) {
perror("Can not read /dev/kmem ?") ;
return 1 ;
}
}
printf("ehash=%p esize=%d bhash=%p bsize=%d\n",
tcp_hashinfo.__tcp_ehash,
tcp_hashinfo.__tcp_ehash_size,
tcp_hashinfo.__tcp_bhash,
tcp_hashinfo.__tcp_bhash_size) ;
return 0 ;
}
> Stas Sergeev <[email protected]> writes:
>>
>>> Another way of saying the same thing: I absolutely hate seeing
>>> patches that fix some theoretical issue that no Linux apps will ever
>>> care about.
>> No, it is not theoretical, but it is mainly
>> about a DOS games and an MS linker, as for
>> me. The things I'd like to get working, but
>> the ones you may not care too much about:)
>> The particular game I want to get working,
>> is "Master of Orion 2" for DOS.
>
> How about you just run it in dosbox instead of dosemu ?
>
> -Andi
Nah, don't insult a DOSemu developer. ;-) Stas is one of them...
Best regards,
Zolt?n B?sz?rm?nyi
Andi Kleen wrote:
>Stas Sergeev <[email protected]> writes:
>
>
>>>Another way of saying the same thing: I absolutely hate seeing
>>>patches that fix some theoretical issue that no Linux apps will ever
>>>care about.
>>>
>>>
>>No, it is not theoretical, but it is mainly
>>about a DOS games and an MS linker, as for
>>me. The things I'd like to get working, but
>>the ones you may not care too much about:)
>>The particular game I want to get working,
>>is "Master of Orion 2" for DOS.
>>
>>
>
>How about you just run it in dosbox instead of dosemu ?
>
>
Yes, that's a solution of course, but it is a bit like saying why
not use Open Office instead of MS Word.
A long term goal of wine is to support DOS apps to. Of course
it's not a priority, but it's there.
regards,
Jakob
On Mon, 14 Mar 2005, Jakob Eriksson wrote:
> Andi Kleen wrote:
>
>> Stas Sergeev <[email protected]> writes:
>>
>>>> Another way of saying the same thing: I absolutely hate seeing
>>>> patches that fix some theoretical issue that no Linux apps will ever
>>>> care about.
>>>>
>>> No, it is not theoretical, but it is mainly
>>> about a DOS games and an MS linker, as for
>>> me. The things I'd like to get working, but
>>> the ones you may not care too much about:)
>>> The particular game I want to get working,
>>> is "Master of Orion 2" for DOS.
>>>
>>
>> How about you just run it in dosbox instead of dosemu ?
>>
>
> Yes, that's a solution of course, but it is a bit like saying why
> not use Open Office instead of MS Word.
>
> A long term goal of wine is to support DOS apps to. Of course
> it's not a priority, but it's there.
>
> regards,
> Jakob
>
Can you tell me how the invisible high-word (invisible in VM-86, and
in real mode) could possibly harm something running in VM-86 or
read-mode ??? I don't even think it's a BUG. If the transition
into and out of VM-86 doesn't handle the fact that the high-word
of the stack hasn't been used in VM-86, then that piece of code
is bad (the SP isn't even the same stack, BTW).
Cheers,
Dick Johnson
Penguin : Linux version 2.6.11 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by Dictator Bush.
98.36% of all statistics are fiction.
Hi!
> Can you tell me how the invisible high-word (invisible in VM-86, and
> in real mode) could possibly harm something running in VM-86 or
> read-mode ??? I don't even think it's a BUG. If the transition
You can have protected-mode application running in dosemu with 16-bit
stack segment.
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
Hello.
Andi Kleen wrote:
>> The particular game I want to get working,
>> is "Master of Orion 2" for DOS.
> How about you just run it in dosbox instead of dosemu ?
Way too slow, and there are other reasons
too, like the better networking support on
dosemu side (not for Orion, but for other
games is important), but that might be an
off-topic here:)
Hi,
Jakob Eriksson wrote:
> A long term goal of wine is to support DOS apps to. Of course
> it's not a priority, but it's there.
Yes, that's exactly what I was hoping
for, thanks!
Even if no Windows apps do such a thing
(which wasn't confirmed yet), Wine may
still need that fix for the DOS support
in the future, and dosemu seems to be in
need of that fix already and for long
(that's where I am getting the use of it).
And we haven't heard a word for a VMWare
yet, and I think they may appreciate that
too.
Also, since the first version of the path,
I've been contacted by a few people asking
me to provide an updated version for 2.6.9
and 2.6.10. I don't know the reason, but
I know it was used (I think they were more
concerned about an aforementioned "information
leak", rather than about the %esp corruption).
So if the last problem with that patch was
that it is not really needed, I think it
no longer stays:)
linux-os wrote:
> On Mon, 14 Mar 2005, Jakob Eriksson wrote:
>
>> Andi Kleen wrote:
>>
>>> Stas Sergeev <[email protected]> writes:
>>>
>>>>> Another way of saying the same thing: I absolutely hate seeing
>>>>> patches that fix some theoretical issue that no Linux apps will ever
>>>>> care about.
>>>>>
>>>> No, it is not theoretical, but it is mainly
>>>> about a DOS games and an MS linker, as for
>>>> me. The things I'd like to get working, but
>>>> the ones you may not care too much about:)
>>>> The particular game I want to get working,
>>>> is "Master of Orion 2" for DOS.
>>>>
>>>
>>> How about you just run it in dosbox instead of dosemu ?
>>>
>>
>> Yes, that's a solution of course, but it is a bit like saying why
>> not use Open Office instead of MS Word.
>>
>> A long term goal of wine is to support DOS apps to. Of course
>> it's not a priority, but it's there.
>>
>> regards,
>> Jakob
>>
>
> Can you tell me how the invisible high-word (invisible in VM-86, and
> in real mode) could possibly harm something running in VM-86 or
> read-mode ??? I don't even think it's a BUG. If the transition
> into and out of VM-86 doesn't handle the fact that the high-word
> of the stack hasn't been used in VM-86, then that piece of code
> is bad (the SP isn't even the same stack, BTW).
Because even in 16-bit mode (real, vm86 or 16-bit protected) you can use
32-bit instructions, with an operand and/or address size override
prefix. Of course this only works on 386 or later.
--
Brian Gerst
On Mon, Mar 14, 2005 at 01:06:36AM +0300, Stas Sergeev wrote:
> Alan, can you please apply that to an -ac
> tree?
Ask Andrew Morton as it belongs in the -mm tree
Hi!
Alan Cox wrote:
>> Alan, can you please apply that to an -ac
>> tree?
> Ask Andrew Morton as it belongs in the -mm tree
Actually I tried that already. Andrew
had nothing against that patch personally,
as well as Linus, but after all that didn't
work:
http://lkml.org/lkml/2005/1/3/260
So it can't be applied to -mm, and not
depending on the kgdb-ga patch allowed for
some extra optimization.
So I have to try the -ac as the last resort.
I realized you won't be pleased with that
too much. If you are confident it doesn't
fit the -ac tree by whatever means, I can
understand that. (In that case I'll much
appreciate the suggestion what other tree
should I try.)
Hello.
Brian Gerst wrote:
>> Can you tell me how the invisible high-word (invisible in VM-86, and
>> in real mode) could possibly harm something running in VM-86 or
>> read-mode ??? I don't even think it's a BUG. If the transition
>> into and out of VM-86 doesn't handle the fact that the high-word
>> of the stack hasn't been used in VM-86, then that piece of code
>> is bad (the SP isn't even the same stack, BTW).
> Because even in 16-bit mode (real, vm86 or 16-bit protected) you can use
> 32-bit instructions, with an operand and/or address size override
> prefix.
And the real problem is when the pure
32bit code is starting to use the 16bit
stack for some strange reasons. Looks like
the common technique for the early dos4gw
-based apps...
Stas Sergeev <[email protected]> wrote:
>
> Alan Cox wrote:
> >> Alan, can you please apply that to an -ac
> >> tree?
> > Ask Andrew Morton as it belongs in the -mm tree
> Actually I tried that already.
I added this patch to -mm.
> Andrew
> had nothing against that patch personally,
> as well as Linus, but after all that didn't
> work:
> http://lkml.org/lkml/2005/1/3/260
>
> So it can't be applied to -mm, and not
> depending on the kgdb-ga patch allowed for
> some extra optimization.
The rule is:
- If the patch patches something which is in Linus's kernel, prepare a
diff against Linus's latest kernel.
- If the patch patches something which is only in -mm, prepare a patch
against -mm.
In this case, I merged the patch prior to the kgdb patch and then fixed
up the fallout.
(If that causes kgdb to break in non-obvious-to-me ways then I might come
calling "help". We'll see)
Hi all
I noticed that in current linux kernel versions (2.6.11), some basic
functions are out of line (not inlined)
Example of a call to spin_unlock(&somelock)
c01069fa: b8 e8 7b 35 c0 mov $0xc0357be8,%eax
c01069ff: e8 3c e4 1f 00 call c0304e40 <_spin_unlock>
c0304e40 <_spin_unlock>:
c0304e40: c6 00 01 movb $0x1,(%eax)
c0304e43: c3 ret
Same problem for _write_unlock(), _read_unlock(), _spin_unlock_irq(), ...
That seems odd, and I fail to see the reason for that. (It's OK for
complex functions, but not for very short ones...)
Is it a regression, or is it needed ?
configuration :
- SMP
- Processor family (Pentium-4/Celeron(P4-based)/Pentium-4 M/Xeon)
- No "Generic x86 support"
Thank you
Eric Dumazet
On Tue, 2005-03-15 at 11:48 +0100, Eric Dumazet wrote:
> Is it a regression, or is it needed ?
>
Please see the "Completely out of line spinlocks" thread from about a
month ago.
Lee
Hello.
Andrew Morton wrote:
> I added this patch to -mm.
Many thanks!
Alan, sorry for bothering you with that.
> - If the patch patches something which is in Linus's kernel, prepare a
> diff against Linus's latest kernel.
> - If the patch patches something which is only in -mm, prepare a patch
> against -mm.
> In this case, I merged the patch prior to the kgdb patch and then fixed
> up the fallout.
> (If that causes kgdb to break in non-obvious-to-me ways then I might come
> calling "help".
Thanks for explanations. That's what I
tried to find in your "The perfect patch"
guidelines many times, but it is not there
in such details.
Sorry for the late answer.
On Mon, Mar 14, 2005 at 11:37:19AM +0100, Eric Dumazet wrote:
>
> Hi Andi
>
> I tried to read /dev/kmem on x86_64 (linux-2.6.11) and got no success.
>
> read() or pread() returns EINVAL
Yes. I fixed this once for 2.4, but somehow the changes never made
it into 2.6. The VFS code doesn't like negative offsets, and the kernel
addresses are negative.
>
> I tried mmap() too : mmap() calls succeed, but as soon the user process
> dereference memory, we get :
Hmm, looks like a bug yes. I can reproduce it. Thanks for the report.
Will investigate later.
-Andi
On Mon, Mar 14, 2005 at 11:37:19AM +0100, Eric Dumazet wrote:
>
> Hi Andi
>
> I tried to mmap /dev/kmem on x86_64 (linux-2.6.11) and got no success.
>
Here's a patch that fixes the problem to me.
Fix mmap of /dev/kmem. It cannot ever have worked before.
vmalloc is still not supported because that would be more
complicated.
Signed-off-by: Andi Kleen <[email protected]>
diff -u linux-2.6.11/drivers/char/mem.c-o linux-2.6.11/drivers/char/mem.c
--- linux-2.6.11/drivers/char/mem.c-o 2004-12-24 22:34:47.000000000 +0100
+++ linux-2.6.11/drivers/char/mem.c 2005-03-22 12:36:33.852319000 +0100
@@ -211,6 +211,23 @@
return 0;
}
+static int mmap_kmem(struct file * file, struct vm_area_struct * vma)
+{
+ unsigned long long val;
+ /*
+ * RED-PEN: on some architectures there is more mapped memory
+ * than available in mem_map which pfn_valid checks
+ * for. Perhaps should add a new macro here.
+ *
+ * RED-PEN: vmalloc is not supported right now.
+ */
+ if (!pfn_valid(vma->vm_pgoff))
+ return -EIO;
+ val = (u64)vma->vm_pgoff << PAGE_SHIFT;
+ vma->vm_pgoff = __pa(val) >> PAGE_SHIFT;
+ return mmap_mem(file, vma);
+}
+
extern long vread(char *buf, char *addr, unsigned long count);
extern long vwrite(char *buf, char *addr, unsigned long count);
@@ -567,7 +584,6 @@
return capable(CAP_SYS_RAWIO) ? 0 : -EPERM;
}
-#define mmap_kmem mmap_mem
#define zero_lseek null_lseek
#define full_lseek null_lseek
#define write_zero write_null
Stas, Petr,
I know it's been a while since this was discussed and integrated into
mainline, but I just now came across this, and following all of the
original discussion that I was able to locate I didn't see any mention
of a potential different approach to solving the problem which, as it
would appear to me, requires much less code changes: Instead of
allocating a separate stack to set intermediately, the 16-bit stack
segment could be mapped directly onto the normal, flat kernel stack,
thus requiring only an in-place stack change (without any copying of
contents) and an adjustment to the __switch_to code to change the base
address of that segment. Since this approach seems fairly natural I'm
sure it was actually considered, and I'd be curious to learn why it
wasn't chosen.
Thanks, Jan
Hello.
Jan Beulich wrote:
> mainline, but I just now came across this, and following all of the
> original discussion that I was able to locate I didn't see any mention
> of a potential different approach to solving the problem which, as it
There were at least 3 fundamentally different
approaches proposed by different people, I implemented
and posted all of them. At least 2 approaches were
publically discussed. Both did what you say. The
one that didn't, wasn't publically discussed either
(the one that ended up in 2.6.12, in fact).
So it is a bit odd that you weren't able to find
the code that does what you say. :)
> would appear to me, requires much less code changes: Instead of
> allocating a separate stack to set intermediately, the 16-bit stack
> segment could be mapped directly onto the normal, flat kernel stack,
Do you mean, eg, this?
http://www.ussg.iu.edu/hypermail/linux/kernel/0409.2/1533.html
Relevant quote:
---
> ring1 stacks must be per-CPU.
I allocate it on a ring0 stack. Noone seem to
suggest that. Is this flawed for some reasons?
---
>>> Stas Sergeev <[email protected]> 11.09.05 16:02:24 >>>
>Jan Beulich wrote:
>> mainline, but I just now came across this, and following all of the
>> original discussion that I was able to locate I didn't see any
mention
>> of a potential different approach to solving the problem which, as
it
>There were at least 3 fundamentally different
>approaches proposed by different people, I implemented
>and posted all of them. At least 2 approaches were
>publically discussed. Both did what you say. The
>one that didn't, wasn't publically discussed either
>(the one that ended up in 2.6.12, in fact).
>So it is a bit odd that you weren't able to find
>the code that does what you say. :)
>
>> would appear to me, requires much less code changes: Instead of
>> allocating a separate stack to set intermediately, the 16-bit stack
>> segment could be mapped directly onto the normal, flat kernel
stack,
>Do you mean, eg, this?
>http://www.ussg.iu.edu/hypermail/linux/kernel/0409.2/1533.html
No, I don't. This talks about going through ring 1 intermediately,
which isn't what I have in mind.
>Relevant quote:
>---
>> ring1 stacks must be per-CPU.
>I allocate it on a ring0 stack. Noone seem to
>suggest that. Is this flawed for some reasons?
>---
Jan
Hi.
Jan Beulich wrote:
>>Do you mean, eg, this?
>>http://www.ussg.iu.edu/hypermail/linux/kernel/0409.2/1533.html
> No, I don't. This talks about going through ring 1 intermediately,
> which isn't what I have in mind.
Well, like I said, 2 approaches do use the
kernel stack for the 16bit stack. One approach
uses ring-1 trampoline, the other one doesn't.
The posting I pointed to, was explicit about
the stack usage, but as for the ring-0 approach
while still using the kernel stack - here it is:
http://www.ussg.iu.edu/hypermail/linux/kernel/0410.0/1402.html
Is this what you mean? This is pretty much all
about it, the third approach is in the kernel,
and there were no more, even under discussion.
>>>Do you mean, eg, this?
>>>http://www.ussg.iu.edu/hypermail/linux/kernel/0409.2/1533.html
>> No, I don't. This talks about going through ring 1 intermediately,
>> which isn't what I have in mind.
>Well, like I said, 2 approaches do use the
>kernel stack for the 16bit stack. One approach
>uses ring-1 trampoline, the other one doesn't.
>The posting I pointed to, was explicit about
>the stack usage, but as for the ring-0 approach
>while still using the kernel stack - here it is:
>http://www.ussg.iu.edu/hypermail/linux/kernel/0410.0/1402.html
>
>Is this what you mean? This is pretty much all
>about it, the third approach is in the kernel,
>and there were no more, even under discussion.
Yes, this comes close. Still, I'm more interested to understand why
this approach was *not* chosen, which doesn't seem to be covered by any
of the (only two) followups.
Thanks again, Jan
Hi.
Jan Beulich wrote:
> Yes, this comes close. Still, I'm more interested to understand why
> this approach was *not* chosen, which doesn't seem to be covered by any
> of the (only two) followups.
Yes, that was discussed privately.
But I am very intrigued why are you asking.
I thought this problem is very specific to
the virtualizers, like dosemu, maybe wine or
vmware, but not much more. Could you please
clarify what exactly problem does this solve
for you? I'm just curious.
As for why the other approach was developed,
here are the main points that I can recall:
- Run-time GDT patching is UGLY.
- Switching back to the 32bit stack is extremely
tricky in an NMI handler. Proving that this will
work even when the exception handler is being
NMIed, was nearly impossible (though I think I
got that part right).
- Overall the patch was so fragile and hackish-looking,
that Linus complained and proposed the plan to
get rid of the most of fragility. That included
allocating a separate per-cpu stacks and rewriting
the fixups mostly in C. He also proposed a lot of
the optimizations to make the C-based patch as
fast as an asmish one. That all made the patch
overall much better (and much bigger, too) and
helped to fix the regressions quickly (there were
quite a few, but they were not the bugs in the
patch itself :)