The following small program (linked against glibc
2.1.3) reliably
freezes my system (Athlon Thunderbird CPU) with at
least kernels
2.4.0-test10 and 2.4.0-test11-pre5. Even the SysRq
keys do not work
after the freeze.
Older kernels (e.g. 2.3.40) seem to work. Any Ideas?
---------------------------------------
#define _GNU_SOURCE
#include <fenv.h>
#include <unistd.h>
int
main(void)
{
double a;
fesetenv(FE_NOMASK_ENV);
a = 1.0 / 0.0;
sleep(10);
return a;
}
---------------------------------------
--
Markus
__________________________________________________________________
Do You Yahoo!?
Gesendet von Yahoo! Mail - http://mail.yahoo.de
Gratis zum Million?r! - http://10millionenspiel.yahoo.de
In article <[email protected]>,
=?iso-8859-1?q?Markus=20Schoder?= <[email protected]> wrote:
>The following small program (linked against glibc 2.1.3) reliably
>freezes my system (Athlon Thunderbird CPU) with at least kernels
>2.4.0-test10 and 2.4.0-test11-pre5. Even the SysRq keys do not work
>after the freeze.
Are you sure sysrq doesn't work? Many distributions will disable the
kernel printing to the console, or move it to console 7 or similar.
It would be really good to get the EIP trace of RightAlt+ScrollLock
pressed a few times if you can try to see if you can use klogd to enable
proper printk's.
>Older kernels (e.g. 2.3.40) seem to work. Any Ideas?
The FP exception handling has certainly changed, but the changes should
all have affected mainly just PIII kernels with XMM support enabled. An
Athlon system should have been pretty unaffected. But I'll take a look
if I see something obvious.
One thing to try: if interrupts really don't work for you (and if SysRq
doesn't work, that may be the case), please test out a kernel that
simply ignores irq13 by just commenting out the line
setup_irq(13, &irq13);
in arch/i386/kernel/i8259.c. Does that make any difference? (irq13
shouldn't be used any more, it's horrible legacy crap, but we do want to
support even horrible legacy systems).
Linus
In article <[email protected]>,
=?iso-8859-1?q?Markus=20Schoder?= <[email protected]> wrote:
>The following small program (linked against glibc 2.1.3) reliably
>freezes my system (Athlon Thunderbird CPU) with at least kernels
>2.4.0-test10 and 2.4.0-test11-pre5. Even the SysRq keys do not work
>after the freeze.
>
>Older kernels (e.g. 2.3.40) seem to work. Any Ideas?
It certainly doesn't happen for me on any of the machines I work with,
but it wouldn't compile as-is for me, so I exchanged the FPU setting
with a simpler
asm("fldcw %0": :"m" (0));
which should do the equivalent (ie unmask divide by zero errors). Does
that make a difference for you?
Can you try to figure out where it started happening? Ie try test9 and
back too, to figure out what might be bringing it on...
I sure as hell hope this isn't an Athlon issue. Can other people try
the test-program and see if we have a pattern (ie "it happens only on
Athlons", or "Linus is on drugs and it happens for everybody else").
Thanks,
Linus
On 17 Nov 2000, Linus Torvalds wrote:
> In article <[email protected]>,
> =?iso-8859-1?q?Markus=20Schoder?= <[email protected]> wrote:
> >The following small program (linked against glibc 2.1.3) reliably
> >freezes my system (Athlon Thunderbird CPU) with at least kernels
> >2.4.0-test10 and 2.4.0-test11-pre5. Even the SysRq keys do not work
> >after the freeze.
> >
> >Older kernels (e.g. 2.3.40) seem to work. Any Ideas?
>
> It certainly doesn't happen for me on any of the machines I work with,
> but it wouldn't compile as-is for me, so I exchanged the FPU setting
> with a simpler
>
> asm("fldcw %0": :"m" (0));
>
> which should do the equivalent (ie unmask divide by zero errors). Does
> that make a difference for you?
>
> Can you try to figure out where it started happening? Ie try test9 and
> back too, to figure out what might be bringing it on...
>
> I sure as hell hope this isn't an Athlon issue. Can other people try
> the test-program and see if we have a pattern (ie "it happens only on
> Athlons", or "Linus is on drugs and it happens for everybody else").
>
I couldn't get it to freeze. I tried it with asm("fldcw %0": :"m" (0))
and with fesetenv() using gcc -lm to link it. I have glibc-2.1.2,
egcs 2.91.66, and 2.4.0-test10.
Regards,
Adrian
Linus Torvalds wrote:
>
> I sure as hell hope this isn't an Athlon issue. Can other people try
> the test-program and see if we have a pattern (ie "it happens only on
> Athlons", or "Linus is on drugs and it happens for everybody else").
I've tried both variants (fesetenv and inline-asm) with glibc-2.1.3,
2.4.0-test11pre7 and an AMD Thunderbird. Neither does freeze, but
both yield:
Floating point exception (core dumped)
-Udo.
Hi,
I am using test10-pre5 on Duron.
>
> I couldn't get it to freeze. I tried it with asm("fldcw %0": :"m" (0))
> and with fesetenv() using gcc -lm to link it. I have glibc-2.1.2,
> egcs 2.91.66, and 2.4.0-test10.
>
> Regards,
> Adrian
Same here except gcc-2.95.2 and glibc 2.13. I got an floating point
expeption. No freeze here.
Greetings
Michael Meding
Linus Torvalds wrote:
>
> In article <[email protected]>,
> =?iso-8859-1?q?Markus=20Schoder?= <[email protected]> wrote:
> >The following small program (linked against glibc 2.1.3) reliably
> >freezes my system (Athlon Thunderbird CPU) with at least kernels
> >2.4.0-test10 and 2.4.0-test11-pre5. Even the SysRq keys do not work
> >after the freeze.
> >
> >Older kernels (e.g. 2.3.40) seem to work. Any Ideas?
>
> It certainly doesn't happen for me on any of the machines I work with,
> but it wouldn't compile as-is for me, so I exchanged the FPU setting
> with a simpler
>
> asm("fldcw %0": :"m" (0));
>
> which should do the equivalent (ie unmask divide by zero errors). Does
> that make a difference for you?
>
> Can you try to figure out where it started happening? Ie try test9 and
> back too, to figure out what might be bringing it on...
>
> I sure as hell hope this isn't an Athlon issue. Can other people try
> the test-program and see if we have a pattern (ie "it happens only on
> Athlons", or "Linus is on drugs and it happens for everybody else").
>
> Thanks,
>
> Linus
I get Floating Point Exception (core dumped), but I needed to use the
modified program below to keep GCC from optimizing the division away as
a constant. This is on test11-pre5.
--------------
#define _GNU_SOURCE
#include <fenv.h>
#include <unistd.h>
double a =1.0, b = 0.0;
int
main(void)
{
double c;
asm("fldcw %0": :"m" (0));
// fesetenv(FE_NOMASK_ENV);
c = a / b;
sleep(10);
return a;
}
--------------
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 2
model name : AMD Athlon(tm) Processor
stepping : 2
cpu MHz : 748.000573
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
features : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat
pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips : 1494.22
--
Brian Gerst
On Sat, 18 Nov 2000, Brian Gerst wrote:
>
> I get Floating Point Exception (core dumped), but I needed to use the
> modified program below to keep GCC from optimizing the division away as
> a constant. This is on test11-pre5.
I'm starting to suspect that it's really a combination of three things:
- 3dnow optimization (ie you have to compile the kernel with Athlon
support)
- pending, but not yet noticed, FPU exceptions.
- a bug/feature in the kernel, where a process exit does not bother to
clear the FPU, only marks it as "unused".
If I'm right, the proper test-program should be something like
int main(int argc, char **argv)
{
asm("fldcw %0": :"m" (0));
asm("fldz ; fld1 ; fdiv");
sleep(1);
return 0;
}
where it's important that we do not wait for the result of the fdiv, we
just exit after having caused a pending exception (and you cannot do this
reliably from C code - depending on compiler version and optimizations
gcc may try to write the bad value back to memory etc).
Now, with the pending exception, do a 3dnow MMX memcpy() - which will
clear the TS bit (because it decides that the FP state can be thrown
away and doesn't need to do a full save/restore) and start using the FPU.
Boom. Instant FP exception. With the exception handler deciding that
nobody owns the FP state, and thus doing nothing sane.
If I'm right (and I'm _always_ right), the following patch would make a
difference.
Markus?
Linus
----
--- v2.4.0-test10/linux/arch/i386/kernel/traps.c Tue Oct 31 12:42:26 2000
+++ linux/arch/i386/kernel/traps.c Fri Nov 17 21:52:55 2000
@@ -643,6 +640,12 @@
asmlinkage void do_coprocessor_error(struct pt_regs * regs, long error_code)
{
ignore_irq13 = 1;
+
+ /* Due to lazy error handling, we might have false pending errors! */
+ if (!current->used_math) {
+ init_fpu();
+ return;
+ }
math_error((void *)regs->eip);
}
@@ -700,6 +703,12 @@
if (cpu_has_xmm) {
/* Handle SIMD FPU exceptions on PIII+ processors. */
ignore_irq13 = 1;
+
+ /* Due to lazy error handling, we might have false pending errors! */
+ if (!current->used_math) {
+ init_fpu();
+ return;
+ }
simd_math_error((void *)regs->eip);
} else {
/*
> Linus Torvalds wrote:
> >
> > I sure as hell hope this isn't an Athlon issue. Can other people try
> > the test-program and see if we have a pattern (ie "it happens only on
> > Athlons", or "Linus is on drugs and it happens for everybody else").
>
> I've tried both variants (fesetenv and inline-asm) with glibc-2.1.3,
> 2.4.0-test11pre7 and an AMD Thunderbird. Neither does freeze, but
> both yield:
>
> Floating point exception (core dumped)
Compiler specific ?
On Sat, 18 Nov 2000, Alan Cox wrote:
> > Linus Torvalds wrote:
> > >
> > > I sure as hell hope this isn't an Athlon issue. Can other people try
> > > the test-program and see if we have a pattern (ie "it happens only on
> > > Athlons", or "Linus is on drugs and it happens for everybody else").
> >
> > I've tried both variants (fesetenv and inline-asm) with glibc-2.1.3,
> > 2.4.0-test11pre7 and an AMD Thunderbird. Neither does freeze, but
> > both yield:
> >
> > Floating point exception (core dumped)
>
> Compiler specific ?
There's almost certainly more than that. I'd love to have a report on my
asm-only version, but even so I suspect it also requires the 3dnow stuff,
because I'm not able to trigger anything like this on any machines I have
access to (none of them are AMD, though)
Linus
--- Linus Torvalds <[email protected]> wrote: >
> If I'm right, the proper test-program should be
> something like
>
Your test program is indeed sufficient to trigger the
freeze. Unfortunately the patch does not make a
difference :(
My test program caused the exception (and the freeze)
unintendedly in the return statement since the
division was optimized away as Brian pointed out.
Some more data points:
SysRq is definitely not working after the freeze :(
Judging from the processor temperature after rebooting
the processor is very busy.
I know of another guy with the exact same CPU (Athlon
Thunderbird 900MHz) and mainboard (ABIT KT7-RAID) who
has the same problem.
I use gcc 2.95.2 to compile the kernel.
I couldn't so far find out where the problem was
introduced since the older 2.4.0-test kernels do not
support the HPT370. I will recable the drive to the
primary controller and try again.
I will also try to compile a non AMD specific kernel
and see if that makes a difference. If just this 40GB
drive would fsck faster :)
Note that cpuinfo shows model 4 whereas e.g. Brian had
model 2 if that means anything.
/proc/cpuinfo:
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 4
model name : AMD Athlon(tm) Processor
stepping : 2
cpu MHz : 900.000063
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
features : fpu vme de pse tsc msr pae mce cx8
sep mtrr pge mca cmov pat pse36 mmx fxsr syscall
mmxext 3dnowext 3dnow
bogomips : 1795.69
--
Markus
__________________________________________________________________
Do You Yahoo!?
Gesendet von Yahoo! Mail - http://mail.yahoo.de
Gratis zum Million?r! - http://10millionenspiel.yahoo.de
Linus Torvalds wrote:
> > Compiler specific ?
>
> There's almost certainly more than that. I'd love to have a report on my
> asm-only version, but even so I suspect it also requires the 3dnow stuff,
> because I'm not able to trigger anything like this on any machines I have
> access to (none of them are AMD, though)
>
Hmm...
Linus, I've tried your latest asm-version with my Thunderbird, egcs-1.1.2
and 3dnow support compiled in. Several runs show no problems - the program
runs and terminates gracefully.
-Udo.
Markus Schoder wrote:
> My test program caused the exception (and the freeze)
> unintendedly in the return statement since the
> division was optimized away as Brian pointed out.
It's quite strange that I cannot seem to trigger the
problem here on my machine.
> I know of another guy with the exact same CPU (Athlon
> Thunderbird 900MHz) and mainboard (ABIT KT7-RAID) who
> has the same problem.
>
> I use gcc 2.95.2 to compile the kernel.
Makes me wonder whether it could be an issue with your
board (I have an Asus A7V) or with gcc 2.95-2 (I use
egcs-1.1.2).
> Note that cpuinfo shows model 4 whereas e.g. Brian had
> model 2 if that means anything.
Mine is a model 4 also, so if it's related to that, I
should probably see the problem here as well.
/proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 4
model name : AMD Athlon(tm) Processor
stepping : 2
cpu MHz : 807.000213
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
features : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips : 1608.91
-Udo.
On Sat, 18 Nov 2000, Linus Torvalds wrote:
> There's almost certainly more than that. I'd love to have a report on my
> asm-only version, but even so I suspect it also requires the 3dnow stuff,
I tried all three versions, and no freezes. I forgot to mention the tests
were run on a model 2 Athlon (original slot K7, .18 micron). The kernel
is compiled with 3dnow support.
Regards,
Adrian
On Sat, 18 Nov 2000, Markus Schoder wrote:
>
> Your test program is indeed sufficient to trigger the
> freeze. Unfortunately the patch does not make a
> difference :(
Ok.
This may in fact be an Athlon CPU bug. But before we contact anybody from
AMD, I'd really need to know what the result from the irq13 disabling and
the non-3dnow thing is.
Considering that Udo reports no lockup at all with the same test-program
even with an Athlon and 3dnow, it looks like it's either irq13 (and a
motherboard routing issue: sane modern motherboards shouldn't even route
the external FERR at _all_ any more), or something stepping-specific on
your Athlon. It doesn't sound kernel-related per se.
Let's hope it's irq13. If so, it will be easy to fix (tentative fix: any
CPU that reports a built-in FPU just doesn't get irq13 enabled at all).
Current workign theory:
- Athlons do FERR wrong. They drive FERR externally when the
unmasked exception happens, rather than when the next FP instruction
actually detects the exception. This means that the external FERR irq13
actually happens _before_ the internal exception 16, which is wrong.
- Linux has seen exception 16 working, so it ignores irq13 and assumes
that it's some real external device (which does happen - sometimes
SCI is wired to irq13).
- irq13 is not only wired on the motherboard (which was right in 1989,
but is not right in 2000), but is marked level-triggered (which
probably wasn't right even in 1989). So when the irq13 happens, it
_keeps_ on happening, and we never get an exception 16 at all.
The reason 2.2.x works on your machine might be that the early bootup test
for FP exceptions will have done something to mask the fpu exception just
by luck. I forget the exact details of the test - it got removed in later
kernels because it made it really nasty to handle XMM faults correctly.
Does anybody have any better ideas?
Linus
On Sat, 18 Nov 2000, adrian wrote:
>
>
> On Sat, 18 Nov 2000, Linus Torvalds wrote:
>
> > There's almost certainly more than that. I'd love to have a report on my
> > asm-only version, but even so I suspect it also requires the 3dnow stuff,
>
> I tried all three versions, and no freezes. I forgot to mention the tests
> were run on a model 2 Athlon (original slot K7, .18 micron). The kernel
> is compiled with 3dnow support.
Apparently it isn't the stepping, as we have Athlon model 4's both showing
it and not showing it. The motherboard seems to be the only real
difference here, which is why I like the irq13 explanation more and more.
I've been wanting to get rid of irq13 anyway (some boards wire up USB
and/or ACPI to irq13 and the fact that the FPU has claimed it makes those
machines unhappy), so if the solution is to only check for irq13 on old
i386 and i486sx machines and just leave it alone for newer CPU's, I won't
complain.
Markus, can you make the irq13 test the first thing - don't worry about
3dnow as that seems to not be a deciding factor..
Linus
--- Linus Torvalds <[email protected]> schrieb: >
>
> On Sat, 18 Nov 2000, adrian wrote:
>
> >
> >
> > On Sat, 18 Nov 2000, Linus Torvalds wrote:
> >
> > > There's almost certainly more than that. I'd
> love to have a report on my
> > > asm-only version, but even so I suspect it also
> requires the 3dnow stuff,
> >
> > I tried all three versions, and no freezes. I
> forgot to mention the tests
> > were run on a model 2 Athlon (original slot K7,
> .18 micron). The kernel
> > is compiled with 3dnow support.
>
> Apparently it isn't the stepping, as we have Athlon
> model 4's both showing
> it and not showing it. The motherboard seems to be
> the only real
> difference here, which is why I like the irq13
> explanation more and more.
>
> I've been wanting to get rid of irq13 anyway (some
> boards wire up USB
> and/or ACPI to irq13 and the fact that the FPU has
> claimed it makes those
> machines unhappy), so if the solution is to only
> check for irq13 on old
> i386 and i486sx machines and just leave it alone for
> newer CPU's, I won't
> complain.
>
> Markus, can you make the irq13 test the first thing
> - don't worry about
> 3dnow as that seems to not be a deciding factor..
>
> Linus
>
Ok, that was it! It's IRQ 13. Guess I should have
tried that first. Now everything works perfectly.
Thanks everybody.
--
Markus
__________________________________________________________________
Do You Yahoo!?
Gesendet von Yahoo! Mail - http://mail.yahoo.de
Gratis zum Million?r! - http://10millionenspiel.yahoo.de
adrian wrote:
>
> On Sat, 18 Nov 2000, Linus Torvalds wrote:
>
> > There's almost certainly more than that. I'd love to have a report on my
> > asm-only version, but even so I suspect it also requires the 3dnow stuff,
>
> I tried all three versions, and no freezes. I forgot to mention the tests
> were run on a model 2 Athlon (original slot K7, .18 micron). The kernel
> is compiled with 3dnow support.
>
> Regards,
> Adrian
>
>
Mine freezes with both versions of C versions, haven't tried the asm
yet.
gcc-2.95.2 and glibc-2.1.3 both of which are compiled for 486.
--
===============processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 2
model name : AMD Athlon(tm) Processor
stepping : 2
cpu MHz : 751.000719
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
features : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca
cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips : 1500.77
-- Tim
Markus Schoder wrote:
>
> --- Linus Torvalds <[email protected]> schrieb: >
>
> >
> > On Sat, 18 Nov 2000, adrian wrote:
> >
> > >
> > >
> > > On Sat, 18 Nov 2000, Linus Torvalds wrote:
> > >
> > > > There's almost certainly more than that. I'd
> > love to have a report on my
> > > > asm-only version, but even so I suspect it also
> > requires the 3dnow stuff,
> > >
> > > I tried all three versions, and no freezes. I
> > forgot to mention the tests
> > > were run on a model 2 Athlon (original slot K7,
> > .18 micron). The kernel
> > > is compiled with 3dnow support.
> >
> > Apparently it isn't the stepping, as we have Athlon
> > model 4's both showing
> > it and not showing it. The motherboard seems to be
> > the only real
> > difference here, which is why I like the irq13
> > explanation more and more.
> >
> > I've been wanting to get rid of irq13 anyway (some
> > boards wire up USB
> > and/or ACPI to irq13 and the fact that the FPU has
> > claimed it makes those
> > machines unhappy), so if the solution is to only
> > check for irq13 on old
> > i386 and i486sx machines and just leave it alone for
> > newer CPU's, I won't
> > complain.
> >
> > Markus, can you make the irq13 test the first thing
> > - don't worry about
> > 3dnow as that seems to not be a deciding factor..
> >
> > Linus
> >
>
> Ok, that was it! It's IRQ 13. Guess I should have
> tried that first. Now everything works perfectly.
> Thanks everybody.
What motherboard do you have? I can't reproduce this on my FIC SD11.
--
Brian Gerst
--- Brian Gerst <[email protected]> wrote:
> > Ok, that was it! It's IRQ 13. Guess I should
have
> > tried that first. Now everything works perfectly.
> > Thanks everybody.
>
> What motherboard do you have? I can't reproduce
> this on my FIC SD11.
>
> --
>
> Brian Gerst
It's a ABIT KT7-100 RAID. And I know somebody else
who has the same problem with this board. So it seems
definitely board related.
--
Markus
__________________________________________________________________
Do You Yahoo!?
Gesendet von Yahoo! Mail - http://mail.yahoo.de
Gratis zum Million?r! - http://10millionenspiel.yahoo.de
=?iso-8859-1?q?Markus=20Schoder?= <[email protected]> said:
[...]
> I will also try to compile a non AMD specific kernel
> and see if that makes a difference. If just this 40GB
> drive would fsck faster :)
mount -o remount,ro [...]
--
Horst von Brand [email protected]
Casilla 9G, Vin~a del Mar, Chile +56 32 672616
=?iso-8859-1?q?Markus=20Schoder?= <[email protected]> said:
> --- Linus Torvalds <[email protected]> schrieb: >
[...]
> > Markus, can you make the irq13 test the first thing
> > - don't worry about
> > 3dnow as that seems to not be a deciding factor..
> Ok, that was it! It's IRQ 13. Guess I should have
> tried that first. Now everything works perfectly.
Could you _please_ document this somewhere in the source, so noone trips
over this or starts wondering?
> Thanks everybody.
Nodz.
--
Horst von Brand [email protected]
Casilla 9G, Vin~a del Mar, Chile +56 32 672616
This may not be helpful (and the thread is now two days
old), but I just wanted to add that after using Mr.
Torvalds' comment-out-irq13 suggestion and Vojtech
Pavlik's "Possible critical VIA vt82c686a chip bug"
patch (Oct 26), both with 2.4.0-t10, I've successfully
had a 2.4 series uptime >48 hours for the first time
with an Athlon/KX133 setup! Whopee!
Hopefully both of these are incorporated in some way
into 2.4.0-t11? I'm about to test it out...
Thanks all... I'm just so overjoyed. *sniff*
-A. Hsiao