2002-01-14 10:58:24

by Zwane Mwaikambo

[permalink] [raw]
Subject: floating point exception

>Right after that my window manager segfaults. Ok, switch to console,
>restart it and go. No! Can't start any programs anymore, no login. All
>tasks die one after the other, up to the complete lock of the machine.
>Even alt-sysrq doesn't work.

Can you reproduce the problem with some degree of success? (2/5 is fine)

Regards,
Zwane Mwaikambo




2002-01-14 21:27:42

by Christian Thalinger

[permalink] [raw]
Subject: Re: floating point exception

On Mon, 2002-01-14 at 11:56, Zwane Mwaikambo wrote:
> >Right after that my window manager segfaults. Ok, switch to console,
> >restart it and go. No! Can't start any programs anymore, no login. All
> >tasks die one after the other, up to the complete lock of the machine.
> >Even alt-sysrq doesn't work.
>
> Can you reproduce the problem with some degree of success? (2/5 is fine)
>
> Regards,
> Zwane Mwaikambo
>

After a little bit of testing i would say yes. 2-3 out of 5 with kernel
2.4.17 and 2.4.18-pre3. Mainly with X, got some without X.

It seems the floating point exception is only raised with a new data
package. Is there a simple way to raise such a exception?

2002-01-15 14:36:53

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: floating point exception

On 14 Jan 2002, Christian Thalinger wrote:

> It seems the floating point exception is only raised with a new data
> package. Is there a simple way to raise such a exception?

New data package? And does the same behaviour re-occur after the fpu
exception? ie programs start segfaulting etc. Can you try doing a "dmesg"
after the segfaults and fpu exception and see if there is anything in the
kernel ring buffer too.

Regards,
Zwane Mwaikambo


2002-01-15 14:47:44

by Richard B. Johnson

[permalink] [raw]
Subject: Re: floating point exception

On Tue, 15 Jan 2002, Zwane Mwaikambo wrote:

> On 14 Jan 2002, Christian Thalinger wrote:
>
> > It seems the floating point exception is only raised with a new data
> > package. Is there a simple way to raise such a exception?
>
> New data package? And does the same behaviour re-occur after the fpu
> exception? ie programs start segfaulting etc. Can you try doing a "dmesg"
> after the segfaults and fpu exception and see if there is anything in the
> kernel ring buffer too.
>
> Regards,
> Zwane Mwaikambo


This will allow you to generate some math-errors and see if everything
works okay. By default, upon process creation, math errors like
/0 are masked.

/*
* Note FPU control only exists per process. Therefore, you have
* to set up the FPU before you use it in any program.
*/
#include <i386/fpu_control.h>

#define FPU_MASK (_FPU_MASK_IM |\
_FPU_MASK_DM |\
_FPU_MASK_ZM |\
_FPU_MASK_OM |\
_FPU_MASK_UM |\
_FPU_MASK_PM)

void fpu()
{
__setfpucw(_FPU_DEFAULT & ~FPU_MASK);
}


main() {
double zero=0.0;
double one=1.0;
fpu();

one /=zero;
}



Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (797.90 BogoMips).

I was going to compile a list of innovations that could be
attributed to Microsoft. Once I realized that Ctrl-Alt-Del
was handled in the BIOS, I found that there aren't any.


2002-01-15 18:20:37

by Christian Thalinger

[permalink] [raw]
Subject: Re: floating point exception

On Tue, 2002-01-15 at 15:34, Zwane Mwaikambo wrote:
> On 14 Jan 2002, Christian Thalinger wrote:
>
> > It seems the floating point exception is only raised with a new data
> > package. Is there a simple way to raise such a exception?
>
> New data package? And does the same behaviour re-occur after the fpu
> exception? ie programs start segfaulting etc. Can you try doing a "dmesg"
> after the segfaults and fpu exception and see if there is anything in the
> kernel ring buffer too.
>
> Regards,
> Zwane Mwaikambo
>

There are .sah files, in which the data is stored to analyse. So i
deleted these files and the client downloads a new package -> new data
package.

Yes, it did happen that the segfault reoccured and there is nothing in
the dmesg. This was also my first thought, then checked
/var/log/messages with a tail and it stucked. No ctrl-c.

Tried this:

#define _GNU_SOURCE 1
#include <fenv.h>

main() {
double zero=0.0;
double one=1.0;

feenableexcept(FE_ALL_EXCEPT);

one /=zero;
}

...but nothing happens.

2002-01-15 18:31:57

by Richard B. Johnson

[permalink] [raw]
Subject: Re: floating point exception

On 15 Jan 2002, Christian Thalinger wrote:

> On Tue, 2002-01-15 at 15:34, Zwane Mwaikambo wrote:
> > On 14 Jan 2002, Christian Thalinger wrote:
[SNIPPED...]

>
> Tried this:
>
> #define _GNU_SOURCE 1
> #include <fenv.h>
>
> main() {
> double zero=0.0;
> double one=1.0;
>
> feenableexcept(FE_ALL_EXCEPT);
>
> one /=zero;
> }
>
Well, that won't even link. The source I showed previously
compiles and link fine. It also shows a FPU exception when
one divides by zero:

Script started on Tue Jan 15 13:27:05 2002
# gcc -o zzz zzz.c -lm
/tmp/ccjhyGHj.o: In function `main':
/tmp/ccjhyGHj.o(.text+0x25): undefined reference to `feenableexcept'
collect2: ld returned 1 exit status
# gcc -o zzz fpu.c
# zzz
Floating point exception (core dumped)
# cat fpu.c
/*
* Note FPU control only exists per process. Therefore, you have
* to set up the FPU before you use it in any program.
*/
#include <i386/fpu_control.h>

#define FPU_MASK (_FPU_MASK_IM |\
_FPU_MASK_DM |\
_FPU_MASK_ZM |\
_FPU_MASK_OM |\
_FPU_MASK_UM |\
_FPU_MASK_PM)

void fpu()
{
__setfpucw(_FPU_DEFAULT & ~FPU_MASK);
}


main() {
double zero=0.0;
double one=1.0;
fpu();

one /=zero;
}

# cat zzz.c

#define _GNU_SOURCE 1
#include <fenv.h>

main() {
double zero=0.0;
double one=1.0;

feenableexcept(FE_ALL_EXCEPT);

one /=zero;
}


You have new mail in /var/spool/mail/root
# exit
exit

Script done on Tue Jan 15 13:28:32 2002



Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (797.90 BogoMips).

I was going to compile a list of innovations that could be
attributed to Microsoft. Once I realized that Ctrl-Alt-Del
was handled in the BIOS, I found that there aren't any.


2002-01-15 18:50:37

by Christian Thalinger

[permalink] [raw]
Subject: Re: floating point exception

On Tue, 2002-01-15 at 19:31, Richard B. Johnson wrote:
> On 15 Jan 2002, Christian Thalinger wrote:
>
> > On Tue, 2002-01-15 at 15:34, Zwane Mwaikambo wrote:
> > > On 14 Jan 2002, Christian Thalinger wrote:
> [SNIPPED...]
>
> >
> > Tried this:
> >
> > #define _GNU_SOURCE 1
> > #include <fenv.h>
> >
> > main() {
> > double zero=0.0;
> > double one=1.0;
> >
> > feenableexcept(FE_ALL_EXCEPT);
> >
> > one /=zero;
> > }
> >
> Well, that won't even link. The source I showed previously
> compiles and link fine. It also shows a FPU exception when
> one divides by zero:
>
> Script started on Tue Jan 15 13:27:05 2002
> # gcc -o zzz zzz.c -lm
> /tmp/ccjhyGHj.o: In function `main':
> /tmp/ccjhyGHj.o(.text+0x25): undefined reference to `feenableexcept'
> collect2: ld returned 1 exit status

This depends on the libc version. Seems you have 2.1. For me it's 2.2.

[root@sector17:/root/src]# cat fpu-exception.c
#define _GNU_SOURCE 1
#include <fenv.h>

main() {
double zero=0.0;
double one=1.0;

feenableexcept(FE_ALL_EXCEPT);

one /=zero;
}
[root@sector17:/root/src]# gcc -Wall -lm -o fpu-exception
fpu-exception.c
fpu-exception.c:4: warning: return type defaults to `int'
fpu-exception.c: In function `main':
fpu-exception.c:11: warning: control reaches end of non-void function
[root@sector17:/root/src]# ./fpu-exception
Floating point exception
[root@sector17:/root/src]#


2002-01-16 05:47:15

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: floating point exception

On 15 Jan 2002, Christian Thalinger wrote:

> Yes, it did happen that the segfault reoccured and there is nothing in
> the dmesg. This was also my first thought, then checked
> /var/log/messages with a tail and it stucked. No ctrl-c.

ctrl-alt-sysrq k? I'd just like to know wether your box hung completely.
Could you also run the ver_linux script in linux_scripts so that we can
get a better idea of your operating environment.

Cheers,
Zwane Mwaikambo


2002-01-16 11:57:20

by Christian Thalinger

[permalink] [raw]
Subject: Re: floating point exception

On Wed, 2002-01-16 at 06:45, Zwane Mwaikambo wrote:
> On 15 Jan 2002, Christian Thalinger wrote:
>
> > Yes, it did happen that the segfault reoccured and there is nothing in
> > the dmesg. This was also my first thought, then checked
> > /var/log/messages with a tail and it stucked. No ctrl-c.
>
> ctrl-alt-sysrq k? I'd just like to know wether your box hung completely.
> Could you also run the ver_linux script in linux_scripts so that we can
> get a better idea of your operating environment.
>
> Cheers,
> Zwane Mwaikambo
>
>

What i got at my last exception (started the client in tty1):

Listened to an mp3 with mpg123. After the exception the mp3 got in the
_he_my_system_is_completely_locked loop. Couldn't kill the process.
System was respondable, console switching was ok. Changed to console to
tty2 where X was running - crtl-c - X went down -> console switching
wasn't possible anymore.

ctrl-alt-sysrq was responding but only with the line:

SysRq : Enmergency sync
SysRq : .... (tried also the other ones)

but nothing happend. No syncing, no unmount and showtasks. Right now i
noticed that showTasks, mem and pc do not give _any_ output, but syncing
works.

I'll do further testing when i'm back from work.

Gnu C 3.0.3
Gnu make 3.79.1
util-linux 2.11m
mount 2.11h
modutils 2.4.11
e2fsprogs 1.25
reiserfsprogs 3.x.0b
Linux C Library 2.2.4
Dynamic linker (ldd) 2.2.4
Linux C++ Library 3.0.2
Procps 2.0.7
Net-tools 1.60
Console-tools 0.2.3
Sh-utils 2.0.11
Modules Loaded NVdriver sym53c8xx scsi_mod pwcx-i386 pwc rio500
usb-ohci
usbcore w83781d eeprom i2c-proc i2c-amd756 i2c-isa binfmt_misc
binfmt_aout ospm
_processor ospm_system ospm_busmgr sercontrol lirc_i2c lirc_dev tuner
tvaudio ms
p3400 bttv videodev i2c-algo-bit i2c-core nfsd lockd sunrpc parport_pc
lp parpor
t via-rhine emu10k1 sound ac97_codec soundcore rtc


2002-01-16 14:35:16

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: floating point exception

On 16 Jan 2002, Christian Thalinger wrote:

> Gnu C 3.0.3
> Gnu make 3.79.1
> util-linux 2.11m
> mount 2.11h
> modutils 2.4.11
> e2fsprogs 1.25
> reiserfsprogs 3.x.0b
> Linux C Library 2.2.4
> Dynamic linker (ldd) 2.2.4
> Linux C++ Library 3.0.2
> Procps 2.0.7
> Net-tools 1.60
> Console-tools 0.2.3
> Sh-utils 2.0.11
> Modules Loaded NVdriver sym53c8xx scsi_mod pwcx-i386 pwc rio500
> usb-ohci
> usbcore w83781d eeprom i2c-proc i2c-amd756 i2c-isa binfmt_misc
> binfmt_aout ospm
> _processor ospm_system ospm_busmgr sercontrol lirc_i2c lirc_dev tuner
> tvaudio ms
> p3400 bttv videodev i2c-algo-bit i2c-core nfsd lockd sunrpc parport_pc
> lp parpor
> t via-rhine emu10k1 sound ac97_codec soundcore rtc

Can you also reproduce _without_ loading NVdriver, just to make everybody
happy.

Thanks,
Zwane Mwaikambo


2002-01-16 20:28:58

by Christian Thalinger

[permalink] [raw]
Subject: Re: floating point exception

On Wed, 2002-01-16 at 15:32, Zwane Mwaikambo wrote:
> Can you also reproduce _without_ loading NVdriver, just to make everybody
> happy.
>
> Thanks,
> Zwane Mwaikambo
>

Sure, same breakdown. Maybe it's really an dual athlon xp issue as dave
jones mentioned. But shouldn't this also occur when i trigger a floating
point exception myself? Is there a way to check which floating point
exception was raised by the seti client?

Regards.

2002-01-16 21:24:40

by Richard B. Johnson

[permalink] [raw]
Subject: Re: floating point exception

On 16 Jan 2002, Christian Thalinger wrote:

> On Wed, 2002-01-16 at 15:32, Zwane Mwaikambo wrote:
> > Can you also reproduce _without_ loading NVdriver, just to make everybody
> > happy.
> >
> > Thanks,
> > Zwane Mwaikambo
> >
>
> Sure, same breakdown. Maybe it's really an dual athlon xp issue as dave
> jones mentioned. But shouldn't this also occur when i trigger a floating
> point exception myself? Is there a way to check which floating point
> exception was raised by the seti client?
>
> Regards.
>

Maybe you can run it off from gdb? Or `strace` it to a file? Usually
these things are caused by invalid 'C' runtime libraries, either
corrupt, "installed by just making a sim-link to something that
was presumed to be close to what the application was compiled with",
or an error in mem-mapping.

Another very-real possibility is that somebody used floating-point
within the kernel thus corrupting the `seti` FPU state. You can
check this out by making a program that does lots of FP calculations,
perhaps the sine of a large number of values. You put the results
into one array. Then you do the exact same thing with the results
put into another array. Then just `memcmp` the arrays! You run
this in a loop for an hour. If the kernel is mucking with your FPU,
it will certainly show.


Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (797.90 BogoMips).

I was going to compile a list of innovations that could be
attributed to Microsoft. Once I realized that Ctrl-Alt-Del
was handled in the BIOS, I found that there aren't any.


2002-01-16 22:00:32

by Brian Gerst

[permalink] [raw]
Subject: Re: floating point exception

"Richard B. Johnson" wrote:
>
> On 16 Jan 2002, Christian Thalinger wrote:
>
> > On Wed, 2002-01-16 at 15:32, Zwane Mwaikambo wrote:
> > > Can you also reproduce _without_ loading NVdriver, just to make everybody
> > > happy.
> > >
> > > Thanks,
> > > Zwane Mwaikambo
> > >
> >
> > Sure, same breakdown. Maybe it's really an dual athlon xp issue as dave
> > jones mentioned. But shouldn't this also occur when i trigger a floating
> > point exception myself? Is there a way to check which floating point
> > exception was raised by the seti client?
> >
> > Regards.
> >
>
> Maybe you can run it off from gdb? Or `strace` it to a file? Usually
> these things are caused by invalid 'C' runtime libraries, either
> corrupt, "installed by just making a sim-link to something that
> was presumed to be close to what the application was compiled with",
> or an error in mem-mapping.
>
> Another very-real possibility is that somebody used floating-point
> within the kernel thus corrupting the `seti` FPU state. You can
> check this out by making a program that does lots of FP calculations,
> perhaps the sine of a large number of values. You put the results
> into one array. Then you do the exact same thing with the results
> put into another array. Then just `memcmp` the arrays! You run
> this in a loop for an hour. If the kernel is mucking with your FPU,
> it will certainly show.

Hmm, that's an interesting idea... An Athlon optimised kernel does use
the MMX/FPU registers to do mem copies. Try running a kernel compiled
for just a Pentium and see if the problem persists.

--

Brian Gerst

2002-01-16 22:05:29

by Richard B. Johnson

[permalink] [raw]
Subject: Re: floating point exception

On Wed, 16 Jan 2002, Brian Gerst wrote:

> "Richard B. Johnson" wrote:
> >
> > On 16 Jan 2002, Christian Thalinger wrote:
> >
> > > On Wed, 2002-01-16 at 15:32, Zwane Mwaikambo wrote:
> > > > Can you also reproduce _without_ loading NVdriver, just to make everybody
> > > > happy.
> > > >
> > > > Thanks,
> > > > Zwane Mwaikambo
> > > >
> > >
> > > Sure, same breakdown. Maybe it's really an dual athlon xp issue as dave
> > > jones mentioned. But shouldn't this also occur when i trigger a floating
> > > point exception myself? Is there a way to check which floating point
> > > exception was raised by the seti client?
> > >
> > > Regards.
> > >
[SNIPPED...]



> > into one array. Then you do the exact same thing with the results
> > put into another array. Then just `memcmp` the arrays! You run
> > this in a loop for an hour. If the kernel is mucking with your FPU,
> > it will certainly show.
>
> Hmm, that's an interesting idea... An Athlon optimised kernel does use
> the MMX/FPU registers to do mem copies. Try running a kernel compiled
> for just a Pentium and see if the problem persists.
>

Here's a progy.. This SHOULD run forever. I assume malloc() works and
don't check the result --yes I already know that.


#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>
#include <time.h>
#include <math.h>

#define MAX_FLOAT 0x100000

int main(int args, char *argv[])
{
unsigned int seed;
double *x;
double *y;
double *z;
size_t i;
x = (double *) malloc(MAX_FLOAT * sizeof(double));
y = (double *) malloc(MAX_FLOAT * sizeof(double));
(void) time((time_t *)&seed);
for(;;)
{
srand(seed);
z = x;
for(i = 0; i < MAX_FLOAT; i++)
*z++ = cos((double) rand());
srand(seed);
z = y;
for(i = 0; i < MAX_FLOAT; i++)
*z++ = cos((double) rand());
if(memcmp(x, y, MAX_FLOAT * sizeof(double)))
break;
seed = rand();
}
fprintf(stderr, "Floating point failure\n");
return 1;
}



Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (797.90 BogoMips).

I was going to compile a list of innovations that could be
attributed to Microsoft. Once I realized that Ctrl-Alt-Del
was handled in the BIOS, I found that there aren't any.


2002-01-16 22:13:19

by Mark Zealey

[permalink] [raw]
Subject: Re: floating point exception

On Wed, Jan 16, 2002 at 05:05:55PM -0500, Richard B. Johnson wrote:

> for(;;)
> {
> srand(seed);
> z = x;
> for(i = 0; i < MAX_FLOAT; i++)
> *z++ = cos((double) rand());
> srand(seed);
> z = y;
> for(i = 0; i < MAX_FLOAT; i++)
> *z++ = cos((double) rand());
> if(memcmp(x, y, MAX_FLOAT * sizeof(double)))
> break;
> seed = rand();

Um, maybe I'm not reading this properly.. why are you randing, doing 1 set and
then using different random values for the other set ?

--

Mark Zealey
[email protected]
[email protected]

UL++++>$ G!>(GCM/GCS/GS/GM) dpu? s:-@ a16! C++++>$ P++++>+++++$ L+++>+++++$
!E---? W+++>$ N- !o? !w--- O? !M? !V? !PS !PE--@ PGP+? r++ !t---?@ !X---?
!R- b+ !tv b+ DI+ D+? G+++ e>+++++ !h++* r!-- y--

(http://www.geekcode.com)

2002-01-16 22:22:19

by Richard B. Johnson

[permalink] [raw]
Subject: Re: floating point exception

On Wed, 16 Jan 2002, Mark Zealey wrote:

> On Wed, Jan 16, 2002 at 05:05:55PM -0500, Richard B. Johnson wrote:
>
> > for(;;)
> > {
> > srand(seed);
^^^^^^^^^^^^^^^^

> > z = x;
> > for(i = 0; i < MAX_FLOAT; i++)
> > *z++ = cos((double) rand());
> > srand(seed);
^^^^^^^^^^^^^

> > z = y;
> > for(i = 0; i < MAX_FLOAT; i++)
> > *z++ = cos((double) rand());
> > if(memcmp(x, y, MAX_FLOAT * sizeof(double)))
> > break;
> > seed = rand();
>
> Um, maybe I'm not reading this properly.. why are you randing, doing 1 set and
> then using different random values for the other set ?

I am NOT. I am setting the seed BACK to whatever it was for the first
set with srand(seed). After the compare, I change the seed.


Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (797.90 BogoMips).

I was going to compile a list of innovations that could be
attributed to Microsoft. Once I realized that Ctrl-Alt-Del
was handled in the BIOS, I found that there aren't any.


2002-01-16 23:37:38

by Christian Thalinger

[permalink] [raw]
Subject: Re: floating point exception

On Wed, 2002-01-16 at 23:05, Richard B. Johnson wrote:
> Here's a progy.. This SHOULD run forever. I assume malloc() works and
> don't check the result --yes I already know that.
>
[snip]

It ran for about 70min on both cpu's (started twice) and no problem
occured. Still have to try the pentium optimized kernel.

Regards.