2002-03-04 18:59:00

by Eric Ries

[permalink] [raw]
Subject: FPU precision & signal handlers (bug?)


I have uncovered what I believe to be a bug in the 2.4.2 i386
kernel. I know that this is not the latest kernel, but I believe the
bug is present in later versions. I will outline the symptoms of the
bug, my belief about what causes it, and a proposed solution. If this
solution seems right, I'm happy to code up a patch and submit it to
the appropriate maintainer (I was unable to figure out who the correct
maintainer-target was for this message, so suggestions are welcome).

Background (or, why this seemingly minor bug is a problem):

We develop an application that relies heavily on being able to do
floating-point operations in a precise way across platforms in a
distributed system. Because we need to be able to marshal the results
of these operations across the network, we have to be very careful
that they are rounded the same on all machines. To accomplish this, we
have to set the FPU precision mode in all processors to 64 bits of
precision (80 bits is the default). This is because, as you already
know, performing an operation in 80 bits of precision and then
rounding it to 64 bits is not the same as doing that same operation at
64 bits of precision in the first place. We set the FPU precision mode
with the 'fldcw' instruction once, and naively expected this control
word to be in effect throughout the life of our process.

Symptoms:

We have been tracking a nasty bug in our program where some of our
operations happen at the 80-bit precision mode despite our having
explicitly set it to 64 bits. After much searching, we realized that
all of the offending operations originate in one of our signal
handlers. We use signals extensively in our program, and were quite
surprised to find kernel FINIT traps being generated from them. After
reading through the 2.4.2 source, I now believe that all signal
handlers run with the default FPU control word in effect. Here's
why...

Explanation:

When it's time to deliver a signal to a user process, the kernel goes
through setup_sigcontext() (arch/i386/kernel/signal.c:318) which saves
the state of the CPU registers at the time of the signal
call. Naturally, the kernel will use this saved state to restore the
registers after the signal handler returns. In order to save the state
of the FPU, it calls save_i387() (signal.c:347). save_i387()
(arch/i386/kernel/i387.c:321) uses one of several appropriate
instructions to save the state of the FPU registers. You will find, in
that function, this handy comment:

/* This will cause a "finit" to be triggered by the next
? * attempted FPU operation by the 'current' process.
? */

This happens because, when you save the state of the FPU registers
using an FNSAVE instruction (and I believe that all of the
instructions used in save_i387 are equivalent to the FNSAVE
instruction, although they differ in their details), you reset the FPU
state as a side-effect. (see
http://webster.cs.ucr.edu/Page_TechDocs/MASMDoc/ReferenceGuide/Chap_05.htm
which has a nice little reference on 387 instructions). Here's the
definition of FNSAVE I am familiar with: "Stores the 94-byte
coprocessor state to the specified memory location. In 32-bit mode on
the 80387?80486, the environment state takes 108 bytes. This
instruction has wait and no-wait versions. After the save, the
coprocessor is initialized as if FINIT had been executed." It's this
last bit that is interesting for this bug. In i387.c:329, still in
save_i387, the kernel has this line:

current->used_math = 0;

My belief is that this allows the kernel to leave the i387 in FINIT
mode. Now, back in user space, if the signal handler does any
floating-point operations, the FINIT trap will be generated, and the
kernel will respond by issuing the FNINIT instruction (in
i387.c:init_fpu). The net result is that this floating point
operation, and any others that take place before the signal
handler-return code is executed, will use the default FPU precision
mode (and the defaults for all other FPU flags) instead of the
(expected) process-global FPU precision mode.

Work-Around:

Now, in our application, we only care about the FPU precision mode
part of the FPU control word, so we can work around this problem by
simply resetting the control word in every signal handler. But this
strikes me as kind of a hack. Why should the signal handler, alone
among all my functions (excepting main) be responsible for blowing
away the control word?

Solution:

As the symptoms of this bug are relatively minor, so too I believe is
the solution. It seems to me that the FPU control word is the only FPU
register in the i387 that is considered to be "global" in scope. If
there are others, I'd appreciate someone letting me know. In any
event, I can think of two solutions. One is less efficient but
probably cleaner. At the time that we invoke the signal handler, we
save the FPU state, which resets the FPU state as a side-effect. At
this time, we could (if the FPU control word is in a non-default
setting) immediately re-set the FPU control word, causing an FINIT
trap, and then invoke the signal handler. This would impose a small
performance penalty on signal-handler invocations. A possibly more
efficient solution would be to add another member (along with
used_math) to the 'current' process data structure. Then, whenever a
process generated an FINIT trap, we could inspect this extra member to
see if we should, in addition to clearing the FPU registers, also
reset the FPU control word. To figure out what to reset the FPU cw to
would involve finding the saved _fpstate that we created with
save_i387 for this process, and then extracting the control word from
that.

So, both solutions have some performance implications, but only during
signal-handler invocation, and both, at least it seems to me, are very
small.

If someone would take a minute of their time to let me know if this
approach seems right/wrong/crazy, I'd greatly appreciate
it. Furthermore, if anybody thinks a patch along these lines would get
incorporated into the kernel, I'd be happy to produce it.

Thanks so much,

Eric Ries
[email protected]

PS. Here's the output from ver_linux for one machine that I've noticed
this problem on:

Linux thdev4 2.4.2-2 #1 Sun Apr 8 20:41:30 EDT 2001 i686 unknown

Gnu C 2.96
Gnu make 3.77
binutils 2.10.91.0.2
util-linux 2.10r
modutils 2.4.2
e2fsprogs 1.19
pcmcia-cs 3.1.22
PPP 2.4.0
Linux C Library 2.2.2
Dynamic linker (ldd) 2.2.2
Procps 2.0.7
Net-tools 1.57
Console-tools 0.3.3
Sh-utils 2.0
Modules Loaded nfs lockd sunrpc 3c59x ipchains usb-uhci usbcore

And here's /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 7
model name : Pentium III (Katmai)
stepping : 3
cpu MHz : 548.745
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips : 1094.45


2002-03-04 22:23:04

by Alan

[permalink] [raw]
Subject: Re: FPU precision & signal handlers (bug?)

> handlers. We use signals extensively in our program, and were quite
> surprised to find kernel FINIT traps being generated from them. After
> reading through the 2.4.2 source, I now believe that all signal
> handlers run with the default FPU control word in effect. Here's
> why...

Think about MMX and hopefully it makes sense then.

> strikes me as kind of a hack. Why should the signal handler, alone
> among all my functions (excepting main) be responsible for blowing
> away the control word?

Right - I would expect it to be restored at the end of the signal handler
for you - is that occuring or not ? I just want to make sure I understand
the precise details of the problem here.

2002-03-04 23:32:26

by Eric Ries

[permalink] [raw]
Subject: RE: FPU precision & signal handlers (bug?)

> -----Original Message-----
> From: Alan Cox [mailto:[email protected]]
> Sent: Monday, March 04, 2002 2:38 PM
> To: Eric Ries
> Cc: [email protected]
> Subject: Re: FPU precision & signal handlers (bug?)
>
>
> Think about MMX and hopefully it makes sense then.

Yes, I think I understand why this is the case presently.

> > strikes me as kind of a hack. Why should the signal handler, alone
> > among all my functions (excepting main) be responsible for blowing
> > away the control word?
>
> Right - I would expect it to be restored at the end of the signal handler
> for you - is that occuring or not ? I just want to make sure I understand
> the precise details of the problem here.

Yes, my belief is that the kernel undoes the FINIT changes by restoring the
FPU state after the signal handler returns.

Eric

2002-03-05 18:24:25

by Gabriel Paubert

[permalink] [raw]
Subject: Re: FPU precision & signal handlers (bug?)



On 4 Mar 2002, Alan Cox wrote:

> > handlers. We use signals extensively in our program, and were quite
> > surprised to find kernel FINIT traps being generated from them. After
> > reading through the 2.4.2 source, I now believe that all signal
> > handlers run with the default FPU control word in effect. Here's
> > why...
>
> Think about MMX and hopefully it makes sense then.

AFAIR MMX only mucks with tag and status words (and the exponent fields of
the stack elements), but never depends on or modifies the control word.

>
> > strikes me as kind of a hack. Why should the signal handler, alone
> > among all my functions (excepting main) be responsible for blowing
> > away the control word?
>
> Right - I would expect it to be restored at the end of the signal handler
> for you - is that occuring or not ? I just want to make sure I understand
> the precise details of the problem here.

State is restored properly at end of signal handler, it always worked
fine, no problem there...

Gabriel.

2002-03-05 23:52:05

by Alan

[permalink] [raw]
Subject: Re: FPU precision & signal handlers (bug?)

> > Think about MMX and hopefully it makes sense then.
>
> AFAIR MMX only mucks with tag and status words (and the exponent fields of
> the stack elements), but never depends on or modifies the control word.

Right but you don't want to end up in MMX mode by suprise in a
signal handler in library code. By the same argument you don't want to end
up in a weird maths more.

I don't think its a bug. I think its correct (but seriously underdocumented)
behaviour

2002-03-06 19:12:17

by Gabriel Paubert

[permalink] [raw]
Subject: Re: FPU precision & signal handlers (bug?)



On 4 Mar 2002, Eric Ries wrote:

>
> I have uncovered what I believe to be a bug in the 2.4.2 i386
> kernel. I know that this is not the latest kernel, but I believe the
> bug is present in later versions. I will outline the symptoms of the
> bug, my belief about what causes it, and a proposed solution. If this
> solution seems right, I'm happy to code up a patch and submit it to
> the appropriate maintainer (I was unable to figure out who the correct
> maintainer-target was for this message, so suggestions are welcome).
>
> Background (or, why this seemingly minor bug is a problem):
>
> We develop an application that relies heavily on being able to do
> floating-point operations in a precise way across platforms in a
> distributed system.

<Offtopic>
Believing in getting the same floating-point results down to the last bit
on different platforms is almost always bound to fail: for transcendental
functions, a 486 will not give the same results as a Pentium, a PIV will
use a library and give different results if you use SSE2 mode, and so on.
I don't even know whether an Athlon and the PII/PIII always give the same
results or not.

Now if you have other architectures, even if they all use IEEE-[78]54
floating point format, this becomes even more interesting. For examples
PPC, MIPS, IA64, and perhaps others will tend to use fused
multiply-accumulate instructions unless you tell the compiler not do do so
(BTW after a quick look at GCC doc and sources, -mno-fused-madd is not
even an option for IA-64).
</Offtopic>

> Because we need to be able to marshal the results
> of these operations across the network, we have to be very careful
> that they are rounded the same on all machines. To accomplish this, we
> have to set the FPU precision mode in all processors to 64 bits of
> precision (80 bits is the default). This is because, as you already
> know, performing an operation in 80 bits of precision and then
> rounding it to 64 bits is not the same as doing that same operation at
> 64 bits of precision in the first place. We set the FPU precision mode
> with the 'fldcw' instruction once, and naively expected this control
> word to be in effect throughout the life of our process.
>
> Symptoms:
>
> We have been tracking a nasty bug in our program where some of our
> operations happen at the 80-bit precision mode despite our having
> explicitly set it to 64 bits. After much searching, we realized that
> all of the offending operations originate in one of our signal
> handlers. We use signals extensively in our program, and were quite
> surprised to find kernel FINIT traps being generated from them. After
> reading through the 2.4.2 source, I now believe that all signal
> handlers run with the default FPU control word in effect.

Right.
[Snipped the clear explanation showing that you've done your homework]

> instruction, although they differ in their details), you reset the FPU
> state as a side-effect.

Actually, fxsave does not reset the FPU state IIRC (so it could be faster
for signal delivery to use fxsave followed by fnsave instead of the format
conversion routine if the FPU happens to hold the state of the current
process).

> http://webster.cs.ucr.edu/Page_TechDocs/MASMDoc/ReferenceGuide/Chap_05.htm
> which has a nice little reference on 387 instructions). Here's the
> definition of FNSAVE I am familiar with: "Stores the 94-byte
> coprocessor state to the specified memory location. In 32-bit mode on
> the 80387?80486, the environment state takes 108 bytes. This
> instruction has wait and no-wait versions. After the save, the
> coprocessor is initialized as if FINIT had been executed." It's this
> last bit that is interesting for this bug. In i387.c:329, still in
> save_i387, the kernel has this line:
>
> current->used_math = 0;
>
> My belief is that this allows the kernel to leave the i387 in FINIT
> mode. Now, back in user space, if the signal handler does any
> floating-point operations, the FINIT trap will be generated, and the
> kernel will respond by issuing the FNINIT instruction (in
> i387.c:init_fpu). The net result is that this floating point
> operation, and any others that take place before the signal
> handler-return code is executed, will use the default FPU precision
> mode (and the defaults for all other FPU flags) instead of the
> (expected) process-global FPU precision mode.
>
> Work-Around:
>
> Now, in our application, we only care about the FPU precision mode
> part of the FPU control word, so we can work around this problem by
> simply resetting the control word in every signal handler. But this
> strikes me as kind of a hack. Why should the signal handler, alone
> among all my functions (excepting main) be responsible for blowing
> away the control word?
>
> Solution:
>
> As the symptoms of this bug are relatively minor, so too I believe is
> the solution. It seems to me that the FPU control word is the only FPU
> register in the i387 that is considered to be "global" in scope. If
> there are others, I'd appreciate someone letting me know. In any
> event, I can think of two solutions. One is less efficient but
> probably cleaner. At the time that we invoke the signal handler, we
> save the FPU state, which resets the FPU state as a side-effect. At
> this time, we could (if the FPU control word is in a non-default
> setting) immediately re-set the FPU control word, causing an FINIT
> trap, and then invoke the signal handler. This would impose a small
> performance penalty on signal-handler invocations. A possibly more
> efficient solution would be to add another member (along with
> used_math) to the 'current' process data structure. Then, whenever a
> process generated an FINIT trap, we could inspect this extra member to
> see if we should, in addition to clearing the FPU registers, also
> reset the FPU control word. To figure out what to reset the FPU cw to
> would involve finding the saved _fpstate that we created with
> save_i387 for this process, and then extracting the control word from
> that.

Very bad idea, the control word is often changed in the middle of the
code, especially the rounding mode field for float->int conversions; have
a look at the code that GCC generates (grep for f{nst,ld}cw). The Pentium
IV doc even states that you can efficiently toggle between 2 values of the
control word, but not more.

Therefore you certainly don't want to inherit the control word of the
executing thread. Now adding a prctl or something similar to say "I'd like
to get this control word(s) value as initial value(s) in signal handlers"
might make sense, even on other architectures or for SSE/SSE2 to control
such things as handle denormal as zeros or change the set of exceptions
enabled by default...

Regards,
Gabriel.

2002-03-06 23:32:55

by Eric Ries

[permalink] [raw]
Subject: RE: FPU precision & signal handlers (bug?)

> -----Original Message-----
> From: Gabriel Paubert [mailto:[email protected]]
> Sent: Wednesday, March 06, 2002 11:12 AM
> To: Eric Ries
> Cc: [email protected]
> Subject: Re: FPU precision & signal handlers (bug?)
>
> <Offtopic>
> Believing in getting the same floating-point results down to the last bit
> on different platforms is almost always bound to fail: for transcendental
> functions, a 486 will not give the same results as a Pentium, a PIV will
> use a library and give different results if you use SSE2 mode, and so on.
> I don't even know whether an Athlon and the PII/PIII always give the same
> results or not.
>
> Now if you have other architectures, even if they all use IEEE-[78]54
> floating point format, this becomes even more interesting. For examples
> PPC, MIPS, IA64, and perhaps others will tend to use fused
> multiply-accumulate instructions unless you tell the compiler not do do so
> (BTW after a quick look at GCC doc and sources, -mno-fused-madd is not
> even an option for IA-64).
> </Offtopic>

Yes, this is a quite tricky problem. Fortunately, in our situation we are
extremely picky about just which calculations must be bit-for-bit consistent
across machines, and try to keep those operations very simple. We have been
doing this for some time on Intel hardware, and - apart from this signal
handler issue - have never had any problems.

> Right.
> [Snipped the clear explanation showing that you've done your homework]

Thanks :) I know how annoying it can be to put up with clueless posts on
mailing lists.

> Actually, fxsave does not reset the FPU state IIRC (so it could be faster
> for signal delivery to use fxsave followed by fnsave instead of the format
> conversion routine if the FPU happens to hold the state of the current
> process).

That's an interesting thought. I didn't have a decent reference on MMX
instructions while I was tracking this bug down, so I just assumed they were
basically equivalent to their 387 counterparts.


> Very bad idea, the control word is often changed in the middle of the
> code, especially the rounding mode field for float->int conversions; have
> a look at the code that GCC generates (grep for f{nst,ld}cw). The Pentium
> IV doc even states that you can efficiently toggle between 2 values of the
> control word, but not more.

I don't see how this is a problem, because (as far as I can tell) there is
no need to use the "default" control word at all. In the solution I propose,
the FPU state is still saved before a signal handler call, and restored
afterwards. It's just that during the signal handler execution, the control
word is set to the process-global value. Keep in mind that, in the case that
your signal handler has no floating-point instructions, the control word
never has to be set, because no FINIT trap will be generated. So there's
only a performance cost to those of us who use floating point in our signal
handlers.

> Therefore you certainly don't want to inherit the control word of the
> executing thread. Now adding a prctl or something similar to say "I'd like
> to get this control word(s) value as initial value(s) in signal handlers"
> might make sense, even on other architectures or for SSE/SSE2 to control
> such things as handle denormal as zeros or change the set of exceptions
> enabled by default...

I'm afraid I don't quite follow what you're suggesting here. Don't you
always want ?your? control word in any function that executes as part of
your process?

Thanks for the thoughtful reply,

Eric

2002-03-07 17:32:34

by Gabriel Paubert

[permalink] [raw]
Subject: Re: FPU precision & signal handlers (bug?)



On 6 Mar 2002, Alan Cox wrote:

> > > Think about MMX and hopefully it makes sense then.
> >
> > AFAIR MMX only mucks with tag and status words (and the exponent fields of
> > the stack elements), but never depends on or modifies the control word.
>
> Right but you don't want to end up in MMX mode by suprise in a
> signal handler in library code. By the same argument you don't want to end
> up in a weird maths more.

I agree with the second part, but actually what you want is to start with
an empty stack. Whether the contents are FP or MMX is irrelevant.
Actually the support of applications using MMX did not require any change
to the kernel (Intel carefully designed it that way).

> I don't think its a bug. I think its correct (but seriously underdocumented)
> behaviour

Indeed.

Gabriel.

2002-03-07 19:13:32

by Alan

[permalink] [raw]
Subject: Re: FPU precision & signal handlers (bug?)

> I agree with the second part, but actually what you want is to start with
> an empty stack. Whether the contents are FP or MMX is irrelevant.
> Actually the support of applications using MMX did not require any change
> to the kernel (Intel carefully designed it that way).

Not the case. If you drop into a signal or exception handler and it uses
FPU while MMX is on it'll get a nasty shock. As it happens Linux already
did the right thing.

Intel minimised it and did pretty much the best job that could be done for
it

2002-03-08 11:54:39

by Gabriel Paubert

[permalink] [raw]
Subject: RE: FPU precision & signal handlers (bug?)



On 7 Mar 2002, Eric Ries wrote:

> Yes, this is a quite tricky problem. Fortunately, in our situation we are
> extremely picky about just which calculations must be bit-for-bit consistent
> across machines, and try to keep those operations very simple. We have been
> doing this for some time on Intel hardware, and - apart from this signal
> handler issue - have never had any problems.

Intel but not IA-64 I presume ?

>
> > Right.
> > [Snipped the clear explanation showing that you've done your homework]
>
> Thanks :) I know how annoying it can be to put up with clueless posts on
> mailing lists.

Indeed :)

> > Actually, fxsave does not reset the FPU state IIRC (so it could be faster
> > for signal delivery to use fxsave followed by fnsave instead of the format
> > conversion routine if the FPU happens to hold the state of the current
> > process).
>
> That's an interesting thought. I didn't have a decent reference on MMX
> instructions while I was tracking this bug down, so I just assumed they were
> basically equivalent to their 387 counterparts.

I confirm.

> I don't see how this is a problem, because (as far as I can tell) there is
> no need to use the "default" control word at all. In the solution I propose,
> the FPU state is still saved before a signal handler call, and restored
> afterwards. It's just that during the signal handler execution, the control
> word is set to the process-global value. Keep in mind that, in the case that
> your signal handler has no floating-point instructions, the control word
> never has to be set, because no FINIT trap will be generated. So there's
> only a performance cost to those of us who use floating point in our signal
> handlers.

And where would you take the global value from ?

>
> > Therefore you certainly don't want to inherit the control word of the
> > executing thread. Now adding a prctl or something similar to say "I'd like
> > to get this control word(s) value as initial value(s) in signal handlers"
> > might make sense, even on other architectures or for SSE/SSE2 to control
> > such things as handle denormal as zeros or change the set of exceptions
> > enabled by default...
>
> I'm afraid I don't quite follow what you're suggesting here. Don't you
> always want ?your? control word in any function that executes as part of
> your process?

But your control word is changed on the fly by the compiler for things as
trivial as float/double to int conversion. It is not as global and static
as you seem to believe.

Gabriel.

2002-03-08 12:11:59

by Gabriel Paubert

[permalink] [raw]
Subject: Re: FPU precision & signal handlers (bug?)



On 8 Mar 2002, Alan Cox wrote:

> > I agree with the second part, but actually what you want is to start with
> > an empty stack. Whether the contents are FP or MMX is irrelevant.
> > Actually the support of applications using MMX did not require any change
> > to the kernel (Intel carefully designed it that way).
>
> Not the case. If you drop into a signal or exception handler and it uses
> FPU while MMX is on it'll get a nasty shock. As it happens Linux already
> did the right thing.

You are in for the same shock if you the FPU is in non MMX mode. Think of
the case when all stack entries are marked valid in the interrupted
process: stack overflow on the first fld.

Or alernatively show me how you could simplify the signal delivery FPU
logic for non MMX processor.

Answer: you can't. I still stand by my statement that MMX is
completely irrelevant and does not add any special case.

> Intel minimised it and did pretty much the best job that could be done for
> it.

Better than this, they made it completely transparent.

Gabriel.