2004-03-31 13:18:00

by Ulrich Windl

[permalink] [raw]
Subject: 2.4.21 on Itanium2: floating-point assist fault at ip 400000000062ada1, isr 0000020000000008

Hello,

I did try to find an answer is SuSE's support database, not in SAP's support
database, and also did search Google, but could not find an answer:

We run SuSE Linux Enterprise Server 8 (SLES8) on a HP rx4640 Itanium2 server
with 2 CPUs (family: Itanium 2, model: 1, revision: 5, archrev: 0).

In syslog is do see periodic kernel messages (with no implicit priority) that
read:

dw.sapC11_DVS02(14393): floating-point assist fault at ip 400000000062ada1,
isr 0000020000000008

("dw.sapC11_DVS02" is a SAP R/3 work process (46D_EXT, patch 1754, for those
who care)

Can anybody explain what this message means? Is it an application problem, or
is it a kernel problem?

Regards,
Ulrich
P.S. I'm not subscribed to linux-kernel, so please CC: at least.


2004-03-31 17:00:35

by Denis Vlasenko

[permalink] [raw]
Subject: Re: 2.4.21 on Itanium2: floating-point assist fault at ip 400000000062ada1, isr 0000020000000008

On Wednesday 31 March 2004 15:16, Ulrich Windl wrote:
> Hello,
>
> I did try to find an answer is SuSE's support database, not in SAP's
> support database, and also did search Google, but could not find an answer:
>
> We run SuSE Linux Enterprise Server 8 (SLES8) on a HP rx4640 Itanium2
> server with 2 CPUs (family: Itanium 2, model: 1, revision: 5, archrev: 0).
>
> In syslog is do see periodic kernel messages (with no implicit priority)
> that read:
>
> dw.sapC11_DVS02(14393): floating-point assist fault at ip 400000000062ada1,
> isr 0000020000000008
>
> ("dw.sapC11_DVS02" is a SAP R/3 work process (46D_EXT, patch 1754, for
> those who care)
>
> Can anybody explain what this message means? Is it an application problem,
> or is it a kernel problem?

static int fpu_swa_count = 0;
static unsigned long last_time;
...
if (jiffies - last_time > 5*HZ)
fpu_swa_count = 0;
if ((fpu_swa_count < 4) && !(current->thread.flags & IA64_THREAD_FPEMU_NOPRINT)) {
last_time = jiffies;
++fpu_swa_count;
printk(KERN_WARNING "%s(%d): floating-point assist fault at ip %016lx, isr %016lx\n",
current->comm, current->pid, regs->cr_iip + ia64_psr(regs)->ri, isr);
}

kernel says that you have them too frequently, which probably
impairs efficiency. It's a hint to programmer.
--
vda

2004-03-31 18:06:15

by David Mosberger

[permalink] [raw]
Subject: Re: 2.4.21 on Itanium2: floating-point assist fault at ip 400000000062ada1, isr 0000020000000008

>>>>> On Wed, 31 Mar 2004 19:00:17 +0200, Denis Vlasenko <[email protected]> said:

Denis> kernel says that you have them too frequently, which probably
Denis> impairs efficiency. It's a hint to programmer.

Close: the kernel limits the frequency of the printing to avoid
flooding the log files. Even if you do get the faults frequently, it
won't print more than 5 warning messages every 5 seconds. The
floating-point software-assist (fpswa) faults are harmless in the
sense that they don't affect correctness of the program, but if you do
get them _very_ frequently (which is quite rare), they could impair
performance. FPSWA faults occur only for corner-cases of
floating-point arithmetic, such as operations on denormals or
non-finite numbers. Many programs don't need denormal support at all
and for those, you can link the program with -ffast-math (GCC) or -ftz
(Intel compiler). This will turn on "flush-to-zero" mode and avoid
any FPSWA-faults due to denormals (in x86-speak, this is equivalent to
the "flush-to-zero" mode that SSE offers).

If the messages appear with a frequency of less than 5 messages/5
seconds, then there is certainly no performance issue and you may want
to just turn off the messages. This can be done via the prctl(2)
system-call or with the prctl command. With the latter:

prctl --fp-emul=silent

will fork a new shell and disable the printing of the FPSWA-messages
in the shell and all its children. There is also "--unaligned=silent"
which will turn off a similar message for unaligned access emulation
done by the kernel.

--david

2004-03-31 18:24:02

by Alex Williamson

[permalink] [raw]
Subject: Re: 2.4.21 on Itanium2: floating-point assist fault at ip 400000000062ada1, isr 0000020000000008

On Wed, 2004-03-31 at 11:06, David Mosberger wrote:

> If the messages appear with a frequency of less than 5 messages/5
> seconds, then there is certainly no performance issue and you may want
> to just turn off the messages.

But if you do get them at the maximum rate for a computational
application, performance could be _severely_ impacted (ie. orders of
magnitude).

Alex


2004-03-31 18:38:00

by David Mosberger

[permalink] [raw]
Subject: Re: 2.4.21 on Itanium2: floating-point assist fault at ip 400000000062ada1, isr 0000020000000008

>>>>> On Wed, 31 Mar 2004 11:23:53 -0700, Alex Williamson <[email protected]> said:

Alex> On Wed, 2004-03-31 at 11:06, David Mosberger wrote:
>> If the messages appear with a frequency of less than 5 messages/5
>> seconds, then there is certainly no performance issue and you may want
>> to just turn off the messages.

Alex> But if you do get them at the maximum rate for a computational
Alex> application, performance could be _severely_ impacted (ie. orders of
Alex> magnitude).

Good point.

--david

2004-03-31 18:52:31

by Richard B. Johnson

[permalink] [raw]
Subject: Re: 2.4.21 on Itanium2: floating-point assist fault at ip 400000000062ada1, isr 0000020000000008

On Wed, 31 Mar 2004, Alex Williamson wrote:

> On Wed, 2004-03-31 at 11:06, David Mosberger wrote:
>
> > If the messages appear with a frequency of less than 5 messages/5
> > seconds, then there is certainly no performance issue and you may want
> > to just turn off the messages.
>
> But if you do get them at the maximum rate for a computational
> application, performance could be _severely_ impacted (ie. orders of
> magnitude).
>
> Alex
>
The power-on or hardware-reset default for the ix86 FPU
is to attempt to handle div 0 errors transparently.
In others words:

R = 1 / (1/r1 + 1/r2 + 1/r3 +...) will resolve correctly
if any r...n = 0. Parallel resistance when one or more
resistors is 0 ohms.

This it probably not the default for the Itanium so your
application either needs to be fixed or at least needs to
set the FPU to handle these problems. The configuration of
the FPU remains, per-process, so some executable could
be run upon login to "fix" the processor.

The following program should cause the bad program to
core-dump any time it is run. You could also configure
this so it lets anybody write garbage and it will "work".

----------------------------------------------------------------
/*
* Note FPU control only exists per process. Therefore, you have
* to set up the FPU before you use it in any program.
*/
#include <fpu_control.h>

#define FPU_MASK (_FPU_MASK_IM |\
_FPU_MASK_DM |\
_FPU_MASK_ZM |\
_FPU_MASK_OM |\
_FPU_MASK_UM |\
_FPU_MASK_PM)

void fpu()
{
__setfpucw(_FPU_DEFAULT & ~FPU_MASK);
}

main() {
double zero=0.0;
double one=1.0;
fpu();

one /=zero; // Testing, remove this after
}
-----------------------------


Cheers,
Dick Johnson
Penguin : Linux version 2.4.24 on an i686 machine (797.90 BogoMips).
Note 96.31% of all statistics are fiction.


2004-03-31 19:11:03

by David Mosberger

[permalink] [raw]
Subject: Re: 2.4.21 on Itanium2: floating-point assist fault at ip 400000000062ada1, isr 0000020000000008

>>>>> On Wed, 31 Mar 2004 13:53:13 -0500 (EST), "Richard B. Johnson" <[email protected]> said:

Richard> The power-on or hardware-reset default for the ix86 FPU
Richard> is to attempt to handle div 0 errors transparently.

I must be missing something. So far I haven't seen anything that
would suggest the FPSWA faults were due to infinities. I'd guess that
it's much more likely that they're due to denormals.

--david

2004-03-31 20:16:37

by Helge Deller

[permalink] [raw]
Subject: Re: 2.4.21 on Itanium2: floating-point assist fault at ip 400000000062ada1, isr 0000020000000008

On Wednesday 31 March 2004 19:00, Denis Vlasenko wrote:
> On Wednesday 31 March 2004 15:16, Ulrich Windl wrote:
> > Hello,
> >
> > I did try to find an answer is SuSE's support database, not in SAP's
> > support database, and also did search Google, but could not find an answer:
> >
> > We run SuSE Linux Enterprise Server 8 (SLES8) on a HP rx4640 Itanium2
> > server with 2 CPUs (family: Itanium 2, model: 1, revision: 5, archrev: 0).
> >
> > In syslog is do see periodic kernel messages (with no implicit priority)
> > that read:
> >
> > dw.sapC11_DVS02(14393): floating-point assist fault at ip 400000000062ada1,
> > isr 0000020000000008
> >
> > ("dw.sapC11_DVS02" is a SAP R/3 work process (46D_EXT, patch 1754, for
> > those who care)
> >
> > Can anybody explain what this message means? Is it an application problem,
> > or is it a kernel problem?
>
>....
>
> kernel says that you have them too frequently, which probably
> impairs efficiency. It's a hint to programmer.

Correct.
We are aware of this message and it will be fixed with a future SAP R/3 kernel patch.
Since this message is only raised in the startup code of the R/3 kernel you shouldn't
see any runtime performance impact and can safely ignore the message for now.

Helge Deller
SAP LinuxLab

2004-03-31 20:18:49

by Richard B. Johnson

[permalink] [raw]
Subject: Re: 2.4.21 on Itanium2: floating-point assist fault at ip 400000000062ada1, isr 0000020000000008

On Wed, 31 Mar 2004, David Mosberger wrote:

> >>>>> On Wed, 31 Mar 2004 13:53:13 -0500 (EST), "Richard B. Johnson" <[email protected]> said:
>
> Richard> The power-on or hardware-reset default for the ix86 FPU
> Richard> is to attempt to handle div 0 errors transparently.
>
> I must be missing something. So far I haven't seen anything that
> would suggest the FPSWA faults were due to infinities. I'd guess that
> it's much more likely that they're due to denormals.
>
> --david


ftp://download.intel.com/design/Itanium/Downloads/24541401.pfd

"Itanium Processor Floating-point Software Assistance
and Floating-Point Exception Handling"


Any FPU fault gets trapped to this code. Nans, Denormals, Overflow,
Inexact, etc., everything....

The reading of 2.2 may not be clear, but further reading will
show that anything that didn't go according to plan gets trapped
to the "Software Assistance" Handler. Writing a message about
the trap to a log-file is a BUG! The handler should just do
whatever it's supposed to do!

Cheers,
Dick Johnson
Penguin : Linux version 2.4.24 on an i686 machine (797.90 BogoMips).
Note 96.31% of all statistics are fiction.


2004-03-31 21:33:39

by David Mosberger

[permalink] [raw]
Subject: Re: 2.4.21 on Itanium2: floating-point assist fault at ip 400000000062ada1, isr 0000020000000008

>>>>> On Wed, 31 Mar 2004 15:13:31 -0500 (EST), "Richard B. Johnson" <[email protected]> said:

Richard> The reading of 2.2 may not be clear, but further reading
Richard> will show that anything that didn't go according to plan
Richard> gets trapped to the "Software Assistance" Handler. Writing
Richard> a message about the trap to a log-file is a BUG! The
Richard> handler should just do whatever it's supposed to do!

Sorry, I thought you were trying to help diagnose the issue at hand.
I didn't realize you were making a statement.

Never mind.

--david

2004-04-01 07:05:58

by Ulrich Windl

[permalink] [raw]
Subject: Re: 2.4.21 on Itanium2: floating-point assist fault at ip 400000000062ada1, isr 0000020000000008

OK,

thanks guys for the answers: It seems to be a hint that an instruction can't
be done in hardware, but has to be emulated instead. I tried to find the
offending code, and I was surprised that is't not a complicated function:

br.few 0x400000000062aca0 <ab_CompManyEq+384>
0x400000000062ad70 <ab_CompManyEq+592>: [MII] nop.m 0x0
0x400000000062ad71 <ab_CompManyEq+593>: zxt4 r14=r44;;
0x400000000062ad72 <ab_CompManyEq+594>: add r15=r35,r14
0x400000000062ad80 <ab_CompManyEq+608>: [MMI] add r14=r34,r14;;
0x400000000062ad81 <ab_CompManyEq+609>: ldfd f7=[r14]
0x400000000062ad82 <ab_CompManyEq+610>: nop.i 0x0
0x400000000062ad90 <ab_CompManyEq+624>: [MMI] ldfd f6=[r15];;
0x400000000062ad91 <ab_CompManyEq+625>: nop.m 0x0
0x400000000062ad92 <ab_CompManyEq+626>: nop.i 0x0
0x400000000062ada0 <ab_CompManyEq+640>: [MFI] nop.m 0x0
0x400000000062ada1 <ab_CompManyEq+641>: fcmp.eq.s0 p7,p6=f7,f6 <<<
0x400000000062ada2 <ab_CompManyEq+642>: nop.i 0x0;;
0x400000000062adb0 <ab_CompManyEq+656>: [MIB] nop.m 0x0
0x400000000062adb1 <ab_CompManyEq+657>: nop.i 0x0

(when attaching to the process with gdb)

Thanks and regards,
Ulrich

On 31 Mar 2004 at 19:00, Denis Vlasenko wrote:

> On Wednesday 31 March 2004 15:16, Ulrich Windl wrote:
> > Hello,
> >
> > I did try to find an answer is SuSE's support database, not in SAP's
> > support database, and also did search Google, but could not find an answer:
> >
> > We run SuSE Linux Enterprise Server 8 (SLES8) on a HP rx4640 Itanium2
> > server with 2 CPUs (family: Itanium 2, model: 1, revision: 5, archrev: 0).
> >
> > In syslog is do see periodic kernel messages (with no implicit priority)
> > that read:
> >
> > dw.sapC11_DVS02(14393): floating-point assist fault at ip 400000000062ada1,
> > isr 0000020000000008
> >
> > ("dw.sapC11_DVS02" is a SAP R/3 work process (46D_EXT, patch 1754, for
> > those who care)
> >
> > Can anybody explain what this message means? Is it an application problem,
> > or is it a kernel problem?
>
> static int fpu_swa_count = 0;
> static unsigned long last_time;
> ...
> if (jiffies - last_time > 5*HZ)
> fpu_swa_count = 0;
> if ((fpu_swa_count < 4) && !(current->thread.flags & IA64_THREAD_FPEMU_NOPRINT)) {
> last_time = jiffies;
> ++fpu_swa_count;
> printk(KERN_WARNING "%s(%d): floating-point assist fault at ip %016lx, isr %016lx\n",
> current->comm, current->pid, regs->cr_iip + ia64_psr(regs)->ri, isr);
> }
>
> kernel says that you have them too frequently, which probably
> impairs efficiency. It's a hint to programmer.
> --
> vda
>
>