LinuxLists.cc - [RFC/PATCH, 1/4] readX_check() performance evaluation

2004-01-28 01:55:33

Subject: [RFC/PATCH, 1/4] readX_check() performance evaluation

Hi all,

We are planning recovery from the PCI bus intermittent errors.
PCI-X standard describes it as "5.4 Error Handling and Fault
Tolerance" in PCI-X 1.0b. There were several discussions in lkml,
how to recover from PCI errors.

Seto posted "[RFC] How drivers notice a HW error?" (readX_check() I/F)
http://marc.theaimsgroup.com/?l=linux-kernel&m=106992207709400&w=2

Grant will show his idea near future,
http://marc.theaimsgroup.com/?l=linux-kernel&m=107453681120603&w=2

I made a readX_check() prototype which Seto proposed to measure
performance disadvantage of this kind of I/F. And I made a performance
evaluation of Disk I/O with this prototype.
Comments are welcome.

Conclusion:
Performance disadvantage of readX_check() is a very small.
I'd like you to understand that such a function will not
cause severe performance disadvantage as you imagine.

This patch:
- is for Fusion MPT driver.
- has no error recovery code yet, sorry.
- currently supports ia64 only. But I believe that
some other CPU(such as SPARC, PPC, PA-RISC) can also
support this kind of I/F.
I know, unfortunately, that i386 can't support this kind
of I/F, because it can't recover from machine check state.

How to use this patch:
- Apply to vanilla 2.6.1.
- Rename drivers/message/fusion/mptbase.c to mptbase_main.c
(Because we make mptbase.ko from mptbase.c and read_check.S,
so source file name has to be renamed. Though I know
read_check.S should go under the architecture directory,
because this patch is only for performance evaluation,
forgive me. )

Evaluation Environment:
Kernel:
vanilla 2.6.1 and 2.6.1+readX_check patch
Platform: Intel Tiger-4 (1-CPU)
processor : 0
vendor : GenuineIntel
arch : IA-64
family : Itanium 2
model : 1
revision : 5
archrev : 0
features : branchlong
cpu number : 0
cpu regs : 4
cpu MHz : 1296.473997
itc MHz : 1296.473997
BogoMIPS : 1941.96
SCSI HBA/driver:
onboard LSI Logic 53C1030(FwRev=01030600h, MaxQ=255)
Fusion MPT driver(kernel 2.6.1)
Disks
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: FUJITSU Model: MAP3367NC Rev: 5207
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 01 Lun: 00
Vendor: FUJITSU Model: MAP3367NC Rev: 5207
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 02 Lun: 00
Vendor: FUJITSU Model: MAP3367NC Rev: 5207
Type: Direct-Access ANSI SCSI revision: 03

Test tool:
rawread 1.0.3
http://www-124.ibm.com/developerworks/opensource/linuxperf/rawread/rawread.html

Results:
To avoid buffer cache, we measured performance by O_DIRECT.

1-1) ./rawread -p 1 -d 1 -s 512 -n 131072 -x -z
(1 disk, 512-bytes/1-read,)
avg. sys(%)
MB/s IOPS of vmstat
---------------------------------------------------
patched 2.6.1 5.245 10741 16.2
vanilla 2.6.1 5.269 10790 15.7
---------------------------------------------------
patched/vanilla 0.995 1.032

1-2) ./rawread -p 2 -d 1 -s 512 -n 131072 -x -z
(2 disks, 512-bytes/1-read,)
avg. sys(%)
MB/s IOPS of vmstat
---------------------------------------------------
patched 2.6.1 10.473 21448 30.6
vanilla 2.6.1 10.548 21602 30.6
---------------------------------------------------
patched/vanilla 0.993 1.000

1-3) ./rawread -p 3 -d 1 -s 512 -n 131072 -x -z
(3 disks, 512-bytes/1-read,)
avg. sys(%)
MB/s IOPS of vmstat
---------------------------------------------------
patched 2.6.1 11.267 23074 29.9
vanilla 2.6.1 11.251 23042 30.5
---------------------------------------------------
patched/vanilla 1.001 0.980

2-1) ./rawread -p 1 -d 1 -s 4096 -n 131072 -x -z
(1 disk, 4096-bytes/1-read,)
avg. sys(%)
MB/s IOPS of vmstat
---------------------------------------------------
patched 2.6.1 39.422 10092 14.1
vanilla 2.6.1 39.389 10083 14.0
---------------------------------------------------
patched/vanilla 1.001 1.007

2-2) ./rawread -p 2 -d 1 -s 4096 -n 131072 -x -z
(2 disks, 4096-bytes/1-read,)
avg. sys(%)
MB/s IOPS of vmstat
---------------------------------------------------
patched 2.6.1 70.438 18032 24.1
vanilla 2.6.1 70.390 18019 24.1
---------------------------------------------------
patched/vanilla 1.001 1.000

2-3) ./rawread -p 3 -d 1 -s 4096 -n 131072 -x -z
(3 disks, 4096-bytes/1-read,)
avg. sys(%)
MB/s IOPS of vmstat
---------------------------------------------------
patched 2.6.1 70.588 18070 24.5
vanilla 2.6.1 70.861 18140 24.4
---------------------------------------------------
patched/vanilla 0.996 1.004

3-1) ./rawread -p 1 -d 1 -s 32768 -n 131072 -x -z
(1 disk, 32768-bytes/1-read,)
avg. sys(%)
MB/s IOPS of vmstat
---------------------------------------------------
patched 2.6.1 69.226 2215 3.5
vanilla 2.6.1 68.546 2190 3.4
---------------------------------------------------
patched/vanilla 1.010 1.029

3-2) ./rawread -p 2 -d 1 -s 32768 -n 131072 -x -z
(2 disk, 32768-bytes/1-read,)
avg. sys(%)
MB/s IOPS of vmstat
---------------------------------------------------
patched 2.6.1 139.315 4458 7.0
vanilla 2.6.1 139.188 4454 6.6
---------------------------------------------------
patched/vanilla 1.010 1.029

3-3) ./rawread -p 3 -d 1 -s 32768 -n 131072 -x -z
(3 disks, 32768-bytes/1-read,)
avg. sys(%)
MB/s IOPS of vmstat
---------------------------------------------------
patched 2.6.1 208.883 6684 10.0
vanilla 2.6.1 209.193 6694 10.2
---------------------------------------------------
patched/vanilla 0.999 0.980

-------
Thanks,
Hironobu Ishii

2004-01-28 17:19:14

by Grant Grundler

[permalink] [raw]

Subject: Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

On Wed, Jan 28, 2004 at 10:54:28AM +0900, Hironobu Ishii wrote:
> Seto posted "[RFC] How drivers notice a HW error?" (readX_check() I/F)
> http://marc.theaimsgroup.com/?l=linux-kernel&m=106992207709400&w=2
>
> Grant will show his idea near future,
> http://marc.theaimsgroup.com/?l=linux-kernel&m=107453681120603&w=2

I don't work on error recovery full time and that's really a full time job.
In a nutshell, I'd like to treat IO errors as exceptions and hide
most of the support in the regular readX() macros. Arch support
controls readX/writeX implementations and CONFIG_* options can
be used to pick which behavior someone wants. I'd expect drivers
which support error recovery to register a error recovery callback
and "fake" value to hand back for PIO reads until recovery is complete.

I could be wrong. Exception handling is ugly. But my hope is that
by putting all the exception handling in one place in the driver,
the driver will be forced to be methodical in being "deterministic"
WRT to driver state and can return to a known state by calling one
routine. This will keep the drivers maintainable by "part-time hackers"
who don't care about error recovery.

> Conclusion:
> Performance disadvantage of readX_check() is a very small.
> I'd like you to understand that such a function will not
> cause severe performance disadvantage as you imagine.

This is no surprise. The cost of PIO reads is far greater (100x)
than the extra cost to check for errors.
Eg PIO read on 1GHz HP rx2600 is ~1000-1100 CPU cycles and it's in
the same order of magnitude for all architectures.

> This patch:
> - is for Fusion MPT driver.
> - has no error recovery code yet, sorry.

Error recovyer code is the hard part. Find all the locations in the
code and writing instance specific error recovery code. The HPUX driver
I first worked on is amazingly similar to MPT. And it had error recovery
support (for "Host Powerfail") and truly was a PITA to support.

> - currently supports ia64 only. But I believe that
> some other CPU(such as SPARC, PPC, PA-RISC) can also
> support this kind of I/F.

yes - probably a few others as well.

> I know, unfortunately, that i386 can't support this kind
> of I/F, because it can't recover from machine check state.

I think i386 could. The method to check for errors will be different
and the types of errors which are detectable are fewer.
I'm not sure it would be recoverable though. But it should be able
to shutdown a misbehaving driver instance/device before the box crashed.
(well, assuming there is no memory corruption).

thanks,
grant

2004-01-28 17:44:49

by Andi Kleen

[permalink] [raw]

Subject: Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

On Wed, 28 Jan 2004 09:20:04 -0800
Grant Grundler <[email protected]> wrote:

> I could be wrong. Exception handling is ugly. But my hope is that
> by putting all the exception handling in one place in the driver,
> the driver will be forced to be methodical in being "deterministic"
> WRT to driver state and can return to a known state by calling one
> routine. This will keep the drivers maintainable by "part-time hackers"
> who don't care about error recovery.

One big problem is how to get rid of the spinlocks after the exception though
(hardware access usually happens inside a spinlock)

I presume you could return a magic value (all ones), but then you still
have to make sure the driver doesn't break when that happens. That would
likely require testing for that value on every read access and make
the code similarly ugly and difficult to write as with Linus'
explicit checking model.

But there may be no other choice, see below...

> > I know, unfortunately, that i386 can't support this kind
> > of I/F, because it can't recover from machine check state.
>
> I think i386 could. The method to check for errors will be different
> and the types of errors which are detectable are fewer.

Yes, there are often magic bits in northbridges and chipsets. Problem is that
they're sometimes buggy (because not well tested) and give random errors.

Also enabling them tends to trigger a *lot* of bugs in random drivers.

> I'm not sure it would be recoverable though. But it should be able

They usually give an MCE, but it is not exact for writes (happens sometime
later) and may not even be for reads.

The only sane way to handle them would be a global call back per pci_dev,
but then you run into problems with the locking again.

Also in my experience from AMD64 which originally was a bit aggressive
on enabling MCEs: enabling MCEs increases your kernel support load a lot.

Many people have slightly buggy systems which still happen to work mostly.
If you report every problem you as kernel maintainer will be flooded with
reports about things you can nothing to do about. So I don't think it would
make sense to enable it by default.

One idea I played with was to only enable it for driver debugging, but
it is hard to educate driver developers about it (most just don't know
about it and we have no way to pass information to them). In the end
I removed it because it was too much hazzle. In short this stuff
probably only makes sense when you're a system vendor who sells
support contracts for whole systems including hardware support.
For the normal linux model where software is independent from hardware
(and hardware is usually crappy) it just doesn't work very well.

-Andi

2004-01-28 18:32:07

by David Mosberger

[permalink] [raw]

Subject: Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

>>>>> On Wed, 28 Jan 2004 18:41:37 +0100, Andi Kleen <[email protected]> said:

Andi> Also in my experience from AMD64 which originally was a bit
Andi> aggressive on enabling MCEs: enabling MCEs increases your
Andi> kernel support load a lot.

Andi> Many people have slightly buggy systems which still happen to
Andi> work mostly. If you report every problem you as kernel
Andi> maintainer will be flooded with reports about things you can
Andi> nothing to do about.

I find this comment interesting. Can you elaborate what you mean by
"slightly buggy systems"?

--david

2004-01-28 18:54:58

by Andi Kleen

[permalink] [raw]

Subject: Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

On Wed, 28 Jan 2004 10:31:58 -0800
David Mosberger <[email protected]> wrote:

> >>>>> On Wed, 28 Jan 2004 18:41:37 +0100, Andi Kleen <[email protected]> said:
>
> Andi> Also in my experience from AMD64 which originally was a bit
> Andi> aggressive on enabling MCEs: enabling MCEs increases your
> Andi> kernel support load a lot.
>
> Andi> Many people have slightly buggy systems which still happen to
> Andi> work mostly. If you report every problem you as kernel
> Andi> maintainer will be flooded with reports about things you can
> Andi> nothing to do about.
>
> I find this comment interesting. Can you elaborate what you mean by
> "slightly buggy systems"?

e.g. one bit ECC errors in memory are quite common. And with ECC memory
they are not really fatal. Similar with drivers. A lot of drivers do
bus aborts and other things regularly, but there is not necessarily
data corruption.

-Andi

2004-01-28 19:19:17

by Andi Kleen

[permalink] [raw]

Subject: Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

On Wed, 28 Jan 2004 11:09:23 -0800
Grant Grundler <[email protected]> wrote:

> ...
> > In short this stuff
> > probably only makes sense when you're a system vendor who sells
> > support contracts for whole systems including hardware support.
> > For the normal linux model where software is independent from hardware
> > (and hardware is usually crappy) it just doesn't work very well.
>
> While ia64/parisc platforms have HW support for this,
> I totally agree it won't work well for most (x86) platforms.
> I'd like to reduce the burden on the driver writers for common
> drivers (eg MPT) used on "vanilla" x86.

It would probably a good idea to implement it for i386 on chipsets
that support it reliably and try to educate driver writers to
enable it when they are testing drivers. This would likely
improve the quality of linux drivers long term and make your
job as maintainer of an "anal IO error" platform easier.

Just it should not be enabled by default in production kernels.
And finding out where it works reliably will be some work.

>
> And like I pointed out before, linux kernel needs to review panic()
> calls to see if some of them could easily be eliminated. The general
> robustness issues (eg pci_map_single() panics on failure) aren't
> prerequisites for IO error checking, but the latter seems less
> useful with out the former.

There is no reason pci_map_single() has to panic on overflow. On x86-64
it returns an unmapped address that is guaranteed to cause an bus abort
for 128KB. And you have an macro to test for it (pci_dma_error()).
I believe ppc64 has adopted it too. Of course most drivers don't
use it yet.

Still panic on overflow is useful for testing and it is kept as an
kernel command line option.

-Andi

2004-01-28 19:08:09

by Grant Grundler

[permalink] [raw]

Subject: Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

On Wed, Jan 28, 2004 at 06:41:37PM +0100, Andi Kleen wrote:
> > I could be wrong. Exception handling is ugly.
...
> One big problem is how to get rid of the spinlocks after the exception though
> (hardware access usually happens inside a spinlock)
>
> I presume you could return a magic value (all ones), but then you still
> have to make sure the driver doesn't break when that happens.

yes - any proposal is going to require reviewing all PIO reads
and how the read return value is consumed (or discarded).

> That would
> likely require testing for that value on every read access and make
> the code similarly ugly and difficult to write as with Linus'
> explicit checking model.

yeah. My hope was it would be less invasive.
But more changes are probably needed than I expected.

...
> In short this stuff
> probably only makes sense when you're a system vendor who sells
> support contracts for whole systems including hardware support.
> For the normal linux model where software is independent from hardware
> (and hardware is usually crappy) it just doesn't work very well.

While ia64/parisc platforms have HW support for this,
I totally agree it won't work well for most (x86) platforms.
I'd like to reduce the burden on the driver writers for common
drivers (eg MPT) used on "vanilla" x86.

And like I pointed out before, linux kernel needs to review panic()
calls to see if some of them could easily be eliminated. The general
robustness issues (eg pci_map_single() panics on failure) aren't
prerequisites for IO error checking, but the latter seems less
useful with out the former.

I'd like to defend the pci_map_single() interface. It was designed
to reduce the complexity at the cost of robustness.
I think it was a fair trade off at the time and it sounds like
the time has come for a different trade off.

thanks,
grant

> -Andi

2004-01-28 19:25:07

by David Mosberger

[permalink] [raw]

Subject: Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

>>>>> On Wed, 28 Jan 2004 19:52:46 +0100, Andi Kleen <[email protected]> said:

>> I find this comment interesting. Can you elaborate what you mean by
>> "slightly buggy systems"?

Andi> e.g. one bit ECC errors in memory are quite common. And with
Andi> ECC memory they are not really fatal.

Yet they are a good indicator that something is wrong (not performing
properly) or may be failing soon. I don't think putting on blinders
for such problems is a good idea. Though I agree that the question of
how to report such things without needlessly alerting Joe Clueless is
an interesting challenge.

--david

2004-01-28 19:48:13

by David Mosberger

[permalink] [raw]

Subject: Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

>>>>> On Wed, 28 Jan 2004 20:39:15 +0100, Andi Kleen <[email protected]> said:

>> Yet they are a good indicator that something is wrong (not performing
>> properly) or may be failing soon. I don't think putting on blinders
>> for such problems is a good idea. Though I agree that the question of

Andi> Most server class hardware should log it somewhere and allow
Andi> to read the event log in the firmware. This even works for
Andi> unhandleable errors unlike what the OS could do.

And you'd want to reboot your server just so you can check on the soft
failure rate? ;-)

--david

2004-01-28 19:40:38

by Andi Kleen

[permalink] [raw]

Subject: Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

On Wed, 28 Jan 2004 11:24:05 -0800
David Mosberger <[email protected]> wrote:

> >>>>> On Wed, 28 Jan 2004 19:52:46 +0100, Andi Kleen <[email protected]> said:
>
> >> I find this comment interesting. Can you elaborate what you mean by
> >> "slightly buggy systems"?
>
> Andi> e.g. one bit ECC errors in memory are quite common. And with
> Andi> ECC memory they are not really fatal.
>
> Yet they are a good indicator that something is wrong (not performing
> properly) or may be failing soon. I don't think putting on blinders
> for such problems is a good idea. Though I agree that the question of

Most server class hardware should log it somewhere and allow
to read the event log in the firmware. This even works for unhandleable
errors unlike what the OS could do.

But when printed in Linux they will report it to the linux maintainer or their
distribution vendor. "My Linux is buggy and giving these weird messages" And they
are both in no position at all to do something about it.

I toyed with the idea of printking a disclaimer of
"This is likely not a software bug. Report it to your hardware vendor."
But I doubt this would help much. Even when you say clearly in the message
that the hardware failed the user sees a weird message and thinks
it is Linux's fault.

You could enable it with CONFIG_I_HAVE_A_HARDWARE_SUPPORT_CONTRACT_OR_I_WRITE_DRIVERS
Or just make it a kernel command line option with off by default.

-Andi

2004-01-28 20:01:38

by Andi Kleen

[permalink] [raw]

Subject: Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

On Wed, 28 Jan 2004 11:48:05 -0800
David Mosberger <[email protected]> wrote:

> >>>>> On Wed, 28 Jan 2004 20:39:15 +0100, Andi Kleen <[email protected]> said:
>
> >> Yet they are a good indicator that something is wrong (not performing
> >> properly) or may be failing soon. I don't think putting on blinders
> >> for such problems is a good idea. Though I agree that the question of
>
> Andi> Most server class hardware should log it somewhere and allow
> Andi> to read the event log in the firmware. This even works for
> Andi> unhandleable errors unlike what the OS could do.
>
> And you'd want to reboot your server just so you can check on the soft
> failure rate? ;-)

Yep, I reboot my machines all the time ;-)

Seriously you can count it somewhere and present it in sysfs or /proc.
Or log it somewhere else and supply a special utility to show them
that makes it clear that the events are hardware and not software related.
I suppose if your server vendor is serious they will supply a tool
to read the firmware log from a running system.

But printks enabled by default are a bad idea (and a bug too BTW - printk called from
MCE handlers can randomly deadlock)

-Andi

2004-01-28 21:13:08

by Grant Grundler

[permalink] [raw]

Subject: Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

On Wed, Jan 28, 2004 at 08:17:01PM +0100, Andi Kleen wrote:
> This would likely
> improve the quality of linux drivers long term and make your
> job as maintainer of an "anal IO error" platform easier.

yup. The key drivers we deal with reached that point last year.
Drivers could always be better. But those issues have been discussed
and presented:
o LinuxTag 2002 keynote by Alan Cox, "Submitting new Kernel drivers"
(http://gsyprf11.external.hp.com/porting_zx1/mgp/Code.mgp)
o OLS 2002 talk by Arjan van de Ven, "How not to write kernel drivers"
o OLS 2002 talk by Greg K-H, "Documentation/CodingStyle and Beyond"
o OLS 2002 talk by myself, "Porting Drivers to HP ZX1"

It helps to "enforce" driver quality through "anal IO Error containment"
but it's too late when it happens on a customer box.

> Just it should not be enabled by default in production kernels.
> And finding out where it works reliably will be some work.

agreed.

> There is no reason pci_map_single() has to panic on overflow. On x86-64
> it returns an unmapped address that is guaranteed to cause an bus abort
> for 128KB.

parisc and ia64 will also bus abort. And then HPMC/MCA respectively.
We could reserve a "safe page" and then hand that back I guess.
But that sounds like a very broken error containment strategy to me.
(ie outbound data will be garbage).

This really isn't an issue for HP ZX1/IA64 since most drivers (64-bit)
can bypass the IOMMU and directly address memory. parisc-linux still
isn't commercially interesting.

> And you have an macro to test for it (pci_dma_error()).

I didn't know about pci_dma_error.
Google found two references: One is:
http://www.x86-64.org/lists/discuss/msg03841.html

> I believe ppc64 has adopted it too. Of course most drivers don't
> use it yet.

<search 2.6.2-rc2 source tree>
grundler <502>find -name '*.[chS]' | xargs fgrep pci_dma_error
./include/asm-x86_64/pci.h:#define pci_dma_error(x) ((x) ==
bad_dma_address)
grundler <503>

That explains why most drivers don't use it yet.
It's only supported on one arch.
Maybe propose this to linux-pci mailing list?

grant

2004-01-28 21:39:12

by Andi Kleen

[permalink] [raw]

Subject: Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

On Wed, 28 Jan 2004 13:14:05 -0800
Grant Grundler <[email protected]> wrote:

>
> > I believe ppc64 has adopted it too. Of course most drivers don't
> > use it yet.
>
> <search 2.6.2-rc2 source tree>
> grundler <502>find -name '*.[chS]' | xargs fgrep pci_dma_error
> ./include/asm-x86_64/pci.h:#define pci_dma_error(x) ((x) ==
> bad_dma_address)
> grundler <503>
>
> That explains why most drivers don't use it yet.
> It's only supported on one arch.
> Maybe propose this to linux-pci mailing list?

It was discussed on linux-arch and ppc64 at least agreed on it.
The other architectures can get it via a comptibility #define that
is always 0.

There was a patch for that somewhere, but apparently it was never merged
or not merged yet.

Anton, what was the state of that?

-Andi

2004-01-28 23:35:14

by David Mosberger

[permalink] [raw]

Subject: Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

>>>>> On Wed, 28 Jan 2004 21:01:32 +0100, Andi Kleen <[email protected]> said:

Andi> Seriously you can count it somewhere and present it in sysfs
Andi> or /proc. Or log it somewhere else and supply a special
Andi> utility to show them that makes it clear that the events are
Andi> hardware and not software related. I suppose if your server
Andi> vendor is serious they will supply a tool to read the firmware
Andi> log from a running system.

Andi> But printks enabled by default are a bad idea (and a bug too
Andi> BTW - printk called from MCE handlers can randomly deadlock)

No argument here. I didn't get/see the earlier part of this
discussion so I didn't realize you were complaining about printks
only. Never mind.

--david

2004-01-29 08:36:41

by Matthias Fouquet-Lapar

[permalink] [raw]

Subject: Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

> Andi> e.g. one bit ECC errors in memory are quite common. And with
> Andi> ECC memory they are not really fatal.
>
> Yet they are a good indicator that something is wrong (not performing
> properly) or may be failing soon. I don't think putting on blinders
> for such problems is a good idea. Though I agree that the question of
> how to report such things without needlessly alerting Joe Clueless is
> an interesting challenge.

We have done a rather large study with DIMMs that had SBEs and have
found no evidence that a SBE turns into a UCE, i.e. the fact that a SBE is
reported, is no indication that the device might fail soon.

As a matter of fact the soft error rates increases while parts use
smaller process technologies and lower supply voltages. Cosmic rays
are one source for soft errors. Another source are alpha particles
emitted by the solder.

Still I think it's important to log SBEs, but you probably will need
a treshhold in case you hit a hard SBE. Also scrubbing the memory location
(and re-read the location to check if the error was transient or not)
might be a good idea if the memory controller supports this.
If it is a true, hard SBE it should be reported. It also might be a good
idea to mark the page, so it does not get re-allocated.

Thanks

Matthias Fouquet-Lapar Core Platform Software [email protected] VNET 521-8213
Principal Engineer Silicon Graphics Home Office (+33) 1 3047 4127

2004-01-29 19:30:29

by David Mosberger

[permalink] [raw]

Subject: Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

>>>>> On Thu, 29 Jan 2004 09:23:20 +0100 ("CET), Matthias Fouquet-Lapar <[email protected]> said:

Matthias> We have done a rather large study with DIMMs that had SBEs
Matthias> and have found no evidence that a SBE turns into a UCE,
Matthias> i.e. the fact that a SBE is reported, is no indication
Matthias> that the device might fail soon.

Matthias> As a matter of fact the soft error rates increases while
Matthias> parts use smaller process technologies and lower supply
Matthias> voltages. Cosmic rays are one source for soft
Matthias> errors. Another source are alpha particles emitted by the
Matthias> solder.

Ehh, wait a second: you're saying that your study proved that if the
device isn't failing, it isn't failing. ;-) Of course you'll get noise
and perhaps even lots of it due to cosmic rays but this doesn't say
anything about the error pattern you when a device _is_ failing (e.g.,
due to overheating, over-clocking, or wrong voltage). Or did your
study cover the cases where a system is operated under "out-of-spec"
situation?

Matthias> Still I think it's important to log SBEs, but you probably
Matthias> will need a treshhold in case you hit a hard SBE. Also
Matthias> scrubbing the memory location (and re-read the location to
Matthias> check if the error was transient or not) might be a good
Matthias> idea if the memory controller supports this. If it is a
Matthias> true, hard SBE it should be reported. It also might be a
Matthias> good idea to mark the page, so it does not get
Matthias> re-allocated.

Yes. And once I finally received Andi's earlier mails (guess I have
to thank MyDoom for that... ;-( ), it was clear that nobody argued for
turning off the error reporting. The issue was only whether or not to
log a message via printk() (which, in this case, clearly isn't a good
idea). So I think we're all in violent agreement.

--david

2004-01-29 20:28:18

by Matthias Fouquet-Lapar

[permalink] [raw]

Subject: Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

> Matthias> We have done a rather large study with DIMMs that had SBEs
> Matthias> and have found no evidence that a SBE turns into a UCE,
> Matthias> i.e. the fact that a SBE is reported, is no indication
> Matthias> that the device might fail soon.
>
> Ehh, wait a second: you're saying that your study proved that if the
> device isn't failing, it isn't failing. ;-) Of course you'll get noise

I should have been more precice. We used field returned parts which
had reported SBEs and had been exchanged in the field. Our goal was to
see if any of these parts "de-generate" over time. Most of these parts
had hard single bit failures in one or more locations. As I said,
we didn't find evidence that even hard SBEs turn into a multiple bit
error. Of course the chances of getting a UCE are higher when a "soft"
SBE occurs in a memory location which already has a hard SBE.

Thanks

Matthias Fouquet-Lapar Core Platform Software [email protected] VNET 521-8213
Principal Engineer Silicon Graphics Home Office (+33) 1 3047 4127

2004-01-29 21:09:20

by David Mosberger

[permalink] [raw]

Subject: Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

>>>>> On Thu, 29 Jan 2004 21:16:52 +0100 ("CET), Matthias Fouquet-Lapar <[email protected]> said:

Matthias> We have done a rather large study with DIMMs that had SBEs
Matthias> I should have been more precice. We used field returned
Matthias> parts which had reported SBEs and had been exchanged in
Matthias> the field. Our goal was to see if any of these parts
Matthias> "de-generate" over time. Most of these parts had hard
Matthias> single bit failures in one or more locations.

Ah, that's more interesting, agreed.

Matthias> As I said, we didn't find evidence that even hard SBEs
Matthias> turn into a multiple bit error.

But you were changing the operating environment of the chip, so I
wouldn't draw too strong of a conclusion. Or was the reason for the
hard SBEs known and it was determined that the operating environment
was not a factor in triggering them?

--david

2004-01-29 22:34:18

by Matthias Fouquet-Lapar

[permalink] [raw]

Subject: Re: [RFC/PATCH, 1/4] readX_check() performance evaluation

> Matthias> We have done a rather large study with DIMMs that had SBEs
> Matthias> I should have been more precice. We used field returned
> Matthias> parts which had reported SBEs and had been exchanged in
> Matthias> the field. Our goal was to see if any of these parts
> Matthias> "de-generate" over time. Most of these parts had hard
> Matthias> single bit failures in one or more locations.
>
> Ah, that's more interesting, agreed.
>
> Matthias> As I said, we didn't find evidence that even hard SBEs
> Matthias> turn into a multiple bit error.
>
> But you were changing the operating environment of the chip, so I
> wouldn't draw too strong of a conclusion. Or was the reason for the
> hard SBEs known and it was determined that the operating environment
> was not a factor in triggering them?

That is a very good point and one of my favourite subjects. I think
a lot of error checking has to be done in-flight, i.e. at the time of
the error check if the error is transient or can be reproduced, if possible
log environmental information (temp and VDD) with the error etc.
And then have a small EEPROM on standard DIMMs and save this error information,
so we don't rely on paper tags. Or maybe include the DIMM serial number
in the error message. But I'm getting carried away :)

As for the test environment, a fair amount of DIMMs was put through
environmental stress ("shake & bake") as well as extended voltage margins.
(I remain impressed with these chambers where you can dial down from +60C to
-40C within a few minutes while the system is vibrating with a couple of G's)
We actually exceeded the DIMM manufacturers specifications for limits,
again no sign of increased failure rate for DIMMs with SBEs.

Some failure modes are very complex and data pattern sensitive.

As someone pointed out quite correctly, there is a fine line how much
information should be logged and potentialy ring a bell for the customer
to place a service call to replace a part which potentially will never
fail again (again there is a difference between a hard and a soft error).

One option might be to have a separate error log, so the console is not
overflowed with messages and then use some tool to diagnose the errors
and potentially warn the user, i.e. turn on the "check engine" light.
We should keep the average user in mind

Thanks

Matthias Fouquet-Lapar Core Platform Software [email protected] VNET 521-8213
Principal Engineer Silicon Graphics Home Office (+33) 1 3047 4127