2007-08-10 21:10:57

by Roland

[permalink] [raw]
Subject: Software based ECC ?

Hello !

since ECC (speaking in terms of ram/memory) is some widespread hardware
technology
within server/enterprise computing for protection of memory failure, i
wonder:

Can`t this be done in software, too ?

I didn`t find a referenc on this list, but i found an interesting paper i'd
like to share at:

http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf

"SoftECC : A System for Software Memory Integrity Checking"

Is it possible to implement something like this within the Linux virtual
memory subsystem ?
If it can be done, wouldn`t this be a great feature ?

regards
Roland K.
system engineer





2007-08-10 22:15:39

by Alan

[permalink] [raw]
Subject: Re: Software based ECC ?

On Fri, 10 Aug 2007 23:16:45 +0200
"roland" <[email protected]> wrote:

> Hello !
>
> since ECC (speaking in terms of ram/memory) is some widespread hardware
> technology
> within server/enterprise computing for protection of memory failure, i
> wonder:
>
> Can`t this be done in software, too ?

Only one way to find out. If it interest you - have a go at it

2007-08-11 06:11:24

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: Software based ECC ?

On Fri, 10 Aug 2007 23:16:45 +0200, roland said:

> http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf
>
> "SoftECC : A System for Software Memory Integrity Checking"
>
> Is it possible to implement something like this within the Linux virtual
> memory subsystem ?

Anything that can be simulated with a Turing machine is *possible*.

The question is how many rocket boosters the pig needs for takeoff.

Hint: The thesis talks about why he didn't implement it for Linux.

> If it can be done, wouldn`t this be a great feature ?

Read section 5.2 of that thesis, particularly this quote from 5.2.2:

"For random word writes, this implies that SoftECC will need an order of
magnitude more compute time than the user-mode code"

Basically, on every single memory page that gets dirtied, we have to then
re-checksum the page (blowing away cache lines in the process). If you want
to get a feel for it, find the kernel code that recognizes that a page is
dirtied, and just add a few lines there:

int foo = 0, i;
for (i=0;i++;<1024) { // adjust for non-4K pages
foo ^= *(page+i);
}

and see how much your system crawls.

Personally, I'd recommend just shelling out the bucks for hardware ECC if
the reliability matters.


Attachments:
(No filename) (226.00 B)

2007-08-12 16:51:51

by folkert

[permalink] [raw]
Subject: Re: Software based ECC ?

> > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf
> > "SoftECC : A System for Software Memory Integrity Checking"
>
> Personally, I'd recommend just shelling out the bucks for hardware ECC if
> the reliability matters.

a question and an idea: Q: is ecc guaranteed to detect all bitflips?

Idea: what about a multicore system (3 or more) that runs the same
processes on 2 cores and a third core verifying that they both do the
same? As I think it is not only ram that can become faulty.



Folkert van Heusden

--
MultiTail er et flexible tool for ? kontrolere Logfiles og commandoer.
Med filtrer, farger, sammenf?ringer, forskeliger ansikter etc.
http://www.vanheusden.com/multitail/
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, http://www.vanheusden.com

2007-08-12 17:07:30

by Jan Engelhardt

[permalink] [raw]
Subject: Re: Software based ECC ?


On Aug 12 2007 18:51, Folkert van Heusden wrote:
>
>> > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf
>> > "SoftECC : A System for Software Memory Integrity Checking"
>>
>> Personally, I'd recommend just shelling out the bucks for hardware ECC if
>> the reliability matters.
>
>a question and an idea: Q: is ecc guaranteed to detect all bitflips?
>
>Idea: what about a multicore system (3 or more) that runs the same
>processes on 2 cores and a third core verifying that they both do the
>same? As I think it is not only ram that can become faulty.

Indeed. And for example BOINC (Seti@home) have to consider this. Hence they
recalculate each work unit at least three times and then compare between
each. What makes this different from ECC is that the checksum is not calculated
on every memory operations, but at the end of a larger block of operations. Of
course this may mean that an error can propagate for a while, but the total
walltime (including recomputation) is lower. :)


Jan
--

2007-08-12 19:05:59

by Tzy-Jye Daniel Lin

[permalink] [raw]
Subject: Re: Software based ECC ?

On 8/12/07, Folkert van Heusden <[email protected]> wrote:
> > > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf
> > > "SoftECC : A System for Software Memory Integrity Checking"
> >
> > Personally, I'd recommend just shelling out the bucks for hardware ECC if
> > the reliability matters.
>
> a question and an idea: Q: is ecc guaranteed to detect all bitflips?
>
> Idea: what about a multicore system (3 or more) that runs the same
> processes on 2 cores and a third core verifying that they both do the
> same? As I think it is not only ram that can become faulty.

Such hardware does exist -- for example, Stratus sells systems that
run the same OS on two separate boards in lockstep, with a voter to
determine what action to take if they ever diverge.

2007-08-13 03:09:48

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: Software based ECC ?

On Sun, 12 Aug 2007 18:51:31 +0200, Folkert van Heusden said:

> a question and an idea: Q: is ecc guaranteed to detect all bitflips?

It depends on the exact ECC function the hardware implements. Usually it
provides performance such as:

"Correct all 1-bit errors. Detect all 2-bit errors, and most 3 and higher,
but not correct".

(Of course, "correct all 1 or 2 bit and detect all 3 bit" can be done, it
just takes more bits of ECC.)

> Idea: what about a multicore system (3 or more) that runs the same
> processes on 2 cores and a third core verifying that they both do the
> same? As I think it is not only ram that can become faulty.

This is actually done for high-reliability systems (Google for "tell me twice"
and "tell me three times"). The problem is that it takes a lot of extra
hardware. The G5 and later IBM Z-series mainframe chipsets (not to be confused with
the PowerPC G5) implemented dual computation units and a comparator that
signals a 'Machine Check' condition if the two CPUs don't end up in the
same exact state (as an added bonus, at the end of each instruction that
both *do* compare good, it latches the *entire* state of the CPU out,
and then does the following:

1) Retry the instruction on the same CPU - if it compares correctly, keep
going and flag a "soft" error.

2) If it still fails, read out the last "known good" status latch, and load
it into a spare CPU, and fire it up, and flag the failing one as bad.

http://www.research.ibm.com/journal/rd/435/spainhower.pdf
http://www.research.ibm.com/journal/rd/435/mueller.pdf

These guys have forgotten more about designing highly reliable systems than
most of us will ever know. ;)

Needless to say, not everybody is willing to pay the costs of the hardware
overhead of this approach.


Attachments:
(No filename) (226.00 B)

2007-08-21 18:44:34

by Bodo Eggert

[permalink] [raw]
Subject: Re: Software based ECC ?

Folkert van Heusden <[email protected]> wrote:

>> > http://pdos.csail.mit.edu/papers/softecc:ddopson-meng
softecc_ddopson-meng.pdf
>> > "SoftECC : A System for Software Memory Integrity Checking"
>>
>> Personally, I'd recommend just shelling out the bucks for hardware ECC if
>> the reliability matters.
>
> a question and an idea: Q: is ecc guaranteed to detect all bitflips?

It's guaranteed not to.

Having n extra bits, you can detect n-bit-flips and correct n/2-bit-flips
(provided you use an optimal code).

These extra bits can flip, too, so if you have m >= 1 data bits and any
finite number n of extra bits, it's possible to have an undetectable
n+1-bit-flip.
--
If you can't remember, then the claymore IS pointed at you.

Fri?, Spammer: [email protected] [email protected]
[email protected] [email protected]

2007-08-21 20:18:18

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: Software based ECC ?


On Tue, 21 Aug 2007, Bodo Eggert wrote:

> Folkert van Heusden <[email protected]> wrote:
>
>>>> http://pdos.csail.mit.edu/papers/softecc:ddopson-meng
> softecc_ddopson-meng.pdf
>>>> "SoftECC : A System for Software Memory Integrity Checking"
>>>
>>> Personally, I'd recommend just shelling out the bucks for hardware ECC if
>>> the reliability matters.
>>
>> a question and an idea: Q: is ecc guaranteed to detect all bitflips?
>
> It's guaranteed not to.
>
> Having n extra bits, you can detect n-bit-flips and correct n/2-bit-flips
> (provided you use an optimal code).
>
> These extra bits can flip, too, so if you have m >= 1 data bits and any
> finite number n of extra bits, it's possible to have an undetectable
> n+1-bit-flip.
> --
> If you can't remember, then the claymore IS pointed at you.
>
Of course common ECC codes detect and correct single bit errors.
When used in memory, bits in a word are never adjacent so a cosmic
ray or other stray particle which could upset bits usually result
in bits being upset in different words so they remain correctable.

The MIT paper is noticeably deficient in its ability to do anything
useful. It proposes checking things at 100 Hz intervals and trapping
each memory access as though these things happen only once in
awhile and, of course, assumes that the code doing the checking will
never be corrupted. Further, it ignores the cache(s).

Cheers,
Dick Johnson
Penguin : Linux version 2.6.22.1 on an i686 machine (5588.29 BogoMips).
My book : http://www.AbominableFirebug.com/
_


****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.