2002-11-01 22:19:29

by Ed Vance

[permalink] [raw]
Subject: RE: [STATUS 2.5] October 30, 2002

On Fri, November 01, 2002 at 11:56 AM, Richard B. Johnson wrote:
> [...]
> So, ten seconds after you have some cosmic-ray upset, you guarantee
> that your machine will crash if you read everything every ten
> seconds. This will never be acceptable. You need to leave the
> machine alone and not try to "pick scabs". That's how you get
> the best reliability. Also, at some periodic intervals, you
> re-boot (restart) the whole machine, reinitializing everything
> including all the RAM.
>
Here's a Monty Python analogy to ECC memory scrubbing:

Do you remember the battle between Arthur and the Black Knight?

Without scrubbing, the memory bits suffer damage at a more or less constant
rate, like the Black Knight. The damage accumulates and eventually renders
the Black Knight non-functional. For the memory, this would be an
uncorrectable error from the accumulation of many separate bit error events.

With scrubbing, the memory bits and the Black Knight suffer damage at the
same rate, but this time the Black Knight is able to stick his limbs back on
(while fighting) after Arthur hacks them off. If the Black Knight's rate of
sticking his limbs back on equals Arthur's rate of hacking his limbs off,
the Black Knight will sustain the same amount of damage, but will remain
functional as long as he can keep up. For the memory, the many separate bit
error events would cause only correctable errors, as long as the scrubbing
can keep up.

cheers,
Ed


2002-11-02 00:27:15

by Werner Almesberger

[permalink] [raw]
Subject: Re: [STATUS 2.5] October 30, 2002

Ed Vance wrote:
> functional as long as he can keep up. For the memory, the many separate bit
> error events would cause only correctable errors, as long as the scrubbing
> can keep up.

Don't those bit errors have a Poissonian character ? If so, it's
impossible to "keep up". All you can do is make the interval small
enough that, on average, it takes a long time until you get hit
twice (or more often) in that interval.

A better example would be car tires on roads with many randomly
distributed sharp objects (i.e. such that age does not significantly
change the odds of tire damage): you can keep going as long as you
can get a flat tire fixed before another tire gets punctured. But
sometimes, you may end up with two flat tires, and need a tow truck.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/