2010-06-10 17:29:51

by Brian Gordon

[permalink] [raw]
Subject: Aerospace and linux

Greetings,

I work in the aerospace industry and one of the considerations
that occurs in aerospace is a phenomenon called Single Event Upsets
(SEU). I'm not an expert on the physics behind this phenomenon, but
the end result is that bits in RAM change state due to high energy
particles passing through the device. This phenomenon happens more
often at higher altitudes (aircraft) and is a very serious
consideration for space vehicles.

When these SEU can be detected some action may be taken to improve
the behaviour of the system (log a fault and reset in order to
refresh things from scratch?). So the first question becomes how to
detect an SEU. Flash is considered somewhat safer than RAM. When
executables run in linux, do the .text and .ro sections get copied
into RAM? If so, can a background task monitor the RAM copy of .text
and .ro for corruption? Tripwire seems to offer this kind of
detection as a means for detecting tampering by a malicious attacker
in the filesystem, but I am not convinced that it would detect
modifications to copies of the ELF in RAM.

My understanding how linux does "on-demand" loading of executables
may be a problem here. But this SEU detection capability would seem
to have some applicability to intrusion detection, so I have to think
some mechanism already exists.

Thank you to anyone for any pointers on where I can look to learn
more about detecting SEU in linux.

legerde at gmail com


2010-06-10 18:23:19

by Andi Kleen

[permalink] [raw]
Subject: Re: Aerospace and linux

Brian Gordon <[email protected]> writes:
> I work in the aerospace industry and one of the considerations
> that occurs in aerospace is a phenomenon called Single Event Upsets
> (SEU). I'm not an expert on the physics behind this phenomenon, but
> the end result is that bits in RAM change state due to high energy
> particles passing through the device. This phenomenon happens more
> often at higher altitudes (aircraft) and is a very serious
> consideration for space vehicles.

It's also a serious consideration for standard servers.

> When these SEU can be detected some action may be taken to improve
> the behaviour of the system (log a fault and reset in order to
> refresh things from scratch?). So the first question becomes how to
> detect an SEU. Flash is considered somewhat safer than RAM. When
> executables run in linux, do the .text and .ro sections get copied
> into RAM? If so, can a background task monitor the RAM copy of .text
> and .ro for corruption?

On server class systems with ECC memory hardware does that.

The hardware stores the RAM contents using an error correcting
code that can normally correct one bit errors and detect multi-bit
errors.

There are various more or less sophisticated variations of
this around, from simple ECC, over chipkill to handle DIMMs failing,
upto various variants of full memory mirroring.

> Thank you to anyone for any pointers on where I can look to learn
> more about detecting SEU in linux.

Normally server class hardware handles this and the kernel then reports
memory errors (e.g. through mcelog or through EDAC)

Hardware also stops the system before it would consume corrupted
data.

Newer Linux also has special code that allows to recover
from this in some circumstances or use predictive failure analysis
with page offlining to prevent future problems. This requires
suitable hardware support.

Lower end systems which are optimized for cost generally ignore the
problem though and any flipped bit in memory will result
in a crash (if you're lucky) or silent data corruption (if you're unlucky)

-Andi

--
[email protected] -- Speaking for myself only.

2010-06-10 18:30:11

by Chris Friesen

[permalink] [raw]
Subject: Re: Aerospace and linux

On 06/10/2010 11:29 AM, Brian Gordon wrote:

> When these SEU can be detected some action may be taken to improve
> the behaviour of the system (log a fault and reset in order to
> refresh things from scratch?). So the first question becomes how to
> detect an SEU.

I do work in telco stuff. We use ECC RAM, turn on ECC/parity on the
various buses, enable error-checking in the hardware, etc.

At higher abstraction levels you can checksum the data being stored and
validate it when you access it.

Some of the errors are "soft" and can be corrected, others are "hard"
and uncorrectable. If you get enough "soft" errors in a short enough
time it may be desirable to treat it as a "hard" error and reset.

> Thank you to anyone for any pointers on where I can look to learn
> more about detecting SEU in linux.

You might start by taking a look at the "edac" code in the kernel.
Linux in general doesn't normally enable all the fault detection code,
so you may need to start looking at datasheets.

Chris

--
The author works for GENBAND Corporation (GENBAND) who is solely
responsible for this email and its contents. All enquiries regarding
this email should be addressed to GENBAND. Nortel has provided the use
of the nortel.com domain to GENBAND in connection with this email solely
for the purpose of connectivity and Nortel Networks Inc. has no
liability for the email or its contents. GENBAND's web site is
http://www.genband.com

2010-06-10 18:38:13

by Brian Gordon

[permalink] [raw]
Subject: Re: Aerospace and linux

> It's also a serious consideration for standard servers.
Yes. Good point.

> On server class systems with ECC memory hardware does that.

> Normally server class hardware handles this and the kernel then reports
> memory errors (e.g. through mcelog or through EDAC)

Agreed. EDAC is a good and sane solution and most companies do this.
Some do not due to naivity or cost reduction. EDAC doesn't cover
processor registers and I have fairly good solutions on how to deal
with that in tiny "home-grown" tasking systems.

On the more exotic end, I have also seen systems that have dual
redundant processors / memories. Then they add compare logic between
the redundant processors that compare most pins each clock cycle. If
any pins are not identical at a clock cycle, then something has gone
wrong (SEU, hardware failure, etc..)

> Lower end systems which are optimized for cost generally ignore the
> problem though and any flipped bit in memory will result
> in a crash (if you're lucky) or silent data corruption (if you're unlucky)

Right! And this is the area that I am interested in. Some people
insist on lowering the cost of the hardware without considering these
issues. One thing I want to do is to be as diligent as possible (even
in these low cost situations) and do the best job I can in spite of
the low cost hardware.

So, some pages of RAM are going to be read-only and the data in those
pages came from some source (file system?). Can anyone describe a
high level strategy to occasionaly provide some coverage of this data?

So far I have thought about page descriptors adding an MD5 hash
whenever they are read-only and first being "loaded/mapped?" and then
a background daemon could occasionaly verify. Does tripwire
accomplish this kind of detection by monitoring the underlying
filesystem (I dont think so)?

2010-06-10 18:42:29

by Brian Gordon

[permalink] [raw]
Subject: Re: Aerospace and linux

> I do work in telco stuff. ?We use ECC RAM, turn on ECC/parity on the
> various buses, enable error-checking in the hardware, etc.

Excellent stuff when you have it. :)

> At higher abstraction levels you can checksum the data being stored and
> validate it when you access it.

What about .ro and .text sections of an executable? I would think
kernel support for that would be required. If its application data,
then all sorts of things are possible like you described. Ive also
seen critical ram variables be stored in triplicate and then
compared/voted just to ensure no silent SEU corruption.

> You might start by taking a look at the "edac" code in the kernel.
> Linux in general doesn't normally enable all the fault detection code,
> so you may need to start looking at datasheets.

Thank you for the suggestion. If the memory device supports EDAC/ECC
then definitely enabling it is a good strategy.

2010-06-10 18:48:41

by Andi Kleen

[permalink] [raw]
Subject: Re: Aerospace and linux

On Thu, Jun 10, 2010 at 12:38:10PM -0600, Brian Gordon wrote:
> > It's also a serious consideration for standard servers.
> Yes. Good point.
>
> > On server class systems with ECC memory hardware does that.
>
> > Normally server class hardware handles this and the kernel then reports
> > memory errors (e.g. through mcelog or through EDAC)
>
> Agreed. EDAC is a good and sane solution and most companies do this.

Sorry, but you mean ECC?

IMHO EDAC Is not a good solution for error reporting (however I'm biased
because I work on a better one)

> Some do not due to naivity or cost reduction. EDAC doesn't cover
> processor registers and I have fairly good solutions on how to deal
> with that in tiny "home-grown" tasking systems.

mcelog covers OS visible processor registers on x86 systems.

If your hardware doesn't support it it's hard for the general
case, although special cases are always possible.
>
> > Lower end systems which are optimized for cost generally ignore the
> > problem though and any flipped bit in memory will result
> > in a crash (if you're lucky) or silent data corruption (if you're unlucky)
>
> Right! And this is the area that I am interested in. Some people
> insist on lowering the cost of the hardware without considering these
> issues. One thing I want to do is to be as diligent as possible (even
> in these low cost situations) and do the best job I can in spite of
> the low cost hardware.

AFAIK there's no support for this in a standard Linux kernel.

That is some architectures do scrubbing in software,
but the basic ECC implementation is still hardware.

In general I suspect you'll need some application specific
strategy, if your hardware doesn't help you in this.

Having good hardware definitely helps, software is generally
not happy if it cannot trust its memory enough.

It's a bit like a human with no reliable air supply.

That is the existing memory error handling mechanisms (like hwpoison)
assume events are reliably detected and are relatively rare.

>
> So, some pages of RAM are going to be read-only and the data in those
> pages came from some source (file system?). Can anyone describe a
> high level strategy to occasionaly provide some coverage of this data?

Just for block data there's some support for checksumming
e.g, block integrity (needs special support in the device)
or file systems (e.g. btrfs)

However they all normally assume memory is reliable and
are more focussed on errors coming from storage.

>
> So far I have thought about page descriptors adding an MD5 hash
> whenever they are read-only and first being "loaded/mapped?" and then
> a background daemon could occasionaly verify.

In theory btrfs or block integrity could be probably extended
to regularly re check page cache. It would not be trivial

But to really catch errors before use you would need to recheck on
every access, and that's hard (or rather extremly slow) in some cases
(e.g. mmap)

And this still wouldn't help with r/w memory. Normally on most
workloads r/o (that is clean) memory is the only a small fraction of the
active memory.

-Andi

--
[email protected] -- Speaking for myself only.

2010-06-10 18:49:37

by Chris Friesen

[permalink] [raw]
Subject: Re: Aerospace and linux

On 06/10/2010 12:38 PM, Brian Gordon wrote:

> On the more exotic end, I have also seen systems that have dual
> redundant processors / memories. Then they add compare logic between
> the redundant processors that compare most pins each clock cycle. If
> any pins are not identical at a clock cycle, then something has gone
> wrong (SEU, hardware failure, etc..)

Some phone switches do this. Some of them also have at least two copies
of everything in memory and will do transactional operations that can be
rolled back if there is a hardware glitch.

> So, some pages of RAM are going to be read-only and the data in those
> pages came from some source (file system?). Can anyone describe a
> high level strategy to occasionaly provide some coverage of this data?

> So far I have thought about page descriptors adding an MD5 hash
> whenever they are read-only and first being "loaded/mapped?" and then
> a background daemon could occasionaly verify.

Makes sense to me. You might also pick an on-disk format with extra
checksumming so you could compare the on-disk checksum with the
in-memory checksum.

Chris

--
The author works for GENBAND Corporation (GENBAND) who is solely
responsible for this email and its contents. All enquiries regarding
this email should be addressed to GENBAND. Nortel has provided the use
of the nortel.com domain to GENBAND in connection with this email solely
for the purpose of connectivity and Nortel Networks Inc. has no
liability for the email or its contents. GENBAND's web site is
http://www.genband.com

2010-06-10 19:15:00

by Brian Gordon

[permalink] [raw]
Subject: Re: Aerospace and linux

Thank you both. This has been very helpful for me.

I think I read two conclusions:
1) R/O is a small percentage of RAM
2) To cover this small precentage would be non-trivial

Thank you both very much for your time and knowledge, I'll move along now.



On Thu, Jun 10, 2010 at 12:46 PM, Chris Friesen <[email protected]> wrote:
> On 06/10/2010 12:38 PM, Brian Gordon wrote:
>
>> On the more exotic end, I have also seen systems that have dual
>> redundant processors / memories. ?Then they add compare logic between
>> the redundant processors that compare most pins each clock cycle. ? If
>> any pins are not identical at a clock cycle, then something has gone
>> wrong (SEU, hardware failure, etc..)
>
> Some phone switches do this. ?Some of them also have at least two copies
> of everything in memory and will do transactional operations that can be
> rolled back if there is a hardware glitch.
>
>> So, some pages of RAM are going to be read-only and the data in those
>> pages came from some source (file system?). ? Can anyone describe a
>> high level strategy to occasionaly provide some coverage of this data?
>
>> So far I have thought about page descriptors adding an MD5 hash
>> whenever they are read-only and first being "loaded/mapped?" and then
>> a background daemon could occasionaly verify.
>
> Makes sense to me. ?You might also pick an on-disk format with extra
> checksumming so you could compare the on-disk checksum with the
> in-memory checksum.
>
> Chris
>
> --
> The author works for GENBAND Corporation (GENBAND) who is solely
> responsible for this email and its contents. All enquiries regarding
> this email should be addressed to GENBAND. Nortel has provided the use
> of the nortel.com domain to GENBAND in connection with this email solely
> for the purpose of connectivity and Nortel Networks Inc. has no
> liability for the email or its contents. GENBAND's web site is
> http://www.genband.com
>

2010-06-10 19:34:08

by Massimiliano Galanti

[permalink] [raw]
Subject: Re: Aerospace and linux


> What about .ro and .text sections of an executable? I would think
> kernel support for that would be required. If its application data,
> then all sorts of things are possible like you described. Ive also
> seen critical ram variables be stored in triplicate and then
> compared/voted just to ensure no silent SEU corruption.

Maybe slightly off topic but... if flash is safer than ram, what about
using XIP (where possible, e.g. on NORs)? That would not put .data
sections into ram, at least.

--
Massimiliano

2010-06-10 19:37:49

by Brian Gordon

[permalink] [raw]
Subject: Re: Aerospace and linux

Yes. Thats exactly what I am looking for. Even if there is a speed
penalty, I wouldn't mind so much. However wikipedia says that XIP is
filesystem dependent and im stuck with FAT32 or NTFS. Wikipedia
claims NTFS can do XIP. Is this true under linux?

On Thu, Jun 10, 2010 at 1:23 PM, Massimiliano Galanti
<[email protected]> wrote:
>
>> What about .ro and .text sections of an executable? ? I would think
>> kernel support for that would be required. ? If its application data,
>> then all sorts of things are possible like you described. ? Ive also
>> seen critical ram variables be stored in triplicate and then
>> compared/voted just to ensure no silent SEU corruption.
>
> Maybe slightly off topic but... if flash is safer than ram, what about using
> XIP (where possible, e.g. on NORs)? That would not put .data sections into
> ram, at least.
>
> --
> Massimiliano
>

2010-06-10 19:43:00

by Brian Gordon

[permalink] [raw]
Subject: Re: Aerospace and linux

Sorry, I take it back. This wont work for me because I wont have
NOR. Also, I only want the "in-place" to apply to read-only pages.
This looks like all reads and writes get passed to the underlying
storage and I can't suffer flash page erase/writes to update a
variable. :) The device will wear out and meaningful work would be
starved.

On Thu, Jun 10, 2010 at 1:37 PM, Brian Gordon <[email protected]> wrote:
> Yes. ?Thats exactly what I am looking for. ? Even if there is a speed
> penalty, I wouldn't mind so much. ?However wikipedia says that XIP is
> filesystem dependent and im stuck with FAT32 or NTFS. ? Wikipedia
> claims NTFS can do XIP. ? Is this true under linux?
>
> On Thu, Jun 10, 2010 at 1:23 PM, Massimiliano Galanti
> <[email protected]> wrote:
>>
>>> What about .ro and .text sections of an executable? ? I would think
>>> kernel support for that would be required. ? If its application data,
>>> then all sorts of things are possible like you described. ? Ive also
>>> seen critical ram variables be stored in triplicate and then
>>> compared/voted just to ensure no silent SEU corruption.
>>
>> Maybe slightly off topic but... if flash is safer than ram, what about using
>> XIP (where possible, e.g. on NORs)? That would not put .data sections into
>> ram, at least.
>>
>> --
>> Massimiliano
>>
>

2010-06-10 19:52:40

by Massimiliano Galanti

[permalink] [raw]
Subject: Re: Aerospace and linux

Well, not quite. See i.e. AXFS, or squashfs that are ro and flash
oriented by design.

Anyway, if you're stuck with NTFS/VFAT and can't use NOR...

(Just curious of what technology are you relying on for storage)

Il 10/06/2010 21:42, Brian Gordon ha scritto:
> Sorry, I take it back. This wont work for me because I wont have
> NOR. Also, I only want the "in-place" to apply to read-only pages.
> This looks like all reads and writes get passed to the underlying
> storage and I can't suffer flash page erase/writes to update a
> variable. :) The device will wear out and meaningful work would be
> starved.

--
Massimiliano

2010-06-10 19:59:40

by Massimiliano Galanti

[permalink] [raw]
Subject: Re: Aerospace and linux

> using XIP (where possible, e.g. on NORs)? That would not put .data

err... _.ro_data (and .text)


--
Massimiliano

2010-06-10 20:12:16

by Brian Gordon

[permalink] [raw]
Subject: Re: Aerospace and linux

Storage will probably be something really cheap. So I assume flash.
But, possibly a USB stick type device. Maybe an IDE based solid
state storage device.


On Thu, Jun 10, 2010 at 1:52 PM, Massimiliano Galanti
<[email protected]> wrote:
> Well, not quite. See i.e. AXFS, or squashfs that are ro and flash oriented
> by design.
>
> Anyway, if you're stuck with NTFS/VFAT and can't use NOR...
>
> (Just curious of what technology are you relying on for storage)
>
> Il 10/06/2010 21:42, Brian Gordon ha scritto:
>>
>> Sorry, I take it back. ? This wont work for me because I wont have
>> NOR. ? Also, I only want the "in-place" to apply to read-only pages.
>> This looks like all reads and writes get passed to the underlying
>> storage and I can't suffer flash page erase/writes to update a
>> variable. :) ?The device will wear out and meaningful work would be
>> starved.
>
> --
> Massimiliano
>

Subject: Re: Aerospace and linux

On Thu, 10 Jun 2010, Chris Friesen wrote:
> On 06/10/2010 11:29 AM, Brian Gordon wrote:
> > When these SEU can be detected some action may be taken to improve
> > the behaviour of the system (log a fault and reset in order to
> > refresh things from scratch?). So the first question becomes how to
> > detect an SEU.
>
> I do work in telco stuff. We use ECC RAM, turn on ECC/parity on the
> various buses, enable error-checking in the hardware, etc.

Let's not forget that the hardware better have unassisted scrubbing
(rewrite cells where an CE is detected), because we don't scrub when
we are notified of a CE.

Background scrubbing might also be something to look for (run over all
RAM over a large period of time, to catch dormant CEs and fix them
before they become UEs).

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2010-06-13 08:51:17

by Borislav Petkov

[permalink] [raw]
Subject: Re: Aerospace and linux

From: Brian Gordon <[email protected]>
Date: Thu, Jun 10, 2010 at 12:38:10PM -0600

Hi,

> > It's also a serious consideration for standard servers.
> Yes. Good point.
>
> > On server class systems with ECC memory hardware does that.
>
> > Normally server class hardware handles this and the kernel then reports
> > memory errors (e.g. through mcelog or through EDAC)
>
> Agreed. EDAC is a good and sane solution and most companies do this.
> Some do not due to naivity or cost reduction. EDAC doesn't cover
> processor registers and I have fairly good solutions on how to deal
> with that in tiny "home-grown" tasking systems.

No, not processor registers but all cache levels of modern class x86
processors have ECC checking capability so that the possibility for the
data to go up dirty in the core is minimized. Now, if a bit flip is
caused by SEU while the data is passing the execution units then you
loose I guess. For such cases, some sort of processor redundancy is
needed to compare and validate results, as you say below.

> On the more exotic end, I have also seen systems that have dual
> redundant processors / memories. Then they add compare logic between
> the redundant processors that compare most pins each clock cycle. If
> any pins are not identical at a clock cycle, then something has gone
> wrong (SEU, hardware failure, etc..)
>
> > Lower end systems which are optimized for cost generally ignore the
> > problem though and any flipped bit in memory will result
> > in a crash (if you're lucky) or silent data corruption (if you're unlucky)
>
> Right! And this is the area that I am interested in. Some people
> insist on lowering the cost of the hardware without considering these
> issues. One thing I want to do is to be as diligent as possible (even
> in these low cost situations) and do the best job I can in spite of
> the low cost hardware.
>
> So, some pages of RAM are going to be read-only and the data in those
> pages came from some source (file system?). Can anyone describe a
> high level strategy to occasionaly provide some coverage of this data?
>
> So far I have thought about page descriptors adding an MD5 hash
> whenever they are read-only and first being "loaded/mapped?" and then
> a background daemon could occasionaly verify.

... and if a SEU corrupts the MD5 hash itself, this should cause a page
reload, right?

--
Regards/Gruss,
Boris.