Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753134Ab0FMIvR (ORCPT ); Sun, 13 Jun 2010 04:51:17 -0400 Received: from mail.skyhub.de ([78.46.96.112]:40107 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752512Ab0FMIvP (ORCPT ); Sun, 13 Jun 2010 04:51:15 -0400 Date: Sun, 13 Jun 2010 10:51:11 +0200 From: Borislav Petkov To: Brian Gordon Cc: Andi Kleen , linux-kernel@vger.kernel.org Subject: Re: Aerospace and linux Message-ID: <20100613085111.GB6428@liondog.tnic> Mail-Followup-To: Borislav Petkov , Brian Gordon , Andi Kleen , linux-kernel@vger.kernel.org References: <877hm64ui4.fsf@basil.nowhere.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2659 Lines: 60 From: Brian Gordon Date: Thu, Jun 10, 2010 at 12:38:10PM -0600 Hi, > > It's also a serious consideration for standard servers. > Yes. Good point. > > > On server class systems with ECC memory hardware does that. > > > Normally server class hardware handles this and the kernel then reports > > memory errors (e.g. through mcelog or through EDAC) > > Agreed. EDAC is a good and sane solution and most companies do this. > Some do not due to naivity or cost reduction. EDAC doesn't cover > processor registers and I have fairly good solutions on how to deal > with that in tiny "home-grown" tasking systems. No, not processor registers but all cache levels of modern class x86 processors have ECC checking capability so that the possibility for the data to go up dirty in the core is minimized. Now, if a bit flip is caused by SEU while the data is passing the execution units then you loose I guess. For such cases, some sort of processor redundancy is needed to compare and validate results, as you say below. > On the more exotic end, I have also seen systems that have dual > redundant processors / memories. Then they add compare logic between > the redundant processors that compare most pins each clock cycle. If > any pins are not identical at a clock cycle, then something has gone > wrong (SEU, hardware failure, etc..) > > > Lower end systems which are optimized for cost generally ignore the > > problem though and any flipped bit in memory will result > > in a crash (if you're lucky) or silent data corruption (if you're unlucky) > > Right! And this is the area that I am interested in. Some people > insist on lowering the cost of the hardware without considering these > issues. One thing I want to do is to be as diligent as possible (even > in these low cost situations) and do the best job I can in spite of > the low cost hardware. > > So, some pages of RAM are going to be read-only and the data in those > pages came from some source (file system?). Can anyone describe a > high level strategy to occasionaly provide some coverage of this data? > > So far I have thought about page descriptors adding an MD5 hash > whenever they are read-only and first being "loaded/mapped?" and then > a background daemon could occasionaly verify. ... and if a SEU corrupts the MD5 hash itself, this should cause a page reload, right? -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/