Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754651AbXINJss (ORCPT ); Fri, 14 Sep 2007 05:48:48 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752198AbXINJsj (ORCPT ); Fri, 14 Sep 2007 05:48:39 -0400 Received: from cernmx06.cern.ch ([137.138.166.160]:32482 "EHLO cernmxlb.cern.ch" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751804AbXINJsi (ORCPT ); Fri, 14 Sep 2007 05:48:38 -0400 X-Greylist: delayed 977 seconds by postgrey-1.27 at vger.kernel.org; Fri, 14 Sep 2007 05:48:38 EDT DomainKey-Status: no signature - Generated by CERN IT/IS DomainKeys v1.0 Keywords: CERN SpamKiller Note: -51 Charset: west-latin X-Filter: CERNMX06 CERN MX v2.0 060921.0942 Release Date: Fri, 14 Sep 2007 11:32:10 +0200 From: KELEMEN Peter To: Alan Cox Cc: Bruce Allen , Linux Kernel Mailing List , Bruce Allen Subject: Re: ECC and DMA to/from disk controllers Message-ID: <20070914093210.GB27479@luba.cern.ch> Mail-Followup-To: Alan Cox , Bruce Allen , Linux Kernel Mailing List , Bruce Allen References: <20070910145415.0fe05319@the-village.bc.nu> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20070910145415.0fe05319@the-village.bc.nu> Organization: CERN European Laboratory for Particle Physics, Switzerland X-GPG-KeyID: 1024D/9FF0CABE 2004-04-03 X-GPG-Fingerprint: 6C9E 5917 3B06 E4EE 6356 7BF0 8F3E CAB6 9FF0 CABE X-Comment: Personal opinion. Paragraphs might have been reformatted. X-Copyright: Forwarding or publishing without permission is prohibited. X-Accept-Language: hu,en User-Agent: Mutt/1.5.13 (2006-08-11) X-OriginalArrivalTime: 14 Sep 2007 09:32:12.0047 (UTC) FILETIME=[20EEA1F0:01C7F6B2] Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3084 Lines: 77 * Alan Cox (alan@lxorguk.ukuu.org.uk) [20070910 14:54]: Alan, Thanks for your interest (and Bruce, for posting). > - The ECC level on the drive processors and memory cache vary > by vendor. Good luck getting any information on this although > maybe if you are Cern sized they will talk Do you have any contacts? We're in contact directly with the system integrators only, not the drive manufacturers. > The next usual mess is network transfers. [...] All our data is based on system-local probes (i.e. no network involved). > Type III wrong block on PATA fits with the fact the block number > isn't protected and also the limits on the cache quality of > drives/drive firmware bugs. Thanks, it's new information. I was planning to extend fsprobe with locality information inside the buffers so that we can catch this as it is happening. > For drivers/ide there are *lots* of problems with error handling > so that might be implicated (would want to do old v new ide > tests on the same h/w which would be very intriguing). We tried to “force” these corruptions out from their hiding places on targeted systems, but we failed miserably. Currently we can't reproduce the issue at will, even on the affected systems. > Stale data from disk cache I've seen reported, also offsets from > FIFO hardware bugs (The LOTR render farm hit the latter and had > to avoid UDMA to avoid a hardware bug) That's interesting, I'll think about how to expose this. Currently a single pass writes data only once, so I don't think any chunk can live hours long in the drives' cache. > Chunks of zero sounds like caches again, would be interesting to > know what hardware changes occurred at the point they began to > pop up and what software. They seem to be popping more frequently on ARECA-based boxes. The “software” is a running target as we gradually upgrade the computer center. > We also see chipset bugs under high contention some of which > are explained and worked around (VIA ones in the past), others > we see are clear correlations - eg between Nvidia chipsets and > Silicon Image SATA controllers. Most of our workhorses are 3ware controllers, the CPU nodes usually have Intel SATA chips. The fsprobe utility we run in the background on practically all our boxes is available at http://cern.ch/Peter.Kelemen/fsprobe/ . We have it deployed on several thousand machines to gather data. I know that some other HEP institutes looked at it, but I have no information on who's running it on how many boxes, let alone what it found. I would be very much interested in whatever findings people have. Peter -- .+'''+. .+'''+. .+'''+. .+'''+. .+'' Kelemen Péter / \ / \ Peter.Kelemen@cern.ch .+' `+...+' `+...+' `+...+' `+...+' - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/