Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1162932AbWLBLBA (ORCPT ); Sat, 2 Dec 2006 06:01:00 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1162945AbWLBLA7 (ORCPT ); Sat, 2 Dec 2006 06:00:59 -0500 Received: from mail.gmx.net ([213.165.64.20]:1963 "HELO mail.gmx.net") by vger.kernel.org with SMTP id S1162946AbWLBLA4 (ORCPT ); Sat, 2 Dec 2006 06:00:56 -0500 X-Authenticated: #3612999 Date: Sat, 2 Dec 2006 12:00:36 +0100 (CET) From: Karsten Weiss To: Christoph Anton Mitterer cc: linux-kernel@vger.kernel.org Subject: Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?! In-Reply-To: <4570CF26.8070800@scientia.net> Message-ID: References: <4570CF26.8070800@scientia.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Y-GMX-Trusted: 0 Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3987 Lines: 93 Hello Christoph! On Sat, 2 Dec 2006, Christoph Anton Mitterer wrote: > I found a severe bug mainly by fortune because it occurs very rarely. > My test looks like the following: I have about 30GB of testing data on This sounds very familiar! One of the Linux compute clusters I administer at work is a 336 node system consisting of the following components: * 2x Dual-Core AMD Opteron 275 * Tyan S2891 mainboard * Hitachi HDS728080PLA380 harddisk * 4 GB RAM (some nodes have 8 GB) - intensively tested with memtest86+ * SUSE 9.3 x86_64 (kernel 2.6.11.4-21.14-smp) - But I've also e.g. tried the latest openSUSE 10.2 RC1+ kernel 2.6.18.2-33 which makes no difference. We are running LS-Dyna on these machines and discovered a testcase which shows a similar data corruption. So I can confirm that the problem is for real an not a hardware defect of a single machine! Here's a diff of a corrupted and a good file written during our testcase: ("-" == corrupted file, "+" == good file) ... 009f2ff0 67 2a 4c c4 6d 9d 34 44 ad e6 3c 45 05 9a 4d c4 |g*L.m.4D..From our testing I can also tell that the data corruption does *not* appear at all when we are booting the nodes with mem=2G. However, when we are using all the 4GB the data corruption shows up - but not everytime and thus not on all nodes. Sometimes a node runs for ours without any problem. That's why we are testing on 32 nodes in parallel most of the time. I have the impression that it has something to do with physical memory layout of the running processes. Please also notice that this is a silent data corruption. I.e. there are no error or warning messages in the kernel log or the mce log at all. Christoph, I will carefully re-read your entire posting and the included links on Monday and will also try the memory hole setting. If somebody has an explanation for this problem I can offer some of our compute nodes+time for testing because we really want to get this fixed as soon as possible. Best regards, Karsten -- Dipl.-Inf. Karsten Weiss - http://www.machineroom.de/knweiss - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/