Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751314Ab0DMLfL (ORCPT ); Tue, 13 Apr 2010 07:35:11 -0400 Received: from bld-mail15.adl6.internode.on.net ([150.101.137.100]:53258 "EHLO mail.internode.on.net" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750857Ab0DMLfJ (ORCPT ); Tue, 13 Apr 2010 07:35:09 -0400 Date: Tue, 13 Apr 2010 21:34:45 +1000 From: Dave Chinner To: Janos Haar Cc: xiyou.wangcong@gmail.com, linux-kernel@vger.kernel.org, kamezawa.hiroyu@jp.fujitsu.com, linux-mm@kvack.org, xfs@oss.sgi.com, axboe@kernel.dk Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...) Message-ID: <20100413113445.GZ2493@dastard> References: <20100404103701.GX3335@dastard> <2bd101cad4ec$5a425f30$0400a8c0@dcccs> <20100405224522.GZ3335@dastard> <3a5f01cad6c5$8a722c00$0400a8c0@dcccs> <20100408025822.GL11036@dastard> <11b701cad9c8$93212530$0400a8c0@dcccs> <20100412001158.GA2493@dastard> <18b101cadadf$5edbb660$0400a8c0@dcccs> <20100413083931.GW2493@dastard> <190201cadaeb$02ec22c0$0400a8c0@dcccs> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <190201cadaeb$02ec22c0$0400a8c0@dcccs> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2509 Lines: 70 On Tue, Apr 13, 2010 at 11:23:36AM +0200, Janos Haar wrote: > >If you run: > > > >$ xfs_db -r -c "inode 474253940" -c p /dev/sdb2 > > > >Then I can can confirm whether there is corruption on disk or not. > >Probably best to sample multiple of the inode numbers from the above > >list of bad inodes. > > Here is the log: > http://download.netcenter.hu/bughunt/20100413/debug.log There are multiple fields in the inode that are corrupted. I am really surprised that xfs-repair - even an old version - is not picking up the corruption.... > The xfs_db does segmentation fault. :-) Yup, it probably ran off into la-la land chasing corrupted extent pointers. > Btw memory corruption: > In the beginnig of march, one of my bets was memory problem too, but > the server was offline for 7 days, and all the time runs the > memtest86 on the hw, and passed all the 8GB 74 times without any bit > error. > I don't think it is memory problem, additionally the server can > create big size .tar.gz files without crc problem. Ok. > If i force my mind to think to hw memory problem, i can think only > for the raid card's cache memory, wich i can't test with memtest86. > Or the cache of the HDD's pcb... Yes, it could be something like that, too, but the only way to test it is to swap out the card.... > In the other hand, i have seen more people reported memory > corruption about these kernel versions, can we check this and surely > select wich is the problem? (hw or sw)? I haven't heard of any significant memory corruption problems in 2.6.32 or 2.6.33, but it is a possibility given the nature of the corruption. However, I may have only happened once and be completely unreproducable. I'd suggest fixing the existing corruption first, and then seeing if it re-appears. If it does reappear, then we know there's a reproducable problem we need to dig out.... > I mean, if i am right, the hw memory problem makes only 1-2 bit > corruption seriously, and the sw page handling problem makes bad > memory pages, no? RAM ECC guarantees correction of single bit errors and detection of double bit errors (which cause the kernel to panic, IIRC). I can't tell you what happens when larger errors occur, though... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/