From: Nick Dokos Subject: Re: ll_ver_fs data verification failure - 96TB fs Date: Thu, 06 Aug 2009 17:28:08 -0400 Message-ID: <18690.1249594088@alphaville.usa.hp.com> References: <28623.1249307676@gamaville.dokosmarshall.org> <20090806200400.GC1800@shell> <18249.1249591034@alphaville.usa.hp.com> <20090806205002.GH3340@webber.adilger.int> Reply-To: nicholas.dokos@hp.com Cc: Nick Dokos , Valerie Aurora , linux-ext4@vger.kernel.org To: Andreas Dilger Return-path: Received: from g1t0026.austin.hp.com ([15.216.28.33]:30191 "EHLO g1t0026.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753343AbZHFV2G (ORCPT ); Thu, 6 Aug 2009 17:28:06 -0400 In-Reply-To: Message from Andreas Dilger of "Thu, 06 Aug 2009 14:50:02 MDT." <20090806205002.GH3340@webber.adilger.int> Sender: linux-ext4-owner@vger.kernel.org List-ID: > On Aug 06, 2009 16:37 -0400, Nick Dokos wrote: > > I did that to begin with but the problem turns out to be much more > > mundane: there was an IO error on one of the volumes. It wasn't quite > > obvious (no red lights going off) but there *was* a message in > > /var/log/messages - unfortunately I missed it. I eventually recreated > > the error by trying to read the file with ``od -c'' and then went back > > and found the original error. I don't know why/how ll_ver_fs managed to > > read the offset and come up with a 1M difference[1] -- ``od -c'' failed with > > a big thud. > > Can you have a look at the error handling in ll_ver_fs at that point? > It seems that it might just have re-used the previous 1MB buffer, but > didn't detect/report the error from the read, which would itself be bad. > It looks right to me: ,---- | ... | if (read(fd, chunk_buf, chunksize) < 0) { | fprintf(stderr, "\n%s: read %s+%llu failed: %s\n", | progname, file, offset, strerror(errno)); | return 1; | } | if (verify_chunk(chunk_buf, chunksize, offset, time_st, | inode_st, file) != 0) | return 1; | ... `---- The read() should have failed (and I should have gotten a different error message) but somehow it didn't - instead, verify_chunk() was called and *that* detected the mismatch. Thanks, Nick