From: Andreas Dilger <adilger@sun.com>
Subject: Re: Problems with checking corrupted large ext3 file system
Date: Wed, 03 Dec 2008 17:09:36 -0700
Message-ID: <20081204000936.GE3186@webber.adilger.int>
References: <20081203101100.GO17966@skl-net.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7BIT
Cc: linux-ext4@vger.kernel.org
To: Andre Noll <maan@systemlinux.org>
In-reply-to: <20081203101100.GO17966@skl-net.de>
Content-disposition: inline
Sender: linux-ext4-owner@vger.kernel.org

On Dec 03, 2008  11:11 +0100, Andre Noll wrote:
> I've some trouble checking a corrupted 9T large ext3 fs which resides
> on a logical volume. The underlying physical volumes are three hardware
> raid systems, one of which started to crash frequently. I was able
> to pvmove away the data from the buggy system, so everything is fine
> now on the hardware side.

A big question is what kernel you are running on.  Anything less than
2.6.18-rhel5 (not sure what vanilla kernel) has bugs with ext3 > 8TB.

The other question is whether there is any expectation that the data
moved from the bad RAID arrays was corrupted.

> However, the crashes left me with a seriously corrupted file system
> from which I'm trying to recover as much as possible. First step was
> to unmount the file system after users reported I/O errors when trying
> to open files. The system log contained many messages like
> 
> 	[102445.420125] EXT3-fs error (device dm-2): ext3_free_blocks_sb: bit already cleared for block 544108393                                              
> 
> and some of the form
> 
> 	[160301.277477] EXT3-fs error (device dm-2): htree_dirblock_to_tree: bad entry in directory #153542738: rec_len % 4 != 0 - offset=0, inode=1381653864, +rec_len=26709, name_len=79
> 
> So I compiled the master branch of the e2fsprogs git repo as of
> Dec 1 (tip: 8680b4) and executed
> 
> 	./e2fsck -y -C0 /dev/mapper/abel-abt6_projects
> 
> This ran for a while and then started to output a couple of these:
> 
> 	Inode table for group 68217 is not in group.  (block 825373744)
> 	WARNING: SEVERE DATA LOSS POSSIBLE.
> 
> along with many lines of the form
> 
> 	Illegal block #3036172 (4233778405) in inode 115335438.
>         CLEARED.

Running "e2fsck -y" vs. "e2fsck -p" will sometimes do "bad" things because
the "-y" forces it to continue on no matter what.  It looks like there
was some serious filesystem corruption beyond the 8TB boundary, and the
inode table for at one or more groups (depending on how many of the
"SEVERE DATA LOSS POSSIBLE" messages were printed) is completely lost.

> But then it continued just fine without printing further
> messsages. After about 4 hours it completed but decided to re-run from
> the beginning and this is where the real trouble seems to start. The
> next day I found thousands of lines like this on the console:
> 
>         /backup/data/solexa_analysis/ATH/MA/MA-30-29/run_30/4/length_42/reads_0.fl (inode #145326082, mod time Tue Jan 22 05:09:36 2008)
> followed by
> 
> 	Clone multiply-claimed blocks? yes

This is likely fallout from the original corruption above.  The bad news
is that these "multiply-claimed blocks" are really bogus because of the
garbage in the missing inode tables...  e2fsck has turned random garbage
into inodes, and it results in what you are seeing now.

> At this point the fsck seems to hang. No further messages, no progress
> bar for at least 17 hours.

The pass1b (clone multiply-claimed blocks) code is very slow, because it
involves an O(n^2) operation to find all of the duplicate blocks, read
them from disk, then write them to some new spot on disk, and the e2fsck
allocator is very slow also.

> The lights on the raid system aren't
> flashing but there seems to be a bit of I/O going on as stracing the
> e2fsck process yields
> 
> 	lseek(3, 6206310776832, SEEK_SET)       = 6206310776832
> 	read(3, "002107740635\tD\t2\t169\t35\t0\thhhhhh"..., 4096) = 4096
> 	lseek(3, 1263113973760, SEEK_SET)       = 1263113973760
> 	write(3, "B9K@=?4C=L-F77F4:CGGK\n3\t14221118"..., 4096) = 4096
> 	lseek(3, 5861641846784, SEEK_SET)       = 5861641846784
> 	read(3, "hhhhhh\tIIIIIIIIIIIIIIIIIIIIIIIII"..., 4096) = 4096
> 	lseek(3, 1263113977856, SEEK_SET)       = 1263113977856
> 	write(3, "\t1.00\t0.46\t19\t4\t2\t0\t1\tA\t33\t31\t0\t"..., 4096) = 4096
> 
> There's about only one read per second, so the fsck might take rather
> long if it continues to run at this speed ;)
> 
> It's running for 34 hours now and I don't know what to do, so here are
> a couple of questions for you ext3 gurus:
> 
> 	Is there any hope this will ever complete?

Depends on how many inodes are duplicated, but it could be days :-(.

> 	Should I abort the fsck and restart?

Restarting won't fix anything because it will just get you back to the
same spot 34h from now.

> 	Do things get even worse if I abort it and mount the file
> 	system r/o so that I can see whether important files are
> 	still there?

I would suggest as a starter to run "debugfs -c {devicename}" and
use this to explore the filesystem a bit.  This can be done while
e2fsck is running, and will give you an idea of what data is still
there.  If you think that a majority of your file data (or even just
the important bits) are available, then I would suggest killing e2fsck,
mounting the filesystem read-only, and copying as much as possible.

The kernel should be largely forgiving of errors it finds on disk.

> 	Are there any magic e2fsck command line options I should try?

One option is to use the Lustre e2fsprogs which has a patch that tries
to detect such "garbage" inodes and wipe them clean, instead of trying
to continue using them.

	http://downloads.lustre.org/public/tools/e2fsprogs/latest/

That said, it may be too late to help because the previous e2fsck run
will have done a lot of work to "clean up" the garbage inodes and they
may no longer be above the "bad inode threshold".

You could try this after copying the data elsewhere, to avoid the need
to restore the filesystem and get a bit more data back, but at that
point it might also be faster to just reformat and restore the data.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.