2005-11-25 10:34:08

by Jerome Lacoste

[permalink] [raw]
Subject: defective RAM: corrupted ext3 FS. How to identify corrupted files/directories?

Hi.

My RAM died, and it corrupted my file system. It seems like this
machine just wants to die... [1]

After removing the faulty RAM, I can boot. I made extensive memtest86+ tests.
I now have my home partition mounted as read-only because of said corruption.

I see a bunch of "ext3_readdir: directory xxxx contains a hole at
offset xxxxx" when I try to access some parts of my disk.

I postponed fscking the FS until I have identified the faulty data.

I was thinking of doing a rsync --dry-run against a known working
backup and check the logs. Any better idea? Is there a way to convert
the directory IDs into file paths?

I have around 500 000 files on that partition. It takes time checking them all.

Cheers,

Jerome

[1] http://lkml.org/lkml/2005/2/4/51


2005-11-25 22:59:57

by Bodo Eggert

[permalink] [raw]
Subject: Re: defective RAM: corrupted ext3 FS. How to identify corrupted files/directories?

jerome lacoste <[email protected]> wrote:

> My RAM died, and it corrupted my file system. It seems like this
> machine just wants to die... [1]
>
> After removing the faulty RAM, I can boot. I made extensive memtest86+ tests.
> I now have my home partition mounted as read-only because of said corruption.
>
> I see a bunch of "ext3_readdir: directory xxxx contains a hole at
> offset xxxxx" when I try to access some parts of my disk.
>
> I postponed fscking the FS until I have identified the faulty data.
>
> I was thinking of doing a rsync --dry-run against a known working
> backup and check the logs. Any better idea? Is there a way to convert
> the directory IDs into file paths?
>
> I have around 500 000 files on that partition. It takes time checking them
> all.

1) Use the backup to get a base on a completely seperate HDD.
1a) Feel glad about having a backup.
2) Find new and changed files on the corrupted disk.
3) For each of the files found, inspect it's contents and copy it over
if it's non-corrupted. You can't automatically find corrupted files
unless you know otherwise.
4) mkfs
--
Ich danke GMX daf?r, die Verwendung meiner Adressen mittels per SPF
verbreiteten L?gen zu sabotieren.

2005-11-26 06:43:12

by Willy Tarreau

[permalink] [raw]
Subject: Re: defective RAM: corrupted ext3 FS. How to identify corrupted files/directories?

On Fri, Nov 25, 2005 at 11:34:06AM +0100, jerome lacoste wrote:
> Hi.
>
> My RAM died, and it corrupted my file system. It seems like this
> machine just wants to die... [1]
>
> After removing the faulty RAM, I can boot. I made extensive memtest86+ tests.
> I now have my home partition mounted as read-only because of said corruption.
>
> I see a bunch of "ext3_readdir: directory xxxx contains a hole at
> offset xxxxx" when I try to access some parts of my disk.
>
> I postponed fscking the FS until I have identified the faulty data.

You had the right reflex. Never ever consider your data bad as long as
you have not recovered them. Hard disks are always cheaper than the
time to rebuild all the lost work so keep the disk under safe conditions.

> I was thinking of doing a rsync --dry-run against a known working
> backup and check the logs. Any better idea? Is there a way to convert
> the directory IDs into file paths?

For this, I've already used a tool that a friend of mine developped
for our distro. It performs file-system signatures (including permissions,
size, date, file names, links and MD5 sums) and can check them with
the ability to ignore differences on some fields (eg: date, size,
perms, ...). It's called 'flx' [1].

The way to proceed is to do a fill sig of your FS to a text file, then
another sig of your backup to another file. For this, you'll have to
restore your old backup somewhere. It's important to put the sigs to
files because you'll be able to edit them to rename the corrupted dirs
once you find the names :

# flx sign bad_fs/. > bad_fs.sig
# flx sign old_fs/. > old_fs.sig

Then you can compare them to find new entries (some of which will be the
corrupted ones) :

# flx check --only-new --ignore-dot --ignore-ldate \
file:old_fs.sig file:bad_fs.sig > diff-sig-1

Check this file and identify the corrupted directory names. Then look at
their contents in bad_fs.sig, and find the same content in old_fs.sig.
This way, you'll know their name and will be able to s/bad_name/name/
in bad_fs.sig, and iterate the process again. (Hint: note somewhere
the association between <bad_name> and <name>).

Once you have found all your directories names, you can check the
contents. Hint : ignore the date changes because you will know that
they will be new files, or files with recent changes. What you're
interested in are the files with different MD5 while other fields
are the sames (because then only content will have changed) :

# flx check --ignore-date --ignore-ldate --ignore-... file:bad_fs.sig file:old_fs.sig

Manually check those files and remove the destroyed ones from the list.
Then, redo the check ignoring MD5 signatures this time, just to find
files that have 'normally' changed since your backup because you've
worked on them.

At the end of this long process, you should be able to identify the
files and directories you want to keep from your corrupted disk, and
copy them over the backup disk to get a nearly complete restore.

> I have around 500 000 files on that partition. It takes time checking them all.

Of course, that's why using a text file signature considerably helps !
Hint: after the restore, you can regularly sign your FS and you'll be
able to check what change from last backup simply by diffing the sig
files before the crash.

Good luck,
Willy

[1] I've put it online here : http://w.ods.org/flx/flx-0.7.1.tar.gz

> Cheers,
>
> Jerome
>
> [1] http://lkml.org/lkml/2005/2/4/51
> -

2005-11-26 08:48:23

by Grant Coady

[permalink] [raw]
Subject: Re: defective RAM: corrupted ext3 FS. How to identify corrupted files/directories?

On Sat, 26 Nov 2005 07:43:07 +0100, Willy Tarreau <[email protected]> wrote:

>
>At the end of this long process,

Since it be a long process, I'd suggest OP put the HDD in a machine
that runs cooler, running HDD at up to 72'C as has been done is not
conducive to data integrity :-p

Grant.

2005-11-28 13:32:34

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: defective RAM: corrupted ext3 FS. How to identify corrupted files/directories?


On Fri, 25 Nov 2005, Bodo Eggert wrote:

> jerome lacoste <[email protected]> wrote:
>
>> My RAM died, and it corrupted my file system. It seems like this
>> machine just wants to die... [1]
>>
>> After removing the faulty RAM, I can boot. I made extensive memtest86+ tests.
>> I now have my home partition mounted as read-only because of said corruption.
>>
>> I see a bunch of "ext3_readdir: directory xxxx contains a hole at
>> offset xxxxx" when I try to access some parts of my disk.
>>
>> I postponed fscking the FS until I have identified the faulty data.
>>
>> I was thinking of doing a rsync --dry-run against a known working
>> backup and check the logs. Any better idea? Is there a way to convert
>> the directory IDs into file paths?
>>
>> I have around 500 000 files on that partition. It takes time checking them
>> all.
>
> 1) Use the backup to get a base on a completely seperate HDD.
> 1a) Feel glad about having a backup.
> 2) Find new and changed files on the corrupted disk.
> 3) For each of the files found, inspect it's contents and copy it over
> if it's non-corrupted. You can't automatically find corrupted files
> unless you know otherwise.
> 4) mkfs
> --
> Ich danke GMX daf?r, die Verwendung meiner Adressen mittels per SPF
> verbreiteten L?gen zu sabotieren.
> -

Every file on that disk could be defective, even those that were never
accessed. This is because of the device buffering which is independent
of the file-system.

I'd advise copying off (using tar) all user-files to backup media.
Then initialize the file-system(s) on the device from scratch.
Reinstall your OS., then restore the user-files from backups or
the files you copied if they were not backed up previously.

Prior to reinstalling, you can save some of the initialization files
that are still readable and ASCII (like /etc/passwd, and /etc/group).
Remember to execute `pwconv` after copying them back onto the new
distribution.

This might be the time to install the "latest and greatest" distribution.
A few years ago, I had a similar situation where I had religiously made
backups every night. I had been backing up corrupt files! Fortunately
most of the data was 'C' language source-files and headers. The
users were able to find and fix them by attempting to recompile them.
One directory tree had about 3,000 files of which at least 1/2 had
to be fixed. This took one user over a month so this was not trivial.
Some of the source-file problems were not visible! There might be
added spaces after some macros. These were the killers to find!

If you have CRC capability on your RAM, make sure it's turned ON in
the BIOS. It's much better to crash as a result of a bad bit then
to have a "few" bad bits in every file.


Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.55 BogoMips).
Warning : 98.36% of all statistics are fiction.

****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.