2008-12-03 10:52:11

by Andre Noll

[permalink] [raw]
Subject: Problems with checking corrupted large ext3 file system

Hi,

I've some trouble checking a corrupted 9T large ext3 fs which resides
on a logical volume. The underlying physical volumes are three hardware
raid systems, one of which started to crash frequently. I was able
to pvmove away the data from the buggy system, so everything is fine
now on the hardware side.

However, the crashes left me with a seriously corrupted file system
from which I'm trying to recover as much as possible. First step was
to unmount the file system after users reported I/O errors when trying
to open files. The system log contained many messages like

[102445.420125] EXT3-fs error (device dm-2): ext3_free_blocks_sb: bit already cleared for block 544108393

and some of the form

[160301.277477] EXT3-fs error (device dm-2): htree_dirblock_to_tree: bad entry in directory #153542738: rec_len % 4 != 0 - offset=0, inode=1381653864, +rec_len=26709, name_len=79

So I compiled the master branch of the e2fsprogs git repo as of
Dec 1 (tip: 8680b4) and executed

./e2fsck -y -C0 /dev/mapper/abel-abt6_projects

This ran for a while and then started to output a couple of these:

Inode table for group 68217 is not in group. (block 825373744)
WARNING: SEVERE DATA LOSS POSSIBLE.

along with many lines of the form

Illegal block #3036172 (4233778405) in inode 115335438.
CLEARED.

But then it continued just fine without printing further
messsages. After about 4 hours it completed but decided to re-run from
the beginning and this is where the real trouble seems to start. The
next day I found thousands of lines like this on the console:

/backup/data/solexa_analysis/ATH/MA/MA-30-29/run_30/4/length_42/reads_0.fl (inode #145326082, mod time Tue Jan 22 05:09:36 2008)

followed by

Clone multiply-claimed blocks? yes

At this point the fsck seems to hang. No further messages, no progress
bar for at least 17 hours. The lights on the raid system aren't
flashing but there seems to be a bit of I/O going on as stracing the
e2fsck process yields

lseek(3, 6206310776832, SEEK_SET) = 6206310776832
read(3, "002107740635\tD\t2\t169\t35\t0\thhhhhh"..., 4096) = 4096
lseek(3, 1263113973760, SEEK_SET) = 1263113973760
write(3, "B9K@=?4C=L-F77F4:CGGK\n3\t14221118"..., 4096) = 4096
lseek(3, 5861641846784, SEEK_SET) = 5861641846784
read(3, "hhhhhh\tIIIIIIIIIIIIIIIIIIIIIIIII"..., 4096) = 4096
lseek(3, 1263113977856, SEEK_SET) = 1263113977856
write(3, "\t1.00\t0.46\t19\t4\t2\t0\t1\tA\t33\t31\t0\t"..., 4096) = 4096

There's about only one read per second, so the fsck might take rather
long if it continues to run at this speed ;)

It's running for 34 hours now and I don't know what to do, so here are
a couple of questions for you ext3 gurus:

Is there any hope this will ever complete?

Should I abort the fsck and restart?

Do things get even worse if I abort it and mount the file
system r/o so that I can see whether important files are
still there?

Are there any magic e2fsck command line options I should try?

The box is a 2xQuad Core Intel machine with 32G Ram and is running
a vanilla 2.6.25.20 kernel. Any help is greatly appreciated.

Thanks
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (3.28 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-12-04 00:09:53

by Andreas Dilger

[permalink] [raw]
Subject: Re: Problems with checking corrupted large ext3 file system

On Dec 03, 2008 11:11 +0100, Andre Noll wrote:
> I've some trouble checking a corrupted 9T large ext3 fs which resides
> on a logical volume. The underlying physical volumes are three hardware
> raid systems, one of which started to crash frequently. I was able
> to pvmove away the data from the buggy system, so everything is fine
> now on the hardware side.

A big question is what kernel you are running on. Anything less than
2.6.18-rhel5 (not sure what vanilla kernel) has bugs with ext3 > 8TB.

The other question is whether there is any expectation that the data
moved from the bad RAID arrays was corrupted.

> However, the crashes left me with a seriously corrupted file system
> from which I'm trying to recover as much as possible. First step was
> to unmount the file system after users reported I/O errors when trying
> to open files. The system log contained many messages like
>
> [102445.420125] EXT3-fs error (device dm-2): ext3_free_blocks_sb: bit already cleared for block 544108393
>
> and some of the form
>
> [160301.277477] EXT3-fs error (device dm-2): htree_dirblock_to_tree: bad entry in directory #153542738: rec_len % 4 != 0 - offset=0, inode=1381653864, +rec_len=26709, name_len=79
>
> So I compiled the master branch of the e2fsprogs git repo as of
> Dec 1 (tip: 8680b4) and executed
>
> ./e2fsck -y -C0 /dev/mapper/abel-abt6_projects
>
> This ran for a while and then started to output a couple of these:
>
> Inode table for group 68217 is not in group. (block 825373744)
> WARNING: SEVERE DATA LOSS POSSIBLE.
>
> along with many lines of the form
>
> Illegal block #3036172 (4233778405) in inode 115335438.
> CLEARED.

Running "e2fsck -y" vs. "e2fsck -p" will sometimes do "bad" things because
the "-y" forces it to continue on no matter what. It looks like there
was some serious filesystem corruption beyond the 8TB boundary, and the
inode table for at one or more groups (depending on how many of the
"SEVERE DATA LOSS POSSIBLE" messages were printed) is completely lost.

> But then it continued just fine without printing further
> messsages. After about 4 hours it completed but decided to re-run from
> the beginning and this is where the real trouble seems to start. The
> next day I found thousands of lines like this on the console:
>
> /backup/data/solexa_analysis/ATH/MA/MA-30-29/run_30/4/length_42/reads_0.fl (inode #145326082, mod time Tue Jan 22 05:09:36 2008)
> followed by
>
> Clone multiply-claimed blocks? yes

This is likely fallout from the original corruption above. The bad news
is that these "multiply-claimed blocks" are really bogus because of the
garbage in the missing inode tables... e2fsck has turned random garbage
into inodes, and it results in what you are seeing now.

> At this point the fsck seems to hang. No further messages, no progress
> bar for at least 17 hours.

The pass1b (clone multiply-claimed blocks) code is very slow, because it
involves an O(n^2) operation to find all of the duplicate blocks, read
them from disk, then write them to some new spot on disk, and the e2fsck
allocator is very slow also.

> The lights on the raid system aren't
> flashing but there seems to be a bit of I/O going on as stracing the
> e2fsck process yields
>
> lseek(3, 6206310776832, SEEK_SET) = 6206310776832
> read(3, "002107740635\tD\t2\t169\t35\t0\thhhhhh"..., 4096) = 4096
> lseek(3, 1263113973760, SEEK_SET) = 1263113973760
> write(3, "B9K@=?4C=L-F77F4:CGGK\n3\t14221118"..., 4096) = 4096
> lseek(3, 5861641846784, SEEK_SET) = 5861641846784
> read(3, "hhhhhh\tIIIIIIIIIIIIIIIIIIIIIIIII"..., 4096) = 4096
> lseek(3, 1263113977856, SEEK_SET) = 1263113977856
> write(3, "\t1.00\t0.46\t19\t4\t2\t0\t1\tA\t33\t31\t0\t"..., 4096) = 4096
>
> There's about only one read per second, so the fsck might take rather
> long if it continues to run at this speed ;)
>
> It's running for 34 hours now and I don't know what to do, so here are
> a couple of questions for you ext3 gurus:
>
> Is there any hope this will ever complete?

Depends on how many inodes are duplicated, but it could be days :-(.

> Should I abort the fsck and restart?

Restarting won't fix anything because it will just get you back to the
same spot 34h from now.

> Do things get even worse if I abort it and mount the file
> system r/o so that I can see whether important files are
> still there?

I would suggest as a starter to run "debugfs -c {devicename}" and
use this to explore the filesystem a bit. This can be done while
e2fsck is running, and will give you an idea of what data is still
there. If you think that a majority of your file data (or even just
the important bits) are available, then I would suggest killing e2fsck,
mounting the filesystem read-only, and copying as much as possible.

The kernel should be largely forgiving of errors it finds on disk.

> Are there any magic e2fsck command line options I should try?

One option is to use the Lustre e2fsprogs which has a patch that tries
to detect such "garbage" inodes and wipe them clean, instead of trying
to continue using them.

http://downloads.lustre.org/public/tools/e2fsprogs/latest/

That said, it may be too late to help because the previous e2fsck run
will have done a lot of work to "clean up" the garbage inodes and they
may no longer be above the "bad inode threshold".

You could try this after copying the data elsewhere, to avoid the need
to restore the filesystem and get a bit more data back, but at that
point it might also be faster to just reformat and restore the data.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-12-04 16:53:16

by Andre Noll

[permalink] [raw]
Subject: Re: Problems with checking corrupted large ext3 file system

On 17:09, Andreas Dilger wrote:
> On Dec 03, 2008 11:11 +0100, Andre Noll wrote:
> > I've some trouble checking a corrupted 9T large ext3 fs which resides
> > on a logical volume. The underlying physical volumes are three hardware
> > raid systems, one of which started to crash frequently. I was able
> > to pvmove away the data from the buggy system, so everything is fine
> > now on the hardware side.
>
> A big question is what kernel you are running on. Anything less than
> 2.6.18-rhel5 (not sure what vanilla kernel) has bugs with ext3 > 8TB.

The box is currently running 2.6.25.20 and was never running a kernel
older than 2.6.23.x. So we should be safe regarding those bugs.

> The other question is whether there is any expectation that the data
> moved from the bad RAID arrays was corrupted.

I can't say for sure but I'd guess the data was already corrupted when
I started the pvmove.

> Running "e2fsck -y" vs. "e2fsck -p" will sometimes do "bad" things because
> the "-y" forces it to continue on no matter what.

True. But running with -p would abort and ask me to run without -p
anyway.

> > /backup/data/solexa_analysis/ATH/MA/MA-30-29/run_30/4/length_42/reads_0.fl (inode #145326082, mod time Tue Jan 22 05:09:36 2008)
> > followed by
> >
> > Clone multiply-claimed blocks? yes
>
> This is likely fallout from the original corruption above. The bad news
> is that these "multiply-claimed blocks" are really bogus because of the
> garbage in the missing inode tables... e2fsck has turned random garbage
> into inodes, and it results in what you are seeing now.

OK, so I guess I would like to run e2fsck again without cloning those
blocks.

> I would suggest as a starter to run "debugfs -c {devicename}" and
> use this to explore the filesystem a bit. This can be done while
> e2fsck is running, and will give you an idea of what data is still
> there.

Very good idea, thanks. We just did this and the important files
seem to be there but some of them, in particular those which were
mentioned in the fsck output, contain garbage or data from other files
in the middle. So the expensive O(n^2) algorithm indeed seems to be
of little use for our particular case.

> If you think that a majority of your file data (or even just the
> important bits) are available, then I would suggest killing e2fsck,
> mounting the filesystem read-only, and copying as much as possible.

We are considering this, but it also means we have to quickly get 9T
of additional disk space which could turn out to be difficult given the
fact we already borrowed 16T from another department for the pvmove :)

> One option is to use the Lustre e2fsprogs which has a patch that tries
> to detect such "garbage" inodes and wipe them clean, instead of trying
> to continue using them.
>
> http://downloads.lustre.org/public/tools/e2fsprogs/latest/
>
> That said, it may be too late to help because the previous e2fsck run
> will have done a lot of work to "clean up" the garbage inodes and they
> may no longer be above the "bad inode threshold".

I would love to give it a try if it gets me an intact file system
within hours rather than days or even weeks because it avoids the
lengthy algorithm that clones the multiply-claimed blocks.

As the box is running a Ubuntu, I could not install the rpm directly.
So I compiled the source from e2fsprogs-1.40.11.sun1.tar.gz which is
contained in e2fsprogs-1.40.11.sun1-0redhat.src.rpm. gcc complained
about unsafe format strings but produced the e2fsck executable.

Do I need to use any command line option to the patched e2fsck? And
is there anything else I should consider before killing the currently
running e2fsck?

Thanks a lot for your help.
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (3.70 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-12-04 19:51:43

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Problems with checking corrupted large ext3 file system

On Thu, Dec 04, 2008 at 05:37:59PM +0100, Andre Noll wrote:
> OK, so I guess I would like to run e2fsck again without cloning those
> blocks.

Actually, what you should probably do is to take a look at the inodes
which were listed in the pass1b, and if they don't make sense, to
clear them out. An individual inode can be cleared by using the
debugfs clri command. i.e., to zero out inode 12345, you do this:

debugfs -w /dev/mapper/thunk-closure
debugfs: clri <12345>
debugfs: quit

This doesn't work very easily if there is a large number of inodes
that contain garbage, though. I don't have tools that deal well with
wholeslae corruption of large portions of the inode table, mostly
because those tools, if misused, could actually cause a lot more harm
than good, and so designing the proper safety mechanism so they are
safe to use in the hands of system administrators that are not
filesystem experts and tend to use commands like "fsck -y" is very
difficult to get right. It's also a failure mode which happens
rarely, so it's never been a high priority to figure out how create
tools that can safely handle this problem in the general case.

If you're convinced that all of the inode tables greater than 4TB have
been corrupted, or blocks from a particular physical volume are *all*
toast, onesolution is to zero out all of the damaged blocks, on the
theory that there's nothing to save anyway, and e2fsck is trying hard
to save all possible data --- and if you know there's nothing to save
there, clearing the parts of the inode table that ar eknown to be bad,
will make e2fsck run more cleanly.

In the long run, I can imagine enhancements to ext4 where we reserve 4
bytes in each inode which are used to collectively to store
information to assure ourselves an inode table block really contains
valid data and not random garbage. The first inode in an inode table
block would use the 4 byte field to store the first inode number in
the itable block. The second inode in the inode table block would
store the block number for the current itable block. Each subsequent
inode, for up to 32 inodes, would use the 4 byte field to store
successive 4 bytes of the filesystem UUID. This would allow e2fsck to
validate whether a particular inode table block read in from disk
really was valid or not. (I'm deliberately not including an actual
checksum since that would complicate matters, and if we are going to
store a checksum, we should have one set of fields which indicates
that this block belongs to filesystem XYZ's inode table starting at
position A, and another set of fields that indicates whether a one or
more bits in the itable block have gotten flipped. The two are
different concepts and how we react may differ depending on what know
is incorrect.)

> > One option is to use the Lustre e2fsprogs which has a patch that tries
> > to detect such "garbage" inodes and wipe them clean, instead of trying
> > to continue using them.
> >
> > http://downloads.lustre.org/public/tools/e2fsprogs/latest/
> >
> > That said, it may be too late to help because the previous e2fsck run
> > will have done a lot of work to "clean up" the garbage inodes and they
> > may no longer be above the "bad inode threshold".
>
> I would love to give it a try if it gets me an intact file system
> within hours rather than days or even weeks because it avoids the
> lengthy algorithm that clones the multiply-claimed blocks.

Well, it's still worth a shot.


> As the box is running a Ubuntu, I could not install the rpm directly.
> So I compiled the source from e2fsprogs-1.40.11.sun1.tar.gz which is
> contained in e2fsprogs-1.40.11.sun1-0redhat.src.rpm. gcc complained
> about unsafe format strings but produced the e2fsck executable.
>
> Do I need to use any command line option to the patched e2fsck? And
> is there anything else I should consider before killing the currently
> running e2fsck?

Nope, try it and let us know whether it seems to work. It might be
possible to augment the hueristics to detect the bad inodes (i.e.,
check to see if the modtimes/ctimes are totally looks reflect times
that are totally outside of what might be considered "normal" times as
an indication of itable block's sanity.

But long-term (although it probably won't help you), we should
seriously think about adding some inode sanity-check fields whose
primary purpose is to tell us whether a itable block is likely to be
valid or garbage.

- Ted

2008-12-05 19:39:39

by Andre Noll

[permalink] [raw]
Subject: Re: Problems with checking corrupted large ext3 file system

On 14:51, Theodore Tso wrote:
> On Thu, Dec 04, 2008 at 05:37:59PM +0100, Andre Noll wrote:
> > OK, so I guess I would like to run e2fsck again without cloning those
> > blocks.
>
> Actually, what you should probably do is to take a look at the inodes
> which were listed in the pass1b, and if they don't make sense, to
> clear them out. An individual inode can be cleared by using the
> debugfs clri command. i.e., to zero out inode 12345, you do this:
>
> debugfs -w /dev/mapper/thunk-closure
> debugfs: clri <12345>
> debugfs: quit
>
> This doesn't work very easily if there is a large number of inodes
> that contain garbage, though.

That number is larger than the number of lines in the scrollback
buffer of my screen session...

> I don't have tools that deal well with wholeslae corruption of large
> portions of the inode table, mostly because those tools, if misused,
> could actually cause a lot more harm than good, and so designing the
> proper safety mechanism so they are safe to use in the hands of system
> administrators that are not filesystem experts and tend to use
> commands like "fsck -y" is very difficult to get right.

What are the alternatives to using -y? Today I interrupted the e2fsck
and started the patched version without -y. The first thing it did
was to ask me

Group descriptor 53702 checksum is invalid. Fix<y>

I typed "y" perhaps 100 times. Then I gave up and reran the command
with the -y switch.

Wouldn't it be nice if e2fsck gave the user not only the option to
fix or not fix the problem, but also the option to always answer
"yes" to _that particular question_, just in case e2fsck later wants
to ask the same question again.

Another useful feature for the clueless admin would be a short
description of the problem at hand, probably together with a severity
indicator and a hint about how safe it is to answer "yes" at this
point and which alternatives there are. Something in the spirit of
Knuth's TeX messages perhaps :)

> If you're convinced that all of the inode tables greater than 4TB have
> been corrupted, or blocks from a particular physical volume are *all*
> toast, onesolution is to zero out all of the damaged blocks, on the
> theory that there's nothing to save anyway, and e2fsck is trying hard
> to save all possible data --- and if you know there's nothing to save
> there, clearing the parts of the inode table that ar eknown to be bad,
> will make e2fsck run more cleanly.

Unfortunately I have no idea which inode tables might be corrupted. The
PVs containing the corrupted ext3 file system reside entirely on a
24 x 1T Areca raid system which is split into two raid6 volumes. So
Linux sees two ~10T devices which are used as PVs for LVM. The LV
containing the corrupted file system was created with --stripes 2 so
that all 24 disks are used.

That setup worked fine for two months. The raid system started to
crash and fail disks soon after we added a 16 x 1T JBOD extension
unit. We've lost 11(!) disks during one month but no more than two in
a single raid set at any given time. Four of these 11 disk failures
happened to be bogus, i.e. we could fix the problem simply by pulling
and re-inserting the "failed" disks. The event log of the raid system
also contained quite a lot of timeout messages for various disks.

Unfortunately, at the time these problems started to show up, the
16T of the JBOD unit had been added to the VG already (configured
as a single raid6). However, we've only used the additional space to
enlarge _another_ file system in the same VG.

> > > One option is to use the Lustre e2fsprogs which has a patch that tries
> > > to detect such "garbage" inodes and wipe them clean, instead of trying
> > > to continue using them.
> > >
> > > http://downloads.lustre.org/public/tools/e2fsprogs/latest/
> > >
> > > That said, it may be too late to help because the previous e2fsck run
> > > will have done a lot of work to "clean up" the garbage inodes and they
> > > may no longer be above the "bad inode threshold".
> >
> > I would love to give it a try if it gets me an intact file system
> > within hours rather than days or even weeks because it avoids the
> > lengthy algorithm that clones the multiply-claimed blocks.
>
> Well, it's still worth a shot.
>
>
> > As the box is running a Ubuntu, I could not install the rpm directly.
> > So I compiled the source from e2fsprogs-1.40.11.sun1.tar.gz which is
> > contained in e2fsprogs-1.40.11.sun1-0redhat.src.rpm. gcc complained
> > about unsafe format strings but produced the e2fsck executable.
> >
> > Do I need to use any command line option to the patched e2fsck? And
> > is there anything else I should consider before killing the currently
> > running e2fsck?
>
> Nope, try it and let us know whether it seems to work.

It completed within 5 hours. During this time it printed many messages
of the form

Inode 132952070 has corrupt indirect block
Clear? yes

and

Inode 132087812, i_blocks is 448088, should be 439888. Fix? yes

Also a couple of these are contained in the output:

Too many illegal blocks in inode 132952070.
Clear inode? yes

After 5 hours, it printed the "Restarting e2fsck from the
beginning..." message just like the unpatched version. It's now at 45%
in the second run with no further messages so far. In particular, there
are no more "clone multiply-claimed blocks" messages. I'm leaving
for the weekend now, but I'll send another mail on Monday.

Thanks for your help, I really appreciate it.
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (5.45 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-12-06 00:09:41

by Andreas Dilger

[permalink] [raw]
Subject: Re: Problems with checking corrupted large ext3 file system

On Dec 05, 2008 20:23 +0100, Andre Noll wrote:
> What are the alternatives to using -y? Today I interrupted the e2fsck
> and started the patched version without -y. The first thing it did
> was to ask me
>
> Group descriptor 53702 checksum is invalid. Fix<y>
>
> I typed "y" perhaps 100 times. Then I gave up and reran the command
> with the -y switch.
>
> Wouldn't it be nice if e2fsck gave the user not only the option to
> fix or not fix the problem, but also the option to always answer
> "yes" to _that particular question_, just in case e2fsck later wants
> to ask the same question again.

Yes, this would be very useful... IIRC there was a patch for this
posted to this list maybe 2 years ago, and I was just thinking about
it... It would allow answering "y/n/A/N" (yes/no/always/never) or
similar, but IIRC there were some language issues involved...

Searching in Google and the linux-ext4 archive didn't show anything
obvious, but I thought it was useful at the time and I don't know what
happened to it. Maybe Ted has a copy in his mail archive?

> It completed within 5 hours. During this time it printed many messages
> of the form
>
> Inode 132952070 has corrupt indirect block
> Clear? yes
>
> and
>
> Inode 132087812, i_blocks is 448088, should be 439888. Fix? yes
>
> Also a couple of these are contained in the output:
>
> Too many illegal blocks in inode 132952070.
> Clear inode? yes
>
> After 5 hours, it printed the "Restarting e2fsck from the
> beginning..." message just like the unpatched version. It's now at 45%
> in the second run with no further messages so far. In particular, there
> are no more "clone multiply-claimed blocks" messages. I'm leaving
> for the weekend now, but I'll send another mail on Monday.

It would be worthwhile to get these patches into upstream at some point.

The "ibadness" patch was implemented exactly for cases like this where
the "clone blocks" pass would otherwise take forever. While it is a
"heuristic" approach, in the majority of cases the inode with the most
"badness" (i.e. shared blocks, bad fields, etc) is the offender and is
the one that is zeroed.

This patch is against 1.40.11, and needs a bunch of rework against the
latest extents code, but before that is done it would be good to get
an ack from Ted as to the basic approach.

[there are 2 extra testcases for this patch, not included here]

==================== e2fsprogs-ibadness-counter.patch ===================

The present e2fsck code checks the inode, per field basis. It doesn't
take into consideration to total sanity of the inode. This may cause
e2fsck turning a garbage inode into an apparently sane inode ("It is a
vessel of fertilizer, and none may abide its strength.").

The following patch adds a heuristics to detect the degree of badness of
an inode. icount mechanism is used to keep track of the badness of every
inode. The badness is increased as various fields in inode are found to
be corrupt. Badness above a certain threshold value results in deletion
of the inode. The default threshold value is 7, it can be specified to
e2fsck using "-E inode_badness_threshold=<value>"

This can avoid lengthy pass1b shared block processing, where a corrupt
chunk of the inode table has resulted in a bunch of garbage inodes
suddenly having shared blocks with a lot of good inodes (or each other).

Signed-off-by: Andreas Dilger <[email protected]>
Signed-off-by: Girish Shilamkar <[email protected]>

Index: e2fsprogs-1.41.1/e2fsck/e2fsck.h
===================================================================
--- e2fsprogs-1.41.1.orig/e2fsck/e2fsck.h
+++ e2fsprogs-1.41.1/e2fsck/e2fsck.h
@@ -11,6 +11,7 @@

#include <stdio.h>
#include <string.h>
+#include <stddef.h>
#ifdef HAVE_UNISTD_H
#include <unistd.h>
#endif
@@ -198,6 +199,18 @@ typedef enum {
E2F_CLONE_ZERO
} clone_opt_t;

+#define EXT4_FITS_IN_INODE(ext4_inode, einode, field) \
+ ((offsetof(typeof(*ext4_inode), field) + \
+ sizeof(ext4_inode->field)) \
+ <= (EXT2_GOOD_OLD_INODE_SIZE + \
+ (einode)->i_extra_isize)) \
+
+#define BADNESS_NORMAL 1
+#define BADNESS_HIGH 2
+#define BADNESS_THRESHOLD 8
+#define BADNESS_BAD_MODE 100
+#define BADNESS_LARGE_FILE 2199023255552ULL
+
/*
* Define the extended attribute refcount structure
*/
@@ -234,7 +247,6 @@ struct e2fsck_struct {
unsigned long max);

ext2fs_inode_bitmap inode_used_map; /* Inodes which are in use */
- ext2fs_inode_bitmap inode_bad_map; /* Inodes which are bad somehow */
ext2fs_inode_bitmap inode_dir_map; /* Inodes which are directories */
ext2fs_inode_bitmap inode_bb_map; /* Inodes which are in bad blocks */
ext2fs_inode_bitmap inode_imagic_map; /* AFS inodes */
@@ -249,6 +261,8 @@ struct e2fsck_struct {
*/
ext2_icount_t inode_count;
ext2_icount_t inode_link_info;
+ ext2_icount_t inode_badness;
+ int inode_badness_threshold;

ext2_refcount_t refcount;
ext2_refcount_t refcount_extra;
@@ -349,6 +363,7 @@ struct e2fsck_struct {
/* misc fields */
time_t now;
time_t time_fudge; /* For working around buggy init scripts */
+ time_t now_tolerance_val;
int ext_attr_ver;
shared_opt_t shared;
clone_opt_t clone;
@@ -465,6 +480,8 @@ extern int e2fsck_pass1_check_symlink(ex
extern void e2fsck_clear_inode(e2fsck_t ctx, ext2_ino_t ino,
struct ext2_inode *inode, int restart_flag,
const char *source);
+extern void e2fsck_mark_inode_bad(e2fsck_t ctx, ino_t ino, int count);
+extern int is_inode_bad(e2fsck_t ctx, ino_t ino);

/* pass2.c */
extern int e2fsck_process_bad_inode(e2fsck_t ctx, ext2_ino_t dir,
Index: e2fsprogs-1.41.1/e2fsck/pass1.c
===================================================================
--- e2fsprogs-1.41.1.orig/e2fsck/pass1.c
+++ e2fsprogs-1.41.1/e2fsck/pass1.c
@@ -20,7 +20,8 @@
* - A bitmap of which inodes are in use. (inode_used_map)
* - A bitmap of which inodes are directories. (inode_dir_map)
* - A bitmap of which inodes are regular files. (inode_reg_map)
- * - A bitmap of which inodes have bad fields. (inode_bad_map)
+ * - An icount mechanism is used to keep track of
+ * inodes with bad fields and its badness (ctx->inode_badness)
* - A bitmap of which inodes are in bad blocks. (inode_bb_map)
* - A bitmap of which inodes are imagic inodes. (inode_imagic_map)
* - A bitmap of which inodes need to be expanded (expand_eisize_map)
@@ -67,7 +68,6 @@ static void check_blocks(e2fsck_t ctx, s
static void mark_table_blocks(e2fsck_t ctx);
static void alloc_bb_map(e2fsck_t ctx);
static void alloc_imagic_map(e2fsck_t ctx);
-static void mark_inode_bad(e2fsck_t ctx, ino_t ino);
static void handle_fs_bad_blocks(e2fsck_t ctx);
static void process_inodes(e2fsck_t ctx, char *block_buf);
static EXT2_QSORT_TYPE process_inode_cmp(const void *a, const void *b);
@@ -241,6 +241,7 @@ static void check_immutable(e2fsck_t ctx
if (!(pctx->inode->i_flags & BAD_SPECIAL_FLAGS))
return;

+ e2fsck_mark_inode_bad(ctx, pctx->ino, BADNESS_NORMAL);
if (!fix_problem(ctx, PR_1_SET_IMMUTABLE, pctx))
return;

@@ -259,6 +260,7 @@ static void check_size(e2fsck_t ctx, str
if ((inode->i_size == 0) && (inode->i_size_high == 0))
return;

+ e2fsck_mark_inode_bad(ctx, pctx->ino, BADNESS_NORMAL);
if (!fix_problem(ctx, PR_1_SET_NONZSIZE, pctx))
return;

@@ -373,6 +375,7 @@ static void check_inode_extra_space(e2fs
*/
if (inode->i_extra_isize &&
(inode->i_extra_isize < min || inode->i_extra_isize > max)) {
+ e2fsck_mark_inode_bad(ctx, pctx->ino, BADNESS_NORMAL);
if (!fix_problem(ctx, PR_1_EXTRA_ISIZE, pctx))
return;
inode->i_extra_isize = ctx->want_extra_isize;
@@ -466,6 +469,7 @@ static void check_is_really_dir(e2fsck_t
(rec_len % 4))
return;

+ e2fsck_mark_inode_bad(ctx, pctx->ino, BADNESS_NORMAL);
if (fix_problem(ctx, PR_1_TREAT_AS_DIRECTORY, pctx)) {
inode->i_mode = (inode->i_mode & 07777) | LINUX_S_IFDIR;
e2fsck_write_inode_full(ctx, pctx->ino, inode,
@@ -662,6 +666,7 @@ void e2fsck_pass1(e2fsck_t ctx)
ext2_filsys fs = ctx->fs;
ext2_ino_t ino;
struct ext2_inode *inode;
+ struct ext2_inode_large *inode_large;
ext2_inode_scan scan;
char *block_buf;
#ifdef RESOURCE_TRACK
@@ -907,14 +912,17 @@ void e2fsck_pass1(e2fsck_t ctx)
ehp = inode->i_block;
#endif
if ((ext2fs_extent_header_verify(ehp,
- sizeof(inode->i_block)) == 0) &&
- (fix_problem(ctx, PR_1_UNSET_EXTENT_FL, &pctx))) {
- inode->i_flags |= EXT4_EXTENTS_FL;
+ sizeof(inode->i_block)) == 0)) {
+ if (fix_problem(ctx, PR_1_UNSET_EXTENT_FL, &pctx)) {
+ e2fsck_mark_inode_bad(ctx, ino,
+ BADNESS_NORMAL);
+ inode->i_flags |= EXT4_EXTENTS_FL;
#ifdef WORDS_BIGENDIAN
- memcpy(inode->i_block, tmp_block,
- sizeof(inode->i_block));
+ memcpy(inode->i_block, tmp_block,
+ sizeof(inode->i_block));
#endif
- e2fsck_write_inode(ctx, ino, inode, "pass1");
+ e2fsck_write_inode(ctx, ino, inode, "pass1");
+ }
}
}

@@ -978,6 +986,7 @@ void e2fsck_pass1(e2fsck_t ctx)
e2fsck_write_inode(ctx, ino, inode,
"pass1");
}
+ e2fsck_mark_inode_bad(ctx, ino, BADNESS_NORMAL);
}
} else if (ino == EXT2_JOURNAL_INO) {
ext2fs_mark_inode_bitmap(ctx->inode_used_map, ino);
@@ -1084,6 +1093,7 @@ void e2fsck_pass1(e2fsck_t ctx)
inode->i_dtime = 0;
e2fsck_write_inode(ctx, ino, inode, "pass1");
}
+ e2fsck_mark_inode_bad(ctx, ino, BADNESS_NORMAL);
}

ext2fs_mark_inode_bitmap(ctx->inode_used_map, ino);
@@ -1096,14 +1106,15 @@ void e2fsck_pass1(e2fsck_t ctx)
frag = fsize = 0;
}

+ /* Fixed in pass2, e2fsck_process_bad_inode(). */
if (inode->i_faddr || frag || fsize ||
(LINUX_S_ISDIR(inode->i_mode) && inode->i_dir_acl))
- mark_inode_bad(ctx, ino);
+ e2fsck_mark_inode_bad(ctx, ino, BADNESS_NORMAL);
if ((fs->super->s_creator_os == EXT2_OS_LINUX) &&
!(fs->super->s_feature_ro_compat &
EXT4_FEATURE_RO_COMPAT_HUGE_FILE) &&
(inode->osd2.linux2.l_i_blocks_hi != 0))
- mark_inode_bad(ctx, ino);
+ e2fsck_mark_inode_bad(ctx, ino, BADNESS_NORMAL);
if (inode->i_flags & EXT2_IMAGIC_FL) {
if (imagic_fs) {
if (!ctx->inode_imagic_map)
@@ -1116,6 +1127,7 @@ void e2fsck_pass1(e2fsck_t ctx)
e2fsck_write_inode(ctx, ino,
inode, "pass1");
}
+ e2fsck_mark_inode_bad(ctx, ino, BADNESS_NORMAL);
}
}

@@ -1171,8 +1183,42 @@ void e2fsck_pass1(e2fsck_t ctx)
check_immutable(ctx, &pctx);
check_size(ctx, &pctx);
ctx->fs_sockets_count++;
- } else
- mark_inode_bad(ctx, ino);
+ } else {
+ e2fsck_mark_inode_bad(ctx, ino, BADNESS_NORMAL);
+ }
+
+ if (inode->i_atime > ctx->now + ctx->now_tolerance_val ||
+ inode->i_mtime > ctx->now + ctx->now_tolerance_val)
+ e2fsck_mark_inode_bad(ctx, ino, BADNESS_NORMAL);
+
+ if (inode->i_ctime < sb->s_mkfs_time ||
+ inode->i_ctime > ctx->now + ctx->now_tolerance_val)
+ e2fsck_mark_inode_bad(ctx, ino, BADNESS_HIGH);
+
+ if (EXT4_FITS_IN_INODE(inode_large,
+ (struct ext2_inode_large *)inode, i_crtime)) {
+ if (((struct ext2_inode_large *)inode)->i_crtime <
+ sb->s_mkfs_time ||
+ ((struct ext2_inode_large *)inode)->i_crtime >
+ ctx->now + ctx->now_tolerance_val) {
+ e2fsck_mark_inode_bad(ctx, ino, BADNESS_HIGH);
+ }
+ }
+
+ /* Is it a regular file */
+ if ((LINUX_S_ISREG(inode->i_mode)) &&
+ /* File size > 2TB */
+ ((((long long)inode->i_size_high << 32) +
+ inode->i_size) > BADNESS_LARGE_FILE) &&
+ /* fs does not have huge file feature */
+ ((fs->super->s_creator_os == EXT2_OS_LINUX) &&
+ !(fs->super->s_feature_ro_compat &
+ EXT4_FEATURE_RO_COMPAT_HUGE_FILE) &&
+ /* inode does not have enough blocks for size */
+ (inode->osd2.linux2.l_i_blocks_hi != 0))) {
+ e2fsck_mark_inode_bad(ctx, ino, BADNESS_NORMAL);
+ }
+
if (!(inode->i_flags & EXT4_EXTENTS_FL)) {
if (inode->i_block[EXT2_IND_BLOCK])
ctx->fs_ind_count++;
@@ -1389,29 +1435,27 @@ static EXT2_QSORT_TYPE process_inode_cmp
}

/*
- * Mark an inode as being bad in some what
+ * Mark an inode as being bad and increment its badness counter.
*/
-static void mark_inode_bad(e2fsck_t ctx, ino_t ino)
+void e2fsck_mark_inode_bad(e2fsck_t ctx, ino_t ino, int count)
{
- struct problem_context pctx;
+ struct problem_context pctx;
+ __u16 result;

- if (!ctx->inode_bad_map) {
+ if (!ctx->inode_badness) {
clear_problem_context(&pctx);
-
- pctx.errcode = ext2fs_allocate_inode_bitmap(ctx->fs,
- _("bad inode map"), &ctx->inode_bad_map);
+ pctx.errcode = ext2fs_create_icount2(ctx->fs, 0, 0, NULL,
+ &ctx->inode_badness);
if (pctx.errcode) {
- pctx.num = 3;
- fix_problem(ctx, PR_1_ALLOCATE_IBITMAP_ERROR, &pctx);
- /* Should never get here */
+ fix_problem(ctx, PR_1_ALLOCATE_ICOUNT, &pctx);
ctx->flags |= E2F_FLAG_ABORT;
return;
}
}
- ext2fs_mark_inode_bitmap(ctx->inode_bad_map, ino);
+ ext2fs_icount_fetch(ctx->inode_badness, ino, &result);
+ ext2fs_icount_store(ctx->inode_badness, ino, count + result);
}

-
/*
* This procedure will allocate the inode "bb" (badblock) map table
*/
@@ -1566,7 +1610,8 @@ static int check_ext_attr(e2fsck_t ctx,
if (!(fs->super->s_feature_compat & EXT2_FEATURE_COMPAT_EXT_ATTR) ||
(blk < fs->super->s_first_data_block) ||
(blk >= fs->super->s_blocks_count)) {
- mark_inode_bad(ctx, ino);
+ /* Fixed in pass2, e2fsck_process_bad_inode(). */
+ e2fsck_mark_inode_bad(ctx, ino, BADNESS_NORMAL);
return 0;
}

@@ -1732,9 +1777,11 @@ static int handle_htree(e2fsck_t ctx, st

if ((!LINUX_S_ISDIR(inode->i_mode) &&
fix_problem(ctx, PR_1_HTREE_NODIR, pctx)) ||
- (!(fs->super->s_feature_compat & EXT2_FEATURE_COMPAT_DIR_INDEX) &&
- fix_problem(ctx, PR_1_HTREE_SET, pctx)))
- return 1;
+ (!(fs->super->s_feature_compat & EXT2_FEATURE_COMPAT_DIR_INDEX))) {
+ e2fsck_mark_inode_bad(ctx, ino, BADNESS_NORMAL);
+ if (fix_problem(ctx, PR_1_HTREE_SET, pctx))
+ return 1;
+ }

pctx->errcode = ext2fs_bmap(fs, ino, inode, 0, 0, 0, &blk);

@@ -1742,6 +1789,7 @@ static int handle_htree(e2fsck_t ctx, st
(blk == 0) ||
(blk < fs->super->s_first_data_block) ||
(blk >= fs->super->s_blocks_count)) {
+ e2fsck_mark_inode_bad(ctx, ino, BADNESS_NORMAL);
if (fix_problem(ctx, PR_1_HTREE_BADROOT, pctx))
return 1;
else
@@ -1749,8 +1797,11 @@ static int handle_htree(e2fsck_t ctx, st
}

retval = io_channel_read_blk(fs->io, blk, 1, block_buf);
- if (retval && fix_problem(ctx, PR_1_HTREE_BADROOT, pctx))
- return 1;
+ if (retval) {
+ e2fsck_mark_inode_bad(ctx, ino, BADNESS_NORMAL);
+ if (fix_problem(ctx, PR_1_HTREE_BADROOT, pctx))
+ return 1;
+ }

/* XXX should check that beginning matches a directory */
root = (struct ext2_dx_root_info *) (block_buf + 24);
@@ -1791,8 +1842,8 @@ void e2fsck_clear_inode(e2fsck_t ctx, ex
ext2fs_unmark_inode_bitmap(ctx->inode_used_map, ino);
if (ctx->inode_reg_map)
ext2fs_unmark_inode_bitmap(ctx->inode_reg_map, ino);
- if (ctx->inode_bad_map)
- ext2fs_unmark_inode_bitmap(ctx->inode_bad_map, ino);
+ if (ctx->inode_badness)
+ ext2fs_icount_store(ctx->inode_badness, ino, 0);

/*
* If the inode was partially accounted for before processing
@@ -1863,6 +1914,11 @@ static void scan_extent_node(e2fsck_t ct
problem = PR_1_EXTENT_ENDS_BEYOND;

if (problem) {
+ /* To ensure that extent is in inode */
+ if (info.curr_level == 0)
+ e2fsck_mark_inode_bad(ctx, pctx->ino,
+ BADNESS_HIGH);
+
pctx->blk = extent.e_pblk;
pctx->blk2 = extent.e_lblk;
pctx->num = extent.e_len;
@@ -2031,6 +2087,7 @@ static void check_blocks(e2fsck_t ctx, s
inode->i_flags &= ~EXT2_COMPRBLK_FL;
dirty_inode++;
}
+ e2fsck_mark_inode_bad(ctx, ino, BADNESS_NORMAL);
}
}

@@ -2105,6 +2162,11 @@ static void check_blocks(e2fsck_t ctx, s
ctx->fs_directory_count--;
return;
}
+ /*
+ * The mode might be in-correct. Increasing the badness by
+ * small amount won't hurt much.
+ */
+ e2fsck_mark_inode_bad(ctx, ino, BADNESS_NORMAL);
}

if (!(fs->super->s_feature_ro_compat &
@@ -2157,6 +2219,7 @@ static void check_blocks(e2fsck_t ctx, s
inode->i_size_high = pctx->num >> 32;
dirty_inode++;
}
+ e2fsck_mark_inode_bad(ctx, ino, BADNESS_NORMAL);
pctx->num = 0;
}
if (LINUX_S_ISREG(inode->i_mode) &&
@@ -2173,6 +2236,7 @@ static void check_blocks(e2fsck_t ctx, s
inode->osd2.linux2.l_i_blocks_hi = 0;
dirty_inode++;
}
+ e2fsck_mark_inode_bad(ctx, ino, BADNESS_NORMAL);
pctx->num = 0;
}
out:
@@ -2332,8 +2396,10 @@ static int process_block(ext2_filsys fs,
problem = PR_1_TOOBIG_SYMLINK;

if (blk < fs->super->s_first_data_block ||
- blk >= fs->super->s_blocks_count)
+ blk >= fs->super->s_blocks_count) {
problem = PR_1_ILLEGAL_BLOCK_NUM;
+ e2fsck_mark_inode_bad(ctx, pctx->ino, BADNESS_NORMAL);
+ }

if (problem) {
p->num_illegal_blocks++;
Index: e2fsprogs-1.41.1/e2fsck/pass4.c
===================================================================
--- e2fsprogs-1.41.1.orig/e2fsck/pass4.c
+++ e2fsprogs-1.41.1/e2fsck/pass4.c
@@ -181,6 +181,7 @@ void e2fsck_pass4(e2fsck_t ctx)
}
ext2fs_free_icount(ctx->inode_link_info); ctx->inode_link_info = 0;
ext2fs_free_icount(ctx->inode_count); ctx->inode_count = 0;
+ ext2fs_free_icount(ctx->inode_badness); ctx->inode_badness = 0;
ext2fs_free_inode_bitmap(ctx->inode_bb_map);
ctx->inode_bb_map = 0;
ext2fs_free_inode_bitmap(ctx->inode_imagic_map);
Index: e2fsprogs-1.41.1/e2fsck/pass2.c
===================================================================
--- e2fsprogs-1.41.1.orig/e2fsck/pass2.c
+++ e2fsprogs-1.41.1/e2fsck/pass2.c
@@ -33,11 +33,10 @@
* Pass 2 relies on the following information from previous passes:
* - The directory information collected in pass 1.
* - The inode_used_map bitmap
- * - The inode_bad_map bitmap
+ * - The inode_badness bitmap
* - The inode_dir_map bitmap
*
* Pass 2 frees the following data structures
- * - The inode_bad_map bitmap
* - The inode_reg_map bitmap
*/

@@ -258,10 +257,6 @@ void e2fsck_pass2(e2fsck_t ctx)
ext2fs_free_mem(&buf);
ext2fs_free_dblist(fs->dblist);

- if (ctx->inode_bad_map) {
- ext2fs_free_inode_bitmap(ctx->inode_bad_map);
- ctx->inode_bad_map = 0;
- }
if (ctx->inode_reg_map) {
ext2fs_free_inode_bitmap(ctx->inode_reg_map);
ctx->inode_reg_map = 0;
@@ -494,6 +489,7 @@ static _INLINE_ int check_filetype(e2fsc
{
int filetype = dirent->name_len >> 8;
int should_be = EXT2_FT_UNKNOWN;
+ __u16 result;
struct ext2_inode inode;

if (!(ctx->fs->super->s_feature_incompat &
@@ -505,16 +501,18 @@ static _INLINE_ int check_filetype(e2fsc
return 1;
}

+ if (ctx->inode_badness)
+ ext2fs_icount_fetch(ctx->inode_badness, dirent->inode,
+ &result);
+
if (ext2fs_test_inode_bitmap(ctx->inode_dir_map, dirent->inode)) {
should_be = EXT2_FT_DIR;
} else if (ext2fs_test_inode_bitmap(ctx->inode_reg_map,
dirent->inode)) {
should_be = EXT2_FT_REG_FILE;
- } else if (ctx->inode_bad_map &&
- ext2fs_test_inode_bitmap(ctx->inode_bad_map,
- dirent->inode))
+ } else if (ctx->inode_badness && result >= BADNESS_BAD_MODE) {
should_be = 0;
- else {
+ } else {
e2fsck_read_inode(ctx, dirent->inode, &inode,
"check_filetype");
should_be = ext2_file_type(inode.i_mode);
@@ -965,12 +963,10 @@ out_htree:
* (We wait until now so that we can display the
* pathname to the user.)
*/
- if (ctx->inode_bad_map &&
- ext2fs_test_inode_bitmap(ctx->inode_bad_map,
- dirent->inode)) {
- if (e2fsck_process_bad_inode(ctx, ino,
- dirent->inode,
- buf + fs->blocksize)) {
+ if ((ctx->inode_badness) &&
+ ext2fs_icount_is_set(ctx->inode_badness, dirent->inode)) {
+ if (e2fsck_process_bad_inode(ctx, ino, dirent->inode,
+ buf + fs->blocksize)) {
dirent->inode = 0;
dir_modified++;
goto next;
@@ -1189,9 +1185,17 @@ static void deallocate_inode(e2fsck_t ct
struct ext2_inode inode;
struct problem_context pctx;
__u32 count;
+ int extent_fs = 0;

e2fsck_read_inode(ctx, ino, &inode, "deallocate_inode");
+ /* ext2fs_block_iterate2() depends on the extents flags */
+ if (inode.i_flags & EXT4_EXTENTS_FL)
+ extent_fs = 1;
e2fsck_clear_inode(ctx, ino, &inode, 0, "deallocate_inode");
+ if (extent_fs) {
+ inode.i_flags |= EXT4_EXTENTS_FL;
+ e2fsck_write_inode(ctx, ino, &inode, "deallocate_inode");
+ }
clear_problem_context(&pctx);
pctx.ino = ino;

@@ -1218,6 +1222,8 @@ static void deallocate_inode(e2fsck_t ct
if (count == 0) {
ext2fs_unmark_block_bitmap(ctx->block_found_map,
inode.i_file_acl);
+ if (ctx->inode_badness)
+ ext2fs_icount_store(ctx->inode_badness, ino, 0);
ext2fs_block_alloc_stats(fs, inode.i_file_acl, -1);
}
inode.i_file_acl = 0;
@@ -1263,8 +1269,11 @@ extern int e2fsck_process_bad_inode(e2fs
int not_fixed = 0;
unsigned char *frag, *fsize;
struct problem_context pctx;
- int problem = 0;
+ int problem = 0;
+ __u16 badness;

+ if (ctx->inode_badness)
+ ext2fs_icount_fetch(ctx->inode_badness, ino, &badness);
e2fsck_read_inode(ctx, ino, &inode, "process_bad_inode");

clear_problem_context(&pctx);
@@ -1279,6 +1288,7 @@ extern int e2fsck_process_bad_inode(e2fs
inode_modified++;
} else
not_fixed++;
+ badness += BADNESS_NORMAL;
}

if (!LINUX_S_ISDIR(inode.i_mode) && !LINUX_S_ISREG(inode.i_mode) &&
@@ -1312,6 +1322,11 @@ extern int e2fsck_process_bad_inode(e2fs
} else
not_fixed++;
problem = 0;
+ /*
+ * A high value is associated with bad mode in order to detect
+ * that mode was corrupt in check_filetype()
+ */
+ badness += BADNESS_BAD_MODE;
}

if (inode.i_faddr) {
@@ -1320,6 +1335,7 @@ extern int e2fsck_process_bad_inode(e2fs
inode_modified++;
} else
not_fixed++;
+ badness += BADNESS_NORMAL;
}

switch (fs->super->s_creator_os) {
@@ -1337,6 +1353,7 @@ extern int e2fsck_process_bad_inode(e2fs
inode_modified++;
} else
not_fixed++;
+ badness += BADNESS_NORMAL;
pctx.num = 0;
}
if (fsize && *fsize) {
@@ -1346,9 +1363,26 @@ extern int e2fsck_process_bad_inode(e2fs
inode_modified++;
} else
not_fixed++;
+ badness += BADNESS_NORMAL;
pctx.num = 0;
}

+ /* In pass1 these conditions were used to mark inode bad so that
+ * it calls e2fsck_process_bad_inode and make an extensive check
+ * plus prompt for action to be taken. To compensate for badness
+ * incremented in pass1 by this condition, decrease it.
+ */
+ if ((inode.i_faddr || frag || fsize ||
+ (LINUX_S_ISDIR(inode.i_mode) && inode.i_dir_acl)) ||
+ (inode.i_file_acl &&
+ (!(fs->super->s_feature_compat & EXT2_FEATURE_COMPAT_EXT_ATTR) ||
+ (inode.i_file_acl < fs->super->s_first_data_block) ||
+ (inode.i_file_acl >= fs->super->s_blocks_count)))) {
+ /* badness can be 0 if called from pass4. */
+ if (badness)
+ badness -= BADNESS_NORMAL;
+ }
+
if ((fs->super->s_creator_os == EXT2_OS_LINUX) &&
!(fs->super->s_feature_ro_compat &
EXT4_FEATURE_RO_COMPAT_HUGE_FILE) &&
@@ -1358,6 +1392,8 @@ extern int e2fsck_process_bad_inode(e2fs
inode.osd2.linux2.l_i_blocks_hi = 0;
inode_modified++;
}
+ /* Badness was increased in pass1 for this condition */
+ /* badness += BADNESS_NORMAL; */
}

if (inode.i_file_acl &&
@@ -1368,6 +1404,7 @@ extern int e2fsck_process_bad_inode(e2fs
inode_modified++;
} else
not_fixed++;
+ badness += BADNESS_NORMAL;
}
if (inode.i_dir_acl &&
LINUX_S_ISDIR(inode.i_mode)) {
@@ -1376,12 +1413,29 @@ extern int e2fsck_process_bad_inode(e2fs
inode_modified++;
} else
not_fixed++;
+ badness += BADNESS_NORMAL;
+ }
+
+ /*
+ * The high value due to BADNESS_BAD_MODE should not delete the inode.
+ */
+ if (ctx->inode_badness &&
+ (badness - ((badness >= BADNESS_BAD_MODE) ? BADNESS_BAD_MODE : 0))>=
+ ctx->inode_badness_threshold) {
+ pctx.num = badness;
+ if (fix_problem(ctx, PR_2_INODE_TOOBAD, &pctx)) {
+ deallocate_inode(ctx, ino, 0);
+ if (ctx->flags & E2F_FLAG_SIGNAL_MASK)
+ return 0;
+ return 1;
+ }
+ not_fixed++;
}

if (inode_modified)
e2fsck_write_inode(ctx, ino, &inode, "process_bad_inode");
- if (!not_fixed && ctx->inode_bad_map)
- ext2fs_unmark_inode_bitmap(ctx->inode_bad_map, ino);
+ if (ctx->inode_badness)
+ ext2fs_icount_store(ctx->inode_badness, ino, 0);
return 0;
}

Index: e2fsprogs-1.41.1/e2fsck/problem.c
===================================================================
--- e2fsprogs-1.41.1.orig/e2fsck/problem.c
+++ e2fsprogs-1.41.1/e2fsck/problem.c
@@ -1354,6 +1354,11 @@ static struct e2fsck_problem problem_tab
N_("@i %N found in @g %g unused inodes area. "),
PROMPT_FIX, PR_PREEN_OK },

+ /* Inode too bad */
+ { PR_2_INODE_TOOBAD,
+ N_("@i %i is badly corrupt (badness value = %N). "),
+ PROMPT_CLEAR, PR_PREEN_OK },
+
/* Pass 3 errors */

/* Pass 3: Checking directory connectivity */
Index: e2fsprogs-1.41.1/e2fsck/problem.h
===================================================================
--- e2fsprogs-1.41.1.orig/e2fsck/problem.h
+++ e2fsprogs-1.41.1/e2fsck/problem.h
@@ -814,6 +814,9 @@ struct problem_context {
/* Inode found in group unused inodes area */
#define PR_2_INOREF_IN_UNUSED 0x020047

+/* Inode completely corrupt */
+#define PR_2_INODE_TOOBAD 0x020048
+
/*
* Pass 3 errors
*/
Index: e2fsprogs-1.41.1/lib/ext2fs/icount.c
===================================================================
--- e2fsprogs-1.41.1.orig/lib/ext2fs/icount.c
+++ e2fsprogs-1.41.1/lib/ext2fs/icount.c
@@ -467,6 +467,23 @@ static errcode_t get_inode_count(ext2_ic
return 0;
}

+int ext2fs_icount_is_set(ext2_icount_t icount, ext2_ino_t ino)
+{
+ __u16 result;
+
+ if (ext2fs_test_inode_bitmap(icount->single, ino))
+ return 1;
+ else if (icount->multiple) {
+ if (ext2fs_test_inode_bitmap(icount->multiple, ino))
+ return 1;
+ return 0;
+ }
+ ext2fs_icount_fetch(icount, ino, &result);
+ if (result)
+ return 1;
+ return 0;
+}
+
errcode_t ext2fs_icount_validate(ext2_icount_t icount, FILE *out)
{
errcode_t ret = 0;
Index: e2fsprogs-1.41.1/e2fsck/pass1b.c
===================================================================
--- e2fsprogs-1.41.1.orig/e2fsck/pass1b.c
+++ e2fsprogs-1.41.1/e2fsck/pass1b.c
@@ -612,8 +612,8 @@ static void delete_file(e2fsck_t ctx, ex
block_buf, delete_file_block, &pb);
if (pctx.errcode)
fix_problem(ctx, PR_1B_BLOCK_ITERATE, &pctx);
- if (ctx->inode_bad_map)
- ext2fs_unmark_inode_bitmap(ctx->inode_bad_map, ino);
+ if (ctx->inode_badness)
+ e2fsck_mark_inode_bad(ctx, ino, 0);
ext2fs_inode_alloc_stats2(fs, ino, -1, LINUX_S_ISDIR(inode.i_mode));

/* Inode may have changed by block_iterate, so reread it */
Index: e2fsprogs-1.41.1/e2fsck/unix.c
===================================================================
--- e2fsprogs-1.41.1.orig/e2fsck/unix.c
+++ e2fsprogs-1.41.1/e2fsck/unix.c
@@ -650,6 +650,18 @@ static void parse_extended_opts(e2fsck_t
extended_usage++;
continue;
}
+ /* -E inode_badness_threshold=<value> */
+ } else if (strcmp(token, "inode_badness_threshold") == 0) {
+ if (!arg) {
+ extended_usage++;
+ continue;
+ }
+ ctx->inode_badness_threshold = strtoul(arg, &p, 0);
+ if (*p != '\0' || (ctx->inode_badness_threshold > 200)){
+ fprintf(stderr, _("Invalid badness value.\n"));
+ extended_usage++;
+ continue;
+ }
} else {
fprintf(stderr, _("Unknown extended option: %s\n"),
token);
@@ -668,6 +680,7 @@ static void parse_extended_opts(e2fsck_t
fputs(("\tshared=<preserve|lost+found|delete>\n"), stderr);
fputs(("\tclone=<dup|zero>\n"), stderr);
fputs(("\texpand_extra_isize\n"), stderr);
+ fputs(("\tinode_badness_threhold=(value)\n"), stderr);
fputc('\n', stderr);
exit(1);
}
@@ -732,6 +745,9 @@ static errcode_t PRS(int argc, char *arg
profile_init(config_fn, &ctx->profile);
initialize_profile_options(ctx);

+ ctx->inode_badness_threshold = BADNESS_THRESHOLD;
+ ctx->now_tolerance_val = 172800; /* Two days */
+
while ((c = getopt (argc, argv, "panyrcC:B:dE:fvtFVM:b:I:j:P:l:L:N:SsDk")) != EOF)
switch (c) {
case 'C':
Index: e2fsprogs-1.41.1/e2fsck/e2fsck.c
===================================================================
--- e2fsprogs-1.41.1.orig/e2fsck/e2fsck.c
+++ e2fsprogs-1.41.1/e2fsck/e2fsck.c
@@ -107,10 +107,6 @@ errcode_t e2fsck_reset_context(e2fsck_t
ext2fs_free_inode_bitmap(ctx->inode_bb_map);
ctx->inode_bb_map = 0;
}
- if (ctx->inode_bad_map) {
- ext2fs_free_inode_bitmap(ctx->inode_bad_map);
- ctx->inode_bad_map = 0;
- }
if (ctx->inode_imagic_map) {
ext2fs_free_inode_bitmap(ctx->inode_imagic_map);
ctx->inode_imagic_map = 0;
Index: e2fsprogs-1.41.1/e2fsck/e2fsck.8.in
===================================================================
--- e2fsprogs-1.41.1.orig/e2fsck/e2fsck.8.in
+++ e2fsprogs-1.41.1/e2fsck/e2fsck.8.in
@@ -203,6 +203,13 @@ Set the version of the extended attribut
will require while checking the filesystem. The version number may
be 1 or 2. The default extended attribute version format is 2.
.TP
+.BI inode_badness_threshold= threshold_value
+A badness counter is associated with every inode, which determines the degree
+of inode corruption. Each error found in the inode will increase the badness by
+1 or 2, and inodes with a badness at or above
+.I threshold_value will be prompted for deletion. The default
+.I threshold_value is 7.
+.TP
.BI fragcheck
During pass 1, print a detailed report of any discontiguous blocks for
files in the filesystem.
Index: e2fsprogs-1.41.1/lib/ext2fs/ext2fs.h
===================================================================
--- e2fsprogs-1.41.1.orig/lib/ext2fs/ext2fs.h
+++ e2fsprogs-1.41.1/lib/ext2fs/ext2fs.h
@@ -974,6 +974,7 @@ extern errcode_t ext2fs_initialize(const

/* icount.c */
extern void ext2fs_free_icount(ext2_icount_t icount);
+extern int ext2fs_icount_is_set(ext2_icount_t icount, ext2_ino_t ino);
extern errcode_t ext2fs_create_icount_tdb(ext2_filsys fs, char *tdb_dir,
int flags, ext2_icount_t *ret);
extern errcode_t ext2fs_create_icount2(ext2_filsys fs, int flags,
Index: e2fsprogs-1.41.1/tests/f_messy_inode/expect.1
===================================================================
--- e2fsprogs-1.41.1.orig/tests/f_messy_inode/expect.1
+++ e2fsprogs-1.41.1/tests/f_messy_inode/expect.1
@@ -20,19 +20,21 @@ Pass 2: Checking directory structure
i_file_acl for inode 14 (/MAKEDEV) is 4294901760, should be zero.
Clear? yes

+Inode 14 is badly corrupt (badness value = 13). Clear? yes
+
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -(43--49)
Fix? yes

-Free blocks count wrong for group #0 (68, counted=75).
+Free blocks count wrong for group #0 (70, counted=77).
Fix? yes

-Free blocks count wrong (68, counted=75).
+Free blocks count wrong (70, counted=77).
Fix? yes


test_filesys: ***** FILE SYSTEM WAS MODIFIED *****
-test_filesys: 29/32 files (3.4% non-contiguous), 25/100 blocks
+test_filesys: 28/32 files (3.6% non-contiguous), 23/100 blocks
Exit status is 1
Index: e2fsprogs-1.41.1/tests/f_messy_inode/expect.2
===================================================================
--- e2fsprogs-1.41.1.orig/tests/f_messy_inode/expect.2
+++ e2fsprogs-1.41.1/tests/f_messy_inode/expect.2
@@ -3,5 +3,5 @@ Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
-test_filesys: 29/32 files (0.0% non-contiguous), 25/100 blocks
+test_filesys: 28/32 files (0.0% non-contiguous), 23/100 blocks
Exit status is 0

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-12-08 13:25:31

by Andre Noll

[permalink] [raw]
Subject: Re: Problems with checking corrupted large ext3 file system

On 20:23, Andre Noll wrote:

> After 5 hours, it printed the "Restarting e2fsck from the
> beginning..." message just like the unpatched version. It's now at 45%
> in the second run with no further messages so far. In particular, there
> are no more "clone multiply-claimed blocks" messages. I'm leaving
> for the weekend now, but I'll send another mail on Monday.

The fsck completed successfully and I've mounted the file system
again. The users are currently investigating the situation, which will
take a while. Until now one user reported that approximately half of
his files are corrupted. Guess I will have to restore from tapes :-(.

Anyway, many thanks to Andreas and Ted.
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (768.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments