2007-03-05 16:36:20

by Daniel Drake

[permalink] [raw]
Subject: e2fsck and human intervention

Hi,

I'm working with ext3 partitions in a product environment, where
numerous embedded Linux systems will be shipped to various locations.

In testing we occasionally find that system boot is halted by e2fsck
with an "UNEXPECTED INCONSISTENCY" error message. This is while running
in preen mode.

This usually happens during e2fsck's regular "check every X mounts"
thing, as opposed to immediately after booting up after power loss, so
to begin with it's not immediately obvious why there is a problem.

It's of course understandable and inevitable that power loss will
occasionally cause some file loss or corruption, and that's fine. My
main concern is that fsck is halting the boot process, and in a product
scenario this would require an engineer to perform a service call. If
e2fsck could unconditionally perform a best-effort attempt at solving
the problems, it would be ideal.

Are there any better approaches than something like the following?

1. Run "e2fsck -p /"

2. If bit 3 is set in exit code (i.e. preen functionality detected
unexpected inconsistency) then run "e2fsck -y /"

Is there significant risk of further data loss through using -y than
might be experienced otherwise?

Thanks!
--
Daniel Drake
Brontes Technologies, A 3M Company


2007-03-05 16:48:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: e2fsck and human intervention

On Mon, Mar 05, 2007 at 11:26:57AM -0500, Daniel Drake wrote:
> Hi,
>
> I'm working with ext3 partitions in a product environment, where
> numerous embedded Linux systems will be shipped to various locations.
>
> In testing we occasionally find that system boot is halted by e2fsck
> with an "UNEXPECTED INCONSISTENCY" error message. This is while running
> in preen mode.
>
> This usually happens during e2fsck's regular "check every X mounts"
> thing, as opposed to immediately after booting up after power loss, so
> to begin with it's not immediately obvious why there is a problem.
>
> It's of course understandable and inevitable that power loss will
> occasionally cause some file loss or corruption, and that's fine. My
> main concern is that fsck is halting the boot process, and in a product
> scenario this would require an engineer to perform a service call. If
> e2fsck could unconditionally perform a best-effort attempt at solving
> the problems, it would be ideal.

Actually, power loss by itself should *not* cause any corruption when
you are using ext3; that's the whole point of the journal. If there
is, you probably have some other problem that you might do well to try
to debug before youi ship your product, since that may lead to
significant data loss in the long-term.

> Are there any better approaches than something like the following?
>
> 1. Run "e2fsck -p /"
>
> 2. If bit 3 is set in exit code (i.e. preen functionality detected
> unexpected inconsistency) then run "e2fsck -y /"
>
> Is there significant risk of further data loss through using -y than
> might be experienced otherwise?

You could do this, but if you are using ext3, this is really papering
over the problem. With ext3, there really should not be any
corruptions caused by power loss.

What sort of errors are being reported by e2fsck?

- Ted

2007-03-05 17:01:37

by Daniel Drake

[permalink] [raw]
Subject: Re: e2fsck and human intervention

Hi Ted,

Thanks for the quick response.

On Mon, 2007-03-05 at 11:48 -0500, Theodore Tso wrote:
> Actually, power loss by itself should *not* cause any corruption when
> you are using ext3; that's the whole point of the journal. If there
> is, you probably have some other problem that you might do well to try
> to debug before youi ship your product, since that may lead to
> significant data loss in the long-term.

OK. I do need to do further testing to see how well the system copes in
situations like this, understanding e2fsck's behaviour is just a
starting point.

In the quoted section above, do you regard 'preen-mode not automatically
fixing everything' and 'corruption' as the same? i.e. are you saying
that the "UNEXPECTED INCONSISTENCY" error message should never appear
after an unexpected power loss? Alternatively, in which kinds of
situations would you reasonably expect preen mode to not be able to
automate all the repairs?

I realise that I mentioned corruption in my last mail, but actually I
don't think we have really seen any of that. Instead the biggest issue
right now is e2fsck halting system boot, and I definitely have seen this
happen a few times (even if no damage was observed after re-running fsck
with -y)

Thanks.

--
Daniel Drake
Brontes Technologies, A 3M Company

2007-03-05 17:50:04

by Sev Binello

[permalink] [raw]
Subject: Re: e2fsck and human intervention

Theodore Tso wrote:
> On Mon, Mar 05, 2007 at 11:26:57AM -0500, Daniel Drake wrote:
>
>> Hi,
>>
>> I'm working with ext3 partitions in a product environment, where
>> numerous embedded Linux systems will be shipped to various locations.
>>
>> In testing we occasionally find that system boot is halted by e2fsck
>> with an "UNEXPECTED INCONSISTENCY" error message. This is while running
>> in preen mode.
>>
>> This usually happens during e2fsck's regular "check every X mounts"
>> thing, as opposed to immediately after booting up after power loss, so
>> to begin with it's not immediately obvious why there is a problem.
>>
>> It's of course understandable and inevitable that power loss will
>> occasionally cause some file loss or corruption, and that's fine. My
>> main concern is that fsck is halting the boot process, and in a product
>> scenario this would require an engineer to perform a service call. If
>> e2fsck could unconditionally perform a best-effort attempt at solving
>> the problems, it would be ideal.
>>
>
> Actually, power loss by itself should *not* cause any corruption when
> you are using ext3; that's the whole point of the journal. If there
> is, you probably have some other problem that you might do well to try
> to debug before youi ship your product, since that may lead to
> significant data loss in the long-term.
>
>
*So when and why is an fsck necessary ?*
>> Are there any better approaches than something like the following?
>>
>> 1. Run "e2fsck -p /"
>>
>> 2. If bit 3 is set in exit code (i.e. preen functionality detected
>> unexpected inconsistency) then run "e2fsck -y /"
>>
>> Is there significant risk of further data loss through using -y than
>> might be experienced otherwise?
>>
>
> You could do this, but if you are using ext3, this is really papering
> over the problem. With ext3, there really should not be any
> corruptions caused by power loss.
>
> What sort of errors are being reported by e2fsck?
>
> - Ted
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>



--

Sev Binello
Brookhaven National Laboratory
Upton, New York
631-344-5647
[email protected]

2007-03-06 02:40:29

by Andreas Dilger

[permalink] [raw]
Subject: Re: e2fsck and human intervention

On Mar 05, 2007 11:26 -0500, Daniel Drake wrote:
> This usually happens during e2fsck's regular "check every X mounts"
> thing, as opposed to immediately after booting up after power loss, so
> to begin with it's not immediately obvious why there is a problem.

That's because the post-boot e2fsck will only check the superblock and
replay the journal. It won't do a full check of the filesystem unless
the kernel detected some corruption and marked an error in the superblock.

That is the primary reason why the "check after N mounts/months" code
is in e2fsck, even though it annoys some people.

> It's of course understandable and inevitable that power loss will
> occasionally cause some file loss or corruption, and that's fine.

As Ted said, if e2fsck detects anything wrong then this IS corruption
of some kind. It might indicate that your disks are writing with
cache enabled and losing some writes that had been reported to the
kernel as committed to disk.

> Are there any better approaches than something like the following?
>
> 1. Run "e2fsck -p /"
>
> 2. If bit 3 is set in exit code (i.e. preen functionality detected
> unexpected inconsistency) then run "e2fsck -y /"

This is no better than just running "e2fsck -y" in the first place,
just twice as slow.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-03-06 14:27:09

by Daniel Drake

[permalink] [raw]
Subject: Re: e2fsck and human intervention

On Tue, 2007-03-06 at 10:40 +0800, Andreas Dilger wrote:
> As Ted said, if e2fsck detects anything wrong then this IS corruption
> of some kind. It might indicate that your disks are writing with
> cache enabled and losing some writes that had been reported to the
> kernel as committed to disk.

Entirely possible, I'll look into that. Thanks for the pointer.

> > Are there any better approaches than something like the following?
> >
> > 1. Run "e2fsck -p /"
> >
> > 2. If bit 3 is set in exit code (i.e. preen functionality detected
> > unexpected inconsistency) then run "e2fsck -y /"
>
> This is no better than just running "e2fsck -y" in the first place,
> just twice as slow.

OK. Given that write caching may be required for performance reasons or
there might be other possible reasons which would result in
preen-unrepairable fs corruption on power loss, my question is now: Is
it a really bad idea to run "e2fsck -y" on every boot?

I'm not expecting magic: I realise that in such configurations there is
risk of data loss. However, every time I have seen preen fail so far,
running "e2fsck -y" gets things back into bootable state and I'm simply
wondering how much potential trouble I would be getting myself into by
automating this.

Thanks.
--
Daniel Drake
Brontes Technologies, A 3M Company

2007-03-07 05:19:17

by Andreas Dilger

[permalink] [raw]
Subject: Re: e2fsck and human intervention

On Mar 06, 2007 09:27 -0500, Daniel Drake wrote:
> OK. Given that write caching may be required for performance reasons or
> there might be other possible reasons which would result in
> preen-unrepairable fs corruption on power loss, my question is now: Is
> it a really bad idea to run "e2fsck -y" on every boot?

If your primary concern is not halting the boot, then yes. 99% of people
only know to answer "y" to e2fsck anyways.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.