LinuxLists.cc - ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On Fri, Oct 22, 2010 at 03:33:29PM +0200, Bernd Schubert wrote:
>
> is is really a good idea to allow the filesystem to mount if something like
> that comes up? I really would prefer if mount would abort.
>
> Oct 22 12:37:36 vm7 kernel: [ 1227.814294] LDISKFS-fs warning (device sfa0074): ldiskfs_clear_journal_err: Filesystem error recorded from p
> revious mount: IO failure
> Oct 22 12:37:36 vm7 kernel: [ 1227.814314] LDISKFS-fs warning (device sfa0074): ldiskfs_clear_journal_err: Marking fs in need of filesystem
> check.
>
> (please ignore "ldiskfs", it was just renamed to that by Lustre, but is
> ext4 based as in RHEL5.5, so 2.6.32-ish).

Did you try running e2fsck first? If it detects the error after
running the journal, it will run the file system check right then and
there. If it doesn't, it's a bug. If you're not running e2fsck
first, and the filesystem had previously detected inconsistencies, the
long-standing tradition is to allow that, since root should know what
it's doing.

And there are times when you do want to mount a filesystem with known
errors; for example, in the case of the root file system, we have
always allowed a read-only mount to continue, so that we can run
e2fsck without requiring a rescue CD 99% of the time.

- Ted

2010-10-22 17:42:52

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On Friday, October 22, 2010, Ted Ts'o wrote:
> On Fri, Oct 22, 2010 at 03:33:29PM +0200, Bernd Schubert wrote:
> > is is really a good idea to allow the filesystem to mount if something
> > like that comes up? I really would prefer if mount would abort.
> >
> > Oct 22 12:37:36 vm7 kernel: [ 1227.814294] LDISKFS-fs warning (device
> > sfa0074): ldiskfs_clear_journal_err: Filesystem error recorded from p
> > revious mount: IO failure
> > Oct 22 12:37:36 vm7 kernel: [ 1227.814314] LDISKFS-fs warning (device
> > sfa0074): ldiskfs_clear_journal_err: Marking fs in need of filesystem
> >
> > check.
> >
> > (please ignore "ldiskfs", it was just renamed to that by Lustre, but is
> > ext4 based as in RHEL5.5, so 2.6.32-ish).
>
> Did you try running e2fsck first? If it detects the error after
> running the journal, it will run the file system check right then and
> there. If it doesn't, it's a bug. If you're not running e2fsck

I *think* I got those messages at least once although I run e2fsck. But I'm
not sure.

> first, and the filesystem had previously detected inconsistencies, the
> long-standing tradition is to allow that, since root should know what
> it's doing.

No, it is far more difficult than that. The devices are managed by pacemaker.
Which means: I/O errors come up -> Lustre complains about that in its proc
file. Pacemaker monitoring fails, so pacemaker stops the device and starts it
again. If that does not succeed, it tries to start it on fail-over system.
I also cannot tell pacemaker to not to try to re-start after an error, as that
would completely defeat an HA solution.

>
> And there are times when you do want to mount a filesystem with known
> errors; for example, in the case of the root file system, we have
> always allowed a read-only mount to continue, so that we can run
> e2fsck without requiring a rescue CD 99% of the time.

Yes, it seems a mount option is missing here.

Thanks,
Bernd

--
Bernd Schubert
DataDirect Networks

2010-10-22 18:32:21

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On Fri, Oct 22, 2010 at 07:42:49PM +0200, Bernd Schubert wrote:
> No, it is far more difficult than that. The devices are managed by
> pacemaker. Which means: I/O errors come up -> Lustre complains
> about that in its proc file. Pacemaker monitoring fails, so
> pacemaker stops the device and starts it again.

I'm not sure what errors you're referring to, but if the errors are
related to file system inconsistencies, by definition umounting and
re-mounting isn't going to fix things, and could result in more
damage. For certain errors, you really do need to run e2fsck before
remounting the device.

Can you not change pacemaker to stop the device, run e2fsck, and then
remount the file system?

It seems like the safer the thing to do.

- Ted

2010-10-22 18:54:47

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On Friday, October 22, 2010, Ted Ts'o wrote:
> On Fri, Oct 22, 2010 at 07:42:49PM +0200, Bernd Schubert wrote:
> > No, it is far more difficult than that. The devices are managed by
> > pacemaker. Which means: I/O errors come up -> Lustre complains
> > about that in its proc file. Pacemaker monitoring fails, so
> > pacemaker stops the device and starts it again.
>
> I'm not sure what errors you're referring to, but if the errors are

There are multiple ways to let Lustre tell you that there is problem.
Underlying filesystem related is just one of many.

> related to file system inconsistencies, by definition umounting and
> re-mounting isn't going to fix things, and could result in more
> damage. For certain errors, you really do need to run e2fsck before
> remounting the device.

Yes and that is exactly why I'm asking for another mount option to not allow
mounts when the filesystem knows better.

>
> Can you not change pacemaker to stop the device, run e2fsck, and then
> remount the file system?

I am sure I could spend the next 4 weeks to write code that would allow to do
that with Lustre and pacemaker. But at the same time, it seems far more easy
to add another mount flag to ext4...

I also cannot simply set a max_failcount=1 in pacemaker, at that would
completely be against an HA concept. There are so many ways to increase the
failcount, for example Lustre bugs (ext4 unrelated), pacemaker bugs, human
errors (something missing on one node, but available on another), etc. A few
failures (ext4 unrelated) are absolutely 'normal' over a couple of month and
there is no reason not to allow that.

I'm not asking you to implement another feature, but I'm asking if a patch to
add a new option would be accepted. I also cannot promise to implement that
any time soon, given that I will leave DDN end of November. But it seems to be
option useful for everyone including my desktop. So either I do that over the
next 4 weeks when I find a minute or during x-mas or so.

Thanks,
Bernd

--
Bernd Schubert
DataDirect Networks

2010-10-23 16:00:07

by Amir Goldstein

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On Fri, Oct 22, 2010 at 7:25 PM, Ted Ts'o <[email protected]> wrote:
> On Fri, Oct 22, 2010 at 03:33:29PM +0200, Bernd Schubert wrote:
>>
>> is is really a good idea to allow the filesystem to mount if something like
>> that comes up? I really would prefer if mount would abort.
>>
>> Oct 22 12:37:36 vm7 kernel: [ 1227.814294] LDISKFS-fs warning (device sfa0074): ldiskfs_clear_journal_err: Filesystem error recorded from p
>> revious mount: IO failure
>> Oct 22 12:37:36 vm7 kernel: [ 1227.814314] LDISKFS-fs warning (device sfa0074): ldiskfs_clear_journal_err: Marking fs in need of filesystem
>> ?check.
>>
>> (please ignore "ldiskfs", it was just renamed to that by Lustre, but is
>> ext4 based as in RHEL5.5, so 2.6.32-ish).
>
> Did you try running e2fsck first? ?If it detects the error after
> running the journal, it will run the file system check right then and
> there. ?If it doesn't, it's a bug. ?If you're not running e2fsck
> first, and the filesystem had previously detected inconsistencies, the
> long-standing tradition is to allow that, since root should know what
> it's doing.
>
> And there are times when you do want to mount a filesystem with known
> errors; for example, in the case of the root file system, we have
> always allowed a read-only mount to continue, so that we can run
> e2fsck without requiring a rescue CD 99% of the time.
>

Ted,

IMHO, and I've said it before, the mount flag which Bernd requests
already exists, namely 'errors=',
both as mount option and as persistent default, but it is not enforced
correctly on mount time.
If an administrator decides that the correct behavior when error is
detected is abort or remount-ro,
what's the sense it letting the filesystem mount read-write without
fixing the problem?
I realize that the umount/mount may have fixed things by "unrolling"
the last transaction,
but still, the state of ERROR_FS with read-write mount, seems to be
inconsistent the the defined errors behavior.
root can always use errors=continue mount to override this restriction.

Amir.

2010-10-23 17:46:59

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On Saturday, October 23, 2010, Amir Goldstein wrote:
> On Fri, Oct 22, 2010 at 7:25 PM, Ted Ts'o <[email protected]> wrote:
> > On Fri, Oct 22, 2010 at 03:33:29PM +0200, Bernd Schubert wrote:
> >> is is really a good idea to allow the filesystem to mount if something
> >> like that comes up? I really would prefer if mount would abort.
> >>
> >> Oct 22 12:37:36 vm7 kernel: [ 1227.814294] LDISKFS-fs warning (device
> >> sfa0074): ldiskfs_clear_journal_err: Filesystem error recorded from p
> >> revious mount: IO failure
> >> Oct 22 12:37:36 vm7 kernel: [ 1227.814314] LDISKFS-fs warning (device
> >> sfa0074): ldiskfs_clear_journal_err: Marking fs in need of filesystem
> >> check.
> >>
> >> (please ignore "ldiskfs", it was just renamed to that by Lustre, but is
> >> ext4 based as in RHEL5.5, so 2.6.32-ish).
> >
> > Did you try running e2fsck first? If it detects the error after
> > running the journal, it will run the file system check right then and
> > there. If it doesn't, it's a bug. If you're not running e2fsck
> > first, and the filesystem had previously detected inconsistencies, the
> > long-standing tradition is to allow that, since root should know what
> > it's doing.
> >
> > And there are times when you do want to mount a filesystem with known
> > errors; for example, in the case of the root file system, we have
> > always allowed a read-only mount to continue, so that we can run
> > e2fsck without requiring a rescue CD 99% of the time.
>
> Ted,
>
> IMHO, and I've said it before, the mount flag which Bernd requests
> already exists, namely 'errors=',
> both as mount option and as persistent default, but it is not enforced
> correctly on mount time.
> If an administrator decides that the correct behavior when error is
> detected is abort or remount-ro,
> what's the sense it letting the filesystem mount read-write without
> fixing the problem?
> I realize that the umount/mount may have fixed things by "unrolling"
> the last transaction,
> but still, the state of ERROR_FS with read-write mount, seems to be
> inconsistent the the defined errors behavior.
> root can always use errors=continue mount to override this restriction.

Hmm, yes and no, while mounting it read-only eventually will be later on
detected by Lustre, that would cause a fencing/stonith of the hole node.

I'm really looking for something to abort the mount if an error comes up.
However, I just have an idea to do that without an additional mount flag:

Let e2fsck play back the journal only. That way e2fsck could set the error
flag, if it detects a problem in the journal and our pacemaker script would
refuse to mount. That option also would be quite useful for our other scripts,
as we usually first run a read-only fsck, check the log files (presently by
size, as e2fsck always returns an error code even for journal recoveries...)
and only if we don't see serious corruption we run e2fsck. Otherwise we
sometimes create device or e2image backups.
Would a patch introducing "-J recover journal only" accepted?

Another option is to add a proc or sysfs file stating the health of the
filesystem.

Thanks,
Bernd

--
Bernd Schubert
DataDirect Networks

2010-10-23 22:17:19

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On Sat, Oct 23, 2010 at 06:00:05PM +0200, Amir Goldstein wrote:
>
> IMHO, and I've said it before, the mount flag which Bernd requests
> already exists, namely 'errors=', both as mount option and as
> persistent default, but it is not enforced correctly on mount time.
> If an administrator decides that the correct behavior when error is
> detected is abort or remount-ro, what's the sense it letting the
> filesystem mount read-write without fixing the problem?

Again, consider the case of the root filesystem containing an error.
When the error is first discovered during the source of the system's
operation, and it's set to errors=panic, you want to immediately
reboot the system. But then, when root file system is mounted, it
would be bad to have the system immediately panic again. Instead,
what you want to have happen is to allow e2fsck to run, correct the
file system errors, and then system can go back to normal operation.

So the current behavior was deliberately designed to be the way that
it is, and the difference is between "what do you do when you come
across a file system error", which is what the errors= mount option is
all about, and "this file system has some kind of error associated
with it". Just because it has an error associated with it does not
mean that immediately rebooting is the right thing to do, even if the
file system is set to "errors=panic". In fact, in the case of a root
file system, it is manifestly the wrong thing to do. If we did what
you suggested, then the system would be trapped in a reboot loop
forever.

- Ted

2010-10-23 22:26:07

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On Sat, Oct 23, 2010 at 07:46:56PM +0200, Bernd Schubert wrote:
> I'm really looking for something to abort the mount if an error comes up.
> However, I just have an idea to do that without an additional mount flag:
>
> Let e2fsck play back the journal only. That way e2fsck could set the
> error flag, if it detects a problem in the journal and our pacemaker
> script would refuse to mount. That option also would be quite useful
> for our other scripts, as we usually first run a read-only fsck,
> check the log files (presently by size, as e2fsck always returns an
> error code even for journal recoveries...) and only if we don't see
> serious corruption we run e2fsck. Otherwise we sometimes create
> device or e2image backups. Would a patch introducing "-J recover
> journal only" accepted?

So I'm confused, and partially it's because I don't know the
capabilities of pacemaker.

If you have a pacemaker script, why aren't you willing to just run
e2fsck on the journal and be done with it? Earlier you talked about
"man months of effort" to rewrite pacemaker. Huh? If the file system
is fine, it will recover the journal, and then see that the file
system is clean, and then exit.

As far as the exit codes, it sounds like you haven't read the man
page. The exit codes are documented in both the fsck and e2fsck man
page, and are standardized across all file systems:

0 - No errors
1 - File system errors corrected
2 - System should be rebooted
4 - File system errors left uncorrected
8 - Operational error
16 - Usage or syntax error
32 - Fsck canceled by user request
128 - Shared library error

(These status codes are boolean OR'ed together.)

An exit code has the '1' bit set, that means that the file system had
some errors, but they have since been fixed. And exit code where the
'2' bit is will only occur in the case of a mounted read-only file
system, and instructs the init script to reboot before continuing,
because while the file system may have had errors fixed, there may be
invalid information cached in memory due to the root file system being
mounted, so the only safe way to make sure that invalid information
won't be written back to disk is to reboot. If you are not checking
the root filesystem, you will never see the '2' bit being set.

So if you are looking at the size of the fsck log files, I'm guessing
it's because no one has bothered to read and understand how the exit
codes for fsck works.

And I really don't understand why you need or want to do a read-only
fsck first....

- Ted

2010-10-23 23:56:06

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On Sunday, October 24, 2010, Ted Ts'o wrote:
> On Sat, Oct 23, 2010 at 07:46:56PM +0200, Bernd Schubert wrote:
> > I'm really looking for something to abort the mount if an error comes up.
> > However, I just have an idea to do that without an additional mount flag:
> >
> > Let e2fsck play back the journal only. That way e2fsck could set the
> > error flag, if it detects a problem in the journal and our pacemaker
> > script would refuse to mount. That option also would be quite useful
> > for our other scripts, as we usually first run a read-only fsck,
> > check the log files (presently by size, as e2fsck always returns an
> > error code even for journal recoveries...) and only if we don't see
> > serious corruption we run e2fsck. Otherwise we sometimes create
> > device or e2image backups. Would a patch introducing "-J recover
> > journal only" accepted?
>
> So I'm confused, and partially it's because I don't know the
> capabilities of pacemaker.
>
> If you have a pacemaker script, why aren't you willing to just run
> e2fsck on the journal and be done with it? Earlier you talked about
> "man months of effort" to rewrite pacemaker. Huh? If the file system

Even if I would rewrite it, it wouldn't get accepted. Upstream would just
start to discuss the other way around...

> is fine, it will recover the journal, and then see that the file
> system is clean, and then exit.

Now please consider what happens if the filesystem is not clean. Resources in
pacemaker have start/stop/monitor timeouts. Default upstream timeouts are
120s. We already increase start timeout to 600s. MMP timeouts could be huge in
the past, although that is limited now and journal recovery also can take
quite some time.
Anyway, there is no way to allow to the such huge timeouts as required by
e2fsck. Sometimes you simply want to try to mount on another node as fast as
possible (consider a driver bug, that makes mount to go into D-state) and then
10 minutes are already a lot. Setting that to hours as might be required by
e2fsck is not an option (yes, I'm aware of uninit_bg and Lustre sets the of
course).
So if we would run e2fsck from the pacemaker script, it simply would be killed
when the timeout is over. Then it would be started on another node and would
repeat that ping-ping until the maximum restart counter exceeds.

(And while we are here, I read in the past you had some concerns about MMP,
but MMP is really a great feature to make double sure the HA software does not
try to do a double mount. While pacemaker supports minoring compared to old
heartbeat, it still is not perfect. In fact there exists an
unmanaged->managed resource state bug, that could easily cause a double
mount).

>
> As far as the exit codes, it sounds like you haven't read the man
> page. The exit codes are documented in both the fsck and e2fsck man
> page, and are standardized across all file systems:
>
> 0 - No errors
> 1 - File system errors corrected
> 2 - System should be rebooted
> 4 - File system errors left uncorrected
> 8 - Operational error
> 16 - Usage or syntax error
> 32 - Fsck canceled by user request
> 128 - Shared library error
>
> (These status codes are boolean OR'ed together.)
>
> An exit code has the '1' bit set, that means that the file system had
> some errors, but they have since been fixed. And exit code where the
> '2' bit is will only occur in the case of a mounted read-only file
> system, and instructs the init script to reboot before continuing,
> because while the file system may have had errors fixed, there may be
> invalid information cached in memory due to the root file system being
> mounted, so the only safe way to make sure that invalid information
> won't be written back to disk is to reboot. If you are not checking
> the root filesystem, you will never see the '2' bit being set.
>
> So if you are looking at the size of the fsck log files, I'm guessing
> it's because no one has bothered to read and understand how the exit
> codes for fsck works.

As I said before, journal replay already sets the '1' bit. So how can I
differentiate in between journal replay bit '1' and pass1 to pass5 bit '1'?
And no, '2' will never come up for pacemaker managed devices, of course.

>
> And I really don't understand why you need or want to do a read-only
> fsck first....

I have seen it more than one times that e2fsck causes more damage than there
had been before. Last case was in January, where an e2fsck version from 2008
wiped out a Lustre OST. The customer just run it without asking anyone and
then that old version caused lots of trouble.
Before "e2fsck -y" the filesystem could be mounted read-only and files could
be read, as far as I remember. If you shoulbe be interested, the case with
some log files is in the Lustre bugzilla.
And as I said before, if 'e2fsck -n' shows that there is required a huge
repair, we double check what is going on and also then consider to create a
device or at least an e2image backup.
As you might understand, no each customer can afford peta-byte backups, so
they sometimes take the risk of data-loss, but of course also appreciate any
precautions to prevent that.

Please also note, that Lustre combines *many* ext3/ext4 filesystems into a
global filesystem. And that high number increases the probability to run into
bugs by a factor of magnitude.

Thanks,
Bernd

--
Bernd Schubert
DataDirect Networks

2010-10-24 00:20:48

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On Sunday, October 24, 2010, Bernd Schubert wrote:
> On Sunday, October 24, 2010, Ted Ts'o wrote:
> > On Sat, Oct 23, 2010 at 07:46:56PM +0200, Bernd Schubert wrote:
> > > I'm really looking for something to abort the mount if an error comes
> > > up. However, I just have an idea to do that without an additional
> > > mount flag:
> > >
> > > Let e2fsck play back the journal only. That way e2fsck could set the
> > > error flag, if it detects a problem in the journal and our pacemaker
> > > script would refuse to mount. That option also would be quite useful
> > > for our other scripts, as we usually first run a read-only fsck,
> > > check the log files (presently by size, as e2fsck always returns an
> > > error code even for journal recoveries...) and only if we don't see
> > > serious corruption we run e2fsck. Otherwise we sometimes create
> > > device or e2image backups. Would a patch introducing "-J recover
> > > journal only" accepted?
> >
> > So I'm confused, and partially it's because I don't know the
> > capabilities of pacemaker.
> >
> > If you have a pacemaker script, why aren't you willing to just run
> > e2fsck on the journal and be done with it? Earlier you talked about
> > "man months of effort" to rewrite pacemaker. Huh? If the file system

Hmm, maybe we have a mis-understanding here. If we could make e2fsck to *only*
recovery the journal, that would be perfect. Kernel and e2fsck journal
recovery should take approximately the same time. But that option does not
exist yet (well, a half baken patch is on my disk now). If e2fsck then would
detect as the kernel:
"clear_journal_err: Filesystem error recorded from previous mount"
and mark the filesystem with an error, that would be all we need to then abort
the mount in the pacemaker script and allow us to run a real e2fsck outside of
pacemaker.

Thanks,
Bernd
--
Bernd Schubert
DataDirect Networks

2010-10-24 01:09:05

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On Sun, Oct 24, 2010 at 02:20:45AM +0200, Bernd Schubert wrote:
> Hmm, maybe we have a mis-understanding here. If we could make e2fsck
> to *only* recovery the journal, that would be perfect. Kernel and
> e2fsck journal recovery should take approximately the same time. But
> that option does not exist yet (well, a half baken patch is on my
> disk now). If e2fsck then would detect as the kernel:
> "clear_journal_err: Filesystem error recorded from previous mount"
> and mark the filesystem with an error, that would be all we need to
> then abort the mount in the pacemaker script and allow us to run a
> real e2fsck outside of pacemaker.

What probably makes sense is to have an extended option which causes
e2fsck to just run the journal and then exit. Part of running the
journal should be setting the EXT4_ERROR_FS bit in s_mount_state and
then clearning the journal. That seems to be missing entirely from
e2fsck, which is a bug that we should fix regardless.

As far as detecting whether or not the file system has known errors,
you can do that by using dumpe2fs -h and grepping for "Filesystem
state". That can have the values "clean" or "with errors". (For ext2
file systems, or ext4 file systems without a journal, you can also
have the state "not clean" and "not clean with errors", but if you
have a journal the latter two states shouldn't ever come up.)

That way the logic that you want is something you can build into your
script, and we don't need to embed application specific logic into
e2fsprogs. The ability to just run the journal without doing any
further checking seems like a reasonable thing to add to e2fsck ---
and by using dumpe2fs -h you'll be able to detect all possible file
system errors (not just the ones which are reported via the journal
error system).

Does that sound reasonable to you?

- Ted

2010-10-24 08:50:14

by Amir Goldstein

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On Sun, Oct 24, 2010 at 12:17 AM, Ted Ts'o <[email protected]> wrote:
> On Sat, Oct 23, 2010 at 06:00:05PM +0200, Amir Goldstein wrote:
>>
>> IMHO, and I've said it before, the mount flag which Bernd requests
>> already exists, namely 'errors=', both as mount option and as
>> persistent default, but it is not enforced correctly on mount time.
>> If an administrator decides that the correct behavior when error is
>> detected is abort or remount-ro, what's the sense it letting the
>> filesystem mount read-write without fixing the problem?
>
> Again, consider the case of the root filesystem containing an error.
> When the error is first discovered during the source of the system's
> operation, and it's set to errors=panic, you want to immediately
> reboot the system. ?But then, when root file system is mounted, it
> would be bad to have the system immediately panic again. ?Instead,
> what you want to have happen is to allow e2fsck to run, correct the
> file system errors, and then system can go back to normal operation.
>
> So the current behavior was deliberately designed to be the way that
> it is, and the difference is between "what do you do when you come
> across a file system error", which is what the errors= mount option is
> all about, and "this file system has some kind of error associated
> with it". ?Just because it has an error associated with it does not
> mean that immediately rebooting is the right thing to do, even if the
> file system is set to "errors=panic". ?In fact, in the case of a root
> file system, it is manifestly the wrong thing to do. ?If we did what
> you suggested, then the system would be trapped in a reboot loop
> forever.
>

Yes, I do realize that to panic on mount would be stupid :-)
this is why I wrote that there is no sense in mounting the file system
read-write.
let me rephrase the 3 error behaviors as the designer (you?) intended:
errors=continue - "always stay out of my way and let me corrupt my
file system as much as I want".
errors=read-only - "never let me corrupt my file system more than it
already is".
errors=panic - "never let me corrupt my file system ... and never let
me view files which may not be there after I reboot".

If you agree with my interpretations to the errors behavior codes,
than you should agree to enforcing on mount time:
errors=continue - if ERROR_FS, go a head and corrupt your file system
errors=read-only - if ERROR_FS, allow only read-only mount
errors=panic - if ERROR_FS, allow only read-only mount (files you see
now are safely stored on disk)

Amir.

2010-10-24 13:54:10

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On 10/23/2010 06:17 PM, Ted Ts'o wrote:
> On Sat, Oct 23, 2010 at 06:00:05PM +0200, Amir Goldstein wrote:
>> IMHO, and I've said it before, the mount flag which Bernd requests
>> already exists, namely 'errors=', both as mount option and as
>> persistent default, but it is not enforced correctly on mount time.
>> If an administrator decides that the correct behavior when error is
>> detected is abort or remount-ro, what's the sense it letting the
>> filesystem mount read-write without fixing the problem?
> Again, consider the case of the root filesystem containing an error.
> When the error is first discovered during the source of the system's
> operation, and it's set to errors=panic, you want to immediately
> reboot the system. But then, when root file system is mounted, it
> would be bad to have the system immediately panic again. Instead,
> what you want to have happen is to allow e2fsck to run, correct the
> file system errors, and then system can go back to normal operation.
>
> So the current behavior was deliberately designed to be the way that
> it is, and the difference is between "what do you do when you come
> across a file system error", which is what the errors= mount option is
> all about, and "this file system has some kind of error associated
> with it". Just because it has an error associated with it does not
> mean that immediately rebooting is the right thing to do, even if the
> file system is set to "errors=panic". In fact, in the case of a root
> file system, it is manifestly the wrong thing to do. If we did what
> you suggested, then the system would be trapped in a reboot loop
> forever.
>
> - Ted

I am still fuzzy on the use case here.

In any shared ext* file system (pacemaker or other), you have some basic rules:

* you cannot have the file system mounted on more than one node
* failover must fence out any other nodes before starting recovery
* failover (once the node is assured that it is uniquely mounting the file
system) must do any recovery required to clean up the state

Using ext* (or xfs) in an active/passive cluster with fail over rules that
follow the above is really common today.

I don't see what the use case here is - are we trying to pretend that pacemaker
+ ext* allows us to have a single, shared file system in a cluster mounted on
multiple nodes?

Why not use ocfs2 or gfs2 for that?

Thanks!

Ric

2010-10-24 14:35:59

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On 10/24/2010 03:55 PM, Ric Wheeler wrote:
> On 10/23/2010 06:17 PM, Ted Ts'o wrote:
>> On Sat, Oct 23, 2010 at 06:00:05PM +0200, Amir Goldstein wrote:
>>> IMHO, and I've said it before, the mount flag which Bernd requests
>>> already exists, namely 'errors=', both as mount option and as
>>> persistent default, but it is not enforced correctly on mount time.
>>> If an administrator decides that the correct behavior when error is
>>> detected is abort or remount-ro, what's the sense it letting the
>>> filesystem mount read-write without fixing the problem?
>> Again, consider the case of the root filesystem containing an error.
>> When the error is first discovered during the source of the system's
>> operation, and it's set to errors=panic, you want to immediately
>> reboot the system. But then, when root file system is mounted, it
>> would be bad to have the system immediately panic again. Instead,
>> what you want to have happen is to allow e2fsck to run, correct the
>> file system errors, and then system can go back to normal operation.
>>
>> So the current behavior was deliberately designed to be the way that
>> it is, and the difference is between "what do you do when you come
>> across a file system error", which is what the errors= mount option is
>> all about, and "this file system has some kind of error associated
>> with it". Just because it has an error associated with it does not
>> mean that immediately rebooting is the right thing to do, even if the
>> file system is set to "errors=panic". In fact, in the case of a root
>> file system, it is manifestly the wrong thing to do. If we did what
>> you suggested, then the system would be trapped in a reboot loop
>> forever.
>>
>> - Ted
>
> I am still fuzzy on the use case here.
>
> In any shared ext* file system (pacemaker or other), you have some basic rules:
>
> * you cannot have the file system mounted on more than one node
> * failover must fence out any other nodes before starting recovery
> * failover (once the node is assured that it is uniquely mounting the file
> system) must do any recovery required to clean up the state
>
> Using ext* (or xfs) in an active/passive cluster with fail over rules that
> follow the above is really common today.
>
> I don't see what the use case here is - are we trying to pretend that pacemaker
> + ext* allows us to have a single, shared file system in a cluster mounted on
> multiple nodes?

The use case here is Lustre. I think ClusterFS and then later the Sun
Lustre group (Andreas Dilger, Alex Zhurlaev/Tomas, Girish Shilamkar)
contributed lots of ext3 and ext4 code, as Lustres underlying disk
format ldiskfs is based on ext3/ext4 (remaining patches, such as MMP are
supposed to be added to ext4 and others such as open-by-inode are
supposed to be given up, ones the vfs supports open-by-filehandle (or so)).

So Lustre mounts a device to a directory (but hides the content to user
space) and then makes the objects in filesystem available globally to
many clients. On first simple glance that is similar to NFS, but Lustre
combines the objects of many ldiskfs filesystems into a single global
filesystem. In order to provide to high-availability, you need to use
any kind of shared storage device. Internal raid1 is planned, but still
not available, so far only raid0 (striping) is supported.

>
> Why not use ocfs2 or gfs2 for that?

You are welcome to write a Lustre plugin for that :) Although, extending
btrfs and use that might be the better choice. Lustre is already going
to supprt ZFS and will make use of ZFS checksums also for its network
checksums, as far as I know. The same should be feasible with btrfs
checksums.

Cheers,
Bernd

Attachments:

2010-10-24 14:42:30

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On 10/24/2010 03:08 AM, Ted Ts'o wrote:
> On Sun, Oct 24, 2010 at 02:20:45AM +0200, Bernd Schubert wrote:
>> Hmm, maybe we have a mis-understanding here. If we could make e2fsck
>> to *only* recovery the journal, that would be perfect. Kernel and
>> e2fsck journal recovery should take approximately the same time. But
>> that option does not exist yet (well, a half baken patch is on my
>> disk now). If e2fsck then would detect as the kernel:
>> "clear_journal_err: Filesystem error recorded from previous mount"
>> and mark the filesystem with an error, that would be all we need to
>> then abort the mount in the pacemaker script and allow us to run a
>> real e2fsck outside of pacemaker.
>
> What probably makes sense is to have an extended option which causes
> e2fsck to just run the journal and then exit. Part of running the
> journal should be setting the EXT4_ERROR_FS bit in s_mount_state and
> then clearning the journal. That seems to be missing entirely from
> e2fsck, which is a bug that we should fix regardless.

Adding the journal option is simple, I will provide a patch by Wednesday
or Thursday. Will also check if it sets EXT2_ERROR_FS and if not, will
try to find some time to add that.

>
> As far as detecting whether or not the file system has known errors,
> you can do that by using dumpe2fs -h and grepping for "Filesystem
> state". That can have the values "clean" or "with errors". (For ext2
> file systems, or ext4 file systems without a journal, you can also
> have the state "not clean" and "not clean with errors", but if you
> have a journal the latter two states shouldn't ever come up.)

I added exactly that to our lustre_server pacemaker agent last week :)
And when I noticed it still mounts filesystems with errors, I started
this thread here.

>
> That way the logic that you want is something you can build into your
> script, and we don't need to embed application specific logic into
> e2fsprogs. The ability to just run the journal without doing any
> further checking seems like a reasonable thing to add to e2fsck ---
> and by using dumpe2fs -h you'll be able to detect all possible file
> system errors (not just the ones which are reported via the journal
> error system).
>
> Does that sound reasonable to you?

Yes, we perfectly agree on each other now :)

Thanks,
Bernd

Attachments:

2010-10-24 15:19:06

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On 10/24/2010 10:30 AM, Bernd Schubert wrote:
> On 10/24/2010 03:55 PM, Ric Wheeler wrote:
>> On 10/23/2010 06:17 PM, Ted Ts'o wrote:
>>> On Sat, Oct 23, 2010 at 06:00:05PM +0200, Amir Goldstein wrote:
>>>> IMHO, and I've said it before, the mount flag which Bernd requests
>>>> already exists, namely 'errors=', both as mount option and as
>>>> persistent default, but it is not enforced correctly on mount time.
>>>> If an administrator decides that the correct behavior when error is
>>>> detected is abort or remount-ro, what's the sense it letting the
>>>> filesystem mount read-write without fixing the problem?
>>> Again, consider the case of the root filesystem containing an error.
>>> When the error is first discovered during the source of the system's
>>> operation, and it's set to errors=panic, you want to immediately
>>> reboot the system. But then, when root file system is mounted, it
>>> would be bad to have the system immediately panic again. Instead,
>>> what you want to have happen is to allow e2fsck to run, correct the
>>> file system errors, and then system can go back to normal operation.
>>>
>>> So the current behavior was deliberately designed to be the way that
>>> it is, and the difference is between "what do you do when you come
>>> across a file system error", which is what the errors= mount option is
>>> all about, and "this file system has some kind of error associated
>>> with it". Just because it has an error associated with it does not
>>> mean that immediately rebooting is the right thing to do, even if the
>>> file system is set to "errors=panic". In fact, in the case of a root
>>> file system, it is manifestly the wrong thing to do. If we did what
>>> you suggested, then the system would be trapped in a reboot loop
>>> forever.
>>>
>>> - Ted
>> I am still fuzzy on the use case here.
>>
>> In any shared ext* file system (pacemaker or other), you have some basic rules:
>>
>> * you cannot have the file system mounted on more than one node
>> * failover must fence out any other nodes before starting recovery
>> * failover (once the node is assured that it is uniquely mounting the file
>> system) must do any recovery required to clean up the state
>>
>> Using ext* (or xfs) in an active/passive cluster with fail over rules that
>> follow the above is really common today.
>>
>> I don't see what the use case here is - are we trying to pretend that pacemaker
>> + ext* allows us to have a single, shared file system in a cluster mounted on
>> multiple nodes?
> The use case here is Lustre. I think ClusterFS and then later the Sun
> Lustre group (Andreas Dilger, Alex Zhurlaev/Tomas, Girish Shilamkar)
> contributed lots of ext3 and ext4 code, as Lustres underlying disk
> format ldiskfs is based on ext3/ext4 (remaining patches, such as MMP are
> supposed to be added to ext4 and others such as open-by-inode are
> supposed to be given up, ones the vfs supports open-by-filehandle (or so)).
>
> So Lustre mounts a device to a directory (but hides the content to user
> space) and then makes the objects in filesystem available globally to
> many clients. On first simple glance that is similar to NFS, but Lustre
> combines the objects of many ldiskfs filesystems into a single global
> filesystem. In order to provide to high-availability, you need to use
> any kind of shared storage device. Internal raid1 is planned, but still
> not available, so far only raid0 (striping) is supported.
>
>

This still sounds more like a Lustre issue than an ext4 one, Andreas can fill in
the technical details.

What ever shared storage sits under ext4 is irrelevant to the fail over case.

Unless Lustre does other magic, they still need to obey the basic cluster rules
- one mount per cluster.

If Lustre is doing the same trick you would do with active/passive failure over
clusters that export ext4 via NFS, you would still need to clean up the file
system before being able to re-export it from a fail over node.

Ric

2010-10-24 15:39:10

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On 10/24/2010 05:20 PM, Ric Wheeler wrote:
>
> This still sounds more like a Lustre issue than an ext4 one, Andreas can fill in
> the technical details.

The underlying device handling is unrelated to Lustre. In that sense it
is just a local filesystem.

>
> What ever shared storage sits under ext4 is irrelevant to the fail over case.
>
> Unless Lustre does other magic, they still need to obey the basic cluster rules
> - one mount per cluster.

Yes, one mount per cluster.

>
> If Lustre is doing the same trick you would do with active/passive failure over
> clusters that export ext4 via NFS, you would still need to clean up the file
> system before being able to re-export it from a fail over node.

What exactly is your question here? We use pacemaker/stonith to do the
fencing job.
What exactly do you want to clean up? The device is recovered by
journals, Lustre goes into recovery mode, clients reconnect, locks are
updated and incomplete transactions resend.

Cheers,
Bernd

Attachments:

2010-10-24 15:48:26

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On 10/24/2010 11:39 AM, Bernd Schubert wrote:
> On 10/24/2010 05:20 PM, Ric Wheeler wrote:
>> This still sounds more like a Lustre issue than an ext4 one, Andreas can fill in
>> the technical details.
> The underlying device handling is unrelated to Lustre. In that sense it
> is just a local filesystem.
>
>> What ever shared storage sits under ext4 is irrelevant to the fail over case.
>>
>> Unless Lustre does other magic, they still need to obey the basic cluster rules
>> - one mount per cluster.
> Yes, one mount per cluster.
>
>> If Lustre is doing the same trick you would do with active/passive failure over
>> clusters that export ext4 via NFS, you would still need to clean up the file
>> system before being able to re-export it from a fail over node.
> What exactly is your question here? We use pacemaker/stonith to do the
> fencing job.
> What exactly do you want to clean up? The device is recovered by
> journals, Lustre goes into recovery mode, clients reconnect, locks are
> updated and incomplete transactions resend.
>
>
> Cheers,
> Bernd
>

What I don't get (certainly might just be me) is why this is a unique issue when
used by lustre. Normally, any similar type of fail over will clean up the local
file system normally before trying to re-export from the second node.

Why exactly can't you use the same type of recovery here? Is it the fencing
agent killing nodes on detection of the file system errors?

Ric

2010-10-24 16:17:05

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On 10/24/2010 05:49 PM, Ric Wheeler wrote:
> On 10/24/2010 11:39 AM, Bernd Schubert wrote:
>> On 10/24/2010 05:20 PM, Ric Wheeler wrote:
>>> This still sounds more like a Lustre issue than an ext4 one, Andreas can fill in
>>> the technical details.
>> The underlying device handling is unrelated to Lustre. In that sense it
>> is just a local filesystem.
>>
>>> What ever shared storage sits under ext4 is irrelevant to the fail over case.
>>>
>>> Unless Lustre does other magic, they still need to obey the basic cluster rules
>>> - one mount per cluster.
>> Yes, one mount per cluster.
>>
>>> If Lustre is doing the same trick you would do with active/passive failure over
>>> clusters that export ext4 via NFS, you would still need to clean up the file
>>> system before being able to re-export it from a fail over node.
>> What exactly is your question here? We use pacemaker/stonith to do the
>> fencing job.
>> What exactly do you want to clean up? The device is recovered by
>> journals, Lustre goes into recovery mode, clients reconnect, locks are
>> updated and incomplete transactions resend.
>>
>>
>> Cheers,
>> Bernd
>>
>
> What I don't get (certainly might just be me) is why this is a unique issue when
> used by lustre. Normally, any similar type of fail over will clean up the local
> file system normally before trying to re-export from the second node.

Of course that is not a Lustre specific issue, which is why I also did
not open a Lustre bugzilla, but opened the thread here.

>
> Why exactly can't you use the same type of recovery here? Is it the fencing
> agent killing nodes on detection of the file system errors?

But I'm using the same type of recovery! I just rewrote pacemakers
default "Filesystem" agent to a lustre_server agent, to include more
Lustre specific checks. When I then added last week a check for the
dumpe2fs "Filesystem state", I noticed, that sometimes the error state
is only set *after* mounting the filesystem, so difficult to script it.
And as I also wrote, running e2fsck from that script and to do a
complete fs check is not appropriate, as that might simply time out.
Again not Lustre specific. So after some discussion, the proposed
solution is to add a "journal recovery only" option to e2fsck and to do
that before the mount. I will add that to the 'lustre_server' agent
(which is part of Lustre now), but leave it to someone else to that for
the 'Filesystem' agent script (I'm not using that script myself and IMHO
it is already too complex, as it tries to support all filesystems -
shell code is ideal anymore then).

Really, only Lustre specific here is the feature to have a proc file to
see if filesystem errors came up on a node. That is a missing feature in
extX and all other linux filesystems I have worked with. And Lustre
server nodes just means the usage of dozens to hundreds of
ext3/ext4/ldiskfs devices, so bugs are more likely exposed by that high
number.

Cheers,
Bernd

Attachments:

2010-10-24 16:42:40

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On 10/24/2010 12:16 PM, Bernd Schubert wrote:
> On 10/24/2010 05:49 PM, Ric Wheeler wrote:
>> On 10/24/2010 11:39 AM, Bernd Schubert wrote:
>>> On 10/24/2010 05:20 PM, Ric Wheeler wrote:
>>>> This still sounds more like a Lustre issue than an ext4 one, Andreas can fill in
>>>> the technical details.
>>> The underlying device handling is unrelated to Lustre. In that sense it
>>> is just a local filesystem.
>>>
>>>> What ever shared storage sits under ext4 is irrelevant to the fail over case.
>>>>
>>>> Unless Lustre does other magic, they still need to obey the basic cluster rules
>>>> - one mount per cluster.
>>> Yes, one mount per cluster.
>>>
>>>> If Lustre is doing the same trick you would do with active/passive failure over
>>>> clusters that export ext4 via NFS, you would still need to clean up the file
>>>> system before being able to re-export it from a fail over node.
>>> What exactly is your question here? We use pacemaker/stonith to do the
>>> fencing job.
>>> What exactly do you want to clean up? The device is recovered by
>>> journals, Lustre goes into recovery mode, clients reconnect, locks are
>>> updated and incomplete transactions resend.
>>>
>>>
>>> Cheers,
>>> Bernd
>>>
>> What I don't get (certainly might just be me) is why this is a unique issue when
>> used by lustre. Normally, any similar type of fail over will clean up the local
>> file system normally before trying to re-export from the second node.
> Of course that is not a Lustre specific issue, which is why I also did
> not open a Lustre bugzilla, but opened the thread here.
>
>> Why exactly can't you use the same type of recovery here? Is it the fencing
>> agent killing nodes on detection of the file system errors?
> But I'm using the same type of recovery! I just rewrote pacemakers
> default "Filesystem" agent to a lustre_server agent, to include more
> Lustre specific checks. When I then added last week a check for the
> dumpe2fs "Filesystem state", I noticed, that sometimes the error state
> is only set *after* mounting the filesystem, so difficult to script it.
> And as I also wrote, running e2fsck from that script and to do a
> complete fs check is not appropriate, as that might simply time out.
> Again not Lustre specific. So after some discussion, the proposed
> solution is to add a "journal recovery only" option to e2fsck and to do
> that before the mount. I will add that to the 'lustre_server' agent
> (which is part of Lustre now), but leave it to someone else to that for
> the 'Filesystem' agent script (I'm not using that script myself and IMHO
> it is already too complex, as it tries to support all filesystems -
> shell code is ideal anymore then).

Why not simply have your script attempt to mount the file system? If it
succeeds, it will replay the journal. If it fails, you will need to fall back to
the long fsck which is unavoidable.

We spend a lot of time and testing to make sure that ext* can be shot at any
point and come back after a storage outage and still mount.

Ric

> Really, only Lustre specific here is the feature to have a proc file to
> see if filesystem errors came up on a node. That is a missing feature in
> extX and all other linux filesystems I have worked with. And Lustre
> server nodes just means the usage of dozens to hundreds of
> ext3/ext4/ldiskfs devices, so bugs are more likely exposed by that high
> number.
>
>
> Cheers,
> Bernd
>

2010-10-25 10:14:53

by Andreas Dilger

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On 2010-10-25, at 00:43, Ric Wheeler wrote:
> On 10/24/2010 12:16 PM, Bernd Schubert wrote:
>>
>> ... sometimes the error state is only set *after* mounting the filesystem,
>> so difficult to script it. And as I also wrote, running e2fsck from that
>> script and to do a complete fs check is not appropriate, as that might
>> simply time out. Again not Lustre specific. So after some discussion,
>> the proposed solution is to add a "journal recovery only" option to e2fsck
>> and to do that before the mount. I will add that to the 'lustre_server'
>> agent (which is part of Lustre now), but leave it to someone else to that
>> for the 'Filesystem' agent script (I'm not using that script myself and
>> IMHO it is already too complex, as it tries to support all filesystems -
>> shell code is ideal anymore then).
>
> Why not simply have your script attempt to mount the file system? If it succeeds, it will replay the journal. If it fails, you will need to fall back to the long fsck which is unavoidable.

I don't really agree with this. The whole reason for having the error flag in the superblock and ALWAYS running e2fsck at mount time to replay the journal is that e2fsck should be done before mounting the filesystem.

I really dislike the reiserfs/XFS model where a filesystem is mounted and fsck is not run in advance, and then if there is a serious error in the filesystem this needs to be detected by the kernel, the filesystem unmounted, e2fsck started, and the filesystem remounted... That's just backward.

> We spend a lot of time and testing to make sure that ext* can be shot at any point and come back after a storage outage and still mount.

Sure, it can still mount, but the only thing it might be able to do is detect the error and remount the filesystem read-only or panic... That's why e2fsck should ALWAYS be run BEFORE the filesystem is mounted.

Bernd's issue (the part that I agree with) is that the error may only be recorded in the journal, not in the ext3 superblock, and there is no easy way to detect this from userspace. Allowing e2fsck to only replay the journal is useful this problem. Another similar issue is that if tune2fs is run on an unmounted filesystem that hasn't had a journal replay, then it may modify the superblock, but journal replay will clobber this. There are other similar issues.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

2010-10-25 11:46:02

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On 10/25/2010 06:14 AM, Andreas Dilger wrote:
> On 2010-10-25, at 00:43, Ric Wheeler wrote:
>> On 10/24/2010 12:16 PM, Bernd Schubert wrote:
>>> ... sometimes the error state is only set *after* mounting the filesystem,
>>> so difficult to script it. And as I also wrote, running e2fsck from that
>>> script and to do a complete fs check is not appropriate, as that might
>>> simply time out. Again not Lustre specific. So after some discussion,
>>> the proposed solution is to add a "journal recovery only" option to e2fsck
>>> and to do that before the mount. I will add that to the 'lustre_server'
>>> agent (which is part of Lustre now), but leave it to someone else to that
>>> for the 'Filesystem' agent script (I'm not using that script myself and
>>> IMHO it is already too complex, as it tries to support all filesystems -
>>> shell code is ideal anymore then).
>> Why not simply have your script attempt to mount the file system? If it succeeds, it will replay the journal. If it fails, you will need to fall back to the long fsck which is unavoidable.
> I don't really agree with this. The whole reason for having the error flag in the superblock and ALWAYS running e2fsck at mount time to replay the journal is that e2fsck should be done before mounting the filesystem.
>
> I really dislike the reiserfs/XFS model where a filesystem is mounted and fsck is not run in advance, and then if there is a serious error in the filesystem this needs to be detected by the kernel, the filesystem unmounted, e2fsck started, and the filesystem remounted... That's just backward.
>

Even if you disagree with the model, that would seem to solve the issue for
Bernd without having to make a change in the utilities.

Thanks!

Ric

>> We spend a lot of time and testing to make sure that ext* can be shot at any point and come back after a storage outage and still mount.
> Sure, it can still mount, but the only thing it might be able to do is detect the error and remount the filesystem read-only or panic... That's why e2fsck should ALWAYS be run BEFORE the filesystem is mounted.
>
> Bernd's issue (the part that I agree with) is that the error may only be recorded in the journal, not in the ext3 superblock, and there is no easy way to detect this from userspace. Allowing e2fsck to only replay the journal is useful this problem. Another similar issue is that if tune2fs is run on an unmounted filesystem that hasn't had a journal replay, then it may modify the superblock, but journal replay will clobber this. There are other similar issues.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
>

2010-10-25 12:54:49

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On 10/25/2010 07:45 AM, Ric Wheeler wrote:
> On 10/25/2010 06:14 AM, Andreas Dilger wrote:
>> On 2010-10-25, at 00:43, Ric Wheeler wrote:
>>> On 10/24/2010 12:16 PM, Bernd Schubert wrote:
>>>> ... sometimes the error state is only set *after* mounting the filesystem,
>>>> so difficult to script it. And as I also wrote, running e2fsck from that
>>>> script and to do a complete fs check is not appropriate, as that might
>>>> simply time out. Again not Lustre specific. So after some discussion,
>>>> the proposed solution is to add a "journal recovery only" option to e2fsck
>>>> and to do that before the mount. I will add that to the 'lustre_server'
>>>> agent (which is part of Lustre now), but leave it to someone else to that
>>>> for the 'Filesystem' agent script (I'm not using that script myself and
>>>> IMHO it is already too complex, as it tries to support all filesystems -
>>>> shell code is ideal anymore then).
>>> Why not simply have your script attempt to mount the file system? If it
>>> succeeds, it will replay the journal. If it fails, you will need to fall
>>> back to the long fsck which is unavoidable.
>> I don't really agree with this. The whole reason for having the error flag
>> in the superblock and ALWAYS running e2fsck at mount time to replay the
>> journal is that e2fsck should be done before mounting the filesystem.
>>
>> I really dislike the reiserfs/XFS model where a filesystem is mounted and
>> fsck is not run in advance, and then if there is a serious error in the
>> filesystem this needs to be detected by the kernel, the filesystem unmounted,
>> e2fsck started, and the filesystem remounted... That's just backward.
>>
>
> Even if you disagree with the model, that would seem to solve the issue for
> Bernd without having to make a change in the utilities.
>
> Thanks!
>
> Ric
>
>>> We spend a lot of time and testing to make sure that ext* can be shot at any
>>> point and come back after a storage outage and still mount.
>> Sure, it can still mount, but the only thing it might be able to do is detect
>> the error and remount the filesystem read-only or panic... That's why e2fsck
>> should ALWAYS be run BEFORE the filesystem is mounted.
>>
>> Bernd's issue (the part that I agree with) is that the error may only be
>> recorded in the journal, not in the ext3 superblock, and there is no easy way
>> to detect this from userspace. Allowing e2fsck to only replay the journal is
>> useful this problem. Another similar issue is that if tune2fs is run on an
>> unmounted filesystem that hasn't had a journal replay, then it may modify the
>> superblock, but journal replay will clobber this. There are other similar
>> issues.
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Lustre Technical Lead
>> Oracle Corporation Canada Inc.
>>
>

One more thought here is that effectively the xfs model of mount before fsck is
basically just doing the journal replay - if you need to repair the file system,
it will fail to mount. If not, you are done.

For HA fail over, what Bernd is proposing is effectively equivalent:

(1) Replay the journal without doing a full fsck which is the same as the mount
for XFS

(2) See if the journal replay failed (i.e., set the error flag) which is the
same as seeing if the mount succeeded

(3) If error, you need to do a full, time consuming fsck for either

(4) If no error in (2), you need to mount the file system for ext4 (xfs is
already done at this stage)

Aside from putting the journal replay into a magic fsck flag, I really do not
see that you are saving any complexity. In fact, for this case, you add step (4).

Regards,

Ric

2010-10-25 14:58:00

by Andreas Dilger

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On 2010-10-25, at 20:54, Ric Wheeler wrote:
> On 10/25/2010 07:45 AM, Ric Wheeler wrote:
>> On 10/25/2010 06:14 AM, Andreas Dilger wrote:
>>> I don't really agree with this. The whole reason for having the error flag in the superblock and ALWAYS running e2fsck at mount time to replay the journal is that e2fsck should be done before mounting the filesystem.
>>>
>>> I really dislike the reiserfs/XFS model where a filesystem is mounted and fsck is not run in advance, and then if there is a serious error in the filesystem this needs to be detected by the kernel, the filesystem unmounted, e2fsck started, and the filesystem remounted... That's just backward.
>>>
>>> Bernd's issue (the part that I agree with) is that the error may only be recorded in the journal, not in the ext3 superblock, and there is no easy way to detect this from userspace. Allowing e2fsck to only replay the journal is useful this problem. Another similar issue is that if tune2fs is run on an unmounted filesystem that hasn't had a journal replay, then it may modify the superblock, but journal replay will clobber this. There are other similar issues.
>
> One more thought here is that effectively the xfs model of mount before fsck is basically just doing the journal replay - if you need to repair the file system, it will fail to mount. If not, you are done.

This won't happen with ext3 today - if you mount the filesystem, it will succeed regardless of whether the filesystem is in error. I did like Bernd's suggestion that the "errors=" mount option should be used to detect if a filesystem with errors tries to mount in a read-write state, but I think that is only a safety measure.

> For HA fail over, what Bernd is proposing is effectively equivalent:
>
> (1) Replay the journal without doing a full fsck which is the same as the mount for XFS

Does XFS fail the mount if there was an error from a previous mount on it?

> (2) See if the journal replay failed (i.e., set the error flag) which is the same as seeing if the mount succeeded

I assume you mean for XFS here, since ext3/4 will happily mount the filesystem today without returning an error.

> (3) If error, you need to do a full, time consuming fsck for either
>
> (4) If no error in (2), you need to mount the file system for ext4 (xfs is already done at this stage)
>
> Aside from putting the journal replay into a magic fsck flag, I really do not see that you are saving any complexity. In fact, for this case, you add step (4).

In comparison, the normal ext2/3/4 model is:

1) Run e2fsck against the filesystem before accessing it (without the -f flag that forces a full check). e2fsck will replay the journal, and if there is no error recorded it will only check the superblock validity before exiting. If there is an error, it will run a full e2fsck.

2) mount the filesystem

This is the simplest model, and IMHO the most correct one. Using "mount" as a proxy for "is my filesystem broken" seems unusual to me, and unsafe for most filesystems.

For Bernd, I guess he needs split step #1 into:

1a) replay the journal so the superblock is up-to-date
1b) check if the filesystem has an error and report it to the HA agent, so that it doesn't have a fit because the mount is taking so long
1c) run the actual e2fsck (which may take a few hours on a 16TB filesystem)

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

2010-10-25 19:43:21

by Eric Sandeen

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

Andreas Dilger wrote:
> On 2010-10-25, at 00:43, Ric Wheeler wrote:
>> On 10/24/2010 12:16 PM, Bernd Schubert wrote:
>>> ... sometimes the error state is only set *after* mounting the
>>> filesystem, so difficult to script it. And as I also wrote,
>>> running e2fsck from that script and to do a complete fs check is
>>> not appropriate, as that might simply time out. Again not Lustre
>>> specific. So after some discussion, the proposed solution is to
>>> add a "journal recovery only" option to e2fsck and to do that
>>> before the mount. I will add that to the 'lustre_server' agent
>>> (which is part of Lustre now), but leave it to someone else to
>>> that for the 'Filesystem' agent script (I'm not using that script
>>> myself and IMHO it is already too complex, as it tries to support
>>> all filesystems - shell code is ideal anymore then).
>> Why not simply have your script attempt to mount the file system?
>> If it succeeds, it will replay the journal. If it fails, you will
>> need to fall back to the long fsck which is unavoidable.
>
> I don't really agree with this. The whole reason for having the
> error flag in the superblock and ALWAYS running e2fsck at mount time
> to replay the journal is that e2fsck should be done before mounting
> the filesystem.

Wait, why? Why did we run with a journal if an IO error causes us
to require a fack prior to next mount?

> I really dislike the reiserfs/XFS model where a filesystem is mounted
> and fsck is not run in advance, and then if there is a serious error
> in the filesystem this needs to be detected by the kernel, the
> filesystem unmounted, e2fsck started, and the filesystem remounted...
> That's just backward.

I must be missing something. We run with a proper, carefully designed
journal on properly configured storage so that the journal + filesystem
is always consistent.

fsck is needed when that carefully configured storage munges something
on disk, or when there's a bug in the code that corrupted the filesystem,
but certainly not just because you happened to unmount a while back and
now wish to remount...

Now, extN has this feature of recording fs errors in the superblock,
but I'm not sure we distinguish between "errors which require a fsck"
and others?

Anyway your characterization of xfs is wrong, IMHO, it's:

Mount (possibly replaying the journal) because all should be well,
we have faith in our hardware and our software.
If during runtime the fs encounters a severe metadata error, it will
shut down, and this is your cue to unmount and run xfs_repair, then
remount. Doesn't seem backwards to me. ;) Requiring that fsck
prior to the first mount makes no sense for a journaling fs.

However, Bernd's issue is probably an issue in general with XFS
as well (which doesn't record error state on-disk) - how to quickly
know whether the filesystem you're about to mount in a cluster has
a -known- integrity issue from a previous mount and really does
require a fsck.

For XFS, you have to have monitored the previous mount, I guess,
and watched for any errors the kernel threw when it encountered them.

For extN we record it in the SB, but that record may only be
in the as-yet-unplayed journal, where the tools can't see it until
it's replayed by a mount or by a full fsck.

-Eric

>> We spend a lot of time and testing to make sure that ext* can be
>> shot at any point and come back after a storage outage and still
>> mount.
>
> Sure, it can still mount, but the only thing it might be able to do
> is detect the error and remount the filesystem read-only or panic...
> That's why e2fsck should ALWAYS be run BEFORE the filesystem is
> mounted.
>
> Bernd's issue (the part that I agree with) is that the error may only
> be recorded in the journal, not in the ext3 superblock, and there is
> no easy way to detect this from userspace. Allowing e2fsck to only
> replay the journal is useful this problem. Another similar issue is
> that if tune2fs is run on an unmounted filesystem that hasn't had a
> journal replay, then it may modify the superblock, but journal replay
> will clobber this. There are other similar issues.
>
> Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle
> Corporation Canada Inc.
>
> -- To unsubscribe from this list: send the line "unsubscribe
> linux-ext4" in the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-10-25 19:49:30

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On 10/25/2010 10:57 AM, Andreas Dilger wrote:
> On 2010-10-25, at 20:54, Ric Wheeler wrote:
>> On 10/25/2010 07:45 AM, Ric Wheeler wrote:
>>> On 10/25/2010 06:14 AM, Andreas Dilger wrote:
>>>> I don't really agree with this. The whole reason for having the error flag in the superblock and ALWAYS running e2fsck at mount time to replay the journal is that e2fsck should be done before mounting the filesystem.
>>>>
>>>> I really dislike the reiserfs/XFS model where a filesystem is mounted and fsck is not run in advance, and then if there is a serious error in the filesystem this needs to be detected by the kernel, the filesystem unmounted, e2fsck started, and the filesystem remounted... That's just backward.
>>>>
>>>> Bernd's issue (the part that I agree with) is that the error may only be recorded in the journal, not in the ext3 superblock, and there is no easy way to detect this from userspace. Allowing e2fsck to only replay the journal is useful this problem. Another similar issue is that if tune2fs is run on an unmounted filesystem that hasn't had a journal replay, then it may modify the superblock, but journal replay will clobber this. There are other similar issues.
>> One more thought here is that effectively the xfs model of mount before fsck is basically just doing the journal replay - if you need to repair the file system, it will fail to mount. If not, you are done.
> This won't happen with ext3 today - if you mount the filesystem, it will succeed regardless of whether the filesystem is in error. I did like Bernd's suggestion that the "errors=" mount option should be used to detect if a filesystem with errors tries to mount in a read-write state, but I think that is only a safety measure.
>
>> For HA fail over, what Bernd is proposing is effectively equivalent:
>>
>> (1) Replay the journal without doing a full fsck which is the same as the mount for XFS
> Does XFS fail the mount if there was an error from a previous mount on it?
>

It does not have an "in error" state bit, but does have sanity checks at mount time.
>> (2) See if the journal replay failed (i.e., set the error flag) which is the same as seeing if the mount succeeded
> I assume you mean for XFS here, since ext3/4 will happily mount the filesystem today without returning an error.
>

On IRC with Eric, xfs will also mount happily after many types of errors.

>> (3) If error, you need to do a full, time consuming fsck for either
>>
>> (4) If no error in (2), you need to mount the file system for ext4 (xfs is already done at this stage)
>>
>> Aside from putting the journal replay into a magic fsck flag, I really do not see that you are saving any complexity. In fact, for this case, you add step (4).
> In comparison, the normal ext2/3/4 model is:
>
> 1) Run e2fsck against the filesystem before accessing it (without the -f flag that forces a full check). e2fsck will replay the journal, and if there is no error recorded it will only check the superblock validity before exiting. If there is an error, it will run a full e2fsck.

One thing that prevents this from being useful in a cluster fail-over context is
that it is really hard to script responses for the full fsck for ext*. Feeding
it a "-y" should work, but it is still a bit scary in practice.

> 2) mount the filesystem
>
> This is the simplest model, and IMHO the most correct one. Using "mount" as a proxy for "is my filesystem broken" seems unusual to me, and unsafe for most filesystems.
>
> For Bernd, I guess he needs split step #1 into:
>
> 1a) replay the journal so the superblock is up-to-date
> 1b) check if the filesystem has an error and report it to the HA agent, so that it doesn't have a fit because the mount is taking so long
> 1c) run the actual e2fsck (which may take a few hours on a 16TB filesystem)
>

I suppose that makes some sense, but it would seem that you could do (1a) and
(1b) today with the mount & unmount (and then check for file system errors)?

Ric

2010-10-25 20:08:13

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On 10/25/2010 09:49 PM, Ric Wheeler wrote:
> On 10/25/2010 10:57 AM, Andreas Dilger wrote:
>> On 2010-10-25, at 20:54, Ric Wheeler wrote:
>>> On 10/25/2010 07:45 AM, Ric Wheeler wrote:
>>>> On 10/25/2010 06:14 AM, Andreas Dilger wrote:
>>>>> I don't really agree with this. The whole reason for having the error flag in the superblock and ALWAYS running e2fsck at mount time to replay the journal is that e2fsck should be done before mounting the filesystem.
>>>>>
>>>>> I really dislike the reiserfs/XFS model where a filesystem is mounted and fsck is not run in advance, and then if there is a serious error in the filesystem this needs to be detected by the kernel, the filesystem unmounted, e2fsck started, and the filesystem remounted... That's just backward.
>>>>>
>>>>> Bernd's issue (the part that I agree with) is that the error may only be recorded in the journal, not in the ext3 superblock, and there is no easy way to detect this from userspace. Allowing e2fsck to only replay the journal is useful this problem. Another similar issue is that if tune2fs is run on an unmounted filesystem that hasn't had a journal replay, then it may modify the superblock, but journal replay will clobber this. There are other similar issues.
>>> One more thought here is that effectively the xfs model of mount before fsck is basically just doing the journal replay - if you need to repair the file system, it will fail to mount. If not, you are done.
>> This won't happen with ext3 today - if you mount the filesystem, it will succeed regardless of whether the filesystem is in error. I did like Bernd's suggestion that the "errors=" mount option should be used to detect if a filesystem with errors tries to mount in a read-write state, but I think that is only a safety measure.
>>
>>> For HA fail over, what Bernd is proposing is effectively equivalent:
>>>
>>> (1) Replay the journal without doing a full fsck which is the same as the mount for XFS
>> Does XFS fail the mount if there was an error from a previous mount on it?
>>
>
> It does not have an "in error" state bit, but does have sanity checks at mount time.
>>> (2) See if the journal replay failed (i.e., set the error flag) which is the same as seeing if the mount succeeded
>> I assume you mean for XFS here, since ext3/4 will happily mount the filesystem today without returning an error.
>>
>
> On IRC with Eric, xfs will also mount happily after many types of errors.
>
>
>>> (3) If error, you need to do a full, time consuming fsck for either
>>>
>>> (4) If no error in (2), you need to mount the file system for ext4 (xfs is already done at this stage)
>>>
>>> Aside from putting the journal replay into a magic fsck flag, I really do not see that you are saving any complexity. In fact, for this case, you add step (4).
>> In comparison, the normal ext2/3/4 model is:
>>
>> 1) Run e2fsck against the filesystem before accessing it (without the -f flag that forces a full check). e2fsck will replay the journal, and if there is no error recorded it will only check the superblock validity before exiting. If there is an error, it will run a full e2fsck.
>
> One thing that prevents this from being useful in a cluster fail-over context is
> that it is really hard to script responses for the full fsck for ext*. Feeding
> it a "-y" should work, but it is still a bit scary in practice.
>
>> 2) mount the filesystem
>>
>> This is the simplest model, and IMHO the most correct one. Using "mount" as a proxy for "is my filesystem broken" seems unusual to me, and unsafe for most filesystems.
>>
>> For Bernd, I guess he needs split step #1 into:
>>
>> 1a) replay the journal so the superblock is up-to-date
>> 1b) check if the filesystem has an error and report it to the HA agent, so that it doesn't have a fit because the mount is taking so long
>> 1c) run the actual e2fsck (which may take a few hours on a 16TB filesystem)
>>
>
> I suppose that makes some sense, but it would seem that you could do (1a) and
> (1b) today with the mount & unmount (and then check for file system errors)?

Hmm yes, mount + umount to replay the journal should work. The
disadvantage is that the kernel might run into a NULL pointer or panic
if something totally was messed up, while e2fsck 'only' would segfault.

Cheers,
Bernd

Attachments:

2010-10-25 20:10:58

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On 10/25/2010 04:08 PM, Bernd Schubert wrote:
> On 10/25/2010 09:49 PM, Ric Wheeler wrote:
>> On 10/25/2010 10:57 AM, Andreas Dilger wrote:
>>> On 2010-10-25, at 20:54, Ric Wheeler wrote:
>>>> On 10/25/2010 07:45 AM, Ric Wheeler wrote:
>>>>> On 10/25/2010 06:14 AM, Andreas Dilger wrote:
>>>>>> I don't really agree with this. The whole reason for having the error flag in the superblock and ALWAYS running e2fsck at mount time to replay the journal is that e2fsck should be done before mounting the filesystem.
>>>>>>
>>>>>> I really dislike the reiserfs/XFS model where a filesystem is mounted and fsck is not run in advance, and then if there is a serious error in the filesystem this needs to be detected by the kernel, the filesystem unmounted, e2fsck started, and the filesystem remounted... That's just backward.
>>>>>>
>>>>>> Bernd's issue (the part that I agree with) is that the error may only be recorded in the journal, not in the ext3 superblock, and there is no easy way to detect this from userspace. Allowing e2fsck to only replay the journal is useful this problem. Another similar issue is that if tune2fs is run on an unmounted filesystem that hasn't had a journal replay, then it may modify the superblock, but journal replay will clobber this. There are other similar issues.
>>>> One more thought here is that effectively the xfs model of mount before fsck is basically just doing the journal replay - if you need to repair the file system, it will fail to mount. If not, you are done.
>>> This won't happen with ext3 today - if you mount the filesystem, it will succeed regardless of whether the filesystem is in error. I did like Bernd's suggestion that the "errors=" mount option should be used to detect if a filesystem with errors tries to mount in a read-write state, but I think that is only a safety measure.
>>>
>>>> For HA fail over, what Bernd is proposing is effectively equivalent:
>>>>
>>>> (1) Replay the journal without doing a full fsck which is the same as the mount for XFS
>>> Does XFS fail the mount if there was an error from a previous mount on it?
>>>
>> It does not have an "in error" state bit, but does have sanity checks at mount time.
>>>> (2) See if the journal replay failed (i.e., set the error flag) which is the same as seeing if the mount succeeded
>>> I assume you mean for XFS here, since ext3/4 will happily mount the filesystem today without returning an error.
>>>
>> On IRC with Eric, xfs will also mount happily after many types of errors.
>>
>>
>>>> (3) If error, you need to do a full, time consuming fsck for either
>>>>
>>>> (4) If no error in (2), you need to mount the file system for ext4 (xfs is already done at this stage)
>>>>
>>>> Aside from putting the journal replay into a magic fsck flag, I really do not see that you are saving any complexity. In fact, for this case, you add step (4).
>>> In comparison, the normal ext2/3/4 model is:
>>>
>>> 1) Run e2fsck against the filesystem before accessing it (without the -f flag that forces a full check). e2fsck will replay the journal, and if there is no error recorded it will only check the superblock validity before exiting. If there is an error, it will run a full e2fsck.
>> One thing that prevents this from being useful in a cluster fail-over context is
>> that it is really hard to script responses for the full fsck for ext*. Feeding
>> it a "-y" should work, but it is still a bit scary in practice.
>>
>>> 2) mount the filesystem
>>>
>>> This is the simplest model, and IMHO the most correct one. Using "mount" as a proxy for "is my filesystem broken" seems unusual to me, and unsafe for most filesystems.
>>>
>>> For Bernd, I guess he needs split step #1 into:
>>>
>>> 1a) replay the journal so the superblock is up-to-date
>>> 1b) check if the filesystem has an error and report it to the HA agent, so that it doesn't have a fit because the mount is taking so long
>>> 1c) run the actual e2fsck (which may take a few hours on a 16TB filesystem)
>>>
>> I suppose that makes some sense, but it would seem that you could do (1a) and
>> (1b) today with the mount& unmount (and then check for file system errors)?
> Hmm yes, mount + umount to replay the journal should work. The
> disadvantage is that the kernel might run into a NULL pointer or panic
> if something totally was messed up, while e2fsck 'only' would segfault.
>
> Cheers,
> Bernd
>
>

This is roughly what we do for active/passive fail over.

The thread has been a good source for rethinking how to improve this use case
though (both for ext* and xfs) in a fairly common use case....

thanks!

ric

2010-10-25 20:37:23

[permalink] [raw]

Subject: Re: ext4_clear_journal_err: Filesystem error recorded from previous mount: IO failure

On 10/25/2010 09:43 PM, Eric Sandeen wrote:
>
> Now, extN has this feature of recording fs errors in the superblock,
> but I'm not sure we distinguish between "errors which require a fsck"
> and others?

That is definitely a good question - is it right to set a generic error
flag, if 'only' I/O errors came up? The problem is that the error flag
comes from ext4_error() and ext4_abort(), which are all over the code
and which do not make any difference if it just an IO error or real
filesystem issue.

>
> Anyway your characterization of xfs is wrong, IMHO, it's:
>
> Mount (possibly replaying the journal) because all should be well,
> we have faith in our hardware and our software.
> If during runtime the fs encounters a severe metadata error, it will
> shut down, and this is your cue to unmount and run xfs_repair, then
> remount. Doesn't seem backwards to me. ;) Requiring that fsck
> prior to the first mount makes no sense for a journaling fs.
>
> However, Bernd's issue is probably an issue in general with XFS
> as well (which doesn't record error state on-disk) - how to quickly
> know whether the filesystem you're about to mount in a cluster has
> a -known- integrity issue from a previous mount and really does
> require a fsck.
>
> For XFS, you have to have monitored the previous mount, I guess,
> and watched for any errors the kernel threw when it encountered them.

It really would be helpful, if filesystems would provide a health file
as Lustre does. A generic VFS proc/sys file or IOCTL would be helpful,
to have a generic interface. I probably should write a patch for it ;)

Cheers,
Bernd

Attachments: