2008-10-15 00:41:40

by Simon Kirby

[permalink] [raw]
Subject: EXT3 way too happy with write errors

Hello!

While attempting to track down failed write error at a device layer,
I noticed that EXT3 seems to behave strangely after a single block I/O
failure.

I would expect that upon the first failed request, it would abort the
journal and remount-ro (if errors=remount-ro is specified). Instead, it
seems to happily plonk along until I inject a few more failures (testing
with the fault injection framework), until it eventually fails enough to
abort the journal. However, by then, "fsck" will show corruption --
sometimes severe. If I force only one or two of write failures and
then unmount, I can reproduce consistency corruption that shows up
with "fsck -f" even though the file system is not marked "errors"!

Why is this?

Example:

Oct 9 19:57:31 nas02 kernel: kjournald starting. Commit interval 5 seconds
Oct 9 19:57:31 nas02 kernel: EXT3 FS on etherd/e3.0p1, internal journal
Oct 9 19:57:31 nas02 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Oct 9 20:00:18 nas02 kernel: FAULT_INJECTION: forcing a failure
Oct 9 20:00:18 nas02 kernel: Buffer I/O error on device etherd/e3.0p1, logical block 5186046
Oct 9 20:00:18 nas02 kernel: lost page write due to I/O error on etherd/e3.0p1
Oct 9 20:00:37 nas02 kernel: FAULT_INJECTION: forcing a failure
Oct 9 20:00:37 nas02 kernel: Buffer I/O error on device etherd/e3.0p1, logical block 410322
Oct 9 20:00:37 nas02 kernel: lost page write due to I/O error on etherd/e3.0p1
Oct 9 20:00:40 nas02 kernel: FAULT_INJECTION: forcing a failure
Oct 9 20:00:40 nas02 kernel: EXT3-fs error (device etherd/e3.0p1): read_block_bitmap: Cannot read block bitmap - block_group = 18, block_bitmap = 589824
Oct 9 20:00:40 nas02 kernel: Aborting journal on device etherd/e3.0p1.
Oct 9 20:00:40 nas02 kernel: FAULT_INJECTION: forcing a failure
Oct 9 20:00:40 nas02 kernel: Buffer I/O error on device etherd/e3.0p1, logical block 1545
Oct 9 20:00:40 nas02 kernel: lost page write due to I/O error on etherd/e3.0p1
Oct 9 20:00:40 nas02 kernel: Remounting filesystem read-only

[sroot@nas02:/]# fsck -C /mnt/web00
fsck 1.40-WIP (14-Nov-2006)
e2fsck 1.40-WIP (14-Nov-2006)
/dev/etherd/e3.0p1: recovering journal
/dev/etherd/e3.0p1 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 49153, i_blocks is 2942528, should be 2942520. Fix<y>?
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/etherd/e3.0p1: ***** FILE SYSTEM WAS MODIFIED *****
/dev/etherd/e3.0p1: 126254/24690688 files (0.1% non-contiguous), 1778971/49359704 blocks

Shouldn't it be the case that the first request failure should
remount-ro? Assuming the fault merely denied a single read or write
request, it should then be possible to reboot or remount,rw after the
fault is fixed and have consistency after just a journal replay...

Cheers,

Simon-


2008-12-18 17:07:15

by Jan Kara

[permalink] [raw]
Subject: Re: EXT3 way too happy with write errors

Hello,

This was quite a long time ago but it seems nobody replied yet :).

> While attempting to track down failed write error at a device layer,
> I noticed that EXT3 seems to behave strangely after a single block I/O
> failure.
>
> I would expect that upon the first failed request, it would abort the
> journal and remount-ro (if errors=remount-ro is specified). Instead, it
> seems to happily plonk along until I inject a few more failures (testing
> with the fault injection framework), until it eventually fails enough to
> abort the journal. However, by then, "fsck" will show corruption --
> sometimes severe. If I force only one or two of write failures and
> then unmount, I can reproduce consistency corruption that shows up
> with "fsck -f" even though the file system is not marked "errors"!
>
> Why is this?
What kernel version is this? Originally, we aborted a journal only if
we spotted a write error in filesystem metadata. If we spotted an error
in data, we just complained but continued. This seems to be exactly the
thing you are hitting. Latest Linus's tree (i.e. 2.6.28-rc5 or so) should
have the patches that allow tuning the behavior in data=ordered mode - i.e.
you can tell the filesystem by data_err=abort and data_err=ignore option
whether it should abort the filesystem or ignore write error in fs data.

> Example:
>
> Oct 9 19:57:31 nas02 kernel: kjournald starting. Commit interval 5 seconds
> Oct 9 19:57:31 nas02 kernel: EXT3 FS on etherd/e3.0p1, internal journal
> Oct 9 19:57:31 nas02 kernel: EXT3-fs: mounted filesystem with ordered data mode.
> Oct 9 20:00:18 nas02 kernel: FAULT_INJECTION: forcing a failure
> Oct 9 20:00:18 nas02 kernel: Buffer I/O error on device etherd/e3.0p1, logical block 5186046
> Oct 9 20:00:18 nas02 kernel: lost page write due to I/O error on etherd/e3.0p1
> Oct 9 20:00:37 nas02 kernel: FAULT_INJECTION: forcing a failure
> Oct 9 20:00:37 nas02 kernel: Buffer I/O error on device etherd/e3.0p1, logical block 410322
> Oct 9 20:00:37 nas02 kernel: lost page write due to I/O error on etherd/e3.0p1
> Oct 9 20:00:40 nas02 kernel: FAULT_INJECTION: forcing a failure
> Oct 9 20:00:40 nas02 kernel: EXT3-fs error (device etherd/e3.0p1): read_block_bitmap: Cannot read block bitmap - block_group = 18, block_bitmap = 589824
> Oct 9 20:00:40 nas02 kernel: Aborting journal on device etherd/e3.0p1.
> Oct 9 20:00:40 nas02 kernel: FAULT_INJECTION: forcing a failure
> Oct 9 20:00:40 nas02 kernel: Buffer I/O error on device etherd/e3.0p1, logical block 1545
> Oct 9 20:00:40 nas02 kernel: lost page write due to I/O error on etherd/e3.0p1
> Oct 9 20:00:40 nas02 kernel: Remounting filesystem read-only
>
> [sroot@nas02:/]# fsck -C /mnt/web00
> fsck 1.40-WIP (14-Nov-2006)
> e2fsck 1.40-WIP (14-Nov-2006)
> /dev/etherd/e3.0p1: recovering journal
> /dev/etherd/e3.0p1 contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes
> Inode 49153, i_blocks is 2942528, should be 2942520. Fix<y>?
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
>
> /dev/etherd/e3.0p1: ***** FILE SYSTEM WAS MODIFIED *****
> /dev/etherd/e3.0p1: 126254/24690688 files (0.1% non-contiguous), 1778971/49359704 blocks
>
> Shouldn't it be the case that the first request failure should
> remount-ro? Assuming the fault merely denied a single read or write
> request, it should then be possible to reboot or remount,rw after the
> fault is fixed and have consistency after just a journal replay...

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2008-12-18 17:28:02

by Jan Kara

[permalink] [raw]
Subject: Re: EXT3 way too happy with write errors

On Thu 18-12-08 09:18:25, Simon Kirby wrote:
> > we spotted a write error in filesystem metadata. If we spotted an error
> > in data, we just complained but continued. This seems to be exactly the
> > thing you are hitting. Latest Linus's tree (i.e. 2.6.28-rc5 or so) should
> > have the patches that allow tuning the behavior in data=ordered mode - i.e.
> > you can tell the filesystem by data_err=abort and data_err=ignore option
> > whether it should abort the filesystem or ignore write error in fs data.
>
> Cool, but one question.. Can you think of a case where anyone would ever
> want data_err=ignore?
>
> Should this really be a knob?
Originally, we changed the behavior unconditionally but then someone came
up with some reasonable argument why it should be tunable. I don't remember
it exactly, sorry :).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-12-18 17:38:23

by Simon Kirby

[permalink] [raw]
Subject: Re: EXT3 way too happy with write errors

On Thu, Dec 18, 2008 at 06:07:14PM +0100, Jan Kara wrote:

> Hello,
>
> This was quite a long time ago but it seems nobody replied yet :).

Thanks :)

> What kernel version is this? Originally, we aborted a journal only if

At the time, this was 2.6.26.5.

> we spotted a write error in filesystem metadata. If we spotted an error
> in data, we just complained but continued. This seems to be exactly the
> thing you are hitting. Latest Linus's tree (i.e. 2.6.28-rc5 or so) should
> have the patches that allow tuning the behavior in data=ordered mode - i.e.
> you can tell the filesystem by data_err=abort and data_err=ignore option
> whether it should abort the filesystem or ignore write error in fs data.

Cool, but one question.. Can you think of a case where anyone would ever
want data_err=ignore?

Should this really be a knob?

Simon-

2008-12-18 17:49:25

by Simon Kirby

[permalink] [raw]
Subject: Re: EXT3 way too happy with write errors

[ You guys were on this original thread.. ]

Re: http://markmail.org/message/jcku5vo5grcjjd3s#query:+page:1+mid:ws2wkcj66ucozlnd+state:results

Maybe you could explain why on earth would you want this configurable?

I think it's a horrible idea to make the default to ignore write errors,
and still a bad idea to even make this an option. Do people really want
data corruption and a log message rather than a a clean way to recover
from such an error, depending on the cause of it?

Aborting on data write error: User can fix why it can't write (maybe the
bus just went to lunch), remount-rw or reboot and the journal will replay
and the file system will be consistent, data and metadata, just as if the
power had failed.

Not aborting on data write error: User loses data. File system gets very
confused.

What am I missing?

Simon-

On Thu, Dec 18, 2008 at 06:27:59PM +0100, Jan Kara wrote:

> On Thu 18-12-08 09:18:25, Simon Kirby wrote:
>
> > Cool, but one question.. Can you think of a case where anyone would ever
> > want data_err=ignore?
> >
> > Should this really be a knob?
>
> Originally, we changed the behavior unconditionally but then someone came
> up with some reasonable argument why it should be tunable. I don't remember
> it exactly, sorry :).
>
> Honza

2008-12-18 18:29:13

by Michael Rubin

[permalink] [raw]
Subject: Re: EXT3 way too happy with write errors

On Thu, Dec 18, 2008 at 9:49 AM, Simon Kirby <[email protected]> wrote:
> Not aborting on data write error: User loses data. File system gets very
> confused.
>
> What am I missing?

I can think of certain situations when companies may care about
getting most of the data to disk and clean it up later.
Datacenters may be replicating the data to many spindles and may
sometimes care about throughput as much as possible. So lossy data
could be preferred to complete data.

Not saying this is always preferred but I can see a use case.

mrubin

2009-01-03 02:15:22

by Simon Kirby

[permalink] [raw]
Subject: Re: EXT3 way too happy with write errors

> On Thu, Dec 18, 2008 at 9:49 AM, Simon Kirby <[email protected]> wrote:
> > Not aborting on data write error: User loses data. File system gets very
> > confused.
> >
> > What am I missing?
>
> I can think of certain situations when companies may care about
> getting most of the data to disk and clean it up later.
> Datacenters may be replicating the data to many spindles and may
> sometimes care about throughput as much as possible. So lossy data
> could be preferred to complete data.
>
> Not saying this is always preferred but I can see a use case.

Ok, fine, in this case they might know what they are doing. Still, this
is not reason enough to default the case in point... ?

:)

Simon-

2009-01-03 02:45:48

by Eric Sandeen

[permalink] [raw]
Subject: Re: EXT3 way too happy with write errors

Simon Kirby wrote:
>> On Thu, Dec 18, 2008 at 9:49 AM, Simon Kirby <[email protected]> wrote:
>>> Not aborting on data write error: User loses data. File system gets very
>>> confused.

A *data* write error should not confuse the *filesystem* - it'll just be
a corrupt file (assuming it was just an EIO / write failure and not some
misdirected IO).

>>> What am I missing?
>> I can think of certain situations when companies may care about
>> getting most of the data to disk and clean it up later.
>> Datacenters may be replicating the data to many spindles and may
>> sometimes care about throughput as much as possible. So lossy data
>> could be preferred to complete data.
>>
>> Not saying this is always preferred but I can see a use case.
>
> Ok, fine, in this case they might know what they are doing. Still, this
> is not reason enough to default the case in point... ?
>
> :)

So one thing I have not seen clearly stated:

When you got the initial write error that bothers you; was that for data
or metadata?

For a metadata write it should certainly not be ignored (other than for
crazy people who run with errors=ignore) because this implies that the
filesystem is no longer consistent.

But for a data write error there is some grey area. If your application
cares about data integrity then it'd be doing direct IO or syncing data
and checking for errors; if it's doing buffered writes and carrying on
blindly assuming that everything is sweetness and light, well, that's
the application's choice. But assuming the entire filesystem should
implode on one file's data write failure is probably not the best plan.

FWIW, Part of the reason for the defaults as they are, IIRC, is to keep
the current/historical behavior, but with an option to be more strict
for those who wish it. As you do. :)

-Eric (coming off a long vacation and hoping he's remembering this all
correctly) :)