2008-06-23 11:25:31

by Hidehiro Kawai

[permalink] [raw]
Subject: [RFC][PATCH] ext3: don't read inode block if the buffer has a write error

A transient I/O error can corrupt inode data. Here is the scenario:

(1) update inode_A at the block_B
(2) pdflush writes out new inode_A to the filesystem, but it results
in write I/O error, at this point, BH_Uptodate flag of the buffer
for block_B is cleared and BH_Write_EIO is set
(3) create new inode_C which located at block_B, and
__ext3_get_inode_loc() tries to read on-disk block_B because the
buffer is not uptodate
(4) if it can read on-disk block_B successfully, inode_A is
overwritten by old data

This patch makes __ext3_get_inode_loc() not read the inode block if
the buffer has BH_Write_EIO flag. In this case, the buffer should
have the latest information, so setting the uptodate flag to the
buffer (this avoids WARN_ON_ONCE() in mark_buffer_dirty().)

According to this change, we would need to test BH_Write_EIO flag for
the error checking. Currently nobody checks write I/O errors on
metadata buffers, but it will be done in other patches I'm working on.

Signed-off-by: Hidehiro Kawai <[email protected]>
---
fs/ext3/inode.c | 10 ++++++++++
1 file changed, 10 insertions(+)

Index: linux-2.6.26-rc5-mm3/fs/ext3/inode.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/fs/ext3/inode.c
+++ linux-2.6.26-rc5-mm3/fs/ext3/inode.c
@@ -2516,6 +2516,16 @@ static int __ext3_get_inode_loc(struct i
}
if (!buffer_uptodate(bh)) {
lock_buffer(bh);
+
+ /*
+ * If the buffer has the write error flag, we have failed
+ * to write out another inode in the same block. In this
+ * case, we don't have to read the block because we may
+ * read the old inode data successfully.
+ */
+ if (buffer_write_io_error(bh) && !buffer_uptodate(bh))
+ set_buffer_uptodate(bh);
+
if (buffer_uptodate(bh)) {
/* someone brought it uptodate while we waited */
unlock_buffer(bh);





2008-06-23 11:46:27

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH] ext3: don't read inode block if the buffer has a write error

On Monday 23 June 2008 21:25, Hidehiro Kawai wrote:
> A transient I/O error can corrupt inode data. Here is the scenario:
>
> (1) update inode_A at the block_B
> (2) pdflush writes out new inode_A to the filesystem, but it results
> in write I/O error, at this point, BH_Uptodate flag of the buffer
> for block_B is cleared and BH_Write_EIO is set
> (3) create new inode_C which located at block_B, and
> __ext3_get_inode_loc() tries to read on-disk block_B because the
> buffer is not uptodate
> (4) if it can read on-disk block_B successfully, inode_A is
> overwritten by old data
>
> This patch makes __ext3_get_inode_loc() not read the inode block if
> the buffer has BH_Write_EIO flag. In this case, the buffer should
> have the latest information, so setting the uptodate flag to the
> buffer (this avoids WARN_ON_ONCE() in mark_buffer_dirty().)
>
> According to this change, we would need to test BH_Write_EIO flag for
> the error checking. Currently nobody checks write I/O errors on
> metadata buffers, but it will be done in other patches I'm working on.

IMO there is a problem with all the buffer head and pagecache error
handling in that uptodate gets cleared on write errors. This is not
only wrong (because the in-memory copy continues to contain the most
uptodate copy), but it will trigger assertions all over the mm and
buffer code AFAIKS.

I don't know why it was done like this, or if anybody actually tested
any of it, but AFAIKS the best way to fix this is to simply not
clear any uptodate bits upon write errors.

2008-06-23 12:31:18

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC][PATCH] ext3: don't read inode block if the buffer has a write error

On Mon 23-06-08 21:46:27, Nick Piggin wrote:
> On Monday 23 June 2008 21:25, Hidehiro Kawai wrote:
> > A transient I/O error can corrupt inode data. Here is the scenario:
> >
> > (1) update inode_A at the block_B
> > (2) pdflush writes out new inode_A to the filesystem, but it results
> > in write I/O error, at this point, BH_Uptodate flag of the buffer
> > for block_B is cleared and BH_Write_EIO is set
> > (3) create new inode_C which located at block_B, and
> > __ext3_get_inode_loc() tries to read on-disk block_B because the
> > buffer is not uptodate
> > (4) if it can read on-disk block_B successfully, inode_A is
> > overwritten by old data
> >
> > This patch makes __ext3_get_inode_loc() not read the inode block if
> > the buffer has BH_Write_EIO flag. In this case, the buffer should
> > have the latest information, so setting the uptodate flag to the
> > buffer (this avoids WARN_ON_ONCE() in mark_buffer_dirty().)
> >
> > According to this change, we would need to test BH_Write_EIO flag for
> > the error checking. Currently nobody checks write I/O errors on
> > metadata buffers, but it will be done in other patches I'm working on.
>
> IMO there is a problem with all the buffer head and pagecache error
> handling in that uptodate gets cleared on write errors. This is not
> only wrong (because the in-memory copy continues to contain the most
> uptodate copy), but it will trigger assertions all over the mm and
> buffer code AFAIKS.
>
> I don't know why it was done like this, or if anybody actually tested
> any of it, but AFAIKS the best way to fix this is to simply not
> clear any uptodate bits upon write errors.
That would be non-trivial effort because there are lots of places which
do things like:
wait_on_buffer(bh);
if (!buffer_uptodate)
/* IO error handling */

But what you say sounds like a reasonable thing from a logical
perspective.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-06-23 12:36:47

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH] ext3: don't read inode block if the buffer has a write error

On Monday 23 June 2008 22:31, Jan Kara wrote:
> On Mon 23-06-08 21:46:27, Nick Piggin wrote:

> > I don't know why it was done like this, or if anybody actually tested
> > any of it, but AFAIKS the best way to fix this is to simply not
> > clear any uptodate bits upon write errors.
>
> That would be non-trivial effort because there are lots of places which
> do things like:
> wait_on_buffer(bh);
> if (!buffer_uptodate)
> /* IO error handling */
>
> But what you say sounds like a reasonable thing from a logical
> perspective.

For reads, that's obviously a common pattern, although even that's
broken in some cases where it is used. But definitely uptodate should
not be set on a read error (although does it need to be explicitly
cleared? I would hope we don't submit a read anyway if the page/buffer
is already uptodate).

But you're right, even changing this for writes would not be a trivial
effort.

2008-06-24 02:18:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH] ext3: don't read inode block if the buffer has a write error

On Mon, 23 Jun 2008 21:46:27 +1000 Nick Piggin <[email protected]> wrote:

> On Monday 23 June 2008 21:25, Hidehiro Kawai wrote:
> > A transient I/O error can corrupt inode data. Here is the scenario:
> >
> > (1) update inode_A at the block_B
> > (2) pdflush writes out new inode_A to the filesystem, but it results
> > in write I/O error, at this point, BH_Uptodate flag of the buffer
> > for block_B is cleared and BH_Write_EIO is set
> > (3) create new inode_C which located at block_B, and
> > __ext3_get_inode_loc() tries to read on-disk block_B because the
> > buffer is not uptodate
> > (4) if it can read on-disk block_B successfully, inode_A is
> > overwritten by old data
> >
> > This patch makes __ext3_get_inode_loc() not read the inode block if
> > the buffer has BH_Write_EIO flag. In this case, the buffer should
> > have the latest information, so setting the uptodate flag to the
> > buffer (this avoids WARN_ON_ONCE() in mark_buffer_dirty().)
> >
> > According to this change, we would need to test BH_Write_EIO flag for
> > the error checking. Currently nobody checks write I/O errors on
> > metadata buffers, but it will be done in other patches I'm working on.
>
> IMO there is a problem with all the buffer head and pagecache error
> handling in that uptodate gets cleared on write errors. This is not
> only wrong (because the in-memory copy continues to contain the most
> uptodate copy), but it will trigger assertions all over the mm and
> buffer code AFAIKS.
>
> I don't know why it was done like this, or if anybody actually tested
> any of it, but AFAIKS the best way to fix this is to simply not
> clear any uptodate bits upon write errors.

There's a plausible-sounding reason for this behaviour which I forgot
about three years ago. Maybe Linus remembers?

Meanwhile I guess I'll queue the patch, sad though it is.

2008-06-24 03:02:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCH] ext3: don't read inode block if the buffer has a write error



On Mon, 23 Jun 2008, Andrew Morton wrote:
> > I don't know why it was done like this, or if anybody actually tested
> > any of it, but AFAIKS the best way to fix this is to simply not
> > clear any uptodate bits upon write errors.
>
> There's a plausible-sounding reason for this behaviour which I forgot
> about three years ago. Maybe Linus remembers?

We have to drop the data at _some_ point. Maybe some errors are transient,
but a whole lot aren't. Jank out your USB memory stick, and those writes
will continue fail. So you can't just keep things dirty - and that also
implies that the buffer sure as heck isn't up-to-date either.

Yes, we could haev a "retry once or twice", but quite frankly, that has
always been left to the low-level driver. By the time the buffer cache or
page cache sees the error, it should be considered more than "transient",
and the data in memory is simply not _useful_ any more.

So clearing the uptodate bit seems to be the logical thing to do. But on
the other hand, it's probably not helping much either, so I don't
personally care if we keep it "uptodate" - as long as the dirty bit
doesn't get set, and as long as there is *some* way to get rid of the bad
buffer later.

Linus

2008-06-24 03:17:34

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC][PATCH] ext3: don't read inode block if the buffer has a write error

On Tuesday 24 June 2008 13:01, Linus Torvalds wrote:
> On Mon, 23 Jun 2008, Andrew Morton wrote:
> > > I don't know why it was done like this, or if anybody actually tested
> > > any of it, but AFAIKS the best way to fix this is to simply not
> > > clear any uptodate bits upon write errors.
> >
> > There's a plausible-sounding reason for this behaviour which I forgot
> > about three years ago. Maybe Linus remembers?
>
> We have to drop the data at _some_ point. Maybe some errors are transient,
> but a whole lot aren't. Jank out your USB memory stick, and those writes
> will continue fail. So you can't just keep things dirty - and that also
> implies that the buffer sure as heck isn't up-to-date either.

Depends what semantics you want, I guess. I have the newest copy of the
data... yes you do unless you want to try to say that a write error to
the media somehow invalidates that fact.


> Yes, we could haev a "retry once or twice", but quite frankly, that has
> always been left to the low-level driver. By the time the buffer cache or
> page cache sees the error, it should be considered more than "transient",
> and the data in memory is simply not _useful_ any more.

It could be useful. For example if you have it mmapped (or even just
reading it back from pagecache) and working on it, then you really may
not want to lose all your program just because of the write error.

Keeping it around longer may allow you eg. to "save as something else".

Yes, we have to discard the page at some point, but I don't know if
this is the right place. Maybe a sysctl thing?


> So clearing the uptodate bit seems to be the logical thing to do. But on
> the other hand, it's probably not helping much either, so I don't
> personally care if we keep it "uptodate" - as long as the dirty bit
> doesn't get set, and as long as there is *some* way to get rid of the bad
> buffer later.

What you want to do is not insane, but the way it is currently being
done is. As I said, just clearing the uptodate bit might blow up your
kernel pretty quickly from assertions in the vm. It should be going
through the whole truncate or invalidate page machinery in order to
do that.

2008-06-24 03:42:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][PATCH] ext3: don't read inode block if the buffer has a write error



On Tue, 24 Jun 2008, Nick Piggin wrote:
>
> What you want to do is not insane, but the way it is currently being
> done is. As I said, just clearing the uptodate bit might blow up your
> kernel pretty quickly from assertions in the vm. It should be going
> through the whole truncate or invalidate page machinery in order to
> do that.

Fair enough.

I would not mind, for example, leaving the uptodate bit, but removing it
from the radix tree or something like that (ie turning it into an
anonymous page for a page-cache page, just removing it from the
hash-queues for a buffer_head).

Of course, that could cause other problems (eg any VM assertions that
shared mappings only contain non-anon pages).

Linus

2008-06-24 13:04:48

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [RFC][PATCH] ext3: don't read inode block if the buffer has a write error

Hi all,

Thank you for precious comments.

Linus Torvalds wrote:

>
> On Tue, 24 Jun 2008, Nick Piggin wrote:
>
>>What you want to do is not insane, but the way it is currently being
>>done is. As I said, just clearing the uptodate bit might blow up your
>>kernel pretty quickly from assertions in the vm. It should be going
>>through the whole truncate or invalidate page machinery in order to
>>do that.
>
> Fair enough.
>
> I would not mind, for example, leaving the uptodate bit, but removing it
> from the radix tree or something like that (ie turning it into an
> anonymous page for a page-cache page, just removing it from the
> hash-queues for a buffer_head).

If we move page caches with errors to another radix tree instead of
just removing, we may be able to do special handlings: rewrite once,
or check the page caches and get rid of them from user space, and so on.

> Of course, that could cause other problems (eg any VM assertions that
> shared mappings only contain non-anon pages).

As Jan and Nick stated, this seems to need a great effort, but I think
it is worthwhile to do.

Thanks,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center