2012-01-05 14:40:04

by Jan Kara

[permalink] [raw]
Subject: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error


Hello,

for mostly historical reasons we still clear buffer_uptodate flag on write
IO errors. This is not only semantically wrong (buffer still has correct data
in it) but it also triggers warnings in mark_buffer_dirty when buffer is
written again which scares lots of users (as they tend to unplug USB sticks
too early). This patch series is an attempt to start solving the problem
by allowing filesystem to tell whether it wishes buffer_uptodate flag cleared
on write IO error. Ext2 is converted to use this as an example. Eventually
when all filesystems are converted, we can remove the condition.

Do people think this is a desirable way to go?

Honza


2012-01-05 14:40:06

by Jan Kara

[permalink] [raw]
Subject: [PATCH 2/3] fs: Do not clear uptodate flag on write IO error

It is semantically wrong to clear uptodate flag of buffers where the write has
failed since the data in the buffer is still correct despite the write failing.
This also causes spurious warnings when IO error happens, we decide to write
the buffer again and call mark_buffer_dirty() which warns us about non-uptodate
buffer.

Sadly, historically buffer_uptodate() check has been the way we detected IO
errors and quite some code still uses this check. Eventually we'd like to
convert all users to use buffer_write_io_error() test instead but that requires
audit of all filesystems. So for now just provide an opt-in flag in
file_system_type by which filesystem can say it detects write IO errors by
checking buffer_write_io_error() and thus we don't have to clear uptodate flag.

Signed-off-by: Jan Kara <[email protected]>
---
fs/buffer.c | 8 ++++++--
include/linux/fs.h | 3 +++
2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 3469d53..96cbb4c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -156,7 +156,9 @@ void end_buffer_write_sync(struct buffer_head *bh, int uptodate)
bdevname(bh->b_bdev, b));
}
set_buffer_write_io_error(bh);
- clear_buffer_uptodate(bh);
+ if (!(bh->b_bdev->bd_super->s_type->fs_flags &
+ FS_HANDLE_WRITE_ERROR))
+ clear_buffer_uptodate(bh);
}
unlock_buffer(bh);
put_bh(bh);
@@ -389,7 +391,9 @@ void end_buffer_async_write(struct buffer_head *bh, int uptodate)
}
set_bit(AS_EIO, &page->mapping->flags);
set_buffer_write_io_error(bh);
- clear_buffer_uptodate(bh);
+ if (!(bh->b_bdev->bd_super->s_type->fs_flags &
+ FS_HANDLE_WRITE_ERROR))
+ clear_buffer_uptodate(bh);
SetPageError(page);
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index e0bc4ff..f69e5f3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -181,6 +181,9 @@ struct inodes_stat_t {
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move()
* during rename() internally.
*/
+#define FS_HANDLE_WRITE_ERROR 65536 /* Don't clear uptodate flag on buffers
+ * when write IO error happens
+ */

/*
* These are the fs-independent mount-flags: up to 32 flags are supported
--
1.7.1

2012-01-05 14:40:05

by Jan Kara

[permalink] [raw]
Subject: [PATCH 1/3] fs: Convert checks for write IO errors from !buffer_uptodate to buffer_write_io_error

Convert a few remaining checks for write IO error in VFS using !buffer_uptodate
test to using buffer_write_io_error.

Signed-off-by: Jan Kara <[email protected]>
---
fs/buffer.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 19d8eb7..3469d53 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -550,7 +550,7 @@ repeat:
get_bh(bh);
spin_unlock(lock);
wait_on_buffer(bh);
- if (!buffer_uptodate(bh))
+ if (buffer_write_io_error(bh))
err = -EIO;
brelse(bh);
spin_lock(lock);
@@ -810,7 +810,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
}
spin_unlock(lock);
wait_on_buffer(bh);
- if (!buffer_uptodate(bh))
+ if (buffer_write_io_error(bh))
err = -EIO;
brelse(bh);
spin_lock(lock);
@@ -3039,7 +3039,7 @@ int __sync_dirty_buffer(struct buffer_head *bh, int rw)
bh->b_end_io = end_buffer_write_sync;
ret = submit_bh(rw, bh);
wait_on_buffer(bh);
- if (!ret && !buffer_uptodate(bh))
+ if (!ret && buffer_write_io_error(bh))
ret = -EIO;
} else {
unlock_buffer(bh);
--
1.7.1

2012-01-05 14:40:07

by Jan Kara

[permalink] [raw]
Subject: [PATCH 3/3] ext2: Replace tests of write IO errors using buffer_uptodate

Replace old tests whether IO error happened during write using buffer_uptodate
flag. This allows us to mark filesystem as not needing clearing of
buffer_uptodate flag on write IO error which stops bogus warnings from
mark_buffer_dirty after IO errors and allows us to remove code trying to avoid
these warnings in common case.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext2/inode.c | 5 ++---
fs/ext2/super.c | 23 +----------------------
fs/ext2/xattr.c | 5 ++---
3 files changed, 5 insertions(+), 28 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 91a6945..dc70d6b 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1506,11 +1506,10 @@ static int __ext2_write_inode(struct inode *inode, int do_sync)
raw_inode->i_block[n] = ei->i_data[n];
mark_buffer_dirty(bh);
if (do_sync) {
- sync_dirty_buffer(bh);
- if (buffer_req(bh) && !buffer_uptodate(bh)) {
+ err = sync_dirty_buffer(bh);
+ if (err) {
printk ("IO error syncing ext2 inode [%s:%08lx]\n",
sb->s_id, (unsigned long) ino);
- err = -EIO;
}
}
ei->i_state &= ~EXT2_STATE_NEW;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index bd8ac16..1e775e5 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1129,30 +1129,9 @@ failed_unlock:
return ret;
}

-static void ext2_clear_super_error(struct super_block *sb)
-{
- struct buffer_head *sbh = EXT2_SB(sb)->s_sbh;
-
- if (buffer_write_io_error(sbh)) {
- /*
- * Oh, dear. A previous attempt to write the
- * superblock failed. This could happen because the
- * USB device was yanked out. Or it could happen to
- * be a transient write error and maybe the block will
- * be remapped. Nothing we can do but to retry the
- * write and hope for the best.
- */
- ext2_msg(sb, KERN_ERR,
- "previous I/O error to superblock detected\n");
- clear_buffer_write_io_error(sbh);
- set_buffer_uptodate(sbh);
- }
-}
-
static void ext2_sync_super(struct super_block *sb, struct ext2_super_block *es,
int wait)
{
- ext2_clear_super_error(sb);
spin_lock(&EXT2_SB(sb)->s_lock);
es->s_free_blocks_count = cpu_to_le32(ext2_count_free_blocks(sb));
es->s_free_inodes_count = cpu_to_le32(ext2_count_free_inodes(sb));
@@ -1492,7 +1471,7 @@ static struct file_system_type ext2_fs_type = {
.name = "ext2",
.mount = ext2_mount,
.kill_sb = kill_block_super,
- .fs_flags = FS_REQUIRES_DEV,
+ .fs_flags = FS_REQUIRES_DEV | FS_HANDLE_WRITE_ERROR,
};

static int __init init_ext2_fs(void)
diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
index d27b71f..d6e7a7a 100644
--- a/fs/ext2/xattr.c
+++ b/fs/ext2/xattr.c
@@ -680,9 +680,8 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
}
mark_buffer_dirty(new_bh);
if (IS_SYNC(inode)) {
- sync_dirty_buffer(new_bh);
- error = -EIO;
- if (buffer_req(new_bh) && !buffer_uptodate(new_bh))
+ error = sync_dirty_buffer(new_bh);
+ if (error)
goto cleanup;
}
}
--
1.7.1

2012-01-05 22:16:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

On Thu, 5 Jan 2012 15:40:04 +0100
Jan Kara <[email protected]> wrote:

>
> Hello,
>
> for mostly historical reasons we still clear buffer_uptodate flag on write
> IO errors.

I have long and strong memories of never being able to understand why
we did that.

> This is not only semantically wrong (buffer still has correct data
> in it) but it also triggers warnings in mark_buffer_dirty when buffer is
> written again which scares lots of users (as they tend to unplug USB sticks
> too early). This patch series is an attempt to start solving the problem
> by allowing filesystem to tell whether it wishes buffer_uptodate flag cleared
> on write IO error. Ext2 is converted to use this as an example. Eventually
> when all filesystems are converted, we can remove the condition.
>
> Do people think this is a desirable way to go?
>

Linus might have some thoughts.

2012-01-15 02:19:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

On Thu, Jan 5, 2012 at 6:40 AM, Jan Kara <[email protected]> wrote:
>
> ?for mostly historical reasons we still clear buffer_uptodate flag on write
> IO errors. This is not only semantically wrong (buffer still has correct data
> in it)

That statement is so nonsensical that I can't get past it.

When you understand why it is nonsensical, you understand why the bit
is cleared.

Feel free to bring this up again without that idiotic statement as an argument.

Linus

2012-01-16 16:01:36

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

On Sat 14-01-12 18:19:02, Linus Torvalds wrote:
> On Thu, Jan 5, 2012 at 6:40 AM, Jan Kara <[email protected]> wrote:
> >
> > ?for mostly historical reasons we still clear buffer_uptodate flag on write
> > IO errors. This is not only semantically wrong (buffer still has correct data
> > in it)
>
> That statement is so nonsensical that I can't get past it.
>
> When you understand why it is nonsensical, you understand why the bit
> is cleared.
Hum, let me understand this. I understand the meaning of buffer_uptodate
bit as "the buffer has at least as new content as what is on disk". Now
when storage cannot write the block under the buffer, the contents of the
buffer is still "at least as new as what is (was) on disk". Therefore I
made above statement about clearing buffer_uptodate bit being wrong. But
apparently you have a different definition of buffer_uptodate or I was not
clear enough in explaining what I mean... Which is the case?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-01-16 18:55:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

On Mon, Jan 16, 2012 at 8:01 AM, Jan Kara <[email protected]> wrote:
>
> ?Hum, let me understand this. I understand the meaning of buffer_uptodate
> bit as "the buffer has at least as new content as what is on disk". Now
> when storage cannot write the block under the buffer, the contents of the
> buffer is still "at least as new as what is (was) on disk".

No.

Stop making crap up.

If the write fails, the buffer contents have *nothing* to do with what
is on disk.

You don't know what the disk contents are.

So clearly the buffer cannot be up-to-date.

Now, feel free to use *other* arguments for why we shouldn't clear the
up-to-date bit, but using the disk contents as one is pure and utter
garbage. And it is *obviously* pure and utter garbage.

Linus

2012-01-16 19:07:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

On Mon, Jan 16, 2012 at 10:55 AM, Linus Torvalds
<[email protected]> wrote:
>
> If the write fails, the buffer contents have *nothing* to do with what
> is on disk.

Another way of thinking of it: if the write fails, you really have two choices:

- retry the write until it doesn't fail. In this case, the buffer is
always "up-to-date" in the sense that it is what we *want* to be on
disk, and what we tro to make sure really *is* on disk.

This is the "good" case, but we can't really do it, because if we do
and the disk has had a hard-failure, we'll just fill up memory with
dirty data that we cannot do anything about.

- just admit that the buffer we have have nothing what-so-ever to do
with what the disk contents are. Any claim about the disk buffer
having any relationship to the disk is clearly bogus.

One reason to clear the up-to-date flag is simply to find out what the
f*&^ we actually have on disk, rather than have to wait for the next
reboot or whatever. Maybe the disk contents ended up ok'ish, but we
got an error for some random reason. Clearing the bit and re-reading
means that we can at least figure it out.

Another is that if we don't clear it, it *will* get cleared eventually
anyway, since the buffer will be free'd (which semantically is the
same thing as clearing the up-to-date bit, in that any future access
will have to read it from disk).

So stop trying to claim that the buffer actually somehow is
"up-to-date". It damn well isn't. If it's not marked dirty, and it
doesn't match the disk contents, then it sure as hell is not
"up-to-date", since dropping the buffer would result in something
*different* being read back in.

Now, you can use *other* arguments for not clearing the up-to-date
bit. For example, if the up-to-date bit being cleared results in worse
problems than some random warning, there's an implementation reason
not to clear it. Or if you can argue that instead of clearing the
up-to-date bit we instead flush the buffer aggressively and try to
invalidate it, I would certainly agree that that is conceptually
equally correct as clearing it.

But just leaving it alone, and thinking that it's all good - that's
just ugly and hiding the issue. The buffer is clearly *not* all good.

Linus

2012-01-17 00:36:13

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

On Mon, Jan 16, 2012 at 10:55:55AM -0800, Linus Torvalds wrote:
> On Mon, Jan 16, 2012 at 8:01 AM, Jan Kara <[email protected]> wrote:
> >
> > ?Hum, let me understand this. I understand the meaning of buffer_uptodate
> > bit as "the buffer has at least as new content as what is on disk". Now
> > when storage cannot write the block under the buffer, the contents of the
> > buffer is still "at least as new as what is (was) on disk".
>
> No.
>
> Stop making crap up.

Jan is right, Linus. His definition of what up-to-date means for
dirty buffers is correct, especially in the case of write errors.

> If the write fails, the buffer contents have *nothing* to do with what
> is on disk.

The dirty buffer contains what is *supposed* to be on disk. If we
fail to write it, we corrupt some application's data.

> You don't know what the disk contents are.

But *we don't care* what is on disk after a write error because
there is no guarantee that after a write error we can even read the
previous data that was on disk. IOWs, the contents of the region on
disk where the write failed is -undefined- and cannot be trusted.

> So clearly the buffer cannot be up-to-date.

What we have in memory is what is *supposed* to be on disk, and the
error is telling us that the disk is failing to be made up-to-date.
IOWs, the disk is stale after a write error, not what is in memory.
So clearly the buffer contains the up-to-date version of the data
after a write error.

How the filesystem handles that error is now up to the filesystem.
For example, the filesystem can chose to allocate new blocks for the
failed write and write the valid, up-to-date in-memory data to a
different location and continue onwards without errors. From this
example, it's pretty obvious that the data in memory contains the
data that what we need to care about after a write error, not what
is on disk.

> Now, feel free to use *other* arguments for why we shouldn't clear the
> up-to-date bit, but using the disk contents as one is pure and utter
> garbage. And it is *obviously* pure and utter garbage.

For the read case you are correct, but that logic (that the disk
version is always correct) does not apply to handling write errors.
It's an important distinction....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-01-17 01:00:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

On Mon, Jan 16, 2012 at 4:36 PM, Dave Chinner <[email protected]> wrote:
>
> Jan is right, Linus. His definition of what up-to-date means for
> dirty buffers is correct, especially in the case of write errors.

It's not a dirty buffer any more.

Go look. We've long since cleared the dirty bit.

So stop spouting garbage.

My argument is simple: the contents ARE NOT CORRECT ENOUGH to be
called "up-to-date and clean".

And I outlined the two choices:

- mark it dirty and continue trying to write it out forever

- invalidate it.

Anything else is crazy talk. And marking it dirty forever isn't really
an option. So..

Linus

2012-01-17 10:53:36

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

On 01/17/2012 02:59 AM, Linus Torvalds wrote:
> On Mon, Jan 16, 2012 at 4:36 PM, Dave Chinner <[email protected]> wrote:
>>
>> Jan is right, Linus. His definition of what up-to-date means for
>> dirty buffers is correct, especially in the case of write errors.
>
> It's not a dirty buffer any more.
>
> Go look. We've long since cleared the dirty bit.
>
> So stop spouting garbage.
>
> My argument is simple: the contents ARE NOT CORRECT ENOUGH to be
> called "up-to-date and clean".
>
> And I outlined the two choices:
>
> - mark it dirty and continue trying to write it out forever
>
> - invalidate it.
>
> Anything else is crazy talk. And marking it dirty forever isn't really
> an option. So..
>
> Linus

I think this conversation is an hint to the fact that the page_cache-page
state machine is clear as mud. And I thought it was only me. For years
I want to catch some VFS guru to sit down and finally explain to me all
the stages and how they are expressed in page-flag bits.

Back to the conversation. The way I understood it (Which is probably wrong)
1. The application dirties a page it is in a *dirty* state.
2. Write-out begins, page goes into that in-write-out state (Am I correct)

Now the page comes back from write-out with an error. As Linus stated we can
not put it back to *dirty* state because it will probably never clear.
(We did bunch of retrys on the block level). And we can't keep it in-write-out
surly. But I think we should surly *not* put it in *not-clean* state. Because
that one implies reading and the worse we can do is read that page as it is
now.

Therefor I agree with Jan. That the best is to use that extra error bit
to indicate an *error-state*, which is up to the FS to handle.
If it was a read error - error-is-set clean-is-cleared
If it was a write err - error-is-set clean-is-set.

All the rest of the Kernel should consider these as a they are error-sate
and I really like Jan's patch of inspecting for error-bit and not the
not-clean in a write-out which is darn confusing. (Regardless of the meaning
of the clean-bit)

Now the filesystem needs to do something about these pages like put them in a Jurnal,
shove them in a recovery workQ or whatever. All the VFS/MM can do is like Linus
said wait until they are plain removed which is effectively like invalidating them.
(In the case the FS did nothing to fix it)

I wish there was some heavy logging when the VFS/MM trashes error-set but clean-set
pages (Write-errors), even a write-out of these buffers to some global journal, of
which tools can extract and amend later. (Like the USB snatched too soon example)

So I see Linus point of "we can't go back to any of the old states" but let's not
overload the clean-bit and use the proper error-bit like Jan suggested.

My $0.017
Boaz

2012-01-23 03:04:27

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

[ LCA delayed responding to this... ]

On Mon, Jan 16, 2012 at 04:59:41PM -0800, Linus Torvalds wrote:
> On Mon, Jan 16, 2012 at 4:36 PM, Dave Chinner <[email protected]> wrote:
> >
> > Jan is right, Linus. His definition of what up-to-date means for
> > dirty buffers is correct, especially in the case of write errors.
>
> It's not a dirty buffer any more.

Yes it is. The write has not completed, so by definition the buffer
is not clean.

> Go look. We've long since cleared the dirty bit.

Sure, but the buffer contents are dirty until the IO completes
successfully and what is on disk matches the contents of the buffer
in memory. It doesn't magically become clean when we clear the dirty
bit. We only clear the dirty bit before submitting the IO to stop
multiple callers from trying to submit it for write at the same
time. IOWs, the buffer dirty bit doesn't really track the dirty
state of the buffer correctly.

> So stop spouting garbage.
>
> My argument is simple: the contents ARE NOT CORRECT ENOUGH to be
> called "up-to-date and clean".

I didn't say it was clean - I said a buffer that failed a write is
not invalid but was still up-to-date and the error handling should
treat it that way. I thought it was obvious that this meant we have
to redirty the buffer at the same time we mark it with an IO error
so that it's state was correct....

> And I outlined the two choices:
>
> - mark it dirty and continue trying to write it out forever
>
> - invalidate it.
>
> Anything else is crazy talk.

I can only assume that you didn't read what I said about how
different filesystems can (and do) handle write errors differently.
Indeed, even within a filesystem there can be different error
handling methods for different types of write IO errors (e.g.
transient vs unrecoverable). Hence there are any number of valid
error handling strategies that can be added to the above list. One
size does not fit all...

> And marking it dirty forever isn't really
> an option. So..

I guess you don't realise that Linux already has a filesystem that
uses this technique. It's called XFS. ;)

FYI, XFS has used the redirtying method to retry failed delayed
write buffer IO since day zero (i.e. 1993). EFS (XFS's predecessor
on Irix) was doing this long before XFS came along so this technique
for handling certain types of transient write IO errors has been
used in production filesystems for somewhere around 25 years.

The thing is, transient write errors tend to be isolated and go away
when a retry occurs (think of IO timeouts when multipath failover
occurs). When non-isolated IO or unrecoverable problems occur (e.g.
no paths left to fail over onto), critical other metadata reads and
writes will fail and shut down the filesystem, thereby terminating
the "try forever" background writeback loop those delayed write
buffers may be in. So the truth is that "trying forever" on write
errors can handle a whole class of write IO errors very
effectively....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-01-23 23:42:38

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

On Mon, Jan 23, 2012 at 02:04:22PM +1100, Dave Chinner wrote:
>
> Sure, but the buffer contents are dirty until the IO completes
> successfully and what is on disk matches the contents of the buffer
> in memory. It doesn't magically become clean when we clear the dirty
> bit. We only clear the dirty bit before submitting the IO to stop
> multiple callers from trying to submit it for write at the same
> time. IOWs, the buffer dirty bit doesn't really track the dirty
> state of the buffer correctly.

Doesn't BH_Lock prevent multple callers from submitting it for write
at the same time? If memory serves, one of the reasons why we cleared
the dirty bit before submitting the write was because we allowed
writers to dirty the buffer_head while the write was "in flight". Of
course, this is becomes problematic if we're trying to support DIF/DIX.

What if we simply disallow BH_Dirty from being set (and disallow the
modification of the buffer) while the buffer is locked? Then the
dirty bit would indeed correctly track the state of the buffer
correctly.

> I can only assume that you didn't read what I said about how
> different filesystems can (and do) handle write errors differently.
> Indeed, even within a filesystem there can be different error
> handling methods for different types of write IO errors (e.g.
> transient vs unrecoverable). Hence there are any number of valid
> error handling strategies that can be added to the above list. One
> size does not fit all...

That's another problem, which is that we need more context than just
!uptodate. We need to know what sort of write I/O errors occurred, so
we can determine whether it's likely to be transient or permanent.

> The thing is, transient write errors tend to be isolated and go away
> when a retry occurs (think of IO timeouts when multipath failover
> occurs). When non-isolated IO or unrecoverable problems occur (e.g.
> no paths left to fail over onto), critical other metadata reads and
> writes will fail and shut down the filesystem, thereby terminating
> the "try forever" background writeback loop those delayed write
> buffers may be in. So the truth is that "trying forever" on write
> errors can handle a whole class of write IO errors very
> effectively....

So how does XFS decide whether a write should fail and shutdown the
file system, or just "try forever"?

- Ted

2012-01-23 23:49:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

On Mon, Jan 23, 2012 at 1:47 PM, Ted Ts'o <[email protected]> wrote:
>
> So how does XFS decide whether a write should fail and shutdown the
> file system, or just "try forever"?

Why would it bother? XFS tends to be a filesystem that you'd only use
for core files in environments where you have a system manager that
knows what he is doing. So there, maybe "try forever" is the right
thing to do.

Things are a bit different with some random unreliable USB stick FAT32
filesystem that just died on you, with a normal user that just removes
the thing or doesn't even notice that the stick is now dead. There the
"try forever" is totally the wrong thing to do.

Linus

2012-01-24 00:37:01

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

On Mon, Jan 23, 2012 at 04:47:09PM -0500, Ted Ts'o wrote:
> > The thing is, transient write errors tend to be isolated and go away
> > when a retry occurs (think of IO timeouts when multipath failover
> > occurs). When non-isolated IO or unrecoverable problems occur (e.g.
> > no paths left to fail over onto), critical other metadata reads and
> > writes will fail and shut down the filesystem, thereby terminating
> > the "try forever" background writeback loop those delayed write
> > buffers may be in. So the truth is that "trying forever" on write
> > errors can handle a whole class of write IO errors very
> > effectively....
>
> So how does XFS decide whether a write should fail and shutdown the
> file system, or just "try forever"?

The IO dispatcher decides that. If the dispatcher has handed the IO
off to the delayed write queue, then failed writes will be tried
again. If the caller is catching the IO completion (e.g. sync
writes) or attaching a completion callback (journal IO), then the
completion context will handle the error appropriately. Journal IO
errors tend to shutdown the filesystem on the first error, other
contexts may handle the error, retry or shutdown the filesystem
depending on their current state when the error occurs.

Reads are even more complex, because ithe dispatch context can be
within a transaction and the correct error handling is then
dependent on the current state of the transaction....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-01-24 06:12:16

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

On Mon, Jan 23, 2012 at 03:49:39PM -0800, Linus Torvalds wrote:
> On Mon, Jan 23, 2012 at 1:47 PM, Ted Ts'o <[email protected]> wrote:
> >
> > So how does XFS decide whether a write should fail and shutdown the
> > file system, or just "try forever"?
>
> Why would it bother? XFS tends to be a filesystem that you'd only use
> for core files in environments where you have a system manager that
> knows what he is doing.

Oh, man, what a troll.

Even my *TV* uses XFS. First google reference:

http://settorezero.blogspot.com/2010/10/xfs-filesystem-and-samsung-ledtvs.html

You won't find a more hostile "pull-the-drive-without-unmounting"
environment than a TV. XFS handles this sort of stuff just fine and
it has for a long time.

> So there, maybe "try forever" is the right
> thing to do.

You still haven't grasped that there are many different
classes of IO errors and that some require different handling to
others. :/

> Things are a bit different with some random unreliable USB stick FAT32
> filesystem that just died on you, with a normal user that just removes
> the thing or doesn't even notice that the stick is now dead. There the
> "try forever" is totally the wrong thing to do.

I'm not saying that "try forever" is how all errors should be
handled. You are completely focussed on that as the only possible
error handling technique that can be used. Stop being stupid!

I'm saying that the dirty, uptodate, errored buffer should
be handled back to the filesystem so that it can decide what to do
with it. For filesystems that can differentiate the types of
errors, let them choose how to handle the problem. Those that are
unable to handle the error can do the invalidation themselves. One
size does not fit all...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-01-24 07:10:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

On Mon, Jan 23, 2012 at 10:12 PM, Dave Chinner <[email protected]> wrote:
>
> You still haven't grasped that there are many different
> classes of IO errors and that some require different handling to
> others. :/

I think *you* haven't grasped what we're talking about.

Did you look at the thread?

We're not talking about some filesystem-specific buffer handling.
We're talking about the *default* hander in fs/buffer.c for IO
completion.

A filesystem can override that handler if it wants to and you can do
whatever the f*ck you want in your filesystem end-of-io handler.

But dammit, in the default handler - which BY DEFINITION IS ABOUT
FILESYSTEMS THAT DO NOT DO THOSE "many different classes of IO errors"
that you continue to blather about, there really are two choices:
- throw the damn thing away
- retry the write forever

It really is that simple. The thing I objected to was Jan claiming
that clearing the up-to-date flag made no semantic sense. That's just
crazy talk.

What does *not* make any semantic sense is to just mark the damn thing
clean after an IO error!

I very much NAK that whole idiotic approach, and object to it. That
is the routine that is used even without any filesystem AT ALL, for
chissake.

So all your arguments are for something totally different that is
irrelevant for the whole discussion. Really.

Linus

2012-01-24 12:13:14

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

On Mon 23-01-12 23:10:53, Linus Torvalds wrote:
> On Mon, Jan 23, 2012 at 10:12 PM, Dave Chinner <[email protected]> wrote:
> >
> > You still haven't grasped that there are many different
> > classes of IO errors and that some require different handling to
> > others. :/
>
> I think *you* haven't grasped what we're talking about.
>
> Did you look at the thread?
>
> We're not talking about some filesystem-specific buffer handling.
> We're talking about the *default* hander in fs/buffer.c for IO
> completion.
Well, not sure if you remember but the change I was proposing was only
for filesystems that would promise (in their fstype) they they know better
how to handle IO errors.

> A filesystem can override that handler if it wants to and you can do
> whatever the f*ck you want in your filesystem end-of-io handler.
That's another possibility but a flag in fstype and a check in default
end-of-io handler looked like an easier solution to me.

> But dammit, in the default handler - which BY DEFINITION IS ABOUT
> FILESYSTEMS THAT DO NOT DO THOSE "many different classes of IO errors"
> that you continue to blather about, there really are two choices:
> - throw the damn thing away
> - retry the write forever
>
> It really is that simple. The thing I objected to was Jan claiming
> that clearing the up-to-date flag made no semantic sense. That's just
> crazy talk.
>
> What does *not* make any semantic sense is to just mark the damn thing
> clean after an IO error!
I agree that just keeping the uptodate flag was a half-baked proposal. I
ultimately want to set the dirty flag back as well and we seem to agree
that that would be logically correct. But then, as you noted, filesystem
has to propely handle unwriteable pages to not bring system OOM (or
actually block all writes due to exceeded dirty limit) and that is more
complicated (distinguishing path temporarily down from USB stick is gone
etc.) so I didn't want to go there in the beginning. But it seems I'll have
to ;)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-01-26 12:18:22

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

On 01/23/2012 07:36 PM, Dave Chinner wrote:
> On Mon, Jan 23, 2012 at 04:47:09PM -0500, Ted Ts'o wrote:
>>> The thing is, transient write errors tend to be isolated and go away
>>> when a retry occurs (think of IO timeouts when multipath failover
>>> occurs). When non-isolated IO or unrecoverable problems occur (e.g.
>>> no paths left to fail over onto), critical other metadata reads and
>>> writes will fail and shut down the filesystem, thereby terminating
>>> the "try forever" background writeback loop those delayed write
>>> buffers may be in. So the truth is that "trying forever" on write
>>> errors can handle a whole class of write IO errors very
>>> effectively....
>> So how does XFS decide whether a write should fail and shutdown the
>> file system, or just "try forever"?
> The IO dispatcher decides that. If the dispatcher has handed the IO
> off to the delayed write queue, then failed writes will be tried
> again. If the caller is catching the IO completion (e.g. sync
> writes) or attaching a completion callback (journal IO), then the
> completion context will handle the error appropriately. Journal IO
> errors tend to shutdown the filesystem on the first error, other
> contexts may handle the error, retry or shutdown the filesystem
> depending on their current state when the error occurs.
>
> Reads are even more complex, because ithe dispatch context can be
> within a transaction and the correct error handling is then
> dependent on the current state of the transaction....
>
> Cheers,
>
> Dave.

I think that having retry logic at the file system layer is really putting the
fix in the wrong place.

Specifically, if we have multipath configured under a file system, it is up to
the multipath logic to handle the failure (and use another path, retry, etc).
If we see a failed IO further up the stack, it is *really* dead at that point.

Transient errors on normal drives are also rarely worth re-trying since pretty
much all modern storage devices have firmware that will have done exhaustive
retries on a failed write. Definitely not worth retrying forever for a normal
device.

At one end of the spectrum, think of a box with dozens of storage devices
attached (either via SAN or local S-ATA devices). If we are doing large,
streaming writes, we could get a large amount of memory dirtied while writing.
If that one device dies and we keep that memory in use for the endless retry
loop, we have really cripple the box which still has multiple happy storage
devices and file systems....

Ric





2012-01-26 20:51:09

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

On Thu 26-01-12 07:17:41, Ric Wheeler wrote:
> On 01/23/2012 07:36 PM, Dave Chinner wrote:
> >On Mon, Jan 23, 2012 at 04:47:09PM -0500, Ted Ts'o wrote:
> >>>The thing is, transient write errors tend to be isolated and go away
> >>>when a retry occurs (think of IO timeouts when multipath failover
> >>>occurs). When non-isolated IO or unrecoverable problems occur (e.g.
> >>>no paths left to fail over onto), critical other metadata reads and
> >>>writes will fail and shut down the filesystem, thereby terminating
> >>>the "try forever" background writeback loop those delayed write
> >>>buffers may be in. So the truth is that "trying forever" on write
> >>>errors can handle a whole class of write IO errors very
> >>>effectively....
> >>So how does XFS decide whether a write should fail and shutdown the
> >>file system, or just "try forever"?
> >The IO dispatcher decides that. If the dispatcher has handed the IO
> >off to the delayed write queue, then failed writes will be tried
> >again. If the caller is catching the IO completion (e.g. sync
> >writes) or attaching a completion callback (journal IO), then the
> >completion context will handle the error appropriately. Journal IO
> >errors tend to shutdown the filesystem on the first error, other
> >contexts may handle the error, retry or shutdown the filesystem
> >depending on their current state when the error occurs.
> >
> >Reads are even more complex, because ithe dispatch context can be
> >within a transaction and the correct error handling is then
> >dependent on the current state of the transaction....
>
> I think that having retry logic at the file system layer is really
> putting the fix in the wrong place.
>
> Specifically, if we have multipath configured under a file system,
> it is up to the multipath logic to handle the failure (and use
> another path, retry, etc). If we see a failed IO further up the
> stack, it is *really* dead at that point.
Yes, that makes sense. Only, if my memory serves well, e.g. with iSCSI we
do see transient errors so it's not like they don't happen.

> Transient errors on normal drives are also rarely worth re-trying
> since pretty much all modern storage devices have firmware that will
> have done exhaustive retries on a failed write. Definitely not worth
> retrying forever for a normal device.
Agreed. But we could still be clever enough to write the data / metadata
to a different place.

> At one end of the spectrum, think of a box with dozens of storage
> devices attached (either via SAN or local S-ATA devices). If we are
> doing large, streaming writes, we could get a large amount of memory
> dirtied while writing. If that one device dies and we keep that
> memory in use for the endless retry loop, we have really cripple the
> box which still has multiple happy storage devices and file
> systems....
I agree that if we ever decide to keep unwriteable data in memory,
kernel has to have a way to get rid of this data if it needs to.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-01-26 20:59:09

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Stop clearing uptodate flag on write IO error

On 01/26/2012 03:51 PM, Jan Kara wrote:
> On Thu 26-01-12 07:17:41, Ric Wheeler wrote:
>> On 01/23/2012 07:36 PM, Dave Chinner wrote:
>>> On Mon, Jan 23, 2012 at 04:47:09PM -0500, Ted Ts'o wrote:
>>>>> The thing is, transient write errors tend to be isolated and go away
>>>>> when a retry occurs (think of IO timeouts when multipath failover
>>>>> occurs). When non-isolated IO or unrecoverable problems occur (e.g.
>>>>> no paths left to fail over onto), critical other metadata reads and
>>>>> writes will fail and shut down the filesystem, thereby terminating
>>>>> the "try forever" background writeback loop those delayed write
>>>>> buffers may be in. So the truth is that "trying forever" on write
>>>>> errors can handle a whole class of write IO errors very
>>>>> effectively....
>>>> So how does XFS decide whether a write should fail and shutdown the
>>>> file system, or just "try forever"?
>>> The IO dispatcher decides that. If the dispatcher has handed the IO
>>> off to the delayed write queue, then failed writes will be tried
>>> again. If the caller is catching the IO completion (e.g. sync
>>> writes) or attaching a completion callback (journal IO), then the
>>> completion context will handle the error appropriately. Journal IO
>>> errors tend to shutdown the filesystem on the first error, other
>>> contexts may handle the error, retry or shutdown the filesystem
>>> depending on their current state when the error occurs.
>>>
>>> Reads are even more complex, because ithe dispatch context can be
>>> within a transaction and the correct error handling is then
>>> dependent on the current state of the transaction....
>> I think that having retry logic at the file system layer is really
>> putting the fix in the wrong place.
>>
>> Specifically, if we have multipath configured under a file system,
>> it is up to the multipath logic to handle the failure (and use
>> another path, retry, etc). If we see a failed IO further up the
>> stack, it is *really* dead at that point.
> Yes, that makes sense. Only, if my memory serves well, e.g. with iSCSI we
> do see transient errors so it's not like they don't happen.

iSCSI is "just" a transport for SCSI - you can have multipath enabled for iSCSI
as well of course :)
>
>> Transient errors on normal drives are also rarely worth re-trying
>> since pretty much all modern storage devices have firmware that will
>> have done exhaustive retries on a failed write. Definitely not worth
>> retrying forever for a normal device.
> Agreed. But we could still be clever enough to write the data / metadata
> to a different place.

Most storage devices totally lie to you about the layout, but there is some
value (like btrfs) in writing things twice to make sure that you can survive a
single bad sector. Even in that case, you still want to avoid a re-try of a
failed IO though.

>
>> At one end of the spectrum, think of a box with dozens of storage
>> devices attached (either via SAN or local S-ATA devices). If we are
>> doing large, streaming writes, we could get a large amount of memory
>> dirtied while writing. If that one device dies and we keep that
>> memory in use for the endless retry loop, we have really cripple the
>> box which still has multiple happy storage devices and file
>> systems....
> I agree that if we ever decide to keep unwriteable data in memory,
> kernel has to have a way to get rid of this data if it needs to.

I seem to recall having this discussion (LinuxCon Japan?) a few years back.

Ric