Some devices - notably dm and md - can change their behaviour in
response to BIO_RW_BARRIER requests. They might start out accepting
such requests but on reconfiguration, they find out that they cannot
any more.
ext3 (and other filesystems) deal with this by always testing if
BIO_RW_BARRIER requests fail with EOPNOTSUPP, and retrying the write
requests without the barrier (probably after waiting for any pending
writes to complete).
However there is a bug in the handling for this for ext3.
When ext3 (jbd actually) decides to submit a BIO_RW_BARRIER request,
it sets the buffer_ordered flag on the buffer head.
If the request completes successfully, the flag STAYS SET.
Other code might then write the same buffer_head after the device has
been reconfigured to not accept barriers. This write will then fail,
but the "other code" is not ready to handle EOPNOTSUPP errors and the
error will be treated as fatal.
This can be seen without having to reconfigure a device at exactly the
wrong time by putting:
if (buffer_ordered(bh))
printk("OH DEAR, and ordered buffer\n");
in the while loop in "commit phase 5" of journal_commit_transaction.
If it ever prints the "OH DEAR ..." message (as it does sometimes for
me), then that request could (in different circumstances) have failed
with EOPNOTSUPP, but that isn't tested for.
My proposed fix is to clear the buffer_ordered flag after it has been
used, as in the following patch.
Thanks,
NeilBrown
Signed-off-by: Neil Brown <[email protected]>
diff .prev/fs/jbd/commit.c ./fs/jbd/commit.c
--- .prev/fs/jbd/commit.c 2008-02-07 10:01:57.000000000 +1100
+++ ./fs/jbd/commit.c 2008-02-07 10:04:58.000000000 +1100
@@ -131,6 +131,8 @@ static int journal_write_commit_record(j
barrier_done = 1;
}
ret = sync_dirty_buffer(bh);
+ if (barrier_done)
+ clear_buffer_ordered(bh);
/* is it possible for another commit to fail at roughly
* the same time as this one? If so, we don't want to
* trust the barrier flag in the super, but instead want
@@ -148,7 +150,6 @@ static int journal_write_commit_record(j
spin_unlock(&journal->j_state_lock);
/* And try again, without the barrier */
- clear_buffer_ordered(bh);
set_buffer_uptodate(bh);
set_buffer_dirty(bh);
ret = sync_dirty_buffer(bh);
Duplicating Neil Brown's jbd patch for jbd2. I guess this can go
through the ext4 queue rather than straight into -mm.
Neil's text:
Some devices - notably dm and md - can change their behaviour in
response to BIO_RW_BARRIER requests. They might start out accepting
such requests but on reconfiguration, they find out that they cannot
any more.
ext3 (and other filesystems) deal with this by always testing if
BIO_RW_BARRIER requests fail with EOPNOTSUPP, and retrying the write
requests without the barrier (probably after waiting for any pending
writes to complete).
However there is a bug in the handling for this for ext3.
When ext3 (jbd actually) decides to submit a BIO_RW_BARRIER request,
it sets the buffer_ordered flag on the buffer head.
If the request completes successfully, the flag STAYS SET.
Other code might then write the same buffer_head after the device has
been reconfigured to not accept barriers. This write will then fail,
but the "other code" is not ready to handle EOPNOTSUPP errors and the
error will be treated as fatal.
This can be seen without having to reconfigure a device at exactly the
wrong time by putting:
if (buffer_ordered(bh))
printk("OH DEAR, and ordered buffer\n");
in the while loop in "commit phase 5" of journal_commit_transaction.
If it ever prints the "OH DEAR ..." message (as it does sometimes for
me), then that request could (in different circumstances) have failed
with EOPNOTSUPP, but that isn't tested for.
My proposed fix is to clear the buffer_ordered flag after it has been
used, as in the following patch.
Thanks,
NeilBrown
Signed-off-by: Dave Kleikamp <[email protected]>
diff -Nurp linux-2.6.24-mm1/fs/jbd2/commit.c linux/fs/jbd2/commit.c
--- linux-2.6.24-mm1/fs/jbd2/commit.c 2008-02-04 09:08:44.000000000 -0600
+++ linux/fs/jbd2/commit.c 2008-02-06 22:11:14.000000000 -0600
@@ -148,6 +148,8 @@ static int journal_submit_commit_record(
barrier_done = 1;
}
ret = submit_bh(WRITE, bh);
+ if (barrier_done)
+ clear_buffer_ordered(bh);
/* is it possible for another commit to fail at roughly
* the same time as this one? If so, we don't want to
@@ -166,7 +168,6 @@ static int journal_submit_commit_record(
spin_unlock(&journal->j_state_lock);
/* And try again, without the barrier */
- clear_buffer_ordered(bh);
set_buffer_uptodate(bh);
set_buffer_dirty(bh);
ret = submit_bh(WRITE, bh);
Hi Neil,
> Some devices - notably dm and md - can change their behaviour in
> response to BIO_RW_BARRIER requests. They might start out accepting
> such requests but on reconfiguration, they find out that they cannot
> any more.
>
> ext3 (and other filesystems) deal with this by always testing if
> BIO_RW_BARRIER requests fail with EOPNOTSUPP, and retrying the write
> requests without the barrier (probably after waiting for any pending
> writes to complete).
>
> However there is a bug in the handling for this for ext3.
>
> When ext3 (jbd actually) decides to submit a BIO_RW_BARRIER request,
> it sets the buffer_ordered flag on the buffer head.
> If the request completes successfully, the flag STAYS SET.
Yes, I've recently noted this as well :)
> Other code might then write the same buffer_head after the device has
> been reconfigured to not accept barriers. This write will then fail,
> but the "other code" is not ready to handle EOPNOTSUPP errors and the
> error will be treated as fatal.
>
> This can be seen without having to reconfigure a device at exactly the
> wrong time by putting:
>
> if (buffer_ordered(bh))
> printk("OH DEAR, and ordered buffer\n");
>
>
> in the while loop in "commit phase 5" of journal_commit_transaction.
>
> If it ever prints the "OH DEAR ..." message (as it does sometimes for
> me), then that request could (in different circumstances) have failed
> with EOPNOTSUPP, but that isn't tested for.
>
> My proposed fix is to clear the buffer_ordered flag after it has been
> used, as in the following patch.
Yes. Actually, I think there's another bug in there as well - at least
I have bugreport where we obviously miss that writing ordered buffer
failed (I see completion function of the buffer return with eopnotsupp
bit set but barriers aren't disabled). I think someone starts writing
out the buffer before we call sync_dirty_buffer(bh) but I'm not completely
sure who can do this when we just do set_buffer_dirty(bh)... Anyway
before I check that it is indeed happening what I think is happening,
your fix is fine :).
You can add: Acked-by: Jan Kara <[email protected]>
> Signed-off-by: Neil Brown <[email protected]>
>
> diff .prev/fs/jbd/commit.c ./fs/jbd/commit.c
> --- .prev/fs/jbd/commit.c 2008-02-07 10:01:57.000000000 +1100
> +++ ./fs/jbd/commit.c 2008-02-07 10:04:58.000000000 +1100
> @@ -131,6 +131,8 @@ static int journal_write_commit_record(j
> barrier_done = 1;
> }
> ret = sync_dirty_buffer(bh);
> + if (barrier_done)
> + clear_buffer_ordered(bh);
> /* is it possible for another commit to fail at roughly
> * the same time as this one? If so, we don't want to
> * trust the barrier flag in the super, but instead want
> @@ -148,7 +150,6 @@ static int journal_write_commit_record(j
> spin_unlock(&journal->j_state_lock);
>
> /* And try again, without the barrier */
> - clear_buffer_ordered(bh);
> set_buffer_uptodate(bh);
> set_buffer_dirty(bh);
> ret = sync_dirty_buffer(bh);
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
On Wed, 2008-02-06 at 22:25 -0600, Dave Kleikamp wrote:
> Duplicating Neil Brown's jbd patch for jbd2. I guess this can go
> through the ext4 queue rather than straight into -mm.
>
Checked-in.
Thanks Shaggy and Neil.
Mingming
> Neil's text:
>
> Some devices - notably dm and md - can change their behaviour in
> response to BIO_RW_BARRIER requests. They might start out accepting
> such requests but on reconfiguration, they find out that they cannot
> any more.
>
> ext3 (and other filesystems) deal with this by always testing if
> BIO_RW_BARRIER requests fail with EOPNOTSUPP, and retrying the write
> requests without the barrier (probably after waiting for any pending
> writes to complete).
>
> However there is a bug in the handling for this for ext3.
>
> When ext3 (jbd actually) decides to submit a BIO_RW_BARRIER request,
> it sets the buffer_ordered flag on the buffer head.
> If the request completes successfully, the flag STAYS SET.
>
> Other code might then write the same buffer_head after the device has
> been reconfigured to not accept barriers. This write will then fail,
> but the "other code" is not ready to handle EOPNOTSUPP errors and the
> error will be treated as fatal.
>
> This can be seen without having to reconfigure a device at exactly the
> wrong time by putting:
>
> if (buffer_ordered(bh))
> printk("OH DEAR, and ordered buffer\n");
>
>
> in the while loop in "commit phase 5" of journal_commit_transaction.
>
> If it ever prints the "OH DEAR ..." message (as it does sometimes for
> me), then that request could (in different circumstances) have failed
> with EOPNOTSUPP, but that isn't tested for.
>
> My proposed fix is to clear the buffer_ordered flag after it has been
> used, as in the following patch.
>
> Thanks,
> NeilBrown
>
> Signed-off-by: Dave Kleikamp <[email protected]>
>
> diff -Nurp linux-2.6.24-mm1/fs/jbd2/commit.c linux/fs/jbd2/commit.c
> --- linux-2.6.24-mm1/fs/jbd2/commit.c 2008-02-04 09:08:44.000000000 -0600
> +++ linux/fs/jbd2/commit.c 2008-02-06 22:11:14.000000000 -0600
> @@ -148,6 +148,8 @@ static int journal_submit_commit_record(
> barrier_done = 1;
> }
> ret = submit_bh(WRITE, bh);
> + if (barrier_done)
> + clear_buffer_ordered(bh);
>
> /* is it possible for another commit to fail at roughly
> * the same time as this one? If so, we don't want to
> @@ -166,7 +168,6 @@ static int journal_submit_commit_record(
> spin_unlock(&journal->j_state_lock);
>
> /* And try again, without the barrier */
> - clear_buffer_ordered(bh);
> set_buffer_uptodate(bh);
> set_buffer_dirty(bh);
> ret = submit_bh(WRITE, bh);
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html