LinuxLists.cc - Re: Enable asynchronous commits by default patch revoked?

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On Aug 24, 2009 09:34 -0400, Theodore Ts'o wrote:
> On Mon, Aug 24, 2009 at 10:33:10AM +0200, Christian Fischer wrote:
> > I try to figure out reasonable mount options for ext4.
> >
> > I've seen a "Enable asynchronous commits by default" patch from Sun, 21 Sep
> > 2008.
> >
> > Why is it revoked?
>
> It patch was never merged because the ayschronous commits feature
> disabled all write barriers, so under heavy workloads a power failure
> could cause data loss.
>
> No one has gotten around to looking at this closely; I think adding a
> strategically placed blkdev_issue_flush() will allow us to safely
> enable this feature, but it needs careful study.

I don't think that was the issue, but rather that we wanted to have
per-block checksums in order to handle the case were some block in
transaction A is causing a transaction checksum failure, yet transaction
B has already committed and begun checkpointing.

One option discussed was to add a lightweight 16-bit checksum (e.g. TCP
checksum) to the high bits of the t_flags of the block tag. The checksum
doesn't have to be very strong since the whole-transaction checksum will
be the primary point of validation.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-08-24 18:37:50

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

Andreas Dilger wrote:
> On Aug 24, 2009 09:34 -0400, Theodore Ts'o wrote:
>
>> On Mon, Aug 24, 2009 at 10:33:10AM +0200, Christian Fischer wrote:
>>
>>> I try to figure out reasonable mount options for ext4.
>>>
>>> I've seen a "Enable asynchronous commits by default" patch from Sun, 21 Sep
>>> 2008.
>>>
>>> Why is it revoked?
>>>
>> It patch was never merged because the ayschronous commits feature
>> disabled all write barriers, so under heavy workloads a power failure
>> could cause data loss.
>>
>> No one has gotten around to looking at this closely; I think adding a
>> strategically placed blkdev_issue_flush() will allow us to safely
>> enable this feature, but it needs careful study.
>>
>
> I don't think that was the issue, but rather that we wanted to have
> per-block checksums in order to handle the case were some block in
> transaction A is causing a transaction checksum failure, yet transaction
> B has already committed and begun checkpointing.
>
> One option discussed was to add a lightweight 16-bit checksum (e.g. TCP
> checksum) to the high bits of the t_flags of the block tag. The checksum
> doesn't have to be very strong since the whole-transaction checksum will
> be the primary point of validation.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>

I still don't trust the logic. Seems like a very complex (and really
non-intuitive - async and commit, really?) thing to support for a
marginal performance impact. Any blkdev_issue_flush() call would dwarf
the advantage of the async bit of the commit.

If we are looking to better support workloads that suffer from
journalling, I suspect that we have more natural ways to do that...

ric

2009-08-24 20:10:30

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On Mon, Aug 24, 2009 at 12:31:19PM -0600, Andreas Dilger wrote:
> > No one has gotten around to looking at this closely; I think adding a
> > strategically placed blkdev_issue_flush() will allow us to safely
> > enable this feature, but it needs careful study.
>
> I don't think that was the issue, but rather that we wanted to have
> per-block checksums in order to handle the case were some block in
> transaction A is causing a transaction checksum failure, yet transaction
> B has already committed and begun checkpointing.

There are multiple problems that are going on here.

Adding per-block checksums is useful in providing better resiliance in
the case of write errors in the journal, and providing better error
handling when we detect a checksum error in the commit block.

However, even if we add even if we add per-block checksums, there is
still the problem that there is logic in the jbd layer where we avoid
reusing certain blocks until we are sure the transaction has
committed. With asynchronous commits, there is no way of knowing when
it is safe to reuse those blocks.

To illustrate, consider a data block for an inode which has just been
deleted. We have code which prevents that block from being reused
until the transaction has committed; but when we use asynchronous
commits, we don't know when that will be. Currently the async commit
code assumes that once we send the commit block to be written, the
commit has happened; this opens up a race where between when the
commit record (and all of the associated journal blocks) is actually
written to disk, and when a data block gets reused.

Most of the time, this will cause silent corruption of a data file
that was about to be deleted right before the power failure --- but if
the block in question was part of a directory that was in the process
of being deleted, it could result in a filesystem corruption
detectable by e2fsck.

Looking at the code, the best we can do in the short-term is to write
the commit record where we do, but do so with a barrier requested, and
then wait for the commit block where we do. This will provide some
performance improvement, since we won't wait for all of the journal
blocks to be sent to disk before writing the commit record. However,
we still have to wait for the commit record (and all of the blocks
before it) to be committed to stable store before we can mark the
transaction as being truly committed. So it's not a true "async
commit", but it is a benefit of adding journal checksums.

It might be possible in the long term to batch together multiple
transaction in the "committing" state, instead of only allowing one
transaction to be in the committing state, and to prevent blocks from
being reused or allowing pinned buffer_heads from writing the contents
of the blocks to their final location on disk until we are 100% sure
the commit block and all of the associated journal blocks really have
made it to disk. However, it's probably not worth doing this, since
the only time this would make a big difference is when we have several
transactions closing within the space of a short time; and in that
case the fsync() requires a barrier operation anyway.

- Ted

2009-08-24 20:29:10

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

Theodore Tso wrote:
> On Mon, Aug 24, 2009 at 12:31:19PM -0600, Andreas Dilger wrote:
>
>>> No one has gotten around to looking at this closely; I think adding a
>>> strategically placed blkdev_issue_flush() will allow us to safely
>>> enable this feature, but it needs careful study.
>>>
>> I don't think that was the issue, but rather that we wanted to have
>> per-block checksums in order to handle the case were some block in
>> transaction A is causing a transaction checksum failure, yet transaction
>> B has already committed and begun checkpointing.
>>
>
> There are multiple problems that are going on here.
>
> Adding per-block checksums is useful in providing better resiliance in
> the case of write errors in the journal, and providing better error
> handling when we detect a checksum error in the commit block.
>

My issue with the async commit is that it is basically a detection
mechanism.

Drives will (almost always) write to platter sequential writes in order.
Async commit lets us send down things out of order which means that we
have a wider window of "bad state" for any given transaction...

ric

> However, even if we add even if we add per-block checksums, there is
> still the problem that there is logic in the jbd layer where we avoid
> reusing certain blocks until we are sure the transaction has
> committed. With asynchronous commits, there is no way of knowing when
> it is safe to reuse those blocks.
>
> To illustrate, consider a data block for an inode which has just been
> deleted. We have code which prevents that block from being reused
> until the transaction has committed; but when we use asynchronous
> commits, we don't know when that will be. Currently the async commit
> code assumes that once we send the commit block to be written, the
> commit has happened; this opens up a race where between when the
> commit record (and all of the associated journal blocks) is actually
> written to disk, and when a data block gets reused.
>
> Most of the time, this will cause silent corruption of a data file
> that was about to be deleted right before the power failure --- but if
> the block in question was part of a directory that was in the process
> of being deleted, it could result in a filesystem corruption
> detectable by e2fsck.
>
> Looking at the code, the best we can do in the short-term is to write
> the commit record where we do, but do so with a barrier requested, and
> then wait for the commit block where we do. This will provide some
> performance improvement, since we won't wait for all of the journal
> blocks to be sent to disk before writing the commit record. However,
> we still have to wait for the commit record (and all of the blocks
> before it) to be committed to stable store before we can mark the
> transaction as being truly committed. So it's not a true "async
> commit", but it is a benefit of adding journal checksums.
>
> It might be possible in the long term to batch together multiple
> transaction in the "committing" state, instead of only allowing one
> transaction to be in the committing state, and to prevent blocks from
> being reused or allowing pinned buffer_heads from writing the contents
> of the blocks to their final location on disk until we are 100% sure
> the commit block and all of the associated journal blocks really have
> made it to disk. However, it's probably not worth doing this, since
> the only time this would make a big difference is when we have several
> transactions closing within the space of a short time; and in that
> case the fsync() requires a barrier operation anyway.
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2009-08-24 21:28:35

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On Aug 24, 2009 16:10 -0400, Theodore Ts'o wrote:
> However, even if we add even if we add per-block checksums, there is
> still the problem that there is logic in the jbd layer where we avoid
> reusing certain blocks until we are sure the transaction has
> committed. With asynchronous commits, there is no way of knowing when
> it is safe to reuse those blocks.

AFAIK, "async commits" are not wholly async. There is still a wait
for the transaction commit block to hit the disk before calling the
transaction committed. The main improvement with "async commits" is
that JBD only waits ONCE for both the transaction data blocks and the
commit block (which are submitted at the same time). In particular
note that there is an unconditional wait on the commit block for
every transaction commit.

JBD currently submits the transaction data and waits once for them
to complete, and then submits the commit block and waits a second time
for it to complete, which is 2x latency for every transaction.

> Looking at the code, the best we can do in the short-term is to write
> the commit record where we do, but do so with a barrier requested, and
> then wait for the commit block where we do. This will provide some
> performance improvement, since we won't wait for all of the journal
> blocks to be sent to disk before writing the commit record.

The big problem with Linux "barriers" is that they are actually cache
flushes, so are worse than just having the original code that waits
for completion of the actual journal blocks, because the "barrier"
will force all kids of unrelated blocks to disk as well.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-08-24 22:07:58

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On Mon, Aug 24, 2009 at 04:28:16PM -0400, Ric Wheeler wrote:
>
> My issue with the async commit is that it is basically a detection
> mechanism.
>
> Drives will (almost always) write to platter sequential writes in order.
> Async commit lets us send down things out of order which means that we
> have a wider window of "bad state" for any given transaction...

Sure, agreed. But let's look a bit closer at what "async commit"
really means.

What ext3 and ext4 does by default is this:

1) Write data blocks required by data=ordered mode (if any)

2) Write the journal blocks

3) Wait for the journal blocks to be sent to disk. (We don't actually
do a barrier operation), so this just means the blocks have been sent
to the disk, not necessarily that they are forced to a platter.

4) Write the commit block, with the barrier flag set.

5) Wait for the commit block.

-----

What the current async commit code does is this:

1) Write data blocks required by data=ordered mode (if any)

2) Write the journal blocks

3) Write the commit block, without a barrier.

4) Wait for the journal blocks to be sent to disk.

5) Wait for the commit block (since a barrier is requested, this is
just when it was sent to the disk, not when it is actually committed
to stable store).

Since there are no barriers at all, the async mount option basically
works the same as barriers=0, and is subject to exactly the same
problems as barrier=0 --- problems which I've actually demonstrated
exist in practice.

----

What I think we can do safely in ext4 is this:

1) Write data blocks required by data=ordered mode (if any)

2) Write the journal blocks

3) Write the commit block, WITH a barrier requested.

4) Wait for the commit block to be completed.

5) Wait for the journal blocks to be sent to disk. #4 implies that
all of the journal block I/O will have been completed, so this is just
to collect the commit completion status; we should actually block
during step #5, assuming the block layer's barrier operation was
implemented correctly.

This should save us a little bit, since it implies the commit record
will be sent to disk in the same I/O request to the storage device as
the the other journal blocks, which is _not_ currently the case today.

Technically, what ext3 does today could result in problems, since
without the barrier between the journal blocks and the commit block,
the two could theoretically get reordered by the disk such that the
commit block is written before the journal blocks are completely
written --- and since ext3 doesn't have journal checksumming, this
would never be noticed. Fortunately in practice this generally won't
happen since the commit block is adjacent to the rest of the journal
blocks, so a sane disk drive will likely coalesce the two write
requests together.

- Ted

2009-08-24 22:12:56

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

Theodore Tso wrote:
> On Mon, Aug 24, 2009 at 04:28:16PM -0400, Ric Wheeler wrote:
>
>> My issue with the async commit is that it is basically a detection
>> mechanism.
>>
>> Drives will (almost always) write to platter sequential writes in order.
>> Async commit lets us send down things out of order which means that we
>> have a wider window of "bad state" for any given transaction...
>>
>
> Sure, agreed. But let's look a bit closer at what "async commit"
> really means.
>
> What ext3 and ext4 does by default is this:
>
> 1) Write data blocks required by data=ordered mode (if any)
>
> 2) Write the journal blocks
>
> 3) Wait for the journal blocks to be sent to disk. (We don't actually
> do a barrier operation), so this just means the blocks have been sent
> to the disk, not necessarily that they are forced to a platter.
>
> 4) Write the commit block, with the barrier flag set.
>
> 5) Wait for the commit block.
>
> -----
>
> What the current async commit code does is this:
>
> 1) Write data blocks required by data=ordered mode (if any)
>
> 2) Write the journal blocks
>
> 3) Write the commit block, without a barrier.
>
> 4) Wait for the journal blocks to be sent to disk.
>
> 5) Wait for the commit block (since a barrier is requested, this is
> just when it was sent to the disk, not when it is actually committed
> to stable store).
>
> Since there are no barriers at all, the async mount option basically
> works the same as barriers=0, and is subject to exactly the same
> problems as barrier=0 --- problems which I've actually demonstrated
> exist in practice.
>
> ----
>
> What I think we can do safely in ext4 is this:
>
> 1) Write data blocks required by data=ordered mode (if any)
>
> 2) Write the journal blocks
>
> 3) Write the commit block, WITH a barrier requested.
>
> 4) Wait for the commit block to be completed.
>
> 5) Wait for the journal blocks to be sent to disk. #4 implies that
> all of the journal block I/O will have been completed, so this is just
> to collect the commit completion status; we should actually block
> during step #5, assuming the block layer's barrier operation was
> implemented correctly.
>
>
> This should save us a little bit, since it implies the commit record
> will be sent to disk in the same I/O request to the storage device as
> the the other journal blocks, which is _not_ currently the case today.
>
>
> Technically, what ext3 does today could result in problems, since
> without the barrier between the journal blocks and the commit block,
> the two could theoretically get reordered by the disk such that the
> commit block is written before the journal blocks are completely
> written --- and since ext3 doesn't have journal checksumming, this
> would never be noticed. Fortunately in practice this generally won't
> happen since the commit block is adjacent to the rest of the journal
> blocks, so a sane disk drive will likely coalesce the two write
> requests together.
>
> - Ted
>
>
I see that this might be slightly faster, but would be very interested
in seeing that the gain is big enough to warrant the complexity :-)

ric

2009-08-24 22:46:16

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On Aug 24, 2009 18:07 -0400, Theodore Ts'o wrote:
> What ext3 and ext4 does by default is this:
>
> 1) Write data blocks required by data=ordered mode (if any)
>
> 2) Write the journal blocks
>
> 3) Wait for the journal blocks to be sent to disk. (We don't actually
> do a barrier operation), so this just means the blocks have been
> sent to the disk, not necessarily that they are forced to a platter.

Hmm, I think you are missing a step here. In both jbd and jbd2 there is
a wait for these buffers to hit the disk. In the jbd case it is at
"commit phase 2", and in jbd2 it is at "wait_for_iobuf".

> 4) Write the commit block, with the barrier flag set.
>
> 5) Wait for the commit block.
>
> -----
>
> What the current async commit code does is this:
>
> 1) Write data blocks required by data=ordered mode (if any)
>
> 2) Write the journal blocks
>
> 3) Write the commit block, without a barrier.
>
> 4) Wait for the journal blocks to be sent to disk.
>
> 5) Wait for the commit block (since a barrier is requested, this is
> just when it was sent to the disk, not when it is actually committed
> to stable store).

Similarly, in the async case, all of the data blocks and the commit
block are waited on, AFAICS. It's just that with async_commit the
commit block is submitted with the data blocks, and in case of a
crash the transaction checksum is needed to determine if the commit
block is valid or not.

> What I think we can do safely in ext4 is this:
>
> 1) Write data blocks required by data=ordered mode (if any)
>
> 2) Write the journal blocks
>
> 3) Write the commit block, WITH a barrier requested.
>
> 4) Wait for the commit block to be completed.
>
> 5) Wait for the journal blocks to be sent to disk. #4 implies that
> all of the journal block I/O will have been completed, so this is just
> to collect the commit completion status; we should actually block
> during step #5, assuming the block layer's barrier operation was
> implemented correctly.

Since a barrier is a painful operation, it is better to just wait
explicitly on the completion of the various blocks as needed (i.e.
journal data + commit block). That avoids the huge wait on many
other blocks that may have been sent to disk unrelated to the journal
itself, if the journal is on the same device as the filesystem.

> This should save us a little bit, since it implies the commit record
> will be sent to disk in the same I/O request to the storage device as
> the the other journal blocks, which is _not_ currently the case today.

Are you _really_ sure that isn't what is done today? My reading of the
code is different, but it's of course possible that I'm seeing what I
want to see (which is how it was originally designed) and not what is
really there.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-08-24 23:28:24

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On Mon, Aug 24, 2009 at 06:12:11PM -0400, Ric Wheeler wrote:
> I see that this might be slightly faster, but would be very interested
> in seeing that the gain is big enough to warrant the complexity :-)

This simple enough? :-)

commit 5127a5da28fc12a219474c96e59fd178629436fe
Author: Theodore Ts'o <[email protected]>
Date: Mon Aug 24 19:18:31 2009 -0400

ext4: Fix async commit mode by writing the commit record using a barrier

Signed-off-by: "Theodore Ts'o" <[email protected]>

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 7b4088b..a7fe81d 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -132,9 +132,7 @@ static int journal_submit_commit_record(journal_t *journal,
set_buffer_uptodate(bh);
bh->b_end_io = journal_end_buffer_io_sync;

- if (journal->j_flags & JBD2_BARRIER &&
- !JBD2_HAS_INCOMPAT_FEATURE(journal,
- JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
+ if (journal->j_flags & JBD2_BARRIER) {
set_buffer_ordered(bh);
barrier_done = 1;
}

Ok, to be fair, most of the complexity was already in the code
already; but it the main complexity was simply separating
journal_write_commit_record() into journal_submit_commit_record() and
journal_wait_on_commit_record().

We can clean up the patch by recombining these two functions, since
there was never any point in separate submitting the commit record
from where we waited for it. I think who ever implemented thought we
could add a bit more paralisms, but in reality all of the code between
line 709 of commit.c and 834 of commit.c (i.e., commit phases 3-5) is
waiting for the various journal data blocks to be written. So we
might as well wait for the commit block, which will save a bit of
scheduling overhead, using the same rationale listed in the commit
found in line 740 of commit.c:

/*
Wait for the buffers in reverse order. That way we are
less likely to be woken up until all IOs have completed, and
so we incur less scheduling load.
*/

But in terms of a simple patch to test things, the above patch is all
we need. At this point, we can try benchmarking with and without
async commit, and see if it makes a significant difference or not.

- Ted

2009-08-24 23:43:36

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On Aug 24, 2009 19:28 -0400, Theodore Ts'o wrote:
> @@ -132,9 +132,7 @@ static int journal_submit_commit_record(journal_t *journal,
> set_buffer_uptodate(bh);
> bh->b_end_io = journal_end_buffer_io_sync;
>
> - if (journal->j_flags & JBD2_BARRIER &&
> - !JBD2_HAS_INCOMPAT_FEATURE(journal,
> - JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
> + if (journal->j_flags & JBD2_BARRIER) {
> set_buffer_ordered(bh);
> barrier_done = 1;
> }
>
>
> Ok, to be fair, most of the complexity was already in the code
> already; but it the main complexity was simply separating
> journal_write_commit_record() into journal_submit_commit_record() and
> journal_wait_on_commit_record().
>
> We can clean up the patch by recombining these two functions, since
> there was never any point in separate submitting the commit record
> from where we waited for it. I think who ever implemented thought we
> could add a bit more paralisms, but in reality all of the code between
> line 709 of commit.c and 834 of commit.c (i.e., commit phases 3-5) is
> waiting for the various journal data blocks to be written. So we
> might as well wait for the commit block, which will save a bit of
> scheduling overhead, using the same rationale listed in the commit
> found in line 740 of commit.c:
>
> /*
> Wait for the buffers in reverse order. That way we are
> less likely to be woken up until all IOs have completed, and
> so we incur less scheduling load.
> */

Without transaction checksums waiting on all of the blocks together
is NOT safe. If the commit record is on disk, but the rest of the
transaction's blocks are not then during replay it may cause garbage
to be written from the journal into the filesystem metadata.

Have you seen any of my other emails on this topic? It would seem not...

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-08-24 23:52:59

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On Mon, Aug 24, 2009 at 04:46:16PM -0600, Andreas Dilger wrote:
> On Aug 24, 2009 18:07 -0400, Theodore Ts'o wrote:
> > What ext3 and ext4 does by default is this:
> >
> > 1) Write data blocks required by data=ordered mode (if any)
> >
> > 2) Write the journal blocks
> >
> > 3) Wait for the journal blocks to be sent to disk. (We don't actually
> > do a barrier operation), so this just means the blocks have been
> > sent to the disk, not necessarily that they are forced to a platter.
>
> Hmm, I think you are missing a step here. In both jbd and jbd2 there is
> a wait for these buffers to hit the disk. In the jbd case it is at
> "commit phase 2", and in jbd2 it is at "wait_for_iobuf".

That's what I meant by step 3. We wait for the blocks to be *sent* to
disk, but since there is no barrier operation, the disks have not
necessarily been committed to iron oxide (or whatever alloy is used on
HDD platters these days :-).

Without a barrier, Chris Mason has demonstrated that with a very heavy
workload, while the system is under memory pressure, and with lots of
fsync()'s thrown in for good measure, simply waiting for the block
device to signal completion is **not** enough. He has demonstrated
filesystem corruption bad enough that fsck -p was not able to recover
the filesystem; it required manual intervention to clear the
filesystem corruption. The bottom line is that modern disks *do* do
significant reordering in their 8-32MB internal buffer, and they
*don't* have sufficient power storage to guarantee that everything
accepted and stored in the cache will actually be written out in the
event of a power failure.

So waiting for the block device layer to say, "OK the write is done",
is not sufficient.

> > 5) Wait for the commit block (since a barrier is requested, this is
> > just when it was sent to the disk, not when it is actually committed
> > to stable store).
>
> Similarly, in the async case, all of the data blocks and the commit
> block are waited on, AFAICS. It's just that with async_commit the
> commit block is submitted with the data blocks, and in case of a
> crash the transaction checksum is needed to determine if the commit
> block is valid or not.

The key here is what is meant by "waited on". We don't have a way for
the HDD to tell us, "this block has hit stable store"; all we know
that the DMA operation has completed, and the data has been posted to
the device.

The real problem is that the cache flush operation is the only thing
which modern disks give us to guarantee that blocks sent to the disk
are on stable storage. Some SCSI disks have FUA, but its semantics
are incredibly sucky (force just this specific sector to disk,
ignoring all hard drive optimizations or elevator optimizations), and
very few hard drives have FUA in any case.

What we *really* want is something where we can say, "please write
these disk blocks tagged with tag <Foo>, in whatever order you like
that is most optimal, and let the OS know when all blocks tagged with
<Foo> are safely written to stable store". Unfortunately, that's not
a facility that HDD manufacturers are willing to give us....

- Ted

2009-08-25 00:15:59

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On Mon, Aug 24, 2009 at 05:43:36PM -0600, Andreas Dilger wrote:
> Without transaction checksums waiting on all of the blocks together
> is NOT safe. If the commit record is on disk, but the rest of the
> transaction's blocks are not then during replay it may cause garbage
> to be written from the journal into the filesystem metadata.

Yes, I *said* that we can only wait on all of the blocks together with
the commit record when doing journal checksums. Sorry if I didn't
make that clear enough.

That's the one optimization we using journal checksums buys us.
Unfortunately it does not allow us to omit the barrier
operation.... and have real-world testing experience that without the
barrier, a power drop can cause significant filesystem corruption and
potential data loss.

Try using Chris Mason's torture-test workload with async-checksums
without this patch; you will get data corruption if you try dropping
power while his torture-test is running. I know you really don't like
the barrier, but I'm afraid it's not safe to run without it, even with
journal checksums.

- Ted

2009-08-25 06:16:52

by Christian Fischer

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On Monday 24 August 2009, Theodore Tso wrote:
> On Mon, Aug 24, 2009 at 10:33:10AM +0200, Christian Fischer wrote:
> > I try to figure out reasonable mount options for ext4.
> >
> > I've seen a "Enable asynchronous commits by default" patch from Sun, 21
> > Sep 2008.
> >
> > Why is it revoked?
>
> It patch was never merged because the ayschronous commits feature
> disabled all write barriers, so under heavy workloads a power failure
> could cause data loss.
>
> No one has gotten around to looking at this closely; I think adding a
> strategically placed blkdev_issue_flush() will allow us to safely
> enable this feature, but it needs careful study.
>
> Thanks for reminding us about this bit of unfinished business. I'll
> put it back on our todo list, where it had fallen off. For now, it's
> not safe to use, which is why it's not enabled by default.
>
> - Ted

Hi Ted,

Thanks, good to know that this better isn't to use.

Christian
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-08-25 17:52:41

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On Aug 24, 2009 20:15 -0400, Theodore Ts'o wrote:
> On Mon, Aug 24, 2009 at 05:43:36PM -0600, Andreas Dilger wrote:
> > Without transaction checksums waiting on all of the blocks together
> > is NOT safe. If the commit record is on disk, but the rest of the
> > transaction's blocks are not then during replay it may cause garbage
> > to be written from the journal into the filesystem metadata.
>
> That's the one optimization we using journal checksums buys us.
> Unfortunately it does not allow us to omit the barrier
> operation.... and have real-world testing experience that without the
> barrier, a power drop can cause significant filesystem corruption and
> potential data loss.
>
> Try using Chris Mason's torture-test workload with async-checksums
> without this patch; you will get data corruption if you try dropping
> power while his torture-test is running. I know you really don't like
> the barrier, but I'm afraid it's not safe to run without it, even with
> journal checksums.

In our performance testing of barriers (not with Chris' program), it
was FAR better to disable the disk cache and wait for IO completion
(i.e. barriers disabled) on just the journal blocks than to enable the
cache and cause a cache flush for each "barrier". The problem is that at
high IO rates there is much more data in the cache vs. the actual journal
blocks, and forcing the whole cache to be flushed each transaction commit
hurt our performance noticably.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-08-25 18:08:27

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On 08/25/2009 01:52 PM, Andreas Dilger wrote:
> On Aug 24, 2009 20:15 -0400, Theodore Ts'o wrote:
>> On Mon, Aug 24, 2009 at 05:43:36PM -0600, Andreas Dilger wrote:
>>> Without transaction checksums waiting on all of the blocks together
>>> is NOT safe. If the commit record is on disk, but the rest of the
>>> transaction's blocks are not then during replay it may cause garbage
>>> to be written from the journal into the filesystem metadata.
>>
>> That's the one optimization we using journal checksums buys us.
>> Unfortunately it does not allow us to omit the barrier
>> operation.... and have real-world testing experience that without the
>> barrier, a power drop can cause significant filesystem corruption and
>> potential data loss.
>>
>> Try using Chris Mason's torture-test workload with async-checksums
>> without this patch; you will get data corruption if you try dropping
>> power while his torture-test is running. I know you really don't like
>> the barrier, but I'm afraid it's not safe to run without it, even with
>> journal checksums.
>
> In our performance testing of barriers (not with Chris' program), it
> was FAR better to disable the disk cache and wait for IO completion
> (i.e. barriers disabled) on just the journal blocks than to enable the
> cache and cause a cache flush for each "barrier". The problem is that at
> high IO rates there is much more data in the cache vs. the actual journal
> blocks, and forcing the whole cache to be flushed each transaction commit
> hurt our performance noticably.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>

This entirely depends on the nature of the storage device. If you are running on
any normal S-ATA drive, I have seen running with write cache disabled can cut
your large file write speed by half over running with barriers enabled.
Certainly less true when running with SAS or fibre channel devices.

Checking on ext4 on a newish Seagate 1TB disk, I see roughly parity (F12
rawhide, RC6 kernel):

[root@ricdesktop ~]# hdparm -W0 /dev/sdb

/dev/sdb:
setting drive write-caching to 0 (off)
write-caching = 0 (off)
[root@ricdesktop ~]# mkfs.ext4 /dev/sdb
mke2fs 1.41.8 (11-July-2009)
<snip>
[root@ricdesktop ~]# mount -o barrier=0 /dev/sdb /mnt/
[root@ricdesktop ~]# dd if=/dev/zero of=/mnt/bigfile bs=10M count=100
100+0 records in
100+0 records out
1048576000 bytes (1.0 GB) copied, 6.75127 s, 155 MB/s
[root@ricdesktop ~]# umount /mnt
[root@ricdesktop ~]# hdparm -W1 /dev/sdb

/dev/sdb:
setting drive write-caching to 1 (on)
write-caching = 1 (on)
[root@ricdesktop ~]# mount -o barrier=1 /dev/sdb /mnt/
[root@ricdesktop ~]# dd if=/dev/zero of=/mnt/bigfile bs=10M count=100
100+0 records in
100+0 records out
1048576000 bytes (1.0 GB) copied, 6.74861 s, 155 MB/s

2009-08-25 18:22:03

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On 08/25/2009 01:52 PM, Andreas Dilger wrote:
> On Aug 24, 2009 20:15 -0400, Theodore Ts'o wrote:
>> On Mon, Aug 24, 2009 at 05:43:36PM -0600, Andreas Dilger wrote:
>>> Without transaction checksums waiting on all of the blocks together
>>> is NOT safe. If the commit record is on disk, but the rest of the
>>> transaction's blocks are not then during replay it may cause garbage
>>> to be written from the journal into the filesystem metadata.
>>
>> That's the one optimization we using journal checksums buys us.
>> Unfortunately it does not allow us to omit the barrier
>> operation.... and have real-world testing experience that without the
>> barrier, a power drop can cause significant filesystem corruption and
>> potential data loss.
>>
>> Try using Chris Mason's torture-test workload with async-checksums
>> without this patch; you will get data corruption if you try dropping
>> power while his torture-test is running. I know you really don't like
>> the barrier, but I'm afraid it's not safe to run without it, even with
>> journal checksums.
>
> In our performance testing of barriers (not with Chris' program), it
> was FAR better to disable the disk cache and wait for IO completion
> (i.e. barriers disabled) on just the journal blocks than to enable the
> cache and cause a cache flush for each "barrier". The problem is that at
> high IO rates there is much more data in the cache vs. the actual journal
> blocks, and forcing the whole cache to be flushed each transaction commit
> hurt our performance noticably.
>
> Cheers, Andreas

Just for completeness, I ran a quick test on ext3 which was marginally better
with barriers and xfs which was much better with barriers...

EXT3:

[root@ricdesktop ~]# mkfs.ext3 /dev/sdb
<snip>
[root@ricdesktop ~]# hdparm -W0 /dev/sdb

/dev/sdb:
setting drive write-caching to 0 (off)
write-caching = 0 (off)
[root@ricdesktop ~]# mount -o barrier=0 /dev/sdb /mnt/
[root@ricdesktop ~]# rm -f /mnt/bigfile
[root@ricdesktop ~]# dd if=/dev/zero of=/mnt/bigfile bs=10M count=100
100+0 records in
100+0 records out
1048576000 bytes (1.0 GB) copied, 9.26707 s, 113 MB/s
[root@ricdesktop ~]# umount /mnt
[root@ricdesktop ~]# hdparm -W1 /dev/sdb

/dev/sdb:
setting drive write-caching to 1 (on)
write-caching = 1 (on)
[root@ricdesktop ~]# mount -o barrier=1 /dev/sdb /mnt/
[root@ricdesktop ~]# rm -f /mnt/bigfile
[root@ricdesktop ~]# dd if=/dev/zero of=/mnt/bigfile bs=10M count=100
100+0 records in
100+0 records out
1048576000 bytes (1.0 GB) copied, 8.90897 s, 118 MB/s
[root@ricdesktop ~]# umount /mnt

XFS:

[root@ricdesktop ~]# umount /mnt
[root@ricdesktop ~]# mkfs.xfs -f /dev/sdb
<snip>
[root@ricdesktop ~]# hdparm -W0 /dev/sdb

/dev/sdb:
setting drive write-caching to 0 (off)
write-caching = 0 (off)
[root@ricdesktop ~]# mount -o nobarrier /dev/sdb /mnt
[root@ricdesktop ~]# rm -f /mnt/bigfile
[root@ricdesktop ~]# dd if=/dev/zero of=/mnt/bigfile bs=10M count=100
100+0 records in
100+0 records out
1048576000 bytes (1.0 GB) copied, 4.04406 s, 259 MB/s
[root@ricdesktop ~]# umount /mnt
[root@ricdesktop ~]# hdparm -W1 /dev/sdb

/dev/sdb:
setting drive write-caching to 1 (on)
write-caching = 1 (on)
[root@ricdesktop ~]# mount -o barrier /dev/sdb /mnt
[root@ricdesktop ~]# rm -f /mnt/bigfile
[root@ricdesktop ~]# dd if=/dev/zero of=/mnt/bigfile bs=10M count=100
100+0 records in
100+0 records out
1048576000 bytes (1.0 GB) copied, 3.03633 s, 345 MB/s

2009-08-25 21:11:43

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On Tue, Aug 25, 2009 at 02:07:56PM -0400, Ric Wheeler wrote:
>> In our performance testing of barriers (not with Chris' program), it
>> was FAR better to disable the disk cache and wait for IO completion
>> (i.e. barriers disabled) on just the journal blocks than to enable the
>> cache and cause a cache flush for each "barrier". The problem is that at
>> high IO rates there is much more data in the cache vs. the actual journal
>> blocks, and forcing the whole cache to be flushed each transaction commit
>> hurt our performance noticably.
>>
> This entirely depends on the nature of the storage device. If you are
> running on any normal S-ATA drive, I have seen running with write cache
> disabled can cut your large file write speed by half over running with
> barriers enabled. Certainly less true when running with SAS or fibre
> channel devices.

It also depends on workload, of course.

With the patch which I've posted, we can leave it up to the system
administrator. If they want to disable the write cache, and then use
the barriers=0 mount option, that can be their choice. However, given
that most hard drives ship with write cache enabled, so they can have
competitive Winbench scores, I think we need to have async-commit
respect the barrier option, and as with the current !async-commit
case, we should default barriers to be enabled, given that there are
known cases (such as Chris Mason's torture test) which would lead to
filesystem corruption otherwise.

Ric's test which shows that some cases barrier=1 is faster than
barrier=0 with write caching disabled is I think another argument for
keeping the defaults in the current configuration. If people want to
experiment with disabling write cache and using barrier=0, they can
choose to do that, with or without async_commit.

The problem is without my patch, the barrier=1 mount option is
completely ignored, and there's no way to enable barriers with
async_commit --- which is clearly wrong. So with my patch, it now
becomes safe for people to experiment with async_commit --- and if
they also to experiment with the barrier=0 w/ "hdparm -W 0" they can
do so. For any given workload and hardware combination, there is
therefore three safe configuarions that people can try benchmarking:

!async_commit,barrier=1,"hdparm -W 1" (currently the default)
async_commit,barrier=1,"hdparm -W 1"
async_commit,barrier=0,"hdparm -W 0"

(n.b., !async_commit,barrier=0,"hdparm -W 0" is not completely safe,
since without the barrier, it's possible, although granted not very
likely, for the block layer elevator algorithm to reorder blocks in
block device queue.)

If someone would like to try appling my patch and then trying to
benchmark their favorite benchmark on the above three combinations,
I'd love to see the results.

- Ted

2009-08-26 09:50:37

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On Aug 25, 2009 17:11 -0400, Theodore Ts'o wrote:
> The problem is without my patch, the barrier=1 mount option is
> completely ignored, and there's no way to enable barriers with
> async_commit --- which is clearly wrong. So with my patch, it now
> becomes safe for people to experiment with async_commit --- and if
> they also to experiment with the barrier=0 w/ "hdparm -W 0" they can
> do so. For any given workload and hardware combination, there is
> therefore three safe configuarions that people can try benchmarking:
>
> !async_commit,barrier=1,"hdparm -W 1" (currently the default)
> async_commit,barrier=1,"hdparm -W 1"
> async_commit,barrier=0,"hdparm -W 0"
>
> (n.b., !async_commit,barrier=0,"hdparm -W 0" is not completely safe,
> since without the barrier, it's possible, although granted not very
> likely, for the block layer elevator algorithm to reorder blocks in
> block device queue.)

I'm not sure I understand about the "n.b." case. If the filesystem
is running with !async_commit,barrier=0,wcache=0 (which is basically
ext3 with write cache off), it should still have the jbd code doing
an explicit wait for the data blocks (which should be guaranteed to
make it to disk, wcache being off) before even submitting the commit
block to the elevator? It doesn't matter what order the transaction
blocks are written to disk, so long as the commit block is last.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-08-26 13:14:20

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On Wed, Aug 26, 2009 at 03:50:35AM -0600, Andreas Dilger wrote:
> On Aug 25, 2009 17:11 -0400, Theodore Ts'o wrote:
> > The problem is without my patch, the barrier=1 mount option is
> > completely ignored, and there's no way to enable barriers with
> > async_commit --- which is clearly wrong. So with my patch, it now
> > becomes safe for people to experiment with async_commit --- and if
> > they also to experiment with the barrier=0 w/ "hdparm -W 0" they can
> > do so. For any given workload and hardware combination, there is
> > therefore three safe configuarions that people can try benchmarking:
> >
> > !async_commit,barrier=1,"hdparm -W 1" (currently the default)
> > async_commit,barrier=1,"hdparm -W 1"
> > async_commit,barrier=0,"hdparm -W 0"
> >
> > (n.b., !async_commit,barrier=0,"hdparm -W 0" is not completely safe,
> > since without the barrier, it's possible, although granted not very
> > likely, for the block layer elevator algorithm to reorder blocks in
> > block device queue.)
>
> I'm not sure I understand about the "n.b." case. If the filesystem
> is running with !async_commit,barrier=0,wcache=0 (which is basically
> ext3 with write cache off), it should still have the jbd code doing
> an explicit wait for the data blocks (which should be guaranteed to
> make it to disk, wcache being off) before even submitting the commit
> block to the elevator? It doesn't matter what order the transaction
> blocks are written to disk, so long as the commit block is last.

Gack, sorry, I screwed that up. What I should have written is this:

The safe configurations people could try benchmarking:

!async_commit,barrier=1,"hdparm -W 1" (currently the default)
!async_commit,barrier=0,"hdparm -W 0"
async_commit,barrier=1,"hdparm -W 1"

and the unsafe case in the nb should have been <async_commit,
barrier=0, "hdparm -W 0">, since without the barrier, async_commit
writes the commit block at the same time as the rest of the journal
(data, metadata, and revoke) blocks, and so there is the chance the
commit block could get reordered in front of the other journal blocks.

- Ted

2009-08-26 16:02:30

by Jan Kara

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

> On Mon, Aug 24, 2009 at 05:43:36PM -0600, Andreas Dilger wrote:
> > Without transaction checksums waiting on all of the blocks together
> > is NOT safe. If the commit record is on disk, but the rest of the
> > transaction's blocks are not then during replay it may cause garbage
> > to be written from the journal into the filesystem metadata.
>
> Yes, I *said* that we can only wait on all of the blocks together with
> the commit record when doing journal checksums. Sorry if I didn't
> make that clear enough.
I suppose we talk about the case when write caches are turned off and
we use barrier=0 (because case barrier=1 does not really care about when
we wait for the blocks from the correctness POV - at least according to
Documentation/block/barrier.txt). In that case, you have to wait for
*data* blocks before writing the commit block because you have no
checksumming of those...

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2009-08-26 22:00:56

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On Aug 26, 2009 09:14 -0400, Theodore Ts'o wrote:
> On Wed, Aug 26, 2009 at 03:50:35AM -0600, Andreas Dilger wrote:
> > I'm not sure I understand about the "n.b." case. If the filesystem
> > is running with !async_commit,barrier=0,"hdparm -W 0" (which is basically
> > ext3 with write cache off), it should still have the jbd code doing
> > an explicit wait for the data blocks (which should be guaranteed to
> > make it to disk, wcache being off) before even submitting the commit
> > block to the elevator? It doesn't matter what order the transaction
> > blocks are written to disk, so long as the commit block is last.
>
> Gack, sorry, I screwed that up. What I should have written is this:
>
> The safe configurations people could try benchmarking:
>
> !async_commit,barrier=1,"hdparm -W 1" (currently the default)
> !async_commit,barrier=0,"hdparm -W 0"
> async_commit,barrier=1,"hdparm -W 1"
>
> and the unsafe case in the nb should have been <async_commit,
> barrier=0, "hdparm -W 0">, since without the barrier, async_commit
> writes the commit block at the same time as the rest of the journal
> (data, metadata, and revoke) blocks, and so there is the chance the
> commit block could get reordered in front of the other journal blocks.

I'm still missing something. With async_commit enabled, it doesn't
matter if the commit block is reordered, since the transaction checksum
will verify if all of the data + commit block are written for that
transaction, in case of a crash. That is the whole point of async_commit.
If the commit block is on disk, but there are some transaction blocks
missing the checksum will (except in very rare coincidences) fail and the
transaction is skipped. With "hdparm -W 0" we are guaranteed to only
have a single uncommitted transaction, except in the case of journal
corruption (i.e. disk error or software bug).

I can imagine with "async_commit,barrier=0,"hdparm -W 1" that having
multiple transactions begin checkpointing before they are fully
committed, which means the filesystem is modified in a non-recoverable
way.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-08-26 22:55:19

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On Wed, Aug 26, 2009 at 04:00:45PM -0600, Andreas Dilger wrote:
> I'm still missing something. With async_commit enabled, it doesn't
> matter if the commit block is reordered, since the transaction checksum
> will verify if all of the data + commit block are written for that
> transaction, in case of a crash. That is the whole point of async_commit.

The problem isn't reordering with respect to the journal blocks alone;
the problem is reordering with respect to the journal blocks *plus*
normal filesystem metadata.

The key point here is that jbd pins filesystem metadata blocks and
prevents them from being pushed out to disk until the transaction has
committed. Once the transaction has been commited, they are free to
be written to disk, and {directory,indirect,extent} blocks which have
been released during the last transactoin are now freed to be reused
by the block allocator.

If the system is under memory pressure and is gettings lots of
fsync(), there are a large number of transaction boundaries. So it's
possible for I/O stream of the form:

...
commit seq #17
journal of block #12
journal of block #52
journal of block #36
journal of block allocation bitmap releasing block #23
commit seq #18
update of block #12
write of reallocated block #23
..,

Could get reorderd as follows:

...
commit seq #17
journal of block #12
journal of block #52
update of block #12
write of reallocated block #23
journal of block #36
<crash>
(journal of block allocation bitmap releasing block #23)
(commit seq #18)

OK, so what's happened? Since there was no barrier when we write the
commit block for transaction #18, some of the (non-journal) I/O that
was only supposed to have happened *after* the commit has completed,
has happened too early, and then the system crashed before all of the
journal blocks associated with commit #18 could be written out.

So from the perspective of the journal replay commit #18 never
happened. So among other things the act of releasing block #23 never
happened --- but block #23 has gotten reused already, since a write
that took place *after* commit #18 has taken place, due to reordering
that took place on the disk drive.

This is what Chris Mason has demonstrated with his barrier=0 file
system corruption workload. And this is something which journal
checksums don't help, because it's not about the commit block getting
written out before the rest of the journal blocks. *That* case will
be detected by an incorrect journal checksum. The problem is other
I/O taking place to other parts of the filesystem.

I've actually used bad numbers here, since the journal is typically at
the very front of the disk (for ext3) or in the middle of the disk
(for ext4). If the I/O for the rest of the filesystem is at the very
end of the disk, it's in fact very believable that drive might defer
the journal update (at the beginning of the disk) and try to do lots
of filesystem metadata updates (at the end of the disk) to avoid
seeking back and forth, without realizing that this violates the
ordering constraints that the jbd layer needs for correctness.

Unfortunately, the only way we can communicate these constraints to the
disk drive is via barriers.

- Ted

2009-09-02 14:48:47

by Tom Vier

[permalink] [raw]

Subject: Re: Enable asynchronous commits by default patch revoked?

On Mon, Aug 24, 2009 at 06:07:39PM -0400, Theodore Tso wrote:
> Sure, agreed. But let's look a bit closer at what "async commit"
> really means.
>
> What ext3 and ext4 does by default is this:
>
> 1) Write data blocks required by data=ordered mode (if any)

Shouldn't there be a write barrier after data blocks, so that the journal
blocks aren't written first? ie, mark the first journal block write with
barrier flag on.

> 2) Write the journal blocks
>
> 3) Wait for the journal blocks to be sent to disk. (We don't actually
> do a barrier operation), so this just means the blocks have been sent
> to the disk, not necessarily that they are forced to a platter.
>
> 4) Write the commit block, with the barrier flag set.
>
> 5) Wait for the commit block.

--
Tom Vier <[email protected]>
DSA Key ID 0x15741ECE

2009-09-02 15:03:06