2009-12-11 08:06:11

by Oleg Drokin

[permalink] [raw]
Subject: Potential data consistency issue with ASYNC_COMMIT feature

Hello!

I think ext4 ASYNC_COMMIT feature is potentially pretty unsafe
when write-back cache is enabled on the device.
Since no barriers are ever done with this feature even if
the barriers are enabled, we might end up in the situation
where we write the journal blocks, then commit block, they
hit the device write-back cache, after that actual metadata
blocks would be allowed to go to disk and eventually they will.

In the end the device might decide to reorder some of the
actual metadata updates in front of journal updates and
if metadata updates will hit the disk and a power or other
failure occurs after that, we have inconsistent filesystem
as a result.

I do not see an easy way to remedy the problem in this case
other than to insert empty barrier after the commit block
and wait for it completion, but I think that would negate
the entire gain from this feature. I wish we actually had
real ordered writes implemented, not just barrier/FUA
sent to the device before every ordered buffer.

Am I missing something?

Thanks.

Bye,
Oleg


2009-12-11 07:14:00

by Oleg Drokin

[permalink] [raw]
Subject: Re: Potential data consistency issue with ASYNC_COMMIT feature

Whoops, nevermind, it seems blkdev_issue_flush after commit does the barrier, I see it now.
It's just rhel5 kernel that is affected.

On Dec 11, 2009, at 1:45 AM, Oleg Drokin wrote:

> Hello!
>
> I think ext4 ASYNC_COMMIT feature is potentially pretty unsafe
> when write-back cache is enabled on the device.
> Since no barriers are ever done with this feature even if
> the barriers are enabled, we might end up in the situation
> where we write the journal blocks, then commit block, they
> hit the device write-back cache, after that actual metadata
> blocks would be allowed to go to disk and eventually they will.
>
> In the end the device might decide to reorder some of the
> actual metadata updates in front of journal updates and
> if metadata updates will hit the disk and a power or other
> failure occurs after that, we have inconsistent filesystem
> as a result.
>
> I do not see an easy way to remedy the problem in this case
> other than to insert empty barrier after the commit block
> and wait for it completion, but I think that would negate
> the entire gain from this feature. I wish we actually had
> real ordered writes implemented, not just barrier/FUA
> sent to the device before every ordered buffer.
>
> Am I missing something?
>
> Thanks.
>
> Bye,
> Oleg


2009-12-11 16:01:14

by Eric Sandeen

[permalink] [raw]
Subject: Re: Potential data consistency issue with ASYNC_COMMIT feature

Oleg Drokin wrote:
> Whoops, nevermind, it seems blkdev_issue_flush after commit does the barrier, I see it now.
> It's just rhel5 kernel that is affected.

rhel5 necessarily lags upstream, but updates will come.

We still need to do a lot of actual real-world testing of
lost caches in -all- ext4 journaling modes, I think.

-Eric

> On Dec 11, 2009, at 1:45 AM, Oleg Drokin wrote:
>
>> Hello!
>>
>> I think ext4 ASYNC_COMMIT feature is potentially pretty unsafe
>> when write-back cache is enabled on the device.
>> Since no barriers are ever done with this feature even if
>> the barriers are enabled, we might end up in the situation
>> where we write the journal blocks, then commit block, they
>> hit the device write-back cache, after that actual metadata
>> blocks would be allowed to go to disk and eventually they will.
>>
>> In the end the device might decide to reorder some of the
>> actual metadata updates in front of journal updates and
>> if metadata updates will hit the disk and a power or other
>> failure occurs after that, we have inconsistent filesystem
>> as a result.
>>
>> I do not see an easy way to remedy the problem in this case
>> other than to insert empty barrier after the commit block
>> and wait for it completion, but I think that would negate
>> the entire gain from this feature. I wish we actually had
>> real ordered writes implemented, not just barrier/FUA
>> sent to the device before every ordered buffer.
>>
>> Am I missing something?
>>
>> Thanks.
>>
>> Bye,
>> Oleg
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2009-12-11 20:52:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Potential data consistency issue with ASYNC_COMMIT feature

On Fri, Dec 11, 2009 at 02:14:01AM -0500, Oleg Drokin wrote:
> Whoops, nevermind, it seems blkdev_issue_flush after commit does the
> barrier, I see it now. It's just rhel5 kernel that is affected.

Yeah, the original ASYNC_COMMIT was totally unsafe, for the reason you
suggested; I was able to trivially induce fs corruption after a crash.
However, with the fixed async_commit code, in combination with journal
checksums, we can reduce the number of barrier ops per commit from two
to one, which increases the fs_mark by 50% (i.e., from 30 ops/sec to
45 ops/sec on a laptop hard drive).

However, journal checksums failed horribly when we tried to enable
them by default during the last merge window, because of bugs in ext4
where we were modifying certain metadata blocks (in particular
superblock and xattr's) without journalling them. (Note to self; we
need to back port those fixes to ext3; the lack of journalling in
xattr in particular could mean that in some cases we could lose some
updates that could affect SELINUX after a crash.)

I think we fixed them all for 2.6.33, but we haven't had time to do
the necessary testing before we enable journal checksums by default,
and after additional testing, I'd like to enable async commit by
default as well, since it means we'll beat the pants off of all of the
other journalling file systems (XFS and JFS are doing two barrier ops
per commit, if I recall correctly; not sure about btrfs) at least on
that particular benchmark. Unfortunately, we probably won't be able
to do that for 2.6.33; hopefully 2.6.34.

- Ted

2009-12-16 20:33:07

by Jan Kara

[permalink] [raw]
Subject: Re: Potential data consistency issue with ASYNC_COMMIT feature

> However, journal checksums failed horribly when we tried to enable
> them by default during the last merge window, because of bugs in ext4
> where we were modifying certain metadata blocks (in particular
> superblock and xattr's) without journalling them. (Note to self; we
> need to back port those fixes to ext3; the lack of journalling in
> xattr in particular could mean that in some cases we could lose some
> updates that could affect SELINUX after a crash.)
Already done by Eric and merged ;): d965736b8cb42ae51ba9c3f13488035a98d025c6,
b918397542388de75bd86c32fbfa820e5d629fa9.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs