LinuxLists.cc - BUG: Failure to send REQ_FLUSH on unmount on ext3, ext4, and FS in general

2011-05-22 19:11:18

Subject: BUG: Failure to send REQ_FLUSH on unmount on ext3, ext4, and FS in general

I have been doing some testing to see what file systems successfully send
REQ_FLUSH after all writes to the file system in the case of an unmount.

Results so far:
1. ext2, ext3 (with default options), never send REQ_FLUSH
2. ext3 (with barrier=1) and ext4 do send REQ_FLUSH but then
send further writes afterwards.
3. btrfs and xfs do things right (i.e. either end with a REQ_FLUSH in
xfs's case, or a REQ_FLUSH and a REQ_FUA in btrfs's case)

So the first bug is that ext3 and ext4 appear to send writes (without a
subsequent flush/fia) before an unmount, and thus will never fully
flush a write-behind cache. They look like this:

But quite aside from the question of whether the FS supports barriers,
should the kernel itself (rather than the FS) not be sending REQ_FLUSH on
an unmount as the last thing that happens? IE shouldn't we see a flush
even on (say) ext2 which is never going to support barriers. If the kernel
itself generated a REQ_FLUSH for the block device, this would keep
filesystems that don't support barriers safe provided the unmount
completed successfully and would have no impact on ones that had already
flushed the write-behind cache.

I have been using an instrumented version of nbd to test this (see
git.alex.org.uk). nbd in this instance is patched to support REQ_FLUSH
and REQ_FUA.

Trace from ext3 below (ext4 is similar)

--
Alex Bligh

> H=10ee1e1b0088ffff C=0x00000001 (NBD_CMD_WRITE+NONE) O=0000000002529000
L=00000400
> H=00d00b1f0088ffff C=0x00000001 (NBD_CMD_WRITE+NONE) O=0000000002531000
L=00000400
> H=082714110088ffff C=0x00000003 (NBD_CMD_FLUSH+NONE) O=0000000000000000
L=00000000
> H=68d10b1f0088ffff C=0x00000001 (NBD_CMD_WRITE+NONE) O=0000000002544400
L=00000400
> H=d0d20b1f0088ffff C=0x00000001 (NBD_CMD_WRITE+NONE) O=0000000002564400
L=00000400
> H=082714110088ffff C=0x00010001 (NBD_CMD_WRITE+ FUA) O=000000000112cc00
L=00000400
> H=d052c31a0088ffff C=0x00000001 (NBD_CMD_WRITE+NONE) O=000000000103a000
L=00000400
> H=d052c31a0088ffff C=0x00000001 (NBD_CMD_WRITE+NONE) O=000000000103a000
L=00000400
> H=d052c31a0088ffff C=0x00000001 (NBD_CMD_WRITE+NONE) O=000000000103a000
L=00000400
> H=d052c31a0088ffff C=0x00000001 (NBD_CMD_WRITE+NONE) O=000000000103a000
L=00000400
> H=d052c31a0088ffff C=0x00000001 (NBD_CMD_WRITE+NONE) O=000000000103a000
L=00000400
> H=d052c31a0088ffff C=0x00000001 (NBD_CMD_WRITE+NONE) O=000000000103a000
L=00000400
> H=d052c31a0088ffff C=0x00000001 (NBD_CMD_WRITE+NONE) O=0000000000000400
L=00000400
> H=88dcdd1b0088ffff C=0x00000002 ( NBD_CMD_DISC+NONE) O=fffffffffffffe00
L=00000000

2011-05-23 15:55:53

by Jan Kara

[permalink] [raw]

Subject: Re: BUG: Failure to send REQ_FLUSH on unmount on ext3, ext4, and FS in general

On Sun 22-05-11 20:11:08, Alex Bligh wrote:
> I have been doing some testing to see what file systems successfully send
> REQ_FLUSH after all writes to the file system in the case of an unmount.
>
> Results so far:
> 1. ext2, ext3 (with default options), never send REQ_FLUSH
> 2. ext3 (with barrier=1) and ext4 do send REQ_FLUSH but then
> send further writes afterwards.
> 3. btrfs and xfs do things right (i.e. either end with a REQ_FLUSH in
> xfs's case, or a REQ_FLUSH and a REQ_FUA in btrfs's case)
>
> So the first bug is that ext3 and ext4 appear to send writes (without a
> subsequent flush/fia) before an unmount, and thus will never fully
> flush a write-behind cache. They look like this:
Yeah, I think ext3/4 write journal superblock and fs superblock without
issuing a barrier after everything is synced.

> But quite aside from the question of whether the FS supports barriers,
> should the kernel itself (rather than the FS) not be sending REQ_FLUSH on
> an unmount as the last thing that happens? IE shouldn't we see a flush
> even on (say) ext2 which is never going to support barriers. If the kernel
> itself generated a REQ_FLUSH for the block device, this would keep
> filesystems that don't support barriers safe provided the unmount
> completed successfully and would have no impact on ones that had already
> flushed the write-behind cache.
Yes, I think that generic VFS helpers should send barriers in cases where
it makes sense and umount is one of them. There even have been some
attempts to do so if I recall right but they didn't go anywhere.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-23 17:09:23

by Alex Bligh

[permalink] [raw]

Subject: Re: BUG: Failure to send REQ_FLUSH on unmount on ext3, ext4, and FS in general

Jan,

--On 23 May 2011 17:55:50 +0200 Jan Kara <[email protected]> wrote:

>> But quite aside from the question of whether the FS supports barriers,
>> should the kernel itself (rather than the FS) not be sending REQ_FLUSH on
>> an unmount as the last thing that happens? IE shouldn't we see a flush
>> even on (say) ext2 which is never going to support barriers. If the
>> kernel itself generated a REQ_FLUSH for the block device, this would keep
>> filesystems that don't support barriers safe provided the unmount
>> completed successfully and would have no impact on ones that had already
>> flushed the write-behind cache.
>
> Yes, I think that generic VFS helpers should send barriers in cases
> where it makes sense and umount is one of them. There even have been some
> attempts to do so if I recall right but they didn't go anywhere.

Indeed I think even doing sync() on ext3 with default options does not
send a flush to the write cache. I had a quick look at the code (which
has got rather more complicated since the umount syscall moved from
super.c to namespace.c) and it seemed to me the best thing to do would
be for sync() on a block device to send a REQ_FLUSH to that device at
the end (assuming the comment about sync actually completing I/O rather
than merely initiating it still holds), and to ensure umount is calling
sync.

Would there be any interested in these patches if I cooked them up,
or did they die because of opposition before rather than apathy?

--
Alex Bligh

2011-05-23 17:29:15

by Jan Kara

[permalink] [raw]

Subject: Re: BUG: Failure to send REQ_FLUSH on unmount on ext3, ext4, and FS in general

On Mon 23-05-11 18:09:17, Alex Bligh wrote:
> Jan,
>
> --On 23 May 2011 17:55:50 +0200 Jan Kara <[email protected]> wrote:
>
> >>But quite aside from the question of whether the FS supports barriers,
> >>should the kernel itself (rather than the FS) not be sending REQ_FLUSH on
> >>an unmount as the last thing that happens? IE shouldn't we see a flush
> >>even on (say) ext2 which is never going to support barriers. If the
> >>kernel itself generated a REQ_FLUSH for the block device, this would keep
> >>filesystems that don't support barriers safe provided the unmount
> >>completed successfully and would have no impact on ones that had already
> >>flushed the write-behind cache.
> >
> > Yes, I think that generic VFS helpers should send barriers in cases
> >where it makes sense and umount is one of them. There even have been some
> >attempts to do so if I recall right but they didn't go anywhere.
>
> Indeed I think even doing sync() on ext3 with default options does not
> send a flush to the write cache. I had a quick look at the code (which
Yes, but that's rather a deficiency in default mount options of ext3
which is kept for backward buggy-for-performance compatibility. Anywone who
seriously cares about the data should use barrier=1 and BTW SUSE or RH
distros change the default to be barrier=1. Anyway, this is a seperate
issue.

> has got rather more complicated since the umount syscall moved from
> super.c to namespace.c) and it seemed to me the best thing to do would
> be for sync() on a block device to send a REQ_FLUSH to that device at
> the end (assuming the comment about sync actually completing I/O rather
> than merely initiating it still holds), and to ensure umount is calling
> sync.
I wish it was this simple ;) The trouble is that clever filesystems -
e.g. xfs, ext4 - will send the flush when it's needed (after a transaction
commit). So sending it after flushing the device (which happens from
generic sync code) would result in two flushes instead of one - not good
for performance (although these days when we do merging of flush requests
the result need not be that bad).

The fs might indicate whether it handles barriers itself or whether it
wants VFS to handle it but that's where it's gets a bit complicated /
controversial ;).

> Would there be any interested in these patches if I cooked them up,
> or did they die because of opposition before rather than apathy?
I guess you might come with some proposal and post it to linux-fsdevel
(include Al Viro and Christoph Hellwig in CC) and see what happens...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-05-23 17:33:54

by Christoph Hellwig

[permalink] [raw]

Subject: Re: BUG: Failure to send REQ_FLUSH on unmount on ext3, ext4, and FS in general

On Mon, May 23, 2011 at 07:29:06PM +0200, Jan Kara wrote:
> Yes, but that's rather a deficiency in default mount options of ext3
> which is kept for backward buggy-for-performance compatibility. Anywone who
> seriously cares about the data should use barrier=1 and BTW SUSE or RH
> distros change the default to be barrier=1. Anyway, this is a seperate
> issue.

It really should be changed. The previous (bad) excuse was that the
ordering barrier code was too much overhead. Making a filesystem
non-safe by default is already a bad sin, but having the code to make
it safe around and not enabling it is plain criminal.

> > Would there be any interested in these patches if I cooked them up,
> > or did they die because of opposition before rather than apathy?
> I guess you might come with some proposal and post it to linux-fsdevel
> (include Al Viro and Christoph Hellwig in CC) and see what happens...

There's no way to make it generic. Fortunately adding support to a
filesystem is generally trivial, take a look at my recently added
cache flushing support for hfsplus for example.

2011-05-23 17:39:28

by Alex Bligh

[permalink] [raw]

Subject: Re: BUG: Failure to send REQ_FLUSH on unmount on ext3, ext4, and FS in general

Jan,

--On 23 May 2011 19:29:06 +0200 Jan Kara <[email protected]> wrote:

> I wish it was this simple ;) The trouble is that clever filesystems -
> e.g. xfs, ext4 - will send the flush when it's needed (after a transaction
> commit). So sending it after flushing the device (which happens from
> generic sync code) would result in two flushes instead of one - not good
> for performance (although these days when we do merging of flush requests
> the result need not be that bad).
>
> The fs might indicate whether it handles barriers itself or whether it
> wants VFS to handle it but that's where it's gets a bit complicated /
> controversial ;).

Well, to "fix" sync(), one could simply look at whether the file system
had ever sent a REQ_FLUSH or REQ_FUA since that FS was mounted. If there
has been one, assume the FS is taking responsibility for sending them.

I'm presuming that if just umount() were altered to do a REQ_FLUSH,
the potential presence of 2 sync()s would not be too offensive, as
unmount isn't exactly time critical, and as Christoph pointed out in
the other thread, a REQ_FLUSH when the write cache has recently been
emptied isn't going to take long.

>> Would there be any interested in these patches if I cooked them up,
>> or did they die because of opposition before rather than apathy?
>
> I guess you might come with some proposal and post it to linux-fsdevel
> (include Al Viro and Christoph Hellwig in CC) and see what happens...

Ah, fsdevel not here. OK. Partly I'd like to understand whether
sync() not flushing write caches on barrier-less file systems
is a good thing or a bad thing. I know barriers are better, but if
writing to (e.g.) FAT32, I'm betting there is little prospect of
barrier support.

--
Alex Bligh

2011-05-23 17:52:09

by Christoph Hellwig

[permalink] [raw]

Subject: Re: BUG: Failure to send REQ_FLUSH on unmount on ext3, ext4, and FS in general

On Mon, May 23, 2011 at 06:39:23PM +0100, Alex Bligh wrote:
> I'm presuming that if just umount() were altered to do a REQ_FLUSH,
> the potential presence of 2 sync()s would not be too offensive, as
> unmount isn't exactly time critical, and as Christoph pointed out in
> the other thread, a REQ_FLUSH when the write cache has recently been
> emptied isn't going to take long.

Umount actually is the only place where adding it generically makes
sense. It's not time-critical, and with kill_block_super we actually
have a block specific place to put it, instead of having to hack
it into the generic VFS, which is something we've been trying to avoid.

> Ah, fsdevel not here. OK. Partly I'd like to understand whether
> sync() not flushing write caches on barrier-less file systems
> is a good thing or a bad thing. I know barriers are better, but if
> writing to (e.g.) FAT32, I'm betting there is little prospect of
> barrier support.

"Barrier" support it's gone. It's really just the FUA and FLUSH
flags these days. For transactional filesystem these need to be
used to guarantee transaction integrity, but for all others just
adding one blkdev_issue_flush call to ->fsync and ->sync_fs is
enough. That's discounting filesystem that use multiple block
devices, which are a bit more complicated.

2011-05-23 18:50:45

by Alex Bligh

[permalink] [raw]

Subject: Re: BUG: Failure to send REQ_FLUSH on unmount on ext3, ext4, and FS in general

Christoph,

--On 23 May 2011 13:52:04 -0400 Christoph Hellwig <[email protected]> wrote:

> On Mon, May 23, 2011 at 06:39:23PM +0100, Alex Bligh wrote:
>> I'm presuming that if just umount() were altered to do a REQ_FLUSH,
>> the potential presence of 2 sync()s would not be too offensive, as
>> unmount isn't exactly time critical, and as Christoph pointed out in
>> the other thread, a REQ_FLUSH when the write cache has recently been
>> emptied isn't going to take long.
>
> Umount actually is the only place where adding it generically makes
> sense. It's not time-critical, and with kill_block_super we actually
> have a block specific place to put it, instead of having to hack
> it into the generic VFS, which is something we've been trying to avoid.

You mean like this (completely untested)?

diff --git a/fs/super.c b/fs/super.c
index 8a06881..a86201a 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -852,6 +852,7 @@ void kill_block_super(struct super_block *sb)
bdev->bd_super = NULL;
generic_shutdown_super(sb);
sync_blockdev(bdev);
+ blkdev_issue_flush(bdev, GFP_KERNEL, NULL);
WARN_ON_ONCE(!(mode & FMODE_EXCL));
blkdev_put(bdev, mode | FMODE_EXCL);
}

One thing I am puzzled by is that blkdev_fsync unconditionally
calls blkdev_issue_flush, but no amount of fsync(), sync() or
whatever generates any REQ_FLUSH traffic. The only explanation
I can guess at for that is that blkdev_issue_flush is a NOOP
if the driver doesn't have a make_request_function:

/*
* some block devices may not have their queue correctly set up here
* (e.g. loop device without a backing file) and so issuing a flush
* here will panic. Ensure there is a request function before
issuing
* the flush.
*/
if (!q->make_request_fn)
return -ENXIO;

According to Documentation/block/writeback_cache_control.txt, drivers
with a request_fn are still meant to get REQ_FLUSH etc. provided
they have done:

blk_queue_flush(sdkp->disk->queue, REQ_FLUSH);

So should that read (again untested) as follows:

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 6c9b5e1..3a6d4bd 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -408,7 +408,8 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t
gfp_mask,
* here will panic. Ensure there is a request function before
issuing
* the flush.
*/
- if (!q->make_request_fn)
+ if (!q->make_request_fn &&
+ !(q->request_fn && (q->flush_flags & REQ_FLUSH)))
return -ENXIO;

bio = bio_alloc(gfp_mask, 0);

>> Ah, fsdevel not here. OK. Partly I'd like to understand whether
>> sync() not flushing write caches on barrier-less file systems
>> is a good thing or a bad thing. I know barriers are better, but if
>> writing to (e.g.) FAT32, I'm betting there is little prospect of
>> barrier support.
>
> "Barrier" support it's gone. It's really just the FUA and FLUSH
> flags these days.

Sorry - slack terminology on my part.

--
Alex Bligh

2011-05-23 18:56:09

by Alex Bligh

[permalink] [raw]

Subject: Re: BUG: Failure to send REQ_FLUSH on unmount on ext3, ext4, and FS in general

--On 23 May 2011 13:33:50 -0400 Christoph Hellwig <[email protected]> wrote:

> It really should be changed. The previous (bad) excuse was that the
> ordering barrier code was too much overhead. Making a filesystem
> non-safe by default is already a bad sin, but having the code to make
> it safe around and not enabling it is plain criminal.

At risk of wandering into ext3 default flamewar, I suppose people foolishly
or not might want to live dangerously. If I'm creating an image on a
loopback mount, or copying stuff to a removable harddrive, perhaps I don't
really care if data gets corrupted if the power drops while I am writing,
as I can recreate the data. I'm always going to have that issue even with
FAT32 and other dinosaurs (ext2). I *do*, however, care if unmount()
doesn't write all the data, especially as "sync; sync; sync" doesn't
appear to write the data either.

--
Alex Bligh