I am doing some work on making REQ_FLUSH and REQ_FUA work with block devices
and have some patches that make them perform as expected with nbd if
an nbd device is mounted (e.g. -t ext3, -odata-journal,barriers=1), and
I see the relevant REQ_FLUSH and REQ_FUA appearing much as expected.
However, if I do a straight dd to the device (which generates an open()
and a close()), I see no barrier activity at all (i.e. no REQ_FLUSH and
no REQ_FUA). It is surprising to me that a close() on a raw device does
not generate a REQ_FLUSH. I cannot imagine it is a performance overhead.
I would have thought this would useful anyway (if I've written
to a raw device I'd rather expect it to hit it when I do the close()),
but my specific application is ensuring cache coherency on live migrate
of virtual servers: if migrating from node A to node B, then when the
hypervisor closes the block device on node A, I want to be sure that any
locally cached write data is written to the remote disk before it
unfreezes node B.
Should a close() of a dirty block device result in a REQ_FLUSH?
--
Alex Bligh
On Thu, May 19, 2011 at 04:06:27PM +0100, Alex Bligh wrote:
> Should a close() of a dirty block device result in a REQ_FLUSH?
No, why would it? That's what fsync is for.
--On 20 May 2011 08:20:10 -0400 Christoph Hellwig <[email protected]> wrote:
> On Thu, May 19, 2011 at 04:06:27PM +0100, Alex Bligh wrote:
>> Should a close() of a dirty block device result in a REQ_FLUSH?
>
> No, why would it? That's what fsync is for.
I had thought fsync() was meant to be implicit in a close of a raw device
though perhaps that's my faulty memory; I think you are saying it's
up to userspace to fix these; fair enough.
However, I'm also seeing writes to the device after the last flush when
the device is unmounted. Specifically, a sequence ending
mount -t ext3 -odata=journal,barrier=1 /dev/nbd0 /mnt
(cd /mnt ; tar cvzf /dev/null . ; sync) 2>&1 >/dev/null
dbench -D /mnt 1 &
sleep 10
killall dbench
sleep 2
killall -KILL dbench
sync
umount /mnt
produces these commands (at the end):
Sending command: NBD_CMD_WRITE [NONE] (0x00000001)
Sending command: NBD_CMD_WRITE [NONE] (0x00000001)
Sending command: NBD_CMD_FLUSH [NONE] (0x00000003)
Sending command: NBD_CMD_WRITE [FUA] (0x00010001)
Sending command: NBD_CMD_WRITE [NONE] (0x00000001)
Sending command: NBD_CMD_WRITE [NONE] (0x00000001)
Sending command: NBD_CMD_WRITE [NONE] (0x00000001)
Sending command: NBD_CMD_WRITE [NONE] (0x00000001)
Sending command: NBD_CMD_WRITE [NONE] (0x00000001)
Sending command: NBD_CMD_WRITE [NONE] (0x00000001)
Sending command: NBD_CMD_WRITE [NONE] (0x00000001)
Sending command: NBD_CMD_WRITE [NONE] (0x00000001)
Sending command: NBD_CMD_WRITE [NONE] (0x00000001)
(I'm testing this out by adding flush and fua support to nbd, see
git.alex.org.uk if this is interesting).
What I am concerned about is that relatively normal actions (e.g. unmount
a filing system) do not appear to be flushing all data, even though I
did "sync" then "umount". I suspect the sync is generating the FLUSH here,
and nothing is flushing the umount writes. How can I know as a block
device that I have to write out a (long lasting) writeback cache if
I don't receive anything beyond the last WRITE?
--
Alex Bligh
On Sat, May 21, 2011 at 09:42:45AM +0100, Alex Bligh wrote:
> What I am concerned about is that relatively normal actions (e.g. unmount
> a filing system) do not appear to be flushing all data, even though I
> did "sync" then "umount". I suspect the sync is generating the FLUSH here,
> and nothing is flushing the umount writes. How can I know as a block
> device that I have to write out a (long lasting) writeback cache if
> I don't receive anything beyond the last WRITE?
In your case it seems like ext3 is doing something wrong. If you
run the same on XFS, you should not only see the last real write
having FUA and FLUSH as it's a transaction commit, but also an
explicit cache flush when devices are closed from the filesystem
to work around issues like that. But the raw block device node
really doesn't behave different from a file and shouldn't cause
any fsync on close.
Btw, using sync_file_range is a really bad idea. It will not actually
flush the disk cache on the server, nor make sure metadata is commited in
case of a sparse or preallocated file, and thus does not implement
the FLUSH or FUA semantics correctly.
And btw, I'd like to know what makes sync_file_range so tempting,
even after I added documentation explaining why it's almost always
wrong to use it to the man page.
Christoph,
--On 22 May 2011 06:44:49 -0400 Christoph Hellwig <[email protected]> wrote:
> On Sat, May 21, 2011 at 09:42:45AM +0100, Alex Bligh wrote:
>> What I am concerned about is that relatively normal actions (e.g. unmount
>> a filing system) do not appear to be flushing all data, even though I
>> did "sync" then "umount". I suspect the sync is generating the FLUSH
>> here, and nothing is flushing the umount writes. How can I know as a
>> block device that I have to write out a (long lasting) writeback cache if
>> I don't receive anything beyond the last WRITE?
>
> In your case it seems like ext3 is doing something wrong. If you
> run the same on XFS, you should not only see the last real write
> having FUA and FLUSH as it's a transaction commit, but also an
> explicit cache flush when devices are closed from the filesystem
> to work around issues like that.
OK. Sounds like an ext3 bug then. I will test with xfs, ext4 and btrfs
and see if they exhibit the same symptoms, and come back with a more
appropriate subject line.
> But the raw block device node
> really doesn't behave different from a file and shouldn't cause
> any fsync on close.
Fair enough. I will check whether the hypervisor concerned is doing
an fsync() or equivalent in the right place.
> Btw, using sync_file_range is a really bad idea. It will not actually
> flush the disk cache on the server, nor make sure metadata is commited in
> case of a sparse or preallocated file, and thus does not implement
> the FLUSH or FUA semantics correctly.
>
> And btw, I'd like to know what makes sync_file_range so tempting,
> even after I added documentation explaining why it's almost always
> wrong to use it to the man page.
I think you are referring to this (which in my defence wasn't in my
local copy of the manpage).
> This system call is extremely dangerous and should not be used in
> portable programs. None of these operations writes out the file's
> metadata. Therefore, unless the application is strictly performing
> overwrites of already- instantiated disk blocks, there are no
> guarantees that the data will be available after a crash. There is no
> user interface to know if a write is purely an overwrite. On file
> systems using copy-on-write semantics (e.g., btrfs) an overwrite of
> existing allocated blocks is impossible. When writing into preallocated
> space, many file systems also require calls into the block allocator,
> which this system call does not sync out to disk. This system call
> does not flush disk write caches and thus does not provide any data
> integrity on systems with volatile disk write caches.
So, the file in question is not mmap'd (it's an nbd disk). fsync() /
fdatasync() is too expensive as it will sync everything. As far as I can
tell, this is no more dangerous re metadata than fdatasync() which also
does not sync metadata. I had read the last sentence as "this system
call does not *necessarily* flush disk write caches" (meaning "if you
haven't mounted e.g. ext3 with barriers=1, then you can't ensure write
caches write through"), as opposed to "will not ever flush disk write
caches", and given mounting ext3 without barriers=1 produces no FUA or
FLUSH commands in normal operation anyway (as far as light debugging
can see) that's not much of a loss.
But rather than trying to justify myself: what is the best way to
emulate FUA, i.e. ensure a specific portion of a file is synced before
returning, without ensuring the whole lot is synced (which is far too
slow)? The only other option I can see is to open the file with a second
fd, mmap the chunk of the file (it may be larger than the available
virtual address space), mysnc it with MS_SYNC, then fsync, then munmap
and close, and hope the fsync doesn't spit anything else out. This
seems a little excessive, and I don't even know whether it would work.
I guess given NBD currently does nothing at all to support barriers,
I thought this was an improvement!
--
Alex Bligh
> So, the file in question is not mmap'd (it's an nbd disk). fsync() /
> fdatasync() is too expensive as it will sync everything. As far as I can
> tell, this is no more dangerous re metadata than fdatasync() which also
> does not sync metadata. I had read the last sentence as "this system
> call does not *necessarily* flush disk write caches" (meaning "if you
> haven't mounted e.g. ext3 with barriers=1, then you can't ensure write
> caches write through"), as opposed to "will not ever flush disk write
> caches", and given mounting ext3 without barriers=1 produces no FUA or
> FLUSH commands in normal operation anyway (as far as light debugging
> can see) that's not much of a loss.
ext3 without barriers does not gurantee any data integrity and will lose
your data in an eye blink if you have a large enough cache.
fdatasync is equivalent to fsync except that it does not flush
non-essential metadata (basically just timestamps in practice), but it
does flush metadata requried to find the data again, e.g. allocation
information and extent maps. sync_file_range does nothing but flush
out pagecache content - it means you basically won't get your data
back in case of a crash if you either:
a) have a volatile write cache in your disk (e.g. any normal SATA disk)
b) are using a sparse file on a filesystem
c) are using a fallocate-preallocated file on a filesystem
d) use any file on a COW filesystem like btrfs
e.g. it only does anything useful for you if you do not have a volatile
write cache, and either use a raw block device node, or just overwrite
an already fully allocated (and not preallocated) file on a non-COW
filesystem.
> But rather than trying to justify myself: what is the best way to
> emulate FUA, i.e. ensure a specific portion of a file is synced before
> returning, without ensuring the whole lot is synced (which is far too
> slow)? The only other option I can see is to open the file with a second
> fd, mmap the chunk of the file (it may be larger than the available
> virtual address space), mysnc it with MS_SYNC, then fsync, then munmap
> and close, and hope the fsync doesn't spit anything else out. This
> seems a little excessive, and I don't even know whether it would work.
You can have a second FD with O_DSYNC open and write to that. But for
NBD and Linux guest that won't make any different yet. While REQ_FUA
is a separate flag so far it's only used in combination with REQ_FLUSH,
so the only pattern you'll see REQ_FUA used in is:
REQ_FLUSH
REQ_FUA
which means there's no data but the one just written in the cache.
Christoph,
> ext3 without barriers does not gurantee any data integrity and will lose
> your data in an eye blink if you have a large enough cache.
This doesn't appear to stop people using it :-)
> fdatasync is equivalent to fsync except that it does not flush
> non-essential metadata (basically just timestamps in practice), but it
> does flush metadata requried to find the data again, e.g. allocation
> information and extent maps. sync_file_range does nothing but flush
> out pagecache content - it means you basically won't get your data
> back in case of a crash if you either:
>
> a) have a volatile write cache in your disk (e.g. any normal SATA disk)
> b) are using a sparse file on a filesystem
> c) are using a fallocate-preallocated file on a filesystem
> d) use any file on a COW filesystem like btrfs
>
> e.g. it only does anything useful for you if you do not have a volatile
> write cache, and either use a raw block device node, or just overwrite
> an already fully allocated (and not preallocated) file on a non-COW
> filesystem.
Thanks, that's really useful.
>> But rather than trying to justify myself: what is the best way to
>> emulate FUA, i.e. ensure a specific portion of a file is synced before
>> returning, without ensuring the whole lot is synced (which is far too
>> slow)? The only other option I can see is to open the file with a second
>> fd, mmap the chunk of the file (it may be larger than the available
>> virtual address space), mysnc it with MS_SYNC, then fsync, then munmap
>> and close, and hope the fsync doesn't spit anything else out. This
>> seems a little excessive, and I don't even know whether it would work.
>
> You can have a second FD with O_DSYNC open and write to that.
Fantastic - I shall do that in the long term.
> But for
> NBD and Linux guest that won't make any different yet.
As far as I know, nbd only has linux clients. It certainly only has
linux clients that transmit flush and FUA because I only added that to
the protocol last week :-)
> While REQ_FUA
> is a separate flag so far it's only used in combination with REQ_FLUSH,
> so the only pattern you'll see REQ_FUA used in is:
>
> REQ_FLUSH
> REQ_FUA
>
> which means there's no data but the one just written in the cache.
I think what you are saying is that when the request with REQ_FUA arrives,
it will have been immediately preceded by a REQ_FLUSH. Therefore, I will
only have the data attached to the request with REQ_FUA to flush anyway, so
an fdatasync() does no harm performance wise. That's what I'm currently
doing if sync_file_range() is not supported. It sounds like that's what I
should be doing all the time. If you don't mind, I shall borrow your
text above and put it in the source.
--
Alex Bligh
On Sun, May 22, 2011 at 01:00:41PM +0100, Alex Bligh wrote:
> I think what you are saying is that when the request with REQ_FUA arrives,
> it will have been immediately preceded by a REQ_FLUSH. Therefore, I will
> only have the data attached to the request with REQ_FUA to flush anyway, so
> an fdatasync() does no harm performance wise. That's what I'm currently
> doing if sync_file_range() is not supported. It sounds like that's what I
> should be doing all the time. If you don't mind, I shall borrow your
> text above and put it in the source.
Sure, feel free to borrow it. Note that I have a mid-term plan to
actually use REQ_FUA without preceeding REQ_FLUSH in XFS, but even in
that case the write cache probably won't be too full.
Long term someone who cares enough should simple submit patches
for range fsync/fdatasync syscalls. We already have all the
infrastructure for it in the kernel, as it's used by the O_SYNC/O_DSYNC
implementation and nfsd, so it's just the actually syscall entry points
that need to be added.
On 05/22/2011 06:44 AM, Christoph Hellwig wrote:
> And btw, I'd like to know what makes sync_file_range so tempting,
> even after I added documentation explaining why it's almost always
> wrong to use it to the man page.
Because Linus used it to describe how to stream data to disk, such as
what MythTV does. Example:
http://lkml.indiana.edu/hypermail/linux/kernel/0904.0/01076.html