by Sachin Gaikwad

[permalink] [raw]

Subject: Re: Proposal for "proper" durable fsync() and fdatasync()

Hi Jamie,

On Tue, Feb 26, 2008 at 10:43 AM, Jamie Lokier <[email protected]> wrote:
> Ric Wheeler wrote:
>> >>I was surprised that fsync() doesn't do this already. There was a lot
>> >>of effort put into block I/O write barriers during 2.5, so that
>> >>journalling filesystems can force correct write ordering, using disk
>> >>flush cache commands.
>> >>
>> >>After all that effort, I was very surprised to notice that Linux 2.6.x
>> >>doesn't use that capability to ensure fsync() flushes the disk cache
>> >>onto stable storage.
>> >
>> >It's surprising you are surprised, given that this [lame] fsync behavior
>> >has remaining consistently lame throughout Linux's history.
>>
>> Maybe I am confused, but isn't this is what fsync() does today whenever
>> barriers are enabled (the fsync() invalidates the drive's write cache).
>
> No, fsync() doesn't always flush the drive's write cache. It often
> does, any I think many people are under the impression it always does,
> but it doesn't.
>
> Try this code on ext3:
>
> fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0666);
> while (1) {
> char byte;
> usleep (100000);
> pwrite (fd, &byte, 1, 0);
> fsync (fd);
> }
>
> It will do just over 10 write ops per second on an idle system (13 on
> mine), and 1 flush op per second.

How did you measure write-ops and flush-ops ? Is there any tool which
can be used ? I tried looking at what CONFIG_BSD_PROCESS_ACCT
provides, but no luck.

Sachin

>
> That's because ext3 fsync() only does a journal commit when the inode
> has changed. The inode mtime is changed by write only with 1 second
> granularity. Without a journal commit, there's no barrier, which
> translates to not flushing disk write cache.
>
> If you add "fchmod (fd, 0644); fchmod (fd, 0664);" between the write
> and fsync, you'll see at least 20 write ops and 20 flush ops per
> second, and you'll here the disk seeking more. That's because the
> fchmod dirties the inode, so fsync() writes the inode with a journal
> commit.
>
> It turns out even _that_ is not sufficient according to the kernel
> internals. A journal commit uses an ordered request, which isn't the
> same as a flush potentially, it just happens to use flush in this
> instance. I'm not sure if ordered requests are actually implemented
> by any drivers at the moment. If not now, they will be one day.
>
> We could change ext3 fsync() to always do a journal commit, and depend
> on the non-existence of block drivers which do ordered (not flush)
> barrier requests. But there's lots of things wrong with that. Not
> least, it sucks performance for database-like applications and virtual
> machines, a lot due to unnecessary seeks. That way lies wrongness.
>
> Rightness is to make fdatasync() work well, with a genuine flush (or
> equivalent (see FUA), only when required, and not a mere ordered
> barrier), no inode write, and to make sync_file_range()[*] offer the
> fancier applications finer controls which reflect what they actually
> need.
>
> [*] - or whatever.
>
> -- Jamie
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2008-11-25 10:17:31

by Jamie Lokier

[permalink] [raw]

Subject: Re: Proposal for "proper" durable fsync() and fdatasync()

Sachin Gaikwad wrote:
> > No, fsync() doesn't always flush the drive's write cache. It often
> > does, any I think many people are under the impression it always does,
> > but it doesn't.
> >
> > Try this code on ext3:
> >
> > fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0666);
> > while (1) {
> > char byte;
> > usleep (100000);
> > pwrite (fd, &byte, 1, 0);
> > fsync (fd);
> > }
> >
> > It will do just over 10 write ops per second on an idle system (13 on
> > mine), and 1 flush op per second.
>
> How did you measure write-ops and flush-ops ? Is there any tool which
> can be used ? I tried looking at what CONFIG_BSD_PROCESS_ACCT
> provides, but no luck.

I don't remember; it was such a long time ago!

It probably involved looking at /sys/block/*/stat or something like that.

You might find the "blktrace" tool does what you want.

-- Jamie