2010-11-29 08:37:23

by Jonathan Nieder

[permalink] [raw]
Subject: Re: Bug#605009: serious performance regression with ext4

Hi,

Ted Ts'o wrote:

> I did some experimenting, and I figured out what was going on. You're
> right, (c) doesn't quite work, because delayed allocation meant that
> the writeout didn't take place until the fsync() for each file
> happened. I didn't see this at first; my apologies.

Thanks for a clear analysis[1]. I am still confused about something,
though. If the answer is "stop wasting my time, just read the source", I
can accept that.

> sync_file_range() is a Linux specific system
> call that has been around for a while. It allows program to control
> when writeback happens in a very low-level fashion. The first set of
> sync_file_range() system calls causes the system to start writing back
> each file once it has finished being extracted. It doesn't actually
> wait for the write to finish; it just starts the writeback.

True, using sync_file_range(..., SYNC_FILE_RANGE_WRITE) for each file
makes later fsync() much faster. But why? Is this a matter of allowing
writeback to overlap with write() or is something else going on?

I'm thinking it has to be something else, since sync() is fast without the
sync_file_range().

> I've attached the program I used to test and prove this mechanism, as
> well as the kernel tracepoint script I used to debug why (c) wasn't
> working, which might be of interest to folks on debian-kernel.
> Basically it's a demonstration of how cool ftrace is. :-)

Perhaps the answer can be phrased in terms of the output of this script.

> #!/bin/sh
> cd /sys/kernel/debug/tracing
> echo blk > current_tracer
> echo 1 > /sys/block/dm-5/trace/enable
> echo 1 > events/ext4/ext4_sync_file/enable
> echo 1 > events/ext4/ext4_da_writepages/enable
> echo 1 > events/ext4/ext4_mark_inode_dirty/enable
> echo 1 > events/jbd2/jbd2_run_stats/enable
> echo 1 > events/jbd2/jbd2_start_commit/enable
> echo 1 > events/jbd2/jbd2_end_commit/enable
> (cd /kbuild; /home/tytso/src/mass-sync-tester -n)
> cat trace > /tmp/trace
> echo 0 > events/jbd2/jbd2_start_commit/enable
> echo 0 > events/jbd2/jbd2_end_commit/enable
> echo 0 > events/jbd2/jbd2_run_stats/enable
> echo 0 > events/ext4/ext4_sync_file/enable
> echo 0 > events/ext4/ext4_da_writepages/enable
> echo 0 > events/ext4/ext4_mark_inode_dirty/enable
> echo 0 > /sys/block/dm-5/trace/enable
> echo nop > current_tracer

Jonathan

[1] http://lists.debian.org/debian-devel/2010/11/msg00577.html


2010-11-29 14:44:45

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Bug#605009: serious performance regression with ext4

On Mon, Nov 29, 2010 at 01:29:30AM -0600, Jonathan Nieder wrote:
>
> > sync_file_range() is a Linux specific system
> > call that has been around for a while. It allows program to control
> > when writeback happens in a very low-level fashion. The first set of
> > sync_file_range() system calls causes the system to start writing back
> > each file once it has finished being extracted. It doesn't actually
> > wait for the write to finish; it just starts the writeback.
>
> True, using sync_file_range(..., SYNC_FILE_RANGE_WRITE) for each file
> makes later fsync() much faster. But why? Is this a matter of allowing
> writeback to overlap with write() or is something else going on?

So what's going on is this. dpkg is writing a series of files.
fsync() causes the following to happen:

* force the file specified to be written to disk; in the case
of ext4 with delayed allocation, this means blocks
have to be allocated, so the block bitmap gets
dirtied, etc.
* force a journal commit. This causes the block bitmap,
inode table block for the inode, etc., to be written
to the journal, followed by a barrier operation to make
sure all of the file system metadata as well as the
data blocks in the previous step, are written to disk.

If you call fsync() for each file, these two steps get done for each
file. This means we have to do a journal commit for each and every
file.

By using sync_file_range() first, for all files, this forces the
delayed allocation to be resolved, so all of the block bitmaps, inode
data structures, etc., are updated. Then on the first fdatasync(),
the resulting journal commit updates all of the block bitmaps and all
of the inode table blocks(), and we're done. The subsequent
fdatasync() calls become no-ops --- which the ftrace shell script will
show.

We could imagine a new kernel interface which took an array of file
descriptors, say call it fsync_array(), which would force writeback on
all of the specified file descriptors, as well as forcing the journal
commit that would guarantee the metadata had been written to disk.
But calling sync_file_range() for each file, and then calling
fdatasync() for all of the, is something that exists today with
currently shipping kernels (and sync_file_range() has been around for
over four years; whereas a new system call wouldn't see wide
deployment for at least 2-3 years).

- Ted


2010-11-29 15:18:27

by Bernd Schubert

[permalink] [raw]
Subject: Re: Bug#605009: serious performance regression with ext4

On Monday, November 29, 2010, Ted Ts'o wrote:
> By using sync_file_range() first, for all files, this forces the
> delayed allocation to be resolved, so all of the block bitmaps, inode
> data structures, etc., are updated. Then on the first fdatasync(),
> the resulting journal commit updates all of the block bitmaps and all
> of the inode table blocks(), and we're done. The subsequent
> fdatasync() calls become no-ops --- which the ftrace shell script will
> show.

Wouldn't it make sense to modify ext4 or even the vfs to do that on close()
itself? Most applications expect the file to be on disk after a close anyway
and I also don't see a good reason why one should delay a disk write-back
after close any longer (well, there are exeption if the application is broken,
for example such as ha-logd used by pacemaker, which did for each line of logs
an open, seek, write, flush, close sequence..., but at least we have fixed
that in -hg now).


Cheers,
Bernd

2010-11-29 15:37:08

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Bug#605009: serious performance regression with ext4

On Mon, Nov 29, 2010 at 04:18:24PM +0100, Bernd Schubert wrote:
>
> Wouldn't it make sense to modify ext4 or even the vfs to do that on
> close() itself? Most applications expect the file to be on disk
> after a close anyway and I also don't see a good reason why one
> should delay a disk write-back after close any longer (well, there
> are exeption if the application is broken, for example such as
> ha-logd used by pacemaker, which did for each line of logs an open,
> seek, write, flush, close sequence..., but at least we have fixed
> that in -hg now).

I can think of plenty of cases where it wouldn't make sense to do that
on a close(). For example, it would dramatically slow down compiles.
Just to give one example, you really don't want to force writeback to
start when the compiler finishes writing an intermediate .o file. And
there are often temporary files which are created and then deleted
very shortly afterwards; forcing writeback just because the file has
been closed would be pointless.

Now, a hint that could be set via an open flag, or via fcntl(), saying
that *this* file is one that should really be written at close() time
--- that would probably be a good idea, if application/library authors
would actually use it.

- Ted

2010-11-29 15:55:06

by Eric Sandeen

[permalink] [raw]
Subject: Re: Bug#605009: serious performance regression with ext4

On 11/29/10 9:18 AM, Bernd Schubert wrote:
> On Monday, November 29, 2010, Ted Ts'o wrote:
>> By using sync_file_range() first, for all files, this forces the
>> delayed allocation to be resolved, so all of the block bitmaps, inode
>> data structures, etc., are updated. Then on the first fdatasync(),
>> the resulting journal commit updates all of the block bitmaps and all
>> of the inode table blocks(), and we're done. The subsequent
>> fdatasync() calls become no-ops --- which the ftrace shell script will
>> show.
>
> Wouldn't it make sense to modify ext4 or even the vfs to do that on close()
> itself? Most applications expect the file to be on disk after a close anyway

but those applications would be wrong.

http://www.flamingspork.com/talks/
Eat My Data: How Everybody Gets File IO Wrong

-Eric

> and I also don't see a good reason why one should delay a disk write-back
> after close any longer (well, there are exeption if the application is broken,
> for example such as ha-logd used by pacemaker, which did for each line of logs
> an open, seek, write, flush, close sequence..., but at least we have fixed
> that in -hg now).
>
>
> Cheers,
> Bernd
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2010-11-29 16:20:28

by Bernd Schubert

[permalink] [raw]
Subject: Re: Bug#605009: serious performance regression with ext4

On Monday, November 29, 2010, Eric Sandeen wrote:
> On 11/29/10 9:18 AM, Bernd Schubert wrote:
> > On Monday, November 29, 2010, Ted Ts'o wrote:
> >> By using sync_file_range() first, for all files, this forces the
> >> delayed allocation to be resolved, so all of the block bitmaps, inode
> >> data structures, etc., are updated. Then on the first fdatasync(),
> >> the resulting journal commit updates all of the block bitmaps and all
> >> of the inode table blocks(), and we're done. The subsequent
> >> fdatasync() calls become no-ops --- which the ftrace shell script will
> >> show.
> >
> > Wouldn't it make sense to modify ext4 or even the vfs to do that on
> > close() itself? Most applications expect the file to be on disk after a
> > close anyway
>
> but those applications would be wrong.

Of course they are, I don't deny that. But denying the most applications
expect the file to be on disk after a close() also denies reality, in my
experience.
And IMHO, such temporary files as pointed out by Ted either should go to tmpfs
or should be specially flagged as something like O_TMP. Unfortunately, that
changes symantics and so indeed the only way left is to do it the other way
around as Ted suggested.


Cheers,
Bernd

2010-11-29 16:34:20

by Florian Weimer

[permalink] [raw]
Subject: Re: Bug#605009: serious performance regression with ext4

* Bernd Schubert:

> Wouldn't it make sense to modify ext4 or even the vfs to do that on close()
> itself? Most applications expect the file to be on disk after a close anyway
> and I also don't see a good reason why one should delay a disk write-back
> after close any longer (well, there are exeption if the application is broken,
> for example such as ha-logd used by pacemaker, which did for each line of logs
> an open, seek, write, flush, close sequence..., but at least we have fixed
> that in -hg now).

If you use Oracle Berkeley DB in a process-based fashion, it is
crucial for decent performance that the memory-mapped file containing
the cache is not flushed to disk when the database environment is
closed prior to process termination. Perhaps flushing could be
delayed until the last open file handle is gone. In any case, it's a
pretty drastic change, which should probably be tunable with a
(generic) mount option.

--
Florian Weimer <[email protected]>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstra?e 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

2010-11-29 20:50:12

by Andreas Dilger

[permalink] [raw]
Subject: Re: Bug#605009: serious performance regression with ext4

On 2010-11-29, at 08:18, Bernd Schubert wrote:
> Wouldn't it make sense to modify ext4 or even the vfs to do that on
> close() itself? Most applications expect the file to be on disk after
> a close anyway and I also don't see a good reason why one should delay
> a disk write-back after close any longer (well, there are exeption if
> the application is broken, for example such as ha-logd used by pacemaker,
> which did for each line of logs an open, seek, write, flush, close
> sequence..., but at least we have fixed that in -hg now).

This would be terrible for applications like tar that create many hundreds or thousands of files. Also, doesn't NFS also internally open/close the file for every write?

There would now be an implicit fsync and disk cache flush for every created file. It would be impossible to create or extract more than about 100 files/second on an HDD due to seek limitations, even if the files are tiny and do not fill the memory.

I can imagine that it might make sense to _start_ writeback sooner than what the VM currently does, if an application is not repeatedly opening, writing, and closing the same file, since this is otherwise dead time in the IO pipeline that could be better utilized. This kind of background writeout shouldn't trigger a cache flush each, so that multiple writes can be aggregated more efficiently.

Lustre has always been more aggressive than the VM in starting writeout when there are good-sized chunks of data to me written, or if there are a lot of small files that are not being modified, and this significantly improves performance when IO is bursty, which it is in most real-world cases.

Cheers, Andreas