2009-10-12 14:01:56

by Chris Mason

[permalink] [raw]
Subject: Fun with fdatasync()


Hello everyone,

Josef has been doing some benchmarking around rpm performance on the
filesystem and noticed that ext3 was going really fast on the
fdatasyncs.

It seems pretty surprising to me that rpm -Uvh should do fdatasync
without forcing fsyncs. The files get overwritten, and any time we mark
an inode dirty I_DIRTY_DATASYNC is getting set.

Handling of I_DIRTY_DATASYNC seems to work like this:

mark_inode_dirty() will set I_DIRTY_DATASYNC

ext3_sync_file will force a full commit on I_DIRTY_DATASYNC

This part makes good sense. If the inode has changed, we're supposed to
do a full commit.

writeback_single_inode is where things seem to go wrong:

/* Set I_SYNC, reset I_DIRTY */
dirty = inode->i_state & I_DIRTY;
inode->i_state |= I_SYNC;
inode->i_state &= ~I_DIRTY;

Whoops, we just lost I_DIRTY_DATASYNC. So, if pdflush comes in and does some
writeback before we fdatasync, we'll skip the full commit because
I_DIRTY_DATASYNC is gone.

The solution to me seems to be that we need to keep I_DIRTY_DATASYNC
until the FS does an fsync/O_SYNC operation, and make the FS
responsible for clearing it.

This does risk extra full fsyncs if the FS does a transaction commit on
its own, but the FS should be responsible for keeping track of which
transaction last changed a given file and doing a shortcut in the fsync
code if the file is already on safely on disk.

Am I missing something? I don't see how fdatasync is safe in our
current usage.

-chris


2009-10-12 22:01:23

by Jan Kara

[permalink] [raw]
Subject: Re: Fun with fdatasync()

Hi,

On Mon 12-10-09 10:00:49, Chris Mason wrote:
> Josef has been doing some benchmarking around rpm performance on the
> filesystem and noticed that ext3 was going really fast on the
> fdatasyncs.
>
> It seems pretty surprising to me that rpm -Uvh should do fdatasync
> without forcing fsyncs. The files get overwritten, and any time we mark
> an inode dirty I_DIRTY_DATASYNC is getting set.
>
> Handling of I_DIRTY_DATASYNC seems to work like this:
>
> mark_inode_dirty() will set I_DIRTY_DATASYNC
>
> ext3_sync_file will force a full commit on I_DIRTY_DATASYNC
>
> This part makes good sense. If the inode has changed, we're supposed to
> do a full commit.
>
> writeback_single_inode is where things seem to go wrong:
>
> /* Set I_SYNC, reset I_DIRTY */
> dirty = inode->i_state & I_DIRTY;
> inode->i_state |= I_SYNC;
> inode->i_state &= ~I_DIRTY;
>
> Whoops, we just lost I_DIRTY_DATASYNC. So, if pdflush comes in and does some
> writeback before we fdatasync, we'll skip the full commit because
> I_DIRTY_DATASYNC is gone.
>
> The solution to me seems to be that we need to keep I_DIRTY_DATASYNC
> until the FS does an fsync/O_SYNC operation, and make the FS
> responsible for clearing it.
>
> This does risk extra full fsyncs if the FS does a transaction commit on
> its own, but the FS should be responsible for keeping track of which
> transaction last changed a given file and doing a shortcut in the fsync
> code if the file is already on safely on disk.
>
> Am I missing something? I don't see how fdatasync is safe in our
> current usage.
Yeah, we already discussed similar problems I_DIRTY flags with Ted and
others in thread "fsync on ext[34] working only by an accident" on
linux-ext4.
I don't quite like clearing dirty flags only on sync - pdflush would then
unnecessarily try to get rid of those inodes and burn CPU on them.
Actually, mapping->private_list (and bh->b_assoc_buffers) is meant to be
used exactly for the purpose of tracking what needs to be written on fsync
so my current plan is to somehow utilize that list to fix the problem.
Maybe I even get to that tomorrow ;) Thanks for the reminder.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-10-13 16:01:48

by Chris Mason

[permalink] [raw]
Subject: Re: Fun with fdatasync()

On Tue, Oct 13, 2009 at 12:00:43AM +0200, Jan Kara wrote:
> Hi,
>
> On Mon 12-10-09 10:00:49, Chris Mason wrote:

[ clearing of I_DIRTY_DATASYNC by pdflush ]

> >
> > Am I missing something? I don't see how fdatasync is safe in our
> > current usage.
> Yeah, we already discussed similar problems I_DIRTY flags with Ted and
> others in thread "fsync on ext[34] working only by an accident" on
> linux-ext4.
> I don't quite like clearing dirty flags only on sync - pdflush would then
> unnecessarily try to get rid of those inodes and burn CPU on them.
> Actually, mapping->private_list (and bh->b_assoc_buffers) is meant to be
> used exactly for the purpose of tracking what needs to be written on fsync
> so my current plan is to somehow utilize that list to fix the problem.
> Maybe I even get to that tomorrow ;) Thanks for the reminder.

I honestly don't remember all the details now, but I know that when
reiserfs stopped using the b_assoc_buffers stuff life got much less
complex. From an outsider's point of view the last thing jbd needs is
another list of buffers to live on.

It seems like ext34 need to be able to answer 3 questions during an
fsync or fdatasync:

The last transaction to change this file (fill hole, change
i_size)

The last transaction to log this inode (for full fsync)

The last transaction committed such that fsync would consider it done.

Filling holes and changing i_size only happens from a handful of places,
so it would be easy to update a transid field in the in-memory inode for
that.

The inode logging code could bump a second transid field to catch all
the other ways inodes change.

The transaction code could (or already does?) export an easy way to
check the last commit. Put the three together and you can safely jump
out of fsync or fdatasync based on what the inode really needs instead
of guessing with the I_ flags or page dirty bits.

-chris





2009-10-13 19:24:45

by Jan Kara

[permalink] [raw]
Subject: Re: Fun with fdatasync()

On Tue 13-10-09 12:00:28, Chris Mason wrote:
> On Tue, Oct 13, 2009 at 12:00:43AM +0200, Jan Kara wrote:
> > Hi,
> >
> > On Mon 12-10-09 10:00:49, Chris Mason wrote:
>
> [ clearing of I_DIRTY_DATASYNC by pdflush ]
>
> > >
> > > Am I missing something? I don't see how fdatasync is safe in our
> > > current usage.
> > Yeah, we already discussed similar problems I_DIRTY flags with Ted and
> > others in thread "fsync on ext[34] working only by an accident" on
> > linux-ext4.
> > I don't quite like clearing dirty flags only on sync - pdflush would then
> > unnecessarily try to get rid of those inodes and burn CPU on them.
> > Actually, mapping->private_list (and bh->b_assoc_buffers) is meant to be
> > used exactly for the purpose of tracking what needs to be written on fsync
> > so my current plan is to somehow utilize that list to fix the problem.
> > Maybe I even get to that tomorrow ;) Thanks for the reminder.
>
> I honestly don't remember all the details now, but I know that when
> reiserfs stopped using the b_assoc_buffers stuff life got much less
> complex. From an outsider's point of view the last thing jbd needs is
> another list of buffers to live on.
>
> It seems like ext34 need to be able to answer 3 questions during an
> fsync or fdatasync:
>
> The last transaction to change this file (fill hole, change
> i_size)
>
> The last transaction to log this inode (for full fsync)
>
> The last transaction committed such that fsync would consider it done.
>
> Filling holes and changing i_size only happens from a handful of places,
> so it would be easy to update a transid field in the in-memory inode for
> that.
>
> The inode logging code could bump a second transid field to catch all
> the other ways inodes change.
>
> The transaction code could (or already does?) export an easy way to
> check the last commit. Put the three together and you can safely jump
> out of fsync or fdatasync based on what the inode really needs instead
> of guessing with the I_ flags or page dirty bits.
I was thinking about this solution as well (before thinking about
using b_assoc_queue). But to make this reliable we have to pin the inode
in memory until transactions modifying it (or its data) are committed.
Otherwise it could be reclaimed and we'd loose the information about
which transactions we have to wait for. And when we pin the inode, we'd
have to unpin it at transaction commit which implies we have to somehow
attach the inode to the transaction... That doesn't look more appealing
than b_assoc_queue in my opinion (OK, for ext4/JBD2 we have a way to
attach inodes to transactions but ext3/JBD does not have such way).
So to me it seemed easier to use b_assoc_queue but I admit I haven't
thought through all the details and your experiences with reiserfs
frighten me a bit ;)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-10-13 20:02:33

by Chris Mason

[permalink] [raw]
Subject: Re: Fun with fdatasync()

On Tue, Oct 13, 2009 at 09:23:57PM +0200, Jan Kara wrote:
> On Tue 13-10-09 12:00:28, Chris Mason wrote:
> > On Tue, Oct 13, 2009 at 12:00:43AM +0200, Jan Kara wrote:
> > > Hi,
> > >
> > > On Mon 12-10-09 10:00:49, Chris Mason wrote:
> >
> > [ clearing of I_DIRTY_DATASYNC by pdflush ]
> >
> > > >
> > > > Am I missing something? I don't see how fdatasync is safe in our
> > > > current usage.
> > > Yeah, we already discussed similar problems I_DIRTY flags with Ted and
> > > others in thread "fsync on ext[34] working only by an accident" on
> > > linux-ext4.
> > > I don't quite like clearing dirty flags only on sync - pdflush would then
> > > unnecessarily try to get rid of those inodes and burn CPU on them.
> > > Actually, mapping->private_list (and bh->b_assoc_buffers) is meant to be
> > > used exactly for the purpose of tracking what needs to be written on fsync
> > > so my current plan is to somehow utilize that list to fix the problem.
> > > Maybe I even get to that tomorrow ;) Thanks for the reminder.
> >
> > I honestly don't remember all the details now, but I know that when
> > reiserfs stopped using the b_assoc_buffers stuff life got much less
> > complex. From an outsider's point of view the last thing jbd needs is
> > another list of buffers to live on.
> >
> > It seems like ext34 need to be able to answer 3 questions during an
> > fsync or fdatasync:
> >
> > The last transaction to change this file (fill hole, change
> > i_size)
> >
> > The last transaction to log this inode (for full fsync)
> >
> > The last transaction committed such that fsync would consider it done.
> >
> > Filling holes and changing i_size only happens from a handful of places,
> > so it would be easy to update a transid field in the in-memory inode for
> > that.
> >
> > The inode logging code could bump a second transid field to catch all
> > the other ways inodes change.
> >
> > The transaction code could (or already does?) export an easy way to
> > check the last commit. Put the three together and you can safely jump
> > out of fsync or fdatasync based on what the inode really needs instead
> > of guessing with the I_ flags or page dirty bits.
> I was thinking about this solution as well (before thinking about
> using b_assoc_queue). But to make this reliable we have to pin the inode
> in memory until transactions modifying it (or its data) are committed.
> Otherwise it could be reclaimed and we'd loose the information about
> which transactions we have to wait for. And when we pin the inode, we'd
> have to unpin it at transaction commit which implies we have to somehow
> attach the inode to the transaction... That doesn't look more appealing
> than b_assoc_queue in my opinion (OK, for ext4/JBD2 we have a way to
> attach inodes to transactions but ext3/JBD does not have such way).

Most of the time the file is open from the time it is written to the
time it is sunk. For the cases where we lose the inode, we just find a
zero in the last trans field and error on the side of doing the full
fsync. This will be a pretty rare event ;)

-chris

2009-10-13 20:18:56

by Jan Kara

[permalink] [raw]
Subject: Re: Fun with fdatasync()

On Tue 13-10-09 16:01:15, Chris Mason wrote:
> On Tue, Oct 13, 2009 at 09:23:57PM +0200, Jan Kara wrote:
> > On Tue 13-10-09 12:00:28, Chris Mason wrote:
> > > On Tue, Oct 13, 2009 at 12:00:43AM +0200, Jan Kara wrote:
> > > > Hi,
> > > >
> > > > On Mon 12-10-09 10:00:49, Chris Mason wrote:
> > >
> > > [ clearing of I_DIRTY_DATASYNC by pdflush ]
> > >
> > > > >
> > > > > Am I missing something? I don't see how fdatasync is safe in our
> > > > > current usage.
> > > > Yeah, we already discussed similar problems I_DIRTY flags with Ted and
> > > > others in thread "fsync on ext[34] working only by an accident" on
> > > > linux-ext4.
> > > > I don't quite like clearing dirty flags only on sync - pdflush would then
> > > > unnecessarily try to get rid of those inodes and burn CPU on them.
> > > > Actually, mapping->private_list (and bh->b_assoc_buffers) is meant to be
> > > > used exactly for the purpose of tracking what needs to be written on fsync
> > > > so my current plan is to somehow utilize that list to fix the problem.
> > > > Maybe I even get to that tomorrow ;) Thanks for the reminder.
> > >
> > > I honestly don't remember all the details now, but I know that when
> > > reiserfs stopped using the b_assoc_buffers stuff life got much less
> > > complex. From an outsider's point of view the last thing jbd needs is
> > > another list of buffers to live on.
> > >
> > > It seems like ext34 need to be able to answer 3 questions during an
> > > fsync or fdatasync:
> > >
> > > The last transaction to change this file (fill hole, change
> > > i_size)
> > >
> > > The last transaction to log this inode (for full fsync)
> > >
> > > The last transaction committed such that fsync would consider it done.
> > >
> > > Filling holes and changing i_size only happens from a handful of places,
> > > so it would be easy to update a transid field in the in-memory inode for
> > > that.
> > >
> > > The inode logging code could bump a second transid field to catch all
> > > the other ways inodes change.
> > >
> > > The transaction code could (or already does?) export an easy way to
> > > check the last commit. Put the three together and you can safely jump
> > > out of fsync or fdatasync based on what the inode really needs instead
> > > of guessing with the I_ flags or page dirty bits.
> > I was thinking about this solution as well (before thinking about
> > using b_assoc_queue). But to make this reliable we have to pin the inode
> > in memory until transactions modifying it (or its data) are committed.
> > Otherwise it could be reclaimed and we'd loose the information about
> > which transactions we have to wait for. And when we pin the inode, we'd
> > have to unpin it at transaction commit which implies we have to somehow
> > attach the inode to the transaction... That doesn't look more appealing
> > than b_assoc_queue in my opinion (OK, for ext4/JBD2 we have a way to
> > attach inodes to transactions but ext3/JBD does not have such way).
>
> Most of the time the file is open from the time it is written to the
> time it is sunk. For the cases where we lose the inode, we just find a
> zero in the last trans field and error on the side of doing the full
> fsync. This will be a pretty rare event ;)
You're probably right that fsyncing an inode that has been freshly loaded
from disk is rare enough so that we can take the penalty of unnecessary
transaction commit in such cases. OK, I'll implement it like that. It
should be pretty straightforward. Thanks for the discussion.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR