2009-09-08 13:26:00

by Jan Kara

[permalink] [raw]
Subject: fsync on ext[34] working only by an accident

Hi,

When looking at how ext3/4 handles fsync, I've realized I don't
understand how writing out inode on fsync can work. The problem is that
ext3/4 mostly calls ext?_mark_inode_dirty() which actually does *not* dirty
the inode. It just copies the in-memory inode content to disk buffer.
So in particular the inode looks clean to VFS and our check in
ext?_sync_file() shouldn't trigger.
The only obvious case when we call mark_inode_dirty() is from write_end
functions when we update i_size but that's clearly not enough. Now I did
some research why things seem to be actually working. The trick is that
when allocating block, we call vfs_dq_alloc_block() which calls
mark_inode_dirty(). But that's all what's keeping our fsync / writeout
logic from breaking!
There are even some cases when the logic actually is broken (I've tested
it and it really does not work) - for example when you create an empty
file, the inode won't get written when you fsync it.
So what we should IMHO do is to convert all ext?_mark_inode_dirty()
calls to simple mark_inode_dirty() (or even maybe introduce and use
mark_inode_dirty_datasync() where appropriate). It will cost us some more
CPU and stack space but if we optimize ext3_dirty_inode() for the case
where handle is already started, it shouldn't be too bad.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


2009-09-10 06:46:05

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: fsync on ext[34] working only by an accident

On Tue, Sep 08, 2009 at 03:26:01PM +0200, Jan Kara wrote:
> Hi,
>
> When looking at how ext3/4 handles fsync, I've realized I don't
> understand how writing out inode on fsync can work. The problem is that
> ext3/4 mostly calls ext?_mark_inode_dirty() which actually does *not* dirty
> the inode. It just copies the in-memory inode content to disk buffer.
> So in particular the inode looks clean to VFS and our check in
> ext?_sync_file() shouldn't trigger.
> The only obvious case when we call mark_inode_dirty() is from write_end
> functions when we update i_size but that's clearly not enough. Now I did
> some research why things seem to be actually working. The trick is that
> when allocating block, we call vfs_dq_alloc_block() which calls
> mark_inode_dirty(). But that's all what's keeping our fsync / writeout
> logic from breaking!

ext4_handle_dirty_metadata should do mark_inode_dirty right ?
__ext4_handle_dirty_metadata -> mark_buffer_dirty ->__set_page_dirty
-> __mark_inode_dirty -> list_move(&inode->i_list, &sb->s_dirty);



> There are even some cases when the logic actually is broken (I've tested
> it and it really does not work) - for example when you create an empty
> file, the inode won't get written when you fsync it.
> So what we should IMHO do is to convert all ext?_mark_inode_dirty()
> calls to simple mark_inode_dirty() (or even maybe introduce and use
> mark_inode_dirty_datasync() where appropriate). It will cost us some more
> CPU and stack space but if we optimize ext3_dirty_inode() for the case
> where handle is already started, it shouldn't be too bad.


-aneesh

2009-09-10 08:50:54

by Jan Kara

[permalink] [raw]
Subject: Re: fsync on ext[34] working only by an accident

Hi,

On Thu 10-09-09 12:16:05, Aneesh Kumar K.V wrote:
> On Tue, Sep 08, 2009 at 03:26:01PM +0200, Jan Kara wrote:
> > When looking at how ext3/4 handles fsync, I've realized I don't
> > understand how writing out inode on fsync can work. The problem is that
> > ext3/4 mostly calls ext?_mark_inode_dirty() which actually does *not* dirty
> > the inode. It just copies the in-memory inode content to disk buffer.
> > So in particular the inode looks clean to VFS and our check in
> > ext?_sync_file() shouldn't trigger.
> > The only obvious case when we call mark_inode_dirty() is from write_end
> > functions when we update i_size but that's clearly not enough. Now I did
> > some research why things seem to be actually working. The trick is that
> > when allocating block, we call vfs_dq_alloc_block() which calls
> > mark_inode_dirty(). But that's all what's keeping our fsync / writeout
> > logic from breaking!
>
> ext4_handle_dirty_metadata should do mark_inode_dirty right ?
> __ext4_handle_dirty_metadata -> mark_buffer_dirty ->__set_page_dirty
> -> __mark_inode_dirty -> list_move(&inode->i_list, &sb->s_dirty);
ext4_handle_dirty_metadata() marks the buffer dirty only when we do not
have a journal (BTW, the inode that gets dirtied in the nojournal case
is the block-device one, not the one whose metadata we mark as dirty, so
it won't work there either - but Google guys are working on this I think).
With a journal the function just calls jbd2_journal_dirty_metadata which
does nothing with the inode.

> > There are even some cases when the logic actually is broken (I've tested
> > it and it really does not work) - for example when you create an empty
> > file, the inode won't get written when you fsync it.
> > So what we should IMHO do is to convert all ext?_mark_inode_dirty()
> > calls to simple mark_inode_dirty() (or even maybe introduce and use
> > mark_inode_dirty_datasync() where appropriate). It will cost us some more
> > CPU and stack space but if we optimize ext3_dirty_inode() for the case
> > where handle is already started, it shouldn't be too bad.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-10 09:05:14

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: fsync on ext[34] working only by an accident

On Thu, Sep 10, 2009 at 10:50:56AM +0200, Jan Kara wrote:
> Hi,
>
> On Thu 10-09-09 12:16:05, Aneesh Kumar K.V wrote:
> > On Tue, Sep 08, 2009 at 03:26:01PM +0200, Jan Kara wrote:
> > > When looking at how ext3/4 handles fsync, I've realized I don't
> > > understand how writing out inode on fsync can work. The problem is that
> > > ext3/4 mostly calls ext?_mark_inode_dirty() which actually does *not* dirty
> > > the inode. It just copies the in-memory inode content to disk buffer.
> > > So in particular the inode looks clean to VFS and our check in
> > > ext?_sync_file() shouldn't trigger.
> > > The only obvious case when we call mark_inode_dirty() is from write_end
> > > functions when we update i_size but that's clearly not enough. Now I did
> > > some research why things seem to be actually working. The trick is that
> > > when allocating block, we call vfs_dq_alloc_block() which calls
> > > mark_inode_dirty(). But that's all what's keeping our fsync / writeout
> > > logic from breaking!
> >
> > ext4_handle_dirty_metadata should do mark_inode_dirty right ?
> > __ext4_handle_dirty_metadata -> mark_buffer_dirty ->__set_page_dirty
> > -> __mark_inode_dirty -> list_move(&inode->i_list, &sb->s_dirty);
> ext4_handle_dirty_metadata() marks the buffer dirty only when we do not
> have a journal (BTW, the inode that gets dirtied in the nojournal case
> is the block-device one, not the one whose metadata we mark as dirty, so
> it won't work there either - but Google guys are working on this I think).
> With a journal the function just calls jbd2_journal_dirty_metadata which
> does nothing with the inode.

When we don't have a journal handle we do that as a part of journal commit
right ? __jbd2_journal_temp_unlink_buffer -> mark_buffer_dirty

I guess fsync only requires the meta data update to be in journal ?

-aneesh

2009-09-10 09:15:31

by Jan Kara

[permalink] [raw]
Subject: Re: fsync on ext[34] working only by an accident

On Thu 10-09-09 14:34:49, Aneesh Kumar K.V wrote:
> On Thu, Sep 10, 2009 at 10:50:56AM +0200, Jan Kara wrote:
> > Hi,
> >
> > On Thu 10-09-09 12:16:05, Aneesh Kumar K.V wrote:
> > > On Tue, Sep 08, 2009 at 03:26:01PM +0200, Jan Kara wrote:
> > > > When looking at how ext3/4 handles fsync, I've realized I don't
> > > > understand how writing out inode on fsync can work. The problem is that
> > > > ext3/4 mostly calls ext?_mark_inode_dirty() which actually does *not* dirty
> > > > the inode. It just copies the in-memory inode content to disk buffer.
> > > > So in particular the inode looks clean to VFS and our check in
> > > > ext?_sync_file() shouldn't trigger.
> > > > The only obvious case when we call mark_inode_dirty() is from write_end
> > > > functions when we update i_size but that's clearly not enough. Now I did
> > > > some research why things seem to be actually working. The trick is that
> > > > when allocating block, we call vfs_dq_alloc_block() which calls
> > > > mark_inode_dirty(). But that's all what's keeping our fsync / writeout
> > > > logic from breaking!
> > >
> > > ext4_handle_dirty_metadata should do mark_inode_dirty right ?
> > > __ext4_handle_dirty_metadata -> mark_buffer_dirty ->__set_page_dirty
> > > -> __mark_inode_dirty -> list_move(&inode->i_list, &sb->s_dirty);
> > ext4_handle_dirty_metadata() marks the buffer dirty only when we do not
> > have a journal (BTW, the inode that gets dirtied in the nojournal case
> > is the block-device one, not the one whose metadata we mark as dirty, so
> > it won't work there either - but Google guys are working on this I think).
> > With a journal the function just calls jbd2_journal_dirty_metadata which
> > does nothing with the inode.
>
> When we don't have a journal handle we do that as a part of journal commit
^^^^^ you meant probably "do"

> right ? __jbd2_journal_temp_unlink_buffer -> mark_buffer_dirty
Yes, that happens. But as I said above:
a) mark_buffer_dirty() dirties blockdevice inode and thus not the one we
check in ext4_sync_file().
b) this happens only after we commit the transaction and only if the buffer
is not part of the next transaction.
Believe me, I've actually checked with blktrace, that the code does not
work in some cases ;).

> I guess fsync only requires the meta data update to be in journal ?
Yes, to be more precise, it requires transaction with the metadata update
to be committed. And the problem is we force a transaction commit in
ext4_sync_file() only if the inode has a dirty bit set - which is
accidentally set only by quota code.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-10 09:15:54

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: fsync on ext[34] working only by an accident

On Thu, Sep 10, 2009 at 02:34:49PM +0530, Aneesh Kumar K.V wrote:
> On Thu, Sep 10, 2009 at 10:50:56AM +0200, Jan Kara wrote:
> > Hi,
> >
> > On Thu 10-09-09 12:16:05, Aneesh Kumar K.V wrote:
> > > On Tue, Sep 08, 2009 at 03:26:01PM +0200, Jan Kara wrote:
> > > > When looking at how ext3/4 handles fsync, I've realized I don't
> > > > understand how writing out inode on fsync can work. The problem is that
> > > > ext3/4 mostly calls ext?_mark_inode_dirty() which actually does *not* dirty
> > > > the inode. It just copies the in-memory inode content to disk buffer.
> > > > So in particular the inode looks clean to VFS and our check in
> > > > ext?_sync_file() shouldn't trigger.
> > > > The only obvious case when we call mark_inode_dirty() is from write_end
> > > > functions when we update i_size but that's clearly not enough. Now I did
> > > > some research why things seem to be actually working. The trick is that
> > > > when allocating block, we call vfs_dq_alloc_block() which calls
> > > > mark_inode_dirty(). But that's all what's keeping our fsync / writeout
> > > > logic from breaking!
> > >
> > > ext4_handle_dirty_metadata should do mark_inode_dirty right ?
> > > __ext4_handle_dirty_metadata -> mark_buffer_dirty ->__set_page_dirty
> > > -> __mark_inode_dirty -> list_move(&inode->i_list, &sb->s_dirty);
> > ext4_handle_dirty_metadata() marks the buffer dirty only when we do not
> > have a journal (BTW, the inode that gets dirtied in the nojournal case
> > is the block-device one, not the one whose metadata we mark as dirty, so
> > it won't work there either - but Google guys are working on this I think).
> > With a journal the function just calls jbd2_journal_dirty_metadata which
> > does nothing with the inode.
>
> When we don't have a journal handle we do that as a part of journal commit
> right ? __jbd2_journal_temp_unlink_buffer -> mark_buffer_dirty
>
> I guess fsync only requires the meta data update to be in journal ?
>

Adding the file inode to the sb->s_dirty is done through block_write_end ?
Why do you mention above that it is not "clearly not enough" ?

-aneesh

2009-09-10 10:52:14

by Jan Kara

[permalink] [raw]
Subject: Re: fsync on ext[34] working only by an accident

On Thu 10-09-09 14:45:51, Aneesh Kumar K.V wrote:
> On Thu, Sep 10, 2009 at 02:34:49PM +0530, Aneesh Kumar K.V wrote:
> > On Thu, Sep 10, 2009 at 10:50:56AM +0200, Jan Kara wrote:
> > > Hi,
> > >
> > > On Thu 10-09-09 12:16:05, Aneesh Kumar K.V wrote:
> > > > On Tue, Sep 08, 2009 at 03:26:01PM +0200, Jan Kara wrote:
> > > > > When looking at how ext3/4 handles fsync, I've realized I don't
> > > > > understand how writing out inode on fsync can work. The problem is that
> > > > > ext3/4 mostly calls ext?_mark_inode_dirty() which actually does *not* dirty
> > > > > the inode. It just copies the in-memory inode content to disk buffer.
> > > > > So in particular the inode looks clean to VFS and our check in
> > > > > ext?_sync_file() shouldn't trigger.
> > > > > The only obvious case when we call mark_inode_dirty() is from write_end
> > > > > functions when we update i_size but that's clearly not enough. Now I did
> > > > > some research why things seem to be actually working. The trick is that
> > > > > when allocating block, we call vfs_dq_alloc_block() which calls
> > > > > mark_inode_dirty(). But that's all what's keeping our fsync / writeout
> > > > > logic from breaking!
> > > >
> > > > ext4_handle_dirty_metadata should do mark_inode_dirty right ?
> > > > __ext4_handle_dirty_metadata -> mark_buffer_dirty ->__set_page_dirty
> > > > -> __mark_inode_dirty -> list_move(&inode->i_list, &sb->s_dirty);
> > > ext4_handle_dirty_metadata() marks the buffer dirty only when we do not
> > > have a journal (BTW, the inode that gets dirtied in the nojournal case
> > > is the block-device one, not the one whose metadata we mark as dirty, so
> > > it won't work there either - but Google guys are working on this I think).
> > > With a journal the function just calls jbd2_journal_dirty_metadata which
> > > does nothing with the inode.
> >
> > When we don't have a journal handle we do that as a part of journal commit
> > right ? __jbd2_journal_temp_unlink_buffer -> mark_buffer_dirty
> >
> > I guess fsync only requires the meta data update to be in journal ?
> >
>
> Adding the file inode to the sb->s_dirty is done through block_write_end ?
> Why do you mention above that it is not "clearly not enough" ?
Where? I don't see block_write_end() marking the inode dirty anywhere...
It calls __block_commit_write() and that dirties only buffers, not the
inode.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-10 11:04:57

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: fsync on ext[34] working only by an accident

On Thu, Sep 10, 2009 at 12:52:16PM +0200, Jan Kara wrote:
> On Thu 10-09-09 14:45:51, Aneesh Kumar K.V wrote:
> > On Thu, Sep 10, 2009 at 02:34:49PM +0530, Aneesh Kumar K.V wrote:
> > > On Thu, Sep 10, 2009 at 10:50:56AM +0200, Jan Kara wrote:
> > > > Hi,
> > > >
> > > > On Thu 10-09-09 12:16:05, Aneesh Kumar K.V wrote:
> > > > > On Tue, Sep 08, 2009 at 03:26:01PM +0200, Jan Kara wrote:
> > > > > > When looking at how ext3/4 handles fsync, I've realized I don't
> > > > > > understand how writing out inode on fsync can work. The problem is that
> > > > > > ext3/4 mostly calls ext?_mark_inode_dirty() which actually does *not* dirty
> > > > > > the inode. It just copies the in-memory inode content to disk buffer.
> > > > > > So in particular the inode looks clean to VFS and our check in
> > > > > > ext?_sync_file() shouldn't trigger.
> > > > > > The only obvious case when we call mark_inode_dirty() is from write_end
> > > > > > functions when we update i_size but that's clearly not enough. Now I did
> > > > > > some research why things seem to be actually working. The trick is that
> > > > > > when allocating block, we call vfs_dq_alloc_block() which calls
> > > > > > mark_inode_dirty(). But that's all what's keeping our fsync / writeout
> > > > > > logic from breaking!
> > > > >
> > > > > ext4_handle_dirty_metadata should do mark_inode_dirty right ?
> > > > > __ext4_handle_dirty_metadata -> mark_buffer_dirty ->__set_page_dirty
> > > > > -> __mark_inode_dirty -> list_move(&inode->i_list, &sb->s_dirty);
> > > > ext4_handle_dirty_metadata() marks the buffer dirty only when we do not
> > > > have a journal (BTW, the inode that gets dirtied in the nojournal case
> > > > is the block-device one, not the one whose metadata we mark as dirty, so
> > > > it won't work there either - but Google guys are working on this I think).
> > > > With a journal the function just calls jbd2_journal_dirty_metadata which
> > > > does nothing with the inode.
> > >
> > > When we don't have a journal handle we do that as a part of journal commit
> > > right ? __jbd2_journal_temp_unlink_buffer -> mark_buffer_dirty
> > >
> > > I guess fsync only requires the meta data update to be in journal ?
> > >
> >
> > Adding the file inode to the sb->s_dirty is done through block_write_end ?
> > Why do you mention above that it is not "clearly not enough" ?
> Where? I don't see block_write_end() marking the inode dirty anywhere...
> It calls __block_commit_write() and that dirties only buffers, not the
> inode.


mark_buffer_dirty -> __set_page_dirty -> __mark_inode_dirty

-aneesh

2009-09-10 12:32:38

by Jan Kara

[permalink] [raw]
Subject: Re: fsync on ext[34] working only by an accident

On Thu 10-09-09 16:34:55, Aneesh Kumar K.V wrote:
> On Thu, Sep 10, 2009 at 12:52:16PM +0200, Jan Kara wrote:
> > On Thu 10-09-09 14:45:51, Aneesh Kumar K.V wrote:
> > > On Thu, Sep 10, 2009 at 02:34:49PM +0530, Aneesh Kumar K.V wrote:
> > > > On Thu, Sep 10, 2009 at 10:50:56AM +0200, Jan Kara wrote:
> > > > > Hi,
> > > > >
> > > > > On Thu 10-09-09 12:16:05, Aneesh Kumar K.V wrote:
> > > > > > On Tue, Sep 08, 2009 at 03:26:01PM +0200, Jan Kara wrote:
> > > > > > > When looking at how ext3/4 handles fsync, I've realized I don't
> > > > > > > understand how writing out inode on fsync can work. The problem is that
> > > > > > > ext3/4 mostly calls ext?_mark_inode_dirty() which actually does *not* dirty
> > > > > > > the inode. It just copies the in-memory inode content to disk buffer.
> > > > > > > So in particular the inode looks clean to VFS and our check in
> > > > > > > ext?_sync_file() shouldn't trigger.
> > > > > > > The only obvious case when we call mark_inode_dirty() is from write_end
> > > > > > > functions when we update i_size but that's clearly not enough. Now I did
> > > > > > > some research why things seem to be actually working. The trick is that
> > > > > > > when allocating block, we call vfs_dq_alloc_block() which calls
> > > > > > > mark_inode_dirty(). But that's all what's keeping our fsync / writeout
> > > > > > > logic from breaking!
> > > > > >
> > > > > > ext4_handle_dirty_metadata should do mark_inode_dirty right ?
> > > > > > __ext4_handle_dirty_metadata -> mark_buffer_dirty ->__set_page_dirty
> > > > > > -> __mark_inode_dirty -> list_move(&inode->i_list, &sb->s_dirty);
> > > > > ext4_handle_dirty_metadata() marks the buffer dirty only when we do not
> > > > > have a journal (BTW, the inode that gets dirtied in the nojournal case
> > > > > is the block-device one, not the one whose metadata we mark as dirty, so
> > > > > it won't work there either - but Google guys are working on this I think).
> > > > > With a journal the function just calls jbd2_journal_dirty_metadata which
> > > > > does nothing with the inode.
> > > >
> > > > When we don't have a journal handle we do that as a part of journal commit
> > > > right ? __jbd2_journal_temp_unlink_buffer -> mark_buffer_dirty
> > > >
> > > > I guess fsync only requires the meta data update to be in journal ?
> > > >
> > >
> > > Adding the file inode to the sb->s_dirty is done through block_write_end ?
> > > Why do you mention above that it is not "clearly not enough" ?
> > Where? I don't see block_write_end() marking the inode dirty anywhere...
> > It calls __block_commit_write() and that dirties only buffers, not the
> > inode.
> mark_buffer_dirty -> __set_page_dirty -> __mark_inode_dirty
OK, but that sets only I_DIRTY_PAGES not I_DIRTY_DATA or I_DIRTY.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-10 13:10:13

by Theodore Ts'o

[permalink] [raw]
Subject: Re: fsync on ext[34] working only by an accident

On Thu, Sep 10, 2009 at 04:34:55PM +0530, Aneesh Kumar K.V wrote:
> mark_buffer_dirty -> __set_page_dirty -> __mark_inode_dirty

We need to be careful here. First of all, mark_buffer_dirty() on the
code paths you are talking about is being passed a metadata buffer
head. As such, has Jan has pointed out, the bh is part of the buffer
cache, so the page->mapping of associated with bh->b_page is the inode
of the block device --- *not* the ext4 inode.

Secondly, __set_page_dirty calls __mark_inode_dirty passing in
I_DIRTY_PAGES --- which should be a hint. What Jan is talking about
is where we set the inode flags I_DIRTY_SYNC and I_DIRTY_DATASYNC:

* I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on
* fdatasync(). i_atime is the usual cause.
* I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of
* these changes separately from I_DIRTY_SYNC so that we
* don't have to write inode on fdatasync() when only
* mtime has changed in it.

This is important because ext4_sync_file() (which is called by fsync()
and fdatasync()) uses this logic to determine whether or not to call
sync_inode(), which is what will force a commit when wbc.sync_mode is
set to WB_SYNC_ALL.

In fact, I think the problem is worse than Jan is pointing out,
because it's not enough that vfs_fq_alloc_space() is calling
mark_inode_dirty(), since that only sets I_DIRTY_SYNC. When we touch
i_size or i_block[], we need to make sure that I_DIRTY_DATASYNC is
set, so that fdatasync() will force a commit.

I think the right thing to do is to create an
_ext[34]_mark_inode_dirty() which takes an extra argument, which
controls whether or not we set I_DIRTY_SYNC or I_DIRTY_DATASYNC. In
fact, most of the time, we want to be setting I_DIRTY_DATASYNC, so we
should probably have ext[34]_mark_inode_dirty() call
_ext[34]_mark_inode_dirty() with I_DIRTY_DATASYNC, and then create a
ext[34]_mark_inode_nodatasync() version passes in I_DIRTY_SYNC.

This will cause pdflush to call ext4_write_inode() more frequently,
but pdflush calls write_inode with wait=0, and ext4_write_inode() is a
no-op in that case.

BTW, while I was looking into this, I noted that the comments ahead of
ext[34]_mark_inode_dirty are out of date; they date back to a time
when when prune_icache actually was responsible for cleaning dirty
inodes; these days, that honor is owned by fs-writeback.c and
pdflush.) Also, the second half of the comments in
ext4_write_inode(), where they reference mark_inode_dirty() are also
painfully out of date, and somewhat misleading as a result.

Does this make sense to every one?

- Ted

2009-09-10 14:06:35

by Jan Kara

[permalink] [raw]
Subject: Re: fsync on ext[34] working only by an accident

On Thu 10-09-09 09:10:07, Theodore Tso wrote:
> On Thu, Sep 10, 2009 at 04:34:55PM +0530, Aneesh Kumar K.V wrote:
> > mark_buffer_dirty -> __set_page_dirty -> __mark_inode_dirty
>
> We need to be careful here. First of all, mark_buffer_dirty() on the
> code paths you are talking about is being passed a metadata buffer
> head. As such, has Jan has pointed out, the bh is part of the buffer
> cache, so the page->mapping of associated with bh->b_page is the inode
> of the block device --- *not* the ext4 inode.
>
> Secondly, __set_page_dirty calls __mark_inode_dirty passing in
> I_DIRTY_PAGES --- which should be a hint. What Jan is talking about
> is where we set the inode flags I_DIRTY_SYNC and I_DIRTY_DATASYNC:
>
> * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on
> * fdatasync(). i_atime is the usual cause.
> * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of
> * these changes separately from I_DIRTY_SYNC so that we
> * don't have to write inode on fdatasync() when only
> * mtime has changed in it.
>
> This is important because ext4_sync_file() (which is called by fsync()
> and fdatasync()) uses this logic to determine whether or not to call
> sync_inode(), which is what will force a commit when wbc.sync_mode is
> set to WB_SYNC_ALL.
Yes, this is exactly what I was trying to point out.

> In fact, I think the problem is worse than Jan is pointing out,
> because it's not enough that vfs_fq_alloc_space() is calling
> mark_inode_dirty(), since that only sets I_DIRTY_SYNC. When we touch
> i_size or i_block[], we need to make sure that I_DIRTY_DATASYNC is
> set, so that fdatasync() will force a commit.
Actually no. mark_inode_dirty() dirties inode with I_DIRTY ==
(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) so it happens to work.
The fact that quota *could* dirty the inode with I_DIRTY_SYNC only
is right but that's a separate issue.

> I think the right thing to do is to create an
> _ext[34]_mark_inode_dirty() which takes an extra argument, which
> controls whether or not we set I_DIRTY_SYNC or I_DIRTY_DATASYNC. In
> fact, most of the time, we want to be setting I_DIRTY_DATASYNC, so we
> should probably have ext[34]_mark_inode_dirty() call
> _ext[34]_mark_inode_dirty() with I_DIRTY_DATASYNC, and then create a
> ext[34]_mark_inode_nodatasync() version passes in I_DIRTY_SYNC.
>
> This will cause pdflush to call ext4_write_inode() more frequently,
> but pdflush calls write_inode with wait=0, and ext4_write_inode() is a
> no-op in that case.
Thinking about it, it won't work so easily. The problem is that when
pdflush decides to write the inode, it unconditionally clears dirty flags.
We could redirty the inode in write_inode() but that's IMHO too ugly to
bear it.
We could use ext[34] internal inode dirty flags to track when transaction
commit is needed on fsync and fdatasync... That would provide the integrity
guarantees. The performance problem with this is that because these flags
won't get cleared on transaction commit, we'd force transaction commit
unnecessarily when it already happened before fsync. So either we can force
commit only if inode buffer is part of the running transaction (but that's
slightly hacky as in fact we want to force a commit whetever some metadata
is part of the running transaction) or we can let inode track TIDs
(transaction ID) which must be committed to return from fsync / fdatasync.
The last possibility looks the best to me so I'd go for it. I can write a
patch next week...

> BTW, while I was looking into this, I noted that the comments ahead of
> ext[34]_mark_inode_dirty are out of date; they date back to a time
> when when prune_icache actually was responsible for cleaning dirty
> inodes; these days, that honor is owned by fs-writeback.c and
> pdflush.) Also, the second half of the comments in
> ext4_write_inode(), where they reference mark_inode_dirty() are also
> painfully out of date, and somewhat misleading as a result.
Yeah, I also couldn't make sence from some of those comments probably
because I lack enough history context ;).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-10 16:25:22

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: fsync on ext[34] working only by an accident

On Thu, Sep 10, 2009 at 09:10:07AM -0400, Theodore Tso wrote:
> On Thu, Sep 10, 2009 at 04:34:55PM +0530, Aneesh Kumar K.V wrote:
> > mark_buffer_dirty -> __set_page_dirty -> __mark_inode_dirty
>
> We need to be careful here. First of all, mark_buffer_dirty() on the
> code paths you are talking about is being passed a metadata buffer
> head. As such, has Jan has pointed out, the bh is part of the buffer
> cache, so the page->mapping of associated with bh->b_page is the inode
> of the block device --- *not* the ext4 inode.
>
> Secondly, __set_page_dirty calls __mark_inode_dirty passing in
> I_DIRTY_PAGES --- which should be a hint. What Jan is talking about
> is where we set the inode flags I_DIRTY_SYNC and I_DIRTY_DATASYNC:
>
> * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on
> * fdatasync(). i_atime is the usual cause.
> * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of
> * these changes separately from I_DIRTY_SYNC so that we
> * don't have to write inode on fdatasync() when only
> * mtime has changed in it.
>
> This is important because ext4_sync_file() (which is called by fsync()
> and fdatasync()) uses this logic to determine whether or not to call
> sync_inode(), which is what will force a commit when wbc.sync_mode is
> set to WB_SYNC_ALL.
>
> In fact, I think the problem is worse than Jan is pointing out,
> because it's not enough that vfs_fq_alloc_space() is calling
> mark_inode_dirty(), since that only sets I_DIRTY_SYNC. When we touch
> i_size or i_block[], we need to make sure that I_DIRTY_DATASYNC is
> set, so that fdatasync() will force a commit.
>

That explained it pretty nicely. Thanks

-aneesh

2009-09-10 16:52:11

by Theodore Ts'o

[permalink] [raw]
Subject: Re: fsync on ext[34] working only by an accident

On Thu, Sep 10, 2009 at 04:06:36PM +0200, Jan Kara wrote:
> > In fact, I think the problem is worse than Jan is pointing out,
> > because it's not enough that vfs_fq_alloc_space() is calling
> > mark_inode_dirty(), since that only sets I_DIRTY_SYNC. When we touch
> > i_size or i_block[], we need to make sure that I_DIRTY_DATASYNC is
> > set, so that fdatasync() will force a commit.
> Actually no. mark_inode_dirty() dirties inode with I_DIRTY ==
> (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) so it happens to work.
> The fact that quota *could* dirty the inode with I_DIRTY_SYNC only
> is right but that's a separate issue.

Oops, you're right. I mixed up I_DIRTY and I_DIRTY_SYNC. Hmm, what's
actually a bit surprising is that we don't have a
mark_inode_dirty_datasync() which only sets I_DIRTY_DATASYNC.

And shouldn't quota be dirtying the inode using I_DIRTY_DATASYNC only?
After, all the reason why we need to do this is because it's messing
with i_size, right?

> Thinking about it, it won't work so easily. The problem is that when
> pdflush decides to write the inode, it unconditionally clears dirty flags.
> We could redirty the inode in write_inode() but that's IMHO too ugly to
> bear it.

Hmm, yes. I wonder if this is something we should change, and make it
be the responsibility of the filesystem's write_inode method function
to clear I_DIRTY_SYNC and I_DIRTY_DATASYNC flags. That would allow
the file system to refuse to clean the inode for whatever reason the
file system saw fit. That would require sweeping through all of the
file systems, but it might be useful for more file systems other than
ext3/ext4.

The problem is it's a little late, given that 2.6.31 has already been
released, to try to get consensus on that way of doing things.
Tracking the TID is probably the best short-term way of handling this
problem, although it means bloating the in-core inode by another four
bytes, which is a bit of a shame.

- Ted

2009-09-10 20:14:45

by Mingming Cao

[permalink] [raw]
Subject: Re: fsync on ext[34] working only by an accident

On Thu, 2009-09-10 at 16:06 +0200, Jan Kara wrote:
> On Thu 10-09-09 09:10:07, Theodore Tso wrote:
> > On Thu, Sep 10, 2009 at 04:34:55PM +0530, Aneesh Kumar K.V wrote:
> > > mark_buffer_dirty -> __set_page_dirty -> __mark_inode_dirty
> >
> > We need to be careful here. First of all, mark_buffer_dirty() on the
> > code paths you are talking about is being passed a metadata buffer
> > head. As such, has Jan has pointed out, the bh is part of the buffer
> > cache, so the page->mapping of associated with bh->b_page is the inode
> > of the block device --- *not* the ext4 inode.
> >
> > Secondly, __set_page_dirty calls __mark_inode_dirty passing in
> > I_DIRTY_PAGES --- which should be a hint. What Jan is talking about
> > is where we set the inode flags I_DIRTY_SYNC and I_DIRTY_DATASYNC:
> >
> > * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on
> > * fdatasync(). i_atime is the usual cause.
> > * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of
> > * these changes separately from I_DIRTY_SYNC so that we
> > * don't have to write inode on fdatasync() when only
> > * mtime has changed in it.
> >
> > This is important because ext4_sync_file() (which is called by fsync()
> > and fdatasync()) uses this logic to determine whether or not to call
> > sync_inode(), which is what will force a commit when wbc.sync_mode is
> > set to WB_SYNC_ALL.
> Yes, this is exactly what I was trying to point out.
>
> > In fact, I think the problem is worse than Jan is pointing out,
> > because it's not enough that vfs_fq_alloc_space() is calling
> > mark_inode_dirty(), since that only sets I_DIRTY_SYNC. When we touch
> > i_size or i_block[], we need to make sure that I_DIRTY_DATASYNC is
> > set, so that fdatasync() will force a commit.
> Actually no. mark_inode_dirty() dirties inode with I_DIRTY ==
> (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) so it happens to work.
> The fact that quota *could* dirty the inode with I_DIRTY_SYNC only
> is right but that's a separate issue.
>
> > I think the right thing to do is to create an
> > _ext[34]_mark_inode_dirty() which takes an extra argument, which
> > controls whether or not we set I_DIRTY_SYNC or I_DIRTY_DATASYNC. In
> > fact, most of the time, we want to be setting I_DIRTY_DATASYNC, so we
> > should probably have ext[34]_mark_inode_dirty() call
> > _ext[34]_mark_inode_dirty() with I_DIRTY_DATASYNC, and then create a
> > ext[34]_mark_inode_nodatasync() version passes in I_DIRTY_SYNC.
> >
> > This will cause pdflush to call ext4_write_inode() more frequently,
> > but pdflush calls write_inode with wait=0, and ext4_write_inode() is a
> > no-op in that case.
> Thinking about it, it won't work so easily. The problem is that when
> pdflush decides to write the inode, it unconditionally clears dirty flags.
> We could redirty the inode in write_inode() but that's IMHO too ugly to
> bear it.

I am a little confused here, so pdflush could found the dirty inodes
(due to quota) but it doesn't force journal comminit and write the inode
to disk?

Mingming


2009-09-14 15:25:28

by Jan Kara

[permalink] [raw]
Subject: Re: fsync on ext[34] working only by an accident

On Thu 10-09-09 13:14:46, Mingming wrote:
> On Thu, 2009-09-10 at 16:06 +0200, Jan Kara wrote:
> > On Thu 10-09-09 09:10:07, Theodore Tso wrote:
> > > On Thu, Sep 10, 2009 at 04:34:55PM +0530, Aneesh Kumar K.V wrote:
> > > > mark_buffer_dirty -> __set_page_dirty -> __mark_inode_dirty
> > >
> > > We need to be careful here. First of all, mark_buffer_dirty() on the
> > > code paths you are talking about is being passed a metadata buffer
> > > head. As such, has Jan has pointed out, the bh is part of the buffer
> > > cache, so the page->mapping of associated with bh->b_page is the inode
> > > of the block device --- *not* the ext4 inode.
> > >
> > > Secondly, __set_page_dirty calls __mark_inode_dirty passing in
> > > I_DIRTY_PAGES --- which should be a hint. What Jan is talking about
> > > is where we set the inode flags I_DIRTY_SYNC and I_DIRTY_DATASYNC:
> > >
> > > * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on
> > > * fdatasync(). i_atime is the usual cause.
> > > * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of
> > > * these changes separately from I_DIRTY_SYNC so that we
> > > * don't have to write inode on fdatasync() when only
> > > * mtime has changed in it.
> > >
> > > This is important because ext4_sync_file() (which is called by fsync()
> > > and fdatasync()) uses this logic to determine whether or not to call
> > > sync_inode(), which is what will force a commit when wbc.sync_mode is
> > > set to WB_SYNC_ALL.
> > Yes, this is exactly what I was trying to point out.
> >
> > > In fact, I think the problem is worse than Jan is pointing out,
> > > because it's not enough that vfs_fq_alloc_space() is calling
> > > mark_inode_dirty(), since that only sets I_DIRTY_SYNC. When we touch
> > > i_size or i_block[], we need to make sure that I_DIRTY_DATASYNC is
> > > set, so that fdatasync() will force a commit.
> > Actually no. mark_inode_dirty() dirties inode with I_DIRTY ==
> > (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) so it happens to work.
> > The fact that quota *could* dirty the inode with I_DIRTY_SYNC only
> > is right but that's a separate issue.
> >
> > > I think the right thing to do is to create an
> > > _ext[34]_mark_inode_dirty() which takes an extra argument, which
> > > controls whether or not we set I_DIRTY_SYNC or I_DIRTY_DATASYNC. In
> > > fact, most of the time, we want to be setting I_DIRTY_DATASYNC, so we
> > > should probably have ext[34]_mark_inode_dirty() call
> > > _ext[34]_mark_inode_dirty() with I_DIRTY_DATASYNC, and then create a
> > > ext[34]_mark_inode_nodatasync() version passes in I_DIRTY_SYNC.
> > >
> > > This will cause pdflush to call ext4_write_inode() more frequently,
> > > but pdflush calls write_inode with wait=0, and ext4_write_inode() is a
> > > no-op in that case.
> > Thinking about it, it won't work so easily. The problem is that when
> > pdflush decides to write the inode, it unconditionally clears dirty flags.
> > We could redirty the inode in write_inode() but that's IMHO too ugly to
> > bear it.
>
> I am a little confused here, so pdflush could found the dirty inodes
> (due to quota) but it doesn't force journal comminit and write the inode
> to disk?
pdflush never forces a journal commit. It just does a periodic writeout
so there's no need to force a journal commit. It just calls write_inode(),
which ext3/4 implement as NOP because the inode is already part of the
running transaction when it is dirty.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-14 16:00:39

by Jan Kara

[permalink] [raw]
Subject: Re: fsync on ext[34] working only by an accident

On Thu 10-09-09 12:52:01, Theodore Tso wrote:
> On Thu, Sep 10, 2009 at 04:06:36PM +0200, Jan Kara wrote:
> > > In fact, I think the problem is worse than Jan is pointing out,
> > > because it's not enough that vfs_fq_alloc_space() is calling
> > > mark_inode_dirty(), since that only sets I_DIRTY_SYNC. When we touch
> > > i_size or i_block[], we need to make sure that I_DIRTY_DATASYNC is
> > > set, so that fdatasync() will force a commit.
> > Actually no. mark_inode_dirty() dirties inode with I_DIRTY ==
> > (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) so it happens to work.
> > The fact that quota *could* dirty the inode with I_DIRTY_SYNC only
> > is right but that's a separate issue.
>
> Oops, you're right. I mixed up I_DIRTY and I_DIRTY_SYNC. Hmm, what's
> actually a bit surprising is that we don't have a
> mark_inode_dirty_datasync() which only sets I_DIRTY_DATASYNC.
Yeah. But that's easy to fix ;).

> And shouldn't quota be dirtying the inode using I_DIRTY_DATASYNC only?
> After, all the reason why we need to do this is because it's messing
> with i_size, right?
Quota does not touch i_size. It touches only i_blocks - that's why it
dirties the inode. Since i_blocks are not really needed to get to the data,
I think I_DIRTY_SYNC for quota is more appropriate.

> > Thinking about it, it won't work so easily. The problem is that when
> > pdflush decides to write the inode, it unconditionally clears dirty flags.
> > We could redirty the inode in write_inode() but that's IMHO too ugly to
> > bear it.
>
> Hmm, yes. I wonder if this is something we should change, and make it
> be the responsibility of the filesystem's write_inode method function
> to clear I_DIRTY_SYNC and I_DIRTY_DATASYNC flags. That would allow
> the file system to refuse to clean the inode for whatever reason the
> file system saw fit. That would require sweeping through all of the
> file systems, but it might be useful for more file systems other than
> ext3/ext4.
Possibly yes. But OTOH from the point of view of pdflush, the inode
actually is clean. It could be even evicted from memory when needed before
the transaction commits - which invalidates my approach to solving the
fsync() trouble. Nasty.
Also just not clearing dirty bits in write_inode() would probably rather
confuse current writeback code because all inodes would stay dirty until
a transaction commit with no way of getting rid of them. So pdflush would
needlessly burn CPU cycles scanning them and trying to write them. Also
it's not completely clear how we would clear the dirty bits after the
transaction commits since during commit, it could become part of a freshly
started transaction.
Sigh, it's rather convoluted. Probably we have to somehow pin the inode
in memory until the transaction commits to fix the "remember TID" approach.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR