Hi,
As described earlier, code which wants to write an inode cannot rely on
the I_DIRTY bits (on inode->i_state) being clean to guarantee that the
inode and its dirty pages, if any, are safely synced on disk.
The reason for that is sync_one() --- it cleans the I_DIRTY bits of an
inode, sets the I_LOCK and starts a writeout.
If sync_one() is called to write an inode _asynchronously_, there is no
guarantee that an inode will have its data fully synced on disk even if
the inode gets unlocked, which means the previous fix to
generic_osync_inode() is not safe.
The easy and safe fix is to simply remove the I_DIRTY_* checks from
generic_osync_inode and ext2_fsync_inode. Easy but slow. Another fix would
be to make sync_one() unconditionally synchronous... slow.
Any suggestion for a fast, safe, but simple fix to this bug?
Hi,
On Sat, Apr 14, 2001 at 07:24:42AM -0300, Marcelo Tosatti wrote:
>
> As described earlier, code which wants to write an inode cannot rely on
> the I_DIRTY bits (on inode->i_state) being clean to guarantee that the
> inode and its dirty pages, if any, are safely synced on disk.
Indeed --- for all such structures, including pages, buffer_heads and
inodes, you can only assume that the object is safely on disk if you
have checked both the dirty bit AND the locked bit. If you find it
locked but clean, then a writeout may be in progress, so you need to
do a wait_on_* to be really sure that the write has completed.
> The reason for that is sync_one() --- it cleans the I_DIRTY bits of an
> inode, sets the I_LOCK and starts a writeout.
As long as it is setting the I_LOCK bit, then that's fine.
> The easy and safe fix is to simply remove the I_DIRTY_* checks from
> generic_osync_inode and ext2_fsync_inode. Easy but slow. Another fix would
> be to make sync_one() unconditionally synchronous... slow.
Just make the *sync functions look for the locked bit and do a wait on
the inode if it is locked.
Cheers,
Stephen
On Tue, 17 Apr 2001, Stephen C. Tweedie wrote:
> Hi,
>
> On Sat, Apr 14, 2001 at 07:24:42AM -0300, Marcelo Tosatti wrote:
> >
> > As described earlier, code which wants to write an inode cannot rely on
> > the I_DIRTY bits (on inode->i_state) being clean to guarantee that the
> > inode and its dirty pages, if any, are safely synced on disk.
>
> Indeed --- for all such structures, including pages, buffer_heads and
> inodes, you can only assume that the object is safely on disk if you
> have checked both the dirty bit AND the locked bit.
No.
> If you find it locked but clean, then a writeout may be in progress,
> so you need to do a wait_on_* to be really sure that the write has
> completed.
As far as I can see, you cannot guarantee that an inode which is unlocked
_and_ clean (accordingly to the inode->i_state) is safely on disk.
The reason for that are calls to sync_one() which write the inode
asynchronously:
sync_one(struct inode *inode, int sync) {
...
dirty = inode->i_state & I_DIRTY;
inode->i_state |= I_LOCK;
inode->i_state &= ~I_DIRTY;
spin_unlock(&inode_lock);
filemap_fdatasync(inode->i_mapping);
/* Don't write the inode if only I_DIRTY_PAGES was set */
if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC))
write_inode(inode, sync); <-------------
filemap_fdatawait(inode->i_mapping);
spin_lock(&inode_lock);
inode->i_state &= ~I_LOCK;
wake_up(&inode->i_wait);
}
The fs-specific write_inode() function will simply "generate" the dirty
buffer_head and add it to the dirty list. After that, it unlocks the
inode.
Where is the guarantee that this dirty buffer head has hitted the disk?
Do you see what I mean now ?
I don't see a clean "for 2.4" fix except writting the inode
unconditionally at _each_ call which needs to write the inode
synchronously.
Hi,
On Wed, Apr 18, 2001 at 06:45:40AM -0300, Marcelo Tosatti wrote:
> As far as I can see, you cannot guarantee that an inode which is unlocked
> _and_ clean (accordingly to the inode->i_state) is safely on disk.
>
> The reason for that are calls to sync_one() which write the inode
> asynchronously:
>
> sync_one(struct inode *inode, int sync) {
> ...
> /* Don't write the inode if only I_DIRTY_PAGES was set */
> if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC))
> write_inode(inode, sync); <-------------
...
> inode->i_state &= ~I_LOCK;
Right --- nasty. But not _too_ nasty, there's a moderately easy way
of dealing with this.
Basically we can't trust the i_state flag for this purpose if we are
allowing async IO to happen on inodes without having proper IO
completion callbacks marking the inodes as unlocked once they are firm
on disk. However, in this case the filesystem itself will know which
underlying buffer_head contains the inode data, and can check to see
if that buffer is locked and perform a wait if necessary.
This is somewhat unpleasant in that it may sometimes cause unnecessary
false sharing, given that we have multiple inodes in an inode block.
However, I can't see any simple way around that.
Linus, do you have any preference for how to deal with this? We can
either do it by adding a s_op->wait_inode() function to complement
write_inode(), and have a wait_inode() implementation in block-device
filesystems which does the buffer lookup and wait; or we can push that
whole logic into the filesystems, so that the I_DIRTY check is removed
from the VFS mid-layer altogether and the filesystem is responsible
for testing both the inode and buffer locked state when we try to wait
for outstanding inode IO to complete.
The second method is a bit cleaner conceptually but it results in more
code duplication.
Cheers,
Stephen