This is a call for review and comments from all filesystem developers.
Not just those cced, but everyone who maintains a filesystem[*], because
I wasn't able to put in a lot of time to understand the more complex
filesystems.
[*] You all read linux-fsdevel anyway, right? If not, please do, it's
pretty low volume and high s/n.
So there seem to be several problems with inode data and metadata
synchronization. Some of it is bugs in core code, but also a couple
of oft repeated bugs in filesystems.
http://marc.info/?l=linux-fsdevel&m=129052164315259&w=2
Basically 2 patterns of problem.
First is checking inode dirty bits
(I_DIRTY/I_DIRTY_SYNC/I_DIRTY_DATASYNC) without locking. What happens is
that other paths (async writeout or even a concurrent sync or fsync) can
clear these bits with the data not being on platter.
* solution: must wait on I_SYNC before testing these things. See my
patch in above linked series. I think I covered everyone here, but
please double check.
* There is opportunity in some filesystems for clearing inode dirty bits
in fsync, and for checking and avoiding fsync work if dirty bits are
not sure.
Second is confusing sync and async inode metadata writeout
Core code clears I_DIRTY_SYNC and I_DIRTY_DATASYNC before calling
->write_inode *regardless* of whether it is a for-integrity call or
not. This means background writeback can clear it, and subsequent
sync_inode_metadata or sync(2) call will skip the next ->write_inode
completely.
* solution: for fsync, you must ensure that everything that a
synchronous ->write_inode call does is also done by an
asynchronous ->write_inode call *plus* a subsequent fsync.
So if a synchronous ->write_inode syncs a bh, but an async one
just marks it dirty, your .fsync would have to sync the bh.
* solution: for sync(2), you must ensure that everything that a
synchronous ->write_inode call does is also done by an
asynchronous ->write_inode call plus a subsequent ->sync_fs
call, plus __sync_blockdev call on the buffer mapping.
Many simple filesystems just go via buffer mapping, and
->write_inode simply dirties the inode's bh. These guys are
fine here (although most are broken wrt fsync).
If you need any more state bits in your inode to work out what is going
on, Christoph's recent hfsplus fsync optimisation patches has a good
example of how it can be done.
The ext2 fix below is an example of how we can do this nicely, the
rest of the filesystems I note when it looks like they went wrong, and
band aid fixed it.
---
adfs/inode.c | 2 +-
affs/file.c | 1 +
bfs/inode.c | 2 +-
exofs/inode.c | 2 +-
ext2/ext2.h | 2 ++
ext2/file.c | 24 +++++++++++++++++++++---
ext2/inode.c | 12 ++----------
ext4/inode.c | 2 +-
fat/inode.c | 2 +-
jfs/inode.c | 2 +-
minix/inode.c | 2 +-
omfs/inode.c | 2 +-
reiserfs/inode.c | 2 ++
sysv/inode.c | 2 +-
udf/inode.c | 2 +-
ufs/inode.c | 2 +-
16 files changed, 39 insertions(+), 24 deletions(-)
Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c 2010-11-24 11:29:42.000000000 +1100
+++ linux-2.6/fs/ext2/inode.c 2010-11-25 17:22:49.000000000 +1100
@@ -1211,7 +1211,7 @@ static int ext2_setsize(struct inode *in
return 0;
}
-static struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
+struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
struct buffer_head **p)
{
struct buffer_head * bh;
@@ -1505,16 +1505,8 @@ static int __ext2_write_inode(struct ino
} else for (n = 0; n < EXT2_N_BLOCKS; n++)
raw_inode->i_block[n] = ei->i_data[n];
mark_buffer_dirty(bh);
- if (do_sync) {
- sync_dirty_buffer(bh);
- if (buffer_req(bh) && !buffer_uptodate(bh)) {
- printk ("IO error syncing ext2 inode [%s:%08lx]\n",
- sb->s_id, (unsigned long) ino);
- err = -EIO;
- }
- }
- ei->i_state &= ~EXT2_STATE_NEW;
brelse (bh);
+ ei->i_state &= ~EXT2_STATE_NEW;
return err;
}
Index: linux-2.6/fs/ext2/file.c
===================================================================
--- linux-2.6.orig/fs/ext2/file.c 2010-11-24 11:29:42.000000000 +1100
+++ linux-2.6/fs/ext2/file.c 2010-11-24 12:23:53.000000000 +1100
@@ -21,6 +21,7 @@
#include <linux/time.h>
#include <linux/pagemap.h>
#include <linux/quotaops.h>
+#include <linux/buffer_head.h>
#include "ext2.h"
#include "xattr.h"
#include "acl.h"
@@ -43,16 +44,33 @@ static int ext2_release_file (struct ino
int ext2_fsync(struct file *file, int datasync)
{
int ret;
- struct super_block *sb = file->f_mapping->host->i_sb;
- struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
+ struct inode *inode = file->f_mapping->host;
+ ino_t ino = inode->i_ino;
+ struct super_block *sb = inode->i_sb;
+ struct address_space *sb_mapping = sb->s_bdev->bd_inode->i_mapping;
+ struct buffer_head *bh;
+ struct ext2_inode *raw_inode;
ret = generic_file_fsync(file, datasync);
- if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) {
+ if (ret == -EIO || test_and_clear_bit(AS_EIO, &sb_mapping->flags)) {
/* We don't really know where the IO error happened... */
ext2_error(sb, __func__,
"detected IO error when writing metadata buffers");
+ return -EIO;
+ }
+
+ raw_inode = ext2_get_inode(sb, ino, &bh);
+ if (IS_ERR(raw_inode))
+ return -EIO;
+
+ sync_dirty_buffer(bh);
+ if (buffer_req(bh) && !buffer_uptodate(bh)) {
+ printk ("IO error syncing ext2 inode [%s:%08lx]\n",
+ sb->s_id, (unsigned long) ino);
ret = -EIO;
}
+ brelse (bh);
+
return ret;
}
Index: linux-2.6/fs/ext2/ext2.h
===================================================================
--- linux-2.6.orig/fs/ext2/ext2.h 2010-11-24 11:29:42.000000000 +1100
+++ linux-2.6/fs/ext2/ext2.h 2010-11-24 12:23:53.000000000 +1100
@@ -124,6 +124,8 @@ extern int ext2_get_block(struct inode *
extern int ext2_setattr (struct dentry *, struct iattr *);
extern void ext2_set_inode_flags(struct inode *inode);
extern void ext2_get_inode_flags(struct ext2_inode_info *);
+extern struct ext2_inode *ext2_get_inode(struct super_block *sb, ino_t ino,
+ struct buffer_head **p);
extern int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 len);
Index: linux-2.6/fs/adfs/inode.c
===================================================================
--- linux-2.6.orig/fs/adfs/inode.c 2010-11-25 17:25:54.000000000 +1100
+++ linux-2.6/fs/adfs/inode.c 2010-11-25 17:30:08.000000000 +1100
@@ -383,7 +383,7 @@ int adfs_write_inode(struct inode *inode
obj.attr = ADFS_I(inode)->attr;
obj.size = inode->i_size;
- ret = adfs_dir_update(sb, &obj, wbc->sync_mode == WB_SYNC_ALL);
+ ret = adfs_dir_update(sb, &obj, 1 /* XXX: fix fsync and use 'wbc->sync_mode == WB_SYNC_ALL' */);
unlock_kernel();
return ret;
}
Index: linux-2.6/fs/affs/file.c
===================================================================
--- linux-2.6.orig/fs/affs/file.c 2010-11-25 17:31:12.000000000 +1100
+++ linux-2.6/fs/affs/file.c 2010-11-25 17:31:30.000000000 +1100
@@ -931,6 +931,7 @@ int affs_file_fsync(struct file *filp, i
int ret, err;
ret = write_inode_now(inode, 0);
+ /* XXX: could just sync the buffer been dirtied by write_inode */
err = sync_blockdev(inode->i_sb->s_bdev);
if (!ret)
ret = err;
Index: linux-2.6/fs/bfs/inode.c
===================================================================
--- linux-2.6.orig/fs/bfs/inode.c 2010-11-25 17:23:40.000000000 +1100
+++ linux-2.6/fs/bfs/inode.c 2010-11-25 17:32:26.000000000 +1100
@@ -151,7 +151,7 @@ static int bfs_write_inode(struct inode
di->i_eoffset = cpu_to_le32(i_sblock * BFS_BSIZE + inode->i_size - 1);
mark_buffer_dirty(bh);
- if (wbc->sync_mode == WB_SYNC_ALL) {
+ if (1 /* XXX: fix fsync and use wbc->sync_mode == WB_SYNC_ALL */ ) {
sync_dirty_buffer(bh);
if (buffer_req(bh) && !buffer_uptodate(bh))
err = -EIO;
Index: linux-2.6/fs/exofs/inode.c
===================================================================
--- linux-2.6.orig/fs/exofs/inode.c 2010-11-25 17:25:54.000000000 +1100
+++ linux-2.6/fs/exofs/inode.c 2010-11-25 17:36:32.000000000 +1100
@@ -1273,7 +1273,7 @@ static int exofs_update_inode(struct ino
int exofs_write_inode(struct inode *inode, struct writeback_control *wbc)
{
- return exofs_update_inode(inode, wbc->sync_mode == WB_SYNC_ALL);
+ return exofs_update_inode(inode, 1 /* XXX: fix fsync and use wbc->sync_mode == WB_SYNC_ALL */ );
}
/*
Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c 2010-11-25 17:25:54.000000000 +1100
+++ linux-2.6/fs/ext4/inode.c 2010-11-25 17:40:37.000000000 +1100
@@ -5240,7 +5240,7 @@ int ext4_write_inode(struct inode *inode
err = __ext4_get_inode_loc(inode, &iloc, 0);
if (err)
return err;
- if (wbc->sync_mode == WB_SYNC_ALL)
+ if (1 /* XXX: fix fxync and use wbc->sync_mode == WB_SYNC_ALL */)
sync_dirty_buffer(iloc.bh);
if (buffer_req(iloc.bh) && !buffer_uptodate(iloc.bh)) {
EXT4_ERROR_INODE_BLOCK(inode, iloc.bh->b_blocknr,
Index: linux-2.6/fs/fat/inode.c
===================================================================
--- linux-2.6.orig/fs/fat/inode.c 2010-11-25 17:23:40.000000000 +1100
+++ linux-2.6/fs/fat/inode.c 2010-11-25 17:42:04.000000000 +1100
@@ -645,7 +645,7 @@ static int __fat_write_inode(struct inod
spin_unlock(&sbi->inode_hash_lock);
mark_buffer_dirty(bh);
err = 0;
- if (wait)
+ if (1 /* XXX: fix fsync and use wait */)
err = sync_dirty_buffer(bh);
brelse(bh);
return err;
Index: linux-2.6/fs/jfs/inode.c
===================================================================
--- linux-2.6.orig/fs/jfs/inode.c 2010-11-25 17:25:54.000000000 +1100
+++ linux-2.6/fs/jfs/inode.c 2010-11-25 17:45:27.000000000 +1100
@@ -123,7 +123,7 @@ int jfs_commit_inode(struct inode *inode
int jfs_write_inode(struct inode *inode, struct writeback_control *wbc)
{
- int wait = wbc->sync_mode == WB_SYNC_ALL;
+ int wait = 1; /* XXX fix fsync and use wbc->sync_mode == WB_SYNC_ALL; */
if (test_cflag(COMMIT_Nolink, inode))
return 0;
Index: linux-2.6/fs/minix/inode.c
===================================================================
--- linux-2.6.orig/fs/minix/inode.c 2010-11-25 17:23:40.000000000 +1100
+++ linux-2.6/fs/minix/inode.c 2010-11-25 17:46:42.000000000 +1100
@@ -576,7 +576,7 @@ static int minix_write_inode(struct inod
bh = V2_minix_update_inode(inode);
if (!bh)
return -EIO;
- if (wbc->sync_mode == WB_SYNC_ALL && buffer_dirty(bh)) {
+ if (1 /* XXX: fix fsync and use wbc->sync_mode == WB_SYNC_ALL */ && buffer_dirty(bh)) {
sync_dirty_buffer(bh);
if (buffer_req(bh) && !buffer_uptodate(bh)) {
printk("IO error syncing minix inode [%s:%08lx]\n",
Index: linux-2.6/fs/omfs/inode.c
===================================================================
--- linux-2.6.orig/fs/omfs/inode.c 2010-11-25 17:23:40.000000000 +1100
+++ linux-2.6/fs/omfs/inode.c 2010-11-25 17:50:50.000000000 +1100
@@ -169,7 +169,7 @@ static int __omfs_write_inode(struct ino
static int omfs_write_inode(struct inode *inode, struct writeback_control *wbc)
{
- return __omfs_write_inode(inode, wbc->sync_mode == WB_SYNC_ALL);
+ return __omfs_write_inode(inode, 1 /* XXX: fix fsync and use wbc->sync_mode == WB_SYNC_ALL */);
}
int omfs_sync_inode(struct inode *inode)
Index: linux-2.6/fs/reiserfs/inode.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/inode.c 2010-11-25 17:25:54.000000000 +1100
+++ linux-2.6/fs/reiserfs/inode.c 2010-11-25 17:53:45.000000000 +1100
@@ -1635,6 +1635,8 @@ int reiserfs_write_inode(struct inode *i
** these cases are just when the system needs ram, not when the
** inode needs to reach disk for safety, and they can safely be
** ignored because the altered inode has already been logged.
+ ** XXX: is this really OK? The caller clears the inode dirty bit, so
+ ** a subsequent sync for integrity might never reach here.
*/
if (wbc->sync_mode == WB_SYNC_ALL && !(current->flags & PF_MEMALLOC)) {
reiserfs_write_lock(inode->i_sb);
Index: linux-2.6/fs/sysv/inode.c
===================================================================
--- linux-2.6.orig/fs/sysv/inode.c 2010-11-25 17:23:40.000000000 +1100
+++ linux-2.6/fs/sysv/inode.c 2010-11-25 17:54:47.000000000 +1100
@@ -286,7 +286,7 @@ static int __sysv_write_inode(struct ino
write3byte(sbi, (u8 *)&si->i_data[block],
&raw_inode->i_data[3*block]);
mark_buffer_dirty(bh);
- if (wait) {
+ if (1 /* XXX: fix fsync and use wait */) {
sync_dirty_buffer(bh);
if (buffer_req(bh) && !buffer_uptodate(bh)) {
printk ("IO error syncing sysv inode [%s:%08x]\n",
Index: linux-2.6/fs/udf/inode.c
===================================================================
--- linux-2.6.orig/fs/udf/inode.c 2010-11-25 17:25:54.000000000 +1100
+++ linux-2.6/fs/udf/inode.c 2010-11-25 17:55:36.000000000 +1100
@@ -1598,7 +1598,7 @@ static int udf_update_inode(struct inode
/* write the data blocks */
mark_buffer_dirty(bh);
- if (do_sync) {
+ if (1 /* XXX fix fsync and use do_sync */) {
sync_dirty_buffer(bh);
if (buffer_write_io_error(bh)) {
printk(KERN_WARNING "IO error syncing udf inode "
Index: linux-2.6/fs/ufs/inode.c
===================================================================
--- linux-2.6.orig/fs/ufs/inode.c 2010-11-25 17:25:54.000000000 +1100
+++ linux-2.6/fs/ufs/inode.c 2010-11-25 17:56:12.000000000 +1100
@@ -889,7 +889,7 @@ static int ufs_update_inode(struct inode
}
mark_buffer_dirty(bh);
- if (do_sync)
+ if (1 /* XXX: fix fsync and use do_sync */)
sync_dirty_buffer(bh);
brelse (bh);
On 11/25/2010 09:49 AM, Nick Piggin wrote:
> This is a call for review and comments from all filesystem developers.
> Not just those cced, but everyone who maintains a filesystem[*], because
> I wasn't able to put in a lot of time to understand the more complex
> filesystems.
>
> [*] You all read linux-fsdevel anyway, right? If not, please do, it's
> pretty low volume and high s/n.
>
> So there seem to be several problems with inode data and metadata
> synchronization. Some of it is bugs in core code, but also a couple
> of oft repeated bugs in filesystems.
>
> http://marc.info/?l=linux-fsdevel&m=129052164315259&w=2
>
> Basically 2 patterns of problem.
>
> First is checking inode dirty bits
> (I_DIRTY/I_DIRTY_SYNC/I_DIRTY_DATASYNC) without locking. What happens is
> that other paths (async writeout or even a concurrent sync or fsync) can
> clear these bits with the data not being on platter.
>
> * solution: must wait on I_SYNC before testing these things. See my
> patch in above linked series. I think I covered everyone here, but
> please double check.
>
> * There is opportunity in some filesystems for clearing inode dirty bits
> in fsync, and for checking and avoiding fsync work if dirty bits are
> not sure.
>
> Second is confusing sync and async inode metadata writeout
> Core code clears I_DIRTY_SYNC and I_DIRTY_DATASYNC before calling
> ->write_inode *regardless* of whether it is a for-integrity call or
> not. This means background writeback can clear it, and subsequent
> sync_inode_metadata or sync(2) call will skip the next ->write_inode
> completely.
>
> * solution: for fsync, you must ensure that everything that a
> synchronous ->write_inode call does is also done by an
> asynchronous ->write_inode call *plus* a subsequent fsync.
>
> So if a synchronous ->write_inode syncs a bh, but an async one
> just marks it dirty, your .fsync would have to sync the bh.
>
> * solution: for sync(2), you must ensure that everything that a
> synchronous ->write_inode call does is also done by an
> asynchronous ->write_inode call plus a subsequent ->sync_fs
> call, plus __sync_blockdev call on the buffer mapping.
>
> Many simple filesystems just go via buffer mapping, and
> ->write_inode simply dirties the inode's bh. These guys are
> fine here (although most are broken wrt fsync).
>
> If you need any more state bits in your inode to work out what is going
> on, Christoph's recent hfsplus fsync optimisation patches has a good
> example of how it can be done.
>
> The ext2 fix below is an example of how we can do this nicely, the
> rest of the filesystems I note when it looks like they went wrong, and
> band aid fixed it.
>
> Index: linux-2.6/fs/exofs/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/exofs/inode.c 2010-11-25 17:25:54.000000000 +1100
> +++ linux-2.6/fs/exofs/inode.c 2010-11-25 17:36:32.000000000 +1100
> @@ -1273,7 +1273,7 @@ static int exofs_update_inode(struct ino
>
> int exofs_write_inode(struct inode *inode, struct writeback_control *wbc)
> {
> - return exofs_update_inode(inode, wbc->sync_mode == WB_SYNC_ALL);
> + return exofs_update_inode(inode, 1 /* XXX: fix fsync and use wbc->sync_mode == WB_SYNC_ALL */ );
> }
>
> /*
Hi Nick.
Thanks for digging into this issue, I bet it's causing pain. Which
I totally missed in my tests. I wish I had a better xsync+reboot
tests for all this.
So in that previous patch you had:
> Index: linux-2.6/fs/exofs/file.c
> ===================================================================
> --- linux-2.6.orig/fs/exofs/file.c 2010-11-19 16:50:00.000000000 +1100
> +++ linux-2.6/fs/exofs/file.c 2010-11-19 16:50:07.000000000 +1100
> @@ -48,11 +48,6 @@ static int exofs_file_fsync(struct file
> struct inode *inode = filp->f_mapping->host;
> struct super_block *sb;
>
> - if (!(inode->i_state & I_DIRTY))
> - return 0;
> - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> - return 0;
> -
> ret = sync_inode_metadata(inode, 1);
>
> /* This is a good place to write the sb */
>
Is that a good enough fix for the issue in your opinion?
Or is there more involved?
In exofs there is nothing special to do other than VFS
managment and the final call, by vfs, to .write_inode.
I wish we had a simple_file_fsync() from VFS that does
what the VFS expects us to do. So when code evolves it
does not need to change all FSs. This is the third time
I'm fixing this code trying to second guess the VFS.
Actually the only other thing I need to do in file_fsync
today is sb_sync. But this is a stupidity (and a bug) that
I'm fixing soon. So that theoretical simple_file_fsync()
would be all I need.
Please advise?
BTW: Do you want that I take the changes through my tree?
Thanks for taking care of this
Boaz
On Thu, Nov 25, 2010 at 11:28:14AM +0200, Boaz Harrosh wrote:
> Hi Nick.
> Thanks for digging into this issue, I bet it's causing pain. Which
> I totally missed in my tests. I wish I had a better xsync+reboot
> tests for all this.
That's no problem, thanks for looking.
> So in that previous patch you had:
> > Index: linux-2.6/fs/exofs/file.c
> > ===================================================================
> > --- linux-2.6.orig/fs/exofs/file.c 2010-11-19 16:50:00.000000000 +1100
> > +++ linux-2.6/fs/exofs/file.c 2010-11-19 16:50:07.000000000 +1100
> > @@ -48,11 +48,6 @@ static int exofs_file_fsync(struct file
> > struct inode *inode = filp->f_mapping->host;
> > struct super_block *sb;
> >
> > - if (!(inode->i_state & I_DIRTY))
> > - return 0;
> > - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > - return 0;
> > -
> > ret = sync_inode_metadata(inode, 1);
> >
> > /* This is a good place to write the sb */
> >
>
> Is that a good enough fix for the issue in your opinion?
> Or is there more involved?
For the inode dirty bit race problem, yes it should fix it.
sync_inode_metadata basically makes the same checks without
races (in a subsequent patch I re-introduced the datasync
optimisation).
> In exofs there is nothing special to do other than VFS
> managment and the final call, by vfs, to .write_inode.
>
> I wish we had a simple_file_fsync() from VFS that does
> what the VFS expects us to do. So when code evolves it
> does not need to change all FSs. This is the third time
> I'm fixing this code trying to second guess the VFS.
Well in your fsync, you need to wait for inode writeback
that might have been started by an asynchronous write_inode.
Also, with your sync_inode_metadata call, you shouldn't need the
sync_inode call by the looks.
> Actually the only other thing I need to do in file_fsync
> today is sb_sync. But this is a stupidity (and a bug) that
> I'm fixing soon. So that theoretical simple_file_fsync()
> would be all I need.
>
> Please advise?
> BTW: Do you want that I take the changes through my tree?
At this point I'd just like some review and feedback, we
might get some other opinions on how to fix it, so don't
take the changes quite yet.
I'll cc you again with a broken out patch.
Thanks,
Nick
On 11/25/2010 12:06 PM, Nick Piggin wrote:
> On Thu, Nov 25, 2010 at 11:28:14AM +0200, Boaz Harrosh wrote:
>>> Index: linux-2.6/fs/exofs/file.c
>>> ===================================================================
>>> --- linux-2.6.orig/fs/exofs/file.c 2010-11-19 16:50:00.000000000 +1100
>>> +++ linux-2.6/fs/exofs/file.c 2010-11-19 16:50:07.000000000 +1100
>>> @@ -48,11 +48,6 @@ static int exofs_file_fsync(struct file
>>> struct inode *inode = filp->f_mapping->host;
>>> struct super_block *sb;
>>>
>>> - if (!(inode->i_state & I_DIRTY))
>>> - return 0;
>>> - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
>>> - return 0;
>>> -
>>> ret = sync_inode_metadata(inode, 1);
>>>
>>> /* This is a good place to write the sb */
>>>
>>
>> Is that a good enough fix for the issue in your opinion?
>> Or is there more involved?
>
> For the inode dirty bit race problem, yes it should fix it.
> sync_inode_metadata basically makes the same checks without
> races (in a subsequent patch I re-introduced the datasync
> optimisation).
>
>
>
> Well in your fsync, you need to wait for inode writeback
> that might have been started by an asynchronous write_inode.
>
All I'm calling is sync_inode_metadata(,1) which calls sync_inode()
which calls writeback_single_inode(sync_mode == WB_SYNC_ALL). It gets
a little complicated but from the looks of it, even though the
call to .write_inode() is not under any lock the state machine there
will do inode_wait_for_writeback() if there was one in motion
all ready. ?
And it looks like writeback_single_inode() does all the proper
checks in the correct order for these flags above.
So current code in exofs_file_fsync() looks scary to me. I would
like to push your above patch for this Kernel. (I'll repost it)
> Also, with your sync_inode_metadata call, you shouldn't need the
> sync_inode call by the looks.
>
What? I missed you. You mean I don't need to sync_inode_metadata(,wait==1),
or what did you mean?
> Thanks,
> Nick
>
Thanks
Boaz
From: Nick Piggin <[email protected]>
It is incorrect to test inode dirty bits without participating in the inode
writeback protocol. Inode writeback sets I_SYNC and clears I_DIRTY_?, then
writes out the particular bits, then clears I_SYNC when it is done. BTW. it
may not completely write all pages out, so I_DIRTY_PAGES would get set
again.
This is a standard pattern used throughout the kernel's writeback caches
(I_SYNC ~= I_WRITEBACK, if that makes it clearer).
And so it is not possible to determine an inode's dirty status just by
checking I_DIRTY bits. Especially not for the purpose of data integrity
syncs.
Missing the check for these bits means that fsync can complete while
writeback to the inode is underway. Inode writeback functions get this
right, so call into them rather than try to shortcut things by testing
dirty state improperly.
Signed-off-by: Nick Piggin <[email protected]>
Signed-off-by: Boaz Harrosh <[email protected]>
---
fs/exofs/file.c | 5 -----
1 files changed, 0 insertions(+), 5 deletions(-)
diff --git a/fs/exofs/file.c b/fs/exofs/file.c
index b905c79..4c0d6ba 100644
--- a/fs/exofs/file.c
+++ b/fs/exofs/file.c
@@ -48,11 +48,6 @@ static int exofs_file_fsync(struct file *filp, int datasync)
struct inode *inode = filp->f_mapping->host;
struct super_block *sb;
- if (!(inode->i_state & I_DIRTY))
- return 0;
- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- return 0;
-
ret = sync_inode_metadata(inode, 1);
/* This is a good place to write the sb */
--
1.7.2.3
On Thu, Nov 25, 2010 at 12:51:11PM +0200, Boaz Harrosh wrote:
> On 11/25/2010 12:06 PM, Nick Piggin wrote:
> > On Thu, Nov 25, 2010 at 11:28:14AM +0200, Boaz Harrosh wrote:
>
> >>> Index: linux-2.6/fs/exofs/file.c
> >>> ===================================================================
> >>> --- linux-2.6.orig/fs/exofs/file.c 2010-11-19 16:50:00.000000000 +1100
> >>> +++ linux-2.6/fs/exofs/file.c 2010-11-19 16:50:07.000000000 +1100
> >>> @@ -48,11 +48,6 @@ static int exofs_file_fsync(struct file
> >>> struct inode *inode = filp->f_mapping->host;
> >>> struct super_block *sb;
> >>>
> >>> - if (!(inode->i_state & I_DIRTY))
> >>> - return 0;
> >>> - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> >>> - return 0;
> >>> -
> >>> ret = sync_inode_metadata(inode, 1);
> >>>
> >>> /* This is a good place to write the sb */
> >>>
> >>
> >> Is that a good enough fix for the issue in your opinion?
> >> Or is there more involved?
> >
> > For the inode dirty bit race problem, yes it should fix it.
> > sync_inode_metadata basically makes the same checks without
> > races (in a subsequent patch I re-introduced the datasync
> > optimisation).
> >
> >
>
> >
> > Well in your fsync, you need to wait for inode writeback
> > that might have been started by an asynchronous write_inode.
> >
>
> All I'm calling is sync_inode_metadata(,1) which calls sync_inode()
> which calls writeback_single_inode(sync_mode == WB_SYNC_ALL). It gets
> a little complicated but from the looks of it, even though the
> call to .write_inode() is not under any lock the state machine there
> will do inode_wait_for_writeback() if there was one in motion
> all ready. ?
>
> And it looks like writeback_single_inode() does all the proper
> checks in the correct order for these flags above.
>
> So current code in exofs_file_fsync() looks scary to me. I would
> like to push your above patch for this Kernel. (I'll repost it)
It does not get it right, because of the situation I described
above. Background writeout can come in first, and clear the inode
dirty bits, and call your ->write_inode for async writeout.
That means you skip doing the exofs_put_io_state(), and (I presume)
this means you aren't waiting for write completion there.
What then happens is that sync_inode_metadata() from your fsync
does not call ->write_inode because the inode dirty bits are clear.
It's basically a noop. So you need to either make your .write_inode
always synchronous, or wait for it in your .fsync and .sync_fs.
> > Also, with your sync_inode_metadata call, you shouldn't need the
> > sync_inode call by the looks.
> >
>
> What? I missed you. You mean I don't need to sync_inode_metadata(,wait==1),
> or what did you mean?
Sorry, I was looking at the wrong code, ignore that.
Nick
This is fine, we'll get it merged before 2.6.37, just give a couple of
days for review, I haven't had much confirmation from other vfs people
about these problems.
Thanks,
Nick
On Thu, Nov 25, 2010 at 12:52:08PM +0200, Boaz Harrosh wrote:
> From: Nick Piggin <[email protected]>
>
> It is incorrect to test inode dirty bits without participating in the inode
> writeback protocol. Inode writeback sets I_SYNC and clears I_DIRTY_?, then
> writes out the particular bits, then clears I_SYNC when it is done. BTW. it
> may not completely write all pages out, so I_DIRTY_PAGES would get set
> again.
>
On Thu, Nov 25, 2010 at 06:49:09PM +1100, Nick Piggin wrote:
> Second is confusing sync and async inode metadata writeout
> Core code clears I_DIRTY_SYNC and I_DIRTY_DATASYNC before calling
> ->write_inode *regardless* of whether it is a for-integrity call or
> not. This means background writeback can clear it, and subsequent
> sync_inode_metadata or sync(2) call will skip the next ->write_inode
> completely.
Hmm, this also means that write_inode_now(sync=1) is buggy. It
needs to in fact call ->fsync -- which is a file operation
unfortunately, Christoph didn't you have some patches to move it
into an inode operation?
On Thu, Nov 25, 2010 at 10:54:57PM +1100, Nick Piggin wrote:
> On Thu, Nov 25, 2010 at 06:49:09PM +1100, Nick Piggin wrote:
> > Second is confusing sync and async inode metadata writeout
> > Core code clears I_DIRTY_SYNC and I_DIRTY_DATASYNC before calling
> > ->write_inode *regardless* of whether it is a for-integrity call or
> > not. This means background writeback can clear it, and subsequent
> > sync_inode_metadata or sync(2) call will skip the next ->write_inode
> > completely.
>
> Hmm, this also means that write_inode_now(sync=1) is buggy. It
> needs to in fact call ->fsync -- which is a file operation
> unfortunately, Christoph didn't you have some patches to move it
> into an inode operation?
No, it doesn't really make much sense either. But what I've slowly
started doing is to phase out write_inode_now. For the cases where
we really only want to write the inode we should use
sync_inode_metadata. That only leaves two others callsers:
- iput_final for a filesystem during unmount. This should be caught
by the need to call ->sync_fs rule you mentioned above, but needs
a closer audit.
- nfsd. Any filesystem that cares should just use the commit_metadata
export operations, which is a subsystem of ->fsync as it only need
to guarantee that metadata is on disk, but not actually any file
data - so no cache flush mess as in a real fsync implementation.
On 11/25/2010 01:47 PM, Nick Piggin wrote:
> On Thu, Nov 25, 2010 at 12:51:11PM +0200, Boaz Harrosh wrote:
>
> It does not get it right, because of the situation I described
> above. Background writeout can come in first, and clear the inode
> dirty bits, and call your ->write_inode for async writeout.
>
> That means you skip doing the exofs_put_io_state(), and (I presume)
> this means you aren't waiting for write completion there.
>
> What then happens is that sync_inode_metadata() from your fsync
> does not call ->write_inode because the inode dirty bits are clear.
> It's basically a noop. So you need to either make your .write_inode
> always synchronous, or wait for it in your .fsync and .sync_fs.
>
Rrr I now see what you mean, that the previous call to
writeback_single_inode(wait==0) came and go without actually
finishing because exofs_write_inode(wait==0).
So I wish there was an
write_inode_async_done() that will actually do the final inode_sync_complete
or something like that, right? (That I could call from exofs::updatei_done)
Sigh. I think I'll go with the always wait==1 at exofs_write_inode() for
now. Just as your patch.
If there is nothing better by the next Kernel I might consider my own
wait_event() in exofs_file_fsync()
ACK-by: Boaz Harrosh <[email protected]>
Thanks
Boaz
On 11/25/2010 01:50 PM, Nick Piggin wrote:
> This is fine, we'll get it merged before 2.6.37, just give a couple of
> days for review, I haven't had much confirmation from other vfs people
> about these problems.
>
> Thanks,
> Nick
>
Hi Dear Nick.
So I was cleaning the table and I fell on this patch. What ever happened
to it? should I push it through my tree?
Thanks
Boaz
> On Thu, Nov 25, 2010 at 12:52:08PM +0200, Boaz Harrosh wrote:
>> From: Nick Piggin <[email protected]>
>>
>> It is incorrect to test inode dirty bits without participating in the inode
>> writeback protocol. Inode writeback sets I_SYNC and clears I_DIRTY_?, then
>> writes out the particular bits, then clears I_SYNC when it is done. BTW. it
>> may not completely write all pages out, so I_DIRTY_PAGES would get set
>> again.
>>