2006-11-30 12:24:42

by Steven Whitehouse

[permalink] [raw]
Subject: [GFS2] Don't flush everything on fdatasync [70/70]

>From 33c3de32872ef3c075e4dac04c0de8f86ac39f6f Mon Sep 17 00:00:00 2001
From: Steven Whitehouse <[email protected]>
Date: Thu, 30 Nov 2006 10:14:32 -0500
Subject: [PATCH] [GFS2] Don't flush everything on fdatasync

The gfs2_fsync() function was doing a journal flush on each
and every call. While this is correct, its also a lot of
overhead. This patch means that on fdatasync flushes we
rely on the VFS to flush the data for us and we don't do
a journal flush unless we really need to.

We have to do a journal flush for stuffed files though because
they have the data and the inode metadata in the same block.
Journaled files also need a journal flush too of course.

Signed-off-by: Steven Whitehouse <[email protected]>
---
fs/gfs2/ops_file.c | 30 +++++++++++++++++++++++++++---
1 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/fs/gfs2/ops_file.c b/fs/gfs2/ops_file.c
index c2be216..7bd971b 100644
--- a/fs/gfs2/ops_file.c
+++ b/fs/gfs2/ops_file.c
@@ -22,6 +22,7 @@ #include <linux/gfs2_ondisk.h>
#include <linux/ext2_fs.h>
#include <linux/crc32.h>
#include <linux/lm_interface.h>
+#include <linux/writeback.h>
#include <asm/uaccess.h>

#include "gfs2.h"
@@ -503,16 +504,39 @@ static int gfs2_close(struct inode *inod
* @file: the file that points to the dentry (we ignore this)
* @dentry: the dentry that points to the inode to sync
*
+ * The VFS will flush "normal" data for us. We only need to worry
+ * about metadata here. For journaled data, we just do a log flush
+ * as we can't avoid it. Otherwise we can just bale out if datasync
+ * is set. For stuffed inodes we must flush the log in order to
+ * ensure that all data is on disk.
+ *
* Returns: errno
*/

static int gfs2_fsync(struct file *file, struct dentry *dentry, int datasync)
{
- struct gfs2_inode *ip = GFS2_I(dentry->d_inode);
+ struct inode *inode = dentry->d_inode;
+ int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
+ int ret = 0;
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_ALL,
+ .nr_to_write = 0,
+ };
+
+ if (gfs2_is_jdata(GFS2_I(inode))) {
+ gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
+ return 0;
+ }

- gfs2_log_flush(ip->i_gl->gl_sbd, ip->i_gl);
+ if (sync_state != 0) {
+ if (!datasync)
+ ret = sync_inode(inode, &wbc);

- return 0;
+ if (gfs2_is_stuffed(GFS2_I(inode)))
+ gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
+ }
+
+ return ret;
}

/**
--
1.4.1




2006-12-01 07:02:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [GFS2] Don't flush everything on fdatasync [70/70]

On Thu, 30 Nov 2006 12:24:08 +0000
Steven Whitehouse <[email protected]> wrote:

> static int gfs2_fsync(struct file *file, struct dentry *dentry, int datasync)
> {
> - struct gfs2_inode *ip = GFS2_I(dentry->d_inode);
> + struct inode *inode = dentry->d_inode;
> + int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
> + int ret = 0;
> + struct writeback_control wbc = {
> + .sync_mode = WB_SYNC_ALL,
> + .nr_to_write = 0,
> + };
> +
> + if (gfs2_is_jdata(GFS2_I(inode))) {
> + gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
> + return 0;
> + }
>
> - gfs2_log_flush(ip->i_gl->gl_sbd, ip->i_gl);
> + if (sync_state != 0) {
> + if (!datasync)
> + ret = sync_inode(inode, &wbc);

filemap_fdatawrite() would be simpler.

2006-12-01 10:57:57

by Steven Whitehouse

[permalink] [raw]
Subject: Re: [GFS2] Don't flush everything on fdatasync [70/70]

Hi,

On Thu, 2006-11-30 at 23:01 -0800, Andrew Morton wrote:
> On Thu, 30 Nov 2006 12:24:08 +0000
> Steven Whitehouse <[email protected]> wrote:
>
> > static int gfs2_fsync(struct file *file, struct dentry *dentry, int datasync)
> > {
> > - struct gfs2_inode *ip = GFS2_I(dentry->d_inode);
> > + struct inode *inode = dentry->d_inode;
> > + int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
> > + int ret = 0;
> > + struct writeback_control wbc = {
> > + .sync_mode = WB_SYNC_ALL,
> > + .nr_to_write = 0,
> > + };
> > +
> > + if (gfs2_is_jdata(GFS2_I(inode))) {
> > + gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
> > + return 0;
> > + }
> >
> > - gfs2_log_flush(ip->i_gl->gl_sbd, ip->i_gl);
> > + if (sync_state != 0) {
> > + if (!datasync)
> > + ret = sync_inode(inode, &wbc);
>
> filemap_fdatawrite() would be simpler.

I was taking my cue here from ext3 which does something similar. The
filemap_fdatawrite() is done by the VFS before this is called with a
filemap_fdatawait() afterwards. This was intended to flush the metadata
via (eventually) ->write_inode() although I guess I should be calling
write_inode_now() instead?

Steve.


2006-12-01 19:09:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [GFS2] Don't flush everything on fdatasync [70/70]

On Fri, 01 Dec 2006 10:58:58 +0000
Steven Whitehouse <[email protected]> wrote:

> On Thu, 2006-11-30 at 23:01 -0800, Andrew Morton wrote:
> > On Thu, 30 Nov 2006 12:24:08 +0000
> > Steven Whitehouse <[email protected]> wrote:
> >
> > > static int gfs2_fsync(struct file *file, struct dentry *dentry, int datasync)
> > > {
> > > - struct gfs2_inode *ip = GFS2_I(dentry->d_inode);
> > > + struct inode *inode = dentry->d_inode;
> > > + int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
> > > + int ret = 0;
> > > + struct writeback_control wbc = {
> > > + .sync_mode = WB_SYNC_ALL,
> > > + .nr_to_write = 0,
> > > + };
> > > +
> > > + if (gfs2_is_jdata(GFS2_I(inode))) {
> > > + gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
> > > + return 0;
> > > + }
> > >
> > > - gfs2_log_flush(ip->i_gl->gl_sbd, ip->i_gl);
> > > + if (sync_state != 0) {
> > > + if (!datasync)
> > > + ret = sync_inode(inode, &wbc);
> >
> > filemap_fdatawrite() would be simpler.
>
> I was taking my cue here from ext3 which does something similar. The
> filemap_fdatawrite() is done by the VFS before this is called with a
> filemap_fdatawait() afterwards. This was intended to flush the metadata
> via (eventually) ->write_inode() although I guess I should be calling
> write_inode_now() instead?

oh I see, you're jsut trying to write the inode itself, not the pages.

write_inode_now() will write the pages, which you seem to not want to do.
Whatever. The APIs here are a bit awkward.

2006-12-05 14:33:38

by Steven Whitehouse

[permalink] [raw]
Subject: Re: [GFS2] Don't flush everything on fdatasync [70/70]

Hi,

On Fri, 2006-12-01 at 11:09 -0800, Andrew Morton wrote:
> On Fri, 01 Dec 2006 10:58:58 +0000
> Steven Whitehouse <[email protected]> wrote:
>
> > On Thu, 2006-11-30 at 23:01 -0800, Andrew Morton wrote:
> > > On Thu, 30 Nov 2006 12:24:08 +0000
> > > Steven Whitehouse <[email protected]> wrote:
> > >
> > > > static int gfs2_fsync(struct file *file, struct dentry *dentry, int datasync)
> > > > {
> > > > - struct gfs2_inode *ip = GFS2_I(dentry->d_inode);
> > > > + struct inode *inode = dentry->d_inode;
> > > > + int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
> > > > + int ret = 0;
> > > > + struct writeback_control wbc = {
> > > > + .sync_mode = WB_SYNC_ALL,
> > > > + .nr_to_write = 0,
> > > > + };
> > > > +
> > > > + if (gfs2_is_jdata(GFS2_I(inode))) {
> > > > + gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
> > > > + return 0;
> > > > + }
> > > >
> > > > - gfs2_log_flush(ip->i_gl->gl_sbd, ip->i_gl);
> > > > + if (sync_state != 0) {
> > > > + if (!datasync)
> > > > + ret = sync_inode(inode, &wbc);
> > >
> > > filemap_fdatawrite() would be simpler.
> >
> > I was taking my cue here from ext3 which does something similar. The
> > filemap_fdatawrite() is done by the VFS before this is called with a
> > filemap_fdatawait() afterwards. This was intended to flush the metadata
> > via (eventually) ->write_inode() although I guess I should be calling
> > write_inode_now() instead?
>
> oh I see, you're jsut trying to write the inode itself, not the pages.
>
> write_inode_now() will write the pages, which you seem to not want to do.
> Whatever. The APIs here are a bit awkward.

Well its not so much that we want to avoid it, but since the VFS has
already done that, it should be pretty much a no-op aside from the
waiting which will happen within the call (so that presumably the VFS's
wait after the call will be more or less a no-op).

I notice that fs/sync.c:file_fsync() uses write_inode_now(), so maybe it
is a better choice as it at least means that I don't have to fill in a
struct write_back control for a mapping that I don't really want to
write again,

Steve.


2006-12-07 09:08:14

by Steven Whitehouse

[permalink] [raw]
Subject: Re: [GFS2] Don't flush everything on fdatasync [70/70]

Hi,

On Fri, 2006-12-01 at 11:09 -0800, Andrew Morton wrote:
> On Fri, 01 Dec 2006 10:58:58 +0000
> Steven Whitehouse <[email protected]> wrote:
>
> > On Thu, 2006-11-30 at 23:01 -0800, Andrew Morton wrote:
> > > On Thu, 30 Nov 2006 12:24:08 +0000
> > > Steven Whitehouse <[email protected]> wrote:
> > >
> > > > static int gfs2_fsync(struct file *file, struct dentry *dentry, int datasync)
> > > > {
> > > > - struct gfs2_inode *ip = GFS2_I(dentry->d_inode);
> > > > + struct inode *inode = dentry->d_inode;
> > > > + int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
> > > > + int ret = 0;
> > > > + struct writeback_control wbc = {
> > > > + .sync_mode = WB_SYNC_ALL,
> > > > + .nr_to_write = 0,
> > > > + };
> > > > +
> > > > + if (gfs2_is_jdata(GFS2_I(inode))) {
> > > > + gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
> > > > + return 0;
> > > > + }
> > > >
> > > > - gfs2_log_flush(ip->i_gl->gl_sbd, ip->i_gl);
> > > > + if (sync_state != 0) {
> > > > + if (!datasync)
> > > > + ret = sync_inode(inode, &wbc);
> > >
> > > filemap_fdatawrite() would be simpler.
> >
> > I was taking my cue here from ext3 which does something similar. The
> > filemap_fdatawrite() is done by the VFS before this is called with a
> > filemap_fdatawait() afterwards. This was intended to flush the metadata
> > via (eventually) ->write_inode() although I guess I should be calling
> > write_inode_now() instead?
>
> oh I see, you're jsut trying to write the inode itself, not the pages.
>
> write_inode_now() will write the pages, which you seem to not want to do.
> Whatever. The APIs here are a bit awkward.

I've added a comment to explain whats going on here, and also the
following patch. I know it could be better, but its still an improvement
on what was there before,

Steve.

--------------------------------------------------------------------------------------------

>From 34126f9f41901ca9d7d0031c2b11fc0d6c07b72d Mon Sep 17 00:00:00 2001
From: Steven Whitehouse <[email protected]>
Date: Thu, 7 Dec 2006 09:13:14 -0500
Subject: [PATCH] [GFS2] Change gfs2_fsync() to use write_inode_now()

This is a bit better than the previous version of gfs2_fsync()
although it would be better still if we were able to call a
function which only wrote the inode & metadata. Its no big deal
though that this will potentially write the data as well since
the VFS has already done that before calling gfs2_fsync(). I've
also added a comment to explain whats going on here.

Signed-off-by: Steven Whitehouse <[email protected]>
Cc: Andrew Morton <[email protected]>
---
fs/gfs2/ops_file.c | 11 ++++++-----
1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/ops_file.c b/fs/gfs2/ops_file.c
index 7bd971b..b3f1e03 100644
--- a/fs/gfs2/ops_file.c
+++ b/fs/gfs2/ops_file.c
@@ -510,6 +510,11 @@ static int gfs2_close(struct inode *inod
* is set. For stuffed inodes we must flush the log in order to
* ensure that all data is on disk.
*
+ * The call to write_inode_now() is there to write back metadata and
+ * the inode itself. It does also try and write the data, but thats
+ * (hopefully) a no-op due to the VFS having already called filemap_fdatawrite()
+ * for us.
+ *
* Returns: errno
*/

@@ -518,10 +523,6 @@ static int gfs2_fsync(struct file *file,
struct inode *inode = dentry->d_inode;
int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
int ret = 0;
- struct writeback_control wbc = {
- .sync_mode = WB_SYNC_ALL,
- .nr_to_write = 0,
- };

if (gfs2_is_jdata(GFS2_I(inode))) {
gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
@@ -530,7 +531,7 @@ static int gfs2_fsync(struct file *file,

if (sync_state != 0) {
if (!datasync)
- ret = sync_inode(inode, &wbc);
+ ret = write_inode_now(inode, 0);

if (gfs2_is_stuffed(GFS2_I(inode)))
gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
--
1.4.1



2006-12-07 12:15:09

by Steven Whitehouse

[permalink] [raw]
Subject: [GFS2 & DLM] Pull request

Hi,

All the outstanding issues have now been resolved. Please consider
pulling the following GFS2 & DLM patches,

Steve.

---------------------------------------------------------------------------------------

The following changes since commit 0215ffb08ce99e2bb59eca114a99499a4d06e704:
Linus Torvalds:
Linux 2.6.19

are found in the git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw.git

Al Viro:
[GFS2] split gfs2_dinode into on-disk and host variants
[GFS2] gfs2_dinode_host fields are host-endian
[GFS2] split gfs2_sb
[GFS2] fields of gfs2_sb_host are host-endian
[GFS2] split and annotate gfs2_rgrp
[GFS2] split and annotate gfs2_inum_range
[GFS2] split and annotate gfs2_log_head
[GFS2] split and annotate gfs2_meta_header
[GFS2] split and annotate gfs_rindex
[GFS2] split and annotate gfs2_inum
[GFS2] split and annotate gfs2_quota
[GFS2] split and annotate gfs2_statfs_change
[GFS2] split and annotate gfs2_quota_change
[GFS2] gfs2 misc endianness annotations
[GFS2] gfs2 __user misannotation fix

David Teigland:
[DLM] res_recover_locks_count not reset when recover_locks is aborted
[DLM] status messages ping-pong between unmounted nodes
[DLM] fix requestqueue race
[DLM] fix aborted recovery during node removal
[DLM] fix stopping unstarted recovery
[DLM] do full recover_locks barrier
[DLM] clear sbflags on lock master
[DLM] fix add_requestqueue checking nodes list
[DLM] fix size of STATUS_REPLY message
[DLM] don't accept replies to old recovery messages

Patrick Caulfield:
[DLM] Add support for tcp communications
[DLM] Fix DLM config
[DLM] Clean up lowcomms

Randy Dunlap:
[GFS2] lock function parameter

Russell Cattelan:
[GFS2] Fix race in logging code
[GFS2] Remove unused zero_readpage from stuffed_readpage

Ryusuke Konishi:
[GFS2] fs/gfs2/log.c:log_bmap() fix printk format warning
[DLM] fix format warnings in rcom.c and recoverd.c

Srinivasa Ds:
[GFS2] Mount problem with the GFS2 code

Steven Whitehouse:
[GFS2] Fix crc32 calculation in recovery.c
[GFS2] Change argument of gfs2_dinode_out
[GFS2] Change argument to gfs2_dinode_in
[GFS2] Move gfs2_dinode_in to inode.c
[GFS2] Change argument to gfs2_dinode_print
[GFS2] Shrink gfs2_inode (1) - di_header/di_num
[GFS2] Shrink gfs2_inode (2) - di_major/di_minor
[GFS2] Shrink gfs2_inode (3) - di_mode
[GFS2] Shrink gfs2_inode (4) - di_uid/di_gid
[GFS2] Shrink gfs2_inode (5) - di_nlink
[GFS2] Shrink gfs2_inode (6) - di_atime/di_mtime/di_ctime
[GFS2] Shrink gfs2_inode (7) - di_payload_format
[GFS2] Shrink gfs2_inode (8) - i_vn
[GFS2] Tidy up 0 initialisations in inode.c
[GFS2] Don't copy meta_header for rgrp in and out
[GFS2] Remove unused GL_DUMP flag
[GFS2] Fix page lock/glock deadlock
[GFS2] Only set inode flags when required
[GFS2] Inode number is constant
[GFS2] Remove gfs2_inode_attr_in
[GFS2] Fix memory allocation in glock.c
[GFS2] Tidy up bmap & fix boundary bug
[GFS2] Remove unused sysfs files
[GFS2] Remove unused function from inode.c
[GFS2] Make sentinel dirents compatible with gfs1
[GFS2] Fix Kconfig wrt CRC32
[GFS2] Simplify glops functions
[GFS2] Fix glock ordering on inode creation
[GFS2] mark_inode_dirty after write to stuffed file
[GFS2] Fix journal flush problem
[GFS2] Move gfs2_meta_syncfs() into log.c
[GFS2] Reduce number of arguments to meta_io.c:getbuf()
[GFS2] Fix recursive locking in gfs2_permission
[GFS2] Fix recursive locking in gfs2_getattr
[GFS2] Remove gfs2_check_acl()
[GFS2] Add a comment about reading the super block
[GFS2] Don't flush everything on fdatasync
[GFS2] Fix indent in recovery.c
[GFS2] Change gfs2_fsync() to use write_inode_now()

fs/dlm/Kconfig | 20 +
fs/dlm/Makefile | 4
fs/dlm/dlm_internal.h | 4
fs/dlm/lock.c | 16 -
fs/dlm/lockspace.c | 4
fs/dlm/lowcomms-sctp.c | 1227 +++++++++++++++++++++++++++++++++++++++++++
fs/dlm/lowcomms-tcp.c | 1189 +++++++++++++++++++++++++++++++++++++++++
fs/dlm/lowcomms.c | 1239 -------------------------------------------
fs/dlm/lowcomms.h | 2
fs/dlm/main.c | 10
fs/dlm/member.c | 8
fs/dlm/rcom.c | 58 ++
fs/dlm/recover.c | 1
fs/dlm/recoverd.c | 44 +-
fs/dlm/requestqueue.c | 26 +
fs/dlm/requestqueue.h | 2
fs/gfs2/Kconfig | 1
fs/gfs2/acl.c | 39 -
fs/gfs2/acl.h | 1
fs/gfs2/bmap.c | 179 +++---
fs/gfs2/daemon.c | 7
fs/gfs2/dir.c | 93 ++-
fs/gfs2/dir.h | 8
fs/gfs2/eaops.c | 2
fs/gfs2/eattr.c | 66 +-
fs/gfs2/eattr.h | 6
fs/gfs2/glock.c | 36 -
fs/gfs2/glock.h | 3
fs/gfs2/glops.c | 138 +----
fs/gfs2/incore.h | 43 +
fs/gfs2/inode.c | 406 +++++---------
fs/gfs2/inode.h | 20 -
fs/gfs2/log.c | 41 +
fs/gfs2/log.h | 2
fs/gfs2/lops.c | 40 +
fs/gfs2/lops.h | 2
fs/gfs2/meta_io.c | 46 +-
fs/gfs2/meta_io.h | 1
fs/gfs2/ondisk.c | 138 +----
fs/gfs2/ops_address.c | 52 +-
fs/gfs2/ops_dentry.c | 4
fs/gfs2/ops_export.c | 38 +
fs/gfs2/ops_export.h | 2
fs/gfs2/ops_file.c | 66 ++
fs/gfs2/ops_file.h | 2
fs/gfs2/ops_fstype.c | 4
fs/gfs2/ops_inode.c | 134 ++---
fs/gfs2/ops_super.c | 11
fs/gfs2/ops_vm.c | 2
fs/gfs2/quota.c | 15 -
fs/gfs2/recovery.c | 29 +
fs/gfs2/recovery.h | 2
fs/gfs2/rgrp.c | 13
fs/gfs2/super.c | 50 +-
fs/gfs2/super.h | 6
fs/gfs2/sys.c | 8
fs/gfs2/util.h | 6
include/linux/gfs2_ondisk.h | 138 ++++-
58 files changed, 3442 insertions(+), 2312 deletions(-)
create mode 100644 fs/dlm/lowcomms-sctp.c
create mode 100644 fs/dlm/lowcomms-tcp.c
delete mode 100644 fs/dlm/lowcomms.c


2006-12-07 19:16:22

by Wendy Cheng

[permalink] [raw]
Subject: Re: [Cluster-devel] Re: [GFS2] Don't flush everything on fdatasync [70/70]

Steven Whitehouse wrote:
> Hi,
>
> On Fri, 2006-12-01 at 11:09 -0800, Andrew Morton wrote:
>
>>> I was taking my cue here from ext3 which does something similar. The
>>> filemap_fdatawrite() is done by the VFS before this is called with a
>>> filemap_fdatawait() afterwards. This was intended to flush the metadata
>>> via (eventually) ->write_inode() although I guess I should be calling
>>> write_inode_now() instead?
>>>
>> oh I see, you're jsut trying to write the inode itself, not the pages.
>>
>> write_inode_now() will write the pages, which you seem to not want to do.
>> Whatever. The APIs here are a bit awkward.
>>
>
> I've added a comment to explain whats going on here, and also the
> following patch. I know it could be better, but its still an improvement
> on what was there before,
>
>
>
Steve,

I'm in the middle of something else and don't have upstream kernel
source handy at this moment. But I read akpm's comment as
"write_inode_now" would do writepage and that is *not* what you want (?)
(since vfs has done that before this call is invoked). I vaguely
recalled I did try write_inode_now() on GFS1 once but had to replace it
with "sync_inode" on RHEL4 (for the reason that I can't remember at this
moment). I suggest you keep "sync_inode" (at least for a while until we
can prove other call can do better). This "sync_inode" has been well
tested out (with GFS1's fsync call).

There is another issue. It is a gray area. Note that you don't grab any
glock here ... so if someone *has* written something in other nodes,
this sync could miss it (?). This depends on how people expects a
fsync/fdatasync should behave in a cluster filesystem. GFS1 asks for a
shared lock here so it will force other node to flush the data (I
personally think this is a more correct behavior). Your call though.

-- Wendy

> --------------------------------------------------------------------------------------------
>
> >From 34126f9f41901ca9d7d0031c2b11fc0d6c07b72d Mon Sep 17 00:00:00 2001
> From: Steven Whitehouse <[email protected]>
> Date: Thu, 7 Dec 2006 09:13:14 -0500
> Subject: [PATCH] [GFS2] Change gfs2_fsync() to use write_inode_now()
>
> This is a bit better than the previous version of gfs2_fsync()
> although it would be better still if we were able to call a
> function which only wrote the inode & metadata. Its no big deal
> though that this will potentially write the data as well since
> the VFS has already done that before calling gfs2_fsync(). I've
> also added a comment to explain whats going on here.
>
> Signed-off-by: Steven Whitehouse <[email protected]>
> Cc: Andrew Morton <[email protected]>
> ---
> fs/gfs2/ops_file.c | 11 ++++++-----
> 1 files changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/fs/gfs2/ops_file.c b/fs/gfs2/ops_file.c
> index 7bd971b..b3f1e03 100644
> --- a/fs/gfs2/ops_file.c
> +++ b/fs/gfs2/ops_file.c
> @@ -510,6 +510,11 @@ static int gfs2_close(struct inode *inod
> * is set. For stuffed inodes we must flush the log in order to
> * ensure that all data is on disk.
> *
> + * The call to write_inode_now() is there to write back metadata and
> + * the inode itself. It does also try and write the data, but thats
> + * (hopefully) a no-op due to the VFS having already called filemap_fdatawrite()
> + * for us.
> + *
> * Returns: errno
> */
>
> @@ -518,10 +523,6 @@ static int gfs2_fsync(struct file *file,
> struct inode *inode = dentry->d_inode;
> int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
> int ret = 0;
> - struct writeback_control wbc = {
> - .sync_mode = WB_SYNC_ALL,
> - .nr_to_write = 0,
> - };
>
> if (gfs2_is_jdata(GFS2_I(inode))) {
> gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
> @@ -530,7 +531,7 @@ static int gfs2_fsync(struct file *file,
>
> if (sync_state != 0) {
> if (!datasync)
> - ret = sync_inode(inode, &wbc);
> + ret = write_inode_now(inode, 0);
>
> if (gfs2_is_stuffed(GFS2_I(inode)))
> gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
>

2006-12-08 10:26:00

by Steven Whitehouse

[permalink] [raw]
Subject: Re: [Cluster-devel] Re: [GFS2] Don't flush everything on fdatasync [70/70]

Hi,

On Thu, 2006-12-07 at 14:05 -0500, Wendy Cheng wrote:
> Steven Whitehouse wrote:
> > Hi,
> >
> > On Fri, 2006-12-01 at 11:09 -0800, Andrew Morton wrote:
> >
> >>> I was taking my cue here from ext3 which does something similar. The
> >>> filemap_fdatawrite() is done by the VFS before this is called with a
> >>> filemap_fdatawait() afterwards. This was intended to flush the metadata
> >>> via (eventually) ->write_inode() although I guess I should be calling
> >>> write_inode_now() instead?
> >>>
> >> oh I see, you're jsut trying to write the inode itself, not the pages.
> >>
> >> write_inode_now() will write the pages, which you seem to not want to do.
> >> Whatever. The APIs here are a bit awkward.
> >>
> >
> > I've added a comment to explain whats going on here, and also the
> > following patch. I know it could be better, but its still an improvement
> > on what was there before,
> >
> >
> >
> Steve,
>
> I'm in the middle of something else and don't have upstream kernel
> source handy at this moment. But I read akpm's comment as
> "write_inode_now" would do writepage and that is *not* what you want (?)
> (since vfs has done that before this call is invoked). I vaguely
> recalled I did try write_inode_now() on GFS1 once but had to replace it
> with "sync_inode" on RHEL4 (for the reason that I can't remember at this
> moment). I suggest you keep "sync_inode" (at least for a while until we
> can prove other call can do better). This "sync_inode" has been well
> tested out (with GFS1's fsync call).
>
Ok. Its gone upstream now, but I'm happy to revert that change if it
turns out to be a problem. My tests show identical performance between
the two calls, but maybe there is a corner case I missed?

Both calls do writepage() but since the VFS has already done it for us,
the chances of there being any more dirty pages to write is fairly
small, so its unlikely to be much of a problem I think, but I'm willing
to be proved wrong if there is a good reason.

> There is another issue. It is a gray area. Note that you don't grab any
> glock here ... so if someone *has* written something in other nodes,
> this sync could miss it (?). This depends on how people expects a
> fsync/fdatasync should behave in a cluster filesystem. GFS1 asks for a
> shared lock here so it will force other node to flush the data (I
> personally think this is a more correct behavior). Your call though.
>
> -- Wendy
>
Its a tricky one to deal with. I would expect that the chances of an
application relying on an fsync on one node to cause a cross-cluster
flush is relatively unlikely. It would mean that there would have to be
another communication channel between the processes on the different
nodes through which the node that was writing data could request a flush
and then receive notification that it has finished, otherwise it would
not seem to make any sense. It would seem an odd way to write an
application, but maybe one does exist which does this somewhere.

Delving back into the history it looks like this is a change (with
respect to gfs1) made by Ken rather than myself. I don't mind adding
this feature though, but even so what we have now is still a marked
improvement on what was there previously I think,

Steve.