2001-04-02 17:11:09

by Kenichi Okuyama

[permalink] [raw]
Subject: nfsd vfs.c does not seems to fsync() with file overwrite, when it have to.

Dear Neil,

I think I've found bug in nfsd. Here's patch for fixing.
I'd like to explain what's problem and what's changed, after patch.


diff -urN ./linux.orig/fs/nfsd/vfs.c ./linux/fs/nfsd/vfs.c
--- ./linux.orig/fs/nfsd/vfs.c Tue Mar 13 11:13:28 2001
+++ ./linux/fs/nfsd/vfs.c Mon Apr 2 22:08:00 2001
@@ -746,15 +746,14 @@
current->state = TASK_RUNNING;
dprintk("nfsd: write resume %d\n", current->pid);
}
+ }

- if (inode->i_state & I_DIRTY) {
- dprintk("nfsd: write sync %d\n", current->pid);
- nfsd_sync(&file);
- }
+ dprintk("nfsd: write sync %d\n", current->pid);
+ nfsd_sync(&file);
#if 0
- wake_up(&inode->i_wait);
+ wake_up(&inode->i_wait);
#endif
- }
+
last_ino = inode->i_ino;
last_dev = inode->i_dev;
}


OKEY, here's what's being changed and why.

According to RFC, nfs v2 requires write to sync everytime it
required. nfs v3 have status given, which takes variable of
UNSTABLE, DATASYNC, and FILESYNC. And when it's DATASYNC or
FILESYNC, it also requires write to be synced.

nfs v2 write should be equivalent to FILESYNC of v3. FILESYNC
requires "ALL" the metadata and "write" requested data to be
syncronously written to storage. DATASYNC requires file related
metadata and "write" requested data to be syncronously written to
storage.

Currently, Linux does not support FILESYNC. Though it does not
report like so. I think this is reasonable. And if so, we need to
make DATASYNC to work properly.


DATASYNC requires to write data to storage even if file system is
being MOUNTED asyncronously. I mean ext2 can be mounted in 'async'
mode, and yet nfsd must prove file to be fsync()-ed.

Because

L710: err = file.f_op->write(&file, buf, cnt, &file.f_pos);

does not store dirty pages to storage in case of async, we need to
call fsync() in case if variable stable is non 0.


Whether we should call nfsd_sync() or not, should not depend on
whether EX_WGATHER(exp) is true or not. So, calling nfsd_sync()
should be outside

if ( EX_WGATHER(exp) ) {...}

statement.


Also, it should not depend on whether (inode->i_state & I_DIRTY) is
true or not, because (inode->i_state & I_DIRTY) is true only when
inode is being changed ( i.e. when append write is being held ),
while we need to sync written data to filesystem even when we only
over wrote already existing data area.

This means we are required to write ALL the dirty pages and dirty
inodes at this point, if status is non 0.


On other hand, with these patches, nfsd + any file system with fsync
functionality will work correctly even with

export(sync) + mount(async)

combination. This is quite useful, for nfs v3 allow UNSTABLE write (
which requires only to recieve write data ) and have COMMIT command
to run fsync() at different timing. To make this work in good
performance, mount is needed to be async mount, so that write()
request will not write any data or metadata to storage.


Wishing my explaint didn't confuse you.

best regards,
----
Kenichi Okuyama@Tokyo Research Lab. IBM-Japan, Co.

P.S. I don't really think we should wait for 10msec At point of
Gathered writes. I don't think this will be of any help.
It's because fsync() have locking of it's own.

While someone have already started storing same file as current code
focuses, to storage, it's not that we should wait for specific
timing. Rather, it's at fsync() timing that reqires mutual storing
and waiting. No one should have right of getting into loop of
calling ll_rw_block() inside fsync(), except for the very first one,
who's waiting for block-write, until it finish.

This mutex functionality SHOULD be what is being supported by
fsync() and if not, we shold simply dive into fsync() for another
write. Even if you over wrote to storage, usually it will not ask
you for extra 10msec.


2001-04-03 06:00:36

by NeilBrown

[permalink] [raw]
Subject: Re: nfsd vfs.c does not seems to fsync() with file overwrite, when it have to.

On Tuesday April 3, [email protected] wrote:
> Dear Neil,
>
> I think I've found bug in nfsd. Here's patch for fixing.
> I'd like to explain what's problem and what's changed, after patch.

snip

>
> Because
>
> L710: err = file.f_op->write(&file, buf, cnt, &file.f_pos);
>
> does not store dirty pages to storage in case of async, we need to
> call fsync() in case if variable stable is non 0.
>
>
> Whether we should call nfsd_sync() or not, should not depend on
> whether EX_WGATHER(exp) is true or not. So, calling nfsd_sync()
> should be outside
>
> if ( EX_WGATHER(exp) ) {...}
>
> statement.
>

You've missed an important fact.
A few lines earlier is the code:
if (stable && !EX_WGATHER(exp))
file.f_flags |= O_SYNC;


so if ! EX_WGATHER, the file has the O_SYNC flag set and all IO to
that file is synchronous (whether the fs was mounted synchronous or
not).
If EX_WGATHER, the flag isn't set, and explicitly call fsync after a
short delay.

>
> Also, it should not depend on whether (inode->i_state & I_DIRTY) is
> true or not, because (inode->i_state & I_DIRTY) is true only when
> inode is being changed ( i.e. when append write is being held ),
> while we need to sync written data to filesystem even when we only
> over wrote already existing data area.

Whenever you write to a file you change the inode - by changing the
modify time at least. Look at generic_file_write in mm/filemap.c.
Notice the code:
if (count) {
remove_suid(inode);
inode->i_ctime = inode->i_mtime = CURRENT_TIME;
mark_inode_dirty_sync(inode);
}

If a non-zero amount of data is being written, the inode is marked dirty.
If don't think the test actually serves the purpose as an fsync
doesn't cause the inode to be marked clean. I'm not sure when that happens.

>
> P.S. I don't really think we should wait for 10msec At point of
> Gathered writes. I don't think this will be of any help.
> It's because fsync() have locking of it's own.

The theory is that you might get 4 write requests inside 10msec.
Each of them enter data into the cache, but don't flush it to disc.
10ms after the first arrives, it causes a fsync on the file. This
writes out all the data for the 4 requests. The other requests
eventually finish their 10ms wait, and fsync finds that there is
nothing to write out and so complete quickly.

I didn't write this code and I don't know how well this code actually
works, but it doesn't seem *wrong* or harmful in any way.

NeilBrown

2001-04-03 07:27:14

by Kenichi Okuyama

[permalink] [raw]
Subject: Re: nfsd vfs.c does not seems to fsync() with file overwrite, when it have to.

Dear Neil,

First of all, thank you for the answer. You've gave me good enough
hint to make several points clear.

>>>>> "NB" == Neil Brown <[email protected]> writes:
>> Whether we should call nfsd_sync() or not, should not depend on
>> whether EX_WGATHER(exp) is true or not. So, calling nfsd_sync()
>> should be outside
>>
>> if ( EX_WGATHER(exp) ) {...}
>>
>> statement.
>>
NB> You've missed an important fact.
NB> A few lines earlier is the code:
NB> if (stable && !EX_WGATHER(exp))
NB> file.f_flags |= O_SYNC;

NB> so if ! EX_WGATHER, the file has the O_SYNC flag set and all IO to
NB> that file is synchronous (whether the fs was mounted synchronous or
NB> not).

I think I got your point. You mean O_SYNC flag will be appended in
case of FILESYNC/DATASYNC mode ( or on nfsv2 ) , so it's not
required to call fsync().

But I have questions about this. It's related to write() assumtions
so I'd like to mention little bit later...


>>
>> Also, it should not depend on whether (inode->i_state & I_DIRTY) is
>> true or not, because (inode->i_state & I_DIRTY) is true only when
>> inode is being changed ( i.e. when append write is being held ),
>> while we need to sync written data to filesystem even when we only
>> over wrote already existing data area.
NB> Whenever you write to a file you change the inode - by changing the
NB> modify time at least. Look at generic_file_write in mm/filemap.c.
NB> Notice the code:
NB> if (count) {
NB> remove_suid(inode);
inode-> i_ctime = inode->i_mtime = CURRENT_TIME;
NB> mark_inode_dirty_sync(inode);
NB> }

! I see....

Well, actually I found three question here.


1st:

Doesn't "calling write with size 0" should cause
i_mtime = CURRENT_TIME;
?


2nd:
At where will the inode be written to storage in generic_file_write()?

According to very end of this function it says:

/* For now, when the user asks for O_SYNC, we'll actually
* provide O_DSYNC. */
if ((status >= 0) && (file->f_flags & O_SYNC))
status = generic_osync_inode(inode, 1); /* 1 means datasync */

This means write() does not write inode here. So it must be written
somewhere before.


But if you think about robust filesystem, any inode update should be
held in "FROM FAR, TO NEAR" rule. I mean, if you have some data
update, you should first store data block, then store indirect
pointing blocks and update, and finally update inode.

# So that when you fail in updating file due to ... something like
# power down ... the very file still will not be crushed unless
# it crushed at very 'inode update' point. If you update from inode
# then write data ( I mean "FROM NEAR TO FAR" rule ), inode may have
# new block pointing but that block not being updated ( which means
# it still contains old data, which might cause security problem.
# Think what will happen if old block was used for /etc/shadow .

But I see no i-node update after here.... where is it????



3rd:
Is it really good idea to rely on write() for SYNCRONOUS write,
rather than relying on fsync()??

At least, if we use fsync(), we can assume that all updates of
current nfsd will be held on file cache on memory. But not only,
we can "WISH" for some "write" request that comes from different
nfsd, or from different process, could be applied too, especially
in case of SMP world.

By calling fsync() all these will be updated to storage at once.
And this can be held without waiting even jiffie.
Yes we can do GATHER WRITE without any waiting tricks.


Semantically, calling SYNCRONOUS write() and calling fsync() is
equivalent.

Then, isn't it better to choose 'better chanced' case? I mean,
instead of switching write() option, isn't it better to call
fsync()?

# It should make code more simple too, for even if you call
# fsync(), if filesystem is being mounted "sync", fsunc()
# have nothing to do, and will come back quickly anyway.
# All we need to do is check ( status != UNSTABLE ) and
# call nfsd_sync() right away.


I know that relying to fsync() is bad, if there's known bug in
fsync(), especially about ext2. But I never heard of one. Is there?



>> P.S. I don't really think we should wait for 10msec At point of
>> Gathered writes. I don't think this will be of any help.
>> It's because fsync() have locking of it's own.
NB> The theory is that you might get 4 write requests inside 10msec.

I know, but if those were all syncronous requiests, all those 4
nfsds will stop for 10msec. And that means performance of nfs server
have to pay 10msec each which means all 4 clients have to pay extra
10msec, even if writing to disk became less.

We're really loosing 40msec with overlaps as total....


Instead, all we need to do GATHER WRITE is to do

asyncronous write()
fsync()

in this order. Because write() and fsync() contains locking, if
write request is being held in such a quick speed, fsync() will be
stopped by other process(CPU)'s write() request being queued at
entry point of write().

If CPU was not fast enough, and could not treat write() request in
such a speed, or if CPU was too fast that write() runed too quickly,
first fsync() will start running. But other write() request will
wait for fsync() to finish at entry point of write(). This, I think,
is better than simply waiting for fixed time.


NB> I didn't write this code and I don't know how well this code actually
NB> works, but it doesn't seem *wrong* or harmful in any way.

This, I agree. I don't think it's wrong, nor is harmful.
Except for perfromance.


best regards,
----
Kenichi Okuyama@Tokyo Research Lab. IBM-Japan, Co.

2001-04-04 02:21:07

by NeilBrown

[permalink] [raw]
Subject: Re: nfsd vfs.c does not seems to fsync() with file overwrite, when it have to.

On Tuesday April 3, [email protected] wrote:
> NB> Whenever you write to a file you change the inode - by changing the
> NB> modify time at least. Look at generic_file_write in mm/filemap.c.
> NB> Notice the code:
> NB> if (count) {
> NB> remove_suid(inode);
> inode-> i_ctime = inode->i_mtime = CURRENT_TIME;
> NB> mark_inode_dirty_sync(inode);
> NB> }
>
> ! I see....
>
> Well, actually I found three question here.
>
>
> 1st:
>
> Doesn't "calling write with size 0" should cause
> i_mtime = CURRENT_TIME;
> ?

Does it matter? Ask POSIX, or SUS or some other standard...

>
>
> 2nd:
> At where will the inode be written to storage in generic_file_write()?
>
> According to very end of this function it says:
>
> /* For now, when the user asks for O_SYNC, we'll actually
> * provide O_DSYNC. */
> if ((status >= 0) && (file->f_flags & O_SYNC))
> status = generic_osync_inode(inode, 1); /* 1 means datasync */
>
> This means write() does not write inode here. So it must be written
> somewhere before.

Probably not. I think it gets written after. Possibly not until
memory pressure or a regular sync() flushes it out.
This is probably not ideal, but I think that O_SYNC handling is still
a work-in-progress.

>
>
> 3rd:
> Is it really good idea to rely on write() for SYNCRONOUS write,
> rather than relying on fsync()??
>
> At least, if we use fsync(), we can assume that all updates of
> current nfsd will be held on file cache on memory. But not only,
> we can "WISH" for some "write" request that comes from different
> nfsd, or from different process, could be applied too, especially
> in case of SMP world.
>
> By calling fsync() all these will be updated to storage at once.
> And this can be held without waiting even jiffie.
> Yes we can do GATHER WRITE without any waiting tricks.
>
>
> Semantically, calling SYNCRONOUS write() and calling fsync() is
> equivalent.
>
> Then, isn't it better to choose 'better chanced' case? I mean,
> instead of switching write() option, isn't it better to call
> fsync()?

You are probably right. In practice, fsync is probably better than
O_SYNC, though in theory they should be the same.
But then in practice, almost everybody uses gathered writes, so fsync
is actually used.

>
> # It should make code more simple too, for even if you call
> # fsync(), if filesystem is being mounted "sync", fsunc()
> # have nothing to do, and will come back quickly anyway.
> # All we need to do is check ( status != UNSTABLE ) and
> # call nfsd_sync() right away.
>
>
> I know that relying to fsync() is bad, if there's known bug in
> fsync(), especially about ext2. But I never heard of one. Is there?
>

Well, fsync in ext2 in 2.2 is really slow on large files, but that is
an issue of fading significance.
However I am not sure that there is any guarantee that filesystems
will support fsync. I note that sys_fsync allows for the possibility
that fsync is supported on the file. NFSd should too.

I would prefer to wait for O_SYNC to be implemented properly than to
change nfsd to always use fsync, but I don't feel very strongly about
it.

>
>
> >> P.S. I don't really think we should wait for 10msec At point of
> >> Gathered writes. I don't think this will be of any help.
> >> It's because fsync() have locking of it's own.
> NB> The theory is that you might get 4 write requests inside 10msec.
>
> I know, but if those were all syncronous requiests, all those 4
> nfsds will stop for 10msec. And that means performance of nfs server
> have to pay 10msec each which means all 4 clients have to pay extra
> 10msec, even if writing to disk became less.
>
> We're really loosing 40msec with overlaps as total....

If they were synchronous, then nfsd_write wouldn't notice concurrent
writes and wouldn't pause the 10msec.

I suggest that you do some benchmarking. Choose a test. See how fast
it runs. Change the code. See what difference it makes. I would
definately be interested in those sorts of results.

>
>
> Instead, all we need to do GATHER WRITE is to do
>
> asyncronous write()
> fsync()
>
> in this order. Because write() and fsync() contains locking, if
> write request is being held in such a quick speed, fsync() will be
> stopped by other process(CPU)'s write() request being queued at
> entry point of write().

I suspect that in most cases the async write would be quite quick, and
the fsync would start before the async write of the next request. But
try it and tell us what happens.

NeilBrown