2008-08-29 17:54:39

by Peter Staubach

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 11448] New: NFS client has inconsistent write flushing to non-linux serversa

Doug Hughes wrote:
> Peter Staubach wrote:
>> J. Bruce Fields wrote:
>>> On Thu, Aug 28, 2008 at 01:27:53PM -0700, Andrew Morton wrote:
>>>
>>>> (switched to email. Please respond via emailed reply-to-all, not
>>>> via the
>>>> bugzilla web interface).
>>>>
>>>> On Thu, 28 Aug 2008 11:41:08 -0700 (PDT)
>>>> bugme-daemon-590EEB7GvNiWaY/[email protected] wrote:
>>>>
>>>>> NFS client writes to Sun Solaris 10 U4 server. at some point in
>>>>> time, there is an empty portion of the output file from the
>>>>> writer containing missing data (shows as NULL bytes from another
>>>>> NFS client
>>>>> issuing a tail -f on the file being written). confirmed that the
>>>>> file as exists on the NFS server is sparse, missing bytes
>>>>> (not necessarily multiple of 512 or 1024, one sample is a gap of
>>>>> 3818 bytes,
>>>>> another is 1895 bytes, another is 423 bytes)
>>>>>
>>>
>>> Seems like something that could happen if for example two write rpc's
>>> got reordered on the network. That's not necessarily a bug--the nfs
>>> client isn't required to wait for confirmation of every previous write
>>> before sending the next one.
>>>
> if two RPCs got reordered on the network, and they encompass all the
> data, then there shouldn't be any missing data. It seems to me like
> pieces of data are just being skipped, for whatever reason, but I
> haven't exhaustively examined the NFS network data.
>
>>> However if the client isn't flushing dirty data to the server before
>>> returning from close, then that's a violation of NFS's close-to-open
>>> semantics:...
>>>
> this is not confirmed yet. No solid cases of data not being present
> after close.
>>>
>>>>> if you do a read of the entire file from the NFS client doing the
>>>>> writing, it
>>>>> causes the non-flushed writes to be instantly flushed to the
>>>>> server followed by
>>>>> a NFS3 commit operation. The data then can be seen on all other
>>>>> NFS clients.
>>>>>
>>>>> If you do an open of the file alone, no flush
>>>>> if you do an open and a close, no flush
>>>>>
>>>
>>> ... so this "close, no flush" could be a bug (depending on who is doing
>>> that close when--I don't completely understand the described
>>> situation).
>>
>> I suspect that this last might depend upon 1) what options were used
>> when the file system was mounted and 2) how the file was opened. The
>> flush-on-close wouldn't be needed if the file was opened read-only.
>>
> no special options on open. Here are the mount options:
> retry=1000,tcp,noatime,nosuid,nodev,dirsync,timeo=100,rsize=32768,wsize=32768
>
> ,hard,intr
>
>
>> It seems a little odd that the holes aren't page aligned or page
>> sized multiples.
>>
> indeed. and the time for them to actually get to the server is
> indeterminate (days is not uncommon. We have not as yet confirmed that
> some of the data never gets sent to the server until close)
>
>> What application is being used to generate the file which is showing
>> these holes?
>>
> namd and some custom code developed in-house for chemistry research
> (at the very least)

Do these applications use mmap() or generate the file contents
serially or randomly?

Thanx...

ps


2008-08-29 18:28:10

by Doug Hughes

[permalink] [raw]
Subject: Re: [Bugme-new] [Bug 11448] New: NFS client has inconsistent write flushing to non-linux serversa

Peter Staubach wrote:
> Doug Hughes wrote:
>> Peter Staubach wrote:
>>> J. Bruce Fields wrote:
>>>> On Thu, Aug 28, 2008 at 01:27:53PM -0700, Andrew Morton wrote:
>>>>
>>>>> (switched to email. Please respond via emailed reply-to-all, not
>>>>> via the
>>>>> bugzilla web interface).
>>>>>
>>>>> On Thu, 28 Aug 2008 11:41:08 -0700 (PDT)
>>>>> bugme-daemon-590EEB7GvNiWaY/[email protected] wrote:
>>>>>
>>>>>> NFS client writes to Sun Solaris 10 U4 server. at some point in
>>>>>> time, there is an empty portion of the output file from the
>>>>>> writer containing missing data (shows as NULL bytes from another
>>>>>> NFS client
>>>>>> issuing a tail -f on the file being written). confirmed that the
>>>>>> file as exists on the NFS server is sparse, missing bytes
>>>>>> (not necessarily multiple of 512 or 1024, one sample is a gap of
>>>>>> 3818 bytes,
>>>>>> another is 1895 bytes, another is 423 bytes)
>>>>>>
>>>>
>>>> Seems like something that could happen if for example two write rpc's
>>>> got reordered on the network. That's not necessarily a bug--the nfs
>>>> client isn't required to wait for confirmation of every previous write
>>>> before sending the next one.
>>>>
>> if two RPCs got reordered on the network, and they encompass all the
>> data, then there shouldn't be any missing data. It seems to me like
>> pieces of data are just being skipped, for whatever reason, but I
>> haven't exhaustively examined the NFS network data.
>>
>>>> However if the client isn't flushing dirty data to the server before
>>>> returning from close, then that's a violation of NFS's close-to-open
>>>> semantics:...
>>>>
>> this is not confirmed yet. No solid cases of data not being present
>> after close.
>>>>
>>>>>> if you do a read of the entire file from the NFS client doing the
>>>>>> writing, it
>>>>>> causes the non-flushed writes to be instantly flushed to the
>>>>>> server followed by
>>>>>> a NFS3 commit operation. The data then can be seen on all other
>>>>>> NFS clients.
>>>>>>
>>>>>> If you do an open of the file alone, no flush
>>>>>> if you do an open and a close, no flush
>>>>>>
>>>>
>>>> ... so this "close, no flush" could be a bug (depending on who is
>>>> doing
>>>> that close when--I don't completely understand the described
>>>> situation).
>>>
>>> I suspect that this last might depend upon 1) what options were used
>>> when the file system was mounted and 2) how the file was opened. The
>>> flush-on-close wouldn't be needed if the file was opened read-only.
>>>
>> no special options on open. Here are the mount options:
>> retry=1000,tcp,noatime,nosuid,nodev,dirsync,timeo=100,rsize=32768,wsize=32768
>>
>> ,hard,intr
>>
>>
>>> It seems a little odd that the holes aren't page aligned or page
>>> sized multiples.
>>>
>> indeed. and the time for them to actually get to the server is
>> indeterminate (days is not uncommon. We have not as yet confirmed
>> that some of the data never gets sent to the server until close)
>>
>>> What application is being used to generate the file which is showing
>>> these holes?
>>>
>> namd and some custom code developed in-house for chemistry research
>> (at the very least)
>
> Do these applications use mmap() or generate the file contents
> serially or randomly?
>
> Thanx...
>
>
open file at beginning. write, write, write, write, write, (no seek, no
offset, entirely serial), run a very long time, end.

strace excerpt:
16:42:56.143512 write(8, "1948900 47.1225 0 0 0 47.7759 0 "..., 118) = 118
16:43:01.845742 write(8, "1949000 47.0474 0 0 0 47.8865 0 "..., 116) = 116
16:43:07.481889 write(8, "1949100 47.045 0 0 0 48.0742 0 0"..., 116) = 116
16:43:13.150555 write(8, "1949200 47.1848 0 0 0 47.8868 0 "..., 116) = 116
16:43:18.788863 write(8, "1949300 47.251 0 0 0 47.7743 0 0"..., 113) = 113
16:43:24.429424 write(8, "1949400 47.2722 0 0 0 47.6937 0 "..., 118) = 118
16:43:30.057179 write(8, "1949500 47.4865 0 0 0 47.6251 0 "..., 117) = 117