2008-06-21 19:34:23

by Phil Endecott

[permalink] [raw]
Subject: Propagation of changes in shared mmap()ed NFS files

Dear Experts,

I have a program which uses an mmap()ed read-mostly data file. When
not using NFS, each instance of the program can use inotify to detect
when other instances have made changes to the data file. Since inotify
doesn't work with NFS, I have now implemented a scheme using network
broadcasts to announce changes. At present it works like this:

All instances of the program mmap(MAP_SHARED) the data file.

One instance stores some new data at the end of the file and calls
msync(MS_SYNC) on the affected pages. It then "atomically commits" the
new data by write()ing a new header at the start of the file with an
"end of data" field advanced to include the new data. It then calls
fdatasync(). Then it transmits a broadcast packet.

The other instance(s) of the program receive the broadcast packet and
read() the header at the start of the file. My hope was that they
would see the new value, but they don't; they continue to see the old value.

In order to allow for network broadcasts being unreliable the
wait-for-broadcast code has a 30 second timeout; when this timeout next
expires the program reads the header again and now it sees the new
end-of-data offset, and the new data in the mapped memory region.

So, what do I have to do so that the new data is visible promptly? Is
there more that the sender or the receiver should do to tell the local
kernel or the NFS server to propagate changes? For example, does
msync(MS_INVALIDATE) do anything useful? Do I simply have to wait for
some delay after receiving the broadcast?

I have also observed that when the changes are finally noticed, it
seems that the whole of the multi-megabyte file is re-fetched from the
server as it is accessed, despite only a few hundred bytes having
changed. This is undesirable; is there anything that I can do to
prevent it? (I tried calling mlock(), mainly to check that this wasn't
a memory-pressure problem, but it still seems to re-fetch it.)

Ideally I'd like to have something that will work with any non-ancient
version of NFS, and perhaps even CIFS too, but for now I'd be happy
with getting it working on this nfsv3 system.

Many thanks for any advice,

Phil.






2008-06-22 12:09:48

by Phil Endecott

[permalink] [raw]
Subject: Re: Propagation of changes in shared mmap()ed NFS files

Trond Myklebust wrote:
>> > On Sat, 2008-06-21 at 20:05 +0100, Phil Endecott wrote:
>> >> Dear Experts,
>> >>
>> >> I have a program which uses an mmap()ed read-mostly data file. When
>> >> not using NFS, each instance of the program can use inotify to detect
>> >> when other instances have made changes to the data file. Since inotify
>> >> doesn't work with NFS, I have now implemented a scheme using network
>> >> broadcasts to announce changes. At present it works like this:
>> >>
>> >> All instances of the program mmap(MAP_SHARED) the data file.
>> >>
>> >> One instance stores some new data at the end of the file and calls
>> >> msync(MS_SYNC) on the affected pages. It then "atomically commits" the
>> >> new data by write()ing a new header at the start of the file with an
>> >> "end of data" field advanced to include the new data. It then calls
>> >> fdatasync(). Then it transmits a broadcast packet.
>> >>
>> >> The other instance(s) of the program receive the broadcast packet and
>> >> read() the header at the start of the file. My hope was that they
>> >> would see the new value, but they don't; they continue to see the old value.

> You shouldn't use mmap() to read data in this situation. mmap() is
> designed for cases where the authoritative copy of the data can be kept
> in local memory.
> In your situation, the authoritative copy is always on disk (or the NFS
> server), and so the correct paradigm is to use O_DIRECT read() and
> write() or to use POSIX file locking. The latter allows the NFS clients
> to do the read()/write() synchronisation for you, whereas the former
> assumes that you are doing some other form of locking to ensure
> synchronisation between readers and writers.

Hmmm. OK. But mmap(MAP_SHARED) does exactly what I want in the more
common case where the files are not on NFS; I can have multiple
instances of the program and only one RAM copy of the data is needed,
and changes made by one instance are immediately visible to the
others. The problem is that the NFS implementation of mmap(MAP_SHARED)
doesn't match the behaviour of the non-NFS version.

It looks to me as if the writer does the right thing: after it modifies
pages they are written back to the server when I call msync(). But,
IIUC, the server has no way to inform the other clients that those
pages are modified. Instead, the clients will revalidate with the
server after some timeout; this revalidation is not per-page but
per-file, so the server will tell them that the whole file has changed
and the clients will invalidate all of their pages. Is this true?

So: is there anything that I can do on the client to say:

- "even though the timeout hasn't expired, invalidate these cached
pages now"?
- "even though the timeout has expired, and the server says that the
file has changed, keep using your cached copies of these pages"?


Phil.





2008-06-21 21:43:49

by Trond Myklebust

[permalink] [raw]
Subject: Re: Propagation of changes in shared mmap()ed NFS files

On Sat, 2008-06-21 at 20:05 +0100, Phil Endecott wrote:
> Dear Experts,
>
> I have a program which uses an mmap()ed read-mostly data file. When
> not using NFS, each instance of the program can use inotify to detect
> when other instances have made changes to the data file. Since inotify
> doesn't work with NFS, I have now implemented a scheme using network
> broadcasts to announce changes. At present it works like this:
>
> All instances of the program mmap(MAP_SHARED) the data file.
>
> One instance stores some new data at the end of the file and calls
> msync(MS_SYNC) on the affected pages. It then "atomically commits" the
> new data by write()ing a new header at the start of the file with an
> "end of data" field advanced to include the new data. It then calls
> fdatasync(). Then it transmits a broadcast packet.
>
> The other instance(s) of the program receive the broadcast packet and
> read() the header at the start of the file. My hope was that they
> would see the new value, but they don't; they continue to see the old value.

open(O_DIRECT) is your friend.

Cheers
Trond


2008-06-21 22:02:40

by Phil Endecott

[permalink] [raw]
Subject: Re: Propagation of changes in shared mmap()ed NFS files

Trond Myklebust wrote:
> On Sat, 2008-06-21 at 20:05 +0100, Phil Endecott wrote:
>> Dear Experts,
>>
>> I have a program which uses an mmap()ed read-mostly data file. When
>> not using NFS, each instance of the program can use inotify to detect
>> when other instances have made changes to the data file. Since inotify
>> doesn't work with NFS, I have now implemented a scheme using network
>> broadcasts to announce changes. At present it works like this:
>>
>> All instances of the program mmap(MAP_SHARED) the data file.
>>
>> One instance stores some new data at the end of the file and calls
>> msync(MS_SYNC) on the affected pages. It then "atomically commits" the
>> new data by write()ing a new header at the start of the file with an
>> "end of data" field advanced to include the new data. It then calls
>> fdatasync(). Then it transmits a broadcast packet.
>>
>> The other instance(s) of the program receive the broadcast packet and
>> read() the header at the start of the file. My hope was that they
>> would see the new value, but they don't; they continue to see the old value.
>
> open(O_DIRECT) is your friend.

Thanks Trond, I'll give it a try.

This only affects the write()s and read()s though, doesn't it? So are
you suggesting that the mmap()ed data is correctly propagated already,
and only the write-to-read needs fixing?

BTW the man page is a bit discouraging about the combination of
O_DIRECT and mmap(): "applications should avoid mixing mmap(2) of files
with direct I/O to the same files." Fingers crossed....


Phil.





2008-06-21 22:12:56

by Trond Myklebust

[permalink] [raw]
Subject: Re: Propagation of changes in shared mmap()ed NFS files

On Sat, 2008-06-21 at 23:02 +0100, Phil Endecott wrote:
> Trond Myklebust wrote:
> > On Sat, 2008-06-21 at 20:05 +0100, Phil Endecott wrote:
> >> Dear Experts,
> >>
> >> I have a program which uses an mmap()ed read-mostly data file. When
> >> not using NFS, each instance of the program can use inotify to detect
> >> when other instances have made changes to the data file. Since inotify
> >> doesn't work with NFS, I have now implemented a scheme using network
> >> broadcasts to announce changes. At present it works like this:
> >>
> >> All instances of the program mmap(MAP_SHARED) the data file.
> >>
> >> One instance stores some new data at the end of the file and calls
> >> msync(MS_SYNC) on the affected pages. It then "atomically commits" the
> >> new data by write()ing a new header at the start of the file with an
> >> "end of data" field advanced to include the new data. It then calls
> >> fdatasync(). Then it transmits a broadcast packet.
> >>
> >> The other instance(s) of the program receive the broadcast packet and
> >> read() the header at the start of the file. My hope was that they
> >> would see the new value, but they don't; they continue to see the old value.
> >
> > open(O_DIRECT) is your friend.
>
> Thanks Trond, I'll give it a try.
>
> This only affects the write()s and read()s though, doesn't it? So are
> you suggesting that the mmap()ed data is correctly propagated already,
> and only the write-to-read needs fixing?
>
> BTW the man page is a bit discouraging about the combination of
> O_DIRECT and mmap(): "applications should avoid mixing mmap(2) of files
> with direct I/O to the same files." Fingers crossed....

You shouldn't use mmap() to read data in this situation. mmap() is
designed for cases where the authoritative copy of the data can be kept
in local memory.
In your situation, the authoritative copy is always on disk (or the NFS
server), and so the correct paradigm is to use O_DIRECT read() and
write() or to use POSIX file locking. The latter allows the NFS clients
to do the read()/write() synchronisation for you, whereas the former
assumes that you are doing some other form of locking to ensure
synchronisation between readers and writers.

Cheers
Trond