2005-12-06 22:04:50

by Kenny Simpson

[permalink] [raw]
Subject: another nfs puzzle

Hi again,
I am seeing some odd behavior with O_DIRECT. If a file opened with O_DIRECT has a page mmap'd,
and the file is extended via pwrite, then the mmap'd region seems to get lost - i.e. it neither
takes up system memory, nor does it get written out.

The attached file demonstrates this.
Compile with -DABUSE to trigger the bad case.
This behavior does not happen with an ext3 partition.

ethereal shows the behavior to be a large amount of block reads, with single byte writes every so
often. Viewing the resultant file from other hosts or even the original host shows the file is
grown, but is zero-filled, not 'a'-filled.

-Kenny




__________________________________________
Yahoo! DSL ? Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com


Attachments:
dtest.c (1.72 kB)
862959384-dtest.c

2005-12-07 03:24:55

by Trond Myklebust

[permalink] [raw]
Subject: Re: another nfs puzzle

NFS: Fix another O_DIRECT race

Ensure we call unmap_mapping_range() and sync dirty pages to disk before
doing an NFS direct write.

Signed-off-by: Trond Myklebust <[email protected]>
---

fs/nfs/direct.c | 43 ++++++++++++++++++++++++-------------------
1 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index a2d2814..ce83000 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -111,6 +111,21 @@ nfs_get_user_pages(int rw, unsigned long
return result;
}

+static int nfs_sync_mapping(struct address_space *mapping, loff_t offset, size_t len)
+{
+ int ret;
+ if (mapping->nrpages)
+ return 0;
+
+ unmap_mapping_range(mapping, offset, len, 0);
+ ret = filemap_fdatawrite(mapping);
+ if (ret == 0)
+ ret = filemap_fdatawait(mapping);
+ if (ret == 0)
+ ret = nfs_wb_all(mapping->host);
+ return ret;
+}
+
/**
* nfs_free_user_pages - tear down page struct array
* @pages: array of page struct pointers underlying target buffer
@@ -676,15 +691,9 @@ nfs_file_direct_read(struct kiocb *iocb,
if (!count)
goto out;

- if (mapping->nrpages) {
- retval = filemap_fdatawrite(mapping);
- if (retval == 0)
- retval = nfs_wb_all(inode);
- if (retval == 0)
- retval = filemap_fdatawait(mapping);
- if (retval)
- goto out;
- }
+ retval = nfs_sync_mapping(mapping, pos, count);
+ if (retval)
+ goto out;

retval = nfs_direct_read(inode, ctx, &iov, pos, 1);
if (retval > 0)
@@ -762,19 +771,15 @@ nfs_file_direct_write(struct kiocb *iocb
if (!count)
goto out;

- if (mapping->nrpages) {
- retval = filemap_fdatawrite(mapping);
- if (retval == 0)
- retval = nfs_wb_all(inode);
- if (retval == 0)
- retval = filemap_fdatawait(mapping);
- if (retval)
- goto out;
- }
+ retval = nfs_sync_mapping(mapping, pos, count);
+ if (retval)
+ goto out;

retval = nfs_direct_write(inode, ctx, &iov, pos, 1);
if (mapping->nrpages)
- invalidate_inode_pages2(mapping);
+ invalidate_inode_pages2_range(mapping,
+ pos >> PAGE_CACHE_SHIFT,
+ (pos + count - 1) >> PAGE_CACHE_SHIFT);
if (retval > 0)
*ppos = pos + retval;


Attachments:
linux-2.6.15-27-unmap_before_odirect.dif (2.07 kB)

2005-12-07 14:01:28

by Peter Staubach

[permalink] [raw]
Subject: Re: another nfs puzzle

Kenny Simpson wrote:

>Hi again,
> I am seeing some odd behavior with O_DIRECT. If a file opened with O_DIRECT has a page mmap'd,
>and the file is extended via pwrite, then the mmap'd region seems to get lost - i.e. it neither
>takes up system memory, nor does it get written out.
>
>

I don't think that I understand why or how the kernel allows a file,
which was opened with O_DIRECT, to be mmap'd. The use of O_DIRECT
implies no caching and mmap implies the use of caching.

I do understand that the kernel can flush and invalidate pages in
the ranges of the i/o operations done, but does it really guarantee
that a pagein operation won't happen on a page within the range of
a region of the file being written to using direct i/o? If it does
not, then any consistency guarantees are gone and data corruption
can occur.

Thanx...

ps

2005-12-07 14:11:38

by Trond Myklebust

[permalink] [raw]
Subject: Re: another nfs puzzle

On Wed, 2005-12-07 at 09:01 -0500, Peter Staubach wrote:
> Kenny Simpson wrote:
>
> >Hi again,
> > I am seeing some odd behavior with O_DIRECT. If a file opened with O_DIRECT has a page mmap'd,
> >and the file is extended via pwrite, then the mmap'd region seems to get lost - i.e. it neither
> >takes up system memory, nor does it get written out.
> >
> >
>
> I don't think that I understand why or how the kernel allows a file,
> which was opened with O_DIRECT, to be mmap'd. The use of O_DIRECT
> implies no caching and mmap implies the use of caching.

In this context it doesn't matter whether or not the you use the same
file descriptor. The problem is the same if my process opens the file
for O_DIRECT and then your process open it for normal I/O, and mmaps it.

Cheers,
Trond

2005-12-07 14:19:00

by Peter Staubach

[permalink] [raw]
Subject: Re: another nfs puzzle

Trond Myklebust wrote:

>On Wed, 2005-12-07 at 09:01 -0500, Peter Staubach wrote:
>
>
>>Kenny Simpson wrote:
>>
>>
>>
>>>Hi again,
>>> I am seeing some odd behavior with O_DIRECT. If a file opened with O_DIRECT has a page mmap'd,
>>>and the file is extended via pwrite, then the mmap'd region seems to get lost - i.e. it neither
>>>takes up system memory, nor does it get written out.
>>>
>>>
>>>
>>>
>>I don't think that I understand why or how the kernel allows a file,
>>which was opened with O_DIRECT, to be mmap'd. The use of O_DIRECT
>>implies no caching and mmap implies the use of caching.
>>
>>
>
>In this context it doesn't matter whether or not the you use the same
>file descriptor. The problem is the same if my process opens the file
>for O_DIRECT and then your process open it for normal I/O, and mmaps it.
>

Yup, same problem. Why is this allowed? Does it really work correctly?

Thanx...

ps

2005-12-07 14:34:52

by Trond Myklebust

[permalink] [raw]
Subject: Re: another nfs puzzle

On Wed, 2005-12-07 at 09:18 -0500, Peter Staubach wrote:
> >In this context it doesn't matter whether or not the you use the same
> >file descriptor. The problem is the same if my process opens the file
> >for O_DIRECT and then your process open it for normal I/O, and mmaps it.
> >
>
> Yup, same problem. Why is this allowed? Does it really work correctly?

Assuming that the processes have _some_ method of synchronisation, then
I cannot see why it shouldn't be workable. Come to think of it, it might
even be possible to use O_DIRECT to provide that synchronisation (use
O_DIRECT to set a "lock" on the page, then modify it using mmap).

Whether or not there are people out there that actually _want_ to do
this is a different matter.

Cheers,
Trond

2005-12-07 14:50:18

by Kenny Simpson

[permalink] [raw]
Subject: Re: another nfs puzzle

--- Trond Myklebust <[email protected]> wrote:
> Does the attached patch fix it?

Dude, you're a machine! Yes, it fixes it.

-Kenny




__________________________________________
Yahoo! DSL ? Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com

2005-12-07 15:35:11

by Peter Staubach

[permalink] [raw]
Subject: Re: another nfs puzzle

Trond Myklebust wrote:

>>Yup, same problem. Why is this allowed? Does it really work correctly?
>>
>>
>
>Assuming that the processes have _some_ method of synchronisation, then
>I cannot see why it shouldn't be workable. Come to think of it, it might
>even be possible to use O_DIRECT to provide that synchronisation (use
>O_DIRECT to set a "lock" on the page, then modify it using mmap).
>
>Whether or not there are people out there that actually _want_ to do
>this is a different matter.
>

Mixing O_DIRECT i/o and cached i/o is probably a recipe for disaster,
unless the cooperating programs are very careful and very aware of how
the particular file system in the particular kernel implements direct
i/o and caching, including cache validation.

This seems like a dangerous enough area that denying mmap on a file which
has been opened with O_DIRECT by any process and denying open(O_DIRECT)
on a file which has been mmap'd would be a good thing. These things are
easy enough to keep track of, so it shouldn't be too hard to implement.

Thanx...

ps

2005-12-07 15:42:14

by Trond Myklebust

[permalink] [raw]
Subject: Re: another nfs puzzle

On Wed, 2005-12-07 at 10:34 -0500, Peter Staubach wrote:

> This seems like a dangerous enough area that denying mmap on a file which
> has been opened with O_DIRECT by any process and denying open(O_DIRECT)
> on a file which has been mmap'd would be a good thing. These things are
> easy enough to keep track of, so it shouldn't be too hard to implement.

That would be a recipe for DOSes as it would allow one process to block
another just by opening a file. I'd really not like to see my database
crash just because some obscure monitoring program happens to use mmap()
to browse the file.

Cheers,
Trond

2005-12-07 15:56:59

by Peter Staubach

[permalink] [raw]
Subject: Re: another nfs puzzle

Trond Myklebust wrote:

>On Wed, 2005-12-07 at 10:34 -0500, Peter Staubach wrote:
>
>
>
>>This seems like a dangerous enough area that denying mmap on a file which
>>has been opened with O_DIRECT by any process and denying open(O_DIRECT)
>>on a file which has been mmap'd would be a good thing. These things are
>>easy enough to keep track of, so it shouldn't be too hard to implement.
>>
>>
>
>That would be a recipe for DOSes as it would allow one process to block
>another just by opening a file. I'd really not like to see my database
>crash just because some obscure monitoring program happens to use mmap()
>to browse the file.
>

I wouldn't think that a database would be a problem since it is opened early
and then stays open. However, point granted. There are tools around, lsof
and such, which would be handy for diagnosing such a situation though.

I do know of other operating systems which do implement the semantics that I
have suggested and I don't think that they are concerned or have been
notified
that these semantics for a problem. These semantics are used because the
kernel itself can not even guarantee that the data that it is caching is
valid,
without lots of synchronization which may tend to reduce performance.

Thanx...

ps

2005-12-07 16:10:09

by Trond Myklebust

[permalink] [raw]
Subject: Re: another nfs puzzle

On Wed, 2005-12-07 at 10:56 -0500, Peter Staubach wrote:

> I do know of other operating systems which do implement the semantics that I
> have suggested and I don't think that they are concerned or have been
> notified
> that these semantics for a problem. These semantics are used because the
> kernel itself can not even guarantee that the data that it is caching is
> valid,
> without lots of synchronization which may tend to reduce performance.

I agree. It is clearly going to be a drag on performance, but it should
be very much a corner case, so...

To tell you the truth, I'm more interested in this case in order to
figure out how to make mmap() work correctly for the case of an ordinary
file, but where the file changes on the server. Currently we just call
invalidate_inode_pages2(), without unmapping first. The conclusion from
Kenny's testcase appears to be that this may lead to mmapped dirty data
being lost.

Cheers,
Trond

2005-12-07 16:39:37

by Peter Staubach

[permalink] [raw]
Subject: Re: another nfs puzzle

Trond Myklebust wrote:

>
>I agree. It is clearly going to be a drag on performance, but it should
>be very much a corner case, so...
>
>
>

Yes, file systems tend to mostly be about the corner cases though... :-)

>To tell you the truth, I'm more interested in this case in order to
>figure out how to make mmap() work correctly for the case of an ordinary
>file, but where the file changes on the server. Currently we just call
>invalidate_inode_pages2(), without unmapping first. The conclusion from
>Kenny's testcase appears to be that this may lead to mmapped dirty data
>being lost.
>

Ugly, huh? The options seem to be to either write out all of the data
to the server or just truncate it. I have seen the former used mostly,
although this does generate the "last one there wins" scenario. This
does match the usual semantics associated with normal cached data from
write(2)s and should fit well with the NFSv4 callback notifying that a
write delegation is being withdrawn.

Thanx...

ps