2008-07-02 11:28:14

by Andy Chittenden

[permalink] [raw]
Subject: nfs client readdir caching issue?

Very rarely, we're seeing various problems on a linux kernel client
(seen on various versions) with ls on directories from an NFS server
that haven't changed:

* looping ls (strace -v shows getdents returning the same names over
again).
* duplicate directory entries.
* missing directory entries.

I've hunted google but can only see problems where NFS servers have
returned duplicate cookies. I've packet captured the readdirplus on one
of the directories and see no duplicate cookies. The problems remain
until the directory is touched, the NFS server is unmounted or some
other event happens (the data is flushed from the cache?).

I think we then got lucky and got two packet captures from different
clients running the same linux kernel. On these clients, the ls output
was ok - no loops, no duplicates, no missing entries. Both captures
showed two readdirplus requests returning the same entries in the same
order but the amount of data in the responses was different. One capture
showed the server returned 1724 bytes, 10 entries, last cookie of 12,
followed by the next readdirplus returning a length of 948 bytes, 5
entries, a first cookie value of 13. In the other capture, the responses
returned 2204 bytes, 13 entries, a last cookie of 17 and 468 bytes, 2
entries, a first cookie of 19.

In the past we've found that ls has returned duplicate entries on this
directory (but didn't have a capture at the time) and those duplicate
entries are the ones that are returned as the last 3 entries in the
first response of the second capture and the first 3 entries in the
second response of the first capture.

So what I think has happened in this particular case, is that at some
point in the past, the directory was read OK with packets similar to the
first capture. Next, the client decided to get rid of the first page of
cached readdir responses from memory for some reason (running low on
memory?) but kept the second page. Subsequently, the readdir cache needs
repopulating so the client sends a readdirplus specifying cookie of 0
and this time it gets a response which is similar to the first packet of
the second capture and thus we now have in cache duplicate names and
cookie values.

So is this possible? Is there some easy way to provoke it? Does this
mean the client's readdir cache is broken?

Please cc me on any response.

--
Andy, BlueArc Engineering


2008-07-02 17:37:40

by Trond Myklebust

[permalink] [raw]
Subject: Re: nfs client readdir caching issue?

On Wed, 2008-07-02 at 12:03 +0100, Andy Chittenden wrote:
> Very rarely, we're seeing various problems on a linux kernel client
> (seen on various versions) with ls on directories from an NFS server
> that haven't changed:
>
> * looping ls (strace -v shows getdents returning the same names over
> again).
> * duplicate directory entries.
> * missing directory entries.
>
> I've hunted google but can only see problems where NFS servers have
> returned duplicate cookies. I've packet captured the readdirplus on one
> of the directories and see no duplicate cookies. The problems remain
> until the directory is touched, the NFS server is unmounted or some
> other event happens (the data is flushed from the cache?).
>
> I think we then got lucky and got two packet captures from different
> clients running the same linux kernel. On these clients, the ls output
> was ok - no loops, no duplicates, no missing entries. Both captures
> showed two readdirplus requests returning the same entries in the same
> order but the amount of data in the responses was different. One capture
> showed the server returned 1724 bytes, 10 entries, last cookie of 12,
> followed by the next readdirplus returning a length of 948 bytes, 5
> entries, a first cookie value of 13. In the other capture, the responses
> returned 2204 bytes, 13 entries, a last cookie of 17 and 468 bytes, 2
> entries, a first cookie of 19.
>
> In the past we've found that ls has returned duplicate entries on this
> directory (but didn't have a capture at the time) and those duplicate
> entries are the ones that are returned as the last 3 entries in the
> first response of the second capture and the first 3 entries in the
> second response of the first capture.
>
> So what I think has happened in this particular case, is that at some
> point in the past, the directory was read OK with packets similar to the
> first capture. Next, the client decided to get rid of the first page of
> cached readdir responses from memory for some reason (running low on
> memory?) but kept the second page. Subsequently, the readdir cache needs
> repopulating so the client sends a readdirplus specifying cookie of 0
> and this time it gets a response which is similar to the first packet of
> the second capture and thus we now have in cache duplicate names and
> cookie values.
>
> So is this possible? Is there some easy way to provoke it? Does this
> mean the client's readdir cache is broken?

If so, then invalidate_inode_pages2_range() would have to be broken: we
always clear the readdir cache immediately after reading in the page
with index 0 (i.e. the first readdir page).

It shouldn't be possible for another thread to race with that cache
invalidation either since the entire readdir() call is protected by the
parent directory's inode->i_mutex.

Cheers
Trond

2008-07-03 12:32:18

by Andy Chittenden

[permalink] [raw]
Subject: RE: nfs client readdir caching issue?

> If so, then invalidate_inode_pages2_range() would have to be broken:
we
> always clear the readdir cache immediately after reading in the page
> with index 0 (i.e. the first readdir page).

I'm confused by the call to invalidate_inode_pages2_range:

if (page->index == 0)
invalidate_inode_pages2_range(inode->i_mapping,
PAGE_CACHE_SIZE, -1);

That's passing in a pgoff_t of 4096 as the start page offset from which
to invalidate. And an enormous number for the end page to invalidate to.
So it looks like the nfs client is trying to invalidate from a *byte*
offset of 4096 (but if that's true, the first page could contain less
than 4096 bytes depending on the size of the readdir response it
received but I'll leave that to one side for the moment).

What's confusing me is that when I look at the implementation of
invalidate_inode_pages2_range, I see this call to pagevec_lookup:

pagevec_lookup(&pvec, mapping, next,
min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1))
{

looking at the pagevec_lookup comments, it claims the fourth parameter
is the number of pages:

* @nr_pages: The maximum number of pages

So how can (end - next) be a number of pages? (next will be 4096 in the
call from the nfs client). IE it looks like
invalidate_inode_pages2_range is expecting a page range (as the name
suggests). IE I'm wondering whether the call to
invalidate_inode_pages2_range should be:

if (page->index == 0)
invalidate_inode_pages2_range(inode->i_mapping, 1, -1);

FWIW We've also seen this on larger directories so I'm wondering what
would happen if a readdir part way down the cookie chain returned more
data (or less) than it did last time. IE if the above is correct, then
replace the two lines with:

invalidate_inode_pages2_range(inode->i_mapping, page->index + 1,
-1);

IE purge the rest of the pages for the inode.

--
Andy, BlueArc Engineering

2008-07-07 17:20:24

by Trond Myklebust

[permalink] [raw]
Subject: RE: nfs client readdir caching issue?

On Thu, 2008-07-03 at 09:47 +0100, Andy Chittenden wrote:
> > If so, then invalidate_inode_pages2_range() would have to be broken:
> we
> > always clear the readdir cache immediately after reading in the page
> > with index 0 (i.e. the first readdir page).
>
> I'm confused by the call to invalidate_inode_pages2_range:
>
> if (page->index == 0)
> invalidate_inode_pages2_range(inode->i_mapping,
> PAGE_CACHE_SIZE, -1);
>
> That's passing in a pgoff_t of 4096 as the start page offset from which
> to invalidate. And an enormous number for the end page to invalidate to.
> So it looks like the nfs client is trying to invalidate from a *byte*
> offset of 4096 (but if that's true, the first page could contain less
> than 4096 bytes depending on the size of the readdir response it
> received but I'll leave that to one side for the moment).
>
> What's confusing me is that when I look at the implementation of
> invalidate_inode_pages2_range, I see this call to pagevec_lookup:
>
> pagevec_lookup(&pvec, mapping, next,
> min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1))
> {
>
> looking at the pagevec_lookup comments, it claims the fourth parameter
> is the number of pages:
>
> * @nr_pages: The maximum number of pages
>
> So how can (end - next) be a number of pages? (next will be 4096 in the
> call from the nfs client). IE it looks like
> invalidate_inode_pages2_range is expecting a page range (as the name
> suggests). IE I'm wondering whether the call to
> invalidate_inode_pages2_range should be:
>
> if (page->index == 0)
> invalidate_inode_pages2_range(inode->i_mapping, 1, -1);
>
> FWIW We've also seen this on larger directories so I'm wondering what
> would happen if a readdir part way down the cookie chain returned more
> data (or less) than it did last time. IE if the above is correct, then
> replace the two lines with:
>
> invalidate_inode_pages2_range(inode->i_mapping, page->index + 1,
> -1);
>
> IE purge the rest of the pages for the inode.

Hi Andy,

Apologies for the late response. I was travelling on Thursday, and got
caught up with the long weekend due to the July 4th holiday in the US.


Yes, I agree with your comment above: the
invalidate_inode_pages2_range() call is specifying a byte range instead
of a page index range, and is therefore wrong.

Another thought is that individual pages might perhaps get evicted by VM
pressure, in which case we might perhaps want to re-read not only the
evicted page, but all subsequent pages too (in case the server returns
more/less data per page so that the alignment of the next entry
changes).

Cheers
Trond
-------------------------------------------------------------------------
From: Trond Myklebust <[email protected]>
NFS: Fix readdir cache invalidation

invalidate_inode_pages2_range() takes page offset arguments, not byte
ranges.

Signed-off-by: Trond Myklebust <[email protected]>
---

fs/nfs/dir.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 58d43da..982a206 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -204,7 +204,7 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page *page)
* Note: assumes we have exclusive access to this mapping either
* through inode->i_mutex or some other mechanism.
*/
- if (page->index == 0 && invalidate_inode_pages2_range(inode->i_mapping, PAGE_CACHE_SIZE, -1) < 0) {
+ if (invalidate_inode_pages2_range(inode->i_mapping, page->index + 1, -1) < 0) {
/* Should never happen */
nfs_zap_mapping(inode, inode->i_mapping);
}