When comparing filehandles in the helper nfs_same_file(), we should not be
using 'strncmp()': filehandles are not null terminated strings.
Instead, we should just use the existing helper nfs_compare_fh().
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/dir.c | 6 +-----
1 files changed, 1 insertions(+), 5 deletions(-)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 8ea4a41..f0a384e 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -395,13 +395,9 @@ int xdr_decode(nfs_readdir_descriptor_t *desc, struct nfs_entry *entry, struct x
static
int nfs_same_file(struct dentry *dentry, struct nfs_entry *entry)
{
- struct nfs_inode *node;
if (dentry->d_inode == NULL)
goto different;
- node = NFS_I(dentry->d_inode);
- if (node->fh.size != entry->fh->size)
- goto different;
- if (strncmp(node->fh.data, entry->fh->data, node->fh.size) != 0)
+ if (nfs_compare_fh(entry->fh, NFS_FH(dentry->d_inode)) != 0)
goto different;
return 1;
different:
--
1.7.3.2
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/inode.c | 1 +
include/linux/nfs_fs.h | 1 +
2 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 314f571..0018e07 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -289,6 +289,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = NFS_SB(sb)->nfs_client->rpc_ops->dir_inode_ops;
inode->i_fop = &nfs_dir_operations;
+ inode->i_data.a_ops = &nfs_dir_addr_space_ops;
if (nfs_server_capable(inode, NFS_CAP_READDIRPLUS))
set_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inode)->flags);
/* Deal with crossing mountpoints */
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index c66fdb7..b5d3ab0 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -401,6 +401,7 @@ extern const struct inode_operations nfs3_file_inode_operations;
#endif /* CONFIG_NFS_V3 */
extern const struct file_operations nfs_file_operations;
extern const struct address_space_operations nfs_file_aops;
+extern const struct address_space_operations nfs_dir_addr_space_ops;
static inline struct nfs_open_context *nfs_file_open_context(struct file *filp)
{
--
1.7.3.2
OK. The first patch in this series fixes the regression reported by
Nick Bowler when I apply it to my setup.
The 2 remaining patches are needed in order to ensure that the
VM doesn't free the readdir page cache page while we're trying to
read from it, and to ensure that we don't leak memory...
Linus, please don't apply these patches quite yet. I'd like to continue
tests for a couple more days before I send you the pull request.
Cheers
Trond
Trond Myklebust (3):
NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler
NFS: lock the readdir page while it is in use
NFS: Fix a memory leak in nfs_readdir
fs/nfs/dir.c | 25 ++++++++++++++++++++++---
fs/nfs/inode.c | 1 +
include/linux/nfs_fs.h | 1 +
3 files changed, 24 insertions(+), 3 deletions(-)
--
1.7.3.2
We need to use the cookie from the previous array entry, not the
actual cookie that we are searching for.
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/dir.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index f0a384e..4a78d2b 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -178,6 +178,7 @@ typedef struct {
struct page *page;
unsigned long page_index;
u64 *dir_cookie;
+ u64 last_cookie;
loff_t current_index;
decode_dirent_t decode;
@@ -344,6 +345,8 @@ int nfs_readdir_search_array(nfs_readdir_descriptor_t *desc)
else
status = nfs_readdir_search_for_cookie(array, desc);
+ if (status == -EAGAIN)
+ desc->last_cookie = array->last_cookie;
nfs_readdir_release_array(desc->page);
out:
return status;
@@ -563,7 +566,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page,
unsigned int array_size = ARRAY_SIZE(pages);
entry.prev_cookie = 0;
- entry.cookie = *desc->dir_cookie;
+ entry.cookie = desc->last_cookie;
entry.eof = 0;
entry.fh = nfs_alloc_fhandle();
entry.fattr = nfs_alloc_fattr();
@@ -672,8 +675,10 @@ int readdir_search_pagecache(nfs_readdir_descriptor_t *desc)
{
int res;
- if (desc->page_index == 0)
+ if (desc->page_index == 0) {
desc->current_index = 0;
+ desc->last_cookie = 0;
+ }
while (1) {
res = find_cache_page(desc);
if (res != -EAGAIN)
--
1.7.3.2
On Tue, Nov 30, 2010 at 9:42 AM, Trond Myklebust
<[email protected]> wrote:
>
> When comparing filehandles in the helper nfs_same_file(), we should not be
> using 'strncmp()': filehandles are not null terminated strings.
Hmm. Applied, but sadly Nick Bowler <[email protected]> reports
that it doesn't actually fix his problem, so there is still a serious
regression there.
Linus
On Tue, 2010-11-30 at 20:10 -0800, Linus Torvalds wrote:
> On Tue, Nov 30, 2010 at 7:47 PM, Trond Myklebust
> <[email protected]> wrote:
> > Otherwise, the VM may end up removing it while we're reading from it.
>
> I don't think this is valid.
>
> Maybe it fixes a bug, but the commit description is misleading at
> best. Since you have a reference count to the page, the page is not
> going away. Locking may hide some other bug (due to serializing with
> other code you care about), but it is _not_ about the "VM may end up
> removing it".
>
> Even from a serialization angle, I think this patch is a bit suspect,
> since readdir() will always be called under the inode semaphore, so I
> think you'll always be serialized wrt other readdir users. Of course,
> you may have invalidation events etc that are outside of readdir, so
> ...
I'm not worried about other readdir calls invalidating the page. My
concern is rather about the VM memory reclaimers ejecting the page from
the page cache, and calling nfs_readdir_clear_array while we're
referencing the page. This wasn't a problem with the previous readdir
code, but it will be with the new incarnation because the actual
filenames are stored outside the page itself.
As far as I can see, the only way to protect against that is to lock the
page, perform the usual tests and then release the page lock when we're
done...
> Anyway if this patch matters, there's something else going on, and you
> need to describe that.
No problem. I just wanted to get the patches out so that the people who
are reporting regressions can start testing.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Tue, Nov 30, 2010 at 7:47 PM, Trond Myklebust
<[email protected]> wrote:
> Otherwise, the VM may end up removing it while we're reading from it.
I don't think this is valid.
Maybe it fixes a bug, but the commit description is misleading at
best. Since you have a reference count to the page, the page is not
going away. Locking may hide some other bug (due to serializing with
other code you care about), but it is _not_ about the "VM may end up
removing it".
Even from a serialization angle, I think this patch is a bit suspect,
since readdir() will always be called under the inode semaphore, so I
think you'll always be serialized wrt other readdir users. Of course,
you may have invalidation events etc that are outside of readdir, so
...
Anyway if this patch matters, there's something else going on, and you
need to describe that.
Linus
Otherwise, the VM may end up removing it while we're reading from it.
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/dir.c | 16 +++++++++++++++-
1 files changed, 15 insertions(+), 1 deletions(-)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 4a78d2b..6edef12 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -639,6 +639,7 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page)
static
void cache_page_release(nfs_readdir_descriptor_t *desc)
{
+ unlock_page(desc->page);
page_cache_release(desc->page);
desc->page = NULL;
}
@@ -646,8 +647,20 @@ void cache_page_release(nfs_readdir_descriptor_t *desc)
static
struct page *get_cache_page(nfs_readdir_descriptor_t *desc)
{
- return read_cache_page(desc->file->f_path.dentry->d_inode->i_mapping,
+ struct page *page;
+
+ for (;;) {
+ page = read_cache_page(desc->file->f_path.dentry->d_inode->i_mapping,
desc->page_index, (filler_t *)nfs_readdir_filler, desc);
+ if (IS_ERR(page))
+ break;
+ lock_page(page);
+ if (page->mapping != NULL && PageUptodate(page))
+ break;
+ unlock_page(page);
+ page_cache_release(page);
+ }
+ return page;
}
/*
@@ -770,6 +783,7 @@ int uncached_readdir(nfs_readdir_descriptor_t *desc, void *dirent,
desc->page_index = 0;
desc->page = page;
+ lock_page(page);
status = nfs_readdir_xdr_to_array(desc, page, inode);
if (status < 0)
--
1.7.3.2
On Tue, 2010-11-30 at 14:10 -0800, Linus Torvalds wrote:
> On Tue, Nov 30, 2010 at 9:42 AM, Trond Myklebust
> <[email protected]> wrote:
> >
> > When comparing filehandles in the helper nfs_same_file(), we should not be
> > using 'strncmp()': filehandles are not null terminated strings.
>
> Hmm. Applied, but sadly Nick Bowler <[email protected]> reports
> that it doesn't actually fix his problem, so there is still a serious
> regression there.
>
> Linus
OK. I'll follow up that problem with him.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Wed, 1 Dec 2010, Trond Myklebust wrote:
> On Wed, 2010-12-01 at 08:17 -0800, Linus Torvalds wrote:
> > On Wed, Dec 1, 2010 at 7:36 AM, Trond Myklebust
> > <[email protected]> wrote:
> > >
> > > We need to ensure that the entries in the nfs_cache_array get cleared
> > > when the page is removed from the page cache. To do so, we use the
> > > releasepage address_space operation (which also requires us to set
> > > the Pg_private flag).
> >
> > So I really think that the whole "releasepage" use in NFS is simply
> > overly complicated and was obviously too subtle.
> >
> > The whole need for odd return values, for the page lock, and for the
> > addition of clearing the up-to-date bit comes from the fact that this
> > wasn't really what releasepage was designed for.
> >
> > 'releasepage' was really designed for the filesystem having its own
> > version of 'try_to_free_buffers()', which is just an optimistic "ok,
> > we may be releasing this page, so try to get rid of any IO structures
> > you have cached". It wasn't really a memory management thing.
> >
> > And the thing is, it looks trivial to do the memory management
> > approach by adding a new callback that gets called after the page is
> > actually removed from the page cache. If we do that, then there are no
> > races with any other users, since we remove things from the page cache
> > atomically wrt page cache lookup. So the need for playing games with
> > page locking and 'uptodate' simply goes away. As does the PG_private
> > thing or the interaction with invalidatepage() etc.
> >
> > So this is a TOTALLY UNTESTED trivial patch that just adds another
> > callback. Does this work? I dunno. But I get the feeling that instead
> > of having NFS work around the odd semantics that don't actually match
> > what NFS wants, introducing a new callback with much simpler semantics
> > would be simpler for everybody, and avoid the need for subtle code.
> >
> > Hmm?
> >
> > Linus
>
>
> > include/linux/fs.h | 1 +
> > mm/vmscan.c | 3 +++
> > 2 files changed, 4 insertions(+), 0 deletions(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index c9e06cc..090f0ea 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -602,6 +602,7 @@ struct address_space_operations {
> > sector_t (*bmap)(struct address_space *, sector_t);
> > void (*invalidatepage) (struct page *, unsigned long);
> > int (*releasepage) (struct page *, gfp_t);
> > + void (*freepage)(struct page *);
> > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec
> > *iov,
> > loff_t offset, unsigned long nr_segs);
> > int (*get_xip_mem)(struct address_space *, pgoff_t, int,
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index d31d7ce..1accb01 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -499,6 +499,9 @@ static int __remove_mapping(struct address_space
> > *mapping, struct page *page)
> > mem_cgroup_uncharge_cache_page(page);
> > }
> >
> > + if (mapping->a_ops->freepage)
> > + mapping->a_ops->freepage(page);
>
> Hmm... Looking again at the problem, it appears that the same callback
> needs to be added to truncate_complete_page() and
> invalidate_complete_page2(). Otherwise we end up in a situation where
> the page can sometimes be removed from the page cache without calling
> freepage().
>
> > +
> > return 1;
> >
> > cannot_free:
Yes, I was wondering quite how we would define this ->freepage thing,
if it gets called from one place that removes from page cache and not
from others.
Another minor problem with it: it would probably need to take the
struct address_space *mapping as arg as well as struct page *page:
because by this time page->mapping has been reset to NULL.
But I'm not at all keen on adding a calllback in this very special
frozen-to-0-references place: please let's not do it without an okay
from Nick Piggin (now Cc'ed).
I agree completely with what Linus said originally about how the
page cannot be freed while there's a reference to it, and it should
be possible to work this without your additional page locks.
Your ->releasepage should be able to judge whether the page is likely
(not certain) to be freed - page_count 3? 1 for the page cache, 1 for
the page_private reference, 1 for vmscan's reference, I think. Then
it can mark !PageUptodate and proceed with freeing the stuff you had
allocated, undo page_has_private and its reference, and return 1 (or
return 0 if it decides to hold on to the page).
If something races in and grabs another reference to prevent removal
from page cache and freeing, then won't read_cache_page(), seeing
!Uptodate, do the right thing and set up the required info again?
Or perhaps I haven't looked far enough, and you do have races which
actually need your page locks, I can see they make it easier to think
about.
But I'd prefer us not to throw in another callback if it's well
workable with the ->releasepage we already have. (If it helps,
perhaps we could adjust shrink_page_list() to allow for the page
being removed from page cache inside try_to_release_page() - but
I don't think that should be necessary.)
Hugh
On Mon, Dec 06, 2010 at 11:59:07AM -0500, Trond Myklebust wrote:
> From: Linus Torvalds <[email protected]>
>
> NFS needs to be able to release objects that are stored in the page
> cache once the page itself is no longer visible from the page cache.
>
> This patch adds a callback to the address space operations that allows
> filesystems to perform page cleanups once the page has been removed
> from the page cache.
>
> Original patch by: Linus Torvalds <[email protected]>
> [trondmy: cover the cases of invalidate_inode_pages2() and
> truncate_inode_pages()]
> Signed-off-by: Trond Myklebust <[email protected]>
> ---
> Documentation/filesystems/Locking | 7 ++++++-
> Documentation/filesystems/vfs.txt | 7 +++++++
> include/linux/fs.h | 1 +
> mm/filemap.c | 5 +++++
> mm/truncate.c | 4 ++++
> mm/vmscan.c | 7 +++++++
> 6 files changed, 30 insertions(+), 1 deletions(-)
>
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index a91f308..b6426f1 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -173,12 +173,13 @@ prototypes:
> sector_t (*bmap)(struct address_space *, sector_t);
> int (*invalidatepage) (struct page *, unsigned long);
> int (*releasepage) (struct page *, int);
> + void (*freepage)(struct page *);
> int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
> loff_t offset, unsigned long nr_segs);
> int (*launder_page) (struct page *);
>
> locking rules:
> - All except set_page_dirty may block
> + All except set_page_dirty and freepage may block
>
> BKL PageLocked(page) i_mutex
> writepage: no yes, unlocks (see below)
> @@ -193,6 +194,7 @@ perform_write: no n/a yes
> bmap: no
> invalidatepage: no yes
> releasepage: no yes
> +freepage: no yes
> direct_IO: no
> launder_page: no yes
>
> @@ -288,6 +290,9 @@ buffers from the page in preparation for freeing it. It returns zero to
> indicate that the buffers are (or may be) freeable. If ->releasepage is zero,
> the kernel assumes that the fs has no private interest in the buffers.
>
> + ->freepage() is called when the kernel is done dropping the page
> +from the page cache.
> +
> ->launder_page() may be called prior to releasing a page if
> it is still found to be dirty. It returns zero if the page was successfully
> cleaned, or an error value if not. Note that in order to prevent the page
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index ed7e5ef..3b14a55 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -534,6 +534,7 @@ struct address_space_operations {
> sector_t (*bmap)(struct address_space *, sector_t);
> int (*invalidatepage) (struct page *, unsigned long);
> int (*releasepage) (struct page *, int);
> + void (*freepage)(struct page *);
> ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
> loff_t offset, unsigned long nr_segs);
> struct page* (*get_xip_page)(struct address_space *, sector_t,
> @@ -679,6 +680,12 @@ struct address_space_operations {
> need to ensure this. Possibly it can clear the PageUptodate
> bit if it cannot free private data yet.
>
> + freepage: freepage is called once the page is no longer visible in
> + the page cache in order to allow the cleanup of any private
> + data. Since it may be called by the memory reclaimer, it
> + should not assume that the original address_space mapping still
> + exists, and it should not block.
Of course we still have bugs in this regard, without inode RCU and
filesystem deregistration RCU, but when those things are implemented
for RCU path-walk, this section should be updated somewhat, and we'll
have to look at RCU protecting the final mapping manipulations after
a page is removed from pagecache.
But I'll help work on that after RCU inodes / filesystems is merged.
> +
> direct_IO: called by the generic read/write routines to perform
> direct_IO - that is IO requests which bypass the page cache
> and transfer data directly between the storage and the
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c9e06cc..090f0ea 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -602,6 +602,7 @@ struct address_space_operations {
> sector_t (*bmap)(struct address_space *, sector_t);
> void (*invalidatepage) (struct page *, unsigned long);
> int (*releasepage) (struct page *, gfp_t);
> + void (*freepage)(struct page *);
> ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
> loff_t offset, unsigned long nr_segs);
> int (*get_xip_mem)(struct address_space *, pgoff_t, int,
> diff --git a/mm/filemap.c b/mm/filemap.c
> index ea89840..6b9aee2 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -143,13 +143,18 @@ void __remove_from_page_cache(struct page *page)
> void remove_from_page_cache(struct page *page)
> {
> struct address_space *mapping = page->mapping;
> + void (*freepage)(struct page *);
>
> BUG_ON(!PageLocked(page));
>
> + freepage = mapping->a_ops->freepage;
> spin_lock_irq(&mapping->tree_lock);
> __remove_from_page_cache(page);
> spin_unlock_irq(&mapping->tree_lock);
> mem_cgroup_uncharge_cache_page(page);
> +
> + if (freepage)
> + freepage(page);
> }
> EXPORT_SYMBOL(remove_from_page_cache);
>
> diff --git a/mm/truncate.c b/mm/truncate.c
> index ba887bf..3c2d5dd 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -390,6 +390,10 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
> __remove_from_page_cache(page);
> spin_unlock_irq(&mapping->tree_lock);
> mem_cgroup_uncharge_cache_page(page);
> +
> + if (mapping->a_ops->freepage)
> + mapping->a_ops->freepage(page);
> +
> page_cache_release(page); /* pagecache ref */
> return 1;
> failed:
The generic parts of the code look OK to me, but why is there a
difference in your sequences of loading the freepage function pointer
here?
On Wed, 2010-12-01 at 08:17 -0800, Linus Torvalds wrote:
> On Wed, Dec 1, 2010 at 7:36 AM, Trond Myklebust
> <[email protected]> wrote:
> >
> > We need to ensure that the entries in the nfs_cache_array get cleared
> > when the page is removed from the page cache. To do so, we use the
> > releasepage address_space operation (which also requires us to set
> > the Pg_private flag).
>
> So I really think that the whole "releasepage" use in NFS is simply
> overly complicated and was obviously too subtle.
>
> The whole need for odd return values, for the page lock, and for the
> addition of clearing the up-to-date bit comes from the fact that this
> wasn't really what releasepage was designed for.
>
> 'releasepage' was really designed for the filesystem having its own
> version of 'try_to_free_buffers()', which is just an optimistic "ok,
> we may be releasing this page, so try to get rid of any IO structures
> you have cached". It wasn't really a memory management thing.
>
> And the thing is, it looks trivial to do the memory management
> approach by adding a new callback that gets called after the page is
> actually removed from the page cache. If we do that, then there are no
> races with any other users, since we remove things from the page cache
> atomically wrt page cache lookup. So the need for playing games with
> page locking and 'uptodate' simply goes away. As does the PG_private
> thing or the interaction with invalidatepage() etc.
>
> So this is a TOTALLY UNTESTED trivial patch that just adds another
> callback. Does this work? I dunno. But I get the feeling that instead
> of having NFS work around the odd semantics that don't actually match
> what NFS wants, introducing a new callback with much simpler semantics
> would be simpler for everybody, and avoid the need for subtle code.
>
> Hmm?
>
> Linus
> include/linux/fs.h | 1 +
> mm/vmscan.c | 3 +++
> 2 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c9e06cc..090f0ea 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -602,6 +602,7 @@ struct address_space_operations {
> sector_t (*bmap)(struct address_space *, sector_t);
> void (*invalidatepage) (struct page *, unsigned long);
> int (*releasepage) (struct page *, gfp_t);
> + void (*freepage)(struct page *);
> ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec
> *iov,
> loff_t offset, unsigned long nr_segs);
> int (*get_xip_mem)(struct address_space *, pgoff_t, int,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d31d7ce..1accb01 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -499,6 +499,9 @@ static int __remove_mapping(struct address_space
> *mapping, struct page *page)
> mem_cgroup_uncharge_cache_page(page);
> }
>
> + if (mapping->a_ops->freepage)
> + mapping->a_ops->freepage(page);
Hmm... Looking again at the problem, it appears that the same callback
needs to be added to truncate_complete_page() and
invalidate_complete_page2(). Otherwise we end up in a situation where
the page can sometimes be removed from the page cache without calling
freepage().
> +
> return 1;
>
> cannot_free:
Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Wed, 1 Dec 2010, Trond Myklebust wrote:
>
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -108,6 +108,10 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
> clear_page_mlock(page);
> remove_from_page_cache(page);
> ClearPageMappedToDisk(page);
> +
> + if (mapping->a_ops->freepage)
> + mapping->a_ops->freepage(page);
> +
> page_cache_release(page); /* pagecache ref */
> return 0;
> }
I think Linus recommended that one be done in remove_from_page_cache()
to catch all instances: did that get overruled later for some reason?
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -454,6 +454,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
> BUG_ON(!PageLocked(page));
> BUG_ON(mapping != page_mapping(page));
>
> + preempt_disable();
> spin_lock_irq(&mapping->tree_lock);
> /*
> * The non racy check for a busy page.
> @@ -492,10 +493,19 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
> swp_entry_t swap = { .val = page_private(page) };
> __delete_from_swap_cache(page);
> spin_unlock_irq(&mapping->tree_lock);
> + preempt_enable();
> swapcache_free(swap, page);
> } else {
> + void (*freepage)(struct page *);
> +
> + freepage = mapping->a_ops->freepage;
> +
> __remove_from_page_cache(page);
> spin_unlock_irq(&mapping->tree_lock);
> + if (freepage != NULL)
> + freepage(page);
> + preempt_enable();
> +
> mem_cgroup_uncharge_cache_page(page);
> }
>
> @@ -503,6 +513,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
>
> cannot_free:
> spin_unlock_irq(&mapping->tree_lock);
> + preempt_enable();
> return 0;
> }
I took his "stop_machine()" explanation ("an idle period for everything.
And just a preemption reschedule isn't enough for that") to imply that
there's no need for your preempt_disable/preempt_enable there: they
don't add anything to the module unload case, and they don't help the
spin_unlock_irq issue (and you're already being rightly careful to note
freepage in advance).
But maybe I misunderstood.
Hugh
On Wed, 1 Dec 2010 15:31:11 -0800
Linus Torvalds <[email protected]> wrote:
> On Wed, Dec 1, 2010 at 2:38 PM, Andrew Morton <[email protected]> wrote:
> >
> > OK, the stop_machine() plugs a lot of potential race-vs-module-unload
> > things. __But Trond is referring to races against vmscan inode reclaim,
> > unmount, etc.
>
> So?
>
> A filesystem module cannot be unloaded while it's still mounted.
>
> And unmount doesn't succeed until all inodes are gone.
>
> And getting rid of an inode doesn't succeed until all pages associated
> with it are gone.
>
> And getting rid of the pages involves locking them (whether in
> truncate or vmscan) and removing them from all lists.
>
> Ergo: vmscan has a locked page leads to the filesystem being
> guaranteed to not be unmounted. And that, in turn, guarantees that
> the module won't be unloaded until the machine has gone through an
> idle cycle.
The page isn't attached to the address_space any more:
static int __remove_mapping(struct address_space *mapping, struct page *page)
{
...
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
...
}
On Wed, 1 Dec 2010 17:42:08 -0800 Linus Torvalds <[email protected]> wrote:
> On Wed, Dec 1, 2010 at 5:22 PM, Andrew Morton <[email protected]> wrote:
> >
> > What we're talking about is races against memory reclaim, unmount, etc.
>
> Ahh. Those I can believe in. Although I think they'd almost
> incidentally be fixed by making inode freeing (which is where the
> 'struct address_space' is embedded) RCU-safe, which we're going to do
> anyway in 38. Then we could make the vmscan code just be a rcu-read
> section.
I didn't know that aspect of it. It will be nice to plug this race -
it's been there for so long because nobody was able to think of an
acceptable way of fixing it by direct means (synchronous locking,
refcounting, etc). Taking a ref on the inode doesn't work, because we
can't run iput_final() in direct-reclaim contexts (lock ordering snafus).
vmscan is the problematic path - I _think_ all other code paths which
remove pagecache have an inode ref. But this assumes that
inode->i_mapping points at inode->i_data! Need to think about the
situation where it points at a different inode's i_data - in that case
these callers may have a ref on the wrong inode.
> Of course, I do think the race is basically impossible to hit in
> practice regardless.
Actually I was able to hit the race back in late 2.5 or thereabouts.
Really massive memory pressure caused vmscan->icache_shrinker to free
the inode/address_space while another CPU in vmscan was playing with the
address_space. That was quite a debugging session ;)
> OK. The first patch in this series fixes the regression reported by
> Nick Bowler when I apply it to my setup.
> The 2 remaining patches are needed in order to ensure that the
> VM doesn't free the readdir page cache page while we're trying to
> read from it, and to ensure that we don't leak memory...
> Linus, please don't apply these patches quite yet. I'd like to continue
> tests for a couple more days before I send you the pull request.
v2 fixes the following issues:
- Fix up the cookie usage in uncached_readdir()
- Changelog fixups for the second patch
- .releasepage should use kmap_atomic() so that it can be called
from all direct reclaim contexts.
- Ensure that .releasepage also clears Pg_uptodate
- Set/clear the Pg_private flag to ensure .releasepage gets called when
appropriate.
- Add a .invalidatepage to ensure truncate_inode_pages() works correctly
- Ensure that the anonymous page that is generated by uncached_readdir()
doesn't leak memory.
Trond Myklebust (3):
NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler
NFS: lock the readdir page while it is in use
NFS: Fix a memory leak in nfs_readdir
fs/nfs/dir.c | 48 ++++++++++++++++++++++++++++++++++++++++--------
fs/nfs/inode.c | 1 +
include/linux/nfs_fs.h | 1 +
3 files changed, 42 insertions(+), 8 deletions(-)
--
1.7.3.2
On Wed, 1 Dec 2010 14:43:14 -0800 Andrew Morton <[email protected]>
wrote:
> On Thu, 2 Dec 2010 08:29:13 +1100
> Neil Brown <[email protected]> wrote:
>
> > > >
> > > > However, while reading Documentation/filesystems/vfs.txt (in order to
> > > > add documentation for freepage) I was surprised to read that the
> > > > ->releasepage() is itself supposed to be allowed to actually remove the
> > > > page from the address space if it so desires.
> > >
> > > That doesn't sound right. It came from Neil in 2006.
> > >
> > > Neil, what were you thinking there? Did you find such a ->releasepage()?
> >
> > Nope, no idea, sorry.
> >
> > No releasepage functions do anything like that, and no call sites suggest it
> > could be a possibility. Quite the reverse - they are likely to remove the
> > page from the mapping without checking that it is still in the mapping.
> >
> > So that sentence should be deleted.
>
> This?
Perfect, thanks.
NeilBrown
>
> From: Andrew Morton <[email protected]>
>
> ->releasepage() does not remove the page from the mapping.
>
> Cc: Neil Brown <[email protected]>
> Cc: Trond Myklebust <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> Documentation/filesystems/vfs.txt | 9 ++++-----
> 1 file changed, 4 insertions(+), 5 deletions(-)
>
> diff -puN Documentation/filesystems/vfs.txt~documentation-filesystems-vfstxt-fix-repeasepage-description Documentation/filesystems/vfs.txt
> --- a/Documentation/filesystems/vfs.txt~documentation-filesystems-vfstxt-fix-repeasepage-description
> +++ a/Documentation/filesystems/vfs.txt
> @@ -660,11 +660,10 @@ struct address_space_operations {
> releasepage: releasepage is called on PagePrivate pages to indicate
> that the page should be freed if possible. ->releasepage
> should remove any private data from the page and clear the
> - PagePrivate flag. It may also remove the page from the
> - address_space. If this fails for some reason, it may indicate
> - failure with a 0 return value.
> - This is used in two distinct though related cases. The first
> - is when the VM finds a clean page with no active users and
> + PagePrivate flag. If releasepage() fails for some reason, it must
> + indicate failure with a 0 return value.
> + releasepage() is used in two distinct though related cases. The
> + first is when the VM finds a clean page with no active users and
> wants to make it a free page. If ->releasepage succeeds, the
> page will be removed from the address_space and become free.
>
> _
On 11/30/2010 11:29 PM, Trond Myklebust wrote:
> I'm not worried about other readdir calls invalidating the page. My
> concern is rather about the VM memory reclaimers ejecting the page from
> the page cache, and calling nfs_readdir_clear_array while we're
> referencing the page.
Wait, if nfs_readdir_clear_array can clear the page while
somebody else has a refcount to it, what good is the
refcount?
Why is your .releasepage modifying the data on the page,
instead of just the metadata in the struct page?
--
All rights reversed
On Wed, 01 Dec 2010 15:05:38 -0500
Trond Myklebust <[email protected]> wrote:
> On Wed, 2010-12-01 at 11:23 -0800, Hugh Dickins wrote:
> > On Wed, 1 Dec 2010, Trond Myklebust wrote:
> > > On Wed, 2010-12-01 at 08:17 -0800, Linus Torvalds wrote:
> > > > include/linux/fs.h | 1 +
> > > > mm/vmscan.c | 3 +++
> > > > 2 files changed, 4 insertions(+), 0 deletions(-)
> > > >
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index c9e06cc..090f0ea 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -602,6 +602,7 @@ struct address_space_operations {
> > > > sector_t (*bmap)(struct address_space *, sector_t);
> > > > void (*invalidatepage) (struct page *, unsigned long);
> > > > int (*releasepage) (struct page *, gfp_t);
> > > > + void (*freepage)(struct page *);
> > > > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec
> > > > *iov,
> > > > loff_t offset, unsigned long nr_segs);
> > > > int (*get_xip_mem)(struct address_space *, pgoff_t, int,
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index d31d7ce..1accb01 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -499,6 +499,9 @@ static int __remove_mapping(struct address_space
> > > > *mapping, struct page *page)
> > > > mem_cgroup_uncharge_cache_page(page);
> > > > }
> > > >
> > > > + if (mapping->a_ops->freepage)
> > > > + mapping->a_ops->freepage(page);
> > >
> > > Hmm... Looking again at the problem, it appears that the same callback
> > > needs to be added to truncate_complete_page() and
> > > invalidate_complete_page2(). Otherwise we end up in a situation where
> > > the page can sometimes be removed from the page cache without calling
> > > freepage().
> > >
> > > > +
> > > > return 1;
> > > >
> > > > cannot_free:
> >
> > Yes, I was wondering quite how we would define this ->freepage thing,
> > if it gets called from one place that removes from page cache and not
> > from others.
> >
> > Another minor problem with it: it would probably need to take the
> > struct address_space *mapping as arg as well as struct page *page:
> > because by this time page->mapping has been reset to NULL.
> >
> > But I'm not at all keen on adding a calllback in this very special
> > frozen-to-0-references place: please let's not do it without an okay
> > from Nick Piggin (now Cc'ed).
> >
> > I agree completely with what Linus said originally about how the
> > page cannot be freed while there's a reference to it, and it should
> > be possible to work this without your additional page locks.
> >
> > Your ->releasepage should be able to judge whether the page is likely
> > (not certain) to be freed - page_count 3? 1 for the page cache, 1 for
> > the page_private reference, 1 for vmscan's reference, I think. Then
> > it can mark !PageUptodate and proceed with freeing the stuff you had
> > allocated, undo page_has_private and its reference, and return 1 (or
> > return 0 if it decides to hold on to the page).
>
> That is very brittle. I'd prefer not to have to scan linux-mm every week
> in order to find out if someone changed the page_count.
>
> However, while reading Documentation/filesystems/vfs.txt (in order to
> add documentation for freepage) I was surprised to read that the
> ->releasepage() is itself supposed to be allowed to actually remove the
> page from the address space if it so desires.
That doesn't sound right. It came from Neil in 2006.
Neil, what were you thinking there? Did you find such a ->releasepage()?
> Looking at the actual code in shrink_page_list() and friends I can't see
> how that can possibly fail to break things, but if it were true, then
> that might enable us to call remove_mapping() in order to safely free
> the page before it gets cleared.
>
On Wed, 2010-12-01 at 11:47 -0800, Linus Torvalds wrote:
> On Wed, Dec 1, 2010 at 10:54 AM, Trond Myklebust
> <[email protected]> wrote:
> >
> > Hmm... Looking again at the problem, it appears that the same callback
> > needs to be added to truncate_complete_page() and
> > invalidate_complete_page2(). Otherwise we end up in a situation where
> > the page can sometimes be removed from the page cache without calling
> > freepage().
>
> Yes, I think any caller of __remove_from_page_cache() should do it
> once it has dropped all locks.
>
> And to be consistent with that rule, even in the __remove_mapping()
> function I suspect the code to call ->freepage() might as well be done
> only in the __remove_from_page_cache() case (ie not in the
> PageSwapCache() case).
>
> Then, add the case to the end of "remove_page_cache()" itself, and now
> it's really easy to just grep for __remove_from_page_cache() and make
> sure they all do it.
>
> That sounds sane, no?
>
> Linus
Something like the following then?
-----------------------------------------------------------------------------------------
On Thu, 2 Dec 2010 08:29:13 +1100
Neil Brown <[email protected]> wrote:
> > >
> > > However, while reading Documentation/filesystems/vfs.txt (in order to
> > > add documentation for freepage) I was surprised to read that the
> > > ->releasepage() is itself supposed to be allowed to actually remove the
> > > page from the address space if it so desires.
> >
> > That doesn't sound right. It came from Neil in 2006.
> >
> > Neil, what were you thinking there? Did you find such a ->releasepage()?
>
> Nope, no idea, sorry.
>
> No releasepage functions do anything like that, and no call sites suggest it
> could be a possibility. Quite the reverse - they are likely to remove the
> page from the mapping without checking that it is still in the mapping.
>
> So that sentence should be deleted.
This?
From: Andrew Morton <[email protected]>
->releasepage() does not remove the page from the mapping.
Cc: Neil Brown <[email protected]>
Cc: Trond Myklebust <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
Documentation/filesystems/vfs.txt | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff -puN Documentation/filesystems/vfs.txt~documentation-filesystems-vfstxt-fix-repeasepage-description Documentation/filesystems/vfs.txt
--- a/Documentation/filesystems/vfs.txt~documentation-filesystems-vfstxt-fix-repeasepage-description
+++ a/Documentation/filesystems/vfs.txt
@@ -660,11 +660,10 @@ struct address_space_operations {
releasepage: releasepage is called on PagePrivate pages to indicate
that the page should be freed if possible. ->releasepage
should remove any private data from the page and clear the
- PagePrivate flag. It may also remove the page from the
- address_space. If this fails for some reason, it may indicate
- failure with a 0 return value.
- This is used in two distinct though related cases. The first
- is when the VM finds a clean page with no active users and
+ PagePrivate flag. If releasepage() fails for some reason, it must
+ indicate failure with a 0 return value.
+ releasepage() is used in two distinct though related cases. The
+ first is when the VM finds a clean page with no active users and
wants to make it a free page. If ->releasepage succeeds, the
page will be removed from the address_space and become free.
_
On Wed, 1 Dec 2010, Linus Torvalds wrote:
> On Wed, Dec 1, 2010 at 2:13 PM, Andrew Morton <[email protected]> wrote:
> > On Wed, 01 Dec 2010 16:51:12 -0500
> > Trond Myklebust <[email protected]> wrote:
> >
> >> On Wed, 2010-12-01 at 13:38 -0800, Andrew Morton wrote:
> >> > Probably on most call paths we'll be OK - if a process is in the middle
> >> > of a file truncate, holdin a file* ref which holds an inode ref then
> >> > nobody will be unmounting that fs and hence nobody will be unloading
> >> > that module.
> >> >
> >> > However on the random_code->alloc_page->vmscan->releasepage path, none
> >> > of that applies.
> >>
> >> Just out of interest, what ensures that the mapping is still around for
> >> the 'spin_unlock_irq(&mapping->tree_lock);' in __remove_mapping()?
> >
> > Nothing, afacit.
>
> No, we're good.
>
> Module unload has to go through a "stop_machine()" cycle, and that in
> turn requires an idle period for everything. And just a preemption
> reschedule isn't enough for that.
>
> So what is sufficient is that
>
> - we had the page locked and on the mapping
>
> This implies that we had an inode reference to the module, and the
> page lock means that the inode reference cannot go away (because it
> will involve invalidate-pages etc)
I'm not so sure of that: doesn't it test inode->i_data.nrpages in
various places, and skip ahead if that is already 0? I don't see
the necessary serialization when nrpages comes down to 0.
>
> - we're not sleeping after __remove_mapping, so unload can't happen afterwards.
>
> A _lot_ of the module races depend on that latter thing. We have
> almost no cases that are strictly about actual reference counts etc.
Okay, I'm reassured on my module unload point; but not on the
question of safety of spin_unlock_irq(&mapping->tree_lock)
which Trond lobbed back in return.
Hugh
On 2010-12-01 18:35, Rik van Riel wrote:
> On 12/01/2010 11:17 AM, Linus Torvalds wrote:
>
>> So this is a TOTALLY UNTESTED trivial patch that just adds another
>> callback. Does this work? I dunno. But I get the feeling that instead
>> of having NFS work around the odd semantics that don't actually match
>> what NFS wants, introducing a new callback with much simpler semantics
>> would be simpler for everybody, and avoid the need for subtle code.
>
> Surely somebody can have just looked up the page and
> gotten a reference count, right before your ->freepage
> call is invoked?
>
> CPU A CPU B
>
> look up page
> grab refcount
> ->freepage
>
> use contents of page
>
> Am I overlooking something obvious?
>
The page is not cached any more at this point therefore
looking it up won't find it.
Benny
On Wed, 1 Dec 2010, Hugh Dickins wrote:
> On Wed, 1 Dec 2010, Andrew Morton wrote:
> > On Wed, 01 Dec 2010 15:10:50 -0500
> > Trond Myklebust <[email protected]> wrote:
> >
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -602,6 +602,7 @@ struct address_space_operations {
> > > sector_t (*bmap)(struct address_space *, sector_t);
> > > void (*invalidatepage) (struct page *, unsigned long);
> > > int (*releasepage) (struct page *, gfp_t);
> > > + void (*freepage)(struct page *);
> > > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
> > > loff_t offset, unsigned long nr_segs);
> > > int (*get_xip_mem)(struct address_space *, pgoff_t, int,
> >
> > It would be good to think about and then clearly spell out exactly what
> > state the page is in here. It is locked, and I assume clean and not
> > under writeback. What about its refcount, freezedness status and
> > eligibility for lookups?
> >
> > And as Hugh pointed out, some callees might needs the address_space*
> > although we can perhaps defer that until such a callee turns up.
> > If/when that happens we might have a problem though: if this locked
> > page is no longer attached to the address_space then what now pins the
> > address_space, protecting it from inode reclaim?
>
> That's an excellent point and trumps mine: it would be actively wrong
> to provide the struct address_space *mapping arg I was asking for.
> (Bet someone then tries stashing it away via page->private though.)
Hmm, thinking further along the same lines: can we even guarantee that
the filesystem module is still loaded at that point? i.e. might
mapping->freepage now be pointing off into the garbage heap?
Hugh
On Wed, 01 Dec 2010 18:21:16 -0500
Trond Myklebust <[email protected]> wrote:
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -108,6 +108,10 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
> clear_page_mlock(page);
> remove_from_page_cache(page);
> ClearPageMappedToDisk(page);
> +
> + if (mapping->a_ops->freepage)
> + mapping->a_ops->freepage(page);
> +
> page_cache_release(page); /* pagecache ref */
> return 0;
> }
So here we're assuming that `mapping' was pinned by other means.
Fair enough, although subtle. Even drop_pagecache_sb() got it right ;)
> @@ -390,6 +394,10 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
> __remove_from_page_cache(page);
> spin_unlock_irq(&mapping->tree_lock);
> mem_cgroup_uncharge_cache_page(page);
> +
> + if (mapping->a_ops->freepage)
> + mapping->a_ops->freepage(page);
> +
> page_cache_release(page); /* pagecache ref */
> return 1;
> failed:
And here.
On Wed, Dec 1, 2010 at 10:54 AM, Trond Myklebust
<[email protected]> wrote:
>
> Hmm... Looking again at the problem, it appears that the same callback
> needs to be added to truncate_complete_page() and
> invalidate_complete_page2(). Otherwise we end up in a situation where
> the page can sometimes be removed from the page cache without calling
> freepage().
Yes, I think any caller of __remove_from_page_cache() should do it
once it has dropped all locks.
And to be consistent with that rule, even in the __remove_mapping()
function I suspect the code to call ->freepage() might as well be done
only in the __remove_from_page_cache() case (ie not in the
PageSwapCache() case).
Then, add the case to the end of "remove_page_cache()" itself, and now
it's really easy to just grep for __remove_from_page_cache() and make
sure they all do it.
That sounds sane, no?
Linus
On Wed, 2010-12-01 at 15:31 -0800, Linus Torvalds wrote:
> On Wed, Dec 1, 2010 at 2:38 PM, Andrew Morton <[email protected]> wrote:
> >
> > OK, the stop_machine() plugs a lot of potential race-vs-module-unload
> > things. But Trond is referring to races against vmscan inode reclaim,
> > unmount, etc.
>
> So?
>
> A filesystem module cannot be unloaded while it's still mounted.
>
> And unmount doesn't succeed until all inodes are gone.
>
> And getting rid of an inode doesn't succeed until all pages associated
> with it are gone.
>
> And getting rid of the pages involves locking them (whether in
> truncate or vmscan) and removing them from all lists.
>
> Ergo: vmscan has a locked page leads to the filesystem being
> guaranteed to not be unmounted. And that, in turn, guarantees that
> the module won't be unloaded until the machine has gone through an
> idle cycle.
>
> It really is that simple. There's nothing subtle there. The reason
> spin_unlock(&mapping->tree_lock) is safe is exactly the above trivial
> chain of dependencies. And it's also exactly why
> mapping->a_ops->freepage() would also be safe.
>
> This is pretty much how all the module races are handled. Doing module
> ref-counts per page (or per packet in flight for things like
> networking) would be prohibitively expensive. There's no way we can
> ever do that.
Although the page is locked, it may no longer be visible to the lockless
page lookup once the radix_tree_delete() completes in
__remove_from_page_cache.
Furthermore, if the same routine causes mapping->nr_pages to go to zero
before iput_final() hits truncate_inode_pages(), then the latter exits
immediately.
Both these cases would appear to allow iput_final() to release the inode
before vmscan gets round to unlocking the mapping->tree_lock since
truncate_inode_pages() no longer thinks it has any work to do.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Wed, 1 Dec 2010 14:24:42 -0800
Linus Torvalds <[email protected]> wrote:
> On Wed, Dec 1, 2010 at 2:13 PM, Andrew Morton <[email protected]> wrote:
> > On Wed, 01 Dec 2010 16:51:12 -0500
> > Trond Myklebust <[email protected]> wrote:
> >
> >> On Wed, 2010-12-01 at 13:38 -0800, Andrew Morton wrote:
> >> > Probably on most call paths we'll be OK - if a process is in the middle
> >> > of a file truncate, holdin a file* ref which holds an inode ref then
> >> > nobody will be unmounting that fs and hence nobody will be unloading
> >> > that module.
> >> >
> >> > However on the random_code->alloc_page->vmscan->releasepage path, none
> >> > of that applies.
> >>
> >> Just out of interest, what ensures that the mapping is still around for
> >> the 'spin_unlock_irq(&mapping->tree_lock);' in __remove_mapping()?
> >
> > Nothing, afacit.
>
> No, we're good.
>
> Module unload has to go through a "stop_machine()" cycle, and that in
> turn requires an idle period for everything. And just a preemption
> reschedule isn't enough for that.
>
> So what is sufficient is that
>
> - we had the page locked and on the mapping
>
> This implies that we had an inode reference to the module, and the
> page lock means that the inode reference cannot go away (because it
> will involve invalidate-pages etc)
>
> - we're not sleeping after __remove_mapping, so unload can't happen afterwards.
>
> A _lot_ of the module races depend on that latter thing. We have
> almost no cases that are strictly about actual reference counts etc.
>
OK, the stop_machine() plugs a lot of potential race-vs-module-unload
things. But Trond is referring to races against vmscan inode reclaim,
unmount, etc.
On Wed, 2010-12-01 at 14:38 -0800, Andrew Morton wrote:
> On Wed, 1 Dec 2010 14:24:42 -0800
> Linus Torvalds <[email protected]> wrote:
>
> > On Wed, Dec 1, 2010 at 2:13 PM, Andrew Morton <[email protected]> wrote:
> > > On Wed, 01 Dec 2010 16:51:12 -0500
> > > Trond Myklebust <[email protected]> wrote:
> > >
> > >> On Wed, 2010-12-01 at 13:38 -0800, Andrew Morton wrote:
> > >> > Probably on most call paths we'll be OK - if a process is in the middle
> > >> > of a file truncate, holdin a file* ref which holds an inode ref then
> > >> > nobody will be unmounting that fs and hence nobody will be unloading
> > >> > that module.
> > >> >
> > >> > However on the random_code->alloc_page->vmscan->releasepage path, none
> > >> > of that applies.
> > >>
> > >> Just out of interest, what ensures that the mapping is still around for
> > >> the 'spin_unlock_irq(&mapping->tree_lock);' in __remove_mapping()?
> > >
> > > Nothing, afacit.
> >
> > No, we're good.
> >
> > Module unload has to go through a "stop_machine()" cycle, and that in
> > turn requires an idle period for everything. And just a preemption
> > reschedule isn't enough for that.
> >
> > So what is sufficient is that
> >
> > - we had the page locked and on the mapping
> >
> > This implies that we had an inode reference to the module, and the
> > page lock means that the inode reference cannot go away (because it
> > will involve invalidate-pages etc)
> >
> > - we're not sleeping after __remove_mapping, so unload can't happen afterwards.
> >
> > A _lot_ of the module races depend on that latter thing. We have
> > almost no cases that are strictly about actual reference counts etc.
> >
>
> OK, the stop_machine() plugs a lot of potential race-vs-module-unload
> things. But Trond is referring to races against vmscan inode reclaim,
> unmount, etc.
>
Yes, that was my question.
However, Linus' explanation appears to answer one of Hugh's objections:
we can apparently use a preempt-disable() in order to ensure that the
module containing the mapping->a_ops is not unloaded until after the
->freepage() call is complete.
That would imply that ->freepage() cannot sleep, but I don't think that
is too nasty a restriction.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Wed, 1 Dec 2010, Linus Torvalds wrote:
> --00032557542a0c5e6004965ba615
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Wed, Dec 1, 2010 at 7:36 AM, Trond Myklebust
> <[email protected]> wrote:
> >
> > We need to ensure that the entries in the nfs_cache_array get cleared
> > when the page is removed from the page cache. To do so, we use the
> > releasepage address_space operation (which also requires us to set
> > the Pg_private flag).
>
> So I really think that the whole "releasepage" use in NFS is simply
> overly complicated and was obviously too subtle.
>
> The whole need for odd return values, for the page lock, and for the
> addition of clearing the up-to-date bit comes from the fact that this
> wasn't really what releasepage was designed for.
>
> 'releasepage' was really designed for the filesystem having its own
> version of 'try_to_free_buffers()', which is just an optimistic "ok,
> we may be releasing this page, so try to get rid of any IO structures
> you have cached". It wasn't really a memory management thing.
>
> And the thing is, it looks trivial to do the memory management
> approach by adding a new callback that gets called after the page is
> actually removed from the page cache. If we do that, then there are no
> races with any other users, since we remove things from the page cache
> atomically wrt page cache lookup. So the need for playing games with
> page locking and 'uptodate' simply goes away. As does the PG_private
> thing or the interaction with invalidatepage() etc.
>
> So this is a TOTALLY UNTESTED trivial patch that just adds another
> callback. Does this work? I dunno. But I get the feeling that instead
> of having NFS work around the odd semantics that don't actually match
> what NFS wants, introducing a new callback with much simpler semantics
> would be simpler for everybody, and avoid the need for subtle code.
>
> Hmm?
>
> Linus
>
> --00032557542a0c5e6004965ba615
> Content-Type: text/x-patch; charset=US-ASCII; name="patch.diff"
> Content-Disposition: attachment; filename="patch.diff"
> Content-Transfer-Encoding: base64
> X-Attachment-Id: f_gh6f5ghm0
>
> IGluY2x1ZGUvbGludXgvZnMuaCB8ICAgIDEgKwogbW0vdm1zY2FuLmMgICAgICAgIHwgICAgMyAr
> KysKIDIgZmlsZXMgY2hhbmdlZCwgNCBpbnNlcnRpb25zKCspLCAwIGRlbGV0aW9ucygtKQoKZGlm
> ZiAtLWdpdCBhL2luY2x1ZGUvbGludXgvZnMuaCBiL2luY2x1ZGUvbGludXgvZnMuaAppbmRleCBj
> OWUwNmNjLi4wOTBmMGVhIDEwMDY0NAotLS0gYS9pbmNsdWRlL2xpbnV4L2ZzLmgKKysrIGIvaW5j
> bHVkZS9saW51eC9mcy5oCkBAIC02MDIsNiArNjAyLDcgQEAgc3RydWN0IGFkZHJlc3Nfc3BhY2Vf
> b3BlcmF0aW9ucyB7CiAJc2VjdG9yX3QgKCpibWFwKShzdHJ1Y3QgYWRkcmVzc19zcGFjZSAqLCBz
> ZWN0b3JfdCk7CiAJdm9pZCAoKmludmFsaWRhdGVwYWdlKSAoc3RydWN0IHBhZ2UgKiwgdW5zaWdu
> ZWQgbG9uZyk7CiAJaW50ICgqcmVsZWFzZXBhZ2UpIChzdHJ1Y3QgcGFnZSAqLCBnZnBfdCk7CisJ
> dm9pZCAoKmZyZWVwYWdlKShzdHJ1Y3QgcGFnZSAqKTsKIAlzc2l6ZV90ICgqZGlyZWN0X0lPKShp
> bnQsIHN0cnVjdCBraW9jYiAqLCBjb25zdCBzdHJ1Y3QgaW92ZWMgKmlvdiwKIAkJCWxvZmZfdCBv
> ZmZzZXQsIHVuc2lnbmVkIGxvbmcgbnJfc2Vncyk7CiAJaW50ICgqZ2V0X3hpcF9tZW0pKHN0cnVj
> dCBhZGRyZXNzX3NwYWNlICosIHBnb2ZmX3QsIGludCwKZGlmZiAtLWdpdCBhL21tL3Ztc2Nhbi5j
> IGIvbW0vdm1zY2FuLmMKaW5kZXggZDMxZDdjZS4uMWFjY2IwMSAxMDA2NDQKLS0tIGEvbW0vdm1z
> Y2FuLmMKKysrIGIvbW0vdm1zY2FuLmMKQEAgLTQ5OSw2ICs0OTksOSBAQCBzdGF0aWMgaW50IF9f
> cmVtb3ZlX21hcHBpbmcoc3RydWN0IGFkZHJlc3Nfc3BhY2UgKm1hcHBpbmcsIHN0cnVjdCBwYWdl
> ICpwYWdlKQogCQltZW1fY2dyb3VwX3VuY2hhcmdlX2NhY2hlX3BhZ2UocGFnZSk7CiAJfQogCisJ
> aWYgKG1hcHBpbmctPmFfb3BzLT5mcmVlcGFnZSkKKwkJbWFwcGluZy0+YV9vcHMtPmZyZWVwYWdl
> KHBhZ2UpOworCiAJcmV0dXJuIDE7CiAKIGNhbm5vdF9mcmVlOgo=
> --00032557542a0c5e6004965ba615--
This violates the "Really. Don't send patches as attachments."
lkml-faq set forth by yourself ;)
Am I the only one who still uses a mail reader that doesn't decode
mime by default? Maybe it's time to move on...
Thanks,
Miklos
On 2010-11-30 22:47 -0500, Trond Myklebust wrote:
> We need to use the cookie from the previous array entry, not the
> actual cookie that we are searching for.
>
> Signed-off-by: Trond Myklebust <[email protected]>
> ---
> fs/nfs/dir.c | 9 +++++++--
> 1 files changed, 7 insertions(+), 2 deletions(-)
Indeed, my test script passes with this patch on top of Linus' master,
thanks. I'll wait for the updated series before trying the other ones.
Tested-by: Nick Bowler <[email protected]>
--
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)
We need to ensure that the entries in the nfs_cache_array get cleared
when the page is removed from the page cache. To do so, we use the
freepage address_space operation.
Change nfs_readdir_clear_array to use kmap_atomic(), so that the
function can be safely called from all contexts.
Finally, modify the cache_page_release helper to call
nfs_readdir_clear_array directly, when dealing with an anonymous
page from 'uncached_readdir'.
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/dir.c | 18 +++++++++---------
fs/nfs/inode.c | 1 +
include/linux/nfs_fs.h | 1 +
3 files changed, 11 insertions(+), 9 deletions(-)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index e03537f..d529e0e 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -57,7 +57,7 @@ static int nfs_rename(struct inode *, struct dentry *,
struct inode *, struct dentry *);
static int nfs_fsync_dir(struct file *, int);
static loff_t nfs_llseek_dir(struct file *, loff_t, int);
-static int nfs_readdir_clear_array(struct page*, gfp_t);
+static void nfs_readdir_clear_array(struct page*);
const struct file_operations nfs_dir_operations = {
.llseek = nfs_llseek_dir,
@@ -83,8 +83,8 @@ const struct inode_operations nfs_dir_inode_operations = {
.setattr = nfs_setattr,
};
-const struct address_space_operations nfs_dir_addr_space_ops = {
- .releasepage = nfs_readdir_clear_array,
+const struct address_space_operations nfs_dir_aops = {
+ .freepage = nfs_readdir_clear_array,
};
#ifdef CONFIG_NFS_V3
@@ -214,17 +214,15 @@ void nfs_readdir_release_array(struct page *page)
* we are freeing strings created by nfs_add_to_readdir_array()
*/
static
-int nfs_readdir_clear_array(struct page *page, gfp_t mask)
+void nfs_readdir_clear_array(struct page *page)
{
- struct nfs_cache_array *array = nfs_readdir_get_array(page);
+ struct nfs_cache_array *array;
int i;
- if (IS_ERR(array))
- return PTR_ERR(array);
+ array = kmap_atomic(page, KM_USER0);
for (i = 0; i < array->size; i++)
kfree(array->array[i].string.name);
- nfs_readdir_release_array(page);
- return 0;
+ kunmap_atomic(array, KM_USER0);
}
/*
@@ -639,6 +637,8 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page)
static
void cache_page_release(nfs_readdir_descriptor_t *desc)
{
+ if (!desc->page->mapping)
+ nfs_readdir_clear_array(desc->page);
page_cache_release(desc->page);
desc->page = NULL;
}
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 314f571..e67e31c 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -289,6 +289,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = NFS_SB(sb)->nfs_client->rpc_ops->dir_inode_ops;
inode->i_fop = &nfs_dir_operations;
+ inode->i_data.a_ops = &nfs_dir_aops;
if (nfs_server_capable(inode, NFS_CAP_READDIRPLUS))
set_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inode)->flags);
/* Deal with crossing mountpoints */
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index c66fdb7..29d504d 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -401,6 +401,7 @@ extern const struct inode_operations nfs3_file_inode_operations;
#endif /* CONFIG_NFS_V3 */
extern const struct file_operations nfs_file_operations;
extern const struct address_space_operations nfs_file_aops;
+extern const struct address_space_operations nfs_dir_aops;
static inline struct nfs_open_context *nfs_file_open_context(struct file *filp)
{
--
1.7.3.2
On Wed, 2010-12-01 at 17:47 -0500, Trond Myklebust wrote:
> However, Linus' explanation appears to answer one of Hugh's objections:
> we can apparently use a preempt-disable() in order to ensure that the
> module containing the mapping->a_ops is not unloaded until after the
> ->freepage() call is complete.
>
> That would imply that ->freepage() cannot sleep, but I don't think that
> is too nasty a restriction.
Revised patch would be as follows...
----------------------------------------------------------------------------------------
On Wed, Dec 1, 2010 at 2:38 PM, Andrew Morton <[email protected]> wrote:
>
> OK, the stop_machine() plugs a lot of potential race-vs-module-unload
> things. ?But Trond is referring to races against vmscan inode reclaim,
> unmount, etc.
So?
A filesystem module cannot be unloaded while it's still mounted.
And unmount doesn't succeed until all inodes are gone.
And getting rid of an inode doesn't succeed until all pages associated
with it are gone.
And getting rid of the pages involves locking them (whether in
truncate or vmscan) and removing them from all lists.
Ergo: vmscan has a locked page leads to the filesystem being
guaranteed to not be unmounted. And that, in turn, guarantees that
the module won't be unloaded until the machine has gone through an
idle cycle.
It really is that simple. There's nothing subtle there. The reason
spin_unlock(&mapping->tree_lock) is safe is exactly the above trivial
chain of dependencies. And it's also exactly why
mapping->a_ops->freepage() would also be safe.
This is pretty much how all the module races are handled. Doing module
ref-counts per page (or per packet in flight for things like
networking) would be prohibitively expensive. There's no way we can
ever do that.
Linus
On 12/01/2010 11:17 AM, Linus Torvalds wrote:
> So this is a TOTALLY UNTESTED trivial patch that just adds another
> callback. Does this work? I dunno. But I get the feeling that instead
> of having NFS work around the odd semantics that don't actually match
> what NFS wants, introducing a new callback with much simpler semantics
> would be simpler for everybody, and avoid the need for subtle code.
Surely somebody can have just looked up the page and
gotten a reference count, right before your ->freepage
call is invoked?
CPU A CPU B
look up page
grab refcount
->freepage
use contents of page
Am I overlooking something obvious?
--
All rights reversed
On Wed, 1 Dec 2010, Andrew Morton wrote:
> On Wed, 01 Dec 2010 15:10:50 -0500
> Trond Myklebust <[email protected]> wrote:
>
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -602,6 +602,7 @@ struct address_space_operations {
> > sector_t (*bmap)(struct address_space *, sector_t);
> > void (*invalidatepage) (struct page *, unsigned long);
> > int (*releasepage) (struct page *, gfp_t);
> > + void (*freepage)(struct page *);
> > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
> > loff_t offset, unsigned long nr_segs);
> > int (*get_xip_mem)(struct address_space *, pgoff_t, int,
>
> It would be good to think about and then clearly spell out exactly what
> state the page is in here. It is locked, and I assume clean and not
> under writeback. What about its refcount, freezedness status and
> eligibility for lookups?
>
> And as Hugh pointed out, some callees might needs the address_space*
> although we can perhaps defer that until such a callee turns up.
> If/when that happens we might have a problem though: if this locked
> page is no longer attached to the address_space then what now pins the
> address_space, protecting it from inode reclaim?
That's an excellent point and trumps mine: it would be actively wrong
to provide the struct address_space *mapping arg I was asking for.
(Bet someone then tries stashing it away via page->private though.)
Hugh
On Wed, 2010-12-01 at 13:38 -0800, Andrew Morton wrote:
> On Wed, 1 Dec 2010 13:15:07 -0800 (PST)
> Hugh Dickins <[email protected]> wrote:
>
> > On Wed, 1 Dec 2010, Hugh Dickins wrote:
> > > On Wed, 1 Dec 2010, Andrew Morton wrote:
> > > > On Wed, 01 Dec 2010 15:10:50 -0500
> > > > Trond Myklebust <[email protected]> wrote:
> > > >
> > > > > --- a/include/linux/fs.h
> > > > > +++ b/include/linux/fs.h
> > > > > @@ -602,6 +602,7 @@ struct address_space_operations {
> > > > > sector_t (*bmap)(struct address_space *, sector_t);
> > > > > void (*invalidatepage) (struct page *, unsigned long);
> > > > > int (*releasepage) (struct page *, gfp_t);
> > > > > + void (*freepage)(struct page *);
> > > > > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
> > > > > loff_t offset, unsigned long nr_segs);
> > > > > int (*get_xip_mem)(struct address_space *, pgoff_t, int,
> > > >
> > > > It would be good to think about and then clearly spell out exactly what
> > > > state the page is in here. It is locked, and I assume clean and not
> > > > under writeback. What about its refcount, freezedness status and
> > > > eligibility for lookups?
> > > >
> > > > And as Hugh pointed out, some callees might needs the address_space*
> > > > although we can perhaps defer that until such a callee turns up.
> > > > If/when that happens we might have a problem though: if this locked
> > > > page is no longer attached to the address_space then what now pins the
> > > > address_space, protecting it from inode reclaim?
> > >
> > > That's an excellent point and trumps mine: it would be actively wrong
> > > to provide the struct address_space *mapping arg I was asking for.
> > > (Bet someone then tries stashing it away via page->private though.)
> >
> > Hmm, thinking further along the same lines: can we even guarantee that
> > the filesystem module is still loaded at that point? i.e. might
> > mapping->freepage now be pointing off into the garbage heap?
>
> I don't see anything on the VFS side which would prevent a module
> unload. Or, more realistically, a concurrent unmount, freeing of the
> superblock and everything associated with it. All we have here is a
> page*.
>
> Probably on most call paths we'll be OK - if a process is in the middle
> of a file truncate, holdin a file* ref which holds an inode ref then
> nobody will be unmounting that fs and hence nobody will be unloading
> that module.
>
> However on the random_code->alloc_page->vmscan->releasepage path, none
> of that applies.
Just out of interest, what ensures that the mapping is still around for
the 'spin_unlock_irq(&mapping->tree_lock);' in __remove_mapping()?
I'm clearly missing whatever mechanism prevents iput_final() from racing
with vmscan if the latter clears out the last page from the mapping.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
We need to ensure that the entries in the nfs_cache_array get cleared
when the page is removed from the page cache. To do so, we use the
freepage address_space operation.
Change nfs_readdir_clear_array to use kmap_atomic(), so that the
function can be safely called from all contexts.
Finally, modify the cache_page_release helper to call
nfs_readdir_clear_array directly, when dealing with an anonymous
page from 'uncached_readdir'.
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/dir.c | 18 +++++++++---------
fs/nfs/inode.c | 1 +
include/linux/nfs_fs.h | 1 +
3 files changed, 11 insertions(+), 9 deletions(-)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index e03537f..d529e0e 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -57,7 +57,7 @@ static int nfs_rename(struct inode *, struct dentry *,
struct inode *, struct dentry *);
static int nfs_fsync_dir(struct file *, int);
static loff_t nfs_llseek_dir(struct file *, loff_t, int);
-static int nfs_readdir_clear_array(struct page*, gfp_t);
+static void nfs_readdir_clear_array(struct page*);
const struct file_operations nfs_dir_operations = {
.llseek = nfs_llseek_dir,
@@ -83,8 +83,8 @@ const struct inode_operations nfs_dir_inode_operations = {
.setattr = nfs_setattr,
};
-const struct address_space_operations nfs_dir_addr_space_ops = {
- .releasepage = nfs_readdir_clear_array,
+const struct address_space_operations nfs_dir_aops = {
+ .freepage = nfs_readdir_clear_array,
};
#ifdef CONFIG_NFS_V3
@@ -214,17 +214,15 @@ void nfs_readdir_release_array(struct page *page)
* we are freeing strings created by nfs_add_to_readdir_array()
*/
static
-int nfs_readdir_clear_array(struct page *page, gfp_t mask)
+void nfs_readdir_clear_array(struct page *page)
{
- struct nfs_cache_array *array = nfs_readdir_get_array(page);
+ struct nfs_cache_array *array;
int i;
- if (IS_ERR(array))
- return PTR_ERR(array);
+ array = kmap_atomic(page, KM_USER0);
for (i = 0; i < array->size; i++)
kfree(array->array[i].string.name);
- nfs_readdir_release_array(page);
- return 0;
+ kunmap_atomic(array, KM_USER0);
}
/*
@@ -639,6 +637,8 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page)
static
void cache_page_release(nfs_readdir_descriptor_t *desc)
{
+ if (!desc->page->mapping)
+ nfs_readdir_clear_array(desc->page);
page_cache_release(desc->page);
desc->page = NULL;
}
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 314f571..e67e31c 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -289,6 +289,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = NFS_SB(sb)->nfs_client->rpc_ops->dir_inode_ops;
inode->i_fop = &nfs_dir_operations;
+ inode->i_data.a_ops = &nfs_dir_aops;
if (nfs_server_capable(inode, NFS_CAP_READDIRPLUS))
set_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inode)->flags);
/* Deal with crossing mountpoints */
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index c66fdb7..29d504d 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -401,6 +401,7 @@ extern const struct inode_operations nfs3_file_inode_operations;
#endif /* CONFIG_NFS_V3 */
extern const struct file_operations nfs_file_operations;
extern const struct address_space_operations nfs_file_aops;
+extern const struct address_space_operations nfs_dir_aops;
static inline struct nfs_open_context *nfs_file_open_context(struct file *filp)
{
--
1.7.3.2
On Wed, Dec 1, 2010 at 8:35 AM, Rik van Riel <[email protected]> wrote:
>
> Surely somebody can have just looked up the page and
> gotten a reference count, right before your ->freepage
> call is invoked?
No.
The removal from the page cache is atomic, even in the presence of the
lockless lookup.
The page cache lookup does a "get_page_unless_zero()" on the count, so
when __remove_mapping() has removed the page using
"page_freeze_refs()", it's really gone, and cannot be looked up.
And if that is broken, then we have much more serious problems (like
aliasing the same page when doing mmap/read etc), so that's more than
just an implementation detail, it's a fundamental requirement of the
whole page-cache design.
And that's the whole point of adding this callback to the
__remove_mapping() stage: that's the _only_ point where we really end
up knowing that "yes, we really removed that page, and there are no
more users".
Linus
On Wed, Dec 01, 2010 at 05:42:08PM -0800, Linus Torvalds wrote:
> On Wed, Dec 1, 2010 at 5:22 PM, Andrew Morton <[email protected]> wrote:
> >
> > What we're talking about is races against memory reclaim, unmount, etc.
>
> Ahh. Those I can believe in.
Yeah, it's a tricky question. It would be solved if the inode reclaim
code didn't have the nasty shortcuts for nr_pages == 0 sitting outisde
the tree_lock... any time we have these kinds of optimisations checking
things outside locks, we invaraibly find they have races or data races
:(
So doing those checks under lock would be a reasonable way to fix it
if anyone cares for 2.6.37 or earlier (eg. distros). But it is another
lock in inode freeing path which is nice to avoid, so let's just get
it fixed with RCU in 2.6.38.
> Although I think they'd almost
> incidentally be fixed by making inode freeing (which is where the
> 'struct address_space' is embedded) RCU-safe, which we're going to do
> anyway in 38. Then we could make the vmscan code just be a rcu-read
> section.
Yep, good observation.
Thanks,
Nick
> OK. The first patch in this series fixes the regression reported by
> Nick Bowler when I apply it to my setup.
> The 2 remaining patches are needed in order to ensure that the
> VM doesn't free the readdir page cache page while we're trying to
> read from it, and to ensure that we don't leak memory...
> Linus, please don't apply these patches quite yet. I'd like to continue
> tests for a couple more days before I send you the pull request.
v2 fixes the following issues:
- Fix up the cookie usage in uncached_readdir()
- Changelog fixups for the second patch
- .releasepage should use kmap_atomic() so that it can be called
from all direct reclaim contexts.
- Ensure that .releasepage also clears Pg_uptodate
- Set/clear the Pg_private flag to ensure .releasepage gets called when
appropriate.
- Add a .invalidatepage to ensure truncate_inode_pages() works correctly
- Ensure that the anonymous page that is generated by uncached_readdir()
doesn't leak memory.
v3:
- add the freepage() address space operation.
- Dump the page locking
- rewrite patch 'Fix a memory leak in nfs_readdir' to work with the new
->freepage() callback.
Linus Torvalds (1):
Call the filesystem back whenever a page is removed from the page
cache
Trond Myklebust (2):
NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler
NFS: Fix a memory leak in nfs_readdir
Documentation/filesystems/Locking | 7 ++++++-
Documentation/filesystems/vfs.txt | 7 +++++++
fs/nfs/dir.c | 28 +++++++++++++++++-----------
fs/nfs/inode.c | 1 +
include/linux/fs.h | 1 +
include/linux/nfs_fs.h | 1 +
mm/truncate.c | 8 ++++++++
mm/vmscan.c | 11 +++++++++++
8 files changed, 52 insertions(+), 12 deletions(-)
--
1.7.3.2
On Wed, 1 Dec 2010 12:39:29 -0800 Andrew Morton <[email protected]>
wrote:
> On Wed, 01 Dec 2010 15:05:38 -0500
> Trond Myklebust <[email protected]> wrote:
>
> > On Wed, 2010-12-01 at 11:23 -0800, Hugh Dickins wrote:
> > > On Wed, 1 Dec 2010, Trond Myklebust wrote:
> > > > On Wed, 2010-12-01 at 08:17 -0800, Linus Torvalds wrote:
> > > > > include/linux/fs.h | 1 +
> > > > > mm/vmscan.c | 3 +++
> > > > > 2 files changed, 4 insertions(+), 0 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > index c9e06cc..090f0ea 100644
> > > > > --- a/include/linux/fs.h
> > > > > +++ b/include/linux/fs.h
> > > > > @@ -602,6 +602,7 @@ struct address_space_operations {
> > > > > sector_t (*bmap)(struct address_space *, sector_t);
> > > > > void (*invalidatepage) (struct page *, unsigned long);
> > > > > int (*releasepage) (struct page *, gfp_t);
> > > > > + void (*freepage)(struct page *);
> > > > > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec
> > > > > *iov,
> > > > > loff_t offset, unsigned long nr_segs);
> > > > > int (*get_xip_mem)(struct address_space *, pgoff_t, int,
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index d31d7ce..1accb01 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -499,6 +499,9 @@ static int __remove_mapping(struct address_space
> > > > > *mapping, struct page *page)
> > > > > mem_cgroup_uncharge_cache_page(page);
> > > > > }
> > > > >
> > > > > + if (mapping->a_ops->freepage)
> > > > > + mapping->a_ops->freepage(page);
> > > >
> > > > Hmm... Looking again at the problem, it appears that the same callback
> > > > needs to be added to truncate_complete_page() and
> > > > invalidate_complete_page2(). Otherwise we end up in a situation where
> > > > the page can sometimes be removed from the page cache without calling
> > > > freepage().
> > > >
> > > > > +
> > > > > return 1;
> > > > >
> > > > > cannot_free:
> > >
> > > Yes, I was wondering quite how we would define this ->freepage thing,
> > > if it gets called from one place that removes from page cache and not
> > > from others.
> > >
> > > Another minor problem with it: it would probably need to take the
> > > struct address_space *mapping as arg as well as struct page *page:
> > > because by this time page->mapping has been reset to NULL.
> > >
> > > But I'm not at all keen on adding a calllback in this very special
> > > frozen-to-0-references place: please let's not do it without an okay
> > > from Nick Piggin (now Cc'ed).
> > >
> > > I agree completely with what Linus said originally about how the
> > > page cannot be freed while there's a reference to it, and it should
> > > be possible to work this without your additional page locks.
> > >
> > > Your ->releasepage should be able to judge whether the page is likely
> > > (not certain) to be freed - page_count 3? 1 for the page cache, 1 for
> > > the page_private reference, 1 for vmscan's reference, I think. Then
> > > it can mark !PageUptodate and proceed with freeing the stuff you had
> > > allocated, undo page_has_private and its reference, and return 1 (or
> > > return 0 if it decides to hold on to the page).
> >
> > That is very brittle. I'd prefer not to have to scan linux-mm every week
> > in order to find out if someone changed the page_count.
> >
> > However, while reading Documentation/filesystems/vfs.txt (in order to
> > add documentation for freepage) I was surprised to read that the
> > ->releasepage() is itself supposed to be allowed to actually remove the
> > page from the address space if it so desires.
>
> That doesn't sound right. It came from Neil in 2006.
>
> Neil, what were you thinking there? Did you find such a ->releasepage()?
Nope, no idea, sorry.
No releasepage functions do anything like that, and no call sites suggest it
could be a possibility. Quite the reverse - they are likely to remove the
page from the mapping without checking that it is still in the mapping.
So that sentence should be deleted.
(Getting good review for doco updates is sooo hard :-)
NeilBrown
On Wed, 2010-12-01 at 08:14 -0500, Rik van Riel wrote:
> On 11/30/2010 11:29 PM, Trond Myklebust wrote:
>
> > I'm not worried about other readdir calls invalidating the page. My
> > concern is rather about the VM memory reclaimers ejecting the page from
> > the page cache, and calling nfs_readdir_clear_array while we're
> > referencing the page.
>
> Wait, if nfs_readdir_clear_array can clear the page while
> somebody else has a refcount to it, what good is the
> refcount?
Hi Rik
This is readdir, so there should normally be only one process accessing
the page. If that process locks it (we don't have mmap() to worry about,
so page locking is reasonable here) then the scheme is safe.
Note that we do clear Pg_uptodate under the page lock in the latest
version of the releasepage method (patches to be published later today
after I finish testing).
> Why is your .releasepage modifying the data on the page,
> instead of just the metadata in the struct page?
We want to be able to free up the data that is referenced by the array
on the page before the page itself is freed.
Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Wed, 1 Dec 2010 17:05:43 -0800
Linus Torvalds <[email protected]> wrote:
> On Wed, Dec 1, 2010 at 3:36 PM, Andrew Morton <[email protected]> wrote:
> > On Wed, 1 Dec 2010 15:31:11 -0800
> > Linus Torvalds <[email protected]> wrote:
> >>
> >> Ergo: vmscan has a locked page leads to the filesystem being
> >> guaranteed to not be unmounted. __And that, in turn, guarantees that
> >> the module won't be unloaded until the machine has gone through an
> >> idle cycle.
> >
> > The page isn't attached to the address_space any more:
>
> Did you even read the email?
We stopped talking about module unload hours ago. That matter is settled.
What we're talking about is races against memory reclaim, unmount, etc.
On Wed, Dec 1, 2010 at 12:10 PM, Trond Myklebust
<[email protected]> wrote:
>
> Something like the following then?
> diff --git a/mm/truncate.c b/mm/truncate.c
> index ba887bf..76ab2a8 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -108,6 +108,10 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
> ? ? ? ?clear_page_mlock(page);
> ? ? ? ?remove_from_page_cache(page);
> ? ? ? ?ClearPageMappedToDisk(page);
> +
> + ? ? ? if (mapping->a_ops->freepage)
> + ? ? ? ? ? ? ? mapping->a_ops->freepage(page);
> +
> ? ? ? ?page_cache_release(page); ? ? ? /* pagecache ref */
> ? ? ? ?return 0;
> ?}
I'd do this in "remove_from_page_cache()" instead of in
"truncate_complete_page()". Otherwise it will miss any other callers
that use "remove_from_page_cache()" (not that there are many, and I
doubt NFS cares about the ones that do exists, but..)
Linus
On Wed, Dec 1, 2010 at 7:36 AM, Trond Myklebust
<[email protected]> wrote:
>
> We need to ensure that the entries in the nfs_cache_array get cleared
> when the page is removed from the page cache. To do so, we use the
> releasepage address_space operation (which also requires us to set
> the Pg_private flag).
So I really think that the whole "releasepage" use in NFS is simply
overly complicated and was obviously too subtle.
The whole need for odd return values, for the page lock, and for the
addition of clearing the up-to-date bit comes from the fact that this
wasn't really what releasepage was designed for.
'releasepage' was really designed for the filesystem having its own
version of 'try_to_free_buffers()', which is just an optimistic "ok,
we may be releasing this page, so try to get rid of any IO structures
you have cached". It wasn't really a memory management thing.
And the thing is, it looks trivial to do the memory management
approach by adding a new callback that gets called after the page is
actually removed from the page cache. If we do that, then there are no
races with any other users, since we remove things from the page cache
atomically wrt page cache lookup. So the need for playing games with
page locking and 'uptodate' simply goes away. As does the PG_private
thing or the interaction with invalidatepage() etc.
So this is a TOTALLY UNTESTED trivial patch that just adds another
callback. Does this work? I dunno. But I get the feeling that instead
of having NFS work around the odd semantics that don't actually match
what NFS wants, introducing a new callback with much simpler semantics
would be simpler for everybody, and avoid the need for subtle code.
Hmm?
Linus
On Wed, Dec 1, 2010 at 2:13 PM, Andrew Morton <[email protected]> wrote:
> On Wed, 01 Dec 2010 16:51:12 -0500
> Trond Myklebust <[email protected]> wrote:
>
>> On Wed, 2010-12-01 at 13:38 -0800, Andrew Morton wrote:
>> > Probably on most call paths we'll be OK - if a process is in the middle
>> > of a file truncate, holdin a file* ref which holds an inode ref then
>> > nobody will be unmounting that fs and hence nobody will be unloading
>> > that module.
>> >
>> > However on the random_code->alloc_page->vmscan->releasepage path, none
>> > of that applies.
>>
>> Just out of interest, what ensures that the mapping is still around for
>> the 'spin_unlock_irq(&mapping->tree_lock);' in __remove_mapping()?
>
> Nothing, afacit.
No, we're good.
Module unload has to go through a "stop_machine()" cycle, and that in
turn requires an idle period for everything. And just a preemption
reschedule isn't enough for that.
So what is sufficient is that
- we had the page locked and on the mapping
This implies that we had an inode reference to the module, and the
page lock means that the inode reference cannot go away (because it
will involve invalidate-pages etc)
- we're not sleeping after __remove_mapping, so unload can't happen afterwards.
A _lot_ of the module races depend on that latter thing. We have
almost no cases that are strictly about actual reference counts etc.
Linus
We need to ensure that the entries in the nfs_cache_array get cleared
when the page is removed from the page cache. To do so, we use the
releasepage address_space operation (which also requires us to set
the Pg_private flag).
Change nfs_readdir_clear_array to use kmap_atomic(), so that the
function can be safely called from all direct reclaim contexts.
Finally, modify the cache_page_release helper to call
nfs_readdir_clear_array directly, when dealing with an anonymous
page from 'uncached_readdir'.
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/dir.c | 22 +++++++++++++++++-----
fs/nfs/inode.c | 1 +
include/linux/nfs_fs.h | 1 +
3 files changed, 19 insertions(+), 5 deletions(-)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 3ec3f1c..4c6319e 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -58,6 +58,7 @@ static int nfs_rename(struct inode *, struct dentry *,
static int nfs_fsync_dir(struct file *, int);
static loff_t nfs_llseek_dir(struct file *, loff_t, int);
static int nfs_readdir_clear_array(struct page*, gfp_t);
+static void nfs_readdir_invalidatepage(struct page*, unsigned long);
const struct file_operations nfs_dir_operations = {
.llseek = nfs_llseek_dir,
@@ -85,6 +86,7 @@ const struct inode_operations nfs_dir_inode_operations = {
const struct address_space_operations nfs_dir_addr_space_ops = {
.releasepage = nfs_readdir_clear_array,
+ .invalidatepage = nfs_readdir_invalidatepage,
};
#ifdef CONFIG_NFS_V3
@@ -216,15 +218,22 @@ void nfs_readdir_release_array(struct page *page)
static
int nfs_readdir_clear_array(struct page *page, gfp_t mask)
{
- struct nfs_cache_array *array = nfs_readdir_get_array(page);
+ struct nfs_cache_array *array;
int i;
- if (IS_ERR(array))
- return PTR_ERR(array);
+ array = kmap_atomic(page, KM_USER0);
for (i = 0; i < array->size; i++)
kfree(array->array[i].string.name);
- nfs_readdir_release_array(page);
- return 0;
+ kunmap_atomic(array, KM_USER0);
+ ClearPageUptodate(page);
+ ClearPagePrivate(page);
+ return 1;
+}
+
+static
+void nfs_readdir_invalidatepage(struct page *page, unsigned long offset)
+{
+ nfs_readdir_clear_array(page, 0);
}
/*
@@ -624,6 +633,7 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page)
if (ret < 0)
goto error;
SetPageUptodate(page);
+ SetPagePrivate(page);
if (invalidate_inode_pages2_range(inode->i_mapping, page->index + 1, -1) < 0) {
/* Should never happen */
@@ -639,6 +649,8 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page)
static
void cache_page_release(nfs_readdir_descriptor_t *desc)
{
+ if (!desc->page->mapping)
+ nfs_readdir_clear_array(desc->page, GFP_KERNEL);
unlock_page(desc->page);
page_cache_release(desc->page);
desc->page = NULL;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 314f571..0018e07 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -289,6 +289,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = NFS_SB(sb)->nfs_client->rpc_ops->dir_inode_ops;
inode->i_fop = &nfs_dir_operations;
+ inode->i_data.a_ops = &nfs_dir_addr_space_ops;
if (nfs_server_capable(inode, NFS_CAP_READDIRPLUS))
set_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inode)->flags);
/* Deal with crossing mountpoints */
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index c66fdb7..b5d3ab0 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -401,6 +401,7 @@ extern const struct inode_operations nfs3_file_inode_operations;
#endif /* CONFIG_NFS_V3 */
extern const struct file_operations nfs_file_operations;
extern const struct address_space_operations nfs_file_aops;
+extern const struct address_space_operations nfs_dir_addr_space_ops;
static inline struct nfs_open_context *nfs_file_open_context(struct file *filp)
{
--
1.7.3.2
On Wed, 2010-12-01 at 19:34 -0800, Hugh Dickins wrote:
> On Wed, 1 Dec 2010, Trond Myklebust wrote:
> >
> > --- a/mm/truncate.c
> > +++ b/mm/truncate.c
> > @@ -108,6 +108,10 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
> > clear_page_mlock(page);
> > remove_from_page_cache(page);
> > ClearPageMappedToDisk(page);
> > +
> > + if (mapping->a_ops->freepage)
> > + mapping->a_ops->freepage(page);
> > +
> > page_cache_release(page); /* pagecache ref */
> > return 0;
> > }
>
> I think Linus recommended that one be done in remove_from_page_cache()
> to catch all instances: did that get overruled later for some reason?
I'm fine with doing that as long as everyone is happy that there is no
chance of races. I was a bit nervous given the discussion that ensued
from the vmscan case.
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -454,6 +454,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
> > BUG_ON(!PageLocked(page));
> > BUG_ON(mapping != page_mapping(page));
> >
> > + preempt_disable();
> > spin_lock_irq(&mapping->tree_lock);
> > /*
> > * The non racy check for a busy page.
> > @@ -492,10 +493,19 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
> > swp_entry_t swap = { .val = page_private(page) };
> > __delete_from_swap_cache(page);
> > spin_unlock_irq(&mapping->tree_lock);
> > + preempt_enable();
> > swapcache_free(swap, page);
> > } else {
> > + void (*freepage)(struct page *);
> > +
> > + freepage = mapping->a_ops->freepage;
> > +
> > __remove_from_page_cache(page);
> > spin_unlock_irq(&mapping->tree_lock);
> > + if (freepage != NULL)
> > + freepage(page);
> > + preempt_enable();
> > +
> > mem_cgroup_uncharge_cache_page(page);
> > }
> >
> > @@ -503,6 +513,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
> >
> > cannot_free:
> > spin_unlock_irq(&mapping->tree_lock);
> > + preempt_enable();
> > return 0;
> > }
>
> I took his "stop_machine()" explanation ("an idle period for everything.
> And just a preemption reschedule isn't enough for that") to imply that
> there's no need for your preempt_disable/preempt_enable there: they
> don't add anything to the module unload case, and they don't help the
> spin_unlock_irq issue (and you're already being rightly careful to note
> freepage in advance).
>
> But maybe I misunderstood.
Again, I was being cautious (I freely admit to not having studied
stop_machine()). If nobody disagrees with your interpretation, then I'm
very happy to drop the preempt disable/enable crud.
Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
We need to use the cookie from the previous array entry, not the
actual cookie that we are searching for (except for the case of
uncached_readdir).
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/dir.c | 10 ++++++++--
1 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index f0a384e..e03537f 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -178,6 +178,7 @@ typedef struct {
struct page *page;
unsigned long page_index;
u64 *dir_cookie;
+ u64 last_cookie;
loff_t current_index;
decode_dirent_t decode;
@@ -344,6 +345,8 @@ int nfs_readdir_search_array(nfs_readdir_descriptor_t *desc)
else
status = nfs_readdir_search_for_cookie(array, desc);
+ if (status == -EAGAIN)
+ desc->last_cookie = array->last_cookie;
nfs_readdir_release_array(desc->page);
out:
return status;
@@ -563,7 +566,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page,
unsigned int array_size = ARRAY_SIZE(pages);
entry.prev_cookie = 0;
- entry.cookie = *desc->dir_cookie;
+ entry.cookie = desc->last_cookie;
entry.eof = 0;
entry.fh = nfs_alloc_fhandle();
entry.fattr = nfs_alloc_fattr();
@@ -672,8 +675,10 @@ int readdir_search_pagecache(nfs_readdir_descriptor_t *desc)
{
int res;
- if (desc->page_index == 0)
+ if (desc->page_index == 0) {
desc->current_index = 0;
+ desc->last_cookie = 0;
+ }
while (1) {
res = find_cache_page(desc);
if (res != -EAGAIN)
@@ -764,6 +769,7 @@ int uncached_readdir(nfs_readdir_descriptor_t *desc, void *dirent,
}
desc->page_index = 0;
+ desc->last_cookie = *desc->dir_cookie;
desc->page = page;
status = nfs_readdir_xdr_to_array(desc, page, inode);
--
1.7.3.2
> OK. The first patch in this series fixes the regression reported by
> Nick Bowler when I apply it to my setup.
> The 2 remaining patches are needed in order to ensure that the
> VM doesn't free the readdir page cache page while we're trying to
> read from it, and to ensure that we don't leak memory...
> Linus, please don't apply these patches quite yet. I'd like to continue
> tests for a couple more days before I send you the pull request.
v2 fixes the following issues:
- Fix up the cookie usage in uncached_readdir()
- Changelog fixups for the second patch
- .releasepage should use kmap_atomic() so that it can be called
from all direct reclaim contexts.
- Ensure that .releasepage also clears Pg_uptodate
- Set/clear the Pg_private flag to ensure .releasepage gets called when
appropriate.
- Add a .invalidatepage to ensure truncate_inode_pages() works correctly
- Ensure that the anonymous page that is generated by uncached_readdir()
doesn't leak memory.
v3:
- add the freepage() address space operation.
- Dump the page locking
- rewrite patch 'Fix a memory leak in nfs_readdir' to work with the new
->freepage() callback.
v4:
- Remove the preempt disable/enable protection in __remove_mapping
- Add a freepage() call to remove_from_page_cache() instead of
open-coding it in truncate_complete_page().
Cheers
Trond
Linus Torvalds (1):
Call the filesystem back whenever a page is removed from the page
cache
Trond Myklebust (2):
NFS: Ensure we use the correct cookie in nfs_readdir_xdr_filler
NFS: Fix a memory leak in nfs_readdir
Documentation/filesystems/Locking | 7 ++++++-
Documentation/filesystems/vfs.txt | 7 +++++++
fs/nfs/dir.c | 28 +++++++++++++++++-----------
fs/nfs/inode.c | 1 +
include/linux/fs.h | 1 +
include/linux/nfs_fs.h | 1 +
mm/filemap.c | 5 +++++
mm/truncate.c | 4 ++++
mm/vmscan.c | 7 +++++++
9 files changed, 49 insertions(+), 12 deletions(-)
--
1.7.3.2
We need to use the cookie from the previous array entry, not the
actual cookie that we are searching for (except for the case of
uncached_readdir).
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/dir.c | 10 ++++++++--
1 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index f0a384e..e03537f 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -178,6 +178,7 @@ typedef struct {
struct page *page;
unsigned long page_index;
u64 *dir_cookie;
+ u64 last_cookie;
loff_t current_index;
decode_dirent_t decode;
@@ -344,6 +345,8 @@ int nfs_readdir_search_array(nfs_readdir_descriptor_t *desc)
else
status = nfs_readdir_search_for_cookie(array, desc);
+ if (status == -EAGAIN)
+ desc->last_cookie = array->last_cookie;
nfs_readdir_release_array(desc->page);
out:
return status;
@@ -563,7 +566,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page,
unsigned int array_size = ARRAY_SIZE(pages);
entry.prev_cookie = 0;
- entry.cookie = *desc->dir_cookie;
+ entry.cookie = desc->last_cookie;
entry.eof = 0;
entry.fh = nfs_alloc_fhandle();
entry.fattr = nfs_alloc_fattr();
@@ -672,8 +675,10 @@ int readdir_search_pagecache(nfs_readdir_descriptor_t *desc)
{
int res;
- if (desc->page_index == 0)
+ if (desc->page_index == 0) {
desc->current_index = 0;
+ desc->last_cookie = 0;
+ }
while (1) {
res = find_cache_page(desc);
if (res != -EAGAIN)
@@ -764,6 +769,7 @@ int uncached_readdir(nfs_readdir_descriptor_t *desc, void *dirent,
}
desc->page_index = 0;
+ desc->last_cookie = *desc->dir_cookie;
desc->page = page;
status = nfs_readdir_xdr_to_array(desc, page, inode);
--
1.7.3.2
On Wed, 2010-12-01 at 08:17 -0800, Linus Torvalds wrote:
> On Wed, Dec 1, 2010 at 7:36 AM, Trond Myklebust
> <[email protected]> wrote:
> >
> > We need to ensure that the entries in the nfs_cache_array get cleared
> > when the page is removed from the page cache. To do so, we use the
> > releasepage address_space operation (which also requires us to set
> > the Pg_private flag).
>
> So I really think that the whole "releasepage" use in NFS is simply
> overly complicated and was obviously too subtle.
>
> The whole need for odd return values, for the page lock, and for the
> addition of clearing the up-to-date bit comes from the fact that this
> wasn't really what releasepage was designed for.
>
> 'releasepage' was really designed for the filesystem having its own
> version of 'try_to_free_buffers()', which is just an optimistic "ok,
> we may be releasing this page, so try to get rid of any IO structures
> you have cached". It wasn't really a memory management thing.
>
> And the thing is, it looks trivial to do the memory management
> approach by adding a new callback that gets called after the page is
> actually removed from the page cache. If we do that, then there are no
> races with any other users, since we remove things from the page cache
> atomically wrt page cache lookup. So the need for playing games with
> page locking and 'uptodate' simply goes away. As does the PG_private
> thing or the interaction with invalidatepage() etc.
>
> So this is a TOTALLY UNTESTED trivial patch that just adds another
> callback. Does this work? I dunno. But I get the feeling that instead
> of having NFS work around the odd semantics that don't actually match
> what NFS wants, introducing a new callback with much simpler semantics
> would be simpler for everybody, and avoid the need for subtle code.
>
> Hmm?
>
> Linus
This works for me, and I can code up a patch that uses it if you like...
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Wed, Dec 1, 2010 at 11:23 AM, Hugh Dickins <[email protected]> wrote:
>
> But I'd prefer us not to throw in another callback if it's well
> workable with the ->releasepage we already have.
The thing is, NFS already really uses releasepage for its "real"
designed usage, in the form of the fscache case (which the directory
inodes don't use).
I really hate mixing things up. The NFS directory case really hasn't
got anything to do with releasepage, and taking the page lock on the
read side is just really sad. As is marking the page not up-to-date,
just so that it will get filled again.
In fact, I don't even know if it's kosher by VFS standards to clear
the up-to-date bit in the first place after it has already gotten set.
It absolutely isn't for anything that can be mmap'ed. Obviously, mmap
doesn't work on the directory cache anyway, but the point is that the
work-arounds for the semantic gap are really quite ugly.
Certainly much uglier than just adding some much simpler semantics to
the "now I'm releasing the page" case in the VM.
Linus
Adding Nick Piggin back to the Cc.
I do think it needs to be freepage(struct address_space *, struct page *),
even if you happen not to need to know the mapping in the NFS case.
On Wed, 1 Dec 2010, Trond Myklebust wrote:
> On Wed, 2010-12-01 at 11:47 -0800, Linus Torvalds wrote:
> > On Wed, Dec 1, 2010 at 10:54 AM, Trond Myklebust
> > <[email protected]> wrote:
> > >
> > > Hmm... Looking again at the problem, it appears that the same callback
> > > needs to be added to truncate_complete_page() and
> > > invalidate_complete_page2(). Otherwise we end up in a situation where
> > > the page can sometimes be removed from the page cache without calling
> > > freepage().
> >
> > Yes, I think any caller of __remove_from_page_cache() should do it
> > once it has dropped all locks.
> >
> > And to be consistent with that rule, even in the __remove_mapping()
> > function I suspect the code to call ->freepage() might as well be done
> > only in the __remove_from_page_cache() case (ie not in the
> > PageSwapCache() case).
> >
> > Then, add the case to the end of "remove_page_cache()" itself, and now
> > it's really easy to just grep for __remove_from_page_cache() and make
> > sure they all do it.
> >
> > That sounds sane, no?
> >
> > Linus
>
> Something like the following then?
>
> -----------------------------------------------------------------------------------------
> From 3a46d5eab1ac6efe9dfaf873e23de7589e0acccc Mon Sep 17 00:00:00 2001
> From: Linus Torvalds <[email protected]>
> Date: Wed, 1 Dec 2010 13:35:19 -0500
> Subject: [PATCH] Call the filesystem back whenever a page is removed from the page cache
>
> NFS needs to be able to release objects that are stored in the page
> cache once the page itself is no longer visible from the page cache.
>
> This patch adds a callback to the address space operations that allows
> filesystems to perform page cleanups once the page has been removed
> from the page cache.
>
> Original patch by: Linus Torvalds <[email protected]>
> [trondmy: cover the cases of invalidate_inode_pages2() and
> truncate_inode_pages()]
> Signed-off-by: Trond Myklebust <[email protected]>
> ---
> Documentation/filesystems/Locking | 5 +++++
> Documentation/filesystems/vfs.txt | 5 +++++
> include/linux/fs.h | 1 +
> mm/truncate.c | 8 ++++++++
> mm/vmscan.c | 3 +++
> 5 files changed, 22 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index a91f308..06d6b71 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -173,6 +173,7 @@ prototypes:
> sector_t (*bmap)(struct address_space *, sector_t);
> int (*invalidatepage) (struct page *, unsigned long);
> int (*releasepage) (struct page *, int);
> + void (*freepage)(struct page *);
> int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
> loff_t offset, unsigned long nr_segs);
> int (*launder_page) (struct page *);
> @@ -193,6 +194,7 @@ perform_write: no n/a yes
> bmap: no
> invalidatepage: no yes
> releasepage: no yes
> +freepage: no yes
> direct_IO: no
> launder_page: no yes
>
> @@ -288,6 +290,9 @@ buffers from the page in preparation for freeing it. It returns zero to
> indicate that the buffers are (or may be) freeable. If ->releasepage is zero,
> the kernel assumes that the fs has no private interest in the buffers.
>
> + ->freepage() is called when the kernel is done dropping the page
> +from the page cache.
> +
> ->launder_page() may be called prior to releasing a page if
> it is still found to be dirty. It returns zero if the page was successfully
> cleaned, or an error value if not. Note that in order to prevent the page
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index ed7e5ef..76de6fd 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -534,6 +534,7 @@ struct address_space_operations {
> sector_t (*bmap)(struct address_space *, sector_t);
> int (*invalidatepage) (struct page *, unsigned long);
> int (*releasepage) (struct page *, int);
> + void (*freepage)(struct page *);
> ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
> loff_t offset, unsigned long nr_segs);
> struct page* (*get_xip_page)(struct address_space *, sector_t,
> @@ -679,6 +680,10 @@ struct address_space_operations {
> need to ensure this. Possibly it can clear the PageUptodate
> bit if it cannot free private data yet.
>
> + freepage: freepage is called once the page is no longer visible in
> + the page cache in order to allow the cleanup of any private
> + data.
> +
> direct_IO: called by the generic read/write routines to perform
> direct_IO - that is IO requests which bypass the page cache
> and transfer data directly between the storage and the
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c9e06cc..090f0ea 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -602,6 +602,7 @@ struct address_space_operations {
> sector_t (*bmap)(struct address_space *, sector_t);
> void (*invalidatepage) (struct page *, unsigned long);
> int (*releasepage) (struct page *, gfp_t);
> + void (*freepage)(struct page *);
> ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
> loff_t offset, unsigned long nr_segs);
> int (*get_xip_mem)(struct address_space *, pgoff_t, int,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index ba887bf..76ab2a8 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -108,6 +108,10 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
> clear_page_mlock(page);
> remove_from_page_cache(page);
> ClearPageMappedToDisk(page);
> +
> + if (mapping->a_ops->freepage)
> + mapping->a_ops->freepage(page);
> +
> page_cache_release(page); /* pagecache ref */
> return 0;
> }
> @@ -390,6 +394,10 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
> __remove_from_page_cache(page);
> spin_unlock_irq(&mapping->tree_lock);
> mem_cgroup_uncharge_cache_page(page);
> +
> + if (mapping->a_ops->freepage)
> + mapping->a_ops->freepage(page);
> +
> page_cache_release(page); /* pagecache ref */
> return 1;
> failed:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d31d7ce..c6fc55d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -497,6 +497,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
> __remove_from_page_cache(page);
> spin_unlock_irq(&mapping->tree_lock);
> mem_cgroup_uncharge_cache_page(page);
> +
> + if (mapping->a_ops->freepage)
> + mapping->a_ops->freepage(page);
> }
>
> return 1;
> --
> 1.7.3.2
On 12/01/2010 11:47 AM, Linus Torvalds wrote:
> On Wed, Dec 1, 2010 at 8:35 AM, Rik van Riel<[email protected]> wrote:
>>
>> Surely somebody can have just looked up the page and
>> gotten a reference count, right before your ->freepage
>> call is invoked?
>
> No.
>
> The removal from the page cache is atomic, even in the presence of the
> lockless lookup.
>
> The page cache lookup does a "get_page_unless_zero()" on the count, so
> when __remove_mapping() has removed the page using
> "page_freeze_refs()", it's really gone, and cannot be looked up.
Doh, you're right. I forgot to look at all the stuff that
__remove_mapping does nowadays and remembered some very old
code from vmscan.c instead.
Acked-by: Rik van Riel <[email protected]>
--
All rights reversed
On Wed, Dec 1, 2010 at 7:53 PM, Trond Myklebust
<[email protected]> wrote:
>
> Again, I was being cautious (I freely admit to not having studied
> stop_machine()). If nobody disagrees with your interpretation, then I'm
> very happy to drop the preempt disable/enable crud.
Just remove it. It won't matter for the stop_machine case, and the
other case that Andrew pointed out (just unmap with memory pressure)
is an independent thing that doesn't do stop_machine anyway.
In the next merge window we'll see the whole RCU name lookup (let's
see if people can agree on it, or whether I'll just have to do an
executive decision and push it through even without consensus), which
then rcu-delays inode freeing, and that will fix it for real. When
that happens, doing a "rcu_read_lock()" in vmscan.c before the removal
of the page from the mapping, and then unlocking after the callback
will be a real fix, but it will require the rcu release to be that
real fix anyway, so doing something else now will just confuse things.
Linus
On Wed, Dec 1, 2010 at 5:22 PM, Andrew Morton <[email protected]> wrote:
>
> What we're talking about is races against memory reclaim, unmount, etc.
Ahh. Those I can believe in. Although I think they'd almost
incidentally be fixed by making inode freeing (which is where the
'struct address_space' is embedded) RCU-safe, which we're going to do
anyway in 38. Then we could make the vmscan code just be a rcu-read
section.
Of course, I do think the race is basically impossible to hit in
practice regardless.
Linus
On Wed, 2010-12-01 at 11:23 -0800, Hugh Dickins wrote:
> On Wed, 1 Dec 2010, Trond Myklebust wrote:
> > On Wed, 2010-12-01 at 08:17 -0800, Linus Torvalds wrote:
> > > include/linux/fs.h | 1 +
> > > mm/vmscan.c | 3 +++
> > > 2 files changed, 4 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index c9e06cc..090f0ea 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -602,6 +602,7 @@ struct address_space_operations {
> > > sector_t (*bmap)(struct address_space *, sector_t);
> > > void (*invalidatepage) (struct page *, unsigned long);
> > > int (*releasepage) (struct page *, gfp_t);
> > > + void (*freepage)(struct page *);
> > > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec
> > > *iov,
> > > loff_t offset, unsigned long nr_segs);
> > > int (*get_xip_mem)(struct address_space *, pgoff_t, int,
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index d31d7ce..1accb01 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -499,6 +499,9 @@ static int __remove_mapping(struct address_space
> > > *mapping, struct page *page)
> > > mem_cgroup_uncharge_cache_page(page);
> > > }
> > >
> > > + if (mapping->a_ops->freepage)
> > > + mapping->a_ops->freepage(page);
> >
> > Hmm... Looking again at the problem, it appears that the same callback
> > needs to be added to truncate_complete_page() and
> > invalidate_complete_page2(). Otherwise we end up in a situation where
> > the page can sometimes be removed from the page cache without calling
> > freepage().
> >
> > > +
> > > return 1;
> > >
> > > cannot_free:
>
> Yes, I was wondering quite how we would define this ->freepage thing,
> if it gets called from one place that removes from page cache and not
> from others.
>
> Another minor problem with it: it would probably need to take the
> struct address_space *mapping as arg as well as struct page *page:
> because by this time page->mapping has been reset to NULL.
>
> But I'm not at all keen on adding a calllback in this very special
> frozen-to-0-references place: please let's not do it without an okay
> from Nick Piggin (now Cc'ed).
>
> I agree completely with what Linus said originally about how the
> page cannot be freed while there's a reference to it, and it should
> be possible to work this without your additional page locks.
>
> Your ->releasepage should be able to judge whether the page is likely
> (not certain) to be freed - page_count 3? 1 for the page cache, 1 for
> the page_private reference, 1 for vmscan's reference, I think. Then
> it can mark !PageUptodate and proceed with freeing the stuff you had
> allocated, undo page_has_private and its reference, and return 1 (or
> return 0 if it decides to hold on to the page).
That is very brittle. I'd prefer not to have to scan linux-mm every week
in order to find out if someone changed the page_count.
However, while reading Documentation/filesystems/vfs.txt (in order to
add documentation for freepage) I was surprised to read that the
->releasepage() is itself supposed to be allowed to actually remove the
page from the address space if it so desires.
Looking at the actual code in shrink_page_list() and friends I can't see
how that can possibly fail to break things, but if it were true, then
that might enable us to call remove_mapping() in order to safely free
the page before it gets cleared.
Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
Before we can fix the memory leak due to not freeing up the contents
of the nfs_cache_array when clearing the page cache, we need to
ensure that shrink_page_list can't just lock the page and
call try_to_release_page() while we're accessing the array.
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/dir.c | 16 +++++++++++++++-
1 files changed, 15 insertions(+), 1 deletions(-)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index e03537f..3ec3f1c 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -639,6 +639,7 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page)
static
void cache_page_release(nfs_readdir_descriptor_t *desc)
{
+ unlock_page(desc->page);
page_cache_release(desc->page);
desc->page = NULL;
}
@@ -646,8 +647,20 @@ void cache_page_release(nfs_readdir_descriptor_t *desc)
static
struct page *get_cache_page(nfs_readdir_descriptor_t *desc)
{
- return read_cache_page(desc->file->f_path.dentry->d_inode->i_mapping,
+ struct page *page;
+
+ for (;;) {
+ page = read_cache_page(desc->file->f_path.dentry->d_inode->i_mapping,
desc->page_index, (filler_t *)nfs_readdir_filler, desc);
+ if (IS_ERR(page))
+ break;
+ lock_page(page);
+ if (page->mapping != NULL && PageUptodate(page))
+ break;
+ unlock_page(page);
+ page_cache_release(page);
+ }
+ return page;
}
/*
@@ -771,6 +784,7 @@ int uncached_readdir(nfs_readdir_descriptor_t *desc, void *dirent,
desc->page_index = 0;
desc->last_cookie = *desc->dir_cookie;
desc->page = page;
+ lock_page(page);
status = nfs_readdir_xdr_to_array(desc, page, inode);
if (status < 0)
--
1.7.3.2
On Wed, 2010-12-01 at 15:46 -0800, Andrew Morton wrote:
> On Wed, 01 Dec 2010 18:21:16 -0500
> Trond Myklebust <[email protected]> wrote:
>
> > --- a/mm/truncate.c
> > +++ b/mm/truncate.c
> > @@ -108,6 +108,10 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
> > clear_page_mlock(page);
> > remove_from_page_cache(page);
> > ClearPageMappedToDisk(page);
> > +
> > + if (mapping->a_ops->freepage)
> > + mapping->a_ops->freepage(page);
> > +
> > page_cache_release(page); /* pagecache ref */
> > return 0;
> > }
>
> So here we're assuming that `mapping' was pinned by other means.
>
> Fair enough, although subtle. Even drop_pagecache_sb() got it right ;)
>
> > @@ -390,6 +394,10 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
> > __remove_from_page_cache(page);
> > spin_unlock_irq(&mapping->tree_lock);
> > mem_cgroup_uncharge_cache_page(page);
> > +
> > + if (mapping->a_ops->freepage)
> > + mapping->a_ops->freepage(page);
> > +
> > page_cache_release(page); /* pagecache ref */
> > return 1;
> > failed:
>
> And here.
Yes. Both these functions are static, and their callers are assuming
that something is already pinning the underlying inode, so the above
should be quite safe.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com
On Wed, 01 Dec 2010 15:10:50 -0500
Trond Myklebust <[email protected]> wrote:
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -602,6 +602,7 @@ struct address_space_operations {
> sector_t (*bmap)(struct address_space *, sector_t);
> void (*invalidatepage) (struct page *, unsigned long);
> int (*releasepage) (struct page *, gfp_t);
> + void (*freepage)(struct page *);
> ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
> loff_t offset, unsigned long nr_segs);
> int (*get_xip_mem)(struct address_space *, pgoff_t, int,
It would be good to think about and then clearly spell out exactly what
state the page is in here. It is locked, and I assume clean and not
under writeback. What about its refcount, freezedness status and
eligibility for lookups?
And as Hugh pointed out, some callees might needs the address_space*
although we can perhaps defer that until such a callee turns up.
If/when that happens we might have a problem though: if this locked
page is no longer attached to the address_space then what now pins the
address_space, protecting it from inode reclaim?
We need to use the cookie from the previous array entry, not the
actual cookie that we are searching for (except for the case of
uncached_readdir).
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/dir.c | 10 ++++++++--
1 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index f0a384e..e03537f 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -178,6 +178,7 @@ typedef struct {
struct page *page;
unsigned long page_index;
u64 *dir_cookie;
+ u64 last_cookie;
loff_t current_index;
decode_dirent_t decode;
@@ -344,6 +345,8 @@ int nfs_readdir_search_array(nfs_readdir_descriptor_t *desc)
else
status = nfs_readdir_search_for_cookie(array, desc);
+ if (status == -EAGAIN)
+ desc->last_cookie = array->last_cookie;
nfs_readdir_release_array(desc->page);
out:
return status;
@@ -563,7 +566,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page,
unsigned int array_size = ARRAY_SIZE(pages);
entry.prev_cookie = 0;
- entry.cookie = *desc->dir_cookie;
+ entry.cookie = desc->last_cookie;
entry.eof = 0;
entry.fh = nfs_alloc_fhandle();
entry.fattr = nfs_alloc_fattr();
@@ -672,8 +675,10 @@ int readdir_search_pagecache(nfs_readdir_descriptor_t *desc)
{
int res;
- if (desc->page_index == 0)
+ if (desc->page_index == 0) {
desc->current_index = 0;
+ desc->last_cookie = 0;
+ }
while (1) {
res = find_cache_page(desc);
if (res != -EAGAIN)
@@ -764,6 +769,7 @@ int uncached_readdir(nfs_readdir_descriptor_t *desc, void *dirent,
}
desc->page_index = 0;
+ desc->last_cookie = *desc->dir_cookie;
desc->page = page;
status = nfs_readdir_xdr_to_array(desc, page, inode);
--
1.7.3.2
On Tue, Nov 30, 2010 at 8:29 PM, Trond Myklebust
<[email protected]> wrote:
>
> I'm not worried about other readdir calls invalidating the page. My
> concern is rather about the VM memory reclaimers ejecting the page from
> the page cache, and calling nfs_readdir_clear_array while we're
> referencing the page.
I think you're making a fundamental mistake here, and you're confused
by a much deeper problem.
The thing is, the ".releasepage" callback gets called _before_ the
page actually gets removed from the page cache, and there is no
guarantee that it will always be removed at all!
In fact, anybody holding a reference to it will result in
page_freeze_refs() not successfully clearing all the refs, and that in
turn will abort the actual freeing of the page. So while you hold the
page count, your page will NOT be freed. Guaranteed.
But it is true that the ".releasepage()" function may be called. So if
your NFS release callback ends up invalidating the data on that page,
that page lock thing will make a difference, yes.
But at the same time, are you sure that you are able to then handle
the case of that page still existing in the page cache and being used
afterwards? Looking at the code, it doesn't look that way to me.
So I think you're confused, and the NFS code totally incorrectly
thinks that ".releasepage" is something that happens at the last use
of the page. It simply is not so. In fact, you seem to return 0, which
I think means "failure to release", so the VM will just mark it busy
again afterwards.
Now, I think you do have a few options:
- keep the current model. BUT! In the page cache release function
(nfs_readdir_clear_array), make sure that you also clear the
up-to-date bit, so that the page gets read back in since it no longer
contains any valid information. And return success for the
"releasepage" operatioin.
Alternatively:
- introduce a callback for the case of the page actually being gone
from the page cache, which gets called _after_ the removal.
which seems to be what you really want, since for you the releasepage
thing is about releasing the data structures associated with that
cache. So you don't want to worry about the page lock, and you don't
want to worry about the case of "maybe it won't get released at all
after this because somebody still holds a ref-count".
> As far as I can see, the only way to protect against that is to lock the
> page, perform the usual tests and then release the page lock when we're
> done...
I think one of us is confused. And it's possible that it is me. It's
been a _long_ time since I looked at that code, and I may well be
missing something. But I get the strong feeling you're mis-using
'.releasepage' and confused about the semantics.
Linus "maybe confused myself" Torvalds
On Wed, 01 Dec 2010 16:51:12 -0500
Trond Myklebust <[email protected]> wrote:
> On Wed, 2010-12-01 at 13:38 -0800, Andrew Morton wrote:
> > On Wed, 1 Dec 2010 13:15:07 -0800 (PST)
> > Hugh Dickins <[email protected]> wrote:
> >
> > > On Wed, 1 Dec 2010, Hugh Dickins wrote:
> > > > On Wed, 1 Dec 2010, Andrew Morton wrote:
> > > > > On Wed, 01 Dec 2010 15:10:50 -0500
> > > > > Trond Myklebust <[email protected]> wrote:
> > > > >
> > > > > > --- a/include/linux/fs.h
> > > > > > +++ b/include/linux/fs.h
> > > > > > @@ -602,6 +602,7 @@ struct address_space_operations {
> > > > > > sector_t (*bmap)(struct address_space *, sector_t);
> > > > > > void (*invalidatepage) (struct page *, unsigned long);
> > > > > > int (*releasepage) (struct page *, gfp_t);
> > > > > > + void (*freepage)(struct page *);
> > > > > > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
> > > > > > loff_t offset, unsigned long nr_segs);
> > > > > > int (*get_xip_mem)(struct address_space *, pgoff_t, int,
> > > > >
> > > > > It would be good to think about and then clearly spell out exactly what
> > > > > state the page is in here. It is locked, and I assume clean and not
> > > > > under writeback. What about its refcount, freezedness status and
> > > > > eligibility for lookups?
> > > > >
> > > > > And as Hugh pointed out, some callees might needs the address_space*
> > > > > although we can perhaps defer that until such a callee turns up.
> > > > > If/when that happens we might have a problem though: if this locked
> > > > > page is no longer attached to the address_space then what now pins the
> > > > > address_space, protecting it from inode reclaim?
> > > >
> > > > That's an excellent point and trumps mine: it would be actively wrong
> > > > to provide the struct address_space *mapping arg I was asking for.
> > > > (Bet someone then tries stashing it away via page->private though.)
> > >
> > > Hmm, thinking further along the same lines: can we even guarantee that
> > > the filesystem module is still loaded at that point? i.e. might
> > > mapping->freepage now be pointing off into the garbage heap?
> >
> > I don't see anything on the VFS side which would prevent a module
> > unload. Or, more realistically, a concurrent unmount, freeing of the
> > superblock and everything associated with it. All we have here is a
> > page*.
> >
> > Probably on most call paths we'll be OK - if a process is in the middle
> > of a file truncate, holdin a file* ref which holds an inode ref then
> > nobody will be unmounting that fs and hence nobody will be unloading
> > that module.
> >
> > However on the random_code->alloc_page->vmscan->releasepage path, none
> > of that applies.
>
> Just out of interest, what ensures that the mapping is still around for
> the 'spin_unlock_irq(&mapping->tree_lock);' in __remove_mapping()?
Nothing, afacit.
I think this was the race which I taunted the mm developers about a
couple of months back (can't find the email) and nobody contradicted
me at that time.
> I'm clearly missing whatever mechanism prevents iput_final() from racing
> with vmscan if the latter clears out the last page from the mapping.
The mechanism is called "luck". Way back in the 2.5.late days there
was such a bug in the kernel (inode was reclaimed while vmscan was
playing with the address_space) and I was able to trigger oopses from
it. It required really massive, withering amounts of memory pressure.
Stupid amounts. I should dig out those tools and remember how to
operate them...
On Wed, Dec 1, 2010 at 3:36 PM, Andrew Morton <[email protected]> wrote:
> On Wed, 1 Dec 2010 15:31:11 -0800
> Linus Torvalds <[email protected]> wrote:
>>
>> Ergo: vmscan has a locked page leads to the filesystem being
>> guaranteed to not be unmounted. ?And that, in turn, guarantees that
>> the module won't be unloaded until the machine has gone through an
>> idle cycle.
>
> The page isn't attached to the address_space any more:
Did you even read the email?
Here, let me quote the important parts:
"module won't be unloaded until the machine has gone through an idle cycle"
"This is pretty much how all the module races are handled. Doing module
ref-counts per page (or per packet in flight for things like
networking) would be prohibitively expensive."
IOW, the whole "stop_machine()" part is fundamental. That whole
"module unload won't happen until we've gone through an idle cycle" is
EXACTLY why we don't need to have the page attached or ref-counted -
we're still safe.
Linus
From: Linus Torvalds <[email protected]>
NFS needs to be able to release objects that are stored in the page
cache once the page itself is no longer visible from the page cache.
This patch adds a callback to the address space operations that allows
filesystems to perform page cleanups once the page has been removed
from the page cache.
Original patch by: Linus Torvalds <[email protected]>
[trondmy: cover the cases of invalidate_inode_pages2() and
truncate_inode_pages()]
Signed-off-by: Trond Myklebust <[email protected]>
---
Documentation/filesystems/Locking | 7 ++++++-
Documentation/filesystems/vfs.txt | 7 +++++++
include/linux/fs.h | 1 +
mm/filemap.c | 5 +++++
mm/truncate.c | 4 ++++
mm/vmscan.c | 7 +++++++
6 files changed, 30 insertions(+), 1 deletions(-)
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index a91f308..b6426f1 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -173,12 +173,13 @@ prototypes:
sector_t (*bmap)(struct address_space *, sector_t);
int (*invalidatepage) (struct page *, unsigned long);
int (*releasepage) (struct page *, int);
+ void (*freepage)(struct page *);
int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
int (*launder_page) (struct page *);
locking rules:
- All except set_page_dirty may block
+ All except set_page_dirty and freepage may block
BKL PageLocked(page) i_mutex
writepage: no yes, unlocks (see below)
@@ -193,6 +194,7 @@ perform_write: no n/a yes
bmap: no
invalidatepage: no yes
releasepage: no yes
+freepage: no yes
direct_IO: no
launder_page: no yes
@@ -288,6 +290,9 @@ buffers from the page in preparation for freeing it. It returns zero to
indicate that the buffers are (or may be) freeable. If ->releasepage is zero,
the kernel assumes that the fs has no private interest in the buffers.
+ ->freepage() is called when the kernel is done dropping the page
+from the page cache.
+
->launder_page() may be called prior to releasing a page if
it is still found to be dirty. It returns zero if the page was successfully
cleaned, or an error value if not. Note that in order to prevent the page
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index ed7e5ef..3b14a55 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -534,6 +534,7 @@ struct address_space_operations {
sector_t (*bmap)(struct address_space *, sector_t);
int (*invalidatepage) (struct page *, unsigned long);
int (*releasepage) (struct page *, int);
+ void (*freepage)(struct page *);
ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
struct page* (*get_xip_page)(struct address_space *, sector_t,
@@ -679,6 +680,12 @@ struct address_space_operations {
need to ensure this. Possibly it can clear the PageUptodate
bit if it cannot free private data yet.
+ freepage: freepage is called once the page is no longer visible in
+ the page cache in order to allow the cleanup of any private
+ data. Since it may be called by the memory reclaimer, it
+ should not assume that the original address_space mapping still
+ exists, and it should not block.
+
direct_IO: called by the generic read/write routines to perform
direct_IO - that is IO requests which bypass the page cache
and transfer data directly between the storage and the
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c9e06cc..090f0ea 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -602,6 +602,7 @@ struct address_space_operations {
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidatepage) (struct page *, unsigned long);
int (*releasepage) (struct page *, gfp_t);
+ void (*freepage)(struct page *);
ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
int (*get_xip_mem)(struct address_space *, pgoff_t, int,
diff --git a/mm/filemap.c b/mm/filemap.c
index ea89840..6b9aee2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -143,13 +143,18 @@ void __remove_from_page_cache(struct page *page)
void remove_from_page_cache(struct page *page)
{
struct address_space *mapping = page->mapping;
+ void (*freepage)(struct page *);
BUG_ON(!PageLocked(page));
+ freepage = mapping->a_ops->freepage;
spin_lock_irq(&mapping->tree_lock);
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
mem_cgroup_uncharge_cache_page(page);
+
+ if (freepage)
+ freepage(page);
}
EXPORT_SYMBOL(remove_from_page_cache);
diff --git a/mm/truncate.c b/mm/truncate.c
index ba887bf..3c2d5dd 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -390,6 +390,10 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
mem_cgroup_uncharge_cache_page(page);
+
+ if (mapping->a_ops->freepage)
+ mapping->a_ops->freepage(page);
+
page_cache_release(page); /* pagecache ref */
return 1;
failed:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d31d7ce..9ca587c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -494,9 +494,16 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
spin_unlock_irq(&mapping->tree_lock);
swapcache_free(swap, page);
} else {
+ void (*freepage)(struct page *);
+
+ freepage = mapping->a_ops->freepage;
+
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
mem_cgroup_uncharge_cache_page(page);
+
+ if (freepage != NULL)
+ freepage(page);
}
return 1;
--
1.7.3.2
On Wed, 1 Dec 2010 13:15:07 -0800 (PST)
Hugh Dickins <[email protected]> wrote:
> On Wed, 1 Dec 2010, Hugh Dickins wrote:
> > On Wed, 1 Dec 2010, Andrew Morton wrote:
> > > On Wed, 01 Dec 2010 15:10:50 -0500
> > > Trond Myklebust <[email protected]> wrote:
> > >
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -602,6 +602,7 @@ struct address_space_operations {
> > > > sector_t (*bmap)(struct address_space *, sector_t);
> > > > void (*invalidatepage) (struct page *, unsigned long);
> > > > int (*releasepage) (struct page *, gfp_t);
> > > > + void (*freepage)(struct page *);
> > > > ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
> > > > loff_t offset, unsigned long nr_segs);
> > > > int (*get_xip_mem)(struct address_space *, pgoff_t, int,
> > >
> > > It would be good to think about and then clearly spell out exactly what
> > > state the page is in here. It is locked, and I assume clean and not
> > > under writeback. What about its refcount, freezedness status and
> > > eligibility for lookups?
> > >
> > > And as Hugh pointed out, some callees might needs the address_space*
> > > although we can perhaps defer that until such a callee turns up.
> > > If/when that happens we might have a problem though: if this locked
> > > page is no longer attached to the address_space then what now pins the
> > > address_space, protecting it from inode reclaim?
> >
> > That's an excellent point and trumps mine: it would be actively wrong
> > to provide the struct address_space *mapping arg I was asking for.
> > (Bet someone then tries stashing it away via page->private though.)
>
> Hmm, thinking further along the same lines: can we even guarantee that
> the filesystem module is still loaded at that point? i.e. might
> mapping->freepage now be pointing off into the garbage heap?
I don't see anything on the VFS side which would prevent a module
unload. Or, more realistically, a concurrent unmount, freeing of the
superblock and everything associated with it. All we have here is a
page*.
Probably on most call paths we'll be OK - if a process is in the middle
of a file truncate, holdin a file* ref which holds an inode ref then
nobody will be unmounting that fs and hence nobody will be unloading
that module.
However on the random_code->alloc_page->vmscan->releasepage path, none
of that applies.
From: Linus Torvalds <[email protected]>
NFS needs to be able to release objects that are stored in the page
cache once the page itself is no longer visible from the page cache.
This patch adds a callback to the address space operations that allows
filesystems to perform page cleanups once the page has been removed
from the page cache.
Original patch by: Linus Torvalds <[email protected]>
[trondmy: cover the cases of invalidate_inode_pages2() and
truncate_inode_pages()]
Signed-off-by: Trond Myklebust <[email protected]>
---
Documentation/filesystems/Locking | 7 ++++++-
Documentation/filesystems/vfs.txt | 7 +++++++
include/linux/fs.h | 1 +
mm/truncate.c | 8 ++++++++
mm/vmscan.c | 11 +++++++++++
5 files changed, 33 insertions(+), 1 deletions(-)
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index a91f308..b6426f1 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -173,12 +173,13 @@ prototypes:
sector_t (*bmap)(struct address_space *, sector_t);
int (*invalidatepage) (struct page *, unsigned long);
int (*releasepage) (struct page *, int);
+ void (*freepage)(struct page *);
int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
int (*launder_page) (struct page *);
locking rules:
- All except set_page_dirty may block
+ All except set_page_dirty and freepage may block
BKL PageLocked(page) i_mutex
writepage: no yes, unlocks (see below)
@@ -193,6 +194,7 @@ perform_write: no n/a yes
bmap: no
invalidatepage: no yes
releasepage: no yes
+freepage: no yes
direct_IO: no
launder_page: no yes
@@ -288,6 +290,9 @@ buffers from the page in preparation for freeing it. It returns zero to
indicate that the buffers are (or may be) freeable. If ->releasepage is zero,
the kernel assumes that the fs has no private interest in the buffers.
+ ->freepage() is called when the kernel is done dropping the page
+from the page cache.
+
->launder_page() may be called prior to releasing a page if
it is still found to be dirty. It returns zero if the page was successfully
cleaned, or an error value if not. Note that in order to prevent the page
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index ed7e5ef..3b14a55 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -534,6 +534,7 @@ struct address_space_operations {
sector_t (*bmap)(struct address_space *, sector_t);
int (*invalidatepage) (struct page *, unsigned long);
int (*releasepage) (struct page *, int);
+ void (*freepage)(struct page *);
ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
struct page* (*get_xip_page)(struct address_space *, sector_t,
@@ -679,6 +680,12 @@ struct address_space_operations {
need to ensure this. Possibly it can clear the PageUptodate
bit if it cannot free private data yet.
+ freepage: freepage is called once the page is no longer visible in
+ the page cache in order to allow the cleanup of any private
+ data. Since it may be called by the memory reclaimer, it
+ should not assume that the original address_space mapping still
+ exists, and it should not block.
+
direct_IO: called by the generic read/write routines to perform
direct_IO - that is IO requests which bypass the page cache
and transfer data directly between the storage and the
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c9e06cc..090f0ea 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -602,6 +602,7 @@ struct address_space_operations {
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidatepage) (struct page *, unsigned long);
int (*releasepage) (struct page *, gfp_t);
+ void (*freepage)(struct page *);
ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
int (*get_xip_mem)(struct address_space *, pgoff_t, int,
diff --git a/mm/truncate.c b/mm/truncate.c
index ba887bf..76ab2a8 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -108,6 +108,10 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
clear_page_mlock(page);
remove_from_page_cache(page);
ClearPageMappedToDisk(page);
+
+ if (mapping->a_ops->freepage)
+ mapping->a_ops->freepage(page);
+
page_cache_release(page); /* pagecache ref */
return 0;
}
@@ -390,6 +394,10 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
mem_cgroup_uncharge_cache_page(page);
+
+ if (mapping->a_ops->freepage)
+ mapping->a_ops->freepage(page);
+
page_cache_release(page); /* pagecache ref */
return 1;
failed:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d31d7ce..9e218e7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -454,6 +454,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
+ preempt_disable();
spin_lock_irq(&mapping->tree_lock);
/*
* The non racy check for a busy page.
@@ -492,10 +493,19 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
swp_entry_t swap = { .val = page_private(page) };
__delete_from_swap_cache(page);
spin_unlock_irq(&mapping->tree_lock);
+ preempt_enable();
swapcache_free(swap, page);
} else {
+ void (*freepage)(struct page *);
+
+ freepage = mapping->a_ops->freepage;
+
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
+ if (freepage != NULL)
+ freepage(page);
+ preempt_enable();
+
mem_cgroup_uncharge_cache_page(page);
}
@@ -503,6 +513,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
cannot_free:
spin_unlock_irq(&mapping->tree_lock);
+ preempt_enable();
return 0;
}
--
1.7.3.2
On Tue, 2010-11-30 at 21:06 -0800, Linus Torvalds wrote:
> On Tue, Nov 30, 2010 at 8:29 PM, Trond Myklebust
> <[email protected]> wrote:
> >
> > I'm not worried about other readdir calls invalidating the page. My
> > concern is rather about the VM memory reclaimers ejecting the page from
> > the page cache, and calling nfs_readdir_clear_array while we're
> > referencing the page.
>
> I think you're making a fundamental mistake here, and you're confused
> by a much deeper problem.
>
> The thing is, the ".releasepage" callback gets called _before_ the
> page actually gets removed from the page cache, and there is no
> guarantee that it will always be removed at all!
>
> In fact, anybody holding a reference to it will result in
> page_freeze_refs() not successfully clearing all the refs, and that in
> turn will abort the actual freeing of the page. So while you hold the
> page count, your page will NOT be freed. Guaranteed.
>
> But it is true that the ".releasepage()" function may be called. So if
> your NFS release callback ends up invalidating the data on that page,
> that page lock thing will make a difference, yes.
>
> But at the same time, are you sure that you are able to then handle
> the case of that page still existing in the page cache and being used
> afterwards? Looking at the code, it doesn't look that way to me.
>
> So I think you're confused, and the NFS code totally incorrectly
> thinks that ".releasepage" is something that happens at the last use
> of the page. It simply is not so. In fact, you seem to return 0, which
> I think means "failure to release", so the VM will just mark it busy
> again afterwards.
>
> Now, I think you do have a few options:
>
> - keep the current model. BUT! In the page cache release function
> (nfs_readdir_clear_array), make sure that you also clear the
> up-to-date bit, so that the page gets read back in since it no longer
> contains any valid information. And return success for the
> "releasepage" operatioin.
This is the option I'm attempting for now. I'm quite fine with the page
only being readable while under the page lock since this is readdir, and
so we don't have to worry about mmap() and its ilk.
My latest incarnation of the patches has added in all the PagePrivate()
foo to make try_to_release_page() work, and also adds in an
invalidatepage() address space operation (I rediscovered that
truncate_inode_pages() will otherwise call block_invalidatepage, and
subsequently Oops).
I'm still in the process of testing, but the latest patchset appears to
be doing all the right things for now. I'll post an update later today
so that Nick and others can test too.
Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer
NetApp
[email protected]
http://www.netapp.com