2007-10-07 19:20:39

by Erez Zadok

[permalink] [raw]
Subject: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

According to vfs.txt, ->writepage() may return AOP_WRITEPAGE_ACTIVATE back
to the VFS/VM. Indeed some filesystems such as tmpfs can return
AOP_WRITEPAGE_ACTIVATE; and stackable file systems (e.g., Unionfs) also
return AOP_WRITEPAGE_ACTIVATE if the lower f/s returned it.

Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
Therefore, some user programs fail, esp. if they're written such as this:

err = msync(...);
if (err != 0)
// fail

They temporarily fixed the specific program in question (apt-get) to check

if (err < 0)
// fail

Is this a bug indeed, or are user programs supposed to handle
AOP_WRITEPAGE_ACTIVATE (I hope not the latter). If it's a kernel bug, what
should the kernel return: a zero, or an -errno (and which one)?

Thanks,
Erez.


2007-10-07 19:58:23

by Pekka Enberg

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

Hi Erez,

On 10/7/07, Erez Zadok <[email protected]> wrote:
> Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
> returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
> Therefore, some user programs fail, esp. if they're written such as
> this:

[snip]

On 10/7/07, Erez Zadok <[email protected]> wrote:
> Is this a bug indeed, or are user programs supposed to handle?
> AOP_WRITEPAGE_ACTIVATE (I hope not the latter). If it's a kernel bug,
> what should the kernel return: a zero, or an -errno (and which one)?

It's a kernel bug. AOP_WRITEPAGE_ACTIVATE is a hint to the VM to avoid
writeback of the page in the near future. I wonder if it's enough that we
change the return value to zero from
mm/page-writeback.c:write_cache_pages() in case we hit AOP_WRITEPAGE_ACTIVE...

Pekka

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 63512a9..717f341 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -672,8 +672,10 @@ retry:

ret = (*writepage)(page, wbc, data);

- if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
+ if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
unlock_page(page);
+ ret = 0;
+ }
if (ret || (--(wbc->nr_to_write) <= 0))
done = 1;
if (wbc->nonblocking && bdi_write_congested(bdi)) {

2007-10-08 01:59:13

by Ryan Finnie

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On 10/7/07, Pekka J Enberg <[email protected]> wrote:
> On 10/7/07, Erez Zadok <[email protected]> wrote:
> > Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
> > returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
> > Therefore, some user programs fail, esp. if they're written such as
> > this:
>
...
> It's a kernel bug. AOP_WRITEPAGE_ACTIVATE is a hint to the VM to avoid
> writeback of the page in the near future. I wonder if it's enough that we
> change the return value to zero from
> mm/page-writeback.c:write_cache_pages() in case we hit AOP_WRITEPAGE_ACTIVE...

Doesn't appear to be enough. I can't figure out why (since it appears
write_cache_pages bubbles up directly to sys_msync), but with that
patch applied, in my test case[1], msync returns -1 EIO. However,
with the exact same kernel without that patch applied, msync returns
524288 (AOP_WRITEPAGE_ACTIVATE). But as your patch specifically flips
524288 to 0, I can't figure out how it eventually returns -1 EIO.

Ryan

[1] "apt-get check" on a unionfs2 mount backed by tmpfs over cdrom,
standard livecd setup

2007-10-08 11:18:52

by Pekka Enberg

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

Hi Ryan,

On 10/8/07, Ryan Finnie <[email protected]> wrote:
> Doesn't appear to be enough. I can't figure out why (since it appears
> write_cache_pages bubbles up directly to sys_msync), but with that
> patch applied, in my test case[1], msync returns -1 EIO. However,
> with the exact same kernel without that patch applied, msync returns
> 524288 (AOP_WRITEPAGE_ACTIVATE). But as your patch specifically flips
> 524288 to 0, I can't figure out how it eventually returns -1 EIO.
>
> [1] "apt-get check" on a unionfs2 mount backed by tmpfs over cdrom,
> standard livecd setup

You have swap device disabled, right? If so, I can't see any reason
why msync(2) on tmpfs would return -EIO. Can you please send a strace
log for your test case?

Pekka

2007-10-11 21:48:18

by Andrew Morton

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On Sun, 7 Oct 2007 15:20:19 -0400
Erez Zadok <[email protected]> wrote:

> According to vfs.txt, ->writepage() may return AOP_WRITEPAGE_ACTIVATE back
> to the VFS/VM. Indeed some filesystems such as tmpfs can return
> AOP_WRITEPAGE_ACTIVATE; and stackable file systems (e.g., Unionfs) also
> return AOP_WRITEPAGE_ACTIVATE if the lower f/s returned it.
>
> Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
> returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
> Therefore, some user programs fail, esp. if they're written such as this:
>
> err = msync(...);
> if (err != 0)
> // fail
>
> They temporarily fixed the specific program in question (apt-get) to check
>
> if (err < 0)
> // fail
>
> Is this a bug indeed, or are user programs supposed to handle
> AOP_WRITEPAGE_ACTIVATE (I hope not the latter). If it's a kernel bug, what
> should the kernel return: a zero, or an -errno (and which one)?
>

shit. That's a nasty bug. Really userspace should be testing for -1, but
the msync() library function should only ever return 0 or -1.

Does this fix it?

--- a/mm/page-writeback.c~a
+++ a/mm/page-writeback.c
@@ -850,8 +850,10 @@ retry:

ret = (*writepage)(page, wbc, data);

- if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
+ if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
unlock_page(page);
+ ret = 0;
+ }
if (ret || (--(wbc->nr_to_write) <= 0))
done = 1;
if (wbc->nonblocking && bdi_write_congested(bdi)) {
_

2007-10-11 22:12:20

by Ryan Finnie

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On 10/11/07, Andrew Morton <[email protected]> wrote:
> shit. That's a nasty bug. Really userspace should be testing for -1, but
> the msync() library function should only ever return 0 or -1.
>
> Does this fix it?
>
> --- a/mm/page-writeback.c~a
> +++ a/mm/page-writeback.c
> @@ -850,8 +850,10 @@ retry:
>
> ret = (*writepage)(page, wbc, data);
>
> - if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
> + if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
> unlock_page(page);
> + ret = 0;
> + }
> if (ret || (--(wbc->nr_to_write) <= 0))
> done = 1;
> if (wbc->nonblocking && bdi_write_congested(bdi)) {
> _
>

Pekka Enberg replied with an identical patch a few days ago, but for
some reason the same condition flows up to msync as -1 EIO instead of
AOP_WRITEPAGE_ACTIVATE with that patch applied. The last part of the
thread is below. Thanks.

Ryan

On 10/7/07, Ryan Finnie <[email protected]> wrote:
> On 10/7/07, Pekka J Enberg <[email protected]> wrote:
> > On 10/7/07, Erez Zadok <[email protected]> wrote:
> > > Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
> > > returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
> > > Therefore, some user programs fail, esp. if they're written such as
> > > this:
> >
> ...
> > It's a kernel bug. AOP_WRITEPAGE_ACTIVATE is a hint to the VM to avoid
> > writeback of the page in the near future. I wonder if it's enough that we
> > change the return value to zero from
> > mm/page-writeback.c:write_cache_pages() in case we hit AOP_WRITEPAGE_ACTIVE...
>
> Doesn't appear to be enough. I can't figure out why (since it appears
> write_cache_pages bubbles up directly to sys_msync), but with that
> patch applied, in my test case[1], msync returns -1 EIO. However,
> with the exact same kernel without that patch applied, msync returns
> 524288 (AOP_WRITEPAGE_ACTIVATE). But as your patch specifically flips
> 524288 to 0, I can't figure out how it eventually returns -1 EIO.
>
> Ryan
>
> [1] "apt-get check" on a unionfs2 mount backed by tmpfs over cdrom,
> standard livecd setup
>

2007-10-12 00:39:15

by Hugh Dickins

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On Thu, 11 Oct 2007, Ryan Finnie wrote:
> On 10/11/07, Andrew Morton <[email protected]> wrote:
> > shit. That's a nasty bug. Really userspace should be testing for -1, but
> > the msync() library function should only ever return 0 or -1.
> >
> > Does this fix it?
> >
> > --- a/mm/page-writeback.c~a
> > +++ a/mm/page-writeback.c
> > @@ -850,8 +850,10 @@ retry:
> >
> > ret = (*writepage)(page, wbc, data);
> >
> > - if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
> > + if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
> > unlock_page(page);
> > + ret = 0;
> > + }
> > if (ret || (--(wbc->nr_to_write) <= 0))
> > done = 1;
> > if (wbc->nonblocking && bdi_write_congested(bdi)) {
> > _
> >
>
> Pekka Enberg replied with an identical patch a few days ago, but for
> some reason the same condition flows up to msync as -1 EIO instead of
> AOP_WRITEPAGE_ACTIVATE with that patch applied. The last part of the
> thread is below. Thanks.

Each time I sit down to follow what's going on with writepage and
unionfs and msync, I get distracted: I really haven't researched
this properly.

But I keep suspecting that the answer might be the patch below (which
rather follows what drivers/block/rd.c is doing). I'm especially
worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned
to userspace, bad enough in itself, you might be liable to hit that
BUG_ON(page_mapped(page)). shmem_writepage does not expect to be
called by anyone outside mm/vmscan.c, but unionfs can now get to it?

Please let us know if this patch does fix it:
then I'll try harder to work out what goes on.

Thanks,
Hugh

--- 2.6.23/mm/shmem.c 2007-10-09 21:31:38.000000000 +0100
+++ linux/mm/shmem.c 2007-10-12 01:25:46.000000000 +0100
@@ -916,6 +916,11 @@ static int shmem_writepage(struct page *
struct inode *inode;

BUG_ON(!PageLocked(page));
+ if (!wbc->for_reclaim) {
+ set_page_dirty(page);
+ unlock_page(page);
+ return 0;
+ }
BUG_ON(page_mapped(page));

mapping = page->mapping;

2007-10-12 21:49:47

by Pekka Enberg

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

Hi Hugh,

On 10/12/07, Hugh Dickins <[email protected]> wrote:
> But I keep suspecting that the answer might be the patch below (which
> rather follows what drivers/block/rd.c is doing). I'm especially
> worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned
> to userspace, bad enough in itself, you might be liable to hit that
> BUG_ON(page_mapped(page)). shmem_writepage does not expect to be
> called by anyone outside mm/vmscan.c, but unionfs can now get to it?

Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages()
without unionfs even?

Pekka

2007-10-14 08:45:09

by Hugh Dickins

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On Sat, 13 Oct 2007, Pekka Enberg wrote:
> On 10/12/07, Hugh Dickins <[email protected]> wrote:
> > But I keep suspecting that the answer might be the patch below (which
> > rather follows what drivers/block/rd.c is doing). I'm especially
> > worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned
> > to userspace, bad enough in itself, you might be liable to hit that
> > BUG_ON(page_mapped(page)). shmem_writepage does not expect to be
> > called by anyone outside mm/vmscan.c, but unionfs can now get to it?
>
> Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages()
> without unionfs even?

I believe not. Please do double-check my assertions, I've always found
the _writepages paths rather twisty; but my belief (supported by the
fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page))
in five years) is that tmpfs/shmem opts out of all of that with its
.capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
in shmem_backing_dev_info, which avoids all those _writepages avenues
(e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is
just a subfunction of the _writepages.

So, while I don't disagree with your patch to write_cache_pages (though
it wasn't clear to me whether it should break from or continue the loop
if it ever does meet an AOP_WRITEPAGE_ACTIVATE), I don't think that's
really the root of the problem.

Hugh

2007-10-14 17:10:27

by Pekka Enberg

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

Hi Hugh,

On Sat, 13 Oct 2007, Pekka Enberg wrote:
> Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages()
> without unionfs even?

On 10/14/07, Hugh Dickins <[email protected]> wrote:
> I believe not. Please do double-check my assertions, I've always found
> the _writepages paths rather twisty; but my belief (supported by the
> fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page))
> in five years) is that tmpfs/shmem opts out of all of that with its
> .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
> in shmem_backing_dev_info, which avoids all those _writepages avenues
> (e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is
> just a subfunction of the _writepages.

Thanks for the explanation, you're obviously correct.

However, I don't think the mapping_cap_writeback_dirty() check in
__filemap_fdatawrite_range() works as expected when tmpfs is a lower
mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability
for unionfs mappings so do_fsync() will call write_cache_pages() that
unconditionally invokes shmem_writepage() via unionfs_writepage().
Unless, of course, there's some other unionfs magic I am missing.

Pekka

2007-10-14 17:24:29

by Erez Zadok

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

In message <[email protected]>, "Pekka Enberg" writes:
> Hi Hugh,
>
> On Sat, 13 Oct 2007, Pekka Enberg wrote:
> > Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages()
> > without unionfs even?
>
> On 10/14/07, Hugh Dickins <[email protected]> wrote:
> > I believe not. Please do double-check my assertions, I've always found
> > the _writepages paths rather twisty; but my belief (supported by the
> > fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page))
> > in five years) is that tmpfs/shmem opts out of all of that with its
> > .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
> > in shmem_backing_dev_info, which avoids all those _writepages avenues
> > (e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is
> > just a subfunction of the _writepages.
>
> Thanks for the explanation, you're obviously correct.
>
> However, I don't think the mapping_cap_writeback_dirty() check in
> __filemap_fdatawrite_range() works as expected when tmpfs is a lower
> mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability
> for unionfs mappings so do_fsync() will call write_cache_pages() that
> unconditionally invokes shmem_writepage() via unionfs_writepage().
> Unless, of course, there's some other unionfs magic I am missing.
>
> Pekka

In unionfs_writepage() I tried to emulate as best possible what the lower
f/s will have returned to the VFS. Since tmpfs's ->writepage can return
AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.

Should I be doing something different when unionfs stacks on top of tmpfs?
(BTW, this is probably also relevant to ecryptfs.)

Thanks,
Erez.

2007-10-14 17:52:37

by Pekka Enberg

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

Hi Erez,

On Sun, 14 Oct 2007, Erez Zadok wrote:
> In unionfs_writepage() I tried to emulate as best possible what the lower
> f/s will have returned to the VFS. Since tmpfs's ->writepage can return
> AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
> unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.
>
> Should I be doing something different when unionfs stacks on top of tmpfs?
> (BTW, this is probably also relevant to ecryptfs.)

Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be
calling unionfs_writepage() _at all_ if the lower mapping has
BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally
untested patch below?

Pekka

---
fs/unionfs/mmap.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

Index: linux-2.6.23-rc8/fs/unionfs/mmap.c
===================================================================
--- linux-2.6.23-rc8.orig/fs/unionfs/mmap.c
+++ linux-2.6.23-rc8/fs/unionfs/mmap.c
@@ -17,6 +17,7 @@
* published by the Free Software Foundation.
*/

+#include <linux/backing-dev.h>
#include "union.h"

/*
@@ -144,6 +145,21 @@ out:
return err;
}

+static int unionfs_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ struct inode *lower_inode;
+ struct inode *inode;
+
+ inode = mapping->host;
+ lower_inode = unionfs_lower_inode(inode);
+
+ if (!mapping_cap_writeback_dirty(lower_inode->i_mapping))
+ return 0;
+
+ return generic_writepages(mapping, wbc);
+}
+
/*
* readpage is called from generic_page_read and the fault handler.
* If your file system uses generic_page_read for the read op, it
@@ -371,6 +387,7 @@ out:

struct address_space_operations unionfs_aops = {
.writepage = unionfs_writepage,
+ .writepages = unionfs_writepages,
.readpage = unionfs_readpage,
.prepare_write = unionfs_prepare_write,
.commit_write = unionfs_commit_write,

2007-10-14 22:32:37

by Erez Zadok

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

In message <[email protected]>, Pekka J Enberg writes:
> Hi Erez,
>
> On Sun, 14 Oct 2007, Erez Zadok wrote:
> > In unionfs_writepage() I tried to emulate as best possible what the lower
> > f/s will have returned to the VFS. Since tmpfs's ->writepage can return
> > AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
> > unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.
> >
> > Should I be doing something different when unionfs stacks on top of tmpfs?
> > (BTW, this is probably also relevant to ecryptfs.)
>
> Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be
> calling unionfs_writepage() _at all_ if the lower mapping has
> BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally
> untested patch below?
>
> Pekka
[...]

Pekka, with a small change to your patch (to handle time-based cache
coherency), your patch worked well and passed all my tests. Thanks.

So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE
from being returned to userland. I guess we still need it, b/c even with
your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to
the VFS and we need to ensure that doesn't "leak" outside the kernel.

Erez.

2007-10-15 11:48:07

by Pekka Enberg

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

Hi,

On 10/15/07, Erez Zadok <[email protected]> wrote:
> Pekka, with a small change to your patch (to handle time-based cache
> coherency), your patch worked well and passed all my tests. Thanks.
>
> So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE
> from being returned to userland. I guess we still need it, b/c even with
> your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to
> the VFS and we need to ensure that doesn't "leak" outside the kernel.

I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that
->writepage() will never return AOP_WRITEPAGE_ACTIVATE for
!wbc->for_reclaim case which would explain why we haven't hit this bug
before. Hugh, Andrew?

And btw, I think we need to fix ecryptfs too.

Pekka

2007-10-16 18:04:41

by Erez Zadok

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

In message <[email protected]>, "Pekka Enberg" writes:
> Hi,
>
> On 10/15/07, Erez Zadok <[email protected]> wrote:
> > Pekka, with a small change to your patch (to handle time-based cache
> > coherency), your patch worked well and passed all my tests. Thanks.
> >
> > So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE
> > from being returned to userland. I guess we still need it, b/c even with
> > your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to
> > the VFS and we need to ensure that doesn't "leak" outside the kernel.
>
> I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that
> ->writepage() will never return AOP_WRITEPAGE_ACTIVATE for
> !wbc->for_reclaim case which would explain why we haven't hit this bug
> before. Hugh, Andrew?
>
> And btw, I think we need to fix ecryptfs too.

Yes, ecryptfs needs this fix too (and probably a couple of other mmap fixes
I've made to unionfs recently -- Mike Halcrow already knows :-)

Of course, running ecryptfs on top of tmpfs is somewhat odd and uncommon;
but with unionfs, users use tmpfs as the copyup branch very often.

> Pekka

Erez.

2007-10-22 19:43:36

by Hugh Dickins

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

Sorry for my delay, here are a few replies.

On Sun, 14 Oct 2007, Erez Zadok wrote:
> In message <[email protected]>, "Pekka Enberg" writes:
> >
> > However, I don't think the mapping_cap_writeback_dirty() check in
> > __filemap_fdatawrite_range() works as expected when tmpfs is a lower
> > mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability
> > for unionfs mappings so do_fsync() will call write_cache_pages() that
> > unconditionally invokes shmem_writepage() via unionfs_writepage().
> > Unless, of course, there's some other unionfs magic I am missing.

Thanks, Pekka, yes that made a lot of sense.

>
> In unionfs_writepage() I tried to emulate as best possible what the lower
> f/s will have returned to the VFS. Since tmpfs's ->writepage can return
> AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
> unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.

I think that's inappropriate. Why should unionfs_writepage re-mark its
page as dirty when the lower level does so? Unionfs has successfully
done its write to the lower level, what the lower level then gets up to
(writing then or not) is its own business: needn't be propagated upwards.

The fewer places that supply AOP_WRITEPAGE_ACTIVATE the better.
What I'd like most of all is to eliminate it, in favour of vmscan.c
working out the condition for itself: but I've given that no thought,
it may not be reasonable.

unionfs_writepage also sets AOP_WRITEPAGE_ACTIVATE when it cannot
find_lock_page: that case may be appropriate. Though I don't really
understand it: seems dangerous to be relying upon the lower level page
just happening to be there already. Isn't memory pressure then likely
to clog up with lots of upper level dirty pages which cannot get
written out to the lower level?

>
> Should I be doing something different when unionfs stacks on top of tmpfs?

I think not.

> (BTW, this is probably also relevant to ecryptfs.)

You're both agreed on that, but I don't see how: ecryptfs writes the
lower level via vfs_write, it's not using the lower level's writepage,
is it?

Hugh

2007-10-22 20:02:24

by Hugh Dickins

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On Sun, 14 Oct 2007, Erez Zadok wrote:
> In message <[email protected]>, Pekka J Enberg writes:
> >
> > Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be
> > calling unionfs_writepage() _at all_ if the lower mapping has
> > BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally
> > untested patch below?
...

I don't disagree with your unionfs_writepages patch, Pekka, but I think
it should be viewed as an optimization (don't waste time trying to write
a group of pages when we know that nothing will be done) rather than as
essential.

Prior to unionfs's own use of AOP_WRITEPAGE_ACTIVATE, there have only
been ramdisk and shmem generating it. ramdisk is careful only to
return it in the wbc->for_reclaim case: I think (as in the patch
I sent out before) shmem now ought to do so too for safety.

Back in 2.4 days it was reasonable to assume that ->writepage would
only get called from certain places, but things move faster nowadays,
and the unionfs example shows others are liable to start ab/using it.
I'll send Andrew that patch tomorrow (it's simple enough, but I'd
like at least to try to reproduce the page_mapped bug first).

>
> Pekka, with a small change to your patch (to handle time-based cache
> coherency), your patch worked well and passed all my tests. Thanks.
>
> So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE
> from being returned to userland. I guess we still need it, b/c even with
> your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to
> the VFS and we need to ensure that doesn't "leak" outside the kernel.

Can it now? Current git has a patch from Andrew which bears a striking
resemblance to that from Pekka, stopping the leak from write_cache_pages.

Hugh

2007-10-22 20:17:28

by Hugh Dickins

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On Mon, 15 Oct 2007, Pekka Enberg wrote:
>
> I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that
> ->writepage() will never return AOP_WRITEPAGE_ACTIVATE for
> !wbc->for_reclaim case which would explain why we haven't hit this bug
> before. Hugh, Andrew?

Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it
if !wbc->for_reclaim. I contend that shmem shouldn't either: it's
a special code to get the LRU rotation right, not useful elsewhere.
Though Documentation/filesystems/vfs.txt does imply wider use.

I think this is where people use the phrase "go figure" ;)

Hugh

2007-10-22 20:40:35

by Pekka Enberg

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

Hi Hugh,

On 10/22/07, Hugh Dickins <[email protected]> wrote:
> I don't disagree with your unionfs_writepages patch, Pekka, but I think
> it should be viewed as an optimization (don't waste time trying to write
> a group of pages when we know that nothing will be done) rather than as
> essential.

Ok, so tmpfs needs your fix still.

On 10/22/07, Hugh Dickins <[email protected]> wrote:
> > So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE
> > from being returned to userland. I guess we still need it, b/c even with
> > your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to
> > the VFS and we need to ensure that doesn't "leak" outside the kernel.
>
> Can it now? Current git has a patch from Andrew which bears a striking
> resemblance to that from Pekka, stopping the leak from write_cache_pages.

I don't think it can, it looks ok now.

Pekka

2007-10-22 20:48:48

by Pekka Enberg

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

Hi Hugh,

On Mon, 15 Oct 2007, Pekka Enberg wrote:
> > I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that
> > ->writepage() will never return AOP_WRITEPAGE_ACTIVATE for
> > !wbc->for_reclaim case which would explain why we haven't hit this bug
> > before. Hugh, Andrew?

On 10/22/07, Hugh Dickins <[email protected]> wrote:
> Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
> Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it
> if !wbc->for_reclaim. I contend that shmem shouldn't either: it's
> a special code to get the LRU rotation right, not useful elsewhere.
> Though Documentation/filesystems/vfs.txt does imply wider use.
>
> I think this is where people use the phrase "go figure" ;)

Heh. As far as I can tell, the implication of "wider use" was added by
Neil in commit "341546f5ad6fce584531f744853a5807a140f2a9 Update some
VFS documentation", so perhaps he might know? Neil?

Pekka

2007-10-22 21:05:51

by Erez Zadok

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

In message <[email protected]>, Hugh Dickins writes:
> On Mon, 15 Oct 2007, Pekka Enberg wrote:
> >
> > I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that
> > ->writepage() will never return AOP_WRITEPAGE_ACTIVATE for
> > !wbc->for_reclaim case which would explain why we haven't hit this bug
> > before. Hugh, Andrew?
>
> Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
> Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it
> if !wbc->for_reclaim. I contend that shmem shouldn't either: it's
> a special code to get the LRU rotation right, not useful elsewhere.
> Though Documentation/filesystems/vfs.txt does imply wider use.

Yes, based on vfs.txt I figured unionfs should return
AOP_WRITEPAGE_ACTIVATE. But, now that unionfs has ->writepages which won't
even call the lower writepage if BDI_CAP_NO_WRITEBACK is on, then perhaps I
no longer need unionfs_writepage to bother checking for
AOP_WRITEPAGE_ACTIVATE, or even return it up?

But, a future file system _could_ return AOP_WRITEPAGE_ACTIVATE w/o setting
BDI_CAP_NO_WRITEBACK, right? In that case, unionfs will still need to
handle AOP_WRITEPAGE_ACTIVATE in ->writepage, right?

> I think this is where people use the phrase "go figure" ;)
>
> Hugh

Erez.

2007-10-22 21:38:55

by Erez Zadok

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

In message <[email protected]>, Hugh Dickins writes:
> Sorry for my delay, here are a few replies.
>

> > In unionfs_writepage() I tried to emulate as best possible what the lower
> > f/s will have returned to the VFS. Since tmpfs's ->writepage can return
> > AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in
> > unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE.
>
> I think that's inappropriate. Why should unionfs_writepage re-mark its
> page as dirty when the lower level does so? Unionfs has successfully
> done its write to the lower level, what the lower level then gets up to
> (writing then or not) is its own business: needn't be propagated upwards.

What's the precise semantics of AOP_WRITEPAGE_ACTIVATE? Is it considered an
error or not? If it's an error, then I usually feel that it's important for
a stacked f/s to return that error indication upwards.

The unionfs page and the lower page are somewhat tied together, at least
logically. For unionfs's page to be considered to have been written
successfully, the lower page has to be written successfully. So again, if
the lower f/s returns AOP_WRITEPAGE_ACTIVATE, should I consider my unionfs
page to have been written successfully or not? If I don't return
AOP_WRITEPAGE_ACTIVATE up, can there be any chance that some vital data may
never get flushed out?

Anyway, now that unionfs has ->writepages that won't bother calling ->write
for file systems with BDI_CAP_NO_WRITEBACK, the issue of
AOP_WRITEPAGE_ACTIVATE in ->writepage may be less important.

> unionfs_writepage also sets AOP_WRITEPAGE_ACTIVATE when it cannot
> find_lock_page: that case may be appropriate. Though I don't really
> understand it: seems dangerous to be relying upon the lower level page
> just happening to be there already. Isn't memory pressure then likely
> to clog up with lots of upper level dirty pages which cannot get
> written out to the lower level?

Based on vfs.txt (which perhaps should be revised :-), I was trying to do
the best I can to ensure that no data is lost if the current page cannot be
written out to the lower f/s.

I used to do grab_cache_page() before, but that caused problems: writepage
is not the right place to _increase_ memory pressure by allocating a new
page...

One solution I thought of is do what ecryptfs does: keep an open struct file
in my inode and call vfs_write(), but I don't see that as a significantly
cleaner/better solution. (BTW, ecrypfts kinda had to go for vfs_write b/c
it changes the data size and content of what it writes below; unionfs is
simpler in that manner b/c it needs to write the same data to the lower file
at the same offset.)

Another idea we've experimented with before is "page pointer flipping." In
writepage, we temporarily set the page->mapping->host to the lower_inode;
then we call the lower writepage with OUR page; then fix back the
page->mapping->host to the upper inode. This had two benefits: first we can
guarantee that we always have a page to write below, and second we don't
need to keep both upper and lower pages (reduces memory pressure). Before
we did this page pointer flipping, we verified that the page is locked so no
other user could be written the page->mapping->host in this transient state,
and we ensured that no lower f/s was somehow caching the temporarily changed
value of page->mapping->host for later use. But, mucking with the pointers
in this manner is kinda ugly, to say the least. Still, I'd love to find a
clean and simple way that two layers can share the same struct page and
cleanly pass the upper page to a lower f/s.

If you've got suggestions how I can handle unionfs_write more cleanly, or
comments on the above possibilities, I'd love to hear them.

> > Should I be doing something different when unionfs stacks on top of tmpfs?
>
> I think not.
>
> > (BTW, this is probably also relevant to ecryptfs.)
>
> You're both agreed on that, but I don't see how: ecryptfs writes the
> lower level via vfs_write, it's not using the lower level's writepage,
> is it?

Yup. ecryptfs no longer does that: it recently changed things and now it
stores and open struct file in its inode, so it can always pass the file to
vfs_write. This nicely avoids calling the lower writepage, but one has to
keep an open file for every inode. Neither the solutions employed currently
by unionfs and ecryptfs seem really satisfactory (clean and efficient).

> Hugh

Thanks,
Erez.

2007-10-24 21:08:57

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE

It's possible to provoke unionfs (not yet in mainline, though in mm
and some distros) to hit shmem_writepage's BUG_ON(page_mapped(page)).
I expect it's possible to provoke the 2.6.23 ecryptfs in the same way
(but the 2.6.24 ecryptfs no longer calls lower level's ->writepage).

This came to light with the recent find that AOP_WRITEPAGE_ACTIVATE
could leak from tmpfs via write_cache_pages and unionfs to userspace.
There's already a fix (e423003028183df54f039dfda8b58c49e78c89d7 -
writeback: don't propagate AOP_WRITEPAGE_ACTIVATE) in the tree for
that, and it's okay so far as it goes; but insufficient because it
doesn't address the underlying issue, that shmem_writepage expects
to be called only by vmscan (relying on backing_dev_info capabilities
to prevent the normal writeback path from ever approaching it).

That's an increasingly fragile expectation, and ramdisk_writepage
(the other source of AOP_WRITEPAGE_ACTIVATEs) is already careful
to check wbc->for_reclaim before returning it. Make the same check
in shmem_writepage, thereby sidestepping the page_mapped BUG also.

Signed-off-by: Hugh Dickins <[email protected]>
---
Unionfs intends its own, third fix to these issues, checking
backing_dev_info capabilities as the normal writeback path does.
And I intend a fourth fix, getting rid of AOP_WRITEPAGE_ACTIVATE
entirely (mainly to put a stop to everybody asking what it means
and when it happens and how to handle it) - but that's a slightly
bigger patch, needing a little more testing, probably for 2.6.25.

I've CC'ed this to stable as you did for the write_cache_pages
fix: it's probably required for ecryptfs (but unionfs was much
easier to set up and test), and helpful to distros using unionfs
and checking stable for fixes. Does this make the write_cache_pages
fix redundant? Probably, but let's have both in for safety.

mm/shmem.c | 5 +++++
1 file changed, 5 insertions(+)

--- 2.6.24-rc1/mm/shmem.c 2007-10-24 07:16:04.000000000 +0100
+++ linux/mm/shmem.c 2007-10-24 20:24:31.000000000 +0100
@@ -915,6 +915,11 @@ static int shmem_writepage(struct page *
struct inode *inode;

BUG_ON(!PageLocked(page));
+ if (!wbc->for_reclaim) {
+ set_page_dirty(page);
+ unlock_page(page);
+ return 0;
+ }
BUG_ON(page_mapped(page));

mapping = page->mapping;

2007-10-24 21:09:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE

On Wed, 24 Oct 2007 22:02:15 +0100 (BST)
Hugh Dickins <[email protected]> wrote:

> --- 2.6.24-rc1/mm/shmem.c 2007-10-24 07:16:04.000000000 +0100
> +++ linux/mm/shmem.c 2007-10-24 20:24:31.000000000 +0100
> @@ -915,6 +915,11 @@ static int shmem_writepage(struct page *
> struct inode *inode;
>
> BUG_ON(!PageLocked(page));
> + if (!wbc->for_reclaim) {
> + set_page_dirty(page);
> + unlock_page(page);
> + return 0;
> + }
> BUG_ON(page_mapped(page));

Needs a comment, methinks.

2007-10-24 21:39:16

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH+comment] fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE

It's possible to provoke unionfs (not yet in mainline, though in mm
and some distros) to hit shmem_writepage's BUG_ON(page_mapped(page)).
I expect it's possible to provoke the 2.6.23 ecryptfs in the same way
(but the 2.6.24 ecryptfs no longer calls lower level's ->writepage).

This came to light with the recent find that AOP_WRITEPAGE_ACTIVATE
could leak from tmpfs via write_cache_pages and unionfs to userspace.
There's already a fix (e423003028183df54f039dfda8b58c49e78c89d7 -
writeback: don't propagate AOP_WRITEPAGE_ACTIVATE) in the tree for
that, and it's okay so far as it goes; but insufficient because it
doesn't address the underlying issue, that shmem_writepage expects
to be called only by vmscan (relying on backing_dev_info capabilities
to prevent the normal writeback path from ever approaching it).

That's an increasingly fragile assumption, and ramdisk_writepage
(the other source of AOP_WRITEPAGE_ACTIVATEs) is already careful
to check wbc->for_reclaim before returning it. Make the same check
in shmem_writepage, thereby sidestepping the page_mapped BUG also.

Signed-off-by: Hugh Dickins <[email protected]>
---
Unionfs intends its own, third fix to these issues, checking
backing_dev_info capabilities as the normal writeback path does.
And I intend a fourth fix, getting rid of AOP_WRITEPAGE_ACTIVATE
entirely (mainly to put a stop to everybody asking what it means
and when it happens and how to handle it) - but that's a slightly
bigger patch, needing a little more testing, probably for 2.6.25.

I've CC'ed this to stable as you did for the write_cache_pages
fix: it's probably required for ecryptfs (but unionfs was much
easier to set up and test), and helpful to distros using unionfs
and checking stable for fixes. Does this make the write_cache_pages
fix redundant? Probably, but let's have both in for safety.

mm/shmem.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

--- 2.6.24-rc1/mm/shmem.c 2007-10-24 07:16:04.000000000 +0100
+++ linux/mm/shmem.c 2007-10-24 22:31:09.000000000 +0100
@@ -915,6 +915,21 @@ static int shmem_writepage(struct page *
struct inode *inode;

BUG_ON(!PageLocked(page));
+ /*
+ * shmem_backing_dev_info's capabilities prevent regular writeback or
+ * sync from ever calling shmem_writepage; but a stacking filesystem
+ * may use the ->writepage of its underlying filesystem, in which case
+ * we want to do nothing when that underlying filesystem is tmpfs
+ * (writing out to swap is useful as a response to memory pressure, but
+ * of no use to stabilize the data) - just redirty the page, unlock it
+ * and claim success in this case. AOP_WRITEPAGE_ACTIVATE, and the
+ * page_mapped check below, must be avoided unless we're in reclaim.
+ */
+ if (!wbc->for_reclaim) {
+ set_page_dirty(page);
+ unlock_page(page);
+ return 0;
+ }
BUG_ON(page_mapped(page));

mapping = page->mapping;

2007-10-25 05:37:38

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH+comment] fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE

Hi Hugh,

On 10/25/07, Hugh Dickins <[email protected]> wrote:
> --- 2.6.24-rc1/mm/shmem.c 2007-10-24 07:16:04.000000000 +0100
> +++ linux/mm/shmem.c 2007-10-24 22:31:09.000000000 +0100
> @@ -915,6 +915,21 @@ static int shmem_writepage(struct page *
> struct inode *inode;
>
> BUG_ON(!PageLocked(page));
> + /*
> + * shmem_backing_dev_info's capabilities prevent regular writeback or
> + * sync from ever calling shmem_writepage; but a stacking filesystem
> + * may use the ->writepage of its underlying filesystem, in which case

I find the above bit somewhat misleading as it implies that the
!wbc->for_reclaim case can be removed after ecryptfs has similar fix
as unionfs. Can we just say that while BDI_CAP_NO_WRITEBACK does
prevent some callers from entering ->writepage(), it's just an
optimization and ->writepage() must deal with !wbc->for_reclaim case
properly?

Pekka

2007-10-25 06:31:35

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH+comment] fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE

On Thu, 25 Oct 2007, Pekka Enberg wrote:
> On 10/25/07, Hugh Dickins <[email protected]> wrote:
> > --- 2.6.24-rc1/mm/shmem.c 2007-10-24 07:16:04.000000000 +0100
> > +++ linux/mm/shmem.c 2007-10-24 22:31:09.000000000 +0100
> > @@ -915,6 +915,21 @@ static int shmem_writepage(struct page *
> > struct inode *inode;
> >
> > BUG_ON(!PageLocked(page));
> > + /*
> > + * shmem_backing_dev_info's capabilities prevent regular writeback or
> > + * sync from ever calling shmem_writepage; but a stacking filesystem
> > + * may use the ->writepage of its underlying filesystem, in which case
>
> I find the above bit somewhat misleading as it implies that the
> !wbc->for_reclaim case can be removed after ecryptfs has similar fix
> as unionfs. Can we just say that while BDI_CAP_NO_WRITEBACK does
> prevent some callers from entering ->writepage(), it's just an
> optimization and ->writepage() must deal with !wbc->for_reclaim case
> properly?

Sorry for being obtuse, but I don't see how that's misleading at all.

ecryptfs already has a (dissimilar) fix in 2.6.24-rc1, not using the
writepage route at all. But it remains the case that some stacking
filesystem may (would you prefer "might" to "may"? "may" has a nice
double meaning of "might" and "we'll allow it", but this patch does
indeed allow it) use the ->writepage of its underlying filesystem.

With unionfs also fixed, we don't know of an absolute need for this
patch (and so, on that basis, the !wbc->for_reclaim case could indeed
be removed very soon); but as I see it, the unionfs case has shown
that it's time to future-proof this code against whatever stacking
filesystems come along. Hence I didn't mention the names of such
filesystems in the source comment.

The !page_mapped assumption has been built in there since earliest
2.4, but it took a while for us to get a way to express it in a BUG.

Hugh

2007-10-25 07:24:33

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH+comment] fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE

Hi Hugh,

On 10/25/07, Hugh Dickins <[email protected]> wrote:
> With unionfs also fixed, we don't know of an absolute need for this
> patch (and so, on that basis, the !wbc->for_reclaim case could indeed
> be removed very soon); but as I see it, the unionfs case has shown
> that it's time to future-proof this code against whatever stacking
> filesystems come along.

Heh, what can I say, after several readings, I still find your above
explanation (which I totally agree with) more to the point than the
actual comment :-).

In any case, the patch looks good to me.

Reviewed-by: Pekka Enberg <[email protected]>

Pekka

2007-10-25 15:52:43

by Hugh Dickins

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On Mon, 22 Oct 2007, Pekka Enberg wrote:
> On 10/22/07, Hugh Dickins <[email protected]> wrote:
> > Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
> > Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it
> > if !wbc->for_reclaim. I contend that shmem shouldn't either: it's
> > a special code to get the LRU rotation right, not useful elsewhere.
> > Though Documentation/filesystems/vfs.txt does imply wider use.
> >
> > I think this is where people use the phrase "go figure" ;)
>
> Heh. As far as I can tell, the implication of "wider use" was added by
> Neil in commit "341546f5ad6fce584531f744853a5807a140f2a9 Update some
> VFS documentation", so perhaps he might know? Neil?

I take as gospel this extract from Andrew's original 2.5.52 comment:

So the locking rules for writepage() are unchanged. They are:

- Called with the page locked
- Returns with the page unlocked
- Must redirty the page itself if it wasn't all written.

But there is a new, special, hidden, undocumented, secret hack for
tmpfs: writepage may return WRITEPAGE_ACTIVATE to tell the VM to move
the page to the active list. The page must be kept locked in this one
case.

Special, hidden, undocumented, secret hack! Then in 2.6.7 Andrew
stole his own secret and used it when concocting ramdisk_writepage.
Oh, and NFS made some kind of use of it in 2.6.6 only. Then Neil
revealed the secret to the uninitiated in 2.6.17: now, what's the
appropriate punishment for that?

In the full 2.5.52 comment, Andrew explains how prior to this secret
code, we used fail_writepage, which in the memory pressure case did
an activate_page, with the intention of moving the page to the active
list - but that didn't actually work, because the page is off the
LRUs at this point, being passed around between pagevecs.

I've always preferred the way it was originally trying to do it, which
seems clearer and less error-prone than having a special code which
people then have to worry about. Here's the patch I'd like to present
in due course (against 2.6.24-rc1, so unionfs absent): tmpfs and ramdisk
simply SetPageActive for this case (and go back to obeying the usual
unlocking rule for writepage), vmscan.c observe and act accordingly.

But I've not tested it at all (well, I've run with it in, but not
actually going down the paths in question): it may suffer from
something silly like the original fail_writepage. Plus I might be
persuaded into making inc_zone_page_state(page, NR_VMSCAN_WRITE)
conditional on !PageActive(page), just to produce the same stats
as before (though they don't make a lot of sense, counting other
non-writes as writes). And would it need a deprecation phase?

Hugh

Documentation/filesystems/Locking | 6 +-----
Documentation/filesystems/vfs.txt | 4 +---
drivers/block/rd.c | 5 ++---
include/linux/fs.h | 10 ----------
mm/migrate.c | 5 ++---
mm/page-writeback.c | 4 ----
mm/shmem.c | 11 ++++++++---
mm/vmscan.c | 17 ++++++-----------
8 files changed, 20 insertions(+), 42 deletions(-)

--- 2.6.24-rc1/Documentation/filesystems/Locking 2007-10-24 07:15:11.000000000 +0100
+++ linux/Documentation/filesystems/Locking 2007-10-24 08:42:07.000000000 +0100
@@ -228,11 +228,7 @@ If the filesystem is called for sync the
in-progress I/O and then start new I/O.

The filesystem should unlock the page synchronously, before returning to the
-caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE
-value. WRITEPAGE_ACTIVATE means that page cannot really be written out
-currently, and VM should stop calling ->writepage() on this page for some
-time. VM does this by moving page to the head of the active list, hence the
-name.
+caller.

Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
and return zero, writepage *must* run set_page_writeback() against the page,
--- 2.6.24-rc1/Documentation/filesystems/vfs.txt 2007-10-24 07:15:11.000000000 +0100
+++ linux/Documentation/filesystems/vfs.txt 2007-10-24 08:42:07.000000000 +0100
@@ -567,9 +567,7 @@ struct address_space_operations {
If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
try too hard if there are problems, and may choose to write out
other pages from the mapping if that is easier (e.g. due to
- internal dependencies). If it chooses not to start writeout, it
- should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
- calling ->writepage on that page.
+ internal dependencies).

See the file "Locking" for more details.

--- 2.6.24-rc1/drivers/block/rd.c 2007-10-24 07:15:23.000000000 +0100
+++ linux/drivers/block/rd.c 2007-10-24 08:42:07.000000000 +0100
@@ -152,8 +152,7 @@ static int ramdisk_commit_write(struct f

/*
* ->writepage to the blockdev's mapping has to redirty the page so that the
- * VM doesn't go and steal it. We return AOP_WRITEPAGE_ACTIVATE so that the VM
- * won't try to (pointlessly) write the page again for a while.
+ * VM doesn't go and steal it.
*
* Really, these pages should not be on the LRU at all.
*/
@@ -163,7 +162,7 @@ static int ramdisk_writepage(struct page
make_page_uptodate(page);
SetPageDirty(page);
if (wbc->for_reclaim)
- return AOP_WRITEPAGE_ACTIVATE;
+ SetPageActive(page);
unlock_page(page);
return 0;
}
--- 2.6.24-rc1/include/linux/fs.h 2007-10-24 07:16:01.000000000 +0100
+++ linux/include/linux/fs.h 2007-10-24 08:42:07.000000000 +0100
@@ -368,15 +368,6 @@ struct iattr {
/**
* enum positive_aop_returns - aop return codes with specific semantics
*
- * @AOP_WRITEPAGE_ACTIVATE: Informs the caller that page writeback has
- * completed, that the page is still locked, and
- * should be considered active. The VM uses this hint
- * to return the page to the active list -- it won't
- * be a candidate for writeback again in the near
- * future. Other callers must be careful to unlock
- * the page if they get this return. Returned by
- * writepage();
- *
* @AOP_TRUNCATED_PAGE: The AOP method that was handed a locked page has
* unlocked it and the page might have been truncated.
* The caller should back up to acquiring a new page and
@@ -392,7 +383,6 @@ struct iattr {
*/

enum positive_aop_returns {
- AOP_WRITEPAGE_ACTIVATE = 0x80000,
AOP_TRUNCATED_PAGE = 0x80001,
};

--- 2.6.24-rc1/mm/migrate.c 2007-10-24 07:16:04.000000000 +0100
+++ linux/mm/migrate.c 2007-10-24 08:42:07.000000000 +0100
@@ -525,9 +525,8 @@ static int writeout(struct address_space
/* I/O Error writing */
return -EIO;

- if (rc != AOP_WRITEPAGE_ACTIVATE)
- /* unlocked. Relock */
- lock_page(page);
+ /* Unlocked: relock */
+ lock_page(page);

return -EAGAIN;
}
--- 2.6.24-rc1/mm/page-writeback.c 2007-10-24 07:16:04.000000000 +0100
+++ linux/mm/page-writeback.c 2007-10-24 08:42:07.000000000 +0100
@@ -850,10 +850,6 @@ retry:

ret = (*writepage)(page, wbc, data);

- if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
- unlock_page(page);
- ret = 0;
- }
if (ret || (--(wbc->nr_to_write) <= 0))
done = 1;
if (wbc->nonblocking && bdi_write_congested(bdi)) {
--- 2.6.24-rc1/mm/shmem.c 2007-10-24 07:16:04.000000000 +0100
+++ linux/mm/shmem.c 2007-10-24 08:42:07.000000000 +0100
@@ -915,6 +915,8 @@ static int shmem_writepage(struct page *
struct inode *inode;

BUG_ON(!PageLocked(page));
+ if (!wbc->for_reclaim)
+ goto redirty;
BUG_ON(page_mapped(page));

mapping = page->mapping;
@@ -922,10 +924,10 @@ static int shmem_writepage(struct page *
inode = mapping->host;
info = SHMEM_I(inode);
if (info->flags & VM_LOCKED)
- goto redirty;
+ goto reactivate;
swap = get_swap_page();
if (!swap.val)
- goto redirty;
+ goto reactivate;

spin_lock(&info->lock);
shmem_recalc_inode(inode);
@@ -955,9 +957,12 @@ static int shmem_writepage(struct page *
unlock:
spin_unlock(&info->lock);
swap_free(swap);
+reactivate:
+ SetPageActive(page);
redirty:
set_page_dirty(page);
- return AOP_WRITEPAGE_ACTIVATE; /* Return with the page locked */
+ unlock_page(page);
+ return 0;
}

#ifdef CONFIG_NUMA
--- 2.6.24-rc1/mm/vmscan.c 2007-10-24 07:16:04.000000000 +0100
+++ linux/mm/vmscan.c 2007-10-24 08:42:07.000000000 +0100
@@ -281,8 +281,6 @@ enum pageout_io {
typedef enum {
/* failed to write page out, page is locked */
PAGE_KEEP,
- /* move page to the active list, page is locked */
- PAGE_ACTIVATE,
/* page has been sent to the disk successfully, page is unlocked */
PAGE_SUCCESS,
/* page is clean and locked */
@@ -329,8 +327,10 @@ static pageout_t pageout(struct page *pa
}
return PAGE_KEEP;
}
- if (mapping->a_ops->writepage == NULL)
- return PAGE_ACTIVATE;
+ if (mapping->a_ops->writepage == NULL) {
+ SetPageActive(page);
+ return PAGE_KEEP;
+ }
if (!may_write_to_queue(mapping->backing_dev_info))
return PAGE_KEEP;

@@ -349,10 +349,6 @@ static pageout_t pageout(struct page *pa
res = mapping->a_ops->writepage(page, &wbc);
if (res < 0)
handle_write_error(mapping, page, res);
- if (res == AOP_WRITEPAGE_ACTIVATE) {
- ClearPageReclaim(page);
- return PAGE_ACTIVATE;
- }

/*
* Wait on writeback if requested to. This happens when
@@ -538,8 +534,6 @@ static unsigned long shrink_page_list(st
switch (pageout(page, mapping, sync_writeback)) {
case PAGE_KEEP:
goto keep_locked;
- case PAGE_ACTIVATE:
- goto activate_locked;
case PAGE_SUCCESS:
if (PageWriteback(page) || PageDirty(page))
goto keep;
@@ -597,10 +591,11 @@ free_it:

activate_locked:
SetPageActive(page);
- pgactivate++;
keep_locked:
unlock_page(page);
keep:
+ if (PageActive(page))
+ pgactivate++;
list_add(&page->lru, &ret_pages);
VM_BUG_ON(PageLRU(page));
}

2007-10-25 16:03:00

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH+comment] fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE

In message <[email protected]>, Hugh Dickins writes:
> On Thu, 25 Oct 2007, Pekka Enberg wrote:

> With unionfs also fixed, we don't know of an absolute need for this
> patch (and so, on that basis, the !wbc->for_reclaim case could indeed
> be removed very soon); but as I see it, the unionfs case has shown
> that it's time to future-proof this code against whatever stacking
> filesystems come along. Hence I didn't mention the names of such
> filesystems in the source comment.

I think "future proof" for other stackable f/s is a good idea, esp. since
many of the stackable f/s we've developed and distributed over the past 10
years are in some use in various places: gzipfs, avfs, tracefs, replayfs,
ncryptfs, versionfs, wrapfs, i3fs, and more (see http://www.filesystems.org).

Cheers,
Erez.

2007-10-25 16:48:24

by Hugh Dickins

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On Mon, 22 Oct 2007, Erez Zadok wrote:
> In message <[email protected]>, Hugh Dickins writes:
> >
> > Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
> > Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it
> > if !wbc->for_reclaim. I contend that shmem shouldn't either: it's
> > a special code to get the LRU rotation right, not useful elsewhere.
> > Though Documentation/filesystems/vfs.txt does imply wider use.
>
> Yes, based on vfs.txt I figured unionfs should return
> AOP_WRITEPAGE_ACTIVATE.

unionfs_writepage returns it in two different cases: when it can't
find the underlying page; and when the underlying writepage returns
it. I'd say it's wrong to return it in both cases.

In the first case, you don't really want your page put back to the head
of the active list, you want to come back to try it again quite soon
(I think): so you should just redirty and unlock and pretend success.

ramdisk uses A_W_A because none of its pages will ever become freeable
(and comment points out it'd be better if they weren't even on the
LRUs - I think several people have recently been putting forward
patches to keep such timewasters off the LRUs).

shmem uses A_W_A when there's no swap (left), or when the underlying
shm is marked as locked in memory: in each case, best to move on to
look for other pages to swap out. (But I'm not quite convincing myself
that the temporarily out-of-swap case is different from yours above.)
It's about fixing some horrid busy loops where vmscan kept going
over the same hopeless pages repeatedly, instead of moving on to
better candidates. Oh, there's a third case, when move_to_swap_cache
fails: that's rare, and I think I was just too lazy to separate them.

In your second case, I fail to see why the unionfs level should
mimic the lower level: you've successfully copied data and marked
the lower level pages as dirty, vmscan will come back to those in
due course, but it's just a waste of time for it to come back to
the unionfs pages again - isn't it?

> But, now that unionfs has ->writepages which won't
> even call the lower writepage if BDI_CAP_NO_WRITEBACK is on, then perhaps I
> no longer need unionfs_writepage to bother checking for
> AOP_WRITEPAGE_ACTIVATE, or even return it up?

unionfs_writepages handles the sync/msync/fsync leaking of A_W_A to
userspace issue, as does Pekka & Andrew's patch to write_cache_pages,
as does my patch to shmem_writepage. And I'm contending that
unionfs_writepage should in no case return A_W_A up.

But so long as A_W_A is still defined, unionfs_writepage does
still need to check for it after calling the lower level ->writepage
(because it needs to do the missing unlock_page): unionfs_writepages
prevents unionfs_writepage being called on the normal writeout path,
but it's still getting called by vmscan under memory pressure.

(I'm in the habit of saying "vmscan" rather than naming the functions
in question, because every few months someone restructures that file
and changes their names. I exaggerate, but it's happened often enough.)

> But, a future file system _could_ return AOP_WRITEPAGE_ACTIVATE w/o setting
> BDI_CAP_NO_WRITEBACK, right? In that case, unionfs will still need to
> handle AOP_WRITEPAGE_ACTIVATE in ->writepage, right?

For so long as AOP_WRITEPAGE_ACTIVATE exists, unionfs_writepage needs to
check for it coming from the lower level ->writepage, as I said above.

But your/Pekka's unionfs_writepages doesn't need to worry about it
at all, because Andrew/Pekka's write_cache_pages fix prevents it
leaking up in the !reclaim case (as does my shmem_writepage fix):
please remove that AOP_WRITEPAGE_ACTIVATE comment from unionfs_writepages.

Hugh

2007-10-25 16:49:01

by Erez Zadok

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

That's a nice historical review, Huge, of how got into these mess we're in
now -- it all starts with good intentions. :-)

On a related note, I would just love to get rid of calling the lower
->writepage in unionfs b/c I can't even tell if I have a lower page to use
all the time. I'd prefer to call vfs_write() if I can, but I'll need a
struct file, or at least a dentry.

What ecryptfs does is store a struct file inside it's inode, so it can use
it later in ->writepage to call vfs_write on the lower f/s. And Unionfs may
have to go in that direction too, but this trick is not terribly clean --
storing a file inside an inode.

I realize that the calling path to ->writepage doesn't have a file/dentry
any more, but if we're considering larger changes to the writepage related
code, can we perhaps consider passing a file or dentry to >writepage (same
as commit_write, perhaps).

Thanks,
Erez.

2007-10-25 18:04:17

by Hugh Dickins

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On Mon, 22 Oct 2007, Erez Zadok wrote:
>
> What's the precise semantics of AOP_WRITEPAGE_ACTIVATE?

Sigh - not at you, at it! It's a secret that couldn't be kept secret,
a hack for tmpfs reclaim, let's just look forward to it going away.

> Is it considered an error or not?

No, it's definitely not an error. It'a a private note from tmpfs
(or ramdisk) to vmscan, saying "don't waste your time coming back
to me with this page until you have to, please move on to another
more likely to be freeable".

> If it's an error, then I usually feel that it's important for
> a stacked f/s to return that error indication upwards.

Indeed, but this is not an error. Remember, neither ramdisk nor
tmpfs is stable storage: okay, tmpfs can go out to disk by using
swap, but that's not stable storage - it's not reconstituted after
reboot. (If there's an error in writing to swap, well, that's a
different issue; and there's few filesystems where such an I/O
error would be reported from ->writepage.)

>
> The unionfs page and the lower page are somewhat tied together, at least
> logically. For unionfs's page to be considered to have been written
> successfully, the lower page has to be written successfully. So again, if
> the lower f/s returns AOP_WRITEPAGE_ACTIVATE, should I consider my unionfs
> page to have been written successfully or not?

Consider it written successfully. (What does written mean with tmpfs?
it means a page can be freed, it doesn't mean the data is forever safe.)

> If I don't return
> AOP_WRITEPAGE_ACTIVATE up, can there be any chance that some vital data may
> never get flushed out?

Things should work better if you don't return AOP_WRITEPAGE_ACTIVATE.
If you mark your page as clean and successfully written, vmscan will
be able to free it. If needed, we can get the data back from the
lower page on demand, but meanwhile a page has been freed, which
is what vmscan reclaim is all about. (But of course, in the case
where you couldn't get hold of a page for the lower, you must redirty
yours before returning.)

> > unionfs_writepage also sets AOP_WRITEPAGE_ACTIVATE when it cannot
> > find_lock_page: that case may be appropriate. Though I don't really
> > understand it: seems dangerous to be relying upon the lower level page
> > just happening to be there already. Isn't memory pressure then likely
> > to clog up with lots of upper level dirty pages which cannot get
> > written out to the lower level?
>
> Based on vfs.txt (which perhaps should be revised :-), I was trying to do
> the best I can to ensure that no data is lost if the current page cannot be
> written out to the lower f/s.
>
> I used to do grab_cache_page() before, but that caused problems: writepage
> is not the right place to _increase_ memory pressure by allocating a new
> page...

Yes, but just hoping the lower page will be there, and doing nothing
to encourage it to become there, sounds an even poorer strategy to me.

It's not easy, I know. Your position reminds me of the loop driver
(drivers/block/loop.c), which has long handled this situation (with
great success, though I doubt an absolute guarantee) by taking
__GFP_IO|__GFP_FS off the mapping_gfp_mask of the underlying file:
look for gfp_mask in loop_set_fd() (and I think ignore do_loop_switch(),
that's new to me and seems to be for a very special case).

I grepped for gfp in unionfs, and there seems to be nothing: I doubt
you can be robust under memory pressure without doing something about
that. If you can take __GFP_IO|__GFP_FS off the lower mapping (just
while in unionfs_writepage, or longer term? what locking needed?),
then you should be able to go back to using grab_cache_page().

>
> One solution I thought of is do what ecryptfs does: keep an open struct file
> in my inode and call vfs_write(), but I don't see that as a significantly
> cleaner/better solution.

I agree with you.

> (BTW, ecrypfts kinda had to go for vfs_write b/c
> it changes the data size and content of what it writes below; unionfs is
> simpler in that manner b/c it needs to write the same data to the lower file
> at the same offset.)

Ah, yes.

>
> Another idea we've experimented with before is "page pointer flipping." In
> writepage, we temporarily set the page->mapping->host to the lower_inode;
> then we call the lower writepage with OUR page; then fix back the
> page->mapping->host to the upper inode. This had two benefits: first we can
> guarantee that we always have a page to write below, and second we don't
> need to keep both upper and lower pages (reduces memory pressure). Before
> we did this page pointer flipping, we verified that the page is locked so no
> other user could be written the page->mapping->host in this transient state,
> and we ensured that no lower f/s was somehow caching the temporarily changed
> value of page->mapping->host for later use. But, mucking with the pointers
> in this manner is kinda ugly, to say the least. Still, I'd love to find a
> clean and simple way that two layers can share the same struct page and
> cleanly pass the upper page to a lower f/s.

I wouldn't call it ugly, but it is exceptional and dangerous and cannot
be sanctioned without a great deal of thought; would very probably need
subtle or wide changes in core vfs/mm. shmem/tmpfs has given enough
trouble in the past with the way it switches page between filecache
and swapcache, and that imposes interesting limitations. We'd need
strong reasons (not for unionfs alone) to go down your page pointer
flipping route, but I wouldn't say it's forever out of the question.

My guess is it shouldn't flip, but page->mapping indicate a list of
of different struct address_spaces.

The coherency benefit seems very appealing.

But more thought might prove it a nonsense.

>
> If you've got suggestions how I can handle unionfs_write more cleanly, or
> comments on the above possibilities, I'd love to hear them.

For now I think you should pursue the ~(__GFP_FS|__GFP_IO) idea somehow.

Hugh

2007-10-25 18:32:37

by Hugh Dickins

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On Thu, 25 Oct 2007, Erez Zadok wrote:
>
> On a related note, I would just love to get rid of calling the lower
> ->writepage in unionfs b/c I can't even tell if I have a lower page to use
> all the time. I'd prefer to call vfs_write() if I can, but I'll need a
> struct file, or at least a dentry.

Why do you want to do that? You gave a good reason why it's easier
for ecryptfs, but I doubt it's robust. The higher the level you
choose to use, the harder to guarantee it won't deadlock.

Or that's my gut feeling anyway. It's several years since I've
thought about such issues: just because I came into this knowing
about shmem_writepage, is perhaps not a good reason to choose me
as advisor!

Hugh

2007-10-25 20:53:36

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH+comment] fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE

Erez Zadok wrote:
> In message <[email protected]>, Hugh Dickins writes:
>> On Thu, 25 Oct 2007, Pekka Enberg wrote:
>
>> With unionfs also fixed, we don't know of an absolute need for this
>> patch (and so, on that basis, the !wbc->for_reclaim case could indeed
>> be removed very soon); but as I see it, the unionfs case has shown
>> that it's time to future-proof this code against whatever stacking
>> filesystems come along. Hence I didn't mention the names of such
>> filesystems in the source comment.
>
> I think "future proof" for other stackable f/s is a good idea, esp. since
> many of the stackable f/s we've developed and distributed over the past 10
> years are in some use in various places: gzipfs, avfs, tracefs, replayfs,
> ncryptfs, versionfs, wrapfs, i3fs, and more (see http://www.filesystems.org).
>

A number of filesystems want partial or full stackability, so getting
rid of lack-of-stackability whereever it may be is highly valuable.

-hpa

2007-10-26 02:01:19

by NeilBrown

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On Thursday October 25, [email protected] wrote:
> On Mon, 22 Oct 2007, Pekka Enberg wrote:
> > On 10/22/07, Hugh Dickins <[email protected]> wrote:
> > > Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE.
> > > Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it
> > > if !wbc->for_reclaim. I contend that shmem shouldn't either: it's
> > > a special code to get the LRU rotation right, not useful elsewhere.
> > > Though Documentation/filesystems/vfs.txt does imply wider use.
> > >
> > > I think this is where people use the phrase "go figure" ;)
> >
> > Heh. As far as I can tell, the implication of "wider use" was added by
> > Neil in commit "341546f5ad6fce584531f744853a5807a140f2a9 Update some
> > VFS documentation", so perhaps he might know? Neil?

I just read the code, tried to understand it, translated that
understanding into English, and put that in vfs.txt.
It is very possible that what I wrote didn't match the intention of
the author, but it seemed to match the behaviour of the code.

The patch looks like it makes perfect sense to me.
Before the change, ->writepage could return AOP_WRITEPAGE_ACTIVATE
without unlocking the page, and this has precisely the effect of:
ClearPageReclaim(); (if the call path was through pageout)
SetPageActive(); (if the call was through shrink_page_list)
unlock_page();

With the patch, the ->writepage method does the SetPageActive and the
unlock_page, which on the whole seems cleaner.

We seem to have lost a call to ClearPageReclaim - I don't know if that
is significant.

>
> Special, hidden, undocumented, secret hack! Then in 2.6.7 Andrew
> stole his own secret and used it when concocting ramdisk_writepage.
> Oh, and NFS made some kind of use of it in 2.6.6 only. Then Neil
> revealed the secret to the uninitiated in 2.6.17: now, what's the
> appropriate punishment for that?

Surely the punishment should be for writing hidden undocumented hacks
in the first place! I vote we just make him maintainer for the whole
kernel - that will keep him so busy that he will never have a chance
to do it again :-)

> --- 2.6.24-rc1/Documentation/filesystems/vfs.txt 2007-10-24 07:15:11.000000000 +0100
> +++ linux/Documentation/filesystems/vfs.txt 2007-10-24 08:42:07.000000000 +0100
> @@ -567,9 +567,7 @@ struct address_space_operations {
> If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
> try too hard if there are problems, and may choose to write out
> other pages from the mapping if that is easier (e.g. due to
> - internal dependencies). If it chooses not to start writeout, it
> - should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
> - calling ->writepage on that page.
> + internal dependencies).
>

It seems that the new requirement is that if the address_space
chooses not to write out the page, it should now call SetPageActive().
If that is the case, I think it should be explicit in the
documentation - please?

NeilBrown

2007-10-26 08:05:44

by Pekka Enberg

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

Hi Hugh,

On 10/25/07, Hugh Dickins <[email protected]> wrote:
> @@ -349,10 +349,6 @@ static pageout_t pageout(struct page *pa
> res = mapping->a_ops->writepage(page, &wbc);
> if (res < 0)
> handle_write_error(mapping, page, res);
> - if (res == AOP_WRITEPAGE_ACTIVATE) {
> - ClearPageReclaim(page);
> - return PAGE_ACTIVATE;
> - }

I don't see ClearPageReclaim added anywhere. Is that done on purpose?
Other than that, the patch looks good to me and I think we should
stick it into -mm to punish Andrew for his secret hack ;-).

Pekka

2007-10-26 08:09:34

by Pekka Enberg

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

Hi,

On 10/26/07, Neil Brown <[email protected]> wrote:
> It seems that the new requirement is that if the address_space
> chooses not to write out the page, it should now call SetPageActive().
> If that is the case, I think it should be explicit in the
> documentation - please?

Agreed.

2007-10-26 11:27:20

by Hugh Dickins

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On Fri, 26 Oct 2007, Neil Brown wrote:
> On Thursday October 25, [email protected] wrote:
>
> The patch looks like it makes perfect sense to me.

Great, thanks a lot for looking at it, Neil and Pekka.

> Before the change, ->writepage could return AOP_WRITEPAGE_ACTIVATE
> without unlocking the page, and this has precisely the effect of:
> ClearPageReclaim(); (if the call path was through pageout)
> SetPageActive(); (if the call was through shrink_page_list)
> unlock_page();
>
> With the patch, the ->writepage method does the SetPageActive and the
> unlock_page, which on the whole seems cleaner.
>
> We seem to have lost a call to ClearPageReclaim - I don't know if that
> is significant.

It doesn't show up in the diff at all, but pageout() already has
if (!PageWriteback(page)) {
/* synchronous write or broken a_ops? */
ClearPageReclaim(page);
}
which will clear it since we've never set PageWriteback.

I think no harm would come from leaving it set there, since it only
takes effect in end_page_writeback (its effect being to let the just
written page be moved to the end of the LRU, so that it will then
be soon reclaimed), and clear_page_dirty_for_io clears it before
coming down this way. But I'd never argue for that: I hate having
leftover flags hanging around outside the scope of their relevance.

> > Special, hidden, undocumented, secret hack! Then in 2.6.7 Andrew
> > stole his own secret and used it when concocting ramdisk_writepage.
> > Oh, and NFS made some kind of use of it in 2.6.6 only. Then Neil
> > revealed the secret to the uninitiated in 2.6.17: now, what's the
> > appropriate punishment for that?
>
> Surely the punishment should be for writing hidden undocumented hacks
> in the first place! I vote we just make him maintainer for the whole
> kernel - that will keep him so busy that he will never have a chance
> to do it again :-)

That is a splendid retort, which has won you absolution.
But it makes me a little sad: that smiley should be a weepy.

> > --- 2.6.24-rc1/Documentation/filesystems/vfs.txt 2007-10-24 07:15:11.000000000 +0100
> > +++ linux/Documentation/filesystems/vfs.txt 2007-10-24 08:42:07.000000000 +0100
> > @@ -567,9 +567,7 @@ struct address_space_operations {
> > If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
> > try too hard if there are problems, and may choose to write out
> > other pages from the mapping if that is easier (e.g. due to
> > - internal dependencies). If it chooses not to start writeout, it
> > - should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
> > - calling ->writepage on that page.
> > + internal dependencies).
> >
>
> It seems that the new requirement is that if the address_space
> chooses not to write out the page, it should now call SetPageActive().
> If that is the case, I think it should be explicit in the
> documentation - please?

No, it's not the case; but you're right that I should add something
there, to put an end to the idea. It'll be something along the lines
of "You may notice shmem setting PageActive there, but please don't do
that; or if you insist, be sure never to do so in the !wbc->for_reclaim
case".

The PageActive thing is for when a filesystem regrets that it even
had a ->writepage (it replicates the behaviour of the writepage == NULL
case or the VM_LOCKED SWAP_FAIL case or the !add_to_swap case, delaying
the return of this page to writepage for as long as it can). It's done
in shmem_writepage because shm_lock (equivalent to VM_LOCKED) is only
discovered within that writepage, and no-swap is discovered there too.

ramdisk does it too: I've not tried to understand ramdisk as Nick and
Eric have, but it used to have no writepage, and would prefer to have
no writepage, but appears to need one for some PageUptodate reasons.

It's fairly normal for a filesystem to find that for some reason it
cannot carry out a writepage on this page right now (in the reclaim
case: the sync case demands action, IIRC); so it then simply does
set_page_dirty and unlock_page and returns "success".

I'll try to condense this down for the Doc when finalizing the patch;
which I've still not yet tested properly - thanks for the eyes, but
I can't submit it until I've checked in detail that it really gets
to do what we think it does.

Hugh

2007-10-27 20:48:25

by Erez Zadok

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

In message <[email protected]>, Hugh Dickins writes:
> On Mon, 22 Oct 2007, Erez Zadok wrote:
[...]
> > If you've got suggestions how I can handle unionfs_write more cleanly, or
> > comments on the above possibilities, I'd love to hear them.
>
> For now I think you should pursue the ~(__GFP_FS|__GFP_IO) idea somehow.
>
> Hugh

Hugh, thanks for the great explanations and suggestions (in multiple
emails). I'm going to test all of those soon.

Erez.

2007-10-28 20:23:51

by Erez Zadok

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

Huge,

I took your advise regarding ~(__GFP_FS|__GFP_IO), AOP_WRITEPAGE_ACTIVATE,
and such. I revised my unionfs_writepage and unionfs_sync_page, and tested
it under memory pressure: I have a couple of live CDs that use tmpfs and can
deterministically reproduce the conditions resulting in A_W_A. I also went
back to using grab_cache_page but with the gfp_mask suggestions you made.

I'm happy to report that it all works great now! Below is the entirety of
the new unionfs_mmap and unionfs_sync_page code. I'd appreciate if you and
others can look it over and see if you find any problems.

Thanks,
Erez.


static int unionfs_writepage(struct page *page, struct writeback_control *wbc)
{
int err = -EIO;
struct inode *inode;
struct inode *lower_inode;
struct page *lower_page;
char *kaddr, *lower_kaddr;
struct address_space *mapping; /* lower inode mapping */
gfp_t old_gfp_mask;

inode = page->mapping->host;
lower_inode = unionfs_lower_inode(inode);
mapping = lower_inode->i_mapping;

/*
* find lower page (returns a locked page)
*
* We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under
* memory pressure conditions. This is similar to how the loop
* driver behaves (see loop_set_fd in drivers/block/loop.c).
* If we can't find the lower page, we redirty our page and return
* "success" so that the VM will call us again in the (hopefully
* near) future.
*/
old_gfp_mask = mapping_gfp_mask(mapping);
mapping_set_gfp_mask(mapping, old_gfp_mask & ~(__GFP_IO|__GFP_FS));

lower_page = grab_cache_page(mapping, page->index);
mapping_set_gfp_mask(mapping, old_gfp_mask);

if (!lower_page) {
err = 0;
set_page_dirty(page);
goto out;
}

/* get page address, and encode it */
kaddr = kmap(page);
lower_kaddr = kmap(lower_page);

memcpy(lower_kaddr, kaddr, PAGE_CACHE_SIZE);

kunmap(page);
kunmap(lower_page);

BUG_ON(!mapping->a_ops->writepage);

/* call lower writepage (expects locked page) */
clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
err = mapping->a_ops->writepage(lower_page, wbc);

/* b/c grab_cache_page locked it and ->writepage unlocks on success */
if (err)
unlock_page(lower_page);
/* b/c grab_cache_page increased refcnt */
page_cache_release(lower_page);

if (err < 0) {
ClearPageUptodate(page);
goto out;
}
/*
* Lower file systems such as ramfs and tmpfs, may return
* AOP_WRITEPAGE_ACTIVATE so that the VM won't try to (pointlessly)
* write the page again for a while. But those lower file systems
* also set the page dirty bit back again. Since we successfully
* copied our page data to the lower page, then the VM will come
* back to the lower page (directly) and try to flush it. So we can
* save the VM the hassle of coming back to our page and trying to
* flush too. Therefore, we don't re-dirty our own page, and we
* don't return AOP_WRITEPAGE_ACTIVATE back to the VM (we consider
* this a success).
*/
if (err == AOP_WRITEPAGE_ACTIVATE)
err = 0;

/* all is well */
SetPageUptodate(page);
/* lower mtimes has changed: update ours */
unionfs_copy_attr_times(inode);

unlock_page(page);

out:
return err;
}


static void unionfs_sync_page(struct page *page)
{
struct inode *inode;
struct inode *lower_inode;
struct page *lower_page;
struct address_space *mapping; /* lower inode mapping */
gfp_t old_gfp_mask;

inode = page->mapping->host;
lower_inode = unionfs_lower_inode(inode);
mapping = lower_inode->i_mapping;

/*
* Find lower page (returns a locked page).
*
* We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under
* memory pressure conditions. This is similar to how the loop
* driver behaves (see loop_set_fd in drivers/block/loop.c).
* If we can't find the lower page, we redirty our page and return
* "success" so that the VM will call us again in the (hopefully
* near) future.
*/
old_gfp_mask = mapping_gfp_mask(mapping);
mapping_set_gfp_mask(mapping, old_gfp_mask & ~(__GFP_IO|__GFP_FS));

lower_page = grab_cache_page(mapping, page->index);
mapping_set_gfp_mask(mapping, old_gfp_mask);

if (!lower_page) {
printk(KERN_ERR "unionfs: grab_cache_page failed\n");
goto out;
}

/* do the actual sync */

/*
* XXX: can we optimize ala RAIF and set the lower page to be
* discarded after a successful sync_page?
*/
if (mapping && mapping->a_ops && mapping->a_ops->sync_page)
mapping->a_ops->sync_page(lower_page);

/* b/c grab_cache_page locked it */
unlock_page(lower_page);
/* b/c grab_cache_page increased refcnt */
page_cache_release(lower_page);

out:
return;
}

2007-10-29 20:34:59

by Hugh Dickins

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On Sun, 28 Oct 2007, Erez Zadok wrote:
>
> I took your advise regarding ~(__GFP_FS|__GFP_IO), AOP_WRITEPAGE_ACTIVATE,
> and such. I revised my unionfs_writepage and unionfs_sync_page, and tested
> it under memory pressure: I have a couple of live CDs that use tmpfs and can
> deterministically reproduce the conditions resulting in A_W_A. I also went
> back to using grab_cache_page but with the gfp_mask suggestions you made.
>
> I'm happy to report that it all works great now!

That's very encouraging...

> Below is the entirety of
> the new unionfs_mmap and unionfs_sync_page code. I'd appreciate if you and
> others can look it over and see if you find any problems.

... but still a few problems, I'm afraid.

The greatest problem is a tmpfs one, that would be for me to solve.
But first...

> static int unionfs_writepage(struct page *page, struct writeback_control *wbc)
> {
> int err = -EIO;
> struct inode *inode;
> struct inode *lower_inode;
> struct page *lower_page;
> char *kaddr, *lower_kaddr;
> struct address_space *mapping; /* lower inode mapping */
> gfp_t old_gfp_mask;
>
> inode = page->mapping->host;
> lower_inode = unionfs_lower_inode(inode);
> mapping = lower_inode->i_mapping;
>
> /*
> * find lower page (returns a locked page)
> *
> * We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under

On reflection, I think I went too far in asking you to mask off __GFP_IO.
Loop has to do so because it's a block device, down towards the IO layer;
but unionfs is a filesystem, so masking off __GFP_FS is enough to prevent
recursion into the FS layer with danger of deadlock, and leaving __GFP_IO
on gives a better chance of success.

> * memory pressure conditions. This is similar to how the loop
> * driver behaves (see loop_set_fd in drivers/block/loop.c).
> * If we can't find the lower page, we redirty our page and return
> * "success" so that the VM will call us again in the (hopefully
> * near) future.
> */
> old_gfp_mask = mapping_gfp_mask(mapping);
> mapping_set_gfp_mask(mapping, old_gfp_mask & ~(__GFP_IO|__GFP_FS));
>
> lower_page = grab_cache_page(mapping, page->index);
> mapping_set_gfp_mask(mapping, old_gfp_mask);

Hmm, several points on that.

When suggesting something like this, I did remark "what locking needed?".
You've got none: which is problematic if two stacked mounts are playing
with the same underlying file concurrently (yes, userspace would have
a data coherency problem in such a case, but the kernel still needs to
worry about its own internal integrity) - you'd be in danger of masking
(__GFP_IO|__GFP_FS) permanently off the underlying file; and furthermore,
losing error flags (AS_EIO, AS_ENOSPC) which share the same unsigned long.
Neither likely but both wrong.

See the comment on mapping_set_gfp_mask() in include/pagemap.h:
* This is non-atomic. Only to be used before the mapping is activated.
Strictly speaking, I guess loop was a little bit guilty even when just
loop_set_fd() did it: the underlying mapping might already be active.
It appears to be just as guilty as you in its do_loop_switch() case
(done at BIO completion time), but that's for a LOOP_CHANGE_FD ioctl
which would only be expected to be called once, during installation;
whereas you're using mapping_set_gfp_mask here with great frequency.

Another point on this is: loop masks __GFP_IO|__GFP_FS off the file
for the whole duration while it is looped, whereas you're flipping it
just in this preliminary section of unionfs_writepage. I think you're
probably okay to be doing it only here within ->writepage: I think
loop covered every operation because it's at the block device level,
perhaps both reads and writes needed to serve reclaim at the higher
FS level; and also easier to do it once for all.

Are you wrong to be doing it only around the grab_cache_page,
leaving the lower level ->writepage further down unprotected?
Certainly doing it around the grab_cache_page is likely to be way
more important than around the ->writepage (but rather depends on
filesystem). And on reflection, I think that the lower filesystem's
writepage should already be using GFP_NOFS to avoid deadlocks in
any of its allocations when wbc->for_reclaim, so you should be
okay just masking off around the grab_cache_page.

(Actually, in the wbc->for_reclaim case, I think you don't really
need to call the lower level writepage at all. Just set_page_dirty
on the lower page, unlock it and return. In due course that memory
pressure which has called unionfs_writepage, will come around to the
lower level page and do writepage upon it. Whether that's a better
strategy or not, I'm do not know.)

There's an attractively simple answer to the mapping_set_gfp_mask
locking problem, if we're confident that it's only needed around
the grab_cache_page. Look at the declaration of grab_cache_page
in linux/pagemap.h: it immediately extracts the gfp_mask from the
mapping and passes that down to find_or_create_page, which doesn't
use the mapping's gfp_mask at all.

So, stop flipping and use find_or_create_page directly yourself.

>
> if (!lower_page) {
> err = 0;
> set_page_dirty(page);

You need to unlock_page, don't you? Or move the "out" label up
before the unlock_page. There seems to have been confusion about
this even in the current 2.6.23-mm1 unionfs_writepage: the only
case in which a writepage returns with its page still locked is
that AOP_WRITEPAGE_ACTIVATE case we're going to get rid of.

> goto out;
> }
>
> /* get page address, and encode it */
> kaddr = kmap(page);
> lower_kaddr = kmap(lower_page);
>
> memcpy(lower_kaddr, kaddr, PAGE_CACHE_SIZE);
>
> kunmap(page);
> kunmap(lower_page);

Better to use kmap_atomic. unionfs_writepage cannot get called
at interrupt time, I see no reason to avoid KM_USER0 and KM_USER1:
therefore simply use copy_highpage(lower_page, page) and let it do
all the kmapping and copying.

If PAGE_CACHE_SIZE ever diverges from PAGE_SIZE (e.g. Christoph
Lameter's variable page_cache_size patches), then yes, this
would need updating to a loop over several pages (or better,
linux/highmem.h should then provide a function to do it).

>
> BUG_ON(!mapping->a_ops->writepage);
>
> /* call lower writepage (expects locked page) */
> clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
> err = mapping->a_ops->writepage(lower_page, wbc);
>
> /* b/c grab_cache_page locked it and ->writepage unlocks on success */
> if (err)
> unlock_page(lower_page);

Another instance of that confusion: lower_page is already unlocked,
on success or failure; it's only the anomalous AOP_WRITEPAGE_ACTIVATE
case that leaves it locked.

> /* b/c grab_cache_page increased refcnt */
> page_cache_release(lower_page);
>
> if (err < 0) {
> ClearPageUptodate(page);

Page needs to be unlocked, whether here or at out.

> goto out;
> }
> /*
> * Lower file systems such as ramfs and tmpfs, may return
> * AOP_WRITEPAGE_ACTIVATE so that the VM won't try to (pointlessly)
> * write the page again for a while. But those lower file systems
> * also set the page dirty bit back again. Since we successfully
> * copied our page data to the lower page, then the VM will come
> * back to the lower page (directly) and try to flush it. So we can
> * save the VM the hassle of coming back to our page and trying to
> * flush too. Therefore, we don't re-dirty our own page, and we
> * don't return AOP_WRITEPAGE_ACTIVATE back to the VM (we consider
> * this a success).
> */
> if (err == AOP_WRITEPAGE_ACTIVATE)
> err = 0;

Right (once you've got the locking right).

>
> /* all is well */
> SetPageUptodate(page);
> /* lower mtimes has changed: update ours */
> unionfs_copy_attr_times(inode);
>
> unlock_page(page);
>
> out:
> return err;
> }
>
>
> static void unionfs_sync_page(struct page *page)
> {

I'm not going to comment much on your unionfs_sync_page: it looks
like a total misunderstanding of what sync_page does, assuming from
the name that it syncs the page in a fsync/msync/sync manner.

No, it would much better be named "unplug_page_io": please take a
look at sync_page() in mm/filemap.c, observe how it gets called
(via wait_on_page_bit) and what it ends up doing. (Don't pay much
attention to what Documentation/filesystems says about it, either!)

It's an odd business; I think Nick did have a patch to get rid of
it completely, which would be nice; but changes to unplugging I/O
(kicking off the I/O after saving up several requests to do all
together) can be a hang-prone business.

Do you need a unionfs_sync_page at all? I think not, since the
I/O, plugged or unplugged, is below your lower level filesystem.

But I started by mentioning a serious tmpfs problem. Now I've
persuaded you to go back to grab_cache_page/find_or_create_page,
I realize a nasty problem for tmpfs. Under memory pressure, you're
liable to be putting tmpfs file pages into the page cache at the
same time as they're already present but in disguise as swap cache
pages. Perhaps the solution will be quite simple (since you're
overwriting the whole page), but I do need to think about it.

Hugh

2007-10-31 23:53:42

by Erez Zadok

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland


Hi Hugh, I've addressed all of your concerns and am happy to report that the
newly revised unionfs_writepage works even better, including under my
memory-pressure conditions. To summarize my changes since the last time:

- I'm only masking __GFP_FS, not __GFP_IO
- using find_or_create_page to avoid locking issues around mapping mask
- handle for_reclaim case more efficiently
- using copy_highpage so we handle KM_USER*
- un/locking upper/lower page as/when needed
- updated comments to clarify what/why
- unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used
to have it)

Below is the newest version of unionfs_writepage. Let me know what you
think.

I have to say that with these changes, unionfs appears visibly faster under
memory pressure. I suspect the for_reclaim handling is probably the largest
contributor to this speedup.

Many thanks,
Erez.

//////////////////////////////////////////////////////////////////////////////

static int unionfs_writepage(struct page *page, struct writeback_control *wbc)
{
int err = -EIO;
struct inode *inode;
struct inode *lower_inode;
struct page *lower_page;
struct address_space *lower_mapping; /* lower inode mapping */
gfp_t mask;

inode = page->mapping->host;
lower_inode = unionfs_lower_inode(inode);
lower_mapping = lower_inode->i_mapping;

/*
* find lower page (returns a locked page)
*
* We turn off __GFP_FS while we look for or create a new lower
* page. This prevents a recursion into the file system code, which
* under memory pressure conditions could lead to a deadlock. This
* is similar to how the loop driver behaves (see loop_set_fd in
* drivers/block/loop.c). If we can't find the lower page, we
* redirty our page and return "success" so that the VM will call us
* again in the (hopefully near) future.
*/
mask = mapping_gfp_mask(lower_mapping) & ~(__GFP_FS);
lower_page = find_or_create_page(lower_mapping, page->index, mask);
if (!lower_page) {
err = 0;
set_page_dirty(page);
goto out;
}

/* copy page data from our upper page to the lower page */
copy_highpage(lower_page, page);

/*
* Call lower writepage (expects locked page). However, if we are
* called with wbc->for_reclaim, then the VFS/VM just wants to
* reclaim our page. Therefore, we don't need to call the lower
* ->writepage: just copy our data to the lower page (already done
* above), then mark the lower page dirty and unlock it, and return
* success.
*/
if (wbc->for_reclaim) {
set_page_dirty(lower_page);
unlock_page(lower_page);
goto out_release;
}
BUG_ON(!lower_mapping->a_ops->writepage);
clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
err = lower_mapping->a_ops->writepage(lower_page, wbc);
if (err < 0) {
ClearPageUptodate(page);
goto out_release;
}

/*
* Lower file systems such as ramfs and tmpfs, may return
* AOP_WRITEPAGE_ACTIVATE so that the VM won't try to (pointlessly)
* write the page again for a while. But those lower file systems
* also set the page dirty bit back again. Since we successfully
* copied our page data to the lower page, then the VM will come
* back to the lower page (directly) and try to flush it. So we can
* save the VM the hassle of coming back to our page and trying to
* flush too. Therefore, we don't re-dirty our own page, and we
* never return AOP_WRITEPAGE_ACTIVATE back to the VM (we consider
* this a success).
*
* We also unlock the lower page if the lower ->writepage returned
* AOP_WRITEPAGE_ACTIVATE. (This "anomalous" behaviour may be
* addressed in future shmem/VM code.)
*/
if (err == AOP_WRITEPAGE_ACTIVATE) {
err = 0;
unlock_page(lower_page);
}

/* all is well */
SetPageUptodate(page);
/* lower mtimes have changed: update ours */
unionfs_copy_attr_times(inode);

out_release:
/* b/c find_or_create_page increased refcnt */
page_cache_release(lower_page);
out:
/*
* We unlock our page unconditionally, because we never return
* AOP_WRITEPAGE_ACTIVATE.
*/
unlock_page(page);
return err;
}

//////////////////////////////////////////////////////////////////////////////

2007-11-05 15:42:59

by Hugh Dickins

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

[Dave, I've Cc'ed you re handle_write_count_underflow, see below.]

On Wed, 31 Oct 2007, Erez Zadok wrote:
>
> Hi Hugh, I've addressed all of your concerns and am happy to report that the
> newly revised unionfs_writepage works even better, including under my
> memory-pressure conditions. To summarize my changes since the last time:
>
> - I'm only masking __GFP_FS, not __GFP_IO
> - using find_or_create_page to avoid locking issues around mapping mask
> - handle for_reclaim case more efficiently
> - using copy_highpage so we handle KM_USER*
> - un/locking upper/lower page as/when needed
> - updated comments to clarify what/why
> - unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used
> to have it)
>
> Below is the newest version of unionfs_writepage. Let me know what you
> think.
>
> I have to say that with these changes, unionfs appears visibly faster under
> memory pressure. I suspect the for_reclaim handling is probably the largest
> contributor to this speedup.

That's good news, and that unionfs_writepage looks good to me -
with three reservations I've not observed before.

One, I think you would be safer to do a set_page_dirty(lower_page)
before your clear_page_dirty_for_io(lower_page). I know that sounds
silly, but see Linus' "Yes, Virginia" comment in clear_page_dirty_for_io:
there's a lot of subtlety hereabouts, and I think you'd be mimicing the
usual path closer if you set_page_dirty first - there's nothing else
doing it on that lower_page, is there? I'm not certain that you need
to, but I think you'd do well to look into it and make up your own mind.

Two, I'm unsure of the way you're clearing or setting PageUptodate on
the upper page there. The rules for PageUptodate are fairly obvious
when reading, but when a write fails, it's not so obvious. Again, I'm
not saying what you've got is wrong (it may be unavoidable, to keep
synch between lower and upper), but it deserves a second thought.

Three, I believe you need to add a flush_dcache_page(lower_page)
after the copy_highpage(lower_page): some architectures will need
that to see the new data if they have lower_page mapped (though I
expect it's anyway shaky ground to be accessing through the lower
mount at the same time as modifying through the upper).

I've been trying this out on 2.6.23-mm1 with your 21 Oct 1-9/9
and your 2 Nov 1-8/8 patches applied (rejects being patches which
were already in 2.6.23-mm1). I was hoping to reproduce the
BUG_ON(entry->val) that I fear from shmem_writepage(), before
fixing it; but not seen that at all yet - that might be good
news, but it's more likely I just haven't tried hard enough yet.

For now I'm doing repeated make -j20 kernel builds, pushing into
swap, in a unionfs mount of just a single dir on tmpfs. This has
shown up several problems, two of which I've had to hack around to
get further.

The first: I very quickly hit "BUG: atomic counter underflow"
from -mm's i386 atomic_dec_and_test: from filp_close calling
unionfs_flush. I did a little test fork()ing while holding a file
open on unionfs, and indeed it appears that your totalopens code is
broken, being unaware of how fork() bumps up a file count without
an open. That's rather basic, I'm puzzled that this has remained
undiscovered until now - or perhaps it's just a recent addition.

It looked to me as if the totalopens count was about avoiding some
d_deleted processing in unionfs_flush, which actually should be left
until unionfs_release (and that your unionfs_flush ought to be calling
the lower level flush in all cases). To get going, I've been running
with the quick hack patch below: but I've spent very little time
thinking about it, plus it's a long time since I've done VFS stuff;
so that patch may be nothing but an embarrassment that reflects
neither your intentions nor the VFS rules! And it may itself be
responsible for the further problems I've seen.

The second problem was a hang: all cpus in handle_write_count_underflow
doing lock_and_coalesce_cpu_mnt_writer_counts: new -mm stuff from Dave
Hansen. At first I thought that was a locking problem in Dave's code,
but I now suspect it's that your unionfs reference counting is wrong
somewhere, and the error accumulates until __mnt_writers drops below
MNT_WRITER_UNDERFLOW_LIMIT, but the coalescence does nothing to help
and we're stuck in that loop. My even greater hack to solve that one
was to change Dave's "while" to "if"! Then indeed tests can run for
some while. As I say, my suspicion is that the actual error is within
unionfs (perhaps introduced by my first hack); but I hope Dave can
also make handle_write_count_underflow more robust, it's unfortunate
if refcount errors elsewhere first show up as a hang there.

I've had CONFIG_UNION_FS_DEBUG=y but will probably turn it off when
I come back to this, since it's rather noisy at present. I've not
checked whether its reports are peculiar to having tmpfs below or not.
I get lots of "unionfs: new lower inode mtime" reports on directories
(if there have been any on regular files, I've missed them in the
noise on directories); "unionfs: unhashed dentry being revalidated"s
(mostly or all directories again); "unionfs: saving rdstate with cookie"s.
After five hours hit "kernel BUG at fs/unionfs/fanout.h:128!".

But the first two problems probably make the rest uninteresting:
I'm hoping you can look at those and provide patches for them,
for now I'll switch away to other work.

Hugh

Dubious patch, Not-Signed-off-by: Nobody <[email protected]>

--- 2.6.23-mm1++/fs/unionfs/commonfops.c 2007-11-04 13:14:42.000000000 +0000
+++ linux/fs/unionfs/commonfops.c 2007-11-04 14:21:12.000000000 +0000
@@ -551,9 +551,6 @@ int unionfs_open(struct inode *inode, st
bstart = fbstart(file) = dbstart(dentry);
bend = fbend(file) = dbend(dentry);

- /* increment, so that we can flush appropriately */
- atomic_inc(&UNIONFS_I(dentry->d_inode)->totalopens);
-
/*
* open all directories and make the unionfs file struct point to
* these lower file structs
@@ -565,7 +562,6 @@ int unionfs_open(struct inode *inode, st

/* freeing the allocated resources, and fput the opened files */
if (err) {
- atomic_dec(&UNIONFS_I(dentry->d_inode)->totalopens);
for (bindex = bstart; bindex <= bend; bindex++) {
lower_file = unionfs_lower_file_idx(file, bindex);
if (!lower_file)
@@ -606,6 +602,7 @@ int unionfs_file_release(struct inode *i
struct unionfs_file_info *fileinfo;
struct unionfs_inode_info *inodeinfo;
struct super_block *sb = inode->i_sb;
+ struct dentry *dentry = file->f_path.dentry;
int bindex, bstart, bend;
int fgen, err = 0;

@@ -628,6 +625,7 @@ int unionfs_file_release(struct inode *i
bstart = fbstart(file);
bend = fbend(file);

+ unionfs_lock_dentry(dentry);
for (bindex = bstart; bindex <= bend; bindex++) {
lower_file = unionfs_lower_file_idx(file, bindex);

@@ -635,7 +633,15 @@ int unionfs_file_release(struct inode *i
fput(lower_file);
branchput(sb, bindex);
}
+
+ /* if there are no more refs to the dentry, dput it */
+ if (d_deleted(dentry)) {
+ dput(unionfs_lower_dentry_idx(dentry, bindex));
+ unionfs_set_lower_dentry_idx(dentry, bindex, NULL);
+ }
}
+ unionfs_unlock_dentry(dentry);
+
kfree(fileinfo->lower_files);
kfree(fileinfo->saved_branch_ids);

@@ -799,11 +805,6 @@ int unionfs_flush(struct file *file, fl_
goto out;
unionfs_check_file(file);

- if (!atomic_dec_and_test(&UNIONFS_I(dentry->d_inode)->totalopens))
- goto out;
-
- unionfs_lock_dentry(dentry);
-
bstart = fbstart(file);
bend = fbend(file);
for (bindex = bstart; bindex <= bend; bindex++) {
@@ -813,14 +814,7 @@ int unionfs_flush(struct file *file, fl_
lower_file->f_op->flush) {
err = lower_file->f_op->flush(lower_file, id);
if (err)
- goto out_lock;
-
- /* if there are no more refs to the dentry, dput it */
- if (d_deleted(dentry)) {
- dput(unionfs_lower_dentry_idx(dentry, bindex));
- unionfs_set_lower_dentry_idx(dentry, bindex,
- NULL);
- }
+ goto out;
}

}
@@ -830,8 +824,6 @@ int unionfs_flush(struct file *file, fl_
/* parent time could have changed too (async) */
unionfs_copy_attr_times(dentry->d_parent->d_inode);

-out_lock:
- unionfs_unlock_dentry(dentry);
out:
unionfs_read_unlock(dentry->d_sb);
unionfs_check_file(file);

2007-11-05 16:58:15

by Dave Hansen

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On Mon, 2007-11-05 at 15:40 +0000, Hugh Dickins wrote:
> The second problem was a hang: all cpus in
> handle_write_count_underflow
> doing lock_and_coalesce_cpu_mnt_writer_counts: new -mm stuff from Dave
> Hansen. At first I thought that was a locking problem in Dave's code,
> but I now suspect it's that your unionfs reference counting is wrong
> somewhere, and the error accumulates until __mnt_writers drops below
> MNT_WRITER_UNDERFLOW_LIMIT, but the coalescence does nothing to help
> and we're stuck in that loop.

I've never actually seen this happen in practice, but I do know exactly
what you're talking about.

> but I hope Dave can
> also make handle_write_count_underflow more robust, it's unfortunate
> if refcount errors elsewhere first show up as a hang there.

Actually, I think your s/while/if/ change is probably a decent fix.
Barring any other races, that loop should always have made progress on
mnt->__mnt_writers the way it is written. If we get to:

> lock_and_coalesce_cpu_mnt_writer_counts();
----------------->HERE
> mnt_unlock_cpus();

and don't have a positive mnt->__mnt_writers, we know something is going
badly. We WARN_ON() there, which should at least give an earlier
warning that the system is not doing well. But it doesn't fix the
inevitable. Could you try the attached patch and see if it at least
warns you earlier?

I have a decent guess what the bug is, too. In the unionfs code:

> int init_lower_nd(struct nameidata *nd, unsigned int flags)
> {
> ...
> #ifdef ALLOC_LOWER_ND_FILE
> file = kzalloc(sizeof(struct file), GFP_KERNEL);
> if (unlikely(!file)) {
> err = -ENOMEM;
> break; /* exit switch statement and thus return */
> }
> nd->intent.open.file = file;
> #endif /* ALLOC_LOWER_ND_FILE */

The r/o bind mount code will mnt_drop_write() on that file's f_vfsmnt at
__fput() time. Since that code never got a write on the mount, we'll
see an imbalance if the file was opened for a write. I don't see this
file's mnt set anywhere, so I'm not completely sure that this is it. In
any case, rolling your own 'struct file' without using alloc_file() and
friends is a no-no.

BTW, I have some "debugging" code in my latest set of patches that I
think should fix this kind of imbalance with the mnt->__mnt_writers().
It ensures that before we do that mnt_drop_write() at __fput() that we
absolutely did a mnt_want_write() at some point in the 'struct file's
life.

-- Dave

linux-2.6.git-dave/fs/namespace.c | 31 ++++++++++++++++++++++---------
linux-2.6.git-dave/include/linux/mount.h | 1 +
2 files changed, 23 insertions(+), 9 deletions(-)

diff -puN fs/namei.c~fix-naughty-loop fs/namei.c
diff -puN fs/namespace.c~fix-naughty-loop fs/namespace.c
--- linux-2.6.git/fs/namespace.c~fix-naughty-loop 2007-11-05 08:03:59.000000000 -0800
+++ linux-2.6.git-dave/fs/namespace.c 2007-11-05 08:35:06.000000000 -0800
@@ -225,16 +225,29 @@ static void lock_and_coalesce_cpu_mnt_wr
*/
static void handle_write_count_underflow(struct vfsmount *mnt)
{
- while (atomic_read(&mnt->__mnt_writers) <
- MNT_WRITER_UNDERFLOW_LIMIT) {
- /*
- * It isn't necessary to hold all of the locks
- * at the same time, but doing it this way makes
- * us share a lot more code.
- */
- lock_and_coalesce_cpu_mnt_writer_counts();
- mnt_unlock_cpus();
+ if (atomic_read(&mnt->__mnt_writers) >=
+ MNT_WRITER_UNDERFLOW_LIMIT)
+ return;
+ /*
+ * It isn't necessary to hold all of the locks
+ * at the same time, but doing it this way makes
+ * us share a lot more code.
+ */
+ lock_and_coalesce_cpu_mnt_writer_counts();
+ /*
+ * If coalescing the per-cpu writer counts did not
+ * get us back to a positive writer count, we have
+ * a bug.
+ */
+ if ((atomic_read(&mnt->__mnt_writers) < 0) &&
+ !(mnt->mnt_flags & MNT_IMBALANCED_WRITE_COUNT)) {
+ printk("leak detected on mount(%p) writers count: %d\n",
+ mnt, atomic_read(&mnt->__mnt_writers));
+ WARN_ON(1);
+ /* use the flag to keep the dmesg spam down */
+ mnt->mnt_flags |= MNT_IMBALANCED_WRITE_COUNT;
}
+ mnt_unlock_cpus();
}

/**
diff -puN include/linux/mount.h~fix-naughty-loop include/linux/mount.h
--- linux-2.6.git/include/linux/mount.h~fix-naughty-loop 2007-11-05 08:22:21.000000000 -0800
+++ linux-2.6.git-dave/include/linux/mount.h 2007-11-05 08:28:20.000000000 -0800
@@ -32,6 +32,7 @@ struct mnt_namespace;
#define MNT_READONLY 0x40 /* does the user want this to be r/o? */

#define MNT_SHRINKABLE 0x100
+#define MNT_IMBALANCED_WRITE_COUNT 0x200 /* just for debugging */

#define MNT_SHARED 0x1000 /* if the vfsmount is a shared mount */
#define MNT_UNBINDABLE 0x2000 /* if the vfsmount is a unbindable mount */
_


2007-11-05 19:04:24

by Hugh Dickins

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On Mon, 5 Nov 2007, Dave Hansen wrote:
>
> Actually, I think your s/while/if/ change is probably a decent fix.

Any resemblance to a decent fix is purely coincidental.

> Barring any other races, that loop should always have made progress on
> mnt->__mnt_writers the way it is written. If we get to:
>
> > lock_and_coalesce_cpu_mnt_writer_counts();
> ----------------->HERE
> > mnt_unlock_cpus();
>
> and don't have a positive mnt->__mnt_writers, we know something is going
> badly. We WARN_ON() there, which should at least give an earlier
> warning that the system is not doing well. But it doesn't fix the
> inevitable. Could you try the attached patch and see if it at least
> warns you earlier?

Thanks, Dave, yes, that gives me a nice warning:

leak detected on mount(c25ebd80) writers count: -65537
WARNING: at fs/namespace.c:249 handle_write_count_underflow()
[<c0103486>] show_trace_log_lvl+0x1b/0x2e
[<c01034b6>] show_trace+0x16/0x1b
[<c0103589>] dump_stack+0x19/0x1e
[<c0171906>] handle_write_count_underflow+0x4c/0x60
[<c0171983>] mnt_drop_write+0x69/0x8e
[<c0160211>] __fput+0xff/0x162
[<c016010d>] fput+0x2e/0x33
[<c01b8f63>] unionfs_file_release+0xc2/0x1c5
[<c01601a1>] __fput+0x8f/0x162
[<c016010d>] fput+0x2e/0x33
[<c015ec9d>] filp_close+0x50/0x5d
[<c015ed1e>] sys_close+0x74/0xb4
[<c01026ce>] sysenter_past_esp+0x5f/0x85

and the test then goes quietly on its way instead of hanging. Though
I imagine, with your patch or mine, that it's then making an unfortunate
frequency of calls to lock_and_coalesce_longer_name_than_I_care_to_type
thereafter. But it's hardly your responsibility to optimize for bugs
elsewhere.

The 2.6.23-mm1 tree has MNT_USER at 0x200, so I adjusted your flag to
#define MNT_IMBALANCED_WRITE_COUNT 0x400 /* just for debugging */

>
> I have a decent guess what the bug is, too. In the unionfs code:

I'll let Erez take it from there...

Hugh

2007-11-09 02:52:52

by Erez Zadok

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

In message <1194280730.6271.145.camel@localhost>, Dave Hansen writes:
> On Mon, 2007-11-05 at 15:40 +0000, Hugh Dickins wrote:
[...]
> I have a decent guess what the bug is, too. In the unionfs code:
>
> > int init_lower_nd(struct nameidata *nd, unsigned int flags)
> > {
> > ...
> > #ifdef ALLOC_LOWER_ND_FILE
> > file = kzalloc(sizeof(struct file), GFP_KERNEL);
> > if (unlikely(!file)) {
> > err = -ENOMEM;
> > break; /* exit switch statement and thus return */
> > }
> > nd->intent.open.file = file;
> > #endif /* ALLOC_LOWER_ND_FILE */
>
> The r/o bind mount code will mnt_drop_write() on that file's f_vfsmnt at
> __fput() time. Since that code never got a write on the mount, we'll
> see an imbalance if the file was opened for a write. I don't see this
> file's mnt set anywhere, so I'm not completely sure that this is it. In
> any case, rolling your own 'struct file' without using alloc_file() and
> friends is a no-no.
[...]

This #ifdef'd code in unionfs is actually not enabled. I left it there as a
reminder of possible future things to come (esp. if nameidata gets split).
There's a related comment earlier in fs/unionfs/lookup.c:init_lower_nd()
that says:

#ifdef ALLOC_LOWER_ND_FILE
/*
* XXX: one day we may need to have the lower return an open file
* for us. It is not needed in 2.6.23-rc1 for nfs2/nfs3, but may
* very well be needed for nfs4.
*/
struct file *file;
#endif /* ALLOC_LOWER_ND_FILE */

In the interest of keeping unionfs as simple as I can, when I implemented
the whole "pass a lower nd" stuff, I left thos bits of semi-experimental
#ifdef code for this lower file upon open-intent. It's not enabled and up
until now, it didn't seem to be needed.

Do you think unionfs has to start using this nd->intent.open.file stuff?

Thanks,
Erez.

2007-11-09 06:06:37

by Erez Zadok

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

In message <[email protected]>, Hugh Dickins writes:
> [Dave, I've Cc'ed you re handle_write_count_underflow, see below.]
>
> On Wed, 31 Oct 2007, Erez Zadok wrote:
> >
> > Hi Hugh, I've addressed all of your concerns and am happy to report that the
> > newly revised unionfs_writepage works even better, including under my
> > memory-pressure conditions. To summarize my changes since the last time:
> >
> > - I'm only masking __GFP_FS, not __GFP_IO
> > - using find_or_create_page to avoid locking issues around mapping mask
> > - handle for_reclaim case more efficiently
> > - using copy_highpage so we handle KM_USER*
> > - un/locking upper/lower page as/when needed
> > - updated comments to clarify what/why
> > - unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used
> > to have it)
> >
> > Below is the newest version of unionfs_writepage. Let me know what you
> > think.
> >
> > I have to say that with these changes, unionfs appears visibly faster under
> > memory pressure. I suspect the for_reclaim handling is probably the largest
> > contributor to this speedup.
>
> That's good news, and that unionfs_writepage looks good to me -
> with three reservations I've not observed before.
>
> One, I think you would be safer to do a set_page_dirty(lower_page)
> before your clear_page_dirty_for_io(lower_page). I know that sounds
> silly, but see Linus' "Yes, Virginia" comment in clear_page_dirty_for_io:
> there's a lot of subtlety hereabouts, and I think you'd be mimicing the
> usual path closer if you set_page_dirty first - there's nothing else
> doing it on that lower_page, is there? I'm not certain that you need
> to, but I think you'd do well to look into it and make up your own mind.

Hugh, my code looks like:

if (wbc->for_reclaim) {
set_page_dirty(lower_page);
unlock_page(lower_page);
goto out_release;
}
BUG_ON(!lower_mapping->a_ops->writepage);
clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
err = lower_mapping->a_ops->writepage(lower_page, wbc);

Do you mean I should set_page_dirty(lower_page) unconditionally before
clear_page_dirty_for_io? (I already do that in the 'if' statement above it.)

> Two, I'm unsure of the way you're clearing or setting PageUptodate on
> the upper page there. The rules for PageUptodate are fairly obvious
> when reading, but when a write fails, it's not so obvious. Again, I'm
> not saying what you've got is wrong (it may be unavoidable, to keep
> synch between lower and upper), but it deserves a second thought.

I looked at all mainline filesystems's ->writepage to see what, if any, they
do with their page's uptodate flag. Most f/s don't touch the flag one way
or another.

cifs_writepage sets the uptodate flag unconditionally: why?

ecryptfs_writepage has a legit reason: if encrypting the page failed, it doesn't want
anyone to use it, so it clears its page's uptodate flag (else it sets it as
uptodate).

hostfs_writepage clears pageuptodate if it failed to write_file(), which I'm
not sure if it makes sense or not.

ntfs_writepage goes as far as doing BUG_ON(!PageUptodate(page)) which
indicates to me that the page passed to ->writepage should always be
uptodate. Is that a fair statement?

smb_writepage pretty much unconditionally calls SetPageUptodate(page). Why?

Is there a reason smbfs and cifs both do this unconditionally? If so, then
why is ntfs calling BUG_ON if the page isn't uptodate? Either that BUG_ON
in ntfs is redundant, or cifs/smbfs's SetPageUptodate is redundant, but they
can't both be right.

And finally, unionfs clears the uptodate flag on error from the lower
->writepage, and otherwise sets the flag on success from the lower
->writepage. My gut feeling is that unionfs shouldn't change the page
uptodate flag at all: if the VFS passes unionfs_writepage a page which isn't
uptodate, then the VFS has a serious problem b/c it'd be asking a f/s to
write out a page which isn't up-to-date, right? Otherwise, whether
unionfs_writepage manages to write the lower page or not, why should that
invalidate the state of the unionfs page itself? Come to think of it, I
think clearing pageuptodate on error from ->writepage(lower_page) may be
bad. Imagine if after such a failed unionfs_writepage, I get a
unionfs_readpage: that ->readpage will get data from the lower f/s page and
copy it *over* the unionfs page, even if the upper page's data was more
recent prior to the failed call to unionfs_writepage. IOW, we could be
reverting a user-visible mmap'ed page to a previous on-disk version. What
do you think: could this happen? Anyway, I'll run some exhaustive testing
next and see what happens if I don't set/clear the uptodate flag in
unionfs_writepage.

> Three, I believe you need to add a flush_dcache_page(lower_page)
> after the copy_highpage(lower_page): some architectures will need
> that to see the new data if they have lower_page mapped (though I
> expect it's anyway shaky ground to be accessing through the lower
> mount at the same time as modifying through the upper).

OK.

> For now I'm doing repeated make -j20 kernel builds, pushing into
> swap, in a unionfs mount of just a single dir on tmpfs. This has
> shown up several problems, two of which I've had to hack around to
> get further.
[...]

Thanks. I'll look more closely into these issues and your patches, and post
my findings.

Erez.

2007-11-12 05:57:05

by Hugh Dickins

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On Fri, 9 Nov 2007, Erez Zadok wrote:
> In message <[email protected]>, Hugh Dickins writes:
> >
> > One, I think you would be safer to do a set_page_dirty(lower_page)
> > before your clear_page_dirty_for_io(lower_page). I know that sounds
> > silly, but see Linus' "Yes, Virginia" comment in clear_page_dirty_for_io:
> > there's a lot of subtlety hereabouts, and I think you'd be mimicing the
> > usual path closer if you set_page_dirty first - there's nothing else
> > doing it on that lower_page, is there? I'm not certain that you need
> > to, but I think you'd do well to look into it and make up your own mind.
>
> Hugh, my code looks like:
>
> if (wbc->for_reclaim) {
> set_page_dirty(lower_page);
> unlock_page(lower_page);
> goto out_release;
> }
> BUG_ON(!lower_mapping->a_ops->writepage);
> clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
> err = lower_mapping->a_ops->writepage(lower_page, wbc);
>
> Do you mean I should set_page_dirty(lower_page) unconditionally before
> clear_page_dirty_for_io? (I already do that in the 'if' statement above it.)

Yes. Whether you're wrong not to be doing that already, I've not checked;
but I think doing so will make unionfs safer against any future changes
in the relationship between set_page_dirty and clear_page_dirty_for_io.

For example, look at clear_page_dirty_for_io: it's decrementing some
statistics which __set_page_dirty_nobuffers increments. Does use of
unionfs (over some filesystems) make those numbers wrap increasingly
negative? Does adding this set_page_dirty(lower_page) correct that?
I suspect so, but may be wrong.

> > Two, I'm unsure of the way you're clearing or setting PageUptodate on
> > the upper page there. The rules for PageUptodate are fairly obvious
> > when reading, but when a write fails, it's not so obvious. Again, I'm
> > not saying what you've got is wrong (it may be unavoidable, to keep
> > synch between lower and upper), but it deserves a second thought.
>
> I looked at all mainline filesystems's ->writepage to see what, if any, they
> do with their page's uptodate flag. Most f/s don't touch the flag one way
> or another.

I'm not going to try and guess what assorted filesystems are up to,
and not all of them will be bugfree. The crucial point of PageUptodate
is that we insert a filesystem page into the page cache before it's had
any data read in: it needs to be !PageUptodate until the data is there,
and then marked PageUptodate to say the data is good and others can
start using it. See mm/filemap.c.

> And finally, unionfs clears the uptodate flag on error from the lower
> ->writepage, and otherwise sets the flag on success from the lower
> ->writepage. My gut feeling is that unionfs shouldn't change the page
> uptodate flag at all: if the VFS passes unionfs_writepage a page which isn't
> uptodate, then the VFS has a serious problem b/c it'd be asking a f/s to
> write out a page which isn't up-to-date, right? Otherwise, whether
> unionfs_writepage manages to write the lower page or not, why should that
> invalidate the state of the unionfs page itself? Come to think of it, I
> think clearing pageuptodate on error from ->writepage(lower_page) may be
> bad. Imagine if after such a failed unionfs_writepage, I get a
> unionfs_readpage: that ->readpage will get data from the lower f/s page and
> copy it *over* the unionfs page, even if the upper page's data was more
> recent prior to the failed call to unionfs_writepage. IOW, we could be
> reverting a user-visible mmap'ed page to a previous on-disk version. What
> do you think: could this happen? Anyway, I'll run some exhaustive testing
> next and see what happens if I don't set/clear the uptodate flag in
> unionfs_writepage.

That was my point, and I don't really have more to add. It's unusual
to do anything with PageUptodate when writing. By clearing it when the
lower level has an error, you're throwing away the changes already made
at the upper level. You might have some good reason for that, but it's
worth questioning. If you don't know why you're Set/Clear'ing it there,
better to just take that out.

Hugh

2007-11-12 17:03:14

by Hugh Dickins

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On Fri, 9 Nov 2007, Erez Zadok wrote:
> In message <[email protected]>, Hugh Dickins writes:
>
> > Three, I believe you need to add a flush_dcache_page(lower_page)
> > after the copy_highpage(lower_page): some architectures will need
> > that to see the new data if they have lower_page mapped (though I
> > expect it's anyway shaky ground to be accessing through the lower
> > mount at the same time as modifying through the upper).
>
> OK.

While looking into something else entirely, I realize that _here_
you are missing a SetPageUptodate(lower_page): should go in after
the flush_dcache_page(lower_page) I'm suggesting. (Nick would argue
for some kind of barrier there too, but I don't think unionfs has a
special need to be ahead of the pack on that issue.)

Think about it:
when find_or_create_page has created a fresh page in the cache,
and you've just done copy_highpage to put the data into it, you
now need to mark it as Uptodate: otherwise a subsequent vfs_read
or whatever on the lower level will find that page !Uptodate and
read stale data back from disk instead of what you just copied in,
unless its dirtiness has got it written back to disk meanwhile.

Odd that that hasn't been noticed at all: I guess it may be hard
to get testing to reclaim lower/upper pages in such a way as to
test out such paths thoroughly.

Hugh

2007-11-13 10:20:14

by Erez Zadok

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

In message <[email protected]>, Hugh Dickins writes:
> On Fri, 9 Nov 2007, Erez Zadok wrote:
> > In message <[email protected]>, Hugh Dickins writes:
> >
> > > Three, I believe you need to add a flush_dcache_page(lower_page)
> > > after the copy_highpage(lower_page): some architectures will need
> > > that to see the new data if they have lower_page mapped (though I
> > > expect it's anyway shaky ground to be accessing through the lower
> > > mount at the same time as modifying through the upper).
> >
> > OK.
>
> While looking into something else entirely, I realize that _here_
> you are missing a SetPageUptodate(lower_page): should go in after
> the flush_dcache_page(lower_page) I'm suggesting. (Nick would argue
> for some kind of barrier there too, but I don't think unionfs has a
> special need to be ahead of the pack on that issue.)
>
> Think about it:
> when find_or_create_page has created a fresh page in the cache,
> and you've just done copy_highpage to put the data into it, you
> now need to mark it as Uptodate: otherwise a subsequent vfs_read
> or whatever on the lower level will find that page !Uptodate and
> read stale data back from disk instead of what you just copied in,
> unless its dirtiness has got it written back to disk meanwhile.

Hehe. Funny, you mention this... A few days ago, while I was doing your other
recommended pageuptodate cleanups, I also added the same call to
SetPageUptodate(lower_page) as you suggested. I tested that change along w/
the other changes you suggested, and they all seem to work great all the way
from my 2.6.9 backport to 2.6.24-rc2 and -mm (modulo the fact that I had to
work around or fix more non-unionfs bugs in -mm than unionfs ones to get it
to work :-)

I posted all of these patches just now. You're CC'ed. Hopefully Andrew can
pull from my unionfs.git branch soon.

You also reported in your previous emails some hangs/oopses while doing make
-j 20 in unionfs on top of a single tmpfs, using -mm. After several days,
I've not been able to reproduce this w/ my latest set of patches. If you
can send me your .config and the specs on the h/w you're using (cpus, mem,
etc.), I'll see if I can find something similar to it on my end and run the
same tests.

Cheers,
Erez.

2007-11-17 21:30:56

by Hugh Dickins

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

On Tue, 13 Nov 2007, Erez Zadok wrote:
>
> I posted all of these patches just now. You're CC'ed. Hopefully Andrew can
> pull from my unionfs.git branch soon.
>
> You also reported in your previous emails some hangs/oopses while doing make
> -j 20 in unionfs on top of a single tmpfs, using -mm. After several days,
> I've not been able to reproduce this w/ my latest set of patches. If you
> can send me your .config and the specs on the h/w you're using (cpus, mem,
> etc.), I'll see if I can find something similar to it on my end and run the
> same tests.

I'm glad to report that this unionfs, not the one in 2.6.24-rc2-mm1
but the one including those 9 patches you posted, now gets through
my testing with tmpfs without a problem. I do still get occasional
"unionfs: new lower inode mtime (bindex=0, name=<directory>)"
messages, but nothing worse seen yet: a big improvement.

I deceived myself for a while that the danger of shmem_writepage
hitting its BUG_ON(entry->val) was dealt with too; but that's wrong,
I must go back to working out an escape from that one (despite never
seeing it).

I did think you could clean up the doubled set_page_dirtys,
but it's of no consequence.

Hugh

--- 2.6.24-rc2-mm1+9/fs/unionfs/mmap.c 2007-11-17 12:23:30.000000000 +0000
+++ linux/fs/unionfs/mmap.c 2007-11-17 20:22:29.000000000 +0000
@@ -56,6 +56,7 @@ static int unionfs_writepage(struct page
copy_highpage(lower_page, page);
flush_dcache_page(lower_page);
SetPageUptodate(lower_page);
+ set_page_dirty(lower_page);

/*
* Call lower writepage (expects locked page). However, if we are
@@ -66,12 +67,11 @@ static int unionfs_writepage(struct page
* success.
*/
if (wbc->for_reclaim) {
- set_page_dirty(lower_page);
unlock_page(lower_page);
goto out_release;
}
+
BUG_ON(!lower_mapping->a_ops->writepage);
- set_page_dirty(lower_page);
clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
err = lower_mapping->a_ops->writepage(lower_page, wbc);
if (err < 0)

2007-11-20 01:36:49

by Erez Zadok

[permalink] [raw]
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

In message <[email protected]>, Hugh Dickins writes:
> On Tue, 13 Nov 2007, Erez Zadok wrote:
[...]
> I'm glad to report that this unionfs, not the one in 2.6.24-rc2-mm1
> but the one including those 9 patches you posted, now gets through
> my testing with tmpfs without a problem. I do still get occasional
> "unionfs: new lower inode mtime (bindex=0, name=<directory>)"
> messages, but nothing worse seen yet: a big improvement.

Excellent.

> I did think you could clean up the doubled set_page_dirtys,
> but it's of no consequence.

Yes, looks good. I'll send that as a patch. Thanks.

> Hugh
>
> --- 2.6.24-rc2-mm1+9/fs/unionfs/mmap.c 2007-11-17 12:23:30.000000000 +0000
> +++ linux/fs/unionfs/mmap.c 2007-11-17 20:22:29.000000000 +0000
> @@ -56,6 +56,7 @@ static int unionfs_writepage(struct page
> copy_highpage(lower_page, page);
> flush_dcache_page(lower_page);
> SetPageUptodate(lower_page);
> + set_page_dirty(lower_page);
>
> /*
> * Call lower writepage (expects locked page). However, if we are
> @@ -66,12 +67,11 @@ static int unionfs_writepage(struct page
> * success.
> */
> if (wbc->for_reclaim) {
> - set_page_dirty(lower_page);
> unlock_page(lower_page);
> goto out_release;
> }
> +
> BUG_ON(!lower_mapping->a_ops->writepage);
> - set_page_dirty(lower_page);
> clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
> err = lower_mapping->a_ops->writepage(lower_page, wbc);
> if (err < 0)

Erez.