2007-11-20 19:15:38

by Hugh Dickins

[permalink] [raw]
Subject: unionfs: several more problems

I think it may be time to move away from that by now misleading
"msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland" thread.
And I've cut most out of the Cc list, but by all means add any back.

On Saturday I said:
> I'm glad to report that this unionfs, not the one in 2.6.24-rc2-mm1
> but the one including those 9 patches you posted, now gets through
> my testing with tmpfs without a problem. I do still get occasional
> "unionfs: new lower inode mtime (bindex=0, name=<directory>)"
> messages, but nothing worse seen yet: a big improvement.

I have seen worse now: details below.

> I deceived myself for a while that the danger of shmem_writepage
> hitting its BUG_ON(entry->val) was dealt with too; but that's wrong,
> I must go back to working out an escape from that one (despite never
> seeing it).

Once I tried a more appropriate test (fsx while swapping) I hit that
easily. After some thought and testing, I'm happy with the mm/shmem.c
+mm/swap_state.c fixes I've arrived at for that; but since it's not
easy to reproduce in normal usage, and hasn't been holding you up,
I'd prefer for the moment to hold on to that patch. I need to make
changes around the same pagecache<->swapcache area to solve some mem
cgroup issues: there might turn out to be some interaction, so I'd
rather finalize both patches in the same series if I can.

For a while I thought the further unionfs problems I'm seeing were
tmpfs ones, but now I seem to be seeing all(?) of them with ext2:
so I'd better spend no more time on them, hand them over to you.

Please try running LTP (e.g. ltp-full-20071031.tgz) after something like
mkfs -t ext2 /dev/sdb1
mount -t ext2 /dev/sdb1 /mnt
mount -t unionfs -o dirs=/mnt unionfs /tmp

That throws up quite a number of errors which don't occur
with a straightforward ext2 mounted on /tmp. Notably, when it
embarks upon "MULTIPLE PROCESSES CREATING AND DELETING FILES",
mkdir seems to collapse into lots of
mkdir: cannot create directory `dirN': No space left on device
mkdir dirN FAILED
(the same happened when I was using a tmpfs instead of an ext2).

But perhaps before fixing up the several LTP tests, you'll want
to concentrate on a more directed test. Please try this sequence:

# Running with mem=512M, probably irrelevant
swapoff -a # Merely to rule out one potential confusion
mkfs -t ext2 /dev/sdb1
mount -t ext2 /dev/sdb1 /mnt
df /mnt # I have 2280 Used out of 1517920 KB
cp -a 2.6.24-rc3 /mnt # Copy a kernel source tree into ext2
rm -rf /mnt/2.6.24-rc3 # Delete the copy
df /mnt # Again 2280 Used, just as you'd expect
mount -t unionfs -o dirs=/mnt unionfs /tmp
cp -a 2.6.24-rc3 /tmp # Copy a kernel source tree into unionfs
rm -rf /tmp/2.6.24-rc3 # Generates 176 unionfs: filldir error messages
df /mnt # Now 68380 Used (df /tmp shows the same)
ls -a /mnt # Shows . .. .wh.2.6.24-rc3 lost+found
echo 1 >/proc/sys/vm/drop_caches # to free pagecache
df /mnt # Still 68380 Used (df /tmp shows the same)
echo 2 >/proc/sys/vm/drop_caches # to free dentries and inodes
df /mnt # Now 2280 Used as it should be (df /tmp same)
ls -a /mnt # But still shows that .wh.2.6.24-rc3
umount /tmp # Restore
umount /mnt # Restore
swapon -a # Restore

Three different problems there:

1. Whiteouts seem to get left behind (at this top level anyway):
I'm getting an increasing number of .wh.run-crons.????? files there.
I'm not familiar with the correct behaviour for whiteouts (and it's
not clear to me why whiteouts are needed at all in this degenerate
case of a single directory in the union, but never mind).

2. Why does copying then deleting a tree leave blocks allocated,
which remain allocated indefinitely, until memory pressure or
drop_caches removes them? Hmm, I should have done "df -i" instead,
that would be more revealing. This may well be the same as the LTP
mkdir problem - inodes remaining half-allocated after they're unlinked.

3. The unionfs filldir error messages:
unionfs: filldir: possible I/O error: a file is duplicated in the same
branch 0: page_offset.h
<... 174 more include/asm-* header files named until ...>
unionfs: filldir: possible I/O error: a file is duplicated in the same
branch 0: sfafsr.h
It's tempting to suppose these are related to the unfreed inodes, but
retrying again gives me 176 messages, whereas inodes fall from 2672
to 30. And I didn't see those messages with tmpfs, just with ext2.

Hugh


2007-11-23 21:20:30

by Erez Zadok

[permalink] [raw]
Subject: Re: unionfs: several more problems

In message <[email protected]>, Hugh Dickins writes:
[...]
> > I deceived myself for a while that the danger of shmem_writepage
> > hitting its BUG_ON(entry->val) was dealt with too; but that's wrong,
> > I must go back to working out an escape from that one (despite never
> > seeing it).
>
> Once I tried a more appropriate test (fsx while swapping) I hit that
> easily. After some thought and testing, I'm happy with the mm/shmem.c
> +mm/swap_state.c fixes I've arrived at for that; but since it's not
> easy to reproduce in normal usage, and hasn't been holding you up,
> I'd prefer for the moment to hold on to that patch. I need to make
> changes around the same pagecache<->swapcache area to solve some mem
> cgroup issues: there might turn out to be some interaction, so I'd
> rather finalize both patches in the same series if I can.
[...]

If you want, send me those patches and I'll run them w/ my tests, even if
they're not finalized; my testing can give you another useful point of
reference.

> But perhaps before fixing up the several LTP tests, you'll want
> to concentrate on a more directed test. Please try this sequence:
>
> # Running with mem=512M, probably irrelevant
> swapoff -a # Merely to rule out one potential confusion
> mkfs -t ext2 /dev/sdb1
> mount -t ext2 /dev/sdb1 /mnt
> df /mnt # I have 2280 Used out of 1517920 KB
> cp -a 2.6.24-rc3 /mnt # Copy a kernel source tree into ext2
> rm -rf /mnt/2.6.24-rc3 # Delete the copy
> df /mnt # Again 2280 Used, just as you'd expect
> mount -t unionfs -o dirs=/mnt unionfs /tmp
> cp -a 2.6.24-rc3 /tmp # Copy a kernel source tree into unionfs
> rm -rf /tmp/2.6.24-rc3 # Generates 176 unionfs: filldir error messages
> df /mnt # Now 68380 Used (df /tmp shows the same)
> ls -a /mnt # Shows . .. .wh.2.6.24-rc3 lost+found
> echo 1 >/proc/sys/vm/drop_caches # to free pagecache
> df /mnt # Still 68380 Used (df /tmp shows the same)
> echo 2 >/proc/sys/vm/drop_caches # to free dentries and inodes
> df /mnt # Now 2280 Used as it should be (df /tmp same)
> ls -a /mnt # But still shows that .wh.2.6.24-rc3
> umount /tmp # Restore
> umount /mnt # Restore
> swapon -a # Restore
>
> Three different problems there:
>
> 1. Whiteouts seem to get left behind (at this top level anyway):
> I'm getting an increasing number of .wh.run-crons.????? files there.
> I'm not familiar with the correct behaviour for whiteouts (and it's
> not clear to me why whiteouts are needed at all in this degenerate
> case of a single directory in the union, but never mind).

I could spend a lot of time explaining the history of whiteouts in unioning
file systems, and all the different techniques and algorithms we've tried
ourselves over the years. But suffice to say that I'd be very happy the day
every Linux f/s has a native whiteout support. :-)

Our current policy for when/where to create whiteouts has evolved after much
experience with users. The most common use case for unionfs is one or more
read-only branches, plus a high-priority writable branch (for copyup).
Therefore, in the most common case we cannot remove the objects from the
readonly branches, and have to create a whiteout instead.

Using a single branch with unionfs is very uncommon among unionfs users, but
it serves nicely as a useful "null layer" testing (ala BSD's Nullfs or my
fistgen's wrapfs).

Anyway, upon further thinking about this issue I realized that whiteouts in
the single-branch situation are just a generalization of a possibly more
common case -- when the object being unlink'ed (or rmdir'ed) is on the
rightmost, lowest priority branch in which it is known to exist. In that
case, there's no need to create a whiteout there, b/c there's no chance that
a readonly file by the same name could exist below that branch. The same is
true if you try to rmdir a directory anywhere in one of the union's
branches: if we know (thanks to ->lookup) that there is no dir by the same
name anywhere else, then we can safely skip creating a whiteout if the
least-priority dir is being rmdir'ed.

I've got a small patch that does just that.

> 2. Why does copying then deleting a tree leave blocks allocated,
> which remain allocated indefinitely, until memory pressure or
> drop_caches removes them? Hmm, I should have done "df -i" instead,
> that would be more revealing. This may well be the same as the LTP
> mkdir problem - inodes remaining half-allocated after they're unlinked.

Turns out we weren't releasing the ref's on the lower directory being
rmdir'ed as early as we could. We'd have done it in delete/clear_inode,
upon memory pressure, or unmount -- so those resources wouldn't have stuck
around forever.

I now have a small patch that releases those resources on rmdir and the
space (df and df -i) is reclaimed right away.

And, with the above patch to not create whiteouts on least-priority
branches, even that one ".wh.2.6.24-rc3" is gone in your case.

> 3. The unionfs filldir error messages:
> unionfs: filldir: possible I/O error: a file is duplicated in the same
> branch 0: page_offset.h
> <... 174 more include/asm-* header files named until ...>
> unionfs: filldir: possible I/O error: a file is duplicated in the same
> branch 0: sfafsr.h
> It's tempting to suppose these are related to the unfreed inodes, but
> retrying again gives me 176 messages, whereas inodes fall from 2672
> to 30. And I didn't see those messages with tmpfs, just with ext2.

This had to do with a special case in processing readdir, when we find a
whiteout file (e.g., ".wh.foo.c"), and we want to know if we've already seen
the same file by it's non-whiteout name (e.g., "foo.c"). In both cases we
were looking for cached entries named "foo.c", but if we found the name
twice in the readdir cache, we considered it a serious error (EIO, perhaps
even a bug). After all, a duplicate elimination algorithm should not find
duplicates after eliminating them. :-)

Anyway, the message which you saw was a false positive, and should have been
printed only if we were not looking for a whiteout. I've got a small fix
for that as well.

> Hugh

Hugh, if you want the fixes for the above problems I already solved, let me
know and I'll post them. Otherwise I'll probably wait until I finish LTP
testing as you suggested, then post everything.

Cheers,
Erez.

2007-11-26 16:53:27

by Erez Zadok

[permalink] [raw]
Subject: Re: unionfs: several more problems

In message <[email protected]>, Hugh Dickins writes:

[...]
> Please try running LTP (e.g. ltp-full-20071031.tgz) after something like
[...]

Hugh,

I just posted a series of patches to unionfs (already in unionfs.git on
korg), which fix every problem LTP found, as well as other problems
mentioned in your email. With this series of patches, the same set of tests
which pass on ext3 also pass with unionfs mounted over ext3.

I'd like to call your attention to one small unionfs_writepage fix. One of
the ltp fs tests (rwtest04) triggered a BUG_ON(PageWriteback(page)) in
fs/buffer.c:1706, when we call the lower ->writepage. This was due to
multiple writers to the same page. Looking at other kernel code, it seems
that the right fix is to use wait_on_page_writeback(lower_page) to serialize
concurrent writebacks to the same page, no? The small patch below fixed the
problem. Let me know what you think.

Thanks,
Erez.


diff --git a/fs/unionfs/mmap.c b/fs/unionfs/mmap.c
index eef43aa..d0fd4d0 100644
--- a/fs/unionfs/mmap.c
+++ b/fs/unionfs/mmap.c
@@ -73,6 +73,7 @@ static int unionfs_writepage(struct page *page, struct writeback_control *wbc)

BUG_ON(!lower_mapping->a_ops->writepage);
clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
+ wait_on_page_writeback(lower_page); /* prevent multiple writers */
err = lower_mapping->a_ops->writepage(lower_page, wbc);
if (err < 0)
goto out_release;

2007-11-26 20:15:25

by Hugh Dickins

[permalink] [raw]
Subject: Re: unionfs: several more problems

On Mon, 26 Nov 2007, Erez Zadok wrote:
>
> I just posted a series of patches to unionfs (already in unionfs.git on
> korg), which fix every problem LTP found, as well as other problems
> mentioned in your email. With this series of patches, the same set of tests
> which pass on ext3 also pass with unionfs mounted over ext3.

Good, I'll give them a try, but not tonight - my own bugs to work on!

>
> I'd like to call your attention to one small unionfs_writepage fix. One of
> the ltp fs tests (rwtest04) triggered a BUG_ON(PageWriteback(page)) in
> fs/buffer.c:1706, when we call the lower ->writepage. This was due to
> multiple writers to the same page. Looking at other kernel code, it seems
> that the right fix is to use wait_on_page_writeback(lower_page) to serialize
> concurrent writebacks to the same page, no?

Ah, yes. I believe that's the right thing to do (and you do have the
lower_page locked there, which is necessary for that serialization).

> The small patch below fixed the problem. Let me know what you think.

I've one issue with it: please move that wait_on_page_writeback before
the clear_page_dirty_for_io instead of after it, then resubmit your 14/16.

As it stands, you're still mixing the end of the previous writeback with
the beginning of the new one (e.g. clear_page_dirty_for_io is clearing
the PageReclaim flag when it shouldn't be interfering at all); and the
page might get dirtied during that wait for the previous writeback to
complete, in which case you'd end up doing an extra unnecessary write
(because page later found "dirty" even though written back long after
last modification).

Whether this is a big deal in any way, I don't know; but if you look
at other instances, you'll find the wait_on_page_writeback is done
before the clear_page_dirty_for_io, so you'll surely be on safer
ground to follow the same ordering.

(And so far as I can see, everyone who calls write_one_page does so
with the wait flag 1: that's the clearest example. Hmm, I wonder if
its initial wait_on_page_writeback is right to be conditional on wait
at all. Well, until someone calls it with wait 0, it's hard to tell.)

Hugh

>
> Thanks,
> Erez.
>
>
> diff --git a/fs/unionfs/mmap.c b/fs/unionfs/mmap.c
> index eef43aa..d0fd4d0 100644
> --- a/fs/unionfs/mmap.c
> +++ b/fs/unionfs/mmap.c
> @@ -73,6 +73,7 @@ static int unionfs_writepage(struct page *page, struct writeback_control *wbc)
>
> BUG_ON(!lower_mapping->a_ops->writepage);
> clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
> + wait_on_page_writeback(lower_page); /* prevent multiple writers */
> err = lower_mapping->a_ops->writepage(lower_page, wbc);
> if (err < 0)
> goto out_release;

2007-11-27 04:17:54

by Erez Zadok

[permalink] [raw]
Subject: Re: unionfs: several more problems

In message <[email protected]>, Hugh Dickins writes:
> On Mon, 26 Nov 2007, Erez Zadok wrote:
[...]
> > The small patch below fixed the problem. Let me know what you think.
>
> I've one issue with it: please move that wait_on_page_writeback before
> the clear_page_dirty_for_io instead of after it, then resubmit your 14/16.
[...]

Done, tested, and working. Here's the revised patch (pushed to unionfs.git
on korg).

Thanks,
Erez.

------------------------------------------------------------------------------

Unionfs: prevent multiple writers to lower_page

Without this patch, the LTP fs test "rwtest04" triggers a
BUG_ON(PageWriteback(page)) in fs/buffer.c:1706.

CC: Hugh Dickins <[email protected]>

Signed-off-by: Erez Zadok <[email protected]>
diff --git a/fs/unionfs/mmap.c b/fs/unionfs/mmap.c
index 623a913..74f2e53 100644
--- a/fs/unionfs/mmap.c
+++ b/fs/unionfs/mmap.c
@@ -72,6 +72,7 @@ static int unionfs_writepage(struct page *page, struct writeback_control *wbc)
}

BUG_ON(!lower_mapping->a_ops->writepage);
+ wait_on_page_writeback(lower_page); /* prevent multiple writers */
clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
err = lower_mapping->a_ops->writepage(lower_page, wbc);
if (err < 0)