Here's a few patches to unionfs, which together with the tmpfs
patches I just posted, get that combination working better.
fs/stack.c | 6 +++
fs/unionfs/inode.c | 83 +++++++++++++++++++++----------------------
fs/unionfs/mmap.c | 3 -
3 files changed, 47 insertions(+), 45 deletions(-)
But on the way I've noticed a number of issues with unionfs not dealt
with in these patches.
1. Please try unionfs with lockdep configured on (perhaps while running LTP
to get coverage of a variety of cases): I soon gave up, not knowing which
way round the locks should really be taken e.g. unionfs_read_lock versus
unionfs_lock_dentry.
2. I'm worried about the locking generally (no doubt it's difficult!).
My 2/4 patch half-fixes one issue, but raises others: if some function
in a filesystem is usually called with i_mutex held, are you safe to be
calling it (not knowing that lower filesystem) without its i_mutex held?
3. LTP's rename14 tends to fill the filesystem with inodes, which prevent
further tests from running (but not on all machines: a timing issue perhaps).
I have to run LTP with rename14 moved aside.
4. Though you made improvements, some blocks still remain mysteriously
allocated even after the unionfs should be empty e.g. after an LTP run.
Hugh
Remove !mapping_cap_writeback_dirty shortcircuit from unionfs_writepages.
It was introduced to avoid the stray AOP_WRITEPAGE_ACTIVATE coming from
shmem_writepage; but that has since been fixed in shmem_writepage and in
write_cache_pages. It stayed because it looked like a good optimization,
not to waste time calling down to tmpfs when that would serve no purpose.
But in fact this optimization causes hangs when running LTP with unionfs
over tmpfs. The problem is that the test comes at the wrong level: unionfs
has already declared in its default_backing_dev_info that it's playing by
cap_writeback_dirty rules. If it does nothing here in its writepages, its
dirty pages accumulate and choke the system. What's needed is to carry on
down and let its pages be cleaned while in turn they dirty the lower level.
And this now has an additional benefit for tmpfs, that a sync or pdflush
pushes these pages down to shmem_writepage, letting it match the filepage
coming from unionfs with the swap which may have been allocated earlier,
so it can free the duplication sooner than waiting for further pressure.
Signed-off-by: Hugh Dickins <[email protected]>
---
fs/unionfs/mmap.c | 3 ---
1 file changed, 3 deletions(-)
--- 2.6.24-rc5-mm1/fs/unionfs/mmap.c 2007-12-05 10:38:38.000000000 +0000
+++ unionfs1/fs/unionfs/mmap.c 2007-12-05 16:50:15.000000000 +0000
@@ -130,9 +130,6 @@ static int unionfs_writepages(struct add
if (!lower_inode)
goto out;
- if (!mapping_cap_writeback_dirty(lower_inode->i_mapping))
- goto out;
-
err = generic_writepages(mapping, wbc);
if (!err)
unionfs_copy_attr_times(inode);
LTP's iogen01 doio tests hang nicely on 32-bit SMP when /tmp is a unionfs
mount of a tmpfs. See the comment on i_size_write in linux/fs.h: it needs
to be locked, otherwise i_size_read can spin forever waiting for a lost
seqcount update.
Most filesystems are already holding i_mutex for this, but unionfs calls
fsstack_copy_inode_size from many places, not necessarily holding i_mutex.
Use the low-level i_lock within fsstack_copy_inode_size when 32-bit SMP.
Signed-off-by: Hugh Dickins <[email protected]>
---
This works for me, to help get through LTP, but is not enough: what about
those places which use i_size_write in common code called by unionfs?
They ought to participate in this i_locking too. Needs an audit to
enumerate such places and decide what course to take.
Bigger question: is unionfs calling down to filesystems without holding
their i_mutex, where some may be relying on their i_mutex to be held?
fs/stack.c | 6 ++++++
1 file changed, 6 insertions(+)
--- 2.6.24-rc5-mm1/fs/stack.c 2007-12-05 10:38:38.000000000 +0000
+++ unionfs2/fs/stack.c 2007-12-05 16:50:15.000000000 +0000
@@ -21,8 +21,14 @@
*/
void fsstack_copy_inode_size(struct inode *dst, const struct inode *src)
{
+#if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
+ spin_lock(&dst->i_lock);
+#endif
i_size_write(dst, i_size_read((struct inode *)src));
dst->i_blocks = src->i_blocks;
+#if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
+ spin_unlock(&dst->i_lock);
+#endif
}
EXPORT_SYMBOL_GPL(fsstack_copy_inode_size);
In order to fix unionfs truncation, we need to move the lower notify_change
out of the loop in unionfs_setattr. But when I came to do that, I couldn't
understand that loop at all: its continues and breaks and gotos are obscure,
most particularly the "if (copyup || (bindex != bstart)) continue;" which
seems to make a nonsense of the whole thing.
Here's my attempt to restructure it, prior to making the functional change
in the next patch; but I may have misunderstood badly, please check - I've
not tested it beyond my degenerate dirs=/tmpfs case, which hardly tests it.
Signed-off-by: Hugh Dickins <[email protected]>
---
fs/unionfs/inode.c | 64 ++++++++++++++++++-------------------------
1 file changed, 28 insertions(+), 36 deletions(-)
--- 2.6.24-rc5-mm1/fs/unionfs/inode.c 2007-12-05 10:38:38.000000000 +0000
+++ unionfs3/fs/unionfs/inode.c 2007-12-05 18:50:52.000000000 +0000
@@ -1004,11 +1004,10 @@ static int unionfs_setattr(struct dentry
{
int err = 0;
struct dentry *lower_dentry;
- struct inode *inode = NULL;
- struct inode *lower_inode = NULL;
+ struct inode *inode;
+ struct inode *lower_inode;
int bstart, bend, bindex;
- int i;
- int copyup = 0;
+ loff_t size;
unionfs_read_lock(dentry->d_sb);
unionfs_lock_dentry(dentry);
@@ -1029,50 +1028,43 @@ static int unionfs_setattr(struct dentry
if (ia->ia_valid & (ATTR_KILL_SUID | ATTR_KILL_SGID))
ia->ia_valid &= ~ATTR_MODE;
- for (bindex = bstart; (bindex <= bend) || (bindex == bstart);
- bindex++) {
- lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
- if (!lower_dentry)
- continue;
- BUG_ON(lower_dentry->d_inode == NULL);
-
+ lower_dentry = unionfs_lower_dentry(dentry);
+ if (lower_dentry) {
/* If the file is on a read only branch */
- if (is_robranch_super(dentry->d_sb, bindex)
+ if (is_robranch_super(dentry->d_sb, bstart)
|| IS_RDONLY(lower_dentry->d_inode)) {
- if (copyup || (bindex != bstart))
- continue;
- /* Only if its the leftmost file, copyup the file */
- for (i = bstart - 1; i >= 0; i--) {
- loff_t size = i_size_read(dentry->d_inode);
- if (ia->ia_valid & ATTR_SIZE)
- size = ia->ia_size;
+
+ if (ia->ia_valid & ATTR_SIZE)
+ size = ia->ia_size;
+ else
+ size = i_size_read(inode);
+
+ for (bindex = bstart - 1; bindex >= 0; bindex--) {
err = copyup_dentry(dentry->d_parent->d_inode,
- dentry, bstart, i,
+ dentry, bstart, bindex,
dentry->d_name.name,
dentry->d_name.len,
NULL, size);
-
- if (!err) {
- copyup = 1;
- lower_dentry =
- unionfs_lower_dentry(dentry);
+ if (!err)
break;
- }
- /*
- * if error is in the leftmost branch, pass
- * it up.
- */
- if (i == 0)
- goto out;
}
+ if (err)
+ goto out;
+ lower_dentry = unionfs_lower_dentry(dentry);
+ }
+ } else {
+ for (bindex = bstart + 1; bindex <= bend; bindex++) {
+ lower_dentry = unionfs_lower_dentry_idx(dentry, bindex);
+ if (lower_dentry)
+ break;
}
- err = notify_change(lower_dentry, ia);
- if (err)
- goto out;
- break;
}
+ err = notify_change(lower_dentry, ia);
+ if (err)
+ goto out;
+
/* for mmap */
if (ia->ia_valid & ATTR_SIZE) {
if (ia->ia_size != i_size_read(inode)) {
fsx-linux in a unionfs soon fails (with oops on tmpfs): truncation doesn't
work right.
When shrinking a file, unionfs_setattr needs to vmtruncate the upper level
before notifying change to the lower level, to eliminate those dirty pages
beyond new eof which otherwise drift down to the lower level's writepage,
writing beyond its eof (and later uncovered when the file is expanded).
Also truncate the upper level first when expanding, in the case when
the upper level's s_maxbytes is more limiting than the lower level's.
Signed-off-by: Hugh Dickins <[email protected]>
---
fs/unionfs/inode.c | 27 +++++++++++++++++----------
1 file changed, 17 insertions(+), 10 deletions(-)
--- unionfs3/fs/unionfs/inode.c 2007-12-05 18:50:52.000000000 +0000
+++ unionfs4/fs/unionfs/inode.c 2007-12-05 18:58:51.000000000 +0000
@@ -1060,23 +1060,30 @@ static int unionfs_setattr(struct dentry
break;
}
}
+ lower_inode = unionfs_lower_inode(inode);
- err = notify_change(lower_dentry, ia);
- if (err)
- goto out;
-
- /* for mmap */
+ /*
+ * If shrinking, first truncate upper level to cancel writing dirty
+ * pages beyond the new eof; and also if its maxbytes is more limiting
+ * (fail with -EFBIG before making any change to the lower level).
+ * There is no need to vmtruncate the upper level afterwards in the
+ * other cases: we fsstack_copy_inode_size from the lower level.
+ */
if (ia->ia_valid & ATTR_SIZE) {
- if (ia->ia_size != i_size_read(inode)) {
+ size = i_size_read(inode);
+ if (ia->ia_size < size || (ia->ia_size > size &&
+ inode->i_sb->s_maxbytes < lower_inode->i_sb->s_maxbytes)) {
err = vmtruncate(inode, ia->ia_size);
if (err)
- printk(KERN_ERR
- "unionfs: setattr: vmtruncate failed\n");
+ goto out;
}
}
- /* get the size from the first lower inode */
- lower_inode = unionfs_lower_inode(inode);
+ err = notify_change(lower_dentry, ia);
+ if (err)
+ goto out;
+
+ /* get attributes from the first lower inode */
unionfs_copy_attr_all(inode, lower_inode);
/*
* unionfs_copy_attr_all will copy the lower times to our inode if
In message <[email protected]>, Hugh Dickins writes:
> In order to fix unionfs truncation, we need to move the lower notify_change
> out of the loop in unionfs_setattr. But when I came to do that, I couldn't
[...]
Hugh, I want to understand how patches 3/4 and 4/4 are related. In patch 3
you say "in order to fix truncation" but you mention a truncation problem
only in patch 4; is there a patch ordering problem, or they're both related
to the same issue (with 3/4 being a code cleanup, and 4/4 actually fixing
the problem)?
What tests did you conduct to tickle this truncation problem: I assume
fsx-linux through unionfs, mounted on tmpfs? Did that include both series
of patches (your 9 tmpfs patches, plus the two memcgrpoup?).
Thanks,
Erez.
On Tue, 18 Dec 2007, Erez Zadok wrote:
> In message <[email protected]>, Hugh Dickins writes:
> > In order to fix unionfs truncation, we need to move the lower notify_change
> > out of the loop in unionfs_setattr. But when I came to do that, I couldn't
> [...]
>
> Hugh, I want to understand how patches 3/4 and 4/4 are related. In patch 3
> you say "in order to fix truncation" but you mention a truncation problem
> only in patch 4; is there a patch ordering problem, or they're both related
> to the same issue (with 3/4 being a code cleanup, and 4/4 actually fixing
> the problem)?
I needed to move that notify_change out of the loop, to fix the truncation
problem, but had great difficulty understanding the loop. So, just as you
say, made the code cleanup first in 3/4, then fixed the problem in 4/4.
But that cleanup does need your review and testing.
>
> What tests did you conduct to tickle this truncation problem: I assume
> fsx-linux through unionfs, mounted on tmpfs?
Exactly. The familiar "fsx foo -q -c 100 -l 10000000" on its own is
enough to trigger it, though my habit is to run that while forcing
swapout and swapoff too. The problem isn't peculiar to tmpfs,
but I expect other filesystems are more likely just to fail the
fsx test rather than oops.
> Did that include both series
> of patches (your 9 tmpfs patches, plus the two memcgrpoup?).
Any kernel without the unionfs 4/4 should show the problem, but
differently with different sets of patches. A lot of my testing
was with an earlier set of tmpfs patches for unionfs (I didn't
realize the tmpfs 5/9 issue until later on), and with those it
manifested as fsx failure.
But if you run without any of these patches,
I believe it manifests as shmem_writepage's
BUG_ON(!(info->flags & SHMEM_TRUNCATE));
or its
BUG_ON(!entry);
Whereas in tmpfs 4/9 I removed that latter BUG_ON(!entry) -
we don't usually bother to BUG on NULL pointers, just let the
dereference oops.
Hugh
In message <[email protected]>, Hugh Dickins writes:
> On Tue, 18 Dec 2007, Erez Zadok wrote:
> > In message <[email protected]>, Hugh Dickins writes:
> > > In order to fix unionfs truncation, we need to move the lower notify_change
> > > out of the loop in unionfs_setattr. But when I came to do that, I couldn't
> > [...]
> >
> > Hugh, I want to understand how patches 3/4 and 4/4 are related. In patch 3
> > you say "in order to fix truncation" but you mention a truncation problem
> > only in patch 4; is there a patch ordering problem, or they're both related
> > to the same issue (with 3/4 being a code cleanup, and 4/4 actually fixing
> > the problem)?
>
> I needed to move that notify_change out of the loop, to fix the truncation
> problem, but had great difficulty understanding the loop. So, just as you
> say, made the code cleanup first in 3/4, then fixed the problem in 4/4.
>
> But that cleanup does need your review and testing.
A quick look at your setattr patches seems very promising. And I've further
realized more cleanup is possible: because we call revalidate_chain at the
opening to ->setattr, we're guaranteed to have a valid lower dentry, so for
example we can remove the else case for "if (lower_dentry)".
I'll be testing this and your other patches, plus look into the locking
issues you brought up.
Thanks,
Erez.
In message <[email protected]>, Hugh Dickins writes:
> Here's a few patches to unionfs, which together with the tmpfs
> patches I just posted, get that combination working better.
>
> fs/stack.c | 6 +++
> fs/unionfs/inode.c | 83 +++++++++++++++++++++----------------------
> fs/unionfs/mmap.c | 3 -
> 3 files changed, 47 insertions(+), 45 deletions(-)
The series of patches I just posted includes your fixes, and more. Thanks!
> But on the way I've noticed a number of issues with unionfs not dealt
> with in these patches.
>
> 1. Please try unionfs with lockdep configured on (perhaps while running LTP
> to get coverage of a variety of cases): I soon gave up, not knowing which
> way round the locks should really be taken e.g. unionfs_read_lock versus
> unionfs_lock_dentry.
>
> 2. I'm worried about the locking generally (no doubt it's difficult!).
> My 2/4 patch half-fixes one issue, but raises others: if some function
> in a filesystem is usually called with i_mutex held, are you safe to be
> calling it (not knowing that lower filesystem) without its i_mutex held?
The series of patches I just posted addresses locking. Lockdep helped me
find and fix a number of locking problems (deadlocks and races). I tried
the fixes w/ my unionfs-specific regression suite (on top of many different
filesystems), with ltp on top of ext3 or tmpfs, and with a large "make -j"
kernel compile on top of ext3 or tmpfs. I also tried it with machines with
more/less memory, SMP/UMP, and/or while forcing drop_caches.
I also found and reported a lockdep problem in nfs4 and ext3. I also
re-discovered a long-standing lockdep problem in jffs2 (which I reported a
while back, but it still exists).
> 3. LTP's rename14 tends to fill the filesystem with inodes, which prevent
> further tests from running (but not on all machines: a timing issue
> perhaps). I have to run LTP with rename14 moved aside.
>
> 4. Though you made improvements, some blocks still remain mysteriously
> allocated even after the unionfs should be empty e.g. after an LTP run.
The series of patches I just posted fixes these. Basically, we weren't
releasing the lower inodes early enough, in several scenarios. So they hung
around until the f/s unmounted. To test the fixes, I've run each of the 707
LTP "syscalls" tests, and each of the 54 "fs" tests, and compared the number
of inodes and blocks used in the upper and lower file systems before and
after each test. And I've tested it with LTP running in a union on top of
ext3 or tmpfs. I'm happy to report we don't leak a single inode now.
> Hugh
I've tried all of these on linus's latest tree. Since Andrew reported that
x86 is broken in -mm, I wasn't yet able to try it with mm or with your tmpfs
patches.
Cheers,
Erez.