LTP's iogen01 doio tests used to hang nicely on 32-bit SMP when /tmp was a
unionfs mount of a tmpfs, i_size_read spinning forever, waiting for a lost
seqcount update: fixed by taking i_lock around i_size_write when 32-bit SMP.
But akpm was dissatisfied with the resulting patch: its lack of commentary,
the #ifs, the nesting around i_size_read, the lack of attention to i_blocks.
I promised to redo it with the general spin_lock_32bit() he proposed; but
disliked the result, partly because "32bit" obscures the real constraints,
which are best commented within fsstack_copy_inode_size itself.
This version adds those comments, and uses sizeof comparisons which the
compiler can optimize out, instead of CONFIG_SMP, CONFIG_LSF. BITS_PER_LONG.
Signed-off-by: Hugh Dickins <[email protected]>
---
Should follow mmotm's git-unionfs-fixup.patch
fs/stack.c | 58 +++++++++++++++++++++++++++++++------
include/linux/fs_stack.h | 3 -
2 files changed, 50 insertions(+), 11 deletions(-)
--- mmotm/fs/stack.c 2008-06-27 13:39:18.000000000 +0100
+++ linux/fs/stack.c 2008-06-27 18:24:19.000000000 +0100
@@ -19,16 +19,56 @@
* This function cannot be inlined since i_size_{read,write} is rather
* heavy-weight on 32-bit systems
*/
-void fsstack_copy_inode_size(struct inode *dst, const struct inode *src)
+void fsstack_copy_inode_size(struct inode *dst, struct inode *src)
{
-#if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
- spin_lock(&dst->i_lock);
-#endif
- i_size_write(dst, i_size_read(src));
- dst->i_blocks = src->i_blocks;
-#if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
- spin_unlock(&dst->i_lock);
-#endif
+ loff_t i_size;
+ blkcnt_t i_blocks;
+
+ /*
+ * i_size_read() includes its own seqlocking and protection from
+ * preemption (see include/linux/fs.h): we need nothing extra for
+ * that here, and prefer to avoid nesting locks than attempt to
+ * keep i_size and i_blocks in synch together.
+ */
+ i_size = i_size_read(src);
+
+ /*
+ * But if CONFIG_LSF (on 32-bit), we ought to make an effort to keep
+ * the two halves of i_blocks in synch despite SMP or PREEMPT - though
+ * stat's generic_fillattr() doesn't bother, and we won't be applying
+ * quotas (where i_blocks does become important) at the upper level.
+ *
+ * We don't actually know what locking is used at the lower level; but
+ * if it's a filesystem that supports quotas, it will be using i_lock
+ * as in inode_add_bytes(). tmpfs uses other locking, and its 32-bit
+ * is (just) able to exceed 2TB i_size with the aid of holes; but its
+ * i_blocks cannot carry into the upper long without almost 2TB swap -
+ * let's ignore that case.
+ */
+ if (sizeof(i_blocks) > sizeof(long))
+ spin_lock(&src->i_lock);
+ i_blocks = src->i_blocks;
+ if (sizeof(i_blocks) > sizeof(long))
+ spin_unlock(&src->i_lock);
+
+ /*
+ * If CONFIG_SMP on 32-bit, it's vital for fsstack_copy_inode_size()
+ * to hold some lock around i_size_write(), otherwise i_size_read()
+ * may spin forever (see include/linux/fs.h). We don't necessarily
+ * hold i_mutex when this is called, so take i_lock for that case.
+ *
+ * And if CONFIG_LSF (on 32-bit), continue our effort to keep the
+ * two halves of i_blocks in synch despite SMP or PREEMPT: use i_lock
+ * for that case too, and do both at once by combining the tests.
+ *
+ * There is none of this locking overhead in the 64-bit case.
+ */
+ if (sizeof(i_size) > sizeof(long) || sizeof(i_blocks) > sizeof(long))
+ spin_lock(&dst->i_lock);
+ i_size_write(dst, i_size);
+ dst->i_blocks = i_blocks;
+ if (sizeof(i_size) > sizeof(long) || sizeof(i_blocks) > sizeof(long))
+ spin_unlock(&dst->i_lock);
}
EXPORT_SYMBOL_GPL(fsstack_copy_inode_size);
--- mmotm/include/linux/fs_stack.h 2008-06-27 13:39:19.000000000 +0100
+++ linux/include/linux/fs_stack.h 2008-06-27 14:08:00.000000000 +0100
@@ -21,8 +21,7 @@
/* externs for fs/stack.c */
extern void fsstack_copy_attr_all(struct inode *dest, const struct inode *src);
-extern void fsstack_copy_inode_size(struct inode *dst,
- const struct inode *src);
+extern void fsstack_copy_inode_size(struct inode *dst, struct inode *src);
/* inlines */
static inline void fsstack_copy_attr_atime(struct inode *dest,
Unionfs has slowly been leaking memory: although many places do explicitly
free the dentry's lower_paths array (and I've not changed those, in case
the ordering is important), not all do - and adding a WARN_ON didn't seem
to finger any particular culprit. So free_dentry_private_data needs to
kfree lower_paths (other freeers are good about setting it to NULL).
Signed-off-by: Hugh Dickins <[email protected]>
---
Should follow mmotm's git-unionfs-fixup.patch
fs/unionfs/lookup.c | 1 +
1 file changed, 1 insertion(+)
--- mmotm/fs/unionfs/lookup.c 2008-06-27 13:39:18.000000000 +0100
+++ linux/fs/unionfs/lookup.c 2008-06-27 14:08:00.000000000 +0100
@@ -498,6 +498,7 @@ void free_dentry_private_data(struct den
{
if (!dentry || !dentry->d_fsdata)
return;
+ kfree(UNIONFS_D(dentry)->lower_paths);
kmem_cache_free(unionfs_dentry_cachep, dentry->d_fsdata);
dentry->d_fsdata = NULL;
}
On Sun, Jun 29, 2008 at 01:05:09AM +0100, Hugh Dickins wrote:
> LTP's iogen01 doio tests used to hang nicely on 32-bit SMP when /tmp was a
> unionfs mount of a tmpfs, i_size_read spinning forever, waiting for a lost
> seqcount update: fixed by taking i_lock around i_size_write when 32-bit SMP.
>
> But akpm was dissatisfied with the resulting patch: its lack of commentary,
> the #ifs, the nesting around i_size_read, the lack of attention to i_blocks.
> I promised to redo it with the general spin_lock_32bit() he proposed; but
> disliked the result, partly because "32bit" obscures the real constraints,
> which are best commented within fsstack_copy_inode_size itself.
>
> This version adds those comments, and uses sizeof comparisons which the
> compiler can optimize out, instead of CONFIG_SMP, CONFIG_LSF. BITS_PER_LONG.
Btw, I hope fsstack doesn't rely on i_size having any particular
meaning. As far as the VFS is concerned i_size is field only used by
the filesystem (or library routines like generic_file_*).
On Sun, 29 Jun 2008, Christoph Hellwig wrote:
>
> Btw, I hope fsstack doesn't rely on i_size having any particular
> meaning. As far as the VFS is concerned i_size is field only used by
> the filesystem (or library routines like generic_file_*).
Interesting point. I can't speak for fsstack itself (I'm not even
sure if it's anything beyond fs/stack.c and the tag I used to identify
where this patch lies); but certainly fs/stack.c doesn't use i_size
for anything, just duplicates it from the lower filesystem.
unionfs (which I think you don't care for at all in general) does
look as if it assumes it's the lower file size in a few places,
when copying up or truncating. Isn't that reasonable? Wouldn't
users make the same assumption?
Or are you saying that filesystems which don't support the usual
meaning of inode->i_size (leave it 0?) would supply their own
equivalent to vmtruncate() if they support truncation, and their
own getattr which fills in stat->size from somewhere else.
Yes, I think you are saying that: unionfs may not play well with them.
Hugh
Hugh Dickins:
> LTP's iogen01 doio tests used to hang nicely on 32-bit SMP when /tmp was a
> unionfs mount of a tmpfs, i_size_read spinning forever, waiting for a lost
> seqcount update: fixed by taking i_lock around i_size_write when 32-bit SMP.
I don't know why dst->i_lock is affected by src->i_size_seqcount.
Do you mean that your test issued write(2) to the lower/actual file so
frequently that i_size_read() in unionfs always failed?
Is your test
iogen01 export LTPROOT; rwtest -N iogen01 -i 120s -s read,write -Da -Dv -n 2 500b:doio.f1.$$ 1000b:doio.f2.$$
line in runtest/fs?
Junjiro Okajima
On Mon, 30 Jun 2008, [email protected] wrote:
> Hugh Dickins:
> > LTP's iogen01 doio tests used to hang nicely on 32-bit SMP when /tmp was a
> > unionfs mount of a tmpfs, i_size_read spinning forever, waiting for a lost
> > seqcount update: fixed by taking i_lock around i_size_write when 32-bit SMP.
(That, of course, is already fixed in Erez's unionfs git, so in mm;
but I thought I'd better repeat the history because some of akpm's
objections to pushing the fix to mainline came from the lack of
this explanation when the patch was submitted for mainline.)
>
> I don't know why dst->i_lock is affected by src->i_size_seqcount.
It certainly shouldn't be. The problem would have come from two
racing i_size_write(dst)s, one of the unlocked increments getting
lost, leaving seqcount odd, so the next i_size_read(dst) would
spin forever waiting for it to go even.
> Do you mean that your test issued write(2) to the lower/actual file
I was merely running LTP, on a Core(2) Quad machine, with /tmp as
unionfs mount of a tmpfs. It wouldn't have been doing anything
directly to the lower file, that would all have been coming via
the unionfs mount.
> so frequently that i_size_read() in unionfs always failed?
I'm not sure what you mean by that. i_size_read() doesn't fail,
but it may loop; and if the seqcount has got out of step from
concurrent unlocked i_size_write()s, then it'll spin forever.
>
> Is your test
> iogen01 export LTPROOT; rwtest -N iogen01 -i 120s -s read,write -Da -Dv -n 2 500b:doio.f1.$$ 1000b:doio.f2.$$
> line in runtest/fs?
That looks like it: I don't think I ever tried it as a standalone test,
just investigated what was happening when standard LTP runs hung here.
Hugh
Hugh Dickins:
> It certainly shouldn't be. The problem would have come from two
> racing i_size_write(dst)s, one of the unlocked increments getting
> lost, leaving seqcount odd, so the next i_size_read(dst) would
> spin forever waiting for it to go even.
I see.
The unlocked increment can cause the next i_size_read() hang.
> I'm not sure what you mean by that. i_size_read() doesn't fail,
> but it may loop; and if the seqcount has got out of step from
> concurrent unlocked i_size_write()s, then it'll spin forever.
What I meant by "fail" was "loop" actually.
And I understand that you didn't writing directly (bypassing unionfs)
too.
Junjiro Okajima
In message <[email protected]>, Hugh Dickins writes:
> Unionfs has slowly been leaking memory: although many places do explicitly
> free the dentry's lower_paths array (and I've not changed those, in case
> the ordering is important), not all do - and adding a WARN_ON didn't seem
> to finger any particular culprit. So free_dentry_private_data needs to
> kfree lower_paths (other freeers are good about setting it to NULL).
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> Should follow mmotm's git-unionfs-fixup.patch
>
> fs/unionfs/lookup.c | 1 +
> 1 file changed, 1 insertion(+)
>
> --- mmotm/fs/unionfs/lookup.c 2008-06-27 13:39:18.000000000 +0100
> +++ linux/fs/unionfs/lookup.c 2008-06-27 14:08:00.000000000 +0100
> @@ -498,6 +498,7 @@ void free_dentry_private_data(struct den
> {
> if (!dentry || !dentry->d_fsdata)
> return;
> + kfree(UNIONFS_D(dentry)->lower_paths);
> kmem_cache_free(unionfs_dentry_cachep, dentry->d_fsdata);
> dentry->d_fsdata = NULL;
> }
Thanks, I'll take a closer look at this to ensure that all paths kfree
lower_paths as needed.
Erez.
In message <[email protected]>, Hugh Dickins writes:
> On Sun, 29 Jun 2008, Christoph Hellwig wrote:
> >
> > Btw, I hope fsstack doesn't rely on i_size having any particular
> > meaning. As far as the VFS is concerned i_size is field only used by
> > the filesystem (or library routines like generic_file_*).
>
> Interesting point. I can't speak for fsstack itself (I'm not even
> sure if it's anything beyond fs/stack.c and the tag I used to identify
> where this patch lies); but certainly fs/stack.c doesn't use i_size
> for anything, just duplicates it from the lower filesystem.
>
> unionfs (which I think you don't care for at all in general) does
> look as if it assumes it's the lower file size in a few places,
> when copying up or truncating. Isn't that reasonable? Wouldn't
> users make the same assumption?
>
> Or are you saying that filesystems which don't support the usual
> meaning of inode->i_size (leave it 0?) would supply their own
> equivalent to vmtruncate() if they support truncation, and their
> own getattr which fills in stat->size from somewhere else.
>
> Yes, I think you are saying that: unionfs may not play well with them.
>
> Hugh
Hugh, yes, the only place in fsstack where i_size is used is to copy the
lower i_size to the upper one verbatim. If this assumption is incorrect for
some lower file systems, then stackable file systems in general may have a
problem with this assumption; in that case, we'll need an alternative way to
"interpret" the lower i_size, and report the i_size in the upper inode (and
hence to the user).
Is there such an alternative?
BTW, ecryptfs may have a problem with this, b/c it uses i_size_read/write
b/t the lower and upper inodes. If some filesystems have a different
interpretation for i_size, then stacking ecryptfs on top of them could be an
issue. Mike?
Erez.
On Mon, Jun 30, 2008 at 02:19:27PM -0400, Erez Zadok wrote:
> BTW, ecryptfs may have a problem with this, b/c it uses
> i_size_read/write b/t the lower and upper inodes. If some
> filesystems have a different interpretation for i_size, then
> stacking ecryptfs on top of them could be an issue. Mike?
eCryptfs depends on the results of i_size_read() on the lower inode
when it needs to interpolate the filesize when the metadata is stored
in the lower file's xattr region.
On Mon, Jun 30, 2008 at 04:49:01PM -0500, Michael Halcrow wrote:
> eCryptfs depends on the results of i_size_read() on the lower inode
> when it needs to interpolate the filesize when the metadata is stored
> in the lower file's xattr region.
please use vfs_getattr to get the size in thst case.
In message <[email protected]>, Christoph Hellwig writes:
> On Mon, Jun 30, 2008 at 04:49:01PM -0500, Michael Halcrow wrote:
> > eCryptfs depends on the results of i_size_read() on the lower inode
> > when it needs to interpolate the filesize when the metadata is stored
> > in the lower file's xattr region.
>
> please use vfs_getattr to get the size in thst case.
So, Christoph, is using vfs_getattr() the preferred choice for stackable
file systems in general to get the lower inode size?
If so, the Hugh's fsstack_copy_inode_size patch will have to be updated: and
we'll probably have to change the names of the src/dst parameters to
lower/upper, possibly watch out for whether "src" was the lower or upper
inode, and be careful about locking semantics when calling vfs_getattr.
Who'd have thought that a concept as simple as
dst->i_size = src->i_size
could turn out to be so hairy... 1/2 a :-)
Thanks,
Erez.