2010-12-13 02:37:42

by Nick Piggin

[permalink] [raw]
Subject: rcu-walk and dcache scaling tree update and status

The vfs-scale branch has had some progress, but it is now requiring
wider testing and detailed review, particularly of the fine details of
dcache_lock lifting, and rcu-walk synchronisation details and
documentation.

Linus has suggested pretty strongly that he wants to pull this in the
next merge window (recently, that "inodes will be RCU freed in 2.6.38"
in an urelated discussion). As far as I know, that's what he's going to
do. I'd like to get this some time in linux-next to improve test
coverage (many filesystems I can't even test, so there are bound to be a
few silly crashes). Stephen, how do I arrange that?

>From my point of view, it has had nowhere near enough review,
particularly I want Al to be happy with it, filesystem changes looked at
and tested by respective fs maintainers, and anybody who is good at
concurrency. However, if Linus still wants to merge it to kick things
along, I am not going to stop him this time, because I have no known
bugs or pending changes required.

I, like everybody else, would prefer bugs or design flaws to be found
*before* it goes upstream, of course. I would be happy to spend time on
irc with reviewers (ask me offline). And if anybody has reasonable
concerns or suggestions, I will be happy to take that into account. I
will not flame anybody who reads my replies, even if it takes a while
for one or both of us to understand.

Documentation/filesystems/path-lookup.txt is a good place to start
reviewing the fun stuff. I would much appreciate review of documentation
and comments too, if anything is not clear, omitted, or not matching the
code.

Also, please keep an eye on the end result when reviewing patches.
Particularly the locking patches before dcache_lock is lifted, these are
supposed to provide a lock coverage to lift dcache_lock with minimal
complexity. They are not supposed to be nice looking code that you'd
want to run on your production box, they are supposed to be nice
changesets (from a review and verification point of view).

Git tree is here:

git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git

Branch is:

vfs-scale-working

Changes since last posting:
* Add a lot more comments for rcu-walk code and functions
* Fix reported d_compare vfat crash
* Incorporate review suggestions
* Make rcu-walk bail out if we have to call a security subsystem
* Fix for filesystems rewriting dentry name in-place
* Audit d_seq barrier write-side, add a few places where it was missing
* Optimised dentry memcmp

Testing:
Testing filesystems is difficult, particularly remote filesystems, and
ones without mkfs packaged in debian. I'm running ltp and xfstests among
others, but those cover a tiny portion of what you can do with the
dcache. The more testing the merrier.

I have been unable to break anything for a long time, but the race
windows can be tiny. I've been trying to insert random delays into
different parts of critical sections, and write tests specifically
targetting particular races, but that's slow going -- review of locking,
or testing on different configurations should be much more productive.

Final note:
You won't be able to reproduce the parallel path walk scalability
numbers that I've posted, because the vfsmount refcounting scalability
patch is not included. I have a new idea for that now, so I'll be asking
for comments with that soon.

Thanks,
Nick


2010-12-13 02:42:23

by Nick Piggin

[permalink] [raw]
Subject: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status)

On Mon, Dec 13, 2010 at 01:37:33PM +1100, Nick Piggin wrote:
> Final note:
> You won't be able to reproduce the parallel path walk scalability
> numbers that I've posted, because the vfsmount refcounting scalability
> patch is not included. I have a new idea for that now, so I'll be asking
> for comments with that soon.

Here is the patch I've been using, which works but has the problem
described in the changelog. But it works nicely for testing.

As I said, I have a promising approach to solving the problem.

fs: scale mntget/mntput

Improve scalability of mntget/mntput by using per-cpu counters protected by the
reader side of the brlock vfsmount_lock. If the mnt_hash field of the vfsmount
structure is attached to a list, then it is mounted which contributes to its
refcount, so the per-cpu counters need not be summed.

MNT_PSEUDO keeps track of whether the vfsmount is actually a pseudo filesystem
that will never be attached (such as sockfs).

No extra atomics in the common case because atomic mnt refcount is now replaced
with per-CPU spinlock. Code will be bigger and more complex however. With the
previous per-cpu locking patch, mount lookups and common case refcounting are
now per-cpu and should be ideally scalable. path lookups (and hence
path_get/path_put) within the same vfsmount should now be more scalable,
however this will often be hidden by dcache_lock on final dput, and d_lock on
common path elements (eg. cwd or root dentry).

Signed-off-by: Nick Piggin <[email protected]>

[Note: this is not for merging. Un-attached operation (lazy umount) may not be
uncommon and will be slowed down and actually have worse scalablilty after
this patch. I need to think about how to do fast refcounting with unattached
mounts.]

---
drivers/mtd/mtdchar.c | 1
fs/internal.h | 1
fs/libfs.c | 1
fs/namespace.c | 167 +++++++++++++++++++++++++++++++++++++++++++-------
fs/pnode.c | 4 -
include/linux/mount.h | 26 +------
6 files changed, 154 insertions(+), 46 deletions(-)

Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c 2010-12-12 03:48:57.000000000 +1100
+++ linux-2.6/fs/namespace.c 2010-12-12 03:51:52.000000000 +1100
@@ -138,6 +138,64 @@ void mnt_release_group_id(struct vfsmoun
mnt->mnt_group_id = 0;
}

+/*
+ * vfsmount lock must be held for read
+ */
+static inline void add_mnt_count(struct vfsmount *mnt, int n)
+{
+#ifdef CONFIG_SMP
+ (*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) += n;
+#else
+ mnt->mnt_count += n;
+#endif
+}
+
+static inline void set_mnt_count(struct vfsmount *mnt, int n)
+{
+#ifdef CONFIG_SMP
+ preempt_disable();
+ (*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) = n;
+ preempt_enable();
+#else
+ mnt->mnt_count = n;
+#endif
+}
+
+/*
+ * vfsmount lock must be held for read
+ */
+static inline void inc_mnt_count(struct vfsmount *mnt)
+{
+ add_mnt_count(mnt, 1);
+}
+
+/*
+ * vfsmount lock must be held for read
+ */
+static inline void dec_mnt_count(struct vfsmount *mnt)
+{
+ add_mnt_count(mnt, -1);
+}
+
+/*
+ * vfsmount lock must be held for write
+ */
+unsigned int count_mnt_count(struct vfsmount *mnt)
+{
+#ifdef CONFIG_SMP
+ unsigned int count = 0;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ count += *per_cpu_ptr(mnt->mnt_count, cpu);
+ }
+
+ return count;
+#else
+ return mnt->mnt_count;
+#endif
+}
+
struct vfsmount *alloc_vfsmnt(const char *name)
{
struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL);
@@ -154,7 +212,15 @@ struct vfsmount *alloc_vfsmnt(const char
goto out_free_id;
}

- atomic_set(&mnt->mnt_count, 1);
+#ifdef CONFIG_SMP
+ mnt->mnt_count = alloc_percpu(int);
+ if (!mnt->mnt_count)
+ goto out_free_devname;
+#else
+ mnt->mnt_count = 0;
+#endif
+ set_mnt_count(mnt, 1);
+
INIT_LIST_HEAD(&mnt->mnt_hash);
INIT_LIST_HEAD(&mnt->mnt_child);
INIT_LIST_HEAD(&mnt->mnt_mounts);
@@ -169,7 +235,7 @@ struct vfsmount *alloc_vfsmnt(const char
#ifdef CONFIG_SMP
mnt->mnt_writers = alloc_percpu(int);
if (!mnt->mnt_writers)
- goto out_free_devname;
+ goto out_free_mntcount;
#else
mnt->mnt_writers = 0;
#endif
@@ -177,6 +243,8 @@ struct vfsmount *alloc_vfsmnt(const char
return mnt;

#ifdef CONFIG_SMP
+out_free_mntcount:
+ free_percpu(mnt->mnt_count);
out_free_devname:
kfree(mnt->mnt_devname);
#endif
@@ -662,8 +730,8 @@ static inline void __mntput(struct vfsmo
* to make r/w->r/o transitions.
*/
/*
- * atomic_dec_and_lock() used to deal with ->mnt_count decrements
- * provides barriers, so count_mnt_writers() below is safe. AV
+ * The locking used to deal with mnt_count decrement provides barriers,
+ * so count_mnt_writers() below is safe.
*/
WARN_ON(count_mnt_writers(mnt));
fsnotify_vfsmount_delete(mnt);
@@ -675,45 +743,76 @@ static inline void __mntput(struct vfsmo
void mntput_no_expire(struct vfsmount *mnt)
{
repeat:
- if (atomic_add_unless(&mnt->mnt_count, -1, 1))
+ if (likely(!list_empty(&mnt->mnt_hash) ||
+ mnt->mnt_flags & MNT_PSEUDO)) {
+ br_read_lock(vfsmount_lock);
+ if (unlikely(list_empty(&mnt->mnt_hash) &&
+ (!(mnt->mnt_flags & MNT_PSEUDO)))) {
+ br_read_unlock(vfsmount_lock);
+ goto repeat;
+ }
+ dec_mnt_count(mnt);
+ br_read_unlock(vfsmount_lock);
return;
+ }
+
br_write_lock(vfsmount_lock);
- if (!atomic_dec_and_test(&mnt->mnt_count)) {
+ dec_mnt_count(mnt);
+ if (count_mnt_count(mnt)) {
br_write_unlock(vfsmount_lock);
return;
}
- if (likely(!mnt->mnt_pinned)) {
+ if (unlikely(mnt->mnt_pinned)) {
+ add_mnt_count(mnt, mnt->mnt_pinned + 1);
+ mnt->mnt_pinned = 0;
br_write_unlock(vfsmount_lock);
- __mntput(mnt);
- return;
+ acct_auto_close_mnt(mnt);
+ goto repeat;
}
- atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
- mnt->mnt_pinned = 0;
br_write_unlock(vfsmount_lock);
- acct_auto_close_mnt(mnt);
- goto repeat;
+ __mntput(mnt);
}
EXPORT_SYMBOL(mntput_no_expire);

+void mntput(struct vfsmount *mnt)
+{
+ if (mnt) {
+ /* avoid cacheline pingpong, hope gcc doesn't get "smart" */
+ if (unlikely(mnt->mnt_expiry_mark))
+ mnt->mnt_expiry_mark = 0;
+ mntput_no_expire(mnt);
+ }
+}
+EXPORT_SYMBOL(mntput);
+
+struct vfsmount *mntget(struct vfsmount *mnt)
+{
+ if (mnt) {
+ preempt_disable();
+ inc_mnt_count(mnt);
+ preempt_enable();
+ }
+ return mnt;
+}
+EXPORT_SYMBOL(mntget);
+
void mnt_pin(struct vfsmount *mnt)
{
br_write_lock(vfsmount_lock);
mnt->mnt_pinned++;
br_write_unlock(vfsmount_lock);
}
-
EXPORT_SYMBOL(mnt_pin);

void mnt_unpin(struct vfsmount *mnt)
{
br_write_lock(vfsmount_lock);
if (mnt->mnt_pinned) {
- atomic_inc(&mnt->mnt_count);
+ inc_mnt_count(mnt);
mnt->mnt_pinned--;
}
br_write_unlock(vfsmount_lock);
}
-
EXPORT_SYMBOL(mnt_unpin);

static inline void mangle(struct seq_file *m, const char *s)
@@ -1008,12 +1107,13 @@ int may_umount_tree(struct vfsmount *mnt
int minimum_refs = 0;
struct vfsmount *p;

- br_read_lock(vfsmount_lock);
+ /* write lock needed for count_mnt_count */
+ br_write_lock(vfsmount_lock);
for (p = mnt; p; p = next_mnt(p, mnt)) {
- actual_refs += atomic_read(&p->mnt_count);
+ actual_refs += count_mnt_count(p);
minimum_refs += 2;
}
- br_read_unlock(vfsmount_lock);
+ br_write_unlock(vfsmount_lock);

if (actual_refs > minimum_refs)
return 0;
@@ -1040,10 +1140,10 @@ int may_umount(struct vfsmount *mnt)
{
int ret = 1;
down_read(&namespace_sem);
- br_read_lock(vfsmount_lock);
+ br_write_lock(vfsmount_lock);
if (propagate_mount_busy(mnt, 2))
ret = 0;
- br_read_unlock(vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
up_read(&namespace_sem);
return ret;
}
@@ -1125,8 +1225,16 @@ static int do_umount(struct vfsmount *mn
flags & (MNT_FORCE | MNT_DETACH))
return -EINVAL;

- if (atomic_read(&mnt->mnt_count) != 2)
+ /*
+ * probably don't strictly need the lock here if we examined
+ * all race cases, but it's a slowpath.
+ */
+ br_write_lock(vfsmount_lock);
+ if (count_mnt_count(mnt) != 2) {
+ br_write_lock(vfsmount_lock);
return -EBUSY;
+ }
+ br_write_unlock(vfsmount_lock);

if (!xchg(&mnt->mnt_expiry_mark, 1))
return -EAGAIN;
@@ -2350,6 +2458,12 @@ SYSCALL_DEFINE2(pivot_root, const char _
touch_mnt_namespace(current->nsproxy->mnt_ns);
br_write_unlock(vfsmount_lock);
chroot_fs_refs(&root, &new);
+
+ /* Drop MNT_PSEUDO from old, add it to new. See init_mount_tree */
+ BUG_ON(!(root.mnt->mnt_flags & MNT_PSEUDO));
+ root.mnt->mnt_flags &= ~MNT_PSEUDO;
+ new.mnt->mnt_flags |= MNT_PSEUDO;
+
error = 0;
path_put(&root_parent);
path_put(&parent_path);
@@ -2376,6 +2490,13 @@ static void __init init_mount_tree(void)
mnt = do_kern_mount("rootfs", 0, "rootfs", NULL);
if (IS_ERR(mnt))
panic("Can't create rootfs");
+ /*
+ * MNT_PSEUDO tells mnt refcounting that we're pinned, so don't
+ * bother checking for zero references. Give one of these to root
+ * because it isn't "attached" to the tree. See mntput().
+ */
+ mnt->mnt_flags |= MNT_PSEUDO;
+
ns = create_mnt_ns(mnt);
if (IS_ERR(ns))
panic("Can't allocate initial namespace");
Index: linux-2.6/include/linux/mount.h
===================================================================
--- linux-2.6.orig/include/linux/mount.h 2010-12-12 03:27:08.000000000 +1100
+++ linux-2.6/include/linux/mount.h 2010-12-12 03:51:52.000000000 +1100
@@ -30,6 +30,7 @@ struct mnt_namespace;

#define MNT_SHRINKABLE 0x100
#define MNT_WRITE_HOLD 0x200
+#define MNT_PSEUDO 0x400

#define MNT_SHARED 0x1000 /* if the vfsmount is a shared mount */
#define MNT_UNBINDABLE 0x2000 /* if the vfsmount is a unbindable mount */
@@ -70,19 +71,15 @@ struct vfsmount {
struct mnt_namespace *mnt_ns; /* containing namespace */
int mnt_id; /* mount identifier */
int mnt_group_id; /* peer group identifier */
- /*
- * We put mnt_count & mnt_expiry_mark at the end of struct vfsmount
- * to let these frequently modified fields in a separate cache line
- * (so that reads of mnt_flags wont ping-pong on SMP machines)
- */
- atomic_t mnt_count;
int mnt_expiry_mark; /* true if marked for expiry */
int mnt_pinned;
int mnt_ghosts;
#ifdef CONFIG_SMP
int __percpu *mnt_writers;
+ int __percpu *mnt_count;
#else
int mnt_writers;
+ int mnt_count;
#endif
};

@@ -95,13 +92,6 @@ static inline int *get_mnt_writers_ptr(s
#endif
}

-static inline struct vfsmount *mntget(struct vfsmount *mnt)
-{
- if (mnt)
- atomic_inc(&mnt->mnt_count);
- return mnt;
-}
-
struct file; /* forward dec */

extern int mnt_want_write(struct vfsmount *mnt);
@@ -109,18 +99,12 @@ extern int mnt_want_write_file(struct fi
extern int mnt_clone_write(struct vfsmount *mnt);
extern void mnt_drop_write(struct vfsmount *mnt);
extern void mntput_no_expire(struct vfsmount *mnt);
+extern void mntput(struct vfsmount *mnt);
+extern struct vfsmount *mntget(struct vfsmount *mnt);
extern void mnt_pin(struct vfsmount *mnt);
extern void mnt_unpin(struct vfsmount *mnt);
extern int __mnt_is_readonly(struct vfsmount *mnt);

-static inline void mntput(struct vfsmount *mnt)
-{
- if (mnt) {
- mnt->mnt_expiry_mark = 0;
- mntput_no_expire(mnt);
- }
-}
-
extern struct vfsmount *do_kern_mount(const char *fstype, int flags,
const char *name, void *data);

Index: linux-2.6/fs/pnode.c
===================================================================
--- linux-2.6.orig/fs/pnode.c 2010-12-12 03:27:08.000000000 +1100
+++ linux-2.6/fs/pnode.c 2010-12-12 03:51:52.000000000 +1100
@@ -288,7 +288,7 @@ int propagate_mnt(struct vfsmount *dest_
*/
static inline int do_refcount_check(struct vfsmount *mnt, int count)
{
- int mycount = atomic_read(&mnt->mnt_count) - mnt->mnt_ghosts;
+ int mycount = count_mnt_count(mnt) - mnt->mnt_ghosts;
return (mycount > count);
}

@@ -300,7 +300,7 @@ static inline int do_refcount_check(stru
* Check if any of these mounts that **do not have submounts**
* have more references than 'refcnt'. If so return busy.
*
- * vfsmount lock must be held for read or write
+ * vfsmount lock must be held for write
*/
int propagate_mount_busy(struct vfsmount *mnt, int refcnt)
{
Index: linux-2.6/fs/internal.h
===================================================================
--- linux-2.6.orig/fs/internal.h 2010-12-12 03:27:08.000000000 +1100
+++ linux-2.6/fs/internal.h 2010-12-12 03:51:52.000000000 +1100
@@ -63,6 +63,7 @@ extern int copy_mount_string(const void

extern void free_vfsmnt(struct vfsmount *);
extern struct vfsmount *alloc_vfsmnt(const char *);
+extern unsigned int count_mnt_count(struct vfsmount *mnt);
extern struct vfsmount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
extern void mnt_set_mountpoint(struct vfsmount *, struct dentry *,
struct vfsmount *);
Index: linux-2.6/drivers/mtd/mtdchar.c
===================================================================
--- linux-2.6.orig/drivers/mtd/mtdchar.c 2010-12-12 03:27:08.000000000 +1100
+++ linux-2.6/drivers/mtd/mtdchar.c 2010-12-12 03:51:52.000000000 +1100
@@ -1201,6 +1201,7 @@ static int __init init_mtdchar(void)
static void __exit cleanup_mtdchar(void)
{
unregister_mtd_user(&mtdchar_notifier);
+ mtd_inode_mnt->mnt_flags &= ~MNT_PSEUDO;
mntput(mtd_inode_mnt);
unregister_filesystem(&mtd_inodefs_type);
__unregister_chrdev(MTD_CHAR_MAJOR, 0, 1 << MINORBITS, "mtd");
Index: linux-2.6/arch/ia64/kernel/perfmon.c
===================================================================
--- linux-2.6.orig/arch/ia64/kernel/perfmon.c 2010-12-12 03:48:57.000000000 +1100
+++ linux-2.6/arch/ia64/kernel/perfmon.c 2010-12-12 03:51:52.000000000 +1100
@@ -1553,8 +1553,10 @@ init_pfm_fs(void)
err = PTR_ERR(pfmfs_mnt);
if (IS_ERR(pfmfs_mnt))
unregister_filesystem(&pfm_fs_type);
- else
+ else {
err = 0;
+ pfmfs_mnt->mnt_flags |= MNT_PSEUDO;
+ }
}
return err;
}
Index: linux-2.6/fs/anon_inodes.c
===================================================================
--- linux-2.6.orig/fs/anon_inodes.c 2010-12-12 03:51:50.000000000 +1100
+++ linux-2.6/fs/anon_inodes.c 2010-12-12 03:51:52.000000000 +1100
@@ -223,6 +223,7 @@ static int __init anon_inode_init(void)
error = PTR_ERR(anon_inode_mnt);
goto err_unregister_filesystem;
}
+ anon_inode_mnt->mnt_flags |= MNT_PSEUDO;
anon_inode_inode = anon_inode_mkinode();
if (IS_ERR(anon_inode_inode)) {
error = PTR_ERR(anon_inode_inode);
@@ -232,6 +233,7 @@ static int __init anon_inode_init(void)
return 0;

err_mntput:
+ anon_inode_mnt->mnt_flags &= ~MNT_PSEUDO;
mntput(anon_inode_mnt);
err_unregister_filesystem:
unregister_filesystem(&anon_inode_fs_type);
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c 2010-12-12 03:27:08.000000000 +1100
+++ linux-2.6/fs/block_dev.c 2010-12-12 03:51:52.000000000 +1100
@@ -499,6 +499,7 @@ void __init bdev_cache_init(void)
bd_mnt = kern_mount(&bd_type);
if (IS_ERR(bd_mnt))
panic("Cannot create bdev pseudo-fs");
+ bd_mnt->mnt_flags |= MNT_PSEUDO;
/*
* This vfsmount structure is only used to obtain the
* blockdev_superblock, so tell kmemleak not to report it.
Index: linux-2.6/fs/pipe.c
===================================================================
--- linux-2.6.orig/fs/pipe.c 2010-12-12 03:51:50.000000000 +1100
+++ linux-2.6/fs/pipe.c 2010-12-12 03:51:52.000000000 +1100
@@ -1285,6 +1285,7 @@ static int __init init_pipe_fs(void)
err = PTR_ERR(pipe_mnt);
unregister_filesystem(&pipe_fs_type);
}
+ pipe_mnt->mnt_flags |= MNT_PSEUDO;
}
return err;
}
@@ -1292,6 +1293,7 @@ static int __init init_pipe_fs(void)
static void __exit exit_pipe_fs(void)
{
unregister_filesystem(&pipe_fs_type);
+ pipe_mnt->mnt_flags &= ~MNT_PSEUDO;
mntput(pipe_mnt);
}

Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c 2010-12-12 03:51:50.000000000 +1100
+++ linux-2.6/net/socket.c 2010-12-12 03:51:52.000000000 +1100
@@ -2375,6 +2375,8 @@ EXPORT_SYMBOL(sock_unregister);

static int __init sock_init(void)
{
+ int err;
+
/*
* Initialize sock SLAB cache.
*/
@@ -2391,8 +2393,16 @@ static int __init sock_init(void)
*/

init_inodecache();
- register_filesystem(&sock_fs_type);
+
+ err = register_filesystem(&sock_fs_type);
+ if (err)
+ goto out_fs;
sock_mnt = kern_mount(&sock_fs_type);
+ if (IS_ERR(sock_mnt)) {
+ err = PTR_ERR(sock_mnt);
+ goto out_mount;
+ }
+ sock_mnt->mnt_flags |= MNT_PSEUDO;

/* The real protocol initialization is performed in later initcalls.
*/
@@ -2405,7 +2415,13 @@ static int __init sock_init(void)
skb_timestamping_init();
#endif

- return 0;
+out:
+ return err;
+
+out_mount:
+ unregister_filesystem(&sock_fs_type);
+out_fs:
+ goto out;
}

core_initcall(sock_init); /* early initcall */

2010-12-13 02:59:15

by Nick Piggin

[permalink] [raw]
Subject: Re: rcu-walk and dcache scaling tree update and status

On Sun, Dec 12, 2010 at 09:53:32PM -0500, Ed Tomlinson wrote:
> On Sunday 12 December 2010 21:37:33 Nick Piggin wrote:
> > Final note:
> > You won't be able to reproduce the parallel path walk scalability
> > numbers that I've posted, because the vfsmount refcounting scalability
> > patch is not included. I have a new idea for that now, so I'll be asking
> > for comments with that soon.
>
> I get this when building:
>
> security/security.c: In function 'security_inode_exec_permission':
> security/security.c:520: error: 'rcu' undeclared (first use in this function)
> security/security.c:520: error: (Each undeclared identifier is reported only once
> security/security.c:520: error: for each function it appears in.)
> make[1]: *** [security/security.o] Error 1
> make: *** [security] Error 2
>
> Missing include maybe?

Ah, missed permutation while doing a final build check. `rcu` should
just be renamed to `flags`.

2010-12-13 03:00:20

by Ed Tomlinson

[permalink] [raw]
Subject: Re: rcu-walk and dcache scaling tree update and status

On Sunday 12 December 2010 21:37:33 Nick Piggin wrote:
> The vfs-scale branch has had some progress, but it is now requiring
> wider testing and detailed review, particularly of the fine details of
> dcache_lock lifting, and rcu-walk synchronisation details and
> documentation.
>
> Linus has suggested pretty strongly that he wants to pull this in the
> next merge window (recently, that "inodes will be RCU freed in 2.6.38"
> in an urelated discussion). As far as I know, that's what he's going to
> do. I'd like to get this some time in linux-next to improve test
> coverage (many filesystems I can't even test, so there are bound to be a
> few silly crashes). Stephen, how do I arrange that?
>
> From my point of view, it has had nowhere near enough review,
> particularly I want Al to be happy with it, filesystem changes looked at
> and tested by respective fs maintainers, and anybody who is good at
> concurrency. However, if Linus still wants to merge it to kick things
> along, I am not going to stop him this time, because I have no known
> bugs or pending changes required.
>
> I, like everybody else, would prefer bugs or design flaws to be found
> *before* it goes upstream, of course. I would be happy to spend time on
> irc with reviewers (ask me offline). And if anybody has reasonable
> concerns or suggestions, I will be happy to take that into account. I
> will not flame anybody who reads my replies, even if it takes a while
> for one or both of us to understand.
>
> Documentation/filesystems/path-lookup.txt is a good place to start
> reviewing the fun stuff. I would much appreciate review of documentation
> and comments too, if anything is not clear, omitted, or not matching the
> code.
>
> Also, please keep an eye on the end result when reviewing patches.
> Particularly the locking patches before dcache_lock is lifted, these are
> supposed to provide a lock coverage to lift dcache_lock with minimal
> complexity. They are not supposed to be nice looking code that you'd
> want to run on your production box, they are supposed to be nice
> changesets (from a review and verification point of view).
>
> Git tree is here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
>
> Branch is:
>
> vfs-scale-working
>
> Changes since last posting:
> * Add a lot more comments for rcu-walk code and functions
> * Fix reported d_compare vfat crash
> * Incorporate review suggestions
> * Make rcu-walk bail out if we have to call a security subsystem
> * Fix for filesystems rewriting dentry name in-place
> * Audit d_seq barrier write-side, add a few places where it was missing
> * Optimised dentry memcmp
>
> Testing:
> Testing filesystems is difficult, particularly remote filesystems, and
> ones without mkfs packaged in debian. I'm running ltp and xfstests among
> others, but those cover a tiny portion of what you can do with the
> dcache. The more testing the merrier.
>
> I have been unable to break anything for a long time, but the race
> windows can be tiny. I've been trying to insert random delays into
> different parts of critical sections, and write tests specifically
> targetting particular races, but that's slow going -- review of locking,
> or testing on different configurations should be much more productive.
>
> Final note:
> You won't be able to reproduce the parallel path walk scalability
> numbers that I've posted, because the vfsmount refcounting scalability
> patch is not included. I have a new idea for that now, so I'll be asking
> for comments with that soon.

I get this when building:

security/security.c: In function 'security_inode_exec_permission':
security/security.c:520: error: 'rcu' undeclared (first use in this function)
security/security.c:520: error: (Each undeclared identifier is reported only once
security/security.c:520: error: for each function it appears in.)
make[1]: *** [security/security.o] Error 1
make: *** [security] Error 2

Missing include maybe?

Ed

2010-12-13 03:31:20

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status)

On Mon, Dec 13, 2010 at 01:42:17PM +1100, Nick Piggin wrote:
> On Mon, Dec 13, 2010 at 01:37:33PM +1100, Nick Piggin wrote:
> > Final note:
> > You won't be able to reproduce the parallel path walk scalability
> > numbers that I've posted, because the vfsmount refcounting scalability
> > patch is not included. I have a new idea for that now, so I'll be asking
> > for comments with that soon.
>
> Here is the patch I've been using, which works but has the problem
> described in the changelog. But it works nicely for testing.
>
> As I said, I have a promising approach to solving the problem.
>
> fs: scale mntget/mntput

[...]

> [Note: this is not for merging. Un-attached operation (lazy umount) may not be
> uncommon and will be slowed down and actually have worse scalablilty after
> this patch. I need to think about how to do fast refcounting with unattached
> mounts.]

So the problem this patch tries to fix is vfsmount refcount scalability.
We need to take a ref for every successful path lookup, and often
lookups are going to the same mountpoint.

(Yes this little bouncing atomic hurts, badly, even on my small 2s12c
tightly connected system on the parallel git diff workload -- because
there are other bouncing kernel cachelines in this workload).

The fundamental difficulty is that a simple refcount can never be SMP
scalable, because dropping the ref requires we check whether we are
the last reference (which implies communicating with other CPUs that
might have taken references).

We can make them scalable by keeping a local count, and checking the
global sum less frequently. Some possibilities:

- avoid checking global sum while vfsmount is mounted, because the mount
contributes to the refcount (that is what this patch does, but it
kills performance inside a lazy umounted subtree).

- check global sum once every time interval (this would delay mount and
sb garbage collection, so it's probably a showstopper).

- check global sum only if local sum goes to 0 (this is difficult with
vfsmounts because the 'get' and the 'put' can happen on different
CPUs, so we'd need to have a per-thread refcount, or carry around the
CPU number with the refcount, both get horribly ugly, it turns out).

My proposal is a variant / generalisation of the 1st idea, which is to
have "long" refcounts. Normal refcounts will be per-cpu difference of
incs and decs, but dropping a reference will not have to check the
global sum while "long" refcounts are elevated. If the mount is a long
refcount, then that is what this current patch essentially is.

But then I would also have cwd take the long refcount, which allows
detached operation to remain fast while there are processes working
inside the detached namespace.

Details of locking aren't completely worked out -- it's a bit more
tricky because umount can be much heavier than fork() or chdir(), so
there are some difficulties in making long refcount operations faster
(the problem is remaining race-free versus the fast mntput check, but
I think a seqcount to go with the long refcount should do the trick).

2010-12-13 03:41:02

by Stephen Rothwell

[permalink] [raw]
Subject: Re: rcu-walk and dcache scaling tree update and status

Hi Nick,

On Mon, 13 Dec 2010 13:37:33 +1100 Nick Piggin <[email protected]> wrote:
>
> Linus has suggested pretty strongly that he wants to pull this in the
> next merge window (recently, that "inodes will be RCU freed in 2.6.38"
> in an urelated discussion). As far as I know, that's what he's going to
> do. I'd like to get this some time in linux-next to improve test
> coverage (many filesystems I can't even test, so there are bound to be a
> few silly crashes). Stephen, how do I arrange that?

Well, you ask me :-) (letting me know where the git tree is).

I this case I would like an email from Al and/or Linus (and/or maybe
Christoph?) before I add this to linux-next as you say below that
(despite being as well tested as you can do) it does require more review
and buyin from others. Of course, if Linus is going to merge this during
the next merge window, then it really should be in linux-next as soon as
it is ready enough.

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (1.02 kB)
(No filename) (490.00 B)
Download all attachments

2010-12-13 03:43:29

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status)

On Mon, Dec 13, 2010 at 02:31:10PM +1100, Nick Piggin wrote:
> (Yes this little bouncing atomic hurts, badly, even on my small 2s12c
> tightly connected system on the parallel git diff workload -- because
> there are other bouncing kernel cachelines in this workload).
^^^
are no other

2010-12-13 03:46:01

by Stephen Rothwell

[permalink] [raw]
Subject: Re: rcu-walk and dcache scaling tree update and status

Hi Nick,

On Mon, 13 Dec 2010 13:59:10 +1100 Nick Piggin <[email protected]> wrote:
>
> On Sun, Dec 12, 2010 at 09:53:32PM -0500, Ed Tomlinson wrote:
> >
> > I get this when building:
> >
> > security/security.c: In function 'security_inode_exec_permission':
> > security/security.c:520: error: 'rcu' undeclared (first use in this function)
> > security/security.c:520: error: (Each undeclared identifier is reported only once
> > security/security.c:520: error: for each function it appears in.)
> > make[1]: *** [security/security.o] Error 1
> > make: *** [security] Error 2
> >
> > Missing include maybe?
>
> Ah, missed permutation while doing a final build check. `rcu` should
> just be renamed to `flags`.

You will, of course never put anything not build tested into a tree for
linux-next inclusion, right? ;-)
--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (925.00 B)
(No filename) (490.00 B)
Download all attachments

2010-12-13 03:49:00

by Nick Piggin

[permalink] [raw]
Subject: Re: rcu-walk and dcache scaling tree update and status

On Mon, Dec 13, 2010 at 02:40:49PM +1100, Stephen Rothwell wrote:
> Hi Nick,
>
> On Mon, 13 Dec 2010 13:37:33 +1100 Nick Piggin <[email protected]> wrote:
> >
> > Linus has suggested pretty strongly that he wants to pull this in the
> > next merge window (recently, that "inodes will be RCU freed in 2.6.38"
> > in an urelated discussion). As far as I know, that's what he's going to
> > do. I'd like to get this some time in linux-next to improve test
> > coverage (many filesystems I can't even test, so there are bound to be a
> > few silly crashes). Stephen, how do I arrange that?
>
> Well, you ask me :-) (letting me know where the git tree is).
>
> I this case I would like an email from Al and/or Linus (and/or maybe
> Christoph?) before I add this to linux-next as you say below that
> (despite being as well tested as you can do) it does require more review
> and buyin from others. Of course, if Linus is going to merge this during
> the next merge window, then it really should be in linux-next as soon as
> it is ready enough.

Yes I would like an ack from Al to have it in linux-next. Note that
wouldn't have to be an ack on the patches or agreement to merge it next
window, just an "ok to get wider review and testing" type of thing.

If Linus's is intending *not* to pull it, then it's not so urgent, and I
might instead ask Andrew to host it in -mm for a while. I just don't
want to get caught with my pants down here, because there are a good
number of untested filesystems and things.

2010-12-13 03:50:26

by Nick Piggin

[permalink] [raw]
Subject: Re: rcu-walk and dcache scaling tree update and status

On Mon, Dec 13, 2010 at 02:45:55PM +1100, Stephen Rothwell wrote:
> Hi Nick,
>
> On Mon, 13 Dec 2010 13:59:10 +1100 Nick Piggin <[email protected]> wrote:
> >
> > On Sun, Dec 12, 2010 at 09:53:32PM -0500, Ed Tomlinson wrote:
> > >
> > > I get this when building:
> > >
> > > security/security.c: In function 'security_inode_exec_permission':
> > > security/security.c:520: error: 'rcu' undeclared (first use in this function)
> > > security/security.c:520: error: (Each undeclared identifier is reported only once
> > > security/security.c:520: error: for each function it appears in.)
> > > make[1]: *** [security/security.o] Error 1
> > > make: *** [security] Error 2
> > >
> > > Missing include maybe?
> >
> > Ah, missed permutation while doing a final build check. `rcu` should
> > just be renamed to `flags`.
>
> You will, of course never put anything not build tested into a tree for
> linux-next inclusion, right? ;-)

Right. It will be guaranteed to have been build tested with at least
one .config :)

Which reminds me, I haven't done an allmodconfig recently to see if
I've missed exports...

2010-12-13 07:25:29

by Eric Dumazet

[permalink] [raw]
Subject: Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status)

Le lundi 13 décembre 2010 à 13:42 +1100, Nick Piggin a écrit :
> + */
> +static inline void add_mnt_count(struct vfsmount *mnt, int n)

maybe name it __add_mnt_count() (should be called with preempt off)

> +{
> +#ifdef CONFIG_SMP
> + (*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) += n;

__this_cpu_add(mnt->mnt_count, n);

> +#else
> + mnt->mnt_count += n;
> +#endif
> +}

and define a preempt safe version :

static inline void __add_mnt_count(struct vfsmount *mnt, int n)
{
#ifdef CONFIG_SMP
__this_cpu_add(mnt->mnt_count, n);
#else
mnt->mnt_count += n;
#endif
}

static inline void add_mnt_count(struct vfsmount *mnt, int n)
{
#ifdef CONFIG_SMP
this_cpu_add(mnt->mnt_count, n);
#else
preempt_disable();
mnt->mnt_count += n;
preempt_enable();
#endif
}

> +
> +static inline void set_mnt_count(struct vfsmount *mnt, int n)
> +{
> +#ifdef CONFIG_SMP


> + preempt_disable();
> + (*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) = n;
> + preempt_enable();

last 3 lines can be replaced by :

this_cpu_write(mnt->mnt_count, n);

> +#else
> + mnt->mnt_count = n;
> +#endif
> +}
> +

> #ifdef CONFIG_SMP
> int __percpu *mnt_writers;
> + int __percpu *mnt_count;
> #else
> int mnt_writers;
> + int mnt_count;
> #endif

You could use a struct and one per cpu allocation to use one cache line
for both objects :

struct mnt_counters {
int writers;
int count;
};

...

#ifdef CONFIG_SMP
struct mnt_counters __percpu *mnt_counters;
#else
struct mnt_counters mnt_counters;
#endif

This would use one pointer instead of two in SMP

2010-12-13 08:33:40

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status)

On Mon, Dec 13, 2010 at 6:25 PM, Eric Dumazet <[email protected]> wrote:
> Le lundi 13 d?cembre 2010 ? 13:42 +1100, Nick Piggin a ?crit :
>> + */
>> +static inline void add_mnt_count(struct vfsmount *mnt, int n)
>
> maybe name it __add_mnt_count() (should be called with preempt off)
>
>> +{
>> +#ifdef CONFIG_SMP
>> + ? ? (*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) += n;
>
> __this_cpu_add(mnt->mnt_count, n);
>
>> +#else
>> + ? ? mnt->mnt_count += n;
>> +#endif
>> +}
>
> and define a preempt safe version :
>
> static inline void __add_mnt_count(struct vfsmount *mnt, int n)
> {
> #ifdef CONFIG_SMP
> ? ? ? ?__this_cpu_add(mnt->mnt_count, n);
> #else
> ? ? ? ?mnt->mnt_count += n;
> #endif
> }
>
> static inline void add_mnt_count(struct vfsmount *mnt, int n)
> {
> #ifdef CONFIG_SMP
> ? ? ? ?this_cpu_add(mnt->mnt_count, n);
> #else
> ? ? ? ?preempt_disable();
> ? ? ? ?mnt->mnt_count += n;
> ? ? ? ?preempt_enable();
> #endif
> }

That looks good, thanks.


>> +
>> +static inline void set_mnt_count(struct vfsmount *mnt, int n)
>> +{
>> +#ifdef CONFIG_SMP
>
>
>> + ? ? preempt_disable();
>> + ? ? (*per_cpu_ptr(mnt->mnt_count, smp_processor_id())) = n;
>> + ? ? preempt_enable();
>
> last 3 lines can be replaced by :
>
> ? ? ? ?this_cpu_write(mnt->mnt_count, n);

Yep, thanks.


>> +#else
>> + ? ? mnt->mnt_count = n;
>> +#endif
>> +}
>> +
>
>> ?#ifdef CONFIG_SMP
>> ? ? ? int __percpu *mnt_writers;
>> + ? ? int __percpu *mnt_count;
>> ?#else
>> ? ? ? int mnt_writers;
>> + ? ? int mnt_count;
>> ?#endif
>
> You could use a struct and one per cpu allocation to use one cache line
> for both objects :
>
> struct mnt_counters {
> ? ? ? ?int writers;
> ? ? ? ?int count;
> };
>
> ...
>
> #ifdef CONFIG_SMP
> ? ? ? ?struct mnt_counters __percpu *mnt_counters;
> #else
> ? ? ? ?struct mnt_counters mnt_counters;
> #endif
>
> This would use one pointer instead of two in SMP

Yes that's a good point too.

Thanks,
Nick

2010-12-13 16:48:20

by Sedat Dilek

[permalink] [raw]
Subject: Re: rcu-walk and dcache scaling tree update and status

Hi,

I will give this new refreshed patchset a try (testing against
systemd-v15, vfat problem I reported) with none-BKL-config of course.

If linux-next will be the new base, then please rebase on it (see below).
(Last time I solved the CONFLICTs manually).

Regards,
- Sedat -

P.S.:

$ cd /mnt/sdb5/linux-kernel/linux-next

$ git log -1 | cat
commit 455b71004aa0d5cb899dc4df34892265e7486b70
Author: Stephen Rothwell <[email protected]>
Date: Mon Dec 13 16:15:19 2010 +1100

Add linux-next specific files for 20101213

Signed-off-by: Stephen Rothwell <[email protected]>

$ git pull git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
vfs-scale-working:vfs-scale-working
remote: Counting objects: 1196, done.
remote: Compressing objects: 100% (80/80), done.
remote: Total 931 (delta 866), reused 915 (delta 851)
Receiving objects: 100% (931/931), 207.90 KiB | 376 KiB/s, done.
Resolving deltas: 100% (866/866), completed with 258 local objects.
>From git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin
* [new branch] vfs-scale-working -> vfs-scale-working
Auto-merging Documentation/filesystems/Locking
Removing Documentation/filesystems/dentry-locking.txt
Auto-merging Documentation/filesystems/vfs.txt
Auto-merging fs/anon_inodes.c
Auto-merging fs/ceph/dir.c
Auto-merging fs/ceph/inode.c
Auto-merging fs/ceph/mds_client.c
Auto-merging fs/cifs/cifsfs.c
Auto-merging fs/cifs/inode.c
Auto-merging fs/cifs/readdir.c
Auto-merging fs/coda/inode.c
Auto-merging fs/ecryptfs/inode.c
Auto-merging fs/ecryptfs/main.c
Auto-merging fs/ext3/super.c
Auto-merging fs/ext4/super.c
Auto-merging fs/fuse/dir.c
Auto-merging fs/fuse/inode.c
CONFLICT (content): Merge conflict in fs/fuse/inode.c
Auto-merging fs/gfs2/ops_inode.c
Auto-merging fs/hfsplus/dir.c
Auto-merging fs/hfsplus/hfsplus_fs.h
Auto-merging fs/hfsplus/super.c
Auto-merging fs/namei.c
Auto-merging fs/nfs/dir.c
Auto-merging fs/nfs/inode.c
Auto-merging fs/nfsd/vfs.c
Auto-merging fs/nilfs2/super.c
Auto-merging fs/ocfs2/super.c
Auto-merging fs/proc/base.c
Auto-merging fs/super.c
CONFLICT (content): Merge conflict in fs/super.c
Auto-merging fs/udf/super.c
Auto-merging include/linux/fs.h
Auto-merging include/linux/fsnotify.h
Auto-merging include/linux/fsnotify_backend.h
Auto-merging include/linux/security.h
Auto-merging mm/filemap.c
Auto-merging mm/shmem.c
Auto-merging mm/slab.c
Auto-merging mm/slub.c
Auto-merging net/socket.c
Auto-merging security/selinux/selinuxfs.c
Automatic merge failed; fix conflicts and then commit the result.
- EOT-

2010-12-13 23:21:45

by Stephen Rothwell

[permalink] [raw]
Subject: Re: rcu-walk and dcache scaling tree update and status

Hi Sedat,

On Mon, 13 Dec 2010 17:48:18 +0100 Sedat Dilek <[email protected]> wrote:
>
> I will give this new refreshed patchset a try (testing against
> systemd-v15, vfat problem I reported) with none-BKL-config of course.
>
> If linux-next will be the new base, then please rebase on it (see below).
> (Last time I solved the CONFLICTs manually).

Nothing released publicly should ever be based on linux-next (unless it
is a short term testing only tree like -mm) as it is rebuilt every day.
Thing should really only be based on other publicly released stable trees.

This is especially true of trees that will be included in linux-next
itself.

> $ git pull git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git
> vfs-scale-working:vfs-scale-working
> remote: Counting objects: 1196, done.
> remote: Compressing objects: 100% (80/80), done.
> remote: Total 931 (delta 866), reused 915 (delta 851)
> Receiving objects: 100% (931/931), 207.90 KiB | 376 KiB/s, done.
> Resolving deltas: 100% (866/866), completed with 258 local objects.
> From git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin
> * [new branch] vfs-scale-working -> vfs-scale-working
> CONFLICT (content): Merge conflict in fs/fuse/inode.c
> CONFLICT (content): Merge conflict in fs/super.c

Thanks for the heads up, though. When I include Nick's tree in
linux-next, I will do these merge fixes if necessary (as will Linus when
it gets to him).

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (1.53 kB)
(No filename) (490.00 B)
Download all attachments

2010-12-14 00:03:40

by Stephen Rothwell

[permalink] [raw]
Subject: Re: rcu-walk and dcache scaling tree update and status

Hi Nick,

On Mon, 13 Dec 2010 14:48:54 +1100 Nick Piggin <[email protected]> wrote:
>
> Yes I would like an ack from Al to have it in linux-next. Note that
> wouldn't have to be an ack on the patches or agreement to merge it next
> window, just an "ok to get wider review and testing" type of thing.
>
> If Linus's is intending *not* to pull it, then it's not so urgent, and I
> might instead ask Andrew to host it in -mm for a while. I just don't
> want to get caught with my pants down here, because there are a good
> number of untested filesystems and things.

OK, since Linus seems so keen (and you fixed the reported compile error)
I will include your tree today. If Al has objections, I can always
remove it again.

I have called the tree (for my purposes) "vfs_scale" and have just you
listed as the contact.

Thanks for adding your subsystem tree as a participant of linux-next. As
you may know, this is not a judgment of your code. The purpose of
linux-next is for integration testing and to lower the impact of
conflicts between subsystems in the next merge window.

You will need to ensure that the patches/commits in your tree/series have
been:
* submitted under GPL v2 (or later) and include the Contributor's
Signed-off-by,
* posted to the relevant mailing list,
* reviewed by you (or another maintainer of your subsystem tree),
* successfully unit tested, and
* destined for the current or next Linux merge window.

Basically, this should be just what you would send to Linus (or ask him
to fetch). It is allowed to be rebased if you deem it necessary.

--
Cheers,
Stephen Rothwell
[email protected]

Legal Stuff:
By participating in linux-next, your subsystem tree contributions are
public and will be included in the linux-next trees. You may be sent
e-mail messages indicating errors or other issues when the
patches/commits from your subsystem tree are merged and tested in
linux-next. These messages may also be cross-posted to the linux-next
mailing list, the linux-kernel mailing list, etc. The linux-next tree
project and IBM (my employer) make no warranties regarding the linux-next
project, the testing procedures, the results, the e-mails, etc. If you
don't agree to these ground rules, let me know and I'll remove your tree
from participation in linux-next.


Attachments:
(No filename) (2.27 kB)
(No filename) (490.00 B)
Download all attachments

2010-12-14 00:16:57

by Stephen Rothwell

[permalink] [raw]
Subject: Re: rcu-walk and dcache scaling tree update and status

Hi Nick,

On Tue, 14 Dec 2010 11:03:26 +1100 Stephen Rothwell <[email protected]> wrote:
>
> I have called the tree (for my purposes) "vfs_scale" and have just you
^^^^^^^^^
Actually "vfs-scale".

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (347.00 B)
(No filename) (490.00 B)
Download all attachments

2010-12-14 12:41:10

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status)

OK, this ones includes your suggestions, and implements the approach I
outlined earlier. I verified it works and is scalable on detached
mounts, so I'll add it to the git branch when it survives a bit more
testing. It's the last piece of the puzzle...

fs: scale mntget/mntput

The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.

The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.

We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.

- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).

- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).

- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.

This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.

This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.

This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.

Signed-off-by: Nick Piggin <[email protected]>

---
arch/ia64/kernel/perfmon.c | 2
drivers/mtd/mtdchar.c | 2
fs/anon_inodes.c | 2
fs/fs_struct.c | 26 +++-
fs/internal.h | 1
fs/namei.c | 24 ++++
fs/namespace.c | 240 +++++++++++++++++++++++++++++++++++++--------
fs/pipe.c | 2
fs/pnode.c | 4
fs/super.c | 2
include/linux/mount.h | 53 +++------
include/linux/path.h | 2
net/socket.c | 19 +++
13 files changed, 282 insertions(+), 97 deletions(-)

Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c 2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/fs/namespace.c 2010-12-14 23:33:00.000000000 +1100
@@ -138,6 +138,64 @@ void mnt_release_group_id(struct vfsmoun
mnt->mnt_group_id = 0;
}

+/*
+ * vfsmount lock must be held for read
+ */
+static inline void add_mnt_count(struct vfsmount *mnt, int n)
+{
+#ifdef CONFIG_SMP
+ this_cpu_add(mnt->mnt_pcp->mnt_count, n);
+#else
+ preempt_disable();
+ mnt->mnt_count += n;
+ preempt_enable();
+#endif
+}
+
+static inline void set_mnt_count(struct vfsmount *mnt, int n)
+{
+#ifdef CONFIG_SMP
+ this_cpu_add(mnt->mnt_pcp->mnt_count, n);
+#else
+ mnt->mnt_count = n;
+#endif
+}
+
+/*
+ * vfsmount lock must be held for read
+ */
+static inline void inc_mnt_count(struct vfsmount *mnt)
+{
+ add_mnt_count(mnt, 1);
+}
+
+/*
+ * vfsmount lock must be held for read
+ */
+static inline void dec_mnt_count(struct vfsmount *mnt)
+{
+ add_mnt_count(mnt, -1);
+}
+
+/*
+ * vfsmount lock must be held for write
+ */
+unsigned int count_mnt_count(struct vfsmount *mnt)
+{
+#ifdef CONFIG_SMP
+ unsigned int count = atomic_read(&mnt->mnt_longrefs);
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ count += per_cpu_ptr(mnt->mnt_pcp, cpu)->mnt_count;
+ }
+
+ return count;
+#else
+ return mnt->mnt_count;
+#endif
+}
+
struct vfsmount *alloc_vfsmnt(const char *name)
{
struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL);
@@ -154,7 +212,17 @@ struct vfsmount *alloc_vfsmnt(const char
goto out_free_id;
}

- atomic_set(&mnt->mnt_count, 1);
+#ifdef CONFIG_SMP
+ mnt->mnt_pcp = alloc_percpu(struct mnt_pcp);
+ if (!mnt->mnt_pcp)
+ goto out_free_devname;
+
+ atomic_set(&mnt->mnt_longrefs, 1);
+#else
+ mnt->mnt_count = 1;
+ mnt->mnt_writers = 0;
+#endif
+
INIT_LIST_HEAD(&mnt->mnt_hash);
INIT_LIST_HEAD(&mnt->mnt_child);
INIT_LIST_HEAD(&mnt->mnt_mounts);
@@ -166,13 +234,6 @@ struct vfsmount *alloc_vfsmnt(const char
#ifdef CONFIG_FSNOTIFY
INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
#endif
-#ifdef CONFIG_SMP
- mnt->mnt_writers = alloc_percpu(int);
- if (!mnt->mnt_writers)
- goto out_free_devname;
-#else
- mnt->mnt_writers = 0;
-#endif
}
return mnt;

@@ -219,7 +280,7 @@ EXPORT_SYMBOL_GPL(__mnt_is_readonly);
static inline void inc_mnt_writers(struct vfsmount *mnt)
{
#ifdef CONFIG_SMP
- (*per_cpu_ptr(mnt->mnt_writers, smp_processor_id()))++;
+ this_cpu_inc(mnt->mnt_pcp->mnt_writers);
#else
mnt->mnt_writers++;
#endif
@@ -228,7 +289,7 @@ static inline void inc_mnt_writers(struc
static inline void dec_mnt_writers(struct vfsmount *mnt)
{
#ifdef CONFIG_SMP
- (*per_cpu_ptr(mnt->mnt_writers, smp_processor_id()))--;
+ this_cpu_dec(mnt->mnt_pcp->mnt_writers);
#else
mnt->mnt_writers--;
#endif
@@ -241,7 +302,7 @@ static unsigned int count_mnt_writers(st
int cpu;

for_each_possible_cpu(cpu) {
- count += *per_cpu_ptr(mnt->mnt_writers, cpu);
+ count += per_cpu_ptr(mnt->mnt_pcp, cpu)->mnt_writers;
}

return count;
@@ -418,7 +479,7 @@ void free_vfsmnt(struct vfsmount *mnt)
kfree(mnt->mnt_devname);
mnt_free_id(mnt);
#ifdef CONFIG_SMP
- free_percpu(mnt->mnt_writers);
+ free_percpu(mnt->mnt_pcp);
#endif
kmem_cache_free(mnt_cache, mnt);
}
@@ -652,9 +713,10 @@ static struct vfsmount *clone_mnt(struct
return NULL;
}

-static inline void __mntput(struct vfsmount *mnt)
+static inline void mntfree(struct vfsmount *mnt)
{
struct super_block *sb = mnt->mnt_sb;
+
/*
* This probably indicates that somebody messed
* up a mnt_want/drop_write() pair. If this
@@ -662,8 +724,8 @@ static inline void __mntput(struct vfsmo
* to make r/w->r/o transitions.
*/
/*
- * atomic_dec_and_lock() used to deal with ->mnt_count decrements
- * provides barriers, so count_mnt_writers() below is safe. AV
+ * The locking used to deal with mnt_count decrement provides barriers,
+ * so count_mnt_writers() below is safe.
*/
WARN_ON(count_mnt_writers(mnt));
fsnotify_vfsmount_delete(mnt);
@@ -672,28 +734,113 @@ static inline void __mntput(struct vfsmo
deactivate_super(sb);
}

-void mntput_no_expire(struct vfsmount *mnt)
+#ifdef CONFIG_SMP
+static inline void __mntput(struct vfsmount *mnt, int longrefs)
{
-repeat:
- if (atomic_add_unless(&mnt->mnt_count, -1, 1))
- return;
+ if (!longrefs) {
+put_again:
+ br_read_lock(vfsmount_lock);
+ if (likely(atomic_read(&mnt->mnt_longrefs))) {
+ dec_mnt_count(mnt);
+ br_read_unlock(vfsmount_lock);
+ return;
+ }
+ br_read_unlock(vfsmount_lock);
+ } else {
+ BUG_ON(!atomic_read(&mnt->mnt_longrefs));
+ if (atomic_add_unless(&mnt->mnt_longrefs, -1, 1))
+ return;
+ }
+
br_write_lock(vfsmount_lock);
- if (!atomic_dec_and_test(&mnt->mnt_count)) {
+ if (!longrefs)
+ dec_mnt_count(mnt);
+ else
+ atomic_dec(&mnt->mnt_longrefs);
+ if (count_mnt_count(mnt)) {
br_write_unlock(vfsmount_lock);
return;
}
- if (likely(!mnt->mnt_pinned)) {
+ if (unlikely(mnt->mnt_pinned)) {
+ add_mnt_count(mnt, mnt->mnt_pinned + 1);
+ mnt->mnt_pinned = 0;
br_write_unlock(vfsmount_lock);
- __mntput(mnt);
+ acct_auto_close_mnt(mnt);
+ goto put_again;
+ }
+ br_write_unlock(vfsmount_lock);
+ mntfree(mnt);
+}
+#else
+static inline void __mntput(struct vfsmount *mnt, int longrefs)
+{
+put_again:
+ dec_mnt_count(mnt);
+ if (likely(count_mnt_count(mnt)))
return;
+ br_write_lock(vfsmount_lock);
+ if (unlikely(mnt->mnt_pinned)) {
+ add_mnt_count(mnt, mnt->mnt_pinned + 1);
+ mnt->mnt_pinned = 0;
+ br_write_unlock(vfsmount_lock);
+ acct_auto_close_mnt(mnt);
+ goto put_again;
}
- atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
- mnt->mnt_pinned = 0;
br_write_unlock(vfsmount_lock);
- acct_auto_close_mnt(mnt);
- goto repeat;
+ mntfree(mnt);
+}
+#endif
+
+static void mntput_no_expire(struct vfsmount *mnt)
+{
+ __mntput(mnt, 0);
}
-EXPORT_SYMBOL(mntput_no_expire);
+
+void mntput(struct vfsmount *mnt)
+{
+ if (mnt) {
+ /* avoid cacheline pingpong, hope gcc doesn't get "smart" */
+ if (unlikely(mnt->mnt_expiry_mark))
+ mnt->mnt_expiry_mark = 0;
+ __mntput(mnt, 0);
+ }
+}
+EXPORT_SYMBOL(mntput);
+
+struct vfsmount *mntget(struct vfsmount *mnt)
+{
+ if (mnt)
+ inc_mnt_count(mnt);
+ return mnt;
+}
+EXPORT_SYMBOL(mntget);
+
+void mntput_long(struct vfsmount *mnt)
+{
+#ifdef CONFIG_SMP
+ if (mnt) {
+ /* avoid cacheline pingpong, hope gcc doesn't get "smart" */
+ if (unlikely(mnt->mnt_expiry_mark))
+ mnt->mnt_expiry_mark = 0;
+ __mntput(mnt, 1);
+ }
+#else
+ mntput(mnt);
+#endif
+}
+EXPORT_SYMBOL(mntput_long);
+
+struct vfsmount *mntget_long(struct vfsmount *mnt)
+{
+#ifdef CONFIG_SMP
+ if (mnt)
+ atomic_inc(&mnt->mnt_longrefs);
+ return mnt;
+#else
+ return mntget(mnt);
+#endif
+}
+EXPORT_SYMBOL(mntget_long);

void mnt_pin(struct vfsmount *mnt)
{
@@ -701,19 +848,17 @@ void mnt_pin(struct vfsmount *mnt)
mnt->mnt_pinned++;
br_write_unlock(vfsmount_lock);
}
-
EXPORT_SYMBOL(mnt_pin);

void mnt_unpin(struct vfsmount *mnt)
{
br_write_lock(vfsmount_lock);
if (mnt->mnt_pinned) {
- atomic_inc(&mnt->mnt_count);
+ inc_mnt_count(mnt);
mnt->mnt_pinned--;
}
br_write_unlock(vfsmount_lock);
}
-
EXPORT_SYMBOL(mnt_unpin);

static inline void mangle(struct seq_file *m, const char *s)
@@ -1008,12 +1153,13 @@ int may_umount_tree(struct vfsmount *mnt
int minimum_refs = 0;
struct vfsmount *p;

- br_read_lock(vfsmount_lock);
+ /* write lock needed for count_mnt_count */
+ br_write_lock(vfsmount_lock);
for (p = mnt; p; p = next_mnt(p, mnt)) {
- actual_refs += atomic_read(&p->mnt_count);
+ actual_refs += count_mnt_count(p);
minimum_refs += 2;
}
- br_read_unlock(vfsmount_lock);
+ br_write_unlock(vfsmount_lock);

if (actual_refs > minimum_refs)
return 0;
@@ -1040,10 +1186,10 @@ int may_umount(struct vfsmount *mnt)
{
int ret = 1;
down_read(&namespace_sem);
- br_read_lock(vfsmount_lock);
+ br_write_lock(vfsmount_lock);
if (propagate_mount_busy(mnt, 2))
ret = 0;
- br_read_unlock(vfsmount_lock);
+ br_write_unlock(vfsmount_lock);
up_read(&namespace_sem);
return ret;
}
@@ -1070,7 +1216,7 @@ void release_mounts(struct list_head *he
dput(dentry);
mntput(m);
}
- mntput(mnt);
+ mntput_long(mnt);
}
}

@@ -1125,8 +1271,16 @@ static int do_umount(struct vfsmount *mn
flags & (MNT_FORCE | MNT_DETACH))
return -EINVAL;

- if (atomic_read(&mnt->mnt_count) != 2)
+ /*
+ * probably don't strictly need the lock here if we examined
+ * all race cases, but it's a slowpath.
+ */
+ br_write_lock(vfsmount_lock);
+ if (count_mnt_count(mnt) != 2) {
+ br_write_lock(vfsmount_lock);
return -EBUSY;
+ }
+ br_write_unlock(vfsmount_lock);

if (!xchg(&mnt->mnt_expiry_mark, 1))
return -EAGAIN;
@@ -1815,7 +1969,7 @@ int do_add_mount(struct vfsmount *newmnt

unlock:
up_write(&namespace_sem);
- mntput(newmnt);
+ mntput_long(newmnt);
return err;
}

@@ -2148,11 +2302,11 @@ static struct mnt_namespace *dup_mnt_ns(
if (fs) {
if (p == fs->root.mnt) {
rootmnt = p;
- fs->root.mnt = mntget(q);
+ fs->root.mnt = mntget_long(q);
}
if (p == fs->pwd.mnt) {
pwdmnt = p;
- fs->pwd.mnt = mntget(q);
+ fs->pwd.mnt = mntget_long(q);
}
}
p = next_mnt(p, mnt_ns->root);
@@ -2161,9 +2315,9 @@ static struct mnt_namespace *dup_mnt_ns(
up_write(&namespace_sem);

if (rootmnt)
- mntput(rootmnt);
+ mntput_long(rootmnt);
if (pwdmnt)
- mntput(pwdmnt);
+ mntput_long(pwdmnt);

return new_ns;
}
@@ -2350,6 +2504,7 @@ SYSCALL_DEFINE2(pivot_root, const char _
touch_mnt_namespace(current->nsproxy->mnt_ns);
br_write_unlock(vfsmount_lock);
chroot_fs_refs(&root, &new);
+
error = 0;
path_put(&root_parent);
path_put(&parent_path);
@@ -2376,6 +2531,7 @@ static void __init init_mount_tree(void)
mnt = do_kern_mount("rootfs", 0, "rootfs", NULL);
if (IS_ERR(mnt))
panic("Can't create rootfs");
+
ns = create_mnt_ns(mnt);
if (IS_ERR(ns))
panic("Can't allocate initial namespace");
Index: linux-2.6/include/linux/mount.h
===================================================================
--- linux-2.6.orig/include/linux/mount.h 2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/include/linux/mount.h 2010-12-14 21:07:29.000000000 +1100
@@ -13,6 +13,7 @@
#include <linux/list.h>
#include <linux/nodemask.h>
#include <linux/spinlock.h>
+#include <linux/seqlock.h>
#include <asm/atomic.h>

struct super_block;
@@ -46,12 +47,24 @@ struct mnt_namespace;

#define MNT_INTERNAL 0x4000

+struct mnt_pcp {
+ int mnt_count;
+ int mnt_writers;
+};
+
struct vfsmount {
struct list_head mnt_hash;
struct vfsmount *mnt_parent; /* fs we are mounted on */
struct dentry *mnt_mountpoint; /* dentry of mountpoint */
struct dentry *mnt_root; /* root of the mounted tree */
struct super_block *mnt_sb; /* pointer to superblock */
+#ifdef CONFIG_SMP
+ struct mnt_pcp __percpu *mnt_pcp;
+ atomic_t mnt_longrefs;
+#else
+ int mnt_count;
+ int mnt_writers;
+#endif
struct list_head mnt_mounts; /* list of children, anchored here */
struct list_head mnt_child; /* and going through their mnt_child */
int mnt_flags;
@@ -70,57 +83,25 @@ struct vfsmount {
struct mnt_namespace *mnt_ns; /* containing namespace */
int mnt_id; /* mount identifier */
int mnt_group_id; /* peer group identifier */
- /*
- * We put mnt_count & mnt_expiry_mark at the end of struct vfsmount
- * to let these frequently modified fields in a separate cache line
- * (so that reads of mnt_flags wont ping-pong on SMP machines)
- */
- atomic_t mnt_count;
int mnt_expiry_mark; /* true if marked for expiry */
int mnt_pinned;
int mnt_ghosts;
-#ifdef CONFIG_SMP
- int __percpu *mnt_writers;
-#else
- int mnt_writers;
-#endif
};

-static inline int *get_mnt_writers_ptr(struct vfsmount *mnt)
-{
-#ifdef CONFIG_SMP
- return mnt->mnt_writers;
-#else
- return &mnt->mnt_writers;
-#endif
-}
-
-static inline struct vfsmount *mntget(struct vfsmount *mnt)
-{
- if (mnt)
- atomic_inc(&mnt->mnt_count);
- return mnt;
-}
-
struct file; /* forward dec */

extern int mnt_want_write(struct vfsmount *mnt);
extern int mnt_want_write_file(struct file *file);
extern int mnt_clone_write(struct vfsmount *mnt);
extern void mnt_drop_write(struct vfsmount *mnt);
-extern void mntput_no_expire(struct vfsmount *mnt);
+extern void mntput(struct vfsmount *mnt);
+extern struct vfsmount *mntget(struct vfsmount *mnt);
+extern void mntput_long(struct vfsmount *mnt);
+extern struct vfsmount *mntget_long(struct vfsmount *mnt);
extern void mnt_pin(struct vfsmount *mnt);
extern void mnt_unpin(struct vfsmount *mnt);
extern int __mnt_is_readonly(struct vfsmount *mnt);

-static inline void mntput(struct vfsmount *mnt)
-{
- if (mnt) {
- mnt->mnt_expiry_mark = 0;
- mntput_no_expire(mnt);
- }
-}
-
extern struct vfsmount *do_kern_mount(const char *fstype, int flags,
const char *name, void *data);

Index: linux-2.6/fs/pnode.c
===================================================================
--- linux-2.6.orig/fs/pnode.c 2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/fs/pnode.c 2010-12-14 21:07:29.000000000 +1100
@@ -288,7 +288,7 @@ int propagate_mnt(struct vfsmount *dest_
*/
static inline int do_refcount_check(struct vfsmount *mnt, int count)
{
- int mycount = atomic_read(&mnt->mnt_count) - mnt->mnt_ghosts;
+ int mycount = count_mnt_count(mnt) - mnt->mnt_ghosts;
return (mycount > count);
}

@@ -300,7 +300,7 @@ static inline int do_refcount_check(stru
* Check if any of these mounts that **do not have submounts**
* have more references than 'refcnt'. If so return busy.
*
- * vfsmount lock must be held for read or write
+ * vfsmount lock must be held for write
*/
int propagate_mount_busy(struct vfsmount *mnt, int refcnt)
{
Index: linux-2.6/fs/internal.h
===================================================================
--- linux-2.6.orig/fs/internal.h 2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/fs/internal.h 2010-12-14 21:07:29.000000000 +1100
@@ -63,6 +63,7 @@ extern int copy_mount_string(const void

extern void free_vfsmnt(struct vfsmount *);
extern struct vfsmount *alloc_vfsmnt(const char *);
+extern unsigned int count_mnt_count(struct vfsmount *mnt);
extern struct vfsmount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
extern void mnt_set_mountpoint(struct vfsmount *, struct dentry *,
struct vfsmount *);
Index: linux-2.6/drivers/mtd/mtdchar.c
===================================================================
--- linux-2.6.orig/drivers/mtd/mtdchar.c 2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/drivers/mtd/mtdchar.c 2010-12-14 21:07:29.000000000 +1100
@@ -1201,7 +1201,7 @@ static int __init init_mtdchar(void)
static void __exit cleanup_mtdchar(void)
{
unregister_mtd_user(&mtdchar_notifier);
- mntput(mtd_inode_mnt);
+ mntput_long(mtd_inode_mnt);
unregister_filesystem(&mtd_inodefs_type);
__unregister_chrdev(MTD_CHAR_MAJOR, 0, 1 << MINORBITS, "mtd");
}
Index: linux-2.6/arch/ia64/kernel/perfmon.c
===================================================================
--- linux-2.6.orig/arch/ia64/kernel/perfmon.c 2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/arch/ia64/kernel/perfmon.c 2010-12-14 21:07:29.000000000 +1100
@@ -1542,7 +1542,7 @@ pfm_exit_smpl_buffer(pfm_buffer_fmt_t *f
* any operations on the root directory. However, we need a non-trivial
* d_name - pfm: will go nicely and kill the special-casing in procfs.
*/
-static struct vfsmount *pfmfs_mnt;
+static struct vfsmount *pfmfs_mnt __read_mostly;

static int __init
init_pfm_fs(void)
Index: linux-2.6/fs/anon_inodes.c
===================================================================
--- linux-2.6.orig/fs/anon_inodes.c 2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/fs/anon_inodes.c 2010-12-14 21:07:29.000000000 +1100
@@ -232,7 +232,7 @@ static int __init anon_inode_init(void)
return 0;

err_mntput:
- mntput(anon_inode_mnt);
+ mntput_long(anon_inode_mnt);
err_unregister_filesystem:
unregister_filesystem(&anon_inode_fs_type);
err_exit:
Index: linux-2.6/fs/pipe.c
===================================================================
--- linux-2.6.orig/fs/pipe.c 2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/fs/pipe.c 2010-12-14 21:07:29.000000000 +1100
@@ -1292,7 +1292,7 @@ static int __init init_pipe_fs(void)
static void __exit exit_pipe_fs(void)
{
unregister_filesystem(&pipe_fs_type);
- mntput(pipe_mnt);
+ mntput_long(pipe_mnt);
}

fs_initcall(init_pipe_fs);
Index: linux-2.6/net/socket.c
===================================================================
--- linux-2.6.orig/net/socket.c 2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/net/socket.c 2010-12-14 21:07:29.000000000 +1100
@@ -2375,6 +2375,8 @@ EXPORT_SYMBOL(sock_unregister);

static int __init sock_init(void)
{
+ int err;
+
/*
* Initialize sock SLAB cache.
*/
@@ -2391,8 +2393,15 @@ static int __init sock_init(void)
*/

init_inodecache();
- register_filesystem(&sock_fs_type);
+
+ err = register_filesystem(&sock_fs_type);
+ if (err)
+ goto out_fs;
sock_mnt = kern_mount(&sock_fs_type);
+ if (IS_ERR(sock_mnt)) {
+ err = PTR_ERR(sock_mnt);
+ goto out_mount;
+ }

/* The real protocol initialization is performed in later initcalls.
*/
@@ -2405,7 +2414,13 @@ static int __init sock_init(void)
skb_timestamping_init();
#endif

- return 0;
+out:
+ return err;
+
+out_mount:
+ unregister_filesystem(&sock_fs_type);
+out_fs:
+ goto out;
}

core_initcall(sock_init); /* early initcall */
Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c 2010-12-14 21:07:19.000000000 +1100
+++ linux-2.6/fs/super.c 2010-12-14 21:07:29.000000000 +1100
@@ -1140,7 +1140,7 @@ static struct vfsmount *fs_set_subtype(s
return mnt;

err:
- mntput(mnt);
+ mntput_long(mnt);
return ERR_PTR(err);
}

Index: linux-2.6/fs/fs_struct.c
===================================================================
--- linux-2.6.orig/fs/fs_struct.c 2010-12-14 22:01:15.000000000 +1100
+++ linux-2.6/fs/fs_struct.c 2010-12-14 22:56:27.000000000 +1100
@@ -17,11 +17,11 @@ void set_fs_root(struct fs_struct *fs, s
write_seqcount_begin(&fs->seq);
old_root = fs->root;
fs->root = *path;
- path_get(path);
+ path_get_long(path);
write_seqcount_end(&fs->seq);
spin_unlock(&fs->lock);
if (old_root.dentry)
- path_put(&old_root);
+ path_put_long(&old_root);
}

/*
@@ -36,12 +36,12 @@ void set_fs_pwd(struct fs_struct *fs, st
write_seqcount_begin(&fs->seq);
old_pwd = fs->pwd;
fs->pwd = *path;
- path_get(path);
+ path_get_long(path);
write_seqcount_end(&fs->seq);
spin_unlock(&fs->lock);

if (old_pwd.dentry)
- path_put(&old_pwd);
+ path_put_long(&old_pwd);
}

void chroot_fs_refs(struct path *old_root, struct path *new_root)
@@ -59,13 +59,13 @@ void chroot_fs_refs(struct path *old_roo
write_seqcount_begin(&fs->seq);
if (fs->root.dentry == old_root->dentry
&& fs->root.mnt == old_root->mnt) {
- path_get(new_root);
+ path_get_long(new_root);
fs->root = *new_root;
count++;
}
if (fs->pwd.dentry == old_root->dentry
&& fs->pwd.mnt == old_root->mnt) {
- path_get(new_root);
+ path_get_long(new_root);
fs->pwd = *new_root;
count++;
}
@@ -76,13 +76,13 @@ void chroot_fs_refs(struct path *old_roo
} while_each_thread(g, p);
read_unlock(&tasklist_lock);
while (count--)
- path_put(old_root);
+ path_put_long(old_root);
}

void free_fs_struct(struct fs_struct *fs)
{
- path_put(&fs->root);
- path_put(&fs->pwd);
+ path_put_long(&fs->root);
+ path_put_long(&fs->pwd);
kmem_cache_free(fs_cachep, fs);
}

@@ -115,7 +115,13 @@ struct fs_struct *copy_fs_struct(struct
spin_lock_init(&fs->lock);
seqcount_init(&fs->seq);
fs->umask = old->umask;
- get_fs_root_and_pwd(old, &fs->root, &fs->pwd);
+
+ spin_lock(&old->lock);
+ fs->root = old->root;
+ path_get_long(&fs->root);
+ fs->pwd = old->pwd;
+ path_get_long(&fs->pwd);
+ spin_unlock(&old->lock);
}
return fs;
}
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c 2010-12-14 22:06:54.000000000 +1100
+++ linux-2.6/fs/namei.c 2010-12-14 22:07:51.000000000 +1100
@@ -448,6 +448,18 @@ void path_get(struct path *path)
EXPORT_SYMBOL(path_get);

/**
+ * path_get_long - get a long reference to a path
+ * @path: path to get the reference to
+ *
+ * Given a path increment the reference count to the dentry and the vfsmount.
+ */
+void path_get_long(struct path *path)
+{
+ mntget_long(path->mnt);
+ dget(path->dentry);
+}
+
+/**
* path_put - put a reference to a path
* @path: path to put the reference to
*
@@ -461,6 +473,18 @@ void path_put(struct path *path)
EXPORT_SYMBOL(path_put);

/**
+ * path_put_long - put a long reference to a path
+ * @path: path to put the reference to
+ *
+ * Given a path decrement the reference count to the dentry and the vfsmount.
+ */
+void path_put_long(struct path *path)
+{
+ dput(path->dentry);
+ mntput_long(path->mnt);
+}
+
+/**
* nameidata_drop_rcu - drop this nameidata out of rcu-walk
* @nd: nameidata pathwalk data to drop
* @Returns: 0 on success, -ECHLID on failure
Index: linux-2.6/include/linux/path.h
===================================================================
--- linux-2.6.orig/include/linux/path.h 2010-12-14 22:06:38.000000000 +1100
+++ linux-2.6/include/linux/path.h 2010-12-14 22:08:13.000000000 +1100
@@ -10,7 +10,9 @@ struct path {
};

extern void path_get(struct path *);
+extern void path_get_long(struct path *);
extern void path_put(struct path *);
+extern void path_put_long(struct path *);

static inline int path_equal(const struct path *path1, const struct path *path2)
{

2010-12-15 08:16:56

by Andreas Dilger

[permalink] [raw]
Subject: Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status)

On 2010-12-14, at 05:40, Nick Piggin wrote:
> +static inline void add_mnt_count(struct vfsmount *mnt, int n)
> +static inline void set_mnt_count(struct vfsmount *mnt, int n)
> +static inline void inc_mnt_count(struct vfsmount *mnt)
> +static inline void dec_mnt_count(struct vfsmount *mnt)
> +unsigned int count_mnt_count(struct vfsmount *mnt)

Minor nit - it is easier to find these related functions via tag
completion if they are of the form "mnt_count_{add,set,inc,dec}"
and it would also be more consistent and easier to understand if
you rename count_mnt_count() to mnt_count_sum() or mnt_count_read().

This also follows the kernel naming convention much more closely
(e.g. atomic_{inc,dec,add,sub,set,read}(), mutex_{init,lock,unlock}(),
etc), which is normally of the form {type}_{action}.

Yes, I see the other {inc,dec,count}_mnt_writers() functions
exist, but I'd prefer not to add more bad function names to
the kernel. Maybe a separate patch to clean up those names is
in order...

Cheers, Andreas




2010-12-16 02:27:30

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status)

On Wed, Dec 15, 2010 at 01:16:52AM -0700, Andreas Dilger wrote:
> On 2010-12-14, at 05:40, Nick Piggin wrote:
> > +static inline void add_mnt_count(struct vfsmount *mnt, int n)
> > +static inline void set_mnt_count(struct vfsmount *mnt, int n)
> > +static inline void inc_mnt_count(struct vfsmount *mnt)
> > +static inline void dec_mnt_count(struct vfsmount *mnt)
> > +unsigned int count_mnt_count(struct vfsmount *mnt)
>
> Minor nit - it is easier to find these related functions via tag
> completion if they are of the form "mnt_count_{add,set,inc,dec}"
> and it would also be more consistent and easier to understand if
> you rename count_mnt_count() to mnt_count_sum() or mnt_count_read().
>
> This also follows the kernel naming convention much more closely
> (e.g. atomic_{inc,dec,add,sub,set,read}(), mutex_{init,lock,unlock}(),
> etc), which is normally of the form {type}_{action}.

OK, I agree.


> Yes, I see the other {inc,dec,count}_mnt_writers() functions
> exist, but I'd prefer not to add more bad function names to
> the kernel. Maybe a separate patch to clean up those names is
> in order...

I might have added those too :P Yes they should be consistent, so
I'll rename them first.

Thanks,
Nick