2004-10-25 14:39:25

by Mike Waychison

[permalink] [raw]
Subject: [PATCH 0/28] Autofs NG Patchset 0.2

Hi All,

The following patchset (against 2.6.9) is a breakdown of all the changes
required to support autofng (currently hosted at http://autofsng.bkbits.net/).
I'm posting this patchset to get more people's eyes on the code.

The series consists of core vfs changes as well as a couple small changes to
the call_usermodehelper interface. I've also sent the autofsng filesystem as a single patch to introduce it seperated from the rest of the changes.

This isn't ready for inclusion as there are bound to be errors and the
interfaces described haven't settled.

Please review / test / comment / flame.



01-unexport_umount_tree.diff
- drop an export with no intree users

02-rename_mnt_fslink_mnt_expire.diff
- give vfsmount->mnt_fslink a more appropriate name

03-move_expiry_into_vfs.diff
- pull expiry stuff that was recently added to be contained within vfs

04-stat_on_root_shouldnt_stop_expire.diff
- don't let ops on the root of a mountpoint affect expiry timeouts

05-expiry_is_countdown.diff
- mountpoint expiry now has configurable timeouts

06-expiry_is_recursive.diff
- allow for atomic expiry of subtrees of mountpoints

07-update_kafs_automount_expiry.diff
- update AFS to use new expiry interface

08-drop_expire_umount_flag.diff
- drop unused old expiry interface (MNT_EXPIRE flag to umount(2))

09-expiry_semantics_bind.diff
- fix up the expiry semantics and document them

10-move_next_mnt.diff
- move next_mnt() in preparation for later patches

11-detachable_subtrees.diff
- allow subtrees of mountpoints to not be bound to a struct namespace

12-remove_check_mnt.diff
- remove the now bogus check_mnt calls (bogus with detachable_subtree.diff)

13-introduce_soft_ref_counts.diff
- introduce 'soft' reference counts that don't affect umount(2) == EBUSY
semantics

14-introduce_mountfd.diff
- introduce the mountfd() syscall

15-mountfd_umounts.diff
- add unmount functionality to mountfd interface

16-mountfd_attach.diff
- add attach interface to mountfd interface

17-mountfd_walk.diff
- allow for userspace to walk a tree of mountfds

18-mountfd_read.diff
- allow for reading properties of a mountfd

19-mountfd_vfsexpire.diff
- give vfs expiry an interface through mountfds

20-call_usermodehelper_cb.diff
- add a way to have a caller of call_usermodehelper get a callback before
execvce.

21-call_usermodehelper_execve_hack.diff
- quick hack that allows for execve to be called without having to define
errno.

22-export_put_namespace.diff
- autofsng wants to call put_namespace. export it.

23-export_get_sb_pseudo.diff
- autofsng wants get_sb_pseudo.

24-follow_link_on_root.diff
- update follow_link logic to pass the right vfsmount to follow_link on root
directory.

25-statfs_nofollow.diff
- hack - statfs doesn't follow symlink on last component.

26-umount_mnt_nofollow.diff
- allow umount to not follow symlink on last component.

27-temp_expiry_syscall.diff
- dummy interface to expiry testing

28-autofsng_support.diff
- big patch for autofsng.


2004-10-25 14:39:37

by Mike Waychison

[permalink] [raw]
Subject: [PATCH 1/28] VFS: Unexport umount_tree

Unexport umount_tree. I don't see any in-kernel users of this call.

Signed-off-by: Mike Waychison <[email protected]>
---

fs/namespace.c | 2 +-
include/linux/namespace.h | 1 -
2 files changed, 1 insertion(+), 2 deletions(-)

Index: linux-2.6.9-quilt/include/linux/namespace.h
===================================================================
--- linux-2.6.9-quilt.orig/include/linux/namespace.h 2004-08-14 01:36:59.000000000 -0400
+++ linux-2.6.9-quilt/include/linux/namespace.h 2004-10-22 17:17:32.919459624 -0400
@@ -12,7 +12,6 @@ struct namespace {
struct rw_semaphore sem;
};

-extern void umount_tree(struct vfsmount *);
extern int copy_namespace(int, struct task_struct *);
extern void __put_namespace(struct namespace *namespace);

Index: linux-2.6.9-quilt/fs/namespace.c
===================================================================
--- linux-2.6.9-quilt.orig/fs/namespace.c 2004-08-14 01:37:25.000000000 -0400
+++ linux-2.6.9-quilt/fs/namespace.c 2004-10-22 17:17:32.921459320 -0400
@@ -338,7 +338,7 @@ int may_umount(struct vfsmount *mnt)

EXPORT_SYMBOL(may_umount);

-void umount_tree(struct vfsmount *mnt)
+static void umount_tree(struct vfsmount *mnt)
{
struct vfsmount *p;
LIST_HEAD(kill);

2004-10-25 14:44:26

by Mike Waychison

[permalink] [raw]
Subject: [PATCH 6/28] VFS: Make expiry recursive

This patch allows for tagging of vfsmounts as being part of a sub-tree
expiry. It introduces a new vfsmount flag, MNT_CHILDEXPIRE which is used to
let the system know that the given mountpoint expires with its parent. This
is a recursive definition.

mnt_expiry, the call used to specify that a mount should expire, now takes an
int described as follows:
- 0 - The mountpoint should not expire (default)
- >0 - The value is used to specify the amount of idle time before the
given mountpoint expires.
- <0 - The mountpoint must expire with it's immediate parent. (parent
must be set to expire, or must be itself be marked to expire
along with _its_ parent.

This allows atomic expiry of a complex hierarchy of mountpoints. This means
userspace will either 'see' or 'not see' the hierarchy of mountpoints. (This
required when using a generic automount facility that acts like a mounted
filesystem on top of any other filesystem).

Signed-off-by: Mike Waychison <[email protected]>
---

fs/namespace.c | 236 +++++++++++++++++++++++++++++++++++++-------------
include/linux/mount.h | 3
2 files changed, 179 insertions(+), 60 deletions(-)

Index: linux-2.6.9-quilt/include/linux/mount.h
===================================================================
--- linux-2.6.9-quilt.orig/include/linux/mount.h 2004-10-22 17:17:35.377086008 -0400
+++ linux-2.6.9-quilt/include/linux/mount.h 2004-10-22 17:17:35.927002408 -0400
@@ -17,6 +17,7 @@
#define MNT_NOSUID 1
#define MNT_NODEV 2
#define MNT_NOEXEC 4
+#define MNT_CHILDEXPIRE 8

struct vfsmount
{
@@ -71,7 +72,7 @@ extern struct vfsmount *do_kern_mount(co
struct nameidata;

extern int do_graft_mount(struct vfsmount *newmnt, struct nameidata *nd);
-extern void mnt_expire(struct vfsmount *mnt, unsigned expire);
+extern int mnt_expire(struct vfsmount *mnt, int expire);

extern spinlock_t vfsmount_lock;

Index: linux-2.6.9-quilt/fs/namespace.c
===================================================================
--- linux-2.6.9-quilt.orig/fs/namespace.c 2004-10-22 17:17:35.378085856 -0400
+++ linux-2.6.9-quilt/fs/namespace.c 2004-10-22 17:17:35.929002104 -0400
@@ -157,6 +157,34 @@ static struct vfsmount *next_mnt(struct
return list_entry(next, struct vfsmount, mnt_child);
}

+static int __can_expire(struct vfsmount *root, int offset)
+{
+ struct vfsmount *mnt;
+ int count;
+
+ /* handle the case of a root or orphaned mountpoint */
+ if (root->mnt_parent == root || root->mnt_parent == NULL)
+ return 0;
+ count = atomic_read(&root->mnt_count) - 1 - offset;
+ for (mnt = next_mnt(root, root); mnt; mnt = next_mnt(mnt, root)) {
+ if (!(mnt->mnt_flags & MNT_CHILDEXPIRE))
+ return 0;
+ count += atomic_read(&mnt->mnt_count) - 2;
+ }
+
+ WARN_ON(count < 0);
+ return count == 0;
+}
+
+static int can_expire(struct vfsmount *root)
+{
+ int ret;
+ spin_lock(&vfsmount_lock);
+ ret = __can_expire(root, 1);
+ spin_unlock(&vfsmount_lock);
+ return ret;
+}
+
static struct vfsmount *
clone_mnt(struct vfsmount *old, struct dentry *root)
{
@@ -164,20 +192,13 @@ clone_mnt(struct vfsmount *old, struct d
struct vfsmount *mnt = alloc_vfsmnt(old->mnt_devname);

if (mnt) {
- mnt->mnt_flags = old->mnt_flags;
+ mnt->mnt_flags = old->mnt_flags & ~MNT_CHILDEXPIRE;
atomic_inc(&sb->s_active);
mnt->mnt_sb = sb;
mnt->mnt_root = dget(root);
mnt->mnt_mountpoint = mnt->mnt_root;
mnt->mnt_parent = mnt;
mnt->mnt_namespace = old->mnt_namespace;
-
- /* stick the duplicate mount on the same expiry list
- * as the original if that was on one */
- spin_lock(&vfsmount_lock);
- if (!list_empty(&old->mnt_expire))
- list_add(&mnt->mnt_expire, &old->mnt_expire);
- spin_unlock(&vfsmount_lock);
}
return mnt;
}
@@ -275,6 +296,43 @@ struct seq_operations mounts_op = {
.show = show_vfsmnt
};

+/*
+ * Clear out MNT_CHILDEXPIRE from the given mountpoint (recursively) to ensure
+ * that our invariant that nodes that have MNT_CHILDEXPIRE set recursivly have
+ * a parent that will eventually expire.
+ */
+static void clear_childexpire(struct vfsmount *root)
+{
+ struct list_head *next;
+ struct vfsmount *this_parent = root;
+
+ if (!(root->mnt_flags & MNT_CHILDEXPIRE))
+ return;
+
+ root->mnt_flags &= ~MNT_CHILDEXPIRE;
+ next = this_parent->mnt_mounts.next;
+again:
+ for ( ; next != &this_parent->mnt_mounts ; next = next->next ) {
+ struct vfsmount *p = list_entry(next, struct vfsmount,
+ mnt_child);
+
+ if (p->mnt_flags & MNT_CHILDEXPIRE) {
+ p->mnt_flags &= ~MNT_CHILDEXPIRE;
+ if (!list_empty(&p->mnt_mounts)) {
+ this_parent = p;
+ next = this_parent->mnt_mounts.next;
+ continue;
+ }
+ }
+ }
+
+ if (this_parent != root) {
+ next = this_parent->mnt_child.next;
+ this_parent = this_parent->mnt_parent;
+ goto again;
+ }
+}
+
/**
* may_umount_tree - check if a mount tree is busy
* @mnt: root of mount tree
@@ -347,6 +405,17 @@ int may_umount(struct vfsmount *mnt)

EXPORT_SYMBOL(may_umount);

+/* clear all expire related information in the subtree rooted at root */
+static void clear_expire(struct vfsmount *root)
+{
+ struct vfsmount *p;
+
+ for (p = root; p; p = next_mnt(p, root)) {
+ list_del_init(&p->mnt_expire);
+ p->mnt_flags &= ~MNT_CHILDEXPIRE;
+ }
+}
+
static void umount_tree(struct vfsmount *mnt)
{
struct vfsmount *p;
@@ -394,7 +463,7 @@ static int do_umount(struct vfsmount *mn
flags & (MNT_FORCE | MNT_DETACH))
return -EINVAL;

- if (atomic_read(&mnt->mnt_count) != 2)
+ if (!can_expire(mnt))
return -EBUSY;

if (--mnt->mnt_expiry_countdown != 0)
@@ -455,9 +524,10 @@ static int do_umount(struct vfsmount *mn
spin_lock(&vfsmount_lock);
}
retval = -EBUSY;
- if (atomic_read(&mnt->mnt_count) == 2 || flags & MNT_DETACH) {
+ if (atomic_read(&mnt->mnt_count) == 2 || flags & MNT_DETACH
+ || (flags & MNT_EXPIRE && can_expire(mnt))) {
if (!list_empty(&mnt->mnt_list)) {
- list_del_init(&mnt->mnt_expire);
+ clear_expire(mnt);
umount_tree(mnt);
}
retval = 0;
@@ -658,6 +728,7 @@ static int do_loopback(struct nameidata
/* stop bind mounts from expiring */
spin_lock(&vfsmount_lock);
list_del_init(&mnt->mnt_expire);
+ clear_childexpire(mnt);
spin_unlock(&vfsmount_lock);

err = graft_tree(mnt, nd);
@@ -698,7 +769,8 @@ static int do_remount(struct nameidata *
down_write(&sb->s_umount);
err = do_remount_sb(sb, flags, data, 0);
if (!err)
- nd->mnt->mnt_flags=mnt_flags;
+ nd->mnt->mnt_flags = mnt_flags |
+ (nd->mnt->mnt_flags & MNT_CHILDEXPIRE);
up_write(&sb->s_umount);
if (!err)
security_sb_post_remount(nd->mnt, flags, data);
@@ -754,9 +826,8 @@ static int do_move_mount(struct nameidat
detach_mnt(old_nd.mnt, &parent_nd);
attach_mnt(old_nd.mnt, nd);

- /* if the mount is moved, it should no longer be expire
- * automatically */
- list_del_init(&old_nd.mnt->mnt_expire);
+ /* if the mount is moved, we need to clear it's child expire flag */
+ clear_childexpire(old_nd.mnt);
out2:
spin_unlock(&vfsmount_lock);
out1:
@@ -829,8 +900,18 @@ unlock:
}
EXPORT_SYMBOL_GPL(do_graft_mount);

-void mnt_expire(struct vfsmount *mnt, unsigned expire)
+/*
+ * Change the expiry settings for a given mountpoint
+ * 0 - Disable expiry for a given mountpoint
+ * <0 - Set this mountpoint to be part of a sub-tree expiry
+ * >0 - This mountpoint will expire after expire ticks
+ *
+ * Returns zero on success.
+ */
+int mnt_expire(struct vfsmount *mnt, int expire)
{
+ int ret = 1;
+
down(&expiry_sem);
spin_lock(&vfsmount_lock);

@@ -841,17 +922,68 @@ void mnt_expire(struct vfsmount *mnt, un
if (!mnt->mnt_namespace)
goto out;

- list_del_init(&mnt->mnt_expire);
- mnt->mnt_expiry_ticks = mnt->mnt_expiry_countdown = expire;
- mnt->mnt_active = 1;
- if (expire > 0)
- list_add_tail(&mnt->mnt_expire, &expiry_list);
+ if (expire < 0) {
+ if (!list_empty(&mnt->mnt_expire))
+ goto out;
+ if (mnt->mnt_parent == mnt || mnt->mnt_parent == NULL)
+ goto out;
+ if (!(mnt->mnt_parent->mnt_flags & MNT_CHILDEXPIRE)
+ && list_empty(&mnt->mnt_parent->mnt_expire))
+ goto out;
+ mnt->mnt_flags |= MNT_CHILDEXPIRE;
+ } else {
+ if (mnt->mnt_flags & MNT_CHILDEXPIRE && expire)
+ goto out;
+ clear_childexpire(mnt);
+ list_del_init(&mnt->mnt_expire);
+ mnt->mnt_expiry_ticks = mnt->mnt_expiry_countdown = expire;
+ mnt->mnt_active = 1;
+ if (expire > 0)
+ list_add_tail(&mnt->mnt_expire, &expiry_list);
+ }
+ ret = 0;
+
out:
spin_unlock(&vfsmount_lock);
up(&expiry_sem);
+ return ret;
}
EXPORT_SYMBOL_GPL(mnt_expire);

+/* return the first ancestor mount that impacts expiry, or NULL if none do */
+static struct vfsmount *find_expiring_parent(struct vfsmount *mnt)
+{
+ struct vfsmount *p = mnt->mnt_parent;
+ /*
+ * one of three things is true:
+ * 1. all parents are normal mounts
+ * 2. parent is a simple expiry mount
+ * 3. parent is a CHILDEXPIRE -- walk up the tree until we get to
+ * a mount that fits case 1 or case 2
+ */
+ while (p && p->mnt_parent != p && p->mnt_flags & MNT_CHILDEXPIRE)
+ p = p->mnt_parent;
+ if (p && list_empty(&p->mnt_expire))
+ return NULL;
+ return mntget(p);
+}
+
+static void bump_expiry_counter(struct vfsmount *mnt, struct vfsmount *parent)
+{
+ int diff;
+
+ /*
+ * If the parent is set to expire, then its counter has been
+ * counting down. If it thinks it has been idle for longer than
+ * the child, we need to bump it up. This child has been used more
+ * recently than the parent, so the parent can only possibly be idle
+ * for as long as this child (or less).
+ */
+ diff = parent->mnt_expiry_ticks - mnt->mnt_expiry_ticks;
+ if (parent->mnt_expiry_countdown < diff)
+ parent->mnt_expiry_countdown = diff;
+}
+
/*
* process a list of expirable mountpoints with the intent of discarding any
* mountpoints that aren't in use and haven't been touched since last we came
@@ -887,7 +1019,7 @@ static void do_expiry_run(void *nothing)
}
if (mnt->mnt_expiry_countdown >= 1)
mnt->mnt_expiry_countdown--;
- if (atomic_read(&mnt->mnt_count) == 2 && mnt->mnt_expiry_countdown == 0) {
+ if (__can_expire(mnt, 0) && mnt->mnt_expiry_countdown == 0) {
mntget(mnt);
list_move(&mnt->mnt_expire, &graveyard);
}
@@ -900,6 +1032,8 @@ static void do_expiry_run(void *nothing)
* - dispose of the corpse
*/
while (!list_empty(&graveyard)) {
+ struct vfsmount *parent = NULL;
+
mnt = list_entry(graveyard.next, struct vfsmount, mnt_expire);
list_del_init(&mnt->mnt_expire);

@@ -914,50 +1048,33 @@ static void do_expiry_run(void *nothing)
down_write(&namespace->sem);
spin_lock(&vfsmount_lock);

- /* check that it is still dead: the count should now be 2 - as
- * contributed by the vfsmount parent and the mntget above */
- if (atomic_read(&mnt->mnt_count) == 2 && !mnt->mnt_active) {
- struct vfsmount *xdmnt;
- struct dentry *xdentry;
-
- /* delete from the namespace */
- list_del_init(&mnt->mnt_list);
- list_del_init(&mnt->mnt_child);
- list_del_init(&mnt->mnt_hash);
- mnt->mnt_mountpoint->d_mounted--;
-
- xdentry = mnt->mnt_mountpoint;
- mnt->mnt_mountpoint = mnt->mnt_root;
- xdmnt = mnt->mnt_parent;
- mnt->mnt_parent = mnt;
-
- spin_unlock(&vfsmount_lock);
-
- mntput(xdmnt);
- dput(xdentry);
+ if (!__can_expire(mnt, 1) || mnt->mnt_active) {
+ list_add_tail(&mnt->mnt_expire, &expiry_list);
+ } else {
+ parent = find_expiring_parent(mnt);
+ if (parent)
+ bump_expiry_counter(mnt, parent);
+ umount_tree(mnt);

- /* now lay it to rest if this was the last ref on the
- * superblock */
- if (atomic_read(&mnt->mnt_sb->s_active) == 1) {
- /* last instance - try to be smart */
- lock_kernel();
- DQUOT_OFF(mnt->mnt_sb);
- acct_auto_close(mnt->mnt_sb);
- unlock_kernel();
+ /* the parent may be expirable now */
+ if (parent && __can_expire(parent, 1) &&
+ parent->mnt_expiry_countdown == 0 &&
+ !parent->mnt_active) {
+ list_move_tail(&parent->mnt_expire, &graveyard);
+
+ /* The ref from find_expiring_parent is now used
+ * for the graveyard. Set the parent to NULL so
+ * that it isn't decremented by the _mntput
+ * below */
+ parent = NULL;
}
-
- mntput(mnt);
- } else {
- /* someone brought it back to life whilst we didn't
- * have any locks held so return it to the expiration
- * list */
- list_add_tail(&mnt->mnt_expire, &expiry_list);
- spin_unlock(&vfsmount_lock);
}

+ spin_unlock(&vfsmount_lock);
up_write(&namespace->sem);

_mntput(mnt);
+ _mntput(parent);
put_namespace(namespace);

spin_lock(&vfsmount_lock);
@@ -1472,6 +1589,7 @@ void __put_namespace(struct namespace *n

list_for_each_entry(mnt, &namespace->list, mnt_list) {
mnt->mnt_namespace = NULL;
+ mnt->mnt_flags &= ~MNT_CHILDEXPIRE;
list_del_init(&mnt->mnt_expire);
}


2004-10-25 14:48:30

by Mike Waychison

[permalink] [raw]
Subject: [PATCH 7/28] AFS: Update AFS to use new expiry interface

Update kAFS to use the new mountpoint expire infrastructure.

Signed-off-by: Mike Waychison <[email protected]>
---

cmservice.c | 5 -----
mntpt.c | 34 +++++++---------------------------
super.c | 2 --
3 files changed, 7 insertions(+), 34 deletions(-)

Index: linux-2.6.9-quilt/fs/afs/mntpt.c
===================================================================
--- linux-2.6.9-quilt.orig/fs/afs/mntpt.c 2004-08-14 01:36:17.000000000 -0400
+++ linux-2.6.9-quilt/fs/afs/mntpt.c 2004-10-22 17:17:36.477918656 -0400
@@ -43,16 +43,6 @@ struct inode_operations afs_mntpt_inode_
.getattr = afs_inode_getattr,
};

-static LIST_HEAD(afs_vfsmounts);
-
-static void afs_mntpt_expiry_timed_out(struct afs_timer *timer);
-
-struct afs_timer_ops afs_mntpt_expiry_timer_ops = {
- .timed_out = afs_mntpt_expiry_timed_out,
-};
-
-struct afs_timer afs_mntpt_expiry_timer;
-
unsigned long afs_mntpt_expiry_timeout = 20;

/*****************************************************************************/
@@ -252,9 +242,15 @@ static int afs_mntpt_follow_link(struct

newnd = *nd;
newnd.dentry = dentry;
- err = do_add_mount(newmnt, &newnd, 0, &afs_vfsmounts);
+ err = do_graft_mount(newmnt, &newnd);

if (!err) {
+ /*
+ * We currently don't test to see if we were able to set
+ * expiry.
+ */
+ mnt_expire(newmnt, afs_mntpt_expiry_timeout);
+
path_release(nd);
mntget(newmnt);
nd->mnt = newmnt;
@@ -265,19 +261,3 @@ static int afs_mntpt_follow_link(struct
kleave(" = %d", err);
return err;
} /* end afs_mntpt_follow_link() */
-
-/*****************************************************************************/
-/*
- * handle mountpoint expiry timer going off
- */
-static void afs_mntpt_expiry_timed_out(struct afs_timer *timer)
-{
- kenter("");
-
- mark_mounts_for_expiry(&afs_vfsmounts);
-
- afs_kafstimod_add_timer(&afs_mntpt_expiry_timer,
- afs_mntpt_expiry_timeout * HZ);
-
- kleave("");
-} /* end afs_mntpt_expiry_timed_out() */
Index: linux-2.6.9-quilt/fs/afs/cmservice.c
===================================================================
--- linux-2.6.9-quilt.orig/fs/afs/cmservice.c 2004-08-14 01:36:57.000000000 -0400
+++ linux-2.6.9-quilt/fs/afs/cmservice.c 2004-10-22 17:17:36.478918504 -0400
@@ -306,9 +306,6 @@ int afscm_start(void)
ret = rxrpc_add_service(afs_transport, &AFSCM_service);
if (ret < 0)
goto kill;
-
- afs_kafstimod_add_timer(&afs_mntpt_expiry_timer,
- afs_mntpt_expiry_timeout * HZ);
}

afscm_usage++;
@@ -389,8 +386,6 @@ void afscm_stop(void)
spin_lock(&kafscmd_attention_lock);
}
spin_unlock(&kafscmd_attention_lock);
-
- afs_kafstimod_del_timer(&afs_mntpt_expiry_timer);
}

up_write(&afscm_sem);
Index: linux-2.6.9-quilt/fs/afs/super.c
===================================================================
--- linux-2.6.9-quilt.orig/fs/afs/super.c 2004-08-14 01:36:56.000000000 -0400
+++ linux-2.6.9-quilt/fs/afs/super.c 2004-10-22 17:17:36.478918504 -0400
@@ -78,8 +78,6 @@ int __init afs_fs_init(void)

_enter("");

- afs_timer_init(&afs_mntpt_expiry_timer, &afs_mntpt_expiry_timer_ops);
-
/* create ourselves an inode cache */
atomic_set(&afs_count_active_inodes, 0);


2004-10-25 14:48:32

by Mike Waychison

[permalink] [raw]
Subject: [PATCH 2/28] VFS: mnt_fslink -> mnt_expire

This patch renames vfsmount->mnt_fslink to something a little more
descriptive: vfsmount->mnt_expire.

Signed-off-by: Mike Waychison <[email protected]>
---

fs/namespace.c | 24 ++++++++++++------------
include/linux/mount.h | 2 +-
2 files changed, 13 insertions(+), 13 deletions(-)

Index: linux-2.6.9-quilt/include/linux/mount.h
===================================================================
--- linux-2.6.9-quilt.orig/include/linux/mount.h 2004-08-14 01:36:14.000000000 -0400
+++ linux-2.6.9-quilt/include/linux/mount.h 2004-10-22 17:17:33.460377392 -0400
@@ -32,7 +32,7 @@ struct vfsmount
int mnt_expiry_mark; /* true if marked for expiry */
char *mnt_devname; /* Name of device e.g. /dev/dsk/hda1 */
struct list_head mnt_list;
- struct list_head mnt_fslink; /* link in fs-specific expiry list */
+ struct list_head mnt_expire; /* link in fs-specific expiry list */
struct namespace *mnt_namespace; /* containing namespace */
};

Index: linux-2.6.9-quilt/fs/namespace.c
===================================================================
--- linux-2.6.9-quilt.orig/fs/namespace.c 2004-10-22 17:17:32.921459320 -0400
+++ linux-2.6.9-quilt/fs/namespace.c 2004-10-22 17:17:33.461377240 -0400
@@ -60,7 +60,7 @@ struct vfsmount *alloc_vfsmnt(const char
INIT_LIST_HEAD(&mnt->mnt_child);
INIT_LIST_HEAD(&mnt->mnt_mounts);
INIT_LIST_HEAD(&mnt->mnt_list);
- INIT_LIST_HEAD(&mnt->mnt_fslink);
+ INIT_LIST_HEAD(&mnt->mnt_expire);
if (name) {
int size = strlen(name)+1;
char *newname = kmalloc(size, GFP_KERNEL);
@@ -166,8 +166,8 @@ clone_mnt(struct vfsmount *old, struct d
/* stick the duplicate mount on the same expiry list
* as the original if that was on one */
spin_lock(&vfsmount_lock);
- if (!list_empty(&old->mnt_fslink))
- list_add(&mnt->mnt_fslink, &old->mnt_fslink);
+ if (!list_empty(&old->mnt_expire))
+ list_add(&mnt->mnt_expire, &old->mnt_expire);
spin_unlock(&vfsmount_lock);
}
return mnt;
@@ -351,7 +351,7 @@ static void umount_tree(struct vfsmount
while (!list_empty(&kill)) {
mnt = list_entry(kill.next, struct vfsmount, mnt_list);
list_del_init(&mnt->mnt_list);
- list_del_init(&mnt->mnt_fslink);
+ list_del_init(&mnt->mnt_expire);
if (mnt->mnt_parent == mnt) {
spin_unlock(&vfsmount_lock);
} else {
@@ -644,7 +644,7 @@ static int do_loopback(struct nameidata
if (mnt) {
/* stop bind mounts from expiring */
spin_lock(&vfsmount_lock);
- list_del_init(&mnt->mnt_fslink);
+ list_del_init(&mnt->mnt_expire);
spin_unlock(&vfsmount_lock);

err = graft_tree(mnt, nd);
@@ -743,7 +743,7 @@ static int do_move_mount(struct nameidat

/* if the mount is moved, it should no longer be expire
* automatically */
- list_del_init(&old_nd.mnt->mnt_fslink);
+ list_del_init(&old_nd.mnt->mnt_expire);
out2:
spin_unlock(&vfsmount_lock);
out1:
@@ -812,7 +812,7 @@ int do_add_mount(struct vfsmount *newmnt
if (err == 0 && fslist) {
/* add to the specified expiration list */
spin_lock(&vfsmount_lock);
- list_add_tail(&newmnt->mnt_fslink, fslist);
+ list_add_tail(&newmnt->mnt_expire, fslist);
spin_unlock(&vfsmount_lock);
}

@@ -846,13 +846,13 @@ void mark_mounts_for_expiry(struct list_
* - still marked for expiry (marked on the last call here; marks are
* cleared by mntput())
*/
- list_for_each_entry_safe(mnt, next, mounts, mnt_fslink) {
+ list_for_each_entry_safe(mnt, next, mounts, mnt_expire) {
if (!xchg(&mnt->mnt_expiry_mark, 1) ||
atomic_read(&mnt->mnt_count) != 1)
continue;

mntget(mnt);
- list_move(&mnt->mnt_fslink, &graveyard);
+ list_move(&mnt->mnt_expire, &graveyard);
}

/*
@@ -862,8 +862,8 @@ void mark_mounts_for_expiry(struct list_
* - dispose of the corpse
*/
while (!list_empty(&graveyard)) {
- mnt = list_entry(graveyard.next, struct vfsmount, mnt_fslink);
- list_del_init(&mnt->mnt_fslink);
+ mnt = list_entry(graveyard.next, struct vfsmount, mnt_expire);
+ list_del_init(&mnt->mnt_expire);

/* don't do anything if the namespace is dead - all the
* vfsmounts from it are going away anyway */
@@ -913,7 +913,7 @@ void mark_mounts_for_expiry(struct list_
/* someone brought it back to life whilst we didn't
* have any locks held so return it to the expiration
* list */
- list_add_tail(&mnt->mnt_fslink, mounts);
+ list_add_tail(&mnt->mnt_expire, mounts);
spin_unlock(&vfsmount_lock);
}


2004-10-25 14:53:54

by Mike Waychison

[permalink] [raw]
Subject: [PATCH 8/28] VFS: Remove MNT_EXPIRE support

Drop support for MNT_EXPIRE (flag to umount(2)). Nobody was using it and it
didn't fit into the new expiry framework.

Note: maybe make this bit a DO NOT USE bit?

Signed-off-by: Mike Waychison <[email protected]>
---

fs/namespace.c | 45 +++++++++------------------------------------
include/linux/fs.h | 1 -
2 files changed, 9 insertions(+), 37 deletions(-)

Index: linux-2.6.9-quilt/include/linux/fs.h
===================================================================
--- linux-2.6.9-quilt.orig/include/linux/fs.h 2004-08-14 01:36:32.000000000 -0400
+++ linux-2.6.9-quilt/include/linux/fs.h 2004-10-22 17:17:37.120820920 -0400
@@ -719,7 +719,6 @@ extern int send_sigurg(struct fown_struc

#define MNT_FORCE 0x00000001 /* Attempt to forcibily umount */
#define MNT_DETACH 0x00000002 /* Just detach from the tree */
-#define MNT_EXPIRE 0x00000004 /* Mark for expiry */

extern struct list_head super_blocks;
extern spinlock_t sb_lock;
Index: linux-2.6.9-quilt/fs/namespace.c
===================================================================
--- linux-2.6.9-quilt.orig/fs/namespace.c 2004-10-22 17:17:35.929002104 -0400
+++ linux-2.6.9-quilt/fs/namespace.c 2004-10-22 17:17:37.121820768 -0400
@@ -157,11 +157,12 @@ static struct vfsmount *next_mnt(struct
return list_entry(next, struct vfsmount, mnt_child);
}

-static int __can_expire(struct vfsmount *root, int offset)
+/* this expects the caller to hold vfsmount_lock */
+static int can_expire(struct vfsmount *root, int offset)
{
struct vfsmount *mnt;
int count;
-
+
/* handle the case of a root or orphaned mountpoint */
if (root->mnt_parent == root || root->mnt_parent == NULL)
return 0;
@@ -171,18 +172,9 @@ static int __can_expire(struct vfsmount
return 0;
count += atomic_read(&mnt->mnt_count) - 2;
}
-
- WARN_ON(count < 0);
- return count == 0;
-}

-static int can_expire(struct vfsmount *root)
-{
- int ret;
- spin_lock(&vfsmount_lock);
- ret = __can_expire(root, 1);
- spin_unlock(&vfsmount_lock);
- return ret;
+ WARN_ON(count < 0);
+ return count == 0;
}

static struct vfsmount *
@@ -453,24 +445,6 @@ static int do_umount(struct vfsmount *mn
return retval;

/*
- * Allow userspace to request a mountpoint be expired rather than
- * unmounting unconditionally. Unmount only happens if:
- * (1) the mark is already set (the mark is cleared by mntput())
- * (2) the usage count == 1 [parent vfsmount] + 1 [sys_umount]
- */
- if (flags & MNT_EXPIRE) {
- if (mnt == current->fs->rootmnt ||
- flags & (MNT_FORCE | MNT_DETACH))
- return -EINVAL;
-
- if (!can_expire(mnt))
- return -EBUSY;
-
- if (--mnt->mnt_expiry_countdown != 0)
- return -EAGAIN;
- }
-
- /*
* If we may have to abort operations to get out of this
* mount, and they will themselves hold resources we must
* allow the fs to do things. In the Unix tradition of
@@ -524,8 +498,7 @@ static int do_umount(struct vfsmount *mn
spin_lock(&vfsmount_lock);
}
retval = -EBUSY;
- if (atomic_read(&mnt->mnt_count) == 2 || flags & MNT_DETACH
- || (flags & MNT_EXPIRE && can_expire(mnt))) {
+ if (atomic_read(&mnt->mnt_count) == 2 || flags & MNT_DETACH) {
if (!list_empty(&mnt->mnt_list)) {
clear_expire(mnt);
umount_tree(mnt);
@@ -1019,7 +992,7 @@ static void do_expiry_run(void *nothing)
}
if (mnt->mnt_expiry_countdown >= 1)
mnt->mnt_expiry_countdown--;
- if (__can_expire(mnt, 0) && mnt->mnt_expiry_countdown == 0) {
+ if (can_expire(mnt, 0) && mnt->mnt_expiry_countdown == 0) {
mntget(mnt);
list_move(&mnt->mnt_expire, &graveyard);
}
@@ -1048,7 +1021,7 @@ static void do_expiry_run(void *nothing)
down_write(&namespace->sem);
spin_lock(&vfsmount_lock);

- if (!__can_expire(mnt, 1) || mnt->mnt_active) {
+ if (!can_expire(mnt, 1) || mnt->mnt_active) {
list_add_tail(&mnt->mnt_expire, &expiry_list);
} else {
parent = find_expiring_parent(mnt);
@@ -1057,7 +1030,7 @@ static void do_expiry_run(void *nothing)
umount_tree(mnt);

/* the parent may be expirable now */
- if (parent && __can_expire(parent, 1) &&
+ if (parent && can_expire(parent, 1) &&
parent->mnt_expiry_countdown == 0 &&
!parent->mnt_active) {
list_move_tail(&parent->mnt_expire, &graveyard);

2004-10-25 14:55:05

by Mike Waychison

[permalink] [raw]
Subject: [PATCH 4/28] VFS: Stat shouldn't stop expire

This patch fixes the problem where if you have a mountpoint that is going to
expire, it fails to expire before somebody keeps stat(2)ing the root of it's
filesystem. For example, consider the case where a user has his home
directory automounted on /home/mikew. Some other user can keep the
filesystem mounted forever by simply calling ls(1) in /home, because the stat
action resets the marker on each call.

Signed-off-by: Mike Waychison <[email protected]>
---

namei.c | 11 ++++++++++-
1 files changed, 10 insertions(+), 1 deletion(-)

Index: linux-2.6.9-quilt/fs/namei.c
===================================================================
--- linux-2.6.9-quilt.orig/fs/namei.c 2004-08-14 01:36:45.000000000 -0400
+++ linux-2.6.9-quilt/fs/namei.c 2004-10-22 17:17:34.762179488 -0400
@@ -275,7 +275,16 @@ int deny_write_access(struct file * file
void path_release(struct nameidata *nd)
{
dput(nd->dentry);
- mntput(nd->mnt);
+ /*
+ * In order to ensure that access to an automounted filesystems'
+ * root does not reset it's expire counter, we check to see if the path
+ * being released here is a mountpoint itself. If it is, then we call
+ * _mntput which leaves the expire counter alone.
+ */
+ if (nd->mnt && nd->mnt->mnt_root == nd->dentry)
+ _mntput(nd->mnt);
+ else
+ mntput(nd->mnt);
}

/*

2004-10-25 14:55:04

by Mike Waychison

[permalink] [raw]
Subject: [PATCH 5/28] VFS: Make expiry timeout configurable

This patch modifies the expiry logic to make timeouts configurable. We do
this by using an atomic counter that ticks downwards every second of
non-usage. We also store a per vfsmount value of what this timer gets reset
to on every mntput.

Signed-off-by: Mike Waychison <[email protected]>
---

fs/namespace.c | 33 +++++++++++++++++++++++----------
include/linux/mount.h | 6 ++++--
2 files changed, 27 insertions(+), 12 deletions(-)

Index: linux-2.6.9-quilt/include/linux/mount.h
===================================================================
--- linux-2.6.9-quilt.orig/include/linux/mount.h 2004-10-22 17:17:34.147272968 -0400
+++ linux-2.6.9-quilt/include/linux/mount.h 2004-10-22 17:17:35.377086008 -0400
@@ -29,7 +29,9 @@ struct vfsmount
struct list_head mnt_child; /* and going through their mnt_child */
atomic_t mnt_count;
int mnt_flags;
- int mnt_expiry_mark; /* true if marked for expiry */
+ int mnt_active; /* flag set on mntput() */
+ int mnt_expiry_countdown; /* countdown of ticks until expiry */
+ int mnt_expiry_ticks; /* total ticks before expiry */
char *mnt_devname; /* Name of device e.g. /dev/dsk/hda1 */
struct list_head mnt_list;
struct list_head mnt_expire; /* link in fs-specific expiry list */
@@ -56,7 +58,7 @@ static inline void _mntput(struct vfsmou
static inline void mntput(struct vfsmount *mnt)
{
if (mnt) {
- mnt->mnt_expiry_mark = 0;
+ xchg(&mnt->mnt_active, 1);
_mntput(mnt);
}
}
Index: linux-2.6.9-quilt/fs/namespace.c
===================================================================
--- linux-2.6.9-quilt.orig/fs/namespace.c 2004-10-22 17:17:34.148272816 -0400
+++ linux-2.6.9-quilt/fs/namespace.c 2004-10-22 17:17:35.378085856 -0400
@@ -68,6 +68,8 @@ struct vfsmount *alloc_vfsmnt(const char
INIT_LIST_HEAD(&mnt->mnt_mounts);
INIT_LIST_HEAD(&mnt->mnt_list);
INIT_LIST_HEAD(&mnt->mnt_expire);
+ mnt->mnt_expiry_ticks = mnt->mnt_expiry_countdown = 0;
+ mnt->mnt_active = 1;
if (name) {
int size = strlen(name)+1;
char *newname = kmalloc(size, GFP_KERNEL);
@@ -365,9 +367,9 @@ static void umount_tree(struct vfsmount
struct nameidata old_nd;
detach_mnt(mnt, &old_nd);
spin_unlock(&vfsmount_lock);
- path_release(&old_nd);
+ path_release_on_umount(&old_nd);
}
- mntput(mnt);
+ _mntput(mnt);
spin_lock(&vfsmount_lock);
}
}
@@ -395,7 +397,7 @@ static int do_umount(struct vfsmount *mn
if (atomic_read(&mnt->mnt_count) != 2)
return -EBUSY;

- if (!xchg(&mnt->mnt_expiry_mark, 1))
+ if (--mnt->mnt_expiry_countdown != 0)
return -EAGAIN;
}

@@ -840,6 +842,8 @@ void mnt_expire(struct vfsmount *mnt, un
goto out;

list_del_init(&mnt->mnt_expire);
+ mnt->mnt_expiry_ticks = mnt->mnt_expiry_countdown = expire;
+ mnt->mnt_active = 1;
if (expire > 0)
list_add_tail(&mnt->mnt_expire, &expiry_list);
out:
@@ -847,7 +851,7 @@ out:
up(&expiry_sem);
}
EXPORT_SYMBOL_GPL(mnt_expire);
-
+
/*
* process a list of expirable mountpoints with the intent of discarding any
* mountpoints that aren't in use and haven't been touched since last we came
@@ -872,12 +876,21 @@ static void do_expiry_run(void *nothing)
* cleared by mntput())
*/
list_for_each_entry_safe(mnt, next, &expiry_list, mnt_expire) {
- if (!xchg(&mnt->mnt_expiry_mark, 1) ||
- atomic_read(&mnt->mnt_count) != 1)
+ /*
+ * Something might still hold a reference to this mount. If
+ * mnt_active is set, we can just move on. If it gets set
+ * after we xchg() here, we'll catch it on the graveyard list.
+ */
+ if (xchg(&mnt->mnt_active, 0)) {
+ mnt->mnt_expiry_countdown = mnt->mnt_expiry_ticks;
continue;
-
- mntget(mnt);
- list_move(&mnt->mnt_expire, &graveyard);
+ }
+ if (mnt->mnt_expiry_countdown >= 1)
+ mnt->mnt_expiry_countdown--;
+ if (atomic_read(&mnt->mnt_count) == 2 && mnt->mnt_expiry_countdown == 0) {
+ mntget(mnt);
+ list_move(&mnt->mnt_expire, &graveyard);
+ }
}

/*
@@ -903,7 +916,7 @@ static void do_expiry_run(void *nothing)

/* check that it is still dead: the count should now be 2 - as
* contributed by the vfsmount parent and the mntget above */
- if (atomic_read(&mnt->mnt_count) == 2) {
+ if (atomic_read(&mnt->mnt_count) == 2 && !mnt->mnt_active) {
struct vfsmount *xdmnt;
struct dentry *xdentry;


2004-10-25 15:14:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 12/28] VFS: Remove (now bogus) check_mnt

On Mon, Oct 25, 2004 at 10:44:33AM -0400, Mike Waychison wrote:
> check_mnt used to be used to see if a mountpoint was actually grafted or not
> to a namespace. This was done because we didn't support mountpoints being
> attached to one another if they weren't associated with a namespace. We now
> support this, so all check_mnt calls are bogus. The only exception is that
> pivot_root still requires all participants to exist within the same
> namespace.

did you audit the namespace code that it doesn't allow attachign to other
namespaces than the current?

2004-10-25 15:19:43

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 8/28] VFS: Remove MNT_EXPIRE support

On Mon, Oct 25, 2004 at 11:12:00AM -0400, Mike Waychison wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Christoph Hellwig wrote:
> > On Mon, Oct 25, 2004 at 10:42:32AM -0400, Mike Waychison wrote:
> >
> >>Drop support for MNT_EXPIRE (flag to umount(2)). Nobody was using it and it
> >>didn't fit into the new expiry framework.
> >
> >
> > umm, this is a user API, you can't simply drop it.
> >
>
> Is anybody using it though?

doesn't matter much. Maybe Sun likes deliberately breaking user ABIs in
Solaris, but in Linux we certainly don't.

> Hmm. I'll think about it a while to figure out how to map this
> functionality to the new expire semantics. Any suggestions?

Hey, it's you who wants the new semantics. And you didn't even explain
them in detail.

2004-10-25 15:14:42

by Mike Waychison

[permalink] [raw]
Subject: [PATCH 13/28] VFS: Introduce soft reference counts

This patch introduces the concept of a 'soft' reference count for a vfsmount.
This type of reference count allows for references to be held on mountpoints
that do not affect their busy states for userland unmounting. Some might
argue that this is wrong because 'when I unmount a filesystem, I want the
resources associated with it to go away too', but this way of thinking was
deprecated with the addition of namespaces and --bind back in the 2.4 series.

A future addition may see a callback mechanism so that in kernel users can
use a given mountpoint and have it deregistered some way (quota and
accounting come to mind).

These soft reference counts are used by a later patch that adds an interface
for holding and manipulating mountpoints using filedescriptors.

Signed-off-by: Mike Waychison <[email protected]>
---

fs/namespace.c | 4 ++++
include/linux/mount.h | 28 ++++++++++++++++++++++++++++
2 files changed, 32 insertions(+)

Index: linux-2.6.9-quilt/include/linux/mount.h
===================================================================
--- linux-2.6.9-quilt.orig/include/linux/mount.h 2004-10-22 17:17:38.881553248 -0400
+++ linux-2.6.9-quilt/include/linux/mount.h 2004-10-22 17:17:40.185355040 -0400
@@ -30,6 +30,7 @@ struct vfsmount
struct list_head mnt_mounts; /* list of children, anchored here */
struct list_head mnt_child; /* and going through their mnt_child */
atomic_t mnt_count;
+ atomic_t mnt_softcount; /* hold reference w/o going busy */
union {
struct vfsmount *base; /* pointer to root of vfsmount tree */
atomic_t count; /* user ref count on this tree */
@@ -104,6 +105,33 @@ static inline void mntput(struct vfsmoun
}
}

+static inline struct vfsmount *mntsoftget(struct vfsmount *mnt)
+{
+ if (mnt) {
+ read_lock(&vfsmountref_lock);
+ atomic_inc(&mnt->mnt_softcount);
+ mntgroupget(mnt);
+ read_unlock(&vfsmountref_lock);
+ }
+ return mnt;
+}
+
+static inline void mntsoftput(struct vfsmount *mnt)
+{
+ struct vfsmount *cleanup;
+ might_sleep();
+ if (mnt) {
+ if (atomic_dec_and_test(&mnt->mnt_count))
+ __mntput(mnt);
+ read_lock(&vfsmountref_lock);
+ cleanup = mntgroupput(mnt);
+ atomic_dec(&mnt->mnt_softcount);
+ read_unlock(&vfsmountref_lock);
+ if (cleanup)
+ __mntgroupput(cleanup);
+ }
+}
+
extern void free_vfsmnt(struct vfsmount *mnt);
extern struct vfsmount *alloc_vfsmnt(const char *name);
extern struct vfsmount *do_kern_mount(const char *fstype, int flags,
Index: linux-2.6.9-quilt/fs/namespace.c
===================================================================
--- linux-2.6.9-quilt.orig/fs/namespace.c 2004-10-22 17:17:39.557450496 -0400
+++ linux-2.6.9-quilt/fs/namespace.c 2004-10-22 17:17:40.187354736 -0400
@@ -73,6 +73,7 @@ struct vfsmount *alloc_vfsmnt(const char
memset(mnt, 0, sizeof(struct vfsmount));
mnt->mnt_flags = MNT_ISBASE;
atomic_set(&mnt->mnt_count,1);
+ atomic_set(&mnt->mnt_softcount,0);
atomic_set(&mnt->mnt_group.count, 1);
INIT_LIST_HEAD(&mnt->mnt_hash);
INIT_LIST_HEAD(&mnt->mnt_child);
@@ -187,6 +188,7 @@ static void detach_mnt(struct vfsmount *

/* count the total number of refcounts in the sub-tree */
nrefs += atomic_read(&p->mnt_count);
+ + atomic_read(&p->mnt_softcount);
}

/*
@@ -362,6 +364,7 @@ void __mntgroupput(struct vfsmount *mnt)

if (mnt == mnt->mnt_parent) {
WARN_ON(atomic_read(&mnt->mnt_count) != 0);
+ WARN_ON(atomic_read(&mnt->mnt_softcount) != 0);
list_del_init(&mnt->mnt_list);
__mntput(mnt);
} else {
@@ -373,6 +376,7 @@ void __mntgroupput(struct vfsmount *mnt)
dput(old_nd.dentry);
atomic_dec(&mnt->mnt_count);
WARN_ON(atomic_read(&mnt->mnt_count) != 0);
+ WARN_ON(atomic_read(&mnt->mnt_softcount) != 0);
__mntput(mnt);
}
}

2004-10-25 15:23:23

by Mike Waychison

[permalink] [raw]
Subject: Re: [PATCH 12/28] VFS: Remove (now bogus) check_mnt

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Christoph Hellwig wrote:
> On Mon, Oct 25, 2004 at 10:44:33AM -0400, Mike Waychison wrote:
>
>>check_mnt used to be used to see if a mountpoint was actually grafted or not
>>to a namespace. This was done because we didn't support mountpoints being
>>attached to one another if they weren't associated with a namespace. We now
>>support this, so all check_mnt calls are bogus. The only exception is that
>>pivot_root still requires all participants to exist within the same
>>namespace.
>
>
> did you audit the namespace code that it doesn't allow attachign to other
> namespaces than the current?
>

So, I don't see how that is possible, other than through relative
resolution from a cwd in the other namespace. Arguably, you aren't
buying any security by denying the mountpoint if you already let other
processes in your namespace.

Auditting the original code, it appeared that doing such a thing was a
no-no only because the locking semantics of current->namespace->sem made
this difficult.


- --
Mike Waychison
Sun Microsystems, Inc.
1 (650) 352-5299 voice
1 (416) 202-8336 voice

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE: The opinions expressed in this email are held by me,
and may not represent the views of Sun Microsystems, Inc.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBfRiVdQs4kOxk3/MRAmC2AJ93Dqcf1hNFjmjKESxsfuBeUqZ+nQCffEZX
Ej3a3wyhQAwTg+amwHqn1v0=
=se6H
-----END PGP SIGNATURE-----

2004-10-25 15:28:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 13/28] VFS: Introduce soft reference counts

On Mon, Oct 25, 2004 at 10:45:03AM -0400, Mike Waychison wrote:
> This patch introduces the concept of a 'soft' reference count for a vfsmount.
> This type of reference count allows for references to be held on mountpoints
> that do not affect their busy states for userland unmounting. Some might
> argue that this is wrong because 'when I unmount a filesystem, I want the
> resources associated with it to go away too', but this way of thinking was
> deprecated with the addition of namespaces and --bind back in the 2.4 series.
>
> A future addition may see a callback mechanism so that in kernel users can
> use a given mountpoint and have it deregistered some way (quota and
> accounting come to mind).
>
> These soft reference counts are used by a later patch that adds an interface
> for holding and manipulating mountpoints using filedescriptors.

You haven't explained why you actually need it, though.

2004-10-25 15:33:41

by Mike Waychison

[permalink] [raw]
Subject: Re: [PATCH 8/28] VFS: Remove MNT_EXPIRE support

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Christoph Hellwig wrote:
> On Mon, Oct 25, 2004 at 11:12:00AM -0400, Mike Waychison wrote:
>
>>-----BEGIN PGP SIGNED MESSAGE-----
>>Hash: SHA1
>>
>>Christoph Hellwig wrote:
>>
>>>On Mon, Oct 25, 2004 at 10:42:32AM -0400, Mike Waychison wrote:
>>>
>>>
>>>>Drop support for MNT_EXPIRE (flag to umount(2)). Nobody was using it and it
>>>>didn't fit into the new expiry framework.
>>>
>>>
>>>umm, this is a user API, you can't simply drop it.
>>>
>>
>>Is anybody using it though?
>
>
> doesn't matter much. Maybe Sun likes deliberately breaking user ABIs in
> Solaris, but in Linux we certainly don't.

I wouldn't know, I only play with Linux ;)

>
>
>>Hmm. I'll think about it a while to figure out how to map this
>>functionality to the new expire semantics. Any suggestions?
>
>
> Hey, it's you who wants the new semantics. And you didn't even explain
> them in detail.
>

Hmm.. I think some bits of the series bounced. Let me look.

- --
Mike Waychison
Sun Microsystems, Inc.
1 (650) 352-5299 voice
1 (416) 202-8336 voice

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE: The opinions expressed in this email are held by me,
and may not represent the views of Sun Microsystems, Inc.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBfRv+dQs4kOxk3/MRAru8AJ9EPceeBwWKUzCAKoAcqkGxcU79gACfROFU
uLiWsbIGSl0AVX3rMf9sdeI=
=dXgV
-----END PGP SIGNATURE-----

2004-10-25 15:44:10

by Mike Waychison

[permalink] [raw]
Subject: Re: [PATCH 14/28] VFS: Introduce Mountpoint file descriptors (resend)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Christoph Hellwig wrote:

> You haven't explained why you actually need it, though.
>

Apparently I used the wrong server and a couple patches bounced :\

I'll try to make the next series more 'forward self-describing' :)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBfR1RdQs4kOxk3/MRAkVnAKCJXfCmyzk1UaL0GcuPwQsdexHMhgCeMbGl
Wz9LPh+FJbdMpyPPYiVrtoY=
=+WQa
-----END PGP SIGNATURE-----


Attachments:
14-introduce_mountfd.diff (10.97 kB)

2004-10-25 17:20:36

by Mika Penttilä

[permalink] [raw]
Subject: Re: [PATCH 13/28] VFS: Introduce soft reference counts

Mike Waychison wrote:

>This patch introduces the concept of a 'soft' reference count for a vfsmount.
>This type of reference count allows for references to be held on mountpoints
>that do not affect their busy states for userland unmounting. Some might
>argue that this is wrong because 'when I unmount a filesystem, I want the
>resources associated with it to go away too', but this way of thinking was
>deprecated with the addition of namespaces and --bind back in the 2.4 series.
>
>A future addition may see a callback mechanism so that in kernel users can
>use a given mountpoint and have it deregistered some way (quota and
>accounting come to mind).
>
>These soft reference counts are used by a later patch that adds an interface
>for holding and manipulating mountpoints using filedescriptors.
>
>Signed-off-by: Mike Waychison <[email protected]>
>
>+static inline struct vfsmount *mntsoftget(struct vfsmount *mnt)
>+{
>+ if (mnt) {
>+ read_lock(&vfsmountref_lock);
>+ atomic_inc(&mnt->mnt_softcount);
>+ mntgroupget(mnt);
>+ read_unlock(&vfsmountref_lock);
>+ }
>+ return mnt;
>+}
>+
>+static inline void mntsoftput(struct vfsmount *mnt)
>+{
>+ struct vfsmount *cleanup;
>+ might_sleep();
>+ if (mnt) {
>+ if (atomic_dec_and_test(&mnt->mnt_count))
>+ __mntput(mnt);
>+ read_lock(&vfsmountref_lock);
>+ cleanup = mntgroupput(mnt);
>+ atomic_dec(&mnt->mnt_softcount);
>+ read_unlock(&vfsmountref_lock);
>+ if (cleanup)
>+ __mntgroupput(cleanup);
>+ }
>+}
>+
> extern void free_vfsmnt(struct vfsmount *mnt);
>
>
What is this against? What are mntgroupput and mntgroupget? Why does
soft put decrement mnt_count which isn't increment by soft get? How do
soft references allow userland umount? I don't see soft references used
anywhere...

--Mika

2004-10-25 17:26:28

by Mike Waychison

[permalink] [raw]
Subject: Re: [PATCH 8/28] VFS: Remove MNT_EXPIRE support

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Christoph Hellwig wrote:
> On Mon, Oct 25, 2004 at 10:42:32AM -0400, Mike Waychison wrote:
>
>>Drop support for MNT_EXPIRE (flag to umount(2)). Nobody was using it and it
>>didn't fit into the new expiry framework.
>
>
> umm, this is a user API, you can't simply drop it.
>

I also wanted to add that given the current interface that is found in
mainline, there is no way for userspace to even set a mountpoint as
expiring. The only consumer is still AFS which handles the
mark_mounts_for_expiry stuff itself.

So even if userspace wanted to use MNT_EXPIRE, it couldn't.

- --
Mike Waychison
Sun Microsystems, Inc.
1 (650) 352-5299 voice
1 (416) 202-8336 voice

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE: The opinions expressed in this email are held by me,
and may not represent the views of Sun Microsystems, Inc.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBfTTUdQs4kOxk3/MRAghsAJ41gA73Qov2lS6nHGcC+A3zsIc+DQCeIdCB
4XisV0zx/CvTDQpQfSfLY04=
=Qj2Q
-----END PGP SIGNATURE-----

2004-10-25 17:35:01

by Mike Waychison

[permalink] [raw]
Subject: Re: [PATCH 8/28] VFS: Remove MNT_EXPIRE support

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mike Waychison wrote:
> Christoph Hellwig wrote:
>
>>>On Mon, Oct 25, 2004 at 10:42:32AM -0400, Mike Waychison wrote:
>>>
>>>
>>>>Drop support for MNT_EXPIRE (flag to umount(2)). Nobody was using it and it
>>>>didn't fit into the new expiry framework.
>>>
>>>
>>>umm, this is a user API, you can't simply drop it.
>>>
>
>
> I also wanted to add that given the current interface that is found in
> mainline, there is no way for userspace to even set a mountpoint as
> expiring. The only consumer is still AFS which handles the
> mark_mounts_for_expiry stuff itself.
>
> So even if userspace wanted to use MNT_EXPIRE, it couldn't.
>

Gah, nevermind. I'm an idiot.

MNT_EXPIRE allows userspace to do the countdown itself. For some reason
I assumed the mount in question had to be on a mnt_fslink list.

- --
Mike Waychison
Sun Microsystems, Inc.
1 (650) 352-5299 voice
1 (416) 202-8336 voice

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE: The opinions expressed in this email are held by me,
and may not represent the views of Sun Microsystems, Inc.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBfTfndQs4kOxk3/MRAo5/AJ42AGwgKaRoEoA1Pto55DClNF3ZFwCfQna8
02yxxTF3wB30aguYzaumWII=
=oTA1
-----END PGP SIGNATURE-----

2004-10-25 17:53:17

by Mika Penttilä

[permalink] [raw]
Subject: Re: [PATCH 13/28] VFS: Introduce soft reference counts

Mike Waychison wrote:

>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>Mika Penttil? wrote:
>
>
>>Mike Waychison wrote:
>>
>>
>>
>>>This patch introduces the concept of a 'soft' reference count for a
>>>vfsmount.
>>>This type of reference count allows for references to be held on
>>>mountpoints
>>>that do not affect their busy states for userland unmounting. Some might
>>>argue that this is wrong because 'when I unmount a filesystem, I want the
>>>resources associated with it to go away too', but this way of thinking
>>>was
>>>deprecated with the addition of namespaces and --bind back in the 2.4
>>>series.
>>>
>>>A future addition may see a callback mechanism so that in kernel users
>>>can
>>>use a given mountpoint and have it deregistered some way (quota and
>>>accounting come to mind).
>>>
>>>These soft reference counts are used by a later patch that adds an
>>>interface
>>>for holding and manipulating mountpoints using filedescriptors.
>>>
>>>Signed-off-by: Mike Waychison <[email protected]>
>>>
>>>+static inline struct vfsmount *mntsoftget(struct vfsmount *mnt)
>>>+{
>>>+ if (mnt) {
>>>+ read_lock(&vfsmountref_lock);
>>>+ atomic_inc(&mnt->mnt_softcount);
>>>+ mntgroupget(mnt);
>>>+ read_unlock(&vfsmountref_lock);
>>>+ }
>>>+ return mnt;
>>>+}
>>>+
>>>+static inline void mntsoftput(struct vfsmount *mnt)
>>>+{
>>>+ struct vfsmount *cleanup;
>>>+ might_sleep();
>>>+ if (mnt) {
>>>+ if (atomic_dec_and_test(&mnt->mnt_count))
>>>+ __mntput(mnt);
>>>+ read_lock(&vfsmountref_lock);
>>>+ cleanup = mntgroupput(mnt);
>>>+ atomic_dec(&mnt->mnt_softcount);
>>>+ read_unlock(&vfsmountref_lock);
>>>+ if (cleanup)
>>>+ __mntgroupput(cleanup);
>>>+ }
>>>+}
>>>+
>>>extern void free_vfsmnt(struct vfsmount *mnt);
>>>
>>>
>>>
>>>
>>What is this against? What are mntgroupput and mntgroupget?
>>
>>
>
>This is against patch [PATCH 11/28] VFS: Allow detachable subtrees.
>
>In that patch, mntgroup(get|put) handles the count of all non-glue
>references for a given tree of vfsmounts.
>
>
>
>
>>Why does soft put decrement mnt_count which isn't increment by soft get?
>>
>>
>
>Ah, thanks for pointing that out. It got messed up when I created the
>patchset from the bk tree. Will fix.
>
>
>
>>How do
>>soft references allow userland umount? I don't see soft references used
>>anywhere...
>>
>>
>
>Soft references are used by the mountpoint file descriptor patch
>[14/28]. They allow references to be had on a vfsmount such that the
>mountpoint itself is not kept busy in the namespace. This allows a
>program to 'grab a mountpoint' by a magic file (gotten by sys_mountfd),
>and perform ops on it. The mountfd holds a reference to the vfsmount,
>but it doesn't keep userspace from trying to umount(2) the path.
>
>Does that help?
>
>
>
I think at least patches 11 and 14 got lost...

--Mika



2004-10-25 18:04:18

by Mike Waychison

[permalink] [raw]
Subject: Re: [PATCH 13/28] VFS: Introduce soft reference counts

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mika Penttil? wrote:
> Mike Waychison wrote:
>
>> This patch introduces the concept of a 'soft' reference count for a
>> vfsmount.
>> This type of reference count allows for references to be held on
>> mountpoints
>> that do not affect their busy states for userland unmounting. Some might
>> argue that this is wrong because 'when I unmount a filesystem, I want the
>> resources associated with it to go away too', but this way of thinking
>> was
>> deprecated with the addition of namespaces and --bind back in the 2.4
>> series.
>>
>> A future addition may see a callback mechanism so that in kernel users
>> can
>> use a given mountpoint and have it deregistered some way (quota and
>> accounting come to mind).
>>
>> These soft reference counts are used by a later patch that adds an
>> interface
>> for holding and manipulating mountpoints using filedescriptors.
>>
>> Signed-off-by: Mike Waychison <[email protected]>
>>
>> +static inline struct vfsmount *mntsoftget(struct vfsmount *mnt)
>> +{
>> + if (mnt) {
>> + read_lock(&vfsmountref_lock);
>> + atomic_inc(&mnt->mnt_softcount);
>> + mntgroupget(mnt);
>> + read_unlock(&vfsmountref_lock);
>> + }
>> + return mnt;
>> +}
>> +
>> +static inline void mntsoftput(struct vfsmount *mnt)
>> +{
>> + struct vfsmount *cleanup;
>> + might_sleep();
>> + if (mnt) {
>> + if (atomic_dec_and_test(&mnt->mnt_count))
>> + __mntput(mnt);
>> + read_lock(&vfsmountref_lock);
>> + cleanup = mntgroupput(mnt);
>> + atomic_dec(&mnt->mnt_softcount);
>> + read_unlock(&vfsmountref_lock);
>> + if (cleanup)
>> + __mntgroupput(cleanup);
>> + }
>> +}
>> +
>> extern void free_vfsmnt(struct vfsmount *mnt);
>>
>>
> What is this against? What are mntgroupput and mntgroupget?

This is against patch [PATCH 11/28] VFS: Allow detachable subtrees.

In that patch, mntgroup(get|put) handles the count of all non-glue
references for a given tree of vfsmounts.


> Why does soft put decrement mnt_count which isn't increment by soft get?

Ah, thanks for pointing that out. It got messed up when I created the
patchset from the bk tree. Will fix.

> How do
> soft references allow userland umount? I don't see soft references used
> anywhere...

Soft references are used by the mountpoint file descriptor patch
[14/28]. They allow references to be had on a vfsmount such that the
mountpoint itself is not kept busy in the namespace. This allows a
program to 'grab a mountpoint' by a magic file (gotten by sys_mountfd),
and perform ops on it. The mountfd holds a reference to the vfsmount,
but it doesn't keep userspace from trying to umount(2) the path.

Does that help?

- --
Mike Waychison
Sun Microsystems, Inc.
1 (650) 352-5299 voice
1 (416) 202-8336 voice

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE: The opinions expressed in this email are held by me,
and may not represent the views of Sun Microsystems, Inc.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBfTcNdQs4kOxk3/MRAnZwAKCTv7SMsT/+o4WLMJGapFVKURsbNwCeI+iF
sZCOzRRNcnesK8rFN2haEww=
=eFwD
-----END PGP SIGNATURE-----

2004-10-25 18:18:00

by Mike Waychison

[permalink] [raw]
Subject: [PATCH 11/28] VFS: Allow for detachable subtrees (resend)

Mika Penttil? wrote:
> I think at least patches 11 and 14 got lost...

11 attached. 14 was resent as a reply to Christoph.

Sorry for the confusion.


Attachments:
11-detachable_subtrees.diff (19.47 kB)

2004-10-25 15:14:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 3/28] VFS: Move expiry into vfs

On Mon, Oct 25, 2004 at 10:40:00AM -0400, Mike Waychison wrote:
> This patch moves the recently added expiry functionality directly into the
> VFS layer. Doing this gives us a couple advantages:
>
> - Allows for configurable timeouts using a single consolidated timer
> - Keeps filesystems from having to each implement their own expiry logic
> - Provides a generic interface that can be used for _any_ filesystem, as
> desired by user applications and/or the system admninistrator.
>
> This patch implements expiry by having the VFS recursively register work to
> do. Checks are done for expiry every 1 second, so expiry is configurable to
> that granularity.

The expiry timer should only run as long as there are filesystems registered
for expiry.

2004-10-25 15:14:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 8/28] VFS: Remove MNT_EXPIRE support

On Mon, Oct 25, 2004 at 10:42:32AM -0400, Mike Waychison wrote:
> Drop support for MNT_EXPIRE (flag to umount(2)). Nobody was using it and it
> didn't fit into the new expiry framework.

umm, this is a user API, you can't simply drop it.

2004-10-25 21:55:46

by Mike Waychison

[permalink] [raw]
Subject: [PATCH 12/28] VFS: Remove (now bogus) check_mnt

check_mnt used to be used to see if a mountpoint was actually grafted or not
to a namespace. This was done because we didn't support mountpoints being
attached to one another if they weren't associated with a namespace. We now
support this, so all check_mnt calls are bogus. The only exception is that
pivot_root still requires all participants to exist within the same
namespace.

Signed-off-by: Mike Waychison <[email protected]>
---

namespace.c | 41 +++++++++++------------------------------
1 files changed, 11 insertions(+), 30 deletions(-)

Index: linux-2.6.9-quilt/fs/namespace.c
===================================================================
--- linux-2.6.9-quilt.orig/fs/namespace.c 2004-10-22 17:17:38.881553248 -0400
+++ linux-2.6.9-quilt/fs/namespace.c 2004-10-22 17:17:39.557450496 -0400
@@ -124,14 +124,8 @@ struct vfsmount *lookup_mnt(struct vfsmo
spin_unlock(&vfsmount_lock);
return found;
}
-
EXPORT_SYMBOL(lookup_mnt);

-static inline int check_mnt(struct vfsmount *mnt)
-{
- return mnt->mnt_namespace == current->namespace;
-}
-
static struct vfsmount *next_mnt(struct vfsmount *p, struct vfsmount *root)
{
struct list_head *next = p->mnt_mounts.next;
@@ -701,8 +695,6 @@ asmlinkage long sys_umount(char __user *
retval = -EINVAL;
if (nd.dentry != nd.mnt->mnt_root)
goto dput_and_out;
- if (!check_mnt(nd.mnt))
- goto dput_and_out;

retval = -EPERM;
if (!capable(CAP_SYS_ADMIN))
@@ -867,14 +859,11 @@ static int do_loopback(struct nameidata
return err;

down_write(&current->namespace->sem);
- err = -EINVAL;
- if (check_mnt(nd->mnt) && (!recurse || check_mnt(old_nd.mnt))) {
- err = -ENOMEM;
- if (recurse)
- mnt = copy_tree(old_nd.mnt, old_nd.dentry, 0);
- else
- mnt = clone_mnt(old_nd.mnt, old_nd.dentry, 0);
- }
+ err = -ENOMEM;
+ if (recurse)
+ mnt = copy_tree(old_nd.mnt, old_nd.dentry, 0);
+ else
+ mnt = clone_mnt(old_nd.mnt, old_nd.dentry, 0);

if (mnt) {
/* stop bind mounts from expiring */
@@ -912,9 +901,6 @@ static int do_remount(struct nameidata *
if (!capable(CAP_SYS_ADMIN))
return -EPERM;

- if (!check_mnt(nd->mnt))
- return -EINVAL;
-
if (nd->dentry != nd->mnt->mnt_root)
return -EINVAL;

@@ -945,9 +931,6 @@ static int do_move_mount(struct nameidat
down_write(&current->namespace->sem);
while(d_mountpoint(nd->dentry) && follow_down(&nd->mnt, &nd->dentry))
;
- err = -EINVAL;
- if (!check_mnt(nd->mnt) || !check_mnt(old_nd.mnt))
- goto out;

err = -ENOENT;
down(&nd->dentry->d_inode->i_sem);
@@ -984,7 +967,6 @@ out2:
spin_unlock(&vfsmount_lock);
out1:
up(&nd->dentry->d_inode->i_sem);
-out:
up_write(&current->namespace->sem);
if (!err)
path_release(&parent_nd);
@@ -1028,9 +1010,6 @@ int do_graft_mount(struct vfsmount *newm
/* Something was mounted here while we slept */
while(d_mountpoint(nd->dentry) && follow_down(&nd->mnt, &nd->dentry))
;
- err = -EINVAL;
- if (!check_mnt(nd->mnt))
- goto unlock;

/* Refuse the same filesystem on the same mount point */
err = -EBUSY;
@@ -1569,9 +1548,6 @@ asmlinkage long sys_pivot_root(const cha
error = __user_walk(new_root, LOOKUP_FOLLOW|LOOKUP_DIRECTORY, &new_nd);
if (error)
goto out0;
- error = -EINVAL;
- if (!check_mnt(new_nd.mnt))
- goto out1;

error = __user_walk(put_old, LOOKUP_FOLLOW|LOOKUP_DIRECTORY, &old_nd);
if (error)
@@ -1589,9 +1565,14 @@ asmlinkage long sys_pivot_root(const cha
read_unlock(&current->fs->lock);
down_write(&current->namespace->sem);
down(&old_nd.dentry->d_inode->i_sem);
+
+ /* All mountpoints must exist within the same namespace */
error = -EINVAL;
- if (!check_mnt(user_nd.mnt))
+ if (user_nd.mnt->mnt_namespace != current->namespace
+ || user_nd.mnt->mnt_namespace != old_nd.mnt->mnt_namespace
+ || user_nd.mnt->mnt_namespace != new_nd.mnt->mnt_namespace)
goto out2;
+
error = -ENOENT;
if (IS_DEADDIR(new_nd.dentry->d_inode))
goto out2;

2004-10-26 00:19:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 6/28] VFS: Make expiry recursive

On Mon, Oct 25, 2004 at 10:41:31AM -0400, Mike Waychison wrote:
> This patch allows for tagging of vfsmounts as being part of a sub-tree
> expiry. It introduces a new vfsmount flag, MNT_CHILDEXPIRE which is used to
> let the system know that the given mountpoint expires with its parent. This
> is a recursive definition.
>
> mnt_expiry, the call used to specify that a mount should expire, now takes an
> int described as follows:
> - 0 - The mountpoint should not expire (default)
> - >0 - The value is used to specify the amount of idle time before the
> given mountpoint expires.
> - <0 - The mountpoint must expire with it's immediate parent. (parent
> must be set to expire, or must be itself be marked to expire
> along with _its_ parent.

so add some constant for this, ala

#define MNT_EXPIRE_RECURSIVE (-1)

2004-10-26 00:31:48

by Mike Waychison

[permalink] [raw]
Subject: [PATCH 11/28] VFS: Allow for detachable subtrees

This patch adds support for having detached sub-tree of vfsmounts. Detached
sub-trees of vfsmounts means we can now have trees that are not required to
have an associated struct namespace holding them together. We call such a
subtree data structure a 'group'.

This is implemented by associated a new state to struct vfsmount, MNT_ISBASE.
This flag is used to determine whether or not a given vfsmount is the base of
a group. When this flag is set, the vfsmount->mnt_group._count value stores
a an atomic count that reflects the number of all references to vfsmounts in
the tree, excluding those that used for group stitching (a child holds a
reference to a parent and vice-versa). For any group of N vfsmounts, (N-1)*2
reference counts are used solely for holding them together. Therefor the
invariant to this structure is: for any group with N vfsmounts, rooted at R,
R->mnt_group._count = (i->mnt_count for all i=1..N vfsmounts) - ((N - 1)*2).

In order to not have to walk up the tree each time we mntput/mntget to manage
this group count, we store a pointer to the base of the tree in
vfsmount->mnt_group._base in each vfsmount where !(vfsmount->mnt_flags &
MNT_ISBASE). (The exercise of showing that we do not need to hold a
reference count to the base vfsmount in each node in the tree is left to the
reader).

Because we now are holding maintaining the reference counts in two different
places, they have to be synchronized in some way with actions that may
invalidate the vfsmount->mnt_group._base relationship. The following
implementation uses a rwlock called vfsmountref_lock along with an atomic_t
group counter to do so. The semantics of this lock are as follows:

- To change the reference count of a vfsmount, one must take a read lock on
vfsmountref_lock. This keeps the mnt_group._base pointer from changing
and ensures that the ->mnt_count values are consistent with the
vfsmnt_base(mnt)->mnt_group._count value.
- When a split or merge of the tree occurs, you need to take a write lock
on vfsmountref_lock. This keeps people from messing with the reference
counts and ensures nobody is walking the mnt_group._base pointers.
- To check to see if a given vfsmount is busy (->mnt_count == 1, implying
it is a leaf), there is no need to take vfsmountref_lock. You'd only do
so with vfsmount_lock held, so you are sure nobody else can walk into it
(current behaviour).

The actions of splitting off/attaching a group of vfsmounts from/to a tree
remains handled in detach_mnt and attach_mnt. These work as before, but now
also take care of updating group counts.

When the group count for a given sub-tree hits zero, then we know that nobody
is holding a reference count to _any_ of the vfsmounts within the tree. When
this happens, we call __mntgroupput which unstitches the pieces of the tree,
and frees them. For example, assume a user were to umount a filesystem from
his namespace. A check would be made to see if the vfsmount in question had
->mnt_count == 2 (one for the parent->child reference and one for the calling
code's reference). If so, the code will proceed to call detach_tree on the
vfsmount. This call updates the base count on the namespace's group (which
would go down by one due to the reference to the unmounted filesystem leaving
it), and would mark the detached vfsmount with MNT_ISBASE with a
->mnt_group._count == 1. When the umount syscall them drops the locks and
calls mntput on the given mountpoint, both it's mnt_count == 0 and
mnt_group._count == 0. The later invokes a call to __mntgroupput which walks
the group (which only has the one vfsmount) and calls __mntput on it to let
it release it's resource, decrement sb->count and so on.

One of the interesting side effects of this is that now when we do a lazy
umount of a tree of filesystems, we no longer need to break apart each
element of the tree. It also allows us to implement a new interface that
allows for having sub-trees of mountpoints that aren't associated with any
namespace. This would allow for process-specific sandboxes of filesystems
that are mounted yet not available in the process' namespace (implemented in
a later patch using mountpoint file descriptors).

One of the downsides to this implementation is that we grab a read on a
rwlock on _every_ mntget/mntput. I looked into using a brlock, but
apparently those have been removed from the kernel a while ago. I've thought
about how to do the same kind of synchronization using RCU, but haven't yet
figured out how to apply it to this kind of data-structure. This patch
hasn't been tested for scalability in any sense.

Signed-off-by: Mike Waychison <[email protected]>
---

fs/namespace.c | 276 +++++++++++++++++++++++++++++++++++++++-----------
fs/super.c | 1
include/linux/mount.h | 46 +++++++-
3 files changed, 263 insertions(+), 60 deletions(-)

Index: linux-2.6.9-quilt/fs/namespace.c
===================================================================
--- linux-2.6.9-quilt.orig/fs/namespace.c 2004-10-22 17:17:38.338635784 -0400
+++ linux-2.6.9-quilt/fs/namespace.c 2004-10-22 17:17:38.881553248 -0400
@@ -38,6 +38,15 @@ static inline int sysfs_init(void)
/* spinlock for vfsmount related operations, inplace of dcache_lock */
spinlock_t vfsmount_lock __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;

+/*
+ * spinlock for mnt_group entries. A write lock indicates that a sub-tree is
+ * being attached/detached and all reference count changes must wait until this
+ * is done. As such, any vfsmount reference count changes must hold a read
+ * lock on vfsmountref_lock.
+ */
+rwlock_t vfsmountref_lock __cacheline_aligned_in_smp = RW_LOCK_UNLOCKED;
+EXPORT_SYMBOL(vfsmountref_lock);
+
static struct list_head *mount_hashtable;
static int hash_mask, hash_bits;
static kmem_cache_t *mnt_cache;
@@ -62,7 +71,9 @@ struct vfsmount *alloc_vfsmnt(const char
struct vfsmount *mnt = kmem_cache_alloc(mnt_cache, GFP_KERNEL);
if (mnt) {
memset(mnt, 0, sizeof(struct vfsmount));
+ mnt->mnt_flags = MNT_ISBASE;
atomic_set(&mnt->mnt_count,1);
+ atomic_set(&mnt->mnt_group.count, 1);
INIT_LIST_HEAD(&mnt->mnt_hash);
INIT_LIST_HEAD(&mnt->mnt_child);
INIT_LIST_HEAD(&mnt->mnt_mounts);
@@ -137,24 +148,141 @@ static struct vfsmount *next_mnt(struct
return list_entry(next, struct vfsmount, mnt_child);
}

+/*
+ * detach_mnt breaks apart mnt from its parent. Caller is expected to call this
+ * with a reference held on mnt, as well as an empty nameidata struct. Once mnt
+ * is detached, the caller is responsible for dropping the references in
+ * nameidata as well as a reference to mnt, on top of the caller-owned
+ * reference to mnt.
+ *
+ * Called with namespace->sem held and vfsmount_lock held (if appropriate)
+ */
static void detach_mnt(struct vfsmount *mnt, struct nameidata *old_nd)
{
+ struct vfsmount *p;
+ long nmounts = 0, nrefs = 0;
+ long basecount;
+
+
+ /* save copy old mountpoint for delayed release */
+ old_nd->mnt = mnt->mnt_parent;
old_nd->dentry = mnt->mnt_mountpoint;
- old_nd->mnt = mnt->mnt_parent;
- mnt->mnt_parent = mnt;
- mnt->mnt_mountpoint = mnt->mnt_root;
+
+ /* undo mountpoint stitching */
list_del_init(&mnt->mnt_child);
list_del_init(&mnt->mnt_hash);
- old_nd->dentry->d_mounted--;
+ mnt->mnt_parent = mnt;
+ mnt->mnt_mountpoint->d_mounted--;
+
+ /* messing with mount groups requires a write lock */
+ write_lock(&vfsmountref_lock);
+
+ /* mnt is now the root of it's own sub-tree */
+ mnt->mnt_flags |= MNT_ISBASE;
+
+ /* dissassociate sub-tree with parent namespace */
+ for (p = mnt; p; p = next_mnt(p, mnt)) {
+ list_del_init(&p->mnt_list);
+ p->mnt_namespace = NULL;
+
+ /* point to mnt as the base of their sub-tree */
+ p->mnt_group.base = mnt;
+
+ /* count the number of mounts in the detached sub-tree */
+ nmounts++;
+
+ /* count the total number of refcounts in the sub-tree */
+ nrefs += atomic_read(&p->mnt_count);
+ }
+
+ /*
+ * calculate the number of group-refs in the base sub-tree.
+ *
+ * Lemma: within a tree, 2 references are used for stitching
+ * parent->child and child->parent for each child.
+ * Therefore, the total count of references used for stitching in a tree
+ * is the number of children times 2. nmounts here represents the total
+ * count of mountpoints, so nmounts - 1 is number of children within the
+ * tree.
+ * We subtract this value from nrefs to get the total number of refs not
+ * used for stitching.
+ */
+ basecount = nrefs - ((nmounts - 1) * 2);
+
+ /* set the new base counts in the child tree */
+ memset(&mnt->mnt_group.count, 0, sizeof(mnt->mnt_group.count));
+ atomic_set(&mnt->mnt_group.count, basecount);
+
+ /*
+ * - the parent tree gets +1 because old_nd->mnt is the callers
+ * responsiblity to free.
+ * - the parent tree gets another +1 because mnt is the callers
+ * responsibility to free as well.
+ * These stem from the fact that we can't mntput the references
+ * previously used for stitching because we are under spinlocks.
+ */
+ atomic_sub(basecount - 2 , &vfsmnt_base(old_nd->mnt)->mnt_group.count);
+
+ write_unlock(&vfsmountref_lock);
}

-static void attach_mnt(struct vfsmount *mnt, struct nameidata *nd)
+/*
+ * attach_mnt attaches mountpoint mnt at the location given by nd.
+ * This should be called with a reference to mnt, as well as a reference to
+ * nd->mnt/dentry. The reference to mnt is consumed by this call, however the
+ * references in nd are _not_ consumed.
+ *
+ * called with namespace->sem and vfsmount_lock held if appropriate
+ */
+static void attach_mnt (struct vfsmount *mnt, struct nameidata *nd)
{
- mnt->mnt_parent = mntget(nd->mnt);
+ struct vfsmount *p;
+ struct vfsmount *base;
+ struct namespace *namespace = nd->mnt->mnt_namespace;
+ long basecount;
+
+ /* do the stitching */
+ nd->dentry->d_mounted++;
+ mnt->mnt_parent = mntget(nd->mnt);
mnt->mnt_mountpoint = dget(nd->dentry);
- list_add(&mnt->mnt_hash, mount_hashtable+hash(nd->mnt, nd->dentry));
list_add_tail(&mnt->mnt_child, &nd->mnt->mnt_mounts);
- nd->dentry->d_mounted++;
+ list_add(&mnt->mnt_hash, mount_hashtable+hash(nd->mnt, nd->dentry));
+
+ /* messing with mount groups requires a write lock */
+ write_lock(&vfsmountref_lock);
+
+ /* get a reference to the base of the subtree */
+ /* (protected by vfsmountref_lock) */
+ base = vfsmnt_base(nd->mnt);
+
+ /* mnt is no longer the base of its own subtree */
+ mnt->mnt_flags &= ~MNT_ISBASE;
+
+ /*
+ * calculate the new base count.
+ *
+ * -1 because mnt is consumed by stitching
+ * -1 because nd->mnt is consumed by stitching
+ */
+ basecount = atomic_read(&base->mnt_group.count)
+ + atomic_read(&mnt->mnt_group.count)
+ - 2;
+
+
+ for (p = mnt; p; p = next_mnt(p, mnt)) {
+
+ /* update the base pointers */
+ p->mnt_group.base = base;
+
+ /* chain in the namespace of the parenting subtree */
+ p->mnt_namespace = namespace;
+ if (namespace)
+ list_add_tail(&p->mnt_list, &namespace->list);
+ }
+
+ atomic_set(&base->mnt_group.count, basecount);
+
+ write_unlock(&vfsmountref_lock);
}

/* this expects the caller to hold vfsmount_lock */
@@ -194,12 +322,13 @@ clone_mnt(struct vfsmount *old, struct d
} else {
mnt->mnt_flags = old->mnt_flags & ~MNT_CHILDEXPIRE;
}
+ mnt->mnt_flags |= MNT_ISBASE;
atomic_inc(&sb->s_active);
mnt->mnt_sb = sb;
mnt->mnt_root = dget(root);
mnt->mnt_mountpoint = mnt->mnt_root;
mnt->mnt_parent = mnt;
- mnt->mnt_namespace = old->mnt_namespace;
+ mnt->mnt_namespace = NULL;
}
return mnt;
}
@@ -212,7 +341,49 @@ void __mntput(struct vfsmount *mnt)
deactivate_super(sb);
}

-EXPORT_SYMBOL(__mntput);
+void __mntgroupput(struct vfsmount *mnt)
+{
+ struct vfsmount *p;
+ LIST_HEAD(kill);
+
+
+ /*
+ * We don't need to grab vfsmount_lock here because we are assured
+ * that no-one holds a reference to us.
+ * The kill list ordering depends on next_mnt doing a pre-order
+ * traversal.
+ */
+ for (p = mnt; p; p = next_mnt(p, mnt)) {
+ p->mnt_namespace = NULL;
+ list_del(&p->mnt_list);
+ list_add(&p->mnt_list, &kill);
+ }
+
+ while (!list_empty(&kill)) {
+ struct nameidata old_nd;
+
+
+ mnt = list_entry(kill.next, struct vfsmount, mnt_list);
+ WARN_ON(!list_empty(&mnt->mnt_mounts));
+
+ if (mnt == mnt->mnt_parent) {
+ WARN_ON(atomic_read(&mnt->mnt_count) != 0);
+ list_del_init(&mnt->mnt_list);
+ __mntput(mnt);
+ } else {
+ /* detach_mnt will clear mnt's mnt_list, however, we are
+ * working on a depth_first pass, so this is okay */
+ detach_mnt(mnt, &old_nd);
+ /* we don't mntput because that would cause recursion */
+ atomic_dec(&old_nd.mnt->mnt_count);
+ dput(old_nd.dentry);
+ atomic_dec(&mnt->mnt_count);
+ WARN_ON(atomic_read(&mnt->mnt_count) != 0);
+ __mntput(mnt);
+ }
+ }
+}
+EXPORT_SYMBOL(__mntgroupput);

/* iterator */
static void *m_start(struct seq_file *m, loff_t *pos)
@@ -417,31 +588,18 @@ static void clear_expire(struct vfsmount
}
}

+/* called with vfsmount_lock held */
static void umount_tree(struct vfsmount *mnt)
{
- struct vfsmount *p;
- LIST_HEAD(kill);
-
- for (p = mnt; p; p = next_mnt(p, mnt)) {
- list_del(&p->mnt_list);
- list_add(&p->mnt_list, &kill);
- }
-
- while (!list_empty(&kill)) {
- mnt = list_entry(kill.next, struct vfsmount, mnt_list);
- list_del_init(&mnt->mnt_list);
- list_del_init(&mnt->mnt_expire);
- if (mnt->mnt_parent == mnt) {
- spin_unlock(&vfsmount_lock);
- } else {
- struct nameidata old_nd;
- detach_mnt(mnt, &old_nd);
- spin_unlock(&vfsmount_lock);
- path_release_on_umount(&old_nd);
- }
- _mntput(mnt);
- spin_lock(&vfsmount_lock);
- }
+ struct nameidata old_nd;
+
+ list_del_init(&mnt->mnt_expire);
+ clear_childexpire(mnt);
+ detach_mnt(mnt, &old_nd);
+ spin_unlock(&vfsmount_lock);
+ path_release(&old_nd);
+ mntput(mnt);
+ spin_lock(&vfsmount_lock);
}

static int do_umount(struct vfsmount *mnt, int flags)
@@ -506,14 +664,16 @@ static int do_umount(struct vfsmount *mn
security_sb_umount_close(mnt);
spin_lock(&vfsmount_lock);
}
+ retval = -EEXIST;
+ if (flags & MNT_DETACH && mnt->mnt_parent == mnt)
+ goto out;
retval = -EBUSY;
if (atomic_read(&mnt->mnt_count) == 2 || flags & MNT_DETACH) {
- if (!list_empty(&mnt->mnt_list)) {
- clear_expire(mnt);
- umount_tree(mnt);
- }
+ clear_expire(mnt);
+ umount_tree(mnt);
retval = 0;
}
+out:
spin_unlock(&vfsmount_lock);
if (retval)
security_sb_umount_busy(mnt);
@@ -662,14 +822,23 @@ static int graft_tree(struct vfsmount *m
if (err)
goto out_unlock;

- err = -ENOENT;
spin_lock(&vfsmount_lock);
+ err = -EEXIST;
+ if (mnt->mnt_parent != mnt) {
+ spin_unlock(&vfsmount_lock);
+ goto out_unlock;
+ }
+ read_lock(&vfsmountref_lock);
+ err = -ELOOP;
+ if (vfsmnt_base(nd->mnt) == mnt) {
+ read_unlock(&vfsmountref_lock);
+ spin_unlock(&vfsmount_lock);
+ goto out_unlock;
+ }
+ read_unlock(&vfsmountref_lock);
+ err = -ENOENT;
if (IS_ROOT(nd->dentry) || !d_unhashed(nd->dentry)) {
- struct list_head head;
-
attach_mnt(mnt, nd);
- list_add_tail(&head, &mnt->mnt_list);
- list_splice(&head, current->namespace->list.prev);
mntget(mnt);
err = 0;
}
@@ -753,7 +922,7 @@ static int do_remount(struct nameidata *
err = do_remount_sb(sb, flags, data, 0);
if (!err)
nd->mnt->mnt_flags = mnt_flags |
- (nd->mnt->mnt_flags & MNT_CHILDEXPIRE);
+ (nd->mnt->mnt_flags & (MNT_CHILDEXPIRE | MNT_ISBASE));
up_write(&sb->s_umount);
if (!err)
security_sb_post_remount(nd->mnt, flags, data);
@@ -843,7 +1012,7 @@ static int do_add_mount(struct nameidata
if (IS_ERR(mnt))
return PTR_ERR(mnt);

- mnt->mnt_flags = mnt_flags;
+ mnt->mnt_flags = mnt_flags | MNT_ISBASE;
return do_graft_mount(mnt, nd);
}

@@ -874,8 +1043,6 @@ int do_graft_mount(struct vfsmount *newm
goto unlock;

err = graft_tree(newmnt, nd);
-
-
unlock:
up_write(&current->namespace->sem);
mntput(newmnt);
@@ -1213,9 +1380,9 @@ int copy_namespace(int flags, struct tas
kfree(new_ns);
goto out;
}
- spin_lock(&vfsmount_lock);
list_add_tail(&new_ns->list, &new_ns->root->mnt_list);
- spin_unlock(&vfsmount_lock);
+ list_for_each_entry(p, &new_ns->list, mnt_list)
+ p->mnt_namespace = new_ns;

/*
* Second pass: switch the tsk->fs->* elements and mark new vfsmounts
@@ -1454,6 +1621,8 @@ asmlinkage long sys_pivot_root(const cha
goto out3;
} else if (!is_subdir(old_nd.dentry, new_nd.dentry))
goto out3;
+ mntget(user_nd.mnt);
+ mntget(new_nd.mnt);
detach_mnt(new_nd.mnt, &parent_nd);
detach_mnt(user_nd.mnt, &root_parent);
attach_mnt(user_nd.mnt, &old_nd);
@@ -1564,21 +1733,16 @@ void __init mnt_init(unsigned long mempa

void __put_namespace(struct namespace *namespace)
{
- struct vfsmount *mnt;
-
down(&expiry_sem);
down_write(&namespace->sem);
spin_lock(&vfsmount_lock);

- list_for_each_entry(mnt, &namespace->list, mnt_list) {
- mnt->mnt_namespace = NULL;
- mnt->mnt_flags &= ~MNT_CHILDEXPIRE;
- list_del_init(&mnt->mnt_expire);
- }
+ clear_expire(namespace->root);

- umount_tree(namespace->root);
spin_unlock(&vfsmount_lock);
up_write(&namespace->sem);
up(&expiry_sem);
+
+ mntput(namespace->root);
kfree(namespace);
}
Index: linux-2.6.9-quilt/include/linux/mount.h
===================================================================
--- linux-2.6.9-quilt.orig/include/linux/mount.h 2004-10-22 17:17:35.927002408 -0400
+++ linux-2.6.9-quilt/include/linux/mount.h 2004-10-22 17:17:38.881553248 -0400
@@ -18,6 +18,7 @@
#define MNT_NODEV 2
#define MNT_NOEXEC 4
#define MNT_CHILDEXPIRE 8
+#define MNT_ISBASE 16

struct vfsmount
{
@@ -29,6 +30,10 @@ struct vfsmount
struct list_head mnt_mounts; /* list of children, anchored here */
struct list_head mnt_child; /* and going through their mnt_child */
atomic_t mnt_count;
+ union {
+ struct vfsmount *base; /* pointer to root of vfsmount tree */
+ atomic_t count; /* user ref count on this tree */
+ } mnt_group;
int mnt_flags;
int mnt_active; /* flag set on mntput() */
int mnt_expiry_countdown; /* countdown of ticks until expiry */
@@ -39,10 +44,39 @@ struct vfsmount
struct namespace *mnt_namespace; /* containing namespace */
};

+extern rwlock_t vfsmountref_lock;
+
+static inline struct vfsmount *vfsmnt_base(struct vfsmount *mnt) {
+ if (mnt->mnt_flags & MNT_ISBASE) {
+ return mnt;
+ }
+ return mnt->mnt_group.base;
+}
+
+static inline void mntgroupget(struct vfsmount *mnt)
+{
+ atomic_inc(&vfsmnt_base(mnt)->mnt_group.count);
+}
+
+extern void __mntgroupput(struct vfsmount *mnt);
+static inline struct vfsmount *mntgroupput(struct vfsmount *mnt)
+{
+ struct vfsmount *base;
+
+ base = vfsmnt_base(mnt);
+ if (atomic_dec_and_test(&base->mnt_group.count))
+ return base;
+ return NULL;
+}
+
static inline struct vfsmount *mntget(struct vfsmount *mnt)
{
- if (mnt)
+ if (mnt) {
+ read_lock(&vfsmountref_lock);
atomic_inc(&mnt->mnt_count);
+ mntgroupget(mnt);
+ read_unlock(&vfsmountref_lock);
+ }
return mnt;
}

@@ -50,9 +84,15 @@ extern void __mntput(struct vfsmount *mn

static inline void _mntput(struct vfsmount *mnt)
{
+ struct vfsmount *cleanup;
+ might_sleep();
if (mnt) {
- if (atomic_dec_and_test(&mnt->mnt_count))
- __mntput(mnt);
+ read_lock(&vfsmountref_lock);
+ cleanup = mntgroupput(mnt);
+ atomic_dec(&mnt->mnt_count);
+ read_unlock(&vfsmountref_lock);
+ if (cleanup)
+ __mntgroupput(cleanup);
}
}

Index: linux-2.6.9-quilt/fs/super.c
===================================================================
--- linux-2.6.9-quilt.orig/fs/super.c 2004-08-14 01:36:57.000000000 -0400
+++ linux-2.6.9-quilt/fs/super.c 2004-10-22 17:17:38.882553096 -0400
@@ -793,7 +793,6 @@ do_kern_mount(const char *fstype, int fl
mnt->mnt_root = dget(sb->s_root);
mnt->mnt_mountpoint = sb->s_root;
mnt->mnt_parent = mnt;
- mnt->mnt_namespace = current->namespace;
up_write(&sb->s_umount);
put_filesystem(type);
return mnt;

2004-10-25 15:00:13

by Mike Waychison

[permalink] [raw]
Subject: [PATCH 9/28] VFS: Give sane expiry semantics

This patch changes the semantics of expiry slightly when dealing with stuff
like --move, --bind and --rbind. The newly added file
Documentation/filesystems/expire_semantics.txt describes how it works.

Signed-off-by: Mike Waychison <[email protected]>
---

Documentation/filesystems/expire_semantics.txt | 184 +++++++++++++++++++++++++
fs/namespace.c | 26 ++-
2 files changed, 202 insertions(+), 8 deletions(-)

Index: linux-2.6.9-quilt/fs/namespace.c
===================================================================
--- linux-2.6.9-quilt.orig/fs/namespace.c 2004-10-22 17:17:37.121820768 -0400
+++ linux-2.6.9-quilt/fs/namespace.c 2004-10-22 17:17:37.770722120 -0400
@@ -178,13 +178,22 @@ static int can_expire(struct vfsmount *r
}

static struct vfsmount *
-clone_mnt(struct vfsmount *old, struct dentry *root)
+clone_mnt(struct vfsmount *old, struct dentry *root, int keep_expiry)
{
struct super_block *sb = old->mnt_sb;
struct vfsmount *mnt = alloc_vfsmnt(old->mnt_devname);

if (mnt) {
- mnt->mnt_flags = old->mnt_flags & ~MNT_CHILDEXPIRE;
+ if (keep_expiry) {
+ mnt->mnt_flags = old->mnt_flags;
+ if (!list_empty(&old->mnt_expire)) {
+ spin_lock(&vfsmount_lock);
+ list_add_tail(&mnt->mnt_expire, &expiry_list);
+ spin_unlock(&vfsmount_lock);
+ }
+ } else {
+ mnt->mnt_flags = old->mnt_flags & ~MNT_CHILDEXPIRE;
+ }
atomic_inc(&sb->s_active);
mnt->mnt_sb = sb;
mnt->mnt_root = dget(root);
@@ -589,13 +598,14 @@ lives_below_in_same_fs(struct dentry *d,
}
}

-static struct vfsmount *copy_tree(struct vfsmount *mnt, struct dentry *dentry)
+static struct vfsmount *copy_tree(struct vfsmount *mnt, struct dentry *dentry,
+ int keep_expiry)
{
struct vfsmount *res, *p, *q, *r, *s;
struct list_head *h;
struct nameidata nd;

- res = q = clone_mnt(mnt, dentry);
+ res = q = clone_mnt(mnt, dentry, keep_expiry);
if (!q)
goto Enomem;
q->mnt_mountpoint = mnt->mnt_mountpoint;
@@ -614,7 +624,7 @@ static struct vfsmount *copy_tree(struct
p = s;
nd.mnt = q;
nd.dentry = p->mnt_mountpoint;
- q = clone_mnt(p, p->mnt_root);
+ q = clone_mnt(p, p->mnt_root, keep_expiry);
if (!q)
goto Enomem;
spin_lock(&vfsmount_lock);
@@ -692,9 +702,9 @@ static int do_loopback(struct nameidata
if (check_mnt(nd->mnt) && (!recurse || check_mnt(old_nd.mnt))) {
err = -ENOMEM;
if (recurse)
- mnt = copy_tree(old_nd.mnt, old_nd.dentry);
+ mnt = copy_tree(old_nd.mnt, old_nd.dentry, 0);
else
- mnt = clone_mnt(old_nd.mnt, old_nd.dentry);
+ mnt = clone_mnt(old_nd.mnt, old_nd.dentry, 0);
}

if (mnt) {
@@ -1197,7 +1207,7 @@ int copy_namespace(int flags, struct tas

down_write(&tsk->namespace->sem);
/* First pass: copy the tree topology */
- new_ns->root = copy_tree(namespace->root, namespace->root->mnt_root);
+ new_ns->root = copy_tree(namespace->root, namespace->root->mnt_root, 1);
if (!new_ns->root) {
up_write(&tsk->namespace->sem);
kfree(new_ns);
Index: linux-2.6.9-quilt/Documentation/filesystems/expire_semantics.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.9-quilt/Documentation/filesystems/expire_semantics.txt 2004-10-22 17:17:37.775721360 -0400
@@ -0,0 +1,184 @@
+Mountpoint Expiry Semantics
+===========================
+
+In order to implement an autofs system as described in 'Towards a Modern Autofs'
+(ftp://ftp-eng.cobalt.com/pub/whitepapers/autofs/towards_a_modern_autofs.txt), a
+VFS-native expiry system needs to be designed and implemented. This document
+attempts to summarize the expiry requirements the autofs design requires and
+lays out a set of expiry semantics.
+
+Requirements
+------------
+
+The following is a set of functions we wish to be able to perform with the
+expiry system.
+
+ - We need to be able to arbitrarily specify timeouts for all expiries.
+ - The ability to set an arbitrary timeout at mount-time.
+ - The ability to change a timeout (via a remount)
+ - The ability to remount a given filesystem without affecting the expiry
+ settings.
+ - The ability to disable an expiry of a mountpoint.
+ - We need to be able to expire a single mountpoint.
+ - We need to be able to expire a tree of mountpoints in an atomic fashion.
+ This is required to support lazy-mounting as this is done by nesting direct
+ mount triggers within expiring NFS filesystems.
+ - The ability to mark a sub-tree as expirying all child mountpoints in
+ unison.
+ - The ability to remount individual mountpoints within a sub-tree (children
+ as well as the root mountpoint of the sub-tree) without affecting the
+ expiry settings for that sub-tree.
+ - The ability to disable expiry of a sub-tree.
+ - The semantic that any added mountpoints into a sub-tree after the sub-tree
+ has been marked to expire will block the expire of the sub-tree.
+ - We need to be able to read expiry settings.
+ - The ability to read expiry settings for an individual mountpoint expiry.
+ - The ability to read expiry settings for a sub-tree (by looking at the root
+ mountpoint of the sub-tree).
+ - The ability to identify whether a given mountpoint is part of a sub-tree
+ expiry.
+
+Ticks
+-----
+
+Ticks are meant as a way to specify the amount of time of inactivity that should
+occur before mountpoints expire. Ticks will by default be 10 seconds long, and
+will be configurable via a sysctl interface. The sysctl interface will be
+represented in terms of USER_HZ. The length of a tick is left configurable as
+calculating expiry may become a tedious task on larger system where less
+frequent expiry checks may be more desirable.
+
+Policies
+--------
+
+In order to fulfill the requirements outlined above, we define a new attribute
+to mountpoints called the expiry policy. Each policy may define a set of
+policy-related attributes.
+
+
+ 'NOEXPIRE' - This is the default expiry policy for any given mountpoint.
+ It signifies that this mountpoint will neither expire nor is it part of a
+ sub-tree expiry.
+
+ This policy has no attributes.
+
+ 'EXPIRE' - This policy means that the given mountpoint is set to
+ eventually expire. It may be an individual mountpoint expiry, or may be
+ the root of a sub-tree that will expire.
+
+ Attribute 'ticks' - This is the number of ticks that must pass of
+ inactive use before the mountpoint will expire. This value is a non-zero
+ positive integer. For more information on 'ticks', see above: 'Ticks'.
+
+ 'SUBTREEEXPIRE' - This policy specifies that the mountpoint in question is
+ part of a larger sub-tree expiry.
+
+
+Policy transitions
+------------------
+
+Certain rules need to apply when the policy for a mountpoint is changed.
+Following is a description of what occurs when a policy has changed for all
+possible transitions:
+
+NOEXPIRE -> EXPIRE
+ A mountpoint that was previously not marked for expiry is now marked for
+ expiry. If the attribute value for 'ticks' is not a positive integer, then
+ this transition fails. This transition also fails automatically for the
+ true-root of a namespace.
+
+NOEXPIRE -> SUBTREEEXPIRE
+ A mountpoint that was not previously set to expire will now expire as part of
+ it's parent expiry. This transition fails if the policy of the immediate
+ parent mointpoint of the mountpoint in question is not either EXPIRE or
+ SUBTREEEXPIRE.
+
+EXPIRE -> NOEXPIRE
+ A mountpoint that was set to expire will no longer expire. Any immediate
+ child mountpoints whose policy is SUBTREEEXPIRE transitions to NOEXPIRE, as
+ described below.
+
+EXPIRE -> SUBTREEEXPIRE
+ This is an invalid policy transition. Instead, a transition from EXPIRE to
+ NOEXPIRE followed by a transition from NOEXPIRE to SUBTREEEXPIRE should be
+ performed.
+
+SUBTREEEXPIRE -> NOEXPIRE
+ A mountpoint that was marked as being part of a sub-tree expiry no longer is
+ part of that expiry. Any immediate child mountpoints of the mountpoint in
+ question that also have the SUBTREEEXPIRE policy will recursively receive an
+ implied SUBTREEEXPIRE -> NOEXPIRE transition.
+
+SUBTREEEXPIRE -> EXPIRE
+ This is an invalid policy transition. Instead, a transition from
+ SUBTREEEXPIRE to NOEXPIRE followed by a transition from NOEXPIRE to EXPIRE
+ should be performed.
+
+Note: Transitions between SUBTREEXPIRE and EXPIRE are invalid in order to
+simplify the corner cases of a failed transition. Transitioning to
+SUBTREEEXPIRE is the only path where things can actually fail (eg: parent
+mountpoint is NOEXPIRE), and trasitioning from EXPIRE may alter children
+mountpoints of the mountpoint in question. This implied loss of information is
+more explicit when you must transition to NOEXPIRE as an intermediate.
+
+Mount Operations
+----------------
+
+The following are a list of mount operations that may occur along with a
+description of implied policy changes that may occur.
+
+
+Mount Move:
+
+ A mountpoint is moved from one location to another. If the policy was:
+
+ NOEXPIRE - The policy remains NOEXPIRE
+ EXPIRE - The policy remains EXPIRE
+ SUBTREEEXPIRE - The policy for the given mountpoint transitions to NOEXPIRE
+
+Mount Bind:
+
+ There are several variants on mount bind operations:
+
+ - Single bind of from a mountpoint to another location. Only one filesystem
+ is actually available in the new location after the bind.
+ - Single bind of a directory (not a mountpoint) to another location. This
+ is similar to the previous scenario however the root directory of the
+ new mountpoint is a sub-directory of the original mountpoint.
+ - A recursive bind from a mountpoint to another location.
+ - A recursive bind from a directory that isn't a mountpoint to another
+ location.
+
+ In all cases, all newly created mountpoints are created with the NOEXPIRE
+ policy. This allows a userspace application to arbitrarily determine what it
+ wishes to do with the expiry policy on the bind.
+
+Remount:
+
+ All filesystems remounts should not effect the expiry policy. That is, the
+ policy and policy attributes should remain the same across a mountpoint
+ remount unless the policy itself was changed.
+
+Namespace Creation:
+
+ When a namespace is created, it is derived from the parenting namespace. A
+ namespace begins it's life as a complete clone and as such, any mountpoints in
+ that namespace should inherit the expiry policies set at the time of namespace
+ creation.
+
+Detaching Mountpoints (umount -l):
+
+ Lazy unmounting is a particular case as the mointpoints in question are still
+ accessible, however not navigeable from the root directory. Currently, when a
+ lazy unmount occurs on a sub-tree, all elements of that sub-tree are torn
+ apart and their resources freed once all references into that given mountpoint
+ are released. In this case, all mountpoints should transition their expiry
+ policy to NOEXPIRE.
+
+ In a future case where a lazy unmount of a sub-tree does not result in the
+ sub-tree being torn apart (so that it may possibly be reattached elsewhere),
+ the root of this sub-tree's policy should transition as follows:
+
+ - EXPIRE -> EXPIRE
+ - NOEXPIRE -> NOEXPIRE
+ - SUBTREEEXPIRE -> NOEXPIRE

2004-10-25 15:00:10

by Mike Waychison

[permalink] [raw]
Subject: [PATCH 10/28] VFS: Move next_mnt()

This patch simply moves next_mnt in preparation for the next patch that
implements detachable subtrees.

Signed-off-by: Mike Waychison <[email protected]>
---

namespace.c | 32 ++++++++++++++++----------------
1 files changed, 16 insertions(+), 16 deletions(-)

Index: linux-2.6.9-quilt/fs/namespace.c
===================================================================
--- linux-2.6.9-quilt.orig/fs/namespace.c 2004-10-22 17:17:37.770722120 -0400
+++ linux-2.6.9-quilt/fs/namespace.c 2004-10-22 17:17:38.338635784 -0400
@@ -121,6 +121,22 @@ static inline int check_mnt(struct vfsmo
return mnt->mnt_namespace == current->namespace;
}

+static struct vfsmount *next_mnt(struct vfsmount *p, struct vfsmount *root)
+{
+ struct list_head *next = p->mnt_mounts.next;
+ if (next == &p->mnt_mounts) {
+ while (1) {
+ if (p == root)
+ return NULL;
+ next = p->mnt_child.next;
+ if (next != &p->mnt_parent->mnt_mounts)
+ break;
+ p = p->mnt_parent;
+ }
+ }
+ return list_entry(next, struct vfsmount, mnt_child);
+}
+
static void detach_mnt(struct vfsmount *mnt, struct nameidata *old_nd)
{
old_nd->dentry = mnt->mnt_mountpoint;
@@ -141,22 +157,6 @@ static void attach_mnt(struct vfsmount *
nd->dentry->d_mounted++;
}

-static struct vfsmount *next_mnt(struct vfsmount *p, struct vfsmount *root)
-{
- struct list_head *next = p->mnt_mounts.next;
- if (next == &p->mnt_mounts) {
- while (1) {
- if (p == root)
- return NULL;
- next = p->mnt_child.next;
- if (next != &p->mnt_parent->mnt_mounts)
- break;
- p = p->mnt_parent;
- }
- }
- return list_entry(next, struct vfsmount, mnt_child);
-}
-
/* this expects the caller to hold vfsmount_lock */
static int can_expire(struct vfsmount *root, int offset)
{

2004-10-25 14:55:01

by Mike Waychison

[permalink] [raw]
Subject: [PATCH 3/28] VFS: Move expiry into vfs

This patch moves the recently added expiry functionality directly into the
VFS layer. Doing this gives us a couple advantages:

- Allows for configurable timeouts using a single consolidated timer
- Keeps filesystems from having to each implement their own expiry logic
- Provides a generic interface that can be used for _any_ filesystem, as
desired by user applications and/or the system admninistrator.

This patch implements expiry by having the VFS recursively register work to
do. Checks are done for expiry every 1 second, so expiry is configurable to
that granularity.

Signed-off-by: Mike Waychison <[email protected]>
---

fs/namespace.c | 75 ++++++++++++++++++++++++++++++++++++--------------
include/linux/mount.h | 6 +---
2 files changed, 57 insertions(+), 24 deletions(-)

Index: linux-2.6.9-quilt/include/linux/mount.h
===================================================================
--- linux-2.6.9-quilt.orig/include/linux/mount.h 2004-10-22 17:17:33.460377392 -0400
+++ linux-2.6.9-quilt/include/linux/mount.h 2004-10-22 17:17:34.147272968 -0400
@@ -68,10 +68,8 @@ extern struct vfsmount *do_kern_mount(co

struct nameidata;

-extern int do_add_mount(struct vfsmount *newmnt, struct nameidata *nd,
- int mnt_flags, struct list_head *fslist);
-
-extern void mark_mounts_for_expiry(struct list_head *mounts);
+extern int do_graft_mount(struct vfsmount *newmnt, struct nameidata *nd);
+extern void mnt_expire(struct vfsmount *mnt, unsigned expire);

extern spinlock_t vfsmount_lock;

Index: linux-2.6.9-quilt/fs/namespace.c
===================================================================
--- linux-2.6.9-quilt.orig/fs/namespace.c 2004-10-22 17:17:33.461377240 -0400
+++ linux-2.6.9-quilt/fs/namespace.c 2004-10-22 17:17:34.148272816 -0400
@@ -42,6 +42,13 @@ static struct list_head *mount_hashtable
static int hash_mask, hash_bits;
static kmem_cache_t *mnt_cache;

+/* manage mountpoint expiry */
+static LIST_HEAD(expiry_list);
+static void do_expiry_run(void *nothing);
+static DECLARE_WORK(expiry_work, do_expiry_run, NULL);
+static DECLARE_MUTEX(expiry_sem);
+#define EXPIRE_PERIOD (HZ)
+
static inline unsigned long hash(struct vfsmount *mnt, struct dentry *dentry)
{
unsigned long tmp = ((unsigned long) mnt / L1_CACHE_BYTES);
@@ -431,6 +438,7 @@ static int do_umount(struct vfsmount *mn
return retval;
}

+ down(&expiry_sem);
down_write(&current->namespace->sem);
spin_lock(&vfsmount_lock);

@@ -446,14 +454,17 @@ static int do_umount(struct vfsmount *mn
}
retval = -EBUSY;
if (atomic_read(&mnt->mnt_count) == 2 || flags & MNT_DETACH) {
- if (!list_empty(&mnt->mnt_list))
+ if (!list_empty(&mnt->mnt_list)) {
+ list_del_init(&mnt->mnt_expire);
umount_tree(mnt);
+ }
retval = 0;
}
spin_unlock(&vfsmount_lock);
if (retval)
security_sb_umount_busy(mnt);
up_write(&current->namespace->sem);
+ up(&expiry_sem);
return retval;
}

@@ -760,7 +771,7 @@ out:
* create a new mount for userspace and request it to be added into the
* namespace's tree
*/
-static int do_new_mount(struct nameidata *nd, char *type, int flags,
+static int do_add_mount(struct nameidata *nd, char *type, int flags,
int mnt_flags, char *name, void *data)
{
struct vfsmount *mnt;
@@ -776,15 +787,15 @@ static int do_new_mount(struct nameidata
if (IS_ERR(mnt))
return PTR_ERR(mnt);

- return do_add_mount(mnt, nd, mnt_flags, NULL);
+ mnt->mnt_flags = mnt_flags;
+ return do_graft_mount(mnt, nd);
}

/*
* add a mount into a namespace's mount tree
* - provide the option of adding the new mount to an expiration list
*/
-int do_add_mount(struct vfsmount *newmnt, struct nameidata *nd,
- int mnt_flags, struct list_head *fslist)
+int do_graft_mount(struct vfsmount *newmnt, struct nameidata *nd)
{
int err;

@@ -806,38 +817,52 @@ int do_add_mount(struct vfsmount *newmnt
if (S_ISLNK(newmnt->mnt_root->d_inode->i_mode))
goto unlock;

- newmnt->mnt_flags = mnt_flags;
err = graft_tree(newmnt, nd);

- if (err == 0 && fslist) {
- /* add to the specified expiration list */
- spin_lock(&vfsmount_lock);
- list_add_tail(&newmnt->mnt_expire, fslist);
- spin_unlock(&vfsmount_lock);
- }

unlock:
up_write(&current->namespace->sem);
mntput(newmnt);
return err;
}
+EXPORT_SYMBOL_GPL(do_graft_mount);
+
+void mnt_expire(struct vfsmount *mnt, unsigned expire)
+{
+ down(&expiry_sem);
+ spin_lock(&vfsmount_lock);

-EXPORT_SYMBOL_GPL(do_add_mount);
+ /* Expiry is not permitted on mounts that are not associated with a
+ * namespace. This is due to the fact that we cannot reliably handle
+ * removing the mount from the expiry list when the mount is no longer
+ * referenced */
+ if (!mnt->mnt_namespace)
+ goto out;

+ list_del_init(&mnt->mnt_expire);
+ if (expire > 0)
+ list_add_tail(&mnt->mnt_expire, &expiry_list);
+out:
+ spin_unlock(&vfsmount_lock);
+ up(&expiry_sem);
+}
+EXPORT_SYMBOL_GPL(mnt_expire);
+
/*
* process a list of expirable mountpoints with the intent of discarding any
* mountpoints that aren't in use and haven't been touched since last we came
* here
*/
-void mark_mounts_for_expiry(struct list_head *mounts)
+static void do_expiry_run(void *nothing)
{
struct namespace *namespace;
struct vfsmount *mnt, *next;
LIST_HEAD(graveyard);

- if (list_empty(mounts))
+ if (list_empty(&expiry_list))
return;

+ down(&expiry_sem);
spin_lock(&vfsmount_lock);

/* extract from the expiration list every vfsmount that matches the
@@ -846,7 +871,7 @@ void mark_mounts_for_expiry(struct list_
* - still marked for expiry (marked on the last call here; marks are
* cleared by mntput())
*/
- list_for_each_entry_safe(mnt, next, mounts, mnt_expire) {
+ list_for_each_entry_safe(mnt, next, &expiry_list, mnt_expire) {
if (!xchg(&mnt->mnt_expiry_mark, 1) ||
atomic_read(&mnt->mnt_count) != 1)
continue;
@@ -913,22 +938,29 @@ void mark_mounts_for_expiry(struct list_
/* someone brought it back to life whilst we didn't
* have any locks held so return it to the expiration
* list */
- list_add_tail(&mnt->mnt_expire, mounts);
+ list_add_tail(&mnt->mnt_expire, &expiry_list);
spin_unlock(&vfsmount_lock);
}

up_write(&namespace->sem);

- mntput(mnt);
+ _mntput(mnt);
put_namespace(namespace);

spin_lock(&vfsmount_lock);
}

spin_unlock(&vfsmount_lock);
+ up(&expiry_sem);
+ schedule_delayed_work(&expiry_work, EXPIRE_PERIOD);
}

-EXPORT_SYMBOL_GPL(mark_mounts_for_expiry);
+static __init int start_expiry_work(void)
+{
+ schedule_delayed_work(&expiry_work, EXPIRE_PERIOD);
+ return 1;
+}
+late_initcall(start_expiry_work);

int copy_mount_options (const void __user *data, unsigned long *where)
{
@@ -1024,7 +1056,7 @@ long do_mount(char * dev_name, char * di
else if (flags & MS_MOVE)
retval = do_move_mount(&nd, dev_name);
else
- retval = do_new_mount(&nd, type_page, flags, mnt_flags,
+ retval = do_add_mount(&nd, type_page, flags, mnt_flags,
dev_name, data_page);
dput_out:
path_release(&nd);
@@ -1421,15 +1453,18 @@ void __put_namespace(struct namespace *n
{
struct vfsmount *mnt;

+ down(&expiry_sem);
down_write(&namespace->sem);
spin_lock(&vfsmount_lock);

list_for_each_entry(mnt, &namespace->list, mnt_list) {
mnt->mnt_namespace = NULL;
+ list_del_init(&mnt->mnt_expire);
}

umount_tree(namespace->root);
spin_unlock(&vfsmount_lock);
up_write(&namespace->sem);
+ up(&expiry_sem);
kfree(namespace);
}

2004-10-26 10:27:12

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 4/28] VFS: Stat shouldn't stop expire

On Mon, Oct 25, 2004 at 10:40:30AM -0400, Mike Waychison wrote:
> This patch fixes the problem where if you have a mountpoint that is going to
> expire, it fails to expire before somebody keeps stat(2)ing the root of it's
> filesystem. For example, consider the case where a user has his home
> directory automounted on /home/mikew. Some other user can keep the
> filesystem mounted forever by simply calling ls(1) in /home, because the stat
> action resets the marker on each call.
>
> Signed-off-by: Mike Waychison <[email protected]>
> ---
>
> namei.c | 11 ++++++++++-
> 1 files changed, 10 insertions(+), 1 deletion(-)
>
> Index: linux-2.6.9-quilt/fs/namei.c
> ===================================================================
> --- linux-2.6.9-quilt.orig/fs/namei.c 2004-08-14 01:36:45.000000000 -0400
> +++ linux-2.6.9-quilt/fs/namei.c 2004-10-22 17:17:34.762179488 -0400
> @@ -275,7 +275,16 @@ int deny_write_access(struct file * file
> void path_release(struct nameidata *nd)
> {
> dput(nd->dentry);
> - mntput(nd->mnt);
> + /*
> + * In order to ensure that access to an automounted filesystems'
> + * root does not reset it's expire counter, we check to see if the path
> + * being released here is a mountpoint itself. If it is, then we call
> + * _mntput which leaves the expire counter alone.
> + */
> + if (nd->mnt && nd->mnt->mnt_root == nd->dentry)
> + _mntput(nd->mnt);
> + else
> + mntput(nd->mnt);

Why only for the root dentry not any on stat() This seems highly inconsistant.
Also while you're at it please give _mntput a more sensible name, e.g.
mntput_no_expire (yes, I know that name isn't your fault)

2004-10-27 18:40:01

by Mike Waychison

[permalink] [raw]
Subject: Re: [PATCH 4/28] VFS: Stat shouldn't stop expire

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Christoph Hellwig wrote:
> On Mon, Oct 25, 2004 at 10:40:30AM -0400, Mike Waychison wrote:
>
>>This patch fixes the problem where if you have a mountpoint that is going to
>>expire, it fails to expire before somebody keeps stat(2)ing the root of it's
>>filesystem. For example, consider the case where a user has his home
>>directory automounted on /home/mikew. Some other user can keep the
>>filesystem mounted forever by simply calling ls(1) in /home, because the stat
>>action resets the marker on each call.
>>
>>Signed-off-by: Mike Waychison <[email protected]>
>>---
>>
>> namei.c | 11 ++++++++++-
>> 1 files changed, 10 insertions(+), 1 deletion(-)
>>
>>Index: linux-2.6.9-quilt/fs/namei.c
>>===================================================================
>>--- linux-2.6.9-quilt.orig/fs/namei.c 2004-08-14 01:36:45.000000000 -0400
>>+++ linux-2.6.9-quilt/fs/namei.c 2004-10-22 17:17:34.762179488 -0400
>>@@ -275,7 +275,16 @@ int deny_write_access(struct file * file
>> void path_release(struct nameidata *nd)
>> {
>> dput(nd->dentry);
>>- mntput(nd->mnt);
>>+ /*
>>+ * In order to ensure that access to an automounted filesystems'
>>+ * root does not reset it's expire counter, we check to see if the path
>>+ * being released here is a mountpoint itself. If it is, then we call
>>+ * _mntput which leaves the expire counter alone.
>>+ */
>>+ if (nd->mnt && nd->mnt->mnt_root == nd->dentry)
>>+ _mntput(nd->mnt);
>>+ else
>>+ mntput(nd->mnt);
>
>
> Why only for the root dentry not any on stat() This seems highly inconsistant.

I'm not sure. I need help in understanding the different cases of
path_release: (please add any others/discrepencies you see)

1) path walk (across a mountpoint)
2) open / close (of a mountpoint)
3) stat / xattr (of a mountpoint)

The first case, a path walk, will always touch a non-mountpoint root
dentry. As such, the above snippet will always reset the expire counter
as some other dentry got path_released.

Case 2, opening of a mountpoint happens when you readdir a the base of
the mounted filesystem. In this case, you aren't path_releasing on
close, but are doing an explicit mntput(filp->f_vfsmnt).

Case 3, you are accessing meta information of the root of the
mountpoint. Assuming the call is made from 'outside' the mounted
filesystem, the expire counter is currently ticking. Access to grab
meta information of the base directory of the mountpoint shouldn't reset
the counter, and in all cases I looked at (minimal), they called
path_release when done.

The above hack^Wsnippet was intended to deal with case #3, but it may
make more sense to look into the matter further and perform the special
case in vfs_l?stat themselves.

Thoughts?

> Also while you're at it please give _mntput a more sensible name, e.g.
> mntput_no_expire (yes, I know that name isn't your fault)
>

Will do.

- --
Mike Waychison
Sun Microsystems, Inc.
1 (650) 352-5299 voice
1 (416) 202-8336 voice

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE: The opinions expressed in this email are held by me,
and may not represent the views of Sun Microsystems, Inc.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD4DBQFBf+q2dQs4kOxk3/MRAnlQAJ413gDuQLqF5HgOKSQ/S7LrtlaZqQCYix9y
ogHTca+0B+7+HIuSJsY9kQ==
=Qw5B
-----END PGP SIGNATURE-----