LinuxLists.cc - [PATCH] fix race in mark_mounts_for

2005-05-18 09:52:49

Subject: [PATCH] fix race in mark_mounts_for_expiry()

> > > > I am seeing a locking issue with get_namespace() and put_namespace()
> > > >
> > > > lets say put_namespace() is called and it finds that it is the last
> > > > user of the namespace is just about to call __put_namespace().
> > > >
> > > > Simultaneously another thread calls get_namespace() and increments
> > > > the count.
> > > >
> > > > At this point __put_namespace() goes ahead and cleans the namespace.
> > >
> > > I think you are right.
> > >
> > > There's one place in the existing code, which could experience this
> > > race I think:
> > >
> > >
> > > /* don't do anything if the namespace is dead - all the
> > > * vfsmounts from it are going away anyway */
> > > namespace = mnt->mnt_namespace;
> > > if (!namespace || atomic_read(&namespace->count) <= 0)
> > > continue;
> > > here ---->
> > > get_namespace(namespace);
> > >
> > > spin_unlock(&vfsmount_lock);
> > >
> > >
> > > Locking vfsmount_lock in put_namespace() would fix it. Any better ideas?
> > >
> > yes. it will.
> >
> > However I don't think this issue will trigger currently because for this
> > to happen, a process should attempt a get_namespace() on a foreign
> > namespace. If all tasks are operating in the same
> > namespace, than the count will not go to zero for put namespace to
> > clean it up, unless it is the last task.
> >
> > I think, it will trigger once we allow recursive binds from foreign
> > namespace.
>
> No I am wrong. mark_mounts_for_expiry() is not called from a process
> context. Its in a timer context. So the race could happen.

How about this patch? It tries to solve this race without additional
locking. If refcount is already zero, it will increment and decrement
it. So be careful to only call grab_namespace() with vfsmount_lock
held, otherwise it could race with itself. (vfsmount_lock is also
needed in this case so that mnt->mnt_namespace, doesn't change, while
grabbing the namespace)

Compile tested only, please review carefully.

Signed-off-by: Miklos Szeredi <[email protected]>

Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c 2005-05-17 13:46:58.000000000 +0200
+++ linux/fs/namespace.c 2005-05-18 11:09:47.000000000 +0200
@@ -825,6 +825,16 @@ unlock:

EXPORT_SYMBOL_GPL(do_add_mount);

+/* Must be called with vfsmount_lock */
+static inline struct namespace *grab_namespace(struct namespace *n)
+{
+ if (n && atomic_add_return(1, &n->count) <= 1) {
+ atomic_dec(&n->count);
+ n = NULL;
+ }
+ return n;
+}
+
/*
* process a list of expirable mountpoints with the intent of discarding any
* mountpoints that aren't in use and haven't been touched since last we came
@@ -868,10 +878,9 @@ void mark_mounts_for_expiry(struct list_

/* don't do anything if the namespace is dead - all the
* vfsmounts from it are going away anyway */
- namespace = mnt->mnt_namespace;
- if (!namespace || atomic_read(&namespace->count) <= 0)
+ namespace = grab_namespace(mnt->mnt_namespace);
+ if (!namespace)
continue;
- get_namespace(namespace);

spin_unlock(&vfsmount_lock);
down_write(&namespace->sem);

2005-05-18 10:35:22

by David Howells

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

Miklos Szeredi <[email protected]> wrote:

>
> How about this patch? It tries to solve this race without additional
> locking. If refcount is already zero, it will increment and decrement
> it. So be careful to only call grab_namespace() with vfsmount_lock
> held, otherwise it could race with itself. (vfsmount_lock is also
> needed in this case so that mnt->mnt_namespace, doesn't change, while
> grabbing the namespace)

How about using cmpxchg?

David

2005-05-18 10:41:03

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

> > How about this patch? It tries to solve this race without additional
> > locking. If refcount is already zero, it will increment and decrement
> > it. So be careful to only call grab_namespace() with vfsmount_lock
> > held, otherwise it could race with itself. (vfsmount_lock is also
> > needed in this case so that mnt->mnt_namespace, doesn't change, while
> > grabbing the namespace)
>
> How about using cmpxchg?

How? If the count is nonzero, an incremented count must be stored.
You can't do that atomically with cmpxchg.

Miklos

2005-05-18 10:46:44

by David Howells

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

Miklos Szeredi <[email protected]> wrote:

> > How about using cmpxchg?
>
> How? If the count is nonzero, an incremented count must be stored.
> You can't do that atomically with cmpxchg.

Yes you can. cmpxchg() is atomic. Several archs implement atomic_inc() and co
with cmpxchg() or similar.

Something like:

static inline struct namespace *grab_namespace(struct namespace *n)
{
int old = atomic_read(&n->count);

while (old > 0) {
/* attempt to increment the counter */
old = cmpxchg(&n->count, old, old + 1);
}

return old > 0 ? n : NULL;
}

David

2005-05-18 10:54:38

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

> Yes you can. cmpxchg() is atomic. Several archs implement atomic_inc() and co
> with cmpxchg() or similar.
>
> Something like:
>
> static inline struct namespace *grab_namespace(struct namespace *n)
> {
> int old = atomic_read(&n->count);
>
> while (old > 0) {
> /* attempt to increment the counter */
> old = cmpxchg(&n->count, old, old + 1);
> }
>
> return old > 0 ? n : NULL;
> }
>

Ahh OK :)

There's still the problem of cmpxchg meddling in the internals of an
atomic_t. Is that OK? Will that work on all archs?

Miklos

2005-05-18 11:00:19

by David Howells

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

Miklos Szeredi <[email protected]> wrote:

>
> There's still the problem of cmpxchg meddling in the internals of an
> atomic_t. Is that OK? Will that work on all archs?

Probably. It might be worth defining an atomic_cmpxchg() op to formalise it
though:

int atomic_cmpxchg(atomic_t *p, int old, int new);

The main reason for this is that atomic_t is mostly a 32-bit value, I think,
and cmpxchg() may sometimes enforce a 64-bit value.

It can be made to work on PPC, PPC64, x86, x86_64, frv. I imagine it'll work
on MIPS, sparc and alpha too without too much trouble. Dunno about the rest.

David

2005-05-18 11:07:56

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

on den 18.05.2005 Klokka 12:53 (+0200) skreiv Miklos Szeredi:
> > Yes you can. cmpxchg() is atomic. Several archs implement atomic_inc() and co
> > with cmpxchg() or similar.
> >
> > Something like:
> >
> > static inline struct namespace *grab_namespace(struct namespace *n)
> > {
> > int old = atomic_read(&n->count);
> >
> > while (old > 0) {
> > /* attempt to increment the counter */
> > old = cmpxchg(&n->count, old, old + 1);
> > }
> >
> > return old > 0 ? n : NULL;
> > }
> >
>
> Ahh OK :)
>
> There's still the problem of cmpxchg meddling in the internals of an
> atomic_t. Is that OK? Will that work on all archs?

Some archs already have an atomic_dec_if_positive() (see for instance
the PPC). It won't take much work to convert that to an
atomic_inc_if_positive().

For those arches that don't have that sort of thing, then writing a
generic atomic_inc_if_positive() using cmpxchg() will often be possible,
but there are exceptions (for instance the original 386 does not have a
cmpxchg, so there you will have to use something else).

Cheers,
Trond

2005-05-18 11:22:42

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

> >
> > There's still the problem of cmpxchg meddling in the internals of an
> > atomic_t. Is that OK? Will that work on all archs?
>
> Probably. It might be worth defining an atomic_cmpxchg() op to formalise it
> though:
>
> int atomic_cmpxchg(atomic_t *p, int old, int new);
>
> The main reason for this is that atomic_t is mostly a 32-bit value, I think,
> and cmpxchg() may sometimes enforce a 64-bit value.
>
> It can be made to work on PPC, PPC64, x86, x86_64, frv. I imagine it'll work
> on MIPS, sparc and alpha too without too much trouble. Dunno about the rest.

Any takers?

I'd rather not do this, if I don't have to :)

Do you think my original fix is wrong, or is this just cosmetics?

Thanks,
Miklos

2005-05-18 11:33:42

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

> Some archs already have an atomic_dec_if_positive() (see for instance
> the PPC). It won't take much work to convert that to an
> atomic_inc_if_positive().
>
> For those arches that don't have that sort of thing, then writing a
> generic atomic_inc_if_positive() using cmpxchg() will often be possible,
> but there are exceptions (for instance the original 386 does not have a
> cmpxchg, so there you will have to use something else).

The problem with introducing architecture specific code, is that it's
just asking for new bugs.

If it's something used all over the kernel, than obviously it's OK,
but for the sake of just one caller it's a bit crazy I think.

Miklos

2005-05-18 11:52:26

by David Howells

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

Miklos Szeredi <[email protected]> wrote:

> Do you think my original fix is wrong, or is this just cosmetics?

What you're doing is tricky. It's asking for a race. Admittedly, it may not
occur in the particular situation you're looking at, but can you always
guarantee that? Remember, it may be a race against some piece of code that's
not yet written, by an author who doesn't realise what _you_ are doing here
because their changeset doesn't intersect with yours.

Remember: you have, in effect, made the usage count on that structure
non-atomic.

I do something like that in rwsems and it's something I have to be very
careful about. The main reason I can get away with it is that the actual
implementation of rwsems is small and is located in a very restrictes set of
places and it's not intermingled with other stuff.

David

2005-05-18 12:09:58

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

> What you're doing is tricky. It's asking for a race.

I know. The comment above the function is there to make sure the user
is aware of this.

> Admittedly, it may not
> occur in the particular situation you're looking at, but can you always
> guarantee that?

Yes, if it's always called under lock.

> Remember, it may be a race against some piece of code that's not yet
> written, by an author who doesn't realise what _you_ are doing here
> because their changeset doesn't intersect with yours.
>
> Remember: you have, in effect, made the usage count on that structure
> non-atomic.

But _only after_ it's has gone to zero. When in fact there are no
more references to it, so it shouldn't matter.

The fact that it does matter and that mark_mounts_for_expiry()
derefences mnt->mnt_namespace without actually having a proper
reference to the namespace is the real culprit here.

This is the third bug found by Jamie Lokier, Ram and me in the
mnt_namespace change. So if we are looking at proper solutions I
think that is what we should be examining.

Thanks,
Miklos

2005-05-18 12:33:53

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

> But _only after_ it's has gone to zero. When in fact there are no
> more references to it, so it shouldn't matter.
>
> The fact that it does matter and that mark_mounts_for_expiry()
> derefences mnt->mnt_namespace without actually having a proper
> reference to the namespace is the real culprit here.
>
> This is the third bug found by Jamie Lokier, Ram and me in the
> mnt_namespace change. So if we are looking at proper solutions I
> think that is what we should be examining.

E.g. having a separate task count, which is incremented/decremented
only by clone/exit. If the task count goes to zero, umount_tree is
called on root, but namespace is not freed.

And each mnt_namespace holds a proper reference to the namespace, so
it's safe to dereference it anytime. When truly no more references
remain, the namespace can go away.

Hmm?

Miklos

2005-05-18 12:51:20

by Jamie Lokier

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

Miklos Szeredi wrote:
> > Some archs already have an atomic_dec_if_positive() (see for instance
> > the PPC). It won't take much work to convert that to an
> > atomic_inc_if_positive().
> >
> > For those arches that don't have that sort of thing, then writing a
> > generic atomic_inc_if_positive() using cmpxchg() will often be possible,
> > but there are exceptions (for instance the original 386 does not have a
> > cmpxchg, so there you will have to use something else).
>
> The problem with introducing architecture specific code, is that it's
> just asking for new bugs.
>
> If it's something used all over the kernel, than obviously it's OK,
> but for the sake of just one caller it's a bit crazy I think.

I agree.

And I think you're just adding to the case for removing mnt_namespace
entirely. We'd still keep CLONE_NS, and users currently using
namespaces (in the normal ways) would see no difference.

mnt_namespace has these visible effects:

- Prevents some tasks from mounting/umounting in a "foreign"
namespace, even when they are granted access to the directory
tree of the foreign namespace.

It's not clear if the restriction is a useful security tool.

- Causes every mount in a mount tree to be detached (independently),
when last task associated with a namespace is destroyed.

And this invisible effect:

- More concurrency than a global mount lock would have.

Is that all? Are any of these effects important enough to keep?

-- Jamie

2005-05-18 13:22:17

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

> And I think you're just adding to the case for removing mnt_namespace
> entirely. We'd still keep CLONE_NS, and users currently using
> namespaces (in the normal ways) would see no difference.
>
> mnt_namespace has these visible effects:
>
> - Prevents some tasks from mounting/umounting in a "foreign"
> namespace, even when they are granted access to the directory
> tree of the foreign namespace.
>
> It's not clear if the restriction is a useful security tool.
>
> - Causes every mount in a mount tree to be detached (independently),
> when last task associated with a namespace is destroyed.

I don't understand. The tree _has_ to be detached when no task uses
the namespace. That is the main purpose of the namespace structure,
to provide an anchor for the mount tree.

> And this invisible effect:
>
> - More concurrency than a global mount lock would have.

This is the key issue I think. It may even have security implications
in the future if we want to allow unprivileged mounts : a user could
DoS the system by just doing lots of mounts umounts in a private
namespace.

Miklos

2005-05-18 17:03:42

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

> E.g. having a separate task count, which is incremented/decremented
> only by clone/exit. If the task count goes to zero, umount_tree is
> called on root, but namespace is not freed.
>
> And each mnt_namespace holds a proper reference to the namespace, so
> it's safe to dereference it anytime. When truly no more references
> remain, the namespace can go away.
>
> Hmm?

Here's a patch (compile and boot tested).

Now there are two solutions to one problem, the previous one is ugly,
this one is buggy. Which should it be?

Thanks,
Miklos

Index: linux/include/linux/namespace.h
===================================================================
--- linux.orig/include/linux/namespace.h 2005-05-18 18:02:24.000000000 +0200
+++ linux/include/linux/namespace.h 2005-05-18 18:28:48.000000000 +0200
@@ -7,18 +7,26 @@

struct namespace {
atomic_t count;
+ atomic_t task_count; /* how many tasks use this */
struct vfsmount * root;
struct list_head list;
struct rw_semaphore sem;
};

extern int copy_namespace(int, struct task_struct *);
-extern void __put_namespace(struct namespace *namespace);
+extern void __exit_namespace(struct namespace *namespace);

static inline void put_namespace(struct namespace *namespace)
{
- if (atomic_dec_and_test(&namespace->count))
- __put_namespace(namespace);
+ if (namespace && atomic_dec_and_test(&namespace->count))
+ kfree(namespace);
+}
+
+static inline struct namespace *get_namespace(struct namespace *namespace)
+{
+ if (namespace)
+ atomic_inc(&namespace->count);
+ return namespace;
}

static inline void exit_namespace(struct task_struct *p)
@@ -28,14 +36,10 @@ static inline void exit_namespace(struct
task_lock(p);
p->namespace = NULL;
task_unlock(p);
- put_namespace(namespace);
+ if (atomic_dec_and_test(&p->namespace->task_count))
+ __exit_namespace(p->namespace);
}
}

-static inline void get_namespace(struct namespace *namespace)
-{
- atomic_inc(&namespace->count);
-}
-
#endif
#endif
Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c 2005-05-18 17:42:46.000000000 +0200
+++ linux/fs/namespace.c 2005-05-18 18:32:29.000000000 +0200
@@ -160,7 +160,7 @@ clone_mnt(struct vfsmount *old, struct d
mnt->mnt_root = dget(root);
mnt->mnt_mountpoint = mnt->mnt_root;
mnt->mnt_parent = mnt;
- mnt->mnt_namespace = current->namespace;
+ mnt->mnt_namespace = get_namespace(current->namespace);

/* stick the duplicate mount on the same expiry list
* as the original if that was on one */
@@ -345,13 +345,15 @@ static void umount_tree(struct vfsmount
for (p = mnt; p; p = next_mnt(p, mnt)) {
list_del(&p->mnt_list);
list_add(&p->mnt_list, &kill);
- p->mnt_namespace = NULL;
}

while (!list_empty(&kill)) {
+ struct namespace *n;
mnt = list_entry(kill.next, struct vfsmount, mnt_list);
list_del_init(&mnt->mnt_list);
list_del_init(&mnt->mnt_fslink);
+ n = mnt->mnt_namespace;
+ mnt->mnt_namespace = NULL;
if (mnt->mnt_parent == mnt) {
spin_unlock(&vfsmount_lock);
} else {
@@ -361,6 +363,7 @@ static void umount_tree(struct vfsmount
path_release(&old_nd);
}
mntput(mnt);
+ put_namespace(n);
spin_lock(&vfsmount_lock);
}
}
@@ -866,12 +869,9 @@ void mark_mounts_for_expiry(struct list_
mnt = list_entry(graveyard.next, struct vfsmount, mnt_fslink);
list_del_init(&mnt->mnt_fslink);

- /* don't do anything if the namespace is dead - all the
- * vfsmounts from it are going away anyway */
- namespace = mnt->mnt_namespace;
- if (!namespace || atomic_read(&namespace->count) <= 0)
+ namespace = get_namespace(mnt->mnt_namespace);
+ if (!namespace)
continue;
- get_namespace(namespace);

spin_unlock(&vfsmount_lock);
down_write(&namespace->sem);
@@ -1073,8 +1073,10 @@ int copy_namespace(int flags, struct tas

get_namespace(namespace);

- if (!(flags & CLONE_NEWNS))
+ if (!(flags & CLONE_NEWNS)) {
+ atomic_inc(&namespace->task_count);
return 0;
+ }

if (!capable(CAP_SYS_ADMIN)) {
put_namespace(namespace);
@@ -1086,6 +1088,7 @@ int copy_namespace(int flags, struct tas
goto out;

atomic_set(&new_ns->count, 1);
+ atomic_set(&new_ns->task_count, 1);
init_rwsem(&new_ns->sem);
INIT_LIST_HEAD(&new_ns->list);

@@ -1109,7 +1112,9 @@ int copy_namespace(int flags, struct tas
p = namespace->root;
q = new_ns->root;
while (p) {
- q->mnt_namespace = new_ns;
+ struct namespace *oldns = q->mnt_namespace;
+ q->mnt_namespace = get_namespace(new_ns);
+ put_namespace(oldns);
if (fs) {
if (p == fs->rootmnt) {
rootmnt = p;
@@ -1381,16 +1386,17 @@ static void __init init_mount_tree(void)
if (!namespace)
panic("Can't allocate initial namespace");
atomic_set(&namespace->count, 1);
+ atomic_set(&namespace->task_count, 0);
INIT_LIST_HEAD(&namespace->list);
init_rwsem(&namespace->sem);
list_add(&mnt->mnt_list, &namespace->list);
namespace->root = mnt;
- mnt->mnt_namespace = namespace;
+ mnt->mnt_namespace = get_namespace(namespace);

init_task.namespace = namespace;
read_lock(&tasklist_lock);
do_each_thread(g, p) {
- get_namespace(namespace);
+ atomic_inc(&namespace->task_count);
p->namespace = namespace;
} while_each_thread(g, p);
read_unlock(&tasklist_lock);
@@ -1448,12 +1454,13 @@ void __init mnt_init(unsigned long mempa
init_mount_tree();
}

-void __put_namespace(struct namespace *namespace)
+void __exit_namespace(struct namespace *namespace)
{
down_write(&namespace->sem);
spin_lock(&vfsmount_lock);
umount_tree(namespace->root);
+ namespace->root = NULL;
spin_unlock(&vfsmount_lock);
up_write(&namespace->sem);
- kfree(namespace);
+ put_namespace(namespace);
}
Index: linux/fs/super.c
===================================================================
--- linux.orig/fs/super.c 2005-05-17 13:46:56.000000000 +0200
+++ linux/fs/super.c 2005-05-18 18:44:18.000000000 +0200
@@ -37,6 +37,7 @@
#include <linux/writeback.h> /* for the emergency remount stuff */
#include <linux/idr.h>
#include <linux/kobject.h>
+#include <linux/namespace.h>
#include <asm/uaccess.h>

@@ -842,7 +843,7 @@ do_kern_mount(const char *fstype, int fl
mnt->mnt_root = dget(sb->s_root);
mnt->mnt_mountpoint = sb->s_root;
mnt->mnt_parent = mnt;
- mnt->mnt_namespace = current->namespace;
+ mnt->mnt_namespace = get_namespace(current->namespace);
up_write(&sb->s_umount);
put_filesystem(type);
return mnt;

2005-05-18 17:35:12

by Jamie Lokier

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

Miklos Szeredi wrote:
> > - Causes every mount in a mount tree to be detached (independently),
> > when last task associated with a namespace is destroyed.
>
> I don't understand. The tree _has_ to be detached when no task uses
> the namespace. That is the main purpose of the namespace structure,
> to provide an anchor for the mount tree.

That makes less sense if we allow other tasks to be using a namespace
through a passing a file descriptor, and then the last task which has
current->namespace equal to that namespace exits. It makes no sense
to me that the mount which is still accessible through the file
descriptor is suddenly detached from it's parent and children mounts.

Why is it not good enough to detach each vfsmnt when the last
reference to each vfsmnt is dropped? In other words, simply when the
vfsmnt becomes unreachable?

That would detach whole vfsmnt trees when the last reference to the
root of the tree is dropped, rather than when a task exits even though
another tasks has a file descriptor still pointing to the tree.

That would be a change of behaviour, but it seems like quite a
sensible change in behaviour, and would not affect "normal" namespace
users.

There's one other purpose to mnt_namespace:

- The list in /proc/mounts.

But really it would *improve* security if the list in /proc/mounts was
derived by walking the vfsmnt subtree starting at current->chroot.

> > And this invisible effect:
> >
> > - More concurrency than a global mount lock would have.
>
> This is the key issue I think. It may even have security implications
> in the future if we want to allow unprivileged mounts : a user could
> DoS the system by just doing lots of mounts umounts in a private
> namespace.

Not an argument. If that's a problem, they could easily DoS the
system by doing lots of mount/umounts in the initial namespace.

The proper thing to do about concurrency is to make sure the slow
operation (getting the superblock) is done outside any mount
semaphore, and just do the tree traversal/splicing operations inside
it. The code seems to do that already, so I think a global semaphore
instead of mnt_namespace->sem would make no practical difference.

But if it was a problem, we'd come up with something. Let's
concentrate on the behaviour we want, rather than a minor detail like
that.

I think the common sense behaviour is to say that vfsmnts remain
mounted as long as they are reachable - from a file descriptor, task's
root, or task's cwd.

-- Jamie

2005-05-18 18:49:08

by Ram Pai

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

On Wed, 2005-05-18 at 09:53, Miklos Szeredi wrote:
> > E.g. having a separate task count, which is incremented/decremented
> > only by clone/exit. If the task count goes to zero, umount_tree is
> > called on root, but namespace is not freed.
> >
> > And each mnt_namespace holds a proper reference to the namespace, so
> > it's safe to dereference it anytime. When truly no more references
> > remain, the namespace can go away.
> >
> > Hmm?

First of all the reason this race exists implies Al Viro may have had
assertion in his mind that "All tasks that have access to a namespace
has a refcount on that namespace". If that was what he was thinking,
than the I would stick with that assertion being true. The overall idea
I am thinking off is:

1) have the automounter hold a refcount on any new namespace created
the mounts in which the automounter has interest in.
2) have a refcount on the namespace when a new task gains access to
a namespace through the file descriptor or any other
similar mechanisms and remove the reference
once the fd gets closed. (bit tricky to implement)

Do you agree with the overall idea?
If so I will try to come up with a patch.

RP

>
> Here's a patch (compile and boot tested).
>
> Now there are two solutions to one problem, the previous one is ugly,
> this one is buggy. Which should it be?
>
> Thanks,
> Miklos
>
> Index: linux/include/linux/namespace.h
> ===================================================================
> --- linux.orig/include/linux/namespace.h 2005-05-18 18:02:24.000000000 +0200
> +++ linux/include/linux/namespace.h 2005-05-18 18:28:48.000000000 +0200
> @@ -7,18 +7,26 @@
>
> struct namespace {
> atomic_t count;
> + atomic_t task_count; /* how many tasks use this */
> struct vfsmount * root;
> struct list_head list;
> struct rw_semaphore sem;
> };
>
> extern int copy_namespace(int, struct task_struct *);
> -extern void __put_namespace(struct namespace *namespace);
> +extern void __exit_namespace(struct namespace *namespace);
>
> static inline void put_namespace(struct namespace *namespace)
> {
> - if (atomic_dec_and_test(&namespace->count))
> - __put_namespace(namespace);
> + if (namespace && atomic_dec_and_test(&namespace->count))
> + kfree(namespace);
> +}
> +
> +static inline struct namespace *get_namespace(struct namespace *namespace)
> +{
> + if (namespace)
> + atomic_inc(&namespace->count);
> + return namespace;
> }
>
> static inline void exit_namespace(struct task_struct *p)
> @@ -28,14 +36,10 @@ static inline void exit_namespace(struct
> task_lock(p);
> p->namespace = NULL;
> task_unlock(p);
> - put_namespace(namespace);
> + if (atomic_dec_and_test(&p->namespace->task_count))
> + __exit_namespace(p->namespace);
> }
> }
>
> -static inline void get_namespace(struct namespace *namespace)
> -{
> - atomic_inc(&namespace->count);
> -}
> -
> #endif
> #endif
> Index: linux/fs/namespace.c
> ===================================================================
> --- linux.orig/fs/namespace.c 2005-05-18 17:42:46.000000000 +0200
> +++ linux/fs/namespace.c 2005-05-18 18:32:29.000000000 +0200
> @@ -160,7 +160,7 @@ clone_mnt(struct vfsmount *old, struct d
> mnt->mnt_root = dget(root);
> mnt->mnt_mountpoint = mnt->mnt_root;
> mnt->mnt_parent = mnt;
> - mnt->mnt_namespace = current->namespace;
> + mnt->mnt_namespace = get_namespace(current->namespace);
>
> /* stick the duplicate mount on the same expiry list
> * as the original if that was on one */
> @@ -345,13 +345,15 @@ static void umount_tree(struct vfsmount
> for (p = mnt; p; p = next_mnt(p, mnt)) {
> list_del(&p->mnt_list);
> list_add(&p->mnt_list, &kill);
> - p->mnt_namespace = NULL;
> }
>
> while (!list_empty(&kill)) {
> + struct namespace *n;
> mnt = list_entry(kill.next, struct vfsmount, mnt_list);
> list_del_init(&mnt->mnt_list);
> list_del_init(&mnt->mnt_fslink);
> + n = mnt->mnt_namespace;
> + mnt->mnt_namespace = NULL;
> if (mnt->mnt_parent == mnt) {
> spin_unlock(&vfsmount_lock);
> } else {
> @@ -361,6 +363,7 @@ static void umount_tree(struct vfsmount
> path_release(&old_nd);
> }
> mntput(mnt);
> + put_namespace(n);
> spin_lock(&vfsmount_lock);
> }
> }
> @@ -866,12 +869,9 @@ void mark_mounts_for_expiry(struct list_
> mnt = list_entry(graveyard.next, struct vfsmount, mnt_fslink);
> list_del_init(&mnt->mnt_fslink);
>
> - /* don't do anything if the namespace is dead - all the
> - * vfsmounts from it are going away anyway */
> - namespace = mnt->mnt_namespace;
> - if (!namespace || atomic_read(&namespace->count) <= 0)
> + namespace = get_namespace(mnt->mnt_namespace);
> + if (!namespace)
> continue;
> - get_namespace(namespace);
>
> spin_unlock(&vfsmount_lock);
> down_write(&namespace->sem);
> @@ -1073,8 +1073,10 @@ int copy_namespace(int flags, struct tas
>
> get_namespace(namespace);
>
> - if (!(flags & CLONE_NEWNS))
> + if (!(flags & CLONE_NEWNS)) {
> + atomic_inc(&namespace->task_count);
> return 0;
> + }
>
> if (!capable(CAP_SYS_ADMIN)) {
> put_namespace(namespace);
> @@ -1086,6 +1088,7 @@ int copy_namespace(int flags, struct tas
> goto out;
>
> atomic_set(&new_ns->count, 1);
> + atomic_set(&new_ns->task_count, 1);
> init_rwsem(&new_ns->sem);
> INIT_LIST_HEAD(&new_ns->list);
>
> @@ -1109,7 +1112,9 @@ int copy_namespace(int flags, struct tas
> p = namespace->root;
> q = new_ns->root;
> while (p) {
> - q->mnt_namespace = new_ns;
> + struct namespace *oldns = q->mnt_namespace;
> + q->mnt_namespace = get_namespace(new_ns);
> + put_namespace(oldns);
> if (fs) {
> if (p == fs->rootmnt) {
> rootmnt = p;
> @@ -1381,16 +1386,17 @@ static void __init init_mount_tree(void)
> if (!namespace)
> panic("Can't allocate initial namespace");
> atomic_set(&namespace->count, 1);
> + atomic_set(&namespace->task_count, 0);
> INIT_LIST_HEAD(&namespace->list);
> init_rwsem(&namespace->sem);
> list_add(&mnt->mnt_list, &namespace->list);
> namespace->root = mnt;
> - mnt->mnt_namespace = namespace;
> + mnt->mnt_namespace = get_namespace(namespace);
>
> init_task.namespace = namespace;
> read_lock(&tasklist_lock);
> do_each_thread(g, p) {
> - get_namespace(namespace);
> + atomic_inc(&namespace->task_count);
> p->namespace = namespace;
> } while_each_thread(g, p);
> read_unlock(&tasklist_lock);
> @@ -1448,12 +1454,13 @@ void __init mnt_init(unsigned long mempa
> init_mount_tree();
> }
>
> -void __put_namespace(struct namespace *namespace)
> +void __exit_namespace(struct namespace *namespace)
> {
> down_write(&namespace->sem);
> spin_lock(&vfsmount_lock);
> umount_tree(namespace->root);
> + namespace->root = NULL;
> spin_unlock(&vfsmount_lock);
> up_write(&namespace->sem);
> - kfree(namespace);
> + put_namespace(namespace);
> }
> Index: linux/fs/super.c
> ===================================================================
> --- linux.orig/fs/super.c 2005-05-17 13:46:56.000000000 +0200
> +++ linux/fs/super.c 2005-05-18 18:44:18.000000000 +0200
> @@ -37,6 +37,7 @@
> #include <linux/writeback.h> /* for the emergency remount stuff */
> #include <linux/idr.h>
> #include <linux/kobject.h>
> +#include <linux/namespace.h>
> #include <asm/uaccess.h>
>
>
> @@ -842,7 +843,7 @@ do_kern_mount(const char *fstype, int fl
> mnt->mnt_root = dget(sb->s_root);
> mnt->mnt_mountpoint = sb->s_root;
> mnt->mnt_parent = mnt;
> - mnt->mnt_namespace = current->namespace;
> + mnt->mnt_namespace = get_namespace(current->namespace);
> up_write(&sb->s_umount);
> put_filesystem(type);
> return mnt;

2005-05-18 19:07:20

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

> That makes less sense if we allow other tasks to be using a namespace
> through a passing a file descriptor, and then the last task which has
> current->namespace equal to that namespace exits. It makes no sense
> to me that the mount which is still accessible through the file
> descriptor is suddenly detached from it's parent and children mounts.

I see your point. I don't yet see a solution.

Currently detach is an explicit action, not something automatic which
happens when there are no more references to a vfsmount.

> Why is it not good enough to detach each vfsmnt when the last
> reference to each vfsmnt is dropped? In other words, simply when the
> vfsmnt becomes unreachable?

Define unreachable. Then define a mechanism, by which it can be
detected.

Miklos

2005-05-18 19:20:16

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

> First of all the reason this race exists implies Al Viro may have had
> assertion in his mind that "All tasks that have access to a namespace
> has a refcount on that namespace". If that was what he was thinking,
> than the I would stick with that assertion being true. The overall idea
> I am thinking off is:
>
> 1) have the automounter hold a refcount on any new namespace created
> the mounts in which the automounter has interest in.
> 2) have a refcount on the namespace when a new task gains access to
> a namespace through the file descriptor or any other
> similar mechanisms and remove the reference
> once the fd gets closed. (bit tricky to implement)
>
> Do you agree with the overall idea?

I don't really understand it.

A reference count usually means, the number of references (pointers)
to an object. You can sometimes get away with schemes that deviate
from this in various ways, but it's usually asking for trouble.

The usage in mark_mounts_for_expiry() deviated from it so much, that
the result was a subtle race.

Doing some tricky thing like what you propose will just likely
introduce more subtle problems.

Miklos

2005-05-18 19:52:42

by Jamie Lokier

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

Miklos Szeredi wrote:
> > That makes less sense if we allow other tasks to be using a namespace
> > through a passing a file descriptor, and then the last task which has
> > current->namespace equal to that namespace exits. It makes no sense
> > to me that the mount which is still accessible through the file
> > descriptor is suddenly detached from it's parent and children mounts.
>
> I see your point. I don't yet see a solution.
>
> Currently detach is an explicit action, not something automatic which
> happens when there are no more references to a vfsmount.

It's implicit, when the last task calls put_namespace:

void __put_namespace(struct namespace *namespace)
{
[...]
umount_tree(namespace->root);
-> calls detach_mnt for each vfsmnt in namespace.
[...]
}

> > Why is it not good enough to detach each vfsmnt when the last
> > reference to each vfsmnt is dropped? In other words, simply when the
> > vfsmnt becomes unreachable?
>
> Define unreachable.

Unreachable as in no file descriptors (or chroot/cwd) refer to the
vfsmnt, either directly or indirectly through a path traversal.

> Then define a mechanism, by which it can be detected.

There aren't any vfsmnt->vfsmnt cycles... They're a forest, vfsmnts
don't move from one tree to another (bind mounts don't link them, they
create new vfsmnts), and each tree can be referenced by a file
descriptor at any point on the tree.

It rather hinges on which of these behaviours you prefer:

1. A file descriptor/chroot/cwd reference to any point in a vfsmnt
tree means the whole tree is retained. This means ".." remains
always accessible: fchdir(fd); open("..") continues to access
that whole tree as you still have fd.

2. A file descriptor/chroot/cwd reference to any point in a vfsmnt
tree means the subtree from that point is retained, and parents
may disappear if there are no references (not counting ".." as a
reference). This behaviour is more sensible for chroots, where
the parents should be inaccessible anyway.

3. A mixture, where current->root references only maintain the
subtree rooted at that point, and other references, if outside
the current->root subtree, retain the whole tree accessible from
those references.

The appropriate data structure / algorithm depends on which behaviour
is preferred. So which is it? 1 Is best done with a mnt_namespace
structure, but references to it counted when vfsmnts are referenced by
file descriptors/root/cwd, _not_ references by tasks (no
current->namespace). 2 is best done by simply reference counting
vfsmnts.

-- Jamie

2005-05-18 20:36:01

by Ram Pai

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

On Wed, 2005-05-18 at 12:19, Miklos Szeredi wrote:
> > First of all the reason this race exists implies Al Viro may have had
> > assertion in his mind that "All tasks that have access to a namespace
> > has a refcount on that namespace". If that was what he was thinking,
> > than the I would stick with that assertion being true. The overall idea
> > I am thinking off is:
> >
> > 1) have the automounter hold a refcount on any new namespace created
> > the mounts in which the automounter has interest in.
> > 2) have a refcount on the namespace when a new task gains access to
> > a namespace through the file descriptor or any other
> > similar mechanisms and remove the reference
> > once the fd gets closed. (bit tricky to implement)
> >
> > Do you agree with the overall idea?
>
> I don't really understand it.
>
> A reference count usually means, the number of references (pointers)
> to an object. You can sometimes get away with schemes that deviate
> from this in various ways, but it's usually asking for trouble.

Ok. One more attempt to be clear.

All the places where get_namespace() is called currently except
in mark_mounts_for_expiry() are safe because they are called in
places where it is guaranteed that they will not race with
__put_namespace().

For example in clone_namespace(), get_namespace() will not race
because the task that called the clone has a refcount on the
namespace and since that task is currently in the kernel, there is
no chance for the task to go away decrementing the refcount
on that namespace.

But the case where the call to get_namespace() is buggy in
mark_mounts_for_expiry() is because:
it is called in a timer context, and the last process referring
the namespace may just disapper right than.
So what I am proposing is:
in automouter, while the automount takes place in
afs_mntpt_follow_link() increment the refcount of the namespace,
by calling get_namespace(). This call will not race with __put_namespace
because the process that is trying to access the
mountpoint already has a refcount on the namespace and it won't be
able to decrement the refcount currently. agree?

Now later when the automounter tries to unmount the mount
call put_namespace() after unmounting. I mean do it in
mark_mounts_for_expiry(). Also delete the call to get_namespace())

So the race will not happen at all.

Makes sense?

That was the easiest part, but the difficult part is how to call
get_namespace() on a forign namespace in do_loopback()?

thinking about it,
RP

>
> The usage in mark_mounts_for_expiry() deviated from it so much, that
> the result was a subtle race.
>
> Doing some tricky thing like what you propose will just likely
> introduce more subtle problems.
>
> Miklos
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2005-05-19 12:43:29

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

> > Define unreachable.
>
> Unreachable as in no file descriptors (or chroot/cwd) refer to the
> vfsmnt, either directly or indirectly through a path traversal.
>
> > Then define a mechanism, by which it can be detected.
>
> There aren't any vfsmnt->vfsmnt cycles... They're a forest, vfsmnts
> don't move from one tree to another (bind mounts don't link them, they
> create new vfsmnts), and each tree can be referenced by a file
> descriptor at any point on the tree.
>
> It rather hinges on which of these behaviours you prefer:
>
> 1. A file descriptor/chroot/cwd reference to any point in a vfsmnt
> tree means the whole tree is retained. This means ".." remains
> always accessible: fchdir(fd); open("..") continues to access
> that whole tree as you still have fd.
>
> 2. A file descriptor/chroot/cwd reference to any point in a vfsmnt
> tree means the subtree from that point is retained, and parents
> may disappear if there are no references (not counting ".." as a
> reference). This behaviour is more sensible for chroots, where
> the parents should be inaccessible anyway.
>
> 3. A mixture, where current->root references only maintain the
> subtree rooted at that point, and other references, if outside
> the current->root subtree, retain the whole tree accessible from
> those references.

4. Everything stays the same, except chroot() changes
current->namespace to current->fs->root_mnt->namespace. This would
address your concern with chroot.

> The appropriate data structure / algorithm depends on which
> behaviour is preferred. So which is it?

My preference is 1) or 4).

2) and 3) would have some pretty strange properties (in some
situations you can enter a directory, but you cannot get out with 'cd
..' any more)

> 1 Is best done with a mnt_namespace structure, but references to it
> counted when vfsmnts are referenced by file descriptors/root/cwd,
> _not_ references by tasks (no current->namespace).

So you are proposing selective replacement of

mntget(mnt);

with

mntget(mnt);
get_namespace(mnt->mnt_namespace);

and similar for mntput()?

What do you do on unmount? If you set mnt->mnt_namespace to NULL,
you'd have to do a put_namespace() for each file descriptor/root/cwd
referencing the mount, but how do you do that?

> 2 is best done by simply reference counting vfsmnts.

It would need quite a revamp of the current reference counting. If
I'm not mistaken, currently each vfsmount has an initial reference
given to it in alloc_vfsmnt. And on attach, the parent namespace's
count is also increased, and vice-versa for unmount.

I'm not sure what purpose referencing the parent serves, but it won't
work with 2). There you'd need referencing the other way round:
parents holding a reference on their children, and have no initial
reference.

With 2) it's also questionable, how you copy the namespace in clone().
Do you just copy the tree under current->fs->root_mnt? Or do you copy
tree under cwd as well (if it's disjunct from the root tree)?

Miklos

2005-05-19 12:54:39

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH] fix race in mark_mounts_for_expiry()

> Ok. One more attempt to be clear.
>
> All the places where get_namespace() is called currently except
> in mark_mounts_for_expiry() are safe because they are called in
> places where it is guaranteed that they will not race with
> __put_namespace().
>
> For example in clone_namespace(), get_namespace() will not race
> because the task that called the clone has a refcount on the
> namespace and since that task is currently in the kernel, there is
> no chance for the task to go away decrementing the refcount
> on that namespace.
>
> But the case where the call to get_namespace() is buggy in
> mark_mounts_for_expiry() is because:
> it is called in a timer context, and the last process referring
> the namespace may just disapper right than.
> So what I am proposing is:
> in automouter, while the automount takes place in
> afs_mntpt_follow_link() increment the refcount of the namespace,
> by calling get_namespace(). This call will not race with __put_namespace
> because the process that is trying to access the
> mountpoint already has a refcount on the namespace and it won't be
> able to decrement the refcount currently. agree?
>
> Now later when the automounter tries to unmount the mount
> call put_namespace() after unmounting. I mean do it in
> mark_mounts_for_expiry(). Also delete the call to get_namespace())
>
> So the race will not happen at all.
>
> Makes sense?

Well I imagine it could work, but again it's just a special case for
the automounter stuff.

It still won't solve the recursive mount problem that this discussion
(and your discovery of the race) originated from.

My second patch solves the problem generally (though I'm sure that
it's full of bugs yet), by keeping a reference to the namespace from
each vfsmount in that namespace. That way, until the vfsmount remains
in the namespace (i.e. until it's unmounted) mnt_namespace can be
safely used for whatever reason (recursive bind, atomount, etc).

So why does it make sense to solve this problem only for the
automounter?

Miklos