I have enclosed a patch that allows rbinds across any two namespaces.
NOTE: currenly bind from foriegn namespace to current namespace is
allowed. This patch now allows:
binds/rbinds from any namespace to any other namespace, under the
assumption that if a process has access to a namespace, it ought to
have permission to manipulate that namespace.
The patch incorporates ideas from Miklos and Jamie, and is dependent
on Miklos's 'fix race in mark_mounts_for_expiry' patch to function
correctly. Also it depends on Miklos's 'fix bind mount from foreign
namespace' patch, because without that patch umounts would fail.
Though we have not come up with any security reason towards why
this functionality should not be allowed, I am sure it may open
up some concerns.
RP
> I have enclosed a patch that allows rbinds across any two namespaces.
> NOTE: currenly bind from foriegn namespace to current namespace is
> allowed. This patch now allows:
Note: you are accessing ->mnt_namespace without holding vfsmount_lock.
Also why check current->namespace? It doesn't hurt to do
get_namespace() even if it's not strictly needed. And it would
simplify the code.
In fact all uses of current->namespace and check_mnt() could be
eliminated from namespace.c. The only use of ->namespace would be in
copy_namespace() and exit_namespace().
Miklos
On Fri, 2005-05-20 at 23:27, Miklos Szeredi wrote:
> > I have enclosed a patch that allows rbinds across any two namespaces.
> > NOTE: currenly bind from foriegn namespace to current namespace is
> > allowed. This patch now allows:
>
> Note: you are accessing ->mnt_namespace without holding vfsmount_lock.
I realized that just after sending the patch :( . Have corrected it in
the new patch.
>
> Also why check current->namespace? It doesn't hurt to do
> get_namespace() even if it's not strictly needed. And it would
> simplify the code.
however get_namespace() will chew up some extra cycles
sometimes unnecessarily. I did incorporate your comment and
got much simpler code.
>
> In fact all uses of current->namespace and check_mnt() could be
> eliminated from namespace.c. The only use of ->namespace would be in
> copy_namespace() and exit_namespace().
Yes. I will make this a separate patch, which I will send out soon.
Will have to look at each of the cases deeply.
Enclosed the simplified patch,
RP
> Enclosed the simplified patch,
Looks much better :)
I still see a problem: what if old_nd.mnt is already detached, and
bind is non-recursive. Now it fails with EINVAL, though it used to
work (and I think is very useful).
When doing up_write(...) you don't have to keep the order, just check
if the namespaces are not equal for the second up_write().
And why don't you do this:
if (old_ns < mntpt_ns)
down_write(&old_ns->sem);
instead of this
if (old_ns < mntpt_ns) {
down_write(&old_ns->sem);
}
Miklos
On Sat, 2005-05-21 at 01:09, Miklos Szeredi wrote:
> > Enclosed the simplified patch,
>
> Looks much better :)
>
> I still see a problem: what if old_nd.mnt is already detached, and
> bind is non-recursive. Now it fails with EINVAL, though it used to
> work (and I think is very useful).
I am not getting this comment. R u assuming that a detached mount
will have NULL namespace? If so I dont see it being the case.
Or am I missing some subtle point?
> When doing up_write(...) you don't have to keep the order, just check
> if the namespaces are not equal for the second up_write().
Yes. saves atleast 2 lines.
>
> And why don't you do this:
>
> if (old_ns < mntpt_ns)
> down_write(&old_ns->sem);
>
> instead of this
>
> if (old_ns < mntpt_ns) {
> down_write(&old_ns->sem);
> }
Will do. Saves another few lines. :)
I will send out the new patch once I understand your first comment.
RP
>
> Miklos
> > I still see a problem: what if old_nd.mnt is already detached, and
> > bind is non-recursive. Now it fails with EINVAL, though it used to
> > work (and I think is very useful).
>
> I am not getting this comment. R u assuming that a detached mount
> will have NULL namespace?
Yes.
> If so I dont see it being the case.
See this patch:
http://marc.theaimsgroup.com/?l=linux-kernel&m=111627383207049&w=2
Miklos
> I am not getting this comment. R u assuming that a detached mount
> will have NULL namespace? If so I dont see it being the case.
> Or am I missing some subtle point?
On a related note: now that it's not necessarily current->namespace
that's used as a destination for the mount, you'll have to pass
mntpt_ns into copy_tree() and clone_mnt() so that mnt->mnt_namespace
can correctly be set. That will also enable some cleanup in
copy_namespace(), where you can remove the of setting ->mnt_namespace.
Miklos
On Sat, 2005-05-21 at 02:09, Miklos Szeredi wrote:
> > > I still see a problem: what if old_nd.mnt is already detached, and
> > > bind is non-recursive. Now it fails with EINVAL, though it used to
> > > work (and I think is very useful).
> >
> > I am not getting this comment. R u assuming that a detached mount
> > will have NULL namespace?
>
> Yes.
>
> > If so I dont see it being the case.
>
> See this patch:
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=111627383207049&w=2
Ok. look at the enclosed patch. Does it look any better? The special
casing for detached mounts ate up some brain cells and made the code
less simpler.
RP
> Ok. look at the enclosed patch. Does it look any better? The special
> casing for detached mounts ate up some brain cells and made the code
> less simpler.
Yes, this isn't trivial stuff.
I realized one more thing: nd->mnt (the destination vfsmount) might be
detached while waiting for the semaphore. So that needs to be
rechecked after taking the semaphores.
And the same for old_nd->mnt in case of rbind. Though I'm
not sure what the semantics should be in this case:
1) rbind always fails if the source is detached
2) rbind always succeeds, and if the source is detached it just
copies that single mount
I like 2) better. Is there anything against it?
Miklos
Ram wrote:
> > Also why check current->namespace? It doesn't hurt to do
> > get_namespace() even if it's not strictly needed. And it would
> > simplify the code.
>
> however get_namespace() will chew up some extra cycles
> sometimes unnecessarily. I did incorporate your comment and
> got much simpler code.
Checking current->namespace also chews up extra cycles, sometimes
unnecessarily, but the cycles are negligable in both cases.
Because they're negligable, the most important thing is to avoid
unnecessary complications in the code.
-- Jamie
Miklos Szeredi wrote:
> I still see a problem: what if old_nd.mnt is already detached, and
> bind is non-recursive. Now it fails with EINVAL, though it used to
> work (and I think is very useful).
Hey, you just made another argument for not detaching mounts when the
last task with that current->namespace exits, but instead detaching
mounts when the last reference to any vfsmnt in the namespace is dropped.
Hint :)
-- Jamie
> > I still see a problem: what if old_nd.mnt is already detached, and
> > bind is non-recursive. Now it fails with EINVAL, though it used to
> > work (and I think is very useful).
>
> Hey, you just made another argument for not detaching mounts when the
> last task with that current->namespace exits, but instead detaching
> mounts when the last reference to any vfsmnt in the namespace is dropped.
>
> Hint :)
I have a better idea:
- create a "dead_mounts" namespace.
- chain each detached mount's ->mnt_list on dead_mounts->list
- set mnt_namespace to dead_mounts
- export the list via proc through the usual mount list interface
The last would be a nice bonus: I've always wanted to see the list of
detached, but not-yet destroyed mounts.
Does anybody see a problem with that?
Miklos
> I have a better idea:
>
> - create a "dead_mounts" namespace.
> - chain each detached mount's ->mnt_list on dead_mounts->list
> - set mnt_namespace to dead_mounts
> - export the list via proc through the usual mount list interface
>
> The last would be a nice bonus: I've always wanted to see the list of
> detached, but not-yet destroyed mounts.
Here's a patch that does this. It needs all the other namespace
patches, so I made a diff against 2.6.12-rc4-mm2 in case anybody wants
to try it:
http://www.inf.bme.hu/~mszeredi/patches/patch-2.6.12-rc4-mm2-szm1
It works for me(TM).
Miklos
Index: linux/include/linux/proc_fs.h
===================================================================
--- linux.orig/include/linux/proc_fs.h 2005-05-22 16:09:37.000000000 +0200
+++ linux/include/linux/proc_fs.h 2005-05-22 16:09:45.000000000 +0200
@@ -127,6 +127,7 @@ extern struct dentry *proc_lookup(struct
extern struct file_operations proc_kcore_operations;
extern struct file_operations proc_kmsg_operations;
extern struct file_operations ppc_htab_operations;
+extern struct file_operations proc_dead_mounts_operations;
/*
* proc_tty.c
Index: linux/fs/proc/proc_misc.c
===================================================================
--- linux.orig/fs/proc/proc_misc.c 2005-05-22 16:09:37.000000000 +0200
+++ linux/fs/proc/proc_misc.c 2005-05-22 16:09:45.000000000 +0200
@@ -652,6 +652,9 @@ void __init proc_misc_init(void)
create_proc_read_entry(p->name, 0, NULL, p->read_proc, NULL);
proc_symlink("mounts", NULL, "self/mounts");
+ entry = create_proc_entry("dead_mounts", S_IRUSR, &proc_root);
+ if (entry)
+ entry->proc_fops = &proc_dead_mounts_operations;
/* And now for trickier ones */
entry = create_proc_entry("kmsg", S_IRUSR, &proc_root);
Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c 2005-05-22 16:09:45.000000000 +0200
+++ linux/fs/namespace.c 2005-05-22 16:12:23.000000000 +0200
@@ -42,6 +42,7 @@ static inline int sysfs_init(void)
static struct list_head *mount_hashtable;
static int hash_mask, hash_bits;
static kmem_cache_t *mnt_cache;
+static struct namespace *dead_mounts;
static inline unsigned long hash(struct vfsmount *mnt, struct dentry *dentry)
{
@@ -175,6 +176,8 @@ clone_mnt(struct vfsmount *old, struct d
void __mntput(struct vfsmount *mnt)
{
struct super_block *sb = mnt->mnt_sb;
+ list_del(&mnt->mnt_list);
+ spin_unlock(&vfsmount_lock);
dput(mnt->mnt_root);
free_vfsmnt(mnt);
deactivate_super(sb);
@@ -240,7 +243,10 @@ static int show_vfsmnt(struct seq_file *
mangle(m, mnt->mnt_devname ? mnt->mnt_devname : "none");
seq_putc(m, ' ');
- seq_path(m, mnt, mnt->mnt_root, " \t\n\\");
+ if (mnt->mnt_namespace != dead_mounts)
+ seq_path(m, mnt, mnt->mnt_root, " \t\n\\");
+ else
+ seq_puts(m, "(detached)");
seq_putc(m, ' ');
mangle(m, mnt->mnt_sb->s_type->name);
seq_puts(m, mnt->mnt_sb->s_flags & MS_RDONLY ? " ro" : " rw");
@@ -265,6 +271,71 @@ struct seq_operations mounts_op = {
.show = show_vfsmnt
};
+/* /proc/dead_mounts needs slightly different handling than above */
+static void *dead_m_start(struct seq_file *m, loff_t *pos)
+{
+ struct namespace *n = m->private;
+ struct vfsmount *res = NULL;
+ struct list_head *p;
+ loff_t l = *pos;
+
+ spin_lock(&vfsmount_lock);
+ list_for_each(p, &n->list)
+ if (!l--) {
+ res = mntget(list_entry(p, struct vfsmount, mnt_list));
+ break;
+ }
+ spin_unlock(&vfsmount_lock);
+ return res;
+}
+
+static void *dead_m_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct namespace *n = m->private;
+ struct vfsmount *curr = v;
+ struct vfsmount *res = NULL;
+ struct list_head *p;
+
+ (*pos)++;
+ spin_lock(&vfsmount_lock);
+ p = curr->mnt_list.next;
+ if (p != &n->list)
+ res = mntget(list_entry(p, struct vfsmount, mnt_list));
+ spin_unlock(&vfsmount_lock);
+ mntput(curr);
+ return res;
+}
+
+static void dead_m_stop(struct seq_file *m, void *v)
+{
+ struct vfsmount *curr = v;
+ mntput(curr);
+}
+
+static struct seq_operations dead_mounts_op = {
+ .start = dead_m_start,
+ .next = dead_m_next,
+ .stop = dead_m_stop,
+ .show = show_vfsmnt
+};
+
+static int dead_mounts_open(struct inode *inode, struct file *file)
+{
+ int ret = seq_open(file, &dead_mounts_op);
+ if (!ret) {
+ struct seq_file *m = file->private_data;
+ m->private = dead_mounts;
+ }
+ return ret;
+}
+
+struct file_operations proc_dead_mounts_operations = {
+ .open = dead_mounts_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
/**
* may_umount_tree - check if a mount tree is busy
* @mnt: root of mount tree
@@ -343,14 +414,13 @@ static void umount_tree(struct vfsmount
LIST_HEAD(kill);
for (p = mnt; p; p = next_mnt(p, mnt)) {
- list_del(&p->mnt_list);
- list_add(&p->mnt_list, &kill);
- p->mnt_namespace = NULL;
+ list_move(&p->mnt_list, &kill);
+ p->mnt_namespace = dead_mounts;
}
while (!list_empty(&kill)) {
mnt = list_entry(kill.next, struct vfsmount, mnt_list);
- list_del_init(&mnt->mnt_list);
+ list_move_tail(&mnt->mnt_list, &dead_mounts->list);
list_del_init(&mnt->mnt_fslink);
if (mnt->mnt_parent == mnt) {
spin_unlock(&vfsmount_lock);
@@ -447,7 +517,7 @@ static int do_umount(struct vfsmount *mn
}
retval = -EBUSY;
if (atomic_read(&mnt->mnt_count) == 2 || flags & MNT_DETACH) {
- if (!list_empty(&mnt->mnt_list))
+ if (mnt->mnt_namespace != dead_mounts)
umount_tree(mnt);
retval = 0;
}
@@ -842,8 +912,8 @@ static void expire_mount(struct vfsmount
struct nameidata old_nd;
/* delete from the namespace */
- list_del_init(&mnt->mnt_list);
- mnt->mnt_namespace = NULL;
+ list_move_tail(&mnt->mnt_list, &dead_mounts->list);
+ mnt->mnt_namespace = dead_mounts;
detach_mnt(mnt, &old_nd);
spin_unlock(&vfsmount_lock);
path_release(&old_nd);
@@ -1387,6 +1457,14 @@ static void __init init_mount_tree(void)
namespace->root = mnt;
mnt->mnt_namespace = namespace;
+ dead_mounts = kmalloc(sizeof(struct namespace), GFP_KERNEL);
+ if (!dead_mounts)
+ panic("Can't allocate dead_mounts namespace");
+ atomic_set(&dead_mounts->count, 1);
+ INIT_LIST_HEAD(&dead_mounts->list);
+ init_rwsem(&dead_mounts->sem);
+ dead_mounts->root = NULL;
+
init_task.namespace = namespace;
read_lock(&tasklist_lock);
do_each_thread(g, p) {
Index: linux/include/linux/mount.h
===================================================================
--- linux.orig/include/linux/mount.h 2005-05-22 16:09:37.000000000 +0200
+++ linux/include/linux/mount.h 2005-05-22 16:09:45.000000000 +0200
@@ -46,11 +46,12 @@ static inline struct vfsmount *mntget(st
}
extern void __mntput(struct vfsmount *mnt);
+extern spinlock_t vfsmount_lock;
static inline void _mntput(struct vfsmount *mnt)
{
if (mnt) {
- if (atomic_dec_and_test(&mnt->mnt_count))
+ if (atomic_dec_and_lock(&mnt->mnt_count, &vfsmount_lock))
__mntput(mnt);
}
}
@@ -75,7 +76,5 @@ extern int do_add_mount(struct vfsmount
extern void mark_mounts_for_expiry(struct list_head *mounts);
-extern spinlock_t vfsmount_lock;
-
#endif
#endif /* _LINUX_MOUNT_H */
On Sat, 2005-05-21 at 06:12, Miklos Szeredi wrote:
> > Ok. look at the enclosed patch. Does it look any better? The special
> > casing for detached mounts ate up some brain cells and made the code
> > less simpler.
>
> Yes, this isn't trivial stuff.
>
> I realized one more thing: nd->mnt (the destination vfsmount) might be
> detached while waiting for the semaphore. So that needs to be
> rechecked after taking the semaphores.
Ok. fixed that. that was surprisingly trivial though initially it looked
like some complex locking.
>
> And the same for old_nd->mnt in case of rbind. Though I'm
> not sure what the semantics should be in this case:
>
> 1) rbind always fails if the source is detached
> 2) rbind always succeeds, and if the source is detached it just
> copies that single mount
> I like 2) better. Is there anything against it?
sure. as much functionality as we can get. I have incorporated (2).
Take a look at the enclosed patch,
RP
>
> Miklos
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 2005-05-22 at 13:25, Ram wrote:
> On Sat, 2005-05-21 at 06:12, Miklos Szeredi wrote:
> > > Ok. look at the enclosed patch. Does it look any better? The special
> > > casing for detached mounts ate up some brain cells and made the code
> > > less simpler.
> >
> > Yes, this isn't trivial stuff.
> >
> > I realized one more thing: nd->mnt (the destination vfsmount) might be
> > detached while waiting for the semaphore. So that needs to be
> > rechecked after taking the semaphores.
>
> Ok. fixed that. that was surprisingly trivial though initially it looked
> like some complex locking.
>
> >
> > And the same for old_nd->mnt in case of rbind. Though I'm
> > not sure what the semantics should be in this case:
> >
> > 1) rbind always fails if the source is detached
> > 2) rbind always succeeds, and if the source is detached it just
> > copies that single mount
>
> > I like 2) better. Is there anything against it?
>
> sure. as much functionality as we can get. I have incorporated (2).
>
> Take a look at the enclosed patch,
The patch failed rbinds in some cases. Fixed it. The enclosed patch
has a high chance of being bug free.
RP
> RP
> >
> > Miklos
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 2005-05-22 at 01:08, Miklos Szeredi wrote:
> > > I still see a problem: what if old_nd.mnt is already detached, and
> > > bind is non-recursive. Now it fails with EINVAL, though it used to
> > > work (and I think is very useful).
> >
> > Hey, you just made another argument for not detaching mounts when the
> > last task with that current->namespace exits, but instead detaching
> > mounts when the last reference to any vfsmnt in the namespace is dropped.
> >
> > Hint :)
>
> I have a better idea:
>
> - create a "dead_mounts" namespace.
> - chain each detached mount's ->mnt_list on dead_mounts->list
> - set mnt_namespace to dead_mounts
> - export the list via proc through the usual mount list interface
>
> The last would be a nice bonus: I've always wanted to see the list of
> detached, but not-yet destroyed mounts.
>
> Does anybody see a problem with that?
Yes. :) because I will have to change my 'rbind across namespace' patch
because now detached mounts will have dead_mounts namespace instead of
null namespace.
RP
>
> Miklos
> Yes. :) because I will have to change my 'rbind across namespace' patch
> because now detached mounts will have dead_mounts namespace instead of
> null namespace.
I was thinking, that it would get rid of all special casing, but I
realize now, that the ns->root != NULL check still has to be done. Oh
well...
Miklos
> The patch failed rbinds in some cases. Fixed it. The enclosed patch
> has a high chance of being bug free.
Except not setting mnt_namespace corretly, it does seem to be ok.
Miklos
On Sun, 2005-05-22 at 22:08, Miklos Szeredi wrote:
> > The patch failed rbinds in some cases. Fixed it. The enclosed patch
> > has a high chance of being bug free.
>
> Except not setting mnt_namespace corretly, it does seem to be ok.
yes done. enclosed the patch.
But seems like this patch opens up questions like:
Should mounts/umounts/remounts/pivot_root in foreign namespaces
be allowed? Probably check_mnt() can simply go, and the correct
semaphores have to be held in each of the functions.
RP
>
> Miklos
> yes done. enclosed the patch.
>
> But seems like this patch opens up questions like:
>
> Should mounts/umounts/remounts/pivot_root in foreign namespaces
> be allowed?
I think yes.
Moving a subtree to a different namespace is a bit tricky so maybe
move should still be restricted to be within a single namespace.
And I think we should relax the checks under /proc, to allow proper
access to foreign namespaces as far as it doesn't impact security.
E.g. it makes sense to allow a process to access /proc/self/fd/XXX
even if XXX resides in a different namespace. Currently that is not
allowed.
Miklos
Miklos Szeredi wrote:
>>>I still see a problem: what if old_nd.mnt is already detached, and
>>>bind is non-recursive. Now it fails with EINVAL, though it used to
>>>work (and I think is very useful).
>>
>>Hey, you just made another argument for not detaching mounts when the
>>last task with that current->namespace exits, but instead detaching
>>mounts when the last reference to any vfsmnt in the namespace is dropped.
>>
>>Hint :)
>
>
> I have a better idea:
>
> - create a "dead_mounts" namespace.
> - chain each detached mount's ->mnt_list on dead_mounts->list
> - set mnt_namespace to dead_mounts
> - export the list via proc through the usual mount list interface
>
> The last would be a nice bonus: I've always wanted to see the list of
> detached, but not-yet destroyed mounts.
>
> Does anybody see a problem with that?
>
FWIW, all this stuff has already been done and posted here.
Detachable chunks of vfsmounts:
http://marc.theaimsgroup.com/?l=linux-fsdevel&m=109872862003192&w=2
'Soft' reference counts for manipulating vfsmounts without pinning them
down:
http://marc.theaimsgroup.com/?l=linux-kernel&m=109872797030644&w=2
Referencing vfsmounts in userspace using a file descriptor:
http://marc.theaimsgroup.com/?l=linux-kernel&m=109871948812782&w=2
walking mountpoints in userspace:
http://marc.theaimsgroup.com/?l=linux-kernel&m=109875012510262&w=2
attaching mountpoints in userspace:
http://marc.theaimsgroup.com/?l=linux-fsdevel&m=109875063100111&w=2
detaching mountpoints in userspace:
http://marc.theaimsgroup.com/?l=linux-kernel&m=109880051800963&w=2
getting info from a vfsmount:
http://marc.theaimsgroup.com/?l=linux-kernel&m=109875135030473&w=2
Mike Waychison
> FWIW, all this stuff has already been done and posted here.
>
> Detachable chunks of vfsmounts:
> http://marc.theaimsgroup.com/?l=linux-fsdevel&m=109872862003192&w=2
>
> 'Soft' reference counts for manipulating vfsmounts without pinning them
> down:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109872797030644&w=2
I think this might just interest Jamie Lokier. He had a very similar
poposal recently, but without reference to this patch, so I guess he
wasn't aware of it.
> Referencing vfsmounts in userspace using a file descriptor:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109871948812782&w=2
Why not just use /proc/PID/fd/FD?
> walking mountpoints in userspace:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109875012510262&w=2
Is this needed? Userspace can find out mountpoints from /proc/mounts
(or something similar for detached trees).
> attaching mountpoints in userspace:
> http://marc.theaimsgroup.com/?l=linux-fsdevel&m=109875063100111&w=2
Again, bind from/to /proc/PID/fd/FD should work without any new
interfaces.
> detaching mountpoints in userspace:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109880051800963&w=2
What's wrong with sys_umount()?
> getting info from a vfsmount:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109875135030473&w=2
/proc or /sys should do fine for this purpose I think.
I agree, that having "floating trees" could be useful, but I don't see
the point of adding new interfaces to support it.
Miklos
Miklos Szeredi wrote:
>>FWIW, all this stuff has already been done and posted here.
>>
>>Detachable chunks of vfsmounts:
>>http://marc.theaimsgroup.com/?l=linux-fsdevel&m=109872862003192&w=2
>>
>>'Soft' reference counts for manipulating vfsmounts without pinning them
>>down:
>>http://marc.theaimsgroup.com/?l=linux-kernel&m=109872797030644&w=2
>
>
> I think this might just interest Jamie Lokier. He had a very similar
> poposal recently, but without reference to this patch, so I guess he
> wasn't aware of it.
>
Interesting. I haven't been following LKML/fsdevel lately due to lack
of time.
>
>>Referencing vfsmounts in userspace using a file descriptor:
>>http://marc.theaimsgroup.com/?l=linux-kernel&m=109871948812782&w=2
>
>
> Why not just use /proc/PID/fd/FD?
In what sense? readlink of /proc/PID/fd/* will provide a pathname
relative to current's root: useless for any paths not in current's
namespace.
Also, if we were to hijack /proc/PID/fd/* for cross namespace
manipulation, then we'd be enabling any root user on the system to
modify anyone's namespace. Any security *cough* provided by namespaces
is lost. A more secure way is to have root in namespace A allow root in
namespace B do the mounts. If you further restrict how this hand-off
happens, such as the walking constraints in the patch mentioned below,
we can restrict modification of a namespace to a given sub-tree of
vfsmounts.
This interface also has the huge advantage that you gain all the goodies
of using file descriptors, such as SCM_RIGHTS. You can hand of entire
trees of mountpoints between applications without ever even binding them
to any namespace whatsoever.
Tie this in with some userspace code that can mount devices for users
with restrictions and appropriate policy, you can create some API+daemon
for regular user apps to get things mounted in a way that guarantees
hiding from other users.
>
>
>>walking mountpoints in userspace:
>>http://marc.theaimsgroup.com/?l=linux-kernel&m=109875012510262&w=2
>
>
> Is this needed? Userspace can find out mountpoints from /proc/mounts
> (or something similar for detached trees).
>
With detached mountpoints (and especially with detached mountpoint
_trees_) it can become very difficult to assess which trees are which.
Also, just like /proc/PID/fd/*, /proc/mounts is built according to
_current_'s root. This only gives a skewed view of what is going on.
>
>>attaching mountpoints in userspace:
>>http://marc.theaimsgroup.com/?l=linux-fsdevel&m=109875063100111&w=2
>
>
> Again, bind from/to /proc/PID/fd/FD should work without any new
> interfaces.
No.. It wouldn't. Pathname resolution is doing everything according to
the ->readlink information provided by this magic proc files, again in
current's namespace. If you care to hijack ->follow_link, prepare
yourself for a slew of corner cases.
>
>
>>detaching mountpoints in userspace:
>>http://marc.theaimsgroup.com/?l=linux-kernel&m=109880051800963&w=2
>
>
> What's wrong with sys_umount()?
sys_umount only works with paths in current's namespace. It doesn't
allow you to handle vfsmounts as primaries in userspace.
>
>
>>getting info from a vfsmount:
>>http://marc.theaimsgroup.com/?l=linux-kernel&m=109875135030473&w=2
>
>
> /proc or /sys should do fine for this purpose I think.
>
Sure, if you can look it up somehow. Even if you could currently walk
around in another namespace using fchdir+chdir, you couldn't pull out
kernel-knowledge of mountpoints, you have to fall back to /etc/mtab,
which is completely broken when you mix in namespaces anyway..
> I agree, that having "floating trees" could be useful, but I don't see
> the point of adding new interfaces to support it.
>
I'm not hugely tied to the idea at the moment. I implemented it as part
of this interface cause it was a simple extension to what was being done.
> Miklos
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> >
> >>Referencing vfsmounts in userspace using a file descriptor:
> >>http://marc.theaimsgroup.com/?l=linux-kernel&m=109871948812782&w=2
> >
> >
> > Why not just use /proc/PID/fd/FD?
>
> In what sense? readlink of /proc/PID/fd/* will provide a pathname
> relative to current's root: useless for any paths not in current's
> namespace.
Not readlink, but actual dereference of link will give you exactly the
vfsmount/dentry the file was opened on. If you want to bind/move
whatever on that mount, that's possible, even if it's a "detached
tree".
> Also, if we were to hijack /proc/PID/fd/* for cross namespace
> manipulation, then we'd be enabling any root user on the system to
> modify anyone's namespace.
No. Obviously there must be limitations on which process can access
/proc/PID/fd. It's obviously safe to allow 'self' for example. But
any process you can ptrace should also be safe.
> This interface also has the huge advantage that you gain all the goodies
> of using file descriptors, such as SCM_RIGHTS. You can hand of entire
> trees of mountpoints between applications without ever even binding them
> to any namespace whatsoever.
Why is that dependent on the interface? The /proc interface already
allows _everything_ you want to do with file descriptors. Only the
the necessary restrictions should be worked out.
> Tie this in with some userspace code that can mount devices for users
> with restrictions and appropriate policy, you can create some API+daemon
> for regular user apps to get things mounted in a way that guarantees
> hiding from other users.
Yes. And I think that can be done without any new magic ioctls. Just
with mount/umount and /proc.
The implementation is another question. I'll look through your
patches to see how you achieve all this.
Thanks,
Miklos
(Sorry for replying in two parts, I missed this in the first reading)
> >>walking mountpoints in userspace:
> >>http://marc.theaimsgroup.com/?l=linux-kernel&m=109875012510262&w=2
> >
> >
> > Is this needed? Userspace can find out mountpoints from /proc/mounts
> > (or something similar for detached trees).
> >
>
> With detached mountpoints (and especially with detached mountpoint
> _trees_) it can become very difficult to assess which trees are which.
>
> Also, just like /proc/PID/fd/*, /proc/mounts is built according to
> _current_'s root. This only gives a skewed view of what is going on.
I'm thinking of /proc/PID/fdinfo/FD/mounts or similar.
> >>attaching mountpoints in userspace:
> >>http://marc.theaimsgroup.com/?l=linux-fsdevel&m=109875063100111&w=2
> >
> >
> > Again, bind from/to /proc/PID/fd/FD should work without any new
> > interfaces.
>
> No.. It wouldn't. Pathname resolution is doing everything according to
> the ->readlink information provided by this magic proc files, again in
> current's namespace. If you care to hijack ->follow_link, prepare
> yourself for a slew of corner cases.
No need to hijack, it's already done. Removing calls to
proc_check_root() will allow access to different namespaces detached
mounts, etc. It's been tried and actually works.
What's more you don't even need /proc, just pass a file descriptor
(with reference to mount in different namespace, etc.) to another
process with SCM_RIGHTS, do fchrdir(fd), and then do mount --bind,
etc. This actually works with an unmodified kernel.
> >>detaching mountpoints in userspace:
> >>http://marc.theaimsgroup.com/?l=linux-kernel&m=109880051800963&w=2
> >
> >
> > What's wrong with sys_umount()?
>
> sys_umount only works with paths in current's namespace. It doesn't
> allow you to handle vfsmounts as primaries in userspace.
But it does. Again, either with fchdir() or /proc.
fchdir() has the drawback of only working on directories, and that a
process has only one CWD. The /proc approach has no such limitations.
Miklos
Miklos Szeredi wrote:
>>>>Referencing vfsmounts in userspace using a file descriptor:
>>>>http://marc.theaimsgroup.com/?l=linux-kernel&m=109871948812782&w=2
>>>
>>>
>>>Why not just use /proc/PID/fd/FD?
>>
>>In what sense? readlink of /proc/PID/fd/* will provide a pathname
>>relative to current's root: useless for any paths not in current's
>>namespace.
>
>
> Not readlink, but actual dereference of link will give you exactly the
> vfsmount/dentry the file was opened on. If you want to bind/move
> whatever on that mount, that's possible, even if it's a "detached
> tree".
Removing proc_check_root essentially removes any protection that
namespaces provided in the first place.
Think of it like virtual memory. You can fork off and create your own
COW instance, and no one can see what you are doing unless they ptrace
you or explicitly ask you. The mountfd model allows for explicit
handing off of vfsmounts between namespaces without allowing arbitrary
access.
>
>
>>Also, if we were to hijack /proc/PID/fd/* for cross namespace
>>manipulation, then we'd be enabling any root user on the system to
>>modify anyone's namespace.
>
>
> No. Obviously there must be limitations on which process can access
> /proc/PID/fd. It's obviously safe to allow 'self' for example. But
> any process you can ptrace should also be safe.
>
Yet even this doesn't allow userspace to define it's own policy for
inter-namespace manipulation.
>
>>This interface also has the huge advantage that you gain all the goodies
>>of using file descriptors, such as SCM_RIGHTS. You can hand of entire
>>trees of mountpoints between applications without ever even binding them
>>to any namespace whatsoever.
>
>
> Why is that dependent on the interface? The /proc interface already
> allows _everything_ you want to do with file descriptors. Only the
> the necessary restrictions should be worked out.
>
Worked out where? Allowing this stuff to happen in /proc forces you to
set this policy in the kernel.
>
>>Tie this in with some userspace code that can mount devices for users
>>with restrictions and appropriate policy, you can create some API+daemon
>>for regular user apps to get things mounted in a way that guarantees
>>hiding from other users.
>
>
> Yes. And I think that can be done without any new magic ioctls. Just
> with mount/umount and /proc.
>
> The implementation is another question. I'll look through your
> patches to see how you achieve all this.
>
Beware that due to the detached-subtrees bit, the locking became a bit
ugly, requiring a global rw_lock for mntget/mntput. I still haven't
figured out a better way to keep per-vfsmounts counts and per-subtree
counts in sync.
That said, the common case is a read lock, for shared access. Only on
mnt_attach / mnt_detach (usually a slowpath anyway) do we require
exclusive access. This would have been a good candidate for brlocks if
they were still around.
Mike Waychison
Miklos Szeredi wrote:
> (Sorry for replying in two parts, I missed this in the first reading)
>
>
>>>>walking mountpoints in userspace:
>>>>http://marc.theaimsgroup.com/?l=linux-kernel&m=109875012510262&w=2
>>>
>>>
>>>Is this needed? Userspace can find out mountpoints from /proc/mounts
>>>(or something similar for detached trees).
>>>
>>
>>With detached mountpoints (and especially with detached mountpoint
>>_trees_) it can become very difficult to assess which trees are which.
>>
>>Also, just like /proc/PID/fd/*, /proc/mounts is built according to
>>_current_'s root. This only gives a skewed view of what is going on.
>
>
> I'm thinking of /proc/PID/fdinfo/FD/mounts or similar.
>
>
>>>>attaching mountpoints in userspace:
>>>>http://marc.theaimsgroup.com/?l=linux-fsdevel&m=109875063100111&w=2
>>>
>>>
>>>Again, bind from/to /proc/PID/fd/FD should work without any new
>>>interfaces.
>>
>>No.. It wouldn't. Pathname resolution is doing everything according to
>>the ->readlink information provided by this magic proc files, again in
>>current's namespace. If you care to hijack ->follow_link, prepare
>>yourself for a slew of corner cases.
>
>
> No need to hijack, it's already done. Removing calls to
> proc_check_root() will allow access to different namespaces detached
> mounts, etc. It's been tried and actually works.
>
See previous message as why we don't want to allow this.
> What's more you don't even need /proc, just pass a file descriptor
> (with reference to mount in different namespace, etc.) to another
> process with SCM_RIGHTS, do fchrdir(fd), and then do mount --bind,
> etc. This actually works with an unmodified kernel.
It shouldn't. An unmodified kernel has check_mnt to keep you from doing
this.
>
>
>>>>detaching mountpoints in userspace:
>>>>http://marc.theaimsgroup.com/?l=linux-kernel&m=109880051800963&w=2
>>>
>>>
>>>What's wrong with sys_umount()?
>>
>>sys_umount only works with paths in current's namespace. It doesn't
>>allow you to handle vfsmounts as primaries in userspace.
>
>
> But it does. Again, either with fchdir() or /proc.
>
> fchdir() has the drawback of only working on directories, and that a
> process has only one CWD. The /proc approach has no such limitations.
>
Again, check_mnt won't let this happen.
Another limitation is that manipulating mounts in another namespace
breaks namespace->list locking.. Add that to a todo list if you still
think this is a good idea.
Mike Waychison
> >>In what sense? readlink of /proc/PID/fd/* will provide a pathname
> >>relative to current's root: useless for any paths not in current's
> >>namespace.
> >
> >
> > Not readlink, but actual dereference of link will give you exactly the
> > vfsmount/dentry the file was opened on. If you want to bind/move
> > whatever on that mount, that's possible, even if it's a "detached
> > tree".
>
> Removing proc_check_root essentially removes any protection that
> namespaces provided in the first place.
>
> Think of it like virtual memory. You can fork off and create your own
> COW instance, and no one can see what you are doing unless they ptrace
> you or explicitly ask you. The mountfd model allows for explicit
> handing off of vfsmounts between namespaces without allowing arbitrary
> access.
Note: I didn't say we should remove proc_check_root(). I said _if_
you remove it, _then_ mounting on foreign namespace will work, which
has actually been confirmed experimentally.
So your follow_link argument doesn't hold. The proc code does the
follow_link in some clever way, that the looked up object will end up
with the same vfsmount/dentry pair as the file.
> Yet even this doesn't allow userspace to define it's own policy for
> inter-namespace manipulation.
OK. Let's keep the kernel policy simple: just allow a process to
access it's _own_ file descriptors in proc. I.e. allow access to
/proc/self/fd/* even if it comes from a foreign namespace.
Then the policy (who passes whom the file descriptors) is entirely up
to userspace (just as with your scheme).
/proc/self/fd/FD doesn't give any extra rights to the process, it just
makes mounting from/to detached mounts, and foreign namespaces
possible without new interfaces.
> Beware that due to the detached-subtrees bit, the locking became a bit
> ugly, requiring a global rw_lock for mntget/mntput. I still haven't
> figured out a better way to keep per-vfsmounts counts and per-subtree
> counts in sync.
Did you measure the effect on performace? Maybe it isn't so bad.
Miklos
Miklos Szeredi wrote:
>>>>In what sense? readlink of /proc/PID/fd/* will provide a pathname
>>>>relative to current's root: useless for any paths not in current's
>>>>namespace.
>>>
>>>
>>>Not readlink, but actual dereference of link will give you exactly the
>>>vfsmount/dentry the file was opened on. If you want to bind/move
>>>whatever on that mount, that's possible, even if it's a "detached
>>>tree".
>>
>>Removing proc_check_root essentially removes any protection that
>>namespaces provided in the first place.
>>
>>Think of it like virtual memory. You can fork off and create your own
>>COW instance, and no one can see what you are doing unless they ptrace
>>you or explicitly ask you. The mountfd model allows for explicit
>>handing off of vfsmounts between namespaces without allowing arbitrary
>>access.
>
>
> Note: I didn't say we should remove proc_check_root(). I said _if_
> you remove it, _then_ mounting on foreign namespace will work, which
> has actually been confirmed experimentally.
>
OK.
> So your follow_link argument doesn't hold. The proc code does the
> follow_link in some clever way, that the looked up object will end up
> with the same vfsmount/dentry pair as the file.
>
Yes. I hadn't realized that the only safe-guard in place was
proc_check_root. I had made the assumption that follow_link was using
vfs_follow_link with the readlink info.
>
>>Yet even this doesn't allow userspace to define it's own policy for
>>inter-namespace manipulation.
>
>
> OK. Let's keep the kernel policy simple: just allow a process to
> access it's _own_ file descriptors in proc. I.e. allow access to
> /proc/self/fd/* even if it comes from a foreign namespace.
>
> Then the policy (who passes whom the file descriptors) is entirely up
> to userspace (just as with your scheme).
>
> /proc/self/fd/FD doesn't give any extra rights to the process, it just
> makes mounting from/to detached mounts, and foreign namespaces
> possible without new interfaces.
So you'd say 'mount /dev/foo /proc/self/fd/4' if 4 was an fd pointing to
a directory in another namespace?
That does require proc_check_root be removed. :\
>
>
>>Beware that due to the detached-subtrees bit, the locking became a bit
>>ugly, requiring a global rw_lock for mntget/mntput. I still haven't
>>figured out a better way to keep per-vfsmounts counts and per-subtree
>>counts in sync.
>
>
> Did you measure the effect on performace? Maybe it isn't so bad.
As far as I could tell without doing high-smp benchmarks, it doesn't
slow anything down. In this case, we aren't on the fastest path anyway,
as we are usually
a) walking a path across a mountpoint, requiring the global
vfsmount_lock anyway.
b) closing a file, which isn't fast either.
You could micro-optimize the pivoting of from one vfsmount to another
during a walk to only grab the lock once, but there may be no profound
effect and any gains would be lost in the noise.
Mike Waychison
> > What's more you don't even need /proc, just pass a file descriptor
> > (with reference to mount in different namespace, etc.) to another
> > process with SCM_RIGHTS, do fchrdir(fd), and then do mount --bind,
> > etc. This actually works with an unmodified kernel.
>
> It shouldn't. An unmodified kernel has check_mnt to keep you from doing
> this.
Common misconception. Even Al Viro fell for it, even though it was he
who wrote the code :)
What is not allowed is that the _new_ mountpoint is in a different
namespace. The _old_ mount can actually come from a different
namespace or a detached mount in a bind (not an rbind though).
Again experimentally verified.
The check_mnt() function is not a security measure, rather a locking
measure. Current code in namespace protects modification of the mount
tree with current->namespace->sem. The check_mnt() calls are needed,
so that the namespace to be modified is really the current namespace.
This limitation can actually be removed. See patch by Ram Pai posted
earlier in this thread:
http://marc.theaimsgroup.com/?l=linux-kernel&m=111683338801090&w=2
The code can be generalized for other types of mounts, not just bind.
With that and allowing access to /proc/self/fd/*, I think we'll have a
proper interface for passing mounts between namespaces.
The mount groups idea is orthogonal to this I think, and I'll send
comments in a separate mail.
Miklos
> So you'd say 'mount /dev/foo /proc/self/fd/4' if 4 was an fd pointing to
> a directory in another namespace?
>
> That does require proc_check_root be removed. :\
Or just make an exception to self?
proc_check_root() could begin with:
if (current == proc_task(inode))
return 0;
For all other tasks it would still be effective.
Miklos
Miklos Szeredi wrote:
>>So you'd say 'mount /dev/foo /proc/self/fd/4' if 4 was an fd pointing to
>>a directory in another namespace?
>>
>>That does require proc_check_root be removed. :\
>
>
> Or just make an exception to self?
>
> proc_check_root() could begin with:
>
> if (current == proc_task(inode))
> return 0;
>
> For all other tasks it would still be effective.
>
Yes, I think something like that is workable :)
(we still have to fix up all the namespace->sem locking. I have yet to
review Ram's patch.)
Mike Waychison
Mike Waychison wrote:
> >No need to hijack, it's already done. Removing calls to
> >proc_check_root() will allow access to different namespaces detached
> >mounts, etc. It's been tried and actually works.
>
> See previous message as why we don't want to allow this.
If you can ptrace any process which is in another namespace, then you
_effectively_ have full access to that namespace. It's quite easy to
do, and negates the supposed security of namespaces.
Because of that, there's _no_ real security benefit from denying
access to /proc/NNN/fd/ if you are able to ptrace task NNN.
What I think makes sense is this:
1. Deny access to /proc/NNN/fd/, /proc/NNN/cwd, /proc/NNN/root
if task NNN cannot be ptraced.
3. Allow entry to /proc/NNN/fd/, /proc/NNN/cwd, /proc/NNN/root
if ptrace is allowed; the namespace being irrelevant.
3. Use _exactly_ the same condition as for ptracing,
i.e. MAY_PTRACE in fs/proc/base.c. Ensure that condition is
consistent with the tests in kernel/ptrace.c, possibly putting
the condition in a common header file to keep it consistent in
future.
4. If further restrictions are desired, to make namespaces more
strict, those should be implemented by further restrictions on
which tasks are allowed to ptrace other tasks.
-- Jamie
Jamie Lokier wrote:
> Mike Waychison wrote:
>
>>>No need to hijack, it's already done. Removing calls to
>>>proc_check_root() will allow access to different namespaces detached
>>>mounts, etc. It's been tried and actually works.
>>
>>See previous message as why we don't want to allow this.
>
>
> If you can ptrace any process which is in another namespace, then you
> _effectively_ have full access to that namespace. It's quite easy to
> do, and negates the supposed security of namespaces.
>
> Because of that, there's _no_ real security benefit from denying
> access to /proc/NNN/fd/ if you are able to ptrace task NNN.
>
> What I think makes sense is this:
>
> 1. Deny access to /proc/NNN/fd/, /proc/NNN/cwd, /proc/NNN/root
> if task NNN cannot be ptraced.
>
> 3. Allow entry to /proc/NNN/fd/, /proc/NNN/cwd, /proc/NNN/root
> if ptrace is allowed; the namespace being irrelevant.
>
> 3. Use _exactly_ the same condition as for ptracing,
> i.e. MAY_PTRACE in fs/proc/base.c. Ensure that condition is
> consistent with the tests in kernel/ptrace.c, possibly putting
> the condition in a common header file to keep it consistent in
> future.
>
> 4. If further restrictions are desired, to make namespaces more
> strict, those should be implemented by further restrictions on
> which tasks are allowed to ptrace other tasks.
>
Indeed. A combination of MAY_PTRACE ||ed with a check against current
sounds reasonable to me.
Mike Waychison
Mike Waychison wrote:
> > 1. Deny access to /proc/NNN/fd/, /proc/NNN/cwd, /proc/NNN/root
> > if task NNN cannot be ptraced.
> >
> > 3. Allow entry to /proc/NNN/fd/, /proc/NNN/cwd, /proc/NNN/root
> > if ptrace is allowed; the namespace being irrelevant.
> >
> > 3. Use _exactly_ the same condition as for ptracing,
> > i.e. MAY_PTRACE in fs/proc/base.c. Ensure that condition is
> > consistent with the tests in kernel/ptrace.c, possibly putting
> > the condition in a common header file to keep it consistent in
> > future.
> >
> > 4. If further restrictions are desired, to make namespaces more
> > strict, those should be implemented by further restrictions on
> > which tasks are allowed to ptrace other tasks.
> >
>
> Indeed. A combination of MAY_PTRACE ||ed with a check against current
> sounds reasonable to me.
Note that MAY_PTRACE already includes a check against current.
-- Jamie
On Tue, 2005-05-24 at 11:04, Mike Waychison wrote:
> Miklos Szeredi wrote:
> >>So you'd say 'mount /dev/foo /proc/self/fd/4' if 4 was an fd pointing to
> >>a directory in another namespace?
> >>
> >>That does require proc_check_root be removed. :\
> >
> >
> > Or just make an exception to self?
> >
> > proc_check_root() could begin with:
> >
> > if (current == proc_task(inode))
> > return 0;
> >
> > For all other tasks it would still be effective.
> >
>
> Yes, I think something like that is workable :)
>
> (we still have to fix up all the namespace->sem locking. I have yet to
> review Ram's patch.)
Yes. This patch is not fully ready yet. It still has to take care of
using the correct namespace sems for operations like umount/move etc.
Have been recently busy with shared subtree coding. Will work on this
to remove all the assumptions in the code that think 'a process can
access only its own namespace'.
RP
>
> Mike Waychison
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html